Data Mining - USC Upstate: Faculty

levelsordData Management

Nov 20, 2013 (4 years and 7 months ago)


Data Mining

Slides by Chris Gravlee



Sometimes called Data or Knowledge Discovery

The process of analyzing data from different
perspectives and summarizing it into useful
information, that can be used to increase
revenue, cut costs, or both

Users are able to analyze data from many
different dimensions or angles, categorize it, and
summarize the relationships identified.

Technically, data mining is the process of finding
correlations or patterns among dozens of fields
in large relational databases



Data are many facts, numbers, or text that can
be processed by a computer

Organizations are accumulating vast and
growing amounts of data in different formats and
different databases. This includes:

Operational or transactional data such as, sales, cost,
inventory, payroll, and accounting,

No operational data, such as industry sales, forecast
data, and macro economic data

Meta Data: data about the data itself, such as logical
database design or data dictionary definitions



The patterns, associations, or
relationships among all this

. For example, analysis
of retail point of sale transaction data can
yield information on which products are
selling and when.



Information can be converted into

about historical patterns and
future trends. For example, summary
information on retail supermarket sales
can be analyzed in light of promotional
efforts to provide knowledge of consumer
buying behavior. Thus, a manufacturer or
retailer could determine which items are
most susceptible to promotional efforts.


What is Data Mining?

Data Mining is the process of extracting knowledge hidden from large volumes of raw

The importance of collecting data that reflect our business or scientific activities to
achieve competitive advantage is widely recognized now. Powerful systems for
collecting data and managing it in large databases are in place in all large and mid
range companies. However, the bottleneck of turning this data into our success is the
difficulty of extracting knowledge about the system we study from the collected data.

Human analysts with no special tools can no longer make sense of enormous
volumes of data that require processing in order to make informed business
decisions. Data mining automates the process of finding relationships and patterns in
raw data and delivers results that can be either utilized in an automated decision
support system or assessed by a human analyst.

What goods should be promoted to this customer?

What is the probability that a certain customer will respond to a planned

Can one predict the most profitable securities to buy/sell during the next trading

Will this customer default on a loan or pay back on schedule?

What medical diagnose should be assigned to this patient?

How large the peak loads of a telephone or energy network are going to be?

Why the facility suddenly starts to produce defective goods?


What is Data Mining?

Modeling the investigated system, discovering
relations that connect variables in a database
are the subject of data mining.

Modern computer data mining systems self learn
from the previous history of the investigated
system, formulating and testing hypotheses
about the rules which this system obeys. When
concise and valuable knowledge about the
system of interest had been discovered, it can
and should be incorporated into some decision
support system which helps the manager to
make wise and informed business decisions.


Why use data mining?

Data might be one of the most valuable assets of an organization

but only
if we know how to reveal valuable knowledge hidden in raw data. For
instance, data mining allows us to extract diamonds of knowledge from
historical data and predict outcomes of future situations. In business, it will
help us optimize the business decisions, increase the value of each
customer and communication, and improve satisfaction of customer with our
services. In medical domain, it may help discover causes of particular
diseases that were not known before. Data mining can be used on any type
of data (ex. financial, medical, education, communication, industrial)

Data that require analysis differ for companies in different industries.
Examples include:

Sales and contacts histories

Call support data

Demographic data on customers and prospects

Patient diagnoses and prescribed drugs data

Click stream and transactional data from a website

In all these cases data mining can help reveal knowledge hidden in data
and turn this knowledge into a crucial competitive advantage.


What can Data Mining do for us?

Identify our best prospects and then retain them as customers.

By concentrating marketing efforts only on the best prospects we will save
time and money, thus increasing effectiveness of the marketing operation.

Predict cross
sell opportunities and make recommendations.

Whether we have a traditional or web
based operation, we can help the
customers quickly locate products of interest to them

and simultaneously
increase the value of each communication with a customer.

Learn parameters influencing trends in sales and margins.

One may think this can be done with OLAP (Online Analytical Processing)
tools. True, OLAP can help prove a hypothesis

but only if we know what
questions to ask in the first place. In the majority of cases we may have no
clue on what combination of parameters influences our operation. In these
situations data mining is the only real option.

Segment markets and personalize communications.

There might be distinct groups of customers, patients, or natural
phenomena that require different approaches in their handling. If we have a
broad customer range, we would need to address teenagers in California
and married homeowners in Minnesota with different products and
messages in order to optimize a marketing campaign.


Reasons for the growing
popularity of Data Mining

Growing Data Volume

The main reason for necessity of automated
computer systems for intelligent data analysis is
the enormous volume of existing and newly
appearing data that require processing. The
amount of data accumulated each day by
various business, scientific, and governmental
organizations around the world is daunting.
According to information from GTE research
center, only scientific organizations store each
day about 1 TB (terabyte!) of new information.


Reasons for the growing
popularity of Data Mining

Limitations of Human Analysis

Two other problems that surface when human
analysts process data are the inadequacy of the
human brain when searching for complex
multifactor dependencies in data, and the lack of
objectiveness in such an analysis. A human
expert is always a hostage of the previous
experience of investigating other systems.
Sometimes this helps, sometimes this hurts, but
it is almost impossible to get rid of this fact.


Reasons for the growing
popularity of Data Mining

Low Cost of Machine Learning

One additional benefit of using automated data
mining systems is that this process has a much
lower cost than hiring an army of highly trained
(and paid) professional statisticians. While data
mining does not eliminate human participation in
solving the task completely, it significantly
simplifies the job and allows an analyst who is
not a professional in statistics and programming
to manage the process of extracting knowledge
from data.



Knowledge Discovery in Databases

A six or more step

data warehousing,

data selection,

data preprocessing,

data transformation,

data mining,


Data Mining is
sometimes referred to as

DM and KDD tend to be
used as synonyms


Typical Applications of Data Mining


Provide better customer service

Improve cross
selling opportunities (beer and

Increase direct mail response rates

Customer Retention

Identify patterns of defection

Predict likely defections

Risk Assessment and Fraud

Identify inappropriate or unusual behavior


Motivation: The Sizes

Databases today are huge:

More than 1,000,000 entities/records/rows

From 10 to 10,000 fields/attributes/variables

bytes and tera

Databases a growing at an unprecedented rate

The corporate world is a cut
throat world

Decisions must be made rapidly

Decisions must be made with maximum knowledge


Motivation for doing Data Mining

Investment in Data Collection/Data Warehouse

Add value to the data holding

Competitive advantage

More effective decision making

OLTP Data Warehouse Decision

Work to add value to the data holding

Support high level and long term decision making

Fundamental move in use of Databases


Importance of Data Mining

By applying data mining techniques, which are elements of
statistics, artificial intelligence and machine learning, they are able
to identify trends within the data that they did not know existed. Data
mining can best be described as a business intelligence (BI)
technology that has various techniques to extract comprehensible,
hidden and useful information from a population of data. This BI
technology makes it possible to discover hidden trends and patterns
in large amounts of data. The output of a data mining exercise can
take the form of patterns, trends or rules that are implicit in the data.
Through data mining and the new knowledge it provides, individuals
are able to leverage the data to create new opportunities or value for
their organizations. The following are examples of practical uses of
data mining and the value it provides those who use this technology
to mine their data.


Fraud Detection

Credit card issuers have been using data mining
techniques to detect potentially fraudulent credit card

When a credit transaction is executed, the transaction
and all data elements describing the transaction are
analyzed using a sophisticated data mining technique
called neural networks to determine whether or not the
transaction is a potentially fraudulent charge based upon
known fraudulent charges.

By utilizing data mining, credit card issuers have
decreased and mitigated losses due to fraudulent


Inventory Logistics

By incorporating data mining techniques, retailers can improve their
inventory logistics and thereby reduce their cost in handling

Through data mining, a retailer can identify the demographics of its
customers such as gender, martial status, number of children, etc.
and the products that they buy.

This information can be extremely beneficial in stocking
merchandise in new store locations as well as identifying "hot"
selling products in one demographic market that should also be
displayed in stores with similar demographic characteristics.

For nationwide retailers, this information can have a tremendous
positive impact on their operations by decreasing inventory
movement as well as placing inventory in locations where it is likely
to sell.


Defect Analysis

Through the use of data mining techniques,
manufacturers are able to identify the characteristics
surrounding defective products, such as day of week and
time of the manufacturing run, components being used
and individuals working on the assembling line.

By understanding these characteristics, changes can be
made to the manufacturing process to improve the
quality of the products being produced.

quality products lead to improved reputation of the
organization within its industry and help to drive sales. In
addition, profitability improves through the reduction of
return materials allowances and field service calls.


Focused Hiring

Some employers use data mining techniques to
understand the characteristics of their top performing
individuals. By understanding the characteristics of this
group such as education, years of experience, skills and
personality traits, a hiring profile can be established to
help recruit and hire individuals who possess similar
characteristics as their best

performing individuals.
While this technique has been used, one must realize
that profiling is based upon historical data, which may
not be indicative of future top
performing individuals due
to changes in social, economic and environmental


Techniques Used in Data Mining

Link Analysis

association rules, sequential patterns, time

Predictive Modeling

tree induction, neural networks, regression

Database Segmentation

clustering, k

Deviation Detection

visualization, statistics


Data Mining Techniques


Many data mining applications make
use of clustering according to similarity
for example to segment a
client/customer base. Clustering
according to optimization of set
functions is used in data analysis

Clustering/segmentation in databases
are the processes of separating a data
set into components that reflect a
consistent pattern of behavior. Once
the patterns have been established
they can then be used to "deconstruct"
data into more understandable subsets
and also they provide sub
groups of a
population for further analysis or action
which is important when dealing with
very large databases. For example a
database could be used for profile
generation for target marketing where
previous response to mailing
campaigns can be used to generate a
profile of people who responded and
this can be used to predict response
and filter mailing lists to achieve the
best response.


Data Mining Techniques


A database is a store of information but more important is the
information which can be inferred from it. There are two main
inference techniques available ie deduction and induction.

Deduction is a technique to infer information that is a logical
consequence of the information in the database e.g. the join
operator applied to two relational tables where the first concerns
employees and departments and the second departments and
managers infers a relation between employee and managers.

Induction has been described earlier as the technique to infer
information that is generalized from the database as in the example
mentioned above to infer that each employee has a manager. This
is higher level information or knowledge in that it is a general
statement about objects in the database. The database is searched
for patterns or regularities.

Induction has been used in the following ways within data mining:


Decision Trees

Decision trees are simple knowledge representation and they
classify examples to a finite number of classes, the nodes are
labeled with attribute names, the edges are labeled with possible
values for this attribute and the leaves labeled with different classes.
Objects are classified by following a path down the tree, by taking
the edges, corresponding to the values of the attributes in an object.

The following is an example of objects that describe the weather at a
given time.


Rule Induction

A data mine system has to infer a model from the database, that is it
may define classes such that the database contains one or more
attributes that denote the class of a tuple (i.e. the predicted
attributes while the remaining attributes are the predicting
attributes.) Class can then be defined by condition on the attributes.
When the classes are defined the system should be able to infer the
rules that govern classification, in other words the system should
find the description of each class.

Production rules have been widely used to represent knowledge in
expert systems and they have the advantage of being easily
interpreted by human experts because of their modularity i.e. a
single rule can be understood in isolation and doesn't need
reference to other rules. The propositional like structure of such
rules has been described earlier but can summed up as if


Neural Networks

Neural networks are an approach to computing that
involves developing mathematical structures with the
ability to learn. The methods are the result of academic
investigations to model nervous system learning. Neural
networks have the remarkable ability to derive meaning
from complicated or imprecise data and can be used to
extract patterns and detect trends that are too complex
to be noticed by either humans or other computer
techniques. A trained neural network can be thought of
as an "expert" in the category of information it has been
given to analyze. This expert can then be used to
provide projections given new situations of interest and
answer "what if" questions.


Neural Networks

Neural networks have broad applicability to real
world business problems and have already been
successfully applied in many industries. Since
neural networks are best at identifying patterns
or trends in data, they are well suited for
prediction or forecasting needs including:

sales forecasting

industrial process control

customer research

data validation

risk management

target marketing etc.


Neural Networks

Neural networks use a set of processing elements (or nodes)
analogous to neurons in the brain. These processing elements are
interconnected in a network that can then identify patterns in data
once it is exposed to the data, i.e. the network learns from
experience just as people do. This distinguishes neural networks
from traditional computing programs, that simply follow instructions
in a fixed sequential order.