Data Mining

desertcockatooΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

132 εμφανίσεις

Data Mining

The Mining Analogy

Data mining gains its name to some degree its popularity, by playing
off a meaning that
the data you have stored
is much like a ‘mountain’
and that buried within the mountain (just as buried within your data)
are certain ‘gems’ of great value

The problem is

that there are also lots of non
valuable rocks and rubble
in the mountain that need to be mined through and discarded in order
to get that which is valuable.

The trick is that both for mountains of rock and mountains of data you
some power tools

to unearth the value of the data

, this means earthmovers and dynamite; for
, this means
powerful computers and data mining software

What Data Mining Isn’t


Statistical tools take longer to run

Less robust on messy real world data

Must often be wielded by a master craftsman

OLAP (Online Analytical Processing)

OLAP provides a tool for looking quickly anywhere
within the mountain

What it doesn’t tell you is
What is valuable

Data Warehousing

Data Mining has come of age

What is data mining?
And why are so many people talking about it in
both the computer industry and in direct marketing?

The answer is simple:
data mining helps end users extract useful
business information from large databases.

What is so new about extracting information from data to make your
business run better?

The allure of data mining is that it promise
to fix the problem of
between you and your data, and allow you to ask
complex questions of your data such as

What has been going on?

What is going to happen next and how can I profit?

Data Mining has come of age

What has been going on?

To answer this question can be provided by the data warehouse
and multidimensional database technology that allow the user to
easily navigate and visualize the data

What is going to happen next and how can I profit?

The answer to this question can be provided by data mining tools
built on some of the latest computer algorithms: Decision Tree
(CART, CHAID, AID), Neural Networks, Nearest Neighbor, and
Rule Induction

Learning from Your Past

Those who cannot remember the past are
condemned to repeat it [G.Santayana]

How does data mining work?

It works
the same way
as a human being does.

It uses historical information (experience) to learn
from the past.

The trick to building a successful predictive model
is to have some data in your database that
describes what has happened in the past. (see
details at page. 96)

Measuring Data Mining Effectiveness

Accuracy, Speed, Cost

To make the right choice of the data mining tool, they need
to evaluate it in comparison to existing statistical
techniques and also compare among the large number of
new data mining products that are currently on the market.

Data mining technology is actually quite similar to
statistics in the way it builds a predictive model from data.

Often, the

of that prediction
depends more

on the
correct deployment of the technology and the quality of the
data than it does in the technology itself.

The choice of data mining should be driven by the
advantages that it brings to the bottom line of the entire
business process

not just the statistical predictive

Measuring Data Mining Effectiveness

Accuracy, Speed, Cost (cont’d)

The other way that data mining techniques are often
measured is by

The reasoning is that the faster the tools runs, the larger is
the data set to which it can be applied. The larger the
database is, the better the accuracy of the predictive model
will be.

To truly determine which technologies are best, it is
helpful to look at the big picture, which includes a much
larger business process than just data analysis. The full
process includes data collection, data analysis (data
mining), predictive model visualization, and the launching
of a marketing program against a customer set

Discovery versus Prediction


finding something that you weren’t looking for

One of the obvious things about real mining is that when you come
across a diamond or vein of gold, you know that you have found it

You can recognize the important properties of diamonds or gold because they have
been discovered before and you know what they look like and feel like

The main idea of how these system work is by making three measurements:

How strong is the association?

How unexpected is it?

How ubiquitous is it?

The first rule requires that the pattern in the data be a strong pattern (for example,
that it occurs 90% of the time)

The second measurement ensures that the pattern is interesting to the user and not

The fulfillment of the third ensures that the pattern occurs often enough that it is

Discovery versus Prediction (cont’d)


With prediction, you as the end user have a very specific event or attribute
that you want to find a pattern in association with.

For instance, suppose you want to predict customer attrition.

One of the most important parts of predicting customer attrition is having
historical information in your database about which customers have
attrited in the past.

There may be many interesting patterns in your database

say, between
the age of your customers and their buying habits

that you might like to
discover, but in this case, you know very well that attrition is costing you
a lot of money.

State of the Industry

The current offering in data mining software products
emphasize different important aspects of the algorithms
and their usage. The different emphases are driven because
of differences in the targeted user and the types of
problems being solved. There are four main categories of

Targeted Solutions

Business Tools

Business Analyst Tools

Research Analyst Tools

Data Mining Methodology

The first of these is the concept of
finding a
in the data

The second of these is that of sampling or not
having to use all of the data in order to make
significant conclusions about what might be
happening with other parts of the data

The third, validating the predictive models that
arise out of data mining algorithm

What is a pattern? What is a model?

Although there are many ways to define patterns and
models, here is what they mean in the context of data
warehousing and data mining:


A description of the original historical database from
which it was built that can be successfully applied to new data in
order to make predictions about missing values or to make
statements about expected values.

. An event or combination of events in a database that
occurs more often than expected. Typically, this means that its
actual occurrence is significantly different than what would be
expected by random chance

What is the difference between a pattern and a model?

Visualizing a pattern

Figure 5

A Graphical Representation of Number Sequence

Visualizing a pattern

Figure 5

A Graphical Representation of Complex Number Sequence

appears much more understandable thanthe raw data

A note on terminology







A note on terminology (cont’d)