Data Mining

desertcockatooΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

132 εμφανίσεις

Data Mining

The Mining Analogy


Data mining gains its name to some degree its popularity, by playing
off a meaning that
the data you have stored
is much like a ‘mountain’
and that buried within the mountain (just as buried within your data)
are certain ‘gems’ of great value


The problem is

that there are also lots of non
-
valuable rocks and rubble
in the mountain that need to be mined through and discarded in order
to get that which is valuable.


The trick is that both for mountains of rock and mountains of data you
need
some power tools

to unearth the value of the data


For
rock
, this means earthmovers and dynamite; for
data
, this means
powerful computers and data mining software

What Data Mining Isn’t


Statistic


Statistical tools take longer to run


Less robust on messy real world data


Must often be wielded by a master craftsman


OLAP (Online Analytical Processing)


OLAP provides a tool for looking quickly anywhere
within the mountain


What it doesn’t tell you is
What is valuable
and
What
isn’t


Data Warehousing


Data Mining has come of age


What is data mining?
And why are so many people talking about it in
both the computer industry and in direct marketing?


The answer is simple:
data mining helps end users extract useful
business information from large databases.


What is so new about extracting information from data to make your
business run better?


The allure of data mining is that it promise
to fix the problem of
miscommunication
between you and your data, and allow you to ask
complex questions of your data such as


What has been going on?


What is going to happen next and how can I profit?

Data Mining has come of age
(cont’d)


What has been going on?


To answer this question can be provided by the data warehouse
and multidimensional database technology that allow the user to
easily navigate and visualize the data


What is going to happen next and how can I profit?


The answer to this question can be provided by data mining tools
built on some of the latest computer algorithms: Decision Tree
(CART, CHAID, AID), Neural Networks, Nearest Neighbor, and
Rule Induction

Learning from Your Past
Mistakes


Those who cannot remember the past are
condemned to repeat it [G.Santayana]


How does data mining work?


It works
the same way
as a human being does.


It uses historical information (experience) to learn
from the past.


The trick to building a successful predictive model
is to have some data in your database that
describes what has happened in the past. (see
details at page. 96)

Measuring Data Mining Effectiveness


Accuracy, Speed, Cost


To make the right choice of the data mining tool, they need
to evaluate it in comparison to existing statistical
techniques and also compare among the large number of
new data mining products that are currently on the market.


Data mining technology is actually quite similar to
statistics in the way it builds a predictive model from data.


Often, the
accuracy

of that prediction
depends more

on the
correct deployment of the technology and the quality of the
data than it does in the technology itself.


The choice of data mining should be driven by the
advantages that it brings to the bottom line of the entire
business process


not just the statistical predictive
accuracy.

Measuring Data Mining Effectiveness


Accuracy, Speed, Cost (cont’d)


The other way that data mining techniques are often
measured is by
speed.


The reasoning is that the faster the tools runs, the larger is
the data set to which it can be applied. The larger the
database is, the better the accuracy of the predictive model
will be.


To truly determine which technologies are best, it is
helpful to look at the big picture, which includes a much
larger business process than just data analysis. The full
process includes data collection, data analysis (data
mining), predictive model visualization, and the launching
of a marketing program against a customer set

Discovery versus Prediction


Discovery


finding something that you weren’t looking for


One of the obvious things about real mining is that when you come
across a diamond or vein of gold, you know that you have found it


You can recognize the important properties of diamonds or gold because they have
been discovered before and you know what they look like and feel like


The main idea of how these system work is by making three measurements:


How strong is the association?


How unexpected is it?


How ubiquitous is it?


The first rule requires that the pattern in the data be a strong pattern (for example,
that it occurs 90% of the time)


The second measurement ensures that the pattern is interesting to the user and not
obvious.


The fulfillment of the third ensures that the pattern occurs often enough that it is
useful.

Discovery versus Prediction (cont’d)


Prediction


With prediction, you as the end user have a very specific event or attribute
that you want to find a pattern in association with.


For instance, suppose you want to predict customer attrition.


One of the most important parts of predicting customer attrition is having
historical information in your database about which customers have
attrited in the past.


There may be many interesting patterns in your database


say, between
the age of your customers and their buying habits


that you might like to
discover, but in this case, you know very well that attrition is costing you
a lot of money.

State of the Industry


The current offering in data mining software products
emphasize different important aspects of the algorithms
and their usage. The different emphases are driven because
of differences in the targeted user and the types of
problems being solved. There are four main categories of
products:


Targeted Solutions


Business Tools


Business Analyst Tools


Research Analyst Tools

Data Mining Methodology


The first of these is the concept of
finding a
pattern
in the data


The second of these is that of sampling or not
having to use all of the data in order to make
significant conclusions about what might be
happening with other parts of the data


The third, validating the predictive models that
arise out of data mining algorithm

What is a pattern? What is a model?


Although there are many ways to define patterns and
models, here is what they mean in the context of data
warehousing and data mining:


Model.

A description of the original historical database from
which it was built that can be successfully applied to new data in
order to make predictions about missing values or to make
statements about expected values.


Pattern
. An event or combination of events in a database that
occurs more often than expected. Typically, this means that its
actual occurrence is significantly different than what would be
expected by random chance


What is the difference between a pattern and a model?

Visualizing a pattern

Figure 5
-
2

A Graphical Representation of Number Sequence

Visualizing a pattern

Figure 5
-
2

A Graphical Representation of Complex Number Sequence



appears much more understandable thanthe raw data


A note on terminology


Database


Record


Field


Predictor


Prediction


Value

A note on terminology (cont’d)