Data Mining

levelsordΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

100 εμφανίσεις

Data Mining


Using IBM Intelligent Miner




Presented by:

Qiyan (Jennifer ) Huang

Outline


Introduction


Mining Process


Main Functionalities of Intelligent Miner


Other Data Mining Products


Data Mining and Privacy


Summary


References


What is Data Mining


Data mining
:
discovering interesting patterns
from large amounts of data


Knowledge discovery (mining) in databases
(KDD), data/pattern analysis, information
harvesting, business intelligence, etc
.


Evolution of Database
Technology


1960s:


Data collection, database creation


1970s:


Relational data model, relational DBMS
implementation


1980s ~ present:


RDBMS, advanced data models
1990s

2000s:


Data mining and data warehousing, multimedia
databases, and Web databases

Data Mining VS. Database
Query


Database





Data Mining



Find all customers who have purchased milk



Find all items which are frequently purchased
with milk. (association rules)



Identify customers who have purchased more
than $10,000 in the last month.




Identify customers with similar buying habits.
(Clustering)

Data Mining Process (KDD)

Data Cleaning

Databases

Data Warehouse

Task
-
relevant Data

Selection

Data Mining

Pattern Evaluation

J. Han. and M. Kamber. Data Mining: Concepts
and Techniques,2001

About DB2 Intelligent Miner


DB2 Intelligent Miner for Data


focused
on the large
-
scale mining, such as large volumes
of data, parallel data mining on Windows NT,
Sun Solaris, and OS/390



IBM

Main Functionalities


Cluster analysis


Group the data that share similar trends and
patterns


Classification


Predict the outcome based on historical data


Association analysis



Finding frequent patterns
.



age
income
student
credit
rating
buys
computer
<=30
high
no
fair
<=30
high
no
excellent
31…40
high
no
fair
>40
medium
no
fair
>40
low
yes
fair
>40
low
yes
excellent
31…40
low
yes
excellent
<=30
medium
no
fair
<=30
low
yes
fair
>40
medium
yes
fair
<=30
medium
yes
excellent
31…40
medium
no
excellent
31…40
high
yes
fair
This
follows
an
example
from
Quinlan’s
ID3

Classification


Classification


age
income
student
credit
rating
buys
computer
<=30
high
no
fair
no
<=30
high
no
excellent
no
31…40
high
no
fair
yes
>40
medium
no
fair
yes
>40
low
yes
fair
yes
>40
low
yes
excellent
no
31…40
low
yes
excellent
yes
<=30
medium
no
fair
no
<=30
low
yes
fair
yes
>40
medium
yes
fair
yes
<=30
medium
yes
excellent
yes
31…40
medium
no
excellent
yes
31…40
high
yes
fair
yes
This
follows
an
example
from
Quinlan’s
ID3

Classification

Association



Association Rule:
identifies relationships


Example





30% customers buy shirts in all the




transactions, 60% of these customers


will also by a tie”


Confidence factor is 60%


Support


if buying shirt and tie together is
observed in 12% of all transactions, then the support
is thus 12%


Lift = 60%
/

30%=2


Association


Support Confidence Type Lift Rule Body Rule Head

(%) (%)

5.5286


34.0800 + 2.7300 [203] + [1207] => [1716]

7.0388


34.1300 + 2.7400 [203] + [1719]

=> [1716]

5.4662


34.1700 + 2.7400 [202] + [802]

=> [1716]

5.8805


34.3400 + 2.7500 [203] + [802]

=> [1716]


5.0163


34.4900 + 2.7600 [203] + [705]

=> [1716]

7.1279


34.7400 + 2.7800 [202] + [1718]

=> [1716]

5.8226 34.7600 + 3.3900 [711] + [203]

=> [710]

5.0697


34.8300 + 2.7400 [202] + [1702]

=> [1703]

5.2836


34.8300 + 2.7400 [202] + [1207]

=> [1703]

5.4350


34.9400 + 3.4100 [201] + [711]

=> [710]

5.3459


35.0200 + 2.7600 [201] + [1702]

=> [1703]

Data Mining Products


more than 50 commercial data mining tools


Wide range of pricing


SAS Institute’s Enterprise Miner ~ $80k


SPSS Inc. Clementine ~ 75K


IBM Intelligent Miner ~ $60k


Desktop products start at few hundred dollars


Data Mining Products

Algorithm

IBM

SAS

SPSS

Neural Network









Decision Tree










Clustering







Association







Nearest
Neighbour




Kohonen Self
-

Organizing Map









Data Ming Product Comparison on Algorithm

Data Mining & Privacy


Release limited subset of data


Hide attributes that potentially related to
personal information


Release Encrypted Data


Audit to detect misuse of Data


Set up Data Mining Controller

Summary


Introduction to Data Mining


A KDD Data Mining Process


Functionalities of Intelligent Miner


Commercial Data Mining Tools


Data Mining & Privacy

References

Angoss Whitepaper:


http://www.angoss.com/ProdServ/AnalyticalTools/kseeker/whitepaper.html.
Retrieved on Oct26th,2003

C. Clifton. & D. Marks Security and Privacy Implications of Data Ming.1996

D.W. Abbott, I. P. Matkovsky & J. F. Elder IV. An Evaluation of High
-
end Data Mining
Tools

Elder Research.
http://www.rgrossman.com/faq/dm
-
02.htm
.
Retrieved on
Oct28th,2003

IBM. BD2 Intelligent Mine.





http://www
-
3.ibm.com/software/data/iminer/
.


Retrieved on Oct26th,2003

J. F. Elder & D. W. Abbott. August, 1988 A comparison of Leading Data Mining Tools

J. Han. and M. Kamber. Data Mining: Concepts and Techniques, 2000

http://www.cald.cs.cmu.edu/summerschool03/PrivacyPreservingDM.ppt Retrieved on
Nov 10th,2003

Robert Grossman

http://www.datamininglab.com/toolcomp.html#comparison
.
Retrieved on Oct20th,2003


SPSS.
http://www.spss.com/
.

Retrieved on Nov12th,2003


Evolution of Database Technology


1960s:


Data collection, database creation, and network DBMS


1970s:


Relational data model, relational DBMS
implementation


1980s:


RDBMS, advanced data models
1990s

2000s:


Data mining and data warehousing, multimedia
databases, and Web databases

Data Mining:


On What Kind of Data?


Data Sources


Relational database


Data warehouses


Transactional databases


WWW


Data types


Audio


Image


Text


Output: A Decision Tree for

buys_computer”

age?

overcast

student?

credit rating?

no

yes

fair

excellent

<=30

>40

no

no

yes

yes

yes

30..40

Neural network

m
k

-

f

weighted

sum

Input

vector
x

output
y

Activation

function

weight

vector
w



w
0

w
1

w
n

x
0

x
1

x
n

0.15

0.29

0.11

0.25

0.09

0.23

0.32

0.27





n
j
j
ji
i
output
w
input
1
i
input
gain
i
e
output



1
1
Neural network

Neural network

Applications of Clustering


Pattern Recognition


Image Processing


Economic Science (especially market research)


WWW


Document classification


Cluster Weblog data to discover groups of
similar access patterns

Data Mining & Privacy



Data Mining Tool

Mining Controller

Data warehouse

Examples of Clustering
Applications


Marketing:

Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs


Insurance:

Identifying groups of motor insurance policy
holders with a high average claim cost


City
-
planning:

Identifying groups of houses according to
their house type, value, and geographical location


Earth
-
quake studies:

Observed earth quake epicenters
should be clustered along continent faults

Association

Association and pattern analysis


Applications:


Basket data analysis, cross
-
marketing,
catalog design, loss
-
leader analysis,
clustering, classification, etc
.


Examples.



buys(x, “diapers”)


buys(x, “beers”)
[0.5%, 60%]


major(x, “CS”) ^ takes(x, “DB”)

grade(x,
“A”) [1%, 75%]


Data Mining:


On What Kind of Data?


Relational databases


Data warehouses


Transactional databases


Advanced DB and information repositories


Object
-
oriented and object
-
relational databases


Text databases and multimedia databases


Heterogeneous and legacy databases


WWW

Steps of a KDD Process



Learning the application domain:


relevant prior knowledge and goals of application


Creating a target data set: data selection


Data cleaning

and preprocessing: (may take 60% of
effort!)


Data reduction and transformation
:


Find useful features, dimensionality/variable reduction, invariant
representation.


Choosing functions of data mining



summarization, classification, regression, association, clustering.


Choosing the mining algorithm(s)


Data mining
: search for patterns of interest


Pattern evaluation and knowledge presentation


visualization, transformation, removing redundant patterns, etc.


Use of discovered knowledge

Strength and Weakness

Strength


Algorithm breadth


Graphical output


Available for PC and mainframe environment


Weakness


No automation


Data has to reside in IBM’s database system