Data mining - Textcube.com

tastelesscowcreekΒιοτεχνολογία

4 Οκτ 2013 (πριν από 4 χρόνια και 10 μέρες)

61 εμφανίσεις

G
IST

Data Mining & Computational Biology lab

Data mining

2010.01.08

김신혁

G
IST

Data Mining & Computational Biology lab

Index

1. Introduction to Data mining

2. Classification

3. Classification process

5. Estimate the model

4. Preparing the Data

6. Questions

G
IST

Data Mining & Computational Biology lab

What’s Data Mining?


Extracting knowledge from large amounts of data(Han
and
Kamber
, 2006)



Exploration & analysis, by automatic or semi
-
automatic
means, of large quantities of data in order to discover

meaningful patterns



Often used as synonym for Knowledge Discovery

3

G
IST

Data Mining & Computational Biology lab

Motivation


Lots of data is being collected and warehoused


Data collected and stored at enormous speeds(GB/hour)


microarrays generating gene expression data


Web data, e
-
commerce


Shopping transactions


Weather Data


Etc.

4

G
IST

Data Mining & Computational Biology lab

Applying Data mining


Business


customer relationship management (CRM)


market basket analysis


web contents mining


Science and engineering


Bioinformatics


forecasting


Sports


Etc.




5

G
IST

Data Mining & Computational Biology lab

Knowledge discovery process

6

Data Cleaning

Data Integration

Databases

Data Warehouse

Task
-
relevant Data

Selection

Data Mining

Pattern Evaluation

Data mining

core of knowledge discovery process

G
IST

Data Mining & Computational Biology lab

Other theories


7

G
IST

Data Mining & Computational Biology lab

Core Data mining tasks


8

G
IST

Data Mining & Computational Biology lab

Chap 6. Classification and Prediction

Data mining concepts and techniques

G
IST

Data Mining & Computational Biology lab

Classification and Prediction


Databases are rich with hidden information!



extract important data classes



predict future data trends





10

Important data
classes

Future data
trends

Database

G
IST

Data Mining & Computational Biology lab

Examples of Classification

11

safe

risky

safe

loan data

Which loan applicants are “safe”?

G
IST

Data Mining & Computational Biology lab

Examples of Classification(
con’t
)

12

BUY: No

BUY:
Yes

BUY:
Yes

Customer profiles

Who will buy a new computer?

G
IST

Data Mining & Computational Biology lab

Examples of Classification(
con’t
)

13

Categorizing news as finance, sports, politics


G
IST

Data Mining & Computational Biology lab

Example of Prediction

14

How much a given
customer will spend
during this sale?

G
IST

Data Mining & Computational Biology lab

Classification
vs

Prediction


Classification


Categorical(discrete, unordered) labels


e.g. yes/no, safe/risky




Prediction


continuous
-
valued function, ordered value


regression analysis


most often used for prediction




15

G
IST

Data Mining & Computational Biology lab

Classification: definition


Given a collection of
tuple
(
training set)


a
tuple

is n
-
dimensional
attribute

vector


one of the attributes is the
class





previously unseen records should be assigned a
class as accurately as possible


Supervised learning


16

name

age

income

loan_decision

Hong Gil Dong

young

high

safe

G
IST

Data Mining & Computational Biology lab

Supervised
vs

Unsupervised learning


Supervised learning(classification)


The class label of training
tuple

is provided


New data is classified based on the training set



Unsupervised learning(clustering)


The class label of training
tuple

is not known


17

G
IST

Data Mining & Computational Biology lab

Classification process

18

1. learning

(model construction)

G
IST

Data Mining & Computational Biology lab

Classification process(
con’t
)


learning (or training phase)


predict the associated class label of a given
tuple


the model is represented as classification rules,
decision trees, or mathematical formulae

19

G
IST

Data Mining & Computational Biology lab

Classification process(
con’t
)

20

2. Classification

(model usage)

G
IST

Data Mining & Computational Biology lab

Classification process(
con’t
)


Model usage


Measure the accuracy of the model


Not use
training set
because of
overfitting
!


so, a
test set

is used!


accuracy of a classifier: the percentage of test set
tuples

that are correctly classified by the model


21

G
IST

Data Mining & Computational Biology lab

Overfitting

22

Decision boundary is distorted by noise point

Overfitting


Occam's razor


“simpler theories are generally better than more complex ones”

G
IST

Data Mining & Computational Biology lab

Preparing the Data


Preprocessing steps improve the accuracy, efficiency,
and scalability



Data cleaning


The preprocessing of data in order to remove or reduce noise
and missing values


Relevance analysis


Reduce the redundant or irrelevant attributes


Data transformation and reduction


Generalizing(e.g.
incom

= {low, medium, high} )


Normalization

23

G
IST

Data Mining & Computational Biology lab

Evaluated Criteria


Accuracy


accuracy of a classifier: predict the class label


accuracy of a predictor: guess the value of predicted attributes


Speed


computational cost of generating and using the model


Robustness


handling noisy data or missing values


Scalability


ability to construct the classifier and predictor in large data


Interpretability


the level of understanding and insight provided by the model


it’s subjective

24

G
IST

Data Mining & Computational Biology lab

Classification techniques


Decision tree


Bayesian classifier


Bayesian belief network


Rule
-
based classifier


Artificial Neural Networks (ANN)


Support Vector Machines


Etc..

25

G
IST

Data Mining & Computational Biology lab

References


Data mining: concepts and techniques(Han and
Kamber
, 2006


Introduction to Data Mining. Pang
-
Ning

Tan


http://www.wikipedia.org


"Occam's razor". Merriam
-
Webster's Collegiate Dictionary (11th
ed.). New York: Merriam
-
Webster. 2003. ISBN 0
-
87779
-
809
-
5.



26

G
IST

Data Mining & Computational Biology lab