What is Data Mining about?

reformcartloadΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

116 εμφανίσεις

What is Data Mining about?

Basic Introduction

Main Tasks in DM

Applications of DM

Relations to other disciplines

Machine Learning: Supervised and
unsupervised learning

Concepts, Concept Space…

Reading material:

Chapter 1 of the textbook by Witten,

apters 1, 2 of Textbook by T.

Chapter 1 of the book by Han.

Data vs. information

Huge amount of data

from sports,
business, science, medicine,
conomics. Kb,Mb,Gb,Tb, Lb…


recorded facts


patterns underlying the data.

Raw d
ata is not helpful

How to extract patterns from raw data?


Marketing companies use historical
response data to build models to predict
who will respond to a direct mail or
telephone solicitation.

The government agency sifts through the
ds of financial transactions to detect
money laundering or drug smuggling.

Diagnosis, building expert systems to
help physicians based on the previous

Data Mining = KDD:

Knowledge Discovery in Database
System: extensive works in database

Statistical learning has been being
active for decades.

Machine Learning: a mature
subfield in Artificial Intelligence.

Data Mining is

Extraction of implicit, previously unknown, and
potentially useful information from data;

Needed: programs
that detect patterns
and regularities in the data;

Strong patterns can be used to make


Most patterns are not interesting;

Patterns may be inexact (or even
completely spurious) if data is garbled or

Machine learning techn

Algorithms for acquiring structural
descriptions from examples

Structural descriptions represent
patterns explicitly

Machine Learning Methods can be used to

predict outcome in new situation;

understand and explain how prediction
is derived (maybe

even more important).

Can machines really learn?

Definitions of “learning” from dictionary:

To get knowledge of by study, experience, or
being taught; To commit to memory; To receive

How to measure this `learning’? The last two
tasks are e
asy for computers.

Operational definition:

Things learn when they change their behavior in
a way that makes them perform better in the

Does a slipper learn?

Does learning imply intention?


computer program is
said to learn from exper
ience E with
respect some class of tasks T and
perfomance P, if its performance at
tasks in T, as measured by P,
improves with experience E.

Designing a learning system:

Example: A checkers learning problem:

Task T: Playing Checkers;

Performance measur
e P: percent of games
won against opponents;

Training experience E: playing practice
games against itself.

A Learning system includes:

Choosing the training Experience


Whether the training experience provides direct or
indirect feedback regarding the choi
ces made by
performance system.


To which degree the learner can control the
sequence of training examples.


How well the training experience represents the
distribution of examples of the final system
performance P must be measured.

Choosing the target f

Determine which kind of knowledge will be learned
and how this will be used by the performance

A Data Mining Process consists of:

Choosing a representation for the target

Choosing a learning algorithm

The weather problem

tions for playing an unspecified

Play Windy Humidity Temperature Outlook

Yes False Normal Mild Rainy

Yes False High Hot


No True High Hot Sunny

No False High Hot Sunny

Yes True Normal Cool


…… …… …… …… ……

Structural Description:

Then Structure:

If outlook = sunny and humidity = high then play
= no

Classification vs. association rules

ion rule: predicts value of pre
specified attribute (the classification of an

If outlook = rainy and windy = true then play = no

If outlook = overcast then play = yes

Association rule: predicts value of
arbitrary attribute or combination of

If outlook = sunny and humidity = high then

play = no

If temperature = cool then humidity = normal

If humidity = normal and windy = false then

play = yes

Enumerating the version space:

Domain space: All possible combinations of
the examples, equal
s to the product of the
number of possible values for each
attribute. For the weather problem,
3*3*2*2=36. (`Play’ is the target attribute).

How many possible classification rules?

If some attributes do not appear in the
if…then structure, we use `?’ to
denote the
value of the corresponding attribute.

For example,
(?,mild, normal,?, play) means

If the temperature = mild and humidity = normal,
then play=yes.

Therefore, the concept space is: 4*4*3*3*2=288

Space of rules set: approximately 2.7*10^27

ion space is the space of consistent
concept with respect to the present training

Weather data with mixed

Two attributes with numeric values

Play Windy Humidity Temperature Outlook

Yes False

85 85 Rainy

Yes False 90 80 Overcast

No True 86 83 Sunny

No False 80

75 Sunny

Yes True 95 65 Overcast

…… …… …… ……

Classification Rules:

If outlook = rainy and windy = true

then play = no

If outlook = overcast then play = yes

If humidity < 85 then play = yes

If none of the above then play = yes

Question: How to count the version space
and concept space? How to add test in the
classification rule?

The contact lenses data

Age Spectacle prescription Astigmatism Tear production rate Recommended lenses

Prepresbyopic Hypermetrope Yes


Prepresbyopic Hypermetrope Yes



Presbyopic Myope No Reduced None

Presbyopic Myope No Normal


Presbyopic Myope Yes Reduced None

Presbyopic Myope Yes Normal Hard

resbyopic Hypermetrope No Reduced None

Presbyopic Hypermetrope No Normal Soft

Presbyopic Hyp
ermetrope Yes Reduced None

Presbyopic Hypermetrope Yes Normal None

Prepresbyopic Hypermetrope No

Normal Soft

Prepresbyopic Hypermetrope No Reduced None

Prepresbyopic Myope Yes Normal Hard

represbyopic Myope Yes Reduced None

Prepresbyopic Myope No Normal Soft

Prepresbyopic Myope

No Reduced None

Young Hypermetrope Yes Normal Hard

Young Hypermetrope Yes

Reduced None

Young Hypermetrope No Normal Soft

Young Hypermetrope No

Reduced None

Young Myope Yes Normal Hard

Young Myope Yes Reduced


Young Myope No Normal Soft

Young Myope No Reduced



Instances with little difference might
have the same value for the target attribute.

A complete and correct rule set

If tear production rate = reduced then

recommendation = none

If age = young and astigmatic = no and tear product
ion rate
= normal then recommendation = soft

If age = pre

presbyopic and astigmatic = no and tear
production rate = normal then recommendation = soft

If age = presbyopic and spectacle prescription = myope and
astigmatic = no then recommendation = none

spectacle prescription = hypermetrope and astigmatic = no
and tear production rate = normal

then recommendation = soft

If spectacle prescription = myope and astigmatic = yes and
tear production rate = normal then

recommendation = hard

If age young and as
tigmatic = yes and tear production rate =
normal then recommendation = hard

If age = pre

presbyopic and spectacle prescription =
hypermetrope and astigmatic = yes then

recommendation = none

If age = presbyopic and spectacle prescription =
hypermetrope a
nd astigmatic = yes

then recommendation = none

In total, we have 9 rules. Can we
summarize the patterns more efficiently?

Classifying iris flowers

Sepal length Sepal width Petal length Petal width Type

5.1 3.
5 1.4 0.2 Setosa

4.9 3.0 1.4 0.2 Setosa

7.0 3.2 4.7 1.4 Versicolor

6.4 3.2 4.5 1.5 Versicolor

6.3 3.3 6.0 2.5 Virginica

5.8 2.7 5.1

1.9 Virginica


If petal length < 2.45 then Iris

If sepal width < 2.10 then Iris

Predicting CPU performance

Cycle time(ns) Main memory(kb) Cache Channels Performance

MYCT Mmin Mmax

Cach Chmin Chmax PRP

125 256 6000 256 16 128 198

29 8000 32000 32 8 32 269

480 512 8000 32

0 0 67

480 1000 4000 0 0 0 45


55.9 + 0.0489 MYCT + 0.0153
MMIN + 0.0056 MMAX + 0.6410 CACH

0. 2700 CHMIN + 1.480 CHMAX

Examples: 209 different c
omputer configurations.

Using L
inear regression function

Data from labor negotiations

A case with missing values

Attribute Type 1 2 3 … 40

Duration {number of years}

1 2 3 2

Wage raise first year Percentage% 2 4 4.3 4.5

Wage raise second year Percentage% ? 5 4.4 4

Wage raise third year Percentage% ? ? ?


Living cost adjustment {none,tcf,tc} none tcf ? none

Working time per week {Hours} 25 35 38 40

Pension {none,ret
cntr} none ? ? ?

Standby pay

Percentage% ? 13 ? ?

Shift supplement Percentage% ? 5 4 4

Education allowance {yes,no} yes ? ? ?

Statutory holidays

number of days 11 15 12 12

Vacation {below,avg,gen} avg gen gen avg

term disability assist {yes,no} no ? ? yes

Dental plan

{none,half,full} none ? full full

Bereavement assist {yes,no} no ? ? yes

Health plan contribution {none,half,full} none ? full half

Acceptability of contract {goo
d,bad} bad good good good

Why have these values been missed and
how to estimate these missing values?

Soybean classification

Attribute Number of values Sample value

ronment Time of Occurrence 7 July

Precipitation 3 Above normal

Seed Condition 2 Normal

Mold growth 2 Absent

Fruit Condition of fruit pods 4 Normal

Fruit spots 5 ?


Condition 2 Abnormal

Leaf spot size 3 ?

Stem Condition 2 Abnormal

Stem Lodging 2 Yes

Root Condition 3 Normal

Diagnosis 19 Diaporthe stem canker


knowledge plays an important role.

If leaf condition is normal and

stem condition is abnormal and

stem cankers is below soil line and

canker lesion color is brown

then diagnosis is rhizoctonia root rot

If leaf malformation is absent and

stem condition is

abnormal and

stem cankers is below soil line and

canker lesion color is brown

then diagnosis is rhizoctonia root rot

Data Mining Applications:

Processing loan application

Given: questionnaire with financial and
personal information

Problem: shoul
d money be lent?

Simple statistical method covers 90% of

Borderline cases referred to loan officers

But: 50% of accepted borderline cases

Solution(?): reject all borderline cases

No! Borderline cases are most active customers

er machine learning

1000 training examples of borderline cases

20 attributes: age, years with current
employer, years at current address, years with
the bank, other credit cards possessed…

Learned rules predicted 2/ 3 of borderline
cases correctly: a b
ig deal in business.

Rules could be used to explain decisions to

More Applications:

Screening images,
Load forecasting, Diagnosis of Machine
fault, Marketing and sales, DNA recognition,

Inductive Learning:

finding a concept
that fits the


Let us recall the weather problem. It is
possible that the target attribute `play=no’
no matter what are the values of the other
attributes. We use the symbol `

’ to denote
this situation or equivalently (





) in
the concept space

specific ordering:

Two descriptions:

d1=(sunny,?,?,hot,?), d2=(sunny,?,?,?,?).

Consider the sets s1, s2 of instances
classified positive by d1 and d2. Because d2
poses fewer constraints on the instances, s1
is a subset of s2. Correspondingly, we use

symbol d1<d2. This yields an order of the
components in the version space.

Finding max
general description:

It is possible that there is no `<’ or `>’ relation
between two general descriptions. For a specific
description d, if there is no other descrip
tion d’ in
a set S of examples satisfying d’>d, then we say
d is the maximally general in S.

We can similarly define the maximally specific

(or minimally general) description in S

We can use intuitive greedy algorithm to find

a max
general description
in S based on the
specific (or its converse) ordering
search in the concept space.

The space of consistent concept descriptions is
completely determined by two sets

L: most specific descriptions that cover all
positive examples and no negativ
e ones

G: most general descriptions that do not cover
any negative examples and all positive ones

need to be maintained and


elimination algorithm

Initialize L (

) and G (?)

For each example e

If e is positive:

Delete all elements from G that do not cover e

For each element r in L that does not cover e:

Replace r by all of its most specific

generalizations that cover e and that are

more specific than some element in G

Remove ele
ments from L that are more general

than some other element in L

If e is negative:

Delete all elements from L that cover e

For each element r in G that covers e:

Replace r by all of its most general

alizations that do not cover e and that

are more general than some element in L

Remove elements from G that are more specific

than some other element in G

Example of Candidate Elimination:

Play Windy

Humidity Temperature Outlook

Yes T Normal Hot Sunny

Yes T High Hot Sunny

No F High

Cold Rainy

Yes T Normal Cold Sunny

L(0)={ (




)} L(1)={(sunny,hot,normal,t)}

L(2)={(sunny,hot,?,T)} L(3)=L(2)
L(4)={(sunny,?,?,T)} L(0)

G(0)={(?,?,?,?)} G(1)=G(0) G(2)=G(1)

G(3)={(sunny,?,?,?), (?,hot,?,?),(?,?,?,T)}



important decisions in learning systems

The concept description language

The order in which the space is searched

The way that overfitting to the particular
training data is avoided

These properties form the “bias” of the

Language bias,Search bias


avoidance bias

Language bias

Most important question: is language
universal or does it restr
ict what can be

Universal language can express arbitrary
subsets of examples

If language can represent statements
involving logical
(“ disjunctions”) it is

Example: rule sets

Domain knowledge can be used to
exclude some concep
t descriptions

a priori
from the search

Search bias

Search heuristic

“Greedy” search:
performing the best single step

“Beam search”: keeping several

Direction of search,

E. g. specializing a rule by adding

E. g. generalizing an individual instance
into a rule


avoidance bias

Can be seen as a form of search bias

Modified evaluation criterion

E. g. balancing simplicity and number of

Modified search strategy

E. g. pruning (simplifying a description)

pruning: stops at a simple description
before search proceeds to an overly
complex one

pruning: generates a complex
description first and simplifies it afterwards

Concepts, Instances, Attributes


ncepts: kinds of things that can be learned.

Aim: intelligible and operational concept


Instances: the individual, independent
examples of a concept.


Attributes: measuring aspects of an instance:
nominal and numeric ones

Practical issue: a
file format for the input

Concept description:
output of learning

Concepts in Data Mining:

Styles of learning:


Classification: predicting a discrete class


Association: detecting association rules
among features


Clustering: grouping similar instan
ces into


Numeric prediction: predicting a numeric

Reading material: Chapters 2 and 3 of
textbook by Witten, Chapter 1, Sections
3.1,3.2 and 5.2 Of the book by Han
(Reference 2).

Classification learning:


Classification learning is the s
supervised learning where the s
cheme will
present a final actual outcome:
of the example

Example problems:

weather data, contact
lenses, irises, labor negotiations

Success can be measured on fresh data
for which class labels are known

In practice success is often measured

Association learning

is the learning
where no class is specified and any kind of
structure is considered “interesting”


to classification learning:

predicting any attribute’s value, not just
class, and more than one attribute’s value at
a time

There are

far more association rules than

classification rules

To measure the success of an
association rule, we introduce two


instances that can be covered
by the rule


correctly predicted instances.

Minimum coverage and accuracy are posed
in learning to avoid too many useless rules.

is to
find groups of items
that are similar to each other.

Clustering is
class of an example is not kno

Each group can be assigned as a
class. Success of clustering often
measured subjectively: how useful are
these groups to the user?

Example: iris data without class

Numeric prediction:

classification with
numeric “class”:

supervised learning.
Success i
s measured on test data (or
subjectively if concept description is

Example: weather data with numeric
attributes, the performance of CPU…

Issues in Data Mining:

In methodologies and interactions:

Mining different kind of knowledge in

Interactive mining at various level;

Incorporating domain knowledge;

Query languages and ad hoc mining;

Presentation and visualization;

Dealing with noisy and incomplete data.

Performance Issues:

Efficiency and scalability of the

el, distributed and incremental
mining algorithm.

Issues relevant to database:

Handling relational and complex

Mining from heterogeneous
databases and global information