DATA MINING

sentencehuddleΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

116 εμφανίσεις

2013
-
05
-
22

1

D
ATA MINING


Concepts
, Models and Methods
.


Part I

Paweł Lula


Department of Computational

Systems
,

Cracow University of Economics

pawel.lula@uek.krakow.pl

Outline


Part I


Data
mining
approach


Types
of data and the concept of similarity and
distance


Part II


Classification
of research
problems
,


Data
mining models and
methods


Software for data mining


Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

2


2013
-
05
-
22

2

DATA
MINING

APPROACH

3

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

Information deluge

Never
before in human history have our brains had to process
as much information as they do today. We have a generation of
people who I call
computer suckers
because they are spending
so much time in front of a computer screen or on their mobile
phone or BlackBerry
.


Edward Hallowell,
Psychiatrist

The Sunday Times, December 13, 2009

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

4

2013
-
05
-
22

3

Information overload

Information overload: a
situation in which you get more
information than you can deal with at one time and become
tired and
confused.

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

5

Flood of data

Computers
have promised us a fountain of wisdom
but
delivered
a flood of data
.


W
. J.
Frawley
,
G.Piatetsky
-
Shapiro, and C. J.
Matheus
,
1992

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

6

2013
-
05
-
22

4

Data mining definition

Data mining: the
nontrivial extraction of implicit, previously
unknown, and potentially useful

information

from

data.


W
.
Frawley

and G.
Piatetsky
-
Shapiro and C.
Matheus

Knowledge
Discovery in Databases: An
Overview


AI Magazine, Fall 1992:
pp. 213

228.

ISSN

0738
-
4602.



Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

7

Database

Data mining
process

Knowledge

Data mining definition

Data mining: the science of extracting useful information from
large

data sets

or databases
.



D
. Hand, H.
Mannila
, P.
Smyth

Principles
of Data Mining. MIT Press,
Cambridge

2001



Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

8

Database

Data mining
process

Knowledge

2013
-
05
-
22

5

Data mining definition

Data mining: the
statistical and logical analysis of large sets of
transaction data, looking for patterns that can aid decision
making.

Ellen
Monk, Bret Wagner (2006).


Concepts
in Enterprise Resource Planning,

Thomson
Course Technology,
Boston

2006

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

9

Database

Data mining
process

Knowledge

Decisions

Key properties of data mining approach


data
-
based approach

(data
-
driven approach):


models are based on data, not on theory


huge databases and warehouses can be analyzed,


data mining methods belong to computational techniques


outcomes:
easy
-
to
-
understand
and
easy
-
to
-
use


main field of application:
business


main goals:
decision support



Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

10

2013
-
05
-
22

6

Data mining as an interdisciplinary field

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

11

Data
mining

Statistics

Mathematics

Databases

Artificial
Intelligence

Machine
Learning

Visualization

High
Performance
Computer

D
ata

mining

process

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

12

DATABASE,
WAREHOUSE


Selection


Transformation

DATA SET


Model building

MODEL


Verification


Evaluation

KNOWLEDGE


Management


Decision
support

Gain knowledge about the process!

Define the goal of analysis!

2013
-
05
-
22

7

TYPES OF DATA AND THE CONCEPT OF SIMILARITY AND
DISTANCE

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

13

Distance vs. s
imilarity


Distance


the measure which reflects
how far
from each
other two objects are.


Similarity


the measure which reflects
how close
to each
other two objects are.


Very often a transformation between distance and similarity
exists:


Example of the transformation:

similarity = 1 / distance

similarity = 1
-

distance

similarity = max(distance)
-

distance

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

14
.

2013
-
05
-
22

8

The
formal definition
of
distance

Let

X

be a set

and
x
,
y



X
. Then a function
d
(
x
,
y
) is a called a
distance if:


d
(
x
,
y
)


0,


d
(
x
,
y
) =
d
(
y
,
x
),


d(
x
,
x
) = 0.


The distance function d(x, y) which satisfies the condition:


d
(
x
,
y
)


d
(
x
,
z
) +
d
(
z
,
y
) /triangle inequality/

is called a metric.


Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

15

Dat
um and data


Datum

(
plural
:
data
):


something given,


a piece of information,


a
single piece of
information
,


a
fact or proposition used to draw a conclusion or make a
decision.


Data


a collection of facts.

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

16

2013
-
05
-
22

9

Classification
of data
according to the type of values


quantitative = numerical, number
-
based


discrete values (integer values),


continuous values (real values).


qualitative = not numerical, word
-
based data


two
-
state data (logical data, True/False, Yes/No),


many
-
state data (color of eyes).

Paweł

Lula, Cracow University of Economics,
Kragujevac
, May 2013

17

Classification
of data
according
to
their

structure


Simple types of data (one object represents one value)


Complex types of data (one objects represents many values)

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

18

2013
-
05
-
22

10

Distance for quantitative data


z
,
y



numbers


dist
(
x
,
y
) = |
x



y
|



For example:




dist
(2, 6) = |2


6| = |
-
4| = 4

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

19

Distance for qualitative data


Nominal values

X = {
Kragujevac
, Rome, London, New York}

Kragujevac

= Rome


NO

Kragujevac



Rome


YES


Example of distance:



dist
(
a,a
) = 0



dist
(a, b) = 1


We can calculate distance based on additional knowledge

distance by car(
Kragujevac
, Rome)= 1425 km

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

20

2013
-
05
-
22

11

Distance for qualitative
data


Ordered values



X = {small, medium, big}



Operations: =,

, >, <




dist
(small, medium) <
dist
(small, big)



dist
(small, small) = 0



dist
(small, medium) =
dist
(medium, big)
PROBLEM!

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

21

Types of complex data


Matrices,


Lists (sequence of elements),


Records,


Data frames (tables),


Sets,


Trees,


N
etworks / Graphs,


Texts (in natural languages).

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

22

2013
-
05
-
22

12

Matrix


a rectangular structure of elements,


homogenous,


elements are arranged in rows and columns,


a position of the element is described by indices.



Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

23

Objects representation in matrices

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

24

Objects

Features

2013
-
05
-
22

13

Vector


A matrix with one row (a 1

×

m

matrix) is called a

row
vector.


A matrix
with one column (an

m

×

1 matrix) is called a

column
vector.


Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

25

Record


a complex structure with fields,


fields store values,


fields are identified by names,


record is a heterogonous structure.

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

26

2013
-
05
-
22

14

Data frame


a table
-
based structure,


row = record,


column = field in the record,


data frame = vector of
records
.

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

27

very popular

in data analysis problems!

Objects as points

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

28

X

Y

Z

1

x1

y1

z1

2

x2

y2

z2

3

x3

y3

z3

4

x4

y4

z4

...

...

...

...

N

xN

yN

zN

2013
-
05
-
22

15

Distance between points

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

29

Assume
that

we
have

two points:

x(x
1
, x
2
, ...,
x
n
)

y(y
1
, y
2
,...,
y
n
)

the distance can be calculated:

𝑑

,

=


𝑖


𝑖
𝑛
𝑖
=
1

𝑑

,

=


𝑖


𝑖
2
𝑛
𝑖
=
1

The curse of dimensionality


The curse of dimensionality


problems with huge number of
dimensions (features)


Questions:


Can distance be calculated


YES


Do dimensions have interpretation


YES (features)


Can points be presented on the graph


NO


Which features are important?


PROBLEM!


Which features have the strongest impact on the distance?


PROBLEM!


Is it possible to order features according to their importance?


PROBLEM!


Solution: Principal Component Analysis

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

30

2013
-
05
-
22

16

The goal of Principal Component Analysis

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

31

Data set

New data set

Transformation

Aspects of PCA

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

32

Aspect

Original data set

New

data set

Interpretation

easy

difficult

Importance

The importance of
variables is difficult to
predict

every sequential variable
has smaller importance

Correlation

generally variables are
correlated

variables are
uncorrelated

2013
-
05
-
22

17

How measure the importance of the feature (dimension)

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

33

The importance of the feature = the range of the feature

The idea of the PCA

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

34

1.
Find a point in the center
of the data set (it is the
origin of the new
coordinate system),

2.
define the first axis to
maximize the
importance of the new
feature,

3.
define the second axis
which is perpendicular to
the first,

4.
....


2013
-
05
-
22

18

PCA

>
pca

<
-

princomp
(
iris
[
-
5
])

>
summary
(
pca
)

Importance

of
components
:


Comp.1 Comp.2 Comp.3 Comp.4

Standard
deviation

2.0494032 0.49097143 0.27872586 0.153870700

Proportion

of
Variance

0.9246187 0.05306648 0.01710261 0.005212184

Cumulative

Proportion

0.9246187 0.97768521 0.99478782 1.000000000

>


Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

35

New features

> pca$scores


Comp.1 Comp.2 Comp.3 Comp.4


[1,]
-
2.684125626
-
0.319397247
-
0.027914828 0.0022624371


[2,]
-
2.714141687 0.177001225
-
0.210464272 0.0990265503


[3,]
-
2.888990569 0.144949426 0.017900256 0.0199683897


[4,]
-
2.745342856 0.318298979 0.031559374
-
0.0755758166


[5,]
-
2.728716537
-
0.326754513 0.090079241
-
0.0612585926


[6,]
-
2.280859633
-
0.741330449 0.168677658
-
0.0242008576


[7,]
-
2.820537751 0.089461385 0.257892158
-
0.0481431065


[8,]
-
2.626144973
-
0.163384960
-
0.021879318
-
0.0452978706


[9,]
-
2.886382732 0.578311754 0.020759570
-
0.0267447358


[10,]
-
2.672755798 0.113774246
-
0.197632725
-
0.0562954013


[11,]
-
2.506947091
-
0.645068899
-
0.075318009
-
0.0150199245


Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

36

2013
-
05
-
22

19

The importance of new components

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

37

>
screeplot
(
pca
)

New components

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

38

2013
-
05
-
22

20

Singular Value Decomposition

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

39

Expenditures:

Food/

Zywnosc

Books/

Ksiazki

Travels/

Podroze

Health/


Zdrowie

Janek

1300

200

25

500

Agata

1140

870

450

120

Wacek

900

30

2300

400

Krysia

890

700

500

0

Andrzej

2500

200

4500

200

Wojtek

700

0

0

3100

Jacek

1300

500

900

300

Zygmunt

5000

4000

0

100

Marysia

500

300

400

200

Teresa

300

300

300

300

Viola

2000

0

3400

2500

object

object

The goal of SVD


definition of the new coordinate system,


new dimensions form new features/components/latent
variables,


new coordinate system
is
common for objects represented by
rows and by columns
,


new features are not correlated,


every subsequent feature has smaller importance,


new features are hard to interpret.


Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

40

2013
-
05
-
22

21

SVD

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

41

SVD

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

42

2013
-
05
-
22

22

List (sequence)

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

43

List


a
ordered
collection of:


values,


events,


tasks,


goods,


cities,


...

The sentence is a sequence of words.

The word is a sequence of letters.

Distance between sequences


Editing operation:


Substitution


replacing one element in the sequence by another,


Deletation



removing a given
element in
the
sequence,


Insertion


inserting a new
element.

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

44

2013
-
05
-
22

23

Distance between sequences


Assumption:

cost(substitution) = cost(
deletation
)
= cost(insertion
)
= 1



Edit distance between
two
sequences is
the minimum
number of
editing operations
required
to change one
sequence into another.


Example:



d(phone, bone) = 2



phone


hone


bone

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

45

Distance between sequences


Assumption:

cost(substitution), cost(
deletation
), cost(insertion
)
are defined separately



Edit distance between
two
sequences is the sequence of
editing operations required
to change one
sequence into
another
with minimal cost
.


Example:


dist
(“This building is
big
”, “This building is
huge
”) <
dist
(“This building is
big
”, “This building is
small
”)

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

46

2013
-
05
-
22

24

Tree

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

47

The best model for

hierarchy representation

Distance between nodes

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

48

Distance based on the length of the path between nodes

dist
(A, B) = 1

dist
(A, H) = 5

dist
(G, G) = 0

2013
-
05
-
22

25

Similarity between classes

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013













2
1
0
2
C
P
C
P
C
P
sim
log
log
log
)
C
,
C
(
2
1



Dekang

Lin:

C
0

C
1

C
2







2
1
0
2
1
)
C
,
C
(
C
I
C
I
C
I
sim


49

Distance based

on the information theory

WordNet

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

50

WordNet



a lexical database

for the English language.


it contains more than 150000 words.

2013
-
05
-
22

26

Ontology

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

51

Ontology

-

a model of domain knowledge. A set of concepts within a

domain,

and the relationships between pairs of concepts.



Ontology
-
based distance
= distance between concepts.

Distance between trees


tree edit distance


Editing operation:


Substitution/
Relabel



changing the label of a node,


Deletation



removing a given node in the tree,


Insertion


inserting a new node.


Cost for editing operations:


assume that cost(
relabel
), cost(
deletation
) and cost(insertion) is
defined


Assume that we have


two trees: T1 and T2


the sequence of operations which turns T1 into T2 with minimal cost


T
he cost of this sequence is the tree edit distance.

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

52

2013
-
05
-
22

27

Graph / Network


Graph


a set of nodes (vertices) connected by edges (links).

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

53

Network modelling

Network

model



a

formal

representation

of

a

group

of

real

objects

and

relationships

between

them
.

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

54

APPLICATION PERSPECTIVE:


network
,


real objects
,


real relationships
.

MATHEMATICAL PERSPECTIVE:


graph
,


vertices,


edges, arcs
.

2013
-
05
-
22

28

Examples

of networks


Web networks,


Social networks


persons (organisations) and relationships
between them,


Communication networks (phones networks, planes
connections),


Computer networks,


Trade networks (export/import),


Terrorist networks,


...


Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

55

Similarity of nodes in the network

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

56

Types

of

node

similarities


attribute
-
based

similarity

(based

on

the

values

of

node

attributes)

,


taxonomy

similarity

(based

on

the

type

of

nodes)


relationship

similarity

(based

on

the

connections

between

nodes)
.

2013
-
05
-
22

29

Relationship

similarity


Two objects are similar if they have similar relationships with
other objects.

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

57

similar

objects

dis
similar

objects

Relationships

dissimilarity

measures


Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

58

A

D

B

C

7,1

3

6,3

1,7

A

B

C

D

A

0,0

1,7

6,3

0,0

B

0,0

0,0

0,0

0,0

C

0,0

0,0

0,0

0,0

D

7,1

0,0

3,0

0,0

Network

Adjacency matrix
















n
v
u
s
s
sv
su
vs
us
q
q
q
q
v
u
d
,
1
2
2
1
,











n
v
u
s
s
sv
su
vs
us
q
q
q
q
v
u
d
,
1
2
,
Euclidean
-
like dissimilarity

Manhattan
-
like dissimilarity

2013
-
05
-
22

30

Distance between graphs

A graph can be transformed to another one by a
finite sequence
of graph edit operations which may be
defined differently
in
various algorithms, and GED is defined
by the
least
-
cost edit
operation sequence
.



Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

59

Set

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

60

Set


a collection of objects without any particular order.

2013
-
05
-
22

31

Distance/similarity of sets

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

61

The
Jacckard

index

(similarity measure):

The
Jacckard

index

(distance measure):

Text


Text


representation of written language.


Text can carry information, opinions or feelings.

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

62

2013
-
05
-
22

32

Frequency matrix as a tool for text representation


Pieces of information are represented by words,


Stages:


cutting text into words,


calculation of word occurrence frequencies,


forming frequency matrix

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013













nm
n
n
m
m
x
x
x
x
x
x
x
x
x
words
documents
...
...
...
...
...
...
...
2
1
2
22
21
1
12
11
63

Distance between words

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

64













nm
n
n
m
m
x
x
x
x
x
x
x
x
x
words
documents
...
...
...
...
...
...
...
2
1
2
22
21
1
12
11
distance

between

vectors

2013
-
05
-
22

33

Distance between documents

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

65













nm
n
n
m
m
x
x
x
x
x
x
x
x
x
words
documents
...
...
...
...
...
...
...
2
1
2
22
21
1
12
11
distance

between

vectors

Distance between words and documents

Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

66













nm
n
n
m
m
x
x
x
x
x
x
x
x
x
words
documents
...
...
...
...
...
...
...
2
1
2
22
21
1
12
11
SVD


Latent

Semantic Analysis

2013
-
05
-
22

34

THANK YOU!

Part I

Data mining approach

Types of data and the concept of similarity and distance


Paweł Lula, Cracow University of Economics, Kragujevac, May 2013

67