Data ware house and data mining - WordPress.com

salamiblackElectronics - Devices

Nov 27, 2013 (3 years and 8 months ago)

391 views

1.3

Data

M
ining

On

W
hat

K
ind

o
f

Data?



Data

Mining

On

What

Kind

of

Data [Stores]?


I
n

p
r
inciple,

data

mining

should

b
e

a
p
plicable

t
o

a
n
y

kind

o
f

data
r
e
p
osi
t
o
r
y
,

as

w
e
l
l as

t
o

t
r
ansi
e
nt

data,

s
u
c
h

as

data

s
t
r
eams.

T
h
us

the

s
c
o
p
e

o
f

our
e
xamina
t
i
o
n

o
f

data

r
e
p
osi
t
o
r
ies

w
i
l
l

in
cl
u
d
e

r
ela
t
i
o
nal

databases, data

wa
r
ehouses,
t
r
ansa
ct
i
o
nal

databases,

a
d
van
c
ed

database

s
y
s
t
e
ms,

flat

files,

data

s
t
r
eams,

and

the
W
o
r
ld

W
i
d
e

W
e
b
.

A
d
van
c
ed

database

s
y
s
t
e
ms

in
cl
u
d
e

obje
c
t
-
r
ela
t
i
o
nal

databases

and
s
p
ecific

a
p
plica
t
i
o
n
-
o
r
i
e
n
t
ed

databases,

s
u
c
h

as

spa
t
ial

databases,

t
ime
-
s
e
r
ies

databases,
t
e
xt

databases,

and

m
ul
t
imedia

databases.

The

c
ha
l
l
e
nges

and

t
e
c
hni
q
ues

o
f

mining

m
a
y dif
fe
r
fo
r

ea
c
h

o
f

the

r
e
p
osi
t
o
r
y

s
y
s
t
e
ms.


1.3.1

Relational

Databases


A

database

s
y
s
t
e
m,

also

ca
l
led

a

database

manag
e
m
e
nt

sys
te
m

(DBMS
)
,

c
o
nsists

o
f

a
c
o
l
le
ct
i
o
n

o
f

in
t
e
r
r
ela
t
ed

data,

kn
o
w
n

as

a

database,

and

a

s
e
t

o
f

s
o
ftwa
r
e

p
r
o
gr
ams

t
o
manage

and

a
cc
ess

the

data.

The

s
o
ftwa
r
e

p
r
o
gr
ams

i
n
v
o
l
v
e
me
c
hanisms

fo
r

the

d
efini
t
i
o
n

o
f

database

s
tr
u
c
tu
r
es;
fo
r data

s
t
o
r
age;

fo
r
c
o
ncu
r
r
e
nt,

sha
r
ed,

o
r

dis
t
r
i
bu
t
ed

data
a
cc
ess;

and

fo
r
e
n
s
u
r
ing

the

c
o
nsis
t
e
ncy

and

secu
r
i
t
y

o
f

the

in
fo
r
ma
t
i
o
n

s
t
o
r
ed,

d
espi
t
e s
y
s
t
e
m

c
r
ashes

o
r

at
t
e
mpts

at una
u
th
o
r
i
z
ed

a
cc
ess.

A

r
ela
t
i
o
nal

database

is

a

c
o
l
le
ct
i
o
n

o
f

tables,

ea
c
h

o
f

w
hi
c
h

is

assi
g
ned

a

uni
q
ue

name.
Ea
c
h

table

c
o
nsists

o
f

a

s
e
t

o
f

at
t
r
i
but
es

(
c
o
lumns

o
r

fi
e
lds
)

and

u
s
ua
l
l
y

s
t
o
r
es

a

large

s
e
t
o
f

tuples

(
r
e
c
o
r
ds
o
r
r
o
w
s
).

Ea
c
h

tuple

in

a

r
ela
t
i
o
nal

table

r
e
p
r
es
e
nts

an

obje
c
t

i
de
n
t
ified
b
y

a

uni
q
ue

k
e
y

and

d
esc
r
i
b
ed

b
y

a

s
e
t
o
f

at
t
r
i
bu
t
e

va
l
ues.

A

s
e
man
t
ic

data

m
o
d
el,

s
u
c
h as

an

e
n
t
i
t
y
-
r
ela
t
i
o
nship

(ER)

data

m
o
d
el,

is

o
f
t
e
n
c
o
ns
tr
u
c
t
ed

fo
r

r
ela
t
i
o
nal

databases.
A
n

ER

data

m
o
d
el

r
e
p
r
es
e
nts

the

database

as

a

s
e
t

o
f

e
n
t
i
t
ies

and

their

r
ela
t
i
o
nships.

C
o
nsi
de
r

the

f
o
l
l
o
w
ing

e
xample.


Example

1.1

A

r
ela
t
i
o
nal

database

fo
r

A
l
lEle
c
t
r
oni
c
s
.

The

A
llEle
c
t
r
oni
c
s

c
o
mpa
n
y

is

d
esc
r
i
b
ed

b
y

the
f
o
l
l
o
w
ing

r
ela
t
i
o
n

tables:

cus
t
om
e
r
,

i
t
e
m,

e
mpl
o
ye
e
,

and

b
r
an
c
h
.

F
r
a
g
m
e
nts

o
f

the

tables
d
esc
r
i
b
ed

h
e
r
e

a
r
e

sh
o
w
n

in

Figu
r
e

1.6.


The

r
ela
t
i
o
n

cus
t
om
e
r

c
o
nsists

o
f

a

s
e
t

o
f

at
t
r
i
bu
t
es,

in
cl
uding

a

uni
q
ue

cus
t
o
m
e
r
i
de
n
t
i
t
y

n
um
b
e
r

(
cust

ID
),

cus
t
o
m
e
r

name,

add
r
ess,

age,

o
c
cupa
t
i
o
n,

an
n
ual

in
c
o
me, c
r
edit

in
fo
r
ma
t
i
o
n,

ca
t
e
go
r
y
,

and

so

o
n.

S
imila
r
l
y
,

ea
c
h

o
f

the

r
ela
t
i
o
ns

i
t
e
m
,

e
mpl
o
yee
,

and

b
r
an
c
h

c
o
nsists

o
f

a

s
e
t

o
f

at
t
r
i
b
u
t
es
d
esc
r
i
b
ing

their

p
r
o
p
e
r
t
ies.

T
ables

can

also

b
e

used

t
o

r
e
p
r
es
e
nt

the

r
ela
t
i
o
nships

be
t
w
e
e
n

o
r

am
o
ng

m
ul
t
iple
r
ela
t
i
o
n

tables.

F
o
r

our

e
xample,

these

in
cl
u
d
e

pu
r
c
ha
s
es

(cus
t
o
m
e
r

pu
r
c
hases

i
t
e
ms, c
r
ea
t
ing

a

sales
t
r
ansa
ct
i
o
n

that

is

han
d
led

b
y

an

e
mpl
o
y
ee),

i
t
e
ms

s
o
ld

(lists

the
i
t
e
ms

sold

in

a

g
i
v
e
n

t
r
ansa
ct
i
o
n),

and

w
o
r
ks

at

(
e
mpl
o
y
ee

w
o
r
ks

at

a

b
r
an
c
h

o
f
A
l
lEle
c
t
r
oni
c
s
).


R
ela
t
i
o
nal

data

can

b
e

a
cc
essed

b
y

database

q
u
e
r
ies

w
r
it
t
e
n

in

a

r
ela
t
i
o
nal

q
u
e
r
y
language,

s
u
c
h

as

SQL,

o
r

w
ith

the

assistan
c
e

o
f
gr
a
p
hical

us
e
r

in
t
e
r
fa
c
es.




W
h
e
n

data

mining

is

a
p
plied

t
o

r
ela
t
i
o
nal

databases,

w
e

can

g
o

fu
r
th
e
r

b
y

s
ea
r
c
hing

f
or
t
r
e
nds

or

data

p
at
t
e
r
ns
.

F
o
r

e
xample,

data

mining

s
y
s
t
e
ms

can
ana
l
y
z
e

cus
t
o
m
e
r

data

t
o
p
r
edi
c
t

the

c
r
edit

r
isk

o
f

new

cus
t
o
m
e
rs

based

o
n

their

in
c
o
me,

age,

and

p
r
e
v
ious

c
r
e
dit
in
fo
r
ma
t
i
o
n.

Data

mining

s
y
s
t
e
ms

m
a
y

also

d
e
t
e
c
t

d
e
v
ia
t
i
o
ns,

s
u
c
h

as

i
t
e
ms

w
hose

sales a
r
e far

f
r
o
m

those

e
x
p
e
c
t
ed

in

c
o
mpa
r
is
o
n

w
ith

the

p
r
e
v
ious
y
ea
r
.

S
u
c
h

d
e
v
ia
t
i
o
ns

can th
e
n

b
e

fu
r
th
e
r

i
n
v
es
t
i
g
a
t
ed

(e.
g
.,

has

th
e
r
e

b
e
e
n

a

c
hange

in

pa
c
ka
g
ing
o
f

s
u
c
h

1.3

Data

M
ining

On

W
hat

K
ind

o
f

Data?


i
t
e
ms,

o
r
a

si
g
nificant

inc
r
ease

in

p
r
i
c
e?).

R
ela
t
i
o
nal

databases

a
r
e

o
ne

o
f

the

most

c
o
mm
o
n
l
y

a
vailable

and

r
i
c
h

in
f
o
r
ma
t
i
o
n
r
e
p
osi
t
o
r
ies,

and

t
h
us th
e
y

a
r
e

a

maj
o
r

data

fo
r
m

in

our

stu
d
y

o
f

data

minin
g
.



1.3.2

Data

W
a
r
ehouses


S
u
p
p
ose
that

A
llEle
c
t
r
oni
c
s

is

a

s
u
cc
essful

in
t
e
r
na
t
i
o
nal

c
o
mpa
n
y
,

w
ith

b
r
an
c
hes

a
r
ound
the

w
o
r
ld.

Ea
c
h

b
r
an
c
h

has

its

o
w
n

s
e
t

o
f

databases.

The
p
r
esi
de
nt

o
f

A
llEle
c
t
r
oni
c
s

has as
k
ed

y
ou

t
o

p
r
o
v
i
d
e

an
ana
ly
sis

o
f

the

c
o
mpa
n
y’s

sales

p
e
r i
t
e
m

t
y
p
e

p
e
r
b
r
an
c
h

fo
r

the
thi
r
d

q
ua
r
t
e
r
.

This

is

a

difficult

task,

pa
r
t
icula
r
l
y

sin
c
e

the

r
el
e
vant

data

a
r
e

s
p
r
ead

o
u
t
o
v
e
r

s
e
v
e
r
al

databases,

p
h
y
sica
l
l
y

l
o
ca
t
ed

at
n
um
e
r
ous

si
t
es.

I
f

A
llEle
c
t
r
oni
c
s

had

a data

wa
r
ehouse,

this

task

w
ould

b
e

eas
y
.

A

data

wa
r
e
-

house

is

a

r
e
p
osi
t
o
r
y

o
f

in
fo
r
ma
t
i
o
n

c
o
l
le
c
t
ed

f
r
o
m

m
ul
t
iple

sou
rc
es,

s
t
o
r
ed

un
d
e
r
a

unified

s
c
h
e
ma,

and

that

u
s
ua
l
l
y

r
esi
d
es

at

a

sin
g
le si
t
e.

Data

wa
r
ehouses

a
r
e

c
o
n
-

s
tr
u
c
t
ed

v
ia

a

p
r
o
c
ess

o
f

data

c
leanin
g
,

data

in
t
e
gr
a
t
i
o
n,

data

t
r
ans
fo
r
ma
t
i
o
n,
data
l
o
adin
g
,

and

p
e
r
i
o
dic

data

r
ef
r
eshin
g
.

Figu
r
e

1.7

sh
o
ws

the

t
y
pical

f
r
ame
w
o
r
k

fo
r

c
o
ns
tr
u
ct
i
o
n

and

use

o
f

a

data

wa
r
ehouse
fo
r

A
l
lEle
c
t
r
oni
c
s
.






Data

source

in
Chicago


Client



Data

source

in

New
York




Data

source

in
Toronto




Data

source

in
Vancouver


Clean
Integrate
Transform
Load
Refresh



Data
Warehouse



Quer
y

and

Analysi
s

Tools








Client



Figu
r
e

1.7

T
y
pical

f
r
ame
w
o
r
k

o
f

a

data

wa
r
ehouse

fo
r

A
l
lEle
c
t
r
oni
c
s
.

1.3

Data

M
ining

On

W
hat

K
ind

o
f

Data?





T
o

facilita
t
e
d
ecisi
o
n

makin
g
,

the

data

in

a

data

wa
r
ehouse

a
r
e

o
rg
ani
z
ed

a
r
o
und
ma
j
or

su
bj
e
c
ts
,

s
u
c
h

as

cus
t
o
m
e
r
,

i
t
e
m,

s
u
p
pli
e
r
,

and

a
ct
i
v
i
t
y
. The

data

a
r
e

s
t
o
r
ed

t
o
p
r
o
v
i
d
e

in
fo
r
ma
t
i
o
n

f
r
o
m

a

his
t
o
r
ical

per
s
p
e
c
t
i
ve

(
s
u
c
h

as

f
r
o
m

the

past

5

10

y
ears) and

a
r
e

t
y
pica
l
l
y

summa
r
i
z
ed
.

F
o
r

e
xample,

r
ath
e
r than

s
t
o
r
ing

the

d
e
tails

o
f

ea
c
h sales

t
r
ansa
ct
i
o
n,

the

data

wa
r
ehouse

m
a
y

s
t
o
r
e

a

s
umma
r
y

o
f

the

t
r
ansa
ct
i
o
ns

p
e
r
i
t
e
m

t
y
p
e

fo
r

ea
c
h

s
t
o
r
e

o
r
,

s
umma
r
i
z
ed

t
o

a

hi
g
h
e
r

l
e
v
el,

fo
r

ea
c
h

sales

r
e
g
i
o
n.

A data

wa
r
ehouse

is

u
s
ua
l
l
y

m
o
d
eled

b
y

a

m
ul
t
idim
e
nsi
o
nal

database

s
t
r
u
c
t
u
r
e,
w
h
e
r
e

ea
c
h

dim
e
nsi
o
n

c
o
r
r
es
p
o
nds

t
o

an

at
t
r
i
bu
t
e

o
r

a

s
e
t

o
f

at
t
r
i
bu
t
es

in

the

s
c
h
e
ma, and

ea
c
h

c
e
l
l

s
t
o
r
es

the

va
l
ue

o
f

s
o
me

ag
g
r
eg
a
t
e

mea
s
u
r
e,

s
u
c
h

as

c
o
unt

o
r

sales

am
o
unt
.
The

a
c
tual

p
h
y
sical

s
tr
u
c
tu
r
e

o
f

a

data

wa
r
ehouse

m
a
y

b
e

a

r
ela
t
i
o
nal

data

s
t
o
r
e

o
r

a
m
ul
t
idim
e
nsi
o
nal

data

cu
be
.

A

data

cu
b
e

p
r
o
v
i
d
es

a

m
ul
t
idim
e
nsi
o
nal

v
iew

o
f

data
and

a
l
l
o
ws

the

p
r
e
c
o
mp
u
ta
t
i
o
n

and

fast

a
cc
essing

o
f

s
umma
r
i
z
ed

data.


Example

1.2

A

data

cu
b
e

fo
r

A
l
lEle
c
t
r
oni
c
s
.

A

data

cu
b
e

fo
r

s
umma
r
i
z
ed

sales

data

o
f

A
l
lEle
c
t
r
oni
c
s
is

p
r
es
e
n
t
ed

in

Figu
r
e

1.8(a).

The

cu
b
e

has

th
r
ee

dim
e
nsi
o
ns:

add
r
ess

(
w
ith

ci
t
y

va
l
ues
Chica
g
o
,
N
e
w

Y
o
r
k,

T
o
r
on
t
o
,

V
an
c
ou
v
er
),

t
ime

(
w
ith

q
ua
r
t
e
r

va
l
ues

Q1,

Q2,

Q3,

Q4
),

and
i
t
e
m

(
w
ith

i
t
e
m

t
y
p
e

va
l
ues

home

e
n
t
e
r
tainm
e
nt,

c
ompu
t
e
r
,

phone,

s
ecu
r
i
t
y
).

The

ag
g
r
eg
a
t
e
va
l
ue

s
t
o
r
ed

in

ea
c
h

c
e
l
l

o
f

the

cu
b
e

is

sales

am
o
unt

(in

thousands).

F
o
r

e
xample,

the

t
otal
sales

f
o
r

the

fir
st

q
ua
r
t
e
r
,

Q1
,

f
o
r

i
t
e
ms

r
e
la
t
ing

t
o

se
cu
r
i
t
y

s
y
s
t
e
ms

in

V
an
c
o
u
v
e
r

is

$400,000,
as

s
t
o
r
ed

in

c
e
l
l

h
V
an
c
o
u
v
e
r
,

Q1,

s
ecu
r
i
t
y
i
.

A
ddi
t
i
o
nal

cu
b
es

m
a
y

b
e

used

t
o

s
t
o
r
e

ag
g
r
eg
a
t
e

s
ums

o
v
e
r

ea
c
h

dim
e
nsi
o
n,

c
o
r
r
es
p
o
nding

t
o

the

ag
g
r
eg
a
t
e

v
a
l
ues

obtained

using

dif
fe
r
e
nt
SQL

g
r
oup
-
b
y
s

(e.
g
.,

the
t
otal

sales

amount

p
e
r

ci
t
y

and

q
ua
r
t
e
r
,

o
r

p
e
r

ci
t
y

and

i
t
e
m,

o
r
p
e
r

q
ua
r
t
e
r

and

i
t
e
m,

o
r

p
e
r

ea
c
h

ind
i
v
i
d
ual

dim
e
nsi
o
n).


“I

h
a
ve

al
s
o

hea
r
d

ab
o
ut

data

ma
r
ts.

What

is

the

diff
e
r
e
nce

b
e
t
we
e
n

a

data

w
a
r
eh
o
u
s
e
and
a

data

ma
r
t?”

y
ou

m
a
y

ask.

A

data

wa
r
ehouse

c
o
l
le
c
ts

in
fo
r
ma
t
i
o
n

a
b
o
u
t

s
ubje
c
ts

that
span

an

e
n
t
i
r
e

o
rg
aniza
t
io
n
,

and

t
h
us

its

s
c
o
p
e

is

e
n
t
e
r
p
r
i
s
e
-
w
ide
.

A

data

ma
r
t,

o
n

the
oth
e
r

hand,

is

a

d
epa
r
t
m
e
nt

s
ubs
e
t

o
f

a

data

wa
r
ehouse.

I
t

f
o
cuses

o
n

sele
c
t
ed

s
ubje
c
ts, and

t
h
us its

s
c
o
p
e

is

d
e
p
a
rt
m
e
nt
-
w
ide
.

B
y

p
r
o
v
iding

m
ul
t
idim
e
nsi
o
nal

data

v
iews

and

the

p
r
e
c
o
mp
u
ta
t
i
o
n

o
f
s
umma
r
i
z
e
d
data,

data wa
r
ehouse

s
y
s
t
e
ms

a
r
e

w
e
l
l

s
ui
t
ed

fo
r

o
n
-
line

anal
y
t
ical

p
r
o
c
essing,

o
r
OLAP.

OLAP

o
p
e
r
a
t
i
o
ns use

ba
c
k
g
r
ound

kn
o
w
ledge

r
eg
a
r
ding

the

do
main

o
f

the
data

b
eing

studied

in

o
r
de
r

t
o a
l
l
o
w

the

p
r
es
e
nta
t
i
o
n

o
f

data

at

diff
e
r
e
nt

l
e
v
e
ls

o
f abs
t
r
a
c
t
ion
.

S
u
c
h

o
p
e
r
a
t
i
o
ns

a
c
c
o
mm
o
da
t
e

dif
fe
r
e
nt

us
e
r
v
ie
w
p
o
ints.

E
xamples

o
f OLAP

o
p
e
r
a
t
i
o
ns

in
cl
u
d
e

d
r
i
l
l
-
d
o
w
n

and

r
o
l
l
-
up,

w
hi
c
h

a
l
l
o
w

the

us
e
r

t
o

v
iew

the
data

at

dif
fe
r
ing

d
e
g
r
ees

o
f

s
umma
r
iza
t
i
o
n,

as

i
l
l
us
t
r
a
t
ed

in

Figu
r
e

1.8(b).
F
o
r

instan
c
e,
w
e

can

d
r
i
l
l

d
o
w
n

o
n

sales

data

s
umma
r
i
z
ed

b
y

qua
r
t
e
r

t
o

see

the

data

s
umma
r
i
z
e
d
b
y

month
.

S
imila
r
l
y
,

w
e

can

r
o
l
l

up

o
n

sales

data

s
umma
r
i
z
ed

b
y

c
i
t
y

t
o

v
iew

the

data
s
umma
r
i
z
ed

b
y

c
o
un
t
r
y
.

1.3

Data

M
ining

On

W
hat

K
ind

o
f

Data?


t
ime

(
q
ua
r
t
ers)

t
ime

(mo
nths)

t
ime

(
q
ua
r
t
ers)




C
hicag
o

N
ew

Y
o
r
k

440

1560

T
o
r
o
n
t
o

V
an
c
o
u
v
e
r

395




<
V
an
c
o
u
v
e
r
,

Q1

605

825

14

400


Q2


Q3


Q4

Q1,

se
cu
r
i
t
y>


c
o
mpu
t
e
r

se
cu
r
i
t
y

ho
me

e
n
t
e
r
tainme
nt

pho
ne

(a)

it
e
m

(
t
y
p
es)


(b)


D
r
i
l
l
-
d
o
w
n


on

t
ime

data

f
or

Q1


R
o
l
l
-
up

on

a
dd
r
ess





C
hicag
o

N
ew

Y
o
r
k

T
o
r
o
n
t
o

V
an
c
o
u
v
e
r


J
an


F
e
b


M
a
r
c
h







150


100


150

USA
C
anada


Q1


Q2


Q3


Q4

2000

1000


c
o
mpu
t
e
r

se
cu
r
i
t
y

c
o
mpu
t
e
r

se
cu
r
i
t
y

ho
me

e
n
t
e
r
tainme
nt

pho
ne

ho
me

e
n
t
e
r
tainme
nt

pho
ne

it
e
m

(
t
y
p
es)

it
e
m

(
t
y
p
es)



Figu
r
e

1.8

A

m
ul
t
idim
e
nsi
o
nal

data

cu
b
e,

c
o
mm
o
n
l
y

used
fo
r data

wa
r
ehousin
g
,

(a)

sh
o
w
ing
s
umma
-

r
i
z
ed

data

fo
r

A
llEle
c
t
r
oni
c
s

and

(b)

sh
o
w
ing

s
umma
r
i
z
ed

data

r
e
s
ul
t
ing

f
r
o
m

d
r
i
l
l
-
d
o
w
n

and

r
o
l
l
-
up

o
p
e
r
a
t
i
o
ns

o
n

the

cu
b
e

in

(a).

F
o
r

im
p
r
o
v
ed

r
eada
b
ili
t
y
,

o
n
l
y

s
o
me

o
f the

cu
b
e

c
e
l
l

va
l
ues
a
r
e

sh
o
w
n.



1.3.3

T
ransactional

Databases


I
n

g
e
n
e
r
al,

a

t
r
ansa
ct
i
o
nal

database

c
o
nsists

o
f

a

file

w
h
e
r
e

ea
c
h

r
e
c
o
r
d

r
e
p
r
es
e
nts

a

t
r
ans
-

a
ct
i
o
n.

A

t
r
ansa
ct
i
o
n

t
y
pica
l
l
y

in
cl
u
d
es

a

uni
q
ue

t
r
ansa
ct
i
o
n

i
de
n
t
i
t
y

n
um
b
e
r

(
t
r
ans

ID
)
and

a

list

o
f the

i
te
ms

making

up

the

t
r
ansa
ct
i
o
n

(
s
u
c
h

as

i
t
e
ms

pu
r
c
hased

in

a

s
t
o
r
e).

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?





t
r
ans

ID

list
o
f

i
t
e
m

IDs

T100

I1,

I3,

I8,

I16

T200

I2,

I8

.

.

.

.

.

.



Figu
r
e

1.9

F
r
a
g
m
e
nt

o
f

a

t
r
ansa
ct
i
o
nal

database

fo
r

sales
at

A
l
lEle
c
t
r
oni
c
s
.



The

t
r
ansa
ct
i
o
nal

database

m
a
y

h
av
e

addi
t
i
o
nal

tables

ass
o
cia
t
ed

w
ith

it,

w
hi
c
h

c
o
ntain
oth
e
r in
fo
r
ma
t
i
o
n

r
eg
a
r
ding

the

sale,

s
u
c
h

as

the

da
t
e

o
f

the

t
r
ansa
ct
i
o
n, the

cus
t
o
m
e
r

ID
n
um
b
e
r
,

the

ID

n
um
b
e
r

o
f

the

sales
p
e
rs
o
n

and

o
f

the

b
r
an
c
h

at

w
hi
c
h

the

sale

o
c
cu
r
r
e
d,
and

so

o
n.


Example

1.3

A

t
r
ansa
ct
i
o
nal

database

fo
r

A
l
lEle
c
t
r
oni
c
s
.

T
r
ansa
ct
i
o
ns

can

b
e

s
t
o
r
ed

in

a

table,

w
ith
o
ne

r
e
c
o
r
d

p
e
r

t
r
ansa
ct
i
o
n.

A

f
r
a
g
m
e
nt

o
f

a

t
r
ansa
ct
i
o
nal

database

fo
r

A
l
lEle
c
t
r
oni
c
s
is

sh
o
w
n

in

Figu
r
e

1.9.

Fr
o
m

the

r
ela
t
i
o
nal

database

p
o
int

o
f

v
ie
w
,

the

sales

table

in
Figu
r
e

1.9

is

a

nes
t
ed

r
ela
t
i
o
n

b
ecause

the

at
t
r
i
bu
t
e

list

o
f

i
t
e
m

IDs

c
o
ntains

a

s
e
t

o
f

i
t
e
ms
.
B
ecause

most

r
ela
t
i
o
nal

database

s
y
s
t
e
ms

d
o

not

s
u
p
p
o
r
t

nes
t
ed

r
ela
t
i
o
nal

s
tr
u
c
tu
r
es,

the
t
r
ansa
ct
i
o
nal

database

is

u
s
ua
l
l
y

eith
e
r

s
t
o
r
ed

in

a flat

file

in

a
fo
r
mat

similar

t
o

that

o
f the

table

in

Figu
r
e

1.9

o
r

un
f
ol
d
ed

in
t
o

a

standa
r
d

r
ela
t
i
o
n

in

a

fo
r
mat

similar

t
o

that

o
f the

i
t
e
ms

s
o
ld

table

in

Figu
r
e

1.6.




1.3.4

Advanced

Data

and

In
f
ormation

Systems

and
Advanced

Applications


R
ela
t
i
o
nal

database

s
y
s
t
e
ms h
av
e

b
e
e
n

w
i
d
e
l
y

used

in

b
usiness a
p
plica
t
i
o
ns.

W
ith

the
p
r
o
g
r
ess

o
f

database

t
e
c
hnolo
g
y
,

va
r
ious

kinds

o
f

a
d
van
c
ed

data

and

in
fo
r
ma
t
i
o
n

s
y
s
t
e
ms

h
av
e

e
m
e
rged

and a
r
e

un
de
r
go
ing

d
e
v
el
op
m
e
nt

t
o

add
r
ess

the

r
e
q
ui
r
e
m
e
nts

o
f

new a
p
plica
t
i
o
ns.




The

new

database

a
p
plica
t
i
o
ns

in
cl
u
d
e

han
d
ling

spa
t
ial data

(
s
u
c
h

as
maps),
e
n
g
ine
e
r
ing

d
esi
g
n

data

(
s
u
c
h

as

the

d
esi
g
n

o
f

b
uildin
g
s, s
y
s
t
e
m

c
o
m
p
o
n
e
nts,

o
r

in
t
e
-

gr
a
t
ed

ci
r
cuits),

h
y
p
e
r
t
e
xt

and

m
ul
t
imedia

data

(in
cl
uding

t
e
xt,

image,

v
i
d
e
o
,

and

audio
data),

t
ime
-
r
ela
t
ed

data

(
s
u
c
h

as

his
t
o
r
ical

r
e
c
o
r
ds

o
r

s
t
o
c
k

e
x
c
hange

data),

s
t
r
eam

data
(
s
u
c
h

as

v
i
d
eo

s
u
r
v
ei
l
lan
c
e

and

s
e
ns
o
r

data,

w
h
e
r
e

data

fl
o
w

in

and

o
u
t

li
k
e

s
t
r
eams),

and
the

W
o
r
ld
W
i
d
e

W
eb

(a
h
uge,

w
i
d
e
l
y

dis
t
r
i
bu
t
ed

in
fo
r
ma
t
i
o
n

r
e
p
osi
t
o
r
y

ma
d
e
a
vailable
b
y

the

I
n
t
e
r
n
e
t).

These

a
p
plica
t
i
o
ns

r
e
q
ui
r
e

effici
e
nt

data

s
tr
u
c
tu
r
es

and

scalable

m
e
th
-

o
ds

fo
r

han
d
ling

c
o
mpl
e
x

obje
c
t

s
tr
u
c
tu
r
es;

va
r
iable
-
l
e
ngth

r
e
c
o
r
ds;

s
e
mis
tr
u
c
tu
r
ed

o
r
uns
tr
u
c
tu
r
ed data;

t
e
xt,

spa
t
io
t
e
m
p
o
r
al,

and

m
ul
t
imedia

data;

and

database

s
c
h
e
mas
w
ith

c
o
mpl
e
x

s
tr
u
c
tu
r
es

and

d
ynamic

c
hanges.

I
n

r
es
p
o
nse

t
o

these

ne
e
ds,

a
d
v
an
c
e
d

database

s
y
s
t
e
ms

and

s
p
e
cific

a
p
plica
t
i
o
n
-
o
r
i
e
n
t
e
d
database

s
y
s
t
e
ms h
av
e

b
e
e
n

d
e
v
el
o
p
ed. These

in
cl
u
d
e

obje
c
t
-
r
ela
t
i
o
nal

database

s
y
s
t
e
ms,
t
e
m
p
o
r
al

and

t
ime
-
s
e
r
ies

database

s
y
s
t
e
ms,

spa
t
ial

and

spa
t
io
t
e
m
p
o
r
al

database

s
y
s
t
e
ms,
t
e
xt

and

m
ul
t
imedia

database

s
y
s
t
e
ms,

h
e
t
e
r
og
e
neous

and

l
eg
acy

database

s
y
s
t
e
ms,

data
s
t
r
eam

manag
e
m
e
nt

s
y
s
t
e
ms,

and
W
eb
-
based

g
lobal

in
fo
r
ma
t
i
o
n

s
y
s
t
e
ms.

W
hile

s
u
c
h

databases
o
r

in
fo
r
ma
t
i
o
n

r
e
p
osi
t
o
r
ies

r
e
q
ui
r
e

s
op
his
t
ica
t
ed

facili
t
ies

t
o
effici
e
nt
l
y

s
t
o
r
e,

r
et
r
i
e
v
e,

and

upda
t
e

large

amounts

o
f

c
o
mpl
e
x

data,

th
e
y

also

p
r
o
v
i
d
e
fe
r
t
ile

g
r
ounds

and

r
aise

ma
n
y

c
ha
l
l
e
n
g
ing

r
esea
r
c
h

and

impl
e
m
e
nta
t
i
o
n

is
s
ues

fo
r

data
minin
g
.

I
n

this

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?


se
ct
i
o
n,

w
e

d
esc
r
i
b
e

ea
c
h

o
f

the

a
d
van
c
ed

database

s
y
s
t
e
ms

lis
t
ed

a
b
o
v
e.




Object
-
Relational

Databases


O
bje
c
t
-
r
ela
t
i
o
nal

databases

a
r
e

c
o
ns
tr
u
c
t
ed

based
o
n

an

obje
c
t
-
r
ela
t
i
o
nal

data

m
o
d
el. This

m
o
d
el

e
x
t
e
nds

the

r
ela
t
i
o
nal

m
o
d
el

b
y

p
r
o
v
iding

a

r
i
c
h

data

t
y
p
e

fo
r

han
d
ling

c
o
mp
l
e
x
obje
c
ts

and

obje
c
t

o
r
i
e
nta
t
i
o
n.

B
ecause

most

s
op
his
t
ica
t
ed

database

a
p
plica
t
i
o
ns need

t
o

han
d
le

c
o
mpl
e
x

obje
c
ts

and

s
tr
u
c
tu
r
es,

obje
c
t
-
r
ela
t
i
o
nal

databases

a
r
e

b
e
c
o
m
ing

inc
r
easin
g
l
y

p
o
pular

in

in
d
us
t
r
y

and

a
p
plica
t
i
o
ns.


C
o
n
c
eptua
l
l
y
,

the

obje
c
t
-
r
ela
t
i
o
nal

data

m
o
d
el

inh
e
r
its

the

ess
e
n
t
ial
c
o
n
c
epts

o
f
obje
c
t
-
o
r
i
e
n
t
ed database
s
,

w
h
e
r
e,

in

g
e
n
e
r
al

t
e
r
ms,

ea
c
h

e
n
t
i
t
y

is

c
o
nsi
de
r
ed

as

an
obje
c
t.
F
o
l
l
o
w
ing

the

A
llEle
c
t
r
oni
c
s

e
xample,

obje
c
ts

can

b
e

ind
i
v
i
d
ual

e
mpl
o
y
ees,

cus
t
o
m
e
rs,

o
r

i
t
e
ms.

Data

and

c
o
d
e

r
ela
t
ing

t
o

an

obje
c
t

a
r
e

e
ncapsula
t
ed

in
t
o

a

sin
g
le unit.

Ea
c
h

obje
c
t

has

ass
o
cia
t
ed

w
ith

it

the

f
o
l
l
o
w
ing:


A s
e
t

o
f

va
r
iables

that

d
esc
r
i
b
e

the

obje
c
ts.

These

c
o
r
r
es
p
o
nd

t
o

at
t
r
i
bu
t
es

in

the
e
n
t
i
t
y
-
r
ela
t
i
o
nship

and

r
ela
t
i
o
nal

m
o
d
els.
A
s
e
t

o
f

messages

that

the

obje
c
t

can

use

t
o

c
o
m
m
unica
t
e

w
ith

oth
e
r

obje
c
ts,

o
r

w
ith the

r
est

o
f

the

database

s
y
s
t
e
m.
A

s
e
t

o
f

m
e
th
o
ds,

w
h
e
r
e

ea
c
h

m
e
th
o
d

holds

the

c
o
d
e

t
o

impl
e
m
e
nt

a

message.

U
p
o
n
r
e
c
e
i
v
ing

a

message,

the

m
e
th
o
d

r
e
tu
r
ns

a

va
l
ue

in

r
es
p
o
nse.

F
o
r

instan
c
e,

the

m
e
th
o
d
fo
r

the

message

g
e
t

ph
ot
o
(
e
mpl
o
yee
)

w
i
l
l

r
et
r
i
e
v
e

and

r
e
tu
r
n

a

p
ho
t
o

o
f

the

g
i
v
e
n
e
mpl
o
y
ee
obje
c
t.


O
bje
c
ts

that

sha
r
e

a

c
o
mm
o
n

s
e
t

o
f
p
r
o
p
e
r
t
ies

can

b
e

g
r
ou
p
ed

in
t
o

an

obje
c
t

c
lass.
Ea
c
h

obje
c
t

is

an

instan
c
e

o
f

its

c
lass.

O
bje
c
t

c
lasses
can

b
e

o
r
g
ani
z
ed

in
t
o

c
lass/
s
u
b
c
lass

hi
e
r
a
r
c
hies

so

that

ea
c
h

c
lass

r
e
p
r
es
e
nts

p
r
o
p
e
r
t
ies

that

a
r
e

c
o
mm
o
n

t
o

obje
c
ts

in

that
c
lass.

F
o
r

instan
c
e,

an

e
mpl
o
yee

c
lass

can

c
o
ntain

va
r
iables

li
k
e

name,

add
r
ess
,

and

b
i
r
th
-

da
t
e
.

S
u
p
p
ose

that

the

c
lass,

sales

per
s
on
,

is

a

s
u
b
c
lass

o
f

the

c
lass,

e
mpl
o
yee
.

A

sales

p
e
r
s
on
obje
c
t

w
ould

inh
e
r
it

a
l
l

o
f

the
va
r
iables

p
e
r
taining

t
o

its

s
u
p
e
r
c
lass

o
f

e
mpl
o
yee
.

I
n

a
ddi
t
i
o
n,

it

has

a
l
l

o
f

the

va
r
iables

that

p
e
r
tain

s
p
ecifica
l
l
y

t
o

b
eing

a

sales
p
e
rs
o
n

(e.
g
.,

c
om
mission
).

S
u
c
h

a

c
lass

inh
e
r
itan
c
e

f
eatu
r
e

b
e
nefits

in
fo
r
ma
t
i
o
n

sha
r
in
g
.

F
o
r

data

mining

in

obje
c
t
-
r
ela
t
i
o
nal

s
y
s
t
e
ms,
t
e
c
hni
q
ues

need

t
o

b
e

d
e
v
el
o
p
ed

fo
r
han
d
ling

c
o
mpl
e
x

obje
c
t

s
tr
u
c
tu
r
es,

c
o
mpl
e
x

data

t
y
p
es,

c
lass

and

s
u
b
c
lass

hi
e
r
a
r
c
hies,
p
r
o
p
e
rt
y

inh
e
r
itan
c
e,

and

m
e
th
o
ds

and

p
r
o
c
e
d
u
r
es.


T
emporal

Databases,

Sequence

Databases,

and
Time
-
Series

Databases


A
te
m
p
o
r
al

database

t
y
pica
l
l
y
s
t
o
r
es

r
ela
t
i
o
nal

data

that

in
cl
u
d
e

t
ime
-
r
ela
t
ed

at
t
r
i
b
u
t
es.
These

at
t
r
i
bu
t
es

m
a
y

i
n
v
o
l
v
e

s
e
v
e
r
al

t
imestamps,

ea
c
h

h
a
v
ing

dif
fe
r
e
nt

s
e
man
t
ics. A

se
q
u
e
n
c
e

database

s
t
o
r
es

se
q
u
e
n
c
es

o
f

o
r
de
r
ed

e
v
e
nts,

w
ith

o
r

w
itho
u
t

a

c
o
nc
r
e
t
e no
t
i
o
n

o
f

t
ime.

E
xamples

in
cl
u
d
e

cus
t
o
m
e
r

sh
op
ping

se
q
u
e
n
c
es,

W
eb

c
li
c
k

s
t
r
eams,

and
b
iolo
g
ical

se
q
u
e
n
c
es.

A

t
ime
-
s
e
r
ies

database

s
t
o
r
es

se
q
u
e
n
c
es

o
f

va
l
ues

o
r

e
v
e
nts

ob
taine
d
o
v
e
r

r
e
p
ea
t
ed

mea
s
u
r
e
m
e
nts
o
f

t
ime

(e.
g
.,

hou
r
l
y
,

dai
l
y
,

w
eek
l
y).

E
xamples

in
cl
u
d
e

data
c
o
l
le
c
t
ed f
r
o
m

the

s
t
o
c
k

e
x
c
hange,

i
n
v
e
n
t
o
r
y

c
o
n
t
r
ol,

and

the

obs
e
r
va
t
i
o
n

o
f

nat
u
r
al
p
h
e
n
o
m
e
na

(li
k
e

t
e
m
p
e
r
atu
r
e

and

w
ind).

Data

mining

t
e
c
hni
q
ues

can

b
e

used

t
o find

the

c
ha
r
a
c
t
e
r
is
t
ics

o
f

obje
c
t

e
v
o
lu
t
i
o
n,

o
r
the

t
r
e
nd

o
f

c
hanges

fo
r

obje
c
ts

in the

database.

S
u
c
h

in
fo
r
ma
t
i
o
n

can

b
e

useful

in
d
eci
-

si
o
n

making

and

s
t
r
a
t
e
g
y

plannin
g
.

F
o
r

instan
c
e,

the

mining

o
f

banking

data

m
a
y

aid

in
the

s
c
he
d
uling

o
f

bank

t
e
l
l
e
rs

a
c
c
o
r
ding

t
o

the

v
o
l
ume

o
f

cus
t
o
m
e
r

t
r
affic.

S
t
o
c
k

e
x
c
hange data

can

b
e

mined

t
o

un
c
o
v
e
r

t
r
e
nds that

c
ould

help

y
ou

plan

i
n
v
es
t
m
e
nt

s
t
r
a
t
e
g
ies

(e.
g
.,
w
h
e
n

is

the

b
est

t
ime

t
o

pu
r
c
hase

A
llEle
c
t
r
oni
c
s

s
t
o
c
k?).

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?


S
u
c
h

ana
ly
ses

t
y
pica
l
l
y

r
e
q
ui
r
e
d
efining

m
ul
t
iple

gr
a
n
ula
r
i
t
y

o
f

t
ime.

F
o
r

e
xample,

t
ime m
a
y

b
e

d
e
c
o
m
p
osed

a
c
c
o
r
ding
t
o

fiscal

y
ears,

aca
de
mic

y
ears,

o
r

cal
e
ndar
y
ears.

Y
ears

m
a
y

b
e

fu
r
th
e
r

d
e
c
o
m
p
osed

in
t
o
q
ua
r
t
e
rs

o
r

m
o
nths.


Spatial

Databases

and

Spatiotemporal

Databases


S
pa
t
ial

databases

c
o
ntain

spa
t
ial
-
r
ela
t
ed

in
fo
r
ma
t
i
o
n.
E
xamples

in
cl
u
d
e

geo
gr
a
p
hic (map)

databases,

v
e
r
y

large
-
scale

in
t
e
gr
a
t
i
o
n

(
VLSI)

o
r

c
o
mp
u
t
ed
-
ai
d
ed

d
esi
g
n

databases, and

medical

and

sa
t
e
l
li
t
e
image

databases.

Spa
t
ial

data

m
a
y

b
e

r
e
p
r
es
e
n
t
ed

in

r
as
te
r

f
o
r
-

mat,

c
o
nsis
t
ing

o
f

n
-
dim
e
nsi
o
nal

b
it

maps

o
r

pi
x
el

maps.

F
o
r

e
xample,

a

2
-
D

sa
t
e
l
li
t
e
image

m
a
y

b
e

r
e
p
r
es
e
n
t
ed

as

r
as
t
e
r

data,

w
h
e
r
e

ea
c
h

pi
x
el

r
e
g
is
t
e
rs

the
r
ainfa
l
l

in

a

g
i
v
e
n
a
r
ea.

M
aps

can

b
e

r
e
p
r
es
e
n
t
ed

in

v
e
c
to
r

fo
r
mat,

w
h
e
r
e

r
o
ads,

b
r
idges,
b
uildin
g
s,

and
la
k
es

a
r
e

r
e
p
r
es
e
n
t
ed

as

uni
o
ns

o
r

o
v
e
r
l
a
y
s

o
f

basic

ge
o
m
et
r
ic

c
o
ns
tr
u
c
ts,

s
u
c
h

as

p
o
ints, lines,

p
o
l
y
go
ns,

and

the

pa
r
t
i
t
i
o
ns

and

n
e
t
w
o
r
ks

fo
r
med

b
y

these

c
o
m
p
o
n
e
nts.

Geo
gr
a
p
hic

databases

h
av
e

n
um
e
r
ous

a
p
plica
t
i
o
ns,

r
an
g
ing

f
r
o
m

fo
r
es
t
r
y

and

e
c
ol
-

o
g
y

planning

t
o

p
r
o
v
iding

public

s
e
r
v
i
c
e

in
fo
r
ma
t
i
o
n

r
eg
a
r
ding

the

l
o
ca
t
i
o
n

o
f

t
e
le
p
h
o
ne
and

ele
ct
r
ic

cables,

pi
p
es,

and

sewage

s
y
s
t
e
ms.

I
n

addi
t
i
o
n,

geo
gr
a
p
hic

databases

a
r
e
c
o
mm
o
n
l
y

used

in

v
ehi
c
le

n
a
v
i
g
a
t
i
o
n

and

dispa
t
c
hing

s
y
s
t
e
ms.

A
n

e
xample

o
f

s
u
c
h

a s
y
s
t
e
m

fo
r

taxis

w
ould

s
t
o
r
e

a

ci
t
y

map

w
ith

in
fo
r
ma
t
i
o
n

r
eg
a
r
ding

o
ne
-
w
a
y

s
t
r
e
e
ts,

s
ug
-

ges
t
ed

r
o
u
t
es

fo
r m
o
v
ing

f
r
o
m

r
e
g
i
o
n

A

t
o

r
e
g
i
o
n

B

d
u
r
ing

r
ush

hou
r
,

and

the

l
o
ca
t
i
o
n
o
f

r
estau
r
ants

and

hospitals,

as

w
e
l
l

as

the

cu
r
r
e
nt

l
o
ca
t
i
o
n

o
f

ea
c
h

d
r
i
v
e
r
.

A

spa
t
ial

database

that

s
t
o
r
es

spa
t
ial

obje
c
ts

that

c
hange

w
ith

t
ime is

ca
l
led

a spa
t
io
te
m
p
o
r
al

databas
e
,

f
r
o
m

w
hi
c
h

in
t
e
r
es
t
ing

in
fo
r
ma
t
i
o
n

can

b
e

mined.

F
o
r

e
xam
-

ple,

w
e

m
a
y

b
e

able

t
o

g
r
oup

the

t
r
e
nds

o
f

m
o
v
ing

obje
c
ts

and

i
de
n
t
ify

s
o
me

s
t
r
ange
l
y m
o
v
ing

v
ehi
c
les,

o
r

dis
t
inguish

a

b
io
t
e
r
r
o
r
ist

atta
c
k

f
r
o
m

a

n
o
r
mal

o
u
t
b
r
eak

o
f

the

flu based

o
n

the

geo
gr
a
p
hic

s
p
r
ead

o
f

a

disease

w
ith

t
ime.



T
e
xt

Databases

and

Multimedia

Databases


T
ext

databases

a
r
e

databases

that

c
o
ntain

w
o
r
d

d
esc
r
ip
t
i
o
ns

fo
r

obje
c
ts.

These
w
o
r
d
d
esc
r
ip
t
i
o
ns

a
r
e

u
s
ua
l
l
y

not

simple

k
e
y
w
o
r
ds

bu
t

r
ath
e
r

l
o
ng

s
e
n
t
e
n
c
es

o
r

pa
r
a
g
r
a
p
hs,
s
u
c
h

as

p
r
o
d
u
c
t

s
p
ecifica
t
i
o
ns,

e
r
r
o
r

o
r

b
ug

r
e
p
o
r
ts,

wa
r
ning

messages,

s
umma
r
y

r
e
p
o
r
ts,
no
t
es,

o
r

oth
e
r

d
o
cum
e
nts.

T
e
xt

databases

m
a
y

b
e

hi
g
h
l
y

uns
tr
u
c
tu
r
ed

(
s
u
c
h

as

s
o
me
W
eb

pages
o
n

the

W
o
r
ld

W
i
d
e

W
eb).

S
o
me

t
e
xt

databases

m
a
y

b
e

s
o
me
w
hat

s
t
r
u
c
t
u
r
e
d,
that

is,

s
e
mis
t
r
u
ct
u
r
ed

(
s
u
c
h

as

e
-
mail

messages

and

ma
n
y

HTML/XML

W
eb

pages),
w
h
e
r
eas

oth
e
rs

a
r
e

r
ela
t
i
v
e
l
y

w
e
l
l

s
tr
u
c
tu
r
ed

(
s
u
c
h

as

li
b
r
a
r
y

catalogue

databases).

T
e
xt databases

w
ith

hi
g
h
l
y

r
e
gular

s
tr
u
c
tu
r
es

t
y
pica
l
l
y

can

b
e

impl
e
m
e
n
t
ed

using

r
ela
t
i
o
nal database

s
y
s
t
e
ms.

“What

can

data

mining

on

t
e
x
t

data
b
a
s
es

un
c
ov
e
r?”

B
y

mining

t
e
xt

data,

o
ne

m
a
y un
c
o
v
e
r

g
e
n
e
r
al

and

c
o
ncise
d
esc
r
ip
t
i
o
ns

o
f

the

t
e
xt

d
o
cum
e
nts,

k
e
y
w
o
r
d

o
r

c
o
n
t
e
nt
ass
o
cia
t
i
o
ns,

as

w
e
l
l

as

the

cl
us
t
e
r
ing

b
eh
a
v
i
o
r

o
f

t
e
xt

obje
c
ts.

T
o

d
o

this,

standa
r
d

data
mining

m
e
th
o
ds

need

t
o

b
e

in
t
e
gr
a
t
ed

w
ith

in
fo
r
ma
t
i
o
n

r
et
r
i
e
val
t
e
c
hni
q
ues

and

the
c
o
ns
tr
u
ct
i
o
n

o
r use

o
f

hi
e
r
a
r
c
hies

s
p
ecifica
l
l
y

fo
r

t
e
xt

data

(
s
u
c
h

as

di
ct
i
o
na
r
ies

and

the
-

sau
r
uses),

as

w
e
l
l

as

discipline
-
o
r
i
e
n
t
ed

t
e
r
m

c
lassifica
t
i
o
n

s
y
s
t
e
ms
(
s
u
c
h

as

in

b
i
o
c
h
e
mi
-

s
t
r
y
,

medicine,

l
a
w
,

o
r

e
c
o
n
o
mics).

M
ul
t
imedia

databases

s
t
o
r
e

image, audi
o
,

and

v
i
d
eo

data.

Th
e
y

a
r
e

used

in

a
p
pli
-

ca
t
i
o
ns

s
u
c
h

as

pi
c
tu
r
e

c
o
n
t
e
nt
-
based

r
et
r
i
e
val,

v
o
i
c
e
-
mail

s
y
s
t
e
ms,

v
i
d
e
o
-
o
n
-
d
e
mand
s
y
s
t
e
ms,

the

W
o
r
ld

W
i
d
e

W
e
b
,

and

s
p
ee
c
h
-
based

us
e
r

in
t
e
r
fa
c
es

that

r
e
c
o
g
ni
z
e

s
p
o
k
e
n
c
o
mmands.

M
ul
t
imedia

databases

m
ust

s
u
p
p
o
r
t

large

obje
c
ts,

b
ecause

data

obje
c
ts

s
u
c
has

v
i
d
eo

can

r
e
q
ui
r
e

g
i
g
a
b
y
t
es

o
f

s
t
o
r
age.

S
p
eciali
z
ed
s
t
o
r
age

and

sea
r
c
h

t
e
c
hni
q
ues

a
r
e also

r
e
q
ui
r
ed.

B
ecause

v
i
d
eo

and

audio

data

r
e
q
ui
r
e

r
eal
-
t
ime

r
et
r
i
e
val at

a

s
t
ea
d
y

and
p
r
e
d
e
t
e
r
mined

r
a
t
e

in

o
r
de
r

t
o

av
o
id

pi
c
tu
r
e

o
r

sound

g
aps

and

s
y
s
t
e
m

b
uf
fe
r

o
v
e
r
fl
o
ws,
s
u
c
h

data

a
r
e

r
e
fe
r
r
ed

t
o

as

c
o
n
t
i
n
uous
-
media

data.

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?


F
o
r

m
ul
t
imedia

data

minin
g
,

s
t
o
r
age and

sea
r
c
h

t
e
c
hni
q
ues

need

t
o

b
e

in
t
e
g
r
a
t
e
d
w
ith

standa
r
d

data

mining

m
e
th
o
ds.

Pr
o
mising

a
pp
r
o
a
c
hes

in
cl
u
d
e

the

c
o
ns
tr
u
ct
i
o
n

o
f
m
ul
t
imedia

data

cu
b
es,

the

e
x
t
r
a
ct
i
o
n

o
f

m
ul
t
iple

f
eatu
r
es

f
r
o
m

m
ul
t
imedia

data,

and
simila
r
i
t
y
-
based

pat
t
e
r
n

ma
t
c
hin
g
.


Hete
r
og
eneous

Databases

and

Legacy

Databases


A

h
e
te
r
og
e
neous

database

c
o
nsists

o
f

a s
e
t

o
f

in
t
e
r
c
o
nne
c
t
ed,

a
u
t
o
n
o
mous

c
o
m
p
o
n
e
nt
databases.

The

c
o
m
p
o
n
e
nts

c
o
m
m
unica
t
e

in

o
r
de
r

t
o
e
x
c
hange

in
fo
r
ma
t
i
o
n

and

ans
we
r
q
u
e
r
ies.

O
bje
c
ts

in

o
ne

c
o
m
p
o
n
e
nt

database

m
a
y

dif
fe
r

g
r
eat
l
y

f
r
o
m

obje
c
ts

in

oth
e
r
c
o
m
p
o
n
e
nt

databases,

making

it

difficult

t
o

assimila
t
e

their

s
e
man
t
ics

in
t
o

the

o
v
e
r
a
l
l h
e
t
e
r
og
e
neous

database.

M
a
n
y

e
n
t
e
r
p
r
ises

a
c
q
ui
r
e

l
eg
acy

databases

as

a

r
e
s
ult

o
f

the

l
o
ng

his
t
o
r
y

o
f

in
f
o
r
-

ma
t
i
o
n

t
e
c
hnolo
g
y
d
e
v
el
op
m
e
nt

(in
cl
uding

the

a
p
plica
t
i
o
n

o
f

dif
fe
r
e
nt

ha
rd
wa
r
e

and
o
p
e
r
a
t
ing

s
y
s
t
e
ms).

A

l
eg
acy

database

is

a

g
r
oup

o
f

h
e
t
e
r
o
g
e
ne
o
us

data
b
a
s
es

that

c
o
m
-

b
ines

dif
fe
r
e
nt

kinds

o
f

data

s
y
s
t
e
ms,
s
u
c
h

as

r
ela
t
i
o
nal

o
r

obje
c
t
-
o
r
i
e
n
t
ed

databases, hi
e
r
a
r
c
hical

databases, n
e
t
w
o
r
k

databases,
s
p
r
eadshe
e
ts,

m
ul
t
imedia

databases,
o
r

file s
y
s
t
e
ms.

The

h
e
t
e
r
og
e
neous

databases

in

a

l
eg
acy

database

m
a
y

b
e

c
o
nne
c
t
ed

b
y

in
t
r
a
-

o
r

in
t
e
r
-
c
o
mp
u
t
e
r

n
e
t
w
o
r
ks.
I
n
fo
r
ma
t
i
o
n
e
x
c
hange

ac
r
oss

s
u
c
h

databases

is

difficult

b
ecause

it

w
ould

r
e
q
ui
r
e
p
r
ecise
t
r
ans
fo
r
ma
t
i
o
n

r
ules

f
r
o
m

o
ne

r
e
p
r
es
e
nta
t
i
o
n

t
o

anoth
e
r
,

c
o
nsi
de
r
ing

d
i
v
e
rse s
e
man
t
ics.



The

W
orld

Wide

W
eb


The

W
o
r
ld

W
i
d
e

W
eb

and its

ass
o
cia
t
ed

dis
t
r
i
bu
t
ed

in
fo
r
ma
t
i
o
n

s
e
r
v
i
c
es,

s
u
c
h

as
Y
ah
o
o!,

G
o
o
g
le,

A
m
e
r
ica

Online,

and

A
lta
V
ista,

p
r
o
v
i
d
e

r
i
c
h,

w
o
r
l
d
w
i
d
e,

o
n
-
line

in
f
o
r
-

ma
t
i
o
n

s
e
r
v
i
c
es,

w
h
e
r
e

data

obje
c
ts

a
r
e

lin
k
ed

t
og
e
th
e
r

t
o

facilita
t
e

in
t
e
r
a
ct
i
v
e

a
cc
ess.
U
s
e
rs

seeking
in
fo
r
ma
t
i
o
n

o
f

in
t
e
r
est

t
r
av
e
rse

f
r
o
m

o
ne

obje
c
t

v
ia

links

t
o

anoth
e
r
.
S
u
c
h

s
y
s
t
e
ms

p
r
o
v
i
d
e

ample

op
p
o
r
tuni
t
ies

and

c
ha
l
l
e
nges

fo
r

data

minin
g
.

F
o
r

e
xam
-

ple,

un
de
rstanding

us
e
r

a
cc
ess

pat
t
e
r
ns

w
i
l
l

not

o
n
l
y

help

im
p
r
o
v
e

s
y
s
t
e
m

d
esi
g
n

(
b
y
p
r
o
v
iding

effici
e
nt

a
cc
ess

be
t
w
e
e
n

hi
g
h
l
y

c
o
r
r
ela
t
ed

obje
c
ts),

bu
t

also

leads

t
o

b
e
t
t
e
r
ma
r
k
et
ing

d
ecisi
o
ns

(e.
g
.,

b
y

placing

a
d
v
e
r
t
is
e
m
e
nts

in

f
r
e
q
u
e
nt
l
y

v
isi
t
ed
d
o
cum
e
nts,
o
r

b
y

p
r
o
v
iding

be
t
t
e
r

cus
t
o
m
e
r/us
e
r

c
lassifica
t
i
o
n

and

b
eh
a
v
i
o
r

ana
ly
sis).

C
ap
t
u
r
ing
us
e
r

a
cc
ess

pat
t
e
r
ns

in

s
u
c
h

dis
t
r
i
bu
t
ed

in
fo
r
ma
t
i
o
n

e
n
v
i
r
o
nm
e
nts

is

ca
l
led

W
eb

usage
mining

(
o
r

W
eblog

mining
).

A
lthou
g
h

W
eb

pages

m
a
y

a
p
p
ear

fancy

and

in
fo
r
ma
t
i
v
e

t
o

h
uman

r
ea
de
rs,

th
e
y

can

b
e
hi
g
h
l
y

uns
tr
u
c
tu
r
ed

and

la
c
k

a

p
r
e
d
efined

s
c
h
e
ma,

t
y
p
e,

o
r

pat
t
e
r
n.

T
h
us

it

is

difficult

fo
r
c
o
mp
u
t
e
rs

t
o

un
de
rstand

the

s
e
man
t
ic

meaning

o
f

d
i
v
e
rse

W
eb

pages

and

s
tr
u
c
tu
r
e

th
e
m
in

an

o
r
g
ani
z
ed

w
a
y

fo
r

s
y
s
t
e
ma
t
ic

in
fo
r
ma
t
i
o
n

r
et
r
i
e
val

and

data

minin
g
.

W
eb

s
e
r
v
i
c
es that
p
r
o
v
i
d
e

k
e
y
w
o
r
d
-
based

sea
r
c
hes

w
itho
u
t

un
de
rstanding

the

c
o
n
t
e
xt

b
ehind

the

W
eb pages

can

o
n
l
y

o
f
fe
r

limi
t
ed

help
t
o

us
e
rs.

F
o
r

e
xample,

a
W
eb

sea
r
c
h

based

o
n

a sin
g
le
k
e
y
w
o
r
d

m
a
y

r
e
tu
r
n

h
und
r
eds

o
f

W
eb

page

p
o
in
t
e
rs

c
o
ntaining

the

k
e
y
w
o
r
d,

bu
t

most
o
f

the

p
o
in
t
e
rs

w
i
l
l

b
e

v
e
r
y

w
eak
l
y

r
ela
t
ed

t
o

w
hat

the

us
e
r

wants

t
o

find. Data

mining
can

o
f
t
e
n

p
r
o
v
i
d
e

addi
t
i
o
nal

help

h
e
r
e than

W
eb

sea
r
c
h

s
e
r
v
i
c
es.

F
o
r

e
xample,

a
u
th
o
r
i
-

ta
t
i
v
e

W
eb

page

analysis

based

o
n

linkages

am
o
ng

W
eb

pages

can

help

r
ank

W
eb

pages based

o
n

their

im
p
o
r
tan
c
e,

influ
e
n
c
e,

and

t
o
pics.

A
uto
ma
t
ed

W
eb

page

cl
us
te
r
ing

and
c
lassifica
t
i
o
n

help

g
r
oup

and

a
rr
ange

W
eb

pages

in

a

m
ul
t
idim
e
nsi
o
nal

mann
e
r

based
o
n

their

c
o
n
t
e
nts.

W
eb

c
o
m
m
uni
t
y

analysis

helps

i
de
n
t
ify

hid
de
n

W
eb

s
o
cial

n
e
t
w
o
r
ks and

c
o
m
m
uni
t
ies

and

obs
e
r
v
e

their

e
v
o
lu
t
i
o
n.

W
eb

mining

is

the

d
e
v
el
op
m
e
nt

o
f

scal
-

able

and

ef
f
e
ct
i
v
e
W
eb

data

ana
ly
sis

and

mining

m
e
th
o
ds.

I
t

m
a
y

help

us

lea
r
n

a
b
o
u
t

the
dis
t
r
i
bu
t
i
o
n

o
f

in
fo
r
ma
t
i
o
n

o
n

the

W
eb

in

g
e
n
e
r
al,

c
ha
r
a
c
t
e
r
i
z
e

and

c
lassify

W
eb

pages, and

un
c
o
v
e
r

W
eb

d
ynamics

and

the

ass
o
cia
t
i
o
n

and

oth
e
r

r
ela
t
i
o
nships

am
o
ng

dif
fe
r
e
nt
W
eb

pages,

us
e
rs,

c
o
m
m
uni
t
ies,

and

W
eb
-
based

a
ct
i
v
i
t
ies.



1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?


1.4


Data

Mining

Functionalities

What

Kinds

of

P
atterns

Can

Be

Mined?


W
e

h
av
e

obs
e
r
v
ed

va
r
ious

t
y
p
es

o
f

databases

and

in
fo
r
ma
t
i
o
n

r
e
p
osi
t
o
r
ies

o
n

w
hi
c
h

data
mining

can

b
e

p
e
r
fo
r
med.

L
e
t

us

n
o
w

e
xamine

the

kinds

o
f

data

pat
t
e
r
ns

that

can

b
e
mine
d.

Data

mining

fun
ct
i
o
nali
t
ies a
r
e

used

t
o

s
p
ecify

the

kind

o
f pat
t
e
r
ns

t
o

b
e

f
ound

in
data

mining

tasks.

I
n

g
e
n
e
r
al,

data

mining

tasks

can

b
e

c
lassified

in
t
o

t
w
o

ca
t
e
go
r
ies:
d
esc
r
ip
t
i
v
e and

p
r
edi
ct
i
v
e.

Desc
r
ip
t
i
v
e

mining

tasks

c
ha
r
a
c
t
e
r
i
z
e

the

g
e
n
e
r
al

p
r
o
p
e
r
t
ies
o
f

the

data

in

the

database.

P
r
edi
ct
i
v
e

mining

tasks

p
e
r
fo
r
m

in
fe
r
e
n
c
e

o
n

the

cu
r
r
e
nt

data
in

o
r
de
r

t
o

ma
k
e

p
r
e
di
c
t
i
o
ns.

Data

mining

fun
ct
i
o
nali
t
ies,

and

the

kinds

o
f

pat
t
e
r
ns

th
e
y

can

dis
c
o
v
e
r
,

a
r
e

d
esc
r
i
b
ed
b
el
o
w
.


1.4.1

Concept/Class

Description:

Characterization

and
Discrimination


Data

can

b
e

ass
o
cia
t
ed

w
ith

c
lasses

o
r

c
o
n
c
epts.

F
o
r

e
xample,

in

the

A
llEle
c
t
r
oni
c
s

s
t
o
r
e,
c
lasses

o
f

i
t
e
ms

fo
r

sale

in
cl
u
d
e

c
ompu
t
er
s

and

p
r
in
t
er
s
,

and

c
o
n
c
epts

o
f

cus
t
o
m
e
rs

in
cl
u
d
e
b
ig
S
p
e
nd
er
s

and

bud
g
e
t
S
p
e
nd
er
s
.

I
t

can

b
e

useful

t
o

d
esc
r
i
b
e

ind
i
v
i
d
ual

c
lasses

and

c
o
n
-

c
epts

in

s
umma
r
i
z
ed,

c
o
ncise,

and

y
e
t

p
r
ecise

t
e
r
ms.

S
u
c
h

d
esc
r
ip
t
i
o
ns

o
f

a

c
lass

o
r
a

c
o
n
c
ept

a
r
e

ca
l
led

c
lass/
c
o
n
c
ept

d
esc
r
ip
t
i
o
n
s
.

These

d
esc
r
ip
t
i
o
ns

can

b
e

de
r
i
v
ed

v
ia (1)

data

c
ha
r
a
c
t
e
r
iza
t
ion
,

b
y

s
umma
r
izing the

data

o
f

the

c
lass
un
de
r

stu
d
y

(
o
f
t
e
n ca
l
led the

ta
r
g
e
t

c
lass)

in

g
e
n
e
r
al

t
e
r
ms,

o
r

(2)
data

di
s
c
r
imina
t
io
n
,

b
y

c
o
mpa
r
is
o
n

o
f

the

targ
e
t
c
lass

w
ith

o
ne

o
r

a

s
e
t

o
f

c
o
mpa
r
a
t
i
v
e

c
lasses

(
o
f
t
e
n

ca
l
led

the

c
o
n
t
r
as
t
ing

c
lasses),

o
r
(3)

b
oth

data

c
ha
r
a
c
t
e
r
iza
t
i
o
n and

disc
r
imina
t
i
o
n.

Data

c
ha
r
a
c
te
r
iza
t
i
o
n

is

a

s
umma
r
iza
t
i
o
n

o
f

the

g
e
n
e
r
al

c
ha
r
a
c
t
e
r
is
t
ics

o
r

f
eatu
r
es

o
f a

targ
e
t

c
lass

o
f

data.

The

data

c
o
r
r
es
p
o
nding

t
o

the

us
e
r
-
s
p
ecified

c
lass

a
r
e

t
y
pica
l
l
y

c
ol
le
c
t
ed

b
y

a

database

q
u
e
r
y
.

F
o
r

e
xample,

t
o

stu
d
y

the

c
ha
r
a
c
t
e
r
is
t
ics

o
f

s
o
ftwa
r
e

p
r
o
d
u
c
ts
w
hose

sales

inc
r
eased

b
y

10%

in

the

last

y
ea
r
,

the

data

r
ela
t
ed

t
o

s
u
c
h

p
r
o
d
u
c
ts

can

b
e
c
o
l
le
c
t
ed

b
y

e
x
ec
u
t
ing

an

SQL
q
u
e
r
y
.


1.4.2

Mining

Fr
equent

P
atterns,

Associations,

and

Co
r
r
elations


F
r
e
q
u
e
nt

pat
te
r
n
s
,

as

the

name

s
uggests,

a
r
e

pat
t
e
r
ns

that
o
c
cur

f
r
e
q
u
e
nt
l
y

in

data.

Th
e
r
e a
r
e

ma
n
y

kinds

o
f

f
r
e
q
u
e
nt

pat
t
e
r
ns,

in
cl
uding

i
t
e
ms
e
ts,

s
ubse
q
u
e
n
c
es,

and

s
ubs
t
r
uc
-

tu
r
es.

A

f
r
equ
e
nt

i
t
e
m
s
e
t

t
y
pica
l
l
y

r
e
fe
rs

t
o

a

s
e
t

o
f

i
t
e
ms

that

f
r
e
q
u
e
nt
l
y

a
p
p
ear

t
o
g
e
th
e
r
in

a

t
r
ansa
ct
i
o
nal

data

s
e
t,

s
u
c
h

as

milk

and

b
r
ead.

A

f
r
e
q
u
e
nt
l
y

o
c
cu
rr
ing

s
ubse
q
u
e
n
c
e,
s
u
c
h

as

the

pat
t
e
r
n

that

cus
t
o
m
e
rs

t
e
nd

t
o

pu
r
c
hase

first

a

PC,

f
o
l
l
o
w
ed

b
y

a

di
g
ital

cam
-

e
r
a,

and

th
e
n

a

m
e
m
o
r
y

ca
r
d,

is

a

(
f
r
equ
e
nt
)

s
equ
e
n
t
ial

p
at
t
e
r
n
.

A

s
ubs
tr
u
c
tu
r
e

can

r
e
fe
r
t
o

dif
fe
r
e
nt

s
tr
u
c
tu
r
al

fo
r
ms,
s
u
c
h

as

gr
a
p
hs,

t
r
ees,

o
r

lat
t
i
c
es,

w
hi
c
h

m
a
y

b
e
c
o
m
b
ine
d
w
ith

i
t
e
ms
e
ts

o
r

s
ubse
q
u
e
n
c
es.

I
f

a

s
ubs
tr
u
c
tu
r
e

o
c
curs

f
r
e
q
u
e
nt
l
y
,

it

is

ca
l
led

a

(
f
r
equ
e
nt
)
s
t
r
u
ct
u
r
ed
p
at
t
e
r
n
.

M
ining

f
r
e
q
u
e
nt

pat
t
e
r
ns

leads

t
o

the

dis
c
o
v
e
r
y

o
f

in
t
e
r
es
t
ing

ass
o
ci
-

a
t
i
o
ns

and

c
o
r
r
ela
t
i
o
ns

w
ithin

data.


Example

1.6

A
ss
o
cia
t
i
o
n

analysis.

S
u
p
p
ose,

as

a

ma
r
k
et
ing

manag
e
r

o
f

A
llEle
c
t
r
oni
c
s
,

y
ou

w
ould

li
k
e

t
o
d
e
t
e
r
mine

w
hi
c
h

i
t
e
ms

a
r
e

f
r
e
q
u
e
nt
l
y

pu
r
c
hased

t
og
e
th
e
r
w
ithin

the

same

t
r
ansa
c
t
i
o
ns.
A
n

e
xample

o
f

s
u
c
h

a

r
ule,

mined

f
r
o
m

the

A
llEle
c
t
r
oni
c
s

t
r
ansa
ct
i
o
nal

database,

is

b
u
y
s
(
X

,


c
ompu
t
e
r
”)



b
u
y
s
(
X

,


s
o
f
t
w
a
r
e
”)

[
sup
p
o
r
t

=

1%
,

c
onfid
e
nce

=

50%
]


w
h
e
r
e

X

is

a

va
r
iable

r
e
p
r
es
e
n
t
ing

a

cus
t
o
m
e
r
.

A

c
o
nfi
de
n
c
e,

o
r

c
e
r
tain
t
y
,

o
f

50%

means that

if

a

cus
t
o
m
e
r

b
u
y
s

a

c
o
mp
u
t
e
r
,

th
e
r
e

is

a

50%

c
han
c
e

that

she

w
i
l
l
b
u
y

s
o
ftwa
r
e as

w
e
l
l.

A

1%

s
u
p
p
o
r
t

means

that

1%

o
f

a
l
l

o
f

the

t
r
ansa
ct
i
o
ns

un
de
r

ana
ly
sis

sh
o
w
ed that

c
o
mp
u
t
e
r

and

s
o
ftwa
r
e
we
r
e
pu
r
c
hased

t
og
e
th
e
r
.

This

ass
o
cia
t
i
o
n

r
ule

i
n
v
o
l
v
es

a
sin
g
le

at
t
r
i
bu
t
e

o
r

p
r
edica
t
e

(i.e.,

b
u
y
s
)

that

r
e
p
eats.

A
ss
o
cia
t
i
o
n

r
ules

that

c
o
ntain

a

sin
g
le
p
r
edica
t
e

a
r
e

r
e
fe
r
r
ed

t
o

as

sin
g
le
-
dim
e
nsi
o
nal

ass
o
cia
t
i
o
n

r
ule
s
.

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?


D
r
op
ping

the

p
r
e
dica
t
e
nota
t
i
o
n,

the

a
b
o
v
e

r
ule

can

b
e

w
r
it
t
e
n simp
l
y

as


c
ompu
t
e
r



s
o
f
t
w
a
r
e

[1%,

50%]”.
S
u
p
p
ose,

ins
t
ead,

that

w
e

a
r
e

g
i
v
e
n

the

A
llEle
c
t
r
oni
c
s

r
ela
t
i
o
nal

database

r
ela
t
ing

t
o
pu
r
c
hases.

A

data

mining

s
y
s
t
e
m

m
a
y

find

ass
o
cia
t
i
o
n

r
ules

li
k
e

a
g
e
(
X

,

“20...29”)



in
c
ome
(
X

,

“20K...29K”)



b
u
y
s
(
X

,


C
D

pl
a
y
e
r
”)
[
sup
p
o
r
t

=

2%,

c
onfid
e
nce

=

60%
]

The

r
ule

indica
t
es

that

o
f

the

A
llEle
c
t
r
oni
c
s

cus
t
o
m
e
rs

un
de
r

stu
d
y
,

2%

a
r
e

20

t
o
29

y
ears

o
f

age
w
ith

an

in
c
o
me

o
f

20,000

t
o

29,000

and

h
av
e

pu
r
c
hased

a

CD

pl
a
y
e
r
at

A
llEle
c
t
r
oni
c
s
.

Th
e
r
e

is

a

60%

p
r
oba
b
ili
t
y

that

a

cus
t
o
m
e
r

in

this

age and

in
c
o
me
g
r
oup

w
i
l
l pu
r
c
hase

a

CD

pl
a
y
e
r
.

N
o
t
e

that

this

is

an

ass
o
cia
t
i
o
n

be
t
w
e
e
n

m
o
r
e

than
o
ne

at
t
r
i
bu
t
e,

o
r

p
r
edica
t
e

(i.e.,

a
g
e,

in
c
ome
,

and

b
u
y
s
).

A
do
p
t
ing

the

t
e
r
minolo
g
y

used in

m
ul
t
idim
e
nsi
o
nal

databases,

w
h
e
r
e ea
c
h

at
t
r
i
bu
t
e

is

r
e
fe
r
r
ed

t
o

as

a

dim
e
nsi
o
n,

the
a
b
o
v
e

r
ule

can

b
e

r
e
fe
r
r
ed

t
o

as a

m
ul
t
idim
e
nsi
o
nal

ass
o
cia
t
i
o
n

r
ul
e
.


1.4.3

Classification

and

P
r
ediction


C
lassifica
t
i
o
n

is

the

p
r
o
c
ess

o
f

finding

a m
o
d
el

(
o
r

fun
ct
i
o
n)

that

d
esc
r
i
b
es

and

dis
t
in
guishes

data

c
lasses

o
r

c
o
n
c
epts,

fo
r

the

pu
r
p
ose

o
f

b
eing

able

t
o

use

the

m
o
d
el

t
o

p
r
e
di
c
t
the

c
lass

o
f

obje
c
ts

w
hose

c
lass

la
b
el

is

unkn
o
w
n.

The

de
r
i
v
ed

m
o
d
el

is

based

o
n

the

anal
-

y
sis

o
f

a

s
e
t

o
f

t
r
aining

data

(i.e.,

data

obje
c
ts

w
hose

c
lass

la
b
el

is

kn
o
w
n).


H
o
w

is

the

d
e
r
i
ved

m
o
d
e
l

p
r
e
s
e
n
t
ed?”

The

de
r
i
v
ed

m
o
d
el

m
a
y

b
e

r
e
p
r
es
e
n
t
ed

in va
r
i
-

ous

fo
r
ms,

s
u
c
h

as

c
lassifica
t
ion

(I
F
-
THEN)

r
ule
s
,

de
c
ision

t
r
ees
,

math
e
ma
t
ical

f
o
r
m
ulae
,
o
r

n
e
u
r
al

n
e
t
w
o
r
ks

(Figu
r
e

1.10).

A

d
ecisi
o
n

t
r
ee

is

a

fl
o
w
-
c
ha
r
t
-
li
k
e

t
r
ee

s
tr
u
c
tu
r
e,

w
h
e
r
e ea
c
h

n
o
d
e

de
no
t
es

a

t
est

o
n an

at
t
r
i
bu
t
e

va
l
ue,

ea
c
h

b
r
an
c
h

r
e
p
r
es
e
nts

an

o
u
t
c
o
me

o
f

the
t
est,

and

t
r
ee

le
av
es

r
e
p
r
es
e
nt

c
lasses

o
r

c
lass

dis
t
r
i
bu
t
i
o
ns.

Decisi
o
n

t
r
ees

can

easi
l
y

b
e
c
o
n
v
e
r
t
ed

t
o

c
lassifica
t
i
o
n

r
ules.

A

n
e
u
r
al

n
e
t
w
o
r
k,

w
h
e
n

used

fo
r

c
lassifica
t
i
o
n,

is

t
y
pi
-

ca
l
l
y

a

c
o
l
le
ct
i
o
n

o
f

n
e
u
r
o
n
-
li
k
e

p
r
o
c
essing

units
w
ith

w
ei
g
h
t
ed

c
o
nne
ct
i
o
ns

be
t
w
e
e
n

the
units.

Th
e
r
e

a
r
e

ma
n
y

oth
e
r

m
e
th
o
ds

fo
r

c
o
ns
tr
u
ct
ing

c
lassifica
t
i
o
n

m
o
d
els,
s
u
c
h

as

naï
v
e
B
a
y
esian

c
lassifica
t
i
o
n,

s
u
p
p
o
r
t

v
e
c
t
o
r

ma
c
hines,

and

k
-
nea
r
est

nei
g
h
b
o
r

c
lassifica
t
i
o
n.

W
h
e
r
eas

c
lassifica
t
i
o
n

p
r
edi
c
ts

ca
t
e
go
r
ical

(disc
r
e
t
e,

un
o
r
de
r
ed)

la
b
els,

p
r
e
di
c
t
i
o
n
m
o
d
els

c
o
n
t
i
n
uous
-
va
l
ued

fun
ct
i
o
ns.

That

is,

it

is used

t
o

p
r
edi
c
t

missing

o
r

un
a
vail
-

able

n
um
e
r
ical

data

v
alues

r
ath
e
r

than

c
lass

la
b
els.

A
lthou
g
h

the

t
e
r
m

p
r
edi
c
t
ion

m
a
y
r
e
fe
r

t
o

b
oth

n
um
e
r
ic

p
r
edi
ct
i
o
n

and

c
lass

la
b
el

p
r
edi
ct
i
o
n,

in

this

bo
ok

w
e

use

it

t
o

r
e
fe
r
p
r
ima
r
i
l
y

t
o

n
um
e
r
ic

p
r
edi
ct
i
o
n.

R
e
g
r
essi
o
n

analysis

is

a

sta
t
is
t
ical

m
e
th
o
d
olo
g
y

that

is most

o
f
t
e
n

used

fo
r

n
um
e
r
ic

p
r
edi
ct
i
o
n, althou
g
h

oth
e
r

m
e
th
o
ds

e
xist

as

w
e
l
l.

P
r
e
di
c
t
i
o
n
also

e
n
c
o
mpasses

the

i
de
n
t
ifica
t
i
o
n

o
f

dis
t
r
i
bu
t
i
o
n

t
r
e
nds

based

o
n

the

a
vailable
data.

Classifica
t
i
o
n

and

p
r
edi
ct
i
o
n

m
a
y

need

t
o

b
e

p
r
e
c
e
d
ed

b
y

r
el
e
van
c
e

analysis,

w
hi
c
h at
t
e
mpts

t
o

i
de
n
t
ify

at
t
r
i
bu
t
es

that

d
o

not

c
o
n
t
r
i
bu
t
e

t
o

the

c
lassifica
t
i
o
n

o
r

p
r
e
di
c
t
i
o
n
p
r
o
c
ess.

These
at
t
r
i
bu
t
es

can

th
e
n

b
e

e
x
cl
u
d
ed.


1.4.4

Cluster

Ana
l
ysis


“What

is

c
lus
t
e
r

ana
l
y
sis?”

U
nli
k
e

c
lassifica
t
i
o
n

and

p
r
edi
ct
i
o
n,

w
hi
c
h

ana
l
y
z
e

c
lass
-
la
b
eled data

obje
c
ts,

cl
us
te
r
ing

ana
l
y
z
es

data

obje
c
ts

w
itho
u
t

c
o
n
s
ul
t
ing

a

kn
o
w
n

c
lass

la
b
el.

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?












1





1






1







Figu
r
e

1.11

A

2
-
D

plot

o
f

cus
t
o
m
e
r

data

w
ith

r
es
p
e
c
t

t
o

cus
t
o
m
e
r

l
o
ca
t
i
o
ns

in

a

ci
t
y
,

sh
o
w
ing

th
r
ee

data

cl
us
t
e
rs.

Ea
c
h

cl
us
t
e
r


c
e
n
t
e
r”

is

ma
r
k
ed

w
ith

a


+
”.



I
n

g
e
n
e
r
al,

the

c
lass

la
b
els

a
r
e

not

p
r
es
e
nt

in

the

t
r
aining

data

simp
l
y

b
ecause

th
e
y

a
r
e not

kn
o
w
n

t
o
be
g
in

w
ith.

C
l
us
t
e
r
ing

can

b
e

used

t
o g
e
n
e
r
a
t
e

s
u
c
h

la
b
els.

The

obje
c
ts

a
r
e
cl
us
t
e
r
ed

o
r

g
r
ou
p
ed

based

o
n

the

p
r
inciple

o
f
ma
x
imizing

the

in
t
r
a
c
lass

simila
r
i
t
y

and minimizing

the

in
t
e
r
c
lass

simila
r
i
t
y
.

That

is,

cl
us
t
e
rs

o
f

obje
c
ts

a
r
e

fo
r
med

so

that

obje
c
ts
w
ithin

a

cl
us
t
e
r

h
av
e

hi
g
h

simila
r
i
t
y

in

c
o
mpa
r
is
o
n

t
o

o
ne

anoth
e
r
,
bu
t

a
r
e

v
e
r
y

dissimilar
t
o

obje
c
ts

in

oth
e
r

cl
us
t
e
rs.

Ea
c
h

cl
us
t
e
r

that

is

fo
r
med

can

b
e

v
ie
w
ed

as

a

c
lass

o
f

obje
c
ts, f
r
o
m

w
hi
c
h

r
ules

can

b
e

de
r
i
v
ed.

C
l
us
t
e
r
ing

can

also

facilita
t
e

ta
x
o
n
o
m
y

fo
r
ma
t
i
o
n
,

that
is,

the

o
r
g
aniza
t
i
o
n

o
f

obs
e
r
va
t
i
o
ns

in
t
o

a

hi
e
r
a
r
c
h
y

o
f

c
lasses

that

g
r
oup

similar

e
v
e
nts
t
o
g
e
th
e
r
.



1.4.5

Outlier

Ana
l
ysis


A

database

m
a
y

c
o
ntain

data

obje
c
ts

that

d
o

not

c
o
mp
l
y

w
ith

the

g
e
n
e
r
al

b
eh
a
v
i
o
r

o
r
m
o
d
el

o
f

the

data.

These

data

obje
c
ts

a
r
e

o
u
tli
e
rs.

M
ost

data

mining

m
e
th
o
ds

disca
r
d
o
u
tli
e
rs

as

n
o
ise

o
r

e
x
c
ep
t
i
o
ns.

H
o
w
e
v
e
r
,

in

s
o
me

a
p
plica
t
i
o
ns

s
u
c
h

as

f
r
aud

d
e
t
e
ct
i
o
n,

the
r
a
r
e

e
v
e
nts

can

b
e

m
o
r
e

in
t
e
r
es
t
ing

than

the

m
o
r
e

r
e
gula
r
l
y

o
c
cu
rr
ing

o
nes.

The

ana
ly
sis
o
f

o
u
tli
e
r

data

is

r
e
fe
r
r
ed

t
o

as

o
u
tli
e
r

mining
.

O
u
tli
e
rs

m
a
y

b
e

d
e
t
e
c
t
ed

using

sta
t
is
t
ical

t
ests

that

as
s
ume

a

dis
t
r
i
bu
t
i
o
n

o
r

p
r
oba
b
ili
t
y

m
o
d
el

fo
r

the
data,

o
r

using

distan
c
e

mea
s
u
r
es

w
h
e
r
e

obje
c
ts

that

a
r
e

a

s
ubstan
t
ial distan
c
e f
r
o
m

a
n
y

oth
e
r

cl
us
t
e
r

a
r
e

c
o
nsi
de
r
ed

o
u
tli
e
rs.

Rath
e
r

than

using

sta
t
is
t
ical

o
r
distan
c
e

mea
s
u
r
es,

d
e
v
ia
t
i
o
n
-
based m
e
th
o
ds

i
de
n
t
ify

o
u
tli
e
rs

b
y

e
xamining

dif
fe
r
e
n
c
es in

the

main

c
ha
r
a
c
t
e
r
is
t
ics

o
f

obje
c
ts

in

a

g
r
ou
p
.


Example

1.9

O
u
tli
e
r

analysis.

O
u
tli
e
r

ana
ly
sis

m
a
y

un
c
o
v
e
r

f
r
au
d
ul
e
nt

usage

o
f

c
r
edit

ca
r
ds

b
y

d
e
t
e
c
t
-

ing

pu
r
c
hases

o
f

e
x
t
r
e
me
l
y

large

amounts

fo
r

a

g
i
v
e
n
a
c
c
ount

n
um
b
e
r

in

c
o
mpa
r
is
o
n

t
o
r
e
gular

c
harges

incu
r
r
ed

b
y

the

same

a
c
c
ount.

O
u
tli
e
r

va
l
ues

m
a
y

also

b
e

d
e
t
e
c
t
ed

w
ith
r
es
p
e
c
t

t
o

the

l
o
ca
t
i
o
n

and

t
y
p
e

o
f

pu
r
c
hase,

o
r

the

pu
r
c
hase

f
r
e
q
u
e
nc
y
.


1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?



1.4.6

E
v
olution

Ana
l
ysis


Data

e
v
o
lu
t
i
o
n

analysis

d
esc
r
i
b
es

and

m
o
d
els

r
e
gula
r
i
t
ies

o
r

t
r
e
nds

fo
r

obje
c
ts

w
hose
b
eh
a
v
i
o
r

c
hanges
o
v
e
r

t
ime.

A
lthou
g
h

this

m
a
y

in
cl
u
d
e

c
ha
r
a
c
t
e
r
iza
t
i
o
n,

disc
r
imina
t
i
o
n,

ass
o
cia
t
i
o
n

and

c
o
r
r
ela
t
i
o
n

ana
ly
sis,

c
lassifica
t
i
o
n,

p
r
edi
ct
i
o
n,

o
r

cl
us
t
e
r
ing

o
f

t
ime
-

r
e
la
t
ed
data,

dis
t
in
c
t

f
eatu
r
es

o
f

s
u
c
h

an

ana
ly
sis

in
cl
u
d
e

t
ime
-
s
e
r
ies

data

ana
ly
sis, se
q
u
e
n
c
e

o
r

p
e
r
i
o
dici
t
y

pat
t
e
r
n

ma
t
c
hin
g
,

and

simila
r
i
t
y
-
based

data

ana
ly
sis.


Example

1.10
E
v
o
lu
t
i
o
n

analysis.

S
u
p
p
ose

that

y
ou

h
av
e

the

maj
o
r

s
t
o
c
k

ma
r
k
e
t

(
t
ime
-
s
e
r
ies)

data
o
f

the

last

s
e
v
e
r
al

y
ears

a
vailable

f
r
o
m

the

N
ew

Y
o
r
k

S
t
o
c
k

E
x
c
hange

and

y
ou

w
ould li
k
e

t
o

i
n
v
est

in

sha
r
es

o
f

hi
g
h
-
t
e
c
h

in
d
us
t
r
ial

c
o
mpanies.

A

data

mining

stu
d
y

o
f

s
t
o
c
k
e
x
c
hange

data

m
a
y

i
de
n
t
ify s
t
o
c
k

e
v
o
lu
t
i
o
n

r
e
gula
r
i
t
ies

fo
r

o
v
e
r
a
l
l

s
t
o
c
ks

and

fo
r

the
s
t
o
c
ks

o
f

pa
r
t
icular
c
o
mpanies.

S
u
c
h

r
e
gula
r
i
t
ies

m
a
y

help

p
r
edi
c
t

f
u
tu
r
e

t
r
e
nds

in

s
t
o
c
k ma
r
k
e
t

p
r
i
c
es,

c
o
n
t
r
i
bu
t
ing

t
o

y
our

d
ecisi
o
n

making

r
eg
a
r
ding

s
t
o
c
k

i
n
v
es
t
m
e
nts.


1.6

Classification

of

Data

Mining

Systems

Data

mining

is

an

in
t
e
r
disciplina
r
y

field,

the

c
o
nflu
e
n
c
e

o
f

a

s
e
t

o
f

disciplines,

in
c
l
ud
-

ing

database

s
y
s
t
e
ms,

sta
t
is
t
ics,

ma
c
hine

lea
r
nin
g
,

v
i
s
ualiza
t
i
o
n,

and

in
fo
r
ma
t
i
o
n

sci
e
n
c
e (Figu
r
e

1.12).

M
o
r
e
o
v
e
r
,

d
e
p
e
nding

o
n the

data

mining

a
pp
r
o
a
c
h

used,

t
e
c
hni
q
ues

f
r
o
m
oth
e
r

disciplines

m
a
y

b
e

a
p
plied,

s
u
c
h

as

n
e
u
r
al

n
e
t
w
o
r
ks,

fuzzy

and/
o
r

r
ou
g
h

s
e
t

the
o
r
y
,
kn
o
w
ledge

r
e
p
r
es
e
nta
t
i
o
n,

in
d
u
ct
i
v
e

lo
g
ic

p
r
o
gr
ammin
g
,
o
r

hi
g
h
-
p
e
r
fo
r
man
c
e

c
o
mp
u
t
-

in
g
.

De
p
e
nding

o
n

the

kinds

o
f

data

t
o

b
e

mined

o
r

o
n

the

g
i
v
e
n

data

mining

a
p
plica
t
i
o
n,
the

data

mining

s
y
s
t
e
m

m
a
y

also

in
t
e
gr
a
t
e

t
e
c
hni
q
ues

f
r
o
m

spa
t
ial

data

ana
ly
sis,

in
f
o
r
ma
-

t
i
o
n

r
et
r
i
e
val,

pat
t
e
r
n

r
e
c
o
g
ni
t
i
o
n,

image

ana
ly
sis,

si
g
nal
p
r
o
c
essin
g
,

c
o
mp
u
t
e
r

gr
a
p
hics,
W
eb

t
e
c
hnolo
g
y
,

e
c
o
n
o
mics,

b
usiness,

b
i
o
in
fo
r
ma
t
ics,

o
r

ps
y
c
holo
g
y
.

B
e
cause

o
f

the

d
i
v
e
r
si
t
y

o
f

disciplines

c
o
n
t
r
i
bu
t
ing

t
o

data

minin
g
,

data

mining

r
esea
r
c
h
is

e
x
p
e
c
t
ed

t
o

g
e
n
e
r
a
t
e

a

large

va
r
i
e
t
y

o
f

data

mining

s
y
s
t
e
ms.

Th
e
r
e
fo
r
e,

it

is

ne
c
essa
r
y

t
o
p
r
o
v
i
d
e

a

c
lear

c
lassifica
t
i
o
n

o
f

data

mining

s
y
s
t
e
ms,

w
hi
c
h

m
a
y

help

p
o
t
e
n
t
ial

us
e
rs

dis
-

t
inguish

be
t
w
e
e
n

s
u
c
h

s
y
s
t
e
ms

and

i
d
e
n
t
ify

those

that

b
est

ma
t
c
h

their

needs.

Data

mining
s
y
s
t
e
ms

can

b
e

ca
t
e
go
r
i
z
ed

a
c
c
o
r
ding

t
o

va
r
ious

c
r
i
t
e
r
ia,

as

f
o
l
l
o
ws:



Classifica
t
i
o
n

a
c
c
o
r
ding

t
o

the

kinds

o
f

data
b
a
s
es

mined
:

A

data

mining

s
y
s
t
e
m

can

b
e
c
lassified

a
c
c
o
r
ding

t
o

the

kinds

o
f

databases

mined.

Database

s
y
s
t
e
ms

can

b
e

c
lassi
-

fied

a
c
c
o
r
ding

t
o dif
fe
r
e
nt

c
r
i
t
e
r
ia

(
s
u
c
h

as

data

m
o
d
els,

o
r the

t
y
p
es

o
f

data

o
r
a
p
pli
-

ca
t
i
o
ns

i
n
v
o
l
v
ed),

ea
c
h

o
f

w
hi
c
h

m
a
y

r
e
q
ui
r
e

its

o
w
n

data

mining

t
e
c
hni
q
ue.

Data mining

s
y
s
t
e
ms

can

th
e
r
e
fo
r
e

b
e

c
lassified

a
c
c
o
r
din
g
l
y
.

F
o
r

instan
c
e,

if

c
lassif
y
ing

a
c
c
o
r
ding

t
o

data

m
o
d
els,

w
e m
a
y

h
av
e

a

r
ela
t
i
o
nal,
t
r
ansa
ct
i
o
nal, obje
c
t
-
r
ela
t
i
o
nal,

o
r

data

wa
r
ehouse

mining

s
y
s
t
e
m.

I
f

c
lassif
y
ing
a
c
c
o
r
ding

t
o

the

s
p
ecial

t
y
p
es

o
f

data

han
d
led,

w
e

m
a
y

h
av
e

a

spa
t
ial,

t
ime
-
s
e
r
ies,

t
e
xt, s
t
r
eam

data,
m
ul
t
imedia

data

mining

s
y
s
t
e
m,

o
r

a

W
o
r
ld

W
i
d
e

W
eb

mining

s
y
s
t
e
m.

Classifica
t
i
o
n

a
c
c
o
r
ding

t
o

the

kinds

o
f

kn
o
w
led
g
e

mined
:

Data

mining

s
y
s
t
e
ms

can

b
e ca
t
e
go
r
i
z
ed

a
c
c
o
r
ding

t
o

the

kinds

o
f

kn
o
w
ledge

th
e
y

mine,

that

is,

based

o
n

data
mining

fun
ct
i
o
nali
t
ies,

s
u
c
h

as

c
ha
r
a
c
t
e
r
iza
t
i
o
n,

disc
r
imina
t
i
o
n,

ass
o
cia
t
i
o
n

and

c
o
r
-

r
ela
t
i
o
n

ana
ly
sis,

c
lassifica
t
i
o
n,

p
r
edi
ct
i
o
n,

cl
us
t
e
r
in
g
,

o
u
tli
e
r

ana
ly
sis,

and

e
v
o
l
u
t
i
o
n
ana
ly
sis.

A

c
o
m
p
r
eh
e
ns
i
v
e

data

mining

s
y
s
t
e
m

u
s
ua
l
l
y

p
r
o
v
i
d
es

m
ul
t
iple

and/
o
r

in
t
e
-

gr
a
t
ed

data

mining

fun
ct
i
o
nali
t
ies.

M
o
r
e
o
v
e
r
,

data

mining

s
y
s
t
e
ms

can

b
e

dis
t
inguished

based

o
n

the

gr
a
n
ula
r
i
t
y

o
r
l
e
v
els

o
f

abs
t
r
a
ct
i
o
n

o
f

the

kn
o
w
ledge

mined,

in
cl
uding

g
e
n
e
r
ali
z
ed

kn
o
w
ledge

(at

a
hi
g
h

l
e
v
e
l

o
f

abs
t
r
a
ct
i
o
n),

p
r
imi
t
i
v
e
-
l
e
v
el

kn
o
w
le
dg
e

(at

a

r
a
w

data

l
e
v
el),

o
r

kn
o
w
ledge at

m
ul
t
iple

l
e
v
els

(
c
o
nsi
de
r
ing

s
e
v
e
r
al l
e
v
els

o
f

abs
t
r
a
ct
i
o
n).

A
n

a
d
van
c
ed

data

mining
s
y
s
t
e
m

should

facilita
t
e

the

dis
c
o
v
e
r
y

o
f

kn
o
w
ledge

at
m
ul
t
iple

l
e
v
els
o
f

abs
t
r
a
c
t
i
o
n.

Data

mining

s
y
s
t
e
ms

can

also

b
e

ca
t
e
go
r
i
z
ed

as

those

that

mine

data

r
e
gula
r
i
t
ies (
c
o
mm
o
n
l
y

o
c
cu
rr
ing

pat
t
e
r
ns)

v
e
r
s
us

those

that

mine

data

i
r
r
e
gula
r
i
t
ies (
s
u
c
h

as
e
x
c
ep
t
i
o
ns,

o
r

o
u
tli
e
rs).

I
n

g
e
n
e
r
al,

c
o
n
c
ept

d
esc
r
ip
t
i
o
n,

ass
o
cia
t
i
o
n

and

c
o
r
r
e
la
t
i
o
n
ana
ly
sis,

c
lassifica
t
i
o
n,

p
r
edi
ct
i
o
n,

and

cl
us
t
e
r
ing

mine

data

r
e
gula
r
i
t
ies,

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?


r
e
je
ct
ing

o
u
t
-

li
e
rs

as

n
o
ise.

These
m
e
th
o
ds

m
a
y

also

help

d
e
t
e
c
t

o
u
tli
e
rs.

Classifica
t
i
o
n

a
c
c
o
r
ding

t
o

the

kinds

o
f

t
e
c
hniques

u
t
ili
z
ed:

Data

mining

s
y
s
t
e
ms

can
b
e

ca
t
e
go
r
i
z
ed

a
c
c
o
r
ding

t
o

the

un
de
r
l
y
ing

data

mining

t
e
c
hni
q
ues

e
mpl
o
y
ed.

These
t
e
c
hni
q
ues

can

b
e

d
esc
r
i
b
ed

a
c
c
o
r
ding

t
o

the

d
e
g
r
ee

o
f

us
e
r

in
t
e
r
a
ct
i
o
n

i
n
v
o
l
v
ed

(e.
g
., a
u
t
o
n
o
mous

s
y
s
t
e
ms,

in
t
e
r
a
ct
i
v
e

e
xpl
o
r
a
t
o
r
y

s
y
s
t
e
ms,

q
u
e
r
y
-
d
r
i
v
e
n

s
y
s
t
e
ms)

o
r

the
m
e
th
o
ds

o
f

data

ana
ly
sis

e
mpl
o
y
ed

(e.
g
.,

database
-
o
r
i
e
n
t
ed
o
r

data

wa
r
ehouse


o
r
i
e
n
t
ed

t
e
c
hni
q
ues,

ma
c
hine

lea
r
nin
g
,

sta
t
is
t
ics,

v
i
s
ualiza
t
i
o
n,

pat
t
e
r
n

r
e
c
o
g
ni
t
i
o
n,
n
e
u
r
al

n
e
t
w
o
r
ks,

and

so

o
n).

A

s
op
his
t
ica
t
ed

data

mining

s
y
s
t
e
m

w
i
l
l
o
f
t
e
n

a
d
o
p
t
m
ul
t
iple

data

mining

t
e
c
hni
q
ues

o
r

w
o
r
k

o
u
t

an

ef
f
e
ct
i
v
e,

in
t
e
gr
a
t
ed

t
e
c
hni
q
ue

that
c
o
m
b
ines

the

m
e
r
its
o
f

a

f
ew
ind
i
v
i
d
ual

a
pp
r
o
a
c
hes.

Classifica
t
i
o
n

a
c
c
o
r
ding

t
o

the

applica
t
ions

ada
p
t
ed
:

Data

mining

s
y
s
t
e
ms

can

also

b
e ca
t
e
go
r
i
z
ed

a
c
c
o
r
ding

t
o

the

a
p
plica
t
i
o
ns

th
e
y

adapt.

F
o
r

e
xample, data

mining
s
y
s
t
e
ms

m
a
y

b
e

tail
o
r
ed

s
p
ecifica
l
l
y

fo
r

finan
c
e,

t
ele
c
o
m
m
unica
t
i
o
ns,

D
N
A, s
t
o
c
k ma
r
k
e
ts,

e
-
mail, and

so
o
n.

D
if
fe
r
e
nt

a
p
plica
t
i
o
ns

o
f
t
e
n

r
e
q
ui
r
e

the

in
t
e
gr
a
t
i
o
n

o
f a
p
plica
t
i
o
n
-
s
p
ecific

m
e
th
o
ds.

Th
e
r
e
fo
r
e, a

g
e
n
e
r
ic,

a
l
l
-
pu
r
p
ose

data

mining

s
y
s
t
e
m m
a
y

not

fit

do
main
-
s
p
ecific

mining

tasks.


1.9

Major

Issues

in

Data

Mining


M
ining

m
e
th
o
d
olo
g
y

and

us
e
r

in
te
r
a
ct
i
o
n

is
s
ues:

These

r
efle
c
t

the

kinds

o
f

kn
o
w
ledge mined,

the

a
b
ili
t
y

t
o

mine

kn
o
w
ledge

at

m
ul
t
iple

gr
a
n
ula
r
i
t
ies,

the

use

o
f

d
o
main
kn
o
w
ledge,

ad

h
o
c

minin
g
,

and

kn
o
w
ledge

v
i
s
ualiza
t
i
o
n.


M
ining

diff
e
r
e
nt

kinds

o
f

kn
o
w
led
g
e

in

data
b
a
s
es:

B
ecause

dif
fe
r
e
nt

us
e
rs

can
b
e

in
t
e
r
es
t
ed

in

dif
fe
r
e
nt

kinds

o
f

kn
o
w
ledge,

data

mining

should

c
o
v
e
r

a

w
i
d
e s
p
e
ctr
um

o
f

data

ana
ly
sis

and

kn
o
w
ledge

dis
c
o
v
e
r
y

tasks,

in
cl
uding

data

c
ha
r
a
c
t
e
r
iza
t
i
o
n,

disc
r
imina
t
i
o
n,

ass
o
cia
t
i
o
n

and

c
o
r
r
ela
t
i
o
n

ana
ly
sis,

c
lassifica
t
i
o
n,
p
r
edi
ct
i
o
n,

cl
us
t
e
r
in
g
,

o
u
tli
e
r

ana
ly
sis,

and

e
v
o
lu
t
i
o
n

ana
ly
sis

(
w
hi
c
h

in
cl
u
d
es
t
r
e
nd

and

simila
r
i
t
y

ana
ly
sis).

These

tasks

m
a
y

use

the

same

database

in

dif
fe
r
e
nt

w
a
y
s

and

r
e
q
ui
r
e

the

d
e
v
el
op
m
e
nt

o
f

n
um
e
r
ous

data

mining

t
e
c
hni
q
ues.

I
n
t
e
r
a
c
t
i
ve

mining

o
f

kn
o
w
led
g
e

at

m
ul
t
iple

l
e
v
e
ls

o
f

abs
t
r
a
c
t
ion:

B
ecause

it

is
difficult

t
o

kn
o
w

e
xa
c
t
l
y
w
hat

can

b
e

dis
c
o
v
e
r
ed

w
ithin

a

database,

the

data
mining

p
r
o
c
ess

should

b
e

in
t
e
r
a
c
t
i
ve
.

F
o
r

databases

c
o
ntaining

a

h
uge

amount
o
f

data,

a
pp
r
op
r
ia
t
e

sampling

t
e
c
hni
q
ues

can

first

b
e

a
p
plied

t
o

facilita
t
e

in
t
e
r
-

a
ct
i
v
e

data

e
xpl
o
r
a
t
i
o
n.

I
n
t
e
r
a
ct
i
v
e

mining

a
l
l
o
ws

us
e
rs

t
o

f
o
cus

the

sea
r
c
h
fo
r

pat
t
e
r
ns,

p
r
o
v
iding

and

r
efining data

mining

r
e
q
uests

based
o
n

r
e
t
u
r
ne
d
r
e
s
ults.

S
p
ecifica
l
l
y
,

kn
o
w
ledge

should

b
e

mined

b
y

d
r
i
l
ling

d
o
w
n,

r
o
l
ling

u
p
,
and

p
i
v
o
t
ing

th
r
ou
g
h

the

data

spa
c
e

and

kn
o
w
ledge

spa
c
e

in
t
e
r
a
ct
i
v
e
l
y
,

similar
t
o

w
hat

OLAP can

d
o

o
n

data

cu
b
es.

I
n

this

w
a
y
,

the

us
e
r

can

in
t
e
r
a
c
t

w
ith the

data

mining

s
y
s
t
e
m

t
o

v
iew

data

and

dis
c
o
v
e
r
ed

pat
t
e
r
ns

at

m
ul
t
iple

g
r
an
-

ula
r
i
t
ies

and

f
r
o
m

dif
fe
r
e
nt

an
g
les.


I
n
c
o
r
p
o
r
a
t
ion

o
f

b
a
c
k
g
r
o
und

kn
o
w
led
g
e:

B
a
c
k
g
r
ound

kn
o
w
ledge,

o
r

in
f
o
r
ma
t
i
o
n
r
eg
a
r
ding

the

do
main

un
de
r

stu
d
y
,

m
a
y

b
e

used

t
o

gui
d
e

the

dis
c
o
v
e
r
y

p
r
o
c
ess

and

a
l
l
o
w

dis
c
o
v
e
r
ed

pat
t
e
r
ns
t
o

b
e

e
x
p
r
essed

in

c
o
ncise

t
e
r
ms

and

at

dif
fe
r
e
nt

l
e
v
els

o
f

abs
t
r
a
ct
i
o
n.
D
o
main

kn
o
w
ledge

r
ela
t
ed

t
o

databases,

s
u
c
h

as

in
t
e
gr
i
t
y

c
o
ns
t
r
aints

and

d
e
d
u
ct
i
o
n

r
ules,

can

help

f
o
cus

and

s
p
eed

up

a

data

mining

p
r
o
c
ess,

o
r

j
udge

the

in
t
e
r
es
t
in
g
ness

o
f

dis
c
o
v
e
r
ed

pat
t
e
r
ns.


D
ata

mining

qu
e
r
y

langua
g
es

and

ad

hoc

data

mining:

R
ela
t
i
o
nal

q
u
e
r
y

languages (
s
u
c
h

as

SQL)

a
l
l
o
w

us
e
rs

t
o

p
ose

ad

h
o
c

q
u
e
r
ies

fo
r

data

r
et
r
i
e
val.

I
n

a

similar
v
ein,

hi
g
h
-
l
e
v
el

data

mining

q
u
e
r
y

languages

need

t
o

b
e

d
e
v
el
o
p
ed

t
o

a
l
l
o
w

us
e
rs
t
o

d
esc
r
i
b
e

ad

h
o
c

data

mining

tasks

b
y

facilita
t
ing

the

s
p
ecifica
t
i
o
n

o
f

the

r
ele
-

vant

s
e
ts
o
f

data

fo
r

ana
ly
sis,

the

do
main

kn
o
w
ledge,

the

kinds

o
f

kn
o
w
ledge

t
o
b
e mined,

and

the

c
o
ndi
t
i
o
ns

and

c
o
ns
t
r
aints

t
o

b
e

e
n
fo
rc
ed

o
n

the

dis
c
o
v
e
r
ed
pat
t
e
r
ns.

S
u
c
h

a

language

should

b
e

in
t
e
gr
a
t
ed

w
ith

a

database

o
r

data

wa
r
ehouse
q
u
e
r
y

language

and

o
p
t
imi
z
ed

fo
r

effici
e
nt

and

fl
e
xible

data

minin
g
.

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?



Pr
e
s
e
nta
t
ion

and

v
isualiza
t
ion

o
f

data

mining

r
esults:

D
is
c
o
v
e
r
ed

kn
o
w
ledge

should
b
e

e
x
p
r
essed

in

hi
g
h
-
l
e
v
el

languages,

v
i
s
ual

r
e
p
r
es
e
nta
t
i
o
ns,

o
r

oth
e
r

e
x
p
r
ess
i
v
e
fo
r
ms

so

that

the

kn
o
w
ledge

can

b
e

easi
l
y un
de
rs
t
oo
d

and

di
r
e
c
t
l
y

usable

b
y
h
umans.

This

is
es
p
ecia
l
l
y

c
r
ucial

if the

data

mining

s
y
s
t
e
m

is
t
o

b
e

in
t
e
r
a
ct
i
v
e.
This

r
e
q
ui
r
es

the

s
y
s
t
e
m

t
o

a
do
pt

e
x
p
r
ess
i
v
e

kn
o
w
ledge

r
e
p
r
es
e
nta
t
i
o
n

t
e
c
hni
q
ues,
s
u
c
h

as

t
r
ees,

tables,

r
ules,

gr
a
p
hs,

c
ha
r
ts,

c
r
osstabs,

ma
t
r
i
c
es,

o
r

cu
r
v
es.


H
andling

noisy or

in
c
ompl
e
t
e

data:

The

data

s
t
o
r
ed

in

a

database

m
a
y

r
efle
c
t

n
o
ise,
e
x
c
ep
t
i
o
nal

cases,

o
r

in
c
o
mpl
e
t
e

data

obje
c
ts.

W
h
e
n

mining

data

r
e
gula
r
i
t
ies,

these obje
c
ts

m
a
y

c
o
nfuse

the

p
r
o
c
ess,

causing

the

kn
o
w
ledge

m
o
d
el

c
o
ns
tr
u
c
t
ed

t
o
o
v
e
r
fit

the

data.

A
s

a

r
e
s
ult,

the

a
c
cu
r
acy

o
f

the

dis
c
o
v
e
r
ed

pat
t
e
r
ns

can

b
e

p
o
o
r
.
Data

c
leaning

m
e
th
o
ds

and

data

ana
ly
sis

m
e
th
o
ds

that
can

han
d
le

n
o
ise

a
r
e
r
e
q
ui
r
ed,

as

w
e
l
l

as

o
u
tli
e
r

mining

m
e
th
o
ds

fo
r

the

dis
c
o
v
e
r
y

and

ana
ly
sis
o
f
e
x
c
ep
t
i
o
nal

cases.


P
at
t
e
r
n

e
v
alua
t
ion

the

in
t
e
r
es
t
in
g
ness

p
r
o
bl
e
m:

A

data

mining

s
y
s
t
e
m

can

un
c
o
v
e
r thousands

o
f

pat
t
e
r
ns.

M
a
n
y

o
f

the

pat
t
e
r
ns

dis
c
o
v
e
r
ed

m
a
y

b
e

unin
t
e
r
es
t
ing

t
o
the

g
i
v
e
n

us
e
r
,

eith
e
r

b
ecause

th
e
y

r
e
p
r
es
e
nt

c
o
mm
o
n

kn
o
w
ledge

o
r

la
c
k

n
o
v
-

el
t
y
.

S
e
v
e
r
al
c
ha
l
l
e
nges

r
e
main

r
eg
a
r
ding

the

d
e
v
el
op
m
e
nt

o
f

t
e
c
hni
q
ues

t
o

assess the

in
t
e
r
es
t
in
g
ness
o
f

dis
c
o
v
e
r
ed

pat
t
e
r
ns,

pa
r
t
icula
r
l
y

w
ith

r
eg
a
r
d

t
o

s
ubje
ct
i
v
e mea
s
u
r
es

that

es
t
ima
t
e

the

va
l
ue

o
f

pat
t
e
r
ns

w
ith

r
es
p
e
c
t

t
o

a

g
i
v
e
n us
e
r

c
lass, based

o
n

us
e
r

b
elie
f
s

o
r

e
x
p
e
c
ta
t
i
o
ns.

The

use

o
f

in
t
e
r
es
t
in
g
ness

mea
s
u
r
es

o
r
us
e
r
-
s
p
ecified

c
o
ns
t
r
aints

t
o

gui
d
e

the

dis
c
o
v
e
r
y

p
r
o
c
ess

and

r
e
d
u
c
e

the

sea
r
c
h
spa
c
e

is

anoth
e
r

a
ct
i
v
e

a
r
ea

o
f

r
esea
r
c
h.




P
e
r
fo
r
man
c
e

is
s
ues:

These

in
cl
u
d
e

effici
e
nc
y
,

scala
b
ili
t
y
,

and

pa
r
a
l
leliza
t
i
o
n

o
f

data
mining

al
g
o
r
ithms.


E
ffi
c
i
e
ncy

and

s
cala
b
ili
t
y

o
f

data

mining

al
g
o
r
ithms:

T
o

ef
f
e
ct
i
v
e
l
y

e
x
t
r
a
c
t

in
f
o
r
ma
-

t
i
o
n

f
r
o
m

a

h
uge

amount

o
f

data

in

databases,

data

mining

al
g
o
r
ithms

m
ust

b
e

effici
e
nt

and

scalable.

I
n

oth
e
r

w
o
r
ds,

the

r
unning

t
ime

o
f

a

data

mining

al
go
r
ithm

m
ust

b
e

p
r
edi
c
table

and

a
cc
eptable

in

large

databases.

F
r
o
m

a

database

p
e
rs
p
e
ct
i
v
e

o
n

kn
o
w
ledge

dis
c
o
v
e
r
y
,

effici
e
ncy

and

scala
b
ili
t
y

a
r
e

k
e
y

is
s
ues

in the

impl
e
m
e
n
-

ta
t
i
o
n

o
f

data

mining

s
y
s
t
e
ms.

M
a
n
y

o
f

the

is
s
ues

discussed

a
b
o
v
e

un
de
r

mining

m
e
th
o
d
o
lo
g
y

and

u
se
r

in
t
e
r
a
c
t
ion

m
ust

also

c
o
nsi
de
r

effici
e
ncy

and

scala
b
ili
t
y
.

P
a
r
all
e
l,

dis
t
r
ibu
t
ed,

and

in
c
r
e
m
e
ntal

mining

al
g
o
r
ithms:

The

h
uge

si
z
e

o
f

ma
n
y databases,

the

w
i
d
e

dis
t
r
i
bu
t
i
o
n
o
f

data,

and

the

c
o
mp
u
ta
t
i
o
nal

c
o
mpl
e
xi
t
y

o
f s
o
me

data

mining

m
e
th
o
ds

a
r
e

fa
c
t
o
rs

mo
t
i
va
t
ing

the

d
e
v
el
op
m
e
nt

o
f

pa
r
a
l
lel

and
dis
t
r
i
but
ed data

mining

alg
o
r
ithms.

S
u
c
h

al
go
r
ithms

d
i
v
i
d
e

the

data

in
t
o

pa
r
-

t
i
t
i
o
ns,

w
hi
c
h

a
r
e

p
r
o
c
essed

in

pa
r
a
l
lel.

The

r
e
s
ults

f
r
o
m

the

pa
r
t
i
t
i
o
ns

a
r
e

th
e
n
m
e
rged.

M
o
r
e
o
v
e
r
,

the

hi
g
h

c
ost

o
f

s
o
me

data

mining

p
r
o
c
esses

p
r
o
mo
t
es

the

ne
e
d
fo
r

inc
r
e
m
e
ntal

data

mining

al
go
r
ithms

that

in
c
o
r
p
o
r
a
t
e

database

upda
t
es

w
ith
-

o
u
t

h
a
v
ing

t
o

mine

the

e
n
t
i
r
e

data

a
g
ain

“f
r
o
m

sc
r
a
t
c
h.”

S
u
c
h

al
go
r
ithms

p
e
r
f
o
r
m
kn
o
w
ledge

m
o
difica
t
i
o
n

inc
r
e
m
e
nta
l
l
y

t
o

am
e
nd

and

s
t
r
e
ngth
e
n

w
hat

was

p
r
e
v
i
-

ous
l
y

dis
c
o
v
e
r
ed.


I
s
s
ues

r
ela
t
ing

t
o

the

d
i
v
e
rsi
t
y

o
f

database

t
y
p
es:


H
andling

o
f

r
e
la
t
ional
and

c
ompl
e
x

ty
p
es

o
f

data:

B
ecause

r
ela
t
i
o
nal

databases

and
data

wa
r
ehouses

a
r
e

w
i
d
e
l
y

used,

the

d
e
v
el
op
m
e
nt

o
f effici
e
nt

and

ef
f
e
ct
i
v
e

data
mining

s
y
s
t
e
ms

fo
r

s
u
c
h

data

is

im
p
o
r
tant.

H
o
w
e
v
e
r
,

oth
e
r

databases

m
a
y

c
o
ntain
c
o
mpl
e
x

data

obje
c
ts,

h
y
p
e
r
t
e
xt

and

m
ul
t
imedia

data,

spa
t
ial

data,

t
e
m
p
o
r
al

data,
o
r

t
r
ansa
ct
i
o
n

data.

I
t

is

un
r
ealis
t
ic

t
o

e
x
p
e
c
t

o
ne

s
y
s
t
e
m

t
o

mine

a
l
l

kinds

o
f data,

g
i
v
e
n

the

d
i
v
e
rsi
t
y

o
f

data

t
y
p
es

and

dif
fe
r
e
nt

g
o
als

o
f

data

minin
g
.

S
p
ecific data

mining

s
y
s
t
e
ms

should

b
e

c
o
ns
tr
u
c
t
ed

fo
r

mining

s
p
ecific

kinds

o
f

data.
Th
e
r
e
fo
r
e,

o
ne

m
a
y

e
x
p
e
c
t

t
o

h
av
e

dif
fe
r
e
nt

data

mining

s
y
s
t
e
ms
fo
r

dif
fe
r
e
nt kinds

o
f

data.

M
ining

in
f
o
r
ma
t
ion

f
r
om

h
e
t
e
r
o
g
e
ne
o
us

data
b
a
s
es

and

g
l
o
b
al

in
f
o
r
ma
t
ion

s
y
s
t
e
ms:
L
o
cal
-

and

w
i
d
e
-
a
r
ea

c
o
mp
u
t
e
r

n
e
t
w
o
r
ks

(
s
u
c
h

as

the

I
n
t
e
r
n
e
t)

c
o
nne
c
t

ma
n
y

sou
rc
es

o
f

data,

fo
r
ming

h
uge,

dis
t
r
i
bu
t
ed,

and

h
e
t
e
r
og
e
neous

databases.

The

dis
c
o
v
e
r
y

o
f

kn
o
w
ledge

f
r
o
m

dif
fe
r
e
nt

sou
rc
es

o
f

s
t
r
u
c
t
u
r
e
d,


s
e
mis
tr
u
c
tu
r
ed,

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?


o
r uns
tr
u
c
tu
r
ed

data

w
ith

d
i
v
e
rse

data

s
e
man
t
ics

p
oses

g
r
eat

c
ha
l
l
e
nges

t
o

data

minin
g
.

Data

mining

m
a
y

help

dis
c
lose

hi
g
h
-
l
e
v
el

data

r
e
gula
r
i
t
ies

in

m
ul
t
iple

h
e
t
e
r
og
e
neous

databases

that

a
r
e

unli
k
e
l
y

t
o

b
e

dis
c
o
v
e
r
ed

b
y

simple

q
u
e
r
y

s
y
s
-

t
e
ms

and

m
a
y

im
p
r
o
v
e

in
fo
r
ma
t
i
o
n

e
x
c
hange

and

in
t
e
r
o
p
e
r
a
b
ili
t
y

in

h
e
t
e
r
ogeneous

databases.

W
eb

minin
g
,

w
hi
c
h

un
c
o
v
e
rs

in
t
e
r
es
t
ing

kn
o
w
ledge

a
b
o
u
t

W
eb

c
o
n
t
e
nts,

W
eb

s
tr
u
c
tu
r
es,

W
eb

usage,

and

W
e
b

d
ynamics,

b
e
c
o
mes a

v
e
r
y

c
hall
e
n
g
ing

and

fast
-
e
v
o
l
v
ing

field

in

data

minin
g
.







3.2.2

Stars,

Sn
o
wfla
k
es,

and

F
act

Constellations: Schemas
f
or

Multidimensional

Databases


The

e
n
t
i
t
y
-
r
ela
t
i
o
nship

data m
o
d
el

is

c
o
mm
o
n
l
y

used

in the

d
esi
g
n

o
f

r
ela
t
i
o
nal databases,
w
h
e
r
e

a

database

s
c
h
e
ma

c
o
nsists

o
f

a

s
e
t

o
f

e
n
t
i
t
ies

and

the

r
ela
t
i
o
nships
be
t
w
e
e
n

th
e
m.

S
u
c
h

a

data

m
o
d
el

is

a
pp
r
op
r
ia
t
e

fo
r

o
n
-
line

t
r
ansa
ct
i
o
n

p
r
o
c
essin
g
. A

data

wa
r
ehouse,

h
o
w
e
v
e
r
,

r
e
q
ui
r
es a

c
o
ncise,

s
ubje
c
t
-
o
r
i
e
n
t
ed

s
c
h
e
ma

that

facilita
t
es
o
n
-
line

data

ana
ly
sis.

The

most

p
o
pular

data

m
o
d
el

fo
r

a

data

wa
r
ehouse

is a

m
ul
t
idim
e
nsi
o
nal

m
o
d
e
l
.
S
u
c
h

a

m
o
d
el

can

e
xist

in

the

fo
r
m

o
f

a

star

s
c
h
e
ma,

a

sn
o
w
fla
k
e

s
c
h
e
ma,

o
r

a

fa
c
t

c
o
n
-

s
t
e
l
la
t
i
o
n

s
c
h
e
ma.

L
e
t’s

l
o
ok

at ea
c
h

o
f

these

s
c
h
e
ma

t
y
p
es.


Star

s
c
h
e
ma:

The

most

c
o
mm
o
n

m
o
d
eling

pa
r
adi
g
m

is

the

star

s
c
h
e
ma,

in

w
hi
c
h

the
data

wa
r
ehouse
c
o
ntains

(1)

a

large

c
e
n
t
r
al

table

(
fa
c
t

table)

c
o
ntaining

the

b
ulk

o
f the

data,

w
ith

no

r
e
d
undanc
y
,

and

(2)

a

s
e
t

o
f

sma
l
l
e
r

at
t
e
ndant

tables

(
dim
e
nsi
o
n
tables),

o
ne

fo
r

ea
c
h

dim
e
nsi
o
n.

The

s
c
h
e
ma

gr
a
p
h

r
es
e
mbles

a

sta
r
b
urst,

w
ith

the
dim
e
nsi
o
n

tables

displ
a
y
ed

in

a

r
adial

pat
t
e
r
n

a
r
ound

the

c
e
n
t
r
al

fa
c
t

table.


Example

3.1

Star

s
c
h
e
ma.

A

star

s
c
h
e
ma

fo
r

A
llEle
c
t
r
oni
c
s

sales

is

sh
o
w
n

in

Figu
r
e

3.4. Sales

a
r
e

c
o
nsid
-

e
r
e
d

al
o
ng

f
our

dim
e
nsi
o
ns,

name
l
y
,

t
ime,

i
t
e
m,

b
r
an
c
h
,

and

loca
t
ion
.

T
he

s
c
h
e
ma

c
o
ntains a

c
e
n
t
r
al

fa
c
t

table

fo
r

sales

that

c
o
ntains

k
e
y
s

t
o ea
c
h

o
f

the

f
our dim
e
nsi
o
ns,

al
o
ng

w
ith
t
w
o

mea
s
u
r
es:

d
o
lla
r
s

s
o
ld

and

units

s
o
ld
.

T
o

minimi
z
e

the

si
z
e

o
f

the

fa
c
t

tab
le,

dim
e
nsi
o
n
i
de
n
t
ifi
e
rs

(
s
u
c
h

as

t
ime

k
e
y

and

i
t
e
m

k
e
y
)

a
r
e

s
y
s
t
e
m
-
g
e
n
e
r
a
t
ed

i
de
n
t
ifi
e
rs.


N
o
t
i
c
e

that

in

the

star

s
c
h
e
ma,

ea
c
h

dim
e
nsi
o
n

is

r
e
p
r
es
e
n
t
ed

b
y

o
n
l
y

o
ne

table,

and
ea
c
h

table

c
o
ntains

a

s
e
t

o
f

at
t
r
i
bu
t
es.

F
o
r

e
xample,

the

loca
t
ion

dim
e
nsi
o
n

table

c
o
ntains the

at
t
r
i
bu
t
e

s
e
t

{
loca
t
ion

k
e
y
,

s
t
r
e
e
t,

c
i
t
y
,

p
r
o
v
ince

or

sta
t
e,

c
o
un
t
r
y
}.

This
c
o
ns
t
r
aint

m
a
y in
t
r
o
d
u
c
e

s
o
me

r
e
d
undanc
y
.

F
o
r

e
xample,


V
an
c
ou
v
e
r”

and


V
i
c
t
o
r
ia”

a
r
e

b
oth

ci
t
ies

in
the

C
anadian

p
r
o
v
in
c
e

o
f

B
r
i
t
ish

C
o
l
um
b
ia.

En
t
r
ies

fo
r

s
u
c
h

ci
t
ies

in

the

loca
t
ion

dim
e
n
-

si
o
n

table

w
i
l
l

c
r
ea
t
e

r
e
d
undancy

am
o
ng

the

at
t
r
i
bu
t
es

p
r
o
v
ince

or

sta
t
e

and

c
o
un
t
r
y
,
that

is,

(...,

V
an
c
ou
v
e
r
,

B
r
i
t
ish

C
o
lum
b
ia,

Canad
a
)

and

(...,

V
i
c
t
o
r
ia, B
r
i
t
ish

C
o
lum
b
ia,
Canada
).

M
o
r
e
o
v
e
r
,

the

at
t
r
i
bu
t
es

w
ithin

a

dim
e
nsi
o
n

table

m
a
y

fo
r
m

eith
e
r

a

hi
e
r
a
r
c
h
y (
t
otal

o
r
de
r)

o
r

a

lat
t
i
c
e

(pa
r
t
ial

o
r
d
e
r).



S
n
o
w
fla
k
e
s
c
h
e
ma:

The

sn
o
w
fla
k
e

s
c
h
e
ma

is

a

va
r
iant

o
f

the

star

s
c
h
e
ma

m
o
d
el,

w
h
e
r
e s
o
me

dim
e
nsi
o
n

tables

a
r
e

no
r
mali
z
ed
,

th
e
r
e
b
y

fu
r
th
e
r

split
t
ing

the

data

in
t
o

a
ddi
-

t
i
o
nal

tables.

The

r
e
s
ul
t
ing

s
c
h
e
ma

gr
a
p
h

fo
r
ms

a

sha
p
e

similar

t
o

a

sn
o
w
fla
k
e.

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?





time

dimension

table

time_

key day
day_of_the_week
month

quarter

year

sales

fact table
time_key
item_key

branch_key

location_key
dollars_sold
units_sold

item

dimension table
item_key
item_name

brand

type

supplier_type




branc
h


dimension

table

branch_key
branch_name
branch_type

location

dimension

table

location_key

street

city
province_or_state
country



Figu
r
e

3.4

Star

s
c
h
e
ma

o
f

a

data

wa
r
ehouse

fo
r

sales.




The

maj
o
r

dif
fe
r
e
n
c
e

be
t
w
e
e
n

the

sn
o
w
fla
k
e and

star

s
c
h
e
ma

m
o
d
els

is

that

the
dim
e
nsi
o
n

tables

o
f

the

sn
o
w
fla
k
e

m
o
d
el

m
a
y

b
e

k
ept

in

n
o
r
mali
z
ed

fo
r
m

t
o

r
e
d
u
c
e
r
e
d
undancies.

S
u
c
h

a

table

is

easy

t
o

maintain

and

s
av
es

s
t
o
r
age

spa
c
e.

H
o
w
e
v
e
r
,
this

s
a
v
ing

o
f

spa
c
e

is

n
e
g
li
g
ible

in

c
o
mpa
r
is
o
n

t
o

the

t
y
pical

ma
g
nitu
d
e

o
f

the

fa
c
t
table.

F
u
r
th
e
r
m
o
r
e,

the

sn
o
w
fla
k
e

s
tr
u
c
tu
r
e

can

r
e
d
u
c
e

the

ef
f
e
ct
i
v
e
ness

o
f

b
ro
wsin
g
,
sin
c
e

m
o
r
e

j
o
ins

w
i
l
l
b
e

nee
d
ed

t
o

e
x
ec
u
t
e a

q
u
e
r
y
.

C
o
nse
q
u
e
nt
l
y
,

the

s
y
s
t
e
m

p
e
r
-

fo
r
man
c
e

m
a
y

b
e

a
d
v
e
rse
l
y

impa
c
t
ed.

H
e
n
c
e,

althou
g
h

the

sn
o
w
fla
k
e

s
c
h
e
ma

r
e
d
u
c
es
r
e
d
undanc
y
,

it

is

not

as

p
o
pular

as

the

star

s
c
h
e
ma

in

data

wa
r
ehouse

d
esi
g
n.


Example

3.2

S
n
o
w
fla
k
e

s
c
h
e
ma.

A

sn
o
w
fla
k
e

s
c
h
e
ma

fo
r

A
llEle
c
t
r
oni
c
s

sales

is

g
i
v
e
n

in

Figu
r
e

3.5.

H
e
r
e,

the

sales

fa
c
t

table

is

i
de
n
t
ical

t
o

that

o
f

the

star

s
c
h
e
ma

in

Figu
r
e

3.4.

The
main

dif
fe
r
e
n
c
e

be
t
w
e
e
n

the

t
w
o

s
c
h
e
mas

is

in

the

d
efini
t
i
o
n

o
f

dim
e
nsi
o
n

tables.
The

sin
g
le

dim
e
nsi
o
n

table

fo
r

i
t
e
m

in

the

star

s
c
h
e
ma

is

n
o
r
mali
z
ed

in

the

sn
o
w
fla
k
e
s
c
h
e
ma,

r
e
s
ul
t
ing

in

new

i
t
e
m

and

suppli
e
r

tables.

F
o
r

e
xample, the

i
t
e
m

dim
e
nsi
o
n
table

n
o
w

c
o
ntains

the

at
t
r
i
bu
t
es

i
t
e
m

k
e
y
,

i
t
e
m

name,

b
r
and,

ty
p
e
, and

suppli
e
r

k
e
y
,
w
h
e
r
e

suppli
e
r

k
e
y

is

lin
k
ed

t
o

the

suppli
e
r

dim
e
nsi
o
n

table,

c
o
ntaining

suppli
e
r

k
e
y
and

suppli
e
r

ty
p
e

in
fo
r
ma
t
i
o
n.

S
imila
r
l
y
,

the

sin
g
le

dim
e
nsi
o
n

table

fo
r

loca
t
ion

in

the
star

s
c
h
e
ma

can

b
e

n
o
r
mali
z
ed

in
t
o

t
w
o

new

tables:

loca
t
ion

and

c
i
t
y
.

The

c
i
t
y

k
e
y

in
the

new

loca
t
ion

table

links

t
o

the

c
i
t
y

dim
e
nsi
o
n.

N
o
t
i
c
e

that

fu
r
th
e
r

n
o
r
maliza
t
i
o
n
can

b
e

p
e
r
fo
r
med

o
n

p
r
o
v
ince

or

sta
t
e

and

c
o
un
t
r
y

in

the

sn
o
w
fla
k
e

s
c
h
e
ma

sh
o
w
n
in

Figu
r
e

3.5,

w
h
e
n

d
esi
r
able.

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?





time

dimension

table

time_key day
day_of_week
month

quarter

year

sales

fact

table
time_key
item_key

branch_key

location_key
dollars_sold
units_sold

item

dimension

table
item_key
item_name

brand

type

supplier_key

supplier

dimension

table
supplier_key
supplier_type




branch

dimension

table


branch_key
branch_name
branch_type

location

dimension

table

location_key
street
city_key



city


dimension

table

city_key

city
province_or_state
country



Figu
r
e

3.5

S
n
o
w
fla
k
e

s
c
h
e
ma

o
f

a

data

wa
r
ehouse

fo
r

sales.




F
a
c
t

c
o
ns
t
e
l
la
t
i
o
n:

S
op
his
t
ica
t
ed

a
p
plica
t
i
o
ns

m
a
y

r
e
q
ui
r
e

m
ul
t
iple

fa
c
t tables

t
o

sha
r
e
dim
e
nsi
o
n

tables.

This

kind

o
f

s
c
h
e
ma

can

b
e

v
ie
w
ed

as

a

c
o
l
le
ct
i
o
n

o
f

stars,

and

h
e
n
c
e is

ca
l
led
a

g
alaxy

s
c
h
e
ma

o
r

a

fa
c
t

c
o
ns
t
e
l
la
t
i
o
n
.



Example

3.3

F
a
c
t

c
o
ns
t
e
l
la
t
i
o
n.

A

fa
c
t

c
o
ns
t
e
l
la
t
i
o
n

s
c
h
e
ma

is

sh
o
w
n

in

Figu
r
e 3.6.

This

s
c
h
e
ma

s
p
ec
-

ifies

t
w
o

fa
c
t

tables,

sales

and

shipping
.

The

sales

table

d
efini
t
i
o
n

is

i
de
n
t
ical

t
o

that

o
f the

star

s
c
h
e
ma

(Figu
r
e

3.4).

The

shipping

table

has

fi
v
e

dim
e
nsi
o
ns,

o
r

k
e
y
s:

i
t
e
m

k
e
y
,
t
ime

k
e
y
,

ship
pe
r

k
e
y
,

f
r
om

loca
t
ion
,

and

t
o

loca
t
ion
,

and

t
w
o

mea
s
u
r
es:

d
o
lla
r
s

c
ost
and
units

ship
p
ed
.

A

fa
c
t

c
o
ns
t
e
l
la
t
i
o
n

s
c
h
e
ma

a
l
l
o
ws

dim
e
nsi
o
n

tables

t
o

b
e

sha
r
ed

be
t
w
e
e
n fa
c
t

tables.

F
o
r

e
xample,

the

dim
e
nsi
o
ns

tables
fo
r

t
ime,

i
t
e
m
,

and

loca
t
ion

a
r
e

sha
r
e
d
be
t
w
e
e
n

b
oth

the

sales

and

shipping

fa
c
t

tables.


I
n

data

wa
r
ehousin
g
,

th
e
r
e

is

a

dis
t
in
ct
i
o
n

be
t
w
e
e
n

a

data

wa
r
ehouse

and a

data

ma
r
t.
A

data

wa
r
ehouse

c
o
l
le
c
ts

in
fo
r
ma
t
i
o
n

a
b
o
u
t

s
ubje
c
ts

that

span

the

e
n
t
i
r
e

o
r
g
aniza
t
io
n
,
s
u
c
h

as

cus
t
om
er
s,

i
t
e
ms,

sales,

as
s
e
ts
,

and

per
s
onn
e
l
,

and

t
h
us

its

s
c
o
p
e

is

e
n
t
e
r
p
r
i
s
e
-
w
ide
.
F
o
r

data

wa
r
ehouses,

the

fa
c
t

c
o
ns
t
e
l
la
t
i
o
n

s
c
h
e
ma

is

c
o
mm
o
n
l
y

used,

sin
c
e

it

can

m
o
d
e
l
m
ul
t
iple,

in
t
e
r
r
ela
t
ed

s
ubje
c
ts.

A

data

ma
r
t,

o
n

the

oth
e
r

hand,

is

a

d
epa
r
t
m
e
nt

s
ubs
e
t
o
f

the

data

wa
r
ehouse

that

f
o
cuses

o
n
sele
c
t
ed

s
ubje
c
ts,

and

t
h
us

its

s
c
o
p
e

is

d
e
p
a
rt
m
e
nt
-

w
ide
.

F
o
r

data

ma
r
ts,

the

star

o
r

sn
o
w
fla
k
e

s
c
h
e
ma

a
r
e

c
o
mm
o
n
l
y

used,

sin
c
e

b
oth

a
r
e gea
r
ed

t
o
wa
r
d

m
o
d
eling

sin
g
le

s
ubje
c
ts,

althou
g
h

the

star

s
c
h
e
ma

is

m
o
r
e

p
o
pular

and
effici
e
nt.

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?






time

dimension

table

time_key day
day_of_week
month

quarter

year

sales

fact

table
time_key
item_key

branch_key

location_key
dollars_sold
units_sold

item

dimension

table
item_key
item_name

brand

type

supplier_type

shipping

fact

table
item_key
time_key

shipper_key

from_location
to_location
dollars_cost
units_shipped

shipper

dimension

table
shipper_key
shipper_name

location_key

shipper_type


branch

dimension

table

branch_key
branch_name
branch_type

location

dimension

table

location_key

street

city
province_or_state
country



Figu
r
e

3.6

F
a
c
t

c
o
ns
t
e
l
la
t
i
o
n

s
c
h
e
ma

o
f

a

data

wa
r
ehouse

fo
r

sales
and

shi
p
pin
g
.



3.2.3

Examples

f
or

Defining Sta
r
,

Sn
o
wfla
k
e
,

and
F
act

Constellation

Schemas



H
o
w

can

I

d
e
fine

a

m
ul
t
idim
e
nsional

sc
h
e
ma

f
or

m
y

data?”

J
ust

as

r
ela
t
i
o
nal

q
u
e
r
y
languages

li
k
e

SQL

can

b
e

used

t
o

s
p
ecify

r
ela
t
i
o
nal

q
u
e
r
ies,

a

data

mining

q
u
e
r
y

lan
-

guage

can

b
e

used

t
o

s
p
ecify

data

mining

tasks.

I
n

pa
r
t
icula
r
,

w
e

e
xamine

h
o
w

t
o

d
efine
data

wa
r
ehouses

and

data

ma
r
ts in

our

SQ
L
-
based

data

mining

q
u
e
r
y

language,

DMQL.

Data

wa
r
ehouses

and

data

ma
r
ts

can

b
e

d
efined

using t
w
o

language

p
r
imi
t
i
v
es,

o
ne
fo
r

cu
b
e

d
e
fini
t
ion

and

o
ne

fo
r

dim
e
nsion

d
e
fini
t
ion
.

The

cu
b
e

d
e
fini
t
ion

sta
t
e
m
e
nt

has

the
f
o
l
l
o
w
ing

syntax:

define cube

h
cu
b
e

name
i

[
h
dim
e
nsi
o
n

list
i
]
:

h
mea
s
u
r
e

list
i


The

dim
e
nsion

d
e
fini
t
ion

sta
t
e
m
e
nt

has

the

f
o
l
l
o
w
ing

syntax:

define dimension

h
dim
e
nsi
o
n

name
i

as

(
h
at
t
r
i
b
u
t
e

o
r

dim
e
nsi
o
n

list
i
)


L
e
t’s

l
o
ok

at

e
xamples
o
f

h
o
w

t
o

d
efine

the

sta
r
,

sn
o
w
fla
k
e,

and

fa
c
t

c
o
ns
t
e
l
la
t
i
o
n
s
c
h
e
mas

o
f

E
xamples

3.1

t
o

3.3

using

DMQL.

DMQL

k
e
y
w
o
r
ds

a
r
e

displ
a
y
ed

in

sans
serif

f
o
nt.


Example

3.4

Star

s
c
h
e
ma

d
efini
t
i
o
n.

The

star

s
c
h
e
ma

o
f

E
xample

3.1

and

Figu
r
e

3.4

is

d
efined

in

DMQL

as

f
o
l
l
o
ws:


define cube

sales

star

[
t
ime,

i
t
e
m,

b
r
an
c
h,

l
o
ca
t
i
o
n]:

d
o
l
lars

sold

=

sum
(sales

in

d
o
l
lars),

units

sold

=

count
(*)

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?





define dimension

t
ime

as

(
t
ime

k
e
y
,
d
a
y
,

d
a
y

o
f

w
eek,

m
o
nth,

q
ua
r
t
e
r
,

y
ear)
define
dimension

i
t
e
m
as

(i
t
e
m

k
e
y
,
i
t
e
m

name,

b
r
and,

t
y
p
e,

s
u
p
pli
e
r

t
y
p
e)
define
dimension

b
r
an
c
h

as

(
b
r
an
c
h

k
e
y
,
b
r
an
c
h

name,

b
r
an
c
h

t
y
p
e)

define dimension

l
o
ca
t
i
o
n

as

(l
o
ca
t
i
o
n

k
e
y
,
s
t
r
e
e
t,

ci
t
y
,

p
r
o
v
in
c
e

o
r

sta
t
e,
c
oun
t
r
y)

The

define

cube

sta
t
e
m
e
nt

d
efines

a

data

cu
b
e

ca
l
led

sales

star
,

w
hi
c
h

c
o
r
r
es
p
o
nds
t
o

the

c
e
n
t
r
al

sales

fa
c
t

table

o
f

E
xample

3.1.

This

c
o
mmand

s
p
ecifies

the

dim
e
nsi
o
ns and

the

t
w
o

mea
s
u
r
es,

d
o
lla
r
s

s
o
ld

and

units

s
o
ld
.

The

data

cu
b
e

has

f
our

dim
e
nsi
o
ns,
name
l
y
,

t
ime,

i
t
e
m,

b
r
an
c
h
,

and

loca
t
ion
.

A

define

dimension

sta
t
e
m
e
nt

is

used

t
o

d
efine ea
c
h

o
f

the

dim
e
nsi
o
ns.


Example

3.5

S
n
o
w
fla
k
e

s
c
h
e
ma

d
efini
t
i
o
n.

The

sn
o
w
fla
k
e

s
c
h
e
ma

o
f

E
xample

3.2

and

Figu
r
e

3.5

is
d
efined

in

DMQL

as

f
o
l
l
o
ws:

define cube

sales

sn
o
w
fla
k
e

[
t
ime,

i
t
e
m,

b
r
an
c
h,

l
o
ca
t
i
o
n]:

d
o
l
lars

sold

=

sum
(sales

in

d
o
l
lars),

units

sold

=

count
(*)

define dimension

t
ime

as

(
t
ime

k
e
y
,
d
a
y
,

d
a
y

o
f

w
eek,

m
o
nth,

q
ua
r
t
e
r
,

y
ear)

define

dimension

i
t
e
m
as

(i
t
e
m

k
e
y
,
i
t
e
m

name,

b
r
and,

t
y
p
e,

s
u
p
pli
e
r

(
s
u
p
pli
e
r

k
e
y
,
s
u
p
pli
e
r

t
y
p
e))

define dimension

b
r
an
c
h

as

(
b
r
an
c
h

k
e
y
,
b
r
an
c
h

name,

b
r
an
c
h

t
y
p
e)

define dimension

l
o
ca
t
i
o
n

as

(l
o
ca
t
i
o
n

k
e
y
,
s
t
r
e
e
t,

ci
t
y

(ci
t
y

k
e
y
,
ci
t
y
,

p
r
o
v
in
c
e

o
r

sta
t
e,

c
oun
t
r
y))

This

d
efini
t
i
o
n

is

similar

t
o

that

o
f

sales

star

(
E
xample

3.4),

e
x
c
ept

that,

h
e
r
e,

the

i
t
e
m
and

loca
t
ion

dim
e
nsi
o
n

tables

a
r
e

n
o
r
mali
z
ed.

F
o
r

instan
c
e,

the

i
t
e
m

dim
e
nsi
o
n

o
f

the
sales

star

data
cu
b
e

has

b
e
e
n

n
o
r
mali
z
ed

in

the

sales

sn
o
w
fla
k
e

cu
b
e

in
t
o

t
w
o

dim
e
nsi
o
n
tables,

i
t
e
m

and

suppli
e
r
.

N
o
t
e

that

the

dim
e
nsi
o
n

d
efini
t
i
o
n

f
o
r

suppli
e
r

is

s
p
ecified

w
ithin
the

d
efini
t
i
o
n

fo
r

i
t
e
m
.

Defining

suppli
e
r

in

this

w
a
y

implicit
l
y

c
r
ea
t
es

a

suppli
e
r

k
e
y

in

the
i
t
e
m

dim
e
nsi
o
n

table

d
efini
t
i
o
n.

S
imila
r
l
y
,

the

loca
t
ion

dim
e
nsi
o
n

o
f

the

sales

star

data
cu
b
e

has

b
e
e
n

n
o
r
mali
z
ed

in

the

sales

sn
o
w
fla
k
e

cu
b
e

in
t
o

t
w
o

dim
e
nsi
o
n

tables,

loca
t
ion
and

c
i
t
y
.

The

dim
e
nsi
o
n

d
efini
t
i
o
n
fo
r

c
i
t
y
is

s
p
ecified

w
ithin the

d
efini
t
i
o
n
fo
r

loca
t
ion
.
I
n

this

w
a
y
,

a

c
i
t
y

k
e
y

is

implicit
l
y

c
r
ea
t
ed

in

the

loca
t
ion

dim
e
nsi
o
n

table

d
e
fini
t
i
o
n.


Fina
l
l
y
,

a

fa
c
t

c
o
ns
t
e
l
la
t
i
o
n

s
c
h
e
ma

can

b
e

d
efined

as

a

s
e
t

o
f

in
t
e
r
c
o
nne
c
t
ed

cu
b
es.
B
el
o
w

is

an

e
xample.


Example

3.6

F
a
c
t

c
o
ns
t
e
l
la
t
i
o
n

s
c
h
e
ma

d
efini
t
i
o
n.

The

fa
c
t

c
o
ns
t
e
l
la
t
i
o
n

s
c
h
e
ma

o
f

E
xample

3.3

and

Figu
r
e

3.6

is

d
efined

in

DMQL

as

f
o
l
l
o
ws:


define cube

sales
[
t
ime,

i
t
e
m,

b
r
an
c
h,

l
o
ca
t
i
o
n]:

d
o
l
lars

sold

=

sum
(sales

in

d
o
l
lars),

units

sold

=

count
(*)

define dimension

t
ime

as

(
t
ime

k
e
y
,
d
a
y
,

d
a
y

o
f

w
eek,

m
o
nth,

q
ua
r
t
e
r
,

y
ear)
define
dimension

i
t
e
m
as

(i
t
e
m

k
e
y
,
i
t
e
m

name,

b
r
and,

t
y
p
e,

s
u
p
pli
e
r

t
y
p
e)
define
dimension

b
r
an
c
h

as

(
b
r
an
c
h

k
e
y
,
b
r
an
c
h

name,

b
r
an
c
h

t
y
p
e)

define dimension

l
o
ca
t
i
o
n

as

(l
o
ca
t
i
o
n

k
e
y
,
s
t
r
e
e
t,

ci
t
y
,

p
r
o
v
in
c
e

o
r

sta
t
e,
c
oun
t
r
y)

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?





define cube

shi
p
ping

[
t
ime,

i
t
e
m,

shi
p
p
e
r
,

f
r
o
m

l
o
ca
t
i
o
n,

t
o

l
o
ca
t
i
o
n]:

d
o
l
lars

c
ost

=

sum
(
c
ost

in

d
o
l
lars),

units

shi
p
p
ed

=

count
(*)

define dimension

t
ime

as

t
ime

in

cube

sales

define dimension

i
t
e
m
as

i
t
e
m
in

cube

sales

define dimension

shi
p
p
e
r

as

(shi
p
p
e
r

k
e
y
,
shi
p
p
e
r

name,

l
o
ca
t
i
o
n

as

l
o
ca
t
i
o
n

in

cube

sales,
shi
p
p
e
r

t
y
p
e)

define dimension

f
r
o
m

l
o
ca
t
i
o
n

as

l
o
ca
t
i
o
n

in

cube

sales

define dimension

t
o

l
o
ca
t
i
o
n

as

l
o
ca
t
i
o
n

in

cube

sales


A

define

cube

sta
t
e
m
e
nt

is

used

t
o

d
efine

data

cu
b
es

fo
r

sales

and

shipping
,

c
o
r
-

r
es
p
o
nding

t
o

the

t
w
o

fa
c
t

tables

o
f

the

s
c
h
e
ma

o
f

E
xample 3.3.

N
o
t
e

that

the

t
ime,
i
t
e
m
,

and

loca
t
ion

dim
e
nsi
o
ns

o
f

the

sales

cu
b
e

a
r
e

sha
r
ed

w
ith

the

shipping

cu
b
e.
This

is

indica
t
ed

fo
r

the

t
ime

dim
e
nsi
o
n,

fo
r

e
xample,

as

f
o
l
l
o
ws.

U
n
de
r

the

define
cube

sta
t
e
m
e
nt

fo
r

shipping
,

the

sta
t
e
m
e
nt


define

dimension

t
ime

as

t
ime

in

cube
sales


is

s
p
ecified.


The

P
r
ocess

of

Data

W
a
r
ehouse

Design


A

data

wa
r
ehouse

can

b
e

b
uilt

using

a

to
p
-
d
o
w
n

ap
p
r
o
a
c
h
,

a

b
o
t
t
om
-
up

ap
p
r
o
a
c
h
,

o
r

a
c
om
b
ina
t
ion

o
f

b
o
th
.

The

to
p
-
d
o
w
n

a
p
p
r
o
a
c
h

sta
r
ts

w
ith

the

o
v
e
r
a
l
l

d
esi
g
n

and

plan
-

nin
g
.

I
t

is

useful

in

cases

w
h
e
r
e

the

t
e
c
hnolo
g
y

is

matu
r
e

and

w
e
l
l

kn
o
w
n,

and

w
h
e
r
e

the
b
usiness

p
r
obl
e
ms

that

m
ust

b
e

so
l
v
ed

a
r
e

c
lear

and

w
e
l
l

un
de
rs
t
oo
d.

The

b
ot
t
o
m
-
up
a
p
p
r
o
a
c
h sta
r
ts

w
ith

e
x
p
e
r
im
e
nts

and

p
r
o
t
o
t
y
p
es.

This

is

useful
in

the

ea
r
l
y

stage

o
f

b
usi
-

ness

m
o
d
eling

and

t
e
c
hnolo
g
y

d
e
v
el
op
m
e
nt.
I
t

a
l
l
o
ws

an

o
r
g
aniza
t
i
o
n

t
o

m
o
v
e

f
o
r
wa
r
d
at

c
o
nsi
de
r
ab
l
y

less

e
x
p
e
nse

and

t
o

e
va
l
ua
t
e

the

b
e
nefits

o
f

the

t
e
c
hnolo
g
y

b
e
fo
r
e

mak
-

ing

si
g
nificant
c
o
mmi
t
m
e
nts.

I
n

the

c
o
m
b
ined

a
p
p
r
o
a
c
h
,

an

o
r
g
aniza
t
i
o
n

can

e
xpl
o
it the

planned

and

s
t
r
a
t
e
g
ic

natu
r
e

o
f

the

t
o
p
-
d
o
w
n

a
pp
r
o
a
c
h

w
hile

r
e
taining

the

r
apid
impl
e
m
e
nta
t
i
o
n

and

op
p
o
r
tunis
t
ic

a
p
plica
t
i
o
n

o
f

the

b
ot
t
o
m
-
up

a
p
p
r
o
a
c
h.

Fr
o
m

the

s
o
ftwa
r
e

e
n
g
ine
e
r
ing

p
o
int

o
f
v
ie
w
,

the

d
esi
g
n

and

c
o
ns
tr
u
ct
i
o
n

o
f a

data
wa
r
ehouse

m
a
y

c
o
nsist

o
f

the

f
o
l
l
o
w
ing

s
t
eps:

planning,

r
equi
r
e
m
e
nts

s
t
u
d
y
,

p
r
o
bl
e
m

anal
-

y
sis,

w
a
r
eh
o
u
s
e

desi
g
n,

data

in
t
eg
r
a
t
ion

and

t
es
t
in
g
,

and

fina
l
l
y

d
e
pl
o
y
m
e
nt

o
f

the

data

w
a
r
e
-

h
o
u
s
e
.

Large

s
o
ftwa
r
e

s
y
s
t
e
ms

can

b
e

d
e
v
el
o
p
ed

using

t
w
o

m
e
th
o
d
olo
g
ies:

the

w
a
t
e
r
fall m
e
th
o
d

o
r

the

spi
r
al

m
e
th
o
d
.

The

wa
te
r
fa
l
l

m
e
th
o
d

p
e
r
fo
r
ms

a

s
tr
u
c
tu
r
ed

and

s
y
s
t
e
ma
t
ic
ana
ly
sis

at ea
c
h

s
t
ep

b
e
fo
r
e

p
r
o
c
eeding

t
o

the

n
e
xt,

w
hi
c
h

is

li
k
e

a

wa
t
e
r
fa
l
l,

fa
l
ling

f
r
o
m
o
ne

s
t
ep

t
o

the

n
e
xt.

The

spi
r
al

m
e
th
o
d

i
n
v
o
l
v
es

the

r
apid

g
e
n
e
r
a
t
i
o
n

o
f

inc
r
easin
g
l
y fun
ct
i
o
nal

s
y
s
t
e
ms,

w
ith

sh
o
r
t

in
t
e
r
vals

be
t
w
e
e
n

s
u
cc
ess
i
v
e

r
eleases.

This is

c
o
nsi
de
r
ed a

g
oo
d

c
h
o
i
c
e

fo
r

data

wa
r
ehouse

d
e
v
el
op
m
e
nt,

es
p
ecia
l
l
y

fo
r

data

ma
r
ts,

b
ecause

the
tu
r
na
r
ound

t
ime

is

sh
o
r
t,

m
o
difica
t
i
o
ns

can

b
e

do
ne

q
ui
c
k
l
y
,

and

new

d
esi
g
ns

and

t
e
c
h
-

nolo
g
ies

can

b
e

adap
t
ed

in a

t
ime
l
y

mann
e
r
.

I
n

g
e
n
e
r
al,
the

wa
r
ehouse

d
esi
g
n

p
r
o
c
ess

c
o
nsists

o
f

the

f
o
l
l
o
w
ing

s
t
eps:


1.

C
h
o
ose

a

business

p
r
ocess

t
o

m
o
d
el,

fo
r

e
xample,

o
r
de
rs,

i
n
v
o
i
c
es,

shi
p
m
e
nts,
i
n
v
e
n
t
o
r
y
,

a
c
c
ount

adminis
t
r
a
t
i
o
n,

sales,

o
r

the

g
e
n
e
r
al

ledg
e
r
.

I
f

the

b
usiness

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?





p
r
o
c
ess

is

o
r
g
aniza
t
i
o
nal

and

i
n
v
o
l
v
es

m
ul
t
iple

c
o
mpl
e
x

obje
c
t

c
o
l
le
ct
i
o
ns,

a

data
wa
r
ehouse

m
o
d
el

should

b
e

f
o
l
l
o
w
ed.

H
o
w
e
v
e
r
,

if

the

p
r
o
c
ess

is

d
e
pa
r
t
m
e
ntal
and

f
o
cuses

o
n

the

ana
ly
sis
o
f

o
ne

kind

o
f

b
usiness

p
r
o
c
ess,

a

data

ma
r
t

m
o
d
e
l
should

b
e

c
hos
e
n.

2.

C
h
o
ose

the

g
r
ain

o
f

the

b
usiness

p
r
o
c
ess.

The

gr
ain

is

the

fundam
e
ntal,

a
t
o
mic

l
e
v
el
o
f

data

t
o

b
e

r
e
p
r
es
e
n
t
ed

in

the

fa
c
t

table

fo
r

this

p
r
o
c
ess,

fo
r

e
xample,

ind
i
v
i
d
ual
t
r
ansa
ct
i
o
ns,

ind
i
v
i
d
ual

dai
l
y

snapshots,

and

so

o
n.


3.

C
h
o
ose

the

dim
e
nsions

that

w
i
l
l

a
p
p
l
y

t
o

ea
c
h

fa
c
t

table

r
e
c
o
r
d.

T
y
pical

dim
e
nsi
o
ns a
r
e

t
ime,

i
t
e
m,

cus
t
o
m
e
r
,

s
u
p
pli
e
r
,

wa
r
ehouse,

t
r
ansa
ct
i
o
n

t
y
p
e,

and

stat
us.


4.

C
h
o
ose

the

measu
r
es

that

w
i
l
l

p
o
pula
t
e

ea
c
h

fa
c
t

table

r
e
c
o
r
d.

T
y
pical

mea
s
u
r
es

a
r
e
n
um
e
r
ic

addi
t
i
v
e

q
uan
t
i
t
ies li
k
e

d
o
lla
r
s

s
o
ld

and

units

s
o
ld
.


B
ecause

data

wa
r
ehouse

c
o
ns
tr
u
ct
i
o
n

is

a

difficult

and

l
o
ng
-
t
e
r
m

task,

its

imple
-

m
e
nta
t
i
o
n s
c
o
p
e

should

b
e

c
lea
r
l
y
d
efined.

The

g
o
als

o
f

an

ini
t
ial

data

wa
r
ehouse impl
e
m
e
nta
t
i
o
n

should

b
e

s
p
e
c
ific,

a
c
hi
e
v
abl
e
,

and

measu
r
able
.

This

i
n
v
o
l
v
es

d
e
t
e
r
-

mining

the

t
ime

and

b
udg
e
t

a
l
l
o
ca
t
i
o
ns,

the

s
ubs
e
t

o
f

the

o
r
g
aniza
t
i
o
n

that

is
t
o

b
e m
o
d
eled,

the

n
um
b
e
r

o
f

data

sou
rc
es

sele
c
t
ed,

and

the

n
um
b
e
r

and

t
y
p
es

o
f

d
e
pa
r
t
-

m
e
nts

t
o

b
e

s
e
r
v
ed.

On
c
e

a

data

wa
r
ehouse

is

d
esi
g
ned

and

c
o
ns
tr
u
c
t
ed,

the

ini
t
ial

d
epl
o
y
m
e
nt

o
f the

wa
r
ehouse

in
cl
u
d
es

ini
t
ial

insta
l
la
t
i
o
n,

r
o
l
l
-
o
u
t

plannin
g
,

t
r
ainin
g
,

and

o
r
i
e
nta
-

t
i
o
n.

Plat
fo
r
m

u
p
gr
a
d
es

and

main
t
e
nan
c
e

m
ust

also
b
e

c
o
nsi
de
r
ed.

Data

wa
r
ehouse adminis
t
r
a
t
i
o
n

in
cl
u
d
es data

r
ef
r
eshm
e
nt,

data

sou
rc
e

syn
c
h
r
o
niza
t
i
o
n,

planning

fo
r disas
t
e
r

r
e
c
o
v
e
r
y
,

mana
g
ing

a
cc
ess

c
o
n
t
r
ol

and

secu
r
i
t
y
,

mana
g
ing

data

g
ro
w
th,

man
-

a
g
ing database

p
e
r
fo
r
man
c
e,

and

data

wa
r
ehouse

e
nhan
c
e
m
e
nt

and

e
x
t
e
nsi
o
n.

S
c
o
p
e
manag
e
m
e
nt

in
cl
u
d
es
c
o
n
t
r
o
l
ling

the

n
um
b
e
r

and

r
ange

o
f

q
u
e
r
ies,

dim
e
nsi
o
ns,

and
r
e
p
o
r
ts;

limi
t
ing

the

si
z
e
o
f

the

data

wa
r
ehouse;

o
r

limi
t
ing

the

s
c
he
d
ule,

b
udg
e
t,

o
r
r
esou
rc
es.

V
a
r
ious kinds

o
f

data

wa
r
ehouse

d
esi
g
n

t
o
ols

a
r
e

a
vailable.

Data

wa
r
ehouse

d
e
v
el
-

o
pm
e
nt

t
o
ols

p
r
o
v
i
d
e

fun
ct
i
o
ns

t
o

d
efine

and

edit

m
e
tadata

r
e
p
osi
t
o
r
y

c
o
n
t
e
nts

(
s
u
c
h
as

s
c
h
e
mas,

sc
r
ipts,

o
r

r
ules),

ans
we
r

q
u
e
r
ies,

o
u
t
p
u
t

r
e
p
o
r
ts,

and

ship

m
e
tadata

t
o
and

f
r
o
m

r
ela
t
i
o
nal

database s
y
s
t
e
m catalogues.

P
lanning

and

analysis

t
o
ols

stu
d
y

the
impa
c
t

o
f

s
c
h
e
ma

c
hanges

and

o
f

r
ef
r
esh

p
e
r
fo
r
man
c
e

w
h
e
n

c
han
g
ing

r
ef
r
esh

r
a
t
es

o
r
t
ime

w
in
d
o
ws.


3.3.2

A

Th
r
ee
-
Tier

Data

W
a
r
ehouse

A
r
chitectu
r
e


Data

wa
r
ehouses

o
f
t
e
n

a
do
pt

a

th
r
ee
-
t
i
e
r

a
r
c
hi
t
e
c
tu
r
e,

as

p
r
es
e
n
t
ed

in

Figu
r
e

3.12.


1.

The

b
ot
t
o
m

t
i
e
r

is

a

wa
r
ehouse

database

s
e
r
v
e
r

that

is

almost

a
l
w
a
y
s

a

r
ela
t
i
o
nal database

s
y
s
t
e
m.
B
a
c
k
-
e
nd

t
o
ols

and

u
t
ili
t
ies

a
r
e

used

t
o

f
eed

data

in
t
o

the

b
ot
t
o
m
t
i
e
r

f
r
o
m

o
p
e
r
a
t
i
o
nal

databases

o
r

oth
e
r

e
x
t
e
r
nal

sou
rc
es

(
s
u
c
h

as cus
t
o
m
e
r

p
r
o
file in
fo
r
ma
t
i
o
n

p
r
o
v
i
d
ed

b
y

e
x
t
e
r
nal

c
o
n
s
ultants).

These

t
o
ols

and
u
t
ili
t
ies

p
e
r
fo
r
m

data
e
x
t
r
a
ct
i
o
n,

c
leanin
g
,

and

t
r
ans
fo
r
ma
t
i
o
n

(e.
g
.,

t
o

m
e
rge similar

data

f
r
o
m

dif
fe
r
e
nt

1.4

Data

M
ining

F
un
ct
i
o
nali
t
ies

W
hat

K
inds

o
f

P
at
t
e
r
ns

C
an

B
e

M
ined?

























































Que
r
y/
r
e
p
o
r
t


Analysis

Data

mining


T
o
p

t
ie
r
:

f
r
ont
-
end

t
o
ols





Ou
t
pu
t

OLAP

se
r
v
er

OLAP

se
r
v
er


M
id
d
le

t
ie
r
:
OL
AP

se
r
v
e
r





M
o
ni
t
o
r
ing

A
dminis
t
r
a
t
io
n

Data

wa
r
ehouse


Data

ma
r
ts




M
e
ta
data

r
e
p
osi
t
o
r
y






Ex
t
r
a
c
t
C
lean
T
r
ans
f
o
r
m
L
o
a
d
R
e
f
r
esh

Bot
t
om

t
ie
r
:
data

wa
r
ehouse
se
r
v
er






Data










O
p
e
r
a
t
ional

databases


Ex
t
e
r
nal

sou
rc
es



Figu
r
e

3.12

A

th
r
ee
-
t
i
e
r

data

wa
r
ehousing

a
r
c
hi
t
e
c
tu
r
e.


sou
rc
es

in
t
o

a

unified

fo
r
mat),

as

w
e
l
l

as

l
o
ad

and

r
ef
r
esh

fun
ct
i
o
ns

t
o

upda
t
e

the
data

wa
r
ehouse

(Se
ct
i
o
n

3.3.3). The

data

a
r
e

e
x
t
r
a
c
t
ed

using

a
p
plica
t
i
o
n

p
r
o
g
r
am
in
t
e
r
fa
c
es

kn
o
w
n

as

g
a
t
ew
a
ys.

A

g
a
t
ew
a
y

is

s
u
p
p
o
r
t
ed

b
y

the

un
de
r
l
y
ing

DBMS

and
a
l
l
o
ws

c
li
e
nt

p
r
o
gr
ams

t
o

g
e
n
e
r
a
t
e

SQL

c
o
d
e

t
o

b
e

e
x
ec
u
t
ed

at

a

s
e
r
v
e
r
.

E
xamples
o
f

g
a
t
ew
a
y
s

in
cl
u
d
e

ODBC

(
Op
e
n

Database

C
o
nne
ct
i
o
n)

and

OLEDB

(
Op
e
n

Link
-

ing

and

Em
b
edding

fo
r

Databases)

b
y

M
ic
r
os
o
ft

and

JDBC

(
J
a
va

Database

C
o
nne
c
-

t
i
o
n).

This

t
i
e
r

also

c
o
ntains

a m
e
tadata

r
e
p
osi
t
o
r
y
,
w
hi
c
h

s
t
o
r
es

in
fo
r
ma
t
i
o
n

a
b
o
u
t
the

data

wa
r
ehouse

and

its

c
o
n
t
e
nts.

The

m
e
tadata

r
e
p
osi
t
o
r
y

is

fu
r
th
e
r

d
esc
r
i
b
ed

in
Se
ct
i
o
n

3.3.4.

2.

The

mid
d
le

t
i
e
r


is

an

OLAP

s
e
r
v
e
r

that

is

t
y
pica
l
l
y

impl
e
m
e
n
t
ed

using

e
ith
e
r

(1)

a

r
ela
t
i
o
nal

OLAP

(
R
OLAP)

m
o
d
el,

that

is,

an

e
x
t
e
n
d
ed

r
ela
t
i
o
nal

DBMS

that

1.8

I
n
t
e
gr
a
t
i
o
n

o
f

a

Data

M
ining

S
y
s
t
e
m





maps

o
p
e
r
a
t
i
o
ns

o
n

m
ul
t
idim
e
nsi
o
nal

data

t
o

standa
r
d

r
ela
t
i
o
nal

o
p
e
r
a
t
i
o
ns;

o
r
(2)

a

m
ul
t
idim
e
nsi
o
nal

OLAP (MOLAP)

m
o
d
el,

that

is,

a

s
p
ecial
-
pu
r
p
ose

s
e
r
v
e
r that

di
r
e
c
t
l
y

impl
e
m
e
nts

m
ul
t
idim
e
nsi
o
nal

data

and

o
p
e
r
a
t
i
o
ns.

OLAP

s
e
r
v
e
rs

a
r
e discussed

in

Se
ct
i
o
n

3.3.5.

3.

The

t
o
p

t
i
e
r

is a

f
r
o
nt
-
e
nd

c
li
e
nt

l
a
y
e
r
,

w
hi
c
h

c
o
ntains

q
u
e
r
y

and

r
e
p
o
r
t
ing

t
o
ols,
ana
ly
sis

t
o
ols,

and/
o
r

data

mining

t
o
ols

(e.
g
.,

t
r
e
nd

ana
ly
sis,

p
r
edi
ct
i
o
n,

and

so

o
n).


Fr
o
m

the

a
r
c
hi
t
e
c
tu
r
e

p
o
int

o
f

v
ie
w
,

th
e
r
e

a
r
e

th
r
ee

data

wa
r
ehouse

m
o
d
els:

the

e
n
t
e
r
-

p
r
i
s
e

w
a
r
eh
o
u
s
e
,

the

data

ma
r
t
,

and

the

v
i
r
t
ual

w
a
r
eh
o
u
s
e
.



En
te
r
p
r
ise

wa
r
ehouse
:

A
n

e
n
t
e
r
p
r
ise

wa
r
ehouse

c
o
l
le
c
ts

a
l
l

o
f

the

in
fo
r
ma
t
i
o
n

a
b
o
u
t
s
ubje
c
ts spanning

the

e
n
t
i
r
e

o
r
g
aniza
t
i
o
n.

I
t

p
r
o
v
i
d
es
c
o
r
p
o
r
a
t
e
-
w
i
d
e

data

in
t
e
-

gr
a
t
i
o
n,

u
s
ua
l
l
y f
r
o
m

o
ne

o
r

m
o
r
e

o
p
e
r
a
t
i
o
nal

s
y
s
t
e
ms

o
r

e
x
t
e
r
nal

in
f
o
r
ma
t
i
o
n
p
r
o
v
i
de
rs,

and

is

c
r
oss
-
fun
ct
i
o
nal

in

s
c
o
p
e.

I
t

t
y
pica
l
l
y

c
o
ntains

d
e
tailed

data

as
w
e
l
l

as

s
umma
r
i
z
ed

data,

and

can

r
ange

in

si
z
e

f
r
o
m

a

f
ew

g
i
g
a
b
y
t
es

t
o

h
und
r
e
ds
o
f

g
i
g
a
b
y
t
es,

t
e
r
a
b
y
t
es,

o
r

b
e
y
o
nd.

A
n

e
n
t
e
r
p
r
ise

data

wa
r
ehouse

m
a
y

b
e

imple
-

m
e
n
t
ed

o
n

t
r
adi
t
i
o
nal

mainf
r
ames,

c
o
mp
u
t
e
r

s
u
p
e
rs
e
r
v
e
rs,

o
r

pa
r
a
l
lel

a
r
c
hi
t
e
c
tu
r
e plat
fo
r
ms.

I
t

r
e
q
ui
r
es

e
x
t
e
ns
i
v
e

b
usiness

m
o
d
eling

and

m
a
y

ta
k
e

y
ears

t
o

d
esi
g
n and

b
uild.

Data

ma
r
t:


A

data

ma
r
t

c
o
ntains

a

s
ubs
e
t

o
f

c
o
r
p
o
r
a
t
e
-
w
i
d
e

data

that

is
o
f

va
l
ue

t
o

a
s
p
ecific

g
r
oup

o
f

us
e
rs.

The

s
c
o
p
e

is

c
o
nfined

t
o

s
p
ecific

sele
c
t
ed

s
ubje
c
ts.

F
o
r

e
xam
-

ple,

a

ma
r
k
et
ing

data

ma
r
t

m
a
y

c
o
nfine

its

s
ubje
c
ts

t
o

cus
t
o
m
e
r
,

i
t
e
m,

and

sales.
The data

c
o
ntained

in

data

ma
r
ts

t
e
nd

t
o

b
e

s
umma
r
i
z
e
d.

Data

ma
r
ts a
r
e

u
s
ua
l
l
y

impl
e
m
e
n
t
ed

o
n

l
o
w
-
c
ost

d
epa
r
t
m
e
ntal

s
e
r
v
e
rs

that

a
r
e UNIX/LINUX
-

o
r

W
in
d
o
ws
-
based.

The

impl
e
m
e
nta
t
i
o
n

c
y
c
le

o
f

a

data

ma
r
t is m
o
r
e

li
k
e
l
y

t
o

b
e

mea
s
u
r
ed

in

w
eeks

r
ath
e
r

than

m
o
nths

o
r

y
ears.

H
o
w
e
v
e
r
,

it
m
a
y

i
n
v
o
l
v
e

c
o
mpl
e
x

in
t
e
gr
a
t
i
o
n

in

the

l
o
ng

r
un

if

its

d
esi
g
n

and

planning

we
r
e not

e
n
t
e
r
p
r
ise
-
w
i
d
e.

De
p
e
nding

o
n

the

sou
rc
e

o
f

data,

data

ma
r
ts

can

b
e

ca
t
e
go
r
i
z
ed

as

in
d
e
p
e
n
de
nt

o
r
d
e
p
e
n
de
nt.

I
nd
e
p
e
nd
e
nt

data

ma
r
ts

a
r
e

sou
rc
ed

f
r
o
m

data

captu
r
ed

f
r
o
m

o
ne

o
r

m
o
r
e
o
p
e
r
a
t
i
o
nal

s
y
s
t
e
ms

o
r

e
x
t
e
r
nal

in
fo
r
ma
t
i
o
n

p
r
o
v
i
de
rs,

o
r

f
r
o
m

data

g
e
n
e
r
a
t
ed

l
o
ca
l
l
y
w
ithin

a

pa
r
t
icular

d
epa
r
t
m
e
nt

o
r

geo
gr
a
p
hic

a
r
ea.

D
e
p
e
nd
e
nt

data

ma
r
ts

a
r
e

sou
rc
ed
di
r
e
c
t
l
y

f
r
o
m

e
n
t
e
r
p
r
ise

data

wa
r
ehouses.

V
i
r
tual

wa
r
ehouse
:

A

v
i
r
tual

wa
r
ehouse

is

a

s
e
t

o
f

v
iews

o
v
e
r

o
p
e
r
a
t
i
o
nal
databases.

F
o
r
effici
e
nt

q
u
e
r
y

p
r
o
c
essin
g
,

o
n
l
y

s
o
me

o
f

the

p
ossible

s
umma
r
y

v
iews

m
a
y

b
e

ma
t
e
r
i
ali
z
ed.

A

v
i
r
tual

wa
r
ehouse

is

easy

t
o

b
uild

bu
t

r
e
q
ui
r
es

e
x
c
ess

capaci
t
y

o
n

o
p
e
r
a
t
i
o
nal database

s
e
r
v
e
rs.