CS-470: Data Mining

sentencehuddleData Management

Nov 20, 2013 (3 years and 8 months ago)

80 views

CS
-
470: Data Mining


Fall 2009

1

Organizational Details

Class Meeting:

4:00
-
6:45pm, Tuesday, Room SCIT215


Instructor: Dr. Igor Aizenberg


Office:
Science and Technology Building, 104C

Phone (903 334 6654)

e
-
mail: igor.aizenberg@tamut.edu


Office hours:

Monday, Wednesday 10am
-
6pm

Tuesday 11pm
-
3pm


Class Web Page
:
http://www.eagle.tamut.edu/faculty/igor/CS
-
470.htm


2

Text Book



R. J. Roiger, M.W. Geatz, Data Mining.
A Tutorial
-
Based Primer
, Addison Wesley,
2003, ISBN 0
-
201
-
74128
-
8


3

Control


Exams (
open book, open notes
):

Exam 1:

October 6, 2009

Exam 2:

November 10, 2009

Exam 3:

December 8, 2009




Homework

4

Grading

Grading Method

Homework and preparation:


10%

Exam 1:



30%

Exam 2:



30%

Exam 3:




30%

Grading Scale
:

90%+


A

80%+


B

70%+


C

60%+


D

less than 60%


F

5

Data Mining: A First View

6

Data Mining
: A Definition


The

process

of

employing

one

or

more

machine

learning

techniques

to

automatically

analyze

and

extract

knowledge

from

data
.


The

exploration

and

analysis

of

large

quantities

of

data

in

order

to

discover

meaningful

patterns

and

rules
.


7

8

What Is Data Mining?


Data mining
(knowledge discovery in
databases) is the process of discovering
interesting knowledge from large amounts of
data stored either in databases, data
warehouses, or other information repositories.


Machine learning
and
data mining
are
interested in the process of discovering
knowledge that may be structurally or
semantically more complex: models, graphs,
new theorems or theories … in particular to
assist scientific discovery.


9

Why Data Mining?


Potential Applications


Database

analysis

and

decision

support


Market

analysis

and

management


target marketing, customer relation management, market basket
analysis, cross selling, market segmentation


Risk

analysis

and

management


Forecasting, customer retention, improved underwriting, quality control,
competitive analysis


Fraud

detection

and

management


Other

Applications


Text

mining

(news

group,

email,

documents)

and

Web

analysis
.


Intelligent

query

answering
.


Medical

decision

support
.

Market Analysis and Management (1)


Where are the data sources for analysis?


Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies


Target marketing


Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.


Determine customer purchasing patterns over time


Conversion of single to a joint bank account: marriage, etc.


Cross
-
market analysis


Associations/co
-
relations between product sales


Prediction based on the association information

10

Market Analysis and

Financial Time Series Prediction

11

Market Analysis and

Financial Time Series Prediction

12

Market Analysis and

Financial Time Series Prediction

13

Market Analysis and

Financial Time Series Prediction

14

Market Analysis and Management (2)


Customer profiling


data mining can tell you what types of customers buy what
products (clustering or classification)


Identifying customer requirements


identifying the best products for different customers


use prediction to find what factors will attract new customers


Provides summary information


various multidimensional summary reports


statistical summary information (data central tendency and
variation)

15

Corporate Analysis and Risk
Management


Finance planning and asset evaluation


cash flow analysis and prediction


contingent claim analysis to evaluate assets


cross
-
sectional and time series analysis (financial
-
ratio,
trend analysis, etc.)


Resource planning:


summarize and compare the resources and spending


Competition:


monitor competitors and market directions


group customers into classes and a class
-
based pricing
procedure


set pricing strategy in a highly competitive market

16

Fraud Detection and Management (1)


Applications


widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.


Approach


use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances


Examples


auto insurance
: detect a group of people who stage
accidents to collect on insurance


money laundering
: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)


medical insurance
: detect professional patients and ring of
doctors and ring of references

17

Fraud Detection and Management (2)


Detecting inappropriate medical treatment


Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).


Detecting telephone fraud


Telephone call model: destination of the call, duration, time
of day or week. Analyze patterns that deviate from an
expected norm.


British Telecom identified discrete groups of callers with
frequent intra
-
group calls, especially mobile phones, and
broke a multimillion dollar fraud.


Retail


Analysts estimate that 38% of retail shrink is due to
dishonest employees.

18

19

Other Applications


Sports


IBM Advanced Scout analyzed NBA game statistics (shots blocked,
assists, and fouls) to gain competitive advantage for New York Knicks
and Miami Heat


Astronomy


JPL and the Palomar Observatory discovered 22 quasars with the help
of data mining


Internet Web Surf
-
Aid


IBM Surf
-
Aid applies data mining algorithms to Web access logs for
market
-
related pages to discover customer preference and behavior
pages, analyzing effectiveness of Web marketing, improving Web site
organization, etc.

Induction
-
based Learning

The

process

of

forming

general

concept

definitions

by

observing

specific

examples

of

concepts

to

be

learned
.


20

Four Levels of Learning



Facts



Concepts



Procedures



Principles

21

Facts

A

fact

is

a

simple

statement

of

truth
.


22

Concepts


A

concept

is

a

set

of

objects,

symbols,

or

events

grouped

together

because

they

share

certain

characteristics
.



23

Procedures

A

procedure

is

a

step
-
by
-
step

course

of

action

to

achieve

a

goal
.


24

Principles

A

principles

are

general

truths

or

laws

that

are

basic

to

other

truths
.


25

What Can Computers Learn?

26

Computers & Learning

Computers

are

good

at

learning

concepts
.

Concepts

are

the

output

of

a

data

mining

session
.



27

Three Concept Views



Classical View



Probabilistic View



Exemplar View


28

Classical View


All concepts have definite
defining properties.


29

Probabilistic View


People store and recall concepts
as generalizations created by
observations.


30

Exemplar View


People store and recall likely
concept exemplars that are used
to classify unknown instances.


31

Methods of Learning

32

Supervised Learning




Build

a

learner

model

using

data


instances

of

known

origin
.



Use

the

model

to

determine

the


outcome

new

instances

of


unknown

origin
.


33



Supervised Learning:

A
Decision Tree
Example


34

Decision Tree

A

tree

structure

where

non
-
terminal

nodes

represent

tests

on

one

or

more

attributes

and

terminal

nodes

reflect

decision

outcomes
.

35

Table 1.1

Hypothetical Training Data for Disease Diagnosis
Patient
Sore
Swollen
ID#
Throat
Fever
Glands
Congestion
Headache
Diagnosis
1
Yes
Yes
Yes
Yes
Yes
Strep throat
2
No
No
No
Yes
Yes
Allergy
3
Yes
Yes
No
Yes
No
Cold
4
Yes
No
Yes
No
No
Strep throat
5
No
Yes
No
Yes
No
Cold
6
No
No
No
Yes
No
Allergy
7
No
No
Yes
No
No
Strep throat
8
Yes
No
No
Yes
Yes
Allergy
9
No
Yes
No
Yes
Yes
Cold
10
Yes
Yes
No
Yes
Yes
Cold
36

Swollen
Glands
Fever
No
Yes
Diagnosis = Allergy
Diagnosis = Cold
No
Yes
Diagnosis = Strep Throat
37

Table 1.2

Data Instances with an Unknown Classification
Patient
Sore
Swollen
ID#
Throat
Fever
Glands
Congestion
Headache
Diagnosis
11
No
No
Yes
Yes
Yes
?
12
Yes
Yes
No
No
Yes
?
13
No
No
No
No
Yes
?
38

Production Rules

IF
Swollen Glands = Yes


THEN
Diagnosis = Strep Throat

IF
Swollen Glands = No
&
Fever = Yes


THEN
Diagnosis = Cold

IF
Swollen Glands = No & Fever = No


THEN
Diagnosis = Allergy



39

Unsupervised Clustering


A

data

mining

method

that

builds

models

from

data

without

predefined

classes
.


40

The “Acme Investors” Dataset

of customers maintaining a brokerage account



41

The “Acme Investors” Dataset


Table 1.3

Acme Investors Incorporated
Customer
Account
Margin
Transaction
Trades/
Favorite
Annual
ID
Type
Account
Method
Month
Sex
Age
Recreation
Income
1005
Joint
No
Online
12.5
F
30–39
Tennis
40–59K
1013
Custodial
No
Broker
0.5
F
50–59
Skiing
80–99K
1245
Joint
No
Online
3.6
M
20–29
Golf
20–39K
2110
Individual
Yes
Broker
22.3
M
30–39
Fishing
40–59K
1001
Individual
Yes
Online
5.0
M
40–49
Golf
60–79K
42

The “Acme Investors” Dataset &
Supervised Learning



1.
Can I develop a general profile of an online investor?

2.
Can I determine if a new customer is likely to open a
margin account?

3.
Can I build a model predict the average number of trades
per month for a new investor?

4.
What characteristics differentiate female and male
investors?

43

The “Acme Investors” Dataset &
Supervised Learning



1.
Can I develop a general profile of an online investor?


output attribute


transaction method

2.
Can I determine if a new customer is likely to open a
margin account?
-

output attribute


margin account

3.
Can I build a model predict the average number of trades
per month for a new investor?
-

output attribute


trades/month

4.
What characteristics differentiate female and male
investors?
-

output attribute


sex

44

Alternative:

The “Acme Investors” Dataset &

Unsupervised Clustering



45


The “Acme Investors” Dataset &
Unsupervised Clustering

1.
What attribute similarities group customers
of Acme Investors together?

2.
What differences in attribute values
segment the customer database?


46

Clustering


Clustering

is the task of segmenting a
heterogeneous population into a number of
more homogeneous subgroups (
clusters
).

47

Clustering:

Two Approaches


A clustering algorithm requires us to
provide an initial best estimate about the
total number of clusters in the data
(
supervised
).


A clustering algorithm uses some method in
an attempt to determine a best number of
clusters (
unsupervised
)

48

Classification


Classification

deals with discrete outcomes:
yes or no; big or small; strange or no
strange; yellow, green or red; etc.


Estimation

is often used to perform a
classification task: estimating the number of
children in a family; estimating a family’s
total household income; etc.


Neural networks
and
regression models
are
the best tools for classification/estimation

49

Prediction


Prediction

is the same as classification or
estimation, except that the records are
classified according to some predicted
future behavior or estimated future value.


Any of the techniques used for
classification and estimation for use in
prediction.

50

Classification and Prediction:
Implementation


To implement both classification and
prediction, we should use the training
examples, where the value of the variable to
be predicted is already known or
membership of the variable to be classified
is already known.

51

Is Data Mining Appropriate for
My Problem?

52

Will Data Mining help me?


Can we clearly define the problem


Do potentially meaningful data exist?


Do the data contain hidden knowledge or
the data is useful for reporting purposes
only?


Will the cost of processing the data be less
than the likely increase in profit seen by
applying any potential knowledge gained
from the data mining?

53

Data Mining or Data Query?



Shallow Knowledge



Multidimensional Knowledge



Hidden Knowledge



Deep Knowledge

54

Shallow Knowledge



Shallow knowledge is factual. It can
be easily stored and manipulated in a
database.

55

Multidimensional Knowledge



Multidimensional knowledge is also
factual. On
-
line analytical Processing
(OLAP) tools are used to manipulate
multidimensional knowledge.

56

Hidden Knowledge



Hidden knowledge represents patterns
or regularities in data that cannot be
easily found using database query.
However,
data mining algorithms can
find such patterns
with ease.

57

Deep Knowledge



Deep knowledge is knowledge stored
in a database that can only be found if
we are given some direction about what
we are looking for.

58

Data Mining or Data Query?




Shallow Knowledge ( can be extracted by the
data base query language like SQL)



Multidimensional Knowledge (can be
extracted by the On
-
line Analytical Processing
(
OLAP
) tools



Hidden Knowledge
represents patterns and
regularities in data that can not be easily found



Deep Knowledge
can be found if we are
given some direction about what we are
looking for

59

Data Mining vs. Data Query:



Use

data

query

if

you

already


almost

know

what

you

are


looking

for
.



Use

data

mining

to

find

regularities


in

data

that

are

not

obvious
.


60

A Simple Data Mining Process
Model

61

Knowledge Discovery in
Databases (KDD)

The

application

of

the

scientific

method

to

data

mining
.

Data

mining

is

one

step

of

the

KDD

process
.



62

Data Mining: A KDD Process


Data mining: the core of
knowledge discovery
process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task
-
relevant Data

Selection

Data Mining

Pattern Evaluation

63

The Data Warehouse



The data warehouse is a historical
database designed for decision
support.

64

SQL Queries
Operational
Database
Data
Warehouse
Result
Application
Interpretation
&
Evaluation
Data Mining
A Simple Data Mining Process
Model

1.
Assemble a collection of data to analyze

2.
Present these data to a data mining tool

3.
Interpret the results

4.
Apply the results to a new problem or situation

65