Data Mining

naivenorthΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

55 εμφανίσεις

Data Mining Techniques

1

Data Mining

Xuequn Shang

NorthWestern Polytechnical University

September 2006

Data Mining Techniques

2

About the Course


Time


Tue. 7:00 pm ~9:00 pm


Fri. 7:00 pm~9:00 pm


Location


Room XA107 West building


Instructor


Xuequn shang, Ph.D.


shang@nwpu.edu.cn

Data Mining Techniques

3

Mini Survey


How many people took database course
before?


How many people took statistic course?


How many people took machine learning
before?

Data Mining Techniques

4

Textbook and Reference


Text book



Data Mining: Concepts and Techniques, JiaweiHan
and Micheline Kamber, Morgan Kaufmann, 2001.



范明、孟小峰等译,数据挖掘概念与技术,机械工业
出版社,
2001

8



References


Principles of Data Mining (Adaptive Computation and
Machine Learning), David J. Hand, Heikki Mannila,
Padhraic Smyth, MIT Press, 2001


Many research papers

Data Mining Techniques

5

Course Introduction


Data that has relevance for managerial decisions is accumulating at
an incredible rate due to a host of technological advances.


Electronic data capture has become inexpensive and ubiquitous as a
by
-
product of innovations such as the internet, e
-
commerce, electronic
banking, point
-
of
-
sale devices, bar
-
code readers, and intelligent
machines.


Such data is often stored in data warehouses and data marts
specifically intended for management decision support.


Data mining is a rapidly growing field that is concerned with
developing techniques to assist managers to make intelligent use of
these repositories.


Such as credit rating, fraud detection, database marketing, customer
relationship management, and stock market investments.


This course will examine methods that have emerged from both
fields and proven to be of value in recognizing patterns and making
predictions from an applications perspective. We will survey
applications and provide an opportunity for hands
-
on
experimentation with algorithms for data mining using easy
-
to
-
use
software and cases.

Data Mining Techniques

6

Course Objective


To provide an introduction to knowledge discovery in
databases and complex data repositories, and to present
basic concepts relevant to real data mining applications,
as well as reveal important research issues germane to
the knowledge discovery domain and advanced mining
applications.


Students will understand the fundamental concepts
underlying knowledge discovery in databases and gain
hands
-
on experience with implementation of some data
mining algorithms applied to real world cases.

Data Mining Techniques

7

Evaluation


Assignments (2) 20%


Class participant 10%


Project 20%


Final Exam 50%




Quality of presentation + quality of report
+ quality of demos



Data Mining Techniques

8

About the Project


Implement and experimentally evaluate
the major method in the paper (60%)


If possible, improve the method in
effectiveness or efficiency, implement and
experimentally evaluate your improvement


Write a technical report (40%)

Data Mining Techniques

9

Contents


Introduction to Data Mining


Association analysis


Sequential Pattern Mining


Classification and prediction


Data Clustering


Data preprocessing


Advanced topics

Data Mining Techniques

10

Course Schedule(1)

Date

Time

Session

Topic


Sep
-

19


7:00 pm
-
9:00 pm


Session 1


Welcome and introduction


Sep
-

22


7:00 pm
-
9:00 pm


Session 2


Association rule mining


Sep
-

26


Session 3




Sep
-

29


Session 4


Sequential Pattern Mining


Oct
-

10




Session 5


classification


Oct
-

13




Session 6

Data Mining Techniques

11

Course Schedule(2)

Date

Time

Session

Topic


Oct
-

17




Session 7


Data Clustering


Oct
-

20




Session 8


Data preprocessing


Oct
-

24


Session 9




Oct
-

27


Session 10


Advance topic


Oct
-

31




Session 11




Nov
-

3




Session 12


Seminar

Data Mining Techniques

12

Course Schedule(3)

Date

Time

Session

Topic


Nov
-

7




Session 7


examination


Nov
-

10




Session 8

Data Mining Techniques

13

Useful Information


How to get a paper online?


DBLP


A good index for good papers


CiteSeer


Just google it


Send requests to the authors


Conferences and Journals on Data Mining


KDD, PAKDD, ICDM, DAWAK, PKDD, etc.


DMKD, TKDE, ACM Trans. on KDD. etc.

Data Mining Techniques

14

Additional Hits


Be a good citizen


Be a good graduate student


Be a good scientist


There are three chief ethical problems: frauds,
plagiarism, and duplicate or simultaneous
submissions


There are four basic considerations in
technical ethics: honesty, justice, respect for
other’s works and copyrights held by others.

Data Mining Techniques

15

Introduction


Why data mining?


What is data mining?


What kind of data to be mined?


Are all the patterns interesting?


Data mining functionality


Major issues in data mining

Data Mining Techniques

16

Why Data Mining?



Changes in the Business Environment


Customers becoming more demanding


Markets are saturated


Databases today are huge:


More than 1,000,000 entities/records/rows


From 10 to 10,000 fields/attributes/variables


Gigabytes and terabytes


Databases a growing at an unprecedented rate


Decisions must be made rapidly


Decisions must be made with maximum knowledge



We are drowning in data, but starving for knowledge!




Necessity is the mother of invention


Data mining

Automated
analysis of massive data sets

Data Mining Techniques

17

Why Data Mining?


“The key in business is to know something that
nobody else knows.”








Aristotle Onassis






“To understand is to perceive patterns.”









Sir Isaiah Berlin

PHOTO:
HULTON
-
DEUTSCH COLL

PHOTO:
LUCINDA DOUGLAS
-
MENZIES

Data Mining Techniques

18

What Is Data Mining?


Mining data

extracting or mining knowledge
from large amount of data


Data mining


is the
non
-
trivial

process of identifying
valid
,
novel
,
potentially useful
, and
ultimately understandable

patterns in data [Fayyad, Piatetsky
-
Shapiro, Smyth,
96]


Data Mining Techniques

19

Applications


Data

analysis

and

decision

support


Market

analysis

and

management


Target marketing, customer relationship management (CRM), market
basket analysis, cross selling, market segmentation


Risk

analysis

and

management


Forecasting, customer retention, improved underwriting, quality
control, competitive analysis


Fraud

detection

and

detection

of

unusual

patterns

(outliers)


Other

Applications


Text

mining

(news

group,

email,

documents)

and

Web

mining


Stream

data

mining


Bioinformatics

and

bio
-
data

analysis

Data Mining Techniques

20

Ex. 1: Market Analysis and Management


Where does the data come from?

Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies


Target marketing


Find clusters of “model” customers who share the same characteristics: interest, income
level, spending habits, etc.,


Determine customer purchasing patterns over time


Cross
-
market analysis

Find associations/co
-
relations between product sales, & predict
based on such association


Customer profiling

What types of customers buy what products (clustering or classification)


Customer requirement analysis


Identify the best products for different customers


Predict what factors will attract new customers


Provision of summary information


Multidimensional summary reports


Statistical summary information (data central tendency and variation)

Data Mining Techniques

21

Ex. 2: Corporate Analysis & Risk Management


Finance planning and asset evaluation


cash flow analysis and prediction


contingent claim analysis to evaluate assets


cross
-
sectional and time series analysis (financial
-
ratio, trend analysis,
etc.)


Resource planning


summarize and compare the resources and spending


Competition


monitor competitors and market directions


group customers into classes and a class
-
based pricing procedure


set pricing strategy in a highly competitive market

Data Mining Techniques

22

Ex. 3: Fraud Detection & Mining Unusual Patterns


Approaches: Clustering & model construction for frauds, outlier analysis


Applications: Health care, retail, credit card service, telecomm.


Auto insurance
: ring of collisions


Money laundering:

suspicious monetary transactions


Medical insurance


Professional patients, ring of doctors, and ring of references


Unnecessary or correlated screening tests


Telecommunications: phone
-
call fraud


Phone call model: destination of the call, duration, time of day or
week. Analyze patterns that deviate from an expected norm


Retail industry


Analysts estimate that 38% of retail shrink is due to dishonest
employees


Anti
-
terrorism

Data Mining Techniques

23

The KDD Process


Data mining

core of knowledge
discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task
-
relevant Data

Selection

Data Mining

Pattern Evaluation

Data Mining Techniques

24


Preprocessing


Data cleaning


Data integration


Data selection


Data transformation


Data mining


Pattern evaluation


Knowledge presentation

KDD Process Steps

Data Mining Techniques

25

Confluence of Multiple Disciplines


Data Mining

Database

Technology

Statistics

Machine

Learning

Pattern

Recognition

Algorithm

Other

Disciplines

Visualization

Data Mining Techniques

26

Classification Schemes


General functionality


Descriptive data mining


Predictive data mining


Different views lead to different classifications


Data

view: Kinds of data to be mined


Knowledge

view: Kinds of knowledge to be discovered


Method

view: Kinds of techniques utilized


Application

view: Kinds of applications adapted

Data Mining Techniques

27

What Kind of Data?


Database
-
oriented data sets and applications


Relational database, data warehouse, transactional database


Advanced data sets and advanced applications



Data streams and sensor data


Time
-
series data, temporal data, sequence data (incl. bio
-
sequences)


Structure data, graphs, social networks and multi
-
linked data


Object
-
relational databases


Heterogeneous databases and legacy databases


Spatial data and spatiotemporal data


Multimedia database


Text databases


The World
-
Wide Web

Data Mining Techniques

28


Structured data


Table

records

attributes


Indexes & SQL


Online transactional processing (OLTP)


Insert a student “Jennet” into class CMPT 741, fall
2005


Online analytical processing (OLAP)


Find the average class size of CMPT 700 level
courses in the last 3 years, grouped by semesters

Relational Databases

Data Mining Techniques

29


A
subject
-
oriented
,
integrated
,
time
-
variant
,
and
nonvolatile
collection of data in support of
management’s decision making process [Inmon]

Data Warehouses

Data
Warehouse

Clean

Transform

Integrate

Load

Query and

analysis tools

Client

Client

Data Mining Techniques

30


A Multi
-
dimensional Database

Data Cube

A

B

29

30

31

32

1

2

3

4

5

9

13

14

15

16

64

63

62

61

48

47

46

45

a1

a0

c3

c2

c1

c 0

b3

b2

b1

b0

a2

a3

C

B

44

28

56

40

24

52

36

20

60

Data Mining Techniques

31

Transactional Databases

TID

Itemset

T100

Milk, bread, beer, diaper


T200

Beer, cook, fish, potato, orange, apple





What kind of product combinations
that customers like to buy together?

Data Mining Techniques

32


Spatial information


Geographic databases (map)


VLSI chip design databases


Satellite image databases


Spatial patterns


What are the changes of the forest in the last 10
years?


Find clusters of homes with kids of age 5
-
10

Spatial Databases

Data Mining Techniques

33


A sequence of values that change over time


The sequences of stock price at every 5 minutes


The daily temperature


Typical operations


Similarity search


Trend analysis

Time Series Data

Data Mining Techniques

34


HTML web documents


XML documents


Digital libraries


Annotated multimedia databases


Image, audio and video data

Semi
-
Structure Data

Data Mining Techniques

35


Bio
-
sequences


DNA, gene, protein: very long sequences


Micro
-
array data


Medical documents and images


Typically very noisy


Data cleaning and integration are challenging

Biological Data

Data Mining Techniques

36


What can be discovered depends upon the data
mining task employed.


Descriptive DM tasks


characterize general properties


Predictive DM tasks


Infer on available data


What Can Be Discovered?

Data Mining Techniques

37

What Kinds of Patterns?


Association rules and sequential patterns


Classification


Clustering


Outlier analysis


Other data mining tasks

Data Mining Techniques

38

Are All the “Discovered” Patterns
Interesting?


Data mining may generate thousands even
million of patterns: Not all of them are interesting


What makes a pattern interesting?


Can a data mining system generate all of the
interesting patterns?


Can a data mining system generate only interesting
patterns?

Data Mining Techniques

39

What makes a pattern interesting?


Interestingness measures


A pattern is
interesting

if it is
easily understood

by humans,
valid

on new or test data with some degree of
certainty
,
potentially
useful
,
novel,

or
validates some hypothesis

that a user seeks to
confirm


Objective vs. subjective interestingness measures


Objective
:

based on
statistics and structures of patterns
, e.g.,
support, confidence, etc.


Subjective
:

based on
user’s belief

in the data, e.g.,
unexpectedness, novelty, etc.

Data Mining Techniques

40

Find All Interesting Patterns?


Find all the interesting patterns:
Completeness


Can a data mining system find
all

the interesting
patterns? Do we need to find
all

of the
interesting patterns?


Heuristic vs. exhaustive search


Association vs. classification vs. clustering

Data Mining Techniques

41

Find Only Interesting Patterns?


Search for only interesting patterns: An
optimization problem


Can a data mining system find
only

the
interesting patterns?


Approaches


First general all the patterns and then filter out the
uninteresting ones


Generate only the interesting patterns

mining query
optimization

Data Mining Techniques

42


Effectiveness


Efficiency


Applications


Theory



Research Issues in Data Mining

Data Mining Techniques

43


What kind of patterns to mine?


Propose interesting data mining problems


How to identify interesting patterns


Interestingness measures


Useful constraints


Visualization and interaction


Presentation of mining results


Interactive, adaptive mining

Effectiveness

Data Mining Techniques

44


Develop fast data mining algorithms


Identify effective heuristics for mining


Theoretical and/or empirical justification


Systematic implementation


Parallel, distributed, and incremental mining


Integration to product systems


Data mining module in DBMS and data warehouses

Efficiency

Data Mining Techniques

45


Handle noisy or incomplete data


Incorporate background knowledge


Application/domain
-
oriented solutions


Vertical solutions

Applications

Data Mining Techniques

46


Knowledge representation


Data mining algebra and language


Integration of multiple mining tasks/DBMS


Open for new data/knowledge


Interaction and visualization


Data mining query optimization


Common construct


Automatic optimization by construct rewriting

Foundation for Data Mining

Data Mining Techniques

47

Major Issues in Data Mining


Mining methodology


Mining different kinds of knowledge from diverse data types, e.g., bio, stream,
Web


Performance: efficiency, effectiveness, and scalability


Pattern evaluation: the interestingness problem


Incorporation of background knowledge


Handling noise and incomplete data


Parallel, distributed and incremental mining methods


Integration of the discovered knowledge with existing one: knowledge fusion


User interaction


Data mining query languages and ad
-
hoc mining


Expression and visualization of data mining results


Interactive mining of knowledge at multiple levels of abstraction


Applications and social impacts


Domain
-
specific data mining & invisible data mining


Protection of data security, integrity, and privacy

Data Mining Techniques

48

A Brief History of Data Mining Society


1989 IJCAI Workshop on Knowledge Discovery in Databases


Knowledge Discovery in Databases (G. Piatetsky
-
Shapiro and W. Frawley,
1991)


1991
-
1994 Workshops on Knowledge Discovery in Databases


Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky
-
Shapiro, P. Smyth, and R. Uthurusamy, 1996)


1995
-
1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95
-
98)


Journal of Data Mining and Knowledge Discovery (1997)


ACM SIGKDD conferences since 1998 and SIGKDD Explorations


More conferences on data mining


PAKDD (1997), PKDD (1997), SIAM
-
Data Mining (2001), (IEEE) ICDM
(2001), etc.


ACM Transactions on KDD starting in 2007

Data Mining Techniques

49

Summary


Data mining: Discovering interesting patterns from large amounts of
data


A natural evolution of database technology, in great demand, with
wide applications


A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation


Mining can be performed in a variety of information repositories


Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.


Data mining systems and architectures


Major issues in data mining

Data Mining Techniques

50

Assignment (

)


What is data mining?


Data mining is the task of
discovering interesting
patterns from large amounts of data
, where
the
data can be stored in databases, data warehouses,
or other information repositories
. It is a young
interdisciplinary field
, drawing from areas such as
database systems, data warehousing, statistics,
machine learning, data visualization, information
retrieval, and high
-
performance computing. Other
contributing areas include neural networks, pattern
recognition, spatial data analysis, image databases,
signal processing, and many application fields, such
as business, economics, and bioinformatics.

Data Mining Techniques

51

Assignment (

)


Define each of the following data mining functionalities: association and
correlation analysis, classification, prediction, clustering, and evolution
analysis. Give example of each data mining functionality, using a real
-
life
database with which you are familiar.


Association analysis


showing attribute
-
value conditions that occur frequently in a given set of data


Classification


finding a set of models that describe and distinguish data classes or concepts, for the
purpose of being able to use the model to predict the class of objects whose class
label is unknown


Clustering analysis



analyzing data objects without consulting a known class label


Outlier analysis



finding data objects that do not comply with the general behavior or model of the data


Evolution analysis



describes and models regularities or trends for objects whose behavior changes over
time


Data Mining Techniques

52

Complement (

)


A student asked me what the difference between
data mining and information retrieval is


There is really no clear difference


Actually some of the recent information retrieval
system do discover associations between words and
paragraphs


Data Mining Techniques

53

Complement (

)


What is the difference between data mining (DM)
and pattern recognition (PR)


Both of them are to find useful relations


In PR, we typically deal with data set of moderate size,
while in a typical DM application, we are concerned
with data sets that are large in terms of dimension
and number of clusters


PR is an important techniques used in DM

Data mining involves an
integration of techniques from
multiple disciplines


Data Mining Techniques

54

Architecture: Typical Data Mining System

data cleaning, integration, and selection

Database or Data
Warehouse Server

Data Mining Engine

Pattern Evaluation

Graphical User Interface

Knowl
edge
-
Base

Database

Data

Warehouse

World
-
Wide

Web

Other Info

Repositories

Data Mining Techniques

55

Thank you !