Data Mining Functionalities

sentencehuddleData Management

Nov 20, 2013 (3 years and 11 months ago)

107 views

Data Mining Functionalities


Data mining functionalities are used to specify the
kind of patterns to be found in data mining tasks.


In general data mining tasks are classified into
two categories

1.
Descriptive mining


Find human
-
interpretable patterns that describe the
data.

2.
Predictive mining


Use some variables to predict unknown or future values
of other variables





Common data mining tasks



Concept/class description :characterization and
discrimination


Mining frequent patterns, associations, and
correlations


Classification and prediction


Cluster analysis


Outlier analysis


Evolution analysis


Concept/class description
:characterization and discrimination



It can be useful to describe individual classes and
concepts in summarized , concise, and yet precise terms.


Such descriptions of a class or a concept are called
class/concept descriptions.


These can be derived via
data characterization
,
by
summarizing the data of the class under study in
general terms, or


Data discrimination
,
by comparison of the target class
with one or a set of comparative classes ,contrasting
classes

Characterization and
discrimination(cont.)


There are several methods for effective data
summarization and characterization ,such as
OLAP
rollup operation, attribute induction technique

etc.


The output of characterization can be presented in
various forms like
pie charts, bar charts, curves,
multidimensional data cubes, and
multidimensional tables, rule forms, crosstabs
,

etc.

Characterization and discrimination:
examples


Mining frequent patterns, associations,
and correlations



Frequent patterns as name suggests, are patterns
that occur frequently in data.


There are many kinds of frequent patterns


Item sets


Subsequences


Substructures


Mining frequent patterns, associations, and
correlations(cont.)



Given a set of records each of which contain some number of items
from a given collection;


Produce dependency rules which will predict occurrence of an item based on
occurrences of other
items.


A
frequent item set
:
refers to a set of items that occur frequently


e.g.


TID
Items
1
Bread, Coke, Milk
2
Beer, Bread
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Diaper, Milk
Rules Discovered:


{Milk}
--
> {Coke}


{Diaper, Milk}
--
> {Beer}


Mining frequent patterns,
associations, and correlations(cont.)



A
frequent subsequence
:
such as a pattern that
customers tend to purchase first a pc, followed by a
digital camera and then a memory card.


A
frequent substructure
can refer to different
structural forms, such as graphs, trees, or lattices,
which may be combined with item sets or
subsequences. If it is frequent then it is frequent
substructure.


Mining frequent patterns,
associations, and correlations(cont.)



Example:

buys(x, “computer”)=>buys(x, “software”) [support=1% confidence=50%]

Where
x
=a
variable

representing a customer

buys

is a
predicate

Confidence
means that if a customer buys a computer
then there is a 50% chance that he will buy
software as well.

Support
means that 1% of all transactions under
analysis showed that computer and software were
purchased together.


Mining frequent patterns,
associations, and correlations(cont.)



We have two types of association rules as


Single dimensional association rules
(having only one
predicate as shown in above e.g.)


Multidimensional association rules
(having more than one
predicate)



e.g.: age(x, “20
-
29”)^income(x, “20k
-
29k)=>buys(x, “CD player”)

[
support=2%, confidence=60%]


Typically association rules are discarded as uninteresting if
they do not satisfy both a
minimum support threshold

and a
minimum confidence threshold.


Classification and prediction



Given a collection of records (
training set
)


Each record contains a set of
attributes
, one of the attributes is the
class
.


Find a
model

for class attribute as a function of the
values of other attributes.


Goal:
previously unseen

records should be assigned a
class as accurately as possible.


A
test set

is used to determine the accuracy of the model. Usually, the
given data set is divided into training and test sets, with training set used
to build the model and test set used to validate it.

Classification Example1

Classification Example2(cont.)

Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
Yes
Married
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
Test

Set

Training

Set

Model

Learn

Classifier


Cluster analysis



Given a set of data points, each having a set of
attributes, and a similarity measure among them, find
clusters such that


Data points in one cluster are more similar to one another.


Data points in separate clusters are less similar to one
another.


Similarity Measures:


Euclidean Distance if attributes are continuous.


Other Problem
-
specific Measures.





Cluster analysis




The objects are clustered or grouped based on the
principle of
maximizing the intraclass similarities

and
minimizing the interclass similarities
.


Clustering also facilitates
taxonomy formation
i.e.,
the organization of observations into a hierarchy of
classes that group similar events together.

Illustrating Clustering


Euclidean Distance Based Clustering in 3
-
D space.

Intracluster distances

are minimized

Intercluster distances

are maximized


Outlier analysis



A database may contain data objects that do not
comply with the general behavior or model of the
data.


These objects are
outliers.


In most cases these are discarded as noises or
exceptions.


The analysis of outlier data referred as
outlier
mining.


Outlier analysis



Examples: we can find
outliers mostly in
credit card issues and
in networks




Evolution analysis



Data evolution describes and models regularities or trend for
objects whose
behavior changes over time.


Examples:


Predicting sales amounts of new product based on advertising
expenditure.


Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.


Time series prediction of stock market indices.