Course “Data Mining” Vladimir Panov - HSE Short description This ...

jamaicacooperativeΤεχνίτη Νοημοσύνη και Ρομποτική

17 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

91 εμφανίσεις


1

Course “Data Mining”


Vladimir Panov
-

HSE



Short description


This
course is suitable for
those

who are interested in data treatment with Data
Mining techniques and effective use
of
statistical
software
. The current version of the
course is provided for the
STATISTICA

software
(developed by StatSoft, Inc.).



The course is divi
ded into 3 parts.
In the first part, we consider the data
preparation for efficient statistical analysis and discuss the main a
sp
ects of the Data
M
ining methodology. The second part is devoted to studying different methods for
solving two popular statististical problems
known as classification and regression tasks.
In the third part, we tu
rn towards other tasks like

clastering

and

dimension reduction
.




Duration of the course:
32

academic hours

.


P
lan

1.

Introduction to
Data Mining
,
data
preparation

and preliminary remarks



General concepts of

Data Mining

and realization in the
S
TATISTICA

software
.



O
verview of the problems that can

be

solved by applying the

Data Mining

techniques.



Data import and export, interaction with databases.



Preliminary
data treatment
(data cleaning)

and data transformations



missing data, outliers, sparse data, doubled values, uncorrect values.



Descri
pti
ve

s
tatistics

and preliminary data analysis,
concept of the tool
Drill down
.



Visualization of the input data, inte
ractive analysis of the plots.



Se
lection

o
f the most important factors, tool

Feature selection
.



Search of the regularity in data,
concepts of
Link

analysis

and
Association
rules
.



Analysis of the division into categories,
tool
Weights of evidence
.


2.

Classification and regression tasks



Formulation of the problem,
key concepts and definitions.



Concept of the

Classification and regression trees
:
graphical representation,
analysis of the importance of predictors,
general
metho
dology, quality
parameters, divi
sion into
training and control
s
amples, cross
-
validation
methods
.



Other me
thods for
building

classification and regression
tree
s
:

Generlized

CHAID
model
s
,
boosted trees
,

ra
ndom forests
.



Support vector machines

(
SVM
)
,
notion of the optimal hyperplane
.



Probability approach for solving classification task,
naive Bayes models
.



Nonparametric regression,
Generalized additive models

(
GAM)
.



Spline appr
oach for solving regression problems,
Multivariate adaptive
regression splines

(MARS)
.



Comparison of different
models with the tool
Goo
dness of fit
,
visual
analy
sis of the
lift
and

gain charts
.


2



Combining different models, ensemble learning (
boosting

and
bagging
)
.



Application of the models to new data,
tool

Rapid Deployment
, «
vote
»
between models
.



Classical methods of regression analysis:
multiv
ariate
and

logistic
regression,

variable selection,
Akaike information criterion
.



Multivariate normal distribution,
Fisher's discriminant analysis
.



Analysis of
cencored data,
survival analysis
.



3.

Other tasks and methods for
data analysis



Cluster analysis:
formulation of the problem, key concepts and definitions,
k
-
means
clustering
,
tree

clustering
,
two
-
way joining
,

and

EM
-

algorithm
.



Dimension reduction
:
formulation of the problem,
curse of dimensionality,
principal c
omponent
a
nalysis
,
multidimensional scaling, factor analysis,

and

independent component a
nalysis
.



Neural networks:
methodology of the neural net
works approach,
structure
s

of the network
s
, optimal choice of complexity and architecture.



A
utomation of data analysis,
creation of the automated reports,
tools
Data
Miner Workspace

and

Data Miner Recipes
.



Literature

1.

Duda, R., Hart, P., and Stork, D.
Pattern classification.
John Wiley, 2001.

2.

Hastie, T.J., T
ibshirani, R., and Friedman, J.

The elements of statistical learning
:
Data Mining, inference and prediction
.
Springer,
1996.

3.

Härdle, W. and Simar, L.
Applied multiv
ariate statistical analysis.

Springer, 2012.

4.

Hyvärinen, A., Karhunen, J., and Oja, E.
Independent component analysis.
John
Wiley & Sons, 2001.

5.

Nisbet, R., Elder, J., and Miner, G.
Handbook of statis
tical analysis and Data Mining
a
pplications.

Elsevier, 20
09.

6.

Wasserman, L.
All of nonparametric statistics.
Springer, 2007.