CS 563 FALL Term KNOWLEDGE DISCOVERY AND DATA MINING

crazymeasleAI and Robotics

Oct 15, 2013 (3 years and 5 months ago)

128 views

Dr. Christos Nikolopoulos

Office: BR 197

(309) 677
-
2456

chris@bradley.edu

class web site at :
http://hilltop.bradley.edu/~chris/

and on Sakai

Office hours:
T

and

TH 11:45
-
12:30 and by appointment




CS 563
FALL Term



KNOWLEDGE DISCOVERY AND DATA MINING


Required
Textbook:


Witten I.

and Frank E., DATA MINING
: Practical Machine Learning Tools and
Techniques,

Morgan Kaufmann Publishers, 2008

Optional References:

1
.
Cios et al., DATA MINING: A Knowledge Discovery Approach
, Springer, 2007.

2.
Chris Nikolopoulos,

Expert Systems: An Introductionto First, Second

Generation
and Hybrid Knowledge based Systems,Marcel Dekker, 1997.

Description:

Advances in Knowledge Discovery and Data Mining bring together the latest research in
the areas of
statistics, databa
ses, machine learni
ng, and artificial intelligence

which
together
contribute to the rapidly growing field of knowledge discovery and data mining.
Topics covered include fundamental issues, knowledge representation, cleaning and
reprocessing of data sets, c
lassification and clustering, machine learning algorithms,
comparing machine learning algorithms and models, evaluating performance.

The
complimentary topic of Data Warehousing and OLA
P is covered in the class CS 572
,
Advanced Databases.

Learning Outcomes


Upon

succ
essful completion of the course

students will be able to:


-
approach data mining as a process
.


-
u
nderstand the mathematical
and
statistic
al

foundations o
f the machine learning
algorithms involved and
be able to provide a clear and concise descri
ption of testing and
benchmarking experiments.


-
possess a toolbox of techniques that can be immediately applied to real world knowledge
discovery problems, for

clustering, estimation, prediction
, and classification,

including
algorithms for
k
-
means cluste
ring,

classification and regression trees, the C4.5 algorithm,
logistic Regression,
k
-
nearest neighbor, multiple regression, and neural networks.

-
b
e proficient
in at least one

leading

data mining software, for example

WEKA.

-

reason as to which method nee
ds to be applied in a given situation depending on the
specific application domain and additional requirements

-
evaluate and compare different models using various
statistical techniques, for exa
mple
Bernoulli trials and statistic variables such as Kappa
-
s
tatistic, confusion matrix, RMS
error, etc.



Time Schedule
:


The
data set to be analyzed
for the final project
must

be chosen by
November 1
st
.
The
final project
written
report

has to be written in a research paper
form
at

(abstract,
introduction, main sect
ions, conclusions, bibliography

in PLA format
)

and is
due back
by

email

by December 13
th

11:00 a
.m.



The table below gives the reading assignments from the books and online sources.

The
dates are tentative and may be adjusted.


# Date



Topics

for online

discussion



Readings/Assignments




Day 1

What is DM/KD? Overview of what we
are going to cover.


Day 2

Review of Statistics

Notes emailed to you

Day 3

Review of Statistics and

Introduction to Machine Learning tools
and techniques

Witten Part I, Chapte
r 1, pp.
4
-
39

Day 4

Introduction to Machine Learning tools
and techniques

Witten Part I, Chapter 1, pp.
4
-
39

Day 5

Input: Concepts, instances and attributes

Witten Part I, Chapter 2, pp.
41
-
60

Day 6

Output: Knowledge representation

Witten Part I, Chapte
r 3, pp.
61
-
82

Day 7

Output: Knowledge representation


Witten Part I, Chapter 3, pp.
61
-
82

Day 8

Machine Learning: the basic methods

Witten Part I, Chapter 4, pp.
83
-
111, sections 4.1
-
4.4

Watch video 1

Day 9

Machine Learning: the basic methods

Witten Pa
rt I, Chapter 4, pp.
83
-
111, sections 4.1
-
4.4

Watch video 1

Day 10

Machine Learning: the basic methods

Witten Part I, Chapter 4, pp.
112
-
139, sections 4.5
-
4.9

Watch video 2

Day 11

Machine Learning: the basic methods

Witten Part I, Chapter 4, pp.
112
-
139,

sections 4.5
-
4.9

Watch video 2

Day 12

Machine Learning: the basic methods

Witten Part I, Chapter 4, pp.
112
-
139, sections 4.5
-
4.9

Watch video 2

Day 13

The WEKA machine learning workbench

Witten Part II, Chapter 9, pp.
365
-
368 and Chapter 10, pp.
369
-
401

Day 14

NO CLASS
-
Bradley on Fall Recess


Tuesday,
October 22
nd

MIDTERM EXAM



Test is on

Witten’s chapters
1,2,3,

and
4

Day 16

Discuss exam/answers


Day 17

The WEKA machine learning workbench


Witten Part II, Chapter 10, pp.
401
-
423

Day 18

Evaluating

the discovered knowledge

Witten Part I, Chapter 5, pp.
143
-
160

Watch video 3

Day 19

Evaluating the discovered knowledge

Witten Part I, Chapter 5, pp.
160
-
183

Watch video 4

Day 20

Decide on a data set to use for Final
Project (could use the University of

California Irvine Machine Learning Data
Depository
http://archive.ics.uci.edu/ml/

)
-

send email to instructor to notify him of
which data set you chose

Report which data set you
chose and d
iscuss data sets

i
n class

Day 21

Engineering the input and output, attribute
selection, discretizing, automatic data
cleansing

Witten Part I, Chapter 7, pp.
285
-
341

Day 22

Engineering the input and output, attribute
selection, discretizing, automatic data
cleansing

Witten

Part I, Chapter 7, pp.
285
-
341

Day 23

Details on Decision trees, classification
rules, extending linear models, neural nets

Witten Part I, Chapter 6, pp.
187
-
235, sections 6.1
-
6.3

Watch video 5

Day 24

Details on Decision trees, classification
rules, ext
ending linear models, neural nets

Witten Part I, Chapter 6, pp.
187
-
235, sections 6.1
-
6.3

Watch video 5

Day 25

Details on Decision trees, classification
rules, extending linear models, neural nets

Witten Part I, Chapter 6, pp.
187
-
235, sections 6.1
-
6.3

Wa
tch video 5

Day 26

Instance
-
based learning, numeric
prediction, clustering, Bayesian networks

Witten Part I, Chapter 6, pp.
235
-
283

Day 27

NO CLASS
-
Thanksgiving Break


Day 28

Instance
-
based learning, numeric
prediction, clustering, Bayesian networks

Wit
ten Part I, Chapter 6, pp.
235
-
283

Day 29

Instance
-
based learning, numeric
prediction, clustering, Bayesian networks

Witten Part I, Chapter 6, pp.
235
-
283

Day 30

Review
, Project report is due

Project report is due by email
by 5:00 p.m

Friday,
December
1
3th

FINAL EXAM

1
2
:
0
0
-
2
:
0
0

Comprehensive but primarily
Witten’s chapters 5,6 and 7



Assessment

4
00 Points Total


25
% Midterm Exam

25
% Final
Data Mining project
report

2
5% homework assignments

25
%
Final Exam



Some
Videos

on DM/KD to watch
:


Video
1
:
II
T lecture 1:

http://www.bing.com/videos/watch/video/lecture
-
34
-
data
-
mining
-
and
-
knowledge
-
discovery/1
d0668894dc732fe82b91d0668894dc732fe82b9
-
83872645560


Video
2
: IIT lecture 2:
http://www.bing
.com/videos/watch/video/lecture
-
35
-
data
-
mining
-
and
-
knowledge
-
discovery
-
part
-
ii/f2c1c8cfcc5e319417f6f2c1c8cfcc5e319417f6
-
29437526744

Video
3
: DM and KD:
http://videolectures.net/mps07_lavrac_dmkd/

Video 4: Data Mining at NASA:
http://videolectures.net/kdd09_srivastava_dmnasata/


Video 5: SQL Know How Video,
http://www.microsoft.com/showcase/en/us/details/38b7e057
-
42d2
-
4a8c
-
b4d2
-
3154bc35d87a

More:
http://videolectures.net/Top/Computer_Science/Data_Mining/


Data Mining Project
:


The project
could

be
worked on
either as individual project or
as
a team project (teams
of at most two members).

The project is open ended and it involves applying WEKA
to
analyze a data set
. Which algorithms to use and which are t
he most appropriate, how to
clean the data etc. is entirely up to you.

Statistical analysis is to be performed to compare
models and find the most appropriate and accurate model.

To be of enough complexity
the data set should contain both numeric and nomin
al values and also holes.
To find a
d
ata set
for your project
,

a possible source is

the
machine learning depository
stored at
the

University of California Irvine
’s ML site
:
http://archive.ics.uci.edu/ml/

.

The data
mining software you will use for your project is WEKA (see Witten's book and the
download link in the main class page)
.