A Data Mining Course for Computer Science Primary Sources and ...

levelsordData Management

Nov 20, 2013 (3 years and 10 months ago)

111 views

A Data Mining Course for
Computer Science

Primary Sources and Implementations

Dave Musicant

Saturday, March 4, 2006


Overview


What is data mining?


Why offer a course in data mining?


Why focus on research papers in an
undergraduate class?


What topics do I cover?


What research papers do I use in class?


What assignments do I use?


Does it work?

What is data mining?


“The non
-
trivial discovery of novel, valid, comprehensible
and potentially useful patterns from data” (Fayyad et al)



Data Mining

and
Machine Learning

are two sides of the
same coin


Data mining focuses more on larger datasets


Machine learning focuses more on connections with artificial
intelligence


... but there is much overlap in the two areas.



My course is titled “Machine Learning and Data Mining”


boosts student enthusiasm

Why offer a course in data mining?


Interesting applied area of CS that uses theoretical
techniques


Reinforces and introduces data structures and
algorithms


heaps, R
-
trees, graphs


Privacy and ethics


Personal ownership in assignments


Students choose datasets in areas that interest them


New field, yet accessible


Can be done with only Data Structures as a prereq


It’s my research area

Why research papers? Can it be done?


One approach to course is to use data mining software


Lopez & Ludwig, University of Minnesota
-
Morris


I wanted students to implement data mining algorithms


Textbook support w/ computer science focus is limited


(I use Margaret Dunham’s text as a side reference)


Primary sources provide a rich experience


With proper selection, papers are accessible to
undergraduates


Papers must be supplemented in classroom


e.g. specific topics in linear algebra, statistics


directs classroom activity toward filling gaps and interpreting
papers instead of parroting reading


Topics, Papers, Assignments


Each topic consists of one or more papers that are
assigned to the students to read
before

class discussion.


Students post to Caucus (electronic message board):


something they didn’t understand, or something they found
interesting


potential exam question


Assignment follows class discussion



Detailed references for all papers and datasets can be
found in paper

Topic 0: What is Data Mining?


Paper: J. Friedman. “
Data Mining and Statistics: What’s
the Connection?



Entertaining and controversial


Pokes fun at flaws on all sides


Helps to ensure buy
-
in from computer science students (they
haven’t been tricked into taking a stats course)


Assignment: For the “census
-
income” dataset,
determine:


Number of records and features


How many features are continuous, how many are nominal


For continuous features: average, median, minimum, maximum,
standard deviation


2
-
dimensional scatter plots of two features at a time


Interesting patterns


Topic 1: Classification and Regression


Example: First Trimester Screening


Use this
training set

to learn how to classify patients
where diagnosis is not known:


The
input data

is often easily obtained, whereas the
classification

is not.

Input Data

Classification

Patient ID
tissue (cm)
Chemical 1
Chemical 2
Diagnosis
1
5
20
118
Positive
2
3
15
130
Negative
3
7
10
52
Negative
4
2
30
100
Positive
Patient ID
tissue (cm)
Chemical 1
Chemical 2
Diagnosis
101
4
16
95
?
102
9
22
125
?
103
1
14
80
?
Training Set

Testing Set

Technique: Nearest Neighbor


Envision each
example as a point
in n
-
dimensional
space


Classify test point
same as nearest
training point

tissue (cm)
Chemical 1
Diagnosis
5
20
Positive
3
15
Negative
7
10
Negative
2
30
Positive
0
5
10
15
20
25
30
35
0
1
2
3
4
5
6
7
8
What am I?

Topic 1: Classification and Regression


Focus on scalable nearest neighbor algorithms


Paper: Roussopoulos et. al. “
Nearest Neighbor Queries



How to do NN efficiently when data doesn’t fit in core


Requires R
-
trees (I cover in class)


Assignment: Code up the traditional k
-
nearest neighbor
algorithm, apply to census
-
income data


Experiment with different distance metrics (1
-
norm, 2
-
norm,
cosine)


Experiment with different values of k


Produce plots showing training and test set accuracies


Interpret results

Topic 2: Clustering


Sometimes referred to as unsupervised learning


Goal: find clusters of similar data


Less accurate than supervised learning, but quite useful
when no training set is available


Where are the clusters below? How many are there?

chemical 1

tissue

(cm)

chemical 2

tissue

(cm)

Topic 2: Clustering


Assignment: Find dataset of interest from UCI
Repository


iris plant, letter recognition, liver disorders, Pima Indians
diabetes, Congressional voting records, wine recognition, zoo


this dataset is used for most remaining assignments


if dataset has a class label, discard it for this assignment


Implement basic clustering algorithm (k
-
means)


Try varying number of clusters


Try two different techniques for initializing clusters


Report and interpret results found

Topic 2: Clustering


Paper: Bradley et al, “
Scaling Clustering Algorithms to
Large Databases



Describes “Scalable K
-
means” algorithm


Class discussion around “data mining desiderata”


Paper: Guha et al, “
CURE: An Efficient Clustering
Algorithm for Large Databases



Agglomerative clustering algorithm


completely different approach


Requires use of a heap (as I pose the assignment)


Assignment: Implement stripped
-
down version of CURE


Run on dataset, interpret results

Topic 3: Association Rules


“Supermarket basket analysis”


What items do people tend do buy together at the same
time?


Paper: Agrawal et al, “
Fast Algorithms for Mining
Association Rules



presents classic Apriori algorithm (skim other portions of paper)


Assignment: Implement Apriori algorithm and implement
on own dataset

Topic 4: Web Mining


How does Google rank importance of web pages?


Every page has a PageRank


PageRank of a page is determined by the PageRank of the
pages that link to it


manifests itself as an eigenvalue problem


Paper: Page et al, “
The PageRank Citation Ranking:
Bringing Order to the Web



describes basic version of Google PageRank algorithm


cover eigenvalues in class


exposure to linear algebra, numerical analysis

Topic 4: Web Mining


Paper: Chakrabarti et al, “
Mining the Link Structure of
the World Wide Web



describes HITS algorithm for ranking web pages


Google isn’t the only way to do it


uses Latent Semantic Analysis, which requires singular value
decomposition (cover in class)


Assignment: Implement PageRank algorithm


try it on archive of department website


crawling for an assignment is
dangerous


sparse data representation


hashing or other form of map for efficiency


interpret results


hubs

authorities

Topic 5: Collaborative Filtering


a.k.a. Recommender Systems


“I like Pink Floyd, Dream Theater, and Evanescence. Who should
I be listening to?”


Amazon.com, Yahoo! Launchcast


Paper: Breese et al, “
Empirical Analysis of Predictive
Algorithms for Collaborative Filtering



Algorithms are nearest neighbor
-
like in flavor


Involve averaging numerical scores


Need to normalize for individual biases


Students already working on final project, so no
assignment




Topic 6: Ethical Issues in Data Mining


Privacy concerns


Good vs. evil uses of data mining


Video: Ramakrishnan et al, “
Data Mining: Good, Bad, or
Just a Tool?



Panel discussion from KDD 2004


Before watching video, students post to Caucus:


how data mining could be exploited


how this could be prevented (if possible)


After watching video


followup commentary


Pictures from conference website at http://www.acm.org/sigs/sigkdd/kdd2004/

Topic 6: Ethical Issues in Data Mining


Students response to video was more engaged than I
expected


More problems than solutions are raised in video


Frustrated students that solutions weren’t clear


Many students interested in issue of accountability


If someone’s privacy is violated, who is responsible?


“Who do I sue?”


Lively class discussion

Final Project


“Do almost anything you want regarding data mining, so
long as I approve it”


Find a paper and implement the algorithm within


Find a dataset of interest and study it completely, using
Weka and/or their own code from throughout the term


Quantitative association rules


Poker association rules


Collaborative filtering (music, art)


Attack KDD Cup problems


KDD Cup 2005: identify categories for web search queries


tried this once: tended to be too big for them in the time that I had


could perhaps be done with right level of support

Conclusions


Papers are most memorable part of course


Students speak very positively about this in evaluations


Significant prep time for me to fill in gaps


Caucus motivates reading papers


Students find this a pain, but are thankful afterwards in evals


Important to set deadline for posting a few hours before class so I
have time to read


Programming assignments work (mostly) well


Allow students to work in pairs if they wish


Grading is difficult: unspecified details in algorithms, differing
datasets


All materials available on my website at
http://www.mathcs.carleton.edu/faculty/dmusican/cs377s05