A Data Mining Course for
Computer Science
Primary Sources and Implementations
Dave Musicant
Saturday, March 4, 2006
Overview
What is data mining?
Why offer a course in data mining?
Why focus on research papers in an
undergraduate class?
What topics do I cover?
What research papers do I use in class?
What assignments do I use?
Does it work?
What is data mining?
“The non

trivial discovery of novel, valid, comprehensible
and potentially useful patterns from data” (Fayyad et al)
Data Mining
and
Machine Learning
are two sides of the
same coin
Data mining focuses more on larger datasets
Machine learning focuses more on connections with artificial
intelligence
... but there is much overlap in the two areas.
My course is titled “Machine Learning and Data Mining”
boosts student enthusiasm
Why offer a course in data mining?
Interesting applied area of CS that uses theoretical
techniques
Reinforces and introduces data structures and
algorithms
heaps, R

trees, graphs
Privacy and ethics
Personal ownership in assignments
Students choose datasets in areas that interest them
New field, yet accessible
Can be done with only Data Structures as a prereq
It’s my research area
Why research papers? Can it be done?
One approach to course is to use data mining software
Lopez & Ludwig, University of Minnesota

Morris
I wanted students to implement data mining algorithms
Textbook support w/ computer science focus is limited
(I use Margaret Dunham’s text as a side reference)
Primary sources provide a rich experience
With proper selection, papers are accessible to
undergraduates
Papers must be supplemented in classroom
e.g. specific topics in linear algebra, statistics
directs classroom activity toward filling gaps and interpreting
papers instead of parroting reading
Topics, Papers, Assignments
Each topic consists of one or more papers that are
assigned to the students to read
before
class discussion.
Students post to Caucus (electronic message board):
something they didn’t understand, or something they found
interesting
potential exam question
Assignment follows class discussion
Detailed references for all papers and datasets can be
found in paper
Topic 0: What is Data Mining?
Paper: J. Friedman. “
Data Mining and Statistics: What’s
the Connection?
”
Entertaining and controversial
Pokes fun at flaws on all sides
Helps to ensure buy

in from computer science students (they
haven’t been tricked into taking a stats course)
Assignment: For the “census

income” dataset,
determine:
Number of records and features
How many features are continuous, how many are nominal
For continuous features: average, median, minimum, maximum,
standard deviation
2

dimensional scatter plots of two features at a time
Interesting patterns
Topic 1: Classification and Regression
Example: First Trimester Screening
Use this
training set
to learn how to classify patients
where diagnosis is not known:
The
input data
is often easily obtained, whereas the
classification
is not.
Input Data
Classification
Patient ID
tissue (cm)
Chemical 1
Chemical 2
Diagnosis
1
5
20
118
Positive
2
3
15
130
Negative
3
7
10
52
Negative
4
2
30
100
Positive
Patient ID
tissue (cm)
Chemical 1
Chemical 2
Diagnosis
101
4
16
95
?
102
9
22
125
?
103
1
14
80
?
Training Set
Testing Set
Technique: Nearest Neighbor
Envision each
example as a point
in n

dimensional
space
Classify test point
same as nearest
training point
tissue (cm)
Chemical 1
Diagnosis
5
20
Positive
3
15
Negative
7
10
Negative
2
30
Positive
0
5
10
15
20
25
30
35
0
1
2
3
4
5
6
7
8
What am I?
Topic 1: Classification and Regression
Focus on scalable nearest neighbor algorithms
Paper: Roussopoulos et. al. “
Nearest Neighbor Queries
”
How to do NN efficiently when data doesn’t fit in core
Requires R

trees (I cover in class)
Assignment: Code up the traditional k

nearest neighbor
algorithm, apply to census

income data
Experiment with different distance metrics (1

norm, 2

norm,
cosine)
Experiment with different values of k
Produce plots showing training and test set accuracies
Interpret results
Topic 2: Clustering
Sometimes referred to as unsupervised learning
Goal: find clusters of similar data
Less accurate than supervised learning, but quite useful
when no training set is available
Where are the clusters below? How many are there?
chemical 1
tissue
(cm)
chemical 2
tissue
(cm)
Topic 2: Clustering
Assignment: Find dataset of interest from UCI
Repository
iris plant, letter recognition, liver disorders, Pima Indians
diabetes, Congressional voting records, wine recognition, zoo
this dataset is used for most remaining assignments
if dataset has a class label, discard it for this assignment
Implement basic clustering algorithm (k

means)
Try varying number of clusters
Try two different techniques for initializing clusters
Report and interpret results found
Topic 2: Clustering
Paper: Bradley et al, “
Scaling Clustering Algorithms to
Large Databases
”
Describes “Scalable K

means” algorithm
Class discussion around “data mining desiderata”
Paper: Guha et al, “
CURE: An Efficient Clustering
Algorithm for Large Databases
”
Agglomerative clustering algorithm
completely different approach
Requires use of a heap (as I pose the assignment)
Assignment: Implement stripped

down version of CURE
Run on dataset, interpret results
Topic 3: Association Rules
“Supermarket basket analysis”
What items do people tend do buy together at the same
time?
Paper: Agrawal et al, “
Fast Algorithms for Mining
Association Rules
”
presents classic Apriori algorithm (skim other portions of paper)
Assignment: Implement Apriori algorithm and implement
on own dataset
Topic 4: Web Mining
How does Google rank importance of web pages?
Every page has a PageRank
PageRank of a page is determined by the PageRank of the
pages that link to it
manifests itself as an eigenvalue problem
Paper: Page et al, “
The PageRank Citation Ranking:
Bringing Order to the Web
”
describes basic version of Google PageRank algorithm
cover eigenvalues in class
exposure to linear algebra, numerical analysis
Topic 4: Web Mining
Paper: Chakrabarti et al, “
Mining the Link Structure of
the World Wide Web
”
describes HITS algorithm for ranking web pages
Google isn’t the only way to do it
uses Latent Semantic Analysis, which requires singular value
decomposition (cover in class)
Assignment: Implement PageRank algorithm
try it on archive of department website
crawling for an assignment is
dangerous
sparse data representation
hashing or other form of map for efficiency
interpret results
hubs
authorities
Topic 5: Collaborative Filtering
a.k.a. Recommender Systems
“I like Pink Floyd, Dream Theater, and Evanescence. Who should
I be listening to?”
Amazon.com, Yahoo! Launchcast
Paper: Breese et al, “
Empirical Analysis of Predictive
Algorithms for Collaborative Filtering
”
Algorithms are nearest neighbor

like in flavor
Involve averaging numerical scores
Need to normalize for individual biases
Students already working on final project, so no
assignment
Topic 6: Ethical Issues in Data Mining
Privacy concerns
Good vs. evil uses of data mining
Video: Ramakrishnan et al, “
Data Mining: Good, Bad, or
Just a Tool?
”
Panel discussion from KDD 2004
Before watching video, students post to Caucus:
how data mining could be exploited
how this could be prevented (if possible)
After watching video
followup commentary
Pictures from conference website at http://www.acm.org/sigs/sigkdd/kdd2004/
Topic 6: Ethical Issues in Data Mining
Students response to video was more engaged than I
expected
More problems than solutions are raised in video
Frustrated students that solutions weren’t clear
Many students interested in issue of accountability
If someone’s privacy is violated, who is responsible?
“Who do I sue?”
Lively class discussion
Final Project
“Do almost anything you want regarding data mining, so
long as I approve it”
Find a paper and implement the algorithm within
Find a dataset of interest and study it completely, using
Weka and/or their own code from throughout the term
Quantitative association rules
Poker association rules
Collaborative filtering (music, art)
Attack KDD Cup problems
KDD Cup 2005: identify categories for web search queries
tried this once: tended to be too big for them in the time that I had
could perhaps be done with right level of support
Conclusions
Papers are most memorable part of course
Students speak very positively about this in evaluations
Significant prep time for me to fill in gaps
Caucus motivates reading papers
Students find this a pain, but are thankful afterwards in evals
Important to set deadline for posting a few hours before class so I
have time to read
Programming assignments work (mostly) well
Allow students to work in pairs if they wish
Grading is difficult: unspecified details in algorithms, differing
datasets
All materials available on my website at
http://www.mathcs.carleton.edu/faculty/dmusican/cs377s05
Comments 0
Log in to post a comment