An Introduction to Data Mining Concepts

naivenorthΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

81 εμφανίσεις

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


An Introduction to Data Mining
Concepts

Tim Eapen and B.C. Holmes

Intelliware Development

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Agenda


Introduction to data mining


The typical steps


What were we trying to accomplish


Bayesian Categorization


An example


Data Clustering


k
-
means clustering


Interesting conclusions


Other Stuff


Java and Data Mining

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


What is Data Mining?


Data mining is the discovery of useful information from data


Data mining touches on many of the same problems as machine
learning and artificial intelligence



This is a huge topic, and we can’t hope to do more than just touch
on it, today

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Some Crazy Examples


Here are some interesting examples of useful information gleaned
from data:


“Diapers and beer”


People who buy diapers are also likely to buy beer. Put potato chips in
between them and the sales of all three items go up


Google ad
-
words:


“digital cameras” is worth more than “digital camera”


Airline traveler behaviours


Amazon.ca


“other people who bought this DVD liked such
-
and
-
such”

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


The Data Mining Process

Cle
ans
e

Ext
ract
the
“Go
od
Stu
ff”

Ide
ntif
y
Pat
ter
ns

Gather the
Data

Vet
the
res
ults

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


What We Were Trying to Accomplish


Tim, Tom and I were working on the WhatAmITaking.com project


WhatAmITaking.com is a wiki / repository that collects information
about medications


Data is all available from public sources, including:


Government drug reference database


Wikipedia


Open License publications available through the (U.S.) National Institute
for Health


News articles


Concept: want to using data mining techniques on publications and
news


First steps: we wanted to try to emulate the Google news
-
style
categorization and “topic” correlation


http://www.intelliware.ca


© 2006 Intelliware Development Inc.


But Along the Way…


We learned some interesting things about the field of Data Mining

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


News: Obtaining News


How do we get news?


Need to build a “bot” or a “web crawler” that goes out to a large
number of web sites and GETs the interesting content.


Nice additions: look for links to other pieces of news


Some complications:


There’s a “Good Internet Citizen” standard (the robots.txt file
standard) that should be respected


If the site has a robots.txt file that says “bots keep out”, you shouldn’t
crawl their site.


How do you determine what’s a story and what’s not?


That’s a hard problem: too big a topic for this presentation

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Data Cleansing


You would not believe how bad some news sites are with respect
to their content.


Poor formatting


bad encoding problems


Clear problems related to converting the content from another format
(e.g. Word)


Two interesting word
-
related cleansing problems


The “US spelling” versus “British spelling” problem


Root words


Some of it looks deliberately obfuscated

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Extracting Interesting Stuff


Your typical web page news article has a lot of extra stuff on it:
banner ads, menus, links to “related stories”, navigation widgets,
etc.


Almost all word manipulation problems talks about “stop words”:
words that are so common they provide no significant meaning in
analysis of text:


the


he


she


said


it


etc…

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Two Interesting Topics


Categorization


I know what the groups are, and I want to assign a group to any
particular data point


E.g.: News is categorized: Sports, Health, Finance, World News,
National, etc.


Data Clustering


I have a lot of data, and I want to find some mechanism for finding
meaningful groups


E.g.: News “events”


http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Bayesian Analysis

A Delightful Example

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


The Problem

HEALTH

SPORTS

TECHNOLOGY

BUSINESS

NEWS

ENTERTAINMENT


Given a random news article, how can


we determine what category it belongs to?

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


In Light of New Evidence…


Do some detective work!


Start off with a hypothesis


Collect evidence


The evidence will be either consistent or inconsistent with a given
hypothesis


As more evidence is accumulated, the
degree of belief

in the initial
hypothesis will change


A hypothesis with a very high
degree of belief

may be accepted as
true


Likewise, a hypothesis with a very low
degree of belief

may be
considered false


How do we measure this
degree of belief?

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Bayes’ Theorem

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Bayes’ Theorem

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Bayes’ Theorem

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


An Edible Example


10 Chocolate Chip Cookies


30 Oatmeal Cookies


20 Chocolate Chip Cookies


20 Oatmeal Cookies

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


State a Hypothesis


Little Johnny picks a bowl at random


Little Johnny picks a cookie at random


The cookie turns out to be an oatmeal cookie


How probable is it that Johnny picked the cookie out of bowl #1?

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Consider the Evidence


Probability of selecting an Oatmeal cookie given
Johnny chooses bowl #1


Probability of selecting an Oatmeal cookie given
Johnny chooses bowl #2

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


An Edible Example


Bayes’ Theorem gives the following result


Notice that initially the prior probability that the cookie


came from bowl #1 was P(H
1
) = 0.5


In light of evidence E, the probability that the cookie


came from bowl #1 increased to P(H
1
|E) = 0.6

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Back to our problem…


Given a random news article, how can


we determine what category it belongs to?

OF COURSE WE CAN!!!

USE BAYESIAN ANALYSIS

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Naïve Bayes Classifier


To categorize a news article use a Naïve Bayes Classifier


A simple probabilistic classifier based on some naïve independence
assumptions


Can be ‘trained’

Naïve Probabilistic Model


The probability model for a classifier is conditional:

Given an news article with
n

words …


Let
C
represent a category of news (i.e. Health)


Let
F
n

represent the frequency with which that
n
th

word appears in articles from category
C


http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Naïve Probabilistic Model


We can express our probability model using Bayes’
Theorem



Solving this is difficult so we make some
simplifying assumptions:


Denominator is constant


Naively assume that each feature (word
frequency) F
i

is conditionally independent of
every other feature F
j

(i j)


http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Naïve Probabilistic Model


Problems with our assumptions


Words have context


Assuming that the frequency (
F
i
) of word
i

is independent of
the frequency (
F
j
) of word
j

is untrue

For example the words ‘War’ and ‘Afghanistan’ are more likely to appear in the

same article than the words ‘War’ and ‘Tuna’


Benefits of our assumptions


It simplifies our math algorithm

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Naïve Probabilistic Model

)
|
(
)
(
C
F
p
C
p
n
i
i
i



We can approximate that the probability that an article
belongs to category
C

as the product of a ‘prior’ probability
that the article belongs to that category multiplied by the
product of individual word frequencies for that category


http://www.intelliware.ca


© 2006 Intelliware Development Inc.


A Simple Algorithm for Classifying An Article


Given a random article with n words to classify the article in one
of several possible categories do the following:


For each possible category

Calculate the probability that article X belongs to that

category by considering the prior probability and word

frequencies




Classify the article as belonging to the category with the highest
probability




http://www.intelliware.ca


© 2006 Intelliware Development Inc.


A Simple Example


Consider this very simple article …

hockey



puck


For simplicity consider that there are only two possible categories:


Sports


News

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


A Simple Example …


Consider the following word frequencies:

Word

Category

Frequency

hockey

Sports

98%

puck

Sports

96%

hockey

News

2%

puck

News

4%

1.
Let C = Sports
:

p(C)=0.5, p(F
1
|C)=0.98 and p(F
2
|C)=0.96



p(C|F
1
,F
2
) = 0.5x0.98x0.96=0.4704

2. Let C = News:

p(C)=0.5, p(F
1
|C)=0.02 and p(F
2
|C)=0.04


p(C|F
1
,F
2
) = 0.5x0.02x0.04=0.0004



http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Gathering the Evidence


So where do the frequencies we use come from?


To perform Bayesian analysis, it is important to have a large
‘corpus’ of articles


This corpus is what we use to determine the word frequencies
used in categorizing a given article


This corpus would grow over time


This corpus is what we use to ‘train’ our Bayesian classifier

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


What We Actually Did


First step was to gather a ‘corpus’ of articles


This corpus would be used to train our Bayesian classifier


Initially started by gathering 5000 articles


Number of articles in the corpus would grow over time


Built a simple, little ‘NewsFinder’ utility that would regularly go to
http://news.google.ca/

and gather articles


Google has seven categories of news


News Finder

world

Canada

Health

business

science

sports

entertainment

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Bayesian Classifier


Started with an open
-
source package from sourceforge called
classifier4j:

available at
http://classifier4j.sourceforge.net/


Created a SimpleClassifier


This classifier has an instance of our Bayesian classifier which does
all the Bayesian analysis for us


The classifier also has a WordDataSource: a simple map that
correlates a frequency with a given word in a given category


Used our corpus of articles to train the our classifier (fill up our word
data source)



http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Issues To Consider


Making sure that the corpus was clean


This was part of ‘cleansing’ the data as we gather it


Had to actually tweak Classifier4j because the algorithm wasn’t
correct


http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Clustering

What is a Cluster, anyway?

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Data Clustering


Data clustering is the process of taking “points” in some
n
-
dimensional space, and grouping them into some understandable
group.


That’s kind of “math
-
y” sounding. How does that relate to news?


This is the fundamental question: trying to decide good “measures” is
the key success criteria


I want to defer the answer for now


There are two fundamental approaches:


Centroid


Guess certain centres of clusters, and iteratively refine them


Hierarchical


Assume that each point is a cluster, and iteratively merge them until
“good” clusters emerge

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Another Key Consideration


The field of Data Mining spends a lot of time thinking about one
special problem:


Often, there’s too much data to fit into memory; any algorithms that
try to “cluster” information must think about the special problem of
data not fitting into memory


I’m not going to say too much about this problem

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


k
-
Means Algorithm


One of the fundamental centroid
-
based algorithms is called the “
k
-
means” algorithm


Assume you have a number of points of data and you want to
cluster these points into some number of clusters (k)


You don’t really need to know what the clusters represent, just some
arbitrary number of clusters

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Step One: Pick k=3 objects

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Step Two: Create initial Groupings

Groups are
based on
distance from
initial points

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Step Three: Find the “centres”/means

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Step Four: Re
-
jig the clusters

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Repeat until the Clusters don’t change

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


But How Do You Decide on k?


A key question to ask is “how many clusters is the right number?”


Try a bunch of different values, and map distance

1

2

3

4

5

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Converting from Words to Points


One idea:


There are about 100,000,000 English words.


Consider an
n
-
Dimensional space, where n = 100,000,000


Frequency of a particular word in an article can be considered a
distance in one dimension of the
n
-
Dimensional space.

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Unintuitive Conclusions


When dealing with points in n
-
Dimensional space, where n is very
large (say > 100), most points are about as far away as average.

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Determining a Good Measuring Stick


So how do you deal with the problem of large dimensional
spaces?


Try to determine a smaller set of “interesting” dimensions. Try
this:


Pick an article


In that article try to find 25 “interesting” words


What’s “interesting”?


Try 10 of the most common words in the article (excluding stop words)


Pick 10 of the most significant “classification” words (e.g. certain words
are strongly correlated with health articles. Find the 10 most strongly
correlated, that also have high frequency of occurrence in the article)


Pick 5 unusual words


Now you’ve got some measuring stick.


Now measure other articles according to this measuring stick, and
figure out distance

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Java and Data Mining


There a few (but not many) Java initiatives relating to Data Mining


Bayesian Classifier:
-

Classifier4J


Used this initially, and discovered that the algorithm wasn’t correctly
implemented


Weka


Created by a number of Data Mining professors


The same group has published a Data Mining book with some references
to Weka (but it’s a heavy math book)


YALE (“Yet Another Learning Environment”)


There’s a Java Community Process around coming up with a
consistent Java API for data mining


JSR 73 and JSR 247


javax.datamining

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Other Topics (Use Wikipedia)


w
-
shingling


Concept Mining

http://www.intelliware.ca


© 2006 Intelliware Development Inc.


Crazy Ideas that Might Make Interesting
Experiments


Could you perform data mining on code?


What if you parsed Camel Case variable and class names and
performed text clustering on classes. Could you find interesting
relationships between classes? In different projects?


What could you learn if you tried to perform clustering on a bunch
of open source web frameworks? How must similarity and/or
difference do they have?