What is Data Mining? - Dave Reed

sentencehuddleΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

205 εμφανίσεις

By Matt Goliber and

Jim Hougas

Data Mining and
Knowledge Discovery

What is Data Mining?


Not like gold or diamond mining




Mining of knowledge from data




Important to many different fields




A Part of Knowledge Discovery in Databases (KDD)

The Process of Knowledge
Discovery

Raw data

Data Warehouse

Patterns

KNOWLEDGE!

Data cleaning and integration

Data transformation,
selection, and mining

Data transformation,
selection, and mining

Pattern evaluation and
knowledge presentation

Why is Data Mining useful?



We are data rich but information poor


-
Internet


-
Intelligence




Humans often lack the ability to comprehend and manage the
immense amount of available and sometime seemingly unrelated
data

How long has this idea been
around?



Late 60’s and Early 70’s




Stanford’s Meta
-
DENDRAL (1970
-
76)


-
Extension of DENDRAL




Doug Lenat with AM (1976)


Meta
-
DENDRAL



Extension of the DENDRAL (1965) program

-
One of the first expert systems

-
Interpreted mass spectra




Meta
-
DENDRAL took the mass spectra of compound of known 3
-
D structure and formulated rules about the interpretation of the
spectra




Came up with known rules and some new ones!

Sample Mass Spec

ethyl 3
-
oxy
-
3
-
phenylpropanoate (ethyl benzoylacetate)

AM


Doug Lenat, 1976



Name means nothing, stand alone



AM was given sets, bags, ordered sets, and lists



AM was also given operations to perform on these data sets

-
Union, Intersection, ect…



Came up with ideas about counting, addition, multiplication, prime
numbers, and Goldbach’s conjecture



AM thought that these were all uninteresting



Liked maximally divisible numbers though…

What next?


Not a whole lot…


Databases were not prevalent enough, no great demand


Did benefit from machine learning research


Beginning of the 1990’s, “The next area…”


-
Ranked as one of the most promising research areas (NSF)


-
Information explosion


Early commercial systems


-
Farm Journal


-
GM

Next Generation Techniques


Decision Trees


Each branch is a classification question


Allows businesses to segment customers, products, and sales regions


Questions organize the data



Rule Induction


All patterns are pulled from the data


Accuracy and Significance are then added to them


Help the user know how strong pattern is and likelihood of it occurring
again


Ex: If bagels are purchased then cream cheese is purchased 90% of the
time and this pattern occurs in 3% of all shopping baskets


Decision Trees vs. Rule Induction



Decision Trees


Many rules to cover same instance or


no rule to cover an instance



Rule Induction


Always and only one rule



Example


Decision Trees use height and shoe size to determine size of person


Rule Induction uses one or the other



Examples of Significant
Developments


Stock Market Advances (1991)


Astrophysicists Doyne Farmer and Norman Packard


Prediction company could predict stock market trends



Bell Atlantic (1996)


Consumer phone buying trends


Rule Induction



Advanced Scout (1997)


Inderpal Bhandari assists NBA coaches


Rule Induction



Persuade 400,000 undecided voters (2004)


MoveOn attemps to influence the election


Decision Tree


Challenges


Large Data Sets with High Complexity


-

One or the other is currently possible, but not both



Expensive


-

Costs of Bell Atlantic (Experts are needed)


-

Cost for a two
-
day course in Las Vegas ($1,300)


-

Software ($100,000)

Research


DARPA


Defense Advance Research Projects Agency


ACLU claims this is an invasion of privacy


Decision Tree



Uncovering Terrorists in public chat rooms


Tracks the times that messages are sent



Advanced Scout


Bhandari is working on Advanced Scout for the NHL


Rule Induction

Current State


Out of the Lab


Into Fortune 500 companies



Automate Model Scoring


Fingers are currently crossed in hopes that scoring by IT personnel is
done correctly


Future States


Utilizing Company Warehouses


Data miners must take advantage of a million dollar warehouse that a
company builds




Effort Knob


Low for quick model, high for quality model




Computed Target Columns


User could create a new target variable


Ex: finance information that a business has

Sources

http://web.media.mit.edu/~haase/thesis/node54.html#SECTION00711000000000000000

http://smi
-
web.stanford.edu/projects/history.html#METADENDRAL

http://www.cs.cf.ac.uk/Dave/AI2/node151.html

http://64.233.161.104/search?q=cache:Q6eMD9tEKwIJ:www.cosc.brocku.ca/Offerings/4P79/Week12.ppt+meta
-
dendral&hl=en

http://laurel.actlab.utexas.edu/~cynbe/muq/muf3_21.html

http://64.233.161.104/search?q=cache:yft0cQ5tZJQJ:www.cs.uwaterloo.ca/~shallit/Talks/cct.ps+%22fundamental+theorem+of+a
rithmetic%22+computer+data+mining+prove&hl=en

http://mathworld.wolfram.com/GoldbachConjecture.html

http://www.quantlet.com/mdstat/scripts/csa/html/node202.html

http://www.thearling.com

http://www.wired.com

http://www.dmreview.com

http://www.ebscohost.com

http://www.thearling.com/text/dmtechniques/dmtechniques.htm

http://www.aaai.org/Library/Magazine/Vol13/13
-
03/vol13
-
03.html

Data Mining: Concepts and Techniques
. Han J. and Kamber M.