Data Mining - Department of Computer Science

sentencehuddleData Management

Nov 20, 2013 (4 years and 5 months ago)


Data Mining

Chris Nelson

CS 157 A

Fall 2007

Data Mining

New buzzword, old idea.

Inferring new information from already
collected data.

Traditionally job of Data Analysts

Computers have changed this.

Far more efficient to comb through data using
a machine than eyeballing statistical data.

Data Mining

Two Main Components

Wikipedia definition: “Data mining is the entire process of applying
based methodology, including new techniques for knowledge
discovery, from data.”

Knowledge Discovery

Concrete information gleaned from known data. Data you may not have
known, but which is supported by recorded facts.

(ie: Diapers and beer example from previous presentation)

Knowledge Prediction

Uses known data to forecast future trends, events, etc. (ie: Stock market

Wikipedia note: "some data mining systems such as neural networks are
inherently geared towards prediction and pattern recognition, rather than
knowledge discovery.“ These include applications in AI and Symbol

Data Mining vs. Data Analysis

In terms of software and the marketing thereof

Data Mining != Data Analysis

Data Mining implies software uses some intelligence
over simple grouping and partitioning of data to
infer new information.

Data Analysis is more in line with standard statistical
software (ie: web stats). These usually present
information about subsets and relations within the
recorded data set (ie: browser/search engine usage,
average visit time, etc. )

Data Mining Subtypes

Data Dredging

The process of scanning a data set for relations and then
coming up with a hypothesis for existence of those relations.


Data that describes other data. Can describe an individual
element, or a collection of elements.

Wikipedia example: “In a
, where the data is the
content of the titles stocked, metadata about a title would
typically include a description of the content, the
, the
publication date and the physical location”

Applications for Data Dredging in business include Market
and Risk Analysis, as well as trading strategies.

Applications for Science include disaster prediction.

Propositional vs. Relational Data

Old data mining methods relied on Propositional Data, or data
that was related to a single, central element, that could be
represented in a vector format. (ie: the purchasing history of a
single user. Amazon uses such vectors in its related item
suggestions [a multidimensional dot product])

Current, advanced data mining methods rely on Relational
Data, or data that can be stored and modeled easily through
use of relational databases. An example of this would be data
used to represent interpersonal relations.

Relational Data is more interesting than Propositional data to
miners in the sense that an entity, and all the entities to which
it is related, factor into the data inference process.

Key Component of Data Mining

Whether Knowledge Discovery or Knowledge
Prediction, data mining takes information that was
once quite difficult to detect and presents it in an
easily understandable format (ie: graphical or

Data mining Techniques involve sophisticated
algorithms, including Decision Tree Classifications,
Association detection, and Clustering.

Since Data mining is not on test, I will keep things

Uses of Data Mining

AI/Machine Learning

Combinatorial/Game Data Mining

Good for analyzing winning strategies to games, and thus
developing intelligent AI opponents. (ie: Chess)

Business Strategies

Market Basket Analysis

Identify customer demographics, preferences, and purchasing

Risk Analysis

Product Defect Analysis

Analyze product defect rates for given plants and predict
possible complications (read: lawsuits) down the line.

Uses of Data Mining (Continued)

User Behavior Validation

Fraud Detection

In the realm of cell phones

Comparing phone activity to calling records.
Can help detect calls made on cloned phones.

Similarly, with credit cards, comparing
purchases with historical purchases. Can
detect activity with stolen cards.

Uses of Data Mining (Continued)

Health and Science

Protein Folding

Predicting protein interactions and functionality within
biological cells. Applications of this research include
determining causes and possible cures for Alzheimers,
Parkinson's, and some cancers (caused by protein "misfolds")

Terrestrial Intelligence

Scanning Satellite receptions for possible transmissions from
other planets.

For more information see Stanford’s Folding@home and
SETI@home projects. Both involve participation in a widely
distributed computer application.

Sources of Data for Mining

Databases (most obvious)

Text Documents

Computer Simulations

Social Networks

Privacy Concerns

Mining of public and government databases is done,
though people have, and continue to raise concerns.

Wiki quote:

"data mining gives information that would not be
available otherwise. It must be properly interpreted
to be useful. When the data collected involves
individual people, there are many questions
concerning privacy, legality, and ethics."

Prevalence of Data Mining

Your data is already being mined, whether you like it or not.

Many web services require that you allow access to your information [for
data mining] in order to use the service.

Google mines email data in Gmail accounts to present account owners
with ads.

Facebook requires users to allow access to info from non
Facebook pages.
Facebook privacy policy:

"We may use information about you that we collect from other sources,
including but not limited to newspapers and Internet sources such as
blogs, instant messaging services and other users of Facebook, to
supplement your profile.

This allows access to your blog RSS feed (rather innocuous), as well as
information obtained through partner sites (worthy of concern).

Data Mining Controversies

Latest one: Facebook's Beacon Advertising program
(Just popped on Slashdot within the last week)

What Beacon does:

“when you engage in consumer activity at a
[Facebook] partner website, such as Amazon, eBay,
or the New York Times, not only will Facebook
record that activity, but your Facebook connections
will also be informed of your purchases or actions.”
[taken from

Controversies continued

Implications: "Thus where Facebook used to be collecting data only
within the confines of its own website, it will now extend that ability to
harvest data across other websites that it partners with. Some of the
companies that have signed on to participate on the advertising side
include Coca
Cola, Sony, Verizon, Comcast, Ebay

and the CBC. The
initial list of 44 partner websites participating on the data collection side
include the New York Times, Blockbuster, Amazon, eBay, LiveJournal,
and Epicurious.”

[Remember the privacy policy on the previous slide]

Verdict is still out. This may violate an old (100+ years) New York law
prohibiting advertising using endorsements without the endorsee’s

Facebook currently offers users no way to opt out of Beacon (once it has
been activated ?). Users can close the accounts, but account data is never

Bottom Line

Data obtained through Data Mining is
incredibly valuable

Companies are understandably reluctant to
give up data they have obtained.

Expect to see prevalence of Data Mining and
(possibly subversive) methods increase in
years to come.

Recommended Resources and

Works Consulted

Wikipedia Data Mining entry

"Privacy is Dead

Get Over It: Revisited"

Steve Rambam's Hope Number Six lecture

Facebook's Faux Pas

Beware of Facebook’s Beacon

Facebook Data Mining guide

Data Mining in Social Networks