+
Data Mining (ECS607U)
Lecture 1

Introduction
Dr. Timothy
Hospedales
EECS, Queen Mary University of London
+
Overview and Learning Outcomes
Intro
What is Data Mining?
Why is it important?
Taxonomy of Data Mining Methods
Intuition for Algorithms
A Few Practical Issues
+
What is Data Mining?
…Ideas?
Look it up on the Internet?
“Databases”… “Big Data”
“Machine Learning”…”Statistics”
Vacuous marketing buzzword?
Related Disciplines: Stats, ML
Here: Applied Machine Learning
Concepts
How to apply it
Application areas
(ECS708P)
Machine
Learning
Statistics
Data
Mining
Databases
Distributed
& HPC
+
What is Data Mining
(Machine Learning)?
“Looking for Patterns in Data”
“Solving problems by analyzing data already
present in databases”
“Extracting knowledge from data”
“Inducing new knowledge from past experience”
“Automating automation”
“Eliminating the software bottleneck”
“Getting computers to program themselves”
“Let the data do the work”
+
What is Data Mining?
Traditional Computing
Machine Learning
Input Data
Computer
Program
Output
Data
Input Data
Computer
(Output Data)
Program
+
Magic?
No. It’s a bit like gardening
Seeds = Algorithms
Nutrients = Data
Plants = Programs
Gardener = You
+
What is Data Mining?
Traditional Computing
Machine Learning
Knowledge
Program
Assumptions
Learner
Data
Model /
Program
New Data
Prediction / Action
New Data
Prediction / Action
Programming
+
Why Data Mining?
Society produces huge amounts of data
Sources: business, science, medicine, economics, geography,
environment, sports, ...
Valuable
resource
to be exploited
Raw data is useless: need techniques to automatically extract
information from it
Data
: recorded
facts
Information
: patterns underlying the data
+
Examples
Input Data
Computer
(Output Data)
Program
+
Examples
+
Examples
+
What is Data Mining
Extracting
Implicit
Previously Unknown
Useful
..Information from data
Challenges:
Most patterns not interesting
Patterns may be inexact / noisy or spurious
Data may be garbled or missing
Machine learning / Data Mining Algorithms
Extract concepts (aka models, hypotheses) from examples
+
Can Machines Really Learn?
Definitions of “learning” from dictionary:
To get knowledge of by study, experience, or
being taught
To become aware by information or from
observation
To commit to memory
To be informed of, ascertain; to receive
instruction
Operational definition:
Things learn when they change their
behavior in a way that makes them perform
better in the future.
Hard to Measure
+
What is Learned?
Models? (aka Concepts, Hypotheses)
Can be used to predict outcome in new situation
Can be used to understand the domain
+
A Simple Concrete Example
Attribute (or Feature)
Instance
or Example
Instance:
•
Specific example to be
classified, associated or
clustered.
•
Characterized by a set of
attributes
Attribute:
•
Each instance described by a
set of attributes
•
Type: Nominal, Ordinal, …
•
Ordinal: Distance defined
•
Nominal: Only equality
defined
Concept/Hypothesis/Model:
•
Type of thing to be learned
•
E.g., rule to predict play for the day’s weather
+
When Should you Try (Supervised)
Machine Learning
Situations where there is no human expert?
X: A new molecule
F(x): Chemical effect of molecule
(There is no option to write a program)
Human can perform the task but can’t describe how
X: Picture of number plate
F(x): Character string of number plate
(Nobody knows how to write a program to do this)
Desired function is changing frequently
X: Stock prices and trades for last 10 days
F(x): Recommended trades
(Can’t write new programs quick enough)
Each user needs a customized function f
X: Purchase history
F(X): Recommend to buy
(Would be too many programs to write)
+
Overview and Learning Outcomes
Intro
What is Data Mining?
Why is it important?
Taxonomy of Data Mining Methods
Intuition for Algorithms
A Few Practical Issues
+
A Taxonomy
Machine
Learning
Supervised
Unsupervised
Target
Unsupervised
•
No Target
•
Understand,
summarize, find
patterns, explain
Supervised
•
Target Attribute
•
Find a rule that
can predict the target
attribute
+
Unsupervised Learning
Summarizing Instances
Machine
Learning
Supervised
Unsupervised
Too many instances
Summarize them by:
•
Clustering
•
Density Estimation
Save Memory
Understand the Domain
+
Unsupervised Learning
Summarizing
Instances
Machine
Learning
Supervised
Unsupervised
Examples
•
Image Processing
•
Customer Profiling
+
Unsupervised Learning
Summarizing
Dimensions
Too many dimensions
•
E.g., Your data contains temperature in
Deg
C and
Deg
F.
Summarize them by eliminating redundant dimensions
•
Save Memory, processing time
+
Supervised Learning
Find a rule that can predict target
Attribute (or Feature)
Instance
or Example
Model: Decision Tree Classifier
Target
+
Supervised Learning
Classification and Regression
Classification
Predict Nominal (Discrete)
Target
Regression
Predict Numeric Target
+
A Taxonomy
Machine
Learning
Supervised
Unsupervised
Classification
Regression
Dimension
Reduction
Clustering /
Density
Est
Association
Rules
Outlier
Detection
+
Examples
Input Data
Computer
(Output Data)
Program
+
Examples
+
Examples
+
Overview and Learning Outcomes
Intro
What is Data Mining?
Why is it important?
Taxonomy of Data Mining Methods
Intuition for Algorithms
A Few Practical Issues
+
Machine Learning:
A Construction Manual
Tens of thousands of algorithms exist
Hundreds new each year
Each algorithm has three components
( Feature Extraction )
Representation
Polynomial, Decision Tree, Support Vector Machine, HMM,…
Evaluation
Accuracy, Squared Error, Posterior Probability,…
Optimization
Combinatorial, Greedy, Gradient,….
+
Machine Learning:
A Construction Manual
Tens of thousands of algorithms exist
Hundreds new each year
Each algorithm has three components
( Feature Extraction )
Representation
Polynomial, Decision Tree, Support Vector Machine, HMM,…
Evaluation
Accuracy, Squared Error, Posterior Probability,…
Optimization
Combinatorial, Greedy, Gradient,….
+
Machine Learning:
Sketch Classification as Search
Each algorithm has three components
Representation:
Linear
Evaluation: Accuracy
Optimization: Search
Algorithm Sketch:
Enumerate all straight lines
Eliminate lines that don’t fit the data
(accuracy < 100%)
Surviving lines are acceptable models
+
Machine Learning:
Sketch Classification as Search
Each algorithm has three components
Representation:
Decision Tree
Evaluation: Accuracy
Optimization: Search
Algorithm Sketch:
Enumerate all combinations of rules
Eliminate rules that don’t fit the data
(accuracy < 100%)
Surviving rules are acceptable models
Issues:
None or more than one rule may survive
Search space is exponentially large. More
rule combinations than atoms.
+
Overview and Learning Outcomes
Intro
What is Data Mining?
Why is it important?
Taxonomy of Data Mining Methods
Intuition for Algorithms
A Few Practical Issues
+
Feature Design
Sometimes you are given a database of suitable attributes.
Sometimes you have to choose what data to collect and put in the
database.
E.g., Detect oil slicks in satellite images
Oil slicks are dark with various sizes and shapes.
Not easy, because lookalike regions can be caused by windy weather
Very hard to do reliably with raw image data, extract attributes:
Size of region
Shape, area
Intensity
Jaggedness of boundary
Proximity of other regions
+
Practical Issues
Bias & Variance
Separability
Model Complexity Control
Scalability: CPU and Memory
Instances
Dimensions
Interpretability
Mixed Attributes
Missing values
Special, or at Random?
Inaccurate values
Sensor noise, typos, deliberate
+
Visualization
Histograms (nominal or numeric)
Graphs (numeric)
2D, 3D scatter plots (numeric)
+
Ethics
Anonymizing
data is difficult.
Machine learning can be used to re

identify
anonymized
data.
(E.g., 85% of US can be identified from ZIP code, DOB and
gender)
Use to discriminate
Loan and insurance applications shouldn’t use gender, religion,
race
Ethics depend on application
Gender, race ok for medicine
Correlated attributes
E.g., area code may correlate highly with race
+
Overview and Learning Outcomes
Intro
What is Data Mining?
Why is it important?
Taxonomy of Data Mining Methods
Intuition for Algorithms
A Few Practical Issues
+
Software
Weka
Matlab
+
Module Organization
Lectures: Thursday 10:00

12:00
Labs: Tuesday 14:00

16:00
Assessment
Final Exam: 70%
Labs: 15%
Coursework: 15% (Week 7
–
10)
Lecturer
Dr. Timothy
Hospedales
TA
Heng
Yang
+
Communication
Course Materials
Book: Witten et al, Data Mining, Morgan Kaufmann, 2011.
More on QMPLUS:
http://qmplus.qmul.ac.uk/
•
News, readings, labs, assignment info.
•
Check here first
•
Check regularly for news
Email:
•
Tim:
tmh@eecs.qmul.ac.uk
•
TA:
heng.yang
@
eecs.qmul.ac.uk
42
Comments 0
Log in to post a comment