Data Mining (ECS607U) Lecture 1 - Introduction

wonderfuldistinctAI and Robotics

Oct 16, 2013 (3 years and 11 months ago)

510 views

+

Data Mining (ECS607U)

Lecture 1
-

Introduction

Dr. Timothy
Hospedales

EECS, Queen Mary University of London

+

Overview and Learning Outcomes


Intro


What is Data Mining?


Why is it important?


Taxonomy of Data Mining Methods


Intuition for Algorithms


A Few Practical Issues


+

What is Data Mining?


…Ideas?


Look it up on the Internet?


“Databases”… “Big Data”


“Machine Learning”…”Statistics”


Vacuous marketing buzzword?


Related Disciplines: Stats, ML


Here: Applied Machine Learning


Concepts


How to apply it


Application areas


(ECS708P)

Machine

Learning

Statistics

Data
Mining

Databases

Distributed
& HPC

+

What is Data Mining

(Machine Learning)?


“Looking for Patterns in Data”


“Solving problems by analyzing data already
present in databases”


“Extracting knowledge from data”


“Inducing new knowledge from past experience”


“Automating automation”


“Eliminating the software bottleneck”


“Getting computers to program themselves”


“Let the data do the work”

+

What is Data Mining?


Traditional Computing





Machine Learning

Input Data

Computer

Program

Output

Data

Input Data

Computer

(Output Data)

Program

+

Magic?


No. It’s a bit like gardening


Seeds = Algorithms


Nutrients = Data


Plants = Programs


Gardener = You

+

What is Data Mining?



Traditional Computing





Machine Learning

Knowledge

Program

Assumptions

Learner

Data

Model /

Program

New Data

Prediction / Action

New Data

Prediction / Action

Programming

+

Why Data Mining?


Society produces huge amounts of data


Sources: business, science, medicine, economics, geography,
environment, sports, ...


Valuable
resource
to be exploited


Raw data is useless: need techniques to automatically extract
information from it


Data
: recorded
facts


Information
: patterns underlying the data


+

Examples

Input Data

Computer

(Output Data)

Program

+

Examples

+

Examples

+

What is Data Mining


Extracting


Implicit


Previously Unknown


Useful


..Information from data


Challenges:


Most patterns not interesting


Patterns may be inexact / noisy or spurious


Data may be garbled or missing


Machine learning / Data Mining Algorithms


Extract concepts (aka models, hypotheses) from examples

+

Can Machines Really Learn?


Definitions of “learning” from dictionary:


To get knowledge of by study, experience, or
being taught


To become aware by information or from
observation


To commit to memory


To be informed of, ascertain; to receive
instruction


Operational definition:


Things learn when they change their
behavior in a way that makes them perform
better in the future.


Hard to Measure

+

What is Learned?


Models? (aka Concepts, Hypotheses)


Can be used to predict outcome in new situation


Can be used to understand the domain


+

A Simple Concrete Example

Attribute (or Feature)

Instance

or Example

Instance:


Specific example to be
classified, associated or
clustered.


Characterized by a set of
attributes


Attribute:


Each instance described by a
set of attributes


Type: Nominal, Ordinal, …


Ordinal: Distance defined


Nominal: Only equality
defined

Concept/Hypothesis/Model:


Type of thing to be learned


E.g., rule to predict play for the day’s weather

+

When Should you Try (Supervised)
Machine Learning


Situations where there is no human expert?


X: A new molecule


F(x): Chemical effect of molecule


(There is no option to write a program)


Human can perform the task but can’t describe how


X: Picture of number plate


F(x): Character string of number plate


(Nobody knows how to write a program to do this)


Desired function is changing frequently


X: Stock prices and trades for last 10 days


F(x): Recommended trades


(Can’t write new programs quick enough)


Each user needs a customized function f


X: Purchase history


F(X): Recommend to buy


(Would be too many programs to write)

+

Overview and Learning Outcomes


Intro


What is Data Mining?


Why is it important?


Taxonomy of Data Mining Methods


Intuition for Algorithms


A Few Practical Issues


+

A Taxonomy

Machine

Learning

Supervised

Unsupervised

Target

Unsupervised


No Target


Understand,
summarize, find
patterns, explain

Supervised



Target Attribute


Find a rule that


can predict the target
attribute

+

Unsupervised Learning

Summarizing Instances

Machine

Learning

Supervised

Unsupervised

Too many instances


Summarize them by:


Clustering


Density Estimation


Save Memory

Understand the Domain

+

Unsupervised Learning

Summarizing
Instances

Machine

Learning

Supervised

Unsupervised

Examples


Image Processing


Customer Profiling

+

Unsupervised Learning

Summarizing
Dimensions

Too many dimensions


E.g., Your data contains temperature in
Deg

C and
Deg

F.


Summarize them by eliminating redundant dimensions


Save Memory, processing time

+

Supervised Learning

Find a rule that can predict target

Attribute (or Feature)

Instance

or Example

Model: Decision Tree Classifier

Target

+

Supervised Learning

Classification and Regression

Classification


Predict Nominal (Discrete)
Target

Regression


Predict Numeric Target

+

A Taxonomy

Machine

Learning

Supervised

Unsupervised

Classification

Regression

Dimension

Reduction

Clustering /

Density
Est

Association

Rules

Outlier

Detection

+

Examples

Input Data

Computer

(Output Data)

Program

+

Examples

+

Examples

+

Overview and Learning Outcomes


Intro


What is Data Mining?


Why is it important?


Taxonomy of Data Mining Methods


Intuition for Algorithms


A Few Practical Issues


+

Machine Learning:

A Construction Manual


Tens of thousands of algorithms exist


Hundreds new each year



Each algorithm has three components


( Feature Extraction )


Representation


Polynomial, Decision Tree, Support Vector Machine, HMM,…


Evaluation


Accuracy, Squared Error, Posterior Probability,…


Optimization


Combinatorial, Greedy, Gradient,….

+

Machine Learning:

A Construction Manual


Tens of thousands of algorithms exist


Hundreds new each year



Each algorithm has three components


( Feature Extraction )


Representation


Polynomial, Decision Tree, Support Vector Machine, HMM,…


Evaluation


Accuracy, Squared Error, Posterior Probability,…


Optimization


Combinatorial, Greedy, Gradient,….

+

Machine Learning:

Sketch Classification as Search


Each algorithm has three components


Representation:
Linear


Evaluation: Accuracy


Optimization: Search



Algorithm Sketch:


Enumerate all straight lines


Eliminate lines that don’t fit the data
(accuracy < 100%)


Surviving lines are acceptable models

+

Machine Learning:

Sketch Classification as Search


Each algorithm has three components


Representation:
Decision Tree


Evaluation: Accuracy


Optimization: Search


Algorithm Sketch:


Enumerate all combinations of rules


Eliminate rules that don’t fit the data
(accuracy < 100%)


Surviving rules are acceptable models


Issues:


None or more than one rule may survive


Search space is exponentially large. More
rule combinations than atoms.


+

Overview and Learning Outcomes


Intro


What is Data Mining?


Why is it important?


Taxonomy of Data Mining Methods


Intuition for Algorithms


A Few Practical Issues


+

Feature Design


Sometimes you are given a database of suitable attributes.


Sometimes you have to choose what data to collect and put in the
database.


E.g., Detect oil slicks in satellite images


Oil slicks are dark with various sizes and shapes.


Not easy, because lookalike regions can be caused by windy weather


Very hard to do reliably with raw image data, extract attributes:


Size of region


Shape, area


Intensity


Jaggedness of boundary


Proximity of other regions

+

Practical Issues


Bias & Variance


Separability


Model Complexity Control


Scalability: CPU and Memory


Instances


Dimensions



Interpretability


Mixed Attributes


Missing values


Special, or at Random?


Inaccurate values


Sensor noise, typos, deliberate

+

Visualization


Histograms (nominal or numeric)


Graphs (numeric)


2D, 3D scatter plots (numeric)

+

Ethics


Anonymizing

data is difficult.


Machine learning can be used to re
-
identify
anonymized

data.
(E.g., 85% of US can be identified from ZIP code, DOB and
gender)


Use to discriminate


Loan and insurance applications shouldn’t use gender, religion,
race


Ethics depend on application


Gender, race ok for medicine


Correlated attributes


E.g., area code may correlate highly with race

+

Overview and Learning Outcomes


Intro


What is Data Mining?


Why is it important?


Taxonomy of Data Mining Methods


Intuition for Algorithms


A Few Practical Issues


+

Software


Weka


Matlab

+

Module Organization


Lectures: Thursday 10:00
-
12:00


Labs: Tuesday 14:00
-
16:00


Assessment


Final Exam: 70%


Labs: 15%


Coursework: 15% (Week 7


10)


Lecturer


Dr. Timothy
Hospedales


TA


Heng

Yang

+

Communication

Course Materials


Book: Witten et al, Data Mining, Morgan Kaufmann, 2011.

More on QMPLUS:
http://qmplus.qmul.ac.uk/


News, readings, labs, assignment info.


Check here first


Check regularly for news


Email:


Tim:
tmh@eecs.qmul.ac.uk


TA:
heng.yang
@
eecs.qmul.ac.uk

42