Classification by Machine Learning Approaches

journeycartΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

77 εμφανίσεις

Classification by

Machine Learning

Approaches

Michael J. Kerner


kerner@cbs.dtu.dk

Center for Biological Sequence Analysis

Technical University of Denmark


Outline


Introduction to Machine Learning


Datasets, Features


Feature Selection


Machine Learning Approaches (Classifiers)


Model Evaluation and Interpretation


Examples, Exercise


Machine Learning


Data Driven Prediction

To Learn
:


“to gain
knowledge

or
understanding of

or
skill in


by
study, instruction, or experience





(Merriam Webster English Dictionary, 2005)


Machine Learning
:


Learning the theory automatically from the data
,

through a process of inference, model fitting, or learning
from examples:


Automated extraction

of
useful information

from a
body of
data

by building good probabilistic models.


Ideally suited for areas with
lots of data in the absence of a
general theory
.

Why do we need Machine Learning?


Some tasks cannot be defined well, except by examples (e.g.
recognition of faces or people).



Large amounts of data may have hidden relationships and
correlations. Only automated approaches may be able to detect
these.



The amount of knowledge about a certain problem / task may
be too large for explicit encoding by humans (e.g. in medical
diagnostics)



Environments change over time, and new knowledge is
constantly being discovered. A continuous redesign of the
systems “by hand” may be difficult.

The Machine Learning Approach

Input

Data

Classifier

ML

e.g. Gene
Expression
Profiles, …

Machine
Learning

Prediction:

Yes / No

Machine Learning


Learning Task:


What do we want to learn or predict?


Data and assumptions:


What data do we have available?


What is their quality?


What can we assume about the given problem?


Representation:


What is a suitable representation of the examples to be classified?


Method and Estimation:


Are there possible hypotheses?


Can we adjust our predictions based on the given results?


Evaluation:


How well does the method perform?


Might another approach/model perform better?

Learning Tasks


Classification:


Prediction of an item class.



Forecasting:


Prediction of a parameter value.



Characterization:


Find hypotheses that describe groups of items.



Clustering:


Partitioning of the
(unassigned)

data set into clusters
with common properties.
(Unsupervised learning)

Emergence of Large Datasets

Dataset examples:



Image processing


Spam email detection


Text mining



DNA micro
-
array data


Protein function


Protein localization


Protein
-
protein interaction




Dataset Examples

Edible or
poisonous ?

Dataset Examples

mRNA Splicing

mRNA Splice Site Prediction

Protein Function Prediction: ProtFun


Predict as many
biologically relevant
features as we can
from the sequence



Train artificial neural
networks for each
category



Assign a probability
for each category
from the NN outputs


############## ProtFun 2.2 predictions ########

>KCNA1_HUMAN


# Functional category Prob Odds


Amino_acid_biosynthesis 0.042 1.893


Biosynthesis_of_cofactors 0.119 1.654


Cell_envelope 0.031 0.507


Cellular_processes 0.027 0.373


Central_intermediary_metabolism 0.046 0.731


Energy_metabolism 0.036 0.395


Fatty_acid_metabolism 0.019 1.485


Purines_and_pyrimidines 0.214 0.879


Regulatory_functions 0.013 0.083


Replication_and_transcription 0.019 0.073


Translation 0.129 2.925


Transport_and_binding
=>
0.717 1.748


# Enzyme/nonenzyme Prob Odds


Enzyme 0.231 0.807


Nonenzyme
=>
0.769 1.078






# Enzyme class Prob Odds


Oxidoreductase (EC 1.
-
.
-
.
-
) 0.040 0.193


Transferase (EC 2.
-
.
-
.
-
) 0.056 0.163


Hydrolase (EC 3.
-
.
-
.
-
) 0.062 0.195


Lyase (EC 4.
-
.
-
.
-
) 0.020 0.430


Isomerase (EC 5.
-
.
-
.
-
) 0.010 0.321


Ligase (EC 6.
-
.
-
.
-
) 0.017 0.326



# Gene Ontology category Prob Odds


Signal_transducer 0.061 0.284


Receptor 0.055 0.323


Hormone 0.001 0.206


Structural_protein 0.002 0.086


Transporter 0.469 4.299


Ion_channel 0.207 3.633


Voltage
-
gated_ion_channel
=>
0.280 12.736


Cation_channel 0.348 7.560


Transcription 0.163 1.270


Transcription_regulation 0.166 1.331


Stress_response 0.011 0.125


Immune_response 0.031 0.370


Growth_factor 0.005 0.372


Metal_ion_transport 0.159 0.345


Complexity of datasets:



Many
instances

(examples)


Instances with multiple
features

(properties /
characteristics)


Dependencies between the features
(correlations)

Emergence of Large Datasets

Data Preprocessing

Instance selection:


Remove identical / inconsistent / incomplete
instances (e.g. reduction of homologous genes,
removal of wrongly annotated genes)


Feature transformation / selection:


Projection techniques (e.g. principal
components analysis)


Compression techniques (e.g. minimum
description length)


Feature selection techniques

Benefits of Feature Selection


Attain good and often even better classification
performance using a small subset of features


Less noise in the data



Provide more cost
-
effective classifiers


Less features to take into account


smaller datasets


faster classifiers



Identification of (biologically) relevant features for
the given problem



Feature Selection

Feature

Subset

Selection

Learning

Algorithm

All

Features

Feature

Subset

Selection

Learning Algorithm

All

Features

Feature Subset

Search Algorithm

Selection

Criterion

Learning

Algorithm

Selected

Features

Evaluation

Optimal

Features

Optimal

Features

Optimal

Features

Filter approach

Wrapper

approach

Filter Approach


Independent of the classification model


A relevance measure for each feature is calculated


Features with a value lower than a selected threshold t will
be removed


Example:
Feature
-
class entropy


Measures the “uncertainty” about the class when
observing feature i



f1 f2 f3 f4
class
f1 f2 f3 f4
class



1 0 1 1
1
1 0 0 0
0



0 1 1 0
1
0 0 1 0
0



1 0 1 0
1
1 1 0 1
0



0 1 0 1
1
0 1 0 1
0

Wrapper approach


Specific to a classification algorithm


The search for a good feature subset is guided by
a search algorithm


The algorithm uses the evaluation of the classifier
as a guide to find good feature subsets


Search algorithm examples: sequential forward or
backward search, genetic algorithms



Sequential backward elimination


Starts with the set of all features


Iteratively discards the feature whose removal
results in the best classification performance

Wrapper approach

Full feature set : f1,f2,f3,f4

f2,f3,f4

0.7

f1,f3,f4

0.8

f1,f2,f4

0.1

f1,f2,f3

0.75

f3,f4

0.85

f1,f4

0.1

f1,f3

0.8

f4

0.2

f3

0.7

Classification Methods

-
Decision trees

-
Hidden Markov Models (HMMs)

-
Support vector machines

-
Artificial Neural Networks

-
Bayesian methods

-



Decision Trees


Simple, practical and
easy to interpret


Given a set of instances (with a set of features), a
tree is constructed with internal
nodes as the
features

and the
leaves as the classes


Example Dataset: Shall we play golf?

Instance


Attributes /



Features



Class

day

outlook

temperature

humidity

windy

Play Golf ?

1

sunny

hot

high

FALSE

no

2

sunny

hot

high

TRUE

no

3

overcast

hot

high

FALSE

yes

4

rainy

mild

high

FALSE

yes

5

rainy

cool

normal

FALSE

yes

6

rainy

cool

normal

TRUE

no

7

overcast

cool

normal

TRUE

yes

8

sunny

mild

high

FALSE

no

9

sunny

cool

normal

FALSE

yes

10

rainy

mild

normal

FALSE

yes

11

sunny

mild

normal

TRUE

yes

12

overcast

mild

high

TRUE

yes

13

overcast

hot

normal

FALSE

yes

14

rainy

mild

high

TRUE

no

today

sunny

cool

high

TRUE

?

Example: Shall we play golf today?

WEKA data file (arff format)
:


@relation weather.symbolic


@attribute outlook {sunny, overcast, rainy}

@attribute temperature {hot, mild, cool}

@attribute humidity {high, normal}

@attribute windy {TRUE, FALSE}

@attribute play {yes, no}


@data

sunny,hot,high,FALSE,no

sunny,hot,high,TRUE,no

overcast,hot,high,FALSE,yes

rainy,mild,high,FALSE,yes

rainy,cool,normal,FALSE,yes

rainy,cool,normal,TRUE,no

overcast,cool,normal,TRUE,yes

sunny,mild,high,FALSE,no

sunny,cool,normal,FALSE,yes

rainy,mild,normal,FALSE,yes

sunny,mild,normal,TRUE,yes

overcast,mild,high,TRUE,yes

overcast,hot,normal,FALSE,yes

rainy,mild,high,TRUE,no


Instance

Independent features (attributes)

Class

Day

Outlook

Temperature

Humidity

Windy

Play Golf?

1

sunny

hot

high

FALSE

no

2

sunny

hot

high

TRUE

no

3

overcast

hot

high

FALSE

yes

4

rainy

mild

high

FALSE

yes

5

rainy

cool

normal

FALSE

yes

6

rainy

cool

normal

TRUE

no

7

overcast

cool

normal

TRUE

yes

8

sunny

mild

high

FALSE

no

9

sunny

cool

normal

FALSE

yes

10

rainy

mild

normal

FALSE

yes

11

sunny

mild

normal

TRUE

yes

12

overcast

mild

high

TRUE

yes

13

overcast

hot

normal

FALSE

yes

14

rainy

mild

high

TRUE

no

Feature compositions

sunny

overcast

rainy

hot

cool

mild

high

normal

True

False

YES

NO

NO

YES

Decision Trees

J48 pruned tree

------------------

outlook = sunny

| humidity = high: no (3.0)

| humidity = normal: yes (2.0)

outlook = overcast: yes (4.0)

outlook = rainy

| windy = TRUE: no (2.0)

| windy = FALSE: yes (3.0)



Number of Leaves :

5

Size of the tree :

8

Attributes / Features

Attribute Values

Classes

Artificial Neural Networks (ANNs)

Artificial Neuron

Neural Network

Overfitting

Overfitting:


A classifier that performs well on the training
examples, but
poorly on new examples
.



Training and testing on the same data will generally
produce a good classifier (for this dataset) with high
overfitting.


To avoid overfitting:


Use
separate training

and

testing
data


Use
cross
-
validation


Use the
simplest model possible

Performance Evaluation

Cross
-
Validation
(10 fold)

Data

Training

Set

Test

Set

Performance Evaluation

Classifier

ML

(9/10)

(1/10)

10x

Performance Evaluation

Confusion Matrix



TP

True Positives



TN

True Negatives



FP

False Positives



FN

False Negatives

Predicted

Label

positive

negative

Known

positive

TP

FN

Label

negative

FP

TN

Performance Evaluation


Precision (PPV)

TP / (TP + FP)



Percentage of correct positive predictions


Recall / Sensitivity

TP / (TP + FN)


Percentage of positively labeled instances, also predicted as positive


Specificity


TN / (TN + FP)


Percentage of negatively labeled instances, also predicted as negative


Accuracy


(TP + TN) / (TP + TN + FP + FN)


Percentage of correct predictions



Correlation Coefficient


(TP * TN


FP * FN)






(TP+FP)*(FP+TN)*(TN+FN)*(FN+TP)



-
1


cc


1


cc = 1 :

no FP or FN





cc = 0 :

random





cc =
-
1:

only FP and FN

ROC


Receiver Operating Characteristic

( FP / (FP + TN) )

False Positive Rate, (1
-

Specificity)

True Positive Rate, Sensitivity

TP / (TP + FN)

ROC


Receiver Operating Characteristic

1
-

Specificity

Sensitivity

Case Study
-

Splice Site Prediction

Case Study
-

Splice Site Prediction

Splice site prediction:



Correctly identify the borders of introns and exons
in genes (splice sites)



Important for gene prediction



Split up into 2 tasks:


Donor prediction (exon
-
> intron)


Acceptor prediction (intron
-
> exon)


Case Study
-

Splice Site Prediction


Splice sites are characterized by a conserved
dinucleotide in the intron part of the sequence



Donor sites :




Acceptor sites :





Classification problem
:


Distinguish between
true
GT, AG and
false
GT, AG.

Case Study
-

Splice Site Prediction


Position dependent features




e.g. an A on position 1, C on position 17, ….



Position independent features



e.g. subsequence “TCG” occurs, “GAG” occurs,…

a
tcgatcagtatcgat
GT
c
tgagctatgag

a
tcg
atcagta
tcg
at
GT
ct
gag
ctat
gag

1 2 3


17 28

Features:

Original Data


Human Acceptor Splice Site Sites

>HUMGLUT4B_3535

GGGCCCCTAGCGGAAGGAAAAAAATCATGGTTCCATGTGACATGCTGTGTCTTTGTGTCTGCCTGTTCAGGATGGGGAACCCCCTCAGCA

>HUMGLUT4B_3763

GAGGACAGGTGTCTCGGGGGTGGTGGAAAGGGGACGGTCTGCAGGAAATCTGTCCTCTGCTGTCCCCCAGGTGATTGAACAGAGCTACAA

>HUMGLUT4B_4028

TGGGGGAAACAGGAAGGGAGCCACTGCTGGGTGCCCTCACCCTCACAGCCTCACTCTGTCTGCCTGCCAGGAAAAGGGCCATGCTGGTCA

>HUMGLUT4B_4276

TGGGCTTTCAGATGGGAATGGACACCTGCCCTCAGCCCTCTCTTCTTCCCTCGCCCAGGGCTGACATCAGGGCTGGTGCCCATGTACGTG

>HUMGLUT4B_4507

ATATGGTGGGCTTCCAAGGTAAGGCAGAAGGGCTGAGTGACCTGCCTTCTTTCCCAACCTTCTCCCACAGGTGCTGGGCTTGGAGTCCCT

>HUMGLUT4B_4775

GCCTCCGCCTCATCTTGCTAGCACCTGGCTTCCTCTCAGGTCCCCTCAGGCCTGACCTTCCCTTCTCCAGGTCTGAAGCGCCTGACAGGC

>HUMGLUT4B_5125

CCAGCCTGTTGTGGCTGGAGTAGAGGAAGGGGCATTCCTGCCATCACTTCTTCTTCTCCCCCACCTCTAGGTTTTCTATTATTCGACCAG

>HUMGLUT4B_5378

CCTCACCCACGCGGCCCCTCCTACTTCCCGTGCCCAAAAGGCTGGGGTCAAGCTCCGACTCTCCCCGCAGGTGTTGTTGGTGGAGCGGGC

>HUMGLUT4B_5995

CTGAGTTGAGGGCAAGGGAAGATCAGAAAGGCCTCAACTGGATTCTCCACCCTCCCTGTCTGGCCCCTAGGAGCGAGTTCCAGCCATGAG

>HUMGLUT4B_6716

CTGGTTGCCTGAAACTACCCCTTCCCTCCCCACCTCACTCCGTCAACACCTCTTTCTCCACCTGTCCCAGGAGGCTATGGGGCCCTACGT

>HSRPS6G_1493

CTTTGTAGATGGCTCTACAATTACCTGTATAGATAGTTTCGTAAACTATTTCCCCCCTTTTAATCCTTAGCTGAACATCTCCTTCCCAGC

[...]

Arff Data File
-

WEKA

@RELATION splice
-
train


@ATTRIBUTE
-
68_A {0,1}

@ATTRIBUTE
-
68_T {0,1}

@ATTRIBUTE
-
68_C {0,1}

@ATTRIBUTE
-
68_G {0,1}

@ATTRIBUTE
-
67_A {0,1}

@ATTRIBUTE
-
67_T {0,1}

@ATTRIBUTE
-
67_C {0,1}

@ATTRIBUTE
-
67_G {0,1}

[...]

@ATTRIBUTE 20_A {0,1}

@ATTRIBUTE 20_T {0,1}

@ATTRIBUTE 20_C {0,1}

@ATTRIBUTE 20_G {0,1}

@ATTRIBUTE class {true,false}


@DATA

0,0,0,1,0,0,0,1, [...] ,1,0,0,0,true

0,0,0,1,1,0,0,0, [...] ,1,0,0,0,true

0,1,0,0,0,0,0,1, [...] ,1,0,0,0,true

0,1,0,0,0,0,0,1, [...] ,0,0,0,1,true

[...]

1,0,0,0,0,1,0,0, [...] ,0,1,0,0,true

0,0,0,1,0,0,1,0, [...] ,0,0,1,0,true

0,0,1,0,0,0,1,0, [...] ,0,0,0,1,true

0,0,1,0,0,0,1,0, [...] ,0,0,1,0,true


The original sequence files in FASTA
format have been converted to represent
the four DNA bases in a binary fashion



A:


1 0 0 0

T:


0 1 0 0

C:


0 0 1 0

G:


0 0 0 1

Case Study
-

Splice Site Prediction


Local context of 88 nucleotides around the splice site



88 position dependent features


A=1000, T=0100, C=0010, G=0001



352 binary features



Reduce the dataset to contain fewer but relevant
features

352 Binary features

15 Binary features

Case Study


Splice Site Sequence Logos

Acceptor Sites:

Donor Sites:

+ 3

+ 2

+ 1

-

2

-

3

+ 4

-

1

+ 1

-

2

-

3

-

1

-

4

-

8

-

9

-

7

-

5

-

6

-

13

-

14

-

12

-

10

-

11

-

15

-

18

-

16

-

17

Exercise:


Building a prediction tool for human

mRNA splice sites


Feature selection for classification

of splice sites


Tool: The WEKA machine learning toolkit.



Go to


http://www.cbs.dtu.dk/~kerner/GeneDisc_Course_2007_MJK/


and follow the instructions



Acknowledgements

Slides and Exercises Adapted from and inspired by:


S
ø
ren Brunak


David Gilbert, Aik Choon Tan

Yvan Saeys