Bioinformatics Research and Resources at the University of ...

fleagoldfishΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 3 μήνες)

83 εμφανίσεις

JM
-

http://folding.chmcc.org

1

Introduction to Bioinformatics: Lecture VIII

Classification and Supervised Learning


Jarek Meller


Division of Biomedical Informatics,

Children’s Hospital Research Foundation

& Department of Biomedical Engineering, UC

JM
-

http://folding.chmcc.org

2

Outline of the lecture





Motivating story: correlating inputs and outputs


Learning with a teacher


Regression and classification problems


Model selection, feature selection and generalization


k
-
nearest neighbors and some other classification
algorithms


Phenotype fingerprints and their applications in
medicine

JM
-

http://folding.chmcc.org

3

Web watch: an on
-
line biology textbook by JW Kimball

Dr. J. W. Kimball's Biology Pages


http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/


Story #1: B
-
cells and DNA editing, Apolipoprotein B and RNA eiditing


http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/R/RNA_Editing.html#apoB_gene




Story #2: ApoB, cholesterol uptake, LDL and its endocytosis


http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/E/Endocytosis.html#ldl



Complex patterns of mutations in genes related to cholesterol transport and

uptake (e.g. LDLR, ApoB) may lead to an elevated level of LDL in the blood.

JM
-

http://folding.chmcc.org

4

Correlations and fingerprints

Instead of often difficult to decipher underlying molecular model, one may

simply try to find correlations between inputs and outputs. If measurements

on certain attributes correlate with molecular processes, underlying genomic

structures, phenotypes, disease states etc., one can use such attributes as

indicators of such “hidden” states and to make predictions for new cases.


Consider for example the elevated levels of the low density lipoprotein (LDL)

particles in the blood, as an indicator (
fingerprint
) of the atherosclerosis.

JM
-

http://folding.chmcc.org

5

Correlations and fingerprints: LDL example

Healthy cases: blue; heart attack or stroke within 5 years from the exam: red (simulated data);

x


LDL; y
-

HDL; z


age (
see study by Westendorp et. al., Arch Intern Med. 2003, 163(13):1549

JM
-

http://folding.chmcc.org

6

LDL example: 2D projection

JM
-

http://folding.chmcc.org

7

LDL example: regression with binary output and
1D projection for classification

JM
-

http://folding.chmcc.org

8

Unsupervised vs. supervised learning

In case of unsupervised learning the goal is to “discover” the structure in the

data and group (cluster) similar objects, given a similarity measure. In case

of
supervised learning

(or learning with a teacher) a set of examples with

class assignments (e.g. healthy vs. diseased) is given and the goal is to

find a
representation of the problem

in some feature (attribute) space that

provides a proper separation of the imposed classes. Such representations

With the resulting decision boundaries may be subsequently used to make

prediction for new cases.

Class 1

Class 2

Class 3

JM
-

http://folding.chmcc.org

9

Choice of the model, problem representation and feature
selection: another simple example

heights

estrogen

F


M


adults

children

weight

testosterone

10

Gene expression example again: JRA clinical classes

Picture: courtesy of B. Aronow

JM
-

http://folding.chmcc.org

11

Advantages of prior knowledge, problems with class
assignment (e.g. in clinical practice) on the other hand

FixL

PYP

GLOBINS

??

Prior knowledge


the same class despite low sequence similarity; suggestion
that distance based on sequence similarity is not sufficient


adding structure
derived features might help (“good model” question again).

JM
-

http://folding.chmcc.org

12

Three phases in supervised learning protocols



Training data: examples with class assignment are given


Learning:


i) appropriate model (or representation) of the problem needs to be
selected in terms of attributes, distance measure and classifier type;
ii) adaptive parameters in the model need to optimized to provide
correct classification of training examples (e.g. minimizing the
number of misclassified training vectors)


Validation: cross
-
validation, independent control sets and other
measure of “real” accuracy and generalization should be used to
assess the success of the model and the training phase
(finding
trade off between

accuracy and generalization is not trivial)

JM
-

http://folding.chmcc.org

13

Training set: LDL example again


A set of objects (here patients)
x
i

,
i=1, …, N

is given. For each patient a set of
features (attributes and the corresponding measurements on these attributes) are
given too. Finally, for each patient we are given the class
C
k

,
k=1, …, K
, he/she
belongs to.




Age


LDL


HDL Sex


Class


41


230


60

F


healthy (0)



32


120


50

M


stroke within 5 years (1)


45


90


70

M


heart attack within 5 years (1)


{
x
i

, C
k

}

i=1, …, N


JM
-

http://folding.chmcc.org

14

Optimizing adaptable parameters in the model


Find a model
y
(
x
;
w
)

that describes the objects of each class as a
function of the features and adaptive parameters (weights)

w
.


Prediction, given

x
(e.g. LDL=240, age=52, sex=male) assign the
class C=?, (e.g. if y(x,w)>0.5 then C=1, i.e. likely to suffer from a
stroke or heart attack in the next 5 years)

y
(
x
;
w
)

JM
-

http://folding.chmcc.org

15

Examples of machine learning algorithms for
classification and regression problems


Linear perceptron, Least Squares


LDA/FDA (Linear/Fisher Discriminate Analysis)
(simple linear cuts, kernel non
-
linear generalizations)


SVM (Support Vector Machines) (optimal, wide
margin linear cuts, kernel non
-
linear generalizations)


Decision trees (logical rules)


k
-
NN (k
-
Nearest Neighbors) (simple non
-
parametric)


Neural networks (general non
-
linear models,
adaptivity, “artificial brain”)

JM
-

http://folding.chmcc.org

16

Training accuracy vs. generalization

JM
-

http://folding.chmcc.org

17

Model complexity, training set size and generalization

JM
-

http://folding.chmcc.org

18

Similarity measures

JM
-

http://folding.chmcc.org

19

k
-
nearest neighbors as a simple algorithm for classification




Given a training set of
N

objects with known class
assignment and
k<N

find an assignment of new objects
(not included in the training) to one of the classes
based on the assignment of its
k

neighbors


A simple, non
-
parametric method that works
surprisingly well, especially in case of low dimensional
problems


Note however that the choice of the distance measure
may again have a profound effect on the results


The optimal k is found by trial and error

JM
-

http://folding.chmcc.org

20

k
-
nearest neighbor algorithm

Step 1: Compute pairwise distances and take k closest neighbors

Step2: Assign class based on a simple majority voting, the new point


belongs to the class with most neighbors in this class