Peter - Bioinformatics Unit

dasypygalstockingsBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

92 views

Peter Bajcsy, PhD

Automated Learning Group

National Center for Supercomputing Applications

University of Illinois

pbajcsy
@ncsa.uiuc.edu

September 10, 2002



Data Mining in Bioinformatics

2

Outline


Introduction


Interdisciplinary Problem Statement


Microarray Problem Overview


Microarray Data Processing


Image Analysis and Data Mining


Prior Knowledge


Data Mining Methods


Database and Optimization Techniques


Visualization


Validation


Summary

3

Introduction: Recommended Literature

1. Bioinformatics


The Machine Learning Approach by P. Baldi & S.
Brunak, 2
nd

edition, The MIT Press, 2001



2. Data Mining


Concepts and Techniques by J. Han & M. Kamber,
Morgan Kaufmann Publishers, 2001




3. Pattern Classification by R. Duda, P. Hart and D. Stork, 2
nd

edition,
John Wiley & Sons, 2001

4

Bioinformatics, Computational Biology, Data Mining


Bioinformatics is an interdisciplinary field about the the
information processing problems in computational biology and a
unified treatment of the data mining methods for solving these
problems.


Computational Biology is about modeling real data and simulating
unknown data of biological entities, e.g.


Genomes (viruses, bacteria, fungi, plants, insects,…)


Proteins and Proteomes


Biological Sequences


Molecular Function and Structure


Data Mining is searching for knowledge in data


Knowledge mining from databases


Knowledge extraction


Data/pattern analysis


Data dredging


Knowledge Discovery in Databases (KDD)

5

Introduction: Problems in Bioinformatics Domain


Problems in Bioinformatics Domain


Data production at the levels of molecules, cells,
organs, organisms, populations


Integration of structure and function data, gene
expression data, pathway data, phenotypic and
clinical data, …


Prediction of Molecular Function and Structure


Computational biology: synthesis (simulations) and
analysis (machine learning)


6

MICROARRAY PROBLEM

7

Microarray Problem: Major Objective


Major Objective: Discover a comprehensive theory of
life’s organization at the molecular level


The major actors of molecular biology: the nucleic
acids, DeoxyriboNucleic acid (DNA) and
RiboNucleic
Acids (RNA)


The central dogma of molecular biology





Proteins are very complicated molecules with 20
different amino acids.


8

Input and Output of Microarray Data Analysis


Input:
Laser image scans (data) and underlying experiment
hypotheses or experiment designs (prior knowledge)


Output:


Conclusions about the input hypotheses or knowledge
about statistical behavior of measurements


The theory of biological systems learnt automatically from
data (machine learning perspective)


Model fitting, Inference process


9

Overview of Microarray Problem

Data
Mining

Microarray
Experiment

Image
Analysis

Biology Application Domain

Experiment

Design and

Hypothesis

Data Analysis

Artificial
Intelligence (AI)

Knowledge discovery
in databases (KDD)

Data Warehouse

Validation

Statistics

10

Statistics Community


Random Variables


Statistical Measures


Probability and Probability Distribution


Confidence Interval Estimations


Test of Hypotheses


Goodness of Fit


Regression and Correlation Analysis


11

Artificial Intelligence (AI) Community


Issues:


Prior knowledge
(e.g., invariance)


Model deviation
from true model


Sampling
distributions


Computational
complexity


Model complexity
(overfitting)



Collect Data

Train Classifier

Choose Model

Choose Features

Evaluate Classifier

Design Cycle of Predictive Modeling

12

Knowledge Discovery in Databases (KDD) Community

Database

13

Microarray Data Mining and Image Analysis Steps


Image Analysis


Normalization


Grid Alignment


Spot Quality Assurance Control


Feature construction (selection and extraction)


Data Mining


Prior knowledge


Statistics


Machine learning


Pattern recognition


Database techniques


Optimization techniques


Visualization


Validation


Issues


Cross validation techniques



?

14

MICROARRAY IMAGE
ANALYSIS

15

Microarray Image Analysis

16

DATA MINING OF
MICROARRAY DATA

17

Why Data Mining ? Sequence Example


Biology: Language and Goals


A gene can be defined as a region of DNA.


A genome is one haploid set of chromosomes with the genes
they contain.


Perform competent comparison of gene sequences across
species and account for inherently noisy biological
sequences due to random variability amplified by evolution


Assumption: if a gene has high similarity to another gene
then they perform the same function



Analysis: Language and Goals


Feature is an extractable attribute or measurement (e.g.,
gene expression, location)


Pattern recognition is trying to characterize data pattern
(e.g., similar gene expressions, equidistant gene locations).


Data mining is about uncovering patterns, anomalies and
statistically significant structures in data (e.g., find two
similar gene expressions with confidence > x)

18

Types of Expected Data Mining and Analysis Results

Hypothetical Examples:


Binary answers using tests of hypotheses


Drug treatment is successful with a confidence level x.


Statistical behavior (probability distribution functions)


A class of genes with functionality X follows Poisson
distribution.


Expected events


As the amount of treatment will increase the gene
expression level will decrease.


Relationships


Expression level of gene A is correlated with expression
level of gene B under varying treatment conditions (gene A
and B are part of the same pathway).


Decision trees


Classification of a new gene sequence by a “domain
expert”.


19

PRIOR KNOWLEDGE

20

Prior Knowledge: Experiment Design


Microarray sources of
systematic and random
errors


Feature selection and
variability


Expectations and
Hypotheses


Data cleaning and
transformations






Data mining method
selection


Interpretation

Collect Data

Choose Features

Data Cleaning and
Transformations

Choose Model and Data
Mining Method

21

Prior Knowledge from Experiment Design

Complexity Levels of Microarray Experiments:

1.
Compare single gene in a control situation versus a treatment situation


Example: Is the level of expression (up
-
regulated or down
-
regulated)
significantly different in the two situations? (drug design application)


Methods: t
-
test, Bayesian approach

2.
Find multiple genes that share common functionalities


Example: Find related genes that are dependent?


Methods: Clustering (hierarchical, k
-
means, self
-
organizing maps,
neural network, support vector machines)

3.
Infer the underlying gene and protein networks that are responsible
for the patterns and functional pathways observed


Example: What is the gene regulation at system level?


Directions: mining regulatory regions, modeling regulatory networks
on a global scale

Goal of Future Experiment Designs:

Understand biology at the system level,
e.g., gene networks, protein networks, signaling networks, metabolic
networks, immune system and neuronal networks.

22

Data Mining Techniques

Visualization

23

STATISTICS

24

Statistics

Inductive


Statistics

Statistics

Descriptive


Statistics

Are two sample sets

identically distributed

?

Make forecast

and inferences

Describe data

25


Gene Expression Level in Control and
Treatment situations


Is the behavior of a single gene
different in Control situation than in
Treatment situation ?


Statistical t
-
test


m


sample mean


s


variance


Normalized distance

Normalized distance t follows a Student
distribution

with f degrees of freedom.

If t>thresh then the control
and treatment data
populations are considered
to be different.

?

26

MACHINE LEARNING

AND

PATTERN RECOGNITION

27

Machine Learning

Supervised

Machine Learning

Unsupervised

Reinforcement

“Natural groupings”

Examples

28

Pattern Recognition

Pattern Recognition

Linear Correlation

and Regression

Neural Networks

Statistical Models

Decision Trees

Locally Weighted
Learning

NN representation
and gradient based
optimization

NN representation and
genetic algorithm based
optimization

k
-
nearest
neighbors,
support
vectors

29

Unsupervised Learning and Clustering


A cluster is a collection of data objects that are similar to
one another within the same cluster and are dissimilar to
the objects in other clusters.


Examples of data objects:


gene expression levels, sets of co
-
regulated genes
(pathways), protein structures




Categories of Clustering Methods


Partitioning Methods


Hierarchical Methods


Density
-
Based Methods

“Natural groupings”

30

Unsupervised Clustering: Partitioning Methods


K
-
means Algorithm

partitions a set of n objects into k
clusters so that the resulting intra
-
cluster similarity is high
but the inter
-
cluster similarity is low.


Input: number of desired cluster k


Output: k labels assigned to n objects


Steps:

1.
Select k initial cluster’s centers

2.
Compute similarity as a distance between an object and
each cluster center

3.
Assign a label to an object based on the minimum similarity

4.
Repeat for all objects

5.
Re
-
compute the cluster’s centers as a mean of all objects
assign to a given cluster

6.
Repeat from Step 2 until objects do not change their
labels.

Example: Centroid
-
Based Technique

31

Unsupervised Clustering: Partitioning Methods


K
-
medoids Algorithm

partitions a set of n objects into k
clusters so that it minimizes the sum of the dissimilarities
of all the objects to their nearest medoid.


Input: number of desired cluster k


Output: k labels assigned to n objects


Steps:

1.
Select k initial objects as the initial medoids

2.
Compute similarity as a distance between an object and
each cluster medoid

3.
Assign a label to an object based on the minimum similarity

4.
Repeat for all objects

5.
Randomly select a non
-
medoid object and swap with the
current medoid it would decrease intra
-
cluster square
error

6.
Repeat from Step 2 until objects do not change their
labels.

Example: Representative
-
Based Technique

32

Unsupervised Clustering: Hierarchical Clustering


Hierarchical Clustering partitions
a set of n objects into a tree
of clusters








Types of Hierarchical Clustering


Agglomerative hierarchical clustering


Bottom
-
up strategy of building clusters


Divisive hierarchical clustering


Top
-
down strategy of building clusters


33

Unsupervised Agglomerative Hierarchical Clustering


Agglomerative Hierarchical Clustering partitions
a set of n
objects into a tree of clusters with a bottom
-
up strategy.


Steps:

1.
Assign a unique label to each data object and form n clusters

2.
Find nearest clusters and merge them

3.
Repeat Step 2 till the number of desired clusters is equal to the
number of merged clusters.







Types of Agglomerative Hierarchical Clustering


The nearest neighbor algorithms (minimum or
single
-
linkage algorithm
, minimal
spanning tree)


The farthest neighbor algorithms (maximum or
complete
-
linkage algorithm
)

34

Unsupervised Clustering: Density
-
Based Clustering


Density
-
Based Spatial Clustering with Noise aggregates
objects into clusters if the objects are density connected.


Density connected objects:


Simplified explanation:P and Q are density connected if
there is an object O such that both P and Q are density
connected to O.


Aggregate P and Q if they are density connected with
respect to R
-
radius neighborhood and Minimum Object
criteria

35

Supervised Learning or Classification


Classification is a two
-
step process consisting of learning
classification rules followed by assignment of classification
label.

36

Supervised Learning: Decision Tree


Decision tree algorithm constructs a tree structure in a top
-
down recursive divide
-
and
-
conquer manner

Car Insurance: Risk Assessment

Age < 25 ?

Risk: Low

Risk: High

Sports car ?

Risk: High

Age

Car Type

Risk

23

family

High

17

sports

High

43

sports

High

68

family

Low

32

truck

Low

20

family

High

yes

no

no

yes

Attributes

Answers

Visualization of Decision Boundaries

37

Supervised Learning: Bayesian Classification


Bayesian Classification is based on Bayes theorem and it can
predict class membership probabilities.


Bayes Theorem (X
-
data sample, H
-
hypothesis of data label)


P(H/X) posterior probability


P(H) prior probability




Classification
-
maximum posteriori hypothesis

38

Statistical Models: Linear Discriminant


Linear Discriminant Functions form boundaries between
data classes.


Finding Linear Discriminant Functions is achieved by
minimizing a criterion error function.

Linear discriminant function

Quadratic discriminant function

Finding w coefficients:


-
Gradient Descent Procedures


-
Newton’s algorithm

39

Neural Networks


Neural network is a set of connected input/output units where each
connection has a weight associated with it.


Phase I: learning


adjust weights such that the network predicts
accurately class labels of the input samples


Phase II: classification
-

assign labels by passing an unknown sample
through the network






Steps:

1.
Initial weights from [
-
1,1]

2.
Propagate the inputs forward

3.
Backpropagate the error

4.
Terminate learning (training) if (a) delta w < thresh or (b) percentage of
misclassified samples < thresh or (c) max number of iterations has been
exceeded

Interpretation

40

Support Vector Machines (SVM)


SVM algorithm finds a separating hyperplane with the largest
margin and uses it for classification of new samples

41

DATABASE TECHNIQUES

AND

OPTIMIZATION TECHNIQUES

42

Database Techniques


Database Design and Modeling (
tables, procedures,
functions, constraints)


Database Interface to Data Mining System


Efficient Import and Export of Data


Database Data Visualization


Database Clustering for Access Efficiency


Database Performance Tuning (memory usage, query
encoding)


Database Parallel Processing (multiple servers and
CPUs)


Distributed Information Repositories (data warehouse)





MINING

43

Optimization Techniques


Highly nonlinear search space (global versus local
maxima)


Gradient based optimization


Genetic algorithm based optimization


Optimization with sampling


Large search space


Example: A genome with N genes can encode 2^N
states (active or inactive states, regulated is not
considered). Human genome ~ 2^30,000;
Nematode genome ~ 2^20,000 patterns.



44

VISUALIZATION

45

Visualization


Data:

3D cubes,distribution charts, curves, surfaces, link
graphs, image frames and movies, parallel coordinates


Results:
pie charts, scatter plots, box plots, association rules,
parallel coordinates, dendograms, temporal evolution



Pie chart

Parallel coordinates

Temporal evolution

46

Novel Visualization of Features

Feature Selection and Visualization


Feature Selection

Mean Feature Image

47

Novel Visualization of Clustering Results

Isodata (K
-
means)

Clustering

Class Labeling and Visualization


Mean Feature Image

Label Image

48

VALIDATION

49

Why Validation?


Validation type:


Within the existing data


With newly collected data




Errors and uncertainties:


Systematic or random errors


Unknown variables
-

number of classes


Noise level
-

statistical confidence due to noise


Model validity


error measure, model over
-
fit or under
-
fit


Number of data points
-

measurement replicas



Other issues


Experimental support of general theories


Exhaustive sampling is not permissive






50

Error Detection: Example of Spot Screening

Mask Image


No Screening

Mask Image


Location and Size Screening

Mask Image


SNR Screening

51

Cross Validation: Example


One
-
tier cross validation


Train on different data than test data


Two
-
tier cross validation


The score from one
-
tier cross validation is used by
the bias optimizer to select the best learning
algorithm parameters (# of control points) . The
more you optimize the more you over
-
fit. The
second tier is to measure the level of over
-
fit
(unbiased measure of accuracy).


Useful for comparing learning algorithms with
control parameters that are optimized.


Number of folds is not optimized.


Computational complexity:


#folds of top tier X #folds of bottom tier X
#control points X CPU of algorithm


52

Summary


Bioinformatics and Microarray problem


Interdisciplinary Challenges: Terminology


Understanding Biology and Computer Science


Data mining and image analysis steps


Image Analysis


Experiment Design as Prior Knowledge


Expected Results of Data Mining


Which Data Mining Technique to Use?


Data Mining Challenges: Complexity, Data Size, Search Space




Validation


Confidence in Obtained Results?


Error Screening


Cross validation techniques



53

Backup