Peter Bajcsy, PhD
Automated Learning Group
National Center for Supercomputing Applications
University of Illinois
pbajcsy
@ncsa.uiuc.edu
September 10, 2002
Data Mining in Bioinformatics
2
Outline
•
Introduction
—
Interdisciplinary Problem Statement
—
Microarray Problem Overview
•
Microarray Data Processing
—
Image Analysis and Data Mining
—
Prior Knowledge
—
Data Mining Methods
—
Database and Optimization Techniques
—
Visualization
•
Validation
•
Summary
3
Introduction: Recommended Literature
1. Bioinformatics
–
The Machine Learning Approach by P. Baldi & S.
Brunak, 2
nd
edition, The MIT Press, 2001
2. Data Mining
–
Concepts and Techniques by J. Han & M. Kamber,
Morgan Kaufmann Publishers, 2001
3. Pattern Classification by R. Duda, P. Hart and D. Stork, 2
nd
edition,
John Wiley & Sons, 2001
4
Bioinformatics, Computational Biology, Data Mining
•
Bioinformatics is an interdisciplinary field about the the
information processing problems in computational biology and a
unified treatment of the data mining methods for solving these
problems.
•
Computational Biology is about modeling real data and simulating
unknown data of biological entities, e.g.
—
Genomes (viruses, bacteria, fungi, plants, insects,…)
—
Proteins and Proteomes
—
Biological Sequences
—
Molecular Function and Structure
•
Data Mining is searching for knowledge in data
—
Knowledge mining from databases
—
Knowledge extraction
—
Data/pattern analysis
—
Data dredging
—
Knowledge Discovery in Databases (KDD)
5
Introduction: Problems in Bioinformatics Domain
•
Problems in Bioinformatics Domain
—
Data production at the levels of molecules, cells,
organs, organisms, populations
—
Integration of structure and function data, gene
expression data, pathway data, phenotypic and
clinical data, …
—
Prediction of Molecular Function and Structure
—
Computational biology: synthesis (simulations) and
analysis (machine learning)
6
MICROARRAY PROBLEM
7
Microarray Problem: Major Objective
•
Major Objective: Discover a comprehensive theory of
life’s organization at the molecular level
—
The major actors of molecular biology: the nucleic
acids, DeoxyriboNucleic acid (DNA) and
RiboNucleic
Acids (RNA)
—
The central dogma of molecular biology
Proteins are very complicated molecules with 20
different amino acids.
8
Input and Output of Microarray Data Analysis
•
Input:
Laser image scans (data) and underlying experiment
hypotheses or experiment designs (prior knowledge)
•
Output:
—
Conclusions about the input hypotheses or knowledge
about statistical behavior of measurements
—
The theory of biological systems learnt automatically from
data (machine learning perspective)
–
Model fitting, Inference process
9
Overview of Microarray Problem
Data
Mining
Microarray
Experiment
Image
Analysis
Biology Application Domain
Experiment
Design and
Hypothesis
Data Analysis
Artificial
Intelligence (AI)
Knowledge discovery
in databases (KDD)
Data Warehouse
Validation
Statistics
10
Statistics Community
•
Random Variables
•
Statistical Measures
•
Probability and Probability Distribution
•
Confidence Interval Estimations
•
Test of Hypotheses
•
Goodness of Fit
•
Regression and Correlation Analysis
11
Artificial Intelligence (AI) Community
•
Issues:
—
Prior knowledge
(e.g., invariance)
—
Model deviation
from true model
—
Sampling
distributions
—
Computational
complexity
—
Model complexity
(overfitting)
Collect Data
Train Classifier
Choose Model
Choose Features
Evaluate Classifier
Design Cycle of Predictive Modeling
12
Knowledge Discovery in Databases (KDD) Community
Database
13
Microarray Data Mining and Image Analysis Steps
•
Image Analysis
—
Normalization
—
Grid Alignment
—
Spot Quality Assurance Control
—
Feature construction (selection and extraction)
•
Data Mining
—
Prior knowledge
—
Statistics
—
Machine learning
—
Pattern recognition
—
Database techniques
—
Optimization techniques
—
Visualization
•
Validation
—
Issues
—
Cross validation techniques
?
14
MICROARRAY IMAGE
ANALYSIS
15
Microarray Image Analysis
16
DATA MINING OF
MICROARRAY DATA
17
Why Data Mining ? Sequence Example
•
Biology: Language and Goals
•
A gene can be defined as a region of DNA.
•
A genome is one haploid set of chromosomes with the genes
they contain.
•
Perform competent comparison of gene sequences across
species and account for inherently noisy biological
sequences due to random variability amplified by evolution
•
Assumption: if a gene has high similarity to another gene
then they perform the same function
•
Analysis: Language and Goals
•
Feature is an extractable attribute or measurement (e.g.,
gene expression, location)
•
Pattern recognition is trying to characterize data pattern
(e.g., similar gene expressions, equidistant gene locations).
•
Data mining is about uncovering patterns, anomalies and
statistically significant structures in data (e.g., find two
similar gene expressions with confidence > x)
18
Types of Expected Data Mining and Analysis Results
Hypothetical Examples:
•
Binary answers using tests of hypotheses
—
Drug treatment is successful with a confidence level x.
•
Statistical behavior (probability distribution functions)
—
A class of genes with functionality X follows Poisson
distribution.
•
Expected events
—
As the amount of treatment will increase the gene
expression level will decrease.
•
Relationships
—
Expression level of gene A is correlated with expression
level of gene B under varying treatment conditions (gene A
and B are part of the same pathway).
•
Decision trees
—
Classification of a new gene sequence by a “domain
expert”.
19
PRIOR KNOWLEDGE
20
Prior Knowledge: Experiment Design
•
Microarray sources of
systematic and random
errors
•
Feature selection and
variability
•
Expectations and
Hypotheses
•
Data cleaning and
transformations
•
Data mining method
selection
•
Interpretation
Collect Data
Choose Features
Data Cleaning and
Transformations
Choose Model and Data
Mining Method
21
Prior Knowledge from Experiment Design
Complexity Levels of Microarray Experiments:
1.
Compare single gene in a control situation versus a treatment situation
•
Example: Is the level of expression (up

regulated or down

regulated)
significantly different in the two situations? (drug design application)
•
Methods: t

test, Bayesian approach
2.
Find multiple genes that share common functionalities
•
Example: Find related genes that are dependent?
•
Methods: Clustering (hierarchical, k

means, self

organizing maps,
neural network, support vector machines)
3.
Infer the underlying gene and protein networks that are responsible
for the patterns and functional pathways observed
•
Example: What is the gene regulation at system level?
•
Directions: mining regulatory regions, modeling regulatory networks
on a global scale
Goal of Future Experiment Designs:
Understand biology at the system level,
e.g., gene networks, protein networks, signaling networks, metabolic
networks, immune system and neuronal networks.
22
Data Mining Techniques
Visualization
23
STATISTICS
24
Statistics
Inductive
Statistics
Statistics
Descriptive
Statistics
Are two sample sets
identically distributed
?
Make forecast
and inferences
Describe data
25
•
Gene Expression Level in Control and
Treatment situations
•
Is the behavior of a single gene
different in Control situation than in
Treatment situation ?
Statistical t

test
•
m
–
sample mean
•
s
–
variance
Normalized distance
Normalized distance t follows a Student
distribution
with f degrees of freedom.
If t>thresh then the control
and treatment data
populations are considered
to be different.
?
26
MACHINE LEARNING
AND
PATTERN RECOGNITION
27
Machine Learning
Supervised
Machine Learning
Unsupervised
Reinforcement
“Natural groupings”
Examples
28
Pattern Recognition
Pattern Recognition
Linear Correlation
and Regression
Neural Networks
Statistical Models
Decision Trees
Locally Weighted
Learning
NN representation
and gradient based
optimization
NN representation and
genetic algorithm based
optimization
k

nearest
neighbors,
support
vectors
29
Unsupervised Learning and Clustering
•
A cluster is a collection of data objects that are similar to
one another within the same cluster and are dissimilar to
the objects in other clusters.
•
Examples of data objects:
—
gene expression levels, sets of co

regulated genes
(pathways), protein structures
•
Categories of Clustering Methods
—
Partitioning Methods
—
Hierarchical Methods
—
Density

Based Methods
“Natural groupings”
30
Unsupervised Clustering: Partitioning Methods
•
K

means Algorithm
partitions a set of n objects into k
clusters so that the resulting intra

cluster similarity is high
but the inter

cluster similarity is low.
•
Input: number of desired cluster k
•
Output: k labels assigned to n objects
•
Steps:
1.
Select k initial cluster’s centers
2.
Compute similarity as a distance between an object and
each cluster center
3.
Assign a label to an object based on the minimum similarity
4.
Repeat for all objects
5.
Re

compute the cluster’s centers as a mean of all objects
assign to a given cluster
6.
Repeat from Step 2 until objects do not change their
labels.
Example: Centroid

Based Technique
31
Unsupervised Clustering: Partitioning Methods
•
K

medoids Algorithm
partitions a set of n objects into k
clusters so that it minimizes the sum of the dissimilarities
of all the objects to their nearest medoid.
•
Input: number of desired cluster k
•
Output: k labels assigned to n objects
•
Steps:
1.
Select k initial objects as the initial medoids
2.
Compute similarity as a distance between an object and
each cluster medoid
3.
Assign a label to an object based on the minimum similarity
4.
Repeat for all objects
5.
Randomly select a non

medoid object and swap with the
current medoid it would decrease intra

cluster square
error
6.
Repeat from Step 2 until objects do not change their
labels.
Example: Representative

Based Technique
32
Unsupervised Clustering: Hierarchical Clustering
•
Hierarchical Clustering partitions
a set of n objects into a tree
of clusters
•
Types of Hierarchical Clustering
—
Agglomerative hierarchical clustering
–
Bottom

up strategy of building clusters
—
Divisive hierarchical clustering
–
Top

down strategy of building clusters
33
Unsupervised Agglomerative Hierarchical Clustering
•
Agglomerative Hierarchical Clustering partitions
a set of n
objects into a tree of clusters with a bottom

up strategy.
•
Steps:
1.
Assign a unique label to each data object and form n clusters
2.
Find nearest clusters and merge them
3.
Repeat Step 2 till the number of desired clusters is equal to the
number of merged clusters.
•
Types of Agglomerative Hierarchical Clustering
—
The nearest neighbor algorithms (minimum or
single

linkage algorithm
, minimal
spanning tree)
—
The farthest neighbor algorithms (maximum or
complete

linkage algorithm
)
34
Unsupervised Clustering: Density

Based Clustering
•
Density

Based Spatial Clustering with Noise aggregates
objects into clusters if the objects are density connected.
•
Density connected objects:
—
Simplified explanation:P and Q are density connected if
there is an object O such that both P and Q are density
connected to O.
—
Aggregate P and Q if they are density connected with
respect to R

radius neighborhood and Minimum Object
criteria
35
Supervised Learning or Classification
•
Classification is a two

step process consisting of learning
classification rules followed by assignment of classification
label.
36
Supervised Learning: Decision Tree
•
Decision tree algorithm constructs a tree structure in a top

down recursive divide

and

conquer manner
Car Insurance: Risk Assessment
Age < 25 ?
Risk: Low
Risk: High
Sports car ?
Risk: High
Age
Car Type
Risk
23
family
High
17
sports
High
43
sports
High
68
family
Low
32
truck
Low
20
family
High
yes
no
no
yes
Attributes
Answers
Visualization of Decision Boundaries
37
Supervised Learning: Bayesian Classification
•
Bayesian Classification is based on Bayes theorem and it can
predict class membership probabilities.
•
Bayes Theorem (X

data sample, H

hypothesis of data label)
—
P(H/X) posterior probability
—
P(H) prior probability
•
Classification

maximum posteriori hypothesis
38
Statistical Models: Linear Discriminant
•
Linear Discriminant Functions form boundaries between
data classes.
•
Finding Linear Discriminant Functions is achieved by
minimizing a criterion error function.
Linear discriminant function
Quadratic discriminant function
Finding w coefficients:

Gradient Descent Procedures

Newton’s algorithm
39
Neural Networks
•
Neural network is a set of connected input/output units where each
connection has a weight associated with it.
•
Phase I: learning
–
adjust weights such that the network predicts
accurately class labels of the input samples
•
Phase II: classification

assign labels by passing an unknown sample
through the network
•
Steps:
1.
Initial weights from [

1,1]
2.
Propagate the inputs forward
3.
Backpropagate the error
4.
Terminate learning (training) if (a) delta w < thresh or (b) percentage of
misclassified samples < thresh or (c) max number of iterations has been
exceeded
Interpretation
40
Support Vector Machines (SVM)
•
SVM algorithm finds a separating hyperplane with the largest
margin and uses it for classification of new samples
41
DATABASE TECHNIQUES
AND
OPTIMIZATION TECHNIQUES
42
Database Techniques
•
Database Design and Modeling (
tables, procedures,
functions, constraints)
•
Database Interface to Data Mining System
•
Efficient Import and Export of Data
•
Database Data Visualization
•
Database Clustering for Access Efficiency
•
Database Performance Tuning (memory usage, query
encoding)
•
Database Parallel Processing (multiple servers and
CPUs)
•
Distributed Information Repositories (data warehouse)
MINING
43
Optimization Techniques
•
Highly nonlinear search space (global versus local
maxima)
•
Gradient based optimization
•
Genetic algorithm based optimization
•
Optimization with sampling
•
Large search space
•
Example: A genome with N genes can encode 2^N
states (active or inactive states, regulated is not
considered). Human genome ~ 2^30,000;
Nematode genome ~ 2^20,000 patterns.
44
VISUALIZATION
45
Visualization
•
Data:
3D cubes,distribution charts, curves, surfaces, link
graphs, image frames and movies, parallel coordinates
•
Results:
pie charts, scatter plots, box plots, association rules,
parallel coordinates, dendograms, temporal evolution
Pie chart
Parallel coordinates
Temporal evolution
46
Novel Visualization of Features
Feature Selection and Visualization
Feature Selection
Mean Feature Image
47
Novel Visualization of Clustering Results
Isodata (K

means)
Clustering
Class Labeling and Visualization
Mean Feature Image
Label Image
48
VALIDATION
49
Why Validation?
•
Validation type:
—
Within the existing data
—
With newly collected data
•
Errors and uncertainties:
—
Systematic or random errors
—
Unknown variables

number of classes
—
Noise level

statistical confidence due to noise
—
Model validity
–
error measure, model over

fit or under

fit
—
Number of data points

measurement replicas
•
Other issues
—
Experimental support of general theories
—
Exhaustive sampling is not permissive
50
Error Detection: Example of Spot Screening
Mask Image
–
No Screening
Mask Image
–
Location and Size Screening
Mask Image
–
SNR Screening
51
Cross Validation: Example
•
One

tier cross validation
—
Train on different data than test data
•
Two

tier cross validation
—
The score from one

tier cross validation is used by
the bias optimizer to select the best learning
algorithm parameters (# of control points) . The
more you optimize the more you over

fit. The
second tier is to measure the level of over

fit
(unbiased measure of accuracy).
—
Useful for comparing learning algorithms with
control parameters that are optimized.
—
Number of folds is not optimized.
•
Computational complexity:
—
#folds of top tier X #folds of bottom tier X
#control points X CPU of algorithm
52
Summary
•
Bioinformatics and Microarray problem
—
Interdisciplinary Challenges: Terminology
—
Understanding Biology and Computer Science
•
Data mining and image analysis steps
—
Image Analysis
—
Experiment Design as Prior Knowledge
—
Expected Results of Data Mining
—
Which Data Mining Technique to Use?
—
Data Mining Challenges: Complexity, Data Size, Search Space
•
Validation
—
Confidence in Obtained Results?
—
Error Screening
—
Cross validation techniques
53
Backup
Comments 0
Log in to post a comment