Data Mining and Bioinformatics

powerfultennesseeΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

89 εμφανίσεις

Data Mining and Bioinformatics

April 30, 2004

What is Data Mining?


Data mining is
the process of selecting,
exploring, and modeling

large amounts
of data to uncover previously unknown patterns
for business advantage. (SAS Institute)


Example: detecting suspicious transactions with
credit cards



A Newer Definition


Data mining is the analysis of (often large)
observational

data sets to find unsuspected
relationships

and to
summarize

the data in
novel

ways that are both understandable and
useful to the data owner.

The “Beers and Diapers” Story


Analyze sales records


Beers & diapers frequently occur together
in customer orders


Put beers next to diapers


Sales volume increases dramatically



Explanation?

Why Do Data Mining


Do you know the differences between the
following concepts?


Data


Information


Knowledge


Difference between data mining and data
analysis


The latter is more specific


What do We Aim to Mine?


Relationships and summaries


Models (global summary of a data set)


Linear equations, clusters, graphs, tree structures


Prediction, classification, interpretation


Patterns (local, restricted regions)


Recurrent patterns, rules


Unusualness
-

Anomaly detection


Analogy to data compression

The Whole KDD Process


KDD: Knowledge Discovery in Databases


Selecting the target data


Preprocessing the data


Transforming them if necessary


Performing data mining to extract patterns and
relationships


Interpreting and assessing the discovered
structures

Data Mining Techniques


Many of them originate from statistics, machine
learning, or pattern recognition


General steps


Determine the nature and structure of the represenation
to be used


Deciding how to quantify and compare how well
different representations fit the data (score function)


Choose an algorithm process to optimize the score
function


Deciding what principles of data management are
required to implement the algorithm efficiently


Example: Regression analysis X = aY + b


Credit card spending vs Annual income

Techniques


Regression/Fitting


Clustering


Neural networks


Bayesian networks


Hidden Markov models

Example: Naïve Bayesian

outlook

temp

humidity

windy

play

sunny

mild

high

false

no

sunny

hot

mild

true

yes

rainy

cool

high

false

yes



sunny

cool

high

true

?

Naïve Bayesian
-

Continued


9 yes samples (out of 14):


2 sunny, 3 cool, 3 high, 2 true


Prob of yes: 9/14 * 2/9 * 3/9 * 3/9 * 2/9 = 0.0053


5 no samples (out of 14):


3 sunny, 1 cool, 4 high, 3 true


Prob of yes: 5/14 * 3/5 * 1/5 * 4/5 * 3/5 = 0.0206


Yes / No = 20.5% / 79.5%

Clustering


Iterative clustering


K
-
means


Hierarchical clustering


Agglomerative method


Probabilistic model
-
based clustering


EM (Expectation Minimization)

Data Mining Applications


Interdisciplinary


statistics, databases, machine learning, pattern
recognition, AI, visualization, etc


Applications:


Marketing


sales model, Finance


loan decision


Insurance


risk analysis, Telecom


load predication


Web/text mining, Surveillance


security


Bioinformatics …


In Bioinformatics


Analysis of Microarray Data


Mining free text


Structural genomics


protein crystallization


Predicting structure from sequence



Common theme: complex data, fast
growing (outgrowing our processing power)

Hybridization of Sample to Probe

Data Collection and Preprocessing


Microarray Expression Data


Fluorescence level


Noisy

Examples

Features

Experiment 1


Experiment 2





Experiment N


Category

Gene 1

1083

1464



1115

Y

Gene 2

1585

398



511

X













Gene M

170

302



751

X

Data Representations

Microarray Experiement Result

Machine Learning Tasks


Design of Microarrays


Probes (67 features) w/ fluorescence value


learn to
choose the best probes for a new gene


Biological Applications of Microarrays


Classify new examples


Prediction the functional category of genes


Cluster genes based on similarity


Cluster experimental conditions


Learn a Bayesian network (that captures the joint prob
distribution over the expression levels of genes)


A Support Vector Machine

Cluster Analysis

Bayesian Network

Machine Learning Tasks (cont’d)


Medical Applications of Microarrays


Cell disease classification


Predicting existing disease classes


Predicting the prognsis


Predicting the drug response of different
patients

Disease Diagnosis Models

Factors That Affect Drug Response

Wrap It Up


Data mining has great potential


Danger: don’t over predict


S&P index = function of the previous year’s butter
production, cheese production, sheep population in
Bangladesh and US?


Finally
-

don’t expect it to answer all
questions