Introduction to Bioinformatics - Department of Computer Science ...

abalonestrawΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

69 εμφανίσεις

Genotype Susceptibility And
Integrated Risk Factors for Complex
Diseases

Weidong Mao

Dumitru Brinza

Nisar Hundewale

Stefan Gremalshi

Alexander Zelikovsky


Department of Computer Science

Georgia State University

2

Outline



Human genetics basics


SNPs, Haplotypes and Genotypes


Genetic epidemiology


Prediction Methods


Genetic susceptibility to complex diseases


Conclusions and future plans


3

Human Genetics Basics


Genetics


DNA, gene, chromosome and Genome


DNA
=

two complimentary strands of nucleotides
(A
-
T,

G
-
C)


Length of DNA is measured in base pairs (bp)



Human Genome Project (1990
-

2003)


3 billion

bps of human genome


15,000

genes



Over
99%
of the genome is identical


1%
are SNPs.


3.7 million
SNPs


4

Single Nucleotide Polymorphisms (SNP)


Altered single nucleotide in the genome
sequence.



Found in at least 1% of the population.



Occurs every 100 to 300 bp.




Bi
-
allelic: wild type and mutation.


AA
G
GC
A
TGG
C
TA


AA
C
GC
G
TGG
T
TA


AA
C
GC
G
TGG
C
TA


AA
C
GC
G
TGG
C
TA



SNPs: genetic risk factors for diseases.

5


Diploid organisms = two different “copies” of each
chromosome = recombined copies of parents’
chromosomes


Too expensive to examine two versions of a chromosome
separately


Much cheaper to obtain genotype (mixed) data rather than
haplotype (separated) data


Haplotype

= description of single copy (0=wild
type,1=minor allele)



Genotype

= description of mixed two copies


(0=00, 1=11, 2=01)



0

1

1

1 0 0 1

1

0

1

1

0

1 0 0 1

0

0

Two

haplotypes

per individual

2

1

2

1 0 0 1

2

0

Genotype for the individual

0

1

1

1 0 0 1

1

0

1

1

0

1 0 0 1

0

0

Two

haplotypes

per individual

2

1

2

1 0 0 1

2

0

Genotype for the individual



homorozigous

haplotype

SN
P

heterozigous

A

T

G

C

T

T

A

C

T

T

G

T



Genotypes, Haplotypes, 0,1,2 notations


6

Genetic Epidemiology


Genetic epidemiology
-

searching for genetic risk factors
for diseases.



Monogenic disease


A mutated gene is entirely responsible for the disease .


Typically rare in population: < 0.1%.



Complex disease


Affected by the interaction of multiple genes.


Common: > 0.1%. In NY city, 12% of the population
has Diabetes II.



Significance of risk factor is measured by risk rate or odds
ratio.


7

Genetic Susceptibility to Complex Diseases


Given
:
Genotypes of sick and healthy persons,


Genotype of a testing person.

Find
: The testing person has the disease or not.


0101201020102210

0220110210120021

0200120012221110

0020011002212101

1101202020100110

0120120010100011

0210220002021112

0021011000212120

-
1

-
1

-
1

-
1

1

1

1

1

Genotype

Disease Status

healthy

sick

testing
-

g
t

0110211101211201

s(g
t
)

8

Prediction Methods


Universal prediction methods:


Statistical Methods:


-

Closest Neighbor


-

Genotype Statistics


Support Vector Machine (SVM)


Random Forest



Ad hoc prediction methods:


Pseudo
-
haplotype statistics


Linear programming based prediction method.


Adjacent SNP pairs

9

Statistical Methods



Closest Genotype Neighbor
:


For the testing genotype
g
t
, find the closest genotype
g
i

using Hamming distance and then set s(
g
t
) = s(
g
i
).


g
i:
ATT
C
TGA
C
CGC
A
TC


g
t:
ATT
G
TGA
T
CGC
C
TC H
(
g
i

,
g
t
)

= 3



Genotype Statistics
:


A standard statistical method based on the allele frequency.
For each SNP
j
=1, … , m, we compute the LRR score of
risk rate (RR) as follows:





For genotype
g
t
, if the cumulative LRR score of all SNPs
is greater than 0, then the output disease status s(
g
t
) =1, (
g
t

is predicted to be in control population) and
-
1, otherwise.



10

Support Vector Machine (SVM) Algorithm


Learning Task


Given: Genotypes of patients and healthy persons.


Compute: A model distinguishing if a person has the
disease.


Classification Task


Given: Genotype of a new patient + a learned model


Determine: If a patient has the disease or not.

Linear SVM

Non
-
Linear SVM

11

Random Forest Algorithm


Random Forests grows many classification trees. To
classify a new object from an input vector, put the
input vector down to each tree in the forest. Each
tree gives a classification, and we say the tree “votes”
for that class. The forest chooses the classification
having the most votes (over all the trees in the
forest).



Growing Tree, Split selection and Prediction.



Random sub
-
sample of training data, Random splitter
selection.

12

0110010

0101201020102

0101201020102

0101201020102

0101201020102

0101201020102

0101201020102

0101201020102

0101201020102

Data Set

0120120

0100110

0120210

0110210

0120200

3

5

6

Bootstrapped

sample

homozygous

heterozygous

1

4

7

2

6

7

…..

0

Test Genotype

0

1

0

1

2

0

1

1

Random Forest Algorithm

13


Pseudo
-
Haplotype Statistics
:


Genotype 1

012001221000

pseudo
-
haplotype 010011001000

Genotype 2 220212021000

Genotypes

pseudo
-
haplotypes

1

1
-
1

1
-
1

1
-
1

1 1

?

-
1

-
1

1

-
1

Ad hoc classification methods


14

LP
-
based Prediction Algorithm


Certain haplotypes are susceptible to the disease while others are resistant to
the disease.



The genotype susceptibility is assumed to be a sum of susceptibilities of its
two haplotypes.



Assign a positive weight to susceptible haplotypes and a negative weight to
resistant haplotypes such that for any control genotype the sum of weights of
its haplotypes is negative and for any case genotype it is positive.



For each vertex h
i

(corresponding to a haplotype) of the graph G
X

we wish to
assign the weight p
i
,




such that for any genotype
-
edge
e
i,j
=(
h
i,
,h
j
)




where s(
e
i,j
)


{
-
1,1} is the disease status of genotype represented by edge
e
i,j
.
The total sum of absolute values of genotype weights is maximized.


15

Most Reliable 2 SNPs Prediction


Chooses a pair of adjacent SNPs to predict the disease
status of the test genotype by voting among genotypes
from training set which have the same SNP values at the
chosen sites. The most reliable 2 adjacent SNPs have the
highest prediction rate in the training set.


2

1

1

2

1

0

0

1

1

2

1

1

0

1

0

1

0

1

1

0

2

2

1

1

1

0

0

1

1

1

0

0

1

1

2

1

0

1

2

1

1

2

2

1

1

0

1

0

2

2

1

1

0

2

2

1

1

0

1

0

1

2

2

1

1

2

1

0

1

0

1

2

0

1

1

1

0

1

1

0

1

2

1

1

0

1

0

0

2

2

1

2

0

0

1

2

1

0

1

1

1

1

2

1

0

0

2

1

0

0

1

2

2

1

1

0

0

0

1

1

0

2

2

2

1

0

2

1

0

0

1

1

1

2

1

0

1

2

2

1

1

0

0

0

1

0

1

2

1

1

1

2

0

1

2

0

1

0

0

2

1

0

1

1

2

2

1

0

1

2

1

0

1

2

1

1

0

1

2

1

0

1

1

0

1

1

2

1

2

1

1

1

0

0

1

1

2

2

1

2

1

0

2

1

0

0

1

2

1

0

2

1

1

0

0

1

2

2

1

0

1

0

0

1

1

2

0

1

1

1

0

2

2

2

0

0

2

1

0

1

1

1

2

1

0

0

1

0

2

1

0

2

1

0

0

1

1

0

2

1

0

2

0

1

2

1

1

2

1

0

1

2

1

1

1

0

2

1

0

0

2

1

1

1

2

2

1

0

1

1

1

1

2

0

1

2

1

0

1

1

Training set

Test Genotype

1

0

2

1

2

1

0

1

1

0

0

1

2

1

2

1

0

1

0

1

1

1

2

0

2

60%

100%

16

Disease Tagging


Motivation:

Genotyping/analysis a limited number of
suspicious SNPs.



Tag SNPs:

The subset of genotypes, probably are
responsible for diseases.

0 0

0 1

0 2

1 0

1 1

1 2

2 0

2 1

2 2

0

1

0

1

1

1

1

2

0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

Tag SNPs

17

Minimal Disease Tagging Problem


Given:

Genotypes partitioned into groups (e.g., case/control ),


Find:

Minimal # of SNPs distinguishing any case from any control.


Greedy algorithm: Drop a SNP if it does not collapse case and controls.

0

001

0

011

1

1

111

0

101

001

0

001

0

011

1

1

011

0

101

001

STOP

18

101

101

110

+ 202

01

01

+ 122

01

00

00

000

00

101

11

011

01

+ 221

21

000

01

000

10

-
000

22

000

11

011

00

-
022

22

011

10

010

11

-
012

12

010

10

010

01

-
010

22

010

00

110

10

-
210

20

101

000

011

010

110

+122

+202

-

000

-

022

-

010

-

210

+221

-

012

-
0.5

-
1

1

0.5

-
0.25

-
0.75

-
0.75

-
1.5

-
1

0.25

-
0.5

1.25

-
0.25

0

Decided by other methods

0.75

19

Quality Measures of Prediction


Sensitivity
:

The ability to correctly detect disease.


sensitivity = TP/(TP+FN)


Specificity
:

The ability to avoid calling normal as disease.
specificity = TN/(FP+TN)


Accuracy

= (TP +TN)/(TP+FP+FN+TN)


Risk Rate
: Measurements for risk factors.


Prediction

Disease

+

-

Test

+

True Positive

False Positive

(TP)

(FP)

-

False Negative

True Negative

(FN)

(TN)

20

Cross
-
validation Method


Leave
-
one
-
out test:

The disease status of each genotype
in the data set is predicted while the rest of the data is
regarded as the training set.

-
1

0101201020102210

0220110210120021

0200120012221110

0020011002212101


Leave
-
many
-
out test:

Repeat randomly picking 2/3 of
the population as training set and predict the other 1/3.


-
1

-
1

1

Genotype

Real

Disease Status

-
1

-
1

1

Predicted

Disease Status

1

0020011002212101

1

1

Accuracy = 80%

21

Algorithms Evaluation


P
-
value:
A measure of how much evidence we have against
the null hypotheses.


Null hypotheses:

The observed prediction accuracy is obtained by
chance.


To reject the null hypotheses, p
-
value < 0.05



Compute p
-
value:

randomization


Randomly permute the disease status of the population to generate
1000 instances.


Apply prediction methods on each instance to get prediction
accuracy.


Compute the probability of instances that have a higher prediction
accuracy than the observed accuracy.



Confidence Intervals
: Using bootstrapping to compute
95% CI for each measure.





22

Data Sets


Crohn's disease (Daly
et al

):

inflammatory bowel disease
(IBD).


Location: 5q31


Number of SNPs: 103


Population Size: 387


case: 144 control: 243



Autoimmune disorders (Ueda
et al)

:


Location: containing gene CD28, CTLA4 and ICONS


Number of SNPs: 108


Population Size: 1036


case: 384 control: 652

23

Experiment Results

(
IEEE International Conference on Granular Computing

,


W. Mao, et al
)

24

Conclusions


SNPs are genetic risk factors for complex diseases.



Most known methods focus on single markers and are not
applicable to complex disease.



Propose several ad
-
hoc algorithms to predict the genetic
susceptibility and integrated risk factors for complex
diseases.



Our algorithms are proved to have a higher statistical
significance and higher prediction rate than universal
methods .

25

Thank You !

Questions ?