Bioinformatics 3 V 3 – Data for Building Networks

fleagoldfishBiotechnology

Oct 2, 2013 (3 years and 6 months ago)

69 views

Bioinformatics 3

V 3


Data for
Building Networks

Mon, Oct 22, 2012

Bioinformatics 3


WS 12/13

V 3



2

Graph Layout 1

Requirements
:

• fast and stable

• nice graphs

• visualize relations

• symmetry

• interactive exploration

• …

Force
-
directed

Layout:

based on energy minimization



runtime



mapping into 2D

H3
: for hierarchic graphs



MST
-
based cone layout



hyperbolic space



efficient layout for
biological data???

Bioinformatics 3


WS 12/13

V 3



3

Aim
: analyze and visualize
homologies

within the
protein universe

50 genomes, 145579 proteins, 21
×

10
9

BLASTP pairwise sequence comparisons

→ need to visualize an extremely large network!


→ develop a
stepwise

scheme

Expectations:

• homologs will be close together


fusion

proteins („Rosetta Stone proteins“) will
link

proteins of related function.

Bioinformatics 3


WS 12/13

V 3



4

LGL: stepwise scheme

Adai et al. J. Mol. Biol. 340, 179 (2004)

The first connected set is placed at the bottom of a potential funnel.

Other sets are placed one at a time on the rim of the potential funnel and allowed
to fall towards the bottom where they are frozen in space upon collision with the
previous sets.

(1)
separate

original network into
connected sets


11517 connected components, 33975 proteins w/out links

(2) force directed
layout

of each
component independently
,


based on a MST

(3) integrate connected sets into one coordinate system


via a
funnel process
, starting from the largest set

(0)
create network

from BLAST E
-
score


145'579 proteins


E < 10

12


1'912'684 links , 30737 proteins in the largest cluster

Bioinformatics 3


WS 12/13

V 3



5

Component layout I

For each component independently:



start from the
root node

of the MST

assigned arbitrarily

(educated guess)

most "central" node

Centrality
: minimize total distance to all other nodes in the component

Layout


place
root

at the
center

Level
n
-
nodes: nodes that are
n

links away from the root in the MST

Bioinformatics 3


WS 12/13

V 3



6

Component Layout II

Adai et al. J. Mol. Biol. 340, 179 (2004)

• start with root node of the MST

• place level
-
1 nodes on circle (sphere) around root,


add all links,


relax springs (+ short
-
range repulsion)

• place level
-
2 nodes on circles (sphere) outside their level
-
1 descendants,


add all links,


relax springs

• place level
-
3 nodes on circles (sphere) outside their level
-
2 descendants,


:

Bioinformatics 3


WS 12/13

V 3



7

Combining the Components

When the components are finished



assemble

using energy
funnel

• place largest component at bottom

• place next smaller one somewhere


on the rim, let it slide down




freeze upon contact

No information in the relative positions of the components!!!

Adai et al. J. Mol. Biol. 340, 179 (2004)

Bioinformatics 3


WS 11/12


Tihamer Geyer

V 3



8

Adai et al. J. Mol. Biol. 340, 179 (2004)

Bioinformatics 3


WS 12/13

V 3



9

Annotations in the Largest Cluster

Related functions in the same regions of the cluster


predictions

Adai et al. J. Mol. Biol. 340, 179 (2004)

Bioinformatics 3


WS 12/13

V 3



10

Clustering of Functional Classes

Adai et al. J. Mol. Biol. 340, 179 (2004)

Bioinformatics 3


WS 12/13

V 3



11

Fusion Proteins

Fusion proteins
connect

two protein homology
families


A, A', A'', AB and B, B', AB

Also
in the network
:


homologies <=> edges


remote homologies <=> in the same cluster


non
-
homologous functional relations <=> adjacent, linked clusters



historic genetic
events
: fusion, fission, duplications, …

Adai et al. J. Mol. Biol. 340, 179 (2004)

Bioinformatics 3


WS 12/13

V 3



12

Functional Relations between Gene Families

Examples of spatial localization of protein
function in the map

C
: metabolic enzymes, particularly those of
acetyl CoA and amino acid metabolism




DUF213 likely has metabolic function!

B
: protein subunits of the pyruvate
synthase and alpha
-
ketoglutarate
ferredexin oxidoreductase complexes

A
: the linkage of the tryptophan synthase
α family to the functionally coupled but
non
-
homologous β family by the yeast
tryptophan synthase αβ fusion protein,

Adai et al. J. Mol. Biol. 340, 179 (2004)

Bioinformatics 3


WS 12/13

V 3



13

And the Winner iiiis…

Adai et al. J. Mol. Biol. 340, 179 (2004)

Compare the layouts from

A
:
LGL



hierarchic

force
-
directed layout


according to MST




structure from homology

B
:
global

force
-
directed layout without MST




no structure, no components visible

C
:
InterViewer



collapses similar nodes




reduced complexity

Bioinformatics 3


WS 12/13

V 3



14

Graph Layout: Summary

Approach

Idea

Force
-
directed
spring model

Force
-
directed
spring
-
electric model

H3

LGL

relax energy, springs
of appropriate lengths

relax energy, springs for
links, Coulomb repulsion
between all nodes

spanning tree in hyperbolic space

hierarchic, force
-
directed algorithm

for modules

the same physical
concept, different
implementations!



Bioinformatics 3


WS 12/13

V 3



15

A "Network"

So far:
G = (E, V)

"Graph"

=

more than the

sum of the individual parts

????

Edges

=

encode the

connectivity

Vertices

=

the "things"

to be connected

Classified by:

• degree distribution

• clustering

• connected components

• …



what are interesting biological "things"?



how are they connected?



are the informations accessible/reliable?

Bioinformatics 3


WS 12/13

V 3



16

Protein Complexes

protein machinery
built from parts
via dimerization
and
oligomerization

Assembly of structures

Cooperation and allostery

Complex formation may lead to

modification of the active site

Complex formation may lead to

increased diversity

Bioinformatics 3


WS 12/13

V 3



17

Gel Electrophoresis

Electrophoresis: directed diffusion of charged particles in an electric field

faster

slower

Higher charge, smaller

Lower charge, larger

Put proteins in a spot on a gel
-
like matrix,

apply electric field



separation according to size (mass) and charge




identify constituents of a complex

Nasty details: protein charge vs. pH, cloud of counter ions,


protein shape, denaturation, …

Bioinformatics 3


WS 12/13

V 3



18

SDS
-
PAGE

For better control: denature proteins with detergent

Often used: sodium dodecyl sulfate (
SDS
)



denatures and coats the proteins with a negative charge




charge proportional to mass





traveled distance per time



SDS
-
p
oly
a
crylamide
g
el
e
lectrophoresis

For "quantitative" analysis: compare to
marker

(set of proteins with known masses)

Image from Wikipedia, marker on the left
lane

After the run:
staining

to make proteins visible

Bioinformatics 3


WS 12/13

V 3



19

Protein Charge?

Main source for charge differences: pH
-
dependent protonation states

Probability to have a proton:

pKa = pH value for 50% protonation

pH

P

Each H
+

has a +1e charge




Isoelectric point
: pH at which the protein is
uncharged




protonation state cancels permanent charges

<=> Equilibrium between


• density (pH) dependent H
+
-
binding and


• density independent H
+
-
dissociation

Asp 3.7

4.0 … His 6.7

7.1 … Lys 9.3
-
9.5

Bioinformatics 3


WS 12/13

V 3



20

2D Gel Electrophoresis

Two steps
:

i) separation
by isoelectric

point via pH
-
gradient

ii) separation
by mass

with SDS
-
PAGE

low pH

high pH

protonated

=> pos. charge

unprotonated

=> neg. charge



Most proteins differ in mass and isoelectric point (pI)

Step 1:

Step 2:

SDS
-
Page

Bioinformatics 3


WS 12/13

V 3



21

Mass Spectrometry

Identify constituents of a (fragmented) complex via their mass patterns,

detect by pattern recognition with machine learning techniques.

http://gene
-
exp.ipk
-
gatersleben.de/body_methods.html

Bioinformatics 3


WS 12/13

V 3



Tandem affinity purification

Yeast 2
-
Hybrid
-
method can only identify binary complexes.


In affinity purification, a protein of interest (bait) is tagged with
a molecular label (dark route in the middle of the figure) to
allow easy purification. The tagged protein is then co
-
purified
together with its interacting partners (W

Z). This strategy can
also be applied on a genome scale.


Gavin
et al. Nature
415, 141 (2002)

Identify proteins

by mass spectro
-

metry (MALDI
-

TOF).

22

Bioinformatics 3


WS 12/13

V 3





TAP analyis of yeast PP complexes)

Gavin
et al. Nature
415, 141 (2002)

Identify proteins by

scanning yeast protein

database for protein

composed of fragments

of suitable mass.


Here, the identified

proteins are listed

according to their

localization (a).

(b) lists the number of

proteins per complex.

23

Bioinformatics 3


WS 12/13

V 3



Validation of TAP methodology

Gavin
et al. Nature
415, 141 (2002)

Check of the method:

can the same complex be obtained for
different choices of attachment point

(tag protein attached to different

coponents of complex)?


Yes, more or less (see gel in (a)).

24

Bioinformatics 3


WS 12/13

V 3



25

Pros and Cons

Advantages:


quantitative

determination of complex


partners
in vivo

without prior knowledge

• simple, high yield, high throughput

Difficulties:

• tag may
prevent

binding of the interaction partners

• tag may change (relative)
expression

levels

• tag may be
buried

between


interaction partners



no binding to beads

Bioinformatics 3


WS 12/13

V 3



26

Yeast Two
-
Hybrid Screening

Discover binary protein
-
protein interactions via physical interaction

complex of

binding domain (BD) +

activator domain (AD)

fuse bait to BD,

prey to AD


expression only when


bait:prey
-
complex

Bioinformatics 3


WS 12/13

V 3



27

Performance of Y2H

Advantages:


in vivo

test for interactions

• cheap + robust


large scale tests

Problems:

• investigate the interaction between


(i) overexpressed


(ii) fusion proteins in the


(iii) yeast


(iv) nucleus

• spurious interactions via third protein



many false positives


(up to 50% errors)

Bioinformatics 3


WS 12/13

V 3



28

Synthetic Lethality

Apply two mutations that are viable on their own,

but lethal when combined.


In cancer therapy, this effect implies that inhibiting one of these genes
in a context where the other is defective should be selectively lethal to
the tumor cells but not toxic to the normal cells, potentially leading to a
large therapeutic window.

Synthetic lethality may point to:

• physical interaction (building blocks of a complex)

• both proteins belong to the same pathway

• both proteins have the same function (redundancy)

http://jco.ascopubs.org/

Bioinformatics 3


WS 12/13

V 3



29

Gene Coexpression

All constituents of a complex should be
present at the same point in the cell cycle



test for correlated expression


No direct indication for complexes

(too many co
-
regulated genes),

but useful "filter"
-
criterion

Standard tool: DNA micro arrays

DeRisi, Iyer, Brown,
Science

278

(1997) 680:


Diauxic shift from fermentation to respiration
in
S. cerevisiae



Identify groups of genes with


similar expression profiles

Bioinformatics 3


WS 12/13

V 3



30

DNA Microarrays

Fluorescence labeled DNA (cDNA)
applied to micro arrays



hybridization with complementary


library strand



fluorescence indicates relative


cDNA amounts

two labels (red + green) for

experiment and control

Usually: red = signal



green = control





yellow = "no change"

Bioinformatics 3


WS 12/13

V 3



31

Diauxic Shift

image
analysis +
clustering

DeRisi, Iyer, Brown,
Science

278

(1997) 680

Identify groups of genes with similar time courses = expression profiles



"cause or correlation"?



biological significance?

Bioinformatics 3


WS 12/13

V 3



32

Interaction Databases

Bioinformatics: make use of existing databases

Bioinformatics 3


WS 12/13

V 3



33

(low) Overlap of Results

For
yeast
: ~ 6000 proteins => ~18 million potential interactions




rough estimates: ≤ 100000 interactions occur



1 true positive for 200 potential candidates =
0.5%




decisive

experiment must have
accuracy

<< 0.5% false positives

Different experiments

detect different interactions

For yeast: 80000 interactions known,


2400 found in > 1 experiment

Problems with experiments:

i) incomplete coverage

ii) (many) false positives

iii) selective to type of interaction


and/or compartment

TAP

HMS
-
PCI

Y2H

annotated: septin
complex

see: von Mering (2002)

Bioinformatics 3


WS 12/13

V 3



34

Criteria for Reliability

Guiding principles (incomplete list!):

1)
mRNA abundance
:


most experimental techniques are biased towards high
-
abundance proteins

2)
compartments
:


• most methods have their "preferred compartment"


• proteins from same compartment => more reliable

3)
co
-
functionality


complexes have a functional reason (assumption!?)

Bioinformatics 3


WS 12/13

V 3



35

In
-
Silico Prediction Methods

Sequence
-
based:

• gene clustering

• gene neighborhood

• Rosetta stone

• phylogenetic profiling

• coevolution

Structure
-
based:

• interface propensities

• spatial simulations

"Work on the parts list"



fast



unspecific



high
-
throughput methods


for pre
-
sorting

"Work on the parts"



specific, detailed



expensive



accurate

Bioinformatics 3


WS 12/13

V 3



36

Gene Clustering

Search for genes with a
common promoter



when activated, all are transcribed together as one operand

Idea
: functionally
related

proteins or parts of a complex


are expressed
simultaneously

Example
:

bioluminescence in
V. fischeri
,

regulated via quorum sensing



three proteins: I, AB, CDE

Bioinformatics 3


WS 12/13

V 3



37

Gene Neighborhood

Hypothesis

again: functionally
related

genes are expressed
together



Search for
similar

sequences

of genes in
different

organisms

genome 1

genome 2

genome 3

(<=> Gene clustering: one species, promoters)

"functionally" = same {complex | pathway | function | …}

Bioinformatics 3


WS 12/13

V 3



38

Rosetta Stone Method

Multi
-
lingual stele from 196 BC,

found by the French in 1799



key to deciphering hieroglyphs

Idea
: same "
names
" in different genome "
texts
"

Enright, Ouzounis (2001):

40000 predicted pair
-
wise interactions

from search across 23 species

sp 1

sp 2

sp 3

sp 4

sp 5

Bioinformatics 3


WS 12/13

V 3



39

Phylogenetic Profiling

Idea
: either
all

or
none

of the proteins of a complex should


be
present

in an organism



compare presence of protein homologs across species


(e.g., via sequence alignment)

Bioinformatics 3


WS 12/13

V 3



40

Distances

EC

SC

BS

HI

P1

1

1

0

1

P2

1

1

1

0

P3

1

0

1

1

P4

1

1

0

0

P5

1

1

1

1

P6

1

0

1

1

P7

1

1

1

0

Hamming

distance between species: number of different protein occurrences

P1

P2

P3

P4

P5

P6

P7

P1

0

2

2

1

1

2

2

P2

0

2

1

1

2

0

P3

0

3

1

0

2

P4

0

2

3

1

P5

0

1

1

P6

0

2

P7

0

Two pairs with similar occurrence: P2
-
P7 and P3
-
P6

Bioinformatics 3


WS 12/13

V 3



41

Coevolution

Idea
: not only similar static occurence, but similar
dynamic

evolution

Interfaces of complexes are often

better conserved than the rest of

the protein surfaces.

Also: look for potential substitutes



anti
-
correlated




missing components of pathways




function prediction across species




novel interactions

Bioinformatics 3


WS 12/13

V 3



i2h method

Schematic representation of the i2h
method.

A: Family alignments are collected
for two different proteins, 1 and 2,
including corresponding sequences
from different species (a, b, c, ).

B: A virtual alignment is constructed,
concatenating the sequences of the
probable orthologous sequences of
the two proteins. Correlated
mutations are calculated.

Pazos, Valencia, Proteins 47, 219 (2002)

42

Bioinformatics 3


WS 12/13

V 3



Correlated mutations at interface

Correlated mutations evaluate the similarity in variation patterns between
positions in a multiple sequence alignment.

Similarity of those variation patterns is thought to be related to compensatory
mutations.

Calculate for each positions
i
and
j

in the sequence a rank correlation
coefficient (
r
ij
):

Pazos, Valencia, Proteins 47, 219 (2002)

where the summations run over every possible pair of proteins
k
and
l

in the
multiple sequence alignment.

S
ikl

is the ranked similarity between residue
i
in protein

k
and residue
i
in
protein
l
.
S
jkl

is the same for residue
j
.

S
i

and
S
j

are the means of
S
ikl

and
S
jkl
.

43

Bioinformatics 3


WS 12/13

V 3



44

Summary

What you learned
today
: how to get some data on PP interactions

Next

lecture
: Fri, Oct. 26, 2012

• combining weak indicators: Bayesian analysis

• identifing communities in networks

SDS
-
PAGE

TAP

MS

Y2H

synthetic lethality

micro array

DB

gene clustering

gene neighborhood

Rosetta stone

phylogenic profiling

coevolution

type of interaction?


reliability?


sensitivity?


coverage?




Tutorial: ???