Codons, Genes and Networks

internalchildlikeInternet et le développement Web

12 nov. 2013 (il y a 7 années et 10 mois)

256 vue(s)

Codons, Genes and Networks







Bioinformatics service







Math@Bio

group


of M.Gromov


Andrei Zinovyev

Plan of the talk


Part I: 7
-
clusters structure of
genome (codons and genes)



Part II: Coding and

non
-
coding DNA scaling laws

(genes and networks)


Part I:

7
-
clusters genome structure

Dr. Tatyana Popova





R&D Centre in


Biberach,


Germany




Prof. Alexander Gorban




Centre for


Mathematical


Modelling




Genomic sequence

as a text in unknown language

frequency dictionaries:



N = 4=4
1

N = 16=4
2

N = 64=4
3

N=256=4
4

..cgtggtgagctgatgc
tagggacgcacgtggtgagctgatgctagggacgacgtgg
tgagctgatgctagggacgc…


From text to geometry

cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc

10
7




cgtggtgagctgatgctagggacgcac

ggtgagctgatgctagggacgcacact

tgagctgatgctagggacgcacaattc


gtgagctgatgctagggacgcacggtg

……

gagctgatgctagggacgcacaagtga

length~200
-
400

10000
-
20000 fragments

R
N

Method of visualization

principal components analysis

R
N

R
2

PCA plot

Caulobacter crescentus


singles

N=4

doublets

N=16

triplets

N=64

quadruplets

N=256

!!!

the information in genomic sequence is encoded

by non
-
overlapping triplets (Nature, 1961)

First explanation

cgtggtga
gctgatgctagggrcgcacgtgg
tgagctgatgct
agggrcgacgtggtgagctgatg
ctagggrcgc


tga tgc tag ggr cgc acg tgg



ctg atg cta ggg rcg cac gtg


Basic 7
-
cluster structure

gtga
gctgatgctagggrcgcacgtgg
tgagc


gct gat gct agg grc gca cgt


gtga
atcggtgggtgaqtgtgctgcta
tgagc


atc ggt ggg tga gtg tgc tgc


tcg gtg ggt gag tgt gct gct






cgg tgg gtg agt gtg ctg ctg



Non
-
coding parts

gtga
gctgatgctagggr cgcacg
aat

Point mutations:

insertions, deletions

a

The flower
-
like 7 clusters
structure is flat

0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
Seven classes
vs






Seven clusters

Stanford

TIGR

Georgia Institute

of Technology

Hong
-
Yu Ou, Feng
-
Biao Guo and Chun
-
Ting Zhang (2003).

Analysis of nucleotide distribution


in the genome of
Streptomyces coelicolor A3(2) using

the Z curve method
.
FEBS Letters 540(1
-
3),188
-
194

Audic, S. and J. Claverie.

Self
-
identification of protein
-
coding regions in microbial genomes.

Proc Natl Acad Sci U S A, 95(17):10026
-
31, 1998.

Lomsadze A., Ter
-
Hovhannisyan V., Chernoff YO, Borodovsky M.

Gene identification in novel eukaryotic genomes by

self
-
training algorithm.
Nucleic Acids Research, 2005, Vol. 33, No. 20

Computational gene
prediction

Accuracy >90%

Mean
-
field approximation

for triplet frequencies

3
2
1
K
J
I
IJK
P
P
P
F

F
IJK

: Frequency of triplet IJK ( I,J,K


{A,C,G,T}
):


F
AAA
, F
AAT
, F
AAC

… F
GGC
, F
GGG
:
64 numbers


position
-
specific
letter
frequency + correlations






:
12 numbers


j
i
P
Why hexagonal symmetry?

0
-
+

-
+0

+0
-

+
-
0

-
0+

0+
-

GC
-
content =
P
C

+ P
G


Genome codon usage

and mean
-
field approximation

ggtga
ATG gat gct agg … gtc gca cgc TAA
tgagct



correct frameshift

64 frequencies F
IJK



ggtga
A
T
G

g
a
t

g
c
t

a
g
g

… g
t
c

g
c
a

c
g
c

T
A
A
tgagct

12 frequencies
P
I
1

,
P
J
2

,
P
K
3

P
I
J

are linear functions of GC
-
content

eubacteria

archae

THE MYSTERY OF TWO

STRAIGHT LINES ???

R
12

R
64

F
IJK

= P
1
I
P
2
J
P
3
K

+ correlations

Codon usage signature

0
-
+

19 possible

eubacterial

signatures

Example:

Palindromic signatures

Four symmetry types

of the basic 7
-
cluster structure

flower
-
like

degenerated

perpendicular

triangles

parallel

triangles

B.Halodurans (GC=44%)

S.Coelicolor (GC=72%)

F.Nucleatum (GC=27%)

E.Coli (GC=51%)

Using
branching principal components

to analyze

7
-
clusters genome structures

Streptomyces coelicolor

Bacillus halodurans

Ercherichia coli

Fusobacterium nucleatum

Using
branching principal components

to analyze

7
-
clusters genome structures

Web
-
site

http://www.ihes.fr/~zinovyev/7clusters

cluster structures in genomic sequences

Papers
(type
Zinovyev

in Google)

Gorban A, Zinovyev A

PCA deciphers genome.


2005. Arxiv preprint


Gorban A, Popova T, Zinovyev A

Codon usage trajectories and 7
-
cluster structure of 143 complete
bacterial genomic sequences.


2005. Physica A 353, 365
-
387


Gorban A, Popova T, Zinovyev A

Four basic symmetry types in the universal 7
-
cluster structure of
microbial genomic sequences.

2005. In Silico Biology 5, 0025


Gorban A, Zinovyev A, Popova T


Seven clusters in genomic triplet distributions
.

2003.
In Silico Biology
. V.
3
, 0039.


Zinovyev A, Gorban A, Popova T


Self
-
Organizing Approach for Automated Gene Identification
.

2003.
Open Systems and Information Dynamics

10

(4).

Part II:Coding and

non
-
coding DNA scaling laws

Dr. Thomas Fink




Bioinformatics service

Dr. Sebastian Ahnert






Cavendish laboratory,

University of Cambridge

C
-
value and G
-
value

paradox


Neither genome length nor gene
number account for complexity of an
organism


Drosophila melanogaster

(fruit fly)
C=120Mb


Podisma pedestris

(mountain
grasshopper) C=1650 Mb



Non
-
linear growth of
regulation

Mattick, J. S. Nature Reviews Genetics 5, 316

323 (2004).

“Amount of regulation” scales non
-
linearly

with the number of genes: every new gene with a
new function requires specific regulation, but the
regulators also

need to be regulated

Log number of genes

Log number of

regulatory genes

bacteria

archae

Slope = 1.96

Slope = 1

Complexity ceiling for
prokaryotes



Adding a new function
D
S requires adding
a regulatory overhead
D
R, the total
increase is


D
N
=
D
R
+
D
S


Since R ~ N
2
, at some point
D
R >
D
S,

i.e. gain from a new function is too
expensive for an organism, it requires too


much regulation to be integrated

There is a maximum possible genome length

for prokaryotes (~10Mb)

How eukaryotes bypassed
this limitation?


Presumably, they invented a
cheaper (digital) regulatory system,
based on RNA



This regulatory information is stored
in the “non
-
coding” DNA


Simple model:

Accelerated networks

Node is a gene (
c

genes)

Edge is a “regulation” (
n

edges)

n =
a
c
2

Connectivity < k
max,

regulators are only

proteins

Connectivity > k
max

deficit of regulations is taken

from non
-
coding DNA

How much regulation genome needs
to take from non
-
coding DNA?

)
(
2
max
max
max
c
c
c
c
k
n
deficit


c
max
(prokaryotic ceiling)

These regulations must be encoded in the non
-
coding

part of genome, therefore

N


non
-
coding DNA length

C


coding DNA length

C
prok



ceiling for prokaryotes (~10Mb)

b


some coefficient

Observation:

coding length
vs

non
-
coding

b
=1

Minimum

non
-
coding

length needed

for the
«
deficit
»

regulation

Hypothesis


Prokaryotes:

<N
on
-
coding length> =
a

<
C
oding length>

a

 5

15
%
(
little constant add
-
on, promoters, UTRs…
)

15% ≈ 1/7



Eukaryotes

N
reg

=
b
/2

C/C
maxprok
(
C
-
C
maxprok
) ~
C
2
,




C
maxprok

≈ 10Mb
,

b


1


This is the amount necessary for regulation, but
repeats, genome parasites, etc., might make a
genome much bigger

This is only a hypothesis,
but…


Prediction on the N
reg

for human:

N
reg
= 87 Mb = 3% of genome length






C

= 48 Mb = 1.7%






N
reg
+C = 4.7%


Thank you for your
attention


Questions?