# Codons, Genes and Networks

Codons, Genes and Networks

Bioinformatics service

Math@Bio

group

of M.Gromov

Andrei Zinovyev

Plan of the talk

Part I: 7
-
clusters structure of
genome (codons and genes)

Part II: Coding and

non
-
coding DNA scaling laws

(genes and networks)

Part I:

7
-
clusters genome structure

Dr. Tatyana Popova

R&D Centre in

Biberach,

Germany

Prof. Alexander Gorban

Centre for

Mathematical

Modelling

Genomic sequence

as a text in unknown language

frequency dictionaries:

N = 4=4
1

N = 16=4
2

N = 64=4
3

N=256=4
4

..cgtggtgagctgatgc
tagggacgcacgtggtgagctgatgctagggacgacgtgg
tgagctgatgctagggacgc…

From text to geometry

cgtggtgagctgatgctagggacgcacgtggtgagctgatgctagggacgacgtggtgagctgatgctagggacgc

10
7

cgtggtgagctgatgctagggacgcac

ggtgagctgatgctagggacgcacact

tgagctgatgctagggacgcacaattc

gtgagctgatgctagggacgcacggtg

……

gagctgatgctagggacgcacaagtga

length~200
-
400

10000
-
20000 fragments

R
N

Method of visualization

principal components analysis

R
N

R
2

PCA plot

Caulobacter crescentus

singles

N=4

doublets

N=16

triplets

N=64

N=256

!!!

the information in genomic sequence is encoded

by non
-
overlapping triplets (Nature, 1961)

First explanation

cgtggtga
gctgatgctagggrcgcacgtgg
tgagctgatgct
agggrcgacgtggtgagctgatg
ctagggrcgc

tga tgc tag ggr cgc acg tgg

ctg atg cta ggg rcg cac gtg

Basic 7
-
cluster structure

gtga
gctgatgctagggrcgcacgtgg
tgagc

gct gat gct agg grc gca cgt

gtga
atcggtgggtgaqtgtgctgcta
tgagc

atc ggt ggg tga gtg tgc tgc

tcg gtg ggt gag tgt gct gct

cgg tgg gtg agt gtg ctg ctg

Non
-
coding parts

gtga
gctgatgctagggr cgcacg
aat

Point mutations:

insertions, deletions

a

The flower
-
like 7 clusters
structure is flat

0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
Seven classes
vs

Seven clusters

Stanford

TIGR

Georgia Institute

of Technology

Hong
-
Yu Ou, Feng
-
Biao Guo and Chun
-
Ting Zhang (2003).

Analysis of nucleotide distribution

in the genome of
Streptomyces coelicolor A3(2) using

the Z curve method
.
FEBS Letters 540(1
-
3),188
-
194

Audic, S. and J. Claverie.

Self
-
identification of protein
-
coding regions in microbial genomes.

Proc Natl Acad Sci U S A, 95(17):10026
-
31, 1998.

-
Hovhannisyan V., Chernoff YO, Borodovsky M.

Gene identification in novel eukaryotic genomes by

self
-
training algorithm.
Nucleic Acids Research, 2005, Vol. 33, No. 20

Computational gene
prediction

Accuracy >90%

Mean
-
field approximation

for triplet frequencies

3
2
1
K
J
I
IJK
P
P
P
F

F
IJK

: Frequency of triplet IJK ( I,J,K

{A,C,G,T}
):

F
AAA
, F
AAT
, F
AAC

… F
GGC
, F
GGG
:
64 numbers

position
-
specific
letter
frequency + correlations

:
12 numbers

j
i
P
Why hexagonal symmetry?

0
-
+

-
+0

+0
-

+
-
0

-
0+

0+
-

GC
-
content =
P
C

+ P
G

Genome codon usage

and mean
-
field approximation

ggtga
ATG gat gct agg … gtc gca cgc TAA
tgagct

correct frameshift

64 frequencies F
IJK

ggtga
A
T
G

g
a
t

g
c
t

a
g
g

… g
t
c

g
c
a

c
g
c

T
A
A
tgagct

12 frequencies
P
I
1

,
P
J
2

,
P
K
3

P
I
J

are linear functions of GC
-
content

eubacteria

archae

THE MYSTERY OF TWO

STRAIGHT LINES ???

R
12

R
64

F
IJK

= P
1
I
P
2
J
P
3
K

+ correlations

Codon usage signature

0
-
+

19 possible

eubacterial

signatures

Example:

Palindromic signatures

Four symmetry types

of the basic 7
-
cluster structure

flower
-
like

degenerated

perpendicular

triangles

parallel

triangles

B.Halodurans (GC=44%)

S.Coelicolor (GC=72%)

F.Nucleatum (GC=27%)

E.Coli (GC=51%)

Using
branching principal components

to analyze

7
-
clusters genome structures

Streptomyces coelicolor

Bacillus halodurans

Ercherichia coli

Fusobacterium nucleatum

Using
branching principal components

to analyze

7
-
clusters genome structures

Web
-
site

http://www.ihes.fr/~zinovyev/7clusters

cluster structures in genomic sequences

Papers
(type
Zinovyev

Gorban A, Zinovyev A

PCA deciphers genome.

2005. Arxiv preprint

Gorban A, Popova T, Zinovyev A

Codon usage trajectories and 7
-
cluster structure of 143 complete
bacterial genomic sequences.

2005. Physica A 353, 365
-
387

Gorban A, Popova T, Zinovyev A

Four basic symmetry types in the universal 7
-
cluster structure of
microbial genomic sequences.

2005. In Silico Biology 5, 0025

Gorban A, Zinovyev A, Popova T

Seven clusters in genomic triplet distributions
.

2003.
In Silico Biology
. V.
3
, 0039.

Zinovyev A, Gorban A, Popova T

Self
-
Organizing Approach for Automated Gene Identification
.

2003.
Open Systems and Information Dynamics

10

(4).

Part II:Coding and

non
-
coding DNA scaling laws

Dr. Thomas Fink

Bioinformatics service

Dr. Sebastian Ahnert

Cavendish laboratory,

University of Cambridge

C
-
value and G
-
value

Neither genome length nor gene
number account for complexity of an
organism

Drosophila melanogaster

(fruit fly)
C=120Mb

Podisma pedestris

(mountain
grasshopper) C=1650 Mb

Non
-
linear growth of
regulation

Mattick, J. S. Nature Reviews Genetics 5, 316

323 (2004).

“Amount of regulation” scales non
-
linearly

with the number of genes: every new gene with a
new function requires specific regulation, but the
regulators also

need to be regulated

Log number of genes

Log number of

regulatory genes

bacteria

archae

Slope = 1.96

Slope = 1

Complexity ceiling for
prokaryotes

D
D
R, the total
increase is

D
N
=
D
R
+
D
S

Since R ~ N
2
, at some point
D
R >
D
S,

i.e. gain from a new function is too
expensive for an organism, it requires too

much regulation to be integrated

There is a maximum possible genome length

for prokaryotes (~10Mb)

How eukaryotes bypassed
this limitation?

Presumably, they invented a
cheaper (digital) regulatory system,
based on RNA

This regulatory information is stored
in the “non
-
coding” DNA

Simple model:

Accelerated networks

Node is a gene (
c

genes)

Edge is a “regulation” (
n

edges)

n =
a
c
2

Connectivity < k
max,

regulators are only

proteins

Connectivity > k
max

deficit of regulations is taken

from non
-
coding DNA

How much regulation genome needs
to take from non
-
coding DNA?

)
(
2
max
max
max
c
c
c
c
k
n
deficit

c
max
(prokaryotic ceiling)

These regulations must be encoded in the non
-
coding

part of genome, therefore

N

non
-
coding DNA length

C

coding DNA length

C
prok

ceiling for prokaryotes (~10Mb)

b

some coefficient

Observation:

coding length
vs

non
-
coding

b
=1

Minimum

non
-
coding

length needed

for the
«
deficit
»

regulation

Hypothesis

Prokaryotes:

<N
on
-
coding length> =
a

<
C
oding length>

a

 5

15
%
(
-
on, promoters, UTRs…
)

15% ≈ 1/7

Eukaryotes

N
reg

=
b
/2

C/C
maxprok
(
C
-
C
maxprok
) ~
C
2
,

C
maxprok

≈ 10Mb
,

b

1

This is the amount necessary for regulation, but
repeats, genome parasites, etc., might make a
genome much bigger

This is only a hypothesis,
but…

Prediction on the N
reg

for human:

N
reg
= 87 Mb = 3% of genome length

C

= 48 Mb = 1.7%

N
reg
+C = 4.7%

Thank you for your
attention

Questions?