Lecture 4. Topics in Genome

wonderfuldistinctΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

128 εμφανίσεις

Lecture 4. Topics in Genome
Annotation

The Chinese University of Hong Kong

CSCI5050 Bioinformatics and Computational Biology

Lecture outline

1.
Introduction


Motivation


Types of functional sequence elements

2.
Gene annotation


Databases and file formats


Specific types:


Protein
-
coding genes (eukaryotes)


Non
-
coding RNAs


Pseudogenes

3.
Regulatory regions


Representations


Protein binding sites

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

2

INTRODUCTION

Part 1

Understanding machine language


This is how
the PDF
version of our
lecture notes
for Lecture 3
look like when
we open it in
binary mode
(shown as
hexadecimal
numbers).


How do we
interpret it?

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

4

Understanding machine language

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

5

Version number

Language

Want to know
more? Look for a
standard called
ISO32000.

Number of pages

Understanding machine language


We looked for elements that are easy to interpret


There were many parts the meanings of which were not as
obvious


Would be more complicated if it was an executable
program instead, as it would contain both control and data
elements


In general, we tried to separate the long piece of
content into elements/element types, and annotate
each of them


Meanings of some elements can be determined with the
help of other elements (e.g., number of pages)


Next (more difficult) step is to understand the relative
locations of the different elements and how they interact
with others

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

6

Understanding genomic language


Now, how do we interpret the human genome?

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

7

......TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTA
ACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAAC
CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACC
CTAAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCAACCCCAACCCCAAC
CCCAACCCCAACCCCAACCCTAACCCCTAACCCTAACCCTAACCCTACCCTAACCCTAACC
CTAACCCTAACCCTAACCCTAACCCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACC
CTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTCGCGGTACCCTCAGCCGGCCCG
CCCGCCCGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAGAGTACCACCGAAATCT
GTGCAGAGGACAACGCAGCTCCGCCCTCGCGGTGCTCTCCGGGTCTGTGCTGAGGAGA
ACGCAACTCCGCCGTTGCAAAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGCGC
CGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGCGC
GCCGCGCCGGCGCAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGAGGC
GCGCCGCGCCGGCGCAGGCGCAGACACATGCTAGCGCGTCGGGGTGGAGGCGTGGCG
CAGGCGCAGAGAGGCGCGCCGCGCCGGCGCAGGCGCAGAGACACATGCTACCGCGTC
CAGGGGTGGAGGCGTGGCGCAGGCGCAGAGAGGCGCACCGCGCCGGCGCAGGCGCA
GAGACACATGCTAGCGCGTCCAGGGGTGGAGGCGTGGCGCAGGCGCAGAGACGC......

Understanding genomic language


Again, we first look for functional elements


Genes


Function
-
wise: protein
-
coding vs. non
-
coding


Sub
-
elements at the transcriptional level: whole
transcripts, exons, introns, ...


Sub
-
elements at the translational level: 5’UTR, coding
sequence, 3’UTR, ...


Regulatory regions


Function
-
wise: promoters, enhancers, silencers,
insulators, ...


Position
-
wise: upstream, intronic, intergenic, ...

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

8

GENE ANNOTATION

Part 2

Gene structure revisited

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

10

Image source: http://www.carolguze.com/text/442
-
1
-
humangenome.shtml

Human gene annotation sets


RefSeq (NCBI, National Center for Biotechnology Information, USA
National Institute of Health)


Standard for most biologists


Ensembl (EMBL
-
EBI, European Molecular Biology Laboratory
-
European Bioinformatics Institute)


Automatic annotation


Havana (Wellcome Trust Sanger Institute)


Gencode (ENCODE, Encyclopedia of DNA Elements)


Based on latest experimental data


Level 1: Experimentally validated


Level 2: Manually checked, but do not have experimental support


Level 3: Automatic annotation


UCSC, University of California at Santa Cruz


Each has different versions

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

11

Comparison of gene annotation sets

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

12

Image source: Harrow et al.,
Genome Research

22(9):1760
-
1774, (2012)

Comparison of gene annotation sets

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

13

UCSC

Gencode v17

Gencode v14

Gencode v7

RefSeq

Ensembl

Example: p53

Annotation file formats


GFF format (from
http://genome.ucsc.edu/FAQ/FAQformat.html
): tab
-
delimited. Fields:

1.
seqname
-

The name of the sequence. Must be a chromosome or scaffold.

2.
source
-

The program that generated this feature.

3.
feature
-

The name of this type of feature. Some examples of standard feature
types are "CDS", "start_codon", "stop_codon", and "exon".

4.
start
-

The starting position of the feature in the sequence. The first base is
numbered 1.

5.
end
-

The ending position of the feature (inclusive).

6.
score
-

A score between 0 and 1000. If the track line useScore attribute is set
to 1 for this annotation data set, the score value will determine the level of
gray in which this feature is displayed (higher numbers = darker gray). If there
is no score value, enter ".".

7.
strand
-

Valid entries include '+', '
-
', or '.' (for don't know/don't care).

8.
frame
-

If the feature is a coding exon, frame should be a number between 0
-
2
that represents the reading frame of the first base. If the feature is not a
coding exon, the value should be '.'.

9.
group
-

All lines with the same group are linked together into a single item.


GTF format: Similar to GFF, except that the group field is replaced by a list
of attributes in <name>, <value> pairs

Last update: 16
-
Nov
-
2012

GNBF5050 Theories and Algorithms in Bioinformatics | Kevin Yip@cse.cuhk | Fall 2012

14

Example


Gencode v12 GTF file:

Last update: 16
-
Nov
-
2012

GNBF5050 Theories and Algorithms in Bioinformatics | Kevin Yip@cse.cuhk | Fall 2012

15

chr1 ENSEMBL exon 17021 17055 .
-

. gene_id "ENSG00000227232.3"; transcript_id "ENST00000430492.2";
gene_type
"pseudogene"; gene_status "KNOWN"; gene_name "WASH7P"; transcript_type "unprocessed_pseudogene"; transcript_status
"KNOWN"; transcript_name "WASH7P
-
202"; level 3; havana_gene "OTTHUMG00000000958.1";

chr1 HAVANA gene 29554 31109 . + . gene_id "ENSG00000243485.1"; transcript_id "ENSG00000243485.1
";
gene_type
"antisense"; gene_status "NOVEL"; gene_name "MIR1302
-
11"; transcript_type "antisense"; transcript_status "NOVEL"; transcript_nam
e
"MIR1302
-
11"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";

...

chr1 HAVANA gene 34554 36081 .
-

. gene_id "ENSG00000237613.2"; transcript_id "ENSG00000237613.2";
gene_type
"protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN";
transcript_name "FAM138A"; level 2; havana_gene "OTTHUMG00000000960.1";

chr1 HAVANA transcript 34554 36081 .
-

. gene_id "ENSG00000237613.2"; transcript_id "ENST00000417
324.1";
gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status
"KNOWN"; transcript_name "FAM138A
-
001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript
"OTTHUMT00000002842.1";

chr1 HAVANA exon 35721 36081 .
-

. gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1";
gene_type
"protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN";
transcript_name "FAM138A
-
001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";

chr1 HAVANA CDS 35721 35736 .
-

0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1";
gene_type
"protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status "KNOWN";
transcript_name "FAM138A
-
001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";

chr1 HAVANA start_codon 35734 35736 .
-

0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417
324.1";
gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM138A"; transcript_type "protein_coding"; transcript_status
"KNOWN"; transcript_name "FAM138A
-
001"; level 2; havana_gene "OTTHUMG00000000960.1"; havana_transcript
"OTTHUMT00000002842.1";

chr1
ENSEMBL

exon

17021 17055 .
-

. gene_id "ENSG00000227232.3"; transcript_id "ENST00000430492.2"; gene_type
"pseudogene"; gene_status "KNOWN"; gene_name "
WASH7P
"; transcript_type "
unprocessed_pseudogene
"; transcript_status
"KNOWN"; transcript_name "WASH7P
-
202";
level 3
; havana_gene "OTTHUMG00000000958.1";

chr1
HAVANA

gene

29554 31109 . + . gene_id "ENSG00000243485.1"; transcript_id "ENSG00000243485.1"; gene_type
"antisense"; gene_status "NOVEL"; gene_name "
MIR1302
-
11
"; transcript_type "
antisense
"; transcript_status "NOVEL"; transcript_name
"MIR1302
-
11";
level 2
; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";

...

chr1
HAVANA

gene

34554 36081 .
-

. gene_id "ENSG00000237613.2"; transcript_id "ENSG00000237613.2"; gene_type
"protein_coding"; gene_status "KNOWN"; gene_name "
FAM138A
"; transcript_type "
protein_coding
"; transcript_status "KNOWN";
transcript_name "FAM138A";
level 2
; havana_gene "OTTHUMG00000000960.1";

chr1
HAVANA

transcript

34554 36081 .
-

. gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1";
gene_type "protein_coding"; gene_status "KNOWN"; gene_name "
FAM138A
"; transcript_type "
protein_coding
"; transcript_status
"KNOWN"; transcript_name "FAM138A
-
001";
level 2
; havana_gene "OTTHUMG00000000960.1"; havana_transcript
"OTTHUMT00000002842.1";

chr1
HAVANA

exon

35721 36081 .
-

. gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type
"protein_coding"; gene_status "KNOWN"; gene_name "
FAM138A
"; transcript_type "
protein_coding
"; transcript_status "KNOWN";
transcript_name "FAM138A
-
001";
level 2
; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";

chr1
HAVANA

CDS

35721 35736 .
-

0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1"; gene_type
"protein_coding"; gene_status "KNOWN"; gene_name "
FAM138A
"; transcript_type "
protein_coding
"; transcript_status "KNOWN";
transcript_name "FAM138A
-
001";
level 2
; havana_gene "OTTHUMG00000000960.1"; havana_transcript "OTTHUMT00000002842.1";

chr1
HAVANA

start_codon

35734 35736 .
-

0 gene_id "ENSG00000237613.2"; transcript_id "ENST00000417324.1";
gene_type "protein_coding"; gene_status "KNOWN"; gene_name "
FAM138A
"; transcript_type "
protein_coding
"; transcript_status
"KNOWN"; transcript_name "FAM138A
-
001";
level 2
; havana_gene "OTTHUMG00000000960.1"; havana_transcript
"OTTHUMT00000002842.1";

Key:

Annotation set

Feature

Gene name

Transcript type

Annotation level

Gene annotation: The process


How to find out the locations of genes?


Experimental:


EST (Expressed Sequence Tag) libraries


Tiling microarrays


RNA sequencing


...

(Require observed expression)


Computational:


Similarity search


Simple features


Machine learning


Hidden Markov Models


...

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

16

Computational gene finding


similarity search


Find sequences that are
similar to annotated genes


DNA (blastn)


Protein (blastx/tblastx): 6
-
frame translation

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

17

Reading

frame

Image credit: Wikipedia

+3 L V R T

+2 T C S Y

+1 N L F V


5’
-
AACTTGTTCGTACA
-
3’


3’
-
TTGAACAAGCATGT
-
5’

-
1 K N T C

-
2 S T R V

-
3 V Q E Y

s

r

G

C

G

T

G

A

C

T

T

T

C

T

A

C

G

T

T

G

C

T

Computational gene finding


simple features


Based on sequence information only



Ab initio

gene finding”


Open reading frame (ORF)


Existence of start and stop codons in
-
frame and within a
reasonable distance


More complicated when introns are present


Splice junctions


Grammar rules or probabilistic models


Promoter signals


TATA boxes


CpG islands


...


Codon bias


...

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

18

Image source: http://www.blackwellpublishing.com/ridley/a
-
z/codon_bias.asp

Combining features


How to combine the various features?


Essentially a machine learning problem


For each window (e.g., 100
-
400bp), compute the
various features


Gather some positive examples (known coding
genes)


Gather some negative examples (known non
-
genic
regions)


Train a statistical model that can tell whether the
window (or the middle nucleotide) is likely
genic/coding

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

19

Computational gene finding


machine learning


GRAIL: Neural network
-
based method

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

20

Image credit: Uberbacher and Mural,
PNAS

88(24):11261
-
11265, (1991)

Fine
-
grained modeling


All the above methods have limitations:


Similarity search: Only for genes with annotated
homologs


Simple features: Each feature is weak, and thus
can lead to false positives and false negatives


Machine learning (in that form): Does not fully
utilize information about neighboring positions,
also not able to tell precise element boundaries


Need methods that provide finer
-
grained
modeling of gene structures

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

21

Hidden Markov Models (HMMs)


Hidden Markov Models are statistical models
for modeling unobserved information based
on observed data sequence


Observed data: DNA sequence


Unobserved information:


State of each nucleotide (exon, intron, etc.)


Transition probability between states


Emission probabilities: E.g., what is the probability of
emitting a certain nucleotide in the exon state?

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

22

HMM example


Suppose you have two coins, one is biased
and one is unbiased, which coin is used each
time if you observed the sequence <
T
,
H
,
T
>?

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

23

?

?

?

T

H

T

A possible model:

B

A

0.5

0.5

0.9

0.1

0.8

0.2

0.5

0.5

0.25

0.75

H

T

H

T

T

H

T

B

A

A

A possible run:

HMM algorithms


There are algorithms for the following problems:


Given a model, compute data likelihood of observed
sequence, Pr(
O
|

)


Forward algorithm


Backward algorithm


Given a model and an observed sequence,
determine the most likely state sequence,


Viterbi algorithm


Given a set of states and a series of observed data
sequences, estimate the transition and emission
probabilities


Baum
-
Welch algorithm

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

24

Computational gene finding


HMMs


GeneScan:


Both transcription
(exon/intron) and
translation (UTR/CDS)


Positive and negative
strands


Single
-
exon vs. multi
-
exon
genes


Three different frames


One type of generalized
HMMs (GHMMs): Emission
of a sequence instead of a
single nucleotide

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

25

Image credit: Burge and Karlin,
Journal of Molecular Biology

268(1):78
-
94, (1997)

Computational gene finding


HMMs


VEIL: Multi
-
level models

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

26

Overall:

Exon and stop codon:

Image credit: Henderson et al.,
Journal of Computational Biology

4(2):127
-
141, (1997)

Gene finding in post
-
NGS era


With the invention of RNA
-
seq, the ability to
experimentally discover gene locations has been
greatly improved:

1.
Sequence all RNAs

2.
Map them to reference genome


Issues:


Experimental noise


Availability of good reference genome


Mapping of split reads and paired
-
end reads


Cell/tissue/condition
-
specific expression


Over
-

and under
-
representation of certain transcripts


Biochemical activity vs. biological function

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

27

Split mapping


TopHat2

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

28

Image credit: Kim et al.,
Genome Biology

14(4):R36, (2013)

Transcript isoforms [Project]


Given a set of RNA
-
seq short reads
mapped to a gene,
determine the
transcript isoforms
present and their
relative abundance


Cufflinks

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

29

Image credit: Trapnell et al.,
Nature Biotechnology

28(5):511
-
515, (2010)

Non
-
coding RNAs (ncRNAs)


Non
-
coding RNAs are RNAs that function
without translating into proteins


Many types:

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

30

Type

Abbreviation

Function

Ribosomal RNA

rRNA

Translation

Transfer RNA

tRNA

Translation

Small nuclear RNA

snRNA

Splicing

Small nucleolar RNA

snoRNA

Nucleotide modifications

MicroRNA

miRNA

Gene regulation

Small interfering

RNA

siRNA

Gene regulation

Long non
-
coding RNA (>200nt)

lncRNA

Various (mostly unknown)







Identifying non
-
coding RNAs [project]


Some features:


Strong evolutionary conservation


Strong secondary structure


Weak coding potential


(For small RNA) Strong RNA
-
seq signals selected
for small RNA


(For non
-
polyadenylated RNA) Weak RNA
-
seq
signals enriched for poly
-
A RNA


...

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

31

Machine learning for identifying ncRNAs

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

32

Image credit: Lu, Yip et al.,
Genome Research

21(2):276
-
285, (2011)

Identifying long non
-
coding RNAs

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

33

Image credit: Nam and Bartel,
Genome Research

22(12):2529
-
2540, (2012)

Structural models for ncRNA


Some small RNAs have strong structural
features, which can be used to identify them
from genomic sequences

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

34

tRNA

snoRNA

Image sources: http://www.bio.miami.edu/dana/pix/tRNA.jpg, http://lowelab.ucsc.edu/images/CDBox.jpg

Structural models for ncRNA


Covariance models

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

35

Image credit: Eddy,
BMC Bioinformatics

3:18, (2002)

Pseudogenes


Pseudogenes are former genes that have lost
their ability to code for (the original) protein


Classification:


By mechanism of creation:


Non
-
processed pseudogenes: Mutation (e.g., pre
-
mature stop codon)


Processed pseudogenes: Reverse transcription (missing
introns)


By copy of gene:


Duplicated copy


The only copy (unitary pseudogenes)

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

36

Identifying pseudogenes


Look for sequences
similar to annotated
coding genes or with
strong coding potential


Consider those that
cannot produce the
corresponding protein

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

37

Image credit: Zhang et al.,
Bioinformatics

22(12):1437
-
1439, (2006)

REGULATORY REGIONS

Part 3

Types of regulatory regions

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

39

Image credit: Maston et al.,
Annual Review of Genomics and Human Genetics

7:29
-
59, (2006)

Representations


Most regulatory regions are recognized by
specific DNA binding domains of proteins, with
specific sequence signatures (motifs)


However, these motifs are not exact


The
proteins can bind (slightly) different versions of
the motifs


Representing a motif:


Consensus sequence


Regular expression


Position weight matrix


Sequence logo

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

40

Consensus sequence

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

41


Suppose we have the following
transcription factor binding site
(TFBS) sequences:


CACAAAA


CACAAAT


CGCAAAA


CACAAAA


Consensus sequence:


CACAAAA


Degenerate sequence in IUPAC
(International Union of Pure and
Applied Chemistry) code (see
http://www.bio
-
soft.net/sms/iupac.html
):


CRCAAAW

Example source: http://conferences.computer.org/bioinformatics/CSB2003/NOTES/Liu_Color.pdf

IUPAC
nucleotide code

Base

A

Adenine

C

Cytosine

G

Guanine

T (or U)

Thymine (or Uracil)

R

A or G

Y

C or T

S

G or C

W

A or T

K

G or T

M

A or C

B

C or G or T

D

A or G or T

H

A or C or T

V

A or C or G

N

any base

. or
-

gap (not used in motifs)

Regular expression


Suppose we have the following TFBS sequences:


CACAAAAA


CACAAA_T


CGCAAAAA


CACAAA_A


Regular expression


E.g.,
C[AG]CA{3,4}[AT]

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

42

Position weight matrix

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

43


Position weight matrix



Pseudo
-
counts: add a small number to each count, to
alleviate problems due to small sample size






ATGGCATG

AGGGTGCG

ATCGCATG

TTGCCACG

ATGGTATT

ATTGCACG

AGGGCGTT

ATGACATG

ATGGCATG

ACTGGATG

1

2

3

4

5

6

7

8

A

0.9

0.0

0.0

0.1

0.0

0.8

0.0

0.0

C

0.0

0.1

0.1

0.1

0.7

0.0

0.3

0.0

G

0.0

0.2

0.7

0.8

0.1

0.2

0.0

0.8

T

0.1

0.7

0.2

0.0

0.2

0.0

0.7

0.2

1

2

3

4

5

6

7

8

A

10/14

1/14

1/14

2/14

1/14

9/14

1/14

1/14

C

1/14

2/14

2/14

2/14

8/14

1/14

4/14

1/14

G

1/14

3/14

8/14

9/14

2/14

3/14

1/14

9/14

T

2/14

8/14

3/14

1/14

3/14

1/14

8/14

3/14

Example source: http://conferences.computer.org/bioinformatics/CSB2003/NOTES/Liu_Color.pdf

Sequence logo

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

44


Sequence logo


Nucleotide with the highest probability on top


Total height of the nucleotides at the
i
-
th position,



p
i
,
x
: probability of character
x

at position
i


n
: number of sequences


Height of nucleotide
x

=
p
i
,
x

h
i






There are also representations that capture dependency
between nucleotides (e.g., profile hidden Markov models)

Motif databases


Two most commonly used databases: JASPAR
and Transfac

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

45

Image source: JASPAR

Identifying regulatory regions


Motif analysis


Binding sites of one protein vs. binding sites of
multiple proteins


Evolutionary conservation


Other types of signal


Experimental protein binding (ChIP
-
chip and ChIP
-
seq)


Open chromatin


Histone modifications

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

46

Motif analysis


Use HMMs or other
statistical models to
learn patterns of a type
of biding sites from
known examples


Apply the model to
scan the whole
genome for other
binding sites


Study co
-
occurrence of
binding sites (multiple
binding sites usually
co
-
occur at regulatory
modules)

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

47

Image source: The MEME Suite

Gibbs sampling example

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

48


s
1
=

AC
CGG
CT


s
2
=

TGT
CAG
C


s
3
=

TCGGTAT


Assume


length of motif,
w
=3


z
=3, thus
s
3

is taken out


Background probabilities:


A
: 0.2


C
: 0.3


G
: 0.3


T
: 0.2


a
1
=3,
a
2
=4



PWM (pseudo
-
count=0.5):







Score of position 1,
A
1
:

[(0.5/4)(0.5/4)(2.5/4)] /

[(0.2)(0.3)(0.3)] = 0.542535


Results:

Position

Nucleotide

1

2

3

A

0.5/4

1.5/4

0.5/4

C

2.5/4

0.5/4

0.5/4

G

0.5/4

1.5/4

2.5/4

T

0.5/4

0.5/4

0.5/4

x

A
x

1

0.542535

2

5.425347

3

0.325521

4

0.16276

5

0.732422

Identifying regulatory modules [project]

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

49

Image credit: Su et al.,
PLOS Computational Biology

6(12):e1001020, (2010)

Co
-
occurrence of binding sites


How to check whether the binding sites of a set
of proteins are statistically more associated than
random?


Count the number of times the binding sites co
-
occur
within a certain window (e.g., a 500bp bin)


Compute the probability of having such a co
-
occurrence count or more if all binding sites are
randomly distributed


By using a specific form of background distribution


By sampling from permuted genomes


Main issues:


Neighboring positions are not independent


Background distribution is not uniform

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

50

Co
-
occurrence of binding sites


Base overlap ratio: |TFBS
A



TFBS
B
| / |TFBS
A
|

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

51

0

Genomic positions

1

2

3

4

5

6

7

8

9

TF A

TF B

TF C

TF A

TF B

TF C

TF A

1

4/7

5/7

TF B

4/5

1

3/5

TF C

5/6

3/6

1

Binding hotspots

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

52

DNAse I
hypersensitivity

Hotspot

Non
-
hotspot

TF A

TF B

TF C

Problem of uniform sampling


Missing Gaussianity required for proper
evaluation of statistical significance

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

53

Hotspot:

more TF binding

TF A

TF B

Non
-
hotspot:

less TF binding

Base
overlap ratio

A
peak at 0
formed by samples
from non
-
overlap region

Approximate normal
distribution when sampled
from region w/ some overlap

Genome structure correction (GSC)


Segmented block bootstrap: each sample
must have some bases from each segment

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

54

Segmentation: each sample contains
bases from each segment

Hotspot:

more TF binding

TF A

TF B

Non
-
hotspot:

less TF binding

Co
-
occurrence of transcription factors

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

55

TF A

TF B

The ENCODE Project Consortium,
Nature

489(7414):57
-
74, (2012)

Current status


Genes:


For humans and other well
-
studied model
organisms, annotations for protein
-
coding genes
are expected to be quite complete


Precise 5’ and 3’ end positions may need further
improvements


Transcript isoform annotations are still rapidly updating


New non
-
coding RNAs are constantly being
discovered


Regulatory regions:


Much more incomplete

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

56

Summary


Genome annotation: Identifying and
classifying functional sequence elements


Different types:


Genes


Protein
-
coding and non
-
coding


Sub
-
elements


Regulatory regions

Last update: 1
-
Oct
-
2013

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip
-
cse
-
cuhk | Fall 2013

57