Introduction to Biological Basics and Bioinformatics

weinerthreeforksBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

101 views

Bioinformatics Basics

Cyrus

Courtesy from LO Leung Yau’s original presentation

Outline


Biological Background


Cell


Protein


DNA & RNA


Central Dogma


Gene Expression


Bioinformatics


Sequence Analysis


Phylogentic Trees


Data Mining

Biological Background


Cell


Basic unit of organisms


Prokaryotic


Eukaryotic


A bag of chemicals


Metabolism controlled
by various enzymes


Correct working needs


Suitable amounts of
various
proteins



Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)

Biological Background


Protein


Polymer of 20 types of
Amino Acids


Folds into 3D structure


Shape determines the
function


Many types


Transcription Factors


Enzymes


Structural Proteins




Picture taken from

http://en.wikipedia.org/wiki/Protein



http://en.wikipedia.org/wiki/Amino_acid

Biological Background


DNA & RNA


DNA


Double stranded


A
denine,
C
ytosine,
G
uanine,
T
hymine


A
-
T, G
-
C


Those parts coding for
proteins are called
genes


RNA


Single stranded


A
denine,
C
ytosine,
G
uanine,
U
racil

Picture taken from

http://en.wikipedia.org/wiki/Gene

Biological Background


Genes


Genes


protein coding regions

3 nucleotides code
for one amino acid


There are also start
and stop codons

Biological Background

in a nutshell


Abstractions

Functional Units:
Proteins

Templates: RNAs

Blueprints: DNAs

Templates: RNAs

Blueprints: DNAs

Not only the information
(data), but also the
control signals about
what and how much data
is to be sent

Proteins (TFs) so help

Biological Background


Sequences


Abstractions



Sequences

acatggccgatcaggctgtttttgtgtgcctgtttttctattttacgtaaatcaccctgaacatgtTTGCATCAacctact
ggtgatgcacctttgatcaatacattttagacaaacgtggtttttgagtccaaagatcagggctgggttgacctgaatact
ggatacagggcatataaaacaggggcaaggcacagactc

FT intron <1..28

FT /gene="CREB"

FT /number=3

FT /experiment="experimental evidence


FT recorded"

FT exon 29..174

FT /gene="CREB"

FT /number=4

FT /experiment="experimental evidence
«

FT recorded"

FT intron 175..>189

FT /gene="CREB"

FT /number=4

Annotations

Visualizations

Biological Background



DNA


RNA


Protein

Picture taken from

http://en.wikipedia.org/wiki/Gene

gene

Biological Background



DNA


RNA


Protein

Transcriptional Regulatory Network is the complex interaction between
genes
,
transcription factors (TF)

and
transcription factor binding
sites (TFBS)
.

Other
functions

Transcription
Factors

Binding sites

Genes

Promoter regions

Complex Interactions between Genes, TFs
and TFBSs

Biological Background



DNA


RNA


Protein

Transcriptional Regulatory Network

is the complex interaction between
genes
,
transcription factors (TF)

and
transcription factor binding
sites (TFBS)
.

Other
functions

Transcription
Factors

Binding sites

Genes

Promoter regions

Gene Expression Microarray Data


High throughput


Measures RNA level


Relies on A
-
T, G
-
C
pairing


Can monitor expression
of many genes

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment

Gene Expression Microarray Data

Picture taken from http://en.wikipedia.org/wiki/DNA_microarray

Genes

Time points/Condiditions

Colors: Expression (RNA) Levels

Bioinformatics

Sequence Analysis


Alignments


a way of arranging the sequences of
DNA
,
RNA
,
or
protein

to identify regions of similarity that may
be a consequence of functional,
structural
, or
evolutionary

relationships between the sequences






http://en.wikipedia.org/wiki/Sequence_alignment

Bioinformatics

Sequence Analysis


Pair
-
wise alignments





Method: dynamic programming!




No penalty for the consecutive ‘
-
’s before and after
the sequence to be aligned

\
\
Pc91106
\
Old_FYP
\
Bioinformatics for FYPs
\
CSC3220 Lectures

Bioinformatics

Sequence Analysis


Multiple (global) sequence alignment


Also dynamic programming (but can’t scale up!)






Bioinformatics

Sequence Analysis


Multiple local sequence alignment


i.e. Motif (pattern) discovery




>seq1

acatggccgatcagctggtttttgtgtgcctgtttctgaatc

>seq2

ttctattttacgtaaatcagcttgaacatgtacctactggtg

>seq3

atgcacctttgatcaataccagctagacaaacgtgtgttg

>seq4

agtccaaagatcagggctggctgaatactggatcagct

>seq5

cagctacagggcatataaaggggcaaggcacagactc

Such overrepresented
patterns are often
important components
(e.g. TFBSs if the sequences are
promoters of similar genes).


TFBSs are the
controlling key holes in
gene regulation!

DNA motifs


Similar DNA fragments across individuals
and/or species


TFBS Motifs:

DNA fragments similar to “
TATAA
” are common in order to
make genes functioning


Expensive and time
-
consuming to try a large set of candidates in biological
experiments

Transcription

RNA

Translation

Protein

TATAA

TFBS
(controlling)

Gene

(functioning)

TF

Transcription Factor

DNA

Motif discovery

CGATTGA

f

Similar controlled functions

e.g. cancer gene activities

Maximized

TFBS Motif Discovery

SNP (
single nucleotide polymorphism)

Motif Discovery



DNA from
different
people

Normal

Disease
!

A

A

A

C

C

C

T

T

T

G

G

G

A

T

C

G









f

Normal

Disease
!

distinguish

Maximized

Bioinformatics

Data mining


Classification


To predict!


Pre
-
processing

tidy up your materials!


Feature selection

the key points to go over


Classifier

the thinking style/manner of how to combine the
key points and get some answer


Training

your practice of your thinking manner with
answers known


Validation

mock quiz to evaluate what you’ve learnt from
the training


Testing

your examination!





\
\
Pc91106
\
Old_FYP
\
Bioinformatics for FYPs
\
CSC5180 Data Mining Notes
\
c3class1.pdf

Underfitting & Overfitting

TRANSFAC Project

TF
-
Transcription Factors, important regulators

TFBS
-
Transcription Factor Binding Site, major regulatory elements

TRANSFAC
-
The most representative DB for TFs and TFBSs

Modeling:

statistical models,
representations, Markov chains;

Discovery:

stochastic searching,
indexing (suffix trees)

1

Relationship:

TF
-
TFBS; TFBS
-
Gene… (understanding, prediction)

Mining:

text mining, approximate
matching

2

Annotations:

accurate
wet
-
lab
candidates (reduced labor and costs);

Computation:

large scale data
processing; parallel computing

3

Representative Publications

[1]
Gang Li, Tak
-
Ming Chan, Kwong
-
Sak Leung and Kin
-
Hong Lee, A Cluster
Refinement Algorithm for Motif Discovery,
IEEE/ACM Transaction on Computational
Biology and Bioinformatics

(accepted)

[2]
Tak
-
Ming Chan, Kwong
-
Sak Leung, Kin
-
Hong Lee, TFBS identification based on
genetic algorithm with combined representations and adaptive post
-
processing.
Bioinformatics
, 2008, 24(3), pp. 341
-
349

Bioinformatics

Data mining


Evaluation (scores!)


Confusion Matrix


Binary Classification


Performance Evaluation Metrics


Accuracy


Sensitivity/Recall/TP Rate


Specificity/TN Rate


Precision/PPV






\
\
Pc91106
\
Old_FYP
\
Bioinformatics for FYPs
\
CSC5180 Data Mining Notes
\
c3class3.pdf

Bioinformatics

Data mining


Evaluation


ROC (Receiver Operating Characteristics)


Trade
-
off between positive hits (TP) and false
alarms (FP)




Not The End


Your corresponding tutor will have more
project
-
specific stuff to tell you



Thanks


Q & A