EE150a Genomic Signal and

lettuceescargatoireAI and Robotics

Nov 7, 2013 (4 years and 1 month ago)

67 views

EE150a


Genomic Signal and
Information Processing


Seminar series


lectures on first 3 meetings, followed by students presentations


statistical signal processing basics


background reading for each meeting


Location: Moore 080 (except today)


List of papers with links:
www.its.caltech.edu/~hvikalo/gsip.html


minor modifications of the list are likely


Contact: Haris Vikalo, Moore 125


Phone: 395
-
4184


E
-
mail:
hvikalo@caltech.edu


Occasionally check website for updates and increasing list of
research related links


Today’s handouts:


basic course info and a list of papers


R. Karp’s “Mathematical Challenges from Genomics and
Molecular Biology”


sign
-
up sheet


Next time: Prof. Vaidyanathan’s lecture on “Signal Processing
Problems in Genomics”


In two weeks: lecture on DNA microarray technology and
novel estimation techniques of gene expression levels


Today: introduction with brief overview of the topics for
presentation


Central Dogma of Molecular Biology


Flow of information in a cell:






[Due to Francis Crick. It has recently been realized that the
dogma requires modifications but more about that later in
course.]


Recent development of high
-
throughput technologies that
study the above flow


requires interdisciplinary effort


dealing with a huge amount of information

genetics.gsk.com/ graphics/dna
-
big.gif


Four nucleotides: adenine (A), cytosine
(C), guanine (G), and thymine (T)


Bindings:


A with T (weaker), C with G (stronger)


Forms a double helix


each strand is
linked via sugar
-
phosphate bonds
(strong), strands are linked via hydrogen
bonds (weak)


Genome is the part of DNA that encodes
proteins:



AAC
TCG
CAT
CGA
ACT
CTA
AGTC



DNA Structure

Sidenote: Sequence Alignment


Perhaps the most fundamental operation in bioinformatics


used to decide if two genes or proteins are related by function,
structure, or evolutionary history


can identify patterns of conservation and variability


Performs pairwise matching between characters of each
sequence


One place where it is useful: SNP (single
-
nucleotide
polymorphism) detection


SNPs may indicate a disease development (myocardial diseases,
arthritis, etc. have been associated with SNPs)


Sequence alignment is the first student presentation topic in
the series (HMM, dynamic programming, Bayesian methods)


Details of the information flow


Replication of DNA


{A,C,G,T} to {A, C, G,T}


Transcription of DNA to mRNA


{A,C,G,T} to {A, C, G,U}


Translation of mRNA to proteins


{A,C,G,U} to {20 amino
-
acids}

http://www
-
stat.stanford.edu/~susan/courses/s166/central.gif

Genes can be turned on and off

Microarray Technology


A medium for matching known and unknown DNA samples
based on hybridization (base
-
pairing)


Two major applications


identification of a sequence (gene or gene mutation)


determination of expression level (abundance) of genes


Enables massively parallel gene expression studies


Two types of molecules take part in the experiments:


probes, orderly arranged on an array


targets, the unknown samples to be detected


“Traditionally”, there are two formats:


probe cDNA immobilized to a solid surface using robot
spotting and exposed to a set of targets, and


an array of oligonucleotide probes synthesized on chip (via,
e.g., photolithography)


Targets are typically fluorescently labeled cDNA molecules
obtained from mRNA samples


hybridize to their complementary probes


image readout

Types of Microarrays

http://pcf1.chembio.ntnu.no/~bka/images/MicroArrays.jpg

Illustration: DNA microarray

Sample Microarray Readout

Some Design Issues


Hybridization is binding of a target to its perfect complement


However, when a probe differs from a target by a small number
of bases, it still may bind


This non
-
specific binding (cross
-
hybridization) is a source of
measurement noise


In special cases (e.g., arrays for gene detection), designer has a
lot of control over the landscape of the probes on the array


Second topic for presentations considers a combinatorial design
of such arrays


[How to deal with cross
-
hybridization on arrays used for
expression level measurements is the topic of the third lecture.]

Clustering Gene Expression Profiles


Microarrays measure expression levels of thousands of gene
simultaneously


For instance, we might take samples at different times during a
biological process


Cluster data in the expression level space


relatedness in biological function often implies similarity in
expression behavior (and vice versa)


similar expression behavior indicates co
-
expression


Clustering of expression level data is one of the topics
(traditional statistical methods but also graph
-
theoretic
approach, information
-
theoretic approach, etc.)

Example of Clustering

http://www.genomatix.de/gif/node43_documentation.gif


Rows: various gene
expression levels


Columns: Time progression


So
-
called hierarchical
clustering

Co
-
regulated genes


Co
-
expressed genes may be co
-
regulated


a combination of transcription factors (activating or
repressing proteins) regulates genes jointly


Finding binding sites (control
regions) of co
-
regulated genes
is another topic


HMM, probabilistic methods
(EM, Gibbs sampling)

Genetic Regulatory Networks


Proteins take part in the gene regulation


feedback loop in the Central Dogma information flow


Thus to fully understand gene regulation, we need to consider
interactions


DNA, RNA, proteins, small molecules


Requires network formalism


directed graphs, Boolean networks, Bayesian networks,
differential equations etc.


Explore some of these models in gene regulation context

An Illustration of a Regulatory Network

Protein Translation/Folding


[Should time permit.]


Sequence
-
structure relationship will play very important role
in the postgenomic era


potential great impact on genetics and pharmaceutical
chemistry, protein design


diseases such as Alzheimer’s are believed to be related to
protein misfolding


Computationally very hard


parallel, distributed computing

Genomic data fusion


Consider the problem of classification of a protein and assume
that we know:


original gene sequence encoding the protein


gene expression levels


some of the protein
-
protein interactions


Question: how to combine various types of data to classify the
protein


The last (right now…) topic of the seminar will be data fusion
of the various genomic data listed above


efficient convex optimization based statistical learning
algorithm

Summary


Trying to understand gene regulation


Recent technologies revolutionized research


huge amount of data


Multidisciplinary; identify opportunities


Challenging problems, quite important:


understanding information processes on genetic level gives
insights about phenotypic effects (disease)


some of the ultimate goals are molecular diagnostics and
creating personalized drugs