EE150a Genomic Signal and

lettuceescargatoireAI and Robotics

Nov 7, 2013 (4 years and 8 months ago)



Genomic Signal and
Information Processing

Seminar series

lectures on first 3 meetings, followed by students presentations

statistical signal processing basics

background reading for each meeting

Location: Moore 080 (except today)

List of papers with links:

minor modifications of the list are likely

Contact: Haris Vikalo, Moore 125

Phone: 395


Occasionally check website for updates and increasing list of
research related links

Today’s handouts:

basic course info and a list of papers

R. Karp’s “Mathematical Challenges from Genomics and
Molecular Biology”

up sheet

Next time: Prof. Vaidyanathan’s lecture on “Signal Processing
Problems in Genomics”

In two weeks: lecture on DNA microarray technology and
novel estimation techniques of gene expression levels

Today: introduction with brief overview of the topics for

Central Dogma of Molecular Biology

Flow of information in a cell:

[Due to Francis Crick. It has recently been realized that the
dogma requires modifications but more about that later in

Recent development of high
throughput technologies that
study the above flow

requires interdisciplinary effort

dealing with a huge amount of information graphics/dna

Four nucleotides: adenine (A), cytosine
(C), guanine (G), and thymine (T)


A with T (weaker), C with G (stronger)

Forms a double helix

each strand is
linked via sugar
phosphate bonds
(strong), strands are linked via hydrogen
bonds (weak)

Genome is the part of DNA that encodes


DNA Structure

Sidenote: Sequence Alignment

Perhaps the most fundamental operation in bioinformatics

used to decide if two genes or proteins are related by function,
structure, or evolutionary history

can identify patterns of conservation and variability

Performs pairwise matching between characters of each

One place where it is useful: SNP (single
polymorphism) detection

SNPs may indicate a disease development (myocardial diseases,
arthritis, etc. have been associated with SNPs)

Sequence alignment is the first student presentation topic in
the series (HMM, dynamic programming, Bayesian methods)

Details of the information flow

Replication of DNA

{A,C,G,T} to {A, C, G,T}

Transcription of DNA to mRNA

{A,C,G,T} to {A, C, G,U}

Translation of mRNA to proteins

{A,C,G,U} to {20 amino


Genes can be turned on and off

Microarray Technology

A medium for matching known and unknown DNA samples
based on hybridization (base

Two major applications

identification of a sequence (gene or gene mutation)

determination of expression level (abundance) of genes

Enables massively parallel gene expression studies

Two types of molecules take part in the experiments:

probes, orderly arranged on an array

targets, the unknown samples to be detected

“Traditionally”, there are two formats:

probe cDNA immobilized to a solid surface using robot
spotting and exposed to a set of targets, and

an array of oligonucleotide probes synthesized on chip (via,
e.g., photolithography)

Targets are typically fluorescently labeled cDNA molecules
obtained from mRNA samples

hybridize to their complementary probes

image readout

Types of Microarrays

Illustration: DNA microarray

Sample Microarray Readout

Some Design Issues

Hybridization is binding of a target to its perfect complement

However, when a probe differs from a target by a small number
of bases, it still may bind

This non
specific binding (cross
hybridization) is a source of
measurement noise

In special cases (e.g., arrays for gene detection), designer has a
lot of control over the landscape of the probes on the array

Second topic for presentations considers a combinatorial design
of such arrays

[How to deal with cross
hybridization on arrays used for
expression level measurements is the topic of the third lecture.]

Clustering Gene Expression Profiles

Microarrays measure expression levels of thousands of gene

For instance, we might take samples at different times during a
biological process

Cluster data in the expression level space

relatedness in biological function often implies similarity in
expression behavior (and vice versa)

similar expression behavior indicates co

Clustering of expression level data is one of the topics
(traditional statistical methods but also graph
approach, information
theoretic approach, etc.)

Example of Clustering

Rows: various gene
expression levels

Columns: Time progression

called hierarchical

regulated genes

expressed genes may be co

a combination of transcription factors (activating or
repressing proteins) regulates genes jointly

Finding binding sites (control
regions) of co
regulated genes
is another topic

HMM, probabilistic methods
(EM, Gibbs sampling)

Genetic Regulatory Networks

Proteins take part in the gene regulation

feedback loop in the Central Dogma information flow

Thus to fully understand gene regulation, we need to consider

DNA, RNA, proteins, small molecules

Requires network formalism

directed graphs, Boolean networks, Bayesian networks,
differential equations etc.

Explore some of these models in gene regulation context

An Illustration of a Regulatory Network

Protein Translation/Folding

[Should time permit.]

structure relationship will play very important role
in the postgenomic era

potential great impact on genetics and pharmaceutical
chemistry, protein design

diseases such as Alzheimer’s are believed to be related to
protein misfolding

Computationally very hard

parallel, distributed computing

Genomic data fusion

Consider the problem of classification of a protein and assume
that we know:

original gene sequence encoding the protein

gene expression levels

some of the protein
protein interactions

Question: how to combine various types of data to classify the

The last (right now…) topic of the seminar will be data fusion
of the various genomic data listed above

efficient convex optimization based statistical learning


Trying to understand gene regulation

Recent technologies revolutionized research

huge amount of data

Multidisciplinary; identify opportunities

Challenging problems, quite important:

understanding information processes on genetic level gives
insights about phenotypic effects (disease)

some of the ultimate goals are molecular diagnostics and
creating personalized drugs