Bioinformatics lectures at Rice University - MD Anderson ...

disturbedtonganeseBiotechnology

Oct 2, 2013 (4 years and 9 days ago)

80 views

Bioinformatics lectures at Rice
University

Li Zhang

Lecture 1

Department of Bioinformatics and Computational Biology

MD Anderson Cancer Center

March
-
April
, 2012

Contact information


Li Zhang


Phone:

713
-
563
-
4298 (office)




713
-
962
-
6661 (cell)


Email:
lzhangli@mdanderson.org


URL:
http://odin.mdacc.tmc.edu/~llzhang/RiceCourse/


Office location: FCT4.5034. Pickens Tower, 4
th

floor, MD Anderson Cancer Center.





Homework


There will be 2
-
3 assignments posted online.


All students are required to complete the assignments.
Homework will be submitted at the beginning of class on
the due date.


If circumstances beyond the student’s control arise and an
assignment cannot be submitted on the due date, an
instructor should be contacted prior to the due date. With
an instructor’s permission, late homework may be accepted
within one week of the due date.


All decisions will be made on an individual student basis
and the final decision rests with the instructor assigning the
homework. A penalty of 10 percentage points will be
applied to late homework.


What is bioinformatics?


Bioinformatics

is

the

application

of

computer

science

and

information

technology

to

the

field

of

biology

and

medicine
.

Bioinformatics

deals

with

algorithms,

databases

and

information

systems,

web

technologies,

artificial

intelligence

and

soft

computing,

information

and

computation

theory,

software

engineering,

data

mining,

image

processing,

modeling

and

simulation,

signal

processing,

discrete

mathematics,

control

and

system

theory,

circuit

theory,

and

statistics,

for

generating

new

knowledge

of

biology

and

medicine,

and

improving

&

discovering

new

models

of

computation

(e
.
g
.

DNA

computing,

neural

computing,

evolutionary

computing,

immuno
-
computing,

swarm
-
computing,

cellular
-
computing)
.


Commonly

used

software

tools

and

technologies

in

this

field

include

Java
,

XML
,

Perl
,

C
,

C++
,

Python
,

R
,

MySQL
,

SQL
,

CUDA
,

MATLAB
,

and

Microsoft

Excel
.

Focus area of this course


Most of the content of this course can be found in
Pierre
Baldi’s

book: “
Bioinformatics: A machine
learning approach
” and a few key papers.


Introducing general topics in the current field of
bioinformatics.


Introducing high throughput technologies that provide
the data.


How to use algorithms and models to visualize and
explore large datasets in search for specific
features/patterns/relationships buried/hidden in the
data.


Computing language: R.


Not focused on databases, web applications. And little
structural biology.

Why should we study
bioinformatics?

Why it is important to study
bioinformatics?

Let us see a few growth charts …

Growth of PDB (Protein Structures)

The Protein Data Bank (PDB) is a
repository for the 3
-
D structural
data of large biological molecules,
such as proteins and nucleic acids.
Most structures are determined by
X
-
ray diffraction, but about 15% of
structures are determined by
NMR.


Large scale organized efforts by
Structural Genomics Initiative and
International Structural Genomics
Consortium have greatly
accelerated the pace of growth.

Growth Chart Of GEO (RNA etc)

Gene Expression Omnibus
(GEO) database holds

over 10
000 experiments comprising
300 000 samples, 16 billion

individual abundance
measurements, for over 500
organisms, submitted

by 5000
laboratories from around the
world. The database typically

receives over 60 000 query
hits and 10 000 bulk FTP
downloads

per day, and has
been cited in over 5000
manuscripts.

GenBank

growth chart (DNA sequences)

There are 126 billion bases in
135 million sequence records in
the traditional
GenBank

divisions
and 191 billion bases in 62
million sequence records in the
WGS division as of April 2011.

11

A brief introduction
of molecular
biology

13

James Watson and Francis Crick

DNA

Next generation sequencing

The cost of sequencing has reduced 100
thousand fold in the past 12 years

The little USB drive could do it

Oxford
Nanopore
, long the sleeper project to
watch in the field of mapping DNA, just announced
two products that could dramatically change the
field of DNA sequencing: a new DNA sequencer
that may be able to handle a human genome in 15
minutes, and a USB thumb drive DNA sequencer
that can read DNA directly from blood with no
prep work.


“‘Game changer’ is an understatement,” says
George Church of Harvard University. (Church was
one of the inventors on one of the patents licensed
to Oxford
Nanopore

that led to the device.” He
ticks off the devices specs: Tiny instruments for
$900. Able to read DNA in 10,000
-
letter stretches


compared to a couple hundred for current
technologies. Able to sequence a human genome
in fifteen minutes (although you’d need 20 of the
server
-
size devices coming in 2013, not just the
USB stick.)

Data explosion in the era of genomics


There have been a large series of breakthroughs
in micro
-
electronics and
nano
-
electronics that
have produced instruments that quantify and/or
characterize large number of biological molecules
in parallel using very small mount of biomaterial.



Such technical advances have made possible to
comprehensively characterize and quantify the
building blocks (DNA, RNA, protein) in a
biological system.

Think
google



Or, think Netflix.

Bioinformatics is the key in genomics

Genome, genomics and post genomic era

List of sequenced
genomes of
mammals:

Type

Genome size

Year of
completion

Cow

3.0 Gb

2009

Dog

2.4 Gb

2005

Guinea Pig

3.4 Gb

Nine
-
banded Armadillo

3.0 Gb

Hedgehog
-
Tenrec

Horse

2.1 Gb

2007

Western European Hedgehog

Cat

3 Gb

2007

Human

3.2 Gb

2001

African Elephant

3 Gb

Rhesus Macaque

Gray Mouse Lemur

Gray Short
-
tailed Opossum

3.5 Gb

2007

Mouse

2.5 Gb

2002

Little Brown Bat

American Pika

Platypus

Rabbit

2.5 Gb

Small
-
eared Galago, or Bushbaby

Chimpanzee

3.1 Gb

2005

Orangutan

3.0 Gb

Rat

2.8 Gb

2004

European Shrew

3.0 Gb

Thirteen
-
lined Ground Squirrel

Domestic pig

2009

Northern Tree Shrew

Large Projects


TCGA: The cancer genome Atlas


1000 Genome Project


1001 Genome Project


ICGC: International cancer genome
consortium


The International
HapMap

Project





Data


Information


Knowledge/power

Bioinformatics provides tools to catalyze the transformations