生物資訊 (Bioinformatics)

dasypygalstockingsBiotechnology

Oct 2, 2013 (3 years and 11 months ago)

88 views

HGP, Fragment Assembly,
and Physical Mapping

蔡懷寬

E
-
mail: d7526010@csie.ntu.edu.tw

人類基因組解讀計畫

(Human Genome Project, HGP)

Human Genome

基因組
(genome)


All the genetic material in the chromosomes
of a particular organism


Its size is generally given as its total number
of base pairs.


基因組的大小


Human: 3000 million bases


Mouse: 3000 million bases


Drosophila (fruit fly): 165 million bases


Nematode (roundworm): 100 million bases


Yeast (fungus): 14 million bases


E. coli (bacteria) 4.67 million bases


人類基因組解讀計畫


簡稱為
HGP (Human Genome Project
)


主要目標有:


identify all the genes in human DNA,


determine the sequences of the 3 billion chemical
bases that make up human DNA


store this information in databases


develop tools for data analysis


transfer related technologies to the private sector


address the ethical, legal, and social issues (ELSI)
that may arise from the project

The work begun formally in 1990, carried out in 16 centers across the
world. The project originally was planned to last 15 years, but rapid
technological advances have accelerated the expected

completion date to 2003.

Human Genome Project (HGP)


Who did the work?

The international Human Genome Mapping Consortium

includes researchers in France, Germany, Japan, China, Great
Britain, Canada and the US.

Celera Genomics (www.celera.com)

Human Genome Project (HGP)


Whose genome was sequenced?

Celera


sequenced the genomes of
five anonymous individuals
-

one African
-
American, one
Asian
-
Chinese, one
Hispanic
-
Mexican and two
Caucasians. One individual's
genome was sequenced 3.5
times; about half the
genome of each of the
remaining individuals was
sequenced.


The HGP


study is based on data
collected from many
individuals from around the
world over a longer period of
time, it is more difficult to
estimate the exact size of
the HGP pool (though it is
significantly more than
Celera's five).




HGP
的沿革與進展
(

)


2001

2

:


Initial sequencing and analysis of the
human genome (Nature, Vol. 409, 15 Feb.
2001, by International Human Genome
Sequencing Consortium)


The sequence of the human genome
(Science, Vol. 291, 16 Feb. 2001, by J. C.
Venter, et al.)

Publication of the Draft Human Genome Sequence

February 12, 2001

Nature, 15 February 2001

Vol. 409

Pages 813
-
960

Science, 16 February 2001

Vol. 291 No.5507

Pages 1145
-
1434

Strategies For Sequencing the Human Genome

The
HGP's


approach has primarily relied
on a
"map
-
based approach
"
-

sequencing an overlapping
series of large chunks of
human DNA cloned into
bacteria. These sequences,
represented by overlapping
bacterial clones, are then
compiled using computer
software using knowledge of
each clone's position on the
map. In this way, HGP has
read each letter of DNA an
average of five times

Celera's


genome sequencing approach
relied on
"shotgun sequencing
"
-

a method in which small bits
of the genome are sequenced
and assembled by computers
into intermediary "scaffolds"
and, ultimately, whole
chromosomes and the genome







How to Sequence a Genome

by

Mapped clones


(clone & clone & clone)

Sequencing Strategies (1)


Map
-
Based Assembly:


Create a detailed complete fragment map


Time
-
consuming and expensive


Provides scaffold for assembly


Original strategy of Human Genome Project

Introduction

1. Mapping

1min 16s

2. Building Libraries

28s

3. Subclones

35s

4. E. coli to store and copy DNA


54s

5. Preparing DNA for sequencing

16s

6. Sequencing Reaction

1min 44s

7. Products of Sequencing Reaction


40s

8. Separating the Sequencing Reaction


30s

9. Reading the Sequencing Reaction


28s

10. Assembling the Results


55s

11. Working Draft Sequence

12. Conclusion

Interactive Presentation
-
How to Sequence

a Genome by Mapped Clones

How to Sequence a Genome

by

Shotgun Sequencing

Sequencing Strategies (2)


Shotgun:


Quick, highly redundant


requires 7
-
9X coverage
for sequencing reads of 500
-
750bp. This means
that for the Human Genome of 3 billion bp, 21
-
27
billion bases need to be sequence to provide
adequate fragment overlap.


Computationally intensive


Troubles with repetitive DNA


Original strategy of Celera Genomics


Shotgun Sequencing: Assembly of
Random Sequence Fragments


To sequence a Bacterial Artificial Chromosome (100
-
300Kb),
millions of copies are sheared randomly, inserted into plasmids,
and then sequenced. If enough fragments are sequenced, it
will be possible to reconstruct the BAC based on overlapping
fragments.

What Does the

Draft Human Genome Sequence

Tell Us?


By the Numbers


The Wheat from the Chaff


How It's Arranged


How the Human Compares with Other Organisms


Variations and Mutations

By the Numbers




The human genome contains 3164.7 million chemical nucleotide bases
(A, C, T, and G).



The average gene consists of 3000 bases, but sizes vary greatly, with
the largest known human gene being dystrophin at 2.4 million bases.



The total number of genes is estimated at 30,000 to 35,000 much lower
than previous estimates of 80,000 to 140,000 that had been based on
extrapolations from gene
-
rich areas as opposed to a composite of gene
-
rich and gene
-
poor areas.



Almost all (99.9%) nucleotide bases are exactly the same in all people.



The functions are unknown for over 50% of discovered genes.

The Wheat from the Chaff




Less than 2% of the genome codes for proteins.




Repeated sequences that do not code for proteins ("junk DNA") make up at
least 50% of the human genome.



Repetitive sequences are thought to have no direct functions, but they shed
light on chromosome structure and dynamics. Over time, these repeats
reshape the genome by rearranging it, creating entirely new genes, and
modifying and reshuffling existing genes.



During the past 50 million years, a dramatic decrease seems to have
occurred in the rate of accumulation of repeats in the human genome.


How It's Arranged



The human genome's gene
-
dense "urban centers" are predominantly
composed of the DNA building blocks G and C.



In contrast, the gene
-
poor "deserts" are rich in the DNA building blocks A
and T. GC
-

and AT
-
rich regions usually can be seen through a microscope
as light and dark bands on chromosomes.



Genes appear to be concentrated in random areas along the genome, with
vast expanses of noncoding DNA between.



Stretches of up to 30,000 C and G bases repeating over and over often
occur adjacent to gene
-
rich areas, forming a barrier between the genes and
the "junk DNA." These CpG islands are believed to help regulate gene
activity.



Chromosome 1 has the most genes (2968), and the Y chromosome has
the fewest (231).

How the Human Compares with

Other Organisms


Unlike the human's seemingly random distribution of gene
-
rich areas, many other
organisms' genomes are more uniform, with genes evenly spaced throughout.



Humans have on average three times as many kinds of proteins as the fly or worm
because of mRNA transcript "alternative splicing" and chemical modifications to the
proteins. This process can yield different protein products from the same gene.



Humans share most of the same protein families with worms, flies, and plants, but
the number of gene family members has expanded in humans, especially in proteins
involved in development and immunity.



The human genome has a much greater portion (50%) of repeat sequences than the
mustard weed (11%), the worm (7%), and the fly (3%).



Although humans appear to have stopped accumulating repeated DNA over 50
million years ago, there seems to be no such decline in rodents. This may account
for some of the fundamental differences between hominids and rodents, although
gene estimates are similar in these species.

Variations and Mutations



Scientists have identified about 1.4 million locations where single
-
base DNA
differences (SNPs) occur in humans. This information promises to revolutionize
the processes of finding chromosomal locations for disease
-
associated sequences
and tracing human history.



The ratio of germline (sperm or egg cell) mutations is 2:1 in males vs females.
Researchers point to several reasons for the higher mutation rate in the male
germline, including the greater number of cell divisions required for sperm
formation than for eggs.

Other Model Organisms

Organism


Genome size Completion


Estimated no






date


of genes


H. influenzae


1.8MB


1995


1,740

S. cerevisiae


12.1MB


1996


6,034

C. elegans


97MB


1998


19,099

A. thaliana


100MB


2000


25,000

D. melanogaster


180MB


2000


13,061

M. musculus


3000MB




H. sapiens


300MB




3,5000
-
45,000

ELSI

Ethical, Legal, and Social Implications of Human Genetics Research

Discrimination in insurance and employment based on genetic
information


When and how new genetic tests should be integrated into
mainstream health care services


Informed consent in genetic research protocols; and


Public and professional education about genetics research and
bioethics.

ELSI

The Next Step:

Functional Genomics


Transcriptomics (microarray)


Proteomics


Structural genomics


Knockout studies


Comparative genomics


Transcriptomics (microarray)

involves large
-
scale analysis of messenger
RNAs transcribed from active genes to follow when, where, and under what
conditions genes are expressed.



Studying protein expression and function
--
or
proteomics
--
can bring researchers
closer to what's actually happening in the cell than gene
-
expression studies. This
capability has applications to drug design.



Structural genomics

initiatives are being launched worldwide to generate the 3
-
D structures of one or more proteins from each protein family, thus offering clues
to function and biological targets for drug design.



Experimental methods for understanding the function of DNA sequences and the
proteins they encode include
knockout studies

to inactivate genes in living
organisms and monitor any changes that could reveal their functions.



Comparative genomics
--
analyzing DNA sequence patterns of humans and well
-
studied model organisms side
-
by
-
side
-
has become one of the most powerful
strategies for identifying human genes and interpreting their function.