Bioinformatics - SJSU Department of Computer Science

austrianceilBiotechnology

Oct 1, 2013 (3 years and 10 months ago)

81 views

Computer Science 286


Fall

2005


© 2005 by Sami Khuri

Bioinformatics Projects


The project will involve research in bioinformatics. You are encouraged to
work on a project related to your interests. A list of suggested topics is
attached to help you get started. You can pick a project from the list,
modify a
project to suit your interests, or invent your own. The key idea
is to be creative either in developing a new algorithm or in implementing
an existing one. Results, whether good or bad, should be compared with
those obtained from existing bioinformatics t
ools or packages.

The project implementation may be developed for any platform. You can
use C, C++, C#, Java, Matlab, or Perl. The World Wide Web contains
many bioinformatics programs as well as the source code. You may use
and modify such code provided
appropriate acknowledgements and
citations are made.

A two
-
page project proposal is due by the be
ginning of the lecture on
Thursday, October 13
, 2005. The proposal should give a clear description
of the project and should contain absolutely no generalities

and
definitions. Clearly state what you are planning to do and explain how
you plan to achieve it. Do not forget to specify the programming language
you intend to use. It is important to get started as early as possible.

T
he projects are due at on Tuesday
, November 29
, 2005. Make sure to
hand in both a technical project description with appendices and
references and a disk or CD containing the source code. The approximate
length of the programming project report should be 10 pages and 30
pages for non
-
prog
ramming projects. You do not need to include a hard
copy of the source code.


A good project report will include the following:



Background of the problem from the literature search.


A clear definition of the problem.


An explanation and justification of

methods of data analysis.


A description and justification of the data sources.


Analysis of the results and comparison with existing tools.


Conclusions based on the results.


Possible directions for future research.


Instructions on how to compile and ex
ecute your program, if
applicable.


A full list of references.

Computer Science 286


Fall

2005


© 2005 by Sami Khuri



List of Suggested Projects


A
-

Programming Projects



1.

DP Pairwise Comparison Algorithm

In this project, you should implement a dynamic programming
algorithm that does pairwise comparison. Yo
ur program should allow
the user to either use:



One penalty for gaps, or



Two gap penalties: one for starting a gap and one for extending
a gap.

Your program should also have the following three options:


Local comparison


Global comparison


Semiglobal compar
ison

The program should allow the user to enter gap penalties.

Compare your program to an existing package.



[SM97] Setubal, J. and Meidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.



2.

K
-
Band DP for Pairwise Comp
arison

If two sequences are similar, the best alignments have their paths
near the main diagonal. It is not necessary to fill the entire matrix to
compute the optimal score and alignment. A narrow band around the
main diagonal should suffice. In this proje
ct, you are to implement a
dynamic programming algorithm that does pairwise comparison and
uses the K
-
Band procedure [SM97]. Your program should allow the
user to either use:



One penalty for gaps, or



Two gap penalties: one for starting a gap and one for ex
tending
a gap.

Your program should also have three options:


Local comparison


Global comparison


Semiglobal comparison

Computer Science 286


Fall

2005


© 2005 by Sami Khuri

The program should allow the user to enter gap penalties.

Compare your program to an existing package.



[SM97] Setubal, J. and Medida
nis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.


3.

Optimal Linear Space DP for Pairwise Comparison

In this project, you are to implement a dynamic programming
algorithm that does pairwise comparison in linear space. The
basic
algorithm is of quadratic complexity. With respect to space, it is
possible to improve the complexity from quadratic to linear. The
algorithm is described in Section 3.3.1 “Space Saving” of Section 3.3:
“Extensions of the Basic Algorithms” [SM97]. Yo
ur program should
have three options:


Local comparison


Global comparison


Semiglobal comparison

Compare your program to an existing package.


[SM97] Setubal, J. and Meidanis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.


4.

The Original BLAST

Basic Local Alignment Search Tool (BLAST) was first published in
1990 [AGM+90]. This project consists in reading, understanding and
implementing the algorithm presented in the article and comparing it
to an existing package.


[AGM+90]
Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman,
D. Basic Local Alignment Search Tool. Journal of Molecular Biology,
215: 403
-
410; 1990.


5.

PSI
-
BLAST

Position Specific Iterated Basic Local Alignment Search Tool (PSI
-
BLAST) was first published in 19
97 [AMS+97]. This project consists in
reading, understanding and implementing PSI
-
BLAST as described in
the article and comparing it to an existing package.


Computer Science 286


Fall

2005


© 2005 by Sami Khuri

[AMS+97] Altschul, S., Madden, T., Schaffer, W., Zhang, J., Zhang, Z.,
Miller, W. and Lipman, D.
Gapped BLAST and PSI
-
BLAST: a new
generation of protein database search programs. Nucleic Acids
Research, 25, 17: 3389
-
3402; 1997.


6.

FASTA and FASTAP
(three)

W. Pearson and D. Lipman developed FASTA, which provides a rapid
way of finding short stretches of
similar sequences between a new
sequence and any sequence in a database [PL88]. Pearson continued
to improve the FASTA method for similarity searches in sequence
databases [Pea90], [Pea96]. This project consists in choosing one of
the three FAST algorithms

from one of the three referenced articles,
reading, understanding and implementing it as described in the
article and comparing it to an existing package.


[PL88] Pearson, W. and Lipman, D. Improved tools for biological
sequence comparisons. Proc. Natl.
Acad. Sci. 85: 2444
-
2448; 1988.

[Pea90] Pearson, W. Rapid and sensitive sequence comparison with
FASTP and FASTA. Methods Enzymol. 183: 63
-
98; 1990.

[Pea96] Pearson, W. Effective protein sequence comparison. Methods
Enzymol. 266: 227
-
258; 1996.


7.

CLUSTAL W

CLUSTAL W is a commonly used package for multiple sequence
alignment. It was published in 1994 [THG94]. This project consists in
reading, understanding and implementing the algorithm presented in
the article and comparing it to an existing package.


[THG
94] Thompson, J., Higgins, D., and Gibson, D. CLUSTL W:
Improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position
-
specific gap penalties and
weight matrix choice. Nucleic Acids Res. 22: 4673
-
4680; 1994.


8.

Mult
iple Sequence Alignment and Genetic Algorithms

Dynamic programming approach to solve the MSA problem results in
exponential time complexity. Genetic Algorithms are of considerable
interest to researchers because they can find high scoring alignment
as good

as those found by other methods.

This project consists in implementing your own genetic algorithm for
the MSA problem and comparing it to SAGA [NH96] or to [ZW97].

Computer Science 286


Fall

2005


© 2005 by Sami Khuri


[NH96] Notredame, C and Higgins, D.G. Sequence Alignment by
Genetic Algorithm (SAGA). Nu
cleic Acid Research, 1996, vol 24, 8,
1515
-
1524; 1996.

[ZW97] Zhang, C. and Wong, A. A Genetic Algorithm for Multiple
Sequence Alignment. Comput. Appl. Bioscience, 13, 565
-
581; 1997.

[CW96] Corcoran, A.L., Wainwright, R.L. LIBGA: A User
-
friendly
workbench
for order
-
based genetic algorithm research; 1996.


9.

Multiple Sequence Alignment and Genetic Doping
Algorithms

In Genetic Doping Algorithm, nothing is fixed. Everything, including
the stochastic operators, depends on the context, which varies over
time [Bus0
4]. Unlike traditional Genetic Algorithms, the probabilities
of crossover and mutation vary from generation to generation. These
numbers are interconnected, both depending on average fitness
values of the population for each generation. It is very possible

that
genetic doping algorithms are more suitable for the multiple sequence
alignment problem than traditional genetic algorithms.

This project consists in implementing your own genetic doping
algorithm for the MSA problem and comparing it to SAGA [NH96]
or to
[ZW97].


[Bus04] Buscena, M. Genetic Doping Algorithm (GenD): theory and
applications. Expert Systems, vol. 21, 2, May 2004.

[NH96] Notredame, C and Higgins, D.G. Sequence Alignment by
Genetic Algorithm (SAGA). Nucleic Acid Research, 1996, vol 24,
8,
1515
-
1524; 1996.

[ZW97] Zhang, C. and Wong, A. A Genetic Algorithm for Multiple
Sequence Alignment. Comput. Appl. Bioscience, 13, 565
-
581; 1997.



10.

Fragment Assembly
(several)

With current technology it is impossible to directly sequence
contiguous DNA s
tretches of more than a few hundred bases.
Typically, several copies of random pieces of long DNA are cut. The
task of sequencing DNA is called fragment assembly: which consists
in reconstructing the original sequence from the fragments.

This project cons
ists in choosing either the greedy algorithm
described in Section 4.3.4 or one of the heuristics described in
Section 4.4. [SM97]. Alternatively, you may choose a more recent
approach described in [KS99]. Read, understand and implement the
Computer Science 286


Fall

2005


© 2005 by Sami Khuri

algorithm you c
hose and compare its performance to an existing
package.


[KS99]
Kim, S., and Segre, A., AMASS: A Structured Pattern Matching
Approach to Shotgun Sequence Assembly. Journal of Computational
Biology , 6(2), 1999, pp 163
-
186; 1999.

[SM97] Setubal, J. and Mei
danis J. Introduction to Computational
Molecular Biology. PWS Publishing Company, 1997.


10. Physical Mapping of DNA
(several)

Physical mapping is the process of determining the location of certain
markers (landmarks) of a DNA molecule. The markers are gen
erally
small but precisely defined sequences. The resulting maps are used as
basis for DNA sequencing, and for the isolation and characterization
of individual genes or other DNA regions of interest. For example, see
Figure 5.1 [MS97, page 144].


This proj
ect consists in choosing one of the algorithms described in
Sections 5.2, 5.3, 5.4, and 5.5, reading it, understanding it,
implementing it and comparing its performance to an existing
package.


[SM97] Setubal, J. and Medidanis J. Introduction to Computatio
nal
Molecular Biology. PWS Publishing Company, 1997.


11. Phylogenetic Trees
(several)

The reconstructing of phylogenetic trees is a general problem in
biology. It is used in molecular biology to help understand the
evolutionary relationships among protein
s, for example.

This project consists in choosing one of the four algorithms
mentioned below, choosing the appropriate referenced article(s),
reading, understanding and implementing it as described in the
article and comparing it to an existing package.




Phylogenetic Trees Based on Pairwise Distances [FD96]



Phylogenetic Trees Based on Neighbor Joining [SN87]



Phylogenetic Trees Based on Maximum Parsimony [Fel96]



Phylogenetic Trees Based on Maximum Likelihood Estimation
[BT86], [Fel81].


Computer Science 286


Fall

2005


© 2005 by Sami Khuri

[BT86] Bishop, M.
and Thompson, E. Maximum likelihood alignment
of DNA sequences. Journal of Molecular Biology. 190:159
-
165; 1986.

[FD96] Feng, D. and Doolittle, R. Progressive alignment of amino acid
sequences and construction of phylogenetic trees from them. Methods
Enzym
ol. 266; 1996.

[Fel81] Felsenstein, J. Evolutionary trees from DNA sequences: A
maximum likelihood approach. Journal of Molecular Evolution.
17:368
-
376; 1981.

[Fel96] Felsenstein, J. Inferring phylogeny from protein sequences by
parsimony, distance and lik
elihood methods. Methods Enzymol. 266;
1996.

[SN87] Saitou, N. and Nei, M. The neighbor joining method: a new
method for reconstructing phylogenetic trees. Molecular Biology
Evolution; 4:406
-
425; 1987.


12. Gene Prediction
(several)

Gene prediction consist
s in identifying regions of genomic DNA that
encode proteins.

Some of the existing models that identify and distinguish coding
regions from non
-
coding regions are based on:


Hidden Markov Model,


Neural Network,


Probabilistic model,


Linear discrimination
analysis,


Decision tree classification,


Quadratic discriminant analysis,


Stochastic context free grammars.

This project consists in choosing one of the above techniques and
implementing the prediction (search) algorithm, which will be able to
search a gi
ven database for genes that do code for proteins.

Your algorithm should be compared to an existing package.


[BK97] Burge C and Karlin S: Prediction of complete gene structures
in human genomic DNA. J Mol Biol 268: 78
-
94; 1997.

[Kro97] Krogh A: Two metho
ds for improving performance of an HMM

and their application for gene
-
finding. Proc Int Conf Intell Syst Mol
Biol 5: 179
-
186; 1997.

[Pre95] Prestridge, D. S. Predicting Pol II promoter sequences using
transcription factor binding sites. J. Mol. Biol. 249:
923
-
932; 1995.

Computer Science 286


Fall

2005


© 2005 by Sami Khuri

[Rab89] Rabiner, L. R. A tutorial on hidden Markov models and
selected applications in speech recognition. Proc. IEEE, 77, 257

285;
1989.

[RJ86] Rabiner, L. R. and Juang, B. H. “An Introduction to Hidden
Markov Models,” IEEE ASSP Magazine,
vol. 3, February 1986.

[Zha97] Zhang, M.Q. Identification of protein coding regions in the
human genome by quadratic discriminant analysis. Proc. Natl. Acad.
Sci. 94: 565

568; 1997.


13. The Protein Prediction Problem
(several)

The main goal in the protein

prediction problem is to determine the
three
-
dimensional structure of a protein based on its amino acid
sequence. Recall that there are three levels to look at the protein
-
structure:


The primary structure is the sequence of amino acids in the chain
i.e.,

a one
-
dimensional structure.


The secondary structure is the result of the folding of parts of the
amino acid chain. The two most important secondary structures
are the

-
helix, and the

-
strand.


The tertiary structure is the real 3
-
dimensional configu
ration of the
protein under given environmental conditions (solvent, pH and
temperature).

The tertiary structure decides the biochemical function of the
protein. If the tertiary structure is changed, the protein normally
looses its ability to perform wh
atever function it has, since this
function depends on the geometrical shape of the active site in the
interior of the molecule

This project consists in choosing an existing algorithm for protein
prediction, implementing it, and comparing it to a package
currently
in use.


[CB00] Clote, P., and Backofen, R., Computational Molecular Biology:
An Introduction, John Wiley and Sons, LTD; 2000.


14. The RNA Structure Prediction Problem
(several)

Unlike DNA, which most frequently assumes its well
-
known double
-
hel
ical conformation, the three
-
dimensional structure of single
stranded RNA is determined by the sequence of nucleotides in much
the same way the protein structure is determined by sequence. RNA
structure, however, is less complex than protein structure and
can be
Computer Science 286


Fall

2005


© 2005 by Sami Khuri

well characterized by identifying the location of commonly occurring
secondary structure elements. [KR03]

This project consists in choosing an existing algorithm for RNA
structure prediction, implementing it, and comparing it to a package
currently

in use.


[KR03] Krane, D. and Raymer, M. Fundamental Concepts of
Bioinformatics; 2003.


15. The Clustering Gene Expression Problem
(several)

Analysis of gene expression patterns can provide insight into
relationships between a gene and its function. Clust
ering techniques
applied to gene expression data partitions genes into clusters (groups)
based on their expression patterns. Genes in the same cluster will
have similar expression patterns, while genes in different clusters will
have distinct expression pa
tterns.


This project consists in choosing an existing clustering
algorithm (such as CLICK [SS00]), implementing it, and comparing it
to a package currently in use. [BSY99].


[BSY99] Ben
-
Dor, A. Shamir, R. and Yakhini, Z. Gene clustering
expression patter
ns. Journal of Computational Biology, 6:281
-
297;
1999.

[SS00] Sharan, R., and Shamir, R. CLICK: A clustering algorithm with
applications to gene expression analysis. Proceedings of the 8
th

International Conference on Intelligent Systems for Molecular Biolo
gy,
307
-
316; 2000.


16. Comparative Genomics
(several)

Comparative genomics is the analysis and comparison of genomes
from different species. The purpose is to gain a better understanding
of how species have evolved and to determine the function of genes
a
nd non
-
coding regions of the genome.


[MBS+00] C. Mayor, M. Brudno, J. R. Schwartz, A. Poliakov, E. M.
Rubin, K. A. Frazer, L. Pachter, I. Dubchak. VISTA: Visualizing global
DNA sequence alignments of arbitrary length. Bioinformatics, 16:
1046
-
1047; 2000.

[LOP+02] G. G. Loots, I. Ovcharenko, L. Pachter, I. Dubchak and E.
M. Rubin. Comparative sequence
-
based approach to high
-
throughput
Computer Science 286


Fall

2005


© 2005 by Sami Khuri

discovery of functional regulatory elements. Genome Res., 12:832
-
839; 2002.


17.
Thermostability and Preferential Amino Acid

and
Codon Usage

Most organisms grow at temperatures from 20 to 50ºC, but some
prokaryotes, including Archaea and Bacteria, are capable of
withstanding higher temperatures, from 60º to over 100ºC. Farias and
Bonato [FB02] investigated the preferential usag
e of certain amino
acids (AA) and codons in thermally adapted organisms, by
comparative proteome analysis.


This project consists in writing

a program that
c
alculates t
he G+C% of
the genome sequences, c
omputes the average propor
tion of each AA in
each gen
ome, c
omputes th
e E+K/Q+H ratio for each genome, and
c
omputes the codon usage of each AA in each genome.

Perform the analysis of the whole genome seq
uences mentioned in the
article. Your program should be able to
group all non
-
thermophylic
(mesophylic) gen
omes into one category and hyperthermophylic and
thermophilic genomes into a second category and
to
compute the
required statistics.



[FB02] “Preferred codons and amino acid couples in
hyperthermophiles” by S.T. Farias and C.M. Bonato. Genome Biology,
200
2.



1
8
.
Genome signatures in prokaryotic and eukaryotic
organisms

Each genome has a characteristic "signature" defined as the ratios
between the observed dinucleotide frequencies and the frequencies
expected if neighbors were chosen at random (dinucleotid
e relative
abundances). The remarkable fact is that the signature is relatively
constant throughout the genome; i.e., the patterns and levels of
dinucleotide relative abundances of every 50
-
kb segment of the
genome are about the same. Campbell, Mrazek and
Karlin analyzed
the signatures of different genomes in [CMK99].

More precisely, they
compute
t
he G+C% of the genome sequences,
the dinucleotide relative
abundances of complete genomes,
the genomic signature

profiles of
organisms and also the
genomic signat
ure difference between pairs of
organisms.

Computer Science 286


Fall

2005


© 2005 by Sami Khuri


This project consists in implementing an algorithm that computes
t
he
G+C% of the genome sequences,
the dinucleotide relative
abundances
of complete genomes,
the genomic signature

profiles of organisms and
also th
e
genomic signature differ
ence between pairs of organisms.
Apply your program to
prokaryotic
and eukaryote genome sequences
and p
resent your results in a format s
imilar to Figure 1 and Figure 2
(see [CMK99].


[CMK99] “Genome signature comparisons among pro
karyote, plasmid,
and mitochondrial DNA” by A. Campbell, J. Mrazek, and S. Karlin.
Proc Natl Acad Sci U S A. 1999 August 3; 96(16): 9184

9189.



19
.
Asymmetric substitution patterns


The analyses of the genomes of three prokaryotes, Escherichia coli,
Bacil
lus subtilis, and Haemophilus influenzae, by Lobry [Lob96]
revealed a new type of genomic compartmentalization of base
frequencies. There was a departure from intrastrand equifrequency
between A and T or between C and G, showing that the substitution
patte
rns of the two strands of DNA were asymmetric. The positions of
the boundaries between these compartments were found to coincide
with the origin and terminus of chromosome replication.

Grigoriev [Gri98] developed a method of cumulative diagrams that
show
s that the nucleotide composition of a microbial chromosome
changes at two points separated by about a half of its length. These
points also coincide with sites of replication origin and terminus for
all bacteria. The leading strand is found to contain mor
e guanine than
cytosine residues.


This project consists in writing

a program that
c
alculates the 3
indices of b
ase frequency

using a nonoverlapp
ing moving window of
size 10 kb [Lob96], c
omputes the GC and

AT skews as defined in
[Gri98], e
stimates the pos
itions of the origin and terminus of
replication
, and c
ompute
s

(C
-
G)/(C+G) % and (A
-
T)/(A+T) % in leading
and lagging coding sequences (See Table 2 in [Lob96]).
Apply your
program to the sequences mentioned in both articles and compare the
results. Then, a
pply the program to other whole genome sequences.


[Lob96] “Asymmetric substitution patterns in the two DNA strands of
bacteria” by J.R. Lobry. 1996 May; 13(5):660
-
665.

[Gri98] “Analyzing genomes with cumulative skew diagrams” by A.
Grigoriev. Nucleic Acid
s Res. 1998 May 15; 26(10):2286
-
90.

Computer Science 286


Fall

2005


© 2005 by Sami Khuri

Visualization Tools for Bioinformatics Algorithms


The main purpose of the projects in this category is to present a
visualization tool to assist students in learning algorithms related to
bioinformatics. One should be a
ble to use the interactive, user
-
friendly, educational tool you are to develop, in classroom
demonstrations, hands
-
on laboratories, self
-
directed work outside of
class, and distance learning. The visualization package will include
background material, a de
tailed explanation of the algorithm, with
examples, and quizzes and exercises. Using Java is highly
recommended.


20
. Needleman and Wunsch’s DP Algorithm

Design and implement a visual interactive software package to
demonstrate how Needleman and Wunsch’s d
ynamic programming
algorithm is applied to solve protein and genomic sequence alignment
problems.


21
. CLUSTAL W Algorithm

Design and implement a visual interactive software package to
demonstrate how the CLUSTAL W algorithm is applied to align
multiple
sequences.


22
. Phylogenetic Tree Construction

Choose a phylogenetic construction algorithm. Design and implement
a visual interactive software package to demonstrate how the
algorithm you selected constructs a phylogenetic tree.



B


Non
-
Programming Pr
ojects



Survey Papers


23
. A Comprehensive Survey and Comparison of
Multiple Sequence Alignment Programs

In this project, you are to choose 5 MSA programs. Describe each one
in detail and compare their performances. The comparison should
include several r
uns with data from various databases.


Computer Science 286


Fall

2005


© 2005 by Sami Khuri

24
. A Comprehensive Survey and Comparison of Protein
Structure Visualization Tools

In this project, you are to choose 8 protein structure visualization
programs. Describe each one in detail and use them to visualize t
he
structure of different proteins. The project should include screen
dumps.


25
. DNA Computing

Write a survey paper on the state of the art in DNA computing,
highlighting new directions.


26
. Bayesian statistical methods for sequence
alignment and
evolutionary distance estimation

Write a survey paper based on the results published by Agarwal and
States in 1996 [AS96], and Zhu et al. in 1998 [ZLL98]. For further
reading, a Bayesian bioinformatics tutorial by C. Lawrence is available
on the Internet [
Law01]. Divide your paper into the following four
parts:

(a)

Explain what Bayesian statistics is.

(b)

Explain how the Bayesian statistics methods can be applied to
sequence analysis.

(c)

Explain how the Bayesian statistics methods can be applied to
evolutionary distan
ce estimation.

(d)

Describe the most common Bayesian sequence alignment
algorithms.


[AS96] Agarwal P. and States D.J., A Bayesian evolutionary distance
for parametrically aligned sequences. Journal of Computational
Biology, vol. 3, pp. 1
-
17; 1996.

[Law01] Law
rence, C. Bayesian Bioinformatics Page




http://www.wadsworth.org/resnres/bioinfo/

[ZLL98] Zhu J., Liu J.S., and Lawrence C.E. Bayesian adaptive
sequence alignment algorithms. Bioinformatics, vol. 14, pp.

25
-
39;
1998.


27
. Non
-
Coding DNA

Gene
ticists have long focused on just the small part of DNA that
contains blueprints for proteins. The remainder
-

in human, 98% of
the DNA


was often dismissed as junk. But the discovery of many
Computer Science 286


Fall

2005


© 2005 by Sami Khuri

hidden genes that work through RNA, rather than protein, have
o
verturned that assumption [Gib03]. Write a survey paper on that
topic, which would include: antisense RNAs, microRNAs and
riboswitches.


[Gib03] Gibbs, W. W. Unseen Genome: gems among the junk.
Scientific American, November 2003, pp. 47
-
53; 2003.

[Sto02]

Storz, G. An Expanding Universe of Noncoding RNAs. Science,
vol. 296, May 17, 2002; pp. 1260
-
1263; 2002.


28
. Stem Cells

Stem cells raise the prospect of regeneration failing body parts and
curing diseases that have so far defied drug
-
based treatment [LR0
4].
Embryonic stem cells are derived from the portion of a very early stage
embryo that would eventually give rise to an entire body. Because
embryonic stem cells originate in this primordial stage, they retain the
“pluripotent” ability to form any cell ty
pe in the body.

This survey project introduces stem cells and elaborates on its
potential various applications. It answers such questions as to why
couldn’t we simply inject embryonic stem cells into the parts of the
body we wish to regenerate and simply
let them take their cues from
the surrounding environment. More importantly, the survey paper has
to clearly identify areas in stem cell research where bioinformatics
could play (and is playing) an important role.


[LR04] Lanza, R., and Rosenthal, N. The s
tem cell challenge. Scientific
American, 93
-
99, June 2004.




Non
-
Survey Projects


29
. PAM and BLOSUM Substitution Matrices

Explain how various PAM or BLOSUM substitution matrices were
constructed. Discuss the differences and similarities between th
e PAM
and BLOSUM matrices. Describe applications of each of the matrices.


30
. Amino Acid Scoring Matrices

The most commonly used substitution scoring matrices are PAM and
BLOSUM. Choose two other amino acid scoring matrices. Describe
how they were constru
cted. Discuss the differences and similarities
between these matrices. Describe applications of each of the matrices.

Computer Science 286


Fall

2005


© 2005 by Sami Khuri


31
. Test of Markov Model of Evolution in Proteins

In 1985, Wilbur tested the Markov model of evolution and showed
that it may be applicab
le if certain changes are made in the way the
PAM matrices are calculated [Wil85]. Write a research paper
describing how the tests were done, which conclusions were drawn
from the tests and the extension of the original paper by George et al.
in 1990 [GBH9
0].


[GBH90] George, D.G., Barker, W.C., and Hunt L.T., Mutation data
matrix and its uses. Methods Enzymol. Vol. 183, pp. 333
-
351; 1990.

[Wil85] Wilbur, W.J. On the PAM model of protein evolution.
Molecular Biol. Evol. Vol. 2, pp. 434
-
447; 1985.


32.
W
hich Genes Make Us Human?

The sequence of the human genome provides a new tool with which to
investigate human origins. It has been known since 1975, through the
work of Mary
-
Claire King and Allan Wilson that the genomes of
humans and chimpanzees differ by

only 1.3%. This DNA sequence
difference is unusually small for two species so different in anatomy
and behavior (Pra02). This puzzle has sparked intense interest in the
chimpanzee genome, now scheduled to be completely sequenced. A
comparative chimp
-
human

clone map has recently been published
(Fujiyama et al., 2002). SNP mappers have jumped into the question,
reasoning that single nucleotide polymorphisms may hold the key
(Lew02).

However, gene expression studies will be required for any real answer
to th
e question, as predicted by King and Wilson. Researchers last
year presented the first comparative gene expression studies in
humans and other primates (Ape Genomics). A more comprehensive
analysis, including some proteomic data, shows major differences in

the pattern of brain gene expression between humans and chimps
(Enard et al., 2002). Recently, a very exciting candidate gene has been
identified that appears to be linked with language ability (see speech
gene). In addition, the gene shows statistical ev
idence of strong
selection during human evolution (Stephens et al., 2001).


The background above describes at least three different approaches to
answering the question of what, genetically speaking, defines our
human species. Based on publicly available b
ioinformatics tools and
databases, can your group suggest a purely bioinformatics approach?
This project should outline the approach and perform the analysis.
Computer Science 286


Fall

2005


© 2005 by Sami Khuri

Discuss the results and compare them to those found in the
references. Describe the limitations o
f this approach (if any).

[Adapted from B. Chapman, 2003].


[Pra02] “Altered gene expression could explain the genetic difference
between human and chimp: Evolutionists Present Their 1.3%
Solution” by L. Pray. The Scientist 16(16): 36
-
41, 2002.

[Fujiyama

et al, 2002] “Construction and Analysis of a Human
-
Chimpanzee Comparative Clone Map” by A. Fujiyama and 16 co
-
authors. Science, vol. 295, January 2002; 131
-
134.

[Lew02] “SNPs as Windows on Evolution: Recent studies reveal that
the human species is young a
nd genetically uniform” by R. Lewis. The
Scientist 16(1): 16
-
21, 2002.

[Enard et al., 2002] “Molecular evolution of FOXP2, a gene involved in
speech and language” by W. Enard, M. Przeworski, S.E. Fisher, C.S.L.
Lai, V. Wiebe, T. Kitano, A.P. Monaco and S.
Paabo. Nature, vol.418,
August 2002; 869
-
872.

[Stephens et al, 2001] “Haplotype Variation and Linkage
Disequilibrium in 313 Human Genes” by J.C. Stephens and 27 co
-
authors. Science 293: 489
-
493, 2001.



33. The USP6 Oncogene Gene

Gene duplication is though
t to be the major mechanism for the
emergence of novel genes during evolution. Such events are thought
to have occurred at early stages in the vertebrate lineage. Paulding et
al. [PRH03] report that the USP6 oncogene is derived from the fusion
of two other

genes: USP32 and TBC1D3.


This project consists in p
erform
ing

the bioinformatics analysis
descr
ibed in “Materials and Methods” and reporting the results.

[PRH03] “The Tre2 (USP6) oncogene is a hominoid
-
specific gene” by
C.A. Paulding, M. Ruvolo, and D.A.

Haber. Proceedings of the National
Academy of Sciences, vol. 100, number 5, 2507
-
2511, March 4, 2003.


34.
The Myosin Gene MYH16 Mutation

Powerful masticatory muscles are found in most primates, including
chimpanzees and gorillas, and were part of a promi
nent adaptation of
Australopithecus and Paranthropus, extinct genera of the family
Hominidae. In contrast, masticatory muscles are considerably smaller
in both modern and fossil members of Homo. The evolving hominid
masticatory apparatus

traceable to a Lat
e Miocene, chimpanzee
-
like
Computer Science 286


Fall

2005


© 2005 by Sami Khuri

morphology

shifted towards a pattern of gracilization nearly
simultaneously with accelerated encephalization in early Homo.
Stedman et al. [Stedman et al., 2004] showed that the gene encoding
the predominant myosin heavy chain (
MYH) expressed in these
muscles was inactivated by a frameshifting mutation after the
lineages leading to humans and chimpanzees diverged. Loss of this
protein isoform is associated with marked size reductions in
individual muscle fibres and entire mastica
tory muscles. Using the
coding sequence for the myosin rod domains as a molecular clock, we
estimate that this mutation appeared approximately 2.4 million years
ago, predating the appearance of modern human body size and
emigration of Homo from Africa. Thi
s represents the first proteomic
distinction between humans and chimpanzees that can be correlated
with a traceable anatomic imprint in the fossil record.


This project consists in performing the bioinformatics analysis
described in [Stedman et al., 2004]

with the help of [Cur04] and
[Pen04]. The findings of Stedman et al. were challenged a year later by
Perry et al. [Perry et al., 2005]. Your project should address the
second article’s findings

and give your own conclusion. In other
words, you should giv
e your own evaluation of the critics mentioned
in [Perry et al., 2005].



[Pen04] “The Primate Bite: Brawn Versus Brain?” by E. Pennisi.
Science vol. 303, page 1957, March 2004.

[Stedman et al., 2004] “Myosin gene mutation correlates with
anatomical change
s in the human lineage” by H.H. Stedman and 9 co
-
authors. Nature vol. 428, pages 415
-
418, March 2004.

[Cur04] “Muscling in on hominid evolution” by P. Currie. Nature vol.
428, pages 373, March 2004.

[Perry et al., 2005] “Comparative Analyses Reveal a Compl
ex History
of Molecular Evolution for Human MYH16” by G.H. Perry, B.C.
Verrelli, A.C. Stone. Molecular Biology and Evolution, Vol. 22, Number
3, March 2005, pp. 379
-
382.