生物資訊研究所

vivaciousefficientBiotechnology

Oct 1, 2013 (3 years and 8 months ago)

89 views

Institute of Bioinformatics


Title

The Study of Memory
-
Efficient Multiple Sequence Alignment with Constraints
and Its Software Development and Applications

Principal Investigator

Chin Lung Lu

Sponsor

National Science Council

Key words:

MSA for Short, Bioi
nformatics and Computational Biology


Multiple sequence alignment (MSA for short) is one of the most important
problems in bioinformatics and computational biology. It has many applications
including the identification of conserved motifs and domains among

a set of
sequences, the reconstruction of phylogenetic trees, the structural and functional
prediction and so on. Usually, biologists have the knowledge of their datasets
concerning the structures (such as active site residues, intra
-
molecular disulfide
b
onds, substrate binding sites and enzyme activities) functionalities. However,
several mostly used MSA software packages, like PILEUP, CLUSTAL W,
WORKBENCH etc., don’t make use of these known key residues information so that
they always generate mismatches

among these important residues. Hence, we
recently proposed the concept of the constrained sequence alignment which allows
biologists to input the sequences with an additional constrained sequence, each of
which corresponds to a structural and functional
residue (respectively, nucleotide),
such that these importantly specified residues (respectively, nucleotides) should be
aligned together in the computed alignment. Moreover, we designed a dynamic
programming algorithm for finding an optimal constrained al
ignment of two
sequences and then used it as a kernel to develop a constrained multiple sequence
alignment tool based on the progressive approach [TLC+02, TLC+03]. Our proposed
algorithm for two sequences runs in
O
(

n
4
) time and consumes
O
(
n
4
) space, where



is the number of the constrained residues and
n

is the maximum of the lengths of
sequences. Later, this result was improved independently by our group [TLY+03] and
Chin
et al
. [CHL+03] to
O
(

n
2
) time and
O
(

n
2
) space using the same approach of
dynamic p
rogramming. These improvements greatly increase the performances of the
constrained multiple sequence alignment tools based on the progressive approach.
However, the requirement of the quadratic memory (actually approaching to
O
(
n
3
)
memory if


is proporti
on to
n
) still limits the developed tools to align a set of short
sequences. In fact, each of the residues (or nucleotides) in the constrained sequence
may represent a conserved site of a protein (or RNA) family and each conserved site
may consist of a sho
rt fragment of amino acids (or nucleotides), instead of a single
amino acid (or nucleotide). In other words, the input constraint is a set of strings,
each of which represents a fragment of amino acids or nucleotides, instead of a
sequence. Hence, in this
proposal, we shall concentrate our attention and efforts on
the study of such a kind of the constrained sequence alignment, and hope to design
and implement a memory
-
efficient algorithm for aligning multiple sequence
alignment with used
-
specified constrain
ts, without increasing too much
time
-
complexity.

NSC93
-
2213
-
E009
-
113(93R375)

-------------------------------------------------------------------------------------------------------


Title

Establishment of an Integrated Platform for Deciphering the Alternative
Splicing Mechanisms and Detecting Conserved Motifs in Human Genome

Principal Investigator

Hsien Da Huang

Sponsor

National Science Council

Keywords:

Detecting Conserved Motifs, Establ
ishment of an Integrated Platform


In eukaryotes, a gene may generate multiple proteins. Alternative splicing of
pre
-
mRNA plays an important role to generate multiple isoforms. To decipher the
mechanisms of alternative splicing play important role in gene
expression research.
Previously research suggest that the identification of conserved sequences in
exon/intron regions which are associated to the alternative splicing becomes mainly
interest of biologists. In this work, we propose an integrated approach t
o automatic
identify the conserved sequences in selected exon/intron regions of a gene group. First,
the alternative splicing database, namely ProSplicer, is constructed from several
biological databases including nucleotides, genes, and proteins. Second,
multiple
alternative splicing information, i.e., exon skipping, 5’ alternative splicing sites, 3’
alternative splicing sites, and o on, are derived. Finally, for each type of alternative
splicing, the flanking intron sequences are collected and then used f
or motif discovery
approaches to detect alternative splicing related conserved motif, called AS
-
Motif.
Also, the system is capable of statistically detection of AS
-
Motif co
-
occurrence. The
tissue
-
specific information and gene functionalities are also taken

into account. The
contribution of this work is mainly to establish an integrated platform for prediction
of conserved sequences into exon/intron regions of specific type of AS information, as
well as the detection of AS
-
Motif co
-
occurrence The system faci
litates to decipher the
alternative splicing mechanisms by considering tissue
-
specific and function
-
specific
information of genes.

NSC93
-
2213
-
E009
-
075(93R364)

-------------------------------------------------------------------------------------------------
------