Conference Proceedings - Nebraska Informatics Center for the Life ...

tripastroturfΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

185 εμφανίσεις



Conference Proceedings




2004 Nebraska EPSCoR State Conference on

Bioinformatics and Biomedical Computing



Held in conjunction with infotec2004


Wednesday April 21, 2004

Qwest

Convention Center

455 North 10th Street

Omaha, NE 68102




CONFERENCE PROCEEDIN
GS

................................
.............

1

Organizing Institutions

................................
................................
................................
...........

4

RESEARCH TALKS

................................
................................
................................
.............

5

1

Eugene V. Koonin, Keynote Speaker

................................
................................
.........

5

2

Lyle Middendorf

................................
................................
................................
.........

6

3

Vadim Gladyshev

................................
................................
................................
.......

7

4

Dan Monaghan

................................
................................
................................
...........

7

5

Bruce Chase

................................
................................
................................
................

8

6

Stephen Scott

................................
................................
................................
..............

9

7

Hong Jiang

................................
................................
................................
..................

9

8

Alex Nicoll

................................
................................
................................
...............

10

9

Steven Hinrichs

................................
................................
................................
........

10

Panel: Bioinformatics Education in Nebraska

................................
................................
.....

12

organizing Committee

................................
................................
................................
..........

14

Poster Presentations

................................
................................
................................
..............

16

A Genetic Algorithm for Simplifying Amino Acid Alphabets and Predicting Protein
-
Protein Interactions

................................
................................
................................
.......

16

A Hidden Markov Model for Gene Functional Prediction

................................
...........

16

Analysis of core promoter motifs in genes that are expressed in pancreas

..................

17

BioExtract Server Metadata Mapping


Creating a Federated Bi
ological Database

...

17

Biomedical Computing Tools for Collaborative Research in
HIV/
AIDS

....................

18

Comparative Analysis of Gene Prediction Me
thods and Development of a Fungal
Genome Database System

................................
................................
............................

18

Dichotomy Analysis of Proteomics and Genomics data

................................
..............

19

DNA
-
Computing

................................
................................
................................
..........

19

Evolution of 3
-
isopropylmalate dehydrogenase

................................
...........................

20

Evolution of the SET domain: bacterial pathogens, symbionts, and horizontal gene
tran
sfer

................................
................................
................................
..........................

20

Genome
-
Wide Identification of Thiol/Disulfide Oxidoreductases

..............................

21

High
-
Throughput Computational and Experimental Biology Stra
tegies in Identifying
Tumor Expressing CAMs

................................
................................
.............................

21

Identification of microorganisms at the species level by comparing strings derived
from their DNA sequences

................................
................................
...........................

22

Identifying Splice Variants Through EST Assembly

................................
...................

22

Mining Mitochondrial Single Nucleotide Polymorphisms (SNPs) associated with
human population evolution and genetics
diseases

................................
......................

23

Mining Principal Components in Very Large Gene Expression Profiles

.....................

23

Molecular Dynamic Investigation Of The

-
Turn Form
ing Nature Of Tetrapeptides

24

Molecular evolutionary analysis of the SET domainy protein families in fungal
genomes

................................
................................
................................
........................

24



On Clusteri
ng Biological Data Using Message Passing
................................
...............

24

Ontology Specific Data Mining Based on Dynamic Grammars

................................
..

25

Partition Coding and Its Appl
ication to Analysis of Complex Disease Data.

.............

25

Peptide Sorter: A web
-
based, custom database homology searching program for
sorting of
de novo

sequenced peptides from tandem mass spectrometry.

....................

26

Phylogenetic reconstruction methods for highly diverged protein families

.................

26

Prediction of amphipathic helices using statistical a
nalysis

................................
.........

27

Promoter analysis from microarray results to characterize gene regulation by MUC1
signaling in pancreatic tumor cell lines.

................................
................................
.......

27

Federated QTool: Multidatabase Queries Simplified

................................
...................

28

Ranking Differentially
-
Expressed Genes in Microarray Data

................................
.....

28

SPV: A Simil
ar Parikh Vector Search Algorithm for Protein Sequences

....................

29

Taxol Analogues
-

Predicting Antitumor Activities with Neural Network

..................

29

Usage of multivariate methods in the analysis of protein sequences

...........................

30

Using enhancing signals to improve specificity of Ab initio splice site sensors

..........

30



ORGANIZIN
G INSTITUTIONS

Nebraska EPSCoR

EPSCoR is an acronym for the Experimental Program to Stimulate Competitive Research that
was initiated by the National Science Foundation (NSF) in 1980 to address concerns of the
U.S. Congress regarding the distribution of fe
deral funds supporting research and
development. Nebraska EPSCoR is a statewide organization established to pursue research
grant opportunities of the federal agency EPSCoR programs.


Nebraska Informatics Center for the Life Sciences

The Nebraska Informat
ics Center for the Life Sciences (NICLS) facilitates the integration of
the biocomputing/informatics disciplines with the life sciences and coordinates cross
-
campus
and state
-
wide efforts in bioinformatics, chemoinformatics, pharmacoinformatics,
computatio
nal chemistry, and computational biology.


Nebraska Biomedical Research Infrastructure Network

The
Nebraska Biomedical Research Infrastructure Network (
BRIN) project is designed to
enhance the competitiveness of biomedical research in Nebraska by developi
ng the human
and technological resources essential for cutting edge research in functional genomics. The
foundation of the project is collaboration between seven undergraduate institutions, two
community colleges and the three Ph.D
.

granting institutions l
ocated throughout the State.


UNMC Eppley Cancer Center

The mission of the UNMC Eppley Cancer Center is to coordinate basic research and clinical
cancer research, patient care and educational programs and to facilitate application of new
knowledge about t
he etiology, diagnosis, treatment and prevention of cancer and to improve
health and quality of life.



RESEARCH TALKS

1

Eugene V. Koonin, Keynote Speaker

National Center for Biotechnology Information (NCBI), National Library of Medicine
(NLM), National Insti
tutes of Health (NIH), Bethesda MD 20892
-
6510


Title of Talk

Evolution of eukaryotic gene repertoire and gene structure: insights from comparative
genomics


Abstract

A comprehensive evolutionary classification of genes is a must for making sense of genome
sequences. By comparing the protein sequences encoded in 7 completely sequenced
eukaryotic genomes, 6162 clusters of probable orthologs (euKaryotic Orthologous Groups, or
KOGs), which include between 50 and 80% of all gene products of the respective organi
sms,
were identified. By combining the most likely topology of the eukaryotic crown group
phylogenetic tree and the phyletic patterns of the KOGs, the most parsimonious scenario of
eukaryotic genome evolution and the minimal ancestral gene sets for ancestr
al eukaryotic
forms were reconstructed. The reconstructed gene set of the last common ancestor of the
eukaryotic crown group consists of 3365 KOGs and is substantially enriched in proteins
involved in information processing and central metabolism; the reco
nstructed gene set for the
last common ancestor of animals includes 4898 KOGs, many of which are implicated in
signal transduction. In an attempt to reveal the major trends in the evolution of eukaryotic
gene structure, intron positions were compared for 6
84

KOGs
from 8 complete genomes of
animals, plants, fungi, and protists, and parsimonious scenarios were constructed for evolution
of exon
-
intron structure for the respective genes. Remarkable conservation of intron position
through >1.5 billion years of e
volution was revealed, with one third of the introns in the
malaria parasite
Plasmodium falciparum

shared with at least one crown
-
group eukaryote.
Paradoxically, humans share many more introns with the plant
Arabidopsis thaliana

than with
fly or nematode.
The evolutionary scenario inferred from this data holds that the common
ancestor of
Plasmodium

and the crown group and especially the common ancestor of animals,
plants and fungi had numerous introns. Most of these ancestral introns, which are retained in
the genomes of vertebrates and plants, have been lost in fungi, nematodes and arthropods, and
probably
Plasmodium
. Comparison of various features of ancient and younger introns starts
shedding light on probable mechanisms of intron insertion. A strong posi
tive correlation was
noticed between the loss and gain of genes and introns in different eukaryotic lineages,
pointing to the existence of distinct, lineage
-
specific trends toward genome shrinkage or
expansion.


POWERPOINT PRESENTATION


Contact Informatio
n



Phone: (301) 435
-
5913

Fax: (301) 435
-
7794

koonin@ncbi.nlm.nih.gov

http://www.ncbi.nlm.nih.gov/CBBresearch/Koonin/


2

Lyle Middendorf

Sr. Vice President of Research & Development and CTO

LI
-
COR Biosciences

4308 Progressive Avenue

PO Box 4000

Lincoln,
NE 68504


Title of Talk

Moore’s Law of Genomic Information


Abstract

Bioinformatics progresses through stages of increasing complexity that deliver data,
information, knowledge, and wisdom. Mining of genome sequence data leads to the
biological interpre
tation of that data giving rise to a genomics information set, which, when
combined with other information sets (e.g. proteomics; cell signaling), provides a systems
biology knowledge base that has the potential to deliver the wisdom associated with
predic
tive and preventative healthcare. These stages of progression require interfaces that
must match both technologically and economically in order to achieve successful
implementation of the bioinformatics value chain. Within each stage, fundamental princip
les
which constrain the evolution of complexity can be assessed by a “Moore’s Law” metric. To
illustrate these constraints, an evaluation of the prospects for achieving a $1000 genome will
be presented from both a throughput and a cost per base perspectiv
e.


POWERPOINT PRESENTATION NOT AVAILABLE


Information about Li
-
Cor

LI
-
COR Biosciences is a leader in the design and manufacture of instrument systems for plant
biology, biotechnology, and environmental research. LI
-
COR instruments for photosynthesis,
car
bon dioxide analysis, and light measurement are recognized world
-
wide for standard
-
setting innovation in plant science research and environmental monitoring. The company
pioneered the development of infrared fluorescence labeling and detection systems for

imaging, DNA sequencing, genotyping, and AFLP
®

for genomic research and discovery.
Founded in 1971, the privately held company is based in Lincoln, Nebraska, with subsidiaries
near Frankfurt, Germany and in Cambridge, UK. LI
-
COR systems are used in over
100
countries and are supported by a global network of distributors.
http://bio.licor.com/CompInfo.htm



Contact Information


lylem@licor.com



Phone: (402) 467
-
0700

Fax: (402) 467
-
0819

http://www.licor
.com (corporate homepage)

3

Vadim Gladyshev

Associate Professor, Department of Biochemistry

University of Nebraska, Lincoln NE 68588
-
0664


Title of Talk

How Selenium Has Altered Our Understanding of the Genetic Code


Abstract not available.


Dr. Gladyshev
’s research interests are

Identity and functions of selenocysteine
-
containing
proteins; Mechanism of cancer prevention by selenium; Bioinformatics; and Redox regulation
of cellular processes.


POWERPOINT PRESENTATION NOT AVAILABLE




Contact Information

E
-
mail:

vgladyshev1@unl.edu

Phone:

(402) 472
-
4948

Fax:

(402) 472
-
7842

http://www.unl.edu/biochem/gladyshev


4

Dan Monaghan

Professor, Department of Pharmaceutical Sciences

University of Nebraska Medical Center, Omaha NE 68198
-
6260


Title of Talk

Pharmaco
phore Informatics


Abstract not available.

Dr. Monaghan’s research interests are Pharmacoinformatics; Modeling receptor
-
drug
interactions; Development of subtype
-
selective N
-
methyl
-
D
-
aspartate (NMDA) receptor
antagonists.


POWER
POINT PRESENTATION


Contac
t Information

E
-
mail:

dtmonagh@unmc.edu

Phone:

(402) 559
-
7196

Fax:

(402) 559
-
7495

http://www.unmc.edu/Pharmacology/faculty/monaghan.html






5

Bruce Chase

Department of Biology

University of Nebraska at Omaha; Omaha, NE 68182


Title of Talk

Bioinformatic
and Microarray Strategies to Identify Peripheral Biomarkers for Parkinson's
Disease


Abstract

Parkinson's Disease is a chronic, progressive neurodegenerative disorder of unknown etiology
that is genetically and clinically heterogeneous. One research chall
enge is to identify
biomarkers to aid in clinical categorization and to serve in directing more optimized
therapeutic regimens. To address this challenge, we are employing transcriptional profiling
using microarray analyses to identify constellations of ge
nes whose expression patterns serve
as biomarkers and molecular signatures for Parkinson's Disease. To minimize invasiveness,
we are using RNA templates isolated from freshly drawn blood or lymphoblastoid cell lines.
To limit some aspects of the genetic an
d phenotypic variation, we are initially focusing on
analyzing expression profiles in parkinsonism individuals that have genetically distinct
familial forms of Parkinson's disease. We will then compare these analyses with analyses in a
cohort of individual
s with sporadic Parkinson's disease, and relate gene expression profiles to
the age of symptom onset and the severity of disease symptoms at the time of blood draw. We
anticipate that these analyses will allow us to identify a constellation of genes whose
expression pattern will serve as a biomarker for Parkinson's disease, and that they will allow
for the identification of stage
-
specific and disease
-
type specific biomarkers.


PO
WER POINT PRESENTATION


Contact Information

E
-
mail: bchase@mail.unomaha.edu

Phone: (402) 554
-
2586

Fax: (402) 554
-
3532

http://www.unomaha.edu/~wwwbio/chase.html





6

Stephen Scott

Department of Computer Science and Engineering

University of Nebraska
-
Lincoln, Lincoln, NE 68588
-
0115


Title of Talk

Machine Learning in Bioin
formatics


Abstract

Building machines that learn from experience is an important research goal of artificial
intelligence (AI). The field of machine learning is a subarea of AI that is concerned with the
question of how to construct computer programs that
automatically improve with experience.
In recent years many successful machine learning applications have been developed,
including data mining programs that learn to detect fraudulent credit card transactions,
information
-
filtering systems that learn user
s' reading preferences, and numerous approaches
to biological sequence analysis, phylogenetic inference, and other applications in
bioinformatics. We will introduce some of the fundamental concepts in machine learning and
overview various applications of m
achine learning in bioinformatics.


POWERPOINT PRESENTATION


Contact Information

E
-
mail: sscott@cse.unl.edu

Phone:

(402) 472
-
6994

Fax:

(402) 472
-
7767

http://www.cse.unl.edu/~sscott


7

Hong Jiang

Department of Computer Science and Engineering,

University

of Nebraska
-
Lincoln; Lincoln, NE 68588
-
0115


Title of Talk

A Case Study of Parallel I/O for Biological Sequence Search on Linux Clusters


Abstract

In this paper we analyze the I/O access patterns of a widely
-
used biological sequence search
tool and imple
ment two variations that employ parallel
-
I/O for data access based on PVFS
(Parallel Virtual File System) and CEFT
-
PVFS (Cost
-
Effective Fault
-
Tolerant PVFS).
Experiments show that the two variations outperform the original tool when equal or even
fewer sto
rage devices are used in the former. It is also found that although the performance of
the two variations improves consistently when initially increasing the number of servers, this
performance gain from parallel I/O becomes insignificant with further incr
ease in server
number.



We examine the effectiveness of two read performance optimization techniques in CEFT
-
PVFS by using this tool as a benchmark. Performance results indicate: (1) Doubling the
degree of parallelism boosts the read performance to approac
h that of PVFS; (2) Skipping
hot
-
spots can substantially improve the I/O performance when the load on data servers is
highly imbalanced. The I/O resource contention due to the sharing of server nodes by multiple
applications in a cluster has been shown to
degrade the performance of the original tool and
the variation based on PVFS by up to 10 and 21 folds, respectively; whereas, the variation
based on CEFT
-
PVFS only suffered a two
-
fold performance degradation.


P
OWER POINT PRESENTATION


Contact Information

E
-
mail: jiang@cse.unl.edu

Phone: (402) 472
-
6747

Fax: (402) 472
-
7767

http://cse.unl.edu/~jiang/


8

Alex Nicoll

Associate Director for Technology

Nebraska University Consortium on Information Assurance

College of Information Science and Technology

Unive
rsity of Nebraska at Omaha, Omaha NE
68182


Title of Talk

User Friendly Cluster Computing


Abstract


Computing clusters have become more and more prevalent in high performance computing
centers across the globe. However, the users of these clusters are mor
e often scientific
researchers, with little or no computing experience, than expert programmers. Therefore it is
important to ensure that a cluster computing resource is accessible to all users, not just the
experts. Covered in this talk will be an overvie
w of how we at UNO have addressed that need,
and a roadmap for future architecture improvements. Technical implementation details will be
included.


POWER
POINT PRESENTATION


Contact Information

E
-
mail: anicoll@unomaha.edu

Phone:
(402)554
-
2060



9

Steven Hi
nrichs



Director, University of Nebraska Center for Biosecurity

University of Nebraska Medical Center, Omaha, NE
68198
-
6495


Title of Talk

Challenges and Opportunities in Bioinformatics and Homeland Security


Abstract

The threats posed by bioterrorism rais
e many new challenges to the US and its citizens. The
fields of Information technology and Bioinformatics, from data exchange engines to
computatio
nal analysis of DNA sequences,
have important roles to play in meeting these
challenges. This presentation

will describe the opportunities presented by achieving a greater
level of data exchange between both federal and state agencies with responsibility for
emergency preparedness and by developing collaborative research programs between
biologists and compute
r scientists. The dual use application of information technology
solutions for not only bioterrorism preparedness but also public health in general will be
discussed, including the ability to detect the outbreak of new infectious illnesses and for
computa
tional approaches to the rapid identification of biological materials of unknown
origin.


POWERPOINT PRESENTA
TION


Contact Information

Email: shinrich@unmc.edu

Phone: (402) 559
-
4116

Fax: (402) 559
-
4077




PANEL: BIOINFORMATI
CS EDUCATION

IN NEBRASKA


He
sham Ali, UNO, Moderator


University of Nebraska at Omaha



New Proposed Undergraduate Program in Bioinformatics at UNO

Presented by Hesham Ali



Graduate Degrees in Bioinformatics
-

Through the MS in CS and the Ph.D. in IT
Programs at UNO

Present
ed by Hesham Ali


POWER
POINT PRESENTATION


Contact Information for Hesham Ali

hesham@unomaha.edu

Phone

(402) 554
-
3623


Fax:

(402) 554
-
3284

http://www.cs.unomaha.edu/fac
-
staff/hali.html


University of Nebraska Medical Center



The Nebraska Biomedical Inf
rastructure Network



Presented by William Chaney, UNMC


POWER
POINT PRESENTATION


Contact Information for William Chaney


E
-
mail:

wchaney@unmc.edu

Phone:

(402) 559
-
6657

Fax:

(402) 559
-
6650

http://www.unmc.edu/Biochemistry/faculty/chaney.html




Bioinformatics Specialty Track, Department of Pathology
-
Microbiology (UNMC), in
Conjunction with the College of Information Science and Technology (UNO)


Presented by Donald Johnson, UNMC


POWER
POINT PRESENTATION





Contact Information for Donald Johnso
n


E
-
mail: drjohnso@unmc.edu

Phone:

(402) 559
-
4038

Fax:

(402) 559
-
4077

http://www.unmc.edu/Pathology/facbios/Johnsonbio.htm


University of Nebraska at Lincoln



Proposed Interdisciplinary Bioinformatics Specialization at UNL

Presented by Andrew Benson and E
tsuko Moriyama, UNL



Interdisciplinary Bioinformatics and Biological Modeling Graduate Recruitment
Program at UNL

Presented by Andrew Benson and Etsuko Moriyama, UNL



Bioinformatic
s Education at UNL

Presented by Andrew Benson and Etsuko Moriyama, UNL

POWERPO
INT PRESENTATION



Contact Information for Andrew Benson

E
-
mail: abenson1@unl.edu

Phone:

(402) 472
-
5637

Fax:

(402) 472
-
1693

http://foodsci.unl.edu/homepage/faculty/benson.htm


Contact Information for Etsuko Moriyama

E
-
mail:

emoriyama2@unl.edu

Phone:

(402) 472
-
4979

Fax:

(402) 472
-
3139

http://psiweb.unl.edu/fac8.html





ORGANIZING COMMITTEE

Dr. Hesham Ali,
Associate Dean of Academic
Affairs

College of Information Science and Technology

University of Nebraska at Omaha, Omaha, NE 68182
-
0116

hesham
@unomaha.edu

Phone

(402) 554
-
3623


Fax:

(402) 554
-
3284

http://www.cs.unomaha.edu/fac
-
staff/hali.html


Dr. Gleb Haynatzki,
Assistant Professor

Department of Biomedical Sciences

Creighton University, Omaha, NE

Email:
gleb@creighton.edu

Phone: (402) 2
80
-
4560



Dr. Dan Moser
, Associate Director

Learning Environment and Internet Services, ITS

University of Nebraska Medical Center, Omaha NE 68198
-
5030

E
-
mail:

dmoser@unmc.edu

Phone:

(402) 559
-
5684


Fax:

(402) 559
-
5579



Dr. Richard Murphy


Chair, Departme
nt of Biomedical Sciences

Creighton University, Omaha, NE 68178

E
-
mail:

barrym@creighton.edu

Phone:

(402)
280
-
2918

Fax:

(402) 280
-
2690

http://www.biomedsci.creighton.edu/faculty/murphy.html


Dr. David Swanson,
Research Assistant Professor

Department o
f Computer Science and Engineering

University of Nebraska


Lincoln, Lincoln, NE 68588
-
0115

E
-
mail: dswanson@rcfinfo.unl.edu

Phone: (402) 472
-
5006

Fax: (402) 472
-
1718

http://rcf.unl.edu/~swanson


Dr. Simon Sherman,

Director of NICLS

Professor, Eppley

Cancer Center

University of Nebraska Medical Center
, Omaha NE 68198
-
6805

E
-
mail: ssherm@unmc.edu



Phone: (402) 559
-
4497

Fax: (402) 559
-
4651

http://www.unmc.edu/Eppley/faculty/f_sherm.html


Dr. James Turpen
, Director

Administrative Core for Nebraska
BRIN

University of Nebraska Medical Center

Omaha NE 68198
-
6395

Omaha, NE 68198
-
6395

E
-
maill: jturpen@unmc.edu

Phone: (402) 559
-
4388







POSTER PRESENTATIONS

Poster presenters are underlined


A Genetic
Algorithm for Simplifying Amino Acid Alphabets an
d Predicting
Protein
-
Protein Interactions

Matt Palensky

and Hesham Ali

Dept of Computer Science; College of Information Science and Technology

University of Nebraska at Omaha, Omaha NE 68182
-
0116

mpalensky@mail.unomaha.edu, hesham@unomaha.edu


A central p
roblem in creating simplified amino acid alphabets is narrowing down the massive
number of possible simplifications. Since considering all possible simplifications is
intractable, effectively using heuristics is essential. Genetic algorithms have been eff
ective in
providing near
-
optimal solutions for similar combinatorial problems with large solution
spaces. Simplifying amino acid alphabets may potentially reduce the degree of complexity
for several difficult problems. In this project, we study the impact

of reducing the alphabet in
addressing an important open problem in microbiology, which is predicting protein
-
protein
interactions. Various techniques for predicting protein
-
protein interactions exist, but no single
method can effectively predict more th
an a small subset of interactions. Hence, a
comprehensive listing all of a cell's protein
-
protein interactions may require many
complimentary approaches. Simplified amino acid alphabets could uncover hidden
relationships in protein sequences, and in turn
provide a valuable first step in solving protein
-
related microbiological problems. In this research, we employ a new genetic algorithm to
simplify amino acid alphabets and show the impact of reducing the alphabet in predicting
protein interactions.

A Hid
den Markov Model for Gene Functional Prediction

Xutao Deng
, Hesham Ali

Dept of Computer Science

University of Nebraska at Omaha, Omaha, NE
68131

xdeng@mail.unomaha.edu, hesham@unomaha.edu


The prediction of functional class of genes or (Open Reading Fra
mes) ORFs is important for
understanding the role of unknown genes and gene networks. Currently, the best accuracy of
the prediction provided by available computational approaches is around 30%. In this project,
we develop a gene functional prediction tool

based on Hidden Markov Models (HMMs). The
training data are solely time
-
series gene expression data from yeast experiments. Because
time
-
series expression data have Markov property and HMM have showed great success in
modeling sequential data sets in the
area of speech recognition, we expect the prediction
accuracy will be higher than other data mining tool such as Support Vector Machines (SVMs)


and decision trees. Preliminary results showed that HMMs can be elegantly applied in gene
expression data sets a
nd achieve better performance than SVMs. Currently, we are integrating
HMMs into Dynamic Bayesian Networks (DBNs) for functional prediction of genes.


Analysis of core promoter motifs in genes that are expressed in pancreas


Winfried
-
Paul Schuller
, Claudia

Kappen and J. Michael Salbaum

Department of Genetics, Cell Biology and Anatomy, and Munroe
-
Meyer Institute, University
of Nebraska Medical Center, Omaha, NE 68198
-
5455

wschuller@unmc.edu
,
ckappen@unmc.edu,
msalbaum@unmc.edu


The expression of genes in spe
cific cells and tissues is regulated by the promoter of each
gene. The best known element of the core promoter is the TATA box. This element is located
upstream to the transcription start site (+25 to +32), and conforms more or less strictly to the
TATAA s
equence motif. A second core promoter element is the initiator that overlaps the
start site of the transcription, with the sequence Py
-
Py
-
A
-
N
-
T/A
-
Py
-
Py, were
A

is the start
site of transcription.


The major interest in our laboratory is to understand which

DNA sequences regulate cell
-
type
-
specific expression of genes in various tissues and over time. In this study, we analyzed
the features and composition of promoters associated with genes that range in their expression
from highly specific for one or few t
issues to broad or ubiquitous distribution. Data on tissue
distribution patterns of expression are available for many genes from microarray assays, and
relative specificity of each gene was classified on the basis of Shannon entropy as the
information meas
ure.


Our analyses of promoter composition indicate that promoters of human genes with high
specificity for expression in pancreas preferentially contain the TATA box motif, with or
without initiator. With decrease of cell
-
type specificity, the fractions o
f genes with TATA
motifs in their core promoter decreases. Conversely, initiator motifs are more prevalent in
widely expressed genes.


This pattern of differential promoter composition was confirmed for the corresponding mouse
orthologous genes. Thus, tiss
ue
-
specific and ubiquitous genes appear to be regulated by
different core promoter elements. The relevance of our results for mechanisms of gene
regulation in various tissues will be discussed.


BioExtract Server Metadata Mapping


Creating a Federated Bio
logical
Database

Xingming Du

Department of Computer Science,

University of South Dakota;
Vermillion, SD 57069

xdu01@usd.edu




The rapid growth of biology research has resulted in an explosion of bioinformatics data
(DNA Sequences, gene expression data) and
databases. This generation of data and databases
has promoted biology research even further. However the distribution of those databases has
hindered biology research in some way, which brought forth the demand of mapping
federated biological databases. A
federated database refers to a set of disparate databases
which are viewed by researchers as one database. Federated biological databases represent an
extraordinarily diverse collection. They are complicated by the complex data type and even
further compl
icated by the kinds of interpretation supported by the databases. BioExtract
Server Metadata mapping is one technique used to map search fields semantically to those in
federated biological databases. BioExtract Server talks to federated biological databas
es
semantically, extracts the related data from those databases and presents the data to
researchers who can get the results with one step. BioExtract server provides flexible web
-
based query capabilities for researchers through the implementation of a rel
ational meta
-
database. It also supports system administration functionality for integration of new federated
databases via a web browser.


Biomedical Computing Tools for Collaborative Research in
HIV/
AIDS

Haizhen Zhu
, Oleg Shats, Dmitry Shats, Kishore Mac
hiraju, Marsha Ketcham,

Simon Sherman

Nebraska Informatics Center for the Life Sciences

Eppley Institute for Research in Cancer and Allied Diseases;

University of Nebraska Medical Center; Omaha NE 68198
-
6805

hzhu@unmc.edu, oshats@unmc.edu, dshats@unmc.ed
u, kmachiraju@unmc.edu,
mketcham@unmc.edu, ssherm@unmc.edu


The long
-
term goal of this project is to create an expert system for HIV/AIDS research by
using the power of computer and information sciences. The expert system will combine
expertise in epidemio
logy, infectious diseases, neurosciences, biology, early detection and
patient care. The systems will allow clinicians and researchers to collect HIV/AIDS
-
related
data in a convenient and efficient way, transfer the data into statistical information, and u
se it
in statistical models to predict the risk of AIDS development as well as estimate the survival
rates of HIV/AIDS patients.


Comparative Analysis of Gene Prediction Methods and Development of a
Fungal Genome Database System

Skanth Ganesan
1
;

Steven Har
ris
2
; Etsuko N. Moriyama
3

1
Department of Computer Science; University of Nebraska
-
Lincoln

2
Department of Plant Pathology; Plant Science Initiative; University of Nebraska
-
Lincoln

3
School of Biological Sciences; Plant Science Initiative; University of Nebra
ska
-
Lincoln

skanth@unlserve.unl.edu;
sharri1@unl.edu; emoriyama2@unl.edu


Fungi, plants, and animals represent the three kingdoms of eukaryotic organisms. A vast
number of fungi are filamentous and have enormous health, economic and ecological impact.


As p
art of the Fungal Genome Initiative, the complete genome sequences of several
filamentous fungi have recently become available. Multiple gene prediction programs are
being used to address the problem of identifying coding regions within these genomes.
Desp
ite several limitations, existing methods of gene prediction and models of gene structure
are often applied to newly sequenced organisms for which no model or method has yet been
tuned. Our objective is to analyze the available gene mining methods by asses
sing their
prediction performance as well as their use of varied genomic information. We are developing
an integrated genome database system that will facilitate the genome annotation of three
filamentous fungi; Neurospora crassa, Aspergillus nidulans and
Fusarium graminearum.


Dichotomy Analysis of Proteomics and Genomics data

Marina Sapir

and Simon Sherman

Eppley Institute for Research in Cancer and Allied Diseases

University of Nebraska Medical Center; Omaha NE 68198
-
6805

marina@sapir.us, ssherm@unmc.edu


We introduce an intuitive integrated approach for the analysis of genomics and proteomics
data.
The approach is based on a certain basic dichotomy of each feature. We use this
dichotomy to evaluate classification ability of the feature and to make an ele
mentary
classifier. Simple voting procedure aggregates these independent classifiers into the final
decision rule. The proposed dichotomy test can be used to evaluate statistical significance of
the correlation between the feature and the class attribute.
Applying the dichotomy analysis
on the Leukemia
and Ovarian

Cancer datasets, we were able to find several features with
strong classification abilities. The resulting classification rules built with very few features
are
comparable

by prognostic accuracy w
ith much more
computationally extensive

procedures,
applied on the same datasets.


DNA
-
Computing

Vladimir Ufimtsev

and
Vyacheslav Rykov

Department of
Mathematics

University of Nebraska at Omaha, Omaha, NE 68132

vufimtsev@mail.unomaha.edu vrykov@mail.unomah
a.edu


Molecular computing is a field that focuses on manipulations with single molecules for
computational purposes. One of the most powerful molecules that has been found for these
purposes is deoxyribonucleic acid (DNA). Through the powers of biomolecul
ar computing
the extraordinary parallelism occurring in nature can be uncovered and used to our advantage.
Great parallelism at nanoscales has been discovered to be inherent in natural phenomena and
we can now realistically imagine this power being used to

solve computational problems. The
formulation of revolutionary algorithms in biomolecules would present a very effective
alternative for the growing demands of computational power in our world. This paper will
focus on the sticker model for DNA computing.

Existing algorithms for NP
-
Complete
problems have been adapted and new methods and operations are proposed for computations
using the sticker model.




Evolution of 3
-
isopropylmalate dehy
drogenase

Philip M. Terry

and Hideaki Moriyama

Department of Chemistry; University of Nebraska
-
Lincoln, Lincoln, NE

pterry2@unl.edu, hmoriyama2@unl.edu


In excess of 150 protein sequences for a family of decarboxylating dehydrogenases which
include those f
or 3
-
isopropylmalate, isocitrate, and tartrate are now available for study of
evolutionary, sequence, structure, and function relationships among species. Among them, 3
-
isopropylmalate dehydrogenase or (IPMDH) is well
-
studied biophysically and biochemical
ly.


To analyze sequence variation in IPMDH among the available sequences, we created multiple
sequence alignments (MSA), using as input, a set of BLASTP hits (E value < e
-
14) resulting
from an IPMDH as query. Gaps and substitutions in columns of the MSA
are being compared
with available structures from the PDB to validate the alignment of sequences in the MSA.
We project biochemical knowledge of IPMDH to the MSA to validate the alignments.

Evolution of the SET domain: bacterial pathogens, symbionts, and
horizontal gene transfer

Ra
ú
l Alvarez
-
Venegas
1
, Alexander Tikhonov
2
, Etsuko Moriyama
1
, and Zoya Avramova
1


1) School of Biological Sciences, UNL Lincoln NE 68588;

2) Protometrix, Inc. Branford, CT 06405

ralvarez
-
venegas2@unlnotes.unl.edu, info@protometri
x.com, emoriyama2@unl.edu,
zavramov@unlserve.unl.edu


Horizontal (or lateral) gene transfer (HGT) can occur between distantly related species. This
phenomenon is considered a major force in organismal evolution. However, questions are still
surrounding t
he mechanisms and validity of HGT. Phylogenetic analysis is the best currently
available method for establishing incidences of ancient HGT. Here, we report phylogenetic
analyses of SET
-
domain containing proteins in prokaryotes and eukaryotes. The SET domai
n
has been defined as a highly conserved peptide (~130 amino acids) found in epigenetic
regulators. Biochemically, the SET peptide carries lysine methylating activity that targets
specific lysine residues from the tails of the nucleosomal histones. Because

chromatin and
histones are signature features of eukaryotes, it has been assumed that
SET
-
genes are only
found in eukaryotes. SET
-
domain coding genes were reported in some bacteria, but their
initial identification only in parasitic and symbiotant specie
s was assumed to represent
transfer from a eukaryote to a prokaryote. Comprehensive analysis of ~150 fully sequenced
bacterial and archebacterial genomes identified ~30 prokaryotic species (pathogenic,
symbiotant, and free
-
living) that carry SET domain cod
ing genes. Even closely related species
within the same family can differ by the presence/absence of
SET

genes. These data seemed
to favor HGT. Further analysis, however, revealed
SET
-
gene paralogs in bacteria.
Phylogenetic analysis of prokaryotic and euka
ryotic
SET

genes revealed a surprising picture
indicating that the SET domain, probably, has a common ancestor. Therefore, the prokaryotic
gene(s) did not come from horizontal gene transfer between the eukaryotic and prokaryotic


domains of life. However, t
here are cases of an apparent
SET
-
gene HGT between prokaryotic
species, like the SET
-
genes in
Bacillus

and
Methanosarcina.

Finally, we show that in
bacteria, a peptide downstream of the SET peptide (named the post
-
SET domain in
eukaryotes) has co
-
evolved t
ogether with the SET domain to perform bacterial gene specific
functions.


Genome
-
Wide Identification of Thiol/Disulfide Oxidoreductases

Dmitri E Fomenko
, Stephen Scott, and Vadim N Gladyshev

Department of Biochemistry

University of Nebraska
-
Lincoln, Linc
oln, NE 68588

dfomenko@genomics.unl.edu
,
sscott@cse.unl.edu,
vgladyshev1@unl.edu


Thiol
-
dependent redox regulation is an important, but poorly characterized biological process
that is involved in oxidative stress defense, signal transduction, protein foldi
ng and regulation
of protein activity. Thiol
-
dependent redox processes are catalyzed by structurally distinct
families of enzymes, thiol/disulfide oxidoreductases, which are difficult to identify by
available protein function prediction programs. The CxxC
motif (two cysteines separated by
two residues) is most often present in thiol/disulfide oxidoreductases. We found that
replacement of one of cysteines in the CxxC motif with serine or threonine is also suitable for
a catalytic redox function. We show that

conserved Cxx(C|S|T), (C|S|T)xxC (x is any amino
acid) sequences present in the context of a simple secondary structure pattern may be used as
a predictor of redox function.


High
-
Throughput Computational and Experimental Biology Strategies in
Identifyin
g Tumor Expressing CAMs

Anguraj Sadanandam
, Michelle L. Varney, and Rakesh K. Singh

Department of Pathology and Microbiology

University of Nebraska Medical Center, Omaha, NE

asadanandam@unmc.edu, mvarney@unmc.edu, rsingh@unmc.edu


Cross
-
talk between cell a
dhesion molecules (CAMs) on cancer cells and specific host
microenvironment cells is critical for tumor invasion and metastasis. Identifying
peptidomemetics that bind membrane receptors seemingly on vascular endothelial cells of
specific organs are signifi
cant in organ
-
selective targeting or blocking. Eleven unique
peptides that can bind specifically to lung, liver, bone marrow or brain were identified by
in
vivo
selection using

a phage display peptide library in NOD
-
SCID mice. These organ
-
specific
peptide
s are seven amino acids in length, and they are the critical binding residues involved in
CAM specific protein interactions. We have developed a high
-
throughput strategy based on
the mouse genome and proteome to identify known CAMs containing these peptide
s in their
extracellular regions. The strategy involves three overlapping methods comprising of
nucleotide/protein sequence, annotation and mRNA expression based database searches.
These searches were done using the peptides as queries against different d
atabases, including
a Local Mouse Cell Adhesion Molecule (LMCAM) sequence database developed in our lab.
The resultant proteins were analyzed using a filtering algorithm that selected approximately


30 known CAMs, including a family of proteins called semap
horins. The mRNA expression
of SEMA5A protein using an experimental strategy in human pancreatic cancer cell lines
showed expression in those originating from metastatic tumors, but not from primary tumors.
The results are promising as suggested from the e
xamination of public microarray and SAGE
expression databases, and protein structure information. Therefore, a number of new CAMs
can be identified with these combined computational and experimental methodologies as an
initial approach, thereby paving the
way for complete understanding of various disease
processes, and making specific targeting possible using these peptidomemetics.


Identification of microorganisms at the species level by comparing strings
derived from their DNA sequences

Dhundy R Bastola
1
,

Peter C Iwen
1
, Steven H Hinrichs
1

and Khalid Sayood*

1
Department of Pathology and Microbiology, University Nebraska Medical Center, Omaha
NE 68198; * Department of Electrical Engineering, University of Nebraska
-
Lincoln, Lincoln,
NE 68588
-
0511


A new appro
ach to evaluate the relatedness of DNA sequences that eliminates the requirement
to align sequences prior to analysis has recently been described and termed the Relative
Complexity Measure (RCM). The first step in the RCM method yields a “dictionary”
compo
sed of “strings” derived from the sequence being analyzed. In this study the dictionary
of strings derived by the RCM algorithm from the 18S rDNA and cytochrome b gene
sequences was utilized to evaluate the feasibility of identifying microorganisms based o
n the
similarity of strings present in their respective dictionaries. The 18S rDNA and cytochrome b
gene sequences from multiple strains of the following organisms were obtained from
GenBank and evaluated:
Candida albicans,

Candida glabrata,

Candida paraps
ilosis
,

Candida kruisii
including single strain of

Candida dubliniensis, Candida lusitianiae, Candida
tropicalis and Malassezia furfur
. Using the RCM algorithm a unique dictionary was created
for each species. The dictionaries were then compared using a se
cond algorithm called RCM
-
C, which extracted the “common strings” and “unique strings” and calculated the membership
of strings for the dictionaries (membership = number of common strings / number of common
+ unique strings). The membership values reflecte
d the degree of similarity between the two
dictionaries. Using this method we compared the dictionaries obtained from the cytochrome b
and 18S rDNA gene sequences separately from seven
Candada species
and
M. furfur.

In
addition, the dictionaries obtained f
rom cytochrome b and 18S rDNA sequences for each
species were combined and queried with the dictionaries obtained from either the cytochrome
b or 18S rDNA sequence. The results showed that the RCM
-
C approach correctly
differentiated these microorganisms a
t the species level when either one of the target
sequences was used for query. Combining the dictionaries from two different target sequences
did not alter the ability of this approach to identify microorganisms at the species level. These
results demonst
rate that comparing strings derived from multiple target DNA sequences using
RCM and RCM
-
C algorithms was able to identify fungal organisms at the species level, and
this approach was a dependable alternative to pair wise sequence comparison.

Identifying
Splice Variants Through EST Assembly

Yi
-
feng Li

and Hesham Ali



Department of Computer Science; College of Information Science and Technology;

University of Nebraska at Omaha, Omaha, NE 68182
-
0116

yl1@unmc.edu; hesham@unomaha.edu


Alternative splicing has r
ecently emerged as the most important mechanism to increase
protein diversity. To further explore its functional roles and regulatory mechanisms, it is
essential to identify different splice forms from available resource. The Expressed Sequenced
Tags (EST)

database, which contains a broad sample of mRNA, provides an ideal source for
hints on different splicing patterns. Furthermore, Unigene system in NCBI has partitioned
EST sequences into a non
-
redundant set of gene
-
oriented clusters. In this project, a pr
ogram
tailored for EST assembly is developed to reconstruct individual EST cluster into contigs that
correspond to different transcripts. After assembly, the reconstructed transcripts are aligned
with parent genomic DNA to reveal possible splicing patterns
. This assembly approach
significantly facilitates splicing variants discovery from EST data.


Mining Mitochondrial Single Nucleotide Polymorphisms (SNPs) associated
with human population evolution and genetics diseases

Chenguang Wang
1
;
Guoqing Lu
2

1
Depart
ment of Statistics, University of Nebraska
-
Lincoln

2
Center for Biotechnology, School of Biological Sciences; University of Nebraska
-
Lincoln

cgwang@bigred.unl.edu, glu3@unlnotes.unl.edu


The studies on mitochondrial genetic diseases and mitochondrial DNA (
mtDNA) intraspecies
diversity are key topics in population genetics and medicine. Most mtDNA variations within
and among populations are single base variants, known as single nucleotide polymorphisms
(SNPs). SNPs as an abundant form of mitochondrial genome

variation, however, have not
been systematically studied in the field of human molecular evolution and genetic diseases.
This research uses mitochondrial genome as a model to study molecular evolution and
disease
-
associated SNPs in humans. For this purpos
e, a bioinformatics tool consolidating mt
SNP information in various public repositories and literature is developed. We will present
here the preliminary findings of mitochondrial SNPs potentially associated with human
population evolution and genetic dis
eases.


Mining Principal Components in Very Large Gene Expression Profiles

Li Xiao
;
Simon Sherman

Eppley Institute for Research in Cancer and Allied Diseases

University of Nebraska Medical Center, Omaha, NE 68198
-
6805

lxiao@unmc.edu
, ssherm@unmc.edu,


Micr
oarray is a technique to monitor the expression of thousands of genes simultaneously.
The gene expression profiles in a microarray experiments often form a huge multi
-
dimensional datasets. Principal Component Analysis has the ability to present the varianc
e
structure of a set of variables through a few new variables, which are linear combinations of
the original ones. The computation cost is very high to get the principal components in a large


multidimensional dataset. In this work, a method to efficient mi
ne the principle components in
very large gene expression profiles was proposed. Silhouette validation technique was applied
to optimize the k value in k
-
means classification for the gene expression profiles. The
dimensions of the data set are decreased in

such a way that, the average values within each
suitable class of genes are used instead of the individual values of each gene. It was shown
that for the very large multi
-
dimensional gene expression profiles, the principal components
could be calculated i
n a very reasonable computational time scale.


Molecular Dynamic Investigation Of The

-
呵牮⁆o牭i湧⁎a瑵牥⁏映f
呥瑲T灥灴i摥d

Attila Borics
,
Sandor Lovas
, and Richard Murphy

Department of Biomedical Sciences

Creighton University;Omaha, NE 68178

aborics@bif12.creighton.edu, slovas@bif1.creighton.edu, barrym@creighton.edu


Since small peptide
s with turn structures are highly flexible, their characterization by either
NMR or UV
-
CD spectroscopy is usually difficult and yields only time
-
averaged spectra with
contributions from each structure type present. NPGQ, GKDG, DDKG, DEKS, VPaH, and
VPsH w
ere previously characterized as

-
turns or turn forming cores of longer peptides by
one or more of the above methods. Therefore, in this study 25 ns molecular dynamics
simulations of structures were performed. The DSSP method and clustering were used to
analyze trajectories. DSSP analysi
s of trajectories showed a fluctuation between

-
turn and
unordered structure for all sequences, although it failed to recognize bend structures because
of the insufficient peptide chain length.

Molecular evolutionary analysis of the SET domainy protein f
amilies in
fungal genomes

Chendhore Veerappan
1
; Zoya Avramova
2
; Etsuko N. Moriyama
3

1
Department of Computer Science; University of Nebraska
-
Lincoln

2
School of Biological Sciences; University of Nebraska
-
Lincoln

3
School of Biological Sciences; Plant Science

Initiative; University

of Nebraska
-
Lincoln

chendhorev@hotmail.com, zavramov@unlserv.unl.edu, emoriyama2@unl.edu


The SET domain is approximately a 130 amino acid motif identified in plants, animals, and
yeast, and considered to be associated with eukaryot
ic functions. These proteins both activate
and repress gene transcription mechanisms. Proteins in different families contain unique sets
of other domains that are not shared between different families. In order to elucidate
evolutionary relationships and d
istributions of this protein family across eukaryotes, we are
conducting large
-
scale searches from various fungal genomic databases as well as protozoan
and other eukaryotes. Our results indicate that some SET
-
domain protein groups unique to
filamentous fu
ngal species. Phylogenetic analysis shows that these proteins can be classified
based on their internal architectures of SET domain sequences.

On Clustering Biological Data Using Message Passing

Huimin Geng

*; Dhundy Bastola†; Hesham Ali *



*Department of C
omputer Science, College of Information Science and Technology,
University of Nebraska at Omaha, Omaha, NE 68182
-
0116

†Department of Pathology and Microbiology, University of Nebraska Medical Center,
Omaha, NE 68198
-
6495

hgeng@mail.unomaha.edu,
dbastola@un
mc.edu,

hesham@unomaha.edu


Clustering algorithms have been frequently used in many areas in bioinformatics to classify
biological data as in the analysis of gene expression and in the building of phylogenetic trees.
In this study, we propose a new cluster
ing algorithm that employs the concept of message
passing. Message Passing Clustering (MPC) allows data elements to communicate with each
other and produces clusters by intrinsic processes, and hence simulates human intelligence.
We have used 35 simulated
data sets from dynamic gene expression typical of microarray
experiments to evaluate the proposed method. In our experiments, 95% hit rate is achieved in
which 639 genes out of total 674 genes are correctly clustered. We have also applied MPC to
real data
sets to build phylogenetic trees. The obtained results show higher classification
accuracies as compared to other traditional clustering methods.

Ontology Specific Data Mining Based on Dynamic Grammars

Daniel Quest
; Hesham Ali

Dept of Computer Science; Col
lege of Information Science and Technology; University of
Nebraska at Omaha, Omaha, NE 68182
-
0116

daniel_quest@cox.net, hesham@unomaha.edu


In this project, we introduce a new formal approach for mining biological databases. The
proposed grammar based appr
oach provides a flexible and powerful tool for advanced
sequence comparison and data mining. The approach benefits from the power of regular
expression in allowing Bioinformatics researchers to use advanced queries in comparing
sequences and searching for
motifs in Biological databases. A common hypothesis is that
biological sequences contain elements or functional units that determine the interactions of
the molecule. These elements may not be detectable by a homology search using simple
alignment tools b
ecause of the interference and noise produced by mutations in the
evolutionary process. However, these consensus subsequences or expressions are the key to
the functionality of the sequence or to understanding the relationship between the sequence
and othe
r biological units. In this paper, we introduce a formal grammar and a corresponding
data mining engine capable of extracting records.


Partition Coding and Its Application to Analysis of Complex Disease Data.

Arkadii D'yachkov;
Vyacheslav Rykov
; David T
orney;
Vladimir Ufimtsev
; Sergey Yekhanin

Department of Mathematics

University of Nebraska at Omaha; Omaha, NE 68132

dyachkov@mech.math.msu.su, vrykov@mail.unomaha.edu, dct@lanl.gov
,
vufimtsev@mail.unomaha.edu, yekhanin@mit.edu


The rapidly advancing field

of Biology has endowed us with an extravagant amount of new
knowledge. From such knowledge we develop a better understanding of the functionality of


the human being. Genetics has enabled us to detect a class of diseases known as complex
diseases. At the p
resent time, diagnosis of complex diseases such as ADHD is a problem that
is still being studied. The mathematical tools that we discuss will aid the analysis of complex
disease data. These methods present new implications of the partitioning of data seque
nces.
We define a new concept of distance (based on the Hamming distance) between two distinct
sets of unordered partitions that we call a partition
-
distance. We can verify that this distance is
a valid metric in the space of unordered partitions of any fi
nite set S size n, where each
partition contains <= q disjoint subsets of S. Using the distinct partitions of a set S, endowed
with the proposed metric, we investigate a new class of codes which we call q
-
partition codes.


Peptide Sorter: A web
-
based, cust
om database homology searching
program for sorting of
de novo

sequenced peptides from tandem mass
spectrometry.

Ingrid Jordon
-
Thaden
*,Ben Birdsey, The Nguyen
1
, Guoqing Lu
1
.

School of Biological Sciences*, UNL Bioinformatics Core Facility
1
, Beadle Center,
U
niversity of Nebraska
-
Lincoln, 68588
-
0666


A method for accurate and efficient sorting of the massive amount of data resulting from
de
novo

peptide sequencing in proteomic studies is proposed with Peptide Sorter. Sequences
obtained from
de novo

peptides f
rom tandem mass spectrometry can be used to search by
homology and have been found to double the number of peptides added to the percent
coverage of protein or identify a homologous protein that mass database searching could not
determine. We have develope
d a web based program, using blast algorithm and custom
databases to automatically sort through
de novo

peptides and display the data. The predicted
de novo

sequenced peptides that result from the PEAKS program have inherent errors due to
the quality of t
he spectra that it interprets.
In order to determine which sequence is the most
accurate for protein identification, searching the sequences by homology can help find the
errors without having to physically look at each individual spectra.
Hand sorting th
rough
de
novo

peptides is inaccurate, biased, time consuming, and requires knowledge of mass
spectrometry (often not known by the researchers). Proteomic questions can be categorized
into two main groups: known protein confirmation in which the researcher
s are looking for
expression levels (i.e. presence/absence), and unknown protein identification. In the case
where a protein identification was made, maximizing confidence is achieved by % coverage.
In the case of an unknown protein, identification can oc
cur, but often is not due to sequence
variation from failure of mass database searching. In these cases, raw data can show peptide
sequences do exist;
de novo

sequences must be used. In order to limit wasting useful data,
automating
de novo

peptide sortin
g for large proteomic experiments is a necessity.


Phylogenetic reconstruction methods for highly diverged protein families

Cory Strope
1
;
Etsuko N. Moriyama
2

1
Department of Computer Science; University of Nebraska
-
Lincoln

2
School of Biological Sciences; Pl
ant Science Initiative; University of Nebraska
-
Lincoln

cstrope@cse.unl, emoriyama2@unl.edu




Objectives:

Phylogenetic trees are reconstructed based on multiple alignments. Using any
phylogenetic methods (e.g., Neighbor
-
Joining, Maximum Parsimony, or Maxim
um
Likelihood) we can examine evolutionary relationships among protein sequences and
elucidate the hypothetical ancestral protein sequences. Multiple alignments of protein
sequences are generally more useful if protein sequences have undergone only point
mutations with limited amount of insertion/deletion events. However, this approach is not
very effective for modeling more dynamic changes, such as duplication, translocation,
insertion, and deletion of large protein regions or domains. In this study, w
e analyzed the
performance of different methods of reconstructing phylogenetic trees from protein sequences
with such dynamic evolutionary history.

Prediction of amphipathic helices using statistical analysis

Mamta Bajaj
1
;
Hideaki Moriyama
2
; Etsuko N. Mo
riyama
3

1
Department of Computer Science; University of Nebraska
-
Lincoln

2
Department of Chemistry; University of Nebraska
-
Lincoln

3
School of Biological Sciences; Plant Science Initiative; University of Nebraska
-
Lincoln

mam_b99@yahoo.com,
hmoriyama2@unl.edu,

emoriyama2@unl.edu


Many secondary structure prediction methods have been developed. However, very few
methods are available for predicting amphipathic helices. Amphipathic alpha helices are very
important for protein structure and functions. These alpha
helices have hydrophobic and
hydrophilic faces, which are corresponding to the protein side and the other side. Locating
this helix helps in predicting the function of a protein such as DNA
-
binding proteins. We are
developing a method that predicts such al
pha helices based on a set of new statistics. Training
sets consisting amphipathic alpha helices are prepared from the Protein Data Bank (PDB).
The helices in PDB are searched by calculating torsion angles. Surface accessibility is also
calculated to find
amphipathic helices as long as manual examination. Using this training
data, we optimize a set of statistics that discriminates between amphipathic alpha helices and
non
-
amphipathic helices. We will discuss the performance of this new method comparing to
o
ther methods.


Promoter analysis from microarray results to characterize gene regulation
by MUC1 signaling in pancreatic tumor cell lines.

Chunhui Yi
, David J. Smith, Andrew J. Gawron, Michael A. Hollingsworth

Eppley Institute for Research in Cancer and Al
lied Diseases

University of Nebraska Medical Center, Omaha NE

cyi@unmc.edu,
djsmith1@unmc.edu, agawron@unmc.edu, mahollin@unmc.edu


MUC1, a glycosylated transmembrane mucin that is substantially overexpressed and
aberrantly glycosylated in many tumors, is
believed to contribute to metastasis. The
cytoplasmic tail (CT) of MUC1 co
-
localizes with β
-
catenin which interacts with transcription
factors to regulate gene expression. We hypothesized that the MUC1 CT is involved in the
regulation of expression of gene
s that contribute to tumor growth and metastasis. To
investigate alterations in gene expression in pancreatic tumor cells, we performed microarray


experiments using a cDNA array. The results revealed that overexpression of MUC1 in the
pancreatic cell line
S2
-
013 differentially regulated 28 genes. Deletion of MUC1 CT partially
restored the expression of 7 genes. The expression levels of 5 of these genes directly
correlated with increasing metastatic potential of pancreatic cell lines. To determine whether
th
is is a direct effect of MUC1 CT mediated signal transduction, we use bioinformatics
strategies to identify the common transcription factor (TF) binding sites present in the
promoter regions of these genes. We interrogated the sequences of promoter regions

of these
genes using the MatInspector program to search for TF binding sites. We found that a putative
binding site for activator protein 4 was present in 6 of the 7 sequences but not in any of 3
control sequences. We also used the sequencing alignment to
ol MEME to identify consensus
motifs that may be potential TF binding sites in the 7 sequences, but failed to find specific
motifs due to the limitations of the program. These results suggest that bioinformatics tools
combined with biological techniques ar
e a promising approach for the discovery of
downstream TFs that are regulated by novel signal transduction pathways. To further refine
our strategy, we need to utilize existing bioinformatics tools more efficiently and also develop
new tools appropriate to

our study.

Federated QTool: Multidatabase Queries Simplified

Matthew Smart

Department of Computer Science

University of South Dakota;
Vermillion, SD 57069

msmart@usd.edu


QTool allows researchers with varying levels of technical skill to interact with the

data in
their relational databases in order to generate tables of data for reports. It is very flexible in
that it can easily be connected to almost any relational database management system on the
market today. It is also capable of generating queries th
at span multiple databases (federated
database queries). The user interface has been simplified so that it does not require
knowledge of querying languages to perform queries. The interface can also be incorporated
into a web browser for inclusion in a w
eb
-
based system.


Ranking Differentially
-
Expressed Genes in Microarray Data

Linfeng Cao;

Li Xiao;
Simon Sherman

Eppley Institute for Research in Cancer and Allied Diseases;

University of Nebraska Medical Center; Omaha NE 68198
-
6805

linfengcao@hotmail.com
,
lxiao@unmc.edu
, ssherm@unmc.edu



In microarray analysis, the expression levels of several thousand genes can be measured
simultaneously. To extract biologically meaningful information from microarray data,
statistical methods are used. The purpose of this

work is to develop and implement in a new
software tool,
MicroMultitest
1.0, a number of different algorithms for statistical testing (such
as t
-
test,
p
-
value, adapted SAM method,
p
-
value adjustment and multiple testing), as well as
the Receiver Operation

Characteristic (ROC) analysis technique to quantify accuracies of
different methods aimed to analyze the DNA microarray data. We proposed to rank
differentially
-
expressed genes by the joint use of several statistical methods. We also


proposed to use the R
OC curves to: (i) estimate the accuracies of different statistical methods,
and (ii) to find the optimal cutoffs for statistical methods.


SPV: A Similar Parikh Vector Search Algorithm for Protein Sequences

Xiaolu Huang
, Anguraj Sadanandam, Rakesh Singh an
d Hesham Ali

Department of Computer Science; College of Information Science and Technology

University of Nebraska at Omaha 68182
-
0116

xhuang@unmc.edu, hesham@unomaha.edu


Tumor markers are polypeptides expressed at the surface of the tumor cells. These mol
ecules
can adhere to the receptors at the surface of normal tissue cells, and are considered to be
important for tumor cell metastasis. Previous studies showed that only 4
-
7 critical residues are
required for protein
-
protein interactions. These critical re
sidues may not appear in the protein
sequence in a specific order or in a contiguous manner in order to perform their function.
Understanding these critical residues is very important in drug design and tumor metastasis
research. Given an ordered alphabet
A of finite k elements, with redundant elements
permissible, Parikh vector of a word w on the alphabet A is the integer vector v = (n
1
, n
2
,…,
n
k
) where i is the number of occurrences of the i
th

letter of A in w. In this project, we propose
a new Similar Pa
rikh Vector (SPV) search algorithm. SPV provides an excellent tool for
tumor marker search and prediction since traditional alignment algorithms are order
dependent.



Taxol Analogues
-

Predicting Antitumor Activities with Neural Network

Stan. Svojanovsky
1
, Swapan Chakrabarti
2
, George S. Wilson
3
, Gunda Georg
4
, Peter Smith
1

1
Department of Molecular and Integrative Physiology, Kansas University Medical Center,
Kansas City, KS 66160;

2
Department of Electrical Engineering and Computer Science, University of K
ansas,
Lawrence, KS 66045

3
Department of Chemistry, University of Kansas, Lawrence, KS 66045

4
Department of Medicinal Chemistry, University of Kansas, Lawrence, KS 66045

ssvojanovsky@kumc.edu; chakra@eecs.ku.edu; gwilson@ku.edu; georg@ku.edu;
psmith@kumc.e
du


We present a back
-
propagation neural network (BPNN) design for 50 taxol derivatives
evaluated with a feature vector of 27 numerically quantified physical and chemical properties.
Training set contains 40 compounds with known output of the antitumor act
ivities. A cascade
of correlation and discriminant analyses then decreases the number of inputs to 8, in order to
construct an optimal NN prototype. Based on the training data set and BPNN architecture,
meaningful and accurate predictions of the anticancer

activity for the 10 tested analogues are
achieved. The system design depends greatly on the nature of the non
-
linearity to be modeled.
For data sets containing periodicity (signature), the results indicate that the BPNN is more
flexible with better perfor
mance than statistical analyses based on the assumption of normally
distributed inputs. In this study, BPNN is used as a powerful tool for the design of
quantitative structure
-
activity relationships (QSAR) with screening of structurally similar


taxol analo
gues for their anticancer activities. BPNN prototype was validated with synthesis
of these compounds and consequent tests that indicate the enhanced antitumor activities in 8
out of 10 predicted taxol analogues. This is more than two times better than appr
oximately
35% accuracy expected from a statistical classifier.

Usage of multivariate methods in the analysis of protein sequences

Stephen O. Opiyo
1
, Han Asard
2
, Stephen Kachman
3

and Etsuko N. Moriyama
4

1
Department of Agronomy and Horticulture,

University

of Nebraska
-
Lincoln, Lincoln, NE 68583
-
0915

2
Department of Biochemistry, Plant Science Initiative,

University of Nebraska
-
Lincoln, Lincoln, NE 68588
-
0664

3
Department of Statistics, University of Nebraska
-
Lincoln, Lincoln, NE 68583
-
0712

4
School of Biologi
cal Sciences, Plant Science Initiative,

University of Nebraska
-
Lincoln, Lincoln, NE 68588
-
0660

sopiyo@unlserve.unl.edu, hasard2@unlnotes.unl.edu, skachman@unl.edu,
emoriyama2@unlnotes.unl.edu


The amount of amino acid sequences are increasing in databases
. Various methods are
needed to extract information from this wealth of data. Multivariate methods have been little
used in the analysis of protein sequences in bioinformatics. In this study, we examine two
multivariate analysis methods: principal compone
nt analysis (PCA) and cluster analysis (CA).
Proteins included in the study are Cytochrome b561 (Cyt
-
b561) and fatty acid desaturase
enzymes. The objectives of this study are to use principal component analysis (PCA) to
extract information from physico
-
c
hemical properties of the 20 amino acids, to use auto and
cross covariance (ACC) to transform amino acid sequences into quantitative measures, and to
use PCA and CA to analyze the transformed protein sequences. We started from 13 physico
-
chemical properti
es of the 20 amino acids and PCA is used to reduce the dimensionality of
this data set. Three principal components (S scores) are extracted. Various sizes of the amino
acid range (lag) to calculate ACC are investigated using amino acid sequences from pro
teins
listed above. So far we have successfully reduced the lag size up to 5 amino acids with the
maximum classification power. The use of ACC in data transformation makes it possible to
translate amino acids of different length into same number of variab
les. This enables us to use
various multivariate analysis methods without relying on multiple alignments but still
including positional information in our analyses. The results from this study show that
multivariate methods can be used in protein sequence
classification.


Using enhancing signals to improve specificity of Ab initio splice site
sensors

Tchourbanov Alexand
e
r
1
,

Deogun Jitender
2
and Hesham Ali

2
Department of Computer Science; College of Information Science and Technology;
University of Nebraska, Omaha, NE 68182

1
Department of Computer Science and Engineering,

University of Nebraska
-
Lincoln, Lincoln, NE 68588;

achurbanov@unomaha.edu, deogun@cse.
unl.edu, hesham@unomaha.edu




In this paper, we describe a new approach to improve the precision of splice site annotation in
human genes. The problem is known to be extremely challenging since the human splice
signals are highly indistinct and frequent cry
ptic sites confuse signal sensors. There is strong
evidence that Exonic Splicing Enhancers (ESE) and Exonic Splicing Silencers (ESS)
influence commitment to splicing at early stages. We propose the use of Bayesian Networks
(BN) combined with Boltzmann mac
hine splice sensor, to improve the specificity of splice
site prediction. The new program, SpliceScan, was implemented to demonstrate feasibility of
specificity enhancement based on ESE/ESS signals interactions. The performance of
SpliceScan was assessed b
y comparing it to the recently developed GeneSplicer program. Our
experimental results show that SpliceScan outperforms GeneSplicer and produces fewer false
negatives for the used test cases. The proposed approach is of particular value for Ab initio
gene
annotation.