Database Focus Group

emptyslowInternet και Εφαρμογές Web

12 Νοε 2013 (πριν από 4 χρόνια και 1 μήνα)

87 εμφανίσεις

CIPRES DB

1

CIPRES

Database Focus Group

NSF Site Visit

June 28, 2006

San Diego


CIPRES DB

2

Senior Personnel



Susan Davidson
, University of Pennsylvania


Michael Donoghue
, Yale University


Mark Miller
, San Diego Supercomputer Center


Dan Miranker
, UT Austin


Brent Mishler
, UC Berkeley


William H. Piel
, Yale University
(TreeBASE II lead)


Val Tannen
, University of Pennsylvania
(database focus
lead)

CIPRES DB

3

Other (Partially) Funded Personnel


Lucie Chan
, Senior Software Developer, San Diego
Supercomputer Center


Shirley Cohen
, Database Developer, then PhD
Student, UT Austin, then University of Pennsylvania


Sarah Cohen
-
Boulakia
, Post
-
Doc, University of
Pennsylvania (not funded by CIPRES)


Jin Ruan
, Senior Software Developer, San Diego
Supercomputer Center
(TreeBASE II Software
Lead)


Yifeng Zheng
, PhD student, University of
Pennsylvania.

CIPRES DB

4

Goals of the Database Focus


The major objective is the development of TreeBASE II



In addition, this focus has supported related research on


storage/querying of the large phylogenetic trees constructed in


the Simulation Focus (
Davidson, Kim, Zheng
)


the Algorithms Focus of the project (
Moret, Hunt, Warnow
)


data provenance in phyloinformatics workflows


(
Davidson, Cohen, Cohen
-
Boulakia
)


phylogenetic database extensions using a metric ordering to


support molecular data (
Miranker
)


genome
-
scale phylogenetics (
Piel
)


searching large collections of trees for topological patterns (
Piel
)

CIPRES DB

5

The current TreeBASE (I)


A 10+ years
-
old major data resource for biological and
biomedical research


submissions needed to be published in a peer
-
reviewed scientific
journal before being published in TreeBASE.



Has been searched from over 60,000 distinct IP addresses


Has accepted over 1,300 submissions that map to over


3,700 trees and


60,000 distinct taxons.



But the capabilities of the current database are being
overtaken by demands.


CIPRES is developing TreeBASE II as a robust, scalable,
and versatile re
-
design and re
-
engineering of TreeBASE I.


CIPRES DB

6

TreeBASE I Audience


Researchers from


traditional systematics backgrounds and


molecular biology backgrounds


who are concentrating on a series of focused experiments
in the lab.



These users include those who periodically seek online
representations of individual phylogenies for research and
educational purposes.

CIPRES DB

7

Additional TreeBASE II Audiences (1)



Researchers that want to run meta
-
analyses on large
collections of trees. Examples:



identifying patterns in trees that result from one type of
analysis over another



visualizing large collections of trees



studying collaborative networks among phylogeneticists

CIPRES DB

8

Additional TreeBASE II Audiences (2)



Phyloinformaticians who seek to make large
-
scale inference
using synthetic methods applied to large collections of
trees. Examples:



assemble a supertree for a large branch of the Tree of Life



mine data in search of conflicting phylogenetic signals



examine the evolution of genes and genomes in a
comparative context

CIPRES DB

9

Additional TreeBASE II Audiences (3)



Bioinformaticians who conduct simulation studies.


Frequently, simulation studies use simple models, such as
the Kimura 2
-
Parameter and Jukes
-
Cantor that are not
believed to be biologically realistic.


Finding realistic evolutionary models, using real data, and
carrying out simulation studies are some of the main goals
of this group.

CIPRES DB

10

Value Added by TreeBASE II




A
phylogenetic query language

to allow ``power
-
users'' to
run complex phyloinformatic queries, including on tree
topology.



A robust
service layer

and LSIDs to allow external tools
and services to interface with the database.



Storage of LSIDs and foreign handles to better
integrate

with
external data services

(morphological characters,
gene names, taxon names, and museum specimen IDs).



Taxonomic intelligence

for leaf and node labels.



Ability to store geographic coordinates to support
phylogeographic

data visualization and analysis.

CIPRES DB

11

Collected Use Cases:

Query Examples




Given a set of taxa and a character
matrix, find the characters for which
the taxa have the same state.


Given a set of taxa and a set of
trees, find all trees for which the
subtree determined by the taxa (as
leaves) is the same.



CIPRES DB

12

TreeBASE II Capabilities:

Submission




Friendlier interface, more features semi
-
automated



Support for entering additional (currently non
-
NEXUS)
data such as specimen IDs



Automated annotations (eg., communication with other
sources to retrieve GenBank accession number
sequence)



Better error checking (eg., matching taxon labels between
trees and character matrices)



Assistance features will be opt
-
in and can be turned off by
the user

CIPRES DB

13

TreeBASE II Capabilities:

Curation




Support for interaction with the publication process:


In conjunction with journal submission, study data is submitted to
TreeBASE


It is not made visible to search/query users but reviewers or journal
editors can examine it (anonymous access)


If and when the journal submission is accepted, the study data is
made visible to search/query users


Support for TreeBASE II editors, examples:


to correct author, citation, or other metadata


to correct the taxon names (alignment between trees and
character matrices or with taxonomic services)


to remove orphan data


An interface with access to taxonomic services such as uBio (www.ubio.org) or
the Glasgow Name Server (taxonomy.zoology.gla.ac.uk/rod/rod.html) will be
provided to facilitate both submission support and curation capability.







CIPRES DB

14

TreeBASE II Capabilities:

Search (1)



2
-
step configurable GUI retrieving sets of studies, matrices, or trees.


Step 1: choose search criteria


Step 2: choose search



Study Search By:


Disjunction of conjunctions of author last names


Citation title matches given keyword(s)


Name matches keyword


Contains analysis/analysis step such that:


Name matches given keyword(s)


Uses given algorithm


Uses given software package


Input and/or output data contains given set of taxa


Input and/or output data contains tree that matches given tree pattern


Input and/or output data contains matrices satisfying given search criteria
(same as below)




CIPRES DB

15

TreeBASE II Capabilities:

Search (2)




Tree Search By:


Tree id number


Appears in a study satisfying given search criteria (same as above)


Appears in an analysis/analysis step satisfying given search criteria (same
as above)


Contains given set of taxa


Matches given tree pattern



Matrix Search By:


Uses given set of taxa


Uses given set of character names


Is a sequence matrix that uses a certain kind of biomolecular information


Contains given specimen(s)

CIPRES DB

16

TreeBASE II Capabilities:

Bulk Queries



XML
-
based query interface for tools that interoperate with TreeBASE II



Input: domain
-
specific query language


based on theTreeBASE Domain Model


related semantically to a simple subset of SQL or ODMG/OQL


XML
-
based syntax



TreeBASE XML format for query output


Nexus data


additional data in TreeBASE II



For the CIPRES tool which is CORBA
-
based we will use an IDL
-
to
-
XML bridge



Interactive (sophisticated) user can also submit prepared query



CIPRES DB

17

TreeBASE II Domain Model




A detailed object
-
oriented Domain Model was designed for TreeBASE II
(EER diagrams were manually derived from the Domain Model)


A very partial and simplified view:


Study

Data

Matrix

Tree

Taxon

1

1

1







1



MatrixRow

RowSegment

Specimen

1

1

1

1

CIPRES DB

18

Technologies used in TreeBASE II development






Open source


Proven technologies and best practices


Hibernate

to generate the SQL schema from the Domain
Model


Hibernate
, based on the Domain Model, to program any
database access


Tomcat

Web container and one of SDSC's Web farms


Spring

framework as an application container to manage
transactions


CIPRES DB

19

Status and Future Plans





Requirements and use case collection is complete


The architectural design is complete


Currently working on detailed design and coding, including GUI work
and loading data from TreeBASE I (some is ready)


A demo will be performed during the site visit



TreeBASE I data will be loaded by August 2006


Elements of the interactive user interface will be beta released and
end
-
user tested throughout Fall 2006


New submissions accepted starting February 2007


Links to taxonomic services developed in Spring 2007


Bulk query API, including CIPRES tool interface, developed in 2007


Available as Web service at end of 2007