All metadata related to each individual sequence XML format

clumpfrustratedΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

102 εμφανίσεις

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Source View

Community

Integrative Bioinformatics

(NSF)

Arabidopsis

(reference organism)

All cereals

(NSF)

Rice

Legumes

Soy EST

(USB)

Soy Functional

(NSF)

Medicago

(NSF)

Trees

Pine EST

(DOE)

Pine Functional

(NSF)

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Partnerships

Research Community Support:

Shared Expertise and Knowledge


Bioinformatics Community


Plant Community


Metacomputing Community

Federal Support:
Grants and Contracts

Corporate Support:

Hardware, Software, and Data

Integrated Genomics

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Application View

All public
genomic data

Sequence
processing

Similarity
Searches

Unigene
Sets

Diogenes

Pipeline,

Automation

BioData

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Application View

All public
genomic data

Sequence
processing

Similarity
Searches

Unigene
Sets

Diogenes

Pipeline,

Automation

Genomics

Desktop

Functional Genomics


Array Design


SAGE


Clustering


Data Mining

Visualization
& Exploration

BioData

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Application View

Warehouse

Multi
-
species Comparative

Functional Genomics

Metafam

All public
genomic data

Sequence
processing

Similarity
Searches

Unigene
Sets

Diogenes

Pipeline,

Automation

Genomics

Desktop

Functional Genomics


Array Design


SAGE


Clustering


Data Mining

Visualization
& Exploration

Metabolic Pathway

Reconstruction

BioData

Relational Genbank

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

The Genomics Grid

Distributed Computing:

Condor, Globus, Sun Grid

Clusters of Workstations

High Performance
Networking

ATM / GBE / FCAL

Internet 2

Special Purpose Hardware

Time Logic “DeCypher”

Interoperable Software

“Grid Aware” Applications

Remote SQL Queries

Java

Enterprise level

data storage

Oracle

High Throughput
Genomics

Visual Exploration of
Global Data Resources

Real Time,

Visual Collaboration

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Design Goals


Scalable

-

Provide a workload
management solution for large scale
bioinformatics processing


Extensible

-

Add new tools easily without
modifying core components


Portable

-

Deliver functionality in
heterogeneous environments


Collaborative

-

Combine processing
resources to increase throughput

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Underlying Components

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Client

Data Files

Metadata

Context

Unique Internal Identifiers

55

56

57

58

Individual Data Items

(Chromatograms or
sequence files)

All metadata related to each
individual sequence

XML format

“Preprocessing”

database

Data submissions
happen in batches,
initiated by clients.

File formats, processing
requirements, and batch
structure vary widely.

Data arrives at
CCGB in a well
structured format,
amenable to
automatic
processing.

Web Based Submission Tool

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Data Submission Prototype

In this example of a data submission
page, the user selects the appropriate
data directory, and uses Netscape’s file
browser to upload the TAB delimited
spreadsheet file.

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Metadata Required for
Processing

Name


Type

Sequence ID

String (used to identify which data file is associated with this metadata)

Sequence Name

String (used for GSS# or EST# in GB submission)

Experiment Type

<GSS, EST>

Data Type


<Chromatogram, FASTA_sequence, array_expression>

Date Sequenced

Date

Seq Primer


Identifier for Primer (CBC maintained list)

Contact Name

Identifier for NCBI Contact File (CBC maintained list)

Citation


Identifier for NCBI Citation File (CBC maintained list)

Library


Identifier for NCBI Library File (CBC maintained list)

Class


<BAC_end, YAC_end, exon
-
trapped>

Organism


Identifier for organism (CBC maintained list)

Send to DB


<Yes, No, Update>

Some quality control checking is done at submission time to
ensure that the metadata are consistent and correct. This includes
a “spellcheck” like feature to be sure that primers, citations and
such reference things known to CBC.

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Tasks in Processing Biological
Data


Base Calling (Phred, Phran)


Vector Filter (VF4)


Artifact Filter (af)


BLAST (blast, blastx, tblastx, blastn)


Contig construction (Phrap)


Microarray Design


Primer Selection


Functional Analysis & Annotation


Submission to public repositories (Genbank)


Publication

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

TkBatch

User Interface

Provide a configurable interface to a set of tools.


Batch Processing System

Enable batch submission of thousands of jobs


Dependency Management

Define Directed Acyclic Graphs (DAG)s for
process flow.

A DAG is not a tree.

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Watchlist:

Directed Acyclic Graph of
processes which will act on the
input data

File List:

Input data, possibly selected
from diverse locations in the file
system

Compile to
-

Job Description:

Enumerates all tasks included in the job,
all job dependencies, as well as a “status
journal” indicating progress through the
tasks.

TkBatch


Use Outline

Submit to


Distributed Processing


CONDOR metacomputing platform


Similar to GLOBUS and Sun’s GRID


Uses idle workstations to perform processing tasks


Dependancy


Observe through TkBatch


Building process monitoring capabilities into the
TkBatch system.


Obtaining CONDOR source code to make
improvements directly.

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Application Configuration

Some system abstraction, but still a very “close to the road” interface


Tools cannot be selected
unless they are appropriate to
the current output type in the
watchlist.



Reasonable defaults are
provided for command line
options.

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

Analysis Tools


RelGB


A simple relational framework for GenBank
Data


Java based UI for biologically relevant queries


SSR Identification & primer design for
ESTs


All; UTR; BAC
-
end; BAC


EST contigs: Diogenes
-
Blast; Primer3


Analysis Tools

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

PERL and CGI Scripts, operating on XML indexes to data directories

Creates set of predefined web views on data

http://web.ahc.umn.edu/biodata

Grant Summary


Grant Info


Grant Statistics


Contig list


Submission Set List

Submission


Sequence Length Distribution


Submission Set Visualization


Search BLAST reports


Sequence List

Contig Sets


Contig Info Table


Phrap Parameters


Submissions in the Contig Set


Contig Quality Graphs


Sequences in the Contig Set

Contig Page


Sequence Info


Contig Visualization


Sequence Analysis Tools


BLAST Reports

Sequence Info


Raw Sequence


Filtered Sequence


Sequence Quality Graph


Sequence Analysis Tools


BLAST Reports

Project Statistics


Number of sequences


Number of submissions


Length Statistics


Contig Statistics


Quality Statistics

BioData Summary

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

BioData File Tree

contig_dir_###


| +
-
index.xml


| | <contigdata


| | kingdom="Planta"


| | family="Pinaceae"


| | species="Pinus taeda"


| | files="contigs"


| | >


| |


| | <tissuelist>


| | Xylem


| | </tissuelist>


| |


| | <librarylist>


| | NXNV


| | </librarylist>


| |


| | <submissionlist>


| | 991206a 991206b 991207a 991207b


| | 20000103a 20000217a 20000515a 20000612


| | 20000103b 20000221a 20000515b 20000613a


| | 20000103c 20000328a 20000515c 20000613b


| | </submissionlist>


| |


| | <phrapparams


| | minmatch="40"


| | minscore="80"


| | >


| | </phrapparams>


| |


| | <contigversionlist


| | AssemblyProcessId="PtaedaNormalXylem"


| | AssemblyProcessVersion="1"


| | AssemblyStepNumber="1"


| | >


| | </contigversionlist>


| | </contigdata>


| |


| +
-
libraryname.fasta.screen.ace
.1


| |

Center for Computational

Genomics and Bioinformatics

U
NIVERSITY OF
M
INNESOTA

CCGB Condor Cluster


65 processors on 37 machines


Performance


4.75 Gflops


25 BIPS


19 GB memory


Figures are roughly equivalent to a 16
processor IBM SP2


Customized usage policies