Apollo: An extensible genome annotation editor - FTP site - WormBase

assistantashamedData Management

Nov 29, 2012 (4 years and 11 months ago)

428 views

A Data Coordinating Center for modENCODE


Stein, Lincoln D.



1

R
ESEARCH
P
LAN

A.

S
PECIFIC
A
IMS

1.

We will provide a project workflow management system that will compile,
track, and assure the quality of the data being generated by different
modENCODE research projects.

2.

We will merge the modENCODE datasets with information fr
om the MODs
and other online resources, to create a public resource that includes:

a.

Primary data in public repositories

b.

The location of any identified sequence element in the corresponding
reference genome sequence

c.

Metadata describing the experimental metho
ds including any
informatics methods

d.

Essential information from Model Organism Databases (MODs) i.e.,
FlyBase or Wormbase.

3.

We will provide the research community with multiple efficient and
unencumbered mechanisms for accessing data about functional
elemen
ts in the Data Coordinating Center (DCC), including:

a.

A genome browser view

b.

Complex queries and custom reports

c.

Bulk dataset downloads.

d.

A portal for distributing general information about the modENCODE
project

4.

We will work closely with the relevant MODs to e
nsure the long
-
term storage and management of modENCODE data.

5.

We will use standard software engineering methodologies to ensure
reliable processing and storage of the data.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.



2

B.

B
ACKGROUND AND
S
IGNIFICANCE

The process by which scientists accumulate new knowled
ge remains the same in the digital
age as it was as in the pre
-
computer era: we collect data through experimentation; we analyze
these data to detect patterns and make associations; we make predictions based on our
analysis and design new experiments to te
st these. Through this cycle we improve our
understanding and models of how living systems work.

B.1.

ENCODE

AND MOD
ENCODE

The
Enc
yclopedia
O
f
D
NA
E
lements (ENCODE) project was launched by the National
Human Genome Research Institute in September 2003 (
http://www.genome.gov/10005107
).
The project’s goal was simple: having reconstructed the “book of life” by sequencing and
assembling the 3 billion base pairs of the human genome ENCODE set out to read the book by
id
entifying and characterizing all functional elements in the human genomic sequence.

Human ENCODE is formally organized into three phases: a pilot phase, a technology
development phase, and a production phase. The project is currently in the pilot and
tech
nology development phases. For the pilot study, 44 regions representing ~1% of the
genome (30 Mb) were targeted for intensive study. Using a variety of computational and
experimental methods, each of the the targeted regions was subjected to techniques to
characterize the structure and splicing patterns of protein
-
coding genes, temporal and spatial
patterns of RNA transcription, the tissue
-
specific patterns of binding of transcription factors
to their corresponding binding sites, features of chromatin such
as DNAse I hypersensitive
sites, and variation information including a deep ascertainment of all common single
-
nucleotide polymorphisms in the target regions. A key part of the ENCODE pilot project
involved the sequencing of the corresponding regions of no
n
-
human species in order to
identify elements that have been conserved during evolution. This evolutionary data has
yielded many new insights into the structure of the genome, including the identification of an
entirely novel class of enhancers, the ultrac
onserved elements (Berjerano
et al.

2004).

Over 20 principal investigators are participating in the pilot phase as data providers. The data
flow is managed by a team of scientists and software engineers in David Haussler’s group at
University of California

Santa Cruz (UCSC), who run the project’s Data Coordinating Center
(DCC). The UCSC DCC is organized around several simple principles: (1) Interactive
submission


the DCC provides a means for data providers to upload samples of their data
and interactively

visualize them on the UCSC genome browser (Karolchik
et al.

2003). This
allows data providers to find and correct problems in their data at an early stage in the
process. (2) Dialogue


engineers from the DCC engage data providers in a continuous
dialogue

in order to arrive at the best electronic representation of the dataset. (3) Quality
control and milestones


the DCC and the data providers negotiate quality control checks on
the data, performance milestones and other metrics which allow the progress of

the project to
be monitored and problems identified early. (4) Open access


all information generated by
ENCODE is available to the research community on a release “early and often” basis. Raw
data, such as sequencing traces and microarray files, are sub
mitted to the appropriate
repositories, and sequencing coordinate
-
based annotations are available for bulk download
from UCSC.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.



3

modENCODE
, announced in the winter of 2006, seeks the comprehensive characterization
of functional elements in the genomes of the

roundworm
C. elegans

and/or the fruitfly
D.
melanogaster.

The project is organized in a similar manner to the human ENCODE pilot,
with the main difference being that instead of characterizing 1% of the genome, all studies will
be genome
-
wide.

modENCODE is

a natural adjunct to the human ENCODE pilot. A fundamental limitation of
human ENCODE is that the functional elements that it identifies can only be validated and
characterized
in vitro,

either biochemically or using a cell culture system. This limits the

ability of researchers to understand the potential role of the functional element in complex
integrative processes, such as cell
-
cell communications. In contrast, non
-
human model
organisms can be manipulated
in vivo
. A putative functional element can be d
eleted from the
organism’s genome, modified, over
-
expressed or put into a new spatial and temporal context.
Such alterations frequently change the organism’s phenotype in a way that reveals the
biological significance of the element.

Another advantage of m
odel organisms is that their genomes are often more compact than
human. The
C. elegans

genome has almost the same number of protein
-
coding genes as
human (21,000 versus roughly 24,000), but is only 100 Mb in length


just three times the
size of the human
ENCODE pilot target regions.

A final strength of model organisms is that they have mature information resources, the
model organism databases (MODs), that integrate many aspects of the organisms’ known
biology. These include detailed descriptions of geneti
c pathways, the effects of allelic
variation, tissue
-

and stage
-
specific gene expression patterns, and the organisms’
developmental and neuroanatomy. A requirement of modENCODE is that all information
generated by the project be integrated into the MODs fo
r the long
-
term benefit of the model
organism research community.

B.2.

T
HE
R
OLE OF THE
D
ATA
C
OORDINATING
C
ENTER

Any large project involving more than a few research groups requires a data coordinating
center. The DCC provides a place for the various groups to u
pload their datasets; it runs
quality control checks on the sets to identify potential errors; it manages versions of and
revisions to the datasets; it provides management tools for tracking project milestones.
Recent NHGRI
-
sponsored projects that have had

a DCC include the Whole Genome
Association initiative, the International HapMap Project, and the Mammalian Gene
Collection.

The modENCODE DCC will have the additional role of serving as liaison between the general
research community, the modENCODE data pr
oducers, and the MODs. To liaise with the
research community, the DCC must create a compelling web portal that describes the project
and its datasets in terms that the general community can understand. To liaise with the data
producers, the DCC must engage

in a detailed dialogue that allows the DCC and the data
producers to agree on the storage, visualization and query needs for each dataset.

Effective liaison with the MODs is particularly important for this project. Because of the
complexity and richness o
f the information already present in the worm and fly MODs, the
modENCODE DCC must not simply collect the datasets in a database of its own devising,
dump them out as flat files, and leave the MODs to sort out the details of integrating them
A Data Coordinating Center for modENCODE


Stein, Lincoln D.



4

with their exi
sting knowledge. The DCC must work closely with the MODs from the very
beginning of the project to ensure that the modENCODE data collected is “MOD ready.” MOD
readiness spans the gamut from such basic but essential details as using the most up to date
MOD
-
approved identifiers for gene models and chromosomal coordinates, to deeper issues
such as relying on the appropriate MOD’s ontology of anatomic structures to describe the
origin of an RNA sample.

Our proposal for the DCC is based around the principles of

integration

of disparate pieces of
information into a single logical repository, of
consistency

required to make rigorous
comparisons among the datasets, and
rapid

feedback

from biological experts.

To assure
integration

we will make the exchange of data a
s easy and seamless as possible,
and provide tools to make the barriers to submitting data as low as possible. We will
capitalize on our close affiliations with WormBase and FlyBase in order to ensure that the
modENCODE data is “MOD ready” and capable of b
eing rapidly incorporated into the MODs
to allow the sequence
-
coordinate based annotations of modENCODE to be correlated with
pathways, functions, and phenotypes.

To assure
consistency

we will use accepted community standards, particularly the
nomenclatur
e approved by the appropriate MODs, and the sequence ontology (SO; Eibeck
et
al.

2005). We will scrupulously track each dataset using stable version numbers, and apply a
uniform set of quality control criteria to each one.

To assure rapid
feedback
, we will

provide multiple mechanisms for communication,
including direct e
-
mail, access to a community WIKI, and specialized tools such as graphical
genome annotation editors.

C.

P
RELIMINARY
S
TUDIES

C.1.

F
LY
B
ASE

FlyBase provides the scientific community with a readily acc
essible database of
Drosophila

genomics and genetics by capturing, integrating, and presenting core information that
research has generated (Drysdale 2005). These data are obtained by manual curation of the
scientific literature, recording personal communi
cations from the research community, and
bulk downloading from large
-
scale data producers. FlyBase will obtain
m
odENCODE data
from the DCC (see letters of support from Gelbart, Drysdale, and Ashburner). Lewis was with
FlyBase until the completion of the fi
nal euchromatic genome sequence in December 2005,
and is deeply knowledgeable of internal technical and organizational details of FlyBase.

FlyBase’s approach in constructing their database and its supporting tools is to develop them
in a way that is conson
ant with the efforts of other genomic and genetic databases. Indeed,
one of the specific aims of FlyBase is to “work closely with the other relevant genomic and
genetic database groups to ensure that data structures are interoperable, that software are
bot
h cost
-
effective and reusable by other groups, and that the domain knowledge within the
various groups is a shared resource.” (
FlyBase

renewal, 2006). As FlyBase and these other
database efforts have matured, there has been an increasing recognition of the

value of shared
semantic standards, and collaborative software development. FlyBase, including Lewis and
Mungall, has helped to lead efforts in these directions through its role in initiating the GO
consortium (to develop a set of structured ontologies th
at would permit interrogation across
A Data Coordinating Center for modENCODE


Stein, Lincoln D.



5

databases at levels other than sequence similarity), the SO project (to develop an ontology for
describing genomic elements, see Eilbeck 2002), and the GMOD consortium. Both FlyBase
and the DCC will continue to jointly
use and develop the same underlying tools and
information structure, including the Chado database schema (see section
s C.4.a and

D2). This
will simplify data exchange between FlyBase and the DCC.

C.2.

W
ORM
B
ASE

WormBase (Schwarz
et al.

2006) is the model organi
sm database for the
C. elegans

community. Like FlyBase, its core activity is the manual curation of all scientific publications
that are relevant to the
C. elegans

community; from these papers, WormBase maintains the
master list of
C. elegans

genes and the
ir alleles, strains, transgenic constructs, vectors, and
other reagents. WormBase is responsible for the uniform annotation of the
C. elegans
,
C.
briggsae

and
C. remanei
genomes, including gene model structures, the comprehensive list
of alternative splice

forms, suspected and confirmed nucleotide polymorphisms, repetitive
elements, and the evolutionary relationships among the three genomes. The database also
incorporates the results of large
-
scale studies, most notably expression patterns derived from
GFP
-
tagged promoter constructs and microarrays.

Via the WormBase web site at
www.wormbase.org
, researchers can browse the
C. elegqans

database in an object
-
at
-
a
-
time fashion and work their way out through a network of l
inks.
For example, a researcher can start with his favorite gene, follow links that describe the cells
in which the gene is expressed, and from there link to pages that describe the anatomy and
developmental lineage of those cells. WormBase supports a GBro
wse
-
based genome browser
view that allows researchers to view genes and other annotations in their genomic context
(Stein
et al.

2002), as well as a BioMart
-
based interface for creating complex queries and
generating custom data reports (
Krasprzyk
et al.

2
004
). A multiple alignment view allows
researchers to visualize the evolutionary relationships among the genomes of
C. elegans, C.
briggsae

and, soon,
C. remanei.

Researchers can download the entire database in bulk in a
variety of software and can also in
stall the entire database and web site locally. WormBase
sees roughly 1 million
non
-
robot page

accesses per month. While the underlying WormBase
database is AceDB (
Durbin and Thierry
-
Mieg, 1991
), considerable effort has gone into making
the genome annotati
on portion of the schema match the data model of the Sequence
Ontology. WormBase exports its genome annotations in the SO
-
compatible GFF3 format
(
http://song.sourceforge.net/gff3.shtml
); the GFF3 file
s are routinely validated by a GFF3
parser and used as the basis for the GBrowse genome annotation display. Because AceDB is
now well outside the mainstream of database management systems, the WormBase
consortium is developing plans to migrate to the same
Chado database schema used by
FlyBase, and this decision was endorsed by its Scientific Advisory Board at its last annual
meeting in January 2006.

Lincoln Stein is one of the founders of WormBase and is responsible for managing the
WormBase user interface,

the query tools and the visualization system. In addition, the CSHL
WormBase group is central to the process of updating the WormBase data model when
needed to accommodate new data types. This intimate connection to WormBase gives the
group a deep underst
anding of the WormBase database and its technical details.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.



6

C.3.

I
NTER
M
INE AND
F
LY
M
INE

InterMine is an open source database project based in the Department of Genetics at the
University of Cambridge and built on a Java/Struts/PostgreSQL platform. It is a generic

system that has been used to build FlyMine (Lyne
et al
(
submitted
) http://www.flymine.org),
an integrated database of genomic, expression and protein data for Drosophila, Anopheles
and other organisms. FlyMine was developed to address the difficulty of us
ing, in an
integrated way, the large
-
scale datasets being generated by modern biology: data are available
in a wide variety of formats in numerous different places, making their joint analysis
otherwise hard. As an integrated resource, FlyMine makes it pos
sible to run efficient data
mining queries that span these domains of biological knowledge. The generic nature of this
project and underlying infrastructure means that it is possible to rapidly construct similarly
integrated databases based on data from di
fferent organisms and datasets. The FlyMine
project has already incorporated much of the starting
Drosophila

data that will underpin the
DCC and therefore forms a good starting point for the integrated data core of the DCC.

The InterMine data warehouse sy
stem has a number of attractive features for use in the DCC:
a) A query optimizer

that allows the modelling of the data to be uncoupled from the data
denormalization that is often used in data warehouses to achieve good query performance.
Here data is de
normalized via the generation of precomputed tables (materialized views).
This allows common queries to run extremely fast and at the same time forms the basis for
the query optimizer that, when worthwhile, transparently re
-
writes incoming queries to make
use of the precomputed tables. Relational database performance is often constrained by the
need to carry out multiple table joins, and the combination of precomputed tables and query
-
rewriting helps alleviate this. Tables can be precomputed and registered
as such at any time.
This means that the performance of the live database can be improved by precomputing
bottlenecks in queries without having to rebuild the database as is often the case for other
data warehouses.

b) Arbitrary Queries:

The above query o
ptimizer allows InterMine based systems to
respond well to arbitrary queries. End users can submit arbitrary queries using a web
-
based
query building interface as well as a programmatic interface. Importantly it is not necessary
to have knowledge of the d
atabase schema in order to build queries


the query builder makes
use of a class navigator to which allows queries to be constructed based on the well known
properties of different classes of biological objects.


c) Template Queries:

Queries composed thro
ugh the query builder can be easily converted
into re
-
usable web
-
pages called “query templates.” This feature allows complex queries to be
encapsulated for re
-
use by others. It is possible to search the accumulated library of such
template queries for temp
lates of interest which can be run directly on one or more items, or
adapted to make further queries.

d) List operations:

It is possible to upload, generate and store “bags’ of objects. For
instance the genes that form the output of a query can be saved f
or later use. Bags can be
viewed, edited, and combined in various ways e.g. intersection, union. Importantly the
contents of bags can be used as constraints in the query builder or templates. This gives users
great flexibility in interrogating the database

by being able to build up queries in stages and
operate on multiple entities at a time.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.



7

e) Automatic Code Generation

is extensively used: this means that, given a starting data
model, the code to define the rest of the system is automatically written and
thus there is little
effort needed to generate the web application that provides the basic user interface, the
database itself, the java classes of the related object model and the programming interface.
Some further simple configuration improves the user
interface, but the overall effort required
to accommodate a new or changed data model is minimal. We have already established,
through the FlyMine project, that InterMine performs well when running a sequence ontology
based schema like that of Chado (C.4.
a). Here we propose to use an instance of InterMine
running the Chado schema as the core data integration engine of the DCC.

C.4.

GMOD

The mission of the GMOD (Generic Model Organism Database) project is to develop,
distribute, and maintain software tools and

procedures for shared use by existing and
emerging model organism databases (
http://www.gmod.org
). The core of GMOD is the
Chado relational database schema, on top of which run a variety of tools for managing
sequence
annotation, literature curation, performing bioinformatics analyses, running a
MOD web site, and other typical MOD tasks.

GMOD tools are characterized by their adherence to common standards, such as the
Sequence Ontology, and their distribution as open sou
rce projects.
1

GMOD tools are widely
used in the animal and plant genome research community.

Stein and Lewis were key instigators of the original GMOD project, and Stein is now the
GMOD Principal Investigator. Both have contributed major software component
s to GMOD:
the Lewis group oversees the architectural design of the database schema, and maintains
Apollo (a genomic annotation editor, see Lewis 2002). The Stein group manages the overall
project, and maintains GBrowse (an interactive web
-
accessible genom
ic browser, see Stein
2002), CMap, a comparative map viewer, and Turnkey, a Chado
-
driven web site engine. It
also collaborates in the development of BioMart, a data warehouse and report generation
engine used by the Ensembl and HapMap projects (see www.bio
mart.org).

The project will be heavily dependent on Chado and GBrowse, and so are now described in
greater detail.

C.4.a.

Chado

Chado is a modular relational database schema, designed to provide normalized, flexible data
storage for genome annotation, genetic,
phenotypic, and expression data from any model
organism database project. Chado is organized as a set of distinct modules with tightly
defined dependencies. It allows different software components to focus on the specific data
compartments required. It all
ows for extensibility and for adaptability, with changes in one
module producing minimal disruption of other modules. Across the schema, relational
structures were designed which use controlled vocabularies and ontologies for describing
attributes of objec
ts and relationships between objects. Use of controlled vocabularies in
general allows for evolution in data content without costly schema revision; in addition, the
semantic richness of ontologies, the GO and SO in particular, are used to inform the Chado




1

The exception to this rule is the Pathway Tools software for b
iological pathway curation, which is distributed under a license
that allows binaries to be used freely by academics.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.



8

object relationships and attributes. Chado has been designed to function on any SQL
compliant RDBMS platform, such as
MySQLMySQL
, Oracle, Sybase, PostgreSQL, etc. Chado
is being used by a number of groups, including including, DictyBase, TIGR, ZFIN, and S
GD.

In February 2004 the GMOD group decided to adopt Chado as the GMOD database schema.
This was a significant step in several ways:



The Chado schema has been, and will be scrutinized and undoubtedly improved by
having lots of knowledgeable eyes looking at

it,



Development has become distributed, as other groups have taken responsibility for
modules (e.g. expression, phenotype, and epidemiology), minimizing duplication of
expensive software development efforts.



The leverage of having an entire community invo
lved in the effort increases
interoperability by the adoption of common data standards and architectures.



The modules within Chado will serve as reference common data model standards
across GMOD

The web site for chado is: http://www.gmod.org/chado

The Chad
o modules, both existing and under current development, are listed below:



general:

General/core data used by all modules,



cv:

Controlled vocabularies and ontologies,



pub:

Publication, bibliographic, and any other attributions of data,



organism:

Taxonomic d
ata,



sequence:

Anything related to biological sequences and annotation,



companalysis:

Adjunct to sequence module for computational analysis,



map:

Adjunct to sequence module for non
-
sequence localization of biological objects,



genetic:

Phenotypic and genoty
pic information,



interactions:

Descriptions of genetic and molecular interactions, and correlations
between and amongst gene product expression and mutant phenotypes.



comparative:

Extension to the sequence module for any inter
-

or intra
-
species
comparisons

at any level (
e.g.
, synteny, peptide, haplotype data).

Chado makes extensive use of metadata, which is also modelled in the schema (in the cv
module). In particular, the use of ontologies (or controlled vocabularies) is ubiquitous across
the schema. Ontol
ogies are indispensable as a means of specifying annotations at a high level
of granularity. In addition, Chado uses ontologies as a means of typing entities and properties
of those entities in the schema. This means that Chado can be more expressive and f
lexible
than is typically the case for a relational database schema.

As a simple example, many genomics schemas (past and present) have specific tables for
genes, transcripts, and exons that mirror the “central dogma”. This is fine if the only data you
ar
e handling are protein
-
coding genes, but that era is passing, and these schemas must either
a) fudge the differences between, for example, miRNAs and protein
-
coding genes, or b)
continue to add new tables as new genomic feature types arise. By uncoupling t
he “type” from
the physical storage one can readily add new kinds of genomic features without any major
overhaul of the schema. An extremely simplified view of the sequence feature module, which
is of central importance to the DCC, is shown in figure 3.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.



9

T
here is always a trade
-
off between flexibility, normalization on the one hand and query
simplicity and efficiency on the other hand. Chado chooses the former over the latter, with a
view to allowing optional denormalisations in a report or warehouse databa
se. This is the
approach taken by the DCC, with Chado used as a central database for housing and data
management, and the use of denormalised schemas (InterMine, GBrowse DB and BioMart)
for querying and driving fast visualisation.

An extremely simplified v
iew of the sequence module, which is of central importance to the
DCC, is shown in Figure 1. For more details visit the GMOD pages (
http://www.gmod.org/
).


Figure 1
.
In the Chado sequence model, a

genomic feature has
a type, which is specified by a term from the SO
maintained in the cvterm module (not shown). It may also have references to external database identifier
s
or
to synonyms. A sequence feature also has a location, which essentially is the start and end positi
ons of that
feature in reference to another feature, that is to another sequence (since a feature defines a sequence).
Sequence features may have other relationships to one another,

as described by the Sequence Ontology.

Chado was devised by Chris Mungall
(FlyBase
-
Berkeley) and David Emmert (FlyBase
-
Harvard). Both will be available as technical consultants on the project (Mungall is part of the
Lewis group).

C.4.b.

GBrowse

GBrowse (Figure 2) is the GMOD web
-
based genome browser (Stein
et al.

2002). Its user
interf
ace is similar to the familiar UCSC Genome Browser: using any standard web browser,
users connect to a GBrowse site, search for genomic landmarks (e.g. chromosome names,
cytogenetic bands, the name of a familiar gene), and then browse through the region of

interest. A series of horizontal tracks show the relationships among various types of genome
annotations.

GBrowse supports many of the features of the UCSC browser, including the ability to show
qualitative data (e.g. gene models), quantitative data (e.g.

conservation scores), a form of
semantic zooming in which the visual representation of a sequence feature changes as the
A Data Coordinating Center for modENCODE


Stein, Lincoln D.


10

magnification changes in order to show increasing levels of detail, file upload, and user
-
configurable tracks. In addition, GBrowse ha
s some features that UCSC doesn’t, including the
ability to display remote annotation tracks from Distributed Annotation System (DAS)
servers (
Dowell

et al.

2002
), and the ability of the user to change the vertical order of the
tracks. GBrowse also support
s a richer set of visual displays than the UCSC browser; for
example it can display embedded images of organisms to demonstrate tissue
-
specific
expression patterns, and it includes facilities for showing statistical data including box and
whisker plots, pi
echarts, spectrograms and DNA sequencing traces.

However, the main strength of GBrowse is its configurability and portability. GBrowse
connects to the underlying source of sequence annotation data via a flexible series of
adaptors. A DAS adaptor allows GBr
owse to be run off a remote DAS source such as Ensembl,
and a Genbank adaptor allows GBrowse to run off NCBI Entrez. For this project we will be
using the Chado adaptor, which allows the software to display annotations stored in a Chado
database, and the G
FF3 adaptor, which allows GBrowse to run off a fast partially
-
denormalized representation of GFF3 data. The latter adaptor is part of recent versions of
Bioperl
(Stajich
et al.

2002).




Figure 2:
The GBrowse Genome Annnotation Browser (from www.hapmap.or
g)


The main drawback of GBrowse is that it is written in an interpreted language (Perl), which
means that its performance is not equal to that of the UCSC Genome Browser. This means
that GBrowse cannot display dense annotations like SNPs across an entire
chromosome at
once the way that the UCSC Browser can. However, in practice it is rarely informative to view
dense annotations in a chromosome
-
wide fashion, and GBrowse can use semantic zooming to
convert dense features such as SNPs into histograms of featu
re density when the region of
interest becomes too large.

GBrowse is installed at over 200 sites world
-
wide and well supported by both the Stein group
and volunteer contributors to the GMOD project.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


11

C.4.c.


Transporting annotations to new assemblies

S
equencing co
nsortiums periodically update the genome with new sequence and new
assemblies. Transporting annotations that were developed on one assembly to a new
assembly takes some care, especially if bases are inserted, deleted,
or

mapping errors
corrected. For types

of annotations that can be easily recomputed, such as the mapping of
mRNA to DNA, it is best to rerun the computations on the new assembly. Often
, however,

annotations can’t be easily recomputed. For mature genome assemblies such as are available
for
C. e
legans

and
D. melanogaster
, the changes between assemblies are small, and the vast
majority of annotations can cleanly be mapped from one assembly to another. UCSC has
developed
a system,

Lift Over,


which does this mapping automatically on many
annotatio
n
file formats.

This system, and our proposed use of it, is described in more detail in section
D.2.a.i “
Handling Genome Build Changes
.”

C.4.d.

Whole genome alignments.

Multiple genome alignments provide a wealth of useful information in themselves, and are
the b
asis of many other types of analysis. These include conservation plots to hilight critical
regions and programs that look for compensentory mutations indicative of RNA secondary
structure such as found in micro RNAs. Producing multiple genome alignments is

an
exceedingly computationally intensive job, and displaying them interactively also can be a
challenge.

Web Miller’s group at
Penn State

and Jim Kent’s group at
UCSC
have jointly developed a
multiple alignment pipeline
that was used for human ENCODE. Th
is pipeline
consists of a
number of steps, many of which are computationally intensive, and which for best results
must be tuned for the particular species involved in the alignment.

1.

Pairwise alignments


these must be done between the ‘reference species
’ and each
other species. Pairwise alignments also need to be done between any species that are
more close to each other than they are to the reference species.
For modENCODE, the
references species would be
C. elegans

and/or
D. melanogaster
.

a.

Find or creat
e a DNA alignment matrix appropriate for the two species being
aligned. This is analogous to selecting the correct BLOSUM matrix for a protein
alignment.

b.

Mask out repetitive sequence with RepeatMasker and Tandem Repeat Finder
(
trf
).

c.

Break up sequences into

chunks of approximately 10 Mb that overlap by 100
Kb.

d.

Run the
blastz

program on all pairs of chunks.

e.

Gather all the
blastz
output together, changing from ‘chunk’ back to
chromosome coordinates.

f.

Run
axtChain

to merge local alignments into longer alignment
s, and to take
out redundant alignments from overlapping parts of chunks.

i.

Figure out nonlinea
r gap penaltie
s to use. This is analogous to the gap
open/gap extend penalties used in a protein alignment, but gaps must be
A Data Coordinating Center for modENCODE


Stein, Lincoln D.


12

treated with more care in genome/geno
me alignments than
protein/protein alignments, since they can be so much longer.

ii.

Convert from blastz’s
lav

format to
axt

format. and split data by
chromosome in reference sequence.

iii.

Run
axtChain

on each chromosome

g.

Use
chainNet

and
netFilter

to separate out
the orthologous alignments
from other alignments

2.


Multiple alignment

a.

Create a directory for each 10 Mb of the reference genome containing all
alignments involving regions orthologous to that region

b.

Construct a phylogenic tree of the species being aligned.


c.

Use the multiz program to produce a multiple alignment progressively up the
phylogenetic tree.

d.

Merge the 10 Mb chunks, removing overlapping areas, and moving back to
chromosome coordinates.

e.

Use the multiple alignment to make a phylogenic tree with accura
te branch
lengths.

f.

Run the
PhastCons

program to create conservation scores for each base in the
reference genome.

This pipeline has already been applied to 7 flies on the
D. melanogaster

genome browser
at genome.ucsc.edu, but outside of
m
odENCODE there is

no funding available to continue
the fly and worm comparative genomics work at UCSC.

The UCSC chain, net, and conservation tracks are some of the most popular features of
that browser. The chain track is especially useful for looking at gene families that

have
expanded or contracted independently in different species. The net track is useful for
focusing on orthologous alignments, and highlighting large scale genomic
rearrangements. The conservation track is quite rich. At the top of the track is a plot of

overall sequence conservation. Underneath is a display containing a line for each species
in the alignment. At high levels of zoom these lines display bases, and, in coding regions,
amino acids. Conserved residues are drawn more darkly than nonconserved

residues.
When zoomed further out the species lines show conservation plots between the reference
genome and the species in that line. The user can control which species to display in the
species lines, and customize many other aspects of the display. The

track shows insertions,
deletions, break points, and missing data as well as conservation levels.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


13

D.

R
ESEARCH
M
ETHODS

D.1.

S
PECIFIC AIM
1:

W
E WILL PROVIDE A PRO
JECT WORKFLOW MANAGE
MENT SYSTEM
THAT WILL COMPILE
,

TRACK
,

AND ASSURE THE QUALI
TY OF THE DATA BEING

GENE
RATED
BY DIFFERENT MOD
ENCODE

RESEARCH PROJECTS

Our submission process mirrors the successful model of the UCSC human ENCODE project,
but with some extensions. One characteristic of both the worm and fly communities, that we
must accommodate, is the active
dialog between the primary data generators and the data
managers. By participating in this dialog we will facilitate greater consistency between
annotations generated by different techniques, and reach community consensus on is the
biological significance
of results. Another novel aspect is the need to make the activities of the
DCC compatible with the needs of the worm and/or fly model organism databases.

The high
-
level architecture of the proposed DCC is shown in Figure 3.


Figure 3.
This shows the high
-
level view of the DCC architecture, and is color
-
coded by area of
responsibility: Berkeley in gold, Cambridge in blue, and CSHL in green.

As described in more detail in subsequent sections, the data providers and data managers
from the DCC will iterativel
y develop a “contract” for the format of the data to be produced,
its meta
-
data, and quality control tests. Data producers will upload their datasets to a staging
server, the “data marshaling area,” where the management system will run a series of tests to

validate the syntax of the upload as well as its semantic consistency according to the quality
A Data Coordinating Center for modENCODE


Stein, Lincoln D.


14

control tests agreed to in the contract. If the uploads validate, producers will be able to view
the dataset in an instance of GBrowse and flag their approval.
The data will then be moved to
a production database based on Chado/InterMine, where it will be integrated more fully with
other modENCODE datasets as well as data imported from WormBase, FlyBase and other
reference databases. A website hosted on the produ
ction database will be the main
community portal to modENCODE data, and will provide a GBrowse
-
based graphical
interface to the data, data
-
mining and report
-
generation facilities based on InterMine and
BioMart respectively, and bulk file
-
based downloads. T
he production database will also serve
as the staging area for offloading the datasets into WormBase and/or FlyBase.

D.1.a.

Uploading, validation, and review

Responsibility for the data acquisition, processing, validation and review will be the primary
responsib
ility of the Berkeley group, who will manage the data marshaling area and post
-
processing pipeline, and of Data Managers working out of the Berkeley and CSHL groups,
who will be responsible for liaison with the data providers and data review.

D.1.a.i.

Data Provider

Identification and Data Submission

The DCC will have three Data Managers on staff. These will be PhD
-
level biologists with a
background in bioinformatics. Such individuals commonly have experience as curators for
model organism databases. At the beginning

of the project, each data provider will be paired
with a DCC Data Manager whose scientific background and interests match well. Data
Managers will work with their assigned data provider throughout the lifetime of the project to
ensure the satisfactory tra
nsfer of information produced by the data producer into the DCC.
At the beginning of the project, Data Managers will work with providers to develop a
“contract” which specifies the format of the data, the identifiers used in the data, and quality
control t
ests necessary to ensure the completeness and consistency of the particular data set.
Later, the Data Managers will shepherd data sets into the marshaling and production
databases, monitor the outcome of the QC tests, help the data producers with problems
that
arise during submission and subsequent processing, and act as liaisons between the data
providers and the software developers responsible for data set query and visualization.

Project profiles
: An early task of the Data Managers will be to gather, in
electronic form,
the fundamental information that describes each of the different modENCODE projects. We
will work with the combined set of modENCODE projects to define common information that
all projects must provide to create their project’s profile, as

well as information that is specific
to each dataset.

Like the human ENCODE project, project profiles will include: a synopsis describing the
project; a brief description of the types of data analysis they are doing; home and help page
URLs; acknowledgeme
nts of the individuals or organizations involved in data collection and
analysis; any related literature references; and other project
-
level information.

Project WIKI:
Communication between the DCC and providers, as well as communication
among the provide
rs, will be facilitated by a WIKI (collaborative web site) to which each
modENCODE participant will have password
-
protected write access. We installed a trial
WIKI while the various modENCODE projects were first being discussed and found it to be
quite use
ful as a vehicle for collaboration during preparation of the proposals. For reviewers
interested, the trial WIKI is located at
http://www.wormbase.org/modencode/wiki
,
username “mod”, password “encode.


A Data Coordinating Center for modENCODE


Stein, Lincoln D.


15

Development of a dataset contract and dataset registration
: The data managers will
work with each modENCODE data provider to develop submission formats for each of their
data types. Some data types, such as sequence
-

and microarray
-
based data, will use
standard
submission formats such as GFF3. Other types, such as anatomic expression patterns, will
need extensions and/or new submission formats. Whenever possible, we will include in the
upload formats a set of one or more URLs that will act as cross
-
refer
ences between the data
display at the DCC and related information provided by the providers. For instance, the
provider might provide a link to a stock center page to describe where a transgenic strain
carrying a promoter::GFP fusion construct may be order
ed.

The data managers will also work out with the each provider a set of quality control tests to be
run on their datasets following submission. These tests will usually be performed locally at
the DCC, but some tests might be hosted by the provider and ac
cessed remotely by, e.g., a web
services interface.

Together, the profile, the submission format and the QC tests will form a data exchange
contract between the DCC and the provider. The dataset will then be registered with the DCC
and the provider given a

private web page that it can use to upload datasets, view the status of
its uploaded sets, and flag datasets for revision, retraction or approval for public release.

Data Submission.
Data providers will use their web pages to upload datasets in the format

agreed upon in the contract. Each dataset submission will be given a unique tracking number,
required for each unique combination of project ID, data type, data format, and submission
date. The DCC will track individual submissions by incrementing a prima
ry release number
for the entire dataset after each submission is made public, and incrementing a sub
-
release
number for each revised submission of the dataset during quality control.

Note that the genomic scope of any particular dataset will be defined as

part of the contract.
Some data producers may choose to provide whole genome datasets, while others may prefer
to provide a series of chromosome or chromosome arm specific datasets. This means that the
DCC will be able to treat each submission as a comple
te dataset and will not have to be
concerned about combining multiple separate submissions. The data producers will be able to
flag whether any particular submission is “complete” or “partial” so that end users will know
whether to expect further data.

Dat
a Previewing.

Data providers will be able to preview their data in two ways. First, for
data types that can be anchored to the genome, providers will be able to upload representative
portions of their data to the genome browser running on the production d
atabase using the
standard GBrowse upload facility. This will allow providers to tweak their data to their liking
without making a formal submission.

Second, after data providers upload their sequence to the marshaling area and the basic
syntax checks are
completed, their dataset will be placed in a temporary DAS server database
accessible to the production database GBrowse. Providers will then be given a private URL
that they can use to preview their data. This will allow providers and selected associates
to
view the entire dataset in the context of all other annotations.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


16

D.1.a.ii.

Data Processing and Quality Assurance

After a dataset is uploaded, the responsible data manager will be alerted by a script that
manages the file upload area. He will then shepherd the dat
aset through a series of QC checks
using a combination of automated, manual and
ad hoc

steps.

The QC checks include: syntax errors; use of the correct genomic sequence; arithmetic errors;
and identifiers errors. The
syntax

test

is the simplest; it simply c
onfirms that the data
conforms to the format specification, and it can be run automatically when the dataset is first
uploaded. For GFF3
-
formatted data, which we anticipate will make up the bulk of the
modENCODE datasets, we will use a GFF3 syntax validati
on program developed by the Stein
laboratory and currently used to validate WormBase GFF3 files prior to public release. Syntax
checking scripts for other file formats will be developed as needed.

Genomic sequence

confirmation

tests
confirm that data coord
inates are relative to the
currently approved genomic assemblies of either FlyBase or Wormbase, as appropriate. This
is important because of the prevalence of arithmetic errors in sequence coordinates, most
famously the “off by one error” that can lead to
the deletion or insertion of a nucleotide at
either the 5’ or 3’ end of an annotation element. To identify such errors, data managers will
develop scripts that check that dataset features are consistent with other features on the
genome. For example, we wi
ll confirm that proposed gene models contain long open reading
frames and that transcription factor binding sites identified by ChIP/CHIP include motifs
that are similar to the expected TFBS motifs.

Data managers will also write and run QC scripts to valid
ate public and project
-
specific
identifiers
. Public identifiers refer to auxiliary data associated with individual annotations
that is housed in public repositories such as NCBI, dbEST, or InterPro. These identifiers must
be formatted correctly, and be con
firmed to exist at the public resource. Similarly, project
specific identifiers must also be well formatted (fit the regular expression defined in the data
exchange contract) and be unique. The programs needed for these tests are not complicated,
but these

basic sanity checks cannot be overlooked.

Additional tests

may also be required for particular datasets and will be defined and
implemented to detect problems such as unexpected overlap (or non
-
overlap) between
annotation elements that are particular to
different data types. For example, the exons for a
predicted transcript must not overlap and must be on the same strand. If two ORFs overlap
and share the same reading frame, then they should have identical gene identifiers. The tests
appropriate for each
data producer and the datasets they provide will be established during
the registration process and refined by the data managers as the datasets come in.

All tests scripts developed by the DCC will be made available as standalone utilities so that the
dat
a providers can run them as part of their own QC process, at their discretion.

D.1.a.iii.

Data processing workflow

We will develop a simple web
-
driven tracking database to help data managers keep track of
each of the datasets,. Each registered dataset will be assigne
d a status which will be updated
as the dataset goes through the various phases of uploading, QC and release. Table 1 describes
the status values.


Status

Definition

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


17

Registered

Data producer has entered project and dataset profile

Ready for
s
yntax
c
heck

Data is available in DCC staging area

Ready for QC

Dataset has passed syntax check and is ready for QC process

QC in progress

Dataset has begun DCC QC process

QC enquiry

Dataset did not pass QC and data producer has been notified of errors to correct

Q
C passed

Dataset passed QC and data producer has been notified to sign off on release

Public

Dataset is available for download

Retracted

Dataset has been retracted by the producer for further work

Full release

Dataset is available on all DCC public int
erfaces

Retired

Dataset has been replaced by a more recent release

Table 1:
Dataset status values


Figure 4.
Data uploading and processing workflow
[LS: NEEDS TO BE UPDATED!!!!]

The data processing workflow is shown in Figure 4. When a dataset is first
registered, but
before an upload has occurred, its status is “registered.” After it is uploaded, the script that
manages the upload changes its status to “Ready for syntax check,” indicating that it is ready
for automatic syntax checking.

A watchdog proces
s will periodically interrogate the tracking databaes for all datasets that are
ready for syntax checking and automatically launch the appropriate syntax checking script.
Datasets that pass the syntax check will be flagged by the watchdog process as “ready

for QC,”
and the process will alert both the data manager and the provider that the upload was
A Data Coordinating Center for modENCODE


Stein, Lincoln D.


18

accepted. Those that don’t will have their status set to “QC enquiry,” and the data manager
and provider will be alerted that the dataset has problems that caus
ed the syntax check to fail.

Datasets that have passed the syntax check will also be automatically loaded into a DAS
database running the lightweight BioPerl GFF database. This will allow the data manager and
provider can have a preview of the dataset in t
he context of other genomic annotations, as
described in more detail below.

The responsible data manager now begins the QC process by setting the status of the dataset
to “QC in process.” He or she tests the dataset for consistency using the various genera
l and
dataset
-
specific tests described earlier. If the data manager is satisfied that the dataset is
consistent, he will set its status to “QC passed” and notifies the data provider that the dataset
is ready for final review. Otherwise he sets its status t
o “QC enquiry” and asks the provider for
assistance. The provider may suggest one or more patches to the dataset, or submit an
entirely new file to replace it. In the latter case, the dataset is marked “Retracted” until its
replacement is submitted.

“QC pa
ssed” datasets will require the final approval of the provider before the dataset is
released to the public. From their profile page, providers will be able to preview their dataset
as it will appear on the modENCODE web site. They will be able adjust the
appearance (glyph,
color, height, labels, semantic zooming properties) of their track, and, working with the data
manager, design the most appropriate visualization for their dataset. When satisfied, the data
provider will push an “Approve” button on their

profile page, thereby switching the dataset
status to “Public.”

When a dataset is released by the data provider, its DAS server track will be made publicly
avalable, and the dataset itself will enter the monthly modMine production build process
described
in section D.2. Up to this point a

data provider may work with the manager to make
minor edits without creating a new version, or even choose to retract a dataset from the
staging area. Once the dataset is made public it cannot be changed in any way. Inste
ad, the
data provider must “retire” the dataset and submit a new version to replace it.

Under special circumstances we will allow a released dataset to revert to a previous version
while the current version is undergoing a repair process. This circumstanc
e will be made
known to the public.

The status of every data submission will appear on the data producers’ profile page. The data
provider can click through to reach either the QC report or the visual GBrowse display. We
will also provide management pages
that list the status of all of the datasets for the entire
modENCODE project.

The workflow tracking database will use a PostgreSQL backend. The middleware layer and
web
-
based front end will be written in Perl using the standard DBI module for database
acce
ss and the CGI module for form generation and file upload.

D.2.

S
PECIFIC AIM
2:

W
E WILL INTEGRATE THE

MOD
ENCODE

DATASETS WITH
INFORMATION FROM THE

MOD
S AND OTHER ONLINE R
ESOURCES

After data is marshaled, it will be integrated into a production database called
“modMine.”
From our conversations with potential modENCODE participants, we expect to see a greater
diversity of data types in modENCODE than in the human ENCODE pilot (see Appendix A for
A Data Coordinating Center for modENCODE


Stein, Lincoln D.


19

a list of the anticipated types), and for this reason, we need a ric
her and more interconnected
data model than was used for the human ENCODE pilot. For this reason, modMine will use
Chado for the underlying relational schema and InterMine for its user interface and high
-
performance query processing.

ModMine will be initi
alized using baseline data from WormBase and/or Flybase (see section
D.4), including the canonical gene list, proteins, protein GO terms and other ontologies.
These will be supplemented with key data sets that we believe will be needed by
modENCODE partici
pants, including:



Orthology data from InParanoid (REF) and Homophila (REF) to relate genes of
the Drosophilids, nematodes and human.



Worm and/or fly protein
-
protein interaction data from IntAct (REF).



Transcription factor binding site motifs from ORegAnno
(REF).

These datasets will be loaded into ModMine using scripts that are already in production use
for FlyMine, and will become available for browsing by both modENCODE participants and
the public as early as the second quarter of the first year of the pro
ject (see section J.
Milestones
).

When a new modENCODE dataset is approved by the data providers, the marshaling
area
status tracking database
will flag
that
the dataset as ready for incorporation into modMine. It
will then be incorporated into the next m
odMine build, which will occur on a monthly basis.
The InterMine build process first translates data from source formats into objects that
conform to the target data model, here Chado. Each source is then loaded through an
integration step where objects ar
e merged according to defined criteria. For example, protein
information from InterPro and IntAct would be integrated according to their shared UniProt
identifiers. When two data sources provide overlapping data, their relative priority is defined
at any d
egree of granularity and this means that the same database is produced regardless of
the order in which data sources are loaded.
Gos,
I don’t understand what the last sentence
means. Do you mean that the two data sets contain duplicate data items? Also, ca
n you
rewrite this paragraph in an active tense so that reviewers understand what tasks you are
going to undertake? I think it might be something like

“Software engineers in the Micklem group will examine each of the dataset contracts, and
develop the bes
t data model for representing them under Chado. They will then identify
common identifiers among the datasets that can be used to integrate them. For example,
protein information….”

For data that is submitted to a primary repository, such as microarray fil
es that are deposited
in GEO (
Barrett
et al.

2005
), the build procedure will include validation of the provided
accession number by fetching the data from the repository and then checking that the meta
-
data contained in the record is consistent with what w
e know about the dataset.


Much of the data generated by modENCODE will be genome annotation, which is routinely
stored and manipulated by Chado.


However we anticipating the need to model several non
-
genomic data types, including
phenotypes

to describe th
e consequences of RNAi, deletion
or insertional mutagenesis experiments
; anatomical hierarchy

data to describe the time
and place of gene expression experiments;
images and movies

which record the behaviour
of GFP::gene fusion constructs during development
;
position weight
-
matrix motifs

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


20

generated by, for instance, binding a transcription factor to a hairpin double
-
stranded DNA
sequence microarray as well as data types corresponding to such reagents as
strains
,
clone
libraries
,
microarray descriptions
, and
antibodies.

The Chado schema already handles
many of these data types, but in those cases in which the schema needs to be extended, we
will make the appropriate improvements and disseminate them to the GMOD consortium. All
major changes will be approved by

Chris Mungall, a member of our team, and one of the core
Chado developers.

ModMine will be the basis for the modENCODE web site described in the next section. We
will use it to generate denormalized databases for the GBrowse interface as well as bulk data

dumps for community downloads.

D.2.a.i.

Handling genome build changes

While both the fly and worm genomes are complete, it is conceivable that these may change
(
e.g.

extensions into the heterochromatic regions of the fly) over the course of the project. In
this ca
se it will be desirable to transfer all annotations performed on the previous genomic
build into the coordinate system of the current one. To cover this eventuality we have
arranged to use the Lift Over system developed for the human ENCODE project which m
akes
use of whole genome alignments to transfer annotations among genome versions. Expertise
in generating such alignments and transferring annotations will be provided by the UCSC
subcontract.

Lift Over starts with an alignment between the old assembly a
nd the new assembly. This is
done using BLAT with a special

fastMap
option to produce local alignments, and then
tools (
axtChain

and
chainNet
) developed for finding orthologous regions between species.
The alignments are run at very stringent settings, an
d the result is an alignment in chain
format that describes for each base in the old assembly the corresponding base in the new
assembly. Once the alignment chains are produced a separate program,
liftOver
, maps
annotations from one assembly to another usi
ng the chains as a guide.
LiftOver

produces
two outputs


a file containing the annotations that mapped cleanly, and another file
containing the annotations that did not map cleanly, along with comments on why the
mapping failed. Reasons why a mapping can
fail include:



The annotation falls in a region that was removed in the new assembly (possibly
because it was contamination).



The annotation falls in a region that is duplicated in the new assembly.



The annotation fall in a region that is split into two re
gions in the new assembly.

After running
liftOver

we will ask the original data providers to look at
any

annotations
that don’t map cleanly
, and submit revised annotations if necessary.

To adapt the Lift Over system for use in modENCODE, we will need to ex
tend it to read and
write the annotation file formats used by the project. These changes will be
relatively easy

to
accomplish
.
L
iftOver
already
supports GFF
version 2
,
so

extension to GFF3 should involve no
more than a few hundred lines of

C

code. If sup
port for Chado XML is needed as well, this
will be more work, but with the help of the autoXML code generator even that should take no
more than two programmer
-
weeks.

Alternatively, we can
use modMine facilities to
dump
A Data Coordinating Center for modENCODE


Stein, Lincoln D.


21

annotations from the production data
base Chado format into GFF3, and run
liftOver

on
that.

D.2.a.ii.

Multiple alignments

Deep multiple whole
-
genome alignments (WGA) among vertebrates were a key information
resource for the human ENCODE project; for this reason we place high priority on having
hi
gh
-
quality WGA data from fly and/or worm available in the modENCODE database at an
early stage of the project. Currently sequencing data on 12 Drosophila species and 3
Caenorhabditis species is available, with information on an additional 2 Caenorhabditis
species expected in 2006.

W
e

plan to produce a multiple genome alignment of at least 12 Drosophila species and as they
become available a
nd
/or

a
similar number of Caenorhabditis species as they become
available
, depending on the choices of organism(s) for

modENCODE
. The alignments will
contain selected other insect/nematode species if available as outgroups. These alignments
will be produced at UCSC using the Penn State/UCSC pipeline. The pipeline takes
approximately one month to run for 12 vertebrate spe
cies. We estimate that it will take
approximately a week to run for 12 fly or worm genomes.
2


Though the Penn State/UCSC pipeline is well established, it still is significant work to run it.
The jobs must be run on computer clusters rather than single CPUs
, and the size of the files
produced is enormous even using state of the art compression. The parameters of many of the
programs need to be tuned to the species involved for best results. Because of these and other
complications we are budgeting 0.5 FTE
of

a software engineer
the first year, and 0.1 FTE in
subsequent years. During the first year we plan on producing
at most
two revisions each of
fly and worm sequences, and in subsequent years only one
, as needed
. These revisions will
incorporate

additional

species and improved assemblies.

Multiple genome alignment is a topic of much research, and UCSC and Penn State are not the
only groups doing this research. As with the Human ENCODE project, in
m
odENCODE we
will store and display multiple alignments from
other groups producing multiple genome
alignments, if such groups emerge. A (non
-
UCSC) Data Manager will work with each
alignment group

as part of the standard contract and data marshaling process
. The Data
Manager will trade notes with the QC group at UCS
C, who had a similar role in the Human
ENCODE project. The aligners will deliver their alignments in MAF format as described at
http://genome.ucsc.edu/FAQ/FAQformat.html#format5
. The UCSC/P
enn State work will
serve as a baseline of comparison for other alignment groups, and ensure that at least one
high quality multiple alignment for worm and
/or

one for fly will be made.

Creating chain, net, and conservation tracks for GBrowse will involve
significant work.
Since
the UCSC Genome Browser is written in C, and GBrowse is largely in Perl, it will take some
work to transport these tracks to the GBrowse environment. Since GBrowse already does call
several C modules, it will be possible to reuse
a

portion of the

UCSC code. For this reason
UCSC has selected Angie Hinrichs, an engineer very proficient in Perl as well as very familiar



2

One might think that since vertebrate genomes are 20
-
30 times as large as fly and worm genomes, that it would only take a
day to run. Ho
wever the pipeline discards non
-
aligning sequence fairly rapidly, and since flies and worms are much more
gene
-
dense than vertebrates, much more of the fly and worm sequence ends up surviving until later stages of the pipeline
than is the case with vertebr
ates

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


22

with the UCSC C code base to do the gluing together. It is likely to take her 3
-
4 months to do
the necessary melding i
n a mixture of computer languages.

Another delicate aspect of the work will be integrating the multiple genome alignments into
the Chado schema. See section D
.5.a

for more analysis of this integration in the context of
integrating other code and methods f
rom the Human ENCODE project into
m
odENCODE.

D.3.

S
PECIFIC AIM
3:

W
E WILL PROVIDE THE R
ESEARCH COMMUNITY WI
TH MULTIPLE
EFFICIENT AND UNENCU
MBERED MECHANISMS FO
R DELIVERING DATA AB
OUT
FUNCTIONAL ELEMENTS
IN THE
DCC.

A web site hosted on the production database w
ill be the public portal to the modENCODE
project. It will feature static pages describing the project and the datasets (using the dataset
profiles developed under Specific Aim #1), as well as pointers to resources of interest to the
community, including t
he human ENCODE project, the worm and/or fly MODs and other
high
-
value web sites.

The production database web site will allow researchers to browse the modENCODE datasets
visually, to perform sophisticated queries against them, to generate customized repor
ts, and
to download the datasets in bulk.

D.3.a.

Genome Browser

Graphic browsing of the modENCODE datasets will use the GBrowse genome browser, a
piece of software that the Stein group is intimately familiar with. GBrowse will display tracks
corresponding to refe
rence annotations from the MODs including gene models, common
reagents such as cloned cDNAs and PCR primers, and well
-
known landmarks such as genetic
markers. To this we will add all modENCODE
-
specific tracks that have genomic coordinates.
This includes ma
pped RACE and TecRed sequencing reads, RT
-
PCR results, splicing array
results, putative small RNAs, whole
-
genome transcription profiling, the locations of
constructs used for promoter::GFP fusions, ChIP/chip and/or ChIP
-
PET binding sites and
their binding
intensities, locations of DNAseI
-
hypersensitivity sites, the locations of modified
histones, and so forth. We will also provide tracks that show the amount of evolutionary
conservation across a region of interest, using the multiple sequence alignment data

from
comparative sequencing.

Users will be able to take advantage of all of GBrowse’s standard features, including
configuring tracks, changing their order, generating publication
-
quality images, bookmarking
views, and generating a variety of quick report
s such as multiple alignments and colorized
sequence files of a region of interest. They will be able to upload private data sets and view
them in the context of the modENCODE annotations, and view remote annotations (such as
tracks from WormBase and FlyBa
se) via DAS. As described earlier, we will also use this
facility to allow modENCODE data providers to preview their data before publication.

Although GBrowse could be run directly from Chado/Intermine, we do not have experience
with its performance in thi
s configuraiton under heavy load. For this reason, we think it safest
to drive GBrowse from a read
-
only denormalized database backend using the BioPerl GFF
schema, a configuration that has been thoroughly tested with large data sets. We will
regularly make

GFF3
-
format dumps from the production database, and load them into the
GBrowse backend.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


23

One issue with this plan is that the GFF3 backend does not lend itself to dynamic updating;
to update or retract a dataset it is usually easiest to clear and reload
the entire backend from
scratch. To simplify management, we will maintain many of the larger datasets within
independent DAS servers, each of which will have a standalone database backend. Each DAS
server will run on top of a GFF3 database containing one
to a small number of datasets, and
will be located on a different machine than GBrowse and its main backend. By spreading the
databases among multiple backends, we will reduce load on the main server and increase
performance for the end user. This architec
ture also has the benefit of giving us the flexibility
to create additional GBrowse servers to balance load without duplicating the backend
database as well.

To further improve GBrowse performance, we will implement a change to the imaging
rendering algori
thm. Instead of rendering each track serially in a single bitmapped image, as
is currently done, we will modify the GBrowse graphics library so that the tracks are rendered
in parallel as separate bitmaps and then stacked on top of each other on the HTML p
age. This
architectural change will allow us to make efficient use of multiprocessor servers and to
offload the rendering of particular feature
-
dense tracks onto dedicated independent
processors. The change will also allow us to implement a feature often r
equested by users


the ability to change the order of tracks by dragging a track and dropping it into the desired
position. Currently users change track order using a menu
-
heavy track configuration page.

The parallel track image rendering algorithm and th
e drag
-
and
-
drop interface have already
been implemented as proof
-
of
-
principal demos by Lincoln Stein and Ian Holmes (personal
communication). A minor improvement to the GBrowse software will be the implementation
of a draggable rectangle to interactively s
et the region of interest, as is already implemented
by the Ensembl ContigView.

D.3.b.

Querying InterMine and BioMart

ModMine will play a central role in dissemination of data by acting as a portal for the general
research community through interfaces that allow
the running of complex, arbitrary queries
as well as standardized “query templates”: we will provide a library of useful query templates
(built under consultation with the data producers and broader user community), which will
be accessible both via the w
eb interface and as web services for use by remote programs. In
addition, internally, modMine will provide data feeds to the GBrowse databases, ftp site, and
the data
-
filtering BioMart interface.

Part of what InterMine provides as infrastructure for the
production database is query
flexibility: it has been designed to allow data to be stored in normal form, but with
transparent query optimization through materialized views. This means that we can tune the
performance of the live production database by gen
erating further precomputed
denormalised tables to address query bottlenecks: once each such table is complete and
registered it immediately is available to the query optimizer and can increase the speed of
appropriate queries.

Importantly for general user
s, it is not necessary to have knowledge of the schema in order to
build queries


arbitrary queries can be built using a web
-
interface that in effect allows
browsing of the schema in a biologically intuitive way. InterMine also provides “template
querie
s” which can be used to encapsulate built queries for future re
-
use through a simple
interface. This is particularly useful for complex queries. We intend to compile a library of
A Data Coordinating Center for modENCODE


Stein, Lincoln D.


24

useful template queries, in response to discussions with the data producers,

MODs and the
general research community. We routinely precompute all template queries and other
obvious multi
-
table joins, and this, together with the query optimimser gives good
performance on arbitrary complex queries. An example of a complex query that

InterMine
supports would be finding genes with nearby positive ChIP/chip probes, where the
orthologues of the genes also had positive nearby probes in analogous experiments.

We will also provide DCC data through the BioMart database system. This relies o
n massive
denormalisation of data to provide rapid querying and filtering of datasets through a powerful
web interface. We will have two choices in building BioMart


we can make use of existing
work to establish a direct data flow between Chado and BioMa
rt, and we can also generate the
BioMart format data directly as precomputed tables in InterMine which can then be dumped
for loading into BioMart.


Jim Kent suggests giving a series of FlyMine URLs
and/or screenshots
that will illustrate
these points.

Bio
Mart excels at creating tabular database extracts that join at most two data types. A
realistic example of a BioMart query would be
Find all Drosophila genes with a strong
upstream RFX
-
1 transcription factor binding site. Format their FlyBase accession num
bers
in an Excel worksheet along with the binding site sequence, the gene ontology terms
associated with the fly gene, and the IDs of C. elegans genes that are orthologous to the fly
gene.

D.3.c.

Web services

InterMine query templates are simplified web interfac
es to complex queries. These simplified
interfaces can be derived rapidly once a query has been built. We will provide web services to
all query templates. As the query templates will be built in response to consultation with all
users of the DCC we ant
icipate this will satisfy power users who wish to collect and use data
using remote programs, perhaps as part of more complex workflows. Such users will also be
able to make use of the programming interfaces provided by InterMine and BioMart.

D.3.d.

Bulk downloa
d

The users of the DCC will be varied and we expect, apart from many browsing users, that
there will be labs that will want to download and use bulk datasets. Through loading data
into the core database we will be able to provide datasets in a number of s
tandard forms (e.g.
GFF3, chado
-
XML), regardless of the data format they were provided to us in by the data
producers. In addition it will be possible to download more complex bulk datasets through
both the InterMine and BioMart interfaces: for instance o
ne might be interested only in
probes from ChIP/chip experiments that were positive for two or more specified transcription
factors.

D.3.e.

Release notes

Each release will be accompanied by a detailed report that summarizes what is new and
changed in the current
release relative to the previous one, including a list of annotations that
are no longer present. If required, we can also produce release difference summaries for other
pairwise combinations.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


25

D.4.

S
PECIFIC AIM
4:

W
E WILL WORK CLOSELY
WITH THE RELEVANT
MOD
S TO
ENSURE
THE LONG
-
TERM STORAGE AND MAN
AGEMENT OF MOD
ENCODE

DATA

Our liaison with the MODs (WormBase and/or FlyBase, depending on the outcome of the
RFA
-
HG
-
06
-
006 awards), will have two main phases. During the first phase, which will occur
early in the projec
t, we will work with the MODs to import the core framework annotations
from the MODs into the DCC Chado/ Intermine database, modMine. During the second
phase, which will occur in the latter part of the project, we will work with the MODs to export
modENCOD
E data from the production database into the MODs in a form that will maximize
its utility to the community and minimize its impact on the MOD workload.

D.4.a.

Data import from the MODs

We will import from the MODs the minimum information necessary to support the

interpretation of modENCODE data and the execution of DCC quality assurance tests. Most
of the information that we import will be genome annotations, and will include the DNA
sequence, gene models, gene names and synonyms, the positions of well
-
known prom
oters
and other transcriptional regulatory elements, protein names and sequences, well
-
known
marker
s, and those genome build elements such as cosmids and BACs that might be used as
reagents in modENCODE experiments.

We will import additional genome annotat
ions as
needed by the projects funded. For example, if any modENCODE data providers develop
assays based on EST (expressed sequence tag) sequences, we will import the EST lists, their
sequences, and their genomic alignments.

We will import additional key n
on
-
genomic data elements that will be needed to support
modENCODE datasets. This will include strain names and their accession numbers, the fly
and/or worm anatomy and developmental ontologies, and basic protein annotations
including gene ontology associat
ions and protein domain content. Again, depending on which
proposals are funded, we will bring in more specialized data types, such as the
C. elegans

developmental lineage.

Over the course of the project the core genome annotations will be updated. Most fr
equently,
these updates will affect the gene models, which are frequently improved by the MODs on the
basis of new information (including, we trust, modENCODE information). Fortunately both
MODs version gene models in such a way that the relationship betwe
en the current version
and older versions is maintained, even when a gene model is split or merged. We will track
new MOD releases and bring in updates of the core genome annotations in such a way that
both the current versions and older versions are visib
le to modENCODE participants and the
public.

The task of importing core annotations from FlyBase into InterMine has already been
accomplished (see
Preliminary Results)
, and so little additional development effort will be
needed to bring FlyBase data into t
he DCC production database. Indeed, this will be made
even easier by the decision to base modMine on the Chado schema. For WormBase, we will
import genome annotation data from WormBase GFF3 files, which are updated and placed on
an FTP site at regular inte
rvals. Other core annotations will be imported from WormBase’s
AceDB database via scripts based on the AcePerl middleware layer
(
http://stein.cshl.org/AcePerl
). The initial importation and development of the u
pdate
protocol will require a modest effort (2
-
3 weeks) by one University of Cambridge software
A Data Coordinating Center for modENCODE


Stein, Lincoln D.


26

engineer working in collaboration with data managers from the Stein group (for WormBase
import) and/or the Lewis group (for FlyBase import).

Multiple whole
-
geno
me alignments will be an important component of the integrated
modENCODE database, and we expect that these alignments will be heavily used by some
data providers for designing reagents and interpreting results. Because both the
Drosophila

and
Caenorhabdit
is

sequencing efforts are moving rapidly, it is not clear whether stable
multiple sequence alignments (MSAs) will be obtained from the MODs, from the sequencing
analysis groups, or from modENCODE data providers. To cover all eventualities, the UCSC
group w
ill evaluate the whole
-
genome alignments available at the time the project begins and
recommend the best course of action


either to import MSAs, or to generate them
de novo
.
In any case, the UCSC group will be responsible for developing the importation,
storage and
query infrastructure needed to handle MSAs efficiently.

D.4.b.

Data export to the MODs

To ensure that all data developed by modENCODE is transferred fully to FlyBase and/or
WormBase for long term storage, we will maintain close contact with one or bot
h MODs.
Should
C. elegans

projects be funded, Lincoln Stein, who is a WormBase co
-
PI, will attend the
monthly WormBase
-
wide teleconferences and provide the group with regular updates on the
modENCODE project and DCC activities. Similarly, should
D. melanog
aster

projects be
funded Suzi Lewis, a former FlyBase PI, will act as high level liaison to FlyBase. Stein and
Lewis are also familiar with each others’ databases: Stein is on the FlyBase Scientific Advisory
Board, and Lewis was formerly on the WormBase Sc
ientific Advisory Board.

Over the course of the project, we will develop protocols for transferring modENCODE data
to the MODs by means of flat file exports and/or database dumps. For FlyBase, the most
expeditious mechanism for transferring information wil
l likely be to export database tables
directly from the modMine database. Since FlyBase also uses Chado, the data should import
with little or no difficulty. WormBase may have adopted Chado for internal use by this point,
in which case we will use the same

mechanism. However, if WormBase has not completed the
transition to Chado, we will export modENCODE data in the form of AceDB “.ace” files for
incorporation into WormBase during the regular WormBase build process. The Stein group
has extensive experience
with generating and manipulating these files.

We expect that the data export issues will be most keen during the third year of the project,
when the data flow will be at maximum value and datasets will reach completion. For this
reason, we will assign an a
dditional half
-
time data manager to the task of acting as data
export liaison to the MODs during the third year of the project.

D.5.

S
PECIFIC
A
IM
5:

W
E WILL USE STANDARD
SOFTWARE ENGINEERING

METHODOLOGIES
TO ENSURE RELIABLE P
ROCESSING AND STORAG
E OF THE DATA
T
H
IS SECTION DESCRIBES

THE SOFTWARE ENGINEE
RING METHODOLOGIES T
HAT WE WILL APPLY TO

THE PROJECT
.

D.5.a.

Learning from human ENCODE

We

will whenever

possible make use of the tools, reagents, protocols, and best practices
developed during the pilot phase of the human

ENCODE project. For the DCC, a key part of
this will be interacting with the groups at NCBI and UCSC who played a similar role in the
human project. Many of the groups producing data for
m
odENCODE

will

also produce data
A Data Coordinating Center for modENCODE


Stein, Lincoln D.


27

for Human ENCODE. At a minimum wh
en the same groups are producing the same types of
data in the two projects, we will use the same data transfer protocols in
m
odENCODE as in
Human ENCODE. For the microarray and SAGE data NCBI used the GEO system. (See the
Deposit and Update section of
http://www.ncbi.nlm.nih.gov/geo/
.) UCSC handled most of
the other data types, and their protocols and formats are described at
http://genome.ucsc.
edu/ENCODE/submission.html
.

To
help smooth the way for this sharing the
m
odENCODE PI’s will pay a three day visit to the
Haussler/Kent lab, and a two day visit to
NCBI

GEO

in order to become acquainted with
human ENCODE’s human workflow processes,

interna
l and external data formats
, and user
interface drivers.

Ideally, the modENCODE DCC should also

share substantial code and database structure
s

with the human DCCs. In practice, this sharing is complicated by a number of technical and
cultural factors.
On
a technical level UCSC tends to work in the C language, NCBI in C++, and
the
m
odENCODE programmers in Perl and Java. UCSC and NCBI tend to do much of the
database work directly in SQL, using code generators that map between structures in C and
database ta
bles. The
GMOD

developers

tend to intersperse a
n

object layer between the
database and the rest of the code. The database style is also quite different between the
organizations.
The UCSC style in particular tends to put data from different sources into
di
fferent tables, and to use metadata outside of the relational schema proper to indicate that
the data is of a particular type. UCSC tends to create new tables for each new dataset, while
Chado uses a small number of generic tables that are adapted for spec
ific datasets using
controlled vocabularies.

Although there are pros and cons to each approach, after much discussion we came down on
the side of taking
a

GMOD
-
centric

approach, primarily
because
this will dramatically simplify
the process of transferring
data between the modENCODE database and the model organism
databases (FlyBase in particular). Nevertheless, we
will reuse as many human ENCODE
software tools as feasible. We have already described the extensive reuse of the Penn
State/UCSC whole
-
genome ali
gnment pipeline

and visualization
. In addition, we will adopt
technology developed by human ENCODE to store WGA data in Chado, as well as to
represent microarray data results and other data types not currently well represented in
GMOD tools. These will bec
ome part of the GMOD open source toolkit to benefit the
community of GMOD users.

Jim Kent will act as

technical
liaison between modENCODE and human ENCODE
. The PIs
will together choose the best mechanisms to integrate human ENCODE with GMOD
technology, and

both the UCSC and NCBI groups will be given an opportunity to comment on
their proposals.

D.5.b.

Software and Database Development Methodology

The
four
groups involved in this DCC consortium proposal operate on similar software
engineering principles: as a star
ting point we believe that it is important to make the output
of our efforts freely available and to collaborate with existing open source development efforts
where possible. Design, coding and testing standards are important and we routinely employ
desig
n reviews, proper versioning (e.g. using subversion), automated unit testing, and
generate consistent documentation. User requirements are met most effectively when users
and the software development staff work in close proximity to one another as a part
of their
A Data Coordinating Center for modENCODE


Stein, Lincoln D.


28

daily routines. In this regard, all
four
PIs are well situated because we all work in close
association with active wet labs.

The tasks outlined in the Plans section, along with the estimated engineering efforts required
to implement them

and the
implementation timeline
, are summarized in

section I.3.

D.5.c.

Performance tuning

Measured just in annotated base pairs, modENCODE is 6
-
9 times the size of the human
ENCODE pilot (100 Mb C. elegans, 180 Mb D. melanogaster versus 30 Mb of human
ENCODE). In additio
n, modENCODE data sets may include raw data, such as anatomic
expression pattern movies and images, that was not part of human ENCODE. Because of the
large volume of anticipated data, we will be continually watchful for bottlenecks in data
processing that
might interfere with the timely integration and publication of modENCODE
data.

A software engineer from the Micklem group will have the full time task of acting as roving
troubleshooter and performance tuner. His or her job will be to monitor the performan
ce of
the system, identify problems early, and work in collaboration with the other software
engineers and data managers to alleviate the problems. The rest of this section describes
some of the areas that the troubleshooter will monitor.

The data marshal
ing process.
Computation is unlikely to be a bottleneck in the data
marshaling process because most of the quality control tests (e.g. syntax checking) are
lightweight. However from a systems standpoint, the most likely bottleneck will be the
process of ne
gotiating data contracts with the data providers, writing code to implement the
contract, and validating the code. We will seek to minimize these bottlenecks by reusing
contracts for similar data (e.g. RACE and Tec
-
Red) and by relying on familiar, tab
-
deli
mited
file formats whenever possible. We will also seek to streamline the negotiation process by
establishing firm deadlines for completion of the negotiations.

Building the production database.

It is likely that the major bottleneck for the production
dat
abase will be the integration and cross
-
validation of the various datasets during the build
process; for example the confirmation that all ArrayExpress accession numbers correspond to
a valid microarray experiment. In order to minimize the impact of the da
tabase build, we will
make use of core data “snapshots.” All stable data will be integrated first, validated and then
snapshotted in a standalone database. To create a build we will take a copy of the most recent
snapshot and add to it the new and less sta
ble datasets.

Querying the production database.

As described in Specific Aim #3, we intend to go beyond
the human ENCODE database by providing end users with the ability to pose arbitrary and
ad
hoc

queries against the production database. However, such qu
eries can be slow when
performed across a normalized database schema such as Chado. To streamline this process,
we will generate denormalized tables that are optimized for particular common subqueries;
this is what InterMine and BioMart excel at.

GBrowse v
isualization.

We anticipate that the genome browser visualization provided by
GBrowse will be heavily used by the community, and the load may adversely impact the
software’s performance. In order to minimize these effects, we will use standalone DAS
databa
se as the back end for the modENCODE GBrowse, and slightly rearchitect the software
as described in Specific Aim #3 to take better advantage of multiple CPUs for the
computation
-
intensive rendering steps. To further improve performance, we will adopt a
A Data Coordinating Center for modENCODE


Stein, Lincoln D.


29

loa
d
-
balancing system based on the Squid web proxy (
http://www.squid
-
cache.org
). This
software will transparently route GBrowse requests to multiple GBrowse instances in a
round
-
robin fashion. To maximize the flexib
ility of this arrangement, GBrowse will be run on
a system of server blades, which will allow us to increase the number of GBrowse backends as
necessary to meet the load requirements.

Bulk Downloads.

Many users will be interested in obtaining extracts of t
he modENCODE
data that meet certain requirements. In order to reduce the number of file formats that are
produced “on the fly” via the InterMine and BioMart report generation facilities, we will
analyze the incoming requests in order to identify the most f
requent requests, and then
produce precomputed files for bulk downloading for each release.

Network Latency.

The number of network router hops introduces a delay between the time a
user makes a request on the DCC web site and the time he sees a response. L
atency becomes
worse as the distance between the user and the web site increases, and is most noticeable for
heavily interactive software, such as GBrowse. The latency delay is more than merely
annoying, it subliminally trains users
not

to experiment with
changing display settings. If we
find that network latency is degrading the end
-
user experience, we will take advantage of the
fact that the DCC is physically spread across three times zones (UK, East Coast, West Coast)
to set up data mirrors at two or mor
e of the sites. This will also help with load balancing.

E.

H
UMAN
S
UBJECTS
R
ESEARCH

None.

F.

V
ERTEBRATE
A
NIMALS

None.

G.

S
ELECT
A
GENT
R
ESEARCH

None.

H.

L
ITERATURE
C
ITED

Ashburner M.
Won for All: How the Drosophila Genome Was Sequenced.

Cold
Spring Harbor Press,

2006 I
SBN 0
-
87969
-
802
-
0

Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE,
Fujibuchi W, Edgar R.

NCBI GEO: mining millions of expression profiles
--
database
and tools.

Nucleic Acids Res
. 2005 Jan 1;33(Database issue):D562
-
6.

Bejerano G
,
Pheasant M
,
Makunin I
,
Stephen S
,
Kent WJ
,
Mattick JS
,
Haussler D
.
Ultraconserved elements in the human genome.
Science
.
2004 May
28;304(5675):1321
-
5. Epub 2004 May 6.

Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L.

The distributed annotation system.

BMC Bioinformatics
. 2001;2:7. Epub 2001 Oct 10.

Drysdale
RA, Crosby MA; FlyBase Consortium.

FlyBase: genes and gene models.
Nucleic
Acids Res.

2005 Jan 1;33(Database issue):D390
-
5.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


30

Durbin R, and Thierry
-
Mieg J.
A C. elegans Database
. Documentation, code and data
available from
http://wwww.acedb.org
. 1991

Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M.
The
Sequence Ontology: a tool for the unification of genome annotations.

Genome
Biol.
2005;6(5):R44.

Hoon S, Ratnapu KK, Chia JM, Kumarasamy B, Jug
uang X, Clamp M, Stabenau A, Potter S,
Clarke L, Stupka E.
Biopipe: a flexible framework for protocol
-
based
bioinformatics analysis.
Genome Res.
2003 Aug; 13(8):1904
-
15. Epub 2003 Jul 17.

Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Ro
skin KM, Schwartz M,
Sugnet CW, Thomas DJ, Weber RJ, Haussler D, Kent WJ
.
The UCSC Genome Browser
Database.

Nucleic Acids Res. 2003 Jan 1;31(1):51
-
4.

Kasprzyk A, Keefe D, Smedley D, London D, Spooner W, Melsopp C, Hammond M, Rocca
-
Serra P, Cox T, Birney E.

EnsMart: a generic system for fast and flexible access to
biological data.

Genome Res.

2004 Jan;14(1):160
-
9.

Lewis SE, Searle SM, Harris N, Gibson M, Lyer V, Richter J, Wiel C, Bayrakta
roglir L, Birney
E, Crosby MA, Kaminker JS, Matthews BB, Prochnik SE, Smithy CD, Tupy JL, Rubin GM,
Misra S, Mungall CJ, Clamp ME.
Apollo: a sequence annotation editor.

Genome Biol.

2002 3(12):RESEARCH0082.

Lyne R, Smith R, Rutherford K, Wakeling M, Riley

T, Guillier F, Ji W, Mclaren P, Woodbridge
M, Janssens H, Watkins X, Rana D, Lilley K, Russell S, Ashburner M, Mizuguchi K, Varley A
and Micklem G.
FlyMine: An Integrated database for Drosophila and Anopheles
genomics
Genome Research
(draft/ submitted)

M
ungall CJ, Misra S, Berman BP, Carlson J, Frise E, Harris N, Marshall B, Shu S, Kaminker
JS, Prochnik SE, Smith CD, Smith E, Tupy JL, Wiel C, Rubin GM, Lewis SE.
An
integrated computational pipeline and database to support whole genome
sequence annot
ation.

Genome Biol.

2002 3(12):

RESEARCH0081. Epub 2002 Dec
23

Schwarz EM, Antoshechkin I, Bastiani C, Bieri T, Blasiar D, Canaran P, Chan J, Chen N, Chen
WJ, Davis P, Fiedler TJ, Girard L, Harris TW, Kenny EE, Kishore R, Lawson D, Lee R, Muller
HM, Nakamu
ra C, Ozersky P, Petcherski A, Rogers A, Spooner W, Tuli MA, Van Auken K,
Wang D, Durbin R, Spieth J, Stein LD, Sternberg PW.
WormBase: better software,
richer content.

Nucleic Acids Res.
2006 Jan 1;34(Database issue):D475
-
8.

Stajich JE, Block D, Boulez K
, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG,
Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P,
Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E.

The Bioperl toolkit: Perl
modules for the life s
ciences.

Genome Res
. 2002 Oct;12(10):1611
-
8.

Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris
TW, Arva A, Lewis S.
The generic genome browser: a building block for a model
organism system database.

Genome Res.
2002 Oct
;12(10):1599
-
610.

Waterston RH, Lindblad
-
Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R,
Ainscough R, Alexandersson M, An P, Antonarakis SE, Attwood J, Baertsch R, Bailey J,
Barlow K, Beck S, Berry E, Birren B, Bloom T, Bork P, Botcherby M, Bra
y N, Brent MR,
A Data Coordinating Center for modENCODE


Stein, Lincoln D.


31

Brown DG, Brown SD, Bult C, Burton J, Butler J, Campbell RD, Carninci P, Cawley S,
Chiaromonte F, Chinwalla AT, Church DM, Clamp M, Clee C, Collins FS, Cook LL, Copley RR,
Coulson A, Couronne O, Cuff J, Curwen V, Cutts T, Daly M, David R, Da
vies J, Delehaunty
KD, Deri J, Dermitzakis ET, Dewey C, Dickens NJ, Diekhans M, Dodge S, Dubchak I, Dunn
DM, Eddy SR, Elnitski L, Emes RD, Eswara P, Eyras E, Felsenfeld A, Fewell GA, Flicek P,
Foley K, Frankel WN, Fulton LA, Fulton RS, Furey TS, Gage D, Gi
bbs RA, Glusman G, Gnerre
S, Goldman N, Goodstadt L, Grafham D, Graves TA, Green ED, Gregory S, Guigo R, Guyer M,
Hardison RC, Haussler D, Hayashizaki Y, Hillier LW, Hinrichs A, Hlavina W, Holzer T, Hsu
F, Hua A, Hubbard T, Hunt A, Jackson I, Jaffe DB, Joh
nson LS, Jones M, Jones TA, Joy A,
Kamal M, Karlsson EK, Karolchik D, Kasprzyk A, Kawai J, Keibler E, Kells C, Kent WJ, Kirby
A, Kolbe DL, Korf I, Kucherlapati RS, Kulbokas EJ, Kulp D, Landers T, Leger JP, Leonard S,
Letunic I, Levine R, Li J, Li M, Lloyd
C, Lucas S, Ma B, Maglott DR, Mardis ER, Matthews L,
Mauceli E, Mayer JH, McCarthy M, McCombie WR, McLaren S, McLay K, McPherson JD,
Meldrim J, Meredith B, Mesirov JP, Miller W, Miner TL, Mongin E, Montgomery KT, Morgan
M, Mott R, Mullikin JC, Muzny DM, Na
sh WE, Nelson JO, Nhan MN, Nicol R, Ning Z,
Nusbaum C, O'Connor MJ, Okazaki Y, Oliver K, Larty EO, Pachter L, Parra G, Pepin KH,
Peterson J, Pevzner P, Plumb R, Pohl CS, Poliakov A, Ponce TC, Ponting CP, Potter S, Quail
M, Reymond A, Roe BA, Roskin KM, Rub
in EM, Rust AG, Santos R, Sapojnikov V, Schultz B,
Schultz J, Schwartz MS, Schwartz S, Scott C, Seaman S, Searle S, Sharpe T, Sheridan A,
Shownkeen R, Sims S, Singer JB, Slater G, Smit A, Smith DR, Spencer B, Stabenau A,
Strange
-
Thomann NS, Sugnet C, Suyam
a M, Tesler G, Thompson J, Torrents D, Trevaskis E,
Tromp J, Ucla C, Vidal AU, Vinson JP, von Niederhausern AC, Wade CM, Wall M, Weber RJ,
Weiss RB, Wendl MC, West AP, Wetterstrand K, Wheeler R, Whelan S, Wierzbowski J, Willey
D, Williams S, Wilson RK, Win
ter E, Worley KC, Wyman D, Yang S, Yang SP, Zdobnov EM,
Zody MC, Lander ES.

Initial sequencing and comparative analysis of the mouse
genome.
Nature

2002 420(6915):520
-
562

I.

M
ULTIPLE
PI

L
EADERSHIP
P
LAN
&

M
ILESTONES

The DCC trio of Stein, Micklem, and Lewi
s are all highly experienced projects managers who
have successfully led distributed projects in the past. In addition they have strong
connections to the fly and worm research communities, giving them highly detailed
knowledge and experience with the biol
ogy, the technical underpinnings, and the community
structure. The DCC will be a distributed, but highly coordinated effort among the three sites.
Each site has special strengths and interests, and these are reflected in the assignment of
responsibilities
within the DCC: CSHL

visualization and MOD liaison; Berkeley

data
collection and validation; Cambridge

query access and data distribution
; UCSC

multiple
alignments and high
-
performance computing.

Details on documentation, deliverable coordination and manag
ement, and quality controls
are discussed in the research plan above.

I.1.

D
ECISION
-
MAKING AND
R
ESPONSIBILITIES

Stein, Micklem, Lewis
and Kent
will make decisions by consensus.
Most critical to a jointly
directed project, we have clearly delineated responsibili
ties. No critical commitments or
representations on behalf of the DCC project are made without first consulting the other 2
members of the leadership core and obtaining full endorsement.

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


32

The PIs are jointly responsible for assuring that we achieve the tech
nical goals of the DCC.
When issues on meeting our goals arise, they are responsible for reallocating personnel, funds
and other resources in order to remedy the situation.

Our priorities will primarily be driven
by the needs of the modENCODE project, but
also by the fly, worm, and human ENCODE
research communities, where the
four
PIs have deep ties.

I.2.

C
OMMUNICATIONS

Given our geographic distribution, effective communication is essential. We all have
considerable experience in successfully managing distribut
ed projects and we know what
works.

The entire DCC staff will speak on the phone weekly (minimally), and when necessary we will
use one of the instant messaging protocols.
Agendas will be set on a rotating basis, as will be
the keeping of minutes (which w
ill be archived on the DCC WIKI site). Every meeting will
begin with a report on outstanding action items, personnel issues, and data updates. In
addition, we will correspond frequently (daily) via e
-
mail to discuss specific issues on
appropriate topic lis
ts. We will hold face
-
to
-
face meetings of the entire staff twice a year, one
of which will be a satellite to the main modENCODE annual meeting while the others rotate
among the three sites. Whenever possible we will informally get together in conjunction w
ith
other conferences we attend.

We will encourage a team attitude, in which any question can be
asked and any creative new approach proposed.

I.3.

M
ILESTONES

Table 1 (this page and next) shows the major activities and milestones for the project.
Milestones are

shown as “X” characters and ongoing activities as solid cells. Activities chiefly
performed by software engineers are shown in shades of blue, while those chiefly performed
by data managers are shown in shades of orange. Great levels of effort are shown a
s shaded
cells of increasing intensity. See the bottom of the table for the key.


1
q
1

2
q
1

3
q
1

4
q
1

1
q
2

2
q
2

3
q
2

4
q
2

1
q
3

2
q
3

3
q
3

4
q
3














General
Management













Hire all personnel

X












Est. internal &
external mailing lists

X












Establish call
schedule

X












Establish
modENCODE WIKI

X












Initial public
modENCODE web
site

X












Project
-
wide staff
meeting

X


X


X


X


X


X




























Specific Aim #1:

Project workflow system &
data
acquisition








Install hardware for
X












A Data Coordinating Center for modENCODE


Stein, Lincoln D.


33

storage

Requirements
analysis













Managers conference with
providers












Negotiate "contracts"
w/ providers













Prototype & test
tracking database













Tracking system

online &
operational


X










Monitor data
acquisition/supervise QC

























Specific Aim #2:

Integrate
data into modMine











Install hardware for
data processing

X












Requirements
analysis













Develop unit te
sts for
system













Prototype & test
modMine













modMine online &
operational




X









Troubleshooting &
peprf. tuning













Initial import of MOD
framework data



X









MOD framework data
update





X

X

X

X

X

X

X

X

Unit
tests for data
import













Prototype import from
staging area












First import of data from
staging area



X









modENCODE data
imports





X

X

X

X

X

X

X

X














Gbrowse comparative
genomics tracks












Human/MOD DCC
D
iscussions













Comparative
genomics
-
> Chado













Multiple alignments



























1
q
1

2
q
1

3
q
1

4
q
1

1
q
2

2
q
2

3
q
2

4
q
2

1
q
3

2
q
3

3
q
3

4
q
3

Specific Aim #3:

Public
user interface












modMine web site
design













modM
ine web site
implementation













Develop query templates
for modMine












A Data Coordinating Center for modENCODE


Stein, Lincoln D.


34

Develop BioMart
interface













Design FTP site for bulk
downloads












FTP site online



X










Design GBrowse
load
-
balancing













GBrowse
load
-
management
configuration











GBrowse parallel
-
rendering algorithm












GB load testing &
performance tuning












GB UI improvements













GB installed at three
sites


X











Initial import of modMine
data into GB


X










Preview DAS server
installed & tested


X










Web site docs & other
static text


























Specific Aim #4:

MOD liaison &
exchange











Initial talks between
DCC & MODs













Select MOD data sets
to import













Initial import of MOD framework (cf SA 3)


X









MOD framework data
update (cf SA 3)




X

X

X

X

X

X

X

X

Discuss data export













Develop & test export
formats













First export of data to
MODs







X






Subsequent exports
t
o MOD








X

X

X

X

X














Specific Aim #5:

Software engineering












Common CVS
accounts established

X












Remote login
accounts established

X












Unit testing
framework selected

X












Software
documentation













Software releases (as
needed)


























Key:


light software development activity (<1 software engineer FTE)



moderate software development activity (1
-
2 software engineering FTEs)



heavy software development activity (2 or mor
e software engineering
FTEs)



light data management activity (<1 data manager FTE)




moderate data management activity (1
-
2 data management FTEs)



heavy data management activity (2 or more data management FTEs)

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


35















X

Milestone, end of fir
st month of quarter






X

Milestone, end of second month of quarter





X

Milestone, end of third month of quarter





Table 1: Major project activities and milestones.


J.

C
ONSORTIUM
/C
ONTRACTUAL
A
RRANGEMENTS

See
attached.

K.

R
ESOURCE
S
HARING

The DCC project is a community database resource project. We do not generate animals or
reagents. All resources from the DCC project
will be

made available openly and freely to all
including academic and industry research commun
ities. Details of documentation,
versioning, API, database downloads, syntax, data files, and tools are discussed through the
Research Plan.

Data

in our public database will be updated on a regular

basis
,
at least
monthly. Software
changes will be incorpor
ated as developed with notification to modENCODE members and to
the research community through the DCC web site and through mailings to interest groups,
and modENCODE mailing lists. The DCC is committed to supporting the widest possible
distribution and us
e of these data.

Software.
While not primarily a software development project, it is probable that the DCC
will develop data processing utilities, visualization tools and other software that could be of
general utility. For example, it is likely that the D
CC will develop new GBrowse glyphs and
plugins during the process of designing the best visual displays for modENCODE data. All
software developed by the DCC will be made freely available to the public under one or more
open source licenses approved by the

Open Source Initiatve
(
http://www.opensource.org/licenses
). Likewise, improvements to pre
-
existing software will
be transmitted back to the authors for consideration in future versions.

L.

L
ETTERS
O
F
S
UPPORT

Sandra to fill in

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


36

A
PPENDICES

M.

A
PPENDIX
:

D
ATA TYPES

What follows
are lists of the experimental approaches and resulting data types that we expect
to receive for processing during the modENCODE project. It is based on disc
ussions with
likely participants, supplemented by data providers’ contributions to the prototype
modENCODE WIKI.

1.

Anticipated experimental approaches

Gene structure

Protein
-
coding gene structure

New gene prediction algorithms

5' and 3' RACE to determine UTR
s and polyA sites



ORFeome
-
style RT
-
PCR resequencing of predicted genes



Whole
-
genome tiling arrays for gene structure




TecRed ascertainment of gene 5' ends (worm)



Alternative Splicing prediction and splicing arrays



Proteomic techniques fo
r validation/ discovery

Non
-
coding gene structure



Small RNA discovery via Immunoprecipi
tation of RISC complex/ size

fractionation



Deep
-
sequencing of small RNA fraction



Genome tiling array hybridisation


Gene Expression

Gene probe microar
rays




Whole
-
genome tiling arrays



SAGE analysis



High
-
throughput construction of GF
P fusion strains followed by
in vivo imaging

Automated lineage analysis (worm)

Automated tissue analysis (fly)


Cis
-
regulatory elements

Transcription
-
facto
r binding sites

ChIP/chip profiling

Hairpin DNA array profiling

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


37

ChIP
-
PET (Paired End di
-
Tag) sequ
encing from ChIP immunopurified
material

Yeast one
-
hybrid assays

DamID

Computational prediction
s

of cis
-
regulatory elements

promoters, enhancers and boundary
elements (insulators)

Non
-
transcription factor DNA/protein interaction

ChIP/chip profiling

Affinity chromatography followed by proteomics

Splicing elements

Computational prediction of splice regulatory elements


DNA replication

Identification of pre
-
repli
cative complex binding sites

ChIP/chip (tiling arrays)

Computational predictions of cis
-
acting sequence determinants

Functional dynamics of replication

Replication timing profiles (tiling arrays)

Identification of specific origins by BrdU enrichment

(tili
ng arrays)

Identification of tissue and developmental specific replication origins

Comparative genomic hybridization CGH (tiling arrays)


Chromatin structure

DNAse I hypersensitive sites

Histone modification

Distribution of histone modifications and

of va
rious chromosomal
proteins

OH radical cleavage profiles


Variation & Conservation

Identification of functional variants

High
-
throughput mutagenesis

Conservation of functional sites



Whole
-
genome alignments



Computational identification of con
served cis
-
regulatory motifs

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


38

Computational identification of conserved motifs in transcripts (e.g. miRNA targets)


Whole genome assembly and alignment


2.

Resources created & used during modENCODE

Strains (transgenics, etc)

High
-
throughput mutagenesis

GFP
reporter strains for targeted cell types & tissues

piggyBac and other insertional mutants

cDNA libraries

Microarrays

Gene probe


Whole
-
genome tiling

Antibodies

Anti
-
TF antibodies

Clones (gene fusions, etc)




Gateway
-
rescued clones from ORFeome proj
ect

Oligo probe sets


3.

Data sets generated during validation

In vivo

ascertainment of knockouts across functional elements

Phenotype of insertional mutants

Biochemical ascertainment (varies with experiment)


4.

Primary data types

Gene structures / splice stru
ctures

many experimental/ computational methods

Features defined by a chromosomal start and end point


low resolution evidence for genes (tiling array data)

cis
-
element computational predictions

chromatin/ cis
-
regulatory regions ChIP/chip Microarray data
Da
mID data



Motif locations



Chromatin modifications/ structure



Conserved regions

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


39




Functional variants


Sequences



Genome



RNA/ EST




Paired
-
end ditags




Tag sequencing expression analysis


Sequence alignments




who
le genome



RNA/ genome




multiple sequence alignments


Motifs

Position weight matrices

Regular expressions

Consensus sequences


Anatomical hierarchy

Time and place of gene expression/ replication origins


Images

In situ

hybridisation

Promoter::GFP

fusions


Movies

4D movies of organisms carrying Promoter::GFP fusions


Stra
ins


Clones/ libraries


Microarrays

A Data Coordinating Center for modENCODE


Stein, Lincoln D.


40


Antibodies


Phenotypes


Anatomy ontology descriptions of abnormal phenotypes