6 Combining PCBC and external datasets. - NHLBI Progenitor Cell ...

yieldingrabbleInternet and Web Development

Dec 7, 2013 (3 years and 8 months ago)

94 views

Bioinformatics Core Proposal for t
he
NHLBI
Progenitor Cell Biology Consortium

1

Overview

H
igh
-
throughput genomic data sets
offer tremendous promise in the identification of novel cellular
mechanisms underlying both human development and disease.
However, w
i
th this promise there
remain tremendous challenges in the analysis, interpretation and comparison of these results
between experiments. While many laboratories have employed individuals or teams of computational
staff to handle such data, differing approac
hes, misunderstanding of important analysis specific
configurations and improper quality control have led to improper analyses which can not be
appropriately compared to other datasets.

For these reasons, standardized software pipelines aimed at
achieving

consistent efficiencies

in
the
way in which these data are handled have become essential in the analysis modern genomics
datasets. In addition to providing reproducible analysis, such pipelines include the ability to customize
analyses, add additional co
mponents, update the underlying methods in a carefully controlled and
structured manner and improve the ability to compare results
among

a large numbers of experimental
analyses. In addition to these backend, low
-
level data analysis procedures, intuitive f
rontend methods
for data presentation, summarization, visualizations and access are particularly important for the
biologists and investigative research staff attempting to identify crucial regulatory biology. Given the
diversity of research environments,

resources and methods used within the PCBC, such
standardized
bioinformatics

become
s

immediately important in creating integrated resources which not only allow
for individual dataset interpretation, but also cross
-
dataset comparative analyses to understa
nd
variation between like
-
datasets and determine novel biology stemming from meta
-
analyses.

Here we propose to form a united
Bioinformatics Core to
provide research
support
for
the NHLBI’s
Progenitor Cell Biology Consortium

(
PCBC
), and to
generate

an open
-
access data portal of value to
both the
consortium and
the entire biomedical research community

who
would

benefit from improved
understanding of human stem cells and their potential uses
. The PCBC Bioinformatics Core

will
carry
out
and enable
both
highly
automated and user
-
driven data analyses

from the gamut of PCBC
genomic data and will
enable
analyses of
PCBC
-
generated data
with respect to

a wealth of genomic,
biologic, and disease
-
relevant data from other sources. The
PCBC
genomics
p
ortal

will allow
PC
BC

data to be downloaded

in its entirety raw, normalized, or

in

signature
-
focused format, and queried

and
analyzed with respect to known
pathway
s

and biological
processes
,

as well as for
interactome
-
based
relationships to provide improved understanding of
regulatory and disease
-
related mechanisms
.

The overarching goal of the Bioinformatics Core is to allow researchers to

significantly
extend current
knowledge

of progenitor cell molecular and cellular biology
and to identify and
define mechanisms of
lineage
-
specific determination and differentiation
. The Bioinformatics Core will
provide direct
user
support
for PCBC

rese
archers
who have

varying levels of expertise and interests

so that they can
gain immediate
help in the analysis of

raw, processed, and interp
reted data generated from the

molecular and phenotypic characterization of human stem cell
-
associated samples.
The data portal
will

transition to a full open
-
access research data resource

for community
-
wide support
.
Consortium
PIs are welcomed and encoura
ged
to utilize the services provided by the Bioinformatics Core
,

and
educational initiatives will be directed for use of both backend analysis resources as well as for
biological end
-
users who will be trained
in a variety approaches to
analysis

of the PCBC

datasets
. Of
course, consortium PIs are also welcome to continue using their existing systems to perform
bioinformatics analysis, and make use of complementary services provided by the core (e.g.
,

in data
distribution or data processing), as well as
bioin
formatics training resources.

Additionally, t
he Bioinformatics Core
will
immediate
ly

leverage and provide focused analysis results
based on

a set of

key priorities
that have been
defined

by consortium
members
for

3

initial driving
biologic
al projects

that
bring

together 3

hub sites (Johns Hopkins, Fred Hutchinson Cancer Research
Center,
and Gladstone Institutes
) with the Bioinformatics Core

(consisting of members from University
of Cincinnati, Sage Bionetworks, Gladstone Institutes, Vanderbilt University, a
nd
the
University of
Maryland
Baltimore
). The Bioinformatics
C
ore will also serve a central data processing and hosting
capability for data generated by

other PCBC cores, such as the Cell Characterization C
ore (Cincinnati

Children’s Hospital Medical Center
) and other cores that may emerge, such as
an

RNA
C
ore
(Stanford University).

Through the
combined
efforts of the
C
ell
C
haracterization
C
ore and
PCBC
PI
s,

the PCBC
is
now
developing

a

rapidly expanding,
substantial
,

novel resource of molecular characteriza
tion
s

of
progenitor cells
and their differentiated deriv
atives.
O
rigins and induction methods used in the
generation of the cell lines
are
extensively documented and
annotated
as well as systematically

characteriz
ed b
ased on

their cellular phenotype and d
ifferentiative potential
. Extracting maximum
scientific value from PCBC projects will require
: 1)

detailed data curation efforts

that link all cell,
sample, treatment, and technology
-
specific data
; 2)
sophisticated com
putational tools
and
pipelines
t
hat ca
rry out

consistent and genome
-
knowledge aware

data processing
; 3)

the
ability to share
raw,
processed, and analyzed
data with the PCBC and eventually the broader scientific community
;

4
)
flexible user
-
driven
computational tools
that

allow further analysis
of
all
results

derived from
PCBC
analys
e
s
to those

derived from the same rich processing of many other samples
; and
5) the ability

of
the Bioinformatics Core

to provide

help and collaborative resources to

assist Consortium investigators
in the analysis of
all cell characterization data
.


Carrying th
e above

out

in a systematic
,

automated
, and Core
-
assisted

fashion
will provide the
consortium and the biomedical research community
to
unveil
and define
new biology and
disease
translational capabilities based on

human
progenitor cell
s and resources.
Training and assisted use of
these will allow consortium investigators to gain
maximal insight
s

into the molecular basis of
progenitor cells’

themselves, factors that affect their
differential lineage potential
,

and t
he identification
of shared, unique, and novel pathways for specific lineage generation and disease modeling
.

Wi
th the critical mass of data that has already been

generated by the PCBC
;

the
flood of data
anticipated in the coming year
;

the broad

need to es
tablish
scientific
synergy across the consortium
that will improve
understanding
of
progenitor cells

biology;

and the overarching need to develop
progenitor cell platforms for the

analysis and treatments

of human diseases,

there is pressing need to

move be
yond the planning role of the
PCBC
B
ioinformatics
C
ommittee

and
to
create a Bioinformatics
Core facility focused on

execution and support
of data sharing

and

data
analysis
.
Our intention is

to

establish focused computational and bioinformatics staff resour
ces who can
advance
understanding
,
novel hypothesis generation
,

and testing that will
lead to the development of

a wealth of
use
s

of
human progenitor cells
.
This proposal
describes

a

B
ioinformatics
C
ore
that we believe will
accomplish

these goals for
the P
CBC.


2

Specific aims

Aim 1.

Provide computational infrastructure that enables PCBC data to be accessed, analyzed
and compared to other data sets in an integrat
ive, reproducible fashion.

Sub
-
aim 1.1.

Provide direct capability to visualize transcripts

as a function of genomic lo
cus
structure and to do this for all samples, averages of groups of samples, and as a function of
differences between groups of samples.

Sub
-
aim 1.2.

Provide direct capability to initiate new primary data processing and secondary
analyses following established scripts
that can also be modified and updated
as new
primary
data and new genomic reference data becomes available

for re
-
analysis of existing data
.

Aim 2.

Support bioinformatics analysis and training for PCBC projects.

Sub
-
aim 2.1.

Support
rich
,

focused

bioinformatics analys
e
s for 3

initial driving biological
projects.

2.1.1.

Expression profiling of ESC
-
specific noncoding RNA as a metric for evaluating
reprogramming quality in human iPSC lines.

Hub
S
ite: Johns Hopkins (
Dr. Elias
Zambidis
,

Dr. Jeffrey Huo)

2.1.2.

Using adult hematopoietic progenito
r cells to define the molecular basis of
lineage commitment.
Hub
S
ite. Fred Hutchinson Cancer Research Center (Dr. Beverly
Torok
-
Storb)

2.1.3.

Analysis of cardiac and neural crest (NC) specification in human ES and iPS
cells.

Hub site:
Gladstone Institutes

(Dr. B
ruce Conk
l
in)

Sub
-
aim 2.2.

Expand support to any additional project led by PCBC investigators interested
in interacting with the Bioinformatics Core to perform bioinformatic
s

analyses that directly
support PCBC investigators’ research goals.

Sub
-
aim 2.3.

Provide training for bioinf
ormatics tools used for performing data analysis tasks
through online training materials and hosting of visiting scientists

for
short or extended
periods

at Bioinformatics Core locations (Sage Bionetworks, University of Cincinnati).

Sub
-
aim 2.4.

Support pilot project (
with Dr. Torok
-
Storb) to create an index of citable
specific
attribution credit for researchers who contribute to the PCBC Data Portal, reagents collection,
and bioinformatics tools.

Aim 3.

S
upport a diversity of data processing workflows

that

enabl
e bioinformati
cs staff and
advanced
users to carry out, curate, and share their analysis of genomic, genetic, and
phenotypic data generated by PCBC investigators and the Cell Characterization Core.

Sub
-
aim 3.1.

Provide access to linked and complementary computational systems to shar
e
primary and processed data among PCBC investigators

and to distribute PCBC data publicly
according to schedules for public data release.

Sub
-
aim 3.2.

Develop automated data processing workflows that apply to all PCBC genomic
data and samples, allowing for the executi
on of both standardized
Galaxy
-
based and
customized Sage Synapse

analyses that compare PCBC samples to a library of other
samples, to transform raw data generated by the PCBC normalized and highly structured data
formats amenable for downstream analysis. D
ata
processing workflows will support all major
technologies used in PCBC projects, including RNAseq, miRseq, ExomeSeq, ChIP
-
seq.

Sub
-
aim 3.3.

Provide users with the ability to carry out sample and analysis curation using
highly structured ontologies for PCBC meta
-
data

including description of sample origins,
derivation methodologies, analysis technology, and data processing and analysis and
interpretation details.

Sub
-
aim 3.4.

Curate publicly available stem cell data resources and
provide the
ability to
quantitatively analyze

thes
e data in conjunction with data generated by PCBC investigators.

Aim 4.

Create an open access web portal for network, pathway, and functional characterization
analyses using PCBC
data in conjunction with a broad array of
other available genomics
data

Sub
-
aim 4.1.

Provide dire
ct access to lists of genes, transcripts, mRNAs, and miRs and their
relative expression levels that define principle signatures of progenitor cells and progenitor
-
derived derived cells.

4.1.1.

Collect, organize, and annotate gene, transcripts, isoform, promoter,
and miRclusters
from
various

samples and sample types (undifferentiated stem cell cell clusters,
differentiation clusters, treatment and disease affect clusters).

4.1.2.

Present clusters of ES/iPS/other stem cell or somatic cell samples in an interactive
format t
o

retrieve

quantiative

individual gene
data

and cluster content and feature
enrichment information
.

Sub
-
aim 4.2.

Provide capability for PCBC data to be
quantitatively
compared and contrasted
with other molecular signatures and activities of gene clusters and sets.

4.2.1.

Deve
lop search engine to test how genes of interest cluster in the PCBC samples.

4.2.2.

Provide links to other stem cell databases/search engines such as Fungenes,
Amazonia, MEM
-

Multi Experiment Matrix.

Sub
-
aim 4.3.

Provide direct easy to use capability to investigators to comp
are all PCBC data
with their own data, other data and to perform biological network analysis to define new
pathways and molecular interactions that are critical for progenitor cell functions.

4.3.1.

Provide a platform and training to enable comparative analysis o
f gene expression at
the level of mRNAs, lincRNAs, miRs, and
alternative
splice forms.

4.3.2.

Provide platform and training for the analysis of co
-
functioning genes and proteins in
biological networks including protein interactions, miR interactions, and th
at ena
ble in
-
silico evaluation

of the
effects of alternate splicing on biological processes and pathways.

4.3.3.

Provide in the
web
-
accessible results analysis
platform the ability for users to save
and
share
their biological networks and pathways and to allow others t
o use and edit them as
new versions.



3

Leveraging and Seamlessly Connecting a Wealth of Powerful
Genomic Data Resources

T
he
above
proposed
aims

are ambitious

yet essential. While t
he requested support level (5.1 FTEs)

represents a substantial commitment,
this level of support

would actually be

insufficient to
fulfill our
aims

for

the B
ioinformatics
C
ore infrastructure

if it were not for our ability to leverage a wealth of
advanced infrastructure and prior and ongoing investments from other sources.


Doing
this and
committing to the seamless connecting of these resources will allow us

to return

far greater added

value to the PCBC

that we could otherwise do
. For example,
the Cincinnati Children’s Genome
Bioinformatics Group includes 4 programming engineers, a
nd 2 research associates, and 4
Bioinformatics graduate students who have developed a powerful infrastructure for carrying out
integrative multi
-
datatype systems biology analyses including the highly published suite of tools at
http://toppgene.cchmc.org/
, newly developed disease systems biology tools at
http://gataca.cchmc.org/
, and the San Francisco Gladstone group have been pioneers in the
development of
http://Cytoscape.org

and developed many powerful additional modules that leverage
the Cytoscape infrastructure
, which
will be fully linked and combined in an end
-
user friendly portal.

Sage Bionetworks includes a group of 11 professional
software engineers and 20 research scientists
developing
generally applicable
bioinformatics tools and software infrastructure
that will

provide
support
for many of
the activities in the proposed workplan
;

the
PCBC

requested funding will be used
to
ensure
that these activities and functions fully enable

P
CBC
-
specific
data analyses and data
integration. PCBC
-
critical functions

will be
built on top of existing tools developed by this team

and
ensure that they are not orphaned as the tools are progressively a
dvanced
. Thus much of the
functionality for which Sage is funded by the mechanisms listed below can be leveraged and
extended for PCBC
-
specific applications:

Through Date

Sage Total

FTEs 2012

Full Name

Grant #


2014/2015

$8,067,874

10.25

Integrating canc
er datasets for
predictive model development
and training

U54
-
CA149237
-
01 (NCI)

2014

$4,999,996

6.8

Sage integrative Bionetwork
community: scalable resource
for the state of Washington

3104672

2016

$7,344,061

8

12 additional sources


This funding is ded
icated to supporting Sage’s mission of: 1) developing predictive models of disease
-
related phenotypes through integrative analysis of large
-
scale genomic data sets;

and,

2) building and
supporting an open source compute
r

platform and database to more effec
tively harness genome
-
scale data
.

Sage has advanced this mission through efforts to support data sharing and community
based research, including: 1) The Synapse commons repository, including over 10,000 highly curated
and “compute
-
ready” datasets processed

through automated normalization and curation procedures
and governed by IRB
-
approved data sharing policies; 2) the TCGA pan
-
cancer analysis project in
which Sage is developing tools used by the TCGA consortium to process, store, and share TCGA
data and ap
ply transparent, reproducible analysis procedures to identify molecular patterns across
tumor types; 3) the breast cancer prognosis challenge, an open community competition to develop
prognostic models of breast cancer survival
.

The ability to promote data

sharing, bioinformatics training, and reusable, transparent methodologies
within the PCBC consortium is strongly aligned with Sage’s core mission, and Sage will
be able to
apply
PCBC
support to extend

related activities funded by other mechanisms to
succe
ed in achieving

the ambitious
aims

of this proposal.

And by integrating the suites of tools for data exploration and

Figure
1
. Integrative Access to Metadata, Raw Data, and
Large
-
Scale (eg GALAXY
-
cloud processed) Datasets from
Synapse.

Synapse uses a set of web services to provide
access to
a

data repository, which comprises a f
ederated
collection of curated, adjusted and analyzed datasets, models
and code. Synapse may also reference restricted data stored in
external databases, such as dbGAP or
TCGA
. Versioning of
data, workflows and tools allows for the documentation of
details

on how individual models were generated, and enables
these models to be reproduced. Storage of the data repository
and services in the cloud allows for scalability, access and the
potential to use high performance computing facilities directly
from Synaps
e.



analysis developed by the Cincinnati and San Francisco groups, we believe that this will represent the
most powerful initiative yet devise
d to carry out multi
-
scale pathway
-
level analyses of any human
system.



4

Data Processing
Using the
Sage Bionetworks Synapse
P
latform

The bioinformatics core will leverage the Synapse compute platform
(Derry et al., 2012)

(synapse.sagebase.org) developed by Sage Bionetworks (
www.sagebase.org
), a non
-
profit
organization with

the mission of (1) developing predictive models of

disease
-
related phenotypes
through integrative analysis of large
-
scale genomic data sets; (2) building and supporting an open
source compute platform and database to


more effectively harness genome
-
scale data by enabling disease models to be evolved by
contributor
scientists with a shared vision to accelerate the elimination of human disease.


Synapse
enables

genomic analysis and provides broad access to

molecular models of disease and
the underlying datasets and
algorithms

that were used to
construct th
em
.

Synapse
provides various functionalities

(
Figure
1
)
, including:

1.

Management of
datasets.

2.

A
nalysis code and
m
odels in user
-
created
projects.

3.

A
bility to publish

th
ese
resources for public
reuse
.

4.

I
nclusion of a workflow
and versio
ning

system
to trac
k the specific
dataset and code
that
was used for a

particular analysis.

5.

A
ccess to tools to
enable scientific
analysis and

collaboration.

6.

Integration with
cloud
computing
technologies providing

scientists with access
to

on
-
demand super
-
c
omputer power
without the upfront
capital costs of

building
and managing a private cluster.

Synapse is built around a set of web services that provide a variety of features, including annotation,
indexing, history tracking, versioning, authentication, auth
orization and data persistence. We designed
an application
-
programming interface (API) that allows for structure queries across the metadata of all
datasets, models and tools. Structured queries can be semantically enhanced by having Synapse
delegate to ex
ternal services, such as the National Center

for Biomedical Ontology (NCBO)
(Musen et
al., 2011)
. The API provides federated access to datasets and other objects managed by Synapse,
which allows a single API to be used to query and load data regardless of whether the data are
h
osted directly on Synapse or are linked to an external system. This strategy allows the support of
use cases where the data generator has imposed restrictions on its redistribution (for example, in
dbGAP) or when data volumes are sufficient to preclude hos
ting in the public cloud. The API also
allows analysis code to be brought to the data that is located with the computing resources. This
obviates the need to download large data sets, a feature that will increasingly become a priority as
genetic data volum
es outstrip network transfer capacities.


Synapse

provide
s

a
n

‘analysis
-
ready’ version of each dataset by running the curated data through a
quality control process in preparation for downstream analysis. These adjusted datasets exist in
conjunction with t
he source code and the detailed documentation that describes the transformation
from the underlying curated and/or raw data (
Error! Reference source not found.
). In addition,
Synapse provide
s

normalized versions of gene expression data from public database
s

--

for example,
Gene Expression Omnibus

(GEO)

or ArrayExpress

--


as well as clinical phenotypes curated to
ex
isting ontologies, such as NCBO
.

Sage Bionetworks seeds the repository with versions of the data
before and after quality control that are usefu
l for modeling analyses, but
scientists may
both deposit
data and participate in the process of curation and quality control

using custom tools
.


A key feature of Synapse will be access to the tools developed by the scientific community for the
manipulati
on and analysis of data (
Error! Reference source not found.
). Synapse
supports

integration with applications that support various users, from data curators to bioinformaticians and
biologists, and tools for data adjustment, normalization and reformatting,
as well as for model building,
will be developed and shared. For example, we have developed an R client to allow platform
-
hosted
data to be accessed from the R environment, thereby providing a ready link to a wealth of existing
analysis methodology (contai
ned in the the Comprehensive
R Archive Network (CRAN) and
Bioconductor). Synapse also consists

of a web portal that allows researchers to search and navigate
through

content relevant to their research interests and form projects with

existing or new collea
gues.
General
-
purpose tools such as wikis, user

forums and issue trackers can easily be adopted from other
domain to support scientific research teams.


5

Integration of Synapse and Galaxy tools

The Galaxy workflow tool provides a set of complementary functi
onalities to Synapse, and we believe
the integration of these tools will provide a particularly powerful and unique resource to the PCBC. In
particular, Synapse provides unique functionality in terms of hosting highly structured repository of
public datase
ts, enabling sophisticated computational analysis through tight integration with
programmatic environments such as R, and flexibility in building sophisticated workflows supporting

Figure
2
. Combined analysis of curated data within the Synapse platform.
Pipelines

have
been built to bring data into Synapse and automate the process of applying standardized data
processing methods and integration with we
ll defined meta
-
data ontologies to produce analysis
-
ready versions of datasets that can be shared with the community and plugged into downstream
analytical pipelines. Data from large publicly available resources (e.g.
,

TCGA and GEO) have been
processed usi
ng these methods and are available in Synapse. The Bioinformatics core will create
pipelines to process PCBC data using similar procedures, enabling them to be shared with the
consortium, linked to analytical pipelines, and integrated with relevant stem ce
ll data sets curated
from public data residing in Synapse (e.g.
,

all stem cell studies from GEO).

many data types and supporting code transparency, data versioning, and prov
enance. Galaxy serves
a different need by providing data interchange formats allowing analysis workflows to be stitched
together from commonly
-
used prepackaged tools. We therefore propose to combine the
data
resources,
flexibility
, and programmatic environ
ment of Synapse with the support for bioinformatics
tools provided by Galaxy by:

1.

Linking output of RNA
-
seq data processing pipelines
built
in Galaxy to automatically be hosted
in Synapse. This will allow the RNA
-
seq dat
a

to feed into Synapse analysis tools
, data sharing
mechanisms, and ability to be analyzed in conjunction with other datasets hosted in Synapse.

2.

Providing data interchange tools allowing datasets generated by Synapse data processing
workflows to feed into bioinformatics analysis tools support
ed by Galaxy.

3.

Extending the Galaxy graphical workflow system to represent workflows implemented in
Synapse.

4.

Linking Synapse with the Globus / GridFTP used by Galaxy to support rapid transfer of large
data files.

6

Combining PCBC and external datasets.

As an
example of the value of large scale curated data

to extend PCBC data,

a proof
-
of
-
concept
project was performed to curate all publicly accessible stem cell studies from the Gene Expression
Omnibus and use this resource to perform meta
-
analysis to identify m
olecular differences between
phenotypic states. As shown in
Figure
3
, this resource allowed queries to be formed such as
assessing the most differentially expressed genes in all stem cell versus non stem cell gene
expression profi
les performed on the same expression platform and available in GEO, demonstrating
clear signal of differential expression of
well
-
characterized stem cell related
genes.


Extending proof
-
of
-
concept studies
to support the PCBC bioinformatics
core.

The PCBC
bioinformatics core will
create publicly available resources
allowing the community to perform
similar queries on large
-
scale publicly
available stem cell datasets. The
bioinformatics core will also allow PCBC
investigators to perform integrative
analysis
to combine PCBC datasets
with publicly available data, and make
such tools available to the community
upon public release of PCBC datasets.


Strategy for d
ata analysis
and portal
-
generation

Figure
4

depi
cts our proposed pipeline
for leveraging multiple informatics tools
to develop an integrative system for
processing, sharing, and
analy
zing
data
generated by the PCBC

that is
combin
able

with complementary
datasets from public resources.


Figure
3
. Aggregated data sets analysis
approaches.

The

assembled resource of publicly
available stem cell microarray data was qu
eried to
identify which genes tend to be more active (i.e.
expressed) in the studies that relate to stem cell biology
in the GPL1261 platform.


The figure represents the
difference i
n activity for each gene in studies exploring
stem cell

biology (y
-
axis) versus the average activity in
studies that do not explore stem cell biology.

The red
dots in the figure correspond to SOX2, POU5F1
(OCT3/4), LIN28, and KLF4. Green dots represent oth
er
transcription factors with higher activity in stem cells.
Orange and black dots represent genes tested by
Yamanaka and Yu (2007)
respectively
.

The pipeline begins

with tools
that use

structured
terminologies and
ontologies to capture
sample and
technology
-
specific
metadata and sample information for PCBC experiments. The bioinformatics
committee has to date focused on developing this
information and
capability thro
ugh the
PCBC
sample

sub
mission portal,
and this

will be extended
in

the current proposal
so as
to store submitted
data in highly structured formats linked to downstream data hosting and analysis tools, as depicted in

Figure 4.

Data processing pipelines wil
l be developed
using the Synapse platform
to handle
the
compiled

data generated from all
genomic
technologies commonly used in the PCBC, including
RNAseq, miRseq, ExomeSeq, ChIP
-
seq

as well as FACS and phenotypic pr
ofiling data
. For each
technology, we wil
l store all raw data with transparent links to data processing scripts used to
generate downstream data at different levels of pre
-
processing and data aggregation.
For each data
platform we will evaluate the use of either Galaxy or Synapse and select the o
ptimal technology (or
su
pport both as needed), given that each provides built
-
in support for a subset of data platforms. We
will use the Synapse integration with R programming

infrastructure

to develop novel data processing
pipelines as needed to support d
ata platforms used in the PCBC. As a proof of concept, we have
developed an initial RNA
-
seq data processing pipeline using Galaxy
.

All processed, “analysis ready” data will be automatically stored in Synapse, providing a common
repository and
data sharing
system for all
PCBC
-
generated
data. These
data will be
combined with
relevant
datasets
curated in
Synapse from
public domain
sources, such
as GEO
(
Figure
3
),
enabling
PCBC
-
specific
data to be
interpreted
together with
large
-
scale
public
datasets.

Data stored in
Synapse will
be poised to
integrate with a
flexible array of
downstream
data analysis system, as data can be accessed: 1) programmatically in R, Python, or Java; 2) through
a web
-
based graphical client; 3) through simple RES
T
-
based APIs.

We will leverage the programmatic data access and provenance tracking capabilities
that start from
the Sample Manager and link the Cell Characterization Core data generation processors to the
Globus Online Data Repository, to Galaxy and SAGE
, and back to a Cincinnati
-
de
veloped Results
Manager Portal
to implement data analysis pipelines for commonly used pr
ocedures including
the
Figure
4

--

Pipeline for data processing, multi
-
stage analyses, and results
-
interpre
tation sharing for the integrated PCBC Informatics Core.


generation of expression signature gene lists, list comparisons, and
gene set enrichment and
biological network anal
ysis for identifying important genes, biological processes,
regulatory
and
biological
state
-
determining
mechanisms

that are reflective of all progenitor cell
-
related origins,
characteristics, and differentiati
on

potential
.

Figure 4.
The PCBC Omics Porta
l
--
A
unified portal

web server
for

generating user
-
access
ible
progenitor cell characterization results.



At the Web Access
point (e
.
g
.,

something similar to GATACA (
http://gataca.cchmc.org/
)
, next stage
analyses
w
ill be provided by the
integration of a set of
bioinformatics
tools that t
he Cincinnati
Bioinformatics group has established
that

allow
s

users
to carry out downstream analyses

from the
SAGE
-
based analysis pipeline
. To do this, a user can send a

specific d
ataset

or set of results
generated from the analysis pipeline

to

be further dissected, analyzed, and visualized using a variety
of tools including table and heatmap generation as well as rendered into a
Cytoscape
network

document/
diagram for visualization
and connectivity analyses.
Additional functionalities, in par
ticular
,
the inclusion of
differential splicing
-
based result
s

from the

UCSF AltAnalyze Package will a
lso be
seamlessly integrated.
S
eparate datasets can be
aggregate
d, compared, and
diverse dat
a and
knowledge relevant to a broad range of projects
are allowed to be retrieved and inter
-
compared. The
overall goal to be enabled from these function are to provide users with a
translational systems
biology
-
based
framework that will allow for the
iden
tification of pathways responsible for normal
development, differentiation, and disease pathogenesis. Developed and supported infrastructure at
Cincinnati will be responsible for users results to be saved, organized, shared, and subjected to
downstream ana
lyses from pipelines and workflows established using the SAGE Synapse
infrastructure. A key goal that will be addressed for the PC
B
C is to make readily available for any user
the ability to save, retrieve, and downstream analyze all the SAGE and galaxy
-
bas
ed pipeline results
from progenitor cell characterizations. To do this, users will be able to save their analysis results, as
well as search and retrieve those that are generated by other users or automatic processes
.

T
he
Cincinnati group providin
g a foc
used version of the ToppG
ene and GATACA database webservers.
For example,
as shown below

in Figure 5
,
the GATACA database

(
http://gataca.cchmc.org/
)
, a new
version of which will be customized specifically for the Pr
ogenitor
Cells
Data Portal,
provides access
to 18 different kinds of gene
-
based datasets and genesets a
ssociated
with specific biological and
disease
-
associated ontology terms, pathways, interactions, experimental datasets and manuscripts
culled from a br
o
ad array of large
-
scale genomic
profiling
characterizations of human diseases
,

genetically engineered disease models, and other large
-
scale
dataset
-
based analyses that provide
integrative and translational insights.

Figure 5. Data Portal:

Retrieval of
Pr
ogenitor
Cell
Signatures
s
earch results from
“s
tem
c
ell


from gene
sets &
genomic
technologies

(shown are data from ChIPseq/ChIPchip, RNAseq, and microarray results sets
and “metasets”
.


From any of the curated, annotated, or auto
-
generated gene signature
s that
are

saved into the PCBC
Web Portal, it will
then
be possible to directly request a graphics visualization of the gene pattern
clustering across the whole
or relevant sub
set of samples pertinent to the specific analysis results.

Each of these gene s
ignatures will be visualizable through a heatmap such a
s
is
shown in Figure 6
.

users will be able to directly download excel tables, graphic images, and TreeView/Cluster Data from
the PCBC web portal. These can be results that they generated themselves us
ing the SAGE
Synapse portal, or that were generated by Bioinformatics Core staff members or others in the
consortium who share their results that were made using Synapse.

Figure 6.
PCBC Datamine: Heatmap of
RNAseq Expression Signature
s
:

4
809

gene
-
level
t
ranscripts with significantly different expression in induced pleuripotential stem cells (iPSC) as
compared to human embryonic stem cells (hESC)
, or Definitive Endoderm (DE), or Embryoid Body
(EB. Also shown is the relative expression of any of those tran
scripts in human heart
.

To analyze
one or more of these

signatures

(gene, transcript, miR, methylation, or
chromatin/ChIPseq
-
based)
, the user then clicks through to send the genes to the
ToppGene Suite
(
http://toppg
ene.cchmc.org
)
,
a powerful one
-
stop
multi
-
Gen
eSet
anal
ysis server that supports gene list enrichment,
candidate gene discovery and ranking based
on
e

stop analytical server that supports gene
list enrichment, candidate gene discovery and
ranking based on
f
unctional annotations and
network topology to facilitate hypotheses
generation. The core in addition

supports and
extends a variety of advanced scientific
discovery and data management software, and
provides ongoing
that provide integrative and
translation
al insights.

Each signature will also
be analyzed by AltAnalyze (see below) for
alternate spliceform biological impact. Many of the functions of Toppgene and AltAnalyze are
complementary and these will be combined and harmonized as the system is refined.


From
Toppgene, for
all gene lists, clusters, genomic location
-
based peak profiles, and networks that
are generated as signatures representative of progenitor cell genomic biology, it is critical to provide
users with direct access to

gene and list
-
level
connectivity to
integrated biological knowledge

that
enables next
-
stage analyses of these data types
.
One of the basic key functions of the
downstream
data analysis capabilit
ies,

as provided by ToppGene
,

is to provide
ready access to a table view of

diffe
rent sets of stem cell signatures as obtai
ned from different technologies,
experiment
s, and
analysis approaches.

hESC

(blue)


iPSC
(purple)


DE (red)

EB (orange)


Heart
(yellow)

This will
be a
single
button
click that
is
enabled
next to
each
signature
and will
look like
the below
signature
The
ToppGen
e table
view
in
Fi
gure 7

allows
users to
(for
example)
interactiv
ely study
focused
areas of
a
functional
group of genes, a specific biological process, or cell component to better understand the underlying
genetic requirements and pathways that being addressed by progenitor

cells or their derivatives.



Figure
7
. Toppgene view of
significant
function
s

and fea
ture
s

associa
t
ed with
the core
Stem Cell gene
expression signature

(uppermost cluster in figure 6
.


By clicking the TG button next to a
ny of the specific category hit
s
, it is possible to study
further
the
specific subset of
genes and
their additional/focused

relationship
s

with respect to other known
features in common

such as protein interactions, co
-
occurrence in a manuscript, shared regulatory
elements, etc.

For exa
mple, to take a somewhat less well studied area of progenitor cell biology, the
area of cell surface interactions can be chosen to show its shared relationships in a network view
(Figure
8
.)
.

TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
TG
Results
Go
T
o
Sta
rt Pa
g
e
1
:

G
O
:

M
o
l
e
cu
l
a
r

F
u
n
ct
i
o
n

[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
GO:0043565
se
q
u
e
n
ce
-
sp
e
ci
f
i
c
D
N
A

b
i
n
d
i
n
g
4.581E-71
178
693
2
GO:0003700
se
q
u
e
n
ce
-
sp
e
ci
f
i
c
D
N
A

b
i
n
d
i
n
g

t
r
a
n
scr
i
p
t
i
o
n

f
a
ct
o
r

a
ct
i
vi
t
y
1.356E-66
210
101
1
3
GO:0001071
n
u
cl
e
i
c
a
ci
d

b
i
n
d
i
n
g

t
r
a
n
scr
i
p
t
i
o
n

f
a
ct
o
r

a
ct
i
vi
t
y
1.630E-66
210
1012
4
GO:0030528
t
r
a
n
scr
i
p
t
i
o
n

r
e
g
u
l
a
t
o
r

a
ct
i
vi
t
y
3.568E-62
209
1059
5
GO:0016563
t
r
a
n
scr
i
p
t
i
o
n

a
ct
i
va
t
o
r

a
ct
i
vi
t
y
3.209E-27
81
367
Show
332
m
or
e
annot
at
ions
2
:

G
O
:

B
i
o
l
o
g
i
ca
l

P
r
o
ce
ss
[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
GO:0000278
m
i
t
o
t
i
c
ce
l
l

cycl
e
3.915E-86
196
690
2
GO:0022403
ce
l
l

cycl
e

p
h
a
se
3.913E-84
205
774
3
GO:0022402
ce
l
l

cycl
e

p
r
o
ce
ss
4.486E-84
225
933
4
GO:0007049
ce
l
l

cycl
e
9.323E-81
257
1249
5
GO:0006366
t
r
a
n
scr
i
p
t
i
o
n

f
r
o
m

R
N
A

p
o
l
ym
e
r
a
se

I
I

p
r
o
m
o
t
e
r
2.799E-73
242
1206
Show
2464
m
or
e
annot
at
ions
3
: GO: C
e
l
l
u
l
a
r C
o
mp
o
n
e
n
t [
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
GO:0005654
n
u
cl
e
o
p
l
a
sm
1.556E-64
239
1342
2
GO:0005694
ch
r
o
m
o
so
m
e
1.475E-36
11
5
555
3
GO:0044427
ch
r
o
m
o
so
m
a
l

p
a
r
t
6.785E-30
96
472
4
GO:0005667
t
r
a
n
scr
i
p
t
i
o
n

f
a
ct
o
r

co
m
p
l
e
x
6.345E-23
69
324
5
GO:0044451
n
u
cl
e
o
p
l
a
sm

p
a
r
t
9.844E-23
11
5
783
Show
198
m
or
e
annot
at
ions
4
:

H
u
m
a
n

P
h
e
n
o
t
yp
e

[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
HP:0002664
N
e
o
p
l
a
si
a
2.679E-8
68
403
2
HP:0004377
H
e
m
a
t
o
l
o
g
i
ca
l

n
e
o
p
l
a
si
a
2.493E-7
31
132
3
HP:0005607
A
b
n
o
r
m
a
l
i
t
y
o
f

t
h
e

t
r
a
ch
e
o
b
r
o
n
ch
i
a
l

syst
e
m
1.355E-6
16
47
4
HP:0001909
L
e
u
ke
m
i
a
1.824E-6
22
83
5
HP:0010623
T
u
m
o
u
r
s
o
f

t
h
e

b
r
e
a
st
2.155E-6
14
38
Show
623
m
or
e
annot
at
ions
5
:

M
o
u
se

P
h
e
n
o
t
yp
e

[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
MP:00
01
67
2
a
b
n
o
r
m
a
l

e
m
b
r
yo
g
e
n
e
si
s/

d
e
ve
l
o
p
m
e
n
t
5.278E-16
243
1340
2
MP:00
05
38
0
e
m
b
r
yo
g
e
n
e
si
s
p
h
e
n
o
t
yp
e
5.278E-16
243
1340
3
MP:00
03
07
7
a
b
n
o
r
m
a
l

ce
l
l

cycl
e
3.158E-12
58
196
4
MP:00
02
08
2
p
o
st
n
a
t
a
l

l
e
t
h
a
l
i
t
y
1.752E-1
1
179
986
5
MP:00
08
76
2
e
m
b
r
yo
n
i
c
l
e
t
h
a
l
i
t
y
1.770E-1
1
217
1264
Show
1225
m
or
e
annot
at
ions
6
: D
o
ma
i
n
[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
IPR017970
H
o
m
e
o
b
o
x_
C
S
InterPro
4.250E-26
57
180
2
PS00027
HOMEOBOX_1
PROSITE
2.776E-25
64
233
3
SM00389
HOX
SMAR
T
3.603E-25
64
234
4
PS50071
HOMEOBOX_2
PROSITE
4.669E-25
64
235
5
IPR001356
Homeobox
InterPro
4.669E-25
64
235
Show
681
m
or
e
annot
at
ions
7
:

P
a
t
h
w
a
y
[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
h
sa
0
4
1
1
0
C
e
l
l

cycl
e
KEGG pathway
1.580E-31
60
11
7
2
REACT
OME_CELL_CYCLE_MIT
OTIC
G
e
n
e
s
i
n
vo
l
ve
d

i
n

C
e
l
l

C
ycl
e
,

M
i
t
o
t
i
c
M
S
i
g
D
B
:

C
2
.
cp

-

R
e
a
ct
o
m
e
1.253E-30
99
306
3
REACT
OME_E2F_MEDIA
TED_REGULA
TION_OF_DNA_REPLICA
TION
G
e
n
e
s
i
n
vo
l
ve
d

i
n

E
2
F

m
e
d
i
a
t
e
d

r
e
g
u
l
a
t
i
o
n

o
f

D
N
A

r
e
p
l
i
ca
t
i
o
n
M
S
i
g
D
B
:

C
2
.
cp

-

R
e
a
ct
o
m
e
7.320E-23
27
33
4
REACT
OME_S_PHASE
G
e
n
e
s
i
n
vo
l
ve
d

i
n

S

P
h
a
se
M
S
i
g
D
B
:

C
2
.
cp

-

R
e
a
ct
o
m
e
4.187E-22
47
103
5
REACT
OME_G2_M_CHECKPOINTS
G
e
n
e
s
i
n
vo
l
ve
d

i
n

G
2
/
M

C
h
e
ckp
o
i
n
t
s
M
S
i
g
D
B
:

C
2
.
cp

-

R
e
a
ct
o
m
e
9.451E-22
30
43
Show
235
m
or
e
annot
at
ions
8
: P
u
b
me
d
[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
20059953
A

syst
e
m
s
a
p
p
r
o
a
ch

r
e
ve
a
l
s
t
h
a
t

t
h
e

m
yo
g
e
n
e
si
s
g
e
n
o
m
e

n
e
t
w
o
r
k
i
s
r
e
g
u
l
a
t
e
d

b
y
t
h
e

t
r
a
n
scr
i
p
t
i
o
n
a
l

r
e
p
r
e
sso
r

R
P
5
8
.
5.390E-77
229
1444
2
2021
1
142
A
n

a
t
l
a
s
o
f

co
m
b
i
n
a
t
o
r
i
a
l

t
r
a
n
scr
i
p
t
i
o
n
a
l

r
e
g
u
l
a
t
i
o
n

i
n

m
o
u
se

a
n
d

m
a
n
.
1.333E-59
157
848
3
21731673
D
e
t
e
r
m
i
n
i
st
i
c
a
n
d

st
o
ch
a
st
i
c
a
l
l
e
l
e

sp
e
ci
f
i
c
g
e
n
e

e
xp
r
e
ssi
o
n

i
n

si
n
g
l
e

m
o
u
se

b
l
a
st
o
m
e
r
e
s.
1.938E-49
99
382
4
20932939
S
cr
e
e
n
i
n
g

l
a
r
g
e

n
u
m
b
e
r
s
o
f

e
xp
r
e
ssi
o
n

p
a
t
t
e
r
n
s
o
f

t
r
a
n
scr
i
p
t
i
o
n

f
a
ct
o
r
s
i
n

l
a
t
e

st
a
g
e
s
o
f

t
h
e

m
o
u
se

t
h
ym
u
s.
9.152E-36
70
252
5
19453261
H
i
g
h
-
d
e
n
si
t
y
a
sso
ci
a
t
i
o
n

st
u
d
y
o
f

3
8
3

ca
n
d
i
d
a
t
e

g
e
n
e
s
f
o
r

vo
l
u
m
e
t
r
i
c
B
M
D

a
t

t
h
e

f
e
m
o
r
a
l

n
e
ck
a
n
d

l
u
m
b
a
r

sp
i
n
e

a
m
o
n
g

o
l
d
e
r

m
e
n
.
2.088E-34
83
383
Show
2897
m
or
e
annot
at
ions
9
:

I
n
t
e
r
a
ct
i
o
n

[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
int:CDK2
C
D
K
2

i
n
t
e
r
a
ct
i
o
n
s
1.985E-22
51
121
2
int:EP300
E
P
3
0
0

i
n
t
e
r
a
ct
i
o
n
s
3.893E-22
87
342
3
int:CREBBP
C
R
E
B
B
P

i
n
t
e
r
a
ct
i
o
n
s
1.090E-20
79
303
4
int:CDK1
C
D
K
1

i
n
t
e
r
a
ct
i
o
n
s
3.380E-18
52
154
5
int:HDAC1
H
D
A
C
1

i
n
t
e
r
a
ct
i
o
n
s
3.996E-18
83
362
Show
216
m
or
e
annot
at
ions
1
0
:

C
yt
o
b
a
n
d

[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
1
:

T
r
a
n
scr
i
p
t
i
o
n

F
a
ct
o
r

B
i
n
d
i
n
g

S
i
t
e

[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
V$E2F_Q6_01
V$E2F_Q6_01
2.202E-14
52
184
2
V$E2F1DP1RB_01
V$E2F1DP1RB_01
3.744E-13
50
184
3
V$E2F1_Q6
V$E2F1_Q6
7.562E-13
49
181
4
V$E2F4DP1_01
V$E2F4DP1_01
1.144E-12
50
189
5
V$E2F_Q3
V$E2F_Q3
1.216E-12
48
177
Show
359
m
or
e
annot
at
ions
1
2
:

G
e
n
e

F
a
m
i
l
y
[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
PA
X
P
a
i
r
e
d

B
o
x
G
e
n
e
s
g
e
n
e
n
a
m
e
s.
o
r
g
4.176E-8
6
9
2
FOX
F
o
r
kh
e
a
d

b
o
x
g
e
n
e
s
g
e
n
e
n
a
m
e
s.
o
r
g
2.655E-7
10
43
3
TBX
T
-
b
o
x
g
e
n
e

f
a
m
i
l
y
g
e
n
e
n
a
m
e
s.
o
r
g
3.632E-7
7
18
4
WNT
W
i
n
g
l
e
ss-
t
yp
e

M
M
T
V

i
n
t
e
g
r
a
t
i
o
n

si
t
e

f
a
m
i
l
y
g
e
n
e
n
a
m
e
s.
o
r
g
5.612E-7
7
19
5
KCN
P
o
t
a
ssi
u
m

ch
a
n
n
e
l
s
g
e
n
e
n
a
m
e
s.
o
r
g
1.503E-6
13
90
Show
1
1
m
or
e
annot
at
ions
1
3
:

C
o
e
xp
r
e
ssi
o
n

[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
19808871-T
ableS3
H
u
m
a
n

S
t
e
m
C
e
l
l
_
H
a
ssa
n
0
9
_
1
5
4
4
g
e
n
e
s
GeneSigDB
0.000E0
1067
1068
2
BENPORA
TH_PRC2_T
ARGETS
S
e
t

'
P
R
C
2

t
a
r
g
e
t
s'
:

P
o
l
yco
m
b

R
e
p
r
e
ssi
o
n

C
o
m
p
l
e
x
2

(
P
R
C
)

t
a
r
g
e
t
s;

i
d
e
n
t
i
f
i
e
d

b
y
C
h
I
P

o
n

ch
i
p

o
n

h
u
m
a
n

e
m
b
r
yo
n
i
c
st
e
m

ce
l
l
s
a
s
g
e
n
e
s
t
h
a
t
:

p
o
se
ss
t
h
e

t
r
i
m
e
t
h
yl
a
t
e
d

H
3
K
2
7

m
a
r
k
i
n

t
h
e
i
r

p
r
o
m
o
t
e
r
s
a
n
d

a
r
e

b
o
u
n
d

b
y
S
U
Z
1
2

[
G
e
n
e

I
D
=
2
3
5
1
2
]

a
n
d

E
E
D

[
G
e
n
e

I
D
=
8
7
2
6
]

P
o
l
yco
m
b

p
r
o
t
e
i
n
s.
M
S
i
g
D
B
:

C
2
.
cg
p
2.765E-294
352
652
3
BENPORA
TH_SUZ12_T
ARGETS
S
e
t

'
S
u
z1
2

t
a
r
g
e
t
s'
:

g
e
n
e
s
i
d
e
n
t
i
f
i
e
d

b
y
C
h
I
P

o
n

ch
i
p

a
s
t
a
r
g
e
t
s
o
f

t
h
e

P
o
l
yco
m
b

p
r
o
t
e
i
n

S
U
Z
1
2

[
G
e
n
e

I
D
=
2
3
5
1
2
]

i
n

h
u
m
a
n

e
m
b
r
yo
n
i
c
st
e
m

ce
l
l
s.
M
S
i
g
D
B
:

C
2
.
cg
p
5.555E-227
369
1037
4
BENPORA
TH_EED_T
ARGETS
S
e
t

'
E
e
d

t
a
r
g
e
t
s'
:

g
e
n
e
s
i
d
e
n
t
i
f
i
e
d

b
y
C
h
I
P

o
n

ch
i
p

a
s
t
a
r
g
e
t
s
o
f

t
h
e

P
o
l
yco
m
b

p
r
o
t
e
i
n

E
E
D

[
G
e
n
e

I
D
=
8
7
2
6
]

i
n

h
u
m
a
n

e
m
b
r
yo
n
i
c
st
e
m

ce
l
l
s.
M
S
i
g
D
B
:

C
2
.
cg
p
1.047E-215
363
1062
5
BENPORA
TH_ES_WITH_H3K27ME3
S
e
t

'
H
3
K
2
7

b
o
u
n
d
'
:

g
e
n
e
s
p
o
se
ssi
n
g

t
h
e

t
r
i
m
e
t
h
yl
a
t
e
d

H
3
K
2
7

(
H
3
K
2
7
m
e
3
)

m
a
r
k
i
n

t
h
e
i
r

p
r
o
m
o
t
e
r
s
i
n

h
u
m
a
n

e
m
b
r
yo
n
i
c
st
e
m

ce
l
l
s,

a
s
i
d
e
n
t
i
f
i
e
d

b
y
C
h
I
P

o
n

ch
i
p
.
M
S
i
g
D
B
:

C
2
.
cg
p
3.676E-208
364
111
6
Show
3095
m
or
e
annot
at
ions
1
4
: C
o
mp
u
ta
ti
o
n
a
l
[
D
i
sp
l
a
y
C
h
a
r
t
]
ID
Name
S
o
u
r
ce
P
-
va
l
u
e
T
erm in Query
T
erm in Genome
1
module_1
98
G
e
n
e
s
i
n

m
o
d
u
l
e
_
1
9
8
M
S
i
g
D
b
:

C
4

-

C
M
:

C
a
n
ce
r

M
o
d
u
l
e
s
7.494E-41
108
292
2
module_9
8
G
e
n
e
s
i
n

m
o
d
u
l
e
_
9
8
M
S
i
g
D
b
:

C
4

-

C
M
:

C
a
n
ce
r

M
o
d
u
l
e
s
5.452E-38
121
381
3
module_2
52
G
e
n
e
s
i
n

m
o
d
u
l
e
_
2
5
2
M
S
i
g
D
b
:

C
4

-

C
M
:

C
a
n
ce
r

M
o
d
u
l
e
s
2.724E-34
87
227
4
module_1
97
G
e
n
e
s
i
n

m
o
d
u
l
e
_
1
9
7
M
S
i
g
D
b
:

C
4

-

C
M
:

C
a
n
ce
r

M
o
d
u
l
e
s
5.343E-30
70
170
5
module_5
7
G
e
n
e
s
i
n

m
o
d
u
l
e
_
5
7
M
S
i
g
D
b
:

C
4

-

C
M
:

C
a
n
ce
r

M
o
d
u
l
e
s
1.727E-29
39
54
Show
206
m
or
e
annot
at
ions
In
p
u
t Parameters [
Sh
o
w

De
t
a
i
l
]
Tr
a
i
n
i
n
g

R
e
s
u
l
t
s

[
Ex
p
a
n
d

Al
l
] [
Do
w
n
l
o
a
d

Al
l
] [
Sp
a
rs
e

M
a
t
ri
x
]
Di
spl
ay
pV
al
ues
and
Scor
es
as
S
c
i
e
n
t
i

c

(4

s
i
g
n
i

c
a
n
t

d
i
g
i
t
s
)
T
oppG
e
ne
R
e
s
ul
t
P
a
ge
ht
t
p:
/
/
10.19.104.15:
8080/
T
oppG
e
ne
/
out
put
.j
s
p?
us
e
r
da
t
a
_i
d=
f
6...
1 of
2
6/
7/
12 10:
40 P
M

Figure
8
. Feature
-
based network analysis of core human stem ce
ll genes associated with Cell
P
rojection
organization and regulation.




Th
e above

network view provides

an example of how cross
-
related data and knowledge can provide
different kinds of

insight
s and help generate hypotheses about new or emergent propertie
s and
functions associated with
statistically significant
relationships shared among subgroup
s

of core stem
cell
-
associated genes
. Here a set of genes

that are involved in the organization of

cell projections


were chosen based on several different gene
ontology criteria and then combined with other genes in
a core iPSC gene signature based on their sharing highly significant relationships that are shown in
different color squares according to the type of relationship (GO, pathway,
transcription factor bi
nding
sites
,
protein
-
protein interactions
, domain, disease phenotype, etc)
.
This “subnetwork” of transcripts
that are strongly expressed in iPSCs

consists of

a
group

of genes with a significant association
s

to
proteins whose functions are better known in
neuron associated contexts including

gabaminergic and
glutaminergic receptors, the formation of vesicles, and the regulation of tube size. The impact of this
type of network view is to provide researchers with the ability to formulate novel hypotheses tha
t can
be tested experimentally. For example, this view suggest
s interesting hypotheses such as a possible

role for ERBB2

in responding to intercellular signaling for the regulation of cell projections
, the
potential of small molecules

such as neurotransmi
tters to actually modify

progenitor cell

signaling
, and
the possibility that there is
connectivity and feedback regulation within a progenitor cell aggregate and
that size, growth, and morphological aggregates
are regulated at a level beyond that of classi
cal FGF
and WNT signaling
.

Thus, an important goal of the PCBC Genomic Analysis Portal will be to provide a simple and
trainable interface that allows any user to build biological networks to examine any area of interest
and to identify shared relationshi
ps and pathways that could shape progenitor cell biology.


Characterization of Cell Lineages and Alternative Isoforms

To facilitate additional functional prediction
s
, we will integrate the
AltAnalyze
open
-
source
analysis
platform
,

developed
by Dr. Nathan
Salomonis

out
of the iPS Cells in Heart Disease hub
,

into the
described
Synapse

workflow and bioinformatics Data Portal
.

In particular, Synapse provides the
framework for programmatically accessing large
-
scale highly curated datasets, integrating them with

public domain data curated to the same specifications, and developing transparent reproducible
workflows through provenance tracking of versioned data, code, and models. AltAnalyze provides an
ideal suite of methodologies to integrate with Synapse workflo
w and data access tools, including

a
suite of tools devoted to characterization of transcriptome profiles at the
gene,
isoform

and cell
-
level

(Fig
ure

9
)

(Emig et al., 2010)
.

AltAnalyze

results
strongly
compliment those produced through the
Cincinnati Bioinformatics group,
and
provid
e

investigators with
a host of

algori
thms
that facilitate
recognition of central and novel

effectors of biological processes and
functional interpretation
s
.
At the
gene
-
level, RNA
-
Seq or microarray data

is
summarized from exon
-
aligned results, evaluated for
differential expression using moder
ated tests, annotated based on identifiers from multiple databases
and evaluated for enriched pathways, transcription factor
and

microRNA binding sites, ontologies and
lineage restricted biomarkers

(
PMID:

(Zambon et al., 2012)
)
. These results are automatically
visualized upon WikiPathways and as hierarchically clustered heatmaps

that will be made available
throu
gh the PCBC web portal
. In addition to these, quality control plots to identify possible outliers or
poor sample quality are produced. At the isoform
-
level,
alternative exon expression is evaluated and
competing junction pairs are compared to identify alte
rnative splicing events, alternative promoters or
alternative polyadenylation. These results are further evaluated to observe isoform variation that leads
to proteins with compositionally distinct domains or
that
blunt protein expression altogether. To
fac
ilitate
immediate
visualization of these alternative exon results, we will integrate the Java web
-
start

Figure 9. Automated AltAnalyze

analysis pipeline for genes, isoforms and cell
-
based correlations.


program SpliceSeq

with pre
-
computed results through the PCBC web portal

(Ryan, Cleland, Kim,
Wong, & Weinstein, 2012)
.

Cell
-
level analyses
in AltAnalyze
are aimed
at determining the propo
rtion of different adult, fetal and
progenitor cell types that comprises a given RNA sample.
This approach, called LineageProfiler,
works by comparing submitted expression profiles (gene or exon RPKM or detection calls) to large
precompiled compendium of l
ineage
-
restricted biomarker profiles. The biomarkers are selected during
the database build process, by finding those genes or exons with the best cell
-
type restricted
expression patterns.
The result
s from LineageProfiler are

correlation
-
based enrichment s
cores for
each cell type

for
each

submitted RNA sample. These results are

automatically visualized
as
a
clustered heatmap

and
as colored network diagrams illustrating
known

differentiation

paths for all
examined cell
-
types and tissues (Fig
ure

9)
.

In the p
rovided example, LineageProfiler predictions are shown for
an

analysis of
mESC
differentiation

into cardiac
progenitors
. From this analysis, it
becomes

apparent that
endodermal and
fetal cardiac programs are induced by day 4 of differentiation, followed by

an adult cardiac induc
tion
program by day 5

and
exit from the immature endoderm/cardiac progenitor stage
by

day 10
.
Cell
-
type
correlation scores at day 10 are shown on the Lineage Network.
Regulated genes associated with
each lineage are also returned, al
lowing investigators to confirm the expression of specific cell marker
genes.
This approach is particularly important, given the need of PCBC investigators to determine
which cell types may be present and to which degree from distinct
in vitro

proge
nitor d
ifferentiation
protocols
.
In addition to differentiation, this approach has been succe
ssful

in identifying which cell
types predominate in tissue
mixing
experiments as well as finding
contamination due to inconsistent
tissue dissection
.

To improve the Line
ageProfiler arm of this pipeline for the PCBC we will improve the quality and
makeup of the biomarker database, build an independent Cytoscape
app

implemented in Java and
improve the methodologies for more accurate detection of cell types.
In improving the

biomarker
database, it will be imperative to include
RNA
-
Seq data emerging through the
Cell Characterization
core and
from
PCBC investigators for gold
-
standard cell
-
types (e.g., human fetal cardiac RNA)
.


7

Driving Biological Projects

7.1

Expression profiling o
f ESC
-
specific noncoding RNA as a
metric for evaluating reprogramming quality in human iPSC
lines

Hub sites: Johns
Hopkins (Dr. Jeffrey Huo
,
Dr. Elias Zambidis)

Bioinformatics core support:
Cincinnati
, Sage Bionetworks

The ideal human induced pluripotent s
tem cell (hiPSC) line should have the capacity to differentiate
into any desired clinical lineage with equivalent efficiency to
bona fide

human embryonic stem cells
(hESCs). However, recent work has demonstrated that many hiPSC lines possess limited effici
ency of
differentiation into mesodermal, ectodermal, and endodermal lineages compared to
bona fide hESC
.
In contrast, we have recently discovered that high quality low passage hiPSC lines derived from
hematopoietic lineages with novel reprogramming approac
hes can differentiate into multiple lineages
with ESC
-
like efficiency, and minimal rete
ntion of epigenetic memory.


Furthermore, current efforts by the PCBC Cell Characterization Core (CCC) in Cincinatti are engaged
in comparing the quantitative potency of

a large repertoire of hiPSC lines (submitted by multiple Hub
Investigators) to differentiate into therapeutically relevant cell lineages of clinical interest. One Aim is
to independently confirm the differences in differentiation quality between hiPSC lin
es derived by
different methods and somatic donors by multiple PCBC investigators at multiple Hubs. The proposed
Bioinformatics Core will provide to PCBC investigators the tools and expertise necessary to integrate
transcriptomic data being amassed by the
PCBC CCC, with the vast constellations of other genomics
and epigenetics data beyond the PCBC. This unified resource can then be harnessed to identify
mechanisms that may explain why some hiPSC lines submitted to the PCBC are ‘handicapped’
regarding their
ability to differentiate into specific lineages. Identifying the molecular mechanisms
underlying this differences in hiPSC differentiation potency and quality should accelerate the
development of methods to generate higher quality, fully reprogrammed hiPSC
.


Specific Aim 1: Identify ESC
-
like ncRNA expression profiles in a defined repertoire of hiPSC
lines to evaluate the pluripotent state
. Non
-
coding RNAs (including miRNAs and lncRNAs) have
been shown to play critical roles in regulating a constellation of
prote
in
-
coding genes in ESC and
iPSC
. Our preliminary data with microRNA microarrays of various hiPSC with both excellent and poor
hemato
-
vascular differentiation potency is consistent with
the hypothesis that expression of ESC
-
specific ncRNAs is associat
ed with differentiation potency into hemato
-
vascular lineages
. The
proposed Bioinformatics Core tools would enable the RNA
-
Seq and miRNA
-
Seq data being amassed
by the PCBC Cell Characterization Core to be mined for candidate short and long ncRNAs present i
n
low
-
passage iPSC
,

but absent in ESC that may be predictive of differentiation
poten
tial
. A correlation
between robust differentiation capacity and high quality iPSC lines with ESC
-
like lncRNA and miRNA
profiles will be sought. The goal is to utilize tool
s of bioinformatics to identify potential common
mechanisms (signaling pathways, transcription factors, epigenetic regulators, or other ncRNAs)
regulating the aberrant expression of these handicapping ncRNAs, and that could be countered with
optimized iPSC

reprogramming methods.


Specific Aim 2: Determine the role of somatic donor cell epigenetic memory in determining or
limiting hiPSC differentiation potential.

It has been proposed that failure to completely erase the
“epigenetic memory” of the donor cell

type of origin during factor
-
driven reprogramming leaves a
residual epigenetic signature that may serve as a subsequent handicap (or in some cases advantage)
for line
age
-
specific differentiation
. The residual starting cell signature may be comprised of ab
errant
patterns of gene expression (protein coding or non
-
coding), histone modifications, or DNA
methylation.
We hypothesize that the methods of iPSC generation and source of somatic cell donor
can determine the degree of residual epigenetic memory retenti
on in hiPSC lines
. The proposed
Bioinformatics Core tools would enable the transcriptomic data being amassed by the PCBC Cell
Characterization Core to be seamlessly integrated with non
-
PCBC epigenetic databases into a unified
resource making possible a com
prehensive assessment of the degree to which individual iPSC lines
characterized by the PCBC harbor residual epigenetic memory of their cell type of origin. The degree
of retention of epigenetic memory of source cell origin in studied iPSC lines can then b
e correlated to
the degree of observed inability to differentiate into specific defined (and standardized) lineages
currently being employed by the PCBC CCC in
Cincinnati
. This may not only provide a generalized
assay by which hiPSC quality can be measured

in regards to its differentiation potency, but it should
also identify hiPSC derivation methods that either limit or augment somatic memory retention.


7.2

Using adult hematopoietic progenitor cells to define the
molecular basis of lineage commitment.

Hub sit
e
: Fred Hutchinson Cancer Research Center

(Dr. Beverly Torok
-
Storb)

Bioinformatics core support:
Sage Bionetworks

Our goal is to generate populations of specific progenitors ex vivo to treat lineage specific cytopenias
and/or control commitment to specific

lineages in vivo for the same purpose. Both strategies require a
map of molecular events responsible for commitment. Theoretically this would be accomplished by
characterizing highly purified populations of cells representing pre and post commitment stage
s.
Ideally these stages would be identified by their expression of lineage
-
specific markers, however
marker expression often exists as a continuum, and may not be lineage specific, making isolation
criteria somewhat arbitrary.

In an effort to address this

problem we are defining populations by their gain or loss of markers in
response to specific cytokines, small molecules, and shRNAs. The resulting populations are relatively
homogenous particularly in their responsiveness to a subsequent set of signals. F
or example, culture
conditions have been developed that drive a subpopulation of CD34+ cells into a narrowly defined
population of cells (p1) that can, with manipulation of subsequent culture conditions either commit to
red blood cell precursors (p2) or me
gakaryocyte precursors (p3). Comparative transcriptome analysis
of the p1, p2, p3 populations should identify gene products that are and are not involved in
commitment of p1 to p2, or p1 to p3.

Identifying pathways that participate in lineage commitment w
ill inform strategies to control
commitment events. Ideally we will identify reagents that can be used in vivo. Translating from in vitro
studies of human cells to clinical applications will require safety and effi
cacy studies in large animals.

We propose

establishing parallels between in vitro results with human cells and dog cells, followed by
studies comparing in vitro dog results to in vivo dog results, which should allow for translation from in
vitro human to clinical applications. Clearly comparative

characterizations between human and dog
progenitor populations would accelerate this process.


7.3

Analysis of cardiac and neural crest
(NC)
specification in
human ES and iPS cell
s

Hub site
:
Gladstone Institutes

(Dr. Bruce Conk
l
in)

Bioinformatics core support
:
Gladstone

Institutes

The Conklin laboratory within the iPS Cells in Heart Disease hub is currently working to optimize
various protocols for the differentiation of highly enriched
human
cardiac and
NC

cell populations for
the purpose of transcriptome eva
luation. Both of these projects involve the analysis of multiple distinct
time
-
points, replicates, protocols and cell

lines.

Based on preliminary
marker
data, cardiac differentiation produces varying degrees of contamination
from endodermal cells types du
ring the protocol and possibly other cell types. The major goals for this
DBP will be to identify and validate new RNA isoforms expressed during the transition from pluripotent
to differentiated progenitors, identify new progenitor specific marker genes un
ique to these cell
-
states,
validate computationally predicted novel markers from meta
-
analyses and identify non
-
desired
differentiating cell types present at each time
-
point.

To achieve these goals, we propose to utilize the PCBC Bioinformatics core softw
are pipeline
to
evaluate and subsequently validate the use of these bioinformatics methods with the goal of
presenting these results in a high profile publication. Differentially expressed genes and isoforms will
be evaluated using the Cincinnati bioinform
atics tool
s set

and the AltAnalyze pipeline. Validation will
entail
q
PCR, RT
-
PCR and protein immunohistochemistry. Downstream analyses may entail RNA
isoform inhibition in hESCs during cardiac differentiation to evaluate differentiation outcomes.
Currently
, only
a
small set of
NC

specific markers exist, many of which appear to overlap with other
progenitor cell
-
types and
NC
derived lineages such as melanocytes, bone, smooth muscle, peripheral
neurons and glial cells.
For both cardiac and
NC
, we will

evaluat
e marker gene predictions produced
by the software LineageProfiler that overlap with our
in vitro

derived cells using RT
-
PCR and qPCR

compared to other cells types or derived lineages (
NC
)
.

Based on these results, we will FACS early
progenitors during the
differentiation process to
see if we can further enrich for cardiac and
NC

cells.

For each differentiation state, we will examine
the presence of alternative cell types that appear in our
cultures using
LineageProfiler
. Where alternative cell types are pre
dicted we will confirm using qPCR
and flow cytometry for corresponding an
notated cell
-
specific markers.


8

Additional projects

Proposals for additional driving biological projects will be submitted as 1
-
page proposals and reviewed
during the PCBC steering co
mmittee teleconferences. Requested resources for the Bioinformatics
Core
provides support for an additional 3 driving biological projects.

Additional projects will be
accommodated within the scope

of the requested Bioinformatics Core resources as deemed fe
asible
by Bioinformatics Core PIs. Otherwise,
t
he Bioinformatics Core will support additional projects from
PCBC investigators either through a fee
-
for
-
service mechanisms or by funding requests for support of
additional biological projects.


9

Project timeli
ne

and Guidance

Our goal is to develop the bioinformatics core as a resource to support the PCBC through the duration
of the consortium, creating a high
-
value and visible output of the PCBC serving to integrate efforts
across multiple investigators, provid
e an open
-
access community data and analysis portal, and
perform cutting
-
edge analyses at the forefront of stem cell research. The proposed work plan is
intended to run for 3 years, after which time both the evolving data, user
-
based needs, and the best
ap
proach to maximizing the value contributed by the bioinformatics core will be evaluated by the
PCBC steering committee to determine if a proposal will be considered for an extension. The
bioinformatics committee and steering committees will also be regula
rly scheduled for presentations,
feedback, and guidance to ensure that goals are being addressed, met, and new challenges defined.

The following timelines provide rough milestones of when we expect to accomplish the items listed in
this proposal. A major g
oal of the bioinformatics core is to be flexible and responsive to consortium
need. Therefore, the following milestones are intended as a flexible timeline that will be modified,
assessed, and prioritized throughout the duration of the project. We will wor
k with the steering
committee to submit a yearly project plan to be approved by the steering committee. We will
join

monthly calls
of

the
B
ioinformatics
C
ommittee and quarterly calls
of

the
S
teering
C
ommittee to
evaluate progress against these objectives a
nd revise the work plan accordingly. In particular,

research

priorities of the bioinformatics core will heavily depend on the

initial 3 driving biological
projects, in addition to work with the Cell Characterization Core, and additional driving biological
projects that emerge during the proposed timeframe.

The tools developed by the bioinformatics core
are reusable for multiple projects, and we would be able to support multiple research projects if there
is interest from consortium PIs.


Months 0
-
6



Deploym
ent of website in Sage Synapse allowing PCBC investigators to access all raw data
generated by the cell line characterization core.



Default data processing pipelines developed for all supported platforms, including RNA
-
seq,
miRseq, ExomeSeq, and ChIP
-
seq.
Processed versions of all dataset distributed to PCBC
investigators.



Ability to access all processed data programmatically through R interface (in addition to
graphical web clients).



Initial analysis results
for driving biological projects. For example,

da
ta analysis results from
current stage of PCBC data sets that encompasses mRNAseq and miRseq analyses of iPSC,
hESC, EB, and DE sample types as compared with each other and a variety of other human
samples of relevance to hematologic, cardiovascular, and l
ung development.



Initiation of live web
-
based video training sessions for PCBC laboratories on the use of tools
developed by the bioinformatics core. Training sessions anticipated to be held bi
-
monthly or as
determined by the PCBC steering committee.



Stan
dardized ontology for cell line meta
-
data defined. All samples from cell line
characterization core curated according to this ontology.



Web portal allowing investigators to perform sample and analysis curation using highly
structured ontologies for PCBC me
ta
-
data including description of sample origins, derivation
methodologies, analysis technology, and data processing and analysis and interpretation
details.


Months 6
-
12



Graphical workflow tools to visualize data processing and analysis pipelines applied t
o PCBC
data.



Versioning capabilities to allow data to be continuously updated with links stored to frozen
versions datasets used in previous analyses.



Support rapid upload/download of large data files into Synapse through integration with
Globus / GridFTP.



Tools for
tools heatmap generation as well as rendered into a network diagram for
visualization and connectivity analyses.



Ability to access all processed data programmatically in R and Python.



Testable hypotheses developed for all driving biological proj
ects and experiments initiated
based on analysis performed by the Bioinformatics Core in collaboration with PCBC
investigators.



Curation of publicly available stem cell datasets, and distribution of these data in Synapse in
the same format as PCBC
-
generate
d data.



Linking output of RNA
-
seq data processing pipelines build in Galaxy to automatically be
hosted in Synapse, and providing data interchange tools allowing datasets generated by
Synapse data processing workflows to feed into bioinformatics analysis to
ols supported by
Galaxy.



Deployment of PCBC communication tools including wikis, forums, and message boards.


Years 1
-
2



Collect, organize, and annotate
clusters of
gene
s
, transcripts, isoform
s
, promoter
s
, and miR
s

from all samples and sample types (undiffe
rentiated stem cell clusters, differentiation clusters,
treatment and disease affect clusters).



Data analysis pipelines for commonly used procedures including the generation of expression
signature gene lists, list comparisons, and gene set enrichment and
biological network
analysis for identifying important genes, biological processes, regulatory and biological state
-
determining mechanisms that are reflective of all progenitor cell
-
related origins, characteristics,
and differentiative capabilities.



Ident
ification of pathways responsible for normal development, differentiation, and disease
pathogenesis.



Ability for any user the ability to save, retrieve, and downstream
-
analyze all the Sage and
galaxy
-
based pipeline results from progenitor cell characteriza
tions.



Publications from all driving biological projects.



All analyses performed in driving biological projects thoroughly documented and reproducible
providing a robust set of analysis tools and training material to distribute publicly as a PCBC
-
develop
ed resource.



Present clusters of ES/iPS/other stem cell or somatic cell samples in an interactive format to
retrieve cluster content information and individual gene information.



Provide network generation capability for PCBC data to allow users to compare
and contrast
the molecular signatures and activities of all gene clusters and sets.



Feature
-
based network analysis of core human stem cell genes associated with Cell
Projection organization and regulation.


Years 2
-
3



Carry out deep analyses of PCBC data th
at facilitate the formulation and execution of next
-
generation research agenda. In particular, addressing the areas of cell/transplantation
competence, disease effects, disease model effect signatures and modifier.



Execution of driving biological research

projects resulting in multiple publications using tools
from the PCBC bioinformatics core.



Network
-
based differentiation signatures inferred to describe differentiation of progenitor cells
to multiple cell fates studied by PCBC inves
t
igators.



Search engin
e to test how genes of interest cluster in the PCBC samples.



Links to other stem cell databases/search engines such as FunGenES, Amazonia, MEM
-

Multi Experiment Matrix.



Completion of research study identifying underlying cis
-
regulatory mechanisms, epigene
tic
modifications, pathway activities, and biological process
-
associated genes and protein
interactions.


10

Training and user support

1.

The Bioinformatics Core will host an online forum to answer all bioinformatics
-
related
questions from PCBC investigators and

trainees. A member of the Bioinformatics Core will
monitor the forum. Trainees will be encouraged to use the forum as a resource to obtain
expert advise in their bioinformatics training. A member of the Bioinformatics Core will aim to
answer all questions

within 48 hours of submission.

2.

Bioinformatics Core sites (Cincinnati and Sage Bionetworks) will host visiting scientists
interested in pursuing deeper immersion in bioinformatics training. We envision being able to
accommodate all requests for visiting sc
ientists for periods up to several months (or possibly
longer as arranged specifically with PCBC PIs).

3.

The Bioinformatics Core will develop training material, including online documentation,
tutorials, and recorded YouTube demonstrations demonstrating the
use of all tools and
resources developed by the Bioinformatics Core. Research done in the context of Driving
Biological Projects will drive the development of

reusable analytical tools that serve as the
basis for training materials.
Training material will
be presented to PCBC trainees in the manner
deemed most appropriate by the PCBC steering committee, including recurring webinar
training sessions, recorded content developed into an online short
-
course, or short
-
courses
hosted at Bioinformatics Core sites
given to groups of visiting trainees.

4.

The Bioinformatics
C
ore, in collaboration with Dr. Torok
-
Storb, will develop a pilot project for
mechanisms of giving meaningful credit
attribution
to
faculty and
trainees who contribute data,
reagents, or results to t
he PCBC Data Portal, analytical tools, or reagents library. Possible
mechanisms include GitHub
-
style user profiles associated with digital content provided by
contributors, a system of citable credits assigned to user profiles based on defined criteria, or

an index of all contributions of
code, results, reagents, etc.

made by individuals.

This effort will
also serve as both a monitor of, and incentive for, collaboration in the form of data contribution,
a main goal of the PCBC.

11

Project resources and perso
nnel

Sage Bionetworks personnel

1.

75
%

blended
support for software engineer
ing

(
led by
Dr.
Bruce Hoff
) for core infrastructure
development.

2.

75%

support for research scientist (
Dr. Larsson Omberg
) for bioinformatics analysis in driving
biological projects.

3.

75
%

support for development of data processing pipelines
, analysis tools, and data set
aggregation and curation
(led by Dr. Brig Mecham) for
user interactions and analyses with
PCBC members and other Bioinformatics Core members
.

4.

20
%

support for Dr. Adam Marg
olin for project oversight

as well as effort to work with PCBC
researchers
.

Sage Bionetworks staffing levels include user support and training for PCBC investigators, including
rapid response to bioinformatics
-
related PCBC questions through forums and mail
ing lists; hosting of
visiting scientists at Sage Bionetworks for periods up to several months; and development of training
materials distributed online and optionally taught at short courses at Sage Bionetworks.


Cincinnati Bioinformatics personnel

1.

75
%

su
pport for senior software engineer (Scott Tabar) for PCBC genomics web portal
development

that allows for user access of finished results for secondary
analyses based on
focused comparisons and user
-
specified areas of interest
.

2.

5
0%
support for
hands
-
on use
r support
bioinformatician (
Mayur Sarangdhar, PhD
) to
work
with users on driving biological projects and to
carry out

requested and mapped
-
out

analyse
s,
continuous testing
of
results
, and user interactions with PCBC

and to feed analysis results into
the PC
BC results portal
.

3.

20%

support for Phillip Dexheimer for the development of integrated sample metadata
aggregation and APIs, initial pipeline raw data processing and automated transfer of all data,
metadata, and results sets
.

4.

20
%

support for Dr.
Anil Jegga
,

D.V.M, MSc

for development of
downstream web portal
data
integration and data mining user interfaces including the network analyses and functional
enrichment for prediction and prioritization of candidate genes, pathways and networks

as well
as effort to
work with PCBC researchers.

5.

2
0
%

support for Dr.
Bruce Aronow

for project oversight

as well as effort to work with PCBC
researchers.


Gladstone Bioinformatics core

personnel

1.

2
0
%

support for software engineer (
Alexander

Williams
,
MS
)
for
LineageProfiler
infa
structure
development, dataset evaluation and Cytoscape application development
.

2.

1
5
%

support for software engineer (Justin Nand, BS)
to integrate

AltAnalyze and associated
software with the Cincinnati data analysis pipeline and web
-
based visualization of a
lternative
exon analysis data through the PCBC results portal.

(
Could be

redundant with Scott Tabar

and
could be eliminated
)

3.

30
%

support for Dr. Nathan Salomonis for AltAnalyze and LineageProfiler infrastructure
development and database support, project ov
ersight as well as effort to work with PCBC
researchers.


Vanderbilt core personnel

1.

5%

support for Dr. Antonis Hatzopoulos, who will work closely with the Bioinformatics
Leads
in
the Bioinformatics Core on the design of the PCBC Database web portal for
use
r interface,
data

access
, and key functions to be enabled including
network, pathway, and functional
characterization analyses using PCBC and all other available genomics data.


University of Maryland
Baltimore
(Administrative Coordinating Center) personn
el

2.

10
% s
upport for Dr. Lynne Schrim
l
, t
he PCBC Administrative Coordinating Center co
-
investigator and o
ntologist who

will transfer processed data files to Livelink for data sharing
with PCBC U01 Hub Site investigators and NHLBI program officers. She will c
urate those data
files and be responsible for their organization and naming conventions in a manner that makes
them

accessible, intelligible and readily searchable for the PCBC.

Dr. Schriml will

assist with
the development of ontologies for the processed b
ioinformatics data.


Total Resources

Project resources to support: 1) personnel across 3 institutions (plus minor support for Vanderbilt and
Maryland); 2)
Four

initial

driving biological projects

with support for at least 3 additional projects to be
define
d
; 3) Training and user support.

1.

Total funded p
roject

FTEs

a.

2.35 software engineers

b.

2.2 bioinformatics scientists

c.

0.55 PI / admin

d.

5.1 total FTEs

If approved by the steering committee, a detailed budget will be prepared to support the 5.1 FTE level
of work,
which will amount to direct cost
greater

than $500,000 and less than $1 million per year for a
period of 3 years.


12

Research Data Sharing Plan

Genomics and Bioinformatics
Data

The Cincinnati and Sage

Bionetworks

groups will work together to take responsibil
ity for

distributing
Atlas
-
level data through Sage Synapse, which can host data in a private project allowing PCBC
investigators to grant access to collaborators, and open the project for public access at the
appropriate time. As requested by investigators
, data will also be posted to appropriate NCBI data
repositories upon public release.
This includes mutation data, mRNA and miR expression data, and
RNAseq splice form summaries. Both summarized and raw sequence next
-
generation sequencing
based data will
be placed in repositories as appropriate.


As is key for the overall infrastructure and analytical service core plan
, we will establish a unified
open access web portal that will integrate access to the multiplicity of data and analysis sources and
infra
structure in order to carry out integrative data analysis that are critical for establishing the
biological mechanisms responsible for progenitor cell biology and that facilitate the ability of progenitor
cells to be used as both therapeutically and as dis
ease models. All data will be completely
downloadable and mineable from links provided from the common portal website, including gene,
disease, pathway, gene expression data and all available gene and disease
-
based knowledge of
gene
-
pathway
-
phenotype
-
dise
ase relationships aggregated from many different database resources.


The primary criterion for success in this project
is
integral to the Research Data Sharing Plan
itself

the fundamental commitment is that the combined, linked, and ver
satile web resourc
es that are
created here must be directly
shareable

to,
useable
by, and
directly beneficial

to trainees,
researchers, and clinicians for improved modeling, analysis, exploration and discovery of optimal
progenitor cell generation, differentiation, and abil
ity to model disease mechanisms, and evaluate
disease treatments.


Data Linking between Cell Characterization Core Technologies and Sample Metadata

A complete pipeline that links the PCBC sample submission server and the entirety of the data that is
entere
d with cell characterization core procedures and data outputs to a unified database that will
provide summaries and details of all associated information and aggregated analyses will be carried
out using standard linked web server application interface app
roaches and will be seamlessly
connected between the PCBC administrative hub, the cell characterization core lab, and the GALAXY,
SAGE, and CHMC data/analysis/interpretation

level application environment


References

Derry, J. M. J., Mangravite, L. M., Suver, C., Furia, M. D., Henderson, D., Schildwachter, X., Bot, B.,
et al. (2012). Developing predictive molecular maps of human disease through community
-
based
modeling.
Nature genetics
,
44
(2), 127

30. Retrieved
from
http://www.ncbi.nlm.nih.gov/pubmed/22281773

Emig, D., Salomonis, N., Baumbach, J., Lengauer, T., Conklin, B. R., & Albrecht, M. (2010).
AltAnalyze and DomainGraph: analyzing and visualizing exon expression data.
Nucleic acids
research
,
38
(Web Server i
ssue), W755

62. Retrieved from
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2896198&tool=pmcentrez&rendertype
=abstract

Musen, M. A., Noy, N. F., Shah, N. H., Whetzel, P. L., Chute, C. G., Story, M.
-
A., & Smith, B. (2011).
The National Center f
or Biomedical Ontology.
Journal of the American Medical Informatics
Association : JAMIA
. doi:10.1136/amiajnl
-
2011
-
000523

Ryan, M. C., Cleland, J., Kim, R., Wong, W. C., & Weinstein, J. N. (2012). SpliceSeq: A Resource for
Analysis and Visualization of RNA
-
Seq Data on Alternative
Splicing and Its Functional Impacts.
Bioinformatics (Oxford, England)
. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/22820202

Zambon, A. C., Gaj, S., Ho, I., Hanspers, K., Vranizan, K., Evelo, C. T., Conklin, B. R., et al. (2012).
GO
-
Elite: A Flexible
Solution for Pathway and Ontology Over
-
Representation.
Bioinformatics
(Oxford, England)
. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/22743224


Kaimal V, Bardes EE, Tabar SC, Jegga AG,
Aronow BJ.

ToppCluster: a multiple gene list
feature analyzer fo
r comparative enrichment clustering and network
-
based dissection of
biological systems.
Nucleic Acids Res
.

2010 May 25. PMID: 20484371

Gudivada RC, Qu XA, Chen J, Jegga AG, Neumann EK,

Aronow BJ
. Identifying disease
-
causal
genes using Semantic Web
-
based re
presentation of integrated genomic and phenomic knowledge.
J Biomed Inform
. 2008 Aug 23.
105(40):15493
-
8.


Musen, M. A., Noy, N. F., Shah, N. H., Whetzel, P. L., Chute, C. G., Story, M.
-
A., & Smith, B. (2011).
The National Center for Biomedical Ontology.
J
ournal of the American Medical Informatics
Association

: JAMIA
. doi:10.1136/amiajnl
-
2011
-
000523


Budget Justification and Summary of Duties

Cincinnati Children’s Hospital Research Foundation


Personnel


Bruce J. Aronow, Ph.D.



2.4 calendar months,
current
ly
directs the Cincinnati Bioinformatics Team
and will

continue to

play a co
-
leading role for the establishment of the PCBC Integrative
Bioinformatics Core. Aronow
’s group

will take specific responsibility for managing the integration of all
sets of resul
ts
from the SAGE pipeline
into a single datamine interface that encompasses the pipeline
processing and analytic efforts of Cincinnati, SAGE, and Gladstone bioinformatics groups. Aronow
will manage the overall goal that the Progenitors Cells Datamine must

provide researchers with
useful, informative, and significant insights into the molecular and biological properties of human
progenitor cells. Aronow will share responsibilities with Drs. Margolin and Salomonis to ensure that
automated data analyses are a
ccurate and complete, that there is consistent data quality monitoring,
and that there is a continuous development of analyses that provide specific significant signatures and
results, and tables and visualizations, that provide useful genomic data and ins
ights of use for the
entire research consortium. Dr. Aronow will also work with Carolyn Lutzko and others of the Cell
Characterization Core and individual hub PIs and research team member to ensure that there is
improvement of next stage experimental desig
ns in response to emerging data analyses and that the
results of the genomic characterization of progenitor cells and their derivatives can be evaluated
relative to all other key heart, lung, and blood related reference samples against which the progenitor
-
derived materials may be usefully compared. With Dr. Lutzko,
Dr. Aronow will also have responsibility
for ensuring that there is documentation of all results, reports, website data representations are
accessible and useable for focused analyses and the pr
eparation of manuscripts and reports and will
work with the Sage Group to ensure that these can be carried out reproducibly from PCBC genomic
data.


Anil Jegga, D.V.M., M.S.,
-

1.80 calendar months is an Assistant Professor of Pediatrics in the
Division of

Biomedical Informatics at Cincinnati Children’s Hospital Medical Center. Dr. Jegga will be
responsible for helping the web site developers and data end
-
users with the data integration,
representation, overview analyses and data mining including progenitor

and differentiated cell multi
-
pathway interactions, novel pathway and biological network. Dr. Jegga is an expert in transcriptional
regulatory mechanisms and will engage in user support, education, and facilitate higher order
analyses and other functional

enrichment analyses related to transcriptional and microRNA
-
based
mechanisms encompassing features and interactions of promoters, enhancers, and transcription
factors and miRs. Specifically, h
e will lead efforts in the network and topology
-
based analyses
of data
so as to facilitate functional analysis of genes and prediction and prioritization of candidate genes,
pathways and networks with some emphasis on gene expression regulation.
He, along with Dr.
Aronow will also oversee design and development of alg
orithms and approaches for PCBC consortia
data mining and analyses.


Scott Tabar, Sr. System Programmer



(
6 calendar months
)

will be responsible for establishing the
PCBC results mine using GATACA and Toppgene Suite interoperability and develop user inter
faces.
Specifically, he will develop advanced and user
-
friendly systems that will help the end
-
users to readily
save, retrieve, and analyze all the SAGE and galaxy
-
based pipeline results from progenitor cell
characterizations. He will work under direct sup
ervision of Drs. Aronow and Jegga and also work very
closely with Sage advanced bioinformatics programmers Drs. Hoff and Mecham
and Dr. Nathan
Salomonis at the Gladstone
to achieve completely seamless user access and functionality.


Phillip Dexheimer, Sr.
System Programmer



2.40 calendar months will be responsible for
aggregating and analyzing the microarray and nexgen RNAseq and expression data (both from the
PCBC consortia and other publically available data sets) to facilitate querying across the multip
le
datasets. He will work under direct supervision of Dr. Aronow.


Mayur Sarangdhar, Research Fellow



2.40 calendar months will be responsible for developing
algorithms and computational approaches for knowledge
-
mining from the aggregated expression data
sets. The algorithms and computational approaches developed will be made available as part of
Gataca and ToppGene applications after validation for the users. He will work under direct supervision
of Dr. Aronow.



Sage Bionetworks Budget Justification


The

following staffing levels include user support and training for PCBC investigators,
including rapid response to bioinformatics
-
related PCBC questions through forums and
mailing lists; hosting of visiting scientists at Sage Bionetworks for periods up to se
veral
months; and development of training materials distributed online and optionally taught at
short courses at Sage Bionetworks.


Personnel


Adam Margolin, Ph.D.

Role: Principal Investigator, Computational Biology, Sage Bionetworks (2.4 calendar months)

Dr. Margolin directs the Computational Biology team at Sage Bionetworks, overseeing the
development of novel computational methods for predicting disease
-
related phenotypes; developing
large
-
scale data processing and sharing tools, and leading several cons
ortium
-
based projects to
leverage cloud
-
enabled computing technologies for collaborative analysis across a distributed network
of investigators. Dr. Margolin will play a co
-
leading role in the PCBC Bioinformatics Core and will
direct the Sage
-
related activ
ities described below and within the proposal, and ensure integration with
the components of the core led by Drs. Aronow and Salomonis. Dr. Margolin will take direct
responsibility for overseeing development of Synapse as a hub for sharing data, results, a
nd code
within the PCBC, and ultimately with the broader community, as well as developing the data
processing pipelines used to provide high
-
quality versions of all PCBC datasets. Dr. Margolin will also
lead research efforts of the Driving Biological Proje
cts, which will both drive functionality of the Core
and provide concrete demonstrations of its value in yielding biological insights.


Dr. Margolin is a pioneer in computational approaches for inferring cellular regulatory networks, and
developing predict
ive models linking alterations in cellular networks to genotype
-
specific cancer
therapeutics. Prior to joining Sage Bionetworks as Director of Computational Biology, Dr. Margolin
initiated and led the genotype
-
specific therapeutics initiative at the Broad
Institute, leading to
development of approaches to analyze data from genetic characterization of cancer cell lines coupled
with viability screens following small molecule treatment. The results of this work were recently
published in Nature and Cancer Cell
, demonstrating that most well characterized genetic alterations
known to confer sensitivity to particular compounds can be discovered in an unbiased de novo
analysis, and that cell lines sensitive or resistant to compound treatment could be predicted with

high
accuracy. Dr. Margolin’s expertise and leadership will be critical in developing and applying analytical
methods to interpret the data generated in this research program.


Brigham Mecham, Ph.D.

Role: Computational Biologist / Data Curator, Sage Bione
tworks (9 calendar months)

Dr. Mecham will lead the scientific computing effort to provide high quality, curated data to the PCBC
community through development, training, and documentation, and training data processing scripts
used to standardize PCBC
-
gene
rated

omics and phenotypic data in computable formats, and
developing data integration with publicly available stem cell resources. Dr. Mecham is the key
contributor to Sage Bionetwork’s commons data repository effort and has over 10 years of experience
i
n developing curation, normalization and data management bioinformatics tools. During his Ph.D.
research, Dr. Mecham developed the widely used Supervised Normalization for Microarrays method
to robustly account for confounding variables in microarray norma
lization. At Sage Bionetworks, Dr.
Mecham has extended on this work to build a repository of over 11,000 high
-
quality curated datasets,
including all data stored in large publicly accessible resources such as GEO, Array Express, and
TCGA. Dr. Mecham has co
-
authored several papers demonstrating the power of leveraging public
data to develop molecular signatures of stem cell fates, including the discovering a molecular
“developmental yardstick” through mining of public resources. This study identified and cur
ated nearly
100 stem cell datasets from the public domain, which will be integrated into the Data Portal developed
by the PCBC Bioinformatics Core.


Larsson Omberg, Ph.D.

Role: Computational Biologist / Analyst, Sage Bionetworks (9 calendar months)

Dr. Omb
erg will serve as the primary research scientist at Sage Bionetworks, and will perform
bioinformatics analysis for the driving biological research projects and provide bioinformatics support
and training for additional research projects driven by PCBC inve
stigators. This research will have the
dual objective of performing deep bioinformatics analysis to advance the scientific questions of
interest to PCBC, and performing such research through use of robust, transparent analytical
pipelines to create computa
tional resources that can be reused in other PCBC projects, provide
training materials, and drive development of a publicly available PCBC
-
developed analytical resource.


Dr. Omberg is an established researcher in mathematical approaches for extracting gen
omic
phenotypes and disease signals from system level biological data. As a graduate student and
postdoctoral researcher at the University of Texas, Dr. Omberg developed a generalization of
principal component analysis to tensors or higher dimensional dat
a. Applying these methods to
perturbations of synchronized cell lines he was able to predict and experimentally verify a strong link
between transcription and replication. During his second postdoc at Cornell University he was integral
to a collaboration

with the Genetic medicine group at Weill Cornell Medical School to look for genetic
and genomic traits of smoking and COPD as well as developing techniques for correcting for
population structure in genetic studies. The former led to publications focusi
ng translation medicine
and the latter has garnered interest among the general population as suggestive links between Pygmy
height and their genetics was discovered. Dr. Omberg is a Senior Scientist at Sage Bionetworks
where his expertise in tensor based
analysis and experience with integrative data analysis and
interpretation will be critical to this research program.


Software engineering and platform support (9 person/months)

The software and platform support responsibilities will be distributed among t
he experts listed below:

Bruce Hoff, Ph.D.

Role: Software Engineer, Sage Bionetworks

Dr. Hoff will serve to integrate analytical tools and algorithms into the Synapse platform. He is
an exceptional software designer and engineer with over 20 years of expe
rience and deep
expertise in genomic data analysis, computer modeling, machine learning and data mining.
His experience with multiple languages and operating systems make him an essential member
of the project team.


John Hill,

Role: Software Engineer, S
age Bionetworks


John Hill will build the web services and core Synapse functionality used to host and distribute
versioned data, code, and models. John has over 10 years industry experience as a
professional software engineer. John has experience with a r
ange of software technologies,
with a focus on developing enterprise
-
class Java
-
built web services and applications. John
built a variety of software products targeting the pharmaceutical industry market, and was
instrumental in designing and building the
text mining and ontology management aspects of
Teranode Fuel.


Xavier Schildwachter

Role: Database architect, Sage Bionetworks

Xavier Schildwachter will be primary database architect and systems administrator of the
Synapse database used to host PCBC dat
a. Xavier has more than 15

years experience in
software design engineering and in defining and managing the

development of systems that
hold raw and analyzed data as well as the high
-
throughput

analysis pipelines required to
process the data. He has writte
n requirements for and

managed development of many internal
research software systems at Merck and

defined many of the research computing and
knowledge management strategies for the

company. He has completed a MS. in computer
science from the Universite Li
bre de

Bruxelles.



Gladstone Institutes Budget Justification


Personnel


Dr. Nathan Salomonis, Soft
ware Engineer

(3

calendar month
s)

Dr. Salomonis is a software developer and experimental biologist, with over 14 years experience with
analyzing genomics d
atasets and developing associated bioinformatics methods. He is the principle
software architect and developer of AltAnalyze, LineageProfiler and GO
-
Elite. He is also a developer
for several related applications including DomainGraph, Mosaic, OncoSplit and

GenMAPP
-
CS.
Through independent and collaborative research, Dr. Salomonis has obtained considerable
experience with next generation RNA
-
sequence analysis, in particular evaluating the effects of
alternative RNA isoform regulation at the level of associate
d protein isoforms, domains and microRNA
binding sites. He has mentored several students in the design of high
-
throughput data analysis and
visualization plugins for the network visualization tool Cytoscape through Google Summer of Code
and the National Re
source for Network Biology academy. During of the funding period of this
proposal, Dr. Salomonis will have several roles, including providing overall project oversight,
supervision of Mr. Williams and Mr. Nand in their work to integrate existing and novel
bioinformatics
methods into the Synapse and Galaxy workflows, database content management and software
development. As an important part of this work, Dr. Salomonis will work with other PCBC investigators
and the Cell Characterization Core to implement spe
cific feature requests, such as the addition of new
analysis methods, annotation resources and reference expression profiles (e.g., RNA
-
Seq datasets
for LineageProfiler). Dr. Salomonis will also prepare related manuscripts and present talks and
tutorials a
t international meetings.


Alexan
der Williams, Software Engineer

(3
calendar month
s)

Mr. Williams has Master’s degrees in both Computer Science and Bioinformatics with ten years of
experience as a programmer and five years of experience in bioinformatics.
He has an extensive
software development track record, which includes the tool GenomeMixer, Cytoscape plugin
developent and an application for the iPhone. In the Bioinformatics Core at Gladstone, Mr. Williams
has developed high
-
throughput data analysis pip
elines for microarray, RNA
-
seq, and ChIP
-
seq
analyses. This work includes genome sequence alignment, quality control, differential expression
analysis, data visualization, biostatistics and vetting of software. For the duration of this proposal, Mr.
Willi
ams will enhance LineageProfiler biostatistical methods, will identify, integrate and evaluate
reference RNA
-
Seq datasets and develop an independent Cytoscape version of LineageProfiler
providing additional visualization methods.


Justin Nand
, Software En
gineer

(1.8

calendar month
s
)

Mr. Nand has a Bachelor’s degree in Bioengineering and extensive experience in data mining, next
generation sequence analysis pipeline development, genome assembly, machine learning, database
and web based content management. M
r. Nand has expertise in over 6 scripting and object oriented
programming languages as well as multiple database management systems. As support for a large
next generation sequence analysis group at the University of California, San Diego, he managed and
e
xtended a web
-
based Galaxy pipeline instance, integrating additional open
-
source and custom
analysis tools. At the Gladstone Institutes he has contributed significantly to the production and
development of the new Gladstone website, focusing on backend con
tent management using drupal
as well as implementing and deploying multiple in house databases. Mr. Nand will contribute to this
project by integrating multiple resources within the with the Cincinnati data analysis pipeline. These
resources include AltAna
lyze, LineageProfiler and existing web
-
based visualization software for
alternative splicing analyses data through the PCBC results portal.



Vanderbilt University Budget Justification


Personnel


Antonis Hatzopoulos, Ph.D.

(0.6 calendar months)

Project Di
rector and Principal Investigator of the PCBC Vanderbilt Hub, is an Associate Professor of
Medicine. Dr. Hatzopoulos coordinated the development of the Functional Genomics in Embryonic
Stem cells (FunGenES) database. The database is based on gene express
ion profiling data of
mouse embryonic stem cells obtained under a series of experimental conditions before and during
differentiation. He has experience with Bioinformatics analysis and database organization of large
expression profiling datasets. He wil
l work closely with the Bioinformaticians in the Bioinformatics
Core on the design of the PCBC Database web portal for data storage and access as well as network,
pathway, and functional characterization analyses using PCBC and all other available genomics

data.


Resources and Environment



CCHMC
-

Cincinnati Children’s Equipment and Resources for BIOINFORMATICS AND
COMPUTING


Dr. Aronow’s Genome Bioinformatics Core Facility at Cincinnati Children’s Hospital Research
Foundation provides both analytical in
frastructure as well as dedicated support for investigators to
work with or from pre/auto
-
processed genomic technology data; and provides training for the use and
analysis of genomics data and knowledge. The bioinformatics core also provides investigators
with
data and knowledge resources extracted from published, publically available, and locally generated
high
-
throughput data, and using web applications, makes these available for a variety of applications
and downstream analysis goals.


Visualization of

RNAseq, exome, CNV and epigenomic data, are generally enabled via UCSC
Genome Browser and tracks data for the IGV browser systems, and for the PCBC portal, we will
either do that for some simple tasks but as we progress in the development of the proposed
PCBC
portal, make a seamless connection for users to do primary data processing and visualization of
RNAseq and ChIP
-
Seq results using GALAXY. Taken together with the large scale comparative
analysis of the SAGE SYNAPSE environment ,this will provide a po
werful analytical and discovery
environment for PCBC users at every level of expertise.


Additional powerful resources are available through the Genomics Core and Division of Biomedical
Informatics including compute
-
cluster enabled use of the Trapnell and
Salzberg Tuxedo suite of
Bowtie, Tophat, and Cufflinks for spliced and unspliced RNAseq alignment isoform assembly and
quantitation, R, SAS, SAS, Matlab, SPSS, SUDAAN, Affymetrix Expression Console, CLUSTER,
TREEVIEW, and Partek, GeneSpring, and JBrowse an
d UCSC Genome browser viewer applications
for scanning, normalization, statistical, and visualization analyses respectively.



Software
:

The bioinformatics group in Cincinnati carries out extensive software development. Most
programming is done using Jav
a using Eclipse and using the Struts framework with statistical and
mathematical functions carried out using the R
-
Server application programming interface.
Jakarta
Project’s Struts Framework allows for the use of Servlets, JSP, custom Struts tag libraries

and other
components using a unified framework that helps rapid design and deployment of web applications.
The value of frameworks is realized in the ordered and componentized nature of code that facilitates
maintenance, update, and reuse

and Tomcat deplo
yments.


Centralized authentication (LDAP) servers, serving unique usernames and passwords to all
investigators Advanced programming developer tools (Eclipse, IDE, CVS, Ant, compilers, debuggers,
graphical developments environments) and parallel computing

software (MPI, PVM) are also used
extensively.


Data Security
:

All data collection systems incorporate a multi
-
layered data security approach through
the use of roles, user accounts, and passwords. To access data, users must be assigned to one or
more re
search projects (role) and have a unique login (account) and password. This is important
since the nature of the DPI is to provide informatic support to multiple research initiatives. Using this
methodology, individual investigators will only have acces
s to their own data (roles) accessible only
with a login account and password. Since the principal investigators are considered the owners of the
data, they must authorize all roles, and user accounts. In addition to standard authentication, secure
data a
re protected by a dual Atheon firewalls system. These firewalls are placed between the
“outside
-
world” and the CHRF network and prohibit unauthorized external computing protocols and
users. All data will be stored and accessed in accordance with the HCFA
’s Internet Security Policy
(http://aspe.os.dhhs.gov/admnsimp/nprm/sec09.htm) and other state and local requirements. A critical
aspect of the proposed system is data security and confidentiality. We also intend to follow the Health
Insurance Portability
and Accountability Act of 1996 (HIPPA) guidelines for handling clinical data. All
data within the DPI will or already do meet these guidelines.


Internet Connectivity:

The DPI provides redundant Internet connectivity that includes Internet and
Internet2 co
nnectivity through ORNet, Ohio
’s academic Internet consortium via

OC48 (2.488 Gbps)
bandwidths

with redundancy through Cincinn
ati Bell at OC3 (155.52 Mbps)
.


AVAILABLE EQUIPMENT

Bioinformatics and Genomics


Hardware:



A 1000+ CPU Linux cluster for parallel
computation, including multi
-
core AMD Opteron and Intel
Xeon based servers with 4
-
64 GB or memory



A VMWare cluster of 150+ virtual machines running Windows and Linux operating systems for
testing and development environments as well as to provide backup an
d redundant services.



Dedicated mysql, MS SQL and Oracle database servers



A terabyte
-
scale storage facility, with currently over 80TB usable space, with daily incremental and
weekly full backups.



Centralized authentication (LDAP) servers, serving unique us
ernames and passwords to all
investigators



Basic developer tools (compilers, debuggers, graphical developments environments) and parallel
computing software (MPI, PVM) are available.



Dedicated redundant web servers

principally using Apache, Tomcat, and JB
oss.



A petabyte
-
scale storage facility (including IBM DS
-
8100 SANs and IBM N3700 NAS devices),
with currently over 800TB usable space, with daily incremental and weekly full backups.

S
erver management:

CCHMC maintains a 7 x 24 x 365 Data Center with oper
ators continually
onsite. All Sun/Solaris Microsystems, Windows, Apple Macintosh, and Linux servers are backed up
with daily incremental backups, weekly full backups, and monthly offsite vaults, using Veritas
NetBackup. Veritas NetBackup is also used in ta
ndem with Oracle Recovery Manager to backup
approximately 20 Oracle instances.

Dual zoned firewalls that are monitored by security experts protect the servers from the
external Internet and the internal DMZ. In the event of a power outage, two Liebert 22
0 kVA UPS,
distributed PDUs (power distribution units), and a diesel backup generator provide coverage
.






SAGE BIONETWORKS
RESOURCES


Laboratory:

N/A



Clinical:

N/A



Animal:

N/A



Computer:

Sage Bionetworks uses a combination of scalable cloud
-
based storage and analytical computational
resources and its own excellent computational facility. The cloud
-
based services are procured from
Amazon Web services on a fee for service basis and provide a cost
-
effective solution to variable
needs, technolo
gy upgrades and support. The Sage Bionetworks facility is comprised of a 5,488 CPU
core Linux cluster with over 40TB of online data storage capacity. The cluster consists of ~5,000 job
slots with a combination of quad core

Intel Xeons, 64bit single core X
eons and AMD dual core
Opteron processors. The cluster nodes are IBM

blade servers. The system is physically very dense
with 84 servers per rack. The IBM blades use chilled

water doors to remove more than 50% of the
heat produced by the cluster nodes. The
cluster is operated

from 7 head nodes. Four head nodes are
available to cluster users and two head nodes are dedicated to

administration and operation. The
Cluster is well maintained through a partnership agreement with the

University of Miami. One of the
major components is an Isilon clustered file system with 36 nodes. Each node in the Isilon is attached
to the cluster via gigabit Ethernet. This highly parallelized file system is very well suited to supporting
systems biology and bioinformatics workflows.

The file system can sustain throughputs of 30Gb/secs.
Storage for cluster administration, UNIX home directories, utilities and archiving is kept on a pair of
BlueArc Titian II NAS storage arrays with 12TB of storage. At the core of the cluster is a Force1
0
network switch. Cluster nodes, head nodes and our Isilon storage system are connected to this
central network switch. The Force 10 has 1Tb/sec back plane that allows the storage nodes to
operate at wire speed. This high throughput network design allows t
he CPUí’s in the cluster to operate
at 80% capacity or better. All networked file systems, databases, and home directories are backed up
using Veritas software to a robotic tape library. Tapes are taken off
-
site each week for disaster
recovery. All IT
-
rela
ted protocols and procedures for dealing with human data on both the AWS and
Sage Bionetworks have been approved by the Western Institutional Review Board.



Office:

Sage Bionetworks leases approximately 3,500 sq ft of office space in the Arnold Building
of the Fred
Hutchinson Cancer Research Center (FHCRC.) As part of the agreement with FHCRC, all Sage
Bionetworks staff have full access to the excellent research resources (e.g. library and reference
collection) and common areas (e.g. conference rooms) at

the Center. In addition to equipping
scientists with state
-
of
-
the
-
art desktop computers and local servers all staff and collaborators will have
access to Amazon Networks Services for work and development of the Synapse computational and
collaboration pla
tform.



Other:

N/A




SCIENTIFIC ENVIRONMENT: Contribution to the probability of success.


The Fred Hutchinson Cancer Research Center is an outstanding research environment with over 190
faculty members, almost 400 pre
-

and post
-
doctoral researchers and

over 2,800 total staff. There is
an extensive international seminar program and Sage Bionetworks scientists work closely with the
FHCRC computational biology program as well as with other Public Health, Human Health and Basic
Science collaborators. All
Sage Bionetworks scientists have formal affiliate positions at FHCRC.
Sage Bionetworks has a services agreement with FHCRC for facilities logistics but is a distinct
corporate entity and manages professional services such as accounting, audit, insurance an
d legal
services through a combination of its administrative team and local professional service providers.
Sage Bionetworks and FHCRC are located within two miles of the University of Washington, and its
Medical Center, Seattle Biotechnology Research Ins
titute (SBRI), the Institute for Systems Biology
(ISB), Seattle Children’s Hospital, the Program for Appropriate Health Technology (PATH) and the
Benaroya Institute that facilitates many productive interactions.


EARLY STAGE INVESTIGATORS: Describe institu
tional investment in the success of the investigator.

N/A



SPECIAL FACILITIES: Describe facilities used for working with biohazard, or other dangerous substances.

N/A




FACILITIES & OTHER RESOURCES


GLADSTONE INSTITUTES


The Gladstone Bioinformatics
core is a federated system of bioinformatics groups with tools,
resources and services to offer Gladstone researchers. This system takes advantage of the
tremendous synergy between the groups and provides the infrastructure needed to define joint
projects
and analysis pipelines that span more than one group. Together, we offer a unique set of
perspectives, skills and resources to tackle a broad range of problems. Specifically, we offer services,
training, research and collaboration opportunities.


Software
:

Dr. Salomonis is the lead developer of the tools AltAnalyze, LineageProfiler and GO
-
Elite and
provides support for these and several other developed applications. These applications are
developed in Python and distributed as both source
-
code and as operat
ing system specific binary
installations. Associated visualization tools are developed primarily for use with the Java Cytoscape
network analysis software, for which members of the Gladstone Bioinformatics core are core
developers or achieved through the c
reation of browser tracks within the UCSC Genome Browser.
Other software development is primarily in R for the purpose of designing streamlined analysis
pipelines for the Bioinformatics core and as extensions to AltAnalyze.


Laboratory:

Gladstone occupies

a ~200,000 sq. ft. research facility that was completed in the Fall of 2004. This
facility is located at the Mission Bay campus of the University of California, San Francisco (UCSF).
The Bioinformatics Core, which Dr. Salomonis is a also member, occupies
500 sq. ft. of dry lab space
and 100 sq. ft. of office space. Dry lab and office space are equipped with desktop computers, a
conferencing
-
capable phone system to facilitate collaborative discussions, and locking file cabinets.
Members of Gladstone have f
ull access to the editorial and graphics departments of the Gladstone
Institutes, which provide assistance in the preparation of manuscripts, slides, and other materials
required for the dissemination of research findings. Dr. Conklin has an administrative

assistant who
devotes 50% effort to support of the laboratory.


Major Equipment:

All of the infastructure and equipment necessary to successfully complete the proposed research is
readily available at the Gladstone Institutes. This includes low
-

and high
-
speed centrifuges, gel
electrophoresis equipment, ultracentrifuges, cold room, fluorescence microscope, confocal
microscope, autoclaves, X
-
ray processor, phosphoimager, and miscellaneous small equipment. This
equipment is located either directly in the Pri
ncipal Investigator’s lab or in adjacent labs. Flow
cytometers and a cell sorter are available within Gladstone, as is a dedicated stem cell core.


Computer:


Computational facilities include shared resources in (1) Gladstone Institutes, (2) California I
nstitute for
Quantitative Biosciences (QB3) at UCSF, (3) Division of Biostatistics, (4) Sequence Analysis and
Computer Service (SACS) at UCSF, as well as computers owned by Dr. Conklin and the
Bioinformatics Core. All of the computers described below are l
inked through a high
-
speed campus
-
wide network.

Gladstone Institutes:
A computer network connected by T1 lines for scientific writing/editing, data
analysis, and graphics/illustration is available at the Gladstone Institutes. The network is secured with
a
Sonic Wall Network appliance for anti
-
virus protection and a Barracuda intrusion detection appliance
for spam and spy ware detection.

QB3:

A computer cluster with 1000 opteron nodes (several with 32Gb memory), shared storage, and
systems administration is

available through QB3 and the associated
Resource for Biocomputing,
Visualization, and Informatics
. The Bioinformatics Core, through their affiliations with Dr. Katie Pollard
lab’s at Gladstone, has full access to this resource free of charge.

Biostatisti
cs:

A computer cluster with 10 Mac nodes (running OSX) and system administration are
available through the Division of Biostatistics.

SACS
:
The plato cluster is maintained by the Resource for Biocomputing, Visualization and
Informatics.
Plato is based on H
ewlett
Packard

Proliant

DL580

and

DL585

servers with a

total of 40
cores. Each system has 32GB of memory and is interconnected via Gigabit Ethernet. The cluster is
connected to a Storage Area Network (SAN) via Fibre Channel to access data stored on a HP
St
orageWorks 6400 disk array.

The Conklin lab has a subscription for access to SACS.

Research Group:
Dr. Salomonis has access to Conklin laboratory as well and Bioinformatics Core
computer resources. In the Conklin laboratory, this includes a dual
-
2.66 GHz I
ntel Core i7 MacBook
Pro with 8GB RAM and Mac OS X and Windows 7 operating systems (dual
-
boot). Dr. Salomonis also
has access to 12
-
core 2.66MHz Intel Xeon MacPro with 16GB of RAM.

Bioinformatics Core
: The Core has (1) a Dell PowerEdge R710 server with d
ual quad core
processors, 48 GB RAM and 6 TB RAID5 storage, running Linux Ubuntu OS; (2) a Dell PowerEdge
R710 server with dual hex core processors, 96 GB RAM and 12 TB RAID5 storage, running Linux
Ubuntu OS; and (3) personal computers with monitors and ke
yboards.


Bioinformatics:

The Gladstone Institutes established a Bioinformatics Cor
e in January 2009. Dr. Holloway is the
director of the Bioinformatics Core. The Bioinformatics Core provides experimental design, software
engineering services and data ana
lysis support for genomics and systems biology research on a fee
-
for
-
service basis. The staff has expertise in statistical programming, databases, workflows, cluster
computing, and a wide range of bioinformatics tools. Dr. Salomonis is currently providing
consultant
services to the Bioinformatics Core to support research projects in the areas of transcriptome
analysis, alternative splicing and biological pathway analysis. The staff of the Bioinformatics Core are
available to assist with the proposed project

if needed and interact closely with Dr. Salomonis.


Animal:


n/a


Clinical:


n/a


Environment:

Gladstone is a world
-
class biomedical research facility housing approximately 20 laboratories
with
expertise in molecular biology, cell biology, development, a
nd computational biology. Gladstone
researchers focus on three main disease areas: cardiovascular disease, virology/immunology, and
neurodegenerative disease.
Seminars, journal club, and “research in progress” talks (for the
discussion of recent results fr
om each laboratory), each held weekly, provide opportunities for training
and collaboration. The Gladstone annual retreat enables members of the laboratory to present their
work to the wider scientific community and to explore collaborations.


Gladstone ha
s an award
-
winning Postdoctoral Fellow program that includes institution
-
wide mentoring
standards, a postdoctoral advisor (staff person) who provides career development and job placement
advice, and a series of workshops in leadership, laboratory managemen
t, grant writing, interviewing
skills, manuscript preparation, and public speaking. UCSF provides additional opportunities for
training in grantsmanship, career development, entrepreneurship, and graduate teaching.


Dr. Salomonis is involved in active coll
aborative projects with the laboratories comprising the PCBC
iPS Cells in Heart Disease hub
. These collaborations provide opportunities for mentoring, scientific
discussion, and exchange of technologies and data, which will enrich the proposed project.