EMBL-EBI The European Bioinformatics Institute

moredwarfΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 8 μήνες)

309 εμφανίσεις


EMBL
-
EBI

(
S.A.

Sansone)

Scientific

progress

report

-

May

2003



12

pages

1

EMBL
-
EBI
The European Bioinformatics Institute


Microarray Informatics


ILSI
-
HESI

AND EBI

RESEARCH PROJECT


SCIENTIFIC PROGRESS
REPORT

May 2003


Title of the research project

‘Establishment of a database for
toxicogenomic

gene expression data’

Operative commencement date

1 April 2002

Time period covered by the report

31 November 2002
-

30 May 2003

Team participating to the research
project

Microarray Informatics Team

EMBL
-
EBI

The European Bioinformatics Institute Hinxton
,
Wellcome Trust Genome Campus, Cambridge
CB10 1SD, UK.
http://www.ebi.ac.uk/microarray

Principal investigator

Alvis Brazma, PhD

Email:
brazma@ebi.ac.uk

Project coo
rdinator

Susanna
-
Assunta Sansone, PhD

Email:
sansone@ebi.ac.uk

Other contributors

Niran Abeygunawardena

(Web designer and programmer)

Email:

niran@ebi.ac.uk

Sergio Contrino

(Database programmer)

Email:
contrino@ebi.ac.uk

Gonzalo Garcia Lara

(Web designer and programmer)

Email:
gonzalo@ebi.ac.uk

Helen Parkinson, PhD

(
Data curation

coordinator)

E
mail:
parkinson@ebi.ac.uk

Philippe Rocca
-
Serra, PhD

(Data curation and data management)

Email:
rocca@ebi.ac.uk

Ugis Sarkans, PhD

(Database development coordinator)

Email:
ugis@ebi.ac.uk

Mohammadreza Shojatalab

(Database programmer)

E
mail:
shoja@ebi.ac.uk



EMBL
-
EBI

(
S.A.

Sansone)

Scientific

progress

report

-

May

2003



12

pages

2

SCIENTIFIC PROGRESS REPORT



1.

Background

1.1.

Objective of the project

Infrastructure for microarray

gene expression data
[1]

has been established at
EMBL
-
EBI The European Bioinformatics Institute. Following an agreement with
the International Life Sciences Institute’s Health and Environmental Sciences
Institute (ILSI HESI), the current infrastructure is

undergoing further
development according to the specification of ILSI HESI toxicogenomics
experiments.
The goal of the project is the establishment of a database for
toxicogenomic gene expression data generated by the ILSI HESI Genomics
Committee. The IL
SI HESI studies, Genotoxicity, Hepatotoxicity and
Nephrotoxicity, will be consistently annotated according to the requirements of
the Minimum Information About a Microarray Experiments (MIAME)
[2]

and will
be
compliant to the
proposed MIAME for toxicogenom
ics (MIAME/Tox)

[3,4]
, an
effort aiming to define the core that is common to most toxicogenomic
experiments. Data will be submitted via a dedicated on
-
line tool (Tox
-
MIAMExpress), hosted in the public repository for microarray gene expression
data (ArrayEx
press) and supported by a query and data analysis interface.
Infrastructure for toxicogenomics gene expression data is shown in figure 1.


1.2.

MIAME requirements

The microarray community response to MIAME has been very favourable and
many instrument manufactur
ers, software developers, and international
databases have moved to adapt their systems to capture and manage MIAME
-
compliant data. MIAME requirements allow sufficient and structured information
to be recorded in order to correctly interpret and replicate
the experiments or
retrieve and analyse the data.

An open letter has been sent to the main scientific journals by the MGED
Society and published by
Nature

[5]
,
Science
,
The Lancet

and
Bioinformatics

[6]
.
The letter includes a guide to authors, editors and
reviewers of microarray gene
expression papers. The guidelines are based on the MIAME document and aim
to help authors to meet the requirements. It also guide editors or reviewers in
their evaluation of whether or not a manuscript provides as much informat
ion as
necessary for others to replicate and interpret the analysis presented. Finally,
the letter suggest that journals should require submission of data to either of the
two public repositories: ArrayExpress or GEO
[7]
.
Nature
, the

Nature

group of
journa
ls,
Cell
,
The Lancet
,
EMBO

and
Toxicology Pathology

have responded
accordingly, by requiring accession number to be supplied at or before
acceptance of publication.


1.3.

ArrayExpress infrastructure

The ArrayExpress infrastructure consists of the database itse
lf, two data
submissions routes (in
Microarray Gene Expression Mark
-
up Language
(MAGE
-
ML)

[8]

format or via a
web
-
based submission/annotation
tool
MIAMExpress
[1,9]
, an online database query interface) and the Expression

EMBL
-
EBI

(
S.A.

Sansone)

Scientific

progress

report

-

May

2003



12

pages

3

Profiler
[10]

online analysis tool.

A query optimised data warehouse is under
development. ArrayExpress will be also fully integrated with the relevant
databases at the EBI.

ArrayExpress accepts three types of submission; experiments, protocols and
arrays, each of these is assigned an acces
sion number.
MIAMExpress is the
data annotation and submission tool for the ArrayExpress curation database.
MIAMExpress is presented in form of a questionnaire, where the required fields
are taken from the MIAME concepts. Therefore it captures MIAME compli
ant
information. Whenever possible MIAMExpress uses drop down lists of
controlled vocabularies that help the user to select the appropriate term for the
required field. The MIAMExpress version 1.01 is generic; no species or domain
specific interface has be
en designed.
Currently we are developing a plant
-
specific interface as a part of this collaborative project.

MIAMExpress is an open source project, consisting of a perl
-
CGI interface,
MySQL database, and MAGE
-
ML export component implemented using
MAGEstk
[
8]
. The system can be installed locally and used as an ‘electronic
notebook’ for microarray experiments, potentially allowing ‘one
-
click’
submissions to ArrayExpress or to any other database or tool that accepts
MAGE
-
ML formatted data.
ArrayExpress is impl
emented in Oracle, the query
interface is implemented via Java servlets and uses Tomcat and Velocity.

The infrastructure for data sharing is based on the adoption of the MAGE
-
ML
data exchange format.
MAGE
-
ML

is a XML based data exchange format
developed by

the Microarray Gene Expression Data (MGED) Society
[11]

and
adopted as a standard by the Object Management Group (OMG)
[12]
. Currently
MAGE
-
ML export is implemented by a number of companies (including
Affymetrix, Agilent, Rosetta Biosoftware, Iobion) and
many academic and
governmental laboratories (including TIGR, Sanger Institute,
Stanford
, NIEHS).

The adoption of a common
data exchange format

and the public repository will
i
ncrease the value of the ILSI HESI data by achieving the objectiv
e of the
project

and
giving a critical mass to the studies.


1.4.

ArrayExpress
infrastructure

development team

European Molecular Biology Laboratory (EMBL), the European Commission
(TEMBLOR grant, CAGE grant), the EBI Industry Programme (Biostandards)
fund the ArrayExpress pro
ject and support 13 staff members. In addition to this,
the ILSI
-
HESI toxicogenomics database grant supports 2 other staff members.


1.5.

Phases of the project

Phase 1: submission of data to ‘curation database’.

As a first step, MIAMExpress will be extended to
meet to the specifics of the
ILSI HESI studies. The

toxicogenomic
-
specific MIAMExpress will allow
annotation and submission of the following domains:



t
oxicological endpoints domain (biological or toxicological results)



genomic domain (gene names and/or seq
uence identifiers on each array
and array descriptors)



experimental domain (samples, analysis descriptors and calculated
result)


EMBL
-
EBI

(
S.A.

Sansone)

Scientific

progress

report

-

May

2003



12

pages

4

ILSI HESI members will submit the above domains by using a toxicogenomics
-
specific MIAMExpress, while the genomic domain for Af
fymetrix arrays will enter
directly ArrayExpress as files in
MAGE
-
ML

format.

Curators at EBI will assist the members during the submission process and
check the data for MIAME and MIAME/Tox compliance. A study will be
considered completed when the above do
mains have been successfully
submitted. Only completed studies will enter ArrayExpress ‘query database’.


Phase 2: data enter ArrayExpress ‘query database’.

ILSI HESI Genomics Committee members will be able to access the data
through a dedicated ArrayExpre
ss web query interface. ILSI HESI data will be
exclusive to participating ILSI HESI members from a period of 10 months from
date of submission.

The EBI will extend the interface to accommodate the specific requirements of
the ILSI HESI members wherever pos
sible and subject to resources. A
database warehouse will also be developed to allow g
ene
-
centric queries,
combining data from several experiments and provide cross
-
platform analysis
possibilities.

The EBI will also collaborate with the ILSI HESI members i
n the development of
a dedicated data analysis tools, subject to resources.


Phase 3: data publicly released.

At the end of a period of 10 months from date of submission, the ILSI HESI data
will be integrated with all the public gene expression data in Arr
ayExpress
database and other relevant databases at the EBI.

The EBI will continue the development of advanced analyses and query
facilities according to recommendations from ILSI HESI members and subject to
resources.



2.

Scientific and technical progress ma
de

2.1.

Tox
-
MIAMExpress annotation/submission tool

The ILSI HESI studies need to be consistently annotated according to the
MIAME requirements and also integrated with
conventional toxicology
data
(e.g., clinical observations, gross necropsy examination, histop
athology
evaluation and clinical pathology). To achieve this, a toxicogenomics
-
specific
version of the MIAME
-
compliant web
-
based annotation/submission
tool,

has
been developed. The new Tox
-
MIAMExpress
[13]

allows
linking gene expression
data to
conventiona
l toxicology meta
data. Tox
-
MIAMExpress

also complies with
the
proposed MIAME/Tox, an effort aiming to define the core that is common to
most toxicogenomic experiments.

A Tox
-
MIAMExpress prototype version has been presented to the ILSI HESI
Plenary Meeting
in December 2002 and now a fully functional a production
version accepts on
-
line submission
since January 2003 (fig. 2).

Tox
-
MIAMExpress
is presented in form of a questionnaire, where the required
fields are taken from the MIAME and MIAME/Tox concepts. Exp
lanatory pop
-
up

EMBL
-
EBI

(
S.A.

Sansone)

Scientific

progress

report

-

May

2003



12

pages

5

help pages are available for each required field by clicking on the terms. Tox
-
MIAMExpress
also provides the user with a Help Page, designed to guide
through the submission procedures.
An assessment of the terminologies used
to record the
t
oxicological endpoints has showed inconsistency among the ILSI
HESI in
-
life reports. In order to overcome this problem, drop down lists of
controlled vocabularies are provided to select the appropriate term for each
required field.
Use of
common terminolog
ies
will ensure h
armonization of the
biological
endpoints

allowing
successful data mining, data evaluation and data
comparison. The following
controlled vocabularies have been incorporated in
the Tox
-
MIAMExpress interface:



International Union of Pure and A
pplied Chemistry (IUPAC)

[14]

“Properties and units in the clinical laboratory sciences”, for Clinical
Chemistry;



NIEHS National Toxicology Program (NTP) Pathology Code Table (PCT)

[15]

for clinical observations and pathology/histopathology evaluation;



MGE
D Ontology terms
[14]
, for a standardized description of a microarray
experiment.

A d
ownloadable Excel template has been developed
for
pathology/histopathology

data upload. The spreadsheet includes the
PCT
controlled vocabulary for an easy import of the pa
thological findings.

Tox
-
MIAMExpress

allows three types of submission: array designs, protocols
and experiments (fig. 3).

The user can choose to submit an array design or a
protocol only. However, when entering an experiment, the user must also
provide the

protocols and the array design to complete the submission. If linked
to multiple experiments, array designs and protocols should be only submitted
once.

The ILSI HESI studies (Genotoxicity, Hepatotoxicity and Nephrotoxicity) will be
submitted as three exp
eriments in Tox
-
MIAMExpress.
Given the complexity of
the experiments and the number of companies contributing to them, Tox
-
MIAMExpress divides the experiment submission in two parts:



Part 1

of the experiment holds information on sample treatments,
toxicolo
gical endpoints and extracts preparation;



Part 2

of the experiment holds information on labeled extracts
preparation, hybridizations and gene expression data files.

When creating an account, users are asked to select the part of the experiment
they will su
bmit information for. Tox
-
MIAMExpress allows the users to ‘submit’
information related to
Part 1

or
Part 2

of an experiment only, while ‘view’
permission is given for the other part. Arrays and protocols submissions are
accessed with no restrictions and ca
n be shared among the members.

Only completed experiments (studies) will be assigned
an accession number
and
enter ArrayExpress ‘query database’.

Tox
-
MIAMExpress

is an open source project, consisting of a perl
-
CGI interface,
MySQL database, and MAGE
-
ML exp
ort component implemented using
MAGEstk. The system can be also installed locally and used as an ‘electronic
notebook’ for toxicogenomics experiments, potentially allowing ‘one
-
click’
submissions to ArrayExpress or to any other toxicogenomics database or t
ool
that accepts MAGE
-
ML formatted data.




EMBL
-
EBI

(
S.A.

Sansone)

Scientific

progress

report

-

May

2003



12

pages

6

Array design submission

Array design is the general layout of printed spots on an array.

A

set of
procedures for formatting the array design information into a standard
referencing system has been developed.

The
Array Design File (ADF) a
llows to
unambiguously locate a spot on the array and provides a
consistent biological
annotation

for
data mining, data evaluation and data
comparison across
different arrays and technology platforms.

Step
-
by
-
step help notes guide
the users through the set of procedures for
merging the ‘spotter output file’ and the ‘clone tracking file’ to format the ADF. A
web page in Tox
-
MIAMExpress allows the upload of the ADF generated along
with other general array design information, e.g. arra
y design name, version,
manufacturer, surface type etc.


Array design re
-
annotation

The array annotation is created on the basis of the sequence information
available at the time of its release.

However,
drafts of the genomes are
continuously updated and s
ubsequently array annotation can be improved.

A prototype
web page, under development, will allow the u
sers to access the
latest gene annotation and
re
-
annotate or update of the ADF before submission.

T
he user will upload the ‘spotter output file’ in a web

form and the database
identifiers are automatically extracted and submitted to
EnsMart
[16]
. EnsMart

is
a system built on the data in
EnsEMBL, the genome database at EBI containing
a
consistent species
-
specific and interspecies annotation (
including
Homo
sapiens, Mus musculus and Rattus norvegicus
)
.
EnsMart is the recipient of the
latest

updated drafts of the genomes with

cross
-
references between identifiers
from a wide variety of the public sequence repositories and internal
EnsEMBL

identifiers.

The outpu
t of the query to the
EnsMart

system will be downloaded in
ADF format for a direct submission to Tox
-
MIAMExpress.


Array design status

A full list of array designs used in the ILSI
-
HESI studies, their format and
availability at the present time is given in

figure 4. Array designs in ADF format
have already been submitted to Tox
-
MIAMExpress. Array designs produced by
Affymetrix are in the process of being loaded into ArrayExpress directly as files
in
MAGE
-
ML

format
(fig.1), via

submission to a dedicated ftp
site
[17]
.


2.2.

Data query and data warehouse

A specific web page meeting the specific needs of the ILSI
-
HESI datasets is
under development. The query interface will allow simple queries and data
retrieval. An example of how such queries may be structured is
the following:



“Show me all experiments FROM ILSI
-
HESI members A and B, WHERE
the compound used is C AND array design is D ”.

The users can export the gene expression
data matrix from these experiments
into Expression Profiler, the on
-
line analysis tool,
f
or further sub
-
selection,
filtering and clustering or
use the latest

annotations of the genomes
by the link
provided to the
EnsMart database.

In order to
answer to complex user
-
constructed queries across and within
database domains (
conventional toxicology

and gene expression data) a data

EMBL
-
EBI

(
S.A.

Sansone)

Scientific

progress

report

-

May

2003



12

pages

7

warehouse in under development. An example of how such queries may be
structured is the following:



“Show me if gene X expression goes up (or down) after treatment with
compound Y with biological endpoint Z (e.g. necrosis)
in experiments
from ILSI
-
HESI members A and B ”.

The data warehouse will also be able to run computation in order to answer
more complex queries as follows:



“Show me the most reproducible gene expression changes for all
experiments with biological endpoint

X (e.g. necrosis) and give me a
quantitative measure of this reproducibility”


2.3.

A MIAME for Toxicogenomics

Following the
very favourable response that MIAME

has received from the
microarray community, the ILSI
-
HESI, NIEHS
-
NCT
[18]

and EBI have initiated a
harmonization process as applied to
array
-
based toxicogenomic experiments.
ILSI
-
HESI, NIEHS
-
NCT and EBI

have drafted

MIAME/Tox, a document defining
the

core data that is common to most toxicogenomic experiments
.
Like MIAME,
MIAME/Tox is not a formal specif
ication, but a set of guidelines,
supporting a
number of objectives including linking toxicological data within a study, linking
several studies from one institution, exchanging toxicogenomics datasets
among institutions and developing data standards and d
ata management
software.

MIAME/Tox has also served as a tool for guiding the development of Tox
-
MIAMExpress and the toxicogenomics databases underway at NIEHS
-
NCT
[19]
.
Compliance to MIAME/Tox, the adoption of a common data exchange format
and the public
repository will increase the value of the ILSI HESI data by
achieving the objective of the project and giving a critical mass to the
toxicogenomics datasets.


3.

Annexes

3.1.

Project contributors

Microarray Informatics staff members (other then the project coordi
nator)

who
have participated to the ILSI HESI
project are listed below. Their
individual
contribution at the present time is calculated in %.



Niran Abeygunawardena (Web designer and programmer) 10
%.



Sergio Contrino, (Database programmer) 80
%.



Gonzalo Gar
cia Lara (Web designer and programmer) 10
%.



Helen Parkinson,
(Data curation coordinator) 10
%.



Philippe Rocca
-
Serra, (Data management) 40
%.



Ugis Sarkans,
(Database development coordinator) 10
%.



Mohammadreza Shojatalab, (Database programmer) 40
%.


3.2.

Publi
cations and web addresses

[1]

Brazma, A.
et al.

(2003).

ArrayExpress


a Public Repository for Microarray
Gene Expression Data at the EBI.
Nucleic Acids Research.

31 (1): 68
-
71.


EMBL
-
EBI

(
S.A.

Sansone)

Scientific

progress

report

-

May

2003



12

pages

8

[2]

Brazma, A.
et al.

(2001). Minimum information about a microarray
experimen
t (MIAME)
-

toward standards for microarray data.
Nature

Genetics
,
29
, 365
-
371
.

[3]

ILSI HESI Committee on the Application of Genomics to Mechanism Based
Risk Assessment:
http://hesi.ilsi.org/i
ndex.cfm?pubentityid=120

[4]

MGED Society:
http://www.mged.org

[5]

Microarray standards at last.
Nature

2002 419:323.

[6]

Ball C.A.
et al.

(2002). An open letter to the scientific journals.
Science
,
298(5593):539.
Bioi
nformatics
, 18(11):1409.
The Lancet
, 360:1019.

[7]

Edgar, R., Domrachev, M. and Lash, A. (2002) Gene Expression Omnibus:
NCBI gene expression and hybridisation array data repository.
Nucleic Acids
Res.
, 30(1), 207
-
210.

[8]

Spellman, P.T.
et al.

(2002). Des
ign and implementation of microarray gene
expression markup language (MAGE
-
ML).
Genome Biology
,
3
(9), research
0046.1
-
0046.9.

[9]

MIAMExpress:
http://www.ebi.ac.uk/miamexpress/

[10]

Vilo, J.
et al.

(2003).

Expression Profiler. In Parmigiani, G., Garrett, E.S.,
Irizarry, R. and Zeger, S.L. (eds.),
The analysis of gene expression data:
methods and software
, in press, Springer
-
Verlag.

[11]

MGED Society:
http://www.mged.org

[12]

OMG:
http://www.omg.org

[13]

Tox
-
MIAMExpress (username: ilsi and password: mxpress):
http://www.ebi.ac.uk/ilsi

[14]

IUPAC:
http://www.iupac.org/publications/pac/2000/7205/7205olesen.html

[15]

NIEHS NTP, PCT:
http://hazel.niehs.nih.gov/user_spt/pct_terms.htm

[16]

EnsMart:
http://www.ensembl.org/EnsMart

[17]
Ftp site for MAGE
-
ML submission to ArrayExpress:
http://www.ebi.ac.uk/microarray/ArrayExpr
ess/Submissions/submissions.html

[18]

NIEHS NCT:
http://www.niehs.nih.gov/nct/

[19]

Waters, M.
et al.

(2003). Systems Toxicology and the Chemical Effects in
Biological Systems (CEBS) Knowledge Base.
Environ.
Health Perspect.
(Toxicogenomics)

111
, 15
-
28




EMBL
-
EB
I

(S.A.Sansone)

Scientific

progress

report

-

May

2003



12

pages

9

































Fig. 1.

Infrastructure for toxicogenomics gene expression data

EBI


www


www

ArrayExpress

(Oracle)

MAGE
-
ML

MAGE
-
ML

Other

Bioinformatic

Databases @ EBI

(EnsMart)

MAGE
-
ML

MAGE
-
ML


Expression

Profiler

Data
Analysis

Softwares

MAGE
-
ML

MAGE
-
ML

MAGE
-
ML

files

MIAME
-
compliant

MAGE
-
ML pipelines

(Affymetrix, Agilent,

NIEHS, TIG
R, SMD,

Sanger Institute)

Data Matrix


www

DATA IMPORT/
EXPORT

DATA SUBMISSION

DATA SUBMISSION

DATA ANALYSIS

DATA QUERY


Tox
-
MIAMExpress

(MySQL)

Other Microarray

(GEO, CIBEX) and

Toxicogenomics

Databases

(CEBS)


EMBL
-
EB
I

(S.A.Sansone)

Scientific

progress

report

-

May

2003



12

pages

10



















Fig. 2.

Tox
-
MIAMExpress on
-
line


EMBL
-
EBI

(S.A.Sansone)

Scientific

progress

report

-

May

2003



12

pages

11


Fig. 3.

To
x
-
MIAMExpress submission process

EMBL
-
EBI

(S.A.Sansone)

Scientific

progress

report

-

May

2003



1
2

pages

12

Fig. 4.

Array designs availability


Array

Manufacturer

Format and status

Created by

Submitted by

Atlas Glass Microarrays
Human 1.0

ClonTech

ADF submitted to Tox
-
MIAMExpress.

EBI

EBI

Atlas Rat Toxicology Array II

ClonTech

ADF submitted to Tox
-
MIAMExpress.

E
BI

EBI

(DuPont) Rat Microarray Chip

Molecular Dynamics

No sufficient information is available.



(RWJPRI) Rat Mega A

Molecular Dynamics

ADF submitted to Tox
-
MIAMExpress.

Alex Nai (J&J)

EBI

(GSK) Tox2

Molecular Dynamics

No sufficient information is avail
able.



GeneChip

Murine Genome 11K subA

Affymetrix

MAGE
-
ML file ready to go in
ArrayExpress.

Affymetrix


GeneChip

Murine Genome 11K subB

Affymetrix

MAGE
-
ML file ready to go in
ArrayExpress.

Affymetrix


GeneChip

Murine Genome U74Av2

Affymetrix

MAGE
-
ML
file ready to go in
ArrayExpress.

Affymetrix


GeneChip

Rat Genome U34A

Affymetrix

MAGE
-
ML file ready to go in
ArrayExpress.

Affymetrix


NIEHS Rat Arrays (43, 49, 59,
52, 56, 60, 66, 69, 86)

NIEHS

ADF formatting ongoing.

Pierre Bushel (NIEHS)


Phase
-
1 Ra
t Toxicology Array
250

Phase
-
1

No sufficient information is available.



Phase
-
1 Rat Toxicology Array
700

Phase
-
1

ADF submitted to Tox
-
MIAMExpress.

EBI/Scott Pine (FDA)

EBI

Rat GEM 1.04

Incyte

ADF submitted to Tox
-
MIAMExpress.

William Curtis
(Pharmacia)

EBI

RatGEM 3.03

Incyte

ADF submitted to Tox
-
MIAMExpress.

William Curtis
(Pharmacia)

EBI