Objectives Progress and Achievements - Gen2Phen

yieldingrabbleInternet and Web Development

Dec 7, 2013 (3 years and 10 days ago)

139 views









Summary

Project objectives
,

work progress and achievements
for
1
st

and 2
nd

years of the project

WP1: SCIENTIFIC COORDINATION

1.

Project objectives for the period 01

(01/01/2008


31/12/2008)

Work Package 1 (WP1) is concerned with providing top
-
leve
l oversight and scientific
coordination to make GEN2PHEN a success. Planned activities involve monitoring
scientific progress, tracking and reacting to new scientific ideas, optimising the role of
committees, and supervising plus assisting Work Package lea
ders as they execute
their tasks. It also involves ensuring that all GEN2PHEN activities emphasize quality,
for which robust quality assessment procedures will be devised. Importantly, WP1 will
additionally provide ethical oversight for the whole project,
and to this end will
undertake specific ethical assessment exercises as and when they are needed. All
these activities will be closely intertwined with WP10 (Project Management), so that
scientific leadership and management of the project are mutually rein
forced.

Activity #1 concerns

‘Project Coordination
’, the goal of which is to establish and apply
procedures by which every aspect of the project can be monitored and optimised. In
the first year of the project, this will entail helping Partners to familiar
ize themselves
with the project plan and how this relates to the state
-
of
-
the
-
art, ensuring they all get
acquainted such that they identify and develop synergies, and producing low
-
overhead
pathways for scientific reporting. Modes of working with the WP Le
aders will be
established that leave them free to lead and innovate, whilst being coordinated by
WP1. A Steering Committee, a Scientific Advisory Board, and other ad
-
hoc
Boards/Committees will be created to provide a key cornerstone for the project.

Activi
ty #2 concerns

‘Project Quality and Assessment
’, representing an aspect of
GEN2PHEN that will be particularly emphasized. To this end, and with respect to each
individual activity and Deliverable, WP1 will frequently remind all Partners about the
need for
quality, offer advice and guidance on good work practice, and routinely
compile results via mechanisms that involve checking for quality. For GEN2PHEN as a
whole, WP1 will undertake a sequential system
-
wide ‘Pilot’


to assess a broad range
of project acti
vities and products in the context of one or more disease scenarios. The
first Pilot will be run towards the end of the first year, and then again every 18 months
or so. Findings will be documented and used to inform the recursive adaptation and
improvemen
t of the program of work.

Activity #3 concerns

‘Ethical Issues
’, representing an important feature of GEN2PHEN
that naturally overlaps with the previously stated issue of project quality. But rather than
rely on activity #2 to cover ethical questions, WP1

will take steps to properly address
the many challenging ethical dimensions of G2P databasing. This will involve the use
of suitably designed oversight, analysis and guidance procedures


starting, in the first
year, with a thorough assessment of the Cons
ortium’s ethical awareness and
expectations. Subsequently,WP1 will produce guidance on when, where, and how
ethical issues should most rationally be addressed, with the general goal of ’hard
coding’ good ethical principles into the IT systems that GEN2PHEN

creates.

2.

Project objectives for the period 2 (01/01/2009


31/12/2009)

Work Package 1 (WP1) is concerned with providing top
-
level oversight and scientific
coordination to make GEN2PHEN a success. Planned activities involve monitoring
scientific progress,
tracking and reacting to new scientific ideas, optimising the role of
committees, and supervising plus assisting Work Package leaders as they execute
their tasks. It also involves ensuring that all GEN2PHEN activities emphasize quality,
for which robust qu
ality assessment procedures will be devised. Importantly, WP1
provides additional ethical oversight for the whole project, and to this end undertakes
specific ethical assessment exercises as and when they are needed. All these activities
are closely intert
wined with WP10 (Project Management), so that scientific leadership
and management of the project are mutually reinforced.

Activity #1 concerns

‘Project Coordination
’, the goal of which is to establish and apply
procedures by which every aspect of the proj
ect can be monitored and optimised. This
entails helping Partners to familiarize themselves with the project plan and how this
relates to the state
-
of
-
the
-
art, ensuring they all get acquainted such that they identify
and develop synergies, and producing lo
w
-
overhead pathways for scientific reporting.
Modes of working established with the WP Leaders leave them free to lead and
innovate, whilst being coordinated by WP1. A Steering Committee, a Scientific
Advisory Board, and other ad
-
hoc Boards/Committees prov
ide a key cornerstone for
the project.

Activity #2 concerns

‘Project Quality and Assessment
’, representing an aspect of
GEN2PHEN that is particularly emphasized. To this end, and with respect to each
individual activity and Deliverable, WP1 frequently remi
nds all Partners about the need
for quality, offer advice and guidance on good work practice, and routinely compile
results via mechanisms that involve checking for quality. For GEN2PHEN as a whole,
WP1 will undertake a sequential system
-
wide ‘Pilot’


to
assess a broad range of
project activities and products in the context of one or more disease scenarios. The first
Pilot was run towards the end of the first year, and it is then planned again every 18
months or so. Findings are being documented and used t
o inform the recursive
adaptation and improvement of the program of work.

Activity #3 concerns

‘Ethical Issues
’, representing an important feature of GEN2PHEN
that naturally overlaps with the previously stated issue of project quality. But rather than
rely

on activity #2 to cover ethical questions, WP1 has taken steps to properly address
the many challenging ethical dimensions of G2P databasing. This involves the use of
suitably designed oversight, analysis and guidance on when, where, and how ethical
issue
s should most rationally be addressed, with the general goal of ’hard coding’ good
ethical principles into the IT systems that GEN2PHEN creates.

3.

Work progress and achievements during the period 1 (01/01/2008


31/12/2008)

During the first year of the GEN2P
HEN project, WP1 launched the project effectively
and accomplished a high rate of productivity. All the anticipated coordination
infrastructures have been formulated, including committees, guidance texts, and
progress reporting/tracking procedures. Three W
P1 Deliverables scheduled for the first
year have progressed well, with only one slightly delayed (by <1 month) due to the
need to react to unexpected developments in the field (see below for further details).

Some particularly notable achievements made b
y WP1 during 2008 include:




Produced a major review of the G2P databasing field, which enunciated the
GEN2HEN perspective



Formed all required Committees and Boards, explained their remits, and
scheduled meetings



Established project
-
wide ‘Progress Reports’
to keep all Partners updated on a
monthly basis, as well as 4
-
monthly reports that are sent to the project’s EC officer



Completed a comprehensive ‘Project Assessment Pilot’



Compiled and issued formal guidance on QC/QA for software development



Worked on et
hics: Consortium awareness (intra
-
project), and privacy in GWAS
(globally)

These are, however, only a subset of all the advances made, the details of which are
as follows:

For activity #1 (
Project Coordination
), prompt actions in setting up effective
coord
ination systems has helped instill into the project a positive and highly
collaborative atmosphere, which itself has led to rapid progress on all scientific and
technical matters. As a first step, a major review of the G2P databasing field was
written [1],

illustrating the principal needs of the field in a way that corresponds to the
GEN2HEN plan. The whole Consortium read and contributed to this review, in order to
get everyone ‘on the same page’ from early on in the project.

On a more practical level, th
e General Assembly (GA) and the Steering Committee
(SC) bodies were formed and given their remits, and they have convened on 2 and 4
occasions respectively. Two GA Meetings held over two days each, allowed all the
Partners to present themselves to the Cons
ortium, begin new cross
-
group interactions,
and understand and influence the project’s objectives. A Scientific Advisory Board
(SAB) has been formed, comprising three individuals with excellent international
credentials: Paul Burton (biobanking/statistics

expert), Lincoln Stein
(informatics/databases expert), and Jochen Taupitz (ethics expert). To streamline pan
-
project communications, dedicated email lists have been put in place for admin and
science matters. To help coordinate research activities, user f
riendly tools (TRAC and
DRUPAL content management systems) have been provided at the project Knowledge
Center (see WP8) which enable collaborating Partners to create private and/or public
web
-
spaces for joint planning and sharing of information. Additional
ly, each month the
Project Coordinator asks the whole Consortium to point out any completed sub
-
activities or significant advances. These are then compiled into a convenient ‘Monthly
Project Report’ that is distributed to all the Partners so that everyone
is informed in
real
-
time about significant GEN2PHEN developments. Each 4 months, a condensed
form of these reports, with an Executive Summary, is issued to the project’s EC Officer.

For activity #2 (
Project Quality and Assessment
), all activities are runni
ng fully to
schedule. To help instill a highly
-
professional approach to software development from
the outset, firm guidance on this matter has been formulated by the project leadership
(Deliverable D1.1: Specification of Procedures for Quality Testing of S
oftware). All
Partners have agreed to work in accordance with this guidance, one facet of which
involves issuing a statement of QC/QA procedures used for each piece of software
created in GEN2PHEN. To address quality and utility issues on a larger scale, w
e have
partnered with the ‘InSiGHT Consortium’ to first explore in depth the G2P databasing
needs of clinical and research groups interested in hereditary non
-
polyposis colorectal
cancer (HNPCC, also known as Lynch syndrome), and thereby to compare this to

what
we have built and plan to build within GEN2PHEN. A report on that work (Deliverable
D1.2: Initial Report from Project Assessment Pilot) will be submitted during January
2009, but its main conclusion is that GEN2PHEN is appropriately tackling many
imp
ortant issues. However, the Pilot also highlights certain activities that need greater
emphasis (e.g., announcing/explaining the tools we build), and suggest new challenges
that need to be tackled (see below).

For activity #3 (
Ethical Issues
), we have cond
ucted a survey of Partner’s knowledge
and expectation regarding ethics, and results will be analysed in January 2009 for entry
into Deliverable D1.3 (Report on General Ethical Issues in G2P Database Work). This
will put that Deliverable ~1 month behind sc
hedule, for reasons explained below. Once
complete, Deliverable D1.3 will be used to guide ethics within GEN2PHEN, and training
activities around ethical questions. Additionally, we have been involved in policy
discussions (with Wellcome Trust, NIH, NHGRI,

NCI, etc) on ‘ethics versus practicality’
as it pertains to ensuring privacy for aggregate genotype frequency datasets (relevant
to association study summary level databases). This has led to new ideas for a major
‘BioScience and Researcher ID Project’, w
hich is described further in our reporting on
WP5.

Work Deviations and New Opportunities

The work plan for WP1 is proceeding well, with only one slight delay to note. This
concerns Deliverable D1.3 (Report on General Ethical Issues in G2P Database Work)
wh
ich is likely to be completed ~1 month later than scheduled. This delay occurred
because we had to redesign the survey at the heart of this Deliverable to take account
of ethical issues raised by a recent publication that showed how individual study
partic
ipants could be identified from summary
-
level genotyping datasets (e.g., case
-
control association study marker frequency data) [2]. This slight delay in finalizing D1.3
will have no adverse effect on the project as a whole.

The Project Assessment Pilot was

a very useful exercise, not least because it revealed
quite starkly that there is a big gap between the research and clinical worlds in the
nature of their understanding, perspective, and need for G2P databases. Specifically,
current databases tend to emp
hasise the research desire for large and complex
datasets that contain much uncertainty. The clinical world, however, needs small,
targeted datasets, with specifically conclusions and implications attached.
Consequently, current G2P databases are at least
partly to blame for the very slow
translation of genetic data into enhanced medical practice. This problem clearly needs
to be addressed. GEN2PHEN has, therefore, recently begun exploring the relevant
issues with key groups such as Peter Tonellato (Harvard
) and John Quackenbush
(Dana
-
Farber Institute). But our spare resources are very limited, and so we suggest
there is now a need for new, large
-
scale, international funding of ‘Clinical
-
GEN2PHEN’
type programs of work.

4.

Work progress and achievements during
the period 2 (01/01/2009


31/12/2009)

During the second year of the GEN2PHEN project, effective coordination actions have
maintained a positive and highly collaborative project atmosphere. All the anticipated
coordination infrastructures have been formula
ted, including committees, guidance
texts, and progress reporting/tracking procedures. One WP1 Deliverable was
scheduled for the second year, and this was completed on time.

Some particularly notable activities in WP1 during 2009 include:



Engaging strategi
c contacts and partnerships, such as the P3G biobanking project,
the UK Select Committee report on Genomic Medicine, Cameron Neylon concerning
linked
-
data and social networking issues, and John Barber regarding the ECARUCA
database.



Creating project
-
wide ‘
Progress Reports’ on a monthly cycle to keep all Partners
updated.



Interacting with the EC Project Officer to explain and discuss our progress, not least
via the provision of 4
-
monthly progress reports.



Progress on ethics: exploring Consortium awareness, a
nd devising concrete
operational policies for the project.

These are, however, only a subset of all the advances made, the details of which are
as follows:

For activity #1 (
Project Coordination
), by diligent and responsive coordination efforts,
we have mai
ntained a positive and highly collaborative project atmosphere. This has
allowed us to make a considerable and early impact on the field, as widely recognized
and reflected by the HUGO Journal asking us to write a paper describing the project’s
goals and o
perational strategies. This offer was accepted, and we are on track to
submit a manuscript in early 2010.

To ensure the project continues to evolve in visionary and leading directions, WP1 has
developed strategic contacts and partnerships. Specifically, we

have; a) established
a
Memorandum of Understanding between
our project
and
the
P3G

international
biobanking project, towards mutual effort and shared goals, b) contributed written and
oral evidence to the
UK Select Committee report on Genomic Medicine
, an
d attended a
follow
-
on
expert workshop
, c) engaged with
Cameron

Neylon (emails, visits), who is a
leading figure in
linked
-
data and social networking
issues, to see how these can benefit
G2P databasing
, and d) l
iaised with John Barber (Wessex NGRL)
regardi
ng options for
inter
-
connecting
the
ir

ECARUCA database (www.ecaruca.net)
and
GEN2PHEN
resources
.

Additionally, we have gathered monthly progress reports from all Partners, and
compiled these into bulleted monthly and 4
-
monthly reports that we issue to the
whole
Consortium. These meetings and reports ensure optimal management and
understanding of the project’s progress, and keep all partners aware of new and
collaborative opportunities.

Continued rapid progress in the project has meant that we have had no di
fficulties
coordinating the completion of project Deliverables. In total, 21 of these were due
during 2009, and all have been, or are likely to be, submitted on time (further details
provided in the individual WP reports).

Frequent and informative interact
ion with our EC Project Officer has been a natural
priority for the project. In addition to emails and phone discussions, we make available
the unofficial 4
-
monthly reports, which we are told are very much appreciated. More
substantially, WP1 supplied the
detailed scientific section of D10.2: “Technical and
Financial Annual Reports #1”. This was then followed up by securing a ‘
Scientific
Advisory Board’ report on the first year’s progress, responses to this report from the
EC
Project Officer and from GEN2PH
EN, and
email discussions with the
EC Project
Officer
about all the issues raised


in particular the potential value of some kind of
'clinical
-
GEN2PHEN’ project.

Leading up to the next full review of GEN2PHEN via its official mid
-
term Commission
Review, W
P1 and the
EC Project Officer have together
arranged dates

and venues
(May 4th 2010 in Brussels)
,
sought reviewers, and announced this plan to the
Consortium and to the projects SAB.

For activity #2 (
Project Quality and Assessment
), the main activity has b
een to report
back to the Consortium on Deliverable D1.2 (“Initial Report from Project Assessment
Pilot”) and ensure that suitable changes to the project were implemented. This
essentially entailed better announcing/explaining the tools we build, focusing
on urgent
formation of federated databases so that data producers could manage their own data
release, and concentrating on issues related to clinical utility of G2P data (many of
which, however, go far beyond what this one project can solve). Subsequentl
y, plans
have been assembled regarding the strategy and disease focus for D1.5 (“Intermediate
Report from Project Assessment Pilot”, due month 30). This second pilot will be run in
collaboration with the ‘ENIGMA project’: an analysis of inherited breast ca
ncer which
aims to guide the interpretation of unclassified variants in the BRCA1 and BRCA2
genes.

For activity #3 (
Ethical Issues
), based upon the plan of attack decided during the first
year, we have
p
repared, tested and issued, and analysed an ethics su
rvey
questionnaire.
The findings are not only reported as part of D1.4 (“Report on External
ELSI Developments“), but they have also been used to update the already submitted
Deliverable D1.3
(“Report on General Ethical Issues in G2P Database Work”)
to
incl
ude concrete proposals for a GEN2PHEN ethics policy
. These guidelines are due
for full discussion at GAM5 in January 2010, with subsequent deployment across the
project. Additionally, we
contributed to a French national consultation on revision of the
bioe
thics law on genetic testing, and we launched "hSERN" (http://www.hsern.eu) to
provide legal/ethical information on exchange of biological samples.

WP2: DOMAIN ANALYSIS AND COMMUNITY RELATIONS

5.

Project objectives for the period 1 (01/01/2008


31/12/2008)

W
ork Package 2 (WP2) aims to analyze the existing G2P database field. Success in
this Work Package is key, as its findings will provide the basis for future planning the
GEN2PHEN project. Completion in a timely manner is therefore also vital, and for this
r
eason WP2 was tasked with embarking on an extensive domain analysis during the
first year of the project. WP2 is also tasked with the job of helping to develop good
community relations, via a number of approaches (e.g., meetings and specialized
workshops).

This work also starts from the very beginning of the GEN2PHEN project.

Activity #1 concerns

‘Community Consultations
’, for which work in the first year will
entail designing a plan for community consultations. We will then proceed to consult
and seek exp
ert opinion from various G2P field stakeholders, including G2P data
creators, database technologists, biobank teams, G2P data end
-
users, LSDB
community, a number of initiatives, e.g. Human Variome Project (HVP), GAIN,
WTCCC, P3G consortium, human genetics
societies and genetic journals, etc. The
goal will be to create a meaningful systems requirements document, enabling us to
compare and contrast our project plan with what others judge to be the main needs and
trends in the field, in order to fine
-
tune and
evolve the strategic direction of the project.
This activity will also help build mutual understanding and trust with the end
-
users of
the GEN2PHEN developments.

Activity #2 concerns

‘Technical Domain Analysis
’, via which we will assess and
document the te
chnical state
-
of
-
the
-
art (data content requirements and data access
needs) for each sub
-
type of G2P database (LSDBs, Diagnostics DBs, and Genomics
DBs holding individual and summary level datasets). Investigations will cover human
and model organism G2P da
tabase projects, plus current integration systems. A
particular focus will be LSDBs, since these very important components of the G2P
domain are currently plagued by great heterogeneity of content and structure. Data
models and data exchange formats will a
lso be considered.

6.

Project objectives for the period 2 (01/01/2009


31/12/2009)

Work Package 2 (WP2) aims to analyze the existing G2P database field. Success in
this Work Package was of utmost importance, as its findings provided the basis for
future pla
nning the GEN2PHEN project. Completion in a timely manner was therefore
also vital, and for this reason WP2 was tasked with embarking on an extensive domain
analysis during the first 18 months of the project. WP2 was also tasked with the job of
helping to
develop good community relations, via a number of approaches (e.g.,
meetings and specialized workshops). This work also started at the very beginning of
the GEN2PHEN project.

Activity #1 concerns

‘Community Consultations
’, for which work in months 12
-
18 wa
s
expected to continue that undertaken in the first year, namely consulting and seeking
expert opinion from various G2P field stakeholders, including G2P data creators,
database technologists, biobank teams, G2P data end
-
users, LSDB community, a
number of
initiatives, e.g. Human Variome Project (HVP), human genetics societies
and genetic journals, etc. The goal was to create a meaningful systems requirements
document, enabling us to compare and contrast our project plan with what others judge
to be the main

needs and trends in the field, in order to fine
-
tune and evolve the
strategic direction of the project. This activity was also to help build mutual
understanding and trust with the end
-
users of the GEN2PHEN developments.

Activity #2 concerns

‘Technical Do
main Analysis
’, via which assessment and
documentation of the technical state
-
of
-
the
-
art (data content requirements and data
access needs) for each sub
-
type of G2P database (LSDBs, Diagnostics DBs, and
Genomics DBs holding individual and summary level data
sets) was to be achieved.
Human and model organism G2P database projects, plus current integration systems,
were to be covered. A particular focus on LSDBs was also planned, since these very
important components of the G2P domain are currently plagued by g
reat heterogeneity
of content and structure. Data models and data exchange formats were also to be
considered.

7.

Work progress and achievements during the period 1 (01/01/2008


31/12/2008)

During the first year of the GEN2PHEN project, WP2 accomplished all
its goals. This
not only included domain analysis and outreach, but also organizing two community
consultation workshops whereas only one had been planned. These workshops are
summarized in Deliverable D2.1, which was completed on time.

For activity #1 (
C
ommunity Consultations
), most effort was directed into co
-
organizing an international conference, as reported in Deliverable D2.1: “Workshop to
Review the G2P Database Field and Current Data Models”. Additionally, GEN2PHEN
partners have attended many meeti
ngs around the globe to become fully aware of
developments in the field, and we held an open discussion session on LSDBs at
HGM2008 at which we committed to produce a 'tutorial' style training manuscript and
website on LSDBs.

The international conference w
as organized in conjunction with the Human Variome
Project (HVP). Specific discussion areas included i) classifying genetic variation from
unlinked clinical medicine or research laboratories, ii) capturing data from diagnostic
and service laboratories, iii
) assessment of pathogenicity, iv) data transfer, v) data
integration and access, vi) funding governance, vii) emerging countries’ initiatives and
involvement, viii) ethics, ix) attribution and publication, and x) pilot projects. Gratifyingly,
the LOVD and

UMD database systems (part of GEN2PHEN) were recognized to be
amongst today’s leading solutions for LSDB creation. Clinical and pathology data
standards must now be developed by experts in each genetic disorder for interpreting
the effects of LSDB
-
stored
genetic variation.

For activity #2 (
Technical Domain Analysis
), one initiative involved organizing a
technology focused meeting. This was a workshop organized in conjunction with WP3
(Standard Data Models and Terminologies), on data modeling and related is
sues
(outcomes reported in deliverables D2.1 and D3.1). At this meeting, Partners jointly
considered their own GEN2PHEN systems alongside data models external to
GEN2PHEN (e.g., PaGE, MAGE, and FuGE). This workshop helped us to i)
understand database resou
rces already provided by partners, thereby gaining insights
into technologies, use cases, data models, ontologies, and requirements for
integration, ii) identify commonalities and differences between the GEN2PHEN and
external resources, and iii) evaluate r
elevant public domain models. Several partners
presented proposals for the representation of ‘Phenotypes’, highlighting a challenging
area that must be further developed in GEN2PHEN


preferably in conjunction with the
model organism community.

Additional

work in activity #2 involved an attempt to analyze the content of existing
LSDBs. We are working to establish a list of features/fields from several high quality
LSDBs, to compare and contrast these with fields and functionalities in LOVD and
UMD. In para
llel, we are determining the field content for each of the ~700 LSDBs
available on the web. Initial findings indicate that LOVD and MUTbase are the most
frequently used database platforms, and field lists and nomenclatures need to be
further harmonized and

regimented. Additionally, it is clear that better visualization
tools need to be developed, along with more useful querying systems. Once complete,
this analysis will help instruct WP4 (Genetics G2P Databases) in their database
development work.


8.

Work pro
gress and achievements during the period 2 (01/01/2009


31/12/2009)

During the second year of the GEN2PHEN project, WP2 completed its full scheduled
mission, as planned, by month 18. This included domain analysis and outreach
activities, as well as extens
ive community consultation. The results of these efforts are
summarized in Deliverables D2.2 (“General G2P Field System Requirements Report”)
and D2.3 (“Technical State
-
Of
-
The
-
Art Document for G2P Databases”), both of which
were completed on time by month
18. Some continuation and additional work will
probably be done in subsequent months/years (e.g., manuscript writing, workshop
participation) as needed by GEN2PHEN, but essentially this Work Package has now
finished successfully.

For activity #1 (
Community

Consultations
), we continued our outreach work with
various G2P field stakeholders
, using the following main approaches; s
urveys,
e
-
mail
exchanges, organization of
workshops (e.g., on data/object models) with external
groups, and o
rganization of joint mee
tings with other major database unification efforts
(e.g., Human Variome Project). As part of this, relationships have been further
strengthened with related projects CASIMIR, BBMRI, FMA, DMuDB, CIMR, ENGAGE,
OBI, EFO, DataSHaPER, PhenX, and Sequence Ontol
ogy, towards the co
-
development of standards and ontologies.

Via our outreach efforts, we produced
a detailed systems requirements document
(Deliverable D2.2), that compares and contrasts our project plan with what others judge
to be the main needs and tre
nds in the field.
This document summarized the following:
Model Organism databases (section 2.2.1), interfaces and user requirements (section
2.2.2), and new concepts and possibilities (section 2.2.3).

For
section
2.2.1, we
surveyed some of the main existi
ng Model Organism Databases, to identify where
lessons can be learned that can benefit today’s broad field of human
-
focused G2P
databasing. This effort revealed a number of important trends, in particular regarding
software reuse, community data standards
and community collaboration, which are
highly relevant to a human G2P genetics/genomics database world facing a deluge of
large
-
scale, complex biomedical data. For
section
2.2.2, we undertook and reported a
genome browser survey designed to list interface
and user requirements of current and
potential future genome browsers. For section 2.2.3, we summarized our participation
in several workshops, such as one on “Anatomical Basis of Disease”, participated in
strategy and white paper meetings with P3G, BBMRI,

PHOEBE, discussed mutation
database and pathogenicity issues with Partners Healthcare representatives (e.g.Mollie
Ullman
-
Cullere) and the Human Variome Project, and outlined planning for a workshop
on pathogenicity.

For activity #2 (
Technical Domain Analy
sis
), and for
Deliverable 2.3,
we undertook a
technical analysis of each sub
-
type of the G2P databases of interest to the
GEN2PHEN project
, seeking external advice as needed. This technical analysis
covered the following topics:
LSDBs and diagnostic databa
ses (section 2.3.1)
, whole
-
genome genomics databases

(section 2.3.2), s
pecifications for locus
-
reference
genomic sequences
(section 2.3.3)
, and ethics for LSDBs and diagnostic databases

(section 2.3.4). An emphasis was placed upon parameters of standardiza
tion


particularly data models, data exchange formats, and ontologies/nomenclatures. Data
models in current use were documented, enabling a comparative analysis of these
features along with a consideration of how they are used in conjunction with specific

data curation criteria. These data model summaries provide important input for the data
model development work of WP3.
A sub
-
focus on ontologies assessed and
documented the state
-
of
-
the
-
art for relevant projects, producing guidance to be used
later on in
specific implementation activities in WP4 and WP5. Ethics was also
specifically addressed, as related to LSDBs and diagnostic databases. Our analysis
work spanned human and model organism databases, as these must all, ultimately, be
properly integrated. To

that end, we considered empirically
-
determined pros and cons
of various current integration strategies.

Additional work was then done to thoroughly analyze the content of existing LSDBs.
This entailed establishing a complete list of features/fields from s
everal high quality
LSDBs, to compare and contrast these with fields and functionalities in LOVD and
UMD. In parallel, we determined the field content for each of the ~1200 LSDBs
available on the web. These analyses showed that LOVD and MUTbase are the mos
t
frequently used database platforms, and that field lists and nomenclatures across the
domain need to be further harmonized and regimented. Additionally, significantly better
visualization tools are needed, along with more powerful querying systems. This
LSDB
analysis is now being prepared for journal publication.

Work Deviations and New Opportunities

There were no work deviations in WP2, but two new opportunities were identified that
we shall further develop in conjunction with other Work Packages:

Given
the detailed LSDB analysis we completed, the possibility exists for showing and
regularly updating this information via a dynamic web interface. Additionally, this
information can be integrated with an earlier and parallel list produced by the HVP. This
ha
s now all been arranged, and a system is being produced by the GEN2PHEN
Knowledge Center (WP8) for shared display and joint updating/editing by us, the HVP,
and any LSDB curator that wishes to contribute information.

From the LSDB domain analysis it became

apparent that very little is being done to
document ethnic differences in mutation frequency in current LSDBs. We shall
therefore theoretically explore ways to precisely record the genetic heterogeneity of a
given population and/or ethnic group by establi
shing individual
-
level genomic
databases, and propose these ideas to WP4 for possible implementation.

WP3: STANDARD DATA MODELS AND TERMINOLOGIES

9.

Project objectives for the period 1 (01/01/2008


31/12/2008)


Work Package 3 (WP3) tackles one of the main ch
allenges in G2P databasing: the lack
of interoperability between public data resources when there are many specialized data
resources that differ widely in scale. For example, LSDBs can contain data for a single
gene, and may use different reference sequen
ce systems. Very large database
resources exist which contain e.g. genome wide data for multiple association studies
and operate on a high throughput scale and who cannot release complete data sets for
ethical reasons. These disparate resources are well re
presented in GEN2PHEN, and in
WP3 we will assess the needs of different partners, collate these, and employ these
use cases to develop data models.

As its first objective, WP3 will survey partner use cases prior to starting model
development. Simultaneousl
y, we shall identify external stakeholders and existing data
models that cover the GEN2PHEN domain, which have the potential for re
-
use within
GEN2PHEN. Secondly, we shall develop data models supporting the core use cases
identified, and validate these mod
els with consortium member use cases and data.
Thirdly, jointly with WP4, WP6, and WP7, we shall develop standard reference
sequences for use by LSDBS, and facilitate data exchange of gene specific data within
GEN2PHEN. This work is broken down into specif
ic activities, as follows:

Activity #1 concerns

‘Core Data Model Development
’ which entails Identifying
Consortium and community use cases and thereby establishing a priority list for
modeling across the domains of Genotype, Phenotype, Data and Analysis, a
nd
Environment. This will provide the basis for the iterative development of a Core Data
Model (CDM) in the first 1
-
2 years of the project, and Specific Data Models (SDMs)
thereafter, for the various databases of interest to GEN2PHEN. These will made
maxim
ally interoperable with related major projects elsewhere. As part effort of this we
shall define and specify data exchange formats in addition to providing data models.

Activity #2 concerns

‘Advanced Data Modeling Issues
’, which entails building on the
fir
st activity in the second half of the project to continually develop data models to
handle new concepts and challenges, such as new statistical tests, complex phenotype
analyses, copy
-
number variation, epistasis, and epigenetics/methylation variation. It
m
ay also be appropriate to register our CDMs and SDMs with approved and
community relevant certification bodies, such as OMG (International) and NCICB
(International), to promote the use of GEN2PHEN models and data resources.

Activity #3 concerns



Other St
andards Development
’, which involves contributing to
various other efforts, such as structured mutation nomenclature for the G2P field, and
helping to select and formalize optimal genomic and protein reference sequences for
all GEN2PHEN project LSDBs. This

requires first establishing a robust reference gene
structure for genes/regions of interest, and this must be tackled early on the project and
in close coordination with all relevant stakeholders globally. Additionally, we shall
extend the existing Ensemb
l DAS standard (currently evolving into DAS2) to provide
design support for G2P data tracks in genome browsers.

10.

Project objectives for the period 2 (01/01/2009


31/12/2009)

Work Package 3 (WP3) tackles one of the main challenges in G2P databasing: the lac
k
of interoperability between public data resources when there are many specialized data
resources that differ widely in scale. For example, LSDBs can contain data for a single
gene, and may use different reference sequence systems. Very large database
res
ources exist which contain e.g. genome wide data for multiple association studies
and operate on a high throughput scale and who cannot release complete data sets for
ethical reasons. These disparate resources are well represented in GEN2PHEN, and in
WP3 w
e intend to assess the needs of different partners, collate these, and employ
these use cases to develop data models.

Amongst its first objectives, WP3 has surveyed partner use cases, and identified
external stakeholders and existing data models which have

the potential for re
-
use
within GEN2PHEN. Secondly, WP3 needed to develop data models supporting the
core use cases identified, and validate these models with consortium member use
cases and data. Thirdly, jointly with WP4, WP6, and WP7, WP3 had the objec
tive of
developing standard reference sequences for use by LSDBS, and facilitating data
exchange of gene specific data within GEN2PHEN. This work is broken down into
specific activities, as follows:

Activity #1 concerns

‘Core Data Model Development
’ which
entails Identifying
Consortium and community use cases and thereby establishing a priority list for
modeling across the domains of Genotype, Phenotype, Data and Analysis, and
Environment. This provides the basis for the iterative development of a Core Data

Model (CDM) in the first 2 years of the project, and Specific Data Models (SDMs)
thereafter, for the various databases of interest to GEN2PHEN. These should be made
maximally interoperable with related major projects elsewhere. As part effort of this we
n
eeded to define and specify data exchange formats in addition to providing data
models.

Activity #2 concerns

‘Advanced Data Modeling Issues
’, which entails building on
activity #1 in the second half of the project to continually develop data models to hand
le
new concepts and challenges, such as new statistical tests, complex phenotype
analyses, copy
-
number variation, epistasis, and epigenetics/methylation variation. It
may also be appropriate to register our CDMs and SDMs with approved and
community relevan
t certification bodies, such as OMG (International) and NCICB
(International), to promote the use of GEN2PHEN models and data resources.

Activity #3 concerns

‘Other Standards Development
’, which involves contributing to
various other efforts, such as struc
tured mutation nomenclature for the G2P field, and
helping to select and formalize optimal genomic and protein reference sequences for
all GEN2PHEN project LSDBs. This requires first establishing a robust reference gene
structure for genes/regions of inter
est, and this must be tackled early on the project and
in close coordination with all relevant stakeholders globally. Additionally, extension of
the existing Ensembl DAS standard (currently evolving into DAS2) to provide design
support for G2P data tracks
in genome browsers was envisaged.

11.

Work progress and achievements during the period 1 (01/01/2008


31/12/2008)

During the first year of the GEN2PHEN project, WP3 has achieved the intended goals,
and provided a substantial amount of ‘standards guidance’ as
input to WP2, WP4,
WP5, and WP6. Two WP3 Deliverables were scheduled for the first year, and both
have been completed.

Some particularly notable achievements made by WP3 during 2008 include:




Comprehensive use case development for the G2P databasing field

within the
consortium



Co
-
development and publication of the generic PaGE
-
OM reference model for
G2P data

These are, however, only a subset of all the advances made, the details of which are
as follows:

For activity #1 (
Core Data Model Development
), good p
rogress has been made. Use
case identification was achieved by; i) an internal survey of relevant all Partners, ii) a 3
-
day workshop of Partners during which use cases were developed further, and iii)
evaluation of relevant available internal and external
data models to determine
additional use cases. The workshop findings and the external data models were
documented on the project wiki site, and the workshop summary plus the use cases
with their supporting documentation were reported in Deliverable 3.1 (“I
dentification of
Consortium Use Cases”).

A core data model was then developed to support the key use cases, and this work
was documented as Deliverable 3.2 (“Development of High
-
Level Domain Model:
Version 1”). As part of this work, all Partners were aske
d to describe their existing data
models, and relevant public domain data models were also identified and their scope
and utility documented. Several Partners participated in the development of the
overarching PAGE
-
OM reference model (www.pageom.org) which

has now been
published [3] and accepted as an OMG standard. That PAGE
-
OM development work
has now been frozen, based in part on the GEN2PHEN experience that it can be better
used as a reference rather than implementation model. Discussions were also held
with developers of data formats in UHTS domain, and existing meta data formats are
being investigated for use in GEN2PHEN resources. Additionally, the use cases
developed for Deliverable 3.1 were developed into ‘Schemalets’
-

simple models
representing key

parts of the GEN2PHEN domain and used in the PAGE
-
OM
development (http://www.schemalet.org/mediawiki/index.php/Portal:PaGE
-
OM). These
provide communication tools for Partners, and are an intermediate building block for
producing larger data models. Core d
ata models will be reported in Deliverable 3.2.

Integrative data modeling has been addressed in the context of LSDBs where there are
multiple models within the GEN2PHEN consortium. A mapping model has been
produced which identifies and maps key components

of both the LOVD and UMD
database schemas. Further work is now underway to identify minimal information
required for data exchange representation which is being used to validate this model.
The documented and validated model will be reported in Deliverabl
e 3.2.

Beyond GEN2PHEN, data models for high
-
throughput data have been provided by the
gene expression community: namely, the MAGE
-
OM model
(
www.mged.org/Workgroups/MAGE/mage
-
om.html
) and th
e related XML format
MAGE
-
ML [4]. MAGE
-
OM can also represent array based genotyping data, and a
subsequent tab delimited data format MAGE
-
TAB [5] and standard data model are
available which cleanly separate data and meta data. In view of these developments
,
and in order to collaborate with the mouse community who use a simple object model
for e
-
QTL work, a prototype implementation using an extended MAGE
-
TAB based
object model has been developed in WP3. Its utility will now be explored in the context
of G2P
data. Prototyping is also underway with high throughput genotyping data in
partnership with the European Genotype Archive and the ENGAGE project.

For activity #2 (
Advanced Data Modeling Issues
), this activity is scheduled to start
only after month 13 and
no work has yet commenced. A workshop has been scheduled
for January 2009 to begin this work.

For activity #3 (
Other Standards Development
), in partnership with WP4, WP6, and
WP7, we have developed a new ‘LRG’ standard structure for gene reference
sequence
s. This is described in detail under the WP6 section of this report.
Additionally, we have begun to explore ontologies relevant to G2P databasing, to help
to bring cross
-
species compatible ontologies to bear in the work of the GEN2PHEN
Consortium. To this
end, links have been made with model organism communities,
especially via the CASIMIR project (http://www.casimir.org.uk) which is developing and
prototyping phenotype representation methodologies. GEN2PHEN has so far attended
three meetings with the rat a
nd mouse communities. Currently, the complex disorder
Metabolic Syndrome has been selected as a test bed for cross
-
species phenotype
ontology usage. A suitable genetic association dataset has been identified that will be
used for comparison of ontology rep
resentation developed for mouse models of
Metabolic Syndrome and to determine what is useful for human data, thereby
establishing what needs to be supported by data modeling.

12.

Work progress and achievements during the period 2 (01/01/2009


31/12/2009)


Dur
ing the second year of the GEN2PHEN project, WP3 has achieved the intended
goals, and provided a substantial amount of ‘standards guidance’ as input to WP4,
WP5, and WP6. Six WP3 Deliverables were scheduled for the second year and have
been completed.

Some

particularly notable achievements made by WP3 during 2009 include:



Development of a core GEN2PHEN Phenotype Object Model and a Reference
Implementation. Suitable data sets to test the model were identified and successfully
loaded into the implementation (
43000 observation targets).



Specification of a G2P exchange format derived from high
-
level object modeling
activities for each of the sub
-
domains: LSDBs, high
-
throughput data and phenotypic
descriptions. The formats were tested, implemented and were succes
sfully adopted,
e.g. MAGETAB by EGA and ENGAGE communities.



Successful adoption of PaGE
-
OM as an official OMG standard and a generic
reference model for G2P data.

These, however, are only a subset of all the advances made, the details of which are
as follo
ws.

For activity #1 (
Core Data Model Development
), use cases identified and reported
last year in Deliverable 3.1 (“Identification of Consortium Use Cases”) were further
elaborated by liaison

with the CASIMIR project, DataShaper and PhenX communities.
CASI
MIR is an FP6 project which focuses on co
-
ordination and integration of
databases of experimental data relevant to the use of the mouse as a model organism
for human disease. The DataSHaPER platform aims to facilitate the prospective
harmonization of emerg
ing biobanks and provides development of questionnaires and
information
-
collection devices. PhenX is a project funded by the NHGRI to contribute to
the integration of genetics and epidemiologic research and aims to develop a
recommended minimal set of high

priority measures for use in Genome
-
wide
Association Studies (GWAS) and other large
-
scale genomic research efforts. These
use cases were gathered and fed back into respective domain specific object models.
The overarching PaGE
-
OM, successfully adopted as
an official OMG standard,
continued to be used as a reference for high level domain modeling.

Two GEN2PHEN modelling workshops were held (Helsinki, January 19
-
22, 2009;
Geneva, May 7
-
8, 2009) and these laid the groundwork for specific sub
-
domain
developmen
t in the context of specific phenotype extensions. External invited
participants from the epidemiology, medical genetics, ontology development, and
model organism communities provided expertise and use cases beyond those of
Consortium Partners. Reports and

slides from the workshop are available from:

http://askja.gene.le.ac.uk/drupal5/content/second
-
gen2phen
-
data
-
modelling
-
workshop

http://askja.gene.le.ac.uk/drupal5/content/first
-
phenotype
-
workshop

Several public data models (see documentation at www.schema
let.org) currently exist
in the phenotype space, and those closely aligned to GEN2PHEN were evaluated for
relevance, domain coverage compared to existing resources, ease of use, and
complexity. This has been reported in Deliverable 3.5 (“High
-
Level Domain
Model
Version 2, with Sample/Phenotype Focus”). Subsequently a core GEN2PHEN
phenotype model has been developed to support primary GEN2PHEN use cases in
LSDB and high
-
throughput domains. The model was implemented on the MOLGENIS
platform [1] and testing is

well under way with several publicly available mouse and
human datasets loaded successfully. An ontology annotation service, accessing
semantic layer through an ontology browser (NCBO Bioportal or EBI Ontology Lookup
Service) or a local ontology file in O
WL or OBO formats, is currently in development.
Interoperability with other sub
-
domain models is being considered as a high priority.

For activity #2 (
Advanced Data Modelling Issues
), which commenced in month 13,
good progress has been made. The Deliverab
le 3.4 “Scope and Range Requirements
of Specialized Domain Models” provides the focus for specific data model development
in later phases of the project to support future partner requirements, and in particular
concentrates on areas in which partners are a
ctively involved and which are emerging
in the community:

-

Defining LSDB background information

-

Establishment of data content standards in LSDB context

-

Establishment of security model requirements

-

Support for complex phenotype representation

-

Ident
ification of relevant new technologies

Documents describing minimal information standards were created, with HGVS
community feedback. Especially the ‘optional’ parts of the recommendations, helped in
expanding the scope of the LSBD object model. This was a
chieved during the Helsinki
Workshop (January 19
-
22, 2009), where partners and invited experts discussed the
minimal content requirements and subsequently produced sub
-
domain models
supporting LSDB, diagnostic lab, and GWAS data exchange use cases. These e
fforts
are now being continued on the GEN2PHEN Knowledge Centre platform in a dedicated
open access interest group.

For activity #3 (
Other Standards Development
), there have been a number of
achievements regarding G2P data formats and ontology development.

The priorities for
data formats in GEN2PHEN are the data exchange between locus specific databases
and central repositories and high
-
throughput data. The modelling work to date has
separated these domains to support immediate needs for data exchange.

Vali
dation of LSDB data models commenced this year by working with the existing
LSDBs inside and outside the GEN2PHEN Consortium, most of which have existing
data formats. We aimed to align the formats as much as possible. Validation of the
MAGE
-
TAB object mod
el is underway and progress is promising. Phenotypic
descriptors, e.g. membership of a cohort through a shared phenotype or trait will
require an extension of MAGE
-
TAB, and the requirement to provide details of markers
in context of high
-
throughput data wi
ll also require an extension. Our devised G2P data
exchange format supporting many use cases is reported in Deliverable 3.7 (“Derivation
and Specification of Exchange Format”). This format, combined with web services
developed in parallel, should provide r
equired data integration amongst different
LSDBs, central browsers, and diagnostic labs.

Additionally, we have extended the existing Ensembl DAS2 standard to provide design
support for sequence variation data tracks in genome browsers. This new ‘SNP
-
DAS’
f
unctionality is already in use by the Ensembl variation pages (see WP6 and
deliverable D3.8).

Furthermore, Deliverables D3.3 (“Standard Reference Sequences, Made Available
from Ensembl”, see WP6 below) and D3.6 (“A High
-
Level Domain Model Version 3”)
have
also been completed.

Work Deviations and New Opportunities

Cooperation was established with the CASIMIR project (http://www.casimir.org.uk)
towards producing of the mappings between human and mouse ontologies, and in
particular between Mammalian Phenotype
(MP) Ontology and Human Phenotype
Ontology (HPO) [2]. As an extension to the original WP3 plan, this work has been
recognised as paramount to cross
-
species queries and successful integration of human
and mouse annotations in the G2P domain. The complex dis
order ‘Metabolic
Syndrome’ has been selected as a test bed for cross
-
species phenotype ontology
usage. A prototype ontology with integrated metabolic syndrome vocabulary has now
been generated and released for consumption by the ENGAGE community.
Consequen
tly specific recommendations for terms related to G2P field were submitted
to Sequence Ontology (SO) and Ontology for Biomedical Investigations (OBI).

WP4: GENETICS G2P DATABASES

13.

Project objectives for the period 1 (01/01/2008


31/12/2008)

Work Package 4
(WP4) aims to develop generic database solutions for the
management of G2P information relating to any gene/region, and then promote the
creation and deployment of databases for genes involved in Mendelian disorders.
These ambitions include the development

of tools for local data management by those
that generate the data, tools that will help with submitting data to community
databases, and tools that will actively gather data from various external depositories.
The initial concept is to build exemplar dat
abases, which may be replicated elsewhere
so that a federated network can ultimately evolve.

Activity #1 ‘
LSDB
-
In
-
A
-
Box Solutions
’. The goal here is to devise and continually
upgrade modular components for use in building and running single gene level G2P
genetics databases. A comprehensive data model generated by WP3 in the first year of
the project will form the basis to design and build a unifying core LSDB solution. It will
entail establishing one optimized database schema, standard data
input/output/ex
change modalities, and compatible database APIs. The basic
components will then be further refined (with respect to data import and validation
support, text plus graphical data output options, curatorial tools, search functionality,
and grid/network integr
ation potential) in subsequent years of the project.

Activity #2 ‘
LSDB Creation
’, will be achieved by local deployment of the technologies
developed in Activity #1. One or more foundational databases supporting at least all
genes involved in Mendelian dis
orders are to be created by the end of the third year of
the project. They will then be increasingly populated with datasets gathered directly
from other databases (with permission), by submissions from the community, and
curated from the literature


in c
lose partnership with WP7 (Data Flows). Additionally, a
simple database archiving service will be provided wherein we will be able to back
-
up
existing LSDB databases upon request.

Activity #3 ‘
Solutions for Diagnostic Labs
’, targets mainly the movement int
o LSDBs
of G2P datasets generated by diagnostic laboratories. Components to be developed
include dedicated data submission applications and/or website submission forms, and
tools that assist with the process of manual and automated checking and curation of

submitted data. This work will be designed in full compliance with ethical guidelines
emerging from WP1 (Scientific Coordination), one focus of which is the ‘hard
-
coding’ of
good ethical practice into data submission pipelines.

Activity #4 ‘
Ontologies in

Genetics Databases
’, will start in 2009 and incorporate
G2P related ontologies into GEN2PHEN genetics database activities, once strategic
guidance on this has been formulated by WP2 and WP3. G2P related ontologies
should increase the consistency of descri
ptions of genotype
-

and phenotype
-
related
features, improve the annotation in LSDBs and facilitate advanced searches using the
ontology terms.

Activity #5 ‘
Testing and Validation
’ will start in 2009 and this is designed as a quality
assurance policy for ev
ery piece of software developed in support of genetics G2P
databases. It involves guidelines that are laid out in Deliverable D1.1 (Specification of
Procedures for Quality Testing of Software) that emphasize best practice in quality
control and testing of
coding activities. Project policy then requires each software
product we generate to be accompanied by a structured quality assurance statement
regarding how that piece of software was developed.

Activity #6 ‘
Deployment Partnerships
’ will involve enlistin
g curators to 'adopt' the
foundational LSDBs designed and built in activity #2. These will be domain experts,
from diverse sources. This activity complements activity #2 (local setting up of LSDBs)
in that it involves helping remote groups to set up LSDBs
using the software solutions
that we have generated, all of which should be highly interoperable. This activity
requires that we develop and advertise our database solutions sufficiently
professionally that the community wishes to adopt them. Since system
development
and promotion takes time, no deployment work was anticipated in the first year of the
project.

14.

Project objectives for the period 2 (01/01/2009


31/12/2009)

Work Package 4 (WP4) aims to develop generic database solutions for the
management of G
2P information relating to any gene/region, and then promote the
creation and deployment of databases for genes involved in Mendelian disorders.
These ambitions include the development of tools for local data management by those
that generate the data, too
ls that help with submitting data to community databases,
and tools that actively gather data from various external depositories. The initial
concept is to build exemplar databases, which may be replicated elsewhere so that a
federated network can ultimate
ly evolve.

Activity #1 ‘
LSDB
-
In
-
A
-
Box Solutions
’ aims to devise and continually upgrade
modular components for use in building and running single gene level G2P genetics
databases. The comprehensive data model generated by WP3 in the first year of the
proj
ect was planned to serve as basis to design and build a unifying core LSDB
solution. This entailed establishing one optimized database schema, standard data
input/output/exchange modalities, and compatible database APIs. The basic
components are then furth
er refined (with respect to data import and validation support,
text plus graphical data output options, curatorial tools, search functionality, and
grid/network integration potential) in subsequent years of the project.

Activity #2 ‘
LSDB Creation
’, is to
be accomplished by local deployment of the
technologies developed in Activity #1. One or more foundational databases supporting
at least all genes involved in Mendelian disorders are to be created by the end of the
third year of the project. They will then

be increasingly populated with datasets
gathered directly from other databases (with permission), by submissions from the
community, and curated from the literature


in close partnership with WP7 (Data
Flows). Additionally, a simple database archiving se
rvice will be provided wherein we
will be able to back
-
up existing LSDB databases upon request.

Activity #3 ‘
Solutions for Diagnostic Labs
’, targets mainly the movement into LSDBs
of G2P datasets generated by diagnostic laboratories. Components to be devel
oped
include dedicated data submission applications and/or website submission forms, and
tools that assist with the process of manual and automated checking and curation of
submitted data. This work is being designed in full compliance with ethical guideli
nes
emerging from WP1 (Scientific Coordination), one focus of which is the ‘hard
-
coding’ of
good ethical practice into data submission pipelines.

Activity #4 ‘
Ontologies in Genetics Databases
’, will start mid
-
way through the project
and incorporate G2P rel
ated ontologies into GEN2PHEN genetics database activities,
once strategic guidance on this has been formulated by WP2 and WP3. G2P related
ontologies should increase the consistency of descriptions of genotype
-

and
phenotype
-
related features, improve the
annotation in LSDBs and facilitate advanced
searches using the ontology terms.

Activity #5 ‘
Testing and Validation
’ was due to start in 2009 and this is designed as a
quality assurance policy for every piece of software developed in support of genetics
G2P

databases. It involves guidelines that are laid out in Deliverable D1.1
(Specification of Procedures for Quality Testing of Software) that emphasize best
practice in quality control and testing of coding activities. Project policy then requires
each softw
are product we generate to be accompanied by a structured quality
assurance statement regarding how that piece of software was developed.

Activity #6 ‘
Deployment Partnerships
’ involves enlisting curators to 'adopt' the
foundational LSDBs designed and built

in activity #2. These will be domain experts,
from diverse sources. This activity complements activity #2 (local setting up of LSDBs)
in that it involves helping remote groups to set up LSDBs using the software solutions
that we have generated, all of whi
ch should be highly interoperable. This activity
requires that we develop and advertise our database solutions sufficiently
professionally that the community wishes to adopt them. Deployment work was
anticipated to begin in the second or third years of the

project.

15.

Work progress and achievements during the period 1 (01/01/2008


31/12/2008)

During the first year of the GEN2PHEN project, good progress has been made on all
activity areas specified within WP4. The rate of progress puts each activity on track o
r
(in many cases) well ahead of schedule, since many activities were able to be
launched ahead of input from WP2, WP3, and WP4. No Deliverables were scheduled
for 2008, and so there is nothing to report on that front.

Some particularly notable achievement
s made by WP4 during 2008 include:




Release of new versions of the LSDB database software (LOVD2 and UMD)



Launched >30 new LSDBs with the community, using our LSDB
-
in
-
a
-
box platforms



Updated the Mutalyzer software design specifications to handle the latest

HGVS
mutation nomenclature guidelines



Created 'Predictor' software to estimate the pathogenicity of amino
-
acid variations

These are, however, only a subset of all the advances made, the details of which are
as follows:

For activity #1 (
LSDB
-
In
-
A
-
Box Solu
tions
), an initial set of database components plus
manuals, installation CDs, etc, have been built sufficiently to launch LSDBs. These
components include two parallel (LOVD [6] and UMD [7]) sets of core data models and
database schema, database implementat
ions, web
-
interfaces, basic search
functionalities, and various data display options. Although the LOVD and UMD systems
use different data models they have been successfully cross
-
mapped by WP3


meaning they should be usefully interoperable. Additionally,

various items of support
software have been generated. For LOVD, a "data sharing export " feature has been
constructed that is fully compliant with recommendations issued by us and HGVS
regarding what core data elements should be exchanged between LSDBs a
nd central
repositories [8]. And for UMD, a 'UMD Predictor' tool has been built which predicts the
pathogenicity of amino
-
acid changing nucleotide substitutions [9].

For activity #2 (
LSDB Creation
), using the components built in Activity #1, GEN2PHEN
teams

have assembled two functional LSDB
-
in
-
a
-
box platforms (LOVD and UMD),
along with release notes and user manuals for each. These will be further developed
continually throughout the project, and hopefully made more deeply interoperable as
time passes. Duri
ng 2008, more than 30 news LSDBs have been created by the
community on our local GEN2PHEN servers, using the LOVD and UMD platforms.

For activity #3 (
Solutions for Diagnostic Labs
), we have linked up with diagnostics
labs in the UK to work on this challeng
e. Progress will require the development of tools
for intra
-
lab data collection, and one such tool has now been constructed (currently
marketed by PhenoSystems) to enable collection of DNA sequence data. This tool
uses an XML format which was defined speci
fically for the purpose of exchanging data
to and from LSDBs.

For activity #4 (
Ontologies in Genomics Databases
), no specific actions on
ontologies have been taken while we await guidance on this matter from WP3
(Standard Data Models and Terminologies). On

related ‘standards’ issues, we have i)
investigated the latest formal HGVS mutation nomenclature guidelines and updated the
design specifications of the Mutalyzer software [10] which checks variation
nomenclature to ensure it is compatible with the HGVS
guidelines, and ii) worked with
WP3 and WP6 to design the LRG reference for gene sequences (see the WP6 section
of this report for further details).

For activity #5 (
Testing and Validation
), we have made sure that all Partners active in
WP4 have read Del
iverable D1.1 (Specification of Procedures for Quality Testing of
Software) and aim to operate in accordance with it throughout the GEN2PHEN project.

For activity #6 (
Deployment Partnerships
), work was not due to start until month 19.
However, many groups
(e.g., staff at NIH) have indicated they would like to use our
LSDB
-
in
-
a
-
box software to support mutation gathering and sharing activities, and so we
have started to deploy our systems beyond GEN2PHEN.

16.

Work progress and achievements during the period 2 (01
/01/2009


31/12/2009)

During the second year of the GEN2PHEN project, good progress has been made on
all activity areas specified within WP4. The rate of progress puts each activity on track
or (in many cases) well ahead of schedule, since many activities

were launched ahead
of full input from WP2, WP3, and WP7. Three deliverables scheduled for 2009 were all
completed on time, specifically:

D4.1. User
-
Manual for LSDB
-
in
-
a
-
box V1 Software (including 2 Manuals)

D4.2. Graphical Software for the Presentation o
f LSDB Data

D4.3. A Validated Code
-
Base for Checking Mutation Nomenclature

Some particularly notable achievements made by WP4 during 2009 include:



Release of new versions of the LSDB database software (LOVD2 and UMD) with
enhanced graphical display feature
s.



Created >40 new curated LSDBs, and >500 LSDBs for genes on the X chromosome
through the LOVD website (www.LOVD.nl/MR).



Launched prototype web
-
services for UMD and LOVD databases, and the Findis
database.



Updated the Mutalyzer software design specificati
ons to handle the latest HGVS
mutation nomenclature guidelines.



Launched a collaboration between LOVD and WikiProfessional, WikiPeople, & CWA
to connect LOVD to the Wiki environment.

These are, however, only a subset of all the advances made, the details o
f which are
as follows.

For activity #1 (
LSDB
-
In
-
A
-
Box Solutions
), the initial set of database components
plus User Manuals (Deliverable 4.1), installation CDs, etc, have been steadily
improved. These components include two parallel (LOVD and UMD) sets of
core data
models and database schema, database implementations, web
-
interfaces, basic
search functionalities, and various data display options (Deliverable 4.2). Although the
LOVD and UMD systems use different data models they have been successfully cross
-
mapped by WP3 (Deliverable 3.2), which implies that they should be usefully
interoperable. Additionally, simple webservice data access protocols have been piloted
for the LOVD and the UMD platforms, and for LOVD this is fully compliant with
recommendations

issued by GEN2PHEN and HGVS regarding what core data
elements should be exchanged between LSDBs and central repositories [3].
Furthermore, we have prototyped a Java based web and atom server for the Findis
database.

For activity #2 (
LSDB Creation
), >40 ne
w LSDBs have been created for the
community on our local GEN2PHEN servers, using the two functional LSDB
-
in
-
a
-
box
platforms (LOVD and UMD). In addition, the feasibility of hosting a large number of
LSDBs in a single LOVD installation was demonstrated: thro
ugh the LOVD website
(www.LOVD.nl/MR), >500 LSDBs for genes on the X chromosome were created to
disclose the results of a large
-
scale resequencing study in patients with X
-
linked mental
retardation [4]. Local LOVD servers are being visited by an increasing

number of users
(from 2716 unique IP
-
addresses in January 2009 to 3974 in November 2009). These
users are also viewing increasing numbers of pages (from 109,801 pages in January
2009 to 721,722 in November 2009). Webservice based data access will be provi
ded
for LSDBs as soon as the code for this has been validated, and to enable access via
the Wiki environment LOVD developers have launched a collaboration with
WikiProfessional, WikiPeople, and the Concept Web Alliance.

For activity #3 (
Solutions for Diagn
ostic Labs
), we are exploring ways to tackle the
fundamental conundrum of how to transfer mutation observations from many labs to
many potential users and collators of this information (not least LSDBs and centralized
search portals), without each transfer

relationship having to be individually established
and tailored to different data formats. One approach involves building and operating the
DMuDB database, which is run by and for UK diagnostic labs, with participation
arranged manually on a case by case
basis. This is now an established resource. A
more universal solution involves a website ‘clearing house’ for new data dissemination,
designed to receive but not show mutation data. Instead, this resource will post a
‘menu’ of its hosted content so that us
ers can select relevant records to download. We
call this concept the ‘Café for Routine Genetic Data Exchange (Café RouGE)’, and it
provides a holistic solution because; a) all data providers and users will only need to
handle only one data format, b) diag
nostic labs can submit data either en masse when
convenient, or one at a time as they are processed (via a dedicated ‘submit button’ on
their analysis software), and c) data access can be open or controlled, as per the
preference of the submitting lab. We
anticipate a launch of Café RouGE during 2010,
since the core software has now been built, a ‘submit button’ has been placed on
‘GeneSearch’ sequence analysis software from Partner PHENO, and digital user
-
ID
technologies for data access control are being p
rovided by WP5.

For activity #4 (
Ontologies in Genomics Databases
), in line with the project plan, no
specific action on ontologies have been taken while we await guidance on this matter
from WP3 (Standard Data Models and Terminologies). On related ‘standa
rds’ issues,
we have; a) investigated the latest formal HGVS mutation nomenclature guidelines and
updated the design specifications of the Mutalyzer software [5] which checks variation
nomenclature to ensure it is compatible with the HGVS guidelines, and b
) worked with
WP3 and WP6 to design the LRG reference for gene sequences, including the
construction of a Java
-
based LRG parser to handle LRG schema version 1.3 (see the
WP6 section of this report for details).

For activity #5 (
Testing and Validation
), we
have ensured that all staff active in WP4
have read Deliverable D1.1 (Specification of Procedures for Quality Testing of
Software), and they will operate in accordance with it throughout the GEN2PHEN
project. The Mutalyzer software for sequence variant nom
enclature checking has been
prepared according to these guidelines, and hence incorporates a Quality Assurance
Report.

For activity #6 (
Deployment Partnerships
), we are contacting authors of papers
describing many genetic variants as potential LSDB curator
s, and helping them to start
new LSDBs on our servers. We are also contacting LSDB curators to revive their
LSDBS and/or change them to a standard format. An example of the success of this
approach is provided by our collaboration with the International So
ciety for
Gastrointestinal Hereditary Tumours (InSiGHT), which resulted in the merger of colon
cancer gene variant data from several different databases into one new LOVD
database (http://www.lovd.nl/insight). Some groups prefer to host their own
installat
ions, and in such cases our LSDB
-
in
-
a
-
box software is being provided to these
groups to support their mutation gathering and sharing activities.

Work Deviations and New Opportunities

Genome
-
wide resequencing is a key application for next generation sequenc
ing
technologies. This has a potentially high impact from a medical perspective, since a
large number of new sequence variants will be observed in every individual genome
analysed. This presents LSDBs with the opportunity to obtain more variant data, but
L
SDB software must first be adapted for their direct (or indirect) automatic submission.
Resequencing projects would also benefit if given the ability to search within LSDBs to
assess the potential pathogenicity of detected variants. And for personal genomi
cs,
LSDBs could be utilised to provide well
-
structured additional information about disease
mechanisms. WP4 will therefore begin considering these newly emerging possibilities
and the new functionalities they will demand of LSDBs, and then develop our syst
ems
accordingly within capacity of GEN2PHEN.

WP5: GENOMICS G2P DATABASES

17.

Project objectives for the period 1 (01/01/2008


31/12/2008)

Work Package 5 (WP5) aims to develop generic database solutions for the
management of G2P information relating to any gen
e/region or the whole genome, and
the creation and deployment of specific examples of such databases. These ambitions
include the development of tools for local data management by those that generate the
data, tools that will help with submitting data to c
ommunity databases, and tools that
will actively gather data from various external depositories. The initial concept is to
build exemplar ‘central’ databases, but to enable the systems to be replicated
elsewhere so that a federated network can ultimately e
volve.

Activity #1 concerns the core objective of building suitable ‘
Genomics Database
Solutions
’. The goal here is to devise and continually upgrade modular components
for use in building and running summary level G2P genomics databases. In the first
year

of the project the basic components for one or more databases were to be
assembled, building upon pre
-
existing interests of Partners
-

such as the HGVbaseG2P
and the IGVdb teams. These components would then be further refined (with respect to
data import
and validation support, text plus graphical data output options, curatorial
tools, search functionality, and grid/network integration potential) in subsequent years
of the project.

Activity #2 concerns the related objective of ‘
Creation of Central Genomics

Database(s)
’, by local deployment of the technologies developed in Activity #1. One or
more such databases were to be be created by the end of the first year of the project,
though these may or may not be sufficiently complete to be released for broad acc
ess
and use by the community. Subsequently, these are intended to be made fully
available, and steadily improved by incorporation of technology developments
emerging from Activity #1. They will then be increasingly populated with datasets
gathered directly

from other databases (with permission), by submissions from the
community, and curated from the literature


in close partnership with WP7 (Data
Flows).

Activity #3 concerns the development of ‘
Tools For Data Submission
’, targeting
mainly the collection o
f small to medium sized datasets by direct deposition by the
broad community. Components to be developed include dedicated data submission
applications and/or website submission forms, and tools that assist with the process of
manual and automated checking

and curation of submitted data. Data exchange
formats developed in WP3 (Standard Data Models and Terminologies) will provide an
integral part of this activity. All of this work will be designed in full compliance with
ethical guidelines emerging from WP1
(Scientific Coordination), one focus of which is
the ‘hard
-
coding’ of good ethical practice into data submission pipelines. During the first
year of the project, little if any construction work will be done on this activity, as it will be
necessary to firs
t establish, implement and test robust data models and database
schema for the resources into which data is to be deposited.

Activity #4 concerns ‘
Tools For Data Harvesting
’, and this will operate in close
partnership with Activity #3 in terms of the pipe
lines and procedures developed. The
main difference, however, will be that this activity focuses on the systems for active
gathering of core representations of large datasets (especially Genome Wide
Association Studies: GWAS) from other depositories (e.g.,

NCI, dbGaP, dbSNP, EGA).
In the first year of the project, contacts with such groups should be made, existing data
formats evaluated, and algorithmic principles for data transfer established.

Activity #5 concerns building ‘
Tools for Local Data Management
and Output
’. The
rationale here is that data producers need informatics solutions that will enable them to
streamline the transfer of G2P data into the database world. Virtually no such solutions
currently exist, especially for small
-
medium sized labs, and

this is one of the main
reasons that virtually no G2P data is regularly submitted to internet based databases.
Systems to be built in the GEN2PHEN project will be compatible with standards
emerging from WP3 (Standard Data Models and Terminologies), and la
rgely centered
upon our commercial partners


so that extensive product support is available, and the
community feels confident to adopt these products. In the first year of the project, the
commercial partners will contribute to the emerging data standard
s, and plan for how to
integrate these in their software products.

Activity #6 concerns ‘
Ontologies in Genomics Databases
’. The objective here is to
optimize the semantic dimensions of data gathered in genomics G2P databases,
especially in relation to phen
otype information, so that search systems can most
powerfully interrogate database content, and in order that interoperability between
projects is maximized. Implementation of this activity will be largely a matter of
providing effective support for data c
uration, and it will leverage ontology developments
coming from WP3 (Standard Data Models and Terminologies) and the broader
community. Such ontologies will not be available for several years, and so no activity is
planned on this activity in the first yea
r of the project.

Activity #7 concerns ‘
Testing and Validation
’, and this is designed as a quality
assurance policy for every piece of software developed in support of genomics G2P
databases. It involves guidelines that are laid out in Deliverable D1.1 (S
pecification of
Procedures for Quality Testing of Software) that emphasize best practice in quality
control and testing of coding activities. Project policy then requires each software
product we generate to be accompanied by a structured quality assurance

statement
regarding how that piece of software was developed. In the first year of the project,
WP5 participants are to ensure they understand and adopt the working principles
stated in D1.1.

Activity #8 concerns ‘
Deployment Partnerships
’, which signifies

the progression from
a ‘central database’ model to one that emphasizes the ‘federated database’ concept.
This involves deploying WP5 database solutions, in whole or in part, to help others
(G2P research laboratories, institutes, organizations, consortia,
funders and informatics
teams) set up their own genomics G2P databases, all of which should be highly
interoperable. This requires that we develop and advertise our database solutions
sufficiently professionally that the community wishes to adopt them. Suc
h system
development and promotion will take time, however, and so no deployment work is
anticipated in the first year of the project.

18.

Project objectives for the period 2 (01/01/2009


31/12/2009)

Work Package 5 (WP5) aims to develop generic database solut
ions for the
management of G2P information relating to any gene/region or the whole genome, and
the creation and deployment of specific examples of such databases. These ambitions
include the development of tools for local data management by those that gen
erate the
data, tools that help with submitting data to community databases, and tools that will
actively gather data from various external depositories. The basic concept is to build
exemplar ‘central’ databases, but to enable the systems to be replicated

elsewhere so
that a federated network can ultimately evolve.

Activity #1 concerns the core objective of building suitable ‘
Genomics Database
Solutions
’. The goal here is to devise and continually upgrade modular components
for use in building and running
summary level G2P genomics databases. In first year of
the project, the basic components for one or more databases were assembled, building
upon pre
-
existing interests of Partners
-

such as the HGVbaseG2P and the IGVdb
teams. These components are to be fur
ther refined (with respect to data import and
validation support, text plus graphical data output options, curatorial tools, search
functionality, and grid/network integration potential) in subsequent years of the project.

Activity #2 concerns the related
objective of ‘
Creation of Central Genomics
Database(s)
’, by local deployment of the technologies developed in Activity #1. One or
more such databases were created by the end of the first year of the project.