1. Background - The Cochrane Collaboration

splashburgerInternet and Web Development

Oct 22, 2013 (3 years and 7 months ago)

79 views



Background Paper
:

Cochrane Linked Data Project: From “Star Trek” to the
present


November 2012

Prepared by

Chris

Mavergames
,
Director of Web Development, Cochrane Web Team

Lorne Becker, Cochrane Web Team and Cochrane Innovations

For the attention of

Anyone

who’s interested.

Structure

1.

Background

2.

User stories and user research

3.

The demonstrator

4.

Future plans

5.

Further reading and viewing

6.

Glossary

1. Background

Beginning in May of 2011, members of the
Web Team, Cochrane Editorial Unit (CEU), IMS
Team, Wiley and consultants from Ontoba held the first meeting to explore the use of
semantic web technologies to enable more dynamic use of Cochrane content both with The
Cochrane Library and to allow Cochrane
to connect to the “web of data” and forge potential
partnerships with those working in this new, online and technological context. Originally
dubbed the “Star Trek” project, due to its futuristic thinking, the project progressed
throughout the remainder of

2011 and in to 2012 mainly focusing on showing proof
-
of
-
concept through an initial set of use cases. Then, from March 2012, the project entered its
second phase and became, officially, the Cochrane Linked Data project.

The original thinking and impetus be
hind this project was grew out of the both developments
on the web in the area of linking data and injecting “meaning” and structure into content as
well as the reports from our various users via research done by Wiley and others that our
users would like
to see other views of our content. “Thinking outside the container of the
Review”, the full
-
text PDF presentation that is our current standard, thus became the task.
This required “interrogating” our currently RevMan XML structure and to look out how this
structure could be improved or augmented to support doing interesting things with the
content such as providing new ways to browse and search the content and re
-
package it for
various users in various contexts.


Cochrane Reviews are great, but…



There are pr
oblems that limit their use by some people



Difficult to wade through all of the text



Difficult to understand the figures, terminology, and other bits of the Review



Hard to compare interventions without reading multiple Reviews



Moving from studies in
CENTRAL to Reviews that included that study difficult



Can be difficult to find the Review you seek

Linked data,
semantic web

and Cochrane
: The basics

The linked data approach allows the possibility for a machine (i.e. a computer program) to
“read” (really
query) a web page or set of pages and return specific portions of interest to
the user. For example, a semantic web standard called “GoodRelations” using linked data
markup to enrich search results so that product details can be extracted and presented in

search results including photos, price, user reviews and ratings and other information that
the user can use to make their purchasing decision. Another example relates to display of
recipes in Google and other search engines. Display of recipe results in
Google is also being
enhanced by linked data markup, for example. Google “New York Cheesecake recipe”

and
you can see below that a photo, rating and preview of results appear:


But…



Machines aren‘t good at reading web pages

because…



Data on the web is
meant for human consumption



Machines need the data to be structured



Once structured, information can be more easily shared within datasets and across
web pages

Fortunately, Cochrane Reviews are structured


but we still need to teach the machines how
to r
ead them, where to find data within them and how the data is related.

The web is

moving from a web of documents to a “web of data”. Right now, the links on web pages are
between documents but the data and content
within
web pages and in databases is largel
y
devoid of any “meaning”. The semantic web and linked data are a way to move toward a
web of data that allows for more meaningful connections between things. See the “Further
reading and viewing” section for more info.

Cochrane Semantic Model

The semantic

web technology stack (
http://en.wikipedia.org/wiki/File:Semantic
-
web
-
stack.png
) uses ontologies (semantic models) to describe a domain. For example, Cochrane
Reviews can be described

using OWL, the Web Ontology Language, and RDFS, RDF Schema,
to map the classes and relationships of the various component
s. So, a Review includes a
number of studies and each study

may have, for example, a risk of bias assessment in a
Review. Once these c
oncepts and relationships are made explicit, a machine can then
“understand” the underlying content. Using an ontology with data in RDF (Resource
Description Framework) format, a simple data model that uses “triples” to store information
that is query
-
able

against a given ontology or set of ontologies that describe the data.

Here is a simple example:

RDF stores data in triples:
Subject

-
>
Property

-
>
Object

This is the way humans think as well, in sentences.

<Gerd Antes>

hasRole

<Director German Ctr
>

<Director German Ctr>

worksIn

<Freiburg, Germany>


<Gerd Antes>

worksIn

<Freiburg, Germany>

So, given the first 2 statements, the machine could infer the 3rd statement.

We have created an ontology
, a semantic model, for Cochrane Reviews and studies. Latest
version can be found here:
http://data.cochrane.org/ontologies/review/
.
It is still a work
-
in
-
progress and needs to be evaluated and te
sted to be sure the inferences it makes are
consistent with Cochrane methods and that it can fulfill the use cases and thus the needs of
our various end
-
users.

2.
User stories and user research

Projects already conducted by Wiley and Cochrane have indicate
d that end users would like
to find and view Cochrane Content in a variety of different ways, and developers of new
products for
The Cochrane Library

would like to be able to select and manipulate sections of
Cochrane Reviews for repackaging into new produ
cts. The use of improved XML structure
and semantic web technologies could facilitate the delivery of “dynamic Cochrane content”.
From this research and other thinking within the Linked Data Project and within the RevMan
Advisory Committee (RAC) and other
groups within Cochrane, we have developed lists of
“user stories” that inform larger sets of use cases.
Using industry techniques we learned
from our consultants, Ontoba, we have used various rubrics and tools to arrive at and
describe these user stories a
nd use cases.


‘So that…’ phrases


One way to capture user stories is to use the “So that…” framework to describe what people
want to do with your content, on your website, etc. You translate desired features into the
form: “As a xxx, I want to be able to
yyy,
so that

I can zzz.” Here are some examples from
the Linked Data Project:

1.

As a ‘XXX’, I want to see all the information about a study in CRS, so that...



‘Clinician’: I can see if the paper is relevant to my clinical question, before reading
the full
report.




‘Systematic reviewer’: I can screen the paper to see if it is relevant to my review,
without having to read the full report.



‘Anybody’: So that I can easily compare the characteristics of studies, as the CRS
format is common across entries.


2.

As a

‘XXX’, I want to see all risk of bias analysis conducted on a study, so that...




‘Clinician’: I can see if the study is biased, and the results trustworthy.



‘Clinician’: I can see if there are differing opinions on the biases in the study from
different a
uthors, and this may help me reach my own conclusions about whether
or not I think the study is biased.



‘Systematic Reviewer’: I can identify whether someone has already done the work
of assessing the risk of bias of a study, and this may save me time. I c
ould use
the information as a starting point and amend if I think it is needed for my own
review, or I could use the information after I have performed my own
assessments to see how they differ.


From groups of user stories, we are able to build out use
cases that can inform potential
prototypes of functionality for use on our websites. One example is the idea of an “Asthma
Super Centre”, or browser of the evidence on Asthma. Another one we’ve been working is a
CENTRAL demonstrator that shows the power of

linking between studies and Reviews and
the information in Reviews about studies they evaluate.


User stories and use cases

For the current linked data project, we have been focusing on two sorts of user stories. One
is the idea of an
“Asthma Super Centr
e”,
or browser of the Cochrane evidence on Asthma
that would address our perception that
users would like to find and view Cochrane Content
in a variety of different ways
. The generic user story for this section has been the following:



As a reader of Cochrane Reviews, I would like to:

1.

Filter reviews by

selected parameters

to show me the subset most relevant to me

2.

Display

selected portions

of those reviews in a format that works for me

3.

Link out to selected content

(both Cochrane & non
-
Co
chrane) that would
enhance the usefulness of the review material



The second focus for the linked data project has been a
CRS
-
CDSR demonstrator

that
explores the potential for linking between studies and Reviews and the information in
Reviews about stud
ies they evaluate. The primary user story that we have been addressing
in this section is:


As a Cochrane author who has identified a single trial report that is relevant to my review, I
would like to:

1.

See what other published reports from the same trial

have been identified in the
“studified” data in CRS

2.

See which other Cochrane Reviews have this as one of their included studies

3.

See the Risk of Bias appraisals of this study from those other Reviews



…s
o that I can improve my review by using the work that

others in the Collaboration have
already done.

3. The demonstrator


As part of phase 2 of the Cochrane Linked Data Project, the Web Team, CEU and Ontoba

have created a demonstrator site in which we can build out these initial use cases and where
we can have a “sandbox” for demonstrating the power of using linked data with Cochrane
and other external content. At present, the demonstrator only includes a s
ubset of the
asthma reviews produced by the Airways group. The demonstrator is at
http://demonstrator.dev.cochrane.org

and has functionality that relates to both the Asthma
Supcercenter and the CRS/CDSR

user stories.

1.

Searching Reviews by drug name.

Currently, there is no cross
-
indexing against
variant names of drugs in Cochrane Reviews. We have linked to Drugbank
(
http://www.drugbank.ca/
) which includes most of the

variants of drug names
including the different brand names and generic names used in different countries.
We have created a “semantic search” that allows users to type any name for an
asthma medication and find the relevant Cochrane Reviews. See:
http://demonstrator.dev.cochrane.org/interventions
. This functionality would greatly
improve the discoverability of Cochrane content in The Cochrane Library as now, for
example, if you search for “Pr
ozac” you get zero results, but if you search for
“fluoxetine” you get 30 results.

2.

Displaying selected portions of reviews.

Clicking on any title on the “List of
Reviews” page in the demonstrator (
http://demonstrator.dev.cochrane.org/reviews
)
takes you to a custom view of that review that we have created by including sections
of the review suggested during the
Strategic Discussion in Paris.
This capability of
showing selected portions of a revie
w, and rearranging their order could allow us in
future to devise different
“views”
for different user groups, to allow users to
customize their own Cochrane view by selecting the specific components and their
order, or to compare reviews by looking at
components from 2 or more Reviews side
by side.

3.

Linking out to selected content.

In addition to linking to Drugbank as noted
above, we have linked to SIDER, a linked data set that includes information on side
effects from FDA label information (see
http://sideeffects.embl.de/drugs/2153/
) for
an example.

4.

Finding which Cochrane Reviews have included a particular study.

Each
review page in the demonstrator includes the list of included studies from the
review,
with a link to a specific study page for each item on the list. Each study page
includes a list of all of the reviews (in our limited set) that have included that study
sing the unique study identifier from CRS and the links that CRS provides betw
een
studies and Reviews (see
http://demonstrator.dev.cochrane.org/studies/revman/002304061509242379
-
STD
-
O_x0027_Byrne
-
2005

for an example).

5.

See wh
at other published reports from the same trial have been identified
in the “studified” data in CRS.

Once again using the links with CRS, each study
page in the demonstrator includes a list of all published reports from the study that

have been identified b
y Cochrane collaborators and either used in reviews or studified
in CRS.

6.

See the Risk of Bias appraisals a single study from different Reviews.

This
information is also included on each study page in the demonstrator. In some cases
(as in the O’Byrne exa
mple above), there is good agreement. Some other examples
have more variation

7.

http://demonstrator.dev.cochrane.org/studies/revman/949204060709442762
-
S
TD
-
Koopmans
-
2006

.

While the above examples are simple, they demonstrate and show the proof
-
of
-
concept of
this approach and, critically, the data in the “triple store” beneath this website is completely
dynamic. There are only ca. 40 Reviews on Asthma in
there now but if we were to put all
Cochrane Reviews and their related studies in the linked data repository, the queries would
update automatically.

The technology behind the demonstrator

Demonstrator.dev.cochrane.org uses the Drupal

open
-
source content management system
(CMS), the same system used to produce 130+ of the websites for The Cochrane
Collaboration. Drupal “plays nicely” with the semantic web stack including an RDFx module
and a very powerful module called SPARQL Views whi
ch allows for SPARQL queries to be
constructed within the core Drupal Views system.
With our triple store linked data repository
software, OWLiM,
running in the background at a canonical data.cochrane.org address and
server, we use Drupal and its RDF and S
PARQL modules to quickly create a working website
for creating working prototypes that can be quickly styled using Drupal’s built
-
in theming and
templating system.

4. Future plans

Our experience with the linked data project to date has convinced us that it

has potential to
become an “enabling technology” for the Collaboration that could allow us to do more with
our data. However, there are a number of issues that should be explored as we decide on
how best to integrate linked data within the Cochrane IT s
tructure. These include:



Potential additional user stories



The technical architecture including implications for the IMS, Web Team, CRS and our
publisher of increased use of linked data



Adding structure and standardization to Cochrane reviews

Potential Us
er Stories


Our success in realizing the r
elatively limited goals of the Linked D
ata
P
roject to date has
encouraged us to look at additional user stories that might be addressed using this approach


including several items on the RevMan

wish list. For example, RevMan case # 119285

which says

"Provide easy access from RevMan to relevant sections of other reviews using the
same studies via CRS. E.g. if you were completing the RoB table for a study you could easily
see how other authors ha
ve assessed the risk of bias for that study" is very similar to the
CRS/CDSR user story that we have been working on and case 122027 calls for "Interaction
between RevMan & CRS" without offering specific details.

Some of the user stories that might be expl
ored using this approach include the following:


As a Managing Editor, or as a Cochrane Review user wishing to keep up with a specific area
of content, I want to see the date of publication for a subset of reviews (e.g.

the set
included in an overview, arti
cle, guideline, etc) so that I

can see if any have been upda
ted
since I last looked at them
.

This came up as a specific request from an ME, but could easily
apply to writers of Cochrane overviews, guidelines, book chapters, etc.

Case 122010
-

"Enable autho
rs to generate a visual graphic highlighting each treatment
being compared in their review. Each node would be a treatment, each line at lest one RCT
with numbers corresponding to the number of RCTs.


Case 121023
-

"Calculator for estimating power to overt
hrow current primary outcome.

For
example, for a very potent intervention with high precision it may need a study total of
around 15,000 people with a neutral result to drag the findings back to being null. Should it
be a weak finding a trial of 100 may su
b
stantially change the result."

Technical architecture

This refers to how we would actually go about building all this out in reality within, alongside
or otherwise in our current systems, workflows and dataflows. An industry standard in
implementing sema
ntic web and linked data technologies is to “not blow up the company”
but to innovate alongside existing tools and technologies to create a metadata store that
better describes the content but leave the existing content store(s) alone. But, we might
want t
o innovate in the authoring process and/or other parts of the content production
process as well. This is all still be to determined and will be discussed at the Linked Data
Project meeting in London from 4
-
6 December 2012.

Paul Wilton from Ontoba drew up
a possible technical architecture diagram to provide us
with an example of one way we could consider:



Adding structure a
nd standardization to Cochrane R
eviews

The

fact that Cochrane Reviews are very structured has been critical to the success of the
linked data project to date. However, this structure could be greatly improved by coding
some key elements in a standard way across reviews. For example, the only wa
y to
determine which interventions have been included in a Review is to parse the text in the title
of each forest plot. A standardized way of coding the I and the C for each analysis would
improve the power and precision of linked data queries of CDSR. I
deally, all elements of the
Population, Interventions, Comparisons and Outcomes covered in the Cochrane Reviews
would be coded using some standard taxonomy.

Unfortunately, there is no currently existing taxonomy that adequately addresses this need,
alth
ough several widely used taxonomies could partially address our requirements. One
approach to this problem would be for the Collaboration to build on the various CRG topic
lists to develop a Cochrane taxonomy which would not be identical to any individual

taxonomy, but would mirror some specific portions of a handful of key taxonomies in a way
that will allow meaningful linkages to them.

The taxonomy could be built gradually by working with individual CRGs. The process has
already been initiated with th
e Airways group as part of the Cochrane linked
-
data project.
The CEU browse list would gradually evolve from its current structure to the new taxonomy.
As each CRG completed its section of the taxonomy, the relevant section of the CRG browse
would be re
placed. The eventual result would be that the CEU browse would be completely
replaced by the new taxonomy, and each review would have only a single set of topics.

5
. Further reading and viewing

Here are some presentations, videos and articles to provide f
urther background on both
linked data and the semantic web as well as the work so far in Cochrane in the “Star Trek”
and Linked Data Project.

Presentations


Linked Data and Cochrane Reviews: A Report from the “Star Trek” Crew

Plenary talk

by Chris
Mavergames

from Madrid Colloquium, October 2011

http://www.slideshare.net/mavergames/linked
-
data
-
and
-
cochrane
-
reviews
-
12936733


Sustainability and Cochrane Revie
ws: How Technology can Help

Plenary talk

by Chris Mavergames

from UK Co
ntributors’ Meeting in Loughboroug
h, March
2012

http://www.slid
eshare.net/mavergames/sustainability
-
and
-
cochrane
-
reviews
-
how
-
technology
-
can
-
help
-
12207716

Web 3.0: The Semantic Web

http://www.slideshare.net/HatemMahmoud/web
-
30
-
the
-
semantic
-
w
eb


Videos


Linked Data and the Web of Data

https://www.youtube.com/watch?v=GKfJ5onP5SQ




Intro to the Semantic Web

https://www.youtube.
com/watch?v=OGg8A2zfWKg


The Semantic Web of Data

Tim Berners
-
Lee, inventor of the World Wide Web

https://www.youtube.com/watch?v=HeUrEh
-
nqtU



6
. Glossary

Here is a glossary of terms related to
linked data as well as a few related to Cochrane.

API

Application Programming Interface


allows different pieces of software to communicate.

CENTRAL

Cochrane Central

Register of ControlLed Trials (Central)

Controlled vocabulary

Most
-
commonly known in
indexing and cataloguing, controlled vocabularies use pre
-
defined,
specific and agreed
-
upon sets of terms for use in taxonomies, thesauri and other systems to
tag and organize content and data.

Drupal

An open
-
source Content Management System (CMS)


see
http://
drupal.org. The Cochrane
Web Team uses Drupal for the 130+ websites it manages.

Drupal themes

Layouts and designs for Drupal
-
based websites.

Drupal Views

A module in Drupal that is basically a GUI (Graphical User Interface) for querying the
database

(MySQL) behind Drupal for displaying content on a website in more or less any
form you like.

GoodRelations

From
http://www.heppnetz.de/projects/goodrelations/
, “GoodRelations

is the most powerful
vocabulary for publishing all of the details of your products and services in a way friendly to
search engines, mobile applications, and browser extensions. By adding a bit of extra code
to your Web content, you make sure that potenti
al customers realize all the great features
and services and the benefits of doing business with you, because their computers can
extract and present this information with ease.”

Linked data

Part of the movement known as the Semantic Web or Web 3.0, linked

data refers to a set of
concepts and standards for connecting data on the web and across data silos. See:
http://linkeddata.org/
.

Linked Life Data

“A semantic data integration platform for the biomedical domain”
-

see

http://linkedlifedata.com
.

It includes the Unified Medical Language System (UMLS) which

includes SNOMED CT as well as Drugbank, both used in the Linked Data Project
demonstrator site at
demonstrator.dev.cochrane.org

Metadata

Put simply, “data about data”. Data that describes your content.

Ontology


An ontology is a specification of a conceptualization.”

Ontologies in the semantic web are
used to describe a domain included the classes and
properties and relationships between
things.

OWL

The Web Ontology Language. See:
http://www.w3.org/TR/owl
-
features/
.

OWLiM

A

semantic repository software or “triple store” currently used in the Cochrane Linked Data
Project. See:
http://www.ontotext.com/owlim
.

RDF

Resource Description Framework. A data model for storing data in “tripl
es”. See:
http://en.wikipedia.org/wiki/Resource_Description_Framework
.

RDFS

RDF Schema language.
http://en.wikipedia.org/wiki
/RDF_Schema


Semantic Web

See videos above!

SNOMED CT

Systematized Nomenclature of Medicine
--

Clinical Terms
. A controlled vocabulary of medical
terms. See:
http://en.wikipedia.org/wiki/SNOMED_CT
.

SPARQL

SPARQL Protocol and RDF Query Language. The query language for querying data in RDF
format. See:
http://en.wikipedia.org/wiki/SPARQL
.

SPARQL Views

A Drupal

module that integrates the SPARQL query languages with the Views module to
create displays of content on a website.

Taxonomy

Less formal way of creating a system to organize content. Note: there is substantial debate
the difference between ontologies, tax
onomies, controlled vocabularies and thesauri!

Triples

RDF triples. In the RDF data model, data is stored as triples with a subject


predicate


object. There are multiple serializations for RDF including RDF
-
XML, Turtle and N
-
3.

Triple store

From Wikipe
dia:
A
triplestore

is a purpose
-
built
database

for the storage and retrieval of

triples
,
[1]

a triple being a data entity composed of subje
ct
-
predicate
-
object, like "Bob is 35" or
"Bob knows Fred".

See:
http://en.wikipedia.org/wiki/Triplestore
.