Science, Models, and the Semantic Web

economickiteInternet και Εφαρμογές Web

21 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

75 εμφανίσεις

Science, Models, and the Semantic Web



Professor James Hendler

Dept of Computer Science

University of Maryland

College Park, MD 20853



As powerful as the current World Wide Web search engines and interface technologies
are, they can leave much to be desi
red in many cases. A key problem, especially facing
the practitioner of science, is that current search technology is primarily oriented around
the printed word. When searching for a web site, or a paper on a particular topic, engines
like Google can do

a phenomenal job of searching literally billions of possibilities and
identifying useful candidates. However, when searching for data in databases, images
and figures, programs that provide necessary capabilities, devices that can be remotely
accessed, o
r other such non
-
textual web resources, the tools are far less helpful. What
database has the latest sequence of that gene you’re working on, and how do you interact
with it? Is there someone else using the chemical you’ve been testing as a reagent in
th
eir

experiments? Is there a program online someplace that can make some sense of the
millions of samples that your new data collector is gathering from your sequencing

Library? On the current Web, questions like this can barely even be asked


but web
dev
elopers are already building the underlying technologies that will make them as
commonplace as a keyword search is today!


The next generation of the World Wide Web, dubbed the “Semantic Web”
i

by Tim
Berners
-
Lee, inventor of the current web, is aimed at
improving communications
between people in different disciplines, between different types of computer programs,
and between people and the machines they have so come to rely on in the pursuit of
science. This new technology is based on a simple idea: whe
n scientists need to interact
with each other, especially across disciplinary bounds, they generally construct a model
of some type to facilitate communication. This model can be physical, like the well
-
known “stick and ball” models of atoms, or virtual,
as in the visualizations of complex
molecules on a three
-
dimensional computer screen. The model can be mathematical, like
the set of equations dictating the behavior of a chemical system, or an engineered artifact,
like the gear and chain models of the so
lar system used in an introductory astronomy
class. Models can also be conceptual constructs, like the biologists’ taxonomy of species,
family and phyla. In all of these cases, these models improve communication because
they expose the intrinsic “semanti
cs” of the system being modeled
--

rather than being
forced to agree on the technical jargon of a particular field, each scientist can understand
the model in their own way, but in a shared context. The Semantic Web is based on web





Please note: This position paper is the basis of a Policy Forum article to appear in
Science

on January 24, 2003. Material in this paper is therefore under Science’s embargo
rules until then. It must not be posted on a public web si
te, or distributed without this
note attached. An extended version of this article will be submitted for the workshop’s
book.

languages that make it p
ossible to build simple models of technical domains, and to
expose the semantics instead of just the textual jargons. The languages are still in
relatively early stages, like the sticks and balls of simple molecules, but with the help of
the scientific co
mmunity, they will evolve rapidly to allow an explosion of new
capabilities.


The Semantic Web is based on a new generation of web languages that go beyond the
presentation capabilities of HTML and the document
-
tagging capabilities of the
Extensible Marku
p Language (XML). These languages provide a set of modeling
primitives that, like the models used in science, can improve communication among
humans who need to interact with each other across sub areas and disciplines. Perhaps
even more importantly, usi
ng these languages the models become machine readable,
allowing interoperation between the different “disciplines” of computer technology


whether they be different applications and operating systems, different databases used to
process and visualize phy
sical data, or different computer programs offered as
“services” running across a variety of machines and devices. These new languages,
coming with names like RDF, RDF
-
Schema, DAML and OWL, are not just a new
alphabet soup coming to confuse the scientif
ic community. Rather, they make available
new tools that enable significant improvements in cross
-
disciplinary collaboration and “e
-
science,” as it has come to be known.


The primary features of the Semantic Web models are based on concepts that are easi
ly
grasped by the scientist. They start with taxonomies, expand them to thesauri, and on to
what are known as
ontologies



thesauri which delineate precise relationships between
the taxonomical entities. These ontologies provide the basic machine
-
readabl
e models
needed to provide the interoperability and intercommunication needed for collaborative
work. Let’s briefly explore an example


taken from work being done at the US National
Cancer Institute’s Center for Bioinformatics (NICB) as part of the Natio
nal Cancer
Institute Metathesaurus.
ii



A very simplified definition from the NCICB ontology would be the description of an
“Oncogene” which has a set of properties:


Class (Oncogene):


Found_In_Organism(organism).


Gene_Has_Function(function).


In_C
hromosomal_Location(location;
unique
)


Gene_Associated_With_Disease(Disease).


and also specifies the restrictions on the various properties (for example, it could be
stated there can only be a unique “In_Chromosomal_Location”). In addition, specific
i
nstances of oncogenes are described, and linked to this definition. For example the gene
MYC is defined to have certain values of the properties


Oncogene(MYC):


Found_In_Organism(Human).


Gene_Has_Function(Transcriptional_Regulation).


Gene_Has_Functio
n(Gene_Transcription).


In_Chromosomal_Location(8q24).


Gene_Associated_With_Disease(Burkitts_Lymphoma).


This provides a simple model of what an oncogene is and what properties it has, written
in a machine
-
readable way.
iii

This deceptively simple model,
by being published on a
web site, gains a tremendous amount of power


the power of the Web, since different
authors, systems, devices and programs can now link to it. For example, suppose you
were doing medical research about “Burkitt’s Lymphoma.” By
explicitly linking this
English phrase to the use in the ontology document above, you would be linked to
information about MYC. Other researchers are also able to link their resources as well.
A biotechnologist who has been sequencing chromosome 8 can li
nk data about location
8q24 to this same definition. In addition, the database may contain links to other
information, for example that the locus PVT1 is located on 8q24. Online resources, like
PubMed also link to similar terms


for example PubMed contai
ns a link from PVT1 to a
paper entitled
Rearrangement of a DNA sequence homologous to a cell
-
virus junction
fragment in several Moloney murine leukemia virus
-
induced rat thymomas
iv
.


From the NCICB model, we know MYC is associated with Burkitt’s Lymphoma an
d
with chromosomal location 8q24. This in turn is linked to a database that associates that
location with the locus PVT1, and PubMed has identified this paper as being linked to
that locus. Even though the words “Burkitt’s lymphoma” and “8q24” don't appea
r
anywhere in the PubMed description, the system now contains the links needed to answer
a query such as "Is anyone working on a paper that describes a locus on the chromosome
associated with Burkitt's Lymphoma." It can also find sequences in the database
associated with PVT1 or 8q24, find the links through these to the gene’s function,
location, organism, etc. More importantly, no one entity had to input all the information
at a single time


the NCICB created the ontology, the medical researcher knew abo
ut
Burkitt’s Lymphoma, the biotechnologist about 8q24 and PVT1, and PubMed about
PVT1 and the aforecited paper. Like the current Web, the magic of the Semantic Web is
in the links. What’s more, if even more powerful modeling features of the web languages

are used, even more complex relationships can be expressed, and linked to other kinds of
resources


creating the backbone for the sorts of interdisciplinary queries shown at the
beginning of this article.
v


While the Semantic Web holds incredible promise

for the needs of scientists and can be
an important constituent in the emerging field of e
-
science, it will take the collaborative
efforts of those in the traditional scientific fields working closely with information
technologists to make it happen. The

Semantic Web, like the Web itself, will be a
“disruptive” technology


that is, a technology that changes the processes used by
practitioners of their fields. The Web has changed how we search for information,
publish our results, share our data, and eve
n how we purchase lab supplies. More
importantly, it has lowered the barriers to many types of scientific collaboration between
organizations and research groups. The Semantic Web will likewise change how we do
business, and will likely have its greatest

impact by lowering the barriers to doing
interdisciplinary research


enabling chemists, biologists, physicists, anthropologists,
geneticists, mathematicians and all the rest of us to find and share each others resources
and outputs.


Unfortunately, des
pite the huge impact that the Semantic Web is likely to have on
science, scientists are not currently highly engaged in the Semantic Web activity. There
are many reasons for this, but one important one is the mechanisms by which the
infrastructure of scie
nce is funded. Crosscutting efforts like this one, that transcend
disciplinary boundaries, are hard to fund within the traditional discipline
-
oriented review
panels used in choosing fundable research. This hindered the original adoption of web
technologi
es by the scientific community, and threatens to hinder the adoption of the new
Semantic Web technologies as well. Research scientists, and the graduate students they
employ, need to help information technologists define and field these new tools. The e
-
Science initiative in the UK
vi

is a good example of how research scientists and
information technologists can work together for the betterment of science, and recent
efforts to unite the Semantic Web and Grid computing
vii

show great promise. Scientists
aroun
d the world should unite with their colleagues in Computer Science and Information
Technology to create similar programs. .


There is also another issue on which information technologists and scientists must start to
speak with a single voice. Numerous s
cientific communities are becoming more and
more troubled by the conflicting policies and priorities caused by the ownership and
patenting of scientific work. As universities and companies try harder to profit from the
intellectual property created by the
ir researchers, the ability of scientists to share
materials and processes suffers. Similarly, in Information Technology circles, the push
for ownership and rights, coupled with a patent office that cannot keep up with the
technical complexities of a rapi
dly changing field, is causing damage to the open
discussion and sharing of Semantic Web technologies. Research scientists must team
with their computer science brethren and fight against the intellectual property policies
and runaway patent madness that
make the free dissemination of our products impossible.
The original World Wide Web revolution was enabled by open code, free software, and
the wide dissemination of low
-
cost computing technology. The Semantic Web requires
similar openness, much more di
fficult in today’s environment where the drive for
ownership and profit challenges the necessary scientific and technological freedoms
required to bring this new generation of web technology into the hands of the scientific
community.


Bringing modeling la
nguages to the web is just the start of an exciting revolution that will
open up the way science is performed. Cross
-
disciplinary scientific collaboration holds
the potential to keep the scientific enterprise vital and growing. Semantic Web
technologies w
ill cause us to change the way we approach, fund, and perform science


but those changes will be for the better if scientists and information technologists can join
together to create the tools we need to cope with the ever more complex problems we are
al
l striving to solve




i

T. Berners
-
Lee, J. Hendler and O. Lassila, The Semantic Web,
Scientific American
, May
2001 (available on line at
http://www.sciam.com/2001/0501issue/0501berners
-
lee.html
).







ii

See
http://ncicb.nci.nih.gov/

and
http://ncimeta.nci.nih.gov/indexMetaphrase.html

for
information on the NCI cancer metathesaurus project.


iii

The notation here is a greatly simplified and not “web friendly.” The actual notation
uses the familiar angle brackets and URIs of the Wor
ld Wide Web. See
http://www.w3.org/TR/owl
-
absyn/

for the details of the actual syntax used in these
ontology definitions.


iv

Lemay G, Jolicoeur P.
Proc Natl Acad Sci

U S A 1984 Jan;81(1):38
-
42;online at
http://www.ncbi.nlm.nih.gov/htbin
-
post/Entrez/query?db=m&form=6&uid=84119459&Dopt=r


vi

See
www.re
search
-
councils.ac.uk/escience

for more about this programme.

vii

For more about Grid Computing see:
The Grid: Blueprint for a New Computing
Infrastructure
, Ian Foster and Carl Kesselman (eds), Morgan Kaufman, San
Francisco:CA, 1998. Integrating Semantic W
eb and Grid technology will be a central
theme of the Euroweb 2002 Conference being held in England in December, 2002:
http://www.w3c.rl.ac.uk/Euroweb/