Application of Semantic Technology: Semantic Medline on the Cray XMT2

cornawakeSoftware and s/w Development

Nov 4, 2013 (3 years and 10 months ago)

70 views

© 2011 Noblis, Inc. Noblis proprietary and confidential.

Application of Semantic Technology:
Semantic Medline on the Cray XMT2

Victor J. Pollara

2 October 2012

2


Background


Government and industry are housing data of every kind.


Structured data sets may have rich information in them that is hidden because of how the
database is designed. Most are in tabular formats (either relational DBs or flat files).


A tremendous amount is in the form of text.


‘Big Data’ does not mean ‘Useful Data’ if you can’t answer the questions you need to


Often we can
greatly increase the utility of
existing data by:


Augmenting the existing data set with a small model (e.g. ontology, taxonomy)


I
ntegrating multiple sets on common data elements


Extracting and structuring information from text


Example using the XMT2 :
Semantic Medline (
Rindflesh
, Shin, et al.)


60M+ High
-
confidence ‘facts’ extracted from 22M biomedical (PubMed) citations


Augment it with biomedical knowledge
models (e.g. UMLS
Metathesaurus
, NCBI Taxonomy)


Integrate with other resources (e.g.
Geonames
)


This talk:


Tabular data and semantic
data: bridging the gap


Text data and semantic
data:


Semantic
Medline Application


Cray XMT2


Semantic Services with the
XMT2


Augmentation
and
Integration


Going beyond semantic data



Overview


3

3

© 2011 Noblis, Inc. Noblis proprietary and confidential.

Technologies for Bridging the Tabular
-
Semantic Gap

There is a class of software products that creates ‘semantic’ views of the data
in a relational database.

Examples:


D2R/D2RQ


Does not disturb a live relational database.


Renders all the data in triple format.


Fully automatic (does not require a subject matter expert)


Is ignorant of the semantics of the data.


R2RML


Language in which a subject matter expert builds a model that adds semantics to
the data in the database


When used with a tool like
Revelytix

Spyder
, it provides a more semantic view of
the data


Is a superset of SQL, so it relies on SQL to do all the heavy lifting


Not practical if data values map to non
-
regular URLs


Scripting languages (e.g. Perl)


Can do anything you want


Traditional ETL


creates another version of the data

4

4

© 2011 Noblis, Inc. Noblis proprietary and confidential.

The XMT2 vs. Bridge Technologies


The architecture of the XMT2 is suited for data that is not easily subdivided


Efficiency of computation requires the entire set to be held in shared memory


Data with little semantic content is not the best candidate (e.g.
triplifying

huge
tabular arrays of numerical data is not appropriate)



Since you are going to create a copy of the data for the XMT2, the best
approach is to remodel it to contain as rich a semantic structure as possible.


Any ontology that adds semantic richness can support new queries that might
be valuable


Since you are doing ETL, a scripting language is appropriate.

5

5

© 2011 Noblis, Inc. Noblis proprietary and confidential.

Augmentation


For example, Medicare collects vast amounts of claims data


Researchers can use it to evaluate the effectiveness of procedures or drugs


But the format makes it difficult to explore the data in medically meaningful ways

Antibacterials

Penicillins

Amoxicillin

Ampicillin

Cephalosporin

UMLS Knowledge
Model

Patient

ID

Age

Drug Code

125454

65

229

224377

77

634

986904

82

229

774826

66

551

223556

71

634

394857

70

551

675849

65

551

Tabular Claims Data

Why add more data to an already large set?

6

6

© 2011 Noblis, Inc. Noblis proprietary and confidential.

Why Modeling is Crucial


e.g., NCBI Taxonomy


The NCBI taxonomy is only available as a RDB (tabular) dump:



W
rote Perl scripts to remodel the NCBI Taxonomy relational tables into a single
RDF file.


R2RML would work well for this because names are uniform


http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=
9606


T
he other possibility would be to use an automatic D2R map of the DB


nodes.dmp

column names:

tax_id


--

node id in
GenBank

taxonomy database

parent
tax_id


--

parent node id in
GenBank

taxonomy database

rank


--

rank of this node (
superkingdom
, kingdom, ...)

embl

code


--

locus
-
name prefix; not unique

division
id


--

see
division.dmp

file

D2R mapping leaves
ids as integers

r1

r2

r3

670904

562

745156

562

866768

562

t
ax_id

After remodeling you get the right meaning: these are
subclasses and the structure is tree shaped

n
cbitax:562

ncbitax:670904

ncbitax:745156

ncbitax:866768

7

7

© 2011 Noblis, Inc. Noblis proprietary and confidential.

Text Extraction and Triples


The first task in text extraction is to identify entities (e.g. people, places,
things, events)


Good for document characterization, document matching, categorization.


Natural language processing can go much further by:



tagging each term with its part of speech


Using the part
-
of
-
speech tags to extract ‘subject
-
verb
-
object’ triples


These triples mirror the triple structure of semantic data


Use controlled vocabularies and ontologies to manage entities and relations


Example:

Tamoxifen

has been shown in vitro to inhibit protein kinase C through
estrogen receptor
-
independent antineoplastic effects
.”


t
amoxifen

p
rotein kinase C

inhibits

urn:nlm.nih.gov:UMLS
/CUI/C0039286

urn:nlm.nih.gov:UMLS
/CUI/
C0033634

urn:nlm.nih.gov:semmed
/relation/inhibits

8

8

© 2011 Noblis, Inc. Noblis proprietary and confidential.

Semantic Medline


The National Library of Medicine hosts a website that contains over 22M
citations from the biomedical literature (PubMed).


Even though they are only titles and abstracts, there is a lot of knowledge in them


But the site only provides access to the citations by ‘search’


NLM scientists (
Rindflesh
, Shin, et al.) built a web
-
app for exploring high
-
confidence ‘facts’ extracted from PubMed citations (Semantic Medline)


The ‘facts’ are represented most naturally as a graph


Without a high
-
performance
triplestore

server, they currently use a relational
database (MySQL) to store the facts


We think the Cray XMT has potential to support a graph database as a
replacement for MySQL.


We proposed to port Semantic Medline to the Noblis XMT2


Cray has provided a Beta version
triplestore

server named
uRiKA


It provides a SPARQL endpoint (analogous to a SQL connector for a MySQL)


First let’s look at Semantic Medline’s functionality…

9

9

10

11

A Network Presentation of Biomedical Facts

12

12

© 2011 Noblis, Inc. Noblis proprietary and confidential.

Porting the Data to the Noblis XMT2


The Computing Environment:


4TB of shared memory


128 cores, each capable of running 128 independent threads (16384 threads)


Maximum recommended size: 20 billion triples (occupies 2TB, but
uRiKA

uses
the remaining 2TB as scratch space)


uRiKA

provides a SPARQL endpoint as well as a web client a user can interact
with directly.


‘Service nodes’ are Linux machines separate from the ‘compute nodes’ and there
is a communication latency between them that must be managed



Phase 1: Naïve
triplification

to test
uRiKA

as a
triplestore

server


Converted a key Semantic Medline MySQL table into triples (similar to D2R)


Included UMLS concepts (6 M) and instances of relations (21 M)


NCBI taxonomy (~1M taxa)


http://www.geonames
.


Modified Semantic Medline code to issue SPARQL queries to
uRiKA


13

13

© 2011 Noblis, Inc. Noblis proprietary and confidential.

Initial Observations


Initial Results for the Beta version of
uRiKA


uRiKA

processes complex SPARQL 1.0 queries properly


uRiKA

is set up to cut through very complex queries that would stymie an RDB


But we were issuing trivial queries (lots of them) and it is not tuned for that kind of
usage, so the aggregate response time was too slow for an acceptable user
experience of the Semantic Medline
webapp


Collaborators’ experiences with an alternative software library (Speed
-
MT) from
Sandia Labs show faster results and we are looking at this library to see if it can be
used in place of
uRiKA

or as an adjunct to it


uRiKA

shows that the XMT2 can support web services


Cray will release an improved version of
uRiKA

in approximately 6 months


We believe there are many other services that could coexist on the same
machine.

14

14

© 2011 Noblis, Inc. Noblis proprietary and confidential.

The XMT2 Supporting Multiple Services

XMT rest bridge

A separate system called
MeGraphs

does the following:


Maintains a directory of graphs resident in memory


Provides an engine that can run different algorithms on
a chosen graph in the directory


Supports job
queueing


Provides an API for building client applications


The only thing missing, in our opinion, is a way for external
processes to access the engine.

We are currently experimenting with:


Developing a client for
MeGraphs

that receives
requests from the outside world and acts as a general
REST service


Building custom services that support graph data that
goes beyond ordinary triples


A key focus is on responsiveness of the services


Because of the XMT2 architecture, process initiation
can be time consuming, so the goal will be to keep the
data ‘live’ in shared memory and be sure that each
service has a memory map of the data relevant to it, so
that it can respond as quickly to requests as possible.


External
Processes

15


Social networks


“link analysis”



degr
. of
sep.


Edges may have
weights
representing
strength or certainty

-------------------------

“graph” has “nodes”
and “edges”

Joe

Zoe

M
oe

Sam

Pam

Joe

Zoe

M
oe

Sam

Pam

hasArmsSupplier

Semantic graph has
named relations with
direction.

Permits much more
sophisticated queries.

Supports reasoning.

-------------------------------

<Moe> <
hasDad
> <Joe>
is called a “triple” in the
semantic world

Joe

Zoe

M
oe

Sam

Pam

hasArmsSupplier

P=0.8

Enhanced semantic
graphs with weighted
edges

Representing Data in Graph Form

Joe

Sam

Moe

Pam

Zoe











Name

Addr











SSN

Tabular Data

16

16

© 2011 Noblis, Inc. Noblis proprietary and confidential.

Challenges


An important factor that makes the XMT2 outstanding for graph computation
is a very efficient internal representation of a graph.


The efficiency comes from packing a lot of information tightly into a special
data structure.


Deleting edges is relatively easy. Deleting nodes is more expensive but still
not prohibitive. Adding new edges between existing nodes is very costly.


This seems to imply that transactional applications are not a good fit, but we
are not convinced of that yet, and we plan to find out.


This is certainly true in the short term


Our goal is to build experimental services that support the basic “Create, Read,
Update, Delete” (CRUD) operations with acceptable latency for
webapps
.


Analytical services are a good fit.


Dynamic fraud detection: build the graph in memory and as it is updated re
-
run
analytical agents that look for the emergence of triggering conditions.


Entity resolution: as new attributes are assigned to entities, analytical agents
would check if certain thresholds are crossed.

17

17

© 2011 Noblis, Inc. Noblis proprietary and confidential.

Conclusion


We believe that the XMT2 shows potential as a platform for providing
semantic services on large semantic data sets


Over the next 12 months we will build a variety of services and test their
utility and responsiveness


The internal graph representation is extensible in ways that could
simultaneously support the logical queries of SPARQL and analytical
methods that use other graph properties


This would enable us to tackle a much wider class of real world graph
problems and we will build a catalogue of these problems and describe how
this computational resource can be part of their solution