Integrating Text Mining into Bio-Informatics Workflows

blabbingunequaledΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

50 εμφανίσεις

ISMB Demo; June 27, 2005

Integrating Text Mining into

Bio
-
Informatics Workflows

Neil Davis

George Demetriou

Robert Gaizauskas

Yikun Guo

Ian Roberts

Henk Harkema

Natural Language Processing
Group

Department of Computer
Science

University of Sheffield

Sheffield, UK

2

Overview


Demonstration scenario: to show the use of text mining
techniques to support biomedical researchers investigating
the genetic basis of human disorders


Case study: Williams
-
Beuren syndrome


Overview of presentation & demonstration:


Background on Williams
-
Beuren syndrome


Architecture of system


Text mining


Workflows


User interface


Demonstration of system

3

Context


MyGrid


University Of Manchester, University of Newcastle, University Of
Nottingham, University Of Sheffield, University Of Southampton,

IT Innovation Centre, European Bioinformatics Institute


CLEF


University of Manchester, University College London, Royal
Marsden Hospital, University of Cambridge, University of
Sheffield, Open University


4

WBS: Clinician’s View


WBS was first clinically described in 1961 before genetic
screening was available


WBS presents multiple but highly variable symptoms,
including:


Congenital heart disorders


Elfin face


Mental retardation with relatively spared language skills


Growth retardation


Dental malformations


Infantile hypercalcemia


Due to the variable underlying genetic basis of WBS,

the symptoms that WBS patients present are notoriously
variable

5

WBS: Geneticist’s View


WBS is caused by a variable (typically 1.54
-
1.88 Mb)
deletion from
7q11.23


The deleted region (termed the Williams
-
Beuren Critical
Region or WBSCR) spans multiple genes


The complex genotype (multiple genes may be deleted)
leads to a complex phenotype (multiple symptoms may

be present)

6

Research Process


Multi
-
step and iterative:


Step 1: Sequence the section of the genome of interest


Step 2: Scan the sequence for putative genes


Step 3: BLAST the putative gene sequence against database

of known genes to identify homologues whose function may

be known


Step 4: Annotate the putative gene sequence with the data

associated with homologous sequences


Repeat as new data and sequences become available


Text Mining techniques can facilitate Step 4


Navigating biomedical literature to find papers containing
information about homologous genes etc.

7

Workflows




8

Text Mining


Uncovering the information content of unstructured or
semi
-
structured textual data sources in an automatic way


Includes research areas such as information extraction (IE),
information retrieval (IR), natural language processing
(NLP), knowledge discovery from databases (KDD)


Relevance to biomedical informatics


Textual biomedical data sources contain valuable information,

but volume is so large and growing so fast that it is difficult for
researchers to find relevant information


Some information is available in textual form only,

e.g., clinical records

9

Text Mining Workflow


Workflow: computational model for
processes that require repeated
execution of a series of analytical
tasks



BLAST reports provide links to
abstracts in the literature


Use MeSH terms to find related
papers


Show retrieved papers to end user

10

Architecture of System

User Client

Medline Server

Swissprot/Blast
record

Workflow Server

Workflow

Enactment

Extract

PubMed Id

Get Medline

Abstract

Initial

Workflow

Cluster

Abstracts

Get Related

Abstracts

Medline: pre
-
processed
offline to extract biomedical
terms + indexed

Workflow definition

+ parameters

Clustered PubMed Ids

+ titles

PubMed Ids

PubMed Ids

Term
-
annotated

Medline abstracts

Medline

Abstracts

11

Text Collection Server

User Client

Medline Server

Swissprot/Blast
record

Workflow Server

Workflow

Enactment

Extract

PubMed Id

Get Medline

Abstract

Initial

Workflow

Cluster

Abstracts

Get Related

Abstracts

Medline: pre
-
processed
offline to extract biomedical
terms + indexed

Workflow definition

+ parameters

Clustered PubMed Ids

+ titles

PubMed Ids

PubMed Ids

Term
-
annotated

Medline abstracts

Medline

Abstracts

12

Text Collection Server


Text collection is MEDLINE (www.ncbi.nlm.nih.gov)


More than 14 million abstracts since 1950’s


Largest repository of biomedical abstracts


Copies made available for research, updated continually


Records contain semi
-
structured information annotated in XML


Unique id


PubMed id


Citation information


author(s), journal, year, etc.


Manually assigned controlled vocabulary keywords

(MeSH terms)


Text of abstract


13

Text Collection Server


Local copy


Loaded in MySQL, indexed on various fields, e.g. MeSH terms


Text portion indexed for search engines (Lucene, Madcow)


Text preprocessed with text mining tools (AMBIT & Termino)


Indexes built for term classes (proteins, genes, diseases, etc.)


Server

accepts web service calls to, e.g.


Return text of abstract given a PubMed id


Return MeSH terms of abstracts given PubMed ids


Return PubMed ids of abstracts with given MeSH terms


Return PubMed ids of abstracts matching a free text query


Return PubMed ids of abstracts containing a specific term

14

Preprocessing / Text Mining


AMBIT


Lexical & terminological processing


Syntactic & semantic processing


Pattern recognition & discourse processing


Termino


Large
-
scale terminological resource to support term processing

for (biomedical) text processing applications


Efficient recognition and classification of terms in text through use
of finite state recognizers compiled from terminological database


Term are associated with links to outside ontologies and other
terminological knowledge sources


Text Mining results saved as annotations on text

15

Workflow Server

User Client

Medline Server

Swissprot/Blast
record

Workflow Server

Workflow

Enactment

Extract

PubMed Id

Get Medline

Abstract

Initial

Workflow

Cluster

Abstracts

Get Related

Abstracts

Medline: pre
-
processed
offline to extract biomedical
terms + indexed

Workflow definition

+ parameters

Clustered PubMed Ids

+ titles

PubMed Ids

PubMed Ids

Term
-
annotated

Medline abstracts

Medline

Abstracts

16

Workflow Server


Workflow server runs the Freefluo enactment engine to

execute XScufl workflow (designed using Taverna)


WBS workflow:


17

Interface/Browsing Client

User Client

Medline Server

Swissprot/Blast
record

Workflow Server

Workflow

Enactment

Extract

PubMed Id

Get Medline

Abstract

Initial

Workflow

Cluster

Abstracts

Get Related

Abstracts

Medline: pre
-
processed
offline to extract biomedical
terms + indexed

Workflow definition

+ parameters

Clustered PubMed Ids

+ titles

PubMed Ids

PubMed Ids

Term
-
annotated

Medline abstracts

Medline

Abstracts

18

Interface/Browsing Client


Two components


Submit workflows for enactment


Explore results, find related documents, free text search


Explore results


Documents organized in tree derived from MeSH hierarchy

(or chromosome locations)


Links to outside databases containing more information about terms


Find related documents


Terms hyperlinked to same terms in other documents


Finding similar documents


Similarity measure based on MeSH terms


Similarity measure based on words in document


Free text search


Based on Lucene search engine

19

Interface/Browsing Client


Gridsphere Portal Framework is used for relaying workflow
requests to the Freefluo enactment engine


Text Mining Results viewer is implemented as a Java
-
Swing applet for enhanced functionality and easy inclusion
in portals


The applet can re
-
enact workflow requests via the portal

so that the user can further process document sets without
explicitly having to enact a new workflow

20

Interface/Browsing Client

Abstract

body

MeSH

Tree

Abstract

Titles

Free text

search

Search

scope

restrictors

Linked

terms

Get

Related

Abstracts

21

Chromosome location


Extracting relationships between terms


Viewer can be used to show data organized according to
other trees, e.g, chromosome location, GO tree, etc.

22

Further Information


Papers


N. Davis, G. Demetriou, R. Gaizauskas, Y. Guo, I. Roberts. In press.
Web Service Architectures for Text Mining: An Exploration of the
Issues via an E
-
Science Demonstrator
.

In:
International Journal of
Web Services Research
.


R. Gaizauskas, N. Davis, G. Demetriou, Y. Guo, I. Roberts. 2004.
Integrating Text Mining Services into Distributed Bioinformatics
Workflows: A Web Services Implementation.

In:
Proceedings of the
IEEE International Conference on Services Computing (SCC 2004).


Contact


Neil Davis:
n.davis@dcs.shef.ac.uk


Sheffield NLP website:
http://www.nlp.shef.ac.uk/

23

More slides

24

Context: MyGrid


Objective:


To develop a comprehensive suite of middleware components
specifically to support data intensive
in silico

experiments in biology


Workflows and query specifications link together third party and local
resources using web service protocols


Sheffield’s contribution:


Provision of text mining capabilities to link experimental results to

the biomedical literature


Duration, funding, participants:


4 years, ending in June 2005


EPSRC
-
funded e
-
Science pilot project


Five UK universities, European Bioinformatics Institute, several
industrial partners (GSK, IBM)

25

Common WBS Deletions

SVAS =
SupraValvular Aortic Stenosis


26

Why Research WBS?


Without an understanding of the underlying causes of the
disease only palliative care can be offered


Before any type of therapy can be developed the pertinent
genes, interactions and expression pathways must all be
elucidated

27

Williams
-
Beuren Syndrome


Congenital disorder resulting in mental retardation caused
by deletion of genetic material on 7th chromosome


Area in which deletions occur not well characterised


better sequence information is becoming available


As new sequence information becomes available


gene finding software is run against it


BLAST is run against new putative genes to identify

homologues whose function may be known


BLAST reports provide links to abstracts in the literature

28

Why Automate


The process of searching for associated papers is tedious
and time consuming


The gene annotation pipeline is iterative and automating
time consuming elements will free up the researchers time
for more specialist research


Automation allows easy collection of provenance and
replication of the research process

29

Architecture of System (2)


3
-
way division of labour sensible way to deliver distributed
text mining services


Providers of e
-
archives, such as Medline, will make archives
available via web
-
services interface


Cannot offer tailored services for every application


Will provide core, common services


Specialist workflow designers will add value to basic services
from archive to meet their organization’s needs


Users will prefer to execute predefined workflows via standard
light clients such as a browser


Architecture appropriate for many research areas, not just
bioinformatics

30

Text Mining Service Architecture


Data pre
-
processing and merging architecture:

AMBIT &

Termino

MEDLINE

abstracts

31

Text Mining: Termino


Large
-
scale terminological resource to support term
processing for (biomedical) text processing applications


Uniform access to terminological information aggregated
across many sources, without the need for multiple, source
-
specific terminological components


Immediate entry points into a variety of outside ontologies
and other knowledge sources, making this information
available to processing steps subsequent to term recognition


Efficient recognition of terms in text through use of finite
state recognizers compiled from contents of Termino


Lexical look
-
up service accessible via web service
(http://don.dcs.shef.ac.uk/termino)

32

Workflow Server


Workflow server runs the Freefluo enactment engine to

execute XScufl workflow (designed using Taverna)


Graves’ disease workflow:


33

Example Project: CLEF


Clinical e
-
Science Framework


Objective:


To develop a high quality, secure and interoperable information
repository, derived from operational electronic patient records to enable
ethical and user
-
friendly access to patient information in support of
clinical care and biomedical research


Sheffield’s contribution:


Analyzing clinical narratives to extract medically relevant entities and
events, and their properties and relationships


Duration, funding, participants:


2003


2005 (CLEF), 2005


2007 (CLEF
-
S)


Funded by Medical Research Council (MRC)


Six universities, Royal Marsden Hospital, industrial partners engaged
through CLEF Industrial Forum Meetings