Chemical Entity extraction using the

beeuppityAI and Robotics

Oct 19, 2013 (3 years and 11 months ago)

86 views

Chemical Entity extraction using the

chemicalize.org
-
technology

Josef Scheiber

Novartis Pharma AG


NITAS/TMS

Where the story of this project started ...

Dreirosenbrücke

Novartis Campus

A day in October 2008

Some time around 7:45

in the morning ...

Vision for textmining

Integration chemical, biological knowledge

Mining for Chemical Knowledge
-

Rationale

-

Make text corpora searchable for chemistry



-

Generate chemistry databases for use in research based


on Scientific Papers or Patents


-

Link Chemical Information with further annotation in an

automated way for e.g. Chemogenomics applications


-

Patent analyis for MedChem projects

Connection table

Mining for chemical Knowledge
-

Rationale

Information on compounds
targeting GPCRs

2005: >14.000
publications

1992: 256 articles &
34 patents

1988: 9 journal
articles

HELP

Information
explosion

Source:
Banville, Debra L. “Mining chemical structural information from the drug
literature.” Drug Discovery Today, Number 1/2 Jan. 2006, p.35
-
42

Example:

Project Prospect


Royal Society of Chemistry


Enhancing Journal Articles with Chemical Features

This helps you identifying other articles
talking about the same molecule

Mining for Chemical Knowledge


Focus for today

-

Make text corpora searchable for chemistry



-

Generate chemistry databases

for use in research
based


on

Scientific Papers or
Patents


-

Link Chemical Information with further annotation in an

automated way for e.g. Chemogenomics applications


-

Patent analyis for MedChem projects

Connection table

A use case for successful patent mining

(molecules you sometimes find in your inbox ;
-
) )

Vardenafil

(2003, Bayer)





1.24 billion

(USD 1.6 billion)

Sildenafil

(1998, Pfizer)





11.7 billion
(USD 15.1 billion)

Slide inspired by an example from Steve Boyer/IBM;

Sales data from Prous Integrity datase

Conventional Database Building

Facts


current standard

... (ACS) owes most of its wealth to its two 'information
services' divisions


the publications arm and the
Chemical Abstracts Service (CAS), a rich database of
chemical information and literature. Together, in 2004,
these divisions made about $340 million


82% of the
society's revenue


and accounted for $300 million (74%)
of its expenditure. Over the past five years, the society has
seen its revenue and expenditure grow steadily ...


Source: ACS homepage

Facts

Established application

Straighforward use

De
-
facto Gold standard

Unique data source


Very costly

No structure export for reasonable price

Very limited in large
-
scale follow
-
up analysis

Most recent patents not available

Not data (search), but
integration
,
analysis
and

insight,

leading to
decisions

and

discovery

Now


What would be the perfect solution?

All patent offices require to
provide all claimed structures
as machine
-
readable version
available for one
-
click
-
download


Text extraction

Definition:

Extract all molecules that
are mentioned in a patent
text

of interest, convert
them to structures and
make them available in
machine
-
readable format

Mining for Chemical Knowledge

Technologies from providers

Text entity

recognition

Image recognition

(a)
Extractors (IUPAC names)

-

TEMIS Chemical Entity
Relationships Skill Cartridge

-

Accelrys Pipeline Pilot extractor
(Notiora)

-

Fraunhofer (ProMiner Chemistry)

-

Chemaxon (chemicalize.org)

-

Oscar (Corbett, Murray
-
Rust et al.)

-

SureChem

-

IBM ChemFrag Annotator

(b)
Converter

(Names


connection table)

-

CambridgeSoft name=struct

-

Openeye

Lexichem

-

Chemaxon

-

OSRA (NIH)


-

Clide Pro (Keymodule Ltd.)


-

Fraunhofer chemoCR


-

ChemReader

The objective

To provide a tool that provides sophisticated
text analysis methods for NIBR scientists and
thereby leverages the methods of TMS

Mining for Chemical Knowledge


Novartis Tools


the
chemicalize
-
technology is working under the hood!

Clipboard Analysis

Patent
text

Identified
structures

View structure
onMouseOver

Export to
other
applications

Mining for Knowledge


Novartis Tools

Input example: J Med Chem Paper

Mining for Chemical Knowledge


Use Case

Medicinal Chemist wants to synthesize competitor
compound as tool compound for own project

Identification
of core
scaffold

Analysis of
substitution
patterns

This enables the identification
of compounds most
representative for a
competitor patent

Example


A text
-
based patent

Automated
Text
extraction

452
compounds

Reference

636 compounds

71%

A patent example

Example


An image
-
base patent


Text extraction not suitable for this case, it does find only a
meager 40 molecules, 1129 in reference


Why?


An entirely image
-
based patent example

Language issues


e.g. Japanese patents

Encountered problems


OCR (Optical Character Recognition)!!



USPTO and WIPO are now available full text in most cases



Typos!


Name2Struct problems (less an issue here)

IBM initiative

Patent Mining / ChemVerse database (Steve Boyer)


The objective is to automatically extract all molecules from
all patents available and make them searchable in a
database


They leverage cloud computing and have access to all full
-
text patents


This is going absolutely the right direction


They annotate the molecules with information from freely
available databases

Future ideas: Patent Analysis


Markush translation, Image+Target



Ranking capabilities of outcome for User


„blurred“ dicos for translating stuff like aryl, cycloalkyl etc.



Select


annotate as entity


on the fly error
-
correction


Result goes in a database


Crowdsourcing efforts to
improve and store results


Suggest functionality

To enable true Patinformatics analyses ...

Definition by Tony Trippe:

Acknowledgements


Alex Fromm


Katia Vella


Olivier Kreim





Therese Vachon


Daniel Cronenberger


Pierre Parisot


Martin Romacker


Nicolas Grandjean


NITAS/TMS


Clayton Springer


Naeem Yusuff


Bharat Lagu

And many other people in different divisions of NIBR for their support