IBM Research An Inter-Corporate Collaboration on Computer Curation of Intellectual Property & the Scientific Literature

gruesomebugscuffleSoftware and s/w Development

Nov 25, 2013 (3 years and 6 months ago)

133 views

© 2009
埃森哲版
权所有,注意保密

SKBoyer@us.ibm.com
-

©2011

IBM Research

An Inter
-
Corporate Collaboration

on

Computer Curation of Intellectual Property & the Scientific Literature


Applying text & image analysis technology
-



to better understand IP (patents) and


the scientific literature……




Computer

curation


of the literature
-



Stephen Boyer Ph D


Sboyer@us.ibm.com


408
-
858
-
5544


the challenges of today's researchers

What we are trying to accomplish

The Problem

All content and no discovery ?

The problem :

Gain a better understanding of IP (patents) and the Scientific Literature


The Question:

Can we use computers to

牥慤


摯d畭en瑳t⁩摥n瑩ty⁣物瑩捡氠en瑩瑩e猬s慮搠
灥牦潲m me慮楮杦畬⁡獳 捩慴楯a猠


that can help us with our work ?


What we did :

1) Apply text analytics technology to analyze

Patents & the Scientific Literature

(>30 M IP documents & Medline abstracts)

2) Apply image analytics to IP documents

3) Explore how these technologies can be applied to foreign documents

(for example Chinese & Japanese patents)


The Value :

Provide new insights into chemical & biomedical information



(still a work in progress).


the challenges of today's researchers

What we are trying to accomplish

Collaborators


IBM Research



Novartis



Pfizer



Dupont



Lilly



Boheringer
-
Ingelheim



Roche / Genentech



AstraZeneca (AZ)



Bristol
-
Myers Squibb (BMS)


Corporate Sponsors



NIH



University of Texas



EMBL
-

EBI



University of Dundee



UC Davis



ChemAxon



CambridgeSoft



Dalhouise



Univ of New Mexico

Other informal Collaborators


partners


A collaborative work in progress

N
H
N
N
N
S
N
O
N
O
O
O


Bayer patented molecule



Annual sales of ~$320 Million



Vardenafil (Levitra)



Late to market, found

similar



molecule and gained

share





Pfizer patented molecule



Annual sales of >$1.7 billion




Sildenafil (Viagra)



1
st

to market, but didn

t patent (cover) full
Chemical space


Why this is important !

N

N

H

N

C

N

S

N

O

N

O

O

O

Chemistry:

1 Carbon, 1 Nitrogen, 1 double bond, 1 hydrogen


Business:

$1.7B in revenue

An opportunity loss of $320M

A revenue gain of $320M


What are the differences between these two molecules?

Example IP Challenge

B
r
H
H
S
B
r
B
r
H
H
S
B
r
Additional

Properties

Relationships

New Insights

New IP


How do I find entities
from the docs?

How do I find entities


relationships?

Web, Scientific

& News

Worldwide

Patents

Medline

How do I exploit other

Information sources?


the challenges of today's researchers


a) (2P/4S)
-
4
-
[4
-
Amino
-
5
-
(4
-
benzyloxy
-
phenyl)pyrrolo[2,3
-
d]pyrimidin
-
7
-
yl]
-
2
-
hydroxymethyl
-
pyrrolidine
-
1
-
carboxylic acid tert
-
butyl ester prepared analogously to
Example 18 starting from (2R/4S)
-
4
-
[4
-
amino
-
5
-
(4
-
benzyloxy
-
phenyl)
-
pyrrolo[2,3
-
d]pyrimidin
-
7
-
yl]
-
pyrrolidine
-
1,2
-
dicarboxylic acid 1
-
tert
-
butyl ester 2
-
ethyl ester
(Example 20a). 1 H
-
NMR (CDCl3, ppm): 8.52 (s, 1H), 7.52
-
7.32 (m, 7H), 7.1 (d, 2H), 6.95
(d,1 H), 5.50 (m, 1H), 5.13 (s, 2H), 4.62
-
4.42 (m, 2H), 4.28 (m, 2H), 4.10 (m, 1H),
3.95
-
3.70 (m, 1H), 2.75 (m, 1H), 2.50 (m, 1H),1.49 (s, 9H).


b) (2R/4S)
-
{4
-
[4
-
Amino
-
5
-
(4
-
benzyloxy
-
phenyl)
-
pyrrolo[2,3
-
d]pyrimidin
-
7
-
yl]
-
pyrrolidin
-
2
-
yl}
-
methanol: 0.100 g of (2R/4S)4
-
[4
-
amino
-
5
-
(4
-
benzyloxy
-
phenyl)
-
pyrrolo[2,3
-
d]pyrimidin
-
7
-
yl]
-
pyrrolidine
-
1,2
-
dicarboxylic acid 1
-
tert
-
butyl ester is
dissolved in 4 ml of tetrahydrofuran; 10 ml of 4M hydrogen chloride in diethyl ether are
added, and stirring is carried out for 1 hour at room temperature. The product is
filtered off and dried under a high vacuum. The dihydrochloride of the title compound is
obtained. 1 H
-
NMR (CD3 OD, ppm): 8.4 (s, 1H); 7.60 (s, 1H), 7.5
-
7.10 (m, 9H), 5.65 (m,
1H), 5.18 (s, 2H), 4.32 (m, 1H), 4.00
-
3.65 (m, 4H), 2.60 (m, 2H).

EXAMPLE 24

(2R/4S)
-
4
-
(4
-
Amino
-
5
-
phenyl
-
pyrrolo[2,3
-
d]pyrimidin
-
7
-
yl)
-
1
-
(2,2
-
dimethyl
-
propionyl)
-
pyrrolidine
-
2
-
carboxylic acid ethyl ester 0.130 g of (2R/4S)
-
4
-
(4
-
benzyloxycarbonylamino
-
5
-
phenyl
-
pyrrolo[2,3
-
d]pyrimidin
-
7
-
yl)
-
1
-
(2,2
-
dimethyl
-
propionyl)
-
pyrrolidine
-
2
-
carboxylic acid ethyl ester is dissolved in 8 ml of methanol,
and the solution is hydrogenated over 0.030 g of palladium
-
on
-
carbon (10%) for 1 hour at
normal pressure. The catalyst is removed by filtration, the filtrate is concentrated by

Can you find the key molecule

猠⁩渠慮a⁵湳瑲畣瑵t敤e瑥硴‬


景爠數慭灬攠愠⁳a楥湴楦楣 橯畲湡氠潲o灡瑥湴p


Chemical nomenclature can be daunting


a)
(2P/4S)
-
4
-
[4
-
Amino
-
5
-
(4
-
benzyloxy
-
phenyl)pyrrolo[2,3
-
d]pyrimidin
-
7
-
yl]
-
2
-
hydroxymethyl
-
pyrrolidine
-
1
-
carboxylic acid tert
-
butyl ester

prepared analogously to Example 18 starting from (2R/4S)
-
4
-
[4
-
amino
-
5
-
(4
-
benzyloxy
-
phenyl)
-
pyrrolo[2,3
-
d]pyrimidin
-
7
-
yl]
-
pyrrolidine
-
1,2
-
dicarboxylic acid 1
-
tert
-
butyl ester 2
-
ethyl ester (Example 20a). 1 H
-
NMR (CDCl3, ppm): 8.52 (s, 1H), 7.52
-
7.32 (m, 7H), 7.1 (d, 2H), 6.95 (d,1 H),
5.50 (m, 1H), 5.13 (s, 2H), 4.62
-
4.42 (m, 2H), 4.28 (m, 2H), 4.10 (m, 1H), 3.95
-
3.70 (m, 1H), 2.75 (m, 1H),
2.50 (m, 1H),1.49 (s, 9H).


b) (2R/4S)
-
{4
-
[4
-
Amino
-
5
-
(4
-
benzyloxy
-
phenyl)
-
pyrrolo[2,3
-
d]pyrimidin
-
7
-
yl]
-
pyrrolidin
-
2
-
yl}
-
methanol:
0.100 g of (2R/4S)4
-
[4
-
amino
-
5
-
(4
-
benzyloxy
-
phenyl)
-
pyrrolo[2,3
-
d]pyrimidin
-
7
-
yl]
-
pyrrolidine
-
1,2
-
dicarboxylic acid 1
-
tert
-
butyl ester is dissolved in 4 ml of tetrahydrofuran; 10 ml of 4M hydrogen chloride
in diethyl ether are added, and stirring is carried out for 1 hour at room temperature. The product is
filtered off and dried under a high vacuum. The dihydrochloride of the title compound is obtained. 1 H
-
NMR (CD3 OD, ppm): 8.4 (s, 1H); 7.60 (s, 1H), 7.5
-
7.10 (m, 9H), 5.65 (m, 1H), 5.18 (s, 2H), 4.32 (m, 1H),
4.00
-
3.65 (m, 4H), 2.60 (m, 2H).

EXAMPLE 24

(2R/4S)
-
4
-
(4
-
Amino
-
5
-
phenyl
-
pyrrolo[2,3
-
d]pyrimidin
-
7
-
yl)
-
1
-
(2,2
-
dimethyl
-
propionyl)
-
pyrrolidine
-
2
-
carboxylic acid ethyl ester 0.130 g of (2R/4S)
-
4
-
(4
-
benzyloxycarbonylamino
-
5
-
phenyl
-
pyrrolo[2,3
-
d]pyrimidin
-
7
-
yl)
-
1
-
(2,2
-
dimethyl
-
propionyl)
-
pyrrolidine
-
2
-
carboxylic acid ethyl ester is dissolved in 8 ml of
methanol, and the solution is hydrogenated over 0.030 g of palladium
-
on
-
carbon (10%) for 1 hour at
normal pressure. The catalyst is removed by filtration, the filtrate is concentrated by

What is this compound ??







N
O
O
H
O
N
N
N
O
N
H
2
identify the chemical names


then convert them to structures

[chemical names
-
> structures] !


entity identification

Valium

(Trade Name)

=

Diazepam

(Generic Name)

=

CAS # 439
-
14
-
5

(Chemical ID #)

ALBORAL, ALISEUM, ALUPRAM , AMIPROL ,ANSIOLIN , ANSIOLISINA , APAURIN,


APOZEPAM, ASSIVAL ,
ATENSINE , ATILEN , BIALZEPAM , CALMOCITENE, CALMPOSE ,


CERCINE, CEREGULART, CONDITION, DAP, DIACEPAN, DIAPAM , DIAZEMULS ,

DIAZEPAN , DIAZETARD , DIENPAX, DIPAM , DIPEZONA, DOMALIUM , DUKSEN,

DUXEN, E
-
PAM, ERIDAN, EVACALM, FAUSTAN,

FREUDAL

,

FRUSTAN, GIHITAN,


HORIZON, KIATRIUM, LA
-
III , LEMBROL, LEVIUM, LIBERETAS , METHYL DIAZEPINONE,


MOROSAN , NEUROLYTRIL NOAN NSC
-
77518 PACITRAN PARANTEN PAXATE PAXEL


PLIDAN QUETINIL QUIATRIL QUIEVITA RELAMINAL RELANIUM RELAX RENBORIN


RO 5
-
2807 S.A. R.L. SAROMET SEDAPAM SEDIPAM SEDUKSEN SEDUXEN ,


SERENACK SERENAMIN SERENZIN SETONIL SIBAZON
SONACON STESOLID

STESOLIN

, TENSOPAM TRANIMUL TRANQDYN TRANQUASE TRANQUIRIT ,

TRANQUO
-
TABLINEN , UMBRIUM UNISEDIL USEMPAX AP VALEO VALITRAN

VALRELEASE VATRAN VELIUM, VIVAL VIVOL WY
-
3467

=

Valium has > 149

names



Problem


I need to find information about Valium


nomenclature issues

There are many different chemical names for Valium

Valium

=

Diazepam

=

7
-
CHLORO
-
1
-
METHYL
-
5
-
PHENYL
-
2H
-
1,4
-
BENZODIAZEPIN
-
2
-
ONE


7
-
CHLORO
-
1
-
METHYL
-
5
-
PHENYL
-
3H
-
1,4
-
BENZODIAZEPIN
-
2(1H)
-
ONE

7
-
CHLORO
-
1
-
METHYL
-
5
-
PHENYL
-
1,3
-
DIHYDRO
-
2H
-
1,4
-
BENZODIAZEPIN
-
2
-
ONE

7
-
CHLORO
-
1
-
METHYL
-
2
-
OXO
-
5
-
PHENYL
-
3H
-
1,4
-
BENZODIAZEPINE



1
-
METHYL
-
5
-
PHENYL
-
7
-
CHLORO
-
1,3
-
DIHYDRO
-
2H
-
1,4
-
BENZODIAZEPIN
-
2
-
ONE




7
-
CHLORO
-
1,3
-
DIHYDRO
-
1
-
METHYL
-
5
-
PHENYL
-
2H
-
1,4
-
BENZODIAZEPIN
-
2
-
ONE


7
-
CHLORO
-
1
-
METHYL
-
5
-
3H
-
1,4
-
BENZIODIAZEPIN
-
2(1H)
-
ONE

CAS # 439
-
14
-
5

=


entity identification

Problems of

taxonomy


& name normalization


Valium

Taxonomies &

Dictionaries



Multiple documents contain

Information about Valium

Diazepam

Sedapam

DIAPAM

Medline

In
-
house database

Choose keywords


439
-
14
-
5

(Chemical ID)

Chem. Abstracts

Pereira notebook 23a


7
-
CHLORO
-
1,3
-
DIHYDRO
-
1
-
METHYL
-
5
-
PHENYL
-
2H
-
1,4
-
BENZODIAZEPIN
-
2
-
ONE

Patent database

7
-
CHLORO
-
1
-
METHYL
-
5
-
PHENYL
-
2H
-
1,4
-
BENZODIAZEPIN
-
2
-
ONE



The scientist simply wants information about valium

Considerations


for searching documents

(or web pages) for chemical substances


Chemicals have a wide variety of trivial and official names.


No text search can find chemicals which are named using one of the alternative
names.


Synonym expansion is insufficient.


Searching by structure will find all such cases.

Source J Cooper / IBM


Name normalization is important

Finding similarity structures not just text !

Further, we would like to find
compounds which are supersets
of the given structure.


For example: toluene and
methylnaphthalene

Text
searches

won

t find documents
with similar structures

Source J Cooper / IBM


Find documents with similar structures


Applying text and
image analytics


to better understand IP (patents) &


the scientific literature……




Computer

curation


of the literature
-


The Solution


The proposed solution

And as (Manually Created) Chemical Complex Work Units (CWU

猩s

As text

Chemical names


found in the text of


documents

As bitmap images


Pictures of chemicals


found in the document

Images

Patents contain molecular data in multiple forms :


Text


Image


manually created chemical complex work units (CWU

s
)


Text Analytics


The computer

reads


documents and attempts to determine

domain specific entities ; for example ;


chemical names, gene names, disease names, etc.


Lets start with text analysis …

5
-
chloro
-
N
-
methyl
-
N
-
phthalimidoacetylanthranilic acid

N
-
aminoacetyl
-
5
-
chloro
-
N
-
methylanathranilic acid

Phosphorus pentachloride

aluminum chloride

hydrazine

7
-
chloro
-
1.3
-
dihydro
-
1
-
methyl
-
5
-
phenyl
-
2H
-
1,4
-
benzodiazepin
-
2
-
one

benzene

Chemical Entities

Extracted from page

Step 2: Extract chemical names and load into tables

Step 1: Identify the chemical entities

Entity extraction

Name


却S畣u畲e

†††
偲潧r慭a

7
-
CHLORO
-
1
-
METHYL
-
5
-

PHENYL
-
2H
-
1,4
-

BENZODIAZEPIN
-
2
-
ONE


language
-
free entities

SMILES
strings:

c1ccccc1


6 6 0 0 0 0 0 0 0 0999 V2000


6.7092 5.6087 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0


6.7076 4.5056 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0


7.6607 3.9551 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0


8.6160 4.5062 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0


8.6121 5.6136 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0


7.6583 6.1591 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0


1 2 2 0 0 0 0


2 3 1 0 0 0 0


3 4 2 0 0 0 0


4 5 1 0 0 0 0


5 6 2 0 0 0 0


6 1 1 0 0 0 0

M END

Connection tables

INChI=1/C6H6/c1
-
2
-
4
-
6
-
5
-
3
-
1/h1
-
6H

Step 3: Convert words to structures


Convert the chemicals into machine readable formats !

IBM Servers



Medline

Patents

Web Pages

Any text

HealthCare

Life Science

Data warehouse

Valium

Benzene


11 Million patent documents


18 Million Medline abstracts



100 Million


chemical structures


>12 Million unique


Step 4: Automate the process


Scale up & automate the process
-


Paper

Words

-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


Chemical Names

Dictionary of the

English Language




minus


the

Dictionary of

Desired Entities

.

-


-


-



toluene

[CC1=CC=CC=C1]

C
H
3
Name=Structure

SMILES


String

2D Structure


methyl benzene

Computational

Resources

Blue Gene


enabled
-


Summary of overall text analysis operations for

chemistry


(HMM, CRF, CFG)

Options to compute


300 properties per


molecule


Overall process flow for text analysis

Paper

Words

-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


-


Chemical Names

Dictionary of the

English Language




minus


the

Dictionary of

Desired Entities

.

-


-


-



toluene

[CC1=CC=CC=C1]

C
H
3
[Name=Structure]

SMILES


String

2D Structure


methyl benzene

Blue Gene


enabled
-


Summary of overall text analysis operations for chemistry

Options to compute


300 properties per


molecule

Computational

Resources


Overall process flow for text analysis

Why

use Blue Gene?

Find and compute the 3D structure of every
molecule on every page of every patent (and
Medline abs.)


Identify every protein (from our dictionary of
>350K proteins) on every page of every patent
(and Medline abs.)


Identify every disease (from our list of 14,500 )
on every page of every patent and map it to
Medline MeSh codes


Identify the occurrence of every biomarker
(from our dictionary of 485 biomarkers) on
every page of every patent


……….your request goes here !

Equivalent to 240K simultaneous Google
searches
-


Data warehouse

Compute properties, &

find relationships,

Examples


Chemicals derived from text analytics


Chemicals derived from text analytics

Examples

Examples of structures created via automated


chemical annotation


Chemicals derived from text analytics

Leading Causes of Annotator Problems *

Improper spacing within the chemical name:


2
-
_
(Bicyclo
_
[2.2.
_
1]
_
hept
-
5
-
en
-
2
-
ylamino)
_
-
5
-
_
[2
-
_
(4
-
chloro
-
3
-
methylphenoxy)
_
ethyl]
-
l,
_
3
-
_
thiazol
-
4
_
(5H)
-
one



Run on lists:


indane,
1,2,_
3,4
-

tetrahydroquinoline,
3,_
4
-
dihydro
-
2H
-
1
,_
4
-
benzoxazine, 1,5
-
naphthyridine, 1, 8
-

naphthyridine


Numbering of compounds:


Comparative Example
3,

2
-
bromo
-
4
-

(1, 3
-
dioxo
-
1, 3
-
dihydro
-
2H
-
isoindol
-
2
-
yl) butanoic acid 4
-
(1,3
-
dioxo
-
1,3
-
dihydro
-
2H
-
isoindol
-
2
-
yl) butanoic acid



Formatting issues:


2
-
[2
-
(bicyclo [2.2. 1] hept
-
5
-
en
-
2
-
ylamino)
-
4
-
oxo
-
4, 5
-
dihydro
-
1, 3
-
thiazol
-
5
-
yl]
-
N
-
<BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR>

(4
-
metlioxyphenyl)
-
N
-
methylacetamide


Missing or Incorrect Parenthesis:


5
-
(2
-
anilinoethyl)
-
2
-
[(2
-
cyclohex
-
1
-
en
-
1
-
ylethyl)amino
}
-
1,3
-
thiazol
-
4(5H)
-
one

* using WO/2005/075471 as an example

Typical problems encountered when dealing with OCR text


IRF Symposium 2007

Wolfgang Thielemann

29

Searching full
-
text patents (WO, EP, US, FR, GB, DE, JP) for the term

卩浶慳瑡瑩t


y楥汤猠㤰㌰ 灡瑥湴猠⠳㘶㘠䥎IA䑏䌠晡f楬楥i⤮

䉵琠B桥牥⁡ 攠㌹㈠2潲攠灡瑥湴猠w桩捨c慲攠湯琠景f湤 摵攠瑯⁴e灯猠慮搠
佒䌠敲牯牳:

OCR Errors: Compound Names


IRF Symposium 2007

Wolfgang Thielemann

30

If you think that was bad... look at the IUPAC names:

WO2007096753

6(R)
-
[2
-
(8'(S)
-
2",2"
-
dimethylbutyryloxy
-
2'(S),6'(R)
-
dimethyl
-

l
',2',6',7,'8',8a'(R)
-
hexahydronapthyl
-
l
'(S))
-
ethyl]
-
4(R)
-
hydroxy
-
3,4
-
5,6
-
tetrahydro
-

2H
-
pyran
-
2
-
one

WO2005095374

6(R)
-
[2
-
[8(5)
-
(2,2
-
dimethyl
.
but
yy
loxy)
-
2 (S), 6 (R)
-
dimethyl
-
1, 2, 6, 7, 8, 8a(R)
-
hexahydro
-
l

(S)
-
napthyle
l
hyl
/
-
4(R)
-
hydroxy
-
3, 4, 5, 6
-
tetrahydro
-
2H
-
pyran
-
2 one

WO2005095374

6(R)
-
[2
-
[8(S)
-
(2, 2
-
dimethylbu
ly
ryloxy)
-
2 (S), 6 (R)
-
dimethyl
-
1, 2, 6, 7, 8, 8a(R)
-
hexa
b
ydro
-
l (S)
-
napthylethyl
/
-
4(R)
-
hydroxy
-
3, 4, 5, 6
-
tetrahydro
-
2H
-
pyran
-
2 one

WO2003018570

6(R)
-
[2
-
[8(S)
-
(2,2
10
dimethylbuty
l
yloxy)
-
2(S),6(R)
-
dimethyl
-
1,2, 6,7,8,8a(R)
hexahydronaphthyl]
-
l
(S)ethyl]
-
4(R)
-
hydroxy
-
3,4,5,6 tetr
a h
ydro
-
2H
-
pyran
e
-
2
-
one

WO2003048149

6(R)
-
[2
-
[8(S)
-
(2,2
-

dimethylbuty
l
yloxy)
-
2(S),6(R)
-
dimethyl
-
1,2,6,7,8,8a(R)
-

hexahydronaphthyl]
-
l(S)ethyl]
-
4(R)
-
hydroxy
-
3,4,5,6
20

tetrahydro
-
2H
-
pyran
e
-
2
-
o
n

WO2003018570

6(R)
-
[2
-
[8(S)
-
(2,2
-
dimethylbuty
l
yloxy)
-
2(S),6(R)
-
dimet
h y
l
-
1,2,6,7,8,8a(R)
-
hexahydronaphthyl]
-
l
(S
) e
thyl]
-
hydro
x y
-
3,4,5,6
-
tetrahydro
-
2H
-
pyran
e
-
2
-
one

WO2005095374

6(R)
-
[2
-
[8(S)
-
(2,2
-
dimethylbutyrylaxy)
-
2 (S),6 (R)
-
dimethy
Al
, 2, 6, 7, 8, 8a(R)
-
hexahydro
-
l

(S)
-
napthyl
J
ethyl)
-
4(R)
-
hydroxy
-
3, 4, 5, 6
-
tetrahydro
-
2H
-
pyran
-
2 one

WO2006072963

6(R)
-
{2[8(S)
-
(2,2dimethylbutyryloxy)2(
5
),6(R)
..
dimethyI
..</p><p>
1,2,6,7,8,8a(R)
-
hexahydro
-
1 (S)
-
naphthyl
J
ethy
1J
-
4(R)hydroxy3,4,5, 6 tetrahydro
-
2H
-
pyran
-
2
-
one

OCR Errors: Chemical Names


IRF Symposium 2007

Wolfgang Thielemann

31

Transposed Characters

Some errors cannot originate from an erroneous OCR process.


Accidentally transposed characters are another source for
variations:


e
ht
yl



1565

patents

me
ht
yl


840

patents

comp
uo
nd


231

patents

rel
ae
se


44

patents

formu
al


1689

patents

65,645,252 = # of Molecules identified
-

(total)*


3,623,248 = # of Unique Molecules


1,830,575 = # of Molecules Passing the




Lipinski Rules



363,993 = # of documents with possible 112 violations


17,122 = # of 2005 pre
-
grants w/ possible 112 violations





Chemical Name Annotation of US patents
backfile (1976
-
2005)


& US patent applications

(2002
-
2005)


* All identified molecules were successfully converted to Smiles strings

-

Preliminary Results


as of June 20 , 2006
-


Rule 112 Analysis

Analysis & Results

Molecules

TOTAL 65,645,252

UNIQUE 3,623,248

DRUG¹ 1,830,575

¹

Passing Lipinski

s

Rule of 5



http://en.wikipedia.org/wiki/Lipinski's_Rule_of_Five


Post processing with pipeline pilot

Annotators


Chemicals


Biomarkers


Genes


Proteins


Cell Lines


Cell Types


People


Institutions


Diseases


Symptoms


Other

Attributes



Journals




Entities



Relationships


Medline








Edgar

Web

Search

Analysis

Blue Gene


䅮A潴at楯i F慣a潲y





䑡t愠W慲敨潵獥


Full
-
Text

Chemical Structures

Co
-
occurrence

Lipinksi Rules

Section 112

Trends, Molecular

Networks & Time lines

IBM's Research Collaboration on Computer Curation

Data


Patents

Scitegic Pipeline Pilot

and other Partner Tools

"UIMA"

Automated Text & Image Analysis !

What about processing image data ??

IBM pioneered a process for converting images of chemical structures




into Mol files (machine readable representations of chemical

structures…)


We can also analyze the image content of patents & journals


Image entity recognition

Seminal paper on converting chemical images into


MOL files


Optical Recognition of Chemical Structures (OROCS)

Scan

Separate

Vectorize

Segment

Cleanup

OCR

Structure

Recognition

Aggregation

Post Process

O=C(CN1C2(C3=CC=CC=C3)OC(C)=CC1=O)
N(C)C4=C2C=C(Cl)C=C4

Optical recognition of chemical structures (OROCS)



How it works

Extract the images

From the page


Isolate the chemical


images

OCR the chemical image

SMILE String

Optimization of Image processing process

Pre
-
processing of the


images makes a significant
difference

This shows the selective extraction of image data

from within the patent

Individual images

Source : Dr John Kinney

Image

Extracted from

the page


Structure

Generated from

the image

SMILE String

Generated from

the image

Examples :

Results from
OCR


of chemical
images



Chemical derived from OCR of image data

Learning from the Exceptions

Radicals, polymers, organometallics

Name lookup table differences


“formal”



Structure conventions differ


i.e., CH
3
MgBr vs. CH
3
Mg
+
.Br
-


Ionization state/stereochemistry

Internal error corrections

Some names are incomplete and therefore
ambiguous!

Malathion, aka Formal
Dimethyl formal, aka Formal
Differences of opinion

Vinyl Toluene
or
3-mercaptopropyl-dimethoxymethylsilane
or
Often tagged as
ambiguous

Where do the punctuation marks
belong?

Structures from Images

Image
-
to
-
Structure software very effective on clean, crisp images

Like text, image quality in documents varies greatly!






Improper structure assignments are common


Structure Recognition Process

Clipped images from documents are used.




Processing of full
-
page images is slow and gives
many errors.



OSRA (NIH) run to produce SDFile output

PipelinePilot Protocol used to analyze and filter resulting structure set.

Criteria for filtering invalid structures

Presence of non
-
element atoms, R, X, etc.

Inappropriate internal coordinates (bond length and angles) of the 2D
representation.

Over
-
assigned stereochemistry can be corrected rather than removing
the entire structure


Examples of common errors in translation

Example Structure

Error

Filter Rule

Double bond interpreted

as two single bonds

The minimum bond distance where

neither atoms is Hydrogen is required

to be greater than 0.85 Å.

The minimum bond angle from an

exocyclic terminal atom to the ring

atoms was required to be greater

than 50
°
.

Aromatic bond interpreted

as exocyclic bond from ring

Examples of common errors in translation

Example Structure

Error

Filter Rule

Atom found in center

of single bond

The maximum bond angle of a carbon

with exactly two single bonds was

required to be less than 155
°
.

The minimum bond angle which

includes any terminal atom was

required to be greater than 10
°
.

Single bond divided

into two single bonds

Conversion Statistics

20,081 patents with 487,537 clip
files

N
o
n
-s
ta
n
d
a
r
d

a
to
m
s
Mi
n

B
o
n
d

L
e
n
g
th
Mi
n

Ex
o
c
y
c
l
e

A
n
g
l
e
Mi
n

B
o
n
d

A
n
g
l
e
Ma
x

C
a
r
b
o
n

B
o
n
d
Angle
B
a
d

Ste
r
e
o
C
o
n
fi
d
e
n
c
e

C
u
to
ff
C
l
e
a
n

Mo
l
e
c
u
l
e
s
35%
clean

Combining Text and Image Structures

A
l
l

Str
u
c
tu
r
e
s
From T
ext
From Image
B
o
th
Normal Organics
From T
ext
From Image
B
o
th
Image Processing
Operations












PTO/ Data Processing
Operations












‘Clip’ Images

OSRA /
Clide


SDF files

Chem CWU’s

CDX / MOL files


SDF files

Multi
-
step post
processing


Operations


Text Processing

Operations












Text

[Name=Structure]

SMILES

[
ChemList

]

Multi
-
step post
processing


Operations

Multi
-
step post
processing


Operations

Source : Dr John Kinney

Image

Extracted from

the page


Structure

Generated from

the image

SMILE String

Generated from

the image

Examples :

Results from
OCR


of chemical
images



Chemical derived from OCR of image data


Computer curation now involves multiple types of
analysis


combining technologies into workflow protocols


Analysis of text


Analysis of image


Analysis of XML files

Derived Meta data

Internal data

IBM + Collaborator input

Output db to

Collaborators


Analysis of (CWU

) s

Image Processing
Operations












PTO/ Data Processing
Operations












‘Clip’ Images

OSRA /
Clide


SDF files

Chem CWU’s

CDX / MOL files


SDF files

Multi
-
step post
processing


Operations


Text Processing

Operations












Text

[Name=Structure]

SMILES

[
ChemList

]

Multi
-
step post
processing


Operations

Multi
-
step post
processing


Operations

Multiple Workflows for processing text & Image


s via different technologies

Data Sources

View

selected

Documents &
Reports

U.S.

Patents

(1976
-

2009)

U.S.

Pre
-

Grants
(All)

PCT &

EPO

Apps

Medline

Abstracts

(>18 M)

Selected

Internet

Content

User Applications

In
-
House

Content

Pipeline Pilot

BIW

SIMPLE

Chem Search

Cognos/DDQB/

Other Apps

Parse


& Extract

data



Annotator 1

Annotator 2

Database

+


computed


Meta Data



e Classifier & Other

Data Associations


Annotation Factory

Computational

Analytics


ChemVerse

(Semantic

Associations)

Computer Curation Process Overview

IP

Database

(e.g. DB2
)

ADU*

* ADU = Automated Data
Update




ChemVerse

db

ChemVerse

Services Hosted at IBM Almaden

What about additional meta data ?

How should we identify extract and associate attributes ?

Data association

Semantic associations


using ChemVerse

Orange Book


-
Legal status

-

Assignee

-

Foreign filings

-

Expiration Date




IP Attributes

Molecular

Entities have

Various Attributes

( From different sources)

NIST db


-
IR spectra

-
NMR,

-
Mass Spec, etc




Spectral Attributes

Computational


-
MW,

-
MF

-
Bp

-
Mp , Etc etc


Physical Attributes


Drugbank

-

Activity

-

Pharam data

-

Protein Binding

-

half life





WomBat


-

Activity

-

Pharam data

-

Target data for SRA

-

Literature references





PubChem

-
Activity

-

Pharam data

-

Target data for SRA

-

Literature references




Durg Attributes

Screening Attributes

EPA databases


-

Toxicity studies

-

LD50

-

Literature references




Toxicity Attributes

ChemVerse : a tool for associating molecular attributes


from different sources


Attributes derived from different sources

The Tank

Internet

Data Sources

Attributes

Orange

Book

Pub

Ch∂em

Drugbank

FDA

Others

Attributes

||||||||||

|||||||||

Data Source 1

Schema 1

Attributes

||||||||||

|||||||||

Data Source 2

Schema 2

Database C
(Tox)

Location

Structure


(trusted
database)

Database A (Medline)

SMILE

InChi_
id

Binding
site

Code name

Target

Activity

app_id

Trade
Name

Geo

Countr
y

Pathwa
y

To
x

IP status

Certification
s

Licensin
g

Input list of

Attributes

Output file


list of

SMILES

Output
file list of
attributes

Input list of
SMILES

ChemVerse :
Semantically maps associations of attributes


from different sources


Semantic association of attributes

Jeff
Blaney
,

Slaton Lipscomb

Keven

Clark

Jw

Feng

Vickie
Tsui


Bin Qing

Ben Sellers

Sorel.Muresan
,

Christopher.Southan
,

Niklas.Blomberg
,

Plamen.Petrov
,


Cynthia.Yang
,

Charles.Hand
,

Michael.Rogers
,

Ramesh.durvasula
,

Alice.goshorn
,

Mark.Hermsmeier
,




Bruce.a.Lefker
,

Christopher.Kibbey
,

David.J.Walsh
,

Sarah.Blendermann
,

Bryn Williams
-
Jones

Jacquelyn Klug
-
McLeod

Lee Harland

Robert Owen

Marudai
Balasubramanian



Therese.Vachon
,


Edgar.Jacoby
,

Peter.Ertl
,


Peter.Gedeck
,

Fatma.Oezdemir
-
Zaech
,

John
-
w.Davies
,


Jeremy.Jenkins
,


Allen.Cornett
,


Stefan Wetzel


Greg Landrum


Richard Lewis


A J
Dambra


Jasmin.Saric
,

Scott.Oloff
,

John.Hart
,

Stephen.Boyer
,

John.Proudfoot
,

Markus.Kunze
,


John.B.Kinney
,

Timothy.E.Mueller
,


Glenn.Macstravic
,

Chrstophe.Mazenc
,

Lustin

Diaconescu
,

Paul Halfpenny

Thompson Doman

Marc Nicklaus

Igor
Filippov

Marcus
Sitzmann



john
Overington

Christopher Steinbeck

Dominique Clark


Stephen Boyer
,

Jeff
Kreulen

Ying Chen

Tom Griffin

Alfredo Alba

Scott Spangler

Eric Louie

Brad Wade

John
Colino

Isaac Cheng


Ana
Lelescu

Linda Kato


Su Yan

Ashish

Sanghavi

Ramachandran

Prasad

Qi He

Timothy J
Bethea

Yanbo

Wu

Meenakshi

Nagarajan


Christopher Campbell

AstraZeneca

BMS

Pfizer

Novartis

NIH

Lilly

Genentech / Roche

Dupont

EBI

Boheringer


WIPO

Who’s is participating ?

Backup materials

Research
-

It

s a journey …