The Automated Extraction of Reactions from the Patent Literature

Arya MirΤεχνίτη Νοημοσύνη και Ρομποτική

3 Απρ 2012 (πριν από 5 χρόνια και 21 μέρες)

824 εμφανίσεις

Daniel Lowe Unilever Centre for Molecular Science Informatics University of Cambridge

1

Automated Extraction of Reactions from the
Patent Literature

Daniel Lowe

Unilever Centre for Molecular Science Informatics

University of Cambridge

2

Chemistry
patent
applications

0
50000
100000
150000
200000
250000
300000
350000
400000
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
Chemistry patent applications per year

World
Intellectual Property Indicators, 2011
edition


100,000s applications each year

3

4

The idea

XML patents

Extracted

Reactions

Reaction
Extraction
System

5

Steps involved


Identifying experimental sections


Identifying chemical entities


Chemical name to structure conversion


Associating chemical entities with quantities


Assigning chemical roles


Atom
-
atom mapping


6

Building on existing projects

7

Archetypal experimental section

Paragraph number

Section heading

Section target
compound

Step target
compound

Synthesis

Characterisation

Workup

Step identifier

8

Jessop, D. M.; Adams, S. E.; Murray
-
Rust, P.
Mining Chemical Information from Open
Patents.
Journal of Cheminformatics

2011
,
3
, 40.

9

ChemicalTagger


Tags words of

text



Parses tags to identify phrases



Generate XML parse tree



http://chemicaltagger.ch.cam.ac.uk/


Hawizy
, L.; Jessop, D. M.; Adams, N.; Murray
-
Rust, P. ChemicalTagger: A tool for
semantic text
-
mining in chemistry.
J
Cheminf

2011
,
3
, 17.

10

Tagging

Additional taggers:


OPSIN tagger: Finds names OPSIN can parse


Trivial chemical name tagger: Tags a few chemicals missed by
the other taggers and cases that are partially matched by
the regex tagger
e.g.
Dess
-
martin reagent


Regex tagger: tags keywords e.g. “yield”, “mL”


OSCAR4 tagger: Finds names OSCAR4 believes to be chemical
e.g. “2
-
methylpyridine”


OpenNLP
: Tags parts of speech

11

Sample ChemicalTagger Output

<MOLECULE>


<OSCARCM>


<OSCAR
-
CM>methyl</OSCAR
-
CM>


<OSCAR
-
CM>4
-
(
chlorosulfonyl
)benzoate</OSCAR
-
CM>


</OSCARCM>


<QUANTITY>


<_
-
LRB
-
>(</_
-
LRB
-
>


<MASS>


<CD>606</CD>


<NN
-
MASS>mg</NN
-
MASS>


</MASS>


<COMMA>,</COMMA>


<AMOUNT>


<CD>2.1</CD>


<NN
-
AMOUNT>
mmol
</NN
-
AMOUNT>


</AMOUNT>


<COMMA>,</COMMA>


<EQUIVALENT>


<CD>1</CD>


<NN
-
EQ>
eq
</NN
-
EQ>


</EQUIVALENT>


<_
-
RRB
-
>)</_
-
RRB
-
>


</QUANTITY>

</MOLECULE>

12

Phrase Identification

13

Quantity Identification

14

Section/Step Parsing

15

Pyridine, pyridines and pyridine rings

Entity

Pyridine

The pyridine /

Pyridine

from step 1

Pyridines

/

A pyridine

Pyridine

ring /

Pyridyl

Type

Exact

DefiniteReference

ChemicalClass

Fragment

16

Section/Step Parsing

Workup phrase types : Concentrate
,
Degass
,
Dry, Extract, Filter, Partition, Precipitate,
Purify, Recover, Remove, Wash, Quench

17

Atom
-
mapping

18

Example

Methyl 4
-
[(
pentafluorophenoxy
)
sulfonyl
]benzoate


To a solution of methyl 4
-
(
chlorosulfonyl
)benzoate (606
mg, 2.1
mmol
, 1
eq
) in DCM (35 ml) was added
pentafluorophenol

(412 mg, 2.2
mmol
, 1.1
eq
) and Et
3
N
(540 mg, 5.4
mmol
, 2.5
eq
) and the reaction mixture stirred
at room temperature until all of the starting material was
consumed. The solvent was evaporated in
vacuo

and the
residue
redissolved

in ethyl acetate (10 ml), washed with
water (10 ml), saturated sodium hydrogen carbonate (10
ml), dried over sodium sulphate, filtered and evaporated to
yield the title compound as a white solid (690 mg, 1.8
mmol
, 85%).

19

Graphical Output

20

CML output

<reaction
xmlns
="http://www.xml
-
cml.org/schema"
xmlns:cmlDict
="http://www.xml
-
cml.org/dictionary/cml/"
xmlns:nameDict
="http://www.xml
-
..


<
dl:reactionSmiles
>
Cl
[S:2]([c:5]1[cH:14][cH:13][c:8]([C:9]([O:11][CH3:12])=[O:10])[cH:7][cH:6]1)(=[O:4])=[O:3].[F:15][c:16]1[c:21]([OH:22])[c:20](
[..


<
productList
>


<product role="product">


<molecule id="m0">


<name
dictRef
="
nameDict:unknown
">title compound</name>


</molecule>


<amount units="
unit:mmol
">1.8</amount>


<amount units="
unit:mg
">690</amount>


<amount units="
unit:percentYield
">85.0</amount>


<identifier
dictRef
="
cml:smiles
" value="FC1=C(C(=C(C(=C1OS(=O)(=O)C1=CC=C(C(=O)OC)C=C1)F)F)F)F"/>


<identifier
dictRef
="
cml:inchi
" value="InChI=1/C14H7F5O5S/c1
-
23
-
14(20)6
-
2
-
4
-
7(5
-
3
-
6)25(21,22)24
-
13
-
11(18)9(16)8(15)10(17)12(13)19/h2
-
5H..



<
dl:entityType
>
definiteReference
</
dl:entityType
>


<
dl:state
>solid</
dl:state
>


</product>


</
productList
>


<
reactantList
>


<reactant role="reactant" count="1">


<molecule id="m1">


<name
dictRef
="
nameDict:unknown
">methyl 4
-
(
chlorosulfonyl
)benzoate</name>


</molecule>


<amount units="
unit:mmol
">2.1</amount>


<amount units="
unit:mg
">606</amount>


<amount units="
unit:eq
">1.0</amount>


<identifier
dictRef
="
cml:smiles
" value="
ClS
(=O)(=O)C1=CC=C(C(=O)OC)C=C1"/>

Quantities including yield are extracted

Entity is classified as an exact compound,
definite reference, chemical class or polymer

Reaction SMILES

SMILES and
InChIs

for every structure
resolvable reagent/product

21

Evaluation


2008
-
2011 USPTO patent applications classified as containing
organic chemistry


65,034 documents.



484,259 reactions atom mapped reactions extracted



Adding the additional requirements that all the identified
product molecules were resolvable to structures and that all
reagents were believed to describe exact compounds


424,621 reactions.



100 of these were selected for manual evaluation of quality

22

Reactions found

1
10
100
1,000
10,000
100,000
0
200
400
600
800
1000
Patents with given number of reactions

Number of extracted reactions

23

Results


96% correctly identified the primary starting material and product
whilst not misidentifying
reagents
that could be confused with the
starting material



As compared to the 495 expected chemical entities there were 61 false
positives and 16 false negatives



Only 4 of the 321 reagents (with quantities) did not have these
quantities recognised and associated with the reagent



Association of quantities/yields with products was less successful, 48
out of the 74 cases where such data was present were handled

24

Use Cases


Reaction searching



Analysing trends in reactions over time



Reaction outcome prediction


25

Example of reaction searching

C[CH:1]=[CH2:2].ICI>>C([CH:1]1[CH2:2][CH2]1
)

6 reactions found in 5 patents

26

Name I20110224.tar
\
US20110046406A1
-
20110224.ZIP
\
0066

Text from US 2011/0046406 A1

27

Most lexical variants

1
-
ethyl
-
3
-
(
dimethylaminopropyl
)
carbodiimide

hydrochloride

EDCI hydrochloride

1
-
ethyl
-
3
-
[3
-
(
dimethylamino
)propyl]
-
carbodiimide

hydrochloride

N
-
ethyl
-
N'
-
(3
-
dimethylamino
-
propyl)
-
carbodiimide

hydrochloride

N
-
[3
-
(
Dimethylamino
) propyl]
-
N'
-
ethylcarbodiimide

hydrochloride

1
-
(3
-
dimethylaminopropyl)
-
3
-
ethylcarbodiimide.HCl

N1
-
((
Ethylimino
)methylene)
-
N3,N3
-
dimethylpropane
-
1,3
-
diamine
hydrochloride

N
-
(3
-
dimethylaminopropyl)
-
N'
-
ethylcarbodiimide

hydrochloride

1
-
ethyl
-
3
-
dimethylaminopropyl
-
carbodiimide
hydrochloride

1
-
(3
-
dimethylaminopropyl)
-
3
-
ethylcarbodiimide
HCl

1
-
[3(
dimethylamino
)propyl]
-
3
-
ethylcarbodiimide hydrochloride

1
-
(
-
3
-
dimethylamino
-
propyl)
-
3
-
ethylcarbodiimide hydrochloride

N
-
(3
-
Dimethylamino
-
1
-
propyl)
-
N'
-
ethylcarbodiimide

hydrochloride

1
-
ethyl
-
3
-
(3
-
dimethylaminopropyl)
carbodiimide

monohydrochloride

1
-
(3
-
(
Dimethylamino
)propyl)
-
3
-
ethyl
-
carbodiimide hydrochloride

And 127 more!

675 chemicals had over
10 lexical variants!

28

Most common solvents

29

Known Limitations


The first workup reagent is often erroneously classified as a
reactant



Atom mapping produces mappings that are not necessarily
representative of reaction mechanism and occasionally
involve clearly incorrect atoms



Conditions from analogous reactions are not resolved



Temperature/time for reactions to occur not captured

30

Conclusions


424,621 exact atom
-
mapped reactions were
extracted from 4 years of USPTO patent
applications


Evaluation indicates the reactions to be of
generally good quality especially if the
misidentification of workup reagents as
reactants is not considered important


All the code to extract reactions is open source:
https://bitbucket.org/dan2097/patent
-
reaction
-
extraction

31

Acknowledgements

Unilever centre:

Robert Glen

Peter Murray
-
Rust

Lezan

Hawizy

David Jessop

Matthew Grayson



Indigo toolkit:

Mikhail
Rybalkin

Savelyev

Alexander

Dmitry Pavlov

Boehringer

Ingelheim

for funding


SMARTS searching:

Roger
Sayle

32

Any Questions?

Email: daniel@nextmovesoftware.com