Accessing U.S. Government Databases with the CACTVS Toolkit

cornawakeΛογισμικό & κατασκευή λογ/κού

4 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

139 εμφανίσεις

Accessing

U.S. Government Chemical
Structure Databases

with the CACTVS Toolkit

Wolf
-
D.
Ihlenfeldt

Xemistry

GmbH

Lahntal
,
Germany

wdi@xemistry.com

The US Gov Chemical Structure Data
Information Pool


PubChem


Depositor structures (SID)


Unique structures (CID)


Assays (AID)


Links to the rest of NCBI
Entrez


The US Government Chemical
Structure Data Pool


NIST Web Book


Spectra, physical properties


ChemIDPlus


Phyical

properties, biomedical links


NCI Chemical Identifier Resolver


From name and IDs to structure

Other sources


ChemSpider

(UK)


EINECS (EU)


KEGG (JP)


EMolecules

(US, commercial)


ChEBI

(UK)


ChEMBL

(UK)


Drugbank

(CA)


PDB (US, academic)


CommonChemistry

(US, commercial)


Wikipedia (World)


How to Work with the Data?


Web interface for humans


Hard to work with software


Many DBs provide external links


Prone to breaking, becoming outdated


Data available as batch download


Massive, difficult to manage


Lack of formal interface documentation or
programmatic access


PubChem,
Entrez
, NCI Resolver good guys



The CACTVS Toolkit


Generic chemistry toolkit


Manages objects like structures, reactions,
tables


Extensible collection of properties, methods
and I/O modules


Implicit automatic method chaining


Scripting language interface for RAD


Ships with access properties and modules for
all these databases


Comprehensive solution for multi
-
DB projects


Basic Tasks


Name/Identifier resolution


NCI Resolver
-
> REST interface


KEGG
-
> text query


cactvs
>
ens

create

"
vioxx
"

ens0

cactvs
>
dataset

create

[
list

‚+
morphine

+
methyl
‘]

dataset0

cactvs>dataset ens dataset0

ens1 ens2



Basic tasks: Get Database ID



Text structure query (SMILES,
InChI
)


NCBI PUG Web service


cactvs
>
ens

get

ens0 E_SIDSET

9792 207247 535364 5146347 7847634 7980536 8146414
8153131 10486532 11341940 11362123 11362973
11364757 11365535…

cactvs
>
ens

get

ens0 E_CHEMIDPLUS_ID

0162011907

Basic Tasks: Download Objects



PubChem: from CID, SID, AID


PDB, CHEMBL, KEGG: from codes


Resolver: from name, identifiers


cactvs
>
ens

create

1

ens0

cactvs
>
ens

create

CHEMBL277500

ens1

Basic Tasks: Download Objects


PubChem I/O via native ASN.1

cactvs
>
table

create

198

table3

cactvs
>
table

get

table3
colnames

SID
SID_Source

Version Date
Outcome

Score
schedule

endpoint

vehicle

dose
tcprcnt

toxicity

cactvs
> table get table3
T_NCBI_ASSAY_DESCRIPTION(description)

{The antitumor activity of compounds was measured
in mice bearing transplantable tumors. Survival
or tumor size were measured and the…



Basic Tasks: I/O of ID Files



Read files with CIDs, SIDs, CASNOs…


cactvs
>set
fh

[
molfile

open test.cas]

molfile0

cactvs
>
molfile

loop $
fh

eh { puts[
ens

get
$eh E_CID] }

436534 321512 234 32532….

Implicit Property Lookup


Yes, its controlled,

with metadata and origin tracing

cactvs
>
ens

create benzene

ens0

cactvs>ens get ens0 E_CAS

71
-
43
-
2

cactvs
>
ens

get ens0 E_UVSPECTRUM

1 {INSTITUTE
OF ENERGY PROBLEMS OF CHEMICAL PHYSICS, RAS} {INEP CP RAS, NIST OSRD Collection (C) 2007
copyright by the U.S. Secretary of Commerce on behalf of the United States of America. All
rights reserved.} 0
n.i.g
. {} {{$NIST SQUIB} 1951ROM/VOD930
-
932 {$NIST SOURCE} TSGMTE {$REF
AUTHOR} {
Romand
, J.;
Vodar
, B.} {$REF TITLE} {
Spectres

d'absorption

du benzene a
l'etat

vapeur

et a
l'etat

condense
dans

l'ultraviolet

lointain
} {$REF JOURNAL} {Compt. Rend.} {$REF VOLUME}
233 {$REF PAGE} 930
-
932 {$REF DATE} 1951} {} {RAS UV No. 118} 0.0 {} {} 0.0 162.418 206.9805 1.0
1.0 {Wavelength (nm)} {Logarithm epsilon} 317 {} {3.7038 3.7101 3.7161 3.722 3.722….

More Property Lookups

cactvs
>
ens

show ens0 E_NIST_WEBBOOK_ID

C71432

cactvs>ens get ens0 E_AIDSET

330 421 426 427 433 434 435 445 530 540 541 542 543
544 545 546 584 585...

cactvs
>
ens

get

ens0 E_NAMESET

BENZENE 71
-
43
-
2 NCGC00090744
-
02 UN1114 {
Benzen

[
Polish
]} {
Benzene

+
aniline

combo
}
270709_ALDRICH 311855_SIGMA {
Benzene

(
including

benzene

from

gasoline
)} 676985_ALDRICH {
Benzene

[UN1114] [
Flammable

liquid]} 154628_SIAL
{
Benzene
,
labeled

with

carbon
-
14
and

tritium
}…

More Property Lookups

cactvs
>
ens

get

ens0 E_MESH_TERMS

{68001554 {
Benzene

Cyclohexatriene

Benzol Benzole}
http://www.ncbi.nlm.nih.gov/sites/entrez?Db=mesh
&Cmd=ShowDetailView&TermToSearch=68001554
{68001554
Benzene

{68006841 {
Hydrocarbons
,
Aromatic
}{68006844 {
Hydrocarbons
,
Cyclic
}
{68006838
Hydrocarbons

{68009930 {
Organic

Chemicals} {1000068 {Chemicals
and

Drugs
Category
} {1000048 {All
MeSH

Categories
}}}}}}}}}
{68009930 {{
Organic

Chemicals} {Chemicals,
Organic
}}
http://www.ncbi.nlm.nih.gov/sites/entrez?Db=mesh
&Cmd=ShowDetailView&TermToSearch=68009930...

Construction of Display URLs

cactvs>ens get ens0 E_PUBCHEM_URL

http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi
?cid=241

cactvs>ens get ens0 E_CHEMIDPLUS_URL

http://chem.sis.nlm.nih.gov/chemidplus/ProxyServlet
?objectHandle=DBMaint&actionHandle=default&nextP
age=jsp/chemidheavy/ResultScreen.jsp&ROW_NUM=0&T
XTSUPERLISTID=0000071432

cactvs
>
ens

metadata

ens0 E_CHEMIDPLUS_URL
info

{JSESSIONID=257C452AAB26D395DCC4AC2652F05C99;
Path=/
chemidplus
}

Ugliness under the Hood



Absence of a clean programmatic
interface hurts!


set

mdata

[
encode

-
url [
molfile

string

$eh]]

set

pdata

"
indexes
=&DT_ROWS_PER_PAGE2=1&objectHandle=
Search&actionHandle
=
searchChemIdLite&nextPage
=jsp%2Fc
hemidheavy%2FChemidDataview.jsp&DT_ROWS_PER_PAGE=1&responseHandle=JSP&QV10=&QO10=Text+Search&QF1
1=Locator&QV11=&QO11=
in&STRING_TO_FILE
=$mdata&QF1=Name&QO1=%3D&QV1=&QV8=&QF8=ToxTestType&QO5=bet
ween&QV5=&QF5=ToxResult&QV6=&QF6=ToxSpecies&QV7=&QF7=ToxRoute&QV9=&QF9=
ToxEffect&ChemType
=1001&Q
F3=ChemType&QV3=&QF2=ChemProp&QO2=between&QV2=&
ChemDataSourceType
=0&QF4=ChemDataSourceType&QV4=&
LocatorExpr1=&
LocatorOper
=AND&LocatorExpr2=&
chemical_viewer
=
marvin&StructureSimilarPctg
=80&QF10=
StructureEqual&structurePref
=marvin&QO12=between&QV12=&QF12=
MolWeight&x
=22&y=5"

set

data

[post
-
contenttype

application
/x
-
www
-
form
-
urlencoded

-
raw

http://chem.sis.nlm.nih.gov/chemidplus/ProxyServlet?chemidheavy $
pdata

#
auto

status
]

if

{![
regexp

{
chemid
=([0
-
9]+)} $
data

dummy

id
] && ![
regexp

{
javascript:loadChemicalIndex
[^0
-
9]*([0
-
9]+)} $
data

dummy

id
]} {


error

"
no

ChemIDplus

record
"

}

ens

set

$eh E_CHEMIDPLUS_ID $
id

ens

metadata

$eh E_CHEMIDPLUS_ID
info

$
status
(
cookies
)

Power by Design


In contrast, PubChem has a well
-
defined set of interfaces


PUG,
EUtils
, cookie
-
free download URLs



No simulated Web form posting


No HTML page scraping


Support for more than just ID access

The PubChem Virtual File Project


Improved

access

to

PubChem

database


indistinguishable

from

a
local
,
read
-
only

structure

file

in
Cactvs
scripting

environment



Input
functions


transparently

read

structures

and

assay

tables

with

all

their

data

from

PubChem
,
by

decoding

native
binary

ASN.1



Query
functions


convenient

development

and

conservation

of

queries

exceeding

the

capabilites

of

Web
interfaces

and

PUG,
maintaining

standard

Cactvs
query

and

retrieval

syntax

Transforming the PubChem Database

into a Virtual File


Cactvs toolkit uses file record as primary key


PubChem

uses CID (AID, SID) as primary key


Establish mapping via record/CID map


Precomputed

as 20M bits bitmap


S
et bit indicates active CID


A
utomatic download from
Xemistry

if needed, local
caching, up
-
to
-
date check via
Entrez

query


Checked and potentially updated every 30
mins

on
Xemistry

server


D
ata size 800K compressed, download <10s


Download of full active CID set from
Entrez

~10
-
25
mins

PubChem Virtual File I/O

Code sample:


filex

load

pubchem

19


molfile

open <
pubchem
>

molfile0


molfile

count

molfile0

19450023


molfile

read

molfile0

ens0


ens

props

ens0

…E_INCHI E_IUPAC_NAME E_NCBI_COMPOUND_ID E_EXACT_MASS
E_TPSA E_SMILES
E_SMILES
/2….


ens

get

ens0 E_CID

1


molfile

read

molfile0

ens1


molfile

set

molfile0
record

999999

Contact

Entrez

e
-
utils
,
get

database

status
,
get

CID Bitmap
from

Xemistry

S
ingle
-
record

ASN.1
download

via
display

page

Simple PubChem
Queries

Code sample:


set fh [molfile open <pubchem>]

set cidlist [molfile scan $fh „structure >= $smarts“
\


{proplist E_CID}]


Operations behind the scenes:


Set
-
up of PUG record


Post PUG, monitor return status


Cache CID result data


Direct access to result set, no structure download

Intermediate PubChem
Queries

Code sample:


set

fh

[
molfile

open <
pubchem
>]


set

elist

[
molfile

scan

$
fh

\


or

{
structure

= $smiles1} {
structure

= $smiles2}
\

{
structure

= $smiles3}“
enslist
]


Operations

behind

the

scenes
:


Create
and

post PUG
records
,
get

history

keys


Perform

server
-
side

e
-
utils

result

merge

via
history

keys


Retrieve

CID
set


Download
structures

as

ASN.1
blobs

via CID

Power PubChem
Queries

Code sample:


set

th

[
molfile

scan

<
pubchem
>
\


"
and

{
structure

>= c1cncc1}



{E_PUBCHEM_AID_COUNT(
active
) > 25}„
\



{
tablecollection

image

E_CID E_NAME E_SMILES

E_PUBCHEM_AID_COUNT(
active
)

E_PUBCHEM_AID_COUNT(
inactive
) E_ACTIVE_AIDSET}
\


{} {
maxhits

10}]

table

write

$
th

active_pyrroles_in_pubchem.xls


Graphical Tools for the Masses


Draw or read structure


Compute database ID property


Display data


Compute lookup properties


Display data


Compute access URL property


Load page into HTML widget

… in Stand
-
alone Tools

… and in Web Applications