BIOINFORMATICS IN DRUG DEVELOPMENT AND ASSESSMENT ...

earthsomberBiotechnology

Sep 29, 2013 (4 years and 1 month ago)

240 views

BIOINFORMATICS IN DRUG DEVELOPMENT
AND ASSESSMENT
From the Meeting of the International Society for the Study of Xenobiotics,
August 29-September 2,2004,Vancouver,Canada
David S.Wishart
QUERY SHEET
Q1:
‘‘Jeffrey’’ or ‘‘Jeffery’’ as in References?
Q2:
‘‘Spigest’’ or ‘‘Spigset’’ as in References?
Q3:
‘‘Baxevannis’’ or ‘‘Baxevanis’’ as in References.
Q4:
‘‘Beresfor’’ or ‘‘Beresford’’ as in References.
Q5:
‘‘Yan’’ or ‘‘Yang’’ as in References.
Q6:
Ref.was not found in text.Please cite in text or delete.
Q7:
Please provide editor(s).
Q8:
Please provide complete author names for Refs.Gibbs et al.,2004 and Venter et al.,
2001.
Q9:
Please provide page range for Refs.‘‘Klein and Altman,2004’’,Oliver et al.,2004,
Riva and Kohane,2004,and Yan and Sadee,2000.
Order reprints of this article at www.copyright.rightslink.com
1
BIOINFORMATICS IN DRUG DEVELOPMENT
2
AND ASSESSMENT
3
From the Meeting of the International Society for the Study of Xenobiotics,
4
August 29-September 2,2004,Vancouver,Canada
5
David S.Wishart
6 Departments of Biological Sciences and Computing Science,University of Alberta,
7 Edmonton,Alberta,Canada
8
Bioinformatics is playing an increasingly important role in nearly all aspects of drug
9
discovery,drug assessment,and drug development.This growing importance lies not only
10
in the role that bioinformatics plays in handling large volumes of data,but also in the utility
11
of bioinformatics tools to predict,analyze,or help interpret clinical and preclinical
12
findings.This review focuses on describing and evaluating some of the newer or more
13
important bioinformatics resources (i.e.,databases and software) that are of growing
14
importance to understanding or predicting drug metabolism,especially with respect to the
15
absorption,distribution,metabolism,excretion,(ADME),and toxicity (T) of both existing
16
drugs and potential drug leads.Detailed descriptions and critical assessments of a number
17
of potentially useful bioinformatics/cheminformatics databases and predictive ADMET
18
software tools are provided.Additionally,several pharmaceutically important applications
19
of both the databases and software are highlighted.Given the rapid growth in this area and
20
the rapid changes that are taking place,a special emphasis is placed on freely available or
21
Web-accessible resources.
22 Key Words:Bioinformatics;Drug metabolism;ADME;Databases;Computer algorithms.
23
INTRODUCTION
24
Genomics,proteomics,and metabolomics are profoundly changing the traditional
25
approaches to drug discovery and development.Nowadays,potential drug targets are
26
increasingly being identified through high-throughput sequencing (Carlton,2003;Kramer
27
and Cohen,2004),through high-throughput microarray or two-dimensional (2D) gel
28
experiments (Butte,2002;Onyango,2004;Walgren and Thompson,2004),or through
29
large-scale mass spectrometry or chemical library screens (Comess and Schurdak,2004;
30
Jeffery and Bogyo,2003;Lindsay,2003)
Q1
.These same high-throughput technologies are
31
also being used to accelerate drug development and assessment (Ansede and Thakker,
32
2004;Curtis et al.,2002;Frank and Hargreaves,2003).The completion of the human
33
genome (Venter et al.,2001) and,more recently,the completion of the mouse,rat,and
34
dog genomes (Gibbs et al.,2004;Parker et al.,2004;Waterston,2002),are already
35
having a significant impact on our understanding of drug metabolism and drug toxicity
Address correspondence to David S.Wishart,Departments of Biological Sciences and Computing
Science,University of Alberta,Edmonton,Alberta,T6G 2E8,Canada;Fax:780-492-1071;E-mail:
david.wishart@ualberta.ca
Drug Metabolism Reviews,2:279–308,2005
Copyright D 2005 Taylor & Francis Inc.
ISSN:0360-2532 print/1097-9883 online
DOI:10.1081/DMR-200055225
279
36
(Goldstein et al.,2003;Guzey and Spigset,2004;Shioda,2004)
Q2
.Likewise,the
37
application of standard genomics/proteomics technologies,such as gene chips,mass
38
spectrometry,2D gels,or NMR spectrometry,is now allowing much more rapid and
39
thorough characterization of the absorption,distribution,metabolism,excretion (ADME)
40
and toxicity (T) of potential drug leads (Ackermann et al.,2002;Kassel,2004;Lindon
41
et al.,2004;Villeneuve and Parissenti,2004).
42
Of course with the increasing use of these high-throughput techniques,scientists
43
are now finding that the greatest challenge lies not with the collection of the data,but
44
with its storage,retrieval,analysis,and interpretation.For instance,sequence databases
45
are now doubling in size every 12 months,with the latest release of GenBank containing
46
39,000,000 sequences,43 billion bases,and occupying 100 gigabytes of disk space.
47
Already more than 1,000 viral genomes,200 bacterial genomes,and more than a dozen
48
eukaryotic genomes have been sequenced (http://www.ebi.ac.uk/genomes/).Similar
49
high-throughput sequencing efforts are also leading to the identification and archiving of
50
millions of mutations or single nucleotide polymorphisms (SNPs) that are being
51
continuously deposited in several databases (Fredman et al.,2004;Jiang et al.,2003;
52
Thorisson and Stein,2003).Sequencing efforts are not the only culprits.High-throughput
53
mass spectrometry experiments or compound screening experiments can easily generate
54
several gigabytes of data per day,while individual microarray experiments,tissue array
55
experiments,or 2D gel runs typically generate 100–200 megabytes of image or text/
56
image data—each.Of course,all of these experimental results are being reported in a
57
rapidly growing body of journal articles and electronic abstracts.At last count,PubMed
58
contained 15 million abstracts from more than 4,600 journals occupying more than 40
59
gigabytes of textual data (Oliver et al.,2004).
60
In order to deal with this ‘‘data explosion,’’ scientists are increasingly turning toward
61
computers and computer scientists to help.This has led to the emergence of two newfields
62
in information science—bioinformatics and cheminformatics.Bioinformatics is a field of
63
information technology concerned with the storage,retrieval,visualization,prediction,and
64
analysis of molecular data having biological or clinical importance (Brown,2003;
65
Buckingham,2004).In a similar fashion,cheminformatics is primarily concerned with the
66
handling of chemical data having pharmaceutical or synthetic importance (Lahana,2002).
67
Bioinformatics and cheminformatics have essentially existed as two separate disciplines
68
for more than 20 years.Each field has developed its own culture,with bioinformatics
69
largely evolving to a noncommerical,Web-based and open-source model,while
70
cheminformatics has evolved to a somewhat more commercial,closed-source,or limited
71
access model.Because pharmaceutical research typically involves both chemical and
72
biological analyses,a continuing challenge for pharmaceutical researchers is to combine
73
bioinformatics with cheminformatics in an effective and coherent way.
74
The purpose of this review is to provide a survey of the bioinformatics (and
75
cheminformatics) resources that are potentially most relevant to pharmaceutical
76
researchers.Because it is impossible to review all aspects of bioinformatics in all areas
77
of pharmaceutical research,the focus here will be primarily on describing existing or
78
emerging bioinformatics tools that may have some utility in drug development or
79
assessment,especially ADMET.In particular,we will discuss two areas of bioinformatics
80
that are particularly relevant to predicting or understanding drug metabolismand toxicity.
81
These include bioinformatics or cheminformatic databases and predictive software.For
82
those readers who desire a broader review of basic bioinformatics,of laboratory
83
information management systems (LIMS),or of specific software for genomic or
D.S.WISHART280
84
proteomics analysis,there are now several excellent textbooks on these subjects
85
(Baxevanis and Ouellette,2004;Lesk,2002)
Q3
.
86
DATABASES
87
Data and databases are key to both bioinformatics and cheminformatics.Without
88
large quantities of easily accessible electronic data,most kinds of data searches would
89
prove to be fruitless,and most kinds of predictive or analytical software could never be
90
developed or tested.Not only is the quantity of biological or chemical data important,but
91
so too is the quality.Obviously,if data quality is poor,then any predictions or
92
interpretations relying on this faulty data will also be poor.This issue of data quality is
93
one of the most difficult areas of bioinformatics and cheminformatics to monitor or
94
control (CODATA,2000).In fact,continuing quality control and data curation are among
95
the most expensive and manual intensive aspects of any database operation.This is why
96
most large databases (i.e.,chemical or pharmaceutical databases) are either commercially
97
operated or are maintained by well-funded government agencies or national libraries (i.e.,
98
bioinformatics databases).
99
Generally speaking there are four types of bioinformatics or cheminformatics
100
databases:archival,curated,commercial,and noncommerical.Most bioinformatics
101
databases are public or noncommercial,while most cheminformatics databases are
102
commercial.Archival databases (which are almost always public) accept or include data
103
‘‘as is’’ fromdata depositors with relatively modest checking or validation.Their purpose
104
is to be an open-access repository,and their goal is to accumulate as much data as possible.
105
An example of an archival bioinformatics database is GenBank (Benson et al.,2004).On
106
the other hand,curated databases are specialized data resources maintained by one or more
107
curators who select,input,or invite only the ‘‘highest-quality’’ data fromselected niches.
108
The quality of the data is of the utmost importance,whereas the quantity of data is
109
secondary.In pharmaceutical research,curated databases are generally of greatest interest,
110
because they offer both the high-quality content and the specialized information (primarily
111
with respect to humans) that is needed by most pharmaceutical researchers.
112
Regardless of whether one uses archival or curated databases,all databases must
113
have an interface that allows users to access,search,visualize,and retrieve data that is of
114
interest to them.Recent developments in both graphical user interface design and in Web
115
interface design (particularly in Common Gateway Interface or CGI scripting tools) have
116
made many bioinformatics and cheminformatics databases easy to query,highly visual,
117
and very interactive.As a result,many useful databases (including commercial databases)
118
are now easily accessible over the Web.
119
In this short review on bioinformatics/cheminformatics databases,we will discuss
120
four different types of curated databases that I believe are of particular importance to
121
drug metabolism research.These include sequence or sequence annotation databases,
122
SNP or mutation databases,general metabolism databases,and finally,drug metabolism/
123
interaction databases.
124
Sequence and Sequence Annotation Databases
125
Gene and protein sequence data are now critical to almost all aspects of
126
pharmaceutical research.For instance,the routine sequencing of pathogens,such as
127
viruses,parasites,or bacteria,now allows remarkably quick and relatively inexpensive
281BIOINFORMATICS IN DRUG DEVELOPMENT
128
identification of potential protein drug targets or pathogenicity ‘‘islands’’ (Buysse,2001;
129
Chan et al.,2002).Likewise,the sequencing of the human,mouse,and rat genomes has
130
led to the identification of more than 30 families of drug-metabolizing enzymes along
131
with dozens of potentially new protein drug targets (Kramer and Cohen,2004;Lindsay,
132
2003).Additionally,gene annotation data (which is derived from gene sequence data) is
133
routinely used to analyze and interpret gene or protein expression experiments in
134
pharmacogenomic or toxicogenomic studies.Sequence and sequence annotation data
135
play a key role not only in identifying possible protein targets,but also in providing a
136
basis to our understanding of the mechanism or process by which a drug may work,
137
where it may target,or how it might be metabolized.
138
Of course,the availability of the human,mouse,rat,and dog genomes does not
139
mean that all the necessary and relevant molecular information about their genomes is
140
available.In fact,four years after the announced completion of the human genome,we
141
are still uncertain of the true number of human genes (at last count it was 24,195 genes
142
and 35,193 transcripts,give or take ￿3,000),the number of alternately spliced gene
143
products or isozymes (estimated at 100,000),the number of peptides,proproteins,
144
proteins,and chemically modified proteins (estimated at 500,000),and the probable
145
function of these genes or proteins (Southan,2004).The same level of uncertainty also
146
holds for the mouse,rat,and dog genomes.Given the state of flux that exists with these
147
very important genomes,it is expected that new data about many genes,gene names,and
148
gene functions will be continuously emerging.Therefore,it is critical that drug re-
149
searchers understand that sequence databases are dynamic entities and that constant
150
vigilance is critical if one is to stay current with what has been learned and what has
151
changed about a given genome.Fortunately,there are a number of continuously updated
152
sequence databases and sequence database resources that make this task a little easier.
153
The two primary providers and curators of annotated sequence databases are the
154
National Centre for Biotechnology Information (NCBI) in Bethesda,Maryland (USA),
155
and the European Bioinformatics Institute (EBI) in Hinxton-Head (UK).Both groups
156
have staff levels in excess of 200 individuals,and both maintain and operate dozens of
157
general and specialized sequence databases—all of which are freely accessible through
158
their home pages (NCBI at http://www.ncbi.nlm.nih.gov/;EBI at http://www.ebi.ac.uk/).
159
Both the EBI and NCBI are mandated to provide free and up-to-the-minute access to all
160
publicly available sequences (including genomic data),and their raw sequence
161
repositories are updated and synchronized every 24 hours.However,this does not mean
162
that both groups offer identical services or identical secondary (i.e.,curated or specialty)
163
databases.In fact,a friendly rivalry is maintained by both groups in an effort to
164
constantly develop,improve,or create new products,databases,or services that will be
165
more attractive to researchers than the other group’s latest offerings.As a general rule,
166
the EBI’s strength is in providing protein-rich resources,while the NCBI’s strength is in
167
providing DNA-rich resources (Brooksbank et al.,2003;Wheeler et al.,2004).Ob-
168
viously,many of the databases,programs,and resources described in this review will
169
originate from either the NCBI or EBI.
170
Rather than reviewing the hundreds (perhaps thousands) of sequence databases
171
available,I will provide a somewhat selective survey of the dozen or so most useful
172
sequence databases for drug metabolism and pharmacogenomics researchers.These are
173
also listed in Table T11.In selecting these databases,I have tried to identify those that are
174
particularly comprehensive,easy to use,and widely acclaimed.
D.S.WISHART282
175
Perhaps the best general resource for human,mouse,rat,and dog (and chimp)
176
genome sequence and genome annotation data is the Ensembl database and genome
177
browser (Brooksbank et al.,2003;Hammond and Birney,2004).Ensembl is both a
178
queryable database and a Web-accessible genome viewer that contains the sequence data
Table 1 Databases of importance to drug development and drug assessment
Database name URL/Web address
Sequence/Sequence Annotation
GenBank http://www.ncbi.nlm.nih.gov/BLAST/
GenBank Stats http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
Ensembl http://www.ensembl.org/
EntrezGene http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene
UCSC-GoldenPath http://www.genome.ucsc.edu/cgi-bin/hgGateway
RefSeq http://www.ncbi.nlm.nih.gov/RefSeq/
SwissProt http://us.expasy.org/sprot/
UniProt http://www.pir.uniprot.org/
TrEMBL http://www.ebi.ac.uk/trembl/index.html
GeneCards http://bioinfo.weizmann.ac.il/cards/index.shtml
Mouse genome database (MGD) http://www.informatics.jax.org/menus/allsearch_menu.shtml
Rat genome database (RGD) http://rgd.mcw.edu/
MAGPIE/BLUEJAY http://magpie.ucalgary.ca/
SymAtlas http://symatlas.gnf.org/SymAtlas/
CypAlleles DB http://www.imm.ki.se/CYPalleles/
Directory of P-450 containing systems www.icgeb.org/p450/
Cytochrome P-450 interaction table http://medicine.iupui.edu/flockhart/table.htm
Human membrane transporter database (HMTD) http://lab.digibench.net/transporter/
Transporters page http://www.med.rug.nl/mdl/english/tab3.htm
Human ABC transporters database http://nutrigene.4t.com/humanabc.htm
SNPs and Mutations
The SNP Consortium http://snp.cshl.org/
JSNP http://snp.ims.u-tokyo.ac.jp/
HGVbase hgvbase.cgb.ki.se/
dbSNP http://www.ncbi.nlm.nih.gov/projects/SNP/
SNPper http://snpper.chip.org/
GO database http://www.geneontology.org/
PolySearch http://redpoll.pharmacy.ualberta.ca/PolySearch/
OMIM http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM
HGMD http://archive.uwcm.ac.uk/uwcm/mg/hgmd0.html
Metabolism and Metabolic Pathways
KEGG http://www.genome.jp/kegg/
MetaCyc http://metacyc.org/
ExPASY Biochemical Pathways http://www.expasy.org/cgi-bin/search-biochem-index
PUMA2 http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgi
Metabolic pathways of biochemistry http://www.gwu.edu/~mpb/
Main metabolic pathways Web page http://home.wxs.nl/~pvsanten/mmp/mmp.html
The Medical Biochemistry Page http://web.indstate.edu/thcme/mwking/
UM-BBD http://umbbd.ahc.umn.edu/
Drug Metabolism and Drug Interactions
MDL Inc.(METABOLITE Db) http://mdl.com/products/predictive/metabolite/index.jsp
DIDbase http://depts.washington.edu/didbase/
DEREK/VITIC http://www.chem.leeds.ac.uk/luk/vitic/index.html
PharmGKB http://www.pharmgkb.org/
DrugBank http://redpoll.pharmacy.ualberta.ca/drugbank/
283BIOINFORMATICS IN DRUG DEVELOPMENT
179
and automatic annotation and schematic image maps of numerous metazoan genomes.
180
Ensembl also contains information about gene organization or gene order (synteny),
181
chromosome structure,and sequence relationships between different genes in different
182
genomes (orthology).The Ensembl database is continuously updated with the human
183
genome being updated every two months.Ensembl is one of three main systems that
184
annotate and display human genome information.The other two are the UCSC
185
(GoldenPath) genome browser system (Karolchik et al.,2003) and the NCBI’s genome
186
resources (Wheeler et al.,2004).Just like Ensembl,the NCBI genome resources
187
(especially EntrezGene,which recently replaced LocusLink) also support Web-accessible
188
queries for a large number of other metazoan genomes (rat,mouse,chimp,and dog).Both
189
Ensembl and EntrezGene depend on reference sequence (RefSeq) data supplied by the
190
NCBI.RefSeq is a comprehensive,publicly available,nonredundant set of sequences,
191
including genomic DNA,transcript (RNA),and protein products,for all major research
192
organisms (Pruitt et al.,2003).
193
In addition to Ensembl and EntrezGene,both of which are particularly ‘‘geno-
194
centric,’’ the other key sequence or sequence annotation resource is SwissProt
195
(O’Donovan et al.,2002).SwissProt,which was established by Amos Bairoch in 1986,
196
is a manually curated,protein-only sequence database that provides a very high level of
197
annotation about protein sequences,names,functions,properties,and relationships.
198
SwissProt is updated every four to six months and has been recently integrated into
199
UniProt (the Universal Protein Resource)—a much larger database containing the
200
TrEMBL protein sequence database (Apweiler et al.,2004).The most recent release of
201
SwissProt contains 163,496 sequence entries,comprising 59,718,080 amino acids
202
abstracted from 120,925 references.SwissProt 45 contains sequence and sequence
203
annotation information from more than 160 organisms,including 11,556 human
204
sequences,8,389 mouse sequences,and 3,985 rat sequences.Notably,as one might
205
gather fromthese numbers,SwissProt is only able to formally annotate 40%of the human
206
proteome,30% of the mouse proteome,and 15% of the rat proteome.This simply
207
underlines the point made earlier that we still have a long way to go before we fully
208
annotate the human (and mouse,and rat) genome.
209
A typical SwissProt entry contains information on the protein name,synonyms,
210
gene name,taxonomy,references,probable function,subunit structure,cellular location,
211
reactions,catalytic activity,polymorphisms,substrates/products,EC numbers,tissue
212
specificity,induction,similarity,sequence features,membrane spanning regions (if any),
213
signal peptides (if any),sequence conflicts,sequence variants,2D PAGE location(s),
214
links to structure,links to sequence motifs,links to protein interactions or interacting
215
partners,sequence,sequence length,and molecular weight.The SwissProt database is
216
regarded by many as the gold standard for protein annotation and protein sequence
217
information.Certainly,many scientists routinely use SwissProt to learn the latest
218
molecular details about the 100+ different drug metabolizing enzymes found in hu-
219
mans as well as other key information on many drug/xenobiotic targets and drug/
220
xenobiotic transporters.
221
Despite their breadth and depth,Ensembl,EntrezGene,and Swiss-Prot are
222
sometimes a little frustrating to use.This is because they are designed to cover many
223
different organisms for many different (nonmedical) users.To counter this problem,
224
several organism-specific databases have recently emerged that are particularly useful to
D.S.WISHART284
225
pharmaceutical and biomedical researchers.The GeneCards database (Rebhan et al.,
226
1998) is perhaps the best biomedical/gene sequence resource for the human genome.It is
227
primarily a Web-queryable database of human genes,their products,and their
228
involvement in diseases.The GeneCards database is regenerated and updated every
229
two to three months by continuously data mining a large number of public databases,
230
including SwissProt,Ensembl,and RefSeq.A standard GeneCard includes information
231
on the gene’s official (HUGO) name,synonyms,gene IDs,UniGene cluster,cytogenetic
232
locus,known SNPs,gene coordinates,the names of the gene products,their cellular
233
functions,gene expression graphs,similarities with other proteins,involvement in
234
diseases,orthologs to mouse genes,relevant references,and a list of disorders and
235
mutations in which the gene is involved.These data are clearly relevant to any research-
236
er working in the areas of pharmacogenomics,drug metabolism,drug toxicity,or
237
drug assessment.
238
While GeneCards is quickly becoming the primary database for human genome
239
data,two other specialized databases are emerging as key resources for mouse and rat
240
genomics.Both the Mouse Genome Database [MGD (Blake et al.,2003)] and the Rat
241
Genome Database [RGD (Twigger et al.,2002)] offer comprehensive sequence,strain,
242
mutation,and phenotypic information about lab strains of mice and rats.The MGD,in
243
particular,includes information on mouse genetic markers,molecular clones (probes,
244
primers,and YACs),phenotypes,comparative mapping data,graphical displays of
245
linkage,cytogenetic and physical maps,and experimental mapping data,as well as strain
246
distribution patterns for recombinant inbred strains.MGD also includes the Gene
247
Expression Literature Index,which is a collection of references to the scientific literature
248
reporting data on endogenous gene expression during mouse embryonic development.
249
Very few sequence databases have such an extensive linkage between genotypic and
250
phenotypic information,and this makes the MGD particularly useful for researchers
251
conducting metabolic or pharmacogenomic studies.
252
Two other lesser-known but notable sequence databases are MAGPIE (Gaasterland
253
and Sensen,1996) and SymAtlas (see Table 1).MAGPIE,or the Multi-purpose
254
Automated Genome Project Investigation Environment,provides extensive sequence
255
data,sequence graphics,and sequence annotation to all publicly available genomes
256
(including human,mouse,and rat).Fortunately,MAGPIE presents each genome
257
individually,so one can still choose to be very organism-specific with regard to any
258
queries or data retrieval.In addition to the usual organism-specific gene sequence data,
259
MAGPIE provides a putative description of each gene,a sequence comparison or
260
alignment results,graphical summaries,links to other databases,taxonomic data,lists of
261
homologues or orthologues,annotated metabolic pathway diagrams,and gene ontology
262
(GO) assignments.The annotated metabolic pathways are particularly unique to
263
MAGPIE and are certainly relevant to those interested in studying xenobiotic or general
264
biotic metabolism.MAGPIE has recently been linked to the BLUEJAY genome browser
265
to allow much more interactive viewing and browsing of MAGPIE’s extensive geno-
266
mic data.
267
SymAtlas is a GeneCards-like database developed by the Genomics Institute of the
268
Novartis Research Foundation.SymAtlas supports queries over several different genomes
269
(human,mouse,rat,and other metazoans),but its primary focus is on human data.While
270
generally providing less comprehensive annotation data than GeneCards,SymAtlas is
285BIOINFORMATICS IN DRUG DEVELOPMENT
271
notable in that it provides Affymetrix (U133A) gene chip expression data for many gene
272
entries covering nearly 80 different human tissues or cell types.Because this data was
273
collected by a single lab on a uniformly prepared set of tissues,this data resource offers a
274
unique opportunity for drug researchers to look at the tissue-specific gene expression
275
variations for many important genes involved in xenobiotic metabolism.
276
Even with their organism-specific focus,GeneCards,MGD,RGD,MAGPIE,and
277
SymAtlas are sometimes still too general for many pharmacogenomic and drug
278
metabolism researchers.As a result,a number of much more specialized sequence
279
databases have emerged over the past few years that cater specifically to the needs of
280
those studying drug metabolismand drug responses.In particular,researchers have begun
281
to compile sequence data on both drug-metabolizing enzymes and drug transporters,
282
along with data on their substrates,inhibitors,and physiological properties.As most
283
readers may recall,there are essentially two different classes of drug-metabolizing
284
enzymes,the phase I enzymes and the phase II enzymes.For humans,these two classes
285
alone contain nearly 30 different protein families.Phase I enzymes are involved in
286
xenobiotic oxidation,reduction,hydrolysis,and other chemical transformations.Phase II
287
enzymes are involved in the conjugation of xenobiotic substrates via glucuronidation,
288
sulfation,acetylation,or methylation to increase solubility and facilitate elimination.
289
CYP enzymes constitute the quantitatively most important group of Phase I enzymes,
290
with cytochrome P450 representing the largest family of CYP enzymes (Vermeulen,
291
2003).Given the importance of CYP enzymes in drug metabolism and drug toxicity,it is
292
not surprising to find that a number of CYP-specific sequence and sequence annotation
293
databases have developed over the past few years.
294
The CYPAlleles database (Ingelman-Sundberg,2002) is among the most important
295
of these CYP databases.It contains an up-to-date compilation of cytochrome P450 allelic
296
variants and contains the recommended nomenclature for all known human polymorphic
297
CYP genes.The intent of this unique and important resource is to standardize CYP
298
nomenclature and to avoid ‘‘home-made’’ allelic designations that can confuse the
299
nomenclature system and the scientific literature.Another very useful cytochrome P450
300
resource is the Directory of P-450 containing systems (Fabian and Degtyarenko,1997).
301
While not human specific,this comprehensive listing provides links,tables,alignments,
302
structures,three-dimensional (3D) images,EC numbers,OMIM (On-line Mendelian
303
Inheritance of Man) links,and descriptions and summaries of cytochrome P450 and P450
304
containing systems.A third cytochrome P450 resource that is particularly useful is a table
305
of cytochrome P450 isoforms and their drug substrates (see Table 1).This is maintained
306
by David Flockhart,who chairs the division of clinical pharmacology at Indiana
307
University.While not exactly a sequence database,this cytochrome P450 database
308
provides a list of drugs that are metabolized by specific cytochrome P450 isoforms along
309
with information on each isoform’s known inhibitors,inducers,and genetic influences.
310
Most drug names in this table are also hyperlinked to specific literature references
311
in PubMed.
312
While there appear to be no specific sequence databases for phase II enzymes,there
313
are several sequence databases of note pertaining to drug or xenobiotic transporters.
314
Transporters are known to have significant effects on drug metabolism,drug distribution,
315
and the development of adverse drug reactions.Among the more useful and
316
pharmaceutically relevant transporter databases is the Human Membrane Transporter
317
Database (Yan and Sadee,2000).This nicely annotated database provides information of
D.S.WISHART286
318
membrane transporter sequences,types,chromosomal locations,tissue localization,
319
variants,3D structures,associated diseases,and,most importantly,the drug substrates of
320
many membrane transporters.Two other lesser known but useful transporter sequence/
321
annotation databases exist:the Transporters Page and the Human ATP Binding Cassette
322
(ABC) Transporters database (Table 1).The Transporter’s Page,which is maintained by
323
Groningen University Hospital in the Netherlands,describes the name,sequence,
324
chromosomal location,animal association,tissue expression,and tissue localization of
325
many transporters in the liver,intestine,and kidney.It is essentially an updated version of
326
a table that originally appeared in Mu¨ller and Jansen’s review of membrane transporters
327
(Mu¨ller and Jansen,1997).On the other hand,the ABC Transporters database is a
328
relatively small database that describes the sequence,sequence links,sequence location,
329
phenotype,animal association,tissue location,regulation,and function/substrates of all
330
known 49 ATP-binding cassette transporters.It is designed to be a general resource for
331
ABC transporters,but it obviously has some value for many pharmaceutical researchers.
332
SNP and Mutation Databases
333
Another class of sequence databases of growing importance in pharmaceutical
334
(especially pharmacogenomic) research are the SNP (single nucleotide polymorphism)
335
and mutation databases.Single nucleotide polymorphisms (SNPs) are DNA sequence
336
variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is
337
altered.For a variation to be considered a SNP,it must occur in at least 1% of the
338
population.If the frequency is less than 1% (although this is somewhat arbitrary),then
339
this variation is called a mutation.SNPs account for about 90% of all human genetic
340
variation and are believed to occur every 100 to 300 bases along the 3-billion-base human
341
genome.Approximately 5 million of the ￿10 million human SNPs have so far been
342
cataloged.SNPs may occur in exons (coding regions),introns (noncoding regions
343
between exons),and intergenic regions (regions between genes).They may lead to
344
coding or amino acid sequence changes (nonsynonymous),or they may leave the
345
sequence unchanged (synonymous).The vast majority of SNPs occur in intergenic
346
regions.A common misperception is that only SNPs in coding regions are biologically
347
relevant.In fact,SNPs in noncoding regions may play a greater role in the expression
348
levels,splicing variants,and,ultimately,gene product activity than SNPs in coding
349
regions.Therefore,researchers should always take care to include or consider SNPs
350
upstream,downstream,and in intronic regions of their gene of interest.
351
SNPs are highly relevant to drug metabolism (Jaja,2003;Meisel et al.,2003).For
352
instance,SNPs and mutations in drug-metabolizing enzymes are known to lead to large
353
differences in drug responses and drug exposure between individual subjects.Some of
354
these responses can prove to be quite toxic,even fatal (Lindpaintner,2003;Meisel et al.,
355
2003).These adverse drug reactions (ADRs) now account for up to 50% of the reported
356
clinical trial failures for IND (Investigational New Drug) applications (Yu and Adedoyin,
357
2003).Furthermore,among approved drugs,ADRs now rank as one of the top ten leading
358
causes of death and illness in the developed world (Lazarou et al.,1998).The direct
359
medical costs of ADRs are estimated to be US$30–130 billion annually in the United
360
States alone,with drug-related mortality claiming an estimated 218,000 lives annually
361
(Lazarou et al.,1998;White et al.,1999).ADRs are particularly problematic for children,
362
with ￿15% of all pediatric hospitalizations leading to ADRs.In fact,it is estimated that
287BIOINFORMATICS IN DRUG DEVELOPMENT
363
approximately 26,500 children in the United States die each year from ADRs
364
(Impicciatore et al.,2001;Lazarou et al.,1998).
365
Genetic factors such as SNPs and related mutations contribute to an estimated 50%
366
of ADRs and account for 20–95% of the variability in drug responses (Classen et al.,
367
1997;Kalow et al.,1998).SNPs and mutations in phase I metabolizing enzymes
368
(cytochrome P450s),phase II enzymes,drug transporters (such as ABC transporters),and
369
even many drug targets are now known to significantly affect drug ADMET properties
370
and frequently lead to ADRs (Meisel et al.,2003;Riley and Kenna,2004).The
371
importance of identifying and archiving SNPs related to ADRs and ADMET
372
complications motivated a number of pharmaceutical companies to start the SNP
373
Consortium in 1999 (Masood,1999).This effort,combined with other systematic
374
genome sequencing efforts by public centers around the world has led to the
375
identification of more than 5.1 million human SNPs and the establishment of several
376
key public SNP databases.
377
The oldest database is the SNP Consortium database (TSC).This archival,Web-
378
queryable resource,which was last updated in September 2003,contains 1.8 million
379
human SNPs.It includes not only SNP sequence data but also allele frequency and
380
genotype data,as well as fully downloadable tables in MySQL format (Thorisson and
381
Stein,2003).A graphical genome browsing interface shows SNPs mapped onto the
382
genome assembly in the context of externally available gene predictions and other
383
features.The TSC has relatively limited querying capabilities,and this somewhat reduces
384
its overall utility.Two other archival SNP databases exist:JSNP (the Japanese SNP
385
database that catalogs SNPs in the Japanese population) and the Human Genome
386
Variation database (HGVbase,based in Sweden).Both continue to be curated but are no
387
longer accepting submissions (Fredman et al.,2004).Now almost all SNP data are being
388
deposited into the dbSNP database which is maintained at the NCBI (Wheeler et al.,
389
2004).The dbSNP is the primary SNP database and serves as a central repository for both
390
SNP and short deletion and insertion polymorphisms for more than a dozen organisms,
391
including humans.Interestingly,dbSNP uses a looser ‘‘variation’’ definition for SNPs
392
than usually adopted,so there is no requirement or assumption about minimum (>1%)
393
allele frequency.The dbSNP database can be queried via Locus ID,gene name or
394
symbol,gene product,accession number,or gene ontology (GO) terms.dbSNP queries
395
(for a given gene) return a plethora of information about the number of SNPs in the gene,
396
the gene’s mRNA orientation,links to the mRNAsequence,links to the protein sequence,
397
validation status,links to OMIM and 3D structure (if it exists),gene function,
398
heterozygosity,SNP codon position,and SNP amino acid position.The dbSNP database
399
currently contains 10 million human SNP records,of which 5.1 million are nonredundant,
400
1.8 million are validated,and 1 million have some allelic frequency information.
401
It is important to emphasize that SNP data are still very rough and that the data do
402
not have the same reliability as genomic reference sequence (RefSeq) data.In some
403
respects,the SNP field is roughly where the human genome sequencing field was in the
404
mid-1990s.Some estimates suggest that more than 40% of the lower-frequency SNPs
405
(<10%) are not valid or are mis-mapped (Jiang et al.,2003).Nevertheless,even in their
406
relatively incomplete form,SNP databases still provide considerable insights to the
407
possible reasons for variations in metabolism and phenotypic responses to drugs.This is
408
certainly why SNP databases are generating so much excitement in the field of
409
pharmacogenomics (Jaja,2003;Jiang et al.,2003;Lindpaintner,2003).
D.S.WISHART288
410
Besides of dbSNP and HGVbase,several other integrated human SNP databases
411
with powerful querying functions exist.A particularly useful SNP database and SNP
412
query engine is called SNPper (Riva and Kohane,2004).This data-rich resource
413
combines SNP data and validation status taken from dbSNP with allele frequency data
414
from the SNP Consortium database;gene position,chromosome location,coding data
415
(exon,intron,intergenic) from GoldenPath;gene ontology information from the GO
416
(Gene Ontology) database (Ashburner and Lewis,2002);protein domain information
417
from SwissProt;along with links to OMIM and PubMed.Additional data processing
418
allows SNPper to provide possible or probable coding changes,coding position changes,
419
and amino acid changes.SNPper supports a variety of queries based on gene names
420
(HUGO name or Genbank,Locuslink,OMIM,or Unigene identifiers),chromosome/
421
genetic position,or GO classification terms.SNPper also supports a variety of textual
422
outputs (for single or multiple SNPs) and provides a Java applet that can be used to
423
display its SNPsets in colorful,graphical form.To prevent excessive downloads,
424
protracted queries,or other abuses of the system,SNPper normally requires user
425
registration.This registration process can be a little bit discouraging to first-time visitors.
426
Another powerful SNP database and searching tool is called PolySearch (Poly-
427
morphism Search,Table 1).This Web-accessible resource supports PubMed literature
428
searches,on drugs,disease genes as well as gene name searches,SNP searches,mutation
429
searches,and PCR-primer searches.Users may search through a variety of SNP databases
430
(HGVbase,dbSNP,HGMD) and text resources (PubMed,lists of disease names/
431
synonyms,drug names/synonyms).Each type of search may be performed independently
432
(i.e.,find all genes associated with tamoxifen;find all polymorphisms and mutations
433
associated with genes adr1,drcl,and trxA;find PCR primers for gene sequences adr1,
434
drcl,and trxA) or in a combined fashion (find all SNPs for all genes associated with
435
breast cancer and design all necessary primers for subsequent SNP analysis).PolySearch
436
uses a variety of techniques,including text-mining,Web-based screen-scraping,and
437
primer design,to generate its results,which can be sent as an HTML hyperlinked table or
438
presented in a Web-accessible relational (mySQL) database format.PolySearch is
439
specifically designed to allow pharmaceutical researchers to explore the relationships
440
between drugs,polymorphisms and diseases,and physiological responses.A screenshot
441
of several PolySearch Web pages is given in Fig.
F1
1.
442
SNP databases are not the only resources that attempt to track and capture
443
polymorphic variation.In fact,long before polymorphisms and SNPs became the rage,
444
genetics researchers were busy accumulating details about specific mutations in the
445
human genome.In many respects,mutation information is much more reliable and far
446
more precise than SNP data,because it is not normally collected in a high-throughput
447
manner.Disease-causing mutations are usually identified and confirmed many times over
448
before they are published.Furthermore,they are often independently confirmed by other
449
labs before they become widely accepted or deposited into databases.Mutation databases
450
tend to capture different kinds of polymorphisms than SNP databases,so there is
451
relatively little overlap between the two.Therefore,in searching for genetic causes for a
452
particular phenotype,it is always wise to look at both SNP and mutation databases.
453
Two important human mutation databases exist:OMIM and HGMD.The On-line
454
Mendellian Inheritance of Man (OMIM) is a superbly researched encyclopedic resource
455
containing genetic (cloning,gene function,gene structure,mapping),phenotypic,
456
historical,and clinical data on more than 5,500 genetic and/or metabolic disorders
289BIOINFORMATICS IN DRUG DEVELOPMENT
457
(Wheeler et al.,2004).Of these,nearly 400 diseases have specific gene sequences
458
associated with them,and for another 1,600 disorders,there is a solid understanding of
459
their molecular origins.Many of the key mutations in OMIM are described in detail and
460
are linked to either sequence data or PubMed references.OMIM is searchable by title,
461
text,clinical data,allelic variants,chromosome number,and citations.It is also fully
462
downloadable.While not as encyclopedic as OMIM,the human gene mutation database
463
(HGMD) is primarily a repository of sequence mutations found in well-studied human
464
genes.The HGMD contains 42,521 mutations compiled for 1,657 genes,with 1,530
465
reference cDNA sequences.The database can be searched either by disease,gene name,
466
or gene symbol.Amore complete description of HGMDis given by Stenson et al.(2003).
467
A continuing challenge for pharmaceutical researchers is to find ways of relating
468
SNPs or mutation data to specific metabolic responses or ADRs.In TableT2 2,a list of
469
genes with known drug response phenotypes is provided.This,by no means,is a
470
complete list,as the number of genes implicated in drug metabolism and ADRs is
471
continuously growing.However,from this subset of human genes,it is possible to use
472
any of the above-mentioned SNP/mutation databases to obtain specific polymorphism
473
data for these genes.From there,a researcher may use any of the more specialized
474
sequence databases already described (i.e.,the cytochrome-P450 and membrane
475
transporter databases) to obtain additional information about the drugs or substrates
476
associated with these genes as well as data (via PubMed links) about their phenotypic
477
consequences.In this way,several existing and seemingly unrelated databases may be
Figure 1.A screenshot montage of the DrugBank home page with several associated query or result dis-
play screens.DrugBank supports multiple types of online structure,text,and sequence queries and provides
richly annotated data on drugs,drug properties,drug targets (genes/proteins),drug–protein interactions,and
drug metabolism.
D.S.WISHART290
478
combined to allow one to learn much more about the genetic aspects of drug metabolism
479
and ADRs.
480
Metabolism and Metabolic Pathway Databases
481
Small molecules account for more than 99% of all FDA-approved drugs and still
482
constitute the vast majority of all IND applications.Interestingly,most small-molecule
483
drugs are designed or chosen to mimic existing small-molecule metabolites.It is also
484
important to remember that most drugs,prodrugs,or xenobiotics are either targeted to,or
485
metabolized by,preexisting metabolic pathways that originally evolved to handle
486
endogenous metabolites.Therefore,the selection of successful drug candidates or drug
487
targets along with an understanding of their metabolic effects and fates is highly dependent
488
on our understanding of an organism’s (both target and host) general metabolism.
489
Given the importance of host/target metabolism in drug development and
490
assessment,it is little wonder that more drug researchers are finding that metabolic
491
pathway databases can play a role in drug research programs (Fairlamb,2002;Karp et al.,
Table 2 A list of human genes (and HUGO gene symbols) associated with known drug response phenotypes as
identified in the PharmGKB database.
ABCB1 ATP-binding cassette B1 KCNA5 K+ voltage-gated channel
ABCC1 ATP-binding cassette C1 MTHFR tetrahydrofolate reductase
ABCC2 ATP-binding cassette C2 NME2 non-metastatic cells 2,prot.
ABCC4 ATP-binding cassette C4 NOS3 nitric oxide synthase 3
ABCG2 ATP-binding cassette G2 NP nucleoside phosphorylase
ADA adenosine deaminase NPR2 natriuretic peptide receptor B
ADCY3 adenylate cyclase 3 NQO1 NAD(P)H dehydrogenase1
ADCY7 adenylate cyclase 7 NR1I2 nuclear receptor 1I2
ADCY8 adenylate cyclase 8 PAPSS1 PAPsulfate synthase 1
ADCY9 adenylate cyclase 9 PDE4B phosphodiesterase 4B
ADRB1 adrenergic,b-1-,receptor POLA DNA polymerase,a
ADRB2 adrenergic,b-2-,receptor POLE DNA polymerase,epsilon
AK2 adenylate kinase 2 POLR2C RNA polymerase II
ATIC ribo-IMP cyclohydrolase PPAT phosphor.amidotransferase
ATP5J ATP synthase,subunit F6 PRPS2 phosphoribosyl PP synthetase
CES1 carboxylesterase 1 RRM1 ribonucleotide reductase M1
CES2 carboxylesterase 2 SHAPY Ca-dependent diphosphatase
COMT catechol-O-methyltransferase SLC19A1 folate transporter,member 1
CYP2B6 cytochr.P450 subfamily B6 SLC22A1 organic cation transporter 1
CYP2C19 cytochr.P450 subfamily C19 SLC22A2 organic cation transporter 2
CYP2C9 cytochr.P450 subfamily C9 SLC29A1 nucleoside transporters 1
CYP2D6 cytochr.P450 subfamily D6 SULT1A1 sulfotransferase family 1A1
CYP3A cytochr.P450 subfamily A SULT1A3 sulfotransferase family 1A3
CYP3A4 cytochr.P450 subfamily A4 UGT1A1 UDP glycosyltransferase 1A1
CYP3A5 cytochr.P450 subfamily A5 UGT1A6 UDP glycosyltransferase 1A6
DCK deoxycytidine kinase UGT1A9 UDP glycosyltransferase 1A9
ESR1 estrogen receptor 1 UGT1A UDP glycosyltransferase 1A
FHIT fragile histidine triad gene UGT2B7 UDP glycosyltransferase 2B7
GGH g-glutamyl hydrolase VDR vitamin D receptor
GSTP1 glutathione S-transferase XDH xanthene dehydrogenase
GUCY1A3 guanylate cyclase 1,a 3 XRCC1 x-ray repair gene
HPRT1 hypoxanth PO4-transferase 1
291BIOINFORMATICS IN DRUG DEVELOPMENT
492
1999).These blended bioinformatics–cheminformatic databases can provide detailed,
493
organism-specific information (proteins,pathways,chemical structures,etc.) about
494
primary and secondary metabolism for hundreds of different organisms and thousands of
495
different compounds.Unfortunately,most general metabolism databases contain
496
relatively little information about xenobiotic or drug metabolism.Instead,xenobiotic
497
fates or putative xenobiotic targets must often be inferred from the chemical similarity to
498
existing endogenous metabolites.Normally this metabolic inference must be done
499
manually,but more recently,several groups have shown that it can be automated
500
(Akutsu,2004;Hou et al.,2004;Jaworska et al.,2002;McShan et al.,2004).McShan
501
et al.demonstrated the potential of automated inference using a well-known metabolic
502
database called KEGG (see Table 1).These workers employed a two-part symbolic
503
computational method in which,first,the biotransformation rules were inferred from
504
molecular graphs of known compounds and pathways.Then these rules were recursively
505
applied to different compounds to generate novel metabolic networks,containing new
506
biotransformations and new metabolites.They tested their approach on two ‘‘drugs,’’
507
ethanol and furfuryl alcohol,not in their original dataset,and demonstrated that the
508
inferred metabolic fates matched well with the known literature.
509
Perhaps the most widely known metabolic pathway resource is the Kyoto
510
Encyclopedia of Genes and Genomes or KEGG (Kanehisa et al.,2004).KEGG has been
511
under development at the Kanehisa lab at the Institute for Chemical Research in Kyoto,
512
Japan,since 1995.It contains genomic,chemical,and network/pathway information for
513
more than 1 million genes in nearly 250 organisms,including 11,503 chemical
514
compounds and 6,304 reactions (at last count).The chemical compound database
515
contains chemical structures of most known metabolic compounds as well as a limited
516
number of pharmaceutical and environmental compounds.All chemical structures are
517
manually entered,computationally verified,and continuously updated.KEGG also
518
contains 235 manually drawn and fully annotated organism-specific pathways or wiring
519
diagrams for metabolism,gene signaling,and protein interactions.Users may browse or
520
query the KEGG database in any number of ways.To access its metabolic pathway data,
521
a user can select the PATHWAY database and choose from several hundred metabolic
522
and catabolic processes/pathways.Clicking on these hyperlinked names will send the
523
user to a hyperlinked image describing the pathway and containing additional hyperlinks
524
to compounds and protein/enzyme data or structures.Through KEGG’s DBGET
525
interface,users can query for specific compounds,reactions,or proteins by their names or
526
synonyms.A much more complete description of KEGG and its contents can be found in
527
an article by Kanehisa et al.(2004) and references therein.
528
An equally comprehensive metabolic database or encyclopedia is MetaCyc
529
developed by Peter Karp at the Stanford Research Instititute (Krieger et al.,2004).
530
MetaCyc is a Web-accessible database containing nonredundant,experimentally
531
elucidated metabolic pathways from more than 240 different organisms gathered from
532
the scientific literature.It stores visual,chemical (structure),and textual information for
533
primary metabolism and secondary metabolism,as well as associated compounds,
534
enzymes,and genes.MetaCyc supports cross-organism metabolic queries or organism-
535
specific queries,including humans (H.sapiens) and several pathogenic microorganisms
536
(E.coli O157,B.anthracis,H.pylori,P.falciparum,etc.).After choosing an organismor
537
set of organisms,users can query MetaCyc by the name of a protein,gene,reaction,
538
pathway,chemical compound,or EC (enzyme classification number).It is also possible
D.S.WISHART292
539
to browse the MetaCyc database if one is not able to frame a specific query or to search
540
its databases using a sequence query via BLAST (Altschul et al.,1997).Most MetaCyc
541
queries or browsing operations return a rich and colorful collection of hyperlinked
542
figures,pathways,chemical structures,reactions,enzyme names,references,and protein/
543
gene sequence data.MetaCyc is somewhat more targeted to biomedical applications than
544
KEGG.Its capacity to select organism-specific pathways probably gives it an important
545
advantage over KEGG for most pharmaceutical applications.
546
A third type of metabolic database is the ExPASY biochemical pathways database
547
(Table 1).This unique database provides a queryable Web interface to the famous Roche
548
Biochemical Pathways chart.Users may type in compound names (or parts of names),or
549
they may select zoomable thumbnail/subsection views of the chart.In this way,it is
550
possible to interactively and visually explore metabolic pathways or cellular and
551
molecular processes on a desktop computer.All compound queries provide hyperlinked
552
lists to EC numbers (and the corresponding enzyme information) and to the specific
553
location of these compound(s) on the Roche wall chart.While not as comprehensive or
554
current as KEGG or MetaCyc,the Roche chart is still considered to be the gold standard
555
for most descriptions of metabolic (primarily human) pathways.
556
In addition to these three major pathway resources,there are a number of smaller or
557
lesser-known metabolism databases.These include the Main Metabolic Pathways Web
558
page,The Metabolic Pathways of Biochemistry page,and The Medical Biochemistry
559
Web page (Table 1),all three of which provide nicely illustrated,modestly hyperlinked or
560
annotated reaction diagrams for most standard metabolic processes.While not as broadly
561
useful for studying xenobiotic metabolism as KEGG or MetaCyc,these pages could
562
serve as an inspiration for a Web-saavy developer to create a comparable resource for
563
drug metabolism.
564
Yet another potentially useful database that caters specifically to xenobiotics is the
565
University of Minnesota Biocatalysis/Biotransformation Database (Ellis et al.,2003).The
566
UM-BBD focuses primarily on microbial biocatalytic reactions and biodegradation
567
pathways for a wide variety of xenobiotic compounds important to biotechnology and
568
general chemistry.At last count,this unique resource contained 142 pathways,930
569
reactions,875 compounds,590 enzymes,and 250 biotransformation rules determined for
570
more than 340 different microbes.The database may be queried by compound,enzyme,
571
microorganism,pathway,biotransformation name,chemical formula,chemical structure,
572
CAS registry number,or EC code.While the UM-BBD is primarily targeted to
573
environmental chemists,the similarity of many drug compounds to the UM-BBD
574
compounds and the importance of microbes as both drug targets and as facilitators of
575
drug metabolism in humans (and in the environment) suggests that this database could be
576
of some use to at least a few pharmaceutical researchers.
577
Drug Metabolism and Drug Interaction Databases
578
While general metabolism and pathway databases are playing an increasingly
579
important role in drug development and assessment,another class of databases is
580
emerging that is probably much more relevant to pharmaceutical researchers—the drug
581
metabolismand drug interaction databases.These databases focus much more directly on
582
known drugs or drug metabolites and attempt to link the genomic/proteomic information
583
being gathered about the relevant genes or proteins with the drug compounds themselves.
293BIOINFORMATICS IN DRUG DEVELOPMENT
584
As a rule,drug metabolism/interaction databases are highly curated,so they tend to be
585
much smaller and less comprehensive than the archival sequence,SNP,and metabolism
586
databases already described.This is partly because much of the information needed by
587
these databases is not electronically accessible—in many cases,it still resides in
588
monographs,books,FDA filings,and journal articles.It is also because this kind of data
589
is relatively difficult to interpret and compile,because it requires detailed knowledge
590
from so many diverse disciplines (chemistry,pharmacy,pharmacology,pharmacoge-
591
nomics,bioinformatics,cheminformatics,biochemistry,proteomics,genomics,etc.).
592
Several commercial drug metabolism or drug interaction databases now exist,
593
including products offered by MDL Information Systems,the University of Washington
594
(Seattle),and Lhasa Limited,a not-for-profit company based at the University of Leeds.
595
MDL maintains and sells the MDL METABOLITE system and MDL TOXICITY
596
Database.The METABOLITE system consists of a database,a registration or data-
597
sharing system,and a browser (Table 1).The database covers abstracted xenobiotic
598
transformations from 1991 onward from the top 60 journals covering metabolism and
599
biotransformation.It also includes the complete collection of metabolic schemes
600
abstracted from Biotransformation of Drugs and Pharmacokinetics from 1901 onwards.
601
The database,which is updated twice a year,primarily consists of medicinal compounds
602
along with a variety of other agricultural or environmental xenobiotics.The current
603
version holds more than 9,000 parent compounds and nearly 60,000 transformations.In
604
addition to this structural information,METABOLITE contains enzyme information
605
(name,EC number),species data,physiological activity,parent compound toxicity,
606
bioavailability,route of administration,excretion routes,CAS registry numbers of parent
607
compounds,and references to the original literature.The METABOLITE browser
608
provides a graphical interface to the database,allowing structural searches and visual
609
displays of metabolic transformations,parent compound structures,and degradation
610
products.The MDL TOXICITY database is similar in concept to METABOLITE in that
611
it also supports graphical searching of chemical structures and contains extensive
612
experimental parameters such as species tested,doses,routes of administration,toxic
613
effects,and full references.The database is updated quarterly and contains the complete
614
contents of the Registry of Toxic Effects of Chemical Substances (150,000+
615
compounds),of which two-thirds are drugs or drug leads.
616
The University of Washington maintains and licenses the Metabolism and
617
Transport Drug Interaction Database or DIDbase (Levy et al.,2003).DIDbase is a
618
Web-based relational database designed to evaluate the interaction potential of drugs in
619
development.It is updated monthly and currently includes information from more than
620
4,000 PubMed articles,concerning in vivo and in vitro drug interaction studies in
621
humans.DIDbase contains the details (therapeutic range,PK parameters,kinetic
622
parameters,V
max
data,AUC data,dosage parameters,dosage intervals,administration
623
method,experimental parameters,etc.) extracted and manually verified from thousands
624
of pharmaceutical research papers.Users may query the database using a variety of up to
625
40 categories,including drug name,target protein,therapeutic class,precipitant/inhibitor,
626
enzyme names,genotype,and experiment type (in vitro,in vivo).Selected information
627
from each relevant article in the database is displayed according to original schemes,
628
which highlight study conditions as well as study results.DIDbase,while rich in content,
629
is very much a text-driven database and still lacks the flash,graphics,and style of more
630
established commercial databases.
D.S.WISHART294
631
Lhasa produces and distributes the VITIC (formerly DEREK) toxicity database
632
(Greene et al.,1999).This is a database designed to allow users to predict and assess
633
toxicological effects of more chemicals while reducing the need for animal testing.It is
634
based on data collected by ILSI Health and Environmental Sciences Institute (HESI)
635
from numerous public sources (the U.S.Food and Drug Administration [FDA],the
636
Environmental Protection Agency [EPA],and existing publicly available literature).The
637
latest release of VITIC includes 59,800 in vitro genetic toxicity data records for 2,130
638
chemicals,156 hepatoxicity data records for 41 chemicals,and 2,700 skin sensitization
639
data records for 830 chemicals.The VITIC database is fully structure-searchable,
640
supporting exact structure,substructure,CAS number,and chemical formula queries.
641
Users may also deposit their own toxicological data from a variety of formats and store it
642
in a format suitable for structure activity prediction or modeling.
643
In addition to these commercial endeavors,several public drug interaction or drug
644
metabolism databases are also starting to appear.These include PharmGKB (Klein and
645
Altman,2004) and DrugBank (Table 1).PharmGKB (the PharmacoGenomics
646
Knowledge Base) is an integrated resource designed to archive experimental and
647
literature data on drug–gene or drug–gene product interactions (i.e.,pharmacogenom-
648
ics).It is also one of the first databases devoted to relating human phenotypes (related to
649
drug response) to genotype information.PharmGKB now includes extensive data on drug
650
metabolism genes,drugs,diseases,and drug pathways,all of which are linked to each
651
other and to several primary datasets collected by the NIH Pharmacogenomics Research
652
Network.The database contains partial information on 15,195 human genes,4,674 drugs,
653
and 4,070 diseases.However,at this stage in the database’s development,only about 200
654
drugs,genes,or diseases have specific genotype/phenotype associations.For those
655
PharmGKB entries with phenotype/genotype relations,the database also provides nicely
656
curated sets of literature references as evidence for the association of genes to drugs and
657
diseases.PharmGKB also maintains 12 nicely illustrated drug metabolism/action
658
pathways describing the molecular details and interacting partners for such drug classes
659
as beta blockers,statins,and glucocorticoids.The popularity of the database has grown
660
considerably over the past 4 years,as PharmGKB now gets more than 20,000 visitors/
661
month.PharmGKB primarily focuses on human genes and their polymorphisms and
662
phenotypes,although efforts are now underway to expand this to mouse and rat studies.
663
Because it contains human clinical and anonymized patient information (obtained from
664
its collaborators in the NIH Pharmacogenomics Research Network),PharmGKB
665
normally requires user registration to access some of its data to maintain subject privacy
666
and confidentiality.Much more detailed descriptions of PharmGKB and its potential
667
applications to drug development and drug assessment are given in the literature (Hewett
668
et al.,2002;Oliver et al.,2002).
669
DrugBank (Fig.
F2
2) is a freely available,fully downloadable drug interaction and
670
drug metabolism database.It contains detailed information about 250 of the most
671
frequently prescribed FDA-approved drugs (chemical structure,common and chemical
672
names,3D structure coordinates,drug/chemical class,solubility,indication,pharmacol-
673
ogy,etc.),along with detailed information about their known protein targets (sequence,
674
3D coordinates,cellular localization,interacting partners,etc.),expected or measured
675
toxicity,metabolic fate,and known metabolizing enzymes.On average,each drug entry
676
contains nearly 50 data fields,with about 20% of these fields providing clickable
677
hyperlinks to graphical views or external databases.DrugBank also supports a wide range
295BIOINFORMATICS IN DRUG DEVELOPMENT
678
of querying options,including simple browsing or sorting (by name,accession number,
679
molecular weight,CAS number,category,or indication),text querying,sequence
680
searching (via local BLAST search),chemical structure querying (via a structure-drawing
681
applet),and relational category querying.By designing the database to support both
682
sequence queries and chemical structure queries,DrugBank allows users to take a new
683
protein or even a new proteome and scan those sequences against DrugBank’s entire
684
chemical database.In this way,protein homologues may be identified that may be
685
targeted with existing drugs or existing metabolite analogues.Similarly,users may scan
686
DrugBank with a new chemical structure or a library of structures to identify with which
687
protein targets these compounds might bind or with which Phase 1 metabolizing enzymes
688
might target them.In addition to these drug discovery and drug assessment applications,
689
DrugBank can serve as an interactive teaching tool and as a sophisticated,openly
690
accessible electronic reference on drug data and drug metabolism.The DrugBank
691
database is fully downloadable and freely available.At the time of this writing,
692
DrugBank (like PharmGKB) is still under active development,and so new links,new
693
fields,and new compounds are constantly being added and updated.
694
PREDICTIVE SOFTWARE TOOLS FOR ADMET
695
Over the past century,physics and chemistry have largely gone from purely
696
observational disciplines to fully predictive sciences.Unfortunately,most life science
697
disciplines are still observational or ‘‘nonpredictive.’’ The complexity of biological
698
systems is such that simple rules and straightforward mathematical relationships are
699
almost impossible to derive.This is now starting to change,thanks,in large part,to the
700
rapid advances in computer technology and to the growing availability of very large sets
Figure 2.A screenshot montage of the PolySearch home page with several associated query or result display
screens.PolySearch supports multiple types of searching,including Medline disease/gene association searches,
disease/drug association searches,and gene/SNP searches.Users may also use PolySearch to design PCR
primers to sequence putative SNPs extracted from HGVbase or dbSNP sources.
D.S.WISHART296
701
of experimental biological data.As we have seen in the last section,a significant number
702
of large and richly annotated databases are now available to almost any life science
703
researcher (Apweiler et al.,2004;Oliver et al.,2002;Wheeler et al.,2004).These
704
datasets are allowing bioinformaticians,cheminformaticians,and other IT professionals
705
to apply statistical tools,data-mining techniques,and machine-learning algorithms to
706
extract novel patterns,relationships,and parameters that are starting to make a number of
707
life science disciplines (especially in pharmaceutical research) much more predictive.
708
The use of predictive software,particularly in drug metabolism and drug
709
toxicology research,is also leading to the advent of in silico ADMET.In silico
710
ADMET allows normally expensive and time-consuming bench tests to be done in
711
seconds on a computer,at a tiny fraction of the cost (Beresford et al.,2004;Van de
712
Waterbeemd and Gifford,2003;Yu and Adedoyin,2003)
Q4
.While in silico techniques
713
are not yet as reliable as experimental methods,the utility of in silico methods in
714
combination with high-throughput in vitro methods is allowing drug metabolism/
715
toxicity screening to be done much faster and much more efficiently.Given the high
716
cost of late-stage drug trial failures and the high failure rate of new lead compounds
Table 3 Predictive software (programs and servers) for ADMET and biotransformation.
Program name URL/Web address
Chemical or Physiological Property Prediction
Tripos Inc.http://www.tripos.com/
Accelrys http://www.accelrys.com/
CambridgeSoft http://www.cambridgesoft.com/
ACD Labs (iLab) http://www.acdlabs.com/ilab/
Actelion Property Explorer http://www.actelion.com/
ChemSilico http://chemsilico.com/
Compumine http://www.compumine.se/adme/adme.jsp
Pre-ADME http://preadme.bmdrc.org/preadme/index.php
PASS http://www.ibmh.msk.su/PASS
Mechanistic ADMET Prediction
Protein Data Bank http://www.rcsb.org/pdb/
DeepView http://www.expasy.org/spdbv/
Modeller http://salilab.org/modeller/modeller.html
WHATIF http://www.cmbi.kun.nl/gv/servers/WIWWWI/
Swiss-Model http://swissmodel.expasy.org//SWISS-MODEL.html
CPH-Models http://www.cbs.dtu.dk/services/CPHmodels/
SDSC1 Homology Model Server http://cl.sdsc.edu/hm.html
Dock (molecular docking) http://dock.compbio.ucsf.edu/
FlexX (molecular docking) http://www.biosolveit.de/FlexX/
Glide (molecular docking) http://www.schrodinger.com/Products/glide.html
GOLD (molecular docking) http://www.ccdc.cam.ac.uk/products/life_sciences/gold/
GROMACS http://www.gromacs.org/
X-Plor http://xplor.csb.yale.edu/xplor/
AMBER http://xplor.csb.yale.edu/xplor/
Empirical Biotransformation Prediction
META http://www.multicase.com/products/prod05.htm
MexAlert http://www.compudrug.com/inside.php
MetabolExpert http://www.compudrug.com/inside.php
METEOR http://www.chem.leeds.ac.uk/LUK/meteor/
297BIOINFORMATICS IN DRUG DEVELOPMENT
717
(both usually due to ADMET issues),the mantra in drug discovery of ‘‘fail early–fail
718
often’’ is particularly compatible with in silico ADMET (Yu and Adedoyin,2003).In
719
this section,we will provide a brief survey of some the predictive software tools or in
720
silico techniques that may be of particular interest to drug researchers doing ADMET
721
studies.Specifically,we will focus on three areas:chemical property prediction,
722
mechanistic ADMET prediction (modeling/docking),and empirical (or expert system)
723
ADMET and biotransformation prediction.Web links for many of these programs and
724
Web servers are given in TableT3 3.
725
Chemical Property Prediction
726
Chemical properties refer to those physical/chemical properties or characteristics
727
of a compound that relate to its structure or behavior in different solution conditions.
728
These include electronic or charge distribution properties,preferred conformations,
729
heats of formation,solubility,hydrophobicity,pK
a
,refractivity,melting point,length,
730
area,volume,reactive groups,etc.Some of these properties,such as solubility,LogP,
731
and charge are highly relevant to understanding or predicting the activity,absorption,
732
distribution,and metabolism of drug compounds (Hansch and Zhang,1993;Hou and
733
Xu,2003).Many chemical properties can be predicted directly from a chemical
734
structure using either quantum mechanical methods,semiclassical molecular
735
mechanics methods,or,more frequently,by empirical/statistical methods (Lipinski
736
et al.,2001).Chemical property prediction has been an integral part of many
737
chemistry software packages for more than 30 years.Most of today’s commercial
738
chemistry software vendors,such as ACD labs,CambridgeSoft,Tripos,and Acclerys
739
offer some kind of chemical property prediction algorithms.However,many of these
740
predictions are also freely available over the Internet through a variety of Web servers
741
(Tetko,2003;Van de Waterbeemd and De Groot,2002).The vast majority of these
742
servers employ statistical methods (Bayesian analysis,principal component analysis),
743
neural nets,or other pattern recognition tools to predict ADMET properties.Most
744
have been trained and validated on databases containing thousands of test compounds
745
and properties.
746
ACD Labs,in addition to its commercial products,provides an Internet ‘‘virtual
747
chemistry’’ service called I-lab (Table 3).I-lab allows users (guests or registered users) to
748
do one-off predictions of nuclear magnetic resonance (NMR) spectra,IUPAC names,and
749
phys–chem parameters using structures drawn with their structure-drawing applet.Users
750
may choose from one of three pull-down menus (NMR,naming,or phys–chem) and can
751
select from up to 20 different predictions.The phys–chem predictor offers pK
a
,LogP,
752
logD,water solubility,boiling point,melting point,molar volume,refractive index,
753
molecular weight,molecular volume,and a host of other property calculations.A key
754
limitation with I-lab is that only one property at a time can be calculated for just one
755
structure at a time.Furthermore,its relatively heavy use and its dependency on applet
756
technology can sometimes lead to delays of up to a minute between mouse clicks
757
and submissions.
758
The Actelion Property Explorer (Google ‘‘Actelion Property Explorer’’) is a Web-
759
enabled Java applet that allows users to draw chemical structures and then rapidly
760
calculates various drug-related properties,including toxicity risks (mutagenicity,
761
tumorgenicity,irritancy,and reproductive effect),solubility,logP,molecular weight,
D.S.WISHART298
762
drug-likeness,and overall drug score.The property explorer uses a fragment-based
763
approach to determine the drug-like and toxicity properties,and appears to score above
764
85% for most of the toxicity values compared to the RTECS database.The predicted
765
values are displayed numerically on the right side of the applet along with color-coded
766
indicators indicating whether the compound exhibits good (green),bad (red),or
767
indifferent (yellow) properties.Hyperlinks provide more detail about the applet’s
768
algorithms and comparative performance.The accessibility and ease of use of this applet
769
is particularly appealing,but the lack of peer-reviewed assessment or detailed
770
documentation concerning the quality of its algorithms is a bit worrisome.
771
ChemSilico is primarily a commercial provider of ADMET prediction software,but
772
it also provides online calculation of a number of useful and interesting physicochemical
773
(aqueous solubility,logD,pK
a
) and biological parameters,such as blood–brain barrier
774
partition,plasma protein binding,genotoxicity prediction,and mutagenicity prediction
775
(Table 3).The programs use neural net prediction methods and were developed and
776
trained using data-mining methods to extract hundreds of topological and functional
777
descriptors from 35,000 compounds.The site also contains additional information with
778
detailed statistics about the performance of its different prediction modules.Users must
779
register to use the site.Free software trials are also available.
780
Compumine is another company offering freely available,Web-accessible ADMET
781
prediction for compounds (Table 3).Their server uses their own Rule Discovery System
782
(RDS),which is a machine-learning or machine induction method,to look for chemical
783
fragment features and global properties to predict aqueous solubility (LogSol),Ames
784
mutagenicity,and acute toxicity (pLC50).Users may submit a SMILES string or upload
785
an ‘‘sdf,’’ ‘‘mol2,’’ ‘‘mol,’’ or ‘‘xyz’’ file of their compound of interest.The Web site is
786
easy to use and provides a number of examples;however,details on the description and
787
validation of its RDS methods are quite sparse.
788
Pre-ADME is another Web site that offers a wide range of ADME and
789
toxicological property calculations for any given chemical compound (Table 3).
790
Three classes of predictors are offered:a molecular descriptors calculation,a drug
791
likeness predictor,and an ADME predictor.The molecular descriptor calculator can
792
predict about 955 molecular properties,including constitutional,topological,
793
physicochemical,and geometrical descriptors,many of which are needed for ADME
794
prediction.The drug likeness predictor is very simple and uses Lipinski’s rules (Rule
795
of Five) and lead-like rules in its predictions.The ADME predictor is quite unique
796
and can predict permeability for Caco-2 cells,MDCK cells,and BBB (blood–brain
797
barrier),HIA (human intestinal absorption),plasma protein binding,and skin
798
permeability using an artificial neural network.Users can draw input structures using
799
ACD’s Structure Drawing Applet or upload compound files in ‘‘sdf’’ or ‘‘mol’’ file
800
format.Users may choose to register,although the site also offers open/guest access
801
without registration.
802
PASS (Prediction of Activity Spectra for Substances) is another Web server
803
designed to predict the biological activity and ADMET for compounds on the basis of
804
their structural formulae (Sadym et al.,2003).PASS employs a statistical or probabilistic
805
(Bayesian) approach based on a training set of ￿46,000 compound entries to predict
806
nearly 1,500 different properties,including pharmacological effects,mechanisms of
807
action,mutagenicity,carcinogenicity,teratogenicity,and embryotoxicity.The average
808
accuracy of any given prediction,based on leave-one-out cross-validation,is about 85%.
299BIOINFORMATICS IN DRUG DEVELOPMENT
809
PASS is particularly appealing in that it is one of the few property prediction servers to
810
provide a detailed description of its methods,validation protocols,test sets,and results.A
811
published description is also available.PASS is freely accessible,but users must first
812
register to use the system.
813
Mechanistic ADMET Prediction
814
The 3D structure of a growing number of phase I enzymes are now known,
815
including the structure of CYP2C9,CYP2C8,and CYP3A4 (Schoch et al.,2004;
816
Williams et al.,2003;Yano et al.,2004).These structures and many more 3D protein
817
structures are deposited in the Protein Data Bank (Westbrook et al.,2002).The
818
availability of these important enzyme structures is now allowing for a much more
819
detailed understanding of the P450 reaction cycle,both in terms of the active site
820
geometry and the structure of the substrates.Additionally,homology modeling of other
821
cytochrome P450s (mutants and isoforms) with various drug-like substrates is also
822
allowing a greater understanding of xenobiotic metabolism and biotransformation
823
(Vermeulen,2003).Similarly,detailed quantitative structure–activity relationships
824
(QSAR) for many cytochrome P450 substrates have also been compiled,and these are
825
helping to reveal the importance of substrate size,hydrophobicity,and electronic
826
structure in explaining metabolic rate variations (Lewis,2003a,b;Lewis et al.,1998).In
827
other words,the availability of so much structural data is opening the door to much more
828
mechanistic predictions and predictive tools for ADMET.This is encouraging,because
829
mechanistic predictions,due to their detailed structural foundation,tend to be somewhat
830
more accurate than empirically derived ADMET predictions (Beresford et al.,2004;
831
Langowski and Long,2002;Yu and Adedoyin,2003).
832
Structural modeling of cytochrome P450 variants and mutations is now quite
833
simple to do,given the abundance of mammalian CYP structures and the existence of
834
many excellent homology modeling software tools.Recall that in homology modeling,
835
one attempts to predict or model the 3D structure of a new protein (with unknown
836
structure) using the structure of a previously solved,but closely related (>35% sequence
837
identity),protein.Homology modeling has been used since the early 1980s to help in a
838
variety of areas in drug discovery and drug design (Hillisch et al.,2004).In fact,
839
homology modeling represents one of the most successful applications of bioinformatics
840
to pharmaceutical science.In addition to several high-quality commercial modeling
841
packages fromTripos and Accelrys,there are also a number of freely available homology
842
modeling tools,including Modeller (Fiser and Sali,2003),DeepView,and WHATIF,that
843
can be downloaded and installed on most computer platforms.More recently,homology
844
modeling services have become widely available on the Web.These servers include the
845
SWISS-MODEL server (Schwede et al.,2003),the CPH Models server,and the SDSC1
846
server (see Table 3 for a complete list).Typically,a user only needs to type or paste in the
847
sequence of the protein of interest and press the ‘‘submit’’ button.A 3D structure will be
848
returned to the submitter,via e-mail,within a few minutes.SWISS-MODEL is
849
particularly popular (more than 120,000 requests per year),as it is very accurate,
850
constantly updated,provides a comprehensive report on the modeled structure quality,
851
and allows various levels of user interactivity.
852
In many cases,a 3D model or structure of the enzyme of interest is not sufficient
853
for predictive ADMET work.To fully understand enzyme mechanics,one usually needs
D.S.WISHART300
854
to have some knowledge of how the substrate binds to the enzyme active site.If a
855
structure of the protein with the ligand of interest is not available,it is often possible to
856
use another class of bioinformatics tools (called docking software) to place the substrate
857
of interest into the active site.Docking software is a type of simulation tool that uses
858
rapid spatial sampling techniques (genetic algorithms) and empirical energy functions to
859
translate and rotate molecules into enzyme active sites (Brooijmans and Kuntz,2003;
860
Yang and Chen,2004)
Q5
.As with homology modeling,there are numerous commercial
861
software products,several freeware packages,and a variety of Web-accessible molecular
862
docking tools that users may choose.These include DOCK,FLEXX,GLIDE,GOLD,
863
SLIDE,and SURFLEX (Table 3).Many of these have recently been reviewed and their
864
performance assessed (Brooijmans and Kuntz,2003;Kellenberger et al.,2004).Once a
865
ligand is docked to its enzyme,these structures may be further refined,probed,or tested
866
using molecular dynamics and energy minimization tools,such as X-plor (Schwieters
867
et al.,2003),GROMACS/GROMOS (Stocker and van Gunsteren,2000),or AMBER
868
(Wang et al.,2004).
869
Docking,structure refinement,and molecular dynamics are very CPU-intensive
870
efforts,requiring considerable computer resources and a solid understanding of the
871
techniques,the software interface,and the pitfalls or the limitations of each method.
872
These kinds of computational tools lie somewhat near the extreme end of what we
873
normally consider bioinformatics (or cheminformatics) software.Nevertheless,there are
874
a growing number of successful examples that demonstrate that these advanced
875
bioinformatic tools can be used to understand and predict CYP activity,substrate
876
specificity,inhibition,induction,reaction rates,and potential biotransformation products
877
(Szklarz and Paulsen,2002;Vermeulen,2003).
878
Empirical ADMET/Biotransformation Predictors
879
It is important to differentiate the software tools described in the section ‘‘Chemical
880
Property Prediction’’ from the software tools described here.While both kinds of
881
predictors depend on empirical or nonmechanistic approaches to make their predictions,
882
the software reviewed in that section is primarily used to predict the chemical or
883
physiological properties of a ‘‘native’’ or untransformed xenobiotic compound.The
884
software described here is primarily designed to predict biotransformation products and
885
metabolic fates of a given xenobiotic compound.This kind of predictive capability is
886
particularly important if researchers are attempting to identify which enzymes are
887
metabolizing a given drug,how they are metabolizing it,where the transformations are
888
happening,and how the drug is being cleared.Of particular interest in biotransformation
889
studies is the possibility that a potentially useful drug compound could be transformed
890
into a reactive intermediate.Such an intermediate could elicit toxic side effects through
891
the covalent reaction and modification of important signaling proteins or enzymes.This
892
is one of the prime motivations for developing biotransformation predictors.
893
There are four major expert systems that are commonly used for in silico
894
biotransformation prediction:META (Klopman et al.,1997),MetabolExpert (Darvas,
895
1987),MexAlert (Table 3),and METEOR (Greene et al.,1999).All four are commercial
896
products,and all four are highly dependent on a set of expert rules and a set of
897
biotransformation databases to assist in their predictions.MetabolExpert is perhaps the
898
oldest of these programs.Initiated by CompuDrug in 1985,released in 1987,and still
301BIOINFORMATICS IN DRUG DEVELOPMENT
899
being sold by CompuDrug today,the system contains a database of 179 biotransforma-
900
tions and a biotransformation knowledge base consisting of dozens of logical rules (‘‘if-
901
then-else’’ statements) that describe how,when,and where molecular subfragments may
902
be changed in a query compound.Users may modify these rules to accommodate any
903
special knowledge or conditions they may need.MetabolExpert is capable of performing
904
two types of biotransformation predictions:one in which biotransformations are matched
905
to,or deconvoluted from,the query structure;and a second in which biotransformations
906
are compared or matched to analogous transformations from its database and from a
907
‘‘learned’’ biotransformation tree.MetabolExpert handles both phase I and phase II
908
enzyme biotransformations,although compounds (or intermediates) that appear to prefer
909
phase II transformations are always treated as terminal transformations.The program is
910
also able to upload and display all structures,metabolite trees,and products graphically.
911
A more complete description of the program is given by Langowski and Long (2002).
912
METEOR is perhaps the most widely used,up-to-date,and extensively supported
913
biotransformation predictor (Greene et al.,1999).It is produced and maintained by Lhasa
914
Limited (Leeds,UK).Like MetabolExpert,METEOR depends on a set of biotransfor-
915
mation rules to predict the metabolic fate of a query chemical structure.METEOR’s
916
knowledge base consists of 217 biotransformations along with nearly 900 reasoning rules.
917
Unlike MetabolExpert,the biotransformation rules in METEOR are generic,and its
918
internal architecture allows it to handle more detailed or sophisticated descriptions of
919
chemical characters (bond valency,ring size,ring fusion status,etc.).Users may query
920
the METEOR system using either graphical images (generated by ISIS-Draw) or ‘‘mol’’
921
files of their compound of interest.They may also limit the biotransformation search to
922
different levels of depth or breadth by selecting,for instance,only the most likely
923
biotransformations.Users may also limit searches to phase-I-only or phase-II-only
924
biotransformations.The resulting structures and structure trees may be queried,exported,
925
or displayed in a variety of manners.The compounds are also linked to the MDL
926
METABOLITE database so that users may investigate any literature precedents for the
927
predicted metabolites/transformations.Much more information about METEOR
928
is available in several detailed reviews (Greene et al.,1999;Langowski and Long,2002).
929
Empirical ADMET/Biotransformation predictors have fallen from favor recently,
930
largely because of the rapid growth of high-throughput in vitro ADMET screening and
931
biotransformation techniques.In particular,the use of tandem mass or ion-trap mass
932
spectrometers coupled with HPLC interfaces and sophisticated compound identification
933
software is allowing many drug metabolites to be rapidly identified and the
934
biotransformation pathways to be experimentally verified (Yu and Adedoyin,2003).
935
Nevertheless,confirmation of the existence of certain xenobiotic by-products from both
936
in vitro and in silico methods certainly adds to one’s confidence in the correctness of any
937
given result.
938
CONCLUSIONS AND FUTURE PROSPECTS
939
Bioinformatics and cheminformatics are transforming pharmaceutical research.Not
940
only are these computational techniques having an impact on the early phases of drug
941
discovery,but so too are they having an impact further down the developmental pipeline.
942
In this review,we tried to focus on the impact of bioinformatics software and databases
943
on understanding drug metabolism,specifically in the areas of drug ADMET (absorption,
D.S.WISHART302
944
distribution,metabolism,excretion,and toxicity) and adverse drug reactions (ADRs).
945
These are critical areas in drug development,as many newly identified drug leads often
946
fail when rigorously tested for the ADMET or ADR properties (Yu and Adedoyin,2003).
947
Being able to predict or explain these failures is where bioinformatics can obviously be of
948
great help.From the results and examples presented here,we have seen how
949
bioinformatics databases can be used to explore and explain the molecular-level
950
properties of newly sequenced genes or variants of genes (SNPs and mutations).We have
951
also seen how other kinds of bioinformatics databases can be used to explain or extend
952
our understanding of xenobiotic metabolism or to identify potential metabolic
953
interference from existing enzymes and pathways.More importantly,we have seen
954
how the information being accumulated in these databases can be used to make predictive
955
software that employs powerful machine-learning or artificial intelligence techniques to
956
extract patterns from the data.These patterns are then,in turn,used to predict
957
physiological properties or chemical fates for previously unseen or uncharacterized
958
chemical entities.It stands to reason that as more data are accumulated,more powerful
959
and much more accurate predictions can be made.The accuracy of many predictive
960
methods,from homology modeling to biotransformation prediction,are continuously
961
improving.In some cases,the predictions are so good that our confidence in them is
962
starting to reach the point of near certainty.This is definitely good news for the drug
963
industry,as certain in silico approaches promise to make the costs of drug failures (and,
964
hence,drug discovery) much less than they currently are (Beresford et al.,2004).
965
In the future,we can expect that the scope of predictive or analytical methods in
966
drug metabolism and ADMET will grow,and so to will the score or breadth of
967
information contained in many bioinformatics or cheminformatic databases.It is not
968
unreasonable to expect that in the near future,data will be collected on human
969
physiological,genetic,metabolic,and even epidemiological information.Perhaps this
970
epidemiological information will incorporate characteristics of specific disease
971
populations and racial or ethnic groups.This information could then be integrated with
972
high-throughput in vitro ADMET data to predict the population distribution and
973
likelihood of various ADRs,toxic effects,or pharmacokinetic parameters.This kind of in
974
silico clinical trial work is already starting to happen in a few small companies and
975
academic consortia (www.simcyp.com).Given the rapid pace of development in
976
bioinformatics,we can only expect to be continuously surprised by the power of the
977
computer and the innovative ideas coming from the people who program them.
978
ACKNOWLEDGMENTS
979
The author wishes to thank Genome Canada,Bristol-Myers Squibb,and the CIHR-
980
Rx&D research chairs program for their generous financial support.
981
REFERENCES
982 Ackermann,B.L.,Berna,M.J.,Murphy,A.T.(2002).Recent advances in use of LC/MS/MS for
983 quantitative high-throughput bioanalytical support of drug discovery.Curr.Top.Med.Chem.
984 2:53–66.
985 Akutsu,T.(2004).Efficient extraction of mapping rules of atoms from enzymatic reaction data.
986 J.Comput.Biol.11:449–462.
303BIOINFORMATICS IN DRUG DEVELOPMENT
987Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.,Lipman,D.J.
988(1997).Gapped BLAST and PSI-BLAST:a new generation of protein database search
989programs.Nucleic Acids Res.25:3389–3402.
990Ansede,J.H.,Thakker,D.R.(2004).High-throughput screening for stability and inhibitory activity
991of compounds toward cytochrome P450-mediated metabolism.J.Pharm.Sci.93:239–255.
992Apweiler,R.,Bairoch,A.,Wu,C.H.,Barker,W.C.,Boeckmann,B.,Ferro,S.,Gasteiger,E.,
993Huang,H.,Lopez,R.,Magrane,M.,Martin,M.J.,Natale,D.A.,O’Donovan,C.,Redaschi,
994N.,Yeh,L.S.(2004).UniProt:the Universal Protein knowledgebase.Nucleic Acids Res.32
995Database issue:D115–D119.
996Ashburner,M.,Lewis,S.(2002).On ontologies for biologists:the Gene Ontology—untangling the
997web.Novartis Found Symp.247:66–80.
998Baxevanis,A.D.,Ouellette,B.F.F.(2004).Bioinformatics:A Practical Guide to the Analysis of
999Genes and Proteins,3rd Ed.New York:Wiley-Interscience.
1000Benson,D.A.,Karsch-Mizrachi,I.,Lipman,D.J.,Ostell,J.,Wheeler,D.L.(2004).GenBank:
1001update.Nucleic Acids Res.32 Database issue:D23–D26.
1002Beresford,A.P.,Segall,M.,Tarbit,M.H.(2004).In silico prediction of ADME properties:are we
1003making progress?Curr.Opin.Drug Discov.Dev.7:36–42.
1004Blake,J.A.,Richardson,J.E.,Bult,C.J.,Kadin,J.A.,Eppig,J.T.,Mouse Genome Database
1005Group (2003).MGD:the Mouse Genome Database.Nucleic Acids Res.31:193–195.
1006Brooijmans,N.,Kuntz,I.D.(2003).Molecular recognition and docking algorithms.Annu.Rev.
1007Biophys.Biomol.Struct.32:335–373.
1008Brooksbank,C.,Camon,E.,Harris,M.A.,Magrane,M.,Martin,M.J.,Mulder,N.,O’Donovan,C.,
1009Parkinson,H.,Tuli,M.A.,Apweiler,R.,Birney,E.,Brazma,A.,Henrick,K.,Lopez,R.,
1010Stoesser,G.,Stoehr,P.,Cameron,G.(2003).The European Bioinformatics Institute’s data
1011resources.Nucleic Acids Res.31:43–50.
1012Brown,S.M.(2003).Bioinformatics becomes respectable.Biotechniques 34:1124–1127.
1013Buckingham,S.(2004).Bioinformatics:data’s future shock.Nature 428:774–777.
1014Butte,A.(2002).The use and analysis of microarray data.Nat.Rev.Drug Discov.1:951–960.
1015Buysse,J.M.(2001).The role of genomics in antibacterial target discovery.Curr.Med.Chem.
10168:1713–1726.
1017Carlton,J.(2003).The Plasmodium vivax genome sequencing project.Trends Parasitol.19:227–
1018231.
1019Chan,P.F.,Macarron,R.,Payne,D.J.,Zalacain,M.,Holmes,D.J.(2002).Novel antibacterials:a
1020genomics approach to drug discovery.Curr.Drug Targets Infect.Disord.2:291–308.
1021Classen,D.C.,Pestotnik,S.L.,Evans,R.S.,Lloyd,J.F.,Burke,J.P.(1997).Adverse drug events
1022in hospitalized patients.Excess length of stay,extra costs,and attributable mortality.J.Am.
1023Med.Assoc.277:301–306.
1024Claussen,H.,Buning,C.,Rarey,M.,Lengauer,T.(2001).FlexE:efficient molecular docking
1025considering protein structure variationsQ6.J.Mol.Biol.308:377–395.
1026CODATA Task Group on biological macromolecules and colleagues.Committee on Data for
1027Science and Technology of the International Council of Scientific Unions (2000).Quality
1028control in databanks for molecular biology.Bioessays 22:1024–1034.
1029Comess,K.M.,Schurdak,M.E.(2004).Affinity-based screening techniques for enhancing lead
1030discovery.Curr.Opin.Drug Discov.Dev.7:411–416.
1031Curtis,C.G.,Chien,B.,Bar-Or,D.,Ramu,K.(2002).Organ perfusion and mass spectrometry:a
1032timely merger for drug development.Curr.Top.Med.Chem.2:77–86.
1033Darvas,F.(1987).MetabolExpert,an expert system for predicting metabolism of substances.In:
1034QSAR in Environmental ToxicologyQ7.Dordecht:Riedel.71–81.
1035Ellis,L.B.,Hou,B.K.,Kang,W.,Wackett,L.P.(2003).The University of Minnesota
1036Biocatalysis/Biodegradation Database:post-genomic data mining.Nucleic Acids Res.
103731:262–265.
D.S.WISHART304
1038 Fabian,P.,Degtyarenko,K.N.(1997).The directory of P450-containing systems in 1996.Nucleic
1039 Acids Res.25:274–277.
1040 Fairlamb,A.H.(2002).Metabolic pathway analysis in trypanosomes and malaria parasites.Philos.
1041 Trans.R.Soc.Lond.B Biol.Sci.357:101–107.
1042 Fiser,A.,Sali,A.(2003).Modeller:generation and refinement of homology-based protein structure
1043 models.Methods Enzymol.374:461–491.
1044 Frank,R.,Hargreaves,R.(2003).Clinical biomarkers in drug discovery and development.Nat.Rev.
1045 Drug.Discov.2:566–580.
1046 Fredman,D.,Munns,G.,Rios,D.,Sjoholm,F.,Siegfried,M.,Lenhard,B.,Lehvaslaiho,H.,
1047 Brookes,A.J.(2004).HGVbase:a curated resource describing human DNA variation and
1048 phenotype relationships.Nucleic Acids Res.32 Database issue:D516–D519.
1049 Gaasterland,T.,Sensen,C.W.(1996).MAGPIE:automated genome interpretation.Trends Genet.
1050 12:76–78.
1051 Gibbs,R.A.,et al.(2004).Genome sequence of the Brown Norway rat yields insights into
1052 mammalian evolution Q8.Nature 428:493–521.
1053 Goldstein,D.B.,Tate,S.K.,Sisodiya,S.M.(2003).Pharmacogenetics goes genomic.Nat.Rev.
1054 Genet.4:937–947.
1055 Greene,N.,Judson,P.N.,Langowski,J.J.,Marchant,C.A.(1999).Knowledge-based expert
1056 systems for toxicity and metabolism prediction:DEREK,StAR and METEOR.SAR QSAR
1057 Environ.Res.10:299–314.
1058 Guzey,C.,Spigset,O.(2004).Genotyping as a tool to predict adverse drug reactions.Curr.Top.
1059 Med.Chem.4:1411–1421.
1060 Hammond,M.P.,Birney,E.(2004).Genome information resources—developments at Ensembl.
1061 Trends Genet.20:268–272.
1062 Hansch,C.,Zhang,L.(1993).Quantitative structure-activity relationships of cytochrome P-450.
1063 Drug Metab.Rev.25:1–48.
1064 Hewett,M.,Oliver,D.E.,Rubin,D.L.,Easton,K.L.,Stuart,J.M.,Altman,R.B.,Klein,T.E.
1065 (2002).PharmGKB:the Pharmacogenetics Knowledge Base.Nucleic Acids Res.30:163–165.
1066 Hillisch,A.,Pineda,L.F.,Hilgenfeld,R.(2004).Utility of homology models in the drug discovery
1067 process.Drug Discov.Today.9:659–669.
1068 Hou,B.K.,Ellis,L.B.,Wackett,L.P.(2004).Encoding microbial metabolic logic:predicting
1069 biodegradation.J.Ind.Microbiol.Biotechnol.31:261–272.
1070 Hou,T.J.,Xu,X.J.(2003).ADME evaluation in drug discovery.3.Modeling blood–brain
1071 barrier partitioning using simple molecular descriptors.J.Chem.Inf.Comput.Sci.43:2137–
1072 2152.
1073 Impicciatore,P.,Choonara,I.,Clarkson,A.,Provasi,D.,Pandolfini,C.,Bonati,M.(2001).
1074 Incidence of adverse drug reactions in paediatric in/out-patients:a systematic review and
1075 meta-analysis of prospective studies.Br.J Clin.Pharmacol.52:77–83.
1076 Ingelman-Sundberg,M.(2002).Polymorphism of cytochrome P450 and xenobiotic toxicity.
1077 Toxicology 181–182:447–452.
1078 Jaja,C.(2003).Foretelling our pharmacogenomic future.Nat.Biotechnol.21:487–488.
1079 Jaworska,J.,Dimitrov,S.,Nikolova,N.,Mekenyan,O.(2002).Probabilistic assessment of
1080 biodegradability based on metabolic pathways:catabol system.SAR QSAR Environ.Res.
1081 13:307–323.
1082 Jeffery,D.A.,Bogyo,M.(2003).Chemical proteomics and its application to drug discovery.Curr.
1083 Opin.Biotechnol.14:87–95.
1084 Jiang,R.,Duan,J.,Windemuth,A.,Stephens,J.C.,Judson,R.,Xu,C.(2003).Genome-wide
1085 evaluation of the public SNP databases.Pharmacogenomics 4:779–789.
1086 Kalow,W.,Tang,B.K.,Endrenyi,L.(1998).Hypothesis:comparisons of inter- and intra-
1087 individual variations can substitute for twin studies in drug research.Pharmacogenetics
1088 8:283–289.
305BIOINFORMATICS IN DRUG DEVELOPMENT
1089Kanehisa,M.,Goto,S.,Kawashima,S.,Okuno,Y.,Hattori,M.(2004).The KEGG resource for
1090deciphering the genome.Nucleic Acids Res.32 Database issue:D277–D280.
1091Karolchik,D.,Baertsch,R.,Diekhans,M.,Furey,T.S.,Hinrichs,A.,Lu,Y.T.,Roskin,K.M.,
1092Schwartz,M.,Sugnet,C.W.,Thomas,D.J.,Weber,R.J.,Haussler,D.,Kent,W.J.(2003).
1093The UCSC Genome Browser Database.Nucleic Acids Res.31:51–54.
1094Karp,P.D.,Krummenacker,M.,Paley,S.,Wagg,J.(1999).Integrated pathway-genome databases
1095and their role in drug discovery.Trends Biotechnol.17:275–281.
1096Kassel,D.B.(2004).Applications of high-throughput ADME in drug discovery.Curr.Opin.Chem.
1097Biol.8:339–345.
1098Kellenberger,E.,Rodrigo,J.,Muller,P.,Rognan,D.(2004).Comparative evaluation of eight
1099docking tools for docking and virtual screening accuracy.Proteins 57:225–242.
1100Klein,T.E.,Altman,R.B.(2004).PharmGKB:the pharmacogenetics and pharmacogenomics
1101knowledge base.PharmacogenomicsQ9 4:1.
1102Klopman,G.,Tu,M.,Talafous,J.(1997).META.3.A genetic algorithm for metabolic transform
1103priorities optimization.J.Chem.Inf.Comput.Sci.37:329–334.
1104Kramer,R.,Cohen,D.(2004).Functional genomics to new drug targets.Nat.Rev.Drug Discov.
11053:965–972.
1106Krieger,C.J.,Zhang,P.,Mueller,L.A.,Wang,A.,Paley,S.,Arnaud,M.,Pick,J.,Rhee,S.Y.,
1107Karp,P.D.(2004).MetaCyc:a multiorganismdatabase of metabolic pathways and enzymes.
1108Nucleic Acids Res.32 Database issue:D438–D442.
1109Lahana,R.(2002).Cheminformatics–decision making in drug discovery.Drug Discov.Today
11107:898–900.
1111Langowski,J.,Long,A.(2002).Computer systems for the prediction of xenobiotic metabolism.
1112Adv.Drug Deliv.Rev.54:407–415.
1113Lazarou,J.,Pomeranz,B.H.,Corey,P.N.(1998).Incidence of adverse drug reactions in
1114hospitalized patients:a meta-analysis of prospective studies.J.Am.Med.Assoc.279:1200–
11151205.
1116Lesk,A.M.(2002).Introduction to Bioinforamtics.Oxford:Pubmed Oxford University Press.
1117Levy,R.H.,Hachad,H.,Yao,C.,Ragueneau-Majlessi,I.(2003).Relationship between extent of
1118inhibition and inhibitor dose:literature evaluation based on the metabolism and transport
1119drug interaction database.Curr.Drug Metab.4:371–380.
1120Lewis,D.F.(2003a).P450 structures and oxidative metabolismof xenobiotics.Pharmacogenomics
11214:387–395.
1122Lewis,D.F.(2003b).Quantitative structure–activity relationships (QSARs) within the cytochrome
1123P450 system:QSARs describing substrate binding,inhibition and induction of P450s.
1124Inflammopharmacology 11:43–73.
1125Lewis,D.F.,Ioannides,C.,Parke,D.V.(1998).An improved and updated version of the
1126COMPACT procedure for the evaluation of P450-mediated chemical activation.Drug
1127Metab.Rev.30:709–737.
1128Lindon,J.C.,Holmes,E.,Bollard,M.E.,Stanley,E.G.,Nicholson,J.K.(2004).Metabonomics
1129technologies and their applications in physiological monitoring,drug safety assessment and
1130disease diagnosis.Biomarkers 9:1–31.
1131Lindpaintner,K.(2003).Pharmacogenetics and the future of medical practice.J.Mol.Med.
113281:141–153.
1133Lindsay,M.A.(2003).Target discovery.Nat.Rev.Drug Discov.2:831–838.
1134Lipinski,C.A.,Lombardo,F.,Dominy,B.W.,Feeney,P.J.(2001).Experimental and
1135computational approaches to estimate solubility and permeability in drug discovery and
1136development settings.Adv.Drug Deliv.Rev.46:3–26.
1137Masood,E.(1999).Consortium plans free SNP map of human genome.Nature 398:545–546.
1138McShan,D.C.,Updadhayaya,M.,Shah,I.(2004).Symbolic inference of xenobiotic metabolism.
1139Pac.Symp.Biocomput.2004:545–556.
D.S.WISHART306
1140 Meisel,C.,Gerloff,T.,Kirchheiner,J.,Mrozikiewicz,P.M.,Niewinski,P.,Brockmoller,J.,Roots,
1141 I.(2003).Implications of pharmacogenetics for individualizing drug treatment and for study
1142 design.J.Mol.Med.81:154–167.
1143 Mu¨ller,M.,Jansen,P.L.M.(1997).Molecular aspects of hepatobiliary transport.Am.J.Physiol.
1144 272:G1285–G1303.
1145 O’Donovan,C.,Martin,M.J.,Gattiker,A.,Gasteiger,E.,Bairoch,A.,Apweiler,R.(2002).High-
1146 quality protein knowledge resource:SWISS-PROT and TrEMBL.Brief.Bioinform.3:275–
1147 284.
1148 Oliver,D.E.,Rubin,D.L.,Stuart,J.M.,Hewett,M.,Klein,T.E.,Altman,R.B.(2002).Ontology
1149 development for a pharmacogenetics knowledge base.Pac.Symp.Biocomput.2002:65–
1150 76.
1151 Oliver,D.E.,Bhalotia,G.,Schwartz,A.S.,Altman,R.B.,Hearst,M.A.(2004).Tools for loading
1152 MEDLINE into a local relational database.BMC Bioinformatics 5:146.
1153 Onyango,P.(2004).The role of emerging genomics and proteomics technologies in cancer drug
1154 target discovery.Curr.Cancer.Drug Targets 4:111–124.
1155 Parker,H.G.,Kim,L.V.,Sutter,N.B.,Carlson,S.,Lorentzen,T.D.,Malek,T.B.,Johnson,G.S.,
1156 DeFrance,H.B.,Ostrander,E.A.,Kruglyak,L.(2004).Genetic structure of the purebred
1157 domestic dog.Science 304:1160–1164.
1158 Pruitt,K.D.,Tatusova,T.,Maglott,D.R.(2003).NCBI Reference Sequence project:update and
1159 current status.Nucleic Acids Res.31:34–37.
1160 Rebhan,M.,Chalifa-Caspi,V.,Prilusky,J.,Lancet,D.(1998).GeneCards:a novel functional
1161 genomics compendium with automated data mining and query reformulation support.
1162 Bioinformatics 14:656–664.
1163 Riley,R.J.,Kenna,J.G.(2004).Cellular models for ADMET predictions and evaluation of drug–
1164 drug interactions.Curr.Opin.Drug Discov.Dev.7:86–99.
1165 Riva,A.,Kohane,I.S.(2004).A SNP-centric database for the investigation of the human genome.
1166 BMC Bioinformatics 5:33.
1167 Sadym,A.,Lagunin,A.,Filimonov,D.,Poroikov,V.(2003).Prediction of biological activity
1168 spectra via the Internet.SAR QSAR Environ.Res.14:339–347.
1169 Schoch,G.A.,Yano,J.K.,Wester,M.R.,Griffin,K.J.,Stout,C.D.,Johnson,E.F.(2004).
1170 Structure of human microsomal cytochrome P450 2C8.Evidence for a peripheral fatty acid
1171 binding site.J.Biol.Chem.279:9497–9503.
1172 Schwede,T.,Kopp,J.,Guex,N.,Peitsch,M.C.(2003).SWISS-MODEL:An automated protein
1173 homology-modeling server.Nucleic Acids Res.31:3381–3385.
1174 Schwieters,C.D.,Kuszewski,J.J.,Tjandra,N.,Clore,M.G.(2003).The Xplor-NIH NMR
1175 molecular structure determination package.J.Magn.Reson.160:65–73.
1176 Shioda,T.(2004).Application of DNA microarrays to toxicological research.J.Environ.Pathol.
1177 Toxicol.Oncol.23:13–31.
1178 Southan,C.(2004).Has the yo-yo stopped?An assessment of human protein-coding gene number.
1179 Proteomics 4:1712–1726.
1180 Stenson,P.D.,Ball,E.V.,Mort,M.,Phillips,A.D.,Shiel,J.A.,Thomas,N.S.,Abeysinghe,S.,
1181 Krawczak,M.,Cooper,D.N.(2003).Human Gene Mutation Database (HGMD):2003
1182 update.Hum.Mutat.21:577–581.
1183 Stocker,U.,van Gunsteren,W.F.(2000).Molecular dynamics simulation of hen egg white
1184 lysozyme:a test of the GROMOS96 force field against nuclear magnetic resonance data.
1185 Proteins 40:145–153.
1186 Szklarz,G.D.,Paulsen,M.D.(2002).Molecular modeling of cytochrome P450 1A1:en-
1187 zyme–substrate interactions and substrate binding affinities.J.Biomol.Struct.Dyn.
1188 20:155–162.
1189 Tetko,I.V.(2003).The WWW as a tool to obtain molecular parameters.Mini Rev.Med.Chem.
1190 3:809–820.
307BIOINFORMATICS IN DRUG DEVELOPMENT
1191Thorisson,G.A.,Stein,L.D.(2003).The SNP Consortium website:past,present and future.
1192Nucleic Acids Res.31:124–127.
1193Twigger,S.,Lu,J.,Shimoyama,M.,Chen,D.,Pasko,D.,Long,H.,Ginster,J.,Chen,C.F.,Nigam,
1194R.,Kwitek,A.,Eppig,J.,Maltais,L.,Maglott,D.,Schuler,G.,Jacob,H.,Tonellato,P.J.
1195(2002).Rat Genome Database (RGD):mapping disease onto the genome.Nucleic Acids Res.
119630:125–128.
1197Van de Waterbeemd,H.,De Groot,M.(2002).Can the Internet help to meet the challenges in
1198ADME and e-ADME?SAR QSAR Environ.Res.13:391–401.
1199Van de Waterbeemd,H.,Gifford,E.(2003).ADMET in silico modelling:towards prediction
1200paradise?Nat.Rev.Drug Discov.2:192–204.
1201Venter,J.C.,et al.(2001).The sequence of the human genome.Science 291:1304–1351.
1202Vermeulen,N.P.(2003).Prediction of drug metabolism:the case of cytochrome P450 2D6.Curr.
1203Top.Med.Chem.3:1227–1239.
1204Villeneuve,D.J.,Parissenti,A.M.(2004).The use of DNA microarrays to investigate the
1205pharmacogenomics of drug response in living systems.Curr.Top.Med.Chem.4:1329–
12061345.
1207Walgren,J.L.,Thompson,D.C.(2004).Application of proteomic technologies in the drug
1208development process.Toxicol.Lett.149:377–385.
1209Wang,J.,Wolf,R.M.,Caldwell,J.W.,Kollman,P.A.,Case,D.A.(2004).Development and
1210testing of a general amber force field.J.Comput.Chem.25:1157–1174.
1211Waterston,R.J.(2002).Initial sequencing and comparative analysis of the mouse genome.Nature
1212420:520–562.
1213Westbrook,J.,Feng,Z.,Jain,S.,Bhat,T.N.,Thanki,N.,Ravichandran,V.,Gilliland,G.L.,
1214Bluhm,W.,Weissig,H.,Greer,D.S.,Bourne,P.E.,Berman,H.M.(2002).The Protein Data
1215Bank:unifying the archive.Nucleic Acids Res.30:245–248.
1216Wheeler,D.L.,Church,D.M.,Edgar,R.,Federhen,S.,Helmberg,W.,Madden,T.L.,Pontius,J.
1217U.,Schuler,G.D.,Schriml,L.M.,Sequeira,E.,Suzek,T.O.,Tatusova,T.A.,Wagner,L.
1218(2004).Database resources of the National Center for Biotechnology Information:update.
1219Nucleic Acids Res.32 Database issue:D35–D40.
1220White,T.J.,Arakelian,A.,Rho,J.P.(1999).Counting the costs of drug-related adverse events.
1221Pharmacoeconomics 15:445–458.
1222Williams,P.A.,Cosme,J.,Ward,A.,Angove,H.C.,Vinkovic,D.,Jhoti,H.(2003).Crystal
1223structure of human cytochrome P450 2C9 with bound warfarin.Nature 424:464–468.
1224Yan,Q.,Sadee,W.(2000).Human membrane transporter database:a Web-accessible relational
1225database for drug transport studies and pharmacogenomics.AAPS PharmSci.2:E20.
1226Yang,J.M.,Chen,C.C.(2004).GEMDOCK:a generic evolutionary method for molecular
1227docking.Proteins 55:288–304.
1228Yano,J.K.,Wester,M.R.,Schoch,G.A.,Griffin,K.J.,Stout,C.D.,Johnson,E.F.(2004).The
1229structure of human microsomal cytochrome P450 3A4 determined by x-ray crystallography
1230to 2.05-a resolution.J.Biol.Chem.279:38091–38094.
1231Yu,H.,Adedoyin,A.(2003).ADME-Tox in drug discovery:integration of experimental and
1232computational technologies.Drug Discov.Today 8:852–861.
D.S.WISHART308