Data Mining Tools for Curation of the Human Metabolome Database
Savita Shrivastava, Craig Knox, Paul Stothard, Russ Greiner, David Wishart, University of Alberta, Edmonton, Canada
Updating the HMDB
Introduction
The
extensive
information
stored
in
the
HMDB
has
been
assembled
by
a
team
of
curators
using
a
collection
of
custom
data
mining
programs
developed
specifically
for
building
and
updating
the
HMDB
.
These
software
tools
use
sequence
and
text
comparison
algorithms
to
obtain
up
-
to
-
date
metabolite
information
from
the
some
of
the
most
reliable
and
complete
resources
.
Two
of
the
HMDB
data
mining
tools,
MetaboBuilder
and
MetabolizingInfo,
are
discussed
below
.
Abstract
:
The
Human
Metabolome
Database
(HMDB)
contains
more
than
1400
metabolite
entries,
each
consisting
of
more
that
80
data
fields
.
Obtaining
and
evaluating
the
contents
of
these
data
fields
has
required
the
development
of
several
custom
software
tools
.
These
data
mining
programs
extract
information
from
several
publicly
accessible
databases
(KEGG,
PubChem,
PubMed,
MetaCyc,
ChEBI,
PDB,
Swiss
-
Prot,
GenBank),
and
generate
a
series
of
web
-
based
reports
.
These
reports,
by
combining
the
results
obtained
from
several
independent
sources,
provide
a
useful
means
for
evaluating
the
reliability
of
the
metabolite
information
that
is
added
to
the
HMDB
.
The
HMDB
is
regularly
updated
as
additional
data
becomes
available
and
as
source
databases
and
data
mining
methods
improve
.
Building the MetaboCards
The
HMDB
contains
more
than
1400
metabolite
entries,
each
consisting
of
over
80
data
fields
.
The
data
pertaining
to
each
metabolite
is
accessible
as
a
“MetaboCard”
.
The
MetaboCard
serves
as
a
curator
-
friendly
summary
of
the
current
metabolite
annotations
stored
in
the
HMDB
(Fig
1
)
.
The
initial
set
of
MetaboCards
is
assembled
using
a
data
mining
program
called
MetaboBuilder,
which
searches
a
variety
of
databases
using
sequence
and
keyword
queries
.
The
results
of
each
search
are
evaluated
to
determine
whether
they
are
relevant
for
the
metabolite
in
question,
or
if
they
should
be
discarded
.
MetaboBuilder
also
coordinates
the
updating
of
fields
that
are
calculated
from
the
contents
of
other
fields,
such
as
protein
molecular
weight,
and
protein
isoelectric
point
.
The
content
that
is
gathered
and
generated
by
MetaboBuilder
is
stored
in
a
relational
database
and
in
a
flat
file
database
to
facilitate
curator
review
.
Fig
.
1
Data
stored
in
the
HMDB
is
available
to
users
and
curators
in
the
form
of
MetaboCards
.
The
cards
are
generated
by
a
data
mining
program
that
retrieves
information
from
several
external
and
internal
databases
and
scripts
.
Whenever
possible
the
contents
of
the
MetaboCards
are
hyperlinked
to
additional
information
to
aid
in
the
curation
process
.
Evaluating Metabolizing Enzymes
Each
of
the
automatically
generated
MetaboCards
is
reviewed
by
curators
who
look
for
missing
or
incorrect
information
.
To
assist
the
curators
the
HMDB
development
team
has
prepared
several
tools
that
obtain
information
from
additional
resources,
using
data
mining
approaches
that
differ
from
those
used
to
build
the
MetaboCards
.
One
of
the
programs,
called
MetabolizingInfo,
is
used
to
evaluate
the
content
of
the
MetaboCards
relating
to
metabolizing
enzymes
.
Currently
more
than
3
,
000
protein
(and
DNA)
sequences
are
linked
to
the
metabolite
entries
.
The
MetabolizingInfo
program
uses
the
name
of
each
metabolite
and
its
known
synonyms
to
obtain
publications
from
PubMed,
metabolizing
enzymes
from
Swiss
-
Prot,
and
metabolite
and
metabolizing
enzyme
information
from
KEGG
.
The
searches
are
conducted
using
a
combination
of
WWW
agents
and
public
database
APIs
.
All
of
the
retrieved
information
is
ranked
using
a
scoring
system
and
presented
to
the
curator
as
an
HTML
document
(Fig
2
)
.
Each
of
the
entries
in
the
document
is
hyperlinked
to
a
complete
database
record
(Fig
3
)
.
The
HMDB
will
never
be
a
“finished”
database,
since
new
research
is
always
providing
additional
data
.
Furthermore,
the
HMDB
data
mining
tools
and
curators
constantly
scrutinize
and
update
existing
content
.
The
HMDB
is
available
at
http
:
//www
.
hmdb
.
ca
.
We
encourage
users
to
provide
us
with
their
feedback
.
Fig
.
2
The
MetabolizingInfo
program
uses
text
-
based
searches
to
retrieve
information
from
Swiss
-
Prot,
PubMed,
and
KEGG
.
Records
that
pass
a
scoring
cut
-
off
are
presented
in
a
colour
-
coded
HTML
table
.
The
table
for
corticosterone
is
shown
above
.
Each
external
record
ID
is
hyperlinked
to
its
corresponding
record
for
curator
review
.
Some
of
these
records
are
shown
in
Fig
3
.
Fig
.
3
The
HMDB
data
mining
tools,
such
as
the
MetabolizingInfo
program,
provide
web
-
based
reports
for
human
curators
.
These
reports
contain
hyperlinks
to
records
in
a
variety
of
external
databases,
including
Swiss
-
Prot,
PubMed,
and
KEGG
.
Shown
above
is
a
Swiss
-
Prot
record,
PubMed
abstract,
KEGG
compound
record,
and
KEGG
enzyme
record
obtained
for
corticosterone
.
By
using
a
combination
of
automated
data
mining
and
manual
curation,
the
HMDB
aims
to
be
a
comprehensive
and
reliable
database
of
human
metabolites
.
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment