Data Mining Tools for Curation of the Human Metabolome Database

levelsordΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

78 εμφανίσεις

Data Mining Tools for Curation of the Human Metabolome Database

Savita Shrivastava, Craig Knox, Paul Stothard, Russ Greiner, David Wishart, University of Alberta, Edmonton, Canada

Updating the HMDB

Introduction

The

extensive

information

stored

in

the

HMDB

has

been

assembled

by

a

team

of

curators

using

a

collection

of

custom

data

mining

programs

developed

specifically

for

building

and

updating

the

HMDB
.

These

software

tools

use

sequence

and

text

comparison

algorithms

to

obtain

up
-
to
-
date

metabolite

information

from

the

some

of

the

most

reliable

and

complete

resources
.

Two

of

the

HMDB

data

mining

tools,

MetaboBuilder

and

MetabolizingInfo,

are

discussed

below
.

Abstract
:

The

Human

Metabolome

Database

(HMDB)

contains

more

than

1400

metabolite

entries,

each

consisting

of

more

that

80

data

fields
.

Obtaining

and

evaluating

the

contents

of

these

data

fields

has

required

the

development

of

several

custom

software

tools
.

These

data

mining

programs

extract

information

from

several

publicly

accessible

databases

(KEGG,

PubChem,

PubMed,

MetaCyc,

ChEBI,

PDB,

Swiss
-
Prot,

GenBank),

and

generate

a

series

of

web
-
based

reports
.

These

reports,

by

combining

the

results

obtained

from

several

independent

sources,

provide

a

useful

means

for

evaluating

the

reliability

of

the

metabolite

information

that

is

added

to

the

HMDB
.

The

HMDB

is

regularly

updated

as

additional

data

becomes

available

and

as

source

databases

and

data

mining

methods

improve
.

Building the MetaboCards

The

HMDB

contains

more

than

1400

metabolite

entries,

each

consisting

of

over

80

data

fields
.

The

data

pertaining

to

each

metabolite

is

accessible

as

a

“MetaboCard”
.

The

MetaboCard

serves

as

a

curator
-
friendly

summary

of

the

current

metabolite

annotations

stored

in

the

HMDB

(Fig

1
)
.

The

initial

set

of

MetaboCards

is

assembled

using

a

data

mining

program

called

MetaboBuilder,

which

searches

a

variety

of

databases

using

sequence

and

keyword

queries
.

The

results

of

each

search

are

evaluated

to

determine

whether

they

are

relevant

for

the

metabolite

in

question,

or

if

they

should

be

discarded
.

MetaboBuilder

also

coordinates

the

updating

of

fields

that

are

calculated

from

the

contents

of

other

fields,

such

as

protein

molecular

weight,

and

protein

isoelectric

point
.

The

content

that

is

gathered

and

generated

by

MetaboBuilder

is

stored

in

a

relational

database

and

in

a

flat

file

database

to

facilitate

curator

review
.

Fig
.

1

Data

stored

in

the

HMDB

is

available

to

users

and

curators

in

the

form

of

MetaboCards
.

The

cards

are

generated

by

a

data

mining

program

that

retrieves

information

from

several

external

and

internal

databases

and

scripts
.

Whenever

possible

the

contents

of

the

MetaboCards

are

hyperlinked

to

additional

information

to

aid

in

the

curation

process
.


Evaluating Metabolizing Enzymes

Each

of

the

automatically

generated

MetaboCards

is

reviewed

by

curators

who

look

for

missing

or

incorrect

information
.

To

assist

the

curators

the

HMDB

development

team

has

prepared

several

tools

that

obtain

information

from

additional

resources,

using

data

mining

approaches

that

differ

from

those

used

to

build

the

MetaboCards
.

One

of

the

programs,

called

MetabolizingInfo,

is

used

to

evaluate

the

content

of

the

MetaboCards

relating

to

metabolizing

enzymes
.

Currently

more

than

3
,
000

protein

(and

DNA)

sequences

are

linked

to

the

metabolite

entries
.

The

MetabolizingInfo

program

uses

the

name

of

each

metabolite

and

its

known

synonyms

to

obtain

publications

from

PubMed,

metabolizing

enzymes

from

Swiss
-
Prot,

and

metabolite

and

metabolizing

enzyme

information

from

KEGG
.

The

searches

are

conducted

using

a

combination

of

WWW

agents

and

public

database

APIs
.

All

of

the

retrieved

information

is

ranked

using

a

scoring

system

and

presented

to

the

curator

as

an

HTML

document

(Fig

2
)
.

Each

of

the

entries

in

the

document

is

hyperlinked

to

a

complete

database

record

(Fig

3
)
.

The

HMDB

will

never

be

a

“finished”

database,

since

new

research

is

always

providing

additional

data
.

Furthermore,

the

HMDB

data

mining

tools

and

curators

constantly

scrutinize

and

update

existing

content
.

The

HMDB

is

available

at

http
:
//www
.
hmdb
.
ca
.

We

encourage

users

to

provide

us

with

their

feedback
.


Fig
.

2

The

MetabolizingInfo

program

uses

text
-
based

searches

to

retrieve

information

from

Swiss
-
Prot,

PubMed,

and

KEGG
.

Records

that

pass

a

scoring

cut
-
off

are

presented

in

a

colour
-
coded

HTML

table
.

The

table

for

corticosterone

is

shown

above
.

Each

external

record

ID

is

hyperlinked

to

its

corresponding

record

for

curator

review
.

Some

of

these

records

are

shown

in

Fig

3
.

Fig
.

3

The

HMDB

data

mining

tools,

such

as

the

MetabolizingInfo

program,

provide

web
-
based

reports

for

human

curators
.

These

reports

contain

hyperlinks

to

records

in

a

variety

of

external

databases,

including

Swiss
-
Prot,

PubMed,

and

KEGG
.

Shown

above

is

a

Swiss
-
Prot

record,

PubMed

abstract,

KEGG

compound

record,

and

KEGG

enzyme

record

obtained

for

corticosterone
.

By

using

a

combination

of

automated

data

mining

and

manual

curation,

the

HMDB

aims

to

be

a

comprehensive

and

reliable

database

of

human

metabolites
.