NISBREAugust2008_final

breadloafvariousBiotechnology

Feb 20, 2013 (4 years and 8 months ago)

156 views






















Nucleotide

Level


We

define

four

statistics

to

describe

how

results

are

scored

at

the

nucleotide

level
.

If

a

base

is

part

of

an

actual

site

and

is

predicted

by

the

tool,

it

is

a

true

positive
.

If

it

is

only

predicted,

it

is

a

false

positive
.

If

it

is

part

of

an

actual

site

and

not

predicted,

it

is

a

false

negative,

and

if

it

is

neither

predicted

nor

an

actual

site,

it

is

a

true

negative


Site

Level


We

define

three

measures

to

describe

how

results

are

scored

on

the

site

level,

including

the

use

of

a

threshold

which

defines

what

percentage

of

a

motif

can

be

considered

a

correct

prediction
.

If

a

site

is

an

actual

site

and

is

predicted

by

the

tool

by

at

least

the

threshold

percentage

(
6
.
7
%

for

these

tests),

it

is

a

true

positive
.

If

it

is

only

predicted,

it

is

a

false

positive
.

If

it

is

part

of

an

actual

site

and

not

predicted,

it

is

a

false

negative
.

False

negatives

are

not

counted

on

the

site

level
.




















Phylogenetic

vs
.

non
-
Phylogenetic

Some

TFBS

detection

methods

take

prediction

one

step

further
.

Similar

to

allowing

students

to

use

supplemental

material

on

a

test,

these

programs

make

use

of

orthologous

sequence

data

and

a

phylogenetic

tree

(both

generated

by

the

user)

in

an

attempt

to

raise

sensitivity

and

specificity
.

We

implement

methods

for

automatically

generating

orthologous

sequence

data

and

phylogenetic

trees

based

off

ribosomal

RNA

to

incorporate

phylogenetic

programs

into

MTAP
.









Results

Program

outputs

and

scoring

revealed

a

wide

array

of

observations

on

current

TFBS

detection

methods
.

Some

key

observations

include

the

following
:


Our

method

of

automation

returns

sensitivity

and

specificity

that

are

comparable

to

“hand
-
made”

results
.

Figure

4

shows

the

nucleotide

and

site

sensitivity,

and

nucleotide

specificity

for

Tompa’s

“hand
-
made”

method

versus

our

automated

method
.

The

graph

suggests

is

that

our

automated

method

returns

results

that

are

comparable

to

Tompa
,

evidence

that

our

method

is

reliable

for

producing

accurate

results
.

Nucleotide

and

site

sensitivity

are

higher

overall

with

our

method,

suggesting

that

automation

is

an

acceptable

route

compared

to

manual

assessment
.
















Phylogenetic

programs

are

comparable

in

performance

to

non
-
phylogenetic

programs

over

RegulonDB
.

Figure

5

depicts

statistics

for

phylogenetic

vs
.

non
-
phylogenetic

programs

over

sequences

400

bp

upstream

of

the

coding

sequence

from

RegulonDB

using

genomes

NC_
000913
,

NC_
007946
,

and

AC_
000091
.

Phylogenetic

program

PhyloGibbs

compares

favorably

with

MEME,

Weeder
,

and

AlignAce
.

PhyME

returns

low

sensitivity,

perhaps

due

to

its

resistance

to

make

predictions
.

More

e v a l u a t i o n s

a r e

n e e d e d

o v e r

d i f f e r e n t

g e n o me

s e l e c t i o n s

to

explore

the

benefits

of

phylogenetic

approaches
.

Regardless,

this

illustrates

the

capabilities

of

MTAP

for

evaluation

of

different

classes

of

algorithms
.









Abstract

Transcription

factor

binding

sites

(TFBS)

regulate

the

expression

of

genes

in

the

cell
.

Their

discovery

remains

one

of

the

most

challenging

problems

in

molecular

biology
.

To

solve

this

problem,

traditional

techniques

have

been

supplemented

by

the

development

of

computational

prediction

methods
.

Today

over

100

algorithms

exist

for

detecting

TFBS

in

various

problem

domains,

yet

it

remains

unclear

how

to

assess

their

performance
.

Comparing

these

tools

across

datasets

is

nearly

impossible

due

to

a

lack

of

standards

for

running

programs

and

reporting

results
.

This

work

proposes

a

novel

method

for

standardizing

runtime

procedures

and

assessing

tool

performance
.

To

appropriately

compare

each

method,

a

standard

reporting

format

was

developed

for

each

tool
.

We

developed

a

pipeline

framework

to

integrate,

score,

and

evaluate

TFBS

identification

for

each

method
.

We

evaluated

9

computational

methods

for

detecting

TFBS

and

obtained

statistics

that

describe

their

performance
.

These

results

allowed

us

to

rank

each

method

by

performance

on

a

range

of

datasets
.

Computational

detection

of

TFBS

remains

a

challenging

problem
.

Sensitivity

and

specificity

remain

low

for

the

algorithms

regardless

of

dataset
.

Our

method

for

comparing

tools

exposes

a

unique

view

into

the

behavior

of

TFBS

detection

and

enables

the

improvement

of

methods

in

the

future
.



Motivation

One

of

the

most

complex

problems

facing

scientists

today

is

the

prediction

of

TFBS

pat t er ns

in

silico
.

Over

100

computational

prediction

tools

have

been

developed

to

address

the

problem

of

predicting

TFBS
.

As

a

result,

it

can

be

unclear

to

scientists

which

available

method

is

right

for

their

specific

domain
.

It

is

possible

for

a

method

that

performs

poorly

on

one

dataset

to

perform

better

on

another,

but

until

recently

there

had

been

no

way

to

determine

this

information

quickly
.

Comparing

even

a

small

set

of

TFBS

pr e di c t i on

t ool s

ma nua l l y

r e qui r e s

e xt e ns i ve

knowl e dge

of

each

program,

hours

of

manual

computation,

and

the

ability

to

streamline

a

number

of

different

input/output

formats

into

a

format

that

is

easily

scored
.

Ou r

o b j e c t i v e

is

to

make

the

process

of

comparing

and

evaluating

tool

performance

automated

and

accurate
.

Our

work

allows

many

popular,

publicly
-
available

tools

to

be

analyzed

on

a

number

of

different

datasets

and

parameters

over

the

period

of

a

few

days
.

The

Motif

Tool

Assessment

Platform,

or

MTAP,

allows

scientists

to

learn

more

about

which

TFBS

detection

tool

is

right

for

their

area

of

expertise
.

We

are

able

to

show

that

when

tools

are

run

through

an

automated

pipeline,

they

perform

just

as

well

(if

not

better)

than

when

they

are

executed

manually
.

MTAP

converts

results

for

each

tool

we

examine

into

one

standard

format
.

This

format

can

be

used

for

evaluation,

visualization,

and

integrative

prediction
.

On

a

representative

set

of

data,

we

are

able

give

insight

into

which

programs

perform

best

in

varying

domains
.

Finally,

we

can

suggest

the

overall

areas

where

TFBS

prediction

is

lacking,

and

where

TFBS

prediction

algorithms

perform

well
.



Methods

The

M o t i f

T o o l

A s s e s s m e n t

P l a t f o r m,

or

MTAP

is

a

platform

for

the

integration

of

motif

discovery

tools
.

To

accommodate

the

wide

diversity

of

libraries

and

dependencies,

the

MTAP

architecture

was

implemented

using

Java,

Perl,

Python,

and

C++
.

In

order

to

handle

computational

demands,

MTAP

evaluates

prediction

tools

simultaneously

on

a

clustered

computer
.

To

ensure

that

our

automation

process

reported

results

that

were

similar

to

results

from

manual

investigation,

we

included

5

of

13

tools

from

Tompa

et

al
.
’s

evaluation

of

TFBS

detection

programs

[
1
]
.

The

remaining

8

methods

were

not

included

because

they

were

unobtainable,

obsolete,

or

existed

only

as

web

versions
.

To

expand

our

methods

beyond

what

had

already

been

evaluated,

we

included

additional

tools

based

on

popularity,

availability,

and

the

ability

to

be

executed

on

the

command
-
line
.



Input

/

Output

Input

datasets

were

created

using

peer
-
reviewed

public

databases
.

Data

was

collected

from

RegulonDB,

DBTBS,

and

Tompa

et

al
.
’s

assessment

on

fly,

human,

yeast,

and

mouse
.


Outputs

from

each

tool

vary,

and

so

a

custom

parsing

class

was

designed

for

each

individual

tool
.

The

results

are

stored

in

the

pipeline

scaffold

framework

for

scoring,

and

stored

in

a

standard

output

format

for

later

retrieval
.

A Novel Approach for Integrating and Assessing Computational
Tools to Detect Transcription Factor Binding Sites

Kathryn Dempsey*, Daniel Quest*
°
, Mohammad Shafiullah*, Dhundy Bastola* and Hesham Ali*
°


*College of Information Science & Technology, University of Nebraska at Omaha, Omaha, NE 68182

°
Department of Pathology & Microbiology, University of Nebraska Medical Center, Omaha, NE 68198





MTAP

gives

insight

into

program

performance

over

different

problem

features
.

MTAP

was

run

over

the

same

400
bp

sequences

generated

with

RegulonDB

as

in

Figure

5
.

Figure

6

was

produced

with

MTAP

by

plotting

the

ratio

of

information

content

found

in

the

motif

relative

to

background

sequence

size

versus

sSn

performance
.

Increasing

the

information

content

of

the

motif

and

decreasing

the

background

sequence

noise

is

expected

to

improve

tool

prediction

performance
.

However,

not

all

tools

are

able

to

take

advantage

of

this

feature
.

Some

methods,

such

as

AnnSpec,

AlignAce,

MEME,

and

Weeder,

(MEME

and

Weeder

highlighted)

exploit

an

increase

in

information
.

As

a

result,

site

sensitivity

increases

with

number

of

sequences
.

Other

programs

do

not

appear

to

take

advantage

of

the

amount

of

information

given
.

This

highlights

one

area

where

TFBS

prediction

improvement

can

be

focused
.















Discussion

and

Conclusions

Computational

detection

of

TFBS

remains

a

challenging

problem
.

Automation

of

computational

TFBS

prediction

makes

assessment

a

possibility

where

there

had

been

no

genome

wide

means

of

assessment

before
.

Overall,

sensitivity

and

specificity

remain

low

for

the

algorithms

regardless

of

dataset,

but

by

examining

patterns

exposed

by

our

work

we

can

better

suggest

critical

areas

for

improvement
.

We

have

shown

that

not

only

is

the

task

of

automating

this

process

possible,

it

performs

just

as

well

as

when

runs

are

executed

by

hand
.

Automation

cuts

down

the

amount

of

time

and

labor

required,

it

also

eliminates

opportunities

for

human/user

error

when

running

the

tools,

scoring

the

output,

etc
.

We

have

also

shown

that

phylogenetic

programs

do

not

always

outperform

non
-
phylogenetic

programs
.

T h e s e

r e s u l t s

b e g

m o r e

q u e s t i o n s

a b o u t

p h y l o g e n e t i c

p r o g r a m s

a n d

t h e

a d d i t i o n a l

i n f o r m a t i o n

u s e d

in

their

implementation,

such

as

number

and

proximity

of

orthologs

used,

and

size

of

the

phylogenetic

tree
.

All

these

things

can

now

be

investigated

with

the

aide

of

the

MTAP
.

Finally,

we

have

presented

representative

data

sets

that

give

a

glimpse

into

how

many

different

ways

we

can

rank

and

compare

TFBS

prediction

tools
.

Our

me t ho d

f o r

c o mpa r i ng

t o o l s

e x po s e s

a

uni q ue

vi e w

i nt o

t he

b e ha vi o r

of

TFBS

detection

and

enables

the

improvement

of

methods

in

the

future
.


References

[
1
]

M
.

Tompa

et

al
.
,

Assessing

Computational

Tools

for

the

Discovery

of

Transcription

Factor

Binding

Sites
.

Nature

Biotechnology,

vol
.

23
,

no
.

1
,

January

2005
,

137

-

144
.

[
2
]

H
.

Salgado

et

al
.
,

Regulondb

(version

4
.
0
)
:

transcriptional

regulation,

operon

organization

and

growth

conditions

in

escherichia

coli

k
-
12
.
,

Nucleic

Acids

Res

32

(
2004
),

no
.

1
,
:

65
-
67
.

[
3
]

http
:
//biobase
.
ist
.
unomaha
.
edu/

Acknowledgement

This project was supported by the NIH grant number P20 RR016469 from the INBRE
Program of the National Center for Research Resources. We would like to thank the
developers of the motif detection tools for their help and for making their tools
available.