Microsoft doc - Bioinformatics at UALR

fabulousgalaxyBiotechnology

Oct 1, 2013 (3 years and 9 months ago)

215 views









ISDB: Interaction Sentence Database

Michael Bauer

Bioinformatics
Masters
Capstone Project




Abstract

................................
................................
................................
...........................

1

1

Introduction

................................
................................
................................
.................

2

1.1

Background

................................
................................
................................
.........

2

1.2

Information Extraction

................................
................................
........................

3

1.3

Hypothesis Generati
on

................................
................................
........................

5

1.4

Purpose

................................
................................
................................
................

6

2

Data Manipulation and Script Development
................................
...............................

7

2.1

Repository Characteristics

................................
................................
..................

7

2.2

Performance Issues

................................
................................
.............................

9

2.3

Program Description

................................
................................
.........................

10

2.4

Zipf’s Law

................................
................................
................................
.........

13

2.5

Problems with Sentences Files

................................
................................
..........

14

3

Data Storage

................................
................................
................................
..............

15

3.1

Mysql DB vs. Flat Files

................................
................................
....................

15

3.2

The MYSQL Database

................................
................................
......................

17

4

Web Application

................................
................................
................................
.......

22

4.1

Graphical User Interface

................................
................................
...................

22

4.2

How to Use

................................
................................
................................
.......

23

5

References

................................
................................
................................
.................

27

6

User Manual

................................
................................
................................
..............

29

6.1

Overview

................................
................................
................................
...........

29

6.2

Getting Started

................................
................................
................................
..

29

6.3

Simple Query

................................
................................
................................
....

29

6.4

Co
-
occurrence Query

................................
................................
........................

29

6.5

Indirect Relationship Query

................................
................................
..............

30

6.6

HTML Pages by Term

................................
................................
......................

30

6.7

Contact Information

................................
................................
..........................

31

Appendix

................................
................................
................................
...........................

32

Appendix A
-

CGI script for the web interface

................................
..............................

32

Appendix B
-

Main program for traversing the tree and extracting the sentences

........

55

Appendix C
-

Script to match interaction
-
indicating terms in a sentence

.....................

59

Appendix D
-

Script to benchmark the performance of MYSQL

vs flat files

...............

62

Appendix E
-

Script to remove font tags and duplicate sentences

................................
.

64

Appendix F
-

Script to populate the MYSQL t
ables
................................
......................

66

Appendix G
-

SQL script to initialize/create the MYSQL tables

................................
..

69

Appendix H
-

Helper script for IO operations

................................
...............................

70

Appendix I
-

Script to sort and remove duplicate sentences

................................
..........

72

Appendix J
-

Text to HTML Script
................................
................................
................

74

Appendix K
-

Interaction
-
indicating terms (440)

................................
..........................

76



1


Abstract

There
continues to be

rapid growth in the amount of scientific literature available online.
Consequently
,

it is becoming increasing
ly difficult to do comprehensive searches for
specific biological information such as biological entity interactions. Scientific literature
databases such as MEDLINE are often overwhelming due to the large amount of
irrelevant texts retrieved when perform
ing a search for a particular chemical interaction.
There is a need to shift
aspects of
the
task of analyzing data
from the researcher to the
computer. The t
hree

main methods used to identify interactions in text are co
-
occurrence

analysis, template matc
hing,

and natural
-
language processing
;

however,
with all

these
methods
,

obtaining both high precision and high recall is a challenge. A greater
knowledge of the sentence
characteristics that may indicate

an interaction between two
biological entities is n
eeded to aid in the creation of better performing information
extraction tools. I have created the Interaction Sentence Database (ISDB) which allows a
researcher to retrieve a se
t of sentences fitting specific

characteristics. MySQL was used
as the datab
ase management system and Perl
code
was written to parse and manipulate
the data and populate the tables. The sentences in the database all contain at least two
chemical terms and one
interaction
-
indicating term. The web interface to the database
allows
the user to query for sentences based on
an
interaction
-
indicating term,
a
single
biomolecule name,
and
two biomolecule names

and
for
sentences with indirect
connections

thro
ugh an
intermediate biomolecule name.


U
se the retrieved sentences to
further char
acterize sentences
that describe an interaction can be expected to
improve
precision

and recall

of text mining and information retrieval tools
.



2

1

Introduction

1.1

Background

We are fortunate to be in a time
in which

there are abundant online source
s

of scientif
ic
literature. In MEDLINE alone
there are

over 15.5 million articles

with approximately
500,000 new articles being added each year, see
F
igure 1
(Anonymous 2007
;
Bekhuis
2006)
.


Fig
ure
1
. This shows the growth in the number of articles available at MEDLINE. The data was
obtained from
http://www.nlm.nih.gov/bsd/medline_lang_distr.html




It
is becoming increasingly difficult for human curato
rs
and researchers
to manage the
sheer volume

of

available
articles
.
It takes a long time
and
is

expensive
to hand curate
articles

(Rebholz
-
Schuhmann
et al. 2005)
.
The
development of better

automated
literature mining tools
to perform

information extraction

and
new relationship
identification

with high
recall

and precision is essential

to be able to take full advantage

3

of the massive amounts of inform
ation that is available
.

The goal is to have the computer
analyze the large amount of information instead of the researcher
(Cohen and Hersh
2005)
.


The majority of information obtained from biological research is stor
ed as text in journals
and in the comment fields of databases such as GenBank feature table
(Hirschman et al.
2002)
. When a paper is submitted to a journal it is often left to a curator to pull
information from the t
ext to specialized databases. These specialized databases can
include interaction, complexes and pathway databases. The use of these databases can
help condense a paper down to its relevant facts
(Corney et al. 2004)
. Instead of having
to read a stack of papers a simple query on a specialized database can
quickly
return the
information of interest.

Accurate information extraction tools are necessary t
o automate
the process of
pull
ing these facts from text to populat
e
specialized
database
s
.


1.2

Information Extraction

Information extraction is the process of pulling different
pre
-
defined
types of facts from
text
(Jensen Juhl et al. 2006)
. The facts
may be

specific
relationship
s between biological
entities
;

for example
,

protein
-
protein interactions or gene
-
protein interactions. The most
commonly used methods for automated information extraction
are
co
-
occurrence
analysis,
template matching,
and natural
language processing
:



Co
-
o
ccurrence just requires that two entities co
-
occur in the same sentence. Co
-
occurrence

analysis
results in low
er

precision

and

high
er

recall
than the template
matching and
natural language processing method
s
,
which mean
s

that it finds

4

many
possible
relati
onships but many are not true relationships

(Jensen Juhl et al.
2006)
.



Template matching uses a text pattern to identify relationships automatically
(Corney et al. 2004)
. A

simple template would be: [protein1] ’binds with’
[protein2]; for example
,

it would match this protein relationship in

the

text
STAM
binds with ubiquitin
.
It can be time consuming to generate the different possible
templates which can become quite comple
x.



Natural
language processing parses a sentence into the different parts of speech
and uses
a set of
rules to identify possible relationships.

The advantage of natural
language processing is the ability to infer the direction of the relationship and
dist
inguish between different relationships when more than two entities are
present in the same sentence.

It is the goal of many information extraction tools to find unknown protein
-
protein
interactions
(Cooper and Kershenb
aum 2005)
, for example
,

SUISEKI
(Blaschke and
Valencia 2001)

is a tool
which uses a combination of co
-
occurrence and natural language
processing
.


The accuracy of the different implementations for extracting relations
hips is often
measured using
a common evaluation method. The following values are used to calculate
recall and precision.



True positives: Correct relationships/extractions



False positive: Type I errors



False negatives: Type II errors


5

Recall is the fractio
n of relevant documents that where retrieved and is also known as
sens
itivity. It is calculated using

.

Precision is the fraction of retrieved relationships that are relevant and is also known as
specificity. It is calculated using

.


1.3

Hypothesis Generation

Once the facts have been accurately pulled from the text of several different publications
unknown or undocumented relationships can be inferred. The increased fragmentation of
science and highly specialized

fields leads to a disconnect between researchers
(Cohen
and Hersh 2005
;

Swanson 1991)
. Possible connections go undiscovered because there is
no researcher with enough cross d
isciplinary knowledge to make the connections
(Blaschke and Valencia 2001)
. Automated information extraction tools can pull facts
from different disciplines and find possible connections that would have
been

overlooke
d.
This is often called hypothesis generation or knowledge discovery.
After the possible
connections have been found the research now has a set of hypotheses that can be
scientifically verified. These hypotheses are generated by find
ing

indirect relatio
nships.
Let’s say there is some causal relationship between entity ‘A’ and entity ‘B’ in a
document, in another unconnected document there is a causal relationship between entity
‘B’ and entity ‘C’. It may then be inferred that there
is a relationship be
tween entities

‘A’

6

and ‘C’.

Figure 2 shows an abstract representation of the concept behind hypothesis
generation.


Figure
2
. Shows an indirect relationship between entities ‘A’ and ‘C’ through entity ‘B’.


According to

Dr. Swan
son a leader in hypothesis generat
ion using scientific literature:

The reward system and ethos of science … recognize only the physical
world as a source of new knowledge. The literature tends to be seen as a
sort of knowledge necrology, a mechanism of di
ffusion that supports
laboratory
-
based discovery, but without a life of its own. Science may be
better served by a new image of its literature as a vast mosaic of
undiscovered connections, a potential source of countless recombinant ideas


a world with i
ts own endless front
ier
(Swanson 1990
, p.36
)
.


1.4

Purpose

The purpose of this project was to create a tool that would give researchers
greatly

improved
access to a unique database of sentences.
The Interaction Sente
nce Database
(ISDB) contains

sentences
that
have at least two biomolecul
e names
and at least one
interaction
-
indicating term. The researcher
therefore does not have to sift through

other
online resources to find sentences for
their
research. They can spe
nd more time finding

7

other sentence characteristics that can help predict whether two terms that co
-
occur in a
sentence truly represent an interaction relationship.
R
esearchers can retrieve datasets of
sentences that contain a certain interaction term or

a biomolecule name
.
They can
research different combinat
ions of statistical and natural
language processing techniques
on the sentence
s

t
hat have entities that co
-
occur

to maximize the precision
and/
or recall
of relationship
s
. This tool can also return a

set of sentences that represent an indirect
relationship. This will allow researchers to study relationships that span multiple
sentences and
are

linked by an intermediate term.

ISDB
is expected to

facilitate the
advancement and improvement of automated

literature mining tools.


2

Data Manipulation and Script Development

2.1

Repository

Characteristics

The
MedRep

repository

was created by Ding Jing and is comprised of sentences
that
contain at least two different biomolecules

(
http://bioinformatics.ualr.edu/dan/medrep/
)
(Ding 2003)
. The sentences were
obtained from ME
D
LINE
abstracts
.
A dictionary of
40,000
biochemical terms
(80,000 names)
was
developed and
used to

identify th
e

sentences
.

The sentences are organized in a two
-
level tree structure
and indexed by the
first letter

of each
of the two biomolecules.
The first level is the first biochemical term of
interest and the second level is the second biochemical t
erm.
The sentences are located in
html files along with other sentences that have
the same
biomolecules
and these files are
grouped in directories depending on the first letters of the
two
biomolecules
.
The html
pages also have links back to the abstract
s from which the sentences were obtained.
The
size of the database is around
19

gigabytes and
is
comprised of over 200,000 files.

The

8

size can further be broken down
in
to 8 gigabytes of plain text sentences and 11 gigabytes
of html code.

There are
over
900,000 biochemical pairs which
are located in

175

mill
ion
sentences in the database.


A
Perl
script
that I wrote
(see appendix
C
)
identified

and
counted the number of the most
common interact
ion
-
indicating terms in the entire database. A time
-
stamp was t
aken
every
time
10,000 files
were processed
and from this data it was possible to observe the
distribution of the sentences in the
repository

by the amount of time spent on each
successive
10,000 file group. Figure
3

shows that there is clearly an uneven
distribution
of sentences in the section
of the
repository

where the
sentence
s

that have a biomolecule
start
ing

with ‘I’

are located
. Further analysis
of this particular section of

the
repository

revealed

several html files in this section
that
are over 3
5 megabytes
;

for example the
INTERLEUKI1~INTERLEUKI1.htm file
.

This can be verified by going to
http://bioinformatics.ualr.edu/dan/medrep/II
.


9


Figure
3
. A graph showi
ng the time spent analyzing the different files. Most of the script time is
spent on files that begin with ‘I’. Further investigation
revealed

that many files in this section were
over 35 megabytes in size.


2.2

Performance Issues

There are several character
istics of the MedRep
repository

that lend it to easy manual
access but which make automatic

analysis more difficult.

The sentences are embedded in
html
files which make
s

them easily accessible to users over the internet through a
browser but require
s

more

processing time to parse out the sentences from the html code.

Another performance issue
I encountered
is that the script access
es

the
repository

remotely which can hinder performance
somewhat unpredictably

depending on network
speed and

congestion.

Med
Rep enables users to find

sentences by search
ing

for two

biological term
s

in
either

order
,

but for automatic analysis this
requires

the
tree traversal
algorithm

to

visit

each file twice and
thus
doubles
the time necessary to traverse

the
entire
repository
.


10

I

also

observed that some
sentences are

duplicated multiple times

in the
repository
,

but it
is not clear to what extent the problem exists.

These duplicate sentences
pose a problem
when performing statistical analysis on the sentences

by skewing the resu
lts obtained
.

The i
nitial analysis o
f

the performance of the script used for
detecting and counting the
number of int
erac
tion
-
indicating terms

has shown that

this task is
processor intensive. I

observed that the CPU
-
intensive

task of string matching the

4
27

interaction
-
indicating
terms

against the sentences
in the repository
increased the processing time
approximately
20 times co
m
pared to the same script with
out
the term comparison.

The script was run
on a desktop
.


S
erver,
processor speed
s
, and amount o
f memory

ha
ve

a
significant effect

on the

time needed for the script
to process the entire
repository
. Figure
3 shows the

total
time spent on each section
.

A

huge spike in the ‘I’ section shows that nearly 60% of the
program processing time is spent in t
his section.

2.3

Program Desc
r
iption

Five major scripts were written to manipulate the data (figure 6).
The scripts used to
traverse and analyze the MedRep database were written in Perl. Perl was chosen for its
pattern matching abilities and ease of use of r
egular expressions.

The main script
traverses the
repository

in post order
,

first visit
ing the left subtree, then the right sub
tree
and finally the parent node. The script is designed to be run on any computer with a Perl
interpreter and an internet conn
ection
. The script grabs a
W
eb

page from the server using
the home

page as the root node. The page is then scanned for links and these links are
placed on a stack. Once all the links have been pushed onto the stack, one link is popped
off the top of the

stack and that link is followed and the procedure is repeated. When no
more links
within the
M
ed
R
ep repository
are found
in
the html

code,

the file is a leaf

11

node. T
hat page is sent to another subroutine that parses this html page for sentences.
The re
gular expression shown in
F
igure
4

was used to extract the sentences from the
HTML code, shown in
F
igure
5
.


Figure
4
.
Perl implementation of t
he regular expression used to extract sentences from the HTML
code. Different tags an
d word boundaries are used to locate the sentences.



Figure
5
.
A

sample of HTML code containing a sentence. A regular expression was written that
extracted the sentences and PMID
s

from the HTML code.


These sentences are sent t
o other subroutines for further analysis (see Appendix).
The
sentences are sent to a subroutine that locates the
interaction
-
indicating terms

in the
sentences and places
the
sentences in the appropriate file.

First
,

the interaction
-
indicating
term list i
s read in from a file

and
it
is
then
sent to a subroutine that creates a regular
expression that will match all the words in the list efficiently. The module Regexp::List
was used and obtained from CPAN, the Perl module library. The regular expression th
at
is returned is used in a matching statement. Every time a term is matched
,

the sentence is
written

to a hash
table
w
h
ere the key is the
interaction
-
indicating term

and the value is the

12

sen
tence. A sentence can be written

multiple times depending on ho
w many times a term
is matched

in the sentence and if there are more than one interaction
-
indicating term
s
.
Sentences are added to the hash
table
until 10
,
000 sentences have been added
. T
hen the
sentences are written to files using the
interaction
term a
s the text file name. A time
stamp is also written to a separate file to keep track of progress.


Figure
6
. Structure and work flow of the sentence extraction and database building scripts.



13

2.4

Zipf’s Law

From

a graph of

the freque
ncy of

occurrence of the interaction
-
indicating terms
,

it was
observed that the graph closely
corresponded to
Zipf’s Law. Zipf’s Law
can describe

the
occurrence of
words in a corpus and
implies

that a few words occur at a high frequency
and many more occu
r at a very low frequency

(Black 2006)
.

Zipf’s

law
implies

that
knowing and having a complete list of all the terms that may indicate an interaction

is

difficult because most terms that are used are used
infrequen
tly

and
words that indicate
an interaction are
varied and evolving

(Rebholz
-
Schuhmann et al. 2005)
.
The use of
Zipf’s law can let us see how well our
list of terms compares

to the expected occurrence
of the target terms
(Powers 1998)
.
The plot
, figure 7,

does not exactly conform to the law
,

which may be due to not all of the verbs / interaction
-
indicating terms being identified.

There should be a larger number

of less frequently occurring interaction
-
indicating terms.


Figure
7
. This graph closely resemble Zipf’s law

which states that a few words in a corpus occur at
a high frequency and many more occur at a very low frequency. This

graph shows the occurrence of
the different interaction
-
indicating terms. It is clear that there are a few terms that occur often and
many more that occur infrequently. Only a few terms are
written below

the graph due to space
limitations.


14


2.5

Problems wit
h Sentences Files

There are several issues that arose after the sentences were put into files named after the

interaction
-
indicating terms

that they contained. There is

an uneven dist
ribution of
sentences across

files.

Some files contain hundreds of thou
sands sentences while other
have only

one or two

sentences. The

files range in size

from a few bytes to
three

gigabytes. After analyzing

some of the files that were created
,

it became apparent that
the original database contained many duplicate sentences
.


The module
(Re
g
exp::List)
that is supposed to create an efficient regular expression
from
a list of words
also seem
ed

to have some problems. When compared to the original
matching of each verb to the regular expression, the regular expression only fou
nd the
first verb that it encountered in the sentence
and moved on to the next sentence.


To
correct the problem with the regular expression module

(Re
g
exp::List)
,

the

case
insensitivity and global search options had to be implemented
. When using the regu
lar
expression that is created by the module
,

every occurrence of the
term

is found so there
was a dramatic increase in the number of duplicate sentences. The total size of all the
interaction
-
indicating term

files after the initial filtering of the sente
nces into the
verb
files was
over 200
GB

at which time the server ran out of disk space
.

To catch some of
the duplicate
s
,

a piece of code was added to remove duplicate sentences that happened to
be next to each other. This reduced the total size of the fi
les to 111GB.
The total number
of sentences analyzed was 183
,
297
,
525 and the number of sentences placed into files was

15

261
,
415
,
504.

Sentence can be placed in more than one file if they contain more than one
interaction
-
indicating term.


Many of the dupli
cate sentences were not next to each other
;

t
o remove the bulk of the
se

duplicate sentences
,

a new script was written. To remove duplicate
s the sentences w
ere

sor
ted and then each line
rea
d in. I
f it was a duplicate of th
e previous sentence
,

it was
remov
ed. Due to the large size of many of the files
(
up to 3GB
)
, it was necessary to sort
the files on the disk instead of in memory. A suitable

P
erl module
(Sort::External)
was
located at CPAN
; this routine

sorts items in memory and periodically w
rites them
to
temporary files. I
t then puts the sorted files together. After the sort script
was

run
and
the duplicate sentences removed
,

the total
size of all the
file
s

wa
s
dramatically
reduced to
10GB. A quick analysis of the files reveals that many of the dupli
cates ha
d

been
removed but there are still quite a few
cases due to
different
markup
tags on
otherwise
identical sentences.

To remove these sentences the FONT tag was removed and the
sentences were written to a temporary file. The temp file was opened an
d the sentences
were written to the
interaction
-
indicating term

file only if they had not already been
written back. The

total size of the files was
then
reduced to 3.1GB

and
the total number
of sentences to

9
,
185
,
141.

This represents a total reduction o
f 96%.

3

Data Storage

3.1

Mysql DB vs.

Flat F
iles

One design decision that needed to be made was whether to continue using flat html files
for the repository or
to
use a relational database system. A database is easier to update
and maintain consistency compare
d to flat files. Databases can help manage multiple

16

users. Complex queries can also be used in a database using SQL statements. Another
advantage of a database is that duplicates can be more readily controlled which results in
less storage space needed
, better representation of reality, easy modification of the data,
and ease of use if a user needs a combination of records. A disadvantage of databases is
that they are more complex to set up and manage. They also require additional software.
Another d
isadvantage is that they may be slower in some situations. Flat files have the
advantage that the sentences are broken out by verb/interaction term ahead of time in our
system. In some cases accessing flat files may also be faster.
Table 1 shows the res
ults
of a benchmark test for grabbing sentences containing a particular interaction
-
indicating
term. Flat files access was approximately twice as fast as the database query for the same
term.
A disadvantage of flat files is that they are not all equal in

size so there may be
memory issues if a large file is read into memory.


Rate

MYSQL

FLAT

MYSQL

559/s

--

-
47%

FLAT

1053/s

88%

--

Table
1
. A performance comparison between a MYSQL database and flat text files. A script was
writte
n that compares the speed to grab all of the sentences in a file versus the sentences in the
MYSQL database that contain the interaction
-
indicating term “acetylate”. The two functions were
compared 1,000 times.
The chart shows the rate per second and th
e percent speed difference
between the two tests.
Reading the lines from the flat text files was nearly twice as

fast as performing
the query.


It was decided to use both flat files and a MYSQL database. In addition to building the
MYSQL database I creat
ed flat files with sentences separated by the interaction
-
indicating terms. The flat files were converted to simple html files for easy access over
the internet.



17

In the future it maybe useful to parse all the words in the sentences into a database so th
at
different analyses of the sentences can be performed on parts of speech other than the
interaction
-
indicating term or in combination with each other.



3.2

The
MYSQL Database

To add more options for selecting sentences
,

the MYSQL database management syste
m
was employed. MYSQL is an open source database management system which is fast
and relatively simple to use. Initial
ly
,

four tables w
ere developed
:

sentence, interaction
term, chemical word
,

and chemical synonym table
s
. I developed a PERL script to p
arse
the
previously
-
created files
and populate the tables.

The sentences in the files are read in
a
single
line at a time.
For each sentence that is processed
,

a unique
ID
is created.
The
sentence has the PMID number of the abstract
from which it

comes
concatenated t
o the
end
delimitated
with a ‘
|
’ symbol

and the interaction
-
indicating terms it contains
delimitated with ‘@’ symbols
. The script splits the sentence on the ‘
|
’ symbol and the
unique
ID
, the
sentence
,

and the PMID number are inserted into th
e sentence table.
The
script separates the interaction
-
indicating terms by the ‘@’ symbol and they are inserted
into
the interaction table with the unique sentence id

and a unique id, which is created for
each term
.

An example sentence can be seen in Fig
ure 8. The example sentence contains
two interaction
-
indicating terms “binding” and “improvement”.


18


Figure
8
. Example sentence with additional information concatenated to the end of the sentence. The
“|” delineates the PMID num
ber 8296790 and the “@” symbo
ls mark

the interaction
-
indicating
terms

binding


and

improvement

. This information is used
by the script that

populate
s

the
MYSQL database.


The script next obtains the chemical term
and its synonym
s

by using a regular exp
ression
that grabs the

biomolecule

terms found in
between the chemical markup tags in the
sentence

and the synonym is found in the tag

(
see figure 9
)
.



Figure
9
.
Perl

regular expression used to grab t
he chemical terms in the sentences
.


A unique id
wa
s created for each set of terms that is located in a sentence.
The chemical
terms
were

inserted into the chemical word table
along with the sentence unique
ID
and
its own unique
ID
.
Each chemical synonym

term wa
s inserted into the synonym table
associated
with the chemical term id.



19

An additional table was

created within the MYSQL database management
environment
u
sing SQL statements.
The
table
that was created

holds

co
-
occurrence
s

where each row
containe
d two words that co
-
occurred in the same sentence. The following is the SQL
statement that was used:

CREATE TABLE

relationships
SELECT

a.sid
AS

sid, a.cword

AS

word1, b.cword
AS

word2
FROM

ch
emical
AS

a, chemical
AS

b
WHERE

a.sid = b.sid
AND

a.cword != b.
cword
;


The statement creates an alias ‘a’ and ‘b’
table of the

chemical table and find
s

words that
are not equal but have the same sentence id
.

Then it
populates the newly created table.

Figure 10 provides an entity
-
relationship (ER) diagram which shows
the interrelationship
between all the tables in the database. You can see that the sentence table is the parent
table for all the other tables.


Figure
10
. An entity
-
relationship (ER) diagram showing
the interrelationships between the entities
in the Interaction Sentence Database.


The
re was an attempt to create a

second table
that would

represent

terms that are
indirectly related. Term ‘A’ co
-
occurs with
a
‘B’ term and the ‘B’ term co
-
occurs with a

20


C’ term
,

so ‘A’ is indirectly
related to the ‘C’ term. The following SQL statement was
used to find indirect relationships and to populate a new table.

CREATE TABLE

indirect
SELECT

a.sid
AS

sid1, b.sid
AS

sid2, a.word1
AS

A, a.word2
AS

B, b.word1
AS

C
FRO
M

cooccurrence
AS

a, cooccurrence as b
WHERE

a.word2=b.word2
AND

a.sid!=b.sid;


Again
,

two
aliases

were made this time using the co
-
occurrence table. Words
were

selected where the two second terms from the co
-
occu
r
rence table
were equal and the
sentence
I
Ds were
not equal.
The problem with this statement was that the table grew to
over 123 gigabytes in size and possibly over a billion rows.
Such

a table would be very
slow to query. Instead of a permanent table containing all the possible indirect
relati
onships
,

I decided to find these relationships on the fly
instead
when the user
supplies the terms.
The following SQL statement is used:

SELECT

sentence.sent,sentence.pmid
FROM

sentence
LEFT

JOIN

(
SELECT

distinct(b.sid)
FROM

cooccurence
AS

a,
cooccurence
AS

b
WHERE

a.si
d!=b.sid
AND

a.word1='term1
'
AND

a.w
ord2='term2'
AND

b.word2='term2
')
AS

indirect
ON

sentence.sid=indirect.sid;


This statement returns sentences and
their

corresponding PMID number
s
. A second SQL
statement is used to just return the possib
le words that are possibly indirectly related to
another term. This query is the same as above minus the join which makes it faster.

Figure 11 represents sample rows from all the tables in the database.


21


Figure
11
. Samples of t
he information contained in each table in the database. A) The parent table
to all the other tables. Contains the sentence id, the sentence and the PMID number. B) Table that
contains the biomolecule name. Hold the chemical id, the sentence it is from

and the biomolecule
name. C) Contains the interaction
-
indicating term, the sentence it is from, and its unique id. D)
Built from the interaction table
;

contains all the biomolecule names that co
-
occur and the senten
ce
ids of the sentence they co
-
occur

in. E) T
he child of the interaction table and contains the synonyms
of the biomolecule names.


It is often desirable to have a random set of sentences to validate results. An option for
the simple and co
-
occurrence queries allows the user
s to

decide if

they want the full data
set or a random subset of the query results.

MYSQL has a function called ORDER BY
RAND() that randomizes the results. The LIMIT command is used to return the desired
number of randomized sentences. The following returns 100 rand
omized
results
for a
simple query with the interaction
-
indicating term “forms”:

SELECT

sentence.sent, sentence.pmid
FROM

sentence
LEFT JOIN

chemical
ON

sentence.sid=interaction.sid
WHERE

term=’forms’
ORDER BY RAND() LIMIT

100;



22

4

Web
Application

4.1

Graphical U
ser Interface

The graphical user interface

to the ISDB (
I
nteraction
S
entence
D
ata
B
ase)

is
a
web
-
based
application built on the LAMP
(Linux, Apache, MYSQL, and P
erl
)
framework.

These
are all Open Source software
and are

freely available.
Linux is the oper
ating system,
Apache is the web server, MYSQL is the database management system, and P
erl

is the
programming language to build the dynamic web pages.

Perl is used to create CGI
(Common Gateway Interface) scripts which
the
client and the server
use to
comm
unicate.
Different forms are built depending on the type of query

the

user wants
. The script takes
the information that the user supplies and chooses the correct SQL
statement to issue.
MYSQL is integrated with Perl using the DBI module which allows
rel
atively
easy
access to the database. The SQL statements are sent to the database and the results if any
are returned.
The script

next

builds a page that

contains
the
sentences which are the
re
sults of the query and displays
them

in the browser window
, se
e figure 12
.
To reduce
the number of times invalid user information is supplied to the server
,

JavaScript
code
checks

to make sure all appropriate fields are filled for each query.

Content Style Sheet
s

(CSS) w
ere

also used to customize some of the button
s.


23


Figure
12
. Structure of the ISDB server and client interaction
.


4.2

How to Use


The web interface
(
bioinformatics.ualr.edu/~mbauer/cgi
-
bin/ISDB/isd
b.cgi
)
allows the
user to query for different sets of sentences that fit particular criteria.

There are three
different types of queries
:



The first is a simple query that allows
one

to query for sentences tha
t contain a
certain interaction
-
indicating ter
m or biomolecular term depending on which
radio button is selected on the form
, figure 13A
.
A list of sentences is returned
that contain the search term.


24



The second type of query is for sentences that contain two biomolecul
e

name
terms

that co
-
occur in a
sentence
, figure 13B
.
There are two text input

fields for
the two biomolecule

name terms

of interest.
For
these first two
queries there is
also the option to return a random set of sentences with the specified query.



The final type of query is to find se
ntences that are indirectly related by an
i
ntermediate biomolecule name term
, figure 13C
.

There are also two text input
fields, o
ne for the ‘A’ biomolecule name

term
and
a second for
a
n intermediate
biomolecule name term

‘B’. The radio buttons on the rig
ht allow the user to pick
the type of results returned.
If the word list is selected
,

a list of

possible ‘C’
biomolecule name terms

is returned. If the sentence button is selected
,

a list of
sentences is returned that cont
ain the ‘B’ and ‘C’ biomolecule

name terms
.

Each of the different queries can be reached by using the navigation bar at the top of
all forms. The user also has the ability to choose the number of sentences to view per
page.

An example output can be see
n

in figure 14.
There is also a l
ink to a page that
contains links to html

file
s of the sentences, one file for each interaction
-
indicating
term
, see figure
15
.


25


Figure
13
. Screenshots of the ISDB web interface. A) The form for the simple query by biomolecule
name or interaction
-
indicating term. B) Form for the querying of biomolecule names that co
-
occur.
C) Form to query for indirect relationship
s between

an ‘A’ biomolecule name and ‘C’ biomolecule
name connected by a ‘B’ biomolecule name.



Figure
14
. Screenshots of sample results of a simple query for the interaction
-
indicating term
‘forms’.


26



Figure
15
. Screenshot of the html flat files. A) A list of links to pages of sentences grouped by
interacti
on
-
indicating terms. B) Sample HTML page of sentences that contain the interaction
-
indicating term ‘form’
.

27

5

References

MEDLINE
®
: Number of Citations to English Language Articles; Number of Citations
Containing Ab
stracts [Internet]; c2007 [r
eferenced

2008 04/01]. Available from:
http://www.nlm.nih.gov/bsd/medline_lang_distr.html

.

Bekhuis T. 2006. Conceptual biology, hypothesis discovery, and text mining: Swanson's

legacy. Biomed Digit Libr 3:2.

Dictionary of Algorithms and Data Structures : Zipf's law


[Internet]; c2006 [
referenced

2008 04/01]. Available from:
http://www.nist.gov/dads/HTML/zipfs
law.html

.

Blaschke C and Valencia A. 2001. The potential use of SUISEKI as a protein interaction
discovery tool. Genome Inform 12:123
-
34.

Cohen AM and Hersh WR. 2005. A survey of current work in biomedical text mining.
Brief Bioinform 6(1):57
-
71.

Cooper
JW and Kershenbaum A. Discovery of protein
-
protein interactions using a
combination of linguistic, statistical and graphical information. BMC Bioinformatics
[In
ternet]
http://www.biome
dcentral.com/1471
-
2105/6/143
.

Corney DP, Buxton BF, Langdon WB, Jones DT. 2004. BioRAT: Extracting biological
information from full
-
length papers. Bioinformatics 20(17):3206
-
13.

Ding J. 2003. PathBinder: A sentence repository of biochemical inte
ractions e
xtracted
from MEDLINE
:1
-
36.

Hirschman L, Park JC, Tsujii J, Wong L, Wu CH. 2002. Accomplishments and
challenges in literature data mining for biology. Bioinformatics 18(12):1553
-
61.

Jensen Juhl L, Saric J, Bork P. 2006. Literature mining for the biologist:

From
information retrieval to biological discovery. Nature Reviews GENETICS 7:119
-
29.

Powers D. Applications and explanations of
Z
ipf's law. ACL [Internet]. Available from
h
ttp://www.aclweb.org/anthology
-
new/W/W98/W98
-
1218.pdf
.


28

Rebholz
-
Schuhmann D, Kirsch H, Couto F. 2005. Facts from text
-

is text mining ready
to deliver? PloS Biology 3(2):0188
-
91.

Swanson DR. 1991. Complementary structures in disjoint science literatures.
ACM :280
-
9.

Swanson DR. 1990. Medical literature as a potential source of new knowledge. Bull Med
Libr Assoc :29.



29

6

User Manual

6.1

Overview


Interaction Sentence Database (ISDB) allows a researcher to retrieve a set of sentences
fitting specific characteris
tics. The sentences in the database all contain at least two
chemical terms and one interaction
-
indicating term. The web interface to the database
allows the user to query for sentences based on an interaction
-
indicating term, a single
biomolecule name, an
d two biomolecule names and for sentences with indirect
connections through an intermediate biomolecule name.

6.2

Getting Started


The web interface can be accessed at:

http://bioinf
ormatics.ualr.edu/~mbauer/cgi
-
bin/ISDB/isdb.cgi

6.3

Simple Query


A simple query for sentences that contain a certain interaction
-
indicating term or
biomolecular term.


1.

Click “Simple Query”.

2.

Choose the type of term to use in the query.

3.

Enter and interactio
n term or biomolecule term depending on choice 1.

4.

Choose to receive the complete dataset or a random set of sentences.

5.

Choose the number of sentences to show per page.

6.

Press “Query” to submit the query.

6.4

Co
-
occurrence Query



30

A query for sentences that cont
ain two biomolecule name terms that co
-
occur in a
sentence.


1.

Click “Co
-
Occurrence Query”.

2.

Enter the two biomolecule name terms.

3.

Choose to receive the complete dataset or a random set of sentences.

4.

Choose the number of sentences to show per page.

5.

Press “
Query” to submit the query.

6.5

Indirect Relationship Query


A query is to find sentences that are indirectly related by an intermediate biomolecule
name term (B).


1.

Click “Indirect Relationship Query”.

2.

Enter the first biomolecular name term to use in the q
uery.

3.

Enter the second biomolecular name term to use in the query.

4.

Choose to receive sentences or a list of words that are indirectly related to (A).

5.

Choose the number of sentences to show per page.

6.

Press “Query” to submit the query.

6.6

HTML Pages by Term


V
iew simple html files containing sentences separated by interaction
-
indicating term.


31


1.

Click the “HTML by term” link

2.

Click the name of interaction
-
indicating term to view sentences that contain the
term.

6.7

Contact Information


Direct questions and comments
to:

Michael Bauer

University of Arkansas Little Rock

mabauerATualr.edu

32


Appendix

Appendix A
-

CGI script for the web interface

#!/usr/bin/perl

##################################################

## FileName: isdb.cgi

## Version:

1

## Date: 02.11.08

## Author: Michael Bauer (mabauer@ualr.edu)

##################################################

## Note: This script to provide an interface

##


to the sentence database

##

#####################################
#############


use DBI;


our $rootURL = 'http://bioinformatics.ualr.edu';

my $database = 'sentDB';

my $server = 'localhost';

my $user = 'mbauer';

my $passwd = 'mb1313';



#my $tm = DBI
-
>connect("dbi:mysql:$database:$server:", $user, $passwd);

&Pars
eFormData;

&DoControl;



exit;


sub DoControl

{


my ($html,$term,$count,$sql,$mode);



$html .= &Header;


if ( $FormData{'mode'}) { $mode = $FormData{'mode'}; }


elsif ( $FormData{'co_occurrence'}) { $mode = 'co_occurrence'; }



elsif ( $FormData{'indirect'}) { $mode = 'indirect'; }


else { $mode = 'simple'; }




if ($mode eq 'simple' and ($FormData{'getsentence'} or $FormData{'direction'} ))


{


if ($FormData{'dataset'} eq 'rand')


33


{


if($FormData{'verb'}) {$term = $FormData{'verb'};}


if($FormD
ata{'chemical'}) {$term = $FormData{'chemical'}; }




$html .= &PrintRandomSent($term,


$FormData{'sentnum'},



$FormData{'queryby'},


$mode);



}




if ($FormData{'getsentence'} and $FormData{'dataset'} eq 'full'){



if($FormData{'verb'})


{


$term = $FormData{'verb'};


$sql = &GetSqlCount($mode,$FormData{'
queryby'});


$count = &TotalResults($sql,$term,$mode);


}


if($FormData{'chemical'})


{



$term = $FormData{'chemical'};


$sql = &GetSqlCount($mode,$FormData{'queryby'});


$count = &TotalResults($s
ql,$term,$mode);


}


if($FormData{'sentnum'}eq


'all'){$scount=$count;}


else{$scount=$Form
Data{'sentnum'};}




$html .= &PrintSent($term,


0,


$scount,



'',


$count,


$FormData{'queryby'},



$mode);


}



if ($FormData{'direction'}){


if($FormData{'term'}) {$term = $FormData{'term'};}


if($FormData{'chemical'}) {$term = $Fo
rmData{'chemical'};}


$html .= &PrintSent( $term,


$FormData{'more'},


$FormData{'num'},



$FormData{'direction'},


$FormData{'count'},


$FormData{'queryby'},


34



$mode)


}


}






elsif ($mode eq 'co_occurrence' and ($FormData{'getsentence'} or
$FormData{'direction'} ))


{


if ($FormData{'dataset'} eq 'rand')



{


$term1 = $FormData{'chem_one'};


$term2 = $FormData{'chem_two'};


$term = $term1.'
-
'.$term2;





$html .= &PrintRandomSent($term,


$FormData{'sentnum'},


$FormData{'queryby'},



$mode);



}




if ($FormData{'getsentence'} and $FormData{'dataset'} eq 'full'){


$term1 = $FormData{'chem_one'};



$term2 = $FormData{'chem_two'};


$term = $term1.'
-
'.$term2;


$sql = &GetSqlCount($mode,'');



$count = &TotalResults($sql,$term,$mode);




if($FormData{'sentnum'}eq


'all'){$scount=$count;}



else{$scount=$FormData{'sentnum'};}




$html .= &PrintSent($term,


0,



$scount,


'',


$count,



$FormData{'queryby'},


$mode);


}



if ($FormData{'direction'}){


$term = $FormData{'term'};



$html .= &PrintSent($term,


$FormData{'more'},


$FormData{'num'},


35



$FormData{'direction'},


$FormData{'count'},


$FormData{'queryby'},


$mod
e)


}


}



elsif ($mode eq 'indirect' and ($FormData{'getsentence'} or
$FormData{'direction'} ))


{


if ($FormData{'getsentence'}){


$term1 = $FormData{'aterm
'};


$term2 = $FormData{'bterm'};


$term = $term1.'
-
'.$term2;


# $sql =


#
&GetSqlCount($mode,$FormData{'qresults'});


# $count = &TotalResults($sql,$term,$mode);


if($FormData{'sentnum'}eq


'all'){$scou
nt=$count;}


else{$scount=$FormData{'sentnum'};}




$html .= &PrintIndirect($term,



0,


$scount,


'',


$count,



$FormData{'qresults'},


$mode);


}



if ($FormData{'direction'}){


$ter
m = $FormData{'term'};


$html .= &PrintIndirect($term,


$FormData{'more'},


$FormData{'num'},



$FormData{'direction'},


$FormData{'count'},


$FormData{'queryby'},



$mode)


}


}




elsif ($mode eq 'co_occurrence')


{


$html .= &Co_Occurrence;


}


36


elsif ($mode eq 'indirect')


{


$html .=

&IndirectRelation;


}




else {$html .= &SimpleQueryForm;}



$html .= &Footer;



print $html;


}


sub Header

{


my $html;



$html .= "Content
-
type: text/html
\
n";


# HTTP Status followed by a blank line is always required

for HTTP


$html .= "Status: 200 OK
\
n
\
n";


$html .= "<html>
\
n";


$html .= "<head>
\
n";


$html .= " <META HTTP
-
EQUIV='Content
-
Type' CONTENT='text/html;
charset=utf
-
8'>
\
n";


$html .= " <META NAME=author CONTENT='Michael Bauer'>
\
n";


$html .=
" <META NAME='keywords' CONTENT='text mining, sentence database,
sentences, text'>
\
n";


$html .= " <BASE href='http://bioinformatics.ualr.edu/~mbauer'>
\
n";


$html .= " <title>Sentence DB</title>
\
n";


$html .= &JavaFunc;




$
html .= "<style type='text/css'>
\
n";


$html .= " .submitLink {
\
n";


$html .= " color: #00F;
\
n";


$html .= " background
-
color: transparent;
\
n";


$html .= " test
-
decoration: underline;
\
n";


$html .= " border: none;
\
n";



$html .= " curser: hand;
\
n";


$html .= " cursor: pointer;
\
n";


$html .= " }
\
n";


$html .= "</style>
\
n";




$html .= " </head>
\
n";


$html .= " <body leftmargin='0' topmargin='0' marginwidth='0' marginheight='0'
text='black'

bgcolor='white'>
\
n";


37


$html .= "<center><img src='$rootURL"."/~mbauer/ISDB/img/ISDBheader.jpg'
width='730' height='101'></center>
\
n";


return $html;


}



sub Footer

{


my $html;


$html .= " <a
href='http://bioinformatics.ualr.edu/~mba
uer/sorted/html_FILES/'>HTML by
term</a>
\
n";


$html .= " </body>
\
n";


$html .= "</html>
\
n";



return $html;

}


sub ToolMenu

{


my ($color1,$color2,$color3) = @_;


my $html;



$html .= "<br><br>
\
n";


$html .= "<FORM name='
menu' ACTION='$rootURL"."/~mbauer/cgi
-
bin/ISDB/isdb.cgi' METHOD='POST'>
\
n";


$html .= "<table bgColor='lightblue' width='70%' border='0' align='center'>
\
n";


$html .= "<tr>
\
n";




$html .= "<td bgcolor='$color1' align='center' onMouseo
ver='MouseOver(this)'";


if($color1 ne '#0099FF'){ $html .= "onMouseout='MouseOut(this)'";}


$html .= "><input type='submit' class='submitLink' name='simple' value='Simple
Query'></td>
\
n";




$html .= "<td bgcolor='$color2' align='cente
r' onmouseover='MouseOver(this)'";


if($color2 ne '#0099FF'){$html .= "onMouseout='MouseOut(this)'";}


$html .= "><input type='submit' class='submitLink' name='co_occurrence'
'value='Co
-
Occurrence Query'></td>
\
n";




$html .= "<td bgcol
or='$color3' align='center' onmouseover='MouseOver(this)'";


if($color3 ne '#0099FF'){$html .= "onMouseout='MouseOut(this)'";}


$html .= "><input type='submit' class='submitLink' name='indirect' value='Indirect
Relationship Query'></td>
\
n";





$html .= "</tr>
\
n";


38


$html .= "</table>
\
n";


$html .= "</form>";



return $html;

}


sub SimpleQueryForm

{


my $html;



$html .= &ToolMenu('#0099FF','lightblue','lightblue');


$html .= "<FORM name='simple_query_fo
rm'
ACTION='$rootURL"."/~mbauer/cgi
-
bin/ISDB/isdb.cgi' METHOD='POST'>
\
n";


$html .= "<TABLE frame='border' width='70%' align='center'>
\
n";


$html .= "<tr>
\
n";


$html .= "<td><label for='verb'>Interaction Term</label></td>
\
n";


$
html .= "<td><label for='chemical'>Query By</label></td>
\
n";


$html .= "</tr>
\
n";


$html .= "<tr>
\
n";


$html .= "<td>
\
n";


$html .= "<INPUT TYPE='text' NAME='verb'>
\
n";


$html .= "</td>
\
n";


$html .= "<td>
\
n";



$html .= "<INPUT TYPE='radio' name='queryby' value='iterm'
onClick='enableField()' CHECKED>Interaction Term<br>
\
n";


$html .= "<INPUT TYPE='radio' name='queryby' value='cterm'
onClick='enableField()'>Chemical Term<br>
\
n";


$html .= "</td>
\
n";


$html .= "</tr>
\
n";


$html .= "<tr>
\
n";




$html .= "<td><label for='chemical'>Chemical Term</label></td>
\
n";


$html .= "<td><label for='dataset'>Dataset</label></td>
\
n";


$html .= "</tr>
\
n";


$html .=

"<tr>
\
n";


$html .=
"<td><INPUT TYPE='text' NAME='chemical' disabled='true'></td>
\
n";


$html .= "<td>
\
n";


$html .= "<INPUT TYPE='radio' name='dataset' value='full'
onClick='modSelectField(this.form)' CHECKED>Complete Dataset<br>
\
n";


$html .= "<INPUT TYPE='radio' name='dataset' value='rand'
onClick='modSelectField(this.form)'>Random Dataset<br>
\
n";


$html .= "</td>
\
n";


$html .= "</tr>
\
n";




$html .= "<tr>
\
n";


$html .= "<td></td>
\
n";


39



$html .= "<td><label for='sentnum'>Sentences to Display</label></td>
\
n";


$html .= "</tr>
\
n";


$html .= "<tr>
\
n";


$html .= "<td></td>
\
n";


$html .= "<td>
\
n";


$html .= "<SELECT name='sentnum'>
\
n";


$html .= "<OPT
ION value='10'>10</OPTION>
\
n";


$html .= "<OPTION value='20'>20</OPTION>
\
n";


$html .= "<OPTION value='50'>50</OPTION>
\
n";


$html .= "<OPTION value='100'>100</OPTION>
\
n";


$html .= "<OPTION value='1000'>1000</OPTION>
\
n";



$html .= "<OPTION value='all'>ALL</OPTION>
\
n";


$html .= "</SELECT>
\
n";


$html .= "</td>
\
n";


$html .= "</tr>
\
n";


$html .= "</TABLE>
\
n";


$html .= "<tr>
\
n";


$html .= "<TABLE width='60%' align='center'>
\
n";



$html .= "<td><center><INPUT TYPE='submit' value='Query' onClick='return
validate_form()'></center></td>
\
n";


$html .= "</tr>
\
n";


$html .= "</TABLE>
\
n";


$html .= "<INPUT TYPE='hidden' NAME='getsentence'
ID='getsentence'VALUE='gets
entence'>
\
n";


$html .= "<INPUT TYPE='hidden' NAME='mode' ID='mode'
VALUE='simple'>
\
n";


$html .= "</FORM>
\
n";



return $html;

}


sub Co_Occurrence

{


my $html;



$html .= &ToolMenu('lightblue','#0099FF','lightblue');



$html .= "<FORM name='co_occur_form' ACTION='$rootURL"."/~mbauer/cgi
-
bin/ISDB/isdb.cgi' METHOD='POST'>
\
n";


$html .= "<TABLE frame='border' width='70%' align='center'>
\
n";


$html .= "<tr>
\
n";


$html .= "<td><label for='verb'>1st C
hemical Term</label></td>
\
n";


$html .= "<td>
\
n";


$html .= "<INPUT TYPE='radio' name='dataset' value='full'
onClick='modSelectField(this.form)' CHECKED>Complete Dataset<br>
\
n";


$html .= "<INPUT TYPE='radio' name='dataset' value='rand
'
onClick='modSelectField(this.form)'>Random Dataset<br>
\
n";


40


$html .= "</td>
\
n";


$html .= "</tr>
\
n";


$html .= "<tr>
\
n";


$html .= "<td>
\
n";


$html .= "<INPUT TYPE='text' NAME='chem_one'>
\
n";


$html .= "</td>
\
n";


$html .= "<td>
\
n";


$html .= "</td>
\
n";


$html .= "</tr>
\
n";


$html .= "<tr>
\
n";


$html .= "<td><label for='chemical'>2nd Chemical Term</label></td>
\
n";


$html .= "<td><label for='sentnum'>Sentences to Display</l
abel></td>
\
n";


$html .= "</tr>
\
n";


$html .= "<tr>
\
n";


$html .= "<td><INPUT TYPE='text' NAME='chem_two'></td>
\
n";


$html .= "<td>
\
n";


$html .= "<SELECT name='sentnum'>
\
n";


$html .= "<OPTION value='10'>10</OPTIO
N>
\
n";


$html .= "<OPTION value='20'>20</OPTION>
\
n";


$html .= "<OPTION value='50'>50</OPTION>
\
n";


$html .= "<OPTION value='100'>100</OPTION>
\
n";


$html .= "<OPTION value='1000'>1000</OPTION>
\
n";


$html .= "<OPTION value
='all'>ALL</OPTION>
\
n";


$html .= "</SELECT>
\
n";


$html .= "</td>
\
n";


$html .= "</tr>
\
n";


$html .= "</TABLE>
\
n";


$html .= "<tr>
\
n";


$html .= "<TABLE width='60%' align='center'>
\
n";


$html .= "<td><center
><INPUT TYPE='submit' value='Query' onClick='return
validate_co_form()'></center></td>
\
n";


$html .= "</tr>
\
n";


$html .= "</TABLE>
\
n";


$html .= "<INPUT TYPE='hidden' NAME='getsentence'
ID='getsentence'VALUE='getsentence'>
\
n";



$html .= "<INPUT TYPE='hidden' NAME='mode' ID='mode'
VALUE='co_occurrence'>
\
n";


$html .= "<INPUT TYPE='hidden' NAME='queryby' ID='queryby'
VALUE='co_occur'>
\
n";


$html .= "</FORM>
\
n";



return $html;

}





41

sub IndirectRelation

{



my $html;




$html .= &ToolMenu('lightblue','lightblue','#0099FF');


$html .= "<FORM name='indirect_form' ACTION='$rootURL"."/~mbauer/cgi
-
bin/ISDB/isdb.cgi' METHOD='POST'>
\
n";


$html .= "<TABLE frame='border' width='70%' alig
n='center'>
\
n";


$html .= "<tr>
\
n";


$html .= "<td><label for='atrm'><font size='16'>A</font> Biomolecular
Term</label></td>
\
n";


$html .= "<td><label for='qresults'><font size='16'>C</font> Biomolecular
Terms</label></td>
\
n";


$html .= "</tr>
\
n";


$html .= "<tr>
\
n";


$html .= "<td>
\
n";


$html .= "<INPUT TYPE='text' NAME='aterm'>
\
n";


$html .= "</td>
\
n";


$html .= "<td>
\
n";


$html .= "<INPUT TYPE='radio' name='qresults' value='sentences'
onClick='enableField()' CHECKED>Sentences<br>
\
n";


$html .= "<INPUT TYPE='radio' name='qresults' value='wlist'
onClick='enableField()'>Word List<br>
\
n";


$html .= "</td>
\
n";


$html .= "<td>
\
n";


$html .= "</td>
\
n";


$html

.= "</tr>
\
n";


$html .= "<tr>
\
n";


$html .= "<td><label for='btrm'><font size='16'>B</font> Biomolecular
Term</label></td>
\
n";


$html .= "<td><label for='sentnum'>Sentences to Display</label></td>
\
n";


$html .= "</tr>
\
n";



$html .= "<tr>
\
n";


$html .= "<td><INPUT TYPE='text' NAME='bterm'></td>
\
n";


$html .= "<td>
\
n";


$html .= "<SELECT name='sentnum'>
\
n";


$html .= "<OPTION value='10'>10</OPTION>
\
n";


$html .= "<OPTION value='20'>20</OP
TION>
\
n";


$html .= "<OPTION value='50'>50</OPTION>
\
n";


$html .= "<OPTION value='100'>100</OPTION>
\
n";


$html .= "<OPTION value='1000'>1000</OPTION>
\
n";


$html .= "</SELECT>
\
n";


$html .= "</td>
\
n";


$html .= "</t
r>
\
n";


$html .= "</TABLE>
\
n";


$html .= "<tr>
\
n";


42


$html .= "<TABLE width='60%' align='center'>
\
n";


$html .= "<td><center><INPUT TYPE='submit' value='Query' onClick='return
validate_in_form()'></center></td>
\