Combine Unsupervised Learning and Heuristic Rules to Annotate Organism Morphological Descriptions

randombroadΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

73 εμφανίσεις

Combine

Unsupervised Learning and

H
euristic

Rules
to Annotate
Organism
Morphological

Descriptions

Hong Cui

University of Arizona

hongcui@email.arizona.edu

Sriramu Singaram

University of Arizona

sriramus@email.arizona.edu

Alyssa Janning

University of Ariz
ona

ajanning@email.arizona.edu





ABSTRACT

Biodiversity literature is a comprehensive compilation
of information on living organisms and fossils.

Rich
factual information on characteristics

of organisms is
presented in narrative form, hence limiting it
s
repurpose and reuse. Transforming narrative
information into atomic forms has been of

special
concern to

informatics research
ers

and biological
research
ers

alike
.
Research done previously show
s
similar results but lacks a

detailed, scientific evaluation
that would help
il
luminate

the problem and eventually
lead to a high
er

performance approach.
Due to the
sublanguage nature

of morphological descriptions,

it is
thought that

general
-
purpose nature

language
processing (NLP) tools are not effective in this
a
pplication.
A h
euristic
-
based approach has been
suggested in the literature. In this paper, we report
our
experiments with

s
uch an approach, where a set of

simple
, intuitive heuristic rules,

informed by

results of
an

unsupervised learning algorithm
,

is us
ed to

segment
taxonomic descriptions and identify the organs along
with

their

associated character/value

pairs

(
color=white, shape=ovoid
)
.

This model system
allows us to investigate the character annotation
problem further
,

study the characteristics of
mo
rphological descriptions,
identify the areas where
the system fails
, and suggest ways to address those
failures
.

O
ne
such suggestion

is to make use of
general
-
purpose syntactic parsers

in a controlled
manner
.

Keywords

C
haracter annotation,
unsupervised m
achine learning
algorithm, character markup,
heuristic
rules
,
semantic
annotation technique

INTRODUCTION

Biodiversity literature
is an abundant source of

morphological description documents published by
biosystematics experts. It is essential that the
in
formation present in these documents be easily

and
conveniently

accessible to the end user
s
.
E
xperts
and
researchers alike thus
feel the

growing
need for
a
seman
tic
-
based service

capable of
supplying atomic
information at a more granular level

that can be
reused
to address various information and/or biological
questions
.

Examples of typical information questions
addressed include information retrieval and
visualization.

Atomic information extracted, like what
is to be described below, has been translated in
to RDF
(Resource Description Framework) triples and will be
used by Flora of North America in their new web
presence. The extracted information is also being
converted into SDD (Structured Descriptive Data)
format to generate
organism identification system
s in a
semi
-
automated manner.

Examples of typical
biological
questions addressed include
relation
s

between molecular sequence and morphological
characters

(i.e. phylogenetic
s

research)
, and relations
between biological characters and climate change.

In t
he
meantime
, more print literature has been
digitized. The

Biodiversity Heritage Library has
digitized close to 34 millions of pages

of biodiversity
documents in the public domain
, see
http://www.biodiversit
ylibrary.org/
.

This

has

dramatically changed the scale of the problem and the
solution will no longer be project
-
specific dictionaries
or software. High performance
,

high throughout
biodiversity
-
wide approaches are needed to dissect


This is the space reserved for copyright notices.


ASIST

20
11
,
October

9
-
13, 2011
,
New Orle
ans, LA
, USA.

Copyright
notice continues right here
.


description documents

into individual, semantic
-
enabled factual statements.



We have been working to atomiz
e

morphological
descriptions in the past years, starting from paragraph
classification to sentence markup
(Cui, Boufford &
Selten, 2010)
and finally to character level
annotation.
Figure 1

presents a morphological description sen
tence
and its character level annotation in XML. This is a
sample output from the system to be described in the
paper.


<?xml version="1.0" encoding="UTF
-
8"?>

<statement id="s1058">


<text>
sty
le branches relatively short, apices
rounded.
</text>


<structure name="branch" constraint="style"
id="o1">


<character name="size" modifier="relatively"



value="short" />


</structure>


<structure name="apex" id="o2">


<character n
ame="shape" value="rounded" />


</structure>

</statement>

Figure 1. Character Annotation of a Sample
Sentence from FNA v. 19


In this paper we focus on the character level annotation
of morphological descriptions by means of
unsupervised learning and si
mple heuristic rules
;

the
latter has been suggested
and used
by

Diederich et
al.
(1999)

,

Woods (2003)
,
the PATO project (
PATO,
2011
), and Balhoff et al.(2010)

. The research question

this paper

has

set out to address is what potential a
heuristic rule
-
base
d
approach holds for fine
-
grained
annotation of morphological descriptions.
To answer
this question, inevitably we will also need to examine
the syntactic characteristics of morphological
descriptions.


In what follows, we first give

a review of related
re
search work in section 2 before heading into the
explanation of the
heuristic based character annotation
algorithm in

section 3.The results obtained from
two
real
-
world corpora

are pre
sented in section 4 followed
by analyse
s of the

description text, annota
tion

results
,
and shortcomings of the algorithm

in section 5. W
e
finally conclude
the

paper detailing future
enhancements and development efforts for the
algorithm
in Section 6.


REVIEW OF RELATED WORK

The task presented in this paper may be considered a
s
pecialized task

of what is commonly known as
information extraction(IE), in which a selection of
factual information
is

extracted from natural language
documents to populate a predefined template. The
template typically
capture
s

the key facts about

an even
t
and
takes the form of who did what
,

when
,

why and
how. A typical information extraction system makes
use of shallow parsing

(including

Part of Speech
tagging)

and Named Entity Recognition techniques to
identify nouns and verbs and uses the verbs to

find
candidate relationships between

the entities involved in

the event. Note that identifying the right verbs
/verb
phrases

is

crucially important in a typical IE system.
Often some form of machine learning method
is
used

to classify information into appropriat
e
categories
.

Full reviews of IE research done in other domains can
be found elsewhere (e.g., Blaschke, Hirschman, &
Valencia, 2002; Chapman & Cohen, 2009; Hobbs &
Riloff, 2010). IE in biodiversity domain is less
researched and to some extent different.
Our task is
particularly different from a typical IE task that a
somewhat different approach is nee
ded. These
differences include:

a) no extraction template may be
made ahead of
time, as we can not foresee what

in
formation may be included in

the
descriptio
n
s of any
organism
;

b)
in
describing characteristics of an entity,
our domain corpora contain
few
er

verbs

(
when they
are presented, they are
often as present/past
participles)
;

c) basic NLP tools such as machine
-
readable lexicons, Part of Speech (POS) tagg
ers, and
syntactic parsers don’t exist for

the
biodiversity
domain.
Further
more
, there is not a standard syntax

that

appl
ies

to all morphological descriptions
;

d) all
facts presented i
n the text are important and should be

include
d in the final annotation
.


Below we review
only the

works closely related to our
task.

Although

a

direct comparison can
not be drawn
between ours and these systems reviewed, this brief
review will nonetheless provide relevant context for
the reader.

Wood et al.
(2003)

used

manually
-
created dictionaries,
an ontolo
gy, and a lookup list
to extract and correlate

cha
racters/states from a set of descriptions of 18
plant
species
included in 6

different English

Floras. They
made use of

parallel text

(i.e. descriptions about a
common species

from different Floras)
to find three
times more targeted information

which would
otherwise be missed, and improved extraction recall

by

three times.


Evaluation results were reported as 66%
recall and 74% precision. In other words, without
parallel text,
the recall would have been 22
%.
Diederich, Fortuner & Milton
(1999)
reported a system
called Terminator, w
hich wa
s
similar to Wood et al.'s
in that they both use
d

a hand
-
crafted domain ontology

including structure names, character names and state
names

to
support character extraction.

The extraction
process was basically

a

heuristic
-
based fuzzy keyword
match.

As Terminator is an interactive system

(i.e.
a
human operator selects correct answers), t
he evaluation
was done on 16 descriptions
to report
the time
taken to
process them
. Extraction performance was done only
on 1 random

sample
: for general character (as opposed
to numerical measurements), 55
%

of

the

time a perfect
structure/character/value triple w
as

shown among the
first 5 candidates suggested by the

system
.



Tang & Heidorn

(2007)

adapted Soderland

(1999)’
s

supervised regular expression pattern learner

to extract

characteristics about leaves (i.e,

leaf
shape, size, color,
arrangement)

and fruit/nut shape from 1600 FNA
species descriptions.
The
ir

repo
rted recall ranged from
33%

to
80% and precision ranged from 75%

to
100%
.
Taylor
(1995)

hand
-
crafted a set of simple
grammar rules and a small lexicon specifically for
extracting character states from several Floras
. The

performances were not scientifically
evaluated but

were

estimated at 60% to 80% recall
.

Difficulties in automated character level annotations
have prompted interactive s
oftware applications

that

allow the user

to

manually annotate morphological
characters. These applications include Phenex (B
alhoff

et. al,

2010) and GoldenGate (
Sautter, Agosti, &
Bohm, 2007
). While these system
s

possess many
plausible features, it is widely acknowledged that
manual annotation is one of the major bottle
-
necks of
curating scientific data.

Unsupervised Learning
and Heuristic Rule Hybrid
Character Annotation

Algorithm

It is evident to us that character
-
level annotation
requires some level of syntactic parsing to group
semantically related words

(called chunking)

and
separate
one chunk

from another.

The
f
ull parsin
g
approach taken by
Taylor is attractive for its
performance

potential

but it has serious scalability
issue
s
. We decided to start with
a
model system
implementing the most evident

heuristic rules
. By
studying the model system, we hope to gain knowledge
on
what would be needed in a high performance
system.

Our algorithm expects plain
-
text, typical sublanguage
description paragraphs containing sentences similar to
what is shown in
Figure 1
.
These description
paragraphs are then
tagged at the sentence level by

the
unsupervised, bootstrapping
-
base
d

learning algorithm
reported in Cui, Boufford, & Selden (2010).
The
algorithm

learns

organ names and descriptive terms
(such as “red”, “large” and “erect”) that are used to tag
the content of each sentence for the heur
istic step of
the system
.

For example:

Before
:

style branches re
latively short, apices rounded
.

After
: <style> <branches> {relatively} {short} ,
<apices> {rounded} .


This operation may be thought as a s
imple POS
taggi
ng operation where only two
tags are

applied:
nouns

are

tagged in “<>” and others

in “{}”.

The
heuristic based character annotation
step then
takes the tagged sentences and apples a set of

simple
rules

to segment and annotate them

in consecutive
operations described below
.

CondenseMarkedsent

Each
tagged

sentence is put through a series of
condensation

procedures aimed at grouping
semantically related chunks
, thereby
reducing the
complexity of the sentence for further annotation
.
For
example
:



{lanceolate} or {ovate}
is reduced to

{lanceolate_o
r_ovate}



<stem> or <branches>

is reduced to
<stem_or_branches>



<spines> of <phyllaries>

is reduced to

<spines_of_phyllaries>


HideCommas

Description sentences

at times consist of multiple
clauses, each

describing the characteris
tics of
one
organ/
structure
.

For example, the sentence in Figure 1
has two clauses.

Clauses are always preceded by
commas, but commas also frequently appear
elsewhere. In preparing for segmentation, it is
necessary to hide or eliminate those unnecessary or
semantically irrelevant co
mmas that might otherwise
hamper the process of creating a legitimate segment.

First
,
all

of

the commas that
have
their

occur
rence
s

between a preposi
tion and an organ

are hidden
. This
portion of the sentence
indicates the presence of a
relation and hence,
these commas should not form a
criterion for segmentation.

Note the “<>” tag tells the
system which word represent an organ/structure.
For
example:


Before
: <styles> {1}
in

each {bisexual} ,
{functionally} {staminate} , or {pistillate}
<floret>;


After
:


<styles> {1} in each {bisexual} {:}
{functionally} {staminate} {:} or {pistillate} <floret>;


As seen in the above example each “,” has been
replaced with a “{:}” as part of hiding it.

Also
,

the commas that are used to separate multiple
character states be
longing to an organ

are hidden
.
These com
m
as should
not be considered a criterion for
segmentation as they do not have any semantic value
with regard to

the
segmentation procedure

apart from
their grammatical value in the sentence.

For example:

Before
: <ap
ices> of
{inner} {erect} , {abaxial}
<faces>

{gray_tomentose} , ± {twisted} .

After
:
<apices> of {inner} {erect} {:} {abaxial}
<faces> {gray_tomentose} , ± {twisted} .


Segmentation

Once a

sentence has been condensed and preprocessed
to hide the extra comm
as, we now proceed to breaking
the sentence into smaller

segments

. This partitioning
of the sentence is
done

by

u
sing the remaining commas
that
have not been hidden.

While segmenting the sentence based on the commas
,

we

try to keep

the semantic m
eaning
of the sentence
i
ntact. This is
done

in some cases
by
means of
appending organs to segments
for which
they
are

rightfully
the subjects. The organ is appended as the
segment would likely be describing character states
associated with the organ or describing

a relation with
respect to a second organ.

For example:

Before
:
<stamens> {5} , {alternate} with <corolla>
<lobes> , <filaments> {inserted} on <corollas> {:}
{usually} {distinct} , <anthers> {introrse} , {usually}
{connate} and {forming} <tubes> around <s
tyles>;

After
:
Segment 1: <stamens> {5} ,

Segment 2:
<stamens>

{alternate} with


<corolla_has_lobes> ,

Segment 3: <filaments> {inserted} on
<corollas>


{:} {usually} {distinct} ,

Segment 4: <anthers> {introrse} ,

Se
gment 5:
<anthers>

{usually} {connate}
and


{forming} <tubes> around
<styles>;


Notice the organ :<stamens> being appended to the
Segment 2 and the organ :<anthers> being appended to
Segment 5.

The resulting
segments can be characterized

into one
of the two categories:



Simple segments
:
Segments which have only
one organ/structure as the subject
. For
example,
Segment 1, Segment 4




Complex segments
:
Segments which are
comprised of two or more organs/structures.

For example
: Segment 2, Segme
nt 3, Segment
5
. These segments typically contain relations
among structures. For example “alternate
with” in Segment 2.

ParseS
eg
ment

After segmentation, the resulting segments

are turned

to
their

original form for ParseSegment

where
character annotation i
s performed. Segments are
annotated

as XML structures

according to
an XML
schema we collaboratively created with biologists. The
schema is accessible at
http://sites.google.com/site/biosemanticsproject/charac
ter
-
annotation
-
discussions/xml
-
schema
-
for
-
character
-
annotation
.

S
egment
s

are

annotated

differently depending on
whether
they are
simple or complex
.

T
he
annotation of
simpl
e segments

is rather straightforward as each
segment is expected to contain one organ name
(indicated by “<>” tags) and a set of descriptors (called
“character states” in biology term
inology
, indicated by
“{}” tags). All character states (such as “red” or
“large”) in a simple segment should be logically
associated with the organ. CharacterStateExtractor
below describe
s

the procedure involved in associating

a character state (e.g. “red”)

with
its
corresponding
character (“color”).

For

a complex segment, the
re is a need to identify the
structures

and

their character states along with their
associated relations.
Due to a lack
of knowledge on

the

syntactic structure of complex segments, a
n intuitive

rule is applied to a segment recursively here to obtain
the fi
rst structure

by looking for a “<>” tag
. Everything

after and until the next

structure

is collected

as the
relation, and the next

structure

is identified
by again
looking for a “<>” tag.


E
verything before the first
organ is considered to contain character

descriptors for
the first organ, and everything after the 2
nd

organ

is
considered to contain character descriptors

for the 2
nd

organ.

Applying these simple rules, Segment4 in the previous
example is annotated as

<statement id="s20">

<text>
anthers intrors
e

,</text>

<structure name="anther" id="o1">


<character name="dehiscence"
value="introrse"/>

</structure>

</statement>


Segment3 in the p
revious example is annotated as the
following.
A

typical
attachment
mistake the simple
rule will cause is shown

in this example: “usually
distinct” is a character of filaments, not corollas.

<statement id="s19">

<text>
filaments inserted on corollas , usually
distinct

,</text>

<structure name="filament" id="o1"></structure>

<structure name="corolla" id="o2">



<character name="fusion" value="distinct"/>

</structure>

<relation id="R1" from="o1" to="o2"
name="inserted on" negation="false"/>

</statement>

CharacterStateExtractor

Chunks containing

character states extracted from the
segment
need to be processed in
order to associate
the
m

with their corresponding characters

and produce
the XML character elements
. This
module takes a
chunk in two different forms: tagged

form and the
plain text form.

The tagged

form includes

one or more

character states
enclosed within

“{“

“}” tags. The character states are
thus extracted and

used to query

a
glossary

of terms to
find th
eir character match.

This glossary of character
and character state pairs is constructed by running the
unsupervised semantic markup algorithm and
augmen
ted by expert input.
Character
states fill the
value attribute and

characters fill the name attrib
ute of
the <character> element, for example <character
name="dehiscence" value="introrse"/>. As seen in
ParseSegment
and Figure 1,

each <character> element
is

associated with a structure by nesting character
elements in relevant structure element
s
.

T
he plain text form is used to handle numerical
character states that
describe the length, width, height,
si
ze, thickness, or count of a

particular organ
.
These
num
erical values may be discrete values or range
values
.
They can be
broadly
grouped

under three

categories
:



Numbers defining characteristics in metrics
(
1.5
-

3 cm long,
3
mm wide,
)



Numbers defining charac
teristics in relative
terms (2 times longer than inne
r
)



Numbers d
efining count

or presence
(Seed
usually 1
)

Each category is handled uniquely by using a set of
predefined regular expression pa
tterns.

Resulting annotations

on the above examples are
:



<character char_type="range_value"
name="length" from="1 . 5
" to="3"
unit="cm"/>



<character char_type="relative_value"
name="size" value="2" modifier="usually"
constraint="times as long as inner"/>



<character name="count" value="1"
modifier="usually"/>


SegmentIntegrator

Annotation

segments
produced so far will n
ee
d to be
merged back into

whole sentence
s
.
This is done by
integrat
ing the

XML
structure
of

segments
belonging
to a single sentence.

Text

that

had been additionally
appended
to subsequent segments
during the
segmentation procedure
is

now merged with
that of

the
corresponding organ fr
o
m the previous segment.

EXPERIMENTAL RESULTS

A set of

559

description sentences randomly

selected
from
volume 19 of

Flora of North America (FNA
,
1993
)

and

472

descriptions from Part H of
Treatise

on
Invertebrate Paleontology(TI
P
, Moore, 1952
) were
hand
-
annotated
.

The original sampling rate was 5% but
a varied number of non
-
descriptive sentences were
removed from two test sets (e.g. “characters
unknown”, “2n=16”), leaving the effective sampling
rate at 4.5% for FNA and 4.8% for T
IP.
The entire
volumes of FNA v19 and TIP/H were annotated using
the
algorithm described above and performance
evaluation was carried out on sampled sentences.

Performance evaluation was done by comparing
machine
-
annotated sentences with human
-
annotated
co
unterparts. Precision and recall measures were
calculated.

Precision =
Number of c
orrectly identified
instances/ Number of identified instances

Recall = Number of correctly identified
instances/ Actual n
umber of instances


We report these measurements for
structures,
characters/states, and relations. The correctness of the
annotation is defined in two mo
des:

strict and

relaxed.
The correctness in strict mode requires exact

and
perfect

match

between human and machine annotated
sentences
, while in
the
relaxed

mode, partial
string
matches

are also counted as correct. It is important to
note

that
for
a structure element to be correct

in either
mode
, the name attribute must be correct.
For a

character/state (i.e. name/value pair)

to

be iden
tified as
correct
, the
pair must be correct and must be associated
with a correct structure. For a relation to be correct, the
relation name and the involved structures need to be
correct at the same time.

In a sense, the relaxed mode
pro
bably reflects human perception

better an
d it

is
often

easy programmatically to

change
a partial match
in
to an exact match by introducing additional heuristic
rules.


Table 1 shows the descriptive statistics of FNAv19 and
TIP part H test sets.

The “N” column lists the total
number of sentences
in a test set.

Table 2

shows the
precision and recall measures on structure, character,
and relation elements. The

last two columns on
sentence level measurements were based on

the

relaxed notion of correctness

and

are the b
asi
s for
further analysis of pe
rformance shown in Table 3 and
Table 4.


Date Se
t

Sentence

(N)

Average number of objects in a sentenc
e

T
oke
n

structure

Character

R
elation

FNA
.
v19

559

12.7

1.68

3.63

0.19

TIP.
H

472

9.92

2.07

2.11

0.79

Table 1. Descriptive Statistics of FNAv19 and TIP

part H Test Sets.

Data
Set

Avg.
segs/

sent

Structure P/R

Character P/R

Relation P/R

strict

relaxed

strict

relaxe
d

stric
t

relaxe
d

FNA.v19

1.3
5

71/67

88/83

36/3
5

63/60

2/
3

13/18

T
I P.H

1.4
3

59/63

85/92

45/3
7

52/43

5/
5

36/38

Tabl e 2.
1

Annot at i on Per f or m
ance of t he Hybr i d
Al gor i t hm

Dat a Set

Avg. segs/sent

Sent ence
P/R

Cor r ect S
ent ences

FNAv19

1.35

74/73

266 ( 47.5%)

TI P.H

1.43

72/69

125( 26.7%)

Tabl e 2.
2

Annot at i on Per f or mance of t he Hybr i d
Al gor i t hm

# of st r uct ur es

i n
sent

FNA
.
v19

(
cor r ect
N=266)

TI P.
H

(
correct
N=125)

One structure

251

111

Multiple structure
s

15

14

Table 3. Sentence Complexity and Performance in
Number of Sentences Correctly Annotated

DISCUSSION

The hybrid algorithm produced similar performances
on the two data sets

in term of
sentence
-
wide precision
and recall

(Tables 2). The performances were not up to
our expectations but were comparable to Wood et al
(2003) and Diederich et. al(1999), both of which were
based on keyword
-
based searches along with some less
specified heuristic
s and were tested on unlimited
characters of similar scope to ours, but on much fewer
number of descriptions. This may allow us to draw
more general conclusions about heuristics
-
based
methods than our research alone would warrant.



FNA.v19

(N=559)

(P/R)
=74/73

TIP.H

(N=472)
(P/R)
=72/69

Sentences with 1
structure

369(83/82)

186 (84/83)

Sentences with 2
structures

102
(56/53)

139

(66/62)

Sentences with 3
structures

38(57/53)

85

(68/61)

Sentences with 4
structures

26(65/63)

37

(61/58)

Sentences with 5
s
tructures

10(61/55)

13 (65/55)

Sentences with 6
structures

8(56/52)

8 (57/51)

Sentences with 7
structures

2(75/74)

1(50/41)

Sentences with 8
structures

2(54/49)

2(46/37)

Sentences with 9
structures

1(63/54)

0

Sentences with
10 structures

1(48/36)

1(70
/67)

Sentences with
multiple
structures

190(58/55)

286(65/60)

Table 4. Sentence Complexity and Performance in
Average Sentence Level Precision and Recall

Data Set

Not
s
egmented
s
entences

Not seg
. sent.
with
multiple

structures

FNA.v19 N=559

437

(78%)

7
1

(13%)

TIP.H N=472

307

(65%)

130

(27.5%)

Table 5. Segmentation Effectiveness

Table
s

2

and 3 show

that

the vast majority of correctly
annotat
ed sentences are

sentences describing a single
structure. Table 4

shows
that

performance decreases
when sentence

complexity increases. This is probably
true across the studies. Unfortunately
, though it

may be
unexpected by many, a significant portion, if not
majority, of the sentences in morphological
descriptions are complex and describe at least

two
structures

(
and

can describe as many as ten,
as seen in
our sample)
.

The Segmentation heuristics seems to be less effective
in reducing text complexity for the annotation modules
as no segmentation was conducted in majority of the
sentences, even when there
were

multip
le structures i
n
a sentence (Table 5). A simplest

segment should
include one structure

per non
-
relational segment, or

two structures per relational segment (higher rank
relations should be broken down into multiple binary
relations). The extremely low perf
ormance on relation
annotation (Table 2
.1
) suggest further chunking is
needed in relational segment
s
to more accurately
identify structures, characters, and relations in a chunk.

For example: “heads 5
-

8+ in corymbiform arrays”
should be segmented into two

segments “heads 5
-
8+”
and “in corymbiform arrays” with “in” and
“corymbiform arrays” separated in the latter, so “in”
will eventually be translated into a relation
“arranged_in” associating heads with arrays. The
simple Segmentation heuristics did not seg
ment this
sentence

because it did not know

the

“arrays” POS tag

and as a result, corymbiform was annotated as a
character for heads.
The unsupervised algorithm was
good at learning domain specific terms, but helped
little in POS
-
tagging of common words, su
ch as
“arrays”.
We also note here

that

the relations in the
description text are often signaled by either a
preposition (“at”, “in”, “near” etc.) or a past/present
participle.

These are all common English words and
basic syntax (preposition phrases and ver
b phrases)

that

the

NLP field has studied

thoroughly.

This
analys
i
s has lead us to take a fresh look at general
purpose syntactic parsers such as the Stanford
Parser(
SNLP, 2011
) and investigate the ways to
effectively use it in morphological description pa
rsing.
A preliminary and small scale test shows

that
, by
replacing the Segmentation heuristics with the
Stanford Parser, it is possible to increase precision and
recall for relations by 30%
-
40%.



Two other factors contributed errors in annotation
.

One
is

that the current implementation did not take
advantage of the unsupervised learning results fully.
For example, although the sentence “distal cauline well
developed” contains no noun at all, the unsupervised
algorithm was able to tag it as “distal cauline

leaf
” by
using context information. The heuristic algorithm
,

in
its current implementation
,

only use
s

the learning
results to POS
-
tag sentences so it
would believe that

the sentence contains all descriptors

(“{}”) without a
structure

and would therefore

a
ssociate all
characters/v
alues to a default structure
“whole_organism”.
There were at least 43 similar
sentences in

the
FNAv19 data set.

This would cause
an error in the structure element and also all related
ch
aracter and relation elements.
In addition,

the
algorithm did not attempt to recognize constraints for
structures, but

it

mist
ook

them as characters. For
example “distal leaves large” should be interpreted as
“distal
-
leaves large” (where “distal” is a constraint for
“leaves”, see Figure 1 for anoth
er example) as opposed
to “leaves distal and large”
where
“distal” is annotated
as a character of the leaves. Again, the unsupervised
alg
orithm could have helped avoid

this type of
problem
,

but it was not fully used.

The other factor is domain
-
specific con
ventions that
need special treatment. For example, most people
would agree that

the

sentence “annuals or biennials”
contains two plural nouns, yet in the annotation,
annuals and biennials should be annotated as if they
were adjectives (e.g. life_style=”ann
ual”). This sort of
domain
-
specific annotation requirements should
probably stay as heuristic rules and be packaged in a
pluggable module.

As a last note in this section, ontology support for this
application is indispensible. For

the

sentence “profile
b
iplanate or gently concavoconvex ;”, the algorithm’s
correct annotati
on of

“biplanate” and “concavoconvex”
as “shape” did not get credit
,

because the answer is
more specific “profile”.


CONCLUSION AND FUTUR
E WORK

We implemented and studied a simple hybri
d
unsupervised learning and heuristic based annotation
system to learn the potential of heuristic rule
-
based
systems in annotating a huge amount of biodiversity
description text. Through careful analyses of
description text and annotation results, we conc
lude:

1.

Syntax structure of morphological descriptions is
more complex than many had expected

2.

In general, simple segmentation is not capable of
breaking sentences into simple components
properly, ideally one structure per non
-
relational
chunk and two structu
res per relational chunk.

3.

Further analysis in relational chunks is needed to
better identify relations (i.e. verb and preposition
phrases)

4.

Collection/domain
-
dependence features requires
domain
-
specific heuristics.

5.

Context is still needed in many cases; th
erefore,
better integration with the unsupervised learning
algorithm is required.

2 and 3 are the major bottlenecks, while 4 and 5 can be
done easily. To address 2 and 3, we could 1) further
refine and add new heuristic rules or 2) make use of
existing res
earch and knowledge on syntax structure
and common English words. Option 1 will be difficult
to maintain and runs the risk of overfitting (i.e. making
the system less useful for other description
collections). We find option 2 a better choice and have
star
ted to investigate ways to replace the Segmentation
module with the Stanford Parser, not by retraining it,
but by placing control on it. A preliminary and small
scale test shows that, by replacing the Segmentation
heuristics with the Stanford Parser, it is

possible to
increase precision and recall for relations by 30%
-
40%. The resulting system will be a hybrid
unsupervised learning and parsing algorithm.


ACKNOWLEDGMENTS

Research reported in this paper was supported by
funding from the Advances in Biologi
cal Informatics
Program of the National Science Foundation (EF
0849982). The authors thank Jessica Bazeley of Yale
University, James
Macklin of Harvard University,
Prasad Avhad and Edward McCain of the University
of Arizona for their help in pre
paring test

descriptions.
The authors also

thank

the

publishers of FNA and TIP
for their sustained support in this and related
biodiversity literature processing projects.

REFERENCES

Balhoff JP, Dahdul WM, Kothari CR, Lapp H,
Lundberg JG, Mabee P, Midford PE, Westerf
ield
ME, Vision TJ. 2010.
Phenex: Ontological
Annotation of Phenotypic Diversity
.
PLoS ONE

5(5): e10500.

Blaschke, C., Hirschman, L., & Valencia, A. (2002)
Information Extraction in molecular biology.
Briefings in Bioinformatics

3(2). 154
-
165.

Chapman, W.
& Cohen, K.B. (2009) Current issues in
biomedical text mining and natural language
processing.
Journal of Biomedical Informatics
. 42.
757
-
759.

Cui, H., Boufford, D., & Selten, P. (2010). Semantic
Annotation of Biosystematics Literature without
Training Exa
mples.
Journal of American Society of
Information Science and Technology
.

Diederich, J., Fortuner, R. & Milton, J. (1999).
Computer
-
assisted data extraction from the
taxonomical literature
.

Retrieved on Dec. 12, 2009
from http://math.ucdavis.edu/~milton/g
enisys.html.

FNA
-
Flora of North America Editorial Committee
(Eds.). 1993
-
.
Flora of North America
. Retrieved
July 10, 2008, from
http://www.fna.org/

Hobbs, J. and Riloff, E. (2010)
Information Extraction

in Nitin Indurkhya and Fred J. Damerau Eds:
Handbook of Natural Language Processing
, 2nd
Edition, Chapman & Hall/CRC Press, Taylor &
Francis Group.

Moore, R. C., Teichert, C., Robison, R. A., Kaesle
r, R.
L., & Selden, P. A. (eds.) 1952

.
Treatise on
Invertebrate Paleontology
. Lawrence, Kansas:
University of Kansas and Boulder, Colorado:
Geological Society of America.

PATO


Phenotypic quality ontology

n.d. Retrieved
May 21, 2011 from http://www.ebi.a
c.uk/ontology
-
lookup/browse.do?ontName=UO

Sautter, G., Agosti, D., & Bohm, K. 2007.
Semi
-
Automated XML Markup of Biosystematics Legacy
Literature with the GoldenGATE Editor.

Proceedings of Pacific Symposium on Biocomputing

2007(pp. 391
--
402).

SNLP
--
The S
tanford Natural Language Processing
Group. n.d.
The Stanford Parser
: A Statistical
Parser. Retrieved Feb 20, 2011, from
http://nlp.stanford.edu/software/lex
-
parser.shtml

Soderland, S. 1999.
Learning Information Extraction
Rules for Semi
-
Structured and Free

Text.

Machine
Learning

,
34

(1
-
3). 233
--
272.

Tang, X., & Heidorn, P. B. 2007.
Using Automatically
Extracted Information in Species Page Retrieval.

Proceedings of TDWG 2007
. Retrieved April 20,
2009 from
http://www.tdwg.org/proceedings/article/view/195

Ta
ylor, A. 1995.
Extracting Knowledge from
Biological Descriptions.

Proceedings of 2nd
International Conference on Building and Sharing
Very Large
-
Scale Knowledge Bases
, (pp. 114
--
119).

Wood, M.M., Lydon, S. J., Tablan, V., Maynard, D.,
& Cunningham, H. 200
3.
Using parallel texts to
improve recall in IE.
Recent Advances in Natural
Language Processing: Selected Papers

from RANLP
2003
,(pp.70
--
77)