2.B.2 Language Resources: Integrating Resources, New Resources

overratedbeltAI and Robotics

Nov 25, 2013 (2 years and 11 months ago)


Language Resources: Integrating Resources, New Resources

The knowledge intensive machine learning approaches we use rely heavily on annotated
data for training purposes. At the moment these resources are limited in scope and too
language and genre

specific. Simply annotating additional data in desired genres and
languages is a short

, albeit an expensive and time consuming one, but it does not
address the more fundamental issue of developing techniques for rapidly porting these
to new genres and new languages as and when priorities shift. We believe
that a paradigm shift away from
more data annotation

is needed, and this involves a
comprehensive approach to more effective use of machine learning in the

creation and

annotation process itse
lf, including: 1)
active lear
ning to quickly extend the
lexical coverage of current annotations
; 2)
automatic cross mapping between
resources to maximize
their utility as Gold Standards; 3)
automatically creating
coherent clusters by clustering algorithms trained again
st these Gold
Standards; 4)

rapid porting techniques based on using clusters
for backoff

Active Learning
We have recently

that active learning can be effective in
ignificantly reducing the amount of hand
tagged data needed for training a supervised
Word Sense Disambiguation (WSD) system on our coarse
grained senses (See Section
F.) Since the features used in WSD are quite similar to those used in SRL, we are now
anning to use this approach as a technique for quickly extending the coverage of our
SRL systems to the AQUAINT corpora


Linking Lexical Resources

A wide variety of lexical resources (PropBank,
FrameNet, VerbNet, WordNet) are available which c
an be used to extract semantic
information from the propositions in a sentence.

Since each of these resources was
created independently, and with differing goals, they provide complementary information
and will be more useful when they can be used in con
junction with each other.

there are significant differences in their coverage of lexemes

and their structuring of

We are therefore interested in extending the coverage of each of these lexical
resources to include the lexemes described by
the remaining resources
, and in providing

“meaning preserving” links

between data structures

We want to

VerbNet's coverage to match FrameNet

and PropBank
, and

vice versa

One of our long
term goals is to be able to use the semantically co
herent classes in VerbNet and
FrameNet to automatically extend the coverage of our Semantic Role Labeling systems
trained on PropBank to new lexical items not in the training data, but
esent in
these classes

We will
able to use the existing
mappings between classes

that we are
developing under Phase 2 funding,
to propose new candidates for each resource, which
can be quickly filtered based on syntactic compatibility. Of

the 2900 entries in FrameNet,
900 are not already in VerbNet, and could
be added.
Another natural source for
nding VerbNet's coverage are the 3200 frame files for PropBank, 1100 of which are
not yet in VerbNet.

Automatic clustering
. Having the 1M word PropBank corpus annotated with VerbNet
and FrameNet gives us a rich
resource for training clustering algorithms. The VerbNet
and FrameNet annotations provide us with several alternative Gold Standard groupings of
lexical items which we can train the clustering algorithms against. We have already had
marked success in imp
roving WSD results through the use of automatically derived noun
clusters (See Section F). We will be able to seed these techniques with our tagged
PropBank data, and evaluate our results against VerbNet and FrameNet classes.

New Genres and Languages

aving reliable semi
supervised clustering algorithms
will greatly reduce the effort required to provide annotations for new genres and
languages. New genres can be addressed through a combination of active learning and
the use of coherent semantic cluster
s as a backoff for WSD and SRL. Beginning with a
set of coherent semantic clusters will
facilitate the creation of PropBank Frame files,
and will provide grounds for experimentation with semi
automatic frame file creation.

We will apply this general

approach to improving the coverage
and accuracy of all of our
lexically based components, such as SRL, WSD,

Event Detection, Temporal Relation
etection, etc. However, there are some types of annotation that will require additional
effort. As we have al
ready discussed, many nominal expressions introduce events and

these need to

be detected and integrated into

our event descriptions. NomBank provides
a first pass at training data, but this will need additional analysis and processing to
improve consisten
cy. A

major gap in annotation is the lack of annotated temporal
relations for Chinese. We also have very limited Korean PropBank data for training SRL
systems. We plan to address each of these limitations through specific, targeted
annotation effo
rts, as described below:


Nominal events

Using Time Bank as our Gold Standard, we will analyze the
types of
nominal events annotated in Time Bank that are missing from
NomBank (somewhere between 15% and 20%), to determine the most
efficient means o
f detecting and annotating them. We will annotate a portion
of these for a subset of the existing PropBank. We will
then retrain our
event detection component, and
run it on the AQUAINT data
. We
hand correct a portion of this data, r
etrain, and then use active
learning to ha
nd correct an additional amount for retraining. Our

consists primarily of event annotations, so that data
coupled with PropBank,

is already available for porting the
English event
detection c
omponent to Chinese.


Chinese Temporal Relations
. Using the Time Bank guidelines for English
we will annotate temporal relations betwee
n already identified events in 6
words of the Chinese PropBank/NomBank, an amount similar to Time Bank’s.


Additional Ko
rean PropBank
. We will use the existing 200K Korean
PropBank to train a dependency parser. Coupling this with the Korean
morphological analyzer developed at Penn, we will automatically parse
AQUAINT relevant documents. We will then experiment with manua
annotation of PropBank arguments


the automatic dependency parses. The
goal is to provide enough training data for active learning to become feasible.

to go in F.
Previous Accomplishments

(may need to be cut)

Extended VerbNet

with a mappin
g from PropBank to VerbNet 1.0,
involving close to 1500 lemmas, and a mapping between VerbNet 1.0 and FrameNet 1.1
involving 1952 verb senses (out of 4173) representing 172 classes (out of

191) (Kipper,
et. al, 2006), we
are currently extending these mappi
ngs to include the new lemmas in
Extended VerbNet and in FrameNet 1.2. We first generated the PB/VN mappings
automatically using two heuristic classifiers: 1) we used the SenseLearner WSD engine to
find the WordNet sense tag of each verb and then used the

existing WordNet/VerbNet
mapping to choose the corresponding VerbNet class; and 2) using the syntactic frames
associated with each class to filter out instances with incompatible syntactic structures.
Remaining ambiguities are being hand corrected, resul
ting in a Gold Standard tagged
corpus for training purposes.


We use

a smoothed maximum entropy (MaxEnt) model with a Gaussian Prior
, 2002) and linguistically motivated features for supervised verb sense
disambiguation (Chen and Palmer, 2005a
, 2005b). With three enhancements in the
extraction and treatment of linguistically motivated features, our system achieved higher
performance than the best previously known results in an

SENSEVAL2 English verbs
, including 87% on coars
grained WN senses. The fourth
enhancement, clustering
based feature selection of semantic features, improved our
system’s performance even further on verbs whose senses rely heavily on their noun
phrase arguments. We applied the same methodology to Chine
se verb sense
disambiguation successfully (Xue
et al.
, 2006).

Our experiments on active learning for supervised WSD showed that two uncertainty
based active learning methods (Entropy Sampling and Minimum
Margin Sampling,
Schien, 2005), combined with the s
moothed MaxEnt model, work well on learning
grained English verb senses (Chen
et al.
, 2006). Data analysis of the active
learning process suggests that high
quality feature extraction and selection are important
for active learning to benefit WSD.

Our work

Chinese verb sense discrimination
using EM Clustering

indicates that
a general semantic taxonomy of Chinese nouns
built semi
and more constrained lexical sets

bstantially alleviate the data sparseness that

the EM clustering model

(Chen and Palmer, 2004)
. We expect
that a decent
general taxonomy of nouns
will also benefit supervised
WSD. Our current research is
focusing on automatic acquisition of semantically coherent noun groups from large text
a. The acquired noun groups can be used to build the general semantic taxonomy
and as semantic features for WSD and other NLP applications such as information
retrieval and information extraction.