Data for NLP Tasks Semi-Automatically

journeycartAI and Robotics

Oct 15, 2013 (3 years and 5 months ago)

50 views

Centre for Text Technology (CTexT)

Research Unit: Languages and Literature in the South African Context

North
-
West University, Potchefstroom Campus (
PUK
)

South Africa

{
Gerhard.VanHuyssteen
;
Martin.Puttkammer
;
Sulene.Pilon
;
Handre.Groenewald
}@
nwu.ac.za

30 September 2007;
Borovets

Gerhard B van Huyssteen, Martin J Puttkammer, Suléne Pilon
and

Hendrik J Groenewald

Using Machine Learning to Annotate
Data for NLP Tasks Semi
-
Automatically


30 September 2007;
Borovets

Van
Huyssteen
,
Puttkammer
,
Pilon

&
Groenewald

Overview



Introduction


End
-
User Requirements


Solution: Design & Implementation


Evaluation


Conclusion

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Human Language Technologies



HLTs depends on availability of
linguistic data


Specialized lexicons


Annotated and raw corpora


Formalized grammar rules


Creation of such resources


Expensive and protractive


Especially for less
-
resourced languages

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Human Language Technologies

Less
-
resourced Languages

Methodology

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Less
-
resourced Languages



"languages for which few digital resources exist; and thus,
languages whose computerization poses unique
challenges. [They] are languages with limited financial,
political, and legal resources… " (Garrett, 2006)


Implicit in this definition:


Lacks human resources (little attention in research or discussions)


Lacks computational linguists working on these languages



Research question:


How could one facilitate development of linguistic data by enabling
non
-
experts to collaborate in the computerization of less
-
resourced languages?



Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Human Language Technologies

Less
-
resourced Languages

Methodology

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Methodology I


Empowering linguists and mother
-
tongue
speakers to deliver annotated data


High quality


Shortest possible time


Escalate the annotation of linguistic data by
mother
-
tongue speakers


User
-
friendly environments


Bootstrapping


Machine learning instead of rule
-
based
techniques


Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Human Language Technologies

Less
-
resourced Languages

Methodology

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Methodology II


The general idea:


Development of gold standards


Development of annotated data


Bootstrapping



With the click of a button:


Annotate data


Train machine
-
learning algorithm


Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Human Language Technologies

Less
-
resourced Languages

Methodology

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Central Point of Departure I



Annotators are invaluable resources


Based on experiences with less
-
resourced
languages


Annotators have mostly word processing skills


Used to a GUI
-
based environment


Usually limited skills in a computational or
programming environment


Worst cases annotators have difficulties with


File management


Unzipping


Proper encoding of text files

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Assumptions

Interviews


30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Central Point of Departure II



Aim of this project: Enabling
annotators to focus on what they are
good at: Enriching data with expert
linguistic knowledge


Training the machine learner occurs
automatically



Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Assumptions

Interviews


30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

End
-
user Requirements I


Unstructured interviews with four
annotators

1.

What do you find unpleasant
about your work as an annotator?

2.

What will make your life as an
annotator easier?

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Assumptions

Interviews


30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

End
-
user Requirements II

1.
What do you find unpleasant
about your work as an
annotator?


Repetitiveness


Lack of concentration/motivation


Feeling “useless”


Do not see results

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Assumptions

Interviews


30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

End
-
user Requirements III

2. What will make your life as an annotator
easier?


Friendly environment (i.e. GUI
-
based, and not
lists of words)


Bite
-
sizes of data rather than endless lists


Rather correct data than annotate from scratch


Program should already suggest a possible
annotation


Click or drag


Reference works need to be available


Automatic data management


Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Assumptions

Interviews


30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Solution:
TurboAnnotate


User
-
friendly annotating environment



Bootstrapping with machine learning



Creating gold standards/annotated lists


Inspired by
DictionaryMaker

(Davel
and Peche, 2006) and
Alchemist
(University of Chicago, 2004)

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

DictionaryMaker

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Alchemist

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Simplified Workflow of
TurboAnnotate

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Step 1: Create Gold Standard


Create gold standard


Independent test set for evaluating
performance


1000 random instances used


Annotator only has to select one
data file

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Simplified Workflow of
TurboAnnotate

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Step 2: Verify Annotations


New data sourced from base list


Automatically annotated by classifier


Presented to annotator in the "Annotate" tab

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

TurboAnnotate

: Annotation Environment

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Simplified Workflow of
TurboAnnotate

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Step 3: Verify Annotated Set


Bootstrapping


inspired by
DictionaryMaker


200 words per chunk


trained in
background


Annotator verifies


Click “accept” or correct the instance


Verified data serve as training data


Iterative process till desired results

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

The Machine Learning System I


Tilburg Memory
-
Based Learner (
TiMBL
).


Wide success and applicability in the field of
natural language processing


Available for research purposes


Relative ease to use


On the down
-
side


Performs best with large quantities of data


For the tasks of hyphenation and
compound analysis,
TiMBL

performs well
with small quantities of data

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

The Machine Learning System II


Default parameter settings used


Task specific feature selection


Performance is evaluated against
gold standard


For hyphenation and compound
analysis, accuracy is determined on
word
-
level and not per instance

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Features I


All input words converted feature vectors


Splitting window


Context 3 positions (left and right)


Class


Hyphenation: indicating a break


Compound Analysis: 3 possible classes


+ indicating word boundary


_ indicating valence morpheme


= no break



Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Features II

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions


Example: eksamenlokaal
-
‘examination room’

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Parameter Optimisation I


Large variations in accuracy occur when
parameter settings of MBL algorithms are
changed


Finding the best combination of parameters


Exhaustive searches undesirable


Slow and computationally expensive


Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Parameter Optimisation II


Alternative:
Paramsearch

(Van den Bosch,
2005)


delivers combinations of algorithmic
parameters that are estimated to perform well


PSearch


Our own modification of
Paramsearch


Only implemented after all data has been
annotated


Ensures the best possible classifier



Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Functional Specifications & Solutions

Technical

Specifications & Solutions

User Instructions

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Criteria


Two criteria


Accuracy


Human effort (time)


Evaluated on the tasks of hyphenation and
compound analysis for Afrikaans and
Setswana


Four human annotators


Two well
-
experienced in annotating


Two considered novices in the field

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Criteria

Accuracy

Effort

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Accuracy


Two kinds of accuracy


Classifier accuracy


Human accuracy


Expressed as percentage of correctly
annotated words over total number of
words


Gold standard excluded as training
data

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Criteria

Accuracy

Effort

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Classifier Accuracy (Hyphenation)

# Words in Training
Data

Accuracy: Afrikaans

Accuracy: Setswana

200

38.60%

94.50%

600

54.00%

98.30%

1000

58.30%

98.80%

2000

68.50%

98.90%

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Criteria

Accuracy

Effort

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Human Accuracy


Human accuracy


Two separate unseen datasets of 200 words for each
language


First dataset annotated in an ordinary text editor


The second dataset annotated with

TurboAnnotate
.

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Criteria

Accuracy

Effort

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Human Accuracy

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Criteria

Accuracy

Effort

Annotation
Tool

Accuracy
(Hyph)

Time (s)
(Hyph)

Accuracy
(CA)

Time (s)

(CA)

Text Editor (200
Words)

93.25%

1325

91.50%

802

TurboAnnotate

(200 words)

98.34%

1258

94.00%

748

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Human Effort I


Two questions



Is it faster to annotate with
TurboAnnotate
?


What would the predicted saving on human effort be
on a large dataset?



Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Criteria

Accuracy

Effort

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Human Effort II

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Criteria

Accuracy

Effort


# Words in Training
Set

Time (s)
(Hyph)

Time (s)

(CA)

0

1258

748

600

663

614

2000

573

582

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Human Effort III


1 minute faster to annotate 200 words with
TurboAnnotate


Larger dataset (40,000 words)


Difference of only circa 3.5 uninterrupted human hours


This picture changes when the effect of
bootstrapping is considered


Extrapolating to 42,967 words


Saving of 51 hours (68%) for hyphenation


Saving of 9 hours (41%) for compound analysis


Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Criteria

Accuracy

Effort

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Conclusion


TurboAnnotate
helps to increase the
accuracy of human annotators


Saves human effort


Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Conclusion

Future Work

Obtaining
TurboAnnotate

Acknowledgements

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Future Work


Other lexical annotation tasks


Creating lexicons for spelling checkers


Creating data for morphological analysis


Stemming


Lemmatization


Improve GUI


Network solution


Active Learning


Experiment with C5.0


Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Conclusion

Future Work

Obtaining
TurboAnnotate

Acknowledgements

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

TurboAnnotate


Requirements:


Linux


Perl 5.8


Gtk+ 2.10


TiMBL 5.1


Open
-
source


Available at http://www.nwu.ac.za/ctext


Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion

Conclusion

Future Work

Obtaining
TurboAnnotate

Acknowledgements

30 September 2007; Borovets

Van Huyssteen, Puttkammer, Pilon & Groenewald

Acknowledgements


This work was supported by a grant from the
South African National Research Foundation
(GUN: FA2004042900059).


We also acknowledge the inputs and
contributions of


Ansu Berg


Pieter Nortjé


Rigardt Pretorius


Martin Schlemmer


Wikus Slabbert

Conclusion

Future Work

Obtaining
TurboAnnotate

Acknowledgements

Introduction

End
-
User Requirements

Solution: Design & Implementation

Evaluation

Conclusion