GATE
Overview and Demo
University of Washington
CLMA
Treehouse
Presentation
October 8, 2010
Prescott Klassen
Overview
•
Summary of GATE information and
documentation found at gate.ac.uk
•
GATE Developer features, components, and
plug
-
ins
•
IDE Demo
•
Embedded GATE
•
Using GATE with Condor on
Patas
•
GATE code samples
Background
•
Sheffield Natural Language Processing Group at the
University of Sheffield
•
Released 1996
–
re
-
written and re
-
released 2002
•
Latest Release GATE 5.2.1 (May 6, 2010)
–
Windows, Linux,
Solaris, and Mac OS
•
Beta Release GATE 6.0 (Beta 1
–
August 21, 2010)
•
100% Java Reference Implementation
•
Compatible with IBM Unstructured Information
Management Architecture (UIMA)
•
Open Source (GNU Library General Public License)
•
XML Corpus Encoding Standard (XCES) format, used by the
American National Corpus
What is GATE?
•
An
architecture
describing how language
processing systems are made up of
components.
•
A
framework
(class library) written in Java and
tested on Linux, Windows and Solaris.
•
A graphical
development environment
built on
the framework (IDE for NLP)
GATE Products
•
GATE Developer
–
IDE for language processing components bundled with the ANNIE (A
Nearly
-
New Information Extraction system) and plug
-
ins
•
GATE
Teamware
–
Web app for collaborative semantic annotation projects incorporating
a workflow engine and a backend service infrastructure
•
GATE Embedded
–
Object library optimized for inclusion in applications
•
GATE Services
–
Hosted services for cloud application development
•
GATE Wiki
–
Wiki/CMS
•
GATE Cloud
–
Cloud computing solution for hosted large
-
scale text processing
GATE Components
•
Language Resources (
LRs
)
—
documents,
corpora and
ontologies
•
Processing Resources (
PRs
)
—
parsers,
stemmers, co
-
reference resolvers, ML
components, etc.
•
Visual Resources (
VRs
)
—
IDE components that
provide a visual interface (GUI) to GATE
components and plug
-
ins
Language Resources
•
Documents, corpora, and
ontologies
•
Can persist in Java Serial Store or
Lucene
Serial Data
Store
•
Document = content + annotations + features
•
“Stand
-
off” Markup
•
Annotations as Directed Acyclic Graphs (start Node,
end Node, ID, type, Feature Map
,
pointers into the
sources document
—
character offsets)
•
Input Formats: Plain Text, HTML,SGML,XML, RTF, Email,
PDF, Microsoft Word
•
Ontology support (Sesame2,OWLIM3)
Processing Resources
•
ANNIE (a Nearly
-
New Information Extraction
System)
–
Document Reset
–
Tokeniser
–
Gazetteer
–
Sentence Splitter
–
RegEx
Sentence Splitter
–
Part of Speech Tagger
–
Semantic Tagger
–
Orthographic
Coreference
(
OrthoMatcher
)
–
Pronominal
Coreference
Processing Resources
•
JAPE (Java Annotation Pattern Engine):
–
Regular expressions over annotations
–
Finite state transduction over annotations based on
regular expressions
–
Not against strings but against annotation graphs
–
Non
-
deterministic
•
ANNIC:
ANNotations
-
In
-
Context
–
full
-
featured annotation indexing and retrieval system
–
Searchable Serial
DataStore
–
Based on
Lucene
Processing Resources
•
The Annotation Diff Tool
–
enables two sets of annotations in one or two
documents to be compared
–
figures are generated for precision, recall, F
-
measure
•
Corpus Benchmark Tool
–
Apply evaluation across an entire corpus
•
Balance Distance Measure (BDM) Ontology
Tool
Processing Resources (
PlugIns
)
•
OntoGazetteer
•
HashGazetteer
•
Gazetteer List Collector
•
Large KB Gazetteer
•
Ontology
-
Aware JAPE Transducer
•
Batch Learning PR (
LibSVM
, PAUM algorithm,
Weka
interface)
•
Machine Learning PR (
Maxent
,
Weka
and SVM
Light)
Resources on the Web site
gate.ac.uk
•
User Guide
•
Movie Tutorials
•
Developer’s Guide/API docs
•
NLP Application Programmer’s Guide
•
Research Papers
•
GATE project descriptions
•
Demos
•
Plug
-
in Info
•
Commerical
/Academic partnerships
•
Etc…
IDE Demo
What is GATE Embedded?
•
Everything in GATE IDE without the GUI
•
A Java framework for many different types of
NLP solutions
•
A complex assortment of core functionality
and plug
-
ins
•
Extensible and
Composable
–
GATE can be included as a component in other
Java Frameworks and vice
-
versa
Example Application with a GATE
Embedded Component
Running GATE (“Hello World”)
import gate.*;
import
gate.creole
.*;
public class Main {
public static void
main(String
[]
args
) throws
Exception {
Gate.setGateHome(new
File(<Path to GATE>));
Gate.setPluginsHome(new
File(<Path to
Plugins
>));
Gate.init
(); // start GATE
}
Registering Directories
Gate.getCreoleRegister().registerDirectories(new
File(Gate.getPluginsHome
(), "
ANNIE").toURL
());
Gate.getCreoleRegister().registerDirectories(new
File(Gate.getPluginsHome
(),
"
Information_Retrieval").toURL
());
Gate.getCreoleRegister().registerDirectories(new
File(Gate.getPluginsHome
(),
"
Stemmer_Snowball").toURL
());
Creating Processing Resources
SerialAnalyserController
annieController
=
(
SerialAnalyserController
)
Factory.createResource
(
"
gate.creole.SerialAnalyserController
",
Factory.newFeatureMap
(),
Factory.newFeatureMap
(), "ANNIE");
FeatureMap
params
=
Factory.newFeatureMap
();
annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.annotdelete.AnnotationDeletePR
",
params
));
annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.tokeniser.DefaultTokeniser
",
params
));
annieController.add((ProcessingResource
)
Factory.createResource("stemmer.SnowballStemmer
",
params
));
annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.gazetteer.DefaultGazetteer
",
params
));
annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.splitter.RegexSentenceSplitter
",
params
));
annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.POSTagger
",
params
));
annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.ANNIETransducer
",
params
));
annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.orthomatcher.OrthoMatcher
",
params
));
FeatureMap
coRefParams
=
Factory.newFeatureMap
();
coRefParams.put("resolveIt
", "true");
annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.coref.Coreferencer
",
coRefParams
));
Creating Language Resources
Corpus corpus =
Factory.newCorpus("DUC
Queries");
@
SuppressWarnings("static
-
access")
File
topicsFile
= new
File(ConfigMgr.getTopicFilePath
() +
"
topics.xml
");
gate.Document
topicDoc
=
Factory.newDocument(topicsFile.toURL
());
corpus.add(topicDoc
);
annieController.setCorpus(corpus
);
annieController.execute
();
Iteration and Cleanup
AnnotationSet
defaultAnnotations
=
topicDoc.getAnnotations
();
AnnotationSet
originalMarkup
=
topicDoc.getAnnotations("Original
markups");
AnnotationSet
topicAnnotationSet
=
originalMarkup.get("TOPIC
");
for (Annotation
topicAnnotation
:
topicAnnotationSet
) {
ArrayList
<Query>
topicQueryArrayList
= new
ArrayList
<Query>();
if (
ConfigMgr.isQueryBreakdown
()) {
topicQueryArrayList
=
Utilities.buildTopicMultiQuery(topicAnnotation
,
originalMarkup
,
defaultAnnotations
,
config
);
} else {
topicQueryArrayList
=
Utilities.buildTopicQuery(topicAnnotation
,
originalMarkup
,
defaultAnnotations
,
config
);
}
String
topicKey
= null;
topicKey
= topicQueryArrayList.get(0).getDucTopicName();
globalQueryHash.put(topicKey
,
topicQueryArrayList
);
}
topicDoc.cleanup
();
Factory.deleteResource(topicDoc
);
corpus.cleanup
();
Factory.deleteResource(corpus
);
Iterating through Annotations
public static
AnnotationSet
getChildAnnotationSet
(
String
childAnnotationSetName
,
Annotation annotation,
AnnotationSet
parentAnnotationSet
)
throws
NullPointerException
{
AnnotationSet
childAnnotationSet
= null;
// traverse nested Annotation Set for named annotation using parent offsets to
delimit range
try {
childAnnotationSet
=
parentAnnotationSet.get(childAnnotationSetName
,
annotation.getStartNode().getOffset
(),
annotation.getEndNode().getOffset
());
if (
childAnnotationSet
== null) {
throw new
NullPointerException
();
}
} catch (Exception
e
) {
System.err.println(e.getMessage
());
}
return
childAnnotationSet
;
}
Example
Script for
Compiling on
Patas
#! /bin/bash
javac
-
classpath
.:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/bin/gate.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/activation.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant
-
contrib
-
1.0b2.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant
-
junit.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant
-
launcher.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jdom.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/antlr.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant
-
nodeps.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant
-
trax.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/Bib2H‚Ñ¢L.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/commons
-
discovery
-
0.2.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/commons
-
fileupload
-
1.0.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/commons
-
lang
-
2.4.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/commons
-
logging.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/concurrent.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/gate
-
asm.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/gate
-
compiler
-
jdt.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/gateHmm.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/geronimo
-
ws
-
metadata_2.0_spec
-
1.1.1.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/GnuGetOpt.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/icu4j.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jakarta
-
oro
-
2.0.5.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/javacc.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jaxb
-
api
-
2.0.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jaxen
-
1.1.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jaxws
-
api
-
2.0.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/junit.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jwnl.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/log4j
-
1.2.14.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/lubm.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/lucene
-
core
-
2.2.0.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/mail.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/nekohtml
-
1.9.8+2039483.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ontotext.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/orajdbc3.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/PDFBox
-
0.7.2.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/pg73jdbc3.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/poi
-
2.5.1
-
final
-
20040804.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/spring
-
beans
-
2.0.8.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/spring
-
core
-
2.0.8.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/stax
-
api
-
1.0.1.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/tm
-
extractors
-
0.4.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/wstx
-
lgpl
-
3.2.3.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/xercesImpl.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/xml
-
apis.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/xmlunit
-
1.2.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/xpp3
-
1.1.3.3_min.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/xstream
-
1.2.jar:edu.mit.jwi_2.1.5.jar ling573extractive/*.java
GATE Condor Script
universe = java
executable = ling573extractive/Main.class
arguments = ling573extractive.Main
output = ling573extractive.output
error = ling573extractive.error
jar_files
= /NLP_TOOLS/tool_sets/gate/gate
-
5.1/bin/gate.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/junit.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant
-
junit.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jdom.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/commons
-
lang
-
2.4.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/gate
-
asm.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/gate
-
compiler
-
jdt.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/log4j
-
1.2.14.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/lucene
-
core
-
2.2.0.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/nekohtml
-
1.9.8+2039483.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ontotext.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/PDFBox
-
0.7.2.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/orajdbc3.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/wstx
-
lgpl
-
3.2.3.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/xercesImpl.jar,edu.mit.jwi_2.1.5.jar
java_vm_args
=
-
Xmn100M
-
Xms500M
-
Xmx500M
+
RequiresWholeMachine
= True
Requirements = ( Memory > 0 &&
TotalMemory
>= (7*1024) )
queue
Discussion
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο