GATE Overview and Demo - University of Washington

sounderslipInternet and Web Development

Oct 22, 2013 (3 years and 11 months ago)

89 views

GATE

Overview and Demo


University of Washington

CLMA
Treehouse

Presentation

October 8, 2010

Prescott Klassen

Overview


Summary of GATE information and
documentation found at gate.ac.uk


GATE Developer features, components, and
plug
-
ins


IDE Demo


Embedded GATE


Using GATE with Condor on
Patas


GATE code samples

Background


Sheffield Natural Language Processing Group at the
University of Sheffield


Released 1996


re
-
written and re
-
released 2002


Latest Release GATE 5.2.1 (May 6, 2010)


Windows, Linux,
Solaris, and Mac OS


Beta Release GATE 6.0 (Beta 1


August 21, 2010)


100% Java Reference Implementation


Compatible with IBM Unstructured Information
Management Architecture (UIMA)


Open Source (GNU Library General Public License)


XML Corpus Encoding Standard (XCES) format, used by the
American National Corpus

What is GATE?


An
architecture

describing how language
processing systems are made up of
components.


A
framework

(class library) written in Java and
tested on Linux, Windows and Solaris.


A graphical
development environment

built on
the framework (IDE for NLP)

GATE Products


GATE Developer


IDE for language processing components bundled with the ANNIE (A
Nearly
-
New Information Extraction system) and plug
-
ins


GATE
Teamware


Web app for collaborative semantic annotation projects incorporating
a workflow engine and a backend service infrastructure


GATE Embedded


Object library optimized for inclusion in applications


GATE Services


Hosted services for cloud application development


GATE Wiki


Wiki/CMS


GATE Cloud


Cloud computing solution for hosted large
-
scale text processing


GATE Components


Language Resources (
LRs
)

documents,
corpora and
ontologies



Processing Resources (
PRs
)

parsers,
stemmers, co
-
reference resolvers, ML
components, etc.


Visual Resources (
VRs
)

IDE components that
provide a visual interface (GUI) to GATE
components and plug
-
ins


Language Resources


Documents, corpora, and
ontologies


Can persist in Java Serial Store or
Lucene

Serial Data
Store


Document = content + annotations + features


“Stand
-
off” Markup


Annotations as Directed Acyclic Graphs (start Node,
end Node, ID, type, Feature Map
,

pointers into the
sources document

character offsets)


Input Formats: Plain Text, HTML,SGML,XML, RTF, Email,
PDF, Microsoft Word


Ontology support (Sesame2,OWLIM3)


Processing Resources


ANNIE (a Nearly
-
New Information Extraction
System)


Document Reset


Tokeniser


Gazetteer


Sentence Splitter


RegEx

Sentence Splitter


Part of Speech Tagger


Semantic Tagger


Orthographic
Coreference

(
OrthoMatcher
)


Pronominal
Coreference


Processing Resources


JAPE (Java Annotation Pattern Engine):


Regular expressions over annotations


Finite state transduction over annotations based on
regular expressions


Not against strings but against annotation graphs


Non
-
deterministic


ANNIC:
ANNotations
-
In
-
Context


full
-
featured annotation indexing and retrieval system


Searchable Serial
DataStore


Based on
Lucene



Processing Resources


The Annotation Diff Tool


enables two sets of annotations in one or two
documents to be compared


figures are generated for precision, recall, F
-
measure


Corpus Benchmark Tool


Apply evaluation across an entire corpus


Balance Distance Measure (BDM) Ontology
Tool


Processing Resources (
PlugIns
)


OntoGazetteer


HashGazetteer


Gazetteer List Collector


Large KB Gazetteer


Ontology
-
Aware JAPE Transducer


Batch Learning PR (
LibSVM
, PAUM algorithm,
Weka

interface)


Machine Learning PR (
Maxent
,
Weka

and SVM
Light)

Resources on the Web site

gate.ac.uk


User Guide


Movie Tutorials


Developer’s Guide/API docs


NLP Application Programmer’s Guide


Research Papers


GATE project descriptions


Demos


Plug
-
in Info


Commerical
/Academic partnerships


Etc…


IDE Demo

What is GATE Embedded?


Everything in GATE IDE without the GUI


A Java framework for many different types of
NLP solutions


A complex assortment of core functionality
and plug
-
ins


Extensible and
Composable


GATE can be included as a component in other
Java Frameworks and vice
-
versa

Example Application with a GATE
Embedded Component

Running GATE (“Hello World”)

import gate.*;

import
gate.creole
.*;


public class Main {


public static void
main(String
[]
args
) throws
Exception {



Gate.setGateHome(new

File(<Path to GATE>));


Gate.setPluginsHome(new

File(<Path to
Plugins
>));



Gate.init
(); // start GATE

}


Registering Directories


Gate.getCreoleRegister().registerDirectories(new

File(Gate.getPluginsHome
(), "
ANNIE").toURL
());



Gate.getCreoleRegister().registerDirectories(new

File(Gate.getPluginsHome
(),
"
Information_Retrieval").toURL
());



Gate.getCreoleRegister().registerDirectories(new

File(Gate.getPluginsHome
(),
"
Stemmer_Snowball").toURL
());


Creating Processing Resources


SerialAnalyserController

annieController

=



(
SerialAnalyserController
)
Factory.createResource
(



"
gate.creole.SerialAnalyserController
",



Factory.newFeatureMap
(),



Factory.newFeatureMap
(), "ANNIE");



FeatureMap

params

=
Factory.newFeatureMap
();



annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.annotdelete.AnnotationDeletePR
",
params
));


annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.tokeniser.DefaultTokeniser
",
params
));


annieController.add((ProcessingResource
)
Factory.createResource("stemmer.SnowballStemmer
",
params
));


annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.gazetteer.DefaultGazetteer
",
params
));


annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.splitter.RegexSentenceSplitter
",
params
));


annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.POSTagger
",
params
));


annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.ANNIETransducer
",
params
));


annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.orthomatcher.OrthoMatcher
",
params
));



FeatureMap

coRefParams

=
Factory.newFeatureMap
();


coRefParams.put("resolveIt
", "true");



annieController.add((ProcessingResource
)
Factory.createResource("gate.creole.coref.Coreferencer
",
coRefParams
));


Creating Language Resources


Corpus corpus =
Factory.newCorpus("DUC

Queries");



@
SuppressWarnings("static
-
access")


File
topicsFile

= new
File(ConfigMgr.getTopicFilePath
() +
"
topics.xml
");


gate.Document

topicDoc

=
Factory.newDocument(topicsFile.toURL
());



corpus.add(topicDoc
);


annieController.setCorpus(corpus
);



annieController.execute
();


Iteration and Cleanup


AnnotationSet

defaultAnnotations

=
topicDoc.getAnnotations
();


AnnotationSet

originalMarkup

=
topicDoc.getAnnotations("Original

markups");


AnnotationSet

topicAnnotationSet

=
originalMarkup.get("TOPIC
");



for (Annotation
topicAnnotation

:
topicAnnotationSet
) {



ArrayList
<Query>
topicQueryArrayList

= new
ArrayList
<Query>();




if (
ConfigMgr.isQueryBreakdown
()) {




topicQueryArrayList

=
Utilities.buildTopicMultiQuery(topicAnnotation
,



originalMarkup
,
defaultAnnotations
,
config
);



} else {




topicQueryArrayList

=
Utilities.buildTopicQuery(topicAnnotation
,



originalMarkup
,
defaultAnnotations
,
config
);



}




String
topicKey

= null;







topicKey

= topicQueryArrayList.get(0).getDucTopicName();



globalQueryHash.put(topicKey
,
topicQueryArrayList
);


}



topicDoc.cleanup
();


Factory.deleteResource(topicDoc
);


corpus.cleanup
();


Factory.deleteResource(corpus
);



Iterating through Annotations


public static
AnnotationSet

getChildAnnotationSet
(



String
childAnnotationSetName
,



Annotation annotation,



AnnotationSet

parentAnnotationSet
)



throws
NullPointerException

{



AnnotationSet

childAnnotationSet

= null;



// traverse nested Annotation Set for named annotation using parent offsets to
delimit range


try {



childAnnotationSet

=
parentAnnotationSet.get(childAnnotationSetName
,




annotation.getStartNode().getOffset
(),
annotation.getEndNode().getOffset
());



if (
childAnnotationSet

== null) {



throw new
NullPointerException
();



}


} catch (Exception
e
) {



System.err.println(e.getMessage
());


}



return
childAnnotationSet
;


}

Example
Script for
Compiling on
Patas

#! /bin/bash


javac

-
classpath

.:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/bin/gate.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/activation.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant
-
contrib
-
1.0b2.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant
-
junit.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant
-
launcher.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jdom.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/antlr.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant
-
nodeps.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant
-
trax.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/Bib2H‚Ñ¢L.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/commons
-
discovery
-
0.2.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/commons
-
fileupload
-
1.0.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/commons
-
lang
-
2.4.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/commons
-
logging.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/concurrent.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/gate
-
asm.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/gate
-
compiler
-
jdt.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/gateHmm.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/geronimo
-
ws
-
metadata_2.0_spec
-
1.1.1.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/GnuGetOpt.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/icu4j.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jakarta
-
oro
-
2.0.5.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/javacc.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jaxb
-
api
-
2.0.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jaxen
-
1.1.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jaxws
-
api
-
2.0.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/junit.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jwnl.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/log4j
-
1.2.14.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/lubm.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/lucene
-
core
-
2.2.0.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/mail.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/nekohtml
-
1.9.8+2039483.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ontotext.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/orajdbc3.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/PDFBox
-
0.7.2.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/pg73jdbc3.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/poi
-
2.5.1
-
final
-
20040804.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/spring
-
beans
-
2.0.8.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/spring
-
core
-
2.0.8.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/stax
-
api
-
1.0.1.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/tm
-
extractors
-
0.4.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/wstx
-
lgpl
-
3.2.3.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/xercesImpl.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/xml
-
apis.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/xmlunit
-
1.2.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/xpp3
-
1.1.3.3_min.jar:/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/xstream
-
1.2.jar:edu.mit.jwi_2.1.5.jar ling573extractive/*.java



GATE Condor Script

universe = java

executable = ling573extractive/Main.class

arguments = ling573extractive.Main

output = ling573extractive.output

error = ling573extractive.error

jar_files

= /NLP_TOOLS/tool_sets/gate/gate
-
5.1/bin/gate.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/junit.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ant
-
junit.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/jdom.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/commons
-
lang
-
2.4.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/gate
-
asm.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/gate
-
compiler
-
jdt.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/log4j
-
1.2.14.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/lucene
-
core
-
2.2.0.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/nekohtml
-
1.9.8+2039483.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/ontotext.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/PDFBox
-
0.7.2.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/orajdbc3.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/wstx
-
lgpl
-
3.2.3.jar,/NLP_TOOLS/tool_sets/gate/gate
-
5.1/lib/xercesImpl.jar,edu.mit.jwi_2.1.5.jar

java_vm_args

=
-
Xmn100M
-
Xms500M
-
Xmx500M

+
RequiresWholeMachine

= True

Requirements = ( Memory > 0 &&
TotalMemory

>= (7*1024) )

queue


Discussion