NLP Tools

addictedswimmingAI and Robotics

Oct 24, 2013 (3 years and 9 months ago)

134 views

NLP Tools

By :
Asef

pourmasoumi

Hossein

Kamyar

Supervisor : Dr.
Kahani

NLP Tasks


Sentence splitter & Tokenizer


Stemming


Discourse analysis


Coreference Resolution


Named entity
recognition (NER)


Natural language
generation


Natural language
understanding


Part of speech tagging (POS)


Optical character recognition (OCR)


Semantic role labeling (SRL
)


Parsing & Chunker


Relationship
extraction


Question
answering


Text Summarization


Summarization Evaluation

NLP Tasks


Machine Translation


Sentiment
analysis


Speech recognition


Speech
segmentation


Topic segmentation


Word
sense disambiguation


Text simplification


Text
-
to
-
speech


Query expansion


RTE


Text to image


Clustering & Classification & IR


And …

Sentence splitter & Tokenizer


GATE


UNIVERSITY OF ILLINOIS


Sentence Segmentation tool


download

link

:

http://cogcomp.cs.illinois.edu/page/tools_view/2



UNIVERSITY OF STANFORD


including the part
-
of
-
speech (POS) tagger, the named entity recognizer (NER),
the parser, and the coreference resolution system.


download link :
http://nlp.stanford.edu/software/corenlp.shtml


MontyTagger


l
ink :
http://web.media.mit.edu/~hugo/montylingua/


Ling
Pipe


OpenNLP


l
ink :
http://incubator.apache.org/opennlp/index.html


Natural Language Toolkit


open source Python modules, Windows, Mac OSX and Linux.


link

:
http://www.nltk.org/download


Sentence

breaking

,sentence

boundary

disambiguation

Stemming


Oleander Porter's algorithm
-

stemming library in C++ released under BSD


Lovins

stemming algorithm

-

with source code in a couple of languages


Porter stemming algorithm

-

including source code in several languages


Lancaster stemming algorithm

-

Lancaster University, UK


UEA
-
Lite Stemmer

-

University of East Anglia, UK



Themis
-

open source IR framework, includes Porter stemmer implementation (
PostgreSQL
,
Java API)


Snowball

-

free stemming algorithms for many languages, includes source code, including
stemmers for five romance languages


PTStemmer

-

A Java/Python/
.Net

stemming toolkit for the Portuguese language


jsSnowball

-

open source JavaScript implementation of Snowball stemming algorithms for
many languages


hindi_stemmer

-

open source stemmer for Hindi


czech_stemmer

-

open source stemmer for Czech

Coreference Resolution


Illinois has online & downloadable CR


UNIVERSITY OF STANFORD


integrated in the Stanford suite of NLP tools,
StanfordCoreNLP
.


download link :
http://nlp.stanford.edu/software/corenlp.shtml


Ling Pipe



OpenNLP


link :
http://incubator.apache.org/opennlp/index.html


Natural Language Toolkit


download link

:
http://www.nltk.org/download


BART (Beautiful Anaphora Resolution Toolkit.)


download link :
http://www.bart
-
coref.org/


Guitar (A General Tool for Anaphora Resolution)


download link :
http://cswww.essex.ac.uk/Research/nle/GuiTAR/


CR

determines

which

words("mentions")

refer

to

the

same

objects

("entities")
.


Named entity recognition


Given

a

stream

of

text,

determine

which

items

in

the

text

map

to

proper

names,

such

as

people

or

places,

and

what

the

type

of

each

such

name

is

(e
.
g
.

person,

location,

organization)
.



Illinois



Stanford Natural Language Processing Group


link :
http://nlp.stanford.edu/software/CRF
-
NER.shtml


downloadable (written in java) English & German.



Ling Pipe



OpenNLP


link :
http://incubator.apache.org/opennlp/index.html


Natural Language Toolkit


link

:
http://www.nltk.org/download

Part of speech tagging


Illinois


Stanford Natural Language Processing Group


link :
http://nlp.stanford.edu/software/tagger.shtml


downloadable (written in java). English,
Arabic, Chinese.


Ling Pipe


OpenNLP


link :
http://incubator.apache.org/opennlp/index.html


MontyTagger


link

:
http://web.media.mit.edu/~hugo/montylingua/


Natural Language Toolkit


open source Python modules, Windows, Mac OSX and Linux.


l
ink

:
http://www.nltk.org/download


GATE


And many others in
http://nlp.stanford.edu/links/statnlp.htm


Given

a

sentence,

determine

the

part

of

speech

for

each

word
.

Many

words,

especially

common

ones,

can

serve

as

multiple

parts

of

speech
.

For

example,

"book"

can

be

a

noun

("the

book

on

the

table")

or

verb

("to

book

a

flight")
.

Semantic role labeling


Illinois has online & downloadable SRL


MontyTagger


Link

:
http://web.media.mit.edu/~hugo/montylingua/


ASSERT

(
Automatic Statistical
SEmantic

Role Tagger)


Link :
http://cemantix.org/assert.html


Downloadable, OS :
RedHat

Linux


It is designed and implemented by
Sameer

S.
Pradhan
, with some initial contribution from
Daniel
Gildea

at the University of Rochester.


ASSERT is trained to tag: i)
PropBank

arguments, ii) Thematic roles, and iii) Opinions, in plain
text.


SwiRL
: The Semantic Role Labeler


English constructed on top of full syntactic analysis of text using Eugene
Charniak's

parser.


SwiRL

trains one classifier for each argument label using a rich set of syntactic and semantic features.


Link :
http://www.surdeanu.name/mihai/swirl/



CoNLL
-
2005 Shared Task:

Semantic Role Labeling: Systems & Results


Link :
http://www.lsi.upc.edu/~srlconll/st05/st05.html

Parser & Chunker


Illinois



Stanford


link :
http://nlp.stanford.edu/software/tagger.shtml


downloadable (written in java),

English , Arabic,

Chinese.


OpenNLP


link :
http://incubator.apache.org/opennlp/index.html



Natural Language Toolkit


link

:
http://www.nltk.org/download

Determine

the

parse

tree

(grammatical

analysis)

of

a

given

sentence

Question
answering


List of question
-
and
-
answer websites

Website

Founded

Alexa Ranking

Registration?

Allexperts


1998

1957

No

AOL Answers

2006

6634

Yes

Answerbag

2003

1128

Answers

2005

127

No

Askpedia

123765

Ask Me Help Desk

2003

6686

Yes

Askville

Yes

Blurtit

2006

1716

ChaCha

1198

Experts Exchange

1996

1424

Yes

Wolfram Alpha

2009

3883

No

Wikipedia Reference Desk

2001

7

No

Automatic Summarization


http://topicmarks.com/dashboard



http://www.tools4noobs.com/summarize/



http://
www.uoguelph.ca
/~wdarling/summ/




Produce

a

readable

summary

of

a

chunk

of

text
.

Often

used

to

provide

summaries

of

text

of

a

known

type,

such

as

articles

in

the

financial

section

of

a

newspaper
.

Other



http://swesum.nada.kth.se/index
-
eng.html



http://www.summarization.com/mead/



http://textcompactor.com/


Multi
-
document online text summarizer



http://newsfeedresearcher.com/



http://iresearch
-
reporter.com/



http://shablast.com/

Summarization Evaluation


ROUGE
(Recall
-
Oriented Understudy for
Gisting

Evaluation)


Link :
http://berouge.com/default.aspx


Downloadable, written in Perl.



MEADeval
: (
An Evaluation Framework for Extractive Summarization)


Link:
http://tangra.si.umich.edu/clair/meadeval/


Downloadable, written in Perl



Machine Translation


Stanford :
Entailment
-
based MT evaluation


Link
:
http://nlp.stanford.edu/software/mteval.shtml


Downloadable (written in java)


It is based on the Stanford RTE system, which performs inference between two short texts,
determining if one is entailed by the other. We use this inference mechanism to predict the
adequacy of MT system output at the segment level compared to a reference translation.



EGYPT
system

System from 1999 JHU workshop. Mainly of historical
interest.



GIZA
++

and
mkcls

Franz
Och
. C
++. GPL.



Thot

Phrase
-
based model building
kit



Phramer

An Open
-
Source Java Statistical Phrase
-
Based MT Decoder



Moses

A new open
-
source phrase
-
based MT decoder with functionality beyond Pharaoh.



SRILM
: For creating n
-
grams.



Syntax
Augmented Machine Translation via Chart Parsing

Andreas
Zollmann

and
Ashish


Venugopal




Rewrite

a decoder for IBM Model




BLEU
scoring tool

for machine translation evaluation


Free
, but getting them requires hassle



Pharaoh
decoder

Philip Koehn, ISI.



MTTK

Machine Translation Tool Kit. Deng and Byrne.

Topic segmentation


Given

a

chunk

of

text,

separate

it

into

segments

each

of

which

is

devoted

to

a

topic,

and

identify

the

topic

of

the

segment
.


Stanford


Link
:
http://nlp.stanford.edu/software/tmt/tmt
-
0.3/


Downloadable (written in java)


English , Arabic, Chinese version 14.7MB,


Features


Import and manipulate text from cells in Excel and other spreadsheets.


Train topic models (LDA and Labeled LDA) to create summaries of the text.


Select parameters (such as the number of topics) via a data
-
driven process.


Generate rich Excel
-
compatible outputs for tracking word usage across topics,
time, and other groupings of data.

Word sense disambiguation


WordNet::SenseRelate


Link
:
http://senserelate.sourceforge.net/


Two different word sense disambiguation algorithms,


WordNet
-
SenseRelate
-
AllWords

:
Assigns a sense to each word in a text.


WordNet
-
SenseRelate
-
TargetWord

:

Assigns a sense to a given target word.


WordNet
-
SenseRelate
-
WordToSet

:

A
ssigns the meaning to a word that is most related to a
given set of words.


They carry out word sense disambiguation by
measuring the semantic similarity between a word
and its neighbors.

In particular, a word is assigned the sense that is most related to its neighbors.



GWSD is a system for unsupervised all
-
words graph
-
based word sense disambiguation


Link :
http://lit.csci.unt.edu/~rada/downloads/GWSD/GWSD.1.0.tar.gz



Many

words

have

more

than

one

meaning
;

we

have

to

select

the

meaning

which

makes

the

most

sense

in

context
.

For

this

problem,

we

are

typically

given

a

list

of

words

and

associated

word

senses,

e
.
g
.

from

a

dictionary

or

from

an

online

resource

such

as

WordNet
.

List of Toolkits

Name

Language

Creators

site

AlchemyAPI

C, C++, C#, Java,
Python, Perl, Ruby

Orchestr8

[1]

Antelope framework

C#, VB.net

Proxem

[2]

Apertium

C++, Java

(various)

[3]

Cogito

Expert System
S.p.A
.

[4]

Carabao Language Kit

Any COM+ compliant
language
.

Digital Sonata Pty Ltd

[5]

DELPH
-
IN

LISP, C++

Deep Linguistic Processing with HPSG Initiative

[6]

Distinguo

C++

Ultralingua

Inc.

[7]

Ellogon

C /
C++

Georgios

Petasis

[8]

FreeLing

C++

Universitat

Politècnica

de
Catalunya

[9]

General Architecture for
Text Engineering

Java

GATE open source community

[10]

Graph Expression

Java

Startup huti.ru

[11]

Learning Based Java

Java

Cognitive Computation Group at the University of Illinois

[12]

LingPipe

Java

Alias
-
i

[13]

LinguaStream

Java

University of Caen, France

[14]

List
of Toolkits

Name

Language

Creators

site

Mallet

Java

University of Massachusetts Amherst

[15]

MII
nlp

toolkit

Java

UCLA Medical Imaging Informatics (MII) Group

[16]

Modular Audio Recognition
Framework

Java

The MARF Research and Development Group, Concordia
University

[17]

MontyLingua

Python, Java

MIT

[18]

Natural Language
Toolkit

(NLTK
)

Python

[19]

NooJ

(based on INTEX)

.
NET

University of Franche
-
Comté, France

[20]

OpenNLP

Java

Online community

[21]

Rosette

C, C++, Java,
.NET

Basis Technology

[22]

ScalaNLP

Scala

David Hall and Daniel
Ramage

[23]

Stanford NLP

Java

The Stanford Natural Language Processing Group

[24]

Text Engineering Software
Laboratoryz
(Tesla
)

Java

University of Cologne

[25]

Thinktelligence

Delegator

Java

Thinktelligence

Corporation

[26]

UIMA

Java / C++

Apache

[27]

WebLab
-
project

Java

OW2

[28]

UniteX

Java & C++

Laboratoire d'Automatique Documentaire et Linguistique

[29]

The Dragon Toolkit

Java

Drexel University

[30]

Factorie

Java

University of Massachusetts Amherst

[31]

Silpa

Indic Language Processing
Toolkit

Python

Silpa

opensource

community
developers

[32]

Corpora


LDC (Linguistic Data
Consortium)
link

and its catalogue by year. Email: ldc@ldc.upenn.edu.
Provides the largest range of corpora on CD
-
ROM. Cost ranges from cheap (e.g., ACL
-
DCI
disk) to pricey. CDs can be purchased individually; institutions can become members and
receive discounts on CDs.


European
Language Resources
Association
link

and its catalogue. Distribution agency is
ELDA
. Rapidly growing collection of materials in European languages.


ICAME
(International Computer Archive of Modern
English)
link

Sells various corpora
(including Brown and London
-
Lund
).


Reuters
@
NIST
link

Reuters corpora are now distributed by NIST.


TRACTOR
link

TELRI Research Archive of Computational Tools and Resource. Corpora,
many multilingual, in European community languages.


CLR
(Consortium for Lexical
Research)
link
.
Focuses more on language processing tools
and lexicons, but does have some corpora
.



OTA
(Oxford Text
Archive)
link

Provides mainly literary texts. Has a bright new web site
.
Most materials are available on the web or by anonymous ftp to ota.ox.ac.uk.


Leipzig Corpora
Collection
link

Sentence collections in MySQL database for 17 mainly
European languages.

Corpora


BNC
(British National
Corpus)
link

A 100 million word corpus of British
English
And now,
an XML edition
.


European
Corpus Initiative Multilingual Corpus I (
ECI/MCI)
link

A 98 million word
corpus, covering most of the major European languages, as well as Turkish, Japanese, Russian,
Chinese, and Malay. Cheap.


Survey
of English
Usage
link

At the Department of English Language and Literature at
University College London. Includes the British part of ICE, the International Corpus of
English project. Now available tagged, and parsed for function. 83,419 sentences. Includes
ICECUP, dedicated retrieval software. Also, Diachronic Corpus of Present
-
Day Spoken English
(800,000 words, tagged and parsed, half from ICE
-
GB and half from London
-
Lund).


International
Corpus of English (
ICE)
link

Million word collections of English from various
world Englishes: ICE
-
NZ, ICE
-
HK, ICE
-
East Africa, etc.


Corpora
held by Lancaster
University
link

This link provides its own annotations.


The
European Language Activity
Network
link

Promises a uniform query language for
accessing corpora in all EU languages
--

but isn't quite there yet.


Talkbank

link
.
Rich video and transcripts.

NLP Research Group


Academic departments with computational linguistics programs


Institute
for Communicating and Collaborative Systems

at the University of Edinburgh


Institute for Research in Cognitive Science

at the University of Pennsylvania


Computational Linguistics & Phonetics

at Saarland University


Computational Linguistics and Language Technology

at Ohio State University


Stanford Natural Language Processing Group


Computational Linguistics

at the University of Washington


Human Language Technology Research Institute

at the University of Texas at Dallas


Department of Computer Science

at the University of Illinois Urbana
-
Champaign
(
Cognitive Computation Group
)


Center for Language and Speech Processing

at Johns Hopkins University


Non
-
university
computational linguistics groups


German Research Center for Artificial Intelligence

NLP Research Sponsors


Summer Internships and Opportunities


Google Internships


Summer of Code 2008


custom essay


Data Science Summer Institute

Blogs, Video Lectures


Blogs


Hal
Daume

III's NLP blog


LingPipe blog

(Bob Carpenter)


Fernando Pereira's Structured Learning blog


Language Log


John Langford's Machine Learning blog


Jamie
Pennebaker's

Wordwatcher's

blog


Video lectures


ACL Video Archive


Videos of Machine Learning lectures


Machine Learning and Cognitive Science 2007



includes talks by Chris Manning, Sharon Goldwater,
John Goldsmith, and others.


MIT workshop: Where Does Syntax Come From? Have We All Been Wrong?



speakers include Chris
Manning, Noam Chomsky,
Partha

Niyogi
, Howard
Lasnik

and Joshua
Tenenbaum
.


NIPS 2007 tutorials



including Geoffrey Hinton, Ben
Taskar
, and Robert
Shapire
.


Graduate Summer School: Probabilistic Models of Cognition: The Mathematics of Mind (July 9
-

26,
2007)



slides and webcast links of all the talks. A lot of good introductory stuffs on graphical models,
Bayesian learning, etc.


Microsoft Research



Videos on
Researchchannel
.


Google Roundtable

Conferences


General
(World Wide):
ACL

/
ANLP

/
COLING

/
LREC

/
HLT


General (USA):
NAACL

/
CICLING


General (Europe):
EACL

/
RANLP

/
AMLaP


General (Asia):
ijc
-
NLP
(formerly, NLPRS)

/
PACLIC

/
PACLING

/
JNLP

/
IALP


Formal Grammar:
FG

/
LFG

/
HPSG

/
TAG+


Machine Learning:
ICML

/
ECML

/
NIPS


Statistical NLP:
EMNLP

/
CoNLL

/
WVLC



Information Retrieval:
SIGIR

/
ECIR


Computational Semantics:
IWCS

/
ICoS


Others:
IWPT

/
WAS

/
MOL

/
SENSEVAL

/
FSMNLP

Journals


NLP/CL


Computational
Linguistics
link


Natural Language
Engineering
link


Journal on Research on Language and
Computation
link


Language Resources and
Evaluation
link

(Formerly
Computers and the Humanities
)


Research on Language and
Computation
link

(
More
)


Logic, Language and
Information
link


Computer Speech and
Language
link


Linguistic Issues in Language
Technology
link
(
LiLT
)


Journal of Interesting Negative Results in Natural Language Processing and Machine Learning

CfP: Interesting Negative Results in
Summarization
link


Terminology
link


Traitement
Automatique

des
Langues

link


CfP
: Special Issue on Scaling
NLP
link


Texto
!
link


Corpus Linguistics and Linguistic
Theory
link


ICAME
Journal
link

Journals


IR/IS


Information
Retrieval
link


D
-
Lib
Magazine
link


Information Processing &
Management
link


Journal of the American Society for Information Science and
Technology
link


Information
Science
link


Information
Development
link


Information Design Journal + Document
Design
link


Speech Processing


International Journal of Speech
Technology
link


Speech
Communication
link


Journal of the Acoustical
Society
of
America
link


IEEE Transactions on Signal
Processing
link


IEEE Transactions on Audio, Speech & Language
Processing
link


CfP: Special Issue on New Approaches to Statistical Speech and Text
Processing
link

Journals


Linguistics


Language@Internet

link


Lingua
link


Natural Language & Linguistic
Theory
link


Natural Language
Semantics
link


Cambridge
Occassional

Papers in
Linguistics
link


System
link



Speculative
Grammarian
link


Discourse/Pragmatics


Discourse
Processes
link


Text &
Talk
link


Multicultural
Discourses
link


Journal of
Pragmatics
link

Journals


Language
and Identity


Language in
Society
link



Journal of Language, Identity, and
Education
link


Language & Intercultural
Communication
link


BioInformatics


Bioinformatics
link


Biomedical
Informatics
link


Applied
Bioinformatics
link


Online Journal of
Bioinformatics
link


In
Silico

Biology
link


Artificial Intelligence in
Medicine
link

Supplementary Links


http://
lac.essex.ac.uk/vm


http://
comp.ling.utexas.edu/wiki/doku.php/nlp_links


http://
www
-
a2k.is.tokushima
-
u.ac.jp/member/kita/NLP/nlp.html


http://www.coli.uni
-
saarland.de/~
csporled/page.php?id=tools


http://
www.elsnet.org/toolslist.html


http://
zope.bioinfo.cnio.es/bionlp_tools/all_bionlp_tools


http://
en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits


Q
uestion?


In the
sy



Sjd



Sdj



Sdfh



Sdf



Sdf



Sdfkj



Sdjkf