Applied Natural Language Processing

scarfpocketAI and Robotics

Oct 24, 2013 (3 years and 8 months ago)

79 views

I256

Applied Natural Language
Processing


Fall 2009


Lecture 1



Introduction

Barbara Rosario

Introductions


Barbara Rosario


iSchool alumni (class 2005)


Intel Labs


Gopal Vaswani


iSchool master student (class 2010)


You?


Today


Introductions


Administrivia


What is NLP


NLP Applications


Why is NLP difficult


Corpus
-
based statistical approaches


Course goals


What we’ll do in this course

Administrivia


http://courses.ischool.berkeley.edu/i256/f09/index.html


Books:


Foundations of Statistical NLP
, Manning and Schuetze, MIT press


Natural Language Processing with Python
, Bird, Klein & Loper,
O'Reilly.


(also on line)


See Web site for additional resources


Work:


Individual coding assignments (Python & NLTK
-
Natural Language
Toolkit) (4 or 5)


Final group project


Participation


Office hours:


Barbara: Thursday 2:00
-
3:00 in Room 6


Gopal: Tuesday 2:00
-
3:00 in Room 6 (to be confirmed)


Administrivia


Communication:


My email:
barbara.rosario@intel.com


Gopal :
gopal.vaswani@gmail.com


Mailing list:
i256@ischool.berkeley.edu


Send an email to
majordomo@ischool.berkeley.edu

with
subscribe
i256

in the body


Through intranet


Announcements:
webpage

and/or mailing list and/or Bspace
(TBA)


Public discussion: Bspace(?)


Related course: Statistical Natural Language
Processing, Spring 2009, CS 288


http://www.cs.berkeley.edu/~klein/cs288/sp09/


Instructor: Dan Klein


Much more emphasis on
statistical algorithms


Questions?

Natural Language Processing


Fundamental goal:
deep

understand of
broad

language


Not just string processing or keyword
matching!


End systems that we want to build:


Ambitious: speech recognition, machine
translation, question answering…


Modest: spelling correction, text
categorization…

Slide taken from Klein’s course: UCB CS 288 spring 09


Example: Machine Translation

NLP applications


Text Categorization


Classify documents by topics, language, author, spam filtering,
information retrieval (relevant, not relevant),
sentiment
classification (positive, negative)


Spelling & Grammar Corrections


Information Extraction


Speech Recognition


Information Retrieval


Synonym Generation


Summarization


Machine Translation


Question Answering


Dialog Systems


Language generation

Why NLP is difficult


A NLP system needs to answer the question
“who did what to whom”


Language is ambiguous


At all levels: lexical, phrase, semantic


Iraqi Head Seeks Arms



Word sense

is ambiguous (head, arms)


Stolen Painting Found by Tree


Thematic role

is ambiguous: tree is agent or location?


Ban on Nude Dancing on Governor’s Desk


Syntactic structure (attachment
) is ambiguous: is the ban or
the dancing on the desk?


Hospitals Are Sued by 7 Foot Doctors


Semantics

is ambiguous : what is 7 foot?

Why NLP is difficult


Language is flexible


New words, new meanings


Different meanings in different contexts


Language is subtle


He arrived at the lecture


He chuckled at the lecture


He chuckled his way through the lecture


**He arrived his way through the lecture


Language is complex!

Why NLP is difficult


MANY hidden variables


Knowledge about the world


Knowledge about the context


Knowledge about human communication techniques


Can you tell me the time?


Problem of scale


Many (infinite?) possible words, meanings, context


Problem of sparsity


Very difficult to do statistical analysis, most things
(words, concepts) are never seen before


Long range correlations

Why NLP is difficult


Key problems:


Representation of
meaning


Language presupposes knowledge about the
world


Language only reflects the surface of
meaning


Language presupposes communication
between people

Meaning


What is meaning?


Physical referent in the real world


Semantic concepts, characterized also by relations.


How do we represent and use meaning


I am
Italian


From lexical database

(WordNet)


Italian =a native or inhabitant of Italy


Italy =
republic in southern
Europe [..]


I

am Italian


Who is “I”?


I know

she is Italian/
I think

she is Italian


How do we represent “I know” and “I think”


Does this mean that I is Italian? What does it say about the “I” and
about the person speaking?


I thought

she was Italian


How do we represent tenses?

Today


Introductions


Administrivia


What is NLP


NLP Applications


Why is NLP difficult


Corpus
-
based statistical approaches


Course goals


What we’ll do in this course

Corpus
-
based statistical
approaches to tackle NLP problem


How can a can a machine understand these
differences?


Decorate the cake with the frosting


Decorate the cake with the kids


Rules based approaches, i.e. hand coded syntactic
constraints and preference rules:


The verb
decorate

require an animate being as agent


The object
cake

is formed by any of the following, inanimate
entities (cream, dough, frosting…..)


Such approaches have been showed to be time
consuming to build, do not scale up well and are very
brittle to new, unusual, metaphorical use of language


To swallow

requires an animate being as agent/subject and a
physical object as object


I swallowed his story


The supernova swallowed the planet


Corpus
-
based statistical
approaches to tackle NLP problem


A Statistical NLP approach seeks to solve these
problems by automatically learning lexical and
structural preferences from text collections
(corpora)


Statistical models are robust, generalize well
and behave gracefully in the presence of errors
and new data.


So:


Get large text collections


Compute statistics over those collections


(The bigger the collections, the better the statistics)


Corpus
-
based statistical
approaches to tackle NLP problem


Decorate the cake with the frosting


Decorate the cake with the kids



From (labeled) corpora we can learn that:

#(kids are subject/agent of decorate) > #(frosting is subject/agent of
decorate)



From (UN
-
labeled) corpora we can learn that:

#(“
the

kids decorate the cake
”) >> #(“
the

frosting decorates the cake
”)

#(“
cake with frosting
”) >> #(“
cake with kids
”)


etc..


Given these “facts” we then need a statistical model
for the attachment decision

Corpus
-
based statistical approaches
to tackle NLP problem


Topic categorization: classify the document
into semantics topics




Document 1


The U.S. swept into the Davis Cup final
on Saturday when twins Bob and Mike
Bryan defeated Belarus's Max Mirnyi
and Vladimir Voltchkov to give the
Americans an unsurmountable 3
-
0 lead
in the best
-
of
-
five semi
-
final tie.


Topic = sport

Document 2


One of the strangest, most relentless
hurricane seasons on record reached
new bizarre heights yesterday as the
plodding approach of Hurricane
Jeanne prompted evacuation orders
for hundreds of thousands of
Floridians and high wind warnings
that stretched 350 miles from the
swamp towns south of Miami to the
historic city of St. Augustine.


Topic = disaster

Corpus
-
based statistical approaches
to tackle NLP problem


Topic categorization: classify the document
into semantics topics




Document 1 (
sport
)


The U.S. swept into the Davis
Cup final on Saturday when twins
Bob and Mike Bryan …

Document 2 (
disasters
)


One of the strangest, most
relentless hurricane seasons on
record reached new bizarre heights
yesterday as….


From (labeled) corpora we can learn that:

#(
sport

documents containing word
Cup
) > #(
disaster

documents
containing word
Cup
)
--

feature


We then need a statistical model for the topic
assignment

Corpus
-
based statistical
approaches to tackle NLP problem



Feature extractions (usually linguistics
motivated)


Statistical models



Data (corpora, labels, linguistic
resources)


Goals of this Course


Learn about the problems and possibilities of natural
language analysis:


What are the major issues?


What are the major solutions?


At the end you should:


Agree that language is difficult, interesting and important


Be able to assess language problems


Know which solutions to apply when, and how


Feel some ownership over the algorithms


Be able to use software to tackle some NLP language tasks


Know language resources


Be able to read papers in the field

What We’ll Do in this Course


Linguistic Issues


What are the range of language phenomena?


What are the knowledge sources that let us

disambiguate?


What representations are appropriate?


Applications


Software (Python and NLTK)


Statistical Modeling Methods

What We’ll Do in this Course


Read books, research papers and tutorials


Final project


Your own ideas or chose from some suggestions I will
provide


We’ll talk later during the couse about ideas/methods
etc. but come talk to me if you have already some
ideas


Learn Python


Learn/use NLTK (Natural Language ToolKit) to
try out various algorithms

Python
-

Simple yet powerful


The zen of python

: http://www.python.org/dev/peps/pep
-
0020/



Very clear, readable syntax


Strong introspection capabilities


http://www.ibm.com/developerworks/linux/library/l
-
pyint.html

(recommended)



Intuitive object orientation


Natural expression of procedural code


Full modularity, supporting hierarchical packages


Exception
-
based error handling


Very high level dynamic data types


Extensive standard libraries and third party modules for virtually every task


Excellent functionality for processing linguistic data.


NLTK

is one such extensive third party module.




Source : python.org

Python

Source : nltk.org

Language processing task

NLTK modules

Functionality

Accessing corpora

nltk.corpus

standardized interfaces to corpora and lexicons

String processing

nltk.tokenize, nltk.stem

tokenizers
, sentence
tokenizers
, stemmers

Collocation discovery

nltk.collocations

t
-
test, chi
-
squared, point
-
wise mutual information

Part
-
of
-
speech tagging

nltk.tag

n
-
gram,
backoff
, Brill, HMM,
TnT

Classification

nltk.classify, nltk.cluster

decision tree, maximum entropy, naive
Bayes
, EM, k
-
means

Chunking

nltk.chunk

regular expression, n
-
gram, named
-
entity

Parsing

nltk.parse

chart, feature
-
based, unification, probabilistic, dependency

This is not the complete list

NLTK


NLTK defines an infrastructure that can be used to build NLP programs in Python.


It provides basic classes for representing data relevant to natural language
processing.


Standard interfaces for performing tasks such as part
-
of
-
speech tagging, syntactic
parsing, and text classification.


Standard implementations for each task which can be combined to solve complex
problems.









Resources:


Download at
http://www.nltk.org/download


Getting started with NLTK Chapter 1


NLP and NLTK talk at
google

http://www.youtube.com/watch?v=keXW_5
-
llD0


Topics


Text corpora & other resources


Words (Morphology
,
tokenization, stemming, part
-
of
-
speech, WSD, collocations, lexical acquisition, language
models)


Syntax: chunking, PCFG & parsing


Statistical models (esp. for classification)


Applications


Text classification


Information extraction


Machine translation


Semantic Interpretation


Sentiment Analysis


QA / Summarization


Information retrieval

Next Assignment




Due before next class Tue Sep 1


No turn
-
in


Download and install Python and NLTK


Download the NLTK Book Collection, as
described at the beginning of
chapter 1

of the
book
Natural Language Processing with Python


Readings:


Chapter 1

of the book
Natural Language
Processing with Python



Chapter 3 of Foundations of Statistical NLP


Next class:


Linguistic Essentials


Python Introduction