Lecture 1 19/4/2011

zoomzurichΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 4 χρόνια και 25 μέρες)

75 εμφανίσεις


Lecture
1

19/4/2011


Syllabus


Why take this class


Introduction to NLP



http://www.staff.zu.edu.eg/hmabobakr/us
erdownloads/post/CSE_620_ST_NLP.html


Reference site:


www.u.arizona.edu
/~echan3/539.html




Instructor


Instructor:
Hitham M. Abo
Bkr


Office
hours:
Wednesday 11:30


12:00


E
-
mail:

hithamab@yahoo.com


I will create a doodle poll and e
-
mail the link
to you



I am teaching 3 classes so there may be
students from other classes in office hours



3 Units. This course will introduce students to the
computational methods used in modern Statistical Natural
Language Processing: corpora, principles of machine
learning, statistical models of linguistic structure, and
evaluation of system performance. Many applications will be
presented: parsing, language modeling, part of speech
tagging, sentiment analysis, machine translation, word
sense disambiguation, information extraction, and others
.


Corpus


Part of speech tag


N
-
gram


Regular expression


Word sense


Syntactic constituency


Phrase structure tree


Context
-
free grammar


Parsing


Elementary probability theory


Smoothing



These topics will be reviewed in this class as needed, but
you should also read about them in the
Jurafsky

& Martin
textbook


Assignments: there will be up to six
assignments, involving short answer questions
and programming questions.
Some assignments
will involve using existing software, and others
will require coding from scratch. Assignments
will be reduced in size for students enrolled in
439. Students without substantial programming
experience
may work
together in pairs, with the
consent of the instructor
.



There will not be any exams.



Each assignment will be given between 0 and 20
points. Late assignments will not be accepted.



The overall score for the course will be weighted
according to these criteria: assignments 70%,
attendance 30%. The course grade will be A, B,
C, D, or E. Incompletes will not be offered.
Grades for students enrolled in 439 and 539 will
be calculated separately.



Proficiency in programming is assumed for this
course. Some of the lectures and scripts for
assignments may use the Python language and
Numerical Python (contained within the
numpy

module). Numerical Python will be covered in
class. Lectures on the Python language are
available here:
http
://www.u.arizona.edu/~echan3/508.html




For assignments, students may use any
programming language. It is recommended that
a “mainstream” language be used.



Daniel
Jurafsky

and James H. Martin. 2008. An
Introduction to Natural Language Processing,
Computational Linguistics, and Speech
Recognition,
Second Edition
. Pearson/Prentice
-
Hall.



Stephen
Marsland
. 2009. Machine Learning: An
Algorithmic Perspective. CRC
Press.


http
://www
-
ist.massey.ac.nz/smarsland/MLbook.html



Christopher M. Bishop. 2006. Pattern Recognition and Machine
Learning. Springer.


Richard O.
Duda
, Peter E. Hart, and David G. Stork. 2001. Pattern
Classification, Second Edition. Wiley
-
Interscience
.


Trevor Hastie, Robert
Tibshirani
, and Jerome Friedman. 2001. The
Elements of Statistical Learning: Data Mining, Inference, and
Prediction. Springer.


Stuart Russell and Peter
Norvig
. 2003. Artificial Intelligence: a
Modern Approach, Second Edition. Pearson / Prentice
-
Hall.


Tom Mitchell. 1997. Machine Learning. WCB / McGraw
-
Hill.


Christopher Manning, Prabhakar Raghavan, Hinrich Schütze.
2008. Introduction to Information Retrieval, Cambridge University
Press.


Thomas M. Cover and Joy A. Thomas. 1991. Elements of
Information Theory. John Wiley & Sons.



Christopher D. Manning and Hinrich Schütze.
1999. Foundations of Statistical Natural
Language Processing. 6
th

printing with
corrections, 2003. The MIT Press.



Available for free through the UA library


(see link on course web page)



Many lectures will refer to conference and
journal papers, which will be linked to on the
course web page. Students may find these
papers useful as supplemental reading
material.



Syllabus


Why take this class


Introduction to NLP



More linguists are using computational or
data
-
based methods these days


Model the mind as a computer


But most linguists were not


educated as computer scientists



In this class you’ll learn the principles behind
NLP technologies, and how to use them
appropriately

http://www.angry
-
monk.com/transblog/wp
-
content/gallery/article
-
pics/mammalian
-
brain
-
computer
-
inside.jpg


Computational linguistics jobs


http
://linguistlist.org/jobs/browse
-
job
-
current
-
rs
-
1.html


http://languagelog.ldc.upenn.edu/nll/?p=1067



Software engineering jobs,


looking for “natural language processing”


http://jobsearch.monster.com/PowerSearch.aspx?
q=natural%20language%20processing&tm=60&s
ort=dt.rv&rad=20&rad_units=miles


Commercially important area of research:
sentiment analysis


http
://www.nytimes.com/2009/08/24/technology/i
nternet/24emotion.html?hpw



Expertise in the structure of other languages
is needed as search engines adapt to other
languages besides English


http://www.nytimes.com/2008/12/31/technology/i
nternet/31hindi.html?_r=1&ref=business


Automated trading programs that monitor the
news, blog posts, twitter, etc.


http://www.nytimes.com/2010/12/23/business/23tradi
ng.html?hpw



Many of the computational techniques in
statistical natural language processing can be
applied to other domains, such as biotechnology
and finance


http://www.nytimes.com/2009/08/06/technology/06st
ats.html





Computer Conquers 'Jeopardy!'



http://online.wsj.com/article/SB10001424052
748704307404576080333201294262.html



Syllabus


Why take this class


Introduction to NLP



Terms can be used interchangeably:


Computational linguistics


Natural Language Processing



Linguists say “computational linguistics”


Computer scientists and engineers like
“Natural Language Processing”




Machine translation


Understanding text


e
.g. persons, their relationships, and their actions


Search engines, answering users’ questions


Speech recognition


Modeling human language learning and
processing



Tokenizing a document


Sentence segmentation


Morphological analysis


Part of speech tagging


Syntactic parsing


Semantic analysis


Before 1990: create systems by hand, by writing
symbolic rules and grammars



Chomsky hierarchy


Regular:


Regular expressions, regular grammars, finite
-
state automata


Rewrite rules, Finite
-
state transducers


Context
-
free:


Context
-
free grammars, pushdown automata


Computational complexity of recognition


Hillary Clinton


Hillary Diane Clinton


Hillary D. Clinton


Secretary of State Clinton


Hillary


H D Clinton


Hillary Rodham Clinton


H D R C


Mrs. Clinton



A name consists of an optional title and one of the
following. One, a first name or initial, an optional
middle name (which is either a first name, last
name, or initial), and any number of last names.
Two, an optional first name or initial, an optional
middle name, and at least one last name.



Title = Secretary of State | Mrs.


First = Hillary | Diane


Last = Rodham | Clinton


Initial = D | H | R | C | D. | H. | R. | C.


Name =

(Title)


[(
First|Initial
) (
First|Last|Initial
)?(Last)*]

| [(
First|Initial
)?(
First|Last|Initial
)?(Last)+]



For further improvements:


Use lists of first names, last names, titles


Capitalization patterns: [A
-
Z][a
-
z]*

[A
-
Z]. [A
-
Z][a
-
z]*



Foreign names


Wen Jiabao


Mikhail
Khodorkovsky


Abdul Aziz bin Abdur Rahman Al Saud



Capitalization


hillary

clinton


HILLARY CLINTON



Ambiguities


Clinton, S.C.
= location


rich baker
= person or phrase?


john
= person or toilet?


Large number of rare names


Very difficult to provide coverage of all cases



Rules too general


Need information about how string appears in sentence


She became a rich baker by selling cupcakes.


Very difficult to specify exact combination of conditions
for precise recognition



Hard to maintain system


Rule
-
based systems can get very, very large


Availability of large annotated corpora


Corpus: electronic file of language usage


Annotations: linguistic structure is indicated in the
corpus


Note:
corpus

= singular,
corpora

= plural


Not “
corpuses




Benefits of corpora


Catalogs actual language usage


Use to train machine learning algorithms


Quantitative evaluation


Henry Kucera and W. Nelson Francis, 1967,
Computational Analysis of Present
-
Day American
English
.


500 texts, about 1.2 millon words


“Balanced” corpus: texts from 15 different genres


Newpapers, editorials, literature, science, government
documents, cookbooks, etc.



http://en.wikipedia.org/wiki/Brown_Corpus



Miami/
np
-
hl ,/,
-
hl Fla./
np
-
hl ,/,
-
hl
March/
np
-
hl 17/
cd
-
hl


--
/
--

The/at Orioles/
nps

tonight/nr
retained/
vbd

the/at distinction/
nn

of/in being/beg the/at only/
ap

winless/
jj

team/
nn

among/in the/at
eighteen/
cd

Major
-
League/
nn
-
tl

clubs/
nns

as/
cs

they/
ppss

dropped/
vbd

their/pp$ sixth/
od

straight/
jj

spring/
nn

exhibition/
nn

decision/
nn

,/, this/
dt

one/
cd

to/in the/at
Kansas/
np
-
tl

City/
nn
-
tl

Athletics/
nns
-
tl

by/in a/at score/
nn

of/in 5/
cd

to/in 3/
cd

./.


1.3 million words of Wall Street Journal
articles


Manually annotated for syntactic structure


http://www.cis.upenn.edu/~treebank/



Example sentence:


Pierre Vinken, 61 years old, will join the board as a
nonexecutive director Nov. 29.



( (S


(NP
-
SBJ


(NP (NNP Pierre) (NNP Vinken) )


(, ,)


(ADJP


(NP (CD 61) (NNS years) )


(JJ old) )


(, ,) )


(VP (MD will)


(VP (VB join)


(NP (DT the) (NN board) )


(PP
-
CLR (IN as)


(NP (DT a) (JJ nonexecutive) (NN director) ))


(NP
-
TMP (NNP Nov.) (CD 29) )))


(. .) ))








Annotated

corpus

NLP system

http://www.ibiblio.org/hhalpin/homepage/presentations/semsearch09/brainsky.jpg

Machine
learning
algorithm


Formulate problem in terms of the
output labels

to be predicted given
input data



Have an annotated corpus containing
labels

for
its data



Use the corpus to
train

a
classifier



Apply the classifier to
predict

labels for new
data





Many algorithms to predict the label of an item
based on its features:


Perceptron


Decision Trees


Naïve
Bayes


Neural Networks


Maximum Entropy


Support Vector Machines


Memory
-
Based Learning


Margin Infused Relaxed Algorithm


etc.



Most common machine learning algorithms can
be applied to any domain:


Handwriting recognition


Interpreting visual scene


Predicting stock prices


Identifying proteins in DNA sequences



Though some algorithms were designed
specifically for language


Utilize CFGs, finite automata, and other formalisms


Fast system development (assuming a
corpus)



Higher performance



Higher coverage of linguistic possibilities:


rare constructions are encountered in corpora



Can handle ambiguity in language






Me See A man The telescope The hill

“I was on the hill that has a telescope
when I saw a man.”


“I saw a man who was on the hill
that has a telescope on it.”


“I was on the hill when I used the
telescope to see a man.”


“I saw a man who was on a hill and
who had a telescope.”

“Using a telescope, I saw a man who
was on a hill.”



Real
-
life sentences can easily have millions of
parses



Not a solution to output all possible parses



Statistical NLP:


Count frequencies of different constructions in a
corpus


Assign a
probability

to a parse


Output
most likely

parse


KIDS MAKE NUTRITIOUS SNACKS


STOLEN PAINTING FOUND BY TREE


LUNG CANCER IN WOMEN MUSHROOMS


QUEEN MARY HAVING BOTTOM SCRAPED


DEALERS WILL HEAR CAR TALK AT NOON


MINERS REFUSE TO WORK AFTER DEATH


MILK DRINKERS ARE TURNING TO POWDER


DRUNK GETS NINE MONTHS IN VIOLIN CASE


JUVENILE COURT TO TRY SHOOTING DEFENDANT


COMPLAINTS ABOUT NBA REFEREES GROWING UGLY



Expense of annotating corpora


Penn Treebank: 1.2 million words, $1 per word in 1990



Languages of lesser commercial importance


Difficult to obtain data


Hard to get funding to pay for annotation



Domain specificity: a classifier trained on a
particular corpus will not necessarily work well
on another



We’ll also look at
unsupervised
and
semi
-
supervised
learning algorithms


Utilize corpora without annotations,


or with only a small amount of annotation


Linguistic structure is built into the learner instead
of explicitly indicated in the data