Python for NLP and the Natrual Language Toolkit

foremanyellowΛογισμικό & κατασκευή λογ/κού

7 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

71 εμφανίσεις

Python for NLP and the Natural
Language Toolkit


CS1573: AI Application Development, Spring 2003

(modified from Edward Loper’s notes)


Outline


Review: Introduction to NLP (knowledge of
language, ambiguity, representations and
algorithms, applications)


HW 2 discussion


Tutorials: Basics, Probability

Python and Natural Language Processing



Python is a great language for NLP:


Simple


Easy to debug:


Exceptions


Interpreted language


Easy to structure


Modules


Object oriented programming


Powerful string manipulation

Modules and Packages


Python
modules

“package program code and
data for reuse.” (Lutz)


Similar to
library

in C,
package
in Java.


Python
packages

are hierarchical modules
(i.e., modules that contain other modules).


Three commands for accessing modules:

1.
import

2.
from…import

3.
reload

Modules and Packages:
import


The
import

command loads a module:

#
Load the regular expression module

>>>
import re



To access the contents of a module, use
dotted
names
:

#

Use the search method from the re module

>>> re.search(‘
\
w+’, str)


To list the contents of a module, use
dir:

>>> dir(re)

[‘DOTALL’, ‘I’, ‘IGNORECASE’,…]

Modules and Packages

from…import


The

from…import
command loads individual
functions and objects from a module:

# Load the search function from the re module

>>> from re import search


Once an individual function or object is loaded
with
from…import,

it can be used directly:

# Use the search method from the re module

>>> search (‘
\
w+’, str)

Import

vs.
from…import

Import


Keeps module
functions separate
from user functions.


Requires the use of
dotted names.


Works with
reload.

from…import


Puts module functions
and user functions
together.


More convenient
names.


Does not work with
reload.

Modules and Packages:
reload


If you edit a module, you must use the
reload
command before the changes become visible in
Python:



>>>
import mymodule



...



>>> reload (mymodule)


The reload command only affects modules that
have been loaded with
import
; it does not
update individual functions and objects loaded
with
from...import.

Introduction to NLTK


The Natural Language Toolkit (NLTK) provides:


Basic classes for representing data relevant to natural
language processing.


Standard interfaces for performing tasks, such as
tokenization, tagging, and parsing.


Standard implementations of each task, which can be
combined to solve complex problems.

NLTK: Example Modules



nltk.token
: processing individual elements of text,
such as words or sentences.


nltk.probability
:
modeling frequency distributions
and probabilistic systems.


nltk.tagger
: tagging tokens with supplemental
information, such as parts of speech or wordnet sense tags.


nltk.parser
:
high
-
level interface for parsing texts.


nltk.chartparser
: a chart
-
based implementation of
the parser interface.


nltk.chunkparser
: a regular
-
expression based
surface parser.

NLTK: Top
-
Level Organization


NLTK is organized as a flat hierarchy of packages
and modules.


Each module provides the tools necessary to
address a specific task


Modules contain two types of classes:


Data
-
oriented classes are used to represent information
relevant to natural language processing.


Task
-
oriented classes encapsulate the resources and
methods needed to perform a specific task.

To the First Tutorials


Tokens and Tokenization


Frequency Distributions

The Token Module


It is often useful to think of a text in terms of
smaller elements, such as words or sentences.


The
nltk.token

module defines classes for
representing and processing these smaller
elements.


What might be other useful smaller elements?

Tokens and Types



The term
word

can be used in two different ways:

1.
To refer to an individual occurrence of a word

2.
To refer to an abstract vocabulary item


For example, the sentence “
my dog likes his dog”
contains five occurrences of words, but four vocabulary
items.


To avoid confusion use more precise terminology:

1.
Word token:

an occurrence of a word

2.
Word Type:
a vocabulary item

Tokens and Types (continued)



In NLTK, tokens are constructed from their
types using the Token constructor:
>>>

from nltk.token import *
>>>

my_word_type = 'dog'
'dog'
>>>

my_word_token
=Token(my_word_type)
‘dog'@[?]


Token member functions include
type

and
loc

Text Locations



A
text location
@
[s:e]

specifies a region of a text:


s
is the
start index


e
is the end index


The text location @
[s:e]
specifies the text beginning
at
s
, and including everything up to (but not including)
the text at
e.


This definition is consistent with Python
slice.


Think of indices as appearing
between

elements:

I


saw


a


man
0


1


2


3


4


Shorthand notation when location width = 1.


Text Locations

(continued)


Indices can be based on different
units
:


character


word


sentence


Locations can be tagged with
sources
(files, other
text locations


e.g., the first word of the first
sentence in the file)


Location member functions:


start


end


unit


source


Tokenization


The simplest way to represent a
text

is with a
single string.


Difficult to process text in this format.


Often, it is more convenient to work with a list of
tokens.


The task of converting a text from a single string
to a list of tokens is known as
tokenization
.

Tokenization (continued)


Tokenization is harder that it seems

I’ll see you in New York.

The aluminum
-
export ban.


The simplest approach is to use “graphic words”
(i.e., separate words using whitespace)


Another approach is to use regular expressions to
specify which substrings are valid words.


NLTK provides a generic tokenization interface:
TokenizerI

TokenizerI


Defines a single method, tokenize, which
takes a string and returns a list of tokens


Tokenize is independent of the level of
tokenization and the implementation
algorithm

Example


from nltk.token import WSTokenizer
from nltk.draw.plot import Plot
#
Extract a list of words from the corpus

corpus = open('corpus.txt').read()
tokens = WSTokenizer().tokenize(corpus)
# Count up how many times each word length
occurs

wordlen_count_list = []
for token in tokens:

wordlen = len(token.type())



# Add zeros until wordlen_count_list is long
enough


while wordlen >= len(wordlen_count_list):




wordlen_count_list.append(0)

# Increment the count for this word length


wordlen_count_list[wordlen] += 1
Plot(wordlen_count_list)

Next Tutorial: Probability


An
experiment

is any process which leads to
a well
-
defined outcome


A
sample

is any possible outcome of a
given experiment


Rolling a die?

Outline

Review Basics

Probability



Experiments and Samples


Frequency Distributions


Conditional Frequency Distributions

Review: NLTK Goals


Classes for NLP data


Interfaces for NLP tasks


Implementations, easily combined (what is an
example?)

Accessing NLTK


What is the relation to Python?

Words


Types and Tokens


Text Locations


Member Functions

Tokenization


TokenizerI


Implementations


>>>

tokenizer = WSTokenizer()


>>>

tokenizer.tokenize(text_str)
['Hello'@[0w], 'world.'@[1w], 'This'@[2w],
'is'@[3w], 'a'@[4w], 'test'@[5w],
'file.'@[6w]]

Word Length Freq. Distribution Example


from nltk.token import WSTokenizer
from nltk.probability import SimpleFreqDist
# Extract a list of words from the corpus

corpus = open('corpus.txt').read()
tokens = WSTokenizer().tokenize(corpus)
# Construct a frequency distribution of word lengths

wordlen_freqs = SimpleFreqDist()
for token in tokens:

wordlen_freqs.inc(len(token.type()))
# Extract the set of word lengths found in the
corpus

wordlens = wordlen_freqs.samples()

Frequency Distributions


A
frequency distribution

records the
number of times each
outcome

of an
experiment

has occurred


>>>

freq_dist = FreqDist()

>>>

for token in document:

...



freq_dist.inc(token.type())



Constructor, then initialization by storing
experimental outcomes

Methods


The
freq

method returns the frequencey of a given
sample.


We can find the number of times a given sample
occured with the
count

method


We can find the total number of sample outcomes
recorded by a frequency distribution with the
N

method


The
samples

method returns a list of all samples
that have been recorded as outcomes by a
frequency distribution


We can find the sample with the greatest number
of outcomes with the
max

method

Examples of Methods


>>>

freq_dist.count('the')

6


>>>

freq_dist.freq('the')

0.012


>>>

freq_dist.N()







500


>>>

freq_dist.max()
‘the’

Simple Word Length Example


>>>

from nltk.token import WSTokenizer
>>>

from nltk.probability import FreqDist
>>>

corpus = open('corpus.txt').read()
>>>

tokens =
WSTokenizer().tokenize(corpus)
#
What is the distribution of word lengths in a
corpus? >>>

freq_dist = FreqDist()

>>>

for token in tokens:

...



freq_dist.inc(len(token.type()))


What is the "outcome" for our experiment?


Simple Word Length Example


>>>

from nltk.token import WSTokenizer
>>>

from nltk.probability import FreqDist
>>>

corpus = open('corpus.txt').read()
>>>

tokens = WSTokenizer().tokenize(corpus)
# What is the distribution of word lengths in a
corpus? >>>

freq_dist = FreqDist()

>>>

for token in tokens:

...



freq_dist.inc(len(token.type()))


This length is the "outcome" for our experiment, so we use
inc() to increment its count in a frequency distribution.


Complex Word Length Example


# define vowels as "a", "e", "i", "o", and "u"
>>>

VOWELS = ('a', 'e', 'i', 'o', 'u')

# distribution for words ending in vowels?
>>>

freq_dist = FreqDist()

>>>

for token in tokens:

...



if token.type()[
-
1].lower() in VOWELS:

...




freq_dist.inc(len(token.type()))



What is the condition?

More Complex Example


# What is the distribution of word lengths for
# words following words that end in vowels?
>>>

ended_in_vowel = 0

#Did last word end in
vowel?

>>>

freq_dist = FreqDist()

>>>

for token in tokens:



...



if ended_in_vowel:




...




Freq_dist.inc(len(token.type()))


...



ended_in_vowel=token.type()[
-
1].lower() in
VOWELS

Conditional Frequency
Distributions


A
condition

specifies the context in which an
experiment is performed


A
conditional frequency distribution

is a
collection of frequency distribtuions for the same
experiment, run under different conditions


The individual frequency distributions are indexed
by the condition.


NLTK
ConditionalFreqDist

class


>>>

cfdist = ConditionalFreqDist()
<ConditionalFreqDist with 0 conditions>

Conditional Frequency
Distributions (continued)


To access the frequency distribution for a condition, use
the
indexing operator

:
>>>

cfdist['a']

<FreqDist with 0 outcomes>


# Record lengths of some words starting with 'a'
>>>

for word in 'apple and arm'.split():

...



cfdist['a'].inc(len(word))



# How many are 3 characters long?

>>>

cfdist['a'].freq(3)

0.66667



To list accessed conditions, use the
conditions method
:


>>>

cfdist.conditions()

['a']

Example: Conditioning on a
Word’s Initial Letter


>>>

from nltk.token import WSTokenizer
>>>

from nltk.probability import
ConditionalFreqDist
>>>

from nltk.draw.plot import Plot
#
>>>

corpus = open('corpus.txt').read()
>>>

tokens =
WSTokenizer().tokenize(corpus)
>>>

cfdist = ConditionalFreqDist()



Example (continued)


# How does initial letter affect word length?

>>>

for token in tokens:

...



outcome = len(token.type())



...

condition = token.type()[0].lower()

...

cfdist[condition].inc(outcome)


What are the condition and the outcome
?

Example (continued)


# How does initial letter affect word length?

>>>

for token in tokens:

...



outcome = len(token.type())



...

condition = token.type()[0].lower()

...

cfdist[condition].inc(outcome)


What are the condition and the outcome
?


Condition = the initial letter of the token


Outcome = its word length

Prediction


Prediction
is the problem of deciding a likely
outcome for a given run of an experiment.


To predict the outcome, we first examine a
training corpus
.


Training corpus


The context and outcome for each run are known


Given a new run, we choose the outcome that occurred
most frequently for the context


Conditional frequency distribution finds the most
frequent occurrrence

Prediction Example: Outline


Record each outcome in the training corpus,
using the context that the experiment was
under as the condition


Access the frequency distribution for a given
context with the indexing operator


Use the max() method to find the most likely
outcome

Example: Predicting Words


Predict word's type, based on preceding word
type


>>>

from nltk.token import WSTokenizer


>>>

from nltk.probability import
ConditionalFreqDist
>>>

corpus = open('corpus.txt').read()
>>>

tokens =
WSTokenizer().tokenize(corpus)
>>>

cfdist =
ConditionalFreqDist() #empty

Example (continued)


>>>

context = None
# The type of the preceding word
>>>

for token in tokens:






...



outcome = token.type()

...



cfdist[context].inc(outcome)




...


context = token.type()

Example (continued)


>>>

cfdist['prediction'].max()
'problems'
>>>

cfdist['problems'].max()
'in'
>>>

cfdist['in'].max()
'the‘


What are we predicting here?

Example (continued)

We predict the most likely word for any context

Generation

application:


>>> word = 'prediction'
>>> for i in range(15):
...


print word,
...


word = cfdist[word].max()


prediction problems in the frequency distribution of
the frequency distribution of the frequency
distribution of

For Next Time


HW3


To run NLTK from unixs.cis.pitt.edu, you
should add /afs/cs.pitt.edu/projects/nltk/bin to
your search path


Regular Expressions (J&M handout, NLTK
tutorial)