# Corpora and Statistical Methods

AI and Robotics

Oct 24, 2013 (4 years and 6 months ago)

97 views

Lecture 1, Part I

Albert Gatt

Corpora and Statistical Methods

Tutorial

CSA5011
--

Corpora and Statistical Methods

Next Monday at
11:00.

This will take the form of a discussion of the following paper:

Jurafsky
, D. (2003). Probabilistic knowledge in
psycholinguistics.

(Available from course web page)

Course goals

CSA5011
--

Corpora and Statistical Methods

Introduce the field of

statistical
natural language processing

(statistical NLP).

Describe
the main directions, problems, and algorithms in
the field.

Discuss
the theoretical foundations.

Involve
students in hands
-
on experiments with real
problems.

A general introduction

CSA5011
--

Corpora and Statistical
Methods

Language

CSA5011
--

Corpora and Statistical Methods

We
can define a language formally as:

a set of symbols (“alphabet”)

a set of rules to combine those symbols

This mathematical definition covers many classes of
languages, not just human language.

Java: An artificial (formal) language

CSA5011
--

Corpora and Statistical Methods

fixed
set of basic symbols:

public, static, for, while, {, }…

fixed syntax for symbol combination

public static void main (String[]
args
) {

for(
int

i

= 0;
i

<
args.length
;
i
++) {

}

}

Natural language

CSA5011
--

Corpora and Statistical Methods

Often much more complicated than an artificial language.

NB: Some theorists view NL as a special kind of formal language as
well (Montague…).

It
does conform to the formal definition:

there are symbols

there are modes of combination

However
, there are many levels at which these symbols and
rules are defined.

Levels of analysis in Natural language (I)

CSA5011
--

Corpora and Statistical Methods

Acoustic
properties (
phonetics
)

defines a basic set of sounds in terms of their features

studies the combination of these phonemes

Higher
-
order acoustic features (
phonology
)

how combinations of phonemes combine into larger units, with
suprasegmental

features such as intonation
.

Levels of analysis in Natural language (II)

CSA5011
--

Corpora and Statistical Methods

Word
formation (
morphology
)

combines morphemes into words

Combination into longer units in a structure
-
dependent way
(
syntax
)

“legal” word combinations in a language

recursive phrasal combination

Interpretation (
semantics
):

of words (
lexical semantics
)

of longer units (
sentential/propositional

semantics
)

Interpretation in context (
pragmatics
)

Natural Language Processing

CSA5011
--

Corpora and Statistical Methods

Studies
language at all its levels.

phonology, morphology, syntax, semantics…

focusses

on
process

(
Sparck
-
Jones `07)

computational methods to understand and generate human
language

Often
, the distinction between
NLP

and
computational
linguistics
is fuzzy

Kindred disciplines: Linguistics

CSA5011
--

Corpora and Statistical Methods

Theoretical
linguistics tends to be less process
-
oriented than
NLP

Q: how can we characterise knowledge that native speakers have of
their language?

declarative models

of speaker’s knowledge of language

how

speakers process language in real time

NB: This depends on the theoretical orientation!

NLP
has strong ties to theoretical linguistics

it has also been an important contributor: process models can serve as
tests for declarative models

Kindred disciplines: Psycholinguistics

CSA5011
--

Corpora and Statistical Methods

Like
NLP, psycholinguistics tends to be
strongly process
-
oriented

studies the online processes of language understanding and
language production

NLP
has benefited from such models.

NLP
has also been a contributor:

it is increasingly common to test psycholinguistic theories by
building computational models.

CSA5011
--

Corpora and Statistical Methods

Knowledge
-
based
:

system is based on a priori rules and constraints

e.g. a syntactic parser might have hand
-
crafted rules such as:

NP

Det

N

A
+

Problem: it is
extremely difficult to hand
-
code all the relevant
knowledge
.

CSA5011
--

Corpora and Statistical Methods

Statistical
:

starting point is a large repository of text or speech (a
corpus
)

corpus is often
annotated

with relevant information, e.g.:

parsed corpora (syntax)

tagged corpora (part
-
of
-
speech)

word
-
sense annotated corpora (semantics)

tries to
learn a model

from the data

tries to
generalise this model

to new data

-
eye view

CSA5011
--

Corpora and Statistical Methods

We
find similar “divisions” within mainstream linguistics:

generative linguistics tends to formulate generalisations about “internalised
speaker knowledge of language” (competence, I
-
Language…)

corpus linguistics tends to formulate generalisations based on patterns
observed in corpora

The idea of “linguistic knowledge”

CSA5011
--

Corpora and Statistical Methods

Traditional linguistic theory (since the 1950s) introduced a
dichotomy:

competence:

a person’s knowledge of language, formalised as a
set of rules

performance:

actual production and perception of language in
concrete situations

Much
of linguistic theory has focused on characterising
competence.

The idea of “linguistic knowledge”

CSA5011
--

Corpora and Statistical Methods

The use of data (corpora) involves an increased focus on
“performance”.

The
idea is that exposure to such regularities is a crucial part
of human
language learning.

(Evidence for this is our topic for
Monday’s
tutorial!)

An initial example

CSA5011
--

Corpora and Statistical Methods

Suppose you’re a linguist interested in the syntax of verb
phrases.

Some verbs are transitive, some intransitive

I
ate the meat pie

(transitive)

I
swam

(intransitive)

quiver

quake

Corpus data suggests they have transitive uses:

the insect
quivered its wings

it
quaked his bowels

(with fear)

these as intransitive

Example II: lexical semantics

CSA5011
--

Corpora and Statistical Methods

Quasi
-
synonymous lexical items exhibit subtle differences in
context.

strong

powerful

A
fine
-
grained theory of lexical semantics would benefit
from data about these contextual cues to meaning.

Example II continued

CSA5011
--

Corpora and Statistical Methods

Some differences between
strong

and
powerful
(source: British
National Corpus):

strong

powerful

The differences are subtle, but examining their
collocates

helps.

wind, feeling, accent, flavour

tool, weapon, punch, engine

Statistical approaches to language

CSA5011
--

Corpora and Statistical Methods

Do
not rely on categorical judgements of grammaticality
etc. Examples:

1.
Degrees
of grammaticality: people often do not have
categorical judgements of acceptability.

2.
Category
blending:
We live nearer town than you thought.

Is
near

3.
Syntactic
ambiguity:
She killed the man with the gun
.

What is the most likely parse?

Statistical NLP vs. Corpus Linguistics (I)

CSA5011
--

Corpora and Statistical Methods

Corpus linguistics became popular with the arrival of large, machine
-

generally viewed as a
methodology

tests hypotheses empirically on data

aim is
to refine a theory of language
, or discover novel generalisations

Statistical NLP shares these aims; however:

it is often
corpus
-
driven

rather than
corpus
-
based

the “theory” or “model” learned is often not a priori given

Statistical NLP vs. Corpus Linguistics (II)

CSA5011
--

Corpora and Statistical Methods

The
term “corpus” may mean different things to different people:

To a corpus linguist, a corpus is a balanced, representative sample of a
particular language variety (e.g. The British National Corpus)

Representativeness allows generalisations to be made more rigorously.

In
statistical NLP, there has traditionally been less emphasis on these
properties.

emphasis on algorithms for learning language models

we frequently find the tacit assumption that the algorithm can be applied to
any set of data, given the right annotations

Some applications of Statistical NLP

CSA5011
--

Corpora and Statistical
Methods

25

Text

Language Technology

Natural
Language
Understanding

Natural Language
Generation

Speech
Recognition

Speech
Synthesis

Text

Meaning

Speech

Speech

Machine
translation

A (very) rough division of NLP tasks

CSA5011
--

Corpora and Statistical Methods

understanding
: typically take as input free text or speech, and
conduct some structural or semantic analysis

POS Tagging, parsing, semantic role labelling, sentiment/opinion mining,
named entity recognition…

generation
: typically take textual or non
-
linguistic input,
outputting some text/speech

automatic weather reporting, summarisation, machine translation

How
effective are statistical NLP tools to carry out these and other tasks?

Are statistical techniques actually useful to learn things about language?

Example 1: Semantics

sheep

0.359

cow

0.345

pig

0.331

rabbit

0.305

cattle

0.304

deer

0.289

lamb

0.286

donkey

0.276

poultry

0.262

boar

0.261

camel

0.259

elephant

0.258

calf

0.258

pony

0.255

Example of an automatically
acquired thesaurus of similar
words.

Data: 1.5 bn words obtained
from the web.

(www.sketchengine.co.uk)

How does this work?

CSA5011
--

Corpora and
Statistical Methods

“goat”

Example 1: Semantics (cont/d)

CSA5011
--

Corpora and Statistical Methods

Corpus
-
based
lexical semantic acquisition typically uses
vector
-
space models
.

represent a word as a vectors containing information about the
context in which it is likely to occur

some models also include grammatical relations (subject
-
of,
object
-
of etc)

Example 2: POS Tagging

CSA5011
--

Corpora and Statistical
Methods

<
tok

pos="at"
>The</
tok
>

<
tok

pos="
jj
"
>tall</
tok
>

<
tok

pos="
nn
"
>woman</
tok
>

<
tok

pos="cc"
>and</
tok
>

<
tok

pos="at"
>the</
tok
>

<
tok

pos="
jj
"
>strange</
tok
>

<
tok

pos="
nn
"
>boy</
tok
>

<
tok

pos="
vbd
"
>thought</
tok
>

<
tok

pos="
jj
"
>statistical</
tok
>

<
tok

pos="
nn
"
>NLP</
tok
>

<
tok

pos="
bedz
"
>was</
tok
>

<
tok

pos="
jj
"
>pointless</
tok
>

<
tok

pos="."
>.</
tok
>

“The tall woman and the strange boy thought
statistical NLP was pointless.”

Output from a statistical POS
Tagger, trained on the Brown
Corpus

(LingPipe demo library)

Uses of POS Tagging:

pre
-
parsing

corpus analysis for linguistics

Example 3: parsing

Parsed using the Stanford Parser.

Based on probabilistic context
-
free grammar of English

trained on a treebank

CFG rules with probabilities

CSA5011
--

Corpora and
Statistical Methods

Example 4: Machine translation

CSA5011
--

Corpora and Statistical
Methods

Input:

(Maltese translation of example
sentence)

Output:

The wife and son long strange
nonetheless feels that the
statistical NLP is without
purpose.

Translated using Maltese
-
English

Obvious shortcomings, but
robust, i.e. some output
returned, even if garbled.

Based on automatic alignment
between parallel text corpora.

Example 5: Generation/Summarisation

CSA5011
--

Corpora and Statistical
Methods

[…] No laboratories offering
molecular genetic testing for
prenatal diagnosis of 3
-
M
syndrome are listed in the
GeneTests Laboratory
Directory. However, prenatal
testing may be available for
families in which the disease
-
causing mutations have been
identified […]

Automatically generated article
-
M syndrome (Sauper and
Barzilay 2009)

Now on Wikipedia!!!

(
http://en.wikipedia.org/wiki/3
-
M_syndrome
)

Summarised from multiple
documents drawn from the web.

Uses automatically acquired
templates from human
-
authored
texts to ensure coherence.

Features of Statistical NLP systems

CSA5011
--

Corpora and Statistical Methods

Robustness
: typically, don’t break down with new or
unknown input

Portability
: statistical learning algorithms can in principle be
ported to new domains (given data)

Sensitivity
to training data
: if (say) a POS tagger is trained on
medical text, its performance will decline on a new genre
(e.g. news).

Some important concepts

CSA5011
--

Corpora and Statistical Methods

All the systems surveyed rely on regularities in large
repositories of training data, expressed as probabilities.

In
practice, we distinguish between:

training/development data
: for learning a model and
finetuning

test data
: for
evaluation

on unseen but compatible data

References

CSA5011
--

Corpora and Statistical Methods

Sparck
-
Jones, K. (2007). Computational Linguistics: What
Computational Linguistics
33 (3): 437

441

McEnery, T., Xiao, R. & Tono, Y. 2006:

Corpus
-
based language studies: An advanced resource book
. London:
Routledge

(Contains an interesting discussion of corpus
-
based vs. corpus
-
driven
approaches)