Corpora and Statistical Methods

addictedswimmingΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 4 χρόνια και 15 μέρες)

89 εμφανίσεις

Lecture 1, Part I


Albert Gatt

Corpora and Statistical Methods

Tutorial

CSA5011
--

Corpora and Statistical Methods



Next Monday at
11:00.



This will take the form of a discussion of the following paper:


Jurafsky
, D. (2003). Probabilistic knowledge in
psycholinguistics.


(Available from course web page)

Course goals

CSA5011
--

Corpora and Statistical Methods


Introduce the field of

statistical
natural language processing

(statistical NLP).



Describe
the main directions, problems, and algorithms in
the field.



Discuss
the theoretical foundations.



Involve
students in hands
-
on experiments with real
problems.

A general introduction

CSA5011
--

Corpora and Statistical
Methods

Language

CSA5011
--

Corpora and Statistical Methods



We
can define a language formally as:


a set of symbols (“alphabet”)


a set of rules to combine those symbols



This mathematical definition covers many classes of
languages, not just human language.

Java: An artificial (formal) language

CSA5011
--

Corpora and Statistical Methods



fixed
set of basic symbols:


public, static, for, while, {, }…



fixed syntax for symbol combination


public static void main (String[]
args
) {



for(
int

i

= 0;
i

<
args.length
;
i
++) {








}


}

Natural language

CSA5011
--

Corpora and Statistical Methods


Often much more complicated than an artificial language.


NB: Some theorists view NL as a special kind of formal language as
well (Montague…).



It
does conform to the formal definition:


there are symbols


there are modes of combination



However
, there are many levels at which these symbols and
rules are defined.

Levels of analysis in Natural language (I)

CSA5011
--

Corpora and Statistical Methods



Acoustic
properties (
phonetics
)


defines a basic set of sounds in terms of their features


studies the combination of these phonemes



Higher
-
order acoustic features (
phonology
)


how combinations of phonemes combine into larger units, with
suprasegmental

features such as intonation
.


Levels of analysis in Natural language (II)

CSA5011
--

Corpora and Statistical Methods



Word
formation (
morphology
)


combines morphemes into words



Combination into longer units in a structure
-
dependent way
(
syntax
)


“legal” word combinations in a language


recursive phrasal combination



Interpretation (
semantics
):


of words (
lexical semantics
)


of longer units (
sentential/propositional

semantics
)



Interpretation in context (
pragmatics
)


Natural Language Processing

CSA5011
--

Corpora and Statistical Methods



Studies
language at all its levels.


phonology, morphology, syntax, semantics…


focusses

on
process

(
Sparck
-
Jones `07)


computational methods to understand and generate human
language



Often
, the distinction between
NLP

and
computational
linguistics
is fuzzy


Kindred disciplines: Linguistics

CSA5011
--

Corpora and Statistical Methods



Theoretical
linguistics tends to be less process
-
oriented than
NLP


Q: how can we characterise knowledge that native speakers have of
their language?


this leads to
declarative models

of speaker’s knowledge of language


tends to say less about
how

speakers process language in real time


NB: This depends on the theoretical orientation!



NLP
has strong ties to theoretical linguistics


it has also been an important contributor: process models can serve as
tests for declarative models

Kindred disciplines: Psycholinguistics

CSA5011
--

Corpora and Statistical Methods



Like
NLP, psycholinguistics tends to be
strongly process
-
oriented


studies the online processes of language understanding and
language production



NLP
has benefited from such models.



NLP
has also been a contributor:


it is increasingly common to test psycholinguistic theories by
building computational models.


Paradigms in NLP (I)

CSA5011
--

Corpora and Statistical Methods



Knowledge
-
based
:


system is based on a priori rules and constraints


e.g. a syntactic parser might have hand
-
crafted rules such as:



NP


Det

AdjP

N



AdjP



A
+


Problem: it is
extremely difficult to hand
-
code all the relevant
knowledge
.


Paradigms in NLP (II)

CSA5011
--

Corpora and Statistical Methods


Statistical
:


starting point is a large repository of text or speech (a
corpus
)


corpus is often
annotated

with relevant information, e.g.:


parsed corpora (syntax)


tagged corpora (part
-
of
-
speech)


word
-
sense annotated corpora (semantics)


tries to
learn a model

from the data


tries to
generalise this model

to new data

The paradigms: a bird’s
-
eye view

CSA5011
--

Corpora and Statistical Methods



We
find similar “divisions” within mainstream linguistics:


generative linguistics tends to formulate generalisations about “internalised
speaker knowledge of language” (competence, I
-
Language…)


corpus linguistics tends to formulate generalisations based on patterns
observed in corpora



The two paradigms are viewed as having roots in different traditions:


rationalist tradition (Plato, Descartes…)


empiricist tradition (Locke…)

The idea of “linguistic knowledge”

CSA5011
--

Corpora and Statistical Methods


Traditional linguistic theory (since the 1950s) introduced a
dichotomy:


competence:

a person’s knowledge of language, formalised as a
set of rules


performance:

actual production and perception of language in
concrete situations



Much
of linguistic theory has focused on characterising
competence.

The idea of “linguistic knowledge”

CSA5011
--

Corpora and Statistical Methods


The use of data (corpora) involves an increased focus on
“performance”.



The
idea is that exposure to such regularities is a crucial part
of human
language learning.


(Evidence for this is our topic for
Monday’s
tutorial!)


An initial example

CSA5011
--

Corpora and Statistical Methods


Suppose you’re a linguist interested in the syntax of verb
phrases.


Some verbs are transitive, some intransitive


I
ate the meat pie

(transitive)


I
swam

(intransitive)



What about:


quiver


quake



Corpus data suggests they have transitive uses:


the insect
quivered its wings


it
quaked his bowels

(with fear)


Most traditional grammars characterise

these as intransitive

Example II: lexical semantics

CSA5011
--

Corpora and Statistical Methods


Quasi
-
synonymous lexical items exhibit subtle differences in
context.


strong


powerful



A
fine
-
grained theory of lexical semantics would benefit
from data about these contextual cues to meaning.


Example II continued

CSA5011
--

Corpora and Statistical Methods


Some differences between
strong

and
powerful
(source: British
National Corpus):



strong



powerful



The differences are subtle, but examining their
collocates

helps.


wind, feeling, accent, flavour

tool, weapon, punch, engine

Statistical approaches to language

CSA5011
--

Corpora and Statistical Methods



Do
not rely on categorical judgements of grammaticality
etc. Examples:


1.
Degrees
of grammaticality: people often do not have
categorical judgements of acceptability.


2.
Category
blending:
We live nearer town than you thought.


Is
near

an adjective or a preposition?


3.
Syntactic
ambiguity:
She killed the man with the gun
.


What is the most likely parse?


Statistical NLP vs. Corpus Linguistics (I)

CSA5011
--

Corpora and Statistical Methods


Corpus linguistics became popular with the arrival of large, machine
-
readable corpora.


generally viewed as a
methodology


tests hypotheses empirically on data


aim is
to refine a theory of language
, or discover novel generalisations



Statistical NLP shares these aims; however:


it is often
corpus
-
driven

rather than
corpus
-
based


the “theory” or “model” learned is often not a priori given


Statistical NLP vs. Corpus Linguistics (II)

CSA5011
--

Corpora and Statistical Methods



The
term “corpus” may mean different things to different people:


To a corpus linguist, a corpus is a balanced, representative sample of a
particular language variety (e.g. The British National Corpus)


Representativeness allows generalisations to be made more rigorously.



In
statistical NLP, there has traditionally been less emphasis on these
properties.


emphasis on algorithms for learning language models


we frequently find the tacit assumption that the algorithm can be applied to
any set of data, given the right annotations


Some applications of Statistical NLP

CSA5011
--

Corpora and Statistical
Methods

25

Text

Language Technology

Natural
Language
Understanding

Natural Language
Generation

Speech
Recognition

Speech
Synthesis

Text

Meaning

Speech

Speech

Machine
translation

A (very) rough division of NLP tasks

CSA5011
--

Corpora and Statistical Methods



understanding
: typically take as input free text or speech, and
conduct some structural or semantic analysis


POS Tagging, parsing, semantic role labelling, sentiment/opinion mining,
named entity recognition…



generation
: typically take textual or non
-
linguistic input,
outputting some text/speech


automatic weather reporting, summarisation, machine translation



How
effective are statistical NLP tools to carry out these and other tasks?


Are statistical techniques actually useful to learn things about language?

Example 1: Semantics

sheep

0.359

cow

0.345

pig

0.331

rabbit

0.305

cattle

0.304

deer

0.289

lamb

0.286

donkey

0.276

poultry

0.262

boar

0.261

camel

0.259

elephant

0.258

calf

0.258

pony

0.255


Example of an automatically
acquired thesaurus of similar
words.


Data: 1.5 bn words obtained
from the web.


(www.sketchengine.co.uk)


How does this work?

CSA5011
--

Corpora and
Statistical Methods

“goat”

Example 1: Semantics (cont/d)

CSA5011
--

Corpora and Statistical Methods



Corpus
-
based
lexical semantic acquisition typically uses
vector
-
space models
.


represent a word as a vectors containing information about the
context in which it is likely to occur


some models also include grammatical relations (subject
-
of,
object
-
of etc)

Example 2: POS Tagging

CSA5011
--

Corpora and Statistical
Methods

<
tok

pos="at"
>The</
tok
>

<
tok

pos="
jj
"
>tall</
tok
>

<
tok

pos="
nn
"
>woman</
tok
>

<
tok

pos="cc"
>and</
tok
>

<
tok

pos="at"
>the</
tok
>

<
tok

pos="
jj
"
>strange</
tok
>

<
tok

pos="
nn
"
>boy</
tok
>

<
tok

pos="
vbd
"
>thought</
tok
>

<
tok

pos="
jj
"
>statistical</
tok
>

<
tok

pos="
nn
"
>NLP</
tok
>

<
tok

pos="
bedz
"
>was</
tok
>

<
tok

pos="
jj
"
>pointless</
tok
>

<
tok

pos="."
>.</
tok
>


“The tall woman and the strange boy thought
statistical NLP was pointless.”



Output from a statistical POS
Tagger, trained on the Brown
Corpus

(LingPipe demo library)



Uses of POS Tagging:


pre
-
parsing


corpus analysis for linguistics




Example 3: parsing


Parsed using the Stanford Parser.


Based on probabilistic context
-
free grammar of English


trained on a treebank


CFG rules with probabilities

CSA5011
--

Corpora and
Statistical Methods

Example 4: Machine translation

CSA5011
--

Corpora and Statistical
Methods


Input:


(Maltese translation of example
sentence)



Output:


The wife and son long strange
nonetheless feels that the
statistical NLP is without
purpose.


Translated using Maltese
-
English
Google Translate.



Obvious shortcomings, but
robust, i.e. some output
returned, even if garbled.



Based on automatic alignment
between parallel text corpora.

Example 5: Generation/Summarisation

CSA5011
--

Corpora and Statistical
Methods

[…] No laboratories offering
molecular genetic testing for
prenatal diagnosis of 3
-
M
syndrome are listed in the
GeneTests Laboratory
Directory. However, prenatal
testing may be available for
families in which the disease
-
causing mutations have been
identified […]


Automatically generated article
about 3
-
M syndrome (Sauper and
Barzilay 2009)


Now on Wikipedia!!!

(
http://en.wikipedia.org/wiki/3
-
M_syndrome
)


Summarised from multiple
documents drawn from the web.


Uses automatically acquired
templates from human
-
authored
texts to ensure coherence.

Features of Statistical NLP systems

CSA5011
--

Corpora and Statistical Methods



Robustness
: typically, don’t break down with new or
unknown input



Portability
: statistical learning algorithms can in principle be
ported to new domains (given data)



Sensitivity
to training data
: if (say) a POS tagger is trained on
medical text, its performance will decline on a new genre
(e.g. news).

Some important concepts

CSA5011
--

Corpora and Statistical Methods


All the systems surveyed rely on regularities in large
repositories of training data, expressed as probabilities.



In
practice, we distinguish between:


training/development data
: for learning a model and
finetuning


test data
: for
evaluation

on unseen but compatible data

References

CSA5011
--

Corpora and Statistical Methods


Sparck
-
Jones, K. (2007). Computational Linguistics: What
about the linguistics?
Computational Linguistics
33 (3): 437


441



McEnery, T., Xiao, R. & Tono, Y. 2006:


Corpus
-
based language studies: An advanced resource book
. London:
Routledge


(Contains an interesting discussion of corpus
-
based vs. corpus
-
driven
approaches)