FSNLP_Lin_Ch1

blabbingunequaledΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

80 εμφανίσεις

Foundations of Statistical Natural Language Processing

Ch1
. Introduction

Lin, Yu
-
Chun

FSNLP
-

Introduction

Statistical NLP


All quantitative approaches to
automated language processing


Computational Linguistics


The Study of computer systems for
understanding and generating natural
languages


To make the computer a fluent user of
ordinary language in all kinds of
conversation tasks

FSNLP
-

Introduction

Road Map
-

FSNLP


Preliminaries


Math and linguistic foundation


Words


Collocations, n
-
gram models, word sense
disambiguation, lexical acquisition


Grammar


Markov Models, tagging, probabilistic context
-
free grammar, probabilistic parsing


Applications and Techniques


Statistical alignment, machine translation,
clustering, information retrieval, and text
categorization


FSNLP
-

Introduction

Applications of NLP


Machine Translation


Natural Language Interface (to Databases)


Text Processing (Understanding/Generation)


Written Aids (Spelling Checker, Grammar Checker,
Style Checker)


Speech Recognition/Synthesis


OCR and OLCR


Intelligent Information Retrieval


Digital Libraries


NLP for the World Wide Web


Text Data Mining


Language Modeling of Biological Data


......


FSNLP
-

Introduction

Rationalist Approach


1960
-
1985


Noam Chromsky



Chromskyan linguistics



A significant part of the
knowledge in the human
mind is not derived by the
senses but is fixed in
advance, presumably by
genetic inheritance.

FSNLP
-

Introduction

Rationalist Approach (cont.)



Poverty of the stimulus




Generative Linguistics




linguistic competence



Knowledge of language structure in the
mind of native
-
speaker



linguistic performance



e.g. Affected by memory limitation or
distracting noises



Knowledge
-
based


FSNLP
-

Introduction

Empiricist Approach


1920
-
1960, 198X
-


Also prostulate some cognitive
abilities at birth


not

tabula rasa




General operations upon senses


American structuralists



Corpus
-
based



Zellig Harris, 1951

FSNLP
-

Introduction

Categorical


Gramaticality


Non
-
categorical phenomena


Conventionality


New POS usage


near


Meaning of words change
gradually


kind of / sort of


FSNLP
-

Introduction

Probabilistic


The argument for a probabilistic
approach to cognition is that we live
in a world filled with uncertainty and
incomplete information.


Unseen events


Ambiguity

FSNLP
-

Introduction

NLP difficulties


Ambiguities


AI
-
complete problem

S

NP

VP

workers

Aux

VP

NP

V

training

is

Our company

S

NP

VP

workers

V

VP

NP

V

training

is

Our company

NP

FSNLP
-

Introduction

Rule
-
based VS Corpus
-
based:

Advantages


Corpus
-
based


Knowledge acquisition
can be automatically
achieved by the
computer


Uncertain knowledge
can be objectively
quantified


Consistency and
completeness are easy
to obtain


Very suitable to handle
huge and minute
information (with a lot
of parameters)


Well established
statistical theories and
technique are available


Rule
-
based


No need to prepare
database


Easy to incorporate
existing linguistic
knowledge


Have better
generalization to a
unseen domain


Reasoning processes
are explainable and
traceable


Operation mechanism is
easy to understand


FSNLP
-

Introduction

Rule
-
based VS Corpus
-
based:

Disadvantages


Corpus
-
based


Preparing database is
a time consuming and
boring task


Generalization is poor
for small
-
size
database


Reasoning processes
are implicit and
inaccessible to human


Parameters are
interactive, hard to
identify the effect of a
particular one


Rule
-
based


Hard to maintain
consistency (between
different people, at
different occasions)


Hard to handle uncertain
knowledge (not easy to
objectively quantify
uncertainty factor)


Hard to deal with complex,
irregular information


Knowledge acquisition is
very time consuming


Not easy to obtain high
coverage (completeness)
for a given domain


Not easy to avoid
redundancy

FSNLP
-

Introduction

Lexical resources (corpora)


Brown


Balanced corpus


British English Version: LOB (Lancaster
-
Oslo
-
Bergen)


PENN Treebank


WSJ


Training set: section 2
-
21


Development set: section 22


Test set: section 23

FSNLP
-

Introduction

Lexical resources (corpora)


Canadian Hansards


Bilingual corpus, parallel texts


WordNet


Electronic dictionary


Synset


Relations between words


Meronymy (part
-
whole relations)


etc.

FSNLP
-

Introduction

Word counts


Function words


Word tokens V.S. Word types


Some facts


100 most common words: 50.9%
tokens


Almost half(49.8%) of word types
are
hapax legomena


Over 90% of the word types occur
<=10 times


12% of the text is words that
occur <=3 times

FSNLP
-

Introduction

Zipf

s laws


Principle of Least Effort


Zipf

s law:



Weak points


Highest/lowest rank


Refined by Mandelbrot:




The significance of power laws


Zipf

s law also stands for randomly
-
generated text.

FSNLP
-

Introduction

Zipf

s laws (cont.)


m

= the number of meanings of a word:





The frequency
F

of different interval sizes
I

(
p

varied between 1~1.3):




Content words occur near another occurrence
of the same word

FSNLP
-

Introduction

Collocations

FSNLP
-

Introduction

Concordance


KWIC: Key word in context