Verb - clarin-nl

journeycartΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

77 εμφανίσεις

Machine Learning:

Basic Introduction

Jan Odijk

January 2011

LOT Winter School 2011

1

Overview


Introduction


Rule
-
based Approaches


Machine Learning Approaches


Statistical Approach


Memory Based Learning


Methodology


Evaluation


Machine Learning & CLARIN

2

Introduction


As a scientific discipline


Studies algorithms that allow computers to
evolve behaviors based on empirical data


Learning: empirical data are used to
improve performance on some tasks


Core concept: Generalize from observed
data



3

Introduction


Plural Formation


Observed: list of (singular form, plural form)


Generalize: predict plural form given a singular
form for new words (not in observed list)


PoS tagging


Observed: text corpus with PoS
-
tag annotations


Generalize: predict Pos
-
Tag of each token from
a new text corpus


4

Introduction


Supervised Learning


Map input into desired output, e.g. classes


Requires a training set


Unsupervised Learning


Model a set of inputs (e.g. into clusters)


No training set required

5

Introduction


Many approaches


Decision Tree Learning


Artificial Neural Networks


Genetic programming


Support Vector Machines


Statistical Approaches


Memory Based Learning

6

Introduction


Focus here


Supervised learning


Statistical Approaches


Memory
-
based learning


7

Rule
-
Based Approaches


Rule based systems for language


Lexicon


Lists all idiosyncratic properties of lexical items


Unpredictable properties e.g
man

is a noun


Exceptions to rules, e.g. past tense(
go) = went


Hand
-
crafted


In a fully formalized manner


8

Rule
-
Based Approaches


Rule based systems for language (cont.)


Rules


Specifies regular properties of language


E.g. direct object directly follows verb (in English)


Hand
-
crafted


In a fully formalized manner


9

Rule
-
Based Approaches


Problems for rule based systems


Lexicon


Very difficult to specify and create


Always incomplete


Existing dictionaries


Were developed for use by humans


Do not specify enough properties


Do not specify the properties in a formalized manner


10

Rule
-
Based Approaches


Problems for rule based systems (cont.)


Rules


Extremely difficult to describe a language (or even a
significant subset of language) by rules


Rule systems become very large and difficult to
maintain


(No robustness (‘fail softly’) for unexpected input)

11

Machine Learning


Machine Learning


A machine learns


Lexicon


Regularities of language


From a large corpus of observed data


12

Statistical Approach


Statistical approach


Goal: get output O given some input I


Given a word in English, get its translation in
Spanish


Given acoustic signal with speech, get the
written transcription of the spoken word


Given preceding tags and following ambitag,
get tag of the current word


Work with probabilities P(O|I)

13

Statistical Approach


P(A) probability of A


A an
event (
usually modeled by a set)


Event space=all possible event elements:



0 ≤ P(A) ≤ 1


For finite event space, and a
uniform
distribution:
P(A) = |A| / |

|



14

Statistical Approach


Simple Example


A fair coin is tossed 3 times


What is the probability of (exactly) two heads?


2 possibilities for each toss: Heads or Tails


Solution:




= {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}


A = {HHT, HTH, THH}


P(A) = |A| / |

| = 3/8




15

Statistical Approach


Conditional Probability


P(A|B)


Probability of event A given that event B has
occurred


P(A|B) = P (A ∩ B) / P(B) (for P(B)>0)




A A∩B B




16

Statistical Approach


A fair coin is tossed 3 times


What is the probability of (exactly) two heads
(A) if the first toss has occurred and is H (B)?




= {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}


A = {HHT, HTH, THH}


B = {HHH,HHT,HTH,HTT}


A ∩ B = {HHT, HTH}


P(A|B)=P(A∩B) / P(B) = 2/8 / 4/8 = 2 / 4 = ½




17

Statistical Approach


Given


P(A|B)=P(A∩B) / P(B)


(multiply by P(B))


P(A∩B) = P(A|B) P(B)


P(B∩A) = P(B|A) P(A)



P(A∩B) = P(B∩A)



P(A∩B) = P(B|A) P(A)


Bayes Theorem:


P(A|B) = P(A∩B)/P(B) = P(B|A)P(A) / P(B)




18

Statistical Approach


Bayes Theorem Check




= {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}


A = {HHT, HTH, THH}


B = {HHH,HHT,HTH,HTT}


A ∩ B = {HHT, HTH}


P(B|A) = P(B∩A) / P(A) = 2/8 / 3/8 = 2/3


P(A|B) = P(B|A)P(A) / P(B) = (2/3 * 3/8) / (4/8) = 2 * 6/24= 1/2





19

Statistical Approach


Statistical approach


Using

Bayesian inference
(
noisy channel
model)


get P(O|I) for all possible O, given I


take that O given input I for which P(O|I) is
highest: Ô


Ô = argmax
O
P(O|I)


20

Statistical Approach


Statistical approach


How to obtain P(O|I)?


Bayes Theorem



P(O|I) =


P(I|O) * P(O)

P(I)

21

Statistical Approach


Did we gain anything?


Yes!


P(O) and P(I|O) often easier to estimate than
P(O|I)


P(I) can be ignored: it is independent of O.


(though we have no probabilities anymore)


In particular:


argmax
O
P(O|I) = argmax
O

P(I|O) * P(O)


22

Statistical Approach


P(O) (also called the Prior probability)


Used for the language model in MT and ASR


cannot be computed: must be estimated


P(w) estimated using the relative frequency of
w in a (representative) corpus


count how often w occurs in the corpus


Divide by total number of word tokens in corpus


= relative frequency ; set this as P(w)


(ignoring
smoothing
)


23

Statistical Approach


P(I|O) (also called the likelihood)


Cannot easily be computed


But estimated on the basis of a corpus


Speech recognition:


Transcribed speech corpus




Acoustic Model


Machine translation


Aligned parallel corpus




Translation Model


24

Statistical Approach


How to deal with sentences instead of
words?


Sentence = w
1
..w
n


P(S) = P(w
1
)*..*P(w
n
)?


NO: This misses the connections between the
words


P(S) = (chain rule)


P(w
1
)P(w
2
|w
1
)P(w
3
|w
1
w
2
)..P(w
n
|w
1
..w
n
-
1
)


25

Statistical Approach


N
-
grams needed (not really feasible)


Probabilities of n
-
grams are estimated by the
relative frequency of n
-
grams in a corpus


Frequencies get too low for n
-
grams n>=3 to be
useful


In practice: use bigrams, trigrams (4
-
grams)


E.g. Bigram model:


P(S) = P(w
1
w
2
)* P(w
2
w
3
)..* P(w
n
-
1
w
n
)


26

Memory Based Learning


Classification


Determine
input features


Determine
output classes


Store observed examples


Use similarity metrics to classify unseen
cases


27

Memory Based Learning


Example: PP
-
attachment


Given a input sequence V ..N.. PP


PP attaches to V?, or


PP attaches to N?


Examples


John ate crisps with Mary


John ate pizza with fresh anchovies


John had pizza with his best friends

28

Memory Based Learning


Input features (
feature vector
):


Verb


Head noun of complement NP


Preposition


Head noun of complement NP in PP


Output classes (indicated by
class labels
)


Verb (i.e. attaches to the verb)


Noun (i.e. attaches to the noun)

29

Memory Based Learning


Training Corpus:




Id

Verb

Noun1

Prep

Noun2

Class

1

ate

crisps

with

Mary

Verb

2

ate

pizza

with

anchovie
s

Noun

3

had

pizza

with

friends

Verb

4

has

pizza

with

John

Verb

5


=
30

Memory Based Learning


MBL: Store training corpus (feature vectors
+ associated class in memory)


for new cases


Stored in memory?


Yes: assign associated class


No: use similarity metrics




31

Similarity Metrics


(actually : distance metrics)


Input:
eats pizza with Liam


Compare input feature vector X with each
vector Y in memory:
Δ
(X,Y)


Comparing vectors: sum the differences for
the
n

individual features
Δ
(X,Y) =
Σ
n
i=1

δ
(x
i
,y
i
)


32

Similarity Metrics


δ
(f
1
,f
2
) =



(f
1
,f
2

numeric):



(f1
-
f2)/(max
-
min)


12


2 = 10 in a range of 0 .. 100


10/100=0.1


12
-

2 = 10

in a range of 0 .. 20


10/20 = 0.5



(f
1
,f
2

not numeric):


0 if f
1
= f
2
no difference


distance = 0


1 if f
1
≠ f
2

difference


distance = 1


33

Similarity Metrics



Id

Verb

Noun1

Prep

Noun2

Class

Δ
⡘,天

New(X)

eats

pizza

with

Liam

??

Mem

1

ate:1

crisps:1

with:0

Mary:1

Verb

3

Mem

2

ate:1

Pizza:0

with:0

anchovies:1

Noun

2

Mem

3

had:1

Pizza:0

with:0

Friends:1

Verb

2

Mem

4

has:1

Pizza:0

with:0

John:1

Verb

2

Mem

5


=
34

Similarity Metrics


Look at the “k nearest neighbours” (
k
-
NN)


(
k

= 1): look at the ‘nearest’ set of vectors


The set of feature vectors with ids {2,3,4}
has the smallest distance (viz.
2
)


Take the most frequent class occurring in
this set:
Verb


Assign this as class to the new example

35

Similarity Metrics


with
Δ
(X,Y) =
Σ
n
i=1

δ
(x
i
,y
i
)


every feature is ‘equally important;


Perhaps some features are more ‘important’


Adaptation:


Δ
(X,Y) =
Σ
n
i=1

w
i

*
δ
(x
i
,y
i
)


Where w
i

is the weight of feature i



36

Similarity Metrics


How to obtain the weight of a feature?


Can be based on knowledge


Can be computed from the training corpus


In various ways:


Information Gain


Gain Ratio


χ
2


37

Methodology


Split corpus into


Training corpus


Test Corpus


Essential to keep test corpus separate


(Ideally) Keep Test Corpus unseen


Sometimes


Development set


To do tests while developing

38

Methodology


Split


Training 50%


Test 50%


Pro


Large test set


Con


Small training set

39

Methodology


Split


Training 90%


Test 10%


Pro


Large training set


Con


Small test set

40

Methodology


10
-
fold cross
-
validation


Split corpus in 10 equal subsets


Train on 9; Test on 1 (in all 10 combinations)


Pro:


Large training sets


Still independent test sets


Con :
training set still not maximal



requires a lot of computation

41

Methodology


Leave One Out


Use all examples in training set except 1


Test on 1 example (in all combinations)


Pro:


Maximal training sets


Still independent test sets


Con : requires a lot of computation

42

Evaluation


True class




Positive

(P)


Negative
(N)




Predicted

class



Correct

True

Positive


(TP)

False
Positive
(FP)


Incorrect

False
negative
(FN)

True
Negative
(TN)

43

Evaluation


TP= examples that have class C and are
predicted to have class C


FP = examples that have class ~C but are
predicted to have class C


FN= examples that have class C but are
predicted to have class ~C


TN= examples that have class ~C and are
predicted to have class ~C

44

Evaluation


Prec
ision = TP / (TP+FP)


Rec
all = True Positive Rate = TP / P


False Positive Rate = FP / N


F
-
Score = (2*Prec*Rec) / (Prec+Rec)


Accuracy = (TP+TN)/(TP+TN+FP+FN)

45

Example Applications


Morphology for Dutch


Segmentation into stems and affixes


Abnormaliteiten
-
> abnormaal + iteit + en


Map to morphological features (eg inflectional)


liepen
-
> lopen + past plural


Instance for each character


Features: Focus char; 5 preceding and 5
following letters + class

46

Example Applications


Morphology for Dutch Results



Prec

Rec

F
-
Score


Full:

81.1

80.7

80.9


Typed Seg:

90.3

89.9

90.1


Untyped Seg:

90.4

90.0

90.2


Seg=correctly segmented


Typed= assigned correct type


Full = typed segm + correct spelling changes

47

Example Applications


Part
-
of
-
Speech Tagging


Assignment of tags to words in context


[word]
-
> [(word, tag)]


[book that flight]
-
>


[(book, verb) (that,Det) (flight, noun)]


Book in isolation is ambiguous between noun
and verb: marked by an
ambitag
: noun/verb

48

Example Applications


Part
-
of
-
Speech Tagging Features


Context:


preceding tag + following ambitag


Word:


Actual word form for 1000 most frequent words


some features of the word


ambitag of the word


+/
-
capitalized


+/
-
with digits



+/
-
hyphen

49

Example Applications


Part
-
of
-
Speech Tagging Results


WSJ
:


96.4% accuracy


LOB Corpus
:

97.0% accuracy

50

Example Applications


Phrase Chunking


Marking of major phrase boundaries


The man gave the boy the money
-
>


[
NP

the man] gave [
NP

the boy] [
NP

the money]


Usually encoded with tags per word:


I
-
X = inside X; O=outside; B
-
X=beginning of new X


the
I
-
NP

man
I
-
NP

gave
O

the
I
-
NP

boy
I
-
NP

the
B
-
NP

money
I
-
NP


51

Example Applications


Phrase Chunking Features


Word form


PoS
-
tags of


2 preceding words


The focus word


1 word to the right


52

Example Applications


Phrase Chunking Results




Prec

Rec

F
-
score


NP


92.5

92.2

92.3


VP


91.9

91.7

91.8


ADJP

68.4

65.0

66.7


ADVP

78.0

77.9

77.9


PP


91.9

92.2

92.0

53

Example Applications


Coreference Marking


COREA project


Demo


Een 21
-
jarige dronkenlap
3


besloot maandagnacht
zijn
5005=3

roes uit te slapen op
de snelweg A1
9

bij
Naarden .

De politie
12=9

trof
de man
14=5005

slapend aan achter het
stuur van
zijn
5017=14

auto
18
, terwijl de motor nog draaide

54

Machine Learning & CLARIN


Web services in work flow systems are created for several
MBL
-
based tools


Orthographic Normalization


Morphological analysis


Lemmatization


Pos
-
Tagging


Chunking


Coreference assignment


Semantic annotation (semantic roles, locative and temporal
adverbs)

55

Machine learning & CLARIN


Web services in work flow systems are created for
statistically based tools such


Speech recognition


Audio mining


All based on
SPRAAK


Tomorrow more on this!


56