1 Combined term paper for GSLT courses “Natural Language ...

scarfpocketAI and Robotics

Oct 24, 2013 (3 years and 9 months ago)

109 views

1

Combined term paper for GSLT courses “Natural
Language Processing” and “Statistical methods”,
spring 2005
.


1.1

Abstract

In this combined term paper I investigate the evidence for verb tense agreement using
two annotated corpora, one Danish (PAROLE) and
o
ne Sw
edish (Stockholm/Ume
å
Corpus). I propose using a Bayesian model for integrating the
verb
tense agreement
priors, and investigate the possible independence assumptions thoroughly.


I conclude that it is likely possible to quantify and incorporate verb tense
agreement
with a standard n
-
gram model in a Bayesian framework, but that there are some
exceptions to the independence assumptions that merit further investigation.


1.2

Structure

As term paper for the course “Natural Language Processing
”, I submit the sectio
ns
1.3
to
1.12
, with the remaining sections as background material.


As term paper for the course “Statistical Methods”, I submit the sections
1.3 to 1.5
and 1.13 to 1.19
,
again with the remaining sections as background material.


Hence,
the sections
1.3 t
o 1.5
are shared material, necessary for evaluating both
sections of work.



1

COMBINED TERM PAPER
FOR GSLT COURSES “NA
TURAL
LANGUAGE PROCESSING”
AND “STATISTICAL ME
THODS”, SPRING
2005.

1

1.1

Abstract

1

1.2

Structure

1

1.3

Introduction

5

1.4

Nomenclature

5

1.4.1

Lexeme or Surface representation

5

1.4.2

Class or role

5

1.4.3

Parts of Speech (POS)

5

1.4.4

History

6

1.4.5

Markov or n
-
gram model

6

1.4.6

Tense

6

1.4.7

Long
-
range
model

6

1.4.8

Trigger or conditioning influence

6

1.4.9

Target or dependant word

6

1.4.10

Candidates

6

1.5

Notation

7

1.6

The influence of tense

8

1.7

Indicators of tense

8

1.8

Prior work and State of the Art

8

1.9

Simplifying assumptions; focus of the investigation

9

1.10

Long
-
range models

10

1.10.1

The effect of distance

11

1.11

Class
-
based Statistical Language Models

11

1.11.1

lexeme / lexeme

12

1.11.2

class / lexeme

12

1.11.3

lexeme / class

12

1.11.4

class / class

13

1.11.5

Hybrid class
-
based models

13

1
.11.6

Discussion

13

1.12

Plan for the investigation

14

1.13

Resources: Corpora and Tools

14

1.13.1

Methodology and Metrics

15

1.13.2

Too
ls

15

1.13.3

Results

16

1.13.4

Discussion

16

1.13.5

Conclusion

18

1.14

Investigating the existence of tense
-
agreement

18

1.14.1

Methodology

19

1.14.2

Tools

19

1.14.3

Results

19

1.14.4

Discussion

19

1.14.5

Conclusion

20

1.15

Investigating the long
-
range effect of verb tense

20

1.15.1

Methodology

20

1.15.2

Tools

20

1.15.3

Results

21

1.15.4

Discussi
on

21

1.15.5

Conclusion

21

1.16

Effect of distance from the previous verb

21

1.16.1

Methodology

21

1.16.2

Tools

21

1.16.3

Results

22

1.16.4

Discussion

22

1.16.5

Conclusion

23

1.17

Is verb tense agreement independent of the trigger lexeme?

24

1.17.1

Methodology

24

1.17.2

Tools

24

1.17.3

Results

25

1.17.4

Discussion

25

1.17.5

Conclusion

26

1.18

Is
verb tense agreement independent of the target lexeme?

26

1.18.1

Methodology

27

1.18.2

Tools

27

1.18.3

Results

2
7

1.18.4

Discussion

27

1.18.5

Conclusion

28

1.19

Conclusion for data analysis

29

2

REFERENCES

30

3

RESULTS

31

3.1

Metrics collecte
d from corpora

31

3.2

Initial investigation of tense
-
agreement

35

3.3

Investigation of long
-
range effect of verb tense

36

3.4

Conditional probability pr. Distance

37

3.5

Verb tense agreement given trigger lexeme

38

3.6

Verb tense agreement given target lexeme

43

4

APPENDIX A

TAG/LEXEME STREAM G
ENERATORS

48

5

APPENDIX B

CAPITALIZATION PROG
RAM

49

6

APPENDIX C

SUPPORT UNITS

50

6.1

getTagAndLexemeUnit.h

50

6.2

getTagAndLexemeUnit.cpp

50

6.3

LowerUnit.h

50

6.4

LowerUnit.cpp

51

7

APPENDIX D

METRIC/PROBABILITY
CALCULATION AND
REPORTING

52



1.3

Introduction

The GazeTalk project aims to develop
and evaluate a
n

Alternative and Augmented
Communications (AAC) system


a
system suitable for communication and text
production for severely disabled users. The primary goal
of
the project
is
to make
available a free e
ye
-
gaze and mouse operated system, which allows for faster input
and more advanced features than is currently available in the marketplace. The system
is currently available in Danish, Swedish, English and Japanese, and the source code
will in due time be
released under a suitable Open Source license.


A core feature of the system is the use of statistical language modeling
(SLM)
to
increase the input rate, and allow for a relatively low number of on
-
screen targets
(currently 10 active targets/but
tons, with two of the available 12 positions reserved for
text display). We employ both letter prediction and whole
-
word prediction. Both
prediction subsystem
s
(letter and word) are implemented using a relatively simple

Markov model, as
we are only concerned with retrieving the ranks of words. The
word
-
level model uses a 3
-
gram back
-
off model
[KATZ]
.



A common and persistent user comment has been that while the perceived quality of
the prediction is good, some aspects of the
model are less than satisfying. Specifically,
many have commented that the model does not “remember” which tense is currently
being used.


1.4

Nomenclature

In the following, I shall use
various terms, which are common to the SLM
comm
unity, but may be unfamiliar to the reader, or may have several interpretations
depending on the domain they are used. Hence, I
will first list and define the most
central terms, to avoid any confusion.


1.4.1

Lexeme or Surface representation

The lexeme
is the s
equence of letters, which make up a word. This is also sometimes
referred to as the surface representation.


1.4.2

Class or role

The class of a word is the
semantic
or syntactic entities, which a given lexeme
embodies in a context.
In this paper, class usually r
efers to the part of speech (POS) a
word embodies and

if
applicable

it’s tense.


1.4.3

Parts of Speech (POS)

The part
-
of
-
speech of a word is the syntactical role or class of a word in a given
context.


1.4.4

History

The history of a lexeme is
theoretically
all in
formation which can
derived from the
preceding words in the text. In most statistical language models the history is assumed
to only consist of the preceding lexemes, but in this
investigation I explicitly include
class information of the preceding words a
s part of the history, as well as possibly
formatting features such as paragraph and article break.

1.4.5

Markov or n
-
gram model

A Markov model assumes that the likelihood of a symbol to appear at a given position
is
wholly
dependant on it
’s history (see above).
An n
-
gram Markov model (often just
referred to as
an
n
-
gram model
) makes the simplifying assumption that only the last
(n
-
1) items of the history are relevant for the likelihood of a symbol.

1.4.6

Tense

The concept of tense varies from language to language, and the granularity of
recording it varies form corpus to corpus.


According to
the Oxford English dictionary, tense is:


a
set of forms taken by a verb to indicate the time (and sometimes also the
continuance or completeness) of the action in relation to the time of the
utterance :
the past tense.


In order to simplify the investigation, I will primarily consider the three mo
st common
(and commonly recorded in corpora)
verb
tenses
, namely present tense, past tense and
infinitive.

1.4.7

Long
-
range model

In SLM, any language
that
which incorporates elements of the history of a word that
are outside the reach of a traditional n
-
gram mo
del are refe
r
red to as long
-
range
models. The value for n in question is usually 3.

1.4.8

Trigger or conditioning influence

In models that only incorporate one item or feature of the history of a word, the
feature
(or “conditioning influence”)
is often refer
r
ed
to as
a “trigger”.
Such models
are also sometimes called “trigger models” or “trigger
-
based models”.

1.4.9

Target or dependant word

In models that only incorporate one item
or feature of the history of a word, the word
that is being influenced by the trigger

t
he
dependant word


is often called the
“target”.

1.4.10

Candidates

The words considered for a given position by a
language

model are usually referred to
as “candidates”. Often all words known to the model (the dictionary) are candidates,
as not including them as
signs a probability of zero to the
occurrence
, which induces
less desirable effects when integrating the model in a larger probabilistic framework.


1.5

Notation

In the interest of efficient, clear and brief communication, I have adopted a notation
for describ
ing data and
probabilities, which
I sha
l
l describe in this section. It is largely
compatible
with other informal shorthand notations used in a SLM context, but is
described in detail here to avoid creating rather than reducing confusion.


As I am mainly co
ncerned with investigating verb tense, I shall
only note the word
class for verbs when listing histories and candidates, except where it is necessary for
the point investigated.
Hence, the following:


I saw/PAST a (
horse, cow, running/INFINITIVE)


Should b
e interpreted as a history consisting of the lexeme/tag sequence:


I/OTHER saw/PAST a/OTHER


…and a candidate list containing the following candidates as lexeme/tag pairs:


horse/OTHER, cow/OTHER, running/INFINITIVE


The tag OTHER is used for anything not
identified as a verb, and is

as can be seen
from the example above

usually left out when not pertinent to a discussion.


When stating probabilities, I will use the following shorthand notation:


P( <tagname> )


…where <tagname> is either a verb tense o
r the tag OTHER is used to denote the
probability
P( X = <tagname> ). Accordingly, I shall use a similar notation for
conditional probabilities:


P( <tagname1> |

<tagname2>)


… to denote the probability
P(X =
any lexeme embodying class
<tagname1> | Y =
an
y lexeme embodying class
<tagname2>), Y

H(X)

in layman’s terms “The
probability that a word has the class <tagname1> given that the class <tagname2>
precedes
it
in it’s immediate history
”.


Finally, when there is no risk of inducing confusion, I shall
not specify whether I am
referring to lexemes or classes (tags). As an example, when denoting the probability
of the lexeme
“was” occurring given a preceding
verb in the history embodying the
verb tense
infinitive, I shall just write:


P(was | INFINITIVE)


Finally, when using the above notation for conditional probabilities, the conditioning
element is understood to be the closest preceding
verb tense. When referring to long
-
range models, the condition should be read
“given that the closes verb is at a dist
ance
of more than 2, and embodies INFINITIVE”.


Some may consider this notation sloppy
, but in the interest of efficiency I have
adopted it in this paper.
The alternative would be more precise, but much more space
consuming and error prone, and ultimately
more
confusing to the reader.


1.6

The influence of tense

The influence of tense, and the problems an n
-
gram model faces when attempting to
incorporate it, is easily illustrated:


Most of our training text is in present tense. Hence, if one types


Example 1:


“Yesterday we went to the mall and “


...it’s relatively easy for a human to anticipate that if the next word is a verb (which is
likely in this case), it should probably be in past tense (“had”, “saw”, “talked”,
“bought”)
.

T
his feature

tense

agreement


is in this case not captured by a
traditional n
-
gram model of lower n value than 5, which is the distance between
“went”

the closest preceding clear indicator of tense

and “had” in the sample
sentence above. Hence, an n
-
gram model of lower n value than 5 trained on
predominately present tense material, will in this example tend to suggest a verb in
present tense, contrary to human expectation.


Such features beyond the reach of an n
-
gram model are commonly referred to as

“long
-
range features”. As the amount of text available for training of models rarely
exceeds a few hundred million words, it is uncommon to use models of a larger n
value than 3. Even a 3
-
gram model is relatively sparsely populated when trained with
hundr
eds of millions of words [
Rosenfeld 00
], and usually it’s necessary to augment it

(
for example
by interpolating with a 2
-
and 1
-
gram model
)
to make up for the
sparseness. Hence, any influence
measurable over a distance of more
than three words
can be considered a long
-
range influence by today’s standards.

1.7

Indicators of tense

The tense of verbs is the clearest indicator of tense. While some
lexemes can occur in
other roles than as a verb, and while it’s not always possible to determine from a
lexeme whether it is a verb, and if so, what tense the verb has, current methods for
parts
-
of
-
speech determination (“tagging”) are able to resolve these
ambiguities with a
high degree of precision and reliability

[SaLT].


There are other possible indicators of tense, which might be worth considering. In the
example sentence
(example 1
) the first clear indication of the tense of the sentence
is
not the verb

it’s the
adverb
“Yesterday”. Other features will probably also indicate
the current tense of the narrative or discourse, such as the tense used in the preceding
sentence, and possibly use of double quotes (citations/quoting), or para
graph breaks.


1.8

Prior work and State of the Art

In [Rosenfeld 00]

the current state of the art is summed up in a compact, but inclusive
overview. There has been surprisingly little work done in the field of quantifying
linguistic knowledge in statistical m
odels
. In fact
Rosenfeld poses incorporating
linguistic knowledge as priors using a Bayesian framework as a future challenge.


The nearest thing to prior work is Rosenfelds Ph.D. thesis [Rosenfeld 94], which
describes experiments with incorporating stemmin
g in a Entropy Maximization
framework for integrating long
-
range influences with
influences of distance 1, 2 and
3.

1.9

Simplifying assumptions
; focus of the investigation

Despite the possibility of other features than verb tense influencing the likelihood of
the tense of the following verb, the tense of verbs are a clear, indisputable indicator of
the current
tense of the discourse. As this is usually recorded as a metasyntactic
feature (or commonly “tag”) in annotated (“tagged”) corpora, it’s also a very suitable
feature to use as a basis for investigating the impact of tense on likelihoods of the
following wo
rds.


There are
other features, such as
adverbs indicating tense
(”yesterday”,

”tomorrow”,
”last
week”)
, that
are also intuitively strong indicators
of tense
. Unfortunately

this
information is
usually
not
reco
rded
in current annotated corpora.
While they are
attractive features for tense modeling from a linguis
t
ic
point of view
,
this
makes them
less
useable for practical reasons.


As for features w
hich occur in preceding sentences, or outside the sentence, such as
the tense of the last verb in the preceding sentence, or a paragraph break, they are also
less at
t
ractive as subjects when investigating tense for much the same
reasons
as the
previously m
entioned:
Structural features (e.g., paragraph break, article break...) are
not always recorded
even
in annotated corpora
. Further,

many corpora are not
“running corpora” (corpora where the order of sentences have been preserved)
.


For these re
asons, I have chosen to focus solely on verbs as indicators of tense, and as
features for inclusion in a language model incorporating tense. This is not to say that
the other features mentioned do not merit investigation, but for an initial attempt at
quan
tifying the influence of tense and possibly incorporating it in a model, verbs are a
much more practical choice as focus of the investigation.


I shall also make the assumption that
it is possible to
reliably (albeit not p
erfectly)
determine if a word is a
verb,
and if so, what it’s tense is,
from the surrounding words.
This is a necessary capability to construct
a
language
model incorporating
knowledge
that verb tense affects the probabilities of
the
following verbs tense
.


This is not an unreasonable assumption, as parts
-
of
-
speech assignment task, which
this is an example of, is
usually performed with a very high degree of accuracy by
current methods
[SaLT]
.


In the context of this investigation I shall simulate th
is capa
bility when identifying the
preceding
(conditioning)
verb
, and

if
necessary

devise some method of assigning
verb and tense information to candidate words when the need arises.


1.10

Long
-
range model
s

Long
-
range models, or indeed any
kind of language models, are usually categorized
as either rule
-
based or stochastic. Traditionally, the assumption is that rule
-
based
models are able to capture knowledge of the structure of language which eludes the
stat
istical models. Despite this, statistical models have for the last
several
decades
consistently outperformed rule
-
based models in terms of quantifiable results
when
evaluated on
real world data
[SaLT]
.



I
shall
initially investigate the possibility
of integrating linguistic knowledge

here
knowledge of the existence of tense

agreement

in a statistical model as a prior. This
is a relatively rare approach, as linguistic knowledge is usually modeled
using
primarily
rule
-
based syst
ems
such as context
-
free grammars (CFGs)
[Rosenfeld 00]
.


There are a range of statistical models, which are suitable for integrating n
-
gram
models and other information sources. The
ones
most common
ly used for

the SLM
task are:




Bay
e
sian
models



Interpolation




Entropy maximization


Bayesian models are well suited for incorporating independent information sources
,
but in the interest of manageability and simplicity of the model, it
’s usually necessary

to assume independence of the information sources. This is referred to as “naïve
Bayesian modeling”.


Interpolation is easy to implement, and performs well, but is unsuitable for combining
heterogeneous information sources.
Additionally, interpolation ass
umes independence
between the integrated sources, which is not always
actually
the case (e.g., when
interpolation a 1
-
gram and 2
-
gram model).


Entropy maximization
excels

at incorporating heterogeneous
(and interdependent)
data sources in a provable optima
l fashion. Unfortunately it is complicated to
implement, consumes large quantities of computing resources during training, and
requires explicit, on
-
the
-
fly normalization wh
e
n integrating with
probabilistic

models.


Hence, given that sufficient independenc
e assumptions can be made, a
naïve

Bayesian
model seems most suited for the task of modeling verb tense agreement.


If the effect can be shown only to be dependent on the verb tense of the trigger and
the target, integrating the knowledge source with a n
-
gram model is indeed simple.
Given integration with a 2
-
gram model, which can provide the probabilities
P
(w
n
)
(unigram probability of the candidate word w
n
) and
P
(w
n
|

w
n
-
1
)
(bigram probability of
the candidate word w
n
), the integrated probability is:


P(w
n
| h) = P(w
n
) * P(w
n
|

w
n
-
1
)
* P(previous verb tense | tense of w
n
)


1.10.1

The effect of distance

Rosenfeld [
Rosenfeld 94
] found that in most cases, distance between trigger pairs was
an important feature. It is therefore necessary to
investigate if this is also the case for
any verb
-
class based model, and (if so), whether it should be incorporated, and how.


The subject is investigated thoroughly in section 2.6 of [Rosenfeld 94], and shall not
be described further here.


1.11

Class
-
based
Statistical Language M
odels

Another important question is whether the model should consider lexemes,
metasyntactic features or a combination. In other words: should th
e model base it’s
decision on the
lexemes
in the history or their class, and should it influence the
following words as
lexemes
, or based on their class? In the following, I will
investigate the pros, cons and requirements for
the various types of models.


In ordinary language, the four possible combinations of class
-
or lexeme
-
based
models can be described thus, in the context of only considering verb / verb
influences:


Conditioning feature

Dependant feature

Description

lexem
e

l
exeme

Given that the last verb is
the lexeme
“X”, and that a candidate
lexeme “Y”
occurs as a verb
, we modify the
probability based on how “X” as
a
verb
conditions “Y” as
a
verb.

class

lexeme

Given that the last
verb is of class X, and
a candidate
lexeme “Y”
occurs as
a verb
,
we modify the probability based on how
the class X conditions “Y” as
a
verb

lexeme

class

Given that the last verb is the
lexeme
“X”, and that a candidate
l
exeme
is a
verb
of class Y, we modify the
probability based on how “X” as
a
verb
conditions the occurrence of
the
verb

class Y

class

c
lass

Given that the last verb is of class X, and
a candidate
lexeme
is a verb

of class Y,
we m
odify the probability based on how
the class
X conditions the occurrence of
the verb class Y



Once again, I will illustrate the problem with a practical example
. For the notation
used, please see section

1.4
:


When analyzing the partial sentence and a set of candidates:


Example 2:

“Yesterday we went to the mall and “ (see | ordered | had | saw | talked |
bought | later | suddenly ...)


We can interpret the partial sentence, or the history, and the set of cand
idates in four
different ways:


1.11.1

lexeme / lexeme

“Yesterday we went/
TRIGGER to the mall and “ (see | ordered | later | suddenly | ...)


In this case the model is much like a traditional trigger
-
model

described in
[Rosenfeld
96]
. We identify
the word “went” as a verb, and for all words following we
incorporate knowledge that having just seen “went” as a verb, we can expect it to
condition other words in various ways. This model requires significant storage
resources (roughly the number of verb
s known by the model times the size of the
dictionary), and a large training material to saturate. It cannot influence words that are
not in the training material, and cannot use words not in the training material as
triggers.


1.11.2

class / lexeme

“Yesterday we
went/VERB_PAST to the mall and “ (see | ordered | later | suddenly |
...)


Here the probabilities of the candidates are not influence
d
by the word “went”, but by
the fact that the word “went” in this case acts as a verb in past tense. This model
require
s less significant storage resources (roughly the number of verb tenses times
the size of the dictionary), and a somewhat smaller, training material to saturate. It is
however only accurate, if it can be determined that any past tense verb would
influence
the candidates in the same way as “went”.


In more formal terms, the requirement
(independence assumption)
for using this
model in this case is that P(X | went) = P(X |
VERB_PAST
)
. If this is not the case, it’s
not possibl
e to generalize the history by only incorporating the role of “went” in the
model.



1.11.3

lexeme / class

“Yesterday we went/

TRIGGER to the mall and “ (see/VERB_PRESENT |
ordered/VERB_PAST | later/OTHER | suddenly/OTHER | ...)


Here,
as
in the lexeme
/ lexeme model, the trigger is the word “went”, but the model
only records how the word “went” influences the likelihood of verb tenses following
it. The storage requirements are slightly smaller than for the class / lexeme model (the
number of verbs time
s the number of verb tenses), but are probably
on
the same order
of magnitude, as verbs usually comprise a large fractions of words in any dictionary.


The formal
requirements of this model is
in this case
that P(X | went) = P(Y | went)
,
where Y is any word of the same verb tense as X.


1.11.4

class / class

“Yesterday we went/VERB_PAST to the mall and “ (see/VERB_PRESENT |
ordered/VERB_PAST | later/OTHER | suddenly/OTHER | ...)


In the final and most general model, the trigger is the fact tha
t “went” acts as a past
tense verb, and it influences the candidates solely based on, which verb tense they
embody. The storage requirements for this model are very modest (the number of verb
tenses squared), and the training material required is proportio
nally smaller as well.


The formal requirements for using this model is
in this case
that P(X | went) = P(Y |
VERB_PAST
), where Y is any word of the same verb tense as X
.


1.11.5

Hybrid class
-
based models

It’s entirely possible that
the independence assumptions in some of the models hold
for some, but not all verbs. If those verbs occur with a high frequency, not
incorporating this information will decrease performance of the language model. A
solution to this problem could be a hybr
id model, which explicitly models the
commonly occurring verbs that diverge from the general case.


1.11.6

Discussion

The most interesting model, both from a linguistic and a practical point of view, is a
class/class model. From the point of view of the linguist
, it encodes general
knowledge about the properties of language. From the point of view of the engineer, it
has the property of scaling beyond the training material, in effect
inferring
knowledge
about the properties of words, which were either sparsely re
presented in, or absent
from the training material
. It also consumes far less computational resources, as all
that would need to be stored and incorporated in the model would be knowledge ab
out
the properties and interaction of classes.


Any model which incorporates class information faces two challenges which the
lexeme/lexeme model does not.

1)

The model must be able to
find the closest preceding verb in the history of the
word currently bei
ng investigates, and

if using the tense of the verb as
conditioning influence

determine
it’s tense.

2)

The model must be able to determine with so
me certainty whether a candidate
word for the position currently being investigated is a verb, and (if so), which
tense it embodies


As mentioned in
section
1.9
I am assuming in this investigation that
both are
solvable
pro
blem
s
. I will simulate a perfect solution to problem 1), and will

if
necessary


implement a solution to problem 2).


Should the investigation of possible independence assumptions indicate the ne
ed for a
hybrid model, I will investigate this further.

1.12

Plan for the investigation

Given the preceding deliberations and the goal of as a minimum to prove or disprove
the feasibility and positive effect of incorporating tense in a language model, the
follo
wing questions

some of which depend on the answer to others

must be posed
and answered:




Is tense
-
correspondence of verbs an actual phenomenon? If so:

o

Is it a long
-
range influence? If so:



Which independency assumption can be made
?



Is it dependant on di
stance from the feature?



Is it dependant on the conditioning word’s class or
lexeme?



Is it dependant on the conditioned word’s class or
lexeme?





Are there sufficiently many high
-
frequency verbs which
diverge strongly from the independence assumption, so
as to warrant the use of a hybrid model?




Can it be incorporated in a statistical language model? If so:



Whi
ch statistical method should be used?



Will a model incorporating this feature outperform a
traditional model? If so:

o

By how much, and by which measures?

o

Can the tense of verbs be easily determined? If
so:



Is it feasible to build a general purpose
model, wh
ich is able to perform well
using untagged text as test or training
material


In the following, I will attempt to investigate the issue further, and answer these
questions.
Due to size limitations, I shall only investigate the points above the
dividing lin
e in the context of term papers for the GSLT classes.


1.13

Resources
: Corpora and Tools

For this investigation, I had access to two tagged corpora. The corpora were the
Danish PAROLE corpus
(DKPAROLE)

[DKPAROLE]
and the Swedish Stockholm
-
Umeå Corpus (SUC)
[SU
C]
.


SUC
is a running
corpus consist
ing
of roughly 1 million Swedish words.
DKPAROLE is
a running corpus
approximately ¼ the size of SUC, e
.
g.
about
250.000 Danish words. The tagsets used for the corpora are largely compatible
in
terms of the word classes
(parts
-
of
-
speech) they identify
. Notable exceptions are that
SUC does not identify the
tense of verbs directly as part of the POS tag, and that
DKPAROLE identifies two verb tenses (
participial
, present tense and
participial
, past
ten
se
), which
are not identified as an individual class in SUC.


Additionally, SUC identifies some verb classes
, which are not
considered standard

[SUC].


In th
is section
I describe which metrics I decided to apply for evaluating the corpora

a
s tools for the investigation
, the to
ols developed for this purpose, and the results of
the evaluations.

1.13.1

Methodology
and
Metrics

As mentioned in
section
1.4.6
, I initially decided to focus primarily on the three
common classes of verb tense, namely present
tense, past tense and infinitive.
Additionally, I will include part and present participle from DKPAROLE in order to
investigate if any effect discovered reaches
beyond
the three basic tenses.


I decided to extract the following metrics from the corpora:


Size (in number of words)

Thi
s
metric
is indicative of the reliability of any other metric or conclusion derived
from the corpus. It is also used for deriving several other measures,
such as the
unconditional probability of lexemes or classes.


Dictionary
size

This metric, when combined with the corpus size, indicate
s if the corpus is “broad
” or
“shallow”, e.g. covers many topic areas or sources cursory or few topic areas or
sources at length.


Proportion of corpus pr.
classes
,
wh
en counting words occurring 1, 10 and 100
times.

This metric is useful for estimating the impact of a model that improves precision for
words on a class
-
wide scale
, and

when
only considering word occurring several
time

for estimating the “openness” of
a class, and the degree to which the corpus
covers a given class.


Verb distances

This metric will indicate how many verbs are potentially affected by a long
-
range
model (
see section 1.10 for background information)
.

1.13.2

Tools

For this task I implemented several programs.



Tag/lexeme stream generators

For both corpora I implemented a simple Perl program, which converts the internal
format of the corpora to a text file
in
the format:


<tag> <lexeme>


...and inserts synt
hetic start
-
of
-
sentence (<s>) and end
-
of
-
sentence (</s>) markers.


The programs are included as appendix A


Note that the program also, in the case of the SUC corpus, identify the various verb
tenses and assign them new tags, as the tagset used by SUC does
not explicitly encode
verb tense in the
tag
proper
.


Proper name identification

A problem with the raw tag/lexeme stream is that the first word in a sentence is
usually written with a capital letter
.
One traditional solution is to disregard capital
letters, in effect converting all lexemes to lower case. Unfortunately this removes an
important source of information for identifying proper names, a
nd should thus be
avoided if at all possible.
The information on proper names is potentially an important
resource for later stages of processing, such as verb identification and classification,
and additionally the absence of proper name identification wo
uld artificially inflate
the dictionary size measure for the corpora
(e.g., “He” and “he” counted as separate
words).


For this purpose, I implemented a simple but robust solution: If a lexeme is
encountered in upper case, but has previously been seen i
n lower case, it is converted
to lower case.


This solution does not address the problem that some proper names have the same
spelling as some words in other word classes (e.g., “Smith” and “smith”), but it will
reliably resolve the vast majority of capita
lization
-
related problems.


The program is implemented in C++, and is included as appendix B


The program uses two modules implementing common operations (upper/lower case
of strings, input from a tag/lexeme stream). These are included as appendix C.


Metr
ic
calculation and reporting

The actual metric calculation and reporting is performed by a C++ program (included
as appendix D), which implements all
metric and
probability calculation and reporting
functionality in this investigation.


1.13.3

Results

The observe
d metrics are
collected in the “Results” section in the
sub
section “Metrics
collected from corpora”


1.13.4

Discussion

The observed values for size and dictionary size are roughly as expected given the
reported sizes
of the corpora.


The data on class size as a proportion of corpus size is best presented graphically:


SUC
:


DKPAROLE
:



Investigating further, I tabulate the decrease in the number of verbs in percent for the
two corpora:



Min. 1 occurrence

Min. 10 oc
currences

Min. 1
00
occurrence
s

SUC

100%

88
,
93
%

64
,
48
%

DKPAROLE

100%

77
,
84
%

52
,
0
6
%


From
this
we can infer two important facts:


1)

While some word classes are affected
more than others, the verb classes are
among the less affected.

2)

The verb classes in DKP
AROLE are affected more than the verb classes in
SUC.


The first fact is expected, as it

s well
-
known that verbs are semi
-
open classes,
meaning that they are less likely than open classes (such as nouns) to be extended
with new members over time, albeit mo
re so than closed classes (such as
prepositions
)
. The second fact indicates that
SUC is a better choice when
analyzing

verbs, as there is a bigger chance that a given verb
occur
a
statistically significant
number of times
in SUC than in DKPAROLE.


As for v
erb distance, the
information necessary to determine the potential effect of a
long
-
range model incorporating verb tense can be tabulated so:



Verbs at distance >3 from previous verb

Proportion of corpus

SUC

68533

0,059

DKPAROLE

20361

0,067



1.13.5

Conclusio
n

The metrics on number of corpora sizes are in agreement with the data supplied with
the corpora. This indicated that the basic tools work as expected.


The metrics on group sizes and the effect of discarding less commonly
occurring

words indicate that th
e DKPAROLE corpus is mar
g
inal for this use
due to it

s small
size, while SUC has an acceptable size. As a consequence of this, I shall mainly use
DKPAROLE to verify the results obtained from SUC, but not use results obtained
from DKPAROLE as basis for any
conclusions in the absence of supporting data from
SUC.


The measurable impact of a long
-
range model incorporating verb tense will likely be
small
when measured for an entire corpus
, as it will at best affect approximately 6%
of the corp
us.
As a consequence,
I may need to
adopt

other measures than the
standard measure of average perplexity reduction to investigate the
effect of any long
-
range model.

1.14

Investigating the existence of tense
-
agreement

This is a relatively simple task, as all that’s needed it to investigate whether the
co
nditional probability for verb class
es
diverges
strongly from the unconditional
probability

in effect using the very definition of conditional probability:


The probability of X (P(X)) is c
onditional
on Y
iff P(
X) != P(Y | X)


I am interested in two aspe
cts of this:


For each verb tense X:

1)

Is P(X) dependant on the preceding verb tense Y?

2)

Is P(Y | X) different from P(Z | X), where Y and Z are verb tenses and Y is
different from Z

1.14.1

Methodology

F
or each verb class in the two corp
o
ra
,
I
can determine the cond
itional and
unconditional probabilities by performing a simple counting task.


The unconditional probability of
verb tense
X is the number of times the class X
occurs, divided by the number of words in the corpus.


The conditional probability of verb tense
X given verb tense Y is the number of times
X
follows Y, divided by the number of
words
following
Y


I calculate these probabilities for all the verb tenses I include for the corpora (
see
section 1.16 for details
).

1.14.2

Tools

For this task I extended the
metric collection and reporting tool
(provided as
Appendix D
)
with
the following capabilities:


The program
now
reports
three
additional
metrics from a tag/lexeme stream:

1)

The actual counts obtained
for verb tense given verb tense occurrences

2)

The calculate
d probabilities (unconditional and conditional
on preceding verb
tense
) for all verb tenses

3)

The probabilities reported in 2) normalized by the unconditional probability,
the better to highlight any conditional effect


1.14.3

Results

The observed metrics are colle
cted in the “Results” section in the subsection “
In
i
tial
investigation of tense
-
agreement


1.14.4

Discussion

As mentioned in
sction 1.16.5

the SUC corpus is about four times as big as the Danish
PAROLE. Further,
I include
5 different tags for verb tense
for DKPAROLE, but only
3 f
or SUC (
see section 1.16 for details
)
.
As a consequence, the results from
DKPAROLE should in general be used only to validate the results from SUC, and


in some cases

entirely disregarded.
Specifically the conditional counts
given

V_PARTC_PRES are extremely low, and consequently the related probabilities
should be disregarded.


Consolidating the results in one table, while disregarding
V_PARTC_PRES and
V_PARTC_PAST from DKPAROLE
,
as they
have no counterparts in SUC
, we see
the foll
owing:



Given
infinitive

Given present
tense

Given past tense

Infinitive, SUC

1,10658

1,47262

1,23778

Infinitive, DKPAROLE

0,612553

0,982222

1,65796

Present tense, SUC

0,698858

0,935669

0,120967

Present tense, DKPAROLE

0,791969

0,894888

0,202564

Past
tense, SUC

0,553072

0,155229

1,69168

Past tense
, DKPAROLE

0,648328

0,196518

1,77442


W
e see that in
7 of 9 cases, there is
a rough
agreement between the results from
the
two corpora
. Considering the smaller size of DKPAROLE, and the fact that the
corpora contain different (but closely related) languages, this is a remarkable
correspondence
.


1.14.5

Conclusion

Based on the results, I conclude that there is indeed reason to ack
nowledge verb tense
as a strong prior for verb tense.
In particular, past tense strongly conditions negatively
for present tense and vice versa, which was the user observation that originally
motivated the investigation.
It is unclear whether the differenc
es between the results
from SUC and DKPAROLE are due to the smaller size of DKPAROLE (which
pro
ba
bly leads to
more false priors), the difference between Danish and Swedish, or
some other, unknown, factor.

1.15

Investigating the long
-
range effect of verb tense

This is much the same task as in the previous section. Here, I investigate this
proposition:


The probability of verb tense X (P(X)) is
long
-
range
conditional on Y iff P(X)
!=
P(X | Y), given distance(Y, X) >2


1.15.1

Methodology

As this task is very similar to the previous one, there is litt
le difference in
methodology. The only relevant difference is that it
’s necessary to obtain occurrence
counts incorporating the distance requirement mentioned above.


1.15.2

Tools

For this task I added two more counts

and one report
to the metric collection and
r
eporting tool (provided as
Appendix D
):




Count: The number of lexemes following a given verb class at a distance > 2



Count: The number of times a given verb class follows a
g
iven verb class at a
distance > 2


I re
-
used the reporting facilities used in the
previous section with the counts
mentioned above.


1.15.3

Results

The observed metrics are collected in the “Results” section in the subsection
“Investigation of tense
-
agreement”


1.15.4

Discussion

Once agai
n, the DKPAROLE results should only be used as an indication of the
validity of results obtained from SUC.


There are some clear
differences in the long
-
range results as compared to the results
obtained in the general investigation of verb tense agreement
(
see section 1.14.4
).


Difference: long range

general case
for SUC


Given
infinitive

Given present
tense

Given past tense

Infinitive, SUC

-
0,096

-
0,588

-
0,461

Present tense, SUC

0,307

0,386

0,05

Past tense, SUC

0,234

0,055

0,731


1.15.5

Conclusion

The big differences in the conditional probabilities for
long
-
range
influence vs.
general conditiona
l
probability are an indication of a potential problem, namely that
distance from the preceding verb may be a significant f
actor.
Should this be the case,
it will be necessary to incorporate this information in a long
-
range model of verb
tense agreement, which will complicate the
modeling

considerably (
see section
1.10
for discussion
).


1.16

Effect of
distance from the previous ver
b


Following on the previous section, I investigate the effect of distance from the
previous verb in detail.


1.16.1

Methodology

This task is again a simple counting task. In order to determine the dependence (if
any)
of conditional probability on distance from t
he preceding verb, I need to tabulate
the information on a pre
-
distance basis.

1.16.2

Tools

This task necessitates collecting two new counts, and implementing a new report in
the metric collection and reporting tool (provided as
Appendix D
):




Count: The number of
lexemes following a given verb class at distance 1..max



Count: The number of times a given verb class follows a given verb class at a
distance of 1..max



Report: The conditional probability of verb class given verb class given
distance


1.16.3

Results

The collect
ed probabilities are reported in the “Results” section, in the subsection
“Conditional probability pr. Distance”.


1.16.4

Discussion

The results are best discussed on the basis of a graphical representation.






These graphs validate the results from the pre
vious section, namely that there is a
marked difference between the close
-
range (distance < 3)
and long
-
range (distance >
2) conditional probabilities.


The question relevant for the goal of building a long
-
range model is however whether
it
’s acceptable to
assume that the influence of the previous verb is independent of the
distance to it, given a distance greater than 2.


For some combinations of verb tense it’s obviously an acceptable approximation to
assume
independence

of distance

specifically for P(V
BINF | VBINF), P(VBPRT |
VBPRS), P(VBINF | VBPRT) and P(VBPRS | VBPRT)

as
these show little or no
dependence on distance.
As for the rest, the graphs do show that the probabilities
either approach
independence
,
or exhibit independence for most values of
distance >
2, but this is not the case for the critical values of distance (3
, 4 and 5), which
comprises
the majority of the
cases (
see section 1.13 for details)
.


1.16.5

Conclusion

It is possible to assume independence of distance to the preceding verb, but it i
s
clearly
an approximation. It is
feasible
to incorporate this dependence in a long
-
range
model,
but for an initial investigation it seems a reasonable simplification to
incorporate the independence assumption in a model.


One should obviously attempt to
quantify any negative impact on performance caused
by this (marginal) assumption during
evaluation of
a model incorporating the
assumption.


1.17

Is verb
tense
agreement
independent of the trigger
lexeme?

I now move on to investigating another central independe
nce assumption, namely that
the effect of tense agreement is dependant on the tense of the trigger word, rather than
the lexeme. The assumption investigated is then:


For the lexeme X embodying the verb tense
Y, which has a closest preceding
verb with a l
exeme of S embodying the verb tense T,
P(Y | T) = P(
Y | S)

1.17.1

Methodology

While it
’s unlikely that the independence assumption holds exactly, I can investigate
whether it almost holds most of the time, which is
sufficiently

close to act on the
assumption whe
n construction a model.


For this purpose I shall collect the probabilities for P(X | lexeme/Y) for all
combinations of verb classes X and Y. To avoid fals
e
conditioning, I shall only collect
these probabilities for cases where the lexeme embodying Y occur
s more than 10
times in the corpus.


I shall then investigate the variance of the values, using the measure of standard
deviation.
G
i
ven a normal
distribution
of data, the
data should be clustered so
that
68% of the data are within one standard deviation o
f the mean, 95% within
two

standard deviations and 99,7
%
within
three standard deviations [
Wikipedia
]

(figure
1)
.
If I find evidence that the data are clustered
significantly
closer to the mean
, I shall
consider this sufficient evidence that the independen
ce assumption is valid.



Figure 1

the normal distribution [Wi
kipedia]


1.17.2

Tools

For this task I added
two counts
and one report to the metric collection and reporting
tool (provided as
Appendix D
).




Count: The number of times a verb tense
occurs following
a given verb tense
represented by a given lexeme at a distance > 2
.



Count: The number of lexemes following a given verb tense represented by a
given lexeme at a distance of > 2
.



Report: The
proportion of the
individual
probabilities P(verb tense
|

lexeme/
verb tense) by
all
verb tense x verb tense combinations that
occur

within

1, 2 and 3 standard deviations from the mean.


As for the number of occurrences of a lexeme, I re
-
used the counts collected in
section 1.13
.

1.17.3

Results

The observed metrics are collecte
d in the “Results” section in the subsection “
Verb
tense agreement given trigger lexeme”.


1.17.4

Discussion

To better
provide an overview of the results, I tabulate them below
.



Mean

Stddev

Within +/
-
1
stddev

Within +/
-
2
stddev

Within +/
-
3
stddev

Infinitive

given
Infinitive

0,0395645

0,0360109

0,866242

0,961783

0,980892

Present tense
given
Infinitive

0,0768285

0,0439039

0,847134

0,961783

0,987261

Past tense
given
Infinitive

0,0317191

0,0242779

0,732484

0,949045

0,980892

Infinitive
given present
tense

0,03
7078

0,0956872

0,934605

0,959128

0,970027

Present tense
given present
tense

0,100287

0,0784574

0,866485

0,978202

0,989101

Past tense
given present
tense

0,00916873

0,023937

0,956403

0,978202

0,989101

Infinitive
given past
tense

0,0336267

0,0727862

0,912
844

0,954128

0,977064

Present tense
given past
tense

0,011176

0,0111036

0,550459

0,949541

0,986239

Past tense
given past
tense

0,122083

0,106106

0,917431

0,986239

0,986239


For all but one combination (present tense given past tense), the clustering of
values is
very pronounced, e.g., significantly higher than it would be, had de data points been
distributed according to a normal distribution
-


For all combinations, the proportion
of the data within +/
-

three
standard deviations is
less than the 99.7% th
at would be the expected given a normal distribution.


As another
potential
indication of the
validity of the results, I tabulate the means and
the measured long
-
distance conditional probability for the tag combination below.



Mean

Measured verb tense
agr
eement

Infinitive given Infinitive

0,0395645

0,0330739

Present tense given Infinitive

0,0768285

0,0658688

Past tense given Infinitive

0,0317191

0,0303732

Infinitive given present tense

0,037078

0,0289623

Present tense given present tense

0,100287

0,08
6597

Past tense given present tense

0,00916873

0,0081227

Infinitive given past tense

0,0336267

0,0254384

Present tense given past tense

0,011176

0,0111783

Past tense given past tense

0,122083

0,0935529


As can be seen from the numbers, there is genera
l agreement between the mean
probability measured on a pr.
l
exeme
embodying the verb tense, and the measured
verb tense agreement
.

1.17.5

Conclusion

The fact that
there is general agreement between the
mean of the data points
considered and the measured verb tens
e agreement
indicates that the methodology
used is sound
.


The pronounced clustering around the mean for all

but one combination of verb
tenses
supports the
independence

assumption that tense agreement investigated in this
section. However, the fact that l
ess than 99,7% of the population lies within +/
-
3
standard deviation for all combination indicate that there are significant exceptions to
the independence assumption. This is an indication that a hybrid model
(see section
1.11.5)
might be
needed in order
to represent the true tense agreement probabilities.


1.18

Is verb
tense
agreement independent of the target
lexeme?

In the previous
section I
showed
that the
verb
tense agreement effect is largely
independent of the trigger lexeme. In order to argue that
the
verb

tense effect it wholly
dependant on the tense of the trigger and target, I must now
show that it is
independent of the target lexeme.
The assumption investigated is then:


For the lexeme X embodying the verb tense Y, which has a closest preceding
verb
with a lexeme of S embodying the verb tense T, P(X | T) = P(Y | T)

1.18.1

Methodology

The task is very much like the previous one: I shall collect the probabilities for
P(lexeme/X | Y) for all combinations of verb classes X and Y.
To avoid false
conditioning, I
only collect these probabilities for cases where the lexeme embodying
Y occurs more than 10 times in the corpus.


1.18.2

Tools

For this task I added one count. I was able to re
-
use counts from the investigation of
long
-
range influence (
section 1.15
)
. I had to cre
ate a modified version of the
reporting
facilities used in the previous section
, as I had to sum the counts
obtained
for
occurrence

of the target lexemes in order to calculate the combined probability for
validation
. I added th
is
count
and the report
to th
e metric collection and reporting tool
(provided in
Appendix D
):




Count: The number of times a verb tense represented by a given lexeme occurs
following a given verb tense at a distance > 2.



Report: The proportion of the individual probabilities P(
lexeme/
v
erb tense |


verb tense) by all verb tense x verb tense combinations that occur within 1, 2
and 3 standard deviations from the mean.



Report: The combined probability of the events
P(lexeme/verb tense |

verb
tense)


For the number of occurrences of a lexem
e, I once again re
-
used the counts collected
in
section 1.13
.


1.18.3

Results

The observed metrics are collected in the “Results” section in the subsection “Verb
tense agreement given
target
lexeme”.


1.18.4

Discussion

To better provide an overview of the results, I tab
ulate them below
.



Combined
probability

Stddev

Within +/
-
1
stddev

Within +/
-
2
stddev

Within +/
-
3
stddev

Infinitive
given
Infinitive

0,020022

0,000145

0,941980

0,962457

0,972696

Present tense
given
Infinitive

0,050378

0,000784

0,959044

0,989761

0,9897
61

Past tense
given
Infinitive

0,020481

0,000368

0,965870

0,986348

0,986348

Infinitive
given present
tense

0,021246

0,000093

0,959520

0,980510

0,982009

Present tense
given present
tense

0,072005

0,000632

0,976012

0,985007

0,994003

Past tense
given pres
ent
tense

0,005099

0,000064

0,977511

0,983508

0,992504

Infinitive
given past
tense

0,015912

0,000115

0,956072

0,968992

0,974160

Present tense
given past
tense

0,007522

0,000126

0,968992

0,981912

0,987080

Past tense
given past
tense

0,072829

0,000796

0,9
68992

0,981912

0,989664


The results here are much more clear than was the case for the previous investigation
(
section 1.17
). The clustering of the values around the mean is very
pronounced
,
in all
cases exceeding 94% of the data points.


As for the pro
portion covered by +/
-
three standard deviations, the results are
comparable to the previous investigation.


As for the validation through com
p
aring the combined probability
with the measured
tag/tag conditional probability, I have tabulated the values

bel
ow.



Combined probability

Measured
verb tense
agreement

Infinitive given Infinitive

0,020022

0,0330739

Present tense given Infinitive

0,050378

0,0658688

Past tense given Infinitive

0,020481

0,0303732

Infinitive given present tense

0,021246

0,0289623

Present tense given present tense

0,072005

0,086597

Past tense given present tense

0,005099

0,0081227

Infinitive given past tense

0,015912

0,0254384

Present tense given past tense

0,007522

0,0111783

Past tense given past tense

0,072829

0,0935529


Whil
e there is a general agreement, it is much less clear than was the case for the
previous investigation.

1.18.5

Conclusion

While the
validation of the results (comparison of the combined probability for the
measured lexemes and the measured
verb tense agreement
) i
s less clear, it is still
pronounced, and supports the use of this methodology.


The clustering around the mean value
is even more pronounced than in the previous
investigation. There is also in this case some evidence of significant (highly frequent)
exce
ptions to the
inde
pendence
assumption.


1.19

Conclusion for data analysis

Based on the conclusions reached in sec
tions 1.13

1.18, I can conclude that it is a
reasonable assumption that verb tense agreement can be incorporated in a statistical
model as a prior on the
assumption
that the effect is
independent

of distance and the
actual lexemes representing the verb t
enses.


I intend to build and evaluate such a model in the near future, but due to space
restrictions this falls beyond the scope of even a dual term paper.


2

References


[
Katz 87]

S.M. Katz, “Estimation of Probabilities from Sparse Data for

The Language
Model Component of a Speech Recognizer”, in

IEEE Transactionson Acoustics, Speech and

Signal

Processing
,

Volume

ASSP
-
35,pages

400
-
401,

March1987.



[
Rosenfeld 00]

Ronald Rosenfeld.


Two decades of Statistical Language Modeling: Where Do We Go
From Here?

P
roceedings of the IEEE
,


88(
8), 2000.


[
Rosenfeld 94
]

Ronald Rosenfeld.
Adaptive Statistical Language Modeling: A Maximum Entropy
Approach
,
Ph.D. thesis, Computer Science Department, Carnegie Mellon
University,TR CMU
-
CS
-
94
-
138
, April 1994


[
W
ikipedia
]

Various authors,
Wiki
pedia reference on standard devi
ation,
http://en.wikipedia.org/wiki/Standard_deviation


[
DKPAROLE
]

Britt Kerson, Vejledning til det danske morfosyntaktisk taggede PAROLE
-
korpus,
http://korpus.dsl.dk/paroledoc_dk.pdf


[
SUC
]

Gunnel Källgren, Documentation of the Stockholm

Umeå
Corpus,
http://www.ling.su.se/staff/sofia/suc/manual.pdf


[SaLT]

Jurafsky, Daniel and Martin, James H. 2000. Speech and Language Processing. An
In
troduction to Natural Language. Draft of November 10, 2004
-
chapter 8, Word
classes and part
-
of
-
speech tagging. Prentice
-
Hall, New Jersey.


3

Results


3.1

Metrics collected from corpora

Size (in number of words)

SUC:
1159484

DKPAROLE:
305439


Dictionary size

SUC:
96148

DKPAROLE:
36152


Proportion of corpus pr. classes, when counting words occurring 1, 10 and 100
times.


SUC:


1

10

100

AB

81793

78431

67915

DT

55696

55663

55466

END

69897

69897

69897

HA

9275

9225

9045

HD

528

526

526

HP

15879

15862

15856

HS

160

144

144

IE

12848

12848

12845

IN

1578

1235

815

JJ

72333

55695

30658

KN

52893

52886

52548

NN

230583

138908

46239

PC

18239

8341

2037

PL

11650

11617

11108

PM

39832

19850

3908

PN

72102

72027

71475

PP

123104

122994

122105

PS

10045

10013

9736

RG

15560

11373

7065

RO

2061

1947

1732

SN

17090

17065

16925

START

69897

69897

69897

UO

2538

557

265

VBAN

82

78

0

VBIMP

1309

1062

623

VBINF

37962

31609

18028

VBKON

204

187

130

VBPRS

75946

70235

56371

VBPRT

44772

39268

27920

VBSMS

20

16

0

VBSUP

13608

9788

4527

Total

1159484

989244

785806


DKPAROLE


1

10

100

ADJ

23319

15341

6678

ADJ_GEN

36

0

0

ADV

19017

18154

13629

EGEN

13188

4708

524

EGEN_GEN

1073

156

0

END

14838

14838

14838

FORK

146

61

1

FORM

47

0

0

INTERJ

259

186

164

N

168

106

0

NUM

411
9

2701

921

NUM_GEN

6

0

0

NUM_ORD

419

234

62

NUM_ORD_GEN

1

0

0

N_DEF_PLU

2569

425

0

N_DEF_PLU_GEN

234

0

0

N_DEF_SING

11065

4027

143

N_DEF_SING_GEN

1085

237

0

N_GEN

6

0

0

N_INDEF

47

42

0

N_INDEF_PLU

11716

5594

1390

N_INDEF_PLU_GEN

252

79

0

N_INDE
F_SING

25072

11696

1751

N_INDEF_SING_GEN

408

61

0

N_PLU

23

15

0

PRON_DEMO

6939

6938

6852

PRON_DEMO_GEN

6

0

0

PRON_INTER_REL

525

522

367

PRON_INTER_REL_GEN

29

29

29

PRON_PERS

13965

13965

13843

PRON_POSS

2433

2429

2038

PRON_REC

77

77

0

PRON_REC_GEN

7

0

0

PRON_UBST

9490

9452

9282

PRON_UBST_GEN

22

18

0

PRÆP

30927

30891

30118

SKONJ

9819

9816

9704

START

14839

14839

14839

SYMBOL

169

164

114

TEGN

25553

25546

25492

UKONJ

5730

5696

5453

UL

295

86

19

UNIK

9037

9037

9037

V_GERUND

56

1

0

V_IMP

437

224

50

V_INF

8068

5772

3027

V_INF_PAS

628

153

0

V_PARTC_PAST

6342

3221

881

V_PARTC_PRES

667

186

0

V_PAST

9047

7379

5158

V_PAST_PAS

34

0

0

V_PRES

19347

17278

13565

V_PRES_PAS

772

173

0

XX

1066

193

111

Total

305439

242746

190080


Verb distances

SU
C:

Distance

Count

1

14683

2

10893

3

12812

4

12290

5

10422

6

8561

7

6632

8

4826

9

3561

10

2608

11

1884

12

1319

13

991

14

722

15

481

16

327

17

283

18

216

19

121

20

102

21

97

22

61

23

55

24

28

25

17

26

20

27

25

28

14

29

18

30

8

31

6

32

10

33

4

34

1

35

2

36

2

37

1

38

1

41

1

43

1

45

1

54

1

58

1


DKPAROLE:

Distance

Count

1

5960

2

3211

3

2733

4

3249

5

3022

6

2602

7

2144

8

1691

9

1250

10

920

11

718

12

516

13

407

14

270

15

202

16

153

17

118

18

78

19

59

20

51

21

38

22

33

23

20

24

13

25

16

26

13

27

11

28

7

29

4

30

7

31

3

33

2

34

1

36

3

37

2

38

1

39

2

46

1

47

1




3.2

Initial investigation of tense
-
agreement

SUC:


Population size:

Verb tense

Number of words following

VBINF (Infinitive)

193.
293

VBPRS (Present tense)

407.743

VBPRT (Past tense)

237.653


Occurrences:


Global

Given VBINF

Given VBPRS

Given VBPRT

VBINF

37.962

7.003

19.659

9.631

VBPRS

75.946

8.848

24.989

1.883

VBPRT

44.772

4.128

2.444

15.524


Calculated probabilities:


Uncond
itional

Given VBINF

Given VBPRS

Given VBPRT

VBINF

0,0327404

0,03623

0,0482142

0,0405255

VBPRS

0,0654998

0,0457751

0,0612862

0,00792332

VBPRT

0,0386137

0,0213562

0,00599397

0,0653221


Normalized by unconditional probability:


Unconditional

Given VBINF

G
iven VBPRS

Given VBPRT

VBINF

1

1,10658

1,47262

1,23778

VBPRS

1

0,698858

0,935669

0,120967

VBPRT

1

0,553072

0,155229

1,69168


DKPAROLE:


Population size:

Verb tense

Number of words following

V_INF (Infinitive)

44.055

V_PARTC_PAST (Participium, past te
nse)

36.526

V_PARTC_PRES (Participium, present
tense)

3.478

V_PAST (Past tense)

47.776

V_PRES (Present tense)

97.753


Occurences:


Global

Given
Given
Given
Given
Given
V_INF

V_PARTC_PAST

V_PARTC_PRES

V_PAST

V_PRES

V_INF

8.068

1.143

591

36

1.765

4.281

V_PARTC_PAST

6.342

642

534

25

1.429

3.435

V_PARTC_PRES

667

104

87

27

127

219

V_PAST

9.047

846

892

96

2.511

569

V_PRES

19.347

2.210

1.603

206

613

5.541


Calculated probabilities:


Global

Given
V_INF

Given
V_PARTC_PAST

Given
V_PARTC_PRES

Given
V_PAST

Gi
ven
V_PRES

V_INF

0,0264144

0,0259448

0,0161803

0,0103508

0,0369432

0,0437941

V_PARTC_PAST

0,0207636

0,0145727

0,0146197

0,00718804

0,0299104

0,0351396

V_PARTC_PRES

0,00218374

0,00236069

0,00238186

0,00776308

0,00265824

0,00224034

V_PAST

0,0296197

0,019
2033

0,024421

0,0276021

0,0525578

0,00582079

V_PRES

0,0633416

0,0501646

0,0438865

0,0592294

0,0128307

0,0566837


Normalized by unconditional probability:


Global

Given
V_INF

Given
V_PARTC_PAST

Given
V_PARTC_PRES

Given
V_PAST

Given
V_PRES

V_INF

0,982222

0,612553

0,391861

1,3986

1,65796

0,982222

V_PARTC_PAST

1

0,70184

0,704105

0,346185

1,44052

1,69237

V_PARTC_PRES

1

1,08103

1,09073

3,55494

1,21729

1,02592

V_PAST

1

0,648328

0,824485

0,931883

1,77442

0,196518

V_PRES

1

0,791969

0,692855

0,935079

0,202564

0,894888



3.3

Investigation of long
-
range effect of verb tense

SUC:


Population size:

Verb tense

Number of words following

VBINF

121.818

VBPRS

265.552

VBPRT

153.154


Occurrences:


Global

Given VBINF

Given VBPRS

Given VBPRT

VBINF

37.962

4.029

7.691

3.896

VBPRS

75.946

8.024

22.996

1.712

VBPRT

44.772

3.700

2.157

14.328


Calculated probabilities:


Unconditional

Given VBINF

Given VBPRS

Given VBPRT

VBINF

0,0327404

0,0330739

0,0289623

0,0254384

VBPRS

0,0654998

0,0658688

0,086597

0,0111783

VBPRT

0,0386137

0,0303732

0,0081227

0,0935529


Normalized by unconditional probability:


Unconditional

Given VBINF

Given VBPRS

Given VBPRT

VBINF

1

1,01019

0,884604

0,776974

VBPRS

1

1,00563

1,32209

0,170661

VBPRT

1

0,78659

0,210358

2,42279


DKPAROLE:


Population size:

Verb tense

Number of words following

V_INF

28.763

V_PARTC_PAST

24.482

V_PARTC_PRES

2.198

V_PAST

31.348

V_PRES

62.749


Occurences:


Global

Given
V_INF

Given
V_PARTC_PAST

Given
V_PARTC_PRES

Given
V_PAST

Given V_PRES

V_INF

8.068

836

432

31

797

1.853

V_PARTC_PAST

6.342

164

237

20

320

844

V_PARTC_PRES

667

62

56

17

76

139

V_PAST

9.047

824

828

75

2.415

552

V_PRES

19.347

2.163

1.472

151

603

5.394



Calculated probabilities:


Global

Given V_INF

Given
V_PARTC_PAST

Given
V_PARTC_PRES

Given
V_PAST

Given
V_
PRES

V_INF

0,0264144

0,0290651

0,0176456

0,0141037

0,0254243

0,0295304

V_PARTC_PAST

0,0207636

0,00570177

0,00968058

0,00909918

0,010208

0,0134504

V_PARTC_PRES

0,00218374

0,00215555

0,00228739

0,0077343

0,0024244

0,00221517

V_PAST

0,0296197

0,0286479

0,
0338208

0,0341219

0,0770384

0,00879695

V_PRES

0,0633416

0,0752008

0,0601258

0,0686988

0,0192357

0,0859615



Normalized by unconditional probability:


Global

Given
V_INF

Given
V_PARTC_PAST

Given
V_PARTC_PRES

Given
V_PAST

Given V_PRES

V_INF

1

1,10035

0,66
8029

0,53394

0,962514

1,11796

V_PARTC_PAST

1

0,274605

0,466229

0,438228

0,49163

0,647789

V_PARTC_PRES

1

0,987089

1,04747

3,54177

1,1102

1,01439

V_PAST

1

0,967193

1,14183

1,152

2,60092

0,296997

V_PRES

1

1,18723

0,949231

1,08458

0,303681

1,35711


3.4

Condi
tional probability pr. Distance


SUC:


P dependant on VBINF and distance


















1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

VBINF

0,04968

0,0324679

0,0401047

0,0343588

0,0329321

0,0341096

0,0319492

0,0300234

0,0268736

0,0249286

0,027065

0,0247
973







VBPRS

0,198272

0,132528

0,0916038

0,0735955

0,0606263

0,0523212

0,0455553

0,0448974

0,0440954

0,0358348

0,0347979

0,0300429







VBPRT

0,0982272

0,0598627

0,0464044

0,0367122

0,0286415

0,0283624

0,0239871

0,0202451

0,0227101

0,0223319


































































P dependant on VBPRS and distance


















1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

VBINF

0,0042528

0,00756398

0,0216367

0,0306871

0,0353264

0,034306

0,0381836

0,0357939

0,0336219

0
,0304047

0,0311368

0,0297135

0,0281625

0,0240634

0,021124

0,0286011



VBPRS

0,00821593

0,0206688

0,0553268

0,0805351

0,0955735

0,103909

0,11067

0,106208

0,10589

0,10484

0,102671

0,0925009

0,099723

0,0846786

0,0785173

0,0582423

0,0750988

0,0763948

VBPRT

0
,000671494

0,00181173

0,0033112

0,00478489

0,00592332

0,00833732

0,00843073

0,00891914

0,0093523

0,00996543

0,00993155

































































P dependant on VBPRT and distance


















1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

VBINF

0,00330564

0,00704828

0,019145

0,0233278

0,0256677

0,0279704

0,0295018

0,028434

0,0282026

0,022019

0,0263491

0,0286195







VBPRS

0,0021442

0,00480793

0,00973279

0,0112016

0,0129694

0,0168057

0,0171832

0,0168304

0,01936
3

0,0188468

0,0192551

0,0218855







VBPRT

0,00667828

0,0225797

0,0601743

0,088297

0,101496

0,115877

0,118007

0,118127

0,11239

0,119612

0,104383

0,0925926

0,0870712

0,107577

0,0770416

0,0829171

0,0768246

0,0901503



















































































































Normalized by unconditional probability:
















P dependant on VBINF and distance


















1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

VBINF

1,51739

0,9916
77

1,22493

1,04943

1,00586

1,04182

0,975833

0,917014

0,820807

0,761401

0,826655

0,757392







VBPRS

3,02706

2,02333

1,39854

1,1236

0,925595

0,7988

0,695503

0,685458

0,673214

0,547098

0,531267

0,458672







VBPRT

2,54384

1,5503

1,20176

0,950754

0,741743

0,734517

0,621207

0,524299

0,588135

0,57834


































































P dependant on VBPRS and distance


















1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

VBINF

0,129894

0,231029

0,660855

0,937285

1,
07899

1,04782

1,16625

1,09326

1,02692

0,92866

0,951019

0,907547

0,860175

0,734974

0,645195

0,873573



VBPRS

0,125434

0,315556

0,844687

1,22955

1,45914

1,5864

1,68962

1,6215

1,61664

1,60062

1,5675

1,41223

1,52249

1,29281

1,19874

0,889198

1,14655

1,16634

V
BPRT

0,01739

0,0469193

0,0857519

0,123917

0,153399

0,215916

0,218335

0,230984

0,242201

0,25808

0,257203

































































P dependant on VBPRT and distance


















1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

VBINF

0,100965

0,215278

0,584752

0,712506

0,783975

0,854307

0,901081

0,868469

0,8614

0,672534

0,804789

0,874134







VBPRS

0,0327359

0,0734038

0,148593

0,171017

0,198007

0,256577

0,26234

0,256954

0,295619

0,287738

0,293972

0,334131







V
BPRT

0,172951

0,584758

1,55836

2,28667

2,62849

3,00094

3,05609

3,05919

2,91061

3,09765

2,70326

2,39792

2,25493

2,78597

1,99519

2,14735

1,98957

2,33467



3