The Light at the End of the Tunnel

stepweedheightsΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

67 εμφανίσεις

The Light at the End of the Tunnel


A Further Look at the Data Sparsity Problem


Abstract:
Insufficient repetition of linguistic features

is a major stumbling block in
the myriad applications of data
-
driven machine learning. While
the

existence
of the
Data

Sparsity problem
is irrefutable, its precise scale remains largely unknown and
it is generally accepted as unsolvable in the sense that simply
increasing

text
volumes, even by an order of magnitude,
provides no respite from
it
.

This paper
attempts to quan
tify data sparsity as experienced with popular corpora of today,
and make
s

suggestions for how rapidly advancing technology might allow it to be
eliminated entirely.



Introduction


Data sparsity is the term used to describe the phenomenon whereby patterns

in text repeat
so infrequently that meaningful conclusions cannot be drawn.
Many find the phrase
“language is a system of very rare events” a notion both comforting and depressing, but
not a challenge.


Although it is generally accepted that gathering con
textual information from a corpus will
improve language processing techniques, the data sparsity problem often precludes
effective algorithms from being developed. Current techniques collect patterns of usage
for a word or phrase and use the frequency of o
ccurrence of these patterns to assign a
probability distribution, or to a
id machine learning algorithms.


Many applications use a trigram (of words) model as an approximation of context and
repetition of these trigrams in the training corpus allows probabi
lities to be assigned to
each three
-
word sequence in a language. Unfortunately, many valid three
-
word sequences
do not appear in the training corpus and so many trigrams that appear in new documents
have never been seen before.


For example, using
a

100
m
illion word corpus
for

language modelling
, it is estimated that

over
4
0% of all continuous three word sequences
from
a new document
are unseen in the
training corpus
. If the task is more ambitious such as collecting the (subject, verb, object)
triples from

a training corpus, then only a very small percentage of the triples found in a
new document will have been seen before.


Data sparsity pervades every aspect of statistical machine learning. Its inescapability in
modern language processing has lead many t
o question whether there might ever be
enough data to eliminate it, and seek alternative or augmented means to combat it. Some
of these include smoothing, exploiting additional textual information (syntax, parts of
speech, etc.) estimating frequency count
s from the web, and reliance on domain specific
corpora.


This paper quantifies the extent of data sparsity across a range of corpus size
s from
very
small (160,
000 words) to the
uncommonly
large (1.7 billion words
) and makes accurate
predictions about spar
sity for large corpora.

Current statistical learning techniques are
operating with unacceptably sparse training corpora. Banko and Brill (2001) also argue
that standard corpora do no
t

do a sufficient job of approximating language, and show that
incorrect c
onclusions can be drawn about the performance of various algorithms
when
the corpus size is small.


This paper attempts to consider the effect
s

of
using
a very large corpus
to model
the
whole of language, rather than on a few select words or phrases
.

As s
uch, the use of the
Internet as a corpus is not considered, as it is not generally used as a corpus in the
traditional sense (rather, it is used to estimate the probability of a few phrases, e.g
Greffenstette

& Kilgarriff,
, Keller et al 2002). It is shown

that the accelerated pace of
technological advancement, coupled with some encouraging facts gained from studying
the effects of increasing corpus size, suggests
an imminent
solution to the data sparsity
problem

is possible
.


We focus on quantifying the co
verage provided for unigram, bigram and trigrams using
different size training corpora (up to a 1.7 billion word corpus of news data). We consider
both using training corpora of the same genre as the test documents, and also different
genres
.


Data


Traini
ng Data


The Gigaword English Corpus is a large archive of newswire text data acquired by the
Linguistic Data Consortium. The total corpus consists of over 1.7 billion words from four
distinct international sources of English newswire ranging from approxim
ately 1994
-
2002:




-

Agence France Press English Service (afe) 1994
-
2002



-

Associated Press Worldstream English Service (apw) 1994
-
2002




-

The New York Times Newswire Service (nyt) 1994
-
2002



-

The Xinhua News Agency English Service

(xie) 1995
-
2001



Testing Data


Initial testing data came from three sources, intended to represent a range of language
use. For each source, a 1,000 word and a 5,000 word document was produced. Sources
were as follows:


Newspaper articles



Docum
ents composed of current stories from
www.guardian.co.uk

and
www.thesun.co.uk

Scientific writing
-

From Einstein’s Special and General Theory of Relativity

Children’s Writi
ng
-

From
www.childrens
-
express.org
, a project producing news and
current affairs stories by children, for children


Later tests also
used

the Medline corpus, 1.2 billion words of abstracts from the Medline

project. Medline was compiled by the U.S. National Library of Medicine (NLM) and
contains publications in the fields of life sciences and biomedicine. It contains nearly
eleven million records from over 7,300 different publications, spanning 1965


the
pr
esent day.


Method


For the purposes of this paper, coverage is defined as the
percentage

of tokens from an
unseen document found at least once in a training corpus. Both type and token
percentages were explored,
and token coverage
is reported here.

For th
is application, the
type/token distinction is as follows: in counting tokens, all instances of a specific n
-
gram
will be counted separately towards the final percentage. For types, each unique n
-
gram
will only be counted once.


Training and test corpora we
re all prepared in the same way


all non
-
alphabetic
characters were removed, and all words were converted to lower case. For bigrams and
trigrams, tokens were formed both by allowing and prohibiting the crossing of sentence
boundaries. However, it was fou
nd that this had a minimal impact upon percentage
scores, and the results reported here are those allowing sentence
-
boundary crossing.

Results


Figure 1 shows the token coverage of unigrams, bigrams and trigrams:





Figure 2

shows the degradation of coverage when using
a training corpus from a different
domain

as the n
-
gram size increases


perhaps surprisingly, the Gigaword news corpus
does a perfe
ctly acceptable job of covering the
unigrams

from the medical texts
, but is
cle
arly inappropriate for trigrams.




Discussion


The results show that while the use of domain
-
specific (or at least balanced) corpora is
desir
able, it is not necessary. The G
igaword corpus, although composed entirely of news
data, does not give significan
tly worse coverage of scientific writing, or indeed that of
children. Perhaps most surprising is that seemingly incomprehensible test documents
from the Medline corpus achieve almost perfect unigram coverage


it is only in bigrams
and trigram coverage tha
t the extent of the differences is apparent.


In using the G
igaword training corpus, it was shown that only novel words
(colloquialisms, “jargon”, etc)
or misspellings in a test document would be unseen. It is
perhaps important to note that without the use

of any dictionary, this brute force “spell
-
checker” recognises far more words than those included in word processors, and perhaps
remarkably, even covers the majority of proper nouns and technical terms. From the
18,000 words of test documents, only the w
ords below were unseen:


1.
gastropubs



7.
tranfalgar



13.
realily

2.
bedsits



8.
paracingulate


14.
vacuo

3.
Geddling



9.
camoufage



15.
galflei

4.
incompatibilitiy


10.
galileian



16.
translatory

5.
tracky



11.
McLone



17.
viewsthemselves

6.
Dutt
y



12.
MOBOs



Of these
, three are proper nouns (3, 6 and 11), three are colloquialisms (1, 2, 5), five are
misspellings (4, 7, 9,
13, 17), five are technical terms (8, 10, 14, 15, 16)
, and one is an
acronym (12).

As such, one might consider the colloquia
lisms and technical terms as
novel


both creep from outside of mainstream language, but may be assimilated
(particularly in the case of colloquialisms). One could even use a diachronic corpus to
observe the introduction and assimilation of these terms int
o common use. The proper
nouns may be considered a special case, as these are notoriously difficult to identify, and
one would not necessarily expect a corpus of any size to cover all possible proper nouns.


Furthermore, the performance of this corpus for

detecting unusual bigram usage is
approaching similar standards. With the gigaword corpus, only approximately 5% of
bigrams (800 bigrams from 18,000 words of text) were unseen, across a range of genres
and topics.

Preliminary tests also involved the use o
f some far larger test documents, and
results were approximately consistent across text sizes
.


While it has been suggested that sufficient speech data cannot reasonably be gathered to
allow perfect performance with current techniques (Moore, 2001), the sa
me cannot
necessarily be said for written text.


Figure 2 indicates forecasted bigram and trigram coverage with increased corpus sizes
using logarithm regression, and plotted on a logarithmic scale. From the
chart
it is
estimated that with
a corpus of appr
oximately 10 billion words
, we are able to cover
nearly 100% of bigrams. Similarly, a corpus containing approximately 1 trillion words is
likely to cover all trigrams. It is accepted that this is an estimate, with an asymptote in the
region of 99%
-

howeve
r, the results obtained with unigrams, and the proximity to this
percentage observed with bigrams, suggests that this level of coverage is achievable.




Assuming a reasonably consistent continuation of the trends shown by these results,
perfect coverage
of non
-
novel use can
be
expected
so long as sufficient data is available.
A reasonable prediction for the amount of data required for perfect trigram coverage was
calculated in the region of 1 trillion words. This translates to a storage requirement of
app
roximately 18 terabytes: advances in storage capacity over the last few years suggest
that such a figure is not unreaso
nable in the short
-
term future.


Nor can the availability of so much textual data be denied


Google currently indexes
over 8 billion pag
es, and even relatively old estimates measure the size of the web in
petabytes

(thousands of terabytes)
. With several authors having shown that the web is a
reasonable model of language (Keller et al) rather than the noisy and unreliable source
many have p
redicted, its application to language processing task is surely simply a matter
of time.
While its use so far has largely been restricted to
highly selective language
modelling, the continuing advent of better storage and processing solutions suggest its
u
se as an offline corpus in the traditional sense is plausible.
Similarly, several repositories
of information (including the American Library of Congress) are working on digitising
large sections of their catalogues. The benefits of using extremely large c
orpora for
statistical language modelling and its associated applications may be speculation at this
stage, but the future looks promising.

References


[1]
M. Banko, E. Brill. (2001).
Mitigating the Paucity of Data Problem
. In Proceedings
of the Conferenc
e on Human Language Technology
.


[2] Kilgarriff, A, Grefenstette, G, (2003)
Introduction to the Special Issue on the Web as
Corpus

(2003) In ACL special issue on the Web as Corpus


[
3
] Lapat
a, M and Keller, F (2005)
Web
-
based Models for Natural Language Pr
ocessing

In Proceedings of
T
ransactions on Speech and Language Processing


[4
] Moore, R. (2001)
There’s No Data Like More Data (But When Will Enough Be
Enough?)

In P
roceedings of I
EEE International Workshop on Intelligent Signal
Processing