Overview of Stemming Algorithms

makeshiftluteSoftware and s/w Development

Jul 14, 2012 (4 years and 11 months ago)

618 views

Ilia Smirnov

DePaul University

0
3
/12/
08



Overview of Stemming Algorithms


This paper is an overview of the state
-
of
-
the
-
art in the area of stemming and lemmatization algorithms. It
covers basic ideas of “classical”
(affix removal) techniques as well as some recent approaches like stochastic
algorithms. The paper scope is restricted by techniques for English language or language independent only
due to complexity and not sufficient development of algorithms for langu
ages with different morphologies.



1. Introduction


Many natural languages (Indo
-
European, Uralic and Semitic) are inflected. In such languages several words
sharing the same morphological invariant (root) can be related to the same topic. The ability of
an
Information Retrieval (IR) system to conflate words allows reducing index and enhancing recall sometimes
even without significant deterioration of precision (as shown by Wessel Kraaij and Renee Pohlmann [9]).
Conflation also conforms to users intuition
because users don't need to worry about the "proper"
morphological form of words in a query. The problem of automated conflation implementation is known as
"stemming" algorithms.


One of the key questions in conflation algorithms is if an automated stemmin
g can be as much efficient (for
IR purposes) as a manual processing. Another important and related problem is if a word should be truncated
only on the right root morpheme boundary or at a non
-
linguistically correct point. Both problems were
studied by W.B
. Frakes [2] and were answered positively.


So despite that stemming often relies on knowledge of language morphology, its goal is not to find a proper
meaningful root of a word. Instead, a word can be truncated at a position "incorrect" from the natural
l
anguage point of view. For example, the M.F. Porter's algorithm [17] can make next productions:

probate
-
> probat

cease
-
> ceas

Apparently, the results are not morphologically right forms of words. Nevertheless, since document index and
queries are stemmed
"invisibly" for a user, this particularity should not be considered as a flaw, but rather as
a feature distinguishing stemming from lemmatization (which is a task of finding a canonical form of a
lexeme).


All stemming algorithms can be roughly classifie
d as affix removing, statistical and mixed. Affix removal
stemmers apply set of transformation rules to each word, trying to cut off known prefixes or suffixes. First
such algorithm was described by J.B. Lovins in 1968. Then few more affix removal algorith
ms have been
suggested. The most often used is the Porter's algorithm published in 1980 [17] and eventually developed into
a whole stemming framework Snowball [18].


The major drawback of affix removal approach is their dependency on a
-
priory knowledge of
language
morphology. Statistical algorithms try to cope with this problem by finding distributions of root elements in a
corpus. Such algorithms started evolving only recently as increase in computers power made feasible heavy
computations necessary for su
ch approaches.


Mixed algorithms can combine several approaches. For example, an affix removal algorithm can be enhanced
by dictionary lookups for irregular verbs or exceptional plural/singular forms like "feet/foot".


Variety of stemming algorithms essent
ially brings up a question about their comparison. Though explicit
measures like under
-
stemming (removing too less a suffix) and over
-
stemming (removing too much) do exist,
they are hard to use due to lack of a standard testing set (and even a possibility
of its creation is questionable).
So usually, stemmers are compared indirectly by their effect on search recall. Performance characteristics
(speed and storage requirements) are used as well.



2. Affix Removal Techniques


2.1. Trivial Algorithms


The simplest stemming algorithm is truncating words at N
-
th symbol (keeping words shorter than N letters
unaffected). Though not used for stemming in real systems, the algorithm provides a good baseline for other
algorithms evaluation (Chris D. Paice [3]).


Another simple approach is so
-
called "S"
-
stemmer

an algorithm conflating singular and plural forms of
English nouns. It can be described with the next set of rules applicable to words longer then three characters
(Donna Harman [6]):

IF a word ends in
"ies", but not "eies" or "aies"

THEN "ies"
-
> "y"

ELSE IF a word ends in "es", but not "aes", "ees" or "oes"

THEN "es"
-
> "e"

ELSE IF a word ends in "s", but not "us" or "ss"

THEN "s"
-
> NULL




2.2. Lovins Algorithm


Julie Beth Lovins’ paper
1
was the first
ever published description of a stemmer. It defines 294 endings, each
linked to one of 29 conditions, plus 35 transformation rules. For a word being stemmed, an ending with a
satisfying condition is found and removed. For example, for the word “nationally
”, two endings match:
“ationally” with the rule “stem must be longer than 3 symbols”, so it is rejected, and “ionally” with no
restrictions, so it is removed leaving the stem “nat”. A suitable transformation rule is applied next. The aim of
this step is to
deal with doubled consonants (“
sitting
-
> sitt
-
> sit
”), irregular plurals (“matrix” and
“matrices”) and so on. The algorithm is very fast (faster than Porter’s) but it missed certain endings, perhaps,
because it was influenced by the technical vocabulary
used by the author.



2.3. Porter's Algorithm and Snowball Framework


Porter algorithm [17] defines five successively applied steps of word transformation. Each step consists of set
of rules in the form
<condition> <suffix>
-
> <new suffix>
.
For example, a
rule
(m>0) EED
-
>
EE
means “if the word has at least one vowel and consonant plus EED ending, change the ending to EE”. So
“agreed” becomes “agree” while “feed” remains unchanged.


The algorithm is very concise (having just about 60 rules) and very readab
le for a programmer. It is also very
efficient in terms of computation complexity. The main flaws and errors (like over
-
stemming for
“police/policy”) are well known and can be corrected to extent with a dictionary. So no wonder it became the
most popular a
nd standard approach to stemming.


Being disappointed with many incorrect implementations of his algorithm
used in different studies, Dr. Porter
not only published the standard implementation written in C and Java, but also developed a whole stemmers



1
Since th
e original article (Lovins, J.B. 1968: “Development of a stemming algorithm”, Mechanical Translation and
Computational Linguistics, 11, 22
-
31) is available only on paid basis, the algorithm is given here based on Dr. Porter
description [19].

framework called Snowball. This framework provides a stemmer definition script language and a translator to
ANSI
C and Java. The main purpose of the project is to allow programmers to develop their own stemmers
for other character sets or languages. Currently there are implementations for many Romance, Germanic,
Uralic and Scandinavian languages as well as English,
Russian and Turkish available on the project web site
[18].



2.4. Paice/Husk Algorithm


The Paice/Husk stemmer is described in the Chris D. Paice article [14]. It is an iterative algorithm with one
table containing about 120 rules indexed by the last lett
er of a suffix. On each iteration, it tries to find an
applicable rule by the last character of the word. If there is no such rule, it terminates. It also terminates if a
word starts with a vowel and there are only two letters left or if a word starts with
a consonant and there are
only three characters left. Otherwise, the rule is applied and the process repeats. In another article [1], Chris
D. Paice compared Paice/Husk algorithm with Porter’s and found that Paice/Husk approach has a tendency to
over
-
stem
(being a “heavy” algorithm is his terminology).



2.5. Dawson Algorithm


Dawson algorithm [1] can be considered as an improvement of the Lovins approach. It follows the same
longest match process and has perhaps the most comprehensive list of English suff
ixes (along with
transformation rules)

about 1200 entries. The suffixes are stored in the reversed order indexed by their
length and last letter. The rules define if a suffix found can be removed (for example, if the remaining part of
the word is not sho
rter than N symbols; or if the suffix is preceded by a particular sequence of characters). It
seems that the algorithm didn’t gain popularity due to its complexity and lack of a standard reusable
implementation.



3. Statistical Algorithms


3.1. N
-
gram Ste
mming


Most stemmers are language
-
specific. Single N
-
gram stemming suggested by James Mayfield and Paul
McNamee [11] tries to bypass this limitation. The idea is to analyze distribution of all N
-
grams in a document
(with some rather high value for N like 4
or 5, selected empirically). Since morphological invariants (unique
word roots) will occur less frequently than variate parts (common prefixes and suffixes, for example, "ing" or
"able"), a typical statistics like inverse document frequency (IDF) can be u
sed to identify them. An example
for 4
-
grams of word "juggling" in CLEF 2002 collection shows a typical distribution [5]:


N
-
gram

Document Frequency

N
-
gram

Document Frequency

_jug

681

glin

4,567

jugg

495

ling

55,210

uggl

6,775

ing_

106,463

ggli

3,003




The authors successfully tested the algorithm on eight European languages. Though N
-
gram stemming
underperformed stems and required significant amount of memory and storage for index, its ability to work
with an arbitrary language makes it useful for ma
ny applications.



3.2. HMM Algorithm


Another statistical approach to stemmers design was used by Massimo Melucci and Nicola Orio to build a
stemming algorithm based on Hidden Markov Models (HMM) [12].

It doesn't need a prior linguistic knowledge or a manually created training set. Instead it uses unsupervised
training which can be performed at indexing time.


HMMs are finite
-
state automata with transitions defined by probability functions. Since probabi
lity of each
path can be computed, it is possible to find the most probable path (with Viterbi decoding) in the automata
graph. Each character comprising a word is considered as a state. The authors divided all possible states into
two groups (roots and su
ffixes) and two categories: initial (which can be roots only) and final (roots or
suffixes). Transitions between states define word building process. For any given word, the most probable
path from initial to final states will produce the split point (a tr
ansition from roots to suffixes). Then the
sequence of characters before this point can be considered as a stem.


The authors considered three different topologies of HMM in their experiments. Using Porter's algorithm as a
baseline, they found that HMM had
a tendency to overstem the words. Anyway, the ability of the HMM
algorithm to work automatically with different languages (shown on five European languages) is beneficial.



3.3. YASS (Yet Another Suffix Striper) Algorithm


Conflation can be viewed as a c
lustering problem with a
-
priory unknown number of clusters. Usually such
problems can be solved with a hierarchical algorithm having a distance measure function for cluster elements.
Then the resulting clusters are considered as equivalence classes and the
ir centroids as stems. This approach
is used in the work of Prasenjit Majumder, et al. [20]. The authors suggested several distance measures
rewarding long matching prefixes and penalizing early mismatches. In its simplest form, having two
strings
n
x
x
x
X

1
0

and
m
y
y
y
Y

1
0

, a penalty function can be defined as








otherwise
n
m
i
y
x
if
p
i
i
i
1
)
,
min(
0
0

Then the distance function is define
d as




n
i
i
i
p
Y
X
D
0
2
1
)
,
(

Apparently, a threshold must be chosen to distinguish elements belonging to a cluster and lying outside it.
Unfortunately, it seems that this problem can be solved only empirically. Also,
the approach requires
significant computer power especially if it is applied to a substantial lexicon. On the other hand, as most
statistical algorithms, this one can be used for any language without knowledge of its morphology. The
authors also showed tha
t lexicon clustering enhances recall almost as well as Porter’s.



3.4. Corpus
-
Based Stemming


One of the major flaws of classical stemmer like Porter's is that they often conflate words with similar syntax
but completely different semantics. For example,
"news" and "new" are both stemmed to "new" while they
belong to two quite different categories. Another problem is that while some stemming algorithms may be
suitable for one corpus, they will produce too many errors on another. For example, "stock", "stoc
ks",
"stocking", etc. will have special meaning in case of the Wall Street Journal. In their work [22], Jinxi Xu and
W. Bruce Croft suggested an approach, which allows correcting "rude" stemming results based on the
statistical properties of a corpus used.
The basic idea is to generate equivalence classes for words with a
classical stemmer and then "separate back" some conflated words based on their co
-
occurrence in the
corpora. It also helps preventing well
-
known incorrect conflations of Porter's algorithm
, such as
"policy/police" since chances of these two words co
-
occurrence are rather low. Using Porter's and trigram
matching algorithms on three English corpora and one Spanish corpus, the authors showed significant
improvement in retrieval efficiency (tho
ugh it should be noted that separating conflated entries back almost
canceled the results of stemming).



3.5. Context Sensitive Stemming


As a further development of corpus
-
based stemming, Funchun Peng, Nawaaz Ahmed, Xin Li and Yumao Lu
suggested context sensitive stemming for web search [4]. In their work corpus analysis is used to find word
distributional similarity. Then a few morphologic
al rules from Porter's stemmer are applied to the similarity
list to find stemming candidates, some of which are finally selected based on the handling purpose, for
example, pluralization. Obtained forms are used to expand a search query on non
-
transformed
index. For
example, considering word "develop", the top 10 similar words using cosine similarity and including
right/left bigrams for context would be (there are more results discussed in [4]):



Rank

Candidate

Similarity

Rank

Candidate

Similarity

1

developing

0.339

6

tutoring

0.138

2

developed

0.176

7

analyzing

0.128

3

incubator

0.160

8

developement

0.128

4

develops

0.150

9

automation

0.126

5

development

0.148

10

berts

0.119



Applying stemming rules retains "developing, developed, develops, development, developement" and for
pluralization purposes only "develops" is selected. Hence, the user's query "develop" is expanded to "develop
OR develops".



4. Other Approaches


Two very common approaches to stemming involve derivational or inflectional morphology analysis. Both
require vast language
-
specific databases containing word groups organized by case, gender, number and other
syntactical variations in case of inflectional
databases or by base
-
derived forms (including changes in the part
of speech) in case of derivational ones. Similarly, compounding word sets can be useful for some languages
like German or Finnish. The strength of derivational and inflectional analysis is
in their ability to produce
morphologically correct stems, cope with exceptions, processing prefixes as well as suffixes. But the major
and obvious flaw in dictionary
-
based algorithms is their inability to cope with words, which are not in the
lexicon. Als
o, a lexicon must be manually created in advance, which requires significant efforts.


A process of building derivational and inflectional stemmers is described in details by Robert Krovetz in his
article [10]. Interestingly, the author started with an att
empt to improve Porter's algorithm by adding a
dictionary check after each iteration. The goal was to stop stemming if a correct form of a word is found and
also to be able to process irregular forms. The resulting stemmer performed worse than the original
, so Robert
Krovetz developed his own inflectional algorithm instead. As the final step, a derivational approach was used
to conflate words with the same meaning and to provide word
-
sense disambiguation. This combination of
inflectional processing followed
by derivational one allowed to increase number of stems corresponding to
real words in about two times comparing with Porter's algorithm having almost same performance. The most
challenging problems were proper nouns ("Mooer" was stemmed to "moo" and "Nav
ier
-
Stokes" to "stoke")
and spelling errors.


Lemmatization can be considered as an alternative to stemming, though its aim is word normalization rather
than just stem finding. As a result of lemmatization, a word’s suffix can be not only removed, but subs
tituted
with a different one. An advantage is that normalized forms can be used for integration with many other
applications or for word disambiguation. For example, a lemmatizer taking context into account would be
able to distinguish the noun “saw” from
the verb “saw” and convert the latter to “see”, while a stemming
algorithm won’t even try to solve this problem. The main disadvantage of lemmatization is the need of a
comprehensive lexicon with all base forms along with all inflected ones (full
-
form lexi
con) or all based forms
with a set of rules for deriving inflected forms (base
-
form lexicon). Such lexicons are created manually which
is a tedious and challenging job. Nevertheless, WordNet and EuroWordNet projects are good basis for
building dictionary
-
b
ased lemmatizers [21]. Also, a few new lemmatization approaches based on AI
techniques have appeared recently, such as usage of if
-
then rules or Ripple Down Rule (RDR) induction
algorithm [16] or Na˜ve Bayesian classifier [13].



5. Stemming Effectiveness


The effectiveness and error
-
proneness of the stemming algorithms has been studied by a few groups of
scientists. All suggested approaches could be classified as direct and indirect methods. Direct evaluation
assumes measuring some characteristics like und
er
-
or over
-
stemming on a special testing word set. Indirect
methods usually estimate how a stemmer affects information retrieval recall. Apparently, direct evaluation
requires creation of a test set in advance, which is a tedious manual work. On the other
hand, indirect
methods narrow down the scope assuming only particular usage of stemming. They are also very sensitive to
the nature of corpora and queries used in the testing.



Direct evaluation of under
-
and over
-
stemming errors using sets of manually c
onflated words was done by
Chris D. Paice [1]. He compared Lovins, Porter and Paice/Husk stemmers using length truncation as a
baseline. While both Porter and Paice/Husk stemmers showed better results than the Lovins stemmer, the
direct comparison of the t
wo was found impossible due to significant differences in algorithms and hence
different behavior on different data sets. Because of that Dr. Paice suggested to classify stemmers as “light”
and “heavy”, i.e. having tendencies to under
-
or over
-
stem corresp
ondingly, and use them for different tasks.
In this classification, the Porter algorithm is a “light” one (similar to trunc(7)) while the Paice/Husk is a
“heavy” algorithm (similar to trunc(5)).


William B. Frakes in his article [15] discusses different d
irect metrics for stemmer strength evaluation. Some
of them are: the mean number of words per conflation class; index compression factor; the mean number of
characters removed in forming stems; the median and mean Hamming distance (number of different
char
acters in the same positions in two strings) between words and their stems. He also evaluated distances
between stems obtained with Lovins, Porter’s, Paice and S
-
removal algorithms and found strong similarity
between Lovins, Paice and Porter’s productions.


The results obtained for indirect evaluations are somewhat controversial. Donna Harman [6, 7] studied how
stemming affects effectiveness of information retrieval and found almost no improvements for Lovins, Porter
and S stemmers. She explained that posit
ively affected queries were offset by negatively affected ones (with
high ranking of non
-
relevant documents retrieved as a result of stemming). She also reported significant (up
to two times) decrease in system performance for Lovins stemmer. Nevertheless,
Donna Harman admitted
that a “weak” stemmer (like the one removing plural forms and simple past tense endings) may be useful as
answering users’ expectations.


Robert Krovetz in the article mentioned above [10] also estimated stemmer's effect on the retri
eval
performance. His results (with his own derivational stemmer) showed 15
-
35% increase on some collections
(CACM and NPL).


One of the most recent studies of affix removal stemming efficiency was made in 1996 by Wessel Kraaij and
Renee Pohlmann [9]. They
tried to verify some issues reported in previous researches, so they followed
Donna Harman's method. The authors found that inflectional stemming did significantly improve recall,
though the simplest methods (as classical Porter's algorithm) were not quit
e efficient.


Another 1996 article by David A. Hull and Gregory Grefenstette [8] states that "most conflation algorithms
perform about 5% better than no stemming, and there is little difference between methods" (the study
compared S, Lovins, Porter, Xerox
inflectional and derivational algorithms). The article tries to reconcile
results of different researches. The authors showed that stemming is inefficient in case of well defined queries
over small set of documents but is efficient in all other cases. It e
xplains the difference in results obtained by
Donna Harman and Robert Krovetz.


Carmen Galvetz and Felix de Moya
-
Anegon compared effectiveness of Porter’s algorithm with lemmatizator
built using finite
-
state transducers [5]. The results showed significan
tly higher recall for stemming (0.97 vs.
0.73) but much lower precision (0.77 vs. 0.96).




6. Conclusions


It should be noted that stemming algorithms seem to be a rather narrow research problem with quite a few
publications. So the question of finding "the best" conflation algorithm is still open, as well as the problem of
finding the best “tuning” parameters
for existing algorithms (it is even explicitly stated as one of the goals for
Snowball project).


Also, most of the works have been done only for English language leaving a vast research field for studies in
other languages, especially from Slavic group wi
th highly inflected languages, where importance of
conflation is much higher. At the same time, for Germanic and similar highly compounding (combining
several roots in a single word) languages alternative approaches like decompounding may be found more
eff
icient.


From author’s point of view, classical stemmers have one major disadvantage

they can be used only in a
“closed” environment, where both queries and collections being searched are transformed in the same manner
opaquely to users. This puts restri
ctions on the system integration ability. On the other hand, dictionary
-
based
algorithms, including natural language processing approaches, allow integration with other applications, for
example, for interactive query expansion or machine translation. Of c
ourse, they require constant updates to
dictionaries due to language evolution, but this task is constantly performed by publishers and scientific
groups. Also, growing computers power makes natural language processing approaches more feasible.



7. Refere
nces


1. Dawson John. “Suffix removal and word conflation”. ALLC Bulletin, Volume 2, No. 3.1974, 33
-
46


2. Frakes W.B. “Term conflation for information retrieval”.
Proceedings of the 7th annual international ACM
SIGIR conference on Research and development
in information retrieval. 1984,
383
-
389



3. Frakes William B. “Strength and similarity of affix removal stemming algorithms”. ACM SIGIR Forum,
Volume 37, No. 1. 2003, 26
-
30


4. Funchun Peng, Nawaaz Ahmed, Xin Li and Yumao Lu. “Context sensitive stemming
for web search”.
Proceedings of the 30th annual international ACM SIGIR conference on Research and development in
information retrieval. 2007, 639
-
646


5.
Galvez Carmen
and
Moya
-
Aneg•n F˜lix. “
An evaluation of conflation accuracy using finite
-
state
transd
ucers”.
Journal of Documentation

62
(
3
). 2006, 328
-
349


6. Harman Donna. “How effective is suffixing?” Journal of the American Society for Information Science.
1991; 42, 7
-
15


7. Harman Donna. “
A failure analysis on the limitations of suffixing in an online environment”. Proceedings
of the 10th annual international ACM SIGIR conference on Research and development in information
retrieval. 1987, 102
-
107


8. Hull David A. and Grefenstette Gregory.
“A detailed analysis of English stemming algorithms”. Rank
Xerox ResearchCenter Technical Report. 1996


9. Kraaij Wessel and Pohlmann Renee. “Viewing stemming as recall enhancement”.
Proceedings of the 19th
annual international ACM SIGIR conference on Rese
arch and development in information retrieval. 1996,
40
-
48


10. Krovetz Robert. “Viewing morphology as an inference process”. Proceedings of the 16th annual
international ACM SIGIR conference on Research and development in information retrieval. 1993, 191
-
202


11. Mayfield James and McNamee Paul. “Single N
-
gram stemming”. Proceedings of the 26th annual
international ACM SIGIR conference on Research and development in information retrieval. 2003, 415
-
416


12. Melucci Massimo and Orio Nicola. “
A novel method for stemmer generation based on hidden Markov
models”. Proceedings of the twelfth international conference on Information and knowledge management.
2003, 131
-
138


13. Mladenic Dunja. “Automatic word lemmatization”. Proceedings B of the 5th I
nternational Multi
-
Conference Information Society IS. 2002, 153
-
159


14. Paice Chris D. “Another stemmer”.
ACM SIGIR Forum, Volume 24, No. 3.
1990, 56
-
61


15. Paice Chris D. “An evaluation method for stemming algorithms”. Proceedings of the 17th annual
int
ernational ACM SIGIR conference on Research and development in information retrieval. 1994, 42
-
50


16. Plisson Joel, Lavrac Nada and Mladenic Dunja. “A rule based approach to word lemmatization”.
Proceedings C of the 7th International Multi
-
Conference Info
rmation Society IS. 2004


17. Porter M.F. “An algorithm for suffix stripping”. Program. 1980; 14, 130
-
137


18. Porter M.F. “Snowball: A language for stemming algorithms”. 2001.
http://snowball.tartarus.org/texts/introduction.html


19. Porter M.F. “The Lovi
ns stemming algorithm”.
http://snowball.tartarus.org/algorithms/lovins/stemmer.html


20. Prasenjit Majumder, Mandar Mitra, Swapan K. Parui, Gobinda Kole, Pabitra Mitra and Kalyankumar
Datta. “YASS: Yet another suffix stripper”.
ACM Transactions on Informat
ion Systems.
Volume 25,

Issue 4.
2007,
Article No. 18


21. Toman Michal, Tesar Roman and Jezek Karel. “Influence of word normalization on text classification”.
The 1st International Conference on Multidisciplinary Information Sciences & Technologies. 2006
, 354
-
358


22. Xu Jinxi and Croft Bruce W. “Corpus
-
based stemming using co
-
occurrence of word variants”.
ACM
Transactions on Information Systems.
Volume 16,

Issue 1. 1998,
61
-
81