Automated Email Classification using Semantic Relationships

strawberrycokevilleΤεχνίτη Νοημοσύνη και Ρομποτική

7 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

107 εμφανίσεις




Automated Email Classification

using Semantic Relationships




















J A N I N E W I C K E














































Master of Science Thesis


Stockholm, Sweden
2010






Automated Email Classification

using Semantic Relationships















J A N I N E W I C K E






Master’s Thesis in Computer Science (
30 ECTS
credits)


at the School of Computer Science and Engineering


Royal Institute of Technology year
2010



Supervisor at CSC was
Viggo Kann


Examiner was Stefan Arnborg



TRITA
-
CSC
-
E
2010:03
8


ISRN
-
KTH/CSC/E
--
10
/
03
8
--
SE


ISSN
-
1653
-
5715







Royal Institute of Technology


School of Computer Science and Communication



KTH
CSC


SE
-
100 44
Stockholm, Sweden



URL: www.kth.se/csc







Abstract
In this Master’s project,the use of semantic relationships
in email classification with Support Vector Machines is ex-
amined.The corpus consists of emails in German language.
Semantically related words are mapped to structures.The
approach is based on the theory of semantic fields.A graph
is built with the help of the semantical relations between
words looked up in a thesaurus.
Two search algorithms,breadth-first search and the
Tarjan algorithm,are applied to identify graph compo-
nents.The size of the structures is limited to a suitable
maximal size.Disambiguation is done in three different
ways with a graph-based approach.Experiments evaluate
results comparing the values of the different variables (dis-
ambiguation,path length,information gain,search algo-
rithm).The results show that the outcome for the variables
are mostly homogeneous,the quality of a category depends
on its character and the correlation to the thesaurus,i.e.
the rate of words that are out-of-vocabulary but highly fre-
quent in the category.An optimal configuration for each
category can be found by an exhaustive search over all vari-
able configurations.
Referat
E-postklassificering med hjälp av semantiska
relationer
I det här examensarbetet undersöks epostklassificering med
hjälp av semantiska relationer.Korpusen består av e-post
på tyska.Semantiskt relaterade ord samlas i strukturer.
Ansatsen grundas på teorin om semantiska fält.Ord slås
upp i en tesaurus och med de givna relationerna byggs
en graf.Två sökningsalgoritmer,breddenförstsökning och
Tarjans algoritm,används för att hitta grafkomponenter.
Strukturerna begränsas till en passande storlek.Disam-
biguering sker med en grafbaserad metod på tre olika sätt.
Metoden evalueras och olika variabler undersöks (disam-
biguering,den maximala stiglängden,information gain,sökn-
ingsalgoritmen).Resultaten visar att variablerna är likvärdi-
ga.En kategoris kvalitet hänger samman med dess karaktär
och korrelation med tesaurusen,dvs.hur många ord som
är högfrekventa i kategorin men inte finns i tesaurusen.En
optimal konfiguration kan hittas för varje kategori genom
en totalsökning över alla möjliga konfigurationer.
Contents
1 Introduction 1
2 Theoretical Background 3
2.1 Text Classification............................3
2.1.1 General Overview about Text Categorization.........3
2.1.2 Machine Learning........................4
2.1.3 Support Vector Machines....................6
2.1.4 Quality Measurements......................8
2.2 Linguistic Background..........................9
2.2.1 Ambiguousness of language...................9
2.2.2 Semantic relationships......................10
2.2.3 The theory of semantic fields..................11
2.2.4 The German language......................12
2.3 Search Algorithms............................12
2.3.1 Breadth-first search.......................13
2.3.2 Tarjan’s algorithm........................13
3 Related Work 17
4 Method 19
4.1 Tools and Corpus.............................19
4.1.1 Minor Third...........................20
4.1.2 Tree Tagger...........................20
4.1.3 Thesaurus.............................20
4.1.4 The Corpus............................20
4.2 Description of Test Framework.....................21
4.3 Theoretical Reasoning about a Graph.................22
4.4 Algorithms and Implementation for Building Semantic Structures..24
4.4.1 Details about implementation..................25
4.4.2 Construction of a graph.....................26
4.4.3 Component Search........................27
4.4.4 Treatment of polysemous words.................31
4.4.5 Trimming graphs - pathlength.................35
5 Evaluation - Experiments 41
5.1 Corpus Analysis.............................41
5.1.1 Decisions about the corpus...................41
5.1.2 Category System.........................42
5.1.3 Preprocessing...........................42
5.1.4 Categories sizes and other statistical data...........43
5.1.5 Correlation to the thesaurus...................45
5.2 Baseline Tests and Test Setup......................46
5.2.1 Test setup.............................46
5.2.2 Variance..............................46
5.2.3 Results in Baseline........................46
5.3 Experiments with Variables in the Graph...............47
5.3.1 Appearance............................47
5.3.2 Algorithms............................48
5.3.3 Disambiguation..........................49
5.3.4 Path Length...........................50
5.3.5 Characteristics of the Categories................51
5.3.6 Summary of the Experiments..................52
6 Conclusion and Future Work 55
Bibliography 59
Appendices 63
A Statistics about the Corpus 65
B Experiments 67
C Graph Gallery 77
C.1 Good Graph Components........................77
C.2 Problematic Graph Components....................77
C.3 Large Graph Components........................77
List of Figures 85
List of Tables 86
Chapter 1
Introduction
Text based communication over digital media has become a very important means
of communication;the number of emails sent every day is growing continuously.In
huge email-client systems such as a customer support,emails have to be sorted and
categorized according to their content.
The topic of this Master’s thesis is automated email classification.The main idea
is to apply semantic relationships for feature reduction.Hyperonymrelationship and
synonym relationship are applied.Emails are classified according their semantic
content according to the given category system.The classification algorithm is
Support Vector Machines (SVM).
The approach used to model words and their semantic relationships is based
on graph theory and the theory of semantic fields [16].Keywords are looked up
in a thesaurus and their relationships among each other are mapped in a graph.
Two search algorithms,breadth-first search and the Tarjan algorithm are applied
to identify graph components,i.e.structures of semantically related words.A very
central problem is to decide how to set the limits of semantic structures.The ques-
tion is under which conditions synonyms are still „semantically related enough“to
each other to be pooled together.
The email corpus used for this study is emails sent to the customer support of
an Austrian telephone provider.The thesis is written at Artificial Solutions AB
in Stockholm.The corpus based on data provided by Artificial Solutions AB.The
emails are in German.
The results of this study are evaluated by experiments.Results of classification
using semantic relationships are compared to test results on the same corpus with
a basic SVM.
Structure of the thesis This thesis report is structured as follows:The second
chapter treats the theoretical background of this Master’s thesis.Foundations on
text classification and linguistics are described.In the next chapter a short overview
of related works is given.The fourth chapter explains the method and experiment
set up.The strategy and implementation of feature regrouping is explained and
1
CHAPTER 1.INTRODUCTION
tools used for this Master’s project are presented.The last chapter deals with
evaluation.It is divided into two parts.In the first part,the corpus is analyzed.
The second part reports about classification experiments.Finally a conclusion is
given.
2
Chapter 2
Theoretical Background
This chapter presents the theoretical background for this Master’s project.The
first part gives a short overview about text classification,machine learning and
explains the algorithm Support Vector Machines (SVM) which are used in this
Master’s project.The second part introduces linguistical terms while the third part
treats the necessary graph theory and explains two search algorithms used for graph
component search.
2.1 Text Classification
2.1.1 General Overview about Text Categorization
The discipline of text categorization deals with content oriented search.Text doc-
uments are assigned to categories according to their semantic content.The set of
categories or the hierarchy of categories (category system) is already defined before
categorization.This is the very difference to the sister discipline text clustering
where categories have to be found to a number of given documents.
A classifier is a function that maps a text document to a category with a prob-
ability.Classifiers (considered in the scope of this Master’s project) are binary,i.e.
for every single category,one classifier has to be trained up.All text documents
belonging to a category are called positive examples,all other documents are called
negative examples.
There are different machine learning approaches to text classification:in general
decision trees,neuronal networks and statistical methods can be distinguished.Sta-
tistical methods,especially Support Vector Machines have been shown to be very
appropriate for the problem of text classification [39].
A very central problem is the task of modeling natural language,as informa-
tion is stored in complex semantical and grammatical structures.Usually,text
documents are represented with the vector space model.
A text document is transormed into a so-called bag-of-words.Words (or terms)
are features,they occurence is counted in a vector.A feature vector describes a
text document in the following way:each dimension d
t
in the vector corresponds to
3
CHAPTER 2.THEORETICAL BACKGROUND
a separate term t occuring in the corpus
1
.The value of the dimension d
t
weights
the number of occurences of the term t in the text in relation to the number of
occurences of the term t in the whole corpus
2
.An important characteristic for
text categorization is the dimensionality of the vector space and very sparse feature
vectors.
This model is the simplest approach [41].Even in a small corpus the number of
different words raises to extremes (feature explosion).
Feature selection is the task to decide which terms are taken into account when
building feature vectors.A strict and brutal feature selection is a way to face the
problem of feature explosion.Joachims [20] uses an information gain criterion,a
term has to appear in at least three different documents.
Definition Atermt has an appearance value of k if t appears in at least k different
documents.
2.1.2 Machine Learning
Machine learning is a very large discipline in artificial intelligence.The major focus
of machine learning research is to extract information from data automatically,by
computational and statistical methods.The basic principle is to generate automati-
cally knowledge fromexperience.An algorithmlearns by example,trying to extract
rules and principles.In that way,unknown examples can be classified.
Several forms can be distinguished,the best known are supervised,unsupervised
and reinforcement learning.In the first form,labeled data are available.A larger
part is used to train a classifier,a smaller part is kept aside to test it afterward.For
unsupervised learning,no labeled data are available and the agent has to model a
set of inputs.In the case of reinforcement learning,the algorithm learns a policy of
how to act by rewarding or punishing depending on its performance.
In our case,supervised learning is applied.Machine learning has two main
phases:The given set of annotated data is separated into two subsets.In the
training step,the classifier is"taught"by a set of given examples.A smaller part is
left out for testing.Those examples are unknown to the classifier.By comparing
results achieved by the classifier to the annotations,the quality of the classifier is
evaluated.
Dealing with computational learning,we cannot expect to gain correctness
and completeness of the learning method.According to the theorem of PAC-
learning (Probably approximately correct learning) proposed by L.Valiant [23]
and Vapnik-Chervonenki [43] the learner receives samples and must select a gen-
eralization function (called the hypothesis) froma certain class of possible functions.
The goal is that the selected function will have low generalization error with high
1
The term corpus describes the collection of texts used for a classification project.
2
Several different ways of computing these (term) weights,have been developed.One of the
best known schemes is tf-idf weighting [35].The SVM implementation Minor Third [8] provides a
representation of the vector space model.
4
2.1.TEXT CLASSIFICATION
probability.The learner must be able to learn the concept given any arbitrary
approximation ratio,probability of success,or distribution of the samples.
However there are two further requirements which are very crucial for the defi-
nition of correctness of a learner:time complexity and feasibility of learning.
• A function must be learnable in polynomial time.
• The result of learning is to be a concept definition or a function that recognizes
a concept.
The following definitions are quite theoretical,therefore some terms need to
be introduced.For a given set X of all possible examples a concept c is a subset
of examples (those expressing the concept c).A concept class C is a subset of the
power set of X.A concept c is often defined the same way as the classification of the
concept:c can be considered as a function mapping X to 1 (for positive examples)
or to 0 (for negative examples).In the field of application for this Master’s project,a
concept c can be considered as a (well definied) email category.The actual learning
result is called hypothesis h.A hypothesis h and the hypothesis space H are defined
in a similair way.A hypothesis h corresponds to the set of emails,a trained classifier
assigns to a category.If the classifier assigns all given emails (for all x in a learning
set E) to the actual category (h(x) = c(x)),it is called consistent for the given
examples.
Definition:PAC (probably approximately correct)-learnable A concept
class C with the examples x of the size |x| is PAC-learnable by a hypothesis space
H,if there is an algorithms L(δ,￿) that
• for all arbitrary but constant probability distributions D over x
• for all concepts c ∈ C
• in time polynomial in
1
￿
,|C|,|x|
• with a probability of at least 1 −δ
returns a hypotheses h ∈ H,of which the error is not higher than ￿.We can
also say that L is a PAC learning algorithm for C.
The PAC theorem can be proved with the Vapnik-Chervonenkis-dimension (cf.
[31,22ff]).This concept helps to illustrate expression strength of a hypothesis class.
It is a measure of the capacity of a statistical classification algorithm,defined as
the cardinality of the largest set of points that the algorithm can shatter.
Definition:Shattering a set of examples Given a hypotheses space H over
X and a subset S of X containing m elements.S is shattered by H if for all S
￿
⊆ S
there is a hypothesis h
S
∈ H that covers S
￿
that is S ∩h
S
￿ = S
￿
.
All subsets of S are recognized by hypotheses in H.
5
CHAPTER 2.THEORETICAL BACKGROUND
Definition:Vapnik-Chervonenkis-dimension (VC-dim) The Vapnik-Chervonenkis
dimension of H VCdim(H) is defined as the number of elements of the largest set
S,if S is shattered by H.
VCdim(H) = max{m:∃S ⊆ X,|S| = m,H shatters S} (2.1)
This dimension indicates how many differences H can express.If the maximum
of S is undefined,is VCdim infinite.To shatter a set of the size m,at least 2
m
hypotheses are needed.To calculate VCdim exactly is often very difficult.
2.1.3 Support Vector Machines
Support Vector Machines (SVM) are used in many different areas where many fea-
tures are involved in the classification task as image classification and handwriting
recognition.
Statistical learning theory describes the exactness of the concluded from a set
of seen examples to unseen examples.SVMs learn by example.The overall goal
is to generalize training data.This is done by creating a hyperplane in the vector
space and maximizing the margin between the two classes.The hyperplane divides
the vector space into positive and negative areas according to the membership in
the category as Figure 2.1 illustrates.
Figure 2.1.A binary classification problem with positive (+) and negative (-) ex-
amples.The picture on the left side illustrates that all hyperplanes h
1
to h
4
divide
positive from negative examples.On the right site,a hyperplane h∗ is shown maxi-
mized by SVM (cf.[33,p.27]).
The easiest version of a SVM uses a linear classification rule.Given a set of
training data E with l examples:E = ( ￿x
1
,y
1
),( ￿x
2
,y
2
),...,(￿x
l
,y
l
).Every example
consists of a feature vector ￿x ∈ X and a classification of this example y ∈ {+1,−1}.
This classification rule is defined as (cf.Eq.2.2):
h(￿x) = sign
￿
b +
n
￿
i=1
x
i
w
i
￿
= sign(￿w ∙ ￿x +b) (2.2)
6
2.1.TEXT CLASSIFICATION
where ￿w and b are the two variables adapted by SVM.￿w is the weight vector
assigning a weight to every feature.
The variable b is a threshold value.If ￿w ∙ ￿x +b > 0 then the example will be
classified as positive.The task SVM has to accomplish is to fulfill the following
inequalities (cf.Eq.2.3).It can be considered as an optimization problem as SVM
maximizes the hyperplane h∗ (cf.Fig.2.1).
y
1
1
￿￿w￿
[ ￿w ∙ ￿x
1
+b] ≥ δ...y
l
1
￿￿w￿
[ ￿w ∙ ￿x
l
+b] ≥ δ (2.3)
δ describes the distance to the hyperplane of the examples with the vectors
closest to the hyperplane,so called support vectors.They give the name to the
algorithm
3
.
Vapnik shows that the maximal hyperplane really is maximal by formulating
the expected error,which is limited by the number of support vectors
4
.
To allow training mistakes (mislabeled examples),a slight modification was sug-
gested by Vapnik and Cortes in 1995 [9]:the Soft Margin method (cf.Eq.2.4,
2.5).If there exists no hyperplane that can split the"yes"and"no"examples,the
Soft Margin method will choose a hyperplane that splits the examples as cleanly
as possible,while still maximizing the distance to the nearest cleanly split exam-
ples.The introduced slack variable ζ
i
measure the degree of misclassification of the
datum x
i
.
∀i:c
i
(w ∙ x
i
−b) ≥ 1 −ζ
i
(2.4)
The objective function is then increased by a function which penalizes non-zero
ζ
i
,and the optimization becomes a trade off between a large margin,and a small
error penalty.If the penalty function is linear,the equation now transforms to:
min
￿
1
2
￿w￿
2
+C
￿
ζ
i
l
i=1
such that c
i
(w ∙ x
i
−b) ≥ 1 −ζ
i
￿
;∀i (2.5)
Joachims first applied SVMin text categorization and achieved outstanding re-
sults [20].He argues that SVMs are well usable for text categorization because they
are robust in cases of high dimensional vector spaces with sparse feature vectors.
One more outstanding characteristic of the SVM algorithm is that the kernel
function can be extended very easily to a non-linear function
5
.For the application
area of text categorization,usually linear kernels are taken [20].
3
Only these support vectors determine the maximum hyperplane.This is interesting feature
of the SVM algorithm.Leaving out all other examples,the result of SVM would be the same [33].
4
The detailed presentation of the mathematical proof can be found in [43] and in any more
detailed articles about SVM such as [31],[33]
5
This is known as the kernel trick [39].It is done by projecting the data into a vector space
of a higher dimension.This function can also be calculated in an efficient way as Boser et al.
showed by calculating the scalar product of this function [7].
7
CHAPTER 2.THEORETICAL BACKGROUND
2.1.4 Quality Measurements
To be able to compare and understand the quality of the results of a classification
experiment,a quality measure is needed.A classification experiment can be con-
sidered as a search problem:the classifier searches for emails belonging to the given
category in a number of uncategorized emails.
In general two basic questions are important to evaluate a classifier:
• Do all documents that the classifier assign to the target category really belong
to that category?
• Does the classifier find all documents belonging to the given category?
The first question focuses on the classifier’s accuracy the second the classifiers
breadth (recall).To calculate these measures,we introduce the confusion matrix.
Definition:confusion matrix A confusion matrix maps the result of a clas-
sifier against given classification information.A relevant document is a document
that belongs to the target category accordning to given information.True positive
documents are documents that belong to the target category and are classified the
right way of the classifier (found or not found).
relevant
not relevant
found
true positive (tp)
false positive (fp)
not found
false negative (fn)
true negative (tn)
Definition:precision The precision of a classifier is equal to the relation of the
true positive documents to all documents the classifier has returned:
p =
tp
tp +tn
(2.6)
Definition:recall The measure recall is calculated by dividing the number of
true positive documents by the number of all relevant documents.
r =
tp
tn +fn
(2.7)
To illustrate the definition of precision and recall,let us consider the worst
case:If the classifier finds all documents,recall is maximal.Precision is minimal
as it depends on the proportion of relevant and non-relevant documents.Only one
relevant document found maximizes precision,but minimizes recall.Precision and
recall are normally correlated negatively:increasing precision results in decreasing
recall and vice versa.
8
2.2.LINGUISTIC BACKGROUND
2.2 Linguistic Background
This section gives a short overview of the theory of semantic meaning,relationships
and fields.It introduces essential terms and concepts for the linguistic background
of this project and also describes the problem of the structure of language.
“Structure is the most general and deepest feature of language” states Will-
helm von Humboldt [40].Words can be grouped in families and fields of seman-
tically related terms.The school of European Structuralism tries to define and to
explain the structures in language.A very essential way to model a word’s meaning
is the two-component model of de Saussure:a word is described by its phonetic
component (signifiant) and by its semantical component (signifié).The semantical
component can consist of a concrete or abstract concept.Meaning of words and
especially for those with an abstract concept is often explained by using the refer-
ential model:to explain one term,other terms are needed.Language is reflexive.
But language is also blurred:a concept can hardly be defined exactly and it is
not possible to enumerate all the characteristical features to define one concept.
Blank [4] hit the bull’s-eye of this problem with his well-known example question:
“Considering the concept of a dog and trying to define the concept “dog-alike” as
an animal with the feature [has four legs],is a dog with only three legs still a dog?”.
2.2.1 Ambiguousness of language
Language is ambiguous.Ambiguousness can happen on several levels:two words
can be connected to the same phonetic sound chain,homograph words have the
same spelling.But even for groups of words and whole sentences,ambiguousness
is a very common phenomenon
6
.For email classification,ambiguousness on word
level is most important as words usually are used as a base to determine feature
vectors.
Ambiguousness on a lexical level is called polysemy.A polyseme is a word or
phrase with multiple,related meanings (called lexemes or senses of meaning).The
different lexemes must have a common semantic core
7
.
Word sense disambiguation describes the method of identifying the appropriate
meaning for a polyseme word in a given context.This is a very prominent problem
in computational linguistics as it appears in different disciplines of natural language
processing such as machine translations,information retrieval,text classification
and question answering.But considering the last 25 years of research in artificial
intelligence,it has to be confessed,that there is no global disambiguation algorithm
6
Such as the sentence John told Robert’s son that he must help him([28]) changes its meaning
according to the assignment of the personal pronouns.
7
This term has to be delimitated against homonymy.Homonym words do not have any related
meaning,but the forms have transformed to the same formin coincidence because of sound shifting
and other linguistical developments.In proper words a homonym are two signs with the same
phonetic form.A standard example [4] for homonymy is the French verb louer which has two
different meanings:’praise’ and ’hire,rent’.These two lemmas deviate from two different Latin
origins:the verb laudare and the verb locare.
9
CHAPTER 2.THEORETICAL BACKGROUND
[44].Depending on the type of application,disambiguation of words can be achieved
by combining several knowledge sources using selection criteria and encyclopedic
knowledge.Dealing with written language,information about part-of-speech is
essential to distinguish polyseme words and expressions
8
.
A very efficient method for disambiguation of different words that coincide on a
homograph formis Part-Of-Speech-tagging (POS-tagging) and lemmatisation where
the grammatical word class is identified and the word is brought back to its basic
form
9
.The POS-tagger used in this Master’s project is described in section 4.1.2.
Ambiguous polyseme words
10
cannot be resolved with a POS-tagger.Most
approaches to word sense disambiguation use the context the word occurs in to
identify the right lexeme.The well-known Yarowsky algorithm[21,p.638 ff.]
11
is
based on two basic assumptions borrowed from the linguistic discipline of discourse
analysis [4]:
1.One sense per collocation:the collocations (words in the neighborhood of the
concerned word) is very useful to define the specific sense of a word
2.One sense per discourse:The sense of a polyseme word that appears several
times in a document is often the same in the same document
2.2.2 Semantic relationships
There are different types of semantic relationships,the most well-known are:syn-
onymy,antonymy and hyperonymy.To define synonymy is one of the central prob-
lems in semantics,the dispute about this topic is as old as the first reflexions
on language [51].Gauger proposes a wide definition,defining synonymy not as
meaning identity but as an ”alikeness in meaning“ [15].There are different types
or degrees of synonymy:partial synonymy describes a simple alikeness in meaning.
”Proper“ synonymy implies that two terms can be exchanged without changing the
semantic content of the context.Total synonymy implies the complete identity of
two terms,that means their commutability in every context.The problem is that
the ”degree of alikeness” is hard to measure and to compute.An empirical approach
is to define synonym pairs by votes of native speakers
12
.
Hyperonymy is often described as the ”lexicographical relationship“ and can
also be classified as a special case of synonymy.In computational linguistics,this
8
The dialog system SMARTKOM recognizing spoken language is able to distinguish ironies
and sarcasm by taking the speaker’s mimic into account [45].
9
The word Lachen can be translated with laughter or puddle (Lache in its basic form).In the
sentence Die Lachen waren gross.(The puddles were big) a POS-tagger identifies the word Lachen
as a plural form of Lache -puddle because of the verb form,the words function in the sentence etc.
10
Fig.C.7 (Appendix) illustrates an example of the polysemous word Leiter (conductor (elec.)
and head,leader,chief) and two lexemes united in one node.
11
The implementation of the algorithm [29] achieves 96% accuacy but it requires a data base
that mappes a lexem to a context or a set of possible contexts such as [48].
12
Online thesauri such as [22] are based on this approach:users vote for proposed synonym
pairs.
10
2.2.LINGUISTIC BACKGROUND
relationship is also known as an IS-A relationship
13
.Antonyms express a contrary
relationship.There are as well different types but those can be omitted
14
.
2.2.3 The theory of semantic fields
The theory of semantic fields picks up Humboldt’s thought,describing semantic of
language in groups of semantically related words,so called semantic fields,synfields,
subsynsystems (Lyons).The relationships between different words construct a
larger structure,a field.Depending on how tight the field definition is drawn,and
where the limits are set,the field describes one single concept in its different aspects,
the meaning of the single field elements becomes equivalent to a certain degree.
Coseriu and Geckeler [16] defines a semantic field as a continuum of semantic
content,describing one concept (archilexeme)
15
.The different field elements are in
opposition to each other for one semantic feature
16
.
The notion of semantic fields in their most strict and idealistic definition can be
compared to the mathematical concept of equivalence classes
17
:
• Language is organized in structures or fields.The congruence representative
is the archilexeme.
• In a field,synonyms are equivalent to each other.
• Among the elements of a semantic field,the synonym relation is transitive.
A main problemis to set up limits for semantic fields as borders between seman-
tic fields become blurred because of polyseme words and unclear borders between
the field of meaning of different words [25].To define those borders properly,a very
detailed analysis of collocations is necessary.
18
Creating a semantic field manually
with the proper analysis of the opposition of every single field element pair is a lot
of work
19
.
13
In the opposite to synonymy,hyperonymy has beneficial qualities thanks to the stricter def-
inition:Hyperonymy is transitive (cf.[4]):if a dog is a mammalian and a dachshund is a dog,
then a dachshund is a mammalian,too.This fact facilitates a lot the use of this relationship in a
computational context.
14
The thesaurus used in this Master’s project does nor represent antonym relations [32].
15
This superordinate concept or umbrella term is called archilexem.The archilexeme in a
semantic field unites all semantic features of the field elements.
16
A semantic feature (seme) is the smallest unit of meaning recognized.It is a binary variable,
i.e.the feature is there or not.In the semantic field of cooking terms in English,for example,fry
and roast share the feature [+method of cooking meat] but only roast has the feature [+oven] (cf.
[6,p.11]).
17
This comparison is not to be accepted at face value,language does not obey mathematical
rules.The purpose of this comparison is to enhance understanding of the linguistic concept.
18
A collocation is a group of words that often appears in the context of the concerned word and
helps to define its meaning in this concrete part of speech [5].
19
In this project we are not going to find semantic fields,we only look for structures of seman-
tically related words,as the strict definitions cannot be fulfilled currently by the computational
approach.
11
CHAPTER 2.THEORETICAL BACKGROUND
A semantic field can be modelled as a graph,words represent nodes,their rela-
tionships are edges.The center of a semantic field,called archilexem [16],represents
the complete meaning core of the semantic field.Using the graph based approach,
an evidence for the archilexem is the maximal number of outgoing edges and a
position in the very center of the field [47].
2.2.4 The German language
The German language is a West Germanic language.German is usually cited as an
outstanding example of a highly inflected language.Words are flected according to
the four-level casus system,for verbs several conjugation schema exist.With four
cases and three genders plus plural there are 16 distinct possible combinations of
case and gender and number [14].In the German orthography,nouns and most
words with the syntactical function of nouns are capitalized,which is supposed to
make it easier for readers to find out what function a word has within the sentence
(Das Auto war kaputt - the car was broken).
In German noun compounds are formed where the first noun modifies the cat-
egory given by the second,for example:Telefonrechnung (telephone bill).Unlike
English,where newer compounds or combinations of longer nouns are often written
in open form with separating spaces,German (like the other Germanic languages)
nearly always uses the closed form without spaces
20
.
2.3 Search Algorithms
Before the two searching algorithms Breadth-first search and Tarjan are presented,
some basic terms have to be defined (cf.[11]).
In an undirected graph G,two vertices u and v are called connected if there
is a path in G from u to v.If every pair of distinct vertices in a graph G can be
reached through some path in the graph,the graph is called connected.If G is a
directed graph,it is called weakly connected if all directed edges are replaced with
undirected and the resulting graph is connected.It is strongly connected or strong
if it contains a directed path from u to v and a directed path from v to u for every
pair of vertices u,v.
A connected component is a maximal connected subgraph of G.The strongly
connected components are the maximal strongly connected subgraphs.Each vertex
belongs to exactly one connected component,as does each edge.A disconnected
undirected graph G is composed of a set of at least two connected components.
20
The longest German word verified to be actually in (albeit very limited) use is Rindfleis-
chetikettierungsüberwachungsaufgabenübertragungsgesetz („Beef labeling supervision duty assign-
ment law“) [19].
12
2.3.SEARCH ALGORITHMS
2.3.1 Breadth-first search
Breadth-first search (bfs) is an uniform search algorithm that aims to search ex-
haustingly every node of a graph.Starting from the start node and expanding the
different graph levels,the graph is searched in its breadth.
given:starting node u
empty queue Q
empty component C
unmark all vertices
choose the starting vertex u
mark u
enqueue u
add u in C
while Q not empty
dequeue a vertex u from Q
visit u
for each unmarked neighbor v
mark v
add it to end of the queue Q
add edge (u,v) to the component C
add v to C
end for
end while
Figure 2.2.Pseudocode of the algorithm breadth-first search (bfs)
Starting from a given start node u (cf.Fig.2.2),it is tested for each neighbor v
to u whether v already has been visited.If node v has not yet been discovered,it is
added to a waiting queue and expanded in the next step.After having considered
all neighbors to u,the same procedure is done with the next node in the waiting
queue.
The algorithmhas a complexity of O(|V |+|E|) where |V | is equal to the number
of vertices and |E| describes the number of edges.
The bfs-algorithmcan be used with a slight modification to identify all connected
components in a graph.The original bfs-algorithmis enclosed in a further while-loop
and a further queue administrates the vertices,marking nodes that already have
been found in a component and providing new starting vertices to the bfs-algorithm.
The complexity of the modified algorithm is still O(|V | +|E|).
2.3.2 Tarjan’s algorithm
The Tarjan algorithm (tarj) named by its inventor Robert Tarjan finds strongly
connected components in a directed graph (cf.[42]).The basic idea of the algorithm
13
CHAPTER 2.THEORETICAL BACKGROUND
is to execute depth-first search from a given starting vertex (cf.Fig.2.3).The
identified components are subgraphs of the depth-first-search tree.The root of this
subgraph is the root of the strong component.
Visited vertices are put on a stack in the order of visit.If depth-first search
returns from a subgraph,the stack is emptied step-by-step,deciding whether the
current start vertex v for this subgraph is the root vertex of a strong component.
Performing depth-first search,the visited vertices in the v
￿
in the supposed subgraph
are indexed in this order (for vertex v
￿
this index is called v
￿
.dfs).Additionally each
node v
￿
is assigned a value v’.lowlink that takes the minimum from v
￿
.dfs and the
pathlength between the currently visited node v
￿
and the current start vertex v.
v.lowlink:= min {v’.dfs;v’ is reachable from v}.This value is calculated during
runtime.
A node v is identified as root vertex of a strong component if and only if
v.lowlink = v.dfs.
The tarjan algorithmhas a linear time complexity.The tarjan procedure is called
once for each node;the forall statement considers each edge at most twice.The
algorithm’s running time is therefore linear in the number of edges in G (O(|V | +
|E|)).
Applying the same modification as for the bfs-algorithm,all components in a
disconnected graph can be found.
14
2.3.SEARCH ALGORITHMS
Input:Graph G = (V,E)
index = 0
empty stack S
empty component C
forall v in V do
if (v.dfs is undefined)
tarjan(v)
end forall
procedure tarjan(v)
v.dfs = index
v.lowlink = index
index = index + 1
S.push(v)
forall (v,v’) in E do
if (v’.dfs is undefined or v’ is in S)
if (v’.dfs is undefined)
tarjan(v’)
v.lowlink = min(v.lowlink,v’.lowlink)
if (v.lowlink == v.dfs)
repeat
v’ = S.pop
C.add(v’)
until (v’ == v)
end forall
end procedure
Figure 2.3.Pseudocode of the Tarjan algorithm (tarj)
15
Chapter 3
Related Work
A lot of research is done comparing algorithms in text categorization.Joachims
was the first to apply Support Vector Machines (SVM) to text classification [20].
A number of studies confirmed these results comparing different algorithms and
settings against each other [54,41].
In different studies different techniques have been applied,such as co-training
1
[24] and testing different kernels [30].The general goal is to increase the performance
of email classifiers.
Features are entities used as basic information for classification.In text classi-
fication,words are usually used as features in the bag-of-word model (BOW),i.e.
words are represented as an unordered set disregarding word order and grammar.
A very central problem is the high variety of words in natural language which leads
to a very high number of features.Because of this,a smart way to reduce or select
features can have a deep impact on the quality of classification.
Abasic idea is to identify features that are characteristic for a category and to cut
off unimportant features.Aggressive feature selection implies usually a reduction of
90 −95%
2
.This idea seems to work quite well in different studies and for different
corpora [10].However there are pros and cons for aggressive feature selection.It
may result in a loss of information and SVM can handle high-dimensional input
space [34],anyway.
Linguistically coined strategies for feature selection touch the bag-of-words model
and aim to improve representation of information.Crawford et.al.[10] found
improvements for classification with SVM for their corpora using phrase selection
based on statistical selected 1- and 2-grams
3
.Youn and McLeod built a spam-
filter using an ontology based on key-words [53].
1
Instead of training up one classifier,two classifiers are used.The view on the data of one
classifier is used to train up the second one.In case of only small amounts of labeled data and
large amounts of unlabeled data,co-training has been shown to be very efficient.Feature sets for
the classifiers have to be independent.
2
Statistical method such as information gain or chi quadrat are applied.
3
A n-gram is a subsequence of n items in a given sequence.In the sequence abcab the 2-gram
ab occurs two times.
17
CHAPTER 3.RELATED WORK
Instead of single words,semantic concepts can be treated as features.From a
linguistic point-of-view,words similar or alike in meaning can be pooled into one
semantic concept.
Yang and Callan built an ontology from a large email-corpus.In order to
identify similar concepts,a graph based approach involving WordNet is used.Hy-
peronym relationships with 2-levels in between and 2-grams are used to regroup
concepts [52].Wermter and Hung find hyperonym relationships useful to build
self-organizing maps
4
for text classification [46].In all three studies WordNet
5
[13]
is the knowledge base to map semantical relationships.
4
Self-organizing maps are a kind of artificial neural network that is trained using unsupervised
learning to produce a low-dimensional representation of the input space of the training samples,
called a map.The difference to other neuronal networks is that they use a neighborhood function
to preserve the topological properties of the input space.
5
WordNet is a lexical database or a kind of machine readable thesaurus for the English lan-
guage.It groups English words into sets of synonyms and records the various semantic relations
between these synonym sets.
18
Chapter 4
Method
Classifying emails involves several working steps but feature selection and feature
reduction are very central problems.The main thought behind this Master’s project
is the use of semantic relationships between keywords to regroup themto one feature.
This is done by a graph-based approach.The words are vertices,connected to each
other.Edges are the relationships between words,given by a thesaurus.To identify
components in the graph,two search algorithms (tarjan and bfs) are applied.A
real challenge is to identify graph components and to limit their size in a way that
the words involved still are semantically related enough to share a semantic core
1
.
Effects of feature selection methods can hardly be calculated in advance as cate-
gories and their characteristical word frequencies may differ a lot.Therefore the way
of research is mostly empirical:Each approach has to be evaluated experimentally.
There is no base line for this corpus.Describing the method is strictly separated
from results.
In order to conduct classification experiments a number of tools are needed.
Support Vector Machines are used for classification in this Master’s project.As
German is a highly inflective language,a lemmatiser is used to bring words back to
their basic form.
This chapter describes method and implementation.The first section intro-
duces the tools needed.Then the test framework and its different components are
explained.The third section goes more into detail about constructing a graph.The
last section explicates the implementation.
4.1 Tools and Corpus
In this section,the tools used in this Master’s project are presented:Minor Third
(the implementation of support vector machines),the POS-tagger and lemmatiser
Tree Tagger and the thesaurus.The corpus is presented.
1
A selection of examples of semantic structures is found in Sec.C (Appendix).The examples
illustrate different aspects of the approach applied.
19
CHAPTER 4.METHOD
4.1.1 Minor Third
Minor Third [8] is an open source Java implementation library for machine learning
in NLP (natural language processing) applications.It disposes of a large number of
classification algorithms and a very easy-to-use interface
2
.Tools for annotating text
and visualising are provided as well,but not used for this Master’s project.A very
big advantage of the architecture of Minor Third is its approach to handle labeling
information:The corpus is stored in a so called text base.All labeling information
is stored in an extern and from the text source completely independent file (in this
report called label file).Several label files containing different labeling information
can be used for the same text base.The format for a label file is intuitive and can
be recreated manually or automatically.
4.1.2 Tree Tagger
As being a high flectional language,POS-tagging of German is not trivial.As a
POS-tagger and lemmatizer the Tree Tagger [1] is used.It is based on a proba-
bilistic tagging method estimating transition probabilities by constructing a binary
decision tree out of trigrams [38].The accuracy of the Tree Tagger was measured
to 96 % [38] so mistakes of the tagger occure.The TreeTagger also recognizes com-
mon proper nouns such as Peter and Maria.Unknown words are guessed.When
installed,it is ready to use and does not need to be trained.
4.1.3 Thesaurus
The Open Thesaurus [32] is an open source thesaurus project for German
3
.
Its structure is similar to WordNet [13],groups of synonyms are mapped in
hyperonym relations.The thesaurus does not contain any grammatical information
i.e.about word classes.But as most work is done by laymen,the structure is kept
simpler:only synonym and Is-A relationships are implemented.Meta information
i.e.about language level to the lemmas is given,but no POS meta information,
that means about the word class to the lemma.
The thesaurus provides a web interface,a version to be integrated into a text
processing software or can be downloaded as a MySQL database dumb.The vo-
cabulary is situated in every-day speech.It contains about 60000 entries [32].
4.1.4 The Corpus
The corpus in its original form consists of about 45000 emails sent to the customer
support service of an Austrian telephone and Internet provider.The corpus consists
2
Exchanging the classification algorithms from support vector machines to Naive Bayes just
demands to change one command line option.
3
Access and collaboration are free to everybody if registered.To avert drawbacks on quality,
the community pays attention to quality assurance following the principle of self-control with the
administrator as last control instance.
20
4.2.DESCRIPTION OF TEST FRAMEWORK
only of the first emails in a communication chain,that means emails that were sent
fromoutside to the customer service.The emails are saved in a large database.The
emails are assigned to the categories by CSO EMS ™
4
and the employees in the
customer service,i.e.labeling is reliable and does not need to be reviewed manually.
The corpus is not standardized and was never used for such a research project
before,so there are no base line results or any gold standard to compare with
5
.
4.2 Description of Test Framework
Main steps to prepare a test set are extracting the emails from the database,pre-
processing and feature regrouping using semantic relationships.
Figure 4.1 illustrates the steps.The corpus is created by extracting emails from
the CSO EMS ™– database.In the database EMS system stores emails and logs
various pieces of meta information.Only the first mails sent to the customer service
are taken into account.Labeling information for training and testing the classifiers
are as well extracted from the database.
Figure 4.1.Architecture of the test framework
The emails are extracted only one time and saved as separate text files.The
preprocessing steps are only done once as well.Preprocessing implies four different
treatments for the data:
4
CSO EMS ™is an Email Processing Application provided by Artificial Solutions that handles,
analyzes and classifies incoming emails to assist support staff in handling emails.The approach for
classification is rule-based.
5
The well known Reuters-Corpus e.g.is considered as an inofficial standard corpus for NLP-
research projects in English language [36]
21
CHAPTER 4.METHOD
• Some emails are extracted in html encoding.All html-tags have to be removed
to get raw text.
• Analysing the content showed that the original corpus contains a lot of spam.
The spam is taken away.
• Stop word
6
treatment:special characters and objectionable pieces of informa-
tion such as email addresses and customer IDs are omitted
• Lemmatisation:By using the Tree Tagger [1] the emails are POS-tagged and
words are brought back to their basic form.
Removing html-tags and stop word alike information is done via Perl scripts,
programmed for this Master’s project.Taking away the spam is realized by elimi-
nating two categories (cf.5.1.1).
In the last step,illustrated by a double rectangle in Fig.4.1,the real experi-
ments are done.The classifying algorithm Minor Third manages all tasks in the
classification process.Input data are the corpus as separate text files and labeling
files containing categorization information.
The main part of this project work is the step which is illustrated in figure
4.1 by a flash:regrouping features using semantic structures.This working step
is very complex.A tool programmed in Java looks up words in a thesaurus and
calculates semantic structures of the resulting graph.It is described in the next
sections.Practically speaking this could be considered as a preprocessing step as
well as groups of words in the raw text are replaced.
4.3 Theoretical Reasoning about a Graph
In [47] it was shown that semantic fields can be modelled with a graph based ap-
proach.In this Master’s project,a similar model is used to compute automatically
semantic structures.The vocabulary is mapped in a graph.Words are vertices and
their relationships are edges connecting them.The thesaurus serves as a knowledge
base for semantical relationships,the structure and the completeness (i.e.that all
words are represented in the graph) depends a lot on the thesaurus.
The thesaurus models only two types of semantic relationships:synonyms and
IS-A relations.In order to create a “real hierarchy” (“eine echte Hierarchie” [32])
a synonym group can be related to only one hyperonym.This fact limits the vari-
ety of the thesaurus;on the other hand it facilitates the construction of semantic
structures.Furthers antonym relations are omitted completely.The IS-A or hyper-
6
Stop words are words without semantic meaning such as or,and,the.Removing them is a
very common preprocessing step in NLP applications.
22
4.3.THEORETICAL REASONING ABOUT A GRAPH
onym relation can be considered as a special case of a partial synonym relation [15].
Therefore in the graph model we can pool the two into one type of relation
7
:
Definition:neighbor A word a is related to another word b if it is mentioned
in the thesaurus as a synonym or hyperonym.We say that a is a neighbor to b.We
distinguish between thesaurus neighbors (relations mentioned in the thesaurus) and
field or structure neighbors (related words in the graphs).
One graph is built for the whole corpus and for all categories.The groups must
be disjunct subsets of the set of words,the vocabulary of the corpus.The groups
shall not be too large,as the members are to be related semantically.For each
group,an identifier such as e.g._Syngroup209_ is created.The words in the mails
are replaced with the accurate identifier.
Central questions are disambiguation of polyseme words and the size and char-
acteristics of those components.The solution applied to these questions has to fit
in limits of time and effort.
Disambiguation is done according to the graph based model counting neighbors
to the different lexemes of a word.The context is not taken into account.
Such a word graph consists of a number of unconnected components.Two search
algorithms,breadth-first search and the Tarjan algorithm are applied to identify
them.Components can become very large,about 1000 nodes.Those components
have to be trimmed.Selecting a central nodes,the idea is to go a path in a maximal
length before cutting off.The central node is selected by having a maximal number
of active neighbors as high connectivity is an evidence for the central position of a
word in semantic structure (cf.Sec 2.2.3).A very large graph component can be
seen as a cluster of different overlapping semantic structures.The next section will
go more into detail about the implementation and construction of a graph.Three
of the main questions described there (search algorithm,disambiguation and path
length) are variables examined in the evaluation part (cf.Chap.5.3).
7
It may be considered to introduce a weight in the graph based model in order to differentiate
between hyperonym and synonym relations.But this is left to an eventual later phase of fine
tuning.
23
CHAPTER 4.METHOD
4.4 Algorithms and Implementation for Building Semantic
Structures
In this section the implementation of the tool to analyze and regroup semantic
structures is described.
Synfield,the heart of this project,is a tool to compute a graph of semantic
structures with several features.The graph or graph components can be visual-
ized.A number of smaller tools performs tasks for preprocessing and preparing
experiments.
Figure 4.2.The functionality of the tool Synfield
Figure 4.2 illustrates the functionality of the tool.The program itself is shown
as a rectangle.It has access to two data sources,the corpus and the thesaurus.The
output of the program is a graph mapping words in semantic structures or a lexicon
file containing regrouped features.
Such a graph is constructed in the following way:
24
4.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTIC
STRUCTURES
1.Extract a list of all words occurring in the corpus (text elements)
2.Look up all text elements in the thesaurus and save them in nodes each with
all given neighbors according to the chosen disambiguation type.
3.Search in the graph for relationships between nodes.
4.Analyze the graph and search for components using bfs-search or tarjan.
Possible actions with the graphical user interface:
1.Select and visualize graph components.
2.Trim graph components with maximal path length and prepare a version of
the corpus for classification experiments.
Looking up a word is done iterating over the list of text elements and querying
the thesaurus.How a graph is built is explained in detail in Section 4.4.2.Infor-
mation about the ways of disambiguation gives Section 4.4.4.The algorithms for
component search are explained in the theoretical chapter 2.3,in Section 4.4.3 the
results are reviewed.The path-length algorithm to trim graphs and its effects is
described in Section 4.4.5.The next section goes more into detail with aspects of
implementation of the tool.
4.4.1 Details about implementation
Synfield is programmed in Java.A graph consists of a list of nodes.A node is
represented by the word itself and lists referring to its neighbors (graph synonyms,
graph hyperonyms,thesaurus synonyms and thesaurus hyperonyms).The data
structure is inherited from Java TreeMap,the name of the node represents the key.
The class Nodelist and Node provide a number of functionalities,such as different
kinds of list unification.
The thesaurus is available as a mySQL database [32].Looking up a word is done
by querying the word entry from the database.
Visualizing the graph or graph components can be accomplished in two differ-
ent ways:The tool provides an interface to Viz Graph (cf.[12]).Running this
application creates a picture of a graph.It is useful for large graphs to gain an
visual overview of the graph structure.For a smaller graph component and manual
analysis the second visualization tequnique using the library JGraph [17] is more
advisable.Building on Java Swing,nodes and edges are movable and resizable.
The user has to adjust the graph structure on his or her own,however the outcome
is much more flexible
8
.The visualization tool was used to analyze graph compo-
nents and to determine semantically modified constraints on size and path length
as described in this chapter.
8
The figures in the appendix (Sec.Appendix C) are created with Graph Viz,the figures in this
section with JGraph.
25
CHAPTER 4.METHOD
Once graph components (or characteristics to compute graph components) are
identified,the tool can be used to replace and regroup words according to the graph
components.Using one identifier for each component,a lexicon is created.Again a
TreeMap is used.Iterating over words in all mails,lemmas given in the lexicon are
replaced in the corpus.
4.4.2 Construction of a graph
A graph is built after the following principle (cf.pseudocode in Fig.4.3).The
vocabulary of the corpus is given in a list of words C.The words extracted from
the corpus are called text elements.Each text element is looked up in the thesaurus.
A node is created for each text element and each word mentioned in the thesaurus.
Found thesaurus neighbors
9
are registered for each text element.
The next step is disambiguation of text elements with more than one lexeme.It
is treated in Section 4.4.4.Then the graph is searched for neighbor relationships.
It is done by an exhaustive search over all text elements.
find_neighbors:∀t ∈ C:∀m∈ C:m￿= t,m∈ T
t
⇒N
t
:= N
t
∪ {m} (4.1)
The procedure find_neighbors (cf.4.1) iterates in a double loop over all text
elements t and min the given vocabulary list C.If the text element mis a thesaurus
neighbor to the word t (i.e.element in the set of all thesaurus neighbors of t,T
t
),
m is added as a field neighbor to the node of t to the set of all field neighbors of t,
N
t
.
Let us consider an example.The following list of words (cf.Fig.4.4) is a
component from the corpus and some supplementary words.The words are looked
up in the thesaurus,their neighbors are registered.Relationships between the words
are found.
Figure 4.5 shows a graph derived from the vocabulary list given in Fig.4.4
with translation.The graph consists of totally 23 nodes.Three components consist
of several vertices,three only of one single node.The single-node components are
not processed further.This graph is the result of a bfs-search.The difference to
a tarjan-search is discussed in Paragraph 4.4.3.Some nodes are not connected to
any other nodes,such as Interessent (prospective customer) and Bestandskunde
(existing customer).As the thesaurus does not contain those words,no information
about semantical relationships is available
10
.
Talking about feature reduction and regrouping of semantically related features,
the size of the graph is important.The question may arise,why to cause so much
trouble involving graph theory,search algorithms and a lot of other instruments
9
The term thesaurus neighbors refers to both synonyms and hyperonyms (cf.4.3).
10
Those words are called out-of-vocabulary-words (OOV) in speech-regocnition [21,p.95].Both
terms are very frequent in the corpus,they are specific terms in the area of application.The use
of a thesaurus modified for the specific field of application and containing all specific terms may
improve classification results.
26
4.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTIC
STRUCTURES
Given:the corpus vocabulary in a list of words.
global node list N empty
forall text elements t
if (t is not in N)
create a node n for t
add t to N
else
get node n for t
look up t in the thesaurus and get thesaurus neighbors t’
forall thesaurus neighbors t’ to t
if (t’ is not in N)
create a node n’ for t’
add t’ to N
else
get node n’ for t’
register n’ as a thesaurus neighbor to n
end forall
end forall
forall polyseme text elements t
disambiguate t
end forall
forall text elements t
find field neighbors for t
end forall
Figure 4.3.Pseudocode of the algorithm to construct a graph
just because of regrouping words.Considering the second largest graph component
gives a first answer:Only looking up erfolglos (unsuccessful) in the thesaurus returns
the synonyms:fruchtlos (fruitless) and vergeblich (in vain).The graph component
describes the meaning of the synset.Not only direct neighbors of the word are
considered.Applying graph theory extends one word to a enclosed structure where
the sense of meaning is kept.But there is no need of manual labor as graphs can
be calculated automatically and in an acceptable time scale.
4.4.3 Component Search
In order to identify components in a disconnected graph,the two algorithms,breadth-
first search and Tarjan algorithm are applied (cf.Sec.2.3).
The algorithms return different results.The Tarjan algorithm finds 186 compo-
nents for 4990 text elements,BFS returns 194.Overall the tarjan-algorithm binds
27
CHAPTER 4.METHOD
• Nett,nett (amiable,likable)
• lieb (good,nice,dear)
• höflich (courteous,polite)
• schön (nice,beautiful)
• verbindlich (authoritative,binding)
• geehrt (honored)
• geschätzt (appreciated,expected)
• teuer (beloved,dear,expensive)
• verehrt (adored,venerated)
• angesehen (esteemed)
• freundlich (friendly)
• erfolglos (unsuccessful)
• fruchtlos (fruitless,effectless)
• vergeblich (in vain)
• unnötig (unnecessary)
• zwecklos (useless)
• schwer (heavy,difficult)
• schwierig (difficult)
• kompliziert (complicated)
• Interessent (prospective customer)
• Bestandskunde (existing customer)
• Kunde (customer)
Figure 4.4.Word list for the running example
28
4.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTIC
STRUCTURES
Figure 4.5.A whole graph of the running example drawn with JGraph
1572 nodes and the bfs-algorithm 1661.Figure 4.6 maps the size of components to
their number.Components containing only one node are left out.The node size
is summed up in classes (on the x-axis).The Tarjan-algorithm produces slightly
more middle-sized components (5 - 20 nodes) whereas the bfs-search returns a larger
number of small components (2 - 4 nodes).
The largest component found in the corpus is in both cases bigger than 1000
nodes (cf.figures in Sec.C.3).Such a big cluster is not usable for semantic
structures but with the help of the path length algorithm (cf.paragraph 4.4.5) the
components can be trimmed to a suitable size.
But apart fromstatistical data,the algorithms differ also in detail.As clear from
the definition,bfs-search components are larger and wider as the following examples
illustrate.Searching the running example (cf.Fig 4.4) for graph components with
bfs - search,the largest component bfs finds is illustrated in Fig 4.7.
Applying the tarjan-algorithm instead of the bfs-algorithm returns the following
result:
In the Tarjan- component the two nodes Nett
11
(nice,likable) and schön (fine,
beautiful) are missing (cf.Fig.4.8).Those two nodes are not highly enough
connected to the component.The vertex Nett sends three edges but does not receive
any,the node schön receives only one edge but does not have any own neighbors in
11
Attention:There is a distinction between Nett and nett.
29
CHAPTER 4.METHOD
Figure 4.6.The number of components in a certain size
Figure 4.7.The largest component that bfs finds in the running example
30
4.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTIC
STRUCTURES
Figure 4.8.The largest component in the running example with Tarjan
this component.
This example illustrates the differences of the two search algorithms as defined
by their basic postulation (cf.Sec.2.3):In a strongly connected components there
is a directed path from every node u to every other node v.Both of the missing
nodes only have neighbors in one direction,therefore they are omitted by the Tarjan
algorithm.
This is an important distinction in view of linguistics:weakly connected nodes
are semantically not related to the semantic core
12
of the field.
4.4.4 Treatment of polysemous words
Disambiguation (cf.Sec.2.2.1) is the process of identifying which sense of a poly-
semous word is used in any given context.The usual approach is to apply context
information where the word occurs to identify the accurate sense of meaning.
Let’s consider an example.The word freundlich (friendly) is given with the
following annotations in the thesaurus:
1.leutselig (archaic)
13
2.charmant,gefällig,nett
12
This is the case for the node schön.Beautiful has not so much in common with the notion of
friendly-kind-polite.This case is interesting as well for disambiguation (cf.paragraph 4.4.4).But
here again,the quality of the semantic structure depends a lot on the entries in the thesaurus.Only
nouns in German are spelled with capital letters.The adjective nett does not exist in a nominative
form.So this entry is actually wrong.
13
translations:leutselig:accostable;charmant:charming;gefällig:accommodating;nett:nice,
kind,friendly;galant:chivalric,gallant;höflich:polite;verbindlich:obliging,authoritative;zuvork-
ommend:courteous;heiter:bright;schön:beautiful;sonnig:sunny
31
CHAPTER 4.METHOD
3.galant,höflich,verbindlich,zuvorkommend
4.heiter,schön,sonnig (weather)
The second and the third lexeme are used most frequently,describing the char-
acter (2) or the behavior of a person (3)
14
.The fourth lexeme is used in the context
of weather.To apply word-sense disambiguation implies to choose the most correct
definition,i.e.the most suitable sets of thesaurus neighbors.
In the following three approaches for the treatment of polysemous words are
described and compared
15
.Polyseme words are identified by several groups of
neighbors,one for each lemma.For each lemma a node is created,neighbors are
stored as for usual nodes.A polysemous word is represented as a group of its
lexemes.
Naive disambiguation
The naive approach does not do any disambiguation at all
16
.Instead of drawing
distinctions between the different lemmas,the lists of synonyms are merged to one
list as the lists of hyperonyms are.So schön,nett and verbindlich,i.e.three words
that differ a lot in their sense of meaning are treated as equivalent neighbors of the
word freundlich.
The resulting graphs have a higher connectivity.The risk is to falsify semantic
regrouping if neighbors with different meanings get mixed
17
.
Graph model based approaches to disambiguation
Standard approaches using disambiguation based on context recognition demand
context information relating a given lexeme to a certain context class
18
.This in-
formation is not available in this Master’s project or beyond the limits of time and
effort for a subproblem.
14
In the sentence Er ist ein freundlicher Mensch.(He is a kind person),the second or the third
alternative would be suitable.A semantic structure to replace freundlich in this context should
consist of these words to catch the correct sense of meaning.
15
These three approaches are the values for the variable disambiguation tested later in the
evaluation part (cf.Sec.5.3.3).
16
A very illustrative example (cf.Fig.C.7 (Appendix)) is the word Leiter (conductor (elec.)
and head,leader,chief) if only Naive disambiguation is done,synonyms for ’conductor’ and for
the second lexeme ’head,leader,chief’ are treated as one single feature.However,as this Master’s
project examines the effect of feature regrouping and of its different aspects,the different ways of
disambiguations have to be taken into account.
17
Another question is,if the lexeme freundlich in the describing of weather is used in the context
of customer mails sent to a telefon agency.
18
An implementation of the Yarowsky algorithm [21,p.638 ff.],adequately trained could be
applied or a database,mapping a sense of meaning to frequent collocators (i.e.words that appear
usually in the context of the word and define its sense of meaning) such as [48,50] would allow to
apply context information.
32
4.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTIC
STRUCTURES
For this reason an approach to disambiguation using the graph based model was
developed and applied
19
.
After looking up all text elements in the thesaurus,the appropriate lexeme for
a polysemic word is chosen.This is done by identifying the lexeme that has the
most neighbors in the text elements.Given a polysemic word w with its lexemes
l
1
...l
n
and the set of all words W.By calculating the maximum intersection of all
neighbor sets of the lexemes with the global word set,the lexeme representing the
word is chosen.
w:= l
j
where l
j
= max(|W ∩N
j
|) (4.2)
Referring to the example of freundlich,in Fig.4.5 the text elements that are
connected to freundlich are nett,schön,höflich,verbindlich.Nett is synonym to the
second lexeme describing a person character,höflich and verbindlich are found in
the third one,schön in the fourth one.So the third lexeme is chosen because most
of its list items are found as text elements.
There is an ambiguous case and the two following approaches resolves this in a
different way.
Disambiguation unite If lexemes have the same maximal amount of neighbors
in the graph,those lexemes are treated with naive disambiguation.They are united
and their neighbors lists are merged.This is the case of the word nett.It has two
neighbors in the field,lieb and freundlich,both belong to different lexemes
20
.But
as no explicit maximum can be found,those two lexemes are united.
In figure 4.9 the node nett has still two neighbors.The node freundlich however
is only connected to the neighbors of the chosen lexeme,höflich and verbindlich.
Other edges,outgoing from this node are omitted.The graph changed a lot.The
node schön is now unconnected.In some regions,e.g.around the node lieb,the
graph is thinned out.
Disambiguation alpha This approach to disambiguation is more severe.If lex-
emes have the same maximal amount of neighbors in the graph,the lexeme that
19
As one global graph is built for the whole corpus,Yarowsky second assumption „One sense
per discourse“is extended:The whole corpus is considered as one single part of speech.It is
assumed that e.g.the word freundlich occurs only in the sense of meaning,describing a person’s
behavior.This very generalizing assumption facilitates implementation but has a big drawback on
the precision of this approach.
20
The complete thesaurus entry of nett is:
1.fein,hübsch,niedlich,süß
2.ansprechend,lieb,liebenswürdig,reizend,sympathisch,umgänglich
3.charmant,freundlich,gefällig
(Translations:fein:fine;hübsch:pretty;niedlich:cute;süß:sweet;ansprechend:pleasant;
lieb:beloved;liebenswürdig:amiable;reizend:attractive,charming;sympathisch:sympathic;
umgänglich:companionable;charmant:charming;freundlich:friendly;gefällig:complaisant)
33
CHAPTER 4.METHOD
Figure 4.9.The graph of the running example built with disambiguation Unite
comes last in the list is chosen,all others are ignored.Figure 4.10 illustrates the
output of the running example,built with bfs and disambiguation Alpha.The word
nett is again an ambiguous case where the special rule of disambiguation Alpha is
applied.As no explicit maximum can be found,the second lexeme containing the
node lieb wins.
In Figure 4.10 the node nett has only one neighbor.The node freundlich however
is only connected to the neighbors of the chosen lexeme,höflich and verbindlich.
Other edges,outgoing from this node are omitted.The graph changed a lot.The
node schön is now unconnected.In some regions,e.g.around the node lieb,the
graph is thinned out.
34
4.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTIC
STRUCTURES
Figure 4.10.The graph of the running example built with disambiguation Alpha
4.4.5 Trimming graphs - pathlength
A graph component containing 20 nodes or more is not useful to regroup words to
related word groups as the semantic content differs a lot
21
.In order to limit up the
component size,an algorithm is developed to trim components according to a given
maximal path length.The algorithm reminds of breadth-firth search,differing in
the path length quality:From a chosen central node,the algorithm takes steps for
the maximal path length.If the the upper bound is reached,the connection to
neighbor nodes is cut down:
The procedure cutComponent (cf.Fig.4.11) is called in a similar way as the
search algorithmwere adapted:all nodes in a big component are stored in a priority
queue,sorted by the number of neighbors
22
.The procedure is run with the node
21
The graph in C.5 (Appendix) illustrates how much words can differ after a short path.
22
If a large graph consists of several overlapping semantic structures,the nodes with the highest
number of neighbors are supposed to be the semantic core of a local semantic structure (cf.Sec.
2.2.3,4.3).This approach could be improved by adding another dimension which involves the
word frequency in the corpus.In the current approach,only the position in the graph component
is taken into account.An idea would be e.g.to correlate the number of neighbors of a node to the
35
CHAPTER 4.METHOD
procedure:cutComponent
given:
node c with highest connectivity (maximal number of neighbors)
all nodes in graph are unmarked
distance = 0 for all nodes
empty queue Q
maxDistance
empty component C
enqueue c in Q
while q is not empty and globalDistance <= maxDistance
dequeue current node d from Q
for all neighbors n of d do
if n was already visited,continue with next neighbor
mark n as visited
increment n.distance
if n.distance < maxDistance
enqueue n in Q
add n to C
if n.distance = maxDistance
add n to C
cut off all neighbors of n that are not in C
end for all
end while
return C
Figure 4.11.Pseudocode for the procedure cut component
with maximal connectivity.All nodes contained in the component the procedure
returns are removed from the priority queue.
The complexity of the algorithm is quadratic
23
.
In the following the algorithm is illustrated by an example,the biggest com-
ponent in the first graph picture in this section (Figure 4.5).The node with the
highest connectivity (the most outgoing edges) is usually the semantic center of the
field (cf.Sec.2.2.3).
The algorithms starts with the node geschätzt
24
marked with a red circle in
td-idf measure for the respective word.
23
Assuming the worst case,a maximal connected graph component where all verteces are ad-
jacent to each other.Each node is only visited once,but anyhow all neighbors for each node are
examined.
24
The node lieb has outgoing edges (δ (lieb) = 5 = δ (geschätzt)).For further distinction the
36
4.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTIC
STRUCTURES
Figure 4.12.The maximal path length is 2.In the first step,all direct neighbors
to the node geschätzt are discovered (angesehen,gesehen,verehrt,teuer,lieb).The
second step involves only one node,the node nett.For this border node,a cut
operation is needed.The first iteration returns the component illustrated in Figure
4.13.
The second iteration starts at the node freundlich marked with a red,dotted
circle.The first step reaches the neighbors schön,verbindlich and höflich.This
component is shown on Figure 4.14.The node Nett drops out as a single-node
component.
Search with the cutting algorithm is exhaustive as all nodes are visited.The
used queue or further the basic thought of bfs-search ensures that nodes are visited
in the convenient order,by their distance to the central node.
In the following we will have a closer look at the newly created components.The
larger component on Figure 4.13 resembles a K5
25
taking away the node angesehen.
Apart from the word angesehen all words in this component are closely related in
their lexical meaning
26
.
In the experiments,graphs with maximal pathlength from 1 till 3 are tested.
number of thesaurus neighbors is taken into account.This node has less thesaurus neighbors as
the node geschätzt.
25
The K5 graph is maximal connected and unplanar i.e.there is no way to arrange the nodes
in the plane such that the edges do not cross [11].
26
Some more examples are found in the Graph Gallery in the Appendix (cf.Sec.C)
37
CHAPTER 4.METHOD
Figure 4.12.A component cut with the path length algorithm
38
4.4.ALGORITHMS AND IMPLEMENTATION FOR BUILDING SEMANTIC
STRUCTURES
Figure 4.13.Component 1 cut from the example graph with path length 2
Figure 4.14.Component 2 cut from the example graph with path length 2
39
Chapter 5
Evaluation - Experiments
Experiments are done in 2 steps.First the corpus is analyzed,preprocessing steps
and baseline tests are discussed.The second part reports classification experiments
done with SVM.It involves semantic feature regrouping.The effect of the different
variables such as the search algorithm and path length is examined.These results
lead to a characterization of an ideal category where semantical feature regrouping
has a positive effect on text classification with SVM.
5.1 Corpus Analysis
This section treats the analysis of the corpus and the different preprocessing steps.
5.1.1 Decisions about the corpus
To bring the data into a usable form was one of the most difficult challenges in this
Master’s project.Keep in mind that this is not a certified corpus such as [36].The
data in the whole original corpus is very unclean,analysing randomly picked emails,
about 2/3 turned out to be spam or unusable emails (i.e.mail returned to sender,
out-of-office emails etc).In order to purify data,the two Unbekannt-categories were
removed completely,the size of the corpus was reduced by two third.The data is
still very unclean.Experiments using such a big corpus take a lot of time
1
.The
emails are not always the first mails of a customer contacting the customer service,
as e.g.colleagues forward information about a customer calling etc.For this reason,
the mails were filtered again for mails coming from a mail - contact interface on
the website of the telephone agency.Only the proper text of the email was left,all
other information (e.g.variables about the type of the subscription were omitted).
The result are about 2000 mails remaining.The average text length was halved
1
The different steps of preprocessing (as they are all done offline),i.e.to calculate a graph and
to replace words in the mails demands time and assistance.To run an experiment series with SVM
for all categories,with 5 folder cross validation and two test setups takes with the given hardware
circumstances about 48 hours.To run a test setup with the smaller corpus (about 2000 emails) is
about 100 times faster.
41
CHAPTER 5.EVALUATION - EXPERIMENTS
(the average text length was 260 words in the spam reduced corpus,in the mail
form corpus a mail counts 126 words in average (cf.Tab.A.2 (Appendix)).When
talking about the corpus in the following,it refers to the last form containing 2000
mails.
5.1.2 Category System
The structure of the category system is complex.There are two main categories:
Produkt and Typ.Every email is assigned to the two main categories,i.e.it has
two classification variables.
The main category Produkt describes different product types,i.e.a telephone
flat rate or an Internet service.It consists of 12 subcategories.The category
Typ could be translated with service,it consists of technical questions,to cancel a
contract etc.There are 6 subcategories.Every message is classified for only one
pair of categories.Considering pairs of category assignments (Produkt,Typ),the
number of categories rises to 72 (of which 67 occur).
To train and to test a classifier in an accurate way demands a large enough set
of data,at least more emails than folds
2
in the category.Omitting the two main
categories Produkt and Typ only 5 categories out of 18 contain less then 5 emails
and cannot be used.In the complex double-featured category system more than
halt categories (43 of 67) are concerned.
Usually in text categorization a complex category hierarchy is broken down
to a combination of binary classification tasks [33].A binary classifier returns a
probability for a document to belong to a category.The scope of this Master’s
thesis is not to build a classification system but to examine the effect of semantic
feature selection.
For these reasons the category system had to be simplified.It was decided to
omit the two main categories and just to consider the sub categories.So doing an
experiment,the whole corpus is treated twice and each email is classified twice.
Table 5.1 gives an overview of all categories and a short description of the
content.
5.1.3 Preprocessing
Some steps of preprocessing were done.The emails were taken out of the enterprise’s
database and stored in separate text files.The whole email,subject and body were
taken for classification.Further information,as sender,sending date etc had been
omitted.To be able to administrate the large amount of emails in the 18 categories,
a database was created.It was used to extract lists with filenames to identify all
emails in one category or to provide classification information to the classification
algorithm.
2
Folds is here the number of parts the data set is divided into for cross-validation (cf.5.2.1)
42
5.1.CORPUS ANALYSIS
Category
Description
Product 1
ADSL high speed internet
Product 2
Different kinds of offers for business customers
Product 3
a combination of internet and telephone flat rate
Product 4
a dsl flat rate
Product 5
different kinds of offers for fixed line network
Product 6
cell phone issues
Product 7
a low price telephone subscription
Product 8
a telephone flat rate
Type 1
contract cancellation
Type 2
information about a product
Type 3
payment issues
Type 4
technical issues
Type 5
changes
Table 5.1.Overview of categories and their content (5 categories have been omitted)
The steps of preprocessing concerned text pureness and tagging.About half of
the emails where stored in HTML format,so HTML tags such as"< br >< br >"
had to be removed.
In Section 2.1 feature reduction was identified as a central problem in text
categorization.The whole corpus is stored in a POS-tagged and in a lemmatized
version.Lemmatisation and POS-tagging is done by the Tree Tagger [37],[38].
This helped to cut down the number of features by about one third (241973 token
types reduced to 140630 cf.Tab.A.1).Further knowledge about the word’s basic
form (lemma) and the word class is essential for the use of semantic relationships.
The tagger has an error rate less than 5% but mistakes or unsolvable ambiguations
are possible.Tagging or lemmantisation errors were not corrected as this manual
labour is too time consuming.In any ambigous case,the tagger returns all possible
forms
3
.Then,the first form in the list was chosen.
A very common step in preprocessing of natural language data is the elimination
of stop-words.Stop-words are frequent words without a direct semantic meaning
such as or,and,the.Stop words were removed from the corpus using a stop word
list [2].
5.1.4 Categories sizes and other statistical data
In the category system Typ,categories are quite balanced in size as figure 5.1 illus-
trates.The four categories Type 5,Type 4,Type 2 and Type 1 consist of 100 - 500
3
The personal pronoun sie (she,them) can as well be used in the polite form of address spelled
with a capital letter.If a sentence starts with sie,it is not clear,which form is meant.The tagger
returns sie,Sie,sie (3.person singular,polite form of address,3.person plural).In such a case,
the first word in this list was taken.
43
CHAPTER 5.EVALUATION - EXPERIMENTS
mails.The largest category Type 3 contains about 2/3 of the whole corpus,about
700 mails.
Figure 5.1.Histogram of the category system Typ
The category system Produkt (illustrated in figure 5.2) shows a more heteroge-
neous distribution of category sizes as the other category system.Three categories
consist of less than 5 emails and have to be omitted.They are not mentioned in the
diagram.The category Product 7 encloses little more than one third of all mails
(740).The rest is between about 50 (Product 4) and about 500 emails (Product 3).
Figure 5.2.Histogram of the category system Produkt
The tables in the appendix (Tab.A.1) give an overview of the categories,their
44
5.1.CORPUS ANALYSIS
size and some more statistic data.
5.1.5 Correlation to the thesaurus
The thesaurus contains about 60000 entries but most of the words are situated in
every-day speech.The vocabulary of the corpus consists of specific terms in a large
percentage which are not registered in the thesaurus.Those words are as out-of-
vocabulary (OOV) words in speech recognition (cf.[21]) not included in further
processing,i.e.feature regrouping is not possible as there is no information about
their semantic relations available.Of 13010 words in the corpus only 4008 are found
in the thesaurus,i.e.less than one third (24%).A lot of the terms OOV affected
are highly frequent and important to classification.For instance the word Anschluß
(access,connection) which appears 288 times,is OOV.Another example is the word
triple from the example graph in section 4.4.2:Interessent (prospective customer),
Bestandskunde (existing customer),Kunde (customer).
Table A.3 (Appendix) maps the categories to the percentage of words contained
in the thesaurus.The categories can be separated into four groups
1.very low coverage (< 60%):Product 3,Type 3 and Product 7.
2.low coverage (60 −70%):Type 1 and Type 2.
3.medium coverage (70 −80%):Type 4,Product 8 and Type 5
4.high coverage (> 80%):Product 1,Product 2,Product 4,Product 5 and
Product 6
When working with semantic feature regrouping,this correlation is an important
feature of a category.
[18] unites concepts by identifying 2-gram words.Air pollution,water pollu-
tion and pollution are hierarchical regrouped to the concept of pollution.Such an
approach would be thinkable for this Master’s project as many words OOV are
composed of several single words in the thesaurus.The word Telefonrechnug (tele-
phone bill) consists of Telefon and Rechnung.Both composites are covered by the