KNOWLEDGE EXTRACTION FROM CALL CENTER DATA: APPLICATION OF TEXT MINING

undesirabletwitterAI and Robotics

Oct 25, 2013 (3 years and 7 months ago)

125 views

KNOWLEDGE EXTRACTION FROM CALL CENTER DATA:
APPLICATION OF TEXT MINING
1

RAKESH ERAZUTH MOHANDAS,


IBM India

rakesmoh@in.ibm.com


VISHNUPRASAD NAGADEVARA

Indian Institute of Management Bangalore

Bangalore 560076

nagadev@iimb.ernet.in



Textual data mining is a rapidly growing application of knowledge discovery in databases
(KDD) due to the ever
-
increasing volume of structured and unstructured documents
produced by large organizations.
It becomes more challenging when the databases
conta
in both textual and fixed form numerical data. This paper attempts to extract the
knowledge from a database containing textual data as well as numerical ratings given by
the customers. It uses text mining and data mining techniques to identify the catego
ries
and relate them to the ratings given by the customers.


1.

Introduction

T
extual data mining is a rapidly growing application of knowledge

discovery in databases
(KDD) due to the ever
-
increasing

volume of structured and unstructured documents
produced

by
large organizations.

M
ost of the techniques developed in the area of

data
mining and information retrieval are designed to handle only
fi
xed
-
format data or free
-
form text, but not a combination

of both. There are many application domains in

which

both
fi
x
ed
-
format and free
-
form data are available.

For example, an e
-
commerce Web
site may collect pro
fi
les

of their customers as
fi
xed
-
format data as well as users' reviews

about their products in free
-
form text.

Incorporating

both types of information into the

data mining task enhances

our understanding of the domain, but also increases

the
complexity of the problem

(Tan et al.
, 2000
)
.


Text mining also known as Text Data Mining (TDM)

(Hearst, 1999)

and Knowledge

Discovery in Textual Databases (KDT)

[
Feldman
and Dagan, 1995
] can be described as
the process

of identifying novel
or meaningful
inform
ation from a collection of text data
.

Novel or meaningful information is generally defined as the information that is not
readily available in the text data directly
.

Doore et

al.
(1999)
say

text mining applies the
same analytical functions of data mining

to the domain of textual information, relying on
sophisticated text analysis

techniques that distill information from free
-
text documents",
whereas Tan
(1999)
desc
ribes text mining as

the process of extracting interesting and
nontrivial

patterns or knowledge from text documents".

Hearst distinguish
es

text data
mining

from information retrieval and

data mining.

She calls text mining the process of
discovering

here
tofore unknown information from a text source.

For example, suppose

that a document establishes a relationship between topics A and B and another

document
establishes a relationship between topics B and C. These

two documents jointly establish
the possibi
lity of a novel (because no document

explicitly relates A and C) relationship
between A and C. Hearst

also argues that tasks such as text categorization, text
clustering, co
-
citation

analysis, etc., cannot be classi
fi
ed as text mining because they do
not
produce

anything novel.





1

Proceedings of the 7
th

International Conference on Knowledge Management, 22
-
23
Oct 2010, Pittsburg, Pennsylvania, USA.



Data mining addresses only a very limited part of a company’s total data assets

namely

t
he structured information available in databases. Probably more than 90% of a
compan
y’
s data are never being

looked at. These include

lette
rs from customers, email
correspondence, recordings of phone calls with customers,
c
ontracts, technical
documentation, patents

etc
. With ever dropping prices of mass storage, companies collect
more and

more of such data online.
More often than not, the on
ly way the data is made
usable is by making it accessible and searchable in a companies intranet.
T
ext mining
helps to dig out the hidden gold from textual information. Text mining leaps from old
-
fashioned information retrieval to information and knowledge

discovery

(
Dijrre, Gerstl
and Seiffert

1999)
.


Text mining is also often confused with data mining with some people

describing text
mining as a simple extension of data mining
techniques
applied to unstructured

databases.

Hearst disputes this by saying t
hat data mining is not

`mining' at all but
simply a (semi)

automated discovery of patterns/trends

across large databases that help in
making decisions and that no new facts

are established during this discovery process.

Kroeze et al.
(2003)
expand on the
`novel', `non
-
novel'

classi
fi
cation of Hearst by
introducing a new class called `semi
-
novel'. They

classify data/information retriev
al as
being non
-
novel and

knowledge discovery

as
semi
-
novel, and they introduce a new type
of investigation process called

`
intelligent text mining', which they classify as being truly
`novel'. They de
fi
ne intelligent text mining as the process of automatic knowledge
creation.

They stress that arti
fi
cial intelligence techniques, which help simulate human

intelligence, are crit
ical to achieve intelligent text mining.



People also tend to confuse text mining with Information Extraction (IE).

Information
extraction deals with the extraction of facts about pre
-
speci
fi
ed

entities, events or
relationships from unrestricted text sou
rces. One can think

of information extraction as
the creation of a structured representation of select

information drawn from text. There is
no notion of novelty

involved because only information that is already present is
extracted,

and

therefore IE canno
t be classifi
ed as text mining. However, information

extraction is directly involved in the text mining process. E.g., one approach

to text
mining from semi
-
structured documents on the web is to use extraction

techniques to
convert documents into a collect
ion of structured data and

then apply data mining
techniques for analysis.

This approach is common in analyzing unstructured call center
data.


Call centers are now regarded as the most important interface for companies to
communicate with their customers. Call center operators have to respond to various
requests from their customers without offending them. In addition, the record is said to
cont
ain precious information for understanding the trends of the customers. However, the
operators often overlook such hidden information in analyzing the records because they
tend to rely solely on static clustering measures and empirical keywords. Also,
inf
ormation that the call center operators deal with in their daily work is quite different
from that for the top management with global and long
-
term viewpoints. Consequently,
call center records are not fully utilized for discovering business chances or avo
iding
risks. This fact is called the bottleneck of the electronic Customer Relationship
Management (
Shimazu, Momma and Furukawa 2003). Shimazu et. al. carried out
experiments to discover important information that could not have been extracted by a set
of
keywords specified by a domain specialist.


Identification of important words that represent the content of each document is the first
step or discovery of important information and knowledge from large amount of data
(
Sakurai
,
Ichimura, Suyama and Orihara

2001)

One of the most commonly used system
for identifying the importance of a word is by calculating the frequency of its occurance
in the document or the text database. Unfortunately this is not a foolproof method for
determining the importance of the
word. Matsuo proposed an algorithm that extracts
important words by excluding general words
(
Matsuo
, Ohsawa

and Ishizuka

2001)

It first
identifies frequent words within a document and then calculates
what is called
co
-
occurrence frequency of each frequent

word and other words and finally extracts words
that have higher concurrence.

Hisamitsu et. al.

report that
semantic relationship among
phrases is more useful
(
Hisamitsu, Niwa and Tsujii

2000)

Besides identifying important
words, some researchers introdu
ce structure to represent relationship among these words.
For example, Zaki proposed an efficient algorithm to induce frequent trees in a forest
consisting
of ordered labeled rooted trees
. He also proposed an algorithm for inducing
frequent sequential patt
erns, where he reported experimental results of word sequential
pattern induction.


2.

Objectives

The objectives of the study are

1.

To extract the terms, concepts and categories from a textual database

2.

To apply mining techniques to the categories extracted fr
om the textual database
in order to relate them to the customer ratings and

3.

To identify the categories that impact the ratings of the customers


3.

Business
Problem

A Leading
Multinational
Business Process Outsourcing (
BPO
)

company is facing a
challenge
in
its operational

process and qualitative Human resource

management.

Most
of the clients of this BPO firm would like to have certain quality standards in its delivery
and would like to improvise on certain operational issues. The client base of this
organiza
tion is typically in millions and serves the requirement of several Medium sized
Industries. It is typically infeasible for the organization to understand each of this
customer requirements develop and deliver as there is a cost involved in each delivery.
So it would like to understand some of the Common key issues with its Services and
Human Resource and try to improvise over the same. At
its

end
,

the organization had
developed an open ended question to its client which they frequently collect as a way to
understand the customer sentiments.
Along with the responses to the open ended
questions, the customers rated their experience on a 5
-
point Likert scale.
In this study the
organization has collected response
s

from 401 Clients which they would like to Mine

and
develop an automated strategy of Mapping this Key issues with the Service Locations
and the Manager who deals with it.

Thus identifying the key root of the issue and he
lping those executives to get better
train
ed
.

The same could be developed for iden
tifying those executive who have got a
positive feedback fro
m

the clients and provide incentive for the achievers

thus reducing
the attrition of the
high achievers
.

3.1

Tran
slation of Business problem to

Data Mining

Objective

The Data is
in
an unstructured format and
need
s

a specialized tool that helps in extracting
relevant
single word
sentiment of the customer

called as “Terms”
.

T
his

single word is
basically a function of many such similar
Synonyms
,
and these

“T
erms


are then grouped
together under a lead term called “Concept”
.
There is also a need to
develop

certain
b
usiness
rule sets
to extract the concepts based
on the
domain
. T
hese rule sets will be
developed and would be saved in the library of a
Text mining

engi
ne.
T
hese extracted
Concepts
/Business Rule Sets are converged to a Single “Category”,

which are in essence,
higher
-
level concepts or topics that will capture the key ideas, knowledge, and attitudes
expressed in the text.

Categories


are made up of set of
descriptors, such as
concepts
,
types
, and
rules
. Together, these descriptors are used to identify whether or not a record
or document belongs to a given category. A document or record can be scanned to see
whether any text it contains matches a descriptor.

If a match is found, the
document/record is assigned to that category. This process is called
categorization
.
The
bottom line is to deploy a tool to extract the
“categories”.

This

Categorization
will help the Business users of the information to condense
a large
number of

words
into more reliable
concept which in turn will be configured into
categories
, which

best
explain

an
individual’s

perception on the service

provided
organization.
T
hese

Key Words
are

mapped to the Numeric
al Rating given by the
Custome
rs,

t
hus
identifying sentiments

that

are criti
cal and
require much

attention.

In this
paper,

we will extract only the positive and negative sentiments through Text
Analytics
.

T
his will help us identify the rating levels given by the Customers.

In this
study we will also highlight some of the important aspects of a text mining procedure.
The steps
below
guide the analyst in performing the Text Mining extraction reliably and
confidently.

3.2

Steps Involved in Text Analysis
:

There are seven major

steps in the text analytics process:

1.

Preparing text for analysis



Language identification



Document conversion



Segmentation

2.

Extracting
concepts

3.

Uncovering opinions, relationships, facts, and events through Text Link
Analysis

4.

Building
categories

5.

Building te
xt analytics models

6.

Merging text analytics models with other data models

7.

Deploying results to predictive models


The technology offering that we are using in this study is
from IBM SPSS and is known
in the market
as PASW Text Analytics
version
13.0.


T
his
technology relies on
linguistics
-
based text analysis for the extraction of key terms and ideas from
the textual

responses. Linguistics
-
based text analysis is based on the field of study known as
“Natural language processing” (NLP), also known as computatio
nal linguistics.

Various
Steps
involved
in the extraction process

are

1.

Input data conversion

2.

The identification of candidate terms

3.

The identification of equivalence classes and integration of synonyms

4.

Type assignment

5.

Indexing

6.

Pattern matching and events extraction

There are several approaches
for

creating categories. Because every data set is unique,
the number of techniques and the order in which
these are applied

may change.



Manual Approach



Drag and drop technique



Code
Frame Manager




Automated Solution




Statistical based Text Analysis Solution



Term/Type Frequency




Linguistic based Text Analysis Solution



Term Derivation



Term Inclusion



Semantic Networks



Term Co
-
occurrence

The Method used in this extraction procedure
is

the Concept
derivation, and

Semant
ic
Networks. With a derived cut
-
off of 30 categories to be generated during the extraction
.


4.

Results and Discussion

The Text

mining extraction relies heavily on the linguistic resources to dictate how to
extract relevant
information from the text data, These linguistic resources come from the
resource template which are made up of libraries, compiled resources and some advanced
resources which are used to define and manage types, terms, synonyms, and exclude list.
Some of
t
hese libraries come for s
pecific applications, such as opinions/surveys,
genomics, and security

intelligence.

During extraction, Text Mining for Clementine also
refers to some internal, compiled resources, which contain a large number of definitions
compl
ementing the types in the Core library. In this study we have used the below
templates to run the extraction procedure.

1.

Local Library: Used to store user
-
defined dictionaries. It is an empty library
added by default to all resources. It contains an empty t
ype dictionary too. It is
most useful when making changes or refinements to the resources directly (such
as adding a word to a type) from the other interactive workbench views will be
automatically stored in the first library listed in the libr
ary tree in
the Resource
Editor. B
y default, this is the
Local Library
. You cannot publish this library
because it is specific to the session project data. If you want to publish its
contents,
you can do so by renaming it first
.

2.

Core Library
: Used in most cases, sinc
e it comprises the basic five built
-
in types
representing people, locations, organizations, products, and unknown. While
you may see only a few terms listed in one of its type dictionaries, the types
represented in the Core library are actually complements

to the robust types
found in the compiled resources delivered with
the

text
-
mining product. These
compiled resources contain thousands of terms for each type. This explains how
names such as
George
can be extracted and typed as Person when only
John
appears in the Person type dictionary in the Core library. Similarly, if you do
not include the Core library, you may still see these types in your extraction
results, since the compiled resources containing these types will still be used by
the extractor

3.

CRM Library
: Used to extract words and phrases often found in the CRM
industry

4.

Customer Satisfaction Library
: Used to extract words and phrases often used to
understand Customer Satisfaction in different industry

5.

Market Intelligence Library
: Used to extra
ct words and phrases often used for
identifying Market conditions in different industry

6.

Product Satisfaction Library
:
Used to extract words and phrases often used in
Product Satisfaction research.

The last four libraries are specific to the present study.
Using the above dictionary and
the NLP text extraction Engine we were able to extract

488 concepts out of which we
have developed
27

Categories, This type of Unstructured Data Reduction i
s similar

to the
Factor Analysis

used to
reduce a number of variables into

more reliable components
which is a function of all the related parameters.
O
nly
those

sentiments that are relevant
to the area of application

are extracted
.
Figure 1 shows the captured
screen
-
shot of the
extraction of categories.

Figure 1. Captured screen
-
shot of the extraction of categories

These categories that were extracted
are used

for further Analysis. Before going to
develop a Text Mining Model we would also like to evaluate/ e
xtract patterns from the
text data. These patterns can help you uncover interesting relationships between concepts
in your data.
Figures 2 and 3 show the negative sentiments that are associated with the
extracted concepts.

Figure 2. Extract
ed concepts associated with negative sentiments


Figure 3. Association
of
concepts with negative sentiments

S
imilar
ly Figures 4 and
5
show the

concepts
that have been
extracted

and their
association with positive sentiments
.

Figure 4.
Extracted concepts associated with positive sentiments




Drill
-
Down

Drill
-
Down

Figure 5
. Association of concepts with Positive sentiments

This Method
of analysis

will help to reduce the number of words

that are

used in a
context
to more reliable categories

which
in turn
contain

a lot of Synonyms,
Hyponyms

and Contextual statements. This helps the Analyst to identify the key Areas of
Dissatisfaction typically faced in
a

conversation with the service representative.


The next step in knowledge extraction is to relate the co
ncepts/categories with the ratings
given by the customers. As mentioned earlier, t
he
se ratings are on a 5
-
point L
ikert scale
with
1 being highly
dis
satisfied through 5 for highly satisfied.

Table 1 presents the 27
extracted categories with the correspond
ing frequency. In addition, Table 1 also presents
the meaning associated with each of the categories. The mapping of the meaning to each
category is important because, the categories are combinations of words and sometimes
these combinations may not be s
elf explanatory. One such example is
[
timing &
<Positive> ] in Table 1.

Table 1. The frequency of the Categories extracted and their meanings

Sl. No.

Word

Count

Meaning

1

[ service & <Positive> ]

175

Positive Customer Service

2

[ service & <Negative> ]

84

Negative Customer Service

3

[ <Positive> + <> ]

58

Positive Sentiment associated

4

No Problem

43

No problem/No comments

5

[language+<Negative>]

36

Negative about the Language of the representative/the
ascent

6

[ difficulty + . ]

22

difficulty +
Sentiment associated with it

7

Friendly

17

Friendly service

8

[ inactivity + . ]

17

No waiting time

9

[efficient+<Positive>]

15

Efficiency of the Workforce/the organisation

10

sense of loyalty

13

sense of loyalty

11

[ question + . ]

12

Questions and
Sentiment Associated with it

Sl. No.

Word

Count

Meaning

12

[ timing & <Positive> ]

12

Operating hours of the organisation/ontime solution

13

operating hours

12

Work hours of the organisation

14

easyline

11

Type of online transaction provided by td

15

[ on the phone + . ]

10

on
the phone solutions

16

[ bank & <Positive> ]

9

Positive about the Organisation, here it is td bank

17

[ td & <Positive> ]

9

Positive about the organisation

18

Convenience

7

Convenience to bank

19

[ <Negative> + <> ]

7

Negative Sentiment associated

20

[experience+<Positive>]

7

Positive Experience about the Workforce/Service and
organisation

21

product

7

product offered by the organisation

22

quick service

7

fast service

23

[ telecommunication +
<Positive> ]

6

Positive about the Telecom service/call
center service

24

accesibility

6

Accesibility of the services offered

25

ease of use

4

Usage of the services offered like online transaction
and so on…

26

online

4

Online Services

27

patient

4

Patience of the customer representative to the queries
of
the Customer


The
above table s
hows the Extracted Categories from the unstructured Data, The
concepts are divided into Negative and Positive sentiments of the customers. We will use
both of these sentiments to study the behavior of the rating
s
.

As mentioned earlier, the
customers have rated their experience on a 5
-
point Likert scale, with 1 being poor and 5
being excellent. In order to relate the extracted categories with the rating of the
customers, web diagrams are created for each rating. Wh
ile creating the web diagrams, a
minimum threshold level of 50 percent of the link was set. Figures
6

and
7

present the
web diagrams for ra
t
ings 4 and 5

respectively
.



Figure 6
.
Web diagram showing
the
relationship between different categories and Rat
ing of 4



Figure 7. Web diagram showing

the

relationship between diff
erent categories and Rating of 5


It can be seen from the above diagrams that the categories “Friendly”, “Quick Service”,
“Product”, “Timing & Positive”, “Experience & Positive”, “Bank
& Positive” and
“Negative” are the categories that influence the ratting of 4. Similarly, “Patient”, “Ease
of Use”, “No Problem”, “Telecommunication & Positive”, “TD & Positive” and
“Efficient and Positive” influence the rating of 5. It is interesting to

note that there are no
categories that are common to the ratings of 4 and 5. On the other hand, these ratings are
ordinal in nature and hence subject to perceptional feelings of the customers.
Nevertheless, the ratings of 4 and 5 are positive in nature
and consequently, all these
categories are important to influence the positive experience of the customer. The
customer service representatives in the call center need to
be
trained to understand the
importance of these categories.

Only
a small proportion

of customers had rated their experience as 1 or 2 which are
negative in nature. As a result, the web diagrams of these ratings did not fulfill the
threshold requirements of 50 percent. At the same time, it is extremely important to
identify the categorie
s that will lead to the negative ratings of the customers. In order to
address this problem, a classification tree was
constructed

from the data, using the
customer rating as the dependent variable and the extracted variables as the explanatory
variables.

Since, the negative ratings (rating of 1 or 2) constitute a small proportion as
compared to the positive ratings (rating of 4 or 5), the dataset falls into the category of
skewed data sets. The skewed datasets lead to a typical problem with classificati
on trees.
The minority dataset gets overwhelmed by the majority dataset and consequently, the
classification tree models tend to make predictions in favor of the m
ajority datasets. The
problem i
s usually addressed by methods of “Unde
r

sampling” or “Over
sampling”

(Anujkumar and Nagadevara 2001)
. In this particular case, under sampling is not
appropriate because it will result in losing observations leading to loss of valuable
information. Over sampling method was employed in this particular analysis by
replicating the minority cases a number of times so that the resulting dataset is balanced.
The classification tree was constructed using

the balanced dataset. Figure 8

presents a
part of the classification tree.


Figure 8. A part of the Classification
Tree

Rules to predict the customer rating are extracted from t
he classification tree. Table 2

presents a sample of the rules.

Table 2
. Sample of rules extracted from the classification tree

Rule

for
Rating of
1.0


if [ call
-
back + . ] = T


and [ ser
vice & <Negative> ] = T


and [ service & <Positive> ] = T


then 1.000

Rule
for Rating of
2.0


if [ difficulty + . ] = T


and [ inactivity + . ] = T


and [ service & <Negative> ] = T


then 2.000

Rule
for Rating of
3.0


if [language+<Negative>] = T


and [ service & <Negative> ] = T


and easyline = T


then 3.000

Rule
for Rating of
4.0


if Location = Pune


and product = T


and [ service & <Positive> ] = T


then 4.000

Rule
for Rating of
5.0


if [ service & <Positive> ] = T


and easyline = T


then 5.000

These prediction rules extracted from the classification tree are applied to the dataset in
order to validate the extent of correctness of the predictions obtained from the
classification tree. The results of the pred
ictions are presented in
Table 3
.

Table 3
.
Observed vs. predicted ratings


prediction through Classification Tree


Predicted


Observed



1

2

3

4

5

Total

Accuracy

1

13

0

0

1

1

15

86.67%

2

0

8

2

1

0

11

72.73%

3

7

0

40

6

0

53

75.47%

4

21

1

29

82

39

172

47.67%

5

18

3

23

39

66

149

44.30%

Total

59

12

94

129

106

400

52.25%


It can be seen from Table 3

that the predictions of the classification tree are fairly
a
ccurate with respect to the rat
ings of 1, 2 and 3, where as the predictions are not so
accurate with respect to the ratings of 4 an 5. But, the Categories that influence the
ratings of 4 and 5 have already been identified through the web diagrams.

The rule set
extracted from the clas
sification tree can be analyzed to identify the
categories that influence the negative ratings (rating of 1 or 2). These categories are

Call
-
back”, “Service & Negative
”,
“Language & Negative”, “On the phone”,
“Experience & Positive”, “Service & Positive”
, “Difficulty”, “Inactivity” and “Service &
Negative”. It is interesting to note that some of the positive categories such as
“Experience & Positive” and “Service & Positive” are a
ssociated with the negative
rat
ings. But, the analysis shows that these
ca
tegories in combination with the other
negative categories influence the negative rankings. In other words, even though these
are positive catego
ries, they lead to negative rat
ings when associated with some of the
negative categories. Obviously, the infl
uence of negative categories is much stronger and
even when the customer is exposed to
some of the positive categories,

the perceived
experience is negative owing to the strong influence of the negative categories.

5.

Summary and Conclusion
s

Textual data
mining is a rapidly growing application of knowledge discovery in databases
(KDD) due to the ever
-
increasing volume of structured and unstructured documents
produced by large organizations. Most of the techniques developed in the area of data
mining and i
nformation retrieval are designed to handle only fixed
-
format data or free
-
form text, but not a combination of both. There are many application domains in which
both fixed
-
format and free
-
form data are available. For example, an e
-
commerce Web
site may col
lect profiles of their customers as fixed
-
format data as well as users' reviews
about their products in free
-
form text. Incorporating both types of information into the
data mining task enhances our understanding of the domain, but also increases the
comp
lexity of the problem. This paper analyzes the data which contained both textual
and numeric types of data. The data consisted of unstructured textual responses
. In
addition, the database contained the ratings given by the customers on a 5
-
point Likert
scale. These ratings are given by the customers based on
their interactions

with the CSEs
of the BPO.


Initially different concepts are extracted from the textual

data. These concepts are
grouped into various categories. These categories are classified into positive and
negative categories. Web diagrams were developed to identify the categories that are
linked to the ratings given by the customers. Based on the
se web diagrams, it was
possible to identify different categories that are linked to the ratings of 4 and 5. On the
other hand, it was not possible to identify the categories that are linked to the ratings of 1,
2 and 3 using web diagrams. Thus, a differe
nt approach is required to identify these
categories. Classification trees were constructed to predict the ratings based on various
categories. The rule sets extracted from the classification trees were used to identify the
categories that lead to the ra
tings of 1, 2 and 3.
It was found that web diagrams were
effective in identifying the categories that are linked to the positive ratings (ratings of 4
and 5) where as classification trees are more effective in identifying the categories that
would help in

predicting the negative ratings (ratings of 1 and 2).

References


Dijrre, J., Peter Gerstl and Roland Seiffert, ”Text Mining: Finding Nuggets in Mountains
of Textual Data”, KDD
-
99 San Diego CA USA, available at
http://www.cs.uvm.edu/~xwu/kdd/KDD
-
dorre.pdf



Dorre, Jochen, Peter Gerstl and Roland Seiffert. Text Mining: Finding

Nuggets in
Mountains of Textual Data. In Knowledge Discovery and

Data Mining, pages 398
-
401,
1999.


Feldman, Ronen and I
do Dagan. Knowledge discovery in textual databases (KDT). In
Knowledge Discovery and Data Mining, pp 112
-
117, 1995.


Hearst, M. Untangling text data mining. In Proceedings of ACL'99: the

37th Annual
Meeting of the Association for Computational Linguistics.
,

1999.


Hisamitsu T., Niwa Y., and Tsujii J. A Method of Measuring Term Representativeness
-

Baseline Method Using Cooccurrence Distribution. in Proceedings of the 18
th

International Conference on Computational Linguistics (Saabrucken Germany, July
2000), 320
-
326.


Koppel,
Moshe,

Jonathan Schler

and
Kfir Zigdon, “
Determining an Author's Native
Language by Mining a Text for Errors”,
KDD’05
, August 21

24, 2005, Chicago, I
L,
USA.


Kroeze, Jan H., Machdel C. Matthee, and Theo J. D. Bothma. Differentiating data and
text
-
mining terminology. In Proceedings of the 2003 annual research conference of the
South African institute of computer scientists and information technologists
on
Enablement through technology, pages 93
-
101, 2003.

Matsuo Y., Ohsawa Y., and Ishizuka M. KeyWorld: Extracting Keywords in a Document
as a Small World. In Proceedings of the Fourth International Conference on Discovery
Science (Washington D.C., 2001), 27
1
-
281].


Rong Chen

, Anne Ro
se
, Benjamin B. Bederson (2008)
,
How People Read Books Online:
Mining and Visualizing Web Logs for Use Information
, available at

http://hcil.cs.umd.edu/trs/2009
-
05/2
009
-
05.pdf


Roy, Shourya
and
L Venkata Subramaniam, “Automatic Generation of Domain Models
for Call Centers from Noisy Transcriptions”,
Proceedings of the 21st International
Conference on Computational Linguistics and 44th Annual Meeting of the ACL
, pages

737

744, Sydney, July 2006.


Sehgal, A. K., TEXT MINING: THE SEARCH FOR NOVELTY IN TEXT

(A report
submitted in partial fulfillment of the requirements of the Ph.D

Comprehensive
Examination in the Department of Computer Science)


Shimazu, Keiko,

Atsuhito Momma and Koichi Furukawa,

Experimental Study of
Discovering Essential Information from Customer Inquiry

,
SIGKDD '03
, August 24
-
27,
2003, Washington, DC, USA.


Tan, A. Text mining: The state of the art and the challenges. In Proceedings of the P
acific
Asia Conf on Knowledge Discovery and Data Mining PAKDD'99 workshop on
Knowledge Discovery from Advanced

Databases., pages 65
-
70, 1999.


Tan, Pang
-
Ning, Hannah Blau, Steve Harp and Robert Goldman, “
Textual Data Mining
of Service Center Call Records”
,
KDD 2000, Boston, MA USA


Sakurai S., Ichimura Y., Suyama A., and Orihara R. Inductive learning of a knowledge
dictionary for a text mining system. In Proceedings of the 14th International Conference
on Industrial and Engineering Applications of Artifici
al Intelligence and Expert Systems
(Budapest Hungary, June 2001), 247
-
252.].