Information Science paper on Data Mining (Mar ... -

estonianmelonAI and Robotics

Oct 24, 2013 (3 years and 5 months ago)


Text Mining

An Examination of Text Mining Software

for Tri
State Times

Sujan Manandhar

Virginia Dressler

Information Science

Dr. J. Holmes

Mar 19, 2007



With more than an estimated 80 percent of content on the Web to be in une
unstructured text format (Haravu & Neelameghan, 2003, p. 100), a growing problem for
effective, relevant information retrieval methods particularly in situations of massive amounts of
data is increasingly
. This idea can be seen in a query of

commercial search engines that
often yield high recall and low precision. Commercial search engines also include indexing and
ranking mechanisms that are unbeknownst to the user and greatly impact the result of any search
in the ordering mechanisms and se
lection of terms. The greater number of recalled documents
entails more time and energy on the user side to sort out meaningful information from the
irrelevant. As a result, an increasing demand for quality information retrieval methods has
become a field
of interest in software developments.

Text mining is in a relatively early phase, and as such, is still quite limited in application
and use, yet often yielding beneficial results. In this paper, the authors will look into text mining
as a solution to one

of the problems of information retrieval. Text mining can be defined as the
discovery of new, previously unknown information, by automatically extracting information
rom processed text sources. In addition, text mining is capable of linking extracted inf
together to form new facts or new hypotheses to be explored further by more conventional means
of experimentation. (Hearst, 2003). As well, text mining can be used to uncover a “narrative in
an unstructured mass of text” (Haravu & Neelameghan, p.
103) and how a particular
environment is evolving (a defined market or business, for example).

Concepts between texts can be linked together as processed through a number of natural
language processing methods, such as corresponding thesaurus, glossary, a

structured subject
representations or a schedule of a classification scheme. Depending on the application, these are


either preformatted factors of a software package, or can be manipulated by the end user. For this
study, the authors chose relatively

basic applications of text mining for our purpose to allow
study of the results of a query in respects to the decisions and considerations of a test group.

The authors conducted a study in which they evaluated different text mining software for
the Tri
State University newspaper information organization needs. This study will compare
various text mining systems on their functionality and ease of use, as well as fitting the small
budget and skills
set of the users.

Problem Statement

By utilizing differ
ent text mining software, the same sets of documents will deem
different results while processing the same query. Differences in the algorithms and weighting
schemas are the main impact the retrieved data, though the assessment and evaluation from the
user should also be considered as an influencing factor as to the overall effectiveness and
relevance of the retrieved information. Additionally, a comparison of the results will examine the
varying degrees of satisfaction for users, indicating a strong de
gree of subjectivity and difference
of the user’s relation to information and language.

Extent of the study

This study will look at only a small aspect of some simple applications to provide a
cursory example of the limitations and benefits of text min
ing. By using a small set of documents
from a single source, the authors hope to provide enough limitation to effectively examine the
effectiveness of retrieved information.


We believe that future research will be needed to further delve into the topic wi
th larger
case studies and how text mining can be implemented into other settings and uses other than
medical, technical and business. As well, a review and study of the major applications of text
mining would be useful at this point in time. Since this is

an early stage for the applications of
text mining, we feel that it is also important to observe the current limitations and gain
information from existing applications. The possibility of creating links between increasingly
massive banks of textual infor
mation will be a major topic of study as our dependence and stock
in technology increases.


Text mining can mean different things depending on the method and purpose. Sharp
(2001) aptly says, “text mining
per se

is new and is still defining it
self.” Yeates (2002) says that
text mining discovers patterns in natural language text, and is the process of analyzing text to
extract information from it for particular purposes. Natarajan (2005) describes text mining as
intelligent text analysis by disc
overing unknown links and relationships between sets of
documents, even perhaps between non
trivial information. Whatever the context or definition,
the computational process involved in text mining is the fundamental aspect of the detection of

he applications of text mining date to the mid
1990s, with IBM’s release of the first
software package, Intelligent Text Miner, in 1998. The acknowledgment of information existing
as raw material to an organization has impacted this aspect of information r
etrieval to software
do Prado,
et. al.,

2004, p. 225

As well, the considerations of the user have
become a larger part of software design, particularly with natural language processing


capabilities. This is particularly effective in
situations where there is a lack of controlled (or
structured) data, such as Web content, e
mails, and other informal documents. A simple
information retrieval is often complicated by the factors inherent to unstructured data, such as the
variety of spelli

Natarajan (2005) cites five requirements for high quality text mining. The desirable conditions
and factors for information retrieval include: information comprehensiveness, quality of
knowledge base, high
quality method of information retrieval, tech
niques and protocols of
information extraction (implementation of internal thesaurus, etc.) and technical expertise (p.
33). Interestingly, classification and cataloguing methods of the library have often been used for
reference in many of the various natu
ral language processing applications.

Literature Review

The topic of text mining is still a relatively new area, and the most of literature on the
subject tends to be geared towards the practical application (business, financial, medical, etc.).
But as
far back as in the 1950s, Luhn (1958) in a seminal paper on automatic abstracting, pointed
out "the resolving power of significant words" in primary text. Doyle (1961) hints on early text
mining and says "natural characterization and organization of inform
ation can come from
analysis of frequencies and distributions of words in libraries."
Swanson (1988_1) a
major force
behind text mining, examined scientific literature as a natural phenomenon worthy of
"exploration, correlation, and synthesis."

Sullivan (
2004) discusses text mining within a business model, outlining the constructs of
such software and the impact and benefits of text mining in a practical scenario. Sullivan also
mentions the current limitations as of 2004 as to the state of NLP processing a
nd error rates. He


also concludes that pre
existing (preformatted) processing schemas and components in many text
mining applications are inappropriate for most queries. More effective are the methods that allow
the user to identify relations and categorie
s within a sample set of documents (“supervised
learning”, pg. 102). In turn, algorithms would be created by these decisions and choices of
related documents.

do Prado (2004) applies the CRISP
DM methodological approach to a case study. A set
of 57,000 do
cuments from a news agency were gathered in 2001 to study the relationships
between the text and the defined grouping of information (types of news). Clusters of text
connected unrelated items, indicating a method of knowledge discovery. Seven main schemas

clustering were found in this group of documents, both hierarchical and non
hierarchical. A
conceptual approach to the search deemed better results (p. 224).
p (2001) looks at several
examples of text mining and natural language processing (NLP). H
e also traces the aspects of
machine learning in text mining and how this can play a pivotal role in its development.

He is
able to highlight some the main features of text min
ing along with some of the
seminal works
related to it.

Feldman and Sanger (200
6) were one of the first to encapsulate the whole topic of text

mining into a book that breaks down the core principles as well as examining probabilistic
models. A selection of existing applications in application to specific fields of interest are also
tudied, mainly business, technical and medical situations.

Major Concepts in Text Mining

Before getting into the nuts and bolts of text mining, it is important to know some of the


key concepts that lead up to it. To try and expound on all the concepts r
elated to text mining
would not be possible within the realms of this paper. The authors will try to explain some of the
important terms that this study comes across

natural language processing, knowledge
discovery, and data mining.

Knowledge Discovery

Knowledge discovery is the process of finding novel, interesting, and useful patterns in data.
Data mining is a subset of knowledge discovery. This method allows the data to suggest new
hypotheses to test

(Purple Insight, n.d.).
James M. Caruthers makes
an interesting analogy and
says "Instead of mining for a nugget of gold, knowledge discovery is more like sifting through a
warehouse filled with small gears, levers, etc., none of which is particularly valuable by itself.
After appropriate assembly, howev
er, a Rolex watch emerges from the disparate parts." (Venere,

In his seminal work
Swanson (1988_
proved that it was possible to discover new
knowledge from existing literature by linking the information present in complementary but
disconnected ar
Smalheiser and Swanson

(1998) postulated a number of new biomedical
hypotheses, which were later verified by domain experts. They developed two approaches,
known as Open and Closed discovery, for generating new hypotheses. However, their research
equired substantial manual input. Since then, a number of efforts have aimed to automate the
discovery process.

Natural Language Processing

Natural language processing (NLP) is a major component of text mining. NLP is the


branch of linguistics which dea
ls with computational models of language. Sharp (2001) says that
NLP can differentiate
words are used such as by sentence parsing and part
tagging, and in the process add discriminatory power to statistical text analysis. He says that
NLP cou
ld be a powerful tool for text mining.

Natural language has evolved to help humans communicate with one another and also to
record information. Computers are still incapable of understanding natural language, though with
language processing mechanisms an
attempt to make meaningful and relevant connections
between words continues to be of interest and study. Humans are able to differentiate and apply
linguistic patterns to text, overcoming obstacles (such as slang, spelling variations, and
contextual meanin
g), while computers do not handle them easily. Meanwhile, although our
language capabilities allow us to comprehend unstructured data, we lack the computer's ability to
process text in large volumes at high speeds. The key to text mining is creating techno
logy that
combines a human's linguistic capabilities with the speed and accuracy of a computer

Wallace, et. al. 2005).

The complete understanding of natural language text is difficult to attain. Text mining
focuses on extracting a small amount of inf
ormation from text with high reliability. (Yeates,
2002) Natural language is ambiguous and the same keyword may express entirely different
meanings, e.g. “Washington” may be a person or a place. Such ambiguity is normally resolved
through context. The inve
rse problem is that different expressions may refer to the same
meaning, e.g. “car” and “automobile”. From these two problems, it is easy to rule out the surface
expression of the keywords alone as a proper representation for text mining. (Chibelushi, Shar
& Salter, n.d.)

Moreover, because of the centrality of natural language text to its mission, text mining


also draws on advances made in other computer science disciplines concerned with the handling
of natural language. Perhaps most notably, text minin
g exploits techniques and methodologies
from the areas of information retrieval, information extraction, and corpus
based computational
linguistics (Feldman & Sanger, 2006).

One example of the application of a large
scale natural language processing datab
ase to
practical search methods was with the WordNet project (Stevenson, 2003, p. 39). A cognitive
psychologist used research on the structure of the human mental lexicon and attempted to
construct a resource that would mirror this structure. A basic block

of terms were created in the
research and found that over half of these terms to have a rather large number of synonyms,
which were tiered into hierarchical chains. For example, the term ‘canary’ was related to about
20 other terms, closest term being ‘fi
nch’ and furthest (but still deemed to be related) ‘entity’ (p.
40). In terms of this application in conjunction with text mining, all relevant synonyms would in
effect be used with a query. Again, there is a number of thesaurus and dictionary programs,
ose quality also varies, which would become a part of the natural language processing. The
quality of these programs would impact the search, as too many synonyms would result in higher
recall of similar terms that would less likely be relevant.

ical advances are, however, beginning to close the gap between human and
computer languages. The field of natural language processing has produced technologies that
teach computers natural language, enabling them to analyze, understand, and even generate t
(Fan, Wallace, et. al. 2005)

As more time and research is put into this facet of text mining, the
greater the possibilities of relevant, meaningful results will be made.


Data Mining

Text mining has its roots in data mining and consequently has man
y similar features.
Like data mining, text mining seeks to extract useful information from data sources through the
identification and exploration of interesting patterns. But unlike data mining, in text mining the
data sources are created from defined and

processed document collections. Interesting patterns
are found not among formalized database records but in the unstructured textual data in the
documents in these collections (Feldman and Sanger, 2006). In addition, both text mining and
data mining syste
ms have similar high
level architecture like preprocessing routines, pattern
discovery algorithms, and presentation
layer elements such as visualization tools to enhance the
browsing of answer sets.

ext mining adopts many of the specific types of pattern
s in its core knowledge discovery
operations that were first introduced in data mining research (Feldman & Sanger, 2006).
data mining mostly deals with structured data, text mining is designed to handle structured data
from databases or XML files, an
d can also handle unstructured or semi
structured data sets (such
as email, full
text documents, and HTML files). As a result, text mining is a better solution for
companies where

volumes of diverse information must be merged and managed (Fan,
, et. al. 2005)

To date, however, most research and development efforts have centered on data mining
using structured data.
Because data mining
deals with
data have already been stored in a
structured format, much of its preprocessing focu
s falls on two
critical tasks: 1) s
crubbing and
normalizing data and
creating extensive numbers of tab
le joins. But in text mining,

cessing deals with the

identification and extraction of representative features for natural
language documents. These preprocessin
g operations
mainly transform

unstructured data stored


in document collections into a more explicitly
structured intermediate format
(Feldman &
Sanger, 2006).

Process of Text Mining

Text mining can be examined in three stages: Text preparation, text pro
cessing and text
analysis. A key element of text mining is its focus on the document collection. At its simplest, a
document collection can be any grouping of text
based documents. Practically speaking,
however, most text mining solutions are aimed at disc
overing patterns across very large
document collections. Another critical element in text mining is the document. For practical
purposes, a document can be very informally defined as a unit of discrete textual data within a
collection that usually, but not

necessarily, correlates with some real
world document such as a
business report, legal memorandum, e
mail, research paper, manuscript, article, press release, or
news story (Feldman & Sanger, 2006).

During the initial stage, text preparation uses a selec
ted set of documents and is input
using text mining software that cleans and preprocesses the data. The text processing stage is
where the user enters the picture and enters an information query into the program. An algorithm
is applied to the processed da
ta, which clusters the data to find meaningful patterns. Text
analysis evaluates and determines the relevance of mined text into a more tangible form.
(Natarajan, 2005)

After text files are input to a system, text mining software produces a semantic netwo
of key concepts and terms in each file, defined by a weighing algorithm. This algorithm is used
to find meaningful patterns in data by frequency of terms. Phrasal analysis is also available in
addition to the term searches. This method has often proved
to be the most effective method of


text mining, as noun phrases such as company names, personal names, locations, or case names
would often be more useful than single term searches. Adoption of subject headings and content
descriptors are frequently derive
d from library classification schemes. The user can enter a query
and receive a compilation of relevant data as pulled from documents in the format of an XML or
HTML file, or even a comma separated file. These forms of the results are often highly visual o
graphical, and aim to produce an organizational knowledge map. The purpose of this compilation
would be to discover new knowledge or information based on similar concepts or to find a
narrative in an unstructured set of documents.

The understanding of t
he specific mapping of the selected group of information is
important to the information retriever. There should be a certain level of awareness of the
classification, clustering and categorization schemes of platform servers, network servers,
database and

workgroup applications to effectively use text mining software. The quality of the
information that is mined is largely dependent on a certain level of organization within the base
of the originating knowledge base. As well, the results are reviewed by th
e individual to assign
relevance and value to the information.

Figure 1

Text Mining Process (Adapted from

Fan, Wallace, et. al. 2005)


in Tex


There are several techniques in the text mining technologies that are utilized throu
gh different
applications. Some of these include information extraction, topic tracking, summarization,
categorization, clustering, concept linkage, information visualization, and question answering.
(Fan, Wallace, et. al.) In this section we reviewed some

of the technologies that we used in
evaluating the text mining software for our

the Tri
State project.

Information extraction


This technology is a popular method for analyzing unstructured text and
identifying key phrases and relationships w
ithin text. It looks for predefined sequences in the text
by pattern matching. The technology is especially useful when dealing with large volumes of
(Fan, Wallace, et. al. 2005)


Text summarization helps users figure out whether a len
gthy document meets
their needs and is worth reading. It is important to reduce the length and detail of a document
while retaining its main points and overall meaning. Sentence extraction mines important
sentences from a given text by statistically weight
ing all the sentences in the text. Other
heuristics like position information and extracting sentences following key phrase like "in
conclusion" is used in summarization. Headings and other markers of subtopics are searched in
order to identify the documen
t's key points
(Fan, Wallace, et. al. 2005)

Topic tracking

A topic
tracking system keeps user profiles and, based on the documents a user
views, forecasts other documents of interest to the user. This allows users to choose keywords
and notifies them whe
n news relating to the topics becomes available. Some tools let users select
particular categories of interest and infers users' interests based on their reading histories and


through information they've left behind online.
(Fan, Wallace, et. al. 200

Term weighting

association rules

are also common in text mining. In the term weighting
technique document representation is done by removing functional words (e.g. conjunctions,
prepositions, pronouns, adverbs, etc.) and then assigning weights to c
ontent words (e.g. agent,
decision making), in order to describe how important the word is for that particular document or
document collection. This is because some words carry more meaning than others.

(Chibelushi, Sharp, and Salter, n.d.) Association ru
les have made their way from data mining to
text mining. An association rule is a probabilistic statement about the co
occurrence of certain
events in a database or large collection of texts.

Purpose of Text Mining

The purpose of text mining is t
o make meaningful connections between unstructured text
data. As we have discussed, three main stages in text mining are data preparation, data
processing and data analysis. Depending on the software or sites reviewed (or guidance of the
human counterpart)
, this would be formatting the data during the selection and preprocessing
stage. One issue in the result of mined text is the lack of a hierarchy in the display of indexes or
clustered information.

A data
mining algorithm is then applied to the preproces
sed data. Sentence and paragraph
identification as well as tagging parts of speech in a set of documents are discerned at this point.
Natural Language Processing would provide conceptual relations between entitled and perhaps
make links between certain chu
cks of information (the people, associated companies, etc.) Text
analysis is the more subjective aspect to the process, in the evaluation of the output. After being


run through certain algorithms, the resulting text is further subjected to further processi
ng (Link
Discovery tool, or other).

Rudimentary term extraction is the most basic form of text mining that weights lists of
terms from a set of texts into a feature vector. A search of any scale would in effect measure the
similarities between documents b
y the feature vectors. In some text mining software, the user or
systems administrator would take a sample group of documents and create certain rules on terms
which the software translates into an algorithm (this has been referred to as “supervised learni
Sullivan, p. 102), mentioned earlier. Alternatively, other software is set with preexisting
classification schemas that weight phrases and terms (or “unsupervised learning”, Sullivan, p.

The underlying notion in text mining is that frequency of
term occurrence equates
relevance. Perhaps more useful is Maximum frequent sequences (MFS), which is a method of
extracting phrases that occur the most frequently in a set of documents. Specific phrases can
often provide higher precision in the recall, for

example by the use of company name, product
name, proper name, etc. Also to note, a certain level of awareness on the part of the user as to the
effect of language of the query, controlled or natural, with relation to the search is vital to
pertinent resu

Within an increasing set of documents, patterns begin to emerge within the mined text,
and as does the number of patterns eventually. More successful cases of text mining are in areas
with highly controlled language, such as Biotechnology, Competitiv
e intelligence, and Consumer
product development.
Table 1 summarizes
some of the main uses and the technologies used in
text mining in the principal industries.


Table 1

Applications of text mining in various industrie
s (Source:
Fan, Wallace, et. al. 2

As previously mentioned, natural l
anguage techniques were applied to text mining to
attempt to represent the user and the document in the search process to aid the search method. In
addition, searching with NLP can also classify documents together by
discovering multiple point
relationships between terms and phrases. NLP is however prone to error, particularly in
environments with a range of topics and styles. Even with this in mind, there can be beneficial
connections to be made between previously unr
elated items.


Research Model U
sing the CRISP

DM was developed in late 1996 by three main players of the then young data
mining market

DaimlerChrysler (then Daimler
Benz), SPSS (then ISL), and NCR.

DaimlerChrysler had already had some experien
ce in applying data mining in its business
operations. SPSS had been providing services related to data mining since 1990 and also
launched the first commercial data mining workbench


in 1994. NCR had the
Teradata data and had teams of data m
ining consultants and technology specialists to service its
clients’ requirements.
DM, 2000)
Over the years the model has been developed and
suited for better data mining for various purposes. We found that this particular model would be
an effectiv
e method in application with our case study.

DM methodological model consists of the following (Sullivan, 2004)

Business understanding

The clients’ point of view is considered at the first stage,
identifying the requirements and objectives of

the selected applications. Problems and
restrictions of each application are identified and examined as well. This phase also
incorporates a description of the client background, the business objectives, and a
description of the criteria used to measure t
he success of the study.

Data understanding

All relevant information is identified to carry out the application,
and also to develop an initial gauge on the applications’ content, quality and utility. This
initial collection of data assists the analyst t
o discover the particulars of the individual
programs. As well, problems related to the format and values of the applications are
looked at this point. The manner in which data was collected, including the different
sources, meaning, volumes, reading proce
dure, etc. will also be of interest. These can also


give an indication of the quality of the data.

Data Preparation

In this stage, the final data set from which the model will be created
and validated will be reviewed. Tools for data extraction, cleaning
, and transformation
are applied to data preparation. Combinations of tables, format changing and aggregation
of values are drawn out to satisfy the input requirements of the particular learning


mining techniques are selected an
d applied at this stage, according to the
objectives as defined in the first step of the model. The core phase of KDD (Knowledge
Discovery and Data Mining) is modeling, which corresponds to the choice of the
technique, its parameterization, and its executi
on over a defined data training set. As well,
other models can be created in this phase if required.


A review of the previous steps will be made in order to verify the results
against the objectives as defined in the business understanding pha
se. The next tasks to be
preformed will also be defined here. Dependent on the results, route corrections may be
defined, which correspond to the return to one of the already performed phases using
other parameters, or looking for additional data.


This phase sets the necessary actions to make the acquired knowledge
available to the organization. A final report is generated to explain the results and the
experiences useful to the client business.


: Phases of the CRISP
DM Process Mode
l (Source: CRISP
DM, 2000)

DM process is more of a cycle in which the sequence of the phases is not
necessarily stringent. Moving back and forth between different phases is essential. The
sequence of the tasks is dependent on the outcome of each

phase. The arrows (See Figure 2
indicate the most important and frequent dependencies between phases. The outer circle in the
figure indicates that the process is cyclic in nature.

The mining process continues after a solution has been deployed. The les
sons learned
during the practice can generate new, often more focused business questions. Subsequent mining
processes will benefit from the experiences of previous ones, with discovery and examination of
successful results.


In our study for the Tri
Times we decided to use the CRISP model as well. It is a
popular and proven solution. Many business solutions have depended on the CRISP model. In a


Chibelushi, Sharp, and Salter (


analyzing the transcripts of the meetings, they
recorded a

set of meetings and transcribed them for further processing. These transcripts were
manually edited to prepare for the modeling phase and then further analyzed to track the themes
discussed and extract the key issues and any associated actions, as well as

identifying the
initiator. The approach combined statistical natural language processing and semantic analysis of
the transcripts.
do Prado (2004) applies the CRISP
DM methodological approach to a case study
in order to look into the relationships between

the text and the defined grouping
news items

The Tri
State Times Text Mining Project

The student newspaper Tri
State Times are researching the use of text mining software to
assist in finding news articles, columns, and editorials that are similar to t
he selected news items.
In addition, the software should be able to investigate the primary terms and phrases in the
selected article so that similarities to other articles can be identified.

The members of the Tri
State Times have approached the School o
f Library Information
Science (SLIS) to assist in researching the text mining software that would be appropriate for
their use. However due to lack of funds, they would like to use a system that can be purchased
for a minimal cost or one that is available
for free. In addition, the upkeep of the system should
be easy and should not incur any major additional costs.

The members of the research team at SLIS will conduct a study in which they will
evaluate different text mining software that will possibly sat
isfy the needs of the Tri
State Times.
e the search has been narrowed down, they will conduct tests based on the criteria of the


tasks that are needed by the newspaper. In addition to comparing the ease of use of the various
systems, a survey will be co
nducted with a sample of the users to determine the functionality of
the system and to determine what the users feel about the system and the resulting data sets. A
combination of these factors will be used to determine the best text mining solution for th

Step 1

Business Understanding

During this initial phase, it is necessary to focus on understanding the project objectives and
requirements from the organization’s perspective, and then converting this knowledge into a data
mining problem d
efinition, and a preliminary plan designed to achieve the objectives.

The student newspaper at Tri
State University is short staffed. The editorial team has
consulted the SLIS for assistance to see if they can come up with a solution in order to help them

archive important articles. The objective of this project is to find a method to find similar
articles, columns, and editorials to stories that the editorial team selects. In order to achieve this,
they survey and select major news articles and then look
for similar articles covering the same
news story in other major newspapers. One can look into most of the major news sources
manually one at a time, or a text mining model can be used that would assist in the process. The
text mining model will be able to

identify the key terms and ideas in the articles. These terms can
also be used to look up similar articles. In addition, other articles can be identified and selected
from other news sources automatically with the use of certain text mining software. Fina
lly with
the use of key terms and sentences can also formulate a summary of the

article. This summary
can be input

into a collection/database of summaries and can be accessed in the future for use in
writing editorials, opinions, and other articles.



Data Understanding

The data understanding begins with initial data collection of news stories from various national
news websites

e.g. CNN, Fox News, Yahoo News, etc. Next local stories of interest are
selected from local and regional newspapers. In
order to get familiar with the data the main
stories are identified from one or two of these news sources. A major data quality issue may be
the availability of more than just a single version of the news articles at various times, especially
on the Intern
et. Another issue is amount of time spent on identifying the main news article,
whether or not this is unwieldy or not. Once the editorial team selects the news article (dataset) a
member of the staff should be able to enter these in the system and generat
e the output of key
terms and phrases, summary of the article, and possible matches of similar articles.

Step 3

Data Preparation

The data preparation phase covers all activities to construct the final dataset (data that will be fed
into the modeling too
l(s)) from the initial raw data. In this case, the raw data includes all the
news articles that the editorial team deemed to be archived during the first selection of these
articles. The final dataset includes the articles that have been weeded out from th
e raw data and
are considered more important than the rest of the articles. During this phase, you choose one or
two versions of the main stories of the day from news sources. These news articles are converted
to text versions so that they can be inputted
without the images. These text versions of the news
stories make up the final dataset that are ready to be entered into the text mining system.

Step 4


During this phase, various modeling techniques are selected and applied. The SLIS team looked


at various text mining tools depending on the needs of the Tri
State Times. Text mining tools
come in various capabilities and prices. Major vendors offer text mining tools that cost in the
region of thousands of dollars. Some of the major text mining too
ls are compared below along
with their major features:



mining technologies offered by commercial vendors

Fan, Wallace, et. al. 2005)

Although most of the above systems offered all the functionalities that the Tri
needed, these tools were beyond the budget of the student newspaper. In addition,
learning and applying these systems would require additional effort. Hence the SLIS team had to
look for systems that were simpler to use, and were less expensive or offered
for free on the
Internet. With this in mind, the initial costs were minimal. The SLIS team narrowed down the
search to three text mining models: 1) Termine, 2) Textalyzer, and 3) Topicalizer.

The SLIS team took a sample set of chosen articles from the fin
al data set and plugged


them into the respective models. The output from these systems were then tabulated and
compared. In some cases, the outputs were not comparable. In these cases, another set of data
was taken and re
plugged into the systems. The outp
uts of these articles were then evaluated

Step 5


During this stage the SLIS team evaluated the results that were generated from the three selected
models of text mining. Each of the results were compared on the basis of the three criteria

terms and phrases selected 2) summary generated 3) generated list of similar articles.

The key terms that the systems generated were evaluated in terms of whether the terms
selected could be used to look for other related articles from other news sour
ces. The summaries
generated were compared to see if the models were able to generate a concise summary that
could be used in the future for editorials and opinions. In addition, the summaries were evaluated
to see if the system was able to connect ideas,
or just extract sentences from the input text.

Finally, the set of articles that the system generates as similar news articles of the input
are also evaluated. The number of articles in the output and the variety of sources are also
investigated. These ou
tputs w
ere evaluated by a team of eleven

participants who looked at the
three criteria: 1) terms and phrases selected 2) summary generated 3) generated list of similar

for each of the three systems. In addition, these users also evaluated the sy
stems in the
ease of use and the available functionalities.

The users tested the three software models with a few news articles that they input into the
systems. The output were generated and compared for each of the systems. In addition, a short


survey w
as taken to evaluate their user experience. (See Appendix A for the survey).
Overall the
users were satisfied with all three models of text mining that were evaluated.
fact that these
systems were available on the Internet for no cost was attractive th
e users, who were aware of the
financial constraints.
In addition the ease
use of all three systems

feature that the
participants liked. All three models tested did provide keywords and phrases that were helpful in
locating other news article
s of interest to the users. But due to the additional

es of the
, including the summary and the links to other articles, most of the participants
preferred this system to the other ones tested.

Topicalizer is a service which automa
tically analyses a document specified by a URL or a
plain text regarding its word, phrase and text stru
cture. It provides
functional information on a
given text including the following: Word, sentence and paragraph count, collocations, syllable
lexical density, keywords, readability and a short abstract on what the given text is

(Topicalizer, n.d.)

The results of the study w

summarized as follows:

Based on the information collecte
d from the users, there was a 72
% agreement rate to th


resulting sets of information. Interestingly, this was also practically concurrent with previous
studies measuring the frequency of agreement between expert and non
expert semantic taggers
(Stevenson, 2003, p. 50). Often, the areas of disagreement involv
ed topics that were more subject
to multiple interpretation (the user was not sure of context or relevance).

Step 6


Creation of the model is generally not the end of the project. Once
the testers chose


as the system of choice for
the Tri
State Times and the trial runs were
complete, the results are
organized and stored in a way that they can be retrieved as per need of the student newspaper. A
structure for reporting is created with a consistent and understandable format, so that t
he users of
these results.


By using a selection of text mining software, our study found that differences exist in the manner
in which people relate to information, even from the same query. Apart from differences in the
particular algorithms

and weighting schemas, definitive difference was found in the decision
processes of different individuals to the same set of information, and also in the comparison of
satisfaction surveys. This ultimately proves to be more of a subjective nature, though
information can be found within these differences for further improvements in the software.

We felt that although certain limitations can be found in text mining, there are also
opportunities for further research and study. There have been many
benefits in the existent
applications of text mining, and we feel that there are many possibilities for improvements in the
current software. Our study was limited to a specific set of data and users, while future research


could utilize a much broader scop
e of application (larger sets of data, wider array of topic, larger
user base).



Abutridy, John. (2004)
Driven Explanatory Text Mining:

Beyond Keywords
. Retrieved March 3, 2007, from Universidad de Concepci´on, Departamen
de Ingenier´ıa Inform´atica,

Chibelushi, C., Sharp, B., Salter, A. (2004)
A Text Mining Approach to Tr
acking Elements of
Decision Making: a pilot study
. Retrieved March 5, 2007, from Staffordshire University, School
of Computing,

DM (2000) Retrieved on Mar 02, 2007 from (Reference:

do Prado, H. A., Moreira de Oli
veira, J. P., Ferneda, E., Wives, L. K., Silva, E. M., Loh, S.
(2004). Transforming Textual Patterns into Knowledge. In Raisinghani, Mahesh (Ed.),
intelligence in the digital economy: opportunities, limitations, and risks
. (p. 207
227). Hershey,
A: Idea Group Publications.

Doyle, L. (1961). Semantic road maps for literature searchers.
Journal of the Association for
Computing Machinery
, 223

Fan, W., Wallace, l., Rich, S. and Zhang., Z. (2005) “ Tapping Into the Power of Text Mining”.
trieved on Mar 10, 2007 from

Feldman, R. & Sanger, J. (2006).
The text mining handbook: advanced approaches
in analyzing
unstructured data
. New York: Cambridge University Press.

Haravu, L. J. and Neelameghan, A. (2003). Text Mining and Data Mining in Knowledge
Organization and Discovery: The Making of Knowledge
Based Products.
Cataloging &

1/2), 97

Hearst, M. (2003).
What Is Text Mining?
. Retrieved on March 12, 2007 from

Luhn, H. P. (1958). The automatic
creation of literature abstracts.
IBM Journal of Research and
Development, 2
, 159

Mironova, S. Y., Berry M. W., Atchley, S., Beck, M. (2004). Advancements in text mining
algorithms and software. In Kargupta, H. (Ed.)
Data mining: next generation cha
llenges and
future directions
. (p. 425
436). Menlo Park, CA: MIT Press

Natarajan, M., (2005, July). Role of Text Mining in Information Extraction and Information


DESIDOC Bulletin of Information Technology

(4), 31

Purple Insight, (n.d
.) Retrieved on March 12, 2007 from

Sharp, M. (2001).
Text Mining
. Semin
ar in Information Studies, Retrieved on Mar 02, 2007 from

Smalheiser, N.R., Swanson, D.R., (1998). “Using ARROWSMITH: A Computer Ass
Approach to Formulating and Assessing Scientific Hypotheses”, Computer Methods and
Programs in Biomedicine
(3), 149

SPSS (2005)
Improve Business Results with Text Mining
. Retrieved March 1, 2007, from

Stevenson, M. (2003) “Word S
ense disambiguation.

Stanford, CA : Center for the Study of
Language and Information.

Sullivan, D. (2004). Text Mining in Business Intelligence. In Raisinghani, Ma
hesh (Ed.),
Business intelligence in the digital economy: opportunities, limitations, and risks
. (pp. 98
Hershey, PA: Idea Group Publications

Swanson, D. R. (1988_1). Historical note: Information retrieval and the future of an illusion.
Journal of
the American Society for Information Science, 39
, 92

Swanson, D.R. (1988_2). Migraine and Magnesium: Eleven neglected connections.
in Biology and Medicine
, 31, 526

Termine Retrieved on Feb 22, 2007 from


Retrieved on Feb 22, 2007 from

Topicalizer Retrieved on Feb 22, 2007 from

Venere, E. (2004) “
'Knowledge discovery
' C
Creation of N
” Purdue
News Service Retrieved on Mar 03, 2007 from

Yeates, S. (2002).
Text Mining
. Retrieved on March 02, 2007 from




Questionnaire for Participants

(Note: One form for each system)

Please feel free to add or remove success viewpoints as appropriate.

Estimate the level of success

query, using this response scale.


very satisfied


very dissatisfied

1. __ The generated summary was relevant to the query.

2. __ The terms and phrases of the query were found to give relevant results.

3. __ The results gave too many variations to similar articles.

4. __ The sys
tem was easy to use and understand.

5. __ The system provided simple and ample methods of search mechanisms.


Identify both the successful and unsuccessful elements found in the resulting information. Were
these useful for further study, or were the
y irrelevant?


Estimate the level of satisfaction of the results, using this response scale.


very satisfied



very dissatisfied

1. __ The resulting data sets was sufficient for our purpose of use and stud

2. __ The extracted information lead to a discovery of new knowledge.

3. __

4. __

5. __


Rate the following characteristics for the environment for the project being reviewed. Use a scale
of 1 to 5, where 1 is the lowest rating and 5 is the hi
ghest. If the item does not apply, mark an X.

__ ease of software use

__ software quality

__ analysis capability

__ design capability

__ appropriateness of technology to query

__ effective use of data configuration

__ quality assurance of data

__ clarity of source