Information Science paper on Data Mining (Mar ... - Personal.kent.edu

estonianmelonΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 4 χρόνια και 20 μέρες)

63 εμφανίσεις





Text Mining



An Examination of Text Mining Software

for Tri
-
State Times










Sujan Manandhar

Virginia Dressler

Information Science

Dr. J. Holmes

Mar 19, 2007


1

Introduction


With more than an estimated 80 percent of content on the Web to be in une
dited,
unstructured text format (Haravu & Neelameghan, 2003, p. 100), a growing problem for
effective, relevant information retrieval methods particularly in situations of massive amounts of
data is increasingly
evident
. This idea can be seen in a query of

commercial search engines that
often yield high recall and low precision. Commercial search engines also include indexing and
ranking mechanisms that are unbeknownst to the user and greatly impact the result of any search
in the ordering mechanisms and se
lection of terms. The greater number of recalled documents
entails more time and energy on the user side to sort out meaningful information from the
irrelevant. As a result, an increasing demand for quality information retrieval methods has
become a field
of interest in software developments.

Text mining is in a relatively early phase, and as such, is still quite limited in application
and use, yet often yielding beneficial results. In this paper, the authors will look into text mining
as a solution to one

of the problems of information retrieval. Text mining can be defined as the
discovery of new, previously unknown information, by automatically extracting information
f
rom processed text sources. In addition, text mining is capable of linking extracted inf
ormation
together to form new facts or new hypotheses to be explored further by more conventional means
of experimentation. (Hearst, 2003). As well, text mining can be used to uncover a “narrative in
an unstructured mass of text” (Haravu & Neelameghan, p.
103) and how a particular
environment is evolving (a defined market or business, for example).

Concepts between texts can be linked together as processed through a number of natural
language processing methods, such as corresponding thesaurus, glossary, a

pre
-
structured subject
representations or a schedule of a classification scheme. Depending on the application, these are

2

either preformatted factors of a software package, or can be manipulated by the end user. For this
study, the authors chose relatively

basic applications of text mining for our purpose to allow
study of the results of a query in respects to the decisions and considerations of a test group.


The authors conducted a study in which they evaluated different text mining software for
the Tri
-
State University newspaper information organization needs. This study will compare
various text mining systems on their functionality and ease of use, as well as fitting the small
budget and skills
-
set of the users.


Problem Statement


By utilizing differ
ent text mining software, the same sets of documents will deem
different results while processing the same query. Differences in the algorithms and weighting
schemas are the main impact the retrieved data, though the assessment and evaluation from the
end
user should also be considered as an influencing factor as to the overall effectiveness and
relevance of the retrieved information. Additionally, a comparison of the results will examine the
varying degrees of satisfaction for users, indicating a strong de
gree of subjectivity and difference
of the user’s relation to information and language.



Extent of the study


This study will look at only a small aspect of some simple applications to provide a
cursory example of the limitations and benefits of text min
ing. By using a small set of documents
from a single source, the authors hope to provide enough limitation to effectively examine the
effectiveness of retrieved information.


3

We believe that future research will be needed to further delve into the topic wi
th larger
case studies and how text mining can be implemented into other settings and uses other than
medical, technical and business. As well, a review and study of the major applications of text
mining would be useful at this point in time. Since this is

an early stage for the applications of
text mining, we feel that it is also important to observe the current limitations and gain
information from existing applications. The possibility of creating links between increasingly
massive banks of textual infor
mation will be a major topic of study as our dependence and stock
in technology increases.


Background


Text mining can mean different things depending on the method and purpose. Sharp
(2001) aptly says, “text mining
per se

is new and is still defining it
self.” Yeates (2002) says that
text mining discovers patterns in natural language text, and is the process of analyzing text to
extract information from it for particular purposes. Natarajan (2005) describes text mining as
intelligent text analysis by disc
overing unknown links and relationships between sets of
documents, even perhaps between non
-
trivial information. Whatever the context or definition,
the computational process involved in text mining is the fundamental aspect of the detection of
pattern.

T
he applications of text mining date to the mid
-
1990s, with IBM’s release of the first
software package, Intelligent Text Miner, in 1998. The acknowledgment of information existing
as raw material to an organization has impacted this aspect of information r
etrieval to software
designers
(
do Prado,
Oliveira,
et. al.,

2004, p. 225
).

As well, the considerations of the user have
become a larger part of software design, particularly with natural language processing

4

capabilities. This is particularly effective in
situations where there is a lack of controlled (or
structured) data, such as Web content, e
-
mails, and other informal documents. A simple
information retrieval is often complicated by the factors inherent to unstructured data, such as the
variety of spelli
ng.

Natarajan (2005) cites five requirements for high quality text mining. The desirable conditions
and factors for information retrieval include: information comprehensiveness, quality of
knowledge base, high
-
quality method of information retrieval, tech
niques and protocols of
information extraction (implementation of internal thesaurus, etc.) and technical expertise (p.
33). Interestingly, classification and cataloguing methods of the library have often been used for
reference in many of the various natu
ral language processing applications.


Literature Review


The topic of text mining is still a relatively new area, and the most of literature on the
subject tends to be geared towards the practical application (business, financial, medical, etc.).
But as
far back as in the 1950s, Luhn (1958) in a seminal paper on automatic abstracting, pointed
out "the resolving power of significant words" in primary text. Doyle (1961) hints on early text
mining and says "natural characterization and organization of inform
ation can come from
analysis of frequencies and distributions of words in libraries."
Swanson (1988_1) a
major force
behind text mining, examined scientific literature as a natural phenomenon worthy of
"exploration, correlation, and synthesis."


Sullivan (
2004) discusses text mining within a business model, outlining the constructs of
such software and the impact and benefits of text mining in a practical scenario. Sullivan also
mentions the current limitations as of 2004 as to the state of NLP processing a
nd error rates. He

5

also concludes that pre
-
existing (preformatted) processing schemas and components in many text
mining applications are inappropriate for most queries. More effective are the methods that allow
the user to identify relations and categorie
s within a sample set of documents (“supervised
learning”, pg. 102). In turn, algorithms would be created by these decisions and choices of
related documents.

do Prado (2004) applies the CRISP
-
DM methodological approach to a case study. A set
of 57,000 do
cuments from a news agency were gathered in 2001 to study the relationships
between the text and the defined grouping of information (types of news). Clusters of text
connected unrelated items, indicating a method of knowledge discovery. Seven main schemas

of
clustering were found in this group of documents, both hierarchical and non
-
hierarchical. A
conceptual approach to the search deemed better results (p. 224).
Shar
p (2001) looks at several
examples of text mining and natural language processing (NLP). H
e also traces the aspects of
machine learning in text mining and how this can play a pivotal role in its development.

He is
able to highlight some the main features of text min
ing along with some of the
seminal works
related to it.


Feldman and Sanger (200
6) were one of the first to encapsulate the whole topic of text

mining into a book that breaks down the core principles as well as examining probabilistic
models. A selection of existing applications in application to specific fields of interest are also
s
tudied, mainly business, technical and medical situations.


Major Concepts in Text Mining


Before getting into the nuts and bolts of text mining, it is important to know some of the

6

key concepts that lead up to it. To try and expound on all the concepts r
elated to text mining
would not be possible within the realms of this paper. The authors will try to explain some of the
important terms that this study comes across


natural language processing, knowledge
discovery, and data mining.


Knowledge Discovery


Knowledge discovery is the process of finding novel, interesting, and useful patterns in data.
Data mining is a subset of knowledge discovery. This method allows the data to suggest new
hypotheses to test

(Purple Insight, n.d.).
James M. Caruthers makes
an interesting analogy and
says "Instead of mining for a nugget of gold, knowledge discovery is more like sifting through a
warehouse filled with small gears, levers, etc., none of which is particularly valuable by itself.
After appropriate assembly, howev
er, a Rolex watch emerges from the disparate parts." (Venere,
2004)

In his seminal work
Swanson (1988_
2)
proved that it was possible to discover new
knowledge from existing literature by linking the information present in complementary but
disconnected ar
ticles.
Smalheiser and Swanson

(1998) postulated a number of new biomedical
hypotheses, which were later verified by domain experts. They developed two approaches,
known as Open and Closed discovery, for generating new hypotheses. However, their research
r
equired substantial manual input. Since then, a number of efforts have aimed to automate the
discovery process.


Natural Language Processing


Natural language processing (NLP) is a major component of text mining. NLP is the

7

branch of linguistics which dea
ls with computational models of language. Sharp (2001) says that
NLP can differentiate
how
words are used such as by sentence parsing and part
-
of
-
speech
tagging, and in the process add discriminatory power to statistical text analysis. He says that
NLP cou
ld be a powerful tool for text mining.

Natural language has evolved to help humans communicate with one another and also to
record information. Computers are still incapable of understanding natural language, though with
language processing mechanisms an
attempt to make meaningful and relevant connections
between words continues to be of interest and study. Humans are able to differentiate and apply
linguistic patterns to text, overcoming obstacles (such as slang, spelling variations, and
contextual meanin
g), while computers do not handle them easily. Meanwhile, although our
language capabilities allow us to comprehend unstructured data, we lack the computer's ability to
process text in large volumes at high speeds. The key to text mining is creating techno
logy that
combines a human's linguistic capabilities with the speed and accuracy of a computer

(Fan,
Wallace, et. al. 2005).

The complete understanding of natural language text is difficult to attain. Text mining
focuses on extracting a small amount of inf
ormation from text with high reliability. (Yeates,
2002) Natural language is ambiguous and the same keyword may express entirely different
meanings, e.g. “Washington” may be a person or a place. Such ambiguity is normally resolved
through context. The inve
rse problem is that different expressions may refer to the same
meaning, e.g. “car” and “automobile”. From these two problems, it is easy to rule out the surface
expression of the keywords alone as a proper representation for text mining. (Chibelushi, Shar
p,
& Salter, n.d.)

Moreover, because of the centrality of natural language text to its mission, text mining

8

also draws on advances made in other computer science disciplines concerned with the handling
of natural language. Perhaps most notably, text minin
g exploits techniques and methodologies
from the areas of information retrieval, information extraction, and corpus
-
based computational
linguistics (Feldman & Sanger, 2006).


One example of the application of a large
-
scale natural language processing datab
ase to
practical search methods was with the WordNet project (Stevenson, 2003, p. 39). A cognitive
psychologist used research on the structure of the human mental lexicon and attempted to
construct a resource that would mirror this structure. A basic block

of terms were created in the
research and found that over half of these terms to have a rather large number of synonyms,
which were tiered into hierarchical chains. For example, the term ‘canary’ was related to about
20 other terms, closest term being ‘fi
nch’ and furthest (but still deemed to be related) ‘entity’ (p.
40). In terms of this application in conjunction with text mining, all relevant synonyms would in
effect be used with a query. Again, there is a number of thesaurus and dictionary programs,
wh
ose quality also varies, which would become a part of the natural language processing. The
quality of these programs would impact the search, as too many synonyms would result in higher
recall of similar terms that would less likely be relevant.

Technolog
ical advances are, however, beginning to close the gap between human and
computer languages. The field of natural language processing has produced technologies that
teach computers natural language, enabling them to analyze, understand, and even generate t
ext.
(Fan, Wallace, et. al. 2005)

As more time and research is put into this facet of text mining, the
greater the possibilities of relevant, meaningful results will be made.




9

Data Mining


Text mining has its roots in data mining and consequently has man
y similar features.
Like data mining, text mining seeks to extract useful information from data sources through the
identification and exploration of interesting patterns. But unlike data mining, in text mining the
data sources are created from defined and

processed document collections. Interesting patterns
are found not among formalized database records but in the unstructured textual data in the
documents in these collections (Feldman and Sanger, 2006). In addition, both text mining and
data mining syste
ms have similar high
-
level architecture like preprocessing routines, pattern
-
discovery algorithms, and presentation
-
layer elements such as visualization tools to enhance the
browsing of answer sets.


T
ext mining adopts many of the specific types of pattern
s in its core knowledge discovery
operations that were first introduced in data mining research (Feldman & Sanger, 2006).
While
data mining mostly deals with structured data, text mining is designed to handle structured data
from databases or XML files, an
d can also handle unstructured or semi
-
structured data sets (such
as email, full
-
text documents, and HTML files). As a result, text mining is a better solution for
companies where
huge

volumes of diverse information must be merged and managed (Fan,
Wallace
, et. al. 2005)
.

To date, however, most research and development efforts have centered on data mining
using structured data.
Because data mining
deals with
data have already been stored in a
structured format, much of its preprocessing focu
s falls on two
critical tasks: 1) s
crubbing and
normalizing data and
2)
creating extensive numbers of tab
le joins. But in text mining,

prepro
cessing deals with the

identification and extraction of representative features for natural
language documents. These preprocessin
g operations
mainly transform

unstructured data stored

10

in document collections into a more explicitly
structured intermediate format
(Feldman &
Sanger, 2006).



Process of Text Mining


Text mining can be examined in three stages: Text preparation, text pro
cessing and text
analysis. A key element of text mining is its focus on the document collection. At its simplest, a
document collection can be any grouping of text
-
based documents. Practically speaking,
however, most text mining solutions are aimed at disc
overing patterns across very large
document collections. Another critical element in text mining is the document. For practical
purposes, a document can be very informally defined as a unit of discrete textual data within a
collection that usually, but not

necessarily, correlates with some real
-
world document such as a
business report, legal memorandum, e
-
mail, research paper, manuscript, article, press release, or
news story (Feldman & Sanger, 2006).

During the initial stage, text preparation uses a selec
ted set of documents and is input
using text mining software that cleans and preprocesses the data. The text processing stage is
where the user enters the picture and enters an information query into the program. An algorithm
is applied to the processed da
ta, which clusters the data to find meaningful patterns. Text
analysis evaluates and determines the relevance of mined text into a more tangible form.
(Natarajan, 2005)

After text files are input to a system, text mining software produces a semantic netwo
rk
of key concepts and terms in each file, defined by a weighing algorithm. This algorithm is used
to find meaningful patterns in data by frequency of terms. Phrasal analysis is also available in
addition to the term searches. This method has often proved
to be the most effective method of

11

text mining, as noun phrases such as company names, personal names, locations, or case names
would often be more useful than single term searches. Adoption of subject headings and content
descriptors are frequently derive
d from library classification schemes. The user can enter a query
and receive a compilation of relevant data as pulled from documents in the format of an XML or
HTML file, or even a comma separated file. These forms of the results are often highly visual o
r
graphical, and aim to produce an organizational knowledge map. The purpose of this compilation
would be to discover new knowledge or information based on similar concepts or to find a
narrative in an unstructured set of documents.

The understanding of t
he specific mapping of the selected group of information is
important to the information retriever. There should be a certain level of awareness of the
classification, clustering and categorization schemes of platform servers, network servers,
database and

workgroup applications to effectively use text mining software. The quality of the
information that is mined is largely dependent on a certain level of organization within the base
of the originating knowledge base. As well, the results are reviewed by th
e individual to assign
relevance and value to the information.



Figure 1
-

Text Mining Process (Adapted from

Fan, Wallace, et. al. 2005)






12

Technologies
in Tex
t

M
ining


There are several techniques in the text mining technologies that are utilized throu
gh different
applications. Some of these include information extraction, topic tracking, summarization,
categorization, clustering, concept linkage, information visualization, and question answering.
(Fan, Wallace, et. al.) In this section we reviewed some

of the technologies that we used in
evaluating the text mining software for our
research,

the Tri
-
State project.


Information extraction

-

This technology is a popular method for analyzing unstructured text and
identifying key phrases and relationships w
ithin text. It looks for predefined sequences in the text
by pattern matching. The technology is especially useful when dealing with large volumes of
text.
(Fan, Wallace, et. al. 2005)

Summarization
-

Text summarization helps users figure out whether a len
gthy document meets
their needs and is worth reading. It is important to reduce the length and detail of a document
while retaining its main points and overall meaning. Sentence extraction mines important
sentences from a given text by statistically weight
ing all the sentences in the text. Other
heuristics like position information and extracting sentences following key phrase like "in
conclusion" is used in summarization. Headings and other markers of subtopics are searched in
order to identify the documen
t's key points
(Fan, Wallace, et. al. 2005)

Topic tracking
-

A topic
-
tracking system keeps user profiles and, based on the documents a user
views, forecasts other documents of interest to the user. This allows users to choose keywords
and notifies them whe
n news relating to the topics becomes available. Some tools let users select
particular categories of interest and infers users' interests based on their reading histories and

13

click
-
through information they've left behind online.
(Fan, Wallace, et. al. 200
5)


Term weighting

and
association rules

are also common in text mining. In the term weighting
technique document representation is done by removing functional words (e.g. conjunctions,
prepositions, pronouns, adverbs, etc.) and then assigning weights to c
ontent words (e.g. agent,
decision making), in order to describe how important the word is for that particular document or
document collection. This is because some words carry more meaning than others.

(Chibelushi, Sharp, and Salter, n.d.) Association ru
les have made their way from data mining to
text mining. An association rule is a probabilistic statement about the co
-
occurrence of certain
events in a database or large collection of texts.
(Ibid.)


Purpose of Text Mining


The purpose of text mining is t
o make meaningful connections between unstructured text
data. As we have discussed, three main stages in text mining are data preparation, data
processing and data analysis. Depending on the software or sites reviewed (or guidance of the
human counterpart)
, this would be formatting the data during the selection and preprocessing
stage. One issue in the result of mined text is the lack of a hierarchy in the display of indexes or
clustered information.

A data
-
mining algorithm is then applied to the preproces
sed data. Sentence and paragraph
identification as well as tagging parts of speech in a set of documents are discerned at this point.
Natural Language Processing would provide conceptual relations between entitled and perhaps
make links between certain chu
cks of information (the people, associated companies, etc.) Text
analysis is the more subjective aspect to the process, in the evaluation of the output. After being

14

run through certain algorithms, the resulting text is further subjected to further processi
ng (Link
Discovery tool, or other).

Rudimentary term extraction is the most basic form of text mining that weights lists of
terms from a set of texts into a feature vector. A search of any scale would in effect measure the
similarities between documents b
y the feature vectors. In some text mining software, the user or
systems administrator would take a sample group of documents and create certain rules on terms
which the software translates into an algorithm (this has been referred to as “supervised learni
ng”
Sullivan, p. 102), mentioned earlier. Alternatively, other software is set with preexisting
classification schemas that weight phrases and terms (or “unsupervised learning”, Sullivan, p.
102).

The underlying notion in text mining is that frequency of
term occurrence equates
relevance. Perhaps more useful is Maximum frequent sequences (MFS), which is a method of
extracting phrases that occur the most frequently in a set of documents. Specific phrases can
often provide higher precision in the recall, for

example by the use of company name, product
name, proper name, etc. Also to note, a certain level of awareness on the part of the user as to the
effect of language of the query, controlled or natural, with relation to the search is vital to
pertinent resu
lts.

Within an increasing set of documents, patterns begin to emerge within the mined text,
and as does the number of patterns eventually. More successful cases of text mining are in areas
with highly controlled language, such as Biotechnology, Competitiv
e intelligence, and Consumer
product development.
Table 1 summarizes
some of the main uses and the technologies used in
text mining in the principal industries.


15


Table 1
-

Applications of text mining in various industrie
s (Source:
Fan, Wallace, et. al. 2
005)

As previously mentioned, natural l
anguage techniques were applied to text mining to
attempt to represent the user and the document in the search process to aid the search method. In
addition, searching with NLP can also classify documents together by
discovering multiple point
relationships between terms and phrases. NLP is however prone to error, particularly in
environments with a range of topics and styles. Even with this in mind, there can be beneficial
connections to be made between previously unr
elated items.


16

Research Model U
sing the CRISP
-
DM


CRISP
-
DM was developed in late 1996 by three main players of the then young data
-
mining market
-

DaimlerChrysler (then Daimler
-
Benz), SPSS (then ISL), and NCR.

DaimlerChrysler had already had some experien
ce in applying data mining in its business
operations. SPSS had been providing services related to data mining since 1990 and also
launched the first commercial data mining workbench
-

Clementine
-

in 1994. NCR had the
Teradata data and had teams of data m
ining consultants and technology specialists to service its
clients’ requirements.
(CRISP
-
DM, 2000)
Over the years the model has been developed and
suited for better data mining for various purposes. We found that this particular model would be
an effectiv
e method in application with our case study.


The CRISP
-
DM methodological model consists of the following (Sullivan, 2004)



Business understanding
-

The clients’ point of view is considered at the first stage,
identifying the requirements and objectives of

the selected applications. Problems and
restrictions of each application are identified and examined as well. This phase also
incorporates a description of the client background, the business objectives, and a
description of the criteria used to measure t
he success of the study.



Data understanding
-

All relevant information is identified to carry out the application,
and also to develop an initial gauge on the applications’ content, quality and utility. This
initial collection of data assists the analyst t
o discover the particulars of the individual
programs. As well, problems related to the format and values of the applications are
looked at this point. The manner in which data was collected, including the different
sources, meaning, volumes, reading proce
dure, etc. will also be of interest. These can also

17

give an indication of the quality of the data.



Data Preparation
-

In this stage, the final data set from which the model will be created
and validated will be reviewed. Tools for data extraction, cleaning
, and transformation
are applied to data preparation. Combinations of tables, format changing and aggregation
of values are drawn out to satisfy the input requirements of the particular learning
algorithms.



Modeling
-

Data
-
mining techniques are selected an
d applied at this stage, according to the
objectives as defined in the first step of the model. The core phase of KDD (Knowledge
Discovery and Data Mining) is modeling, which corresponds to the choice of the
technique, its parameterization, and its executi
on over a defined data training set. As well,
other models can be created in this phase if required.



Evaluation
-

A review of the previous steps will be made in order to verify the results
against the objectives as defined in the business understanding pha
se. The next tasks to be
preformed will also be defined here. Dependent on the results, route corrections may be
defined, which correspond to the return to one of the already performed phases using
other parameters, or looking for additional data.



Deploym
ent
-

This phase sets the necessary actions to make the acquired knowledge
available to the organization. A final report is generated to explain the results and the
experiences useful to the client business.



18


Figure
2
: Phases of the CRISP
-
DM Process Mode
l (Source: CRISP
-
DM, 2000)


The CRISP
-
DM process is more of a cycle in which the sequence of the phases is not
necessarily stringent. Moving back and forth between different phases is essential. The
sequence of the tasks is dependent on the outcome of each

phase. The arrows (See Figure 2
)
indicate the most important and frequent dependencies between phases. The outer circle in the
figure indicates that the process is cyclic in nature.

The mining process continues after a solution has been deployed. The les
sons learned
during the practice can generate new, often more focused business questions. Subsequent mining
processes will benefit from the experiences of previous ones, with discovery and examination of
successful results.


19

In our study for the Tri
-
State
Times we decided to use the CRISP model as well. It is a
popular and proven solution. Many business solutions have depended on the CRISP model. In a
study

by

Chibelushi, Sharp, and Salter (
n.d.
)

in

analyzing the transcripts of the meetings, they
recorded a

set of meetings and transcribed them for further processing. These transcripts were
manually edited to prepare for the modeling phase and then further analyzed to track the themes
discussed and extract the key issues and any associated actions, as well as

identifying the
initiator. The approach combined statistical natural language processing and semantic analysis of
the transcripts.
do Prado (2004) applies the CRISP
-
DM methodological approach to a case study
in order to look into the relationships between

the text and the defined grouping
news items
.


The Tri
-
State Times Text Mining Project


The student newspaper Tri
-
State Times are researching the use of text mining software to
assist in finding news articles, columns, and editorials that are similar to t
he selected news items.
In addition, the software should be able to investigate the primary terms and phrases in the
selected article so that similarities to other articles can be identified.

The members of the Tri
-
State Times have approached the School o
f Library Information
Science (SLIS) to assist in researching the text mining software that would be appropriate for
their use. However due to lack of funds, they would like to use a system that can be purchased
for a minimal cost or one that is available
for free. In addition, the upkeep of the system should
be easy and should not incur any major additional costs.

The members of the research team at SLIS will conduct a study in which they will
evaluate different text mining software that will possibly sat
isfy the needs of the Tri
-
State Times.
On
c
e the search has been narrowed down, they will conduct tests based on the criteria of the

20

tasks that are needed by the newspaper. In addition to comparing the ease of use of the various
systems, a survey will be co
nducted with a sample of the users to determine the functionality of
the system and to determine what the users feel about the system and the resulting data sets. A
combination of these factors will be used to determine the best text mining solution for th
e
newspaper.


Step 1
-

Business Understanding

During this initial phase, it is necessary to focus on understanding the project objectives and
requirements from the organization’s perspective, and then converting this knowledge into a data
mining problem d
efinition, and a preliminary plan designed to achieve the objectives.

The student newspaper at Tri
-
State University is short staffed. The editorial team has
consulted the SLIS for assistance to see if they can come up with a solution in order to help them

archive important articles. The objective of this project is to find a method to find similar
articles, columns, and editorials to stories that the editorial team selects. In order to achieve this,
they survey and select major news articles and then look
for similar articles covering the same
news story in other major newspapers. One can look into most of the major news sources
manually one at a time, or a text mining model can be used that would assist in the process. The
text mining model will be able to

identify the key terms and ideas in the articles. These terms can
also be used to look up similar articles. In addition, other articles can be identified and selected
from other news sources automatically with the use of certain text mining software. Fina
lly with
the use of key terms and sentences can also formulate a summary of the

article. This summary
can be input

into a collection/database of summaries and can be accessed in the future for use in
writing editorials, opinions, and other articles.


21

Step
2
-

Data Understanding

The data understanding begins with initial data collection of news stories from various national
news websites


e.g. CNN, Fox News, Yahoo News, etc. Next local stories of interest are
selected from local and regional newspapers. In
order to get familiar with the data the main
stories are identified from one or two of these news sources. A major data quality issue may be
the availability of more than just a single version of the news articles at various times, especially
on the Intern
et. Another issue is amount of time spent on identifying the main news article,
whether or not this is unwieldy or not. Once the editorial team selects the news article (dataset) a
member of the staff should be able to enter these in the system and generat
e the output of key
terms and phrases, summary of the article, and possible matches of similar articles.

Step 3
-

Data Preparation

The data preparation phase covers all activities to construct the final dataset (data that will be fed
into the modeling too
l(s)) from the initial raw data. In this case, the raw data includes all the
news articles that the editorial team deemed to be archived during the first selection of these
articles. The final dataset includes the articles that have been weeded out from th
e raw data and
are considered more important than the rest of the articles. During this phase, you choose one or
two versions of the main stories of the day from news sources. These news articles are converted
to text versions so that they can be inputted
without the images. These text versions of the news
stories make up the final dataset that are ready to be entered into the text mining system.

Step 4
-

Modeling

During this phase, various modeling techniques are selected and applied. The SLIS team looked


22

at various text mining tools depending on the needs of the Tri
-
State Times. Text mining tools
come in various capabilities and prices. Major vendors offer text mining tools that cost in the
region of thousands of dollars. Some of the major text mining too
ls are compared below along
with their major features:



Table
2

-

Text
-
mining technologies offered by commercial vendors

(Source:
Fan, Wallace, et. al. 2005)


Although most of the above systems offered all the functionalities that the Tri
-
State
project
needed, these tools were beyond the budget of the student newspaper. In addition,
learning and applying these systems would require additional effort. Hence the SLIS team had to
look for systems that were simpler to use, and were less expensive or offered
for free on the
Internet. With this in mind, the initial costs were minimal. The SLIS team narrowed down the
search to three text mining models: 1) Termine, 2) Textalyzer, and 3) Topicalizer.

The SLIS team took a sample set of chosen articles from the fin
al data set and plugged

23

them into the respective models. The output from these systems were then tabulated and
compared. In some cases, the outputs were not comparable. In these cases, another set of data
was taken and re
-
plugged into the systems. The outp
uts of these articles were then evaluated
.


Step 5
-

Evaluation

During this stage the SLIS team evaluated the results that were generated from the three selected
models of text mining. Each of the results were compared on the basis of the three criteria


1)
terms and phrases selected 2) summary generated 3) generated list of similar articles.

The key terms that the systems generated were evaluated in terms of whether the terms
selected could be used to look for other related articles from other news sour
ces. The summaries
generated were compared to see if the models were able to generate a concise summary that
could be used in the future for editorials and opinions. In addition, the summaries were evaluated
to see if the system was able to connect ideas,
or just extract sentences from the input text.

Finally, the set of articles that the system generates as similar news articles of the input
are also evaluated. The number of articles in the output and the variety of sources are also
investigated. These ou
tputs w
ere evaluated by a team of eleven

participants who looked at the
three criteria: 1) terms and phrases selected 2) summary generated 3) generated list of similar
articles


for each of the three systems. In addition, these users also evaluated the sy
stems in the
ease of use and the available functionalities.

The users tested the three software models with a few news articles that they input into the
systems. The output were generated and compared for each of the systems. In addition, a short

24

survey w
as taken to evaluate their user experience. (See Appendix A for the survey).
Overall the
users were satisfied with all three models of text mining that were evaluated.
The
fact that these
systems were available on the Internet for no cost was attractive th
e users, who were aware of the
financial constraints.
In addition the ease
-
of
-
use of all three systems
was

also
a
feature that the
participants liked. All three models tested did provide keywords and phrases that were helpful in
locating other news article
s of interest to the users. But due to the additional

functionaliti
es of the
Topicalizer
, including the summary and the links to other articles, most of the participants
preferred this system to the other ones tested.


Topicalizer is a service which automa
tically analyses a document specified by a URL or a
plain text regarding its word, phrase and text stru
cture. It provides
functional information on a
given text including the following: Word, sentence and paragraph count, collocations, syllable
structure,
lexical density, keywords, readability and a short abstract on what the given text is
about.

(Topicalizer, n.d.)

The results of the study w
ere

summarized as follows:



Based on the information collecte
d from the users, there was a 72
% agreement rate to th
e

25

resulting sets of information. Interestingly, this was also practically concurrent with previous
studies measuring the frequency of agreement between expert and non
-
expert semantic taggers
(Stevenson, 2003, p. 50). Often, the areas of disagreement involv
ed topics that were more subject
to multiple interpretation (the user was not sure of context or relevance).

Step 6
-

Deployment

Creation of the model is generally not the end of the project. Once
the testers chose

Topicaliser

as the system of choice for
the Tri
-
State Times and the trial runs were
complete, the results are
organized and stored in a way that they can be retrieved as per need of the student newspaper. A
structure for reporting is created with a consistent and understandable format, so that t
he users of
these results.


Conclusion


By using a selection of text mining software, our study found that differences exist in the manner
in which people relate to information, even from the same query. Apart from differences in the
particular algorithms

and weighting schemas, definitive difference was found in the decision
processes of different individuals to the same set of information, and also in the comparison of
satisfaction surveys. This ultimately proves to be more of a subjective nature, though
important
information can be found within these differences for further improvements in the software.

We felt that although certain limitations can be found in text mining, there are also
opportunities for further research and study. There have been many
benefits in the existent
applications of text mining, and we feel that there are many possibilities for improvements in the
current software. Our study was limited to a specific set of data and users, while future research

26

could utilize a much broader scop
e of application (larger sets of data, wider array of topic, larger
user base).

27

References

Atkinson
-
Abutridy, John. (2004)
Semantically
-
Driven Explanatory Text Mining:


Beyond Keywords
. Retrieved March 3, 2007, from Universidad de Concepci´on, Departamen
to
de Ingenier´ıa Inform´atica,

http://www.springerlink.com/content/v78y4u242a67uupe/fulltext.pdf



Chibelushi, C., Sharp, B., Salter, A. (2004)
A Text Mining Approach to Tr
acking Elements of
Decision Making: a pilot study
. Retrieved March 5, 2007, from Staffordshire University, School
of Computing,

http://www.comp.lancs.ac.uk
/computing/research/cseg/projects/tracker/css_iceis04.pdf



CRISP
-
DM (2000) Retrieved on Mar 02, 2007 from (Reference:
http://www.crisp
-
dm.org/Overview/index.htm
)


do Prado, H. A., Moreira de Oli
veira, J. P., Ferneda, E., Wives, L. K., Silva, E. M., Loh, S.
(2004). Transforming Textual Patterns into Knowledge. In Raisinghani, Mahesh (Ed.),
Business
intelligence in the digital economy: opportunities, limitations, and risks
. (p. 207
-
227). Hershey,
P
A: Idea Group Publications.


Doyle, L. (1961). Semantic road maps for literature searchers.
Journal of the Association for
Computing Machinery
,
8
, 223
-
239.


Fan, W., Wallace, l., Rich, S. and Zhang., Z. (2005) “ Tapping Into the Power of Text Mining”.
Re
trieved on Mar 10, 2007 from
http://pubs.dlib.vt.edu:9090/2/01/text_mining_final_preprint.pdf


Feldman, R. & Sanger, J. (2006).
The text mining handbook: advanced approaches
in analyzing
unstructured data
. New York: Cambridge University Press.


Haravu, L. J. and Neelameghan, A. (2003). Text Mining and Data Mining in Knowledge
Organization and Discovery: The Making of Knowledge
-
Based Products.
Cataloging &
Classification
,
37

(
1/2), 97
-
113.


Hearst, M. (2003).
What Is Text Mining?
. Retrieved on March 12, 2007 from
http://www.ischool.berkeley.edu/~hearst/text
-
mining.html



Luhn, H. P. (1958). The automatic
creation of literature abstracts.
IBM Journal of Research and
Development, 2
, 159
-
165.


Mironova, S. Y., Berry M. W., Atchley, S., Beck, M. (2004). Advancements in text mining
algorithms and software. In Kargupta, H. (Ed.)
Data mining: next generation cha
llenges and
future directions
. (p. 425
-
436). Menlo Park, CA: MIT Press


Natarajan, M., (2005, July). Role of Text Mining in Information Extraction and Information

28

Management.
DESIDOC Bulletin of Information Technology
,
25

(4), 31
-
8.


Purple Insight, (n.d
.) Retrieved on March 12, 2007 from
http://www.purpleinsight.com/downloads/docs/visualizer_tutorial/glossary/go01.html



Sharp, M. (2001).
Text Mining
. Semin
ar in Information Studies, Retrieved on Mar 02, 2007 from
http://www.scils.rutgers.edu/~msharp/text_mining.htm



Smalheiser, N.R., Swanson, D.R., (1998). “Using ARROWSMITH: A Computer Ass
isted
Approach to Formulating and Assessing Scientific Hypotheses”, Computer Methods and
Programs in Biomedicine
57
(3), 149
-
153.


SPSS (2005)
Improve Business Results with Text Mining
. Retrieved March 1, 2007, from
http://www.spss.com/PDFs/TMC4SPC
-
0105.pdf



Stevenson, M. (2003) “Word S
ense disambiguation.


Stanford, CA : Center for the Study of
Language and Information.


Sullivan, D. (2004). Text Mining in Business Intelligence. In Raisinghani, Ma
hesh (Ed.),
Business intelligence in the digital economy: opportunities, limitations, and risks
. (pp. 98
-
110).
Hershey, PA: Idea Group Publications


Swanson, D. R. (1988_1). Historical note: Information retrieval and the future of an illusion.
Journal of
the American Society for Information Science, 39
, 92
-
98.


Swanson, D.R. (1988_2). Migraine and Magnesium: Eleven neglected connections.
Perspectives
in Biology and Medicine
, 31, 526
-
557


Termine Retrieved on Feb 22, 2007 from
http://www.nactem.ac.uk/software/termine/



Textaly

Retrieved on Feb 22, 2007 from
http://textalyser.net/



Topicalizer Retrieved on Feb 22, 2007 from
http://www.topicalizer.com/


Venere, E. (2004) “
'Knowledge discovery
' C
ould
S
peed
Creation of N
ew
P
roducts
” Purdue
News Service Retrieved on Mar 03, 2007 from
http://www.purdue.edu/UNS/html4ever/2004/041018.Caruthers.discover.html


Yeates, S. (2002).
Text Mining
. Retrieved on March 02, 2007 from
http://www.cs.waikato.ac.nz/~nzdl/textmining/



29

Appe
ndix


A
-


Questionnaire for Participants

(Note: One form for each system)


Please feel free to add or remove success viewpoints as appropriate.


Estimate the level of success
each

query, using this response scale.


5

very satisfied
4

satisfied
3
neutr
al
2
dissatisfied
1

very dissatisfied


1. __ The generated summary was relevant to the query.


2. __ The terms and phrases of the query were found to give relevant results.


3. __ The results gave too many variations to similar articles.


4. __ The sys
tem was easy to use and understand.


5. __ The system provided simple and ample methods of search mechanisms.







30

Identify both the successful and unsuccessful elements found in the resulting information. Were
these useful for further study, or were the
y irrelevant?




.



Estimate the level of satisfaction of the results, using this response scale.


5

very satisfied
4

satisfied
3

neutral
2
dissatisfied
1

very dissatisfied



1. __ The resulting data sets was sufficient for our purpose of use and stud
y.


2. __ The extracted information lead to a discovery of new knowledge.


3. __


4. __


5. __



31


Rate the following characteristics for the environment for the project being reviewed. Use a scale
of 1 to 5, where 1 is the lowest rating and 5 is the hi
ghest. If the item does not apply, mark an X.



__ ease of software use

__ software quality

__ analysis capability

__ design capability

__ appropriateness of technology to query

__ effective use of data configuration

__ quality assurance of data

__ clarity of source