TEXTUAL INFORMATION CLUSTERING AND VISUALIZATION FOR KNOWLEDGE DISCOVERY AND MANAGEMENT

kneewastefulΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

67 εμφανίσεις


1

TEXTUAL INFORMATION CLUSTERING AND VISUALIZATION

FOR KNOWLEDGE DISCOVERY AND MANAGEMENT


Xavier Polanco


INTRODUCTION

We are concerned with the design and development of computer
-
based information analysis tools in
which clustering analysis, computational
linguistics and artificial intelligence techniques are combined. On
the technology side, an information analysis computer
-
based system is an integrated environment that
somehow assisted a user in carrying out the complex process of converting information f
rom the textual
data sources to knowledge.

TEXT MINING

Text
-
mining consists of extracting information from hidden patterns in large textual collections. A
very big amount of information is available in textual form in databases and online information sourc
es.
In this context, manual analysis and effective extraction of useful information are not possible. It is
relevant to provide automatic tools for analyzing large textual collections. The goal of text mining is to
extract information from patterns in larg
e textual collections. The results can be important both for the
analysis of the collection, and for providing intelligent navigation and browsing methods (Feldman et al.,
1998; Landau et al., 1998)).


The text mining process can be organized roughly into
five
-
major steps: [1] Data Selection, [2] Term
Extraction and Filtering, [3] Data Clustering, [4] Cluster Mapping or Visualization, [5] Result
Interpretation (Polanco and François, 2000b).

CLUSTERING

The aim of our activity is performing the analysis of in
formation by computer using cluster analysis
and cartography (or mapping) algorithms which represent the generated clusters in the form of maps. We
have applied this approach to the domain of scientific and technical information, i.e. stored publications
a
nd patents in databases (Polanco et al., 1995; 1998a; 1998b).


The analysis of the textual information is divided into two phases. The first involves the cluster
generation using clustering procedures, in which learning is unsupervised (the user does not d
efine
classes), while the second consists of positioning the clusters on a global map in order to display the
topical organization of knowledge. These two phases are data driven. A hypertext interface generator
provides the user with a user
-
friendly interf
ace displaying the global map, the topics or clusters and the
documents set and then it gives access to useful information organized by topics (clusters).


Artificial neural networks (ANNs) are a useful class of models consisting of layers of nodes. Our
in
terest in ANNs is based on the links which exist between data analysis and the ANNs approaches in the
areas of clustering and mapping (Kohonen, 1997; Polanco et al., 1998a; 1998b; 2000a; 2000b).


INFORMATION VISUALIZ
ATION

Information visualization is usin
g vision to think (Card et al., 1999). We are concerned with
cartography algorithms that represent the clusters in the form of maps. The studied maps (in Polanco et
al., 1998b) are not only means of visualization. They also represent an analysis tool insof
ar as they allow
users to evaluate the relative position of clusters (or topics) in the multidimensional space of
representation. As we have observed, we must deal with the problem of readability of such maps.


2

The maps are "visualization
-
based analysis too
ls." In the context of data mining and knowledge
discovery in databases, Brachman and Anand (1996) have noted that "The visualization produced is by
itself a model, and the user can examine the visualization to determine its explanatory power (...)
Appropr
iate display of data points and their relationships can give the analyst insight that is virtually
impossible to get from looking at tables of output or simple summary statistics. In fact, for some tasks,
appropriate visualization is the only thing needed
to solve a problem or confirm a hypothesis, even though
we do not usually think of picture
-
drawing as a kind of analysis."

KNOWLEDGE DISCOVERY

As a framework of what means knowledge discovery in databases (KDD), we summarize here the
view of Brachman and
Anand (1996). They invite to look for KDD as a human centered process. A KDD
system is a technical way of support discovery of knowledge by a user. In a given context, the output of
the knowledge discovery process would more typically be the specification
for a knowledge discovery
application. The goal of the design of the KDD as a process is to help us better understanding how to do
knowledge discovery, and how to support the human analysts advantage. Without human analysts KDD is
unthinking. It is crucial

emphasizing the key role played by humans in knowledge discovery.


It is important to understand who the user is and what tasks the user has performed. We assume that
our user is not a business end
-
user, but the "analyst." So it is the analyst's needs and

tasks that will
determine our attention. The analyst "analyzes" the data using data analysis and visualization tools. This
analysis leads the analyst to some sort of "insight" about the data. The analyst then uses presentation tools
to disseminate this in
sight to a broader audience, that is the parties that generated the original goal of the
analysis.

KNOWLEDGE MANAGEMENT

We would add to the information analysis a formalized operator for processing the knowledge
produced by experts when they analyze the c
lusters. The knowledge management presupposes that it is
implemented by a system. The system must be able to process the results of the knowledge organization
allowing not only exploration and visualization, but also the possibility of performing operation
s on this
knowledge. The system must be able to manage at least three types of data that we wish to combine:
clusters
,
classes

and the bibliographic or textual
data

(which may themselves be of different types). The idea
is therefore to model not only the b
ibliographic data and the
clusters

obtained from this
data
, as it currently
is performed, but also the
classes

of knowledge obtained from the cluster experts analyze.

Generally speaking, a knowledge management system (KMS) is concerned with the identificat
ion,
acquisition, development, diffusion, use, and preservation of the enterprise’s knowledge. Without going
into details, we can accept for our purpose the following general concept: "Knowledge management is the
formal management of knowledge for facilita
ting creation, access, and reuse of knowledge, typically using
advanced technology" (O'Leary, 1998a; 1998c). In order to capitalize the expert knowledge produces in an
science and technology watch analyze, as a way of reusing this knowledge in new actions
concerning the
same domain. In our opinion, science and technology watch and knowledge management will take full
advantage if these tasks are fully integrated.

A more operational definition of knowledge management is in terms of converting and connecting.
"Knowledge management is a process of converting knowledge from the sources accessible to an
organization and connecting people with that knowledge" (O’Leary, 1998b). Then the functions that a
KMS represents are converting knowledge from textual data sourc
es, and connecting people with that
knowledge.

Science and technology watch in the broadest sense can be considered as the observation and
following up of scientific and technological changes in order to alert decision
-
makers, about the
consequences of s
cientific and technological issues and trends.


3

REFERENCES

Brachman R. J. Anand T., 1996,

“The Process of Knowledge Discovery in Databases.” In U. M.
Fayyad, G. Piatetsky
-
Shapiro, P. Smyth, R. Uthurusamy (eds),
Advances in Knowledge Discovery and Data
Mini
ng
, Menlo Park, Calif., AAAI Press / The MIT Press, p. 37
-
57.

Card S. K. Mackinlay J. D. Shneiderman B. (eds.), 1999,
Readings in Information Visualisation
. San
Francisco, Calif.
Morgan Kaufmann Publishers Inc.

Feldman R. Aumann Y Zilberstein A.
Ben
-
Yehud
a Y.,1998, “Trend Graphs: Visualizing the
Evolution of Concept Relationships in Large Document Collections.” In J.M. Zytkow and M. Quafafou
(eds)
Principles of Data Mining and Knowledge Discovery
.
Berlin: Springer Verlag, pp. 38
-
46.

Kohonen T., 1997,
Self
-
Organizing Maps
, Berlin, Springer.

Landau D. Feldman R.
Aumann Y. Fresko M. Lindell Y Lipshtat O. Zamir O., 1998, “TextVis: An
Integrated Visual Environment for Text Mining.” In J.M. Zytkow and M. Quafafou (eds)
Principles of Data
Mining and Knowledge Dis
covery
.
Berlin: Springer Verlag, pp. 56
-
64.

O'Leary D. E., 1998a, “Knowledge
-
Management Systems: Converting and Connecting.”

IEEE
Intelligent Systems
, vol. 13, n° 3, pp. 30
-
33

O'Leary D. E., 1998b, “Using AI in Knowledge Management: Knowledge Bases and Ont
ologies.”
IEEE Intelligent Systems
, vol. 13, n° 3, pp. 34
-
39.

O'Leary D. E., 998c, “Enterprise Knowledge Management.”
Computer
, vol. 31, n° 3, pp. 54
-
61

Polanco X. Grivel L. Royauté J., 1995, “How To Do Things with Terms in Informetrics

:
Terminological Va
riation and Stabilization as Science Watch Indicators.” In
Proceedings of the Fifth
International Conference of the International Society for Scientometrics and Informetrics.

Edited by M.E.D. Koening et
A. Bookstein. Medford, N.J., Learned Information Inc.
, p. 435
-
444.

Polanco X. François C. Keim J
-
P., 1998a, “Artificial Neural Network Technology for the
Classification and Cartography of Scientific and Technical Information.”
Scientometrics
, vol. 41, num. 1, p.
69
-
82.

Polanco X. François C. Ould Louly A., 1
998b, “For Visualization
-
Based Analysis Tools in Knowledge
Discovery Process

: A Multilayer Perceptron versus Principal Components Analysis
-

A Comparative
Study.” In J.M. Zytkow and M. Quafafou (eds)
Principles of Data Mining and Knowledge Discovery
.
Seco
nd
European Symposium, PKDD’98, Nantes, France, 23
-
26 September 1998.
Lecture Note in Artificial
Intelligence 1510. Subseries of Lecture Notes in Computer Science. Berlin, Springer, p. 28
-
37.

Polanco X. François C., Lamirel J. Ch.2000a, “Using Artificial N
eural Networks for Mapping
Science.” In
Book of Abstracts

of the
Sixth International Conference on Science and Technology Indicators
. Leiden,
The Netherlands, 24
-
27 May, p. 89.

Polanco X. François C., 2000b, “Data Clustering and Cluster Mapping or Visualiz
ation in Text
Processing and Mining.” In
Proceedings of the Sixth International Conference of the International Society of Knowledge
Organization
, July 10
-
13, 2000, in Toronto, Canada, p. 359
-
365.