Textual Information Clustering

naivenorthΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

80 εμφανίσεις


Xavier Polanco

URI
-
INIST
-
CNRS

Textual Information Clustering

and Visualization for Knowledge
Discovery and Management

2

Introduction


We are concerned with the design and
development of computer
-
based
information analysis tools


Cluster analysis, computational linguistics
and artificial intelligence techniques are
combined


3

On the technology side


An information analysis computer
-
based
system is


an integrated environment that somehow
assisted a user


in carrying out the complex process of
converting information from the textual data
sources to knowledge

4

Information Analysis System

French or English
text
-
data

Lexicons or
terminological
resources

Term Extraction

And

Indexation

Clustering

and

Mapping

Bibliometric

statistics

SDOC

NEURODOC

MIRIAD

ILC

Mac

PC

WS

HENOCH

WWW

Server

Dataset or
Corpus

DBMS
-
R

5

Home Pages

Intranet

Extranet

6

Plan


Text Mining


Cluster Analysis


Visualization or Mapping


Knowledge Discovery


Knowledge Management

7

Textual Information


Big amount of information is available in
textual form in databases and online sources


In this context, manual analysis and
effective extraction of useful information
are not possible


It is relevant to provide automatic tools for
analyzing large textual collections

8

Text Mining


Text mining consists of
extraction information
from hidden patterns

in large
text
-
data
collections


The results can be important both:


for the
analysis
of the collection, and


for providing
intelligent navigation

and
browsing

methods

9

Process


The text mining process can be organized
roughly into five
-
major steps:


Data Selection


Term Extraction and Filtering


Data Clustering and Classification


Mapping or Visualization


Result Interpretation


Iterative and interactive process

10

Natural Language Processing


Experience shows that linguistic
engineering approach insures a
higher
performance

of the data mining algorithms



Part
-
of
-
speech tagging

(tagging texts), and
lemmatization
are tasks generally admit

11

The approach


Our approach to text mining is based on
extracting meaningful terms from
documents


In this presentation, the focus is on the
term
extraction process
, and


The need of the organization of the
generated terms in a
taxonomy

12

The main tasks


Term extraction or acquisition


Indexation


Human control and screening




Indexing quality control



Index screening


clustering phase

13

Language Engineering

Lexicons
: Management and Linguistic Processing

Texts
: Part
-
of
-
speech tagging, lemmatization, and indexation

Indexed

Corpus

Text
-
DB

Lexicons

Natural Language

Engineering System

14

Variation

Normal Form

Syntactic Variation

Morpho
-
syntactic
Variation

Resistance gene


Resistance

methylase
gene

Resistance

and susceptibility
gene

Gene

of the antibiotic
resistance

Rare species

Rare
ly encountered
enterococus
species

15

Taxonomy


A
taxonomic structure

should improve text mining



Considering the clustering techniques that might
be used in text mining. One must be mindful that
more
taxonomic classifying capabilities would be
incorporated

into text mining



A taxonomic classifying capability might also
facilitate
cluster interpretation

by giving the user
some kind of
rules

16

Clustering


Clustering is a descriptive task where one
seeks to identify a finite set of categories


Clustering is used to segment a database
into subsets or clusters


Clustering means finding the clusters
themselves from a given set of data

17

Clustering Process

Indexed

Corpus

Text
-
DB

Lexicons

Natural Language

Engineering System

Clustering

Algorithm

D(n,p)

C(m,p)

Dissimilarity Measures: d(x,y)

Similarity Measures: s(x,y)

18

Documents


䭥K睯牤s

KW
1

KW
2

KW
3

KW
4

KW
5

KW
6

D
1

1 0 1 0 1 1

D
2

1 0 1 0 1 1

D
3

0 1 0 1 0 0

D
4

1 0 0 1 0 1

C
1

= ({D
1
,D
2
}{
KW
1
,KW
3
,KW
5
,
KW
6
})


C
2

= ({D
4
}{KW
1
,
KW
4
,KW
6
})


C
3

= ({D
3
}{KW
2
,KW
4
})

D
i


KW
j

= {1,0}

D
i


KW
j

= {1, 2, …, n}

19

Clustering Algorithms


Major families of clustering methods:


Sequential algorithms


Hierarchical algorithms


Agglomerative algorithms


Divisive algorithms


Fuzzy clustering algorithms

20

Information Analysis Process


The text
-
data information analysis is
divided into two phases:

1.
Cluster generation

2.
Map display of clusters


A hypertext user interface enables the
analyst to explore and interpret results

21

Example

Antibiotic Resistance

Data

Hypertext

Clusters

Map

4025 documents (1998
-
1999)

30

2 DB

Medicine

Molecular

Biology

22

Information
Visualization


Definition :
The use of computer
-
supported,
interactive, visual representation of abstract data to
amplify the acquisition or use of knowledge

(Card
et al., 1999)


Visual artifacts aid human thought


The progress of civilization can be read in the
invention of visual artifacts, from writing to
mathematics, to maps, to diagrams, to visual
computing

23

Process


Raw Data


䑡瑡D呡T汥l


Data Tables


䍬畳瑥物湧


Clustering


噩獵慬a却牵捴S牥猠㨠䵡:


Visual Structures


噩敷s

24

Visual Structures


Data Tables are mapped to Visual Structures,
which augment a spatial substrate with marks and
graphical properties to encode information


A Graphic Representation is said to be
expressive

if all and only the data in the Data Table are also
represented in the Visual Structure


A Graphic Representation is said to be more
effective

if it is faster to interpret

25

Map Display


We are concerned with map display of the
clusters


A problem of particular interest is how to
visualize data set with many variables:

1.
Multivariate
-
Data are clustered, and

2.
Clusters are mapped

26

Mapping tools


For mapping, we use the following
techniques:


Density and Centrality Diagrams


Principal Component Analysis (PCA)


Multi
-
Layer Perceptrons (MLP)


Self
-
Organizing Maps (SOM)


Multi
-
SOMs

27

Multi
-
Layer Perceptron 1

prion

proteins

scrapie

CJD

human disease

mankind

spongiform

encephalopathy

W
c
ij

W
s
jk

x
p

s
p

x
1

x
i

s
1

s
k

W
c
(p,2)

W
s
(2,p)

ISE=||s
-
x||
2

28

Multi
-
Layer Perceptron 2

Input
Layer

Output
Layer

First Hidden
Layer

Second Hidden Layer
(Cartography)

Polarizer node

x

1

y

1

x

p

y

p

C(m,p)

plasmids

protein

infection

resistance

Agrobacterium

29

Multi
-
SOM Platform

MAPS

Raw Data

Processing System

Graphic
-
Hypertext

User Interface

Pre
-
processing

Post
-
processing

SOMPACK

MULTISOM

Java Application

DB

30

Multi
-
Self
-
Organizing Map
Display


Use of the inter
-
Map Communication Mechanism

Maps associated to 5 viewpoints :

Map 1


偬慮瑳

䵡瀠㈠


偬慮琠偡牴P

䵡瀠㌠


偡瑨潧敮P䅧敮瑳

䵡瀠㐠


G敮整e挠T散桮e煵敳

Map 5


偡瑥湴楮朠䙩牭n

2

1

3

4

5

Rice Area Activated

31

Knowledge Discovery


KD is informally defined as the extraction
of useful knowledge from databases or large
amounts of data


One of the most important research topics in
KD is the rule discovery or extraction


The discovered knowledge is usually
expressed in the form of «

if
-
then

» rules


32

Association Rules


Association rules can be seen as one of the
key tasks of KDD


The intuitive meaning of an association rule
X


Y
, where X and Y are keywords or
descriptors, is :
“a document set containing
keyword X is likely to also contain keyword Y”

33

Example


In a given a food
-
industry corpus:


“98% of the documents which are interested
on
apple juice

does it related with the
chromatography

analytic technique”


X


Y : “apple juice


chromatography”


34

The Galois Lattice


Our current research includes an approach
based on the lattice structure to discover
concepts and rules to the objects
(documents) and their properties
(keywords)


The Galois lattice approach is also known
as conceptual clustering

35

The concept lattice

C1:(D1,
Ø
)

C2:({d1,d2,d4},{t1,t6}

C3:({d3,d4},{t4}

C4:({d1,d2},{t1
,t3,t5
,t6}

C5:({d4},{t1,t4,t6}

C6:({d3},{t2,t4}

C7:(
Ø, T1
)

The formal concept

C4 has two own terms

{t3,t5} and two inherited

terms {t1,t6}

Given the context (D1,T1) where

D1 = {d1,d2,d3,d4} & T1 = {t1,t2,t3,t4,t5,t6}


R t1 t2 t3 t4 t5 t6

d1
1 0 1 0 1 1


d2
1 0 1 0 1 1

d3
0 1 0 1 0 0

d4
1 0 0 1 0 1

Table: The input relation

R = documents


keywords

Hasse
Diagram

36

Association Rules Extraction


The formal concept C4 makes it possible the following
rules



R1 : t3


t1


t6


R2 : t5


t1


t6


R3 : t3


t5


The interpretation of the R1 and R2: The use of terms t3 or
t5 is always associated with that of terms t1 and t6


The rule R3 express mutual equivalence of the terms
{t3,t5: All the documents which have the term t3 also
have the t5 term.


37

Summary

Text Mining

Clustering

Mapping

Knowledge Discovery

38

Knowledge Management


A knowledge management system is
concerned with the identification,
acquisition, development, diffusion, use,
and preservation of the enterprise’s
knowledge


39

KM Objectives


Using advanced technology


For facilitating creation, access, and reuse
of knowledge


For converting knowledge from the sources
accessible to an organization and
connecting people with that knowledge

40

Project


Adding to the information analysis
system a formalized operator for
processing together:


The knowledge that is extracted from
databases


The knowledge that the experts produce when
they analyze the clusters, maps, concepts and
rules


41

We have reached our last subject,


but not the end !

42

Xavier Polanco