A Distributed Agent Implementation of

savagelizardAI and Robotics

Nov 25, 2013 (3 years and 8 months ago)

84 views

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

A Distributed Agent Implementation of
Multiple Species Flocking Model for
Document Partitioning Clustering

Xiaohui Cui, Ph.D. and Thomas E. Potok, Ph.D.

Applied Software Engineering Research Group

Oak Ridge National Laboratory

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Outline


Introduction of Dynamic Information Stream and
the issues


Bio
-
inspired Clustering


MSF Clustering Model Based on Bird Flock
Collective Behavior


TFIDF not practical for dynamic data


MSF Document Clustering Algorithm


Multi
-
Agent Document Clustering Implementation


Future works and Conclusion

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Text Challenge


Problem


How to effectively reduce the size of a large, streaming
set of documents


“Give me the 10 documents that I need to read, out of
the 1000 I received today?”


Characteristics


A steady flow of simple documents


Need to rapidly organize the documents into subsets


Select representative documents from the subsets


O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Approach


Use standard IR techniques to convert text
to vectors


Use unsupervised learning/text clustering
to organize the documents


Look for improvements in term weighting
approaches

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Standard Information Retrieval

Army

Sensor

Technology

Help

Find

Improvise

Explosive

Device

ORNL

develop

homeland

Defense

Mitre

won

contract

Term List

Doc 1
Doc 2
Doc 3
Army
1
0
0
Sensor
1
1
1
Technology
1
1
0
Help
1
0
0
Find
1
0
0
Improvise
1
0
0
Explosive
1
0
1
Device
1
0
1
ORNL
0
1
0
develop
0
1
1
homeland
0
1
1
Defense
0
1
1
Mitre
0
0
1
won
0
0
1
contract
0
0
1
Vector Space Model

The Army needs senor
technology to help find
improvised explosive
devices

ORNL has developed
sensor technology for
homeland defense

Mitre has won a contract
to develop homeland
defense sensors for
explosive devices

Army

Sensor

Technology

Help

Find

Improvise

Explosive

device

ORNL

develop

sensor

technology

homeland

defense

Mitre

won

contract

develop

homeland

defense

sensor

explosive

devices

Document 1

Terms

Document 2

Document 3

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Standard Textual Clustering

Doc 1
Doc 2
Doc 3
Army
1
0
0
Sensor
1
1
1
Technology
1
1
0
Help
1
0
0
Find
1
0
0
Improvise
1
0
0
Explosive
1
0
1
Device
1
0
1
ORNL
0
1
0
develop
0
1
1
homeland
0
1
1
Defense
0
1
1
Mitre
0
0
1
won
0
0
1
contract
0
0
1
Doc 1
Doc 2
Doc 3
Doc 1
100%
17%
21%
Doc 2
100%
36%
Doc 3
100%
Vector Space Model

Dissimilarity Matrix

TFIDF

Documents to Documents

D1

D2

D3

Cluster Analysis

Most similar documents

Euclidean distance

O(n
2
Log n)

Time Complexity











n
N
f
W
ij
ij
2
2
log
*
1
log
O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Issues (1)


Analysts are currently overwhelmed with
the amount of information streams
generated everyday.


Researches in clustering analysis mainly
focus on how to quickly and accurately
cluster static data collection.


Research on clustering the dynamic
information stream is limited.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Solution: Bio
-
inspired Clustering


New computational algorithms inspired from
biological models, such as ant colonies, bird
flocks, and swarm of bees etc., can solve
problems in dynamical environment.



These algorithms are characterized by the
interaction of a large number of agents that follow
the same rules.



The bio
-
inspired clustering algorithms apply the
self
-
organizing and collective behaviors of social
insects for organizing of dynamical changed data.


O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

1
2
3
Deneubourg proposed the first clustering solutions inspired by ant
colonies in 1991.


Agent (ant) action rule: agent move randomly in the grid. Agents
only recognize objects immediately in front of them. Picking up or
dropping item based on pickup probability and drop probability.

The movement of data objects has to be implemented through the
movements of a small number of ant agents, which will slow down
the clustering speed.


Data Clustering by

Ant Clustering Algorithm

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Trivial Behavior
Emergent behavior = flocking
A New Clustering Algorithm Based
on Bird Flock Collective Behavior

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Flocking model,
one of the first bio
-
inspired computational collective
behavior models,
was first proposed by Craig Reynolds in 1987.

Alignment

: steer towards the average heading of the local flock mates


Separation
: steer to avoid crowding flock mates


Cohesion

: steer towards the average position of local flock mates

Alignment


Separation



Cohesion

Flocking Model







n
x
x
ar
b
x
b
x
v
n
v
d
P
P
d
d
P
P
d


1
)
,
(
)
,
(
2
1





n
x
b
x
b
x
sr
b
x
P
P
d
v
v
v
d
P
P
d
)
,
(
)
,
(
2










n
x
b
x
cr
b
x
b
x
P
P
v
d
P
P
d
P
P
d
)
(
)
,
(
)
,
(
2
1

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Flocking Demo

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Multiple Species Flocking (MSF) Model

Feature similarity rule
: Steer away from other birds that have
dissimilar features and stay close to these birds that have
similar features.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Issues (2)


Every added or removed document from the set requires
recalculation of the entire VSM








TFIDF not practical for dynamic data


Requires sequential processing


Not good for a distributed agent approach


Document Set
must be known
before VSM
can be
calculated

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Inverse Corpus Frequency













1
1
log
1
log
2
2
c
C
f
W
ij
ij
0
50000
100000
150000
200000
250000
5
55
105
155
205
255
305
355
405
455
505
555
605
655
705
755
805
855
905
Number of Documents (K)
Unique Term Count

Look at the forest, not the
trees


We analyzed near 1 million
documents from 6 major
research corpora


We found 229,023 unique
terms (A large dictionary
contains around 70,000
terms)


We use this term frequency
distribution as our “global”
term frequency

Reed, Jiao, et al., “TF
-
ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams,”
The Fifth
International Conference on Machine Learning and Applications (2006) to appear

Reed et al., “Multi
-
Agent System for Distributed Cluster Analysis,”
Third International Workshop on Software
Engineering for Large
-
Scale Multi
-
Agent Systems (SELMAS'04),

May 24
-
25, 2004, Edinburgh, Scotland

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Why this matters


We can now generate an accurate vector
directly from a text document



That vector can be generated where ever
the document resides



We can now use agents to create vectors
from documents over a broad range of
computers

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Multiple Species Flocking (MSF)
Document Clustering


Each document is projected as a bird in a 2D
virtual space.


The birds that have similar document vector
feature (same as the bird’s species and
colony in nature) will automatically group
together and became a bird flock.


Other birds that have different document
vector features will stay away from this flock.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

MSF Document Clustering Demo

Category/Topic

Number
of
articles

1

Airline Safety

10

2

China and Spy
Plane and Captives

4

3

Hoof and Mouth
Disease

9

4

Amphetamine

10

5

Iran Nuclear

16

6

N. Korea and
Nuclear Capability

5

7

Mortgage Rates

8

8

Ocean and Pollution

10

9

Saddam Hussein
and WMD

10

10

Storm Irene

22

11

Volcano

8

The Document
collection Dataset

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Performance Results of MSF, K
-
means and Ant Clustering
Algorithm

* Four data types and each includes 200 two dimensional (x, y) data objects.


x and y are distributed according to Normal distribution.

** 112 news article dataset, 11 categories

*** The k
-
means algorithm has pre
-
knowledge of the cluster number.

The clustering results of K
-
means, Ant clustering and MSF clustering
Algorithm on synthetic* and document** datasets after 300 iterations

Ref: X. Cui, J. Gao and T. E. Potok, A Flocking Based Algorithm for Document Clustering Analysis, Journal of Systems Architec
tur
e, Volume 52, Issues
8
-
9 , pp. 505
-
515, August 2006, ISSN: 1318
-
7621

Algorithms

Average
cluster
number

Average F
-
measure value

Synthetic
Dataset

MSF

4

0.9997

K
-
means

(4)***

0.9879

Ant

4

0.9823

Real
Document
Collection

MSF

9.105

0.7913

K
-
means

(11)***

0.5632

Ant

1

0.1623

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

MSF Clustering Algorithm

for Information Stream


The MSF clustering algorithm can achieve better
performance in document clustering than the K
-
means and the Ant clustering algorithm.


This algorithm can continually refine the
clustering result and quickly react to the change
of individual data. This character enables the
algorithm suitable for clustering dynamic
changed document information, such as the text
information stream.


O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Multi
-
Agent Document

Clustering Implementation


JADE platform. (
http://jade.tilab.com/
)


Linux Cluster Machine.



One main node and three client nodes, which
are connected with a Gigabit Ethernet switch.
Each node contains a single 2.4G Intel Pentium
IV processor and 512M memory.


Document datasets are derived from TREC
collections.
TREC: Text REtrieval
Conference (
http://trec.nist.gov/
)


O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Current and Future Works


Switched agent platform from JADE to our light agent platform
(ORMAC).


Built a control agent for automatically generating and deploying
flock agents on all available cluster nodes of 135 node cluster.


Built agents to monitor the news
update on several popular
Internet news websites and
collect news and feed into the
system in real
-
time.


Building a better GUI interface

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Conclusion


The heuristic searching mechanism of flocking
model helps document agents to quickly form
flocks and react to the change of any individual
documents.



TFIDF enhancement, the TFICF vector space
model, allows for parallel or distributed
algorithms for information stream clustering



Agent architecture provides analysis approach
that can run on cluster computers.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Thank you!

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Node1

Node3

Node2

Location

proxy agents

Boid

agents

Head

Node

JADE system

agents

JADE main

Container

JADE Container

The architectures the central model and
distributed model

the distributed model

Node1



Boid agents

Location

proxy agent

Head

Node

JADE main

Container

JADE Container

JADE system

agents

the Single Processor model