Agent-Based Knowledge Discovery:

aspiringtokAI and Robotics

Oct 15, 2013 (3 years and 11 months ago)

239 views

Agent
-
Based Knowledge Discovery:

Survey and Evaluation















A Term Paper for

EE380L Data Mining


WRITTEN BY:

Austin Bingham

Paul Chan

Dung Lam


SUBMITTED TO:

Dr. Joydeep Ghosh


May 2000

1.

Introduction

With the prevalence of networking in the past decade, data has not only grown exponentially in size
but also become more decentralized and disordered. In addition, databases, knowledge bases, and
online repositories of information (such as dictionaries, u
ser survey results, and server logs) around
the world can now interact with one another. These intertwining networks of data sources present a
challenge for knowledge discovery, as most existing techniques assume a single source of data. To
make the prob
lem worse, there is no agreed method for discovering knowledge through distributed
information gathering from heterogeneous data sources. Consequently, the rate of knowledge
discovery fails to keep up with that of data generation, and the percentage of kn
owledge relative to
the amount of data declines steadily. To remedy the situation, researchers have developed Agent
-
based Knowledge Discovery (ABKD) as a new paradigm that combines the two fields of Distributed
Artificial Intelligence and Machine Learning

[Davies and Edwards 1995A]. The purpose of this paper
is to examine how we can apply existing agent
-
based techniques to the knowledge discovery, or
data
-
mining, field. From this evaluation, an ideal agent
-
based system is proposed along with issues
that
must be considered.

An agent
-
based data mining system is a natural choice for mining large sets of inherently distributed
data. One example application of such systems is military decision
-
making [Yang, Honavar, Miller,
and Wong 1998]. Every day, comma
nders and intelligence analysts need to access critical
information in a timely fashion. Typical day
-
to
-
day operations involve intelligence data gathering and
analysis, situation monitoring and assessment, and looking for potentially interesting patterns
in
data, such as relationship between troop movements and significant political developments in a
region. The information can be valuable for decision
-
makers to take both proactive and reactive
measures designed to safeguard a nation’s security concerns.

In a crisis, we need to be able to
deliver accurate information to the decision
-
makers at the right time without overwhelming them
with large volume of irrelevant data. This involves physically distributed data source include satellite
images, intelligen
ce reports, and records of communication with officers at the frontier. In that
situation, nobody can afford the delay of sending large volumes of data back for central processing
before we can present relevant information to decision
-
makers.

Financial
institutions and law
-
enforcement agencies share similar information processing needs. In
order to predict market fluctuations, brokerage houses need to analyze news and financial
transactions from all over the world in real time. The continuous growth in

the amount of data to
process makes it impossible for centralized analysis to take place. Besides, the prevalence of
electronic commerce demands a secured and trusted inter
-
banking network with high
-
speed
verification and authentication mechanisms. This

requires a widely deployed system that detects
local fraudulent transaction attempts and propagates the attack information as soon as possible
[Chan and Stolfo 1996]. Likewise, law
-
enforcement agencies need to obtain information from one
another for case

histories or crime patterns to co
-
ordinate nationwide or even worldwide efforts to
fight crime.

This paper surveys and evaluates ABKD systems. Section 2 introduces the idea of applying agent
technology in distributed data mining. It also describes some
metrics that are useful for evaluating
the existing ABKD architectures in Section 3. Section 4 proposes the desired characteristics of an
ideal ABKD architecture. Section 5 suggests some possible future work in ABKD research, and Section
6 concludes the
paper.




2.

Agent Technology

2.1

What is an Agent?

Many areas of research employ agent technology, and thus the definition of an agent varies
according to the focus of the research. For example, research in multi
-
agent systems (MAS)
commonly characterize
s agents as autonomous and able to plan and coordinate within an
organization for solving a problem. In ABKD, an agent is a software entity that can 1) interoperate
with its data source and/or other agents, 2) receive/gather raw data, 3) process and learn

from the
data source or from other sources, and 4) coordinate with other agents to produce relevant and
useful information. Research in ABKD emphasizes how agents manipulate data and how agents
extract information from distributed data sources. Based on

this characterization, many aspects of
research in ABKD, such as planning, coordination, and communication, overlap with other fields of
agent research. This paper, however, limits its description of agent technology to the context of
knowledge discovery
.

There are two types of agent
-
based systems: homogeneous systems and heterogeneous systems.
Agents in homogeneous systems have the same functionality and capabilities, whereas agents in
heterogeneous systems have dissimilar functionalities and capabiliti
es but can still coordinate with
one another. In general, heterogeneous systems are useful for processing different kinds of
databases using a variety of techniques, but it may be difficult to integrate the resultant
heterogeneous information. Agent syst
ems can also be classified by the source of control. In
decentralized systems, agents negotiate among themselves to resolve coordination problems.
Centralized systems are usually easier to implement but have single points of failure.

In addition, some
agent systems allow agents to dynamically change their roles when necessary.
Having static agent roles within a system may simplify coordination mechanism but the system will
be less robust as a whole. Choosing the right characteristics for an ABKD syste
m involves considering
what types of data are being mined and what coordination and integration techniques are preferred.



2.2

How does ABKD work?


ABKD systems fit naturally to domains with distributed resources. There are three general methods
for ABK
D to learn from distributed data. The first method involves collecting data into a single
repository. This method is impractical and does not take advantage of agents and distributed
networks.

Sian researched the second method, which involves informati
on exchange among agents during their
learning of local data [Sian 1991]. In the ideal case, since agents are working as a single algorithm
over all the data sources, few or no revisions or integration is necessary. However, this method
restricts the cho
ice of possible algorithms to those specifically designed for distributed learning.
Another drawback of this method is its assumption about consistently reliable communication and
secure data channel.

In the third method, agents independently process th
e data and learn locally. After the agents have
completed, they share, refine, and integrate their results. The level of independence in local learning
is a design decision that factors into the communication capability of the agents. The third method
m
akes better use of agent technology and is more suitable when the system designers are concerned
with network instability and security breaches. It also allows the use of conventional algorithms in
the local learning stage. However, problems may arise du
ring the integration phase when agents try
to merge different types of results from different local
-
learning algorithms. Davies and Edwards in
particular proposed a high
-
level model of the third method using multiple distributed agents:

One or more agent
s per network node are responsible for examining and analyzing a local data
source. In addition, an agent may query a knowledge source for existing knowledge (such as rules
and predicates). The agents communicate with each other during the discovery proc
ess. This allows
agents to integrate the new knowledge they produce into a globally coherent theory. A user
communicates with the agents via a user
-
interface. In addition, a supervisory agent responsible for
coordinating the discovery agents may exits. …

The interface allows the user to assign agents to data
sources, and to allocate high
-
level discover goals. It allows the user to critique new knowledge
discovered by the agents, and to direct the agents to new discovery goals, including ones that might
m
ake use of the new knowledge. [Davies and Edwards 1995B]


ABKD systems use software agents for encapsulating the learning functionality of data
-
mining
techniques, as well as coordinating distributed agents. There is significant interdependence between
int
egration of gathered information and coordination mechanism in an ABKD system: if integration is
concurrent with the gathering process, the coordination of the agents is critical for accurate
knowledge discovery; if integration occurs after agents independ
ently gather information, less
coordination effort is required.

Two common techniques for merging or integrating gathered information are theory revision and
knowledge integration. Both of these techniques involve the local learning by agents but differ i
n the
way they discover knowledge. Theory revision adopts incremental learning, with which an agent
passes the theory it develops to another agent for further refinement with respects to the latter’s
data sources. In the case of simple knowledge integrat
ion, theories are tested against all training
examples and the best theory with respects to a test set is selected. ABKD systems can also
implement variations of these two techniques. For example, agents can send their theory to every
agent, which then m
odifies the theory to their own local data. The final theory is chosen from the
resulting theories based on a test set [Davies and Edwards 1995B].

2.3

What do Agents Contribute to Data Mining?

With the availability of a wide spectrum of agent systems, AB
KD contributes to data mining in a
number of ways. First of all, adopting ABKD provides parallelism, which improves the speed, the
efficiency, and the reliability of data mining. The distributed nature of agent systems allows the
parallel execution of da
ta
-
mining process regardless of the number of distant data sources involved.
This means that non
-
parallel data
-
mining algorithms can still be applied on local data (relative to the
agent) because information about other data sources is not necessary for l
ocal operations. It is the
responsibility of agents to integrate the information from numerous local sources in collaboration
with other agents.

Second, agent concepts assist developers in designing distributed data
-
mining systems. The
encapsulation of

variables and methods in the object
-
oriented paradigm leads to the idea of
encapsulating data
-
mining techniques, and thus developers can reuse agent objects that contain
existing techniques. After defining the agent objects, the developers can design how

the agent
objects interact with one another to generate the correct results.

Third, agent concepts provide users of a data
-
mining system the capability to retrieve the discovered
knowledge at different stages of progression. For instance, a user may wa
nt to view the information
gathered by a particular agent before integration takes place. The sophistication of details retrieved
at each stage depends on the implementation of individual agent
-
based systems.

Another advantage of adopting ABKD is the abil
ity of agents to gather or search for information
beyond a single data repository. As an example, we can view the World Wide Web as one large
database of web pages with no particular order or organization. An agent can randomly sample from
the database (
World Wide Web) or it can selectively filter certain items (web pages). The agent can
then process the retrieved information or relay the items to other agents for further processing. The
rich interactions and coordination among agents distinguish ABKD f
rom conventional techniques.

2.4

What are the limitations of ABKD?

Despite all its contributions, ABKD is not a panacea for problems inherent with a particular data
-
mining technique, such as noise, missing data, or lack of scalability. More over, ABKD sy
stems in
many cases are more difficult to design and implement than conventional data
-
mining systems.
Hence, ABKD systems are better suited for mining enormous amounts of distributed data, which
usually requires a complicated conventional data
-
mining syst
em.

2.5

How to evaluate ABKD?

Several implementations of agent
-
based knowledge discovery exist (such as SAIRE and JAM) and
more are in development (like InfoSleuth and BODHI). Thus, it is important to be able to evaluate
and compare various agent
architectures and distributed learning techniques. This paper suggests
some common metrics for most, if not all, ABKD systems:

1)

What type of information or data do agents communicate with one another?

Do they share summarized information or raw data tha
t represents the data source they mine?

2)

How often do agents communicate with one another?

Does their communication require high bandwidth?

3)

Do agents communicate during or after the learning process?

4)

Are both the architecture and the implementation

easily scalable?

Are there limitations on the application?

5)

Can the system reuse existing machine learning algorithms without extensive modification?

6)

What is the integration technique?

Is it efficient, scalable, and practical?

7)

What is the coordina
tion technique?

Is it efficient, scalable, and practical?

8)

What are the results of experiments, if any?

These metrics provide some clues about the advantages as well as problems involved in ABKD. With
these metrics, the next section will evaluate some o
f the present work in ABKD. Following that, the
paper will present the desired characteristics of an ideal ABKD architecture.




3.

Existing Agent Architectures for Data Mining

3.1

CAS

The developers of Cooperative Agent Society (CAS) identified the gener
ation of concise, high
-
quality
information in response to a user's needs as the core problem of information gathering (IG). The
constant growth of the number of available information sources compounds the problem. The
authors presented a sophisticated vi
ew of IG as the processes of both acquiring and retrieving
information, instead of just information retrieval. They based their arguments on the observation
that "no single source of information may contain the complete response to a query and hence may
n
ecessitate piecing together mutually related partial responses from disparate and heterogeneous
sources.” Hence, they proposed that the paradigm to supporting flexible and reliable IG application
is a distributed cooperative task, in which agents act as t
he intermediaries between a user and the
system.

The team suggested an agent
-
based approach for several reasons. Their main motivation for using
intelligent agents was that "the components with which to interact [when gathering information] are
not known
a priori". Other motivations included the maintenance of data sources by different
providers, the difference in creation times of data sources, and the use of different problem
-
solving
paradigms. Because agents can negotiate and cooperate with one anothe
r, the team believed that
agents are important tools for interacting with heterogeneous data sources.

The team used the Internet as their test source because it provides the kind of environment they
were interested in and where the test results were gener
ally applicable. They first examined both
non
-
agent
-
based and partially
-
agent
-
based approaches for IG, so that they could determine how an
agent
-
based approach should work and what issues it should address. Non
-
agent
-
based systems
that the authors looked

into were mostly navigational systems such as World Wide Web and gopher.
The authors concluded the main problem with non
-
agent based systems was that “although [non
-
agent
-
based systems] allow [the] user to search through a large number of information sour
ces, they
provide very limited capabilities for locating, combining, and processing information; the user is still
responsible for finding the information.”

The authors classified the partially agent
-
based systems they have examined into two categories.
In
the first category, the systems use agents to help users in browsing, mainly as tools that interactively
advise users on which link to pick. This approach easily falls prey to poor designs of agents that
constantly made "annoying suggestions." Systems

in the second category use agents to help users in
document search on the Internet, with tools like client
-
based search tools and indexing agents (or
search engines). Nevertheless, is the team found it difficult to scale systems in this category as the
s
ize of document pool grows, mainly as a result of the stress such systems place on network
resource.

Using information from these prototypes, the authors proposed a completely agent
-
based IG tool
called CAS. The main design concepts of CAS are:

1)

Search
at remote sites with multiple agents
-

domain
-
expert agents determine which sites to
search and how to optimize the search

2)

Cooperation of agents
-

an agent would consult other agents when facing an unfamiliar
situation

3)

Abstraction of low level detai
ls from users

Based on these concepts, the team developed three types of agents in the CAS system: 1) User
Agents, or UA (one per user), 2) Machine Agents, or MA (one per data source), and 3) Managers, or
MAN (each uses its domain knowledge to direct searc
h to proper data sources).

The CAS system adopts the following mechanism of agent interaction for IG. Initially, the UA learns
about the preferences of its users either directly from its user or through monitoring. The UA also
provides an interface for i
ts user to submit queries. With both the query and profile of its user, the
UA can then select a proper MAN for answering the request. This selection process requires the UA
to have meta
-
knowledge about each MAN. The UA can also ask other UAs for advice
on picking a
MAN. After that, the selected MAN may request further domains
-
specific information from the user
via the UA to process the query. Once all the proper information is gathered, the MAN formulates a
plan and contacts the corresponding MAs. Simi
lar to the selection of a MAN by the UA, the MAN uses
meta
-
knowledge about each MA together with advice from other MANs to choose the proper MAs
for service. Upon receiving their directions from the MAN, the MAs will try to retrieve the
appropriate data f
rom the system. Again, MAs may consult other MAs if they do not have enough
information about the query. The key to this approach is the cooperation between agents. Each
level of the search requires a high degree of interaction among peer agents for adv
ice and direction.

The team suggested that CAS solve many of the problems found in other distributed data
-
mining
systems. While CAS does not require users to know exactly where to find data, CAS guides the user
by asking appropriate questions about domain
-
specific topics.
In addition, CAS simplifies the
maintenance of data among sources by placing an MA at each information source. The topology of
CAS allows parallel execution and improves the security of the system.

The authors of this paper presented a theoretical example

to show how CAS can be used. In this
example, a user tries to plan a trip and wants to perform tasks such as booking a flight, renting a car,
and finding interesting routes for site
-
seeing. CAS will handle this request by first obtaining and
clarifying
the users request through the UA. Next, the UA will dispatch the request to the proper
MAN based on meta
-
knowledge about the MANs domain. This MAN will then dispatch different
parts of the request to different MAs best suited for each subtask. After res
olving discrepancies
between returned values, the MAN will return the final results to the user via the UA.

The implementation of CAS has two phases. The first phase involves the development of cooperating
UAs that learn from users, and the design of MANs

that plan request fulfillment and develop trust
relationships with other agents. The second phase involves the incorporation of more intelligence
into the agents so that they can make better plans. As of writing of this paper, the authors are
investigat
ing real
-
time planning and learning algorithms for this purpose.

The team begun their implementation in 1996 using libwww and wrote their code in C. They used
Netscape as the user
-
interface and implemented each agent as a separate process. There was one
UA and several MANs per user. The team used standard web search engines like Lycos, Infoseek, and
Crawler, for the data sources and de facto MAs. In their prototype, the UA maintains a log of
exchanges as well as a trust table for other agents. After ea
ch user query, the UA gets feedback from
the user on the usefulness of the information and will recalculate its trust of each agent as a result.
As the authors adopted a long
-
ranged approach for implementation, apparently they are still working
on CAS. U
nfortunately, as all data on CAS from Tohoku University are in Japanese, information
regarding the current status of CAS is unavailable for this paper. The information summarized in this
section can be found in [Okada and Lee and Shiratori 1996]. Further

information on CAS (in Japanese
only) is available from [CAS web].

3.2

PADMA

PADMA (Parallel Data Mining Agents) is an agent
-
based system designed to address issues in the
data
-
mining field like scalability of algorithms, as well as distributed nature of
data and computation.
The team that developed PADMA suggests that “the very distributed nature of the data storage and
computing environments is likely to play an important role in the design of the next generation of
data mining systems.”

With this vie
w of the steady growth in research of agent
-
based information processing architectures
and parallel computing, PADMA uses specialized agents for each specific domain, so that PADMA can
evolve to be a “flexible system that will exploit data mining agents in

parallel, for the particular
application at hand.”

PADMA consists of three main components: 1) data
-
mining agents, 2) a facilitator for coordinating
agents, and 3) a user interface. The third component is not of interest to this paper.

Specifically, da
ta
-
mining agents directly access the data to extract high
-
level useful information, and
thus each agent need to specialize in the particular domain of the data it deals with. Each agent has
its own disk subsystem and performs I/O operations on data indepe
ndent of other agents: this is key
to the parallel execution in PADMA. In this way, agents can employ local I/O optimization techniques
to increase their speed and improve the accuracy. After extracting information from the data, agents
share their mined

information through the facilitator module. Other than coordinating agents, the
facilitator presents the mining result to the user interface and routes feedbacks from the user to the
agents.

PADMA addresses the scalability issue by reducing the inter
-
age
nt and inter
-
process communication
during the mining process. In the initial stage of processing a user request, each agent runs
independently and queries the data in its own data set. This independence in the initial phase allows
a speedup that is linea
r with the number of agents involved. Once each agent finishes its local
extraction operations, the facilitator merges the information from the agents into a final result.

Similarly, PADMA analyze data in a parallel fashion. The facilitator instructs the

data
-
mining agents
to run a clustering algorithm on their respective local data sources. After analyzing its local sets of
data, each agent returns a "concept graph" to the facilitator without interacting with other agents.
The concept graph is a null o
bject if no data relevant to the user query exists at a particular data
source. The facilitator then combines the concept graphs from the agents and returns the clustering
result to the user interface. Note that the mechanisms for detecting and hierarchi
cally merging
clusters are largely independent of the way PADMA functions. The system administrator thus needs
to provide the clustering mechanisms for each domain to which PADMA is applied.

The team tested PADMA for clustering related texts in a corpus.

The test involved designing the
agents and the facilitator to identify text relationships based on n
-
grams, so as to alleviate the
problems of typographical errors and misspellings in the texts. Their test showed that PADMA could
deliver satisfactory cl
ustering results in an acceptable time frame.

The PADMA project is still under active research. The current implementation performs querying
and clustering on bodies of texts. The team ran experiments and tests against the TIPSTER text
corpus of size 36
MB, and showed PADMA had linear speedups for clustering. However, the current
implementation did not have a reasonable speedup in query operations. The team now investigates
the bottleneck that prevents this speedup. The next step will be the tests with
a larger corpus (100
MB). The team also tries to develop a combination of supervised and unsupervised clustering
algorithms that can be used in PADMA. For more information and detail see [Kargupta and
Hamzaoglu and Stafford 1999]. Further information can

also be found at [Los Alamos National
Laboratory web].


3.3

SAIRE

SAIRE (Scalable Agent
-
based Information Retrieval Engine) is an agent framework for solving the
problem of information overload. The authors remarked that because of this problem, the
info
rmation delivered to users in a data search is “often unorganized and overwhelming.” SAIRE
attempts to alleviate the problem with a combination of software agents, concept
-
based search, and
natural language (NL) processing. The system provides facilities

for tailoring a search to the specific
need of a user. For instance, a user may use a technical word for its very specific meaning instead of
its more common meaning. In this case, SAIRE will make sure the search is based on the meaning
desired by that
user.

SAIRE emphasizes more on domain
-
specific queries and user interaction issues than in most
distributed knowledge integration or data
-
mining systems do. Since the team tries to factor users’
search objectives and prior activities into the searching pr
ocess, SAIRE aims to "[provide] an
opportunity for non
-
science users to answer questions and perform data analysis using quality
science data". Meeting this goal involves incorporating vast amounts of domain expertise into the
agents that interact with us
ers, as well as the agents that extract information from the data sources.

Users interact with SAIRE through a User Interface Agent (UIA). The UIA accepts user inputs and
passes the inputs to the Natural Language Parser Agent (NLP). The NLP extracts impo
rtant phrases
from the user input, interprets the inputs, and then generates a request to the SAIRE Coordinator
Agent (SCA).

The NLP consists of four agents: 1) a dynamic dictionary, 2) a grammar
-
checking module, 3) a pre
-
processor, and 4) a chart parser
. Both the dictionary and the grammar
-
checking module are specific
to the domain in which the NLP is working. In addition, the dictionary is split into a main dictionary
with words and semantic meanings pertinent to a domain, and a user dictionary that co
ntains words
with ambiguous or special meanings. SAIRE interacts with the user to construct the user dictionary
and update it with each clarification of a word’s preferred domain meaning.



Figure 1. The architecture of SAIRE.

The SCA first forwards the r
equest from the NLP to a User Modeling Agent (UMA). The UMA
monitors the usage patterns of individual users and user groups so that SAIRE can adapt to the
requests of frequent users and user groups. The UMA, together with the Concept Search Agent
(CSA), p
rovides user
-
specific interpretations of the request to the SCA. After that, the SCA attempts
to resolve any remaining ambiguities with the UMA and the user
-
specific dictionary. If ambiguities
remain, the UMA requests clarification from the user, and thi
s clarification will update the user
dictionary.

Once the SCA fully understands a request, it sends the request to the proper data source managers.
When the corresponding data source agents return information, the SCA passes the results to a
Results Agen
t (RA). The RA notifies the UIA of the availability of the results and provides tools for
presenting this data in different media and various formats.

Instead of having each agent maintain the local information by direct interaction with other agents,
th
e SCA serves as a centralized coordinator for agents. Since the SCA is aware of the capabilities of
every data source agent, it can coordinate the agents in a very sophisticated way. The SCA can also
store this information safely in a repository, and pos
sibly enhance the fault tolerance of the system.
The SCA keeps track of the locations and skill bases of agent managers (AM) in the system, and
provides this information for the use of all data source agents. An agent manager (AM) controls the
command
-
dr
iven, domain
-
specific data source agents in a particular domain. Furthermore, by
monitoring the request history for each agent, the SCA can control the resource usage of agents
through migrating agents from node to node or spawning new agents when necessa
ry.
Consequently, SAIRE overloads no single node in the network and uses the most of the available
bandwidth efficiently. This multi
-
agent coordinator architecture of SAIRE is best suited for
applications with well
-
known data sources but no effective mea
ns of finding appropriate agents in
the agent pool.

The authors evaluated SAIRE with several experiments, and the results were quite promising. In a
sample of represented requests, the number of documents retrieved ranged from 8 to 536 per
query. The pre
cision, or the percentage of documents retrieved that are relevant to a user query,
ranged from 75% to 100%. With these results, the authors claimed that SAIRE has the potential to
retrieve only those documents that are relevant to a user’s objectives and

interests, and therefore
users need not sort through a vast pool of irrelevant documents.

The SAIRE project appears to be suspended in 1997. As of February 1997, SAIRE could understand
11,000 words and 7000 phrases as well as clarify ambiguous words thro
ugh user
-
agent dialogue.
SAIRE also could take user
-
context and previous history into account when understanding a query.
The last implementation of SAIRE involved 8 agent groups of 16 agents apiece, and each agent could
collaborate with others to fulfil
l user requests. The implementation also provided visual displays of
agent activity along with run
-
time explanations.

This section summarized work from Lockheed Martin Space Mission Systems & Services presented in
[Das and Kocur 1997]. Further informatio
n on the SAIRE project can also be found at [SAIRE web].

3.4

InfoSleuth

InfoSleuth is an agent
-
based system for information retrieval. The team at MCC developed the
system for the purpose of extracting and integrating semantic information from diverse so
urces, as
well as providing temporal monitoring of the information network and identifying any patterns that
may emerge [Unruh, Martin, and Perry 1998]. They finished the InfoSleuth project by June 30, 1997,
and the InfoSleuth project is now in phase two,

called InfoSleuth II. The work described in this paper
has come under the auspices of both projects. However, the second project focuses on studying how
to support multimedia information objects, and on promoting widespread deployment of data
-
mining tec
hnology in business organizations.

In order to deal with the active joining and leaving of data sources in the InfoSleuth system yet
avoiding the need of central co
-
ordination, the team developed its own multi
-
brokering peer
-
to
-
peer
architecture to co
-
ordinate agent actions [Nodine, Bohrer,

and Ngu 1998]. The brokering system
matches specific request for services with the agents that can provide the services. This matching
process is based on both the syntactic characteristics of the request, as well as the semantic nature
of the requested

service.

Each data
-
mining agent in the InfoSleuth system subscribes to agents called brokers. Each broker in
turn advertises the capabilities of the agents that subscribe to it, as well as what kind of broker
advertisements it will take. The brokering

system then groups brokers that provide similar agent
services into a consortium, but there is enough overlapping among different consortia to guarantee
interconnectivity among brokers. Brokers belonging to a consortium maintain up
-
to
-
date
information ab
out other brokers in the consortium as well as general information about the presence
of other consortia.

When a broker wants to join the system, it needs to first discover which consortia its services fit
within. Then the new broker will advertise its

services and openness for advertisements. Only those
brokers whose openness include the services will discover the new broker, and they can choose
whether to accept the advertisement after assessing the capabilities of the new broker. On the other
hand,

the new broker can query the brokers it advertises to for a list of brokers, and if it is interested
in any brokers in the list, it can add their advertisement to its own list.

As a data source joins the system, each of its data
-
mining agents subscribes t
o one or two brokers.
After being in the brokering system for a while, each agent can change its preferred brokers. In one
way, the agent queries the related consortia for brokers. If there is a match, it then adds the broker
to its preferred list. If
the agent figures out that one of its preferred brokers always forwarding the
service request from/to another broker, it may simply replace that preferred broker with the
intermediate broker.

InfoSleuth uses multiple layers of agents for the task of inform
ation gathering and analysis. At each
of the data sources, a Resource Agent extracts semantic concepts from the source. Upon receiving
user requests, a Multi
-
resource Query Agent determines whether the request involves more than
one Resource Agent, and i
f so, it will integrate the annotated data from multiple sources. At the
same time, Data
-
mining Agents and Sentinel agents perform the tasks of intelligent system
monitoring and correlation of high
-
level patterns emerged from the data sources. Data
-
minin
g
agents provide event notifications that encode statistical analyses and summaries of the retrieved
data. Sentinel agents support the data
-
activities by organizing inputs to data
-
mining agents, and
monitoring for “higher
-
level” event patterns based on da
ta
-
mining agents’ output events. Through
all these layers of agents, InfoSleuth supports derived requests such as deviation analysis, filtered
deviations, and correlated deviations.



Figure 2. The architecture of InfoSleuth.

Even though the team has eval
uated InfoSleuth with several experiments, they did not publish any
data regarding the performance.

With the elaborate brokering system, InfoSleuth does not require central coordination for
collaborative agent action. Besides, the peer
-
to
-
peer feature o
f the brokering system provides an
efficient way for a data
-
mining agent to locate another agent for use. The brokering system also
provides mechanisms with which agents can rate the service provided by brokers and switch brokers
accordingly. This allows

the system to dynamically adapt itself to both network instability and major
categorical shift of user requests. Nonetheless, the information necessary for brokers and agents to
adjust their links may propagate very slowly across the network. In that ca
se, InfoSleuth may have
sub
-
optimal performance for prolonged periods of time.

Organizations such as National Institute of Standard and Technology and companies like Texas
Instrument and Eastman Chemical Company have adopted InfoSleuth as the infrastructu
re for their
data
-
mining operations [MCC web A]. In particular, the EDEN (Environmental Data Exchange
Network) project recently used InfoSleuth to support integrated access via web browsers to
environmental information sources provides by agencies in diff
erent countries [MCC web B].

3.5

JAM

JAM (Java Agents for Meta
-
Learning over Distributed Databases) attempts to provide a scalable
solution for learning patterns and generating a descriptive representation from a large amount of
data in distributed databas
es [Stolfo, Prodromidis, Tselepis, Lee, Fan, and Chan 1997]. The authors
identified the need for scaling algorithms in data mining. They claimed that even though many well
-
developed data
-
mining algorithms exist, most of these algorithms assume that the t
otal set of data
can fit into the memory, and this assumption does not hold in many data mining contexts. The team
thus developed JAM as an agent
-
based framework for handling this scaling problem.

Another motivation for their agent
-
based data
-
mining framew
ork is to handle inherently distributed
data. The authors claimed that data can be inherently distributed because of its storage on
physically distributed mobile platforms like ships or cellular phones. Other reasons for the inherently
distributed natur
e of data include, but are not limited to, secure and fault
-
tolerant distribution of
data and services, proprietary issues (different parts of data belonged to different entities), or
statutory constraints required by laws.



Figure 3. The architecture o
f a JAM network with 3 Datasites.

The JAM system is a collection of distributed learning and classification programs linked by a network
of Datasites. Each JAM Datasite consists of a local database, one or more base
-
learning agents, one
or more meta
-
learn
ing agents, a local user configuration file, graphical user interfaces, and animation
facilities. A learning agent is a machine
-
learning program for computing the classifiers at distributed
sites. Base
-
learning agents at each Datasite first compute base
classifiers from a collection of
independent and inherently distributed databases in a parallel fashion. Meta
-
learning agents are
learning processes that integrate several base classifiers, which may be generated by different
Datasites. In addition, JAM
has a central and independent module, called Configuration File Manager
(CFM), which keeps up
-
to
-
date state of the distributed system. The CFM stores a list of participating
Datasites and logs events for future reference and evaluation.

At each Datasite,
local learning agents operate on the local database to compute the base classifier.
Each Datasite may import classifiers from peer Datasites and combine these with its own local
classifier using the local meta
-
learning agent. JAM solves the scaling probl
em of data mining by
computing a meta
-
classifier that integrates all the base
-
classifier and meta
-
classifier modules once
they are computed. The system can then use the resultant meta
-
classifier module to classify other
datasets of interest. Through the
ensemble work, JAM boosts the overall predictive accuracy.

The CFM assumes a passive role for the configuration maintenance of the system. It maintains a list
of active member Datasites for coordination of meta
-
learning activities. Upon receiving a JOIN
request from a new Datasite, the CFM verifies the validity of the request as well as the identity of the
site. Similarly, a DEPARTURE request invokes the CFM to verify the request and remove the Datasite
from the list of active members. The CFM logs the
events between Datasites, stores the links among
Datasites, and keeps the status of the system.

JAM implements both the CFM and Datasites as multi
-
threaded Java programs. Meta
-
learning
agents are implemented as java applets for their need to migrate to ot
her sites.

The team initially designed JAM for the purpose of fraud and intrusion detection in financial
information system. They have conducted an experiment using the system for detecting credit card
fraud transactions, which involved processing inhere
ntly distributed data from various financial
institutions. They obtained the best performance by using a Bayesian network as the meta
-
classifier:
JAM was able to classify 80% of the true positive and had a false alarm rate of 13%.

Agents in JAM communicat
e with one another for the classifiers they have developed. The system
does not require the dispersion of data across the different sites throughout the execution. This
allows the participants to share only information without violating security or propr
ietary protection
of the data.

Since JAM does not specify implementation, different Datasites may choose different machine
-
learning algorithm implementations as the learning agents, and some of these algorithms may not
scale well for large data sets. Thu
s, the ability to handle large datasets may vary among Datasites.

Moreover, there can be a limitation on how many Datasites can join a JAM system. Even though JAM
has no central coordinator, the CFM constantly monitors the global state of the system and

contends
for more network bandwidth when more Datasites join the system. The CFM can both be a single
point of failure and a bottleneck to reasonable system performance.

The JAM project ended in December 1998. The team posted the evaluation of JAM’s p
erformance in
intrusion detection on their website [JAM web A]. For software download and specification, refer to
[JAM web B].

3.6

DKN

DKN (Distributed Knowledge Network) is a research project for large
-
scale automated data extraction
and knowledge acq
uisition and discovery from heterogeneous, distributed data sources [Yang,
Honavar, Miller, and Wong 1998]. As part of this project, they implemented a toolkit of machine
learning algorithms, called KADLab, which uses customizable agents for document clas
sification and
retrieval from distributed data sources.

Instead of building an agent infrastructure like most projects, the DKN team chose to use the
commercially available Voyager platform from ObjectSpace. Voyager uses the Java language object
model a
nd allows regular message syntax for constructing and deploying remote objects. Through
the Object Request Broker, Voyager provides services to remote objects and autonomous agents.
Objects and other agents can send messages to a moving agent, and an age
nt can continue to
execute as it moves. The platform also provides service for persistence, group communication, and
basic directory services.

The team has experimented with their approach for retrieving paper abstracts and news articles on a
point
-
to
-
p
oint basis [Yang, Pai, Honavar, Miller 1998]. They first trained the classifiers with user
preferences, and then incorporated the classifiers into mobile agents on the Voyager platform.
When a user queries a document using their system, a mobile agent (A
gent 1) is generated. Agent 1
moves to a remote site to retrieve relevant documents. It sends the documents to the local site and
then dies. Next, the user gives feedback as whether the documents are interesting or not. These
preferences train the clas
sifiers and generate another agent (Agent 2). Agent 2 moves to the remote
site and runs the classifier to retrieve relevant documents. It sends the relevant documents to the
local site and dies. The team claimed that the mobile agents return only a subs
et of relevant
documents, but they did not explain the mechanism through which they incorporate classifiers into
agents.

Other than the point
-
to
-
point experiment, the team did not publish any experiments regarding the
system performance with distributed
data sources or under varying network environment. Nor did
they publish the data source characteristics in the experiments they have conducted.

An important feature of their work is the use of an off
-
the
-
shelf agent platform. By not building their
own agent platforms, developers can proceed to programming the agent activities and launch the
agent into the network within a relatively short time fr
ame. On the other hand, most commercially
available agent platforms are for general agent usage, and thus developers that use such platforms
do not enjoy the same leverage as with a platform specifically designed for data mining. In the case
of DKN, the
team found it difficult to keep track of agents, once the agents are launched into the
network. This is because Voyager requires a proprietary Java message format for communications
among agents. Therefore, instead of updating agents at remote sites with

the new classifier
information, their system has to regenerate and dispatch new agents from a central location every
time a user provides feedback to the learning process. Their system may not scale well to handle
distributed data source due to the consi
derable overhead in agent generation and garbage
collection.

3.7

BODHI

BODHI is an implementation of Kargupta’s Collective Data
-
Mining (CDM) framework for distributed
knowledge discovery using agents. CDM aims at “designing and implementing efficient al
gorithms
that generate models from heterogeneous and distributed data with guaranteed global correctness
of the model” [Kargupta web].

An agent in BODHI is an interface between the learning algorithm and the communication module.
At each site, there is
an agent station module that maintains communication between sites and
handles security issues. A facilitator module co
-
ordinates inter
-
agent communication and directs
data and control flow among distributed sites. Most of the BODHI implementation is in
Java for
flexibility, but the system can still import learning algorithms implemented in native code at local
machines.


BODHI uses several learning algorithms specifically developed for distributed data mining:
collective decision rule learning using Four
ier analysis [Kargupta, Park,Hershberger, and Johnson
1999], collective hierarchical clustering [Johnson and Kargupta 1999], collective multivariate
regression using wavelets [Hershberger and Kargupta 1999], and collective principal component
analysis [Ran
nar, MacGregor, and Wold 1998]. The first algorithm uses Fourier analysis to find the
Fourier spectrum of data at each data source, and then sends the local spectrums to a centralized
site for merging. BODHI can then transforms the resultant spectrum to
a decision tree
representation. The collective hierarchical clustering requires the transmission of local dendograms
at O(n) communication cost. It then creates a global model using the local models with a O(n2)
bound in time and a O(n) bound in space.
The collective multivariate regression only requires the
aggregation of significant coefficients of the wavelet transformation of local data to a central site.
The algorithm can then reconstruct the model by performing regression on the coefficients.
Col
lective principal component analysis involves the creation of a global covariance matrix from
loading matrices and sample score matrices after distributed data analysis.


In general, collective learning algorithms attempt to build the most accurate model

with
respect to a centralized algorithm while minimizing data communication [Kargupta web].

The use of these algorithms minimizes the amount of communication between the central
coordinator and local data sources, but the type of information communicated
varies according to
the algorithm used. The use of such algorithms lacks the inter
-
agent communication before model
integration, and this may be a drawback for some data
-
mining applications.


BODHI adopts an agent architecture that provides the necessar
y infrastructure for the
execution and information transmission of collective learning algorithms. The system uses the
network bandwidth efficiently. However, if too many distributed agents send their local models
concurrently to a central location for m
erging, the large amount of incoming information may
overload the network and thus scalability is an issue. More detail concerning the implementation
can be found on Kargupta’s website [Kargupta web].



4.

Desired Characteristics of Ideal ABKD Architectu
re

The survey in the previous section demonstrated the various ways of applying agent technology to
distributed data mining. It covered most of the issues that may arise in distributed data mining,
ranging from system architecture and network topology to
user interaction. While each existing
ABKD system addresses certain issues of distributed data mining better than others, one may wonder
if it is possible to extract all the good features from existing ABKD systems to specify an ideal ABKD
system that is
flexible and robust enough to address all of issues in distributed data mining. In
particular, this section attempts to provide insights, if not final answers, to the following questions:

1)

What are the characteristics of an ideal ABKD architecture?

2)

W
hat can a user or a developer expect from an ideal ABKD system?

3)

Is it possible to build an ideal ABKD system?

4.1

Environment

By definition, any ABKD systems need to work in a networking environment. In some networks like
the Internet where the stabili
ty is uncontrollable, an ideal ABKD system needs to allow the dynamic
joining and leaving of remote data sources at any given time. Other networks like corporate
intranets can be fairly stable for most of the time. In that case, the ideal ABKD system nee
ds to be
aware of the stable network and works with the remote data sources through more efficient means.

An ideal ABKD system scales well with the number of data sources and the size of data to be mined
at each remote site. Also, the ideal system can han
dle concurrent queries from a large number of
users. Based on the earlier discussion of existing ABKD system, avoiding centralized coordination and
system monitoring is the key to resolving these scalability issues.

4.2

Information Integration

In distribu
ted data mining, ABKD systems need to integrate information from different data sources.
These data sources may store data in different formats, belong to different application domains, and
support the retrieval of data in different kinds of data structur
es. Yet there are cases all these
distributed data sites are homogeneous. An ideal ABKD system needs to adapt to the different
possible natures of remote data sources to perform the necessary information integration.

One issue in ABKD research is whethe
r to integrate information during the mining process, or after
independent mining at each remote data source. This issue is closely coupled with the mechanisms
with which agents are coordinated. In the former choice of integration, central coordination i
s not
necessary and the action of an agent can be influenced by many different entities through extensive
communication. In the latter choice, a coordinator at the top of the system hierarchy is necessary to
manage agents for proper information integratio
n. An ideal ABKD system should support both types
of information integration and adjust itself dynamically according to the network environment and
application context.

What is more, an ideal ABKD system should support both any
-
time algorithms and non
-
int
erruptible
algorithms. The use of any
-
time algorithms in a data
-
mining operation allows the users to interrupt
the system at any stage of processing and retrieve an analysis for the results up to that stage.
However, any
-
time analysis may not make sense
for certain data
-
mining operations or techniques.

4.3

Result Processing

Result processing is an aspect of data mining that involves presenting results to users, understanding
user queries, and archiving past results efficiently. Even though result process
ing is usually not a
concern in distributed data mining research, agent technology readily addresses issues in result
processing.

As it is important for the data
-
mining system to thoroughly understand user requests, the ideal
system should perform user p
rofiling and take into account the history, experience, and profile of the
user before processing the requests from a user. Only then the system can verify if it is finding the
information interesting to the user.

Interactive query clarification is a po
werful agent tool for this purpose. It allows the data mining
system to ask the user questions before the user submits a query. As a result, the system can ensure
that it understands the correct meaning of a particular word in the user request. Incorpor
ating this
feature to the ideal ABKD system will extend its applicability in domain
-
specific operations without
limiting its application in a particular domain.

Moreover, the ideal system should be able to measure and report the relevance of the returned
i
nformation with respect to the user query. The user can then cross
-
reference with this relevance
rating when making a decision based on the returned information.

An ideal ABKD system should not only support advanced user interaction but also make the best

use
of available resources to resolving user queries. The ideal system performs data source profiling to
better direct queries to the proper data sources. Data source profiling requires that the coordinating
entities keep track of the meta
-
information f
or each data source. Besides returning information of
higher quality, proper data source profiling also promotes efficient use of systems resources,
especially network bandwidth.

In addition, the ideal system may cache past data
-
mining results, so as to r
educe the amount of
processing for recurrent queries. Since caching only work well with certain domains, the ideal system
should provide means for the system administrator to specify whether caching should take place,
and if so, at what points in the syst
em and at what level.

4.4

Resource Usage

In many data
-
mining applications, it is desirable to minimize resource usage of without significant
compromises in speed or accuracy. Hence, an ideal ABKD system needs to provide facilities for
adjusting the
resource usage for different performance requirements. An example is the resource
usage for inter
-
agent communication. An ideal ABKD system should allow agents to communicate
with one another to resolve problems. Increasing the communication among agent
s usually results
in more accurate and more meaningful data mining. Yet increase in such communication invariably
leads to increase in network traffic, and the ideal system may need to limit the communication
among agents to control both bandwidth usage a
nd runtime bounds. Therefore, the ideal system
should adjust the level of inter
-
agent communication dynamically according to the run
-
time
environment.

Since querying data sources is an expensive operation, an ideal system should try to minimize the
time o
verhead in accessing remote data sites. A lot of research has been conducted in this area to
find out a way of gathering information with minimal amount of database access.

4.5

How realistic is it?

Researchers may wonder if they can build such an ideal AB
KD system. Even though no existing ABKD
system possesses all the ideal characteristics, the survey of ABKD systems in the previous section can
provide some insight regarding the feasibility of building the ideal system.

Since the ideal ABKD system essenti
ally consists of the good features of all the existing ABKD
systems, the survey demonstrates that each constituent part of the ideal system can be
implemented. However, the inherent difficulty of integration means that the ideal system is not a
simple un
ion of these constituent parts. As research in distributed data mining shows the difficulty in
combining approaches in learning information, building an ideal system is a feasible but non
-
trivial
task. It will require an in
-
depth understanding of the pro
blems in the distributed data mining and a
familiarity with all possible solutions, so that alternative approaches can be taken whenever
necessary.

Perhaps the most difficult aspect of building an ideal ABKD system architecture is to allow users to
use the

system for their domain
-
specific needs. The ideal system needs to support the flexible
tradeoffs for many conflicting desired characteristics, in order to be useful for a wide array of
application domains. It can be a challenge for developers to incorpo
rate these conflicting
characteristics into the system so that the features work properly by themselves and more
importantly, work together when necessary.

4.6

Proposed ABKD Architecture

We propose an agent
-
based data
-
mining architecture that accommodates
most of the ideal
characteristics we described in the previous section. The proposed architecture resembles SAIRE but
emphasizes more on data sources and management rather than user interface. SAIRE requires high
level of inter
-
communication among agents
, and it does not address the problem of unstable
networks. In other words, even though SAIRE does not stop data
-
sites to join or leave the system, it
does not directly address the issues. Mainly as an effort to allow user
-
friendly search through a vast

amount of data, SAIRE ignored many issues that our proposed architecture deals with.

The proposed system contains three main types of agents: the UI (user interface) agent, the Manager
agent, and the KB (knowledgebase) agent. The UI agent helps the user
get specific information via
the Manager agent. The Manager agent, which is an expert in a particular field, interoperates with a
number of KB agents or other Manager agents to get the requested information. Each KB agent
wraps around each data source an
d uses some conventional data
-
mining technique to get
information from the data.

The architecture uses a “registration” mechanism that allows an agent only to have access to those
agents that have registered with the particular agent. This protocol makes
the organization of the
agents dynamic, or in other words, the UI and KB agents are mutable. UI agents are created when
users enter the system, and the system destroys the corresponding UI agents when users leave.
Similarly, as data sources join and leav
e the system, the KB agents are created and destroyed. In
general, a UI agent has access to multiple Manager agents that have registered with the UI agent.

A Manager agent is the least volatile agent since it has no direct link to an external object, su
ch as a
user or a database. Manager agents organize themselves into a hierarchy of domain expertise. Each
Manager agent has access to a specified number of Manager and/or KB agents that have registered
with the Manager agent. Besides, a Manager agent ca
n terminate itself if no KB or Manager agent is
available.

We assume that every KB agent understands the meaning of its data with respects to a hierarchy so
that it can register itself to the correct branch of the hierarchy. Nevertheless, researchers can
implement a simple organization with a two
-
level hierarchy, and use the prototype to study the
possibilities of multi
-
level hierarchies.

Based on Lam’s research on agent design, we suggest that each agent have six functionalities:
sensing, modeling, organi
zing, planning, acting, and communicating. A KB agent senses (gathers) the
data and produces a model or statistic from the data, and then it registers with Manager agents that
are interested in its data. When a KB agent gets an information request from a

Manager agent, the
KB agent queries its data source using the proper query language, and then communicates the
results to the Manager agent. The KB agent collaborates with other KB agents when the Manager
agent requests high
-
level information that requir
es the querying of more than one data source. KB
agents also collaborate during modeling.

A Manager agent is responsible for getting access to all data sources pertaining to a particular field,
which may be modeled as a set of keywords within the agent.
If a KB agent finds no Manager agent
to register with, a new Manager agent will be instantiated. A Manager agent models KB agents that
have registered with it, as well as other Manager agents it may collaborate with. The UI agent sends
a model of user pr
eferences along with the user’s request to the Manager agent to ensure the
retrieval of accurate information.

As an example, suppose a user makes three requests: “find the average price of a house sold for less
than $150,000 within 5 miles of my current ho
me,” “what area within my state has the highest
selling price for a house in the past 10 years,” and “what is the expected selling price for my
neighborhood in the next 10 years.” This example assumes that the ABKD system has established the
Manager and K
B agents for the realty domain. The UI agent first translates the request into terms
that the agents understand. Then, it gets the users location from the user model and sends the
request, along with the user model, to a Manager agent that handles reside
ntial property (as
opposed to business property). When the Manager agent receives this request, it locates a Manager
agent that exclusively deals with houses (as opposed to apartments) and redirects the request to that
agent. Having modeled the data cont
ent of each registered KB agent, the house Manager agent
selects only those registered KB agents that have data pertaining to the request. This means that
the agent may select only those KB agents within the user’s state. A query is formed and sent to
e
ach selected KB agent. The KB agent executes the query and returns the results to the Manager
agent. Once the Manager agent receives the results from the KB agents, it can send the results
directly to the user or do further integration, analysis, or form
atting of the results according to the
user’s preferences that was sent from the UI agent. Manager agents can produce high
-
level
knowledge using conventional data
-
mining techniques, such as clustering, decision trees, and
statistical methods. This knowle
dge is then sent to the UI agent, which displays it according to the
user’s request.

The implementation of the proposed architecture needs to focus on flexibility, standardization, and
platform
-
independence. Agent modeling and interoperation can be done w
ith KQML (Knowledge
Query and Manipulation Language), which “is part of a larger effort, the ARPA Knowledge Sharing
Effort, which is aimed at developing techniques and methodologies for building large
-
scale
knowledge bases which are sharable and reusable”

[Mayfield web]. Agent communication should
follow the FIPA (Foundation for Intelligent Physical Agents) ACL (agent communication language)
standard [FIPA web]. The main programming language should be Java, and CORBA can be used to
support and interopera
te with other languages that are possible used by the KB agents. CORBA is a
powerful tool for interfacing agents with legacy systems. Very often an agent queries data sources
that are older, entrenched systems, which do not readily interoperate with mode
rn protocols. By
defining CORBA interfaces for both agents and data sources, we can simplify the task of making these
data sources accessible. Moreover, CORBA simplifies the interfacing with newer systems since
CORBA is an established standard. The lear
ning time for adding a data site to the agent network can
thus be substantially shortened.

Our proposed ABKD architecture handles distributed, dynamic, heterogeneous data by having
mutable KB agents that are independent of other KB agents. Our overall goa
l is to provide a flexible
framework to support the implementation of various techniques. Depending on the domain and the
type of request made, either manager agents or UI agents can integrate data. In addition,
decentralized coordination allows the syst
em to remain functional despite breakdown in networks or
agents. What is more, the administrator can choose whether to have KB agents mined the data
beforehand or only when a query is made. A subset of the UI agents can evolve into a system that
recommen
ds further topics or items of interest based on the user model.



5.

Future Works

Until recently, there has been a lack of standard for inter
-
agent communication among different
agent systems. Thus existing ABKD research did not consider the possibility o
f working with one
another. In addition, present works in ABKD systems either support only collective data
-
mining
algorithms specially designed for distributed execution, or provide a general framework for the reuse
of conventional machine
-
learning algori
thms. One potential research is to develop an ABKD system
supporting both types of algorithms, so that researchers can evaluate the two types of algorithms on
the same basis. Besides, researchers may try to build an ABKD system that works with other syst
ems
from different research teams.

Another observation is that most existing ABKD systems perform only one type of data
-
mining tasks
like classification. Building multi
-
purpose ABKD systems will allow researchers to reuse the expensive
agent platform imp
lementation.

Moreover, publications in ABKD research rarely detail the implementation of the underlying agent
systems. Since editors in general prefer new design ideas to implementation details, most papers
describe only the architecture of the agent sy
stems and how agents interact. It will be nice if
researchers can share their experience of building a prototype ABKD system and thus novices in
ABKD research need not start from scratch. Better still, researchers may provide technical reports on
their we
b
-
sites detailing the design alternatives they have considered but decided not to adopt, so
that others can benefit from the previous works.

Last but not least, there is no common methodology to benchmark the various ad hoc
implementations of agent
-
based s
ystems. Also, researchers of ABKD systems in most cases did not
publish the data on the performance evaluation of their projects. In addition to existing research
that establishes metrics for multi
-
agent systems [Lam and Barber 2000], researchers may dev
elop a
generic test bed for ABKD systems that helps to compare the various architectures of ABKD systems.
The development of test bed can lead to the identification and recommendation of the suitable
architecture for each application domain.



6.

Conclusi
on

This paper introduced the idea of ABKD system, which allows the mining of distributed
heterogeneous data at relatively ease. In addition, the use of agent technology enhances parallel
execution of data
-
mining processes. Nevertheless, ABKD is not a pan
acea for problems inherent with
a particular data
-
mining technique.

Next, the paper presented a set of metrics for evaluating ABKD systems, and then evaluated present
work in ABKD research. It examined ABKD systems built for a specific application domain
and those
systems for general data
-
mining applications. We also took a look at an architecture that uses
existing commercial package instead of building its own agent infrastructure. In general, most of the
existing ABKD systems used multiple layers of a
gents to handle different levels of data mining tasks.

After that, the paper described the desired characteristics of an ideal ABKD architecture in terms of
its functionalities, resource requirements, and processing of results. Since some of the desired

characteristics were conflicting, researchers need to exercise certain tradeoffs when trying to
incorporate such characteristics into their own systems.

Finally, the paper identified some potential works in ABKD research, such as building ABKD systems
t
hat support both dedicated and conventional data
-
mining algorithms, developing ABKD systems for
multiple categories of data
-
mining tasks, and implementing a test bed for evaluating different kinds
of ABKD architectures with respect to different application

domains.





References


Balakrishnan, K. and Honavar, V. (1998). Intelligent Diagnosis Systems. Journal of Intelligent Systems.
In press.


Chan, P. and Stolfo, S. (1996). Sharing learned models among remote database partitions by local
meta
-
learning. In Proceedings Second International Conference on Knowledge Discovery and Data
Mining, 2
-
7.


CAS. Shiratori Lab: New Projects.

http://www.shiratori.riec.tohoku.ac.jp/index
-
e.html


Das, B., and Kocur, D. (1997). Experiments in Using Agent
-
Based Retrieval from Distributed and
Heterogeneous Databases. In Knowledge and Data Engineering Exchange Workshop, 27
-
35.


Davies, W. H. E. and Edwards, P. (1995A). Agent
-
Based Knowledge Discovery. In Working Notes of
the AAAI Spring Symposium on Informa
tion Gathering from Heterogeneous, Distributed
Environment.


Davies, W. H. E. and Edwards, P. (1995B). Distributed Learning: An Agent
-
Based Approach to Data
-
Mining. In Proceedings of ML95 Workshop on Agents that Learn from Other Agents.


Domingos, P. (19
97). Knowledge acquisition from examples via multiple models. In International
Conference on Systems, Man and Cybernetics.


FIPA. Foundation for Intelligent Physical Agents. http://www.fipa.org.


Hall, Lawrence O., Chawla, Nitesh, and Bowyer, Kevin W. (1
998). Combining Decision Trees Learned
in Parallel. In Distributed Data Mining Workshop at KDD
-
98.


Hayes, Caroline C. (1999). Agents in a Nutshell


A Very Brief Introduction. In IEEE Transactions on
Knowledge and Data Engineering, Vol. 11, No. 1 Jan/F
eb 1999.


Hershberger, D. and Kargupta, H. (1999). Distributed Multivariate Regression Using Wavelet
-
based
Collective Data Mining. In Special Issue on Parallel and Distributed Data Mining of the Journal of
Parallel Distributed Computing. Kumar, V., Ranka,

S., and Singh, V. (Ed.) (In press) (also available as
Technical Report EECS
-
99
-
02).


Honavar, V. (1994). Toward Learning Systems That Use Multiple Strategies and Representations. In
Artificial Intelligence and Neural Networks: Steps Toward Principled Int
egration. pp. 615
-
644.
Honavar, V. and Uhr, L. (Ed.) New York: Academic Press.


Honavar, V. (1998). Inductive Learning: Principles and Applications. In Intelligent Data Analysis in
Science. Cartwright, H. (Ed). London: Oxford University Press.


JAM (A).

The JAM Project Overview. http://www.cs.columbia.edu/~sal/JAM/PROJECT/recent
-
results.html.


JAM (B). Software Download for the JAM Project.
http://www.cs.columbia.edu/~andreas/JAM_download.html.


Johnson, E. and Kargupta, H. (1999). Collective, Hiera
rchical Clustering from Distributed,
Heterogeneous Data. In Large
-
Scale Parallel KDD Systems, Lecture Notes in Computer Science,
Springer
-
Verlag. Zaki, M. and Ho, C. (Ed).


Kargupta, H. Distributed Knowledge Discovery from Heterogeneous Sites.
http://www.
eecs.wsu.edu/~hillol/DKD/ddm_research.html.


Kargupta, H., Hamzaoglu, I. and Stafford, B. (1999). Scalable, Distributed Data Mining Using An Agent
Based Architecture. Proceedings of Knowledge Discovery And Data Mining. Heckerman, D., Mannila,
H., Pregibon
, D., and Uthurusamy, R. AAAI Press. 211
-
214.


Kargupta, H., Park, B., Hershberger, D. , and Johnson, E. (1999). Collective Data Mining: A New
Perspective Toward Distributed Data Mining. Submitted for publication in Advances in Distributed
Data Mining.
Kargupta, H., and Chan, P (Ed.). AAAI Press.


Los Alamos National Laboratory. Parallel Data Mining Agents. http://www
-
fp.mcs.anl.gov/ccst/research/reports_pre1998/algorithm_development/padma/kargupta.html.


Lam, D. N. and Barber, K. S. Tracing Dependen
cies of Strategy Selections in Agent Design. To be
published in AAAI
-
2000 17th National Conference on AI.


Mayfield, James, Labrou, Yannis, and Finin, Tim. Desiderata for Agent Communication Languages.
http://www.cs.umbc.edu/kqml/papers/desiderata
-
acl/ro
ot.html. University of Maryland Baltimore
County.


MCC (A). Who Will Use InfoSleuth and For What.
http://www.mcc.com/projects/infosleuth/introduction/applications.html, last updated February 10,
1998.


MCC (B). Project Documents.

http://www.mcc.com/pr
ojects/env/eden/docs/fact.html, last updated October 11, 1999.


Miller, L., Honavar, V. and Barta, T.A. (1997). Warehousing Structured and Unstructured Data for
Data Mining. In Proceedings of the American Society for Information Science Annual Meeting (AS
IS
97). Washington, D.C.


Nodine, M., Bohrer, W., and Ngu, A. (1998). Semantic brokering over dynamic heterogeneous data
sources in InfoSleuth. MCC Technical Report. Submitted to ICDE '99.


Okada, R., Lee, E., and Shiratori, N. (1996). Agent Based Approach for Information Gathering on
Highly Distributed and Heterogeneous Environment. In Proc. 1996 International Conference on
Parallel and Distributed Systems.


Parekh, R. and Honavar, V. (199
8). Constructive Theory Refinement in Knowledge Based Neural
Networks. In Proceedings of the International Joint Conference on Neural Networks Anchorage,
Alaska.


Parekh, R., Yang, J., and Honavar, V. (1998). Constructive Neural Network Learning Algorit
hms for
Multi
-
Category Pattern Classification. In IEEE Transactions on Neural Networks.


Rannar S., MacGregor, J.F., Wold., S. (1998). Adaptive Batch Monitoring using Hierarchical PCA. In
Chemometrics & Intelligent Laboratory Systems.


SAIRE. SAIRE Home
page.

http://saire.ivv.nasa.gov/.


Sian, S. (1991). Extending Learning to Multiple Agents: Issues and a Model for Multi
-
Agent Machine
Learning (MA
-
ML). In Proceedings of the European Working Session on Learning. Y. Kodratroff Ed.,
Springer
-
Verlag, 458
-
47
2.


Stolfo, S., Prodromidis, A., Tselepis, S., Lee, W., Fan, D., and Chan , P. (1997). Jam: Java agents for
meta
-
learning over distributed databases. In Proceedings of the 3rd International Conference on
Knowledge Discovery and Data Mining, pages 74
--
81,
Newport Beach, CA. AAAI Press.


Unruh, A., Martin, G., and Perry, B. (1998). Getting only what you want: Data mining and event
detection using InfoSleuth agents. Technical Report MCC
-
INSL
-
113
-
98, MCC InfoSleuth Project.


Williams, G. (1990). Inducing
and Combining Multiple Decision Trees. Ph. D. Dissertation, Australian
National University, Canberra, Australia.


Yang, J. and Honavar, V. (1998). Feature Subset Selection Using a Genetic Algorithm. In Feature
Extraction, Construction, and Subset Selecti
on: A Data Mining Perspective. Motoda, H. and Liu, H.
(Ed.) New York: Kluwer. 1998. A shorter version of this paper appears in IEEE Intelligent Systems
(Special Issue on Feature Transformation and Subset Selection).


Yang, J. and Honavar, V. (1998). Dist
Al: An Inter
-
Pattern Distance Based Constructive Neural Network
Learning Algorithm. In Intelligent Data Analysis. In press. A preliminary version of this paper appears
in [IJCNN98].


Yang, J., Pai, P., Honavar, V., and Miller, L. (1998). Mobile Intellige
nt Agents for Document
Classification and Retrieval: A Machine Learning Approach. In Proceedings of the European
Symposium on Cybernetics and Systems Research. In press.


Yang, J., Honavar, V., Miller, L. and Wong, J. (1998). Intelligent Mobile Agents
for Information
Retrieval and Knowledge Discovery from Distributed Data and Knowledge Sources. In Proceedings of
the IEEE Information Technology Conference. Syracuse, NY.