Agent-Based Knowledge Discovery:

aspiringtokAI and Robotics

Oct 15, 2013 (3 years and 10 months ago)


Based Knowledge Discovery:

Survey and Evaluation

A Term Paper for

EE380L Data Mining


Austin Bingham

Paul Chan

Dung Lam


Dr. Joydeep Ghosh

May 2000



With the prevalence of networking in the past decade, data has not only grown exponentially in size
but also become more decentralized and disordered. In addition, databases, knowledge bases, and
online repositories of information (such as dictionaries, u
ser survey results, and server logs) around
the world can now interact with one another. These intertwining networks of data sources present a
challenge for knowledge discovery, as most existing techniques assume a single source of data. To
make the prob
lem worse, there is no agreed method for discovering knowledge through distributed
information gathering from heterogeneous data sources. Consequently, the rate of knowledge
discovery fails to keep up with that of data generation, and the percentage of kn
owledge relative to
the amount of data declines steadily. To remedy the situation, researchers have developed Agent
based Knowledge Discovery (ABKD) as a new paradigm that combines the two fields of Distributed
Artificial Intelligence and Machine Learning

[Davies and Edwards 1995A]. The purpose of this paper
is to examine how we can apply existing agent
based techniques to the knowledge discovery, or
mining, field. From this evaluation, an ideal agent
based system is proposed along with issues
must be considered.

An agent
based data mining system is a natural choice for mining large sets of inherently distributed
data. One example application of such systems is military decision
making [Yang, Honavar, Miller,
and Wong 1998]. Every day, comma
nders and intelligence analysts need to access critical
information in a timely fashion. Typical day
day operations involve intelligence data gathering and
analysis, situation monitoring and assessment, and looking for potentially interesting patterns
data, such as relationship between troop movements and significant political developments in a
region. The information can be valuable for decision
makers to take both proactive and reactive
measures designed to safeguard a nation’s security concerns.

In a crisis, we need to be able to
deliver accurate information to the decision
makers at the right time without overwhelming them
with large volume of irrelevant data. This involves physically distributed data source include satellite
images, intelligen
ce reports, and records of communication with officers at the frontier. In that
situation, nobody can afford the delay of sending large volumes of data back for central processing
before we can present relevant information to decision

institutions and law
enforcement agencies share similar information processing needs. In
order to predict market fluctuations, brokerage houses need to analyze news and financial
transactions from all over the world in real time. The continuous growth in

the amount of data to
process makes it impossible for centralized analysis to take place. Besides, the prevalence of
electronic commerce demands a secured and trusted inter
banking network with high
verification and authentication mechanisms. This

requires a widely deployed system that detects
local fraudulent transaction attempts and propagates the attack information as soon as possible
[Chan and Stolfo 1996]. Likewise, law
enforcement agencies need to obtain information from one
another for case

histories or crime patterns to co
ordinate nationwide or even worldwide efforts to
fight crime.

This paper surveys and evaluates ABKD systems. Section 2 introduces the idea of applying agent
technology in distributed data mining. It also describes some
metrics that are useful for evaluating
the existing ABKD architectures in Section 3. Section 4 proposes the desired characteristics of an
ideal ABKD architecture. Section 5 suggests some possible future work in ABKD research, and Section
6 concludes the


Agent Technology


What is an Agent?

Many areas of research employ agent technology, and thus the definition of an agent varies
according to the focus of the research. For example, research in multi
agent systems (MAS)
commonly characterize
s agents as autonomous and able to plan and coordinate within an
organization for solving a problem. In ABKD, an agent is a software entity that can 1) interoperate
with its data source and/or other agents, 2) receive/gather raw data, 3) process and learn

from the
data source or from other sources, and 4) coordinate with other agents to produce relevant and
useful information. Research in ABKD emphasizes how agents manipulate data and how agents
extract information from distributed data sources. Based on

this characterization, many aspects of
research in ABKD, such as planning, coordination, and communication, overlap with other fields of
agent research. This paper, however, limits its description of agent technology to the context of
knowledge discovery

There are two types of agent
based systems: homogeneous systems and heterogeneous systems.
Agents in homogeneous systems have the same functionality and capabilities, whereas agents in
heterogeneous systems have dissimilar functionalities and capabiliti
es but can still coordinate with
one another. In general, heterogeneous systems are useful for processing different kinds of
databases using a variety of techniques, but it may be difficult to integrate the resultant
heterogeneous information. Agent syst
ems can also be classified by the source of control. In
decentralized systems, agents negotiate among themselves to resolve coordination problems.
Centralized systems are usually easier to implement but have single points of failure.

In addition, some
agent systems allow agents to dynamically change their roles when necessary.
Having static agent roles within a system may simplify coordination mechanism but the system will
be less robust as a whole. Choosing the right characteristics for an ABKD syste
m involves considering
what types of data are being mined and what coordination and integration techniques are preferred.


How does ABKD work?

ABKD systems fit naturally to domains with distributed resources. There are three general methods
for ABK
D to learn from distributed data. The first method involves collecting data into a single
repository. This method is impractical and does not take advantage of agents and distributed

Sian researched the second method, which involves informati
on exchange among agents during their
learning of local data [Sian 1991]. In the ideal case, since agents are working as a single algorithm
over all the data sources, few or no revisions or integration is necessary. However, this method
restricts the cho
ice of possible algorithms to those specifically designed for distributed learning.
Another drawback of this method is its assumption about consistently reliable communication and
secure data channel.

In the third method, agents independently process th
e data and learn locally. After the agents have
completed, they share, refine, and integrate their results. The level of independence in local learning
is a design decision that factors into the communication capability of the agents. The third method
akes better use of agent technology and is more suitable when the system designers are concerned
with network instability and security breaches. It also allows the use of conventional algorithms in
the local learning stage. However, problems may arise du
ring the integration phase when agents try
to merge different types of results from different local
learning algorithms. Davies and Edwards in
particular proposed a high
level model of the third method using multiple distributed agents:

One or more agent
s per network node are responsible for examining and analyzing a local data
source. In addition, an agent may query a knowledge source for existing knowledge (such as rules
and predicates). The agents communicate with each other during the discovery proc
ess. This allows
agents to integrate the new knowledge they produce into a globally coherent theory. A user
communicates with the agents via a user
interface. In addition, a supervisory agent responsible for
coordinating the discovery agents may exits. …

The interface allows the user to assign agents to data
sources, and to allocate high
level discover goals. It allows the user to critique new knowledge
discovered by the agents, and to direct the agents to new discovery goals, including ones that might
ake use of the new knowledge. [Davies and Edwards 1995B]

ABKD systems use software agents for encapsulating the learning functionality of data
techniques, as well as coordinating distributed agents. There is significant interdependence between
egration of gathered information and coordination mechanism in an ABKD system: if integration is
concurrent with the gathering process, the coordination of the agents is critical for accurate
knowledge discovery; if integration occurs after agents independ
ently gather information, less
coordination effort is required.

Two common techniques for merging or integrating gathered information are theory revision and
knowledge integration. Both of these techniques involve the local learning by agents but differ i
n the
way they discover knowledge. Theory revision adopts incremental learning, with which an agent
passes the theory it develops to another agent for further refinement with respects to the latter’s
data sources. In the case of simple knowledge integrat
ion, theories are tested against all training
examples and the best theory with respects to a test set is selected. ABKD systems can also
implement variations of these two techniques. For example, agents can send their theory to every
agent, which then m
odifies the theory to their own local data. The final theory is chosen from the
resulting theories based on a test set [Davies and Edwards 1995B].


What do Agents Contribute to Data Mining?

With the availability of a wide spectrum of agent systems, AB
KD contributes to data mining in a
number of ways. First of all, adopting ABKD provides parallelism, which improves the speed, the
efficiency, and the reliability of data mining. The distributed nature of agent systems allows the
parallel execution of da
mining process regardless of the number of distant data sources involved.
This means that non
parallel data
mining algorithms can still be applied on local data (relative to the
agent) because information about other data sources is not necessary for l
ocal operations. It is the
responsibility of agents to integrate the information from numerous local sources in collaboration
with other agents.

Second, agent concepts assist developers in designing distributed data
mining systems. The
encapsulation of

variables and methods in the object
oriented paradigm leads to the idea of
encapsulating data
mining techniques, and thus developers can reuse agent objects that contain
existing techniques. After defining the agent objects, the developers can design how

the agent
objects interact with one another to generate the correct results.

Third, agent concepts provide users of a data
mining system the capability to retrieve the discovered
knowledge at different stages of progression. For instance, a user may wa
nt to view the information
gathered by a particular agent before integration takes place. The sophistication of details retrieved
at each stage depends on the implementation of individual agent
based systems.

Another advantage of adopting ABKD is the abil
ity of agents to gather or search for information
beyond a single data repository. As an example, we can view the World Wide Web as one large
database of web pages with no particular order or organization. An agent can randomly sample from
the database (
World Wide Web) or it can selectively filter certain items (web pages). The agent can
then process the retrieved information or relay the items to other agents for further processing. The
rich interactions and coordination among agents distinguish ABKD f
rom conventional techniques.


What are the limitations of ABKD?

Despite all its contributions, ABKD is not a panacea for problems inherent with a particular data
mining technique, such as noise, missing data, or lack of scalability. More over, ABKD sy
stems in
many cases are more difficult to design and implement than conventional data
mining systems.
Hence, ABKD systems are better suited for mining enormous amounts of distributed data, which
usually requires a complicated conventional data
mining syst


How to evaluate ABKD?

Several implementations of agent
based knowledge discovery exist (such as SAIRE and JAM) and
more are in development (like InfoSleuth and BODHI). Thus, it is important to be able to evaluate
and compare various agent
architectures and distributed learning techniques. This paper suggests
some common metrics for most, if not all, ABKD systems:


What type of information or data do agents communicate with one another?

Do they share summarized information or raw data tha
t represents the data source they mine?


How often do agents communicate with one another?

Does their communication require high bandwidth?


Do agents communicate during or after the learning process?


Are both the architecture and the implementation

easily scalable?

Are there limitations on the application?


Can the system reuse existing machine learning algorithms without extensive modification?


What is the integration technique?

Is it efficient, scalable, and practical?


What is the coordina
tion technique?

Is it efficient, scalable, and practical?


What are the results of experiments, if any?

These metrics provide some clues about the advantages as well as problems involved in ABKD. With
these metrics, the next section will evaluate some o
f the present work in ABKD. Following that, the
paper will present the desired characteristics of an ideal ABKD architecture.


Existing Agent Architectures for Data Mining



The developers of Cooperative Agent Society (CAS) identified the gener
ation of concise, high
information in response to a user's needs as the core problem of information gathering (IG). The
constant growth of the number of available information sources compounds the problem. The
authors presented a sophisticated vi
ew of IG as the processes of both acquiring and retrieving
information, instead of just information retrieval. They based their arguments on the observation
that "no single source of information may contain the complete response to a query and hence may
ecessitate piecing together mutually related partial responses from disparate and heterogeneous
sources.” Hence, they proposed that the paradigm to supporting flexible and reliable IG application
is a distributed cooperative task, in which agents act as t
he intermediaries between a user and the

The team suggested an agent
based approach for several reasons. Their main motivation for using
intelligent agents was that "the components with which to interact [when gathering information] are
not known
a priori". Other motivations included the maintenance of data sources by different
providers, the difference in creation times of data sources, and the use of different problem
paradigms. Because agents can negotiate and cooperate with one anothe
r, the team believed that
agents are important tools for interacting with heterogeneous data sources.

The team used the Internet as their test source because it provides the kind of environment they
were interested in and where the test results were gener
ally applicable. They first examined both
based and partially
based approaches for IG, so that they could determine how an
based approach should work and what issues it should address. Non
based systems
that the authors looked

into were mostly navigational systems such as World Wide Web and gopher.
The authors concluded the main problem with non
agent based systems was that “although [non
based systems] allow [the] user to search through a large number of information sour
ces, they
provide very limited capabilities for locating, combining, and processing information; the user is still
responsible for finding the information.”

The authors classified the partially agent
based systems they have examined into two categories.
the first category, the systems use agents to help users in browsing, mainly as tools that interactively
advise users on which link to pick. This approach easily falls prey to poor designs of agents that
constantly made "annoying suggestions." Systems

in the second category use agents to help users in
document search on the Internet, with tools like client
based search tools and indexing agents (or
search engines). Nevertheless, is the team found it difficult to scale systems in this category as the
ize of document pool grows, mainly as a result of the stress such systems place on network

Using information from these prototypes, the authors proposed a completely agent
based IG tool
called CAS. The main design concepts of CAS are:


at remote sites with multiple agents

expert agents determine which sites to
search and how to optimize the search


Cooperation of agents

an agent would consult other agents when facing an unfamiliar


Abstraction of low level detai
ls from users

Based on these concepts, the team developed three types of agents in the CAS system: 1) User
Agents, or UA (one per user), 2) Machine Agents, or MA (one per data source), and 3) Managers, or
MAN (each uses its domain knowledge to direct searc
h to proper data sources).

The CAS system adopts the following mechanism of agent interaction for IG. Initially, the UA learns
about the preferences of its users either directly from its user or through monitoring. The UA also
provides an interface for i
ts user to submit queries. With both the query and profile of its user, the
UA can then select a proper MAN for answering the request. This selection process requires the UA
to have meta
knowledge about each MAN. The UA can also ask other UAs for advice
on picking a
MAN. After that, the selected MAN may request further domains
specific information from the user
via the UA to process the query. Once all the proper information is gathered, the MAN formulates a
plan and contacts the corresponding MAs. Simi
lar to the selection of a MAN by the UA, the MAN uses
knowledge about each MA together with advice from other MANs to choose the proper MAs
for service. Upon receiving their directions from the MAN, the MAs will try to retrieve the
appropriate data f
rom the system. Again, MAs may consult other MAs if they do not have enough
information about the query. The key to this approach is the cooperation between agents. Each
level of the search requires a high degree of interaction among peer agents for adv
ice and direction.

The team suggested that CAS solve many of the problems found in other distributed data
systems. While CAS does not require users to know exactly where to find data, CAS guides the user
by asking appropriate questions about domain
specific topics.
In addition, CAS simplifies the
maintenance of data among sources by placing an MA at each information source. The topology of
CAS allows parallel execution and improves the security of the system.

The authors of this paper presented a theoretical example

to show how CAS can be used. In this
example, a user tries to plan a trip and wants to perform tasks such as booking a flight, renting a car,
and finding interesting routes for site
seeing. CAS will handle this request by first obtaining and
the users request through the UA. Next, the UA will dispatch the request to the proper
MAN based on meta
knowledge about the MANs domain. This MAN will then dispatch different
parts of the request to different MAs best suited for each subtask. After res
olving discrepancies
between returned values, the MAN will return the final results to the user via the UA.

The implementation of CAS has two phases. The first phase involves the development of cooperating
UAs that learn from users, and the design of MANs

that plan request fulfillment and develop trust
relationships with other agents. The second phase involves the incorporation of more intelligence
into the agents so that they can make better plans. As of writing of this paper, the authors are
ing real
time planning and learning algorithms for this purpose.

The team begun their implementation in 1996 using libwww and wrote their code in C. They used
Netscape as the user
interface and implemented each agent as a separate process. There was one
UA and several MANs per user. The team used standard web search engines like Lycos, Infoseek, and
Crawler, for the data sources and de facto MAs. In their prototype, the UA maintains a log of
exchanges as well as a trust table for other agents. After ea
ch user query, the UA gets feedback from
the user on the usefulness of the information and will recalculate its trust of each agent as a result.
As the authors adopted a long
ranged approach for implementation, apparently they are still working
on CAS. U
nfortunately, as all data on CAS from Tohoku University are in Japanese, information
regarding the current status of CAS is unavailable for this paper. The information summarized in this
section can be found in [Okada and Lee and Shiratori 1996]. Further

information on CAS (in Japanese
only) is available from [CAS web].



PADMA (Parallel Data Mining Agents) is an agent
based system designed to address issues in the
mining field like scalability of algorithms, as well as distributed nature of
data and computation.
The team that developed PADMA suggests that “the very distributed nature of the data storage and
computing environments is likely to play an important role in the design of the next generation of
data mining systems.”

With this vie
w of the steady growth in research of agent
based information processing architectures
and parallel computing, PADMA uses specialized agents for each specific domain, so that PADMA can
evolve to be a “flexible system that will exploit data mining agents in

parallel, for the particular
application at hand.”

PADMA consists of three main components: 1) data
mining agents, 2) a facilitator for coordinating
agents, and 3) a user interface. The third component is not of interest to this paper.

Specifically, da
mining agents directly access the data to extract high
level useful information, and
thus each agent need to specialize in the particular domain of the data it deals with. Each agent has
its own disk subsystem and performs I/O operations on data indepe
ndent of other agents: this is key
to the parallel execution in PADMA. In this way, agents can employ local I/O optimization techniques
to increase their speed and improve the accuracy. After extracting information from the data, agents
share their mined

information through the facilitator module. Other than coordinating agents, the
facilitator presents the mining result to the user interface and routes feedbacks from the user to the

PADMA addresses the scalability issue by reducing the inter
nt and inter
process communication
during the mining process. In the initial stage of processing a user request, each agent runs
independently and queries the data in its own data set. This independence in the initial phase allows
a speedup that is linea
r with the number of agents involved. Once each agent finishes its local
extraction operations, the facilitator merges the information from the agents into a final result.

Similarly, PADMA analyze data in a parallel fashion. The facilitator instructs the

mining agents
to run a clustering algorithm on their respective local data sources. After analyzing its local sets of
data, each agent returns a "concept graph" to the facilitator without interacting with other agents.
The concept graph is a null o
bject if no data relevant to the user query exists at a particular data
source. The facilitator then combines the concept graphs from the agents and returns the clustering
result to the user interface. Note that the mechanisms for detecting and hierarchi
cally merging
clusters are largely independent of the way PADMA functions. The system administrator thus needs
to provide the clustering mechanisms for each domain to which PADMA is applied.

The team tested PADMA for clustering related texts in a corpus.

The test involved designing the
agents and the facilitator to identify text relationships based on n
grams, so as to alleviate the
problems of typographical errors and misspellings in the texts. Their test showed that PADMA could
deliver satisfactory cl
ustering results in an acceptable time frame.

The PADMA project is still under active research. The current implementation performs querying
and clustering on bodies of texts. The team ran experiments and tests against the TIPSTER text
corpus of size 36
MB, and showed PADMA had linear speedups for clustering. However, the current
implementation did not have a reasonable speedup in query operations. The team now investigates
the bottleneck that prevents this speedup. The next step will be the tests with
a larger corpus (100
MB). The team also tries to develop a combination of supervised and unsupervised clustering
algorithms that can be used in PADMA. For more information and detail see [Kargupta and
Hamzaoglu and Stafford 1999]. Further information can

also be found at [Los Alamos National
Laboratory web].



SAIRE (Scalable Agent
based Information Retrieval Engine) is an agent framework for solving the
problem of information overload. The authors remarked that because of this problem, the
rmation delivered to users in a data search is “often unorganized and overwhelming.” SAIRE
attempts to alleviate the problem with a combination of software agents, concept
based search, and
natural language (NL) processing. The system provides facilities

for tailoring a search to the specific
need of a user. For instance, a user may use a technical word for its very specific meaning instead of
its more common meaning. In this case, SAIRE will make sure the search is based on the meaning
desired by that

SAIRE emphasizes more on domain
specific queries and user interaction issues than in most
distributed knowledge integration or data
mining systems do. Since the team tries to factor users’
search objectives and prior activities into the searching pr
ocess, SAIRE aims to "[provide] an
opportunity for non
science users to answer questions and perform data analysis using quality
science data". Meeting this goal involves incorporating vast amounts of domain expertise into the
agents that interact with us
ers, as well as the agents that extract information from the data sources.

Users interact with SAIRE through a User Interface Agent (UIA). The UIA accepts user inputs and
passes the inputs to the Natural Language Parser Agent (NLP). The NLP extracts impo
rtant phrases
from the user input, interprets the inputs, and then generates a request to the SAIRE Coordinator
Agent (SCA).

The NLP consists of four agents: 1) a dynamic dictionary, 2) a grammar
checking module, 3) a pre
processor, and 4) a chart parser
. Both the dictionary and the grammar
checking module are specific
to the domain in which the NLP is working. In addition, the dictionary is split into a main dictionary
with words and semantic meanings pertinent to a domain, and a user dictionary that co
ntains words
with ambiguous or special meanings. SAIRE interacts with the user to construct the user dictionary
and update it with each clarification of a word’s preferred domain meaning.

Figure 1. The architecture of SAIRE.

The SCA first forwards the r
equest from the NLP to a User Modeling Agent (UMA). The UMA
monitors the usage patterns of individual users and user groups so that SAIRE can adapt to the
requests of frequent users and user groups. The UMA, together with the Concept Search Agent
(CSA), p
rovides user
specific interpretations of the request to the SCA. After that, the SCA attempts
to resolve any remaining ambiguities with the UMA and the user
specific dictionary. If ambiguities
remain, the UMA requests clarification from the user, and thi
s clarification will update the user

Once the SCA fully understands a request, it sends the request to the proper data source managers.
When the corresponding data source agents return information, the SCA passes the results to a
Results Agen
t (RA). The RA notifies the UIA of the availability of the results and provides tools for
presenting this data in different media and various formats.

Instead of having each agent maintain the local information by direct interaction with other agents,
e SCA serves as a centralized coordinator for agents. Since the SCA is aware of the capabilities of
every data source agent, it can coordinate the agents in a very sophisticated way. The SCA can also
store this information safely in a repository, and pos
sibly enhance the fault tolerance of the system.
The SCA keeps track of the locations and skill bases of agent managers (AM) in the system, and
provides this information for the use of all data source agents. An agent manager (AM) controls the
iven, domain
specific data source agents in a particular domain. Furthermore, by
monitoring the request history for each agent, the SCA can control the resource usage of agents
through migrating agents from node to node or spawning new agents when necessa
Consequently, SAIRE overloads no single node in the network and uses the most of the available
bandwidth efficiently. This multi
agent coordinator architecture of SAIRE is best suited for
applications with well
known data sources but no effective mea
ns of finding appropriate agents in
the agent pool.

The authors evaluated SAIRE with several experiments, and the results were quite promising. In a
sample of represented requests, the number of documents retrieved ranged from 8 to 536 per
query. The pre
cision, or the percentage of documents retrieved that are relevant to a user query,
ranged from 75% to 100%. With these results, the authors claimed that SAIRE has the potential to
retrieve only those documents that are relevant to a user’s objectives and

interests, and therefore
users need not sort through a vast pool of irrelevant documents.

The SAIRE project appears to be suspended in 1997. As of February 1997, SAIRE could understand
11,000 words and 7000 phrases as well as clarify ambiguous words thro
ugh user
agent dialogue.
SAIRE also could take user
context and previous history into account when understanding a query.
The last implementation of SAIRE involved 8 agent groups of 16 agents apiece, and each agent could
collaborate with others to fulfil
l user requests. The implementation also provided visual displays of
agent activity along with run
time explanations.

This section summarized work from Lockheed Martin Space Mission Systems & Services presented in
[Das and Kocur 1997]. Further informatio
n on the SAIRE project can also be found at [SAIRE web].



InfoSleuth is an agent
based system for information retrieval. The team at MCC developed the
system for the purpose of extracting and integrating semantic information from diverse so
urces, as
well as providing temporal monitoring of the information network and identifying any patterns that
may emerge [Unruh, Martin, and Perry 1998]. They finished the InfoSleuth project by June 30, 1997,
and the InfoSleuth project is now in phase two,

called InfoSleuth II. The work described in this paper
has come under the auspices of both projects. However, the second project focuses on studying how
to support multimedia information objects, and on promoting widespread deployment of data
mining tec
hnology in business organizations.

In order to deal with the active joining and leaving of data sources in the InfoSleuth system yet
avoiding the need of central co
ordination, the team developed its own multi
brokering peer
architecture to co
ordinate agent actions [Nodine, Bohrer,

and Ngu 1998]. The brokering system
matches specific request for services with the agents that can provide the services. This matching
process is based on both the syntactic characteristics of the request, as well as the semantic nature
of the requested


Each data
mining agent in the InfoSleuth system subscribes to agents called brokers. Each broker in
turn advertises the capabilities of the agents that subscribe to it, as well as what kind of broker
advertisements it will take. The brokering

system then groups brokers that provide similar agent
services into a consortium, but there is enough overlapping among different consortia to guarantee
interconnectivity among brokers. Brokers belonging to a consortium maintain up
information ab
out other brokers in the consortium as well as general information about the presence
of other consortia.

When a broker wants to join the system, it needs to first discover which consortia its services fit
within. Then the new broker will advertise its

services and openness for advertisements. Only those
brokers whose openness include the services will discover the new broker, and they can choose
whether to accept the advertisement after assessing the capabilities of the new broker. On the other

the new broker can query the brokers it advertises to for a list of brokers, and if it is interested
in any brokers in the list, it can add their advertisement to its own list.

As a data source joins the system, each of its data
mining agents subscribes t
o one or two brokers.
After being in the brokering system for a while, each agent can change its preferred brokers. In one
way, the agent queries the related consortia for brokers. If there is a match, it then adds the broker
to its preferred list. If
the agent figures out that one of its preferred brokers always forwarding the
service request from/to another broker, it may simply replace that preferred broker with the
intermediate broker.

InfoSleuth uses multiple layers of agents for the task of inform
ation gathering and analysis. At each
of the data sources, a Resource Agent extracts semantic concepts from the source. Upon receiving
user requests, a Multi
resource Query Agent determines whether the request involves more than
one Resource Agent, and i
f so, it will integrate the annotated data from multiple sources. At the
same time, Data
mining Agents and Sentinel agents perform the tasks of intelligent system
monitoring and correlation of high
level patterns emerged from the data sources. Data
agents provide event notifications that encode statistical analyses and summaries of the retrieved
data. Sentinel agents support the data
activities by organizing inputs to data
mining agents, and
monitoring for “higher
level” event patterns based on da
mining agents’ output events. Through
all these layers of agents, InfoSleuth supports derived requests such as deviation analysis, filtered
deviations, and correlated deviations.

Figure 2. The architecture of InfoSleuth.

Even though the team has eval
uated InfoSleuth with several experiments, they did not publish any
data regarding the performance.

With the elaborate brokering system, InfoSleuth does not require central coordination for
collaborative agent action. Besides, the peer
peer feature o
f the brokering system provides an
efficient way for a data
mining agent to locate another agent for use. The brokering system also
provides mechanisms with which agents can rate the service provided by brokers and switch brokers
accordingly. This allows

the system to dynamically adapt itself to both network instability and major
categorical shift of user requests. Nonetheless, the information necessary for brokers and agents to
adjust their links may propagate very slowly across the network. In that ca
se, InfoSleuth may have
optimal performance for prolonged periods of time.

Organizations such as National Institute of Standard and Technology and companies like Texas
Instrument and Eastman Chemical Company have adopted InfoSleuth as the infrastructu
re for their
mining operations [MCC web A]. In particular, the EDEN (Environmental Data Exchange
Network) project recently used InfoSleuth to support integrated access via web browsers to
environmental information sources provides by agencies in diff
erent countries [MCC web B].



JAM (Java Agents for Meta
Learning over Distributed Databases) attempts to provide a scalable
solution for learning patterns and generating a descriptive representation from a large amount of
data in distributed databas
es [Stolfo, Prodromidis, Tselepis, Lee, Fan, and Chan 1997]. The authors
identified the need for scaling algorithms in data mining. They claimed that even though many well
developed data
mining algorithms exist, most of these algorithms assume that the t
otal set of data
can fit into the memory, and this assumption does not hold in many data mining contexts. The team
thus developed JAM as an agent
based framework for handling this scaling problem.

Another motivation for their agent
based data
mining framew
ork is to handle inherently distributed
data. The authors claimed that data can be inherently distributed because of its storage on
physically distributed mobile platforms like ships or cellular phones. Other reasons for the inherently
distributed natur
e of data include, but are not limited to, secure and fault
tolerant distribution of
data and services, proprietary issues (different parts of data belonged to different entities), or
statutory constraints required by laws.

Figure 3. The architecture o
f a JAM network with 3 Datasites.

The JAM system is a collection of distributed learning and classification programs linked by a network
of Datasites. Each JAM Datasite consists of a local database, one or more base
learning agents, one
or more meta
ing agents, a local user configuration file, graphical user interfaces, and animation
facilities. A learning agent is a machine
learning program for computing the classifiers at distributed
sites. Base
learning agents at each Datasite first compute base
classifiers from a collection of
independent and inherently distributed databases in a parallel fashion. Meta
learning agents are
learning processes that integrate several base classifiers, which may be generated by different
Datasites. In addition, JAM
has a central and independent module, called Configuration File Manager
(CFM), which keeps up
date state of the distributed system. The CFM stores a list of participating
Datasites and logs events for future reference and evaluation.

At each Datasite,
local learning agents operate on the local database to compute the base classifier.
Each Datasite may import classifiers from peer Datasites and combine these with its own local
classifier using the local meta
learning agent. JAM solves the scaling probl
em of data mining by
computing a meta
classifier that integrates all the base
classifier and meta
classifier modules once
they are computed. The system can then use the resultant meta
classifier module to classify other
datasets of interest. Through the
ensemble work, JAM boosts the overall predictive accuracy.

The CFM assumes a passive role for the configuration maintenance of the system. It maintains a list
of active member Datasites for coordination of meta
learning activities. Upon receiving a JOIN
request from a new Datasite, the CFM verifies the validity of the request as well as the identity of the
site. Similarly, a DEPARTURE request invokes the CFM to verify the request and remove the Datasite
from the list of active members. The CFM logs the
events between Datasites, stores the links among
Datasites, and keeps the status of the system.

JAM implements both the CFM and Datasites as multi
threaded Java programs. Meta
agents are implemented as java applets for their need to migrate to ot
her sites.

The team initially designed JAM for the purpose of fraud and intrusion detection in financial
information system. They have conducted an experiment using the system for detecting credit card
fraud transactions, which involved processing inhere
ntly distributed data from various financial
institutions. They obtained the best performance by using a Bayesian network as the meta
JAM was able to classify 80% of the true positive and had a false alarm rate of 13%.

Agents in JAM communicat
e with one another for the classifiers they have developed. The system
does not require the dispersion of data across the different sites throughout the execution. This
allows the participants to share only information without violating security or propr
ietary protection
of the data.

Since JAM does not specify implementation, different Datasites may choose different machine
learning algorithm implementations as the learning agents, and some of these algorithms may not
scale well for large data sets. Thu
s, the ability to handle large datasets may vary among Datasites.

Moreover, there can be a limitation on how many Datasites can join a JAM system. Even though JAM
has no central coordinator, the CFM constantly monitors the global state of the system and

for more network bandwidth when more Datasites join the system. The CFM can both be a single
point of failure and a bottleneck to reasonable system performance.

The JAM project ended in December 1998. The team posted the evaluation of JAM’s p
erformance in
intrusion detection on their website [JAM web A]. For software download and specification, refer to
[JAM web B].



DKN (Distributed Knowledge Network) is a research project for large
scale automated data extraction
and knowledge acq
uisition and discovery from heterogeneous, distributed data sources [Yang,
Honavar, Miller, and Wong 1998]. As part of this project, they implemented a toolkit of machine
learning algorithms, called KADLab, which uses customizable agents for document clas
sification and
retrieval from distributed data sources.

Instead of building an agent infrastructure like most projects, the DKN team chose to use the
commercially available Voyager platform from ObjectSpace. Voyager uses the Java language object
model a
nd allows regular message syntax for constructing and deploying remote objects. Through
the Object Request Broker, Voyager provides services to remote objects and autonomous agents.
Objects and other agents can send messages to a moving agent, and an age
nt can continue to
execute as it moves. The platform also provides service for persistence, group communication, and
basic directory services.

The team has experimented with their approach for retrieving paper abstracts and news articles on a
oint basis [Yang, Pai, Honavar, Miller 1998]. They first trained the classifiers with user
preferences, and then incorporated the classifiers into mobile agents on the Voyager platform.
When a user queries a document using their system, a mobile agent (A
gent 1) is generated. Agent 1
moves to a remote site to retrieve relevant documents. It sends the documents to the local site and
then dies. Next, the user gives feedback as whether the documents are interesting or not. These
preferences train the clas
sifiers and generate another agent (Agent 2). Agent 2 moves to the remote
site and runs the classifier to retrieve relevant documents. It sends the relevant documents to the
local site and dies. The team claimed that the mobile agents return only a subs
et of relevant
documents, but they did not explain the mechanism through which they incorporate classifiers into

Other than the point
point experiment, the team did not publish any experiments regarding the
system performance with distributed
data sources or under varying network environment. Nor did
they publish the data source characteristics in the experiments they have conducted.

An important feature of their work is the use of an off
shelf agent platform. By not building their
own agent platforms, developers can proceed to programming the agent activities and launch the
agent into the network within a relatively short time fr
ame. On the other hand, most commercially
available agent platforms are for general agent usage, and thus developers that use such platforms
do not enjoy the same leverage as with a platform specifically designed for data mining. In the case
of DKN, the
team found it difficult to keep track of agents, once the agents are launched into the
network. This is because Voyager requires a proprietary Java message format for communications
among agents. Therefore, instead of updating agents at remote sites with

the new classifier
information, their system has to regenerate and dispatch new agents from a central location every
time a user provides feedback to the learning process. Their system may not scale well to handle
distributed data source due to the consi
derable overhead in agent generation and garbage



BODHI is an implementation of Kargupta’s Collective Data
Mining (CDM) framework for distributed
knowledge discovery using agents. CDM aims at “designing and implementing efficient al
that generate models from heterogeneous and distributed data with guaranteed global correctness
of the model” [Kargupta web].

An agent in BODHI is an interface between the learning algorithm and the communication module.
At each site, there is
an agent station module that maintains communication between sites and
handles security issues. A facilitator module co
ordinates inter
agent communication and directs
data and control flow among distributed sites. Most of the BODHI implementation is in
Java for
flexibility, but the system can still import learning algorithms implemented in native code at local

BODHI uses several learning algorithms specifically developed for distributed data mining:
collective decision rule learning using Four
ier analysis [Kargupta, Park,Hershberger, and Johnson
1999], collective hierarchical clustering [Johnson and Kargupta 1999], collective multivariate
regression using wavelets [Hershberger and Kargupta 1999], and collective principal component
analysis [Ran
nar, MacGregor, and Wold 1998]. The first algorithm uses Fourier analysis to find the
Fourier spectrum of data at each data source, and then sends the local spectrums to a centralized
site for merging. BODHI can then transforms the resultant spectrum to
a decision tree
representation. The collective hierarchical clustering requires the transmission of local dendograms
at O(n) communication cost. It then creates a global model using the local models with a O(n2)
bound in time and a O(n) bound in space.
The collective multivariate regression only requires the
aggregation of significant coefficients of the wavelet transformation of local data to a central site.
The algorithm can then reconstruct the model by performing regression on the coefficients.
lective principal component analysis involves the creation of a global covariance matrix from
loading matrices and sample score matrices after distributed data analysis.

In general, collective learning algorithms attempt to build the most accurate model

respect to a centralized algorithm while minimizing data communication [Kargupta web].

The use of these algorithms minimizes the amount of communication between the central
coordinator and local data sources, but the type of information communicated
varies according to
the algorithm used. The use of such algorithms lacks the inter
agent communication before model
integration, and this may be a drawback for some data
mining applications.

BODHI adopts an agent architecture that provides the necessar
y infrastructure for the
execution and information transmission of collective learning algorithms. The system uses the
network bandwidth efficiently. However, if too many distributed agents send their local models
concurrently to a central location for m
erging, the large amount of incoming information may
overload the network and thus scalability is an issue. More detail concerning the implementation
can be found on Kargupta’s website [Kargupta web].


Desired Characteristics of Ideal ABKD Architectu

The survey in the previous section demonstrated the various ways of applying agent technology to
distributed data mining. It covered most of the issues that may arise in distributed data mining,
ranging from system architecture and network topology to
user interaction. While each existing
ABKD system addresses certain issues of distributed data mining better than others, one may wonder
if it is possible to extract all the good features from existing ABKD systems to specify an ideal ABKD
system that is
flexible and robust enough to address all of issues in distributed data mining. In
particular, this section attempts to provide insights, if not final answers, to the following questions:


What are the characteristics of an ideal ABKD architecture?


hat can a user or a developer expect from an ideal ABKD system?


Is it possible to build an ideal ABKD system?



By definition, any ABKD systems need to work in a networking environment. In some networks like
the Internet where the stabili
ty is uncontrollable, an ideal ABKD system needs to allow the dynamic
joining and leaving of remote data sources at any given time. Other networks like corporate
intranets can be fairly stable for most of the time. In that case, the ideal ABKD system nee
ds to be
aware of the stable network and works with the remote data sources through more efficient means.

An ideal ABKD system scales well with the number of data sources and the size of data to be mined
at each remote site. Also, the ideal system can han
dle concurrent queries from a large number of
users. Based on the earlier discussion of existing ABKD system, avoiding centralized coordination and
system monitoring is the key to resolving these scalability issues.


Information Integration

In distribu
ted data mining, ABKD systems need to integrate information from different data sources.
These data sources may store data in different formats, belong to different application domains, and
support the retrieval of data in different kinds of data structur
es. Yet there are cases all these
distributed data sites are homogeneous. An ideal ABKD system needs to adapt to the different
possible natures of remote data sources to perform the necessary information integration.

One issue in ABKD research is whethe
r to integrate information during the mining process, or after
independent mining at each remote data source. This issue is closely coupled with the mechanisms
with which agents are coordinated. In the former choice of integration, central coordination i
s not
necessary and the action of an agent can be influenced by many different entities through extensive
communication. In the latter choice, a coordinator at the top of the system hierarchy is necessary to
manage agents for proper information integratio
n. An ideal ABKD system should support both types
of information integration and adjust itself dynamically according to the network environment and
application context.

What is more, an ideal ABKD system should support both any
time algorithms and non
algorithms. The use of any
time algorithms in a data
mining operation allows the users to interrupt
the system at any stage of processing and retrieve an analysis for the results up to that stage.
However, any
time analysis may not make sense
for certain data
mining operations or techniques.


Result Processing

Result processing is an aspect of data mining that involves presenting results to users, understanding
user queries, and archiving past results efficiently. Even though result process
ing is usually not a
concern in distributed data mining research, agent technology readily addresses issues in result

As it is important for the data
mining system to thoroughly understand user requests, the ideal
system should perform user p
rofiling and take into account the history, experience, and profile of the
user before processing the requests from a user. Only then the system can verify if it is finding the
information interesting to the user.

Interactive query clarification is a po
werful agent tool for this purpose. It allows the data mining
system to ask the user questions before the user submits a query. As a result, the system can ensure
that it understands the correct meaning of a particular word in the user request. Incorpor
ating this
feature to the ideal ABKD system will extend its applicability in domain
specific operations without
limiting its application in a particular domain.

Moreover, the ideal system should be able to measure and report the relevance of the returned
nformation with respect to the user query. The user can then cross
reference with this relevance
rating when making a decision based on the returned information.

An ideal ABKD system should not only support advanced user interaction but also make the best

of available resources to resolving user queries. The ideal system performs data source profiling to
better direct queries to the proper data sources. Data source profiling requires that the coordinating
entities keep track of the meta
information f
or each data source. Besides returning information of
higher quality, proper data source profiling also promotes efficient use of systems resources,
especially network bandwidth.

In addition, the ideal system may cache past data
mining results, so as to r
educe the amount of
processing for recurrent queries. Since caching only work well with certain domains, the ideal system
should provide means for the system administrator to specify whether caching should take place,
and if so, at what points in the syst
em and at what level.


Resource Usage

In many data
mining applications, it is desirable to minimize resource usage of without significant
compromises in speed or accuracy. Hence, an ideal ABKD system needs to provide facilities for
adjusting the
resource usage for different performance requirements. An example is the resource
usage for inter
agent communication. An ideal ABKD system should allow agents to communicate
with one another to resolve problems. Increasing the communication among agent
s usually results
in more accurate and more meaningful data mining. Yet increase in such communication invariably
leads to increase in network traffic, and the ideal system may need to limit the communication
among agents to control both bandwidth usage a
nd runtime bounds. Therefore, the ideal system
should adjust the level of inter
agent communication dynamically according to the run

Since querying data sources is an expensive operation, an ideal system should try to minimize the
time o
verhead in accessing remote data sites. A lot of research has been conducted in this area to
find out a way of gathering information with minimal amount of database access.


How realistic is it?

Researchers may wonder if they can build such an ideal AB
KD system. Even though no existing ABKD
system possesses all the ideal characteristics, the survey of ABKD systems in the previous section can
provide some insight regarding the feasibility of building the ideal system.

Since the ideal ABKD system essenti
ally consists of the good features of all the existing ABKD
systems, the survey demonstrates that each constituent part of the ideal system can be
implemented. However, the inherent difficulty of integration means that the ideal system is not a
simple un
ion of these constituent parts. As research in distributed data mining shows the difficulty in
combining approaches in learning information, building an ideal system is a feasible but non
task. It will require an in
depth understanding of the pro
blems in the distributed data mining and a
familiarity with all possible solutions, so that alternative approaches can be taken whenever

Perhaps the most difficult aspect of building an ideal ABKD system architecture is to allow users to
use the

system for their domain
specific needs. The ideal system needs to support the flexible
tradeoffs for many conflicting desired characteristics, in order to be useful for a wide array of
application domains. It can be a challenge for developers to incorpo
rate these conflicting
characteristics into the system so that the features work properly by themselves and more
importantly, work together when necessary.


Proposed ABKD Architecture

We propose an agent
based data
mining architecture that accommodates
most of the ideal
characteristics we described in the previous section. The proposed architecture resembles SAIRE but
emphasizes more on data sources and management rather than user interface. SAIRE requires high
level of inter
communication among agents
, and it does not address the problem of unstable
networks. In other words, even though SAIRE does not stop data
sites to join or leave the system, it
does not directly address the issues. Mainly as an effort to allow user
friendly search through a vast

amount of data, SAIRE ignored many issues that our proposed architecture deals with.

The proposed system contains three main types of agents: the UI (user interface) agent, the Manager
agent, and the KB (knowledgebase) agent. The UI agent helps the user
get specific information via
the Manager agent. The Manager agent, which is an expert in a particular field, interoperates with a
number of KB agents or other Manager agents to get the requested information. Each KB agent
wraps around each data source an
d uses some conventional data
mining technique to get
information from the data.

The architecture uses a “registration” mechanism that allows an agent only to have access to those
agents that have registered with the particular agent. This protocol makes
the organization of the
agents dynamic, or in other words, the UI and KB agents are mutable. UI agents are created when
users enter the system, and the system destroys the corresponding UI agents when users leave.
Similarly, as data sources join and leav
e the system, the KB agents are created and destroyed. In
general, a UI agent has access to multiple Manager agents that have registered with the UI agent.

A Manager agent is the least volatile agent since it has no direct link to an external object, su
ch as a
user or a database. Manager agents organize themselves into a hierarchy of domain expertise. Each
Manager agent has access to a specified number of Manager and/or KB agents that have registered
with the Manager agent. Besides, a Manager agent ca
n terminate itself if no KB or Manager agent is

We assume that every KB agent understands the meaning of its data with respects to a hierarchy so
that it can register itself to the correct branch of the hierarchy. Nevertheless, researchers can
implement a simple organization with a two
level hierarchy, and use the prototype to study the
possibilities of multi
level hierarchies.

Based on Lam’s research on agent design, we suggest that each agent have six functionalities:
sensing, modeling, organi
zing, planning, acting, and communicating. A KB agent senses (gathers) the
data and produces a model or statistic from the data, and then it registers with Manager agents that
are interested in its data. When a KB agent gets an information request from a

Manager agent, the
KB agent queries its data source using the proper query language, and then communicates the
results to the Manager agent. The KB agent collaborates with other KB agents when the Manager
agent requests high
level information that requir
es the querying of more than one data source. KB
agents also collaborate during modeling.

A Manager agent is responsible for getting access to all data sources pertaining to a particular field,
which may be modeled as a set of keywords within the agent.
If a KB agent finds no Manager agent
to register with, a new Manager agent will be instantiated. A Manager agent models KB agents that
have registered with it, as well as other Manager agents it may collaborate with. The UI agent sends
a model of user pr
eferences along with the user’s request to the Manager agent to ensure the
retrieval of accurate information.

As an example, suppose a user makes three requests: “find the average price of a house sold for less
than $150,000 within 5 miles of my current ho
me,” “what area within my state has the highest
selling price for a house in the past 10 years,” and “what is the expected selling price for my
neighborhood in the next 10 years.” This example assumes that the ABKD system has established the
Manager and K
B agents for the realty domain. The UI agent first translates the request into terms
that the agents understand. Then, it gets the users location from the user model and sends the
request, along with the user model, to a Manager agent that handles reside
ntial property (as
opposed to business property). When the Manager agent receives this request, it locates a Manager
agent that exclusively deals with houses (as opposed to apartments) and redirects the request to that
agent. Having modeled the data cont
ent of each registered KB agent, the house Manager agent
selects only those registered KB agents that have data pertaining to the request. This means that
the agent may select only those KB agents within the user’s state. A query is formed and sent to
ach selected KB agent. The KB agent executes the query and returns the results to the Manager
agent. Once the Manager agent receives the results from the KB agents, it can send the results
directly to the user or do further integration, analysis, or form
atting of the results according to the
user’s preferences that was sent from the UI agent. Manager agents can produce high
knowledge using conventional data
mining techniques, such as clustering, decision trees, and
statistical methods. This knowle
dge is then sent to the UI agent, which displays it according to the
user’s request.

The implementation of the proposed architecture needs to focus on flexibility, standardization, and
independence. Agent modeling and interoperation can be done w
ith KQML (Knowledge
Query and Manipulation Language), which “is part of a larger effort, the ARPA Knowledge Sharing
Effort, which is aimed at developing techniques and methodologies for building large
knowledge bases which are sharable and reusable”

[Mayfield web]. Agent communication should
follow the FIPA (Foundation for Intelligent Physical Agents) ACL (agent communication language)
standard [FIPA web]. The main programming language should be Java, and CORBA can be used to
support and interopera
te with other languages that are possible used by the KB agents. CORBA is a
powerful tool for interfacing agents with legacy systems. Very often an agent queries data sources
that are older, entrenched systems, which do not readily interoperate with mode
rn protocols. By
defining CORBA interfaces for both agents and data sources, we can simplify the task of making these
data sources accessible. Moreover, CORBA simplifies the interfacing with newer systems since
CORBA is an established standard. The lear
ning time for adding a data site to the agent network can
thus be substantially shortened.

Our proposed ABKD architecture handles distributed, dynamic, heterogeneous data by having
mutable KB agents that are independent of other KB agents. Our overall goa
l is to provide a flexible
framework to support the implementation of various techniques. Depending on the domain and the
type of request made, either manager agents or UI agents can integrate data. In addition,
decentralized coordination allows the syst
em to remain functional despite breakdown in networks or
agents. What is more, the administrator can choose whether to have KB agents mined the data
beforehand or only when a query is made. A subset of the UI agents can evolve into a system that
ds further topics or items of interest based on the user model.


Future Works

Until recently, there has been a lack of standard for inter
agent communication among different
agent systems. Thus existing ABKD research did not consider the possibility o
f working with one
another. In addition, present works in ABKD systems either support only collective data
algorithms specially designed for distributed execution, or provide a general framework for the reuse
of conventional machine
learning algori
thms. One potential research is to develop an ABKD system
supporting both types of algorithms, so that researchers can evaluate the two types of algorithms on
the same basis. Besides, researchers may try to build an ABKD system that works with other syst
from different research teams.

Another observation is that most existing ABKD systems perform only one type of data
mining tasks
like classification. Building multi
purpose ABKD systems will allow researchers to reuse the expensive
agent platform imp

Moreover, publications in ABKD research rarely detail the implementation of the underlying agent
systems. Since editors in general prefer new design ideas to implementation details, most papers
describe only the architecture of the agent sy
stems and how agents interact. It will be nice if
researchers can share their experience of building a prototype ABKD system and thus novices in
ABKD research need not start from scratch. Better still, researchers may provide technical reports on
their we
sites detailing the design alternatives they have considered but decided not to adopt, so
that others can benefit from the previous works.

Last but not least, there is no common methodology to benchmark the various ad hoc
implementations of agent
based s
ystems. Also, researchers of ABKD systems in most cases did not
publish the data on the performance evaluation of their projects. In addition to existing research
that establishes metrics for multi
agent systems [Lam and Barber 2000], researchers may dev
elop a
generic test bed for ABKD systems that helps to compare the various architectures of ABKD systems.
The development of test bed can lead to the identification and recommendation of the suitable
architecture for each application domain.



This paper introduced the idea of ABKD system, which allows the mining of distributed
heterogeneous data at relatively ease. In addition, the use of agent technology enhances parallel
execution of data
mining processes. Nevertheless, ABKD is not a pan
acea for problems inherent with
a particular data
mining technique.

Next, the paper presented a set of metrics for evaluating ABKD systems, and then evaluated present
work in ABKD research. It examined ABKD systems built for a specific application domain
and those
systems for general data
mining applications. We also took a look at an architecture that uses
existing commercial package instead of building its own agent infrastructure. In general, most of the
existing ABKD systems used multiple layers of a
gents to handle different levels of data mining tasks.

After that, the paper described the desired characteristics of an ideal ABKD architecture in terms of
its functionalities, resource requirements, and processing of results. Since some of the desired

characteristics were conflicting, researchers need to exercise certain tradeoffs when trying to
incorporate such characteristics into their own systems.

Finally, the paper identified some potential works in ABKD research, such as building ABKD systems
hat support both dedicated and conventional data
mining algorithms, developing ABKD systems for
multiple categories of data
mining tasks, and implementing a test bed for evaluating different kinds
of ABKD architectures with respect to different application



Balakrishnan, K. and Honavar, V. (1998). Intelligent Diagnosis Systems. Journal of Intelligent Systems.
In press.

Chan, P. and Stolfo, S. (1996). Sharing learned models among remote database partitions by local
learning. In Proceedings Second International Conference on Knowledge Discovery and Data
Mining, 2

CAS. Shiratori Lab: New Projects.

Das, B., and Kocur, D. (1997). Experiments in Using Agent
Based Retrieval from Distributed and
Heterogeneous Databases. In Knowledge and Data Engineering Exchange Workshop, 27

Davies, W. H. E. and Edwards, P. (1995A). Agent
Based Knowledge Discovery. In Working Notes of
the AAAI Spring Symposium on Informa
tion Gathering from Heterogeneous, Distributed

Davies, W. H. E. and Edwards, P. (1995B). Distributed Learning: An Agent
Based Approach to Data
Mining. In Proceedings of ML95 Workshop on Agents that Learn from Other Agents.

Domingos, P. (19
97). Knowledge acquisition from examples via multiple models. In International
Conference on Systems, Man and Cybernetics.

FIPA. Foundation for Intelligent Physical Agents.

Hall, Lawrence O., Chawla, Nitesh, and Bowyer, Kevin W. (1
998). Combining Decision Trees Learned
in Parallel. In Distributed Data Mining Workshop at KDD

Hayes, Caroline C. (1999). Agents in a Nutshell

A Very Brief Introduction. In IEEE Transactions on
Knowledge and Data Engineering, Vol. 11, No. 1 Jan/F
eb 1999.

Hershberger, D. and Kargupta, H. (1999). Distributed Multivariate Regression Using Wavelet
Collective Data Mining. In Special Issue on Parallel and Distributed Data Mining of the Journal of
Parallel Distributed Computing. Kumar, V., Ranka,

S., and Singh, V. (Ed.) (In press) (also available as
Technical Report EECS

Honavar, V. (1994). Toward Learning Systems That Use Multiple Strategies and Representations. In
Artificial Intelligence and Neural Networks: Steps Toward Principled Int
egration. pp. 615
Honavar, V. and Uhr, L. (Ed.) New York: Academic Press.

Honavar, V. (1998). Inductive Learning: Principles and Applications. In Intelligent Data Analysis in
Science. Cartwright, H. (Ed). London: Oxford University Press.

JAM (A).

The JAM Project Overview.

JAM (B). Software Download for the JAM Project.

Johnson, E. and Kargupta, H. (1999). Collective, Hiera
rchical Clustering from Distributed,
Heterogeneous Data. In Large
Scale Parallel KDD Systems, Lecture Notes in Computer Science,
Verlag. Zaki, M. and Ho, C. (Ed).

Kargupta, H. Distributed Knowledge Discovery from Heterogeneous Sites.

Kargupta, H., Hamzaoglu, I. and Stafford, B. (1999). Scalable, Distributed Data Mining Using An Agent
Based Architecture. Proceedings of Knowledge Discovery And Data Mining. Heckerman, D., Mannila,
H., Pregibon
, D., and Uthurusamy, R. AAAI Press. 211

Kargupta, H., Park, B., Hershberger, D. , and Johnson, E. (1999). Collective Data Mining: A New
Perspective Toward Distributed Data Mining. Submitted for publication in Advances in Distributed
Data Mining.
Kargupta, H., and Chan, P (Ed.). AAAI Press.

Los Alamos National Laboratory. Parallel Data Mining Agents. http://www

Lam, D. N. and Barber, K. S. Tracing Dependen
cies of Strategy Selections in Agent Design. To be
published in AAAI
2000 17th National Conference on AI.

Mayfield, James, Labrou, Yannis, and Finin, Tim. Desiderata for Agent Communication Languages.
ot.html. University of Maryland Baltimore

MCC (A). Who Will Use InfoSleuth and For What., last updated February 10,

MCC (B). Project Documents.
ojects/env/eden/docs/fact.html, last updated October 11, 1999.

Miller, L., Honavar, V. and Barta, T.A. (1997). Warehousing Structured and Unstructured Data for
Data Mining. In Proceedings of the American Society for Information Science Annual Meeting (AS
97). Washington, D.C.

Nodine, M., Bohrer, W., and Ngu, A. (1998). Semantic brokering over dynamic heterogeneous data
sources in InfoSleuth. MCC Technical Report. Submitted to ICDE '99.

Okada, R., Lee, E., and Shiratori, N. (1996). Agent Based Approach for Information Gathering on
Highly Distributed and Heterogeneous Environment. In Proc. 1996 International Conference on
Parallel and Distributed Systems.

Parekh, R. and Honavar, V. (199
8). Constructive Theory Refinement in Knowledge Based Neural
Networks. In Proceedings of the International Joint Conference on Neural Networks Anchorage,

Parekh, R., Yang, J., and Honavar, V. (1998). Constructive Neural Network Learning Algorit
hms for
Category Pattern Classification. In IEEE Transactions on Neural Networks.

Rannar S., MacGregor, J.F., Wold., S. (1998). Adaptive Batch Monitoring using Hierarchical PCA. In
Chemometrics & Intelligent Laboratory Systems.


Sian, S. (1991). Extending Learning to Multiple Agents: Issues and a Model for Multi
Agent Machine
Learning (MA
ML). In Proceedings of the European Working Session on Learning. Y. Kodratroff Ed.,
Verlag, 458

Stolfo, S., Prodromidis, A., Tselepis, S., Lee, W., Fan, D., and Chan , P. (1997). Jam: Java agents for
learning over distributed databases. In Proceedings of the 3rd International Conference on
Knowledge Discovery and Data Mining, pages 74
Newport Beach, CA. AAAI Press.

Unruh, A., Martin, G., and Perry, B. (1998). Getting only what you want: Data mining and event
detection using InfoSleuth agents. Technical Report MCC
98, MCC InfoSleuth Project.

Williams, G. (1990). Inducing
and Combining Multiple Decision Trees. Ph. D. Dissertation, Australian
National University, Canberra, Australia.

Yang, J. and Honavar, V. (1998). Feature Subset Selection Using a Genetic Algorithm. In Feature
Extraction, Construction, and Subset Selecti
on: A Data Mining Perspective. Motoda, H. and Liu, H.
(Ed.) New York: Kluwer. 1998. A shorter version of this paper appears in IEEE Intelligent Systems
(Special Issue on Feature Transformation and Subset Selection).

Yang, J. and Honavar, V. (1998). Dist
Al: An Inter
Pattern Distance Based Constructive Neural Network
Learning Algorithm. In Intelligent Data Analysis. In press. A preliminary version of this paper appears
in [IJCNN98].

Yang, J., Pai, P., Honavar, V., and Miller, L. (1998). Mobile Intellige
nt Agents for Document
Classification and Retrieval: A Machine Learning Approach. In Proceedings of the European
Symposium on Cybernetics and Systems Research. In press.

Yang, J., Honavar, V., Miller, L. and Wong, J. (1998). Intelligent Mobile Agents
for Information
Retrieval and Knowledge Discovery from Distributed Data and Knowledge Sources. In Proceedings of
the IEEE Information Technology Conference. Syracuse, NY.