IADIS MULTI CONFERENCE ON COMPUTER SCIENCE
AND INFORMATION SYSTEMS
call for papers
21  23 June
Proceedings
Edited by:
António Palma dos Reis
international association for development of the information society
Algarve, Portugal
of
INTELLIGENT SYSTEMS AND AGENTS 2009
IADIS INTERNATIONAL CONFERENCE
INTELLIGENT SYSTEMS AND
AGENTS 2009
part of the
IADIS MULTI CONFERENCE ON COMPUTER SCIENCE AND
INFORMATION SYSTEMS 2009
ii
iii
PROCEEDINGS OF THE
IADIS INTERNATIONAL CONFERENCE
INTELLIGENT SYSTEMS AND
AGENTS 2009
part of the
IADIS MULTI CONFERENCE ON COMPUTER SCIENCE AND
INFORMATION SYSTEMS 2009
Algarve, Portugal
JUNE 21  23, 2009
Organised by
IADIS
International Association for Development of the Information Society
iv
Copyright 2009
IADIS Press
All rights reserved
This work is subject to copyright. All rights are reserved, whether the whole or part of the material
is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other way, and storage in data banks.
Permission for use must always be obtained from IADIS Press. Please contact secretariat@iadis.org
Intelligent Systems and Agents Volume Editor:
António Palma dos Reis
Computer Science and Information Systems Series Editors:
Piet Kommers, Pedro Isaías and NianShing Chen
Associate Editors: Luís Rodrigues and Patrícia Barbosa
ISBN: 9789728924874
SUPPORTED BY
v
TABLE OF CONTENTS
FOREWORD ix
PROGRAM COMMITTEE
xiii
KEYNOTE LECTURE
xv
FULL PAPERS
COMBINING FUZZY DOMINANCE BASED PSO AND GRADIENT DESCENT
FOR EFFECTIVE PARAMETER ESTIMATION OF GENE REGULATORY
NETWORKS
Sanjoy Das, Karim Morcos and Stephen M. Welch
3
TREESTRUCTUREAWARE GENETIC OPERATORS IN GENETIC
PROGRAMMING
Kisung Seo and Chulhyuk Pang
11
CONTENT AND COMMUNICATION BASED SUBCOMMUNITY DETECTION
USING PROBABILISTIC TOPIC MODELS
Alexandru Berlea, Markus Döhring and Nicolai Reuschling
19
THE GENE EXPRESSION PROGRAMMING APPLIED TO THE SEASONAL
DEMAND FORECAST
Evandro Bittencourt, Raul Landmann, Paulo César Oliveira, Sidney Schossland, Edson Wilson
Torrens and Jerzy Wyrebski
27
USING “SOCIAL ACTIONS” AND RLALGORITHMS TO BUILD POLICIES IN
DECPOMDP
Thomas Vincent and Akplogan Mahuna
35
AGENTBASED LEARNING MANAGEMENT SYSTEMS: UPSIDES AND
CHALLENGES FOR SUPPORTING USERS
Shenghua Liu, Ari Wahlstedt and Anne Honkaranta
43
A NOVEL METHOD FOR IRIS FEATURE EXTRACTION BASED ON
CONTOURLET TRANSFORM AND COOCCURRENCE MATRIX
Amir Azizi and Hamid Reza Pourreza
53
IRRIGATION AND FERTILIZATION EXPERT SYSTEM FOR VEGETABLES
BASED ON GEOGRAPHICAL INFORMATION SYSTEM
Mostafa Mahmoud
61
vi
HYBRID SYSTEM BASED ON ROUGH SETS AND WAVELET NEURAL
NETWORKS
Yasser F. Hassan
69
A METHOD FOR COMBINING INSTANCE SELECTION ALGORITHMS
Yoel Caises, Antonio González, Enrique Leyva and Raúl Pérez
77
EXTRACTING RULE SUBSETS IN A GENETIC ITERATIVE MODEL
Yoel Caises, Antonio González, Enrique Leyva and Raúl Pérez
85
ELIMINATING BORDER INSTANCES TO AVOID OVERFITTING
Khalil el Hindi and Mousa ALAkhras
93
MULTI AGENT SYSTEM INTEGRATING NATURALISTIC DECISION ROLES:
APPLICATION TO MARITIME TRAFFIC
Thierry Le Pors, Thomas Devogele and Christine Chauvin
100
COORDINATED MULTIAGENT BASED FRAMEWORK FOR PATIENT AND
RESOURCE SCHEDULING
E. Grace Mary Kanaga, M.L. Valarmathi and Preethi S.H. Darius
108
MASITS – A MULTIAGENT BASED INTELLIGENT TUTORING SYSTEM
DEVELOPMENT METHODOLOGY
Egons Lavendelis and Janis Grundspenkis
116
SPECIFYING AND VALIDATING THE AGENT PERFORMANCE EVALUATION
METHODOLOGY: THE SYMBIOSIS USE CASE
Christos Dimou, Fani A. Tzima, Andreas L. Symeonidis, and Pericles. A. Mitkas
125
AN EXTENSION OF THE CLIENT – SERVER MODEL TO THE MOBILE AGENTS:
THE SELLER – BUYER MODEL
Djamel Eddine Menacer, Habiba Drias and Christophe SibertinBlanc
133
A HARDWARE BASED APPROACH FOR PROTECTING MULTIAGENT
SYSTEMS
Antonio Muñoz, Antonio Maña and Marioli Montenegro
141
INTEGRATING ANT COLONY OPTIMIZATION IN A MOBILEAGENT BASED
RESOURCE DISCOVERY ALGORITHM
Yasushi Kambayashi and Yoshikuni Harada
149
A MULTIAGENT ARCHITECTURE FOR AUGMENTED REALITY
APPLICATIONS
J. A. Mateos Ramos, D. Vallejo Fernández, I. Arriaga Sánchez, C. González Morcillo
159
ALTERNATIVE NEIGHBORHOOD CONFIGURATIONS IN AN ABMS MODEL TO
ESTIMATE THE ADOPTION OF TELECENTERS IN BRAZIL
Ismael Mattos A. Ávila, Luiz Acácio G. Rolim and Giovanni M. Holanda
167
MODELING RESPONSIBILITY IN ORGANIZATIONS
Lambèr Royakkers and Maarten Verkerk
177
vii
SHORT PAPERS
FOCUSED TIMEDELAY NEURAL NETWORK MODELING TOWARDS TYPING
STREAM PREDICTION
Jun Li, Karim Ouazzane, Hassan Kazemian, Yanguo Jing and Richard Boyd
189
USE OF THE NEURAL NETWORK FOR ESTIMATING THE MUD ‘S QUANTITY
GENERATED BY EFFLUENT TREATMENT STATIONS: A CASE STUDY
Paulo Bousfield, Cladir Zanottelli, Paulo Olivieira, Cátia Ganske, Sidney Schossland and Edson
Torrens
195
TOPOLOGICAL APPROACH FOR ROBUST INTERPOLATION OF SPEECH
SPECTRA
Yoshinao Shiraki
200
APPLICATION OF NEURAL NETWORK TO PREDICT ADVERSE SITUATIONS IN
TROUBLE TICKETING REPORTS
Julia Gómez, Yaiza Temprado, Margarita Gallardo, Carolina García and Francisco Javier
Molinero
204
SUMMAVILLE: AN AUTOMATIC AND WEBBASED NEWS STORIES
SUMMARIZER
Paulo C F de Oliveira, Edson Wilson Torrens, Paulo Bousfield, Sidney Schossland, Evandro
Bittencourt and Raul Landmann
209
CLASSIFICATION OF SERIOUS SEXUAL ASSAULT USING FUZZY
CLUSTERING
Don Casey and Phillip Burrell
215
AN ARCHITECTURE FOR AN AGENTBASED COOPERATIVE SYSTEMS
Ebrahim Alhashel, Masoud Mohammadian, Bala Balachandran and Dharmendra Sharma
219
FINITE, REASONING AND INTERACTING AGENTS
Michal Walicki and Paul Simon Svanberg
225
AMBIENT ACTIVITY RECOGNITION: A POSSIBILISTIC APPROACH
Patrice C. Roy, Bruno Bouchard, Abdenour Bouzouane and Sylvain Giroux
231
EVACUATION BEHAVIORS IN AN EMERGENCY STATION BY AGENTBASED
APPROACH
Kazuki Satoh, Toru Takahashi, Takashi Yamada, Atsushi Yoshikawa and Takao Terano
236
DOCOPT A NEW METHOD FOR DISTRIBUTED CONSTRAINT SATISFACTION
AND OPTIMIZATION PROBLEMS RESOLUTION
Kais Ben Salah and Khaled Ghedira
242
AN HYBRID APPROACH FOR FAULT RESISTANCE IN MULTIAGENT
SYSTEMS
Mounira Bouzahzah and Ramdane Maamri
247
viii
POSTERS
CONTEXT MANAGEMENT AND USER PREFERENCE LEARNING IN SMART
HOME ENVIRONMENTS
Víctor M. Peláez Martínez, Luis Ángel San Martín Rodríguez, Roberto González Rodríguez and
Vanesa Lobato Rubio
255
ROSES: AN EXPERT SYSTEM FOR DIAGNOSING SIX NEUROLOGIC DISEASES
IN CHILDREN
Sayed Yousef Monir Vaghefi and Touran Mahmoudian Isfahani
259
NEURAL NETWORKS AS IMPROVING TOOLS FOR AGENT BEHAVIOR
Alketa Hyso, Eva Çipi and Betim Çiço
261
AUTHOR INDEX
ix
FOREWORD
These proceedings contain the papers of the IADIS International Conference on Intelligent
Systems and Agents 2009, which was organised by the International Association for
Development of the Information Society in Algarve, Portugal, 21 – 23 June, 2009. This
conference is part of the Multi Conference on Computer Science and Information Systems
2009, 17  23 June 2009, which had a total of 1131 submissions.
The IADIS Intelligent Systems and Agents conference addresses in detail two main aspects:
intelligent systems and agents. The conference has the intention to provide a contribution to
academics and practitioners. So, both fundamental and applied research are considered
relevant.
Submissions were accepted under the following areas and topics:
Area 1 – Intelligent Systems
 Algorithms
 Artificial Intelligence
 Automation Systems and Control
 Bio Informatics
 Computational Intelligence
 Expert Systems
 Fuzzy Technologies and Systems
 Game and Decision Theories
 Intelligent Control Systems
 Intelligent Internet Systems
 Intelligent Software Systems
 Intelligent Systems
 Machine Learning
 Neural Networks
 Neurocomputers
 Optimization
 Parallel Computation
 Pattern Recognition
 Robotics and Autonomous Robots
 Signal Processing
 Systems Modelling
 Web Mining
Area 2 – Agents
 Adaptive Agent Systems
 Agent Applications
 Agent Communication
 Agent Development
 Agent middleware
 Agent Models and Architectures
x
 Agent Ontologies
 Agent Oriented Systems and Engineering
 Agent Programming, Languages and Environments
 Agent Systems
 Agent Technologies
 Agent Theories
 Agent Trends
 Agents Analysis and Design
 Agents and Learning
 Agents and Ubiquitous Computing
 Agents in Networks
 Agents Protocols and Standards
 Artificial Systems
 Computational Complexity
 eCommerce and Agents
 Embodied Agents
 Mobile Agents
 MultiAgent Systems
 Negotiation Strategies
 Performance Issues
 Security, Privacy and Trust
 Semantic Grids
 Simulation
 Web Agents
The IADIS Intelligent Systems and Agents 2009 conference received 103 submissions from
more than 27 countries. Each submission has been anonymously reviewed by an average of
four independent reviewers, to ensure that accepted submissions were of a high standard.
Consequently only 22 full papers were approved which means an acceptance rate below 22
%. A few more papers were accepted as short papers and posters. An extended version of
the best papers will be published in the IADIS International Journal on Computer Science
and Information Systems (ISSN: 16463692) and also in other selected journals, including
journals from Inderscience.
Besides the presentation of full papers, short papers and posters, the conference also
included one keynote presentation from an internationally distinguished researcher. We
would therefore like to express our gratitude to Dr. Ronald R. Yager, Machine Intelligence
Institute, Iona College, New York for accepting our invitation as keynote speaker.
As we all know, organising a conference requires the effort of many individuals. We would
like to thank all members of the Program Committee, for their hard work in reviewing and
selecting the papers that appear in the proceedings.
xi
This volume has taken shape as a result of the contributions from a number of individuals.
We are grateful to all authors who have submitted their papers to enrich the conference
proceedings. We wish to thank all members of the organizing committee, delegates,
invitees and guests whose contribution and involvement are crucial for the success of the
conference.
Last but not the least, we hope that everybody will have a good time in Algarve, and we
invite all participants for the next year edition of the IADIS International Conference on
Intelligent Systems and Agents 2010, that will be held in Freiburg, Germany.
António Palma dos Reis,
ISEG  Technical University of Lisbon,
Portugal
Intelligent Systems and Agents 2009 Conference Program Chair
Piet Kommers, University of Twente, The Netherlands
Pedro Isaías, Universidade Aberta (Portuguese Open University), Portugal
NianShing Chen, National Sun Yatsen University, Taiwan
MCCSIS 2009 General Conference CoChairs
Algarve, Portugal
June 2009
xii
xiii
PROGRAM COMMITTEE
INTELLIGENT SYSTEMS AND AGENTS CONFERENCE
PROGRAM CHAIR
António Palma dos Reis, ISEG  Technical University of Lisbon, Portugal
MCCSIS GENERAL CONFERENCE COCHAIRS
Piet Kommers, University of Twente, The Netherlands
Pedro Isaías, Universidade Aberta (Portuguese Open University), Portugal
NianShing Chen, National Sun Yatsen University, Taiwan
INTELLIGENT SYSTEMS AND AGENTS CONFERENCE COMMITTEE
MEMBERS
Adel M. Alimi, University of Sfax, Tunisia
Adina Magda Florea, University "Politehnica" of Bucharest, Romania
Agris Nikitenko, Riga Technical University, Latvia
Alessandro Ricci, Università di Bologna in Cesena, Italy
Alfredo Garro, Universita' della Calabria, Italy
Andrea Addis, University of Cagliari, Italy
Angel GarcíaOlaya, Universidad Carlos III de Madrid, Spain
Anton Bogdanovych, UTS, Australia
Anton Nijholt, University of Twente, The Netherlands
Costin Badica, University of Craiova, Romania
Dariusz Krol, Wroclaw University of Technology, Poland
David A. Pelta, University of Granada, Spain
Dickson K.W. Chiu, Computer Systems, Hong Kong
Dídac Busquets, Universitat de Girona, Spain
Djamila Ouelhadj, ASAP Research Group, UK
Eloisa Vargiu, DIEE  University of Cagliari, Italy
Ezendu Ariwa, London Metropolitan University, United Kingdom
Fariba Sadri, Imperial College London, UK
Federico Bergenti, Università degli Studi di Parma, Italy
Federico Castanedo Sotela, Universidad Carlos III de Madrid, Spain
Fikret Ercal, University of Missouri, USA
Gerard Murray, Port of Melbourne  Boskalis Australia Alliance, Australia
Giovanni Semeraro, University of Bari, Italy
Giuseppe Mangioni, Universita di Catania, Italy
Hans Werner Guesgen, Massey University, New Zealand
Haralambos Mouratidis, University of East London, United Kingdom
Heinrich C. Mayr, AlpenAdriaUniversitaet Klagenfurt, Austria
Huiye Ma, Centrum voor Wiskunde en Informatica (CWI), The Netherlands
xiv
Jackeline Spinola de Freitas, Universidad Politécnica de Madrid, Spain
Jaime Ramírez, Universidad Politécnica de Madrid, Spain
Javier Carbo Rubiera, Univ. Carlos III de Madrid, Spain
Jesualdo Tomás Fernández Breis, University of Murcia, Spain
Jim Cunningham, Imperial College, UK
Jorge A. RamírezUresti, ITESMCEM, Mexico
Jørgen Villadsen, Technical University of Denmark, Denmark
José Antonio Iglesias, University of Carlos III, Spain
José Carlos Cortizo Pérez, Universidad Europea de Madrid, Spain
José Manuel Molina López, Universidad Carlos III de Madrid, Spain
Juan Manuel Serrano, Universidad Rey Juan Carlos, Spain
Julius Stuller, Academy of Sciences of the Czech Republic, Czech Republic
Krysia Broda, Imperial College, UK
Lars Nolle, Nottingham Trend University, UK
Laura Naismith, McGill University, Canada
Laurent Vercouter, Ecole des Mines de SaintEtienne, France
Leonardo Garrido, Tecnologico de Monterrey, México
Longbing Cao, Univ of Technology, Sydney, Australia
Maite López Sánchez, University of Barcelona, Spain
Marc Esteva, University of Technology, Sydney, Australia
Maria Bielikova, Slovak University of Technology, Slovakia
Maria Salamó Llorente, University of Barcelona, Spain
Marko Grobelnik, Josef Stefan Institute, Slovenia
Matjaz Gams, Jozef Stefan Institute, Slovenia
Mengjie Zhang, Victoria University of Wellington, New Zealand
Michelangelo Ceci, Università degli Studi di Bari, Italy
Miguel Angel Patricio, Universidad Carlos III de Madrid, Spain
Miguel González Mendoza, ITESMCEM, Mexico
Mirjana Ivanovic, University of Novi Sad, Serbia
Nesrine Baklouti, University of Sfax, Tunisie
Nizar Rokbani, REGIM, Tunisia
P.K. Mahanti, University of New Brunswick, Canada
Paolo Petta, Institute of Medical Cybernetics and Artificial Intelligence, Austria
Patrick Wong, Open University, United Kingdom
Rainer Hilscher, New Vectors LLC, USA
Ramon Brena Pinero, Tecnológico de Monterrey, Mexico
Raúl Arrabales Moreno, Universidad Carlos III de Madrid, Spain
Raymond Chiong, Swinburne University of Technology, Malaysia
Razvan Andonie, Central Washington University, USA
Ricardo Imbert, Universidad Politécnica de Madrid, Spain
Roland Kaschek, Massey University, New Zealand
Roman Neruda, Academy of Sciences of the Czech Republic, Czech Republic
Stuart Chalmers, University of Aberdeen, UK
Sviatoslav Braynov, University of Illinois, USA
Thierry Moyaux, Université de Lyon, France
Tomas Klos, Delft University of Technology, The Netherlands
Vincent Thomas, LORIA, France
Viorel Negru, West University of Timisoara, Romania
William Song, Durham University, UK
Yubin Yang, Nanjing University, China
Zoran Budimac, University of Novi Sad, Serbia
xv
KEYNOTE LECTURE
LEARNING METHODS FOR EVOLVING INTELLIGENT
SYSTEMS AND AGENTS
Ronald R. Yager
Machine Intelligence Institute
Iona College, New York
ABSTRACT
In this presentation our concern is with technologies that allow the construction of intelligent
systems agents that can evolve and learn based on experiences. We discuss a number of
technologies that support this capability: the participatory learning paradigm, the hierarchical
prioritized structure and the mountain clustering method. The basic premise of the participatory
learning paradigm is that learning takes place in the framework of what is already learned and
believed. The implication of this is that every aspect of the learning process is affected and guided
by the current belief system. This name, participatory learning, highlights the fact that in learning
we are in a situation in which the current knowledge of what we are trying to learn participates in
the process of learning about itself. The hierarchical prioritized structure provides a generalization
of fuzzy systems modeling by introducing a hierarchical representation of the rules. It supports
systems evolution by allowing the learning of new rules based and their insertion at different levels
of the hierarchy.
xvi
Full Papers
COMBINING FUZZY DOMINANCE BASED PSO AND
GRADIENT DESCENT FOR EFFECTIVE PARAMETER
ESTIMATION OF GENE REGULATORY NETWORKS
Sanjoy Das, Karim Morcos
Electrical & Computer Engineering Department
Stephen M. Welch
Division of Agronomy
Kansas State University
Manhattan, KS 66506
USA
ABSTRACT
Stochastic optimization techniques such as multiobjective PSO are very useful in determining the parameters of gene
regulatory network models. Unfortunately, evaluating the performance of such a model with a set of parameters is
computationally expensive. The fuzzy εdominance based PSO algorithm is a recent approach that is particularly well
suited for these modeling tasks, achieving convergence to the Pareto front with a relatively small number of function
evaluations. In order to further reduce the function evaluations this paper considers ways to incorporates explicit gradient
descent steps within this algorithm. As a case study, the performance of the proposed approach is investigated to compute
the parameters of a differential equation model of Arabidopsis flowering time control.
KEYWORDS
PSO, multiobjective, gradient, optimization, genomics, Arabidopsis.
1. INTRODUCTION
Particle Swarm Optimization (PSO) is a populationbased approach that maintains a set of candidate
solutions, called particles, which are allowed to move within the search space (Clerc & Kennedy, 2002). The
trajectory followed by each particle is guided by its own memory, as well as by its interaction with other
particles. The specific method of adjusting the particles trajectory is motivated by the interaction of birds,
fishes, or other organisms that move in swarms. Eventually, the particles converge to suitable optima. We
will use the terms particle and solution interchangeably henceforth.
Multiobjective optimization has been the focus of much recent research (Das & Panigrahi, 2008). Unlike
in singleobjective optimization where it is easy to compare one solution to another, in multiobjective
problems, a solution that is inferior to another one in one objective, may in fact be better in another. Under
these circumstances, the concept of Pareto optimality is used. Several multiobjective versions of PSO have
been recently proposed (Coello et al., 2004, Koduru et al., 2007, Das & Panigrahi, 2008).
While biologically motivated algorithms such as PSO are very effective in providing optimal solutions,
they can be further improved by maintaining the correct balance between exploration and exploitation (Das,
2008). Adding an exploitative component allows the algorithm to make use of local information to guide the
search towards better regions in the search space. This property lets the algorithm convergence towards the
Pareto front using fewer function evaluations – a muchdesired characteristic in applications such as gene
regulatory network modeling, where a substantial amount of computation is involved in evaluating each
objective function (cf. Cai et al., 2007). Local search techniques have been successfully included within
genetic algorithms (Koduru et al., 2008). PSO hybrid algorithms for single objective optimization (Das et al.,
2006), and more recently, multiobjective optimization (Koduru et al., 2007), have also made their
IADIS International Conference Intelligent Systems and Agents 2009
3
appearance. The approach proposed by Koduru et al., (2007), makes use of the concept of fuzzy εdominance
within PSO to accomplish better convergence, and is specifically suited for gene regulatory network
modeling. It also has been shown to outperform the popular NSGAII (Deb et al., 2002), as well as the multi
objective PSO (MOPSO) proposed by Coello et al. (2004).
In the above methods, local search techniques have relied on the derivativefree Nelder Mead algorithm,
which can only approximate the true gradient of the objective. In this paper, a method to explicitly compute
the gradient of the objectives for a representative gene differential equation model, namely the genetic
network controlling flowering time in Arabidopsis, has been formulated. Using this result, the addition of
separate gradient descent steps to the PSO of Koduru et al., (2007) is considered. The simulation results
indicate that the inclusion of gradient descent within PSO improves the convergence of the optimization
algorithm.
2. FUZZY ΕDOMINANCE BASED PSO
2.1 PSO
We first describe the variant of the standard PSO algorithm. This algorithm maintains a population of N
particles whose positions, X(i), i = 1, 2, … N, are initialized to random values. These positions are
incremented in each iteration t of the algorithm according to the instantaneous velocity V
t
(i), as follows,
X
t+1
(i) = X
t
(i) + V
t
(i) (1)
The velocity is also updated in each iteration, using the particle’s own recorded previous best position, as
well as the current location of the other particles. The update rule is given by,
V
t+1
(i) = χ(V
t
(i) + C
1
×U[0,1]×(IB(i) – X
t
(i))+ C
2
×U[0,1]×(GB – X
t
(i))) (2)
In the above equation, C
1
and C
2
are two constants, called the cognitive and the social constants, and χ is
a constriction coefficient, that helps in maintaining stability (Clerk & Kennedy, 2002). The factor U[0,1] is a
uniformly distributed random number in [0, 1]. The quantity IB is the individual best recorded position of the
i
th
particle so far, in terms of objective function. The other quantity, GB is the global best position of any
particle (usually in the current iteration t). In this paper, these constants have been set to the following values:
χ = 0.4, C
1
= 2.1 and C
2
= 2.1.
2.2 Multiobjective Optimization
When dealing with optimization problems with multiple objectives, the conventional concept of optimality
does not hold (Das & Panigrahi, 2008). Instead, the concepts of dominance and Paretooptimality are applied.
Without a loss of generality, let us assume that the multiobjective problem entails the simultaneous
minimization of all M objectives, e
i
(.),
Mi...,,1
=
. Let the solution space be denoted as
n
ℜ⊂
Ψ
. A solution
Ψ∈u
is said to dominate another solution
Ψ
∈
v
iff
},,,2,1{
Mi
K
∈
∀
)()( veue
ii
≤
with at least one of the
inequalities being strict; i.e. u is as good as v for all objectives and better for at least one. This relationship
is written
vu p
. In the set of all feasible solutions, that subset whose members are not dominated is called
the Pareto set. In other words, if
S
is the population, the Pareto set,
{
}
)(,
uvSvSu
p¬∈∀
∈
. Its
corresponding image in the space of all objective functions is known as the Pareto front.
Since all the solutions in the Pareto set are nondominated, they must be treated as equally good.
Therefore, the goal of an effective multiobjective optimization algorithm is to find candidate solutions
whose images in the objective function space are (i) are as close to the true Pareto front as possible, and (ii)
are also as spread out and evenly spaced as possible, thereby sampling an extensive region of the Pareto
front. These two conditions are usually referred to as convergence and diversity respectively. Accomplishing
good convergence and diversity are the two crucial aspects of any multiobjective optimization algorithm,
including PSO. Fuzzy εdominance is a recently proposed scheme that combines convergence and diversity
into one single measure, allowing multiobjective optimization problems to be treated as though they
ISBN: 9789728924874 © 2009 IADIS
4
involved only a single objective. Fuzzy εdominance is an extension of fuzzy dominance that has been
modified to take into account diversity. Both are discussed next in this section.
2.3 Fuzzy Dominance based PSO
Given a monotonically nondecreasing function
)(⋅μ
dom
i
, whose range is in [0, 1],
},,2,1{ ni K∈
, a solution
Ψ
∈u
is said to
i
dominate solution
Ψ
∈v
, if and only if
)()( veue
ii
<
. This relationship can be denoted as
vu
F
i
f
. If
vu
F
i
f
, the degree of fuzzy
i
dominance is equal to
(
)
(
)
vuueve
F
i
dom
iii
dom
i
fμ≡−μ )()(
. Fuzzy
dominance can be regarded as a fuzzy relationship vu
F
i
f between
u
and
v
. Solution
Ψ
∈u
is said to
fuzzy dominate
Ψ
∈v
if and only if
},,,2,1{ Mi K
∈
∀
vu
F
i
f
. This relationship can be denoted as
vu
F
f
.
The degree of fuzzy dominance can be defined by invoking the concept of fuzzy intersection and using a t
norm,
( )
I
ff
M
i
F
i
dom
i
Fdom
vuvu
1
)(
=
μ=μ
(3)
In another implementation of fuzzy dominance (Koduru et al., 2008), the membership functions )(⋅μ
dom
i
were defined to be zero for negative arguments. Therefore, whenever
)()( veue
ii
>
, the degree of fuzzy
dominance
vu
F
i
f
necessarily evaluated to zero. In this paper, we allow nonzero values in accordance with
Koduru et al. (2007). The membership functions used are trapezoidal, yielding nonzero values whenever
their arguments are to the right of a threshold ε, as shown in Figure 1 below.
Mathematically, the memberships
)( vu
F
i
dom
i
fμ
are defined as,
( )
⎪
⎩
⎪
⎨
⎧
ε−Δ≥Δ1
ε−Δ<Δ<ε−ΔΔ
ε−≤Δ
=Δμ
ii
iiii
i
i
dom
i
e
ee
e
e
if
if/)(
if0
(4)
where,
)()( uevee
iii
−=Δ
.Given a population of solutions
Ψ
⊂S
, a solution
Sv ∈
is said to be fuzzy
dominated in
S
iff it is fuzzy dominated by any other solution
Su
∈
. In this case, the degree of fuzzy
dominance can be computed by performing a union operation over every possible
(
)
vu
Fdom
f
μ
, carried out
using tco norms as,
U
ff
Su
FdomFdom
vuvS
∈
μ=μ )()( (5)
In this manner, each solution can be assigned a single measure to reflect the amount it dominates others in
a population. Better solutions within a set will be assigned lower fuzzy dominances, although unlike in
Koduru et al. (2008) nondominated solution may not necessarily be assigned zero values. The union and
intersection operators follow the standard min and max definitions (Mendel, 1995).
Figure 1. Fuzzy membership functions used here to compute εdominances
e(u)e(v)
μdom
(e(v)e(u))
ε
Δ
1
e(u)e(v)
μdom
(e(v)e(u))
ε
Δ
e(u)e(v)
μdom
(e(v)e(u))
ε
Δ
1
IADIS International Conference Intelligent Systems and Agents 2009
5
Typically, in multiobjective PSO, an archive of all the best A solutions found is maintained. The
velocities of the particles in the population are redirected towards archive solutions. As newer solutions are
discovered, the best ones among them are inserted into the archive, while the older solutions discarded. This
is also the strategy adopted here. In each iteration of the present scheme, the population of particles is merged
with the archive, and the fuzzy dominances computed. The A + N solutions are sorted in ascending order of
their fuzzy dominances, and the best A solutions are the archive for the next iteration. The global best used in
equation (2) is the archive solution with the lowest fuzzy dominance. The individual best IB(i), is updated
only when the i
th
particle dominates it’s own earlier stored individual best, in which case X
t
(i) replaces IB(i).
3. GENE NETWORK MODEL AND OVERALL APPROACH
3.1 Flowering Time Control Gene Network in Arabidopsis
In the Arabidopsis plant (A. thaliana), three genes TERMINAL FLOWERING 1 (TFL1), APETALA 1 (AP1),
and LEAFY (LFY) play a special role in flowering (Welch et al. 2003, 2005). OFF to ON state changes in two
of them (AP1 and LFY) signal plant commitment to flowering. One plausible mechanism for their interaction
involves a threeelement positive feedback loop that incorporates a bistable switch as shown below,
TFL1KAP1hRTFL1
dt
d
AP1KLFYhRAP1
dt
d
LFYKTFL1SOC1hRLFY
dt
d
TTFL1dwnT
HAP1upH
LLFYupL
λ−=
λ−=
λ−−=
),(
),(
),(
, (6)
where h
up
and h
dwn
are, respectively, promotive (n=3) and repressive (n = 3) Hill functions defined as,
nn
n
K
x
x
Kxh
+
=),(
. (7)
The difference input to h
dwn
in the equation for LFY is restricted to positive values as negative
biochemical concentrations are impossible. The quantity K is replaced with K
LFY
, K
LFY
, or K
LFY
for the Hill
functions pertaining to LFY, AP1, and TFL1 in equation (6). For clarity, the time argument has been dropped
in the equation, a convention that we shall follow for the remainder of the paper.
The TFL1 gene becomes increasingly active with time, in the developing shoot’s apex. In turn, this
influences the activity of LFY and AP1 in leaf primordia that sequentially emerge from the apex. It is not
known precisely how this molecular influence is exerted, but the net effect is to slow the plant’s development
toward flowering. Increasing levels of LFY and AP1 within each primordium offset this effect. Ultimately,
the switch changes state, causing the primordium within which this happens to initiate inflorescence
development. Equation (6) models all these effects as if they are direct, although this may not be the case in
a real plant.
External switch input is provided by the expression level of the SUPRESSOR OF OVEREXPRESSION
OF CO (SOC1) gene. An equation for the SOC1 expression level at time instant t is,
)
2
sin(
2
)(
s
ss
s
p
t
t
ba
tbSOC1
π−
+=
(8)
As done earlier in Koduru et al. (2008), synthetic data was generated covering a period of nine days and
the emergence of four leaf primordia. Each primordia was simulated for a total of 30 hrs of simulation time.
This data generated time varying expression levels for each gene, which shall be denoted as
,
d
LFY,
d
AP1
TFL1
d
, and SOC1
d
. The values of the parameters used are provided in Table 1 below. For
further details of the implementation, one is referred to Koduru et al. (2008).
ISBN: 9789728924874 © 2009 IADIS
6
Table 1. Values of the parameters used to generate desired expression level data
Parameter
R
L
R
H
R
T
λ
L
λ
H
λ
T
K
L
FY
K
AP1
K
TFL1
a
s
b
s
p
s
Value 1.0 1.0 1.0 1.0 1.0 1.0 0.4 0.9 0.3 0.01 0.007 24
3.2 Problem Formulation
The goal of the gene network problem is to estimate the values of the six variables, R
L
, R
H
, R
T
, K
LFY
, K
AP1
,
and K
TFL1
so that the model, when simulated again, produces expression levels for
LF
Y
and
A
P1
that are as
close to that stored earlier as
d
LFY
and
d
AP1
. In other words, the two objective functions to minimize are,
( )
∫
−= dtLFYLFYerr
d
2
1
, (9a)
and
( )
∫
−= dtAP1AP1err
d
2
2
, (9b)
Each solution – a particle in PSO – is therefore a six dimensional vector X = [R
L
R
H
R
T
K
LFY
K
AP1
K
TFL1
].
This is a simpler version of the original formulation of the GRN2 problem defined in Koduru et al., (2008),
where a more extensive set of simulations were carried out, and all nine parameters were determined by the
algorithm. However, this reduced version is sufficient for the present study as our aim is limited to only study
the effect of gradient descent within a version of PSO. Our goal is NOT to produce the most effective multi
objective PSO algorithm as was the focus in Koduru et al., (2008), where a very fast genetic algorithm is
reported. Needless to say, a modified version of the fuzzy dominance based PSO that has been used as the
main algorithm within which gradient descent has been incorporated, has been shown to perform
significantly above other competitive algorithms for several benchmarks (Koduru et al., 2007).
3.3 Gradient Descent Algorithm
Instead of the aggregate square errors, instantaneous values of the errors in Equation (9a,b) are considered.
These are,
(
)
2
1 d
LFYLFYerr −=
, (10a)
and
(
)
2
2 d
AP1AP1err −=
. (10b)
The gradient descent rule that is used for adaptation is,
)(
21
errerr
d
t
dX
+∇ζ−=
, (11)
which replaces the usual rule
)(errXX
∇
ζ
−=
. The quantity ζ is the usual adaptation rate constant. The
factor
)(err∇
is the vector derivative of err with respect to X. With respect to a generic parameter P, the
scalarized version of Equation (11) is,
dt
dP
⎟
⎠
⎞
⎜
⎝
⎛
∂
∂
+
∂
∂
ζ−=
P
err
P
err
11
⎟
⎠
⎞
⎜
⎝
⎛
∂
∂
−+
∂
∂
−ζ−=
P
AP1
AP1AP1
P
LFY
LFYLFY
dd
)()(
(12)
The expression levels in Equation (6) can be rewritten as,
),(
),(
),(
1
1
1
TFL1dwnTT
AP1upHH
LFYupLL
KAP1h
dt
d
RTFL1
KLFYh
dt
d
RAP1
KTFL1SOC1h
dt
d
RLFY
−
−
−
⎟
⎠
⎞
⎜
⎝
⎛
+λ=
⎟
⎠
⎞
⎜
⎝
⎛
+λ=
−
⎟
⎠
⎞
⎜
⎝
⎛
+λ=
, (13)
IADIS International Conference Intelligent Systems and Agents 2009
7
where each
1−
⎟
⎠
⎞
⎜
⎝
⎛
+λ
dt
d
is a linear operator that is commutative with respect to scalars as well as derivatives.
Clearly, the derivative of LFY with respect to R
L
, R
T
, K
LFY
, and K
TFL1
are zero. Furthermore letting VAP1 =
H
R
AP1
∂
∂
and
AP1
K
AP1
WAP1
∂
∂
=
, we get
),(
1
AP1upH
KLFYh
dt
d
VAP1
−
⎟
⎠
⎞
⎜
⎝
⎛
+λ=
,
and,
),(
1
AP1up
AP1
HH
KLFYh
K
R
dt
d
WAP1
∂
∂
⎟
⎠
⎞
⎜
⎝
⎛
+λ=
−
.
Thus, the time evolution of the variables VAP1 and WAP1 follow the differential equations,
),(
AP1upH
KLFYhVAP1
d
t
dVAP1
+λ−=
,
and,
),(
AP1up
AP1
HH
KLFYh
K
RWAP1
dt
dWAP1
∂
∂
+λ−=
.
Similarly, letting VLFY =
L
R
LFY
∂
∂
and
LFY
K
LFY
WLFY
∂
∂
=
, we can obtain,
),(
LFYupL
KTFL1SOC1hVLFY
dt
dVLFY
−+λ−=
and,
),(
LFYup
LFY
LL
KTFL1SOC1h
K
RWLFY
dt
dWLFY
−
∂
∂
+λ−=
The derivatives of LFY with respect to the other parameters, as well as those of TFL1 are zero. Using
Equation (12), the set of gradient descent based adaptation rules can be readily obtained,
( )
VLFYLFYLFY
dt
dR
d
L
−ζ−=
( )
VAP1AP1AP1
dt
dR
d
H
−ζ−=
( )
WLFYLFYLFY
d
t
dK
d
LFY
−ζ−=
(14)
( )
WAP1AP1AP1
dt
dK
d
AP1
−ζ−=
The adaptation rules for the other parameters cannot be obtained in this manner, and therefore must rely
on PSO for improvement. When these adaptation rules are applied to PSO, in each iteration, the top M
solutions of the population is subject to these adaptation rules. Early studies taken up by the authors of this
paper, where the entire population was subject to adaptation, did not provide any significant gains and are not
discussed any further in this paper. PSO is run with and without gradient descent based parameter adaptation
to investigate the improvement gained in terms of convergence speed when gradient descent is used.
4. RESULTS
The population size is fixed at N = 30, and the archive size, at M = 10. Figure 2 shows the average value of
err
1
and err
2
for the solution in the archive. The plots to the left pertain to the simulation when PSO was run
without gradient descent. In the right figure are shown the errors when gradient descent was applied to PSO.
It is clear that when gradient descent is applied, the errors decrease more rapidly with increasing iteration.
ISBN: 9789728924874 © 2009 IADIS
8
Figure 2. Convergence plots of log(
err
1
) (top) and log(
err
2
) (bottom) vs. iteration
In Figure 3 is shown the final population obtained at the end of 25 iterations when PSO is run without and
with gradient descent, with the solutions marked with dots (.) and plus (+) respectively. Further, the non
dominated solutions are circled. It is again clear that when gradient descent is applied, the solutions are
significantly closer to the origin (0,0).
Figure 3. Solutions at the end of 25 iterations
5. CONCLUSION AND FURTHER RESEARCH
In this study, we have provided a method to apply gradient descent to some parameters of gene differential
equation models. Models of other genetic networks are similar. Therefore, our method can readily be
extended to other models also. Although our results are preliminary, they strongly suggest that a
straightforward application of gradient descent greatly improves the convergence speed of the algorithm.
This observation is consistent with earlier research taken up by others, such as Das et al. (2006) and Koduru
et al. (2007), where other forms of local search were used in place of explicit gradient descent.
Further investigation is necessary to analyze the effectiveness of such a method for larger models. This
research may also be applied to study other flowering time models of Arabidopsis (Wilczek et al. 2009).
ACKNOWLEDGEMENT
This research has been funded by the US National Science Foundation, through Grant No. NSF FIBR
0425759.
IADIS International Conference Intelligent Systems and Agents 2009
9
REFERENCES
Cai, X., et al (2007). Discovering Structures in Gene Regulatory Networks Using Genetic Programming and Particle
Swarms.
Proceedings of the Genetic and Evolutionary Computing Conference, London
, UK, pp. 1750.
Clerc, M., and Kennedy, J., (2002). The Particle Swarm – Explosion, Stability, and Convergence in a Multidimensional
Complex Space.
IEEE Transactions on Evolutionary Computation
, Vol. 6, No. 1, pp. 58 – 73.
Coello, C.A.C., Pulido, G.T., and Lechuga, M.S., (2004). Handling Multiple Objectives with Particle Swarm
Optimization",
IEEE Transactions on Evolutionary Computation
, Vol. 8, No. 3, pp. 256 – 279.
Das, S., et al (2006). Adding Local Search to Particle Swarm Optimization.
Proceedings
,
World Congress on
Computational Intelligence
, Vancouver, BC, Canada, pp. 428 – 433.
Das, S. (2008). Evolutionary Algorithms with NelderMead Simplex Based Local Search.
Encyclopedia of Artificial
Intelligence
, (Eds. J. R. Rabuñal, J. Dorado & A. Pazos), Idea Group Publishing, Vol. 3, pp. 1191 – 1196.
Das, S., and Panigrahi, B.K., (2008). MultiObjective Evolutionary Algorithms.
Encyclopedia of Artificial Intelligence
,
(Eds. J. R. Rabuñal, J. Dorado & A. Pazos), Idea Group Publishing, Vol. 3, pp 1145 – 1151.
Deb, K., et al (2002). A Fast and Elitist Multiobjective Genetic Algorithm: NSGAII.
IEEE Transactions on Evolutionary
Computation
, Vol. 6, No. 2, pp. 182197.
Koduru, P., Welch, S.M., and Das, S., (2007). A Particle Swarm Optimization Approach for Estimating Confidence
Regions.
Proceedings of the Genetic and Evolutionary Computing Conference
, London, UK, pp. 70 – 77.
Koduru, P., Das, S., Welch, S.M., (2007). MultiObjective and Hybrid PSO Using εFuzzy Dominance.
Proceedings of
the Genetic and Evolutionary Computing Conference
, London, UK, pp. 853 – 860.
Koduru, P. et al (2008). MultiObjective EvolutionarySimplex Hybrid Approach for the Optimization of Differential
Equation Models of Gene Networks.
IEEE Transactions on Evolutionary Computation
, Vol. 12, No. 5, pp. 572 –
590.
Koduru, P., Das, S., Welch, S.M., (2009). A Hybrid PSO Algorithm for Single and Multiobjective Optimization.
IEEE
Transactions on Evolutionary Computation
, revised and resubmitted.
Mendel, J.M. (1995). Fuzzy Logic Systems for Engineering: A Tutorial.
Proceedings of the IEEE
, Vol. 83, No. 3, pp. 345
– 377.
Welch, S.M., Dong, Z, Roe, J.L., and Das, S., (2005). Flowering Time Control: Gene Network Modeling and the Link to
Quantitative Genetics.
Australian Journal of Agricultural Research
, Vol. 56, pp. 919 – 936.
Welch, S.M., et al (2005). Merging Genomic Control Networks with SoilPlantAtmosphereContinuum (SPAC) Models.
Agricultural Systems
, Vol. 86, pp. 243 – 274.
Welch, S.M., Roe, J.L. and Dong, Z., (2003). A Genetic Neural Network Model of Flowering Time Control in
Arabidopsis thaliana
.
Agronomy Journal
, Vol. 95, pp. 71 – 81.
Wilczek, A.M., et al, (2009). Effects of Genetic Perturbation on Seasonal Life History Plasticity,
Science
, doi:
10.1126/science.1165826.
ISBN: 9789728924874 © 2009 IADIS
10
TREESTRUCTUREAWARE GENETIC OPERATORS IN
GENETIC PROGRAMMING
Kisung Seo, Chulhyuk Pang
Electronic Engineering, Seokyeong University
Seoul, Korea
ABSTRACT
In this paper, we suggest treestructureaware GP operators that heed tree distributions in structure space and their
possible structural difficulties. The main idea of the proposed GP operators is to place the generated offspring of
crossover and/or mutation in a specified region of tree structure space insofar as possible, taking into account the
observation that most solutions are found in that region. To enable that, the proposed operators are designed to utilize
information about the region to which the parents belong and node/depth statistics of the subtree selected for
modification. To demonstrate the effectiveness of the proposed approach, experiments on the binomial3 regression and
even parity problems are performed. The results show that the results using the proposed treestructureaware operators
are superior to the results of standard GP for both two test problems in both success rate and number of evaluations.
KEYWORDS
Genetic Programming , Treestructureaware GP Operators
1. INTRODUCTION
The tree representation of GP chromosomes, as compared with the string representation typically used in GA,
gives GP more flexibility to encode solution representations for many realworld design and optimization
applications [Koza 1992, 1994]. Due to a characteristic of tree representations, some problems are commonly
experienced, such as code bloat [Luke, Mcphee, Silva], destructive crossover [Ito, Majeed, Poli 1999, 2000],
and structural difficulties [Daida 2001, 2003a, 2003b, Hoai].
Daida et al. [Daida 2001, 2003a, 2003b] investigate regions of a tree structure’s space by a distribution of
tree shapes, and show that tree structure can have a substantial impact on determining problem difficulty in
standard GP. Their work has provided experimental support for the hypothesis that the iterative random
growth of information structures can be a significant and limiting factor that adversely affects solution
difficulty. One of the most important observations obtained from above research is that most solutions in
standard GP are found in a specific region of the space of tree structures [Daida 2001, 2003a, 2003b].
We suggest a new recombination operators using the above idea of tree structure space. For crossover, a
random subtree is selected for parent 1 and node/depth statistics of the subtree are calculated. In accord with
the rate, a subtree of parent 2 is selected to make the child lie within a specified treestructure region, known
as region I (defined in the next section). One offspring is obtained from two parents, because choice of a
subtree to swap based on statistics of the subtree it replaces is done only from one parent. Therefore we do
two crossover operations in a row to generate two offspring. In case of mutation, the growth method (grow,
full, or half & half) to replace a parent subtree is determined based on the current shape of the parent tree.
In this paper, we propose new treeshapeaware genetic operators for GP. To demonstrate the
effectiveness of our proposed approach, experiments on binomial3 regression and even parity problem are
performed. Section 2 discusses the previous work and the space of tree structures. Section 3 describes the
proposed structureaware GP operators. Section 4 presents experimental results for two test problems, and
Section 5 concludes the paper.
IADIS International Conference Intelligent Systems and Agents 2009
11
2. STRUCTURAL DIFFICULTY AND GP OPERATORS
2.1 Structural Difficulty in GP
Daida et al. [Daida 2003b] showed that structure alone can pose great difficulty for a standard GP search
(using an expression tree representation and subtree swapping crossover). In particular, they classified four
regions of the search space of tree structures, as shown in Fig. 1.
The regions of tree structures are summarized as follows, classified according to number of nodes and
tree depth. There exist at least four distinct regions for depths 0 – 26. Region I contains most solutions in
standard GP. Far fewer individuals than in Region I appear in Regions II (IIa and IIb). Only partial mixing of
size / shape subtrees occurs here, with mixing becoming nonexistent towards the boundaries furthest away
from Region I. Region III (IIIa and IIIb) is a place where even fewer individuals typically appear. Regions
IVa and IVb are regions that are structurally precluded from binary trees.
In [Hoai], extensive work on a new treebased representation and simple local operators for GP has been
done. They have argued that GP’s problem of structural difficulty results from the lack of local structure
editing operators due to its fixedarity representation. Applying a TAG (tree adjoining grammars)based
representation and local structureediting operators to Daida’s LID problem, they demonstrated that the
operators significantly ease the structural difficulty problem in GP.
Although those new findings broaden the insight into structural difficulty in GP, they do not relate
directly to how we apply the phenomenon to improve the search efficiency of GP. Their relaxation of the
fixedarity approach cannot be applied to GP problems in general, because in many cases, the necessary arity
is defined according to problemspecific characteristics.
In this paper, we have focused on how to apply the observations which have been well studied
theoretically, to enhance the current standard GP operators. Based on tree structure characteristics, region
based GP operators are proposed. Because most solutions of test problems [Daida 2003a] found by standard
GP search occur in region I, it is natural to say that it is easy for GP to find solutions to problems which exist
in region I. Furthermore, biasing the operators to put generated offspring into region I seems to be plausible
strategy to improve search capability in standard treebased GP.
2.2 Problems of GP Operators
The recombination operator in GP is regarded as a main driving force for the success of a GP run. Many
variants of the recombination operator have been introduced and used, but the most commonly used one is
the one point crossover operator [Koza 1992], due to its simplicity and ease of implementation. Many
Figure 1. Node and depth regions 
Re
p
rinted with
p
ermission from
[
Daida
ISBN: 9789728924874 © 2009 IADIS
12
researches have tried to improve its performance. One approach is context preserving [Majeed]. In it, the
order and number of the parent nodes of a swapped subtree in its container parent tree are preserved to the
best possible extent in the other parent. A recently introduced operator, contextaware crossover [Majeed],
which implicitly discovers the best possible crossover site for a subtree, has been shown to consistently
attain higher fitnesses while processing fewer individuals. On the other hand, GP uniform crossover has also
been suggested, inspired by the GA uniform crossover concept [Poli 1999, 2000]. Also, a depthdependent
crossover for GP has been proposed [Ito], in which the depth selection ratio is varied according to the depth
of a node, i.e. shallow nodes are more often chosen as the crossover points.
The various approaches mentioned above have contributed to analysis of existing problems with GP
operators and development of new GP operators with performance improvements. These approaches are
mainly to minimize destructive effects of standard crossover, or preserve the position of genetic material in
the genotype, and/or to accumulate building blocks. Although it is very important to preserve or minimize
destruction of genetic material, there is no known implicit direction to favor. In other words, there is no
information about which building blocks are good to preserve, because a subtree cannot be evaluated alone.
On the other hand, concrete guidance is provided as an implicit evolution direction in our treestructure
aware genetic operators for GP. Assuming that solutions in Region I are more likely to be helpful to search,
guidance on the direction that GP operators should pursue is available.
3. TREESTRUCTUREAWARE OPERATORS
The main idea of the proposed treestructureaware GP operator is to place generated offspring(s) resulting
from crossover and/or mutation into region I as much as possible, using the observation of tree structure as
guidance in specifying the shape of genetic material to be introduced. [Daida 2003a, 2003b].
3.1 Treestructureaware Crossover
The treestructureaware crossover algorithm is described in Figure 2 and 3. Two parents are selected by
roulette wheel selection and the region of each parent is examined. A subtree of parent1 is chosen at random,
and a subtree of parent2 is chosen for substitution according to the node/depth ratio of the subtree of parent1
and the regions of each parent.
For example, if the region of parent1 is lower than the region of parent2, a subtree of parent2 is chosen
among possible candidates which have a smaller node/depth ratio than that of the subtree of parent1. This
operation generates a child which has high probability of being in region I. In the other case, a subtree of
parent2 is chosen among candidates which have larger node/depth ratios than that of the subtree of parent1.
Unlike standard crossover, one offspring is obtained from crossover of two parents, because it is difficult
Fi
g
ure 2. Pseudo code of crossover o
p
eration
Select two parents using roulette wheel selection
Examine the region to which each parent belongs
Choose a subtree at random in parent1
Calculate node/depth ratio for the subtree of parent1
If (region of parent1 is lower than region of parent2)
Choose a subtree of parent2 so that node/depth
ratio is less than that of parent1’s subtree
else if (region of parent1 is higher than region of parent2)
Choose a subtree of parent2 so that node/depth
ratio is greater than that of parent1’s subtree
else
Choose random subtree in parent2
Iterate the above process to generate required number of
crossover offspring
IADIS International Conference Intelligent Systems and Agents 2009
13
to control the shape of both offspring by the subtree swapping operation in the proposed crossover. Therefore,
offspring from the proposed crossover are generate one at a time.
Choose a subtree of parent2
so that node/depth ratio is less
than that of the subtree
s of
parent1
Tree1 in region(IIb)
Tree2 in region(I)
1) Region of parent1 is lower than region of parent2
2) Region of parent1 is higher than region of parent2
Tree2 in region(IIb)
Choose a subtree of parent2
so that node/depth ratio is greater
than that of the subtree
s of
parent1
Tree1 in region(IIIa)
Figure 3. Treestructureaware crossover
3.2 Treestructureaware Mutation
The major feature of the treestructureaware mutation is to control the replaced subtree by generation
methods (grow, full, or half & half) considering the current shape of the parent tree. The treestructureaware
mutation algorithm is described in detail in Figure 4 and 5.
A parent is selected by roulette wheel selection and the shape of the parent is examined as in crossover.
The number of nodes on each branch of from the root node of the parent is calculated. If the region of the
parent is higher than region I, for example, a random node from the larger branch is chosen in order to
balance the tree structure and move the offspring toward region I. Then mutation with “grow” generation is
executed, which tends to make the subtree smaller.
ISBN: 9789728924874 © 2009 IADIS
14
Figure 5. Treestructureaware mutation
4. EXPERIMENTAL SETUP
The treestructureaware GP operators have been applied to two standard benchmark problems—binomial3
regression and even parity problem—and compared with standard GP. In all experiments, 20 runs were
executed and the number of evaluations and hit ratio or success rate of solutions found are. We used lilGP
[Zonker] for the standard GP runs, and modified it heavily to implement the new crossover and mutation
operators. These experiments were performed on a single Core 2 Duo 2.13GHz PC with 2GB RAM. The GP
parameters were as shown below.
Number of generations : 500 for binomial3,
2000 to 10000 for multiplexor,
and 100 to 7000 for evenparity
Population sizes : 500
Fi
g
ure 4. Pseudo code fo
r
mutation o
p
eration
Select a parent using roulette wheel selection
Determine the region to which the parent belongs
Calculate the number of nodes on each branch of the root node
of the parent
If (region of parent is higher than region I)
Choose a node from the larger branch and
mutate with “grow” generation
else if (region of parent is lower than region I)
Choose a node from smaller branch and
mutate with “full” generation
else
Choose a random node and
mutate with “half_and_half” generation
IADIS International Conference Intelligent Systems and Agents 2009
15
Initial population: half_and_half
Initial depth : 26
Max depth : 17, (25 for 11multiplexor)
Selection : tournament (size=7) for standard
roulette wheel for proposed
Crossover : 0.9
Mutation : 0.1
4.1 Binomial3 Regression
The first experiment was on the binomial3 problem, which was proposed as a tunably difficult problem by
Daida, et al. in [2]. The binomial3 is a symbolic regression problem and involves seeking a function
expressible as:
133)(
23
+++= xxxxf
(1)
The binomial3 problem was defined as fitness cases for 50 equidistant points generated from equation
(1), over the interval [1,0). The raw fitness is the sum of the absolute errors. A hit is defined being within
0.01 in ordinate of a fitness case for a total of 50 hits. The stop criterion is finding an individual in the
population that scores 50 hits.
The function set is defined as {
÷
×
−
+,,,
}, and the terminal set is defined as {x, α}, where x is the
symbolic variable and α is the set of ephemeral random constants (ERCs).
The ephemeral random constants are uniformly distributed over a specified interval of the form [α, α],
where α is a real number that defines the range for ERCs. Five values for α were used—namely: {0, 1, 2, 3,
10, 100}.
Table 1. Results of Binomial3 Regression Problem
The experimental results for the binomial3 problem are summarized in Table 1. The proposed crossover
& mutation produced better results than standard crossover & mutation, in terms of hit rate, number of
evaluations, and success rate. Because the binomial3 problem is relatively easier problem, it shows only
slight differences in the hit rate comparison. However, big improvements are shown for the number of
evaluations, which is one of the most important indexes. The proposed method reduces evaluations required
approximately 70% compared to standard GP, except for the very easy case of no ERCs.
Binomial3 regression
Operators
ERC
range
Hit rate
No. of
Evals
Success
rate
standard
crossover
&
mutation
noERC 50.00 4,850.0 100%
[1,1] 47.75 149,676.5 85%
[2,2] 48.25 23,794.1 85%
[3,3] 49.10 116,740.6 85%
[10,10] 48.00 109,411.8 85%
[100,100] 48.55 143,000.0 95%
proposed
crossover
&
mutation
noERC 50.00 4,552.5 100%
[1,1] 49.60 70,368.4 95%
[2,2] 49.45 37,629.4 85%
[3,3] 49.25 26,968.8 85%
[10,10] 49.75 65,975.0 90%
[100,100] 49.45 54,182.4 85%
ISBN: 9789728924874 © 2009 IADIS
16
4.2 Evenparity
The second experiment was on the evennparity problem. It has been recognized as difficult for genetic
programming to induce if no bias favorable to their induction is introduced in the function set, the input
representation, or in any other part of the algorithm [Page]. The function set defined is {and, or, nand, nor}
and terminal set is defined as {d0, d1, d2,…, dn}, where each element of the terminal set is a data bit.
Table 2 summarizes the experimental results for 3, 4 and 5even parity problems. For the even3 parity
problem, the results of the proposed method is better than those of standard GP in terms of number of
evaluations. The success rate of the proposed method is better than standard GP’s for the even4 parity
problem. Moreover, the number of evaluations of the proposed crossover & mutation is only 12% of standard
GP’s. For the even5 parity problem, the success rate of the proposed structureaware operators (75%) is
quite superior to standard GP’s (never found)
Table 2. Results of Even Parity Problem
3, 4, and 5even parity
Operators Bits Hit rate No. of Evals
Success
rate
standard
crossover
&
mutation
3bits
8.00
(max8)
35,815.8 100%
4bits
15.55
(max16)
1,987,333.3 75%
5bits
20.45
(max32)
NA 0%
proposed
crossover
&
mutation
3bits 8 15,266.7 100%
4bits 15.90 237,631.6 95%
5bits 31.35 2,924,100.0 75%
5. CONCLUSION
We have suggested new recombination operators based on tree distributions in structure space and structural
difficulties. The main idea of the proposed treestructureaware GP operators is to generate offspring via
crossover and mutation that have tree structures residing withinr region I in the Daida et al. [3]
characterization by biasing the tree structures of the altered subtrees.
To demonstrate the effectiveness of our proposed approach, experiments on the binomial3 regression
and even parity problem were performed. The experimental results showed that the results using the proposed
treestructureaware operators were superior to the results of standard GP for both two test problems in both
success rate and number of evaluations.
Due to the use of meaningful observation of the regions in the space of tree structures identified by Daida
et al., our proposed treestructureaware operators can enhance search capability over the randomly generated
tree structures exhibited by the standard GP
Further study will aim at analysis, extension and refinement of the treestructureaware GP operators to
validate their effectiveness more theoretically and to apply them to more complex and practical realworld
problems.
ACKNOWLEDGMENTS
This work was supported by the Korea Research Foundation Grant funded by the Korea government
(MOFHRD) Basic Research Promotion Fund) (KRF2007314D00176)
IADIS International Conference Intelligent Systems and Agents 2009
17
REFERENCES
Daida J. M. et al, 2001. “What Makes a Problem GP Hard? Analysis of a Tunably Difficult Problem in Genetic
Programming,” Genetic Programming and Evolvable Machines, 2(2), pp.165191.
Daida J. M. and Hilss A. M., 2003. “Identifying Structural Mechanisms in Standard Genetic Programming,” Proceedings
of the Genetic and Evolutionary Computation Conference (GECCO2003), LNCS 2724, Chicago, IL, USA, pp.1639
1651
Daida J. M. and Hilss A. M., 2003. “What Makes a Problem GP Hard? Validating a Hypothesis of Structural Causes,”
Proceedings of the Genetic and Evolutionary Computation Conference (GECCO2003), LNCS 2724, Chicago, IL,
USA, pp.16651677
Ito T. et al, 1998. “Depth Dependent Crossover for Genetic Programming,” in Proceedings of the IEEE World Congress
on Computational Intelligence, Anchorage, AK. USA, pp.775780
Koza J. R., 1992. Genetic Programming: On the Programming of Computers by Natural Selection, MIT Press,
Cambridge, MA, USA
Koza J. R., 1994. Genetic Programming II: Automatic Discovery of Reusable Programs, MIT Press, Cambridge, MA,
USA,
Luke S., 2000. Issues in Scaling Genetic Programming Breeding Strategies, Tree Generation and Code Bloat, PhD thesis,
University of Maryland
Majeed H. and Ryan C., 2007. “On the Constructiveness of Context Aware Crossover,” Proceedings of the Genetic and
Evolutionary Computation Conference (GECCO’07), London, England, United Kingdom, pp.16591666
McPhee N. F. et al, 2004. “On the Strength of Size Limits in Linear Genetic Programming,” Proceedings of the Genetic
and Evolutionary Computation Conference (GECCO2004), LNCS 3103, Seattle, WA, USA, pp.593604
Hoai N. X. et al, 2006 “Representation and structural Difficulty in Genetic Programming,” IEEE Transactions on
Evolutionary Computation, 10(2), pp.157166
Page J. et al, 1999. “Smooth Uniform Crossover with Smooth Point Mutation in Genetic Programming : A Preliminary
Study,” Proceedings of EuroGP’99, LNCS 1598, Göteborg, Sweden, pp.3948
Poli R. and Page J., 2000. “Solving High Order Boolean Parity Problems with Smooth Uniform Crossover,” Sub
Machine Code GP and Demes, Genetic Programming and Evolvable Machines, 1(1/2), pp.3756
Silva S. and Costa E., 2005 “Resource Limited Genetic Programming : The Dynamic Approach,” Proceedings of the
Genetic and Evolutionary Computation Conference (GECCO’05), Washington, DC, USA, pp.16731680
Zongker D. and Punch B., 1995. LilGP User’s Manual, Michigan State University
ISBN: 9789728924874 © 2009 IADIS
18
CONTENT AND COMMUNICATION BASED SUB
COMMUNITY DETECTION USING PROBABILISTIC
TOPIC MODELS
Alexandru Berlea
1
, Markus Döhring, Nicolai Reuschling
SAP Research
Bleichstr. 8, 64283 Darmstadt, Germany
ABSTRACT
Subcommunity detection is a fundamental task in social network analysis and becomes increasingly interesting in
business applications related to supporting collaboration platforms on the Internet and mining the content generated on
them. We present a set of methods for subcommunity detection leveraging on probabilistic topic models. The methods
are based on similarities among community members arising from their communication links, their topics of interest, or
on both aspects. We thereby identify suitable scenarios for the application of the proposed approaches. Preliminary
experimental results indicate our hybrid approach as a promising candidate for the analysis of large forum communities.
KEYWORDS
Community Mining, Topic Detection, Probabilistic Models, Social Network Analysis
1. INTRODUCTION
Given a set of interacting entities, subcommunity detection is defined as the task of identifying subsets of
entities characterized by common properties. This definition is quite broad, leaving a lot of interpretation
space and requires some refinement in order to specify the scope of this paper.
Firstly, depending on the nature of the entities and their interaction, subcommunity detection is a matter
of interest for different disciplines such as social network analysis or biology. Our focus will be on Internet
communities. Supported by the increase in usability provided by Internet applications designed for
collaboration (such as blogs, forums, wikis), Internet communities and their user numbers have proliferated
in the recent years, making them increasingly interesting and relevant for business purposes. For example,
recommendation systems, which have shown to have a significant impact on sales figures, may be improved
if subcommunities of people sharing the same interests or tastes are detected, by accordingly tailoring their
offer to the customers. Additionally, companies become increasingly interested in having Internet
communities around their products; identifying these communities or supporting them may both benefit from
subcommunity detection.
Secondly, our definition of subcommunity detection does not specify what are the common properties
supposed to be shared by the members of the subcommunities. As a rule of thumb, these properties should
provide a good clustering of the community into subcommunities, i.e. minimize the differences inside a sub
community and maximize the differences between subcommunities. In this paper we will restrict ourselves
to properties which are derivable from information in Internet communities that is always openly available to
anyone: the communication structure —who is talking to whom—, and the communication content —who is
talking about what.
Corresponding to the large interest in subcommunity detection, there is a large body of recent research
literature in which various approaches to subcommunity detection have been reported, many of which are
tailored to Internet communities. Related work is reviewed in Section 2. Our contribution is to demonstrate
how existing traditional methods for analyzing communication structures can be combined with new stateof
1
Corresponding author: alexandru.berlea@sap.com
IADIS International Conference Intelligent Systems and Agents 2009
19
theart methods for automatic content analysis in order to perform subcommunity detection. In particular, we
show how different assumptions as to how subcommunities are defined can be obtained by variations of the
methods used and the combinations thereof. One distinguishing feature of our approaches is the focus on
Internet forums, in which communication links among participants are explicitly present. We present results
and experiences obtained using a very large Internet community and also address how they can be visualized.
The ingredients underlying our approaches are introduced in Section 3. Subsequently, in Section 4 we
show how these can be used to devise different types of approaches to subcommunity detection: purely
communication based (Section 4.1), purely content based (Section 4.2) and combined approaches (Section
4.3). Thereby we mention how the different decisions lead to obtaining different flavors of subcommunity
detection. We exemplarily apply our combined approach to a large community and present the results in
Section 5.
2. RELATED WORK
Most approaches to subcommunity detection consider either only the communication structure or only the
communication content. As opposed to this, the approaches that we will present consider both aspects,
similarly to related work in (Viermetz 2008), (Gloor and Zhao 2006), (Tuulos and Tirri 2004), (Dietz 2006)
and (Zhou et al. 2006). More precisely, we are in a line of research with the last three in which the
communication content is analyzed using probabilistic models. Probabilistic models have recently shown to
be a powerful method to automatically detect topics in collections of documents. They are especially suitable
for content generated in Internet communities due to the large numbers of documents and authors. (Dietz
2006) addresses communities of researchers as derivable from their publications. The assumption thereby is
that the strength of a tie among two subcommunity members is denoted by the similarity of the topics
addressed by them. Strictly speaking, there are no explicit communication structures available. The detected
subcommunities are thus in fact communities of topics. (Zhou et al. 2006) address communities as derivable
from email exchanges and present two approaches basically assuming either that a subcommunity is a set of
users that communicate frequently, or, respectively a set of users that share common topics. While (Tuulos
and Tirri 2004) address both content and structure analysis, the focus is not on subcommunity detection.
Instead the authors address how topic detection accuracy can be improved by using the communication
structure in order to better discriminate among useful and noise content and is targeted at chat data mining.
Subcommunity detection can be cast into a clustering problem. Given a set of entities, clustering is the
task of (automatically, unsupervisedly) partitioning this set into subsets, or clusters, such that the similarities
are maximized intracluster and minimized intercluster. By this definition subcommunity detection can be
seen as an instance of clustering in which the considered entities are community users whereas each cluster
corresponds to a subcommunity. More formally, a clustering (subcommunity detection method) of E
entities (users) into C clusters is specified by an ExC matrix. The element m
i,j
of the matrix has a value in the
interval [0,1] denoting the grade of membership of entity (user) i to cluster (subcommunity) j. The particular
case in which the grade of membership for each entity i is 1 for exactly one cluster and 0 for all others is
referred to as hard clustering, as opposed to the general, fuzzy clustering. Fuzzy subcommunity detection
methods are in general more expressive as they allow assigning a user with different confidence levels to
different subcommunities and allow a natural covering of reallife scenarios in which users are or feel
simultaneously part of different subcommunities.
Typically, the entities to be clustered can be represented as feature vectors, i.e. points in a multi
dimensional space. A variety of clustering techniques exist arising from different choices of entities’ features,
similarity measures among them and grouping methods (Jain, Murthy & Flynn 1999).
In some cases, such as Internet communities for this matter, the set of interacting entities are naturally
represented as nodes of a graph, whereas an arc (implicitly) denotes the similarity among the entities it
connects. Clustering nodes of a graph is known as graph clustering (Schaeffer 2007), whereas clusters
become subgraphs. A range of techniques is based on density properties; they try to maximize the internal
coherence of subgraphs by identifying maximal subgraphs that have a density above a certain threshold.
Cutbased approaches try to maximize the independence of subgraphs, whereas independence is defined in
terms of the cut size needed to isolate the subgraphs. Another proposed approach is based on iteratively
removing arcs with the highest betweenness (the number of shortest paths passing through the arc), based on
ISBN: 9789728924874 © 2009 IADIS
20
the assumption that these arcs are links between clusters rather than within a cluster. For this particular work,
we are interested in clustering methods which scale for the large number of entities typically available in
Internet communities. Algorithms for computing optimal graph clusters are in general NPcomplete and thus
not applicable for large graphs. In practice, however many algorithms have been proposed which are able to
find reasonable good partitions efficiently (Schaeffer 2007).
In particular a cutbased approach suitable for very large graphs is implemented in Cluto (Karypis 2003),
a stateofthe art clustering tool. In order to be able to deal with very large input graphs, Cluto’s algorithm
(Karypis & Kumar 1999) reduces the original input graph by first collapsing nodes and edges, then
partitioning the reduced graph, and finally projecting the obtained partition back to the original graph. At the
graph level, this method, being a cutbased one, tends to find clusters such that the number of intercluster
edges is minimized. The interpretation in terms of subcommunity detection is straightforwardly obtained by
taking each cluster to represent a subcommunity: subcommunities are detected as to minimize the amount
of information exchanged across communities. By optimizing this global property of a community, this
method is meaningful in scenarios in which the community is regarded as a whole (e.g. for its visualization),
but may be less optimal in explaining (locally) what are the bonds holding together a subcommunity, or why
a person belongs to a community.
3. PRELIMINARIES
In this section we shortly address the fundamental techniques underlying the subcommunity detection
methods that will be introduced in the next section. Rather than to give a formal precise introduction of these
techniques, our aim is to provide the intuitive understanding thereof as needed for a selfcontained
presentation and to introduce the terminology and notations used in the remainder of the paper.
3.1 Community Data
While the approaches that we will introduce are in principle applicable to arbitrary online communities, for
the sake of the presentation and evaluation we will refer to the particular case of online forums. It is
furthermore convenient to introduce here our practical experimental data. This will allow us to refer to a
concrete use case in examples used throughout the paper.
We utilized forum data publicly available on the SAP Developer Network (SDN). The SDN forums
contain a broad variety of discussion topics related to the SAP software landscape. In some cases, SDN users
may explicitly link to each other in their profiles. However, this happens quite rarely and we do not use any
such information. Instead we simply assume that subcommunity structures are latent in the communication
structure and the exchanged content. We restricted ourselves to the forum group focusing on “Application
Server” issues. It offers a mixture of business and technical content that is typical for the overall SDN, while
providing a medium diversification of distinct subareas within the forum. It contains 61,781 threads
totalizing 272,582 posts by 23,545 users.
3.2 Probabilistic Topic Models
Probabilistic topic models (PTMs) lie at the basis of some recent promising approaches to automatic topic
detection. A PTM offers a generative explanation of a document collection in which topics are explicitly
modeled. More precisely, the model is specified by a fixed number of topics as probability distributions over
words and a probability distribution over topics associated with each document. Each token of a document is
(assumedly) generated in turn, by first sampling a topic from the topic distribution associated with the
document and then sampling a word from the probability distribution denoted by the topic.
Now, if we are able to find the model that best explains the document collection at hand, the topics of
each document d can be looked up as the most probable topics in θ
d
, the probability distribution over topics
associated with d. Finding this model is an inference problem which is generally not exactly solvable.
Instead, one tries to approximate the optimal solution. The various topic detection methods that have been
proposed differ in the inference methods used as well as in additional assumptions they make regarding the
underlying model. In particular, for our purposes we use Latent Dirichlet Allocation (Blei, Ng & Jordan
IADIS International Conference Intelligent Systems and Agents 2009
21
2003) as our PTMs and Gibbs sampling as the inference method (Steyvers & Griffiths 2007), which have
been reported to generally deliver good results. Intuitively, the detected topics can be thought of as patterns
of cooccurrences of words in the document collection. One advantage of this topic detection is that it is not
affected by ambiguous words, as cooccurring words in context automatically account for the right topic
assignment.
For our purposes we run the topic detection as mentioned above on a corpus built from the experimental
data previously introduced, by aggregating all posts to a thread into one document. We fixed the number of
topics to 75 – selecting the right number of topics is in general driven by the sensitive granularity for the
application domain at hand: not too small a number in order not to get too general topics and not too large a
number in order not to get too specific ones. Some of the topics detected are depicted in Table 1 by their
most representative (likely) words in decreasing order. By looking at the words we can identify these topics
as dealing with Security, Web Services, Web Dynpro, Databases and Web Servers, respectively.
Table 1. Topics detected in our experimental data
Topic 3 user
p
assword
login logon role id log authentication
p
ortal
sso
Topic 16 service web
ejb
webservice bean
p
roxy wsdl model client
method
Topic 35 web
dynpro
abap webdynpro Java wd wda tutorial ui component
Topic 47 database
connection
sql
datasource datum
j
dbc db table driver
oracle
Topic 68 server http
url
service web
p
or
t
error domain browser
host
3.3 Centrality Metrics for SNA
Detecting persons with special roles in social networks is often based on measuring their centrality in the
network as follows. Local degree centrality, defined as the number of edges connecting a node, may be used
to measure how intensively the node communicates. Closeness centrality, sometimes also termed as global
centrality of a node, is measured as the sum of distances from the node to all other nodes and may tell for a
given node how well connected he is to all other reachable nodes. Betweenness centrality is defined as the
number of shortest paths between any two nodes of a network that run through a given node. The calculation
and exploitation of this metric may give clues about how important the node is for connecting subnets within
a community. Eigenvector centrality is used to determine the general importance of a node within a network.
This is done not only by counting connections, but also by overweighting connections to nodes which are
themselves more central than other nodes. Diameter metrics may allow conclusions based on the maximum
number of steps that have to be taken to get from one node to another within a community network. In other
words, diameter is defined as the longest shortest path within a network. This metric can be an indicator for
how fast information can be passed through a network or whether propagation takes places rather extensively
(indicated by a small diameter) or not.
4. PROBABILISTIC TOPIC MODELS FOR COMMUNITY DETECTION
In this section we introduce three approaches to subcommunity detection based on probabilistic topic
models: one purely communication based, one purely content based and one that combines communication
and content.
4.1 A Communication Based Approach
Assuming that users that tend to cooccur in discussion threads should belong to the same subcommunity,
we might detect these subcommunities by applying the method for detecting patterns of cooccurrences
introduced in Section 3.1 on the communication structure of the threads. For that we can consider each thread
as a “document”, the “tokens” of which are users that have posted in the thread, one “token” for each post.
Each resulting “topic” θ
t
(a probability distribution over users) will denote a subcommunity. More precisely,
the grade of membership of a user u to subcommunity t is given by θ
t
(u) (the probability of pattern t to
generate user u). As a fuzzy method this has advantages as presented in Section 2.
ISBN: 9789728924874 © 2009 IADIS
22
Intuitively, the grade of membership of a user u to a subcommunity c denotes how likely it is that u will
contribute to threads in which other members of c are also present. Each thread will thereby tend to be
assigned to a small number of communities. Altogether, this approach is suitable for scenarios in which an
automatic categorization of threads is needed according to groups of highly active users driving them.
4.2 A Content Based Approach
We will refer to the topics that are detected by PTMs in user generated content (such as forums) as discussion
topics. Assuming that the membership of a user in a subcommunity exclusively depends on the discussion
topics in which the user participates, subcommunities can be identified by detecting discussion topics as
presented in Section 3.1. Each discussion topic t can specify a subcommunity in a number of ways, each of
which leads to slightly different (more or less obvious) interpretations of the subcommunities.
(1) For example, the grade of membership of user u to subcommunity t, can be specified as the average
proportion of topic t over all threads to which user u contributes. We call these topic proportions for user u,
u’s interests.
(2) In order to also account for the number of posts in the different threads, the average can be weighted
by the number of posts made in each of these threads.
(3) Another way to compute the interest of user u in discussion topic t is to count the number of u’s posts
within discussion threads, the top topic of which is topic t.
Subsequently, if a hard clustering is needed we can place the user u in the subcommunity corresponding
to his largest interest. The approach (3) is suitable in particular if we can assume that each thread essentially
deals with only one topic (the top topic, whose proportion is the greatest for the thread). Essentially we assign
the user to the discussion topic to which most of his posts have been made.
Note that discussion topics are detected on thread level rather than on post level. This is sensible to do
since we can assume that most of the posts to a thread deal with similar topics and implicitly use the
enclosing thread as disambiguating context. This implies that the approach is not completely oblivious of the
communication structure; the reason is that the subcommunity of a user is determined by the topics of the
user’s posts, which are in turn influenced by the (tokens of the) posts of the users talking in the same thread.
One might thus argue that users which often talk within the same thread are more likely to end up in the same
community. Yet, a second thought reveals that this is only the case if the cooccurrences of these users’
words in the threads are statistically relevant at the level of the whole document collection, i.e. for this matter
throughout all threads. Given a large number of threads and posts, as in our use case, this essentially makes
our subcommunity assignment being overwhelmingly determined by the content of the communication
rather than by its structure.
All in all, the approaches introduced here are suitable for scenarios in which there is no reason to suppose
that subcommunity members are bound together by other tights than the need to solve their problems at hand
and these problems mostly fall under one topic. This is true to a large extent in socalled business
communities, as opposed to social networks in the narrower sense.
The assumption underlying the subcommunity approaches proposed so far in this section is that a sub
community is essentially defined by one discussion topic and leads to a user ending up (with high probability
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment