IADIS MULTI CONFERENCE ON COMPUTER SCIENCE

AND INFORMATION SYSTEMS

call for papers

21 - 23 June

Proceedings

Edited by:

António Palma dos Reis

international association for development of the information society

Algarve, Portugal

of

INTELLIGENT SYSTEMS AND AGENTS 2009

IADIS INTERNATIONAL CONFERENCE

INTELLIGENT SYSTEMS AND

AGENTS 2009

part of the

IADIS MULTI CONFERENCE ON COMPUTER SCIENCE AND

INFORMATION SYSTEMS 2009

ii

iii

PROCEEDINGS OF THE

IADIS INTERNATIONAL CONFERENCE

INTELLIGENT SYSTEMS AND

AGENTS 2009

part of the

IADIS MULTI CONFERENCE ON COMPUTER SCIENCE AND

INFORMATION SYSTEMS 2009

Algarve, Portugal

JUNE 21 - 23, 2009

Organised by

IADIS

International Association for Development of the Information Society

iv

Copyright 2009

IADIS Press

All rights reserved

This work is subject to copyright. All rights are reserved, whether the whole or part of the material

is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation,

broadcasting, reproduction on microfilms or in any other way, and storage in data banks.

Permission for use must always be obtained from IADIS Press. Please contact secretariat@iadis.org

Intelligent Systems and Agents Volume Editor:

António Palma dos Reis

Computer Science and Information Systems Series Editors:

Piet Kommers, Pedro Isaías and Nian-Shing Chen

Associate Editors: Luís Rodrigues and Patrícia Barbosa

ISBN: 978-972-8924-87-4

SUPPORTED BY

v

TABLE OF CONTENTS

FOREWORD ix

PROGRAM COMMITTEE

xiii

KEYNOTE LECTURE

xv

FULL PAPERS

COMBINING FUZZY DOMINANCE BASED PSO AND GRADIENT DESCENT

FOR EFFECTIVE PARAMETER ESTIMATION OF GENE REGULATORY

NETWORKS

Sanjoy Das, Karim Morcos and Stephen M. Welch

3

TREE-STRUCTURE-AWARE GENETIC OPERATORS IN GENETIC

PROGRAMMING

Kisung Seo and Chulhyuk Pang

11

CONTENT AND COMMUNICATION BASED SUB-COMMUNITY DETECTION

USING PROBABILISTIC TOPIC MODELS

Alexandru Berlea, Markus Döhring and Nicolai Reuschling

19

THE GENE EXPRESSION PROGRAMMING APPLIED TO THE SEASONAL

DEMAND FORECAST

Evandro Bittencourt, Raul Landmann, Paulo César Oliveira, Sidney Schossland, Edson Wilson

Torrens and Jerzy Wyrebski

27

USING “SOCIAL ACTIONS” AND RL-ALGORITHMS TO BUILD POLICIES IN

DEC-POMDP

Thomas Vincent and Akplogan Mahuna

35

AGENT-BASED LEARNING MANAGEMENT SYSTEMS: UPSIDES AND

CHALLENGES FOR SUPPORTING USERS

Shenghua Liu, Ari Wahlstedt and Anne Honkaranta

43

A NOVEL METHOD FOR IRIS FEATURE EXTRACTION BASED ON

CONTOURLET TRANSFORM AND CO-OCCURRENCE MATRIX

Amir Azizi and Hamid Reza Pourreza

53

IRRIGATION AND FERTILIZATION EXPERT SYSTEM FOR VEGETABLES

BASED ON GEOGRAPHICAL INFORMATION SYSTEM

Mostafa Mahmoud

61

vi

HYBRID SYSTEM BASED ON ROUGH SETS AND WAVELET NEURAL

NETWORKS

Yasser F. Hassan

69

A METHOD FOR COMBINING INSTANCE SELECTION ALGORITHMS

Yoel Caises, Antonio González, Enrique Leyva and Raúl Pérez

77

EXTRACTING RULE SUBSETS IN A GENETIC ITERATIVE MODEL

Yoel Caises, Antonio González, Enrique Leyva and Raúl Pérez

85

ELIMINATING BORDER INSTANCES TO AVOID OVERFITTING

Khalil el Hindi and Mousa AL-Akhras

93

MULTI AGENT SYSTEM INTEGRATING NATURALISTIC DECISION ROLES:

APPLICATION TO MARITIME TRAFFIC

Thierry Le Pors, Thomas Devogele and Christine Chauvin

100

COORDINATED MULTI-AGENT BASED FRAMEWORK FOR PATIENT AND

RESOURCE SCHEDULING

E. Grace Mary Kanaga, M.L. Valarmathi and Preethi S.H. Darius

108

MASITS – A MULTI-AGENT BASED INTELLIGENT TUTORING SYSTEM

DEVELOPMENT METHODOLOGY

Egons Lavendelis and Janis Grundspenkis

116

SPECIFYING AND VALIDATING THE AGENT PERFORMANCE EVALUATION

METHODOLOGY: THE SYMBIOSIS USE CASE

Christos Dimou, Fani A. Tzima, Andreas L. Symeonidis, and Pericles. A. Mitkas

125

AN EXTENSION OF THE CLIENT – SERVER MODEL TO THE MOBILE AGENTS:

THE SELLER – BUYER MODEL

Djamel Eddine Menacer, Habiba Drias and Christophe Sibertin-Blanc

133

A HARDWARE BASED APPROACH FOR PROTECTING MULTIAGENT

SYSTEMS

Antonio Muñoz, Antonio Maña and Marioli Montenegro

141

INTEGRATING ANT COLONY OPTIMIZATION IN A MOBILE-AGENT BASED

RESOURCE DISCOVERY ALGORITHM

Yasushi Kambayashi and Yoshikuni Harada

149

A MULTIAGENT ARCHITECTURE FOR AUGMENTED REALITY

APPLICATIONS

J. A. Mateos Ramos, D. Vallejo Fernández, I. Arriaga Sánchez, C. González Morcillo

159

ALTERNATIVE NEIGHBORHOOD CONFIGURATIONS IN AN ABMS MODEL TO

ESTIMATE THE ADOPTION OF TELECENTERS IN BRAZIL

Ismael Mattos A. Ávila, Luiz Acácio G. Rolim and Giovanni M. Holanda

167

MODELING RESPONSIBILITY IN ORGANIZATIONS

Lambèr Royakkers and Maarten Verkerk

177

vii

SHORT PAPERS

FOCUSED TIME-DELAY NEURAL NETWORK MODELING TOWARDS TYPING

STREAM PREDICTION

Jun Li, Karim Ouazzane, Hassan Kazemian, Yanguo Jing and Richard Boyd

189

USE OF THE NEURAL NETWORK FOR ESTIMATING THE MUD ‘S QUANTITY

GENERATED BY EFFLUENT TREATMENT STATIONS: A CASE STUDY

Paulo Bousfield, Cladir Zanottelli, Paulo Olivieira, Cátia Ganske, Sidney Schossland and Edson

Torrens

195

TOPOLOGICAL APPROACH FOR ROBUST INTERPOLATION OF SPEECH-

SPECTRA

Yoshinao Shiraki

200

APPLICATION OF NEURAL NETWORK TO PREDICT ADVERSE SITUATIONS IN

TROUBLE TICKETING REPORTS

Julia Gómez, Yaiza Temprado, Margarita Gallardo, Carolina García and Francisco Javier

Molinero

204

SUMMAVILLE: AN AUTOMATIC AND WEB-BASED NEWS STORIES

SUMMARIZER

Paulo C F de Oliveira, Edson Wilson Torrens, Paulo Bousfield, Sidney Schossland, Evandro

Bittencourt and Raul Landmann

209

CLASSIFICATION OF SERIOUS SEXUAL ASSAULT USING FUZZY

CLUSTERING

Don Casey and Phillip Burrell

215

AN ARCHITECTURE FOR AN AGENT-BASED COOPERATIVE SYSTEMS

Ebrahim Alhashel, Masoud Mohammadian, Bala Balachandran and Dharmendra Sharma

219

FINITE, REASONING AND INTERACTING AGENTS

Michal Walicki and Paul Simon Svanberg

225

AMBIENT ACTIVITY RECOGNITION: A POSSIBILISTIC APPROACH

Patrice C. Roy, Bruno Bouchard, Abdenour Bouzouane and Sylvain Giroux

231

EVACUATION BEHAVIORS IN AN EMERGENCY STATION BY AGENT-BASED

APPROACH

Kazuki Satoh, Toru Takahashi, Takashi Yamada, Atsushi Yoshikawa and Takao Terano

236

DOC-OPT A NEW METHOD FOR DISTRIBUTED CONSTRAINT SATISFACTION

AND OPTIMIZATION PROBLEMS RESOLUTION

Kais Ben Salah and Khaled Ghedira

242

AN HYBRID APPROACH FOR FAULT RESISTANCE IN MULTI-AGENT

SYSTEMS

Mounira Bouzahzah and Ramdane Maamri

247

viii

POSTERS

CONTEXT MANAGEMENT AND USER PREFERENCE LEARNING IN SMART

HOME ENVIRONMENTS

Víctor M. Peláez Martínez, Luis Ángel San Martín Rodríguez, Roberto González Rodríguez and

Vanesa Lobato Rubio

255

ROSES: AN EXPERT SYSTEM FOR DIAGNOSING SIX NEUROLOGIC DISEASES

IN CHILDREN

Sayed Yousef Monir Vaghefi and Touran Mahmoudian Isfahani

259

NEURAL NETWORKS AS IMPROVING TOOLS FOR AGENT BEHAVIOR

Alketa Hyso, Eva Çipi and Betim Çiço

261

AUTHOR INDEX

ix

FOREWORD

These proceedings contain the papers of the IADIS International Conference on Intelligent

Systems and Agents 2009, which was organised by the International Association for

Development of the Information Society in Algarve, Portugal, 21 – 23 June, 2009. This

conference is part of the Multi Conference on Computer Science and Information Systems

2009, 17 - 23 June 2009, which had a total of 1131 submissions.

The IADIS Intelligent Systems and Agents conference addresses in detail two main aspects:

intelligent systems and agents. The conference has the intention to provide a contribution to

academics and practitioners. So, both fundamental and applied research are considered

relevant.

Submissions were accepted under the following areas and topics:

Area 1 – Intelligent Systems

- Algorithms

- Artificial Intelligence

- Automation Systems and Control

- Bio Informatics

- Computational Intelligence

- Expert Systems

- Fuzzy Technologies and Systems

- Game and Decision Theories

- Intelligent Control Systems

- Intelligent Internet Systems

- Intelligent Software Systems

- Intelligent Systems

- Machine Learning

- Neural Networks

- Neurocomputers

- Optimization

- Parallel Computation

- Pattern Recognition

- Robotics and Autonomous Robots

- Signal Processing

- Systems Modelling

- Web Mining

Area 2 – Agents

- Adaptive Agent Systems

- Agent Applications

- Agent Communication

- Agent Development

- Agent middleware

- Agent Models and Architectures

x

- Agent Ontologies

- Agent Oriented Systems and Engineering

- Agent Programming, Languages and Environments

- Agent Systems

- Agent Technologies

- Agent Theories

- Agent Trends

- Agents Analysis and Design

- Agents and Learning

- Agents and Ubiquitous Computing

- Agents in Networks

- Agents Protocols and Standards

- Artificial Systems

- Computational Complexity

- eCommerce and Agents

- Embodied Agents

- Mobile Agents

- Multi-Agent Systems

- Negotiation Strategies

- Performance Issues

- Security, Privacy and Trust

- Semantic Grids

- Simulation

- Web Agents

The IADIS Intelligent Systems and Agents 2009 conference received 103 submissions from

more than 27 countries. Each submission has been anonymously reviewed by an average of

four independent reviewers, to ensure that accepted submissions were of a high standard.

Consequently only 22 full papers were approved which means an acceptance rate below 22

%. A few more papers were accepted as short papers and posters. An extended version of

the best papers will be published in the IADIS International Journal on Computer Science

and Information Systems (ISSN: 1646-3692) and also in other selected journals, including

journals from Inderscience.

Besides the presentation of full papers, short papers and posters, the conference also

included one keynote presentation from an internationally distinguished researcher. We

would therefore like to express our gratitude to Dr. Ronald R. Yager, Machine Intelligence

Institute, Iona College, New York for accepting our invitation as keynote speaker.

As we all know, organising a conference requires the effort of many individuals. We would

like to thank all members of the Program Committee, for their hard work in reviewing and

selecting the papers that appear in the proceedings.

xi

This volume has taken shape as a result of the contributions from a number of individuals.

We are grateful to all authors who have submitted their papers to enrich the conference

proceedings. We wish to thank all members of the organizing committee, delegates,

invitees and guests whose contribution and involvement are crucial for the success of the

conference.

Last but not the least, we hope that everybody will have a good time in Algarve, and we

invite all participants for the next year edition of the IADIS International Conference on

Intelligent Systems and Agents 2010, that will be held in Freiburg, Germany.

António Palma dos Reis,

ISEG - Technical University of Lisbon,

Portugal

Intelligent Systems and Agents 2009 Conference Program Chair

Piet Kommers, University of Twente, The Netherlands

Pedro Isaías, Universidade Aberta (Portuguese Open University), Portugal

Nian-Shing Chen, National Sun Yat-sen University, Taiwan

MCCSIS 2009 General Conference Co-Chairs

Algarve, Portugal

June 2009

xii

xiii

PROGRAM COMMITTEE

INTELLIGENT SYSTEMS AND AGENTS CONFERENCE

PROGRAM CHAIR

António Palma dos Reis, ISEG - Technical University of Lisbon, Portugal

MCCSIS GENERAL CONFERENCE CO-CHAIRS

Piet Kommers, University of Twente, The Netherlands

Pedro Isaías, Universidade Aberta (Portuguese Open University), Portugal

Nian-Shing Chen, National Sun Yat-sen University, Taiwan

INTELLIGENT SYSTEMS AND AGENTS CONFERENCE COMMITTEE

MEMBERS

Adel M. Alimi, University of Sfax, Tunisia

Adina Magda Florea, University "Politehnica" of Bucharest, Romania

Agris Nikitenko, Riga Technical University, Latvia

Alessandro Ricci, Università di Bologna in Cesena, Italy

Alfredo Garro, Universita' della Calabria, Italy

Andrea Addis, University of Cagliari, Italy

Angel García-Olaya, Universidad Carlos III de Madrid, Spain

Anton Bogdanovych, UTS, Australia

Anton Nijholt, University of Twente, The Netherlands

Costin Badica, University of Craiova, Romania

Dariusz Krol, Wroclaw University of Technology, Poland

David A. Pelta, University of Granada, Spain

Dickson K.W. Chiu, Computer Systems, Hong Kong

Dídac Busquets, Universitat de Girona, Spain

Djamila Ouelhadj, ASAP Research Group, UK

Eloisa Vargiu, DIEE - University of Cagliari, Italy

Ezendu Ariwa, London Metropolitan University, United Kingdom

Fariba Sadri, Imperial College London, UK

Federico Bergenti, Università degli Studi di Parma, Italy

Federico Castanedo Sotela, Universidad Carlos III de Madrid, Spain

Fikret Ercal, University of Missouri, USA

Gerard Murray, Port of Melbourne - Boskalis Australia Alliance, Australia

Giovanni Semeraro, University of Bari, Italy

Giuseppe Mangioni, Universita di Catania, Italy

Hans Werner Guesgen, Massey University, New Zealand

Haralambos Mouratidis, University of East London, United Kingdom

Heinrich C. Mayr, Alpen-Adria-Universitaet Klagenfurt, Austria

Huiye Ma, Centrum voor Wiskunde en Informatica (CWI), The Netherlands

xiv

Jackeline Spinola de Freitas, Universidad Politécnica de Madrid, Spain

Jaime Ramírez, Universidad Politécnica de Madrid, Spain

Javier Carbo Rubiera, Univ. Carlos III de Madrid, Spain

Jesualdo Tomás Fernández Breis, University of Murcia, Spain

Jim Cunningham, Imperial College, UK

Jorge A. Ramírez-Uresti, ITESM-CEM, Mexico

Jørgen Villadsen, Technical University of Denmark, Denmark

José Antonio Iglesias, University of Carlos III, Spain

José Carlos Cortizo Pérez, Universidad Europea de Madrid, Spain

José Manuel Molina López, Universidad Carlos III de Madrid, Spain

Juan Manuel Serrano, Universidad Rey Juan Carlos, Spain

Julius Stuller, Academy of Sciences of the Czech Republic, Czech Republic

Krysia Broda, Imperial College, UK

Lars Nolle, Nottingham Trend University, UK

Laura Naismith, McGill University, Canada

Laurent Vercouter, Ecole des Mines de Saint-Etienne, France

Leonardo Garrido, Tecnologico de Monterrey, México

Longbing Cao, Univ of Technology, Sydney, Australia

Maite López Sánchez, University of Barcelona, Spain

Marc Esteva, University of Technology, Sydney, Australia

Maria Bielikova, Slovak University of Technology, Slovakia

Maria Salamó Llorente, University of Barcelona, Spain

Marko Grobelnik, Josef Stefan Institute, Slovenia

Matjaz Gams, Jozef Stefan Institute, Slovenia

Mengjie Zhang, Victoria University of Wellington, New Zealand

Michelangelo Ceci, Università degli Studi di Bari, Italy

Miguel Angel Patricio, Universidad Carlos III de Madrid, Spain

Miguel González Mendoza, ITESM-CEM, Mexico

Mirjana Ivanovic, University of Novi Sad, Serbia

Nesrine Baklouti, University of Sfax, Tunisie

Nizar Rokbani, REGIM, Tunisia

P.K. Mahanti, University of New Brunswick, Canada

Paolo Petta, Institute of Medical Cybernetics and Artificial Intelligence, Austria

Patrick Wong, Open University, United Kingdom

Rainer Hilscher, New Vectors LLC, USA

Ramon Brena Pinero, Tecnológico de Monterrey, Mexico

Raúl Arrabales Moreno, Universidad Carlos III de Madrid, Spain

Raymond Chiong, Swinburne University of Technology, Malaysia

Razvan Andonie, Central Washington University, USA

Ricardo Imbert, Universidad Politécnica de Madrid, Spain

Roland Kaschek, Massey University, New Zealand

Roman Neruda, Academy of Sciences of the Czech Republic, Czech Republic

Stuart Chalmers, University of Aberdeen, UK

Sviatoslav Braynov, University of Illinois, USA

Thierry Moyaux, Université de Lyon, France

Tomas Klos, Delft University of Technology, The Netherlands

Vincent Thomas, LORIA, France

Viorel Negru, West University of Timisoara, Romania

William Song, Durham University, UK

Yubin Yang, Nanjing University, China

Zoran Budimac, University of Novi Sad, Serbia

xv

KEYNOTE LECTURE

LEARNING METHODS FOR EVOLVING INTELLIGENT

SYSTEMS AND AGENTS

Ronald R. Yager

Machine Intelligence Institute

Iona College, New York

ABSTRACT

In this presentation our concern is with technologies that allow the construction of intelligent

systems agents that can evolve and learn based on experiences. We discuss a number of

technologies that support this capability: the participatory learning paradigm, the hierarchical

prioritized structure and the mountain clustering method. The basic premise of the participatory

learning paradigm is that learning takes place in the framework of what is already learned and

believed. The implication of this is that every aspect of the learning process is affected and guided

by the current belief system. This name, participatory learning, highlights the fact that in learning

we are in a situation in which the current knowledge of what we are trying to learn participates in

the process of learning about itself. The hierarchical prioritized structure provides a generalization

of fuzzy systems modeling by introducing a hierarchical representation of the rules. It supports

systems evolution by allowing the learning of new rules based and their insertion at different levels

of the hierarchy.

xvi

Full Papers

COMBINING FUZZY DOMINANCE BASED PSO AND

GRADIENT DESCENT FOR EFFECTIVE PARAMETER

ESTIMATION OF GENE REGULATORY NETWORKS

Sanjoy Das, Karim Morcos

Electrical & Computer Engineering Department

Stephen M. Welch

Division of Agronomy

Kansas State University

Manhattan, KS 66506

USA

ABSTRACT

Stochastic optimization techniques such as multi-objective PSO are very useful in determining the parameters of gene

regulatory network models. Unfortunately, evaluating the performance of such a model with a set of parameters is

computationally expensive. The fuzzy ε-dominance based PSO algorithm is a recent approach that is particularly well

suited for these modeling tasks, achieving convergence to the Pareto front with a relatively small number of function

evaluations. In order to further reduce the function evaluations this paper considers ways to incorporates explicit gradient

descent steps within this algorithm. As a case study, the performance of the proposed approach is investigated to compute

the parameters of a differential equation model of Arabidopsis flowering time control.

KEYWORDS

PSO, multi-objective, gradient, optimization, genomics, Arabidopsis.

1. INTRODUCTION

Particle Swarm Optimization (PSO) is a population-based approach that maintains a set of candidate

solutions, called particles, which are allowed to move within the search space (Clerc & Kennedy, 2002). The

trajectory followed by each particle is guided by its own memory, as well as by its interaction with other

particles. The specific method of adjusting the particles trajectory is motivated by the interaction of birds,

fishes, or other organisms that move in swarms. Eventually, the particles converge to suitable optima. We

will use the terms particle and solution interchangeably henceforth.

Multi-objective optimization has been the focus of much recent research (Das & Panigrahi, 2008). Unlike

in single-objective optimization where it is easy to compare one solution to another, in multi-objective

problems, a solution that is inferior to another one in one objective, may in fact be better in another. Under

these circumstances, the concept of Pareto optimality is used. Several multi-objective versions of PSO have

been recently proposed (Coello et al., 2004, Koduru et al., 2007, Das & Panigrahi, 2008).

While biologically motivated algorithms such as PSO are very effective in providing optimal solutions,

they can be further improved by maintaining the correct balance between exploration and exploitation (Das,

2008). Adding an exploitative component allows the algorithm to make use of local information to guide the

search towards better regions in the search space. This property lets the algorithm convergence towards the

Pareto front using fewer function evaluations – a much-desired characteristic in applications such as gene

regulatory network modeling, where a substantial amount of computation is involved in evaluating each

objective function (cf. Cai et al., 2007). Local search techniques have been successfully included within

genetic algorithms (Koduru et al., 2008). PSO hybrid algorithms for single objective optimization (Das et al.,

2006), and more recently, multi-objective optimization (Koduru et al., 2007), have also made their

IADIS International Conference Intelligent Systems and Agents 2009

3

appearance. The approach proposed by Koduru et al., (2007), makes use of the concept of fuzzy ε-dominance

within PSO to accomplish better convergence, and is specifically suited for gene regulatory network

modeling. It also has been shown to outperform the popular NSGA-II (Deb et al., 2002), as well as the multi-

objective PSO (MOPSO) proposed by Coello et al. (2004).

In the above methods, local search techniques have relied on the derivative-free Nelder Mead algorithm,

which can only approximate the true gradient of the objective. In this paper, a method to explicitly compute

the gradient of the objectives for a representative gene differential equation model, namely the genetic

network controlling flowering time in Arabidopsis, has been formulated. Using this result, the addition of

separate gradient descent steps to the PSO of Koduru et al., (2007) is considered. The simulation results

indicate that the inclusion of gradient descent within PSO improves the convergence of the optimization

algorithm.

2. FUZZY Ε-DOMINANCE BASED PSO

2.1 PSO

We first describe the variant of the standard PSO algorithm. This algorithm maintains a population of N

particles whose positions, X(i), i = 1, 2, … N, are initialized to random values. These positions are

incremented in each iteration t of the algorithm according to the instantaneous velocity V

t

(i), as follows,

X

t+1

(i) = X

t

(i) + V

t

(i) (1)

The velocity is also updated in each iteration, using the particle’s own recorded previous best position, as

well as the current location of the other particles. The update rule is given by,

V

t+1

(i) = χ(V

t

(i) + C

1

×U[0,1]×(IB(i) – X

t

(i))+ C

2

×U[0,1]×(GB – X

t

(i))) (2)

In the above equation, C

1

and C

2

are two constants, called the cognitive and the social constants, and χ is

a constriction coefficient, that helps in maintaining stability (Clerk & Kennedy, 2002). The factor U[0,1] is a

uniformly distributed random number in [0, 1]. The quantity IB is the individual best recorded position of the

i

th

particle so far, in terms of objective function. The other quantity, GB is the global best position of any

particle (usually in the current iteration t). In this paper, these constants have been set to the following values:

χ = 0.4, C

1

= 2.1 and C

2

= 2.1.

2.2 Multi-objective Optimization

When dealing with optimization problems with multiple objectives, the conventional concept of optimality

does not hold (Das & Panigrahi, 2008). Instead, the concepts of dominance and Pareto-optimality are applied.

Without a loss of generality, let us assume that the multi-objective problem entails the simultaneous

minimization of all M objectives, e

i

(.),

Mi...,,1

=

. Let the solution space be denoted as

n

ℜ⊂

Ψ

. A solution

Ψ∈u

is said to dominate another solution

Ψ

∈

v

iff

},,,2,1{

Mi

K

∈

∀

)()( veue

ii

≤

with at least one of the

inequalities being strict; i.e. u is as good as v for all objectives and better for at least one. This relationship

is written

vu p

. In the set of all feasible solutions, that subset whose members are not dominated is called

the Pareto set. In other words, if

S

is the population, the Pareto set,

{

}

)(,|

uvSvSu

p¬∈∀

∈

. Its

corresponding image in the space of all objective functions is known as the Pareto front.

Since all the solutions in the Pareto set are non-dominated, they must be treated as equally good.

Therefore, the goal of an effective multi-objective optimization algorithm is to find candidate solutions

whose images in the objective function space are (i) are as close to the true Pareto front as possible, and (ii)

are also as spread out and evenly spaced as possible, thereby sampling an extensive region of the Pareto

front. These two conditions are usually referred to as convergence and diversity respectively. Accomplishing

good convergence and diversity are the two crucial aspects of any multi-objective optimization algorithm,

including PSO. Fuzzy ε-dominance is a recently proposed scheme that combines convergence and diversity

into one single measure, allowing multi-objective optimization problems to be treated as though they

ISBN: 978-972-8924-87-4 © 2009 IADIS

4

involved only a single objective. Fuzzy ε-dominance is an extension of fuzzy dominance that has been

modified to take into account diversity. Both are discussed next in this section.

2.3 Fuzzy Dominance based PSO

Given a monotonically non-decreasing function

)(⋅μ

dom

i

, whose range is in [0, 1],

},,2,1{ ni K∈

, a solution

Ψ

∈u

is said to

i-

dominate solution

Ψ

∈v

, if and only if

)()( veue

ii

<

. This relationship can be denoted as

vu

F

i

f

. If

vu

F

i

f

, the degree of fuzzy

i

-dominance is equal to

(

)

(

)

vuueve

F

i

dom

iii

dom

i

fμ≡−μ )()(

. Fuzzy

dominance can be regarded as a fuzzy relationship vu

F

i

f between

u

and

v

. Solution

Ψ

∈u

is said to

fuzzy dominate

Ψ

∈v

if and only if

},,,2,1{ Mi K

∈

∀

vu

F

i

f

. This relationship can be denoted as

vu

F

f

.

The degree of fuzzy dominance can be defined by invoking the concept of fuzzy intersection and using a t-

norm,

( )

I

ff

M

i

F

i

dom

i

Fdom

vuvu

1

)(

=

μ=μ

(3)

In another implementation of fuzzy dominance (Koduru et al., 2008), the membership functions )(⋅μ

dom

i

were defined to be zero for negative arguments. Therefore, whenever

)()( veue

ii

>

, the degree of fuzzy

dominance

vu

F

i

f

necessarily evaluated to zero. In this paper, we allow non-zero values in accordance with

Koduru et al. (2007). The membership functions used are trapezoidal, yielding nonzero values whenever

their arguments are to the right of a threshold ε, as shown in Figure 1 below.

Mathematically, the memberships

)( vu

F

i

dom

i

fμ

are defined as,

( )

⎪

⎩

⎪

⎨

⎧

ε−Δ≥Δ1

ε−Δ<Δ<ε−ΔΔ

ε−≤Δ

=Δμ

ii

iiii

i

i

dom

i

e

ee

e

e

if

if/)(

if0

(4)

where,

)()( uevee

iii

−=Δ

.Given a population of solutions

Ψ

⊂S

, a solution

Sv ∈

is said to be fuzzy

dominated in

S

iff it is fuzzy dominated by any other solution

Su

∈

. In this case, the degree of fuzzy

dominance can be computed by performing a union operation over every possible

(

)

vu

Fdom

f

μ

, carried out

using t-co norms as,

U

ff

Su

FdomFdom

vuvS

∈

μ=μ )()( (5)

In this manner, each solution can be assigned a single measure to reflect the amount it dominates others in

a population. Better solutions within a set will be assigned lower fuzzy dominances, although unlike in

Koduru et al. (2008) non-dominated solution may not necessarily be assigned zero values. The union and

intersection operators follow the standard min and max definitions (Mendel, 1995).

Figure 1. Fuzzy membership functions used here to compute ε-dominances

e(u)-e(v)

μdom

(e(v)-e(u))

-ε

Δ

1

e(u)-e(v)

μdom

(e(v)-e(u))

-ε

Δ

e(u)-e(v)

μdom

(e(v)-e(u))

-ε

Δ

1

IADIS International Conference Intelligent Systems and Agents 2009

5

Typically, in multi-objective PSO, an archive of all the best A solutions found is maintained. The

velocities of the particles in the population are redirected towards archive solutions. As newer solutions are

discovered, the best ones among them are inserted into the archive, while the older solutions discarded. This

is also the strategy adopted here. In each iteration of the present scheme, the population of particles is merged

with the archive, and the fuzzy dominances computed. The A + N solutions are sorted in ascending order of

their fuzzy dominances, and the best A solutions are the archive for the next iteration. The global best used in

equation (2) is the archive solution with the lowest fuzzy dominance. The individual best IB(i), is updated

only when the i

th

particle dominates it’s own earlier stored individual best, in which case X

t

(i) replaces IB(i).

3. GENE NETWORK MODEL AND OVERALL APPROACH

3.1 Flowering Time Control Gene Network in Arabidopsis

In the Arabidopsis plant (A. thaliana), three genes TERMINAL FLOWERING 1 (TFL1), APETALA 1 (AP1),

and LEAFY (LFY) play a special role in flowering (Welch et al. 2003, 2005). OFF to ON state changes in two

of them (AP1 and LFY) signal plant commitment to flowering. One plausible mechanism for their interaction

involves a three-element positive feedback loop that incorporates a bistable switch as shown below,

TFL1KAP1hRTFL1

dt

d

AP1KLFYhRAP1

dt

d

LFYKTFL1SOC1hRLFY

dt

d

TTFL1dwnT

HAP1upH

LLFYupL

λ−=

λ−=

λ−−=

),(

),(

),(

, (6)

where h

up

and h

dwn

are, respectively, promotive (n=3) and repressive (n = -3) Hill functions defined as,

nn

n

K

x

x

Kxh

+

=),(

. (7)

The difference input to h

dwn

in the equation for LFY is restricted to positive values as negative

biochemical concentrations are impossible. The quantity K is replaced with K

LFY

, K

LFY

, or K

LFY

for the Hill

functions pertaining to LFY, AP1, and TFL1 in equation (6). For clarity, the time argument has been dropped

in the equation, a convention that we shall follow for the remainder of the paper.

The TFL1 gene becomes increasingly active with time, in the developing shoot’s apex. In turn, this

influences the activity of LFY and AP1 in leaf primordia that sequentially emerge from the apex. It is not

known precisely how this molecular influence is exerted, but the net effect is to slow the plant’s development

toward flowering. Increasing levels of LFY and AP1 within each primordium offset this effect. Ultimately,

the switch changes state, causing the primordium within which this happens to initiate inflorescence

development. Equation (6) models all these effects as if they are direct, although this may not be the case in

a real plant.

External switch input is provided by the expression level of the SUPRESSOR OF OVEREXPRESSION

OF CO (SOC1) gene. An equation for the SOC1 expression level at time instant t is,

)

2

sin(

2

)(

s

ss

s

p

t

t

ba

tbSOC1

π−

+=

(8)

As done earlier in Koduru et al. (2008), synthetic data was generated covering a period of nine days and

the emergence of four leaf primordia. Each primordia was simulated for a total of 30 hrs of simulation time.

This data generated time varying expression levels for each gene, which shall be denoted as

,

d

LFY,

d

AP1

TFL1

d

, and SOC1

d

. The values of the parameters used are provided in Table 1 below. For

further details of the implementation, one is referred to Koduru et al. (2008).

ISBN: 978-972-8924-87-4 © 2009 IADIS

6

Table 1. Values of the parameters used to generate desired expression level data

Parameter

R

L

R

H

R

T

λ

L

λ

H

λ

T

K

L

FY

K

AP1

K

TFL1

a

s

b

s

p

s

Value 1.0 1.0 1.0 1.0 1.0 1.0 0.4 0.9 0.3 0.01 0.007 24

3.2 Problem Formulation

The goal of the gene network problem is to estimate the values of the six variables, R

L

, R

H

, R

T

, K

LFY

, K

AP1

,

and K

TFL1

so that the model, when simulated again, produces expression levels for

LF

Y

and

A

P1

that are as

close to that stored earlier as

d

LFY

and

d

AP1

. In other words, the two objective functions to minimize are,

( )

∫

−= dtLFYLFYerr

d

2

1

, (9a)

and

( )

∫

−= dtAP1AP1err

d

2

2

, (9b)

Each solution – a particle in PSO – is therefore a six dimensional vector X = [R

L

R

H

R

T

K

LFY

K

AP1

K

TFL1

].

This is a simpler version of the original formulation of the GRN2 problem defined in Koduru et al., (2008),

where a more extensive set of simulations were carried out, and all nine parameters were determined by the

algorithm. However, this reduced version is sufficient for the present study as our aim is limited to only study

the effect of gradient descent within a version of PSO. Our goal is NOT to produce the most effective multi-

objective PSO algorithm as was the focus in Koduru et al., (2008), where a very fast genetic algorithm is

reported. Needless to say, a modified version of the fuzzy dominance based PSO that has been used as the

main algorithm within which gradient descent has been incorporated, has been shown to perform

significantly above other competitive algorithms for several benchmarks (Koduru et al., 2007).

3.3 Gradient Descent Algorithm

Instead of the aggregate square errors, instantaneous values of the errors in Equation (9a,b) are considered.

These are,

(

)

2

1 d

LFYLFYerr −=

, (10a)

and

(

)

2

2 d

AP1AP1err −=

. (10b)

The gradient descent rule that is used for adaptation is,

)(

21

errerr

d

t

dX

+∇ζ−=

, (11)

which replaces the usual rule

)(errXX

∇

ζ

−=

. The quantity ζ is the usual adaptation rate constant. The

factor

)(err∇

is the vector derivative of err with respect to X. With respect to a generic parameter P, the

scalarized version of Equation (11) is,

dt

dP

⎟

⎠

⎞

⎜

⎝

⎛

∂

∂

+

∂

∂

ζ−=

P

err

P

err

11

⎟

⎠

⎞

⎜

⎝

⎛

∂

∂

−+

∂

∂

−ζ−=

P

AP1

AP1AP1

P

LFY

LFYLFY

dd

)()(

(12)

The expression levels in Equation (6) can be rewritten as,

),(

),(

),(

1

1

1

TFL1dwnTT

AP1upHH

LFYupLL

KAP1h

dt

d

RTFL1

KLFYh

dt

d

RAP1

KTFL1SOC1h

dt

d

RLFY

−

−

−

⎟

⎠

⎞

⎜

⎝

⎛

+λ=

⎟

⎠

⎞

⎜

⎝

⎛

+λ=

−

⎟

⎠

⎞

⎜

⎝

⎛

+λ=

, (13)

IADIS International Conference Intelligent Systems and Agents 2009

7

where each

1−

⎟

⎠

⎞

⎜

⎝

⎛

+λ

dt

d

is a linear operator that is commutative with respect to scalars as well as derivatives.

Clearly, the derivative of LFY with respect to R

L

, R

T

, K

LFY

, and K

TFL1

are zero. Furthermore letting VAP1 =

H

R

AP1

∂

∂

and

AP1

K

AP1

WAP1

∂

∂

=

, we get

),(

1

AP1upH

KLFYh

dt

d

VAP1

−

⎟

⎠

⎞

⎜

⎝

⎛

+λ=

,

and,

),(

1

AP1up

AP1

HH

KLFYh

K

R

dt

d

WAP1

∂

∂

⎟

⎠

⎞

⎜

⎝

⎛

+λ=

−

.

Thus, the time evolution of the variables VAP1 and WAP1 follow the differential equations,

),(

AP1upH

KLFYhVAP1

d

t

dVAP1

+λ−=

,

and,

),(

AP1up

AP1

HH

KLFYh

K

RWAP1

dt

dWAP1

∂

∂

+λ−=

.

Similarly, letting VLFY =

L

R

LFY

∂

∂

and

LFY

K

LFY

WLFY

∂

∂

=

, we can obtain,

),(

LFYupL

KTFL1SOC1hVLFY

dt

dVLFY

−+λ−=

and,

),(

LFYup

LFY

LL

KTFL1SOC1h

K

RWLFY

dt

dWLFY

−

∂

∂

+λ−=

The derivatives of LFY with respect to the other parameters, as well as those of TFL1 are zero. Using

Equation (12), the set of gradient descent based adaptation rules can be readily obtained,

( )

VLFYLFYLFY

dt

dR

d

L

−ζ−=

( )

VAP1AP1AP1

dt

dR

d

H

−ζ−=

( )

WLFYLFYLFY

d

t

dK

d

LFY

−ζ−=

(14)

( )

WAP1AP1AP1

dt

dK

d

AP1

−ζ−=

The adaptation rules for the other parameters cannot be obtained in this manner, and therefore must rely

on PSO for improvement. When these adaptation rules are applied to PSO, in each iteration, the top M

solutions of the population is subject to these adaptation rules. Early studies taken up by the authors of this

paper, where the entire population was subject to adaptation, did not provide any significant gains and are not

discussed any further in this paper. PSO is run with and without gradient descent based parameter adaptation

to investigate the improvement gained in terms of convergence speed when gradient descent is used.

4. RESULTS

The population size is fixed at N = 30, and the archive size, at M = 10. Figure 2 shows the average value of

err

1

and err

2

for the solution in the archive. The plots to the left pertain to the simulation when PSO was run

without gradient descent. In the right figure are shown the errors when gradient descent was applied to PSO.

It is clear that when gradient descent is applied, the errors decrease more rapidly with increasing iteration.

ISBN: 978-972-8924-87-4 © 2009 IADIS

8

Figure 2. Convergence plots of log(

err

1

) (top) and log(

err

2

) (bottom) vs. iteration

In Figure 3 is shown the final population obtained at the end of 25 iterations when PSO is run without and

with gradient descent, with the solutions marked with dots (.) and plus (+) respectively. Further, the non-

dominated solutions are circled. It is again clear that when gradient descent is applied, the solutions are

significantly closer to the origin (0,0).

Figure 3. Solutions at the end of 25 iterations

5. CONCLUSION AND FURTHER RESEARCH

In this study, we have provided a method to apply gradient descent to some parameters of gene differential

equation models. Models of other genetic networks are similar. Therefore, our method can readily be

extended to other models also. Although our results are preliminary, they strongly suggest that a

straightforward application of gradient descent greatly improves the convergence speed of the algorithm.

This observation is consistent with earlier research taken up by others, such as Das et al. (2006) and Koduru

et al. (2007), where other forms of local search were used in place of explicit gradient descent.

Further investigation is necessary to analyze the effectiveness of such a method for larger models. This

research may also be applied to study other flowering time models of Arabidopsis (Wilczek et al. 2009).

ACKNOWLEDGEMENT

This research has been funded by the US National Science Foundation, through Grant No. NSF FIBR

0425759.

IADIS International Conference Intelligent Systems and Agents 2009

9

REFERENCES

Cai, X., et al (2007). Discovering Structures in Gene Regulatory Networks Using Genetic Programming and Particle

Swarms.

Proceedings of the Genetic and Evolutionary Computing Conference, London

, UK, pp. 1750.

Clerc, M., and Kennedy, J., (2002). The Particle Swarm – Explosion, Stability, and Convergence in a Multidimensional

Complex Space.

IEEE Transactions on Evolutionary Computation

, Vol. 6, No. 1, pp. 58 – 73.

Coello, C.A.C., Pulido, G.T., and Lechuga, M.S., (2004). Handling Multiple Objectives with Particle Swarm

Optimization",

IEEE Transactions on Evolutionary Computation

, Vol. 8, No. 3, pp. 256 – 279.

Das, S., et al (2006). Adding Local Search to Particle Swarm Optimization.

Proceedings

,

World Congress on

Computational Intelligence

, Vancouver, BC, Canada, pp. 428 – 433.

Das, S. (2008). Evolutionary Algorithms with Nelder-Mead Simplex Based Local Search.

Encyclopedia of Artificial

Intelligence

, (Eds. J. R. Rabuñal, J. Dorado & A. Pazos), Idea Group Publishing, Vol. 3, pp. 1191 – 1196.

Das, S., and Panigrahi, B.K., (2008). Multi-Objective Evolutionary Algorithms.

Encyclopedia of Artificial Intelligence

,

(Eds. J. R. Rabuñal, J. Dorado & A. Pazos), Idea Group Publishing, Vol. 3, pp 1145 – 1151.

Deb, K., et al (2002). A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II.

IEEE Transactions on Evolutionary

Computation

, Vol. 6, No. 2, pp. 182-197.

Koduru, P., Welch, S.M., and Das, S., (2007). A Particle Swarm Optimization Approach for Estimating Confidence

Regions.

Proceedings of the Genetic and Evolutionary Computing Conference

, London, UK, pp. 70 – 77.

Koduru, P., Das, S., Welch, S.M., (2007). Multi-Objective and Hybrid PSO Using ε-Fuzzy Dominance.

Proceedings of

the Genetic and Evolutionary Computing Conference

, London, UK, pp. 853 – 860.

Koduru, P. et al (2008). Multi-Objective Evolutionary-Simplex Hybrid Approach for the Optimization of Differential

Equation Models of Gene Networks.

IEEE Transactions on Evolutionary Computation

, Vol. 12, No. 5, pp. 572 –

590.

Koduru, P., Das, S., Welch, S.M., (2009). A Hybrid PSO Algorithm for Single and Multi-objective Optimization.

IEEE

Transactions on Evolutionary Computation

, revised and resubmitted.

Mendel, J.M. (1995). Fuzzy Logic Systems for Engineering: A Tutorial.

Proceedings of the IEEE

, Vol. 83, No. 3, pp. 345

– 377.

Welch, S.M., Dong, Z, Roe, J.L., and Das, S., (2005). Flowering Time Control: Gene Network Modeling and the Link to

Quantitative Genetics.

Australian Journal of Agricultural Research

, Vol. 56, pp. 919 – 936.

Welch, S.M., et al (2005). Merging Genomic Control Networks with Soil-Plant-Atmosphere-Continuum (SPAC) Models.

Agricultural Systems

, Vol. 86, pp. 243 – 274.

Welch, S.M., Roe, J.L. and Dong, Z., (2003). A Genetic Neural Network Model of Flowering Time Control in

Arabidopsis thaliana

.

Agronomy Journal

, Vol. 95, pp. 71 – 81.

Wilczek, A.M., et al, (2009). Effects of Genetic Perturbation on Seasonal Life History Plasticity,

Science

, doi:

10.1126/science.1165826.

ISBN: 978-972-8924-87-4 © 2009 IADIS

10

TREE-STRUCTURE-AWARE GENETIC OPERATORS IN

GENETIC PROGRAMMING

Kisung Seo, Chulhyuk Pang

Electronic Engineering, Seokyeong University

Seoul, Korea

ABSTRACT

In this paper, we suggest tree-structure-aware GP operators that heed tree distributions in structure space and their

possible structural difficulties. The main idea of the proposed GP operators is to place the generated offspring of

crossover and/or mutation in a specified region of tree structure space insofar as possible, taking into account the

observation that most solutions are found in that region. To enable that, the proposed operators are designed to utilize

information about the region to which the parents belong and node/depth statistics of the subtree selected for

modification. To demonstrate the effectiveness of the proposed approach, experiments on the binomial-3 regression and

even parity problems are performed. The results show that the results using the proposed tree-structure-aware operators

are superior to the results of standard GP for both two test problems in both success rate and number of evaluations.

KEYWORDS

Genetic Programming , Tree-structure-aware GP Operators

1. INTRODUCTION

The tree representation of GP chromosomes, as compared with the string representation typically used in GA,

gives GP more flexibility to encode solution representations for many real-world design and optimization

applications [Koza 1992, 1994]. Due to a characteristic of tree representations, some problems are commonly

experienced, such as code bloat [Luke, Mcphee, Silva], destructive crossover [Ito, Majeed, Poli 1999, 2000],

and structural difficulties [Daida 2001, 2003a, 2003b, Hoai].

Daida et al. [Daida 2001, 2003a, 2003b] investigate regions of a tree structure’s space by a distribution of

tree shapes, and show that tree structure can have a substantial impact on determining problem difficulty in

standard GP. Their work has provided experimental support for the hypothesis that the iterative random

growth of information structures can be a significant and limiting factor that adversely affects solution

difficulty. One of the most important observations obtained from above research is that most solutions in

standard GP are found in a specific region of the space of tree structures [Daida 2001, 2003a, 2003b].

We suggest a new recombination operators using the above idea of tree structure space. For crossover, a

random subtree is selected for parent 1 and node/depth statistics of the subtree are calculated. In accord with

the rate, a subtree of parent 2 is selected to make the child lie within a specified tree-structure region, known

as region I (defined in the next section). One offspring is obtained from two parents, because choice of a

subtree to swap based on statistics of the subtree it replaces is done only from one parent. Therefore we do

two crossover operations in a row to generate two offspring. In case of mutation, the growth method (grow,

full, or half & half) to replace a parent subtree is determined based on the current shape of the parent tree.

In this paper, we propose new tree-shape-aware genetic operators for GP. To demonstrate the

effectiveness of our proposed approach, experiments on binomial-3 regression and even parity problem are

performed. Section 2 discusses the previous work and the space of tree structures. Section 3 describes the

proposed structure-aware GP operators. Section 4 presents experimental results for two test problems, and

Section 5 concludes the paper.

IADIS International Conference Intelligent Systems and Agents 2009

11

2. STRUCTURAL DIFFICULTY AND GP OPERATORS

2.1 Structural Difficulty in GP

Daida et al. [Daida 2003b] showed that structure alone can pose great difficulty for a standard GP search

(using an expression tree representation and subtree swapping crossover). In particular, they classified four

regions of the search space of tree structures, as shown in Fig. 1.

The regions of tree structures are summarized as follows, classified according to number of nodes and

tree depth. There exist at least four distinct regions for depths 0 – 26. Region I contains most solutions in

standard GP. Far fewer individuals than in Region I appear in Regions II (IIa and IIb). Only partial mixing of

size / shape subtrees occurs here, with mixing becoming non-existent towards the boundaries furthest away

from Region I. Region III (IIIa and IIIb) is a place where even fewer individuals typically appear. Regions

IVa and IVb are regions that are structurally precluded from binary trees.

In [Hoai], extensive work on a new tree-based representation and simple local operators for GP has been

done. They have argued that GP’s problem of structural difficulty results from the lack of local structure-

editing operators due to its fixed-arity representation. Applying a TAG (tree adjoining grammars)-based

representation and local structure-editing operators to Daida’s LID problem, they demonstrated that the

operators significantly ease the structural difficulty problem in GP.

Although those new findings broaden the insight into structural difficulty in GP, they do not relate

directly to how we apply the phenomenon to improve the search efficiency of GP. Their relaxation of the

fixed-arity approach cannot be applied to GP problems in general, because in many cases, the necessary arity

is defined according to problem-specific characteristics.

In this paper, we have focused on how to apply the observations which have been well studied

theoretically, to enhance the current standard GP operators. Based on tree structure characteristics, region-

based GP operators are proposed. Because most solutions of test problems [Daida 2003a] found by standard

GP search occur in region I, it is natural to say that it is easy for GP to find solutions to problems which exist

in region I. Furthermore, biasing the operators to put generated offspring into region I seems to be plausible

strategy to improve search capability in standard tree-based GP.

2.2 Problems of GP Operators

The recombination operator in GP is regarded as a main driving force for the success of a GP run. Many

variants of the recombination operator have been introduced and used, but the most commonly used one is

the one point crossover operator [Koza 1992], due to its simplicity and ease of implementation. Many

Figure 1. Node and depth regions -

Re

p

rinted with

p

ermission from

[

Daida

ISBN: 978-972-8924-87-4 © 2009 IADIS

12

researches have tried to improve its performance. One approach is context preserving [Majeed]. In it, the

order and number of the parent nodes of a swapped subtree in its container parent tree are preserved to the

best possible extent in the other parent. A recently introduced operator, context-aware crossover [Majeed],

which implicitly discovers the best possible crossover site for a sub-tree, has been shown to consistently

attain higher fitnesses while processing fewer individuals. On the other hand, GP uniform crossover has also

been suggested, inspired by the GA uniform crossover concept [Poli 1999, 2000]. Also, a depth-dependent

crossover for GP has been proposed [Ito], in which the depth selection ratio is varied according to the depth

of a node, i.e. shallow nodes are more often chosen as the crossover points.

The various approaches mentioned above have contributed to analysis of existing problems with GP

operators and development of new GP operators with performance improvements. These approaches are

mainly to minimize destructive effects of standard crossover, or preserve the position of genetic material in

the genotype, and/or to accumulate building blocks. Although it is very important to preserve or minimize

destruction of genetic material, there is no known implicit direction to favor. In other words, there is no

information about which building blocks are good to preserve, because a subtree cannot be evaluated alone.

On the other hand, concrete guidance is provided as an implicit evolution direction in our tree-structure-

aware genetic operators for GP. Assuming that solutions in Region I are more likely to be helpful to search,

guidance on the direction that GP operators should pursue is available.

3. TREE-STRUCTURE-AWARE OPERATORS

The main idea of the proposed tree-structure-aware GP operator is to place generated offspring(s) resulting

from crossover and/or mutation into region I as much as possible, using the observation of tree structure as

guidance in specifying the shape of genetic material to be introduced. [Daida 2003a, 2003b].

3.1 Tree-structure-aware Crossover

The tree-structure-aware crossover algorithm is described in Figure 2 and 3. Two parents are selected by

roulette wheel selection and the region of each parent is examined. A subtree of parent1 is chosen at random,

and a subtree of parent2 is chosen for substitution according to the node/depth ratio of the subtree of parent1

and the regions of each parent.

For example, if the region of parent1 is lower than the region of parent2, a subtree of parent2 is chosen

among possible candidates which have a smaller node/depth ratio than that of the subtree of parent1. This

operation generates a child which has high probability of being in region I. In the other case, a subtree of

parent2 is chosen among candidates which have larger node/depth ratios than that of the subtree of parent1.

Unlike standard crossover, one offspring is obtained from crossover of two parents, because it is difficult

Fi

g

ure 2. Pseudo code of crossover o

p

eration

Select two parents using roulette wheel selection

Examine the region to which each parent belongs

Choose a subtree at random in parent1

Calculate node/depth ratio for the subtree of parent1

If (region of parent1 is lower than region of parent2)

Choose a subtree of parent2 so that node/depth

ratio is less than that of parent1’s subtree

else if (region of parent1 is higher than region of parent2)

Choose a subtree of parent2 so that node/depth

ratio is greater than that of parent1’s subtree

else

Choose random subtree in parent2

Iterate the above process to generate required number of

crossover offspring

IADIS International Conference Intelligent Systems and Agents 2009

13

to control the shape of both offspring by the subtree swapping operation in the proposed crossover. Therefore,

offspring from the proposed crossover are generate one at a time.

Choose a subtree of parent2

so that node/depth ratio is less

than that of the subtree

s of

parent1

Tree1 in region(IIb)

Tree2 in region(I)

1) Region of parent1 is lower than region of parent2

2) Region of parent1 is higher than region of parent2

Tree2 in region(IIb)

Choose a subtree of parent2

so that node/depth ratio is greater

than that of the subtree

s of

parent1

Tree1 in region(IIIa)

Figure 3. Tree-structure-aware crossover

3.2 Tree-structure-aware Mutation

The major feature of the tree-structure-aware mutation is to control the replaced subtree by generation

methods (grow, full, or half & half) considering the current shape of the parent tree. The tree-structure-aware

mutation algorithm is described in detail in Figure 4 and 5.

A parent is selected by roulette wheel selection and the shape of the parent is examined as in crossover.

The number of nodes on each branch of from the root node of the parent is calculated. If the region of the

parent is higher than region I, for example, a random node from the larger branch is chosen in order to

balance the tree structure and move the offspring toward region I. Then mutation with “grow” generation is

executed, which tends to make the subtree smaller.

ISBN: 978-972-8924-87-4 © 2009 IADIS

14

Figure 5. Tree-structure-aware mutation

4. EXPERIMENTAL SETUP

The tree-structure-aware GP operators have been applied to two standard benchmark problems—binomial-3

regression and even parity problem—and compared with standard GP. In all experiments, 20 runs were

executed and the number of evaluations and hit ratio or success rate of solutions found are. We used lil-GP

[Zonker] for the standard GP runs, and modified it heavily to implement the new crossover and mutation

operators. These experiments were performed on a single Core 2 Duo 2.13GHz PC with 2GB RAM. The GP

parameters were as shown below.

Number of generations : 500 for binomial-3,

2000 to 10000 for multiplexor,

and 100 to 7000 for even-parity

Population sizes : 500

Fi

g

ure 4. Pseudo code fo

r

mutation o

p

eration

Select a parent using roulette wheel selection

Determine the region to which the parent belongs

Calculate the number of nodes on each branch of the root node

of the parent

If (region of parent is higher than region I)

Choose a node from the larger branch and

mutate with “grow” generation

else if (region of parent is lower than region I)

Choose a node from smaller branch and

mutate with “full” generation

else

Choose a random node and

mutate with “half_and_half” generation

IADIS International Conference Intelligent Systems and Agents 2009

15

Initial population: half_and_half

Initial depth : 2-6

Max depth : 17, (25 for 11-multiplexor)

Selection : tournament (size=7) for standard

roulette wheel for proposed

Crossover : 0.9

Mutation : 0.1

4.1 Binomial-3 Regression

The first experiment was on the binomial-3 problem, which was proposed as a tunably difficult problem by

Daida, et al. in [2]. The binomial-3 is a symbolic regression problem and involves seeking a function

expressible as:

133)(

23

+++= xxxxf

(1)

The binomial-3 problem was defined as fitness cases for 50 equidistant points generated from equation

(1), over the interval [-1,0). The raw fitness is the sum of the absolute errors. A hit is defined being within

0.01 in ordinate of a fitness case for a total of 50 hits. The stop criterion is finding an individual in the

population that scores 50 hits.

The function set is defined as {

÷

×

−

+,,,

}, and the terminal set is defined as {x, α}, where x is the

symbolic variable and α is the set of ephemeral random constants (ERCs).

The ephemeral random constants are uniformly distributed over a specified interval of the form [-α, α],

where α is a real number that defines the range for ERCs. Five values for α were used—namely: {0, 1, 2, 3,

10, 100}.

Table 1. Results of Binomial-3 Regression Problem

The experimental results for the binomial-3 problem are summarized in Table 1. The proposed crossover

& mutation produced better results than standard crossover & mutation, in terms of hit rate, number of

evaluations, and success rate. Because the binomial-3 problem is relatively easier problem, it shows only

slight differences in the hit rate comparison. However, big improvements are shown for the number of

evaluations, which is one of the most important indexes. The proposed method reduces evaluations required

approximately 70% compared to standard GP, except for the very easy case of no ERCs.

Binomial-3 regression

Operators

ERC

range

Hit rate

No. of

Evals

Success

rate

standard

crossover

&

mutation

noERC 50.00 4,850.0 100%

[-1,1] 47.75 149,676.5 85%

[-2,2] 48.25 23,794.1 85%

[-3,3] 49.10 116,740.6 85%

[-10,10] 48.00 109,411.8 85%

[-100,100] 48.55 143,000.0 95%

proposed

crossover

&

mutation

noERC 50.00 4,552.5 100%

[-1,1] 49.60 70,368.4 95%

[-2,2] 49.45 37,629.4 85%

[-3,3] 49.25 26,968.8 85%

[-10,10] 49.75 65,975.0 90%

[-100,100] 49.45 54,182.4 85%

ISBN: 978-972-8924-87-4 © 2009 IADIS

16

4.2 Even-parity

The second experiment was on the even-n-parity problem. It has been recognized as difficult for genetic

programming to induce if no bias favorable to their induction is introduced in the function set, the input

representation, or in any other part of the algorithm [Page]. The function set defined is {and, or, nand, nor}

and terminal set is defined as {d0, d1, d2,…, dn}, where each element of the terminal set is a data bit.

Table 2 summarizes the experimental results for 3-, 4- and 5-even parity problems. For the even-3 parity

problem, the results of the proposed method is better than those of standard GP in terms of number of

evaluations. The success rate of the proposed method is better than standard GP’s for the even-4 parity

problem. Moreover, the number of evaluations of the proposed crossover & mutation is only 12% of standard

GP’s. For the even-5 parity problem, the success rate of the proposed structure-aware operators (75%) is

quite superior to standard GP’s (never found)

Table 2. Results of Even Parity Problem

3-, 4-, and 5-even parity

Operators Bits Hit rate No. of Evals

Success

rate

standard

crossover

&

mutation

3bits

8.00

(max8)

35,815.8 100%

4bits

15.55

(max16)

1,987,333.3 75%

5bits

20.45

(max32)

NA 0%

proposed

crossover

&

mutation

3bits 8 15,266.7 100%

4bits 15.90 237,631.6 95%

5bits 31.35 2,924,100.0 75%

5. CONCLUSION

We have suggested new recombination operators based on tree distributions in structure space and structural

difficulties. The main idea of the proposed tree-structure-aware GP operators is to generate offspring via

crossover and mutation that have tree structures residing withinr region I in the Daida et al. [3]

characterization by biasing the tree structures of the altered subtrees.

To demonstrate the effectiveness of our proposed approach, experiments on the binomial-3 regression

and even parity problem were performed. The experimental results showed that the results using the proposed

tree-structure-aware operators were superior to the results of standard GP for both two test problems in both

success rate and number of evaluations.

Due to the use of meaningful observation of the regions in the space of tree structures identified by Daida

et al., our proposed tree-structure-aware operators can enhance search capability over the randomly generated

tree structures exhibited by the standard GP

Further study will aim at analysis, extension and refinement of the tree-structure-aware GP operators to

validate their effectiveness more theoretically and to apply them to more complex and practical real-world

problems.

ACKNOWLEDGMENTS

This work was supported by the Korea Research Foundation Grant funded by the Korea government

(MOFHRD) Basic Research Promotion Fund) (KRF-2007-314-D00176)

IADIS International Conference Intelligent Systems and Agents 2009

17

REFERENCES

Daida J. M. et al, 2001. “What Makes a Problem GP Hard? Analysis of a Tunably Difficult Problem in Genetic

Programming,” Genetic Programming and Evolvable Machines, 2(2), pp.165-191.

Daida J. M. and Hilss A. M., 2003. “Identifying Structural Mechanisms in Standard Genetic Programming,” Proceedings

of the Genetic and Evolutionary Computation Conference (GECCO2003), LNCS 2724, Chicago, IL, USA, pp.1639-

1651

Daida J. M. and Hilss A. M., 2003. “What Makes a Problem GP Hard? Validating a Hypothesis of Structural Causes,”

Proceedings of the Genetic and Evolutionary Computation Conference (GECCO2003), LNCS 2724, Chicago, IL,

USA, pp.1665-1677

Ito T. et al, 1998. “Depth Dependent Crossover for Genetic Programming,” in Proceedings of the IEEE World Congress

on Computational Intelligence, Anchorage, AK. USA, pp.775-780

Koza J. R., 1992. Genetic Programming: On the Programming of Computers by Natural Selection, MIT Press,

Cambridge, MA, USA

Koza J. R., 1994. Genetic Programming II: Automatic Discovery of Reusable Programs, MIT Press, Cambridge, MA,

USA,

Luke S., 2000. Issues in Scaling Genetic Programming Breeding Strategies, Tree Generation and Code Bloat, PhD thesis,

University of Maryland

Majeed H. and Ryan C., 2007. “On the Constructiveness of Context Aware Crossover,” Proceedings of the Genetic and

Evolutionary Computation Conference (GECCO’07), London, England, United Kingdom, pp.1659-1666

McPhee N. F. et al, 2004. “On the Strength of Size Limits in Linear Genetic Programming,” Proceedings of the Genetic

and Evolutionary Computation Conference (GECCO2004), LNCS 3103, Seattle, WA, USA, pp.593-604

Hoai N. X. et al, 2006 “Representation and structural Difficulty in Genetic Programming,” IEEE Transactions on

Evolutionary Computation, 10(2), pp.157-166

Page J. et al, 1999. “Smooth Uniform Crossover with Smooth Point Mutation in Genetic Programming : A Preliminary

Study,” Proceedings of EuroGP’99, LNCS 1598, Göteborg, Sweden, pp.39-48

Poli R. and Page J., 2000. “Solving High Order Boolean Parity Problems with Smooth Uniform Crossover,” Sub

Machine Code GP and Demes, Genetic Programming and Evolvable Machines, 1(1/2), pp.37-56

Silva S. and Costa E., 2005 “Resource Limited Genetic Programming : The Dynamic Approach,” Proceedings of the

Genetic and Evolutionary Computation Conference (GECCO’05), Washington, DC, USA, pp.1673-1680

Zongker D. and Punch B., 1995. Lil-GP User’s Manual, Michigan State University

ISBN: 978-972-8924-87-4 © 2009 IADIS

18

CONTENT AND COMMUNICATION BASED SUB-

COMMUNITY DETECTION USING PROBABILISTIC

TOPIC MODELS

Alexandru Berlea

1

, Markus Döhring, Nicolai Reuschling

SAP Research

Bleichstr. 8, 64283 Darmstadt, Germany

ABSTRACT

Sub-community detection is a fundamental task in social network analysis and becomes increasingly interesting in

business applications related to supporting collaboration platforms on the Internet and mining the content generated on

them. We present a set of methods for sub-community detection leveraging on probabilistic topic models. The methods

are based on similarities among community members arising from their communication links, their topics of interest, or

on both aspects. We thereby identify suitable scenarios for the application of the proposed approaches. Preliminary

experimental results indicate our hybrid approach as a promising candidate for the analysis of large forum communities.

KEYWORDS

Community Mining, Topic Detection, Probabilistic Models, Social Network Analysis

1. INTRODUCTION

Given a set of interacting entities, sub-community detection is defined as the task of identifying subsets of

entities characterized by common properties. This definition is quite broad, leaving a lot of interpretation

space and requires some refinement in order to specify the scope of this paper.

Firstly, depending on the nature of the entities and their interaction, sub-community detection is a matter

of interest for different disciplines such as social network analysis or biology. Our focus will be on Internet

communities. Supported by the increase in usability provided by Internet applications designed for

collaboration (such as blogs, forums, wikis), Internet communities and their user numbers have proliferated

in the recent years, making them increasingly interesting and relevant for business purposes. For example,

recommendation systems, which have shown to have a significant impact on sales figures, may be improved

if sub-communities of people sharing the same interests or tastes are detected, by accordingly tailoring their

offer to the customers. Additionally, companies become increasingly interested in having Internet

communities around their products; identifying these communities or supporting them may both benefit from

sub-community detection.

Secondly, our definition of sub-community detection does not specify what are the common properties

supposed to be shared by the members of the sub-communities. As a rule of thumb, these properties should

provide a good clustering of the community into sub-communities, i.e. minimize the differences inside a sub-

community and maximize the differences between sub-communities. In this paper we will restrict ourselves

to properties which are derivable from information in Internet communities that is always openly available to

anyone: the communication structure —who is talking to whom—, and the communication content —who is

talking about what.

Corresponding to the large interest in sub-community detection, there is a large body of recent research

literature in which various approaches to sub-community detection have been reported, many of which are

tailored to Internet communities. Related work is reviewed in Section 2. Our contribution is to demonstrate

how existing traditional methods for analyzing communication structures can be combined with new state-of-

1

Corresponding author: alexandru.berlea@sap.com

IADIS International Conference Intelligent Systems and Agents 2009

19

the-art methods for automatic content analysis in order to perform sub-community detection. In particular, we

show how different assumptions as to how sub-communities are defined can be obtained by variations of the

methods used and the combinations thereof. One distinguishing feature of our approaches is the focus on

Internet forums, in which communication links among participants are explicitly present. We present results

and experiences obtained using a very large Internet community and also address how they can be visualized.

The ingredients underlying our approaches are introduced in Section 3. Subsequently, in Section 4 we

show how these can be used to devise different types of approaches to sub-community detection: purely

communication based (Section 4.1), purely content based (Section 4.2) and combined approaches (Section

4.3). Thereby we mention how the different decisions lead to obtaining different flavors of sub-community

detection. We exemplarily apply our combined approach to a large community and present the results in

Section 5.

2. RELATED WORK

Most approaches to sub-community detection consider either only the communication structure or only the

communication content. As opposed to this, the approaches that we will present consider both aspects,

similarly to related work in (Viermetz 2008), (Gloor and Zhao 2006), (Tuulos and Tirri 2004), (Dietz 2006)

and (Zhou et al. 2006). More precisely, we are in a line of research with the last three in which the

communication content is analyzed using probabilistic models. Probabilistic models have recently shown to

be a powerful method to automatically detect topics in collections of documents. They are especially suitable

for content generated in Internet communities due to the large numbers of documents and authors. (Dietz

2006) addresses communities of researchers as derivable from their publications. The assumption thereby is

that the strength of a tie among two sub-community members is denoted by the similarity of the topics

addressed by them. Strictly speaking, there are no explicit communication structures available. The detected

sub-communities are thus in fact communities of topics. (Zhou et al. 2006) address communities as derivable

from e-mail exchanges and present two approaches basically assuming either that a sub-community is a set of

users that communicate frequently, or, respectively a set of users that share common topics. While (Tuulos

and Tirri 2004) address both content and structure analysis, the focus is not on sub-community detection.

Instead the authors address how topic detection accuracy can be improved by using the communication

structure in order to better discriminate among useful and noise content and is targeted at chat data mining.

Sub-community detection can be cast into a clustering problem. Given a set of entities, clustering is the

task of (automatically, unsupervisedly) partitioning this set into sub-sets, or clusters, such that the similarities

are maximized intra-cluster and minimized inter-cluster. By this definition sub-community detection can be

seen as an instance of clustering in which the considered entities are community users whereas each cluster

corresponds to a sub-community. More formally, a clustering (sub-community detection method) of E

entities (users) into C clusters is specified by an ExC matrix. The element m

i,j

of the matrix has a value in the

interval [0,1] denoting the grade of membership of entity (user) i to cluster (sub-community) j. The particular

case in which the grade of membership for each entity i is 1 for exactly one cluster and 0 for all others is

referred to as hard clustering, as opposed to the general, fuzzy clustering. Fuzzy sub-community detection

methods are in general more expressive as they allow assigning a user with different confidence levels to

different sub-communities and allow a natural covering of real-life scenarios in which users are or feel

simultaneously part of different sub-communities.

Typically, the entities to be clustered can be represented as feature vectors, i.e. points in a multi-

dimensional space. A variety of clustering techniques exist arising from different choices of entities’ features,

similarity measures among them and grouping methods (Jain, Murthy & Flynn 1999).

In some cases, such as Internet communities for this matter, the set of interacting entities are naturally

represented as nodes of a graph, whereas an arc (implicitly) denotes the similarity among the entities it

connects. Clustering nodes of a graph is known as graph clustering (Schaeffer 2007), whereas clusters

become sub-graphs. A range of techniques is based on density properties; they try to maximize the internal

coherence of sub-graphs by identifying maximal sub-graphs that have a density above a certain threshold.

Cut-based approaches try to maximize the independence of sub-graphs, whereas independence is defined in

terms of the cut size needed to isolate the sub-graphs. Another proposed approach is based on iteratively

removing arcs with the highest betweenness (the number of shortest paths passing through the arc), based on

ISBN: 978-972-8924-87-4 © 2009 IADIS

20

the assumption that these arcs are links between clusters rather than within a cluster. For this particular work,

we are interested in clustering methods which scale for the large number of entities typically available in

Internet communities. Algorithms for computing optimal graph clusters are in general NP-complete and thus

not applicable for large graphs. In practice, however many algorithms have been proposed which are able to

find reasonable good partitions efficiently (Schaeffer 2007).

In particular a cut-based approach suitable for very large graphs is implemented in Cluto (Karypis 2003),

a state-of-the art clustering tool. In order to be able to deal with very large input graphs, Cluto’s algorithm

(Karypis & Kumar 1999) reduces the original input graph by first collapsing nodes and edges, then

partitioning the reduced graph, and finally projecting the obtained partition back to the original graph. At the

graph level, this method, being a cut-based one, tends to find clusters such that the number of inter-cluster

edges is minimized. The interpretation in terms of sub-community detection is straightforwardly obtained by

taking each cluster to represent a sub-community: sub-communities are detected as to minimize the amount

of information exchanged across communities. By optimizing this global property of a community, this

method is meaningful in scenarios in which the community is regarded as a whole (e.g. for its visualization),

but may be less optimal in explaining (locally) what are the bonds holding together a sub-community, or why

a person belongs to a community.

3. PRELIMINARIES

In this section we shortly address the fundamental techniques underlying the sub-community detection

methods that will be introduced in the next section. Rather than to give a formal precise introduction of these

techniques, our aim is to provide the intuitive understanding thereof as needed for a self-contained

presentation and to introduce the terminology and notations used in the remainder of the paper.

3.1 Community Data

While the approaches that we will introduce are in principle applicable to arbitrary online communities, for

the sake of the presentation and evaluation we will refer to the particular case of online forums. It is

furthermore convenient to introduce here our practical experimental data. This will allow us to refer to a

concrete use case in examples used throughout the paper.

We utilized forum data publicly available on the SAP Developer Network (SDN). The SDN forums

contain a broad variety of discussion topics related to the SAP software landscape. In some cases, SDN users

may explicitly link to each other in their profiles. However, this happens quite rarely and we do not use any

such information. Instead we simply assume that sub-community structures are latent in the communication

structure and the exchanged content. We restricted ourselves to the forum group focusing on “Application

Server” issues. It offers a mixture of business and technical content that is typical for the overall SDN, while

providing a medium diversification of distinct sub-areas within the forum. It contains 61,781 threads

totalizing 272,582 posts by 23,545 users.

3.2 Probabilistic Topic Models

Probabilistic topic models (PTMs) lie at the basis of some recent promising approaches to automatic topic

detection. A PTM offers a generative explanation of a document collection in which topics are explicitly

modeled. More precisely, the model is specified by a fixed number of topics as probability distributions over

words and a probability distribution over topics associated with each document. Each token of a document is

(assumedly) generated in turn, by first sampling a topic from the topic distribution associated with the

document and then sampling a word from the probability distribution denoted by the topic.

Now, if we are able to find the model that best explains the document collection at hand, the topics of

each document d can be looked up as the most probable topics in θ

d

, the probability distribution over topics

associated with d. Finding this model is an inference problem which is generally not exactly solvable.

Instead, one tries to approximate the optimal solution. The various topic detection methods that have been

proposed differ in the inference methods used as well as in additional assumptions they make regarding the

underlying model. In particular, for our purposes we use Latent Dirichlet Allocation (Blei, Ng & Jordan

IADIS International Conference Intelligent Systems and Agents 2009

21

2003) as our PTMs and Gibbs sampling as the inference method (Steyvers & Griffiths 2007), which have

been reported to generally deliver good results. Intuitively, the detected topics can be thought of as patterns

of co-occurrences of words in the document collection. One advantage of this topic detection is that it is not

affected by ambiguous words, as co-occurring words in context automatically account for the right topic

assignment.

For our purposes we run the topic detection as mentioned above on a corpus built from the experimental

data previously introduced, by aggregating all posts to a thread into one document. We fixed the number of

topics to 75 – selecting the right number of topics is in general driven by the sensitive granularity for the

application domain at hand: not too small a number in order not to get too general topics and not too large a

number in order not to get too specific ones. Some of the topics detected are depicted in Table 1 by their

most representative (likely) words in decreasing order. By looking at the words we can identify these topics

as dealing with Security, Web Services, Web Dynpro, Databases and Web Servers, respectively.

Table 1. Topics detected in our experimental data

Topic 3 user

p

assword

login logon role id log authentication

p

ortal

sso

Topic 16 service web

ejb

webservice bean

p

roxy wsdl model client

method

Topic 35 web

dynpro

abap webdynpro Java wd wda tutorial ui component

Topic 47 database

connection

sql

datasource datum

j

dbc db table driver

oracle

Topic 68 server http

url

service web

p

or

t

error domain browser

host

3.3 Centrality Metrics for SNA

Detecting persons with special roles in social networks is often based on measuring their centrality in the

network as follows. Local degree centrality, defined as the number of edges connecting a node, may be used

to measure how intensively the node communicates. Closeness centrality, sometimes also termed as global

centrality of a node, is measured as the sum of distances from the node to all other nodes and may tell for a

given node how well connected he is to all other reachable nodes. Betweenness centrality is defined as the

number of shortest paths between any two nodes of a network that run through a given node. The calculation

and exploitation of this metric may give clues about how important the node is for connecting subnets within

a community. Eigenvector centrality is used to determine the general importance of a node within a network.

This is done not only by counting connections, but also by overweighting connections to nodes which are

themselves more central than other nodes. Diameter metrics may allow conclusions based on the maximum

number of steps that have to be taken to get from one node to another within a community network. In other

words, diameter is defined as the longest shortest path within a network. This metric can be an indicator for

how fast information can be passed through a network or whether propagation takes places rather extensively

(indicated by a small diameter) or not.

4. PROBABILISTIC TOPIC MODELS FOR COMMUNITY DETECTION

In this section we introduce three approaches to sub-community detection based on probabilistic topic

models: one purely communication based, one purely content based and one that combines communication

and content.

4.1 A Communication Based Approach

Assuming that users that tend to co-occur in discussion threads should belong to the same sub-community,

we might detect these sub-communities by applying the method for detecting patterns of co-occurrences

introduced in Section 3.1 on the communication structure of the threads. For that we can consider each thread

as a “document”, the “tokens” of which are users that have posted in the thread, one “token” for each post.

Each resulting “topic” θ

t

(a probability distribution over users) will denote a sub-community. More precisely,

the grade of membership of a user u to sub-community t is given by θ

t

(u) (the probability of pattern t to

generate user u). As a fuzzy method this has advantages as presented in Section 2.

ISBN: 978-972-8924-87-4 © 2009 IADIS

22

Intuitively, the grade of membership of a user u to a sub-community c denotes how likely it is that u will

contribute to threads in which other members of c are also present. Each thread will thereby tend to be

assigned to a small number of communities. Altogether, this approach is suitable for scenarios in which an

automatic categorization of threads is needed according to groups of highly active users driving them.

4.2 A Content Based Approach

We will refer to the topics that are detected by PTMs in user generated content (such as forums) as discussion

topics. Assuming that the membership of a user in a sub-community exclusively depends on the discussion

topics in which the user participates, sub-communities can be identified by detecting discussion topics as

presented in Section 3.1. Each discussion topic t can specify a sub-community in a number of ways, each of

which leads to slightly different (more or less obvious) interpretations of the sub-communities.

(1) For example, the grade of membership of user u to sub-community t, can be specified as the average

proportion of topic t over all threads to which user u contributes. We call these topic proportions for user u,

u’s interests.

(2) In order to also account for the number of posts in the different threads, the average can be weighted

by the number of posts made in each of these threads.

(3) Another way to compute the interest of user u in discussion topic t is to count the number of u’s posts

within discussion threads, the top topic of which is topic t.

Subsequently, if a hard clustering is needed we can place the user u in the sub-community corresponding

to his largest interest. The approach (3) is suitable in particular if we can assume that each thread essentially

deals with only one topic (the top topic, whose proportion is the greatest for the thread). Essentially we assign

the user to the discussion topic to which most of his posts have been made.

Note that discussion topics are detected on thread level rather than on post level. This is sensible to do

since we can assume that most of the posts to a thread deal with similar topics and implicitly use the

enclosing thread as disambiguating context. This implies that the approach is not completely oblivious of the

communication structure; the reason is that the sub-community of a user is determined by the topics of the

user’s posts, which are in turn influenced by the (tokens of the) posts of the users talking in the same thread.

One might thus argue that users which often talk within the same thread are more likely to end up in the same

community. Yet, a second thought reveals that this is only the case if the co-occurrences of these users’

words in the threads are statistically relevant at the level of the whole document collection, i.e. for this matter

throughout all threads. Given a large number of threads and posts, as in our use case, this essentially makes

our sub-community assignment being overwhelmingly determined by the content of the communication

rather than by its structure.

All in all, the approaches introduced here are suitable for scenarios in which there is no reason to suppose

that sub-community members are bound together by other tights than the need to solve their problems at hand

and these problems mostly fall under one topic. This is true to a large extent in so-called business

communities, as opposed to social networks in the narrower sense.

The assumption underlying the sub-community approaches proposed so far in this section is that a sub-

community is essentially defined by one discussion topic and leads to a user ending up (with high probability

## Comments 0

Log in to post a comment