Abstract
—
A
new Adaptive Resonance Theory
variant
is
presented
that is
capable of clustering variable length
semantic
inputs
by creating templates that store a
non

parametric
distribution over the symbols
and structure
of a
given
grammar. Originally created as an autom
atic function
definition mechanism for a Genetic Program
ming architecture,
the Gram

ART method has many other useful applications and
properties.
The variable cluster geometry of Gram

ART is
demonstrated
on a 2D clustering task.
Gram

ART
perform
ance is
shown to be improved compared
to that of
Fuzzy

ART on
the
benchmark
iris dataset.
I.
I
NTRODUCTION
H
E
Adaptive Resonance Theory
(ART)
unsupervised
learning method has long been a state of the art
clustering tool due to its low run

time complexity and ability
to scale the number of clusters
via
a single parameter
.
Additionally, the
seminal
ART1 architecture
[1]
has been the
subject of many research modifications,
resulting in the
development
of
Fuzzy

ART
[2]
, Gaussian ART
[3]
,
Category Theory ART
[4]
,
and numerous others.
Genetic Programming
(GP)
is a rapidly growing field with
incr
easingly valuable application to a number of important
areas
[5]
.
While this evolutionary algorithm is able to
efficiently generate solutions to many problems which
significantly outpace
those devised by human experts, there
are issues of computational cost to be addressed.
In
particular, we investigate a class of GP's which tend to
produce function trees of such magnitude that the approach
is rendered less effective in a very short orde
r.
The Gram

ART clustering algorithm is introduced to intelligently and
dynamically adjust the size of the GP function tree in a way
to satisfy the dual criteri
a
of efficacy and computability.
Designed to operate on encoded expressions in BNF
grammar, Gr
am

ART is capable of clustering variable

length inputs.
The algorithm is based on the neural
cognitive model known as Adaptive Resonance Theory, and
it is the first such ART

based architecture to address the
need for variable

length inputs.
Manuscript received September 3, 2008
. This work was supported in
part by
The Boeing Company
, the National Science Foundation, the
Missouri S&T Intelligent Systems Center, and the M.K. Finley Missouri
Endowment
.
Ryan J. Meuth,
John Seiffertt, Paul Robinette
and Dr. Donald
C. Wun
s
ch II are with the Applied
Computational Intelligence Laboratory
at the Missouri University of Science and Technology, Rolla
,
MO
65401
USA (e

mail:
rmeuth@mst.edu
,
jes0b4@mst.edu
,
pmrmq3@mst.edu
,
dwunsch@mst.edu
, respectively
).
Section II d
evelops the background context of Genetic
Programming, Grammatical Evolution and Automatic
Function definition. Section III provides an overview of
Adaptive Resonance Theory and the Fuzzy

ART algorithm.
Section IV details the Gram

ART algorithm, and sect
ions V
and VI analyze Gram

ART cluster geometry and compare
Gram

ART performance to Fuzzy

ART on the standard iris
dataset. Section VII details the use of Gram

ART for
automatic function definition in Genetic Programming.
The
remaining s
ections provide c
onclusions
, acknowledgments,
and references.
II.
G
ENETIC
P
ROGRAMMING
In genetic programming, the genome of an individual is
represented as a tree structure, where operation
s
are applied
at branches and the leaves are constants and problem
parame
ters, as illus
trated in Figure 1
[6, 7]
. One advantage
of GP is that the results can be easily
interpreted by humans
and formally verifi
ed
, a quality that is not present in many
other computational intelligence meth
ods
[8]
.
Figure 1
. Function Representation as tree structure.
There has been some development of methods
to generalize
function blocks (branches in an individual’s genome) that
appear similarly and usefully across individuals and across
generations, making those blocks available as fundamental
components in the next generation of programs
[9

12]
. In
this way, a library of available functions is generated and
customized in a meta

evolutionary way. This modification
leads to greatly increased performance and the repetition of
code allows the algorithm to find solutions
that it would
have very little chance of finding
otherwise
. Additionally,
by creating function blocks and removing parts of an
individual’s genome from active evolutionary modification,
the probability of architectural changes increases, as the
genome is
effectively shortened, and changes are only
Gram

ART: Variable Length Representation with Non

Parametric
Probabilistic Templates and
Complex
Cluster Geometry
Ryan J. Meuth
,
John Seiffertt
,
Paul Robinette
and
Donald C. Wunsch II
T
allowed on parts of the genome that have a higher

level
effect. In this way the evolutionary process
starts
by
building
and stabilizing low

level functionality, which grows
to higher

level functions that exploit
it.
The result is
a
progressive, fitness

driven increase in
program
complexity
t
hat massively accelerates
how well GP performs both in
terms of quality and speed.
Grammatical Evolution
Grammatical evolution is a way of using an integer

based
represent
ation to encode a function tree. In this way a
standard
integer

based genetic algorithm
can be used to
perform the evolutionary process while the individuals are
expressed as programs. This method has the benefit of
speed,
high population diversity,
low
memory requirements
and stability
[13]
.
In grammatical evolution, the integer chromosome is used to
select production rules of a Backus Naur Form (BNF)
grammatical definition to generate a valid program. The
BNF grammar is a way of expressing a language in the form
of produc
tion rules. A BNF grammar consists of the tuple
{
N,T,P,S
}, where
N
is the set of non

terminals such as
<expr>, <op>, <preop>, corresponding to expressions,
binary operat
ors
and unary operat
ors
, respectively.
T
is the
set of terminals, such as operations
AND, OR and NOT.
P
is the set of production rules that map from
N
to
T
, and
S
is a
seed symbol which is a member of
N
.
An example of a simple binary BNF grammar would be:
N = {expr, op, pre_op, var}
T = {AND, OR, NOT, X, Y}
S = <expr>
P can be repres
ented as:
1.
<expr>::= <expr> <op> <expr> 
(a)
<preop> <expr> 
(b)
<var>
(c)
2.
<op> ::= AND 
(a)
OR
(b)
3.
<pre_op> ::= NOT
(a)
4.
<var> ::= X 
(a)
Y
(b)
Rule
Number of Choices
1
3
2
2
3
1
4
2
Table 1. Choices available for each production rule.
The G
P
process uses the integer chromosome as a series of
choices over the production rules of the grammar. Starting
with the seed, the decoding process uses the components of
the chromosome to decid
e which choice to make when a
non

terminal has one or more choices. For example, given
the chromosome
{
1
, 2, 5, 3, 2, 6
}
and the above BNF
grammar
,
a functional representation, T1, can be
constructed. The seed <expr> is used as the root of the tree:
T1:
<expr>
Looking at the rule

choice table, <expr> has
three
possible
choices. The first term in the chromosome selects rule 1
b
:
T1: <preop> <expr>
There is only
one
possible choice for <preop>, so the
function becomes:
T1: NOT (<expr>)
Evaluating <exp
r> again, this time taking the modulus of 5
and 3, choose rule
1c
:
T1: NOT(<var>)
And as <var> has
two
options available, Y is chosen:
T1: NOT(Y)
Note that the decoding process terminated before
all
elements
in the
chromosome were utilized. Also, if
elements
were all used before the decoding process filled all non

terminals, the pointer can be reset to the beginning and the
chromosome reused. To prevent infinitely long codes, a
wrap counter is used and maximum wrap count is
implemented. When a decod
ing exceeds the maximum wrap
count, the tree is marked as invalid and is given a non

competitive fitness value, ensuring that it will not survive to
the next iteration.
Using this encoding style, grammars of arbitrary and
dynamic
complexity can be impleme
nted, including the
grammars of compilable languages and arbitrary functions.
This method has been shown to be effective on a variety of
problems, and even to out

perform
genetic programming
using tree

based genetic operators
[13]
.
Automatic Function Definition
A key aspect
to the GP process is defining the functions
produced through the evolution.
Koza’s early attempts at
function definition utilize a rigid structure where the number
of functions and
arguments
are fixed
[9]
. This limits the
flexibility of the defined function and limits the complexity
of evolved programs. Later attempts utilize the differential
fitness of the population to determine
when code
components should be added. However, this leads to a large
number of functions, with any given function having a small
chance of being selected.
To automatically generalize useful functions, we propose
that a
clustering method could be used wi
th differential
fitness selection. The parameters of the clustering method
could be tuned to control the number and coarseness of
functions generated, providing a simple mechanism for
automatic function definition. As categories are generated,
the templa
tes from each category are added to the grammar
as new functions, and the GE/GA process can then take
advantage of these new elements.
For this purpose, a new clustering algorithm based on
Adaptive Resonance Theory is proposed that is able to
utilize a va
riable

length representation to encode categories
against a specified grammar
. Currently no
A
daptive
R
esonance
T
heory based

clustering method exists that is
able to handle trees or variable

length representations. This
new algorithm is called
Gram

ART.
III.
A
DAPTIVE
R
ESONANCE
T
HEORY
Adaptive
R
esonance
T
heory (ART) was developed by
Carpenter and Grossberg as a solution to the plasticity and
stability dilemma, i.e., how adaptable (plastic) should a
learning system be so that it does not suffer from
catastrop
hic forgetting of previously

learned rules
[1, 2, 14]
.
ART can learn arbitrary input patterns in a stable, fast, and
self

organizing way, thus overcoming the effect of learning
instability that plagues many other com
petitive networks.
ART is not, as is popularly imagined, a neural network
architecture. It is a learning theory hypothesizing that
resonance in neural circuits can trigger fast learning
[15]
.
Adaptive Resonance Theory exhibits t
heoretically rigorous
properties
desired
by neuroscientists
which
solved some of
the major difficulties faced by modelers in the field. Chief
among these properties is stability under incremental
learning. In fact, it is this property which translates we
ll to
the computational domain and gives the ART1 clustering
algorithm, the flavor of ART most faithful to the underlying
differential equation
model
, its high status among
unsupervised learning algorithm
researchers
. At its heart,
the ART1 algorithm reli
es on calculating a fitness level
between an input and available categories. In this way it is
very much like
the well

known
k

means
algorithm
, although
the number of categories is variable and grows dynamically
as needed by the given data set.
What
fu
ndamentally
differentiates ART1 from similar
distance

based clustering algorithms is a second fitness
calculation whereby a given category can reject the inclusion
of an input if the input does not meet the category’s
standards as governed by a single glob
al parameter.
Cognitively, this is modeling the brain’s generation and
storage of expectations in response to neuronal stimulation.
The initial fitness, measuring the degree to which each input
fits each of the established categories, is considered a shor
t

term memory trace which excites a top

down expectation
from long

term memory. Computationally, this second
fitness calculation acts to tune the number of categories, and
it may force the creation of new categories where a
k

means
styled algorithm would
not, thus exhibiting stronger, more
nuanced, classification potential. The ART1 algorithm has
enjoyed great popularity in a number of practical application
areas of engineering interest. Its chief drawback is the
requirement that input vectors be binary.
The ART2
algorithm was first proposed to get around this restriction,
but in practice today it is the Fuzzy

ART modification of
ART1 which powers most of the new ART research and
applications.
Fuzzy

ART admits input vectors with elements in the
range
[0,1]. Typically a sort of preprocessing called
complement coding is applied to the input vectors as well as
any normalization required
, mapping
the data to the
specified range. The Fuzzy

ART’s core fitness equations
take a different form than those
of
ART1, leveraging the
mechanics of fuzzy logic to accommodate analogue data
vectors. Researchers have concocted a wide variety of
ART

based architectures by modifying the fitness equations
to specialize them for a given problem domain.
For example, Gaussi
an ARTMAP uses the normal
distribution to partition categories, with the relevant fitness
equations incorporating the Gaussian kernel. This
parametric statistical approach to ART was the first in what
has become a rich field of study. Other parametric me
thods
incorporate different probability distributions or allow for
alternative preprocessing schemes based on statistics. The
Gram

ART architecture presented in this paper extends this
body of knowledge by exploring non

parametric statistical
methods for
category determination.
Parametric statistics assume much about the underlying
distribution of the inputs to the system.
In running a
standard t

test, for example, it is required that the data be
generated by Gaussians or that we have a sufficient quanti
ty
of data to ensure the sampling distribution is normal.
It is
often the case in practice that such normality assumptions
are invalid.
Gram

ART adds to the existing probabilistic
ART architectures in that it makes no such assumptions
regarding the distr
ibution of inputs (as compared to, for
example, Gaussian ARTMAP.)
Instead, it relies on non

parametric, or distribution

free, statistical models of the
inputs when making its classifications.
This allows Gram

ART to effectively handle data from small sam
ples or about
whose structure nothing is known.
The interested reader is
directed to
[16]
for further details regarding non

parametric
statistical analysis.
Other specializations of ART include ARTMAP

IC
[17]
which allows for input data to be inconsistently la
beled and
is shown to work well on medical databases, Ellipsoidal
ARTMAP
[18]
which calculates elliptical category regions
and produces superior results to methods based on hyper

rectangles in a number of problem domains, and a version of
ART which uses category theory to better model the storage
and organization of internal
knowledge. Overall, Adaptive
Resonance Theory enjoys much attention by those studying
computational learning for both scientific and engineering
purposes.
Fuzzy

ART incorporates fuzzy set theory into ART and
extends the ART family by being capable of le
arning stable
recognition clusters in response to both binary and real

valued input patterns with either fast or slow learning.
The basic FA architecture consists of two

layer nodes or
neurons, the feature representation field
F
1
, and the category
representation field
F
2
, as shown in Figure 2
. The neurons in
layer
F
1
are activated by the input pattern, while the
prototypes of the formed clusters
, represented by hyper

rectang
l
es,
are stored in layer
F
2
. The neurons in layer
F
2
th
at are already being used as representations of input
patterns are said to be committed. Correspondingly, the
uncommitted neuron encodes no input patterns. The two
layers are connected via adaptive weights,
W
j
, emanating
from node
j
in layer
F
2
. After laye
r
F
2
is activated according
to the winner

take

all competition
between a certain number
of committed neurons and one uncommitted neuron, an
expectation is reflected in layer
F
1
and compared with the
input pattern. The orienting subsystem with the pre

spec
ified
vigilance parameter
ρ
(0≤
ρ
≤1) determines whether the
expectation and the input pattern are closely matched. If the
match meets the vigilance criterion,
learning occurs and the
weights are updated
. This
state
is called resonance, which
suggests the na
me of ART. On the other hand, if the
vigilance criterion is not met, a reset signal is sent back to
layer
F
2
to shut off the current winning neuron
for the entire
duration of the presentation of this input pattern, and a new
competition is performed among
the remaining neurons.
This new expectation is then projected into layer
F
1
, and this
process repeats until the vigilance criterion is met. In the
case
where
an uncommitted neuron is selected for coding, a
new uncommitted neuron is created to represent a p
otential
new cluster.
F
uzzy

A
RT
exhibits fast, stable, and transparent learning
and atypical pattern detection. The Fuzzy

ART method has
the benefit of being a highly efficien
t clustering method
with
a linear
runtime complexity.
Algorithmically, t
here ar
e two steps to ART: category choice
and vigilance test.
Let
be the input,
the weights
associated with category
(this is really
where the
weight is a vector of
elements, but this subscri
pt is
typically suppressed), and
be the vigilance.
In category choice, the degree of match is
calculate
d:
j
j
w
w
x
j
T
)
(
(1)
For
each category
.
For the vigilance test, we calculate
x
w
x
j
(2)
An
d
compare with
.
We
then
cycle between
category
choice and the vigilance test
until resonance o
ccurs and we
update the weights
accor
ding to eq. 3.
x
w
w
j
j
(3
)
Fast learning occurs when β=1.
IV.
G
RAM

ART
A
LGORITHM
This algorithm is a specialization of ART designed to handle
variable

leng
th input patterns represented in a tree structure
based on a BNF grammar
.
Initialization
Let
be a
tree under the
grammar
. Let
be a
generalized
tree
corresponding to category
.
Note here that the category
representations in Gram

ART are themselves trees, thus
abstracting the hyper

rectangular prototype forms of earlier
manifestations of A
RT.
Each node in the generalized tree
has an array representing the distribution of possible
symbols
at that node. Here,
represent
s the number of
nodes in a tree
. Note that we set the number of columns
equal to the maximum number of
symbo
ls in the grammar
.
If a given
symbol
does
not
apply to a given
node
, then its
entry in the weight matrix will be eternally zero. Finally, we
let
represent the vigilance level.
We need a measure of magnitude for inputs and weights.
Since t
he size of the elements of these
distributions
do
not
correspond in a meaningful way to
any
sense of magnitude,
we instead define
the measure to be
simply the number of
nodes
presented in each. That is, we define the
tree

norm
as
= the number of
nodes
in
. So
,
and
.
…
Reset
Layer
F
2
Orienting Subsystem
Input Pattern
I
…
Layer
F
1
ρ
W
Fig
ure 2
. Topological structure of Fuzzy ART. Layers
F
1
and
F
2 are connected via adaptive weights W. The orienting
subsystem is controlled by the vigilance parameter
ρ
.
Initially
, there are no category nodes committed. The first
input vector will be used to update
so no initial values of
the weights need to be given.
We need t
o define a notion of overlap between the input and
the weights. We cannot use the normal intersection nor the
fuzzy

and operator because
and
are not guaranteed to
be of the same dimensions. Therefore, we define the
trace
of
in
, denoted by
, as follows:
r
i
J
x
i
J
i
w
w
x
1
,
(4
)
T
he trace is the sum of the values stored in the weight matrix
corresponding to the choices enumerated for a given
input
.
This
has the effect of comparing root

aligned trees.
An example is given below.
Given two trees, one for the function “X AND Y” and
another for the function NOT X
, which are shown in Figure
3.
Figure 3
. Function Trees.
To store the combination of th
e two, we create a type of tree
,
called a
proto

tree that
holds a distribution over the symbols
at each node, and has a variable number of children.
Figure 4
.
Proto

Node
If we combine these two trees
results in a
proto

tree
shown
in Figure 5.
Figure
5
.
Proto

tree
Construction
To compare and update trees, we define recursive
functions
that
traverse both trees synchronously
, outlined in Tables 4
and 5
.
We define two Node structures
Proto
Node and TreeNode
,
outlined in Tables 2 and 3
.
The Prot
oNode structure is used
to construct tree prototypes, which are the templates of
Gram

ART. The ProtoNode holds a distribution over all
symbols in a given grammar (Line 2) and the update
counters for each symbol (Line 3), as well as an array of
child

nodes
(Line 4). The TreeNode structure, outlined in
Table 3, holds a single symbol (Line 2) and an array of
child

nodes (Line 3).
1
struct Proto
Node
2
double dist[]
3
int N[]
4
ProtoNode
protochildren
[]
5
end struct
Table 2.
Proto

Node structure.
1
struct TreeNode
2
Terminal t
3
TreeNode
children
[]
4
end struct
Table 3. Tree

Node structure.
The CompareNode function, outlined in Table 4, performs
the recursive process of comparing a tree with a prototype.
The function first accumulates the p
robability of a tree’s
symbols occurring in the prototype (Line 2), then increments
a counter that tracks the number of nodes that the trees have
in common (Line 3). The function then calls itself on each
of the child nodes to accumulate statistics for th
e remainder
of the tree.
1
function CompareNode ( TreeNode &A,
ProtoNode
&B, double &sum, double &size)
2
sum = sum + B.dist[A.t]
3
size = size + 1
4
For each i in A.
children
[],
5
CompareNode(A.
children
[i],
B.
protochildren
[i], sum,
size)
6
end f
unction
Table 4. Trace Process p
seudo

code
Step 1: Category Match
The first step in ART is to calculate the strength of the
activations to the category nodes. We define this activation
strength, or choice value, for category
as equation 5
.
J
J
J
w
w
x
x
T
)
(
(5
)
This quantity measures to what extent the input pattern
fires the category weight entries of
. If the elements of
correspond to all
’s in the rows of
then this is a pe
rfect
match with activation equal to
. If the category
is
nowhere close to the input
th
en the corresponding weight
ent
r
i
es will be small so that this quantity approaches
.
Note that the weight matrix might contain more or fe
wer
elements than the input and this measure penalizes such
mismatches. In the numerator, if the weight value
does not
exist to correspond to the input then the value
does not
get
summed. In the denominator, all the rows of the weight are
accounted for s
o this will lower t
he
choice value if the input
has fewer entries.
Step 2: Vigilance Test
Once
has been calculated for all categories, this vector is
sorted and the highest category is checked for vigilance. In
the vigilance test we want to
see how accurately the chosen
category can predict the value of the input
, so we check
the following condition:
x
w
x
J
(6
)
If this condition is satisfied, then we move on to the weight
update step. Otherwise,
we zero out the value
and
proceed with the next highest category match. If none of the
categories pass the vigilance test, then a new uncommitted
node is assigned to the current input.
Step 3: Weight Update
There are two
operations
used in
the weight update
procedure: the element update and the row insertion.
Element update
is a weighted sum of the frequency with
which a given option has been presented and is
calculated
by
1
N
N
w
w
j
J
ij
J
ij
(7
)
Where
is the number of
inputs prior to the latest one and
is a characteristic function given by
eq. 8.
otherwise
j
x
if
i
j
0
1
(8
)
1
function UpdateNode( TreeNode &A,
ProtoNode
&B)
2
B.dist[A.t] = NewWeight(B.dist, B.N)
3
B.N[A.t] = B.N + 1;
4
For each
i in A.
children
[],
5
UpdateNode(A, B)
6
end function
Table 5. Update Process p
seudo

code.
The
recursive
process for updating
is
described in Table
5. The function first updates the probability of a tree symbol
occurring in that node l
ocation using equation 7 (Line 2),
then it increments the number of updates for that symbol
(Line 3). The function then calls itself on each of the child
nodes, recursively updating the rest of the tree.
V.
C
LUSTER
A
NALYSIS
To demonstrate arbitrary cluster
geometries in Gram

ART a
dataset consisting of two
mixed
Gaussian distribution
s
is
given as input.
To translate between a continuous 2D space
and a symbolic grammar, t
he X and Y dimensions are evenly
segmented into three symbols each, giving nine separate
regions
of activation
.
One hundred
points were given as
input with a vigilance value of 0.
7
. F
our
templates
were
produced
, shown in Figure 6
. The
data
points are shown in
white. Regions with high activation values are shown in
bright red while low activat
ion values are shown in black.
Template
1 below is mainly composed of the empty, upper
region. This is a fairly trivial cluster, but it is important to
note that Gram

ART still segments this region off as a
different function set.
Template
2, 3 and 4 ar
e centered in the
densely populated bottom right, center and left areas,
respectively. Nearby regions are also partially activated,
illustrating the inherent arbitrary geometry of the cluster.
Templates
1 and 2:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Templates
3 and 4:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 6
. Low

Reso
lution Gram

ART
template
shapes.
Arbitrary cluster geometries are further illustrated i
n a
second experiment. The X
and
Y
dimensions
were
segmented into
ten
symbols in each
dimension
which
produced
eleven
clusters. These
templates
are each shown
by bands
of activation in the X and Y directions
in Figure 7
.
The first cluster is again the trivial case of the empty top
area. After that, several bands can be seen in each cluster.
Some of these bands even ha
ve multiple peaks in them,
indicating
a complex relati
onship
among
the input data. The
sample data is still separable and Gram

ART is able to
divide it nonlinearly into clusters.
Templates
1 and 2:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Templates
3 and 4:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Templates
5 and 6:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Templates
7 and 8:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Templates
9 and 10:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Templates
11:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure
7
. High

Resolution Gram

ART cluster shapes.
VI.
F
UZZY

ART
VS
.
G
RAM

ART
C
OMPARISON
To evaluate the performance of the Gram

ART algorithm,
the standard Fisher’s Iris
d
ataset
[19, 20]
was used a
s a
benchmark against the Fuzzy

ART method. The
performance of
Fuzzy

ART is shown in Figure 8
and 9
.
Figure 8
. Fuzzy

ART clustering performance on the Iris
dataset.
Figure 9. Fuzz

ART Iris Data clustering profile.
To evaluate Gram

ART on the Iris d
ataset, the following
grammar was constructed:
N = {
S
L, SW, PL, PW
}
T = {
SL1, SL2, SL3, SL4, SL5, SL6, SL7, SL8, SL9, SL10,
SW1, Sw2, Sw3, Sw4, Sw5, Sw6, Sw7, Sw8, Sw9, Sw10,
PL1, PL2, PL3, PL4, PL5, PL6, PL7, PL8, PL9, PL10,
PW1, PW2, PW3,
PW
4,
PW
5,
PW
6
,
PW
7,
PW
8,
PW
9,
PW
10,
}
S = <SL> <SW> <PL> <PW>
P can be represented as:
1.
<
SL
>::=
{SL1, SL2, SL3, SL4, SL5, SL6, SL7, SL8,
SL9, SL10}
2.
<
SW
> ::=
{SW1, Sw2, Sw3, Sw4, Sw5, Sw6, Sw7,
Sw8, Sw9, Sw10}
3.
<
PL
> ::=
{PL1, PL2, PL3, PL4, PL5, PL6, PL7,
PL8, PL9, PL10}
4.
<
PW
> ::=
{PW1, PW2, PW3, PW4, PW5, PW6, PW7,
PW8, PW9, PW10}
The input variables are translated into symbols by finding
the max and min of each variable
and
then dividing into
equal compartments. For instance, the Sepal Length
variable has a min an
d max
value of 2 and 4.
4
, respectively.
Dividing this range into
ten
equal bins results in
symbol
SL1 with range 2 to
2.24, symbol SL2 with range 2.25 to
2.48, etc.
Thus, each input variable is translated into a symbolic
representation and input to Gram

ART
for clustering.
Combined with
a fixed seed, the grammar is able to encode a
fixed

length symbolic representation of the input data.
Setting the vigilance at 0.8, the c
lustering results are shown
in F
igure 10
and 11
.
Figure 10
. Gram

ART Iris Data Perf
ormance.
Figure 11. Gram

ART Iris Data clustering Profile
Figure 12
. Iris Data cluster confusion
By examining the confusion matrix of the two methods,
shown in Figure 12, we can see that the number of confused
inputs is reduced for Gram

ART, sugges
ting a higher ability
to distinquish between the natural clusters in the data.
VII.
G
RAMMATICAL
E
VOLUTION AND
G
RAM

ART
A
UTOMATIC
F
UNCTION
D
EFINITION
R
ESULTS
In
a
Grammatical
Evolution architecture, the Gram

ART
unit serves
the purpose of dynamic function defin
ition,
providing a
library of generalized functions. If an individual
has a non

zero differential fitness between itself and the
higher fitness of its two parents, a search is initiated to find a
sub

tree in the individual that differs from that of its pa
rents.
When found, this sub

tree is passed as input to the Gram

ART method, where it is matched to a category and modifies
a template. The templates are then extracted from Gram

ART and added to the grammar as high

level functions that
are available for
new individuals to utilize. In this way,
useful sub

trees are removed from the evolutionary process,
and the genetic operations are then focused on increasingly
high

level modifications to the programs. This
type of
mechanism maintains population diversi
ty and is able to
counteract the bloat of individuals that causes fitness
stagnation
[9]
.
The even

parity problem stack is a s
tandard GP function
approximation benchmar
k
and a good demonstration
pr
oblem for functional usage as 2

Bit Even Parity is a sub

problem of 4

Bit eve
n parity is a sub

problem of 6

B
it,
etc
[6]
.
Curriculum control is utilized to increase the size of
the even parity problem incrementally from 2 bits to 10 bits,
as the evolutionary o
ptimizer reaches either convergence or
100% accuracy on all test cases.
Figure 13
shows the average early evolutionary profiles
of
the architecture on the 10

B
it even parity problem. The
baseline used neither curriculum control
n
or dynamic
function defin
ition, the iterated method used only curriculum
control to step through even

parity problem sizes from 2 to
9,
before being exposed to the 10

B
it even parity problem,
shown here. The iterated dynamic method utilized both
curriculum control and the
Gram

AR
T unit for dynamic
function definition. We can see that the baseline method
makes very little early progress, while both iterated methods
start at a much higher fitness. Additionally, the iterated
dynamic method is able to take large optimization steps d
ue
to its ability to construct individuals from existing
generalized partial solutions.
Figure 13
. Evo
lutionary Profile for Baseline, Iterated, and
Iterated with Dynamic (I.D.) Function definition under the
10

Bit even parity problem.
Fig
ure 14
. Co
mpara
tive computational expense for Baseline
and Iterated training.
Figure 15
. Example Function Utilization after Iterated
curriculum with Dynamic function definition
Figure 14 compares the computational expense of the
baseline method to an iterated me
thod over the same number
of generations (and thus, individual evaluations) each. As
the iterated method starts with simple problems that are
faster to evaluate, it has a much lower computational
expense than the baseline method, which must evaluate the
l
argest and most computationally expensive problem
instance for each individual.
Figure 15
shows the normalized distribution of function
utilization by the fittest individuals after an iterated learning
curriculum with dynamic function definition under the
even
parity curriculum. We can see that early dynamically
generated functions are highly utilized.
For a vigilance value of 0.7, the number of clusters
generated using Gram

ART as automatic function definition
is many orders of magnitude smaller than th
e number of
functions generated without clustering. Less than a hundred
functions were generated using clustering, compared to tens
of thousands of functions generated without clustering.
In
the genetic process, this increases the probability that any
gi
ven high

value function will be selected for utilization in
an individual.
VIII.
C
ONCLUSION
Gram

ART, a
new Adaptive Resonance Theory variant
,
has
been developed
with many
valuable properties,
including the
ability to cluster symbolic information
–
not only dat
a but
the structure of data relative to a grammar. Additionally, the
Gram

ART method is able to develop non

geometrically
constrained cluster shapes, which
enables
a
n increased
ability to distinguish between non

linearly separable data.
IX.
A
CKNOWLEDGMENTS
T
he authors would like to acknowledge John Vian and Emad
W. Saad of the Boeing Phantom Works Intelligent Adaptive
Systems Group for technical discussions and aerospace
applications concepts.
Additional special thanks
to Rui Xu
and Ryanne Dolan for
review
and
editing assistance.
X.
R
EFERENCES
[1]
S. Grossberg and G. A. Carpenter, "A massively
parallel architecture for a self

organizing neural
pattern recognition machine,"
Computer Vision,
Graphics, and Image Processing,
vol. 37, pp. 54

11
5, 1987.
[2]
G. A. Carpenter and S. Grossberg, "Fuzzy ART:
Fast Stable Learning and Categorization of analog
patters by an adaptive resonance system,"
Neural
Networks,
vol. 4, pp. 759

771, 1991.
[3]
J. R. Williamson, "Gaussian ARTMAP: A neural
network for
fast incremental learning of noisy
multidimensional maps.,"
Neural Networks,
vol. 9,
pp. 881

897, 1996.
[4]
M. J. Healy, R. D. Olinger, R. J. Young, T. P.
Cuadell, and K. W. Larson, "Modification of the
ART1 architecture based on category theoretic
design
principles,"
Neural Networks,
vol. 1, pp.
457

462, 2005.
[5]
W. B. Langdon and R. Poli,
Foundations of Genetic
Programming
. New York: Springer Verlag, 2002.
[6]
J. R. Koza, "Hierarchical genetic algorithms
operating on populations of computer programs," in
International Joint Conference on Artificial
Intelligence
, 1989, pp. 768

774.
[7]
J. R. Koza, "The genetic programming paradigm:
Genetically breeding populations of computer
programs to solve problems," in
Dynamic, Genetic
and Chaotic Programming
: John Wi
ley, 1992, pp.
201

321.
[8]
J. P. Rosca, "Genetic Programming Exploratory
Power and the Discovery of Functions," in
Conference on Evolutionary Programming
, 1995,
pp. 719

736.
[9]
J. R. Koza, "Hierarchical Automatic Function
Definition in Genetic Programmin
g," in
Foundations of Genetic Algorithms 2
: Morgan
Kaufmann, 1992, pp. 297

318.
[10]
J. R. Koza, "Simultaneous Discovery of Detectors
and a Way of Using the Detectors via Genetic
Programming," in
International Conference on
Neural Networks
, 1993.
[11]
J. P
. Rosca, "Hierarchical Learning with Procedural
Abstraction Mechanisms," 1997.
[12]
P. J. Angeline, "Evolutionary Algorithms and
Emergent Intelligence." vol. Doctoral Thesis
Columbus, OH: Ohio State University, 1993.
[13]
M. O'Neill and C. Ryan, "Grammatic
al Evolution,"
IEEE Transactions on Evolutionary Computation,
vol. 5, pp. 349

358, 2001.
[14]
G. A. Carpenter, S. Grossberg, and D. B. Rosen,
"ART 2

A: An adaptive resonance algorithm for
rapid category learning and recognition,"
Neural
Networks,
pp. 493

5
04.
[15]
R. Xu and D. Wunsch, II, "Survey of clustering
algorithms,"
Neural Networks, IEEE Transactions
on,
vol. 16, pp. 645

678, 2005.
[16]
J. D. Gibbons,
Nonparametric Statistics: An
Introduction
: Sage Publications, 1993.
[17]
G. A. Carpenter and N. Mark
uzon, "ARTMAP

IC
and medical diagnosis: Instance counting and
inconsistent cases,"
Neural Networks,
vol. 11, pp.
323

336, 1998.
[18]
G. C. Anagnostopoulos and M. Georgiopoulos,
"Ellipsoidal ART and ARTMAP for incremental
clustering and classification," in
Int. Joint
Conference on Neural Networks
, 2001, pp. 1221

1226.
[19]
A. Asuncion and D. J. Newman, "Fisher's Iris
Dataset," UCI Machine Learning Repository, 2007.
[20]
R. A. Fisher, "The use of multiple measurements in
taxonomic problems,"
Annual Eugenics,
vol. 7, pp.
179

188, 1936.
Comments 0
Log in to post a comment