Gram-ART: Variable Length Representation with Non-Parametric Probabilistic Templates and Complex Cluster Geometry

tealackingAI and Robotics

Nov 8, 2013 (3 years and 7 months ago)

67 views






Abstract


A

new Adaptive Resonance Theory

variant

is
presented

that is

capable of clustering variable length

semantic
inputs

by creating templates that store a

non
-
parametric

distribution over the symbols
and structure
of a
given
grammar. Originally created as an autom
atic function
definition mechanism for a Genetic Program
ming architecture,
the Gram
-
ART method has many other useful applications and
properties.


The variable cluster geometry of Gram
-
ART is
demonstrated

on a 2D clustering task.

Gram
-
ART
perform
ance is
shown to be improved compared

to that of
Fuzzy
-
ART on
the

benchmark

iris dataset.

I.

I
NTRODUCTION

H
E

Adaptive Resonance Theory

(ART)

unsupervised
learning method has long been a state of the art
clustering tool due to its low run
-
time complexity and ability
to scale the number of clusters
via
a single parameter
.
Additionally, the
seminal

ART1 architecture
[1]

has been the
subject of many research modifications,
resulting in the
development

of

Fuzzy
-
ART

[2]
, Gaussian ART

[3]
,
Category Theory ART
[4]

,

and numerous others.


Genetic Programming

(GP)

is a rapidly growing field with

incr
easingly valuable application to a number of important
areas

[5]
.


While this evolutionary algorithm is able to
efficiently generate solutions to many problems which
significantly outpace

those devised by human experts, there
are issues of computational cost to be addressed.


In
particular, we investigate a class of GP's which tend to
produce function trees of such magnitude that the approach
is rendered less effective in a very short orde
r.


The Gram
-
ART clustering algorithm is introduced to intelligently and
dynamically adjust the size of the GP function tree in a way
to satisfy the dual criteri
a

of efficacy and computability.


Designed to operate on encoded expressions in BNF
grammar, Gr
am
-
ART is capable of clustering variable
-
length inputs.


The algorithm is based on the neural
cognitive model known as Adaptive Resonance Theory, and
it is the first such ART
-
based architecture to address the
need for variable
-
length inputs.



Manuscript received September 3, 2008
. This work was supported in
part by

The Boeing Company
, the National Science Foundation, the
Missouri S&T Intelligent Systems Center, and the M.K. Finley Missouri
Endowment
.

Ryan J. Meuth,

John Seiffertt, Paul Robinette
and Dr. Donald
C. Wun
s
ch II are with the Applied

Computational Intelligence Laboratory
at the Missouri University of Science and Technology, Rolla
,
MO

65401

USA (e
-
mail:
rmeuth@mst.edu
,
jes0b4@mst.edu
,
pmrmq3@mst.edu
,
dwunsch@mst.edu
, respectively
).



Section II d
evelops the background context of Genetic
Programming, Grammatical Evolution and Automatic
Function definition. Section III provides an overview of
Adaptive Resonance Theory and the Fuzzy
-
ART algorithm.
Section IV details the Gram
-
ART algorithm, and sect
ions V
and VI analyze Gram
-
ART cluster geometry and compare
Gram
-
ART performance to Fuzzy
-
ART on the standard iris
dataset. Section VII details the use of Gram
-
ART for
automatic function definition in Genetic Programming.
The
remaining s
ections provide c
onclusions
, acknowledgments,
and references.

II.

G
ENETIC
P
ROGRAMMING


In genetic programming, the genome of an individual is
represented as a tree structure, where operation
s

are applied
at branches and the leaves are constants and problem
parame
ters, as illus
trated in Figure 1

[6, 7]
. One advantage
of GP is that the results can be easily
interpreted by humans

and formally verifi
ed
, a quality that is not present in many
other computational intelligence meth
ods
[8]
.



Figure 1
. Function Representation as tree structure.


There has been some development of methods
to generalize
function blocks (branches in an individual’s genome) that
appear similarly and usefully across individuals and across
generations, making those blocks available as fundamental
components in the next generation of programs
[9
-
12]
. In
this way, a library of available functions is generated and
customized in a meta
-
evolutionary way. This modification
leads to greatly increased performance and the repetition of
code allows the algorithm to find solutions
that it would
have very little chance of finding
otherwise
. Additionally,
by creating function blocks and removing parts of an
individual’s genome from active evolutionary modification,
the probability of architectural changes increases, as the
genome is
effectively shortened, and changes are only
Gram
-
ART: Variable Length Representation with Non
-
Parametric
Probabilistic Templates and
Complex

Cluster Geometry

Ryan J. Meuth
,
John Seiffertt
,

Paul Robinette

and
Donald C. Wunsch II

T




allowed on parts of the genome that have a higher
-
level
effect. In this way the evolutionary process

starts
by
building
and stabilizing low
-
level functionality, which grows
to higher
-
level functions that exploit

it.


The result is

a
progressive, fitness
-
driven increase in
program
complexity
t
hat massively accelerates

how well GP performs both in
terms of quality and speed.



Grammatical Evolution


Grammatical evolution is a way of using an integer
-
based
represent
ation to encode a function tree. In this way a
standard
integer
-
based genetic algorithm

can be used to
perform the evolutionary process while the individuals are
expressed as programs. This method has the benefit of
speed,
high population diversity,
low
memory requirements
and stability

[13]
.


In grammatical evolution, the integer chromosome is used to
select production rules of a Backus Naur Form (BNF)
grammatical definition to generate a valid program. The
BNF grammar is a way of expressing a language in the form
of produc
tion rules. A BNF grammar consists of the tuple
{
N,T,P,S
}, where
N

is the set of non
-
terminals such as
<expr>, <op>, <preop>, corresponding to expressions,
binary operat
ors

and unary operat
ors
, respectively.
T

is the
set of terminals, such as operations
AND, OR and NOT.
P

is the set of production rules that map from
N

to
T
, and
S

is a
seed symbol which is a member of
N
.


An example of a simple binary BNF grammar would be:

N = {expr, op, pre_op, var}

T = {AND, OR, NOT, X, Y}

S = <expr>



P can be repres
ented as:

1.

<expr>::= <expr> <op> <expr> |



(a)

<preop> <expr> |



(b)

<var>








(c)

2.

<op> ::= AND |









(a)





OR











(b)

3.

<pre_op> ::= NOT









(a)

4.

<var> ::= X |









(a)






Y










(b)


Rule

Number of Choices

1

3

2

2

3

1

4

2

Table 1. Choices available for each production rule.


The G
P

process uses the integer chromosome as a series of
choices over the production rules of the grammar. Starting
with the seed, the decoding process uses the components of
the chromosome to decid
e which choice to make when a
non
-
terminal has one or more choices. For example, given
the chromosome
{
1
, 2, 5, 3, 2, 6
}

and the above BNF
grammar
,

a functional representation, T1, can be
constructed. The seed <expr> is used as the root of the tree:


T1:
<expr>


Looking at the rule
-
choice table, <expr> has
three

possible
choices. The first term in the chromosome selects rule 1
b
:


T1: <preop> <expr>


There is only
one

possible choice for <preop>, so the
function becomes:


T1: NOT (<expr>)


Evaluating <exp
r> again, this time taking the modulus of 5
and 3, choose rule
1c
:


T1: NOT(<var>)


And as <var> has
two

options available, Y is chosen:


T1: NOT(Y)



Note that the decoding process terminated before
all
elements

in the
chromosome were utilized. Also, if
elements
were all used before the decoding process filled all non
-
terminals, the pointer can be reset to the beginning and the
chromosome reused. To prevent infinitely long codes, a
wrap counter is used and maximum wrap count is
implemented. When a decod
ing exceeds the maximum wrap
count, the tree is marked as invalid and is given a non
-
competitive fitness value, ensuring that it will not survive to
the next iteration.


Using this encoding style, grammars of arbitrary and
dynamic

complexity can be impleme
nted, including the
grammars of compilable languages and arbitrary functions.
This method has been shown to be effective on a variety of
problems, and even to out
-
perform
genetic programming
using tree
-
based genetic operators
[13]
.


Automatic Function Definition


A key aspect

to the GP process is defining the functions
produced through the evolution.
Koza’s early attempts at
function definition utilize a rigid structure where the number
of functions and
arguments

are fixed
[9]
. This limits the
flexibility of the defined function and limits the complexity
of evolved programs. Later attempts utilize the differential
fitness of the population to determine

when code
components should be added. However, this leads to a large
number of functions, with any given function having a small
chance of being selected.


To automatically generalize useful functions, we propose
that a

clustering method could be used wi
th differential
fitness selection. The parameters of the clustering method
could be tuned to control the number and coarseness of



functions generated, providing a simple mechanism for
automatic function definition. As categories are generated,
the templa
tes from each category are added to the grammar
as new functions, and the GE/GA process can then take
advantage of these new elements.


For this purpose, a new clustering algorithm based on
Adaptive Resonance Theory is proposed that is able to
utilize a va
riable
-
length representation to encode categories

against a specified grammar
. Currently no
A
daptive
R
esonance
T
heory based
-

clustering method exists that is
able to handle trees or variable
-
length representations. This
new algorithm is called
Gram
-
ART.


III.

A
DAPTIVE
R
ESONANCE
T
HEORY


Adaptive
R
esonance
T
heory (ART) was developed by
Carpenter and Grossberg as a solution to the plasticity and
stability dilemma, i.e., how adaptable (plastic) should a
learning system be so that it does not suffer from
catastrop
hic forgetting of previously
-
learned rules
[1, 2, 14]
.
ART can learn arbitrary input patterns in a stable, fast, and
self
-
organizing way, thus overcoming the effect of learning
instability that plagues many other com
petitive networks.
ART is not, as is popularly imagined, a neural network
architecture. It is a learning theory hypothesizing that
resonance in neural circuits can trigger fast learning
[15]
.


Adaptive Resonance Theory exhibits t
heoretically rigorous
properties
desired

by neuroscientists
which

solved some of
the major difficulties faced by modelers in the field. Chief
among these properties is stability under incremental
learning. In fact, it is this property which translates we
ll to
the computational domain and gives the ART1 clustering
algorithm, the flavor of ART most faithful to the underlying
differential equation

model
, its high status among
unsupervised learning algorithm
researchers
. At its heart,
the ART1 algorithm reli
es on calculating a fitness level
between an input and available categories. In this way it is
very much like

the well
-
known

k
-
means

algorithm
, although
the number of categories is variable and grows dynamically
as needed by the given data set.


What
fu
ndamentally
differentiates ART1 from similar
distance
-
based clustering algorithms is a second fitness
calculation whereby a given category can reject the inclusion
of an input if the input does not meet the category’s
standards as governed by a single glob
al parameter.
Cognitively, this is modeling the brain’s generation and
storage of expectations in response to neuronal stimulation.
The initial fitness, measuring the degree to which each input
fits each of the established categories, is considered a shor
t
-
term memory trace which excites a top
-
down expectation
from long
-
term memory. Computationally, this second
fitness calculation acts to tune the number of categories, and
it may force the creation of new categories where a
k
-
means

styled algorithm would
not, thus exhibiting stronger, more
nuanced, classification potential. The ART1 algorithm has
enjoyed great popularity in a number of practical application
areas of engineering interest. Its chief drawback is the
requirement that input vectors be binary.

The ART2
algorithm was first proposed to get around this restriction,
but in practice today it is the Fuzzy
-
ART modification of
ART1 which powers most of the new ART research and
applications.



Fuzzy
-
ART admits input vectors with elements in the
range

[0,1]. Typically a sort of preprocessing called
complement coding is applied to the input vectors as well as
any normalization required
, mapping

the data to the
specified range. The Fuzzy
-
ART’s core fitness equations
take a different form than those
of

ART1, leveraging the
mechanics of fuzzy logic to accommodate analogue data
vectors. Researchers have concocted a wide variety of
ART
-
based architectures by modifying the fitness equations
to specialize them for a given problem domain.


For example, Gaussi
an ARTMAP uses the normal
distribution to partition categories, with the relevant fitness
equations incorporating the Gaussian kernel. This
parametric statistical approach to ART was the first in what
has become a rich field of study. Other parametric me
thods
incorporate different probability distributions or allow for
alternative preprocessing schemes based on statistics. The
Gram
-
ART architecture presented in this paper extends this
body of knowledge by exploring non
-
parametric statistical
methods for
category determination.


Parametric statistics assume much about the underlying
distribution of the inputs to the system.


In running a
standard t
-
test, for example, it is required that the data be
generated by Gaussians or that we have a sufficient quanti
ty
of data to ensure the sampling distribution is normal.


It is
often the case in practice that such normality assumptions
are invalid.


Gram
-
ART adds to the existing probabilistic
ART architectures in that it makes no such assumptions
regarding the distr
ibution of inputs (as compared to, for
example, Gaussian ARTMAP.)


Instead, it relies on non
-
parametric, or distribution
-
free, statistical models of the
inputs when making its classifications.


This allows Gram
-
ART to effectively handle data from small sam
ples or about
whose structure nothing is known.


The interested reader is
directed to
[16]

for further details regarding non
-
parametric
statistical analysis.



Other specializations of ART include ARTMAP
-
IC

[17]

which allows for input data to be inconsistently la
beled and
is shown to work well on medical databases, Ellipsoidal
ARTMAP

[18]

which calculates elliptical category regions
and produces superior results to methods based on hyper
-
rectangles in a number of problem domains, and a version of
ART which uses category theory to better model the storage



and organization of internal

knowledge. Overall, Adaptive
Resonance Theory enjoys much attention by those studying
computational learning for both scientific and engineering
purposes.


Fuzzy
-
ART incorporates fuzzy set theory into ART and
extends the ART family by being capable of le
arning stable
recognition clusters in response to both binary and real
-
valued input patterns with either fast or slow learning.


The basic FA architecture consists of two
-
layer nodes or
neurons, the feature representation field
F
1
, and the category
representation field
F
2
, as shown in Figure 2
. The neurons in
layer
F
1
are activated by the input pattern, while the
prototypes of the formed clusters
, represented by hyper
-
rectang
l
es,

are stored in layer
F
2
. The neurons in layer
F
2
th
at are already being used as representations of input
patterns are said to be committed. Correspondingly, the
uncommitted neuron encodes no input patterns. The two
layers are connected via adaptive weights,
W
j
, emanating
from node
j

in layer
F
2
. After laye
r
F
2

is activated according
to the winner
-
take
-
all competition


between a certain number
of committed neurons and one uncommitted neuron, an
expectation is reflected in layer
F
1

and compared with the
input pattern. The orienting subsystem with the pre
-
spec
ified
vigilance parameter
ρ

(0≤
ρ
≤1) determines whether the
expectation and the input pattern are closely matched. If the
match meets the vigilance criterion,
learning occurs and the
weights are updated
. This
state

is called resonance, which
suggests the na
me of ART. On the other hand, if the
vigilance criterion is not met, a reset signal is sent back to
layer
F
2

to shut off the current winning neuron

for the entire
duration of the presentation of this input pattern, and a new
competition is performed among
the remaining neurons.
This new expectation is then projected into layer
F
1
, and this
process repeats until the vigilance criterion is met. In the
case
where

an uncommitted neuron is selected for coding, a
new uncommitted neuron is created to represent a p
otential
new cluster.


F
uzzy
-
A
RT

exhibits fast, stable, and transparent learning
and atypical pattern detection. The Fuzzy
-
ART method has
the benefit of being a highly efficien
t clustering method

with
a linear

runtime complexity.

Algorithmically, t
here ar
e two steps to ART: category choice
and vigilance test.

Let

be the input,

the weights
associated with category

(this is really

where the
weight is a vector of

elements, but this subscri
pt is
typically suppressed), and

be the vigilance.


In category choice, the degree of match is
calculate
d:


j
j
w
w
x
j
T


)
(

(1)


For

each category
.


For the vigilance test, we calculate


x
w
x
j




(2)


An
d

compare with
.

We
then
cycle between
category
choice and the vigilance test

until resonance o
ccurs and we
update the weights

accor
ding to eq. 3.




x
w
w
j
j




(3
)


Fast learning occurs when β=1.

IV.

G
RAM
-
ART

A
LGORITHM


This algorithm is a specialization of ART designed to handle
variable
-
leng
th input patterns represented in a tree structure
based on a BNF grammar
.


Initialization

Let

be a
tree under the

grammar
. Let

be a

generalized
tree
corresponding to category

.

Note here that the category
representations in Gram
-
ART are themselves trees, thus
abstracting the hyper
-
rectangular prototype forms of earlier
manifestations of A
RT.


Each node in the generalized tree
has an array representing the distribution of possible
symbols

at that node. Here,


represent
s the number of
nodes in a tree
. Note that we set the number of columns
equal to the maximum number of
symbo
ls in the grammar
.
If a given
symbol

does

not

apply to a given
node
, then its
entry in the weight matrix will be eternally zero. Finally, we
let

represent the vigilance level.


We need a measure of magnitude for inputs and weights.
Since t
he size of the elements of these
distributions

do

not

correspond in a meaningful way to
any

sense of magnitude,
we instead define

the measure to be
simply the number of
nodes

presented in each. That is, we define the
tree
-
norm


as

= the number of
nodes

in
. So
,

and

.





Reset

Layer
F
2

Orienting Subsystem


Input Pattern
I




Layer
F
1

ρ

W



Fig
ure 2
. Topological structure of Fuzzy ART. Layers
F
1
and
F
2 are connected via adaptive weights W. The orienting
subsystem is controlled by the vigilance parameter
ρ
.





Initially
, there are no category nodes committed. The first
input vector will be used to update

so no initial values of
the weights need to be given.


We need t
o define a notion of overlap between the input and
the weights. We cannot use the normal intersection nor the
fuzzy
-
and operator because

and

are not guaranteed to
be of the same dimensions. Therefore, we define the
trace

of

in
, denoted by
, as follows:







r
i
J
x
i
J
i
w
w
x
1
,

(4
)


T
he trace is the sum of the values stored in the weight matrix

corresponding to the choices enumerated for a given
input

.
This
has the effect of comparing root
-
aligned trees.
An example is given below.


Given two trees, one for the function “X AND Y” and
another for the function NOT X
, which are shown in Figure
3.





Figure 3
. Function Trees.


To store the combination of th
e two, we create a type of tree
,
called a
proto
-
tree that

holds a distribution over the symbols
at each node, and has a variable number of children.



Figure 4
.
Proto
-
Node


If we combine these two trees
results in a

proto
-
tree

shown
in Figure 5.





Figure
5
.
Proto
-
tree

Construction


To compare and update trees, we define recursive
functions
that

traverse both trees synchronously
, outlined in Tables 4
and 5
.


We define two Node structures

Proto
Node and TreeNode
,
outlined in Tables 2 and 3
.

The Prot
oNode structure is used
to construct tree prototypes, which are the templates of
Gram
-
ART. The ProtoNode holds a distribution over all
symbols in a given grammar (Line 2) and the update
counters for each symbol (Line 3), as well as an array of
child
-
nodes

(Line 4). The TreeNode structure, outlined in
Table 3, holds a single symbol (Line 2) and an array of
child
-
nodes (Line 3).


1

struct Proto
Node

2


double dist[]

3


int N[]

4


ProtoNode

protochildren
[]

5

end struct


Table 2.
Proto
-
Node structure.


1

struct TreeNode

2


Terminal t

3


TreeNode
children
[]

4

end struct


Table 3. Tree
-
Node structure.


The CompareNode function, outlined in Table 4, performs
the recursive process of comparing a tree with a prototype.
The function first accumulates the p
robability of a tree’s
symbols occurring in the prototype (Line 2), then increments
a counter that tracks the number of nodes that the trees have
in common (Line 3). The function then calls itself on each
of the child nodes to accumulate statistics for th
e remainder
of the tree.


1

function CompareNode ( TreeNode &A,
ProtoNode

&B, double &sum, double &size)

2


sum = sum + B.dist[A.t]

3


size = size + 1

4


For each i in A.
children
[],

5



CompareNode(A.
children
[i],
B.
protochildren
[i], sum,
size)

6

end f
unction



Table 4. Trace Process p
seudo
-
code


Step 1: Category Match

The first step in ART is to calculate the strength of the
activations to the category nodes. We define this activation
strength, or choice value, for category

as equation 5
.

J
J
J
w
w
x
x
T


)
(

(5
)


This quantity measures to what extent the input pattern

fires the category weight entries of
. If the elements of

correspond to all
’s in the rows of

then this is a pe
rfect
match with activation equal to
. If the category

is
nowhere close to the input

th
en the corresponding weight
ent
r
i
es will be small so that this quantity approaches
.



Note that the weight matrix might contain more or fe
wer
elements than the input and this measure penalizes such
mismatches. In the numerator, if the weight value
does not
exist to correspond to the input then the value
does not

get
summed. In the denominator, all the rows of the weight are
accounted for s
o this will lower t
he

choice value if the input
has fewer entries.


Step 2: Vigilance Test

Once

has been calculated for all categories, this vector is
sorted and the highest category is checked for vigilance. In
the vigilance test we want to

see how accurately the chosen
category can predict the value of the input
, so we check
the following condition:





x
w
x
J

(6
)














If this condition is satisfied, then we move on to the weight
update step. Otherwise,
we zero out the value

and
proceed with the next highest category match. If none of the
categories pass the vigilance test, then a new uncommitted
node is assigned to the current input.


Step 3: Weight Update

There are two
operations

used in
the weight update
procedure: the element update and the row insertion.


Element update

is a weighted sum of the frequency with
which a given option has been presented and is
calculated

by


1




N
N
w
w
j
J
ij
J
ij


(7
)


Where


is the number of

inputs prior to the latest one and

is a characteristic function given by
eq. 8.






otherwise
j
x
if
i
j
0
1


(8
)



1

function UpdateNode( TreeNode &A,

ProtoNode

&B)

2


B.dist[A.t] = NewWeight(B.dist, B.N)

3


B.N[A.t] = B.N + 1;

4


For each
i in A.
children
[],

5



UpdateNode(A, B)

6

end function


Table 5. Update Process p
seudo
-
code.


The
recursive
process for updating

is
described in Table
5. The function first updates the probability of a tree symbol
occurring in that node l
ocation using equation 7 (Line 2),
then it increments the number of updates for that symbol
(Line 3). The function then calls itself on each of the child
nodes, recursively updating the rest of the tree.

V.

C
LUSTER
A
NALYSIS


To demonstrate arbitrary cluster
geometries in Gram
-
ART a

dataset consisting of two

mixed
Gaussian distribution
s

is
given as input.
To translate between a continuous 2D space
and a symbolic grammar, t
he X and Y dimensions are evenly
segmented into three symbols each, giving nine separate
regions
of activation
.
One hundred
points were given as
input with a vigilance value of 0.
7
. F
our
templates
were
produced
, shown in Figure 6
. The

data
points are shown in
white. Regions with high activation values are shown in
bright red while low activat
ion values are shown in black.


Template
1 below is mainly composed of the empty, upper
region. This is a fairly trivial cluster, but it is important to
note that Gram
-
ART still segments this region off as a
different function set.
Template
2, 3 and 4 ar
e centered in the
densely populated bottom right, center and left areas,
respectively. Nearby regions are also partially activated,
illustrating the inherent arbitrary geometry of the cluster.


Templates
1 and 2:

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

Templates
3 and 4:

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

Figure 6
. Low
-
Reso
lution Gram
-
ART
template
shapes.


Arbitrary cluster geometries are further illustrated i
n a
second experiment. The X

and
Y

dimensions

were
segmented into
ten

symbols in each
dimension

which
produced
eleven

clusters. These
templates
are each shown
by bands
of activation in the X and Y directions

in Figure 7
.
The first cluster is again the trivial case of the empty top
area. After that, several bands can be seen in each cluster.
Some of these bands even ha
ve multiple peaks in them,
indicating

a complex relati
onship
among
the input data. The
sample data is still separable and Gram
-
ART is able to
divide it nonlinearly into clusters.











Templates
1 and 2:

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

Templates
3 and 4:

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

Templates
5 and 6:

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

Templates
7 and 8:

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

Templates
9 and 10:

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

Templates
11:

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

Figure
7
. High
-
Resolution Gram
-
ART cluster shapes.



VI.

F
UZZY
-
ART

VS
.

G
RAM
-
ART

C
OMPARISON


To evaluate the performance of the Gram
-
ART algorithm,
the standard Fisher’s Iris
d
ataset
[19, 20]

was used a
s a
benchmark against the Fuzzy
-
ART method. The
performance of

Fuzzy
-
ART is shown in Figure 8

and 9
.



Figure 8
. Fuzzy
-
ART clustering performance on the Iris
dataset.



Figure 9. Fuzz
-
ART Iris Data clustering profile.


To evaluate Gram
-
ART on the Iris d
ataset, the following
grammar was constructed:


N = {
S
L, SW, PL, PW
}


T = {
SL1, SL2, SL3, SL4, SL5, SL6, SL7, SL8, SL9, SL10,

SW1, Sw2, Sw3, Sw4, Sw5, Sw6, Sw7, Sw8, Sw9, Sw10,

PL1, PL2, PL3, PL4, PL5, PL6, PL7, PL8, PL9, PL10,

PW1, PW2, PW3,
PW
4,
PW
5,
PW
6
,
PW
7,
PW
8,
PW
9,
PW
10,
}


S = <SL> <SW> <PL> <PW>


P can be represented as:

1.

<
SL
>::=

{SL1, SL2, SL3, SL4, SL5, SL6, SL7, SL8,
SL9, SL10}

2.

<
SW
> ::=
{SW1, Sw2, Sw3, Sw4, Sw5, Sw6, Sw7,
Sw8, Sw9, Sw10}

3.

<
PL
> ::=
{PL1, PL2, PL3, PL4, PL5, PL6, PL7,
PL8, PL9, PL10}

4.

<
PW
> ::=
{PW1, PW2, PW3, PW4, PW5, PW6, PW7,
PW8, PW9, PW10}

The input variables are translated into symbols by finding
the max and min of each variable

and
then dividing into



equal compartments. For instance, the Sepal Length
variable has a min an
d max
value of 2 and 4.
4
, respectively.
Dividing this range into
ten

equal bins results in

symbol
SL1 with range 2 to
2.24, symbol SL2 with range 2.25 to
2.48, etc.


Thus, each input variable is translated into a symbolic
representation and input to Gram
-
ART
for clustering.

Combined with

a fixed seed, the grammar is able to encode a
fixed
-
length symbolic representation of the input data.


Setting the vigilance at 0.8, the c
lustering results are shown
in F
igure 10

and 11
.



Figure 10
. Gram
-
ART Iris Data Perf
ormance.



Figure 11. Gram
-
ART Iris Data clustering Profile




Figure 12
. Iris Data cluster confusion


By examining the confusion matrix of the two methods,
shown in Figure 12, we can see that the number of confused
inputs is reduced for Gram
-
ART, sugges
ting a higher ability
to distinquish between the natural clusters in the data.

VII.

G
RAMMATICAL
E
VOLUTION AND
G
RAM
-
ART

A
UTOMATIC
F
UNCTION
D
EFINITION
R
ESULTS


In

a

Grammatical

Evolution architecture, the Gram
-
ART
unit serves
the purpose of dynamic function defin
ition,
providing a
library of generalized functions. If an individual
has a non
-
zero differential fitness between itself and the
higher fitness of its two parents, a search is initiated to find a
sub
-
tree in the individual that differs from that of its pa
rents.
When found, this sub
-
tree is passed as input to the Gram
-
ART method, where it is matched to a category and modifies
a template. The templates are then extracted from Gram
-
ART and added to the grammar as high
-
level functions that
are available for
new individuals to utilize. In this way,
useful sub
-
trees are removed from the evolutionary process,
and the genetic operations are then focused on increasingly
high
-
level modifications to the programs. This

type of
mechanism maintains population diversi
ty and is able to
counteract the bloat of individuals that causes fitness
stagnation

[9]
.


The even
-
parity problem stack is a s
tandard GP function
approximation benchmar
k
and a good demonstration
pr
oblem for functional usage as 2
-
Bit Even Parity is a sub
-
problem of 4
-
Bit eve
n parity is a sub
-
problem of 6
-
B
it,
etc
[6]
.

Curriculum control is utilized to increase the size of
the even parity problem incrementally from 2 bits to 10 bits,
as the evolutionary o
ptimizer reaches either convergence or
100% accuracy on all test cases.


Figure 13

shows the average early evolutionary profiles

of
the architecture on the 10
-
B
it even parity problem. The
baseline used neither curriculum control
n
or dynamic
function defin
ition, the iterated method used only curriculum
control to step through even
-
parity problem sizes from 2 to
9,
before being exposed to the 10
-
B
it even parity problem,
shown here. The iterated dynamic method utilized both
curriculum control and the
Gram
-
AR
T unit for dynamic
function definition. We can see that the baseline method
makes very little early progress, while both iterated methods
start at a much higher fitness. Additionally, the iterated
dynamic method is able to take large optimization steps d
ue
to its ability to construct individuals from existing
generalized partial solutions.






Figure 13
. Evo
lutionary Profile for Baseline, Iterated, and
Iterated with Dynamic (I.D.) Function definition under the
10
-
Bit even parity problem.




Fig
ure 14
. Co
mpara
tive computational expense for Baseline
and Iterated training.



Figure 15
. Example Function Utilization after Iterated
curriculum with Dynamic function definition


Figure 14 compares the computational expense of the
baseline method to an iterated me
thod over the same number
of generations (and thus, individual evaluations) each. As
the iterated method starts with simple problems that are
faster to evaluate, it has a much lower computational
expense than the baseline method, which must evaluate the
l
argest and most computationally expensive problem
instance for each individual.


Figure 15

shows the normalized distribution of function
utilization by the fittest individuals after an iterated learning
curriculum with dynamic function definition under the

even
parity curriculum. We can see that early dynamically
generated functions are highly utilized.


For a vigilance value of 0.7, the number of clusters
generated using Gram
-
ART as automatic function definition
is many orders of magnitude smaller than th
e number of
functions generated without clustering. Less than a hundred
functions were generated using clustering, compared to tens
of thousands of functions generated without clustering.

In
the genetic process, this increases the probability that any
gi
ven high
-
value function will be selected for utilization in
an individual.

VIII.

C
ONCLUSION


Gram
-
ART, a

new Adaptive Resonance Theory variant
,

has
been developed
with many

valuable properties,
including the
ability to cluster symbolic information


not only dat
a but
the structure of data relative to a grammar. Additionally, the
Gram
-
ART method is able to develop non
-
geometrically
constrained cluster shapes, which
enables
a
n increased
ability to distinguish between non
-
linearly separable data.


IX.

A
CKNOWLEDGMENTS

T
he authors would like to acknowledge John Vian and Emad
W. Saad of the Boeing Phantom Works Intelligent Adaptive
Systems Group for technical discussions and aerospace
applications concepts.


Additional special thanks

to Rui Xu

and Ryanne Dolan for

review

and

editing assistance.

X.

R
EFERENCES


[1]

S. Grossberg and G. A. Carpenter, "A massively
parallel architecture for a self
-
organizing neural
pattern recognition machine,"
Computer Vision,
Graphics, and Image Processing,
vol. 37, pp. 54
-
11
5, 1987.

[2]

G. A. Carpenter and S. Grossberg, "Fuzzy ART:
Fast Stable Learning and Categorization of analog
patters by an adaptive resonance system,"
Neural
Networks,
vol. 4, pp. 759
-
771, 1991.

[3]

J. R. Williamson, "Gaussian ARTMAP: A neural
network for
fast incremental learning of noisy
multidimensional maps.,"
Neural Networks,
vol. 9,
pp. 881
-
897, 1996.

[4]

M. J. Healy, R. D. Olinger, R. J. Young, T. P.
Cuadell, and K. W. Larson, "Modification of the
ART1 architecture based on category theoretic



design
principles,"
Neural Networks,
vol. 1, pp.
457
-
462, 2005.

[5]

W. B. Langdon and R. Poli,
Foundations of Genetic
Programming
. New York: Springer Verlag, 2002.

[6]

J. R. Koza, "Hierarchical genetic algorithms
operating on populations of computer programs," in

International Joint Conference on Artificial
Intelligence
, 1989, pp. 768
-
774.

[7]

J. R. Koza, "The genetic programming paradigm:
Genetically breeding populations of computer
programs to solve problems," in
Dynamic, Genetic
and Chaotic Programming
: John Wi
ley, 1992, pp.
201
-
321.

[8]

J. P. Rosca, "Genetic Programming Exploratory
Power and the Discovery of Functions," in
Conference on Evolutionary Programming
, 1995,
pp. 719
-
736.

[9]

J. R. Koza, "Hierarchical Automatic Function
Definition in Genetic Programmin
g," in
Foundations of Genetic Algorithms 2
: Morgan
Kaufmann, 1992, pp. 297
-
318.

[10]

J. R. Koza, "Simultaneous Discovery of Detectors
and a Way of Using the Detectors via Genetic
Programming," in
International Conference on
Neural Networks
, 1993.

[11]

J. P
. Rosca, "Hierarchical Learning with Procedural
Abstraction Mechanisms," 1997.

[12]

P. J. Angeline, "Evolutionary Algorithms and
Emergent Intelligence." vol. Doctoral Thesis
Columbus, OH: Ohio State University, 1993.

[13]

M. O'Neill and C. Ryan, "Grammatic
al Evolution,"
IEEE Transactions on Evolutionary Computation,
vol. 5, pp. 349
-
358, 2001.

[14]

G. A. Carpenter, S. Grossberg, and D. B. Rosen,
"ART 2
-
A: An adaptive resonance algorithm for
rapid category learning and recognition,"
Neural
Networks,
pp. 493
-
5
04.

[15]

R. Xu and D. Wunsch, II, "Survey of clustering
algorithms,"
Neural Networks, IEEE Transactions
on,
vol. 16, pp. 645
-
678, 2005.

[16]

J. D. Gibbons,
Nonparametric Statistics: An
Introduction
: Sage Publications, 1993.

[17]

G. A. Carpenter and N. Mark
uzon, "ARTMAP
-
IC
and medical diagnosis: Instance counting and
inconsistent cases,"
Neural Networks,
vol. 11, pp.
323
-
336, 1998.

[18]

G. C. Anagnostopoulos and M. Georgiopoulos,
"Ellipsoidal ART and ARTMAP for incremental
clustering and classification," in
Int. Joint
Conference on Neural Networks
, 2001, pp. 1221
-
1226.

[19]

A. Asuncion and D. J. Newman, "Fisher's Iris
Dataset," UCI Machine Learning Repository, 2007.

[20]

R. A. Fisher, "The use of multiple measurements in
taxonomic problems,"
Annual Eugenics,
vol. 7, pp.
179
-
188, 1936.