UNIVERSITY OF CALIFORNIA,
IRVINE
Modular Neural Network Architecture for Detection of Operational Problems on
Urban Arterials
DISSERTATION
Submitted in partial satisfaction of the requirements for the degree of
DOCTORAL OF PHILOSOPHY
in Civil Engineering
by
Sarosh Islam Khan
Dissertation Committee:
Professor Stephen Ritchie
1996
© 2002 Sarosh Khan
The dissertation of Sarosh Islam Khan
is approved and is acceptable in quality
and form for publication on microfilm:
___________________________
___________________________
___________________________
Committee Chair
University of California, Irvine
1996
i
MODULAR NEURAL NETWORK ARCHITECTURE FOR DETECTION OF
OPERATIONAL PROBLEMS ON URBAN ARTERIALS
LIST OF CONTENTS
LIST OF FIGURES...................................................................................................15
LIST OF TABLES.....................................................................................................16
1. Introduction..................................................................................................17
1.1 Problem Definition...............................................................................................17
1.2 Research Approach..............................................................................................13
1.3 Organization of Dissertation................................................................................16
2. Incident Detection Approaches..................................................................21
2.1 Introduction..........................................................................................................21
2.2 Basic Approaches to Incident Detection..............................................................21
2.2.1 Pattern RecognitionBased Algorithms......................................................21
2.2.2 TimeSeries Methods..................................................................................22
2.2.3 Bayesian Approach.....................................................................................22
2.2.4 Catastrophe TheoryBased Algorithm........................................................23
2.3 Surface Street Incident Detection........................................................................23
2.3.1 TimeSeries Based Algorithm.....................................................................23
2.3.2 KnowledgeBased System Using Video Image Processing........................25
ii
2.3.3 Decision TreeBased Approach..................................................................26
2.3.4 Discriminant AnalysisBased Approach.....................................................27
3. Pattern Recognition and Neural Networks................................................31
3.1 Introduction..........................................................................................................31
3.2 Statistical Approaches to Pattern Recognition....................................................32
3.2...............................................................................................................................32
3.2.1 Bayesian Classifier.....................................................................................33
3.2.1......................................................................................................................33
3.2.2 Discriminant Functions...............................................................................33
3.2.2......................................................................................................................33
3.3 Artificial Neural Networks..................................................................................35
3.3...............................................................................................................................35
3.3.1 Learning Schemes.......................................................................................37
3.3.1......................................................................................................................37
3.3.2 Unsupervised learning................................................................................37
3.3.2......................................................................................................................37
3.3.3 Competitive Learning.................................................................................38
3.3.3......................................................................................................................38
3.4 Artificial Neural Networks for Pattern Recognition............................................38
3.4...............................................................................................................................38
3.5 Artificial Neural Network Architectures...........................................................310
3.5.............................................................................................................................310
3.5.1 Multilayer Feed Forward Neural Network..............................................310
iii
3.5.2 Projection Neural Network.......................................................................315
3.5.2....................................................................................................................315
3.5.3 Modularity of Neural Network Models....................................................318
3.5.3....................................................................................................................318
4. Data Collection.............................................................................................42
4.1 Introduction..........................................................................................................42
4.2 Signalized Street Networks Selected as Study Areas..........................................42
4.2.1 Traffic Control System...............................................................................42
4.2.2 Description of Study Networks...................................................................43
4.3 Field Data Collected............................................................................................47
4.4 Microscopic Simulation.......................................................................................49
4.4.1 Limitations of NETSIM, Version 4.2.......................................................410
4.4.2 NETSIM Enhancements...........................................................................410
4.4.3 NETSIM Representation of Study Networks and Its Calibration.............411
4.4.4 Calibration of NETSIM............................................................................413
4.5 Simulated Data Collected..................................................................................414
4.5.1 Simulated Data Set....................................................................................414
5. Model Development.....................................................................................52
5.1 Introduction..........................................................................................................52
5.2 Selection of Features............................................................................................54
5.2.1 Input Features.............................................................................................54
5.2.2 Output Features...........................................................................................56
iv
5.3 Performance Measures.........................................................................................56
5.3.1 Root Mean Square (RMS) Error.................................................................56
5.3.2 Detection Rate (DR)...................................................................................57
5.3.3 False Alarm Rate (FAR).............................................................................57
5.3.4 Average Time To Detection (TTD)............................................................57
5.3.5 Classification Rate (CR).............................................................................57
5.3.6 Statistical Techniques Used to Evaluate the Performance of Models
Developed............................................................................................................58
5.4 Neural Network Model Developed....................................................................510
5.4.1 Feasibility Study.......................................................................................510
5.5 Different Input Features.....................................................................................511
5.5.1 Parameter Selection for Multilayer Feedforward (MLF) Neural Network513
5.5.2 Number of Hidden Layer Processing Elements........................................513
5.5.3 Optimum Learning and Momentum Coefficients.....................................515
5.5.4 Generalization...........................................................................................515
5.6 Output Features..................................................................................................516
5.7 Modular Network Developed............................................................................518
5.8 Statistical Classifiers Developed.......................................................................519
5.8.1 Discriminant Analysis...............................................................................519
6. Results and Comparative Evaluation..........................................................21
6.1 Introduction............................................................................................................21
6.2 Neural Network Classifier...................................................................................62
6.2.1 Multilayer Feedforward Neural Network (MLF).......................................62
v
6.2.2 Projection Network.....................................................................................67
6.2.3 Modular Neural Network............................................................................68
6.3 Neural Network and Statistical Classifiers..........................................................69
6.4 Effect of Flow Conditions and Network Geometry on Performance..................69
7. Conclusions and Recommendations............................................................2
7.1 Conclusions..............................................................................................................2
7.2 Recommendations................................................................................................74
8. References....................................................................................................81
vi
LIST OF FIGURES
Figure 31. Pattern Classifier..........................................................................................31
Figure 32. A Processing Element..................................................................................37
Figure 33. Multilayer Feedfroward Neural Network..................................................312
Figure 34.Hidden and Output Layer Processing Elements..........................................313
Figure 35. Projection transformation and the formation of boundary surfaces...........317
Figure 36. (a) MLF neural network (b) a modular equivalent....................................320
Figure 41. Los Angeles Network...................................................................................44
Figure 42. Anaheim Network........................................................................................46
Figure 43. NETSIM representation of the Anaheim Study Network..................................
Figure 51. Detector (i) Configuration #1 and (ii) Configuration #2...............................71
Figure 52. Single Neural Network Model to Detect Different
Types of Operational Problems.......................................................................84
Figure 53. Modular Architecture of Neural Network Models to Detect.........................86
Figure 61. Input Feature Selection Using Simulated Data..............................................93
Figure 62. Single MLF Network to Detect Different Types of Problems.......................95
Figure 63. Training of MLF and Projection Network.....................................................96
Figure 64. Single and Modular Neural Network Model..................................................97
Figure 65. Performance of Neural Network and Statistical Models...............................98
Figure 66.Performance of the Modular Neural Network Model Based on....................100
vii
LIST OF TABLES
Table 41. Data Collected from the Anaheim Network..................................................47
Table 42. Data Collected from the LA Network...........................................................48
Table 53. Input Features..............................................................................................513
Table 61. DR and TTD Performance Measures............................................................63
11
1. Introduction
1.1 Problem Definition
In recent years, transportation research has revealed that problems of widespread
congestion cannot be solved by building more roads or by expanding existing
infrastructure. A significant part of the solution lies in better management of traffic. One
of the principal thrusts of the new national program on Intelligent Transportation Systems
(ITS) is Advanced Transportation Management Systems (ATMS).
To facilitate better management, recent research has focused on continuous monitoring of
traffic to ascertain the 'normal' level of congestion and to provide an understanding of
how it forms and spreads. Techniques for rapidly detecting incidents have become a vital
link in the management of traffic. As pointed out by Ritchie (1990), a major concern in
ATMS is providing decision support to effectively detect, verify and develop response
strategies for incidents that disrupt the flow of traffic. A key element of providing such
support is automating the process of detecting operational problems on large area
networks. Successful detection of operational problems in their early stages is vital for
formulating response strategies such as modifying surface street signal timing plans and
activating or updating traveler information systems, including changeable message signs,
invehicle navigation systems and highway advisory radio, amongst others. It is also
needed as a basis for alerting police, emergency vehicles and tow services. Reliable
12
surface street incident detection is necessary for the development of integrated freeway
arterial control systems, and will permit improved coordination of freeway ramp meters
and surface street signal timing. Therefore, developing a capability for automating
incident detection on arterial streets will aid in accomplishing a true integration of
freeway and arterial networks.
From the early 1970's, research has been conducted to develop incident detection
algorithms for freeways to aid traffic engineers. Only recently, since the mid 1980's, has
any research focused on similar efforts for surface streets. As pointed out by Han and
May (1988), little work has been done on the arterial side because of characteristic
differences between freeways and arterials. Therefore a significant challenge lies in
formulating an incident detection methodology for arterial streets.
Differences between freeways and surface streets that impact incident detection include
the following:
• multiple access: freeways have directed access points through entry and exit ramps,
but surface streets have multiple access points through left, right and through
movements of traffic, giving drivers multiple choices
• geometric constraints: surface streets have geometric constraints such as
channelization for separation of traffic movements, and conflicting movements;
freeways have fewer of these features
13
• control measures: entry ramps on freeways control the rate of traffic entering the
main line flow, whereas intersection control on surface street networks controls phase
sequencing through either fixed time control or traffic actuated control with variable
splits, depending on the local demand within the constraints of cycle lengths, and
minimum and maximum phase lengths.
• operating conditions: surface streets usually operate at lower speeds compared to
freeways, which allows vehicles on surface streets to change lanes more readily,
thereby making the problem of distinguishing between incident and nonincident
patterns more difficult as vehicles are able to maneuver around incident locations
more easily.
• detector configuration: freeways usually have more uniform spacing of detector
stations (e.g. half or onethird of a mile), but for surface streets the surveillance
detector location varies based on the length of the links
Because of the characteristic differences between surface streets and freeways, the
effects, types and nature of 'incidents' differ. Very limited work has been done in
developing an algorithm to detect incidents on signalized surface street networks. Of the
few attempts (Bell and Thancanamootoo, 1988; Han and May, 1989; Chen and Chang,,
1993), none have been extensively tested or implemented for a city’s street network.
14
Therefore, there is a need for the development of an incident detection system for
signalized street networks that will automate the process for a traffic management center.
This research proposes to develop such a new approach to detecting incidents or traffic
operational problems on surface streets.
1.2 Research Approach
The objective of developing an algorithm to automate the process of detecting
operational problems or traffic management problems is to provide traffic management
centers, overseeing the operations and control of street networks, with a decision tool.
This tool, embedded in a traffic management system, would be part of a four step
incident management system  incident detection, incident confirmation, incident
response and recovery monitoring (Ritchie, 1990). In the case of surface street networks,
any operational problem that requires the attention of an operator in a traffic management
center, and results in an operator formulating a response, mey be defined as an incident.
Therefore, the role of a detection algorithm will be to detect the following types of
incidents that are relevant to the operation and control of surface street networks:
• Reduced capacity:
accident, stalled vehicles, illegal street parking, lane closure, or blockage
within a link or within an intersection
15
• Excess demand
due to special event, queues do not clear over several cycles
leftturn pockets overflow over several cycles
inadequate capacity due to inadequate effective green time available to a
particular phase
• Detector malfunction
system detectors
traffic control detectors
Neural classifiers, as an ensemble of a great number of collectively interacting elements,
are capable of storing representations of concepts and information as collective states.
Therefore, different aspects of a pattern recognition problem can be expressed over a set
of interconnections as weights in a distributed manner.
This research proposed the use of artificial neural networks in a modular architecture to
detect the different types of operational problems listed above. The types of problem that
can be detected depend on factors such as range of operating conditions, configuration of
system detectors within the network, and block or link length. Neural network models
were developed to demonstrate the feasibility of training and testing different
architectures of neural network models as components of a modular architecture, with
appropriate architecture for each subproblem of pattern recognition. It was hypothesized
that the performance of such a modular architecture would exceed that of any single
16
architecture applied to the detection of the different types of operational problems. A
comparative analysis was carried out to study the performance of each type of model
considered. Also included was a study of the effect flow levels and detector
configurations have on the performance of the incident detection model. The results
show that with the selection of a suitable architecture, the performance of the modular
neural network classifiers developed based on data from loop detectors outperform other
statistical techniques such as discriminant analysis. This is demonstrated by testing the
detection of operational problems on street networks in the Cities of Los Angeles and
Anaheim, California, using cyclic data collected from a microscopicsimulation, and the
Urban Traffic Control System (UTCS) implemented in the field.
In this research we propose using a multiplicity of networks to take advantage of a
modular architecture. As a result, a different network learns a different region of the
input domain or class pattern by decomposing the problem at hand and splitting its input
domain. A modular architecture was used to develop a hierarchy of neural nets to detect
different types of problems under different operating conditions. From the tests
performed for incident detection for arterials with various architectures, it was clear that
the detection rate and the false alarm rates both increased simultaneously. Therefore, an
attempt to increase the detection rate resulted in an increase in false alarm rate as well,
i.e. the performance of one measure was inversely related to the other. This has also
been found in incident detection for freeways (Ritchie and Cheu, 1990). Therefore, an
attempt was made to train two separate neural networks, one to optimize the detection
rate and another other to optimize the false alarm rate. It was also shown that, using two
17
differently trained neural networks, the false alarm rates could be lowered by the dual
system of networks compared to a single network.
Incident detection systems, both for freeways and arterials, will operate in a traffic
management center controlling and monitoring large area networks. Therefore, it is of
utmost importance to keep the false alarm rates extremely low, so that when the neural
network models are implemented in a real traffic management center, they will result in
low false alarms. One of the main concerns of traffic engineers seeking an incident
detection algorithm is not only a high detection rate, but perhaps of equal or greater
importance is a low false alarm rate. As has been shown in the freeway case, even
moderate false alarm rates can result in traffic engineers in a TMC ignoring real alarms.
The dual neural network, composed of single networks, was trained separately to
optimize the performance of detection rate and false alarm rate, and was jointly used to
reduce the false alarm rates to an acceptable level for use by a traffic management center
for large surface street networks.
The overall objective of this research was to:
• develop a methodology to automate the process of detecting operational
problems on surface street networks
• extend the application of artificial neural networks to incident detection
• detect different types of problems
• test the robustness of the model developed by testing on different types of
surface street networks and different detector configurations
18
1.3 Organization of Dissertation
The research effort is described in the following chapters:
Chapter 1 presents the problem addressed in this Dissertation, the approach proposed and
the objectives of this research.
Chapter 2 presents a review of basic techniques applied to the problem of detecting non
recurring congestion or incidents on freeways, and also Dissertations the limited work
done to develop a methodology for signalized surface street networks.
Chapter 3 identifies the problem addressed in this research as a pattern recognition
problem, presents different approaches to solving pattern recognition problems, namely
statistical and neural classifiers, the strengths and weaknesses of these approaches, and
why neural network classifiers were proposed for the problem addressed in this research.
Finally a neural network architecture was proposed to develop a comprehensive system
to detect traffic operational problems for signalized arterials.
19
Chapter 4 describes the field and simulated data collected to develop and test the
performance measures of the models developed for this research. The microsimulator
used, its calibration, the city networks represented, and the experiments designed are
presented in this chapter.
Chapter 5 presents the model development, input feature selection, model structure,
training and testing of the proposed model for different types of traffic operational
problems under different operating conditions, and detector configurations.
Chapter 6 evaluates the results of the different neural classifiers, a statistical classifier,
and the modular architecture of neural classifiers.
Chapter 7 discusses the findings of this research, and future direction of research in this
area.
1. Incident Detection Approaches
1.1 Introduction
Detecting incidents on either a freeway section or a surface street is a pattern recognition,
or more specifically a classification, problem. Algorithms have been developed for
incident detection using various techniques. They can be classified as pattern
recognition, pattern matching techniques or comparative algorithms, statisticallybased
algorithms, and traffic flow modelingbased approaches.
There are basically two approaches: one uses the notion of trying to find similarities and
the other estimating beliefs/probabilities. Attempts to classify incident and nonincident
data where the emphasis was on trying to determine the `similarities' included decision
tree techniques, and time series or filtering techniques. These techniques ultimately rely
on means to determine how close the traffic parameters are to some 'normal' values or
predicted values using calibrated thresholds, determined differently by different
techniques.
1.2 Basic Approaches to Incident Detection
1.2.1 Pattern RecognitionBased Algorithms
Pattern matching algorithms based on decision trees for freeway incident detection were
developed by Payne and Tignor (1978), and were later developed by others (ref) as a
12
series of algorithms. They are based on decision trees to detect incidents from traffic
parameters. These algorithms, better known as the California Algorithms, are based on
the pattern of traffic when an incident occurs. When an incident occurs, congestion
builds upstream of the incident  thus causing an increase in occupancy upstream, and
decrease in occupancy downstream. But this difference can be also be caused by a
bottleneck. Therefore the algorithm is also used to distinguish a bottleneck from an
incident.
Occ
u
t Occ
d
t K( ) ( )
−
≥
1
Eq. 11
Occ
u
t Occ
d
t K( ) ( )
−
≥
2
Eq. 12
Occ
d
(t 2) Occ
d
(t)
Occ
d
(t 2)
K
3
−
−
−
≥
Eq. 13
where,
Occ
u
Occ
d
K K K
=
=
upstream occupancy for time t (%)
downstream occupancy for time t (%)
thresholds
1 2 3
,,
The first two tests (Eq. 21, Eq. 22) were used to compare the absolute difference in
occupancy and the relative differences against thresholds. The third test (Eq. 23)
determines whether the difference is due to a bottleneck or recurring congestion. Various
versions of the algorithm have been developed based on this version. Currently, version
13
8 of this algorithm is being used for freeways in Los Angeles using 30 second occupancy
values.
1.2.2 TimeSeries Methods
Algorithms were also developed based on statistical forecasting of traffic behavior by
time series algorithms (Cook and Cleveland 1974, Dudek and Messer, 1974; Ahmed and
Cook, 1982). These timeseries based methods provide a means of forecasting short term
traffic behavior. Significant deviations from observed and estimated values of traffic
parameters detect an incident.
1.2.3 Bayesian Approach
Levin and Krause (1978) have also used the Bayesian approach to classify incident and
nonincident data. This algorithm uses the ratio of the difference between upstream and
downstream one minute occupancies and also uses historical incident data. It is based on
mathematical expressions derived from the ratio of distribution of incident and incident
free conditional distributions of incidents, given traffic features; and the probability of
the occurrence of an incident at a particular location and time period. This algorithm
performs better than the California Algorithm, but has a high mean time to detect.
1.2.4 Catastrophe TheoryBased Algorithm
The McMaster Algorithm (Persaud and Hall, 1989) is based on applying catastrophe
theory to the two dimensional analysis of traffic flow and occupancy data, by separating
14
the areas corresponding to different states of traffic conditions. When specific changes of
traffic states are observed over a period of time, an incident is detected.
1.3 Surface Street Incident Detection
Few attempts have been reported in the literature for surface street incident detection.
One is based on decision trees (Han and May, 1989) another applies timeseries
technique to data from an urban traffic control system and simulated data for an isolated
intersection (Bell and Thancanamootoo, 1988); a knowledgebased approach uses video
image processing data (Sellam, Boulmakoul, and Pierrelee, 1991); and a discriminant
analysis method uses traffic parameters from loop data and detector configuration data
(Chen, and Chang, 1993). A description of each of these attempts is presented next.
1.3.1 TimeSeries Based Algorithm
A time series approach was used by Bell and Thancanamotoo (1988) in an effort to
perform incident detection. Incidents were defined as an unexpected, nonrecurrent,
longer term loss of capacity at a critical location. The key variables (determined by the
detector type) at each detector site were collected on a cyclic basis. When the traffic
condition remained normal, the mean and variance of traffic parameters were updated or
estimated each cycle by exponential smoothing according to Eq. 24 and Eq. 25.
Abnormal conditions were identified when the estimated key variable values were
outside the range of an upper and lower bound as computed in Eq. 26, Eq. 27, and Eq.
15
28 where the bounds were established in terms of the smoothed mean and variance (Eq.
29 and Eq. 210).
)
)
F t F t F t
( ).( ).( )
= − +
08 1 02
Eq. 14
)
)
佴 佴 佴
( ).( ).( )
= − +
08 1 02
Eq. 15
F t F t
F
t
( ) ( ) ( )
< −
)
)
DV
1
Eq. 16
O t O t
o
t( ) ( ) ( )> +
)
)
D V
2
Eq. 17
O t O t
o
t( ) ( ) ( )< −
)
)
DV
3
Eq. 18
)
)
)
σ σ
F
t F t F t
F
t( ).( ( ) ( )).( )= − − − + −01 1 1
2
09 1
Eq. 19
)
)
)
σ σ
o
t O t O t
o
t( ).( ( ) ( )).( )= − − − + −01 1 1
2
09 1
Eq. 110
F t( ) and O t( ) were observed cyclic flow and occupancy,
)
F t( ) and
)
O t( ) were estimated
flow and occupancy, and
)
σ
F
t( ) and
)
σ
O
t( ) were estimated standard deviations of flow
and occupancy. The thresholds respond to changes in level of traffic due to exponential
smoothing. The bounds were established based on the smoothed mean and variance.
Whenever an incident was suspected, the mean and variance were frozen to avoid the
incident affecting the mean and the variance computation. But if after the freezing of the
16
mean and variance for 5 consecutive cycles, an incident was not confirmed, the values
were reset to the most current and smoothing commenced.
When the upper bound of occupancy was exceeded, an incident was suspected upstream,
and the downstream occupancy was checked. If that too was found to exceed the lower
bound then an incident was confirmed. On the other hand, if the lower bound was
exceeded, then an upstream detector was checked for upper bound infringement to
confirm an incident.
SCOOT data from the Traffic Management Division of TRRL in London for an isolated
Tintersection were collected. There was an incident where a vehicle was parked close to
the stop line for an hour. Data from the Traffic Control System Unit using the SCOOT
system for 2 hours. This set consisted of data also from a T intersection for both
direction of traffic. In this case the incidents were right turning vehicles blocking the
traffic. Both incidents were detected using the algorithm developed. SCOOT data were
collected for Middlesbrough, over a 2hour period, which covered the evening peak
period for 2 days. The incident detection algorithm developed produced no false alarms
for this data set.
This approach was used by researchers in the DRIVE project MONICA (Monitoring
Incidents and Congestion Automatically). Bretherton and Bowen (1991) report their
work with the algorithm developed by Bell and Thancanamootoo which extended it to an
arterial, using detector data from adjacent intersections. Data were collected using a
17
system developed for the London UTCS to receive, process and store traffic information
produced from SCOOT. Data were collected over a three hour period for two
consecutive links, where for an hour there was a lane blocking incident. The paper
reports of a field testing to take place, but no followup literature was available reporting
results.
1.3.2 KnowledgeBased System Using Video Image Processing
Sellam, Boulmakoul, and Pierrelee (1991) developed a knowledgebased system using
video image processing for incident detection for signalized intersection. Their paper
presents the general architecture of a system developed as part of the DRIVE project
INVAID. It consists three Units: the Image Processing Unit (IPU) which outputs a
binary image of the junctions or streets, a Measurements Processing Unit (MPU) which
processes the binary image data to compute indicators of traffic and a Diagnosis
Processing Unit (DPU) which diagnoses a problem and if necessary make requests to the
MPU. The incident detection is based on the binary output of the digitized image  each
black pixel on the source image represents a "moving vehicle" and a white image
corresponds to the background. A userspecified parameter determines the time interval
until which a stopping vehicle would be considered as a moving vehicle and after which
it would be considered a parked vehicle. This results in a spatial detection algorithm as
opposed to a vehicle detection.
18
1.3.3 Decision TreeBased Approach
Shortly after Thancanomotoo's work was reported, Han and May (1988,1989) reported
their attempt to develop an incident detection algorithm for surface streets. They selected
a downtown area in Los Angeles under ATSAC (Automated Surveillance and Control
System) with signalized surface streets of short blocks (400500 feet) and detectors in all
lanes. Detector data collected were smoothed.
The algorithm first uses the smoothed data to detect abnormal detector data patterns.
Based on historical flow, occupancy and speed data for statistical ranges of 1 and 99
percentiles, comparisons are made.
After a check of detector malfunction, the algorithm proceeds to determine whether either
an Impending Saturation Occupancy or an Impending Congestion Occupancy is
exceeded. The flows are also checked against Medium and Low Flow thresholds.
Thresholds of 300 and 500 for flow, and 30% and 40% for occupancy, are used. These
values are determined for the test, in an attempt to minimize the false alarm rate. These
thresholds are therefore time and detector dependent. One minute and three minute
smoothed data were used to determine the thresholds. Traffic conditions on a street were
classified into one of six states, depending on the occupancy and volume. Based also on
the condition of adjacent lanes and downstream streets, classification as a lane blockage,
approach blockage or arterial blockage was also made.
19
The algorithm was implemented as a system in C. It was tested offline for a section of a
street in Los Angeles (Washington Blvd.) near the Coliseum. This testbed had 20
detectors. In the early stages of development, runs were made to determine the
thresholds using one minute and three minute smoothed data. In 1989, Han and May
reported the development of TOPDOG, developed in TurboProlog, based on the
algorithm described. They also reported using 50 minute data from Venice Blvd., Los
Angeles. The system is still in its initial stages of testing as a demonstration prototype.
Still further work is required to test whether a global set of thresholds may be determined
for a detector configuration, as opposed to using time and detector dependent thresholds.
1.3.4 Discriminant AnalysisBased Approach
Discriminant analysis has been proposed for incident detection on surface streets by Chen
and Chang (1993). This paper very briefly presents an incident detection algorithm for
surface streets as part of a 3module system  a dynamic traffic flow prediction model, an
incident detection model, and an incident monitoring module. The paper presents the
overall architecture of an incident detection system. According to the architecture
presented, the flow model captures the dynamics of the traffic and thus predicts the flow
conditions; this is compared to the realtime condition and forms the basis of the
detection module. The paper presents the detection portion of this system:
110
d
1 lane blockage
106.7 3.54 7.22 0.66 2.21
53.70 1.69 0.88 1.73
1.06 2.26
d
2 lane blockage
131.26 3.39 6.80 0.10 2.17
156.88 1.49 0.05 1.58
1.89 2.04
−
= − + + + +
− + + +
+ +
−
= − + + + +
− + + +
+ +
⎛
⎝
⎜
⎞
⎠
⎟
⎛
⎝
⎜
⎞
⎠
⎟
ASPDUS ASPDDS AIFLDS AOCUS
DFSPD
DTSPACE
DFOC TRUCK BLRATO
DSP DSP
ASPDUS ASPDDS AIFLDS AOCUS
DFSPD
DTSPACE
DFOC TRUCK BLRATO
DSP DSP
12 23
12 23
where,
ASPDUS, ASPDDS average upstream and downstream speed
AIFLD average downstream flows
AOCUS average upstream occupancy
DFSPD upstream and downstream speed difference
DFOC upstream and downstream occupancy difference
DSP12 speed difference of lane 1 and 2
DSP23 speed difference of lane 2 and 3
BLRATO fraction of the blocked area of a link
TRUCK composition of heavy vehicles
DTSPACE detector spacing
This method is based on a set of multivariate discriminant functions. The paper
presenting this method reports a misclassification rate of 13.22% and a variance of 0.59
for one and twolane blockages classified in an experimental design of a 3lane arterial
111
segment. The paper does not present the configuration of the network simulated, and the
details of design of the experiments conducted, such as the location of the blockages, and
whether the blockages were partial or complete lane closures. NETSIM, a microscopic
simulation was used for the experiments, where only one lane blockages (partial
blockage as stalled vehicle) can be simulated. The use of the variable 'fraction of the
blocked area of a link' as reported in the paper suggests use of the 'lane closure'
simulation feature of NETSIM that requires specifying the percent of time the lane
closure is simulated, not lane blockages. Also, no information was available on how the
detector data collected was available as the current version of NETSIM does not simulate
surveillance detectors.
All of the incident detection methods described here are in their early stages of
development, unlike the freeway incident detection approaches. It may be mentioned
here that none (except the Chen and Chang paper) report any performance measures for
the algorithms presented. The discriminant analysis based algorithm reports only
misclassification rate, but no false alarm rates nor times to detection.
1.
Pattern Recognition and Neural Networks
1.1 Introduction
The task of determining a procedure to use currently available information on an object
or an event to assign it to a prespecified set of categories or classes is termed pattern
classification or pattern recognition. A body of work has developed out of the extensive
study of pattern recognition problems which has led to the development of mathematical
models to design classifiers as shown in Figure 11. These models use a set of features of
an object or an event and describe a relationship between these features (inputs) and its
class pattern.
g
1
g
2
g
3
x
1
x
2
x
n
g (x)
g (x)
g (x)
1
2
n
Max
y(x)
Figure 11. Pattern Classifier
(Duda and Hart, 1973)
In the literature there are a few types of pattern recognition techniques. They are mostly
based on statistical, machine learning or neural networkbased approaches. In this
research, statistical and neural network approaches were considered and evaluated, and
the discussion will be limited to these two.
1.2 Statistical Approaches to Pattern Recognition
In this approach, the problem of pattern recognition is considered a problem of estimating
density functions in a high dimensional space and dividing the highdimensional space
into the regions of patterns or classes. The input features are considered realizations of
random vectors, where the conditional density functions depend on the class pattern, and
the density function may be known or assumed. The performance of a classifier can be
analyzed for a given distribution of input vectors. In designing a classifier, the
conditional density form may be known or assumed and the functional form of the
classifier or discriminant can also be assumed to be linear, quadratic or piecewise linear.
The best classifier for a given distribution is based on Bayes decision theory and
minimizes the probability of classification error; it is also considered the optimal
classifier. When the density function is assumed, the parameters of the function need to
be estimated using parametric techniques, and when the density function is not known
nonparametric techniques are used. For problems where the data do not fit the common
density functions, nonparametric techniques are applied. However, nonparametric
techniques are normally used for offline analysis because of the limitations in
performance, storage requirement, speed and complexity of the algorithms. Therefore,
this research, where a methodology is required to perform online, realtime analysis,
statistical nonparametric methods are not discussed further.
In general, it is also known that it is easier to design a classifier for an input feature
vector of lower dimensionality. Therefore, techniques are also used in pattern
recognition that reduce the dimensionality of a given input vector to perform feature
extraction, which then forms the basis of a linear classifier.
1.2.1 Bayesian Classifier
One of the fundamental approaches to pattern recognition is based on Bayesian decision
theory, and is expressed in terms of probability structures. When the distributions of the
input feature random vectors are given, it can be shown that the Bayesian classifier is the
best classifier which minimizes the probability of classification error (Duda and Hart,
1973).
1.2.2 Discriminant Functions
Even when the probability distribution of the input feature vectors is given,
implementation of the optimal Bayes classifier is often difficult when the dimensionality
of the input feature vector is high. In such cases, another statistical classifier is often
used,  discriminant analysis, where the mathematical forms of the discriminant functions
are known (linear or quadratic classifiers).
Classification into groups or pattern classes is based on differences in the characteristics
of the features of objects that come from the different classes. A good classification rule
for discriminating between the classes minimizes the misclassification rate, the prior
probabilities of occurrence of each class, and the cost of classification. Therefore,
discriminant functions are to take these factors into consideration.
Discriminant function (DF) procedures are based on normal populations. Let f
1
(x) and
f
2
(x) be multivariate normal densities, with mean vector and covariance matrix µ
1
and Σ
1
,
µ
2
and Σ
2
respectively. When Σ
1
=Σ
2
, the classification rule that minimizes the expected
cost of misclassification is,
For class 1:
( ) ( )
( ) lnµ µ µ µ µ µ
1 2
1
0 1 2
1
1 2
2
1
1
2
−
′
− −
′
+ ≥
⎛
⎝
⎜
⎞
⎠
⎟
− −
Σ Σ
x
p
p
Eq. 11
and for class 2 otherwise.
where,
p
1
, p
2
= prior probabilities of belonging to class 1, and class 2, respectively
The classification rule in this case is linear, but when Σ
1
≠Σ
2
, that is the covariance
structure is different, then the discriminant function become,
For class 1:
( ) ( )
−
′
− +
′
−
′
− ≥
⎛
⎝
⎜
⎞
⎠
⎟
− − − −
1
2
1
1
2
1
1 1
1
2 1
1
2
1
x x x k
p
p
Σ Σ Σ Σµ µ ln
Eq. 12
and for class 2 otherwise.
where,
( )
k =
⎛
⎝
⎜
⎞
⎠
⎟
+
′
−
′
− −
1
2
1
2
1
2
1 1
1
1 2 2
1
2
ln
Σ
Σ
Σ Σµ µ µ µ
In this case, the classification regions are defined as quadratic functions of x. Therefore,
in using discriminant functions as classifiers, assumptions of normality are made for the
multivariate density functions, and linear and quadratic classifiers arise based on the
structure of the covariances. If the data are not multivariate normal, the data may be
transformed to variables that are more normal, and the linear or quadratic discriminant
classifier can be used to determine the appropriateness of a particular classifier. Or,
when appropriate transformations of the data can not be formed, the linear or quadratic
classifying rule may be applied to check the performance of the classifier (Johnson and
Wichern, 1992) and to determine whether linear or quadratic decision surfaces can
perform the classification reasonably well for the particular pattern recognition problem.
1.2.2.1 Fisher Discriminant Function
In this linear discrimination rule, the assumption of normality for the multivariate data is
not made. However implicitly the population covariance matrices are assumed to be
equal. In this case, the rule maximizes the differences between the classes by
maximizing the ratio of the squared distance between the sample means and the sample
variance. The separation distance is expressed not in terms of the original input variables
x, but a set of transformed variables y. The x variables are transformed to y by taking a
linear combination of x's that maximize the separation distance in terms of y. The Fisher
discriminant function becomes, :
For class 1 if:
( )
y x x x x x x x
pooled
pooled
= −
′
≥ −
′
−
− −
( ) ( )
1 2
1
1 2
1
1 2
1
2
Σ Σ
Eq. 13
and to class 2 otherwise.
For the Fisher discriminant function, the two classes are assumed to have a common
covariance matrix. From the discriminant developed, the maximum relative separation
distance can also be calculated and the significance of the difference in means can be
computed. A significant difference in means for the classes does not imply that the
classifier developed will perform good classification. If good separation between the
means does not exist, then the classification need not be performed. But if the difference
in means is significant, then other methods of testing the validity of classification
procedure is used.
1.3 Artificial Neural Networks
Classifiers based on Bayes decision theory can be shown to be optimal classifiers. But
for Bayesian classifiers, the conditional probability density function for each class must
be known. Even when the densities are known, they may be difficult to implement. A
classifier is designed by assuming the mathematical form of the classifier (linear,
quadratic or piecewise linear) and the parameters have to be estimated. But the
performance of the classifiers depend on certain conditions of the density functions and
their covariance structures. It may be mentioned that the Bayesian linear classifier is
optimal only when the distribution is normal and the covariances are equal. When the
assumption of equality of the covariances is inappropriate, the Bayesian classifier is not
optimum. For unequal covariances or nonnormal distributions, quadratic discriminants
or other types of classifiers may be developed. But as pointed out, often the robustness
of linear classification is preferred to the performance of the more complex classifiers.
Therefore, for most pattern recognition problems, linear classifiers are initially designed,
and then the performance evaluated.
In recent years, artificial neural networks have been applied to a variety of pattern
recognition problems and have clearly emerged as one that has outperformed some
statistically based techniques described in the previous section. But as pointed out in the
previous section, the performance of statistical classifiers depends on how well the
assumptions about the density functions fit the data they describe, and also on the
appropriateness of the functional form of the discriminant function. These factors are
applicationspecific. When the restrictive assumptions are violated for a particular
application, the classifier is no longer best or optimum. Another set of classifiers that
address theses types of problems are neural networkbased classifiers.
Artificial neural network architectures that were initially developed and successfully
applied, were based on the original perceptron [Rosenblatt, 1958]. They are parallel
distributed processing information structures that combine computational and knowledge
representation methods. An artificial neural network consists of many processing
elements (PE's) that are massively connected with each other. The processing elements
can be arranged in layers, with an external layer receiving input from data sources, which
is passed on to other processing elements through interconnections, and on to an output
layer of processing elements. Since these structures are distributed in nature, they are
known to be robust, efficient, and also have the capability to capture highly nonlinear
mappings between inputs and outputs.
Processing elements each receive inputs from external data sources or other processing
elements and pass through these summation function, a transfer or activation function, as
shown in Figure 12. The method of obtaining the weights to perform the desired
mapping is termed as learning. The various types of training methods are described in
the next section.
Σ
φ
Xo
Xn
Y
W
v
Figure 12. A Processing Element
1.3.1 Learning Schemes
1.3.1.1 Supervised Learning
Supervised learning occurs when the network parameters are adapted under the combined
influence of the error signal and the input vector. This adjustment is often made by an
iterative procedure until the output produced follows the target signal closely in some
statistical terms according to either a least mean square or steepest descent algorithm
using the instantaneous estimate of the gradient of the error surface.
1.3.2 Unsupervised learning
Unsupervised learning is controlled by a teacher or an external target signal and the
performance of the network weight is updated based on the ability to follow the target
signal. For supervised learning, learning is carried out not in terms of a cost function,
which is expressed in terms of an error signal, but rather based on a target independent
measure. Network parameters are adjusted based on an independent measure of
performance which captures statistical regularities of the input vector and forms classes
of patterns automatically.
1.3.3 Competitive Learning
This occurs when the processing units compete amongst themselves to respond to a
stimulus or input. In Hebbian learning the units produce outputs simultaneously, but in
competitive learning only one unit responds at a time and is therefore often referred to as
the winnertakeall method. For a collection of processing units, certain units specialize
on certain groups or classes of patterns and therefore work well to detect certain features
of an input set.
Based on this learning scheme where the weights are normalized and therefore the input
vectors also properly scaled lie in an Ndimensional unit hypersphere. the learning
moves the weight vector to the center of gravity of the cluster it discovers.
1.4 Artificial Neural Networks for Pattern Recognition
Artificial neural networks applied to pattern recognition problems have clearly emerged
as a major contender to other statisticalbased approaches. The applicability of one
approach over the other is of course application specific. Statisticalbased work in
pattern recognition is based on discriminant analysis or on a class of models that attempt
to provide an estimate of the joint distribution of the features within each class. There
are approaches that are characterized by having an explicit underlying probability model
of being in each class. The main advantages of applying neural network techniques are
described as follows,
• Only weak assumptions need to be made about the data set as opposed to
statisticallybased approaches where more restrictive assumptions of the
underlying distributions are made to obtain the optimal classifiers.
• Classifiers based on Bayes decision theory can be shown to be optimal
classifiers. But for Bayesian classifiers, the conditional probability density
function for each class must be known. Even when the densities are known,
they may be difficult to implement. A classifier is designed by assuming the
mathematical form of the classifier (linear, quadratic or piecewise linear) and
the parameters have to be estimated. But the performance of the classifiers
depend on certain conditions of the density functions and their covariance
structures. It may be mentioned that the Bayesian linear classifier is optimal
only when the distribution is normal and the covariances are equal. Based on
the application, when the equal covariance assumption is inappropriate, the
Bayesian classifier is not optimum.
• Even though there are nonparametric statistical techniques that can be applied
to pattern recognition to avoid making assumptions about the distribution of
input features, these techniques are difficult to apply to online applications as
they are computationally intensive and require extensive amounts of storage
space
• Nonparametric techniques also suffer from the 'curse of dimensionality',
where although with enough samples, convergence to an arbitrary
complicated unknown density function is assured, the number of samples
required may be very large. The demand for a large number of samples grows
exponentially with the dimensionality of the input feature space. This
limitation severely restricts the practical application of nonparametric
techniques. Work with MLF networks (Baron 1991, 1992) shows that the rate
of convergence expressed as a function of the training set size N is of the
order (1/N)
1/2
(times a logarithmic factor). This makes the application of
neural network techniques feasible for problems that require online
implementation.
• Since the relationships in ANN are expressed as a set of interconnections,
they are distributed in nature and thereby more robust and computationally
efficient
1.5 Artificial Neural Network Architectures
1.5.1 Multilayer Feed Forward Neural Network
The multilayer feedforward neural network consists of an input layer, one or more non
linear hidden layers, and an output layer. It employs an error correcting back
propagation algorithm for training. In the case of backpropagation, there are two distinct
phases  a forward pass and a backward pass. In the forward pass the input is propagated
through layers of processing units, and in the backward pass, the errors computed are
propagated backwards with parameters that minimize the mean square errors (Rumelhart
and McClelland, 1986).
The model of each processing element or units includes a nonlinear transfer function
that is part of each processing element in the hidden layers and output layer. The hidden
layers progressively extract more and more features from the input data. The layers are
interconnected through a set of weights. Due to this distributed form of nonlinear
processing, these structures are able to produce highly nonlinear mappings between
inputs and outputs. The weights in these networks are adjusted according to the well
known backpropagation algorithm that minimizes the mean square error between the
outputs produced by the network and a set of desired outputs.
The error is computed as,
e n d n y n
j j
( ) ( ) ( )=
−
for the output unit j
Eq. 14
The sum of the square of the errors is,
E n e n
j
j
( ) ( )=
∑
1
2
2
for all output units j
Eq. 15
and the mean error over N training patterns,
E
N
E n
mean
n
N
=
=
∑
1
1
( )
Eq. 16
The objective is to adjust the parameters or weights. The errors computed for every
pattern are thus summed over the entire set of patterns to produce an estimate of the
overall error, which is used as a measure to adjust the weights. Considering a typical
processing unit as shown in Figure 33, the net output produced is,
v n w y n
j jii
i
p
i
( ) ( )=
=
∑
0
for output unit j from all p inputs
Eq. 17
According to the backpropagation algorithm, the weight adjustment DWji is
proportional to the instantaneous gradient,
∂ε
∂
∂ε
∂
∂
∂
∂
∂
∂
∂
( )
( )
( )
( )
( )
( )
( )
( )
( )
( )
n
w n
n
e n
e n
y n
y n
v n
v n
w n
ji j
j
j
j
j
j
ji
=
Eq. 18
where,
∂ε
∂
∂
∂
∂
∂
ϕ
∂
∂
( )
( )
( ),
( )
( )
( )
( )
( ( )),
( )
( )
n
e n
e n
e n
y n
y n
v n
v n
v n
w n
j
j
j
j
j
j
j
j
ji
= = − =
′
, = y (n)
i
1
Eq. 19
Σ
Σ
Σ
Σ
Σ
Σ
W
2
T
W
3
T
δ
1
δ
2
δ
m
φ'φ'φ'
δ
1
δ
2
δ
m
δ
m
δ
2
δ
1
Σ
Σ
Σ
Σ
Σ
Σ
Σ
Σ
Σ
x
0
x
1
x
2
x
n
y
1
y
2
y
m
W
1
,θ
1
W
2
,θ
2
W
3
,θ
3
φ
φ
φ
φ
φ
φ
φ
φ
φ
Multilayer Feed Forward Network
(forward propagation phase)
Backpropagation
phase
Figure 13. Multilayer Feedfroward Neural Network
Forward/Backward pass
(Haykin, 1992)
Therefore,
∆w n
n
w n
n y n
ji
ji
j i
( )
( )
( )
( ) ( )= − = −η
∂
ε
∂
ηδ
Eq. 110
where,
∂
∂ε
∂
∂
∂
∂
∂
ϕ
j
j
j
j
j
j
j j
n
n
e n
e n
y n
y n
v n
e n v n( )
( )
( )
( )
( )
( )
( )
( ) ( ( ))= =
′
Eq. 111
y
i
(n)
w
ji
(n)
v
j
(n)
ϕ(.)
y
j
(n)
w
kj
(n)
v
k
(n)
y
k
(n)
d
k
(n)
e
k
(n)
ϕ(.)
Figure 14.Hidden and Output Layer Processing Elements
For hidden layer units:
∂
∂ε
∂
∂
∂
∂ε
∂
ϕ
j
j
j
j j
j
n
n
y n
y n
v n
n
y n
v n( )
( )
( )
( )
( )
( )
( )
( ( ))= =
′
for hidden unit j
Eq. 112
∂
ε
∂
∂
∂
∂
∂
( )
( )
( )
( )
( )
( )
( )
n
y n
e n
e n
v n
v n
y n
j
k
k
k
k
k
j
=
∑
for output unit k
Eq. 113
where:
∂
∂
ϕ
∂
∂
e n
v n
v n
v n
y n
w n
k
k
k
k
j
kj
( )
( )
( ( )),
( )
( )
( )= −
′
=
and,
∂ε
∂
ϕ δ
( )
( )
( ) ( ( ) ( ) ( ) ( )
n
y n
e n v n w n n w n
j
k
k
k kj k
k
kj
= −
′
= −
∑ ∑
where:
∂ ϕ ∂
j j j k
k
kj
n v n n w n( ) ( ( )) ( ) ( )=
′
∑
for j hidden unit
Therefore the learning rule may be summarized as:
Weight correction
Learning parameter
Local
gradient
Input of unit
( )
∆W
( )
ji
( )
..
n
n
y n
j
i
⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟
=
⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟
⎛
⎝
⎜
⎜
⎜
⎜
⎞
⎠
⎟
⎟
⎟
⎟
⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟
η
∂
Eq. 114
1
1.1.1 Projection Neural Network
Multilayer Feedforward (MLF) networks are capable of mapping any continuous
bounded function of
0
dimensions to
0
outputs by dividing the input space into regions
using hyperplanes. The locations of the hyperplanes are determined by the weights and
thresholds of the hidden layer nodes. Nonlinear combination of these hyperplanes can
bound regions by hyperplanes or curved surfaces, either open or closed. But to develop
these complex boundaries more regions are required. For a large set of N, large number
of hidden layer nodes, therefore large networks are required. On the other hand, there is
a different set of neural network architectures that train very fast but does not attempt to
minimize error. Networks such as Kohonen networks, Radial Basis Functions, and a few
others, attempt to place prototypes within closed decision boundaries around training data
points and adjust to their parameters. These networks place hyperspheres around
prototypes and adjust their radii.
A third set of neural networks have evolved that combine the ability to form closed
boundaries and also perform error minimization. An examples of this type of network is
the Projection network. This network can form both closed and open decision regions.
Training will cause closed boundaries to open if required and vice versa. Details of this
network will be presented in the next section. The advantage of this type of network lies
in its ability to initialize rapidly to a good starting point, which substantially speeds up
training.
2
The projection network is based on the concept of projecting the inputs to one higher
dimension to form a hypersphere, where the weight vectors will also lie on this
hypersphere. With a hidden layer, networks could be capable of forming either an open
or closed region within the original input space. The idea of trying to project inputs to
higher dimension and use the inner product of the input and the weights to determine the
closeness of these two vectors has been used by other network architectures such as the
Radial Basis Functions. But as in the case of Radial Basis Functions (RBF), the
framework of clustering was used to form closed prototypes. But in the case of
Projection Network, these boundaries are formed within the original framework of back
propagation (discussed in the MLF section).
As mentioned earlier, the idea is to project the input vector from N dimensions onto a
higher dimension (N+1) by transforming the input vector x to x', subject to x=R.
Example of transformations is
x'= R
h
h
2
+ x
2
,
x
h
2
+ x
2
⎛
⎝
⎜
⎞
⎠
Eq. Error! No text of specified style in
document.1
These projections then serve as inputs to a MLF network. Here h is the distance between
the origin of the original input space and the (N+1) space. The component of the input
vector lies along the extra dimension (N+1) and the others along the original dimension,
therefore the weights that connect the (N+1) component to a hidden unit also lie on the
extra dimension (N+1). These weights are also constrained to w=R. Figure Error! No
text of specified style in document.1 shows the projection a 2dimensional space to a
3
3dimensional space. The vector X connects the origin of the 2dimensional space, and
the 3dimensional vector
′
X
connects the center of the sphere and the point
X
and
extends the line to intersect the surface of the sphere. As
′
X
is on N+1 dimensional
space,
′
W
that connects modified input
′
X
to the hidden layer node is also 3
dimensional. The net input to the hidden layer node is,
Net input to the hidden layer node w x w
=
−
=
'.'
0
constant
Eq. Error! No text of
specified style in document.2
As described for MLF, the output produced by input vector x and weights w with bias or
threshold w
o,
0
is passed through a nonlinear function, often sigmoidal. The threshold
determines the location of the decision boundary formed and is proportional to the
distance from the origin to the decision surface or hyperplane.
R
R
o
ND input
space
Projection
Sphere
x
x
'
Figure Error! No text of specified style in document.1. Projection transformation and the
formation of boundary surfaces
(Wilensky and Manukian, 1992)
4
0
In (N+1) dimensions, each hidden layer node still draws a hyperplaner decision
boundary that intersects with the hypersphere around
′
X
. This is a circle around
′
X
and the position of the intersection was determined by the threshold. Therefore, the
projection of the surface resulting from the intersection back onto the original N
dimension space is a function of the threshold. As shown in Figure Error!
No text of
specified style in document.1
, if the threshold is large, the resulting intersection circle
is small and lies on one side of the original 2dimensional plane. If the threshold is
small, the intersection circle approaches a great circle on the sphere and its projection
back on the 2dimensional plane is a curve or an open boundary or line. It is this ability
of the projection network to form hyperplanner or hyperspherical prototypes that allows
the Logicon network to b initialized rapidly to a good starting point. It is during learning
that the closed boundaries can become open boundaries and vice versa, as the weights
and the thresholds are adjusted.
The training of the projected weights is based on backpropagation, which minimizes the
error by changing the weight vector in the direction of maximum error decrease. But in
the case of the projection network, the weight vector has to be moved in the direction of
maximum error decrease, but needs to be constrained to be tangent to the hypersphere
surface in order to keep the weight vectors on the hypersphere. The change in weights is
expressed as:
δ
w'
=
w'
R
x
w'
R
x
α∇
e
⎛
⎝
⎞
⎠
Eq. Error! No text of specified style in
document.3
5
where the e is the error between the desired output and the output produced by the
network,
∆
e is the error gradient with respect to the weights and is the gain. The weights
need to be normalized to have magnitude R to prevent the vectors from moving away
from the hypersphere.
1.1.2
0
Modularity of Neural Network Models
A modular architecture allows decomposition and assignment of tasks to several
modules. Therefore, separate architectures can be developed to each solve a subtask
with the best possible architecture, and the individual modules or building blocks may be
combined to form a comprehensive system. The modules decompose the problem into
two or more subsystems that operate on inputs without communicating with each other.
The input units are mediated by an integrating unit that is not permitted to feed
information back to the modules (Jacobs, Jordan, Nowlan, and Hinton, 1991). The
modular architecture combines two learning schemes, supervised and competitive. The
supervised learning scheme is used to train the different modules of the networks and a
gating network operates in a competitive mode to assign different patterns of the task to a
module through a mechanism that acts as a 'mediator'.
Neural networks are commonly designed at the level of processing elements or units
which represent the finest level. Layers of processing elements are at a coarser level.
Adding networks adds an even coarser level to the classification. However, there may be
significant practical and theoretical advantages to be gained by considering modularity at
6
the network level. The advantages of modularizing are described in the following
section.
1.1.2.1 Why Modularize ?
Proper Assignment of Tasks or Functions
Networks composing the modular architecture compete with each other and learn the
training patterns. As a result, they learn different functions or tasks by partitioning the
function into independent tasks and allocating a distinct network to learn each task. In
addition, the architecture allows the allocation of a topologically appropriate network to
each task. Often there is a natural way to decompose a complex task into a set of simple
tasks. For example, the
0
nonlinear function
y x
=
can be approximated either with a
neural network using one layer of hidden units or by assigning two networks, one for
each linear function when
x
≥
<
0 0
and
x
, each with a single linear unit without any
hidden units, and by setting an appropriate switching or gating mechanism. This example
shows that if the data supports a discontinuity in the function being described, then it may
be more effective to fit separate models on both sides of the discontinuity. Similarly, if a
function is simple in some region, then a global model could be fitted in that region
rather than approximating the function locally with a large number of local models. By
proper assignment of tasks, the network could be simplified in structure to perform the
same set of tasks. Therefore, modular structures perform local generalization by learning
patterns of a particular region. This ensures that the performance of a single module does
not affect the other modules of the structure. In this example, the task of decomposition
7
simplified the structure by removing the need for hidden layers, thereby reducing
computational speed.
Speed of Learning
As the example demonstrates, a modular network to train the two simpler networks
would train faster in the absence of hidden layer processing elements. The
modularization is able to take advantage of function decomposition and can also reduce
the conflicting information that tends to retard learning. In the literature, this is termed
'crosstalk'; and may be spatial or temporal in nature (Jacobs, Jordon and Barto, 1986). In
the MLF, crosstalk may occur when the backpropagation in MLF is applied to two or
more outputs as shown in Figure 36.
h
1
h
2
h
3
o
1
o
2
h
1
h
2
h
3
o
1
o
2
Figure Error! No text of specified style in document.2. (a) MLF neural network (b) a modular
equivalent
If the output of the hidden unit h
1
in Figure 36(a) produces positive weights to output
units O
1
and O
2
, and the first output O
1
is 'too large' and the second output O
2
is 'too
small', then using the backpropagation derivative information will specify that for O
1
the
hidden layer output should be smaller, while for the second output O
2
will suggest that
the hidden layer output should be larger. This conflict in derivative information is
8
referred to as spatial crosstalk. Modular architectures, as noted by Plaut and Hinton
(1987) are immune to spatial crosstalk.
Besides spatial crosstalk, there may be temporal crosstalk. For example, if a network is
initially trained to learn a pattern, its hidden units become useful in performing that
function. But when another pattern is trained, the performance on the first pattern may
deteriorate. Often times with backpropagation training, this is overcome by adding more
hidden layer units. But hidden units are added at a computational cost of speed and
complexity. Use of modular networks may eliminate this need.
Representation
Modular structures can also provide a method of representation that is natural or easy to
understand. The modules can be viewed as 'building blocks' for more comprehensive and
complex tasks, where the idea is to 'divide and conquer'. This philosophy has long been
used in computer science, and in numerical methods such as finite element analysis. The
modular structure can provide a means of decomposition at a broad level which
suppresses the activation of a large number of processing elements, while activating a
smaller number of processing elements in a single module.
This structure also allows domain specific knowledge to be incorporated. For example,
Jacobs, Jordan and Barto (1990) in their work decompose the task of 'what' and 'where'
vision tasks to two different multilayer feed forward neural networks. Knowledge can
9
also be incorporated into the design of a structure by deciding on how to divide the input
information between the gating networks and the modules since there may be a natural
context of division. Another way to incorporate domain specific knowledge is in the
design when part of the functional properties may be known. For example, a linear or
nonlinear portion may be identified, and therefore different networks may be designed to
possess different topologies, weights, activation functions, error functions, different input
variables, etc.
Module 1
Module 1
Module N
Module 2
g
1
g
2
g
n
Output vector:
Y
y
1
y
2
y
N
Winnertakeall
X
Input vector
Gating
Network
Figure 37. Modular Architecture
1.1.2.1.1 Algorithms
The problem of decomposing training cases into a set of subtasks was addressed by
Hampshire and Weibel (1989) when these subtasks can naturally be identified
beforehand. But Jacobs et al (1991) first presented the idea of a system that learns to
allocate subtasks to different networks/experts or modules. The idea in this method is
that a gating mechanism encourages one module or network to be assigned to a subtask,
and that these modules would perform their tasks locally, decoupled from one another.
10
Therefore no crosstalk or interference between the weights occur as shown in Figure
3.5(b). Hamshire and Weibel (1989) in their work presented a method where the
modules were not decoupled, since the final output was a linear combination of outputs
produced by the individual modules.
0
The interaction between modules caused some of
the modules to be used for a subtask. But Jacobs and Jordan (1991) proposed another
error function to encourage competition between modules using the gating network. The
gating network makes a stochastic decision about which single expert to use on each
occasion. Each module works as a feedforward neural network and all modules receive
the same input and have the same number of outputs. The gating network is also a
feedforward network and typically receives the same input as the other modules (Figure
37).
y y=
=
∑
g
i
i
n
i
1
Eq. 316
During training, the weights of all the networks are modified simultaneously as in the
multilayer feedforward network. The training of the modules and the gating network are
based on error minimization of the error functions. The error function for the modules is,
J
y
T
= − −
1
2
( * ) ( * )y y y y
Eq. Error! No text of specified style in
document.4
where, y* is the desired output, output of the system is y.
11
The error function of the gating mechanism is more elaborate. For each training pattern,
one module comes closer to producing the desired output than the others. If for a given
training pattern the systems performance is improved significantly than in the past, the
weights of the gating networks are adjusted to make the output of the winner increase
towards 1, and the outputs of the loosers towards 0. If the systems performance does not
improve, the gating weights are adjusted to move all of the outputs toward some neutral
value. The function that determines whether the performance of the model has improved
is,
J t J t J t
y y y
( ) ( ) ( ) ( )= + − −α α1 1
Eq. Error! No text of specified style in
document.5
where, α, 0<α<1. This determines how rapidly the past values are forgotten by
exponentially weighting the average of J
y.
The binary variable λ
WTA
and λ
NT
indicate
whether the performance has improved. That is,
If J t J t
Then and
Else and
y y
WTA NT
NT
WTA
( ) ( )< −
= =
= =
γ
λ λ
λ λ
1
1 0
0 1
Eq. Error! No text of specified style in
document.6
where γ is a multiplicative factor that determines how much less the current error must be
than the measure of the past.
12
If the architecture’s performance improves significantly, the module whose output is
closest to the desired output is determined. The error for the module is then defined as
the sum of squared error between the module’s output y
i
, and the desired output y*. The
error is then,
J
yi i
T
i
= − −
1
2
( * ) ( * )
y y y y
Eq. Error! No text of specified style in
document.7
The module that wins is the module with the smallest error. If network or module I wins,
then the desired value of the ith output of the gating network g
i
is set to 1 and otherwise
set to 0. If the systems performance does not improve significantly, that is when λ
NT
=1,
then the weights are adjusted so that the outputs of the gating network move towards a
neutral value 1/n, where n is the number of modules. The gating networks error function
is defined as,
J g g g g g
n
g
G WTA i
i
n
i WTA i
i
n
WTA i
i
n
i NT
i
n
i
= − + − + − + −
= = = =
∑ ∑ ∑ ∑
λ λ λ λ
1
2
1
2
1 1
1
2
1
1
2
1
2
1 1
2
( * ) ( ) ( ) ( )
Eq.
Error! No text of specified style in document.8
where, the first three terms of Eq. 22 contributes to the error when the performance of
the system improves, and the fourth term contributes when the performance has not
improved significantly. The first term is the sum of the sum of the squared error between
the desired outputs and the actual outputs of the gating network. the second term takes
its smallest value when the outputs of the gating network sum to one, the third term takes
13
its smallest value when the outputs of the gating network are binary valued, and the
fourth term is the sum of the squared error between the neutral value and the actual
outputs of the gating network. When the performance of the system does not improve
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο