UNIVERSITY OF CALIFORNIA,

IRVINE

Modular Neural Network Architecture for Detection of Operational Problems on

Urban Arterials

DISSERTATION

Submitted in partial satisfaction of the requirements for the degree of

DOCTORAL OF PHILOSOPHY

in Civil Engineering

by

Sarosh Islam Khan

Dissertation Committee:

Professor Stephen Ritchie

1996

© 2002 Sarosh Khan

The dissertation of Sarosh Islam Khan

is approved and is acceptable in quality

and form for publication on microfilm:

___________________________

___________________________

___________________________

Committee Chair

University of California, Irvine

1996

i

MODULAR NEURAL NETWORK ARCHITECTURE FOR DETECTION OF

OPERATIONAL PROBLEMS ON URBAN ARTERIALS

LIST OF CONTENTS

LIST OF FIGURES...................................................................................................1-5

LIST OF TABLES.....................................................................................................1-6

1. Introduction..................................................................................................1-7

1.1 Problem Definition...............................................................................................1-7

1.2 Research Approach..............................................................................................1-3

1.3 Organization of Dissertation................................................................................1-6

2. Incident Detection Approaches..................................................................2-1

2.1 Introduction..........................................................................................................2-1

2.2 Basic Approaches to Incident Detection..............................................................2-1

2.2.1 Pattern Recognition-Based Algorithms......................................................2-1

2.2.2 Time-Series Methods..................................................................................2-2

2.2.3 Bayesian Approach.....................................................................................2-2

2.2.4 Catastrophe Theory-Based Algorithm........................................................2-3

2.3 Surface Street Incident Detection........................................................................2-3

2.3.1 Time-Series Based Algorithm.....................................................................2-3

2.3.2 Knowledge-Based System Using Video Image Processing........................2-5

ii

2.3.3 Decision Tree-Based Approach..................................................................2-6

2.3.4 Discriminant Analysis-Based Approach.....................................................2-7

3. Pattern Recognition and Neural Networks................................................3-1

3.1 Introduction..........................................................................................................3-1

3.2 Statistical Approaches to Pattern Recognition....................................................3-2

3.2...............................................................................................................................3-2

3.2.1 Bayesian Classifier.....................................................................................3-3

3.2.1......................................................................................................................3-3

3.2.2 Discriminant Functions...............................................................................3-3

3.2.2......................................................................................................................3-3

3.3 Artificial Neural Networks..................................................................................3-5

3.3...............................................................................................................................3-5

3.3.1 Learning Schemes.......................................................................................3-7

3.3.1......................................................................................................................3-7

3.3.2 Unsupervised learning................................................................................3-7

3.3.2......................................................................................................................3-7

3.3.3 Competitive Learning.................................................................................3-8

3.3.3......................................................................................................................3-8

3.4 Artificial Neural Networks for Pattern Recognition............................................3-8

3.4...............................................................................................................................3-8

3.5 Artificial Neural Network Architectures...........................................................3-10

3.5.............................................................................................................................3-10

3.5.1 Multi-layer Feed Forward Neural Network..............................................3-10

iii

3.5.2 Projection Neural Network.......................................................................3-15

3.5.2....................................................................................................................3-15

3.5.3 Modularity of Neural Network Models....................................................3-18

3.5.3....................................................................................................................3-18

4. Data Collection.............................................................................................4-2

4.1 Introduction..........................................................................................................4-2

4.2 Signalized Street Networks Selected as Study Areas..........................................4-2

4.2.1 Traffic Control System...............................................................................4-2

4.2.2 Description of Study Networks...................................................................4-3

4.3 Field Data Collected............................................................................................4-7

4.4 Microscopic Simulation.......................................................................................4-9

4.4.1 Limitations of NETSIM, Version 4.2.......................................................4-10

4.4.2 NETSIM Enhancements...........................................................................4-10

4.4.3 NETSIM Representation of Study Networks and Its Calibration.............4-11

4.4.4 Calibration of NETSIM............................................................................4-13

4.5 Simulated Data Collected..................................................................................4-14

4.5.1 Simulated Data Set....................................................................................4-14

5. Model Development.....................................................................................5-2

5.1 Introduction..........................................................................................................5-2

5.2 Selection of Features............................................................................................5-4

5.2.1 Input Features.............................................................................................5-4

5.2.2 Output Features...........................................................................................5-6

iv

5.3 Performance Measures.........................................................................................5-6

5.3.1 Root Mean Square (RMS) Error.................................................................5-6

5.3.2 Detection Rate (DR)...................................................................................5-7

5.3.3 False Alarm Rate (FAR).............................................................................5-7

5.3.4 Average Time To Detection (TTD)............................................................5-7

5.3.5 Classification Rate (CR).............................................................................5-7

5.3.6 Statistical Techniques Used to Evaluate the Performance of Models

Developed............................................................................................................5-8

5.4 Neural Network Model Developed....................................................................5-10

5.4.1 Feasibility Study.......................................................................................5-10

5.5 Different Input Features.....................................................................................5-11

5.5.1 Parameter Selection for Multilayer Feedforward (MLF) Neural Network5-13

5.5.2 Number of Hidden Layer Processing Elements........................................5-13

5.5.3 Optimum Learning and Momentum Coefficients.....................................5-15

5.5.4 Generalization...........................................................................................5-15

5.6 Output Features..................................................................................................5-16

5.7 Modular Network Developed............................................................................5-18

5.8 Statistical Classifiers Developed.......................................................................5-19

5.8.1 Discriminant Analysis...............................................................................5-19

6. Results and Comparative Evaluation..........................................................21

6.1 Introduction............................................................................................................21

6.2 Neural Network Classifier...................................................................................6-2

6.2.1 Multilayer Feedforward Neural Network (MLF).......................................6-2

v

6.2.2 Projection Network.....................................................................................6-7

6.2.3 Modular Neural Network............................................................................6-8

6.3 Neural Network and Statistical Classifiers..........................................................6-9

6.4 Effect of Flow Conditions and Network Geometry on Performance..................6-9

7. Conclusions and Recommendations............................................................2

7.1 Conclusions..............................................................................................................2

7.2 Recommendations................................................................................................7-4

8. References....................................................................................................8-1

vi

LIST OF FIGURES

Figure 3-1. Pattern Classifier..........................................................................................3-1

Figure 3-2. A Processing Element..................................................................................3-7

Figure 3-3. Multilayer Feedfroward Neural Network..................................................3-12

Figure 3-4.Hidden and Output Layer Processing Elements..........................................3-13

Figure 3-5. Projection transformation and the formation of boundary surfaces...........3-17

Figure 3-6. (a) MLF neural network (b) a modular equivalent....................................3-20

Figure 4-1. Los Angeles Network...................................................................................4-4

Figure 4-2. Anaheim Network........................................................................................4-6

Figure 4-3. NETSIM representation of the Anaheim Study Network..................................

Figure 5-1. Detector (i) Configuration #1 and (ii) Configuration #2...............................71

Figure 5-2. Single Neural Network Model to Detect Different

Types of Operational Problems.......................................................................84

Figure 5-3. Modular Architecture of Neural Network Models to Detect.........................86

Figure 6-1. Input Feature Selection Using Simulated Data..............................................93

Figure 6-2. Single MLF Network to Detect Different Types of Problems.......................95

Figure 6-3. Training of MLF and Projection Network.....................................................96

Figure 6-4. Single and Modular Neural Network Model..................................................97

Figure 6-5. Performance of Neural Network and Statistical Models...............................98

Figure 6-6.Performance of the Modular Neural Network Model Based on....................100

vii

LIST OF TABLES

Table 4-1. Data Collected from the Anaheim Network..................................................4-7

Table 4-2. Data Collected from the LA Network...........................................................4-8

Table 5-3. Input Features..............................................................................................5-13

Table 6-1. DR and TTD Performance Measures............................................................6-3

1-1

1. Introduction

1.1 Problem Definition

In recent years, transportation research has revealed that problems of widespread

congestion cannot be solved by building more roads or by expanding existing

infrastructure. A significant part of the solution lies in better management of traffic. One

of the principal thrusts of the new national program on Intelligent Transportation Systems

(ITS) is Advanced Transportation Management Systems (ATMS).

To facilitate better management, recent research has focused on continuous monitoring of

traffic to ascertain the 'normal' level of congestion and to provide an understanding of

how it forms and spreads. Techniques for rapidly detecting incidents have become a vital

link in the management of traffic. As pointed out by Ritchie (1990), a major concern in

ATMS is providing decision support to effectively detect, verify and develop response

strategies for incidents that disrupt the flow of traffic. A key element of providing such

support is automating the process of detecting operational problems on large area

networks. Successful detection of operational problems in their early stages is vital for

formulating response strategies such as modifying surface street signal timing plans and

activating or updating traveler information systems, including changeable message signs,

in-vehicle navigation systems and highway advisory radio, amongst others. It is also

needed as a basis for alerting police, emergency vehicles and tow services. Reliable

1-2

surface street incident detection is necessary for the development of integrated freeway-

arterial control systems, and will permit improved coordination of freeway ramp meters

and surface street signal timing. Therefore, developing a capability for automating

incident detection on arterial streets will aid in accomplishing a true integration of

freeway and arterial networks.

From the early 1970's, research has been conducted to develop incident detection

algorithms for freeways to aid traffic engineers. Only recently, since the mid 1980's, has

any research focused on similar efforts for surface streets. As pointed out by Han and

May (1988), little work has been done on the arterial side because of characteristic

differences between freeways and arterials. Therefore a significant challenge lies in

formulating an incident detection methodology for arterial streets.

Differences between freeways and surface streets that impact incident detection include

the following:

• multiple access: freeways have directed access points through entry and exit ramps,

but surface streets have multiple access points through left, right and through

movements of traffic, giving drivers multiple choices

• geometric constraints: surface streets have geometric constraints such as

channelization for separation of traffic movements, and conflicting movements;

freeways have fewer of these features

1-3

• control measures: entry ramps on freeways control the rate of traffic entering the

main line flow, whereas intersection control on surface street networks controls phase

sequencing through either fixed time control or traffic actuated control with variable

splits, depending on the local demand within the constraints of cycle lengths, and

minimum and maximum phase lengths.

• operating conditions: surface streets usually operate at lower speeds compared to

freeways, which allows vehicles on surface streets to change lanes more readily,

thereby making the problem of distinguishing between incident and non-incident

patterns more difficult as vehicles are able to maneuver around incident locations

more easily.

• detector configuration: freeways usually have more uniform spacing of detector

stations (e.g. half or one-third of a mile), but for surface streets the surveillance

detector location varies based on the length of the links

Because of the characteristic differences between surface streets and freeways, the

effects, types and nature of 'incidents' differ. Very limited work has been done in

developing an algorithm to detect incidents on signalized surface street networks. Of the

few attempts (Bell and Thancanamootoo, 1988; Han and May, 1989; Chen and Chang,,

1993), none have been extensively tested or implemented for a city’s street network.

1-4

Therefore, there is a need for the development of an incident detection system for

signalized street networks that will automate the process for a traffic management center.

This research proposes to develop such a new approach to detecting incidents or traffic

operational problems on surface streets.

1.2 Research Approach

The objective of developing an algorithm to automate the process of detecting

operational problems or traffic management problems is to provide traffic management

centers, overseeing the operations and control of street networks, with a decision tool.

This tool, embedded in a traffic management system, would be part of a four step

incident management system - incident detection, incident confirmation, incident

response and recovery monitoring (Ritchie, 1990). In the case of surface street networks,

any operational problem that requires the attention of an operator in a traffic management

center, and results in an operator formulating a response, mey be defined as an incident.

Therefore, the role of a detection algorithm will be to detect the following types of

incidents that are relevant to the operation and control of surface street networks:

• Reduced capacity:

accident, stalled vehicles, illegal street parking, lane closure, or blockage

within a link or within an intersection

1-5

• Excess demand

due to special event, queues do not clear over several cycles

left-turn pockets overflow over several cycles

inadequate capacity due to inadequate effective green time available to a

particular phase

• Detector malfunction

system detectors

traffic control detectors

Neural classifiers, as an ensemble of a great number of collectively interacting elements,

are capable of storing representations of concepts and information as collective states.

Therefore, different aspects of a pattern recognition problem can be expressed over a set

of interconnections as weights in a distributed manner.

This research proposed the use of artificial neural networks in a modular architecture to

detect the different types of operational problems listed above. The types of problem that

can be detected depend on factors such as range of operating conditions, configuration of

system detectors within the network, and block or link length. Neural network models

were developed to demonstrate the feasibility of training and testing different

architectures of neural network models as components of a modular architecture, with

appropriate architecture for each sub-problem of pattern recognition. It was hypothesized

that the performance of such a modular architecture would exceed that of any single

1-6

architecture applied to the detection of the different types of operational problems. A

comparative analysis was carried out to study the performance of each type of model

considered. Also included was a study of the effect flow levels and detector

configurations have on the performance of the incident detection model. The results

show that with the selection of a suitable architecture, the performance of the modular

neural network classifiers developed based on data from loop detectors outperform other

statistical techniques such as discriminant analysis. This is demonstrated by testing the

detection of operational problems on street networks in the Cities of Los Angeles and

Anaheim, California, using cyclic data collected from a microscopic-simulation, and the

Urban Traffic Control System (UTCS) implemented in the field.

In this research we propose using a multiplicity of networks to take advantage of a

modular architecture. As a result, a different network learns a different region of the

input domain or class pattern by decomposing the problem at hand and splitting its input

domain. A modular architecture was used to develop a hierarchy of neural nets to detect

different types of problems under different operating conditions. From the tests

performed for incident detection for arterials with various architectures, it was clear that

the detection rate and the false alarm rates both increased simultaneously. Therefore, an

attempt to increase the detection rate resulted in an increase in false alarm rate as well,

i.e. the performance of one measure was inversely related to the other. This has also

been found in incident detection for freeways (Ritchie and Cheu, 1990). Therefore, an

attempt was made to train two separate neural networks, one to optimize the detection

rate and another other to optimize the false alarm rate. It was also shown that, using two

1-7

differently trained neural networks, the false alarm rates could be lowered by the dual

system of networks compared to a single network.

Incident detection systems, both for freeways and arterials, will operate in a traffic

management center controlling and monitoring large area networks. Therefore, it is of

utmost importance to keep the false alarm rates extremely low, so that when the neural

network models are implemented in a real traffic management center, they will result in

low false alarms. One of the main concerns of traffic engineers seeking an incident

detection algorithm is not only a high detection rate, but perhaps of equal or greater

importance is a low false alarm rate. As has been shown in the freeway case, even

moderate false alarm rates can result in traffic engineers in a TMC ignoring real alarms.

The dual neural network, composed of single networks, was trained separately to

optimize the performance of detection rate and false alarm rate, and was jointly used to

reduce the false alarm rates to an acceptable level for use by a traffic management center

for large surface street networks.

The overall objective of this research was to:

• develop a methodology to automate the process of detecting operational

problems on surface street networks

• extend the application of artificial neural networks to incident detection

• detect different types of problems

• test the robustness of the model developed by testing on different types of

surface street networks and different detector configurations

1-8

1.3 Organization of Dissertation

The research effort is described in the following chapters:

Chapter 1 presents the problem addressed in this Dissertation, the approach proposed and

the objectives of this research.

Chapter 2 presents a review of basic techniques applied to the problem of detecting non-

recurring congestion or incidents on freeways, and also Dissertations the limited work

done to develop a methodology for signalized surface street networks.

Chapter 3 identifies the problem addressed in this research as a pattern recognition

problem, presents different approaches to solving pattern recognition problems, namely

statistical and neural classifiers, the strengths and weaknesses of these approaches, and

why neural network classifiers were proposed for the problem addressed in this research.

Finally a neural network architecture was proposed to develop a comprehensive system

to detect traffic operational problems for signalized arterials.

1-9

Chapter 4 describes the field and simulated data collected to develop and test the

performance measures of the models developed for this research. The micro-simulator

used, its calibration, the city networks represented, and the experiments designed are

presented in this chapter.

Chapter 5 presents the model development, input feature selection, model structure,

training and testing of the proposed model for different types of traffic operational

problems under different operating conditions, and detector configurations.

Chapter 6 evaluates the results of the different neural classifiers, a statistical classifier,

and the modular architecture of neural classifiers.

Chapter 7 discusses the findings of this research, and future direction of research in this

area.

1. Incident Detection Approaches

1.1 Introduction

Detecting incidents on either a freeway section or a surface street is a pattern recognition,

or more specifically a classification, problem. Algorithms have been developed for

incident detection using various techniques. They can be classified as pattern

recognition, pattern matching techniques or comparative algorithms, statistically-based

algorithms, and traffic flow modeling-based approaches.

There are basically two approaches: one uses the notion of trying to find similarities and

the other estimating beliefs/probabilities. Attempts to classify incident and non-incident

data where the emphasis was on trying to determine the `similarities' included decision

tree techniques, and time series or filtering techniques. These techniques ultimately rely

on means to determine how close the traffic parameters are to some 'normal' values or

predicted values using calibrated thresholds, determined differently by different

techniques.

1.2 Basic Approaches to Incident Detection

1.2.1 Pattern Recognition-Based Algorithms

Pattern matching algorithms based on decision trees for freeway incident detection were

developed by Payne and Tignor (1978), and were later developed by others (ref) as a

1-2

series of algorithms. They are based on decision trees to detect incidents from traffic

parameters. These algorithms, better known as the California Algorithms, are based on

the pattern of traffic when an incident occurs. When an incident occurs, congestion

builds upstream of the incident - thus causing an increase in occupancy upstream, and

decrease in occupancy downstream. But this difference can be also be caused by a

bottleneck. Therefore the algorithm is also used to distinguish a bottleneck from an

incident.

Occ

u

t Occ

d

t K( ) ( )

−

≥

1

Eq. 1-1

Occ

u

t Occ

d

t K( ) ( )

−

≥

2

Eq. 1-2

Occ

d

(t 2) Occ

d

(t)

Occ

d

(t 2)

K

3

−

−

−

≥

Eq. 1-3

where,

Occ

u

Occ

d

K K K

=

=

upstream occupancy for time t (%)

downstream occupancy for time t (%)

thresholds

1 2 3

,,

The first two tests (Eq. 2-1, Eq. 2-2) were used to compare the absolute difference in

occupancy and the relative differences against thresholds. The third test (Eq. 2-3)

determines whether the difference is due to a bottleneck or recurring congestion. Various

versions of the algorithm have been developed based on this version. Currently, version

1-3

8 of this algorithm is being used for freeways in Los Angeles using 30 second occupancy

values.

1.2.2 Time-Series Methods

Algorithms were also developed based on statistical forecasting of traffic behavior by

time series algorithms (Cook and Cleveland 1974, Dudek and Messer, 1974; Ahmed and

Cook, 1982). These time-series based methods provide a means of forecasting short term

traffic behavior. Significant deviations from observed and estimated values of traffic

parameters detect an incident.

1.2.3 Bayesian Approach

Levin and Krause (1978) have also used the Bayesian approach to classify incident and

non-incident data. This algorithm uses the ratio of the difference between upstream and

downstream one minute occupancies and also uses historical incident data. It is based on

mathematical expressions derived from the ratio of distribution of incident and incident-

free conditional distributions of incidents, given traffic features; and the probability of

the occurrence of an incident at a particular location and time period. This algorithm

performs better than the California Algorithm, but has a high mean time to detect.

1.2.4 Catastrophe Theory-Based Algorithm

The McMaster Algorithm (Persaud and Hall, 1989) is based on applying catastrophe

theory to the two dimensional analysis of traffic flow and occupancy data, by separating

1-4

the areas corresponding to different states of traffic conditions. When specific changes of

traffic states are observed over a period of time, an incident is detected.

1.3 Surface Street Incident Detection

Few attempts have been reported in the literature for surface street incident detection.

One is based on decision trees (Han and May, 1989) another applies time-series

technique to data from an urban traffic control system and simulated data for an isolated

intersection (Bell and Thancanamootoo, 1988); a knowledge-based approach uses video

image processing data (Sellam, Boulmakoul, and Pierrelee, 1991); and a discriminant

analysis method uses traffic parameters from loop data and detector configuration data

(Chen, and Chang, 1993). A description of each of these attempts is presented next.

1.3.1 Time-Series Based Algorithm

A time series approach was used by Bell and Thancanamotoo (1988) in an effort to

perform incident detection. Incidents were defined as an unexpected, non-recurrent,

longer term loss of capacity at a critical location. The key variables (determined by the

detector type) at each detector site were collected on a cyclic basis. When the traffic

condition remained normal, the mean and variance of traffic parameters were updated or

estimated each cycle by exponential smoothing according to Eq. 2-4 and Eq. 2-5.

Abnormal conditions were identified when the estimated key variable values were

outside the range of an upper and lower bound as computed in Eq. 2-6, Eq. 2-7, and Eq.

1-5

2-8 where the bounds were established in terms of the smoothed mean and variance (Eq.

2-9 and Eq. 2-10).

)

)

F t F t F t

( ).( ).( )

= − +

08 1 02

Eq. 1-4

)

)

佴 佴 佴

( ).( ).( )

= − +

08 1 02

Eq. 1-5

F t F t

F

t

( ) ( ) ( )

< −

)

)

DV

1

Eq. 1-6

O t O t

o

t( ) ( ) ( )> +

)

)

D V

2

Eq. 1-7

O t O t

o

t( ) ( ) ( )< −

)

)

DV

3

Eq. 1-8

)

)

)

σ σ

F

t F t F t

F

t( ).( ( ) ( )).( )= − − − + −01 1 1

2

09 1

Eq. 1-9

)

)

)

σ σ

o

t O t O t

o

t( ).( ( ) ( )).( )= − − − + −01 1 1

2

09 1

Eq. 1-10

F t( ) and O t( ) were observed cyclic flow and occupancy,

)

F t( ) and

)

O t( ) were estimated

flow and occupancy, and

)

σ

F

t( ) and

)

σ

O

t( ) were estimated standard deviations of flow

and occupancy. The thresholds respond to changes in level of traffic due to exponential

smoothing. The bounds were established based on the smoothed mean and variance.

Whenever an incident was suspected, the mean and variance were frozen to avoid the

incident affecting the mean and the variance computation. But if after the freezing of the

1-6

mean and variance for 5 consecutive cycles, an incident was not confirmed, the values

were reset to the most current and smoothing commenced.

When the upper bound of occupancy was exceeded, an incident was suspected upstream,

and the downstream occupancy was checked. If that too was found to exceed the lower

bound then an incident was confirmed. On the other hand, if the lower bound was

exceeded, then an upstream detector was checked for upper bound infringement to

confirm an incident.

SCOOT data from the Traffic Management Division of TRRL in London for an isolated

T-intersection were collected. There was an incident where a vehicle was parked close to

the stop line for an hour. Data from the Traffic Control System Unit using the SCOOT

system for 2 hours. This set consisted of data also from a T intersection for both

direction of traffic. In this case the incidents were right turning vehicles blocking the

traffic. Both incidents were detected using the algorithm developed. SCOOT data were

collected for Middlesbrough, over a 2-hour period, which covered the evening peak

period for 2 days. The incident detection algorithm developed produced no false alarms

for this data set.

This approach was used by researchers in the DRIVE project MONICA (Monitoring

Incidents and Congestion Automatically). Bretherton and Bowen (1991) report their

work with the algorithm developed by Bell and Thancanamootoo which extended it to an

arterial, using detector data from adjacent intersections. Data were collected using a

1-7

system developed for the London UTCS to receive, process and store traffic information

produced from SCOOT. Data were collected over a three hour period for two

consecutive links, where for an hour there was a lane blocking incident. The paper

reports of a field testing to take place, but no follow-up literature was available reporting

results.

1.3.2 Knowledge-Based System Using Video Image Processing

Sellam, Boulmakoul, and Pierrelee (1991) developed a knowledge-based system using

video image processing for incident detection for signalized intersection. Their paper

presents the general architecture of a system developed as part of the DRIVE project

INVAID. It consists three Units: the Image Processing Unit (IPU) which outputs a

binary image of the junctions or streets, a Measurements Processing Unit (MPU) which

processes the binary image data to compute indicators of traffic and a Diagnosis

Processing Unit (DPU) which diagnoses a problem and if necessary make requests to the

MPU. The incident detection is based on the binary output of the digitized image - each

black pixel on the source image represents a "moving vehicle" and a white image

corresponds to the background. A user-specified parameter determines the time interval

until which a stopping vehicle would be considered as a moving vehicle and after which

it would be considered a parked vehicle. This results in a spatial detection algorithm as

opposed to a vehicle detection.

1-8

1.3.3 Decision Tree-Based Approach

Shortly after Thancanomotoo's work was reported, Han and May (1988,1989) reported

their attempt to develop an incident detection algorithm for surface streets. They selected

a downtown area in Los Angeles under ATSAC (Automated Surveillance and Control

System) with signalized surface streets of short blocks (400-500 feet) and detectors in all

lanes. Detector data collected were smoothed.

The algorithm first uses the smoothed data to detect abnormal detector data patterns.

Based on historical flow, occupancy and speed data for statistical ranges of 1 and 99

percentiles, comparisons are made.

After a check of detector malfunction, the algorithm proceeds to determine whether either

an Impending Saturation Occupancy or an Impending Congestion Occupancy is

exceeded. The flows are also checked against Medium and Low Flow thresholds.

Thresholds of 300 and 500 for flow, and 30% and 40% for occupancy, are used. These

values are determined for the test, in an attempt to minimize the false alarm rate. These

thresholds are therefore time and detector dependent. One minute and three minute

smoothed data were used to determine the thresholds. Traffic conditions on a street were

classified into one of six states, depending on the occupancy and volume. Based also on

the condition of adjacent lanes and downstream streets, classification as a lane blockage,

approach blockage or arterial blockage was also made.

1-9

The algorithm was implemented as a system in C. It was tested off-line for a section of a

street in Los Angeles (Washington Blvd.) near the Coliseum. This testbed had 20

detectors. In the early stages of development, runs were made to determine the

thresholds using one minute and three minute smoothed data. In 1989, Han and May

reported the development of TOPDOG, developed in TurboProlog, based on the

algorithm described. They also reported using 50 minute data from Venice Blvd., Los

Angeles. The system is still in its initial stages of testing as a demonstration prototype.

Still further work is required to test whether a global set of thresholds may be determined

for a detector configuration, as opposed to using time and detector dependent thresholds.

1.3.4 Discriminant Analysis-Based Approach

Discriminant analysis has been proposed for incident detection on surface streets by Chen

and Chang (1993). This paper very briefly presents an incident detection algorithm for

surface streets as part of a 3-module system - a dynamic traffic flow prediction model, an

incident detection model, and an incident monitoring module. The paper presents the

overall architecture of an incident detection system. According to the architecture

presented, the flow model captures the dynamics of the traffic and thus predicts the flow

conditions; this is compared to the real-time condition and forms the basis of the

detection module. The paper presents the detection portion of this system:

1-10

d

1 lane blockage

106.7 3.54 7.22 0.66 2.21

53.70 1.69 0.88 1.73

1.06 2.26

d

2 lane blockage

131.26 3.39 6.80 0.10 2.17

156.88 1.49 0.05 1.58

1.89 2.04

−

= − + + + +

− + + +

+ +

−

= − + + + +

− + + +

+ +

⎛

⎝

⎜

⎞

⎠

⎟

⎛

⎝

⎜

⎞

⎠

⎟

ASPDUS ASPDDS AIFLDS AOCUS

DFSPD

DTSPACE

DFOC TRUCK BLRATO

DSP DSP

ASPDUS ASPDDS AIFLDS AOCUS

DFSPD

DTSPACE

DFOC TRUCK BLRATO

DSP DSP

12 23

12 23

where,

ASPDUS, ASPDDS average upstream and downstream speed

AIFLD average downstream flows

AOCUS average upstream occupancy

DFSPD upstream and downstream speed difference

DFOC upstream and downstream occupancy difference

DSP12 speed difference of lane 1 and 2

DSP23 speed difference of lane 2 and 3

BLRATO fraction of the blocked area of a link

TRUCK composition of heavy vehicles

DTSPACE detector spacing

This method is based on a set of multivariate discriminant functions. The paper

presenting this method reports a misclassification rate of 13.22% and a variance of 0.59

for one and two-lane blockages classified in an experimental design of a 3-lane arterial

1-11

segment. The paper does not present the configuration of the network simulated, and the

details of design of the experiments conducted, such as the location of the blockages, and

whether the blockages were partial or complete lane closures. NETSIM, a microscopic

simulation was used for the experiments, where only one lane blockages (partial

blockage as stalled vehicle) can be simulated. The use of the variable 'fraction of the

blocked area of a link' as reported in the paper suggests use of the 'lane closure'

simulation feature of NETSIM that requires specifying the percent of time the lane

closure is simulated, not lane blockages. Also, no information was available on how the

detector data collected was available as the current version of NETSIM does not simulate

surveillance detectors.

All of the incident detection methods described here are in their early stages of

development, unlike the freeway incident detection approaches. It may be mentioned

here that none (except the Chen and Chang paper) report any performance measures for

the algorithms presented. The discriminant analysis based algorithm reports only

misclassification rate, but no false alarm rates nor times to detection.

1.

Pattern Recognition and Neural Networks

1.1 Introduction

The task of determining a procedure to use currently available information on an object

or an event to assign it to a prespecified set of categories or classes is termed pattern

classification or pattern recognition. A body of work has developed out of the extensive

study of pattern recognition problems which has led to the development of mathematical

models to design classifiers as shown in Figure 1-1. These models use a set of features of

an object or an event and describe a relationship between these features (inputs) and its

class pattern.

g

1

g

2

g

3

x

1

x

2

x

n

g (x)

g (x)

g (x)

1

2

n

Max

y(x)

Figure 1-1. Pattern Classifier

(Duda and Hart, 1973)

In the literature there are a few types of pattern recognition techniques. They are mostly

based on statistical, machine learning or neural network-based approaches. In this

research, statistical and neural network approaches were considered and evaluated, and

the discussion will be limited to these two.

1.2 Statistical Approaches to Pattern Recognition

In this approach, the problem of pattern recognition is considered a problem of estimating

density functions in a high dimensional space and dividing the high-dimensional space

into the regions of patterns or classes. The input features are considered realizations of

random vectors, where the conditional density functions depend on the class pattern, and

the density function may be known or assumed. The performance of a classifier can be

analyzed for a given distribution of input vectors. In designing a classifier, the

conditional density form may be known or assumed and the functional form of the

classifier or discriminant can also be assumed to be linear, quadratic or piecewise linear.

The best classifier for a given distribution is based on Bayes decision theory and

minimizes the probability of classification error; it is also considered the optimal

classifier. When the density function is assumed, the parameters of the function need to

be estimated using parametric techniques, and when the density function is not known

nonparametric techniques are used. For problems where the data do not fit the common

density functions, non-parametric techniques are applied. However, nonparametric

techniques are normally used for off-line analysis because of the limitations in

performance, storage requirement, speed and complexity of the algorithms. Therefore,

this research, where a methodology is required to perform on-line, real-time analysis,

statistical nonparametric methods are not discussed further.

In general, it is also known that it is easier to design a classifier for an input feature

vector of lower dimensionality. Therefore, techniques are also used in pattern

recognition that reduce the dimensionality of a given input vector to perform feature

extraction, which then forms the basis of a linear classifier.

1.2.1 Bayesian Classifier

One of the fundamental approaches to pattern recognition is based on Bayesian decision

theory, and is expressed in terms of probability structures. When the distributions of the

input feature random vectors are given, it can be shown that the Bayesian classifier is the

best classifier which minimizes the probability of classification error (Duda and Hart,

1973).

1.2.2 Discriminant Functions

Even when the probability distribution of the input feature vectors is given,

implementation of the optimal Bayes classifier is often difficult when the dimensionality

of the input feature vector is high. In such cases, another statistical classifier is often

used, - discriminant analysis, where the mathematical forms of the discriminant functions

are known (linear or quadratic classifiers).

Classification into groups or pattern classes is based on differences in the characteristics

of the features of objects that come from the different classes. A good classification rule

for discriminating between the classes minimizes the misclassification rate, the prior

probabilities of occurrence of each class, and the cost of classification. Therefore,

discriminant functions are to take these factors into consideration.

Discriminant function (DF) procedures are based on normal populations. Let f

1

(x) and

f

2

(x) be multivariate normal densities, with mean vector and covariance matrix µ

1

and Σ

1

,

µ

2

and Σ

2

respectively. When Σ

1

=Σ

2

, the classification rule that minimizes the expected

cost of misclassification is,

For class 1:

( ) ( )

( ) lnµ µ µ µ µ µ

1 2

1

0 1 2

1

1 2

2

1

1

2

−

′

− −

′

+ ≥

⎛

⎝

⎜

⎞

⎠

⎟

− −

Σ Σ

x

p

p

Eq. 1-1

and for class 2 otherwise.

where,

p

1

, p

2

= prior probabilities of belonging to class 1, and class 2, respectively

The classification rule in this case is linear, but when Σ

1

≠Σ

2

, that is the covariance

structure is different, then the discriminant function become,

For class 1:

( ) ( )

−

′

− +

′

−

′

− ≥

⎛

⎝

⎜

⎞

⎠

⎟

− − − −

1

2

1

1

2

1

1 1

1

2 1

1

2

1

x x x k

p

p

Σ Σ Σ Σµ µ ln

Eq. 1-2

and for class 2 otherwise.

where,

( )

k =

⎛

⎝

⎜

⎞

⎠

⎟

+

′

−

′

− −

1

2

1

2

1

2

1 1

1

1 2 2

1

2

ln

Σ

Σ

Σ Σµ µ µ µ

In this case, the classification regions are defined as quadratic functions of x. Therefore,

in using discriminant functions as classifiers, assumptions of normality are made for the

multivariate density functions, and linear and quadratic classifiers arise based on the

structure of the covariances. If the data are not multivariate normal, the data may be

transformed to variables that are more normal, and the linear or quadratic discriminant

classifier can be used to determine the appropriateness of a particular classifier. Or,

when appropriate transformations of the data can not be formed, the linear or quadratic

classifying rule may be applied to check the performance of the classifier (Johnson and

Wichern, 1992) and to determine whether linear or quadratic decision surfaces can

perform the classification reasonably well for the particular pattern recognition problem.

1.2.2.1 Fisher Discriminant Function

In this linear discrimination rule, the assumption of normality for the multivariate data is

not made. However implicitly the population covariance matrices are assumed to be

equal. In this case, the rule maximizes the differences between the classes by

maximizing the ratio of the squared distance between the sample means and the sample

variance. The separation distance is expressed not in terms of the original input variables

x, but a set of transformed variables y. The x variables are transformed to y by taking a

linear combination of x's that maximize the separation distance in terms of y. The Fisher

discriminant function becomes, :

For class 1 if:

( )

y x x x x x x x

pooled

pooled

= −

′

≥ −

′

−

− −

( ) ( )

1 2

1

1 2

1

1 2

1

2

Σ Σ

Eq. 1-3

and to class 2 otherwise.

For the Fisher discriminant function, the two classes are assumed to have a common

covariance matrix. From the discriminant developed, the maximum relative separation

distance can also be calculated and the significance of the difference in means can be

computed. A significant difference in means for the classes does not imply that the

classifier developed will perform good classification. If good separation between the

means does not exist, then the classification need not be performed. But if the difference

in means is significant, then other methods of testing the validity of classification

procedure is used.

1.3 Artificial Neural Networks

Classifiers based on Bayes decision theory can be shown to be optimal classifiers. But

for Bayesian classifiers, the conditional probability density function for each class must

be known. Even when the densities are known, they may be difficult to implement. A

classifier is designed by assuming the mathematical form of the classifier (linear,

quadratic or piecewise linear) and the parameters have to be estimated. But the

performance of the classifiers depend on certain conditions of the density functions and

their covariance structures. It may be mentioned that the Bayesian linear classifier is

optimal only when the distribution is normal and the covariances are equal. When the

assumption of equality of the covariances is inappropriate, the Bayesian classifier is not

optimum. For unequal covariances or non-normal distributions, quadratic discriminants

or other types of classifiers may be developed. But as pointed out, often the robustness

of linear classification is preferred to the performance of the more complex classifiers.

Therefore, for most pattern recognition problems, linear classifiers are initially designed,

and then the performance evaluated.

In recent years, artificial neural networks have been applied to a variety of pattern

recognition problems and have clearly emerged as one that has outperformed some

statistically based techniques described in the previous section. But as pointed out in the

previous section, the performance of statistical classifiers depends on how well the

assumptions about the density functions fit the data they describe, and also on the

appropriateness of the functional form of the discriminant function. These factors are

application-specific. When the restrictive assumptions are violated for a particular

application, the classifier is no longer best or optimum. Another set of classifiers that

address theses types of problems are neural network-based classifiers.

Artificial neural network architectures that were initially developed and successfully

applied, were based on the original perceptron [Rosenblatt, 1958]. They are parallel

distributed processing information structures that combine computational and knowledge

representation methods. An artificial neural network consists of many processing

elements (PE's) that are massively connected with each other. The processing elements

can be arranged in layers, with an external layer receiving input from data sources, which

is passed on to other processing elements through interconnections, and on to an output

layer of processing elements. Since these structures are distributed in nature, they are

known to be robust, efficient, and also have the capability to capture highly non-linear

mappings between inputs and outputs.

Processing elements each receive inputs from external data sources or other processing

elements and pass through these summation function, a transfer or activation function, as

shown in Figure 1-2. The method of obtaining the weights to perform the desired

mapping is termed as learning. The various types of training methods are described in

the next section.

Σ

φ

Xo

Xn

Y

W

v

Figure 1-2. A Processing Element

1.3.1 Learning Schemes

1.3.1.1 Supervised Learning

Supervised learning occurs when the network parameters are adapted under the combined

influence of the error signal and the input vector. This adjustment is often made by an

iterative procedure until the output produced follows the target signal closely in some

statistical terms according to either a least mean square or steepest descent algorithm

using the instantaneous estimate of the gradient of the error surface.

1.3.2 Unsupervised learning

Unsupervised learning is controlled by a teacher or an external target signal and the

performance of the network weight is updated based on the ability to follow the target

signal. For supervised learning, learning is carried out not in terms of a cost function,

which is expressed in terms of an error signal, but rather based on a target independent

measure. Network parameters are adjusted based on an independent measure of

performance which captures statistical regularities of the input vector and forms classes

of patterns automatically.

1.3.3 Competitive Learning

This occurs when the processing units compete amongst themselves to respond to a

stimulus or input. In Hebbian learning the units produce outputs simultaneously, but in

competitive learning only one unit responds at a time and is therefore often referred to as

the winner-take-all method. For a collection of processing units, certain units specialize

on certain groups or classes of patterns and therefore work well to detect certain features

of an input set.

Based on this learning scheme where the weights are normalized and therefore the input

vectors also properly scaled lie in an N-dimensional unit hypersphere. the learning

moves the weight vector to the center of gravity of the cluster it discovers.

1.4 Artificial Neural Networks for Pattern Recognition

Artificial neural networks applied to pattern recognition problems have clearly emerged

as a major contender to other statistical-based approaches. The applicability of one

approach over the other is of course application specific. Statistical-based work in

pattern recognition is based on discriminant analysis or on a class of models that attempt

to provide an estimate of the joint distribution of the features within each class. There

are approaches that are characterized by having an explicit underlying probability model

of being in each class. The main advantages of applying neural network techniques are

described as follows,

• Only weak assumptions need to be made about the data set as opposed to

statistically-based approaches where more restrictive assumptions of the

underlying distributions are made to obtain the optimal classifiers.

• Classifiers based on Bayes decision theory can be shown to be optimal

classifiers. But for Bayesian classifiers, the conditional probability density

function for each class must be known. Even when the densities are known,

they may be difficult to implement. A classifier is designed by assuming the

mathematical form of the classifier (linear, quadratic or piecewise linear) and

the parameters have to be estimated. But the performance of the classifiers

depend on certain conditions of the density functions and their covariance

structures. It may be mentioned that the Bayesian linear classifier is optimal

only when the distribution is normal and the covariances are equal. Based on

the application, when the equal covariance assumption is inappropriate, the

Bayesian classifier is not optimum.

• Even though there are nonparametric statistical techniques that can be applied

to pattern recognition to avoid making assumptions about the distribution of

input features, these techniques are difficult to apply to on-line applications as

they are computationally intensive and require extensive amounts of storage

space

• Nonparametric techniques also suffer from the 'curse of dimensionality',

where although with enough samples, convergence to an arbitrary

complicated unknown density function is assured, the number of samples

required may be very large. The demand for a large number of samples grows

exponentially with the dimensionality of the input feature space. This

limitation severely restricts the practical application of nonparametric

techniques. Work with MLF networks (Baron 1991, 1992) shows that the rate

of convergence expressed as a function of the training set size N is of the

order (1/N)

1/2

(times a logarithmic factor). This makes the application of

neural network techniques feasible for problems that require on-line

implementation.

• Since the relationships in ANN are expressed as a set of interconnections,

they are distributed in nature and thereby more robust and computationally

efficient

1.5 Artificial Neural Network Architectures

1.5.1 Multi-layer Feed Forward Neural Network

The multilayer feedforward neural network consists of an input layer, one or more non-

linear hidden layers, and an output layer. It employs an error correcting back-

propagation algorithm for training. In the case of backpropagation, there are two distinct

phases - a forward pass and a backward pass. In the forward pass the input is propagated

through layers of processing units, and in the backward pass, the errors computed are

propagated backwards with parameters that minimize the mean square errors (Rumelhart

and McClelland, 1986).

The model of each processing element or units includes a non-linear transfer function

that is part of each processing element in the hidden layers and output layer. The hidden

layers progressively extract more and more features from the input data. The layers are

interconnected through a set of weights. Due to this distributed form of non-linear

processing, these structures are able to produce highly non-linear mappings between

inputs and outputs. The weights in these networks are adjusted according to the well

known back-propagation algorithm that minimizes the mean square error between the

outputs produced by the network and a set of desired outputs.

The error is computed as,

e n d n y n

j j

( ) ( ) ( )=

−

for the output unit j

Eq. 1-4

The sum of the square of the errors is,

E n e n

j

j

( ) ( )=

∑

1

2

2

for all output units j

Eq. 1-5

and the mean error over N training patterns,

E

N

E n

mean

n

N

=

=

∑

1

1

( )

Eq. 1-6

The objective is to adjust the parameters or weights. The errors computed for every

pattern are thus summed over the entire set of patterns to produce an estimate of the

overall error, which is used as a measure to adjust the weights. Considering a typical

processing unit as shown in Figure 3-3, the net output produced is,

v n w y n

j jii

i

p

i

( ) ( )=

=

∑

0

for output unit j from all p inputs

Eq. 1-7

According to the back-propagation algorithm, the weight adjustment DWji is

proportional to the instantaneous gradient,

∂ε

∂

∂ε

∂

∂

∂

∂

∂

∂

∂

( )

( )

( )

( )

( )

( )

( )

( )

( )

( )

n

w n

n

e n

e n

y n

y n

v n

v n

w n

ji j

j

j

j

j

j

ji

=

Eq. 1-8

where,

∂ε

∂

∂

∂

∂

∂

ϕ

∂

∂

( )

( )

( ),

( )

( )

( )

( )

( ( )),

( )

( )

n

e n

e n

e n

y n

y n

v n

v n

v n

w n

j

j

j

j

j

j

j

j

ji

= = − =

′

, = y (n)

i

1

Eq. 1-9

Σ

Σ

Σ

Σ

Σ

Σ

W

2

T

W

3

T

δ

1

δ

2

δ

m

φ'φ'φ'

δ

1

δ

2

δ

m

δ

m

δ

2

δ

1

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ

x

0

x

1

x

2

x

n

y

1

y

2

y

m

W

1

,θ

1

W

2

,θ

2

W

3

,θ

3

φ

φ

φ

φ

φ

φ

φ

φ

φ

Multi-layer Feed Forward Network

(forward propagation phase)

Backpropagation

phase

Figure 1-3. Multilayer Feedfroward Neural Network

Forward/Backward pass

(Haykin, 1992)

Therefore,

∆w n

n

w n

n y n

ji

ji

j i

( )

( )

( )

( ) ( )= − = −η

∂

ε

∂

ηδ

Eq. 1-10

where,

∂

∂ε

∂

∂

∂

∂

∂

ϕ

j

j

j

j

j

j

j j

n

n

e n

e n

y n

y n

v n

e n v n( )

( )

( )

( )

( )

( )

( )

( ) ( ( ))= =

′

Eq. 1-11

y

i

(n)

w

ji

(n)

v

j

(n)

ϕ(.)

y

j

(n)

w

kj

(n)

v

k

(n)

y

k

(n)

d

k

(n)

e

k

(n)

ϕ(.)

Figure 1-4.Hidden and Output Layer Processing Elements

For hidden layer units:

∂

∂ε

∂

∂

∂

∂ε

∂

ϕ

j

j

j

j j

j

n

n

y n

y n

v n

n

y n

v n( )

( )

( )

( )

( )

( )

( )

( ( ))= =

′

for hidden unit j

Eq. 1-12

∂

ε

∂

∂

∂

∂

∂

( )

( )

( )

( )

( )

( )

( )

n

y n

e n

e n

v n

v n

y n

j

k

k

k

k

k

j

=

∑

for output unit k

Eq. 1-13

where:

∂

∂

ϕ

∂

∂

e n

v n

v n

v n

y n

w n

k

k

k

k

j

kj

( )

( )

( ( )),

( )

( )

( )= −

′

=

and,

∂ε

∂

ϕ δ

( )

( )

( ) ( ( ) ( ) ( ) ( )

n

y n

e n v n w n n w n

j

k

k

k kj k

k

kj

= −

′

= −

∑ ∑

where:

∂ ϕ ∂

j j j k

k

kj

n v n n w n( ) ( ( )) ( ) ( )=

′

∑

for j hidden unit

Therefore the learning rule may be summarized as:

Weight correction

Learning parameter

Local

gradient

Input of unit

( )

∆W

( )

ji

( )

..

n

n

y n

j

i

⎛

⎝

⎜

⎜

⎞

⎠

⎟

⎟

=

⎛

⎝

⎜

⎜

⎞

⎠

⎟

⎟

⎛

⎝

⎜

⎜

⎜

⎜

⎞

⎠

⎟

⎟

⎟

⎟

⎛

⎝

⎜

⎜

⎞

⎠

⎟

⎟

η

∂

Eq. 1-14

1

1.1.1 Projection Neural Network

Multilayer Feedforward (MLF) networks are capable of mapping any continuous

bounded function of

0

dimensions to

0

outputs by dividing the input space into regions

using hyperplanes. The locations of the hyperplanes are determined by the weights and

thresholds of the hidden layer nodes. Non-linear combination of these hyperplanes can

bound regions by hyperplanes or curved surfaces, either open or closed. But to develop

these complex boundaries more regions are required. For a large set of N, large number

of hidden layer nodes, therefore large networks are required. On the other hand, there is

a different set of neural network architectures that train very fast but does not attempt to

minimize error. Networks such as Kohonen networks, Radial Basis Functions, and a few

others, attempt to place prototypes within closed decision boundaries around training data

points and adjust to their parameters. These networks place hyperspheres around

prototypes and adjust their radii.

A third set of neural networks have evolved that combine the ability to form closed

boundaries and also perform error minimization. An examples of this type of network is

the Projection network. This network can form both closed and open decision regions.

Training will cause closed boundaries to open if required and vice versa. Details of this

network will be presented in the next section. The advantage of this type of network lies

in its ability to initialize rapidly to a good starting point, which substantially speeds up

training.

2

The projection network is based on the concept of projecting the inputs to one higher

dimension to form a hypersphere, where the weight vectors will also lie on this

hypersphere. With a hidden layer, networks could be capable of forming either an open

or closed region within the original input space. The idea of trying to project inputs to

higher dimension and use the inner product of the input and the weights to determine the

closeness of these two vectors has been used by other network architectures such as the

Radial Basis Functions. But as in the case of Radial Basis Functions (RBF), the

framework of clustering was used to form closed prototypes. But in the case of

Projection Network, these boundaries are formed within the original framework of back

propagation (discussed in the MLF section).

As mentioned earlier, the idea is to project the input vector from N dimensions onto a

higher dimension (N+1) by transforming the input vector x to x', subject to |x|=R.

Example of transformations is

x'= R

h

h

2

+ x

2

,

x

h

2

+ x

2

⎛

⎝

⎜

⎞

⎠

Eq. Error! No text of specified style in

document.-1

These projections then serve as inputs to a MLF network. Here h is the distance between

the origin of the original input space and the (N+1) space. The component of the input

vector lies along the extra dimension (N+1) and the others along the original dimension,

therefore the weights that connect the (N+1) component to a hidden unit also lie on the

extra dimension (N+1). These weights are also constrained to |w|=R. Figure Error! No

text of specified style in document.-1 shows the projection a 2-dimensional space to a

3

3-dimensional space. The vector X connects the origin of the 2-dimensional space, and

the 3-dimensional vector

′

X

connects the center of the sphere and the point

X

and

extends the line to intersect the surface of the sphere. As

′

X

is on N+1 dimensional

space,

′

W

that connects modified input

′

X

to the hidden layer node is also 3-

dimensional. The net input to the hidden layer node is,

Net input to the hidden layer node w x w

=

−

=

'.'

0

constant

Eq. Error! No text of

specified style in document.-2

As described for MLF, the output produced by input vector x and weights w with bias or

threshold w

o,

0

is passed through a non-linear function, often sigmoidal. The threshold

determines the location of the decision boundary formed and is proportional to the

distance from the origin to the decision surface or hyperplane.

R

R

o

N-D input

space

Projection

Sphere

x

x

'

Figure Error! No text of specified style in document.-1. Projection transformation and the

formation of boundary surfaces

(Wilensky and Manukian, 1992)

4

0

In (N+1) dimensions, each hidden layer node still draws a hyperplaner decision

boundary that intersects with the hypersphere around

′

X

. This is a circle around

′

X

and the position of the intersection was determined by the threshold. Therefore, the

projection of the surface resulting from the intersection back onto the original N-

dimension space is a function of the threshold. As shown in Figure Error!

No text of

specified style in document.-1

, if the threshold is large, the resulting intersection circle

is small and lies on one side of the original 2-dimensional plane. If the threshold is

small, the intersection circle approaches a great circle on the sphere and its projection

back on the 2-dimensional plane is a curve or an open boundary or line. It is this ability

of the projection network to form hyperplanner or hyperspherical prototypes that allows

the Logicon network to b initialized rapidly to a good starting point. It is during learning

that the closed boundaries can become open boundaries and vice versa, as the weights

and the thresholds are adjusted.

The training of the projected weights is based on back-propagation, which minimizes the

error by changing the weight vector in the direction of maximum error decrease. But in

the case of the projection network, the weight vector has to be moved in the direction of

maximum error decrease, but needs to be constrained to be tangent to the hypersphere

surface in order to keep the weight vectors on the hypersphere. The change in weights is

expressed as:

δ

w'

=

w'

R

x

w'

R

x

α∇

e

⎛

⎝

⎞

⎠

Eq. Error! No text of specified style in

document.-3

5

where the e is the error between the desired output and the output produced by the

network,

∆

e is the error gradient with respect to the weights and is the gain. The weights

need to be normalized to have magnitude R to prevent the vectors from moving away

from the hypersphere.

1.1.2

0

Modularity of Neural Network Models

A modular architecture allows decomposition and assignment of tasks to several

modules. Therefore, separate architectures can be developed to each solve a sub-task

with the best possible architecture, and the individual modules or building blocks may be

combined to form a comprehensive system. The modules decompose the problem into

two or more subsystems that operate on inputs without communicating with each other.

The input units are mediated by an integrating unit that is not permitted to feed

information back to the modules (Jacobs, Jordan, Nowlan, and Hinton, 1991). The

modular architecture combines two learning schemes, supervised and competitive. The

supervised learning scheme is used to train the different modules of the networks and a

gating network operates in a competitive mode to assign different patterns of the task to a

module through a mechanism that acts as a 'mediator'.

Neural networks are commonly designed at the level of processing elements or units

which represent the finest level. Layers of processing elements are at a coarser level.

Adding networks adds an even coarser level to the classification. However, there may be

significant practical and theoretical advantages to be gained by considering modularity at

6

the network level. The advantages of modularizing are described in the following

section.

1.1.2.1 Why Modularize ?

Proper Assignment of Tasks or Functions

Networks composing the modular architecture compete with each other and learn the

training patterns. As a result, they learn different functions or tasks by partitioning the

function into independent tasks and allocating a distinct network to learn each task. In

addition, the architecture allows the allocation of a topologically appropriate network to

each task. Often there is a natural way to decompose a complex task into a set of simple

tasks. For example, the

0

non-linear function

y x

=

can be approximated either with a

neural network using one layer of hidden units or by assigning two networks, one for

each linear function when

x

≥

<

0 0

and

x

, each with a single linear unit without any

hidden units, and by setting an appropriate switching or gating mechanism. This example

shows that if the data supports a discontinuity in the function being described, then it may

be more effective to fit separate models on both sides of the discontinuity. Similarly, if a

function is simple in some region, then a global model could be fitted in that region

rather than approximating the function locally with a large number of local models. By

proper assignment of tasks, the network could be simplified in structure to perform the

same set of tasks. Therefore, modular structures perform local generalization by learning

patterns of a particular region. This ensures that the performance of a single module does

not affect the other modules of the structure. In this example, the task of decomposition

7

simplified the structure by removing the need for hidden layers, thereby reducing

computational speed.

Speed of Learning

As the example demonstrates, a modular network to train the two simpler networks

would train faster in the absence of hidden layer processing elements. The

modularization is able to take advantage of function decomposition and can also reduce

the conflicting information that tends to retard learning. In the literature, this is termed

'crosstalk'; and may be spatial or temporal in nature (Jacobs, Jordon and Barto, 1986). In

the MLF, crosstalk may occur when the backpropagation in MLF is applied to two or

more outputs as shown in Figure 3-6.

h

1

h

2

h

3

o

1

o

2

h

1

h

2

h

3

o

1

o

2

Figure Error! No text of specified style in document.-2. (a) MLF neural network (b) a modular

equivalent

If the output of the hidden unit h

1

in Figure 3-6(a) produces positive weights to output

units O

1

and O

2

, and the first output O

1

is 'too large' and the second output O

2

is 'too

small', then using the backpropagation derivative information will specify that for O

1

the

hidden layer output should be smaller, while for the second output O

2

will suggest that

the hidden layer output should be larger. This conflict in derivative information is

8

referred to as spatial crosstalk. Modular architectures, as noted by Plaut and Hinton

(1987) are immune to spatial crosstalk.

Besides spatial crosstalk, there may be temporal crosstalk. For example, if a network is

initially trained to learn a pattern, its hidden units become useful in performing that

function. But when another pattern is trained, the performance on the first pattern may

deteriorate. Often times with backpropagation training, this is overcome by adding more

hidden layer units. But hidden units are added at a computational cost of speed and

complexity. Use of modular networks may eliminate this need.

Representation

Modular structures can also provide a method of representation that is natural or easy to

understand. The modules can be viewed as 'building blocks' for more comprehensive and

complex tasks, where the idea is to 'divide and conquer'. This philosophy has long been

used in computer science, and in numerical methods such as finite element analysis. The

modular structure can provide a means of decomposition at a broad level which

suppresses the activation of a large number of processing elements, while activating a

smaller number of processing elements in a single module.

This structure also allows domain specific knowledge to be incorporated. For example,

Jacobs, Jordan and Barto (1990) in their work decompose the task of 'what' and 'where'

vision tasks to two different multi-layer feed forward neural networks. Knowledge can

9

also be incorporated into the design of a structure by deciding on how to divide the input

information between the gating networks and the modules since there may be a natural

context of division. Another way to incorporate domain specific knowledge is in the

design when part of the functional properties may be known. For example, a linear or

non-linear portion may be identified, and therefore different networks may be designed to

possess different topologies, weights, activation functions, error functions, different input

variables, etc.

Module 1

Module 1

Module N

Module 2

g

1

g

2

g

n

Output vector:

Y

y

1

y

2

y

N

Winner-take-all

X

Input vector

Gating

Network

Figure 3-7. Modular Architecture

1.1.2.1.1 Algorithms

The problem of decomposing training cases into a set of sub-tasks was addressed by

Hampshire and Weibel (1989) when these sub-tasks can naturally be identified

beforehand. But Jacobs et al (1991) first presented the idea of a system that learns to

allocate sub-tasks to different networks/experts or modules. The idea in this method is

that a gating mechanism encourages one module or network to be assigned to a subtask,

and that these modules would perform their tasks locally, decoupled from one another.

10

Therefore no crosstalk or interference between the weights occur as shown in Figure

3.5(b). Hamshire and Weibel (1989) in their work presented a method where the

modules were not decoupled, since the final output was a linear combination of outputs

produced by the individual modules.

0

The interaction between modules caused some of

the modules to be used for a subtask. But Jacobs and Jordan (1991) proposed another

error function to encourage competition between modules using the gating network. The

gating network makes a stochastic decision about which single expert to use on each

occasion. Each module works as a feedforward neural network and all modules receive

the same input and have the same number of outputs. The gating network is also a

feedforward network and typically receives the same input as the other modules (Figure

3-7).

y y=

=

∑

g

i

i

n

i

1

Eq. 3-16

During training, the weights of all the networks are modified simultaneously as in the

multilayer feedforward network. The training of the modules and the gating network are

based on error minimization of the error functions. The error function for the modules is,

J

y

T

= − −

1

2

( * ) ( * )y y y y

Eq. Error! No text of specified style in

document.-4

where, y* is the desired output, output of the system is y.

11

The error function of the gating mechanism is more elaborate. For each training pattern,

one module comes closer to producing the desired output than the others. If for a given

training pattern the systems performance is improved significantly than in the past, the

weights of the gating networks are adjusted to make the output of the winner increase

towards 1, and the outputs of the loosers towards 0. If the systems performance does not

improve, the gating weights are adjusted to move all of the outputs toward some neutral

value. The function that determines whether the performance of the model has improved

is,

J t J t J t

y y y

( ) ( ) ( ) ( )= + − −α α1 1

Eq. Error! No text of specified style in

document.-5

where, α, 0<α<1. This determines how rapidly the past values are forgotten by

exponentially weighting the average of J

y.

The binary variable λ

WTA

and λ

NT

indicate

whether the performance has improved. That is,

If J t J t

Then and

Else and

y y

WTA NT

NT

WTA

( ) ( )< −

= =

= =

γ

λ λ

λ λ

1

1 0

0 1

Eq. Error! No text of specified style in

document.-6

where γ is a multiplicative factor that determines how much less the current error must be

than the measure of the past.

12

If the architecture’s performance improves significantly, the module whose output is

closest to the desired output is determined. The error for the module is then defined as

the sum of squared error between the module’s output y

i

, and the desired output y*. The

error is then,

J

yi i

T

i

= − −

1

2

( * ) ( * )

y y y y

Eq. Error! No text of specified style in

document.-7

The module that wins is the module with the smallest error. If network or module I wins,

then the desired value of the ith output of the gating network g

i

is set to 1 and otherwise

set to 0. If the systems performance does not improve significantly, that is when λ

NT

=1,

then the weights are adjusted so that the outputs of the gating network move towards a

neutral value 1/n, where n is the number of modules. The gating networks error function

is defined as,

J g g g g g

n

g

G WTA i

i

n

i WTA i

i

n

WTA i

i

n

i NT

i

n

i

= − + − + − + −

= = = =

∑ ∑ ∑ ∑

λ λ λ λ

1

2

1

2

1 1

1

2

1

1

2

1

2

1 1

2

( * ) ( ) ( ) ( )

Eq.

Error! No text of specified style in document.-8

where, the first three terms of Eq. 22 contributes to the error when the performance of

the system improves, and the fourth term contributes when the performance has not

improved significantly. The first term is the sum of the sum of the squared error between

the desired outputs and the actual outputs of the gating network. the second term takes

its smallest value when the outputs of the gating network sum to one, the third term takes

13

its smallest value when the outputs of the gating network are binary valued, and the

fourth term is the sum of the squared error between the neutral value and the actual

outputs of the gating network. When the performance of the system does not improve

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο