Parameter setting for the for the neural networks were left at the default settings and are as
follows:

Momentum: 0.5
Learning rate: 0.2
Weight initialisation: random (between ±0.5)

The number of hidden units used in the nets were chosen according to the number of input/output
units and were based on the following criteria (as suggested [44a]); at least one hidden unit per
output, at least one hidden unit for every ten inputs, with a minimum of five hidden units in total.
The number of epochs was based on the topology of the network (i.e. the number of examples
and parameters). The following guidelines were followed:

Small problems (< 250 examples): 60-80 epochs
Mid-sized problems (250 to 500 examples: 40 epochs
Large problems (> 500 examples): 20 to 40 epochs


5.1.1.1 Phase One Results

Original question posed earlier in this paper: Is AdaBoost able to produce good classifiers when
using ANN’s or RBF’s as base learners?

A summary of the datasets used in this phase of experimentation can be seen in table 1 while the
test set error rates are shown in table 2. It is clear from the results that applying AdaBoost to
neural networks offers better results than a single neural network, reducing the error rate for
almost all of the datasets. Figure 5.1 shows the percent reduction in error for the AdaBoosted
neural networks, taking the single unboosted neural net as a base marker. For a single neural
giving a test error of 10% and an AdaBoosted neural network giving a test error of 5%, the
dataset producing these results would be displayed as a 5% reduction in error.

The reduction in error is quite substantial in a couple of cases, improving the Letter dataset by
almost 75% (down from 19% to 4.9%) and the Segmentation dataset by almost 50% (down from
6.9% to 3.5%). However, it also appears that AdaBoost is responsible for an increase in error for
a couple of tests, increasing the error by just over 21% for both the Heart-Clev and Cancer
Version 1.5β 01/05/02 Page 35
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
datasets. This increase may be a consequence of noise in the data causing AdaBoost to overfit.
The effects of noisy data are looked at in greater detail in Section 5.1.3.


Data Set
Instances Output Classes
Features
Cont Discrete
Neural Network
Inputs Outputs Hidden Epochs
Breast cancer
699
2
9
-
9
1
5
20
Segmentation
2310
7
19
-
19
7
15
20
Credit-g
1000
2
7
13
63
1
10
30
Diabetes
768
2
9
-
8
1
5
30
Letter
20000
26
16
-
16
26
40
30
Promoters-
936
936
2
-
57
228
1
20
30
Satellite
6435
6
36
-
36
6
15
30
Splice
3190
3
-
60
240
2
25
30
Heart-Clev
303
2
8
5
13
1
5
40
Hypo
3772
5
7
22
55
5
15
40
Sick
3772
2
7
22
55
1
10
40
Soybean
683
19
-
35
134
19
25
40
Vehicle
846
4
18
-
18
4
10
40

Table 1: Summary of the Datasets used in this paper.


Neural Network
Single
AdaBoosted

Dataset
Error
S.D
Error
S.D
Breast cancer
3.3
0.3
4
0.3
Segmentation
6.9
0.5
3.5
0.2
Credit-g
29.2
1.0
26.5
0.1
Diabetes
24.9
0.9
24.2
1.0
Letter
19
0.4
4.9
0.1
Promoters-936
5.2
0.6
4.6
0.4
Satellite
13.2
0.3
10.1
0.3
Splice
4.6
0.3
4.2
0.2
Heart-Clev
18.3
0.9
22.2
1.0
Hypo
6.3
0.2
6.3
0.1
Sick
5.8
0.6
4.5
0.3
Soybean
9.3
1.3
6.6
0.7
Vehicle
24.9
1.1
19.9
1.1

Table 2: Test set error rates with standard deviations for
(a) A Single MLP NN and (b) An AdaBoosted NN. The best result is shown in
bold and a drawn is shown in italics.
Version 1.5β 01/05/02 Page 36
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell











































2



Change in test error %
-40 -20 0 0 40 60 8
Breast cancer
Segmentation
Credit-g
Diabetes
Letter
Promoters-936
Satellite
Splice
Heart-Clev
Hypo
Sick
Soybean
Vehicle
Dataset
Percent Reduction in error
0



Figure 5.1 Reduction in error for AdaBoosted neural network over that of
a single unboosted neural network. An increase in performance is shown by a bar to the
right of zero and a decrease in performance is shown by a bar to the left of zero.
AdaBoosted NNs produce a result of 10-1-2 (wins-draws-losses)
Version 1.5β 01/05/02 Page 37
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
5.1.2 Phase Two

Having established that AdaBoost does in fact offer improvement over an unboosted ANN, this
experiment is aimed at determining whether altering the number of training epochs affects the
overall performance of the classifier. It has been noted [18a] that in some circumstances,
situations can arise whereby the training error continues to decrease while the validation error rate
starts to increase. This is known as overfitting which is due to the network being too finely tuned
to the training patterns and as a result the generalisation suffers (see Figure 5.2).














Figure 5.2 Test and training error rate versus number of training iterations


At the time of carrying out this experiment I became aware of a similar test that had already been
carried out [55a], showing that an increase in the number of epochs had little effect on the
efficiency and as a result this experiment aims replicate those results.

The dataset used in this experiment is a set of online handwritten digits [54a] collected using a
WACOM A5 tablet with cord-less pen to allow natural writing. Two hundred and three students
wrote down isolated numbers that are divided into a learning set (1200 instances) and a test set
(830 instances). Figure 5.2 below shows the variety of writing styles contained within the
dataset.















Figure 5.3 Some examples of the on-line handwritten digits data set (test set) [55a]

Version 1.5β 01/05/02 Page 38
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
The parameters for the neural networks were left at their default settings (again, using Weka
[64a]) and are as follows:

Momentum: 0.5
Learning rate: 0.2
Weight initialisation: random (between ±0.5)

Some simple pre-processing was applied to the dataset: characters were resampled to 11 points,
centered and size-normalised to an (x,y) coordinate sequence in [-1,1]
22
.


5.1.2.1 Phase Two Results

Original Questions posed earlier in this paper? Does altering the number of training epochs affect
the efficiency of the classifier when using ANN’s or RBF’s as base learners? Does altering the
number of hidden units have any effect?

Table 3 shows the results of the boosted multi-layer perceptrons with 10, 30 and 50 hidden units
trained for 100, 200, 500, 1000, 2000 and 5000 epochs combining 100 neural networks in each
case. The results for a fully connected unboosted MLP are included as a control.


Topology
22-10-10
22-30-10
22-50-10
Unboosted MLP

9.4
3.6
3.0
AdaBoosted MLP
100

3.5

2.1

2.2
200
3.0
2.0
2.0
500
2.9
1.7
1.7
1000
2.8
1.8
1.7
2000
2.8
1.8
1.6
5000
2.8
1.8
1.6

Table 3: Test error rates for AdaBoosted MLPs with differing topologies.
Unboosted MLP is included as a control.



The results obtained weren’t quite as good as previous experiments [54a] but displayed the same
characteristics in improving the generalisation error of the MLPs in all cases. As can be seen
from table 3 above, increasing the number of training epochs of individual classifiers had little
effect after about 500 epochs, although the test error did fall slightly for two out of the three
topologies. There was also a slight increase in error for the 22-30-10 network when the number
of epochs was increased past 500 (from 1.7% to 1.8%).
Version 1.5β 01/05/02 Page 39
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell

Figure 5.4 below shows the test error for the various network topologies. The test error of the
unboosted classifier is only included once (trained for 200 epochs) as increases in the number of
training rounds only produced small oscillations. For this reason it can be taken as a single
reading across the board for comparison with AdaBoosted neural nets.

There is a considerable improvement in generalisation error when increasing the number of
hidden units from 10 (22-10-10) to 30 (22-30-10) for all variations in training rounds. However,
the improvement from 30 hidden units to 50 hidden units offers a much smaller improvement in
generalisation and the increase in time required to perform the tests negates the advantage unless
time is of no importance.



0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
Unboosted 100 200 500 1000 2000 5000
Number of Training Epochs
Test Error (%)
22/10/10
22-30-10
22-50-10





















Figure 5.4 Test error for three different network topologies using 100, 200, 500, 1000
2000 and 5000 training epochs. A single reading for each topology is given for an unboosted neural
network as a comparison.





Version 1.5β 01/05/02 Page 40
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
5.1.3 Phase Three

This section looks at the effects of introducing noise into data and took the longest to perform as
it involved changing the setting on several occasions and required a lot of CPU time to process.

Freund and Schapire [28a] suggested that the occasional poor performance of Boosting results
from an overfitting of the data due to an over-emphasis of examples that are ‘noisy’, thus creating
poor classifiers. This argument seems particularly viable for two reasons, the first being that their
method for updating the probabilities works by focusing on the hard to classify examples which
may over-emphasise noisy instances within a dataset. The second reason is that classifiers are
combined using weighted voting and as Sollich and Krogh [57b] have shown, optimising the
combining weights can lead to overfitting.

According to [48a], noisy data has at least one of the following properties: (a) overlapping class
probability distributions, (b) outliers and (c) mislabelled patterns. Since all three types of noise
appear often in data analysis a test is necessary to determine how effective AdaBoost would be in
real world applications.

In order to determine whether AdaBoost is noise robust, and to help explain why it does or
doesn’t exhibit sub optimal generalisation ability, a series of numerical simulations have been
performed on toy data with an asymptotic number (10
4
) of boosting steps. Training data was
generated from several (non-linearly transformed) Gaussian and uniform blobs
5
, which were
distorted by uniformly distributed noise U(0.0,σ
2
). All simulations were performed using 300 test
patterns and σ
2
as 0%, 9% or 16%. RBF networks with adaptive centres are used as base learners
(cf. Appendix A.2 or [43b] for a detailed description).
6


In carrying out these simulations AdaBoost is expected to increase the weights of outliers and
mislabelled patterns as they are the hardest to classify. Since the outliers and mislabelled patterns
are not representative of the whole set they should mislead the classifier, thereby decreasing it’s
generalisation ability. The findings of this experiment are presented in the following section
(5.1.3.1).


5
A description of the generation of the toy data used in the simulations can be found in Appendix A.1 and is
based on code provided by Gunnar Rätsch [48b]

6
All simulations performed using RBF nets as base learners use the AdaBoost implementation written by
Gunnar Rätsch [48c]
Version 1.5β 01/05/02 Page 41
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
5.1.3.1 Phase Three Results


Figure 5.5a – 5.5c shows the overfitting behaviour in the generalisation error as a function of the
number of iterations generated by AdaBoost (10
4
iterations) using RBF networks (30 centres) in
the case of noisy data (300 patterns with σ
2
= 0%, 9% and 16% respectively). For the toy data
with 0% noise (σ
2
= 0%) the generalisation error suffers from a slight overfitting of the data in
that the final error produced is not the lowest value obtained (although it is still decreasing after
10
4
iterations). In contrast the toy data containing 9% and 16% noise (σ
2
= 9% and 16%) displays
a large overfitting of the data with a steady increase in generalisation error as the number of
iterations increase. This is in line with expectations and the noisiest of the 3 simulations displays
the worse case of overfitting as expected.

















Figure 5.5a Figure 5.5b
















Figure 5.5c


Figure 5.5a – 5.5c Generalisation error as a function of the number of iterations
(log scale) generated by AdaBoost (10
4
iterations) using RBF networks (30 centres),
in the case of noisy data (300 patterns, σ
2
= 0%, 9% & 16% respectively)

Version 1.5β 01/05/02 Page 42
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell

In order to understand why AdaBoost overfits for higher noise levels it is important to understand
the margin distribution.

The first analysis of AdaBoost in connection to margin distributions was done by Schapire et al.
[52a]. It was stated that the reason for the success of AdaBoost over other ensemble learning
techniques (e.g. Bagging [4a]), is the maximisation of the margin. The authors observed
experimentally that AdaBoost maximises the margin of patterns that are hardest to classify i.e.
those that have the smallest margin and that by increasing the minimum margin on a few patterns,
the margin of the other patterns are also reduced.

According to Rätsch et al [48a Eq.(7)], an AdaBoost type algorithm will asymptotically (t→∞)
generate margin distributions with a margin η, which is bounded from below by:

( ) (
( ) (
)
)
11
11
)1)(1(lnln
)1)(1(lnln
−−
−−
−−−
−−+

εφφε
εφφε
η
(7)

where ε = max
t
ε
t
, if ε ≤ (1-η) / 2 is satisfied.

The interaction between φ and ε can be seen in Eq.(7) and it is clear that a small difference in φ
and ε will cause the right hand side of (7) to be small. This smaller φ is, the more important the
difference is. Theorem 7.2 of Breiman [4b] also gives us the weaker bound η ≥ 1 - 2φ , and so, if
φ is small then η must be large. This means that choosing a small value for φ will result in a
larger margin on the training patterns. However, it also means that an increase in the complexity
of the base learner (i.e. an increase in the number of centers), will lead to an increased η because
the error ε
t
will decrease.

Rätsch et al [48a Appendix A.3] show that during the learning process of AdaBoost, the smallest
margin of the training patterns of each class will asymptotically converge to the same value, i.e.

),(minlim),(minlim
1:1:
t
i
yit
t
i
yit
czcz
ii
ρρ
−=∞→=∞→
=
(8)

if the following assumptions are fulfilled:

1. the weight of each hypothesis is bounded from below and above by

∞<Γ<<<
t
b
γ
0
, and (9)

2. the learning algorithm must (in principle) be able to classify all patterns to one class
}1{±∈c
, if the sum over the weights of patterns of class c is larger than a constant δ,
i.e.


=
=⇒>
cyi
ii
i
cxhzw
:
)()( δ
(10)
).,,1( li K=
Version 1.5β 01/05/02 Page 43
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
By ensuring that the classifier is (in principle) able to misclassify every single pattern
(assumption 2), something like a bias term b is introduced which is automatically adjusted if the
smallest margins of one class are significantly different from the other class.

This means that the AdaBoost learning process converges under rather mild assumptions to a
solution where a subset of the training patterns has asymptotically the same smallest margin. In
effect AdaBoost becomes a hard competition case, whereby patterns with the smallest margin will
get high weights and other patterns are basically neglected in the learning process.

In order to support the theoretical analysis more simulations were performed on the same toy data
as used previously. All simulations used 10
4
iterations (φ = ½) and 300 patterns. The first test
was designed to see how changing the complexity of the base learner (RBF net) affected the
margin distribution. This was performed with constant noise (σ
2
= 16%) for 7, 13 and 30 centres
in the base hypotheses.

The second test was designed to see how changing the noise affected the margin distribution and
was performed with constant complexity (13 centres) for noise σ
2
= 0%, 9% and 16%. The
results can be seen in Figures 5.6 and 5.7 respectively:
7























Figure 5.6 Margin distribution of AdaBoost for different complexities of base learner,
with 7 (blue), 13 (green) and 30 (red) centres for constant noise σ
2
=16%
Version 1.5β 01/05/02 Page 44


7
Margin distribution graphs were generated using code kindly supplied by Gunnar Rätsch.
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell



















Figure 5.7 Margin distribution of AdaBoost for different noise levels in the data,
with σ
2
=0% (blue), σ
2
= 9% (green), and σ
2
=16% (red)



Looking at the figures above it becomes clear that the margin distribution asymptotically makes a
step at a specific margin size and that a subset of the training patterns must have similar margins
that correspond to the minimal margin discussed earlier. It can be concluded that low complexity
in the base learner or high noise in the patterns causes a higher error ε
t
and thus a smaller margin
µ. These results support the findings of Rätsch et al [48a] and tie in nicely with previous work by
Schapire et al [52a], Maclin and Opitz [41a] and others discussed in previous sections of this
paper (Section 3).

Interestingly, the asymptotical approach of AdaBoost to a hard margin is comparable to that
obtained by the original SVM (support vector machine) approach of Boser et al. [2b]

Version 1.5β 01/05/02 Page 45
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
5.2 Results Summary

We can now look back and provide a brief summary of the finding of this paper:


AdaBoost improves the generalisation ability of a base learner, quite substantially in some cases
and for Neural Networks an increase in the number of hidden units results in an improvement in
test error. Altering the number of training epochs seems to have a similar effect although there
comes a point where the increase in time required to process the data outweighs the small gain in
efficiency.

Using RBF nets we were able to show that in the presence of noise AdaBoost suffers from an
overfitting of the data and this complements the findings of others mentioned in this paper [52a,
56a, 4b, 41a]. AdaBoost aims to minimise a functional which depends on the margin distribution
and this is done by means of a constraint gradient descent with respect to the margin. It was also
noted that asymptotically, AdaBoost reaches a hard margin comparable to the one obtained by
the original SVM approach [2b].


5.3 Future Work

Since a hard margin is clearly a sub-optimal strategy in the presence of noise future work could
investigate the possibility of implementing a regularised version of AdaBoost that introduces
mistrust in the data to alleviate the distortions that outliers and mislabelled data can cause to the
margin distribution.

Version 1.5β 01/05/02 Page 46
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
Appendix A.1

RBF nets with adaptive centres

The RBF nets used in the experiments are an extension of the method of Moody and Darken
[43ab], since centres and variances are also adapted (see also Bishop [2a] and Müller et al. [43b]).
The following explanation is taken directly from Rätsch et al. [43b]:

The output of the network is computed as a linear superposition of K basis functions


=
=
K
k
kk
xgwxf
1
),()(
(A.1)

where w
k
, k=1,…,K, denotes the weights of the output layer. The Gaussian basis functions g
k
are
defined as:
,
2
exp)(
2
2









−=
k
k
k
x
xg
σ
µ
(A.2)

where µ
k
and denote means and variances, respectively. In a first step, the means µ
2
k
σ
k
are
initialised with K-means clustering and the variances σ
k
are determined as the distance between
µ
k
and the closest µ
i
. Then in the following steps we perform a gradient
descent in the regularised error function (weight decay)
}),,1{,( Kiki K∈≠

∑ ∑
= =
+−=
l
i
K
k
kii
w
l
xfyE
1 1
22
2
1
.
2
))((
λ
(A.3)

Taking the derivative of Eq.(A.3) with respect to RBF means µ
k
and variance σ
k
we obtain


=


−=


l
i
i
k
ii
k
xfyxf
E
1
),()))((
µµ
with
)()(
2
ik
k
ki
ki
k
xg
x
wxf
σ
µ
µ

=


(A.4)

and


=


−=


l
i
i
k
ii
k
xfyxf
E
1
),()))((
σσ
with
)()(
3
2
ik
k
ik
ki
k
xg
x
wxf
σ
µ
σ

=


(A.5)

These two derivatives are employed in the minimisation of Eq. (A.3) by a conjugate gradient
descent with line search, where we always compute the optimal output weights in every
evaluation of the error function during the line search. The optimal output weights
in matrix notation can be computed in closed form by
T
K
www ],,[
1
K=

,2
1
yGI
l
GGw
TT







+=
λ
where
G
(A.6)
)(
ikik
xg=
Version 1.5β 01/05/02 Page 47
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
and y = [y
1
,…,y
l
]
T
denotes the output vector, and I an identity matrix. For λ = 0, this corresponds
to the calculation of a pseudo-inverse of G.

So, we simultaneously adjust the output weights and the RBF centers and variances (see Fig. A.1
for pseudo-code of this algorithm). In this way, the network fine-tunes itself to the data after the
initial clustering step, yet, of course, overfitting has to be avoided by careful tuning of the
regularisation parameter, the number of centres K and the number if iterations [2a]. In all
experiments λ = 10
-6
and up to ten CG iterations are used.
































Figure A.1 Pseudo-code description of the RBF net algorithm, which is used as base
Learning algorithm in the simulations with AdaBoost.

Version 1.5β 01/05/02 Page 48
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
Appendix A.2

References

[1a] P. Baldi and K. Hornik. Neural networks and principal component analysis: learning
from examples without local minima. (1989, Neural Networks, 2.p.53-58).

[2a] C. Bishop. Neural Networks for Pattern Recognition. (1995, Oxford University Press)

[2b] B. Boser, I. Guyon and V. Vapnik. A training algorithm for optimal margin classifiers.
(5
th
Annual ACM Workshop on COLT, p.144-152).

[3a] H. Bourland and Y. Kamp. Auto-association by multi-layer perceptrons and singular
value decomposition. (1988, Biological Cybernetics, 59.p.291-294).

[4a] L. Breiman. Bias, variance, and Arcing classifiers. (1996, Technical Report 460,
Statistics Department, University of California at Berkeley)

[4b] L. Breiman. Prediction games and arcing algorithms. (1997, Technical report 504,
Statistics Department, University of California).

[5a] E. Bauer and R.. Kohavi. An empirical comparison of voting classification algorithms:
Bagging, boosting, and variants. (Machine Learning, to appear).

[6a] E. Baum and D. Haussler. What size net gives valid generalisation? (1989, Neural
Computation, 1(1).p.151-160)

[7a] Beale and Jackson. Neural Computing an introduction (1990, Institute of Physics
Publishing, Bristol and Philadelphia, p. 41)

[8a] L. Breiman. Arcing classifiers. (1998, Annals of Statistics, 26(3).p.801-849)

[9a] L. Breiman. Bagging predictors. (1996, Machine Learning, 24(2).p.123-140)

[10a] L. Breiman. Prediction games and arcing classifiers. (1997, Technical Report 504,
Statistics Department, University of California)

[11a] L. Breiman. J.H. Friedman, R.A. Olshen and C.J. Stone. Classification and Regression
Trees. (1984, Wadsworth & Brooks).

[12a] A.Collins. Evaluating the Performance of AI Techniques in the Domain of Computer
Games (2001, University of Sheffield, p. 10)

[13a] Darpa. Neural Network Study (1998, AFCEA International Press, p. 60)

[14a] T. Dietterich. An experimental comparison of three methods for constructing ensembles
of decision trees: Bagging, boosting, and randomisation. (1998)

[15a] T. Dietterich and G. Bakiri. Solving multiclass learning problems via error-correcting
output codes. (Jan 1995, Journal of Artificial Intelligence Research, p.263-286)

Version 1.5β 01/05/02 Page 49
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
[16a] Delve Datasets. Collections of data for developing, evaluating, and comparing learning
methods (www.cs.toronto.edu/~delve/data/datasets.html)

[17a] L. Devroye, L. Györfi and G. Lugosi. A probabilistic Theory of Pattern Recognition.
(1996, Springer)

[18a] H. Drucker. Combining Artificial Neural Nets: Ensemble and Modular Learning, ed:
Amanda J.C. Sharkey (1999, p. 51-77)


[19a] H. Drucker, C. Cotes, L. Jackel, Y. LeCun and V. Vapnik. Boosting and other ensemble
methods. (1994, Neural Computation, 6(6).p.1289-1301).

[20a] H. Drucker, R. Schapire, and P. Simard. Boosting performance in neural networks.
(1993, International Journal of Pattern Recognition and Artificial Intelligence, 7(4):p.
707-719)

[21a] H. Drucker and C. Cortes. Boosting decision trees. (1996, In NIPS*8, p. 479-485)

[22a] H. Drucker, R. Schapire, and P. Simard. Improving performance in neural networks
using a boosting algorithm. (1993, Advances in Neural Information Processing Systems
5, p. 42-49)

[23a] R. Duna and P. Hart. Pattern Classification and Scene Analysis. (1973, Wiley).

[24a] Y. Freund. Boosting a weak learning algorithm by majority. (1995, Information and
Computation, 121(2): 256-285)

[25a] Y. Freund. Data Filtering and Distribution Modelling Algorithms for Machine Learning.
(1993, PhD thesis, University of California, Santa Cruz).

[26a] Y. Freund and R. Schapire. Game theory, on-line prediction and boosting. (1996, In
Proceedings of the Ninth Annual Conference on Computational Learning Theory, p. 325-
332)

[27a] Y. Freund and R. Schapire. A decision theoretic generalisation of on-line learning and an
application to boosting. (1995, http://www.research.att.com/ ~schapire)

[28a] Y. Freund and R. Schapire. Experiments with a new boosting algorithm. (1996, In
Machine Learning: Proceedings of Thirteenth International Conference, p. 148-156)

[29a] V. Guruswami and A. Sahai. Multiclass learning, boosting, and error-correcting codes.
(1999, In Proceedings of the Twelfth Annual Conference on Computational Learning
Theory, p.145-155).

[30a] K. Hornik, M. Stinchcombe and H. White. Multilayer feedforward networks are
universal approximators. (1989, 2.p.359-366).

[31a] W. Iba and P. Langley. Polynomial learnability of probabilistic concepts with respect to
the Kullback-Liebler divergence. (1992, In Machine Learning: Proceedings of the Ninth
International Conference, p.233-240).
Version 1.5β 01/05/02 Page 50
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell

[32a] J. Jackson and M. Craven. Learning sparse perceptrons. (1996, In Advances in Neural
Information Processing Systems 8, p.654-660).

[33a] M. Kearns and L. Valiant. Learning Boolean formulae or finite automata is as hard as
factoring. (1988, Harvard University Aiken Computation Laboratory. Technical Report
TR-14-88).

[34a] M. Kearns and L. Valiant. Cryptographic limitations on learning Boolean formulae and
finite automata. (1994, Journal of the Association for Computing Machinery, 41(1):p.
67-95).

[35a] M. Kearns and U, Vazirani. An introduction to Computational Learning Theory. (1994,
MIT Press).

[36a] T. Kohonen. Self-Organization and Associative Memory. (1989, Springer, New York)

[37a] M. Kramer. Nonlinear principal components analysis using auto-associative neural
networks. (1991, AIChe Journal, 37.p.233-243)

[38a] N. Littlestone. Learning when irrelevant attributes abount: A new linear-threshold
algorithm. (1988, Machine Learning, 2.p.285-318).

[39a] N. Littlestone and M. Warmuth. The weighted majority algorithm. (Oct 1989, In 30
th

Annual Symposium on Foundations of Computer Science, p.256-261).

[40a] D. Livingstone and D. Salt. Neural networks in the search for similarity and structure-
activity. (1995, In Molecular Similarity in Drug Design, p.187-214)

[41a] R. Maclin and D. Opitz. An Empirical Evaluation of Bagging and Boosting. (1997, In
The Fourteenth National Conference on Artificial Intelligence, p.546-551).

[42a] D. Margineantu and T. Dietterich. Pruning adaptive boosting. In Machine Learning:
Proceedings of the Fourteenth International Conference, p.211-218).

[43a] C. Merz and P. Murphy. UCI repository of machine learning databases (1998,
www.ics.uci.edu/~mlearn/MLRepository.html.)

[43ab] J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units.
(1989, Neural Computation, 1(2).p281-293).

[43b] K. Müller, A. Smola, G. Rätsch, B. Schölkopf, J.Kohlmorgen and V. Vapnik. Using
support vector machines for time series prediction. (1998, Advances in Kernel Methods
– Support Vector Learning, MIT Press).

[44a] D. Opitz, R. Maclin. Popular Ensemble Methods: An Empirical Study. (1999, Journal of
Artificial Intelligence Research 11, p.169-198)

[45a] M. Orr. Introduction to Radial Basis Function Networks. (1996, Centre for Cognitive
Sciences, University of Edinburgh).

Version 1.5β 01/05/02 Page 51
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
[46a] J. Quinlan. Bagging, Boosting and C.45. (1996, In Proceedings of Fourteenth National
Conference on Artificial Intellience)

[47a] J. Quinlan. C4.5: Programs for Machine Learning. (1993, Morgan Kaufmann).

[48a] G. Rätsch, T. Onoda, K. Muller. Soft margins for AdaBoost. (2000, Machine Learning,
p.1-35).

[48b] G. Rätsch. Code for generating toy data. (Available at
http://www.first.gmd.de/~raetsch/data/banana.txt).

[48c] G. Rätsch. AdaBoost-Reg Implementation. (Available from
http://ida.first.gmd.de/homepages/raetsch/abrsurvey.html).

[49a] D. Rumelhart and J. McClelland. Parallel Distributed Processing, Volume 1. (1986,
MIT Press, Cambridge).

[50a] R. Schapire. The strength of weak learnability. (1990, Machine Learning, 5(2): 197-227)

[50a] R. Schapire. The strength of weak learnability. (1990, Machine Learning, 5(2): 197-227)

[51a] R. Schapire. The Design and Analysis of Efficient Learning Algorithms. (1992, MIT
Press).

[52a] R. Schapire, Y. Freund, P. Barlett and W. Lee. Boosting the margin: a new explanation
for the effectiveness of voting methods. (1997, In Proceedings of the 14
th
International
Conference on Machine Learning, p.322-330)

[53a] R. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated
predictions. (1998, In proceedings of the Eleventh Annual Conference on Computational
Learning Theory, p.80-91).

[54a] H. Schwenk, Miligram. Online handwritten digits. (1996, Paris 6 University).

[55a] H. Schwenk, Y. Benigo. Boosting Neural Networks. (paper No 1806 To appear in
Neural Computation)

[56a] H. Schwenk and Y. Benigo. Training methods for adaptive boosting of neural networks.
(1998, In Advance in Neural Information Processing Systems 10, p.647-653).

[57a] H. Schwenk & Y. Bengio. Adaptive Boosting of Neural Networks for Character
Recognition (1997, Departement d’Informatique et Recherche Operationnelle, University
de Montreal, p. 1)

[57b] P. Sollich and A. Krogh. Learning with ensembles: How over-fitting can be useful.
(1996, Advances in Neural Information Processing Systems, Vol 8, p.190-196)

[58a] Statlog. Machine Learning Databases. (www1.ics.uci.edu/pub/machine-learning-
databases/statlog/ )

Version 1.5β 01/05/02 Page 52
AdaBoost, Artificial Neural Nets and RBF Nets Author: Christopher James Cartmell
[59a] L.G. Valiant. A theory of the learnable. Communications of the ACM (November 1984,
27(11):1134-1142).

[60a] L.G. Valiant. Learning is Computational (Oct 1997, Knuth Prize Lecture, In 38
th
Annual
Symposium on Foundations of Computer Science)

[61a] V. Vapnik. Estimation of Dependencies Based on Empirical Data. (1982, Springer-
Verlag).

[62a] V. Vapnik. The Nature of Statistical Learning Theory. (1995, Springer)

[63a] V. Vovk. A gam prediction with expert advice. (1998, Journal of Computer and System
Sciences, 56(2).p.153-173)

[64a] Waikato Environment for Knowledge Analysis (1999-2000, University of Waikato, New
Zealand. http://www.cs.waikato.ac.nz/~ml/index.html)


Version 1.5β 01/05/02 Page 53