.. ,2003
.17,.1,69–92
Review Article
Is inductive machine learning just another wild goose
(or might it lay the golden egg)?
MARK GAHEGAN
GeoVISTA Center,Department of Geography,The Pennsylvania State
University,302 Walker Building,University Park,PA 16802,USA;
email:mng1@psu.edu
(Received 26 November 2001;accepted 29 April 2002)
Abstract.The research reported here contrasts the roles,methodologies and
capabilities of statistical methods with those of inductive machine learning
methods,as they are used inferentially in geographical analysis.To this end,
various established problems with statistical inference applied in geographical
settings are reviewed,based on Gould’s (1970) critique.Possible solutions to the
problems outlined by Gould are suggested via reviews of:(i) improved statistical
methods,and (ii) recent inductive machine learning techniques.Following this,
some newer problems with inference are described,emerging from the increased
complexity of geographical datasets and from the analysis tasks to which we put
them.Again,some solutions are suggested by pointing to newer methods.By way
of results,questions are posed,and answered,relating to the changes brought
about by adopting inductive machine learning methods for geographical analysis.
Speciﬁcally,these questions relate to analysis capabilities,methodologies,the role
of the geographer and consequences for teaching and learning.Conclusions argue
that there is now a strong need,motivated from many perspectives,to give
geographical data a stronger voice,thus favouring techniques that minimize the
prior assumptions made of a dataset.
1.Introduction
In his famous article critiquing the use of inferential statistics—‘Is statistix inferens
the geographical name for a wild goose?’—Peter Gould (1970) lays bare the many
premises upon which inferential statistical analysis is founded,alternatively ques
tioning their validity and the blind faith placed in them by geographers.These
questions are revisited here in the light of a digital revolution that is providing
torrents of data where once was only a trickle (Miller and Han 2001).Consequently,
we are confronted with the diﬃculty of scaling up our analysis to embrace datasets
that are both voluminous in terms of numbers of records or samples represented (n),
and deep in terms of the number of separate attribute dimensions over which data
are gathered ( p).As well as making additional demands on existing analysis methods,
these datasets also generate the need for new types of analysis procedure,to support
exploration,mining and knowledge discovery (Buttenﬁeld et al.2001,Gahegan et al.
2001).It is not always clear that traditional statistical techniques can address these
new challenges,and where they can,there may be severe consequences in terms of
International Journal of Geographical Information Science
ISSN 13658816 print/ISSN 13623087 online ©2003 Taylor & Francis Ltd
http://www.tandf.co.uk/journals
DOI:10.1080/13658810210157778
M.Gahegan70
computational burden,signiﬁcance testing,demands for sample data and so forth.
Openshaw and Openshaw (1997,p.3) describe the current situation thus:‘Sadly,
nearly all of the available methods for analysis,modelling and processing to extract
value date from an earlier period of history where data were scarce and the analyst
had to rely on his or her intuitive skills aided by an intimate knowledge of what
little information was available to formulate analysis tasks’.
Within the domain of geographical analysis,the use and capabilities of traditional
inferential statistics are here contrasted with an alternative form of computational
inference based on inductive machine learning.The discussion is restricted to infer
ence used for predicting some unknown characteristics or properties,as opposed to
the identiﬁcation of underlying processes or models.The latter is possible also with
machine learning,for example by utilizing tools to automatically construct Bayesian
Belief Networks,but falls outside the scope of this paper.Philosophically,statistical
inference and machine learning (ML) are based,to diﬀering extents,around a style
of inference known as induction;allowing the analyst to infer some generic outcomes
from speciﬁc examples,to whit:‘By induction,we conclude that facts,similar to
observed facts,are true in cases not examined’ (Peirce 1878).This contrasts with
deduction,in which facts are asserted as true by computation against some a priori
model.Section 2 below describes the process of inductive inference in detail.
Machine learning and inferential statistics typically diﬀer in their use of prior
knowledge.Inferential statistics uses observations to condition (shape) the form of a
distribution model that is usually provided by the analyst.This prior assumption
represents a selfimposed limit in terms of model complexity and the ability to adapt
to the data.By contrast,many machine learning techniques construct a distribution
model using evidence gleaned from the data alone,i.e.they are datadriven.This
diﬀerence leads to major methodological disparities aﬀecting training,accuracy
analysis,goodness of ﬁt and signiﬁcance testing.Thus it can appear at ﬁrst glance
that these two types of inference are for quite diﬀerent purposes,yet we see a growing
trend to employ neural,genetic and rulebased induction methods in place of more
traditional forms of geographic analysis (Benediktsson et al.1990,Byungyong and
Landgrebe 1991,Lees and Ritman 1991,Civco 1993,Openshaw 1993,Fisher 1994,
Yoshida and Omatu 1994,Paola and Schowengerdt 1995,Foody et al.1995,German
and Gahegan 1996,Friedl and Brodley 1997,Fischer and Leung 1998,Bennett et al.
1999,Openshaw and Abrahart 2000).The reasons for this are largely concerned
with practicality.
Firstly,we can substitute a model that must be provided beforehand for a learned
model that is derived when needed from sample data.This can lead to greater
ﬂexibility,and less reliance on expert knowledge for conﬁguration.Such ﬂex
ibility may well prove crucial;as geographers integrate ever more data to study
complex phenomena such as humanenvironment interaction or population demo
graphics and epidemiology,the diﬃculties in specifying a reliable model in advance
rise accordingly.Discovering—or inducing—such a model from a limited set of
observations may provide a practical alternative.
Secondly,in many complex systems with nonaxiomatic components,models
may either be too elaborate to deﬁne or else too susceptible to variation in precondi
tions;for example data gathered from a diﬀerent place requires a diﬀerent model.
Gould points out (p.444) that a geographer should expect this latter problem since:
‘...all phenomena of interest to the geographer are never independent in the funda
mental dimensions of his enquiry’.We must then decide if this interdependence can
Is inductive machine learning just another wild goose chase?71
be expressed axiomatically (c.f.spatial regression or autocorrelation,Cressie 1993)
or whether a more adaptive approach is needed instead.
Statistical research has had an inﬂuence on geography that is both broad and
deep;shaping the way analysis is conducted (and how systems are understood and
communicated) and having itself been shaped by many researchers who have revised
and reﬁned techniques to better suit the nature of geographical space (Moran 1948,
Ord 1975,Getis and Boots 1978,Anselin 1988,Kulldorﬀ 1999).We now turn
attention to the potential for inductive machine learning to do likewise.Two general
questions are examined in this regard:
1.How might inductive machine learning change the way we conduct
geographical analysis?
And at a deeper level:
2.How does inductive machine learning change the way we conceptualize and
describe geographical systems?
It is not my intention (and neither was it Gould’s) to dismiss inferential statistics
as inadequate or to insinuate that its day has passed.Research in spatial statistics
has made huge progress in the last couple of decades,starting from a number of
disparate breakthroughs across a variety of ﬁelds and weaving the many separate
strands together into a cohesive body of knowledge that can be brought to bear
across a wide range of problems (Diggle 1983,Isaacs and Srivastava 1989,Haining
1990,Cressie 1993,Lawson 2001).In my opinion it is needed more than ever.It is
my intention,however,to showthat there exist nowa range of geographical problems
and datasets that require us to reassess the methods of analysis that are best suited.
Over the same timeframe,the machine learning community has made equally vast
strides,progressing from rulebased,deductive approaches to sophisticated concept
learning and function optimization methods (Stewart et al.1994,Mitchell 1997,
Luger and Stubbleﬁeld 1998,Bremaud 1999) that hold great potential for a wide
range of geographical problems.
Bailey (1994) provides a very useful overview of the progress that spatial statistics
has made,including a taxonomy of the methods and approaches that have developed.
In the same article,Bailey also refers to some of the (then) more radical approaches
sanctioned by Openshaw (1991) that are more in line with machine learning than
statistics,correctly pointing out (at that time) that they carry their own set of
problems,are too computationally demanding and that they are ‘...not yet developed
to the stage where they are widely applicable’.In the intervening time,the problems
alluded to have been more thoroughly investigated (Openshaw and Openshaw 1997,
Kanellopoulos and Wilkinson 1997,Gahegan et al.1999) and are touched upon
later;the computational performance issues,relevant then,have been largely overcome
(Moller 1993,Birkin et al.1995,Fischer and Staufer 1999);the applicability,as
argued convincingly by Miller and Han (2001) and Buttenﬁeld et al.(2001) arises
from the data and applications we are now faced with.
Hence it is time to revisit this debate.We do so by ﬁrst examining the progress
made by statistics and machine learning that relate to Gould’s original critique (§3),
following from which some additional problems are described,arising mainly from
the wealth and richness of datasets now routinely available and the corresponding
complexity of the questions currently being pursued in our eﬀorts to understand the
Earth’s intricate systems (§4).Taken together,these diﬃculties form the motivation
M.Gahegan72
for expanding our arsenal of inferential tools to include machine learning methods.
By doing so we are able to discard some problematic underlying assumptions.But
we must also modify and declare some in addition,all of which have a direct impact
on the questions we can investigate,the methodology we must use and our interpreta
tion of the results produced (§5).The conclusions present a summary of the ﬁndings
and outline the major research themes still to be addressed in this arena.
2.The process of inductive inference
Figure 1 depicts the inductive process,beginning with a set of observations {X}
each consisting of a value x (univariate case) or vector of values x
1
,x
2
,...,x
p
(multivariate case) and an outcome or target (y),drawn from the set {Y }.During
learning or training a function is constructed that maps inputs X to desired outcomes
Y,(XY );this is referred to as a mapping function,or target function (V ).The ﬁrst
stage in an inductive methodology is then to acquire this mapping function (ﬁgure 1(a)).
In machine learning,it is learned directly from a limited set of examples;in statistical
inference it is the distributional form chosen by the analyst,but which may require
some parameterization that is calibrated from the data.The second stage is a general
ization step,where the acquired function is applied to a (usually much larger) dataset K
(X5K),for which Y is unknown and must be predicted (ﬁgure 1(b)).Although not
shown in the ﬁgure,in the ML case it is possible for Y to also be a vector,signifying
the learning of two or more objective functions simultaneously.
Figure 1.The inductive learning methodology.(a) The target function (V ) is learned from
examples,and (b) then applied to predict unknown values.
Is inductive machine learning just another wild goose chase?73
2.1.L earning as a search process
Many of the tasks undertaken in conventional analysis or modelling can be
tackled inductively by recasting them in terms of a search problem—whether it be
for the identiﬁcation of suitable parameters for conﬁguring a statistical function
(calibration),or for the construction of useful functions themselves to form into
more complex models (Openshaw and Openshaw 1997).Classiﬁcation too,can be
expressed as a search for discriminant functions or characterizing distributions that
demark a category in featurespace.In many forms of learning,the number of
possible states to be searched through is prohibitively large,so stochastic approxi
mation methods are used to avoid exhaustive enumeration (Stewart et al.1994,
Mitchell 1997).Stochastic search uses the idea of a performance metric (such as
predictive error or explanatory power) that can be calculated for each possible state
the tool can take.These states may be conceptualized as comprising a surface (usually
a hypersurface),where the lowest point represents the best conﬁguration.The aim
is to iteratively move towards this point of least error,but bearing in mind that an
exhaustive search (enumerating the performance metric for each point on the surface)
is computationally intractable.ML techniques diﬀer as to how this search is per
formed (Sonka et al.1993 and Openshaw and Openshaw 1997 give further details).
A feedforward neural network with back propagation,for example,employs a
neighbourhood search on the error surface,and at each iteration the centroid of this
neighbourhood is moved in the direction oﬀering the largest apparent performance
involvement (Benediktsson et al.1993).By contrast,decision trees use an information
gain measure to ﬁnd a new decision rule that,when added,contributes the most
to the desired outcome (Hunt et al.1966,Quinlan 1993).In both cases the search
terminates either after a predetermined number of iterations,or when the perform
ance gain falls below some threshold.Consequently,it is not possible to say if
the solution found is indeed the optimal choice,but instead we must establish its
superiority through application (§3.6).
Once constructed,the model can be tested by requiring it to infer outcomes for
cases where Y is already known,but is withheld;its eﬀectiveness at doing so gives
one measure of the inferential accuracy of the learned model (see further details
in §3.5).Practically speaking,Xand Y may be discrete or continuous,since statistical
and inductive learning methods have been developed to operate across the full range
of statistical scales.
2.2.Constructing the mapping function
As described previously,the major diﬀerence between statistical and machine
induction is the degree to which a priori knowledge is used in the learning phase.
In statistical methods,the form of the mapping function used is speciﬁed
beforehand,for example a straight line,y=a+bx,or a Gaussian curve
n(x;m,s)=1/√
2pse−(1/2)[x−m)/s]
2
with the parameters (a and b in the former case,
m and s in the latter) derived from the presented data.In machine learning,an
iterative process is used to approximate the desired outcomes,usually involving
many simple components working together to construct the required mapping func
tion in a piecewise form.Thus the overall function is highly parameterized,being
constructed from a number of more primitive functions that are summed together
(e.g.hyperplanes in a neural network) or arranged in a hierarchy (e.g.decision rules
in a decision tree) so as to operate cohesively.The learning capacity of the tool is
M.Gahegan74
governed by the number of these small functions used,and the mechanisms by which
they are combined.
In many ML methods there is no requirement for the same overall functional
form to be used throughout the entire range of the data,nor indeed to assume that
just one function form is adequate.Thus,irregular and multimodal distributions
cause no additional complications,provided enough learning capacity is available
in the tool,since they can be constructed by the piecewise combination of more
primitive functions.The additional ﬂexibility is very useful in situations where
relationships between variations are complex and/or unknown.
2.3.Assumptions and testing
Clearly,statistical inference requires the assumption that the expertsupplied
function is suitable for the problem.This assumption can be tested with a goodness
of ﬁt statistic (Walpole and Myers 1989,p.344),which is a measure of distance
between the observed values and the function used to describe their situation.It
does not establish that the function is somehow the ‘right’ one,but merely provides
a metric by which alternatives may be ranked.The ML method requires a diﬀerent
set of assumptions,namely that there is suﬃcient expressive power available (via the
summed primitive functions) and that a good parameterization of these functions
can be found (via the stochastic search).Goodness of ﬁt measures make no sense
for an ML method,since the data distribution is not assumed.Instead,the learned
model must be validated by the quality of its outcomes,as described above.
Both statistical and machine learning methods use a generalization step,thereby
assuming that a ﬁnite set of values (the sample) is suﬃcient to build an eﬀective
general model.In this sense,both employ induction,though clearly the ML
methods rely on induction to a larger extent,having greater capacity to adapt to
the presented data.
3.Old problems with the use of inferential statistics
The original argument made by Gould catalogues problems with statistical
inference according to the validity of certain underlying assumptions.By making
these assumptions and ﬁxing certain properties the analyst can concentrate on those
data characteristics she wishes to study and ignore all other aspects.Some assump
tions are made to simplify the mathematics,others might be reasonable given certain
circumstances.Note that these problems are not so much a consequence of bad
underlying theories as they are a result of careless or thoughtless application in a
geographic setting;they arise when underlying assumptions are untested or unques
tioned.Each of Gould’s original problems with inferential statistics (the function
form,the sample,independence of observations and residuals,the distribution of the
variables and error terms,and the level of signiﬁcance) are described brieﬂy in the
following subsections along with an overviewof the developments that have occurred
in the meantime to address them.
3.1.The form of the function
Gould’s ﬁrst argument is that functional relationships between variables are often
oversimpliﬁed for convenience,for example assumed to be linear,or at least linear
over the range of the data.This simpliﬁes the computation associated with analysis,
although in practice it may also lower accuracy.
All too often there may be absolutely no logical reason why linearity,or some
Is inductive machine learning just another wild goose chase?75
other simplistic relationship,should be assumed.Gould argues (in 1970!) that with
improvements in computational capacity,and in associated software,there is no
longer a reason to strive for simplicity where it is not warranted.In the meantime,
research in statistics has made signiﬁcant progress in the support provided for more
complex functions (McGarigal and Marks 1995),hierarchies of functions that better
integrate scalebased analysis (Kreft and DeLeeuw 1998,Johnson et al.1999) and
extreme value theory to address very rare events (Smith 1990).Geographically
weighted regression (Brunsdon et al.1998) addresses this same issue by making local
subsets where the functional form is the same,but the parameterization diﬀers.
However,more simplistic statistical models are still in widespread use,possibly
reﬂecting the ease with which they can be applied and understood,rather than the
need for computational simplicity.
Large families of ML methods have also been developed to address the modelling
of complex functional forms.As described above in §2.2,complex functions can be
simulated by ML methods by the assumption of many simpler,lowlevel functions,
such as decision rules or hyperplanes.Neural networks are perhaps the most widely
used method in this regard.For example,the General Regression Neural Network
(GRNN:Specht 1991) provides a more ﬂexible form of regression,where distances
from the ﬁtted line are applied piecewise,locally rather than globally,allowing more
complex functional relationships to be modelled with ease.
3.2.The sample
Assumptions include the randomness of sample selection,problems of generaliz
ing froma sample to a population and the chances that the sample contains unwanted
bias of some sort.These problems still pervade spatial statistics,for example a semi
variogram (a graphical tool for exploring spatial dependence in data) will produce
misleading results when samples are preferentially clustered or data shows signiﬁcant
heteroskedasticity (Isaacs and Srinivastava 1989,p.527).Improvements in sampling
strategies help to alleviate some of these problems (Kalton and Anderson 1986,
Thompson 1992) and simulation techniques such as the Monte Carlo method can
help explore for randomness and bias problems (Bremaud 1999).Using relative
variograms,or other locallycalculated measures of variance can help oﬀset the
eﬀects of heteroskadisticity.
In part,ML methods overcome this problem by avoiding assumptions about the
sample,though its representativeness is tacitly assumed.The whole area of sampling
theory and bias associated with both the data and the generalization methods used
have formed central strands in the development of machine learning methods
(Benjamin 1990,Briscoe and Caelli 1996),and are well understood.
3.3.The independence of observations and residuals
Assumptions here include that the sample is representative and that each obser
vation is independent,though Tobler’s ﬁrst law (‘Everything is related to everything
else,but near things are more related than distant things’,Tobler 1970) advises
us that independence is not likely in a geographical setting.Tackling the second
part of this rule,the spatial statistics community has made great progress in
providing much better means of dealing with spatial dependence;from measure of
global autocorrelation (Moran 1948,Cliﬀ and Ord 1973,1981) to sophisticated,
locallycomputed measures of spatial dependence and change in relationships over
geographical space (Anselin 1995,Brunsdon et al.1996,Assuncao and Reis 1999).
M.Gahegan76
As above,ML methods do not rely on assumptions of independence;the reliance
on evidence is based solely on how useful it is in predicting a desired outcome;
indeed,metrics describing this utility (such as information gain,Quinlan 1993) are
used to control the inductive learning process by evaluating each possible next move
(§2.1).Any form of correlation aﬀects the utility of parts of the feature vector X in
predicting Y,since if x
a
and x
b
are strongly correlated,then after using x
a
there is
likely to be little information gain when using x
b
.Thus,dependence structures in
data are implicitly ‘learned’ in the training phase.
3.4.The distribution of the variables and the error terms
Error terms particularly are often assumed to be normally distributed,without
any physical or logical basis for such an assumption,and with potential to add error
into the analysis.Gould argues that these assumptions (normality of data and error,
unimodality,homoskedasticity) are untenable in many settings and again a result of
laziness or an overenthusiastic zeal for simplicity.Here again,progress has been
signiﬁcant,with the development of spatial statistical techniques that can speciﬁcally
model autocorrelation in error terms (Cressie 1993,chapter 5),as well as in the
signal,and reliable means to test for heteroskedascity (Breusch and Pagan 1979).
Kriging (Krige 1951) and other forms of geostatistical analysis are able to speciﬁcally
calculate measures of spatial dependence (e.g.via a semivariogram) that can be used
to improve interpolation and estimation in the presence of noise.However,these
too become problematic,for example if the range of diﬀerent distances between
observations is not adequately sampled (as noted above in §3.2).
Again,ML methods do not start from any such distributional assumptions so
largely avoid these pitfalls.However,ML methods can exhibit some undesirable bias
because they assume that reducing error,or increasing information gain,are valid
measures by which to prioritize the learning process.Consequently,learning con
centrates on those denser regions of feature space where the greatest gains can be
made—typically those with the largest number of samples.Other regions may be
neglected until later in the learning process,by which time the solution thus far may
not be able to accommodate these remaining cases.Figure 2 depicts this situation.
Figure 2.For this distribution of samples,using only three hyperplanes or oblique decision
rules,the feature space cannot be subdivided so that a perfect classiﬁcation results.
The two diamond samples inside the dashed oval will likely be misclassiﬁed,since
this represents a minimization of error.Any bias in the distribution of such ‘diﬃcult
to train on’ samples will propagate into the result.
Is inductive machine learning just another wild goose chase?77
Solving bias problems requires careful initial calibration,to ensure enough learning
capacity is available,though only just enough,otherwise overtraining may occur
(Gahegan 2000).Utgoﬀ (1986) describes how the bias exhibited during training can
itself be learned,so that it might be better understood.
3.5.The level of signiﬁcance
Questions are raised about the selection of signiﬁcance levels for testing;these
are often motivated by the reliability of the data,not the reliability required in the
prediction.The fact that a signiﬁcance value is itself only a likelihood of reliability
seems to be overlooked in our enthusiasm to achieve a positive result,and has been
widely criticized recently within statistics (Nester 1996).Brunsdon (2001) brings to
light the debate within the statistics community regarding the validity of signiﬁcance
testing from a methodological perspective (Wang 1993).The problem of signiﬁcance
testing has recently taken on a new form with the popularisation of exploratory and
data mining techniques that perform thousands,or even millions of tests,a problem
taken up later in §4.4.
As mentioned already,signiﬁcance tests make no sense for ML methods;assess
ments of performance must instead be made from outcomes.This usually involves
holding back some percentage of the training data to independently test on the
learned model,requiring modiﬁcation to the underlying experimental methodology
(Fitzgerald and Lees 1994).Various validation methods have been reported for this
purpose (Congalton 1991,Schaﬀer 1993,Stehma 1997).
3.6.How machine learning techniques restate these problems
In summary,the form of the function,including patterns of covariance and
distribution of error terms is not assumed,but is learned.If the data provides
evidence (examples) of a relationship between location and some value,then—
provided this relationship is useful in predicting the desired outcome—the ML
technique will attempt to learn this pattern.Even if the relationship changes over
space,that too can be learned if it is encoded in the examples presented.For example,
a neural network deals with covariance (spatial or otherwise) by learning that the
covarying attributes together overpredict an outcome,so connection weights are
adjusted to reduce the strength of the signal.The whole notion of empirically
modelling these relationships is put aside,thus any problems associated with the
selection or accuracy of statistical functions do not apply.Likewise,the distribution
of error terms is never assumed,so demands no special treatment.
There are,of course,caveats:these relate to the data themselves—they are
required to contain evidence of the trends that help to predict the desired outcome,
and the learning capacity of the tool—it must be able to detect and represent the
useful trends.Openshaw and Openshaw (1997) and Gahegan (2000) give more
details relating to the machine learning of geographical pattern.
3.7.Progress in statistics to address these problems
In the years since Gould’s paper was originally published,a good deal of ground
has been covered to address the above problems.Brunsdon (2001),in a recent
editorial review of Gould’s original paper,points out areas where statistical research
has resulted in real progress,by tools that can relax or better account for one or
more of the above problems,including ‘...generalized additive modelling,nonparamet
ric regression,kernel density estimation,randomization tests and regression models
M.Gahegan78
with autocorrelated errors...’.Useful reviews of these,and other waymarkers to
progress,can be found in Wand and Jones (1995),Hox (1995) and Longley and
Batty (1996).Mainstream acceptance of these newer techniques seems to be assured,
but until they are routinely available,Gould’s original warnings still apply.In part,
a slow uptake may be due to limited availability of the new statistics in established
software,though marked progress is reported by Bao et al.(2000).Furthermore,
dedicated software packages such as SpaceStatTM (http://www.spacestat.com/) and
SpatialAnalystTM (http://www.esri.com/software/arcgis/arcgisxtensions/spatialanalyst/
index.html),and the interest they stimulate,signify a trend for spatiallyaware statistical
methods to become more accessible.
4.Emerging problems with the use of inferential statistics
It is not just the theory and available tools that have changed radically in the
last thirty years—geographical data have changed too,as have the tasks to which
we put them!With the advent of vast,digital geospatial datasets,of everincreasing
subtlety and collected at geometric rates,additional analysis problems arise as
new challenges (Buttenﬁeld 1998,Kahn and Braverman 1999).This section intro
duces a number of new problems arising from the changing nature of the data we
use,in terms of:(1) size and nonintuitive nature of a highdimensional feature
space,(2) data reduction,(3) computational complexity,(4) signiﬁcance testing,and
(5) increasing demands for training data.
4.1.Size and nonintuitive nature of high dimensional feature space
The size of a feature space is determined by the number of unique positions that
it comprises,given p attribute dimensions each measured with a precision p.If we
assume for simplicity that p is the same for all dimensions and measured as the
number of bits by which data is encoded,then the number of unique positions in
feature space is given by (2p)p.
Using three attribute dimensions,each represented by a single byte,the size of
the features space is (28)3#16.7 million unique locations—a common size for many
remote sensing problems.Obviously,this number arises very rapidly if either p or n
increase.For the AVIRIS hyperspectral remote sensing platform,which uses 12bit
data precision and 224 spectral channels,this equation becomes (212)224#1.47e+809,
an astronomical number.Considering the United States 2000 census Demographic
Proﬁle,we obtain 98 variables with around 32 bit precision,making a feature space
with a truly staggering 3.9e+1926 locations.Even when the number of observations
is very large (massive n),the vast majority of these possible values will not be realized,
so the feature space will be largely empty (sparse).
We are familiar with conceptualizing analysis in two or three dimensions,where
distribution functions exhibit a highly recognizable form.However,we should be
cautious in the way we generalize these conceptualizations to higher dimensional
spaces,since these familiar functions become less intuitive,and consequently more
diﬃcult to model,as p increases.By way of a simple example (after Scott 1992),
consider the case of a square and a circle—speciﬁcally as a circular cluster of points
modelled using a square box,as would be the case with a parellelpiped classiﬁer,or
as could be modelled with four decision rules or linear discriminant functions.
Figure 3 depicts this situation.
In two dimensions the model seems to be an acceptable approximation,since
the ratio of the area of the circle to that of the square is reasonably close at 0.79,so
Is inductive machine learning just another wild goose chase?79
Figure 3.Comparing simple geometric shapes and fractional intersection of their volume in
a p dimensional feature space,after Scott (1992) and Landgrebe (1999).
the model used does not generalize too far beyond the observed properties of the
data.However,if p is increased,this ratio does not stay constant,but decreases
rapidly to a state where the surrounding box is almost entirely empty and is a very
poor representation of the data.By p=4 the ratio of the area is well below 50%
and at p=7 the hypersphere only accounts for about 4% of the volume of the
hypercube.In other words,the hypercube is certainly no longer a useful approximator
of any spherical cluster of data points,since it is 96% empty.
Were this problem to be conﬁned to only rectangular or orthonormal structures
then it would simply require that we choose statistical models with greater care as
p increases.But unfortunately,the same geometric problems occur with other distri
bution functions too;in fact it can be generally shown that for an arbitrary shape,
as dimensionality is increased,more of the volume of the object becomes concentrated
in an outer shell,and less in the centre.So,when considering a Gaussian distribution,
the volume of the curve migrates quickly from the centre to the tails of the distribu
tion,producing a rather counterintuitive ﬂat shape.Note that this eﬀect is not a
result of a lack of training examples,high variance or poor model choice,but simply
a consequence of geometry.An insightful explanation of this phenomenon is given
by Landgrebe (1999),who also points out the following two important consequences:
that the space is largely empty and that the migration of volume to the outer shell
or corners causes great diﬃculties for multivariate density estimation (Scott 1992,
Wand and Jones 1995,Jimenez and Landgrebe 1998).
The point here is that familiar distributional forms do not perform well in high
dimensional settings,they were never designed to.It becomes vital,instead,to take
a piecewise or hierarchical approach,tackling the problem by fragmenting the space
into lower dimensional partitions only where the feature space contains useful
information,and ignoring other empty portions.This is why neural networks and
decision trees often meet with success in these settings (§2.2).
4.2.Data reduction
Another way to deal with feature space complexity is to use tools that reduce
the space to a manageable form,for example by classiﬁcation or clustering.Recent
interest in data mining and knowledge discovery (DM/KD) as applied to geography
(Miller and Han 2001,Buttenﬁeld et al.2001) is evidence of this need.Not surpris
ingly,many of the newer tools for data reduction harness inductive machine learning
methods (Cohen 1995,Gehrke et al.1999).
M.Gahegan80
In direct contrast to this ‘reductionist’ approach,Openshaw (1994,p.87) cautions
that such preprocessing may well remove important information,and suggests that
‘A worthwhile general principle should be to develop methods of analysis that impose
as fewas possible additional,artiﬁcial,and arbitrary selections on the data’.However,
many commercial systems still appear to oﬀer limited support for higherdimensional
data,encouraging us to be wasteful,since we are expected to renounce many
attributes in order to concentrate analysis on the small handful that appear to carry
the most information.Techniques such as Principal Components Analysis (PCA)
and MultiDimensional Scaling (MDS) have been speciﬁcally developed to help us
with this task.There are two important problems with such approaches:
1.It is assumed that the phenomena of interest can be adequately expressed with
a small number of variables.However,complex processes,such as landuse
change or gentriﬁcation,may possess a ‘signature’ that extends over many
diﬀerent attribute domains and is not adequately explained in any small subset.
2.Generally speaking,data reduction methods such as PCA and MDS assume
that global variance is a suﬃcient measure of an attribute’s utility,which,it
could be argued,is rather ungeographical.We should be intimately concerned
with the spatial structure within attribute data,i.e.within the context of place
(Abler et al.1971,chapter 1),and less with globally aggregated measures.
By reducing dimensionality,we trade accuracy for simplicity,and in doing so risk a
corresponding loss of explanatory power.In cases where variables are highly correl
ated and processes are simple,this loss of accuracy might be small or even signiﬁcant,
but that is yet another assumption brought about by the now outdated need for
computational simplicity.
There is now a large body of evidence,both inside and outside of geography,
that demonstrates the abilities of machine learning techniques,and particularly
decision trees and neural networks,to deal eﬀectively with tasks involving high
dimensional data ( p>10,p>100) (Benediktsson et al.1993,Ripley 1996,German
and Gahegan 1996,Di and Khorram 1999).Reduction to just two or three variables
is an outdated notion that in most cases is no longer required.
In addition to machine learning approaches,a number of statisticallybased
techniques have been proposed to tackle the same problem,including the notion of
projection pursuit for data exploration (Asimov 1985,Cook et al.1995) and a variety
of pooledcovariance techniques to reduce the complexity of constructing a high
dimensional distributional model (see §4.6).
Perhaps another factor here is the desire for conceptual simplicity and transpar
ency in our underlying models?There may be good cause for this,such as ease of
communication or for pedagogic reasons.But I am aware of no reason why good
geographic models should,by nature,involve only a small number of simple relation
ships.Perhaps it is time at last to embrace the ﬁrst part of Tobler’s ﬁrst law (§3.3)?
4.3.Computational complexity
Larger datasets imply an increase in the number of cases (n) or the number of
attributes associated with each case ( p),or possibly both.When addressing datasets
with either large n or large p,the time required by the machine to perform the
necessary computations can become a limiting factor for all forms of analysis.For
example,it may render impractical any exhaustive search for the best solution,
i.e.one where all possible alternatives are evaluated.
Is inductive machine learning just another wild goose chase?81
Computational complexity is usually expressed in terms of the number of itera
tions of an algorithm required to complete the calculation,in the best,worst or
average case (Moret and Shapiro 1991).Obviously,any increase in n or p directly
impacts complexity.Many machine learning techniques scale somewhere between
O(n2) and O(nlogn) in terms of runtime computational burden (Martin 1991),with
p being a constant term determining the complexity of each iteration.By contrast,
closed form statistical techniques are nominally of O(n),though techniques such
as maximum likelihood require the additional derivation of a covariance matrix
(see §4.5 below).Nonlinear statistical functions are more expensive because the
approximation techniques used,such as Newton Raphson (Judge et al.1988),are
computationally demanding and typically of the order of O(n3).
By abandoning a deterministic approach in favour of stochastic search (§2.1),
machine learning techniques are able to reduce computational demands signiﬁcantly
for nonlinear distributions,a factor that becomes increasingly vital as the feature
space enlarges (Openshaw et al.1999).In doing so,they remain computationally
tractable for large values of p,as noted above.
Whereas many ML techniques are able to analyse datasets with tens or even
hundreds of dimensions,further increases in p,perhaps with associated increases in
n as is common in data mining,currently causes a performance bottleneck.Signiﬁcant
advances in computational eﬃciency are currently being sought to enable these
techniques to scale up further.Proposed solutions usually involve increasing the
number of prior assumptions in order to reduce the time complexity,so that it
approaches O(n).Examples include RIPPER (Cohen 1995) and BOAT (Gehrke et al.
1999),both based on optimistic construction of a decision tree.
4.4.Further problems with signiﬁcance testing
As datasets become ever more complex,we must rely on exploratory methods
to bring to light useful knowledge.Data mining aims to uncover unknown patterns
by repeated application of a (usually local ) test.One of the earliest geographical
examples of data mining in geography is Openshaw’s Geographical Analysis Machine
(GAM:Openshaw et al.1990) that performs a clustering test for each cell on a
gridded surface over a number of spatial scales.Philosophically,it is debatable
whether such repeated testing constitutes a real hypothesis—in the sense of setting
up and evaluating a null (H
0
) and alternative (H
1
) at a given level of signiﬁcance.
To make their downgraded status clear,they are sometimes referred to as indicators
instead (Anselin 1995).But algorithmically,the mining method is indeed choosing
between H
0
and H
1
at every iteration:Gould (1999,p.224) later refers to GAM as
conducting ‘...eight million rigorous Poissonbased tests...’.
When large numbers of hypotheses are evaluated,the problem of signiﬁcance
testing described above (§3.5) becomes even more vexing.If we perform only one
test,say at a ( high) signiﬁcance level of 1%,then we must acknowledge one chance
in a hundred that our results might be signiﬁcant only by a chance arrangement of
data values,and not arising from any noteworthy cause.Conducting a million tests,
we should anticipate 10000 or so such ‘errors’ and so forth.In fact the number of
these commission errors rapidly rises to the point where they become a signiﬁcant
distraction;the user is faced with a mountain of results to sift through with no way
to distinguish the good from the bad.New forms of signiﬁcance testing have been
put forward to address this problem,that can take into account the volume of tests
when reporting signiﬁcance (Glymour et al.1996,Smythe 2000).Nowhere is this
M.Gahegan82
more necessary than in spatial or spatiotemporal data mining where the physical
dimensions add considerably to the number of tests to be applied (Ester et al.1998,
Koperski et al.1999).
To summarize,traditional statistical methodologies can experience diﬃculties in
exploratory settings where they are put to use in a manner for which they were
never designed.Machine learning researchers have tackled this vexing issue by
providing techniques that can summarize and generalize from learning outcomes,
thus avoiding a casebycase assessment of signiﬁcance (Gains 1996,Bradsil and
Kronolige 1990).Signiﬁcance testing may also prove unreliable if distributions
cannot be conditioned accurately because of a lack of training examples,as
discussed next.
4.5.Increased demands for sample or training data
Fukunaga (1990) shows that for a linear statistical classiﬁer,the number of
training samples required depends directly on p,but for a quadratic classiﬁer,such
as maximum likelihood,this rises to p2.More precisely,a Gaussian distribution
requires the formulation of a covariance matrix that describes relationships between
dependent attributes.The covariance matrix is triangular in nature (elements are
symmetric across the diagonal ),so the number of coeﬃcients that require estimation
is given by:c( p+1) p/2,where c is the number of classes to be delineated and p the
number of dimensions in feature space (as before).Five classes and ﬁve attribute
dimensions requires a reasonable 75 covariance values to be estimated,but ten
classes and 100 dimensions would produce a matrix with 50500 entries.Each of
these coeﬃcients is estimated fromthe data sample,so the data must contain enough
observations to allow all these coeﬃcients to be estimated reliably.Clearly,this fast
becomes an entirely impractical requirement.
By making assumptions regarding covariance (pooling),the number of samples
required to construct a Gaussian curve can be reduced to around 30–100 independent
examples per attribute dimension (Mardia et al.1979).What does this mean in
practice?To construct such a wellconditioned curve in a sociodemographic setting
using 10 attributes would require 300–1000 examples,or to use a supervised classifer
on hyperspectral remote sensing data fromthe AVIRIS sensor would require between
224×30=6720 and 224×100=22400 independent training samples,though one
could question whether supervised classiﬁcation is really a suitable way to interpret
such data (Goetz and Curtiss 1996).These illustrative examples are somewhat
contrived,but nevertheless we can expect growing numbers of attributes to become
available within all areas of geographical analysis in the future,so they serve as a
useful indicator of the increasing demand for so called ‘ground truth’.This might be
good news for geography graduates in search of employment in the ﬁeld!
Unlike parametric methods,ML methods are not required to build complete
models,in the sense that no eﬀort needs to be applied to regions of feature space
that are empty;and as pointed out above,this is usually the vast majority of the
space.Ehrenfeucht et al.(1989) show that for inductive machine learning,the amount
of training data required depends on the complexity of the learning task,so is more
diﬃcult to deﬁne beforehand.In the case of classiﬁcation,this complexity depends
on the number of classes required and the intricacy of the separation task,which
itself depends only partly on the dimensionality (Cybenko 1993).In short,many
inductive learning techniques manage better than a linear relationship with p,in
Is inductive machine learning just another wild goose chase?83
terms of data requirements,allowing them to extend to very large feature spaces
without acquiring a voracious appetite for data.
4.6.The n%p problem
Generally speaking,multivariate statistical inference assumes that p<n,in that
n samples are generalized to form a pdimensional distributional model.But where
p>n,these distributions cannot be constructed or are degenerate.For example,to
construct a sample covariance matrix (S) requires that n>p.If it is not,then the
rank of S is less than p,so the matrix becomes singular (after Press 1982).That
being the case,the inverse of S does not exist and its probability distribution cannot
be calculated.
There are various statistical shortcuts that can be taken to construct S and they
fall into two types:either reduce p or increase n.Increasing n can be eﬀectively
achieved by assuming some prior knowledge of a distribution,so that less samples
are needed to condition it properly.One possibility,mentioned above,is to assume
that covariance is constant for a particular class or indeed for all classes (pooled
covariance).Landgrebe (1999) presents a useful summary of possible methods for
pooling,and discusses their likely eﬀects on predictive accuracy.Reducing p is usually
achieved using principal components or factor analysis.
New solutions to this problem are oﬀered by inductive methods.For example,a
Self Organising Map (SOM,Kohonen 1997) reduces a highly multivariate space
into a lower dimensional structure (typically twodimensional ) by training a set of
neurons (v,v%n) to represent the salient properties of the original data.The neurons
capture the variance and important trends in the data.In doing so,they reduce n
and p to v and 2,respectively.One advantage here is that the form of the problem
is not changed;we still have a set of (albeit transformed) observations within a
(transformed) feature space.Another advantage is that the mapping fromn to v aims
to preserve topology,so relative positions in the transformed feature space still have
meaning.
5.Questions about induction
To highlight the relevance of the above discussion to geographical analysis,this
section is structured around several questions related to the consequences of using
machine induction,addressing how it might change our capabilities,methodologies,
understanding,our role and even the way we approach teaching.
5.1.Can we address previously intractable problems?
The answer here is clearly yes;the problems that were once intractable because
of dataset complexity,computational burden or for lack of a model (§4) are now
feasible.As additional progress is made within the machine learning and data mining
communities,providing more reliable search and optimization methods,the frontiers
of possibility will be pushed back still further (Dietterich 1997,Gehrke et al.1999).
5.2.Does the method of investigation change?
From a methodological perspective,we see that inferential statistics requires a
model to be speciﬁed beforehand,with unknown examples then evaluated against
it.By contrast,machine learning requires examples to be available that represent
the functioning of the model,but not the model itself.By generalizing from these
known examples,a model is induced.A major diﬀerence then,between these two
M.Gahegan84
styles of analysis,concerns the requirement for prior knowledge.It is not necessary
to have a procedural understanding of a problem before using ML to predict or
infer new results.
By adopting machine induction,we move from an explicit model constructed by
a human expert (perhaps indirectly fromobservations or theory) to an implicit model
constructed directly from examples by an algorithm.Methodology changes accord
ingly (§2).In all cases,reliance on the human expert is never fully relinquished since
machine learning algorithms require a variety of handson intervention to assure
their correct functioning.While one goal is to remove this reliance,because it
demands a level of computational knowledge,another is to build expertise from
the user into the method,as it relates to the domain of application (German and
Gahegan 1999).These goals are not in conﬂict,though they may appear to be so at
ﬁrst glance.
5.3.Are we able to examine new kinds of questions and if so,how?
Again the answer is yes;the ability to operate in the absence of prior knowledge
is enabled by substituting data for expertise (Openshaw 2000),with examples used
as a surrogate for this understanding.So,questions can be generated from our
extended ability to extract patterns from data,to categorize and to generalize.These
questions can take the form of hypotheses that shape the start of a more trad
itional investigation.To this end,inductive learning is being applied within data
mining tools,to uncover previously unknown relationships and patterns in complex
geographical datasets (Ester et al.1998).
5.4.Does our approach to science need to change to accommodate induction?
At a more philosophical level,we need to embrace induction as a valid form of
scientiﬁc inference,that is diﬀerent from the deductive approach used in ‘normal’
science (Popper 1959),that achieves a diﬀerent purpose and that needs to be veriﬁed
in a diﬀerent manner.The validity of induction seems to be a matter for the domain
scientists to resolve,since within the philosophy of science it is widely acknowledged
and has been more than a century (Peirce 1878,Mechelen et al.1993).
Computational methods simulate the act of induction by applying complex
algorithms containing a degree of nondeterminism.One problematic consequence
is that results may vary,even when the same algorithm is applied to the same
dataset.In a scientiﬁc sense this is troublesome,because it challenges the notion of
repeatability in experimentation.Since repeatability has long been regarded as one
of the three pillars of science (cf.communicable,repeatable,refutable) the con
sequences for analysis are both philosophic and practical.However,it could be
argued that any deviation in the result is simply a reﬂection of the indeterminate
nature of the problem itself;in other words,we delude ourselves to think that there
is a single ‘right’ answer that we can know with decimal precision.So even though
repeatability provides a yardstick by which results can be directly compared,in
many cases it may hide the uncertainty present.Stochastic methods leave the
uncertainty within the result and force us to deal with it.Fuzzy and probabilistic
approaches to combining evidence also do the same (Fisher 1994).The variance in
the results is then a measure of the uncertainty in the data combined with the
learning deﬁciencies of the algorithmused (i.e.uncertainty in the constructed model ),
often with some small element of chance due to the randomized start conditions
Is inductive machine learning just another wild goose chase?85
used.By contrast,the error termin inferential statistics is a measure of the goodness
ofﬁt of the data to the predeﬁned model and not how appropriate the model itself
might be.The simplest way to account for variance in results of ML methods is to
compute an average value over several consecutive training and validation cycles.
Many appropriate measures have been proposed (Schaﬀer 1993).
5.5.Who knows the most,the geographer or the data?
A function describing the basis of a statistical model will usually include para
meters that allow adaptation to the current dataset,but the model itself remains
invariant.A model of this kind has many advantages:it is simple,can be easily
understood and communicated,and leads to repeatable analysis.On the negative
side it may be inaccurate (the underlying relationship might have a complex covari
ance structure) and in highly multivariate datasets it might also be diﬃcult to
‘discover’ in the ﬁrst place.Furthermore,because the model is ﬁxed,it cannot readily
adapt to subtle diﬀerences in the data used that may occur within or between speciﬁc
places.We must either assume it is universally true or else we must redevelop it each
time it is applied.Fullyinductive methods take the latter approach,automatically
reformulating new relationships for each dataset presented.
By assuming a ﬁxed relationship holds true,we remove the possibility of dis
covering something new and signiﬁcant about the study region.Such overreliance
on a logicodeductive approach to science has been widely criticized.For example,
Kuhn (1962) asserts that such models can never in themselves lead to newknowledge,
and only when they are seen to fail can new knowledge follow,since this implies the
model represents an invalid hypothesis.Furthermore,deduction,by itself,precludes
the development of a new or reﬁned model.True induction does not suﬀer from this
disadvantage.
To sum up,the argument between inferential and machine inductive approaches
can be stated as follows:‘Do we know enough about our systems—or are they so
simple and predictable—that a deterministic approach is adequate,or do these
systems contain local subtleties and complexities that would favour a more adaptable
approach?’
Perhaps more radically,the question can be reexpressed as:‘Do our data
represent a better approximation of system behaviour than our expertise?’ This
statement is challenging,and emphasizes diﬀerent aspects of the role of the geo
grapher.To take it to the extreme:in the ﬁrst instance,the geographer is the
theoretician who imposes structure on the data directly and thus shapes the outcome;
in the second,the geographer is the ﬁeld expert who must carefully gather representat
ive samples so that a valid model can emerge from them.Across the discipline,we
see stark evidence of both of these roles.
5.6.Are there implications for teaching and learning about geographical analysis?
One ramiﬁcation for education is that learned models may be diﬃcult to recover
and to communicate,even if they do lead to improvements in predictive power.The
simple parametric form of many common statistical functions makes the nature of
relationships easy to comprehend and to explain,whereas most machine learning
methods have little or no facility to describe the models they learn in any way that
makes immediate sense to a human.This is not an insurmountable problem,even a
complex model can be progressively reduced to a simpler,more generalized form
M.Gahegan86
for presentation and examination:learning outcomes can be visualised and internal
structures can be summarized (Gains 1996,Laﬀan 1998,Ankerst et al.1999).
However,one could also make the counterargument,namely:is such simpliﬁca
tion ultimately helpful and/or does it act as a barrier to understanding,rather than
an aid?The complexity of learned models may well depict geography as inherently
complex,and thus challenge our tendency to simplify it.Clearly,there are pedagogic
consequences to face.
6.Summary:an even wilder goose chase?
Inductive machine learning oﬀers considerable promise to improve our predictive
capabilities in complex settings,but is not yet a magic bullet (or a golden egg).The
answers it provides are only as good as:(1) the data are representative,and (2) the
methods are capable of learning the trends contained therein.
Statistical analysis is good for some classes of problem,where the solution is
largely deterministic and the underlying model is well understood.However,geo
graphical science that is entrenched only in statistics is short sighted.To address the
challenges of richer and more voluminous data,geographers will need new tools
employing diﬀerent inferential techniques.ML reacts,via learning,to the speciﬁc
properties of a complex dataset and one could therefore argue that it is more
‘geographic’ since it is able to respond speciﬁcally to the nuances of place,provided
of course that place is encoded in the data.This is both a strength and a weakness.
It is a strength because the models produced are unique from place to place.It is a
weakness because the notion of remaining objective to some wholly external frame
of reference is sacriﬁced.
As a summary of the techniques described above,some of the more common
analysis tasks are shown in table 1,with suitable tools shown for each task drawn
from statistics and machine learning.It highlights that there are many methods
with common goals and points to some of the alternative ML methods that can be
Table 1.Various analysis tasks with their statistical and machine learning counterparts.
Analysis task Statistical technique ML technique
Data reduction Principal components,Selforganizing map
multidimensional scaling
Clustering kmeans,ISODATA Selforganizing map,
association rules
Modelling simple Regression,correlation General regression neural
relationships network (GRNN)
Classiﬁcation Maximum likelihood,Discrete output neural
discriminant analysis network,decision tree
Function approximation Nonlinear least squares and Continuous output feed
likelihood estimation forward neural network
Parameter estimation Least squares,Stochastic search,
maximum likelihood,genetic algorithms,
expectation maximization,gradient ascent (descent)
best linear unbiased estimator
Rulebased inference First order logic,Decision tree,rule induction
linear discriminants
Is inductive machine learning just another wild goose chase?87
substituted for their more established statistical counterparts as datasets and tasks
become more complex.
By increasing our reliance on induction we change the role of the expert,since
many initial assumptions need now not be made or tested,but we must instead rely
directly on the ‘truth’ (representativeness) contained within the dataset.Although,
such a goal is perhaps not entirely laudable,since it is probably a good thing to be
intimately familiar with one’s data,this is an increasingly impractical requirement
due to the escalating size and complexity of datasets (Openshaw and Openshaw
1997,p.3).
Diﬃculty of use is still a real issue with many forms of machine learning;it is
not always straightforward to make informed choices regarding parameter conﬁg
uration.However,this situation is also common for more advanced spatial analysis
tools.Conﬁguration of neural networks,for instance,is no more complex a task
than conducting a geostatistical interpolation:the appropriate use of kriging requires
quite a deep knowledge of available methods,as well as selection of suitable
transformations (spherical,etc).
To make the descriptions clearer I have contrasted the simpler techniques from
statistics and machine learning.There are many other techniques that merit descrip
tion,but space considerations have precluded their mention.It is important to point
out that there is by now a good deal of convergence between statistics and machine
learning,especially with more advanced techniques where the need to search through
solution spaces eﬃcienctly is a common thread in both disciplines (Moller 1993,
Stewart et al.1994,Simoudis et al.1996).For example,Kernel Discriminant Analysis
(Lissoir and Rasson 1998),a statistical classiﬁcation techniques,constructs decision
boundaries by employing a nonlinear mapping of the data into some feature space,
via a series of ‘kernel’ transformation functions.This newspace introduces distortions
to allow a cleaner delineation of the classes.Although the theoretical foundation
diﬀers fromthat of a neural classiﬁer,the functionality and many of the conﬁguration
and training issues are similar.This trend towards convergence between machine
learning and statistical analysis is likely to continue,so their distinction will become
less clear as time passes.
Acknowledgments
This paper is dedicated to the memory of Peter Robin Gould (1929–2000),whose
many insights are a continuing source of inspiration.
References
A,R.,A,J.S.,and G,P.,1971,Spatial Organization:The Geographer’s View
of the World (Prentice Hall:Englewood Cliﬀs,New Jersey).
A,M.,E,C.,E,M.,and K,H.P.,1999,Visual classiﬁcation:An
interactive approach to decision tree construction.In KDD’99 Proc.,Fifth ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining (New
York:ACM Press),pp.392–396.
A,L.,1988,Spatial Econometrics:Methods and Models (Kluwer:Dordrecht).
A,L.,1995,Local indicators of spatial association—LISA.Geographical Analysis,27,
93–115.
A,D.,1985,The grand tour:a tool for viewing multidimensional data.SIAM Journal
of Science and Statistical Computing,6,128–143.
A,R.M.,and R,E.A.,1999,A new proposal to adjust Moran’s I for population
density.Statistics and Computing,18,2147–2162.
B,T.C.,1994,A review of statistical spatial analysis in geographical information systems.
M.Gahegan88
In Spatial Analysis and GIS,edited by S.Fotheringham and P.Rogerson (London:
Taylor and Francis).
B,S.,A,L.,M,D.,and S,D.,2000,Seamless integration of spatial
statistics and GIS:the SPlus for ArcView and the S+Grassland links.Journal of
Geographical Systems,2,287–306.
B,J.A.,S,P.H.,and E,O.K.,1990,Neural network approaches
versus statistical methods in classiﬁcation of multisource remote sensing data.IEEE
Transactions on Geoscience and Remote Sensing,28,540–551.
B,J.A.,S,P.H.,and E,O.K.,1993,Conjugate gradient neural
networks in classiﬁcation of multisource and very high dimensional remote sensing
data.International Journal of Remote Sensing,14,2883–2903.
B,D.P.(editor),1990,Change in Representation and Inductive Bias (Boston,MA:
Kluwer Academic Press).
B,D.A.,W,G.A.,and A,M.P.,1999,Exploring the solution space of
semistructured geographical problems with genetic algorithms.Transactions in GIS,
3,51–72.
B,M.,C,G.,and G,F.,1995,The use of parallel computers to solve non
linear spatial optimisation problems:an application in network planning.Environment
and Planning A,27,1049–1068.
B,P.B.,and K,K.(editors),1990,MetaL earning,MetaReasoning and L ogics
(Boston,MA:Kluwer Academic Press).
B,P.,1999,Markov Chains:Gibbs Fields,Monte Carlo Simulation,and Queues (New
York:Springer).
B,T.S.,and P,A.R.,1979,A simple test for heteroskedasticity and random
coeﬃcient variation.Econometrica,47,1287–1294.
B,G.,and C,T.,1996,A Compendium of Machine L earning (volume 1:Symbolic
Machine L earning) (Norwood,New Jersey:Ablex Publishing Corporation).
B,C.,2001,Is ‘statistics inferens’ still the geographical name for a wild goose?
Transactions in GIS,5,1–3.
B,C.,F,A.S.,and C,M.E.,1996,Geographically weighted
regression:A method for exploring spatial nonstationarity.Geographical Analysis,
28,281–298.
B,C.,F,A.S.,and C,M.E.,1998,Spatial nonstationarity
and autoregressive models.Environment and Planning A,30,957–973.
B,B.P.,1998,Looking forward:geographic information services and libraries in
the future.Cartography and GIS,25,161–171.
B,B.,G,M.,M,H.,and Y,M.,2001,Geospatial data mining
and knowledge discovery.UCGIS Emerging Themes White Paper:
URL:http://www.ucgis.org/emerging/.
B,K.,and L,D.A.,1991,Hierarchical decision tree classiﬁers in high
dimensional and large class data.IEEE Transactions on Geosciences and Remote
Sensing,29,518–528.
C,D.L.,1993,Artiﬁcial neural networks for landcover classiﬁcation and mapping.
International Journal of Geographical Information Systems,7,173–186.
C,A.,and O,J.,1973,Spatial Autocorrelation (London:Pion).
C,A.,and O,J.,1981,Spatial Processes:Models and Applications (London:Pion).
C,W.W.,1995,Fast,eﬀective rule induction.In Proceedings of 12th International
Conference on Machine L earning (San Francisco,California:MorganKaufmann),
pp.115–123.
C,R.,1991,A review of assessing the accuracy of classiﬁcation of remotely sensed
data.Remote Sensing of the Environment,37,35–45.
C,D.,B,A.,C,J.,and H,C.,1995,Grand tour and projection pursuit.
Computational and Graphical Statistics,4,155–172.
C,N.A.C.,1993,Statistics for Spatial Data,revised edition (New York:John Wiley
and Sons).
C,G.,1990,Complexity theory of neural networks and classiﬁcation problems.In
Proceedings of Neural Networks EURASIP Workshop,edited by L.B.Almeida and
C.J.Wellekens,Sesimbra,Portugal (Berlin:SpringerVerlag),pp.24–44.
Is inductive machine learning just another wild goose chase?89
D,X.,and K,S.,1999,Data fusion using artiﬁcial neural networks:a case study
on multitemporal change analysis.Computers,Environment and Urban Systems,23,
19–31.
D,T.G.,1997,Machine learning research:four current directions.AI magazine,
Winter,pp.97–136.
D,P.J.,1983,Statistical Analysis of Spatial Point Patterns (London:Academic Press).
E,A.,H,D.,K,M.,and V,L.,1989,A general lower bound
on the number of examples needed for learning.Information and Computation,82,
247–261.
E,M.,K,H.P.,and S,J.,1998,Algorithms for characterization and trend
detection in spatial databases.In Proceedings of 4th International Conference on
Knowledge Discovery and Data Mining (KDD’98),New York,USA (Menlo Park,CA:
American Association for Artiﬁcial Intelligence),pp.44–50.
F,P.F.,1994,Probable and fuzzy models of the viewshed operation.In Innovations in
GIS 1,edited by M.Worboys (London:Taylor and Francis),pp.161–175.
F,M.M.,and L,Y.,1998,A geneticalgorithms based evolutionary computational
neural network for modeling spatial interaction data.Annals of Regional Science,
32,437–458.
F,M.M.,and S,P.,1999,Optimization in an error backpropagation neural
network environment with a performance test on a pattern classiﬁcation problem.
Geographical Analysis,31,89–108.
F,R.W.,and L,B.G.,1994,Assessing the classiﬁcation accuracy of multisource
remote sensing data.Remote Sensing of the Environment,47,362–368.
F,G.M.,MC,M.B.,and Y,W.B.,1995,Classiﬁcation of remotely sensed
data by an artiﬁcial neural network:issues relating to training data characteristics.
Photogrammetric Engineering and Remote Sensing,61,391–401.
F,M.A.,and B,C.E.,1997,Decision tree classiﬁcation of landcover from
remotely sensed data.International Journal of Remote Sensing,18,711–725.
F,K.,1990,Introduction to Statistical Pattern Recognition (San Diego,California:
Academic Press).
G,M.,2000,On the application of inductive machine learning tools to geographical
analysis.Geographical Analysis,32,113–139.
G,M.,G,G.,and W,G.,1999,Some solutions to neural network conﬁgura
tion problems for the classiﬁcation of complex geographic datasets.Geographical
Systems,6,3–22.
G,M.,H,M.,R,T.M.,and W,M.,2001,The Integration
of Geographic Visualization with Databases,Data Mining,Knowledge Construction
and Geocomputation.Cartography and Geographic Information Science,28,29–44.
G,B.R.,1996,Transforming Rules and Trees into Comprehensive Knowledge
Structures.In:Advances in Knowledge Discovery and Data Mining,edited by U.Fayyad,
G.PiatetskyShapiro,P.Smyth and R.Uthurusamy (Cambridge,MA:AAAI/MIT
Press),pp.205–228.
G,A.,and B,B.,1978,Models of Spatial Processes (Cambridge,UK:Cambridge
University Press).
G,J.,G,V.,R,R.,and L,W.Y.,1999,BOAT—Optimistic
decision tree construction.Proc.SIGMOD 1999 (New York:ACMPress),pp.169–180.
G,G.,and G,M.,1996,Neural network architectures for the classiﬁcation of
temporal image sequences.Computers and Geosciences,22 (9),969–979.
G,C.,M,D.,P,D.,and S,P.,1996,Statistical inference and
data mining.Communications of the ACM,39,35–41.
G,A.F.H.,and C,B.,1996,Hyperspectral imaging of the earth:remote analytical
chemistry in an uncorrelated environment.Field Analytical Chemistry and Technology,
1,67–76.
G,P.R.,1970,Is Statistix Inferens the geographcial name for a wild goose?Economic
Geography,46,539–548.
G,P.R.,1999,Becoming a Geographer (New York:Syracuse University Press).
H,R.P.,1990,Spatial Data Analysis in the Social and Environmental Sciences
(Cambridge:Cambridge University Press).
M.Gahegan90
H,J.,1995,Applied Multilevel Analysis (TTPublikaties:Amsterdam).
H,E.B.,M,J.,and S,P.J.,1966,Experiments in Induction (New York,USA:
Academic Press).
I,E.H.,and S,R.M.,1989,An Introduction to Applied Geostatistics (New
York:Oxford University Press).
J,L.,and L,D.,1998,Supervised classiﬁcation in high dimensional space:
geometrical,statistical and asymptotical properties of multivariate data.IEEE
Transactions on System,Man and Cybernetics,28C,39–54.
J,G.D.,M,W.L.,P,G.P.,and T,C.,1999,Multiresolution fragmenta
tion proﬁles for assessing hierarchically structured landscape patterns.Ecological
Modelling,116,293–301.
J,G.G.,CH,R.,G,W.E.,L,H.,and L,T.C.,1988,
Introduction to the Theory and Practice of Econometrics (New York:John Wiley
and Sons).
K,G.,and A,D.W.,1986,Sampling rare populations.Journal of the Royal
Statistical Society (A),149 (1),65–82.
K,R.,and B,A.,1999,What shall we do with the data we are expecting
from upcoming earth observation satellites?Journal of Computational and Graphical
Statistics,8,575–588.
K,I.,and W,G.,1997,Strategies and best practice for neural network
image classiﬁcation.International Journal of Remote Sensing,18,711–725.
K,T.,1997,Selforganizing maps (Berlin:SpringerVerlag).
K,K.,H,J.,and A,J.,1999,Mining knowledge in geographic data.
Communications of the Association for Computing Machinery.
URL:http://db.cs.stu.ca/sections/publication/kdd/kdd.html.
K,I.G.G.,and DL,J.,1998,Introducing Multilevel Modeling (London:Sage).
K,D.G.,1951,A statistical approach to some basic mine valuation problems on the
Witwatersrand.Journal of the Chemical,Metallurgical and Mining Society of South
Af rica,52,119–139.
K,T.S.,1962,The structure of scientiﬁc revolutions (Chicago:University of Chicago Press).
K,M.,1999,Spatial scan statistics:models,calculations,and applications.In Scan
Statistics and Applications,edited by J.B.Glaz (Boston:Boston Press),pp.303–322.
L,S.,1998,Visualising neural network training in geographic space.In Proceedings of
3rd International Conference on GeoComputation,University of Bristol,United Kingdom,
17–19 September 1998,URL:http://www.geocomputation.org/1998/48/gc_48.htm.
L,D.,1999,Information extraction principles and methods for multispectral and
hyperspectral image data.In Information Processing for Remote Sensing,edited by
C.H.Chen (River Edge,NJ,USA:World Scientiﬁc),pp.3–38.
L,A.B.,2001,Statistical Methods in Spatial Epidemiology (London:John Wiley and
Sons).
L,B.G.,and R,K.,1991,Decision tree and rule induction approach to integ
ration of remotely sensed and GIS data in mapping vegetation in disturbed or hilly
environments.Environmental Management,15,823–831.
L,S.,and R,J.P.,1998,Symbolic kernel discriminant analysis.In Advances in
Data Science and Classiﬁcation,edited by A.Rizzi,M.Vichi and H.H.Bock (Berlin:
SpringerVerlag),pp.417–423.
L,P.,and B,M.(editors),1996,Spatial Analysis:Modelling in a GIS Environment
(New York:John Wiley & Sons).
L,G.F.,and S,W.A.,1998,Artiﬁcial Intelligence:structures and strategies
for complex problem solving (Reading,MA:AddisonWesley).
M,K.V.,K,T.,and B,J.M.,1979,Multivariate Analysis (London:Academic
Press).
M,J.C.,1991,Introduction to L anguages and the Theory of Computation (New York:
McGraw Hill ).
M,G.,1963,Principles of geostatistics.Economic Geology,58,1246–1266.
MG,K.,and M,B.,1995,FRAGSTATS:Spatial pattern analysis program for
quantifying landscape structure.General Technical Report PNWGTR351 Portland,
OR,US Department of Agriculture,Forest Service,Paciﬁc Northwest Research
Station.
Is inductive machine learning just another wild goose chase?91
M,I.V.,H,J.,M,R.S.,and T,P.(editors),1993,Categories
and Concepts:theoretical views and inductive data analysis (NewYork:Academic Press).
M,H.,and H,J.(editors),2001,Knowledge Discovery with Geographic Information
(London:Taylor and Francis).
M,T.M.,1997,Machine L earning (New York:McGraw Hill ).
M,M.F.,1993,A scaled conjugate gradient algorithm for fast supervised learning.
Neural Networks,6,525–533.
M,P.,1948,The interpretation of statistical maps.Journal of the Royal Statistical Society
B,10,243–251.
M,B.M.E.,and S,H.D.,1991,Algorithms f rom P to NP (Redwood,CA:
BenjaminCummings).
N,M.,1996,An applied statistician’s creed.Applied Statistics,45,401–410.
O,S.,1991,A spatial analysis research agenda.In Handling Geographic Information,
edited by I.Masser and M.Blakemore (London:Longman),pp.18–37.
O,S.,1993,Modelling spatial interaction using a neural net.In Geographic Information
Systems,Spatial Modelling and Policy Evaluation,edited by M.M.Fischer and
P.Nijkamp (London:SpringerVerlag),pp.147–164.
O,S.,1994,Exploratory spacetimeattribute pattern analysers.In Spatial Analysis
and GIS,edited by S.Fotheringham and P.Rogerson (London:Taylor and Francis).
O,S.,2000,GeoComputation.In GeoComputation,edited by S.Openshaw and
A.J.Abrahart (London:Taylor and Francis),pp.1–31.
O,S.,C,A.,and C,M.,1990,Building a prototype geographical
correlates exploration machine.International Journal of Geographical Information
Systems,4,297–311.
O,S.,and A,B.(editors),2000,GeoComputation (London:Taylor and
Francis).
O,S.,and O,C.,1997,Artiﬁcial Intelligence in Geography (Chichester,UK:
John Wiley and Sons).
O,S.,T,A.,T,I.,MG,J.,and B,C.,1999,Testing space
time and more complex hyperspace geographical analysis tools.In GIS Research UK
’99 (Southampton,UK:University of Southampton),pp.89–102.
O,J.K.,1975,Estimation methods for models of spatial interaction.Journal of the American
Statistical Association,70,120–126.
P,J.D.,and S,R.A.,1995,A detailed comparison of backpropagation
neural networks and maximumlikelihood classiﬁers for urban land use classiﬁcation.
IEEE Transactions on Geosciences and Remote Sensing,33,981–996.
P,C.S.,1878,Deduction,induction and hypothesis.Popular Science Monthly,13,470–482.
P,K.R.,1959,The L ogic of Scientiﬁc Discovery (New York:Harper and Row).
P,S.J.,1982,Applied Multivariate Analysis,including Bayesian and Frequentist Methods
of Inference (Malabar,Florida:Krieger Publishing Co).
Q,R.,1993,C4.5:Programs for Machine L earning (San Mateo,CA:Morgan Kaufman).
R,B.D.,1996,Pattern Recognition and Neural Networks (Cambridge,UK:Cambridge
University Press).
S,C.,1993,Selecting a classiﬁcation method by cross validation.Machine L earning,
13,135–143.
S,D.,1992,Multivariate Density Estimation (London:John Wiley and Sons).
S,E.,L,B.,and K,R.,1996,Integrating inductive and deductive reasoning
for data mining.In Advances in Knowledge Discovery and Data Mining,edited by
U.Fayyad,G.PiatetskyShapiro,P.Smyth and R.Uthurusamy (Cambridge,Mass.:
AAAI/MIT Press),pp.353–374.
S,R.,1990,Extreme value theory.Handbook of Applicable Mathematics (supplement)
(New York:John Wiley and Sons).
S,P.,2000,Data mining:Data analysis on a grand scale?Statistical Methods in Medical
Research,September 2000.
S,M.,H,V.,and B,R.,1993,Image Processing,Analysis and Machine Vision
(London,UK:Chapman and Hall ).
S,D.F.,1991,A general regression neural network.IEEE Transactions on Neural
Networks,2,568–576.
Is inductive machine learning just another wild goose chase?92
S,S.V.,1997,Selecting and interpreting measures of thematic classiﬁcation accuracy.
Remote Sensing of the Environment,62,77–89.
S,B.S.,L,C.F.,and W,C.C.,1994,A bibliography of heuristic search
through 1992.IEEE Transactions on Systems,Man and Cybernetics,24,268–293.
T,S.K.,1992,Sampling (New York:John Wiley and Sons).
T,W.,1970,A compyter movie simulating urban growth in the Detroit region.Economic
Geography,46,234–240.
U,P.E.,1986,Machine L earning of Inductive Bias (Boston,MA:Kluwer Academic Press).
W,R.E.,and M,R.H.,1989,Probability and Statistics for Scientists and Engineers
(4th Edition) (New York:Macmillan).
W,M.P.,and J,M.C.,1995,Kernel Smoothing (London:Chapman and Hall ).
W,C.,1993,Sense and Nonsense of Statistical Inference (New York:Dekker).
Y,T.,and O,S.,1994,Neural network approaches to landcover mapping.IEEE
Transactions on Geosciences and Remote Sensing,32,1103–1109.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο