.. ,2003

.17,.1,69–92

Review Article

Is inductive machine learning just another wild goose

(or might it lay the golden egg)?

MARK GAHEGAN

GeoVISTA Center,Department of Geography,The Pennsylvania State

University,302 Walker Building,University Park,PA 16802,USA;

e-mail:mng1@psu.edu

(Received 26 November 2001;accepted 29 April 2002)

Abstract.The research reported here contrasts the roles,methodologies and

capabilities of statistical methods with those of inductive machine learning

methods,as they are used inferentially in geographical analysis.To this end,

various established problems with statistical inference applied in geographical

settings are reviewed,based on Gould’s (1970) critique.Possible solutions to the

problems outlined by Gould are suggested via reviews of:(i) improved statistical

methods,and (ii) recent inductive machine learning techniques.Following this,

some newer problems with inference are described,emerging from the increased

complexity of geographical datasets and from the analysis tasks to which we put

them.Again,some solutions are suggested by pointing to newer methods.By way

of results,questions are posed,and answered,relating to the changes brought

about by adopting inductive machine learning methods for geographical analysis.

Speciﬁcally,these questions relate to analysis capabilities,methodologies,the role

of the geographer and consequences for teaching and learning.Conclusions argue

that there is now a strong need,motivated from many perspectives,to give

geographical data a stronger voice,thus favouring techniques that minimize the

prior assumptions made of a dataset.

1.Introduction

In his famous article critiquing the use of inferential statistics—‘Is statistix inferens

the geographical name for a wild goose?’—Peter Gould (1970) lays bare the many

premises upon which inferential statistical analysis is founded,alternatively ques-

tioning their validity and the blind faith placed in them by geographers.These

questions are revisited here in the light of a digital revolution that is providing

torrents of data where once was only a trickle (Miller and Han 2001).Consequently,

we are confronted with the diﬃculty of scaling up our analysis to embrace datasets

that are both voluminous in terms of numbers of records or samples represented (n),

and deep in terms of the number of separate attribute dimensions over which data

are gathered ( p).As well as making additional demands on existing analysis methods,

these datasets also generate the need for new types of analysis procedure,to support

exploration,mining and knowledge discovery (Buttenﬁeld et al.2001,Gahegan et al.

2001).It is not always clear that traditional statistical techniques can address these

new challenges,and where they can,there may be severe consequences in terms of

International Journal of Geographical Information Science

ISSN 1365-8816 print/ISSN 1362-3087 online ©2003 Taylor & Francis Ltd

http://www.tandf.co.uk/journals

DOI:10.1080/13658810210157778

M.Gahegan70

computational burden,signiﬁcance testing,demands for sample data and so forth.

Openshaw and Openshaw (1997,p.3) describe the current situation thus:‘Sadly,

nearly all of the available methods for analysis,modelling and processing to extract

value date from an earlier period of history where data were scarce and the analyst

had to rely on his or her intuitive skills aided by an intimate knowledge of what

little information was available to formulate analysis tasks’.

Within the domain of geographical analysis,the use and capabilities of traditional

inferential statistics are here contrasted with an alternative form of computational

inference based on inductive machine learning.The discussion is restricted to infer-

ence used for predicting some unknown characteristics or properties,as opposed to

the identiﬁcation of underlying processes or models.The latter is possible also with

machine learning,for example by utilizing tools to automatically construct Bayesian

Belief Networks,but falls outside the scope of this paper.Philosophically,statistical

inference and machine learning (ML) are based,to diﬀering extents,around a style

of inference known as induction;allowing the analyst to infer some generic outcomes

from speciﬁc examples,to whit:‘By induction,we conclude that facts,similar to

observed facts,are true in cases not examined’ (Peirce 1878).This contrasts with

deduction,in which facts are asserted as true by computation against some a priori

model.Section 2 below describes the process of inductive inference in detail.

Machine learning and inferential statistics typically diﬀer in their use of prior

knowledge.Inferential statistics uses observations to condition (shape) the form of a

distribution model that is usually provided by the analyst.This prior assumption

represents a self-imposed limit in terms of model complexity and the ability to adapt

to the data.By contrast,many machine learning techniques construct a distribution

model using evidence gleaned from the data alone,i.e.they are data-driven.This

diﬀerence leads to major methodological disparities aﬀecting training,accuracy

analysis,goodness of ﬁt and signiﬁcance testing.Thus it can appear at ﬁrst glance

that these two types of inference are for quite diﬀerent purposes,yet we see a growing

trend to employ neural,genetic and rule-based induction methods in place of more

traditional forms of geographic analysis (Benediktsson et al.1990,Byungyong and

Landgrebe 1991,Lees and Ritman 1991,Civco 1993,Openshaw 1993,Fisher 1994,

Yoshida and Omatu 1994,Paola and Schowengerdt 1995,Foody et al.1995,German

and Gahegan 1996,Friedl and Brodley 1997,Fischer and Leung 1998,Bennett et al.

1999,Openshaw and Abrahart 2000).The reasons for this are largely concerned

with practicality.

Firstly,we can substitute a model that must be provided beforehand for a learned

model that is derived when needed from sample data.This can lead to greater

ﬂexibility,and less reliance on expert knowledge for conﬁguration.Such ﬂex-

ibility may well prove crucial;as geographers integrate ever more data to study

complex phenomena such as human-environment interaction or population demo-

graphics and epidemiology,the diﬃculties in specifying a reliable model in advance

rise accordingly.Discovering—or inducing—such a model from a limited set of

observations may provide a practical alternative.

Secondly,in many complex systems with non-axiomatic components,models

may either be too elaborate to deﬁne or else too susceptible to variation in precondi-

tions;for example data gathered from a diﬀerent place requires a diﬀerent model.

Gould points out (p.444) that a geographer should expect this latter problem since:

‘...all phenomena of interest to the geographer are never independent in the funda-

mental dimensions of his enquiry’.We must then decide if this interdependence can

Is inductive machine learning just another wild goose chase?71

be expressed axiomatically (c.f.spatial regression or autocorrelation,Cressie 1993)

or whether a more adaptive approach is needed instead.

Statistical research has had an inﬂuence on geography that is both broad and

deep;shaping the way analysis is conducted (and how systems are understood and

communicated) and having itself been shaped by many researchers who have revised

and reﬁned techniques to better suit the nature of geographical space (Moran 1948,

Ord 1975,Getis and Boots 1978,Anselin 1988,Kulldorﬀ 1999).We now turn

attention to the potential for inductive machine learning to do likewise.Two general

questions are examined in this regard:

1.How might inductive machine learning change the way we conduct

geographical analysis?

And at a deeper level:

2.How does inductive machine learning change the way we conceptualize and

describe geographical systems?

It is not my intention (and neither was it Gould’s) to dismiss inferential statistics

as inadequate or to insinuate that its day has passed.Research in spatial statistics

has made huge progress in the last couple of decades,starting from a number of

disparate breakthroughs across a variety of ﬁelds and weaving the many separate

strands together into a cohesive body of knowledge that can be brought to bear

across a wide range of problems (Diggle 1983,Isaacs and Srivastava 1989,Haining

1990,Cressie 1993,Lawson 2001).In my opinion it is needed more than ever.It is

my intention,however,to showthat there exist nowa range of geographical problems

and datasets that require us to reassess the methods of analysis that are best suited.

Over the same timeframe,the machine learning community has made equally vast

strides,progressing from rule-based,deductive approaches to sophisticated concept

learning and function optimization methods (Stewart et al.1994,Mitchell 1997,

Luger and Stubbleﬁeld 1998,Bremaud 1999) that hold great potential for a wide

range of geographical problems.

Bailey (1994) provides a very useful overview of the progress that spatial statistics

has made,including a taxonomy of the methods and approaches that have developed.

In the same article,Bailey also refers to some of the (then) more radical approaches

sanctioned by Openshaw (1991) that are more in line with machine learning than

statistics,correctly pointing out (at that time) that they carry their own set of

problems,are too computationally demanding and that they are ‘...not yet developed

to the stage where they are widely applicable’.In the intervening time,the problems

alluded to have been more thoroughly investigated (Openshaw and Openshaw 1997,

Kanellopoulos and Wilkinson 1997,Gahegan et al.1999) and are touched upon

later;the computational performance issues,relevant then,have been largely overcome

(Moller 1993,Birkin et al.1995,Fischer and Staufer 1999);the applicability,as

argued convincingly by Miller and Han (2001) and Buttenﬁeld et al.(2001) arises

from the data and applications we are now faced with.

Hence it is time to revisit this debate.We do so by ﬁrst examining the progress

made by statistics and machine learning that relate to Gould’s original critique (§3),

following from which some additional problems are described,arising mainly from

the wealth and richness of datasets now routinely available and the corresponding

complexity of the questions currently being pursued in our eﬀorts to understand the

Earth’s intricate systems (§4).Taken together,these diﬃculties form the motivation

M.Gahegan72

for expanding our arsenal of inferential tools to include machine learning methods.

By doing so we are able to discard some problematic underlying assumptions.But

we must also modify and declare some in addition,all of which have a direct impact

on the questions we can investigate,the methodology we must use and our interpreta-

tion of the results produced (§5).The conclusions present a summary of the ﬁndings

and outline the major research themes still to be addressed in this arena.

2.The process of inductive inference

Figure 1 depicts the inductive process,beginning with a set of observations {X}

each consisting of a value x (univariate case) or vector of values x

1

,x

2

,...,x

p

(multivariate case) and an outcome or target (y),drawn from the set {Y }.During

learning or training a function is constructed that maps inputs X to desired outcomes

Y,(XY );this is referred to as a mapping function,or target function (V ).The ﬁrst

stage in an inductive methodology is then to acquire this mapping function (ﬁgure 1(a)).

In machine learning,it is learned directly from a limited set of examples;in statistical

inference it is the distributional form chosen by the analyst,but which may require

some parameterization that is calibrated from the data.The second stage is a general-

ization step,where the acquired function is applied to a (usually much larger) dataset K

(X5K),for which Y is unknown and must be predicted (ﬁgure 1(b)).Although not

shown in the ﬁgure,in the ML case it is possible for Y to also be a vector,signifying

the learning of two or more objective functions simultaneously.

Figure 1.The inductive learning methodology.(a) The target function (V ) is learned from

examples,and (b) then applied to predict unknown values.

Is inductive machine learning just another wild goose chase?73

2.1.L earning as a search process

Many of the tasks undertaken in conventional analysis or modelling can be

tackled inductively by recasting them in terms of a search problem—whether it be

for the identiﬁcation of suitable parameters for conﬁguring a statistical function

(calibration),or for the construction of useful functions themselves to form into

more complex models (Openshaw and Openshaw 1997).Classiﬁcation too,can be

expressed as a search for discriminant functions or characterizing distributions that

demark a category in feature-space.In many forms of learning,the number of

possible states to be searched through is prohibitively large,so stochastic approxi-

mation methods are used to avoid exhaustive enumeration (Stewart et al.1994,

Mitchell 1997).Stochastic search uses the idea of a performance metric (such as

predictive error or explanatory power) that can be calculated for each possible state

the tool can take.These states may be conceptualized as comprising a surface (usually

a hyper-surface),where the lowest point represents the best conﬁguration.The aim

is to iteratively move towards this point of least error,but bearing in mind that an

exhaustive search (enumerating the performance metric for each point on the surface)

is computationally intractable.ML techniques diﬀer as to how this search is per-

formed (Sonka et al.1993 and Openshaw and Openshaw 1997 give further details).

A feed-forward neural network with back propagation,for example,employs a

neighbourhood search on the error surface,and at each iteration the centroid of this

neighbourhood is moved in the direction oﬀering the largest apparent performance

involvement (Benediktsson et al.1993).By contrast,decision trees use an information

gain measure to ﬁnd a new decision rule that,when added,contributes the most

to the desired outcome (Hunt et al.1966,Quinlan 1993).In both cases the search

terminates either after a pre-determined number of iterations,or when the perform-

ance gain falls below some threshold.Consequently,it is not possible to say if

the solution found is indeed the optimal choice,but instead we must establish its

superiority through application (§3.6).

Once constructed,the model can be tested by requiring it to infer outcomes for

cases where Y is already known,but is withheld;its eﬀectiveness at doing so gives

one measure of the inferential accuracy of the learned model (see further details

in §3.5).Practically speaking,Xand Y may be discrete or continuous,since statistical

and inductive learning methods have been developed to operate across the full range

of statistical scales.

2.2.Constructing the mapping function

As described previously,the major diﬀerence between statistical and machine

induction is the degree to which a priori knowledge is used in the learning phase.

In statistical methods,the form of the mapping function used is speciﬁed

beforehand,for example a straight line,y=a+bx,or a Gaussian curve

n(x;m,s)=1/√

2pse−(1/2)[x−m)/s]

2

with the parameters (a and b in the former case,

m and s in the latter) derived from the presented data.In machine learning,an

iterative process is used to approximate the desired outcomes,usually involving

many simple components working together to construct the required mapping func-

tion in a piecewise form.Thus the overall function is highly parameterized,being

constructed from a number of more primitive functions that are summed together

(e.g.hyperplanes in a neural network) or arranged in a hierarchy (e.g.decision rules

in a decision tree) so as to operate cohesively.The learning capacity of the tool is

M.Gahegan74

governed by the number of these small functions used,and the mechanisms by which

they are combined.

In many ML methods there is no requirement for the same overall functional

form to be used throughout the entire range of the data,nor indeed to assume that

just one function form is adequate.Thus,irregular and multi-modal distributions

cause no additional complications,provided enough learning capacity is available

in the tool,since they can be constructed by the piecewise combination of more

primitive functions.The additional ﬂexibility is very useful in situations where

relationships between variations are complex and/or unknown.

2.3.Assumptions and testing

Clearly,statistical inference requires the assumption that the expert-supplied

function is suitable for the problem.This assumption can be tested with a goodness

of ﬁt statistic (Walpole and Myers 1989,p.344),which is a measure of distance

between the observed values and the function used to describe their situation.It

does not establish that the function is somehow the ‘right’ one,but merely provides

a metric by which alternatives may be ranked.The ML method requires a diﬀerent

set of assumptions,namely that there is suﬃcient expressive power available (via the

summed primitive functions) and that a good parameterization of these functions

can be found (via the stochastic search).Goodness of ﬁt measures make no sense

for an ML method,since the data distribution is not assumed.Instead,the learned

model must be validated by the quality of its outcomes,as described above.

Both statistical and machine learning methods use a generalization step,thereby

assuming that a ﬁnite set of values (the sample) is suﬃcient to build an eﬀective

general model.In this sense,both employ induction,though clearly the ML

methods rely on induction to a larger extent,having greater capacity to adapt to

the presented data.

3.Old problems with the use of inferential statistics

The original argument made by Gould catalogues problems with statistical

inference according to the validity of certain underlying assumptions.By making

these assumptions and ﬁxing certain properties the analyst can concentrate on those

data characteristics she wishes to study and ignore all other aspects.Some assump-

tions are made to simplify the mathematics,others might be reasonable given certain

circumstances.Note that these problems are not so much a consequence of bad

underlying theories as they are a result of careless or thoughtless application in a

geographic setting;they arise when underlying assumptions are untested or unques-

tioned.Each of Gould’s original problems with inferential statistics (the function

form,the sample,independence of observations and residuals,the distribution of the

variables and error terms,and the level of signiﬁcance) are described brieﬂy in the

following sub-sections along with an overviewof the developments that have occurred

in the meantime to address them.

3.1.The form of the function

Gould’s ﬁrst argument is that functional relationships between variables are often

oversimpliﬁed for convenience,for example assumed to be linear,or at least linear

over the range of the data.This simpliﬁes the computation associated with analysis,

although in practice it may also lower accuracy.

All too often there may be absolutely no logical reason why linearity,or some

Is inductive machine learning just another wild goose chase?75

other simplistic relationship,should be assumed.Gould argues (in 1970!) that with

improvements in computational capacity,and in associated software,there is no

longer a reason to strive for simplicity where it is not warranted.In the meantime,

research in statistics has made signiﬁcant progress in the support provided for more

complex functions (McGarigal and Marks 1995),hierarchies of functions that better

integrate scale-based analysis (Kreft and DeLeeuw 1998,Johnson et al.1999) and

extreme value theory to address very rare events (Smith 1990).Geographically

weighted regression (Brunsdon et al.1998) addresses this same issue by making local

subsets where the functional form is the same,but the parameterization diﬀers.

However,more simplistic statistical models are still in widespread use,possibly

reﬂecting the ease with which they can be applied and understood,rather than the

need for computational simplicity.

Large families of ML methods have also been developed to address the modelling

of complex functional forms.As described above in §2.2,complex functions can be

simulated by ML methods by the assumption of many simpler,low-level functions,

such as decision rules or hyperplanes.Neural networks are perhaps the most widely

used method in this regard.For example,the General Regression Neural Network

(GRNN:Specht 1991) provides a more ﬂexible form of regression,where distances

from the ﬁtted line are applied piecewise,locally rather than globally,allowing more

complex functional relationships to be modelled with ease.

3.2.The sample

Assumptions include the randomness of sample selection,problems of generaliz-

ing froma sample to a population and the chances that the sample contains unwanted

bias of some sort.These problems still pervade spatial statistics,for example a semi-

variogram (a graphical tool for exploring spatial dependence in data) will produce

misleading results when samples are preferentially clustered or data shows signiﬁcant

heteroskedasticity (Isaacs and Srinivastava 1989,p.527).Improvements in sampling

strategies help to alleviate some of these problems (Kalton and Anderson 1986,

Thompson 1992) and simulation techniques such as the Monte Carlo method can

help explore for randomness and bias problems (Bremaud 1999).Using relative

variograms,or other locally-calculated measures of variance can help oﬀset the

eﬀects of heteroskadisticity.

In part,ML methods overcome this problem by avoiding assumptions about the

sample,though its representativeness is tacitly assumed.The whole area of sampling

theory and bias associated with both the data and the generalization methods used

have formed central strands in the development of machine learning methods

(Benjamin 1990,Briscoe and Caelli 1996),and are well understood.

3.3.The independence of observations and residuals

Assumptions here include that the sample is representative and that each obser-

vation is independent,though Tobler’s ﬁrst law (‘Everything is related to everything

else,but near things are more related than distant things’,Tobler 1970) advises

us that independence is not likely in a geographical setting.Tackling the second

part of this rule,the spatial statistics community has made great progress in

providing much better means of dealing with spatial dependence;from measure of

global autocorrelation (Moran 1948,Cliﬀ and Ord 1973,1981) to sophisticated,

locally-computed measures of spatial dependence and change in relationships over

geographical space (Anselin 1995,Brunsdon et al.1996,Assuncao and Reis 1999).

M.Gahegan76

As above,ML methods do not rely on assumptions of independence;the reliance

on evidence is based solely on how useful it is in predicting a desired outcome;

indeed,metrics describing this utility (such as information gain,Quinlan 1993) are

used to control the inductive learning process by evaluating each possible next move

(§2.1).Any form of correlation aﬀects the utility of parts of the feature vector X in

predicting Y,since if x

a

and x

b

are strongly correlated,then after using x

a

there is

likely to be little information gain when using x

b

.Thus,dependence structures in

data are implicitly ‘learned’ in the training phase.

3.4.The distribution of the variables and the error terms

Error terms particularly are often assumed to be normally distributed,without

any physical or logical basis for such an assumption,and with potential to add error

into the analysis.Gould argues that these assumptions (normality of data and error,

unimodality,homoskedasticity) are untenable in many settings and again a result of

laziness or an over-enthusiastic zeal for simplicity.Here again,progress has been

signiﬁcant,with the development of spatial statistical techniques that can speciﬁcally

model autocorrelation in error terms (Cressie 1993,chapter 5),as well as in the

signal,and reliable means to test for heteroskedascity (Breusch and Pagan 1979).

Kriging (Krige 1951) and other forms of geostatistical analysis are able to speciﬁcally

calculate measures of spatial dependence (e.g.via a semi-variogram) that can be used

to improve interpolation and estimation in the presence of noise.However,these

too become problematic,for example if the range of diﬀerent distances between

observations is not adequately sampled (as noted above in §3.2).

Again,ML methods do not start from any such distributional assumptions so

largely avoid these pitfalls.However,ML methods can exhibit some undesirable bias

because they assume that reducing error,or increasing information gain,are valid

measures by which to prioritize the learning process.Consequently,learning con-

centrates on those denser regions of feature space where the greatest gains can be

made—typically those with the largest number of samples.Other regions may be

neglected until later in the learning process,by which time the solution thus far may

not be able to accommodate these remaining cases.Figure 2 depicts this situation.

Figure 2.For this distribution of samples,using only three hyperplanes or oblique decision

rules,the feature space cannot be subdivided so that a perfect classiﬁcation results.

The two diamond samples inside the dashed oval will likely be mis-classiﬁed,since

this represents a minimization of error.Any bias in the distribution of such ‘diﬃcult

to train on’ samples will propagate into the result.

Is inductive machine learning just another wild goose chase?77

Solving bias problems requires careful initial calibration,to ensure enough learning

capacity is available,though only just enough,otherwise over-training may occur

(Gahegan 2000).Utgoﬀ (1986) describes how the bias exhibited during training can

itself be learned,so that it might be better understood.

3.5.The level of signiﬁcance

Questions are raised about the selection of signiﬁcance levels for testing;these

are often motivated by the reliability of the data,not the reliability required in the

prediction.The fact that a signiﬁcance value is itself only a likelihood of reliability

seems to be overlooked in our enthusiasm to achieve a positive result,and has been

widely criticized recently within statistics (Nester 1996).Brunsdon (2001) brings to

light the debate within the statistics community regarding the validity of signiﬁcance

testing from a methodological perspective (Wang 1993).The problem of signiﬁcance

testing has recently taken on a new form with the popularisation of exploratory and

data mining techniques that perform thousands,or even millions of tests,a problem

taken up later in §4.4.

As mentioned already,signiﬁcance tests make no sense for ML methods;assess-

ments of performance must instead be made from outcomes.This usually involves

holding back some percentage of the training data to independently test on the

learned model,requiring modiﬁcation to the underlying experimental methodology

(Fitzgerald and Lees 1994).Various validation methods have been reported for this

purpose (Congalton 1991,Schaﬀer 1993,Stehma 1997).

3.6.How machine learning techniques restate these problems

In summary,the form of the function,including patterns of covariance and

distribution of error terms is not assumed,but is learned.If the data provides

evidence (examples) of a relationship between location and some value,then—

provided this relationship is useful in predicting the desired outcome—the ML

technique will attempt to learn this pattern.Even if the relationship changes over

space,that too can be learned if it is encoded in the examples presented.For example,

a neural network deals with covariance (spatial or otherwise) by learning that the

co-varying attributes together over-predict an outcome,so connection weights are

adjusted to reduce the strength of the signal.The whole notion of empirically

modelling these relationships is put aside,thus any problems associated with the

selection or accuracy of statistical functions do not apply.Likewise,the distribution

of error terms is never assumed,so demands no special treatment.

There are,of course,caveats:these relate to the data themselves—they are

required to contain evidence of the trends that help to predict the desired outcome,

and the learning capacity of the tool—it must be able to detect and represent the

useful trends.Openshaw and Openshaw (1997) and Gahegan (2000) give more

details relating to the machine learning of geographical pattern.

3.7.Progress in statistics to address these problems

In the years since Gould’s paper was originally published,a good deal of ground

has been covered to address the above problems.Brunsdon (2001),in a recent

editorial review of Gould’s original paper,points out areas where statistical research

has resulted in real progress,by tools that can relax or better account for one or

more of the above problems,including ‘...generalized additive modelling,nonparamet-

ric regression,kernel density estimation,randomization tests and regression models

M.Gahegan78

with autocorrelated errors...’.Useful reviews of these,and other way-markers to

progress,can be found in Wand and Jones (1995),Hox (1995) and Longley and

Batty (1996).Mainstream acceptance of these newer techniques seems to be assured,

but until they are routinely available,Gould’s original warnings still apply.In part,

a slow uptake may be due to limited availability of the new statistics in established

software,though marked progress is reported by Bao et al.(2000).Furthermore,

dedicated software packages such as SpaceStatTM (http://www.spacestat.com/) and

SpatialAnalystTM (http://www.esri.com/software/arcgis/arcgisxtensions/spatialanalyst/

index.html),and the interest they stimulate,signify a trend for spatially-aware statistical

methods to become more accessible.

4.Emerging problems with the use of inferential statistics

It is not just the theory and available tools that have changed radically in the

last thirty years—geographical data have changed too,as have the tasks to which

we put them!With the advent of vast,digital geospatial datasets,of ever-increasing

subtlety and collected at geometric rates,additional analysis problems arise as

new challenges (Buttenﬁeld 1998,Kahn and Braverman 1999).This section intro-

duces a number of new problems arising from the changing nature of the data we

use,in terms of:(1) size and non-intuitive nature of a high-dimensional feature

space,(2) data reduction,(3) computational complexity,(4) signiﬁcance testing,and

(5) increasing demands for training data.

4.1.Size and non-intuitive nature of high dimensional feature space

The size of a feature space is determined by the number of unique positions that

it comprises,given p attribute dimensions each measured with a precision p.If we

assume for simplicity that p is the same for all dimensions and measured as the

number of bits by which data is encoded,then the number of unique positions in

feature space is given by (2p)p.

Using three attribute dimensions,each represented by a single byte,the size of

the features space is (28)3#16.7 million unique locations—a common size for many

remote sensing problems.Obviously,this number arises very rapidly if either p or n

increase.For the AVIRIS hyperspectral remote sensing platform,which uses 12-bit

data precision and 224 spectral channels,this equation becomes (212)224#1.47e+809,

an astronomical number.Considering the United States 2000 census Demographic

Proﬁle,we obtain 98 variables with around 32 bit precision,making a feature space

with a truly staggering 3.9e+1926 locations.Even when the number of observations

is very large (massive n),the vast majority of these possible values will not be realized,

so the feature space will be largely empty (sparse).

We are familiar with conceptualizing analysis in two or three dimensions,where

distribution functions exhibit a highly recognizable form.However,we should be

cautious in the way we generalize these conceptualizations to higher dimensional

spaces,since these familiar functions become less intuitive,and consequently more

diﬃcult to model,as p increases.By way of a simple example (after Scott 1992),

consider the case of a square and a circle—speciﬁcally as a circular cluster of points

modelled using a square box,as would be the case with a parellelpiped classiﬁer,or

as could be modelled with four decision rules or linear discriminant functions.

Figure 3 depicts this situation.

In two dimensions the model seems to be an acceptable approximation,since

the ratio of the area of the circle to that of the square is reasonably close at 0.79,so

Is inductive machine learning just another wild goose chase?79

Figure 3.Comparing simple geometric shapes and fractional intersection of their volume in

a p dimensional feature space,after Scott (1992) and Landgrebe (1999).

the model used does not generalize too far beyond the observed properties of the

data.However,if p is increased,this ratio does not stay constant,but decreases

rapidly to a state where the surrounding box is almost entirely empty and is a very

poor representation of the data.By p=4 the ratio of the area is well below 50%

and at p=7 the hypersphere only accounts for about 4% of the volume of the

hypercube.In other words,the hypercube is certainly no longer a useful approximator

of any spherical cluster of data points,since it is 96% empty.

Were this problem to be conﬁned to only rectangular or orthonormal structures

then it would simply require that we choose statistical models with greater care as

p increases.But unfortunately,the same geometric problems occur with other distri-

bution functions too;in fact it can be generally shown that for an arbitrary shape,

as dimensionality is increased,more of the volume of the object becomes concentrated

in an outer shell,and less in the centre.So,when considering a Gaussian distribution,

the volume of the curve migrates quickly from the centre to the tails of the distribu-

tion,producing a rather counter-intuitive ﬂat shape.Note that this eﬀect is not a

result of a lack of training examples,high variance or poor model choice,but simply

a consequence of geometry.An insightful explanation of this phenomenon is given

by Landgrebe (1999),who also points out the following two important consequences:

that the space is largely empty and that the migration of volume to the outer shell

or corners causes great diﬃculties for multi-variate density estimation (Scott 1992,

Wand and Jones 1995,Jimenez and Landgrebe 1998).

The point here is that familiar distributional forms do not perform well in high-

dimensional settings,they were never designed to.It becomes vital,instead,to take

a piecewise or hierarchical approach,tackling the problem by fragmenting the space

into lower dimensional partitions only where the feature space contains useful

information,and ignoring other empty portions.This is why neural networks and

decision trees often meet with success in these settings (§2.2).

4.2.Data reduction

Another way to deal with feature space complexity is to use tools that reduce

the space to a manageable form,for example by classiﬁcation or clustering.Recent

interest in data mining and knowledge discovery (DM/KD) as applied to geography

(Miller and Han 2001,Buttenﬁeld et al.2001) is evidence of this need.Not surpris-

ingly,many of the newer tools for data reduction harness inductive machine learning

methods (Cohen 1995,Gehrke et al.1999).

M.Gahegan80

In direct contrast to this ‘reductionist’ approach,Openshaw (1994,p.87) cautions

that such pre-processing may well remove important information,and suggests that

‘A worthwhile general principle should be to develop methods of analysis that impose

as fewas possible additional,artiﬁcial,and arbitrary selections on the data’.However,

many commercial systems still appear to oﬀer limited support for higher-dimensional

data,encouraging us to be wasteful,since we are expected to renounce many

attributes in order to concentrate analysis on the small handful that appear to carry

the most information.Techniques such as Principal Components Analysis (PCA)

and Multi-Dimensional Scaling (MDS) have been speciﬁcally developed to help us

with this task.There are two important problems with such approaches:

1.It is assumed that the phenomena of interest can be adequately expressed with

a small number of variables.However,complex processes,such as landuse

change or gentriﬁcation,may possess a ‘signature’ that extends over many

diﬀerent attribute domains and is not adequately explained in any small subset.

2.Generally speaking,data reduction methods such as PCA and MDS assume

that global variance is a suﬃcient measure of an attribute’s utility,which,it

could be argued,is rather un-geographical.We should be intimately concerned

with the spatial structure within attribute data,i.e.within the context of place

(Abler et al.1971,chapter 1),and less with globally aggregated measures.

By reducing dimensionality,we trade accuracy for simplicity,and in doing so risk a

corresponding loss of explanatory power.In cases where variables are highly correl-

ated and processes are simple,this loss of accuracy might be small or even signiﬁcant,

but that is yet another assumption brought about by the now outdated need for

computational simplicity.

There is now a large body of evidence,both inside and outside of geography,

that demonstrates the abilities of machine learning techniques,and particularly

decision trees and neural networks,to deal eﬀectively with tasks involving high

dimensional data ( p>10,p>100) (Benediktsson et al.1993,Ripley 1996,German

and Gahegan 1996,Di and Khorram 1999).Reduction to just two or three variables

is an outdated notion that in most cases is no longer required.

In addition to machine learning approaches,a number of statistically-based

techniques have been proposed to tackle the same problem,including the notion of

projection pursuit for data exploration (Asimov 1985,Cook et al.1995) and a variety

of pooled-covariance techniques to reduce the complexity of constructing a high-

dimensional distributional model (see §4.6).

Perhaps another factor here is the desire for conceptual simplicity and transpar-

ency in our underlying models?There may be good cause for this,such as ease of

communication or for pedagogic reasons.But I am aware of no reason why good

geographic models should,by nature,involve only a small number of simple relation-

ships.Perhaps it is time at last to embrace the ﬁrst part of Tobler’s ﬁrst law (§3.3)?

4.3.Computational complexity

Larger datasets imply an increase in the number of cases (n) or the number of

attributes associated with each case ( p),or possibly both.When addressing datasets

with either large n or large p,the time required by the machine to perform the

necessary computations can become a limiting factor for all forms of analysis.For

example,it may render impractical any exhaustive search for the best solution,

i.e.one where all possible alternatives are evaluated.

Is inductive machine learning just another wild goose chase?81

Computational complexity is usually expressed in terms of the number of itera-

tions of an algorithm required to complete the calculation,in the best,worst or

average case (Moret and Shapiro 1991).Obviously,any increase in n or p directly

impacts complexity.Many machine learning techniques scale somewhere between

O(n2) and O(nlogn) in terms of runtime computational burden (Martin 1991),with

p being a constant term determining the complexity of each iteration.By contrast,

closed form statistical techniques are nominally of O(n),though techniques such

as maximum likelihood require the additional derivation of a covariance matrix

(see §4.5 below).Non-linear statistical functions are more expensive because the

approximation techniques used,such as Newton Raphson (Judge et al.1988),are

computationally demanding and typically of the order of O(n3).

By abandoning a deterministic approach in favour of stochastic search (§2.1),

machine learning techniques are able to reduce computational demands signiﬁcantly

for non-linear distributions,a factor that becomes increasingly vital as the feature

space enlarges (Openshaw et al.1999).In doing so,they remain computationally

tractable for large values of p,as noted above.

Whereas many ML techniques are able to analyse datasets with tens or even

hundreds of dimensions,further increases in p,perhaps with associated increases in

n as is common in data mining,currently causes a performance bottleneck.Signiﬁcant

advances in computational eﬃciency are currently being sought to enable these

techniques to scale up further.Proposed solutions usually involve increasing the

number of prior assumptions in order to reduce the time complexity,so that it

approaches O(n).Examples include RIPPER (Cohen 1995) and BOAT (Gehrke et al.

1999),both based on optimistic construction of a decision tree.

4.4.Further problems with signiﬁcance testing

As datasets become ever more complex,we must rely on exploratory methods

to bring to light useful knowledge.Data mining aims to uncover unknown patterns

by repeated application of a (usually local ) test.One of the earliest geographical

examples of data mining in geography is Openshaw’s Geographical Analysis Machine

(GAM:Openshaw et al.1990) that performs a clustering test for each cell on a

gridded surface over a number of spatial scales.Philosophically,it is debatable

whether such repeated testing constitutes a real hypothesis—in the sense of setting

up and evaluating a null (H

0

) and alternative (H

1

) at a given level of signiﬁcance.

To make their downgraded status clear,they are sometimes referred to as indicators

instead (Anselin 1995).But algorithmically,the mining method is indeed choosing

between H

0

and H

1

at every iteration:Gould (1999,p.224) later refers to GAM as

conducting ‘...eight million rigorous Poisson-based tests...’.

When large numbers of hypotheses are evaluated,the problem of signiﬁcance

testing described above (§3.5) becomes even more vexing.If we perform only one

test,say at a ( high) signiﬁcance level of 1%,then we must acknowledge one chance

in a hundred that our results might be signiﬁcant only by a chance arrangement of

data values,and not arising from any noteworthy cause.Conducting a million tests,

we should anticipate 10000 or so such ‘errors’ and so forth.In fact the number of

these commission errors rapidly rises to the point where they become a signiﬁcant

distraction;the user is faced with a mountain of results to sift through with no way

to distinguish the good from the bad.New forms of signiﬁcance testing have been

put forward to address this problem,that can take into account the volume of tests

when reporting signiﬁcance (Glymour et al.1996,Smythe 2000).Nowhere is this

M.Gahegan82

more necessary than in spatial or spatio-temporal data mining where the physical

dimensions add considerably to the number of tests to be applied (Ester et al.1998,

Koperski et al.1999).

To summarize,traditional statistical methodologies can experience diﬃculties in

exploratory settings where they are put to use in a manner for which they were

never designed.Machine learning researchers have tackled this vexing issue by

providing techniques that can summarize and generalize from learning outcomes,

thus avoiding a case-by-case assessment of signiﬁcance (Gains 1996,Bradsil and

Kronolige 1990).Signiﬁcance testing may also prove unreliable if distributions

cannot be conditioned accurately because of a lack of training examples,as

discussed next.

4.5.Increased demands for sample or training data

Fukunaga (1990) shows that for a linear statistical classiﬁer,the number of

training samples required depends directly on p,but for a quadratic classiﬁer,such

as maximum likelihood,this rises to p2.More precisely,a Gaussian distribution

requires the formulation of a covariance matrix that describes relationships between

dependent attributes.The covariance matrix is triangular in nature (elements are

symmetric across the diagonal ),so the number of coeﬃcients that require estimation

is given by:c( p+1) p/2,where c is the number of classes to be delineated and p the

number of dimensions in feature space (as before).Five classes and ﬁve attribute

dimensions requires a reasonable 75 covariance values to be estimated,but ten

classes and 100 dimensions would produce a matrix with 50500 entries.Each of

these coeﬃcients is estimated fromthe data sample,so the data must contain enough

observations to allow all these coeﬃcients to be estimated reliably.Clearly,this fast

becomes an entirely impractical requirement.

By making assumptions regarding covariance (pooling),the number of samples

required to construct a Gaussian curve can be reduced to around 30–100 independent

examples per attribute dimension (Mardia et al.1979).What does this mean in

practice?To construct such a well-conditioned curve in a socio-demographic setting

using 10 attributes would require 300–1000 examples,or to use a supervised classifer

on hyperspectral remote sensing data fromthe AVIRIS sensor would require between

224×30=6720 and 224×100=22400 independent training samples,though one

could question whether supervised classiﬁcation is really a suitable way to interpret

such data (Goetz and Curtiss 1996).These illustrative examples are somewhat

contrived,but nevertheless we can expect growing numbers of attributes to become

available within all areas of geographical analysis in the future,so they serve as a

useful indicator of the increasing demand for so called ‘ground truth’.This might be

good news for geography graduates in search of employment in the ﬁeld!

Unlike parametric methods,ML methods are not required to build complete

models,in the sense that no eﬀort needs to be applied to regions of feature space

that are empty;and as pointed out above,this is usually the vast majority of the

space.Ehrenfeucht et al.(1989) show that for inductive machine learning,the amount

of training data required depends on the complexity of the learning task,so is more

diﬃcult to deﬁne beforehand.In the case of classiﬁcation,this complexity depends

on the number of classes required and the intricacy of the separation task,which

itself depends only partly on the dimensionality (Cybenko 1993).In short,many

inductive learning techniques manage better than a linear relationship with p,in

Is inductive machine learning just another wild goose chase?83

terms of data requirements,allowing them to extend to very large feature spaces

without acquiring a voracious appetite for data.

4.6.The n%p problem

Generally speaking,multivariate statistical inference assumes that p<n,in that

n samples are generalized to form a p-dimensional distributional model.But where

p>n,these distributions cannot be constructed or are degenerate.For example,to

construct a sample covariance matrix (S) requires that n>p.If it is not,then the

rank of S is less than p,so the matrix becomes singular (after Press 1982).That

being the case,the inverse of S does not exist and its probability distribution cannot

be calculated.

There are various statistical short-cuts that can be taken to construct S and they

fall into two types:either reduce p or increase n.Increasing n can be eﬀectively

achieved by assuming some prior knowledge of a distribution,so that less samples

are needed to condition it properly.One possibility,mentioned above,is to assume

that covariance is constant for a particular class or indeed for all classes (pooled

covariance).Landgrebe (1999) presents a useful summary of possible methods for

pooling,and discusses their likely eﬀects on predictive accuracy.Reducing p is usually

achieved using principal components or factor analysis.

New solutions to this problem are oﬀered by inductive methods.For example,a

Self Organising Map (SOM,Kohonen 1997) reduces a highly multivariate space

into a lower dimensional structure (typically two-dimensional ) by training a set of

neurons (v,v%n) to represent the salient properties of the original data.The neurons

capture the variance and important trends in the data.In doing so,they reduce n

and p to v and 2,respectively.One advantage here is that the form of the problem

is not changed;we still have a set of (albeit transformed) observations within a

(transformed) feature space.Another advantage is that the mapping fromn to v aims

to preserve topology,so relative positions in the transformed feature space still have

meaning.

5.Questions about induction

To highlight the relevance of the above discussion to geographical analysis,this

section is structured around several questions related to the consequences of using

machine induction,addressing how it might change our capabilities,methodologies,

understanding,our role and even the way we approach teaching.

5.1.Can we address previously intractable problems?

The answer here is clearly yes;the problems that were once intractable because

of dataset complexity,computational burden or for lack of a model (§4) are now

feasible.As additional progress is made within the machine learning and data mining

communities,providing more reliable search and optimization methods,the frontiers

of possibility will be pushed back still further (Dietterich 1997,Gehrke et al.1999).

5.2.Does the method of investigation change?

From a methodological perspective,we see that inferential statistics requires a

model to be speciﬁed beforehand,with unknown examples then evaluated against

it.By contrast,machine learning requires examples to be available that represent

the functioning of the model,but not the model itself.By generalizing from these

known examples,a model is induced.A major diﬀerence then,between these two

M.Gahegan84

styles of analysis,concerns the requirement for prior knowledge.It is not necessary

to have a procedural understanding of a problem before using ML to predict or

infer new results.

By adopting machine induction,we move from an explicit model constructed by

a human expert (perhaps indirectly fromobservations or theory) to an implicit model

constructed directly from examples by an algorithm.Methodology changes accord-

ingly (§2).In all cases,reliance on the human expert is never fully relinquished since

machine learning algorithms require a variety of hands-on intervention to assure

their correct functioning.While one goal is to remove this reliance,because it

demands a level of computational knowledge,another is to build expertise from

the user into the method,as it relates to the domain of application (German and

Gahegan 1999).These goals are not in conﬂict,though they may appear to be so at

ﬁrst glance.

5.3.Are we able to examine new kinds of questions and if so,how?

Again the answer is yes;the ability to operate in the absence of prior knowledge

is enabled by substituting data for expertise (Openshaw 2000),with examples used

as a surrogate for this understanding.So,questions can be generated from our

extended ability to extract patterns from data,to categorize and to generalize.These

questions can take the form of hypotheses that shape the start of a more trad-

itional investigation.To this end,inductive learning is being applied within data

mining tools,to uncover previously unknown relationships and patterns in complex

geographical datasets (Ester et al.1998).

5.4.Does our approach to science need to change to accommodate induction?

At a more philosophical level,we need to embrace induction as a valid form of

scientiﬁc inference,that is diﬀerent from the deductive approach used in ‘normal’

science (Popper 1959),that achieves a diﬀerent purpose and that needs to be veriﬁed

in a diﬀerent manner.The validity of induction seems to be a matter for the domain

scientists to resolve,since within the philosophy of science it is widely acknowledged

and has been more than a century (Peirce 1878,Mechelen et al.1993).

Computational methods simulate the act of induction by applying complex

algorithms containing a degree of non-determinism.One problematic consequence

is that results may vary,even when the same algorithm is applied to the same

dataset.In a scientiﬁc sense this is troublesome,because it challenges the notion of

repeatability in experimentation.Since repeatability has long been regarded as one

of the three pillars of science (cf.communicable,repeatable,refutable) the con-

sequences for analysis are both philosophic and practical.However,it could be

argued that any deviation in the result is simply a reﬂection of the indeterminate

nature of the problem itself;in other words,we delude ourselves to think that there

is a single ‘right’ answer that we can know with decimal precision.So even though

repeatability provides a yardstick by which results can be directly compared,in

many cases it may hide the uncertainty present.Stochastic methods leave the

uncertainty within the result and force us to deal with it.Fuzzy and probabilistic

approaches to combining evidence also do the same (Fisher 1994).The variance in

the results is then a measure of the uncertainty in the data combined with the

learning deﬁciencies of the algorithmused (i.e.uncertainty in the constructed model ),

often with some small element of chance due to the randomized start conditions

Is inductive machine learning just another wild goose chase?85

used.By contrast,the error termin inferential statistics is a measure of the goodness-

of-ﬁt of the data to the pre-deﬁned model and not how appropriate the model itself

might be.The simplest way to account for variance in results of ML methods is to

compute an average value over several consecutive training and validation cycles.

Many appropriate measures have been proposed (Schaﬀer 1993).

5.5.Who knows the most,the geographer or the data?

A function describing the basis of a statistical model will usually include para-

meters that allow adaptation to the current dataset,but the model itself remains

invariant.A model of this kind has many advantages:it is simple,can be easily

understood and communicated,and leads to repeatable analysis.On the negative

side it may be inaccurate (the underlying relationship might have a complex covari-

ance structure) and in highly multivariate datasets it might also be diﬃcult to

‘discover’ in the ﬁrst place.Furthermore,because the model is ﬁxed,it cannot readily

adapt to subtle diﬀerences in the data used that may occur within or between speciﬁc

places.We must either assume it is universally true or else we must redevelop it each

time it is applied.Fully-inductive methods take the latter approach,automatically

reformulating new relationships for each dataset presented.

By assuming a ﬁxed relationship holds true,we remove the possibility of dis-

covering something new and signiﬁcant about the study region.Such over-reliance

on a logico-deductive approach to science has been widely criticized.For example,

Kuhn (1962) asserts that such models can never in themselves lead to newknowledge,

and only when they are seen to fail can new knowledge follow,since this implies the

model represents an invalid hypothesis.Furthermore,deduction,by itself,precludes

the development of a new or reﬁned model.True induction does not suﬀer from this

disadvantage.

To sum up,the argument between inferential and machine inductive approaches

can be stated as follows:‘Do we know enough about our systems—or are they so

simple and predictable—that a deterministic approach is adequate,or do these

systems contain local subtleties and complexities that would favour a more adaptable

approach?’

Perhaps more radically,the question can be re-expressed as:‘Do our data

represent a better approximation of system behaviour than our expertise?’ This

statement is challenging,and emphasizes diﬀerent aspects of the role of the geo-

grapher.To take it to the extreme:in the ﬁrst instance,the geographer is the

theoretician who imposes structure on the data directly and thus shapes the outcome;

in the second,the geographer is the ﬁeld expert who must carefully gather representat-

ive samples so that a valid model can emerge from them.Across the discipline,we

see stark evidence of both of these roles.

5.6.Are there implications for teaching and learning about geographical analysis?

One ramiﬁcation for education is that learned models may be diﬃcult to recover

and to communicate,even if they do lead to improvements in predictive power.The

simple parametric form of many common statistical functions makes the nature of

relationships easy to comprehend and to explain,whereas most machine learning

methods have little or no facility to describe the models they learn in any way that

makes immediate sense to a human.This is not an insurmountable problem,even a

complex model can be progressively reduced to a simpler,more generalized form

M.Gahegan86

for presentation and examination:learning outcomes can be visualised and internal

structures can be summarized (Gains 1996,Laﬀan 1998,Ankerst et al.1999).

However,one could also make the counter-argument,namely:is such simpliﬁca-

tion ultimately helpful and/or does it act as a barrier to understanding,rather than

an aid?The complexity of learned models may well depict geography as inherently

complex,and thus challenge our tendency to simplify it.Clearly,there are pedagogic

consequences to face.

6.Summary:an even wilder goose chase?

Inductive machine learning oﬀers considerable promise to improve our predictive

capabilities in complex settings,but is not yet a magic bullet (or a golden egg).The

answers it provides are only as good as:(1) the data are representative,and (2) the

methods are capable of learning the trends contained therein.

Statistical analysis is good for some classes of problem,where the solution is

largely deterministic and the underlying model is well understood.However,geo-

graphical science that is entrenched only in statistics is short sighted.To address the

challenges of richer and more voluminous data,geographers will need new tools

employing diﬀerent inferential techniques.ML reacts,via learning,to the speciﬁc

properties of a complex dataset and one could therefore argue that it is more

‘geographic’ since it is able to respond speciﬁcally to the nuances of place,provided

of course that place is encoded in the data.This is both a strength and a weakness.

It is a strength because the models produced are unique from place to place.It is a

weakness because the notion of remaining objective to some wholly external frame

of reference is sacriﬁced.

As a summary of the techniques described above,some of the more common

analysis tasks are shown in table 1,with suitable tools shown for each task drawn

from statistics and machine learning.It highlights that there are many methods

with common goals and points to some of the alternative ML methods that can be

Table 1.Various analysis tasks with their statistical and machine learning counterparts.

Analysis task Statistical technique ML technique

Data reduction Principal components,Self-organizing map

multi-dimensional scaling

Clustering k-means,ISODATA Self-organizing map,

association rules

Modelling simple Regression,correlation General regression neural

relationships network (GRNN)

Classiﬁcation Maximum likelihood,Discrete output neural

discriminant analysis network,decision tree

Function approximation Non-linear least squares and Continuous output feed-

likelihood estimation forward neural network

Parameter estimation Least squares,Stochastic search,

maximum likelihood,genetic algorithms,

expectation maximization,gradient ascent (descent)

best linear unbiased estimator

Rule-based inference First order logic,Decision tree,rule induction

linear discriminants

Is inductive machine learning just another wild goose chase?87

substituted for their more established statistical counterparts as datasets and tasks

become more complex.

By increasing our reliance on induction we change the role of the expert,since

many initial assumptions need now not be made or tested,but we must instead rely

directly on the ‘truth’ (representativeness) contained within the dataset.Although,

such a goal is perhaps not entirely laudable,since it is probably a good thing to be

intimately familiar with one’s data,this is an increasingly impractical requirement

due to the escalating size and complexity of datasets (Openshaw and Openshaw

1997,p.3).

Diﬃculty of use is still a real issue with many forms of machine learning;it is

not always straightforward to make informed choices regarding parameter conﬁg-

uration.However,this situation is also common for more advanced spatial analysis

tools.Conﬁguration of neural networks,for instance,is no more complex a task

than conducting a geostatistical interpolation:the appropriate use of kriging requires

quite a deep knowledge of available methods,as well as selection of suitable

transformations (spherical,etc).

To make the descriptions clearer I have contrasted the simpler techniques from

statistics and machine learning.There are many other techniques that merit descrip-

tion,but space considerations have precluded their mention.It is important to point

out that there is by now a good deal of convergence between statistics and machine

learning,especially with more advanced techniques where the need to search through

solution spaces eﬃcienctly is a common thread in both disciplines (Moller 1993,

Stewart et al.1994,Simoudis et al.1996).For example,Kernel Discriminant Analysis

(Lissoir and Rasson 1998),a statistical classiﬁcation techniques,constructs decision

boundaries by employing a non-linear mapping of the data into some feature space,

via a series of ‘kernel’ transformation functions.This newspace introduces distortions

to allow a cleaner delineation of the classes.Although the theoretical foundation

diﬀers fromthat of a neural classiﬁer,the functionality and many of the conﬁguration

and training issues are similar.This trend towards convergence between machine

learning and statistical analysis is likely to continue,so their distinction will become

less clear as time passes.

Acknowledgments

This paper is dedicated to the memory of Peter Robin Gould (1929–2000),whose

many insights are a continuing source of inspiration.

References

A,R.,A,J.S.,and G,P.,1971,Spatial Organization:The Geographer’s View

of the World (Prentice Hall:Englewood Cliﬀs,New Jersey).

A,M.,E,C.,E,M.,and K,H.P.,1999,Visual classiﬁcation:An

interactive approach to decision tree construction.In KDD’99 Proc.,Fifth ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining (New

York:ACM Press),pp.392–396.

A,L.,1988,Spatial Econometrics:Methods and Models (Kluwer:Dordrecht).

A,L.,1995,Local indicators of spatial association—LISA.Geographical Analysis,27,

93–115.

A,D.,1985,The grand tour:a tool for viewing multidimensional data.SIAM Journal

of Science and Statistical Computing,6,128–143.

A,R.M.,and R,E.A.,1999,A new proposal to adjust Moran’s I for population

density.Statistics and Computing,18,2147–2162.

B,T.C.,1994,A review of statistical spatial analysis in geographical information systems.

M.Gahegan88

In Spatial Analysis and GIS,edited by S.Fotheringham and P.Rogerson (London:

Taylor and Francis).

B,S.,A,L.,M,D.,and S,D.,2000,Seamless integration of spatial

statistics and GIS:the S-Plus for ArcView and the S+Grassland links.Journal of

Geographical Systems,2,287–306.

B,J.A.,S,P.H.,and E,O.K.,1990,Neural network approaches

versus statistical methods in classiﬁcation of multisource remote sensing data.IEEE

Transactions on Geoscience and Remote Sensing,28,540–551.

B,J.A.,S,P.H.,and E,O.K.,1993,Conjugate gradient neural

networks in classiﬁcation of multisource and very high dimensional remote sensing

data.International Journal of Remote Sensing,14,2883–2903.

B,D.P.(editor),1990,Change in Representation and Inductive Bias (Boston,MA:

Kluwer Academic Press).

B,D.A.,W,G.A.,and A,M.P.,1999,Exploring the solution space of

semi-structured geographical problems with genetic algorithms.Transactions in GIS,

3,51–72.

B,M.,C,G.,and G,F.,1995,The use of parallel computers to solve non-

linear spatial optimisation problems:an application in network planning.Environment

and Planning A,27,1049–1068.

B,P.B.,and K,K.(editors),1990,Meta-L earning,Meta-Reasoning and L ogics

(Boston,MA:Kluwer Academic Press).

B,P.,1999,Markov Chains:Gibbs Fields,Monte Carlo Simulation,and Queues (New

York:Springer).

B,T.S.,and P,A.R.,1979,A simple test for heteroskedasticity and random

coeﬃcient variation.Econometrica,47,1287–1294.

B,G.,and C,T.,1996,A Compendium of Machine L earning (volume 1:Symbolic

Machine L earning) (Norwood,New Jersey:Ablex Publishing Corporation).

B,C.,2001,Is ‘statistics inferens’ still the geographical name for a wild goose?

Transactions in GIS,5,1–3.

B,C.,F,A.S.,and C,M.E.,1996,Geographically weighted

regression:A method for exploring spatial nonstationarity.Geographical Analysis,

28,281–298.

B,C.,F,A.S.,and C,M.E.,1998,Spatial non-stationarity

and autoregressive models.Environment and Planning A,30,957–973.

B,B.P.,1998,Looking forward:geographic information services and libraries in

the future.Cartography and GIS,25,161–171.

B,B.,G,M.,M,H.,and Y,M.,2001,Geospatial data mining

and knowledge discovery.UCGIS Emerging Themes White Paper:

URL:http://www.ucgis.org/emerging/.

B,K.,and L,D.A.,1991,Hierarchical decision tree classiﬁers in high

dimensional and large class data.IEEE Transactions on Geosciences and Remote

Sensing,29,518–528.

C,D.L.,1993,Artiﬁcial neural networks for landcover classiﬁcation and mapping.

International Journal of Geographical Information Systems,7,173–186.

C,A.,and O,J.,1973,Spatial Autocorrelation (London:Pion).

C,A.,and O,J.,1981,Spatial Processes:Models and Applications (London:Pion).

C,W.W.,1995,Fast,eﬀective rule induction.In Proceedings of 12th International

Conference on Machine L earning (San Francisco,California:Morgan-Kaufmann),

pp.115–123.

C,R.,1991,A review of assessing the accuracy of classiﬁcation of remotely sensed

data.Remote Sensing of the Environment,37,35–45.

C,D.,B,A.,C,J.,and H,C.,1995,Grand tour and projection pursuit.

Computational and Graphical Statistics,4,155–172.

C,N.A.C.,1993,Statistics for Spatial Data,revised edition (New York:John Wiley

and Sons).

C,G.,1990,Complexity theory of neural networks and classiﬁcation problems.In

Proceedings of Neural Networks EURASIP Workshop,edited by L.B.Almeida and

C.J.Wellekens,Sesimbra,Portugal (Berlin:Springer-Verlag),pp.24–44.

Is inductive machine learning just another wild goose chase?89

D,X.,and K,S.,1999,Data fusion using artiﬁcial neural networks:a case study

on multitemporal change analysis.Computers,Environment and Urban Systems,23,

19–31.

D,T.G.,1997,Machine learning research:four current directions.AI magazine,

Winter,pp.97–136.

D,P.J.,1983,Statistical Analysis of Spatial Point Patterns (London:Academic Press).

E,A.,H,D.,K,M.,and V,L.,1989,A general lower bound

on the number of examples needed for learning.Information and Computation,82,

247–261.

E,M.,K,H.-P.,and S,J.,1998,Algorithms for characterization and trend

detection in spatial databases.In Proceedings of 4th International Conference on

Knowledge Discovery and Data Mining (KDD’98),New York,USA (Menlo Park,CA:

American Association for Artiﬁcial Intelligence),pp.44–50.

F,P.F.,1994,Probable and fuzzy models of the viewshed operation.In Innovations in

GIS 1,edited by M.Worboys (London:Taylor and Francis),pp.161–175.

F,M.M.,and L,Y.,1998,A genetic-algorithms based evolutionary computational

neural network for modeling spatial interaction data.Annals of Regional Science,

32,437–458.

F,M.M.,and S,P.,1999,Optimization in an error backpropagation neural

network environment with a performance test on a pattern classiﬁcation problem.

Geographical Analysis,31,89–108.

F,R.W.,and L,B.G.,1994,Assessing the classiﬁcation accuracy of multisource

remote sensing data.Remote Sensing of the Environment,47,362–368.

F,G.M.,MC,M.B.,and Y,W.B.,1995,Classiﬁcation of remotely sensed

data by an artiﬁcial neural network:issues relating to training data characteristics.

Photogrammetric Engineering and Remote Sensing,61,391–401.

F,M.A.,and B,C.E.,1997,Decision tree classiﬁcation of landcover from

remotely sensed data.International Journal of Remote Sensing,18,711–725.

F,K.,1990,Introduction to Statistical Pattern Recognition (San Diego,California:

Academic Press).

G,M.,2000,On the application of inductive machine learning tools to geographical

analysis.Geographical Analysis,32,113–139.

G,M.,G,G.,and W,G.,1999,Some solutions to neural network conﬁgura-

tion problems for the classiﬁcation of complex geographic datasets.Geographical

Systems,6,3–22.

G,M.,H,M.,R,T.-M.,and W,M.,2001,The Integration

of Geographic Visualization with Databases,Data Mining,Knowledge Construction

and Geocomputation.Cartography and Geographic Information Science,28,29–44.

G,B.R.,1996,Transforming Rules and Trees into Comprehensive Knowledge

Structures.In:Advances in Knowledge Discovery and Data Mining,edited by U.Fayyad,

G.Piatetsky-Shapiro,P.Smyth and R.Uthurusamy (Cambridge,MA:AAAI/MIT

Press),pp.205–228.

G,A.,and B,B.,1978,Models of Spatial Processes (Cambridge,UK:Cambridge

University Press).

G,J.,G,V.,R,R.,and L,W.-Y.,1999,BOAT—Optimistic

decision tree construction.Proc.SIGMOD 1999 (New York:ACMPress),pp.169–180.

G,G.,and G,M.,1996,Neural network architectures for the classiﬁcation of

temporal image sequences.Computers and Geosciences,22 (9),969–979.

G,C.,M,D.,P,D.,and S,P.,1996,Statistical inference and

data mining.Communications of the ACM,39,35–41.

G,A.F.H.,and C,B.,1996,Hyperspectral imaging of the earth:remote analytical

chemistry in an uncorrelated environment.Field Analytical Chemistry and Technology,

1,67–76.

G,P.R.,1970,Is Statistix Inferens the geographcial name for a wild goose?Economic

Geography,46,539–548.

G,P.R.,1999,Becoming a Geographer (New York:Syracuse University Press).

H,R.P.,1990,Spatial Data Analysis in the Social and Environmental Sciences

(Cambridge:Cambridge University Press).

M.Gahegan90

H,J.,1995,Applied Multilevel Analysis (TT-Publikaties:Amsterdam).

H,E.B.,M,J.,and S,P.J.,1966,Experiments in Induction (New York,USA:

Academic Press).

I,E.H.,and S,R.M.,1989,An Introduction to Applied Geostatistics (New

York:Oxford University Press).

J,L.,and L,D.,1998,Supervised classiﬁcation in high dimensional space:

geometrical,statistical and asymptotical properties of multivariate data.IEEE

Transactions on System,Man and Cybernetics,28C,39–54.

J,G.D.,M,W.L.,P,G.P.,and T,C.,1999,Multi-resolution fragmenta-

tion proﬁles for assessing hierarchically structured landscape patterns.Ecological

Modelling,116,293–301.

J,G.G.,C-H,R.,G,W.E.,L,H.,and L,T.C.,1988,

Introduction to the Theory and Practice of Econometrics (New York:John Wiley

and Sons).

K,G.,and A,D.W.,1986,Sampling rare populations.Journal of the Royal

Statistical Society (A),149 (1),65–82.

K,R.,and B,A.,1999,What shall we do with the data we are expecting

from upcoming earth observation satellites?Journal of Computational and Graphical

Statistics,8,575–588.

K,I.,and W,G.,1997,Strategies and best practice for neural network

image classiﬁcation.International Journal of Remote Sensing,18,711–725.

K,T.,1997,Self-organizing maps (Berlin:Springer-Verlag).

K,K.,H,J.,and A,J.,1999,Mining knowledge in geographic data.

Communications of the Association for Computing Machinery.

URL:http://db.cs.stu.ca/sections/publication/kdd/kdd.html.

K,I.G.G.,and DL,J.,1998,Introducing Multilevel Modeling (London:Sage).

K,D.G.,1951,A statistical approach to some basic mine valuation problems on the

Witwatersrand.Journal of the Chemical,Metallurgical and Mining Society of South

Af rica,52,119–139.

K,T.S.,1962,The structure of scientiﬁc revolutions (Chicago:University of Chicago Press).

K,M.,1999,Spatial scan statistics:models,calculations,and applications.In Scan

Statistics and Applications,edited by J.B.Glaz (Boston:Boston Press),pp.303–322.

L,S.,1998,Visualising neural network training in geographic space.In Proceedings of

3rd International Conference on GeoComputation,University of Bristol,United Kingdom,

17–19 September 1998,URL:http://www.geocomputation.org/1998/48/gc_48.htm.

L,D.,1999,Information extraction principles and methods for multispectral and

hyperspectral image data.In Information Processing for Remote Sensing,edited by

C.H.Chen (River Edge,NJ,USA:World Scientiﬁc),pp.3–38.

L,A.B.,2001,Statistical Methods in Spatial Epidemiology (London:John Wiley and

Sons).

L,B.G.,and R,K.,1991,Decision tree and rule induction approach to integ-

ration of remotely sensed and GIS data in mapping vegetation in disturbed or hilly

environments.Environmental Management,15,823–831.

L,S.,and R,J.-P.,1998,Symbolic kernel discriminant analysis.In Advances in

Data Science and Classiﬁcation,edited by A.Rizzi,M.Vichi and H.H.Bock (Berlin:

Springer-Verlag),pp.417–423.

L,P.,and B,M.(editors),1996,Spatial Analysis:Modelling in a GIS Environment

(New York:John Wiley & Sons).

L,G.F.,and S,W.A.,1998,Artiﬁcial Intelligence:structures and strategies

for complex problem solving (Reading,MA:Addison-Wesley).

M,K.V.,K,T.,and B,J.M.,1979,Multivariate Analysis (London:Academic

Press).

M,J.C.,1991,Introduction to L anguages and the Theory of Computation (New York:

McGraw Hill ).

M,G.,1963,Principles of geostatistics.Economic Geology,58,1246–1266.

MG,K.,and M,B.,1995,FRAGSTATS:Spatial pattern analysis program for

quantifying landscape structure.General Technical Report PNW-GTR-351 Portland,

OR,US Department of Agriculture,Forest Service,Paciﬁc Northwest Research

Station.

Is inductive machine learning just another wild goose chase?91

M,I.V.,H,J.,M,R.S.,and T,P.(editors),1993,Categories

and Concepts:theoretical views and inductive data analysis (NewYork:Academic Press).

M,H.,and H,J.(editors),2001,Knowledge Discovery with Geographic Information

(London:Taylor and Francis).

M,T.M.,1997,Machine L earning (New York:McGraw Hill ).

M,M.F.,1993,A scaled conjugate gradient algorithm for fast supervised learning.

Neural Networks,6,525–533.

M,P.,1948,The interpretation of statistical maps.Journal of the Royal Statistical Society

B,10,243–251.

M,B.M.E.,and S,H.D.,1991,Algorithms f rom P to NP (Redwood,CA:

Benjamin-Cummings).

N,M.,1996,An applied statistician’s creed.Applied Statistics,45,401–410.

O,S.,1991,A spatial analysis research agenda.In Handling Geographic Information,

edited by I.Masser and M.Blakemore (London:Longman),pp.18–37.

O,S.,1993,Modelling spatial interaction using a neural net.In Geographic Information

Systems,Spatial Modelling and Policy Evaluation,edited by M.M.Fischer and

P.Nijkamp (London:Springer-Verlag),pp.147–164.

O,S.,1994,Exploratory space-time-attribute pattern analysers.In Spatial Analysis

and GIS,edited by S.Fotheringham and P.Rogerson (London:Taylor and Francis).

O,S.,2000,GeoComputation.In GeoComputation,edited by S.Openshaw and

A.J.Abrahart (London:Taylor and Francis),pp.1–31.

O,S.,C,A.,and C,M.,1990,Building a prototype geographical

correlates exploration machine.International Journal of Geographical Information

Systems,4,297–311.

O,S.,and A,B.(editors),2000,GeoComputation (London:Taylor and

Francis).

O,S.,and O,C.,1997,Artiﬁcial Intelligence in Geography (Chichester,UK:

John Wiley and Sons).

O,S.,T,A.,T,I.,MG,J.,and B,C.,1999,Testing space-

time and more complex hyperspace geographical analysis tools.In GIS Research UK

’99 (Southampton,UK:University of Southampton),pp.89–102.

O,J.K.,1975,Estimation methods for models of spatial interaction.Journal of the American

Statistical Association,70,120–126.

P,J.D.,and S,R.A.,1995,A detailed comparison of backpropagation

neural networks and maximum-likelihood classiﬁers for urban land use classiﬁcation.

IEEE Transactions on Geosciences and Remote Sensing,33,981–996.

P,C.S.,1878,Deduction,induction and hypothesis.Popular Science Monthly,13,470–482.

P,K.R.,1959,The L ogic of Scientiﬁc Discovery (New York:Harper and Row).

P,S.J.,1982,Applied Multivariate Analysis,including Bayesian and Frequentist Methods

of Inference (Malabar,Florida:Krieger Publishing Co).

Q,R.,1993,C4.5:Programs for Machine L earning (San Mateo,CA:Morgan Kaufman).

R,B.D.,1996,Pattern Recognition and Neural Networks (Cambridge,UK:Cambridge

University Press).

S,C.,1993,Selecting a classiﬁcation method by cross validation.Machine L earning,

13,135–143.

S,D.,1992,Multivariate Density Estimation (London:John Wiley and Sons).

S,E.,L,B.,and K,R.,1996,Integrating inductive and deductive reasoning

for data mining.In Advances in Knowledge Discovery and Data Mining,edited by

U.Fayyad,G.Piatetsky-Shapiro,P.Smyth and R.Uthurusamy (Cambridge,Mass.:

AAAI/MIT Press),pp.353–374.

S,R.,1990,Extreme value theory.Handbook of Applicable Mathematics (supplement)

(New York:John Wiley and Sons).

S,P.,2000,Data mining:Data analysis on a grand scale?Statistical Methods in Medical

Research,September 2000.

S,M.,H,V.,and B,R.,1993,Image Processing,Analysis and Machine Vision

(London,UK:Chapman and Hall ).

S,D.F.,1991,A general regression neural network.IEEE Transactions on Neural

Networks,2,568–576.

Is inductive machine learning just another wild goose chase?92

S,S.V.,1997,Selecting and interpreting measures of thematic classiﬁcation accuracy.

Remote Sensing of the Environment,62,77–89.

S,B.S.,L,C.F.,and W,C.C.,1994,A bibliography of heuristic search

through 1992.IEEE Transactions on Systems,Man and Cybernetics,24,268–293.

T,S.K.,1992,Sampling (New York:John Wiley and Sons).

T,W.,1970,A compyter movie simulating urban growth in the Detroit region.Economic

Geography,46,234–240.

U,P.E.,1986,Machine L earning of Inductive Bias (Boston,MA:Kluwer Academic Press).

W,R.E.,and M,R.H.,1989,Probability and Statistics for Scientists and Engineers

(4th Edition) (New York:Macmillan).

W,M.P.,and J,M.C.,1995,Kernel Smoothing (London:Chapman and Hall ).

W,C.,1993,Sense and Nonsense of Statistical Inference (New York:Dekker).

Y,T.,and O,S.,1994,Neural network approaches to landcover mapping.IEEE

Transactions on Geosciences and Remote Sensing,32,1103–1109.

## Comments 0

Log in to post a comment