UNIVERSIDADE NOVA DE LISBOA
INSTITUTO SUPERIOR DE ESTATÍSTICA E GESTÃO DE INFORMAÇÃO
Geospatial Data Mining
Fernando Lucas Bação
Geospatial Data Mining
Fernando Lucas Bação
1
CONTENTS
1 INTRODUCTION 2
2 THE FRAMEWORK OF GEOSPATIAL DATA MINING 4
1.1. The Present Situation 7
1.2. New Approaches to the Theoretical Constriction 9
3 THE DIFFICULTIES OF A QUANTITATIVE GEOGRAPHY AND SPATIAL
ANALYSIS 11
4 MODEL TYPES 16
5 AN EXPERIMENTAL APPROACH TO GEOSPATIAL DATA MINING 23
6 GEOREFERENCED DATA AND ITS UNIQUE FEATURES 29
Geospatial Data Mining
Fernando Lucas Bação
2
1 INTRODUCTION
We should like to emphasize, forthwith, the importance of the bibliographical
references to an understanding of the subject dealt with here. Though they may
not be indispensable, they are an invaluable source for those who seek a better
understanding of some of the topics that time and space do not allow us to
analyze in greater detail here.
This text serves to introduce the issues of Geospatial Data Mining (GDM)
which fits it into the broader setting of GISc (Geographical Information Science).
The term adopted GDM aims to emphasize that what is being dealt with here is
the application of data mining tools to the specific field of geographical data, i.e.
data with references for its position on the earth’s surface.
The term ‘geospatial’ is used to underline the importance of space defined
as the relationship between a set of objects. In contrast to what we may think,
data mining based on georeferenced data has different features from that
carried out on a business data basis. Though the two coincide to a considerable
extent, there are also certain differences that, though restricted in number, are
very important and cannot be neglected. Thus we not only use the expression
Geospatial Data Mining to indicate the origin of the data but also to stress the
fact that there are important differences.
We shall begin the text by providing a framework for GSDM in which we try
to trace out the present situation of GISc (Geographical Information Science),
stressing the aspects that lead us to think that GSDM is of importance to the
subject. This task also helps us to reach a more concrete definition of the
GSDM concept itself or, rather, in presenting the difficulties facing GISc in
spatial analysis terms, we deduce some of the features that GSDM should,
desirably, present.
Point three provides a more detailed treatment of the theoretical question
connected with the study objective and attempts to give the main reasons
Geospatial Data Mining
Fernando Lucas Bação
3
contributing to the difficulty of creating a quantitative geography (perhaps we
can use the term ‘scientific’ rather than ‘quantitative’). Another issue broached
concerns the definition of spatial analysis, a particularly significant aspect in that
the principal aim of the tools we seek to present i n this course is to solve the
problems of spatial analysis.
The objective of the fourth topic is to draw attention to the different types of
models that science normally employs. We try to show how they differ and
particularly how and why new models incorporating most DM tools can support
GISc in developing new tools and new knowledge.
With the fifth topic, An Experimental Approach to GSDM, we attempt to
establish certain differences in the ways that DM tools can be used in GISc.
These differences are based, fundamentally, on a purely inductive vision and a
vision in which inductive tools complement the deductive effort that is more
general in nature and more capable of providing theories and laws.
The sixth topic deals with the question of what is special in spatial data. It is
extremely important to understand that spatial data displays features that
distinguish it from other types of data. These differences have serious
implications, creating constrictions at the analysis level. It is precisely these
differences that have triggered the appearance of GSDM.
In brief, we could say that this module covers essential issues relating to
GISc itself and spatial analysis, along with their practical consequences. The
issues we refer to are: What is the object of studying GISc? Can laws or
theories be constructed around spatial distributions? Are the inductive
approaches provided by DM tools enough to overcome the absence of theory?
In our view it is important to stress that if we consider space and distance
relevant variables, then it is indispensable for this to become evident when we
apply our methods of analysis. If we are convinced that spatial distribution plays
an important role in the phenomena we are studying, then we have to find a
way of explicitly including information of a geospatial nature in our analyses.
Geospatial Data Mining
Fernando Lucas Bação
4
2 THE FRAMEWORK OF GEOSPATIAL DATA MINING
When we speak of the relationship between geography and data mining
(Fayyad et all. 1996) it is easy to detect an enormous paradox. On the one
hand, a large part of the GISc (Geographical Information Science) community at
present entertains great expectations of DM (data mining) as the means of
increasing the quality and sophistication of spatial analysis. On the other hand,
we can argue that geography, and more precisely cartography, have long been
doing what today is termed as DM.
In fact, if we accept that the main function of DM is to improve the interfaces
between data storage systems and human beings, allowing the exploration,
summary and modelling of large databases (Fayyad, 1998), then geography
has been data mining for centuries. An example of this is the following map by
Charles Minard (figure 1):
Figure 1 – Map of Napoleon’s Russian campaign (adapted from Burch and Grudnitski,
1989)
Geospatial Data Mining
Fernando Lucas Bação
5
This masterpiece of the cartographer’s art illustrates Napoleon’s campaign
in Russia. We consider that it provides a good picture of the capacities of maps
to summarize information and to act as an interface between human beings and
large databases.
Drawn up by Charles Minard in 1889, the map provides a clear and simple
illustration of the disaster that Napoleon’s Russian campaign in 1812 turned out
to be, unambiguously revealing the devastating losses that France suffered
during the campaign. The map shows the route taken by the column of troops
as it advanced on Russia (the light colour) and retreated (in black) and the size
of the column in terms of men (indicated by the width of the line). The line of
retreat is combined with a temperature scale showing the temperatures these
men had to endure when retreating. The map also shows the level of casualties
whenever there was a river to cross or a battle, as for example with the crossing
of the River Berezina during the retreat. The army finally managed to reach
Poland with the catastrophic number of 10 000 men, out of the 422 000 who
had left. As can be observed, this map gives a highly detailed picture where we
can follow the development of five variables: the size of the army, its position on
a twodimensional surface, the direction the military column was moving in and
the temperature on various dates during the retreat from Moscow.
Another good example of this argument is the work of Dr John Snow at the
time of the great cholera epidemic in London in 1854. This is perhaps the most
fantastic discovery that GIS has produced, paradoxically long before computers
existed.
Geospatial Data Mining
Fernando Lucas Bação
6
Figure 2 – Map indicating cholera victims and wells supplying water in London (adapted
from Dodson 1992)
This London physician suspected that the agent provoking the disease was
being transmitted through the water being used in London. He devised quite a
simple method to test his hypothesis. First of all he made a plan of the zone in
London where the epidemic was concentrated, showing the houses and the
streets. Then he started to identify the victims’ place of residence from a spatial
perspective, marking each victim’s address with a dot. When a plan of the wells
supplying the water was placed over this map (figure 2), it was quickly
concluded that not only was his hypothesis correct but that a certain well was
responsible for spreading the disease.
The second part of the paradox that we mentioned initially concerns the fact
that it has long been noted that there is a discrepancy between the capabilities
for storage, management and access and the analytical tools provided by
Geospatial Data Mining
Fernando Lucas Bação
7
Geographical Information Systems (Maguire, 1991). In fact, the sophistication
offered by developments in computer science with regard to geographical data
management and access has not been reflected in spatial analysis
(Aangeenbrug, 1991; Openshaw, 1993). That is why we have GIS with the
capability to store and manage great quantities of georeferenced data but do
not possess the tools that allow us to transform this data into information and
this information into knowledge (Openshaw, 1991).
On the one hand the map or, more generally, the visualization tools that GIS
offers are a central tool in the domain of visualizing complex phenomena, in
particular those that are to be seen on the earth’s surface (for more on this
subject see MacEachren et all. 1994; Openshaw et all. 1994). It is to be hoped
that GISc continues to contribute to the expansion of DM visualization
methodologies.
On the other hand, it appears essential for GISc to pay attention to the new
DM tools emerging, which are capable of making a decisive contribution to a
new era in spatial analysis (Openshaw, 1999; Miller et all. 2001; Yuan et all.
2001; Gahegan, 2003). This new era should be characterized by a greater
ability to predict the development of the phenomena under study.
1.1. The Present Situation
One of the most notable characteristics of GISc today is the explosion of
georeferenced data produced by recent IT developments (Openshaw, 1991;
Miller et all. 2001). The technology for gathering information with geographical
references, which varies from digital cartography to remote sensing and LBS
(locationbased services), has been inundating the databases. This fact
demonstrates the importance of developing tools capable of effectively dealing
with large quantities of georeferenced data.
Geospatial Data Mining
Fernando Lucas Bação
8
Without doubt the visualization tools now available represent an enormous
plus for the exploration of this data but they are still inadequate. The complex
and multidimensional nature of today’s databases creates new challenges for
which visualization does not yet have effective solutions. At the same time,
visualization is always an assisted process which has great difficulty in allowing
anything more than an exploratory analysis of the data.
As we have already mentioned, GIS today permits great quantities of data to
be stored and nowadays it is quite simple to manage and access it using highly
sophisticated tools. If we add to this the above developments in the technology
for gathering geographical information, we arrive at the present situation in
which we probably have the data to respond to many urgent issues (social and
environmental) but only buffers and overlays to analyze them.
The use and importance of operations like buffers and overlays are not in
question but these analytical tools date from an earlier time than the
Geographical Information Systems themselves. Today we need tools of a
different class, able to handle the highly varied nature of georeferenced data
and to explore, relate and forecast.
To attain this objective there are two routes available. On the one hand the
tools can emerge as a consequence of the theoretical bases of the subject; on
the other (if these theoretical cases do not exist) we can resort to inductive
approaches that allow the practical problems to be dealt with and possibly help
to boost the theory.
Whatever the case, it is always preferable – as it is more robust and
scientifically more correct – that there are laws, knowledge and theory on the
study objective or, in the specific case of GISc, spatial and spatialtemporal
distributions and patterns. Yet that is precisely what is missing.
The question can be reformulated, declaring that in GISc terms we are living
in an environment that is “rich in data and poor in theory” (Openshaw, 1991).
Today, taking the theoretical difficulties into consideration, DM offers GISc an
Geospatial Data Mining
Fernando Lucas Bação
9
excellent opportunity to raise spatial analysis to new levels of sophistication. By
promoting more effective exploration of the databases available.
1.2. New Approaches to the Theoretical Constriction
Inductive data driven approaches applied to modelling and spatial analysis
may be the way of facilitating the creation of new knowledge and contributing to
the process of scientific discovery. DM tools that have been developed may
play a central role, helping the process of exploring great quantities of data in
the search for recurring patterns and persistent relationships.
It is important to stress that DM is not to be viewed as a panacea for all the
problems of data analysis and, more specifically, spatial data analysis.
Deterministic models and wellgrounded theory on the nature of phenomena are
always preferable to DM. The belief that DM will work in any case,
independently of the data, can only damage the development of this area of
knowledge.
When the use of DM methodologies in GISc is assessed, two essential
aspects should be taken into consideration. The first, already presented above,
is connected with the scarcity of theoretical results and models, which makes
GISc a suitable candidate for the benefits of DM tools.
Secondly, the fact that spatial information possesses special characteristics
should be taken into account (Anselin, 1989), as we shall see below in greater
detail. As a preliminary conclusion we can say that DM is capable of becoming
a very useful tool in spatial analysis, provided that the special aspects of the
data are properly treated and safeguarded.
When the specificity of spatial data is taken into account, it must not be
forgotten that the software packages available were specifically developed for
the purpose of analysing large databases, with a view to modelling and
forecasting consumer behaviour. Given their origin, it is not difficult to
Geospatial Data Mining
Fernando Lucas Bação
10
understand that the specific needs of spatial analysis are not addressed. This
situation has led certain authors (Openshaw, 1991) to argue that specific
software needs to be developed for exploring georeferenced data.
In spite of the limitations imposed by spatial data, it may not be inevitable
that we have to wait for a new Geospatial Data Mining technology to be
developed before we can take advantage of these new tools. With alterations to
the process, particularly greater emphasis on the preprocessing of data, we will
probably be able to use existing software packages to produce new forms of
spatial analysis. At the present moment adapting the process underlying DM
(Pyle, 1999) to allow appropriate, reliable exploration of spatial data seems
more useful than waiting for specific tools for Geospatial Data Mining.
Geospatial Data Mining should be understood as a special type of DM that
seeks to carry out generic functions similar to those of conventional DM, though
modified to safeguard the special aspects of geoinformation. However, the
elements of real importance in Geospatial Data Mining are the study objective,
spatial distributions and spatialtemporal patterns: it is this that confers on it a
unique view of reality and the unique features that it possesses (Abler et all.
1977).
Geospatial Data Mining
Fernando Lucas Bação
11
3 THE DIFFICULTIES OF A QUANTITATIVE GEOGRAPHY AND SPATIAL
ANALYSIS
In this topic we attempt to underline some of the theoretical aspects that can
facilitate an understanding of GSDM’s specificities and its potential for resolving
certain theoretical constrictions. It is also important to understand what is meant
by spatial analysis insofar as it is the area in which the tools we present later
are applied.
Unsurprisingly, spatial analysis has various definitions. From a more
traditional and restrictive perspective, we can define spatial data analysis as
"the statistical study of phenomena that manifest themselves in space" (Anselin,
1993). The definitions chosen by Openshaw (1993) are more global and not so
useful in practical terms, though broader and richer in theoretical terms: “Spatial
analysis is usually defined as the description and analysis of the 0, 1, 2, 2 1/2
and 3 dimensional features found on maps (i.e. points, lines, areas, surfaces). It
also includes the analysis and modelling of all maprelated or map relateable
information” or, again, “Another definition would be that any analysis, modelling,
and policy application in which space or location is important, or which makes
use of spatial information in the broadest possible sense, is also spatial
analysis”. Lastly, the following definition can also be considered: “a general
ability to manipulate spatial data into different forms and extract additional
meaning as a result” (Bailey, 1994).
Independently of more philosophical reflections, which do not fall within the
scope of this text, we can quite easily draw two conclusions from the group of
definitions presented. First of all, it is not hard to ascertain that there is no
consensus on the question “What is spatial analysis?” and that defining it is
extremely difficult, a situation that results in definitions that are too general and
not very objective.
Geospatial Data Mining
Fernando Lucas Bação
12
The second conclusion is that though these definitions are quite different
there is a common thread running through them all, that is, the inherent concern
with stressing space as the central factor of the definition. For the rest, all the
definitions put forward are rooted in a very personal vision of what spatial
analysis or spatial data analysis should be, as we shall see below.
This second conclusion gives rise to a third, one of extreme importance
which clearly indicates that location, area, topology, spatial arrangement,
distance and position represent the centre or nucleus of research, i.e. of spatial
analysis. This is implicitly expressed in the First Law of Geography,
"everything is related to everything else, but near things are more related than
distant things" (Tobler, quoted by Anselin 1993). In fact, if there is anything
useful in spatial analysis definitions, it is certainly the recognition of space as a
source of explanation for the patterns presented by the different phenomena
within it.
It is easy to understand that the law just stated results in a disappointing
conclusion: to a great extent, it is not possible to use the traditional methods of
classical statistics to analyze spatial data, given the important role that location
plays in understanding phenomena observed in space.
Accordingly, if we consider that, as a rule, observations will tend to be
spatially clustered or, in other words, geographical data samples will not be
independent, then we are faced with a conflict in that, in statistical terms,
observations are usually assumed to be independent and identically distributed.
This spatial effect is termed spatial dependence. When present in spatial data
this dependence is normally referred to as spatial autocorrelation.
There is, however, a second spatial effect, emerging from the importance of
location in this type of data. It is related to spatial differentiation, the result of the
intrinsic originality of each place, and can be termed spatial heterogeneity.
There are two other factors that, to a great extent, rule out using the
methodologies of classical statistics. The first relates to the fact that the central
Geospatial Data Mining
Fernando Lucas Bação
13
concern of the methods to use should be space, the location. As is well known,
this does not count among the relevant variables in statistical processes. “Many
kinds of analysis look only at features’ attributes without making explicit
reference to location, and in this class we would have to include the vast
majority of standard statistical techniques, ranging from simple tests of means
and analysis of variance to multiple regression and factor analysis” (Goodchild,
1986). Moreover, quite comprehensibly, existing statistical methods were not
designed with the specific features of space as an explanatory variable in mind.
Indeed, Openshaw (1993) refers to this in the following way: “These (aspatial
statistical methods) are not usually spatial analysis relevant methods because
they treat spatial information as if it were equivalent to survey data, and
generally totally fail to both handle any of the key features that make spatial
data special (i.e. spatial dependency, non sample nature, modifiable areal units
effects, noise and multivariate complexity)? or provide results that are sensitive
to the nature and needs of geographical analysis?”.
The second factor involves the fact that today more than ever, as far as
GISc is concerned, there is no valid reason to test the significance of particular
indicators or try to construct “significant” samples. In the field of geography
significance tests are of little use, as Gould (quoted by Goodchild, 1986) states
in the case of spatial autocorrelation: “Why we should expect independence in
spatial observations which are of the slightest intellectual interest or importance
in geographic research I cannot imagine. All our efforts to understand spatial
pattern, structure and process have indicated that it is precisely the lack of
independence  the interdependence  of spatial phenomena that allows us to
substitute pattern, and therefore predictability and order, for chaos and apparent
lack of interdependence of things in time and space”. Goodchild (1986)
emphasizes this idea, stating that: “It is impossible for a geographer to imagine
a world in which spatial autocorrelation could be absent: there could be no
regions of any kind, since the variation of all phenomena would have to occur
independently of location, and places in the same neighbourhood would be as
different as places a continent apart”.
Geospatial Data Mining
Fernando Lucas Bação
14
We can extend this argument beyond the concept of spatial autocorrelation,
taking into consideration that if concepts exist and are accepted by the majority,
as such it is not worthwhile testing them. In fact, it makes no sense to test
concepts like the distance effect or spatial association. In the absence of any
other reason, we can argue that the null hypothesis (the absence of spatial
autocorrelation or the friction of distance effect), necessary to proceed with the
test, makes no sense in geographical terms. As Openshaw (1994) argues: “So
why not operationalize this geographical concept as a concept rather than as a
precise, assumptiondependent statistical test of a hypothesis? The statistical
test approach requires a much greater input of knowledge than is contained in
the original theoretical notion”.
On the one hand, in the case of GISc, the universe under study can
generally to be explored to its full extent without the need to create “statistically
significant samples” (a concept that does not exist in geography). Essentially,
this has resulted from the development of computing that, with its dizzying
advance, has provided statistics, too, with new forms, e.g. the bootstrapping in
the approach to old problems such as sampling.
Forecasting methods may be the point where DM has its most obvious
impact. In fact, with regard to the use of tradition statistical methods, there are
upstream problems that are still a long way from being solved and that appear
at present to be much more pressing and more reliably resolvable. Openshaw
(1993) acknowledges this very fact: “Indeed knowing how to develop new and
more relevant methods looks like being too difficult for current technology
particularly when it has to be performed in a rigorous fashion within a classical
statistical framework ... Maybe the statistical problems are just too hard to be
resolved at present, so why not change the nature of the problem to make it
easier to solve?”
Over the years concern with the use of less restrictive methodologies, where
DM fits in, has come to constitute the research object of certain GISc
researchers. This has touched on areas as different as neural networks
Geospatial Data Mining
Fernando Lucas Bação
15
(Mitchell, 1997; Bishop, 1995) and genetic algorithms (Mitchell, 1997), statistical
measures for spatial correlation (Goodchild, 1986), exploratory data analysis
(Anselin, 1993; Openshaw, 1993) and scientific visualization (MacEachren
1994). This type of research has produced a range of results and is registering
ever increasing recognition of its advantages, which represents an important
step in raising the visibility of the field both in funding terms and in terms of its
curricular importance within GISc.
Geospatial Data Mining
Fernando Lucas Bação
16
4 MODEL TYPES
The aim of this topic is to facilitate the operating framework of DM and, more
specifically, its framework within the scope of GISc. In connection with what has
already been discussed, it is worth recalling the emphasis placed on the need to
adopt analytical tools that allow a less restrictive approach (with fewer
assumptions about data distribution) in order to “get round” the particular
difficulties imposed by the spatial data.
In the course of this topic we shall try to go a little further and produce a
framework for DM within the scope of other model types that science employs.
We consider this discussion important in that it will allow a clearer
understanding of the way DM can fit in with GISc so as to encourage the
theoretical development that is indispensable to any science. In order to explain
some of the concepts relevant for this discussion we will make use of a model
classification proposed by Kennedy at all. (1998) and their illustrative examples.
Accordingly, we begin by proposing to divide the models used in science
into three fundamental types:
Deterministic models;
Parametric models;
Nonparametric models;
As with other tasks carried out in computing, the modelling requires a
“program” that gives detailed instructions on the way the process develops.
These instructions are typically mathematical equations that characterize the
relationship between input and output. It is in the formulation of these equations
that the central problem of the modelling lies.
The best way of modelling consists of formulating “closed” equations that
deterministically define the way the outputs are obtained from the inputs. As all
Geospatial Data Mining
Fernando Lucas Bação
17
the features are constant we refer to them as deterministic models. This type of
model is suitable for dealing with problems that are simple and perfectly
understood.
An example of this type of model is the calculation of how long a stone takes
to fall. In fact, we possess sufficient knowledge to describe this relationship and
all we need to know is the height from which the stone begins to fall:
8.9
2h
where 9.8 is normally represented as g, the acceleration constant of gravity,
and h is the height in metres.
The conceptual elegance and richness of this type of model inspired some
authors to try to adopt models from physics to explain social phenomena, in a
stream of thought that became known as “social physics” (Johnston, 1986).
Geography was no exception and also presented theoretical, geometrybased
developments as a work tool (the most widely known probably being the
gravitational model, which theorizes on applying the recognized physics model
to spatial phenomena).
These theoretical and conceptually rich models represented one of the faces
of the Quantitative Revolution in Geography (Johnston, 1986; Bird, 1993). Their
main difficulty, and the reason for abandoning them, lay in the difficulty of
applying the concepts to the discipline’s practical problems. The language of
geometry is so rigid that it created practical difficulties and prevented the
practical applications from drawing on the benefits of these developments.
It is unfortunate that, especially in the humanities and social sciences, most
problems do not lend themselves to such a simple description or are not so well
known as those that characterize Newtonian physics (Macmillan, 1997).
(Nowadays, apparently, we also know that the object of study in physics is
Geospatial Data Mining
Fernando Lucas Bação
18
actually not so linear and explicable as was once believed). The lack of “formal
knowledge” for most geographical problems drastically restricts the use of this
type of model.
Now is the time to think of a slightly different problem in order to illustrate the
way a parametric model functions. The problem is to ascertain how long a stone
takes to fall, though on a different planet for which we do not know the constant
acceleration of gravity g. The relationship is expressed as follows:
g
h2
Now we are confronted with an estimation problem. Though we have a good
idea of the way the inputs and outputs are related, we lack the degree of
certainty necessary in a deterministic model. Thus, in this case, we need to
provide a value for g. One way of dealing with the matter may be to carry out an
experiment. If we drop the stone from different heights, timing how long it takes
to fall, the experiment will result in a table like this:
Height (m) Time to fall (s)
0.5 0.19
1.3 0.4
2.8 0.46
4 0.68
7.3 0.7
9.7 1.04
Table 1 – Experiments on the time a stone takes to fall on Planet xpto
Here we have a table of examples defining the relationship between height
and falling time. An example is defined by an input vector and the desired
output. In this particular case the input vector consists of a single variable: the
Geospatial Data Mining
Fernando Lucas Bação
19
height from which the stone falls. The output consists of the time the stone
takes to fall.
The next step involves selecting a suitable value for g so that the model can
produce estimates that are as close as possible to reality. In other words, we
want to define a value for g that, given an input value for the height, allows us to
estimate the falling time as accurately as possible. The following graph (figure
3) shows the measurements we made and the response the models provided
for g values of 9.8, 20 and 35:
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 1 2 3 4 5 6 7 8 9 10 11
Figure 3 – Adapting the model: the different curves correspond to the different g values
In this case it is quite easy to assign a good value to g, though it is not
always so, especially with more complex problems involving complex
relationships and multiple variables. As we can see in the graph, the middle
line, where a value of 20 was assigned to the constant, seems to be the one
that best fits the description of the time a stone takes to fall on this particular
planet.
Geospatial Data Mining
Fernando Lucas Bação
20
We may also observe that, although the phenomenon is described quite
faithfully, discrepancies still exist between the estimated values and those
actually measured i n the experiments (shown by the lines). These inaccuracies
may be the result of two fundamental causes: on the one hand, structural
deficiencies in the model and, on the other, measurement errors. Quite clearly,
the usefulness of the model always depends on the magnitude of such errors,
or, in other words, the difference between the values given by the model and
those actually measured.
The basic aspect of parametric models lies in the fact that explicit
mathematical equations characterize the structure of the relationship between
inputs and outputs, though there are some unspecified parameters. The latter
are chosen by means of an analysis of the examples available, i.e. they are
estimated on the basis of a sample.
Linear regression is one of the best known applications of parametric
modelling. The hypothesis on which this is based assumes a linear relationship
between inputs and outputs, with the following expression normally being used:
332211
xcxcxccy
o
Another fundamental aspect to be taken into account with parametric
modelling is that it demands a sound knowledge of the phenomenon that we
intend to model. In the first place, the fundamental aspects of the relationship
must be known or, rather, not only must we know what variables are relevant to
the process but also the way they influence it.
The geographers of the Quantitative Revolution used these methods in the
search for solutions to the practical problems confronting them, for which the
geometrical formalism of physicsinspired models could not provide an answer.
The solution lay in adopting the tools of econometrics and traditional statistics to
deal with spatial phenomena.
Geospatial Data Mining
Fernando Lucas Bação
21
The problem, as already noticed, is that statistical methods are not exactly
suitable for dealing with spatial phenomena and, furthermore, most researchers
did not use any kind of spatial reference for the phenomena studied. What
tended to happen in practice was that the analysis was begun based on a map,
statistical methods were applied to the socioeconomic data associated with the
geographical units and, finally, a map with the results was produced. The
problem with this methodology is that it in no way differs from what other social
sciences do and it would be difficult to expect to find empirical regularities
respecting the influence of space, in that this was not explicitly included in the
analyses.
This is the point where we enter the field of nonparametric models, which
can be defined as models that essentially depend on the use of data rather than
on specific knowledge of the domain of the problem. They are also usually
called datadriven models. This type of model has enjoyed great success,
especially in resolving complex problems, as it usually uses large data sets
containing large numbers of examples.
The essential premise of nonparametric methods is that relationships
occurring consistently in the data set will be repeated in future observations; we
can call this an inductive approach. Thus, by obtaining a large enough set of
examples and modelling their behaviour we can establish a model of arbitrary
complexity that will allow the repetition of the behaviour observed.
A significant benefit of this model type is that it does not require an indepth
knowledge of the phenomenon being modelled, an aspect that is particularly
useful when addressing spatial problems. However, though it is theoretically
possible to solve any problem, irrespective of its complexity, certain problems
are still too complex. This is due to the data collection and processing time
limitations that exist for the building of a nonparametric model.
This is a particularly important aspect as it draws attention to the fact that
possessing the data and software is not enough, though the vendors of DM
Geospatial Data Mining
Fernando Lucas Bação
22
solutions would often like us to think differently. A minimum knowledge of the
phenomenon being modelled is indispensable, particularly as regards what the
most important inputs are. We normally possess a certain amount of information
or even an intuition about what the most important variables are. So knowledge
of the phenomenon is fundamental if we are to “reduce the scope of the search”
for possible solutions.
The same aspect is also particularly interesting from the point of view of its
usefulness to spatial analysis, in that it is precisely at this moment (when we
choose the relevant variables for modelling a phenomenon) that we are
constructing the hypothesis. In the light of this highly experimental nature, we
can predict that DM methods are going to make important theoretical
contributions to GISc.
Geospatial Data Mining
Fernando Lucas Bação
23
5 AN EXPERIMENTAL APPROACH TO GEOSPATIAL DATA MINING
The process of using GSDM should include the formulation of hypotheses,
as is implicitly assumed in the topic above – hypotheses on what variables
influence the phenomenon being studied and what algorithm should be applied
to confirm or negate a hypothesis proposed. This means that before applying
algorithms there should always be a time for reflection on the problem and the
way the independent variables interact with the dependent variable.
This is the context in which the concept of preprocessing (Pyle, 1999;
Bishop, 1995) appears, which is nothing more than “removing the irrelevant
information and extracting the essential characteristics in order to simplify the
problem”. Nonparametric models with preprocessing fit in here, too, and may
be seen as the halfway house between parametric and nonparametric models.
Figure 4 presents a diagram of this type of model.
Preprocessor
Nonparametric
Model
Raw input
vector
Preprocessed
input vector
Output
Preprocessor
Nonparametric
Model
Raw input
vector
Preprocessed
input vector
Output
Figure 4 – Diagram of the nonparametric model with preprocessing
This may be seen as dividing the development of the model into two distinct
parts: the application of specific knowledge of the domain (the preprocessor)
and the aspects that are less well understood (the model). In other words, in the
first phase this specific methodology allows us to use the prior knowledge we
possess about the phenomenon. As an example, if we want to assess the
interaction between two cities, we need to know the size of the population in
each and the distance between them (the gravitational model).
Geospatial Data Mining
Fernando Lucas Bação
24
In the second phase, the model is applied in order to specify the way in
which the variables relate to each other and obtain the desired output. So, on
the one hand, it is necessary to select and adapt the relevant variables and, on
the other, to specify the way in which each of them contributes to the final
result.
Deterministic and nonparametric approaches to modelling can be viewed as
two extremes of a continuum. A deterministic approach just uses the existing
knowledge of a phenomenon whereas nonparametric models use large
amounts of data and processing capacity and little knowledge. Nonparametric
models with preprocessing and parametric models can be seen as
compromises between these two extremes, as figure 5 illustrates.
Fixed
Parametric
Nonparametric
with preprocessing
Nonparametric
More Knowledge
More Data
Fixed
Parametric
Nonparametric
with preprocessing
Nonparametric
More Knowledge
More Data
Figure 5 – The continuum between deterministic and nonparametric models
Whatever the case, success will always mean progressively transporting
scientific problems from the right side of the figure (nonparametric approach) to
the left (deterministic approach). In other words, success lies in the ability to use
data and computing power to produce generalizable knowledge.
This may well be the ideal but that does not mean it is attainable in all
sciences. A particularly relevant example can be seen in the development of
physics, which is often considered the paradigm of modern science. Though
initially a science essentially based on a deterministic vision of phenomena,
physics has progressively adopted probabilistic approaches. In their
explanations, deterministic models have been unable to accommodate new
Geospatial Data Mining
Fernando Lucas Bação
25
facts that essentially derive from improved measuring and observation
instruments such as new space telescopes.
The mechanistic view of natural phenomena, which used to characterize
science, has been progressively put into question while the expansion of the
biological sciences has very significantly changed our perception of what
science is (Macmillan 1997). More importantly, as happened in the past with
physics, this development in biological science has provided some extremely
useful concepts for the social sciences.
Among others, the concepts of complex adaptive systems, distributed
processing and agents as against the global analysis of processes have been
adopted and exploited in the social sciences (for more information on this topic
the reader is directed to the Santa Fe Institute website http://www.santafe.edu/
).
They originated in the observation of biological systems and have been adapted
and adopted in the study of human social systems such as the economy, cities
and others (Macmillan 1997).
There are differing views on the way that GISc can benefit from non
parametric models. There is apparently, however, a certain unanimity in the
GISc community that at this stage neither deterministic nor parametric models
have a great deal to offer the development of spatial analysis. As we have
already stated, this type of model demands a kind of knowledge and theoretical
formulation about the phenomena under study that at the moment GISc purely
and simply does not possess.
Accordingly, the discussion should be centred on the use of nonparametric
models to solve the practical problems facing GISc at present and, more
importantly, to develop a robust theoretical framework to serve as a basis for
developing a new GISc. More specifically, this discussion involves two
perspectives that at first sight appear quite similar but in fact conceal a
divergence that is not only profound but also carries important practical
implications.
Geospatial Data Mining
Fernando Lucas Bação
26
There are those who defend the position that DM can be used within GISc
as a “black box”: we limit ourselves to gathering all the input variables available
and leave the tools to process all the information and present us with the
solution – without our access to the specifications that let us understand how
the model works. But there is another camp with a different position: it defends
the use of DM but assigns an important role to the preprocessing stage, at
which it reduces and perhaps transforms the input space. This allows greater
control of the specification of the problem, which will lead to a clearer
understanding of it.
Data preprocessing is essentially the stage where we concern ourselves
with the selection and transformation of the input variables (independent
variables) to be used in the modelling. Figure 6 provides a diagram representing
the two views of the use of DM in GISc.
Preprocessor
Nonparametric
Model
Raw input
vector
Preprocessed
input vector
Output
Preprocessor
Nonparametric
Model
Raw input
vector
Preprocessed
input vector
Output
Big Black Box of Magic
Raw input
vector
Output
Big Black Box of Magic
Raw input
vector
Output
Figure 6 – Two approaches to the use of nonparametric models in GISc
As you will have realized, this is where the fundamental difference lies
between nonparametric models with preprocessing and nonparametric
models. This difference reflects two completely different philosophies. When
using nonparametric models we accept all the lack of knowledge about the
problem and hope that the technology will present the best solution, by the
“brute force” of the search in the solutions space. When using nonparametric
Geospatial Data Mining
Fernando Lucas Bação
27
models with preprocessing, we are effectively formulating a hypothesis about
the variables that influence the phenomenon under study.
At the theoretical development level, the consequences of the two
approaches are completely different. If we adopt the first (figure 7), we cannot
expect to learn very much about the phenomena we are studying – we can only
hope to respond to practical problems case by case. At the same time we run
the risk of arriving at spurious relationships, i.e. those found by chance that do
not correspond to any actual relationships. In fact, the probability of spurious
relationships cropping up increases in proportion to the rise in the number of
independent variables (Mitchell, 1997; Bishop 1995).
Big Black Box of Magic
Raw input
vector
Output
Not much Knowledge
Small contribution to
Geographic Knowledge
Big Black Box of Magic
Raw input
vector
Output
Big Black Box of Magic
Raw input
vector
Output
Not much Knowledge
Small contribution to
Geographic Knowledge
Figure 7 – The process of using nonparametric models. In this case there is no place for
hypotheses and the possibility of finding spurious relationships is high.
In contrast, in the second case (figure 8) theoretical advances can probably be
achieved in that formulating and testing hypotheses on the behaviour of the
phenomena being studied can contribute to a better understanding of the way
the different variables interact. This experimental approach represents a
compromise between the knowledge necessary to specify a parametric model
and the nonparametric approach. In the parametric case, we need to specify a
model that not only identifies the variables involved but also the way they relate
Geospatial Data Mining
Fernando Lucas Bação
28
to each other. In the case of the nonparametric model with preprocessing we
only specify the variables to be used and let the model deal with the task of
specifying the relationships that exist between the different variables.
Preprocessor
Nonparametric
Model
Raw input
vector
Preprocessed
input vector
Output
Knowledge
Contribution to
Geographic Knowledge
Preprocessor
Nonparametric
Model
Raw input
vector
Preprocessed
input vector
Output
Preprocessor
Nonparametric
Model
Raw input
vector
Preprocessed
input vector
Output
Knowledge
Contribution to
Geographic Knowledge
Figure 8 – The process of using nonparametric models with preprocessing.
Geospatial Data Mining
Fernando Lucas Bação
29
6 GEOREFERENCED DATA AND ITS UNIQUE FEATURES
Some of the aspects we shall be mentioning under this topic have already
been touched on earlier, especially in the topic “The difficulties of a Quantitative
Geography and Spatial Analysis”. However, these aspects were introduced to
give a picture of the present situation in terms of geographical theory and were
neither sufficiently explored nor presented in an organized manner. This is
precisely what we intend to do now.
It is generally agreed today that geographical data possesses special
characteristics that should guide the exploration of data on spatial phenomena.
The most important aspect to be taken into consideration is the First Law of
Geography (everything is related to everything else, but near things are more
related than distant things (Tobler, quoted by Anselin 1993)), which not only
indicates the unique features of spatial data but also provides a rich conceptual
framework for its exploration.
The first and obvious result of the First Law of Geography is that the
observations within a geographical framework are not independent, a
hypothesis that underlies a great deal of statistical methods. In fact, the
observations depend on each another and reveal high levels of interaction and
interdependence. This seriously conditions the use of traditional statistical
methods in spatial analysis. However, by using DM methods, which do not
assume hypotheses about data distribution, this characteristic can be used to
improve analysis results.
The idea consists of incorporating data that can help to improve the
neighbourhood characterization of an individual unit (an individual unit normally
corresponds to some type of geographical unit). It can be done by taking the
averages of the variables for all contiguous units or, as an option, all the units
located at a certain distance. According to the First Law of Geography, it is
highly probable that what happens “closeby” influences the behaviour of the
Geospatial Data Mining
Fernando Lucas Bação
30
individual unit in question. That is the reason why this type of variable (average
values for the neighbourhood) tends to be of importance in the problem.
Another very important aspect is the nonstationarity that should be
expected. This means that the process underlying a phenomenon in a particular
area is not necessarily the same in another area for the same phenomenon. It is
to be expected that the process underlying the phenomenon changes in space
– and, accordingly, global explanations may be wrong and mask local
specificities.
Another important feature of georeferenced data, which to a great extent is
an outcome of the First Law of Geography, is that uncertainty and errors are
grouped together spatially. Expressed in another way, this means that the
quality of the data and its representation varies over the study area.
Furthermore, it is normal that the quality of the representation is similar in zones
that are close to each other, i.e. there are some zones where the representation
is good and others where the quality is bad.
A situation that is obviously related to data nonstationarity is the strong
probability of locally observing relationships that are not substantiated overall.
This fact explains the irrelevance of most statistics relating to overall maps. For
example, there is little use in measuring the overall spatial autocorrelation of a
map – we should concentrate on evaluating the local spatial autocorrelation
instead, measuring the relationship between a spatial unit and its neighbours.
The most important aspect to be taken into account when we are working
with georeferenced data is that at the heart of spatial analysis lie location, area,
topology, spatial arrangement, distance and position. Any GSDM project should
conform to this principle. There are different ways of integrating this kind of
information and it is perhaps here that the need for research is the greatest and
the most pressing. At the present moment, this seems a more important issue
than the construction of a new GSDM technology.
Geospatial Data Mining
Fernando Lucas Bação
31
In our opinion, current DM tools have created what GISc has long needed:
nonparametric tools that are less conditioned by hypotheses about data
distribution, that can be freely used and that are not so demanding in terms of
presuppositions about the data.
From the perspective of their application to GISc, DM tools are of much
more interesting than traditional statistics, for the simple reason that they do not
assume hypotheses, a priori, about the data. However, if all the benefits of
these tools are to be reaped, we have to develop appropriate methodologies for
applying them to this specific area. Rather than constructing what we may call
“blackbox type search models” we should be implementing solid GISc
applicable methodologies that incorporate geographical references in the
modelling process.
GSDM may represent an excellent opportunity for developing a body of work
that leads, if not to greater theoretical robustness, at least to enhanced
knowledge of empirical GISc regularities. Experimentation is of particular
importance to this task. As we know, DM tools deal especially well with large
amounts of data and the type of knowledge we are looking for, about the role of
space in the different phenomena.
It seems that we need to formulate hypotheses that can be tested and
devise inventive ways of including geographical references in data and
methodologies. Distance and connectivity matrices are just two examples of the
different forms we can use to contextualize spatial data.
An example would probably best help to explain this point. Cluster analysis
has long been an important tool in the geographer’s armoury. Recently it has
been widely used in what is normally called geodemographics.
Cluster analysis is particularly interesting for GISc in that, in contrast to other
statistical methods, it assumes no hypotheses about the type of data
distribution. In this sense, we can consider it quite a secure tool when used with
georeferenced data.
Geospatial Data Mining
Fernando Lucas Bação
32
The problem lies in the way most geographers use it. Geographers and
other spatial scientists would be expected to introduce some kind of spatial
measure or indicator into their analysis. Instead of analyzing the proximity of
individual units in terms of a space constructed by alphanumerical variables,
why not introduce geographical distance? This very simple observation helps us
better understand what we call the superficiality of geographical theory.
Geospatial Data Mining
Fernando Lucas Bação
33
7 BIBLIOGRAPHY
Aangeenbrug, R. T. (1991), A critique of GIS, In: Maguire D J,
Goodchild M F, Rhind D W (eds) Geographical Information Systems
Vol 1 Overview Longman Scientific & Technical, Harlow, pp 101107
Abler, R. Adams, J. S. Gould, P. (1977), Spatial Organization, The
Geographer's View of the World, PrenticeHall International, Inc,
London
Anselin, L. (1989), What is Special About Spatial Data? Alternative
Perspectives on Spatial Data Analysis, Technical Paper, NCGIA,
Geography Department, University of California Santa Barbara,
California
Anselin, L., (1993), Exploratory Spatial Data Analysis and
Geographic Information Systems. Proceedings of the workshop on
New Tools for Spatial Analysis, ISEGI, Lisbon
Bailey, Trevor C. (1994), A review of statistical spatial analysis in
geographical information systems, In: Fotheringham, A. S., Rogerson,
P. A. (eds) Spatial analysis and GIS Taylor and Francis Ltd. 1900
Frost Road, Suite 101, Bristol PA 19007, pp 1344
Bird, James (1993), The changing worlds of geography, a critical
guide to concepts and methods, Second Edition, Clarendon Press,
Oxford
Bishop, C. (1995) Neural Networks for Pattern Recognition. Oxford
University Press, Oxford.
Burch, J. Grudnitski, G. (1989), Information Systems  Theory and
Practice, Fifth Edition, John Wiley & Sons
Geospatial Data Mining
Fernando Lucas Bação
34
Dodson, R. NCGIA, Santa Barbara. 1992, from the map included in
the book by John Snow: "Snow on Cholera...", London: Oxford
University Press, 1936. Scale of source map is approx. 1:2000.
Fayyad, U., PiatetskyShapiro, G. and Smyth, P. (1996). From data
mining to knowledge discovery in databases. AI Magazine, Fall 1996,
pp. 3754.
Fayyad, U., (1998) “Mining Databases: Towards Algorithms for
Knowledge Discovery”, Data Eng. Bull., vol. 21, no. 1, pp. 3948.
Gahegan, M. (2003) Is inductive machine learning just another wild
goose (or might it lay the golden egg)? International journal of
geographical information science, Vol. 17. No. 1. p. 6992
Gatrell, A. C. (1991), Concepts of space and geographical data, In:
Maguire D J, Goodchild M F, Rhind D W (eds) Geographical
Information Systems Vol 1 Principles, Longman Scientific &
Technical, Harlow, pp 119134
Goodchild, Michael F. (1986), Spatial Autocorrelation, Geo Books 
CATMOG 47, London
Johnston, R. J. (1986), Geografia e Geógrafos, Difel, São Paulo
Kennedy, Ruby L.; Lee, Yuchun; Van Roy, Benjamin; Reed,
Christoper D.; Lippmann, Richard P. (1998), Solving Data Mining
Problems Through Pattern Recognition, Prentice Hall, New Jersey
MacEachren, A., Bishop, J., Dykes, J., Dorling, D., Gatrell, A.,
(1994), Introduction to advances in visualizing spatial data, In:
Hearnshaw, H. M., Unwin, D. J.(eds), Visualization In Geographical
Information Systems, John Wiley & Sons Ltd., Baffins Lane,
Chichester, West Sussex PO19 1UD, England, pp 5160
Macmillan, W., (1997). Computing and the science of Geography:
the postmodern turn and the geocomputational twist in proceedings
Geospatial Data Mining
Fernando Lucas Bação
35
of the second annual conference of GeoComputation ‘97 & SIRC ‘97,
University of Otago, New Zealand
Maguire, D. J. Dangermond J, (1991) The functionality of GIS, In:
Maguire D J, Goodchild M F, Rhind D W (eds) Geographical
Information Systems Vol 1 Principles, Longman Scientific &
Technical, Harlow, pp 319335
Miller, H. J. and Jiawei, H., 2001, Geographic data mining and
knowledge discovery: An overview. In H. J. Miller and J. Han (eds.)
Geographic Data Mining and Knowledge Discovery, Taylor and
Francis, London.
Mitchell, T. M. (1997). Machine Learning, New York, USA, McGraw
Hill.
Openshaw, S. (1991), Developing appropriate spatial analysis
methods for GIS, In: Maguire D J, Goodchild M F, Rhind D W (eds)
Geographical Information Systems Vol 1 Principles, Longman
Scientific & Technical, Harlow, pp 389402
Openshaw, S. (1993), What is gisable spatial analysis?, Proceedings
of the workshop on New Tools for Spatial Analysis, ISEGI, Lisbon
Openshaw, S., Waugh, D., Cross, A., (1994), Some ideas about the
use of map animation as a spatial analysis tool, In: Hearnshaw, H. M.,
Unwin, D. J.(eds), Visualization in Geographical Information Systems,
John Wiley & Sons Ltd., Baffins Lane, Chichester, West Sussex
PO19 1UD, England, pp 131138
Openshaw, S. (1999), Geographical data mining: key design issues,
GeoComputation 99, Mary Washington College in Fredericksburg,
VA, USA, 2528 July 1999. URL:
http://www.geovista.psu.edu/sites/geocomp99/Gc99/051/gc_051.htm
Popper, K. (1959). The logic of scientific discovery, Basic Books:
New York, 479pp.
Geospatial Data Mining
Fernando Lucas Bação
36
Pyle D., (1999) Data Preparation for Data Mining. Morgan Kaufmann
Publishers, San Francisco, California
Yuan, M., Buttenfield, B. Gahegan, M. and Miller, H. (2001)
Geospatial Data Mining and Knowledge Discovery. A UCGIS White
Paper on Emergent Research Themes. URL:
http://www.ucgis.org/emerging/
.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο