UNIVERSIDADE NOVA DE LISBOA

INSTITUTO SUPERIOR DE ESTATÍSTICA E GESTÃO DE INFORMAÇÃO

Geospatial Data Mining

Fernando Lucas Bação

Geospatial Data Mining

Fernando Lucas Bação

1

CONTENTS

1 INTRODUCTION 2

2 THE FRAMEWORK OF GEOSPATIAL DATA MINING 4

1.1. The Present Situation 7

1.2. New Approaches to the Theoretical Constriction 9

3 THE DIFFICULTIES OF A QUANTITATIVE GEOGRAPHY AND SPATIAL

ANALYSIS 11

4 MODEL TYPES 16

5 AN EXPERIMENTAL APPROACH TO GEOSPATIAL DATA MINING 23

6 GEOREFERENCED DATA AND ITS UNIQUE FEATURES 29

Geospatial Data Mining

Fernando Lucas Bação

2

1 INTRODUCTION

We should like to emphasize, forthwith, the importance of the bibliographical

references to an understanding of the subject dealt with here. Though they may

not be indispensable, they are an invaluable source for those who seek a better

understanding of some of the topics that time and space do not allow us to

analyze in greater detail here.

This text serves to introduce the issues of Geospatial Data Mining (GDM)

which fits it into the broader setting of GISc (Geographical Information Science).

The term adopted GDM aims to emphasize that what is being dealt with here is

the application of data mining tools to the specific field of geographical data, i.e.

data with references for its position on the earth’s surface.

The term ‘geospatial’ is used to underline the importance of space defined

as the relationship between a set of objects. In contrast to what we may think,

data mining based on geo-referenced data has different features from that

carried out on a business data basis. Though the two coincide to a considerable

extent, there are also certain differences that, though restricted in number, are

very important and cannot be neglected. Thus we not only use the expression

Geospatial Data Mining to indicate the origin of the data but also to stress the

fact that there are important differences.

We shall begin the text by providing a framework for GSDM in which we try

to trace out the present situation of GISc (Geographical Information Science),

stressing the aspects that lead us to think that GSDM is of importance to the

subject. This task also helps us to reach a more concrete definition of the

GSDM concept itself or, rather, in presenting the difficulties facing GISc in

spatial analysis terms, we deduce some of the features that GSDM should,

desirably, present.

Point three provides a more detailed treatment of the theoretical question

connected with the study objective and attempts to give the main reasons

Geospatial Data Mining

Fernando Lucas Bação

3

contributing to the difficulty of creating a quantitative geography (perhaps we

can use the term ‘scientific’ rather than ‘quantitative’). Another issue broached

concerns the definition of spatial analysis, a particularly significant aspect in that

the principal aim of the tools we seek to present i n this course is to solve the

problems of spatial analysis.

The objective of the fourth topic is to draw attention to the different types of

models that science normally employs. We try to show how they differ and

particularly how and why new models incorporating most DM tools can support

GISc in developing new tools and new knowledge.

With the fifth topic, An Experimental Approach to GSDM, we attempt to

establish certain differences in the ways that DM tools can be used in GISc.

These differences are based, fundamentally, on a purely inductive vision and a

vision in which inductive tools complement the deductive effort that is more

general in nature and more capable of providing theories and laws.

The sixth topic deals with the question of what is special in spatial data. It is

extremely important to understand that spatial data displays features that

distinguish it from other types of data. These differences have serious

implications, creating constrictions at the analysis level. It is precisely these

differences that have triggered the appearance of GSDM.

In brief, we could say that this module covers essential issues relating to

GISc itself and spatial analysis, along with their practical consequences. The

issues we refer to are: What is the object of studying GISc? Can laws or

theories be constructed around spatial distributions? Are the inductive

approaches provided by DM tools enough to overcome the absence of theory?

In our view it is important to stress that if we consider space and distance

relevant variables, then it is indispensable for this to become evident when we

apply our methods of analysis. If we are convinced that spatial distribution plays

an important role in the phenomena we are studying, then we have to find a

way of explicitly including information of a geospatial nature in our analyses.

Geospatial Data Mining

Fernando Lucas Bação

4

2 THE FRAMEWORK OF GEOSPATIAL DATA MINING

When we speak of the relationship between geography and data mining

(Fayyad et all. 1996) it is easy to detect an enormous paradox. On the one

hand, a large part of the GISc (Geographical Information Science) community at

present entertains great expectations of DM (data mining) as the means of

increasing the quality and sophistication of spatial analysis. On the other hand,

we can argue that geography, and more precisely cartography, have long been

doing what today is termed as DM.

In fact, if we accept that the main function of DM is to improve the interfaces

between data storage systems and human beings, allowing the exploration,

summary and modelling of large databases (Fayyad, 1998), then geography

has been data mining for centuries. An example of this is the following map by

Charles Minard (figure 1):

Figure 1 – Map of Napoleon’s Russian campaign (adapted from Burch and Grudnitski,

1989)

Geospatial Data Mining

Fernando Lucas Bação

5

This masterpiece of the cartographer’s art illustrates Napoleon’s campaign

in Russia. We consider that it provides a good picture of the capacities of maps

to summarize information and to act as an interface between human beings and

large databases.

Drawn up by Charles Minard in 1889, the map provides a clear and simple

illustration of the disaster that Napoleon’s Russian campaign in 1812 turned out

to be, unambiguously revealing the devastating losses that France suffered

during the campaign. The map shows the route taken by the column of troops

as it advanced on Russia (the light colour) and retreated (in black) and the size

of the column in terms of men (indicated by the width of the line). The line of

retreat is combined with a temperature scale showing the temperatures these

men had to endure when retreating. The map also shows the level of casualties

whenever there was a river to cross or a battle, as for example with the crossing

of the River Berezina during the retreat. The army finally managed to reach

Poland with the catastrophic number of 10 000 men, out of the 422 000 who

had left. As can be observed, this map gives a highly detailed picture where we

can follow the development of five variables: the size of the army, its position on

a two-dimensional surface, the direction the military column was moving in and

the temperature on various dates during the retreat from Moscow.

Another good example of this argument is the work of Dr John Snow at the

time of the great cholera epidemic in London in 1854. This is perhaps the most

fantastic discovery that GIS has produced, paradoxically long before computers

existed.

Geospatial Data Mining

Fernando Lucas Bação

6

Figure 2 – Map indicating cholera victims and wells supplying water in London (adapted

from Dodson 1992)

This London physician suspected that the agent provoking the disease was

being transmitted through the water being used in London. He devised quite a

simple method to test his hypothesis. First of all he made a plan of the zone in

London where the epidemic was concentrated, showing the houses and the

streets. Then he started to identify the victims’ place of residence from a spatial

perspective, marking each victim’s address with a dot. When a plan of the wells

supplying the water was placed over this map (figure 2), it was quickly

concluded that not only was his hypothesis correct but that a certain well was

responsible for spreading the disease.

The second part of the paradox that we mentioned initially concerns the fact

that it has long been noted that there is a discrepancy between the capabilities

for storage, management and access and the analytical tools provided by

Geospatial Data Mining

Fernando Lucas Bação

7

Geographical Information Systems (Maguire, 1991). In fact, the sophistication

offered by developments in computer science with regard to geographical data

management and access has not been reflected in spatial analysis

(Aangeenbrug, 1991; Openshaw, 1993). That is why we have GIS with the

capability to store and manage great quantities of geo-referenced data but do

not possess the tools that allow us to transform this data into information and

this information into knowledge (Openshaw, 1991).

On the one hand the map or, more generally, the visualization tools that GIS

offers are a central tool in the domain of visualizing complex phenomena, in

particular those that are to be seen on the earth’s surface (for more on this

subject see MacEachren et all. 1994; Openshaw et all. 1994). It is to be hoped

that GISc continues to contribute to the expansion of DM visualization

methodologies.

On the other hand, it appears essential for GISc to pay attention to the new

DM tools emerging, which are capable of making a decisive contribution to a

new era in spatial analysis (Openshaw, 1999; Miller et all. 2001; Yuan et all.

2001; Gahegan, 2003). This new era should be characterized by a greater

ability to predict the development of the phenomena under study.

1.1. The Present Situation

One of the most notable characteristics of GISc today is the explosion of

geo-referenced data produced by recent IT developments (Openshaw, 1991;

Miller et all. 2001). The technology for gathering information with geographical

references, which varies from digital cartography to remote sensing and LBS

(location-based services), has been inundating the databases. This fact

demonstrates the importance of developing tools capable of effectively dealing

with large quantities of geo-referenced data.

Geospatial Data Mining

Fernando Lucas Bação

8

Without doubt the visualization tools now available represent an enormous

plus for the exploration of this data but they are still inadequate. The complex

and multi-dimensional nature of today’s databases creates new challenges for

which visualization does not yet have effective solutions. At the same time,

visualization is always an assisted process which has great difficulty in allowing

anything more than an exploratory analysis of the data.

As we have already mentioned, GIS today permits great quantities of data to

be stored and nowadays it is quite simple to manage and access it using highly

sophisticated tools. If we add to this the above developments in the technology

for gathering geographical information, we arrive at the present situation in

which we probably have the data to respond to many urgent issues (social and

environmental) but only buffers and overlays to analyze them.

The use and importance of operations like buffers and overlays are not in

question but these analytical tools date from an earlier time than the

Geographical Information Systems themselves. Today we need tools of a

different class, able to handle the highly varied nature of geo-referenced data

and to explore, relate and forecast.

To attain this objective there are two routes available. On the one hand the

tools can emerge as a consequence of the theoretical bases of the subject; on

the other (if these theoretical cases do not exist) we can resort to inductive

approaches that allow the practical problems to be dealt with and possibly help

to boost the theory.

Whatever the case, it is always preferable – as it is more robust and

scientifically more correct – that there are laws, knowledge and theory on the

study objective or, in the specific case of GISc, spatial and spatial-temporal

distributions and patterns. Yet that is precisely what is missing.

The question can be reformulated, declaring that in GISc terms we are living

in an environment that is “rich in data and poor in theory” (Openshaw, 1991).

Today, taking the theoretical difficulties into consideration, DM offers GISc an

Geospatial Data Mining

Fernando Lucas Bação

9

excellent opportunity to raise spatial analysis to new levels of sophistication. By

promoting more effective exploration of the databases available.

1.2. New Approaches to the Theoretical Constriction

Inductive data driven approaches applied to modelling and spatial analysis

may be the way of facilitating the creation of new knowledge and contributing to

the process of scientific discovery. DM tools that have been developed may

play a central role, helping the process of exploring great quantities of data in

the search for recurring patterns and persistent relationships.

It is important to stress that DM is not to be viewed as a panacea for all the

problems of data analysis and, more specifically, spatial data analysis.

Deterministic models and well-grounded theory on the nature of phenomena are

always preferable to DM. The belief that DM will work in any case,

independently of the data, can only damage the development of this area of

knowledge.

When the use of DM methodologies in GISc is assessed, two essential

aspects should be taken into consideration. The first, already presented above,

is connected with the scarcity of theoretical results and models, which makes

GISc a suitable candidate for the benefits of DM tools.

Secondly, the fact that spatial information possesses special characteristics

should be taken into account (Anselin, 1989), as we shall see below in greater

detail. As a preliminary conclusion we can say that DM is capable of becoming

a very useful tool in spatial analysis, provided that the special aspects of the

data are properly treated and safeguarded.

When the specificity of spatial data is taken into account, it must not be

forgotten that the software packages available were specifically developed for

the purpose of analysing large databases, with a view to modelling and

forecasting consumer behaviour. Given their origin, it is not difficult to

Geospatial Data Mining

Fernando Lucas Bação

10

understand that the specific needs of spatial analysis are not addressed. This

situation has led certain authors (Openshaw, 1991) to argue that specific

software needs to be developed for exploring geo-referenced data.

In spite of the limitations imposed by spatial data, it may not be inevitable

that we have to wait for a new Geospatial Data Mining technology to be

developed before we can take advantage of these new tools. With alterations to

the process, particularly greater emphasis on the pre-processing of data, we will

probably be able to use existing software packages to produce new forms of

spatial analysis. At the present moment adapting the process underlying DM

(Pyle, 1999) to allow appropriate, reliable exploration of spatial data seems

more useful than waiting for specific tools for Geospatial Data Mining.

Geospatial Data Mining should be understood as a special type of DM that

seeks to carry out generic functions similar to those of conventional DM, though

modified to safeguard the special aspects of geo-information. However, the

elements of real importance in Geospatial Data Mining are the study objective,

spatial distributions and spatial-temporal patterns: it is this that confers on it a

unique view of reality and the unique features that it possesses (Abler et all.

1977).

Geospatial Data Mining

Fernando Lucas Bação

11

3 THE DIFFICULTIES OF A QUANTITATIVE GEOGRAPHY AND SPATIAL

ANALYSIS

In this topic we attempt to underline some of the theoretical aspects that can

facilitate an understanding of GSDM’s specificities and its potential for resolving

certain theoretical constrictions. It is also important to understand what is meant

by spatial analysis insofar as it is the area in which the tools we present later

are applied.

Unsurprisingly, spatial analysis has various definitions. From a more

traditional and restrictive perspective, we can define spatial data analysis as

"the statistical study of phenomena that manifest themselves in space" (Anselin,

1993). The definitions chosen by Openshaw (1993) are more global and not so

useful in practical terms, though broader and richer in theoretical terms: “Spatial

analysis is usually defined as the description and analysis of the 0, 1, 2, 2 1/2

and 3 dimensional features found on maps (i.e. points, lines, areas, surfaces). It

also includes the analysis and modelling of all map-related or map relateable

information” or, again, “Another definition would be that any analysis, modelling,

and policy application in which space or location is important, or which makes

use of spatial information in the broadest possible sense, is also spatial

analysis”. Lastly, the following definition can also be considered: “a general

ability to manipulate spatial data into different forms and extract additional

meaning as a result” (Bailey, 1994).

Independently of more philosophical reflections, which do not fall within the

scope of this text, we can quite easily draw two conclusions from the group of

definitions presented. First of all, it is not hard to ascertain that there is no

consensus on the question “What is spatial analysis?” and that defining it is

extremely difficult, a situation that results in definitions that are too general and

not very objective.

Geospatial Data Mining

Fernando Lucas Bação

12

The second conclusion is that though these definitions are quite different

there is a common thread running through them all, that is, the inherent concern

with stressing space as the central factor of the definition. For the rest, all the

definitions put forward are rooted in a very personal vision of what spatial

analysis or spatial data analysis should be, as we shall see below.

This second conclusion gives rise to a third, one of extreme importance

which clearly indicates that location, area, topology, spatial arrangement,

distance and position represent the centre or nucleus of research, i.e. of spatial

analysis. This is implicitly expressed in the First Law of Geography,

"everything is related to everything else, but near things are more related than

distant things" (Tobler, quoted by Anselin 1993). In fact, if there is anything

useful in spatial analysis definitions, it is certainly the recognition of space as a

source of explanation for the patterns presented by the different phenomena

within it.

It is easy to understand that the law just stated results in a disappointing

conclusion: to a great extent, it is not possible to use the traditional methods of

classical statistics to analyze spatial data, given the important role that location

plays in understanding phenomena observed in space.

Accordingly, if we consider that, as a rule, observations will tend to be

spatially clustered or, in other words, geographical data samples will not be

independent, then we are faced with a conflict in that, in statistical terms,

observations are usually assumed to be independent and identically distributed.

This spatial effect is termed spatial dependence. When present in spatial data

this dependence is normally referred to as spatial autocorrelation.

There is, however, a second spatial effect, emerging from the importance of

location in this type of data. It is related to spatial differentiation, the result of the

intrinsic originality of each place, and can be termed spatial heterogeneity.

There are two other factors that, to a great extent, rule out using the

methodologies of classical statistics. The first relates to the fact that the central

Geospatial Data Mining

Fernando Lucas Bação

13

concern of the methods to use should be space, the location. As is well known,

this does not count among the relevant variables in statistical processes. “Many

kinds of analysis look only at features’ attributes without making explicit

reference to location, and in this class we would have to include the vast

majority of standard statistical techniques, ranging from simple tests of means

and analysis of variance to multiple regression and factor analysis” (Goodchild,

1986). Moreover, quite comprehensibly, existing statistical methods were not

designed with the specific features of space as an explanatory variable in mind.

Indeed, Openshaw (1993) refers to this in the following way: “These (aspatial

statistical methods) are not usually spatial analysis relevant methods because

they treat spatial information as if it were equivalent to survey data, and

generally totally fail to both handle any of the key features that make spatial

data special (i.e. spatial dependency, non sample nature, modifiable areal units

effects, noise and multivariate complexity)? or provide results that are sensitive

to the nature and needs of geographical analysis?”.

The second factor involves the fact that today more than ever, as far as

GISc is concerned, there is no valid reason to test the significance of particular

indicators or try to construct “significant” samples. In the field of geography

significance tests are of little use, as Gould (quoted by Goodchild, 1986) states

in the case of spatial autocorrelation: “Why we should expect independence in

spatial observations which are of the slightest intellectual interest or importance

in geographic research I cannot imagine. All our efforts to understand spatial

pattern, structure and process have indicated that it is precisely the lack of

independence - the interdependence - of spatial phenomena that allows us to

substitute pattern, and therefore predictability and order, for chaos and apparent

lack of interdependence of things in time and space”. Goodchild (1986)

emphasizes this idea, stating that: “It is impossible for a geographer to imagine

a world in which spatial autocorrelation could be absent: there could be no

regions of any kind, since the variation of all phenomena would have to occur

independently of location, and places in the same neighbourhood would be as

different as places a continent apart”.

Geospatial Data Mining

Fernando Lucas Bação

14

We can extend this argument beyond the concept of spatial autocorrelation,

taking into consideration that if concepts exist and are accepted by the majority,

as such it is not worthwhile testing them. In fact, it makes no sense to test

concepts like the distance effect or spatial association. In the absence of any

other reason, we can argue that the null hypothesis (the absence of spatial

autocorrelation or the friction of distance effect), necessary to proceed with the

test, makes no sense in geographical terms. As Openshaw (1994) argues: “So

why not operationalize this geographical concept as a concept rather than as a

precise, assumption-dependent statistical test of a hypothesis? The statistical

test approach requires a much greater input of knowledge than is contained in

the original theoretical notion”.

On the one hand, in the case of GISc, the universe under study can

generally to be explored to its full extent without the need to create “statistically

significant samples” (a concept that does not exist in geography). Essentially,

this has resulted from the development of computing that, with its dizzying

advance, has provided statistics, too, with new forms, e.g. the bootstrapping in

the approach to old problems such as sampling.

Forecasting methods may be the point where DM has its most obvious

impact. In fact, with regard to the use of tradition statistical methods, there are

up-stream problems that are still a long way from being solved and that appear

at present to be much more pressing and more reliably resolvable. Openshaw

(1993) acknowledges this very fact: “Indeed knowing how to develop new and

more relevant methods looks like being too difficult for current technology

particularly when it has to be performed in a rigorous fashion within a classical

statistical framework ... Maybe the statistical problems are just too hard to be

resolved at present, so why not change the nature of the problem to make it

easier to solve?”

Over the years concern with the use of less restrictive methodologies, where

DM fits in, has come to constitute the research object of certain GISc

researchers. This has touched on areas as different as neural networks

Geospatial Data Mining

Fernando Lucas Bação

15

(Mitchell, 1997; Bishop, 1995) and genetic algorithms (Mitchell, 1997), statistical

measures for spatial correlation (Goodchild, 1986), exploratory data analysis

(Anselin, 1993; Openshaw, 1993) and scientific visualization (MacEachren

1994). This type of research has produced a range of results and is registering

ever increasing recognition of its advantages, which represents an important

step in raising the visibility of the field both in funding terms and in terms of its

curricular importance within GISc.

Geospatial Data Mining

Fernando Lucas Bação

16

4 MODEL TYPES

The aim of this topic is to facilitate the operating framework of DM and, more

specifically, its framework within the scope of GISc. In connection with what has

already been discussed, it is worth recalling the emphasis placed on the need to

adopt analytical tools that allow a less restrictive approach (with fewer

assumptions about data distribution) in order to “get round” the particular

difficulties imposed by the spatial data.

In the course of this topic we shall try to go a little further and produce a

framework for DM within the scope of other model types that science employs.

We consider this discussion important in that it will allow a clearer

understanding of the way DM can fit in with GISc so as to encourage the

theoretical development that is indispensable to any science. In order to explain

some of the concepts relevant for this discussion we will make use of a model

classification proposed by Kennedy at all. (1998) and their illustrative examples.

Accordingly, we begin by proposing to divide the models used in science

into three fundamental types:

Deterministic models;

Parametric models;

Non-parametric models;

As with other tasks carried out in computing, the modelling requires a

“program” that gives detailed instructions on the way the process develops.

These instructions are typically mathematical equations that characterize the

relationship between input and output. It is in the formulation of these equations

that the central problem of the modelling lies.

The best way of modelling consists of formulating “closed” equations that

deterministically define the way the outputs are obtained from the inputs. As all

Geospatial Data Mining

Fernando Lucas Bação

17

the features are constant we refer to them as deterministic models. This type of

model is suitable for dealing with problems that are simple and perfectly

understood.

An example of this type of model is the calculation of how long a stone takes

to fall. In fact, we possess sufficient knowledge to describe this relationship and

all we need to know is the height from which the stone begins to fall:

8.9

2h

where 9.8 is normally represented as g, the acceleration constant of gravity,

and h is the height in metres.

The conceptual elegance and richness of this type of model inspired some

authors to try to adopt models from physics to explain social phenomena, in a

stream of thought that became known as “social physics” (Johnston, 1986).

Geography was no exception and also presented theoretical, geometry-based

developments as a work tool (the most widely known probably being the

gravitational model, which theorizes on applying the recognized physics model

to spatial phenomena).

These theoretical and conceptually rich models represented one of the faces

of the Quantitative Revolution in Geography (Johnston, 1986; Bird, 1993). Their

main difficulty, and the reason for abandoning them, lay in the difficulty of

applying the concepts to the discipline’s practical problems. The language of

geometry is so rigid that it created practical difficulties and prevented the

practical applications from drawing on the benefits of these developments.

It is unfortunate that, especially in the humanities and social sciences, most

problems do not lend themselves to such a simple description or are not so well

known as those that characterize Newtonian physics (Macmillan, 1997).

(Nowadays, apparently, we also know that the object of study in physics is

Geospatial Data Mining

Fernando Lucas Bação

18

actually not so linear and explicable as was once believed). The lack of “formal

knowledge” for most geographical problems drastically restricts the use of this

type of model.

Now is the time to think of a slightly different problem in order to illustrate the

way a parametric model functions. The problem is to ascertain how long a stone

takes to fall, though on a different planet for which we do not know the constant

acceleration of gravity g. The relationship is expressed as follows:

g

h2

Now we are confronted with an estimation problem. Though we have a good

idea of the way the inputs and outputs are related, we lack the degree of

certainty necessary in a deterministic model. Thus, in this case, we need to

provide a value for g. One way of dealing with the matter may be to carry out an

experiment. If we drop the stone from different heights, timing how long it takes

to fall, the experiment will result in a table like this:

Height (m) Time to fall (s)

0.5 0.19

1.3 0.4

2.8 0.46

4 0.68

7.3 0.7

9.7 1.04

Table 1 – Experiments on the time a stone takes to fall on Planet xpto

Here we have a table of examples defining the relationship between height

and falling time. An example is defined by an input vector and the desired

output. In this particular case the input vector consists of a single variable: the

Geospatial Data Mining

Fernando Lucas Bação

19

height from which the stone falls. The output consists of the time the stone

takes to fall.

The next step involves selecting a suitable value for g so that the model can

produce estimates that are as close as possible to reality. In other words, we

want to define a value for g that, given an input value for the height, allows us to

estimate the falling time as accurately as possible. The following graph (figure

3) shows the measurements we made and the response the models provided

for g values of 9.8, 20 and 35:

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 1 2 3 4 5 6 7 8 9 10 11

Figure 3 – Adapting the model: the different curves correspond to the different g values

In this case it is quite easy to assign a good value to g, though it is not

always so, especially with more complex problems involving complex

relationships and multiple variables. As we can see in the graph, the middle

line, where a value of 20 was assigned to the constant, seems to be the one

that best fits the description of the time a stone takes to fall on this particular

planet.

Geospatial Data Mining

Fernando Lucas Bação

20

We may also observe that, although the phenomenon is described quite

faithfully, discrepancies still exist between the estimated values and those

actually measured i n the experiments (shown by the lines). These inaccuracies

may be the result of two fundamental causes: on the one hand, structural

deficiencies in the model and, on the other, measurement errors. Quite clearly,

the usefulness of the model always depends on the magnitude of such errors,

or, in other words, the difference between the values given by the model and

those actually measured.

The basic aspect of parametric models lies in the fact that explicit

mathematical equations characterize the structure of the relationship between

inputs and outputs, though there are some unspecified parameters. The latter

are chosen by means of an analysis of the examples available, i.e. they are

estimated on the basis of a sample.

Linear regression is one of the best known applications of parametric

modelling. The hypothesis on which this is based assumes a linear relationship

between inputs and outputs, with the following expression normally being used:

332211

xcxcxccy

o

Another fundamental aspect to be taken into account with parametric

modelling is that it demands a sound knowledge of the phenomenon that we

intend to model. In the first place, the fundamental aspects of the relationship

must be known or, rather, not only must we know what variables are relevant to

the process but also the way they influence it.

The geographers of the Quantitative Revolution used these methods in the

search for solutions to the practical problems confronting them, for which the

geometrical formalism of physics-inspired models could not provide an answer.

The solution lay in adopting the tools of econometrics and traditional statistics to

deal with spatial phenomena.

Geospatial Data Mining

Fernando Lucas Bação

21

The problem, as already noticed, is that statistical methods are not exactly

suitable for dealing with spatial phenomena and, furthermore, most researchers

did not use any kind of spatial reference for the phenomena studied. What

tended to happen in practice was that the analysis was begun based on a map,

statistical methods were applied to the socio-economic data associated with the

geographical units and, finally, a map with the results was produced. The

problem with this methodology is that it in no way differs from what other social

sciences do and it would be difficult to expect to find empirical regularities

respecting the influence of space, in that this was not explicitly included in the

analyses.

This is the point where we enter the field of non-parametric models, which

can be defined as models that essentially depend on the use of data rather than

on specific knowledge of the domain of the problem. They are also usually

called data-driven models. This type of model has enjoyed great success,

especially in resolving complex problems, as it usually uses large data sets

containing large numbers of examples.

The essential premise of non-parametric methods is that relationships

occurring consistently in the data set will be repeated in future observations; we

can call this an inductive approach. Thus, by obtaining a large enough set of

examples and modelling their behaviour we can establish a model of arbitrary

complexity that will allow the repetition of the behaviour observed.

A significant benefit of this model type is that it does not require an in-depth

knowledge of the phenomenon being modelled, an aspect that is particularly

useful when addressing spatial problems. However, though it is theoretically

possible to solve any problem, irrespective of its complexity, certain problems

are still too complex. This is due to the data collection and processing time

limitations that exist for the building of a non-parametric model.

This is a particularly important aspect as it draws attention to the fact that

possessing the data and software is not enough, though the vendors of DM

Geospatial Data Mining

Fernando Lucas Bação

22

solutions would often like us to think differently. A minimum knowledge of the

phenomenon being modelled is indispensable, particularly as regards what the

most important inputs are. We normally possess a certain amount of information

or even an intuition about what the most important variables are. So knowledge

of the phenomenon is fundamental if we are to “reduce the scope of the search”

for possible solutions.

The same aspect is also particularly interesting from the point of view of its

usefulness to spatial analysis, in that it is precisely at this moment (when we

choose the relevant variables for modelling a phenomenon) that we are

constructing the hypothesis. In the light of this highly experimental nature, we

can predict that DM methods are going to make important theoretical

contributions to GISc.

Geospatial Data Mining

Fernando Lucas Bação

23

5 AN EXPERIMENTAL APPROACH TO GEOSPATIAL DATA MINING

The process of using GSDM should include the formulation of hypotheses,

as is implicitly assumed in the topic above – hypotheses on what variables

influence the phenomenon being studied and what algorithm should be applied

to confirm or negate a hypothesis proposed. This means that before applying

algorithms there should always be a time for reflection on the problem and the

way the independent variables interact with the dependent variable.

This is the context in which the concept of pre-processing (Pyle, 1999;

Bishop, 1995) appears, which is nothing more than “removing the irrelevant

information and extracting the essential characteristics in order to simplify the

problem”. Non-parametric models with pre-processing fit in here, too, and may

be seen as the half-way house between parametric and non-parametric models.

Figure 4 presents a diagram of this type of model.

Pre-processor

Nonparametric

Model

Raw input

vector

Pre-processed

input vector

Output

Pre-processor

Nonparametric

Model

Raw input

vector

Pre-processed

input vector

Output

Figure 4 – Diagram of the non-parametric model with pre-processing

This may be seen as dividing the development of the model into two distinct

parts: the application of specific knowledge of the domain (the pre-processor)

and the aspects that are less well understood (the model). In other words, in the

first phase this specific methodology allows us to use the prior knowledge we

possess about the phenomenon. As an example, if we want to assess the

interaction between two cities, we need to know the size of the population in

each and the distance between them (the gravitational model).

Geospatial Data Mining

Fernando Lucas Bação

24

In the second phase, the model is applied in order to specify the way in

which the variables relate to each other and obtain the desired output. So, on

the one hand, it is necessary to select and adapt the relevant variables and, on

the other, to specify the way in which each of them contributes to the final

result.

Deterministic and non-parametric approaches to modelling can be viewed as

two extremes of a continuum. A deterministic approach just uses the existing

knowledge of a phenomenon whereas non-parametric models use large

amounts of data and processing capacity and little knowledge. Non-parametric

models with pre-processing and parametric models can be seen as

compromises between these two extremes, as figure 5 illustrates.

Fixed

Parametric

Nonparametric

with preprocessing

Nonparametric

More Knowledge

More Data

Fixed

Parametric

Nonparametric

with preprocessing

Nonparametric

More Knowledge

More Data

Figure 5 – The continuum between deterministic and non-parametric models

Whatever the case, success will always mean progressively transporting

scientific problems from the right side of the figure (non-parametric approach) to

the left (deterministic approach). In other words, success lies in the ability to use

data and computing power to produce generalizable knowledge.

This may well be the ideal but that does not mean it is attainable in all

sciences. A particularly relevant example can be seen in the development of

physics, which is often considered the paradigm of modern science. Though

initially a science essentially based on a deterministic vision of phenomena,

physics has progressively adopted probabilistic approaches. In their

explanations, deterministic models have been unable to accommodate new

Geospatial Data Mining

Fernando Lucas Bação

25

facts that essentially derive from improved measuring and observation

instruments such as new space telescopes.

The mechanistic view of natural phenomena, which used to characterize

science, has been progressively put into question while the expansion of the

biological sciences has very significantly changed our perception of what

science is (Macmillan 1997). More importantly, as happened in the past with

physics, this development in biological science has provided some extremely

useful concepts for the social sciences.

Among others, the concepts of complex adaptive systems, distributed

processing and agents as against the global analysis of processes have been

adopted and exploited in the social sciences (for more information on this topic

the reader is directed to the Santa Fe Institute website http://www.santafe.edu/

).

They originated in the observation of biological systems and have been adapted

and adopted in the study of human social systems such as the economy, cities

and others (Macmillan 1997).

There are differing views on the way that GISc can benefit from non-

parametric models. There is apparently, however, a certain unanimity in the

GISc community that at this stage neither deterministic nor parametric models

have a great deal to offer the development of spatial analysis. As we have

already stated, this type of model demands a kind of knowledge and theoretical

formulation about the phenomena under study that at the moment GISc purely

and simply does not possess.

Accordingly, the discussion should be centred on the use of non-parametric

models to solve the practical problems facing GISc at present and, more

importantly, to develop a robust theoretical framework to serve as a basis for

developing a new GISc. More specifically, this discussion involves two

perspectives that at first sight appear quite similar but in fact conceal a

divergence that is not only profound but also carries important practical

implications.

Geospatial Data Mining

Fernando Lucas Bação

26

There are those who defend the position that DM can be used within GISc

as a “black box”: we limit ourselves to gathering all the input variables available

and leave the tools to process all the information and present us with the

solution – without our access to the specifications that let us understand how

the model works. But there is another camp with a different position: it defends

the use of DM but assigns an important role to the pre-processing stage, at

which it reduces and perhaps transforms the input space. This allows greater

control of the specification of the problem, which will lead to a clearer

understanding of it.

Data pre-processing is essentially the stage where we concern ourselves

with the selection and transformation of the input variables (independent

variables) to be used in the modelling. Figure 6 provides a diagram representing

the two views of the use of DM in GISc.

Pre-processor

Nonparametric

Model

Raw input

vector

Pre-processed

input vector

Output

Pre-processor

Nonparametric

Model

Raw input

vector

Pre-processed

input vector

Output

Big Black Box of Magic

Raw input

vector

Output

Big Black Box of Magic

Raw input

vector

Output

Figure 6 – Two approaches to the use of non-parametric models in GISc

As you will have realized, this is where the fundamental difference lies

between non-parametric models with pre-processing and non-parametric

models. This difference reflects two completely different philosophies. When

using non-parametric models we accept all the lack of knowledge about the

problem and hope that the technology will present the best solution, by the

“brute force” of the search in the solutions space. When using non-parametric

Geospatial Data Mining

Fernando Lucas Bação

27

models with pre-processing, we are effectively formulating a hypothesis about

the variables that influence the phenomenon under study.

At the theoretical development level, the consequences of the two

approaches are completely different. If we adopt the first (figure 7), we cannot

expect to learn very much about the phenomena we are studying – we can only

hope to respond to practical problems case by case. At the same time we run

the risk of arriving at spurious relationships, i.e. those found by chance that do

not correspond to any actual relationships. In fact, the probability of spurious

relationships cropping up increases in proportion to the rise in the number of

independent variables (Mitchell, 1997; Bishop 1995).

Big Black Box of Magic

Raw input

vector

Output

Not much Knowledge

Small contribution to

Geographic Knowledge

Big Black Box of Magic

Raw input

vector

Output

Big Black Box of Magic

Raw input

vector

Output

Not much Knowledge

Small contribution to

Geographic Knowledge

Figure 7 – The process of using non-parametric models. In this case there is no place for

hypotheses and the possibility of finding spurious relationships is high.

In contrast, in the second case (figure 8) theoretical advances can probably be

achieved in that formulating and testing hypotheses on the behaviour of the

phenomena being studied can contribute to a better understanding of the way

the different variables interact. This experimental approach represents a

compromise between the knowledge necessary to specify a parametric model

and the non-parametric approach. In the parametric case, we need to specify a

model that not only identifies the variables involved but also the way they relate

Geospatial Data Mining

Fernando Lucas Bação

28

to each other. In the case of the non-parametric model with pre-processing we

only specify the variables to be used and let the model deal with the task of

specifying the relationships that exist between the different variables.

Pre-processor

Nonparametric

Model

Raw input

vector

Pre-processed

input vector

Output

Knowledge

Contribution to

Geographic Knowledge

Pre-processor

Nonparametric

Model

Raw input

vector

Pre-processed

input vector

Output

Pre-processor

Nonparametric

Model

Raw input

vector

Pre-processed

input vector

Output

Knowledge

Contribution to

Geographic Knowledge

Figure 8 – The process of using non-parametric models with pre-processing.

Geospatial Data Mining

Fernando Lucas Bação

29

6 GEOREFERENCED DATA AND ITS UNIQUE FEATURES

Some of the aspects we shall be mentioning under this topic have already

been touched on earlier, especially in the topic “The difficulties of a Quantitative

Geography and Spatial Analysis”. However, these aspects were introduced to

give a picture of the present situation in terms of geographical theory and were

neither sufficiently explored nor presented in an organized manner. This is

precisely what we intend to do now.

It is generally agreed today that geographical data possesses special

characteristics that should guide the exploration of data on spatial phenomena.

The most important aspect to be taken into consideration is the First Law of

Geography (everything is related to everything else, but near things are more

related than distant things (Tobler, quoted by Anselin 1993)), which not only

indicates the unique features of spatial data but also provides a rich conceptual

framework for its exploration.

The first and obvious result of the First Law of Geography is that the

observations within a geographical framework are not independent, a

hypothesis that underlies a great deal of statistical methods. In fact, the

observations depend on each another and reveal high levels of interaction and

interdependence. This seriously conditions the use of traditional statistical

methods in spatial analysis. However, by using DM methods, which do not

assume hypotheses about data distribution, this characteristic can be used to

improve analysis results.

The idea consists of incorporating data that can help to improve the

neighbourhood characterization of an individual unit (an individual unit normally

corresponds to some type of geographical unit). It can be done by taking the

averages of the variables for all contiguous units or, as an option, all the units

located at a certain distance. According to the First Law of Geography, it is

highly probable that what happens “close-by” influences the behaviour of the

Geospatial Data Mining

Fernando Lucas Bação

30

individual unit in question. That is the reason why this type of variable (average

values for the neighbourhood) tends to be of importance in the problem.

Another very important aspect is the non-stationarity that should be

expected. This means that the process underlying a phenomenon in a particular

area is not necessarily the same in another area for the same phenomenon. It is

to be expected that the process underlying the phenomenon changes in space

– and, accordingly, global explanations may be wrong and mask local

specificities.

Another important feature of georeferenced data, which to a great extent is

an outcome of the First Law of Geography, is that uncertainty and errors are

grouped together spatially. Expressed in another way, this means that the

quality of the data and its representation varies over the study area.

Furthermore, it is normal that the quality of the representation is similar in zones

that are close to each other, i.e. there are some zones where the representation

is good and others where the quality is bad.

A situation that is obviously related to data non-stationarity is the strong

probability of locally observing relationships that are not substantiated overall.

This fact explains the irrelevance of most statistics relating to overall maps. For

example, there is little use in measuring the overall spatial autocorrelation of a

map – we should concentrate on evaluating the local spatial autocorrelation

instead, measuring the relationship between a spatial unit and its neighbours.

The most important aspect to be taken into account when we are working

with georeferenced data is that at the heart of spatial analysis lie location, area,

topology, spatial arrangement, distance and position. Any GSDM project should

conform to this principle. There are different ways of integrating this kind of

information and it is perhaps here that the need for research is the greatest and

the most pressing. At the present moment, this seems a more important issue

than the construction of a new GSDM technology.

Geospatial Data Mining

Fernando Lucas Bação

31

In our opinion, current DM tools have created what GISc has long needed:

non-parametric tools that are less conditioned by hypotheses about data

distribution, that can be freely used and that are not so demanding in terms of

presuppositions about the data.

From the perspective of their application to GISc, DM tools are of much

more interesting than traditional statistics, for the simple reason that they do not

assume hypotheses, a priori, about the data. However, if all the benefits of

these tools are to be reaped, we have to develop appropriate methodologies for

applying them to this specific area. Rather than constructing what we may call

“black-box type search models” we should be implementing solid GISc-

applicable methodologies that incorporate geographical references in the

modelling process.

GSDM may represent an excellent opportunity for developing a body of work

that leads, if not to greater theoretical robustness, at least to enhanced

knowledge of empirical GISc regularities. Experimentation is of particular

importance to this task. As we know, DM tools deal especially well with large

amounts of data and the type of knowledge we are looking for, about the role of

space in the different phenomena.

It seems that we need to formulate hypotheses that can be tested and

devise inventive ways of including geographical references in data and

methodologies. Distance and connectivity matrices are just two examples of the

different forms we can use to contextualize spatial data.

An example would probably best help to explain this point. Cluster analysis

has long been an important tool in the geographer’s armoury. Recently it has

been widely used in what is normally called geodemographics.

Cluster analysis is particularly interesting for GISc in that, in contrast to other

statistical methods, it assumes no hypotheses about the type of data

distribution. In this sense, we can consider it quite a secure tool when used with

georeferenced data.

Geospatial Data Mining

Fernando Lucas Bação

32

The problem lies in the way most geographers use it. Geographers and

other spatial scientists would be expected to introduce some kind of spatial

measure or indicator into their analysis. Instead of analyzing the proximity of

individual units in terms of a space constructed by alpha-numerical variables,

why not introduce geographical distance? This very simple observation helps us

better understand what we call the superficiality of geographical theory.

Geospatial Data Mining

Fernando Lucas Bação

33

7 BIBLIOGRAPHY

Aangeenbrug, R. T. (1991), A critique of GIS, In: Maguire D J,

Goodchild M F, Rhind D W (eds) Geographical Information Systems

Vol 1 Overview Longman Scientific & Technical, Harlow, pp 101-107

Abler, R. Adams, J. S. Gould, P. (1977), Spatial Organization, The

Geographer's View of the World, Prentice-Hall International, Inc,

London

Anselin, L. (1989), What is Special About Spatial Data? Alternative

Perspectives on Spatial Data Analysis, Technical Paper, NCGIA,

Geography Department, University of California Santa Barbara,

California

Anselin, L., (1993), Exploratory Spatial Data Analysis and

Geographic Information Systems. Proceedings of the workshop on

New Tools for Spatial Analysis, ISEGI, Lisbon

Bailey, Trevor C. (1994), A review of statistical spatial analysis in

geographical information systems, In: Fotheringham, A. S., Rogerson,

P. A. (eds) Spatial analysis and GIS Taylor and Francis Ltd. 1900

Frost Road, Suite 101, Bristol PA 19007, pp 13-44

Bird, James (1993), The changing worlds of geography, a critical

guide to concepts and methods, Second Edition, Clarendon Press,

Oxford

Bishop, C. (1995) Neural Networks for Pattern Recognition. Oxford

University Press, Oxford.

Burch, J. Grudnitski, G. (1989), Information Systems - Theory and

Practice, Fifth Edition, John Wiley & Sons

Geospatial Data Mining

Fernando Lucas Bação

34

Dodson, R. NCGIA, Santa Barbara. 1992, from the map included in

the book by John Snow: "Snow on Cholera...", London: Oxford

University Press, 1936. Scale of source map is approx. 1:2000.

Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996). From data

mining to knowledge discovery in databases. AI Magazine, Fall 1996,

pp. 37-54.

Fayyad, U., (1998) “Mining Databases: Towards Algorithms for

Knowledge Discovery”, Data Eng. Bull., vol. 21, no. 1, pp. 39-48.

Gahegan, M. (2003) Is inductive machine learning just another wild

goose (or might it lay the golden egg)? International journal of

geographical information science, Vol. 17. No. 1. p. 69-92

Gatrell, A. C. (1991), Concepts of space and geographical data, In:

Maguire D J, Goodchild M F, Rhind D W (eds) Geographical

Information Systems Vol 1 Principles, Longman Scientific &

Technical, Harlow, pp 119-134

Goodchild, Michael F. (1986), Spatial Autocorrelation, Geo Books -

CATMOG 47, London

Johnston, R. J. (1986), Geografia e Geógrafos, Difel, São Paulo

Kennedy, Ruby L.; Lee, Yuchun; Van Roy, Benjamin; Reed,

Christoper D.; Lippmann, Richard P. (1998), Solving Data Mining

Problems Through Pattern Recognition, Prentice Hall, New Jersey

MacEachren, A., Bishop, J., Dykes, J., Dorling, D., Gatrell, A.,

(1994), Introduction to advances in visualizing spatial data, In:

Hearnshaw, H. M., Unwin, D. J.(eds), Visualization In Geographical

Information Systems, John Wiley & Sons Ltd., Baffins Lane,

Chichester, West Sussex PO19 1UD, England, pp 51-60

Macmillan, W., (1997). Computing and the science of Geography:

the postmodern turn and the geocomputational twist in proceedings

Geospatial Data Mining

Fernando Lucas Bação

35

of the second annual conference of GeoComputation ‘97 & SIRC ‘97,

University of Otago, New Zealand

Maguire, D. J. Dangermond J, (1991) The functionality of GIS, In:

Maguire D J, Goodchild M F, Rhind D W (eds) Geographical

Information Systems Vol 1 Principles, Longman Scientific &

Technical, Harlow, pp 319-335

Miller, H. J. and Jiawei, H., 2001, Geographic data mining and

knowledge discovery: An overview. In H. J. Miller and J. Han (eds.)

Geographic Data Mining and Knowledge Discovery, Taylor and

Francis, London.

Mitchell, T. M. (1997). Machine Learning, New York, USA, McGraw

Hill.

Openshaw, S. (1991), Developing appropriate spatial analysis

methods for GIS, In: Maguire D J, Goodchild M F, Rhind D W (eds)

Geographical Information Systems Vol 1 Principles, Longman

Scientific & Technical, Harlow, pp 389-402

Openshaw, S. (1993), What is gisable spatial analysis?, Proceedings

of the workshop on New Tools for Spatial Analysis, ISEGI, Lisbon

Openshaw, S., Waugh, D., Cross, A., (1994), Some ideas about the

use of map animation as a spatial analysis tool, In: Hearnshaw, H. M.,

Unwin, D. J.(eds), Visualization in Geographical Information Systems,

John Wiley & Sons Ltd., Baffins Lane, Chichester, West Sussex

PO19 1UD, England, pp 131-138

Openshaw, S. (1999), Geographical data mining: key design issues,

GeoComputation 99, Mary Washington College in Fredericksburg,

VA, USA, 25-28 July 1999. URL:

http://www.geovista.psu.edu/sites/geocomp99/Gc99/051/gc_051.htm

Popper, K. (1959). The logic of scientific discovery, Basic Books:

New York, 479pp.

Geospatial Data Mining

Fernando Lucas Bação

36

Pyle D., (1999) Data Preparation for Data Mining. Morgan Kaufmann

Publishers, San Francisco, California

Yuan, M., Buttenfield, B. Gahegan, M. and Miller, H. (2001)

Geospatial Data Mining and Knowledge Discovery. A UCGIS White

Paper on Emergent Research Themes. URL:

http://www.ucgis.org/emerging/

.

## Σχόλια 0

Συνδεθείτε για να κοινοποιήσετε σχόλιο