Geographic Data Mining

sentencehuddleΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

61 εμφανίσεις

Geographic Data Mining

Marc van Kreveld

Seminar for GIVE

Block 1, 2003/2004

About …


A form of geographical analysis


Current topic of interest in GIS research
(and database research and AI
research)


Finding hidden information in large
collections of geographic data

This seminar


Learning about a topic together


Presenting to each other + interaction


Added value by good examples:


for important concepts, algorithms


possibly self
-
thought of, or extended


referring to GIS data and issues (hence the
GIS course prerequisite)


Written assignment: joint survey


Material


Book by Harvey Miller and Jiawei Han
(editors): selected chapters


Possibly: papers from conference
proceedings


Mostly provided by me

Weeks


Week 36
-
46


Probably:


Not September 4 (this Thursday)


Not in week 40 (Sept. 29 & Oct. 2)


Not October 23


The above depending on participation!

Overview of Geographic
Data Mining & Knowledge
Discovery




Chapter 1 of the book



KDD: knowledge discovery in databases


Data warehouses


Data mining


Geographic aspects of the above

Knowledge Discovery in
Databases (KDD)


Large databases contain interesting
patterns
: non
-
random properties and
relationships that are:


valid (general enough to apply to new data)


novel (non
-
trivial and unexpected)


useful (leads to effective action: decision
making or investigation)


ultimately understandable (simple, and
interpretable by humans)

Knowledge Discovery in
Databases (KDD)


Because of quantity of data nowadays


Because we want information, not data


Because computing power allows it


KDD opposed to statistics


Statistics


small and clean numeric database


scientifically sampled


specific questions in mind


KDD: none of the above

KDD techniques


Statistics


Machine learning


Pattern recognition


Numeric search (?)


Scientific visualization

Data warehouse


Large repository of data


F
or analytical processing


(DB: transactional processing)


H
eterogeneous
:

different sources and
formats


(DB: homogeneous)


S
upports OLAP tools

(OnLine Analytical
Processing)


OLAP Example


M
easure of interest: sales


D
imensions of interest: item, store, week



(item, store, week)


money


[quantity sold times price ]


OLAP Example


2
-
dim. aggregation:

(item, store, . )


money


A
nother 2
-
dim. aggregation: sales by
store and by week


1
-
dim. a
gg
regatio
n
: sales by week (all
items and stores)


Data cube
: all 2
d

possible aggregations,
different types of
summaries


KDD steps


Data selection


Data pre
-
processing


Data enrichment


Data reduction and projection


Data mining


Interpretation and reporting

Presence of steps and order not fixed

KDD steps


Data selection
: which records, variables
chosen?

KDD steps


Data selection


Data pre
-
processing
: removing noise,
duplicate records, handling missing
data, …

KDD steps


Data selection


Data pre
-
processing


Data enrichment
: combining the
selected data with external data

KDD steps


Data selection


Data pre
-
processing


Data enrichment


Data reduction and projection
:
reduction in number, reducing
dimension

KDD steps


Data selection


Data pre
-
processing


Data enrichment


Data reduction and projection


Data mining
: uncovering information,
interesting patterns

KDD steps


Data selection


Data pre
-
processing


Data enrichment


Data reduction and projection


Data mining


Interpretation and reporting
: evaluating,
understanding, communicating

Data mining


Segmentation


Dependency analysis


Deviation and outlier analysis


Trend detection


Generalization and characterization

DM
-

segmentation

Description:


Clustering
: finding a
finite set of implicit
classes


Classification
:
mapping data items
into pre
-
defined
classes

Techniques:


Cluster analysis


Bayesian
classification


Decision or
classification trees


Artificial neural
networks

DM
-

segmentation

clustering

given classes

classification

DM


dependency analysis

Description:


Finding rules to
predict the value of
some attribute based
on other attributes

Techniques:


Bayesian networks


Association rules

(4, 12, 0.24)

(3, 14, 0.21)

(7, 13, 0.43)

(2, 9, 0.78)

(11, 11, 0.55)

(5, 11, ???)

(???, 12, 0.51)

DM


dependency analysis


Confidence

and
support

measures for
association rules of the form:

[
if

X
then

Y ]:


confidence =


#(X and Y in DB) / #(X in DB)

support =


#(X and Y in DB) / #(all in DB)


DM


deviation & outlier
analysis

Description:


Finding data with
unusual deviations
(=errors, or data of
particular interest)

Techniques:


Clustering, other
mining methods


Outlier analysis

DM


trend detection

Description:


Finding lines, curves,
summarizing the
database (often as a
function over time)

Techniques:


Regression


Sequential pattern
extraction

DM


generalization and
characterization

Description:


Obtaining compact
descriptions of the
data

Techniques:


Summary rules


Attribute
-
oriented
induction

concept hierarchy

low level concept

higher level concept

Visualization and
knowledge discovery


KDD is difficult to automate


steered
by human intelligence


Visualization helps to understand the
data and which data mining techniques
to try

KD + geography


Special case of KDD


Other special cases


marketing


biology


astronomy


Main features: location, distance, dimen
-
sionality (with
dependent

dimensions)

KD + geography

(attr1, attr2, attr3, attr4); attr’s are numbers
and (relatively) independent: statistics


(attr1, attr2, attr3, attr4); attr’s can also be on
other measurement scales: KDD


(attr1, attr2, attr3, attr4); attr’s are often
dependent and can be shapes: KD + geography


Often: (lat., long., attr1, attr2, …)

or: (shape description, attr1, attr2, …)

KD + geography


Study of scalable versions of DM tasks
(in lat. and long.)


Certain dimensions can be non
-
metric
(travel time need not be symmetric)


DM in data that is not in the form of
tuples: sets of thematic map layers

Geographic data mining


Spatial segmentation (clustering,
classification)


Spatial dependency (spatial association
rules)


Spatial trend detection


Geographic characterization and
generalization

GDM


spatial
association rules


Example:
If

a location is within 500 m
from water and the average winter
temperature is at least

2 degrees

then

there are frogs around

distance
relationship

GDM


spatial trend
detection


Patterns of change with respect to
neighborhood of some object


Example: (North America) Further from
Pacific ocean


fewer earthquakes

GDM
-

applications


Map interpretation


Remote sensing interpretation


Environmental mapping (soil type, etc.)


Extracting spatio
-
temporal patterns for
cyclones, crimes


Spatial interaction (movement/flow of
people, capital, goods)

Conclusions


GDM & GKD is an extension of (tool for)
geographical analysis


GDM is different from DM due to


Geographic spaces, not attribute space


Neighborhood is extremely important


Scale issues


Data is different


Applications (interesting patterns to mine
for) are different

This seminar on GDM


First: chapters from the book


CH 1: GDM & KD: an overview



(today)


CH 2: Paradigms for spatial and spatio
-
temporal DM(11
-
9)


CH 3: Fundamentals of spatial DW for GKD


(15
-
9)


CH 7: Algorithms and applications of SDM (Ronny)

(18
-
9)


CH 8: Spatial clustering in DM



(22
-
9)



CH 6: Modeling spatial dependencies


(25
-
9)








(not: 29
-
9 and 2
-
10)


CH 9: Detecting outliers




(6
-
10)



CH 10: Knowledge construction based on GVis and KDD


CH 14: Mining mobile trajectories

This seminar


All PowerPoint presentations on the
Web page of the course


Survey paper or written exam; possible
topics for survey:


Hierarchical clustering


Clustering with obstacles


Proximity relationship mining





Or: joint survey of (geometric)
algorithms for GDM

Each presentation


The chapter contents


Additional (spatial) examples

(from the Web links or self
-
constructed)


Detect and present algorithmic problems
that appear


together: report on
algorithmic issues in GDM



Present your chapter; don’t be afraid of
overlap with other chapters