Towards the Geo-computation of Real-Time

savagelizardAI and Robotics

Nov 25, 2013 (3 years and 9 months ago)

62 views


Towards the Geo
-
computation of Real
-
Time
Geodemographics





Muhammad Adnan


Introduction



BSc. in Information Technology from N.E.D. University, Karachi,
Pakistan


MSc. in Software Engineering from Queen Mary, University of
London


Working as Computing and Research Assistant at SPLINT, Dept.
of Geography, UCL


Part
-
time PhD. Student


Research interest in optimizing the performance of clustering
algorithms

Recent Projects at UCL




Worldnames Profiler (
www.publicprofiler.org/worldnames
)


National Trust Names (
www.nationaltrustnames.org.uk
)


Onomap Website (
www.onomap.org
)


Google Maps Shortest Path (
www.publicprofiler.org/travel
)


Google Street View (
www.publicprofiler.org/streetview
)


A bit of work for London Profiler (
www.londonprofiler.org
)

Worldnames Profiler



Online at
www.publicprofiler.org/worldnames



Extraction of data from 22 telephone directories for different
countries


Names Cleansing and standardising the format


Geo
-
coding individual households using different gazatteers



Oracle database with 1 billion names around 26 different countries


Geo
-
visualisation using Flash maps instead of slippy maps
(Google Maps, ArcGIS server etc)

Worldnames Profiler



Increased interest in creating visualisation techniques and dealing
with text mining algorithms


PhD. Focus is on the creation of bespoke real time
Geodemographic classifications

Concentration of surname “SINGLETON” in North America

Concentration of surname “SINGLETON” in the World

What will I be talking about today ?


Geodemographic Classication


Introduction


What data goes in ?


Standardising the data


Clustering the data


Naming the clusters


Real time Geodemographics


Need for real time Geodemographics ?


What are real time Geodemographics ?


Computational Challenges


Clustering Algorithms


K
-
means


PAM (Partitioning Around Mediods)


CLARA (Clustering Large Applications)


GA (Genetic Algorithm)


Comparison of Clustering Algorithm


Creating bespoke Geodemographics…demo


Parallel Processing in real time Geodemographics

What is a Geodemographic classification ?


“A segmentation system which groups similar neighbourhoods
into categories, based on

the characteristics of their residents”.
(Vickers, 2006)



Generalised classification of areas based on characteristics of
population.

What data goes in?



Census data


Demographic attributes (Age, Ethnicity, Country of birth etc.)


Household composition (Family type, Family size etc.)


Housing characteristics (Tenure, Type & Size etc.)


Socio
-
economic attributes (Education, Car ownership etc.)


Employment attributes (Economic activity, Economic class etc.)


Lifestyle Surveys


Credit cards data


Commercial companies use a mix of all these data sources


Academic research has based on census data only


Standardising the data


Z
-
Scores


Widely used variable normalisation technique


Can create outliers in the datasets



Range Standardisation


Standardise values between a range of 0
-
1


Can erase interesting patterns in the data



Principal Component Analysis


Reduces the dimensions of a data set


Focuses on the part of dataset having maximum variance


Can erase interesting patterns in the data




Clustering the data


K
-
means clustering algorithm is used to cluster data into
homogeneous groups


Multiple runs of k
-
means due to its instability


10,000 times (Singleton, 2008)


Different classification systems produce different number of
groups


MOSAIC classifies data into12 lifestyle groups


ACORN classifies data into 17 lifestyle groups



Naming the clusters


MOSAIC classifies data into 12 life style groups.

1.
High Income Families

2.
Suburban Semis

3.
Blue Collar Owners

4.
Low Rise Council

5.
Council Flats

6.
Victorian Low Status

7.
Town Houses and Flats

8.
Stylish Singles

9.
Independent Elders

10.
Mortgaged Families

11.
Country Dwellers

12.
Institutional Areas



Need for real time Geodemographics


Current classifications are created using static data sources


Rate and scale of current population change is making large
surveys (census) increasingly redundant


Significant hidden value in transactional data


Data is increasingly available in near real time


e.g. ONS (Office of National Statistics) NESS API


Application specific (bespoke) classifications have
demonstrated utility (Longley & Singleton, 2009)



What are real time Geodemographics ?

Real time
feeds of data

Online
Specification
of inputs

Clustering

Visualisation

Specification Estimation Testing

Computational challenges



Integration of large and possibly disparate databases


E.g. NHS data; Census data


Data normalisation and optimization for fast transactions


Minimizing computational time of clustering algorithms (Very
Important)!


Common protocol


XML (SOAP)


Use of non traditional data sources. (Singleton, 2008)


E.g. Flickr; Facebook


Important Challenge: Selection of clustering
algorithm



K
-
Means


PAM (Partitioning Around Medoids)


CLARA (Clustering Large Applications)


GA (Genetic Algorithm)


k
-
means



Attempts to find out cluster centroids by minimising within sum
of squares distance.


K
-
means is unstable due to its initial seeds assignment.


Sensitive to outliers in the data set.


Creating a Geodemographic classification requires running
algorithm multiple times.


10,000 times (Singleton, 2008)


Computationally expensive in a real time environment.





k
-
means variants




Hartigan’s k
-
means algorithm


Lloyd’s k
-
means algorithm


Forgy’s k
-
means algorithm


McQueen’s k
-
means algorithm





k
-
means variants




Hartigan’s k
-
means algorithm


Lloyd’s k
-
means algorithm


Forgy’s k
-
means algorithm


McQueen’s k
-
means algorithm





Clustering efficiency of k
-
means variants

K
-
means

(100 runs of k
-
means on OAC data set for k=4)






An example of bad clustering result (K
-
means)


An example of bad clustering result (K
-
means)


An example of bad clustering result (K
-
means)


Alternate Clustering Algorithms




PAM (Partitioning around medoids)


CLARA (Clustering Large Applications)


GA (Genetic Algorithm)




Alternate Clustering Algorithms…



PAM (Partitioning around medoids)


It tries to minimize the sum of dissimilarities of the data points to
their cluster centers.


Less sensitive to outliers than K
-
means.


Cannot handle larger data sets.


Produces better results than k
-
means for smaller data sets.


Alternate Clustering Algorithms…



CLARA (Clustering Large Applications)


It draws multiple samples of the dataset, applies PAM to each
sample and returns the best result.


Can handle large data sets as it operates on samples rather than on actual
data set.


Sometimes it gives bad clustering results due to its procedure of
sample selection.


Alternate Clustering Algorithms…



GA (Genetic Algorithm)


It is inspired by models of biological evolution. It produces
results through a breeding procedure.


Creates hierarchies of generations and then merge the
hierarchies in homogeneous groups having similar
characteristics.


Can be time consuming due to the creation of generation
hierarchies.



Comparing the computational efficiency of


K
-
means


Clara


GA


By using three data normalisation techniques



Z
-
Scores


Range Standardisation


Principal Component Analysis


Data normalisation techniques


Z
-
Scores


Widely used variable normalisation technique


Can create outliers in the datasets



Range Standardisation


Standardise values between a range of 0
-
1


Can erase interesting patterns in the data



Principal Component Analysis


Reduces the dimensions of a data set


Focuses on the part of dataset having maximum variance


Can erase interesting patterns in the data




Comparing computational efficiency (Z
-
scores)

PAM, and GA on the three geographic aggregations of a dataset covering London.


OA (Output Area) level results

LSOA (Lower Super Output Area) level results

Ward level results

Comparing computational efficiency (Range Standardisation)

PAM, and GA on the three geographic aggregations of a dataset covering London.


OA (Output Area) level results

LSOA (Lower Super Output Area) level results

Ward level results

Comparing computational efficiency (PCA)

PAM, and GA on the three geographic aggregations of a dataset covering London.


OA (Output Area) level results

LSOA (Lower Super Output Area) level results

Ward level results

Algorithm Stability (w.r.t. Computational time)

Running k
-
means on OA (Output Area) for 120 times on each iteration

Running CLARA on OA (Output Area) for 120 times on each iteration


Running GA on OA (Output Area) for 120 times on each iteration

0
0.5
1
1.5
2
2.5
3
3.5
4
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
Time (s)

K
-
means

0
0.5
1
1.5
2
2.5
3
3.5
4
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
Time (s)

CLARA

0
0.5
1
1.5
2
2.5
3
3.5
4
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
Time (s)

GA

K
-
means and Principal Component Analysis


PCA can be used to facilitate K
-
means clustering by reducing
dimensions.


(Ding, C., He, X., 2004)



K
-
means result for 41 “OAC variables”


K
-
means result for 26 “OAC Principal Components”


K=4 (99% similar)

K
-
means and Principal Component Analysis


PCA can be used to facilitate K
-
means clustering by reducing
dimensions.


(Ding, C., He, X., 2004)



K
-
means result for 41 “OAC variables”


K
-
means result for 26 “OAC Principal Components”


0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
1
4
7
10
13
16
19
22
25
28
31
34
37
40
43
46
49
No. of clusters
Time (s)
Kmeans
PCA_Kmeans
Intermediate results


Clara is plausible alternative to k
-
means in a real time
Geodemographic classification system.



K
-
means might be combined with PCA for enhanced computation
power.



In an online environment k
-
means is better for small data sets.



Exploration of non traditional data sources.




Bespoke Geodemographics….demo



Web based clustering.


Users interact with Java Servlets to submit the requests.


Java Servlets interact with R, which in turn does the clustering.


Java Servlets interact with R by using JRI.




Bespoke Geodemographics….demo

Bespoke Geodemographics….demo

Parallel Processing in Geodemographics



Distribution of computation on multiple computers or processors.


Effectively using the idle time of processors.


Reduces computational time.


Parallel Processing in Geodemographics



Graphics cards can be used for parallel processing


Latest graphics cards are coming with multiple GPUs (Graphics
Processing Units)


CUDA


Parallel computing architecture


Uses graphic processing units of NVIDIA graphics cards


Graphics cards can be used for running clustering algorithms in parallel


All the latest NVIDIA graphics cards are CUDA enabled


E.g. GeForce GTX 295, Tesla S1070, Quadro FX 5800 etc.



Parallel Processing in Geodemographics



Running k
-
means on OA dataset for London (10 times for each
value of k).


Result shows an increase in computation power by approx 30%
while maintaining the clustering efficiency.








Can be approx 70% if we run k
-
means for 10,000 times.


Future work



Investigation of clustering algorithms in more detail


Investigation of CUDA in more detail for running clustering
algorithms in less time.


More testing for other clustering algorithms


Visualisation of the results produced.


Merging Google Maps with the existing visualisation techniques of
WorldNames (
www.publicprofiler.org/worldnames
) and National Trust
Names (
www.nationaltrustnames.org.uk
) websites.


Testing the usability for bespoke classifications.








Future work

Thank you for listening

Any Questions?