How to download and load CSO data into PostgreSQL

needmoreneedmoreΔιαχείριση Δεδομένων

28 Νοε 2012 (πριν από 4 χρόνια και 24 μέρες)

211 εμφανίσεις

Lab 4

DT786

Tuesday
28
-
2
-
2012

Getting CSO data into
R



In this lab we will cover:

1)

Describe
data.f
rame

and
SpatialPolygonsDataFrame

objects.

2)

How to load shape files for the Dublin Electoral Division (dublin.eds).

3)

Adding an additional column to a data.frame
.

4)

Saving the changes in a shape file.

5)

Plot the contents of a shape file

6)

Downloading
education and car ownership data from CSO
.

7)

Adding the education data from the CSO

to R
data.frame
.

8)

Adding the
car ownership data from the CSO

to R
data.frame
.

9)

Location Quo
tient

10)

Printing neighbours


If you are doing this lab on your own machine use C:
\
My
-
R
-
Dir or in college use
U:
\
My
-
R
-
Dir


First we must load the required packages:

library(spdep)

library(maptools)

library(RColorBrewer)

library(classInt)

If they are not insta
lled then follow the instructions to install packages in Lab2.


1.1)
In R data frames (
data.frame
) can take several vectors of different types and
store them in the same variable. The vectors can be of all different types. For
example, a data frame may con
tain many lists, and each list might be a list of factors,
strings, or numbers. DFs are
tightly coupled collections of variables which share
many of the properties of matrices and of lists, used as the fundamental data structure
by most of R's modeling sof
tware
. There are different ways to create and manipulate
data frames. Here are some examples.

L3 <
-

LETTERS[1:3]

df1 <
-

data.frame(cbind(x=1, y=1:10), fac=sample(L3, 10, replace=TRUE))

The same with automatic column names:

df2 <
-

dat
a.frame(cbind( 1, 1
:10),
sample(L3, 10, replace=TRUE))

is.data.frame(df1)

The
data.frame

structure is
used as
part of

spatial
class
es such as
SpatialPolygonsDataFrame

and
SpatialPointsDataFrame

to hold
attributes

of spatial objects















A
SpatialPolygonsDataFrame

is

made up of a data.frame and an object of
class SpatialPolygons, which is a list of objects of class Polygons, which is made up
of a list with Polygon class objects.
You should examine

the structure of spatial
objects

using
is
,

slots
,
names
,
@
,
$
.

nrow
,
ncol
,
rownames
.




The
re

is also a
SpatialPoint
sDataFrame

and
SpatialGrid
DataFrame

which are

st
ructured as follows
:




2)

How to load shape files for the Dublin Electoral Division (dublin.eds).

The Dublin Electoral Division data was exported from Postgr
eSQL a
s a shape file. It
contains al

the geometric and
some of
the
attribute
s

(MALE1_1 and FEMALE1_1)
data for Dublin EDs.

The file should be copied from
student distrib to
C:
\
My
-
R
-
Dir
.


In college you can get to distrib from your Windows
Explorer (i.e. f
ile
browser
)

with:

The necessary data files are in:

\
\
offaly.cs.dit.ie
\
student_distrib
\
PBrowne
\
SPATIAL
-
DATABASES
-
SOFTWARE
-
DATA
\
Data



The file can be loa
ded into R with the command:

dublin.eds <
-

readShapePoly("C:
\
\
My
-
R
-
Dir
\
\
dublinEds.shp")

Note this is the same data as used on the Spatial Databases course but in R it is called
dublin.eds
.

Examine

dublin.eds

names(
dublin.eds
)

is(
dublin.eds
)

is(dublin.eds$
MALE1_1)

dublin.eds$
FE
MALE1_1

coordinates(
dublin.eds
)

getClass(
dublin.eds
)

slotNames(
dublin.eds
)

#Examine slots

slotNames(dublin.eds)

dublin.eds
@data

dublin.eds
@polygons

View(dublin.eds)



3)

Add a new colu
mn called POP to
dublin.eds

to store
the
sum

of
male
and female

population
s

for each ED.


dublin.eds@data$POP <
-

dublin.eds@data$MALE1_1 + dublin.eds@data$FEMALE1_1

Get the sum of the the populations of all Eds

sum(dublin.eds$POP,na.rm = TRUE)

There is one NA in the population.


4)

To save the
above
changes

in
a file set called
dubeds1
.


details <
-

paste("C:
\
\
My
-
R
-
Dir", "dubeds1", sep="/")


writePolyShape(dublin.eds,details)


A new file will be created in your current working directory
C:/
My
-
R
-
Dir

called
dubeds1.shp
.

This new file will have an population column
calculated in the
at was
calculated in 3) (which is the
current R session
)
.


5)

Plot the contents of

dubeds1
.shp


Load the file that you saved from step 4. It should contain the data on Dublin ED and
a total populatio
n:

dublin.eds <
-

read
ShapePoly("C:
\
\
My
-
R
-
D
ir
\
\
dube
ds
1
.shp")


Load a colour library

library(RColorBrewer)

Set some colours
using

8 intervals


pop8 <
-

brewer.pal(8,'Set2')

Use ranges for thematic colouring

spplot(dublin.eds, "POP", col.regions=pop8, at=c(500,1000,2000,3000,4000,5000,6000,7000,800
0),main='Dublin
Population')

You should get a map as below:



To label each electorial division use the following command.

l1 = list("sp.text", coordinates(dublin.eds), as.character(dublin.eds$SAPS_LABEL))
,col="red",font=2
,cex=0.5
)

spplot(dublin.eds, "PO
P", col.regions=pop8, sp.layout=list(l1),
at=c(500,1000,2000,3000,4000,5000,6000,7000,8000),main='Dublin Population')


6)

Downloading education and car ownership data from CSO.


The CSO census for 2006 consists of over 70 themes. Each of these themes contains

several components.
We will use three steps to get CSO data into
R.

1)

Download a particular topic for a given area. We will download the
education
and car ownership data

for the Dublin Electoral Divisions.

2)

Load the
new
topic into a

data.frame

specifically d
esigned for that topic.

3)

Move the
new topic

to the Electoral Division (
dublin.eds
) table that contains
all topics and the geometry of each ED
.




Download
Education information


Go

to the CSO
Cens
us 2006
r
eports

page

for loading into PostgreSQL
:

http://census.cso.ie/census/ReportFolders/ReportFolders.aspx


Follow the following screen shots:














Scroll to
Dublin City







Select Dublin City and download as a CSV file
.




Set the dimension order.



Save the file as
DublinEducation
.csv

in

C:
\
My
-
R
-
Dir





7)Adding the education data from the CSO to R
data.frame
.



Edit
DublinEducation
.csv

in TextPad or Notepad and delete the header
:

"Theme 10
-

4 : Persons aged 15 an
d over by sex, principal economic status and highest level of education completed, 2002"



Each generic theme has one or more components. It is the theme components

that
actually get downloaded. We are interests only in
the educational
p
art
s

of Theme 1
0
-
4
.

Note the data from he CSO
does not contain geometry. Selected
columns

from the
downloaded
table
will later

be loa
ded into the big ED frame

dublin.
eds
.

You
must have
dublin.
eds
loaded into R
.

Now in R read in the file:

dubeduc
=read.csv(file="DublinEducati
on
.csv"
,header=TRUE
)


> names(
dubeduc
)

>
names(dublin.eds)


names(dubeduc)

names(dublin.eds)

"Geographic.Area"

"SAPS_LABEL"

"No.formal.education"

"FORMAL_EDU"

"Primary.education"

"PRIMARY_ED"

"Lower.secondary.education"

"LOWER_SECO"

"Upper.secondary"

"UPPER_SECO"

"Technical.or.vocational.qualification"

"TECHNICAL1"

"Upper.secondary.and.technical.or.vocational"

"UPPER_S_01"

"Non.degree"

"NON_DEGREE"

"Primary.degree"

"PRIMARY_DE"

“Pr
潦敳獩潮慬⹱畡汩晩捡瑩潮⸮摥杲敥⹳瑡瑵献o

≐剏䙅卓䥏丢

≂潴栮摥杲敥⹡湤⹰牯晥獳楯湡氮煵慬楦楣慴楯渢

≂佔䡟䑅䝒䔢

≐潳琮杲慤畡瑥⹣敲瑩晩捡瑥⹯爮摩灬潭愢†††"

≐体呇剁䑕䄢

≐潳琮杲慤畡瑥⹤敧牥攮⹭慳瑥牳⸢"

≐体呇剁弰ㄢ

≄潣瑯牡瑥⸮偨䐮∠†††††††"
††††††

≄佃呏剁呅ㄢ

≎潴⹳瑡瑥搢"


≔潴慬"




There are other names in

dublin_eds

that are not of interest at the moment.

Add the
CSO data to
dublin.eds

as follows


dublin.eds@data$
FORMAL_EDU <
-

dubeduc
$"No.formal.education"

dublin.eds@data$
PRIMAR
Y_ED <
-

dubeduc$"Primary.education"

dublin.
eds@data$PROFESSION <
-

dubeduc$
"
Professional.qualification..degree.status."



Check that the has been correctly updates


sideBySide <
-
paste(dublin.eds@data$SAPS_LABEL,

dublin.eds@data$PROFESSION, dubeduc$"Geogra
phic.Area",

dubeduc$"Professional.qualification..degree.status.")


writeLines(sideBySide)


You do not have to check the rest of the data.

Add the rest of the
education
data in a similar fashion.


You can save the newley entered data as follows:

details <
-

paste("C:
\
\
My
-
R
-
Dir", "dubeds2", sep="/")

writePolyShape(dublin.eds,details)


Plot primary degress. Are the intervals appropriate?

spplot(dublin.eds, "PRIMARY_DE", col.regions=pop8,
at=c(50,100,200,300,400,500,600,700,800),main='Dublin Primary Degree')




Print the map of Professionals

Load a colour library if necessary

library(RColorBrewer)


Set some colours and

pop8 <
-

brewer.pal(8,'Set2')

Experiement with other colours.


Get the ranges for thematic colouring

lower =
min(dublin.eds@data$PROFESSION)


u
pper =
max(dublin.eds@data$PROFESSION)

intrv

= (lower+upper)/8


Now plot the map with the above intervals
:

spplot(dublin.eds, "PROFESSION", col.regions=pop8, at=c(intrv, intrv*2, intrv*3,
intrv*4, intrv*5, intrv*6, intrv*7,
intrv*8),main='Dublin Profession
als
')


You should get a map

depicting the number of professionally qualified per ED
.

H
ow

would you make
a map displaying
the density of professionals per ED.



The following updates all the dublin.eds education fields with the CSO values.


dublin.eds@data$
"FORMAL_EDU" <
-

dubeduc$No.formal.education

dublin.eds@data$"PRIMARY_ED" <
-

dubeduc$Primary.education

dublin.eds@data$"LOWER_SECO" <
-

dubeduc$Lower.secondary.education

dublin.eds@data$"UPPER_SECO" <
-

dubeduc$Upper.secondary

dublin.eds@data$"TECHNICAL1" <
-

dubeduc$Technical.or.vocational.qualification

dublin.eds@data$"UPPER_S_01" <
-

dubeduc$Upper.secondary.and.technical.or.vocational

dublin.eds@data$"NON_DEGREE" <
-

dubeduc$Non.degree

dublin.eds@data$"PRIMARY_DE" <
-

dubeduc$Primary.degree

dublin.eds@data$"PRO
FESSION" <
-

dubeduc$"Professional.qualification..degree.status."

dublin.eds@data$"BOTH_DEGRE" <
-

dubeduc$"Both.degree.and.professional.qualification"

dublin.eds@data$"POSTGRADUA" <
-

dubeduc$"Post.graduate.certificate.or.diploma"

dublin.eds@data$"POSTGRA_0
1" <
-

dubeduc$"Post.graduate.degree..masters."

dublin.eds@data$"DOCTORATE1" <
-

dubeduc$"Doctorate..PhD."

dublin.eds@data$"NOT_STATED" <
-

dubeduc$"Not.stated"


It is a good idea to save your data af
t
er c
hanges
.

It is a good idea to increment file version
as follows:

writePolyShape(
"
dublin.eds,
C:
\
\
My
-
R
-
Dir
\
\
", "dubeds3
.shp
")




8)Adding the car ownership data from the CSO to R
data.frame
.


Add

the
car
data

called:

"Theme 15
-

1 : Number of households with cars, 2006"

in a similar fashion.

















9)

Location Quotient
.

##Writing a function for spatial data: The Location Quotient.

## Download shape file from

## http://www.r
-
bloggers.com/writing
-
a
-
spatial
-
function
-
the
-
location
-
quotient/

## http://spatialanalysis.co.uk/r/



## From http://faculty.washing
ton.edu/krumme/gloss/l.html#lq

## Location Quotient is a measure of the relative significance of a
phenomenon (e.g. employment in software activities) in a region

## (e.g. Dublin) compared with its significance in a larger
("benchmark") region (the countr
y as a whole)

## A high location quotient for a specific activity implies
specialization and the export of the goods or services produced by
the activity.

## The viability of the "1" (i.e. the hypothetical location quotient
of 1 representing selfsufficien
cy)

## as a dividing line between regional specialization (exports) and
regional deficiency (imports)

## depends of the "realism" of the assumptions made:

## homogeneous production and consumption patterns, no cross
-
hauling,
no national exports or impor
ts, etc.




##In some cases it is necessary to conduct the same analysis multiple
times

##on either the same or different data. In such circumstances it is
worth

##writing a function to simplify the code. In this example the
location quotient

##provides

a simple calculation easily written in to a function.


##Background: The location quotient (LQ) is an index for comparing a
region's

##share of a particular activity with the share of that same activity
found at

##a more aggregate spatial level. In thi
s example we take a shapefile
of London

##Boroughs that contains information on the population of each
borough and the

##percentage of sports participation in each borough. In this case
there is

##little point in calculating the LQ as the percentage alon
e would be
more meaningful.

##The focus here is how to undertake the methods, not their
appropriate use, or the

##validity of the results.






## Data Requirements:

## London Sport Participation Shapefile.

## Install the following packages (if you have
n't already done so):

## maptools, RcolourBrewer.

## Code (Comments are preceded by ##)


## Load the packages

library(maptools)

library(RColorBrewer)


## Set the working directory (the folder containing your shapefile).


setwd("C:
\
\
My
-
R
-
Dir")


## load the
shapefile


sport<
-

readShapePoly("london_sport.shp")


##have a look at the attribute table headings


names(sport)


## Extract the attribute data of interest (in this case its the %
participation and borough population.


partic_per<
-
sport$Partic_Per


bor_p
op<
-

sport$Pop_2001


## with these two columns the LQ can be calculated for each borough.

##This is a useful calculation and we may wish to apply it over a
number of

##datasets to see temporal change. We can therefore write a function.


"LQ"<
-

function(
partic_per, bor_pop) {



sdp<
-

partic_per/bor_pop


dp<
-

sum(partic_per)/ sum(bor_pop)


sdp/dp

}


## apply the function above: to sports paticipents and borough
populations.

bor_LQ<
-

data.frame(sport$ons_label, LQ(partic_per, bor_pop))

colnames(bor_LQ)<
-

c
("ons_label", "bor_LQ")


## Examine bor_LQ$bor_LQ

bor_LQ$bor_LQ


## Using the resulting LQ values we can plot a map.


## Specify the colour breaks in the data (how many colours and what
values the colour change at)


brks<
-
c(0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6)


## Use the RColorBrewer package to define a colour palette.

colours <
-

brewer.pal(6, "Blues")


## Plot map, using 6 intervals 0.4
-
0.6, 0.6
-
0.8, 0.8
-
1, 1
-
1.2, 1.2
-
1.4, 1.4
-
1.6

## uses findInterval(x,vec) which finds the indices of x in vec,
where vec mus
t be sorted (non
-
decreasingly);

## all.inside logical; if true, the returned indices are coerced into
1,...,N
-
1, i.e., 0 is mapped to 1 and N to N
-
1.




brks<
-
c(0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6)

colours <
-

brewer.pal(6, "Blues")

plot(sport, col=colours[fin
dInterval(bor_LQ$bor_LQ, brks,


all.inside=TRUE)], axes=F)

box()



## Add map annotations etc...


SpatialPolygonsRescale(layout.north.arrow(1), offset=
c(505100,164800), scale = 6000, plot.grid=F)

title(paste ("LQ of London Sports Participation"))


legend(
x=503800, y=164800, legend=leglabs(brks), fill=colours,
bty="n")


##Done!

##License: cc
-
by
-
nc
-
sa. Contact: james@spatialanalysis.co.uk



10)Printing neighbours

## Find eire.shp on your
own machine.

## It may be in “Program Files” or “
Program files (x86)


or


C:
\
Documents and Settings



eireMap <
-

readShapePoly("C:
\
\
Program Files
\
\
R
\
\
R
-
2.14.1
\
\
library
\
\
spdep
\
\
etc
\
\
shapes
\
\
eire.shp"[1],ID="names",
proj4string=CRS("+proj=utm +zone=30 +units=km"))

plot(eireMap)

names(eireMap)

eireMap $names


# Get

the neighbours of each county.

eire.nb <
-

poly2nb(eireMap)

# Examine contiguity

summary(eire.nb)

plot(eire.nb, coordinates(eire
Map
), add=TRUE)

# Draw Eire with county names

plot(eireMap)

text(coordinates(eireMap), labels=as.character(eireMap$names),
cex=
0.4)

# You can check what a function does by using help.

# e.g. help(invisible)

or ?
invisible