Intermediate Postgresql and SQL - Michigan State University

hornbeastcalmΔιαχείριση Δεδομένων

27 Νοε 2012 (πριν από 4 χρόνια και 10 μήνες)

190 εμφανίσεις

Page |
1



GEO 425: SPRING 2012

LAB 10: Intermediate Postgresql and SQL


Objectives
: In

the

introductory

SQL

lab

you

worked

with

basic

queries

on

two

tables.

This

lab

builds

on

that

one

by

providing

more

advanced

queries.

You

will

modify

the

returned

query

results,

calculat
e

sums

and

do some
other

more

complex

operations.

You'll

create

new

fields

and

new

tables.

You

will

also

experiment

with

joins

to
enable

analysis

across

both

tables.

Fi
nally, you will map query outputs. This

exercise

covers:


1.

Obtaining

more

complex

query

output


2. Implementing subqueries


3. Creating new fields (columns)


4. Creating new tables


5. Table joins


6. Manually exporting and mapping quer
y outputs in Quantum GIS


Introduction


About Subqueries

Subqueries are simply nested queries which help you set conditions which are themselves queries. For
example, to find all counties in the census table with greater than the average number of
households
, run
:
SELECT name, househ FROM census WHERE househ < (SELECT AVG(househ) FROM census);


About Joins

Joins are tools to combine multiple tables in an SQL command. Typically you do not physically create a new
table; you just link them for the purp
oses of your query. To implement a join, you need a common identifying
key column in both tables. They do not have to have the same name, but they have to associate common
records. There are several kinds of joins; this lab uses inner joins.


Instructions


Step 1: Startup

1.
Create a new directory under your labs folder called “lab
10
”. You don't need any new data, but it's a good
place to put your results for the report.


2.
Open a terminal window on your desktop. Start the postgresql

interpreter using the

psql


command from
last time. Use

\
d


to remind yourself about the tables in your database and their columns.


Step 2: Modifying query output

1. There are a variety of ways to summarize query results. For example, instead of return
ing the votes for each
county voting for Obama, you can add them together and return a single value with the SUM operator:

SELECT SUM(obama) FROM election;


2. What about printing all the totals?

Remember from last week that you can save time by using the up arrow
and editing previous commands.

SELECT SUM(obama), SUM(mccain), SUM(other) FROM election;



The column headings are not very intuitive. Try renaming the columns:

SELECT SUM(obama) AS Obam
a, SUM(mccain) AS McCain, SUM(other) AS Other FROM election;


Note that the capitalization didn't make any difference
-

the columns are lower case.

Page |
2




3. Other operators work too. Determine Obama's winning margin with this command. Write it down:

SELECT SUM(
obama)
-

SUM(mccain) AS margin FROM election;


Try some things with the census data now. MAX and MIN do what you expect. What is the latitude of the
southernmost Michigan county centroid?

SELECT MIN(lat) FROM census;


4. You can print better looking numbers

by rounding. Try:

SELECT ROUND(3.14159, 3);

That rounds to the third decimal place. You don't have to specify the decimal place; if you don't, the value is
rounded to the nearest integer. Write this answer down:

SELECT ROUND(MIN(lat)) FROM census;


Step 3
: Subqueries

1. The last example returned a latitude, but what is the name of the county with that latitude? Try this:

SELECT name, MIN(lat) FROM census;


That didn't work, because name is 83 rows long and MIN(lat) is one row long
-

not a valid relation. I
nstead you
need a subquery. Read the next command carefully, then input it:

SELECT name,lat FROM census

WHERE lat = (SELECT MIN(lat) FROM census);


Note that there are
two

select queries here. The second, following the = operator, is identical to the original
query. The first selects the desired fields from the table where the second query is met. This means that
the
result of a SELECT is a temporary table
; you can do addit
ional queries and other operators on that temporary
table!


Use a subquery when you want a selection condition that is the result of a query.


2. Here is one more example. Calculate all the
Michigan households

in counties
where

the land area (in miles) is
less than the average water area (in miles). These are generally coastal counties:

SELECT name, househ FROM census

WHERE land_area_mi < (SELECT AVG(wat_area_mi) FROM census);


Can you revise that query to return just the total number of households in those

counties? Write the query and
result here:


Step 4: Creating new fields

1. The election table has columns with the raw vote totals, but it might be good to store the percent of the vote
each candidate received in new columns. To do this, start by describi
ng the table:
\
d election


2. Adding a field is a two step process. First you define and create a blank column, then you fill it. Enter the
following SQL statement:

ALTER TABLE census ADD COLUMN obama_per real;


That

should be mostly
self
-
explanatory.
'real' means the data type is a real number. Do
\
d census to see if it is
there.


3. Oops, that wasn't the right table. obama_per should really go in election. Get rid of the column:

Page |
3



ALTER TABLE census DROP COLUMN obama_per;


4. And now add the column to t
he election table. Remember, you can hit the up arrow at the prompt a couple
of times to retrieve a query that is almost like the one you want to input. Then you can modify it and execute
without typing so much. Write your command here: ___________________
________________


5. Do you have the new field? Let's populate it with a formula using the UPDATE command:

UPDATE election SET obama_per = ((100.0 * obama) / (obama + mccain + other));


6. Finally, view your shiny new table:

SELECT * FROM election;


Step
5: Creating new tables

1. It would be nice to have a table that listed just the
Upper Peninsula

counties, since they may be quite
different demographically than the
Lower Peninsula
. That can be done! This will be a very simple table with one
column: UP cou
nty names. Like the field problem, you will solve this in two steps. The first creates the new
table (empty), and the second populates it. Let's make a table:

CREATE TABLE up (county varchar(30));


The table name is up. It has one field: county. varchar(30
) is the datatype: character, with a variable width of up
to 30 characters.


2. Now let's populate it. At the prompt you can do this one record at a time. Fortunately there aren't too many
counties in the UP:

INSERT INTO up (county) VALUES('Gogebic');


Use

the up arrow to retrieve the previous SQL statement and modify it to add these counties, too:

'Ontonagon', 'Houghton', 'Keweenaw', 'Baraga', 'Iron', 'Marquette', 'Dickinson', 'Menominee', 'Alger', 'Delta',
'Schoolcraft', 'Luce', 'Chippewa', 'Mackinac'


3.

Check that you entered them right with:

SELECT * FROM up;


4. You can delete records:

DELETE FROM up WHERE county = 'Ingham';

Note that you got no
errors;

although nothing was changed (DELETE 0 was returned, since there is no 'Ingham'
in there).


5. Yes,
there are better ways to input lots of records! Check out the COPY command if you're interested.


Step 6: Joins

1. There are many types of join. Here you'll use an inner join to pull together the census table with the election
table. But what to use as a k
ey? County names seem like the obvious key, but the fields have different names.
Here's one way to deal with it:

SELECT * FROM census INNER

JOIN election

ON census.name=election.county;


Note that this is an inner join, and that the two equivalent fields
are communicated to the database with the
ON command. Further, you use <table>.<field> to specify the key column name.


Page |
4



2. Let's do a trickier join. You probably know that Michigan's urban areas voted heavily for Obama. But how
heavily? Well, we can come
up with a definition of urban counties
-

let's say any county with more than 75,000
people in it. What you want to calculate is the percentage of the vote that Obama won in those counties. Of
course the Obama numbers are in one table and the population num
bers are in the other table, so a join is
necessary. A multi
-
line select seems good here, because it is very long:

SELECT ROUND(100*SUM(obama)/SUM(obama + mccain + other),2)

AS percentObama FROM election

INNER JOIN census ON census.name=election.county

WHE
RE pop > 75000;


Read through that carefully and make sure you understand the logic of this query. Then run it!


3. Your last join will involve a subquery. Our definition of urban counties was kind of arbitrary. Why 75,000 and
not 80,000? Another more stat
istical definition would be to pick all counties above the statewide county mean
and total the vote in those. Calculating the mean will happen in the subquery. The join component of the
statement will look the same as in the previous example:

SELECT ROUND(
100*SUM(obama)/SUM(obama + mccain + other),2)

AS percentObama FROM election

INNER JOIN census ON census.name=election.county

WHERE pop > (SELECT AVG(pop) FROM census);


Step 7: Making maps with SQL output

Yes, this is GIS class, so let's see how to get
this data into a GIS for display purposes. There are a number of ways
to do this. Here is one. The method will be to manually copy query results into a dbf file (part of a shapefile) of
Michigan counties, and then display that in Quantum GIS. There are muc
h tighter database
-
GIS integrations
possible, too. Next week we'll get into those.


1. Generate a query that will give you a single column of numbers, one for each county in Michigan. Examples:
percent vote for Obama (or McCain, or Other); proportion of to
tal county population voting in the election,
average household size. An example of the output in the format you need (there should be 83 lines):


obvote

---------


45.3633


52.0420


43.5046


...


2. Open the Kwrite text editor

(Utilities
-
>Editors)
.
Then
, b
ack in the Konsole window, use your mouse to select
all of
the rows in this column. Middle
-
click in the text editor window to paste the data into it. Note that you
don't need to hit any special copy or paste keys
-

just press the middle button. 83 recor
ds should drop in there.
Notice that there's a space before each number. Remove this using Replace Text (hit Ctrl
-
R to bring up this
menu). "Text to find" should be a space. "Replacement text" should be nothing at all. Hit Replace, and all the
leading spac
es should disappear. Now the spreadsheet should read these as numbers instead of text strings
pretending to be numbers.




Page |
5



3. Create a new spreadsheet in Calc

(
-
>Office)
. In Kwrite, select all of your edited numbers. Then paste them
into the spreadsheet.
T
he defaults in the import window should be fine, so click OK.
If
numbers

are right
-
aligned, they are numeric! Round numbers to a single decimal point

(just like in Excel, right
-
click and “Format
Cells…”
.


4. Now off to the web to obtain a shapefile of Michigan counties. I suggest the Michigan Center for Geographic
Information. Here is a
fairly

direct link:

http://www.mcgi.state.mi.us/mgdl/?
rel=ext&action=sext



Under Political Features, grab the counties shapefile. Save it to your working directory. In a Konsole window
(make sure you are in the right directory) list the files, then type

unzip county_mi.zip
”.



5. Open the dbf file in OpenOffice Calc. These are the attribute data for each county. Relying on the fact that
your numbers in the other Calc spreadsheet are in alphabetical order, the same as the records in this dbf file,
copy and paste those numbers ove
r to the leftmost blank column in the dbf, leaving the very first row blank.
Make sure you are pasting numbers, not formulas! In the first row, type a
header in the same syntax as the
other headers. So, first type a
short field name in all
-
caps,
then a co
mma
.
After that, continue to

add the

data
type

N

, number of decimal places (4 should be fine) and where the decimal goes

1

. My line looks like this:
OBPVOTE,N,4,1. Save the dbf file.


6. Start Quantum GIS and load the county shapefile. Double click its

name in the legend to open layer
properties. Under symbology, hit 'New symbology'. Renderer should be

Graduated

, column will be your new
field name. Pick a color ramp and number of classes that you like. Hit Classify. Finally, hit Apply. Your map
should

appear, with your variable colored in! You can add labels by
going to the
“Layer” menu and selecting
“Labeling”.


7. If you find the number of color ramps dissatisfying (you should!), this is easily fixed.
Go to the “
Settings

menu and select “S
tyle manag
er
…”
. Pick Color ramp from the list at the top. Then hit the + button to add a new
ramp. You have a couple of ramp type choices here. ColorBrewer, developed by an MSU Geography alumna, is a
good option for cartographer
-
approved ramps. You can create your o
wn ramp as well by clicking Gradient. Set
up the end colors as you like, and even add color stops in between. I made a red
-

gray
-

blue ramp for
displaying election outcomes (red
-

purple
-

blue might be even better). I now have a variety of ramps, both
s
elf
-
created and selected from
ColorBrewer that

I can choose in the Symbology menu.


IMPORTANT: Make sure you log out every time you
finish in the lab
!!!


Report

For each of the following questions, write down the postgresql

command you used as well as the answer. Do
this in a text editor. Then, still in the text editor, turn it into html by putting in some tags (<html>, </html>,
<body>, </body>, and <br> for line breaks would be a good start). Save it in your lab
10

director
y and also
somewhere under your
personal website

directory as lab
10
.html. Finally, place a link to it from your index page.
Make sure you can view it with a web browser, and then email your TA with the link
to the page.


1. Identify the percentage of the t
otal vote that Obama got. Round to the nearest tenth (e.g. 52.5)

2. What is the average latitude for the Michigan counties?

3. How many people live at or north of the average latitude? Use a subquery!

4. How many people live south of the average latitude?

5. What proportion of the Michigan population lives south of that latitude?

6. Create new columns with the percent vote McCain got in each county, and with "Other's" vote.

7. What percent of the vote did McCain get in northern counties (those north of the
average latitude)? Hint:
Page |
6



this takes a join and a subquery!

8. Calculate the vote totals for each candidate in the UP.
(
Hint: This can be done either with a join or with a
subquery!
)

9. Calculate the vote totals for each candidate in the
Lower Peninsula
. Us
e a subquery, and look at the NOT
function (listed below).

10. What percent of the total population voted?
You'll need a join to do this.

Warning: the census data is from
the year 2000, so it's pretty out of date!

11. Create a nice
-
looking map. Post it to

your web page, with explanatory text about what it represents.



Hints

When you experience an error message, spend a moment trying to figure it out. Problems mostly involve typing
mistakes or being situated in the wrong location.

Web searching

SQL statements can be helpful to learn more about them.


The meanings of Postgresql prompt symbols:


“=>” means ready to start new command


-
>” means continuing a command

“(
-
“ means there was an unclosed parenthesis in the last command, which it cannot ex
ecute.


Some
helpful SQL

commands:


AVG
(<expression>)


Average value. <expr> could be a field

MAX
(<expression>)

Maximum value.

MIN
(<expression>)

Minimum value.

SUM
(<expression>)

Sum the values.

ROUND
(<expr>,<digits>)

Round to the nearest digit.

<expr1>
IN

(<expr2>)

Checks to see if expr1 is contained in expr2

NOT

<expr>

Inverse of a command, e.g. NOT IN

<output stuff>
AS

<name>

Rename an output column as <name>