Data-Mining the Web

klipitiklopwarrenSoftware and s/w Development

Nov 7, 2013 (3 years and 7 months ago)

62 views

Data
-
Mining the Web
Using Perl

Burt L. Monroe

Director, Quantitative Social Science Initiative

Department of Political Science

The Pennsylvania State University

Data
-
Mining the Web


Examples



Election Returns in Luxembourg


Luxembourg Official Election Results, 2004


http://qssi.psu.edu/files/luxembourg.pl



Parliamentary Speech


The Congressional Record

How’d You Do That?


There are several programming languages
with “straightforward” facilities for doing
this. Most notably,


Perl


Python


Java


I’m going to talk about Perl, because


it’s the most established


it’s the one I know


It appears that Python may be preferable,
but that’s for someone else to say.

What’s Perl?


Open source (free / flexible / extensible / a little
wild and woolly


like Linux, R) programming
language.


It is very very good at processing text.


note, webpages are just texts.


note, datasets (like a flat spreadsheet or Stata file) are
just texts.


Social scientists might have some use for turning one
into the other, no?


It has very useful facilities for building


Spiders


Scrapers


(and “agents”, “robots”, “crawlers”, etc.)

What’s a Spider?


A spider is a program designed to
automatically gather webpages.



If, for example, you want to
automatically download all of the
speeches delivered in Congress
today


without manually clicking on
every one, cutting and pasting, etc.


you might want to build a spider.

What’s a scraper?


A scraper (or “screen
-
scraper”)
extracts the information you want


whatever you consider to be data


from a given webpage.



If you want to know who said
“health” and how many times, you
might want to build a scraper.

BEWARE!


Spiders (and other similar types of programs


“robots”, “crawlers”) can be put to nefarious use:


appropriating copyrighted materials


extracting email addresses for spammers


overwhelming servers to create “denial of service”


generally violating a site’s “terms of service” or
“acceptable use policy”


If you are not careful to use legal and ethical
good practices, you can


be denied access to a website altogether


get yourself or the university sued or even subjected to
criminal penalties

Perl


Open
-
source


Cross
-
platform


(Windows


I recommend “ActivePerl” from
http://www.activestate.com
)


There are many websites with resources:


http://www.cpan.org

(Comprehensive Perl
Archive Network)


http://www.perlmonks.org

(PerlMonks)


http://www.perl.org


http://perl.oreilly.com

(O’Reilly Publishing)


Lots of mailing lists, etc.

Books


Basics of Perl


The best books are put out by O’Reilly Publishing and
are generally known by the animal on the cover.


Learning Perl

(the Llama)


or, Learning Perl on Win32 Systems

(the Gecko)


Programming Perl

(the Camel)


Web
-
mining


Perl & LWP

(the Blesbok, apparently)


Spidering Hacks


These books, and some others, are or will be
available in the “QuaSSI Library” (in Pond 216).

Running Perl


For machines with approved ActivePerl
installations in Pond ...


Perl is located in c:/Perl/


For today,


we will operate entirely in the directory c:/Perl/eg/


To get there,


open Programs
-
> Accessories
-
> Command Prompt


At the prompt, type
c:


Type
cd Perl/eg


(In your particular installation, or in a Mac, or
something like Unix on high performance
computing, these details will be different.)

The First Perl Program


Go to the QuaSSI Website for the example
scripts for todays workshop:


http://qssi.psu.edu/files/howdy.pl


Right
-
click on the first script, “howdy.pl”,
and save it to c:
\
Perl
\
eg
\


Open up the text
-
editor WinEdt (you could
use almost anything) and then open
howdy.pl


That’s a complete Perl program.


Note: that’s all a program is


a text file.

Running a Perl Program


Go back to your command prompt.


Type
perl howdy.pl

w


(The

w

tells perl to give you
w
arnings about what might be
wrong if the program is broken.)

Modifying a program


Go back to WinEdt


Edit the text between the quotation
marks to say something new


Click File
-
> Save


Go back to the command prompt


Hit the up arrow (to get the last
command,
perl howdy.pl

w


Look at that


you’re a programmer!

Break the program


Go back to WinEdt


Delete the semicolon at the end of
the line


Save the file


Go back to the command prompt and
run the program, with

w
, again


What happened?

Perl at 30,000 feet


Much of the next set of slides is
stolen shamelessly from Andy
Tester’s “Perl at 10,000 Feet” at
www.petdance.com


(I’m skipping even more than he
did.)

Some generalities about Perl


Statements in Perl are, or usually can be,
constructed in a fairly natural English
-
like
way.


There are many ways to do any one thing.


The syntax can be offputting and hard to
read, especially at first. It is easy to
“obfuscate” Perl code and this is
sometimes done intentionally.


Main syntax rule: end all lines with
;

Data Types


Scalars


Arrays and Lists


Hashes


References


Filehandles


Objects

Scalars


Numbers


Generally decimal floating point


(Can be made integer, octal,
hexadecimal)


Strings


Can contain any character


Can be null:
“”


Can be arbitrarily large

Strings


Single
-
quoted


characters are as shown with only two exceptions.


single
-
quote
in

a single
-
quoted string requires
\



backslash in a single
-
quoted string requires
\
\


Double
-
quoted


it will
interpolate



calculate variables or control sequences.


For example


$foo = “myfile”;


$datafile = “$foo.txt”;


will result in the variable $datafile holding the string “myfile.txt”


Another example


print ‘Howdy
\
n’;

will print:


Howdy
\
n


print “Howdy
\
n”;

will print


Howdy



(
\
n

is a control sequence, standing for “new line”).

Scalar operators


Math


*, /, % (for modulo), ** (for exponentiation),
etc.


Strings


x to repeat the thing on the left


“b” x 10

gives “bbbbbbbbbb”


. concatenates strings


(“na” x 16).“ Batman!”

gives ...


Perl knows to convert when mixing these
two types:


“3”*4

gives 12


“3”.4

gives “34”

Comparing Scalars


Comparison


Numeric


String


Equal



==



eq


Not equal


!=



ne


Less than


<



lt


Greater than


>



gt


Less / equal


<=



le


Greater / equal

>=



ge



8 < 25



TRUE!


“8” lt “25”


FALSE!

Variables


A sign, followed by a letter, followed by pretty much
whatever.


Sign determines the type:


$foo

is a scalar


@foo

is a list


%foo

is a hash


Variables default to global (they apply in all parts of your
program). This can be problematic.


local $var

will make the variable active only for the current
“block” of code.


my $var

does the same, and is the more usual construction.


the very common
use strict
; at the beginning of code forces
good practice in the use of local variables (creates more
syntax errors, but prevents more whoppers that could blow
everything up.)

Lists and Arrays


A list is an ordered set of (usually)
scalars.


An array is a variable holding a list.


my @foo = (1,2,3)


my @bar = (“elephant”, 3.14)


Can be constructed as lists of scalar
variables:


my @data = ($name, $address, $SSN)

Using Arrays


Elements are indexed, from 0.


my @animals = (“frog”, “bear”, “elephant”);


print $animals[2];

# prints elephant


Note: element is a scalar, so $ rather than @


Subsections are “slices”.


my @mammals = @animals[1,2];



Lots of functions for


using as a stack (moving things on and off the right or left side
of the array).


sorting


joining two arrays


splitting a scalar string into an array


my $sentence = “This is my sentence.”;


my @words = split(“ “, $sentence);


# now @words contains (“This”, “is”, “my”, “sentence”);

Programming Controls


Control structures


if / then / elsif / else


while


do {} while


do {} until


for ()


foreach() # loops over a list


Errors / warnings


die “message” kills program and prints
“message”.


warn “message” prints message and keeps
going.

Hashes


“Associative arrays”


A set of


values (any scalar), indexed by


keys (strings)


Example


my %info;


$info{ “name” } = “Burt Monroe”;


$info{ “age” } = 39;


With hashes and arrays you can create almost
any arbitrary data structure (even arrays of
arrays, arrays of hashes, hashes of arrays, etc.)

File Handling


open() function opens a file for
processing.


Prefix the filename to define how


“<“ for input from existing file (read)


“>” to create for output (write)


“>>” to append to a file (that may not yet
exist)


open (IN, “<myfile.txt”) or die
“Can’t open myfile.txt”;


Can then use <> to refer to the file. The
above would be <IN>.

Matching string patterns using
regular expressions


This is where much of the power of Perl lies.


m/pattern/

will check the last stored variable (
$_
) for
pattern.


$var =~ m/pattern/;

will check $var for pattern.


If the pattern is in $var, then


$var =~ m/pattern/

is TRUE.


If you “group” part of the pattern and it is present,


$var =~ m/(pattern)/

is true, AND, now a variable names $1
contains the first match it found.


Group more pieces of the pattern and the matches are stored
in $2, $3, etc.


This only grabs the *first* match. To grab all, say


my @matches = ($var =~ m/(pattern)/g);


This will store every match in the array @matches.

What’s a “regular expression”?


Combination of

any literal character, number, etc.

.



any single character

*



zero or more of the previous

+



one or more of the previous

?



zero or one of the previous

[aeiou]

character class


this is the vowels

^



beginning of the line

$



end of the line

\
b


word boundary

\
d
\
D

digit / non
-
digit

\
s
\
S

space / non
-
space

\
w
\
W

word character / non
-
word character

|



or


match this or that

()


grouping


See handout for more.

Examples


Romeo|Juliet



“Romeo” or “Juliet”


\
d
\
d
\
d
-
\
d
\
d
\
d
\
d



a phone number


(
\
d
\
d
\
d
-
)?
\
d
\
d
\
d
-
\
d
\
d
\
d
\
d

phone #, maybe w/ area


\
b[aeiou]
\
w+



a word starting w/ a vowel


\
b[A
-
Z0
-
9._%
-
]+@[A
-
Z0
-
9.
-
]+
\
.[A
-
Z]{2,4}
\
b


email add.




Modules


Hundreds of modules / packages
available through cpan.


ActivePerl gives a GUI for installing
them in its “Perl Package Manager”.


A basic Perl example


Counting words.



counter1.pl

Grabbing from the web


The basic idea is simply to have Perl
act as an “agent”, in the way a
browser like Explorer or Firefox does
--

requesting and interpreting
webpages.



There are a few basic modules that
can do this.

LWP::Simple



lwpsimpleget.pl

LWP::UserAgent


More elaborate than LWP::Simple.


I’m going to skip that one today, but
it’s covered in details in the main
books


Perl & LWP


Spidering Hacks


Pretty much all of the functionality
has been wrapped more intuitively
into ...

WWW::Mechanize



mechanizeget.pl

Scraping


At its base, this is just extracting
information from the page(s) you
download.



Simple example:


freshair.pl


Your agent can interact ...


For example, what if the webpage
involves a form ...



Example


abstracts.pl



You can authenticate with username
and password, run through proxy
servers, and so on.

Spiders


Type 1 Requester


Requests a few items with known urls from a website.


Type 2 Requester


Requests a few items, then requests (some set of) pages to
which those items link.


Type 3 Requester


Starts at a given url, and then requests everything linked,
everything linked by that, etc.
at the same host server
. The
idea here is usually to download an entire website.


Type 4 Requester


Starts at a given url, requests everything linked
anywhere
,
everything linked by that, etc. until it, perhaps, visits the
entire web.


YOU


I am talking to YOU


in all likelihood have no
business writing Type 3 or Type 4 spiders. These can easily
go seriously awry causing mayhem of many sorts. Write
only spiders with known finite scope.

Back to the Luxembourg Miner


Commune
-
level election results from
Luxembourg.



luxembourg.pl

More on Scraping


All of the examples scraped / parsed using
regular expressions.



More structured data like HTML is often better (or
only) addressed with more specialized tools:


HTML::TokeParser


HTML::TreeBuilder



There are modules for scraping from XML,
spreadsheets, databases, Word docs, PDFs.