Data transformation using Perl

whooploafΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

85 εμφανίσεις

Data transformation using Perl
(or ‘My favourite quick and nasty
machine learning hacks’)
Simon Rawles, November 2005.
Overview
Q
Why do we need to transform?
Q
What is Perl?
Q
The most useful bits of Perl
Q
Some example problems and solutions
Q
Perl resources and final advice
Q
There is a lot to Perl, this talk is a taster!
Data is never how you want it
Q
Input data is often not how the learner
wants it
Q
Adding additional summary attributes
can help the learner
Q
People want their results to be
summarised before they draw
conclusions.
The Swiss Army Chainsaw
Q
“exceedingly powerful but ugly and
noisy and prone to belch noxious
fumes”
–Henry Spencer (author of regex)
–ie. Highly versatile but inelegant
Q
C-like syntax, many operators/functions
Q
Originally intended for text processing
but now many more uses
Features of Perl for preprocessing
Q
Regular expressions
and their operators
Q
Data structures
–hashes
–arrays
–hashes of hashes…
–hashes of arrays..
Q
File handling and
shell/OS integration
Q
Lots of built-in
functions:
–split and join, chomp,
foreach, sort,
globbing
Q
Fast and Robust
(enough)
Q
Development in Perl
is usually quick
Some of the things I use Perl for
Q
Text-based data conversion
–Filtering out bad records
–Converting between data formats
–Parsing the output of programs
Q
Summarising and reporting from big
result sets (Perl’s original application)
Q
As a flexible scripting tool
Data structures (perldsc)
Q
The array @a is accessed as $a[$x].
–Last element is $#a. @a = (1,2,3,4,5);
Q
Hashes (keys/values) often more useful
–The hash %h is accessed as $h{$key}.
–defined(), undef, keys, using foreach
Q
Multidimensional
–$a[$x1][$x2], $h{$k1}{$k2}, $ha{$k}[$a]
–Better syntax: $a->[$x1,$x2].
Handling a stream of data
Q
while (<STDIN>) {
chomp; @f = split(/\s+/);
$j = join(‘,’,$f[0],f[2]);
print “$j\n”; }
Q
File operations
–open(OUT, “>out”); print OUT; close OUT
Q
$_ holds the currently read line
–e.g. $x = substr($_, 2, 1)
How to make a word frequency
table for a set of text files
@files = <*.txt>; %count = ();
foreach $f ( @files ) {
open(FILE, $f);
while(<FILE>) {
@words = split(/\s+/, $_);
foreach $word ( @words ) {
$count{$word}++;
}
}
close(FILE);
}
foreach $k ( sort keys %count ) {
print “$count{$k}\t$k\n”;
}
Regular expressions (perlre,
perlrequick)
.any character
^start of a line
$end of a line
|alternation
()grouping
[]character class
*match 0 or more
+match 1 or more
? match 0 or 1
\t = tab
\n = new line
\r = return
\w = word
\W = non-word
\s = whitespace
\S = non-whitespace
\d = digit
\D = non-digit
Three useful regexp operators
(perlop)
Q
Does pattern occur?
–$x =~ /PATTERN/
–/i for case-insensitive
–/m for multiple lines
Q
Transliteration lists
–$x =~ tr/FIND/REP/
–/c for complement
–/d ‘replace or delete’
–/g squash duplicates
Q
Search and replace
–$x =~ s/PAT/REPL/
–brackets show up as
$1, $2 on the right
–/g for all occurrences
Q
Split operator
–split /regex/, string;
Q
=~ applies one of
these to a string
Examples of regular expression
code
Q
$x =~ /\<!DOCTYPE/m
Q
$x =~ s/\s+/ /g;
Q
$x =~ tr/a-z/A-Z/;
Q
$time =~
/(\d\d):(\d\d):(\d\d)/;
$hours = $1; $minutes = $2;
$seconds = $3;
Examples of regular expression
code (more nerdcore examples)
Q
$x =~ /(\w\w\w)\s\1/;
Q
You can iterate over successive
matches:
$x = "cat dog house";
while ($x =~ /(\w+)/g) {
print "Word is $1, ends at
position ", pos $x, "\n";
}
A few of my favourite Perl
functions
chompchop(r)indexlengthsubstr
randpoppushshiftunshift
keysvaluesglobunlinkstat
execsystemdatesprintfsplice
map…
man perlfunc
for more!
Real-world example: Running
some text mining experiments
Q
Ten-fold cross-validation
–Take a sample of available text files, size S
–Shuffle the examples in the sample
–Divide into 10 folds
Q
Extract the N most common words
–Make into a table and output for learner
Q
Call learner and get the accuracy
Q
Different syntax/approaches coming up!
–Don’t work if you don’t understand it all
Part 1: Prepare for cross-
validation
Q
@selection = (); $thisfold = 0; $accu = 0;
Q
@files = <*/*.txt>;
Q
for ($i = 0; $i <
S
; $i++) {
–$ir = rand @files;
–push @selection, $files[$ir];
–splice(@files, $ir, 1); }
Q
foreach $f ( @selection ) { $fold{$f} =
$thisfold++ % 10; }
Part 2: Extract the N most
common words
Q
for ( $cf = 0; $cf < 10; $cf++ ) {
–%train = (); %test = (); $count = ();
–foreach $f ( @selection ) {
•($cl{$f}, $name) = split(/\//, $f);
•$ref = ($fold{$f}==$cf) ? \%test :
\%train;
•open(FILE, $f); while(<FILE>) {
–@words = split(/\s+/, $_);
–foreach $w ( @words ) { ${$ref}
{$w}++; $count{$f}{$w}++; } }
•close(FILE); }
Part 3: Make the table, invoke the
learner
–@wlist = sort { $train{$b} <=>
$train{$a} } ( keys %train );
–splice(@wlist,
N
);
–open(TR, “>tr$cf”); open(TE, “>te$cf”);
–foreach $f ( @selection ) {
•$handle =
($fold{$f}==$cf) ? TE : TR;
•print $handle join(',',
map
{$count{f}{$_}+0} @wlist).”,$cl{$f}\n”; }
–close(TR); close(TE);
–$accu += `learner tr$cf te$cf`; }
Q
print “Accuracy: “.($accu/10).”\n”;
Further reading
Q
Books
–Learning Perl (‘the llama’) by Schwartz.
–Programming Perl (‘the camel’) by Wall,
Schwartz, Christiansen.
–Perl Cookbook (some kind of sheep?) by
Christiansen, Torkington.
Q
http://perldoc.perl.org/ contains great
documentation (or just man perldoc)
Q
comp.lang.perl.misc
Final advice
Q
Use my to scope variables.
–Use strict! It enforces my and will also
disallow accidental "symbolic
dereferencing”.
Q
Structure your programs with
subroutines (sub) and modules.
Q
“There’s More Than One Way To Do It.”
Q
Don’t reinvent the wheel: cpan.org has
modules to do just about anything
Exercises: How would you...
Q
… change the delimiter in a text file?
Q
… add a new identifier to each example?
Q
… remove all bad records from a file?
Q
… add a new attribute giving the mean of
three others?
Q
… discretise a continuous attribute for a
discrete-valued learner?
Q
… convert a series of text-based data files to
interlinked Prolog ground facts?