Using SAS and Perl for

whooploafΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

89 εμφανίσεις

Using SAS and Perl for
Large Datasets

March 21, 2007

The Strong Points of SAS


SAS is designed to handle large datasets.


SAS language is robust


PROC SORT


PROC SUMMARY


PROC SQL


PROC REG


The Strong Points of Perl


Free


Well
-
documented


CPAN


Comprehensive Perl Archive Network


Efficient at file handling.


The Netflix Data


Download as a tar.gz file.


ReadMe file.


movie_titles.txt…


RMSE Perl script…


Training Set…


Movie_Titles.txt

1,2003,Dinosaur Planet

2,2004,Isle of Man TT 2004 Review

3,1997,Character

4,1994,Paula Abdul's Get Up & Dance

5,2004,The Rise and Fall of ECW

6,1997,Sick

Movies


Ziggy Stardust and the Spiders From Mars: The
Motion Picture


Learning HTML: No Brainers


Godzilla vs. The Sea Monster


Frank Lloyd Wright


Rabbit
-
Proof Fence

RMSE


Root Mean Square Error


S
2

= (1/n)
Σ

(X
i



X
bar
)
2

where:


n = number of observations


X
i

= i
th

observation out of n


X
bar

= mean of X (or in our case the predicted X)


RMSE = square root of S
2




The Training Set


17770 files

The Training Set

1:

1488844,3,2005
-
09
-
06

822109,5,2005
-
05
-
13

885013,4,2005
-
10
-
19

30878,4,2005
-
12
-
26

823519,3,2004
-
05
-
03

893988,3,2005
-
11
-
17

124105,4,2004
-
08
-
05

1248029,3,2004
-
04
-
22


Movie:

Person,Rating,Date

Person,Rating,Date

Person,Rating,Date

.

.

.


Data Marts


Data marts are subsets of a larger data set.


Contents are determined by the problem at
hand.


Contents may change over time or remain static.



Why Use Data Marts


Increases query performance.


Decreases storage costs.


Decreases risk.


Proving
-
ground for new code or equipment.


Allows for the optimization of effort toward
problem solving instead of data management.



Perl for Data Mart Assembly

What we have…

1:

1488844,3,2005
-
09
-
06


What we want…

1,1488844,3,2005
-
09
-
06




Perl2.pl

while (<*.txt>) {


$file=$_;


print $file, "
\
n";


open(IN, "< $file");


while (<IN>) {



print "$_";



}



close IN;

}


Perl3.pl

open(OUT, "> output.tx");

while (<*.txt>) {


$file=$_;


print $file, "
\
n";


open(IN, "< $file");


while (<IN>) {


if( $_ =~ /[0
-
9]*:/){





} else {



print OUT "$file,$_";



}


}


close IN;

}

close OUT;

SAS Tips and Tricks


Getting started


OPTIONS OBS=0;


WHERE…


Avoid temporary data sets.


Make permanent SAS data sets.


Access data sets with SQL.