PERL Files and file handling

whooploafSoftware and s/w Development

Dec 13, 2013 (3 years and 6 months ago)

58 views

1
Perl for biologists
PERL
Files and file handling
Perl for biologists
Getting data in/out of Perl programs
• Inconvenient to type data directly into programs. Instead use
Perl’s Input/Output (IO) features to transfer the data.
• Perl defines two main routes for data transfer
1.Standard input (
STDIN
) and standard output (
STDOUT
).
2.Named files (usually kept on the computer’s hard disk)
• By default
STDIN
accepts input from the keyboard,
STDOUT
writes to the screen, but they can be changed (in UNIX and in
Perl).
2
Perl for biologists
Using standard input
# counting bases by reading in sequence from
# keyboard
print “Enter DNA sequence:\n”;
$DNA = <STDIN>;
# read in bases
chomp $DNA;
# remove newline char from input
# reset counters
$A_count=0;
$C_count=0;
$G_count=0;
$T_count=0;
for ($i=0; $i<length $DNA; $i++) {
$base = substr($DNA,$i,1);
if ($base eq'A') {
$A_count++;
}elsif ($base eq'C') {
$C_count++;
.....
Enter DNA sequence:
CGTGGCACACTGTCAACGTATG
CCAAAA
no. of A =9
no. of C =8
no. of G =6
no. of T =5
The <> operator
reads in data from a
specified source, in
this case STDIN
The
<> operator
reads in data from a
specified source, in
this case STDIN
Perl for biologists
Using standard input
# counting bases by reading in sequence from
# keyboard
print “Enter DNA sequence:\n”;
while ($DNA = <STDIN>) {
chomp($DNA);
# reset counters
$A_count=0;
$C_count=0;
$G_count=0;
$T_count=0;
for ($i=0; $i<length $DNA; $i++) {
$base=substr($DNA,$i,1);

The example only reads in one line of text. Supposing we want to
read in many? Use a loop..
Enter DNA sequence:
CGTGGCACACTGTCAACGTATG
CCAAAA
no. of A =9
no. of C =8
no. of G =6
no. of T =5
CCGTGAACGCTACAAACA
no. of A =9
no. of C =10
no. of G =5
no. of T =5
a1
Slide 4
a1
explain the idea of a file pointer
aem0, 10/20/2005
3
Perl for biologists
Using standard input
# counting bases by reading in sequence from
# keyboard
print “Enter DNA sequence:\n”;
while ($DNA = <STDIN>) {
chomp $DNA;
# remove newline char
# reset counters
$A_count=0;
$C_count=0;
$G_count=0;
$T_count=0;
for ($i=0; $i<length $DNA; $i++) {
$base=substr($DNA,$i,1);
this is a
logical
expression
: the
while stops
looping when this
is no longer true,
i.e. when $DNA
becomes
undefined.
this is a
logical
expression
: the
while
stops
looping when this
is no longer true,
i.e. when
$DNA
becomes
undefined.
pressing Enter
does NOT make
$DNA undefined:
instead (in UNIX)
you need to type
CTRL-D
pressing Enter
does NOT make
$DNA
undefined:
instead (in UNIX)
you need to type
CTRL-D
Perl for biologists
Further standard input
while (<STDIN>) {
...
Occasionally you may see this
or even this
These are both equivalent to:
Perl allows many short cuts like this. What you choose is a
question of style but I would avoid $_ because it doesn’t indicate
the data it contains.
while (<>) {
...
while ($_=<STDIN>) {
...
4
Perl for biologists
Further standard input
#
$DNA=<STDIN>;
print length $DNA,”\n”;
chomp($DNA);
print length $DNA,”\n”;
Note the use of the
chomp
function.
Perl adds a \n (newline) character to each input line. In some cases (e.g.
pattern matching) this can cause problems but in any case we are unlikely
to want it in our data.
CGTCCGTC
9
8
CGTCCGTC
9
8
Perl for biologists
Standard output
print STDOUT “Writing results..\n”;
Standard output is easy: the print command by default sends it contents to
STDOUT. You may if you wish specify STDOUT, therefore
print “Writing results..\n”;
is equivalent to
but normally you wouldn’t bother.
5
Perl for biologists
Re-directing STDIN and STDOUT
$ myprog.pl < mydata.input
$
$ myprog.pl > screen.output
$
$ prog.pl < prog.input > prog.output
•By default
STDIN
is takes input from the keyboard,
STDOUT
writes to the
screen.
•Sometimes it is useful to take input instead from a file (e.g. for repetitive
inputs, in batch mode,..), or write output intended for the screen to a file (e.g.
to keep a log, batch mode, ..). This is called
redirection.
•Redirection
is usually done in the UNIX environment. For example in the
csh or tcsh:
prompt
prompt
STDIN
STDOUT
both STDIN
and STDOUT
Perl for biologists
Reading and writing files
# get name of file
$filename = <STDIN>;
# open the file and read the data
open (DNAFILE,"$filename") or die "Cannot
open file $filename\n";
$DNA = <DNAFILE>;
# now close the file
close(DNAFILE);
....
As well using standard input and output, Perl allows you to
read from or write to named files.
open associates a
label (
file handle)
with a file.
open
associates a
label (
file handle)
with a file.
if it cannot open
the file die exits
the program with a
message
if it cannot open
the file
die
exits
the program with a
message
close puts the file
in a “safe” condition
(not strictly
necessary but
recommended)
close
puts the file
in a “safe” condition
(not strictly
necessary but
recommended)
6
Perl for biologists
the
open
command
# open for reading
open(FH,”info.dat”);
# alternative form for reading
open(FH,”<info.dat”);
# open file for writing (create if doesn’t exist,
destroys contents if it does)
open (FH,”>output”);
# append to file (create if it doesn’t exist)
open (FH,”>>output”);
Perl for biologists
context sensitive input
# read whole file into array
open(PROTEINFILE,”proteins.db”);
@protein=<PROTEINFILE>;
close(PROTEINFILE);
Note that the <> operator is
context sensitive
.
• if the variable is a scalar then one line at a time is read.
• if the variable is an array then all the lines of input are read.
Therefore we have another method of reading all the lines of a file:
this “
slurping
” may be convenient in some cases but obviously uses
more memory. For large files best one line at a time with a scalar.
7
Perl for biologists
writing to files
#
# write results to file
open (OUTPUT,”>results.txt”);
print OUTPUT “% identity = $pcid\n”;
print OUTPUT “score = $score\n”;
close (OUTPUT);
Easy, just add the file handle to the print statement:
Perl for biologists
Using files summary
￿
Perl defines two “special files”,
STDIN
and
STDOU
T, which by
default read from the keyboard and write to screen.
￿
The defaults for
STDIN
and
STDOUT
can be changed to use files
in UNIX.
￿
Reading input is done with the
<>
operator, the
print
command
writes output.
￿
Use
chomp
to remove the \n from every line of input.
￿
The
open
command associates a file handle to a file name and
can be used with
<>
for input and added to
print
for output.
￿
When opening a file for reading it is good practice to use the
die
command with a message to exit the program if the file is not
found.
￿
It is also good practice to
close
the file when it is no longer
needed.