Howard Ross School of Biological Sciences Perl Resources The ...

whooploafΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 3 μήνες)

95 εμφανίσεις

Perl Notes

Howard Ross

page
1

P
ERL
N
OTES
:

G
ETTING
S
TARTED

Howard Ross

School of Biological Sciences


Perl Resources


The main Perl site (owned by O’Reilly publishing company)
http://www.perl.com/

This site has links at which you can download Perl, if your operating system doesn't
already
include Perl.


Learning Perl


http://learn.perl.org/tutorials/

http://learn.perl.org/books.html


Standard Perl
Documentation


http://perldoc.perl.org/


Downloadable Modules


CPAN Com
prehensive Perl Archive Network

http://www.cpan.org/


BioPerl


http://www.bioperl.org/


BioPerl Documentation


http://doc.bioperl.org/


BioPerl HowTo and Tutorials


http://bio.perl.org/wiki/HOWTOs

http://bio.perl.org/wiki/Bptutorial.pl

http://bio.perl.org/wiki/Tutorials


A Script Editor


You will need an editor, as op
posed to a word processor or SimpleText.

It is most important that the editor is syntax
-

or language
-
aware. That means it colours each
item in the script according to what "part of speech" it is, and will alert you when braces or
parentheses are not matche
d.


Try to
Google "perl editor yourOperatingSystem"


If you get more adventurous, you will use an Integrated Development Environment (IDE)
such as Eclipse with the EPIC add
-
on for Perl.


Perl Notes

Howard Ross

page
2

To run a Perl script


Write your Perl script in an editor
.

Save it with
.pl

extension (
myScript.pl
) in your
workFolder
.

In the Terminal or Command window, change the default

directory to the work folder


cd workFolder

E
xecute it with

perl myScript.pl


Syntax Overview


Perl statements end in a semi
-
colon. Comments start with a hash symbol and run to the end of
the line



print "Hello, world"; # comment to EOL


Whitespace is irrelevant:



print


"Hello, world"


;

... except inside quoted strings:





# this would print with a line

break in the middle


print "Hello


world";




vs



Double quotes or single quotes may be used around literal strings



print "Hello, world";


print 'Hello, world';


But, only double quotes "interpolate" variables
and special characters such as newlines (
\
n
)



print "Hello, $name
\
n"; # works fine


print 'Hello, $name
\
n'; # prints $name
\
n literally


Variable Types & their Names


Scalar (
$name
)


type set by context: string, int, real


Array of Scalars (
@name
)


indexed from
[0]
, elements
$array[$i]


last index of array =
$#array


Associative Array of Scalars [Hash] (
%name
)


unordered


$hash{ $key } = $value


Perl Notes

Howard Ross

page
3

Examples of Variables


Scalars


my $animal = "camel";


my $answer = 42;


Arrays


my @animals = ("camel", "llama", "owl");


my @numbers = (23, 42, 69);


my @mixed = ("camel", 42, 1.23);


value of

$numbers[2]

is

69
,
value of

$animals[2]

is

“owl”



Hashes


my %fruit_colour = ("apple", "red", "banana", "yellow");


$fruit_colour{ “
apple” } has value “red”


more common method to load values



my %fruit_colour = (


apple => "red",


banana => "yellow",


);


extracting keys and values



my @fruits = keys %fruit_colour
;


my
@colours = values %fruit_colour
;


Variable Scoping


my

creates lexically scoped variables



Variables scoped to block (i.e. set of statements surrounded by { }) in which they are
defined.



allows creation of global and local variables



my $a = "foo";


if ($some_condition) {


my $b = "b
ar";


print $a; # prints "foo"


print $b; # prints "bar"


}


print $a; # prints "foo"


print $b; # prints nothing; $b has fallen out of scope


Conditionals and Loops


if

(negated is
unless
),

while

(negated is
until
)



if

( condition ) {


...


} elsif ( other condition ) { # NOTE SPELLING!!!


...


} else {

Perl Notes

Howard Ross

page
4


...


}


Post
-
conditionals


Syntax:

perform something conditional condition;



print "Yow!" if $zippy;


print "We have no bananas" unless $bananas;



do {


$line = <STDIN>;


chomp $line;


. . .


} until $line eq “.
\
n”;


for

and
foreach

loops


C
-
type syntax in a
for

loop


for ($i=0; $i <= $max; $i++) {


...


}


but
foreach

loops over all members of a list



# loop over every key in %hash and print its value


foreach my $key (keys %hash) {


print "The value of $key is $hash{$key}
\
n";


}


or


# loop over every
integer in a range


foreach my $
number

(
2 .. 12
) {


print "
$number
\
n";


}


Operators


Arithmetic:

+
-

* / ** ++
--

+=
-
=


Numeric comparison:

== != > < <=

>=


String


comparison:

eq ne lt gt le ge


concatenation:

.


Boolean Logic:

&& || !


infix dereference operator, as in C and C++:

-
>


pattern binding operator:

=~ !~




Perl Notes

Howard Ross

page
5

Files and File Handles


Opening a file for input or output


open INFILE, "input.txt", or die "Can't open input: $!";


# or

my $infile = “$replicate.txt”; # “” causes interpolation

open INFILE, $infile, or die "Can't open $infile: $!";


# open
for write, overwrite file contents

open OUTFILE, ">output.txt", or die "Can't open output: $!";


# open for write, append to file contents

open LOGFILE, ">>my.log", or die "Can't open logfile: $!";


Reading from a filehandle


Read from an open filehandle
using the
<>

operator.



In scalar context it reads a single line

-

use a loop



In list context it reads the whole file in, assigning each line to an element of the list



my $line = <INFILE>;

# reads one line into a string


my @lines = <INFILE>;

# reads all lines into an array



#or


while (defined($line = <INPUT>)){


chomp $line; # remove trailing
\
n


($id,$time) = ($line =~ (/^(
\
S+)(
\
d+).*$/));


print OUTPUT "$id
\
t$time
\
n";


}


Writing and Closing


Naming the filehandle for output
is optional, the default being
STDOUT



print OUTFILE $record;


print LOGFILE $logmessage;



and close the filehandle when you are finished



close LOGFILE;



“Magic” variables


The default variable for function input or output is
$_




allows the writin
g of English
-
like code



opportunity to obfuscate


my @pets = ("dog", "cow", "sheep", "horse", "chicken");

print sort @pets;


or


@sorted_pets = sort @pets;

Perl Notes

Howard Ross

page
6

for $i (0 .. $#sorted_pets ){


print $sorted_pets[$i];

}


Array of arguments passed to a subroutine is
@_


Almost anything matching pattern
$x
where
x

is a number, piece of punctuation or control
character (
$” $& $< $? $^C $5

etc)
-

Use only with extreme caution!


Subroutines



sub razzle {


print “OK, you’ve

been razzled.
\
n”;


}


and call with



razzle();


Placed either in current file or in called module (
use modulename;
)


Function parameters and multiple return values passed as single, flat lists of scalars


Passing Arguments and Returning Values


Incoming arguments in
@_



razzle($victim,$con);



sub razzle {


my ($vic, $c) = @_; # $vic = $_[0], $c = $_[1]


or



my @deal = @_; # $deal[0] = $_[0], $deal[1] = $_[1]


}


Return value is the value of last expression evaluated or use explicit
re
turn

statement


References

hard (not symbolic) references


$scalarref =
\
$foo;

$$scalarref

returns value of

$foo


@array = (‘1’,’b’,’2’,’3’,’e’);

$arrayref =
\
@array;

$arrayref
-
>[4]

has value

‘e’


Passing a reference to a subroutine


$total = sum (
\
@a);

sub sum {


my ($aref) = @_;

Perl Notes

Howard Ross

page
7


my $total = 0;


foreach (@$aref) { $total += $_;}


return $total;

}


Lots of Functions


Selected Functions of Special Use in Bioinformatics


chomp

-

removes
\
n

from end of string


$line = <INPUT>; # say $line contains “TTAC
GTAT
\
n”

chomp $line; # removes trailing
\
n


# now $line is “TTACGTAT”


chop

-

removes and returns last character from a string


$base = chop $line; # $base holds ‘T’


# now $line is “TTACGTA”


pop

-

removes and returns last value of an array, shortening the array by one element


@guinea_pigs = ("dog", "cow", "sheep", "horse");


$next = pop @guinea_pigs # $next holds "horse"

# @guinea_pigs holds ("dog", "cow", "sheep")


push

-

adds an element to end

of an array, and increased the length of the array by one


$new = ‘chicken’;

push @guinea_pigs, $new;

# @guinea_pigs holds

# ("dog", "cow", "sheep", "chicken")


reverse

-

returns list or string in reverse order


my $string = 'backwards';

$s
tr1 = reverse $string; # returns 'sdrawkcab'


substr

-

extracts a substring, based on place and length


$str2 = substr($string, 4, 5); # returns 'wards'


join

-

joins a list of strings into a single string


$str3 = join('', $str1, $str2); # returns '
sdrawkcabwards'


split

-

splits a string into a list of strings


$sequence = 'AACTGC';

@bases = split //, $sequence;

# // means on each character of string

# array @bases contains('A','A','C','T','G','C')




Perl Notes

Howard Ross

page
8

Regular Expressions & Pattern Matching

(REGEX)


Fi
nd a substring, based on a pattern, and extract or modify it


Simple matching


if (m/foo/) { ... } # true if $_ contains "foo"

if (/foo/) { ... } # m is implied

if ($a =~ /foo/) { ... } # true if $a contains "foo"

if ($a !~ /foo/) { ... }

# true if $a lacks "foo"


Simple substitution


s/foo/bar/; # replaces foo with bar in $_

$a =~ s/foo/bar/; # replaces foo with bar in $a

$a =~ s/foo/bar/g;


# replaces ALL INSTANCES of foo with bar in $a


Transliteration


tr/abc
/123/;
# replaces a with 1, b with 2, c with 3 in $_


$a =~ tr/A
-
Z/a
-
z/;
# converts all letters in $a to lower case


($count) = ($a =~ tr/A
-
Z/a
-
z/);

# converts all letters in $a to lower case

# AND returns the number of times it did this


Specify
ing type of character(s) to match



. a single character


\
s a whitespace character (space, tab, newline)


\
S a non
-
whitespace character


\
d a digit (0
-
9)


\
D a non
-
digit


\
w a word character (a
-
z, A
-
Z, 0
-
9, _)


\
W a non
-
word character


[aeiou] matches a single character in the given set


[^aeiou] matches a single char outside the given set


(foo|bar|baz) matches any of the alternatives specified



^ start of string


$ end of string


Modifying the number of occurrences

(eg
/foo
\
d*/
)


* zero or more of the previous thing

+ one or more of the previous thing

? zero or one of the previous thing

{3} matches exactly 3 of the previous thing

{3,6} matches between 3 and 6 of the previous thing

{3,} matches 3 or more of the previous thing


Perl Notes

Howard Ross

page
9

Modifying the type of search


/foo/i ignore alphabetic case

/foo/m treat string as multiple lines

/foo/s treat string as single line

/foo/g globally
find all matches


Examples of use


/^>/

string starts with “>” header line in FASTA


/
\
$
\
d+/

matches a price (or salary) in a file


/[
-
]?
\
d+[.]
\
d+/

matches a real number


/(
\
d
\
s){3}/

three digits, each followed by a whitespace character
(eg "3 4 5 ")


/(a.)+/

matches string in which every odd
-
numbered letter is a
(eg "abacadaf")


/(taa|tga|tag)+/i

matches any STOP codon


Capturing from a pattern


Matches to bits in parentheses go into
$1
,
$2
, etc


if ($novel =~ /(War) and (Peace)/) { ... }

$1 holds “War
” and $2 holds “Peace”


Or ...


if (($seqid,$species)


= ($line =~ /ID: (
\
d+) SP: (
\
w+
\
w+)/){ ... }


then for data


ID: 398371 SP: Balaenoptera edeni

$seqid

holds
“398371”

and

$species

holds

“Balaenoptera edeni”


Perl Notes

Howard Ross

page
10

BioPerl


This is a collection of Perl modules that
facilitate the development of Perl scripts for
bioinformatics applications
.


Bioperl is Modular



Perl is (can be used as) an Object
-
Oriented language



Bioperl modules give access to public objects and methods



Privat
e (behind
-
the
-
scenes) objects and methods
implement the functionality
.


Bioperl

is based on

Objects
, which have
associated
Attributes and
Actions



Sequence Objects

o

sequences, alignments, sequence details



Location Objects

o

associated with sequence feature det
ails regarding where sequence occurs on
chromosome


Objects are defined in a hierarchical way. More specific objects (DNA sequence database in
Fasta format) have both methods inherited from a more general object (DNA sequence
database) and methods specific

to themselves. Consequently t
he methods available in an
object

vary tremendously. They are not only defined in the object, but also inherited from
parental objects.


Documentation


Basic documentation (
http://doc.bioperl.org/
)

Links to standard
documentation for each module or method

BioPerl Deobfuscator (
http://bioperl.org/cgi
-
bin/deob_interface.cgi
)

Helps you determine what methods are available to you in each module

HOWTOs (
http://bioperl.org/wiki/HOWTOs
)

Instructions on how to perform many di
fferent common tasks

Especially
http://www.bioperl.org/wiki/HOWTO:Beginners

FAQ (
http://bioperl.org/wiki/FAQ
)

More instructions on how to perform many different common tasks


General Syntax

You apply or use a
method

with an
object

and usually this returns a
value

$
newValue = $
myObject
-
>$theMethod
;


An

example

# first declare what modules are being used

use Bio::DB::Fasta;


# initialize some variables

my $file = “filename.fasta”; my $id = “someID”;


# create a
new instance

of

the

object

my $db = Bio::DB::Fasta
-
>new($file);


# apply or use methods on the object

#
get a sequence as string

my $seqstring = $db
-
>seq($id);

Perl Notes

Howard Ross

page
11


# get the header, or description, line

my $desc = $db
-
>header($id);



new
,
seq
, and
header

are all methods from module
Bio::DB::Fasta


Another example

# A script for converting formats

use Bio::SeqIO;


$in = Bio::SeqIO
-
>new('
-
file' => "inputfilename",


'
-
format' => 'Fasta');

$out = Bio::SeqIO
-
>new('
-
file' => ">outputf
ilename",


'
-
format' => 'EMBL');


while ( my $seq = $in
-
>next_seq() ) {


$out
-
>write_seq($seq);

}


Accessing Sequences


Sequences can be accessed
in two logically different ways:


1.

F
rom a file, in which they are read sequentially



T
he file needs to be local, i.e. on the current system

2.

F
rom a database, in which they can be accessed in random order



The database can be

local
or remote



Local = indexed flat file on current system



Remote = on different Internet host

o

Genbank, genpept,
RefSeq, swissprot, and EMBL databases


Accessing a local file

Generally use the methods available in
Bio::SeqIO

and
Bio::Seq


# Access a local file containing sequences in FASTA format

#!/usr/bin/perl
-
w

use strict;

use Bio::Seq;

use Bio::SeqIO;

my $file

= 'data.fasta';

my $in = Bio::SeqIO
-
>new(
-
format => 'fasta',


-
file => $file);

while (my $seq = $in
-
>next_seq) {


my $id = $seq
-
>display_id;


my $length = $seq
-
>length;


print "Sequence $id has length $length
\
n";

}


Access
ing a local database

Accessing a local file as a database is much faster, and allows you to move back and forth in the file at will.
Also, you can load segments of sequence much more efficiently than if you use the
Bio::Seq

methods.
T
his
module indexes
the database
files when
they are
first used, and as required thereafter
. If you alter the database
files, by other scripts, then you need to re
-
index them using

$db
-
>index_file($filename)
;



Perl Notes

Howard Ross

page
12

use Bio::DB::Fasta;

use strict;


# one file or many files

my $db

= Bio::DB::Fasta
-
>new($path_to_files);


# get a sequence as string

my $seqstring = $db
-
>seq($id);


# Is it dna, rna or protein?

my $type = $db
-
>alphabet($id);


# get the header, or description, line

my $desc = $db
-
>header($id);


Accessing a remote datab
ase, in this case Swissprot


use Bio::DB::SwissProt;


$sp = new Bio::DB::SwissProt;


$seq = $sp
-
>get_Seq_by_id('KPY1_ECOLI');

# SwissProt ID:

# <4
-
letter
-
identifier>_<species 5
-
letter code>

# or ...


$seq = $sp
-
>get_Seq_by_acc('P43780');

# SwissProt Acces
sion Number: [OPQ]xxxxx


Querying a remote database, in this case GenBank


Compose a query string, just as you would in the interactive interface to GenBank, and then
receive a stream of sequence objects.


use Bio::DB::GenBank;

use Bio::DB::Query::GenBank;

$query = "Arabidopsis[ORGN] AND topoisomerase[TITL] ";

$query_obj = Bio::DB::Query::GenBank
-
>new(
-
db => 'nucleotide',



-
query => $query );

$gb_obj = Bio::DB::GenBank
-
>new;

$stream_obj = $gb_obj
-
>
get_Stream_by_query($query_obj);

while ($seq_obj = $stream_obj
-
>next_seq) {


$do_something_with_the_seq_object;

}




Perl Notes

Howard Ross

page
13

Manipulating sequences




retrieving sequence information

o

find specific sequences

o

screen sequences for particular attributes



altering sequen
ce information



creating sequences and their annotations


Retrieving sequence information


use Bio::Seq;


$seqobj
-
>display_id();

human readable sequence id


$seqobj
-
>seq();

sequence as string


$seqobj
-
>subseq(5,10);

part of the sequence as a string


$seqobj
-
>accession_number();

the accession number


$seqobj
-
>alphabet();

'dna'
,
'rna'
, or
'protein’


$seqobj
-
>primary_id();

a unique id for this sequence


$seqobj
-
>desc();

a description of the sequence


Retrieving sequence features


Get the 'top level' sequence fe
atures

@topfeatures = $seqobj
-
>get_SeqFeatures();


or all sequence features, including subsequence features

@allfeatures = $seqobj
-
>all_SeqFeatures();


then perform actions based on their values

foreach $feat
_object

(@features) {


if ($feat
_object
-
>has_
tag('translation')) {


if ($feat
_object
-
>has_tag('protein_id')) {


# do something based on protein identifier given

# that an amino acid translation if provided



}


}

}

or

foreach my $tag ($feat_object
-
>get_all_tags) {


$do_something_with_tag;


foreach my $value ($feat_object
-
>get_tag_values($tag)) {


$do_something_with_tagvalue;


}

}




Perl Notes

Howard Ross

page
14

Transforming sequences


Sequences may be read or written in many different formats. You need to investigate the
documentation for the current version of BioP
erl to determine whether the desired format is
supported for the action (read or write) that you intend to perform.


Altering sequence features


$subseq = $seqobj
-
>trunc(5,10);

# truncate or take a subsequence


$mocrev = $seqobj
-
>revcom;

# take reverse
complement


$translation = $seqobj
-
>translate;

#

translate DNA/RNA to Protein


Searching for “similar” Sequences


BLAST (blastp, blastn, blastx, tblastn, tblastx) series of programs for searching protein and
nucleotide databases

Use
Bio::Tools::Run::
RemoteBlast

to submit a query and retrieve report. Default
database is Genbank.

Alternatively, use interactive BLAST and save results in text format, or run a command
-
line
version and save output to a local file.


Then u
se
Bio::SearchIO

module for parsing
output from sequence
-
similarity
-
searching
programs (
e.g.,
BLAST)
. The BLAST results are organised hierarchically. The
result
contains one or more hits, each of which contains one or more high scoring pairs (HSPs).


use Bio::SearchIO;

use strict;

my $parser

= new Bio::SearchIO(
-
format => 'blast',


-
file => 'file.bls');

while( my $result = $parser
-
>next_result ){


while( my $hit = $result
-
>next_hit ) {


while( my $hsp = $hit
-
>next_hsp ) {


$stuff;


}


}

}


There are
several commands for retrieving information about each, the results, the hits and
the HSPs (see

http://www.bioperl.org/wiki/HOWTO:SearchIO
).


Sequence Alignments


TaxonA GAAGAAGATGTAGTAATTAGATCTGAAAATTT

TaxonB GAAGAAGAGGTAGTAATTAGATCTGAAGATTT

TaxonC GAT
GAAGAGATAGTAATTAGGTCTGAAAATCT

TaxonD GAAGCAGAGGTAGTGA
-
TAGATCTGAAAATTT

TaxonE GAAGAAGAGGTAGTAA
-
TAGATCTGAAAATTT

TaxonF GAAGAAGAGGTAGTAATTAGATCCGAAAATTT




positions in alignment represent evolutionary homology



import existing alignments or estimate using
external software


Perl Notes

Howard Ross

page
15

Read and Write an Alignment


use strict;

use Bio::AlignIO;

my $informat = 'fasta';

my $outformat= 'nexus';

my $in = Bio::AlignIO
-
>new(
-
format => $informat,


-
file => 'hits.fa');

my $out = Bio::AlignIO
-
>
new(
-
format => $outformat,


-
file => '>hits.nex');

while( my $aln = $in
-
>next_aln ) {


$out
-
>write_aln($aln);

}


Getting

and Managing

Information about

Alignment with

Bio::
AlignIO


$aln
-
>consensus_string()

&
$aln
-
>
consensus_iupa
c()
: make a consensus sequence
from DNA and RNA

$aln
-
>
percentage_identity()
:
average pairwise
sequence similarity based on identity at
each position

$aln
-
>
slice(
$start,$end
)
: a ''slice'', a subalignment between start and end columns

$aln
-
>
column_from_residue_number(
$seq_id,$residue_number
)
: find column where
specified residue is located

in particular sequence
.