Large Scale Data Manipulation:

moredwarfΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

112 εμφανίσεις


1


Perl in one hour: a brief introduction


Perl, “practical extraction and report language,” is the Swiss
-
army knife of computer
languages known for its flexibility and power. Its syntax allows people to quickly start
writing programs. But, as a person gai
ns competence, more complex problems can be
easily addressed as well.

It is particularly suited to manipulating text, which nicely coincides with the way we
represent sequence data. Often you will hear people denigrating Perl, compared to
more structur
ed languages such as Java, C++, and Python, but this is usually in terms of
the ability to accomplish extremely large complex projects, which is not the goal for most
individuals who are writing code for themselves in a lab.



00. Sources of Perl


Perl on
Linux and Macs :
should already be there (e.g. hpcc)

Windows:
http://www.activestate.com/activeperl/



0. Souces of Help




perldoc


command line extensive built in help files



Beginning Perl for Bioinformatics by James Tisdall



Mastering Perl for Bioinformati
cs by James Tisdall


--
good guide for using BioPerl



Perl Cookbook by Tom Christiansen and Nathan Torkington


--
wide
-
range of how
-
to examples/snippets of code



Learning Perl by Randal Schawartz et al.



Programming Perl by Larry Wall, et al., (“the camel”)




G
ood introductory course (for informatics from
2004)
http://stein.cshl.org/genome_informatics/



Bioperl Tutorials:
http://bio.perl.org/wiki/Tutorials







Disclaimer:
This hour can only give you an overview and a sense of Perl.
Learning to program takes practice and application to specific situations just like
any language.


2

AN INTRODUCTORY PROGRAM


#!/usr/bin/perl

w
#this is the location of pe
rl (in linux/unix)

use strict;
#this forces definition of variables and good coding#


if (! defined $ARGV[0] ) {


#check if file name provide


print 'Usage: fasta_split.pl <fasta file>';


die ;

}

###open a file for reading ###

open (IN, $ARGV[0]) or die
"Can not read file ($ARGV[0])!
\
n";


## data will look like this again and again ###

#>gi|2226439|gb|AC002038.1|HUAC002038 Homo sapiens chromosome 2 clone 101B6 map 2

p11, complete sequence

#AAGCTTGTTCAACATATGCAAATCAATAAATGTAATCCAGCATATAAACAGAGCCAAAGACA

#AT
GATTATCTCAATAGATGCAGAAAGGGCCTTTGACAAAATTCAACAAGCCTTCATGCTAAT

#ATAAATTAGGTATTGATGGGACGTATGTCAAAATAGTAAGAGCTATCTATGACAAACCCACA


###declare and set variables

my $sequence ='';

my $header='';

my $seqname='';

my $count=0;


###loop until the file has been comple
tely read###

while (my $line=<IN>) {
#loads one line into $line at a time


chomp $line;
#remove the newline character at end of text#


if ($line=~ /^>/ ) {
#check for new fasta header



if ($header ne '') {
#check if previous sequence to dump




open (
OUT, ">$seqname");




print OUT "$header
\
n";




print OUT "$sequence
\
n";




close OUT ;




$count++;



}



$sequence='';



$header=$line;



#extract sequence name from header matching AC002038.1



if ( $header =~ /
\
|([A
-
Z][A
-
Z0
-
9.]+)
\
|/ ) {




$seqname=$
1;



}


} else {



$sequence = $sequence . $_;
#paste sequences together


}

}

close IN;

print "Separated ($count) fasta records into individual files.
\
n";






3





I. Basic Data (Numbers and Text)


Integers or floating point
: 2, 3, 2, 23.233, 4.22


strings
:

"atatta
\
tccttgc"


atatta[TAB]ccttgc


'atatta
\
tccttgc'


atatta
\
ttccttgc



II. Functions (verbs)


Functions create, alter, or output data. Perl has a large number of built in functions.


Type:
perldoc

f [function name]


print



print output to the screen or a gi
ven FILEHANDLE


print "Hello World!
\
n";


print OUT “>seq1” (sends to file represented by
OUT)

system



execute a system command


system "repeatmasker AC002038
-
v";



system "blastall
-
i AC002038

p blastall

o x
-
v";


$x=`blastall

i AC002038

p bla
stall`;


All of the functions:

accept alarm atan2 bind binmode caller chdir chmod chop chown chroot close closedir connect cos
crypt dbmclose dbmopen defined delete die do dump each eof eval exec exit exp fcntl fileno flock
fork getc getgrent getgrgid getg
rnam gethostbyname gethostent getlogin getnetbyaddr
getnetbyname getnetent getpeername getpgrp getppid getpriority getprotobyname
getprotobynumber getprotoent getpwent getpwnam getpwuid getservbyname getservent
getsockname getsockopt gmtime goto grep hex i
ndex int ioctl join keys kill length link listen local
localtime log lstat m mkdir my oct open opendir ord pack pipe pop print printf push rand read
readdir readlink recv rename require reset return reverse rewinddir rindex rmdir s scaler seek
seekdir sele
ct semctl semget semop send setpgrp setpriority setsockopt shift shmctl shmget
shmread shmwrite shutdown sin sleep socket socketpair sort splice split sprintf sqrt srand stat
study substr symlink syscall sysread system syswrite tell telldir time times tr t
runcate umask undef
unlink unpack unshift utime values vec wait waitpid wantarray warn write


4


III. Simple Operators (verbs)



Mathematical

3 words or less

Illustrations

+

add

3+3


6 (use . for strings)

-

subtract

3
-

3


0

/

divide

3/3


1

*

multiply

2 * 3


6

**

power (exponent)

2 ** 3


8

%

remainder

4 % 3


1




Comparisons




<, lt

less than

3<4


1, 'y' < 'x'


0

>, gt

greater than

3>4


0, 'y' > 'x'


1

<=, l
e

less and equal

3<=4


1, 4<=4


1,

>=, ge

greater and equal

3 >= 4


0, 4 >=4


1

<=>, cmp

compare

3 <=> 4


-
1 , 4 <=> 4


0

==, eq

equals

4==3


0, bob eq bob


1







Logical



and , &&

logical and

3 > 4 and 4 < 3


0, 0 and 1


0

or , ||

l
ogical or

3 > 4 and 4 < 3


1, 0 or 1


1

!

Logical not

! 1


0, ! 10 =


0, ! 0


1




Assignment



=

assign value

$x=4; $y=55; $z='string';

+=

add to value

$x+=4

equals
$x=$x+4;

-
=

sub from value

$x
-
=4

equals
$x=$x
-
4;

*=

multiply value

$x*=4

equals
$x=$x*4;

/=

divide value

$x/=4

equals
$x=$x/4;

++

increment

$x++

raise value by 1

--

decrement

$x
--

decrease value by 1




File Tests


Return true (1) or false (0) or a appropriate value

-
e

File exists?

-
e filename



returns true or false

-
d

Is a directory?

-
d dirname



returns true or false

-
s

What size?

-
s filename



returns the size of file









5


IV. Data Variables (Nouns)

The names of all variables can contain letters (tradition is lower case), numbers, and
underscores. Define vari
ables that start with letters not numbers.

my


declares a variable (gives the program a heads up)




if
use strict

then program complains if not declared


A. scalar (a single number or string of text)

$
represents a scalar variable
, e.g.

$x; m
y $x=4;

$x=4; $y=0.4323221; $fruit = 'apple'; $newline ="
\
n";


numerical functions

abs($num)



returns the absolute value
-
3 → 3, 3 → 3

int($num)



returns an integer value of $num rounded down 3.2 → 3

rand($num)



returns a random floating number b
etween 0 and the number;


$die_roll=int(rand(5))+1;

#roll a six side die

Others simple functions:
sin, cos, log, atan, etc.


Useful string operators and functions

. (the period)


concatenates strings

"
atcgg"."ccttg
"


"atcggccttg"

length($string)




ret
urns number of characters in a string



length "atcggccttg"


10


length 234


3 (treated as string of 3 characters “234”)

reverse($string)



reverses the order of characters in a string



reverse "atcggccttg"


gttccggcta

$string =~ /regular expression

(RE) pattern/


match a pattern


$seq =~ /N+/

will find runs of Ns



$header =~/([A
-
Z]{1,2}_?[0
-
9]{4,5}
\
.[0
-
9]{1,2})/;


the parentheses () will capture the matching patterns in
$1,$2,…



the entire function will return true (1) if match is found


http://en.wikipedia.org/wiki/Regular_expression


http://www.perl.com/doc/manual/html/pod/perlre.html

$string =~ s/find RE/replace
/



find and replace a pattern


$seq =~ s/cg/nn/;

#replace cg with nn


$seq =~ s/(gaa)(ttc)/$1
\
n$2/;

#cut a sequence with EcoRI



$
string =~tr/chars2find/char2substitute/

#char by char (fast)


$seq =~ tr/atcgATCG/tagcTAGC/;

# the complement of $seq


$count= $seq=~/cgCG/cgCG/
; #fast way to count GC content

substr ($var, start pos, # characters)

access char within string


substr (“actgc”, 2,2)
→ “tg”


6

B. array


a list of scalars


@ is the array

$ for element in array followed by square brackets


my @bases=('a', 'c', 't' , 'g', 'c', 't', 'g');

$bases[0]



a

(counting starts at zero)

$bases[4]



c

@files=("AC002038", "AC004021", "U52111", "
AC004234"),

$files[2]


AC004021;


Functions for Arrays

@list=split(/RE/,$string)



splits a string into a list based on regular
expression pattern e.g.
/
\
t/
for tab,
/ +/

for spaces

@columns = split /
\
t/, $line;

#break tab
-
delimited text into columns

@columns= split / +/, $line;

#split spaced data into columns#

sort(@list)


sorts the elements of a list

join(spacer_string, @list)


merges a @list into a string


scalar(@list)



returns the size (number of elements) of a @list

$col_number = scala
r @columns;

#returns the number of columns

reverse(@list)



reverses the order of a @list

shift (@list)



removes and returns the first element off of a LIST

push(@list,$var )



push a variable $scalar or another list onto the end of
the @list.

@new = gr
ep { /RE/} @old



filters a @array keeping only matches


Special Arrays:


@ARGV



program command line input broken by space


@_



subroutine arguments




7

C. hash or associative array (random access list)


Allows you to access data in a random order:


my

%hash=(key=>value, key2=>value2);

$hash{key2}


value2


%sequence_lengths=('AC002038' => 156000, 'AC002041' =>
123000, 'AC005252' => 4000);

print $sequence_lengths{‘AC002038’}


156000


%codons=('AAA'=>’K’, 'AAG'=>'K', 'AAC'=>'N',

'AAT'=>'K');

print $codo
ns{'AAG'}

→ K


Useful hash functions:

keys %hash



returns a list of all of the keys;

delete $hash{key}



remove a key from a HASH

exists $hash{key}



checks to see if a key exists


D. filehandles


File handles are traditionally capital while scalar, arr
ay and hash variables are lower
case.


INFILE, OUTFILE, DIR2


open
(INFILE, $filename) or die "Can’t read ($filename)!
\
n";

$x=<INFILE>; #read a line

@alllines=<INFILE>; #reads all of the lines remaining;

close

IN;


open(OUT, ">$filename")

open(APPEND,
">>$filename")

print OUT "Bioinformatics rocks!
\
n

" ;


opendir
(DIR, $dirname)

my @files=grep { /[a
-
zA
-
Z]/ } readdir DIR;

close DIR;



Special File Handles: STDIN, STDOUT, STDERR


8


IV. Subroutines (creating your own verbs/functions)


Subroutines are user defined functions, which are usually found after the main body o f
the program. Subroutines can return a scalar or a array, but not hashes.


sub lowercase2ns {


my $seq=shift;


$seq =~ tr/actg/nnnn/;


return $seq;

}


sub revcomp {


my

$seq=shift;


$seq = reverse $seq;


$seq=~ tr/ACTGNatcgn/TGACNtagcn/;


return $seq

}



sub translate_sequence {



my $seq=shift;



my $protein='';


my %codons=(
'AAA'=>’K’, 'AAG'=>'K', 'AAC'=>'N', etc.)



if ( length($seq) % 3 == 0 ) {





print "
WARNING: s
eq length not a multiple of
3!
\
n
";




}



while (length($seq) >=3) {




if ($seq =~ s/([atcgATCGNn]{3})// ) {




my $c=$1;




$c=uc($codon);





$protein = $protein . $codons{$c};




}



}



return $protein;


}


9


IV. Controling the Flow (making choices and doing it

again and again)


while (
TEST

) {
BLOCK OF CODE
}


while (<FILE>) { …. }

foreach $var (
LIST

) {
BLOCK OF CODE
}


for (start_value; testval; increment value) { BLOCK OF
CODE}

for (my $i=0; $i<10; $i++ ) {print $i}

if (
TEST

) {
BLOCK OF CODE
}

if ( TES
T ) { BLOCK OF CODE

} elsif (TEST2) { BLOCK OF CODE

} else { BLOCK OF CODE

}


next



go back to beginning and re
-
evaluate the code

last


break out of the loop exiting any block of code


while($x< $base_length) {


#do something


$x++;

}


#batch job

fo
reach $f (@files) {


print "repeatmasking $f
\
n";


system "repeatmasker $f

qq

xsmall";


system "megablast

i $f

D2

o $f.bo";


system "blastparser

in $f.bo

o $f.bo.parse";


system "parasight
-
align $f.bo.parse";

}


#processing tab
-
delimited file#

whil
e (<IN>) {


s/
\
r
\
n/
\
n/;



#de ms
-
dos line breaks


chomp;




#remove newline

@c =split /
\
t/;


#use a space for repeatmasker

for(my $i=0; $i< scalar @c; $i++) {


print "$i)$c[$i] ";

}

print "
\
n";

my $pause=<STDIN>;


#pause and wait for enter

}



1
0

V. Mo
dules (let's others do the work)


http://search.cpan.org/


use Data::Dumper;

use Bio::SeqIO;

use GD::Window;




Modules for almost anything.



Even modules to help build modules.



Systematic Modules for Biology

http://www
.bioperl.org/wiki/Main_Page



let's you do things in just a few lines of code



however, it may take you a few hours to find them


The Bioperl toolkit: Perl modules for the life sciences.

Genome Res 2002 Oct;12(10):1611
-
8

Stajich JE, Block D, Boulez K, Brenner

SE, Chervitz SA, Dagdigian C, Fuellen G,
Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock
MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E.




KEY: Start with examples from others:

http://www.biop
erl.org/wiki/HOWTOs

http://www.bioperl.org/Core/Latest/bptutorial.html



use Bio::SeqIO;


my $inseq = Bio::SeqIO
-
>new(
-
file => $ARGV[0],


-
format => 'fasta' );


while ( my $seq=$inseq
-
>next_seq ) {


my $filename='';



if ($seq
-
>id =~ /
\
|([A
-
Z][A
-
Z0
-
9.]+)
\
|/ ) {



$filename=$1;


}


my $outseq= Bio::SeqIO
-
>new(
-
file => ">$filename",





-
format => 'fasta');



$outseq
-
>write_seq($seq);


}


1
1



Final Tips:



Use lots of comments. (Other people looking at your code can never get
enough).



use strict; It makes debugging much easier. Helps catch typos.



$_ is the default variable that will be defined or acted on by many
functions if no variable is explicit


this can make code simpler but
sometimes harder to understand especially for a begin
ner



complex data structures:
http://www.perl.com/doc/FMTEYEWTK/pdsc/



e.g. matrices, databases, complex relationships, etc.






One Perl liners can help with quick data manipulation:
http://www.theperlreview.com/Articles/v0i1/one
-
liners.pdf



Regular Expressions: A very powerful tool within Perl and many other programs.
A good introduction is at http://en.wikipedia.org/w
iki/Regular_expression