Programming and Perl

whooploafSoftware and s/w Development

Dec 13, 2013 (3 years and 7 months ago)

79 views

Programming and Perl

for

Bioinformatics


Part I

Why Write Programs?


Automate computer work that you do by hand
-

save time and reduce errors


Run the same analysis on lots of similar data files =
scale
-
up


Analyze data, make decisions


sort Blast results by e
-
value and/or species of
best mach


Build a pipeline


Create new analysis methods


Why Perl?


Fairly easy to learn the basics


Many powerful functions for working with
text: search & extract, modify, combine


Can control other programs


Free and available for
all

operating systems


Most popular language in bioinformatics


Many pre
-
built “modules” are available that
do useful things

Get Perl


You can install Perl on any type of computer


Just log in
-

you don’t even need to type any
command to make Perl active.


Download and install Perl on your own
computer:

www.perl.org

Programming Concepts


Program

= a text file that contains instructions
for the computer to follow


Programming

Language

= a set of commands
that the computer understands (via a “command
interpreter”)


Input

= data that is given to the program


Output

= something that is produced by the
program

Programming


Write the program (with a text editor)


Run the program


Look at the output


Correct the errors (debugging)


Repeat

computers are VERY dumb
-
they do exactly what
you tell them to do, so be careful what you ask
for…

Strings


Text is handled in Perl as a
string


This basically means that you have to put quotes
around any piece of text that is not an actual
Perl instruction.


Perl has two kinds of quotes



-

single
‘...’

and double
“...”


(they are different
-

more about this later)

Print


Perl uses the term “print” to create output


Without a
print

statement, you won’t know
what your program has done


You need to tell Perl to put a carriage return
at the end of a printed line


Use the

\
n


(newline) command


Include the quotes


The “
\
” character is called an escape
-

Perl uses
it a lot

A Taste of Perl: print a message


hello_world.pl
:

Greet the entire world.


#!/usr/bin/perl
-
w

#greet the entire world

$x = 6e9;



print “Hello world!
\
n”;



print “All $x of you!
\
n”;

}

-

function calls

(output statements)

-

command interpretation header

-

variable assignment statement

-

a comment

Variables


Up till now, we’ve been telling the
computer exactly what to print. But in order
for the program to generate what is printed,
we need to use variables.



A scalar variable name starts with “
$




It can store either a string or a number.

Basic Syntax and Data Types


whitespace

doesn’t matter to Perl. One can write all
statements on one line


All Perl statements end in a semicolon “
;


just like C


Comments begin with ‘
#
’ and Perl ignores everything
after the # until end of line.


Example: #this is a comment



Perl has
three basic data types
:


scalar


array (list)


associative array (hash)

Variables


To be useful at all, a program needs to be able to
store information from one line to the next


Perl stores information in
variables


A scalar variable name starts with the “
$
” symbol,
and it can store strings or numbers


Variables are case sensitive


Give them sensible names


Use the “
=
”sign to assign values to variables

$one_hundred = 100

$my_sequence = “ttattagcc”

Scalars


Scalar variables

begin with
$


followed by an
identifier


Example: $this_is_a_scalar;



An
identifier

is composed of upper or lower case
letters
,
numbers
, and
underscore

'
_
'. Identifiers are case
sensitive (like all of Perl)



$progname = “first_perl”;


$numOfStudents = 4;



= ( “gets”) sets the content of $progname to be the string
“first_perl” and $numOfStudents to be the integer 4

Scalar Values


Numerical Values


integer:


5, “3”, 0,
-
307


floating point: 6.2e9,
-
4022.33


NOTE:
all

numerical values stored as floating
-
point
numbers (“double” precision)



A program with variables

#!/usr/bin/perl
-
w


#this program uses variables containing numbers


my $two = 2;


my $three = $two + 1;


print “
\
$two *
\
$three = $two * $three = ",




($two * $three);


print "
\
n";


Do the Math


Mathematical functions work pretty much as you would
expect:

4+7

6*4

43
-
27

256/12

2/(3
-
5)


Example

#!/usr/bin/perl

print "4+5
\
n";

print 4+5 , "
\
n";

print "4+5=" , 4+5 , "
\
n";

$myNumber = 88;



Note: use commas to separate multiple items in a
print

statement

What will be the output?

4+5

9

4+5=9

Scalar Values


String values


Example:


$day = "Monday ";

print "Happy Monday!
\
n";


print "Happy $day!
\
n";


print 'Happy Monday!
\
n';


print 'Happy $day!
\
n';







Double
-
quoted: interpolates (
replaces variable name/control
character with it’s value
)


Single
-
quoted: NO interpolation done (as
-
is)

Happy Monday!<newline>

Happy Monday!
\
n

Happy Monday!<newline>

Happy $day!
\
n

What will be the output?

String Manipulation

Concatenation



$dna1 = “ACTGCGTAGC”;



$dna2 = “CTTGCTAT”;


juxtapose in a string assignment or print statement



$new_dna = “$dna1$dna2”;


Use the
concatenation operator


.




$new_dna = $dna1
.

$dna2;

Substring



$dna = “ACTGCGTAGC”;



$exon1 = substr($dna,2,5);

0

2

# TGCGT

Length of the substring

Substitution

DNA transcription
: T


U


Substitution operator
s///

:




$dna = “GATTACATACACTGTTCA”;



$rna = $dna;



$rna
=~

s/
T
/
U
/
g
; #“GAUUACAUACACUGUUCA”


=~

is a binding operator indicating to exam the contents of
$
rna

for a match pattern; “
g
” : global


Ex:

Start with
$dna =“gaTtACataCACTgttca”;

and do the same as above. What will be the output?

Example


transcribe.pl:

$dna ="gaTtACataCACTgttca";

$rna = $dna;

$rna =~ s/T/U/g;


print "DNA: $dna
\
n";

print "RNA: $rna
\
n";


Does it do what you expect? If not, why not?


Patterns in substitution are case
-
sensitive! What can we do?


Convert all letters to upper/lower case (
preferred when possible
)


If we want to retain mixed case, use
transliteration/translation
operator

tr///



$rna =~ tr/tT/uU/;
#replace all t by u, all T by U

Case conversion

$string = “acCGtGcaTGc”;

Upper case:




$dna = uc($string);

# “ACCGTGCATGC”




or

$dna = uc $string;




or

$dna = “
\
U$string”;
#
\
U : string directive

Lower case:



$dna = lc($string);

# “accgtgcatgc”




or

$dna = “
\
L$string”;

Sentence case:



$dna = ucfirst($string)
# “Accgtgcatgc”




or

$dna = “
\
u
\
L$string”;

Reverse Complement

5’
-

A C G T C T A G C
. . . .

G C A T

-
3’

3’
-

T G C A G A T C G
. . . .

C G T A

-
5’



Reverse
: reverses a string

$string = "ACGTCTAGC";

$string = reverse($string);

"CGATCTGCA”



Complementation
: use transliteration operator

$string =~ tr/ACGT/TGCA/;


What’s Wrong?


$DNA = "ACGTCTAGC";


print "$DNA
\
n
\
n";


$revcom = reverse $DNA;


# Next substitute all bases by their complements,


# A
-
>T, T
-
>A, G
-
>C, C
-
>G


$revcom =~ s/A/T/g;


$revcom =~ s/T/A/g;


$revcom =~ s/G/C/g;


$revcom =~ s/C/G/g;


# Print the reverse complement DNA onto the screen


print "$revcom
\
n";

More on String Manipulation

String length:



length( $dna )


Index:



#
index STR,SUBSTR,POSITION




index( $strand, $primer, 2 )


Optional, default 0

Flow Control

Conditional Statements


parts of code executed depending on truth value of a logical
statement


“truth” (logical) values in Perl:


false = {0, 0.0, 0e0, “”, undef}, default
“”


true


= anything else, default
1



($a, $b) = (75, 83);


if ( $a < $b ) {



$a = $b;



print “Now a = b!
\
n”;


}


if ( $a > $b ) { print “Yes, a > b!
\
n” }

# Compact

Comparison Operators

Comparison

String

Number

Equality

eq

==

Inequality

ne

!=

Greater than

gt

>

Greater than or equal to

ge

>=

Less than

lt

<

Less than or equal to

return 1/null

le

<=

Comparison:

Returns
-
1, 0, 1

cmp

<=>

Logical Operators

Operation

Computerese

English version

AND

&&

and

OR

||

or

NOT

!

not

if/else/elsif


allows for multiple branching/outcomes


$a = rand();


if

( $a < 0.25 ) {


print “A”;

}

elsi
f

($a < 0.50 ) {


print “C”;

}

elsif

( $a < 0.75 ) {


print “G”;

}

else

{


print “T”;

}

What’s a block?


In the case of an “if” statement:



If the test is true, execute all the command
lines inside the
{ }

brackets. If not, then go on
past the closing
}

to the statements below.



You can also do stuff in a block over and
over again using a
loop
.

Conditional Loops

while

(
statement
) {

commands …
}



repeats
commands

until
statement

is no longer true


do

{
commands

}
while

(
statement

);


same as
while
, except
commands

executed as least once


NOTE

the “
;
” after the while statement


Short
-
circuiting commands:
next

and
last



next;


#jumps to end, do next iteration


last;


#jumps out of the loop completely

While
-
Loop


Loops test a condition and repeat a block of
code based on the result


while

loops repeat while the condition is true


$count = 1;

while

($count <= 10) {



print “$count bottles of pop
\
n”;

$count = $count +1;

};

print “POP!
\
n”;


[Try this program yourself]

for and foreach loops


Execute a code loop a specified number of times, or for
a specified list of values


for

and
foreach

are identical: use whichever you want


Incremental loop (“C style”):



for

( $i=0 ; $i < 50 ; $i++ ) {




$x = $i*$i;




print "$i squared is $x.
\
n";



}


Loop over list (“
foreach
” loop):




foreach

$name ( "Billy", "Bob", "Edwina" ) {




print "$name is my friend.
\
n";



}

Standard Input


To make the program do something, we need
to input data.


The angle bracket operator (
< >
) tells Perl to
expect input, by default from the keyboard.


Usually this is assigned to a variable


print “Please type a number: ”;

$num = <STDIN>;

print “Your number is $num
\
n”;

chomp


When data is entered from the keyboard, the program waits
for you to type the carriage return key.


But.. the string which is captured includes a newline
(carriage return) at its end


You can use the
chomp

function to remove the newline
character:


print “Enter your name: ”;

$name = <STDIN>;

print “Hello $name, happy to meet you!
\
n”;

chomp

$name;

print “Hello $name, happy to meet you!
\
n”;


Basic Data Types


Perl has
three basic data types
:


scalar


array (list)


associative array (hash)