Perl for bioinformatics


Oct 2, 2013 (4 years and 9 months ago)


Perl for bioinformatics

Chapter 5 Motifs and Loops


Chapter target

Search for motifs in DNA or protein

Interact with users at the keyboard

Write data to files

Use loops

Use basic regular expressions

Take different actions depending on the
outcome of conditional tests

Examine sequence data in detail by operating
on strings and arrays

5.1 Flow Control

Flow control

is the order in which the statements
of a program are executed.

There are two ways to tell a program to do
otherwise: conditional statements and loops.

A c
onditional statement

executes a group of
statements only if the conditional test succeeds;
otherwise, it just skips the group of statements.

repeats a group of statements until an
associated test fails. (difference between
conditional statements and loops)

5.1.1 Conditional Statements


, and

conditional statements
are three such testing mechanisms in Perl.

The main feature of these kinds of constructs is
the testing for a conditional. A conditional
evaluates to a true or false value. If the
conditional is true, the statements following are
executed; if the conditional is false, they are
skipped (or vice versa).

What Truth Means to Perl

The rules are as follows:

The number 0 is false.

The empty string ("") and the string "0" are

The undefined value undef is false.

Everything else is true.

True or False Examples



the opposite of if. It works like the
English word "unless":

If the conditional evaluates to true, no
action is taken; if it evaluates to false, the
associated statements are executed. Conditional tests and
matching braces

These operators decide "greater than" and "less than" by
examining each character left to right and comparing them
in ASCII order. This means that strings sort in ascending
order: most punctuation first, and then numbers,
uppercase, and finally lowercase. For example, 1506
compares less than Happy, which compares less than

Having the same number of left and right
braces in the right places is essential for a
Perl program to run correctly.

5.1.2 Loops

There are several ways to loop in Perl:



loops, and
more . open and unless

Conditionals allow you to tailor a program
to several alternatives,

Loops harness the speed of the computer
so that in a few lines of code, you can
handle large amounts of input or
continually iterate and refine a

5.2 Code Layout

Good format ,good read

5.3 Finding Motifs

Perl has a handy set of features for finding
things in strings. This, as much as anything, has
made it a popular language for bioinformatics

Getting user input from the keyboard

Joining lines of a file into a single scalar variable

Regular expressions and character classes

until loops

Pattern matching

5.3.1 Getting User Input from the

A filehandle and the angle bracket input
operator are used to read in data from an
opened file into an array, like so:

@protein = <PROTEINFILE>;

$proteinfilename = <STDIN>;


removing the newline from the input
collected from the user at the keyboard .

5.3.2 Turning Arrays into Scalars
with join


collapses an array @protein by
combining all the lines of data into a single
string stored in a new scalar variable

$protein = join( '', @protein);

you specify the empty string to be placed
between the lines of the input file. The
empty string is represented with the pair of
single quotes ''

5.3.3 do
until Loops

first executes a block and then does a
conditional test.

5.3.4 Regular Expressions

Regular expressions

let you easily
manipulate strings of all sorts, such as
DNA and protein sequence data Regular expressions and
character classes

Regular expressions are ways of matching one
or more strings using special wildcard

$protein =~ s/

s is one of several metasymbols ,
s can
also be written as: [

if ( $motif =~ /^
s*$/ )

beginning (indicated by the ^), is zero or more
(indicated by the *) whitespace characters
(indicated by the
s) until the end of the string
(indicated by the $). Pattern matching with =~
and regular expressions

Search for an A followed by a D or S, followed
by a V: A[DS]V

Search for K, N, zero or more D's, and two or
more E's (note that {2,} means "two or more"):

Search for two E's, followed by anything,
followed by another two E's :EE.*EE

Notice that a period stands for any character
except a newline, and ".*" stands for zero or
more such characters.

5.4 Counting Nucleotides

Explode the DNA into an array of single
bases, and iterate over the array (that is,
deal with the elements of the array one by

Use the

Perl function to iterate over
the positions in the string of DNA while

5.5 Exploding Strings into Arrays

Explode the string of DNA into an array

This is the inverse of the join function


with an empty string as the
first argument causes the string to explode
into individual characters;

@DNA = split( '', $DNA);

5.6 Operating on Strings

see if the position reached in the string is
less than the length of the string. It uses

Perl function

for ( $position = 0 ; $position < length
$DNA ; ++$position )

$position = 0; while( $position < length
$DNA ) { # the same statements in the
block, plus ... ++$position;

For loops vs While loops


loop brings the initialization and
increment of a counter ($position) into the
loop statement, whereas in the

they are separate statements


$base = substr($DNA, $position, 1);

you look at just one character, so you call

on the string $DNA, ask it to look in
position $position for one character, and
save the result in scalar variable $base

By default, Perl assumes that a string
begins at position 0 and its last character
is at a position that's numbered one less
than the length of the string.

5.7 Writing to Files

to write to a file, you do an

call, just
as when reading from a file, but with a
difference: you prepend a greater
sign > to the filename.

while($dna =~ /a/ig){$a++}

i is a modifier, it's a case
insensitive match,
which means it matches a or A.