Perl for Science Majors

foregoinggowpenΛογισμικό & κατασκευή λογ/κού

4 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

105 εμφανίσεις

Midwestern State University

Page
1



Perl for Science Majors

Designed to support CMPS 1023

V .6
1


Image from National Human Genome Research Institute



By


Richard Simpson

and Tina Johnson

Midwestern State University

Midwestern State University

Page
2


2010





CONTENTS

1.

What is Perl

and DOS

2.

Installing
and using
Notepad++

3.

Installing and running Perl

4.

Variables and Data Types in

Perl

5.

Input
\
Output

6.

M
athematical operators

7.

Simple Programs

8.

The IF statement and Logical operators

9.

The WHILE statement

10.

File
Input
\
Output

11.

Regular Expressions

12.

Searching Text files using RE’s

13.

More

Arrays in Perl

14.

Hashes and their uses














Midwestern State University

Page
3





1.

What is Perl

Perl
(not PERL)
is a programming language developed by Larry Wall in 1987. Although it was originally
intended as a
UNIX

scripting language for
UNIX

systems
,

it has evolved to become a highly used
programming language for text processing.

This is not to say that it cannot process numerical
applications but that it has features that make the processing of text files re
latively easy. Computer
Scientist
s

refer to a language that can be used to solve almost any problem as a general
-
purpose
language, which is what Perl is.

Besides applications in graphics programming, system administration,
database systems, Web and networ
ks, t
he field of Bioinformatics has embraced the language as their

(at least the most popular one)

preferred programming

tool for DNA and other text processing.
This
little book will restrict its focus to applications in Biology, Chemistry, Physics, Math
ematics and Geology
which is what the majority of science students major in.


We will be working in a unfamiliar

environment this semester at least from your perspective.

This
environment,

which is our interface to the operating system, is called

command
line DOS
. It looks very
much like the command line interface found in
the
Unix and Linux
operating systems so what you
learn
here will
help you when and if you work on

these systems.

DOS (Disk Operating System) has been around since 1980 or so. It was d
eveloped to work with the new
desktop microcomputers that hit the shelves about this time. The version that Bill Gates sold was
called MS
-
DOS. Its popularity and subsequent Windows
operating

systems is w
hat made Microsoft and
its owner Bill Gates so weal
thy.
The fact that DOS is command line based means
that the user interacts
with the OS by entering commands on a line, one at a time. There is no mouse or GUI (graphical user
interface) that has icons and
etc.

to click on. It’s all done thru the keyboar
d. Although the original DOS
was displayed on the entire screen, modern
DOS is normally run within a window. To create this DOS
port
,

click on the start
button of windows, select run
from the menu and enter
cmd

within the opened
Midwestern State University

Page
4


popup

followed by OK.
You should get a window that looks like this

The line

c:
\
Documents and Settings
\
richard.simpson>

is called the prompt. It indicates the directory that DOS is presently accessing, also known as the
present working directory (pwd). In order to display t
he actual contents of this directory
,

just type in
the command
dir
. The
results of executing
dir
within DOS on a laptop are

given below.


The directory contents and the initial directory will most likely be different on your computer. In this
case le
t’s look at what is displayed. Note the line

Midwestern State University

Page
5


06/20/2010 12:04 PM 11,149 gsview32.ini

in the display. This line gives the creation date and time as well as its size in bytes for the file
gsview32.ini.

The data is changed/updated each time t
he file is modified. Other lines such as

02/02/2009 11:46 AM <DIR> mydir

refer to directories as indicated by the <DIR> field.


As you first learn to use DOS it might be wise to hide your mouse so it is not within easy reach.
Remember you

are not supposed to use the mouse while interacting with the DOS interface. Of course
if you need to work on another window you can use the mouse to click on some other part of the
background windows GUI.

So what can we
do in this command line window.?

The basic process is
type a comma
nd and enter

to
see its effect and do this over and over again. Although there are quite a few commands you can use,
as given in the DOS command appendix we will look at a few really useful ones here.
As we discuss
the
se commands don’t forget that
pwd

is shorthand for the present working directory.

There is one command that allows the user to move around the directory

tree
. This command is called
cd (change directory) and can be used in several ways as given by the f
ollowing examples

cd ..




change
to
the parent dir

( the .. is shorthand for the parent )

cd species



change to the species dir of the pwd.

(note that there is no slash)

cd /bin




change to the bin directory of the root.

ch /Simpson/Files/Perl/



change to the indicated directory if it exists within the dir tree.

Each time you execute one of the above you should probably follow it with a
dir

command to see its
contents. The command
cd species

shown above will only work if a species d
irectory is displayed when
the dir command is executed,
i.e.

species is a
dir
in the
pwd.

As an aside if you want to change drives
(for example C: to D: where D: is your thumb drive) just type
D:

at the prompt without the
cd
.

The creation and deletion of
directories is straight forward as well. Just type
mkdir dir_name

to create
a new
dir

within the

pwd.

You can create as many directories and subdirectories as you like with this
method. If you want a to create a subdirectory in say directory Files
List
you must make FileList

the cwd
before you execute mkdir
.

Another command line command that is useful is the
type

command( as in
type file_name
). This
command is used to display the contents of text files (those made from ASCII codes only) that you see
within the pwd. It will not work properly on .exe files, .doc files, .pdf file as well as many others. If you
don’t believe me just type a .e
xe file and see what you get.

If things go crazy
while attempting
something lke this just ty
pe cntl
-
C to kill the confusion and get back to the prompt.

Midwestern State University

Page
6


The final set of commands that we will discuss here is the move and copy commands. The move
command l
ets you move a file from one directory to another with the original being deleted. The copy
command will do the same but keep BOTH copies. In order to give some examples assume that we
have the tree displayed in homework 1.3 below and that the pwd is Ex
ams.
We will only discuss copy
since the move would be similar.
The command that would copy A.txt to the Exams directory
( ie the
pwd)

would be



copy
/T
ext files/Letters/A.txt . (Note: the . is shorthand for the pwd)

In this case

the

entire path /T
ext file/Letters/A.txt

starting at the root,
was used to select the file we
want to copy.

You could also have given the full path of the receiving directory as in


copy /Text files/Letters/A.txt /Textfiles/Exams/

As
suming that we are in Letters here is a command that will copy a file to its parent directory.


copy b.txt .. (Remember that .. always represents the parent directory of the pwd)

There are many variations of this command. What do you think this
me
ans?



copy ../Letters/A.txt . (assuming that we are in Exams)

You start where you are and back up one directory, then go down into Letters to retrieve A.txt, copying
it to the pwd. (GOT that?)

Homework

1.1 Open a DOS window

and note the pwd. Remember its path is displayed in the prompt. Now move
to the root by running cd
\

and note that the new prompt should be C:
\
> Now move around the
directory tree by executing dir and cd commands, cd name to go down into the direct
ory and cd .. to
back up. Move around the tree until you become comfortable with this process.

1.2 Insert a thumb drive (AKA geek stick)(AKA computer sticky thingy from a recent movie) into one of
the USB ports.

Use one that has no important information

you want to keep. Move DOS to this drive
by entering the drive letter into the prompt. For example if D is the drive letter that the OS assigned to
the thumb drive then just type D: at the prompt and return. After running dir to see what’s on the
drive
, delete all the files and directories on the drive, one at a time. Note that if you try to delete a
directory that is not empty DOS will inform you of this. If so change to that dir and delete the files in it
first and then back out, via cd .. , to the

parent again. Now you can delete it.

1.3 Starting with an empty thumb drive build the following dir tree.
The rectangles are directories and
the circles are files. Go to drive C and find a small .exe file and copy it to the bin directory as shown in
the tree. Also copy a .doc (or .docx) from from drive C to your exams directory. Within the Letters
directory run Notepad at the command line and create two files, with a sentence or two of data. Now
go to the root directory and run the
tree

(ie type tre
e and return)

command. You should see this tree
drawn on its side.

Midwestern State University

Page
7



Root

Other files

Text files

bin

Perl programs

Letters

Exams
Any
.
exe
file
A
.
txt
B
.
txt
Any
.
doc
file


1.4 Starting with the above tree do the following


a) copy the a.txt file to the Perl programs dir


b) move the .exe file to the root dir and to the Other
files directory.


c) copy the B.txt file to Text files.


2.

Installing and using
Notepad++


In order to write and run Perl programs, there are two programs that will need to be downloaded and
installed on your home system. The computers in our class and in many of the labs here at MSU
already have these tools loaded for you
convenience
.

The firs
t tool we need to download is called Notepad++. This is a FREE text editor that we will use to
write Perl programs. Many of you already are familiar with Notepad
, as used in a previous homework,

that comes with Windows. This is a
greatly improved upgrade. You may download the software from
http://notepad
-
plus
-
plus.org

at this point in time. If the link goes away just
Google

notepad++ and you
Midwestern State University

Page
8


will be given a variety of download sites th
at you can use. In the above case just click on the
Download

tab and then click
Download the current version
. Select the
Installer.exe

version
.

This will grab the
release most recent release. Release 6.1.8

is the current one
at the

time

of this writing
. By the time
you read this the release number may well have increased. No worries just click on

the most recent
offering
. Once the installer is downloaded you may execute it. This should install Notepad++. When
you first execute this little editor yo
u should get a screen that looks like the following.


Don’t be discouraged by all the options, we will use only a few of them. Notepad++ basically works just
like any other editor that you have worked with. You type in some text

and then save

the file a
fter
giving it a unique name. Check out the installation by typing in the following Perl program and saving it
under the name check.pl where the pl extension tells us that this is a Perl program.


In order to get this to display instructions using colore
d syntax you might need to click on the Language
tab and select Perl. This editor has be
en

designed to work with a lot different languages. In this case
you will note that comments, everything after a # symbol is colored green and the print instructions
are
color blue. Other things such as variable names will be colored differently. This is a great help to those
trying to read a Perl program. In order to save the above c
ode click on File and then save
-
as and select
the correct directory. Normally in c
lass we will use D:
\
Students or something similar. I don’t have a
drive D: at the moment so I am using drive C:.
Use drive D: in the lab during class. After saving the file it
should look like the following. Note the name at the top. C:
\
Students
\
check
.pl gives you both the
directory the file is in and its name.

Saving to a jump drive is also an option.

Midwestern State University

Page
9



We will assume that you have had enough experience with editors such as Microsoft Word to use the
above applications. If you have any problems pleas
e ask the Lab assistant or the Instructor. In order to
run the above progr
am we will need to install Perl which is what we will do in the next section.

3.

Installing and running Perl

There are many versions of Perl on the web but we will be using Active Perl on the labs in class. The
website is
http://www.activestate.com/activeperl/downloads
. Here you can download either

the
regular 32bit version or the 64 bit one. If you have Windows 7 (64bit) or Windows XP 64 then you can
download the 64 bit version. If you don’t have a clue what you have, just download the 32 bit version
(ie x86) and it should work in either case. Th
e file you will download is a Microsoft Windows Installer
package( note the .msi extension). Save the file and run it. It should be named something like
ActivePerl
-
5
-
12.2.1202
-
MSWin32
-
x86
-
293621.msi if you are
downloading the 32bit (x86) version. It
wil
l lead you thru the installation process. The version number may well have changed by the time you
read this.

If you want to know if you have a 64 bit system just go to computer and click the system
properties at the top of the window. The system type w
ill indicate a 64 bit Operating System.


4.

Variables

and Data Types

in Perl

A variable is a reference to

(or

name of)

a unique memory location that is used to temporarily store
information. Unlike many other languages, variables in Perl do not have to be de
clared befor
e they are
used in a program.

Variables in Perl can be scalar
s
, array
s
, or

hash
es
.
Hashes
will be discussed later in
this document. Scalars and arrays are discussed below.

Scalar
s

A scalar variable holds a single piece of information, such
as number, a character, or a group of
characters (referred to as a string). A scalar is the simplest data type in Perl. A scalar variable

name

in
Perl must begin with a dollar sign, followed by a letter or underscore, and then (optionally) followed by
on
e or more letters, digits, or underscores. Perl is case sensitive; $this_variable is not the same as
$This_Variable. As with any other programming language, a good variable name should give an
indication of the variable

purpose. For example, the variabl
e $
sum

is preferred over $s because it is
more descriptive.



The following examples demonstr
ate the use of scalar variables within the context
of an assignment statement. In Perl an
assignment statement

is an instruction that contains a variable
on the
left, followed by an equal sign (=), followed by either a value or expression. The meaning of this
Midwestern State University

Page
10


is quite simple. The value
of the expression
on the right of the equal is stored in the variable on the
left. Do not get this confused. The variable you
are modifying is
ALWAYS

on the left. Capice? !
All
instructions are always terminated by a semicolon (;).

Make sure you understand that an assignment
statement is NOT an equation, ie

remember those things you studied in algebra. It is an action not a
statement of a relationship!

Although variables used to store numbers are initially set to zero by
default it is good practice for the programmer to perform the initialization explicitl
y. This is shown in
the first example below

The followings should suffice.


$sum = 0;
# Start the variable out at zero.

$university = 'M
idwestern State
'
;

#

stores a
string into

the variable $university.


$pick = ‘B’ ;
# stores a single character B in
to the variable $pick

$num1 = 52;

# Stores 52 into the variable $num1

$pi = 3.1415926535;

# initialize pi to 10 decimal places

$total = $subtotal + $tax_amount;

# The sum on the right is stored in $total

$avogadro = 6.022E+23 ;
#

you can even use
scientific notation



print "I attend $university University
\
n";
# see note below

The last example illustrates an important point in Perl: A double
-
quoted string is
variable interpolated
.
This means that the variable name is replaced with its current value when it is printed. In the example
above, the following would be printed, "I attend Midwestern State University", assuming that
$university holds the string literal value 'Midwestern
State'.

Single quoted strings in a print statement
will not perform variable interpolation. In other words print
“Hello $name”
will print
Hello $name
as
is.

Arrays

Scalar values are limited to a single piece of data.
Many times it is helpfu
l to store
a collection of data

in a single data structure that can be manipulated as a single unit
. An
array

is a data type that holds a
list

of scalar values.
Arrays are indexed with integer values
beginning with zero
.
U
se
the
@
symbol
followed

by the array name

to refer to
an entire

array. To refer to
an individual element in an array,
use a dollar sign, followed by the name of the array, followed by the element number in brackets.

For
example @words could be the name of an array of words. The first element
in the array is called
$words[0], the second is called $words[1] and so on.
An array c
an be visualized as shown below. Here
we have an array called @taxonomy that contains the
well
-
known taxonomic categories.


Midwestern State University

Page
11



@taxonomy





The statement:


$taxonomy[2] = $taxonomy
[0];

// copies the value in the 0
th

slot to slot 2.

would produce:


@taxonomy






The statement:


@taxonomyscopy = @taxonomy

would

produce (assuming original array):


@taxonomy






@taxonomy
copy





Array variables can easily be initialized by using an assignment statement as
well. Here are several
examples using the required syntax.


@
numbers = ( 1,2,3,4,5,6,7);

# all numbers


@
names = (‘Harry’, ‘Bob’, ‘Tom’, ‘John’, ‘Bill’); # all strings (names)

0

1

2

3

4

Kingdom

Phylum

Class

Order

Fami
ly

0

1

2

3

4

Kingdom

Phylum

Kingdom

Order

Family

0

1

2

3

4

Kingdom

Phylum

Class

Order

Family

0

1

2

3

4

Kingdom

Phylum

Class

Order

Family

Midwestern State University

Page
12



@
mixed = (‘One’,2,’Three’,4, ‘Five’); # we can mix strings and numbers.


@bases = (“A”, “C”, “G”, “T”); # Remember you DNA?

From these it should be clear that $numbers[4] is 5, $names[0] is Harry and $mixed[1] is 2.

5.


Input/Output

Perl programs, at least the ones we will write, can communicate with the outside world in basic
ally two
ways, standard I/O
(Input/Output)
or file I/O.


$
x
=
<
STDIN
>
Print “Hello there”

The first method, referred to as standard I/O, either reads data typed in at the keyboard

(<STDIN>)

or
writes data to the DOS screen

(STDOUT)
. Programs that operate in
this fashion are said to be
interactive
. Probably the simplest program is one that
writes output

to the screen using a
print

instruction. The syntax for the print instruction is


print data_type, data_type, …., data_type;

Here data_type c
an be
any of the usual types such as
variables, strings or arrays
.
It can also be an
expression that results in one of the data types.
Here are some simple examples

of printing strings
. The
associated comment (anything after #) explains the output.

The
commas are
required

if you have
multiple types.

print

“Hello world”;
# This writes
Hello world
to the screen leaving the cursor at the end.

print

“Hello world
\
n”;

#This writes
Hello world

to the screen moving the cursor to the next line.

print


Hello”, “ world
\
n”;
#This writes
Hello world

using two separated types in the print statement.

In both of the above cases we are printing a string with the only difference being that the second print
includes the
\
n formatting character

and the third is se
parated
. You can think of
\
n as being a carriage
return (or in our case the Enter key). In
other words

anytime

a print instruction encounters this
character
within a string
then
the
output moves t
o the start of the next line. If we execute the

following

little program

print


“If know
ledge can create problems,
\
n It


is not through ignorance
\
n”;

Midwestern State University

Page
13


print



that we can solve them.
\
nIsaac Asimov
\
n
”;

we see this
.

If knowledge can create problems,

It is

not through ignorance

that we can solve them.

Isaac
Asimov


The output is heavily controlled by the careful placement of
\
n’s throughout the strings.

Another data_type is o
f course variables. A variable can be printed i
n

two ways. The following
sequence of instructions
show several

examples of printin
g variables together
with support

strings.

Look very close

at
the output and
the associated
syntax

use
d

to create that output
.


$a=10;

#Assign the value 10 to $a


#H
ere we are printing a string,
then a
number,
and finally a
string

print

“The answer is “,$a,”
\
n”;
# the output is
The answer is 10



# the following just prints a single string. Since the string is double quoted the enclosed


# variable is
variable interpolated

as discussed earlier
.


prin
t “This is another way to print $a
\
n”;
# the output is

This is another way to print 10


#
The following is a print of a string using single quotes. Here NO interpolation occurs!!!

print ‘This is another way to print $a
\
n’;
# the output is

This is another

way to print $a
\
n


From the above you can see that you can print a variable embedded in a double quoted string or by
itself

on
the print line separated by
comma
s
.

It is also possible to read from the keyboard (
ie <STDIN>
) numbers or strings. In order to do this we
normally use two instructions. The first instruction is called the prompt. It is just a print statement
telling the user what he/she is supposed to type in. The second instruction is an assignment statement

that retrieves the typed in value and stores it into a variable of your choice. Here are several examples.


print “Enter a number between 1 and 10:”;



$selection = <STDIN>;


The second instruction contains <STDIN> wh
ich stands for standard

input which is synonymous with
the keyboard. The program
actually stops

at this instruction and waits for the user (i.e. you) to type
something in and hit enter. The main thing you have to remember here is that, all you type goes into
the variable INCL
UDING the enter key (i.e.
\
n). Every variable read this way will have the
\
n on the
end. This is quite often a pain and generally one wants to remove it. It can easily be done using the
chomp

command. The following examples should

demonstrate

its usage
.



print “Enter your name:”;


$name = <STDIN>;


chomp($name);
# removes the
\
n from the end.

Midwestern State University

Page
14



Print “Enter your age:”;


$age = <STDIN>;


chomp($age);

# you need to remove the
\
n from both numbers and strings.



Homework Problems

5
.1
Give

the output of the following program EXACTLY.

$a=23;

print “The answer is $a
\
n”;

print ‘The answer is $a
\
n’;#note the

print “
I have called this principle,
\
n by which each slight variation,
\
n”;

print “if useful, is preserved, by the term of Natural Selecti
on.
\
n”;

#
by
Charles Darwin


5
.2 Give the output of the following program
exactly
.
Note the first variable is not chomped. What
would happen if it were?



print “Enter a number :”;


$num = <STDIN>;


print “Enter another number :”;


$val = <STDIN>;


chomp(
$val);


print “The second number was $val and the first number was $num. Interesting huh!”;



6.

Mathematical operators


Perl has a large number of built
-
in operators for performing operations on numbers. Perl considers all
numbers as real (
i.e
. floating point numbers).
As mentioned previously a

variable can be initialized by
these values.
Here $PI and $cube10 are initialized.





$PI=3.1415926535
;
# a real number with decimal



$cube10 = 1000;

# an integer value


Real

numbers

and integers can be processed using the standard operators, +,
-
,*,/ and

can be

mixed
with little worry.
In other words you can add reals (decimals) to integers and it will work properly.
These operations can be applied to any combination of numeric litera
ls

(things like 3.14 and 2)

and
variables while using parenthesis in the usual way.

When writing expressions for the right hand side of
an assignment statement one must always pay attention to order of operations. This is exactly the
same order of operat
ions that you learned in elementary school. Remember you were told to,
multiply and divide before you add and subtract. If you write an expression such as $a + $b*$c, it will
be evaluate according to these rules, i.e. the product $b*$c will be done fir
st and then the $a will be
added on.


For example the commands
in the following program
are acceptable and perform the
obvious
operations.

Note that this is a complete program that executes from top to bottom.

Midwestern State University

Page
15


# A simple program of operational processing.

$PI=3.1416;

$r = 5;

$TwoPi= 2*
$
PI;

$Area = $PI*$r*$r ;

$Circum = TwoPI * $r;

print “ A circle of radius $r has a circumference of $Cirum and an area of $Area.
\
n”;


Output is:

A circle of radius 5 has a circumference of
31.416

and an area of 78.54.



Ther
e are quite a few other operators in Perl, some which work with numbers and others that work on
strings. The first we will look at is the exponentiation operator **. It can be used with whole or real
numbers.


$Area = $PI *$r**2;

#
This is o
f course pi time r squared.



$c = ($a**2+$b**2)**.5;
#Remember the Pythagorean theorem?


$c = sqrt(
$a**2+$b**2);


#

Same thing but using the function sqrt()


The sqrt() used above is called a function. In fact in this case it is a bui
lt
-
in function. This means that it
is already in Perl ready for us to use. When you use a function like this it is referred to as calling the
function. The call is replaced by the value that results. For example if you have sqrt(4) in a program, it
wi
ll be replaced by 2.

Although appendix B contains a list of functions that may be useful in this class
there are in fact many more that can be used.

You can

in fact

download entire libraries that are
specially designed for a specific area such as BioPerl
.

Another operator that is very useful is the modulo operator %.
This returns the remainder that occurs
when you divide one number from another.


$y =231;


$w = $y %

3
;

# sets w to 0

$z = $y % 2;
# sets $z to 1

Remember when you first learned to divide,
before you worked with decimal
s y
ou divided a large
numb
er

by a small number and if it divided evenly the
n

you obtained a 0 remainder
. If it did not divide
evenly you obtained a remainder. The % operator returns this remainder. You can use this to
dete
rmine if a number is even (i.e. if num%2 is 0) as well as some other applications which we will run
into.
Can you think of a way to get the quotient instead of the remainder?


There is a very useful operator for strings called concatenation. It is a s
ingle period (.). It is used to
connect two strings together thus creating a longer string. For example


$Name = “Richard “.”Simpson
”; #creates

the single string

“Richard Simpson”


$genus=”Homo”;


$species = “Sapiens”;


$comment = $a.” “.$b; #Note
a blank was added in between the two words.


$len = length(
$comment);
#
length is another useful function. Here
$len is set to 12!;

Midwestern State University

Page
16


Homework 6

6.1

What value is assigned to the variables on the

left assuming $x = 5, $y=3
and $n=”Hello”


a) $a = 3+$x**2
-
4*
$x;


b) $m = $x*$x+sqrt($x*20)+ 75/$x;


c) $sentence = $n.” World”.” “.” How are you doing?”


d) $t = length($sentence);


e) $d = 100%$x + 27%$x;


f) $c = sqrt($x**2+$y**2);


6.2 How do you determine if a number is even? Is a multiple of 10?


7.

Simple Progra
ms

In this section we will look a
t some complete programs. All of these programs are straight line
programs where each instruction is executed just once, one after another.
It is very important that
you think about the execution of a Perl program in th
is dynamic way. The instructions execute one
after another modifying and storing the variable values at each step.
The

first example is a simple
program that reads in the radius of a circle
(entered at the keyboard by the person running the
program)
an
d then prints out the diameter, circumference and area of the circle

as calculated from the
entered value
.

In addition no mention was made of the unit of measure.

Is it inches, feet,
meters?
Since the formulas apply to all these it was not necessary.

pri
nt “Enter the radius of a circle:”;

$radius = <STDIN>;
# Get the radius from the keyboard

$diameter = 2* $radius;

$circumference = 2 * 3.14159 * $radius;

$area = 3.14159 * $radius ** 2;

p
rint
“The radius is $radius
\
n”;

print “The diameter is $diameter
\
n;

print “The circumference is $circumference
\
n”;

print “The area is $area
\
n”;


The output, assuming we enter 3 for the radius, for the above program is given below.


Enter the radius of a circle:3

The radius is 3

The diameter is 6

The circumference is 18.849
54

The area is 28.27431


If we run the program again and enter 10 instead we obtain the following output.


Midwestern State University

Page
17


Enter the radius of a circle:10

The radius is 10

The diameter is 20

The circumference is 62.8318

The area is 314.159



The above program is
indicative of most Perl programs. It is a program that using I/O (Input and
Output). Data is requested from the user which is then processed and the resulting information is
printed out for the user to read. Let’s look at another example that works on s
trings as opposed to
numbers as processed in the previous example.


print "Enter your last name:";

$last = <STDIN>;

chomp ($last);
#Remove the CR (carriage return ie enter key)

print "Enter your first name:";

$first = <STDIN>;

chomp ($first);

$name = $fi
rst . " " . $last;
# The . is the concatenation operator!

print "Hello $name it is nice to see you.
\
n";


and its output.


Enter your last name:Darwin

Enter your first name:Charles

Hello Charles Darwin it is nice to see you.


This program has two features

that need discussing. First we use the chomp function. It is a simple
command that removes the CR from an inputted word. Not doing so has a tendency to foul up our
print outs. If you do not remove it, every time you print the string an automatic CR wi
ll be printed
immediately after it, something that you do not normally want. The next feature of note is the period
(.) that
we see in the next to last line. Recall that t
his is the concatenation operator for strings. It
allows us to connect strings
together (in this case three) in order to create a single string. The
command


$name = $first . " " . $last;


c
onnects
$first a blank “ “ and $last into one string placing the result in the variable $name. The blank
is necessary so that CharlesDarwin

will be split with a separating blank. It would be instructive for you
to run the above program without the chomps and then again without the inserted blank to see the
result. This will help you with debugging when you make errors in the future. If you

forget to use a
chomp sometime in the

future the error in its printout

will hopefully remind you of the cause.

Midwestern State University

Page
18




Exercises (Formulas that you do not remember can be found on Wikipedia!.

Also before you accept
your results check them by hand. You may have your order of operations wrong in your program)


7
.1 Write a Perl program that will request a temperature in Celsius and calculate and print out the
temperature in Fahrenheit. Do it
again but in this case ask for Fahrenheit and convert it to Celsius.


7
.2 Write a Perl program that will request the three coefficients of a quadratic (y=ax
2
+bx+ c) as well as
the value for x and then have it print out the resulting value for y.


7
.3
R
ecall
Einstein’s

mass energy equation E=mc
2
. Write a program that will read in the mass of an
object and have it print out the Energy that that mass represents. The variable c is the speed of light in
meters/sec. Look up its value on the internet and us
e scientific notation when typing it
value
into your
program.


7
.4

This is a well know
n

formula
a
2
+b
2
=c
2
often referred to as the Pythagorean Theorem. Write a
program that will ask for b and c and have it give you a. You will need to use the built
-
in squ
are root
function in this program. It is possible for bad things to happen when you run this program. Explain.


7
.5

At home you have power sockets all over the place. Most of them run at 120 volts. You may recall
that your electric dryer and electric

range run at 240 volts.
Write a program that will calculate how
much it cost to run a 100 watt light bulb for 7 straight days. Have your program
request the

wattage of
the bulb and the cost per kWh(kilowatt hour).


What do we pay here in WF per kWh?
The method of
calculation is as follows.


wattage


x



hours used


÷


1000


x


price per kWh


=



cost of electricity


7.6

Write

a program that will read in the lengths of the three sides of a rectangular prism(box)
. Have
your program pr
int out the volume and surface area with associated comment.


7.7
A "molecular clock" is a gene that evolves at a steady rate and is present in many related
species. The percent similarity of this gene between any pair of species is given by the number
of
base positions in the gene that are the same between two species.

The time that has passed
since the point when two species diverged varies approximately with the percent
difference
between the two; that is:

Time since divergence

of two species is given by

(100
-

X% sequence similarity) / (% change / years).

Write a program that reads in the sequence similarity, percent change

and number of

year
s.
Have the program print out the Time since divergence.

Midwestern State University

Page
19


8.


The IF statement and Logical operators



Straight line
programs as discussed in the previous section are quite limited. Although many simple
problems can be solved this way the majority cannot. For example suppose we want to solve the
quadratic formula. Recall that the solution for the equation y = ax
2
+bx+
c is given by the equation

















This formula creates several problems for us. First is the plus/minus. It is really short hand for the
following two solutions which need to be calculated separately.























































The second issue concerns itself with the radical (ie square root) . You may recall that we cannot
calculate the square root of a negative number (without using complex values). If we execute the
comma
nd sqrt(
-
3) in Perl an error will occur that will kill our program. We must make sure that the
value under the radical

(aka the discriminate)

is positive before we attempt to take its root. One of the
instruction types in Perl that allows us to check th
is value is called IF statement. This statement allows
us to change the order that instructions are executed in the program, a process called Flow Control. In
its simplest form the IF statement looks like the following.



If ( conditional test){



Instruction 1


Instruction 2


. . .


Instruction n


}


Next instruction


Etc.


The flow of control works as follows. If the condition test is TRUE and the instructions Instruction 1
thru Instruction n are executed followed by the Next instruction. If the conditional test is FALSE then
the instructions enclosed in the braces {…} are

SKIPPED and the Next instruction is executed.

All this of course depends on the conditional test. There are many kinds of conditional tests. Probably
the most used are comparisons between numbers or numbers and variables containing numbers. Here
is a
n example list of numerical
comparisons
.



Numerical

Comparisons



Meaning


3>0




is always true since 3 is always greater than 0

Midwestern State University

Page
20


$x >10




is true if the value in $x is greater than 10

$y<=100



is t
rue if the value in $y is less than or equal to 100

$num != $value

is true if $num is not equal to $value

$count == $y+1 is true if the value in $count is equal to the value in $y +1

$count%2 == 0



is true if $count is an even number


In each of the above comparisons we use a special operator to define the specific comparison we
desire. Here is the complete list that can be used to compare numbers!







Numerical Comparison Operators




String Comparison Operators

>



Greater than




gt

>=



Greater than or equal



ge

<



Less than




lt

<=



Less than or equal



le

!=



Not equal




ne


==



Equal





eq


It is
very important

that you notice th
e double = signs

when comparing numbers
. A single = sign using
in an instructions such as $num=1 has an entirely different meaning(
semantic
) than does $num==1.
The former is an assignment statement that puts a 1 into the variable $num whereas the seco
nd is a
comparison that is either TRUE or FALSE. The value of $num is NOT modified in the comparison case.

Weird things happen when you forget to use the double = in a comparison.

Also pay particular note of
the operators required to be used with strin
gs.
You
cannot

compare strings using the numerical
operators. Capice!?

True and false have actual values. Although normally 0 is defined to be false and not zero is true we
normally consider 1 to be the representation for true. Consider the following
code segment.




if(1) {


print “ This will always print”;


}


Since the value in the parenthesis is 1 ie TRUE, the print statement will always execute.

A comparison
can be thought of as an operation that return
s ( is replaced by ) either a 0 or a 1. The
blue

comparison
in the
following
command sequence



if(
$n<10
) {


print “ This will always print out if the value of $n is less than 10”;


}


is

replaced by a 1 or a 0 depending on the value of $n.

Midwestern State University

Page
21


There is another version of the IF statement that includes an ELSE section. Its format is as follows.



If(conditional test){




Inst
ruction 1




Instruction 2





… # True part

red




Instruction n



} else {




Instruction a




Instruction b




… # False part

brown




Instruction z



}



#Co
ntinuing Code






flow chart

Instructions 1 thru n are executed if the test is true otherwise Instructions a thru z will be executed. In
both of the above instances it is important to note the braces { and }. These must be used to delimit
both the

true and false sections. Leaving them out will result in a syntax error. In order to make it
easier to see whether or not all the braces are in their proper place it is advisable to develop a style or
form that is adhered to. The above is reasonable
form. Note where the bra
ces are with respect to

how
they line up. Readability is enhanced by indenting the instructions enclosed within a pair of braces.

In order

to clarify this instruction lets continue looking at the quadratic formula. The main thing we
need to do in the program is to check the value under the radical. If it is negative we will just print out
that there are no real solutions. So here we go.



# This program evaluates the quadratic formula

print "Enter a:";

chomp($a=<STDIN>);
# Read in the a coefficient. Note the two instructions in one!

print "Enter b:";

chomp($b=<STDIN>);
# Read in the b coefficient

print "Enter c:";

chomp($c=<STDIN>);
#
Read in the c coefficient

# Determine the discriminate

$disc = $b**2
-
4*$a*$c;

if ($disc<0){


print "There are no real solutions where a=$a, b=$b and c=$c
\
n";


print "This is because the discriminate is $disc
\
n";


exit; # This little instructio
n causes the program to exit.

}

$x1 = (
-
$b+sqrt($disc))/(2*$a);

$x2 = (
-
$b
-
sqrt($disc))/(2*$a);

print "There are two real solutions and they are $x1 and $x2
\
n";


Conditional
Test
True
Code
False
Code
Continuing Code
Midwestern State University

Page
22


Running the above program for a=1, b=2 and c=3 gives

Enter a:1

Enter b:2

Enter c:3

There are n
o real solutions where a=1, b=2 and c=3

This is because the discriminate is
-
8


Here is another example that has real values.


Enter a:2


Enter b:5


Enter c:2


There are two real solutions and they are
-
0.5 and
-
2


From the above example you can see
the utility of the IF statement. It allows the program to executed
different instructions depending on the result of each comparison. One must be very careful when
writing programs that have one or more if statements. This is a result of the fact that

the path a
program takes thru the code is dependent on the returned value of the comparisons.

A program that
has a lot of IF
-
ELSE

statements has a very large number o
f different paths that the execution

might
take.

This is even clearer when one realizes

that IF statements can be nesting within other IF
statements. Programs can become very complicated indeed.


Looking at another example that contains nested IF’s. Suppose we want to request a string from the
user (keyboard) and print out whether or not th
e length of the string is 5, less than 5 or more than 5.
To do this problem we will need to use the length( ) function. This function will return the length of the
string that it is applied. In other words if $str=”Hello” then length($str) will return
5.


print "Enter a string:";


chomp($str=<STDIN>);

$len = length($str);


if ($len == 5){


print "The length of the string is equal to 5
\
n";

} else {


if ($len <5){



print

"The length of the string is less than
5
\
n";


} else {



print "The length of the string is greater
than 5
\
n";


}

}


#Continuing Code


IF
$
len
==
5
True
Code
Continuing Code
IF
$
len
<
5
true
false
Midwestern State University

Page
23


Note that the red IF
-
ELSE statement in the above program is nested within the else part of the
enclosing IF. Follow t
he path thru the program for all three cases and convince yourself that this
indeed
works.

In working with this program you probably have already noticed that when typing in
braces within Notepad++ , matching braces are highlighted in red as soon as the s
econd matching brace
is typed. This helps you make sure every brace matches up properly.

This is a cool feature so use it to
your advantage.


The next example is a simple program that reads in three numbers and then prints out the largest.
There are really two
ways to do this. The first way, which we shall explain at this point
uses a extra
variable to hold the largest so far.
Let’s

look at i
t.




# Determine the largest of three numbers

print "Enter the first number:"; $n1=<STDIN>;

print "Enter the second number:"; $n2=<STDIN>;

print "Enter the third number:"; $n3=<STDIN>;

$max= $n1;
# $n1 is the max so far

if ($n2>$max){ $max = $n2;}
# i
f $n2 is bigger make it max

if ($n3>$max){ $max = $n3;}
# if $n3 is bigger make it max

print "The largest value of all three is $max
\
n";


The main point to note in the above program is the use of the variable $max. This variable is used to
contain the l
argest value we have see so far. Every time a new number is checked against this variable
the contents of $max will be replaced by the new number if it is indeed larger. If the program were
written using only IF statemen
ts without the use of an

extra var
iabl
e such as $max it’s structure
becomes

a little more complicated. See Homework 9.2


Our last example involves strings. Here we are to write a program that asks the user 3
DNA related
questions, obtains the answers from the user, indicates Right or Wr
ong for each case and then prints
out the percentage correct. For the sake of simplicity the questions are hard coded.


# A three question quiz on DNA

$num_correct=0;

print "What year was the structure of DNA discovered? ";

$ans =<STDIN>; chomp ($ans);

if

($ans == 1953){
# Note the == for comparing numbers


print "Awesome, you are correct!
\
n";


$num_correct = $num_correct + 1;

}else{


print "I am sorry. You must be really ignorant!
\
n";

};


print "
\
nThere

was a woman who also deserves credit for the discovery
\
n";

Midwestern State University

Page
24


print "of the structure of DNA. What was her last name? ";

$ans =<STDIN>; chomp ($ans);

if ($ans eq "Franklin"){
#Rosalind 1920
-
1958
.

Note the eq for comparing strings!


print "Awesome, yo
u are correct!
\
n";


$num_correct = $num_correct + 1;

}else{


print "Wrong, Wrong ! You must be really stupid!
\
n";

}


print "
\
nWhat is the structure of the DNA molecule ";

$ans =<STDIN>; chomp ($ans);

if ($ans eq "double helix"){
# Another eq here!


p
rint "Awesome, you are correct!
\
n";


$num_correct = $num_correct + 1;

}else{


print "What an idiot! Can't

you do anything right?
\
n";

}

$per = $num_correct/3*100;

print "You were $per percent correct!
\
n";


The above program is straight forward. Each question section is structured the same as the others.
The main things to note of course are the use of eq and == as commented in the program. The other is
the
counting variable $num_correct that is incremente
d every time the user gets an answer correct.


It turns out that there are many instances where we need to to enter several options (say in a menu)
and process the appropriate option. We can do this with the previous if statement by nesting them, ie
an if

within an if within an if and so on. This becomes complicated because of all the braces that are
required. In order to simplify the coding of these cases the designers of Perl created a special
instruction called elsif. It is mainly used to select fr
om a list of options entered at the keyb
oard. The
format is as follows


If(first test){


# instructions to do is first test is true




} elsif ( second test){


#instructions to do if second test is true




}

elsif (third test
){


#instructions to do if third test is true



}else {


# instructions to do if none of the above are true.


}


Although the above shows only 3 test’s there can be as many as desired. The last one should have an
else only an
d be the catch all for anything that’s not caught by the previous cases. This sequence is
Midwestern State University

Page
25


quite handy and is used very often. A nice example where this is used is in menu systems. In these
cases a user is requested to enter a selection from a menu listi
ng possible options. The following is a
program that will convert either base 8 or base 16 numbers to base 10 (decimal). It
prints a menu on
the screen and the user selects either a or b.

The program also uses two

built
-
in
functions
, hex() and
oct() to

do the conversions for us.


print "
\
n Menu options for acceptable conversions
\
n";

print " a. Base 8(octal) to decimal
\
n";

print " b. Base 16(hexadecimal) to decimal
\
n";

print " Enter the

requested conversion, (a or b
) :";

$selection= <STDIN>;

chomp($selection);

# get rid of the
\
n or the following test won’t work!

if($selection eq "a"){

#convert the octal value


print " Enter the octal number :";


$bin = <STDIN>;


$result= oct($bin);


} elsif ($selection eq "b"){

#con
vert the hexadecimal value


print " Enter the hexadecimal number :";


$hex = <STDIN>;


$result= hex($hex);

} else {

# valued entered is incorrect, ie it’s not a or b


die "
\
nERROR :please enter a or b only!
\
n
\
n";

}

print

" The value in decimal is $result
\
n";


Here is an example run where the user requests a hexadecimal converstion.


Menu options for acceptable conversions


a. Base 8(octal) to decimal


b. Base 16(hexadecimal) to decimal


Enter the requested conversi
on, (a or b) :b


Enter the hexadecimal number :3C5


The value in decimal is 965








Several of the exercises below will give you practice with using if and elsif instructions. Proper use of
these instructions are critical to Perl programming since the
y define the intended logic of the
programmer.


Exercises

8
.1

Copy and paste the
DNA program from the previous page

into Notepad++ and create two
additional DNA questions, one that has a numerical answer and the other that has a string answer.
Test a
nd run.

Midwestern State University

Page
26



8
.2

Write a program that reads in three number
s

and prints out the
largest. Use only the nesting of
multiple IF sta
tements to do this. Do not use

any support variables

such as $max
.

Run and test every
possible
scenario, ie the first is largest
, the second is largest and the third is largest etc
.

How many
relative data input scenarios are there?


8
.3 Convert the above program so that it reads in last names instead of numbers. We are now
alphabetical ordering so print out the name that occurs d
eeper in the alphabet. Test carefully.


8
.4
* Write a program that reads in three numbers and prints out the numbers in numerical order from
small to large.

Use only nested IF statements to accomplish this. Draw a comparison tree on a piece of
paper to help you keep the logic straight.


8
.5


Write

a program that reads in 5 exam grades (0
-
100) and have it print out each grade and the letter
grade associated with it. Also have the program print out the average and the number of students that
passed (# of students that made above 59. You will need a
n extra variable or two to accomplish this.
Do this problem in steps. First read in the numbers and print the average. After this is working have
the program print the letter grade for the first exam etc. Incremental development is really a nice way
to
develop programs.

Try it you will like it!


8
.6 (Use the elsif in this problem) Write a program that will first request two numbers from the user.
Then have the program display a menu of the four different operations (add, subtract, multiply, and
divid
e). Then perform that operations and print the result. Here is what an example run should look
like.

Enter the first number: 4

Enter the second number: 5

Menu options for operations


a. Add


b. Subtract


c. Multiply


d. Divide

Enter the requested ope
ration: b

The answer is
-
1


8
.7 Write a program that converts meters to either kilometer, centimeter, or millimeters. Use a menu
to obtain the users choice.


8
.8 Write a program that will either convert Celsius to Fahrenheit or Fahrenheit to Celsius de
pending
on the users choice from a menu.


7.8

8.9
HARDY
-
WEINBURG EQUILIBRIUM
: Used to determine if allele frequency in population is
changing.

Midwestern State University

Page
27


p
2

+ 2pq + q
2

= 1
and

p + q = 1

p = frequency of the dominant allele in the population

q = frequ
ency of the recessive allele in the population

p
2

= percentage of homozygous dominant individuals

q
2

= percentage of homozygous recessive individuals

2pq = percentage of heterozygous individuals

Write a program that will read in p and q and have it print
out whether or not the population is in
equilibrium. Just how close to 1 do you need to be? Make a decision.


9.

The While Statement

In this section we add a new statement that greatly increases the power of our programs. This
statement is called the WHILE

statement and it will allow us to run sections of our program repeatedly.
This is called iteration and is heavily used in almost every program on earth.

The basic structure of a
WHILE

is as follows



# previous instructions

while(
conditional test){



Instruction 1;



Instruction 2;




. . .



Instruction n;

}


# subsequent instructions


When the above
while

is encountered for the first time the condition test is performed. If true then
instructions 1 thru
n are executed. At this point the conditional test is performed again. If it is still
true then instructions 1 thru n are again executed. This process is repeated over and over again until
the test is false at which time the path of execution continue
s with the subsequent instructions. Loops
can be executed
thousands of times making this process very code efficient.
Let’s

look at the following
simple example.



$n=1;


while($n<=5
){
# n less than or equal to 5



print “$n “;




$n = $n + 1;

#increment $n



s
leep(0
); # pause here zero second



}



print “
\
nDone
\
n”;
# Throw in a couple of new lines and a Done


Midwestern State University

Page
28


The program initializes $n to 1 in preparation for the upcoming loop. Then the
while

is encountered
and
the test $n <= 5 is performed. It is true so 1 is printed and $n is set to 2. Back at the top of the
loop the test $n<=5 is done again. It is still true since 2 <=5 so 2 is printed. Continuing we obtain the
following output.


1

2

3

4

5

Done


Running the program
after changing the 5 to a 10
will
print 1 thru 10 and so on.

It is interesting to slow the program down

some

using the sleep command.

Change the sleep(0) to
sleep(1) and the program will pause ever time it hits this instru
cti
on for 1 second. This can be quite
educational since
it allows you to slow down programs so you can actually
see the execution flow.

What do you think would happen if we inadvertently left out the $n=$n+1 command? Hmmm.

This is call
ed an infinite loop
. It will just keep on runn
ing since $n never changes and consequently it

will never get larger than the limit, 5
,

in this case. Remove the instruction and run it. Of course if you
do so I am sure that you will want to know how to kill(
stop) the program. Just hit a Cntl
-
C and it should
stop.


There are basically three ways to read in data. The first way, reading from the keyboard

(ie STDIN) we
have already examined. Here we will look at another, very convenient
method. In this techni
que the
data is added on to the end of the program code

after the __END__
.

Note the double underlines on
each side.
Here is an example.


# Read data from the end of the file

# The <DATA> file handle refers to the data after __END__

while($val=<DATA>){



$sum = $sum + $val;


$n = $n + 1;

}

$ave = $sum/$n;

print "There are $n values with an average of $ave
\
n";


__END__

# This starts the data section that the above program will read from

34

45

62

12

81

78

Midwestern State University

Page
29


66


OUTPUT

There are 7 values with an
average of 54


The thing to note is that each time $val=<DATA> is executed an entire line is read

( a line is all text up to
AND INCLUDING the
\
n)
. This implies that we should place one data value per line. There are ways to
extract multiple values from
a single i
nput line but we will leave that

technique for a later section.


Let’s look at another example. Suppose that we would like to read in <DATA
> values

until end of the
file and print out the largest and smallest values in the list.


# Read data and

print the largest and smallest.

# The <DATA> file handle refers to the data after __END__

#We first read in one value and call it the smallest and also the largest

$val = <DATA>;

$smallest = $val; # this is the smallest we have seen so far.

$largest = $v
al; # This is the largest we have seen so far

chomp($val);

while($val=<DATA>){ # keep on reading and looping until we run out of numbers


chomp($val);


if ( $val < $smallest ) { $smallest = $val;}


if ($val > $largest ) {$largest = $val;}

}

print "The largest is
$largest and the smallest is $smallest
\

n";



__END__

34

45

62

12

81

78

66


Homework

In each of the following programs read your data from the a DATA list after the __END__ statement.
Assume that the data is 1 item per line.


9.1


Write a program that will read in the included data and have the code
print out the number of data
values that are greater 40.


9.2

Write a program that will

Midwestern State University

Page
30



10.

File Input/Output


The third location that we can read data from is a file. In order to do thi
s programmers use a file
handle that is linked to an ac
tual file. For example, suppose the

file info.txt contain
s a list of numbers,
one per line. W
e can access this file
form a Perl program
by first creating a file handle name, say
FILEHDL and link it

to the file info.txt using the open command as follows.




o
pen(FILEHDL, “info.txt”)


Once this is done then the usual $var = <FILEHDL> will read a line from the file info.txt and place it
into the variable $var.


The previous example that averages data values can be easily converted to read from a file. Assume
we have a file named info.txt that contains the previous values, one per line.


# Read data from a file

# The <FILEHDL> file han
dle refers to the data in info.txt

open(FILEHDL, “in

fo.txt”)

while($val=<FILEHDL>){


$sum = $sum + $val;


$n = $n + 1;

}

$ave = $sum/$n;

print "There are $n values with an average of $ave
\
n";



OUTPUT

There are 7 values with an average of 54



Let
’s

now
look at a program

that reads in a text file that contains a book and prints the book to the
screen. In order to have a book to read you first
need to download one from the web. Gutenberg.org
is a wonderfu
l site to do this so
l
et’s

download a
really interesting book by Alfred Russell
Wallace
entitled The Malay Archipelago


Volume I. (of II.)
. Name the file ARWMalayV1.txt. You may re
call that
Wallace developed a

theory of natural selection at the same time that Darwin did. He and Darwin were
good friends and in fact Wallace was one of Darwin’s pall bearers.


The following program will dump(print) this file to the screen

whilst counting the number of lines
.



open ( FILE, “ARWMalay
V1
.txt”) ;


while ($line = <FI
LE>){



print $line;
# no CR is needed here since the $line already has one.



$line_count =$line_count + 1;
#just in case we want to know the number of lines.


}



Midwestern State University

Page
31


Each time the above while iterates
,

a new line from the file is read and printed to the screen.


A programmer, more often than not, need
s

to write the output to a file instead of the screen. Perl
allows the opening and creation creation of a file in one command. For example the command





open (FILEOUT,”>newfile.txt”)


will create a new file named
newfile.txt
and define FILEOUT as its file handle. The following program
will write the numbers 1 thru 10 to this new file and close it.


open (FILEOUT, “>numbers.txt”);

#note the >

sign. This is required
f
or output files.

$n=1;

while ($n<=10){


print FILEOUT “$n
\
n”; #writes the number to the file.


Print “$n
\
n”; # prints the number to the screen


$n++;

}

c
lose FILEOUT;



Enter the above program

(or copy and paste)

int
o notepad++, save. Create a dos window in the
directory where the program is to run and note the contents of this directory. Now run the program
and note again the contents of this directory. You should see a new file, named numbers.txt
, that

has
been c
reated by the above program. Look inside this file, by running the DOS
type

command.

You
should see the then numbers 1 thru 10 displayed one per line.

If you do not see this file make sure you
are in the correct directory. The file will be created in
the same directory that the
Perl
program is in.


When opening a file that should exist on the drive there is a special
die

command that can be used to
kill the program in the case this file does not exist. It is common practice to use this command since
the input fil
e name may be misspelled,

in the wrong directory or the input file was never created.
Here is an example of the syntax for the die as used in conjunction with the open.



open(FILEIN, “data.txt”) or die “Input file does not exist!
\
n”;


If th
e file data.txt exists then the instruction will fall thru to the subsequent instruction, if not, then the
program will print
Input file
does not exist!

a
nd exit. This is a safety trap that should always be used

when opening files
.


Perl is a programming
language that was designed from the get
-
go to process strings. One string type
that we may well

have interest

in is one that contains a DNA sequence.
Recall

that a DNA strand is
made up of the four nucleotides, adenine, cytosine ,

guanine and thymine. These are
usually
represented by the letters a,c,g and t respectively. A DNA snippet such as acggtattcgttaaaccgt can be
processed by Perl

if it is st
ored in a string variable, such as

$dna.




$dna = “acggtattcgttaaaccgt”




In order to facilitate string processing Perl has a built
-
in function called substr() that allows on
e

to
access parts(substrings) of the string.
A substring of $dna is any contiguous subsequence containing
one or more characters of the string. For exampl
e ggtatt is a substring of $dna that starts on position 2
Midwestern State University

Page
32


of the $dna string. The first letter ‘a’ is at position 0, the second letter ‘c’ is at position 1 and so on.
The
Perl
command



$ss = substr($dna, 2, 6)


will load the variable $ss with “ggtatt”
. The first parameter in the substr function is the string to be
access
ed
, the second parameter is the position number and the third parameter is the number of
characters to extract (or access). So the above command starts at position 2 (the third chara
cter) of
$dna and copies 6 character
s

which are subsequently placed in $ss.

The variable $dna is unharmed by
this function.
Here is an example program that will print the nucleotides from the $dna string, one per
line. Although we would not normally do t
his it does indicate how the command works.


$dna = “acggtattcgttaaaccgt”;

$pos=0;

while($pos < length($dna)){


$nucleotide = substr($dna,$pos,1);


print “$nucleotide
\
n”;


$pos++;

# go to the next character.

}


Copy and paste this into Notepa
d++ and run it. Note: when copying and pasting from word
or a
browser for that matter,
your double quotes may be incorrect(the wrong ones). Retype them within
notepad++ if this occurs.

With the substring function you can perform a large number of dna p
rocessing requirements such as
counting the number of each nucleotide type. As a simple example lets count the number of c’s that
occur in $dna.


$dna = “acggtattcgttaaaccgt”;

$pos=0;

while($pos < length($dna)){


$nucleotide = substr($dna,$pos,1);



if($nucleotide eq ‘c’){$count_c++};
#count each c we pass by


$pos++;

# go to the next character.

}

print "There are $count_c cytosine nucleotides in the dna strand.
\
n"


Now that we are getting into processing DNA we need a web site to download
from. We will use the
National Center for Biotechnology Information (NCBI) site

which can be found at

http://www.ncbi.nlm.nih.gov

This is a huge web site, containing a large number of data bases maintained by
NIH (National Institute
of Health). We will restrict ourselves initially to the GenBank data base. See

http://www.ncbi.nlm.nih.gov/genbank/

Some of the following homework uses data from this site as w
ill many of the remaining sections of this
document.



Homework


Midwestern State University

Page
33


10.1
Create a file entitled data.txt using notepad++ and enter a few lines data. Now write a program
that will read this file and copy it over to a new file whose name is data.dat. After th
is program run
check the contents of data.dat to make sure it is the same as data.txt.


10.2 Write a program called copy.pl that will request the user to enter two file names, the first is called
the
from

file and the second is called the
to

file. The
fro
m

file should already exits. The program
should open the
from

file and copy the contents to the
to

file. If the from file does not exist then a
message should be sent to the user indicating so, and then die.


10.3 Write a Perl program that will add up th
e numbers from 1 to 100 and print out the result.


10.3 Write

a program that will request the user to enter two numbers and add up the
integers that
occur from the smaller to the larger inclusive. Your program should handle integer inputs where the
first
number entered is the largest and the second number is the smallest and vice
-
versa.


10.4
Go to the NCBI web site and download a copy of Mammoth mitochondria dna and save in file
mam.txt. Open this file with notepad++ and delete all text that is not dna
, leaving only the base
characters a,c,t and g. Resave. Now write a Perl program

that reads in the mam.txt file and

counts the
number of a,c,g and t’s printing out the number of each together with the percentage of each within
the
total
dna sequence.


10.5
Write a Perl program that will print the reverse complement of a dna strand contained in $dna.

For comparision purposes print out both $dna and its reverse complement on two different lines.
Recall that the reverse complement is the dna sequence that

matches

this sequence on the other
strand

of the dna helix. All you need to do here is reverse the string converting c’s to g’s and t’s to a’s

(and vise
-
versa). For example, the reverse complement of acgggaggacg is cgtcctcccgt.

By the way,
there is a

reverse function, reverse($str), that will reverse the string parameter $str. You can use this if
you would like.




11.

Regular Expressions

One of the features that
make

Perl so useful to people who work in the files of biology and chemistry is
its ability to process large text files rapidly. This fea
ture is greatly enhanced by a built
-
in pattern
recognition system called regular expressions

(RE)
. Regular expressions a
re extremely powerful and
allow the programmer to search and string for virtually any data pattern that they are interested in. A
pattern is enclosed in slashes, for example if the
programmer would

like to look for the string "good

he

would
use

the forma
t /good/

where the pattern in this case is just the four letters
good
. As an
example suppose that the string variable $s is initialized to "Now is the time for all good men to com
e
to the aid of their country” and a short program is required to print out whether or not $s contains the
substring
good
. This can be done in the following way.

if ($s =~ /good/){


print "The substring good is

in the string
.
\
n";

}else{


print "The substr
ing g
ood is not in the string
.
\
n";

}

Midwestern State University

Page
34


The operators used in the above if, =~, compares $s and the pattern given between
slashes (
ie
/pattern
/

) and returns true if found and false if not. The output for the above program is
.



The su
bstring good is in the string.

There are a variety of

methods for defining patterns in Perl. The simplest is the single character or sting
patterns.

Here the programmer places a sequence of characters that is to be searched for, in for the
pattern.
For exam
ple if we want to search for
Bothriolepis

then we use /Bothriolepis/. The RE
/Devonian period
/ could be used to search for
the word pair
Devonian period
. Note that RE’s are case
sensitive and the space is also a character and consequently must be matched. There are many times
patterns will contain
don’t cares
. If the programmer wants to specify a
don’t care
, it is done by
inserting a . (period!) in the

pattern. For example the pattern /
A.G/ could be used to match AIG, AUG,
A;G, AEG, A4G, etc.


A character class is a list of characters between a set of brackets. One and only one of these characters
need to be present at the corresponding part of the str
ing for the pattern to match. For example
[aeiou] would match a single vowel. [0123456789] or equivalently [0
-
9] will match a single digit. If we
are interested in matching only letters, either caps or not, we would use [a
-
zA
-
Z].


The class [a
-
zA
-
Z0
-
9]
ma
tches any alphanumeric character. Do to the frequent use of some of these classes t
he fo
llowing
special predefined character class abbreviations

were created
.

\
d

is equivalent to [0
-
9]

\
w

is equivalent to [a
-
zA
-
Z0
-
9 ]

\
s

matches white space [
\
r
\
t
\
n
\
f]

#matches space, return, tab, newline and
form
-
feed


characters
.

\
b

matches a word boundary. Ie space, period, start of string, comma etc
.

It may be the case you would like to match everything
but
small vowels.

The

^ negation operator
is
used
to do
this within a character class definition
. In the vowel case we would write [^aeiou
] to match
anything that is not a vowel or [^0
-
9] to match anything that is not a digit. Suppose that we would like
to match an upper case A followed by anything except x y or z. Recall that [^xyz] is anything except xyz
so /A[^xyz]/ will match Aq or Ab bu
t not Ax etc
.

The match for a string is the first pattern found that
matches the RE. If there are several matches of different length then it will match the longest one.

Suppose that we download Charles Darwin’s
The Origin of the Species

1
st

ed from Gut
enberg.org and
name it oots.txt
. The following program will search for the word evolution
in this document and
display the appropriate result. Note that we are looking for a line that has either Evolution or evolution
embedded within it. Since
\
b is pl
aced only at the front of the RE it is possible that we might match
something like
evolutionary
.
Do this exercise and see what you find out.

# Here we open the file and read one line at a time and check for the
word evolution.

open(FILE,"oots
.txt");

$ct
=0;

Midwestern State University

Page
35


# Every time the following line is execu
ted the next line in the file

#
is
loaded into $line

while($line = <FILE>){


$ct=$ct+1;

# Re
call

the
\
b in the following regular expression matches on a word
boundary (space etc)


if($line =~ /
\
b[Ee]volution/)


{



print "Evolution is in the Origin at line number $ct
\
n";



exit;


}

}

print "Evolution is not in the Origin
\
n";

If we are interesting in matching something at the
beginning of a string

we use a ^ as well. There is no
ambiguity for we are between /'s and n
ot []'s. This is referred to as context sensitive, ie the meaning
of ^ is determined by its surrounding context.
As a
n example /^Joseph Hooker
/ would match any

string

that begins wit
h Joseph Hooker such as "Joseph Hooker was a good friend of Charles Darwin
" but not
"
It is well known that Joseph Hooker was a friend of Darwin
". Similarly we use a $ to indica
te the end of
the string. As an example /office$/ would match "Who got Einstei
n’s office" but not "Where was the
office of Einstein
".

Multipliers

Another feature of regular expressions is the use of multipliers.
These are special characters that allow
the specification of multiple instances of a character or patter.
For example

w
e use an * to represent
zero or more copies of the immediately previous chara
cter. For example the DNA pattern /a*cgt/
would match
aaacgt

or
cgt

or
acgt


etc
. If we want one or more copies we use a + instead of the *. So
the pattern /RA+T/ would match RAT or RAAT or RAAAT etc. RT is not matched in the + case.

A ? is
used if we want to have one or no copies of the
preceding

character. A pattern such as /Haa?
t/ would
match Haat or Hat and nothing else. What do you think /fo+ba?r/ matches.

It matches an f followed by
one or more o’s followed by b then by an a or not and finally an r.
How about /^
\
s*$/. This

is an often
used RE since it

matches blank lines. W
hy? Note that these operators will match the longest string that
it can find. For example /t[A
-
Za
-
z ]+d/ will match the red colored section in the following string even
though there is an earlier d.

How to see exactly what it matches is the subject of a l
ater section.

Alternation Operators

The symbol | is used to match exactly one of a set of alternatives. For example if we want to
match a or b or c we would write /a|b|c/.

For single characters you probably should use /[abc]/
which is the same thing. This

operator

really is useful if you want to match certain words ie
/rat|mouse/ would match rat or mouse but nothing else. What does /a|b*/ match? What abou
t
/(a|b)*/. Here the parens define a

precedence grouping. Precedence rules are as follows

with the
pare
ntheses being at the top

Name

Representation

Parentheses

( )

Midwestern State University

Page
36


Multipliers

+ * ? {m,n}

Sequence and anchoring

abc ^ $
\
b
\
B

Alternation

|

Recall that
\
b matches a word boundary. IE /
\
bHi
\
b/ would match Hi but not High.
\
B matchs a
non word boundary.

Here are some examples that may help you see the pattern.


Regular Expressin Example matches

1.

/ab?c/





abc, ac

2.

/^0x[0
-
9A
-
F]+$/



0x4FA, 0xFFF8

3.

/abc*/





ab, abcccccccc

4.

/(abc)+
/




abc, abcabcabc

5.

/(a|b)(
c|d)/




bc,ad,ac,bc

6.

/(song|blue)bird/




songbird, bluebird




7.

/ab{2,4}c/




abbc, abbbc

8.

/100
\
s*mk/




100 nk, There are 100

mk

9.

/[yY][eE][sS]/



yes, YES, Yes, YeS

Suppose we have a file that contains a lot of ordered pairs

and would like to extract them from
the file. Recall that an ordered pair looks like (number,number). For example (2,3), (
-
45,23)
and (
-
1,
-
4) are ordered pairs. The first thing we need to do is construct the regular expression.
We need to match a ‘(‘

then an integer, then a comma, then a number and finally a ‘)’. Don’t
forget that each number can be negative. This can be accomplished

with this expression
.

/
\
b
\
(
-
?
\
d+,
-
?
\
d+
\
)
\
b/

Recall that parenthesis are

used as special characters in regular expressions. If we want to actually
search for a paren we must place a
\

in front of it.
The characters ‘(‘ and ‘)’ and regular expression
grouping symbols while
\
( and
\
) are just the characters ‘(‘ and ‘)’. Cap
ice?

Lets write a Perl program that matches every line of a file that contains ordered pairs of the above
format.

Pay particular attention to the regular expression in the following application.

open(FILE,"pairs.txt")||die "Could not open
file:$filename
\
n";

while($line=<FILE>){
# For every line we will do the following


while($line =~/
\
(
-
?
\
d+
\
,
-
?
\
d+
\
)/g) {$c+=1;};


$cttotal++;

}

print "Number of ordered pa
irs=$c
\
n”;


print "Total number of lines= $cttotal
\
n";


If the above program is
run on the following data set
, aka pairs.txt
.

This is a file of ordered pairs

(2,3), (4,5), (
-
2,34), (65,2)

Midwestern State University

Page
37


and (
-
2,
-
3) order pair

as well as this one (
-
234,342)

and this one (
-
3, 3)

How about that.


The output would be


Number of ordered pairs=6

Total
number of lines=6


Why do we count only 6 ordered pairs when it’s clear there are 7. Look closely at the last one. Notice
anything different?. This point was not matched (ie counted) because there are spaces in front of each
number and we DID NOT allow
that case in the above
regular expression. I f we wanted to allow one or
two blanks then
\
d{1,2} in front of the


sign will allow it to find and count (
-
3, 3)!


Searching text files using regular expressions

12.

Searching text files using regular expression
s

The previous section spent some time explaining how to search text files. Since this is such an
important topic we will expand on this concept considerably. Why? Because this is what you will
probably do most. There are many file formats that scienti
st run up against. We already looked that
NCBI files so let’s continue with these and learn to extract a variety of information from these files.
First go to the NCBI web site
(
http://www.ncbi.nlm.nih.gov/
)
an
d download the
FM866397 file on
Neanderthal mitochondria DNA.
_d Name the file ncbi_dna.txt.
Here is what it looks like.

LOCUS FM866397 367 bp DNA linear PRI 06
-
NOV
-
2009

DEFINITION Homo sapiens neanderthalensis mitochondri
al D
-
loop hypervariable


region 1, isolate Sidron 1351e.

ACCESSION FM866397

VERSION FM866397.1 GI:262527002

KEYWORDS .

SOURCE mitochondrion Homo sapiens neanderthalensis


ORGANISM Homo sapiens neanderthalensis


Eukary
ota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

REFERENCE 1


AUTHORS Briggs,A.W., Good,J.M., Green,R.E., Krause,J., Maricic,T.,


TITLE Targeted sequence capture and analysis of multiple Neandertal


mitochondrial genomes


J
OURNAL Unpublished

REFERENCE 2 (bases 1 to 367)


AUTHORS Briggs,A.W.

Midwestern State University

Page
38



TITLE Direct Submission


JOURNAL Submitted (05
-
NOV
-
2008) Briggs A.W., Human Evolutionary Genetics,


MPI
-
EVA, Deutscher Platz 6, Leipzig, 04103, GERMANY

FEATURES Location/Qualifiers


source 1..367


/organism="Homo sapiens neanderthalensis"


D
-
loop 1..367


/note="hypervariable region 1"

ORIGIN


1 gggagcagat ttgggtacca

cccaagtatt gactcaccca tcagcaaccg ctatgtattt


61 cgtacattac tgccagccac catgaatatt gtacagtacc ataattactt gactacctgc


121 agtacataaa aacctaatcc acatcaaacc ccccccccca tgcttacaag caagcacagc


181 aatcaacctt caactgtcat acatcaacta caactccaaa gacgc
cctta cacccactag


241 gatatcaaca aacctaccca cccttgacag tacatagcac ataaagtcat ttaccgtaca


301 tagcacatta cagtcaaatc ccttctcgcc cccatggatg acccccctca gataggggtc


361 ccttgat

//


Although the above is a short example, all of these files have th
e same format. The NCBI web site
defines each of the subsections such as LOCUS, FEATURES and ORIGIN etc. For our purposes here we
see, just by observation, that the actual DNA string begins after the word ORIGIN. Our first program
will read in this fi
le and extract only the dna that occurs after the word origin and then write that out to
a file. We will remove spaces and numbers after ORIGIN but keep the
CR’s (ie
\
n).

This program uses a
flag variable called $inseq to tell the program when we pass t
he ORIGIN line. Before we see ORIGIN its
value is 0 and after it is 1. Read the code very carefully and see if you can follow the logic.


open (FILE,"ncbi_dna.txt")or die "File not there";

open(OUT,”>rawdna.txt”); # Don’t forget the > sign for output fil
es.

while ($line=<FILE>)


{


if ($line=~/^ORIGIN/){



$inseq = 1; # when we pass the ORIGIN line turn on the inseq flag


} elsif($line=~/^
\
/
\
/
\
n/) { # look for the last line with a //


last;# make this the last loop in the while


} elsif ($
inseq == 1){ # We are in the DNA section


$line =~ s/[0
-
9]//g; # remove any digits



$line =~ s/[
\
t ]//g; # remove tabs and blanks leaving
\
n



# What would happen if we used $line =~ s/
\
s//g; instead?



print OUT "$line";


}


}


Midwestern State University

Page
39


The contents
of the rawdna.txt as generated from the above program is


gggagcagatttgggtaccacccaagtattgactcacccatcagcaaccgctatgtattt

cgtacattactgccagccaccatgaatattgtacagtaccataattacttgactacctgc

agtacataaaaacctaatccacatcaaacccccccccccatgcttacaagcaagcacagc

aatcaaccttcaact
gtcatacatcaactacaactccaaagacgcccttacacccactag

gatatcaacaaacctacccacccttgacagtacatagcacataaagtcatttaccgtaca

tagcacattacagtcaaatcccttctcgcccccatggatgacccccctcagataggggtc

ccttgat


Here is another short example from the National Center for Biotechnology Inform
ation (NCBI). It is a
gene from the extinct Tasmanian wolf (Thylacinus cynocephalus). You can see it is quite similar to the
previous Neanderthal mitochondria file. If you look close you see that the entire name for this animal
is
found immediately aft
er the word ORGANISM and stops with the word REFERENCE. The first line
contains the genus species and the remaining lines the complete taxonomic name. This will be true of
all the NCBI dna files.



LOCUS EU091365 388 bp DNA l
inear MAM 31
-
DEC
-
2008

DEFINITION Thylacinus cynocephalus interphotoreceptor binding protein gene,


partial cds.

ACCESSION EU091365

VERSION EU091365.1 GI:158668312

KEYWORDS .

SOURCE Thylacinus cynocephalus (Tasmanian wolf)


ORG
ANISM
Thylacinus cynocephalus


Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;


Mammalia; Metatheria; Dasyuromorphia; Thylacinidae;
Thylacinus.

REFERENCE 1 (bases 1 to 388)


AUTHORS Westerman,M., Young,J. and Krajewski,C.


TITLE Molecular relationships of species of Pseudantechinus,


Parantechinus and Dasykaluta (Marsupialia: Dasyuridae)


JOURNAL Unpublished

RE
FERENCE 2 (bases 1 to 388)


AUTHORS Westerman,M., Young,J. and Krajewski,C.


TITLE Direct Submission


JOURNAL Submitted (09
-
AUG
-
2007) Genetics, Latrobe University, Bundoora,


Melbourne, Victoria 3086, Australia

FEATURES

Location/Qualifiers


source 1..388


/organism="Thylacinus cynocephalus"


/mol_type="genomic DNA"


/db_xref="taxon:
9275
"


mRNA

<1..>388


/product="interphotoreceptor binding protein"


CDS

<1..>388


/note="IRBP"


/codon_start=1


/product="interphotoreceptor binding protein"


/protein_id="
ABW76674.1
"


/db_xref="GI:158668313"

Midwestern State University

Page
40



/translation="STSKAPQHDSKFTNATQEELLALFQQIIKYQVLEGNVGYLRVDY


IPGREMIEEVGEFLVNDIWKKVMETSSLVLDLQHSSGGEVSGIPFVISYLHQGDILLH


VDTIYDRPSNTTTEIWTLPQVLGERYS"

ORIGIN


1 agcacctcca aggctcctca gcacgactcc aaattcacca atgccactca ggaagagcta


61 ctcgccttat tccagcaaat
aatcaagtac caggtactgg agggtaacgt cggttaccta


121 agagtggact acatccctgg ccgggagatg atagaggaag ttggggagtt cctggtgaat


181 gacatctgga agaaggtcat ggagacctcc tctctcgtgt tggatctcca gcacagcagc


241 ggaggtgaag tttcaggaat cccctttgtc atttcctacc tccacc
aggg ggatatcctg


301 ctccacgtag acaccattta cgaccggcca tcaaacacca ctactgagat ctggaccctg


361 ccccaggtgc tgggggagag gtacagtg

//


Let’s write a program that will read the above file and extract the genus and species from the data. Our
first attempt

is this.

open (FILE,"tasmwolf.txt")or die "File not there";

while ($line=<FILE>)


{


if ($line=~/ORGANISM/){



print $line;


}


}


Note that all it really does is print out the line that contains the word ORGANISM which includes the
genus species. I
f this is all we need then cool otherwise a little more work might be required. Suppose
we need just the genus and species in separate variables. This shouldn’t be too hard since it’s a part of
the line we just printed.
A little fancy RE work will do th
is for us. All we really need to do here is
change the $line=~/ORGANISM/ regular expression to something like the following




$line=~/ORGANISM
\
s
+
(
\
w+)
\
s
\
(w+)/


This will match a line that has this sequence of characters


ORG
ANISM spaces word spaces word and then the remaining test until the
\
n.

You may note the extra parenthesis around the two
\
w+’s. The RE will work just fine without these but
by including them it is possible to get from the RE what matches these. Th
e matched text for each RE
expression subsection that is surrounded by parenthesis is automatically placed in a variable for use to
access. The matched text for the first parenthesized piece is placed in the variable $1, the second in $2
and so on. This
gives us a way of knowing what matches what, a very useful feature of Perl.


Here is the updated version of the above program and its output.


open (FILE,"tasmwolf.txt")or die "File not there";

while ($line=<FILE>)


{

Midwestern State University

Page
41



if

($line=~/ORGANISM
\
s+(
\
w+)
\
s(
\
w+)/){



print $line;



print "$1
\
n$2
\
n"; # print genus on one line and species on the other.


}


}


Output


ORGANISM Thylacinus cynocephalus

Thylacinus

cynocephalus


Running the above program on the Neanderthal file n
cbi_dna.txt gives this output



ORGANISM Homo sapiens neanderthalensis

Homo

Sapiens


Note that it basically grabbed the
first
two words after ORGANISM and printed them. The subspecies
name neanderthalensis was not processed. If we want this as well we

would have to include another
(
\
w+) in the RE sequ
ence. This of course would not work with organisms that have only two names on
this line. If we want our program to work with any file, whether it contains two or three names here
another technique will

be required. This will be dis
cussed in the section on arrays where the problem
becomes quite easy.


In general data mining of these text files is quite easy using RE’s. For a more interesting example we
will work on a much larger file say the complete m
itochondrion genome for the
ornate kangaroo tick
.
Download it using the
NCBI Reference Sequence: NC_005963.1 and call it kangtick.txt. Browse the file
carefully. Here is a section of the FEATURES portion of this file


FEATURES Location/Qual
ifiers


source 1..14740


/organism="Amblyomma triguttatum"


/organelle="mitochondrion"


/mol_type="genomic DNA"


/isolate="SB1"


/db_xref
="taxon:
65637
"


/sex="male"


tRNA

1..62



/product="tRNA
-
Met"


/anticodon=(pos:31..33,aa:Met)


gene

64..1029


/gene="ND2"



/db_xref="GeneID:
2866202
"


CDS


64..1029


/gene="ND2"


/codon_start=1

Midwestern State University

Page
42



/transl_table=
5


/product="NADH dehydro
genase subunit 2"


/protein_id="
YP_044778.1
"


/db_xref="GI:49619213"


/db_xref="GeneID:
2866202
"
/translation="MNFNILMKWLILMTIMISMSVNSWFIFWMMMEMNLMFFIPILNK
QKMTNSNSMITYFVIQSFSSTIFIMMAILNFITYFYMFKILMIISIMIKLAIIPFHFW
LISI
SEMIEFNSLFFILSLQKFIPLFILSKFNSQFMIMFALASAILGSLSAMNSKMLK
KMLIFSSISHQGWMIMLIMMKSNFWISYLLIYSIMIYKVTSLMKMFKFNYISEFFNYN
KNSLSKISLIMMMMSLSGMPPFMGFTLKIISIIILLTYFNFSIIILILSSMLNIYFYL


NSIQSFFLLNLIKFKKMIMKTY
MFKNMILNFNIFMIIFLFNLMIF"


tRNA

1031..1089


/product="tRNA
-
Trp"


/anticodon=(pos:1060..1062,aa:Trp)


tRNA

complement(1090..1151)


/product="tRNA
-
Tyr"


/anticodon=(pos:complement(1119..1121),aa:Tyr)


gene

1160..2683


/gene="COX1"


/db_xref="GeneID:
2866206
"


CDS

1160..2683


/gene="COX1"


/codon_s
tart=1


/transl_table=
5


The names on the left such as gene and CDS are annotation names. This file
contains the entire mitoch
ondria sequence

for th
is organism
, which has quite
a few genes as label as such. The line


gene

64..1029

indicates that authors think there is a gene that goes from nucle
otide 64 up
to nucleotide 1029. If you look thru the file you will see there is another
gene that occurs from 1160 to 2683 and so on. Rather than look thru this file
by hand it is possible to write a small program that goes thru the file and
prints out all

the genes. All we really need to do is print out each line
that starts with the word gene. Note there are other lines that have the word
gene in them that we DO NOT want so our RE needs to be carefully designed.


open(FILE,"kangtick.txt");

$ct=0;

while($
line = <FILE>){

if($line =~ /^
\
s+gene
\
s/){ #<
-

look for the word gene that has spaces before and after. Try it
without the
\
s's and see what happens.


print "$line";


$ct++; #<
-

counts the number of genes that we find

}

}

print "There are $ct genes in
the mitochondria of this animal
\
n";

close(FILE);


The RE /^
\
s+gene
\
s/ matches a line that starts out with a bunch of white space followed by the word
gene followed by a space. This makes sure we will not grab any of the other cases. Here is its output.

Midwestern State University

Page
43




gene 64..1029


gene 1160..2683


gene 2688..3361


gene 3489..3653


gene 3647..4309


gene 4314..5091


gene 5154..5493


gene complement(57
48..6749)


gene complement(9281..10935)


gene complement(10998..12325)


gene complement(12319..12594)


gene 12731..13159


gene 13164..14239

There are 13 genes in the mitochondria of

this animal


From this we can see that there are 13 genes four of which are on the
complement strand going from right to left. Now we are finally doing some
real data mining.


If you are interested in the meaning of individual sections of this GenBank
file see

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html


This contains

an example file for the organisim Saccharomyces cerevisiae and
the meaning of the annotation labels are at
the bottom of this document.


Homework

12.1


13.

Using
Arrays in Perl

Although arrays were introduced early in this document they were not used in processing of the
previous examples. We will correc
t that omission at this point but before doing so let’s do a little
review.
Recall that an array is really a named list of elements kept in a variable that has a @ prefix
instead of a $. Variables such as @words contains a list of words while variables
such as $word would
contain only one word. Recall that we access an array using subscripting. If we would like to print the
first word in the array @words we would do something like

print $words[0];

Note that we use the $ prefix when we reference a
single item in the array @words.


We can initialize
an array variable by assigning it an explicitly written list (aka array literal). An array literal is

nothing
more than a list of elements separated by commas and enclosed in parenthesis. The elements m
ay be
strings or numbers and may be mixed in a list.

These lists may be made up of numbers or strings or
both. For example

Midwestern State University

Page
44


(1,2,3) is a numeric array literal, (“ genes”,”dna”, “complement”) is an string array literal, and
("Stangl","Fred", 7, 8.5) is a m
ixed example. The empty array is represented by (). Since numeric lists
are often used there is a short hand that makes creation of these easy. For example (1..5) is shorthand
for (1,2,3,4,5) and (2..6,10,12) is shorthand for (2,3,4,5,6,10,12).

Variables m
ay also be used in initialization of a literal list. If $x=5 and $y=10 then the list
($x,2,3,$y) defines the literal (5,2,3,10)
.

An array variable is a variable that holds lists as defined by the above literals. Its name looks
like a normal variable nam
e e
xcept it starts with the
@ symbol instead of the $. Examples
include @list, @people
and
@
Species
.

The easiest way to initialize an array is by assigning
a list (array literal)

to it. The following is a
list of examples that demonstrate this



@list =(1,2,3,
4,5)



@People=("Bob","Tom","Sally")



@list2=@list # you can copy one list into another.



@Species = eq(Sapiens Canus Bothriolepus)

#
eq is a

simple function that allows you to
not have to type the quotes.

The individual values of @list can be accessed using su
bscripts starting at 0. When you do this
you use a $ instead of the @ symbol. For example



$list[0] is the value 1



$People[2] is "Sally"

Here is a simple
Perl

program that initializes an array and prints out the values one per line.




@list=(1..10);




$i
=0;




while($i<=9){







print "list[$i]
\
n"






$i=$i+1;




}

Here is another example




@words = ("Home", "went", "House","Bill");




print "$words[3] $words[1] $words[0]
\
n"

which prints the string "Bill went home"

Midwestern State University

Page
45


The size of an array (ie its length
) can be easily obtained by just assigning the array to a single
variable.

For example $size=@words

will assign 4 to the variable $size. We say we are using
the array in scalar context when we do this.

There is a really interesting way that an array can b
e loaded. It is possible to load an array with
every line of an entire file. If <FILE> is a file handle for some opened file then the command

@lines = <FILE>; # This is called slurping the file.

Will copy the ENTIRE file into the array @lines one line
at a time. Each slot in the array is a
string that contains the associated line in the file. Recall that lines are determined by where
the
\
n’s are. Let’s look at some examples that make this clear. The first example is a rewriting
of the previous gene

search program.

open(FILE,"mammoth.gb");

@linearray = <FILE>;
#<
-

loads the array with the lines in the file

(slurps it)

$len = @linearray;
# Assigning an array to a variable gets length of the array

for($i=0;$i<$len;$i++){


if($linearray[$i] =~ /
^
\
s
+
gene
\
s/){


print $linearray[$i],"
\
n";;


$ct++;


}

}

print "There are $ct genes in the
mitochondria

of this animal
\
n";

close(FILE);

Note that this loads in the entire file into the array @linearray and then loops thru the array
printing out the matc
hed lines. Although this is rather cool you must take note that the
ENTIRE file is loaded into memory (within the array). Some files are quite large and hence may
use up most if not all of your memory. In cases such as these it is advisable to process t
he file
a line at a time as we did earlier.

A simpler way of processing the array is by using the forevery construct.

Its semantics should
be

clear from this example.

open(FILE,"kangtick.txt
");


@linearray = <FILE>;
#<
-

slurp it up

# Each

line in the file is loaded into $line and then processed by the following loop


forevery $line (@linearray){


if($line =~ /^
\
s+gene
\
s/){


print $line,"
\
n";
# prints only those
lines
that have the word gene in it.


$ct++;


}


}

Midwestern State University

Page
46



print "Th
ere are $ct genes in the mitoch
ond
r
ia of this animal
\
n";


close(FILE);

The forevery loop construct is very handy for processing every slot in an array. I
f

you need
only to process a few specific ones then the array subscripting method

probably

is requir
ed.

There is a very cool (and useful) command called split that will take a string of words and
break
it up into its individual

elements. An array will
be used to hold these

elements. The split
operation is writte
n using a regular expression for

its first parameter and a string as its second.
The regular expression defines the pattern that separates the string into its individual
elements. For example $line="Now:is:the:time" is a string whose words are separated by :'s.
The split command split(/:
/,$line) will return the list ("Now,"is","the","time") and this list can
be assigned to the array variable @list by the command


@list=split(/:/,$line);

Note that if the string is separated by blanks (which is more usual) then the command would
be
@
list=split(/ /,$line)
where
the :
is replaced
with a blank. If the words are separated by
multiple blanks and other white space ( tabs,
\
n etc)

the we could use

@list=split(/
\
s+
/,$line)


So what can we do with this?

Suppose that Dr. Shipley had a class
of 25 students that when out in the field and collected
data that involved counting species in different counties. He gave out the specification for
typing this information into an ascii file. This specification requested that each species be
typed in usin
g the format in the following order, one species per line.



genus
species population location

After all 25 students did this it was discovered that the format was incorrect for the database
program being used. This program was designed to read text files
in the format



species genus : location : population

Note that the genus species is reversed and that colons are required instead of spaces in the
later two locations. Mean Dr. Shipley told everyone

to

just retype the files since each student
had only to
retype 500 different species lines.
Rather than follow this advice s
tudent Mr.
Aw
esome decided to write a short P
erl program that will do the conversion for you. He gave
the program to the other students in the class and consequently became a Hero. Here is

the
program that he wrote. Read it carefully.

Midwestern State University

Page
47


#This program will reformat a data file. The original file

#looks like the following

# genus species population location(county)

# and should look like

# species genus : location : population

#Ther
e are several ways to do this. This method used arrays and split

open(FILE,"data.txt")|| die "Sorry, I could open the file data.txt";

# for input

open(OUT,">newformat.txt");
# for output

while($line=<FILE>){#<
---

Remember that this reads one line at a
time.


@list=split(/
\
s+/,$line);

# split every line into its individual blank separated words


print OUT "$list[1] $list[0] : $list[3] : $list[2]
\
n";

#write it to a file


print "$list[1] $list[0] : $list[3] : $list[2]
\
n";

# and to the screen as well.

}

close(OUT);

The above program will convert the following data which is contained in data.txt

Equus caballus 232 Wichita

Equus asinus 221 Ford

Phascolarctos cinereus 23 Crocket

Myrmecobius fasciatus 456 Harris

Tamia
sciurus douglasii 18 Smith

and write it to the file newformat.txt in the following format.

caballus Equus : Wichita : 232

asinus Equus : Ford : 221

cinereus Phascolarctos : Crocket : 23

fasciatus Myrmecobius : Harris : 456

douglasii Tamiasciurus : Smith :
18


Do you or do you not think that this is a lot easier than retyping everything in?

Split has many uses and one solves a problem previously discussed. Recall that we were trying
to read in the names of an organism from a GenBank file and that the name
appeared after
the word ORGANISM. Our problem was that there may be three names, genus, species and
subspecies or two names genus and species only.
If line has been read into the variable $line
then here is a way to handle either case.


@name
s = split(/
\
s+/,$line);


$ct = @names; # Get the length of this array.


forevery $n (@names){


print “$n
\
n”; # print one name per line.

Midwestern State University

Page
48




}



The above code will either print out two names three names following the word ORGANISM
depend
ing on how long the array @names is. If you need to print only some of these words
use subscripting.


Homework

13.1 Download the 3NYU.
pdb Mycoplasma genitalium MG289
file
from the PDB web site. This
is information about a specific binding protein that occurs on the bacteria Mycoplasma
genitalium. It is a very large file. You job is to write a program that will count a variety of
things. Count and print the number of line
s that begin with each of the following words.

ATOM, CONECT, HEL
IX, REMARK, HETATM and SEQRES. Print out the percentage
of the total number
of lines that each of these consume.

Here is an example output for a different molecule.

Tabs were
used in the pr
int statement to get the Percentages to line up. Also if you cannot get the numbers to
truncate to the indicated values don’t worry. There are a variety of ways to do this. HINT:There is a
function called int() that will truncate a fractional number to a
n integer by dropping the decimal part.

Out of a total of 1692 lines we have

REMARK lines :432

Percentage is 25.5

HELIX

lines :6

Percentage is 0.3

ATOM


lines :1087 Percentage is 64.2

CONECT

lines :7

Percenta
ge is 0.4

SEQRES


lines :12

Percentage is 0.7

HETATM lines :73

Percentage is 4.3


13.2

Use the same molecule as in problem 13.1 for this exercise. If you study the file section
that starts with the word ATOM you will notice that the
last character on the line is a letter C,
O, N, S etc. There represent the name of each atom in the molecule. Carbon is C, oxygen is O,
nitrogen is N, sulfur is S and so on. Your program is to process this file with a Perl program and
print out the numb
er of C, O and N’s that occur in the molecule.

HINT: Note that each letter is
on the end of the line. It is followed by some white space and then a
\
n.


14.

Hashes and their uses


15.

Important Web Reference Sites.



Protein
Data Bank (PDB)

(
http://www.pdb.org/
)

Midwestern State University

Page
49


Project Gutenberg (
http://www.gutenberg.org/
)

National Center for Biotechnology Information (NCBI) (
http://www.ncb
i.nlm.nih.gov/
)


Appendix A : Perl insstructions

Appendix B: Functions