BINF 634 Bioinformatics Programming

sparrowcowardBiotechnology

Oct 2, 2013 (3 years and 9 months ago)

79 views

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

1

Thanks to John Grefenstette for Many of These Slides !!

Topics


Quiz 5


Midterm


Perl Modules


Module

Getopt::Std


Range Operator


Restriction Maps


Parsing Rebase File


BINF 634 FALL 2011 LECTURE 8 Modules and Maps

2

Perl Modules
-

I


You can put useful subroutines into modules for future use


Only write and debug subroutines once




Creating a module:


Put subroutines into a file with extension .pm, e.g.
MySubs.pm



Include "
1;
" as last line in file

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

3

Perl Modules
-

II


Using a module:


Include statement:


use Mysubs;


note: leave off the
.pm

from the module name in use statement


Then use any subroutine defined in
MySubs.pm


Modules are included at compile time



Example:


BeginPerlBioinformatics.pm

contains subroutines from
Tisdall textbook (see web page)


http://oreilly.com/catalog/9780596000806/


BINF 634 FALL 2011 LECTURE 8 Modules and Maps

4

Perl Modules

Where does Perl find the modules?


Perl looks in the list of directories in built
-
in array
@INC


includes paths to standard modules such as

strict.pm
and

warnings.pm



@INC

automatically includes the current directory



You can prepend directories to @INC by using the
-
I switch

% perl
-
I"/mypath/libdir" prog.pl



You also can tell Perl where to look as follows:

use lib "/mypath";

use MyFASTASubs; # file is: "/mypath/MyFASTASubs.pm"

use MyRandomSubs;

% cat RAND.pm

#

# subroutines for using random numbers

#


sub random_uniform {


my ($lower, $upper) = @_;


return $lower + rand ($upper
-

$lower);

};


sub random_int {


my ($lower, $upper) = @_;


return int($lower + rand ($upper
-

$lower + 1));

}


sub shuffle_array {


my (@a) = @_;


my $length = scalar @a;


for (my $i = 0; $i < $length; $i++) {


my $j = random_int($i, $length
-
1);


($a[$i], $a[$j]) = ($a[$j], $a[$i]);


}


return @a;

}


sub shuffle_string {


my ($s) = @_;


my @a = split "", $s;


return join "", shuffle_array(@a);

}


1;

5

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

#!/usr/bin/perl

use strict;

use warnings;

# file: test_myrand.pl

use lib 'C:
\
Documents and
Settings
\
Owner
\
workspace
\
binf634_book_examples';

use RAND;

my $dna = "AAAAAAAATTTTTTTTTTGGGGGGGGCCCCCCCCC";

print "$dna
\
n";

$dna = shuffle_string($dna);

print "$dna
\
n";


% test_myrand.pl

AAAAAAAATTTTTTTTTTGGGGGGGGCCCCCCCCC

GCTACGCAGAGGTTAGTCATCTCGACATACTGCTT


% test_myrand.pl

AAAAAAAATTTTTTTTTTGGGGGGGGCCCCCCCCC

TCTTCGGAACCTACGACGTTCTATACAATGCTGGG


6

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

7

use Getopt::Std;

Suppose prog.pl contains the statements:

use Getopt::Std;

my %opts = (); # a hash to hold the options

getopts('bnf:',
\
%opts);


#
-
b &
-
n are boolean flags


#
-
f takes an argument (indicated by ":")


Hash keys will be
x

(where
x

is the switch name) with key values

the value of the argument or 1 if no argument is specified.

The following set %opts = (n=>1, f=>"infile")

% prog.pl
-
f infile
-
n

% prog.pl
-
nf infile

% prog.pl
-
n
-
f infile

#!/usr/bin/perl

use strict;

# file: opt.pl

use Getopt::Std;

my %options=(); # empty hash of options


# "d:" means d takes an argument

getopts("od:fF",
\
%options);


print "
-
o $options{o}
\
n" if defined $options{o};

print "
-
d $options{d}
\
n" if defined $options{d};

print "
-
f $options{f}
\
n" if defined $options{f};

print "
-
F $options{F}
\
n" if defined $options{F};


print "Unprocessed by Getopt::Std:
\
n" if @ARGV;

foreach (@ARGV) {


print "$_
\
n";

}


% opt.pl
-
Fd5
-
o 123 foo

-
o 1

-
d 5

-
F 1

Unprocessed by Getopt::Std:

123 foo

8

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

9

Restriction Maps (Tisdall Ch. 9)


Restriction Enzymes

are “chemical scissors” that cut DNA in
specific places specified by short sequence patterns


HindII binds and cuts at GTY^RAC


Y = C or T (pyrimidines)


R = A or G (purines)


^ indicates cleave site



A
Restriction Map

of a DNA sequence shows all the positions at
which a given Restriction Enzyme will cut



Several hundred restriction enzymes are known


Database REBASE:
http://www.neb.com



REBASE version 104 bionet.104




=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=


REBASE, The Restriction Enzyme Database http://rebase.neb.com


Copyright (c) Dr. Richard J. Roberts, 2001. All rights reserved.


=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=
-
=



Rich Roberts Mar 30 2001



AaaI (XmaIII) C^GGCCG

AacI (BamHI) GGATCC

AaeI (BamHI) GGATCC

AagI (ClaI) AT^CGAT

AaqI (ApaLI) GTGCAC

AarI CACCTGCNNNN^

AarI ^NNNNNNNNGCAGGTG

AatI (StuI) AGG^CCT

AatII GACGT^C

AauI (Bsp1407I) T^GTACA

AbaI (BclI) T^GATCA

AbeI (BbvCI) CC^TCAGC

AbeI (BbvCI) GC^TGAGG

AbrI (XhoI) C^TCGAG

AcaI (AsuII) TTCGAA

AcaII (BamHI) GGATCC

AcaIII (MstI) TGCGCA

AcaIV (HaeIII) GGCC

AccI GT^MKAC

# The IUB ambiguity codes

R = G or A

Y = C or T

M = A or C

K = G or T

S = G or C

W = A or T

B = not A (C or G or T)

D = not C (A or G or T)

H = not G (A or C or T)

V = not T (A or C or G)

N = A or C or G or T

10

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

11

Restriction Maps

Problem:


Given a DNA sequence and the name of a restriction enzyme in
REBASE


Find all location of the restriction sites on the DNA


Pseudocode:


read in rebase data from file “bionet”


create hash with key =
enzyme name

and value =
regular
expression


for each user query (in the form of an enzyme name)



if query is defined in the hash




get positions that match regular expr in DNA



report positions, if any


}

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

12

Range Operator

start .. end


In list context, returns a slice of an array:


print "@A[10 .. 20]
\
n"; # print items 10 thru 20



In a scalar context, return true if $. (line number)
is currently between start and end values:



# print out first 20 lines of file:


while (<FILEHANDLE>) {




print if (1 .. 20);


}


# print out up to first line containing "foo":


while (<FILEHANDLE>) {




print if (1 .. /foo/);


}


BINF 634 FALL 2011 LECTURE 8 Modules and Maps

13

Finding matches with regular expressions

Reminder: If we use the global modifier
g
, then
pos($string)

returns
position after the match:


$string = "A
TCG
CA
TGG
AA";

$string =~ /T.G/g;

print "$& ends at position ", pos($string)
-
1, "
\
n";

$string =~ /T.G/g;

print "$& ends at position ", pos($string)
-
1, "
\
n";


Output:

TCG ends at position 3

TGG ends at position 8

# Example 9
-
2 Subroutine to parse a REBASE datafile

# parseREBASE
-
Parse REBASE bionet file

#

# A subroutine to return a hash where

# key = restriction enzyme name

# value = blank
-
separated recognition site and regular expression


sub parseREBASE {


my($rebasefile) = @_;



use BeginPerlBioinfo;

# see Chapter 6 about this module



# Declare variables


my @rebasefile = ( );


my %rebase_hash = ( );


my $name;


my $site;


my $regexp;



# Read in the REBASE file


# note: incorrect on p. 191


my $rebase_filehandle =
open_file
($rebasefile);

14

BINF 634 FALL 2011 LECTURE 8 Modules and Maps


while(<$rebase_filehandle>) { # note: incorrect on p. 191


# Discard header lines



next if ( 1 .. /Rich Roberts/ );



# Discard blank lines


next if /^
\
s*$/;



# Split the fields


my @fields = split( " ", $_);



# Get and store the name and the recognition site


# grab just the first and last fields


$name = shift @fields;



$site = pop @fields;



# Translate the recognition sites to regular expressions


$regexp = IUB_to_regexp($site);



# Store the data into the hash



$rebase_hash{$name} = "$site $regexp";


}



# Return the hash containing the reformatted REBASE data


return %rebase_hash;

}

15

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

#############################################################

# Subroutine IUB_to_regexp

#

# Translate IUB ambiguity codes to regular expressions

#

# The IUB ambiguity codes

# R = G or A Y = C or T M = A or C K = G or T

# S = G or C W = A or T B = not A (C or G or T)

# D = not C (A or G or T) H = not G (A or C or T)

# V = not T (A or C or G) N = A or C or G or T


sub IUB_to_regexp {


my($iub) = @_;


my $regular_expression = '';


my %iub2character_class = (


A => 'A', C => 'C', G => 'G', T => 'T', R => '[GA]', Y => '[CT]',


M => '[AC]', K => '[GT]', S => '[GC]', W => '[AT]', B => '[CGT]',


D => '[AGT]', H => '[ACT]', V => '[ACG]', N => '[ACGT]',


);



# Remove the ^ signs


$iub =~ s/
\
^//g;



# Translate the iub sequence


for (my $i = 0; $i < length($iub); ++$i ){


$regular_expression .= $iub2character_class{substr($iub,$i,1)};


}



return $regular_expression;

}

16

BINF 634 FALL 2011 LECTURE 8 Modules and Maps



Sample Main Program to test parseREBASE subroutine:


#!/usr/bin/perl

use warnings;

use strict;

my %rebasehash = ();

%rebasehash = parseREBASE("bionet");

for my $key (sort keys %rebasehash) {


my ($site, $regexp) = split " ", $rebasehash{$key};


print "enzyme = $key site = $site regexp = $regexp
\
n";

}


sub parseREBASE {


...

}


Output:

enzyme = AaaI site = C^GGCCG regexp = CGGCCG

enzyme = AacI site = GGATCC regexp = GGATCC

enzyme = AaeI site = GGATCC regexp = GGATCC

enzyme = AagI site = AT^CGAT regexp = ATCGAT

enzyme = AaqI site = GTGCAC regexp = GTGCAC

enzyme = AarI site = ^NNNNNNNNGCAGGTG regexp =

[ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT]GCAGGTG

17

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

18

Restriction Maps

Problem:


Given a DNA sequence and the name of a restriction enzyme in
REBASE


Find all location of the restriction sites on the DNA


Pseudocode:


read in rebase data from file “bionet”


create hash with key =
enzyme name

and value =
regular expression


for each user query (in the form of an enzyme name)



if query is defined in the hash




get positions that match regular expr in
DNA



report positions, if any


}

#####################################################################

# Subroutine match_positions

# Find locations of a match of a regular expression in a string

# Return an array of positions where the regular expression

# appears in the string


sub match_positions {


my($regexp, $sequence) = @_;


my @positions = ( );



# Determine positions of regular expression matches


while ( $sequence =~ /$regexp/ig ) {


push @positions, pos($sequence)
-

length($&) + 1;


}


return @positions;

}


# sample main program:

my $dna = “AT
TGA
TAAG
TCA
”;

my @pos = match_positions(“T[GC]A”, $dna);

print “@pos
\
n”;

exit;

Output: 3 10

19

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

20

Restriction Maps

Problem:


Given a DNA sequence and the name of a restriction
enzyme in REBASE


Find all location of the restriction sites on the DNA


Pseudocode:


read in rebase data from file “bionet”


create hash with key =
enzyme name

and value =
regular expression


for each user query (in the form of an enzyme name)



if query is defined in the hash




get positions that match regular expr in
DNA



report positions, if any


}

#!/usr/bin/perl
-
w

# Make restriction map from user queries on names of

# restriction enzymes


use strict;

use warnings;

use BeginPerlBioinfo; # see Chapter 6 about this module


# Declare and initialize variables

my %rebase_hash = ( );

my @file_data = ( );

my $query = '';

my $dna = '';

my $recognition_site = '';

my $regexp = '';

my @locations = ( );


# Read in the file "sample.dna"

@file_data = get_file_data("sample.dna");


# Extract the DNA sequence data from the file "sample.dna"

$dna = extract_sequence_from_fasta_data(@file_data);


# Get the REBASE data into a hash, from file "bionet"

%rebase_hash = parseREBASE("bionet");


21

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

# Prompt user for restriction enzyme names, create restriction map

do {



print "Search for what restriction site (or quit)?: ";


$query = <STDIN>;


chomp $query;



# Exit if empty query


if ($query =~ /^
\
s*$/ ) { exit;}



# Perform the search in the DNA sequence


if ( exists $rebase_hash{$query} )

{


($recognition_site, $regexp) = split " ", $rebase_hash{$query};



# Create the restriction map


@locations = match_positions($regexp, $dna);



# Report the restriction map to the user


if (@locations) {


print "Searching for $query $recognition_site $regexp
\
n";


print "A restriction site for $query at locations:
\
n";


print join(" ", @locations), "
\
n";


} else {


print "A restriction enzyme $query is not in the DNA:
\
n";


}


}


print "
\
n";

} until ( $query =~ /quit/ );

exit;

22

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

23

% rebase.pl

Search for what restriction site (or quit)?: AceI

Searching for AceI G^CWGC GC[AT]GC

A restriction site for AceI at locations:

54 94 582 660 696 702 840 855 957


Search for what restriction site (or quit)?: AceII

A restriction enzyme AceII is not in the DNA:


Search for what restriction site (or quit)?: Acc36I

Searching for Acc36I ^NNNNNNNNGCAGGT
[ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT][ACGT]GCAGGT

A restriction site for Acc36I at locations:

166


Search for what restriction site for (or quit)?: quit


%


The Program in Action!!

BINF 634 FALL 2011 LECTURE 8 Modules and Maps

24

HW


Read Tisdall Appendix Chapter 9 and
Appendix B