Pattern Handling

signtruculentΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

65 εμφανίσεις

Lecture 7:

P
erl pattern handling features

Pattern Matching


Recall
=~

is the pattern matching operator



A first simple match example


print “An methionine amino acid is found ” if $AA =~ /
m
/;



It means if $AA (string) contains the m then print methionine
amino acid found.


What is inside the
/ /

is the pattern and =~ is the pattern
matching symbol



It could also be written as



if
($
dna

=~
/
m
/)


{


print “An
methionine
amino acid is found ”
}


Met.pl











Pattern Matching


If we want to check for the start codon we could use:



if
(
$
s
eq

=~
/ATG/

)


{


Print “a start codon was found on line number
\
n”

}



Or could write if
/ATG
/
i

(where I stands for case)



if we want to see if there is an A or T or G or C in the
sequence use:
$
seq

=~ /[ATGC]/



The main way to use the Boolean OR is


If
( $
dna

=~ /GAATTC|AAGCTT/)

|
(Boolean Or symbol)


{



Print “EcoR1 site found
!!!”;


}



(note EcoR1 is an important DNA sequence)




Sequence size example


File_size_
2
example


#!/
usr/bin/perl


# file size
2
.pl


$length =
0
; $lines =
0
;


while (<>) {


chomp;


$length =
$length

+ length $_
if $_ =~ /[
GATCgatc
]/
;










$lines =
$lines

+
1
;


}


print "LENGTH = $length
\
n"; print "LINES = $lines
\
n";



The above is a modification of the length of the file example to
include only files that have G or A or T or C in the input line.


However this will lead to problems for FASTA files as the descriptor
line will be included:
Why?


Pattern Matching


A NOT Boolean operator such as to see if the pattern contains
letters that are not vowels can be represented via pattern handling
by using the
^

symbol: e.g.


If ($
seq

=~ /[^
aeiou
]/ {print “no vowel”};



More flexible pattern syntax:


Quite common to check for words or numbers so
perl

has
represented as:



/[0
-
9]/ or/
\
d/
is a

digit


A word character is /[a
-
zA
-
Z0
-
9_]/ and is represented by /
\
w
/ (word)


/
\
s
/ represents a white space



By
invert the case of the letter

it has the reverse meaning; e.g. /
\
S/ (non
white space)



A more complete list of what are referred to as “
metacharacters
” is
shown in the next slide (you must of course use =~ in expression)


Pattern matching:
metacharacters


Metacharacter

Description


. Any character except newline


\
.


Full stop character



^ The beginning of a line


$ The end of a line


\
w Any word character (non
-
punctuation, non
-
white space)


\
W Any non
-
word character


\
s White space (spaces, tabs, carriage returns)


\
S Non
-
white space


\
d Any digit


\
D Any non
-
digit




You can also specify the number of times [ single, multiple or specific multiple]


More information on variations of
metacharacters

here:
metacharacters





Pattern matching: Quantifiers


Quantifier Description


?



0 or 1 occurrence



+



1 or more occurrences


*



0 or more occurrences


{N}



n occurrences


{N,M}


Between N and M occurrences


{N, }


At least N occurrences


{ ,M}


No more than M occurrences



Pattern matching: Quantifiers


Consider the following pattern




DT
249 4

(your class code) consists of [one or more word
characters; then a space and then a digit so the match is:



{ =~/
\
w+
\
s
\
d/ }



If the sequence has the following format:



Pu
-
C
-
X(
40
-
80
)
-
Pu
-
C


Pu

[AG] and X[ATGC]



$sequence =~ /[AG]C[GATC]{
40
,
80
}[AG]C/;



Quantify.pl

Pattern Matching


To determine where to look for a “pattern” in a sequence:


Anchors


The start of line anchor ^



/^>/

only those beginning with >


The end of line character $


/>$/

only where the last character is >



/^$/ : what does this mean?



The boundary anchor
\
b


E.g
. Matching a word exactly:


/
\
bword
\
b/


where
\
b boundary: just looks for
“word”
and not a sequence of
the letters

such as w
o r and
d



The non boundary anchor is
\
B


/
\
Bword
\
B/ look for words like unworthy, trustworthy….. But not worthy or
word


Sequence Size example: modified


File_size_2 example


#!/
usr/bin/perl


# file size2.pl


$length = 0; $lines = 0;


while (<>) {


chomp;


$length = $length + length $_
if $_ =~
/[
GATCgatc
]+$/
;



#Alternative:
$length += length if /^[GATCN]+$ /
i
;








$lines =
$lines

+ 1;


}


print "LENGTH = $length
\
n"; print "LINES = $lines
\
n";


The code in red is modified:



explain why this modification is necessary?






Extracting Patterns


The second aspect of Perl pattern handling
is:



Pattern extraction:


Consider a sequence like


> M185580, clone 333a,
complete sequence


M18… is the sequence ID


Clone 33a, com…. : optional comments



Need to stored some of elements of the descriptor
line:


$
seq

=~/ (
\
S+)/ part of the match is extracted and put
into variable $1;


Extracting patterns


#! /
usr/bin/perl


w


# demonstrates the effect of parentheses.



while ( my $line = <> )


{



$line =~ /
\
w+ (
\
w+)
\
w+ (
\
w+)/;




print "Second word: '$
1
' on line $..
\
n" if defined $
1
;



print "Fourth word: '$
2
' on line $..
\
n" if defined $
2
;


}


Change it to catch the first and the
3
word of a sentence



More examples in
ExtractExample
1
.pl


Search and replace



s/t/u/

replace (t)thymine with (u)
Uracil
; once
only


s/t/u/g

(g = global) so scan the whole string


s/t/u/
gi

(global and case insensitive)



What about the following :


s/^
\
s+//



s/
\
s+$//


s/
\
s+/ /g (where g stands for global)



See examples in
SearchReplace.pl



Search /replace/extract


Write a program that


removes
the > from the FASTA line descriptor
and assigns each element to appropriate
variables.



Example
Fastafile_replace.txt



>gi|171361,
Saccharomyces

cerevisiae
,
cystathionine

gamma
-
lyase


GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC


GCTACAGAGCCAACCCGGTGGACAAACTCGAAGTCATTGTGGACCGAATGAGGCTCAATAACGAGATTAGCG


ACCTCGAAGGCCTGCGCAAATATTTCCACTCCTTCCCGGGTGCTCCTGAGTTGAACCCGCTTAGAGACTCCG


AAATCAACGACGACTTCCACCAGTGGGCCCAGTGTGACCGCCACACTGGACCCCATACCACTTCTTTTTGTT


ATTCTTAAATATGTTGTAACGCTATGTAATTCCACCCTTCATTACTAATAATTAGCCATTCACGTGATCTCA


GCCAGTTGTGGCGCCACACTTTTTTTTCCATAAAAATCCTCGAGGAAAAGAAAAGAAAAAAATATTTCAGTT


ATTTAAAGCATAAGATGCCAGGTAGATGGAACTTGTGCCGTGCCAGATTGAATTTTGAAAGTACAATTGAGG


CCTATACACATAGACATTTGCACCTTATACATATAC


Functions and Modules


In
perl

functions take the following format:


sub
subname


{


my $var1 = $_[0];


My $var2 = $_[1];



statements


Return value


}


After all the subs are defined they are then called as
required:


My $variable =
subname
($var1, $var2);


The
SubroutineExample.pl


Exercises


Write a script that:

1.
Confirms if the user has input the code in the
following
format:


Classcode_yearcode
(
papercode
)


E.g

dt
249 4
(w
203
c)

2.
Many important DNA sequences have specific
patters; e.g. TATA write a script to find the
position of this sequence in a FASTA file
sequence.

Exercises

3.
Write a script that can find the
reverse
complement

of an DNA sequence without
using the
tr

function. (Hint: a global search
and replace will give an incorrect answer)


4.
Coding regions begin win the AUG (ATG)
codon and end with a stop codons. Write a
perl

script that extract a coding sequence
from a FASTA file.




Exercise

5.
Modify the Sequence size example from
earlier to:


Allow the user to input a file name and determine
its length.


Write the script using one or more subroutines.

Exam Questions


Perl is a important bioinformatics language.
Explain the main features of
perl

that facilitate
its ability to handle downloaded DNA data
files. Write a
perl

script that illustrates its
ability to access DNA sequences.



(refer to assignment/previous papers
perl

scripts)