Lecture 9:

disturbedtonganeseBiotechnology

Oct 2, 2013 (4 years and 11 days ago)

106 views

Bioinformatics

Lecture
8
perl

pattern matching
features

Questions to think about


Create a hash table that performs the
condon

to AA conversion and use it to convert
codons

{entered from the key board} into
their corresponding Amino Acids



Write a script that extracts the gene ID, and
Gene name from the Descriptor header of a
DNA FASTA file

Questions to think about


Write a script that reads in the DNA sequences
from two
Fasta

files, assume the sequence
length is the same for both, and determines
the number of alignment matches to non
matches

Introduction


Pattern Matching


Pattern
extraction


Pattern
Substitution


Split and join functions


Unpack function


Pattern Matching


Recall =~ is the pattern matching operator



A first simple match example


print “
EcoRI

site found!” if $
dna

=~

/gat/;


It means if $DNA (string) contains the pattern gat then
print
Ecori

site found. What is inside the
2
/ is the pattern
and =~ is the pattern matching symbol


More patterns



if ($
dna

=~ /[
GATCgatc
]/ )


if /^[GATC] /
i


If ( $
dna

=~ /GAATTC|AAGCTT/) | (Boolean Or
symbol)


Print “EcoR
1
site found!!!”;









Pattern Matching


A More flexible pattern:


print “
EcoRI

site found!” if $
dna

=~

/GAA[GATC]TTC/;


Pattern where
4
th

letter is any let within square brackets



[GATC] means any character other than G or A or T or C



[
0
-
9
] or
\
d (digit) [ a
-
z] [
-
A
-
Z] /[AT][GC][TG]/



/[a
-
zA
-
Z
0
-
9
_]/ or /
\
w/ (word)



/
\
s/ (white space) and to invert
\
s uppercase the letter
\
S (non white
space)


Pattern matching:
metacharacters


Metacharacter

Description


. Any character except newline


\
.


Full stop character



^ The beginning of a line


$ The end of a line


\
w Any word character (non
-
punctuation, non
-
white space)


\
W Any non
-
word character


\
s White space (spaces, tabs, carriage returns)


\
S Non
-
white space


\
d Any digit


\
D Any non
-
digit



You can also specify the number of times [ single, multiple or specific multiple]


More information on variations of
metacharacters

here:
metacharacters





Pattern matching: Quantifiers


Quantifier Description


?



0 or 1 occurrence



+



1 or more occurrences


*



0 or more occurrences


{N}



n occurrences


{N,M}


Between N and M occurrences


{N, }


At least N occurrences


{ ,M}


No more than M occurrences



Pattern matching: Quantifiers


Pattern Match the following format:


M
58200.2

{ =~/
\
w+
\
.
\
d+/ }



If the sequence is:
Pu
-
C
-
X(
40
-
80
)
-
Pu
-
C


Pu

[AG] and X[ATGC]


$sequence = /[AG]C[GATC]{
40
,
80
}[AG]C/;





Extracting pattern to variables


Anchors


E..g. Matching a word exactly:


/
\
bword
\
b/
\
b boundary: just looks for word and not a
sequence of the letters w o r and d


The start of line anchor ^



/^>/ only those beginning with >


The end of line character $


/>$/ only where the last character is >



/^$/ : what does this mean?




Further examples


File_size_base_only.pl

example


#!/
usr/bin/perl


# file size2.pl


$length = 0; $lines = 0;


while (<>) {


chomp;


$length =
$length

+ length $_
if $_ =~ /[
GATCgatc
]/
;


#Alternative:
$length += length if /^[GATCN] /
i
;








$lines =
$lines

+ 1;


}


print "LENGTH = $length
\
n"; print "LINES =
$lines
\
n";




FASTA files

Write and test (file_size_bases_only.pl) using a FASTA file
as input: FASTADNA
1
.txt:


example of FASTA file

>
2
L
52.1
CE
20433
Zinc finger, C
2
H
2
type (CAMBRIDGE) protein id:CAA
21776.1

GCAGCGCACGACAGCTGTGCTATCCCGGCGAGCCCGTGGCAGAGGACCTCGCTTGCGAAAGCATCGAGTACC



sample of file in EMBL format

gccacagatt

acaggaagtc

atatttttag

acctaaatca

ctatcctcta

tctttcagca

60

agaaaagaac

atctacttgg

tttcgttccc

tatccaagat

tcagatggtg

aaacgagtga

120

tcatgcacct

gatgaacgtg

caaaaccaca

gtcaagccat

gacaaccccg

atctacagtt

180

tgatgttgaa

actgccgatt

ggtacgccta

cagtgaaaac

tatggcacaa

gtgaagaaaa

240



Sample of an NCBI record format:

1
atgaacccca

acctgtgggt

cgacgcgcag

agcacttgca

agagggaatg

cgacgctgac


61
ctggagtgcg

agacctttga

gaagtgctgc

cccaatgtct

gtggaaccaa

gagctgtgtg


121
gctgctcggt

acatggacat

caaggggaag

aaggggcctg

tggggatgcc

caaagaggca


181
acctgtgacc

gcttcatgtg

catccagcaa

ggctcagagt

gcgacatctg

ggacgggcag


241
cctgtctgca

agtgcaagga

caggtgtgag

aaggagccga

gctttacctg

cgcctcggac



Extracting Patterns


Consider a sequence like


>M
185580
clone
333
a,
complete sequence


>


M
18
… is the sequence ID


Clone
33
a, com…. : optional comments



Need to stored some of elements of the descriptor
line:



=~/ (
\
S+)/ part of the match is extracted and put
into variable $
1
;


Extracting patterns


#! /
usr/bin/perl


w


# demonstrates the effect of parentheses.



while ( my $line = <> )


{



$line =~ /
\
w+ (
\
w+)
\
w+ (
\
w+)/;




print "Second word: '$
1
' on line $..
\
n" if defined $
1
;



print "Fourth word: '$
2
' on line $..
\
n" if defined $
2
;


}


Change it to catch the first and the
3
word of a sentence


Search and replace



s/t/u/

replace (t)thymine with (u)
Uracil
; once only


s/t/u/g

(g = global) so scan the whole string


s/t/u/
gi

(global and case insensitive)



What about the following :


s/^
\
s+//



s/
\
s+$//


s/
\
s+$/ /g (where g stands for global)






Write a
perl

script that reads in the DNA sequences from
the FastaDNA1file.txt and replaces all the Thymine bases
with the corresponding
Uracil

bases



Splits and joins


To transform strings into arrays: split


Line
1
looks like:


192
a
8
,The Stranger DNA ,GGGTTCCGATTTCCAA,CCTTAGGCCAAATTAAGGCC


Consider the following code:


chomp($line = <>); # read the line into $line


@fields = split ‘,’,$line;


($
clone,$laboratory,$left_oligo,$right_oligo
) = split ‘,’,$line;


Reads in line
1
and puts each part before the delimiter; e.g.
192
a
8
, into element of
array….



To
transform arrays (lists) into strings: join


$tab = join “
\
t”,@fields
;


192
a
8

The Sanger Centre


GGGTTCCGATTTCCAA

CCTTAGGCCAAATTAAGGCC



#initialize an array


my @
perlFunc

= ("
substr","grep","defined","undef
");


my $
perlFunc

= join " ", @
perlFunc
;


print "Perl Functions: $
perlFunc
\
n";





See example
split_file.pl









Other
useful functions


Other useful functions:


Unpack syntax :



@triplets = unpack("a
3
" x (length($line)/
3
), $line);



Frame Shift (
1
position to the right)


@triplets = unpack(‘a’ . “a
3
” x (length ($line)/
3
),$line);



Unpack_codons.pl



Questions


Modify the
file_bases_size_only.pl

to count the
the

number of bases for a file in an
EMBL

format
and one in an
NCBI

format


Using the FASTADNA
1
.txt : extract the sections of
the descriptor line to appropriate scalar variables.


Assuming the DNA sequence of FastaDNA
1
file.txt
is the complementary or anti
-
sense strand print
the mRNA when the primary strand ( sequence )
is transcribed

Exam Questions


Perl is a important bioinformatics language.
Explain the main features of
perl

that make in
appealing to the field of Bioinformatics.


Write a script that extracts the gene ID, and Gene
name from the Descriptor header of a DNA FASTA file


Write a
perl

script only reads and prints DNA
sequences from a FASTA file.


Write a script that reads in the DNA sequences from
two
Fasta

files, assume the sequence length is the
same for both, and determines the number of
alignment matches to non matches


FastaDNA1file.txt


Write a script that reads in the DNA sequences
from two
Fasta

files, assume the sequence length
is the same for both, and illustrates the number of
alignment matches to non matches.