Playing with DNA Sequences (and words) Using Regular Expressions

helmetpastoralΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 4 μήνες)

143 εμφανίσεις


1

Playing with DNA Sequences (and words) Using Regular Expressions

Stephen Sontum 2011


modified from “Perl fo
r Exploring DNA
” MD LeBlanc and BD Dyer 2007


Introduction


One could argue that almost all DNA sequence analysis is
based on
pattern matching. Whet
her
you are comparing sequences to build phylogenetic trees or to determine their identities, you are
looking for similar patterns. Perl has a powerful syntax for pattern matching called a “Regular
Expressions” or “regex” for short. Many different programm
ing languages use regex’s, however
they fit especially well within the syntax of Perl.


The examples of this exercise, the regex’s will be embedded within Perl statements. Don’t worry
that you do not yet know Perl

well
. At this point you will be using ver
batim a Perl program that is
already written and you will just focus on and play with the entertaining parts of these programs,
the regex’s. Look at almost any other book on Perl and you will find regular expressions in the
later chapters because they have

been traditionally considered advanced topics. However for
DNA sequence analysis, regex’s may be just exactly what you hoped for. You will need to install
Perl

(
http://www.activestate.com/

)

and a Perl IDE
(
http://www.vim.org/

)
programming
environment

on your computer before you can play with these regex examples.


The particular datasets that

you will explore are a
list of all possible 7
-
mers (heptamers)
comprised of A

s,
C

s, G

s, and T

s from the entire genome of the bacterium,
Escherichia coli

(
E.
coli.
)

and a English word Dictionary
.
A list of words from an English dictionary and a list of 7
-
letter DNA motifs from

E. coli are not directly comparable at all. However, the
y are useful as
introductory

datasets for learning about regex syntax and thinking about how pattern matching
might be applied to DNA sequences.


Regex

syntax


Using regular expressions for the exploratory analysis of texts involves two steps:



$regex = ‘omics$’;

(1) define a pattern to match in a text using regular expression syntax, e.g.,

(a) to find all words ending in “omics”


$nextWord
=~

/$regex/ ;

(2) apply that regu
lar expression to the text in question, e.g., to each of the words in a
dictionary, poem, or story or to all putative regulatory motifs upstream of a particular
gene.

This exercis
e

is focused on the first step, the syntax of regular expressions that you ca
n use to
define your pattern.


The following is an example featuring a rich set of regular expression syntax. Do not feel daunted
by the syntax at this point, rather, try to get a sense of the power of finding words that contain
certain patterns. Assuming

we are searching through a file of English words (e.g., a dictionary),
the regex below matches all words in the file that

“start with ‘ge’ and end with either ‘ne’ or ‘me’.”

/
^ge.*[nm]e$
/


2

More explicitly, this pattern matches any word that starts (^) with

the letters ‘ge’, followed by any
number of letters (.*), and the word ends ($) with either an ‘n’ or ‘m’ ( [nm]) followed by the
letter ‘e’. This pattern would match with many words, a partial list of which is shown below:


^ge.*[nm]e$

gelatine

gene

geno
me

genuine

germane

Again, do not feel daunted by the syntax at this point.
The point of this exercise

is to introduce
the pattern matching syntax of regular expressions in a step
-
by
-
step fashion with lots of
opportunities for you to practice.

Here are a
few more syntax examples.




Regular Expressions



/END/

matches
within the
line for a string ‘END’



/^END/

matches a
beginning

END



/TAG$/

matches an ending TAG



/[AG]T/

matches AT or GT



/[a
-
z]/

matches any lowercase letter



/[^a
-
z]/

matches any letter except

a
-
z



Predefined character classes



\
d

digit


same as [0
-
9]



\
D

non
-
digit


same as [^0
-
9]



\
s

white space


same as [
\
t
\
n
\
r
\
f]



\
S

non white space

same as [
^
\
t
\
n
\
r
\
f]




Quantifiers



{n}


exactly n times

/a{5}/ aaaaa



{n,m}


between

n

and m times

/a{2,3
}
/ aa,

aaa



*


0 or more times

/ab*c/ ac, abc, abbc,





+


1 or more times

/ab+c/ abc, abbc, …



?


0 or 1 time


/ab?c/ ac, abc
, …



.


match any 1 character /a
. c/ abc, aac,




|


alternate patter
n
s /ac|ca/ ..acb, cab, …



Capture



(.)


Match any character and reme
mber it



(.*)


Capture and match any character 0 or more times

(remember it)



\
1


recall the first captured (parenthesized) group



\
2


recall the second captured group



Escape



/
\
^/


match
a metacharacter



\
n


newline



\
r


carriage return


3

Setting up regex.pl





(1)

D
ownlo
aded zipped file (regexfiles
.zip)
and extract
on
to

your Desktop

or move the folder
to the desktop from our classes folder
.

(2
)

You shoul
d see a new directory, regexfiles
, on your Desktop. Open the directory. You
shou
ld see the following four files:


regex.pl



Perl program to use regex to find patterns (use first)


anagram.pl


Perl program to use regex to find anagrams (used later)


english.txt


English dictionary (list of words one per line)


big_english.txt


muc
h larger English dictionary


ecoli.txt



7
-
mers found in E.coli genome

(3
)

Right Click “regex.pl” and edit it with Vim


(4
)

Each time you would like to try a new regex or use a different dictionary, you will only
need to modify the following
two
lines

i
n the program
.

To use a new regex, change the pattern between the single quotes. Note: you must keep the
single quotes!


my $regex = 'q[^u]';

By the way, this particular regex means: “match all words that contain a ‘q’ but are not
immediately followed by t
he letter ‘u’.”


To change the library to search y
ou should put a comment (#) symbol in front of the line
that you do not want to use

and remove the comment (
#) from the one you want to use. In
this example “big_english.txt” will be searched.


my $filenam
e = "big_english.txt";

# my $filename = "ecoli.txt";

# my $filename = “english.txt”;

(5
)

Once you have entered the regex and selected the file to search, RUN the Perl program

using the perl menu in Vim
( Perl/Run/update, run script

)


Change the regex between single quotes

Pick a file to search by
moving the #(comment)


4


(6)

To continue “H
it any key to close this window”
. You can cut and paste from the command
window by right clicking the blue bar on top of the window and selecting
Edit/Select All

followed by pressing the
Enter

key. This copies the content of the window to the clipboard
.

(
7
)

If you make a syntax error
to which the
Perl program will complain.

Perl/Run/update and
check

syntax

will highlight the source of the error.
.


Basic rules of regular expression behavior


Before we begin, a simple description of the basic rules of r
egular expressions is
in order. As the
exercise

proceeds, we gently introduce you to the various types of regex behavior and encourage
your hands
-
on practice with each new set of features.

1.

Most characters match themselves, e.g., a G matches a G, a T ma
tches a T.

2.

A match anywhere within a string is a match. That is, unless explicitly requested to do so, a
regex need not match an entire word or line.

3.

Anchors (^ and $) can restrain where a match takes place, e.g., $ means at the end of a word.

4.

S
ome symbols (e.g., ^) have multiple meanings depending on the context of their use within
the regex.

5.

Quantifiers (such as + and *) modify the previous character in the regular expression.

6.

The match is case sensitive, i.e., lowercase c does not matc
h uppercase C. Note however that
our program regex.pl in this chapter is case insensitive.


5

Exercises:


For the following questions give your regex statement, the number of answers, and an
example of an answer.


Word Play

Note: If you ever listen to Nati
onal Public Radio (NRP) on Sunday mornings (Weekend
Edition), you’ve likely heard Will Shortz’ seven
-
minute feature where he (the “Puzzle
Master”) challenges readers with word puzzles.


1.

Are there any words in the English dictionary where the letter ‘q’ is

not immediately
followed by the letter ‘u’? Try this in the big and small English dictionary files

2.

From Shortz (1996),
#44 “On the Money.”
The currency name MARK is hidden in the
word REMARKABLE, and RAND is hidden in GRANDMA. What common words does
DINA
R appear in?

3.

From Shortz (1996),
#162 “Salty Language.”

There are common words in the English
language that contain the consecutive letters N
-
A
-
C
-
L. Name any three of them.

4.

From Shortz (2003),
#124 “D
-
Plus.”

If you were asked to name a familiar word that c
ontains two S’s, followed by another
letter, and then two more S’s, you might say ASSESSOR. What common word contains
two D’s, followed by another letter, and then two more D’s?

5.

From Shortz (1996),
#38 Vowel Play.”

(slightly modified)

There’s a well
-
known
old puzzle to name words that contains the vowels A, E, I, O, and
U in order. What are they?


“Scrabble” in DNA Land


1.

All 7
-
mers in
E. coli

that contain GTGAC

2.

Are there any 7
-
mers in E. coli where a sequence of three nucleotides is repeated
elsewhere in th
e motif? Hint use (.)

3.

Are there any 7
-
mers in
E. coli

that are seven letter palindromes where the first three at
the beginning of the motif are the same as the reverse of those letters at the end of the
motif.

4.

Are there any 7
-
mers in
E. coli

which have t
he hexameric sequence AT ATorTA AT