1
Playing with DNA Sequences (and words) Using Regular Expressions
Stephen Sontum 2011
modified from “Perl fo
r Exploring DNA
” MD LeBlanc and BD Dyer 2007
Introduction
One could argue that almost all DNA sequence analysis is
based on
pattern matching. Whet
her
you are comparing sequences to build phylogenetic trees or to determine their identities, you are
looking for similar patterns. Perl has a powerful syntax for pattern matching called a “Regular
Expressions” or “regex” for short. Many different programm
ing languages use regex’s, however
they fit especially well within the syntax of Perl.
The examples of this exercise, the regex’s will be embedded within Perl statements. Don’t worry
that you do not yet know Perl
well
. At this point you will be using ver
batim a Perl program that is
already written and you will just focus on and play with the entertaining parts of these programs,
the regex’s. Look at almost any other book on Perl and you will find regular expressions in the
later chapters because they have
been traditionally considered advanced topics. However for
DNA sequence analysis, regex’s may be just exactly what you hoped for. You will need to install
Perl
(
http://www.activestate.com/
)
and a Perl IDE
(
http://www.vim.org/
)
programming
environment
on your computer before you can play with these regex examples.
The particular datasets that
you will explore are a
list of all possible 7
-
mers (heptamers)
comprised of A
’
s,
C
’
s, G
’
s, and T
’
s from the entire genome of the bacterium,
Escherichia coli
(
E.
coli.
)
and a English word Dictionary
.
A list of words from an English dictionary and a list of 7
-
letter DNA motifs from
E. coli are not directly comparable at all. However, the
y are useful as
introductory
datasets for learning about regex syntax and thinking about how pattern matching
might be applied to DNA sequences.
Regex
syntax
Using regular expressions for the exploratory analysis of texts involves two steps:
$regex = ‘omics$’;
(1) define a pattern to match in a text using regular expression syntax, e.g.,
(a) to find all words ending in “omics”
$nextWord
=~
/$regex/ ;
(2) apply that regu
lar expression to the text in question, e.g., to each of the words in a
dictionary, poem, or story or to all putative regulatory motifs upstream of a particular
gene.
This exercis
e
is focused on the first step, the syntax of regular expressions that you ca
n use to
define your pattern.
The following is an example featuring a rich set of regular expression syntax. Do not feel daunted
by the syntax at this point, rather, try to get a sense of the power of finding words that contain
certain patterns. Assuming
we are searching through a file of English words (e.g., a dictionary),
the regex below matches all words in the file that
“start with ‘ge’ and end with either ‘ne’ or ‘me’.”
/
^ge.*[nm]e$
/
2
More explicitly, this pattern matches any word that starts (^) with
the letters ‘ge’, followed by any
number of letters (.*), and the word ends ($) with either an ‘n’ or ‘m’ ( [nm]) followed by the
letter ‘e’. This pattern would match with many words, a partial list of which is shown below:
^ge.*[nm]e$
gelatine
gene
geno
me
genuine
germane
Again, do not feel daunted by the syntax at this point.
The point of this exercise
is to introduce
the pattern matching syntax of regular expressions in a step
-
by
-
step fashion with lots of
opportunities for you to practice.
Here are a
few more syntax examples.
•
Regular Expressions
–
/END/
matches
within the
line for a string ‘END’
–
/^END/
matches a
beginning
END
–
/TAG$/
matches an ending TAG
–
/[AG]T/
matches AT or GT
–
/[a
-
z]/
matches any lowercase letter
–
/[^a
-
z]/
matches any letter except
a
-
z
•
Predefined character classes
–
\
d
digit
same as [0
-
9]
–
\
D
non
-
digit
same as [^0
-
9]
–
\
s
white space
same as [
\
t
\
n
\
r
\
f]
–
\
S
non white space
same as [
^
\
t
\
n
\
r
\
f]
•
Quantifiers
–
{n}
exactly n times
/a{5}/ aaaaa
–
{n,m}
between
n
and m times
/a{2,3
}
/ aa,
aaa
–
*
0 or more times
/ab*c/ ac, abc, abbc,
…
–
+
1 or more times
/ab+c/ abc, abbc, …
–
?
0 or 1 time
/ab?c/ ac, abc
, …
–
.
match any 1 character /a
. c/ abc, aac,
…
–
|
alternate patter
n
s /ac|ca/ ..acb, cab, …
•
Capture
–
(.)
Match any character and reme
mber it
–
(.*)
Capture and match any character 0 or more times
(remember it)
–
\
1
recall the first captured (parenthesized) group
–
\
2
recall the second captured group
•
Escape
–
/
\
^/
match
a metacharacter
–
\
n
newline
–
\
r
carriage return
3
Setting up regex.pl
(1)
D
ownlo
aded zipped file (regexfiles
.zip)
and extract
on
to
your Desktop
or move the folder
to the desktop from our classes folder
.
(2
)
You shoul
d see a new directory, regexfiles
, on your Desktop. Open the directory. You
shou
ld see the following four files:
regex.pl
Perl program to use regex to find patterns (use first)
anagram.pl
Perl program to use regex to find anagrams (used later)
english.txt
English dictionary (list of words one per line)
big_english.txt
muc
h larger English dictionary
ecoli.txt
7
-
mers found in E.coli genome
(3
)
Right Click “regex.pl” and edit it with Vim
(4
)
Each time you would like to try a new regex or use a different dictionary, you will only
need to modify the following
two
lines
i
n the program
.
To use a new regex, change the pattern between the single quotes. Note: you must keep the
single quotes!
my $regex = 'q[^u]';
By the way, this particular regex means: “match all words that contain a ‘q’ but are not
immediately followed by t
he letter ‘u’.”
To change the library to search y
ou should put a comment (#) symbol in front of the line
that you do not want to use
and remove the comment (
#) from the one you want to use. In
this example “big_english.txt” will be searched.
my $filenam
e = "big_english.txt";
# my $filename = "ecoli.txt";
# my $filename = “english.txt”;
(5
)
Once you have entered the regex and selected the file to search, RUN the Perl program
using the perl menu in Vim
( Perl/Run/update, run script
)
Change the regex between single quotes
Pick a file to search by
moving the #(comment)
4
(6)
To continue “H
it any key to close this window”
. You can cut and paste from the command
window by right clicking the blue bar on top of the window and selecting
Edit/Select All
followed by pressing the
Enter
key. This copies the content of the window to the clipboard
.
(
7
)
If you make a syntax error
to which the
Perl program will complain.
Perl/Run/update and
check
syntax
will highlight the source of the error.
.
Basic rules of regular expression behavior
Before we begin, a simple description of the basic rules of r
egular expressions is
in order. As the
exercise
proceeds, we gently introduce you to the various types of regex behavior and encourage
your hands
-
on practice with each new set of features.
1.
Most characters match themselves, e.g., a G matches a G, a T ma
tches a T.
2.
A match anywhere within a string is a match. That is, unless explicitly requested to do so, a
regex need not match an entire word or line.
3.
Anchors (^ and $) can restrain where a match takes place, e.g., $ means at the end of a word.
4.
S
ome symbols (e.g., ^) have multiple meanings depending on the context of their use within
the regex.
5.
Quantifiers (such as + and *) modify the previous character in the regular expression.
6.
The match is case sensitive, i.e., lowercase c does not matc
h uppercase C. Note however that
our program regex.pl in this chapter is case insensitive.
5
Exercises:
For the following questions give your regex statement, the number of answers, and an
example of an answer.
Word Play
Note: If you ever listen to Nati
onal Public Radio (NRP) on Sunday mornings (Weekend
Edition), you’ve likely heard Will Shortz’ seven
-
minute feature where he (the “Puzzle
Master”) challenges readers with word puzzles.
1.
Are there any words in the English dictionary where the letter ‘q’ is
not immediately
followed by the letter ‘u’? Try this in the big and small English dictionary files
2.
From Shortz (1996),
#44 “On the Money.”
The currency name MARK is hidden in the
word REMARKABLE, and RAND is hidden in GRANDMA. What common words does
DINA
R appear in?
3.
From Shortz (1996),
#162 “Salty Language.”
There are common words in the English
language that contain the consecutive letters N
-
A
-
C
-
L. Name any three of them.
4.
From Shortz (2003),
#124 “D
-
Plus.”
If you were asked to name a familiar word that c
ontains two S’s, followed by another
letter, and then two more S’s, you might say ASSESSOR. What common word contains
two D’s, followed by another letter, and then two more D’s?
5.
From Shortz (1996),
#38 Vowel Play.”
(slightly modified)
There’s a well
-
known
old puzzle to name words that contains the vowels A, E, I, O, and
U in order. What are they?
“Scrabble” in DNA Land
1.
All 7
-
mers in
E. coli
that contain GTGAC
2.
Are there any 7
-
mers in E. coli where a sequence of three nucleotides is repeated
elsewhere in th
e motif? Hint use (.)
3.
Are there any 7
-
mers in
E. coli
that are seven letter palindromes where the first three at
the beginning of the motif are the same as the reverse of those letters at the end of the
motif.
4.
Are there any 7
-
mers in
E. coli
which have t
he hexameric sequence AT ATorTA AT
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment