Bioinformatics Functions: Categorical List

powerfultennesseeΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

77 εμφανίσεις

COMP 5115 Programming Tools in Bioinformatics

Week 6

Elements of an Expression

Character Classes
:
Character classes represent either a
specific set of characters (e.g., uppercase) or a certain type of character (e.g.,
non
-
white
-
space).



Operator

Usage

.

Any single character, including white space

[c
1
c
2
c
3
]

Any character contained within the brackets: c
1

or c
2

or c
3

[^c
1
c
2
c
3
]

Any character not contained within the brackets: anything but c
1

or c
2

or c
3

[c
1
-
c
2
]

Any character in the range of c
1

through c
2

\
s

Any white
-
space character; equivalent to [
\
f
\
n
\
r
\
t
\
v]

\
S

Any non
-
white
-
space character; equivalent to
[^
\
f
\
n
\
r
\
t
\
v]

\
w

Any alphabetic, numeric, or underscore character; equivalent to [a
-
zA
-
Z_0
-
9]

\
W

Any character that is not alphabetic, numeric, or underscore; equivalent to
[^a
-
zA
-
Z_0
-
9]

\
d

Any numeric digit; equivalent to [0
-
9]

\
D

Any non
-
digit character; equivalent to [^0
-
9]

Any Character
.


Use '
..ain
' in an expression to match a sequence of five
characters ending in '
ain
'. (Note that
.

matches white
-
space
characters as well):


str = 'The rain in Spain falls mainly on the plain.';


regexp(str, '..ain')



ans = 4 13 24 39


Matches ' rain', 'Spain', ' main', and 'plain'.


Returning Strings Rather than Indices:

Here is the same
example, this time specifying the command qualifier 'match'.
In this case, regexp returns the
text

of the matching strings
rather than the starting index:


regexp(str, '..ain', 'match')



ans = ' rain' 'Spain' ' main' 'plain'

Examples

Character Classes


Selected Characters
[c1c2c3
]


Use [c1c2c3] in an expression to match selected characters r, p, or m
followed by 'ain'. Specify two qualifiers this time, 'match' and 'start', along
with an output argument for each, mat and idx. This returns the matching
strings and the starting indices of those strings:


[mat idx] = regexp(str, '[rpm]ain', 'match', 'start')



mat =




'rain' 'pain' 'main'




idx =




5 14 25


Range of Characters
[c1
-

c2]


Use [c1
-
c2] in an expression to find words that begin with a letter in the
range of A through Z:


[mat idx] = regexp(str, '[A
-
Z]
\
w*', 'match', 'start')



mat =




'The' 'Spain'




idx =




1 13

Examples

Character Classes


Word and White
-
Space Characters
\
w,
\
s

Use
\
w and
\
s in an expression to find words that end with the letter n
followed by a white
-
space character. Add a new qualifier, 'end', to return
the str index that marks the end of each match:


[mat ix1 ix2] = regexp(str, '
\
w*n
\
s', 'match', 'start', 'end')


mat =



'rain ' 'in ' 'Spain ' 'on '


ix1 =



5 10 13 32


ix2 =



9 12 18 34


Numeric Digits
\
d

Use
\
d to find numeric digits in the following string:


numstr = 'Easy as 1, 2, 3';


[mat idx] = regexp(numstr, '
\
d', 'match', 'start')


mat =



'1' '2' '3'


idx =



9 12 15

Examples

Character Classes

Character Representation:
The following character
combinations represent specific character and numeric values:


Operator

Usage

\
a

Alarm (beep)

\
b

Backspace

\
e

Escape

\
f

Form feed

\
n

New line

\
r

Carriage return

\
t

Horizontal tab

\
v

Vertical tab

\
oN o
r
\
o{N}

Character of octal value
N

\
xN

or
\
x{N}

Character of hexadecimal value
N

\
char

If a character has special meaning in a regular
expression, precede it with backslash (
\
) to
match it literally.


Octal and Hexadecimal
\
o,
\
x

Use
\
x and
\
o in an expression to find a comma (hex 2C) followed by a space
(octal 40) followed by the character 2:


numstr = 'Easy as 1, 2, 3';


[mat idx] = regexp(numstr, '
\
x2C
\
o{40}2', 'match', 'start')


mat =



', 2'


idx =



10


Special Characters
\
char

Use
\

before a character that has a special meaning to the regular expression
functions if you want that character to be interpreted literally. The intention in this
example is to have the string '(ab[XY|Z]c)' interpreted literally. The first
expression does not do that because
regexp

interprets the parentheses and
|

sign as the special characters for grouping and logical OR:


regexp('(ab[XY|Z]c)', '(ab[XY|Z]c)', 'match')


ans =



'ab[XY' 'Z]c'

This next expression uses a
\

before any special characters. As a result the
entire string is matched:


regexp('(ab[XY|Z]c)', '
\
(ab
\
[XY
\
|Z
\
]c
\
)', 'match')


ans =



'(ab[XY|Z]c)'

Examples

Character Representation

Logical Operators
:
Logical operators do not match any
specific characters. They are used to specify the context for matching an
accompanying regular expression.

Operator

Usage

(expr)

Group regular expressions and capture tokens.

(?:expr)

Group regular expressions, but do not capture tokens.

(?>expr)

Group atomically.

(?#expr)

Insert a comment into the expression. Comments are
ignored in matching.

expr
1
|expr
2

Match expression expr
1

or expression expr
2
.

^expr

Match the expression only at the beginning of the
string.

expr$

Match the expression only at the end of the string.

\
<expr

Match the characters when they start a word.

expr
\
>

Match the characters when they end a word.

\
<expr
\
>

Match an exact word.


Grouping and Capture

(expr)
: elements can be grouped together using either (expr) to
group and capture (Week
-
5 lecture:
Using Tokens
--

Example 1
) or (?:expr) for grouping
alone (see below)


Grouping Only
(?:expr)

(1)

Use (?:expr) to group a consonant followed by a vowel in the palindrome pstr. Specify at
least two consecutive occurrences ({2,}) of this group. Return the starting and ending indices
of the matched substrings:


pstr = 'Marge lets Norah see Sharon''s telegram';


expr = '(?:[^aeiou][aeiou]){2,}';


[mat ix1 ix2] = regexp(pstr, expr, 'match', 'start', 'end')


mat =



'Nora' 'haro' 'tele'


ix1 =



12 23 31


ix2 =



15 26 34

(2)

Remove the grouping, and the {2,} now applies only to [aeiou]. The command is entirely
different now as it looks for a consonant followed by at least two consecutive vowels:


expr = '[^aeiou][aeiou]{2,}';


[mat ix1 ix2] = regexp(pstr, expr, 'match', 'start', 'end')


mat =



'see'


ix1 =



18


ix2 =



20

Examples

Logical Operators


Including Comments
(?#expr)

Use (?#expr) to add a comment to this expression that matches
capitalized words in pstr. Comments are ignored in the process of finding
a match:


regexp(pstr, '(?# Match words in caps)[A
-
Z]
\
w+', 'match')


ans =



'Marge' 'Norah' 'Sharon'



Alternative Match
expr1|expr2

Use p1|p2 to pick out words in the string that start with let or tel:


regexpi(pstr, '(let|tel)
\
w+', 'match')


ans =



'lets' 'telegram'



(!) Note that using the '|' operator within square brackets in MATLAB
regular expressions. Because the '|' operator has a higher precedence
than '[ ]', MATLAB interprets the expression '[A|B]' not as 'A' or 'B', but as
'[A' or 'B]'. The recommended way to match "the letter A or the letter B" in
a MATLAB regexp expression is to use '[AB]'.

Examples

Logical Operators


Start and End of String Match
^expr, expr$

Use ^expr to match words starting with the letter m or M only when it
begins the string, and expr$ to match words ending with m or M only
when it ends the string:


regexpi(pstr, '^m
\
w*|
\
w*m$', 'match')


ans =



'Marge' 'telegram'



Start and End of Word Match
\
<expr, expr
\
>

Use
\
<expr to match any words starting with n or N, or ending with e or E:


regexpi(pstr, '
\
<n
\
w*|
\
w*e
\
>', 'match')


ans =



'Marge' 'Norah' 'see'



Exact Word Match
\
<expr
\
>

Use
\
<expr
\
> to match a word starting with an n or N and ending with an h
or H:


regexpi(pstr, '
\
<n
\
w*h
\
>', 'match')


ans =



'Norah'

Examples

Logical Operators

Lookaround Operators


Lookaround operators have two components: a
match pattern

and a
test pattern
. If
you call the match pattern p1 and the test pattern p2, then the simplest form of
lookaround operator looks like this:


p1(?=p2)



The match pattern p1 is just like any other element in an expression. For example, it
can be '
\
<[A
-
Za
-
z]+
\
>' to make
regexp

find any word.


The test pattern p2 places a condition on this match. There can be a match for p1
only if there is also a match for p2, and the p2 match must immediately precede (for
lookbehind

operators) or follow (for
lookahead

operators) the match for p1.


In the following expression, the match pattern is '
\
<[A
-
Za
-
z]+
\
>' and the test pattern is
'
\
S'. The entire expression can be read as


"Find those words that are followed by a non
-
white
-
space character":


'
\
<[A
-
Za
-
z]+
\
>(?=
\
S)'



When used on the following string, this lookahead expression matches the letters of
the words
Raven

and
Nevermore
:

str = 'Quoth the Raven, "Nevermore"';

regexp(str, '
\
<[A
-
Za
-
z]+
\
>(?=
\
S)', 'match')

ans =

'Raven' 'Nevermore'


One important characteristic of lookaround operators is how they affect the parsing of
the input string. The parser can be said to "consume" pieces of the string as it looks
for matching phrases. With lookaround operators, only the match pattern p1 affects
the current parsing location. Finding a match for the test pattern p2 does not move
the parser location.

Four lookaround expressions
:


1. lookahead


2. negative lookahead


3. lookbehind


4. negative lookbehind

Operator

Usage

expr
1
(?=expr
2
)

Match expression
expr
1

if followed by
expression
expr
2
.

expr
1
(?!expr
2
)

Match expression
expr
1

if not followed by
expression
expr
2
.

(?<=expr
1
)expr
2

Match expression
expr
2

if preceded by
expression
expr
1
.

(?<!expr
1
)expr
2

Match expression
expr
2

if not preceded by
expression
expr
1
.



Lookahead
expr1(?=expr2)

Use p1(?=p2) to find all words of this string that precede a comma:


poestr = ['While I nodded, nearly napping, ' ... 'suddenly there came a tapping,'];


[mat idx] = regexp(poestr, '
\
w*(?=,)', 'match', 'start')


mat =



'nodded' 'napping' 'tapping'


idx =



9 24 55



Negative Lookahead
expr1(?!expr2)

Use p1(?!p2) to find all words that do not precede a comma:


[mat idx] = regexp(poestr, '
\
w+(?!
\
w*,)', 'match', 'start')


mat =



'While' 'I' 'nearly' 'suddenly' 'there' 'came' 'a'


idx =



1 7 17 33 42 48 53




Lookbehind
(?<=expr1)expr2

Use (?<=p1)p2 to find all words that follow a comma and zero or more spaces:


[mat idx] = regexp(poestr, '(?<=,
\
s*)
\
w*', 'match', 'start')


mat =



'nearly' 'suddenly'


idx =



17 33

Examples

Lookaround Operators



Negative Lookbehind
(?<!expr1)expr2

Use (?<!p1)p2 to find all words that do not follow a comma and zero or more
spaces:


[mat idx] = regexp(poestr, '(?<!,
\
s*
\
w*)
\
w*', 'match', 'start')


mat =



'While' 'I' 'nodded' 'napping' 'there' 'came' 'a' 'tapping'


idx =



1 7 9 24 42 48 53 55




Using Lookaround as a Logical Operator

You can use lookaround operators to perform a logical AND. The expression used
here finds all words that contain a sequence of two letters under the condition that
the two letters are identical
and

are in the range a through m. (The expression
'(?=[a
-
m])' is a lookahead test for the range a through m, and the expression '(.)
\
1'
tests for identical characters using a
token
: see Week
-
5 lecture notes):


[mat idx] = regexp(poestr, '
\
<
\
w*(?=[a
-
m])(.)
\
1
\
w*
\
>', ... 'match', 'start')


mat =



'nodded' 'suddenly'


idx =



9 33

(!) Note that when using a lookahead operator to perform an AND, you need to
place the match expression expr1
after

the test expression expr2:


(?=expr2)expr1

or

(?!expr2)expr1

Examples

Lookaround Operators

Quantifiers


Quantifiers can be used to specify how many
instances of an element are to be matched.


When used alone, they match as much of the string
as possible. Thus these are sometimes called
greedy

quantifiers.


When one of these quantifiers is followed by a plus
sign (e.g., '
\
w*+'), it is known as a
possessive

quantifier.


Possessive quantifiers also match as much of the
string as possible, but they do not rescan any
portions of the string should the initial match fail.


When you follow a quantifier with a question mark
(e.g., '
\
w*?'), it is known as a
lazy

quantifier. Lazy
quantifiers match as little of the string as possible.

Operator

Usage

expr?

Match the preceding element 0 times or 1 time. Equivalent to
{0,1}
.

expr*

Match the preceding element 0 or more times. Equivalent to
{0,}
.

expr+

Match the preceding element 1 or more times. Equivalent to
{1,}
.

expr{n}

Must match exactly
n

times. Equivalent to
{n,n}
.

expr{n,}

Must occur at least
n

times.

expr{n,m}

Must occur at least
n

times but no more than
m

times.

qu_expr?

Match the quantified expression according to the guidelines
stated above for lazy quantifiers, where
qu_expr

represents any one of the expressions shown in the top six
rows of this table.

qu_expr+

Match the quantified expression according to the guidelines
stated above for possessive quantifiers, where
qu_expr

represents any one of the expressions shown in the top six
rows of this table.

Quantifiers


Zero or One
expr?

(1) Use ? to make the HTML <code> and </code> tags optional in the
string. The first string, hstr1, contains one occurrence of each tag. Since
the expression uses ()? around the tags, one occurrence is a match:



hstr1 = '<td><a name="18854"></a><code>%%</code><br></td>';



expr = '</a>(<code>)?..(</code>)?<br>';


regexp(hstr1, expr, 'match')


ans =



'</a><code>%%</code><br>'


(2) The second string, hstr2, does not contain the code tags at all. Just
the same, the expression matches because ()? allows for zero
occurrences of the tags:



hstr2 = '<td><a name="18854"></a>%%<br></td>';


expr = '</a>(<code>)?..(</code>)?<br>';


regexp(hstr2, expr, 'match')


ans =



'</a>%%<br>'

Examples

Quantifiers


Zero or More
expr*

Use * to match strings having any number of line breaks, including no line breaks
at all.


hstr1 = '<p>This string has <br><br>line breaks</p>';


expr = '<p>.*(<br>)*.*</p>';


regexp(hstr1, expr, 'match')


ans =



'<p>This string has <br><br>line breaks</p>'




hstr2 = '<p>This string has no line breaks</p>';



regexp(hstr2, expr, 'match')


ans =



'<p>This string has no line breaks</p>'



One or More
expr+

Use + to verify that the HTML image source is not empty. This looks for one or
more characters in the gif filename:


hstr = '<a href="s12.html"><img src="b_prev.gif" border=0>';


expr = '<img src="
\
w+.gif';


regexp(hstr, expr, 'match')


ans =



'<img src="b_prev.gif'

Examples

Quantifiers


Exact, Minimum, and Maximum Quantities
{min,max}

Use {m}, {m,}, and {m,n} to verify the href syntax used in HTML. This
statement requires the href to have at least one non
-
white
-
space
character, followed by exactly one occurrence of .html, optionally
followed by # and five to eight digits:



hstr = '<a name="18749"></a><a href="s13.html#18760">';


expr = '<a href="
\
w{1,}(
\
.html){1}(
\
#
\
d{5,8}){0,1}"';


regexp(hstr, expr, 'match')


ans =



'<a href="s13.html#18760"'




Greedy Quantifiers
expr*

Use * to match as many characters as possible between any < and >
signs in the string. Because of the .* in the expression, regexp reads all
characters in the string up to the end. Finding no closing > at the end,
regexp then backs up to the </a> and ends the phrase there:



hstr = '<tr valign=top><td><a name="19184"></a>xyz';


regexp(hstr, '<.*>', 'match')


ans =



'<tr valign=top><td><a name="19184"></a>'

Examples

Quantifiers


Possessive Quantifiers
expr*+

Except for the possessive *+ quantifier, this expression is the same as that used in the
last example. Unlike the greedy quantifier, possessive quantifiers do not reevaluate parts
of the string that have already been evaluated. This command scans the entire string
because of the .* quantifier, but then cannot back up to locate the </a> sequence that
would satisfy the expression. As a result, no match is found and regexp returns an empty
cell array:


regexp(hstr, '<.*+>', 'match')


ans =



{ }



Lazy Quantifiers
expr*?

This example shows the difference between lazy and greedy quantifiers. The first
expression uses lazy .*? to match the minimum number of characters between <tr, <td, or
</td tags:


hstr = '<tr valign=top><td><a name="19184"></a><br></td>';


regexp(hstr, '</?t.*?>', 'match')


ans =



'<tr valign=top>' '<td>' '</td>'


The second expression uses greedy .* to match all characters from the opening <tr to the
ending </td:


regexp(hstr, '</?t.*>', 'match')


ans =



'<tr valign=top><td><a name="19184"></a><br></td>'

Examples

Quantifiers