Bioinformatics Functions: Categorical List

abalonestrawBiotechnology

Oct 2, 2013 (3 years and 11 months ago)

73 views

COMP 5115 Programming Tools in Bioinformatics

Week 5


Regular Expressions


A regular expression is a string of characters that defines a certain
pattern. You would normally use a regular expression in searching
through text for a group of words that matches this pattern, perhaps
while parsing program input, or while processing a block of text.



The string 'Joh?n
\
w*' is an example of a regular expression. It
defines a pattern that starts with the letters Jo, is optionally followed
by the letter h (indicated by 'h?'), is then followed by the letter n, and
ends with any number of non
-
whitespace characters (indicated by
'
\
w*'). This pattern matches any of the following:



Jon, John, Jonathan, Johnny



MATLAB supports most of the special characters, or
meta
-
characters
, commonly used with regular expressions and provides
several functions to use in searching and replacing text with these
expressions.

see
http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_prog/ch_com15.html#regular_expressions_in_matlab

for further details (
Regular Expressions section of the MATLAB documentation
)


Several MATLAB functions support searching and
replacing characters using regular expressions:






REGEXP

is a function used to match regular expression


S =
REGEXP
(STRING,EXPRESSION) matches the
regular expression, EXPRESSION, in the string, STRING.
The indices of the beginning of the matches are returned.


In EXPRESSION, patterns are specified using
combinations of meta
-
characters and literal characters**.

*see
http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/ref

for further details


**see
http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_prog/ch_com15.html#regular_expressions_in_matlab

for further details (
Regular Expressions section of the MATLAB documentation
)

Function

Description

regexp

Match regular expression

regexpi

Match regular expression, ignoring case

regexprep

Replace string using regular expression

classes of meta
-
characters (1)


The following meta
-
characters match exactly one
character from its respective set of characters:







Meta
-
character

Meaning




---------------


--------------------------------



.


Any character



[ ]


Any character contained within the brackets



[^]

Any character not contained within the




brackets



\
w

A word character [a
-
z_A
-
Z0
-
9]



\
W

Not a word character [^a
-
z_A
-
Z0
-
9]



\
d

A digit [0
-
9]



\
D

Not a digit [^0
-
9]



\
s

Whitespace [
\
t
\
r
\
n
\
f
\
v]



\
S

Not whitespace [^
\
t
\
r
\
n
\
f
\
v]

classes of meta
-
characters (2)


The following meta
-
characters are used to logically group subexpressions
or to specify context for a position in the match. These meta
-
characters do
not match any characters in the string:






Meta
-
character

Meaning



---------------


--------------------------------



()


Group subexpression



|


Match subexpression before or after the |



^


Match expression at the start of string



$


Match expression at the end of string



\
<


Match expression at the start of a word



\
>


Match expression at the end of a word





The following meta
-
characters specify the number of times the previous
meta
-
character or grouped subexpression may be matched:






Meta
-
character

Meaning



---------------


--------------------------------



*


Match zero or more occurrences



+


Match one or more occurrences



?


Match zero or one occurrence



{n,m}

Match between n and m occurrences


Characters that are not special meta
-
characters
are all treated literally in a match. To match a
character that is a special meta
-
character,
escape that character with a '
\
'. For example '.'
matches any character, so to match a '.'
specifically, use '
\
.' in your pattern.


Example:


str = 'bat cat can car coat court cut ct caoueouat';


pat = 'c[aeiou]+t';


regexp
(str, pat)



returns [5 17 28 35]


When one of STRING or EXPRESSION is
a cell array of strings, REGEXP matches
the string input with each element of the
cell array input



Example:

str = {'Madrid, Spain' 'Romeo and Juliet' 'MATLAB is great'};

pat = '
\
s';

regexp
(str, pat)


returns {[8]; [6 10]; [7 10]}


When one of STRING or EXPRESSION is a cell array of
strings, REGEXP matches the string input with each
element of the cell array input



Example:

str = {'Madrid, Spain' 'Romeo and Juliet' 'MATLAB is great'};

pat = '
\
s';

regexp
(str, pat)


returns {[8]; [6 10]; [7 10]}



When both STRING and EXPRESSION are cell arrays
of strings, REGEXP matches the elements of STRING
and EXPRESSION sequentially. The number of
elements in STRING and EXPRESSION must be
identical.



Example:

str = {'Madrid, Spain' 'Romeo and Juliet' 'MATLAB is great'};

pat = {'
\
s', '
\
w+', '[A
-
Z]'};

regexp
(str, pat)



returns {[8]; [1 7 11]; [1 2 3 4 5 6]}


REGEXP supports up to six outputs. These outputs may be
requested individually or in combinations by using additional input
keywords. The order of the input keywords corresponds to the order
of the results. The input keywords and their corresponding results in
the default order are:






Keyword


Result



---------------


--------------------------------



'start'


Row vector of starting indices of each match



'end'


Row vector of ending indices of each match



'tokenExtents'

Cell array of extents of tokens in each match



'match'


Cell array of the text of each match



'tokens'


Cell array of the text of each token in each match



'names'


Structure array of each named token in each match






Example:



str = 'regexp helps you relax';



pat = '
\
w*x
\
w*';



m = regexp(str, pat, 'match')




returns





m = {'regexp', 'relax'}


Tokens are created by parenthesized subexpressions within
EXPRESSION.



Example:



str = 'six sides of a hexagon';



pat = 's(
\
w*)s';



t =
regexp
(str, pat, 'tokens')



returns



t = {{'ide'}}


Named tokens are denoted by the pattern (?<name>...). The 'names'
result structure will have fields corresponding to the named tokens in
EXPRESSION.



Example:



str = 'John Davis; Rogers, James';



pat = '(?<first>
\
w+)
\
s+(?<last>
\
w+)|(?<last>
\
w+),
\
s+(?<first>
\
w+)';



n =
regexp
(str, pat, 'names')



returns



n(1).first = 'John'



n(1).last = 'Davis'



n(2).first = 'James'



n(2).last = 'Rogers'


By default, REGEXP returns all matches. To find just the first match,
use REGEXP(STRING,EXPRESSION,'once').


Tokens:
Parentheses used in a regular expression not only group
elements of that expression together, but also designate any
matches found for that group as
tokens
. You can use tokens to
match other parts of the same string. One advantage of using tokens
is that they remember what they matched, so you can recall and
reuse matched text in the process of searching or replacing.


Operators Used with Tokens


Operator

Usage

(expr)

Capture in a token all characters matched by the expression within the
parentheses.

\
N

Match the N
th

token generated by this command. That is, use
\
1 to match
the first token,
\
2 to match the second, and so on.

$N

Insert the match for the Nth token in a replacement string. Used only by
the
regexprep

function.

(?<name>expr)

Capture in a token all characters matched by the expression within the
parentheses. Assign a name to the token.

\
k<name>

Match the token referred to by name.

(?(tok)expr)

If token tok is generated, then match expression expr.

(?(tok)expr
1
|

If token tok is generated, then match expression expr
1
. Otherwise, match
expression expr
2
.

expr
2
)


Example of how tokens are assigned values. Suppose
that you are going to search the following text:


andy ted bob jim andrew andy ted mark


You choose to search the above text with the following
search pattern: and(y|rew)|(t)e(d)


This pattern has three parenthetical expressions that
generate tokens. When you finally perform the search,
the following tokens are generated for each match:


Match

Token 1


Token 2

andy


y

ted


t



d

andrew

rew

andy


y

ted


t



d



Only the highest level parentheses are used. For
example, if the search pattern and(y|rew) finds the text
andrew, token 1 is assigned the value rew. However, if
the search pattern (and(y|rew)) is used, token 1 is
assigned the value andrew.

Examples
-
1


Use (expr) and
\
N to capture pairs of matching HTML tags (e.g., <a> and
<
\
a>) and the text between them. The expression used for this example is

expr = '<(
\
w+).*?>.*?</
\
1>';


The first part of the expression, '<(
\
w+)', matches an opening bracket (<)
followed by one or more alphabetic, numeric, or underscore characters. The
enclosing parentheses capture token characters following the opening
bracket.


The second part of the expression, '.*?>.*?', matches the remainder of this
HTML tag (characters up to the >), and any characters that may precede
the next opening bracket.


The last part, '</
\
1>', matches all characters in the ending HTML tag. This
tag is composed of the sequence </tag>, where tag is whatever characters
were captured as a token.

hstr = '<!comment><a name="752507"></a><b>Default</b><br>';

expr = '<(
\
w+).*?>.*?</
\
1>';

[mat tok] = regexp(hstr, expr, 'match', 'tokens');

mat{:}

ans =


<a name="752507"></a>

ans =


<b>Default</b>

tok{:}

ans =


'a'

ans =


'b'

Examples
-
2


Tokens That Are Not Matched:
For those tokens specified in the regular
expression that have no match in the string being evaluated, regexp and
regexpi return an empty string ('') as the token output, and an extent that
marks the position in the string where the token was expected.


The example shown here executes regexp on the path string str returned
from the MATLAB
tempdir

function. The regular expression expr includes
six token specifiers, one for each piece of the path string. The third specifier
[a
-
z]+ has no match in the string because this part of the path, Profiles,
begins with an uppercase letter:

str = tempdir

str =

C:
\
WINNT
\
Profiles
\
bpascal
\
LOCALS~1
\
Temp
\


expr = ['([A
-
Z]:)
\
\
(WINNT)
\
\
([a
-
z]+)?.*
\
\
' ...



'([a
-
z]+)
\
\
([A
-
Z]+~
\
d)
\
\
(Temp)
\
\
'];

[tok ext] = regexp(str, expr, 'tokens', 'tokenExtents');


When a token is not found in a string, MATLAB still returns a token string
and token extent. The returned token string is an empty character string ('').
The first number of the extent is the string index that marks where the token
was expected, and the second number of the extent is equal to one less
than the first.


In the case of this example, the empty token is the third specified in the
expression, so the third token string returned is empty:

tok{:}

ans =


'C:' 'WINNT' '' 'bpascal' 'LOCALS~1' 'Temp'

Examples
-
3


Conditional Expressions
--

(?(token)expr1|expr2)


With conditional regular expressions, you can select
which pattern to match, depending on whether a token
elsewhere in the string is found. The expression appears
as


(?(token)expr1|expr2)


This expression can be translated as an if
-
then
-
else
statement, as follows:

if


the specified token is found



then match expression expr1



else match expression expr2


This example uses the conditional expression expr to
match the string regardless of the gender used. The
expression creates a token if
Mr

is followed by the letter
s
. It later matches either
her

or
his
, depending on
whether this token was found. The phrase
(?(1)her|his)

means that
if

token 1 is found,
then

match

her
,
else

match
his

Examples
-
4

expr = 'Mr(s?)
\
..*?(?(1)her|his) son';

[mat tok] = regexp('
Mr. Clark went to see
his

son
',

.



.. expr, 'match', 'tokens')

mat =


'Mr. Clark went to see his son'

tok =


{1x2 cell}

tok{:}

ans =

‘ '

'his'



In the second part of the example, the token s is found and MATLAB
matches the word her:


[mat tok] = regexp('
Mrs. Clark went to see
her

son
',




... expr, 'match', 'tokens')

mat =


'Mrs. Clark went to see her son'

tok =


{1x2 cell}

tok{:}

ans =

's'


'her'


Examples
-
4(cont.)