Intro-bio 102 Lab #1 Perl Programming for Pattern Matching.

bewgrosseteteSoftware and s/w Development

Dec 13, 2013 (3 years and 9 months ago)

101 views

BIO 102 Renn_Lab#1 (in
-
lab handout)


Name ___________________



pg.
1

o
f
14


Intro
-
bio 102 Lab #1 Perl Programming for Pattern Matching.


There are 9 exercises in total:

USE PERL PROGRAM
regex.pl

1.

Scrabble in English

2.

“Scrabble” in DNA Sequences

3.


Direct Repeats

4.


Direct Repeats in DNA Sequences

5.


Mirror Repeats also called palindrom
es in English (English Word Play)

6.


Mirror Repeats in DNA Sequences

USE PERL PROGRAM
book_search.pl

7.


Pattern Matching in Large Texts.

USE PERL PROGRAM
omic_search.pl

8.


Pattern Matching for Protein Sequences

9.


Pattern Matching for Protein Sequences in D
NA

There is more if you finish early and are still enjoying yourself!

Note: If you run a Perl program and it runs on and on ... and on and on, you’ll need to
HALT the program.

Hold down the keys: Option
-
Apple
-
Escape, select the TextWrangler application, a
nd
click "Force Quit".


In your lab notebook keep a record of the regex’s that you try. Keep track of
whether or not they worked the way you expected them to. Write down your errors
as diligently as your successes, these are what you will learn from. Wh
enever you see
gray highlighted text, you should be writing in your lab notebook. At the end of lab,
you will trade notebooks to learn from each other’s efforts.

__ Logon to your computer using your own name and password

__ Connect to the courses server an
d open the
week1_Perl_lab

folder in
Behavior
-
Renn.

__ Drag the folder
regexPlay

onto your
desktop
.

__ Open this folder and you should see the following files:


anagram.pl

Perl program to use regex to find anagrams

book_search.pl

Perl program to use re
gex to return full
paragraphs

ecoli_K12_genes.fasta

DNA coding sequence for predicted E.coli
proteins in FASTA format

ecoli_K12_proteins.fasta

amino acid sequence for predicted E.coli
proteins.

ecoli.txt

7
-
mers found in E.coli genome

english.txt


English dictionary (list of words one per
line)

omic_search.pl

Perl program to use regex to return genes
with specified pattern

origin.txt

Full text of Darwin’s “Origin of species”

regex.pl

Perl program to use regex to find patterns


Perl is alr
eady installed on the intro
-
bio lab computers for you.


BIO 102 Renn_Lab#1 (in
-
lab handout)


Name ___________________



pg.
2

o
f
14


__ Open the application
TextWrangler
TM

from the Dock (Gold “W”).

__ From the FILE menu in TextWrangler, select Open (or use

o) and open the Perl
program:
regex.pl
from the folder on your desktop
.

The Perl program will appear in
your programming environment. It will be color coded according to Perl syntax. The text
that you will work with will be the color coded pink text under the words “ENTER
HERE”. Notice the “#” sign is used to frame the code

and include comments.

__ You should also see a “Documents Drawer” on the right. If not, go to the
View

menu
and select
Show

Documents

Drawer
.

__ Before making any changes, make your own version of this script by saving it as
regex_yourname.pl

From the
Fi
le

menu select
Save As
, give the file its new name,
check that you are saving it in the
regexPlay

folder on the desktop.

__ If you do not see line numbers next to the code use the
Edit

menu to open
Text
Options
and check the box under
Display:
for

Line Nu
mbers.

This is the only section you will be changing:

######################################################################

# ==========ENTER HERE============


my $regex


=

'q[^u]';


my $filename


=

"english.txt";

# my $filename

=

"ecoli.txt";


##########
############################################################


(note: The word “my” is Perl’s way of initializing variables. It is necessary here.)

To write a new regex, you will change the pink text between the single quotes.

Note: you must keep the singl
e quotes!


The initial regex supplied in this Perl program means:

“match all words that contain a ‘q’ that is followed by a letter that is not ‘u’.”


____You must also choose the file to search:


either the English dictionary (english.txt)


or the lis
t of 7
-
mers in
E.coli

(ecoli.txt).

You will make this selection by putting a comment (#) symbol in front of the line that
you
do not

want to use.

The initial program is set to search the file "
english.txt
", thus there is a # to
“comment out” (shut off) t
he "
ecoli.txt”

file.

Note: After you change the regex expression, you must SAVE your changes to the
file before you re
-
run the new program. Your program is NOT saved if it has a dark
diamond next to the title in the Documents panel. You can save your Per
l Program
with

S ⁵s楮朠瑨攠gu汬downenu⁦r潭 F楬e.

____To run your program from TextWrangler use the
#!

Menu and select
Run



(this combination of symbols is call hashbang or she
-
bang).

Your program will run, and a new file with your results will op
en.

You will see a new document name added to the Documents panel called “Unix Script
Output.” This file is not currently saved, and you do not need to save it. Subsequent runs
BIO 102 Renn_Lab#1 (in
-
lab handout)


Name ___________________



pg.
3

o
f
14


will be added below, and it maybe a very long file by the end of the day. Y
ou should
record all of your coding attempts in your lab notebook as you work. When your
attempted code fails to give the expected results, try to figure out why and correct it.
Record this learning process in your lab notebook
.

___ Run your program now b
efore you change anything in the code.

How many words were there with
a ‘q’ that is followed by a letter that is not ‘u’.

(Write
your findings in your lab notebook.).

____To Return to the script, click on the document icon to the left of your saved
regex_y
ourname.pl in the document drawer, or use the
back
-
arrow

button





OUTPUT of initial program:

Work your way through the rest of the handout.

Always start at the top of each left hand page.

This is a demonstration using the English dictionary and

English letters. Each
page then contains a “going back for more” that you must solve, and record your
solution in your lab notebook. You are asked to be creative using the regular
expressions that you have learned on each page.

Then, move to the facing p
age, the right hand page.


This is a demonstration using DNA or Protein code rather than English.

Again, there is an example followed by one or more assigned problems, and an
opportunity to be creative.

The answers are available from your instructor.

Wo
rking with your fellow students is encouraged.

Please ask for and offer help to each other before looking up the answer.

pink regex text

Document Drawer

Open diamond = saved file

Black diamond = unsaved file

Bold text = working file

Yellow highlighted
text at cursor

Backarrow
button

counts of words
returned

BIO 102 Renn_Lab#1 (in
-
lab handout)


Name ___________________



pg.
4

o
f
14


Scrabble in English

Question: Are there any words in an English dictionary where the letter ‘q’ is not

immediately followed by the

letter ‘u’?



Pattern to match expressed in English: Search for two adjacent characters: the letter ‘q’

immediately followed by any character that is not the letter ‘u’.


[…]

match any one of the characters in the set

[^…]

match any one character t
hat is not in the set

{n}

match any of the characters in the set exactly n times

{n,}

match any of the characters in the set at least n times

{x,y}

match at least x and not more than y times


these can be combined

[…]{x,y}

match any of the charac
ters in the set at least x and not
more than y times


|

a vertical bar separates alternatives

this is OR in Boolean logic.



regex = ‘q[^u]’

(1) Iraqi

(2) Iraqis

(3) Qatar


Note: In these facing page examples, we are assuming that the Perl program i
s instructing
each regex to ignore the case of letters. You need not worry about upper vs. lower case in
your regex. In the example above, “q” is really handling both upper and lower case ‘q’.

This is part of the work that the Perl program is doing for y
ou. After you play with regex,
we can look at the code if you are interested.



Going back for more:

Select
regex_yourname.pl
in the Documents Drawer to return to your Perl script.
Edit your new regex in pink text between the single quotes, and

then save your changes
before running the script from the hash
-
bang menu
#!
. If you forget to save, it will run
the previous regexs.

(1a)

All words that contain the three
-
letter string “ghi”.

Note: Do not confuse this with the request to find words wit
h any one of the letters ‘g’, ‘h’, or
‘i’. [ghi] is not the regex you want. [ghi] finds words with
one

of the letters in the set.


(1b)

All words that contain “fin” or “phin”.

(1c)

All words with ‘yz’ that are not immediately followed by an ‘e’ or an ‘i
’ .

(1d)

All words with....

(make up your own).
BIO 102 Renn_Lab#1 (in
-
lab handout)


Name ___________________



pg.
5

o
f
14


“Scrabble” in DNA Sequences

Our context for DNA sequences is a file with one DNA motif (or word) per line. The file
is comprised of 16,384 7
-
mers, each line holding one DNA sequence or motif of length
seven(
7), where a motif is defined as “a pattern with putative biological meaning”. The 7
-
mers were collected by us from NCBI’s publicly available DNA sequence for the
E.coli

bacteria.


Change the file that will be searched by inserting and deleting the hash si
gn
#

as shown.

# my $filename = "english.txt";

my $filename = "ecoli.txt";



Question: Are there any 7
-
mers in
E. coli

where the nucleotide ‘G’ (guanine) is not

immediately followed by the nucleotide ‘T’ (thymine)?



Search for two adjacent nucleotid
es: guanine ‘G’ followed by any nucleotide that is not
thymine ‘T’.



regex = ‘G[^T]’

TAAA
GA
A

ACGT
GC
C

GA
TATTT … 11267 total



The program is checking All Motifs for a G followed by a letter that is not T

TCAGTGT no

GTTCACG no

TAAA
GA
A yes

AGTAGTG

no

CTTTTTT no

ACGT
GC
C yes

ACTCATT no

etc


Going back for more:

(2a)

All 7
-
mers that contain GTGAC.



(2b)

All 7
-
mers that contain the dimer GC, and this dimer is immediately followed by a
letter that is not a G nor a C.



(2c)

All 7
-
mers where ..
. (write your own)?

BIO 102 Renn_Lab#1 (in
-
lab handout)


Name ___________________



pg.
6

o
f
14


Direct Repeats

Question: Are there any words in which a sequence of three letters is repeated
elsewhere in the word?


Note: Don’t forget to switch back to the English text.


Search for words where a sequence of three letters is rep
eated. This pattern might appear
in the middle of a longer word.



.

Match any character except a new line

(.)


Match any character and remember it

.*

Match any character zero or more times

.+

Match any character one or more times

.?

Mat
ch any character zero or one time

\
1

Recall the character from the 1st match

\
2

Recall the character from the 2nd match

\
3

Recall the character from the 3rd match


regex = ‘(.)(.)(.).*
\
1
\
2
\
3’

(1) acclimatization

(2) agglutinating

(3) aggressiv
eness

(4) Albuquerque

(5) alfalfa

(6) allegorically

(7) amalgamate

(8) amalgamated

(9) amalgamates

(10) amalgamating … 305 total


Going back for more:


(3a)

Are there any words in which a sequence of two letters is directly repeated at least

three times
in the word?





(3b)

Are there any words ... (write your own)?





(Note, “directly” refers to “in the same order”. It is different than “immediately” in that it
allows for additional characters to come between the repeated motif.) (How would you
change
the regex to require an immediately repeated 3 letter motif?)





BIO 102 Renn_Lab#1 (in
-
lab handout)


Name ___________________



pg.
7

o
f
14


Direct Repeats in DNA Sequences



Question: Are there any 7
-
mers in the DNA sequence of
E.coli

where a sequence of
three nucleotides is repeated elsewhere in the motif?



Search for motif
s where a sequence of three nucleotides is repeated.



regex = ‘(.)(.)(.).*
\
1
\
2
\
3’

1) AAAAAAA

(2) AAAAAAC

(3) AAAAAAG

(4) AAAAAAT

(5) AAACAAA

(6) AAACAAC

(7) AAAGAAA

(8) AAAGAAG

(9) AAATAAA

(10) AAATAAT … 700 total


Going back for more:



(4a)

Are there

any motifs in which a sequence of two nucleotides is repeated at least three

times in the motif?




(4b)

Are there any motifs ... (write your own)




BIO 102 Renn_Lab#1 (in
-
lab handout)


Name ___________________



pg.
8

o
f
14


Mirror Repeats (also called palindromes in English)

Question: Are there any words in which the firs
t three letters of the beginning of a
word are the same as the reverse of those letters at the end of the word?



Search for words where the initial sequence of three letters ends with the reverse of those
letters.



regex = ‘^(.)(.)(.).*
\
3
\
2
\
1$’

(1) de
spised

(2) detected

(3) deteriorated

(4) detested

(5) foolproof

(6) Greenberg

(7) Hannah

(8) redder

(9) reviver

(10) revolver

(11) rotator


remember:

^ab

Match ‘ab’ at the beginning of a word
=
xyz$

Match ‘xyz’ at the end of a word
=
\
2

Recall

the character from the 2nd match



Going back for more:

(5a)

Are there any words in which three consecutive letters anywhere in a word are
followed by the reverse of those letters anywhere in the word? (Hint: you do not need ^
(start of word)
and $ (end of word) for this).



(Note: the following two examples combine both direct and mirror patterns)

(5b)

Are there any words in which a pair of letters is followed by a direct repeat of those
pair of letters followed by two occurrences of

the reverse of the pair of letters? Each of
the pairs can be zero or more letters apart), e.g., “AB...AB...BA...BA”



(5c)

Are there any words in which this pattern occurs twice: a pair of letters is followed
by its reverse of those pair of letters? (Ea
ch of the pairs can be zero or more letters apart).

BIO 102 Renn_Lab#1 (in
-
lab handout)


Name ___________________



pg.
9

o
f
14


Mirror Repeats in DNA Sequences

Question: Are there any motifs in which the first three nucleotides of the beginning
of a motif are the same as the reverse of those letters at the end of the motif?



S
earch for motifs where the reverse of the initial three nucleotides is repeated at the end
of the motif.



regex = ‘^(.)(.)(.).*
\
3
\
2
\
1$’

1) AAAAAAA

(2) AAACAAA

(3) AAAGAAA

(4) AAATAAA

(5) AACACAA

(6) AACCCAA

(7) AACGCAA

(8) AACTCAA

(9) AAGAGAA

(10) AAGCG
AA … 256



remember:

^ATG

Match ‘ATG’ at the beginning of a motif
=
TAC$

Match ‘TAC’ at the end of a motif
=
(.)

Match any character and remember it

\
2

Recall the character from the 2nd match



Going back for more:

(6a)

Are the
re any motifs in which three consecutive nucleotides anywhere in a motif are

followed by the reverse of those nucleotides anywhere in the motif?


(Hint: you do not need ^ (start of motif) and $ (end of motif) for this).



(Note: the following two examp
les combine both direct and mirror patterns)

(6b)

Are there any motifs in which a pair of adjacent nucleotides is followed by a direct

repeat of those pair of nucleotides followed by two occurrences of the reverse of the pair

of nucleotides? (Each of th
e pairs can be zero or more nucleotides apart), e.g.,

“GT...GT...TG...TG”.



(6c)

Are there any motifs in which this pattern occurs twice: a pair of adjacent
nucleotides is followed by its reverse of those pair of nucleotides?


(Each of the pairs can b
e zero or more nucleotides apart).



BIO 102 Renn_Lab#1 (in
-
lab handout)


Name ___________________



pg.
10

o
f
14


Pattern Matching in Large Texts.

Perl can be a powerful tool for searching and manipulating larger text documents.

We have downloaded the full text of The Origin of Species by means of Natural
Selection, 6th Edition

by Charles Darwin.


You will write regex’s to find interesting words in this text.

Choose words or concepts from Bob Kaplan’s lectures first semester.


In TextWrangler close (

w) the
regex.pl

script and open (

o)
book_search.pl

This script will return the entire paragraph that contains your interesting word so
you can learn how Charles Darwin used these terms.

Search the text origin.txt


You could search any other downloaded te
xt by changing line #13 in the program.


(
my $filename = "origin.txt";

#enter the text file name here
)


(9a)

How many times do you think Darwin used the word “evolution” in his text?

Write your guess in your lab notebook.


(9b)

Write a regex to find the c
orrect answer.


This would be easy to do in a simple text program like
Word

using the
Find

tool, but
now try to find any term related to evolution (evolve, evolution, evolved).

(9c)

Write a regex that will find any term related to evolution.


(9d)

How many

are there?

(9e)

Look carefully. Did you get the words you weren’t expecting (“revolved”
“revolution”)? Look back at your answer to 9b now also


(9f)

Write a regex that will avoid these words? (Hint
\
s means “white space” or
\
b means
“word boundary”)


Se
veral special characters can be helpful regex tools in this type of search in Perl.

\
n

new line character

\
t

tab

\
s

whitespace

\
S

non
-
whitespace

\
d

digit

\
D

non
-
digit

\
b

word boundary

\
B

not word boundary

If you are finding it difficult to find y
our word in the full printed paragraph, comment out
line 52 and use line 53 by removing that comment mark.

52 print @paragraph;


# and print the paragraph

53# print "$match
\
n";


# will print only the line that matched
BIO 102 Renn_Lab#1 (in
-
lab handout)


Name ___________________



pg.
11

o
f
14


Pattern Matching for Protein Seque
nces

Remember Janis Shampay’s lectures last semester?

Proteins are a string of amino acids of which there are 20. Each amino acid has a three
letter abbreviation but also a one letter abbreviation used for writing protein sequences.


* G
-

Glycine (Gly
)


* P
-

Proline (Pro)


* A
-

Alanine (Ala)


* V
-

Valine (Val)


* L
-

Leucine (Leu)


* I
-

Isoleucine (Ile)


* M
-

Methionine (Met)


* C
-

Cysteine (Cys)


* F
-

Phenylalanine (Phe)


* Y
-

Tyrosine (Tyr)


* W
-

Tryptophan (Trp
)


* H
-

Histidine (His)


* K
-

Lysine (Lys)


* R
-

Arginine (Arg)


* Q
-

Glutamine (Gln)


* N
-

Asparagine (Asn)


* E
-

Glutamic Acid (Glu)


* D
-

Aspartic Acid (Asp)


* S
-

Serine (Ser)


* T
-

Threonine (Thr)

What amino acid s
equence would be represented by the protein sequence GENE?


In TextWrangler close (

w)
book_search.pl
; open (

o)
omic_search.pl

This program will return either DNA or protein sequence for genes. It is expecting a
FASTA formatted file, rather than a book.

Use the file
E_coli_K12_proteins.fasta

by commenting the appropriate line.

If you want to see only one matching line of protein code, comment out line 43 and use 44.

(10a)

Write a regex to find how many proteins in
E.coli

include the protein code
GENE.


(1
0b)

Write a regex to find how many proteins in
E.coli

include the amino acids
DARWIN in any order.


(10c)

Write a regex to find proteins in
E.coli

include …. (make up your own).


(10d)

Is your name part of
E.coli

proteome?
(If you don’t know what the word

proteome
means, look it up in Wikipedia quickly
http://en.wikipedia.org/wiki/Proteome

).


Regular expressions can be used to search for secondary structure.

Different amino acids have different chemic
al characteristics. The secondary structure of a
protein depends on the number and spacing of these different chemical characteristics.
Different amino
-
acid sequences have different propensities for forming

-
helical structure.
Methionine, alanine, leucine
, glutamic acid, and lysine ("MALEK" in the amino
-
acid 1
-
letter codes) all have high helix
-
forming propensities. Proline (“P”) tends to break or kink
helices. However, proline is often seen as the first residue of a helix.


(10e)

Write a regex for an amino

acid pattern that is likely to form a long (10 amino
acid) alpha helix secondary structure.




There are many web
-
based tools that search for complex protein or DNA patterns such as
Neuropeptide cleavage predictors (see appendix for an example). These to
ols rely on
programming in languages like Perl or Python. One advantage of learning to write your
own programs is that you can search multiple sequences (or whole genomes as you have
done today). You do not need an internet connection, you do not have to

wait for results,
and you can format the output to be compatible with your own files. You can adjust the
parameters. Often it is possible to obtain the program code that is used by these web
-
based
tools, and you can then adjust specific parameters to mee
t your own needs.

BIO 102 Renn_Lab#1 (in
-
lab handout)



Name
___________________



pg.
12

o
f
14


Pattern Matching for Protein Sequences in DNA

Each amino acid is encoded by 3 letters in the genomic DNA (translated via mRNA).
These three bases are called a “codon”. Remember that some of the amino acids can be
encoded by more than o
ne codon, so the code is called “degenerate”.

Do you remember the Universal Genetic Code? Of course not, nobody does.


Use the Perl script
omic_search.pl

Use the file
E_coli_K12_genes.fasta

by commenting out the appropriate file line.


A regular expre
ssion for alanine is

regex = ‘gc[acgt]’

A regular expression for Arginine is a bit tougher

remember the vertical line symbol | means OR

regex= ‘(cg[acgt])|(ag[ag])’



(11a)

Write a regular expression that will search the genomic sequence for the
possible

amino acid sequence GENE.


(11b)

Do you find this sequence the same number of times in the DNA as you do when
you search the Protein file for GENE? If not, what is the reason?


(11c)

Write a regular expression to find a Lysine rich region (3 or more in a
row).

Or make it harder. Find a region where every other amino acid is a Lysine,

Think about how you could specify which amino acids are allowed to be interspersed
with the Lysines.


(11d)

Why might a researcher be interested in looking for secondary str
ucture in a
given DNA sequence rather than looking directly at an amino acid sequence?


(11e)

Why might a researcher want to write their own program rather than using a
web
-
based tool?
BIO 102 Renn_Lab#1 (in
-
lab handout)



Name
___________________



pg.
13

o
f
14


When you have finished

__The answer key is available in lab if you wish

to check your own work.

__There is an additional exercise on finding anagrams if you are having fun, have finished
early and want to try more difficult regular expressions.

__If you want to save your output file, use the Save As command, give it a name an
d save
it to your Home Server. This file contains only the output, not the regular expressions you
used to get what you were looking for. Your lab notebook is the real record of your efforts.

__Before you leave, find another student who has also finished
the exercises. Trade
notebooks with this student. Complete the post
-
lab evaluation form while you learn what
creative regular expressions your classmate has attempted.

__Take your own lab notebook home with you.

__Leave your post
-
lab evaluation form with
Carey or Ned.




Appendix

This lab is based on Exercise 3 from “Programming in PERL for Biology” a text by Mark
LeBlanc and Betsy Dreyer from Wheaton College. The full text will be published later in
2007 and is recommended for any student wishing to go f
urther with this work. Until that
book is published, I recommend
Beginning Perl for Bioinformatics

from O'Reilly &
Associates as it addresses the needs of biologists who want to learn Perl programming.




If you want to install Perl on your own computer; i
t comes installed with Mac OS X or it
can be downloaded for free from
http://www.perl.com/download.csp



If you want to install it on your own computer, TextWranger
TM

is available at
http://www.barebones.com/products/textwrangler/download.shtml

TextWrangler is a
freeware programming environment provided by BBEdit. TextWrangler
TM

is a “lighter”
version of BBEdit®, the industry stand
ard Perl programming environment for MAC OS
X.



If you want to learn more Perl, there are several books available as well as several online
courses. Perl is open source language and many scripts and modules can be freely found
and downloaded from the web.
One good online tutorial that I have seen is available at
http://bioportal.weizmann.ac.il/course/prog/




The english.txt dictionary and 7mer text that you used in class can be downloaded from
http://cs.wheatonma.edu/mleblanc/dna/




The full text of “On the Origin of Species” (and other works) can be downloaded for
free from project Gutenberg at
http://www.gutenberg.org/etext/2009

There are
>20,000 free books for download from project Gutenberg. Project Gutenberg is the first
and largest single collection of free electronic books, or eBooks. It was founded by
Michael Hart, who inve
nted eBooks in 1971.
http://www.gutenberg.org/wiki/Main_Page



Whole genomes, collections of genes, protein coding regions, upstream regulatory
regions etc. can be downloaded from many sources. For th
is lab, the DNA coding
sequence file and the protein file for
E. coli

strain K12 were downloaded from
http://cmr.tigr.org/tigr
-
scripts/CMR/CmrHomePage.cgi

The Institute for Genome
Resear
ch, Comprehensive Microbial Resource.
C elegans

genomes can downloaded
BIO 102 Renn_Lab#1 (in
-
lab handout)



Name
___________________



pg.
14

o
f
14


from biomart
http://www.biomart.org/index.html

if you want to play with larger
genomes.



NeuroPred is a web application designed to pre
dict cleavage sites at basic amino acid
locations in neuropeptide precursor sequences. The user can study one amino acid
sequence or multiple sequences simultaneously, selecting from several prediction
models and optional, user
-
defined functions.
http://neuroproteomics.scs.uiuc.edu/cgi
-
bin/neuropred.py

http://www.pubmedcentral.nih.gov/articlerender.fcgi?
artid=1538825



FASTA format

In bioinformatics, FASTA format is a text
-
based format for representing either nucleic acid
sequences or protein sequences, in which base pairs or amino acids are represented using
single
-
letter codes. The format also allows f
or sequence names and comments to precede
the sequences. The simplicity of FASTA format makes it easy to manipulate and parse
sequences using text
-
processing tools and scripting languages like Perl.

A sequence in FASTA format begins with a single
-
line desc
ription, followed by lines of
sequence data. The description line is distinguished from the sequence data by a greater
-
than (">") symbol in the first column. The word following the ">" symbol is the identifier
of the sequence, and the rest of the line is t
he description (both are optional). The sequence
ends if another line starting with a ">" appears; this indicates the start of another sequence.