emboss/srs

educationafflictedBiotechnology

Oct 4, 2013 (4 years and 1 month ago)

76 views


ID X03006; SV 1; linear; mRNA; STD; MAM; 620 BP.

XX

AC X03006;

XX

SV X03006.1

XX

DT 28
-
JAN
-
1986 (Rel. 08, Created)

DT 12
-
SEP
-
1993 (Rel. 36, Last updated, Version 2)

XX

DE Bovine mRNA for lens beta
-
s
-
crystallin

XX

KW beta
-
crystallin; beta
-
gamma
-
crystallin; crystallin.

XX

OS Bos taurus (cow)

OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;

OC Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae;

OC Bovinae; Bos.

XX

RN [1]

RP 1
-
620

RX PUBMED; 4054100.

RA Quax
-
Jeuken Y.E.F.M., Driessen H., Leunissen J., Quax W.J., de Jong W.,

RA Bloemendal H.;

RT "Beta
-
s
-
crystallin: structure and evolution of a distinct member of the

RT beta
-
gamma
-
superfamily";

RL EMBO J. 4(10):2597
-
2602(1985).

XX

CC Data kindly reviewed (06
-
MAR
-
1986) by Y. Quax
-
Jeuken

XX

...

EMBL

Index

parser

index

flatfile

Retrieve

index

parser

display

entries

SRS

Sequence Retrieval System

an indexing and retrieval system for
flat file databases


http://srs.bioinformatics.nl


http://srs.ebi.ac.uk

Q: Which sequences in EMBL [do not]
encode for a protein for which the 3D
structure is known?

Command line SRS

Using
getz

Retrieve the UniProt entry for the protein with
accession number
P19558:



getz

"[uniprot
-
acc:P19558]"
-
e


Count the human proteins in the UniProt database:


getz

"[
uniprot
-
org:human
]"

c



Print sequence of the rice proteins in the UniProt
database that have a length between 10 and 50
aa
:


getz

"[
uniprot
-
org:rice
]&[
uniprot
-
sl#10:50]"
-
f
sl







Give the id and description for all
A.thal
proteins
that have at least 8 transmembrane domains
:



getz '[swissprot
-
org:arabidopsis thaliana]<
([swissprot
-
CountedItem:transmem]


&[swissprot
-
CountedN#8:]))'

-
f "id des"




Count the human protein sequences in the NCBI RefSeq
database:


getz

"[
refseqp
-
org:human
]"

c


Count the human mRNA sequences in the NCBI RefSeq
database:


getz

"[
refseq
-
org:human
]&[
refseq
-
mol:mrna
]"

c



Retrieve the mRNA sequences for all human proteins in
the NCBI RefSeq database in
fasta

format :


getz

"[
refseqp
-
org:human
]>[
refseq
-
mol:mrna
]"

d

sf

fasta






MRS: A fast and compact retrieval system for
biological data. Hekkelman M.L., Vriend G.

http://mrs.cmbi.ru.nl/


European Molecular Biology

Open Software Suite


EMBOSS

"European
Molecular Biology Open
Software Suite"


http
://emboss.sourceforge.net/



Toolbox with bioinformatics applications





http://emboss.bioinformatics.nl/

http://main.g2.bx.psu.edu/

command line / shell

Useful EMBOSS commands

command

description

showdb

Displays information on the currently available
databases

wossname

Finds programs by keywords in their one
-
line
documentation

tfm

Reads the manual entries for each program in EMBOSS

seealso

Finds the relevant programs of certain program

seqret

Reads and writes (returns) sequences

entret

Reads and writes (returns) flatfile entries

extractfeat

Extract features from a sequence

extractseq

Extract regions from a sequence

transeq

Translate nucleic acid sequences

Get help from EMBOSS itself

#
showdb

Shows the currently available databases

#
tfm

wossname

How to use a EMBOSS command? Just (r)
tfm

it

#
wossname

alignment

Which commands can handle alignments?

#
seealso

seqret

Are there any other commands able to do the
similar
thing?

Command line options


All EMBOSS programs react to a number of
command line options. The most important
ones are


help



Get help


help

verbose

Get elaborate help


auto



“no questions asked”


stdout



Write to standard output


filter



Read
stdin
, write
stdout



SEQRET

parameters

zonnebloem
>
seqret

-
help


Standard (Mandatory) qualifiers:


[
-
sequence]
seqall

(Gapped) sequence(s) filename and optional


format, or reference (input USA)


[
-
outseq
]
seqoutall

[<sequence>.<format>] Sequence set(s)


filename and optional format (output USA)



Additional (Optional) qualifiers: (none)


Advanced (Unprompted) qualifiers:


-
feature
boolean

Use feature information


-
firstonly

boolean

Read one sequence and stop



General qualifiers:


-
help
boolean

Report command line options. More


information on associated and general


qualifiers can be found with
-
help
-
verbose


SEQRET

parameters

zonnebloem
>
seqret

-
help
-
verbose


Standard (Mandatory) qualifiers:


[
-
sequence]
seqall

(Gapped) sequence(s) filename and
optional


format, or reference (input USA)


[
-
outseq
]
seqoutall

[<sequence>.<format>] Sequence set(s)


filename and optional format (output USA)



Additional (Optional) qualifiers: (none)


Advanced (Unprompted) qualifiers:


-
feature
boolean

Use feature information


-
firstonly

boolean

Read one sequence and stop



Associated qualifiers:



"
-
sequence" associated qualifiers


-
sbegin1 integer Start of each sequence to be used


///

SEQRET

parameters


///


"
-
sequence" associated qualifiers


-
sbegin1 integer Start of each sequence to be used


-
send1 integer End of each sequence to be used


-
sreverse1
boolean

Reverse (if DNA)


-
sask1
boolean

Ask for begin/end/reverse


-
snucleotide1
boolean

Sequence is nucleotide


-
sprotein1
boolean

Sequence is protein


-
slower1
boolean

Make lower case


-
supper1
boolean

Make upper case


-
sformat1 string Input sequence format


-
sdbname1 string Database name


-
sid1 string
Entryname


-
ufo1 string UFO features


-
fformat1 string Features format


-
fopenfile1 string Features file name


///

SEQRET

parameters


///


"
-
outseq
" associated qualifiers


-
osformat2 string Output
seq

format


-
osextension2 string File name extension


-
osname2 string Base file name


-
osdirectory2 string Output directory


-
osdbname2 string Database name to add


-
ossingle2
boolean

Separate file for each entry


-
oufo2 string UFO features


-
offormat2 string Features format


-
ofname2 string Features file name


-
ofdirectory2 string Output directory


///



SEQRET

parameters


///


General qualifiers:


-
auto
boolean

Turn off prompts


-
stdout

boolean

Write standard output


-
filter
boolean

Read standard input, write standard output


-
options
boolean

Prompt for standard and additional values


-
debug
boolean

Write debug output to program.dbg


-
verbose
boolean

Report some/full command line options


-
help
boolean

Report command line options. More


information on associated and general


qualifiers can be found with
-
help
-
verbose


-
warning
boolean

Report warnings


-
error
boolean

Report errors


-
fatal
boolean

Report fatal errors


-
die
boolean

Report dying program messages

Universal Sequence Address

Type

Example

Description

filename

xxx.seq

A sequence file "xxx.seq" in any format

format::filename

fasta::xxx.seq

A sequence file "xxx.seq" in fasta format

db:IDname

embl:paamir

EMBL entry PAAMIR, using whatever access method is defined locally
for the EMBL database

db:AccessionNumber

embl:X13776

EMBL entry X13776, using whatever access method is defined locally
for the EMBL database and searching by accession number and entry
name (X13776 is the accession number in this case)

db
-
acc:AccessionNumber

embl
-
acc:X13776

EMBL entry X13776, using whatever access method is defined locally
for the EMBL database and searching by accession number only

db
-
id:IDname

embl
-
id:paamir

EMBL entry PAAMIR, using whatever access method is defined locally
for the EMBL database, and searching by ID only

db
-
searchfield:word

embl
-
des:lectin

EMBL entries containing the word 'lectin' in the Description line

db
-
searchfield:wildcard
-
word

embl
-
org:*human*

EMBL entries containing the wildcarded word 'human' in the Organism
fields

db:wildcard
-
ID

embl:paami*

EMBL entries PAAMIB, PAAMIE and so on, usually in alphabetical
order, using whatever access method is defined locally for the EMBL
database

Universal Sequence Address

Type

Example

Description

db or db:*

embl or EMBL:*

All sequences in the EMBL database

@listfile

@mylist

Reads file
mylist

and uses each line as a separate USA. List files can
contain references to other lists files or any other standard USA.

list:listfile

list:mylist

Same as "@mylist" above

'program parameters |'

'getz
-
e [embl
-
id:paamir] |'

The pipe character "|" causes EMBOSS to fire up getz (the SRS
sequence retrieval program) to extract entry PAAMIR from EMBL in
EMBL format. Any application or script which writes one or more
sequences to stdout can be used in this way.

asis::sequence

asis::atacgcagttatctgaccat

So far the shortest USA we could invent. In '
asis
' format the name is
the sequence so no file needs to be opened. This is a special case. It
was intended as a joke, but could be quite useful for generating
command lines.

Each of the above can have '[start : end]' or '[start : end : r]' appended to them.

The 'file' and '
dbname
' forms of USA can have 'format::' in front of them
(although a database knows which format it is and so this is redundant and
error
-
prone)

Walk through exercise

For a protein with UniProt Accession number:


Q5ZKN6


find the nucleotide sequence that encodes this (repeated)
amino
acid
fragment:


VAEEVAEE


Getting the sequence

seqret
-
auto uniprot:Q5ZKN6
-
stdout



>Q5ZKN6_CHICK Q5ZKN6
SubName
: Full=Putative uncharacterized protein;

MADNLPSEFDVVVIGTGLPESIIAAACARSGQRVLHVDSRNYYGGNWASFSFSGLLSWIK

ENQQNTDIKDECEDWRKLILENEEVISLNKKDKTIQHVEAFCFDDQDAAEDVEEAGALAR

LPAYGASVAEEVAEEPEKECSPLESAVPGAENLESEKATSVDPASAAEGNVTEINAESES

SHDSASGESTLESGKTEAALSEISAQEPKKITYSQIVREGRRFNIDLVSKLLYSRGLLIE

LLIKSNVSRYAEFKNATRILAFREGKVEQVPCSRADVFNSRQLAMVEKRMLMKFLTFCLE

YEQHPDEYQDYKNSTFAQFLKTRKLTPSLQHFILHSIAMVSEKDCNTLEGLQATRKFLQC

LGRYGNTPFLFPLYGQGEIPQCFCRMCAVFGGIYCLRHSVQCLVVDKESGRCKAVVDHFG

QRISANYFIVEDSYLSESVCENVCYRQLSRAVLITDQSVLKTDSEQQVSILMVPPVDLGQ

PAVCVIELCSSTMTCMKDTYLVHLTCPSTKTAREDLEPVVQKLFSLNAETEKETEDEVLE

KPRVLWALYFNMRDSSGIDRNSYSGLPSNVYVCSGPDSALGNDCAVKQAETIFQEMFPTE

EFCPAPPNPEDIIYDEDEIASEETGFNNSPETKPESSLQESSSRGSSTAVKEHIEE


Getting the sequence

seqret
-
auto uniprot:Q5ZKN6
-
stdout



>Q5ZKN6_CHICK Q5ZKN6
SubName
: Full=Putative uncharacterized protein;

MADNLPSEFDVVVIGTGLPESIIAAACARSGQRVLHVDSRNYYGGNWASFSFSGLLSWIK

ENQQNTDIKDECEDWRKLILENEEVISLNKKDKTIQHVEAFCFDDQDAAEDVEEAGALAR

LPAYGAS
VAEEVAEE
PEKECSPLESAVPGAENLESEKATSVDPASAAEGNVTEINAESES

SHDSASGESTLESGKTEAALSEISAQEPKKITYSQIVREGRRFNIDLVSKLLYSRGLLIE

LLIKSNVSRYAEFKNATRILAFREGKVEQVPCSRADVFNSRQLAMVEKRMLMKFLTFCLE

YEQHPDEYQDYKNSTFAQFLKTRKLTPSLQHFILHSIAMVSEKDCNTLEGLQATRKFLQC

LGRYGNTPFLFPLYGQGEIPQCFCRMCAVFGGIYCLRHSVQCLVVDKESGRCKAVVDHFG

QRISANYFIVEDSYLSESVCENVCYRQLSRAVLITDQSVLKTDSEQQVSILMVPPVDLGQ

PAVCVIELCSSTMTCMKDTYLVHLTCPSTKTAREDLEPVVQKLFSLNAETEKETEDEVLE

KPRVLWALYFNMRDSSGIDRNSYSGLPSNVYVCSGPDSALGNDCAVKQAETIFQEMFPTE

EFCPAPPNPEDIIYDEDEIASEETGFNNSPETKPESSLQESSSRGSSTAVKEHIEE


Run a program within Perl: 3 ways

$seq
= `
seqret

-
auto
uniprot:Q5ZKN6 stdout`;


system("seqret
-
auto uniprot:Q5ZKN6 stdout");



open SEQRET,"seqret
-
auto uniprot:Q5ZKN6 stdout|";

while(my $line = <SEQRET>) {


if($line !~ /^>/) {



chomp($line);



$seq .= $line;


}

}

close SEQRET;

my $lsOutput = `ls
-
l`;

put shell commands or programs in
backticks to run from Perl. The
output can be stored in a variable.

open LS,"ls
-
l|";

The open function can run a program
and read its output. The pipe symbol
"|" links the output to a filehandle.

Find the fragment’s position

my $seq = "";

open SEQRET,"seqret
-
auto uniprot:Q5ZKN6 stdout|";

while(my $line = <SEQRET>) {


if($line !~ /^>/) {



chomp($line);



$seq .= $line;


}

}

close SEQRET;

#
look for location of the repeat

my $position =
index
($seq,
"VAEEVAEE") + 1;

#
print the offset

print "Position = ", $position,
"
\
n";









!~

opposite of "=~ "gives true if the
search found no hits.

Get
a cross
-
reference to EMBL

entret uniprot:Q5ZKN6
-
auto
stdout

|
grep

"DR "


Get
the feature table of this protein entry










Understand the cross
-
reference


DR EMBL; AJ720048; CAG31707.1;
-
; mRNA.









Read the detailed documentation of UniProt cross reference

http://www.expasy.org/sprot/userman.html#DR_line



Database cross reference

EMBL accession number

Protein ID

Molecule Type

Link to EMBL

Status identifier

The corresponding

cross reference

in EMBL

Get
a cross
-
reference to EMBL

entret uniprot:Q5ZKN6
-
auto
stdout

|
grep

"DR "
|
grep

"EMBL;"




In Perl, use a regular expression to locate the EMBL
reference line, and extract the EMBL accession number
and the protein
-
ID






Link protein to coding DNA

extractfeat

embl:AJ720048
-
value CAG31707.1
stdout


Returns the DNA coding for protein CAG31707.1 (=
Q5ZKN6)











Figure out the offset in DNA

Offset in amino acid sequence:
128

Offset in corresponding nucleotide sequence:
((128
-
1) x 3) + 1


OR


(128 x 3)
-
2


= 382

Position is from 382 to
(382 +
8x3)=406


Figure out the position of its corresponding coding DNA
sequence (is there anything wrong here?)











Extract the DNA sequence

extractfeat

embl:AJ720048
-
value CAG31707.1
stdout

|
extractseq


filter
-
reg

"382
-
406"


Now we got the corresponding DNA sequence for
the protein fragment


It should be: “
gttgctgaggaggttgctgaagaac



But is that correct? Let's translate it for verification…









Verify the result

extractfeat

embl:AJ720048
-
value CAG31707.1
stdout

|
extractseq


filter
-
reg

"382
-
406"


|
transeq

-
filter


Result is “VAEEVAEEX” but not “VAEEVAEE”



What’s wrong here?


Always try to verify your results: computers make
very few errors, but that is not true for people...




Exercise

Build a pipeline in Perl to perform the previous steps
of the walkthrough (from slide 34)


Test it with the
UniProt

protein A0L7N9


Find the fragment at offset 305 that is 8
aa

long


Find out the coding DNA of this amino acid fragment
and verify it