Pæano
–
A
p
ipeline for the
a
nnotation and
e
xpression
a
nalysis of EST
s from
n
on
-
model
o
rganisms
A u
ser
guide
Compiled by Helen Collett
, with assistance from Nicolas Immelman
2
CONTENTS
Introduction
................................
................................
................................
..............................
4
How to access the system (mancala.cbio)
................................
................................
...............
5
Some basic Linux for getting around
................................
................................
.....................
5
Directory structure and syntax:
................................
................................
..............................
5
Some useful commands:
................................
................................
................................
.........
5
Some tips:
................................
................................
................................
...............................
6
Setting your environment variables
................................
................................
........................
6
Getting your chromatogram files ready
................................
................................
.................
8
The NERC naming convention and how to distinguish between 5’ and 3’ sequences:
.........
8
Copying your files across to mancala:
................................
................................
...................
8
Renaming your sequences according to the NERC naming scheme:
................................
....
8
Trace2dbest
................................
................................
................................
.............................
10
What trace2dbest does:
................................
................................
................................
.........
10
Input files:
................................
................................
................................
............................
10
Running trace2dbest:
................................
................................
................................
............
10
Output files:
................................
................................
................................
..........................
14
PartiGene
................................
................................
................................
................................
16
What PartiGene does:
................................
................................
................................
...........
16
Input files:
................................
................................
................................
............................
16
Running PartiGene:
................................
................................
................................
..............
16
The preliminaries
–
making a PartiGene directory
................................
.........................
16
The preliminaries
–
reverse com
plementing 3’ sequences
................................
................
17
Running PartiGene from est_solutions/Xerophyta_humilis/PartiGene
...........................
17
PartiGene Clustering
................................
................................
................................
.......
17
PartiGene cluster assembly
................................
................................
..............................
18
PartiGene blasting
................................
................................
................................
...........
19
Creating HTML tables
in PartiGene
................................
................................
................
20
Constructing relational databases in PartiGene
................................
.............................
21
Ouput files:
................................
................................
................................
...........................
22
Prot4EST
................................
................................
................................
................................
.
23
What Prot4EST does:
................................
................................
................................
...........
23
Input files:
................................
................................
................................
............................
23
Running Prot4EST:
................................
................................
................................
..............
23
Output files:
................................
................................
................................
..........................
29
Annot8er_Blast2GO
................................
................................
................................
...............
30
What Annot8er_Blast2GO does:
................................
................................
..........................
30
Input files:
................................
................................
................................
............................
30
Running Annot8er_BLAST2_GO:
................................
................................
......................
30
Output files:
................................
................................
................................
..........................
33
PostgreSQL
................................
................................
................................
.............................
34
InterProScan
................................
................................
................................
...........................
35
Wh
at InterProScan does:
................................
................................
................................
......
35
Input files:
................................
................................
................................
............................
35
Running InterProScan:
................................
................................
................................
.........
35
Output files:
................................
................................
................................
..........................
35
OrthoLog
................................
................................
................................
................................
.
36
What
OrthoLog
does:
................................
................................
................................
..........
36
Input file
s:
................................
................................
................................
............................
36
Running
OrthoLog
................................
................................
................................
..............
36
3
Output files:
................................
................................
................................
..........................
37
GEOExpress
................................
................................
................................
...........................
38
What
GEOExpress
does:
................................
................................
................................
....
38
Input files:
................................
................................
................................
............................
38
Running
GEOExpress
................................
................................
................................
.........
38
Output files:
................................
................................
................................
..........................
38
How to set up and update your WebPartiGene results.
................................
.....................
39
Pæano web interface (WebPartiGene)
................................
................................
.................
40
How to update your WebPartiGene results
................................
................................
.........
48
Note that the results from InterProScan, OrthoL
og, and GEOExpress will automatically be
updated on your wwwPartiGene page.
................................
................................
.....................
49
References
................................
................................
................................
...............................
50
User guides
................................
................................
................................
..............................
50
Acknowledgements
................................
................................
................................
.................
51
Thank you to Arthur Shen for trialling the system.
................................
................................
..
51
Troublesho
oting/Getting help
................................
................................
...............................
52
4
Introduction
Welcome to
Pæano
–
a
p
ipeline for
the
a
nnotation and
e
xpression
a
nalysis of ESTs from
n
on
-
model
o
rganisms
.
The system makes use of publicly
-
available software, accessed from:
http://zeldia.cap.ed.ac.uk/PartiGene/index.html
-
to which we have
introduced some extra
features and
modifications.
The
Pæano
suite of programs
for
sequence processing and basic annotation
(phases1
-
3)
includes (briefly summarised as follows):
Trace2dbest
–
which c
onv
erts raw sequence data (
standard chromatogram format (scf) or
.abd files
,
referred to as ‘traces’) to text
files
(in fasta format)
that are
ready for submission
to Genbank db
EST
. In the proc
ess
it
removes vector sequence, adapter sequence, poly A
tails, etc.
PartiGene
–
which:
o
c
lusters redundant and/or overlapping sequences.
o
a
ssembles these sequences into a contig.
o
a
llows you to blast the contigs to assign a gene identity/best match.
In addit
ion
PartiGene
has facilities for creating
SQL relational databases and
HTML
tables (so that you can view your results via a web interface
, generated by
WebPartiGene
)
.
Prot4EST
–
which translates EST nucleotide
sequences.
Annot8er_blast2_GO
–
which a
llows y
ou to assign Gene Ontology (GO) terms to your
EST
sequences on the basis of sequence homology.
While Pæano phases 1
-
3 involve basic sequence annotation, phases 4
-
6 specialise in mapping
the EST sequence data for the non
-
model organism of interest (e.g.
Xe
rophyta humilis
) to the
genomic, transcriptomic and proteomic data for relevant model organisms (e.g. Arabidopsis
and rice). Phases 4
-
6 therefore constitute secondary level annotations for functional and
comparative genomics, and are outlined below. (Again
, they make use of publicly available
tools and databases).
The Pæano suite of programs (phases 4
-
6) includes (briefly summarised as follows):
InterProScan
–
which allows you to identify protein domains/signatures for functional
classification of your prote
in sequences.
OrthoLog
–
which identifies putative/candidate gene orthologues in relevant model
organisms.
GEOExpress
–
which captures expression data for putative gene orthologues
Pæano is
installed on
a computer housed at medical campus:
mancala.cbio.u
ct.ac.za
(which
you can access via PuTTY once you have a log
in
–
see p
. 5
).
Pæano p
rocessing
is
therefore
done remotely.
The operating system is
Linux
and most of the programs are scripted in Perl.
Userguides (pdfs) for the individual programs are ava
ilabl
e at
/usr/local/paeano/doc
/serguide_supplements
(still to make available).
5
How to access the system (mancala.cbio)
First you n
eed to apply for a user account.
Contact Nicky Mulder (email:
nmulder@science.u
ct.ac.za
) or Cathal Seioghe (
cathal@science.uct.ac.za
) for authorisation,
then apply to the systems manager,
Rodger
Duffet (email:
rodger@curie.uct.ac.za
, phon
e: 406
6375) for your user account.
Once you have a user ID, you can access the system via
PuTTY
.
(
Go to
http://www.chiark.greenend.org.uk/~sgtatham/putty
/download.html
to download putty.exe
)
.
To log in, the host name is
:
mancala.cbio.uct.ac.za
. (Y
ou can save this as
the
default setting so
that you needn’t type it in each time).
For protocol
,
select SSH
-
2 only.
While you’re at it, also install
winscp3
76
(
available from
http://winscp.net/eng/download.php
)
on your desktop. T
his will allow you to transfer files
from your
computer to the remote computer
. You c
an also view the
mancala
di
rectory
structure via
winscp37
6
(RHS
screen).
Some basic Linux
for gettin
g around
Directory structure
and syntax
:
If you are unfamiliar,
with the Linux directory structure,
first
get
a
Windows
-
like view
via
winscp386
(RHS screen)
. (
Here, the directories are the equivalent of Windows folders
)
.
C
lick
on the top folder (
the one
w
ith
the arrow
) until you contract
the directories
as far
as you can.
The
u
sr
and
home
directories
are the two
that
you’ll be accessing.
Open up
user
/
local
/
paeano
(have a look to see what’s in there
–
most of
the
P
æ
ano
software is stored here
)
.
Open up
user
/
(
local
)/
home
/
yourname
(this is your home dir
ectory
wh
ich
you will
work
from.
Then
in
PuTTY
, on mancala
, you will see: yourname
@mancala:~>
Y
ou are in your home directory by def
ault.
To view your directories within your home directory, type
ls
.
To g
o to the directory public_html, type
cd public_html
⸠
(You need only type the first
letter/few letters of the direct
ory and then press the Tab key
to
automatically complete the
directory name.)
To change
to the
paeano
directory (in the local directory),
type
cd /usr/local/paeano
(
directories are defined or
separated by
/
; note there is a space between command ‘cd’ and
directory definition
).
To see what’s in the
paeano
directory,
and get a feel for the directory structure
type
ls
.
To practise moving ar
ound within and between directories:
o
To go to the
bin
(binaries) directory
in
paeano
, type
cd bin
, and then
ls
.
o
To go to the
blastdb
directory in
paeano
, type
cd ..
(takes you up
one
directory notch), then
cd blastdb
.
Type
ls
. You’ll see that Linux ha
s a
colour system. Directories
are
shown
in blue text, files in white
, zipped files in
red
.
To return to your home directory, type
cd
or
cd ~/
(~ can be substituted for your home
directory)
; these
a
re
shortcut
s
. Long
-
cut would be
cd /usr/home/yourname
.
Some useful
commands
:
6
cd
–
change directory e
.
g
.
cd public_html
[takes you into
public_html
dir
ectory
]
o
cd ..
–
will
take you back to the previous dir
ectory
o
cd
–
will take you
straight
back
(express route)
to your home dir
ectory
pwd
–
print w
orking directory
[shows you whic
h dir
ectory
you’re in
–
if you lose track
!
]
m
kdir
–
make directory e.g.
mkdir PartiGene
[
creates a directory
PartiGene
]
rm
–
remove e.g.
rm test.fsa
will delete this file
rm
–
r
–
e.g.
rm
–
r
PartiGene
will delete the directory
PartiGene
and i
ts contents
mv
–
move
, rename e.g.
mv test.fsa
ls
–
lists files in the dir you are in
logout
–
to exit
who
–
for the curious,
to
see who’s logged in at the time
man
–
calls up help manual
Some tips
:
^c
–
to escape
(very useful if
you get stuck in one of
the P
æ
ano program
s!)
Remember you can
use the Tab key
to complete your directory
or program
name
and
avoid having to type it out in full each time
.
Linux is case sensitive i
.
e
.
it
distinguishes between
‘
a
’
and
‘
A
’
.
To
edit (or
view
)
files,
type
:
vi
filena
me
.
To exit
vi
text editor
(without saving any
change
s
) type :
q!
See
‘
List of Commands for vi
-
An Unix Editor
’
(to
be stored in
usr/local/paeano/doc/userguide_supplements
)
for navigating in vi
.
To view files, type:
more
filename or
less
filename (to exit
from view,
type
:q
).
You should now be able to get yourself around (even if a bit slowly at first). For more help
with Linux
,
try the following site
: xxx
.
S
et
ting
your environment variables
Many of the tools used in P
æano require certain variables
asso
ciated with the shell
,
which is
the
interfac
e to the operating system
.
(You will be
using the bash shell).
These variables tell
the tools where to find certain files,
etc
.
These environment variables are stored in a file
called
.profile
.
As a first
-
time u
ser of Pæano,
you need to run
paeano_setup.sh
script. Do
this by going to the
/paeano/bin
directory by typing
cd /usr/local/paeano/bin
.
Once there,
type:
paeano_setup.sh
. You should
see
something like
this:
helen
@mancala:~> paeano_setup.sh
Welcome to pae
ano @ mancala
This scripts sets up any files that you will need for the paeano pipeline
At a later stage you will need to upload your data to a SQL database
What database would you like to use?
If you are not sur
e, enter your login name (helen
):
helen
[
No
te: by defining your
database name, you are keeping your
own processed sequences separate, even if, for example,
more than one of you are sequencing from the same species.]
Your data will be uploaded to helen. You may change this by running this script
ag
ain at any time.
Copying required files to your home directory
Setup is complete. Please logout and back in again for changes to come into effect.
Forward any queries to immelman@science.uct.ac.za
helen@mancala:/usr/local/paeano/bin>
7
This setup script c
reates the files and adds the environment variables you need to run Pæano.
In order for the environment variable settings to take effect, you need to log out of the system
by typing
logout
, and then log i
n again.
Return to your home directory by typing:
c
d
(If this fails, you can copy the script to your home directory by typing:
cp paeano_setup.sh
/usr/home/helen)
For your interest, t
he following environmental variables have to be set
for
P
æano:
.
BLASTDB=/usr/local/paeano/blastdb
BLASTMAT=/usr/local/b
last/data
ESTSCANDIR=/usr/local/paeano/BTLib
-
2.0b
PGDATABASE=paeano
PGUSER=paeano
PHRED_PARAMETER_FILE=/usr/local/paeano/bin/phredpar.dat
PATH=/usr/local/jdk/bin:/usr/local/ant/bin:/usr/local/bin:/usr/bin:/usr/X11
R6/bin:/bin:/usr/games:/opt/gnome/bin:/opt/
kde3/bin:/usr/lib/java/jre/bin:/
usr/local/mysql/bin:/usr/local/paeano/bin/:/usr/local/pgsql/bin:/usr/local/
blast
-
2.2.9/bin/:/usr/sbin/:/usr/local/src/ncbitoolbox/ncbi/bin/:/usr/local/paean
o/BTLib
-
2.0b/ESTScan/MkTables/
8
Getting your
chromatogram
files read
y
The NERC naming convention
and how to distinguish
between 5’ and 3’ sequences
:
Trace2dbest
uses a
controlled
n
aming convention for sequences. For the Pæano system we
have opted for the
NERC
naming scheme, which consists of three tags separated by
unders
cores,
and
looks like this: e.g. Xh_RD_01B04
, dissected as follows:
The first tag consists of two characters derived from the species name e.g. Xh for
Xerophyta humilis
.
The second tag is 3
-
5 letters long and indicates the library from which the EST was
de
rived e.g. RD for
R
oot
D
ehydration library.
If you are sequencing the ESTs in both the
forward (5’) and reverse (3’) direction, you can distinguish between the orientations by
including ‘f’ and ‘r’ in the second tag e.g. Xh_RDf_01B04 and Xh_RDr_01B04.
The
third tag indicates the co
-
ordinates
for the microtitre plate
in which the EST clone is
stored e.g. 01B04 for plate 01, row B, column 04
.
(If your clones are not in microtitre
plates, substitute ‘Z’ for the ro
w number and apply
your own numbering system).
It is b
est when you send your clones for sequ
encing to name them according
to the NERC
convention
.
If, however, you already have sequences that do not follow the naming scheme or
you prefer to
use a less clumsy Sequence Identifier e.g. HC1 that you can the
n map back to
the EST Ident
i
fier
, you can either
:
rename them manually
(if you have only a few). You will also need to r
emove
the
.
abd
extensions.
rename them using a Pæano script by following the instructions below (see ‘
Renaming to
fit NERC naming schem
e
’
.
Copying your files across to mancala
:
In your home directory, make a directory called e
.
g
.
traces
.
by t
yp
ing
mkdir traces
–
it is
usually convenient to keep directory names in lower case
).
(If you will be processing the
sequences in b
atches, you might
also like to make a
n e.g.
Bat
ch1
directory within
traces
.
V
ia WinSCP, copy across
your sequence chromatograms
to
the
traces
directory
.
(Alternatively, y
ou can also cr
eate the directories via WinSCP, by clicking on RHS panel and
then selecting ‘Files’ foll
owed by ‘Create a Directory’ from pop
-
up menu.
)
Renaming
your sequences according to the
NERC naming scheme
:
If you
r
sequences don’t follow the
naming convention
, you can
use a renaming script. There
are two options in Pæano. Use:
paeano_rename
if your se
quences were named according to a Sequence Identifier e.g.
HC1, rather than a 96
-
well co
-
ordinate. For this option, you have to provide a mapping
file consisting of two columns, 1) listing the Sequence Identifier (e.g. HC1) and 2) the
corresponding clone I
D or EST ID (e.g. XH_LD_04F05) in comma
-
delimited (csv)
format.
rename_file.pl
if your sequences were named according to a 96
-
well co
-
ordinate.
Option 1
R
un the
paeano_rename
script as follows:
Create a directory called
Rename
.
Copy into
here
your mapping
file and the
files you want
to rename
.
Run the script
paeano_rename
in
the
Rename
dir
ectory
by typing:
paeano_rename
mapping_
filename.csv
(NB. You have to provide the mapping file!)
You s
hould get something like this:
9
Renaming HC29_
-
_M13_F rerun.abd to
Xh_LD_27D12
Renaming HC56_
-
_M13_F rerun.abd to Xh_LD_04C06
Renaming HC25_
-
_M13_F rerun.abd to Xh_RD_27C05
Renaming HC23_
-
_M13_F rerun.abd to Xh_RD_27B12
Renaming HC50_
-
_M13_F rerun.abd to Xh_LR_01B05
Renaming HC21
-
_M13_F rerun.abd to Xh_RD_27B07
Renaming H
C22_
-
_M13_ rerun.abd to Xh_RD_27B11
If there are any problems with any of the matches
, you will get something like this:
Could not find match for HC20_
-
_M13_F rerun.abd in name mapping file.
OR
Option 1
Run the
rename_file.pl
script as follows:
Create a
directory called
Rename
.
Copy into here the files you want to rename.
Exit to your home directory
.
Run the script by typing:
rename_file.pl
–
dir Rename
–
txt text you want to remove
–
sub text you want to add
e.g.
for conversion of a set of sequences named
‘LRxxXxx
-
RM13’ (e.g. LR01G12
-
RM13)
to ‘Xh_LR_xxXxx’ (e.g. Xh_LR_01G12), type:
rename_file.pl
–
dir Rename
–
txt LR
–
sub Xh_LR
and then
rename_file.pl
–
dir Rename
–
txt
–
RM13
For help, type
rename_file.pl
–
help
Then, in preparation for processing your sequ
ences, create a directory called e.g.
data
in your
home directory, (and a subdirectory if necessary e.g.
testset
)
and move your renamed
sequences there. To do this, in the
Rename
directory, type e.g.
mv Xh* ~/data/testset
.
You’re now ready to process you
r sequences via
Trace2dbest
.
10
Trace2dbest
What trace2dbest does
:
Trace2dbest performs
bas
ic processing of your EST sequences,
using
the
raw
data
chromatograms (
trace
s)
as the input
.
To submit EST sequ
e
nces to Genbank
, you have to submit the sequences in
a specified format
consisting of four
files:
1.
A Library (Lib) file
–
giving details of EST origins (organism, source, etc.)
.
2.
A Contact (Con
t) file
–
giving you
r
contact details.
3.
A Publish (Pub) file
–
giving information re associated publication.
[Once you
have entered information for 1
-
3 for your EST seq project, you will not have to
repeat these entries for the same sequencing project i.e. you can access a saved file for
subsequent trace2dbest runs.]
4.
An EST (file)
–
containing all your EST sequences in Fas
ta format.
[
For each batch of sequences you process via trace2dbest,
you will generate an EST file.
Trace2dbest gives you the option to submit your EST seqs as
soon as you have processed
them]
.
In processing your sequences, to generate the EST file,
trace2
dbest
makes use of the
following progs/software:
Phred
–
for base
-
calling (assigns a quality score to each base)
.
Cross
-
match
–
to identify and
trim away vector, poly
-
A tails.
Input files:
The input files are your chromatogram files, named according to th
e NERC convention
(see p.
7
).
Test data for processing (96 earthworm sequences) are available
at
usr/local/paeano/testdata
(copy to local).
Running trace2dbest
:
Run this from your home direc
tory
Type
trace2dbest.pl
e.g.
helen@mancala:~> trace2dbest.pl
If
you get a
warning
–
ignore
.
You should see this;
then
follow
the
prompts.
####################################################
### ###
### trace2dbest V
ersion 2.1 ###
### trace file processing and dbEST ###
### sequence submission tool ###
### ###
#######
#############################################
Section 1
-
Lib, Cont, Pub and EST file information
Library file
Each batch of EST submissions to dbEST must have an associated
Library file.
Please choose one of the following optio
ns:
1
-
Library file already submitted
2
-
Enter information for a new Library file now
Or use a saved file...
3
-
Xerophyta_humilis
11
Don't fancy one of those? Enter h for help or q to quit.
Library file
As first
-
time user, select ‘2’
and enter as
prompted
[If you are using the
X. humilis
library,
select 3]
:
Information for new Library file
Please answer the following questions about the Library used to generate the ESTs
you wish to submit.
What is the name of th
e library?
Xerophyta_humilis
(use underscore here)
Wha
t is the name of the organism
?
Xerophyta humilis
(
use space,
don’
t use underscore
)
In a similar manner,
select option 2
to generate Contact file and Publication file and enter
information as prompted
.
(You will only have to do this once for each sequencing project)
:
Contact file
Each batch of EST submissions to dbEST must have an associated
Contact file.
Please choose one of the following options:
1
-
Contact file alre
ady submitted
2
-
Enter information for a new Contact file now
Publication file
Each batch of EST submissions to dbEST must have an associated
Publication file.
Please choose one of the following options:
1
-
Publ
ication file already submitted
2
-
Enter information for a new Publication file now
Processing your sequences to generate the EST file
(example entries given)
:
EST File
Each sequence submitted to dbEST must have an associated EST file.
Please choose one of the following options for the creation of this file:
1
-
Enter information for a new EST file now
Or use a saved file...
2
-
Xerophyta humilis Helen
3
-
400+R
4
-
400+F
Don't f
ancy one of those? Enter h for help or q to quit.
Options 3 and 4 are for M13forward and M13reverse, respectively i.e. select 3 for 3’
X.
humilis
sequences and 4 for 5’
X. humilis
sequences. Or create your own file by selecting 1:
EST File
Each
sequence submitted to dbEST must have an associated EST file.
Please choose one of the following options for the creation of this file:
1
-
Enter information for a new EST file now
1
12
What was the sequencing primer used? [enter the primer
name, optionally followed by
the
sequence in brackets()].
E.g. SAC(GGGAACAAAAGCTGGAG): M13F
What was the forward PCR primer? NA
What was the reverse PCR primer? NA
Which end was sequenced (5'/3')? If neither, just hit return: 3'
Please enter the date you
would like your data to be made public. This date
will be inserted into the PUBLIC field of the submission file.
Enter the date in the format MM/DD/YYYY. For immediate release, just press enter.
Please enter a comment about the ESTs for inclusion in the
"COMMENT" field of
the EST file.
Please check the data you have entered:
SEQ_PRIMER: M13F
PCR_F: NA
PCR_B: NA
P_END: 3'
PUBLIC:
COMMENT:
Are you happy to continue? (Y/N): y
Would you like to save this file for future use?(y/n):y
Please e
nter a name for this file: UG test
Are you happy to continue? (Y/N): File saved to /usr/local/paeano/db/ESTfile.db
Section 2
-
trace2dbest processing information
The following information is required to allow trace2dbest to process your
traces efficientl
y.
Adapter
If you would like trace2dbest to trim off an adapter sequence (and everything
upstream of it), please enter the adapter sequence here or hit
return to continue:
GAATTCGGCACGAGG
(
This is the
adapter sequence
for the
X. humilis
library; it is det
ermined by the lib
rary kit used)
Vector file
The default location for the vector.seq file is
/usr/local/paeano/db/vector.seq
Hit return to use this file or enter a different path here:
Would you like to see a list of all the vector sequences in
/usr/lo
cal/paeano/db/vector.seq? (y/n):
y
The following 35 vectors were found in /usr/local/paeano/db/vector.seq
>bluescript
>lorist2
>lorist6
>loristb
>pbs
>pjb8
>pphc79
>prs313
>prs423
>prs424
>pwe15
>pyac4
>scos
>m13mp18
>m13mp19
>mg3_left
13
>mg3_right
>puc118
>puc18
>pSC
>pcDNA3.1 V5
-
His
-
Topo
>pBeloBAC11
>pBACe3.6
>PSPORTI
>pSCREEN
-
1b(+) from Novagen
>pDNR_LIB
>pCR4
-
TOPO Invitrogen TA cloning vector
>pGEM(R)T
-
easy Promega cloning vector
>pGEMT.fasta Promega TA cloning vector
>pBLUESCRIPT SK+
>pCMV
-
PCR
>gi|1017
801|gb|U37573.1|XXU37573 Shuttle expression vector pBKCMV
> NAME = pTriplEx2_seq.fa : TYPE = DNA
>
>pCR2.1TOPOTA
Are you satisfied that the re
levant vector is here? (y/n): y
E.coli sequence information
Do you want to screen for E.coli sequence in your ES
Ts? (y/n):
n
Trace file naming scheme
In order for your traces to be processed, the file names must follow one of
these schemes:
1
-
NERC Environmental Genomics scheme
2
-
STRESSGENES scheme
Please enter the appropriate number:
1
Trace file directory
Ple
ase enter the full path of the directory containing the trace files
to be processed
e.g.
/usr/home/helen/traces
(
Test data for processing (96 earthworm sequences) are available
at
usr/local/paeano/
testdata
(copy to local).
There are 2 files that match th
is naming scheme in
/usr/home/helen/data/HC356andHC357
Is this correct? (y/n):
y
Section 3
-
trace2dbest parameters
You now have the opportunity to set the various parameters that control how the
traces will be processed.
trace2dbest has default values f
or all the parameters, to use these defaults
enter 's' (skip), otherwise hit return to alter parameters:
For each of the parameters, enter the value you wish to use or hit return to
use the default shown in brakets().
phred
Number of high quality bases r
equired in sequence (150):
cross_match
cross_match for vector sequence
-
minmatch (10):
minmatch value of 10 selected
minscore (20):
minscore value of 20 selected
Poly(A) tail
Enter number bases in poly(A) tail (8):
14
Spliced leader 1
If you wis
h to trim the nematode spliced leader sequence, type 'yes', or
If you do not wish to trim any spliced leader sequence just hit return:
Section 4
-
Annotation of sequences
Would you like to add BLAST
-
based preliminary annotation to the sequences?
Note: t
his will slow the process considerably (y/n):
n
Section 5
-
Trace processing
Running phred (basecalling from raw traces)
-
please wait...Done
Running cross_match (screening for vector sequence)
-
please wait...Done
Section 6
-
Sequence processing
Creati
ng submission files...
Done
Statistics
2 Traces processed
2 (100%) 'Good quality' traces
2 (100%) Submissable sequences after trimming
699 Average length of submissable sequences
475 Average number of high quality bases for submissable
sequences
Section 7
-
Submission and saving of files
The EST submission records have been merged into one file.
Would you like to view this completed submission file? [y/n]
y
Th
is takes you into text editor mode to view the file. To exit, type:
:q!
The EST submission file may now be sent by e
-
mail to NCBI dbEST.
Enter yes to send file to dbEST now, or any other key to continue without
submitting (your file will be saved):
Ok, fi
le not submitted to dbEST.
The EST submission file and the other output directories have been
saved in the following location:
/usr/home/helen/est_solutions/Xerophyta_humilis/trace2dbest/2005
-
03
-
18T11:45unsubmitted
Trace2dbest has finished and will now
exit. Bye.
Output files:
Trace2dbest
create
s
a dir
ectory called
est_solutions
,
and within this a dir
ectory
with the
name of your
project (
i.e.
organism of interest e
.
g.
Xerophyta_humilis
) and within this a
dir
ectory called
trace2dbest
.
Once
you have
run
t
race2dbest
and processed your sequences
,
you
can look at the output:
Go to
the
trace2dbest
dir
ectory
created i
.
e
. type
e.g.
cd
est_solutions/Xerophyta_humilis/trace2dbest
Type
ls
, and you should get
a dir
ectory
with a date stamp (ac
cording to
the
time of
p
rocessing)
e.g
.
2005
-
03
-
11T09:43unsubmitted
. To look inside,
type
cd 2005
-
03
-
11T09:43unsubmitted
(remember that you can use the tab key to avoid having to type
out the whole thing).
Within this directory are
a number of sub
-
dir
ectories.
fastafiles
contains
your processed
seq
uence
s
; these files have a
.fsa
extension.
(Theoretically, the
PartiGene
directory
contains the input files for
PartiGene
, but you will be using the files from the
fastafiles
directory).
15
Also
have a look at the
file:
dbEST_submission.txt
.
You are now ready to cluster your sequences and assemble the contigs via
PartiGene
.
16
PartiGene
What PartiGene does
:
PartiGene
cluster
s
together overlapping or redundant sequences via CLOBB.
It then
assemble
s
your clustered seq
uence into a consensus se
quence or
contig via Phrap.
It also
a
llow
s
you to
Blast the contig (
either
Blastn or
Blastx)
against selected databases to obtain a
putative gene identity
.
We have included an option that allows you to reverse complement
your sequences if they are
in the r
everse (
i.e.
3’) direction.
It works best to reverse
complement
3’ sequences
before
clustering
.
Input files
:
The
.fsa
input
files for PartiGene are stored in
~/est_solutions/Xerophyta_humilis/traced2dbest/fastafiles
.
Running PartiGene
:
The preliminaries
–
making a PartiGene directory
Trace2dbest
has made you an
est_solutions
dir
ectory
, and
with
in
that a species
or
organism
directory e.g.
Xerophyta_humilis
, and
with
in that a
trace2dbest
directory
(see illustration of
directory structure above)
.
Y
ou now need to make a
PartiGene
directory
w
ithin the e.g.
Xerophyta_humilis
directory
, from which to run PartiGene. Do
this
as follows.
From your home directory, type
cd est_s
olutions/Xerophyta_humilis
.
Once there, type
m
kdir PartiGene
,
then
cd PartiGene
.
The preliminaries
–
copying the files to the
PartiGene
directory
Y
ou t
hen need to copy
the
relevant files across from
the
trace2dbest
to the
PartiGene
dir
ectory
. Do this by typing:
copy_fastafiles.pl
You will get something like this:
Species Directory L
ist
1) /home/helen/est_solutions/Lumbricus_rubellus
2) /home/helen/est_solutions/Xerophyta_humilis
Please select an option from 1 to 2:
First
select
according to o
rganism
/sequencing project
e.g.
2
.
Then select according to date stamp i.e. batch of proces
sed sequences
e.g.
1
:
Run Directory List
1)
/home/helen/est_solutions/Xerophyta_humilis/trace2dbest/2004
-
11
-
10T14:27unsubmitted
You will then see something like this:
est_solutions
Xerophyta_humilis
Lumbricus_rubellus
trace2dbest
PartiGene
est_solutions
Xerophyta_humilis
Lumbricus_rubellus
trace2dbest
PartiGene
17
Copying /home/helen/est_solutions/Xerophyta_humilis/trace2dbest/2005
-
03
-
18T11:45unsubmitt
ed/fastafiles/Xh_LR_01G12.fsa to
/home/helen/est_solutions/Xerophyta_humilis/PartiGene/sequences...
Copying /home/helen/est_solutions/Xerophyta_humilis/trace2dbest/2005
-
03
-
18T11:45unsubmitted/fastafiles/Xh_LR_01H01.fsa to
/home/helen/est_solutions/Xerophyt
a_humilis/PartiGene/sequences...
Done.
helen@mancala:~/est_solutions/Xerophyta_humilis/PartiGene>
Once
you have done this, type
ls
. You will see that you now have
a temporary dir
ectory
called
sequences
that contains your
trace2dbest
-
processed sequences, r
eady
for
PartiGene
processing.
(
You can have a look by typing
cd sequences
, then
ls
)
.
The preliminaries
–
reverse complementing 3’ sequences
For any sequences
that you have
in
the
reverse (3’) direction, it works best to reverse
complement them at this sta
ge. Do this as follows:
Change into the
sequence
directory.
If all your sequences are in the 3’ direction, type:
revcomp
–
w *.fsa
to reverse
complement
(revcomp)
and overwrite (
-
w) all
the
seq
uences in the
sequences
directory
.
If you have sequences in bo
th directions, you will have to have identified
the
seq
uence
direction during naming e.g
.
Xh_RD
f
_01B04 and Xh_RD
r
_01B04 to distinguish between
5’ and 3’ sequences, respectively. You can then selectively reverse complement the 3’
sequences
using wildcards
a
s follows:
revcomp
–
w *_*r_*.fsa
.
Running PartiGene
from
est_solutions/Xerophyta_humilis/PartiGene
In PartiGene, type
PartiGene_v2.2.pl
(e.g.
helen@mancala:~/est_solutions/Xerophyta_humilis/PartiGene>
PartiGene_v2.2.pl
)
You
will then get this:
##
##############################################################
### ###
### PartiGene
-
a script to convert individual sequences ###
### (typically ESTs) into a Partial G
enome. Vs 2.2.0 ###
### ###
### Ralf Schmid and colleagues for the EGTDC 2004 ###
### ###
###
News, upgrades and help: nematode.bioinf@ed.ac.uk ###
### Help for EG
-
Awardees: helpdesk@envgen.nox.ac.uk ###
### ###
########################################
########################
Enter the number corresponding to the part of the PartiGene
process you want to perform:
1. Download sequences from EBI for analysis.
2. Pre
-
process sequences.
3
. Cluster sequences.
4. Assemble clusters.
5. Perform BLASTs.
6. Create HTML tables of results.
7. Construct relational database of results.
8. Quit.
PartiGene Clustering
Sele
ct O
ption 3 to c
luster
:
##### SEQUENCE CLUSTERING #####
18
This software clusters datasets of EST and other sequences
such that each cluster represents one putative gene.
The sequences need to be available as individual fasta files
in a directory called 'se
quences'.
The PartiGene sequence download process does this automatically.
The cluster process uses a program called CLOBB to do the
clustering. CLOBB allows new sequences to be added to old
clusters while maintaining previous cluster identities
For more
information on CLOBB see :
Parkinson J, Guiliano DB, Blaxter M.
Making sense of EST sequences by CLOBBing them.
BMC Bioinformatics 2002 3(1):31.
Enter the three letter cluster ID you would like to use
(typically this is the first l
etters of the genus and species
followed by 'C' eg. for Zeldia punctata you might use ZPC).
Type
the cluster ID
e
.
g
.
XHC
,
and you will then get
something like
this:
0 sequences remaining
Clustering done
-
now splitting cluster file into individual
c
luster files. This will create files in directory 'Clus'
Creating individual cluster files, please wait...
SUMMARY OF CLUSTERING FOR XHC
=============================================================
Number of sequences = 414
To
tal number of clusters = 385
Number of clusters with 1 member = 367
Number of clusters with >1 member
(derived from 47 sequences) = 18
=============================================================
This data an
d additional information on each cluster,
has been saved in the file:
OUT/CLOBB_XHC_03
-
18
-
05+11:56.txt
Would you like to continue with the PartiGene process?
Y
PartiGene cluster assembly
Back in main menu, s
elect Option
4
to a
ssemble c
luster
s
:
##### CLUSTER ASSEMBLY #####
The sequences grouped in clusters can be 'assembled' to yield a
consensus sequence. PartiGene uses a program called 'phrap' for this
(written by Phil Green and colleagues; see http://www.phrap.org/).
You have previously used
XHC as the cluster identifier
Is this correct ?
[y/n] :
Y
7 Clusters will be assembled or updated
Before assembling the clusters the sequences need to pre
-
processed
phrap uses quality information from the sequencing chromatographs,
if this is available.
You have three options :
1) Attempt to use original quality files for all clusters
2) Attempt to use original quality files only for clusters containing
2 sequences
19
3) Skip preprocessing (if quality files are unavailable).
NB. phrap can generate multip
le consensuses from large clusters.
We have found that option 2 reduces this less
-
than useful feature
Please select 1,2 or 3 :
3
(NB select option
3
–
selecting option 1
or 2
is likely to generate
concatamerised contigs).
0 clusters remaining
Assembly pr
ocess finished.
Now creating input files for the protein prediction pipeline prot4EST.
A report on the phrap based assembly process has been saved in the file:
OUT/phrap_XHC_03
-
18
-
05+12:13.txt
Would you like to continue with the PartiGene process?
y
PartiGene blasting
Back in main menu, s
elect Option 5
to blast:
Consensus sequences needed for BLAST searches within PartiGene
have been processed and stored in the directory 'blast'.
A concatenated file which can be used as input file for BLAST
searche
s outside of PartiGene has been saved as 'blast_input_XHC.txt'
in the main directory.
Would you like to continue by BLASTing against your custom databases ?
[y/n] :
y
### Available protein databases ###
1 ATH1.nr.pep 2 kyva
3 nr
### Available nucleotide databases ###
1 ATH1.nr.cds 2 ego8_112404.seq 3 est.00
4 est.01 5 est.02 6 est.03
7 GEO_FASTA.txt
Please enter the number of the database you would
like to blast and
the type of blast you would like to perform separated by a comma.
For example '1, blastn' followed by 'q' for finish would perform a
single BLASTN against nucleotide database 1.
If you would like to specify several databases, separate
the numbers
by using '+'. E.g. entering 1+2+4, blastx, would perform a single BLASTX
search for each sequence against the combined protein databases 1,2 and 4
Upto 5 different BLASTs are allowed
-
enter 'q' to finish
If you would like to run more than one
BLAST enter e.g '1 , blastx' RETURNKEY
'2 , blastx' RETURNKEY '4, blastx' RETURNKEY followed by 'q' to finish.
This would run three BLASTs (one against each of the protein databases 1,2 and 4)
PartiGene has the facility for you to do Blastn or Blastx sea
rches.
(Note that Blastx
searches are the most informative for EST annotation).
To run a Blastx search,
select
nr
from
the Available protein databases by typing
:
3
, blastx
(NB. If
the
Pæ
ano system is populated with more databases, the number
associat
ed with
nr
may change; select the number matching
nr
)
q
(don’t forget to type q)
You have selected the following blasts
-
/usr/local/blast
-
2.2.9/bin//blastall
-
p blastx
-
d /usr/local/paeano/blastdb/nr ....
20
Is this correct ?
[y/n] :
y
A
Bl
astx run, takes
about 1 min per sequence. You will see the % progress. When the run is
finished, you will get:
100% Completed
Would you like to continue with the PartiGene process?
[y/n] :
y
OK: Back to main menu
Note that y
ou can
do both types
of blast (Blastx and Blas
tn)
simultaneously and
you can
blast against a maximum of 5 databases simultaneously e.g
.
you can do
a for blastx run
against nr and a blast n against the three est databases
as follows
:
3, blastx
4
, blastn
5
, blastn
5
, blast
n
You’ll then get somethin
g like this:
You have selected the following blasts
-
/usr/local/blast
-
2.2.9/bin//blastall
-
p blastn
-
d /usr/local/paeano/blastdb/nt.01
....
/usr/local/blast
-
2.2.9/bin//blastall
-
p blastn
-
d /usr/local/paeano/blastdb/nt.02
....
/usr/local/blast
-
2.2.9/bin/
/blastall
-
p blastn
-
d /usr/local/paeano/blastdb/nt.03
....
Is this correct ?
y
OR, to make the search quicker,
blastn search
the databases
collectively, rather than
individually,
by typing:
6+7+8, blastn
q
Creating HTML tables in PartiGene
Back in mai
n menu, select option 6 to create HTML tables
. Running this sub
-
menu, you
will get something like this
:
##### Creating HTML Tables #####
This facility creates a series of HTML format results files.
This is recommended only for smaller datasets (<1000 s
equences).
Do you want to continue?
[y/n] :
y
First select the BLASTs that you would like to include from the following list
1 nr
2 nt
Please enter a comma separated list o
f numbers from the list above:
1
21
You have selected the following BLASTs :
nr
Pl
ease wait while the tables are being generated
100% Completed
Tables have been generated
-
if you are running this program remotely,
you will need to copy the following directories and contents into
a web accessible directory (typically "public_html" in yo
ur home directory
-
ask your system administrator for further details). Directories to copy are :
html, blast and phrap
If you are running a local copy of this program you could view the results now ?
[y/n] :n
To view the results open up a web browser and
open the file :
/usr/home/helen/est_solutions/Xerophyta_humilis/PartiGene/html/Results.html
Would you like to continue with the PartiGene process?
[y/n] :
y
NB:
To be able to view your Results via the
Pæano
web
interface when you open up a web
browser, y
ou need to copy across the directories:
html
,
blast
and
phrap
to your
public_html
dir.
Y
ou can either exit PartiGene and do it now, or
once you’ve completed Option 7
.
Do this as follows:
Type
cp
–
r html/ ~/public_html
. Repeat for
blast
and
phrap
.
Open up
a web browse
r
and go to
http://mancala.cbio.uct.ac.za/~
yourname
/html/Results.html
to view your results!
Constructing relational datab
a
ses in PartiGene
Back in main menu, select op
tion 7 to construct relational databases of your results.
Note that
for a group of users all sequencing from the same organism, you can share a database,
although it is safer initially for each user to define their own database.
Pæano is/will be set up
so
that, provided all users working on the same species use the same cluster ID e.g. XHC for
X. humilis
(see below), cluster IDs will be unique even if there are redundant sequences
between users.
Also note, that you need to revisit this option to upload you
r downstream Prot4EST results.
(If you prefer, you can run this option once all your processing is completed.)
Running this
sub
-
menu, you will get something like this:
##### Databasing #####
This facility offers the ability to hold your data in a
SQL database using the public domain databasing software
postgreSQL. PostgreSQL is typically packaged with many Linux
distributions and is also freely available from :
http://www.postgresql.org/
In order to use this databasing facility, you will need to
e
nsure that postgres is running and that you have permissions
to create new databases
-
see the website above for more details.
You have already defined a database
-
paeano would you like to use it ?
[y/n] :
y
(See below if you select n)
Enter three lette
r cluster ID you have previously defined
: XHC
Cluster entries already exist for this cluster ID
-
Update the db ?
[y/n] :
y
If you define a new database (i.e. select n as an option above, you will get something like this:
22
Please enter the name of the d
atabase you would like to create
arthur
Use arthur ? :
[y/n] :y
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "cluster_pkey" for
table "cluster"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "est_pkey" for table
"est"
NOTI
CE: CREATE TABLE / PRIMARY KEY will create implicit index "clone_name_pkey"
for table "clone_name"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "p4e_ind_pkey" for
table "p4e_ind"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit i
ndex "p4e_hsp_pkey" for
table "p4e_hsp"
Use this as your default database in future ?
[y/n] :y
Enter three letter cluster ID you have previously defined : XHC
Inserting cluster entries
Inserting sequence entries
Do you want to add or update clone name en
tries ?
(Please note, this feature works only for ESTs downloaded from EBI in option 1
which use the EGTDC naming scheme)
[y/n] :n
Inserting BLAST info
Please choose the BLAST results to include in the database :
Include ATH1 ?
[y/n] :n
Do you want to
insert results from prot4EST into arthur?
[y/n] :n
Would you like to continue with the PartiGene process?
[y/n] :
To view your
paeano
relational database results, see
‘PostgreSQL’
p.
34
.
Ouput files:
PartiGene generates, amongst others, the following fi
les:
Cluster results are in
PartiGene/Clus
.
o
Singletons.fsa
is a fasta format file of all
unclustered sequences
o
Individual files e.g.
XHC00251
groups together in fasta format those sequences that
cluster
Contig assembly results are in
PartiGene/phrap
.
XHC00
251.contigs
(e.g.) presents
consensus sequence in fasta format.
Blast results for individual clusters e.g.
XHC00251.out
are in
PartiGene/blast/nr
.
23
Prot4EST
What Prot4EST does
:
Prot4EST t
ranslates your EST seq
uences
, attempting to compensate for
the fact
that ESTs
often represent partial mRNAs and that their sequence quality usually isn’t perfect
(
seq
uence
errors tend to
disrupt
the open reading frame
(ORF))
.
Prot4EST
aims for the best
possible
polypeptide prediction
via
a number
of
similarity
-
based and
d
e novo
prediction
methods
. The
tiered or
hierarchical system is executed
as follows:
1.
A B
lastn
search is performed against a database of ribosomal RN
A (rRNA) sequences to
identify and filter
out rRNA sequences from your EST
-
translatable set.
2.
A B
lastx
search
is performed
against proteins encoded by mitochondrial
genomes
.
Those
sequences identified as mitochondrial
-
encoded genes, for example, are translated using the
appropriate mitochondrial genetic code
.
(For plants sequences we plan to make available a
sear
ch against chloroplast genomes also).
3.
Nuclear
-
encoded sequences are translated according to B
lastx
match
es. T
hose ESTs for
which there is no significant
sequence similarity
are processed further
via
de novo
prediction method
s
.
4.
ESTScan
use
s
Hidden Markov
mo
dels
(HMM)
which can predict
coding sequence in a
probabilistic manner.
The HMM must be built using a complete set of CDS entries from
e.g. EMBL for your species of interest, or a related species.
ESTScan can compensate for
sequence errors and can distingu
ish between the UTRs and CDS of a cDNA.
Sequences
that fail to be translated by
ESTScan
are passed on to the next step.
5.
DECODER predicts CDS using
sequence
quality info
rmation (i.e. the phrap quality files).
Sequences that fail to be translated by Prot4ES
T are passed on to the next step.
6.
As a last resort, putative polypeptide translation is generated from
longest ORF
in all 6
frames
.
Input files:
The input file for Prot4EST is
prot4EST.input.fsa
in your PartiGene directory.
Decoder makes use of
.seq
and
.qlt
files in
PartiGene/protein
.
N.B.
If you are using the pre
-
computed blast results option
(see p. 25 below
), which we
recommend,
you need to do the following
:
o
The cluster IDs in the
prot4EST.input.fsa
file are incompatible with the cluster IDs in
the Pa
rtiGene blast results file
s (anomalously, the cluster IDs in the input file contain
.1 extensions e.g. XHC00001.1)
. To fix this, run the Prot4EST fix script in the
PartiGene directory by typing:
prot4est_input_fix prot4EST_
input.fsa
before you
run Prot4EST
.
o
Also, anomalously, some of the cluster IDs in the
blast result
files contain a
.Contig1
extension
e.g. XHC00001.Contig1
.
(
Consequently,
Prot4EST cannot recognise and use
these particular blast reports for similarity
-
based translations
)
. To fix this, run
the
following patch:
pg_blast_fix
(e.g.) XHC* from your
PartiGene/blast/nr
directory
.
Running Prot4EST
:
Make
a
directory
called
Prot4EST
in e.g.
est_solutions/Xerophyta_humilis
.
Run
Prot4EST
from your
Prot4EST
dir, by typing
p
rot4EST.pl
.
You’ll get this:
starting prot4EST checks...
Unable to find the pico text editor. To create the configuration file use
the example as a template in whatever text editor you prefer.(
W
e must
still get
Rodger to install
pico)
24
#######################################
###########
### ###
### prot4EST ###
### ###
### version: 2.1.1
-
January '05 ###
### ###
### a script that converts EST sequence into ###
### amino acid sequence taking frame shift, ###
### substitutions et al into consideration. ###
###
###
##################################################
Please set up the config file:
1. Create a configuration file.
2. Use or Edit an existing configuration file.
3. Get Help.
4. Exit Program.
As
a
first
-
time user, select 1 to create the
configuration file
in which
you specify paths to
input files and customise analysis options
. You will get this:
1
Now select OPTION 2 to load the configuration file
#############
#####################################
### ###
### prot4EST ###
### ###
### version: 2.1.1
-
Janua
ry '05 ###
### ###
### a script that converts EST sequence into ###
### amino acid sequence taking frame shift, ###
### substitutions et al into consideration. ###
### ###
##################################################
Please set up the config file:
1. Create a configuration file.
2. Use or Edit an existing configuration file.
3. Get Help.
4. Exit Program.
4
A
t present, pico editor not installed and therefore you have to edit the config file using vi
text ed
itor (which is a bit tricky to use)
.
So instead of sel
e
cting option 2, select ‘4’ to exit
and type:
vi c
onfig.
helen@mancala:~/est_solutions/Xerophyta_humilis/Prot4EST>
vi config
25
The con
figuration file
template
looks like this
, with the
custom
or definable
entries in
bold:
Config file for PROT4EST created Mon Mar 7 10:23:39 SAST 2005
For help on any of
these please consult the README file
#Full path to fasta input file, e.g. /home/joe/EST/rubellus.fsa
1. Input File [fasta
format
]:/usr/home/helen/est_solutions/Xerophyta_humilis/PartiGene/prot4EST_input.fs
a
#prot4EST will create this directory, e.g. '
output' will be created in the
#directory p4e is launched from
2. Output Directory:
output
#e.g. Lumbricus rubellus
3. Organism Name (full):
Xerophyta humilis
4. Location of genetic code file:
/usr/local/paeano/db/gc.prt
#Fasta and BLAST files conta
ining these sequences are included in the prot4EST
release.
#Enter the full path.
5. Ribosomal RNA BLAST database:
/usr/local/paeano/db/rRNA.fsa
6. Mitochondria BLAST database
[protein]:
/usr/local/paeano/db/mito_viridiplantae.fsa
#The defaults are show
n
7. Evalue for rRNA search (BLASTN):
1e
-
65
8. Evalue for BLASTX:
1e
-
8
#If you have previous carried out BLASTx search on these sequences then enter the
path to the report file
#or directory containing only these files
#If left blank then prot4EST ass
umes you wish to carry out a BLASTx search on these
sequences
#You are advised to read the userguide regarding this option
9. Location of pre
-
computed BLASTX report files/directory:
/home/helen/est_solutions/Xerophyta_humilis/PartiGene/blast/nr
#Fill i
n all entries for 9a
-
c OR just 9d (if DECODER has already been run on these
sequences)
(
This should read 10a
-
c, 10d)
#e.g. /home/joe/partigene/protein
10a. Path to sequence and quality files [protein
directory
]:/usr/home/helen/est_solutions/Xerophyta_humi
lis/PartiGene/protein/
#defaults shown
10b. Suffix for EST sequence files:
seq
10c. Suffix for EST quailty files:
qlt
or
10d. Path to pre
-
computed DECODER results:
11. ESTScan Matrix File [optional]:
12. Codon Usage Table (gcg format) [option
al]:
Fill in the entrie
s
for your config file, following the example above
. Make use of the
following to navigate in vi:
o
To move cursor around, type
j
, then use arrow keys.
o
To inse
r
t text after the cursor, type
a
, then enter your text. Use
Esc
key to exit
text
insetion mode.
o
To delete on the cursor, type
x
.
o
To save and exit vi, type
ZZ
or :
wq
.
26
Entries for 11 and 12 are optional. If
you
leave
out the
path for
the
pre
-
computed Blastx
results (
genera
ted in PartiGene), Prot4EST will do a new Blastx search.
On
ce you have entered the relevant information in the configuration file, select Option 2
to generate the translations. You will be prompted to make some selections during the
course of the run:
2
Please provide path to config file...(current directory:
/us
r/home/helen/est_solutions/Xerophyta_humilis/Prot4EST )
/usr/home/helen/est_solutions/Xerophyta_humilis/Prot4EST/config
Configuration file found
Would you like to Use or Edit this file?
[U/E]
U
Created prot4EST_050405_121845.log and prot4EST_050405_121845
.errorlog
/usr/home/helen/est_solutions/Xerophyta_humilis/PartiGene/prot4EST_input.fsa
accepted and format verified
/usr/local/paeano/db/rRNA.fsa accepted and format verified
/usr/local/paeano/db/mito_viridiplantae.fsa accepted and format verified
chec
king sequence and quality files...
All components for DECODER located
Config file has been read and all variables accepted
Starting prot4EST
You need to choose which Genetic Codes to use for nuclear and mitochondrial
translations
Select a nucle
ar genetic code [default=1]
1: Standard
2: Vertebrate Mitochondrial
3: Yeast Mitochondrial
4: Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate Mitochondrial;
Mycoplasma; Spiroplasma
5: Invertebrate Mitochondrial
6: Ciliate Nuclear; Dasycladacean N
uclear; Hexamita Nuclear
9: Echinoderm Mitochondrial
10: Euplotid Nuclear
11: Bacterial and Plant Plastid
12: Alternative Yeast Nuclear
13: Ascidian Mitochondrial
14: Flatworm Mitochondrial
15: Blepharisma Macronuclear
16: Chlorophycean Mitochondrial
21: T
rematode Mitochondrial
22: Scenedesmus obliquus mitochondrial
23: Thraustochytrium mitochondrial code
1
[unless sequencing mitochon
drial genes, select from
options 1,
6,
10, 12, 15]
Select a mitochondrial genetic code [default=5]
2: Vertebrate Mitochondri
al
3: Yeast Mitochondrial
4: Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate Mitochondrial;
Mycoplasma; Spiroplasma
5: Invertebrate Mitochondrial
9: Echinoderm Mitochondrial
13: Ascidian Mitochondrial
14: Flatworm Mitochondrial
16: Chlorophycean
Mitochondrial
21: Trematode Mitochondrial
27
22: Scenedesmus obliquus mitochondrial
23: Thraustochytrium mitochondrial code
16
[if for e.g. you are working with sequences of plant origin]
You have selected nuclear genetic code: 1 and mitochodrial genetic cod
e: 16
--
12:25:54
--
http://srs.ebi.ac.uk/srs7bin/cgi
-
bin/wgetz?
-
e+
-
vn+2+[embl
-
Organism:Xerophyta%20humilis]%20&%20[embl
-
Description:complete]%20&%20[embl
-
Description:cds]%20&%20[embl
-
Molecule:RNA%20%7C%20mRNA]
=> `Xh_embl.db'
Resolving http
-
prox
y.uct.ac.za... 137.158.128.106, 137.158.128.107, 137.158.128.105
Connecting to http
-
proxy.uct.ac.za[137.158.128.106]:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: unspecified [text/html]
[ <=>
] 9,092 781.10B/s
12:26:10 (780.95 B/s)
-
`Xh_embl.db' saved [9092]
Build ESTScan Tables for Xh_embl.conf
-------------------------------------
Current parameters:
-
organism: X
erophyta humilis
-
database files are: Xh_embl.db
-
UniGene data is in:
-
ESTs for testing:
-
data directory: .
-
mRNA file is: ./mrna.seq
-
EST file is: ./ests.seq
-
ESTs with coding:
./Evaluate/estcds.seq
-
ESTs without coding: ./Evaluate/estutr.seq
-
training file is: ./training.seq
-
test file is: ./test.seq
-
clean UTR file is: ./Evaluate/rnautr.seq
-
clean CDS file is:
./Evaluate/rnacds.seq
-
HMM paramters file: ./Matrices/6_00030_0000001_4242.smat
-
tuple size: 6
-
min redundancy mask: 30
-
added pseudocounts: 1
-
minimum score:
-
100
-
start pro
file length/preroll: 4/2
-
stop profile length/preroll: 4/2
-
nb of isochores: 1
Extracting mRNA entries...
-
processing EMBL
-
file Xh_embl.db...
found 3 sequences, 906 coding nucleotides
-
overall found 3 sequences, 906 coding nucleot
ides
Analyzing GC contents...
-
generating GC
-
content histogram...
read 3 sequences
-
written datafile and gnuplot script
-
computing isochore borders...
isochores used: 0
-
100
Split mRNA data into isochores...
3 sequences found in isochore 0
-
100
Masking redundancy from isochores...
-
masking redundancy in isochore 0
-
100
masked 0 of 1734 nucleotides (0%)
Writing codon usage tables...
-
computing for isochore 0
-
100...
28
Xh_embl.conf done.
### WARNING ###
Have found only 1734 coding nucleoti
des for Xerophyta humilis
We suggest that at least 150000 coding nucleotides are used.
Using fewer *may* compromise construction of ESTScan matrix.
These may be complemented with pseudo entries built by prot4EST from the similarity
stages.
It is a NON
-
fata
l issue. Do you wish to use these results and utilise ESTScan?
[Y/N]
N
[
F
or non
-
model organisms, it is unlikely that
there will be
enough sequence data available
for constructing an ESTScan matrix.
You can build one for your species
at a later stage, once
you have generated enough sequence data. Alternatively,
you can
build a matrix
for
a related
species e.g.
you
could use a rice matrix for
X. humilis
sequences.]
(Still to include notes on building
an
ESTScan matrix.)
fetching a codon bias table for Xer
ophyta humilis.
this may take a few seconds
Searching, please wait...
### Warning! ###
Sorry, there are 0 matches for Xerophyta humilis.
Is the organism name correct?
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο