Pæano – A Userguide - CBio

bossprettyingΔιαχείριση Δεδομένων

28 Νοε 2012 (πριν από 4 χρόνια και 8 μήνες)

518 εμφανίσεις















Pæano


A
p
ipeline for the
a
nnotation and
e
xpression
a
nalysis of EST
s from
n
on
-
model
o
rganisms



A u
ser

guide





Compiled by Helen Collett
, with assistance from Nicolas Immelman
















2

CONTENTS


Introduction

................................
................................
................................
..............................

4

How to access the system (mancala.cbio)

................................
................................
...............

5

Some basic Linux for getting around

................................
................................
.....................

5

Directory structure and syntax:

................................
................................
..............................

5

Some useful commands:

................................
................................
................................
.........

5

Some tips:

................................
................................
................................
...............................

6

Setting your environment variables
................................
................................
........................

6

Getting your chromatogram files ready

................................
................................
.................

8

The NERC naming convention and how to distinguish between 5’ and 3’ sequences:

.........

8

Copying your files across to mancala:

................................
................................
...................

8

Renaming your sequences according to the NERC naming scheme:

................................
....

8

Trace2dbest

................................
................................
................................
.............................

10

What trace2dbest does:

................................
................................
................................
.........

10

Input files:

................................
................................
................................
............................

10

Running trace2dbest:

................................
................................
................................
............

10

Output files:

................................
................................
................................
..........................

14

PartiGene

................................
................................
................................
................................

16

What PartiGene does:

................................
................................
................................
...........

16

Input files:

................................
................................
................................
............................

16

Running PartiGene:

................................
................................
................................
..............

16

The preliminaries


making a PartiGene directory

................................
.........................

16

The preliminaries

reverse com
plementing 3’ sequences

................................
................

17

Running PartiGene from est_solutions/Xerophyta_humilis/PartiGene

...........................

17

PartiGene Clustering

................................
................................
................................
.......

17

PartiGene cluster assembly

................................
................................
..............................

18

PartiGene blasting

................................
................................
................................
...........

19

Creating HTML tables
in PartiGene

................................
................................
................

20

Constructing relational databases in PartiGene

................................
.............................

21

Ouput files:

................................
................................
................................
...........................

22

Prot4EST

................................
................................
................................
................................
.

23

What Prot4EST does:

................................
................................
................................
...........

23

Input files:

................................
................................
................................
............................

23

Running Prot4EST:

................................
................................
................................
..............

23

Output files:

................................
................................
................................
..........................

29

Annot8er_Blast2GO

................................
................................
................................
...............

30

What Annot8er_Blast2GO does:

................................
................................
..........................

30

Input files:

................................
................................
................................
............................

30

Running Annot8er_BLAST2_GO:

................................
................................
......................

30

Output files:

................................
................................
................................
..........................

33

PostgreSQL

................................
................................
................................
.............................

34

InterProScan

................................
................................
................................
...........................

35

Wh
at InterProScan does:

................................
................................
................................
......

35

Input files:

................................
................................
................................
............................

35

Running InterProScan:

................................
................................
................................
.........

35

Output files:

................................
................................
................................
..........................

35

OrthoLog

................................
................................
................................
................................
.

36

What
OrthoLog

does:

................................
................................
................................
..........

36

Input file
s:

................................
................................
................................
............................

36

Running
OrthoLog

................................
................................
................................
..............

36


3

Output files:

................................
................................
................................
..........................

37

GEOExpress

................................
................................
................................
...........................

38

What
GEOExpress

does:

................................
................................
................................
....

38

Input files:

................................
................................
................................
............................

38

Running
GEOExpress

................................
................................
................................
.........

38

Output files:

................................
................................
................................
..........................

38

How to set up and update your WebPartiGene results.

................................
.....................

39

Pæano web interface (WebPartiGene)

................................
................................
.................

40

How to update your WebPartiGene results

................................
................................
.........

48

Note that the results from InterProScan, OrthoL
og, and GEOExpress will automatically be
updated on your wwwPartiGene page.

................................
................................
.....................

49

References

................................
................................
................................
...............................

50

User guides

................................
................................
................................
..............................

50

Acknowledgements

................................
................................
................................
.................

51

Thank you to Arthur Shen for trialling the system.

................................
................................
..

51

Troublesho
oting/Getting help

................................
................................
...............................

52



4

Introduction


Welcome to
Pæano


a
p
ipeline for
the
a
nnotation and
e
xpression
a
nalysis of ESTs from
n
on
-
model
o
rganisms
.

The system makes use of publicly
-
available software, accessed from:
http://zeldia.cap.ed.ac.uk/PartiGene/index.html

-

to which we have

introduced some extra
features and
modifications.


The
Pæano

suite of programs

for
sequence processing and basic annotation
(phases1
-
3)

includes (briefly summarised as follows):



Trace2dbest

which c
onv
erts raw sequence data (
standard chromatogram format (scf) or
.abd files
,

referred to as ‘traces’) to text
files
(in fasta format)
that are
ready for submission
to Genbank db
EST
. In the proc
ess

it
removes vector sequence, adapter sequence, poly A
tails, etc.



PartiGene



which:

o

c
lusters redundant and/or overlapping sequences.

o

a
ssembles these sequences into a contig.

o

a
llows you to blast the contigs to assign a gene identity/best match.

In addit
ion
PartiGene
has facilities for creating
SQL relational databases and
HTML
tables (so that you can view your results via a web interface
, generated by
WebPartiGene
)
.



Prot4EST



which translates EST nucleotide

sequences.



Annot8er_blast2_GO


which a
llows y
ou to assign Gene Ontology (GO) terms to your
EST
sequences on the basis of sequence homology.


While Pæano phases 1
-
3 involve basic sequence annotation, phases 4
-
6 specialise in mapping
the EST sequence data for the non
-
model organism of interest (e.g.
Xe
rophyta humilis
) to the
genomic, transcriptomic and proteomic data for relevant model organisms (e.g. Arabidopsis
and rice). Phases 4
-
6 therefore constitute secondary level annotations for functional and
comparative genomics, and are outlined below. (Again
, they make use of publicly available
tools and databases).


The Pæano suite of programs (phases 4
-
6) includes (briefly summarised as follows):



InterProScan

which allows you to identify protein domains/signatures for functional
classification of your prote
in sequences.



OrthoLog



which identifies putative/candidate gene orthologues in relevant model
organisms.



GEOExpress


which captures expression data for putative gene orthologues


Pæano is
installed on
a computer housed at medical campus:
mancala.cbio.u
ct.ac.za

(which
you can access via PuTTY once you have a log
in

see p
. 5
).
Pæano p
rocessing

is
therefore
done remotely.
The operating system is

Linux

and most of the programs are scripted in Perl.


Userguides (pdfs) for the individual programs are ava
ilabl
e at
/usr/local/paeano/doc
/serguide_supplements

(still to make available).


5

How to access the system (mancala.cbio)


First you n
eed to apply for a user account.

Contact Nicky Mulder (email:
nmulder@science.u
ct.ac.za
) or Cathal Seioghe (
cathal@science.uct.ac.za
) for authorisation,
then apply to the systems manager,
Rodger
Duffet (email:
rodger@curie.uct.ac.za
, phon
e: 406
6375) for your user account.

Once you have a user ID, you can access the system via
PuTTY
.

(
Go to
http://www.chiark.greenend.org.uk/~sgtatham/putty
/download.html
to download putty.exe
)
.

To log in, the host name is
:

mancala.cbio.uct.ac.za
. (Y
ou can save this as
the
default setting so
that you needn’t type it in each time).

For protocol
,

select SSH
-
2 only.


While you’re at it, also install
winscp3
76

(
available from
http://winscp.net/eng/download.php
)

on your desktop. T
his will allow you to transfer files
from your
computer to the remote computer
. You c
an also view the
mancala
di
rectory
structure via
winscp37
6

(RHS
screen).



Some basic Linux

for gettin
g around


Directory structure
and syntax
:

If you are unfamiliar,
with the Linux directory structure,
first
get

a
Windows
-
like view

via
winscp386

(RHS screen)
. (
Here, the directories are the equivalent of Windows folders
)
.

C
lick
on the top folder (
the one
w
ith

the arrow
) until you contract
the directories
as far
as you can.

The
u
sr

and
home

directories
are the two
that
you’ll be accessing.



Open up
user
/
local
/
paeano

(have a look to see what’s in there



most of
the
P
æ
ano
software is stored here
)
.



Open up
user
/
(
local
)/
home
/
yourname

(this is your home dir
ectory

wh
ich

you will
work
from.


Then

in
PuTTY
, on mancala
, you will see: yourname
@mancala:~>



Y
ou are in your home directory by def
ault.



To view your directories within your home directory, type
ls



.




To g
o to the directory public_html, type
cd public_html



(You need only type the first
letter/few letters of the direct
ory and then press the Tab key
to
automatically complete the
directory name.)



To change

to the
paeano

directory (in the local directory),
type
cd /usr/local/paeano



(
directories are defined or
separated by
/
; note there is a space between command ‘cd’ and
directory definition
).



To see what’s in the
paeano

directory,
and get a feel for the directory structure
type
ls


.



To practise moving ar
ound within and between directories:

o

To go to the
bin

(binaries) directory

in
paeano
, type
cd bin
, and then
ls


.

o

To go to the
blastdb

directory in
paeano
, type
cd ..


(takes you up

one
directory notch), then
cd blastdb


.

Type
ls
. You’ll see that Linux ha
s a
colour system. Directories
are
shown
in blue text, files in white
, zipped files in
red
.



To return to your home directory, type
cd



or
cd ~/

(~ can be substituted for your home
directory)
; these

a
re

shortcut
s
. Long
-
cut would be
cd /usr/home/yourname
.


Some useful
commands
:


6



cd


change directory e
.
g
.

cd public_html

[takes you into
public_html

dir
ectory
]

o

cd ..


will
take you back to the previous dir
ectory

o

cd



will take you
straight
back

(express route)

to your home dir
ectory



pwd



print w
orking directory

[shows you whic
h dir
ectory

you’re in


if you lose track
!
]



m
kdir


make directory e.g.
mkdir PartiGene

[
creates a directory
PartiGene
]



rm


remove e.g.
rm test.fsa

will delete this file



rm

r


e.g.
rm

r
PartiGene

will delete the directory
PartiGene

and i
ts contents



mv


move
, rename e.g.
mv test.fsa




ls



lists files in the dir you are in



logout



to exit




who



for the curious,
to
see who’s logged in at the time



man


calls up help manual


Some tips
:



^c


to escape
(very useful if

you get stuck in one of

the P
æ
ano program
s!)



Remember you can

use the Tab key
to complete your directory
or program

name

and
avoid having to type it out in full each time
.



Linux is case sensitive i
.
e
.

it
distinguishes between

a


and

A

.



To
edit (or
view
)

files,
type
:

vi

filena
me

.
To exit
vi

text editor
(without saving any
change
s
) type :
q!

See

List of Commands for vi
-

An Unix Editor


(to
be stored in
usr/local/paeano/doc/userguide_supplements
)

for navigating in vi
.



To view files, type:
more

filename or
less

filename (to exit

from view,
type
:q
).


You should now be able to get yourself around (even if a bit slowly at first). For more help
with Linux
,

try the following site
: xxx
.


S
et
ting

your environment variables


Many of the tools used in P
æano require certain variables

asso
ciated with the shell
,

which is
the
interfac
e to the operating system
.
(You will be

using the bash shell).
These variables tell
the tools where to find certain files,
etc
.

These environment variables are stored in a file
called
.profile
.
As a first
-
time u
ser of Pæano,
you need to run
paeano_setup.sh

script. Do
this by going to the
/paeano/bin

directory by typing
cd /usr/local/paeano/bin
.
Once there,

type:
paeano_setup.sh

. You should

see
something like
this:


helen
@mancala:~> paeano_setup.sh

Welcome to pae
ano @ mancala

This scripts sets up any files that you will need for the paeano pipeline


At a later stage you will need to upload your data to a SQL database

What database would you like to use?

If you are not sur
e, enter your login name (helen
):
helen

[
No
te: by defining your
database name, you are keeping your

own processed sequences separate, even if, for example,
more than one of you are sequencing from the same species.]


Your data will be uploaded to helen. You may change this by running this script
ag
ain at any time.


Copying required files to your home directory


Setup is complete. Please logout and back in again for changes to come into effect.

Forward any queries to immelman@science.uct.ac.za

helen@mancala:/usr/local/paeano/bin>



7

This setup script c
reates the files and adds the environment variables you need to run Pæano.
In order for the environment variable settings to take effect, you need to log out of the system
by typing
logout
, and then log i
n again.


Return to your home directory by typing:
c
d



(If this fails, you can copy the script to your home directory by typing:
cp paeano_setup.sh
/usr/home/helen)


For your interest, t
he following environmental variables have to be set

for
P
æano:
.


BLASTDB=/usr/local/paeano/blastdb

BLASTMAT=/usr/local/b
last/data

ESTSCANDIR=/usr/local/paeano/BTLib
-
2.0b

PGDATABASE=paeano

PGUSER=paeano

PHRED_PARAMETER_FILE=/usr/local/paeano/bin/phredpar.dat

PATH=/usr/local/jdk/bin:/usr/local/ant/bin:/usr/local/bin:/usr/bin:/usr/X11
R6/bin:/bin:/usr/games:/opt/gnome/bin:/opt/
kde3/bin:/usr/lib/java/jre/bin:/
usr/local/mysql/bin:/usr/local/paeano/bin/:/usr/local/pgsql/bin:/usr/local/
blast
-
2.2.9/bin/:/usr/sbin/:/usr/local/src/ncbitoolbox/ncbi/bin/:/usr/local/paean
o/BTLib
-
2.0b/ESTScan/MkTables/


8

Getting your
chromatogram
files read
y


The NERC naming convention

and how to distinguish

between 5’ and 3’ sequences
:

Trace2dbest

uses a
controlled
n
aming convention for sequences. For the Pæano system we
have opted for the
NERC

naming scheme, which consists of three tags separated by
unders
cores,
and
looks like this: e.g. Xh_RD_01B04
, dissected as follows:



The first tag consists of two characters derived from the species name e.g. Xh for
Xerophyta humilis
.



The second tag is 3
-
5 letters long and indicates the library from which the EST was
de
rived e.g. RD for
R
oot
D
ehydration library.

If you are sequencing the ESTs in both the
forward (5’) and reverse (3’) direction, you can distinguish between the orientations by
including ‘f’ and ‘r’ in the second tag e.g. Xh_RDf_01B04 and Xh_RDr_01B04.



The
third tag indicates the co
-
ordinates
for the microtitre plate
in which the EST clone is
stored e.g. 01B04 for plate 01, row B, column 04
.
(If your clones are not in microtitre
plates, substitute ‘Z’ for the ro
w number and apply

your own numbering system).

It is b
est when you send your clones for sequ
encing to name them according

to the NERC
convention
.
If, however, you already have sequences that do not follow the naming scheme or
you prefer to

use a less clumsy Sequence Identifier e.g. HC1 that you can the
n map back to
the EST Ident
i
fier
, you can either
:



rename them manually

(if you have only a few). You will also need to r
emove
the
.
abd

extensions.



rename them using a Pæano script by following the instructions below (see ‘
Renaming to
fit NERC naming schem
e

.


Copying your files across to mancala
:

In your home directory, make a directory called e
.
g
.

traces
.
by t
yp
ing

mkdir traces



it is
usually convenient to keep directory names in lower case
).
(If you will be processing the
sequences in b
atches, you might

also like to make a
n e.g.

Bat
ch1

directory within
traces
.

V
ia WinSCP, copy across

your sequence chromatograms

to
the
traces

directory
.

(Alternatively, y
ou can also cr
eate the directories via WinSCP, by clicking on RHS panel and
then selecting ‘Files’ foll
owed by ‘Create a Directory’ from pop
-
up menu.
)


Renaming
your sequences according to the

NERC naming scheme
:

If you
r

sequences don’t follow the
naming convention
, you can
use a renaming script. There
are two options in Pæano. Use:



paeano_rename
if your se
quences were named according to a Sequence Identifier e.g.
HC1, rather than a 96
-
well co
-
ordinate. For this option, you have to provide a mapping
file consisting of two columns, 1) listing the Sequence Identifier (e.g. HC1) and 2) the
corresponding clone I
D or EST ID (e.g. XH_LD_04F05) in comma
-
delimited (csv)
format.



rename_file.pl
if your sequences were named according to a 96
-
well co
-
ordinate.


Option 1

R
un the
paeano_rename

script as follows:



Create a directory called
Rename
.



Copy into
here

your mapping

file and the
files you want

to rename
.



Run the script
paeano_rename

in
the
Rename

dir
ectory

by typing:

paeano_rename

mapping_
filename.csv

(NB. You have to provide the mapping file!)



You s
hould get something like this:



9

Renaming HC29_
-
_M13_F rerun.abd to
Xh_LD_27D12

Renaming HC56_
-
_M13_F rerun.abd to Xh_LD_04C06

Renaming HC25_
-
_M13_F rerun.abd to Xh_RD_27C05

Renaming HC23_
-
_M13_F rerun.abd to Xh_RD_27B12

Renaming HC50_
-
_M13_F rerun.abd to Xh_LR_01B05

Renaming HC21
-
_M13_F rerun.abd to Xh_RD_27B07

Renaming H
C22_
-
_M13_ rerun.abd to Xh_RD_27B11




If there are any problems with any of the matches
, you will get something like this:


Could not find match for HC20_
-
_M13_F rerun.abd in name mapping file.


OR

Option 1

Run the
rename_file.pl

script as follows:



Create a

directory called
Rename
.



Copy into here the files you want to rename.



Exit to your home directory
.



Run the script by typing:
rename_file.pl

dir Rename

txt text you want to remove

sub text you want to add

e.g.
for conversion of a set of sequences named
‘LRxxXxx
-
RM13’ (e.g. LR01G12
-
RM13)
to ‘Xh_LR_xxXxx’ (e.g. Xh_LR_01G12), type:

rename_file.pl

dir Rename

txt LR

sub Xh_LR


and then

rename_file.pl

dir Rename

txt

RM13




For help, type
rename_file.pl

help


Then, in preparation for processing your sequ
ences, create a directory called e.g.
data

in your
home directory, (and a subdirectory if necessary e.g.
testset
)

and move your renamed
sequences there. To do this, in the
Rename

directory, type e.g.
mv Xh* ~/data/testset
.

You’re now ready to process you
r sequences via
Trace2dbest
.



10

Trace2dbest


What trace2dbest does
:

Trace2dbest performs

bas
ic processing of your EST sequences,
using
the
raw

data
chromatograms (
trace
s)

as the input
.

To submit EST sequ
e
nces to Genbank
, you have to submit the sequences in

a specified format
consisting of four

files:

1.

A Library (Lib) file


giving details of EST origins (organism, source, etc.)
.

2.

A Contact (Con
t) file


giving you
r

contact details.

3.

A Publish (Pub) file


giving information re associated publication.

[Once you
have entered information for 1
-
3 for your EST seq project, you will not have to
repeat these entries for the same sequencing project i.e. you can access a saved file for
subsequent trace2dbest runs.]

4.

An EST (file)


containing all your EST sequences in Fas
ta format.

[
For each batch of sequences you process via trace2dbest,
you will generate an EST file.
Trace2dbest gives you the option to submit your EST seqs as
soon as you have processed
them]
.

In processing your sequences, to generate the EST file,
trace2
dbest

makes use of the
following progs/software:



Phred


for base
-
calling (assigns a quality score to each base)
.



Cross
-
match


to identify and
trim away vector, poly
-
A tails.


Input files:



The input files are your chromatogram files, named according to th
e NERC convention

(see p.
7
).
Test data for processing (96 earthworm sequences) are available
at
usr/local/paeano/testdata

(copy to local).


Running trace2dbest
:



Run this from your home direc
tory



Type
trace2dbest.pl

e.g.

helen@mancala:~> trace2dbest.pl



If

you get a

warning

ignore
.
You should see this;
then

follow
the
prompts.



####################################################


### ###


### trace2dbest V
ersion 2.1 ###


### trace file processing and dbEST ###


### sequence submission tool ###


### ###


#######
#############################################



Section 1
-

Lib, Cont, Pub and EST file information


Library file



Each batch of EST submissions to dbEST must have an associated


Library file.


Please choose one of the following optio
ns:



1
-

Library file already submitted


2
-

Enter information for a new Library file now



Or use a saved file...



3
-

Xerophyta_humilis



11


Don't fancy one of those? Enter h for help or q to quit.

Library file




As first
-
time user, select ‘2’

and enter as
prompted

[If you are using the
X. humilis

library,
select 3]
:


Information for new Library file

Please answer the following questions about the Library used to generate the ESTs

you wish to submit.

What is the name of th
e library?

Xerophyta_humilis

(use underscore here)

Wha
t is the name of the organism
?

Xerophyta humilis

(
use space,
don’
t use underscore
)


In a similar manner,
select option 2

to generate Contact file and Publication file and enter
information as prompted
.
(You will only have to do this once for each sequencing project)
:


Contact file



Each batch of EST submissions to dbEST must have an associated


Contact file.


Please choose one of the following options:



1
-

Contact file alre
ady submitted


2
-

Enter information for a new Contact file now


Publication file



Each batch of EST submissions to dbEST must have an associated


Publication file.


Please choose one of the following options:



1
-

Publ
ication file already submitted


2
-

Enter information for a new Publication file now


Processing your sequences to generate the EST file

(example entries given)
:


EST File



Each sequence submitted to dbEST must have an associated EST file.



Please choose one of the following options for the creation of this file:



1
-

Enter information for a new EST file now



Or use a saved file...



2
-

Xerophyta humilis Helen


3
-

400+R


4
-

400+F



Don't f
ancy one of those? Enter h for help or q to quit.


Options 3 and 4 are for M13forward and M13reverse, respectively i.e. select 3 for 3’
X.
humilis

sequences and 4 for 5’
X. humilis

sequences. Or create your own file by selecting 1:


EST File



Each
sequence submitted to dbEST must have an associated EST file.


Please choose one of the following options for the creation of this file:



1
-

Enter information for a new EST file now


1



12

What was the sequencing primer used? [enter the primer

name, optionally followed by
the

sequence in brackets()].

E.g. SAC(GGGAACAAAAGCTGGAG): M13F

What was the forward PCR primer? NA

What was the reverse PCR primer? NA

Which end was sequenced (5'/3')? If neither, just hit return: 3'


Please enter the date you

would like your data to be made public. This date

will be inserted into the PUBLIC field of the submission file.

Enter the date in the format MM/DD/YYYY. For immediate release, just press enter.


Please enter a comment about the ESTs for inclusion in the
"COMMENT" field of

the EST file.




Please check the data you have entered:


SEQ_PRIMER: M13F


PCR_F: NA


PCR_B: NA


P_END: 3'


PUBLIC:


COMMENT:




Are you happy to continue? (Y/N): y

Would you like to save this file for future use?(y/n):y

Please e
nter a name for this file: UG test

Are you happy to continue? (Y/N): File saved to /usr/local/paeano/db/ESTfile.db


Section 2
-

trace2dbest processing information


The following information is required to allow trace2dbest to process your

traces efficientl
y.


Adapter

If you would like trace2dbest to trim off an adapter sequence (and everything

upstream of it), please enter the adapter sequence here or hit

return to continue:

GAATTCGGCACGAGG
(
This is the
adapter sequence
for the
X. humilis

library; it is det
ermined by the lib
rary kit used)


Vector file

The default location for the vector.seq file is


/usr/local/paeano/db/vector.seq

Hit return to use this file or enter a different path here:


Would you like to see a list of all the vector sequences in


/usr/lo
cal/paeano/db/vector.seq? (y/n):
y


The following 35 vectors were found in /usr/local/paeano/db/vector.seq


>bluescript

>lorist2

>lorist6

>loristb

>pbs

>pjb8

>pphc79

>prs313

>prs423

>prs424

>pwe15

>pyac4

>scos

>m13mp18

>m13mp19

>mg3_left


13

>mg3_right

>puc118

>puc18

>pSC

>pcDNA3.1 V5
-
His
-
Topo

>pBeloBAC11

>pBACe3.6

>PSPORTI

>pSCREEN
-
1b(+) from Novagen

>pDNR_LIB

>pCR4
-
TOPO Invitrogen TA cloning vector

>pGEM(R)T
-
easy Promega cloning vector

>pGEMT.fasta Promega TA cloning vector

>pBLUESCRIPT SK+

>pCMV
-
PCR

>gi|1017
801|gb|U37573.1|XXU37573 Shuttle expression vector pBKCMV

> NAME = pTriplEx2_seq.fa : TYPE = DNA

>

>pCR2.1TOPOTA


Are you satisfied that the re
levant vector is here? (y/n): y


E.coli sequence information

Do you want to screen for E.coli sequence in your ES
Ts? (y/n):
n


Trace file naming scheme

In order for your traces to be processed, the file names must follow one of

these schemes:


1
-

NERC Environmental Genomics scheme

2
-

STRESSGENES scheme


Please enter the appropriate number:
1


Trace file directory

Ple
ase enter the full path of the directory containing the trace files

to be processed

e.g.

/usr/home/helen/traces

(
Test data for processing (96 earthworm sequences) are available
at
usr/local/paeano/
testdata

(copy to local).


There are 2 files that match th
is naming scheme in
/usr/home/helen/data/HC356andHC357

Is this correct? (y/n):

y


Section 3
-

trace2dbest parameters


You now have the opportunity to set the various parameters that control how the

traces will be processed.

trace2dbest has default values f
or all the parameters, to use these defaults

enter 's' (skip), otherwise hit return to alter parameters:


For each of the parameters, enter the value you wish to use or hit return to

use the default shown in brakets().


phred

Number of high quality bases r
equired in sequence (150):


cross_match

cross_match for vector sequence
-

minmatch (10):






minmatch value of 10 selected

minscore (20):






minscore value of 20 selected


Poly(A) tail

Enter number bases in poly(A) tail (8):



14

Spliced leader 1

If you wis
h to trim the nematode spliced leader sequence, type 'yes', or

If you do not wish to trim any spliced leader sequence just hit return:




Section 4
-

Annotation of sequences

Would you like to add BLAST
-
based preliminary annotation to the sequences?

Note: t
his will slow the process considerably (y/n):

n


Section 5
-

Trace processing

Running phred (basecalling from raw traces)
-

please wait...Done

Running cross_match (screening for vector sequence)
-

please wait...Done


Section 6
-

Sequence processing


Creati
ng submission files...

Done


Statistics



2 Traces processed


2 (100%) 'Good quality' traces


2 (100%) Submissable sequences after trimming


699 Average length of submissable sequences



475 Average number of high quality bases for submissable
sequences




Section 7
-

Submission and saving of files


The EST submission records have been merged into one file.

Would you like to view this completed submission file? [y/n]

y


Th
is takes you into text editor mode to view the file. To exit, type:
:q!


The EST submission file may now be sent by e
-
mail to NCBI dbEST.

Enter yes to send file to dbEST now, or any other key to continue without

submitting (your file will be saved):

Ok, fi
le not submitted to dbEST.



The EST submission file and the other output directories have been

saved in the following location:

/usr/home/helen/est_solutions/Xerophyta_humilis/trace2dbest/2005
-
03
-
18T11:45unsubmitted


Trace2dbest has finished and will now
exit. Bye.


Output files:



Trace2dbest

create
s

a dir
ectory called
est_solutions
,

and within this a dir
ectory

with the
name of your
project (
i.e.
organism of interest e
.
g.
Xerophyta_humilis
) and within this a
dir
ectory called
trace2dbest
.

Once
you have
run

t
race2dbest

and processed your sequences
,
you
can look at the output:



Go to
the
trace2dbest

dir
ectory

created i
.
e
. type

e.g.
cd
est_solutions/Xerophyta_humilis/trace2dbest



Type
ls
, and you should get
a dir
ectory

with a date stamp (ac
cording to
the
time of
p
rocessing)
e.g
.

2005
-
03
-
11T09:43unsubmitted
. To look inside,
type
cd 2005
-
03
-
11T09:43unsubmitted

(remember that you can use the tab key to avoid having to type
out the whole thing).



Within this directory are
a number of sub
-
dir
ectories.
fastafiles

contains

your processed
seq
uence
s
; these files have a

.fsa

extension.

(Theoretically, the
PartiGene

directory
contains the input files for
PartiGene
, but you will be using the files from the
fastafiles

directory).


15



Also

have a look at the

file:
dbEST_submission.txt
.


You are now ready to cluster your sequences and assemble the contigs via
PartiGene
.



16

PartiGene


What PartiGene does
:

PartiGene

cluster
s

together overlapping or redundant sequences via CLOBB.

It then

assemble
s

your clustered seq
uence into a consensus se
quence or
contig via Phrap.

It also
a
llow
s

you to
Blast the contig (
either
Blastn or
Blastx)

against selected databases to obtain a
putative gene identity
.

We have included an option that allows you to reverse complement
your sequences if they are
in the r
everse (
i.e.
3’) direction.
It works best to reverse
complement

3’ sequences
before

clustering
.


Input files
:



The

.fsa

input
files for PartiGene are stored in
~/est_solutions/Xerophyta_humilis/traced2dbest/fastafiles
.


Running PartiGene
:

The preliminaries



making a PartiGene directory










Trace2dbest

has made you an

est_solutions

dir
ectory
, and
with
in
that a species

or
organism
directory e.g.
Xerophyta_humilis
, and
with
in that a
trace2dbest

directory

(see illustration of
directory structure above)
.
Y
ou now need to make a
PartiGene

directory
w
ithin the e.g.
Xerophyta_humilis

directory
, from which to run PartiGene. Do
this
as follows.



From your home directory, type
cd est_s
olutions/Xerophyta_humilis
.



Once there, type
m
kdir PartiGene
,
then
cd PartiGene
.


The preliminaries


copying the files to the
PartiGene

directory

Y
ou t
hen need to copy
the
relevant files across from

the

trace2dbest

to the
PartiGene

dir
ectory
. Do this by typing:
copy_fastafiles.pl


You will get something like this:


Species Directory L
ist

1) /home/helen/est_solutions/Lumbricus_rubellus

2) /home/helen/est_solutions/Xerophyta_humilis

Please select an option from 1 to 2:




First
select
according to o
rganism
/sequencing project

e.g.
2
.




Then select according to date stamp i.e. batch of proces
sed sequences

e.g.
1
:


Run Directory List

1)

/home/helen/est_solutions/Xerophyta_humilis/trace2dbest/2004
-
11
-
10T14:27unsubmitted


You will then see something like this:


est_solutions
Xerophyta_humilis
Lumbricus_rubellus
trace2dbest
PartiGene
est_solutions
Xerophyta_humilis
Lumbricus_rubellus
trace2dbest
PartiGene

17

Copying /home/helen/est_solutions/Xerophyta_humilis/trace2dbest/2005
-
03
-
18T11:45unsubmitt
ed/fastafiles/Xh_LR_01G12.fsa to
/home/helen/est_solutions/Xerophyta_humilis/PartiGene/sequences...

Copying /home/helen/est_solutions/Xerophyta_humilis/trace2dbest/2005
-
03
-
18T11:45unsubmitted/fastafiles/Xh_LR_01H01.fsa to
/home/helen/est_solutions/Xerophyt
a_humilis/PartiGene/sequences...

Done.

helen@mancala:~/est_solutions/Xerophyta_humilis/PartiGene>




Once
you have done this, type
ls
. You will see that you now have

a temporary dir
ectory
called
sequences

that contains your
trace2dbest
-
processed sequences, r
eady
for
PartiGene
processing.

(
You can have a look by typing
cd sequences
, then

ls
)
.


The preliminaries

reverse complementing 3’ sequences

For any sequences
that you have
in
the
reverse (3’) direction, it works best to reverse
complement them at this sta
ge. Do this as follows:



Change into the
sequence

directory.



If all your sequences are in the 3’ direction, type:

revcomp

w *.fsa

to reverse
complement
(revcomp)
and overwrite (
-
w) all
the
seq
uences in the
sequences

directory
.



If you have sequences in bo
th directions, you will have to have identified
the
seq
uence

direction during naming e.g
.

Xh_RD
f
_01B04 and Xh_RD
r
_01B04 to distinguish between
5’ and 3’ sequences, respectively. You can then selectively reverse complement the 3’
sequences
using wildcards
a
s follows:
revcomp

w *_*r_*.fsa
.


Running PartiGene

from

est_solutions/Xerophyta_humilis/PartiGene



In PartiGene, type
PartiGene_v2.2.pl

(e.g.
helen@mancala:~/est_solutions/Xerophyta_humilis/PartiGene>
PartiGene_v2.2.pl
)


You
will then get this:



##
##############################################################


### ###


### PartiGene
-

a script to convert individual sequences ###


### (typically ESTs) into a Partial G
enome. Vs 2.2.0 ###


### ###


### Ralf Schmid and colleagues for the EGTDC 2004 ###


### ###


###

News, upgrades and help: nematode.bioinf@ed.ac.uk ###


### Help for EG
-
Awardees: helpdesk@envgen.nox.ac.uk ###


### ###


########################################
########################



Enter the number corresponding to the part of the PartiGene


process you want to perform:



1. Download sequences from EBI for analysis.


2. Pre
-
process sequences.


3
. Cluster sequences.


4. Assemble clusters.


5. Perform BLASTs.


6. Create HTML tables of results.


7. Construct relational database of results.


8. Quit.


PartiGene Clustering



Sele
ct O
ption 3 to c
luster
:


##### SEQUENCE CLUSTERING #####



18

This software clusters datasets of EST and other sequences

such that each cluster represents one putative gene.

The sequences need to be available as individual fasta files

in a directory called 'se
quences'.

The PartiGene sequence download process does this automatically.


The cluster process uses a program called CLOBB to do the

clustering. CLOBB allows new sequences to be added to old

clusters while maintaining previous cluster identities

For more
information on CLOBB see :


Parkinson J, Guiliano DB, Blaxter M.


Making sense of EST sequences by CLOBBing them.


BMC Bioinformatics 2002 3(1):31.


Enter the three letter cluster ID you would like to use

(typically this is the first l
etters of the genus and species

followed by 'C' eg. for Zeldia punctata you might use ZPC).




Type
the cluster ID
e
.
g
.

XHC
,
and you will then get
something like
this:



0 sequences remaining


Clustering done
-

now splitting cluster file into individual

c
luster files. This will create files in directory 'Clus'


Creating individual cluster files, please wait...



SUMMARY OF CLUSTERING FOR XHC

=============================================================

Number of sequences = 414

To
tal number of clusters = 385

Number of clusters with 1 member = 367

Number of clusters with >1 member


(derived from 47 sequences) = 18

=============================================================



This data an
d additional information on each cluster,

has been saved in the file:



OUT/CLOBB_XHC_03
-
18
-
05+11:56.txt


Would you like to continue with the PartiGene process?

Y


PartiGene cluster assembly



Back in main menu, s
elect Option

4

to a
ssemble c
luster
s
:


##### CLUSTER ASSEMBLY #####


The sequences grouped in clusters can be 'assembled' to yield a

consensus sequence. PartiGene uses a program called 'phrap' for this

(written by Phil Green and colleagues; see http://www.phrap.org/).


You have previously used

XHC as the cluster identifier

Is this correct ?

[y/n] :
Y


7 Clusters will be assembled or updated


Before assembling the clusters the sequences need to pre
-
processed

phrap uses quality information from the sequencing chromatographs,

if this is available.

You have three options :

1) Attempt to use original quality files for all clusters

2) Attempt to use original quality files only for clusters containing


2 sequences


19

3) Skip preprocessing (if quality files are unavailable).


NB. phrap can generate multip
le consensuses from large clusters.

We have found that option 2 reduces this less
-
than useful feature


Please select 1,2 or 3 :
3

(NB select option
3


selecting option 1

or 2

is likely to generate
concatamerised contigs).


0 clusters remaining


Assembly pr
ocess finished.

Now creating input files for the protein prediction pipeline prot4EST.


A report on the phrap based assembly process has been saved in the file:



OUT/phrap_XHC_03
-
18
-
05+12:13.txt


Would you like to continue with the PartiGene process?

y



PartiGene blasting



Back in main menu, s
elect Option 5
to blast:


Consensus sequences needed for BLAST searches within PartiGene

have been processed and stored in the directory 'blast'.

A concatenated file which can be used as input file for BLAST

searche
s outside of PartiGene has been saved as 'blast_input_XHC.txt'

in the main directory.


Would you like to continue by BLASTing against your custom databases ?

[y/n] :
y


### Available protein databases ###


1 ATH1.nr.pep 2 kyva

3 nr


### Available nucleotide databases ###


1 ATH1.nr.cds 2 ego8_112404.seq 3 est.00


4 est.01 5 est.02 6 est.03


7 GEO_FASTA.txt


Please enter the number of the database you would

like to blast and

the type of blast you would like to perform separated by a comma.

For example '1, blastn' followed by 'q' for finish would perform a


single BLASTN against nucleotide database 1.


If you would like to specify several databases, separate
the numbers

by using '+'. E.g. entering 1+2+4, blastx, would perform a single BLASTX

search for each sequence against the combined protein databases 1,2 and 4

Upto 5 different BLASTs are allowed
-

enter 'q' to finish


If you would like to run more than one

BLAST enter e.g '1 , blastx' RETURNKEY

'2 , blastx' RETURNKEY '4, blastx' RETURNKEY followed by 'q' to finish.

This would run three BLASTs (one against each of the protein databases 1,2 and 4)




PartiGene has the facility for you to do Blastn or Blastx sea
rches.
(Note that Blastx
searches are the most informative for EST annotation).



To run a Blastx search,
select
nr

from
the Available protein databases by typing
:


3

, blastx

(NB. If
the

ano system is populated with more databases, the number
associat
ed with
nr

may change; select the number matching
nr
)

q

(don’t forget to type q)


You have selected the following blasts
-

/usr/local/blast
-
2.2.9/bin//blastall
-
p blastx
-
d /usr/local/paeano/blastdb/nr ....


20


Is this correct ?

[y/n] :
y


A
Bl
astx run, takes
about 1 min per sequence. You will see the % progress. When the run is
finished, you will get:


100% Completed

Would you like to continue with the PartiGene process?

[y/n] :
y


OK: Back to main menu




Note that y
ou can

do both types

of blast (Blastx and Blas
tn)

simultaneously and
you can
blast against a maximum of 5 databases simultaneously e.g
.

you can do
a for blastx run
against nr and a blast n against the three est databases

as follows
:


3, blastx


4
, blastn


5
, blastn


5
, blast
n


You’ll then get somethin
g like this:


You have selected the following blasts
-

/usr/local/blast
-
2.2.9/bin//blastall
-
p blastn
-
d /usr/local/paeano/blastdb/nt.01
....

/usr/local/blast
-
2.2.9/bin//blastall
-
p blastn
-
d /usr/local/paeano/blastdb/nt.02
....

/usr/local/blast
-
2.2.9/bin/
/blastall
-
p blastn
-
d /usr/local/paeano/blastdb/nt.03
....


Is this correct ?
y


OR, to make the search quicker,
blastn search
the databases
collectively, rather than
individually,

by typing:


6+7+8, blastn

q


Creating HTML tables in PartiGene




Back in mai
n menu, select option 6 to create HTML tables
. Running this sub
-
menu, you
will get something like this
:



##### Creating HTML Tables #####


This facility creates a series of HTML format results files.

This is recommended only for smaller datasets (<1000 s
equences).


Do you want to continue?


[y/n] :
y


First select the BLASTs that you would like to include from the following list

1 nr

2 nt


Please enter a comma separated list o
f numbers from the list above:
1


21


You have selected the following BLASTs :

nr


Pl
ease wait while the tables are being generated

100% Completed

Tables have been generated
-

if you are running this program remotely,

you will need to copy the following directories and contents into

a web accessible directory (typically "public_html" in yo
ur home directory
-

ask your system administrator for further details). Directories to copy are :

html, blast and phrap

If you are running a local copy of this program you could view the results now ?

[y/n] :n

To view the results open up a web browser and
open the file :


/usr/home/helen/est_solutions/Xerophyta_humilis/PartiGene/html/Results.html


Would you like to continue with the PartiGene process?

[y/n] :
y


NB:
To be able to view your Results via the
Pæano
web

interface when you open up a web
browser, y
ou need to copy across the directories:
html
,
blast

and
phrap

to your
public_html

dir.

Y
ou can either exit PartiGene and do it now, or

once you’ve completed Option 7
.

Do this as follows:



Type
cp

r html/ ~/public_html
. Repeat for
blast

and
phrap
.



Open up
a web browse
r

and go to
http://mancala.cbio.uct.ac.za/~
yourname
/html/Results.html

to view your results!


Constructing relational datab
a
ses in PartiGene


Back in main menu, select op
tion 7 to construct relational databases of your results.
Note that
for a group of users all sequencing from the same organism, you can share a database,
although it is safer initially for each user to define their own database.
Pæano is/will be set up
so
that, provided all users working on the same species use the same cluster ID e.g. XHC for
X. humilis

(see below), cluster IDs will be unique even if there are redundant sequences
between users.

Also note, that you need to revisit this option to upload you
r downstream Prot4EST results.
(If you prefer, you can run this option once all your processing is completed.)
Running this
sub
-
menu, you will get something like this:



##### Databasing #####


This facility offers the ability to hold your data in a

SQL database using the public domain databasing software

postgreSQL. PostgreSQL is typically packaged with many Linux

distributions and is also freely available from :

http://www.postgresql.org/

In order to use this databasing facility, you will need to

e
nsure that postgres is running and that you have permissions

to create new databases
-

see the website above for more details.




You have already defined a database
-

paeano would you like to use it ?

[y/n] :
y

(See below if you select n)

Enter three lette
r cluster ID you have previously defined

: XHC


Cluster entries already exist for this cluster ID
-

Update the db ?

[y/n] :

y


If you define a new database (i.e. select n as an option above, you will get something like this:


22


Please enter the name of the d
atabase you would like to create

arthur

Use arthur ? :

[y/n] :y

NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "cluster_pkey" for
table "cluster"

NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "est_pkey" for table
"est"

NOTI
CE: CREATE TABLE / PRIMARY KEY will create implicit index "clone_name_pkey"
for table "clone_name"

NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "p4e_ind_pkey" for
table "p4e_ind"

NOTICE: CREATE TABLE / PRIMARY KEY will create implicit i
ndex "p4e_hsp_pkey" for
table "p4e_hsp"


Use this as your default database in future ?

[y/n] :y

Enter three letter cluster ID you have previously defined : XHC

Inserting cluster entries

Inserting sequence entries


Do you want to add or update clone name en
tries ?

(Please note, this feature works only for ESTs downloaded from EBI in option 1

which use the EGTDC naming scheme)


[y/n] :n


Inserting BLAST info

Please choose the BLAST results to include in the database :


Include ATH1 ?

[y/n] :n


Do you want to
insert results from prot4EST into arthur?

[y/n] :n


Would you like to continue with the PartiGene process?

[y/n] :




To view your
paeano

relational database results, see
‘PostgreSQL’
p.
34
.


Ouput files:

PartiGene generates, amongst others, the following fi
les:



Cluster results are in
PartiGene/Clus
.

o

Singletons.fsa

is a fasta format file of all
unclustered sequences

o

Individual files e.g.
XHC00251

groups together in fasta format those sequences that
cluster



Contig assembly results are in
PartiGene/phrap
.
XHC00
251.contigs

(e.g.) presents
consensus sequence in fasta format.



Blast results for individual clusters e.g.
XHC00251.out

are in
PartiGene/blast/nr
.



23

Prot4EST


What Prot4EST does
:

Prot4EST t
ranslates your EST seq
uences
, attempting to compensate for
the fact

that ESTs
often represent partial mRNAs and that their sequence quality usually isn’t perfect

(
seq
uence
errors tend to
disrupt
the open reading frame

(ORF))
.
Prot4EST

aims for the best
possible
polypeptide prediction

via
a number
of
similarity
-
based and
d
e novo

prediction
methods
. The
tiered or
hierarchical system is executed

as follows:

1.

A B
lastn

search is performed against a database of ribosomal RN
A (rRNA) sequences to
identify and filter
out rRNA sequences from your EST
-
translatable set.

2.

A B
lastx

search

is performed
against proteins encoded by mitochondrial
genomes
.

Those
sequences identified as mitochondrial
-
encoded genes, for example, are translated using the
appropriate mitochondrial genetic code
.

(For plants sequences we plan to make available a
sear
ch against chloroplast genomes also).

3.

Nuclear
-
encoded sequences are translated according to B
lastx

match
es. T
hose ESTs for
which there is no significant

sequence similarity
are processed further
via
de novo
prediction method
s
.

4.

ESTScan
use
s

Hidden Markov
mo
dels
(HMM)
which can predict

coding sequence in a
probabilistic manner.
The HMM must be built using a complete set of CDS entries from
e.g. EMBL for your species of interest, or a related species.
ESTScan can compensate for
sequence errors and can distingu
ish between the UTRs and CDS of a cDNA.

Sequences
that fail to be translated by
ESTScan
are passed on to the next step.

5.

DECODER predicts CDS using

sequence
quality info
rmation (i.e. the phrap quality files).
Sequences that fail to be translated by Prot4ES
T are passed on to the next step.

6.

As a last resort, putative polypeptide translation is generated from
longest ORF
in all 6
frames
.


Input files:



The input file for Prot4EST is
prot4EST.input.fsa

in your PartiGene directory.



Decoder makes use of
.seq

and
.qlt

files in
PartiGene/protein
.



N.B.
If you are using the pre
-
computed blast results option

(see p. 25 below
), which we
recommend,

you need to do the following
:

o

The cluster IDs in the
prot4EST.input.fsa

file are incompatible with the cluster IDs in
the Pa
rtiGene blast results file
s (anomalously, the cluster IDs in the input file contain
.1 extensions e.g. XHC00001.1)
. To fix this, run the Prot4EST fix script in the
PartiGene directory by typing:
prot4est_input_fix prot4EST_
input.fsa

before you
run Prot4EST
.

o

Also, anomalously, some of the cluster IDs in the
blast result
files contain a

.Contig1
extension

e.g. XHC00001.Contig1
.

(
Consequently,

Prot4EST cannot recognise and use
these particular blast reports for similarity
-
based translations
)
. To fix this, run
the
following patch:
pg_blast_fix

(e.g.) XHC* from your
PartiGene/blast/nr

directory
.


Running Prot4EST
:



Make
a
directory
called
Prot4EST

in e.g.
est_solutions/Xerophyta_humilis
.



Run
Prot4EST

from your
Prot4EST

dir, by typing
p
rot4EST.pl
.



You’ll get this:


starting prot4EST checks...

Unable to find the pico text editor. To create the configuration file use
the example as a template in whatever text editor you prefer.(
W
e must

still get
Rodger to install

pico)


24



#######################################
###########


### ###


### prot4EST ###


### ###


### version: 2.1.1
-

January '05 ###


### ###


### a script that converts EST sequence into ###


### amino acid sequence taking frame shift, ###


### substitutions et al into consideration. ###


###

###


##################################################




Please set up the config file:



1. Create a configuration file.



2. Use or Edit an existing configuration file.



3. Get Help.




4. Exit Program.




As
a
first
-
time user, select 1 to create the

configuration file

in which
you specify paths to
input files and customise analysis options
. You will get this:

1



Now select OPTION 2 to load the configuration file




#############
#####################################


### ###


### prot4EST ###


### ###


### version: 2.1.1
-

Janua
ry '05 ###


### ###


### a script that converts EST sequence into ###


### amino acid sequence taking frame shift, ###


### substitutions et al into consideration. ###



### ###


##################################################




Please set up the config file:



1. Create a configuration file.



2. Use or Edit an existing configuration file.




3. Get Help.



4. Exit Program.


4





A
t present, pico editor not installed and therefore you have to edit the config file using vi
text ed
itor (which is a bit tricky to use)
.

So instead of sel
e
cting option 2, select ‘4’ to exit
and type:
vi c
onfig.


helen@mancala:~/est_solutions/Xerophyta_humilis/Prot4EST>
vi config



25



The con
figuration file
template
looks like this
, with the
custom
or definable
entries in
bold:


Config file for PROT4EST created Mon Mar 7 10:23:39 SAST 2005



For help on any of

these please consult the README file


#Full path to fasta input file, e.g. /home/joe/EST/rubellus.fsa

1. Input File [fasta
format
]:/usr/home/helen/est_solutions/Xerophyta_humilis/PartiGene/prot4EST_input.fs
a


#prot4EST will create this directory, e.g. '
output' will be created in the

#directory p4e is launched from

2. Output Directory:
output


#e.g. Lumbricus rubellus

3. Organism Name (full):
Xerophyta humilis


4. Location of genetic code file:
/usr/local/paeano/db/gc.prt


#Fasta and BLAST files conta
ining these sequences are included in the prot4EST
release.

#Enter the full path.

5. Ribosomal RNA BLAST database:
/usr/local/paeano/db/rRNA.fsa

6. Mitochondria BLAST database
[protein]:
/usr/local/paeano/db/mito_viridiplantae.fsa


#The defaults are show
n

7. Evalue for rRNA search (BLASTN):
1e
-
65

8. Evalue for BLASTX:
1e
-
8


#If you have previous carried out BLASTx search on these sequences then enter the
path to the report file

#or directory containing only these files

#If left blank then prot4EST ass
umes you wish to carry out a BLASTx search on these
sequences

#You are advised to read the userguide regarding this option

9. Location of pre
-
computed BLASTX report files/directory:

/home/helen/est_solutions/Xerophyta_humilis/PartiGene/blast/nr


#Fill i
n all entries for 9a
-
c OR just 9d (if DECODER has already been run on these
sequences)

(
This should read 10a
-
c, 10d)

#e.g. /home/joe/partigene/protein

10a. Path to sequence and quality files [protein
directory
]:/usr/home/helen/est_solutions/Xerophyta_humi
lis/PartiGene/protein/

#defaults shown

10b. Suffix for EST sequence files:
seq

10c. Suffix for EST quailty files:
qlt


or

10d. Path to pre
-
computed DECODER results:


11. ESTScan Matrix File [optional]:


12. Codon Usage Table (gcg format) [option
al]:




Fill in the entrie
s

for your config file, following the example above
. Make use of the
following to navigate in vi:

o

To move cursor around, type
j
, then use arrow keys.

o

To inse
r
t text after the cursor, type
a
, then enter your text. Use
Esc

key to exit

text
insetion mode.

o

To delete on the cursor, type
x
.

o

To save and exit vi, type
ZZ
or :
wq
.


26



Entries for 11 and 12 are optional. If
you
leave
out the
path for
the
pre
-
computed Blastx
results (
genera
ted in PartiGene), Prot4EST will do a new Blastx search.



On
ce you have entered the relevant information in the configuration file, select Option 2
to generate the translations. You will be prompted to make some selections during the
course of the run:


2

Please provide path to config file...(current directory:
/us
r/home/helen/est_solutions/Xerophyta_humilis/Prot4EST )

/usr/home/helen/est_solutions/Xerophyta_humilis/Prot4EST/config


Configuration file found

Would you like to Use or Edit this file?

[U/E]

U

Created prot4EST_050405_121845.log and prot4EST_050405_121845
.errorlog



/usr/home/helen/est_solutions/Xerophyta_humilis/PartiGene/prot4EST_input.fsa
accepted and format verified


/usr/local/paeano/db/rRNA.fsa accepted and format verified


/usr/local/paeano/db/mito_viridiplantae.fsa accepted and format verified

chec
king sequence and quality files...

All components for DECODER located



Config file has been read and all variables accepted




Starting prot4EST

You need to choose which Genetic Codes to use for nuclear and mitochondrial
translations

Select a nucle
ar genetic code [default=1]

1: Standard

2: Vertebrate Mitochondrial

3: Yeast Mitochondrial

4: Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate Mitochondrial;
Mycoplasma; Spiroplasma

5: Invertebrate Mitochondrial

6: Ciliate Nuclear; Dasycladacean N
uclear; Hexamita Nuclear

9: Echinoderm Mitochondrial

10: Euplotid Nuclear

11: Bacterial and Plant Plastid

12: Alternative Yeast Nuclear

13: Ascidian Mitochondrial

14: Flatworm Mitochondrial

15: Blepharisma Macronuclear

16: Chlorophycean Mitochondrial

21: T
rematode Mitochondrial

22: Scenedesmus obliquus mitochondrial

23: Thraustochytrium mitochondrial code

1

[unless sequencing mitochon
drial genes, select from

options 1,

6,

10, 12, 15]


Select a mitochondrial genetic code [default=5]

2: Vertebrate Mitochondri
al

3: Yeast Mitochondrial

4: Mold Mitochondrial; Protozoan Mitochondrial; Coelenterate Mitochondrial;
Mycoplasma; Spiroplasma

5: Invertebrate Mitochondrial

9: Echinoderm Mitochondrial

13: Ascidian Mitochondrial

14: Flatworm Mitochondrial

16: Chlorophycean
Mitochondrial

21: Trematode Mitochondrial


27

22: Scenedesmus obliquus mitochondrial

23: Thraustochytrium mitochondrial code

16

[if for e.g. you are working with sequences of plant origin]


You have selected nuclear genetic code: 1 and mitochodrial genetic cod
e: 16

--
12:25:54
--

http://srs.ebi.ac.uk/srs7bin/cgi
-
bin/wgetz?
-
e+
-
vn+2+[embl
-
Organism:Xerophyta%20humilis]%20&%20[embl
-
Description:complete]%20&%20[embl
-
Description:cds]%20&%20[embl
-
Molecule:RNA%20%7C%20mRNA]


=> `Xh_embl.db'

Resolving http
-
prox
y.uct.ac.za... 137.158.128.106, 137.158.128.107, 137.158.128.105

Connecting to http
-
proxy.uct.ac.za[137.158.128.106]:8080... connected.

Proxy request sent, awaiting response... 200 OK

Length: unspecified [text/html]



[ <=>

] 9,092 781.10B/s


12:26:10 (780.95 B/s)
-

`Xh_embl.db' saved [9092]


Build ESTScan Tables for Xh_embl.conf

-------------------------------------


Current parameters:


-

organism: X
erophyta humilis


-

database files are: Xh_embl.db


-

UniGene data is in:


-

ESTs for testing:



-

data directory: .


-

mRNA file is: ./mrna.seq


-

EST file is: ./ests.seq


-

ESTs with coding:

./Evaluate/estcds.seq


-

ESTs without coding: ./Evaluate/estutr.seq


-

training file is: ./training.seq


-

test file is: ./test.seq


-

clean UTR file is: ./Evaluate/rnautr.seq


-

clean CDS file is:

./Evaluate/rnacds.seq


-

HMM paramters file: ./Matrices/6_00030_0000001_4242.smat



-

tuple size: 6


-

min redundancy mask: 30


-

added pseudocounts: 1


-

minimum score:
-
100


-

start pro
file length/preroll: 4/2


-

stop profile length/preroll: 4/2


-

nb of isochores: 1


Extracting mRNA entries...


-

processing EMBL
-
file Xh_embl.db...


found 3 sequences, 906 coding nucleotides


-

overall found 3 sequences, 906 coding nucleot
ides


Analyzing GC contents...


-

generating GC
-
content histogram...


read 3 sequences


-

written datafile and gnuplot script


-

computing isochore borders...


isochores used: 0
-
100


Split mRNA data into isochores...


3 sequences found in isochore 0
-
100


Masking redundancy from isochores...


-

masking redundancy in isochore 0
-
100


masked 0 of 1734 nucleotides (0%)


Writing codon usage tables...


-

computing for isochore 0
-
100...


28

Xh_embl.conf done.

### WARNING ###

Have found only 1734 coding nucleoti
des for Xerophyta humilis

We suggest that at least 150000 coding nucleotides are used.

Using fewer *may* compromise construction of ESTScan matrix.

These may be complemented with pseudo entries built by prot4EST from the similarity
stages.

It is a NON
-
fata
l issue. Do you wish to use these results and utilise ESTScan?
[Y/N]

N

[
F
or non
-
model organisms, it is unlikely that
there will be
enough sequence data available
for constructing an ESTScan matrix.
You can build one for your species
at a later stage, once

you have generated enough sequence data. Alternatively,
you can
build a matrix
for

a related
species e.g.
you
could use a rice matrix for
X. humilis

sequences.]


(Still to include notes on building
an
ESTScan matrix.)



fetching a codon bias table for Xer
ophyta humilis.

this may take a few seconds

Searching, please wait...


### Warning! ###

Sorry, there are 0 matches for Xerophyta humilis.

Is the organism name correct?