Maintenance Manual - Student Home Pages - - Newcastle University

fabulousgalaxyBiotechnology

Oct 1, 2013 (3 years and 6 months ago)

71 views








M
AINTENANCE


M
ANUAL


for


DRUTEX

Using Workflow technology to identify new bacterial drug
-
targets



Group A,
M.Sc. Bioinformatics, Newcastle University



March, 2010








Maintenance Manual

Authorization Memorandum



I have carefully assessed the Maintenance Manual for the
DRUTEX system.
This document has been
completed
using the Maintenance Manual template
from

Maura Lilienfeld, CHM Team,
U.S.
Department
of Housing and Urban

development, 2005

(
http://www.hud.gov/offices/cio/sdm/devlife/tempcheck.cfm
)
.



MANAGEMENT CERTIFICATION
-

Please check the appropriate statement.



__
____ The document is accepted.



______ The document is accepted pending the changes noted.



______ The document is not accepted.



We fully accept the changes as needed improvements and authorize initiation of work to proceed. Based
on our authority an
d judgment, the continued operation of this system is authorized.


_______________________________

_____________________

NAME


DATE

Project Leader


_______________________________

_____________________

NAME


DATE

Operations Division Director


_______________________________

_____________________

NAME


DATE

Program Area/Sponsor Representative


_______________________________

_____________________

NAME


DATE

Program Area/Sponsor Director





Maintenance Manual


Page
iii


MAINTENANCE MANUAL


TABLE OF CONTENTS


Page #


1.0

GENERAL INFORMATION

................................
................................
................................
......

1
-
1

1.1

System Overview

................................
................................
................................
....................

1
-
1

1.2

Points of Contact

................................
................................
................................
.....................

1
-
1

1.2.1

Technical support

................................
................................
................................
................................
1
-
1

2.0

SYSTEM DESCRIPTION

................................
................................
................................
...........

2
-
2

2.1

System Architecture

................................
................................
................................
...............

2
-
2

2.2

Security

................................
................................
................................
................................
....

2
-
2

3.0

ENVIRONMENT

................................
................................
................................
.........................

3
-
3

This system has used the following tools:

................................
...................

Error! Bookmark not defined.

3.1

Taverna

................................
................................
................................
................................
....

3
-
3

3.2

Biocatalogue

................................
................................
................................
............................

3
-
3

3.3

myExperiment

................................
................................
................................
.........................

3
-
3

3.4

Compatibility

................................
................................
................................
..........................

3
-
3

3.5

Support Software Environment

................................
................................
............................

3
-
3

3.6

Setting up the system/reinstallation

................................
................................
......................

3
-
3

4.0

SYSTEM MAINTENANCE PROCEDURES

................................
................................
.............

4
-
4

4.1

Responsibilities

................................
................................
................................
.......................

4
-
4

4.2

Performance Verification Procedure/Quality control

................................
.........................

4
-
4

4.3

Handling performance problems/errors
................................
................................
...............

4
-
4

4.3.1

System raises an error message

................................
................................
................................
...........
4
-
4

4.3.2

System is producing unexpected output

................................
................................
..............................
4
-
4

4.3.3

Program pro
ducing unexpected output

................................
................................
...............................
4
-
4




Maintenance Manual


Page
iv


5.0

INFORMATION ABOUT EACH WORKFLOW UNIT

................................
............................

5
-
1

5.1

Workflow 1: Read in two files in GenBank or EMBL format
................................
............

5
-
1

5.1.1

Overview

................................
................................
................................
................................
............
5
-
1

5.1.2

Detailed description

................................
................................
................................
............................
5
-
1

5.2

Workflow 2: Compare relatedness of two strains

................................
...............................

5
-
2

5.2.1

Overview

................................
................................
................................
................................
............
5
-
2

5.2.2

Detailed description

................................
................................
................................
............................
5
-
2

5.3

Workflow 3: Compare proteins to find those unique to pathogen

................................
.....

5
-
3

5.3.1

Overview

................................
................................
................................
................................
............
5
-
3

5.3.2

Detailed description

................................
................................
................................
............................
5
-
3

5.4

Workflow 4: Pathogen proteins that are
potential drug targets

................................
........

5
-
4

5.4.1

Overview

................................
................................
................................
................................
............
5
-
5

5.4.2

Detailed descript
ion

................................
................................
................................
............................
5
-
5

5.5

Workflow 5: Pin
-
point pathogen enzymes in Kegg diagrams

................................
............

5
-
6

5.5.1

Overview

................................
................................
................................
................................
............
5
-
6

5.5.2

Detailed description

................................
................................
................................
............................
5
-
7





Maintenance Manual


Page
1
-
1


The Maintenance Manual presents information on the
DRUTEX system.
It is written for personnel who
are responsible for the maintenance of the system and who need to understand the operating
environment, secu
rity, and control requirements.

It describes the p
rograms in technical detail to assist the
maintenance programmer.


1.0

GENERAL INFORMATION

1.1

System Overview


The system aims at identifying new
targets for existing drugs.
We achieve this by comparing genomes of
a pathogen and a non
-
pathogen to see if they
are closely related. T
hen we find out
the
proteins
in

a

pathogen

that are unique (
not

present
in the non
-
pathogen
)
.
T
hese proteins are
probably
the cause of
its
pathogenicity
.
These proteins are compared for sequence similarity to proteins which are target
s of
existing drugs. Those with a

high similarit
y to targets of existing drugs

are the potential new bacterial
drug
-
targets. This

would be a significant discovery
as

we are running out of anti
biotics
because
pathogen
s

are
becoming resistant

to them
.


Outli
ne of its working
:

1)

The system

read
s

in at least two
files

(EMBL or

Genbank

format)
-

the source (non
-
pathogen)
and the target (pathogen).

2)



The system test
s

whether the two strains are closely related enough to produce a meaningful
comparison.

3)

The system
provides an
output a list of proteins encoded by the target genomes that do not have
sequence similarity to those encoded by the source genome.

4
)

The system
compare
s

the unique protein list to a list of known protein sequences that are known
t
o be the target of existing drugs
.

5)




The system also

produce
s

a list of those proteins in the target organism that may be the target for
known drugs based on protein similarity.

6)

The system
pin
-
point
s

the position of those proteins that are enzymes i
n Kegg pathway diagrams.


1.2

Points of Contact

1.2
.1

Technical support

For further assistance or information about the maintenance of this product, please email your query to
J.S.Steyn@ncl.ac.uk

or call our customer support line on 0191 123456.






Maintenance Manual


Page
2
-
2


2.0

SYSTEM DESCRIPTION

2.1

System Architecture

The workflow will run on Taverna Server. e
-
Drugfinders can start the workflow on Taverna Server by
using a web interface where the sequence files can be uploaded. For each part of the workflow, Taverna
Server accesses an appropriate internet web service an
d submits a job to it. The web service processes the
job and sends the results back to Taverna Server. When the workflow is complete, Taverna Server will
display the results through the web interface. It will also connect to e
-
Drugfinders’ Ondex warehouse
and
update it with new potential drug targets it discovers.
The components

of the software comprise of
workflows created using the Taverna platform. Each workflow, once executed, supplies an output that is
accepted as an input by the next workflow in the s
equence.



Figure representing System Architecture


2.2

Security


Although there are currently no provisions for keeping this system secure, the company implementing the
software are free to apply them.
Usage of this product for illegal purposes might lea
d to
SPLUGE will
accept no liability for loss or damage of the product.




Maintenance Manual


Page
3
-
3


3.0

ENVIRONMENT

3.1

Taverna

Taverna Workbench
is an open source tool for designing and executing workflows. This allows users to
connect a number of bioinformatics services into one process
. It is written in the Scufl programming
language.
The myExperiment social web site
is a database for Taverna

workflows and has special support
for Scufl workflows.
http://www.taverna.org.uk/
.


3.2

Biocatalogue

Th
e BioCatalogue is a curated
registry

of
biological

Web Services.
It

was launched in June 2008

at the
Intelligent Systems for Molecular Biology Conference. The project is
collaboration

between the myGrid
project at the University of Manchester led by Carole Goble and the European Bioinformatics Institute led
by Rodrigo Lopez.

http://www.biocatalogue.org/
.


3.3

myExperiment

myExperiment is a
social networking
website for
scientists

sharing Research Objects such as
scientific
w
orkflows

and experiment plans. It was launched in November 2007
.

http://www.myexperiment.org/
.


3.4

Compatibility

System Requ
irements for Taverna:



This tool is freely available online and works with all web
-
browsers and
O
perating systems
(Windows XP, Windows Vista, Windows 7, Mac

OS X 10.4 and higher
)
.



It is recommended that your system should have 1GB memory for Taverna to wor
k efficiently, it
might work with lesser memory as well, but performance will be slower than expected.




It requires Java
1.5 or higher installed.




GraphViz application (not required for Windows users)


For more details, refer to the link:
http://www.taverna.org.uk/download/taverna
-
2
-
1/system
-
requirements/


DRUTEX should be capable of working on all OS’s although this testing is a work
-
in
-
progress.


3.
5

Support Software
Environment



M
-
GCAT, which is a kind of software, is used in workflow2 to solve the problem.



Java
ha
s

been

used to produce codes solving problems.



G
lassfish uses a derivative of Apache Tomcat as the servlet container for serving Web content,
with an added component called Grizzly which uses Java NIO for scalability and speed.

3.
6

Setting up the system/
re
installation

All instructions for getting DRUTE
X running on your system are outlined in section III, Setup in the User
Manual. For further details, please go to the website:
www.taverna.org/

or, alternatively,
follow the step
-
by
-
step instructions on the installat
ion CD.




Maintenance Manual


Page
4
-
4


4.0

SYSTEM MAINTENANCE PROCEDURES

This section provides information about the specific procedures necessary for the programmer to maintain
the collective software units that make up the system.

4.1

Responsibilities

The person responsible for maintaini
ng the software is a person qualified in computing.
If such a person
should not be able to fix the problem, then S
PLUGE

will cover the costs of any replacement within 1 year
of the warranty period.


4.2

Per
formance Verification Procedure
/Quality control

W
e recommended that you perform system maintenance

as a prerequisite to using the system. This will
ensure that the system is functioning optimally before it is used to generate biological data. To do this you
should follow the steps outlined in the mainten
ance CD. The CD will guide you through a procedure
which uses standardized test input values to test whether the system is working as expected.


4.3

H
andling performance problems/errors

4.3.1

System raises an error message

Unexpected input in dialogue box
asking for threshold value


Incorrect file path or file format
. If having confirmed that the file paths and formats are correct, please
reinstall the software by following the instructions at
3.6

Setting up the system/reinstallation
.


Blast/Kegg server
s

are down. For the Taverna workflow to run, it must access two external servers, Blast
(
http://blast.ncbi.nlm.nih.gov/
)
and Kegg

(
http://www.genome.jp/kegg/
)
.
Should
you think these servers
might be causing a problem, please navigate to the relevant website to confirm whether they are working.


For any other errors please email your query to
J.S.Steyn@ncl.ac.uk

or call our cu
stomer support line on
0191 123456.

4.3.2

System is producing unexpected output

In the general case, please isolate
individual workflow components to test

whether they
are
work
ing

in
isolation.
The next step is to connect up all the workflows that are work
ing. This will allow you to isolate
the problem to a particular workflow. If you are unable to progress further with a solution to the problem,
please email your query to
J.S.Steyn@ncl.ac.uk

or call our customer s
upport line on 0191 123456,
quoting the particular workflow the problems has been isolated to.

4.3.
3

Program producing unexpected output

Please navigate to section 4.2 Performance Verification Procedure. This will ask you to test the system
with standardiz
ed input to see whether the output is expected. If this is not the case, please contact
support.




Maintenance Manual


Page
5
-
1


5.0

INFORMATION ABOUT EACH WORKFLOW UNIT

This section provides a detailed description of each software unit.

This allows the maintainer of the
workflow to understand how each workflow functions so that if there is a problem with the workflow
proper, each component workflow can be isolated and test individually.

5.1

Workflow 1:
Read in two files in GenBank or EMB
L format


5.1
.1

Overview

This workflow reads in two files containing i
n either GenBank or EMBL format and parses this
information into two
output files containing only the genomic sequences.


This is the first step in the workflow proper. The inputs are
2 files in

GenBank or EMBL format.
One file
is a genome from a non
-
pathogenic bacterium; the other is from a (supposedly) related bacterium t
hat is
pathogenic. The outputs are 2 files containing the sequences only.

5.1
.2

Detailed description

The base directory contains the list of proteins in the target organism. The remaining two inputs are the
actual EMBL files for the target and source orga
nisms. The ReadEMBLDatabase service uses BioJava to
retrieve the protein sequence from each EMBL file. Each protein from the target genome is transferred
into a separate file so that it can be Blasted against the source database in Workflow 2. The list of
protein
sequences from the source organism in the original file will later represent this database. The output from
the workflow is an integer number of proteins in the target organism, and therefore the total number of
files to be Blasted against the sour
ce database.






Maintenance Manual


Page
5
-
2


5.2

Workflow 2
:
Compare relatedness of two strains


5.2.1

Overview

This workflow compares the genomes of non
-
pathogen and pathogen to see whether the two strains are
related. The
threshold level of similarity can be set by the user using th
e dialogue box that appears when
the workflow is run.


This is the
second

step in the workflow proper.
The inputs are
two

files containing sequences in GenBank
or EMBL format.

The outputs are

1


or

0

.
‘1’
indicates that the two strains are sufficiently
similar and
the workflow can progress onto the next stage which compares the proteins of the two

genomes. A ‘0’

indicates that the two strains are too distantly
-
related for a meaningful comparison so the system exits.


5.2.2

Detailed description

Th
e input
s are the file paths to F
ast
A

files conta
ining the full genome sequences from each organism.
INI_Path is the full path to the configuration file used by M
-
GCAT. A number of regular expressions
parses the file. MGCAT_Path is the path to the M
-
GCAT executabl
e. INI_File is the name of the



Maintenance Manual


Page
5
-
3


configuration file (not the path). Working_Dir is the directory in which the output of M
-
GCAT will be
stored. Run M
-
GCAT which aligns the two genomes and produces a log file. The service
Read_Text_File_2 reads the log file an
d extracts the genome similarity. A dialogue box asks the user to
choose the minimum genome similarity

required
. Finally, the Thresholder tests whether the similarity is
greater than or equal to the threshold set by the user.

5.3

Workflow 3: Compare prote
ins to find those unique to pathogen


5.
3
.1

Overview

This workflow takes
three inputs



e value threshold, number of files and basedir

(working directory)

and
extracts the proteins that are unique. i.e. that exist in pathogen (accounting for pathogenicity) but not in
the source
(non
-
pathogen).


The output
is a file containing

a list naming the proteins that are unique to the pathogen.

5.
3
.2

Detailed description

This is a work
-
in
-
progress. This workflow has been running in Unix systems, but has problems running in
Windows, althou
gh our experts are working on it.




Maintenance Manual


Page
5
-
4


5.4

Workflow 4:
Pathogen proteins that are potential drug targets





Maintenance Manual


Page
5
-
5


5.
4
.1

Overview

This workflow compares
the unique protein list to a list of known protein sequences that are known to be
the target of existing drugs
. This
will result in a list of

pathogen proteins which are potential targets

for
known dr
ugs based on protein similarity above a certain e
-
value threshold.


This is the fourth step in the workflow proper. The inputs are a list of proteins that are unique
to the
pathogen. The outputs are a list naming the proteins which are the subset of proteins unique to the
pathogen that also show global sequence similarity to the targets of known drugs.

5.
4
.2

Detailed description

The inputs to the workflow are
:





path_p
roteins
: a

single path to a file containing
a
list of
file paths of single protein
sequences that are

un
ique to the pathogen

in FastA format
,



drugs_file
: a

path to a file containing the amino acid sequences
in FastA format
of all known
drug
-
affected
proteins

which was
supplied by e
-
Drugfinders,



DBSavePath
:
a path to the working directory
storing

the files required for the workflow to run
.


FormatDB_Path and Blastall_path contain only strings specifying paths to the location of the local
BLAST databas
e and Blast output.

We create a local Blast database, then run the nested workflow. The
first nested workflow runs a Blast search of each protein unique to the pathogen against the database. The
output from this goes into the second nested workflow. A dial
ogue
-
box prompts the user for a threshold
value which is then used to filter the Blast results.


The only output is a
flattened
list of GI numbers which are similar to the drug
-
affected proteins according
to the threshold input by the user, which is the minimum e
-
value they expect.
















Maintenance Manual


Page
5
-
6


5.
5

Workflow 5:
Pin
-
point pathogen enzymes in Kegg diagrams



5.
5
.1

Overview

This
workflow

will
take the list from the previous workflow that names pathogen
proteins

that are
potential drug targets and
pin
-
point the
ir

position in Kegg pathway diagrams
. If the proteins are not
enzymes then no Kegg pathway will apply.


This is the final
step in the workflow proper. The inputs are 1) a three
-
letter code representing the species’
name, 2) the species id,
and
3) background and foreground colours

(green and red, respectively)
.

If the
protein is an enzyme, the outputs are a list of Kegg Pathwa
y IDs indicating the pathways in which the
enzyme participates and a coloured image pin
-
pointing the position of the enzyme in these pathways.
If
the protein

is not an enzyme, then the workflow will return an empty list.





Maintenance Manual


Page
5
-
7



5.5.2

Detailed description

In the nested workflow we will
concatenate

the species name and id to a single query string for use by
Kegg. For example:


hsa (species name) + string1_value (colon) = hsa:

hsa: + 1487 (gene id) = hsa:
1487 (query string)


The workflow takes the query stri
ng to check in Kegg whether there is a corresponding enzyme. If one is
present the position of the enzyme is pin
-
pointed in the Kegg image with a green background and a red
foreground.

The workflow also takes the query string to find the names of the Kegg
Pathway IDs in
which the enzyme is involved.





Maintenance Manual


Page
5
-
8


5.6

Combined Workflows 4
-
6