How to install and use CBSU P-IPRSCAN on a CCS or HPC 2008 cluster

bewgrosseteteΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

59 εμφανίσεις

How

to install and use CBSU P
-
IPRSCAN

on a CCS

or HPC 2008
cluster



Questions and comments:

Jarek Pillardy (
jarekp@tc.cornell.edu
)

Robert Bukowski (
bukowski@tc.cornell.edu
)

Overview

CBSU P
-
IPRSCAN

is

a set of codes design
ed

for InterProScan

analysis of large number
of sequences. The analysis is run in parallel, with each compute node executing a
n
instance of IPRSCAN
on a single query sequence using the
databases stored locally on
this node. The parallel processing is facilitated by an MPI program called
pdriver_iprscan.exe
. One process spawned by this program serves as a “master”
process supervising the work of the “w
orker” pro
cesses. Submission of a

pdriver_iprscan.exe

job to the queue reserves a number of nodes on which the
MPI processes are running. The master process then takes care of splitting the original
query file into
one
-
sequence chunks

and “farming out” tho
se chunks to the worker
processes. Once a worker process is finished processing a chunk, it reports this fact to the
master, which collects the output and sends another chunk
of work
to this worker. In this
way, all workers are kept busy all the time (with

the exception of the final stage of the
calculation, when the number of workers exceeds the number of query chunks to
process).


Currently,

P
-
IPRSCAN operates with the 4.4

version of IPRSCAN.


The rest of this manual describes the detail
s of the installat
ion of P
-
IPRSCAN
. A lot of
useful information about how the program operates can be also found in the form of
comments in the
doc
\
example
\
run
_aa
.bat

and
doc
\
example
_dna
\
run_nuc.bat

script
s
.


Files in this distribution


P
-
IPRSCAN consists of two basic parts
: 1) the IPRSCAN suite ported
by us
to Windows
(this is what you need to run an IPRSCAN analysis on one machine) and 2) a set of P
-
IPRSCAN tools facilitating the parallelization.
The tar archive
CBSU_PIPRSCAN.tgz

will decompress into the following

files and
directories:



PIPRSCAN_execs
:

precompiled executables of P
-
IPRSCAN tools




CBSUtools
: the Visual Studio 2008

project containing sources of all P
-
IPRSCAN tools (except for the actual IPRSCAN suite); this directory should not
be needed unless the pr
ograms have to be recompiled.



WIN_iprscan
:

windows version of IPRSCAN, ported by CBSU staff:

o

iprscan_4_4
.tgz
:
IPRSCAN suite ported to W
indows; contains all
the
perl scripts and executables

needed to run IPRSCAN on a single
Windows machine

o

perlmod_iprscan.t
gz
: collection of perl modules used by
IPRSCAN

o

add_libs.tgz
: ad
ditional DLL

files used by EMBOSS
executables
called from within IPRSCAN.



doc
:

contains this document and subdirectories

example

and
example_dna

with sample input,
submission
scripts

(slightly different for the cases of amino
acid and DNA input)
, and
output
from a sample run
.

In addition

to the tgz archive described above
, you will need to download the
database
file
s

iprscan_4_4_
dat
a

_NOPHTR_22.0_nomatch
.tg
z

(3

GB)
,

iprscan_4_4_dat
a

_PTHR_14.0.tg
z

(849 M
B)
, and

match_complet
e_
22.0.xml.g
z

(1.5

GB)
,
from
ftp://cbsuftp.tc.cornell.edu/software/CBSUtools/WIN_IPRSC
A
N
.
The name
s

of the file
s reflect

the current v
ersion
s

of the databases and may
change in the future. For more information about the databases, including updating
instructions, please consult the file

WINDOWS_INSTALL.tx
t

(present

in

iprscan_4_4
.tg
z
).
The database

file
s

will have to
be
propagated and
unpacked
on each node, as described later in this document.


Prerequisites



You need to have a network share accessible to all the nodes in your cluster. In
our case, this share is mapped (through “net use” command) to drive
H
:
.



Somewhere on
H:

you will ne
ed to set up directories where certain executables
will be stored. In our case these are
H:
\
CBSU
\
bin_x64

for serial executables
and
H:
\
CBSU
\
ptools
\
mpims

for MPI applications. This distinction is not
essential, so you can have all executables in one (share
d) directory with a name of
your choice.



A subdirectory of
H:

will be also used as

the job directory, where
all the files
needed to launch a job will be deposited and where the output will eventually
appear.



On
each compute node

there must

be a local sc
rat
ch drive

called
T:
.
The sample
script file
doc
\
example
\
run.bat

contains instructions on what to do if your
local scratch drive is somewhere else than on
T:
\
.



The permissions on
T:
\

must be set to FULL CONTROL for
any user who
submits the P
-
IPRSCAN

job.
Th
ere must be a directory
T:
\
CBSU

with all fil
es
and subdirectories
readable and executable by everyone (but not necessarily
writable).



Perl
5
has to be installed on each and every node (complete with the path setup, so
that perl

scripts can be run directly from command line).


Our installations work
with version
5.8.7

of perl.



CYGWIN has to be installed on each and every node

in
C:
\
CYGWIN
. IPRSCAN
uses UNIX c
ommands taken from
C:
\
CYGWIN
\
bin
. We have not tested it, but
it
is likel
y that the program will work with any Windows implementation of
UNIX tools as long as they are in the search PATH.
The version of CYGWIN
used on our clusters is
5.2 1.5.10
.



Command
pskill

has to be available on each node
.

This command is a part of
Pstools
which can be dow
n
loaded (see
http://www.microsoft.com/technet/sysinternals/utilities/pskill.mspx
) and simply
put somewhere in the
PATH
.

Although not essential, i
t is also u
seful (especially
during installation) to have a set of UNIX tools for Windows (such as tar, gzip,
etc.)
, which can be obtained

from
http://projects.mindtel.com/2005/1213.UnixUtils.distro/



If you need to compile

your own version of the P
-
IPRSCAN
-
related ex
ecutables
(the sources are included), the latest version of Vi
sual Studio 2008

will have to be
installed somewhere, not necessarily on the cluster. There is a good chance,
though, that the ready
-
made executables provided by us will work.

Installation



Place

the P
-
IPRSCA
N tools
executables
from the
PIPRSCAN_execs

directory
(see next section

for more details
) in some shared directory on
H:
. Here at CBSU
we use
H:
\
CBSU
\
bin_x64

for sequential auxiliary programs (like
clean_t.exe

and

MachineMaker.exe
) and
H:
\
CBSU
\
ptools
\
mpims

for MPI applications (
pdriver_iprscan.exe

and
pbatch.exe
).



The Windows
-
ported IPRSCAN suite (including databases) needs to be installed

on each node

where P
-
IPRSCAN will run. To do this,
transfer files
iprscan_4_4
.tgz
,
add_libs.tgz
, and
perlm
od_iprscan.tgz


to
the directory
T:
\
CBSU

and unpack each file. As a result of this, the following
directories w
ill be created
:

o

T:
\
CBSU
\
iprscan_4_
4

location of IPRSCAN, its databases, and
working directories

o

T:
\
CBSU
\
add_libs

location of the a
dditional dll

files used by
EMBOSS
executables called by IPRSCAN

o

T:
\
CBSU
\
perlmod_iprscan

location of perl modules needed by
IPRSCAN



Transfer the database files
iprscan_4_4
_data
_NOPHTR_22.0_nomatch
.tgz

and
iprscan_4_4
_data
_PTHR_14.0
.tgz

to
T:
\
CBSU

and unpack. This will

create the database directories
T:
\
CBSU
\
iprscan_4_4
\
data
and

T:
\
CBSU
\
iprscan_4_4
\
data
\
Panther
. Place the file
match_complete
_
22.0.xml.
gz

in
T:
\
CBSU
\
iprscan_4_4
\
data
and unpack


this will create the file
match_complete
.xml
. We recommend
using the 7za.exe
program (
http://www.7
-
zip.org
/
) for unpacking the database
files

(for more details on database files unpacking
-

see
ftp://cbsuftp.tc.cornell.edu/software/CBSUtools/WIN_IPRSCAN
/WIN
DOWS_IN
STALL.txt
).



Note that the database file
s are large

and during unpacking the compressed
versions will have to co
-
exist with the uncompressed ones. Thus, if disk space is
limited, it is important to install the files one by one and delete
“leftover”
compressed files when they are no longer needed.
If executed this way, the
peak
disk requirement of the
database installation

will be about 30 GB. Once fully
installed (and after deleting all the “leftover” archive files) the current databases
o
ccupy 28.5 GB of disk space.



Make sure that the directory

T:
\
CBSU
\
iprscan_4_4
\
tm
p

is readable and
writeable by anyone who submits a job (temporary IPRSCAN files will be stored
in this directory and each job will attempt to clean it in the beginning).

P
-
IPR
SCAN
tools

These executables are collected in the directory
PIPRSCAN
_execs

of the distribution
and should work right away (after having been placed in proper
shared
directories
). The
sources are in the VS2008

solution
CBSUtools

and can be recompiled, if needed (after
compilation the executables will end up in
CBSUtools
\
debug
). Of course, Visual
Studio 2008

needs to be properly set up before the compilation is attempted.



MPI executables (in some shared directory, e.g.,
H:
\
CBSU
\
pt
ools
\
mpims
\
):

o

pdriver_iprscan.exe

the parallel driver

o

pbatch.exe

the parallel batch utility



Serial executables (in some shared directory, e.g.,
H:
\
CBSU
\
bin_x64
\
):

o

clean_t.exe
:

the
T:

drive cleaner (removes everything from
T:
\

except

for the
T:
\
CBSU

directo
ry)

o

MachineMaker.exe
:

produces the machines file(s), it is copied from
the shared directory

to
T:
\

on the master node


Windows version of IPRSCAN



IPRSCAN is quite a complicated system of perl

scripts which launch various
sequence analysis tools, parse the outputs, and present them to the user in a
unified fashion.

The CBSU

staff ported

all these scripts (originally designed for
UNIX systems) to Windows. On each node, IPRSCAN is installed in th
e directory
T:
\
CBSU
\
iprscan_4_
4
, containing, besides scripts and binaries, also the
databases (in subdirectory
data
), and the working directory
tmp
, where
temporary files are stored duri
ng the IPRSCAN run.
The binaries of the sequence
analysis programs use
d by IPRSCAN (like
blast, hmmpfam
, pfscan
,
etc.), are taken from their respective official distributions. In some cases
(
hmm
pfam, hmmsearch
), the original version had to be ported to Windows.
O
ur IPRSCAN distribu
tion directory contains a few
original

READM
E files and
comments related to the
UNIX installation, most of the
m

of no relevance to the
Windows platform.
Therefore, it is better not to pay too
much attention to those
files.

Instructions specific to the Windows distribution can be found in the file
WI
NDOWS_INSTALL.txt
.


How to run

a P
-
IPRSCAN

job

Once all the prerequisites are met, you can

proceed to submitting a P
-
IPRSCAN

job.
Here is what you need to do:



create a job directory on the shared drive, for example,
H:
\
jobdir



place
the sample files
run
_aa
.bat

(for
amino acid input
)

or
run_nuc.bat

(for DNA input)
,
ccp.bat
, and
options

from the
doc
\
example

(or
doc
\
example
_dna
)
directory
in
H:
\
jobdir
.



copy your
FASTA query file
(e.g.,
doc
\
example
\
iprscan
_job
)
to
H:
\
jobdir



modify
run
_aa
.bat

(or
run
_nuc
.bat
)
by adjusting the following
environment variables (look for “set” commands):

o

CBSUH
: network share where the binaries and the job directory are located

o

CBSUW
: job directory (on the network share)

o

CBSUT
: local scratch drive on compute nodes

o

INFILE
: name of th
e input FASTA file

o

NUMNODES
: the number of nodes (not processors!) the job will be
submitted on

o

NUMPROCS
: should be equal to %NUMNODES% + 1

o

LOCJOBDIR
: local directory (on %CBSUT%) where the job’s files will
be stored on compute nodes

o

MPIBINDIR
: location of

the MPI binaries on the network share (be sure
to express this in terms of %CBSUH%, as in the sample script)

o

BINDIR
: location of the sequential executables on the network share (be
sure to express this in terms of %CBSUH%, as in the sample script)



e
dit th
e file

option
s

and modify it as needed (see the description of options
below).



s
ubmit
run
_aa
.bat

(or
run
_nuc
.bat
)
to

the queue

in
exclusive

mode (i.e.,
make the scheduler
allocate a number of nodes for the job
making sure that
no
other jobs will be
scheduled to the
se

nodes).

The number of
requested

nodes
should be equal to the number of MPI

processes (specified with the “
-
np” option
of
mpiexec

while launching the
pdriver_iprscan.exe

program)
minus 1
.
This way, each allocated node will be running one

instance of IPRSCAN, and one
node will additionally run the master MPI process.

While submitting the job,
s
pecify

the value of

%
CBSUH%
:
\
jobdir

as the task’s working directory

(i.e.,
variable
%CBSUH%

should be expanded)
.

Due to the multithreading capability of
HMMER executables, IPRSCAN instances running on multi
-
core nodes will use
all the available cores (or as many as configured in the IPRSCAN configuration
files).



During a run, a bunch of
files will appear in the job d
irectory, most of them log
files from various stages of the job. The progress of the run is

recorded

in the
master
log file (
machine_name_0000.log
, where
machine_name

is the
name of the machine the master process is running on). The actu
al IPRSCAN

output w
ill be collected in a file
iprscan_job
.out
(where
in our example
iprscan_job

is the
query file

name
).

The directories

doc
\
example

and
doc
\
example
_dna

show what a

job dir
ectory should look like once a

job is
finished.

A few

special files to look out for:

o

iprscan_job.problems
: for some sequences the analysis fails,
usually in the BLAST part. This file collects (in FASTA format) all the
sequences that caused problems.

o

iprscan_job.nomatch
: this FASTA file collects all sequences for
which the analysis succeed
ed, but gave no significant hits. This feature
works only when the “txt” output format is selected (format parameter “1”
in file
options



see below).

o

iprscan_job.restart
: a checkpoint file updated each time a
sequence is analyzed. Thanks to this file, a P
-
IPRSCAN job can be easily
restarted.



If for any reason the job does not finish (for example it times out or the cluster
crashes), it can be restarted simply by resubmitting the same
run.bat

script on
the same number of processors (absolutely no changes in

the job directory are
required). In the beginning
pdriver_iprscan.exe

always checks for the
presence of the

checkpo
int file and if it finds it, the calculation is restarted

form
where it
left off instead of starting
from scratch. If instead you need to re
start the
calculation from the beginning in the same directory, all the output files (and in
particular the checkpoint file) from the previous run should be removed first.




Contents of the
options

file


Example
:

15

minimum sequence length

"
-
cli
-
iprlookup
-
goterms "

options passed on to IPRSCAN

3

t
imeout

treatment

600

initial timeout

(in seconds)

20

M
ultiplier

1

output format



Meaning of the parameters (if not obvious)



minimum sequence length
: sequences shorter than this will not be
analyzed at all



timeout treatment
:
determines a timeout for a single sequence analysis

0

multiplier times average one
-
sequence
analysis time (averaged over all sequences
analyzed so far)

1

multiplier times maximum one
-
sequence
analysis time (maximized ove
r all
sequences analyzed so far

2

fixed to initial timeout

3

multiplier times average one
-
sequence
analysis time (averaged over all sequences
analyzed so far), but no less than the initial
timeout



multiplier
: see
timeout treatment

above



output format
:
i
nteger
selected according to the following table:

1

2

3

4

5

txt

raw

xml

html

ebixml