Supplementary Data - Bioinformatics

austrianceilBiotechnology

Oct 1, 2013 (3 years and 9 months ago)

112 views


1

Example Next Generation Sequencing Workflow using
KNIME

Table of Content

Example Next Generation Sequencing Workflow using KNIME

................................
......

1

Tab
le of Content

................................
................................
................................
.............

1

General Workflow Description

................................
................................
.......................

1

Loading workflow

................................
................................
................................
.......

2

Dark blue


Data cleansing and alignment to reference genome.
...............................

3

White
-

1
st

steps for data analysis of aligned sequences

................................
.............

8

Ora
nge
-

unique sequences, R
-
integration.

................................
...............................

16

Pink


mutation analysis

................................
................................
...........................

18

Green
-

ROIs

................................
................................
................................
.............

18

Light Blue


BED/BEDGraph files

................................
................................
..........

19

Integration into Galaxy/Mobyle

................................
................................
....................

20

REFERENCES

................................
................................
................................
.............

22

General Workflow Description

The workflow described here in detail is an example of how to use KNIME for Next
Generation Sequencing (NGS) data analysis. In particular we describe here typical parts
of
a
data analysis workflow from the areas of
R
NA
seq, DNAseq, and ChiPSeq. It does
not try to show a complete or perfect workflow but rather to point out some of the
features of what can be done and how it could be done with explicit usage of the NGS
relevant tools. Other tools like integration of R and
generating analysis reports are not in
the scope of this application note.

In general, this workflow reads
-
in FastQ formatted data, cleans and filters it, and then
aligns it to a reference genome (hg19) using bowtie (dark blue section). The results are
rea
d and filtered (nodes without background color) and then converted into a BED/
BEDGraph file (
http://genome.ucsc.edu/goldenPath/help/bedgraph.html
) for visualization
using e.g. GBrowse o
r UCSC genome browser (light blue). Regions of interest are being
identified (green). Mutations are being analyzed (red), and uniquely aligning sequences
are identified and then handed over to an R instance (orange)
. We use a publicly available
dataset for

the examples outlined below (WoldLab, 2008).



2


Figure 1.

Screenshot from the complete workflow. See text for further information.

From the first screenshot (Figure 1) we can already see that workflows can be annotated
using Annotations, i.e. coloring dif
ferent sections of background and writing text in them
(dark blue). We will now describe in detail the different steps.

Loading workflow

In order to use the workflow that is provided as supplementary material and is described
here one has to first download

the

workflow
(NGS_sample_workflow.zip
)

file.
Since it
was not possible to upload the full file
on the bioinformatics web
-
page nor on the
myexperiments web
-
page
the version for download is limited

to the first 100,000 entries
from the FastQ file. The work
flow with the part
ial data can be downloaded through the
bioinformatics web
-
site or through the following link:

http://www.myexperiment.org/workflows/2183.html
. KNIME can
be downloaded from
http://knime.org/downloads/overview

. Depending on the downloaded file the file has to
be executed or extracted. Then the application is ready to start.
KNIME will ask for a
workflow directory. This is th
e root folder where the workflows are going to be stored.
There should be sufficient space available at this location.
Then, the additional modules
have to be loaded. Here
,

select
the new software installation tools from KNIME
’s

help:
Select
Help
in the ma
in menu, then


Install new software
”, then “
Add
:, enter a
Name
, e.g.

KNIME
-
nightly
and the following
url
http://t
ech.knime.org/update/community
-
contributions/nightly
, then
-

select
at least “KNIME
NGS tools


and

KNIME R
Scripting extensions


and click Next

twice and
accept the license
agreement
. There will
then be a warning about unsigned content where you have to
click “OK”. KNIME will
then restart, but not without asking you first. You can now select “
Open KNIME

3

workbench (yellow triangle)
” from the welcome screen and import the workflow. This
can be done by selecting File in the main menu, then “
Import KNIME work
flow
”, check

Select archive file
”, point to the previously downloaded file and click “Finish”. You
have now imported the workflow. It can be opened by double
-
clicking on the appropriate
name (NGS_sample_workflow).

Dark blue


Data cleansing and alignment

to reference genome.

The first two nodes that provide data are the “
File Reader
” and “
FastQReader
” nodes.
The
File Reader

node reads in the parameters for the
AdapterRemovalAdv

node, i.e. the
adapter sequences, similarity threshold, quality threshold, min
imum overlap, and an
integer value (0 or 1) to indicate if partial comparisons should be carried out. The details
on these parameters are described below with the
AdapterRemovalAdv

node. Those
parameters have to be located in a text file and each set of pa
rameters has to be on its
own line. The
File Reader

node then automatically identifies if the columns are separated
using tabulators, commas, or semicolons. Figure 2 shows the resulting table that is
available by right
-
clicking on the node and then selecti
ng “File Table”.


Figure 2
.

Visualization of data read in by the
File Reader

node.

Column names are set automatically. They could be also set within the
File Reader

node
or through the
Rename

node. For simplicity and because the columns are used
only
once

they are not changed. Within the
FastQReader

node the file name, the number of rows to
be read
-
in and the type (Solexa, Solexa (pre 1.3)/Illumina, or Sange
r) can be selected
(Figure 3).

This implementation uses the BioJava (version 1.7.1) for reading the
data. In
addition to the ClusterID, sequence, and the quality
-
string a sum of the ASCII character
code values is calculated (last column in Figure 4).


4


Figure 3.

Parameter selection from the
FastQReader

node. NGS data coming from the NCBI is usually store
d using the Sanger format
for encoding the quality values for the base
s.

The names for these options are carried over from the BioJava implementation. The
Solexa and Illumina options refer to different methods Illumina calculates quality values and how the
y are represented.


Figure 4.

FastQReader output.

The next node is the
AdapterRemovalAdv

node. This node compares each sequence from
the FastQ file (target) with all sequences from the parameters file (query) in the following
fashion: The sliding window t
echnique is used to compare all possible pairwise
alignments without in/dels. The sliding window starts from the 5


ends of the sequences.
For each alignment only overlapping regions are compared, i.e. if the target sequence that
is generated by the slidin
g window is smaller than the query only those nucleotides are
compared that have a counterpart. If the quality value of the target is less than the quality
threshold it is being counted as a match. The minimum length for comparison is set by
the minimum ov
erlap column; any sequence comparison involving shorter sequences is
not being computed. If the fraction of the number of matched sequence positions over the
length of the compared sequence (can be shorter than the length of either the target or
subject se
quence,
see
minimum overlap argument) is greater than the similarity threshold
then the two sequences match and that part of the target sequence and anything following

5

is being removed. If a value greater than zero is present in the “Do partial comparisons

column” then the subject sequence is successively being truncated from the 5’ end and
the comparison is repeated as discussed before until the minimum length parameter is
reached. We have seen such cases in our experiments and therefore introduced this op
tion.
One has to keep in mind that using this option will increase computation time
significantly.

This implementation has some features noteworthy: Any concatenations of adapter
sequences are recognized, which also implies that if the adapter sequence is

part of the
reference sequence chances are that those parts will be filtered out as well. 2. The quality
value is the ASCII code. 3. Poly N sequences can be removed with this method as well
using a poly A sequence for example.

The code has been optimized

to a certain degree to break any loop once it is clear that no
match can be found, but we also recognize that this is not in any way an optimized
implementation.

The resulting table has an additional column that indicates which row from the
parameters tab
le caused sequences to be truncated

(Figure 5)
.


Figure 5.

result from
AdapterRemovalAdv
.

Next, we calculate the sequence length of the clean
s
ed data using a simple
Java Snippet
.
Figure 6 shows how easy it is to implement/prototype missing functionality.

Using the
Java Snippet

we append a new column called “mySeqlen” of type Integer. We use the
column name “Sequence” ($Sequence$) and calculate its length (.length()) and return that
value
.


6


Figure 6.

Java snippet configuration for calculating the sequenc
e length.

We then use the
Row Filter

node (Figure 7) to remove any sequence that is smaller than
15 nucleotides. Instead of using the
Java Snippet

and the
Row Filter

we could have also
used the
Java If (Table)
node thereby removing one extra step. The extr
a step here has
the added benefit of keeping the sequence length with the table
.


Figure 7.

Filter sequences by length, minimum length is 15 and maximum length is very high (1,000,000). I.e. this function removes
small sequences from the data.


7

Now we have

the data cleaned and ready to
be
compare
d

to the reference genome. But
we first have to write the data to a file. This is done using the
FastQWriter

node

(Figure
8)
. In our case we write to a file called: /tmp/testFastQwriter.txt.


Figure 8.

FastQ
Writer

node. Using configuration user interfaces, shown here, the user has to select the sequence, quality and
description column from the input table as well as an appropriate output file name.

Once the data is written, we can use for example bowtie to align it
to the reference
genome. Here, we use the simplest version of bowtie with only standard parameters and
execute the following command using the
Bash

node (Figure 9
, Source code 1
):

bowtie
-
S /tmp/bowtie_genomes/hg19 testFastQwriter.txt >test.sam

Source code

1.

Command line used in Bash node to execute the bowtie program.

This code will be executed in a directory called “/tmp”. The standard error is not being
interpreted as an error because bowtie writes important statistics to this port. When this
option is
checked any message printed to standard error will be interpreted as an
execution problem and the workflow will be stopped (Figure 10). Since the
Bash

node
has no input we connect it to the previous node using the Flow Variable port (red
connection). This
type of connection ensures that variables from previous nodes are
inherited. This mechanism can also be used as in our case to control the execution order.

When there is more than one command to be executed the
CmdwInput

node can be used,
which behaves jus
t like the
Bash

node, only that the command and working directory
parameters are read in from two columns. An explicit example for this node is
not
provided in the workflow presented here
.


8


Figure 9.

Bash

node, executes external commands and collects stan
dard output and standard error messages.


Figure 10.

Standard Error output from bowtie captured in a table.

White
-

1
st

steps for data analysis of aligned sequences

Now that we have a SAM formatted file with the information from the alignment, we
read thi
s into KNIME (
SAMReader
, Figure 11), select only those sequences that align to
the reference genome (
Row Filter
, Figure 12), and further transform the data in a sub
-
workflow (
MetaNode

1:1, transformations, Figure 13).


Figure 1
1
.

Configuration user interf
ace for the
SAMReader

node. Parameters include the file name, whether inconsistencies within the
file format should be interpreted as error, warning or be ignored and how many entries to read.


9


Figure 1
2
.

Row Filter

nodes. Here we filter out (exclude) whe
re no reference sequence is associated with the entry (“*” in ref Seq
name column).


Figure 1
3
.

sub
-
workflow that performs a series of data transformations as explained in the text.

Within this sub
-
workflow we first interpret the information from the flag
s column (Li et
al. 2009) into the strandness of the sequences and append a column called “strand” that
holds the value “+” if the query sequence matches the forward strand and “
-
“ if it matches
the reverse strand. (
Java Snippet
, strand column, Source code

2). We also want to make
clear that we are only using Java Snippets because we are more familiar with Java. The
same functionality could be created by writing a Perl or Python snippet, or even an R
node.

String flagStr = new Integer($flag$).toString();

in
t flagH = Integer.parseInt(flagStr,16);

byte strand = 0x10;

if((flagH & strand) == strand){


return "
-
";

}else{


return "+";

}

Source code
2
.

Source code from the Java Snippet (strand column).


10

The SAM file format allows for arbitrary information in the la
st column. This is collected
in an array type that is called “collection” in KNIME language. This collection holds
information on the mutations in the “MD:” column. This is a specific feature to bowtie
and other alignment tools handle this kind of informat
ion in different ways. Therefore
this workflow has to be adapted at this position when changing the alignment tool. The
Split Collection Column

node transforms this collection column into individual columns.
Figure 14 shows the top portion of the resulting

table.


Figure 1
4
.

Output from
Split Collection Column

node.

The interesting information for us

is

in the “Split Value 2” column. We convert this string
into a regular CIGAR expression (Li et al, 2009) that can be used in the following nodes
using a
Java

Snippet

(
Source code

3).


11

String valueType = $Split Value 2$;

String out="";

if(valueType.equals(null)){


return null;

}

int idx = valueType.indexOf("MD:Z:");

out = new Integer(idx).toString();

if(idx<0)return null;

String outString = "";

out = valueType
.substring(idx + 5);

String [] outArr = out.split("[^A
-
Z0
-
9]");


/*


* since mismatch string holds the bases from the reference sequence and we


* are interested in the mutation we have to substitute this...


*/

String mismatch = outArr[0];

String[] chrs =

mismatch.split("[0
-
9]+");

String[] nums = mismatch.split("[A
-
Z]");

int currPos =
-
1;

for(int nNum=0;nNum<nums.length;nNum++){


currPos += Integer.parseInt(nums[nNum]);


outString+=nums[nNum];


if(chrs.length>nNum+1){


for(int k=0;k<chrs[nNum+1]
.length();k++){


currPos++;


outString+= $query sequence$.substring(currPos,currPos+1);


}


}

}

return outString;

Source code

3
.

Conversion of the MD string into a CIGAR format.

For example, this code will translate the MD string “MD
:Z:24C2T1A3” that is associated
with the sequence TCTGAACCCGACTCCCTTTCGATCGGCCGCGGG into the
following CIGAR string: 24G2C1C3. Please note here that nucleotides in the MD refer to
the reference sequence and we are interested in the mutations. Therefore we
have to
identify the nucleotides in the query sequence, which makes the source a bit more
complicated. See
Source code

4 for an easy to read representation of the query sequence,
which reflects the positions.


1 2 3

1234567890123
45678901234567890123

TCTGAACCCGACTCCCTTTCGATCGGCCGCGGG


| | |

Source code

4
.

Four lines representing the sequence given as an example in the text. The first line gives the position of the 10
th
, 20
th
,
and 30
th

nucleotides. Line 2
shows the individual decimal digits. Line 3 represents the query sequence and line 4 indicates positions of
interest.

Next, we prepare the sequences for calculating the counts per position. For this, each
sequence of length N will be translated into N diff
erent entries or rows. We want to keep
the read header to be able to identify where a count came from and keep track of the
mutations. The read header allows us to link information from previous nodes. This
linking is usually accomplished using the
Join

or

JoinSorted

nodes. Figure 1
5

shows the
configuration user interface whereas Figure 1
6

shows part of the resulting table.


12


Figure 1
5
.

Configuration user interface for the
Seq2PosIncidents

node.


Figure 1
6
.

Results from the
Seq2PosIncidents

node. The Row N
ame holds the identifiers from the original sequenced sequences. The
position is kept as a string that is formed by concatenating the chromosome name with the position linking both with the char
acter “_”.
The Mismatch column holds the sequenced based in ca
se it is not the same as in the reference chromosome. A “.” Represents a match.

The position string (Position column) has to be now split into the chromosome name and
its corresponding position.
This is done using the
PositionStr2Position

node.
Alternative
ly one would use 2
Java Snippets

to perform the same task
.


13


Figure 1
7
.

Configuration for the sorter node in the transformation sub
-
workflow.

The last step in the transformation sub
-
workflow is the sorting of the individual positions.
From the configuratio
n user interface (Figure 1
7
) we see that we first sort by
chromosome, then by position and then by the mismatch column. This concludes this
sub
-
workflow and the data is
now
being handed over to the output of that meta
-
node.

This output is used in three dif
ferent places. Two of those are directly or indirectly going
into the pileup meta
-
node, which, in this case, is called
Variables Loop (Data)
. As the
name implies this sub
-
workflow iterates over the data. The first input is the original data
and the second
one holds information on how the iteration shall be carried out. In our
case we want to iterate over individual chromosomes. This reduces the size of the tables
and thus can speed up calculations. The
Value Counter

(Node 156) counts different
chromosome na
mes and thereby identifies unique names. As a side effect we also get a
general coverage per chromosome (Figure 1
8
). As the RowID is a special column that we
cannot use for the input of the Variable Loop sub
-
workflow we have to copy this into a
regular col
umn. This is done using the
RowID

(Node 157) node.


14


Figure 1
8
.

Output from the
Value Counter

(Node 156) Chromosome names and respective counts.

An overview of the pileup sub
-
workflow is shown in Figure 1
9
. The first node on the
second input for this meta
-
node is the
TableRow to Variable Loop Start Node

(Node 3),
which iterates over all rows of the input table and translates all values for each row into
variables that can be used in any of the following nodes. This node works together with
the
Loop End
node

to carry out the iterations. The output port is a Variable port that we
have used earlier to connect nodes with no input. Here we use this type of connection to
transfer data between nodes. The
Inject Variables (Data)

node then inserts this
information in
to the input of the first meta
-
node input.


15


Figure1
9
. Overview of the pileup meta node. This sub
-
workflow iterates over all chromosomes and counts how often an alignment to
a given position on the reference genome was found.

In the
Row Filter

node we now

can use this information. All nodes have a tab called
“Flow Variables” that handles the usage of variables. Here we can assign a given variable
a configuration parameter. Figure
20

shows how this assignment looks like in KNIME.


Figure
20
.

The variable
“chr” is assigned to the parameter with the name “Pattern”

In the configuration user interface for the
Row Filter

node (Figure
21
) we now see that
the pattern field under “use pattern matching” is grayed out and shows the value of the
current variable.


16


Figure
21
.

Row Filter configuration for sub
-
selecting rows that belong to one chromosome.

Flow variables are a very useful property in KNIME. There are also workflow variables
that don’t need to be passed through a given workflow. Unfortunately they are no
t copied
when copying nodes between workflows. To overcome this we introduced the
OneString

node that allows creating a table with one cell (one row and one column). This can then
be copied between workflows and also easily changed for batch executions (se
e below).
This node is also not used in this example workflow.

The following
CountSorted

node only takes one column as input, i.e. the column for
which it should calculate the occurrences of the values. This is in principle just a faster
version of the
Val
ue Counter

node as it assumes a presorted table and doesn’t enable
“hiliting” (see the KNIME documentation for more information on hiliting). As before,
we now have to translate the row id into a regular column (
RowID

node) and

extract the
position from th
e position string (
PositionStr2Position
). The
Loop End

node collects the
results for all iterations in one output table that is handed over to the output of the
MetaNode

(pileup). This concludes the description of the nodes in the unmarked area of
the work
flow.

Orange
-

unique sequences, R
-
integration.

In this part of the workflow we select sequences that align to only one part of the
reference genome and the show how those results can be transmitted to R and back. The
integration of R is not part of this a
pplication note as we have not contributed to these
nodes. But since this integration is essential for some of the later analysis steps usually
carried in NGS data analysis we include a very simple example.

Identifying uniquely aligning sequences

(Figure
22)
: We assume that the bowtie program
we are using for the alignment is producing files where the sequences that align to more
than one position in the reference genome are written in a consecutive way. Thus we can
use the fast
CountSorted

node to calcula
te how often a given sequence identifier occurs.
We then have to use the
RowID

node to convert the row ID back to a regular column.


17


Figure 2
2
.

Meta workflow for identifying sequences that align uniquely to the reference genome.

Then, we can use the
Joi
nSorted

node (Figure 2
3
) to link this information to the original
data. For this we have to just tell the node which columns it should use for the linking
process. The final node in this meta
-
workflow uses an upper bound of “1” in a range
check to include
only sequences that align to only one position in the reference genome.
With this node one can also select sequences that align to more than 1 sequence position
of course.


Figure 2
3
.

Configuration for the JoinSorted node.

Next, this output is directed t
o the R
-
computation meta
-
node. Figure 2
4

shows what
happens inside this node.


Figure 2
4
.

Subworkflow for the integration with R.


18

Since the R related nodes don’t know how to handle the KNIME data type collection we
have to transform this column into regul
ar columns. This is done using the
Split
Collection Column

node. The next
Row Filter

node limits the numbers of rows as there
are installation depend
e
nt limitations on the amount of data that can be interchanged with
R. Then we convert the data into the R
calculation space, execute some R commands and
import the results back to KNIME.

Pink


mutation analysis

Here we just want to briefly mention that we can very easily create sections within a
workflow for different types of analyses. In this case we use th
e
Row Filter

node to select
only mutations from our input data. From here we could further investigate statistics on
which mutations might actually be SNPs or other things. For example we can group by
(
Group By
) sequence position and concatenate the mutati
ons. Then we can determine the
length of that mutation string (
Java Snippet
) and sort by that length (
Sort
). This will give
us a list of positions where most of the mutations occur. Figure 2
5

shows the final result
of this exploratory analysis.


Figure 2
5
.

Result from the mutation analysis section. A sorted (by number of mutations) list of sequence positions with the actual
mutations and their counts is shown. The “Concatenate(Mismatch)” column can be expanded to show all nucleotides.

Green
-

ROIs

On task
especially in the analysis of small RNAs is to identify regions of coverage. The
MetaNode

called ROIs performs some interesting exploratory analysis (Figure 2
6
). As
input we are using the result from the pileup analysis. We have a list of positions from th
e
reference genome with associated coverage. This list is already sorted by chromosome
and position. The
GetRegions

node identifies regions of interest (ROIs). A ROI is defined
by having entries in the input table with the same chromosome name and increasi
ng (by
one) positions, i.e. consecutive regions of coverage. Values are stored in a string column
concatenated using a space as a separator.


19


Figure 2
6
.

Regions of interest sub
-
workflow.

Figure 2
7

shows a typical result for this node. The chromosome name
, start, and stop
position of the regions and the counts from the individual positions as well as the length
of the regions are given.



Figure 2
7
.

Result from the
GetRegions

node. See text for explanations.

Next, we use a Java Snippet to retrieve the max
imum count value from the count string
(
Source code

5
) and sort by that value to retrieve the regions with the highest coverage as
the top entries in the table.

String [] strArr = $count$.split(" ");

int len = strArr.length;

Integer [] countArr = new Integ
er[len];

int max = 0;

for(int i=0; i<len; i++){


countArr[i] = Integer.parseInt(strArr[i]);


if(max<countArr[i]) {max=countArr[i];}

}

return(max);

Source code

5
.

Java code for calculating the maximum value from a string of counts.

Light Blue


BED/BEDGraph

files

Within this section we would like to describe the steps involved to create a BED file out
of the pileup file. This is basically a simple reformatting task, which involves the

20

adjustment of the start positions (
Java Snippet
), selection of the columns

to write
(
Column

Filter
), reordering the columns (
Column Resorter
), and sorting by chromosomal
position (
Sorter
)

(Figure 28)
. In our case
,

the input table is already sorted correctly but
we still show this here for completeness.


Figure 2
8
.

Meta workflow

for creating BED files.

This result is then written into a tab delimited text file for which the first couple
of lines are shown in Source code 6. The start and stop positions are given for
each nucleotide of a region with coverage.

1 10793 10794

1

1 10794 10795 1

1 10795 10796 1

1 10796 10797 1

1 10797 10798 1

1 10798 10799 1

Source code 6
.

First lines of Bed file. The column is the chromosome name

followed by the start and stop positions and the

counts.

A more efficient way of storing this information is by combining regions with
equal coverage into one line. This is done using the
BedGraphWriter

node.
Source code 7 shows the first couple of lines of this file.

1 10794 10811 1.0

1

12800 12818 1.0

1 13742 13756 1.0

1 14701 14715 1.0

1 16270 16288 1.0

1 16592 16606 1.0

1 18284 18298 1.0

Source code 7
.

Output from
BedGraphWriter

node. Regions with consecutive equal counts are combi
ned on one line.


Integration into Galaxy/Mobyle

On important aspect for us is the possibility to execute workflows from the command line.
We are giving here a short example of how to execute the given workflow from the
command line and use a different inp
ut file for the execution

(Source code 8)
.


21

knimeRoot={PATH to knime}/eclipse_knime_2.3.0

knimeClassPath=${knimeRoot}/plugins/org.eclipse.equinox.launcher_1.1.1.R36x_v20101122_140
0.jar


time ${JAVA_HOME}/jre/bin/java
\

-
Dknime.expert.mode=true
\

-
Dknime.dis
able.rowid.duplicatecheck=true
\

-
Xmx16G
\

-
classpath ${knimeClassPath} org.eclipse.equinox.launcher.Main
\

-
launcher ${knimeRoot}/eclipse
\

-
application org.knime.product.KNIME_BATCH_APPLICATION
\

-
workflowDir=${knimeWorkflowDir}/SampleScript
\

-
destDir=/
tmp/sampleOutputDir
\

-
option=38,FQR_fpname,{PATH}/anotherFile.sam,String
\

2>&1 | tee $logfile

Source code

8
.

Bash script for executing the workflow from the command line.

In the first two lines of this BASH script we define variables with the location o
f the
KNIME installation and the KNIME CLASSPATH. We execute the following KNIME
batch job using the time command to keep track of different resources that are consumed
during execution. $JAVA_HOME is the variable that is usually set during the JAVA
instal
lation and that points to the JAVA executable. What follows are option
s

to KNIME
that define extra functionality is available (knime.expert.mode) and that the duplicate row
id checking is being turned off (knime.disable.rowid.duplicatecheck). We also the
m
aximum amount
of memory to be used to 16 giga
bytes. The following options
(classpath, launcher, application) are needed for running KNIME in batch mode. The
workflowDir directive tells KNIME which workflow should be executed. The
knimeWorkflowDir variable
should hold the root for the workspace that is defined when
first starting KNIME. The destDir directive stores the resulting executed workflow in that
specific directory, which can after execution be imported into KNIME again. Thus all
intermediate

results

can be analyzed. The option directive enables us to change
parameters within the workflow. In this case we change the name of FASTQ file that
should be analyzed. The number 38 can be identified in the configuration dialog of the
respective node (
FastQRead
er
), in the upper left corner. The name FQR_fpname is the
name of the field in the configuration we want to change. This is the same name as is
used in the Flow Variable tab in the configuration dialog (Figure 2
9
). Then we set the
new file name and tell KN
IME that this is String variable. Further information on how to
use the Batch execution mode can be found on the KNIME.org web
-
page under the help
and FAQ sections.


22


Figure 2
9
.

FastQReader configuration of flow variables. Important for the command line ex
ecution are the numbers next to “Dialog”
in the very top left corner and the names of the variables.

Being able to execute workflows from the command line allows KNIME to be integrated
into other tools like
Mobyle (N
é
ron et al., 2010) or Galaxy (Goecks et
al., 2010)
.

This concludes our short example. We hope to have shown how easy it is to develop and
prototype new workflows.


REFERENCES

Goecks, J. et al. (2010) Galaxy: a comprehensive approach for supporting accessible,
reproducible, and transparent compu
t
a
tional research in the life sciences. Genome Biol.,
11, R86.


Li H. et al. (2009) The sequence alignment / map format and SAMtools. Bioinforma
t
ics
25, 2078
-
2079.


Néron, B. et al. (2010) Mobyle: a new full web bioinformatics framework.
Bioinfo
r
matics, 25
, 3005
-
3011.


WoldLab (2008)
SRX000350

Illumina sequencing of Mouse brain transcript
fragment library;
(SRA001030, released on May

30,

2008, updated on
Jan

30,

2009)