Abstract_Graf - Bioinformatics Vienna

austrianceilΒιοτεχνολογία

1 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

69 εμφανίσεις

Building a Bioinformatics Pipeline for the Design of

Pichia Pastoris

Whole Genome Microarrays




Alexandra Graf


FH
-
HauptbetreuerIn:

Prof. Dr. Diethard Mattanovich




1

Introduction

DNA microarray analysis is a widely used technology to generate genome
-
w
ide expression profiles, it is also the
only application that allows the efficient use of information from whole
-
genome sequencing projects in answering
biologically relevant questions. Not very long ago the genome of the yeast
Pichia pastoris

was sequence
d making it
possible to use microarray techniques to address problems in the excretion of complex heterologous proteins pr
o-
duced in this organism. This work deals with the selection of candidate ORFs and the design of oligonucleotides for
a
P. pastoris

who
le
-
genome microarray.

At the time of this work no DNA microarray of
Pichia pastoris

existed. The aim of this work was to establish a pip
e-
line for the design of whole genome chips of
P. pastoris
. Starting from an only partly annotated genome of the
yeast
P.

pastoris

a set of
P. pastoris

specific oligomers was designed which were then uploaded to Agilent to pr
o-
duce microarrays. In the process of designing these oligos appropriate software and parameters were selected as
well as connected via perl scripts. The

type of microarray was a two
-
color, 60
-
oligomer, custom expression array
from Agilent in the format 4x44k.


2

Results

De
-
novo predictions of
P. pastoris

genes were created using the program GeneMark followed up by a homology
search with BLAST using
S. cerev
isiae

as model organism. Data was parsed and filtered with Perl scripts developed
by the author of this work. The resulting ORFs were merged with data predicted by Integrated Genomics and the
program cd
-
hi was used to cluster the sequences according to the
ir similarity. The ORFs as well as information
from BLAST and cd
-
hi was then used to select candidates for probe design. Oligo design was done with OligoArray
2.1. The number of different oligomers on each array was 17,161 so that 2 replicates of each olig
omer could be
used per array. Of the oligos 10,134 were specific, for the other 7,027 the risk of cross
-
hybridization was given.
Hybridization of the first set of arrays was conducted according to the IAM SOP for RNA extraction and Agilents'
hybridization
protocols. A variety of
P. pastoris

samples were pooled together to have a high proportion of active
genes. The experimental part of the work was conducted at the IAM RNA lab. Twelve arrays (3 slides) were hybri
d-
ized in a same
-
same manner, scanned and quan
tified. The slides were scanned with a GenePix 4000B scanner and
Agilents' Feature Extraction software was used to convert the image into intensity values. The first batch of arrays
has the function to determine the amount of genes that hybridize to
P. pas
toris

targets and to refine the next batch
accordingly. The results of the analysis of the extracted data together with prediction data will determine the oligo
content of the next generation of microarrays.


3

Discussion

The selection of a good gene finder
for yeast was surprisingly hard in the light that
Saccharomyces cerevisiae

has
been completely sequenced since 1996. The reason for this is that up to date gene prediction software either f
o-
cuses on prokaryotes having no introns or on higher eukaryotes hav
ing a large amount of rather long introns. In

2

lower eukaryotes like yeast only a small proportion of genes contain introns and of these most have very few and
short introns. Additionally, the yeast genome includes less repeats and is overall more compact t
han the genome
of complex eukaryotes. For example yeast genome contains about 70
\
% protein coding regions whereas in human
genome this number is only 2
\
% [Kellis et. al, 2003]. To see which type of gene finder is best suited for yeast g
e-
nomes a test using
three de
-
novo gene finder
-

GeneMark, Glimmer3 and GlimmerHMM
-

was carried out.
S. cer
e-
visiae

was the logical choice for the test since it is the closest related organism to
P. pastoris

for which genome and
ORF data are available. The results confirmed th
at a gene finder written for eukaryotes (GlimmerHMM) was not
adequate for yeast organisms since it introduced far too many introns into the predicted genes. The prokaryotic
versions performed much better, with GeneMark predicting less false negatives but m
ore false positives than
Glimmer3. As stated before, a higher rate of false positives was preferable over a higher rate of false negatives,
therefore GeneMark was used in the gene prediction of

P. pastoris
.

The lowest threshold for the de
-
novo predictions

that GeneMark allowed was used to get the largest possible set of
ORFs. The resulting sequences as well as the ORFs that were predicted by IG were then blasted against coding
sequences of
S. cerevisiae

Sequences with a hsps that had a total length greater

100 and an identity of greater 50
were considered as similar and of these only the hsp with the highest e
-
value per sequence was recorded. This
information only indicates conserved regions between the two yeast organisms but not the actual length of the
c
oding regions. Moreover, considering that the similarity between
S. cerevisiae

and
S. bayanus
, both members of
the
Saccharomyces

group, being about the same as between human and mouse (62
\
% and 66
\
% respectively in
orthologous regions) [Kellis et. al, 200
3], it follows that a notable amount of genes of
P. pastoris

will not have
orthologous in
S. cerevisiae
. Due to these facts BLAST scores were taken into account but were only an ancillary
criteria in the selection of oligo candidates.

To avoid cross
-
hybrid
ization the sequences were run through cd
-
hi, a program that clusters sequences according to
their similarity and outputs a file containing the longest sequence of each cluster. A similarity threshold of 90
\
%
was used in our case. The more recent version (
cd
-
hit) was deliberately not used since it allows for a certain
amount of redundancy. Both versions (cd
-
hi, cd
-
hit) were developed and used to build up non
-
redundant protein
databases. Since the program was written for protein sequences and involves alignm
ents between these sequences
it might introduce a bias when used with nucleotide sequences. The program was used under the assumption that
the bias, if there is one, is negligible. Still, a nucleotide version of cd
-
hit (cd
-
hit
-
est) is now available and it
would
be interesting to compare the results of the nucleotide version to the one used in this work.

After the QC and normalization steps it will be decided which probes are deemed true and will be on the next chip.
Probes that are not marked as outliers an
d that have an intensity value which is higher than the average of neg
a-
tive probes plus the standard deviation are considered as positive. Since the array contains specific as well as u
n-
specific probes it will also be necessary to check for cross
-
hybridiza
tion. This is done by comparing the distribution
of the intensity values of the two groups as well as comparing the total amount of positive hybridizations for each
group. Additionally it is possible to use supervised learning techniques to see if there is

a significant difference
between the two classes of probes. The positive probes plus all probes that have a high prediction value as well as
probes from complex clusters (that were not used on this microarray chip due to time limitations) will make up the

next round of arrays. For the complex cluster a Multiple Sequence Alignment has to be prepared before the dec
i-
sion can be made which sequences to include into the next design. The sequences that were present on the first
array can be easily selected by th
eir IDs alone but as new sequences are introduced into the design a new run of
OligoArray 2.1 is required to check again for cross
-
hybridization in the set. After that the probes can be uploaded
to Agilent for the production of a new chip. Depending on the

performance of the new array my estimation would
be that at least another iteration will be necessary to produce a chip that contains mainly true probes.


4

Literature

Sauer M., Branduardi P., Gasser B., Valli M., Maurer M., Porro D., Mattanovich D., Differ
ential gene expression in
recombinant Pichia pastoris analysed by heterologous DNA microarray hybridisation, Microb Cell Fact, 2004,
Volume 3, Number 1

Stekel D., Microarray Bioinformatics, Cambridge University Press 2003

Kreil D.P., Russell R.R., Russell
S., Microarray Oligonucleotide
-
probes, Methods Enzymology, 2006, Volume 410

Kellis M., Patterson N., Endrizzi M., Birren B., Lander E. S., Sequencing and comparison of yeast species to identify
genes and regulatory elements, Nature, 2003, Volume 423