Abstract_Graf - Bioinformatics Vienna


1 Οκτ 2013 (πριν από 4 χρόνια και 9 μήνες)

87 εμφανίσεις

Building a Bioinformatics Pipeline for the Design of

Pichia Pastoris

Whole Genome Microarrays

Alexandra Graf


Prof. Dr. Diethard Mattanovich



DNA microarray analysis is a widely used technology to generate genome
ide expression profiles, it is also the
only application that allows the efficient use of information from whole
genome sequencing projects in answering
biologically relevant questions. Not very long ago the genome of the yeast
Pichia pastoris

was sequence
d making it
possible to use microarray techniques to address problems in the excretion of complex heterologous proteins pr
duced in this organism. This work deals with the selection of candidate ORFs and the design of oligonucleotides for
P. pastoris

genome microarray.

At the time of this work no DNA microarray of
Pichia pastoris

existed. The aim of this work was to establish a pip
line for the design of whole genome chips of
P. pastoris
. Starting from an only partly annotated genome of the


a set of
P. pastoris

specific oligomers was designed which were then uploaded to Agilent to pr
duce microarrays. In the process of designing these oligos appropriate software and parameters were selected as
well as connected via perl scripts. The

type of microarray was a two
color, 60
oligomer, custom expression array
from Agilent in the format 4x44k.



novo predictions of
P. pastoris

genes were created using the program GeneMark followed up by a homology
search with BLAST using
S. cerev

as model organism. Data was parsed and filtered with Perl scripts developed
by the author of this work. The resulting ORFs were merged with data predicted by Integrated Genomics and the
program cd
hi was used to cluster the sequences according to the
ir similarity. The ORFs as well as information
from BLAST and cd
hi was then used to select candidates for probe design. Oligo design was done with OligoArray
2.1. The number of different oligomers on each array was 17,161 so that 2 replicates of each olig
omer could be
used per array. Of the oligos 10,134 were specific, for the other 7,027 the risk of cross
hybridization was given.
Hybridization of the first set of arrays was conducted according to the IAM SOP for RNA extraction and Agilents'
protocols. A variety of
P. pastoris

samples were pooled together to have a high proportion of active
genes. The experimental part of the work was conducted at the IAM RNA lab. Twelve arrays (3 slides) were hybri
ized in a same
same manner, scanned and quan
tified. The slides were scanned with a GenePix 4000B scanner and
Agilents' Feature Extraction software was used to convert the image into intensity values. The first batch of arrays
has the function to determine the amount of genes that hybridize to
P. pas

targets and to refine the next batch
accordingly. The results of the analysis of the extracted data together with prediction data will determine the oligo
content of the next generation of microarrays.



The selection of a good gene finder
for yeast was surprisingly hard in the light that
Saccharomyces cerevisiae

been completely sequenced since 1996. The reason for this is that up to date gene prediction software either f
cuses on prokaryotes having no introns or on higher eukaryotes hav
ing a large amount of rather long introns. In


lower eukaryotes like yeast only a small proportion of genes contain introns and of these most have very few and
short introns. Additionally, the yeast genome includes less repeats and is overall more compact t
han the genome
of complex eukaryotes. For example yeast genome contains about 70
% protein coding regions whereas in human
genome this number is only 2
% [Kellis et. al, 2003]. To see which type of gene finder is best suited for yeast g
nomes a test using
three de
novo gene finder

GeneMark, Glimmer3 and GlimmerHMM

was carried out.
S. cer

was the logical choice for the test since it is the closest related organism to
P. pastoris

for which genome and
ORF data are available. The results confirmed th
at a gene finder written for eukaryotes (GlimmerHMM) was not
adequate for yeast organisms since it introduced far too many introns into the predicted genes. The prokaryotic
versions performed much better, with GeneMark predicting less false negatives but m
ore false positives than
Glimmer3. As stated before, a higher rate of false positives was preferable over a higher rate of false negatives,
therefore GeneMark was used in the gene prediction of

P. pastoris

The lowest threshold for the de
novo predictions

that GeneMark allowed was used to get the largest possible set of
ORFs. The resulting sequences as well as the ORFs that were predicted by IG were then blasted against coding
sequences of
S. cerevisiae

Sequences with a hsps that had a total length greater

100 and an identity of greater 50
were considered as similar and of these only the hsp with the highest e
value per sequence was recorded. This
information only indicates conserved regions between the two yeast organisms but not the actual length of the
oding regions. Moreover, considering that the similarity between
S. cerevisiae

S. bayanus
, both members of

group, being about the same as between human and mouse (62
% and 66
% respectively in
orthologous regions) [Kellis et. al, 200
3], it follows that a notable amount of genes of
P. pastoris

will not have
orthologous in
S. cerevisiae
. Due to these facts BLAST scores were taken into account but were only an ancillary
criteria in the selection of oligo candidates.

To avoid cross
ization the sequences were run through cd
hi, a program that clusters sequences according to
their similarity and outputs a file containing the longest sequence of each cluster. A similarity threshold of 90
was used in our case. The more recent version (
hit) was deliberately not used since it allows for a certain
amount of redundancy. Both versions (cd
hi, cd
hit) were developed and used to build up non
redundant protein
databases. Since the program was written for protein sequences and involves alignm
ents between these sequences
it might introduce a bias when used with nucleotide sequences. The program was used under the assumption that
the bias, if there is one, is negligible. Still, a nucleotide version of cd
hit (cd
est) is now available and it
be interesting to compare the results of the nucleotide version to the one used in this work.

After the QC and normalization steps it will be decided which probes are deemed true and will be on the next chip.
Probes that are not marked as outliers an
d that have an intensity value which is higher than the average of neg
tive probes plus the standard deviation are considered as positive. Since the array contains specific as well as u
specific probes it will also be necessary to check for cross
tion. This is done by comparing the distribution
of the intensity values of the two groups as well as comparing the total amount of positive hybridizations for each
group. Additionally it is possible to use supervised learning techniques to see if there is

a significant difference
between the two classes of probes. The positive probes plus all probes that have a high prediction value as well as
probes from complex clusters (that were not used on this microarray chip due to time limitations) will make up the

next round of arrays. For the complex cluster a Multiple Sequence Alignment has to be prepared before the dec
sion can be made which sequences to include into the next design. The sequences that were present on the first
array can be easily selected by th
eir IDs alone but as new sequences are introduced into the design a new run of
OligoArray 2.1 is required to check again for cross
hybridization in the set. After that the probes can be uploaded
to Agilent for the production of a new chip. Depending on the

performance of the new array my estimation would
be that at least another iteration will be necessary to produce a chip that contains mainly true probes.



Sauer M., Branduardi P., Gasser B., Valli M., Maurer M., Porro D., Mattanovich D., Differ
ential gene expression in
recombinant Pichia pastoris analysed by heterologous DNA microarray hybridisation, Microb Cell Fact, 2004,
Volume 3, Number 1

Stekel D., Microarray Bioinformatics, Cambridge University Press 2003

Kreil D.P., Russell R.R., Russell
S., Microarray Oligonucleotide
probes, Methods Enzymology, 2006, Volume 410

Kellis M., Patterson N., Endrizzi M., Birren B., Lander E. S., Sequencing and comparison of yeast species to identify
genes and regulatory elements, Nature, 2003, Volume 423