BITS 2010 Abstracts Genomics - Bioinformatics Italian Society

plumpbustlingInternet and Web Development

Dec 4, 2013 (3 years and 10 months ago)

689 views



BITS 2010

Abstracts

Genomics

Finding new genes for non
-
syndromic hearing loss through an
in silico prioritization study

Accetturo M
1
, Creanza TM
1
, Giordano A
1
,

Leo P
1
, Santoro C
1
, Scioscia G
1
,

Tria G
1
,
Vaccina A
1
*

Accetturo M et al.

Motivation

Prioritizing genes is a major concern for all those complex disorders whose genetic
causes have not been yet completely unde
rstood. Due to its extremely
heterogeneous genetics, non
-
syndromic Hereditary Hearing Loss (HHL) is one of
the best candidates for such an approach: there are indeed 51 genes already
known to be responsible, if mutated, of this phenotype, and 111 chromosom
al
regions linked to this disease over the years where one or more genes causing
HHL are located. These regions are often large, containing hundreds of genes,
making the systematic screening of all the genes they contain (candidate genes) in
search of caus
ative mutations not feasible. In this scenario a computational help to
select the candidate genes according to their probability to cause, if mutated, the
disease is strongly needed. To address this issue we built a gene scoring system
based on Gene Ontolo
gy (GO), which scores the candidate genes for HHL by
comparing them with the 51 HHL disease genes, relying on the rationale that
genes whose dysfunction cause a disease, tend to be functionally related.

Methods

We defined a semantic similarity measure expl
oiting the information contained in
GO to quantify the functional similarity between genes. Starting from Lin’s metric
(Lin 1998) which measures the semantic similarity between two GO terms through
their Information Content (IC), i.e. a measure of how spec
ific and informative a
term is, we estimated the semantic similarity between two gene products looking at
their GO annotations, measuring the information they share normalized by the
information contained in their total descriptions. As GO allows multiple
parents for



*
1
IBM GBS BAO Advanced Analytics Services and MBLab, Via P. Leonida La
forgia, 14
-

70125 Bari.
Italy

Accetturo M et
al.

4


each concept, two terms can share parents by multiple paths, therefore we chose
for each GO term pair the more specific parent term shared by both of them. We
defined a set of candidate genes for HHL as all the genes contained in the
susceptibi
lity loci known so far, and we prioritized them for the association with the
disease by measuring their similarity respect to the disease gene set. All candidate
genes were ranked by computing the semantic similarity measure for each
candidate
-
disease gene

pair; then the final score used to prioritize each candidate
was obtained as the mean of the scores estimated for that candidate against all the
disease genes.

Results

The twenty top
-
scored genes were then examined to evaluate their possible
involvement i
n HHL. We found that half of them are reported in literature to be
expressed in human inner ear and/or cochlea, while six are reported to be
expressed in other organisms' inner ear and/or cochlea, mainly mouse or chicken,
and four have no gene expression d
ata for these tissues. Due to the limited
availability of gene expression data for human inner ear, these findings support the
goodness of the ranking we produced in respect to the HHL. Moreover, looking at
their functions, we found that most of the top
-
ra
nked genes play roles compatible
with a possible involvement in HHL phenotype, such as a) processes of remodeling
and organization of actin, an essential component of the hair
-
cell bundle; b)
formation and maintenance of cilia, the sensory organelles devot
ed to receive the
mechanical stimulus; c) K+ cycling and pH homeostasis in cochlear fluids,
essentials for the generation and maintenance of the endocochlear potential; d)
signal transduction, that again support our ranking. We also validated our results
a
dding 15% of disease genes randomly drawn 1000 times from the disease gene
set to the candidates, and testing if the number of disease genes ranked in the top
100, 75, 50 and 25 genes was significantly greater than expected when a random
extraction of 100,

75, 50 and 25 genes was performed from the total set. In this
case we always observed a p
-
value smaller than 0.05. We therefore are extremely
confident that our metric was able to suggest excellent candidate genes for HHL to
be screened in patients and co
ntrols for causative mutations.

Contact e
-
mail

pietro_leo@it.ibm.com

Barchi L et
al.

5


Generation of SNPs in eggplant (Solanum melongena L.)

Barchi L
1
, Lanteri S
1
, Portis E
1
, Acquadro A
1
, Valè G
2
, Toppino L
3
, Rotino GL
3
*

Barchi L et al.

Motivation

Single nucleotide polymorphisms (SNPs) are the most abundant types of DNA
sequence polymorphism. Their high availability provide enhanced possibilities for
gene
tic and breeding applications such as cultivar identification, construction of
genetic maps and marker
-
assisted breeding. Furthermore, the development of high
throughput genotyping methods make SNPs highly attractive as genetic markers.
Several methodologi
es are available for SNPs discovery. Recently, Miller et al.
(Genome Research, 17, 2007) proposed the use of Restriction
-
site Associated
DNA (RAD) method in association with next generation sequencing (Illumina) for
rapid and cost effective SNP mining. RAD

technology creates a reduced
representation of the genome (RAD tags) allowing for the identification of high
number of SNPs suitable for genetic studies. At present in eggplant (Solanum
melongena L.) no information are reported on SNP marker development.
Here we
report on the identification of a wide set of SNPs by sequencing RAD tags derived
from two eggplant inbred lines: ‘305E40’ and ‘67/3’, which we used as parents of
an F2 intra
-
specific mapping population. The newly identified SNPs derived
markers wi
ll consistently contribute in saturating the already developed intra
-
specific genetic map and will provide powerful tools for comparative genetics
studies within the Solanaceae family.

Methods

Plant materials
:

t
he parental line ‘305E40’, characterized by e
longated fruit, is a
doubled haploid (DH) obtained through anther culture of an advanced introgression
line (BC7) derivative of an interspecific hybrid somatic Solanum aethiopicum gr.
Gilo x S. melongena cv. Dourga. The parental line ‘67/3’ is a selection
from the
intraspecific cross cv. ‘Purpura’ x cv.‘CIN2’ followed by seven cycles of selfing, and
is characterized by round fruit type. These parents were crossed to obtain F1
hybrids which, after selfing, generated a F2 population of 141 individuals. RAD
li
brary preparation, sequencing, assembly and SNP discovery
:

g
enomic DNA from
the parental lines ‘305E40’ and ‘67/3’ was digested with restriction endonucleases
SgrAI (SgrAI round) and PstI (PstI round) and an adapter (P1) was ligated to the
generated fragme
nts. The P1 adapter contained a forward amplification primer site,
an Illumina sequencing primer site, and a barcode. P1 Adapter
-
ligated fragments
were combined, sheared and ligated to a second adapter (P2). RAD tags, carrying



*
1

DIVAPRA Plant Genetics and Breeding, University of Torino, 10095 Grugliasco (Torino) Italy;
2

CRA
-
GPG Genomic Research Ce
ntre, 29017 Fiorenzuola d

Arda (Piacenza) Italy
3

CRA
-
ORL Research Unit
for Vegetable Crops, 26836 Montanaso Lombardo (Lodi) Italy

Barchi L et
al.

6


P1 and P2 adapters, were sele
ctively PCR amplified. RADs from each parent were
sequenced on the Genome Analyzer II (Illumina) platform using paired end 54 bp
sequence reads (2 x 54 bp). Paired end sequences from each parent were pooled
and segregated by single read RAD sequences. Velv
et was used to assemble
consensus LongRead contigs from the paired end data. For SgrAI round, SNPs
were called at minimum of 4x coverage while for PstI round SNPs were called at
minimum of 6x coverage. Sequences annotation
:

CAP3 algorithm was used for
iden
tifying sequences in common between parents. A new dataset (A) was
constituted for further analyses: it included singlets from ‘67/3’ and ‘305E40’ as well
as contigs deriving from both RAD rounds. Standalone BLAST tool was used to
identify best annotation
for each dataset. A BLASTX algorithm was carried out
against TAIR9, adopting as a threshold E
-
value 1e
-
09, while BLASTN algorithm
was performed against SGN Cornell unigene database with a cut
-
off

1e
-
20. Gene
Ontology categorization of non redundant sequences was inferred by TAIR9 best
hits.

Results

SNPs discovery A total of 38,941 and 38,935 contigs were obtained for parent
‘67/3’ and ‘305E40’, respectively. Globally, almost 15.45 Mb of high
-
quali
ty de novo
eggplant sequences were obtained. The average contig length was 353.9 bp for
‘67/3’ and 340.8 bp for ‘305E40’, with an overall value of 347.3 bp. A total of
11,580 SNPs and 1,664 InDels were identified considering both lines. The
complete SNP pa
nel was screened to identify those alleles free of flanking
polymorphisms and 2,435 SNPs were found as candidates for genotyping assays.
Sequences annotation The dataset A, constituted of 43,795 sequences (30,187 in
common from both parents, 5,535 singlets

from ‘67/3’ and 5,603 singlets from
‘305E40’), was searched against TAIR9 protein database using BLASTX. Globally,
11,588 sequen
ces (26.45%) matched at E
-
value

1E
-
09, clustering with 6,875
Arabidopsis unigenes. The latter were successfully GO categorised
. BLASTN
search against Cornell unigene database revealed a total of 17,631 sequences
ha
ving a significant hit (E
-
value

1E
-
20), clustering with 14,338 SGN unigenes:
these data will be useful for anchoring eggplant genetic maps to the tomato
genome scaffol
d. In conclusion, the de novo sequencing of 15.45 Mb of eggplant
genome, via RAD technology, provided 2,435 SNPs suitable for a wide range of
genomic studies.

Contact e
-
mail

alberto.acquadro@unito.it

Capizzi C et
al.

7


SNP array data and quantitative determination of cell fr
action
bearing Copy Neutral
-
LOH regions in tumoral samples

Capizzi C
1
, Musso N
1
, Barresi V
1,2
, Condorelli DF
1,2
*

Capizzi C et al.

Motivation

Molecular cytogenetic technology by last
generation single nucleotide
polymorphism (SNP) array provides a high resolution mapping of two different
tumor
-
associated genetic alterations undetectable by conventional metaphase
cytogenetics: submicroscopic copy number abnormalities (CNAs) and copy neu
tral
loss of heterozygosity (CN
-
LOH), also known as acquired uniparental disomy.
Recent advances in analysis of SNP
-
array data allow a quantification of tumoral
cell fraction bearing specific CNAs or CN
-
LOH regions. In the present work we set
up a method t
o exploit a novel parameter for representation of SNP array data, the
so called “allele difference”, in order to provide a quantitative estimation of the
fraction of cells bearing specific CN
-
LOH regions and report some applications in
clinical oncological

cases.

Methods

High
-
resolution genome
-
wide DNA copy number and SNP genotyping analysis was
performed by Affymetrix SNP 6.0 arrays that interrogate 906,600 SNPs and
945,826 copy number probes (SNP/CNV array). The “allele difference value” is the
difference

of allele A signal and allele B signal each standardized with respect to
their median values in the reference HapMap population. Mathematical simulation
of the results obtained with different mixture of CN
-
LOH bearing and not bearing
cells were used to de
rive a calibration curve useful to determine the cell fraction
bearing the CN
-
LOH in the unknown sample.

Results

By the implemented script we analysed SNP array data related to hematological
malignancies (acute myeloid leukemias, myelodysplastic syndromes)

and
colorectal cancer containing different amounts of normal and pathological cells. A
good correlation with independent methods able to evaluate the cell fraction
bearing a specific chromosomal abnormality was found
.

Contact e
-
mail

carmencapizzi@libero.i
t




*
1

Laboratory on Complex Systems, Scuola Superiore di Catania, University of Catania, Italy,

2

Department of Chemical Science
s, Section of Biochemistry and Molecular Biolog
y, University of
Catania, Italy

Colella R et
al.

8


Renewing bioinformatics workflow system by using a Web 2.0
approach

Colella R
1
, Quinto V
2
, Vaccina A
1
, Scioscia G
1
, Leo P
1
*

Colella R et al.

Motivation

In this work a W
eb 2.0 technology and approach is adopted to enhance a key field
of bioinformatic platform research: manage and automate analysis workflows. The
approach considered the adoption of recent Web 2.0 technologies, such as mash
-
up platforms, that enable the rap
id creation, sharing, and discovery of reusable
application building blocks (widgets, feeds, mashups), known also as
consumables, as an alternative environment to support bioinformatic workflows
design and execution. The usage of Mashup is expanding in the

business
environment. Business Mashup, for instance, is adopted for integrating business
and data services by providing the ability to develop new integrated services
quickly. Typically, Mushup provides organizations a much more flexible modality to
combi
ne internal with external services and then creating new services that are
usually accessed through user
-
friendly Web browser interfaces. We applied
Mushup principles to the bioinformatics workflow context with the final aim to
collect insights to develop
a new kind of bioinformatic workflow systems. We did
our experiments by prototyping a number of widgets as well as bioinformatics
consumables that can be mushed
-
up in a typical Mushup environment and built
some bioinformatic pipelines to validate the effec
tiveness of our approach.

Methods

Consumables (widget and services) have been developed by using the Lotus
Widget Factory, an Eclipse plug
-
in that provides an easy
-
to
-
use development
environment that enables developers of all skill levels to create dynamic

widgets
rapidly almost without writing code (except for a bit of Java and Javascript). Lotus
Widget Factory is a component of IBM® Mashup® C
enterTM, which also includes:
1)
the mashup builder, useful to configure and wire widgets on a mashup page;
pages c
an be published to the catalogue

and shared with other users. 2)
MashupHub, a catalogue for mashup objects: feeds, data mashups, mashup
pages, REST services, spaces, and widgets. The catalog includes community
features for sharing information with other us
ers; objects that are stored in the
catalogue can be tagged, rated and commented.




*
1

IBM Italia S.p.A., Bari

2

Exhicon S.r.l., Bar
i

Colella R et
al.

9


Results

A first set of widgets have been developed to perform basic data manipulation
operations such as: uploading Flat files, i.e. files containing biosequences and
informa
tion associated to them, stored in specific formats, from local file system,
selection of sequences to be input to the workflow, execution of a REST
(Representational State Transfer) service call and some others. The core widget
receives data (as XML) from

one or more widgets, invokes a generic WSDL
-
described web service and sends (as XML) the results to all the widgets that the
user has wired to it, iterating the execution of the web service call with respect to
the input data set dimension and to a user
-
c
hosen parameter. Some prototypal
workflows have been assembled and tested with a number of these basic widgets,
making use of some of the two hundred algorithms of EMBOSS suite, exposed as
web services.

Contact e
-
mail

roberto_colella@it.ibm.com


D’Andrea D et
al.

10


Challenging an ensemble approach (GENTES) with the Gene
-
Environment iNteraction Simulator (GENS)

D’Andrea D
1
, Amato R
1,2,3
, Pinelli M
1,4
, Tagliaferri R
1,5
, Cocozza S
1,4
, Miele G
1,2,3
*

D

Andrea D et al.

Motivation

Complex diseases are multifactorial traits caused by both genetic and
environmental factors. They represent the major part of human diseases and
incl
ude those with largest prevalence and mortality (cancer, heart disease, obesity,
etc.). Despite a large amount of information that has been collected about both
genetic and environmental risk factors, there are few examples of studies on their
interactions

in epidemiological literature. One reason can be the incomplete
knowledge of the statistical power of Feature Selection Method (FSM) usually used
to identify the risk factors and their interactions in data sets. As it is well known,
each FSM have differen
t performances and weaknesses and better performs in
particular conditions. It’s clear that an improvement in this direction would lead to a
better understanding and description of gene
-
environment interactions. To this aim,
a possible strategy is to chall
enge the different statistical methods against data
sets where the underlying phenomenon is completely known and fully controllable,
for example simulated ones; to determine rules to improve performances of each
method; combining FSMs in an ensemble approa
ch to add the positive
characteristics of each method and to dilute at the same time the weakness points.

Methods

We present a mathematical approach that models gene
-
environment interactions.
By this method it is possible to generate simulated populations
having gene
-
environment interactions of any form, involving any number of genetic and
environmental factors and also allowing non
-
linear interactions as epistasis. In
particular, we implemented a simple version of this model in a Gene
-
Environment
iNteracti
on Simulator (GENS), a tool designed to simulate case
-
control data sets
where a one gene
-
one environment interaction influences the disease risk. The
main aim has been to allow the input of population characteristics by using
standard epidemiological measu
res and to implement constraints to make the
simulator behavior biologically meaningful. Then we developed new software,



*
1

Gruppo Interdipartimentale di Bioinformatica e Biologia Computazionale, Universit
à
di Napoli

Federico II


-

Universit
à
di S
alerno, Italy.

2

Dipartimento di Scienze Fisiche, Universit
à
di Napoli

Federico II

, Napoli, Italy.

3

INFN Sezione di Napoli, Napoli, Italy.

4

Dipartimento di Biologia e Patologia
Cellulare e Molecolare

L. Califano

, Napoli, Italy.

5

Dipartimento di Mate
matica e Informatica, Universit
à
di

Salerno, Fisciano (SA), Italy

D’Andrea D et
al.

11


Gene
-
Environment iNteraction Exploration System (GENTES), that implements an
ensemble of FSMs aimed to identify relevant genetic and no
n
-
genetic features
involved in a given complex disease. The ensemble can be composed by any type
of FSM, as well we present the implementation of four of them, namely the Binary
Logistic Regression, the Linear Discriminant Analysis, the Multifactor
Dimensi
onality Reduction and an univariate $2 or t
-
test. We optimized the
performances of the ensemble in identifying gene
-
environment interactions by the
challenges on simulated datasets.

Results

By the multi
-
logistic model implemented in GENS it is possible to
simulate case
-
control samples of complex disease where gene
-
environment interactions influence
the disease risk. The user has full control of the main characteristics of the
simulated population and a Monte Carlo process allows random variability. A
knowle
dge
-
based approach reduces the complexity of the mathematical model by
using reasonable biological constraints and makes the simulation more
understandable in biological terms. Simulated data sets were used to evaluate the
statistical power of four widely
used FSM. Moreover the same simulated data sets
were used to evaluate the performances of the ensemble in comparison with those
of single FSMs. The ensemble showed performances generally better than or
comparable to those of each one of its components.

Con
tact e
-
mail

daniel.dandrea@gmail.com



D’Antonio M

et
al.

12


Functional and structural annotation of human protein variants
originated from alternative splicing in human

D

Antonio M
1
, Martelli PL
2
, Castrignanò T
1
, Fariselli P
2
, Casadio R
2
, Zauli A
3
,
Pesole G
4,5
*

D

Antonio M et al.

Motivation

Alternative splicing has been suggested as a key mechanism
for increasing the
functional landscape of the human genome. In order to detect structural and
functional features of alternative protein variants originated from the same gene, a
pipeline for annotating all the alternative splicing variants included in th
e ASPicDB
database has been implemented, by integrating different state of the art tools for
similarity search and for the prediction of structural and functional features of a
protein starting from its residue sequence.

Methods

For each one of the 254,195

protein variants coming from 17,142 human genes a
first layer of annotation consists in searching with BLAST for similar sequences
annotated in the SwissProt data base (rel. 53.0) or endowed with a resolved three
dimensional structure in the PDB data base

(rel. Apr 09). Moreover, remote
homology searches are performed by mapping on the sequence the structural and
functional domains collected in the PFAM database (rel. 23.0). To this aim we
adopted the pfam_scan.pl program, downloaded from the PFAM ftp site

(ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools/). The second layer of annotation
results form a decision tree integrating several predictive tools developed by the
Bologna Biocomputing Group. As a first step, N
-
terminal signal peptides and C
-
terminal GP
I
-
anchor propeptides are predicted with Spep (Fariselli et al., 2003) and
PredGPI (Pierleoni et al., 2008), respectively. When present, they are cleaved from
the protein sequence and the presence and localization of alpha
-
helical
transmembrane domains is p
redicted with ENSEMBLE (Martelli et al., 2003).
Secondary structure and cysteine bonding state are predicted with SecPred
(Jacoboni et al., 2000) and CysPres (Martelli et al., 2004), respectively. The
subcellular localization of globular proteins is predic
ted with BaCelLo (Pierleoni et
al., 2006).




*
1

Consorzio per le Applicazioni di Supercalcolo per Universit
à
e Ricerca, Rome, Italy

2

Biocomputing
Group, University of Bologna, Bologna, Italy

3

BioDec srl, Bologna

4

Istituto Biomembran
e e
Bioenergetica, Consigli Nazionale delle Ricerche, Bari, Italy

5

Dipartimento di Biochimica e Biologia
Molecolare,
University of Bari, Bari, Italy

D’Antonio M et
al.

13


Results

As a result of the first annotation layer, 228,737 protein variants share similarity
with a SwissProt sequence with an E
-
value lower than 10
-
5. Out of these proteins,
129,828 share also similarity with a s
equence included in the PDB database. In
these cases, the transfer of the functional and structural annotations by similarity is
feasible, at least for the aligned regions. 350,870 PFAM domains are also mapped
onto 159,538 protein variants with an E
-
value
lower than 10
-
5. The second layer of
annotation discriminates 28,991 and 1,679 sequences endowed with signal
peptides and GPI
-
anchor propeptides, respectively. 41,594 variants are predicted
as transmembrane and, among globular proteins, BaCelLo classifies
69,251
sequences as nuclear, 90,267 as cytoplasmic, 19,514 as mitochondrial and 31,996
as secreted. The structural and functional annotations of the proteins encoded by
the transcript variants were added in the ASPicDB database (Castrignanò et al.
2008) an
d can be browsed by means of a graphical search interface, that also
allows to retrieve all the genes whose splicing variants encode proteins with
specific structural or functional properties (e.g. PFAM or TM domain, protein type,
etc.) (Fig. 1A) or showin
g differences in specific features (Fig. 1B) . Availability: The
ASPicDB, supplemented with the annotations for the protein coding transcripts, is
available at http://www.caspur.it /ASPicDB/

Availability

http://www.caspur.it/ASPicDB/

Contact e
-
mail

grazian
o.pesole@biologia.uniba.it

D’Antonio M et
al.

14


Imag
e


Figure 1.
Protein Search form in ASPicDB

Di Filippo M
et al.

15


Revealing the chromosome organization of the emerging tomato
genome

Di Filippo M
1
, Traini A
1
, D

Agostino N
1
, Frusciante
L
1
, Chiusano ML
1
*

Di Filippo M et al.

Motivation

In the attempt to provide further insights into the nature, composition and role of
heterochromatin and euchromatin in plant genomes, we analyzed the first tomato
(So
lanum lycopersicum) genome draft (Mueller et al.
,
2009) which was obtained by
a BAC (Bacterial Artificial Chromosome) based sequencing effort focused on the
euchromatin region (~250 Mb) of the tomato genome

which is also expected to be
gene richer (Peterso
n et al.
,
1996).

Methods

ISOL@ is conceived as a multi
-
level computational environment accessible
through different gateways. The 'transcriptome' gateway provides an access point
to explore publicly available EST collections from Solanaceae species The
'ge
nome' gateway allows to browse the EST
-
based annotation of the tomato BAC
sequences. The resulting identification of the 'expressed' loci is exploited for the
definition of gene models and for the identification of putative alternative
transcripts. Annotat
ions of repeatitive elements, based on the Plant Repeats
database at Michigan State University, are included as well. Both gene and repeat
content for each BAC were then calculated as percentage of nucleotides covered
by S. lycopersicum ESTs and by annotat
ed repeats.

Results

Solanaceae transcriptome data which are part of the Italian Solanaceae Platform
(ISOL@, Chiusano et al., 2008, accessible at http://biosrv.cab.unina.it/isola/),
integrated with the available tomato genome sequences, allowed us to perfor
m
preliminary investigations on structural features of the tomato genome. We
considered the repeat content and gene distribution for each BAC, providing a
quick and an high resolution preview of the genome composition. In fact, despite
the generally accept
ed idea that the euchromatic regions were gene rich, we found
a number of BACs showing low gene and high repeat content outside the peri
-
centromeric regions. In addition, the screening of the gene and the repeat content
per BAC revealed: (i) large variabil
ity in both repeat and gene coverage along
chromosomes; (ii) in spite of this variability, repeat
-

or gene
-
richer BACs generally
were organized as blocks sharing similar composition; iii) repeat
-
richer or gene
-



*
1

Dept. of Soil, Plant, Environmental and Animal Production Sciences, University of Na
ples Federico II

Di Filippo M
et al.

16


richer blocks can be associated to heterochrom
atic or to euchromatic regions,
respectively; iv) repeat
-
richer blocks, which are evident in both pericentromeric and
extra
-
pericentromeric regions, show similar compositional properties; v) regions of
inverted repeats are found mainly associated to hetero
chromatin, suggesting their
role in chromatin compaction. In conclusion, although the tomato genome here
analysed is only made of 120 Mb, an integrated bioinformatics platform and novel
computational strategies permitted: i) to reveal a typical design of t
he emerging
tomato chromosomes and pave the way for further investigations on the
relationship between DNA primary structure and chromatin organization in
Solanaceae genomes

Availability

http://biosrv.cab.unina.it/isola/

Contact e
-
mail

miriam.difilippo@gmail.com


Gallo A et
al.

17


Polyketide and non
-
ribosomal peptide synthetases in
Aspergillus carbonarius genome:

a

strategy for identification of
secondary metabolite clusters

Gallo A
1
, Baker SE
2,3
, Mulè G
1
, Susca A
1
, Logrieco A
1
, Perrone G
1
*

Gallo A et al.

Motivation

One of the major aims of fungal genomics is the identification of genes involved in
the biosynthesis of secondar
y metabolites, which include important mycotoxins.
These compounds are relevant for human and animal health, so understanding
their mechanism of production is important in order to limit contamination on food
and feed. On the other hand, there is a signifi
cant opportunity for the discovery of
novel bioactive natural products which could be exploited for their beneficial
applications, such as antibiotics. Genes encoding enzymes likely to be involved in
natural product biosynthesis can be readily located in s
equenced genomes by use
of computational sequence comparison tools. Many fungal secondary metabolites
are polyketides and non ribosomal peptides and in recent years the genome
analysis of many filamentous fungi has revealed an unexpectedly large number of
genes encoding for polyketide synthases (PKS) and non ribosomal peptide
synthetases (NRPS). The identification of these genes could aid in the prediction of
secondary metabolite biosynthetic gene clusters. Coupling of genome sequencing
with transcriptional

analyses and genetic manipulation accelerates the elucidation
of the biosynthetic pathway and the regulatory mechanism of production. The
recent sequencing and annotation of Aspergillus carbonarius genome, generated
in collaboration with the US DOE
-
Joint
Genome Institute (JGI), will enable a variety
of bioinformatic analyses in this important organism, which has been reported to be
the main agent of ochratoxin A (OTA) contamination of wine, grapes, grape juice
and dried vine fruits. OTA, a widespread mycot
oxin produced by several species of
Aspergillus and Penicillium, is a potent nephrotoxin, and also displays hepatotoxic,
teratogenic and immunosuppressive properties. It has been classified in group 2B
(a possible human carcinogen) by the International Age
ncy for Research on
Cancer. The OTA biosynthetic pathway has not yet been completely elucidated. So
far the majority of the studies have been focused on Penicillium species and A.
ochraceus, and, combined with the molecular structure of the mycotoxin, poin
t
toward a PKS as the key enzyme catalyzing the first step of OTA biosynthesis.




*
1

National Research Council, Institute of Science of Food Production (ISPA) Bari, Italy

2

Pacific
Northwest National Laboratory, Richland, Washington, US

3

DOE Joint Genome Institut
e, Walnut Creek,
California, US

Gallo A et
al.

18


Methods

For the A. carbonarius sequencing project both the 454 and the Sanger
sequencing technologies were used. The annotation was performed on the basis of
a consensus gene s
et predicted by the JGI annotation pipeline, using a variety of
cDNA
-
based, protein
-
based, and ab initio gene modelers, and a filtering based on
homology and EST support. PKSs and NRPSs constitute a class of multifunctional
proteins that use a very similar

strategy for the biosynthesis of two distinct classes
of natural products. They present complex modular structures. The main domains
of each module of NRPSs are the adenylation (A), the peptyl carrier (T) and the
condensation (C) domains; whereas the keto
acyl synthase (KS), the
acyltransferase (AT) and the acyl carrier protein (ACP) represent the main domains
for PKSs. To retrieve pks and nrps genes the consensus sequence of the principal
domains was used as queries in BLAST search of the A. carbonarius ge
nome
assembly. Homology analysis of retrieved pks genes was carried out for the
identification of putative pks genes involved in the biosynthesis of OTA. Analysis of
the annotated genes belonging to the putative OTA biosynthetic cluster was
initiated with
the attempt to clarify the biosynthetic pathway.

Results

The genome assembly sequence of A. carbonarius and the automated annotation
is now open to the public (http://genome.jgi
-
psf.org/Aspca1), while a new
annotated version is currently restricted only to

registered users (http://genome.jgi
-
psf.org/Aspca3). The genome sequence was the result of a hybrid assembly of 963
scaffolds spanning a total of 36.3 Mbp. A total of 11624 genes, with an average
length of 2241 bp, were structurally and functionally annot
ated. The search tools
available at the genome portal allowed us to establish the presence in A.
carbonarius of 33 NRPS and 28 PKS encoding genes, most of which are located in
clusters. The analysis of their domain and modular structures confirmed the high

diversity of the two enzymes due to the difference of roles they may have in the
fungal metabolism. A pks gene of about 8000 bp and encoding a protein of about
2500 aa showed a high similarity to the OTA putative pks gene of A. niger, which
was identified

on the basis of homology to the A. ochraceus pks involved in the
biosynthesis of OTA. The putative OTA pks of A. carbonarius belongs to a cluster
which includes, according to the annotation data, some genes which may have a
role in the OTA synthesis mecha
nism. Among them, genes encoding a NRPS, a
cytochrome P450 type monoxygenase and a halogenase are present, which
interestingly match with the prediction made on the basis of the OTA molecular
structure and with the results so far obtained in other OTA prod
ucing fungal
species.

Gallo A et
al.

19


Contact e
-
mail

antonia.gallo@ispa.cnr.it

Supplementary information

This research was partially supported by Italian Minister of Research (MIUR)
project (MBLab
,
DM19410), Fondo FAR
-

Legge 297/1999 Art. 12/lab

Iannelli G et
al.

20


aCGH Segmentation: analy
sis of a male breast cancer dataset

Iannelli G
1
, Mangia A
1
, Chiarappa P
1
, Paradiso A
1
, Tommasi S
1
*

Iannelli G et al.

Motivation

In the last years DNA Copy Number h
as achieved a new role as diagnostic and
therapy determinant for cancer. Comparative genomic hybridization (CGH) is a
technique by which it is possible to detect and map genetic changes that involve
gain or loss of segments of genomic DNA. Microarray forma
ts of CGH provide copy
number information at thousands of locations distributed throughout the genome.
The aim of this study was to find the most performing algorithm to systematically
identify deleted or amplified genomic regions in a set of tumors. This
algorithm
should have an accurate methodology for detecting the breakpoints delimiting
altered regions in genomic patterns.

Methods

The pathological tissues of 25 male breast cancer patients enrolled at the NCC of
Bari were hybridized on high
-
density oligo
nucleotide aCGH arrays. aCGH was
performed using the Agilent Human Genome CGH Microarray Kit (Agilent
Technologies, Santa Clara, California, USA) with a resolution of about 100 kb. Data
analysis was performed with Nexus Copy Number 5.0 software (Biodiscove
ry, Inc.,
El Segundo, CA, USA). This software uses the Rank Segmentation algorithm, a
proprietary variation much faster at processing, on Circular Binary Segmentation
(CBS) together with the statistical Significance Testing for Aberrant Copy number
(STAC)
method, to identify non random genomic amplifications and deletions
across multiple experiments. We used the modified CBS algorithm to improve the
processing speed. It uses a normal distribution function to test for changing points
as opposed to the origin
al algorithm based on non
-
parametric permutation. It is a
recursive algorithm that keeps dividing the genome into smaller and smaller
segments until no region can be further segmented. The result is to segment the
genome into clusters of uniform ratios. Th
e algorithm has a single parameter called
Significance Threshold that controls if a region is to be segmented out or not. At
the completion of the segmentation process, the entire genome can be
represented as a series of segments and each segment having a
cluster value
which is the median log
-
ratio value of all the probes in that region. The calling
algorithm then uses the cluster values and the user defined threshold to establish
regions of copy number variations. According to this algorithm, two regions a
re
considered different when p values are lower than 1.0E
-
6. Genomic regions of



*

1
National Cancer Centre
“Giovanni Paolo I
I”

-

Bari (I)

Iannelli G et
al.

21


gains were defined as averaged log2 CGH fluorescence ratio & 0.2 and losses as
averaged log2 CGH ratio ! % 0.2. Frequency significance testing, instead, helps to
identify the a
reas of the genome where there is a statistically significant high
frequency of aberrations over the baseline level of aberrations. The algorithm
identifies a set of aberrations that are stacked on top of each other such that it
would not occur randomly. T
o integrate our analysis, we compare our dataset with
a female breast cancer dataset deposited with the Gene Expression Omnibus
(http://www.ncbi.nlm.nih.gov/projects/geo//query/acc.cqi?acc5GSE12659), applying
the same algorithms.

Results

All the 25 males d
isplayed chromosomal instability: 760 gains, 711 losses, 223
high copy gains, 8 homozygous copy losses. Average of 68 aberrations were found
in each patient on this study. Amplifications were more frequent in chromosome 7
(50%), 11 (50%), 16 (40%) and X (7
0%), while chromosomal deletions were on
chromosome 1 (60%), 2 (70%), 4 (50%), 5 (40%), 14 (53.33%), 15 (46.66%), 19
(40%), Y (40%). The aberrations were unequally distributed among the patients
with 4 patients having less than 10 aberrations. The number o
f aberrations doesn’t
seem to depend on the age of the patients. In the female dataset we found 738
loss, 804 gains, 228 high copy gains and 5 homozygous copy losses. Average of
110.93 aberrations were found in each patient, the amplification were more
rec
urrent on chromosome 1 (85.5%), 2 (68.75%), 3 (68.75%), 8 (43.75%), 11
(56.25%), 17 (56.25%), 20(50%), while deletions in chromosome 2 (43.75%), 3
(62.5%), 7 (50%), 15 (81.25%), 16 (50%), 17 (68.75%). The genomic aberration
profile is quite different among

the two datasets with very few common regions
among male and female. In our experience, the modified CBS algorithm together
with the statistical Significance Testing for Aberrant Copy number can be used as a
valid evaluation method in copy number variatio
ns detection methods.

Contact e
-
mail

s.tommasi@oncologico.bari.it

Lamontanara A et
al.

22


The repetitive landscape of Wheat Chromosome 5A. A
preliminary study based on low
-
coverage NGS technologies

Lamontanara A
1
, Vitulo N
2
, Albiero A
2,3
,

Forcato C
2
, Campagna D
2
, Dal Pero F
3
,
Cattivelli L
1
, Bagnaresi P
1
, Colaiacovo M
1
, Faccioli P
1
, Simkova

H
6
, Dolezel J
6
,
Perrotta G
5
, Giuliano G
4
, Valle G
2
, Stanca M
1
*

Lamontanara A

et al.

Motivation

Next generation sequencing (NGS) technologies are evo
lving at a very quick pace.
While for most small genomes accurate characterization of genome landscape is
no longer a challenging task in both terms of time and costs, complex eukaryotic
genomes as cereal plant genomes pose dramatic constraints due to both

their size
and abundance of repeats and transposable elements (TE) which especially
hamper final assembly steps. Nonetheless, acquiring preliminary information
concerning the repetitive landscape of complex genomes is of obvious interest in
order to gain
insights in composition and dynamics of this sizeable genome
fraction. Furthermore, gaining an early insight with respect to TE composition may
prove useful in order to counteract technical issues which can potentially arise
when finalizing genome assembly
. Undertaking a low
-
coverage NGS sequencing
approach on discrete genome fractions (as different arms of a given chromosome)
can provide a first insight on these issues while keeping experimental efforts and
costs under a desirable threshold.

Methods

Wheat
(Triticum Aestivum) chromosome 5A short and long arms (5AS and 5AL,
respectively) were independently isolated by flow cytometry. The DNA from the
sorted chromosome arms was amplified by GenomiPhi amplification Kit (GE
Healthcare), processed for DNA fragmen
t analysis and run on a Roche 454
-
Titanium sequencer. A coverage of about 2X was produced. For 5AS and 5AL,
2,407,89 and 3,324,512 reads were respectively obtained. Long and short arm
reads were independently blasted against Triticeae genomic repeat sequen
ces
(TREP complete database, BLASTN, Expect value < 10e
-
6; [1]) and matching
reads were assigned to the “known TE families” group. In order to identify
candidate novel TEs (novel TEs, while bearing scarce homology at the DNA level



*
1

Genomics Research Centre, Italian Agricultural Research Council, via S.Protaso 302, I
-
29017
Fiorenzuola d’Arda (Pc), Italy
2

CRIBI Biotechnology Center, University of Padova via U.Bassi 58/b,
35131 Padova, Italy
3

Bmr
-
genomics srl via Redi
puglia 21/A, 35131 Padova, Italy
4

ENEA, Research
Center CASACCIA, S.M. Galeria, 00163 Rome, Italy
5

ENEA, TRISAIA Research Center, S.S. 106
Ionica, 75026 Rotondella (Matera), Italy
6

Laboratory of Molecular Cytogenetics and Cytometry, Institute
of Experim
ental Botany, Sokolovska 6, CZ
-
77200 Olomouc, Czech Republic

Lamontanara A et
al.

23


to known TEs may nonethel
ess exhibit substantial homologies in the CDS to
known TEs) the leftover reads were further screened by blasting against TREP
protein division (PTREP, BLASTX, Expect value <10e
-
6). The resulting hits were
grouped in the “novel TE families” fraction. Within

the group of remaining reads, a
further class of ill
-
defined “repeats” was identified and an approximate
quantification was attempted on the basis of their participation to contigs with high
coverage (>20
-
fold). In fact, given the 2
-
fold coverage in this
study, a > 20
-
fold
coverage should by definition represent repeats [1]. When populating this further
class of repeats, in
-
house Python scripts were developed to parse ACE files and
subsequently select only reads participating to contigs devoid of reads ass
igned to
already classified TE or genes. The “others” group fraction refers to the leftover
reads and should include, among various fractions, nuclear genes, organellar DNA,
low
-
complexity DNA and further components.

Results

Known TE families (identity at
the DNA level) reached 72.67% and 71.14% for 5AS
and 5AL, respectively. Novel TE families (i.e. identity detected solely at protein
level) amounted to 2.48% for 5AS and 2.60% for 5AL. Uncharacterized repeats
were 10.35% and 7.97% for 5AS and 5AL, respectiv
ely, leaving an “others” fraction
summing up to of 14.49% for 5AS and 18.29% for 5AL. TE family quantitative
distribution was substantially uniform along the two 5A arms, apart some discrete
families several
-
fold more abundant in the long arm. Few minor fa
milies were only
detectable in one of the two arms, possibly reflecting recent bursts of transposition
or, alternatively, classification artifacts.

Contact e
-
mail

antonella.lamontanara@entecra.it

Sup
plementary information

References

[1] Wicker T, Taudien S, Houben A, Keller B, Graner A, Platzer M, Stein N: A whole
-
genome
snapshot of 454 sequences exposes the composition of the barley genome and provides
evidence for parallel evolution of genome size i
n wheat and barley. Plant J 2009, 59

(5)

:712
-
722
.


Malovini A et
al.

24


Enabling a Multivariate Strategy for Genotyping Quality Control
as a Grid Service

Malovini A
1,3
, Nuzzo A
2
, Puca AA
3
, Bellazzi R
1
*

Mal
ovini A et al.

Motivation

The vast amount of molecular data generated by high
-
throughput techniques
requires robust computational approaches already at the pre
-
processing stage,
which is a heavily error
-
prone processe. In particular, quality control of gen
otyping
data is based on parameters whose setting is often unclear and subjective, leading
to a lack of reproducibility, and heavily affecting the final results. A formal approach
is then needed for parameters tuning. As for other pre
-
processing tasks invo
lved in
high
-
throughput experiments, computational implementations greatly benefit from a
High
-
Performance Computing infrastructure.

Methods

Experimental genotyping errors in Genome
-
wide association studies (GWAS) can
lead to false positive findings and th
erefore to spurious associations. The Quality
Control (QC) phase, which is needed to minimize the effects of this kind of errors,
relies on filtering procedures aimed at identifying: i) individual samples with errors
across multiple markers (problems with
the DNA), and ii) SNPs yielding errors in
multiple individuals (marker
-
affecting errors). Several criteria (genotyping rate,
Hardy
-
Weinberg Equilibrium, samples heterozygote rate, minor allele frequency,
genomic inflation factor, etc) can be used to evalua
te the effect of the removal of
SNPs and individuals, but the choice of the most appropriate threshold for this
filtering is usually based on visual inspection of the data plots, looking for a tradeoff
between losing samples or missing potentially associat
ed SNPs. In order to make
this process more reproducible we propose two strategies based on the Multi
-
Criteria Decision Making theory for setting appropriate genotyping call rate (CR)
thresholds, with the final goal of maximizing the study power, which mea
ns
removing as few individual samples and markers as possible, while minimizing the
genotyping error rate. In the first strategy, called Simple Multi
-
Attribute Rating
Technique (SMART) the decision maker is required to answer a pairwise
comparison question

about the relative importance of a set of QC criteria. The
second strategy implements a different procedure for criteria weights assignment,
based on direct elicitation of user preferences (D
-
MCDM) using a 0 to 10 scale.
The best alternative for both stra
tegies is the highest scored one. These methods
are based on a comparison of different combinations of samples and SNP CR
thresholds, which even for a small GWAS (300K SNPs for few hundreds of



*
1

Department of Computer Engineering and Systems Science, University of Pavia, Italy

2

Centre for
Tissue Engineering, University of Pavia, Italy

3

IRCCS Multimedica, Milan, Italy

Malovini A et
al
.

25


patients) requires a large computational effort. Thus, a parall
elization strategy has
been studied and applied to the overall analysis process in order to increase
computing performance on a Grid infrastructure and make it available in a
reasonable time. The service module submits commands to the statistical
programs
R and Plink through an automated pipeline, exploiting the available high
performance computing resources. The Grid portal providing this service is
interfaced with two different environments: an IBM cluster based on the Platform
LSF scheduler (http://www.p
latform.com) and the gLite
(http://glite.web.cern.ch/glite) middleware. This environment has been chosen as it
has already been developed and validated for microarray gene expression data
and clinical data for survival analysis.

Results

A genetic associati
on module has been integrated in the HPC platform, whose
front
-
end, based on EnginFrame Grid Portal, provides users with customized Web
interfaces, increasing application usability and productivity. We validated our
methods on (i) a real dataset generated
by an Arterial Hypertension GWAS on 734
cases and 486 controls genotyped using Illumina 317k SNPs and (ii) a larger
simulated dataset. The results of the two strategies were comparable (best SMART
alternative: “samples CR >95% and SNP CR >96%”; best D
-
MCDM

alternative:
“samples CR >95% and SNP CR >97%”). In particular, the two score profiles were
very similar for samples with CR <95%, with comparable score profiles. For
samples with CR >96% the interpretation of the two profiles is more complex: D
-
MCDM appe
ars more “conservative”, penalizing stringent CR thresholds
corresponding to a decrease in statistical power (the elicitation process of the
criteria weights is done independently for each criteria), while SMART is able to
take into account correlations be
tween criteria and therefore it is related to
smoother score functions. We also tested computational time requirements by
simulating GWAS datasets with different sample sizes (1000, 2000, 3000 and 4000
samples) and marker densities (370K and 550K SNPs). Th
e parallelization strategy
leads to a decrease of one order of magnitude in computation time, from tens of
hours required on a standard PC to about 15 minutes for a 500 cases
-
500 controls
-
370K SNPs dataset, and to about 2 hours for a 2000 cases
-
2000 contro
ls
-
550K
SNPs dataset.

Availability

http://ada.dist.unige.it:8080/enginframe/bioinf

Contact e
-
mail

angelo.nuzzo@unipv.it

Malovini A et
al.

26


Supplementary information

We gratefully acknowledge Dr. Livia Torterolo and Prof. Marco Fato from the
University of Genoa for their supp
ort in deploying the algorithms on their Grid
Platform

Masseroli M et
al.

27


Gene functional clustering for improved prediction of Gene
Ontology annotations

Masseroli

M
1
, Tagliasacchi

M
1
*

Masseroli M et al.

Motivation

To annotate biom
olecular entities, several controlled vocabularies and ontologies,
including the Gene Ontology (GO), are available and routinely used. This provides
a computable and shareable description of the increasing knowledge of structural,
functional and phenotypic

features of genes in different organisms. Availability of
such controlled annotations is crucial to support interpretation of experimental
results and derive new biomedical knowledge. Unfortunately, only a subset of
genes of sequenced organisms has been a
nnotated, mainly through automatic
annotation procedures. Indeed, considerable effort and time are required to obtain
reliable curated annotations. In this context, the curation of annotation data is
supported by the use of computational tools, e.g. in the

assessment of the
relevance of inferred annotations or in the prediction of missed annotations with
high reliability. Some algorithms have been proposed to predict GO annotations.
Among them, the work by Khatri et al., based on singular value decompositio
n
(SVD) of the gene
-
to
-
term annotation matrix, seems to outperform other methods.
We propose a novel method which extends that algorithm by incorporating gene
clustering based on gene functional similarity computed by means of Gene
Ontology annotations.

Me
thods

Let the matrix A(i,j), with m rows corresponding to genes and n columns
corresponding to GO terms, represent all annotations of a specific GO ontology for
a given organism. The entry A(i,j) assumes value 1 if gene i is annotated to term j
or to any d
escendant of j in the GO structure, or 0 otherwise. The SVD
-
based
annotation prediction is performed by computing a reduced rank approximation Ak
of the matrix A by means of the singular value decomposition. Ak contains real
valued entries related to the l
ikelihood that gene i shall be annotated to GO term j.
For a defined threshold t, if Ak(i,j) > t, gene i is predicted to be annotated to term j.
The SVD method implicitly adopts a global term
-
to
-
term correlation matrix T = A’A,
estimated from the whole cor
pus of available annotations. Instead, we propose an
adaptive approach, named SIM method, which clusters genes based on their
original annotation profile and estimates a set of distinct correlation matrices Tc.



*
1
Dipartimento

di Elettronica e Informazione, Politecnico di Milano, Piazza Leonardo d
a Vinci 32, 20133
Milano, Italy

Masseroli M et
al.

28


For each matrix Tc, a predicted annotation pr
ofile for the gene i is computed. The
selected predicted annotation profile for the gene i is the one that minimizes the
variation, measured by the ell
-
2 norm, with respect to the original annotation profile
of the gene. To estimate the correlation matrice
s Tc, we cluster genes based on
their functional similarity, expressed through their annotations, by exploiting the
SVD of the matrix A. Thus, each gene might belong to more than one cluster with
different degrees of membership. To estimate Tc, for each cl
uster, first we
generate a modified gene
-
to
-
term matrix Ac, in which the i
-
th row of A is weighted
by the membership score of the corresponding gene to the c
-
cluster. Then, we
compute Tc = Ac’Ac. To obtain a more accurate clustering, we also incorporate th
e
functional similarity between GO terms, computed by using the Lin’s similarity
metrics. To assess the performance of the SVD and SIM methods, we considered
the GO annotations of different organisms, including Saccharomyces cerevisiae
and Drosophila melan
ogaster, excluding annotations with evidence code IEA
(inferred electronic annotations), since they have not been checked by a manual
curator. We performed k
-
fold cross
-
validation, confining our analysis to GO terms
used to annotate (directly or indirectly
) at least 3 or 10 genes in order to obtain
more reliable predictions, and heuristically setting a fixed number of 5 clusters for
all ontologies.

Results

For each possible gene
-
term pair, our method produces a ranking score indicating
the likelihood of gen
e i being annotated to term j based on the whole corpus of
available annotations. Evaluation results demonstrate that our SIM method
generally outperforms the SVD method for all GO ontologies, showing that
clustering based on the functional similarity betw
een terms might be beneficial.
Nevertheless, most of the performance gain between SIM and SVD stems from the
adaptive nature of SIM, regardless on how clustering is actually performed. In fact,
the SVD method, which computes similarities between clusters i
n terms of
frequency of co
-
annotation, is bound to be biased towards the larger clusters, since
it is unnormalized. The SIM method counterbalances such a bias with its adaptive
approach of clustering genes according to their original annotation profile. Th
e
more likely predicted annotations provided by our method can help boosting the
performance of data analyses that rely on existing annotations, such as the
annotation enrichment analysis. Furthermore, although we considered only GO
annotations, our framew
ork can be extended to handle different and also multiple
ontologies, as well as to provide predictions based on multiple data sources.

Contact e
-
mail

masseroli@elet.polimi.it

Paoletti D et
al.

29


An optimized web server for metagenomics data analysis

Paoletti D
1
, D’Antonio M
1
, D’Onorio De Meo P
1
, Chillemi G
1
, Desideri A
2
,
Castrignanò T
1
,

Pesole G
3,4
*

Paoletti D et al.

Motivation

The adv
ent of next
-
generation sequencing (NGS) platforms has given an amazing
burst to Metagenomics, a new rampant discipline addressing the analysis of the
genetic complexity of environmental samples, allowing for the first time the
identification and functional

characterization of the huge amount of so far unknown
microorganisms which cannot
be
cultured in the lab. Indeed, metagenomic
analyses make now possible the full exploitation of the products of the evolution of
life in different environments and condition
s with unprecedented impacts in several
biotechnological and medical areas. However, the large size of NGS data and the
complexity of their analyses involve computational workloads requiring high
-
performance computing systems.

Methods

A high
-
throughput pip
eline has been developed to provide high
-
performance
computing to automate the taxonomic and functional assignments of short reads
obtained by the pyrosequencing technology through extensive similarity searches
against both protein and nucleotide databases
. The pipeline is implemented in
PHP and involves three main open source components: NCBI BLAST [1], MySQL
[2], and Apache [3]. The server takes a multi
-
fasta format as input and performs an
optimized pipeline of customizable BLAST searches on daily update
d databases of
microbial and eukaryotic species. The BLAST searches incrementally add data to a
"job directory" that contains all job
-
relevant data in XML files. PHP scripts parse the
results files, classify hit reads based on a configurable threshold give
n by the user
(e
-
value, identity, read overlapping, maximum residual read length) collecting
results into a MySQL database. For what concerns the alignment tasks, the
workflow dispatches several parallelized BLAST searches on different computing
nodes in o
rder to achieve an overall high
-
performance computing time.

Results

The web server offers a full view of all the analyzed data by querying the results
database through some specific search forms. Specifically, the pipeline is



*
1

Consorzio per le Applicazioni di Supercalcolo per Universit
à
e Ricerca, Rome, Italy

2

Department of
Biology, University of Rome Tor Vergata, Rome, It
aly

3

Istituto Biomembrane e Bioenergetica, Consigli
Nazionale delle Ricerche, Bari, Italy

4

Dipartimento di Biochimica e Biologia Molecolare,
University of
Bari, Bari, Italy

Paoletti D et
al.

30


structured in three different
phases: 1) detection of host reads (this phase is
necessary in the case of metagenomic analyses of clinical samples); 2) detection
of reads unambiguously assignable to known species; 3) assignment of residual
reads to higher taxomic ranks. This latter phas
e may be accomplished by available
software like MEGAN [4] using as input the BLAST output. The web service
provides: i) an alignment view of each identified read, including the organism
description, taxonomy ID and relative taxonomic tree recognition; ii)

a taxonomic
map of the unidentified reads, showing a wide
-
range taxonomic tree and
highlighting the lowest common ancestor for reads that have been aligned on
multiple organisms; iii) global statistics of species and organisms distribution
among the sampl
es, including the identification of the host reads for samples
extracted from animal environments or tissues.

Contact e
-
mail

graziano.pesole@biologia.uniba.it

Supplementary information

References

[
1] Altschul SFl, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped
BLAST and PSI
-
BLAST: a new generation of protein database search programs. Nucleic
Acids Res. 1997;25:3389

3402. doi: 10.1093/nar /25.17.3389.

[2] MySQL.
http://www.mysql.com


[3] Apache.
http://www.apache.org


[4] Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data.
Genome Res 2007;17

(3)

:377
-
86.

Scaglione D et
al.

31


Microsatellites mined in Gl
obe Artichoke EST database: linkage
analysis and relation to gene function

Scaglione D
1
, Acquadro A
1
, Portis E
1
, Lanteri S
1
, Taylor CA
2
, Knapp SJ
2
*

Scaglione D et al.

Motivation

Italy is the world leading producer of the globe artichoke (Cynara cardunculus var.
scolymus). Despite its economical relevance, the knowledge of its genomic
organization is limited, thus hampering the application of mar
ker
-
a
ssisted breeding
programs.

In order to fill in this gap, the development of new molecular markers is
required to make possible the construction of dense genetic maps suitable for the
identification of QTLs (Quantitative Traits Loci). Microsatellite markers

(SSRs,
simple sequence repeats) can be easily mined by analysing single
-
pass sequence
data, and their map positioning might serve as primary scaffold for future genome
sequencing projects. A Cynara cardunculus EST database (36321 ESTs released
in NCBI by
the Compositae Genome Project
-

CGP) was assembled and mined for
the identification of SSR which can be employed as putative functional marker loci
to easily tag corresponding functional genes. Furthermore, by a Gene Ontology
analysis, we highlighted which

gene categories preferentially contain
microsatellites.

Methods

A customised bioinformatic pipeline was built up. The 36321 ESTs were cleaned by
the SeqClean script querying the UniVec database and assembled using the
TGICL script The resulting sequences
were annotated by a batch BlastX process
against the Arabidopsis thaliana proteins database (TAIR8), considering as a
threshold E
-
value 10e
-
29. Unigenes were analysed for the presence of
microsatellite using SSR Identification Tool (SSRIT) perl script, ado
pting the
following parameters: 5 repeats for dinucleotide, 4 for tri
-
, tetra
-
, and
pentanucleotide and 3 for esanucleotide. Flanking primers were designed by
means of BatchPrimer3 web tool. Three hundred primer pairs were selected for a
first screening us
ing a 28
-
genotypes panel and subsequently mapped on a C.
cardunculus genetic map obtained by the cross of a genotype of globe artichoke
(“Romanesco C3”) with one of cultivated cardoon (C. cardunculus var. altilis,
“A41”). Parsing the BlastX output, togethe
r with the data obtained by an ORF
(Open Reading Frame) predictor, we estimated the position of SSRs along the



*
1

DIVAPRA Plant Genetics and Breeding, University of Turin, via L. da Vinci 44,
10095 Grugliasco
(Turin), Italy

2

Center for Applied Genetic Technologies, University of Georgia, 111 Riverbend Rd.,
30605 Athens, Georgia (U.S.A.)

Scaglione D et
al.

32


transcripts. The Arabidopsis
-
based annotation allowed the categorization of each
unigene at different hierarchical levels of the Gene Ontology (G
O). Combining the
be
fore
-
mentioned data with a Fisher’s exact test we were able to identify specific
gene categories in which specific SSR were highly represented, with regard to
motifs (di
-

to exa
-
nucleotidic) and positions (CDS, 5’
-

and 3’
-

UTRs).

Result
s

A total of 19055 unigenes was generated by the assembly process, while the
SSRIT script harvested 4219 microsatellite in 3308 unigenes. Sufficient flanking
sequences for primers design were present for 2311 SSRs in 1974 unigenes. A
total of 238 primer pa
irs (out of the 300 tested) produced clear PCR amplicons, of
which 236 were polymorphic; the estimated PIC (polymorphic information content)
values ranged from 0.035 to 0.891, with an average of 0.660. Polymorphic markers
segregating in the “Romanesco C3”
(globe artichoke) x “A41” (cultivated cardoon)
F1 progeny were mapped, leading to the construction of a SSR
-
based consensus
map. Each parental map appreciably increased its coverage, merging to a number
of linkage groups (17) equal to the haploid chromosom
al number of C. cardunculus
(n=17). By analyzing the GO terms linked to specific microsatellite motifs, and their
relative position, significant enrichments in the dataset were discovered. The
majority of them belong to the ‘nucleic acids binding’ as well
as ‘metabolism and
gene expression’ GO terms, with regard to ATC/GAT in the coding regions (CDS)
and AG/CT repeats in the 5’
-
UTR. On the other hand, AG/CT motif in CDS seemed
to be involved in the ‘response to stress’ and ‘DNA damage’. These data
highlight
ed a close relation between specific microsatellites and gene function,
thereby confirming their role as cis
-
regulatory elements.

Contact e
-
mail

alberto.acquadro@unito.it

Scaglione D et
al.

33


Imag
e


Simone D et
al.

34


The human NumtS revised compilation, RHNumtS.2: custom
tracks, polymorphisms

and validation by amplification and
sequencing

Simone D
1
, Calabrese FM
1
, Mineccia G
1
, Lang M
2
, Gasparre G
2
, Attimonelli M
1
*

Simone D

et al.

Mo
tivation

In our group in Bari we have a long tradition in the study of human mtDNA
variability. We have thus developed the HmtDB database [1] (www.hmtdb.uniba.it)
storing the published and unpublished human mitochondrial genomes (about 7000)
annotated with

adding values data concerning population samples and DNA
variability. In the last years we have moved our focus on human nuclear DNA
regions where traces of human mitochondrial DNA can be detected: the NumtS
[2,3]. By comparing the reference human mt DNA
vs the reference human nuclear
DNA through database similarity searching tools (Blastn+MegaBlast+Blat), we
have produced and published the RHNumtS compilation [4]
, reporting 190 different
NumtS. T
o validate the in silico results, some of the NumtS reported

in the
compilation showing higher risk to be false postives, have been sequenced from a
European sample. In order to complete the validation, a more systematic protocol
to search primers and hence to amplify and sequence all the NumtS in RHNumtS
has been
set up. By the way, in the step of the protocol aimed to design the
primers, by applying Primer
-
Blast software, we have detected in the flanking
regions of the annotated NumtS, DNA regions that, in the application of Blastn,
MegaBlast and BLAT, had not sho
wn any similarity with the human mtDNA, while
Primer BLAST sensibility has been successful in this. These results have
suggested us to fully revise the RHNumtS compilation by applying Blastn with more
relaxed parameters, thus allowing the detection of Numt
S regions less conserved if
compared to their mitochondrial counterpart although showing traces of their mt
origin. Here we present the revised RHNumtS compilation (RHNumtS.2),
supporting tools and variability data concerning human NumtS.

Methods

RHNumtS.2

annotates the following categories of NumtS: % NumtS already
present in the first release (numeric code + A, B, C); % NumtS flanking the ones
from the previous category, detected with primer designing. They are identified with
the “nn” suffix; % NumtS obt
ained by applying BLASTn (query: J01415.2, the ref
human mt genome) with more relaxed parameters (gap opening penalty =
-
5, gap



*
1

Dipartimento di Biochimica e Biologia molecolare

Ernesto Quagliariello

, University of Bari, Italy

2

Uni
t
à
di Genetica Medica, Policlinico Uni
versitario S. O
rsola
-
Malpighi, Bologna, Italy

Simone D et
al.

35


extend penalty =
-
2, match reward = 2, mismatch penalty =
-
3, e
-
value = 1e
-
04).
They are identified with the “r” prefix. The UCS
C custom tracks function has been
used to develop both the human nuclear and mitochondrial custom tracks allowing
to map each RHNumtS sequence on both the human nuclear and mt genomes.
The revised RHNumtS sequences, selected on the custom tracks among thos
e not
falling in repeated regions have been amplified and sequenced after the selection
of specific and unique primer pairs by applying Primer
-
Blast.

Results

The Blastn returned 766 hits; nearly spaced hits were concatenated with criteria
described by Grau
r [5]. Thus the RHNumtS.2 compilation reports 624 NumtS.
Primer designing was performed on 265 NumtS. 223 (84%) of the selected primers
have been amplified and sequenced from a European sample, thus showing the
efficiency of our protocol and the good quali
ty of the data annotated in RHNumtS.
However, most of the not sequenced NumtS were amplified but, due to the
presence of insertions in heterozygosis, it was difficult to produce the sequence.
The produced sequences have been multialigned with the mtDNA cou
nterpart from
the same sample, with the hg18 NumtS, and the rCRS (J01415.2) corresponding
fragment. The application of a Java script to each NumtS multialignment extracted
variant sites due to mt/nu SNPs or nu/nu SNPs or mt/mtSNPs. The resulting data
are u
nder check in dbSNPs the nu/nu and in Phylotree and HmtDB the mt/nu SNP
with the aim to contribute to NumtS datation (see Calabrese
FM
et al
,

in this book).
Finally, the browsing of NumtS custom track allows the user to characterize each
NumtS to know in w
hich genomic components the NumtS is located: whether in a
genic or intergenic region, which SNPs have been mapped inside, if the NumtS is
internal to a repetitive region, the synthenic region in other organisms and so on.
The full extended RHNumtS custom
track are available upon request and can be
implemented on the UCSC genome browser as personal custom tracks.

Contact e
-
mail

dome.simone@gmail.com

Supplementary information

References

[1] Attimonelli, M. et al
., BMC Bioinformatics, 2005, 6:S4.

[2] Lopez JV et. al, JME, 1997, 39

(2)

:174
-
190

[3] Parr RL et al., 2006, BMC Genomics, 7:185

[4] Lascaro D. et al. 2008, BMC genomics 9, no. 1: 267.

[5] Hazkani
-
Covo, E., e D. Graur.
2007. Mol. Biol. and Evolution 24
, no. 1: 13.

Vezzi F et
al.

36


Enhanced Reference Guided Assembly

Vezzi F
1,2
, Policriti A
1,2
, Cattonaro F
2
*

Vezzi F et al.

Motivation

The presence of several complete and draft genomes together with the advent of
Next

Generation Sequencing (NGS) technologies, allow analysis thought infeasible
only a few years ago. Despite that, de
-
novo assembly remains a hard task and
assembling new organism using only short reads is in practice extremely difficult. A
reference sequenc
e from a related organism, when available, can be used to assist
the assembly of the new organism. In this case the sequences are first aligned
against the reference and then a consensus sequence must be extrapolated. The
only way to align the huge amount
of sequences produced by NGS instruments is
to use fast aligners like SOAP2. These aligners tend to be highly conservative:
reads can be aligned only with a low number of errors and usually without gaps.
While this allows to reconstruct the similarities be
tween the two organisms we are
unable to reconstruct the divergent parts. It is clear that a new strategy to assemble
new organisms in presence of the sequence of a closely related species is
necessary. This is especially true when using NGS data, as eithe
r reference guided
or de
-
novo assembly in this case, present very peculiar problems. Here we
propose a pipeline named Enhanced Reference Guided Assembly (e
-
RGA) that
combines reference and de
-
novo assembly in order to obtain an improved
assembly for a new
sequenced organism in presence of a closely related
reference.

Methods

Let R be the collection of short reads produced in the sequencing effort for a given
higher organism B, and let A be the reference sequence belonging to a closely
related organism of B.

As shown in the attached figure there are essentially two
possible ways to perform Reference Guided Assembly (RGA). The standard way,
here named s
-
RGA, consists in aligning all R's reads on the reference A and then
extracting the consensus sequence s
-
A. T
he other way, here named dn
-
RGA,
consists in performing de
-
novo assembly on R, then aligning the resulting contigs
on the reference sequence, and hence extracting the consensus sequence dn
-
A.
This approach has the significant advantage of allowing the usag
e of BLAST
-
like
tool to align, thereby permitting low similarity parameters and partial hits. The
resulting assemblies are composed of a set of ordered and oriented contigs



*
1

Department of Mathematics and Informatics, University of Udine, Udine
2

IGA Istituto di Genomica
Applicata, Udine

Vezzi F et
al.

37


separated by gaps. As shown in Figure 1, once s
-
A and dn
-
A are available we
have to

merge the two sequences. This situation is similar to the one already
studied in the so
-
called Assembly Reconciliation (AR). The aim of AR is to merge
the outputs of two different assemblers run on the same set of reads in order to
obtain an improved fina
l assembly. The merging step is the computationally most
demanding. The hard task is the identification of the areas to be merged. Using an
approach similar to AR and exploiting some specific constraints, we are able to
solve this task in time linear on th
e length of A without performing global alignment
between s
-
A and dn
-
A.

Results

The e
-
RGA pipeline has been implemented and tested on two different datasets.
The first one consisted of four Conifer chloroplast genomes (Black, White, Red and
Norway spruce)
sequenced with an Illumina instrument GAII producing 45bp reads.
A reference for pine chloroplast is available (Pinus thunbergii), while there is no
reliable one for spruce chloroplast. Therefore, e
-
RGA was applied using the pine
chloroplast as reference.
The second data set consists of a set of 32bp reads from
a microbial genome of length 2,7Mb, sequenced using an Illumina Instrument GAI.
A reference sequence is present but the reads were sequenced from an enriched
meta
-
population highly divergent from the

reference. The results obtained show
that e
-
RGA is able to produce an enhanced assembly: for all the five genomes both
the number of reads aligned and the N50 contig size are improved on e
-
A with
respect to dn
-
A and to s
-
A (see Figure 1 for details). The
core of e
-
RGA is written
in Perl and it is easily extensible to complex genomes.

Contact e
-
mail

francesco.vezzi@dimi.uniud.it

Vezzi F et
al.

38


Imag
e




Molecular Evolution

and Comparative Genomics

Genes involved in vitamin D pathway and schizophrenia show
signature of lat
itude
-
dependent adaptation

Amato R
1,2,3
, Pinelli M
1,
4
, Monticelli A
5
, Miele G
1,2,3
, Cocozza S
1,4
*

Amato R et al.

Motivation

Genetic differences are present in humans a
t both individual and population level.
Human genetic variations are studied for their evolutionary relevance and for their
potential medical applications. In particular, high levels of population differentiation
suggest the acting of a positive selection
of advantageous alleles in one or more
populations. This studies can help scientists in understanding ancient human
population migrations as well as how selective forces act on the human being.
Moreover, loci resulting from an adaptation to particular envi
ronmental factors can
be of great interest when studying complex diseases. Adaptations to spatially
varying selective pressures may be particularly important in human populations.
Since the human species first arose in equatorial Africa, as humans spread o
ut of
equatorial Africa to regions at higher latitudes, many latitude
-
related phenomena
likely became important selective forces. This scenario is likely to apply to a range
of selective pressures, e.g. exposure to UV radiation, climate, diet, day
-
night cy
cle,
etc.

Methods

To explore, at genome
-
wide level, possible genetic adaptations to latitude, in this
work we defined a set of Latitude
-
Related Genes (LRGs) following a two step
approach. We used genotype data at about 600000 SNP loci of 938 unrelated
indi
viduals from 51 populations of the Human Genome Diversity Panel. Firstly, we
identified a set of SNPs with high levels of population differentiation. From this set
we then extracted those SNPs showing high correlation of allelic frequencies with
the geogra
phical latitude. We finally obtained a set of 2239 SNPs corresponding to
1336 unique genes. We characterized this set of genes exploring both the tissue
localization and the functional characterization. We then investigated for their



*
1

Gruppo Interdipartimentale di Bioinformatica e Biolo
gia Computazionale, Universit
à
di Napoli

Federico II


-

Universit
à
di Salerno, Italy.

2

Dipartimento di Scienze Fisiche, Universit
à
di Napoli

Federico II

, Napoli, Italy.

3

INFN Sezione di Napoli, Napoli, Italy.

4

Dipartimento di Biologia e Patologia
Cel
lulare e Molecolare

L. Califano

, Napoli, Italy.

5

Istituto di Endocrinologia ed Oncologia
Speriment
ale, CNR Napoli, Napoli, Italy

Amato R et al.

40


enrichment in other se
ts of genes associated to physiological and pathological
phenotypes.

Results

Both functional characterization and expression localization of LRGs resulted in a
strong enrichment in neuralrelated processes. Among the four diseases tested
(namely Parkinson's

disease, Alzheimer's disease, multiple sclerosis and
schizophrenia), we found a significant enrichment in genes related with
schizophrenia for LRGs (Fisher's exact test, Bonferroni adjusted p
-
value 1.6E
-
5).
We thus investigated for a possible latitudedepe
ndent biological mechanisms
linking latitude to neural development. A very important factor hypothesized to be
latitude dependent and known to be related with neural development is the vitamin
D. We found in the set of vitamin D related genes a significant

enrichment by LRGs
(p
-
value 3.5E
-
8, Fisher's exact test). This result, for the first time at a genetic level,
suggests an adaptation of vitamin D related genes driven by geographical latitude.
We also found a significant overlap between schizophrenia and
vitamin D related
genes (p
-
value 1.4E
-
6, Fisher's exact test) confirming the role of vitamin D in
schizophrenia pathogenesis. Our results provide the first evidence, at a molecular
level, of a previously hypothesized relationship among these phenomena. Thi
s
result can also be useful for the identification of new candidate genes for this and
other related pathology.

Contact e
-
mail

roamato@na.infn.it

Imag
e


Balech B et al.

41


A Bioinformatic Workflow for Grapevine Viral Diseases Analysis
with Reference to Grapevine Leafroll C
omplex

Balech B
1
, Creanza TM
1
, Di Tota F
1
, Scioscia G
1
, Leo P
1
*

Balech B et al.

Motivation

Grapevines (Vitis spp.) are affected by many viral diseases. The most harmful
and
widespread ones are fanleaf degeneration, Leafroll complex, rugose wood, and
fleck. Leafroll disease occurs in all major grape
-
growing areas worldwide and is
one of the most destructive viral diseases of grapevines. Grapevine Leafroll
associated Viruse
s (GLRaVs) are a complex of viruses in the genus Ampelovirus,
family Closteroviridae, where GLRaV
-
3 is the predominant species in the world. At
least nine serologically distinct viruses are associated with Leafroll disease. This
disease impacts both vine h
ealth and grape quality where yield losses may reach
as much as 40%. The international committee on taxonomy of viruses describes in
its database, the Universal Virus Database (ICTVdb), GLRaV
-
3 as type
-
species of
Leafroll complex, and provides morphologica
l descriptions and general properties
of this virus complex, as well as records of genomic and protein sequences located
in GenBank (NCBI). The present study illustrates a bioinformatic workflow for
studying one of the most represented genes of GLRaVs comp
lex, namely the Heat
shock Protein (HSP70), and a preliminary analysis of comparative genomics of the
available complete genome sequences. The aim of these analyses was mainly the
identification of tip association traits between a phylogenetic inference HS
P70
-
based and categorical characters, namely isolate geographical origin and host
-
virus adaptation.

Methods

A bioinformatic workflow shape has been adopted to conduct this study. The first
analyses concerned the investigations of ICTVdb files to underline
the main
properties of Leafroll complex viruses and their instructive genes in their taxonomy.
The links to NCBI taxonomy database were exploited to get complete and partial
genomic and gene nucleotide sequences. A nucleotide dataset corresponding to
HSP70

gene sequences (complete and/or partial) was constructed from all Leafroll