Supplementary Information for Neurodevelopmental and neuropsychiatric disorders represent an interconnected molecular

taxidermistplateSoftware and s/w Development

Nov 7, 2013 (3 years and 9 months ago)

87 views

1


Supplementary
Information for


Neuro
developmental

and
neuropsychiatric

disorders represent an interconnected molecular
system


Alexandre S
.

Cristino
1*
, Sarah
M.
Williams
1
,
Ziari
h Hawi
1,2
, Joon
-
Y
ong

An
1
, Mark
A.
Bellgrove
1
,2
,
Charles
E.
Schwartz
3
, Luciano d
a F
.

Costa
4
, Charles Claudianos
1*



1
-

Queensland Brain Institute,
The
University of Queensland, Brisbane, QLD, Australia

2
-

School of Psychology and Psychiatry, Monash University, Melbourne, Victoria, Australia

3
-

Greenwood Genetic Center
, Greenwood, South Carolin
a, USA


4
-

Instituto de Física de São Carlos, Universidade de São Paulo, São Paulo, Brazil

* Correspondence to C. Claudianos (
c.claudianos@uq.edu.au
) and A.S. Cristino
(
a.c
ristino@uq.edu.au
)

Tel: +61 7 33466319 Fax: +61 7 33466301



Supplemental Methods and Materials


Genes and databases used to source primary candidate disorder genes

The primary candidate genes associated with four
disorder
s, Autism Spectrum Disorder (ASD
),
X
-
Linked Intellectual Disability (XLID)
,
Attention Deficit Hyperactivity Disorder (ADHD)

and
schizophrenia (SZ), were retrieved from many different studies (Supplementary Table S1). We used
these candidate genes to create the ASD (361 genes),
XLID (98 g
enes)
,

ADHD (158 genes)

and

SZ

(218 genes)

databases. A total of 700 non
-
redundant genes are identified in this study as primary
candidate

genes and all subsequent analyses focus on the functional relationship between these
disorder genes with other genes
and/or DNA sequences found in the human genome. We
specifically examined three functional relationships of 700 primary
candidate

genes using: (1)
protein
-
protein interactions, (2) transcriptional regulation via analysis of cis
-
DNA elements in 5’
intergenic

regions and (3) miRNA regulation via analysis of target sequences identified in the 3’
untranslated regions.

Construction and analysis of
neurodevelopmental

and neuropsychiatric

disorder

PPI

networks

Graph theory and complex network concepts are increas
ingly being used to represent and analyse
many different biological systems, and measurements and formal description of the methods have
been previously described
(
Costa, et al., 2007
)
. We first created a comprehensive database of
2


protein
-
protein interactions (PPI) by joining two principal protein interaction databases, BioGrid
(
Stark, et al., 2011
)

and HPRD
(
Kandasamy, et al., 2009
)

(Supplementary Table S2). The disorder
PPI networks were the
n constructed by retrieving proteins that had empirical support for directly
interacting with encoded proteins of 700 primary candidate genes. Only proteins with formal
accession numbers, names and symbols recognised by the National Centre for Biotechnolog
y
Information (NCBI) and HUGO Gene Nomenclature Committee (HGNC)
(
Seal, et al., 2011
)

were
used in our analyses.
We

built the
AXAS
-
PPI

network
(acronym for
A
SD,
X
LID,
A
DHD,
S
Z)

by
connecting
all
first
-
degree
interacting protein
neighbours

of

the primary
candidate
disorde
r
genes
using an integrated BioGrid
-
HPRD database.

The analysis of

the

network properties and structure
was carried out using the
igraph

module (igraph.sourceforge.net) with all necessary scripts written
in Python programming language (
www.python.org
). The modular structure of the networks was
estimated using a ‘greedy optimization algorithm’ to assess possible divisions in the networks.
Network visualization and functional analysis was performed using Cytoscape
(
http://cytoscape.org/
)
(
Shannon,
et al., 2003
)
.
ClueGO (Bindea, et al., 2009)
was used
to find Gene
Ontology (GO) (http://geneontology.org/) categories statistically overrepresented in the PPI
network
s.



Construction and analysis of cis
-
regulatory networks

We identified cis
-
DNA regulat
ory elements using computational analyses designed to discover and
identify overrepresented DNA patterns that characterise transcription factor (TF) binding sites
(
MacIsaac and Fraenkel, 2006
)
. Two approaches were implemented: (1) identifying enriched TF
binding sites as

described in the TRANSFAC
(
Wingender, et al., 2000
)

and JASPAR
(
Vlieghe, et
al., 2006
)

database, and (2) discovering DNA motifs based on pattern recognition strategies that
have been modified to analyse eukaryotic genomes
(
Barchuk, et al., 2007
;
Cristino, et al., 2006
)
.
Python programming language was used to integrate the various scoring algorithms with the TAMO
module for
d
e novo

discovery of DNA motifs
(
Gordon, et al., 2005
)
. The 6kb 5’
-
UCR regions were
extracted from the human genome sequence based on predicted start methionine codon of all genes
described in the Ensembl database (GRCh37) (www.ensembl.org). Three different DNA pattern
discovery programs were used: AlignAce
(
Roth, et al., 1998
)
, MDscan
(
Liu, et al., 2002
)

and
MEME
(
Bailey and Elkan, 1994
)
. Default parameters were used for the discovery programs; GC
-
content background for the human intergenic regions was appropriately set for each program.
Enrichment score
(
Gordon, et al., 2005
)

and ROC
-
AUC plots (area under the curve for a receiver
operator characteristic plot)
(
Clarke and Granek, 2003
)

were used to evaluate DNA motif
overrepresentation. Two specific score metric
s, MAP score for AlignAce and MDscan, and E
-
value
for MEME, were used as a first filter for selecting the most significant motifs (MAP > 5 and E
-
3


value < 0.01). A second filter was used to decrease the amount of spurious hits among the
discovered DNA motifs

(Church < 0.01, ROC
-
AUC > 0.5 and Enrichment < 0.01).

MicroRNA target predictions were performed using the miRANDA algorithm
(
John, et al., 2004
)
.
We considered
all mature sequences described in miRBase (release 16)
(
Kozomara and Griffiths
-
Jones, 2011
)

and 3’UTRs sequences of gene transcripts described in the Ensembl database
(GRCh37). An additional 20
bp upstream of the 3’UTR start was included, and only predicted sites
that fall within the 3’UTR were used. A network of miRNAs and their respective target genes was
created by recording the presence of at least one predicted target site with free energy o
f <=
-
20. An
enrichment score based on hypergeometric statistics was used to calculate which miRNAs are
overrepresented in
all

four

disorder
s

PPI

(
ASD
-
PPI
, XLID
-
PPI
, ADHD
-
PPI

and SZ
-
PPI
)

networks

and the 30 structural modules (or communities) found in the

AXAS
-
PPI network. A hypergeometric
probability
<

0.01 and >

1.5
-
fold increased incidence was used as a threshold for enrichment. Only
those miRNAs enriched in at least one of the 34 gene lists were used for further analysis.

Enrichment analysis of cis
-
reg
ulatory motifs in SNP data

Single nucleotide polymorphisms (SNPs) associated with ASD and schizophrenia (95%
significance level or
P
-
value
<

0.05) were retrieved from the literature. We used data from eight
comprehensive (two autism and six schizophrenia)

GWAS studies. The SNP sequences were
retrieved from NCBI dbSNP (build 132) and trimmed down to 100nt upstream and downstream of
the associated substitution. We examined the occurrence of GWAS SNP sequences with the location
of regulatory elements (52 TF b
inding and 683 miRNAs target sites) that were found to be
overrepresented in the
AXAS
-
PPI network using the same 5’ UCR and 3’ UTR analyses described
above. Associations were scored in comparison to 50,000 randomly selected SNPs from NCBI
dbSNP.

Cross
-
v
alidation of
the
AXAS
-
PPI network

To validate our
AXAS
-
PPI network we used
genetic screening

data from six studies
:
ASD

(
Iossifov,
et al., 2012
;
Neale, et al., 2012
;
O'Roak, et al., 2012
)
,
XLID
(
Isrie, et al., 2012
)
,
ADHD
(
Will
iams,
et al., 2012
)

and

SZ
(
Need, et al., 2012
)
.
We selected

lists of candidate genes

based on

causal
variants

(frameshift
, splice variants, non
-
synonymous SNPs
) with significant level of association
(
P
-
value

<

0.05) (Supplementary Table S12
).
We used a binomial test to evaluate if the genes
identified in all six datasets were overrepresented in any of the four disorder PPI networks.
As the
total number of varian
ts in both datasets and the PPI networks are greater than 10 we can use
normal approximation and binomial distribution to test the null
-
hypothesis


the proportion of gene
variants is not different from what is expected by chance. We used a
one
-
tailed Z
-
te
st
(equation 1)
to test this hypothesis at the
P
-
value

<

0.05 with critical
Z

score

of
1.
6
5

(Supplementary Table S13)
.

4


pq
N
E
O
Z
)
1
(




(equation 1)

Where:

O = observed number of gene variants in the
AXAS
-
PPI network

E = expected number
of gene variants in the
AXAS
-
PPI network

N = total number of gene variants

p = expected frequency of gene variants

q = 1
-
p


Statistical analyses

We tested the statistical significance of the connections between primary
candidate

genes for
protein
-
protein i
nteraction network, 5’cis
-
regulatory network and miRNA
-
target network. We used
the same method for the three types of networks by creating networks from 100 genes sampled from
the primary candidates and from the whole genome (control). For the PPI networks

we measured
three basic properties: (1) average degree, (2) average path length and (3) density of the network
(
Assenov, et al., 2008
)
; and checked whether the distribution of those structural properti
es were the
same in the
AXAS
-
PPI

networks and in the random networks. We used the non
-
parametric
Kolmogorov
-
Smirnov test as implemented in R (
www.r
-
project.org
) to test the hypothesis of
random connections among
ca
ndidate genes.

The distribution of the primary genes in each structural module of the
AXAS
-
PPI network was
evaluated by a chi
-
squared test assuming
equal distribution of primary candidate genes in each
AXAS
-
PPI module.


To test the significance of the e
nrichment analysis of the miRNAs and DNA regulatory motifs in the
PPI networks, we sampled 100 genes with 500 iterations from each one of the 34 gene sets, and also
from the whole genome as control sets. As the 5’cis
-
regulatory and miRNA
-
target networks ar
e
bipartite networks with two different types of nodes we only used the distribution of the average
degree for the statistical analysis. The distribution of average degree of disorder networks was
compared to random networks using Kolmogorov
-
Smirnov tests.


We checked the suitability of miRNA target prediction thresholds (hypergeometric probability
<

0.01 and at least 1.5

-

fold overrepresented) by estimating 100 enrichment calculations on 100 gene
subsets from each of th
e

disorder

genes and random genes. T
he distributions of number of miRNAs
enriched are then compared by Kolmogorov
-
Smirnov test. The overrepresented miRNAs in the
gene sets were combined into a single non
-
redundant enriched miRNA list.

The Pearson’s correlation analysis as implemented in R w
as used to test the significance of
association between the occurrence of DNA motifs and miRNA target sites in the 5’ and 3’
intergenic regions.

5


Supplementary Figures



Figure S1
. Flow diagram of the computational approach used to create and analyse
AXAS
-
PPI

network.



6


Figure S
2
.

Comparison

of structural properties
observed in

PPI networks
. We use the

number of
nodes (proteins), number of edges (interactions), average number of neighbours (degree), the
normalised average
number of neighbours (density) to

compare

all
disorder
s PPI

networks (in blue)
with random
PPI
networks (in red).





7


Figure S
3
.

Comparison of

structural properties
observed in

transcription networks. We used the
number of nodes (TF binding sites and target genes), the number of edges (
regulation of
transcription), the average number of neighbours (degree), the normalised average number of
neighbours (density) to compare
disorder

networks (in blue) with random networks (in red).





8


Figure S
4
.

Comparison of

structural properties
observ
ed in
miRNA networks
. We used the number
of nodes (miRNAs and target genes), the number of edges (regulation of transcription), the average
number of neighbours (degree), the normalised average number of neighbours (density) to compare
the
disorder

network
s (in blue) with random networks (in red).



9



Figure S
5
.

Correlation between regulatory elements
found
in the 5’
-
UCR and 3’
-
UTR

of candidate
genes
. Pearson’s correlation R

= 0.64 (t = 4.67; df = 32;
P
-
value < 0.01). Two modules, M4 and
M11, have an excess
ive number of miRNA target sites compared to number of the 5’ DNA motifs.
The correlation between miRNA target sites and 5’ DNA motifs is even higher when these two
outliers are removed (R = 0.88; t

=

10.33; df

=

30;
P
-
value < 0.01).





10


Figure S
6
.

(A)
F
requency of SNPs in the functional modules of the
AXAS
-
PPI

network
, high
abundance of SNP hits

indicated by *
. (B and C)
Distribution of SNPs
in intergenic
regions of
NTRK3

and
NRXN1
.






















11


Supplementary References


Ass
enov, Y.
, et al.

(2008) Computing topological parameters of biological networks.,
Bioinformatics
;

24
:

282
-
284.

Bailey, T.L. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover motifs in
biopolymers.,
Proc Int Conf Intell Sy
st Mol Biol
;

2
:

28
-
36.

Barchuk, A.R.
, et al.

(2007) Molecular determinants of caste differentiation in the highly eusocial honeybee
Apis mellifera.,
BMC Dev Biol
;

7
:

70.

Clarke, N.D. and Granek, J.A. (2003) Rank order metrics for quantifying the associatio
n of sequence features
with gene regulation.,
Bioinformatics
;

19
:

212
-
218.

Costa, L.d.F.
, et al.

(2007) Characterization of complex networks: A survey of measurements.,
Advances in
Physics
;

56
:

167
-
242.

Cristino, A.S.
, et al.

(2006) Caste development and r
eproduction: a genome
-
wide analysis of hallmarks of
insect eusociality.,
Insect Mol Biol
;

15
:

703
-
714.

Gordon, D.B.
, et al.

(2005) TAMO: a flexible, object
-
oriented framework for analyzing transcriptional
regulation using DNA
-
sequence motifs.,
Bioinformati
cs
;

21
:

3164
-
3165.

Iossifov, I.
, et al.

(2012) De novo gene disruptions in children on the autistic spectrum,
Neuron
;

74
:

285
-
299.

Isrie, M.
, et al.

(2012) Sporadic male patients with intellectual disability: contribution of X
-
chromosome
copy number varian
ts,
Eur J Med Genet
;

55
:

577
-
585.

John, B.
, et al.

(2004) Human MicroRNA targets,
PLoS Biol
;

2
:

e363.

Kandasamy, K.
, et al.

(2009) Human Proteinpedia: a unified discovery resource for proteomics research,
Nucleic Acids Res
;

37
:

D773
-
781.

Kozomara, A. and G
riffiths
-
Jones, S. (2011) miRBase: integrating microRNA annotation and deep
-
sequencing data,
Nucleic Acids Res
;

39
:

D152
-
157.

Liu, X.S., Brutlag, D.L. and Liu, J.S. (2002) An algorithm for finding protein
-
DNA binding sites with
applications to chromatin
-
im
munoprecipitation microarray experiments.,
Nat Biotechnol
;

20
:

835
-
839.

MacIsaac, K.D. and Fraenkel, E. (2006) Practical strategies for discovering regulatory DNA sequence
motifs.,
PLoS Comput Biol
;

2
:

e36.

Neale, B.M.
, et al.

(2012) Patterns and rates of
exonic de novo mutations in autism spectrum disorders,
Nature
;

485
:

242
-
245.

Need, A.C.
, et al.

(2012) Exome sequencing followed by large
-
scale genotyping suggests a limited role for
moderately rare risk factors of strong effect in schizophrenia,
Am J Hum
Genet
;

91
:

303
-
312.

O'Roak, B.J.
, et al.

(2012) Sporadic autism exomes reveal a highly interconnected protein network of de
novo mutations,
Nature
;

485
:

246
-
250.

Roth, F.P.
, et al.

(1998) Finding DNA regulatory motifs within unaligned noncoding sequences c
lustered by
whole
-
genome mRNA quantitation.,
Nat Biotechnol
;

16
:

939
-
945.

Seal, R.L.
, et al.

(2011) genenames.org: the HGNC resources in 2011,
Nucleic Acids Res
;

39
:

D514
-
519.

Shannon, P.
, et al.

(2003) Cytoscape: a software environment for integrated mode
ls of biomolecular
interaction networks.,
Genome Res
;

13
:

2498
-
2504.

Stark, C.
, et al.

(2011) The BioGRID Interaction Database: 2011 update,
Nucleic Acids Res
;

39
:

D698
-
704.

Vlieghe, D.
, et al.

(2006) A new generation of JASPAR, the open
-
access repository
for transcription factor
binding site profiles.,
Nucleic Acids Res
;

34
:

D95
-
97.

Williams, N.M.
, et al.

(2012) Genome
-
wide analysis of copy number variants in attention deficit
hyperactivity disorder: the role of rare variants and duplications at 15q13.3,
A
m J Psychiatry
;

169
:

195
-
204.

Wingender, E.
, et al.

(2000) TRANSFAC: an integrated system for gene expression regulation.,
Nucleic
Acids Res.
;

1
:

316
-
319.