Supplementary Material 2 Biclustering

tribecagamosisΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

58 εμφανίσεις

Supplementary Material
2


Biclustering


Background
.

The

canonical data model

f
or

clustering
assumes

a set of observations each of which is described by a
set of features or
variables.

T
he
clustering
data is typically

arranged in a matrix where
rows repres
ent observations
and the columns features or variables. In classical clustering the goal is
to group observations
that are similar
into
clusters, where similarity is defined
or computed
over

the full set of variables
. An
alternative

is to identify clusters

consisting

of observations that are similar based only on a subset of the original variables.

Biclustering
attempts

to
identify clusters of the latter type.

Additionally, in biclustering

the conceptual distinction between observations and
variables

need

n
ot
be
made
, and
because of that
biclustering is often described as the simultaneous clustering of both
observations and variables.


Current
applications of clustering are characterized by
massive
high
-
dimensional
(very large number of variables)
data

sets
,

which
in turn
pose several challenges that cannot be addressed by using traditional clustering

techniques.
The main
difficulties arise by the presence of noisy variables, the possibility

that
different variables are relevant to
different clusters, and
tha
t the concept of distance or similarity (e.g., Euclidian distance) becomes meaningless as the
dimension increases
1
. Clustering of drugs and
adverse events (
AEs
)
, the subject of
our

paper, falls into to this group
of applica
tions since typical SRS contain

thousands of drugs and thousands of AEs, creating a
clustering space of

several thousand dim
ensions.
In addition
, this
clustering space

is
very sparse (most cells
in the data matrix
contain
zero) since drugs are typically reported
or associated
with much fewer AEs then the full set of AEs found in the
SRS. Studies have shown
2

that the density (proportion of non
-
zero cells) of such
drug
-
AE
data matrices is around
30%, and much less if the
terminology used is left
unprocessed
. As mentioned in the main paper,
of the
308
,
5
20
,
126
(28,341 drugs × 10,886 AEs) cells of the
initial
data clustering matrix used in this study,
over 99%
contained zero
.



Biclustering is a relatively new clustering technique designed
mainly
to address these challenges
,

and
originated in
bioinformatic
s where it is used identify groups of genes that may be functionally related under certain environmental
conditions (e.g. cancer
samples
) based on expressions levels measured in a microarray experiment
1, 3
. The
microarray
data
is arranged in a matrix
, where
rows
correspond to
genes, colu
mns to environmental co
ndition
s, and
where e
ach
cell

in

this matrix represents the

expression level of
a specific
gene
under

a specific condition
.
Biclustering is then applied to identify sub
-
matrices representing genes that are similarly expressed under a

subset of
conditions.
Biclustering has also been used in the domain of text mining

to identify groups of documents which can
be defined or described by a similar group of words
, i.e., topics
, where a text corpora is represented by a data matrix
whose rows

rep
resent

documents, whose columns represent words, and whose cell
s

contain the
co
-
occurrence

frequency
of a particular word in a particular document
4
.
Similarly, biclustering can be used to cluster drugs and
AEs in SRS, where drugs are analogous to genes or documents, AEs to environmental conditions or words, and
drug
-
AE association strength to gene expression levels or document
-
word co
-
occurrence

freq
uencies. A bicluster of
drugs and AEs is then interpreted as a set of drugs that are related t
o each other by their common AE

associations.


Many biclustering algorithms have been
proposed

in recent years. Although they all share the general bicluster
conc
ept, each formulates a different objective, problem setting, and data model
1, 3
.
The general objective of
biclustering is to identify sub
-
matrices (biclusters) of the data matrix such that each bicluster

satisfies
a

specific
homogeneity criterion
, which generally varies

from approach to
approach
.

The most common homogeneity criteria
or the so called bicluster models are
3
: bi
clusters with constant values
, biclusters with const
ant values on row
s or
columns
, and biclusters with coherent or correlated values.
A relatively new biclustering model
which has gained
popularity in recent years, primarily due the application of
finding transcription factor modules responsible for gene
regulation
5
, assumes as
in
this

wor
k a binary data model. The binary model assumes two possible states, in gene
expression analysis, a gene is either “on” (expressed) or “off” (not expressed) with respect to a control or condition,
and
in
our case a drug is either associated
with

an adverse

effect or not.


D
isproportionality

analysis in preparation for biclustering.
Prior to clustering
the association strength value
must be computed for each drug
-
AE pair in the data. Then, based on this
value,

biclustering will group drugs that are
strongly
associated with a common set of AEs.


Disproportionality analysis

is the approach used by most current computerized pharmacovigilance signal detection
methods in order to quantify the association strength of a drug
-
AE combination and highlight strong assoc
iations as
potential ADEs. Among the set of
disproportionality measure
s

is the relative reporting ratio (RR)
6
, which is defined
as the ratio of the observed incidence rate of a drug
-
AE association to its “baseline” expected rate under the
assump
tion that the drug and AE occur independently. Based on the entries of Table 1
in this document,
the
expected number of reports containing a certain drug and AE combination under the assumption of independence is
equal to
t×n/t×m/t

(
n/t

and
m/t

are
estimat
es of
the probabilities of the drug and AE respectively)
. Given

that

the
observed incidence rate is equal to
a
,
then RR is defined as
,




t
n
m
a
RR
/




A value of RR close to 1 indicates that there is no association between the drug and AE. A valu
e of 3 for example
indicates that there are 3 times as many drug
-
AE reports in the database than would be expected, and might support
the hypothesis of an ADE association. The Gamma Poisson S
hrinker (GPS)
2
, which is used in this work, is a
pharmacovigilance signal detection method endorsed and used by the FDA
7

to monitor safety signals in their SRS
.
GPS is a based on a Bayesian approach that attempts to account for the uncertainty in RR associated with small
observed and expected coun
ts, by “shrinking” RR towards the baseline case of no association (value of 1), by an
amount that is proportional to the variability of the RR statistic. The result of this shrinkage is a reduction of
spurious associations when there is not enough data to
support it.

The general idea underlying GPS is to assume that
the true measure of RR is unknown, and begin by making a prior assumption about the distribution of the RR values
in the database. Then, based on a modeling framework called
empirical Bayes
, it
uses the observed RR values for
the entire set of associations in the data to estimate the parameters of this distribution. Having estimated these
parameters, GPS then computes a measure called EBGM (empirical Bayes geometric mean), which is essentially an

estimate of the posterior expected value of RR for a particular drug
-
AE pair.


In this work, the GPS method was implemented exactly as specified in original paper including
stratification (step 3
in the main paper) and recommended

seeding parameters
2
.


Table 1
.

Contingency table

specifying the
number of reports mentioning a specific drug

and a specific adverse effect (AE)








Biclustering in AERS.
Let
A
m×n

be our data matrix of
m

drugs and
n

AEs
representing drug
-
AE associations in
AERS, i.e., each cell in this matrix
a
ij

contains GPS’ EBGM association strength value
c
omputed for

the
i
-
th drug
and the
j
-
th AE pair.

In order to obtain a binary data model, this matrix was then transformed into a binary data
matrix B
m×n
, where each cell
b
ij

contained either a 1 or 0, representing the states of “strongly associated” or “w
eakly
associated” respectively. The transformation was performed by selecting an “association strength threshold” T,
which
is

used qualify each association as either “strong” or “weak”. That is,








T
a
if
T
a
if
b
ij
ij
ij
0
1


Given this matrix
,
a bicluster (D,
E) corresponds to a sub
-
matrix of B in which all cells equal 1, i.e., a subset of drugs
D that are jointly associated

with a subset of events E,
as in

Figure 1 of this document.


The problem of biclustering can be posed in a graph
theoretic

setting of sear
ching for bicliques (complete bipartite
graphs)
as depicted in Figure 1 of

the main paper, where on
e

set of nodes in the

graph
correspond to drugs, the other
set of nodes to AEs, and

where an edge connects nodes

between the two disjoint
sets
and represents

a strong
statistical

association (denoted by 1 in the data matrix) between a drug and an AE. The bicluster is then interpreted
as a set of drugs which are each associated with the same set of AEs, or alternately a group of drugs that potentially
all cause

the same set of AEs. Binary inclusion
-
maximal biclustering (Bimax)
8
, which assumes the

binary data
model

and is used in this work, uses
a divide
-
and
-
conquer

approach to identify biclusters (bicliques).
The general

Target AE

All other AEs

Total

Target drug

a

b

n=a+b

All other drugs

c

d

c+d

Total

m=a+c

b+d

t=a+b+c+d

idea behind Bimax is to decompose

the binary data

matrix into three sub
-
matrices, one which contains only
0’s

and
therefore
can
be discarded. The a
lgorithm is then applied recursively to the two
remaining
sub
-
matrices, and the
recursion ends when one of the sub
-
matrices represents a bicluster (
contains

only 1

s).

As mentioned in the main
paper, in contrast to most clustering algorithms, Bimax is an e
xact algorithm able to find all of the biclusters that
exist in the data



Fig.
1
.

Durg
-
AE
Biclusters
in the binary data model
. Left, the initial binary data matrix before clustering. Right, two biclusters
identified
(sub
-
matrices where all
cells equal 1)
.
The top bicluster is
( { d1,d3,d6 } , { e1,e5,e2,e4 } ) and

the bottom bicluster is

( { d2,d5 } , { e2,e4,e3 } ).


Reference List


(1)
Kriegel H, Kröger P, Zimek A. Clustering high
-
dimensional data: A survey on subspace clusterin
g, pattern
-
based clustering, and correlation
clustering.
ACM Trans Knowl Discov Data

2009;3(1):1
-
58.

(2) DuMouchel W. Bayesian data mining in large frequency tables, with an application to the FDA Spontaneous Reporting System.

Am Stat

1999;53(3):177
-
90.

(3
) Madeira SC, Oliveira AL. Biclustering Algorithms for Biological Data Analysis: A Survey.
IEEE/ACM Trans Comput Biol Bioinformatics

2004;1(1):24
-
45.

(4)

Long B, Zhang Z, Yu PS. Co
-
clustering by block value decomposition.
Proceedings of the eleventh ACM SI
GKDD international conference
on Knowledge discovery in data mining

2005;635
-
40.

(5) van Uitert M, Meuleman W, Wessels L. Biclustering sparse binary genomic data.
J Comput Biol

2008;15(10):1329
-
45.

(6) Hauben M, Madigan D, Gerrits CM, Walsh L, van Puijenbr
oek EP. The role of data mining in pharmacovigilance.
Expert Opin Drug Saf

2005 September;4(5):929
-
48.

(7) Szarfman A, Machado SG, O'Neill RT.
Use of screening algorithms and computer systems to efficiently signal higher
-
than
-
expected
combinations of drugs

and events in the US FDA's spontaneous reports database.
Drug Saf

2002;25(6):381
-
92.

(8) Prelic A, Bleuler S, Zimmermann P et al.
A systematic comparison and evaluation of biclustering methods for gene expression data.
Bioinformatics

2006 May 1;22(9):1122
-
9.