Expert Systems with Applications

milkygoodyearAI and Robotics

Oct 14, 2013 (4 years and 9 months ago)


Application of statistics and machine learning for risk stratification
of heritable cardiac arrhythmias

Saw Swee Hock School of Public Health,National University of Singapore,MD3,16 Medical Drive,Singapore 117546,Singapore
Department of Statistics and Applied Probability,National University of Singapore,Blk S16,Level 7,6 Science Drive 2,Singapore 117546,Singapore
Defence Medical & Environmental Research Institute,DSO National Laboratories,27 Medical Drive,Singapore 117510,Singapore
a r t i c l e i n f o
Risk stratification
Long QT
Machine learning
a b s t r a c t
In the clinical management of heritable cardiac arrhythmias (HCAs),risk stratification is of prime impor-
tance.The ability to predict the likelihood of individuals within a sub-population contracting a pathology
potentially resulting in sudden death gives subjects the opportunity to put preventive measures in place,
and make the necessary lifestyle adjustments to increase their chances of survival.In this paper,we
review classical methods that have commonly been used in clinical studies for risk stratification in
HCA,such as odds ratios,hazard ratios,Chi-squared tests,and logistic regression,discussing their benefits
and shortcomings.We then explore less common and more recent statistical and machine learning meth-
ods adopted by other biological studies and assess their applicability in the study of HCA.These methods
typically support the multivariate analysis of risk factors,such as decision trees,neural networks,support
vector machines and Bayesian classifiers.They have been adopted for feature selection of predictor vari-
ables in risk stratification studies,and in some cases,prove better than classical methods.
￿ 2012 Elsevier Ltd.All rights reserved.
Sudden cardiac death,as its name implies,is an unexpected
fatality,that occurs within an hour of the onset of acute symptoms.
While this can be brought about by a variety of pathological causes
such as coronary artery disease,cardiomyopathies,congestive
heart failure,conduction disorders,valvular disorders and myocar-
ditis,in this paper,we will be focusing on non-pathological herita-
ble cardiac arrhythmias (HCAs),in particular,the two most
common:Brugada and Long QT syndromes.These HCAs,unlike
predominant ischaemic cardiac conditions,manifest in a relatively
small number of cases of sudden cardiac death victims.Due to the
lack of structural cardiac damage,these electrophysiological condi-
tions pose a greater challenge in their identification,mostly so,
within the apparently normal population.It is for this reason that
diagnostic screening be done to stratify the population of normal
individuals as well as those presenting symptoms according to
their risk profiles.To date,methods that have been used in the
diagnosis of HCA include electrocardiogram monitoring,with and
without invasive electrophysiology studies,surface mapping,
magnetoelectrocardiography,genetic testing and as well testing
for suggestive clinical triggers such as unheralded syncope,while
taking personal and family history into consideration.This data
comes primarily from patients that have experienced a cardiac
event and their family members.While the quality of the data col-
lected is of prime importance,it is as well imperative that the best
methods for risk stratification are selected for the right data types
being analyzed.
This paper will serve as a review of the common methods that
are used in the stratification of risk within populations that are
susceptible to HCAs,and assess the suitability of these models.A
proposal of alternative methods that have successfully been used
in other domains shall then be made,and their relative benefits
against the former methods discussed.
To identify suitable publications for inclusion in this review,the
following repositories were used:
i.Web of Science
v.Google Scholar
Publications were searched for using the phrases ‘‘risk stratifi-
cation and Long QT syndrome’’,‘‘risk stratification and Brugada’’,
0957-4174/$ - see front matter ￿ 2012 Elsevier Ltd.All rights reserved.

Corresponding author at:Defence Medical & Environmental Research Institute,
DSO National Laboratories,27 Medical Drive,Singapore 117510,Singapore.Tel.:
+65 6485 7238;fax:+65 6485 7262.
E-mail (P.H.Yap).
Expert Systems with Applications 40 (2013) 2476–2486
Contents lists available at SciVerse ScienceDirect
Expert Systems with Applications
j ournal homepage:www.el sevi ocat e/eswa
‘‘risk stratification and sudden cardiac death’’,‘‘risk stratification
and machine learning’’,‘‘statistical risk stratification’’,‘‘machine
learning and Brugada’’ and ‘‘machine learning and Long QT’’.Re-
sults were then tagged genetic,clinical,and electrocardiographic
to be included in the relevant portions of the review.
3.Risk stratification
In order to decide on the best set of analytical methods to be
used,we need to first understand the primary purpose of risk strat-
ification.Risk stratification involves the splitting of a population of
subjects based on their likelihood towards a particular clinical out-
come.This requires accurate identification of risk factors,common
to the members within each stratumand the proof of association of
these factors with the clinical endpoint.This ultimately facilitates
the profiling of a new subject,and classification into the appropri-
ate risk category based on the extent to which these risk factors are
expressed,and the recommendation of appropriate intervention to
circumvent or mitigate the effects of the clinical endpoint.
Both classical and new models do follow a very similar work-
flow,starting with the acquisition of data up to the finalization
of the model used.Fig.1 below illustrates this workflow of stages
that are involved in any risk stratification.
Experimental results are first extracted and pre-processed into
an appropriate format amenable to analysis.Exploratory analyses
are then carried out to summarize extracted data and to look for
patterns within all variables measured via both numerical and
graphical analysis.Input fromthis exploratory analysis along with
information from available literature is then used to select which
variables are informative enough to retain,and which are removed.
This feature selection step will require the use of univariate and
multivariate statistical methods and tests,which will determine
features that will be included in the model.The model will then
be trained using a subset of the data,or using a permutative test
methodology,using a pre-defined set of predictive statistical meth-
odologies and machine learning techniques.The results fromthese
tests will be assessed for correctness in the through derivatives of
the true positives,true negatives,false positives and false negatives
of the results,through validation methods that include positive
and negative predictive values and with the use of Receiver Oper-
ating Characteristic curves (Zweig & Campbell,1993).The model is
then iteratively fine-tuned by removing or adding on features and
validating the model over.The optimum model is then finalized
and used for predicting risk for new subjects based on their fea-
tures applied to the model.
Within this workflow,the types of tests and methods used at
each step for effective risk stratification depend heavily on the type
of data collected fromthe subjects of the study.The various types
of data collected include electrocardiogram (ECG) data,genetic
data,clinical measurements,and historical data through question-
naires.Many genetic studies have been carried out to better pre-
dict risk of HCA,both on their own and combined with clinical
variables for better diagnostic and prognostic predictability.
The type of data collected from these studies varies in accor-
dance to the design of the studies.Prospective cohort studies are
commonly carried out by recruiting suitable subjects into national
registries and monitoring them over time (Chinushi et al.,2007;
Delise et al.,2011;Gehi,Duong,Metz,Gomes,& Mehta,2006;
Giustetto et al.,2008;Goldenberg et al.,2008,2011;Hobbs et al.,
2006;Letsas et al.,2011;Priori,2002;Probst et al.,2010;Raju
et al.,2011;Takagi et al.,2007;Tatsumi,Takagi,Nakagawa,
Yamashita,& Yoshiyama,2006).In such studies,time-to-event
measurements are most predominant compared to retrospective
and cross sectional studies that rely more on historical data and
on prior occurrence of HCA.
This combination of continuous,categorical and time-to-event
data presents potential to facilitate the identification of risk predic-
tors of HCA,provided the appropriate data analysis methods are
used.Current methods used should be re-assessed and compared
to alternative and more recent methods via a quantitative mea-
surement,and the best methods selected to increase diagnostic
and prognostic predictability of HCA.
3.1.Risk stratification using electrocardiographic and clinical data
Electrocardiograms,graphs used to measure the electrical sig-
nals that control the rhythmof the heart,are the prime tools used
to detect arrhythmias.The electrocardiogram reveals information
on a galvanometer showing electrical signals detected by elec-
trodes placed on the surface of the subjects skin.Fromthese elec-
trocardiograms,such as that illustrated in Fig.2,one can obtain
information about a cardiac condition through the amplitude of
certain waves,the length of intervals,and the shape of the
Parts of the electrocardiogram translated by trained medical
professionals include the P wave that measures atrial depolariza-
tion,the QRS complex that shows ventricular depolarization and
the T wave for ventricular repolarization (Milhorn,2005).Heritable
cardiac arrhythmias such as Long QT and Short QT syndromes
(along with their variants,Romano Ward,and Jervell-Lange Niel-
son syndromes) can be identified by observing the QT interval
while Brugada syndrome is detectable through the elevation of
the ST segment of the resting ECG,or after a pharmacological chal-
lenge.The Type 1 ECG pattern is diagnostic of the Brugada syn-
drome and is presented as a coved ST-segment elevation P2 mm
followed by a negative T wave in leads V1–V3 (Benito,Brugada,
Brugada,& Brugada,2008).The Long QT syndrome (LQTS) is char-
acterized on the ECG by the presence of a prolongation of the QT
interval,which is reflective of a lengthened ventricular repolariza-
tion.This prolongation varies between children,adults males and
adult females and are characterized by Bazett-Corrected QTc of
more than 460,450 and 470 ms respectively (Brugada & Campuz-
ano,2009).Not all HCAs can be detected through ECG analysis
alone.Catecholaminergic Polymorphic Ventricular Tachycardia
(CPVT) is one that cannot be detected through a resting ECG,and
as such,is assessed during exercise along with other clinical factors
(Veltmann,Schimpf,Borggrefe,& Wolpert,2009).Once the electro-
cardiograms have been interpreted,they can be reduced to a set of
categorical and continuous variables that are more amenable to
statistical analysis.While manual translation remains the predom-
inant method of interpreting electrocardiograms,and detecting
arrhythmias,automated methods have as well been developed
and tested (Green et al.,2012;Koeppel,Labarre,& Zitoun,2012;
Martis et al.,2012).This paper will focus on analyzing ECG data
Fig.1.Workflow of analysis steps in risk stratification.
P.S.Wasan et al./Expert Systems with Applications 40 (2013) 2476–2486
that has been manually interpreted by experts for reasons of in-
creased accuracy,and not on those derived fromautomated wave-
form detection methods which may be prone to errors.
In the stratification of risk in HCA,classical statistical methods
suited for any continuous,categorical,time-to-event and survival
data,are to date most commonly adopted.These statistical meth-
ods have been selected in accordance to the study design which
are predominantly either retrospective cross-sectional or prospec-
tive cohort designs,and in some cases,a combination (Takagi et al.,
2007).In cross-sectional studies (Babaee Bigi,Aslani,& Aslani,
2008;Chen et al.,2004;Kotta,Anastasakis,Gatzoulis,Manolis,&
Stefanadis,2010),categorical variables between normal controls
and probands are compared using the Chi-squared test,Fisher’s ex-
act test,and logistic regression,while paired/unpaired t-tests,
Mann Whitney U tests,ANOVA (followed by Fisher’s PLSD post
hoc test) and Kruskal–Wallis tests and linear regression are used
for continuous data that meet statistical prerequisites.In multi ob-
server environments,intra- and inter-observer variability is as-
sessed using the Bland and Altman method (Doi et al.,2009;
Takagi et al.,2007).Prospective cohort studies utilize Kaplan–Me-
ier survival curves,log-rank tests and Cox regression models gen-
erally to compare two groups over the course of the study (Ajiro,
Hagiwara,& Kasanuki,2005;Benito,Sarkozy,Mont,Henkens,&
Berruezo,2008;Brugada,2003;Doi et al.,2009;Giustetto et al.,
2008;Ikeda et al.,2005;Nakano et al.,2010;Probst et al.,2010).
Aside fromthe selection of the right analytical method,the appro-
priate use of these univariate and multivariate methods are as well
crucial to the discovery of the right features that are responsible for
Univariate comparisons of continuous data from normally dis-
tributed populations commonly use the Student’s t-test when
comparing two groups of continuous data (Delise et al.,2011;Gar-
cia-Alvarez et al.,2009;Giustetto et al.,2008;Goldenberg et al.,
2008,2011) and ANOVA when comparing three or more (Delise
et al.,2011;Jouven,Desnos,Guerot,& Ducimetière,1999;Jouven,
Zureik,Desnos,Guerot,& Ducimetière,2001;Takagi et al.,2007),
while using the Pearson chi-square test for categorical variables
(Delise et al.,2011;Giustetto et al.,2008;Goldenberg et al.,
2008,2011;Jouven et al.,1999,2001;Probst et al.,2010;Tatsumi
et al.,2006),and the Fisher’s exact test (Garcia-Alvarez et al.,2009;
Giustetto et al.,2008;Probst et al.,2010;Takagi et al.,2007).Odds
ratios (Straus et al.,2004) and relative risk (Jouven et al.,1999;
Nunn,Bhar-Amato,& Lambiase,2010) were as well used to assess
the categorical variables.
For variables in which normality pre-requisites are not met,the
non-parametric equivalents are adopted,in particular,the Mann
Whitney U test (Garcia-Alvarez et al.,2009;Goldenberg et al.,
2008,2011;Probst et al.,2010;Tatsumi et al.,2006),and Krus-
kal–Wallis test (Probst et al.,2010).Other studies take the analysis
a step further and conduct multivariate analyses to understand
better the combined effect of the predictor variables on the out-
come.Multivariate Cox regression and Logistic regression was used
most in such cases to build the model and determine the variables
of prime importance,whilst controlling for the others (Ajiro et al.,
2005;Brugada,2003;Delise et al.,2011;Goldenberg et al.,2011;
Ikeda et al.,2005;Nakano et al.,2010;Priori,2002;Takagi et al.,
In the case of studies that collected patient data over time,time
to event analyses were carried out using the Cox proportional haz-
ard regression model (Delise et al.,2011;Elliott et al.,2000;Garcia-
Alvarez et al.,2009;Goldenberg et al.,2008,2011;Hobbs et al.,
2006;Ikeda et al.,2005;Takagi et al.,2007),with risk being quan-
tified through hazard ratios (Delise et al.,2011;Ikeda et al.,2005;
Straus et al.,2004;Takagi et al.,2007).Survival curves were as well
constructed with such data and the curves between the probands
and the references compared using the log-rank test (Delise
et al.,2011;Garcia-Alvarez et al.,2009;Giustetto et al.,2008;Ikeda
et al.,2005).For the prediction of events in such cases,ROC curves
are used with C-statistics to assess discriminatory strength (Ajiro
et al.,2005;Delise et al.,2011;Goldberger et al.,2011;Takagi
et al.,2007).
While the above-mentioned methods have been used for risk
stratification,many of them have restrictive prerequisites and re-
quire adherence of the data to some linear model.This is not just
apparent in the analysis of clinical and ECG data but as well in ge-
netic data,which will be discussed in the following section.
3.2.Risk stratification using genetic data
While electrophysiological methods for identifying subjects at
risk of HCA have been widely used,their incomplete penetrance
and the manifestation of the disorders in the absence of these dis-
tinctive ECG patterns has lead to genetic screening to identify the
presence of potential underlying mutations.Mutations in SCN5A
causing loss-of-function,and in genes regulating SCN5A:GPD1-L,
SCN1B and SCN3B (London et al.,2007;Watanabe et al.,2009),
genes encoding for the
and b subunits of the cardiac calcium
channel:CACNA1c and CACNAB2b (Napolitano & Antzelevitch,
2011),and other gain-of-function genes:KCNE3,KCNE5 and KCND3
have been linked to Brugada syndrome.The Long QT syndrome has
as well linked to it mutations in many genes but KCNQ1,KCNH2
and SCN5A still account for 80–90% of cases (Napolitano,Bloise,
Monteforte,& Priori,2012).This penetrance level has led to the
recommendation of mutation-specific genetic testing for family
members following the identification of LQTS-causative mutation
in index cases (Ackerman et al.,2011).Brugada syndrome,due to
its lowyield,approximately 25% (Kapplinger et al.,2010),of SCN5A
genetic testing in robust clinical cases,requires additional support-
ing clinical evidence before genetic testing for family members is
recommended (Ackerman et al.,2011).
Risk stratification using genetic data alone becomes a bigger
concern when diagnosis is dependent on genotype confirmation
in clinically asymptomatic cases.Genotype-confirmed patients
with normal range QTc make up about 25% of the at-risk LQTS pop-
ulation (Goldenberg et al.,2011),cases that using clinical methods
alone,would go undiagnosed.Genetic testing and its use in risk
stratification are additionally important as the manifestations of
specific ECG patterns can as well be intermittent and escape
Before advances in technology and increased knowledge from
the sequencing of the human genome (Lander et al.,2001;McPh-
erson,Marra,Hillier,& Waterston,2001;Venter et al.,2001)
brought about genome wide association studies (GWAS) (Hirsch-
Fig.2.The electrocardiogram generated by a heart during polarization and
2478 P.S.Wasan et al./Expert Systems with Applications 40 (2013) 2476–2486
horn,2005),association between genes and phenotypes was made
predominantly through the study of a relatively small list of candi-
date genes,or through linkage studies that associated a genomic
region to the phenotype at a family level.In the identification of
risk in genotype-confirmed patients without prolonged QTc inter-
vals,candidate genes selected for study included KCNQ1,KCNH2
and SCN5A (Goldenberg et al.,2011).The statistical methods used
were chi square tests for categorical variables and t-tests and
Mann–Whitney–Wilcoxon tests for continuous variables.The Kap-
lan–Meier estimator was as well used in this longitudinal study to
assess the time to a first life-threatening event.The candidate gene
method still maintains popularity due to its economy especially
when the number of patients is small,which is often the case in
the study of any rare disease.In a study of 15 Greek patients,novel
mutations in a single candidate gene,SCN5A,were analyzed,and
due to the small number of subjects,no statistical methods beyond
frequencies were adopted (Kotta et al.,2010).Studies with larger
sample sizes have also used the candidate gene method when
the hypothesis was targeted at studying the effect of the location
of causative mutation on the clinical course of LQTS and their inter-
actions with clinical factors such as QTc and demographic factors
such as gender (Priori,Schwartz,Napolitano,& Bloise,2003),or
to identify the combination of factors within Brugada syndrome
patients to detect subjects at risk of cardiac arrest (Priori,2002).
Candidate genes KCNQ1,KCNE1,KCNH2,KCNE2,SCN5A,ANK2,
KCNJ2,CAV3,CASQ2,and hRyR2 were also used to study LQTS and
Brugada syndrome,along with other causes of sudden arrhythmic
death syndrome,with 20% of Brugada,and 38% of Long QT cases
carrying mutations that were not found in the controls,while iden-
tifying novel mutations in SCN5A and hRyR2 genes (Behr et al.,
2008).This method however,can as well be used to test potential
novel targets such as MOG1 that was identified through its interac-
tion with the cardiac sodium channel and its coexpression with
SCN5A to regulate the surface expression of the sodium channel
(Kattygnarath et al.,2011),a similar path taken to the identifica-
tion of SCN1B for human arrhythmia susceptibility (Watanabe
et al.,2009).Despite the identification of mutations within these
genes,the number of Brugada and LQTS cases that bypass these
mutational criteria present the need to identify variability in novel
This brings into question the limitations of the candidate gene
methodology for HCA,rare diseases that do not show as clear a
Mendelian segregation through genetic variations as Huntington’s
disease (Gilliam,Tanzi,Haines,& Bonner,1987) and cystic fibrosis
(Keremet al.,1989).In this case of rare complex diseases,genome
wide association studies (GWAS) may be conducted as an alterna-
tive to the hypothesis driven candidate gene approach.Several
GWAS have been conducted in the identification of variations in
the NOS1AP gene and their association with LQTS risk (Arking
et al.,2006;Becker et al.,2009).These studies used staged GWAS
designs where results from an initial GWAS in one sample are
tested in a second independent sample and the combined statisti-
cal evidence analyzed.Results fromsuch studies later become part
of other candidate gene analyses for the purpose of validation in
independent populations (Eijgelsheim,Aarnoudse,Rivadeneira,
Kors,& Witteman,2009a;Eijgelsheim,Newton-Cheh,Aarnoudse,
van Noord,& Witteman,2009b).Independent GWAS have as well
been carried out to confirm variants at NOS1AP along with nine
other loci mapping to KCNQ1,KCNH2,SCN5A,KCNJ2,ATP1B1,PLN
and RNF207 (Pfeufer et al.,2009).
While both the candidate gene and GWAS methodologies use
methods from classical statistics for analysis including ANOVA,t-
tests,and linear regression for continuous data and chi square or
Fisher’s exact tests for categorical data,the large number of genetic
variants and the large sample sizes of GWAS require many statisti-
cal consideration in the selection of cases,controls and of genetic
markers and sample size requirements to ensure experiments are
adequately powered (Zondervan,2007).Additionally,prior to any
statistical testing for association between a genetic variant and a
phenotype,data quality control measures need to be put in place
to avoid the addition or removal of DNA samples and markers that
may introduce bias.Such methods involve checking the data on an
individual level to identify individuals with discordant sex infor-
mation,outlying missing genotype or heterozygosity rates,dupli-
cated information,or with divergent ancestry,and as well at a
marker level to identify excessive missing genotypes,deviation
fromHardy–Weinberg equilibrium(HWE),different missing geno-
type rates between cases and controls,and removal of low minor
allele frequency markers (Anderson et al.,2010).
Several limitations stand out from the studies observed above.
Due to the rarity of the diseases,sample sizes tend to be small leav-
ing the studies underpowered.Outputs fromgenetic studies gener-
ally have unique high-dimensional analytic requirements as these
studies involve more variables than samples.While most of the
studies use parametric methods for the association analyses,few
describe any tests conducted to ensure prerequisites were met.
Interactions between genes were as well poorly explored,or not
at all in most of the studies,and correction for multiple testing
rarely used,leading to potential false positives.Protocols have
been published with recommendations on maintaining data qual-
ity in case–control association studies (Anderson et al.,2010;Bald-
ing,2006) but the use of classical statistical methods makes such
high dimensional studies difficult if not impossible to analyze at
a multivariate level,the challenge being further exacerbated when
genetic data is combined with clinical and environmental covari-
ates to study their combined association with the phenotypic end-
points,bringing about a need for alternative analytic methods.
3.3.Risk stratification combining clinical and/or environmental
covariates and genetic data
Genetic data is rarely studied in isolation,usually being associ-
ated with a clinical phenotype.In the case of HCAs,randomized
and blinded studies do not exist,most available data being derived
fromregistries that have monitored patients over time (Ackerman
et al.,2011).Genetic data has been obtained fromsubjects who are
pre-screened clinically,and in other cases,pre-screened individu-
als are further stratified by another clinical or environmental var-
iable to test its association with the phenotypic endpoint.In the
case of LQTS and Brugada,this involves subjects having meeting
electrocardiographic,clinical history and family history require-
ments (Ackerman et al.,2011).In the case of LQTS,known genetic
variants leave 15–20% of cases unexplained,in Brugada Syndrome,
over 65% (Kapplinger et al.,2010),and these percentages do not in-
clude the potential HCA victims that do not have clinical manifes-
tations of the pathology.
This has brought about the need to combine data already at
hand,to look for covariance,interactions,and confounding,not just
within the genetic and clinical/environmental subsets separately,
but their impact on each other.The power to detect genetic predic-
tors may be decreased if they differ according to clinical or envi-
ronmental conditions,and if this interaction is not accounted for
(Cordell,2009).Studies that have succeeded at this in HCA include
an association found in a variant of the NOS1AP gene and increased
mortality in calcium channel blocker users (Becker et al.,2009),
and in the case of LQTS,where the locus of the causative mutation
affects its clinical course and modulates the effects of QTc and gen-
der on clinical manifestations (Priori et al.,2003).Considering both
the genetic and environmental factors in a joint analysis however
increases the dimensionality of the problem,and introduces the
need to consider potentially large numbers of factors interacting.
These multi-factorial interactions pose challenges of their own
P.S.Wasan et al./Expert Systems with Applications 40 (2013) 2476–2486
within the limits of the classical statistics paradigmand thus new
methods to deal with such high dimensional interactions must be
3.4.Multivariate issues in risk stratification
While many methods have been developed in the association of
risk with single factors,this may not suffice,the primary reason for
this being that clinical and genetic susceptibility factors seldom
work independently on their associated pathologies.Therefore,
interactions between these risk factors are of great importance
and need to be addressed appropriately.Looking at interaction
from a statistical point of view,it occurs when the impact a vari-
able,genetic or clinical,has on a phenotypic endpoint is changes
based on the state of one or more other variables.The absence of
interaction is detected when the impact of two or more variables,
or risk factors adhere to a mathematical model,such as an additive
or multiplicative model,and is not based on a combination of the
levels each variable takes on (Clayton,2009).
One situation in which we commonly see interaction is in epis-
tasis,which refers to a deviation fromadditivity in the effect of two
or more alleles at different loci with respect to their contribution to
a phenotypic endpoint (Cordell,2002).While such interaction oc-
curs in epistasis,it is very common in other pathologies for risk fac-
tors to be additive or even multiplicative.Siemiatycki and Thomas
(1981) used a multistage model of carcinogenesis that some risk
factors fit an additive model,others a multiplicative model,and
some neither.As such,advice is given to use inductive and deduc-
tive reasoning to select an appropriate analytical method for the
There is seldom any mention on the choice of measurement of
interactions.Most of the studies looked at each risk factor indepen-
dently,and seldom as a combination.Traditional statistical meth-
ods become limited with increasing number of interactions,
resultantly increasing the number of factors in the interaction,
and the combinatorial complexity due to the increased dimension-
ality,thus requiring commensurately larger sample sizes which
may result in the study becoming prohibitively expensive (McKin-
ney,Reif,Ritchie,& Moore,2006).Another challenge in using clas-
sical statistical methods for identifying interactions is the need to
fit the data to a model for the interaction,becoming particularly
difficult with an increase in the number of factors to consider with-
in the interaction (McKinney et al.,2006).
4.Alternative methods in stratifying risk
An alternative method of looking at multivariate data,and avoid
the risk of using a suboptimal statistical method,is to turn to ma-
chine learning.This refers to the use of algorithms that facilitate
the learning of a system,from the data it is exposed to,based on
either Bayesian or classical statistics,or a combination of both.In
the case of a supervised learning system,input variables along with
the known results are used to train the system.Once trained,the
system is able to predict a result based on a developed internal
set of rules,from a set of novel input variables.
4.1.Artificial neural networks (ANN)
Artificial neural networks (ANNs) are mathematical models
based on the workings of the biological nervous system.It com-
prises a network of nodes and edges connecting them and is used
to model complex nonlinear relationships between independent
and dependent variables.There are various types of ANNs that
have applications in supervised and unsupervised learning,but fo-
cus will be set on the supervised Feed-Forward Back-Propagation
Neural Network,which is a type of multi-layer perceptron,and
the unsupervised Kohonen Network.This supervised machine
learning method consists of three main layer types,an input layer,
one or more hidden layers and an output layer,as illustrated in
Fig.3.Each layer is made up of a set of nodes connected to other
layers through edges.Each node within the network has an activa-
tion function associated with it,through which a combination of
the inputs pass.Associated with each edge in turn are weights,
which are iteratively adjusted during the training of the network.
Training of an ANN involves modifying the weights according to
a learning algorithm,such as the back propagation method
(Rumelhart,Hintont,& Williams,1986) which uses gradient des-
cent to adjust the weight minimizing the error between the pre-
dicted and the true output.ANNs have been used in many fields,
including biological and medical research,for example,to detect
gene–gene interactions (McKinney et al.,2006),for classification
using microarray data (Pirooznia,Yang,Yang,& Deng,2008) and
in the prediction of dementia (Maroco et al.,2011).
Unlike its supervised counterparts,Kohonen Networks,also
known as Self-Organizing Maps (Kohonen,2007),are classifiers
that do not depend on tuples of input and desired output data.
They are classifiers that map multi-dimensional input vectors to
a set of clusters based on the inherent structure within the input
data.Input patterns close to each other in the input space should
thus be clustered together.These classifiers have successfully been
used in the analysis of ECGs,to detect arrhythmias (Leite et al.,
2010) and in the localization of the source of ventricular focal
arrhythmias,using intravenous catheter measurements as inputs
(Sunay & Cunediog
4.2.Decision trees and random forests
A decision tree is another method used in machine learning for
automated decision making.Its representation is similar to that of
a reversed tree,starting at the top with a root node,branching
down through a series of internal nodes,finally reaching the deci-
sion point at the bottom-most leaf nodes,which represent a class
label or a class label distribution.The trees are built through a pro-
cess of recursive partitioning based on the values of certain attri-
butes.At each of these nodes,a particular attribute needs to be
selected to separate the samples at that node into two classes.
Goodness functions are used to build the tree,via methods such
Fig.3.Structure of a multi-layer perceptron,with x as the inputs and y as the
2480 P.S.Wasan et al./Expert Systems with Applications 40 (2013) 2476–2486
as the Classification and Regression Tree (CART) (Breiman,Fried-
man,Stone,& Olshen,1984),Chi-squared automatic interaction
detector (CHAID),Quick,Unbiased,Efficient Statistical Tree
(QUEST),Commercial Version 4.5 (C4.5),and Interactive Dichotom-
izer version 3 (ID3) (Ture & Tokatli,2009).
Decision trees have been used in the resolution of many biolog-
ical and clinical challenges such as class prediction using micro-
array datasets (Pirooznia et al.,2008),in genome wide
association studies (Szymczak et al.,2009),in the prediction of
symptoms of Parkinson’s Disease (Exarchos et al.,2012) to name
a few.
In some cases,several decision trees are combined into an
ensemble of classifiers to forma randomforest which then aggre-
gates the results to come to a consensus decision,and it has been
shown that these ensembles can outperform their individual base
classifiers (Dietterich,2000).In principle,a heterogeneous mix of
learning algorithms,such as decision trees,support vector ma-
chines may be used to construct distinct base learners but we will
focus methods that employ base learners of the same type to be
trained on slightly different data sets via bootstrap aggregation
(Breiman,1996) and RandomForests (Breiman,2001).These meth-
ods have been used in a variety of studies such as to predict disease
status using SNPs (Sun et al.,2007),to rank SNP predictors (Sch-
warz,Szymczak,Ziegler,& König,2007;Sun,Bielak,Turner,Shee-
dy,& Peyser,2008;Sun,Bielak,& Peyser,2008),to identify the
epistatic effects related to human diseases (García-Magariños,
López-de-Ullibarri,Cao,& Salas,2009),and to identify multi-SNP
interactions in data obtained through genome wide analyses
(Wan et al.,2009).
4.3.Support vector machines (SVMs)
SVMs (Vapnik,1995) are a supervised machine learning method
that has been used commonly to solve classification problems
within and beyond the biomedical field.SVMs discriminate be-
tween two classes of data in a high dimensional space through
the formulation of a hyperplane.As SVMs are not restricted by
any requirement to adhere to a statistical model but is instead data
driven,it is especially useful when the data comprises a small sam-
ple size relative to the number of variables.
In a given binary classification problem,the objective of the
SVM is to find the optimum linearly separating hyperplane by
mapping a vector of predictors to a higher dimension feature space,
for the hyperplane to best divide the two groups,through the use
of linear and non-linear kernel functions.SVMs have been used
extensively in the biological and clinical sciences and through their
innate classification ability,have inadvertently been used in risk
stratification.Such studies include the predictions of patients with
neuroblastoma using expression profiles as predictors (Schramm
et al.,2005;Schulte et al.,2010),for the early diagnosis and recur-
rence of prostate cancer (Çınar,Engin,Engin,& Ziya Ates￿çi,2009)
and combined with genetic algorithms for the detection of unsta-
ble angina (Sepulveda-Sanchis et al.,2002),and further applica-
tions shall be discussed in a later section.
4.4.Bayesian classifiers
As our main intention in risk stratification is to create sub-
groups within a cohort that have different risks of HCA based on
clinical and genetic factors,aside from classical frequentist meth-
ods,Bayesian classifiers too can be adopted.Bayesian classifiers
have as well shown to be more flexible than classical ones,with
no loss in precision (Bigi,Gregori,Cortigiani,& Desideri,2005).
These classifiers,adapted from conditional probability laws,are
based on the fundamental equation below:
PðHjEÞ ¼ ½PðEjHÞ

where P(H|E) is the probability of the hypothesis H,given the evi-
dence E,P(E|H) is the probability of the evidence E given the hypoth-
esis H,and P(E) is the probability of the evidence E irregardless of
the hypothesis H (Han,Kamber,& Pei,2011).
In practice,the ability to integrate new evidence as it becomes
available into a Bayesian learning system in order to make deci-
sions on newly available data has made its use attractive.In fact,
it has even been argued that the natural statistical framework for
evidence based medicine is a Bayesian approach (Ashby & Smith,
2000).Table 1 below summarizes the advantages and disadvan-
tages of the various machine learning methods described above.
5.Alternative methods in risk stratification using
electrocardiographic data
While ECG data is unique in its ability to describe cardiac
rhythms in detail,it is always used in a form that has been trans-
lated by one or more experts to represent the presence or absence
of an arrhythmia.This is done by either matching the morphology
of the ECG with pre-defined patterns that have been associated
with arrhythmias,or by calculating the length of intervals,an
example being the QT interval to look for signs of the Long QT syn-
drome.The ECG is mostly used in its raw waveform in studies fo-
cused on the automation of the translation process currently done
by human expertise.Methodologies adopted from signal process-
ing,such as Fourier transforms of the digitized ECG have been used
to study underlying mechanisms of ventricular fibrillation (Latcu
et al.,2011).Such methods from signal processing are only able
to help extract features fromthe ECG,still requiring the extracted
features to be classified into one of many arrhythmias.The feature
extraction phase is set to extract one or more of the main morpho-
logical features of the ECG such as the ST segment duration,R
amplitude,PR,RR and QT intervals,and detection of QRS com-
plexes,P and T waves.These features can then used to classify
the arrhythmia using,for example,Fuzzy Clustered Probabilistic
and Multi Layered Feed Forward Neural Networks (Haseena,Math-
ew,& Paul,2009),or intelligent automation algorithms (Green
et al.,2012).
Syed and Guttag (2011) show how symbolic mismatch-based
algorithms are used to classify patients according to their risk of
major cardiac events by analyzing their ECG over long periods
and comparing themto that of a normal population.Sekkal,Chikh,
and Settouti (2011) developed genetic algorithm evolved neural
network classifier as well to classify premature ventricular con-
traction beats using ECGs sampled at 360 Hz,while Milpied,Du-
bois,Roussel,Henry,and Dreyfus (2011) designed support vector
machines to classify arrhythmias.Zadeh,Khazaee,and Ranaee
(2010) split up their analysis of ECG data into two discrete tasks,
that of feature extraction using a discrete wavelet transformto ex-
tract the morphological features of the ECG and a multi-classed
support vector machine in their classification module.The follow-
ing two sections will go into further detail on alternative methods
used in risk stratification.While many of the methods described
below may seem to be machine learning methods,note should
be taken of the gradual merging of statistics and machine learning
with the recent popularity of Data Mining due to the influx of Big
Data coming especially from new sequencing technologies (Metz-
5.1.Alternative methods of risk stratification using clinical data
Once information has been extracted and interpreted from the
ECG by qualified practitioners,this effectively breaks the spectral
data down to a set of continuous and categorical data points.In
P.S.Wasan et al./Expert Systems with Applications 40 (2013) 2476–2486
addition to that,it reduces the data to the same form as clinical
data,facilitating their combination.While many,as shown above,
have tackled this challenge using classical statistical methods,
new machine learning methods are proving to be a more accurate
means of risk stratification.The main machine learning methods
used in the literature use Bayesian models,classification trees,sup-
port vector machines,neural networks,methods that have been
discussed in Section 4.
5.2.Alternative methods of risk stratification using genetic data
Machine learning methods have several advantages including
robustness to parametric assumptions,high power,accuracy,abil-
ity to model non-linear phenomena,the availability of many well-
developed and validated algorithms,and the ability to model high-
dimensional data (Cosgun,Limdi,& Duarte,2011).This is espe-
cially the case in GWAS when genotypes of more than a million
SNPs are measured for subjects of one or two orders lower.As well,
when a large number of SNPs are genotyped in GWAS,there is a
high tendency for linkage disequilibrium resulting in high collin-
earity between SNPs.It is for these reasons that classical multivar-
iate statistical methods such as linear or logistic regressions are
deemed unsuitable (Szymczak et al.,2009).
Cosgun et al.(2011) proposed the use of randomforest regres-
sion,boosted regression trees and support vector regression to pre-
dict the appropriate warfarin maintenance dose in a cohort of
African Americans,to build predictive models out of selected SNPs
and test the models in a cross-validation framework.Jiang et al.
(2007) utilized the ensemble learning scheme,as used in random
forests to develop a rule voting method and used it as the underly-
ing scoring mechanism in their multiple selection rule voting
(MSRV) systemfor the prioritization of candidate non-synonymous
single-nucleotide polymorphisms.Yue and Moult (2006),and Cala-
brese,Capriotti,Fariselli,Martelli,and Casadio (2009) developed
methods based on support vector machines (SVMs) to identify can-
didate non-synonymous SNPs that have a deleterious effect on pro-
tein function,thus bringing light to disease susceptibility.Krishnan
and Westhead (2003),and Yue and Moult (2006),in turn,com-
bined support vector machines (SVMs) and decision trees to differ-
entiate between neutral and disease causing SNPs.Szymczak et al.
(2009) proposed a different set of methods to overcome the limita-
tions of classical methods including penalized regression,which
can be used with correlated variables,and non-parametric ap-
proaches such as neural networks and ensemble methods which
are not constrained by adherence to any particular model (Szymc-
zak et al.,2009).Another study that used a large number of ma-
chine learning methods to classify dendritic cell phenotypes
using gene expression data was by Tuana et al.(2011) where Zer-
oR,Nearest Neighbor,C4.5,logistic regression,multi-layer percep-
tron,Naïve Bayes,random forest,support vector machines and
Tree Augmented Naïve Bayes were used.Pirooznia et al.(2008)
conducted another study to compare machine learning methods
to identify significant differentially expressed genes in eight pub-
licly available datasets.The methods the group compared com-
prised SVMs,Radial Basis Function and Multi-Layer Perceptron
Neural Nets,Bayesian methods,decision trees and randomforests,
with a combination of SVMRecursive Feature Elimination as a fea-
ture selection method and SVM as a classifier giving the best
6.Resampling and diagnostic tests
While the main reasons resonating in this paper to include ma-
chine learning methods are related to the flexibility they promise,
their ability to work with data that have a relatively smaller sam-
ple to variable ratio and their freedom from model adherence,
resampling methods have proven to be useful in dealing with sim-
ilar analytical issues.Resampling facilitates the reuse of data,
which can be especially required when the sample sizes collected
are small,and can be used with both statistical and machine learn-
ing methods.
Bootstrapping is one such method of resampling which entails
the repeated sampling with replacement from the original ob-
served data set from which multiple instances of a parameter are
estimated,and finally,from them,compute their sample mean
and variance.Jackknifing is a more refined version of the bootstrap,
which adopts a leave-one-out methodology in the selection of the
data subsets.Cosgun et al.(2011) used a bootstrap based cross-val-
idation method to improve the performance and prevent over fit-
ting of their random forest regression to predict warfarin dose
maintenance in a cohort of African Americans,while Yue and
Moult (2006) used it with SVMs to predict deleterious human SNPs
that cause monogenic diseases.Jackknifing has been used as well
in the prediction of risk for life threatening cardiac events (Golden-
berg et al.,2011) and aborted cardiac arrest in the Long QT syn-
drome (Hobbs et al.,2006).
Cross validation is yet another type of resampling which uses
similar principles to Jackknifing,but for a different purpose.While
Jackknifing is used to estimate a parameter,cross validation is pri-
marily concerned with evaluating the predictive error of a function
Table 1
Summarized advantages and disadvantages of various machine learning methods.
Method Advantages Disadvantages
Non-linear adaptability,no assumptions required on probability density
and distribution,certain configurations have been proven to being a
universal approximator
Loss of generality,over-fitting,prone to sub-optimal local minima,inability
to extract features responsible for results,black box presents uncertainties
for mission critical applications (Tufféry,2011)
Ability to model non-linear relationships,facilitates identification of
clusters,works with both qualitative and quantitative data
Sensitive to outliers,black box,large samples required,unable to extract
Decision trees Self-explanatory,easy to interpret,works with categorical and continuous
data,works with missing values,makes no distribution assumptions,
Dependent variable is restricted to categorical data,may not perform well
in the presence of many complex interactions,overfitting may lead to
instability (Rokach & Maimon,2008)
Works well with high dimensional small sample sizes,robust to noise,fast
computation (Yang,Hwa Yang,Zhou,& Zomaya,2010)
Difficult to interpret,prone to overfitting in certain datasets (Segal,2004),
do not handle large number of irrelevant features as well as other ensemble
methods (Gashler,Giraud-Carrier,& Martinez,2008)
Can be used to classify complex biological non-linear data,not prone to
local minima,works well with high dimensional small data sets,avoids
overfitting,robust to noise
Black box,lack of transparency,restricted to pairwise classification,can’t be
used directly for feature selection
Results are easy to interpret,works well with missing data,avoids
overfitting of data,facilitates the learning of causal relationships,easily
updatable with new information
Sufficient prior information must be provided for all unknown parameters
2482 P.S.Wasan et al./Expert Systems with Applications 40 (2013) 2476–2486
used to map one or more independent variables to a dependent,
similar to the use of the R
statistic and likelihood ratios in a linear
and logistic regression models respectively (Akobeng,2007a,b,c).
Through cross-validation,datasets are split into training sets,
which the prediction functions are based on,and test sets,which
are then used to test the validity of the method used.This then pro-
vides the researcher metrics such as true positives,true negatives,
false positives and false negatives with which they may proceed to
calculate the sensitivity and specificity of the methods used.One
such common tool used to visualize sensitivity is the Receiver
Operating Characteristic (ROC) that uses the true positive rate
and false positive rate to create a graphical plot of sensitivity,as
used by Bailey,Berson,Handelsman,and Hodges (2001) to predict
risk of major arrhythmic events after myocardial infarction.
Studies have been conducted using both statistical and machine
learning methods.Both evidently have their strengths and limita-
tions,and neither should be replaced with the other.It is also inter-
esting that statisticians have been responsible for the development
of many machine learning methods.It is through this research that
we come to the conclusion that walls be broken down between the
two fields and both statistical and machine learning be used to-
gether for the analysis and interpretation of HCA data.Table 2 be-
lowlists the various methods recommended for use at the different
stages,previously illustrated in Fig.1.
This can be done by both traditional statistical and machine
learning methods being used concurrently,for different data sub-
sets within the experiment,or at separate stages in a workflow
of tools,for integrative and meta-analyses.This combination of
methods has been made easier in this era of data driven science,
with the availability of workflowtools that provide friendly graph-
ical user interfaces such as Accelrys’ Pipeline Pilot,KNIME (Bert-
hold et al.,2008) Orange (Demšar,Zupan,Leban,& Curk,2004),
Taverna (Oinn et al.,2004),Kepler (Altintas et al.,2004),Galaxy
(Goecks,Nekrutenko,& Taylor,2010),and Weka (Holmes,Donkin,
& Witten,1994).
Another observation from the usage of methods selected was
that researchers have their own analytical toolboxes that they
use in their studies.These generally comprise tried and tested
methods that have been used successfully in publications and are
generally accepted by the community of arrhythmia researchers.
Most authors on the papers are as well clinical and biological
researchers,and are more focused on the collection of high quality
data,which is then analyzed using their toolbox.To increase the
motivation to explore new methods of data analyses,collabora-
tions with Biostatistics and Bioinformatics professionals would
prove beneficial.
The points mentioned above lead us to realize that irrespective
of historical success and knowledge limitations,effort needs to be
made to identify the best method to analyze the data at hand.Due
to the lack of statistical knowledge within biological and clinical
laboratories,it is as well important that bioinformaticians and bio-
statisticians are involved prior to the running of experiments,in
their design stage,to optimize the experiments run extracting
maximum information from a the data collected.
Ackerman,M.J.,Priori,S.G.,Willems,S.,Berul,C.,Brugada,R.,Calkins,H.,et al.
(2011).HRS/EHRA expert consensus statement on the state of genetic testing
for the channelopathies and cardiomyopathies:This document was developed
as a partnership between the heart rhythm society (HRS) and the European
heart rhythm association (EHRA).Europace,13(8),1077–1109.http://
Ajiro,Y.,Hagiwara,N.,& Kasanuki,H.(2005).Assessment of markers for identifying
patients at risk for life-threatening arrhythmic events in Brugada syndrome.
Journal of Cardiovascular Electrophysiology,16(1),45–51.
Akobeng,A.K.(2007a).Understanding diagnostic tests.1:Sensitivity,specificity
and predictive values.Acta Paediatrica,96(3),338–341.
Akobeng,A.K.(2007b).Understanding diagnostic tests.2:Likelihood ratios,pre-
and post-test probabilities and their use in clinical practice.Acta Paediatrica,
Akobeng,A.K.(2007c).Understanding diagnostic tests.3:Receiver operating
characteristic curves.Acta Paediatrica,96(5),644–647.
Altintas,I.,Berkley,C.,Jaeger,E.,Jones,M.,Ludascher,B.,& Mock,S.(2004).In
Presented at the proceedings of 16th international conference on scientific and
statistical database management (pp.423–424):IEEE.
Zondervan,K.T.(2010).Data quality control in genetic case–control association
studies.Nature Protocols,5(9),1564–1573.
Arking,D.E.,Pfeufer,A.,Post,W.,Kao,W.H.L.,Newton-Cheh,C.,Ikeda,M.,et al.
(2006).A common genetic variant in the NOS1 regulator NOS1AP modulates
cardiac repolarization.Nature Genetics,38(6),644–651.
Ashby,D.,& Smith,A.F.M.(2000).Evidence-based medicine as Bayesian decision-
making.Statistics in Medicine,19(23),3291–3305.
Babaee Bigi,M.A.,Aslani,A.,& Aslani,A.(2008).Significance of cardiac autonomic
neuropathy in risk stratification of Brugada syndrome.Europace,10(7),
Bailey,J.J.,Berson,A.S.,Handelsman,H.,& Hodges,M.(2001).Utility of current risk
stratification tests for predicting major arrhythmic events after myocardial
infarction.Journal of the American College of Cardiology,38(7),1902–1911.
Balding,D.(2006).A tutorial on statistical methods for population association
studies.Nature Reviews Genetics (Abstract).
Witteman,J.C.M.,et al.(2009).A common NOS1AP genetic polymorphism is
associated with increased cardiovascular mortality in users of dihydropyridine
calcium channel blockers.British Journal of Clinical Pharmacology,67(1),61–67.
Behr,E.R.,Dalageorgou,C.,Christiansen,M.,Syrris,P.,Hughes,S.,Tome Esteban,M.
T.,et al.(2008).Sudden arrhythmic death syndrome:Familial evaluation
identifies inheritable heart disease in the majority of families.European Heart
Benito,B.,Brugada,R.,Brugada,J.,& Brugada,P.(2008).Brugada syndrome.Progress
in Cardiovascular Diseases,51(1),1–22.
Benito,B.,Sarkozy,A.,Mont,L.,Henkens,S.,Berruezo,A.,Tamborero,D.,et al.
(2008).Gender differences in clinical manifestations of Brugada syndrome.
Journal of the American College of Cardiology,52(19),1567–1573.http://
Berthold,M.R.,Cebron,N.,Dill,F.,Gabriel,T.R.,Kötter,T.,Meinl,T.,et al.(2008).
KNIME:The Konstanz information miner.In C.Preisach,H.Burkhardt,L.
Schmidt-Thieme,& R.Decker (Eds.),Data analysis,machine learning and
Table 2
List of statistical and machine learning methods that are applicable at the different stages of a risk stratification analysis.
Stage of Analysis Methods Applicable
Exploratory analysis Summary statistics,frequency comparisons,relative risk,odds ratio,outlier detection,distribution adherence,clustering,PCA,
Kohonen maps
Experimental design Cross-validation,bootstrapping,jackknifing
Feature selection T-test,ANOVA,linear/logistic regression,decision trees (CART),ensemble methods (eg.random forests)
Model building (training/
Support vector machines,multi-layer perceptrons,Naive Bayesian modeling,decision trees,ensemble methods
Model validation/scoring Sensitivity analysis,receiver operator characteristic curves,C statistics,coefficients of determination,F-statistics
P.S.Wasan et al./Expert Systems with Applications 40 (2013) 2476–2486
applications (pp.319–326).Berlin,Heidelberg:Springer Berlin Heidelberg.
Bigi,R.,Gregori,D.,Cortigiani,L.,& Desideri,A.(2005).Artificial neural networks
and robust Bayesian classifiers for risk stratification following uncomplicated
myocardial infarction.International Journal of Cardiology,101(3),481–487.
Breiman,L.(1996).Bagging predictors.Machine Learning,24(2),123–140.
Breiman,L.(2001).Random forests.Machine Learning,45(1),5–32.
Breiman,L.,Friedman,J.,Stone,C.J.,& Olshen,R.A.(1984).Classification and
regression trees (1st ed.).Chapman and Hall/CRC.
Brugada,J.(2003).Determinants of sudden cardiac death in individuals with the
electrocardiographic pattern of Brugada syndrome and no previous cardiac
Brugada,R.,& Campuzano,O.(2009).In R.Brugada (Ed.),Clinical approach to sudden
cardiac death syndromes (pp.121–129).London:Springer London.http://
Calabrese,R.,Capriotti,E.,Fariselli,P.,Martelli,P.L.,& Casadio,R.(2009).Functional
annotations improve the predictive score of human disease-related mutations
in proteins.Human Mutation,30(8),1237–1244.
Chen,J.-Z.,Xie,X.-D.,Wang,X.-X.,Tao,M.,Shang,Y.-P.,& Guo,X.-G.(2004).Single
nucleotide polymorphisms of the SCN5A gene in Han Chinese and their relation
with Brugada syndrome.Chinese Medical Journal,117(5),652–656.
Chinushi,M.,Komura,S.,Izumi,D.,Furushima,H.,Tanabe,Y.,Washizuka,T.,et al.
(2007).Incidence and initial characteristics of pilsicainide-induced ventricular
arrhythmias in patients with Brugada syndrome.Pacing and Clinical
Çınar,M.,Engin,M.,Engin,E.Z.,& Ziya Ates￿çi,Y.(2009).Early prostate cancer
diagnosis by using artificial neural networks and support vector machines.
Expert Systems with Applications,36(3),6357–6361.
Clayton,D.G.(2009).Prediction and interaction in complex disease genetics:
Experience in type 1 diabetesM.I.McCarthy (Ed.).PLoS genetics,5(7),e1000540.
Cordell,H.J.(2002).Epistasis:What it means,what it does not mean,and statistical
methods to detect it in humans.Human Molecular Genetics,11(20),2463.
Cordell,H.J.(2009).Estimation and testing of gene-environment interactions in
family-based association studies.Genomics,93(1),5–9.
Cosgun,E.,Limdi,N.A.,& Duarte,C.W.(2011).High-dimensional pharmacogenetic
prediction of a continuous trait using machine learning techniques with
application to warfarin dose prediction in African Americans.Bioinformatics,
Delise,P.,Allocca,G.,Marras,E.,Giustetto,C.,Gaita,F.,Sciarra,L.,et al.(2011).Risk
stratification in individuals with the Brugada type 1 ECG pattern without
previous cardiac arrest:Usefulness of a combined clinical and
electrophysiologic approach.European Heart Journal,32(2),169–176.http://
Demšar,J.,Zupan,B.,Leban,G.,& Curk,T.(2004).Orange:From experimental
machine learning to interactive data mining.Knowledge Discovery in Databases:
Dietterich,T.(2000).Multiple classifier systems.
Doi,A.,Takagi,M.,Maeda,K.,Tatsumi,H.,Shimeno,K.,& Yoshiyama,M.(2009).
Conduction delay in right ventricle as a marker for identifying high-risk patients
with Brugada syndrome.Journal of Cardiovascular Electrophysiology,21(6),
Hofman,A.,et al.(2009a).Identification of a common variant at the NOS1AP
locus strongly associated to QT-interval duration.Human Molecular Genetics,
Eijgelsheim,M.,Newton-Cheh,C.,Aarnoudse,A.L.H.J.,van Noord,C.,Witteman,J.
C.M.,Hofman,A.,et al.(2009b).Genetic variation in NOS1AP is associated with
sudden cardiac death:Evidence from the Rotterdam study.Human Molecular
Elliott,P.M.,Poloniecki,J.,Dickie,S.,Sharma,S.,Monserrat,L.,Varnava,A.,et al.
(2000).Sudden death in hypertrophic cardiomyopathy:Identification of high
risk patients.Journal of the American College of Cardiology,36(7),2212–2218.
Exarchos,T.P.,Tzallas,A.T.,Baga,D.,Chaloglou,D.,Fotiadis,D.I.,Tsouli,S.,et al.
(2012).Using partial decision trees to predict Parkinson‘‘s symptoms:A new
approach for diagnosis and therapy in patients suffering from Parkinson’’s
disease.Computers in Biology and Medicine,42(2),195–204.
et al.(2009).Early risk stratification of patients with cardiogenic shock
complicating acute myocardial infarction who undergo percutaneous
coronary intervention.The American Journal of Cardiology,103(8),1073–1077.
García-Magariños,M.,López-de-Ullibarri,I.,Cao,R.,& Salas,A.(2009).Evaluating
the ability of tree-based methods and logistic regression for the detection of
SNP–SNP interaction.Annals of Human Genetics,73(3),360–369.http://
Gashler,M.,Giraud-Carrier,C.,& Martinez,T.(2008).Decision tree ensemble:Small
heterogeneous is better than large homogeneous.In Seventh international
conference on machine learning and applications,ICMLA’08 (pp.900–905).
Gehi,A.K.,Duong,T.D.,Metz,L.D.,Gomes,J.A.,& Mehta,D.(2006).Risk
stratification of individuals with the Brugada electrocardiogram:A meta-
analysis.Journal of Cardiovascular Electrophysiology,17(6),577–583.http://
Gilliam,T.,Tanzi,R.,Haines,J.,& Bonner,T.(1987).Localization of the Huntington’s
disease gene to a small segment of chromosome 4 flanked by D4S10 and the
Giustetto,C.,Drago,S.,Demarchi,P.G.,Dalmasso,P.,Bianchi,F.,Masi,A.S.,et al.
(2008).Risk stratification of the patients with Brugada type electrocardiogram:
A community-based prospective study.Europace,11(4),507–513.http://
Goecks,J.,Nekrutenko,A.,& Taylor,J.(2010).Galaxy:A comprehensive approach for
supporting accessible,reproducible,and transparent computational research in
the life sciences.Genome Biology.
Goldberger,J.J.,Buxton,A.E.,Cain,M.,Costantini,O.,Exner,D.V.,Knight,B.P.,et al.
(2011).Risk stratification for arrhythmic sudden cardiac death:Identifying the
Goldenberg,I.,Horr,S.,Moss,A.J.,Lopes,C.M.,Barsheshet,A.,McNitt,S.,et al.
(2011).Risk for life-threatening cardiac events in patients with genotype-
confirmed long-QT syndrome and normal-range corrected QT intervals.Journal
of the American College of Cardiology,57(1),51–59.
Goldenberg,I.,Moss,A.J.,Peterson,D.R.,McNitt,S.,Zareba,W.,Andrews,M.L.,et al.
(2008).Risk factors for aborted cardiac arrest and sudden cardiac death in
children with the congenital long-QT syndrome.Circulation,117(17),
Green,C.L.,Kligfield,P.,George,S.,Gussak,I.,Vajdic,B.,Sager,P.,et al.(2012).
Detection of QT prolongation using a novel electrocardiographic analysis
algorithm applying intelligent automation:Prospective blinded evaluation
using the cardiac safety research consortium electrocardiographic database.
American Heart Journal,163(3),365–371.
Han,J.,Kamber,M.,& Pei,J.(2011).Data mining:Concepts and techniques.Morgan
Haseena,H.H.,Mathew,A.T.,& Paul,J.K.(2009).Fuzzy clustered probabilistic and
multi layered feed forward neural networks for electrocardiogram arrhythmia
classification.Journal of Medical Systems,35(2),179–188.
Hirschhorn,J.(2005).Genome-wide association studies for common diseases and
complex traits.Nature Reviews Genetics (Abstract).
Hobbs,J.B.,Peterson,D.R.,Moss,A.J.,McNitt,S.,Zareba,W.,Goldenberg,I.,et al.
(2006).Risk of aborted cardiac arrest or sudden cardiac death during
adolescence in the long-QT syndrome.JAMA:The Journal of the American
Medical Association,296(10),1249.
Holmes,G.,Donkin,A.,& Witten,I.H.(1994).In Proceedings of ANZIIS ’94 - Australian
New Zealand intelligent information systems conference.Presented at the ANZIIS
’94 - Australian New Zealnd intelligent information systems conference
Ikeda,T.,Takami,M.,Sugi,K.,Mizusawa,Y.,Sakurada,H.,& Yoshino,H.(2005).
Noninvasive risk stratification of subjects with a Brugada-type
electrocardiogram and no history of cardiac arrest.Annals of Noninvasive
Jiang,R.,Yang,H.,Zhou,L.,Kuo,C.C.J.,Sun,F.,& Chen,T.(2007).Sequence-based
prioritization of nonsynonymous single-nucleotide polymorphisms for the
study of disease mutations.The American Journal of Human Genetics,81(2),
Jouven,X.,Desnos,M.,Guerot,C.,& Ducimetière,P.(1999).Predicting sudden death
in the population:The Paris prospective study I.Circulation,99(15),1978–1983.
Jouven,X.,Zureik,M.,Desnos,M.,Guerot,C.,& Ducimetière,P.(2001).Resting heart
rate as a predictive risk factor for sudden death in middle-aged men.
Cardiovascular Research,50(2),373.
Kapplinger,J.D.,Tester,D.J.,Alders,M.,Benito,B.,Berthet,M.,Brugada,J.,et al.
(2010).An international compendium of mutations in the SCN5A-encoded
cardiac sodium channel in patients referred for Brugada syndrome genetic
testing.Heart Rhythm:The Official Journal of the Heart Rhythm Society,7(1),
Kattygnarath,D.,Maugenre,S.,Neyroud,N.,Balse,E.,Ichai,C.,Denjoy,I.,et al.
(2011).MOG1:A new susceptibility gene for Brugada syndrome.Circulation.
Cardiovascular Genetics,4(3),261–268.
Kerem,B.,Rommens,J.,Buchanan,J.,Markiewicz,D.,Cox,T.,Chakravarti,A.,et al.
(1989).Identification of the cystic fibrosis gene:Genetic analysis.Science,
Koeppel,F.,Labarre,D.,& Zitoun,P.(2012).Quickly finding a needle in a haystack:A
new automated cardiac arrhythmia detection software for preclinical studies.
Journal of Pharmacological and Toxicological Methods.
Kohonen,T.(2007).Kohonen network.Scholarpedia.
Kotta,C.-M.,Anastasakis,A.,Gatzoulis,K.,Manolis,A.S.,& Stefanadis,C.(2010).
Novel sodium channel SCN5A mutations in Brugada syndrome patients from
Greece.International Journal of Cardiology,145(1),45–48.
Krishnan,V.G.,& Westhead,D.R.(2003).A comparative study of machine-learning
methods to predict the effects of single nucleotide polymorphisms on protein
2484 P.S.Wasan et al./Expert Systems with Applications 40 (2013) 2476–2486
Lander,E.S.,Linton,L.M.,Birren,B.,Nusbaum,C.,Zody,M.C.,Baldwin,J.,et al.
(2001).Initial sequencing and analysis of the human genome.Nature,
Latcu,G.D.,Meste,O.,Duparc,A.,Mondoly,P.,Rollin,A.,Delay,M.,et al.(2011).
Temporal and spectral analysis of ventricular fibrillation in humans.Journal of
Interventional Cardiac Electrophysiology,30(3),199–209.
Leite,C.R.M.,Martin,D.L.,Sizilio,G.R.M.A.,Santos,dos,K.E.A.,de Araújo,B.G.,de
M Valentim,R.A.,et al.(2010).In Annual international conference of the IEEE
engineering in medicine and biology.Presented at the 2010 32nd annual
international conference of the IEEE engineering in medicine and biology society
(EMBC 2010) (pp.1386–1389).IEEE.
Charalampous,C.,et al.(2011).Long-term prognosis of asymptomatic
individuals with spontaneous or drug-induced type 1 electrocardiographic
phenotype of Brugada syndrome.Journal of Electrocardiology,44(3),346–349.
London,B.,Michalec,M.,Mehdi,H.,Zhu,X.,Kerchner,L.,& Sanyal,S.,et al.(2007).
Mutation in glycerol-3-phosphate dehydrogenase 1–like gene (GPD1-L)
decreases cardiac Na+ current and causes inherited arrhythmias.
Maroco,J.,Silva,D.,Rodrigues,A.,Guerreiro,M.,Santana,I.,& de Mendonça,A.
(2011).Data mining methods in the prediction of dementia:A real-data
comparison of the accuracy,sensitivity and specificity of linear discriminant
analysis,logistic regression,neural networks,support vector machines,
classification trees and random forests.BMC Research Notes,4,299.http://
et al.(2012).Automated screening of arrhythmia using wavelet based machine
learning techniques.Journal of Medical Systems,36(2),677–688.http://
McKinney,B.A.,Reif,D.M.,Ritchie,M.D.,& Moore,J.H.(2006).Machine learning
for detecting gene-gene interactions:A review.Applied Bioinformatics,5(2),
McPherson,J.,Marra,M.,Hillier,L.,& Waterston,R.(2001).A physical map of the
human genome.Nature (Abstract).
Metzker,M.L.(2009).Sequencing technologies — the next generation.Nature
Reviews Genetics,11(1),31–46.
Milhorn,H.T.(2005).Electrocardiography for the family physician:The essentials.
Brown Walker Press.
Milpied,P.,Dubois,R.,Roussel,P.,Henry,C.,& Dreyfus,G.(2011).Arrhythmia
discrimination in implantable cardioverter defibrillators using support vector
machines applied to a newrepresentation of electrograms.IEEE Transactions on
Biomedical Engineering,58(6),1797–1803.
Nakano,Y.,Shimizu,W.,Ogi,H.,Suenari,K.,Oda,N.,Makita,Y.,et al.(2010).A
spontaneous type 1 electrocardiogram pattern in lead V2 is an independent
predictor of ventricular fibrillation in Brugada syndrome.Europace,12(3),
Napolitano,C.,& Antzelevitch,C.(2011).Phenotypical manifestations of mutations
in the genes encoding subunits of the cardiac voltage-dependent L-type calcium
Napolitano,C.,Bloise,R.,Monteforte,N.,& Priori,S.G.(2012).Sudden cardiac death
and genetic ion channelopathies:Long QT,Brugada,short QT,
catecholaminergic polymorphic ventricular tachycardia,and idiopathic
ventricular fibrillation.Circulation,125(16),2027–2034.
Nunn,L.M.,Bhar-Amato,J.,& Lambiase,P.D.(2010).Brugada syndrome:
Controversies in risk stratification and management.Indian Pacing and
Electrophysiology Journal,10(9),400.
Oinn,T.,Addis,M.,Ferris,J.,Marvin,D.,Senger,M.,Greenwood,M.,&Carver,T.,et al.
(2004).Taverna:A tool for the composition and enactment of bioinformatics
Pfeufer,A.,Sanna,S.,Arking,D.E.,ller,M.M.U.,Gateva,V.,Fuchsberger,C.,et al.
(2009).Common variants at ten loci modulate the QT interval duration in the
QTSCD study.Nature Genetics,41(4),407–414.
Pirooznia,M.,Yang,J.Y.,Yang,M.Q.,& Deng,Y.(2008).A comparative study of
different machine learning methods on microarray gene expression data.BMC
Genomics,9(Suppl 1),S13.
Priori,S.G.(2002).Natural history of Brugada syndrome:Insights for risk
stratification and management.Circulation,105(11),1342–1347.http://
Priori,S.,Schwartz,P.,Napolitano,C.,& Bloise,R.(2003).Risk stratification in the
long-QT syndrome.The New England Journal of Medicine.
Probst,V.,Veltmann,C.,Eckardt,L.,Meregalli,P.G.,Gaita,F.,Tan,H.L.,et al.(2010).
Long-term prognosis of patients diagnosed with Brugada syndrome:Results
from the FINGER Brugada syndrome registry.Circulation,121(5),635–643.
Raju,H.,Papadakis,M.,Govindan,M.,Bastiaenen,R.,Chandra,N.,O’Sullivan,A.,et al.
(2011).Lowprevalence of risk markers in cases of sudden death due to Brugada
syndrome.Journal of the American College of Cardiology,57(23),2340–2345.
Rokach,L.,& Maimon,O.Z.(2008).Data mining with decision trees.World Scientific
Pub Co Inc..
Rumelhart,D.E.,Hintont,G.E.,& Williams,R.J.(1986).Learning representations by
back-propagating errors.Nature,323(6088),533–536.
et al.(2005).Prediction of clinical outcome and biological characterization of
neuroblastoma by expression profiling.Oncogene,24(53),7902–7912.http://
Schulte,J.H.,Schowe,B.,Mestdagh,P.,Kaderali,L.,Kalaghatgi,P.,Schlierf,S.,et al.
(2010).Accurate prediction of neuroblastoma outcome based on miRNA
expression profiles.International Journal of Cancer.Journal International du
Schwarz,D.F.,Szymczak,S.,Ziegler,A.,& König,I.R.(2007).Picking single-
nucleotide polymorphisms in forests.BMC Proceedings,1(Suppl 1),S59.
Segal,M.(2004).Machine learning benchmarks and random forest regression.
Sekkal,M.,Chikh,M.A.,& Settouti,N.(2011).Evolving neural networks using a
genetic algorithmfor heartbeat classification.Journal of Medical Engineering and
Calzon,C.,Sanz-Romero,G.,et al.(2002).Computers in cardiology.Presented at
the computers in cardiology (Vol.29).IEEE.doi:0.1109/CIC.2002.1166797.
Siemiatycki,J.,& Thomas,D.C.(1981).Biological models and statistical
interactions:An example from multistage carcinogenesis.International Journal
of Epidemiology,10(4),383–387.
Straus,S.M.J.M.,Bleumink,G.S.,Dieleman,J.P.,van der Lei,J.,t Jong,G.W.,
Kingma,J.H.,et al.(2004).Antipsychotics and the risk of sudden cardiac death.
Archives of Internal Medicine,164(12),1293.doi:10.1001/archinte.164.12.1293.
Sun,Y.V.,Cai,Z.,Desai,K.,Lawrence,R.,Leff,R.,Jawaid,A.,et al.(2007).
Classification of rheumatoid arthritis status with candidate gene and genome-
wide single-nucleotide polymorphisms using random forests.BMC Proceedings,
Sun,Y.V.,Bielak,L.F.,Peyser,P.A.,Turner,S.T.,Sheedy,P.F.II,,et al.(2008).
Application of machine learning algorithms to predict coronary artery
calcification with a sibship-based design.Genetic Epidemiology,32(4),
Sun,Y.,Bielak,L.,& Peyser,P.(2008).Application of machine learning algorithms to
predict coronary artery calcification with a sibship-based design – Sun – 2008 –
Genetic Epidemiology – Wiley Online Library.Genetic....
Sunay,A.,& Cunediog
lu,U.(2009).Feasibility of probabilistic neural networks,
Kohonen self-organizing maps and fuzzy clustering for source localization of
ventricular focal arrhythmias fromintravenous catheter measurements – Sunay
– 2009 – Expert Systems – Wiley Online Library,Expert Systems.
Syed,T.F.,& Guttag,J.V.(2011).Unsupervised similarity-based risk stratification
for cardiovascular events using long-term time-series data.
& Sun,Y.V.(2009).Machine learning in genome-wide association studies.In J.
W.MacCluer,L.A.Cupples,& L.Almasy (Eds.),Genetic Epidemiology (Vol.33(S1),
Takagi,M.,Yokoyama,Y.,Aonuma,K.,Aihara,N.,& Hiraoka,M.for the Japan
Idiopathic Ventricular Fibrillation Study (J-IVFS) Investigators.(2007).Clinical
characteristics and risk stratification in symptomatic and asymptomatic
patients with Brugada syndrome:Multicenter study in Japan.Journal of
Cardiovascular Electrophysiology,18(12),1244–1251.
Tatsumi,H.,Takagi,M.,Nakagawa,E.,Yamashita,H.,& Yoshiyama,M.(2006).Risk
stratification in patients with Brugada syndrome:Analysis of daily fluctuations
in 12-lead electrocardiogram (ECG) and signal-averaged electrocardiogram
(SAECG).Journal of Cardiovascular Electrophysiology,17(7),705–711.http://
Tuana,G.,Volpato,V.,Ricciardi-Castagnoli,P.,Zolezzi,F.,Stella,F.,& Foti,M.(2011).
Classification of dendritic cell phenotypes from gene expression data.BMC
Tufféry,S.(2011).Data mining and statistics for decision making.Wiley.
Ture,M.,& Tokatli,F.(2009).Using Kaplan–Meier analysis together with decision
tree methods (C&RT,CHAID,QUEST,C4.5 and ID3) in determining recurrence-
free survival of breast cancer patients.Expert Systems with Applications,23.
Vapnik,V.(1995).Machine learning.Springer.20(3).
Veltmann,C.,Schimpf,R.,Borggrefe,M.,& Wolpert,C.(2009).Risk stratification in
electrical cardiomyopathies.Herz,34(7),518–527.
Venter,J.C.,Adams,M.D.,Myers,E.W.,Li,P.W.,Mural,R.J.,Sutton,G.G.,et al.
(2001).The sequence of the human genome.Science Signaling,291(5507),1304.
Wan,X.,Yang,C.,Yang,Q.,Xue,H.,Tang,N.L.S.,& Yu,W.(2009).MegaSNPHunter:
A learning approach to detect disease predisposition SNPs and high level
interactions in genome wide association study.BMC Bioinformatics,10,13.
S.,Kannankeril,P.J.,et al.(2009).Mutations in sodium channel b1- and b2-
subunits associated with atrial fibrillation.Clinical Perspective.
Yang,P.,Hwa Yang,Y.,Zhou,B.B.,& Zomaya,A.Y.(2010).A review of ensemble
methods in bioinformatics.
Yue,P.,& Moult,J.(2006).Identification and analysis of deleterious human SNPs.
Journal of Molecular Biology,356(5),1263–1274.
P.S.Wasan et al./Expert Systems with Applications 40 (2013) 2476–2486
Zadeh,A.E.,Khazaee,A.,& Ranaee,V.(2010).Classification of the electrocardiogram
signals using supervised classifiers and efficient features.Computer Methods and
Programs in Biomedicine,99(2),179–194.
Zondervan,K.(2007).Designing candidate gene and genome-wide case–control
association studies.Nature Protocols (Abstract).
Zweig,M.H.,& Campbell,G.(1993).Receiver-operating characteristic (ROC) plots:
A fundamental evaluation tool in clinical medicine.Clinical Chemistry,39(4),
2486 P.S.Wasan et al./Expert Systems with Applications 40 (2013) 2476–2486