Victorian Partnership for Advanced Computing
Expertise Program Grants Scheme
Round Three
–
1
st
January, 2002 to 31
st
December, 2002
PROJECT
Parallel Regression Trees
for Multilevel Information Systems and Data Mining
Fabrizio Carinci
Centre
for Health Systems Research
Monash Institute of Health Services Research
Correspondence:
Monash Institute of Health Services Research
Level 1, Block E, Locked Bag 29
Monash Medical Centre
Clayton, Victoria 3168, Australia
Tel:
+61 3 9594 7504
Fax:
+61
3 9594 7554
Email:
Fabrizio.Carinci@med.monash.edu.au
Web:
www.med.monash.edu.au/healthservices/
R
E
I
G
N
RECPAM
Information Generation Network
R
E
I
G
N
RECPAM
Information Generation Network
CONTENTS
Aim
1
Background
1
Details of the basic research strategy
6
Study design
8
Significance
of the project
9
Milestones
11
Outcomes
12
Budget
13
References
14
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
1
AIM
Aim of this project is to advance the methodology and software relative to the regression tree
approach known as RECPAM (RECursive Partitioning and Amalgamation).
This is a one ye
ar project in which we plan to design original software using parallel
computing technology as a strategic solution to support the development of research streams
that will aim to: a) advance the RECPAM model; and b) transform RECPAM in a powerful tool
sup
porting the development of multilevel information systems.
The present proposal will focus on applications in health services and systems research, to
identify efficient methods rapidly extracting information from shared data and collaborative
networks.
T
he central objective of our investigation will be generating information from multiple sources;
therefore, we will refer to the present proposal as “ReIGN”, acronym for “RECPAM Information
Generation Network”.
BACKGROUND
Nowadays, tree

based techniques h
ave become a familiar tool in data analysis. They are
frequently used to support research through better understanding of the relationships and
possible causal mechanisms influencing the occurrence of outcomes in many different fields.
A tree structure off
ers a simple and attractive presentation of the information about a
response variable contained in a set of predictors. A chain of statements concerning the
predictors, one for each node, uniquely assigns an individual to a leaf, or terminal node of the
tr
ee.
RECPAM
1

8
is a generalization of the well

known CART
9
algorithm, a technique that builds a
prediction tree from data of the form
(y,z)
, where
y
is a one d
i
mensional response variable and
z
is a vector of predictors. Within the CART framework, general
i
z
ations to mult
i
variate
responses have recently been d
e
veloped, where
y
, even if multivariate is still consi
d
ered a
response.
The RECPAM model
The RECPAM model, summarized in figure 1, was devel
o
ped for data having the general
structure
(u,z)
, where
u
is
a vector of variables not necessarily to be thought as of 'response'.
They may, in fact, consist of a mixture of random and non

random variables, corresponding, for
e
x
ample, to response and trea
t
ment.
In RECPAM, it is assumed that the distribution of
u
is at least partially specified, and that the
object of the predi
c
tion is a parameter of this distribution, in general mu
l
tidimensional. This is
denoted by
慮搠i猠term敤
cr
i
terion
, to emphasize that any predictor is a
s
sessed according to
how well it pred
icts
. F潲ot桥 獡m攠r敡獯s, t桥 捯浰
o
湥湴猠of
u
are referred to as
cr
i
terion
variables
.
Additional classes of parameters, defined as
global
,
local
and
determinants
may be fitted
using the RECPAM approach, by applying a broad range of regression models (
e.
g.
proportional hazards and generalized regression models) to deliver a final model that is a mix
of a traditional regression approach, plus an additional set of informative parameters that may
be defined by a tree structure.
Tree

growing in RECPAM procee
ds through a series of steps that is quite established,
including a binary search and partitioning algorithm, strategies for complexity reduction (
pruning
and
amalgamation
), validation (
cross

validation
) and final model specification (
backward
elimination
)
.
Development of the technique was inspired by situations frequently occurring in biostatistics.
Particularly in this field, using a tree structure can be very informative, since parameters will
refer to subgroups of observations, such as individuals, doct
ors, hospitals or specific health
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
2
episodes that could be predicted by specific characteristics or conditions.
Adding a tree component to a traditional regression model could represent an ideal
compromise, where the investigator should use the personal ‘ski
lls and art’ in model fitting to
improve accuracy by tuning the desired balance between the different components.
Because of its unique features and flexibility, the RECPAM a
p
proach have been adopted
frequently in health services and outcomes research
10

15
as a solution to use a criterion to drive
the application of data mining to very large epidemiologic databases.
Such applications usually required consistency with the study design and so needed to be
based on the definition of an
a priori
primary endpoi
nt and the clear specification of the
potential set of confoun
d
ers to be included as predictors in the statistical model.
RECPAM software development
In the last 15 years, specific solutions for the development of specialised RECPAM software
have been im
plemented, using Fortran
1,4,6
, C++
6

8
, and SAS
16
.
The SAS implementation has been developed by F.Carinci to apply this technique to the
exploration of very large epidemiologic databases: RECPAM/SAS has been applied in AIDS at
Harvard School of Public Heal
th, in diabetes
13,15
and cardiovascular diseases
10,14
at Mario Negri
Sud and in oncology and social research at McGill University.
Overall, the software has demonstrated to be flexible, stable and reliable, even when working
on very large numbers of observ
ations and variables, sharing the strengths of the SAS
programming language.
Figure 1. The RECPAM approach: developing a SAS prototype
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
3
Open Questions
Despite of the availability of a complete RECPAM implementation in SAS, the existing
software is not addressing a number of open questions both at the methodolo
gical and the
applied level, which we believe will merit attention for further understanding, development and
on field testing.
In this proposal we will focus on aspects that will be based on the application of innovative
approaches making use of parallel
processing.
Our research questions belong to two different research streams: advancing the RECPAM
model and using RECPAM as a tool for multilevel information systems.
Stream 1.
Advancing the RECPAM model
1.1) Extending the range of applicable models: mu
ltilevel models
Estimation of statistical models may be useful in many occasions, providing a simple way to
synthesise results of analyses of value for decision makers.
Multilevel models are a specific class of statistical models that are ideal for appli
cations in
social research, since they allow estimating parameters at different levels in the hierarchy of
nested structures.
For example, we could fit single parameters predicting the risk of individual patients, at the
same time taking into account the
correlation for patient belonging to the same doctor, or even
fitting parameters for the single doctors that will then add a random component to the overall
model. Using the same model, we could even explore patterns in repeated measurements over
the singl
e patient, where the correlation that should be accounted for is in this occasion not at
the doctor level, but longitudinally at the patient level.
The number of potential applications is extremely large, due to the characteristics of multilevel
models, of
fering flexibility to extend the application of generalised regression models to
multilevel structures.
How to apply the methodology of multilevel models to regression trees?
The RECPAM approach is quite general, thus offering an ideal base for new soluti
ons that is
at the moment generally still lacking.
1.2) Addressing stability
Along with the increased practical use of recursive partitioning models have come doubts
concerning the reliability and stability of the procedure. Methods like RECPAM may have
a
tendency to overfit the data; among others, some of the causes that have being identified are
the unnatural dichotomization of continuous covariates and the dependence of the final tree
structure and all the subsequent operations from a very small number
of splits at the start of the
recursive partitioning process.
RECPAM includes a cross

validation step that closely resembles the CART technique;
however, recent developments in the area
17

20
need to be addressed and tested using more
advanced approaches t
han those of the original work by Breiman et al.
1.3) Reducing the computational burden
The RECPAM approach is very general in nature: its computational burden is the price that it
has to be paid for the availability of a broad spectrum of applications.
At the same time, the
RECPAM model shares limitations that are typically a characteristic of local search strategies,
such as the number of permutations of nominal variables with a large number of categories,
fast approaching infinity. Since this is often
the case in health services research, the
identification and reliability of approximate solutions is an important point that should be tested
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
4
in practice. We believe that parallel computing would constitute a crucial factor since is evident
that its impact
on speed and performance would be rather substantial in such applications.
Stream 2.
Using RECPAM as a tool for multilevel information systems
A second stream follows work conducted from the chief investigator on health information
systems.
F.Carinci ha
s been working for the last five years to the development of original software to
perform statistical analysis on administrative health databases, applying sophisticated
a
p
proaches to different disease areas and exploratory multidimensional models.
In this
framework, he has developed original approaches to link and analyse very large health
databases, conceiving and implementing RISS (Reporting

by

Intranet Statistical System)
21

27
.
The RISS software was inspired by very simple logic, which consisted basica
lly in ma
k
ing few
or no changes at all to the original data. RISS was a result of an Italian research project
sponsored by SAS Institute and SUN Micr
o
systems; the project delivered the output as public
domain software that has been later adopted by differe
nt health departments and public health
observatories in several Italian r
e
gions.
The system has been tested in practice and used to automate the production of web reports
for various disease areas, using a flexible design that can be easily customized to
meet
different needs.
2.1) RISS and RECPAM together: matching the multilevel structure
Recently, health information systems are increasingly based on multilevel structures,
i.e.
problems in which by default different levels interact with patients, influ
encing their outcomes
within the broader framework of complex health systems.
Health systems are a perfect field of application of such approaches: analysts are frequently
involved in processing data at multiple levels; typically, some archives may be stor
ed centrally
as a very large or massive database, while others are geographically distributed, and finally
more detailed and fragmented files are managed by small organisational units or even single
users.
From a different point of view, health informatio
n systems usually relate to different levels of
data ownership and data security, adding a complexity factor that affects the possibility to
obtain completely linked datasets both at the individual and the structural levels.
Data systems such as those des
cribed here are referred to as
multilevel information systems
.
The RISS system was inspired by a similar multilevel design, to be highly customisable, so
that it could be used to answer complex queries and obtain entirely customisable tables,
graphs, and
maps. But RISS was also meant to be extensible in the sense of providing access
to the use of advanced multivariate methods.
In particular, RISS was ideal to write modules linking population archives to administrative
databases, extracting tables to build
prediction models available through the SAS system, such
as Generalized Linear Models (GENMOD) and Proportional Hazards (PHREG).
The availability of RECPAM routines linked to the RISS system would represent an incredible
opportunity to deliver supporting
solutions for informed decision

making and the realization of
evidence based health policy.
As a matter of fact, tree

based structures are ideal for any population

based information
system, since they would use parameters such as risk estimates or correl
ation measures to
isolate specific subgroups that can be targeted by specific actions.
However, the implementation of multilevel models, albeit essential and definitely possible,
should be conceptualised to work in the framework of complex information syst
ems: such an
issue is still largely unexplored and so it would need to be exactly specified and tested in real
life situations.
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
5
2.1) RISS and RECPAM together: a distributed, meta

analytical approach
RISS (Reporting

by

Intranet Statistical System) is a S
AS application package that was built in
three consecutive years by five staff members, making extensive use of macros. The core
program was then linked to HTML and Java

script code, automatically producing interactive
reports to be accessed locally and re
motely by an intranet access.
How to make put the RISS design and the RECPAM approach work together on intranet
networks?
RISS relies on distributed data and it is severely de

centralized, allowing the exchange of only
aggregate data and ensuring a minimu
m transfer of information at the individual data.
Such a design, although offering some advantages, is not compatible with the current design
of RECPAM, which needs to access all the available information from a complete database.
Different solutions for
the RECPAM approach should be explored in order to align its potential
use with intranet networks that would participate to the development of multilevel information
systems, though agreeing on the transfer of only limited amount of information.
ReIGN as
a RECPAM parallel solution for stream 1 and stream 2.
The present proposal is based on the adoption of innovative methodology that should rely on
a substantial re

design and development of RECPAM using a parallel processing approach.
The parallel approa
ch is proposed not only because it is thought to be more efficient, but also
because it closely resembles the reality of today’s world of Internet and Intranet networks.
For this reason a new project deploying software for regression trees in a parallel e
nvironment
would represent an important step to support systems for the generation of real time
information.
RECPAM solutions to be used on multiple processors or cluster computers would represent a
substantial innovation in the landscape of data mining, p
articularly for applications in health
services, where no such product seems to be so flexible and specifically tuned to answer the
specific questions and convincingly supporting health policy.
Therefore, the objectives of the present proposal are:
1)
to de
velop a software that will advance the RECPAM approach through the following
innovations:

applicability of multilevel models

enhanced stability of RECPAM trees

reduced computational burden
2)
to design a software that could be embedded in multilevel informat
ion systems,
characterized by the following points:

multilevel structure

access and processing of distributed databases
The first point, corresponding to stream 1, will be studied in great detail during this one

year
study. The ReIGN project will correspo
nd to homonymous software that will be based on
parallel computing.
The second point, corresponding to stream 2, will be indeed intended as a feasibility study.
The ReIGN software developed to accomplish stream 1 will be compatible with these
objectives,
and the structure will be in place, although deployment of ReIGN as a fully
functional information system will need additional resources or extension beyond the expected
length of one year.
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
6
DETAILS OF THE BASIC RESEARCH STRATEGY
To be successful, the Re
IGN project should be able to make innovation in the theoretical field
as well as improving the existing numerical algorithms.
As depicted above, innovations will explore the usage of the system to perform two
mainstream operations:
a) empowering the REC
PAM approach to model complex relationships in very large databases
b) advancing the RECPAM approach to support multilevel information systems
In regards to the first stream,
i.e.
ReIGN as a means to empower the RECPAM approach
, our
proposal will be focus
ed on two fundamental points:
1.
Regression trees for multilevel modelling
2.
Improving techniques for validation and stabilization of RECPAM models
3.
Reducing overall RECPAM computational burden
As of the first point,
multilevel regression trees
,
a parallel e
nvironment would represent an
ideal setting to simulate and evaluate innovative solutions for modern data

warehouses.
The RECPAM approach is itself a generalisation of regression tree analysis and as such
would offer the same interpretational advantages of
hierarchical models, a situation that by
definition is coherent with the application of binary trees.
There are many potential approaches to implementing multilevel models in RECPAM.
One design would be to extend the widely known and used
generalized lin
ear model (GLIM)
for a single outcome (time

independent) variable is well known. GLIM permits to treat the most
important types of simple outcomes: continuous, binary (by extension, polytomous) and count,
and to assess the effect of a covariate vector on o
utcome. Several generalization to time
dependent outcomes have been proposed
28
. The main distinction is between
marginal
and
conditional
models and between fully and partially specified models, leading to
likelihood

based
and
GEE

based
parameter estimation
, where GEE stands for
generalized estimating
equations
. Experience shows that the most appropriate way of generalizing GLIM depends on
both the scientific question underlying the analysis and the statistical properties of the estimate
one wants to insure,
e.g.
bias

variance trade off. Comparative studies for various available
approaches are highly desirable.
A.Ciampi and F.Carinci have recently presented
11

12
a GEE

based extension of tree growing
to correlated continuous and discrete outcome variables. S
ome 2

level systems can be
satisfactorily treated by the GEE approach to correlated outcome variables, the multi

level
feature being represented by an appropriate choice for the correlation matrix. Applications to
family

individual systems, matched case

co
ntrol studies and time dependent outcomes were
shown to be feasible.
A better understanding of the issues discussed above in connection with linear regression and
multi

level systems, is essential in order to develop tree

growing algorithms for multileve
l
systems. Here we only discuss an example of situations not covered above, but extremely
important in the applications. Suppose we consider individuals on whom we measure both a
time dependent outcome and a number of time

dependent covariates. A GEE

tree
structure
can easily be constructed from data and to each to leaf, say, we may attach a growth curve.
However, there is only one option: a leaf represents not a collection of subjects but a collection
of
subject

time units
; also, the decision nodes involve
values of the covariates
at particular
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
7
times
. This means, for instance, that the same individual may, in time, ‘jump’ from one leaf to
another: this happens whenever certain key covariates vary in such a way as to change the
decision at some nodes. The ot
her option would be that a leaf represents a collection of
subjects
: this makes sense only if the decision nodes of the tree are based on the whole time
development of the predictors.
To implement such an algorithm, a deeper analysis of the problem is req
uired and it is
necessary to consider decision nodes based on functional data.
Following most recent ideas from L. Breiman
20
, one of the most interesting developments
about
tree validation and stabilisation
would involve using random outputs, so that a si
ngle
RECPAM tree might be obtained as a result of growing RECPAM
forests
.
Construction of validated and stable regression trees can be dramatically improved by using
parallel systems. When processing is restricted to sequential dependency, then growing n t
rees
will request a time that is equal to the sum of all computations involved for all trees, plus
procedures to sort and transform individual outputs to merge results into a unified model.
Using parallel processing, the required time should be approximat
ely equal to the time of the
most complex tree, since as computations for individual trees are completed, more steps can
be performed while other trees are still growing.
Parallel processing would help in developing some fundamental innovations in regressi
on tree
methodology, such as for example globally optimised algorithms. Even in this case, parallel
processing would be fundamental to extend the search to a high dimensional set of
parameters, finding the best model at each progression in tree partitionin
g through parallel
estimation of a very large number of model candidates.
Because of its generality, the computational burden is still an issue in the application of
RECPAM. The third point,
reducing RECPAM execution time and improving its numerical
limit
s
, will improve feasibility of tree growing in specific conditions, by designing improved
algorithms and offering new options to apply approximated methods.
There are again many possible ideas that we have been thinking of in the last years, in terms
of al
gorithms whose detailed explanation would certainly fall beyond the scope of the present
outline.
As an example, an interesting use of parallel capabilities would consist in dynamically
allocating split search processes at every local node, so that problem
s of high numerical order,
such as extremely high number of combinations/permutations, or fitting a large number of
model parameters, would become feasible in critical applications where the limit will be rapidly
approached.
This is the case of very large
databases of hospital episodes, where parameters should be
fitted at the hospital level and the number of possible interaction terms will be rapidly
approaching infinity.
In regards to the second stream,
i.e.
ReIGN as a RECPAM engine for multilevel infor
mation
systems
, our proposal will explore two main areas:
1.
Meta

analytical generation of regression trees.
2.
Linking ReIGN to the RISS model.
The first point,
meta

analytical regression trees
, is based on the recognition that parallel
regression trees may b
e useful to generate information from collaborative networks, because
while allowing estimation of very general statistical models, they would not require centralized
databases for parameter estimation. In other words, final estimates could be directly obt
ained
by stacking models from a series of remote nodes.
Such a process closely resembles
meta

analysis
, a technique that has been extensively used
for systematic reviews of clinical trials and was then becoming quite popular with the
establishment of evide
nce

based medicine.
Several ideas that have been appearing in the specialised literature recently may be
successfully applied to construct a meta

analytical RECPAM approach
18

20,29

32
. In particular we
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
8
are referring to born again trees
18
, random forests
20
and stacked regression
32
, mostly originally
intended as a means to increase and obtain stabilized models.
A meta

tree would represent quite an interesting approach for health databases, a situation
where ownership and privacy issues usually play a major ro
le, harnessing the applicability of
traditional models that rely on the access to individual data for the overall system.
There are very few theoretical solutions and definitely no software available that would be
able to deal with this degree of complexi
ty and we believe that the realization of ReIGN would
represent quite a step forward in the field.
As of the second point,
linking ReIGN to the RISS model
, we would use a parallel
environment to create a platform for the application of ReIGN in collaborat
ive projects using
intranet networks to connect databases. In the present study, we would explore how to run
concurrently on different stations SAS sessions that would create parallel trees. We would then
define an approach that could allow us pooling para
meters and parallel trees from different
databases, addressing the possibility that regression models, and in particular multilevel
models, could be carried out without using all the information available from centralized
databases.
Clearly, the developme
nt of the software as a fully stand

alone information system would not
be possible in the framework of the present project. However, we will test the feasibility of the
project and identify possible strategies using distributed databases, as represented by
sample
datasets to be referred to distinct remote units.
STUDY DESIGN
In the first phase of this project, we will organize a brainstorming day and workshop with the
team of researchers, including the chief investigators, Prof.Ciampi and Prof.Gibberd. Th
is event
will provide a presentation of the project, also allowing for a restricted meeting that will serve to
define the study design in greater detail.
At the same time, Prof.Ciampi will be visiting for two weeks Monash University and the team
will refin
e the strategy related to the statistical theory as well as the outline for statistical
programming using parallel computing.
Then, in

depth analysis of available RECPAM software will be conducted; core programming
will be finalised in the following six mo
nths.
During the third quarter, the team will meet again to revise the software, identifying a list of
applications on very large databases.
Thanks to Prof.Gibberd, who is a national authority in the analysis of health care databases in
Australia, the Re
IGN team will have the opportunity to approach typical problems in health
systems research at the national level. Typical examples will include problems encountered in
the analysis of administrative databases, as in the case of clinical indicators using in
patient
episodes databases
33
, or in large population surveys to support community prevention
programs
34
. Such applications will serve to finalise stream 1 of our proposal.
We will then finally revise, debug and evaluate the software.
At the end of this st
ep, we will assess the feasibility of adopting the software for distributed
databases, using the same applications to approach stream 2. This phase will provide the team
with indications for further development towards the use of ReIGN for multilevel infor
mation
systems.
Finally, ReIGN software will be released and results will be published in various forms.
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
9
Software specifications
Version 8 of SAS software (through its module SAS/CONNECT) now includes the Multi

Process Connect facility, to exploit multi

processor capabilities of symmetrical multiprocessor
and massively parallel processor systems by allowing parallel processing of self

contained
tasks and the coordination of all the results in the original SAS session. SAS now provide
independent paralleli
sm
35

39
, which means that when there are no dependencies between
tasks, separate SAS sessions may be run concurrently completing the overall job.
Building upon the experience of RECPAM/SAS, we will use its know how to re

engineer the
software and reuse its
functionalities in a parallel environment.
ReIGN will be made up of multiple SAS macros linked together and operating on multiple
processors and/or multiple computers, thus sharing the weight of computations involved in tree
growing. Outputs will be avail
able in HTML and PDF format, tables and graphs both in low and
high

resolution graphics, so that results will be widely accessible.
Efforts on GUI interface will be minimized, although the possibility for end users to run the
product should be allowed, eve
n with a restricted number of options. The product will be
developed at Monash, tested and finally installed on VPAC systems, where source and
compiled software will reside by the end of the project. Source code will be available at the end
of the project
in commented form, and a tutorial and programmers’ guide will be finally
published with the release of the software. Only an incomplete random sample of de

identified
data utilized to test the software for this project will be available at the end of the p
roject, to
facilitate its use and understanding.
SIGNIFICANCE OF THE PROJECT
We believe that our project would be a valuable opportunity since it offers new alternatives to
link research with an improved everyday use of information technology for modern
society.
We will now illustrate our view using an example relative to health system analysis.
The RECPAM approach can be described as a technique that allows extending the ordinary
regression approach by mixing fixed effects regression with recursive p
artitioning. By slight
abuse of terminology, we could say that it is a sort of ‘a
d
justed’ clustering with respect to a
target measure of outcome.
Such a feature is of fundamental importance in health research.
In the case of acute myocardial infarction (f
igure 2), for instance, we may be asked to solve a
particular question that is thought to be relevant, such as ‘optimising’ hospital beds on the
ground of ev
i
dence from health outcomes. Therefore, we can choose to use available data from
epidemiologic stud
ies to est
i
mate the needs of our population.
The RECPAM algorithm can provide a solution to stratify patients, identifying profiles
chara
c
terised by extremely different prognoses in terms of in

hospital mo
r
tality.
A particular step of the RECPAM approach
called amalgamation finally identifies a total of six
classes: in the case of AMI, every patient will be allocated to a single class by asking a few
questions at admission in the emergency depar
t
ment.
This guideline may have an immediate impact on hospit
al
i
sation in terms of length of stay and
use of procedures. For instance, Class VI has a .010 mortality rate, as opposed to Class I that
showed in our sample a mortality rate equal to .5. For a subgroup of Class VI

patients with a
low risk pr
o
file identi
fied by a Killip score equal to 1 and less than 53 years old

thrombolysis
would be actually increa
s
ing the risk and it is unlikely that this subgroup of patient would really
benefit from the treatment.
At the same time, the a
v
erage length of stay for th
is class was found to be less than 4 days.
This kind of modelling may be very effective at the system level, particularly when results are
derived by standard hospitals that may provide evidence of ‘good practice’, which can be later
extended to the e
n
tire
system.
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
10
–
Age 70

75, Site Anterior or Multiple or Not Evaluable
•
Killip II
–
Age > 70
, Site Lateral or Inferior
–
Site not Inferior, Age 53

65, SBP
110 mmHg
Population/
Administrative Data
No. incident cases, prevalence
for each class,
patterns of primary,
secondary
and tertiary care,
length of stay,
geographical variability ??
•
Killip III

IV
•
Killip II, Age > 66, Site Anterior or Not Evalu
able
–
Age
66, SBP
120 mmHg
–
Site Inferior, Age 53

70, SBP
115 mmHg
–
Age 53

70, Site not Inf
e
rior, SBP
110 mmHg
•
Killip II, Age
66, SBP> 120 mmHg
–
Site Inferior, Age 53

70, SBP
115 mmHg
–
Age < 53
–
Age > 66, Site Ev
a
luable but Not Anteri
or
•
Killip I, Site Anterior or Mu
l
tiple or Not Evaluable, Age>75
•
Ki
l
lip I
–
Site not Inferior, Age 65

70, SBP
110 mmHg
•
Ki
l
lip I
•
Ki
l
lip I
KI
L
LIP
II

III

IV
KILLIP
III

IV
516 (469

563)
AGE
Inf., lat.,mult.
629
191
8
403
1706
1076
10407
66
268
765
314 (276

352)
183
400
189 (153

225)
85
365
66
135
941
203 (165

242)
85
333
76 (56

96)
50
608
I
II
III
III
IV
II
I
SITE
AGE
ant., lat.,
mult.,n.e.
209
7061
53

70
52
29 (21

37)
51
1686
72 (52

92)
47
604
AGE
70
98
2290
95 (69

122)
44
417
IV
V
IV
AGE
>70
447
8489
SITE
inf., lat.
ant., mult.,n.e.
238
1428
91 (70

112)
IV
44
2183
42 (23

60)
19
438
14 (9

20)
25
1745
10 (6

15)
V
VI
VI
186
4890
142
2707
53

65
RECPAM
clas
s
Subgroup
Mortality Rate
x 1000
(95% c.i.)
SITE
Ant., n.e.
66
657
23
2171
inf
e
rior
66

70
135 (103

166)
61
392
226 (189

264)
111
379
AGE
> 75
71

75
172
771
IV
III
226
212
SBP
> 120
> 110
120
SBP
SBP
110
> 115
115
Databases
Selection of
Methods/Algorithms
(Es.: RECPAM)
?
Health System
I
VI
II
III
IV
V
•
K
i
l
l
i
p
I
I
•
K
i
l
l
i
p
I
•
K
i
l
l
i
p
I
Clinical Guidelines
clin
i
che
Classification
Global Objective
Exploratory/Confirmatory Study
(e.g. Modelling Demand and Supplyof services forAcute
Myocardial Infarction,
Prognostic Stratification at
Admission)
Figure 2. Knowledge Improvement Cycle in Health Services Research
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
11
MILESTONES
ReIGN will be developed at the Centre for Health Systems Research, Monash Institute of
Health Services Research.
Overall length is one year

corresponding to year 2002
–
which will be divided in four quarters
of three months eac
h, corresponding to the following milestones:
First Quarter
Pre

study Investigators Meeting, to carry out a detailed plan of project actions
Seminar on data mining: illustrating the background of the project, including a public
presentation of the resea
rch program
Publication of the conference materials and the project details on a thematic web site,
linked to the VPAC and participating institutions
Appointment of the statistical programmer
Analysis of the RECPAM/SAS software
Second Quarter
Core progr
amming work
Updates to statistical routines for multilevel modelling
Planning and development of routines for parallel and cluster compu
t
ing
Implementation of a local ‘in

house’ parallel version on two stations
Debugging
Transfer and extensive testing on V
PAC systems
Debugging VPAC version
Fitting
Benchmarking parallel version through applications
Third Quarter
Visit of Prof. Ciampi.
Revision of the status of the project
Debug of statistical procedures
Visit of Prof. Gibberd
Collaborative data analysis
for the selected appl
i
cations.
Final investigators mee
t
ing
Fourth Quarter
Last run on practical applications in health services research
Preparation and presentation of the results
Web publication of source code and applications
Preparation of articles a
nd monograph
Submissions
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
12
OUTCOMES
ReIGN will result in the delivery of original, new and innovative software that will be available
in the public domain. The software will be finally released as source and compiled code in the
SAS macro programming l
anguage, using at least the BASE, STAT, GRAPH, IML and MP
CONNECT modules and adding up to a total of more than 20,000 lines.
It is anticipated that the software will be copyright of F.Carinci and A.Ciampi and will follow the
GPL (General Public License).
As it is, it will be then published at Monash Web site and will be
available for public use and general research software development as well as for applied work.
A manual will be also produced and a tutorial will be included along with sample data to be
u
sed for educational purposes.
Results will be published in specialised journals and presented at international conferences.
In summary, we will expect the following outcomes:
Conference, presentations and papers

One paper on an international statisti
cal jou
r
nal

One paper on an international computational and pr
o
gramming journal

At least two papers on international biomed
i
cal journals presenting applications in health
services and ou
t
comes research

Presentations at international confe
r
ences on
statistical computing, cluster programming
and health services r
e
search

A monograph on the REIGN approach, inclu
d
ing:

a summary of the theoretical found
a
tions of the RECPAM approach

advancements and technical solutions for the realis
a
tion of
REIGN

applications

a detailed user’s and pr
o
grammer’s guide.
Grants Submitted and/or Received

Tendering to the Victorian Department of Human Se
r
vices, NSWHealth, Depar
t
ment of
Health and Aged Care

International collaboration with the Eur
o
pean and North American partners working at
health information sy
s
tems

Submitting the pro
d
uct to the attention of the WHO to develop new programs translating
the REIGN a
p
proach in open source stand alone pr
o
grams to develop free health
information systems
for develo
p
ing countries
Industry funding received

Support from SAS Institute to transfer the prototype in the health sector internatio
n
ally
Other

Development and release of REIGN 1.0, a General Pu
b
lic License (GPL) open source SAS
software, prototy
p
ing the RECPAM approach for cluster computing sy
s
tems and intranet
ne
t
works.
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
13
BUDGET
The overall budget, in accordance with the VPAC regulations, is below 100,000 AUD.
Expenses are expected for the design, management and computing programming, for the
travels, and for relieving the professorial senior researchers involved from academic activities
and extra time dedicated to the development of the research project.
Items
Instit
u
tion
Priority
($)
(List all items individually, together with a justifica
tion for
any non

personnel expenses)
*
A, B, C
Year
(please
indicate)
2002
Personnel
SAS Statistical Programmer F/T
(to be appointed at MIHSR)
Prof. A.Ciampi P/T: .10
Statistical Planning (rate: $ 115,000)
Prof. A.Gibberd P/T: .05
Applicatio
ns in Australian Health Care (rate: $ 115,000)
MIHSR
McGill
U.Newcastle
A
A
A
68,000
11,500
5,750
Equipment / Software
(items costing more than $1,000 each)
1 yr SAS license 2 PCs
MIHSR
A
2,000
Maintenance / Materials
(items costing less than $1,000 e
ach)
1 Travel+Accomodation Prof.Ciampi (Montreal

Melbourne ret)
1Travel Prof.Carinci (Melbourne

Newcastle ret)
1 Travel+Accomodation Prof.Gibberd (Newcastle

Melbourne ret)
A
B
B
6,000
1,500
1,500
Total
96,250
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
14
REFERENCES
1.
Ciampi A., Hogg S., McKinn
ey S., Thiffault J., RECPAM: A Computer Program for
Recursive Part
i
tion and Amalgamation for Censored Survival Data and other situations
frequently occurring in biost
a
tistics, I. Methods and Program Features,
Computer Methods
and Programs in Biomed
i
cine,
1
988; 26, 239

256.
1.
Ciampi A., Lawless F., McKinney M., Singhal K., Regression and Recursive Partition
Strategies in the analysis of medical Survival Data,
Journal of Clinical Epidemiology
, 41,
1988; 8, 737

748.
2.
Ciampi A., Thiffault J., Sagman U., RECPAM: A
Computer Program for Recursive Partition
and Amalgamation for Censored Survival Data and other situations frequently occurring in
biost
a
tistics, II Applications to data on small cell carcinoma of the lung (SCCL),
Computer
Methods and Pr
o
grams in Biomedicin
e
, 1989; 30, 283

296.
3.
Ciampi A., Thiffault J., Pruning Regression Trees for censored survival data: the RECPAM
a
p
proach,
Communications in Statistics

Theory and Methods
, 1989; 18 (9), 3373

3388.
4.
Ciampi A., duBerger R., Taylor H., Thiffault J., RECPAM: A
Computer Program for
Recursive Part
i
tion and Amalgamation for Censored Survival Data and other situations
frequently occurring in biost
a
tistics, III Classification according to a multivariate construct.
Applications to data on Haemophilus influenzae type b
meningitis,
Computer Methods and
Programs in Biomedicine
, 1991; 36, 51

61.
5.
Ciampi A., Generalized Regression Trees,
Computational Statistics and Data Analysis
,
1991; 12, 57

78.
6.
Ciampi A., Constructing Predictions Trees from Data: the RECPAM approach, in
C
omput
a
tional Aspects of Model Choice,
105

52.
Physica

Verlag, Hedelberg, 1992.
7.
Ciampi A., Hendricks L., Lou Z., Discriminant Analysis for mixed variables: integrating trees
and r
e
gression models, in
Multivariate Analysis: future directions, 2, Cuadras C.M.
and
Rao C.R. eds.
, E
l
sevier Science, 1993.
8.
Ciampi A., Negassa A., Lou. Z., Tree

structured
prediction for censored survival data and
the Cox model,
Journal of Clinical Epidemiology
, 1995; 48, 5, 675

689.
9.
Breiman, L., Friedman, J., O
l
shen, R. and Stone, C.
, Classification and Regression Trees,
CRC Press, Berkeley, 1984.
10.
Carinci F.,Nicolucci A.,Ciampi A.,Labbrozzi., Bettinardi O.Zotti A.M.Tognoni G. on behalf of
the GISSI Investigators, Role of interactions between psychological and clinical factors in
dete
r
mining 6

month mortality among patients with acute myocardial infarction. Application
of recur

sive part
i
tioning techniques to the GISSI

2 data

Base.
European Heart Journal
,
1997;18, 835.
11.
Ciampi A., Carinci F., Couturier A. and Infante Rivard C., GEE Regre
ssion Trees: the
RECPAM a
p
proach. Application to logistic regression for matched case

control studies,
The 19th Confe
r
ence of the International Society for Clinical Biostatistics, Dundee,
Scotland,
1998
.
12.
Ciampi A., Carinci F., Couturier A. and Infante Riva
rd C., GEE Regression Trees: the
RECPAM a
p
proach, Trees for correlated outcome variables,
The 19th Conference of the
International Soc
i
ety for Clinical Biostatistics, Dundee, Scotland,
1998.
13.
Nicolucci A., Carinci F., Ciampi A., Stratifying patients at risk
of diabetic complications. An
int
e
grated look at clinical, socio

economic and care

related factors.
Diabetes Care
, 1998;
21, 1439.
14.
Fresco C., Carinci F., Maggioni A.P.,Ciampi A., Nicolucci A., Santoro E., Tavazzi L. and
Tognoni G., on behalf of GISSI nves
tigators, Very early assessment of the risk of in

hospital death in 11,483 p
a
tients with acute myocardial infarction ,
American
Heart
Journal,
1999; 138(6 Pt 1),1058

64.
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
15
15.
Carinci F, Nicolucci A, Pellegrini F, Regression trees in health services and outcom
es
research: an application of the RECPAM approach using quality of care as a criterion,
Technical Report, Monash Institute of Health Services Research, Monash University,
Melbourne, 2001.
16.
Carinci F, Pellegrini F, RECPAM/SAS (Recursive Partitioning and Ama
lgam
a
tion): a
statistical tool for criterion

driven data

mining, Technical Report, Monash Institute of Health
Services Research, Monash University, Melbourne, 2001.
17.
Dannegger F, Tree stability diagnostics and some remedies for instability,
Statist Med
2000
; 19: 475

491.
18.
Breiman L, Shang N. Born again trees, Technical Report, 1996, University of California,
Berkeley,
ftp://ftp.stat.berkeley.edu/pub/users/breiman/BAtrees.ps
, accessed 16
/9/2001.
19.
Breiman L., Bagging Predictors,
Machine Learning
1996;
24
: 123

140
20.
Breiman L., Random Forests
–
Random Features,
Technical Report 567, 1999, University
of California, Berkeley,
ftp://ftp.stat.berkeley.edu/pub/users/breiman/randomforests.html
,
accessed 16/9/2001.
21.
Carinci F, Health Services Epidemiology in Diabetes. From outcomes r
e
search to
population

based health planning: the integrated approach of RISS system,
http://statbone.
cmns.mnegri.it/software/rissq_en.html
, accessed 4 July 2001.
22.
Carinci F,
Averaging and Profiling SF

36+various Clinical Characteristics. The Report List
Proc
e
dure,
http://statbone.cmns.mnegri.it/Samples/qued/Reports/RL_SF36_per_Centro[RL1SF36].HTML
,
a
c
cessed 4 July 2001.
23.
Carinci F,
Averaging and Profiling SF

36+various Clini
cal Characteristics. The Report Index
Proc
e
dure,
http://statbone.cmns.mnegri.it/software/Samples/qued/Reports/RI_SF36_per_Centro[RI1
SF36].HTML
,
ac
cessed 4 July 2001.
24.
Carinci F, RISS SAMPLES. Integrating Multicentric Outcomes Research with everyday
practice,
http://statbone.cmns.mnegri.it/software/riss_samples.html
, accessed 4
July 2001
25.
Carinci F., Health Services Epidemiology in Diabetes. From health care research to public
health: the int
e
grated approach of the RISS system,
http://statbone.cmns.mnegri.it/
software/rissh_en.html
, a
c
cessed 4 July 2001.
26.
Churches T, Carinci F, Open source at the interface between policy and academia:
t
o
wards evidence

based information sy
s
tems,
4
th
International Conference on the Scientific
Basis of Health Services Research
, S
ydney, 22

25 September 2001.
27.
Carinci F, Corrado D, Dettorre A, Pellegrini F, A multilevel approach to health systems
analysis using RISS (Reporting

by

Intranet Stat System),
4th International Conference on
the Scientific B
a
sis of Health Services Research,
Sydney, 22

25 September 2001.
28.
Diggle PJ, Liang KY, and Zeger SL. The analysis of Longitudinal Data. Oxford, England:
Oxford University Press, 1994.
29.
Shannon W, Banks D. Combining classification trees using MLE,
Statist Med
, 1999; 18,
727

740.
30.
LeBlanc M, Tib
shirani R. Combining estimates in regression and classification,
JASA
,
1996; 91 (436), 1641

1650.
31.
Oliver J, Hand D. Averaging over Decision Trees,
1996;
Journal of Classification
13
: 281

297.
32.
Breiman L. Stacked regressions,
Machine Learning
, 1996; 24: 49.
33.
Gibberd R, Pathmeswaran A, Burtenshaw K, Using clinical indicators to identify areas for
quality improvement,
J Qual Clin Practice
, 2000; 20, 136

144.
34.
Hancock L, Sanson

Fisher R Perkins J, McClintock A, Howley P, Gibberd R, Effect of a
Community action Pro
gram in Rural Australian Towns: The CART Project,
Preventive
Medicine
, 2001; 32, 118

127.
RECPAM Information Generation Network (REIGN):
Parallel Regression Trees for Multilevel Information Systems and Data Mining
Fabrizio Carinci and Anton
io Ciampi
16
35.
Garner C. Multiprocessing with Version 8 of the SAS System,
Paper 16,
Proceedings of the
25
th
SAS Users Group International, Indianapolis, Indiana, 2000,
http://www.sas.com/usergroups/sugi/sugi25/ 25p016.pdf
,
accessed 16/9/2001.
36.
Bentley J.
SAS Multi

Process Connect: What, When, Where, How, and Why,
Paper 269,
Pr
o
ceedings of the 26
th
SAS Users Group
International, Long Beach, Califo
r
nia, 2001,
http://www2.sas.com/proceedings/sugi26/p269

26.pdf
,
accessed 16/9/2001.
37.
Bentley J.
An Introduction to Parallel Computing
, Paper 283,
Proceeding
s of the 25
th
SAS
U
s
ers Group International, Indianapolis, Indiana, 2000,
http://www2.sas.com/proceedings/sugi25/25/sy/25p283.pdf
,
accessed 16/9/2001.
38.
Doninger C. The %Distribute Syst
em for Large

Scale Parallel Computation ijn the SAS
Sy
s
tem, Presentation at the
26
th
SAS Users Group International, Long Beach, Cal
i
fornia,
2001,
http://www.sas.com/rnd/app/papers/distConnec
t.pdf
, , accessed 16/9/2001.
39.
Olsen K, West JT. SAS Software and the Performance Effects of Parallel Architectures,
Proceedings of the 24
th
SAS U
s
ers Group International, Miami Beach, Florida, 1999,
http://www2.sas.com/proce edings/sugi24/Sysarch/p290

24.pdf
,
accessed 16/9/2001.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment