Parallel Regression Trees

spongereasonInternet and Web Development

Nov 12, 2013 (3 years and 4 months ago)

98 views



Victorian Partnership for Advanced Computing


Expertise Program Grants Scheme

Round Three


1
st

January, 2002 to 31
st

December, 2002


PROJECT








Parallel Regression Trees

for Multilevel Information Systems and Data Mining



Fabrizio Carinci

Centre

for Health Systems Research

Monash Institute of Health Services Research



Correspondence:


Monash Institute of Health Services Research

Level 1, Block E, Locked Bag 29

Monash Medical Centre

Clayton, Victoria 3168, Australia

Tel:

+61 3 9594 7504

Fax:

+61
3 9594 7554

Email:
Fabrizio.Carinci@med.monash.edu.au

Web:

www.med.monash.edu.au/healthservices/


R
E
I
G
N
RECPAM
Information Generation Network
R
E
I
G
N
RECPAM
Information Generation Network

CONTENTS


Aim

1

Background

1

Details of the basic research strategy

6

Study design

8

Significance

of the project

9

Milestones

11

Outcomes

12

Budget

13

References

14


RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

1


AIM


Aim of this project is to advance the methodology and software relative to the regression tree
approach known as RECPAM (RECursive Partitioning and Amalgamation).

This is a one ye
ar project in which we plan to design original software using parallel
computing technology as a strategic solution to support the development of research streams
that will aim to: a) advance the RECPAM model; and b) transform RECPAM in a powerful tool
sup
porting the development of multilevel information systems.

The present proposal will focus on applications in health services and systems research, to
identify efficient methods rapidly extracting information from shared data and collaborative
networks.

T
he central objective of our investigation will be generating information from multiple sources;
therefore, we will refer to the present proposal as “ReIGN”, acronym for “RECPAM Information
Generation Network”.


BACKGROUND


Nowadays, tree
-
based techniques h
ave become a familiar tool in data analysis. They are
frequently used to support research through better understanding of the relationships and
possible causal mechanisms influencing the occurrence of outcomes in many different fields.

A tree structure off
ers a simple and attractive presentation of the information about a
response variable contained in a set of predictors. A chain of statements concerning the
predictors, one for each node, uniquely assigns an individual to a leaf, or terminal node of the
tr
ee.

RECPAM
1
-
8

is a generalization of the well
-
known CART
9

algorithm, a technique that builds a
prediction tree from data of the form
(y,z)
, where
y

is a one d
i
mensional response variable and
z

is a vector of predictors. Within the CART framework, general
i
z
ations to mult
i
variate
responses have recently been d
e
veloped, where
y
, even if multivariate is still consi
d
ered a
response.


The RECPAM model


The RECPAM model, summarized in figure 1, was devel
o
ped for data having the general
structure
(u,z)
, where
u

is

a vector of variables not necessarily to be thought as of 'response'.
They may, in fact, consist of a mixture of random and non
-
random variables, corresponding, for
e
x
ample, to response and trea
t
ment.

In RECPAM, it is assumed that the distribution of
u

is at least partially specified, and that the
object of the predi
c
tion is a parameter of this distribution, in general mu
l
tidimensional. This is
denoted by


慮搠i猠term敤
cr
i
terion
, to emphasize that any predictor is a
s
sessed according to
how well it pred
icts

. F潲ot桥 獡m攠r敡獯s, t桥 捯浰
o
湥湴猠of
u

are referred to as
cr
i
terion

variables
.

Additional classes of parameters, defined as
global
,
local

and
determinants

may be fitted
using the RECPAM approach, by applying a broad range of regression models (
e.
g.
proportional hazards and generalized regression models) to deliver a final model that is a mix
of a traditional regression approach, plus an additional set of informative parameters that may
be defined by a tree structure.

Tree
-
growing in RECPAM procee
ds through a series of steps that is quite established,
including a binary search and partitioning algorithm, strategies for complexity reduction (
pruning
and
amalgamation
), validation (
cross
-
validation
) and final model specification (
backward
elimination
)
.

Development of the technique was inspired by situations frequently occurring in biostatistics.

Particularly in this field, using a tree structure can be very informative, since parameters will
refer to subgroups of observations, such as individuals, doct
ors, hospitals or specific health
RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

2

episodes that could be predicted by specific characteristics or conditions.

Adding a tree component to a traditional regression model could represent an ideal
compromise, where the investigator should use the personal ‘ski
lls and art’ in model fitting to
improve accuracy by tuning the desired balance between the different components.

Because of its unique features and flexibility, the RECPAM a
p
proach have been adopted
frequently in health services and outcomes research
10
-
15

as a solution to use a criterion to drive
the application of data mining to very large epidemiologic databases.

Such applications usually required consistency with the study design and so needed to be
based on the definition of an
a priori

primary endpoi
nt and the clear specification of the
potential set of confoun
d
ers to be included as predictors in the statistical model.


RECPAM software development


In the last 15 years, specific solutions for the development of specialised RECPAM software
have been im
plemented, using Fortran
1,4,6
, C++
6
-
8
, and SAS
16
.

The SAS implementation has been developed by F.Carinci to apply this technique to the
exploration of very large epidemiologic databases: RECPAM/SAS has been applied in AIDS at
Harvard School of Public Heal
th, in diabetes
13,15

and cardiovascular diseases
10,14

at Mario Negri
Sud and in oncology and social research at McGill University.

Overall, the software has demonstrated to be flexible, stable and reliable, even when working
on very large numbers of observ
ations and variables, sharing the strengths of the SAS
programming language.




Figure 1. The RECPAM approach: developing a SAS prototype

RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

3


Open Questions


Despite of the availability of a complete RECPAM implementation in SAS, the existing
software is not addressing a number of open questions both at the methodolo
gical and the
applied level, which we believe will merit attention for further understanding, development and
on field testing.

In this proposal we will focus on aspects that will be based on the application of innovative
approaches making use of parallel
processing.

Our research questions belong to two different research streams: advancing the RECPAM
model and using RECPAM as a tool for multilevel information systems.


Stream 1.
Advancing the RECPAM model


1.1) Extending the range of applicable models: mu
ltilevel models


Estimation of statistical models may be useful in many occasions, providing a simple way to
synthesise results of analyses of value for decision makers.

Multilevel models are a specific class of statistical models that are ideal for appli
cations in
social research, since they allow estimating parameters at different levels in the hierarchy of
nested structures.

For example, we could fit single parameters predicting the risk of individual patients, at the
same time taking into account the
correlation for patient belonging to the same doctor, or even
fitting parameters for the single doctors that will then add a random component to the overall
model. Using the same model, we could even explore patterns in repeated measurements over
the singl
e patient, where the correlation that should be accounted for is in this occasion not at
the doctor level, but longitudinally at the patient level.

The number of potential applications is extremely large, due to the characteristics of multilevel
models, of
fering flexibility to extend the application of generalised regression models to
multilevel structures.

How to apply the methodology of multilevel models to regression trees?

The RECPAM approach is quite general, thus offering an ideal base for new soluti
ons that is
at the moment generally still lacking.


1.2) Addressing stability


Along with the increased practical use of recursive partitioning models have come doubts
concerning the reliability and stability of the procedure. Methods like RECPAM may have
a
tendency to overfit the data; among others, some of the causes that have being identified are
the unnatural dichotomization of continuous covariates and the dependence of the final tree
structure and all the subsequent operations from a very small number

of splits at the start of the
recursive partitioning process.

RECPAM includes a cross
-
validation step that closely resembles the CART technique;
however, recent developments in the area
17
-
20

need to be addressed and tested using more
advanced approaches t
han those of the original work by Breiman et al.


1.3) Reducing the computational burden


The RECPAM approach is very general in nature: its computational burden is the price that it
has to be paid for the availability of a broad spectrum of applications.
At the same time, the
RECPAM model shares limitations that are typically a characteristic of local search strategies,
such as the number of permutations of nominal variables with a large number of categories,
fast approaching infinity. Since this is often
the case in health services research, the
identification and reliability of approximate solutions is an important point that should be tested
RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

4

in practice. We believe that parallel computing would constitute a crucial factor since is evident
that its impact

on speed and performance would be rather substantial in such applications.


Stream 2.
Using RECPAM as a tool for multilevel information systems


A second stream follows work conducted from the chief investigator on health information
systems.

F.Carinci ha
s been working for the last five years to the development of original software to
perform statistical analysis on administrative health databases, applying sophisticated
a
p
proaches to different disease areas and exploratory multidimensional models.

In this

framework, he has developed original approaches to link and analyse very large health
databases, conceiving and implementing RISS (Reporting
-
by
-
Intranet Statistical System)
21
-
27
.

The RISS software was inspired by very simple logic, which consisted basica
lly in ma
k
ing few
or no changes at all to the original data. RISS was a result of an Italian research project
sponsored by SAS Institute and SUN Micr
o
systems; the project delivered the output as public
domain software that has been later adopted by differe
nt health departments and public health
observatories in several Italian r
e
gions.

The system has been tested in practice and used to automate the production of web reports
for various disease areas, using a flexible design that can be easily customized to
meet
different needs.


2.1) RISS and RECPAM together: matching the multilevel structure


Recently, health information systems are increasingly based on multilevel structures,
i.e.

problems in which by default different levels interact with patients, influ
encing their outcomes
within the broader framework of complex health systems.

Health systems are a perfect field of application of such approaches: analysts are frequently
involved in processing data at multiple levels; typically, some archives may be stor
ed centrally
as a very large or massive database, while others are geographically distributed, and finally
more detailed and fragmented files are managed by small organisational units or even single
users.

From a different point of view, health informatio
n systems usually relate to different levels of
data ownership and data security, adding a complexity factor that affects the possibility to
obtain completely linked datasets both at the individual and the structural levels.

Data systems such as those des
cribed here are referred to as
multilevel information systems
.


The RISS system was inspired by a similar multilevel design, to be highly customisable, so
that it could be used to answer complex queries and obtain entirely customisable tables,
graphs, and

maps. But RISS was also meant to be extensible in the sense of providing access
to the use of advanced multivariate methods.

In particular, RISS was ideal to write modules linking population archives to administrative
databases, extracting tables to build

prediction models available through the SAS system, such
as Generalized Linear Models (GENMOD) and Proportional Hazards (PHREG).


The availability of RECPAM routines linked to the RISS system would represent an incredible
opportunity to deliver supporting

solutions for informed decision
-

making and the realization of
evidence based health policy.

As a matter of fact, tree
-
based structures are ideal for any population
-
based information
system, since they would use parameters such as risk estimates or correl
ation measures to
isolate specific subgroups that can be targeted by specific actions.

However, the implementation of multilevel models, albeit essential and definitely possible,
should be conceptualised to work in the framework of complex information syst
ems: such an
issue is still largely unexplored and so it would need to be exactly specified and tested in real
life situations.

RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

5


2.1) RISS and RECPAM together: a distributed, meta
-
analytical approach


RISS (Reporting
-
by
-
Intranet Statistical System) is a S
AS application package that was built in
three consecutive years by five staff members, making extensive use of macros. The core
program was then linked to HTML and Java
-
script code, automatically producing interactive
reports to be accessed locally and re
motely by an intranet access.


How to make put the RISS design and the RECPAM approach work together on intranet
networks?

RISS relies on distributed data and it is severely de
-
centralized, allowing the exchange of only
aggregate data and ensuring a minimu
m transfer of information at the individual data.

Such a design, although offering some advantages, is not compatible with the current design
of RECPAM, which needs to access all the available information from a complete database.


Different solutions for

the RECPAM approach should be explored in order to align its potential
use with intranet networks that would participate to the development of multilevel information
systems, though agreeing on the transfer of only limited amount of information.


ReIGN as

a RECPAM parallel solution for stream 1 and stream 2.


The present proposal is based on the adoption of innovative methodology that should rely on
a substantial re
-
design and development of RECPAM using a parallel processing approach.

The parallel approa
ch is proposed not only because it is thought to be more efficient, but also
because it closely resembles the reality of today’s world of Internet and Intranet networks.

For this reason a new project deploying software for regression trees in a parallel e
nvironment
would represent an important step to support systems for the generation of real time
information.

RECPAM solutions to be used on multiple processors or cluster computers would represent a
substantial innovation in the landscape of data mining, p
articularly for applications in health
services, where no such product seems to be so flexible and specifically tuned to answer the
specific questions and convincingly supporting health policy.


Therefore, the objectives of the present proposal are:


1)

to de
velop a software that will advance the RECPAM approach through the following
innovations:

-

applicability of multilevel models

-

enhanced stability of RECPAM trees

-

reduced computational burden


2)

to design a software that could be embedded in multilevel informat
ion systems,
characterized by the following points:

-

multilevel structure

-

access and processing of distributed databases


The first point, corresponding to stream 1, will be studied in great detail during this one
-
year
study. The ReIGN project will correspo
nd to homonymous software that will be based on
parallel computing.

The second point, corresponding to stream 2, will be indeed intended as a feasibility study.
The ReIGN software developed to accomplish stream 1 will be compatible with these
objectives,
and the structure will be in place, although deployment of ReIGN as a fully
functional information system will need additional resources or extension beyond the expected
length of one year.

RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

6



DETAILS OF THE BASIC RESEARCH STRATEGY


To be successful, the Re
IGN project should be able to make innovation in the theoretical field
as well as improving the existing numerical algorithms.


As depicted above, innovations will explore the usage of the system to perform two
mainstream operations:


a) empowering the REC
PAM approach to model complex relationships in very large databases

b) advancing the RECPAM approach to support multilevel information systems


In regards to the first stream,
i.e.

ReIGN as a means to empower the RECPAM approach
, our
proposal will be focus
ed on two fundamental points:


1.

Regression trees for multilevel modelling

2.

Improving techniques for validation and stabilization of RECPAM models

3.

Reducing overall RECPAM computational burden


As of the first point,
multilevel regression trees
,

a parallel e
nvironment would represent an
ideal setting to simulate and evaluate innovative solutions for modern data
-
warehouses.

The RECPAM approach is itself a generalisation of regression tree analysis and as such
would offer the same interpretational advantages of

hierarchical models, a situation that by
definition is coherent with the application of binary trees.


There are many potential approaches to implementing multilevel models in RECPAM.

One design would be to extend the widely known and used
generalized lin
ear model (GLIM)

for a single outcome (time
-
independent) variable is well known. GLIM permits to treat the most
important types of simple outcomes: continuous, binary (by extension, polytomous) and count,
and to assess the effect of a covariate vector on o
utcome. Several generalization to time
dependent outcomes have been proposed
28
. The main distinction is between
marginal

and
conditional

models and between fully and partially specified models, leading to
likelihood
-
based

and
GEE
-
based

parameter estimation
, where GEE stands for
generalized estimating
equations
. Experience shows that the most appropriate way of generalizing GLIM depends on
both the scientific question underlying the analysis and the statistical properties of the estimate
one wants to insure,

e.g.

bias
-
variance trade off. Comparative studies for various available
approaches are highly desirable.


A.Ciampi and F.Carinci have recently presented
11
-
12

a GEE
-
based extension of tree growing
to correlated continuous and discrete outcome variables. S
ome 2
-
level systems can be
satisfactorily treated by the GEE approach to correlated outcome variables, the multi
-
level
feature being represented by an appropriate choice for the correlation matrix. Applications to
family
-
individual systems, matched case
-
co
ntrol studies and time dependent outcomes were
shown to be feasible.


A better understanding of the issues discussed above in connection with linear regression and
multi
-
level systems, is essential in order to develop tree
-
growing algorithms for multileve
l
systems. Here we only discuss an example of situations not covered above, but extremely
important in the applications. Suppose we consider individuals on whom we measure both a
time dependent outcome and a number of time
-
dependent covariates. A GEE
-
tree
structure
can easily be constructed from data and to each to leaf, say, we may attach a growth curve.
However, there is only one option: a leaf represents not a collection of subjects but a collection
of

subject
-
time units
; also, the decision nodes involve

values of the covariates
at particular
RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

7

times
. This means, for instance, that the same individual may, in time, ‘jump’ from one leaf to
another: this happens whenever certain key covariates vary in such a way as to change the
decision at some nodes. The ot
her option would be that a leaf represents a collection of
subjects
: this makes sense only if the decision nodes of the tree are based on the whole time
development of the predictors.

To implement such an algorithm, a deeper analysis of the problem is req
uired and it is
necessary to consider decision nodes based on functional data.


Following most recent ideas from L. Breiman
20
, one of the most interesting developments
about
tree validation and stabilisation

would involve using random outputs, so that a si
ngle
RECPAM tree might be obtained as a result of growing RECPAM
forests
.

Construction of validated and stable regression trees can be dramatically improved by using
parallel systems. When processing is restricted to sequential dependency, then growing n t
rees
will request a time that is equal to the sum of all computations involved for all trees, plus
procedures to sort and transform individual outputs to merge results into a unified model.

Using parallel processing, the required time should be approximat
ely equal to the time of the
most complex tree, since as computations for individual trees are completed, more steps can
be performed while other trees are still growing.

Parallel processing would help in developing some fundamental innovations in regressi
on tree
methodology, such as for example globally optimised algorithms. Even in this case, parallel
processing would be fundamental to extend the search to a high dimensional set of
parameters, finding the best model at each progression in tree partitionin
g through parallel
estimation of a very large number of model candidates.


Because of its generality, the computational burden is still an issue in the application of
RECPAM. The third point,
reducing RECPAM execution time and improving its numerical
limit
s
, will improve feasibility of tree growing in specific conditions, by designing improved
algorithms and offering new options to apply approximated methods.

There are again many possible ideas that we have been thinking of in the last years, in terms
of al
gorithms whose detailed explanation would certainly fall beyond the scope of the present
outline.

As an example, an interesting use of parallel capabilities would consist in dynamically
allocating split search processes at every local node, so that problem
s of high numerical order,
such as extremely high number of combinations/permutations, or fitting a large number of
model parameters, would become feasible in critical applications where the limit will be rapidly
approached.

This is the case of very large

databases of hospital episodes, where parameters should be
fitted at the hospital level and the number of possible interaction terms will be rapidly
approaching infinity.


In regards to the second stream,
i.e.

ReIGN as a RECPAM engine for multilevel infor
mation
systems
, our proposal will explore two main areas:

1.

Meta
-
analytical generation of regression trees.

2.

Linking ReIGN to the RISS model.


The first point,
meta
-
analytical regression trees
, is based on the recognition that parallel
regression trees may b
e useful to generate information from collaborative networks, because
while allowing estimation of very general statistical models, they would not require centralized
databases for parameter estimation. In other words, final estimates could be directly obt
ained
by stacking models from a series of remote nodes.

Such a process closely resembles
meta
-
analysis
, a technique that has been extensively used
for systematic reviews of clinical trials and was then becoming quite popular with the
establishment of evide
nce
-
based medicine.

Several ideas that have been appearing in the specialised literature recently may be
successfully applied to construct a meta
-
analytical RECPAM approach
18
-
20,29
-
32
. In particular we
RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

8

are referring to born again trees
18
, random forests
20

and stacked regression
32
, mostly originally
intended as a means to increase and obtain stabilized models.

A meta
-
tree would represent quite an interesting approach for health databases, a situation
where ownership and privacy issues usually play a major ro
le, harnessing the applicability of
traditional models that rely on the access to individual data for the overall system.

There are very few theoretical solutions and definitely no software available that would be
able to deal with this degree of complexi
ty and we believe that the realization of ReIGN would
represent quite a step forward in the field.


As of the second point,
linking ReIGN to the RISS model
, we would use a parallel
environment to create a platform for the application of ReIGN in collaborat
ive projects using
intranet networks to connect databases. In the present study, we would explore how to run
concurrently on different stations SAS sessions that would create parallel trees. We would then
define an approach that could allow us pooling para
meters and parallel trees from different
databases, addressing the possibility that regression models, and in particular multilevel
models, could be carried out without using all the information available from centralized
databases.

Clearly, the developme
nt of the software as a fully stand
-
alone information system would not
be possible in the framework of the present project. However, we will test the feasibility of the
project and identify possible strategies using distributed databases, as represented by

sample
datasets to be referred to distinct remote units.


STUDY DESIGN


In the first phase of this project, we will organize a brainstorming day and workshop with the
team of researchers, including the chief investigators, Prof.Ciampi and Prof.Gibberd. Th
is event
will provide a presentation of the project, also allowing for a restricted meeting that will serve to
define the study design in greater detail.

At the same time, Prof.Ciampi will be visiting for two weeks Monash University and the team
will refin
e the strategy related to the statistical theory as well as the outline for statistical
programming using parallel computing.

Then, in
-
depth analysis of available RECPAM software will be conducted; core programming
will be finalised in the following six mo
nths.


During the third quarter, the team will meet again to revise the software, identifying a list of
applications on very large databases.

Thanks to Prof.Gibberd, who is a national authority in the analysis of health care databases in
Australia, the Re
IGN team will have the opportunity to approach typical problems in health
systems research at the national level. Typical examples will include problems encountered in
the analysis of administrative databases, as in the case of clinical indicators using in
patient
episodes databases
33
, or in large population surveys to support community prevention
programs
34
. Such applications will serve to finalise stream 1 of our proposal.

We will then finally revise, debug and evaluate the software.

At the end of this st
ep, we will assess the feasibility of adopting the software for distributed
databases, using the same applications to approach stream 2. This phase will provide the team
with indications for further development towards the use of ReIGN for multilevel infor
mation
systems.

Finally, ReIGN software will be released and results will be published in various forms.

RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

9

Software specifications


Version 8 of SAS software (through its module SAS/CONNECT) now includes the Multi
-
Process Connect facility, to exploit multi
-
processor capabilities of symmetrical multiprocessor
and massively parallel processor systems by allowing parallel processing of self
-
contained
tasks and the coordination of all the results in the original SAS session. SAS now provide
independent paralleli
sm
35
-
39
, which means that when there are no dependencies between
tasks, separate SAS sessions may be run concurrently completing the overall job.

Building upon the experience of RECPAM/SAS, we will use its know how to re
-
engineer the
software and reuse its

functionalities in a parallel environment.

ReIGN will be made up of multiple SAS macros linked together and operating on multiple
processors and/or multiple computers, thus sharing the weight of computations involved in tree
growing. Outputs will be avail
able in HTML and PDF format, tables and graphs both in low and
high
-
resolution graphics, so that results will be widely accessible.

Efforts on GUI interface will be minimized, although the possibility for end users to run the
product should be allowed, eve
n with a restricted number of options. The product will be
developed at Monash, tested and finally installed on VPAC systems, where source and
compiled software will reside by the end of the project. Source code will be available at the end
of the project
in commented form, and a tutorial and programmers’ guide will be finally
published with the release of the software. Only an incomplete random sample of de
-
identified
data utilized to test the software for this project will be available at the end of the p
roject, to
facilitate its use and understanding.


SIGNIFICANCE OF THE PROJECT


We believe that our project would be a valuable opportunity since it offers new alternatives to
link research with an improved everyday use of information technology for modern

society.


We will now illustrate our view using an example relative to health system analysis.


The RECPAM approach can be described as a technique that allows extending the ordinary
regression approach by mixing fixed effects regression with recursive p
artitioning. By slight
abuse of terminology, we could say that it is a sort of ‘a
d
justed’ clustering with respect to a
target measure of outcome.

Such a feature is of fundamental importance in health research.

In the case of acute myocardial infarction (f
igure 2), for instance, we may be asked to solve a
particular question that is thought to be relevant, such as ‘optimising’ hospital beds on the
ground of ev
i
dence from health outcomes. Therefore, we can choose to use available data from
epidemiologic stud
ies to est
i
mate the needs of our population.

The RECPAM algorithm can provide a solution to stratify patients, identifying profiles
chara
c
terised by extremely different prognoses in terms of in
-
hospital mo
r
tality.

A particular step of the RECPAM approach

called amalgamation finally identifies a total of six
classes: in the case of AMI, every patient will be allocated to a single class by asking a few
questions at admission in the emergency depar
t
ment.

This guideline may have an immediate impact on hospit
al
i
sation in terms of length of stay and
use of procedures. For instance, Class VI has a .010 mortality rate, as opposed to Class I that
showed in our sample a mortality rate equal to .5. For a subgroup of Class VI
-

patients with a
low risk pr
o
file identi
fied by a Killip score equal to 1 and less than 53 years old
-

thrombolysis
would be actually increa
s
ing the risk and it is unlikely that this subgroup of patient would really
benefit from the treatment.

At the same time, the a
v
erage length of stay for th
is class was found to be less than 4 days.

This kind of modelling may be very effective at the system level, particularly when results are
derived by standard hospitals that may provide evidence of ‘good practice’, which can be later
extended to the e
n
tire

system.


RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

10



Age 70
-
75, Site Anterior or Multiple or Not Evaluable




Killip II



Age > 70
, Site Lateral or Inferior




Site not Inferior, Age 53
-
65, SBP

110 mmHg

Population/

Administrative Data


No. incident cases, prevalence

for each class,


patterns of primary,

secondary

and tertiary care,

length of stay,

geographical variability ??




Killip III
-
IV




Killip II, Age > 66, Site Anterior or Not Evalu
able




Age

66, SBP

120 mmHg




Site Inferior, Age 53
-
70, SBP

115 mmHg



Age 53
-
70, Site not Inf
e
rior, SBP

110 mmHg




Killip II, Age

66, SBP> 120 mmHg




Site Inferior, Age 53
-
70, SBP

115 mmHg




Age < 53




Age > 66, Site Ev
a
luable but Not Anteri
or




Killip I, Site Anterior or Mu
l
tiple or Not Evaluable, Age>75




Ki
l
lip I



Site not Inferior, Age 65
-
70, SBP

110 mmHg




Ki
l
lip I




Ki
l
lip I

KI
L
LIP

II
-
III
-
IV

KILLIP

III
-
IV

516 (469
-
563)

AGE

Inf., lat.,mult.

629

191
8

403

1706

1076

10407


66

268

765

314 (276
-
352)

183

400

189 (153
-
225)

85

365


66

135

941

203 (165
-
242)

85

333

76 (56
-
96)

50

608

I

II

III

III

IV

II

I

SITE

AGE

ant., lat.,

mult.,n.e.

209

7061

53
-
70


52

29 (21
-
37)

51

1686

72 (52
-
92)

47

604

AGE


70

98

2290

95 (69
-
122)

44

417

IV

V

IV

AGE

>70

447

8489

SITE

inf., lat.

ant., mult.,n.e.

238

1428

91 (70
-
112)

IV

44

2183

42 (23
-
60)

19

438

14 (9
-
20)

25

1745

10 (6
-
15)

V

VI

VI

186

4890

142

2707

53
-
65

RECPAM

clas
s

Subgroup

Mortality Rate

x 1000

(95% c.i.)

SITE

Ant., n.e.

66

657

23

2171

inf
e
rior

66
-
70


135 (103
-
166)

61

392

226 (189
-
264)

111

379

AGE

> 75

71
-
75

172

771

IV

III

226

212

SBP

> 120

> 110


120

SBP

SBP


110

> 115


115

Databases

Selection of
Methods/Algorithms

(Es.: RECPAM)

?

Health System

I

VI

II

III

IV

V




K
i
l
l
i
p

I
I




K
i
l
l
i
p

I




K
i
l
l
i
p

I

Clinical Guidelines
clin
i
che

Classification

Global Objective

Exploratory/Confirmatory Study

(e.g. Modelling Demand and Supplyof services forAcute
Myocardial Infarction,

Prognostic Stratification at

Admission)

Figure 2. Knowledge Improvement Cycle in Health Services Research

RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

11



MILESTONES


ReIGN will be developed at the Centre for Health Systems Research, Monash Institute of
Health Services Research.

Overall length is one year
-

corresponding to year 2002


which will be divided in four quarters
of three months eac
h, corresponding to the following milestones:


First Quarter




Pre
-
study Investigators Meeting, to carry out a detailed plan of project actions



Seminar on data mining: illustrating the background of the project, including a public
presentation of the resea
rch program



Publication of the conference materials and the project details on a thematic web site,
linked to the VPAC and participating institutions



Appointment of the statistical programmer



Analysis of the RECPAM/SAS software


Second Quarter




Core progr
amming work



Updates to statistical routines for multilevel modelling



Planning and development of routines for parallel and cluster compu
t
ing



Implementation of a local ‘in
-
house’ parallel version on two stations



Debugging



Transfer and extensive testing on V
PAC systems



Debugging VPAC version



Fitting



Benchmarking parallel version through applications


Third Quarter




Visit of Prof. Ciampi.



Revision of the status of the project



Debug of statistical procedures



Visit of Prof. Gibberd



Collaborative data analysis
for the selected appl
i
cations.



Final investigators mee
t
ing


Fourth Quarter




Last run on practical applications in health services research



Preparation and presentation of the results



Web publication of source code and applications



Preparation of articles a
nd monograph



Submissions





RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

12

OUTCOMES


ReIGN will result in the delivery of original, new and innovative software that will be available
in the public domain. The software will be finally released as source and compiled code in the
SAS macro programming l
anguage, using at least the BASE, STAT, GRAPH, IML and MP
CONNECT modules and adding up to a total of more than 20,000 lines.

It is anticipated that the software will be copyright of F.Carinci and A.Ciampi and will follow the
GPL (General Public License).
As it is, it will be then published at Monash Web site and will be
available for public use and general research software development as well as for applied work.

A manual will be also produced and a tutorial will be included along with sample data to be
u
sed for educational purposes.

Results will be published in specialised journals and presented at international conferences.


In summary, we will expect the following outcomes:


Conference, presentations and papers


-


One paper on an international statisti
cal jou
r
nal

-


One paper on an international computational and pr
o
gramming journal

-


At least two papers on international biomed
i
cal journals presenting applications in health
services and ou
t
comes research

-


Presentations at international confe
r
ences on

statistical computing, cluster programming
and health services r
e
search

-


A monograph on the REIGN approach, inclu
d
ing:



-

a summary of the theoretical found
a
tions of the RECPAM approach



-

advancements and technical solutions for the realis
a
tion of
REIGN



-

applications



-

a detailed user’s and pr
o
grammer’s guide.


Grants Submitted and/or Received


-


Tendering to the Victorian Department of Human Se
r
vices, NSWHealth, Depar
t
ment of
Health and Aged Care

-



International collaboration with the Eur
o
pean and North American partners working at
health information sy
s
tems

-

Submitting the pro
d
uct to the attention of the WHO to develop new programs translating
the REIGN a
p
proach in open source stand alone pr
o
grams to develop free health
information systems
for develo
p
ing countries


Industry funding received


-

Support from SAS Institute to transfer the prototype in the health sector internatio
n
ally


Other


-

Development and release of REIGN 1.0, a General Pu
b
lic License (GPL) open source SAS
software, prototy
p
ing the RECPAM approach for cluster computing sy
s
tems and intranet
ne
t
works.




RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

13

BUDGET


The overall budget, in accordance with the VPAC regulations, is below 100,000 AUD.

Expenses are expected for the design, management and computing programming, for the

travels, and for relieving the professorial senior researchers involved from academic activities
and extra time dedicated to the development of the research project.


Items

Instit
u
tion

Priority

($)

(List all items individually, together with a justifica
tion for


any non
-
personnel expenses)



*

A, B, C

Year
(please
indicate)

2002





Personnel


SAS Statistical Programmer F/T

(to be appointed at MIHSR)


Prof. A.Ciampi P/T: .10

Statistical Planning (rate: $ 115,000)


Prof. A.Gibberd P/T: .05


Applicatio
ns in Australian Health Care (rate: $ 115,000)

MIHSR

McGill

U.Newcastle

A

A

A

68,000

11,500

5,750

Equipment / Software

(items costing more than $1,000 each)

1 yr SAS license 2 PCs


MIHSR


A


2,000

Maintenance / Materials
(items costing less than $1,000 e
ach)

1 Travel+Accomodation Prof.Ciampi (Montreal
-
Melbourne ret)

1Travel Prof.Carinci (Melbourne
-
Newcastle ret)

1 Travel+Accomodation Prof.Gibberd (Newcastle
-
Melbourne ret)



A

B

B


6,000

1,500

1,500

Total



96,250

RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

14

REFERENCES


1.

Ciampi A., Hogg S., McKinn
ey S., Thiffault J., RECPAM: A Computer Program for
Recursive Part
i
tion and Amalgamation for Censored Survival Data and other situations
frequently occurring in biost
a
tistics, I. Methods and Program Features,

Computer Methods
and Programs in Biomed
i
cine,

1
988; 26, 239
-
256.

1.

Ciampi A., Lawless F., McKinney M., Singhal K., Regression and Recursive Partition
Strategies in the analysis of medical Survival Data,
Journal of Clinical Epidemiology
, 41,
1988; 8, 737
-
748.

2.

Ciampi A., Thiffault J., Sagman U., RECPAM: A
Computer Program for Recursive Partition
and Amalgamation for Censored Survival Data and other situations frequently occurring in
biost
a
tistics, II Applications to data on small cell carcinoma of the lung (SCCL),
Computer
Methods and Pr
o
grams in Biomedicin
e
, 1989; 30, 283
-
296.

3.

Ciampi A., Thiffault J., Pruning Regression Trees for censored survival data: the RECPAM
a
p
proach,
Communications in Statistics
-

Theory and Methods
, 1989; 18 (9), 3373
-
3388.

4.

Ciampi A., duBerger R., Taylor H., Thiffault J., RECPAM: A
Computer Program for
Recursive Part
i
tion and Amalgamation for Censored Survival Data and other situations
frequently occurring in biost
a
tistics, III Classification according to a multivariate construct.
Applications to data on Haemophilus influenzae type b

meningitis,
Computer Methods and
Programs in Biomedicine
, 1991; 36, 51
-
61.

5.

Ciampi A., Generalized Regression Trees,
Computational Statistics and Data Analysis
,
1991; 12, 57
-
78.

6.

Ciampi A., Constructing Predictions Trees from Data: the RECPAM approach, in
C
omput
a
tional Aspects of Model Choice,

105
-
52.
Physica
-
Verlag, Hedelberg, 1992.

7.

Ciampi A., Hendricks L., Lou Z., Discriminant Analysis for mixed variables: integrating trees
and r
e
gression models, in
Multivariate Analysis: future directions, 2, Cuadras C.M.

and
Rao C.R. eds.
, E
l
sevier Science, 1993.

8.

Ciampi A., Negassa A., Lou. Z., Tree
-
structured

prediction for censored survival data and
the Cox model,
Journal of Clinical Epidemiology
, 1995; 48, 5, 675
-
689.

9.

Breiman, L., Friedman, J., O
l
shen, R. and Stone, C.
, Classification and Regression Trees,
CRC Press, Berkeley, 1984.

10.

Carinci F.,Nicolucci A.,Ciampi A.,Labbrozzi., Bettinardi O.Zotti A.M.Tognoni G. on behalf of
the GISSI Investigators, Role of interactions between psychological and clinical factors in
dete
r
mining 6
-
month mortality among patients with acute myocardial infarction. Application
of recur
-
sive part
i
tioning techniques to the GISSI
-
2 data
-
Base.
European Heart Journal
,
1997;18, 835.

11.

Ciampi A., Carinci F., Couturier A. and Infante Rivard C., GEE Regre
ssion Trees: the
RECPAM a
p
proach. Application to logistic regression for matched case
-
control studies,
The 19th Confe
r
ence of the International Society for Clinical Biostatistics, Dundee,
Scotland,
1998
.

12.

Ciampi A., Carinci F., Couturier A. and Infante Riva
rd C., GEE Regression Trees: the
RECPAM a
p
proach, Trees for correlated outcome variables,
The 19th Conference of the
International Soc
i
ety for Clinical Biostatistics, Dundee, Scotland,
1998.

13.

Nicolucci A., Carinci F., Ciampi A., Stratifying patients at risk

of diabetic complications. An
int
e
grated look at clinical, socio
-
economic and care
-
related factors.
Diabetes Care
, 1998;
21, 1439.

14.

Fresco C., Carinci F., Maggioni A.P.,Ciampi A., Nicolucci A., Santoro E., Tavazzi L. and
Tognoni G., on behalf of GISSI nves
tigators, Very early assessment of the risk of in
-
hospital death in 11,483 p
a
tients with acute myocardial infarction ,
American


Heart
Journal,

1999; 138(6 Pt 1),1058
-
64.

RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

15

15.

Carinci F, Nicolucci A, Pellegrini F, Regression trees in health services and outcom
es
research: an application of the RECPAM approach using quality of care as a criterion,
Technical Report, Monash Institute of Health Services Research, Monash University,
Melbourne, 2001.

16.

Carinci F, Pellegrini F, RECPAM/SAS (Recursive Partitioning and Ama
lgam
a
tion): a
statistical tool for criterion
-
driven data
-
mining, Technical Report, Monash Institute of Health
Services Research, Monash University, Melbourne, 2001.

17.

Dannegger F, Tree stability diagnostics and some remedies for instability,
Statist Med

2000
; 19: 475
-
491.

18.

Breiman L, Shang N. Born again trees, Technical Report, 1996, University of California,
Berkeley,
ftp://ftp.stat.berkeley.edu/pub/users/breiman/BAtrees.ps
, accessed 16
/9/2001.

19.

Breiman L., Bagging Predictors,
Machine Learning

1996;
24
: 123
-
140

20.

Breiman L., Random Forests


Random Features,
Technical Report 567, 1999, University
of California, Berkeley,
ftp://ftp.stat.berkeley.edu/pub/users/breiman/randomforests.html
,
accessed 16/9/2001.

21.

Carinci F, Health Services Epidemiology in Diabetes. From outcomes r
e
search to
population
-
based health planning: the integrated approach of RISS system,
http://statbone.
cmns.mnegri.it/software/rissq_en.html
, accessed 4 July 2001.

22.

Carinci F,
Averaging and Profiling SF
-
36+various Clinical Characteristics. The Report List
Proc
e
dure,
http://statbone.cmns.mnegri.it/Samples/qued/Reports/RL_SF36_per_Centro[RL1SF36].HTML
,
a
c
cessed 4 July 2001.

23.

Carinci F,
Averaging and Profiling SF
-
36+various Clini
cal Characteristics. The Report Index
Proc
e
dure,
http://statbone.cmns.mnegri.it/software/Samples/qued/Reports/RI_SF36_per_Centro[RI1
SF36].HTML
,
ac
cessed 4 July 2001.

24.

Carinci F, RISS SAMPLES. Integrating Multicentric Outcomes Research with everyday
practice,
http://statbone.cmns.mnegri.it/software/riss_samples.html
, accessed 4

July 2001

25.

Carinci F., Health Services Epidemiology in Diabetes. From health care research to public
health: the int
e
grated approach of the RISS system,


http://statbone.cmns.mnegri.it/

software/rissh_en.html
, a
c
cessed 4 July 2001.

26.

Churches T, Carinci F, Open source at the interface between policy and academia:
t
o
wards evidence
-
based information sy
s
tems,
4
th

International Conference on the Scientific
Basis of Health Services Research
, S
ydney, 22
-
25 September 2001.

27.

Carinci F, Corrado D, Dettorre A, Pellegrini F, A multilevel approach to health systems
analysis using RISS (Reporting
-
by
-
Intranet Stat System),
4th International Conference on
the Scientific B
a
sis of Health Services Research,
Sydney, 22
-
25 September 2001.

28.

Diggle PJ, Liang KY, and Zeger SL. The analysis of Longitudinal Data. Oxford, England:
Oxford University Press, 1994.

29.

Shannon W, Banks D. Combining classification trees using MLE,
Statist Med
, 1999; 18,
727
-
740.

30.

LeBlanc M, Tib
shirani R. Combining estimates in regression and classification,
JASA
,
1996; 91 (436), 1641
-
1650.

31.

Oliver J, Hand D. Averaging over Decision Trees,

1996;
Journal of Classification

13
: 281
-
297.

32.

Breiman L. Stacked regressions,
Machine Learning
, 1996; 24: 49.

33.

Gibberd R, Pathmeswaran A, Burtenshaw K, Using clinical indicators to identify areas for
quality improvement,
J Qual Clin Practice
, 2000; 20, 136
-
144.

34.

Hancock L, Sanson
-
Fisher R Perkins J, McClintock A, Howley P, Gibberd R, Effect of a
Community action Pro
gram in Rural Australian Towns: The CART Project,
Preventive
Medicine
, 2001; 32, 118
-
127.

RECPAM Information Generation Network (REIGN):

Parallel Regression Trees for Multilevel Information Systems and Data Mining

Fabrizio Carinci and Anton
io Ciampi

16

35.

Garner C. Multiprocessing with Version 8 of the SAS System,
Paper 16,
Proceedings of the
25
th

SAS Users Group International, Indianapolis, Indiana, 2000,
http://www.sas.com/usergroups/sugi/sugi25/ 25p016.pdf
,
accessed 16/9/2001.

36.

Bentley J.
SAS Multi
-
Process Connect: What, When, Where, How, and Why,
Paper 269,
Pr
o
ceedings of the 26
th

SAS Users Group
International, Long Beach, Califo
r
nia, 2001,
http://www2.sas.com/proceedings/sugi26/p269
-
26.pdf
,

accessed 16/9/2001.

37.

Bentley J.

An Introduction to Parallel Computing
, Paper 283,
Proceeding
s of the 25
th

SAS
U
s
ers Group International, Indianapolis, Indiana, 2000,
http://www2.sas.com/proceedings/sugi25/25/sy/25p283.pdf
,

accessed 16/9/2001.

38.

Doninger C. The %Distribute Syst
em for Large
-
Scale Parallel Computation ijn the SAS
Sy
s
tem, Presentation at the
26
th

SAS Users Group International, Long Beach, Cal
i
fornia,
2001,
http://www.sas.com/rnd/app/papers/distConnec
t.pdf
, , accessed 16/9/2001.

39.

Olsen K, West JT. SAS Software and the Performance Effects of Parallel Architectures,
Proceedings of the 24
th

SAS U
s
ers Group International, Miami Beach, Florida, 1999,
http://www2.sas.com/proce edings/sugi24/Sysarch/p290
-
24.pdf
,
accessed 16/9/2001.