Rough Notes on my experience with Microsoft Amalga UIS and ...

splashburgerInternet et le développement Web

22 oct. 2013 (il y a 5 années et 1 mois)

244 vue(s)

I recently had to watch my mom die of cancer. Like many who have
to see the same thing

have the same experiences

it is
difficult to put into words the confusion, anger and frustration
with a disease so unrelenting.
I also watched, at the same time,
the ‘waste’ and ridiculous near
fraud like behavior with respect
to the implementation of Microsoft Amalga at the University of
WA Medical Center.

During the last months of her life I was an employee of the
University of W
ashington Medical Center working at Harborview
(the primary trauma center serving the Pacific Northwest and the
‘county hospital’ for King County). As an employee at the UW,
from July 2010 to
May 2011, I worked directly and indirectly
with a system called
Microsoft Amalga.

You might not know what Microsoft Amalga is. Frankly, given the
number of re
brandings of this system in the last few years, I
doubt Microsoft knows what it is. While interviewing at
Microsoft recently, I received puzzled stairs when ment
it. It seems, other than the marketing pabulum available on the
web, no one really knows much about this system. What is
Microsoft Amalga? Microsoft would describe it as a clinical data
repository which supports the querying and analysis of patient
data. Basically, it is a unified data warehouse for the health
care system.

What is Microsoft Amalga

it is a disaster and a waste of

To say it has flaws is to seem trite. What are its main


A parser/ETL system that is more complex (comple
x being bad
in this case) then any Rube Goldberg Device.


A set of sub
standard HL
7 API’s.


A group of ‘admin tools’ which are badly designed from a
User Interface perspective and badly performing from an
engineering perspective.


A serialization model that
is essentially like calling a
group of spreadsheets a database.

There is an assumption, amongst the political class in this
country, that someone who self
identifies as 'liberal', and is
in public service, must have a 'sense of right and wrong'

calling of public service and stewardship of the people's
money. My experiences at the University of Washington since July
of last year (2010) have gone a long ways to showing this to be
FALSE. This would be 'expected' by many, but this was health
care, th
is was work where people's lives depend upon the UW
making wise decisions

not just about care, but also about the
way in which money is spent. Having watched my mom die recently
of metastatic cancer and seeing how little financial resources
she had towa
rds the end, it made me sick to know that even ONE
DIME of my money went to Microsoft for the abomination called

I would like to believe that this was an exception. In February
of 2011 I switched jobs to a grant based role working for the
Dept. of

Biomedical and Healthcare Informatics working on a
project call
ed CBR. A central tool that

we 'must use' was i2b2

a clinical informatics tool developed by Harvard University to
allow researchers to do 'de
identified' queries (sometimes
referred to as '
noodling'). There are many features of i2b2 that
are flawed, few that make much sense and a general model of
construction that aligns with NOTHING I learned in any of my
computer science classes

what's worse, it does not even look
like a system that coul
d survive outside of publicly funded

The following essay is a remembrance, a call to arms and a
'hope' that someone might make the right decision with respect
to one or both of these systems before they spend precious time
and money on them


am assuming that most are aware of the
constraints of current budgets and the tough decisions ordinary
Americans (like my Mom for instance) are having to make every
wrong, review both the Ama
lga and i2b2 systems with expert
computer scientists and engineers in the fields of data
warehousing and data storage. These are two very different
systems (i2b2 and Amalga), but they both have 1 thing in common

it is doubtful they would pass a consiste
nt review or set of
tests for scalability, usability, efficiency, TCO (total cost of
ownership) or accuracy/audit worthiness. While it is true that
i2b2 is 'free', this means less when you consider the labor cost
of fixing and supporting their equally bad



Most of the men and women I have worked with here at the UW
(with the exception of some management) have been hard
working, intelligent and critical assets to the UW. In my
view, the UW is lucky to have them and it is sad that t
spend so much time keeping Amalga from crashing (i2b2 isn't
really in the same boat as far as criticality

if it were
expected to be 'up' all the time then I suspect the UW
would need to invest as much as Harvard has had to to keep
in from blowing up


I don't claim to know everything. In fact, I think what is
troubling here are not simply my 'opinions' but the obvious
absence of any structured consistent review of either

I don't know WHAT the CDROC committee does, but
clearly it is having n
o impact upon Amalga's quality and
delivery. I was told, when I was hired, that Amalga had
been tested/evaluated. This seems disingenuous. The best
part is you DON'T need to take my word for it, go beyond
the Amalga marketing materials and do your own disc
Something I wish the UWMC had done.


Though I can draw a distinction between the direct payment
of money to Microsoft versus the indirect costs of working
with i2b2, I don't think i2b2 should be off the hook simply
because it seems 'free'

is free if you must
expend unreasonable resources in order to support it.

Section 1: Microsoft Amalga

High Level: Amalga Design Flaws



comes with a few data models built in. Some of these
models are rather naively based on HL
7. HL
7 is an EDI
messaging format. Message formats are supposed to be

data warehouses (or clinical repositories or
data aggregators or whatever the Mi
crosoft marketing folks
are calling this terrible system now) ought to store data
accurately, efficiently and with as little redundancy as is
possible. Because of the way azADT is designed, there are
MANY fields which NEVER get populated.


All Amalga prima
ry keys (clustered indexes) are NON
SEQUENTIAL random strings. I say random, they are really a
'hash' of actually meaningful information like MRN (Medical
Record Number), date of service, and proprietary system
episode or visit keys. This use of a 'non
string key in a high throughput system hammers SAN's and is
very expensive. One estimate I performed before resigning
showed that 50% of all unique string tokens in the UWMC
data warehouse were composed of Amalga ID's.


The 'flexible' part of the
Amalga's data model amounts to
database building by spreadsheet. Because their is very
little or no sound data design in Amalga, the expansion of
databases usually amounts to importing large, unwieldy and
dimensionalized data.


Amalga, because of the w
ay the azAEID database is designed,
is very difficult if not IMPOSSIBLE to federate. What does
this mean? It means that Amalga, to quote a former
'manager', really is just a 'giant garbage can for data'.
The core tables get bigger and bigger and there is n
rational or feasible way to archive data.


The hl
7 parsers and parser development is a joke. The
tools which support hl
7 interface work are buggy and in
general do not live up to expectations. Someone told me
that Microsoft marketers sell Amalga as a
system where the
7 interfaces can be developed in 4 hours. I don't know
in what bizarre universe this is true, but the estimate I
was given initially when I started on the team was 80

not 4. I am very productive and I was working on a
E referral interface. This interface took at least 80
hours to complete (not including testing). When you have a
problem building an interface in Amalga, you pretty much
have to start from scratch. There are bugs in the SEE
(Script Engine Explorer: the too
l used to help build these
interfaces) which crashes often.


Amalga was almost ALWAYS in a failure state when I was
there. During the winter we had serious problems with
replication and missing data

Microsoft 'support' was
difficult to find. I suppose i
f you want to buy an
enterprise system and get very little help
> then buy
Amalga, you

can feel left out in the cold.


The amalga ‘ connector’ does not work as advertised.
It states, in their materials, that this connector can work
with LDAP/AD

nothing of this is true.


I don't believe Amalga has was ever properly tested prior
to being deployed at the UWMC. I was told they had 'vetted'
the system in the year and a half before my arrival
> I
don't know what was evaluated, but it WAS NOT Amalga's

ability to reliably ingest and store large volumes of data.


I proved that IF you moved from the 'post relational'
Amalga way to a Kimball model, you would reduce your data
footprint by roughly 50%. If you further broke free of the
"Amalga ID" this reduct
ion MIGHT be as big as 70%. What
does this mean? One estimate for DB warehousing drive needs
for the next year or so at UWMC for Amalga storage was 258
TB's (to provide the uninitiated with a frame of reference,
in 2005 I worked for a company that had HALF

this much data
in it and at the time it was one of the largest SQL Server
databases in the USA). Drives, energy and computer
equipment are NOT getting cheaper

they are actually
beginning to respond to the same inflationary pressures
that exist in the w
ider economy. To shrug off a 50
reduction in hardware costs seems reckless and not in
keeping with the stewardship of the people's money.

Microsoft Amalga conforms to NONE of the standard features you
would expect to find in a contemporary, high vol
umes, large
scale, data warehouse. It does NOT have dimensions. Data, in
Amalga (of course the exception are the core azyxxi HL
databases and tables which are BAD for their own reasons) is
stored as a collection of spreadsheets. The data is redundant

could be improved by adopting standard approaches of data
storage. As such, Amalga has VERY prohibitive features from a
TCO (total cost of ownersip) perspective. See below, an estimate
for JUST the labor costs for managing 7500

patients in Amalga
for 1 ye

ADO.NET connector does not work, see below:

Let's say you have 1,000,000 patients per year. Assume a roughly
9K cost per 7500 patients (not including servers and Amalga
licenses), this a variable cost of 1.2 million dollars per year
for just the ma
intenance cost. Add in the fixed costs of
licenses and servers and the REAL cost of Amalga for a 1 million
patient a year enterprise is closer to 10
15 million dollars for
the first 5 years... This may not be the MOST expensive
solution, but given how unwi
eldy and under
performing the system
is it also does not seem like a bargain.

The HL
7 parsing is NO substitute to Cloverleaf. Cloverleaf is
not perfect, but at least it has the right to call itself an HL
7 integration engine. I estimated, several months
ago, that the
true cost of building parsers in Amalga IS NOT the 4 hours they
tell future customers. In fact, it is non
deterministic. The SEE
(Script Engine Explorer) is SO buggy, you are often faced with
doing the same tasks OVER and OVER again. God forb
id you need to
update a parser, you are better off starting from scratch.

We were told, in October (by Microsoft) that with the 'next
version' building HL
7 interfaces would be 'easier', see
comments below from someone who had to work with the new system:

Amalga is NOT federated. What does this mean? It means that
there is no way to archive off or split the clinical repository
into smaller sub groups. Normally, in large scale data
warehouses, people choose MEANINGFUL partition points in the
data space.
Logically, in the hospital context, Facility,
Location and Date of Service are LOGICAL and semantically
MEANINGFUL ways of breaking the very large monolithic databases
into smaller units. Microsoft's advice to our management was
'buy another license'... Th
is is a really great solution

Amalga, because of its design, is almost not capable of being
audited. This may seem like a slight concern to some, but to
ANYONE familiar with the risks and opportunities of healthcare
informatics, this is not a g
ood feature. In a system with
conforming dimensions, it is possible to ask existential
questions LONG before you need to query core facts. For
instance, you can ask what procedure codes exist or what clinics
exist without writing a very inefficient query a
gainst a 'post
relational' table. Because of its (Amalaga's) design, there is
NO easy auditing query. One of my reasons for leaving the UW
Amalga team was because I felt Microsoft had a responsibility to
assist us in ensuring data quality by designing a sy
stem which
allowed for these checks.

IN system. The likelihood, given the
design of the 'amalga id' and azAEID structures, is VERY low
that any hospital system could easily switch to something else.
If Microsoft gets this terrible
system into a hospital system,
it is UNLIKELY that the hospital system would be able to
disentangle itself. It doesn't take a math genius or Warren
Buffet to figure out that IF Microsoft is successful in
'pushing' this bad system it could become a 1 billio
n dollar
product line in 10 years. If they get into the DOD or VA system,
it is unlikely that Amalga could be stopped.

Amalga's Index to Data space ratios are atrocious in most cases.
And yet, for the TERRIBLE indexing schemes (or lack of any sense
of des
igning proper indexes), the queries are still more often
than not slower than you would expect. Here is a snippet of
data/index space analysis performed on the live system:

The Amalga ID is a nightmare. It makes for a TERRIBLE clustered
index, because n
o matter how you manipulate the 'guid like' key,
it will NEVER be an efficient sequential key. The excuse (and a
weak one at that) is it will be 'unique' for external data
sharing. I would like to believe this, but since it is only
like' and since I
can ONLY assume it is deterministic, it
seems UNLIKELY that this argument is true.

Let's say I have a deterministic hash function:

Hospital (A) is the University of Wisconsin Medical Center

Hospital (B) is the University of Washington Medical Center
> UWMC (I don't know if such a hospital exists)

Hospital (C) is the Union
Wilmington Medical Center
UWMC (this one is made up)

MRN's are not unique across organizations, and they are OFTEN
via roll
over, not unique in a hospital EMR.

If I have an MRN for each of U2345445, for Sept 2, 2010, the
'amalga key generator' would be using precisely the same inputs
for each hospital for the same patient

to generate the EID
(visit level key).

f it is not deterministic, then you have a whole slew of other
problems. If it is deterministic, then by definition the same
EID would be generated for each institution. How is this cross
facility/system unique?

See question/response below with respect to

the Amalga ID:

Section 2: i2b2

High level design flaws:

A). Ontology Storage:

The 'ontology' or structured
classification scheme is stored primarily in 2 tables (in 2
different databases) in i2b2. The 'tree' itself is stored as all
possible paths in the tree. For instance, take the following
simple ontology:

The 'i2b2' way of st
oring this ontology is as follows:

For Each Terminal Node:

Store the complete path from root node to terminal node. For
formatting reasons, store it 3 times (yes, in both the metadata
and concept_dimension, this same path is stored multiple times).

this, it stores (and more than once!) the following:











The standard way of storing topologies, which an ontology is a
structural sub
class of, is as one of the following:


Edge List (my preferred technique)


Jagged Array / Linked List (also workable)

However, the 2 techniques above attempt a balance between
mathematical complexity and useability. As an edge list, the
above would look l
ike this:

> Trucks

> Chevy

> Dodge

> Cars

> Color

> Cost

> Red

> Black

> Mustang

> Corvette

> Speed

> Steering

> >30K



> <5K


> Used

> Used

> Wreck

My Edge List Cost:
19 x 2 ==> 38 memory units (an abstraction,
ceterus paribus, assume node size is equivalent)

I2B2 Cost: 46 Units!

The difference between these two seems s
mall. Please understand

being MUCH more complex than my example, this difference becomes
much worse. My edge list version of the 'i2b2 ontology' took
SIGNIFICANTLY less space, did not have an asin
ine 'only 700
characters in length requirement' and is orthogonal/generic
(good in systems which live in the real world and not fantasy

As stated above, the maximum string length of the 'ontology'
paths is 700 characters. It seems ridiculous to hav
e to explain
why this is bad. What's worse is the SAME ontology is stored in
MULTIPLE fields in different tables

wanna say update
inconsistency? This ontology model has the following features:
a) inefficient, b) difficult to manage, and c) ONLY WORKS if

semantic ontology string depth will ONLY ALWAYS BE 700 chars in
DEPTH! (not a great feature of a system designed for the complex
sciences of medicine and biology)

B). Over
Design / difficult to disentangle:

The documentation
and install bits are over
bearing. In reality, the primary use
cases for this at other institutions (other than Harvard) is
essentially as a set theory engine. As such, there are very few
tables that are 'needed'. An analysis of LIVE data from our i2b2
system showed MOSTLY empty t
ables and empty structures. This
does not create much of a data footprint cost, but from a design
perspective it is the equivalent of the 8 headlights on the
family truxter


not really necessary and probably counter
productive. Over design has costs in documentation and the risk
of engendering FUTURE design flaws. I2B2 reminds me of the
freeways, in Seattle, during the late 70's

any of them went
NOWHERE. I2B2 is 'sold' as a cellular model, which would imply
an ala carte architecture. This is FAR from the truth. While a
person COULD re
program and re
design portions of it to make it
more flexible (one reason for my resignation was
the desire

in order to meet deadlines

to fix and remedy some of the
worst aspects of this), it is frowned upon. I think EGO rules
for I2B2.

C). Indexing is amateurish:

Load factor of 7
10 times input data
footprint. I was told by a PhD at the UW not

be be concerned by
this. I'm glad a PhD in informatics allows someone to discount a
memory leak. Please, download and look (you can download i2b2
although it has a strange version of OFT MENTIONED 'open source'
licensing). Observation fact, for I2B2 1.6,
has so many indexes
that SQL SERVER rejected the DDL. There were indexes on almost
every field individually (though given how empty some of these
fields were it made little or no sense), but on top of this
there is a covering index on EVERY field! A db tun
ing class is


D). UI is uninspiring and derivative:

Set theory tabs, drag a
node of the tree, it allows basic AND, OR and NOT operations.
Its not bad, but not that impressive either. Plus, if you DO NOT
update your ontology to reflect new co
ncepts you will suffer
from ORPHAN CONCEPTS... I do not need to explain how difficult
it can be to maintain a system with this feature. What it is
missing is dynamic connectedness. Tableaux and systems like
tableaux support a much more dynamic and less lab
or intensive
means of noodling. A person could BUILD a set theory engine that
would outperform i2b2 for large scale systems and this
alternative could be completed in 2
3 weeks. You could then call
it a 'cell' and everyone could save face. This correct app
was not acceptable.

E). Method for de
id is limited and buggy:

a) i2b2 randomizes
the counts for results below a certain threshold and b). if you
query multiple times you are 'locked out'. The theory, use a
randomizing function and then block users
from using statistical
narrowing. The problem: this doesn't really help when the
populations you are dealing with drop below a certain level. At
best it then becomes a noise generator.

F). Data stored is stored in an unjustifiably redundant way
y given its nature as a noodling/de
id tool):

At first
you might think, "Hey, its as simple as storing concepts in
Observation_Fact and ensuring the metadata and concept_dimension
(remember, the ontology is stored and used in multiple places

can I say u
pdate anomaly again?)". However, take a look at the
hard coded and stored sql in the I2B2 table
> very revealing.
I think, all a person should do once they download the i2b2
system is to examine the following table

and pay attention
to the generated SQL and the closure anomalies
that exist. This kind of code undermines a modern database,
creates 'false' sub
queries and is frankly VERY BAD PRACTICE.
Also, unnecessary when you consider HOW SQL could be generated
in the middle tier. Ther
e is a Patient Provider and Visit
dimension. I am wary of using dimension, because even though
they are 'dimensions' of the data, the actual concept codes that
represent the stored de
id value is stored ALSO. So, if I want
to use I2B2, I both have to store

the concept code in the
Observation_Fact which represents this code, but I must ALSO
store the same data in one of the core dimensions.

Conclusions: How does this happen?


Dr. Harvard (or Docs PRACTICING computer science without a
license): If I decided to begin practicing 'neighborhood
cardiology' I would be arrested. I have no MD degree. I
have no license to practice. And yet, in the healthcare
system today, there are doct
ors selling themselves as
'experts' who know little or nothing of the discipline of
computer science. Amalga was not designed by folks who
understand complexity and data structures

it was
designed by docs, with some software developers, and it is
the ou
tcome of malpractice. I will make the docs who
invented this a promise: if they stop practicing computer
science, I will not pretend to be an ER doc. We need to ask
better questions of ANYONE selling an idea and not simply
assume because they are docs they

must be good at

sorry, but this is not generally the case. I
joked once that given the current culture of hospitals, a
doc from 'Harvard' could sell electronic mail as his or her
own invention

it might even make it to contract review
re someone calls bullshit.



good systems and good ideas don't make it:
The circuitous route by which systems get built in a large
healthcare system is both byzantine and painful. Instead of
taking the advantage of 'crowd' intelligence, the r
esult is
a gray, mediocre, half measure.


No cost accountability: I don't think there is currently
any solid cost review or cost accounting with respect to
these large healthcare systems. In fairness, this is common
with MANY enterprise systems in and outs
ide of healthcare.
We must, as professionals, develop better and more
objective metrics for measuring the performance and
benefits of systems like Amalga and I2b2.


No transparency in WHY a system is selected: Amalga is one
of the least transparent systems

I have come across. Its
ironic, because it is a terribly simplistic system with
only 3 legs to the stool. Leg 1: The HL
7 parsers (crap),
Leg 2: Amalga ID and azAEID (super expensive and terribly
convoluted crap) and Leg 3: the 'console' or UI tools

used to beat myself up for cutting corners on UI
development and design, now that I've seen Amalga and
considered the raw dollars it has consumed, I am much more
forgiving to myself. I doubt ONE of the UI tools (including
the console) would pass a thorough

review by an HCI (Human
Computer Interaction) specialist. The UI tools which
support development are buggy and have memory leaks. I
figured out months ago that you MUST shut down your
development environment periodically in order to avoid
catastrophic mem
ory leaks which can (and do) lead to lost
parser work. I don't really know why or how the UWMC picked
Amalga or I2B2, but what I am fairly certain of is that NO
competent computer scientist dug very deeply into either
system. If they had, I doubt these sys
tems would have ever
been acquired.


Informatics IS NOT Computer Science, at least not yet: I
have, in the last year, met MANY PhD's in Informatics. I
remember my brother
law telling me about the very
difficult tests he had to take (in the Computer Scie
Department) at Indiana University in Bloomington in order
to move from the Masters program to the PhD portion of the
program. The tests he described seemed extremely difficult

I don't know if I would have passed (my Bro
law was a
graduate of Rose
Hulman so not a light
weight when it comes
to the science of information). I have no reason to
believe, given all the blank stares, given the fact that
these 'experts' can look at Amalga and I2B2 and not
immediately be concerned, that the PhD in Informatic
provides any real background in the foundational skills of
being an information scientist in the Biological and Medial
realm. Maybe Informatics has a place, but only if the
stewards of the profession take seriously the basic skills
they need to do their
job. I have very little reason to
believe that the 'information scientists' in Informatics
with PhD's actually have the background to do their work.
This is too bad.

No one needs to take my word for it. If you are a hospital CIO,
do yourself a favor

aluate Amalga or i2b2 yourself and do a
good job of it. Microsoft sales folks will tell you about all
the 'success' stories

if they tell you the UWMC is a success
they are being dishonest. Before you 'buy' Amalga, load it with
10 TB's of test data. If

Microsoft says 'we can't let you do
that', then you should RUN from this disaster as far as you can.
If they do consent, do your own analysis. Look at ingestion
speed, replication and run ingestion while you are writing
queries. In all likelihood, you wil
l have to have or purchase
servers on this scale or you may have them already. Bottom line,
Amalga doesn't get really ugly until it's almost too late to
turn back. But it is never too late to do the right thing!

So, in recap. At small scales of data Amalg
a is unnecessary and
a burden, at large scales of data Amalga (and this applies to
i2b2 as well) is unwieldy, buggy, provably inefficient and just
a waste of money, time and human resources.


The Joy of Querying Amalga... (JOY is SARCASM her

The confusion over querying Amalga (our expert on our team
gave me 3 explanations in 3 weeks
during September of last

The funny thing is, Microsoft originally recommended
that MRN be thrown away. I'm glad that is one
recommendation that was NOT

followed (despite the most
recent statement of what is 'correct' when querying

Converting from Amalga to Kimball.