Overview

perchmysteriousData Management

Dec 1, 2012 (4 years and 11 months ago)

246 views

1


Multivalued Columns

in LabKey 10.3

Author:

Adam Rauch

Last updated:

August
1
4
,
2010

Overview

LabKey
currently

supports
standard
columns that display a single value per row. These
columns take two forms:



Value columns
.

A single value
per row
is stored in

the column.

Each value is
independent of all other values; changing one value has no
e
ffect on any other.
For
example:

Adults

RowId

Name



1

Ad慭



2

M慴t



3

M慲a






Lookup columns
.

A
single
key
per row
is stored in the column,
and it
link
s

to a
second table that contains
keys and
often

other columns
. For example:

Adults

Neighborhoods

RowId

Name

Neighborhood

1

Adam

1001

2

Matt

1002

3

Mark

1000

4

Karl

1001


Note that multiple rows (Adam & Karl, in this example) in the base table (Adults)
link to the same row in the lookup table (Neighborhoods).
This is a
many
-
to
-
one
relationship

between Adults and Neighborhoods. Since the lookup table rows are
shared by the

base table rows,
chang
ing the value “
Leschi
” would affect both the
Adam & Karl rows.

LabKey 10.3 adds general support for “multivalued” columns, which
(conceptually)
hold
and display multiple values per row.
(Note: A PostgreSQL script for creating and
po
pulating all tables for the multivalued column examples is included in Appendix A.)
Multivalued
columns also take two forms:



Multiv
alue
d

columns
. In the simple case, a multivalued column holds 0


n distinct
values per row.
An

example

as it might be dis
played
:

Hood
Id

Name

1000

Queen Anne

1001

Leschi

1002

Laurelhurst

2


Adults

RowId

Name

Children

1

Adam

Rachel, Connor

2

Matt

Evie, John, Helen

3

Mark

Alex, Aaron


Like the value column in the first example, the values here are all independent of
each other; if
Adam’s
daughter
Rachel change
s her name to Raquel,
all other Rachels
remain unchanged
.

Multivalued columns are implemented
in the database
as
two tables, much like the
lookup column above, but with
a
one
-
to
-
many relationship

instead:

Adults

Children

RowId

Name

1

Adam

2

Matt

3

Mark






Every new child associated with an Adults row results in a new row inserted into
Children.
The row in Adults “owns” all associated rows in Children.
Deleting an
Adults row requires deleting

all associated Children rows. The Children table could
have additional columns (Age, School, etc.)

whose values are
also
private to that
child
.



Multivalued l
ookup columns
.
In this case, the column holds 0


n lookup values
per row. This might
appear

similarly

to the simple multivalued column
:

Adults

RowId

Name

Hobbies

1

Adam

Snowboarding
,

Playing
Bluegrass, Biking

2

Matt

Skiing, Biking

3

Mark

Skiing
,
Traveling

4

Karl

Biking, Mixing Mai Tais


The difference is that the lookup rows are shared between
many

base table rows.
This

“column” is

implemented in the database
with

three

tables
:

the base table, the
Adult
Id

Name

1

Rachel

1

Connor

2

Evie

2

John

2

Helen

3

Alex

3

Aaron

3


lookup table, and a junction table

that connects them
. This results in a
many
-
to
-
many rela
tionship

between
the
base table and lookup table:

Adults

Junction

Hobbies

RowId

Name

1

Adam

2

Matt

3

Mark

4

Karl







When Mark takes up a new hobby, “Reading Dostoevsky,” we
must

insert

it into
Hobbies if not present, retrieve the RowId from
Hobbies

(assume it’s 1007)
, and
insert a new row
[3, 1007]
into
Junction.


If Karl decides to give up “Mixing Mai Tais”
we would delete the
[
4, 1006
]

row from Junction. If someone decides to change
“Biking” to “Cycling,” the change would affect all three Adults who engage in that
hobby.

Note that the
Children table in the first multivalued column example and the Junction table
in the second are structurally identical; each participates in a one
-
to
-
many relationship
with its base table. The only structural difference is that Children holds values (Name

is a
value column) and Junction holds keys (HobbyId is a lookup column to Hobbies

that
displays Name
).

This similarity is key to the selected implementation approach.

Implementation

LabKey uses one
-
to
-
many and many
-
to
-
many relationships
throughout the pr
oduct
, and
multivalued columns appear in several places, most notably the Viability and Reagent
modules that Kevin wrote. The goal
for

10.3 is to

provide core infrastructure that

make
s

it
easier
for module developers
to add
these
multivalued columns
to their applications.
Eventually, we
would like to

make
the
multivalued column and multivalued lookup column
first
-
class data types; users
could then
add them
arbitrarily
to
any
assay, dataset, list
,
external schema table, etc., with the result being
app
ropriate UI appearing

(grid, details,
insert, update)
, junction tables getting created

if needed, a
nd all required inserts and
deletes happening
automatically.

The 10.3 implementation
, however,

focuses on building
basic
infrastructure

that programmers can

use to add and manipulate multivalued columns

for the key 10.3
scenarios (genotyping, ehr, viability, reagent, and experiment runs)
.

Important
elements

include:

RowId

Name

1001

Snowboarding

1002

Playing Bluegrass

1003

Biking

1004

Skiing

1005

Traveling

1006

Mixing Mai Tais

AdultId

HobbyId

1

1001

1

1002

1

1003

2

1004

2

1003

3

1004

3

1005

4

1003

4

1006

4




MultiValuedColumn. A new ColumnInfo that
generates SQL to join the tables and
handle groupin
g, sorting, and aggregating of multiple values. The ColumnInfo
constructor takes three ColumnInfos: the base table primary key column, the
corresponding key column in the child table, and the value column, which itself can
be a lookup column. In the mult
ivalued column examples above, these would be:

o

Multivalued column: Adults.RowId, Children.
Adult
Id, Children.Name

o

Multivalued lookup columns: Adults.RowId, Junction.
Adult
Id,
LookupColumn(Junction.H
obbyId, Hobbies.RowId, Hobbies.
Name)

T
he MultiValuedColumn c
lass handles both multivalued columns and multivalued
lookup columns. The only
difference is that
the “value” column is
a simple value
column in the first case and a lookup column in the second.
If it becomes necessary
(or if we
want

to increase cla
rity
)

we could
break out a subclass
(Mul
t
iValuedLookupColumn)
,
but at the moment a single column class seems simple
and flexible.

The MultiValuedColumn accepts any standard ColumnInfo, including those from
hard tables, ontology manager tables, external schemas,

etc.
However,

the tables

and any desired indexes
and foreign keys
must
already exist; they will not be
created automatically
(
in 10.3
)
.

Note that key column in the child table
(Children
.AdultId

and Junction
.AdultId

in these examples) must not
be a single
column primary key or have a unique constraint. However,
the key column should
be a part of a multi
-
column primary key, multi
-
column unique constraint, or single
-
column non
-
unique index.

The purpose of the MultiValuedColumn is to select all va
lues
associated with

a single
row in the base table and aggregate them in some way to return a single value. The
default
MultiValuedColumn
aggregate function returns
all the values as a comma
-
delimited string, like the examples above. (This

is

sometimes
referred to as
“GROUP_CONCAT,” after the MySQL aggregate function.) MultiValuedColumn
subclasses can override getAggregateFunction() and getSqlTypeName() to apply
different aggregate functions such as COUNT, AVG, MIN, MAX, SUM, etc.

The SQL generated for
the first multivalued column example looks like this:

SELECT adults.rowid AS rowid, adults.name AS name,


adults$rowid$.name AS "rowid$name"

FROM mvc.adults adults


LEFT OUTER JOIN


(


SELECT kt.adultid, GROUP_CONCAT(name) AS name


F
ROM mvc.children kt


GROUP BY kt.adultid


) AS adults$rowid$ ON (adults.rowid = adults$rowid$.adultid)

G
ROUP_CONCAT in the SQL will be replaced by the appropriate SQL syntax for the
dialect (see below and discussion in Appendix A).

5



The SQL gene
rated for the multivalued lookup column example looks like this:

SELECT adults.rowid AS rowid, adults.name AS name,


adults$rowid$.hobbyid$name AS "rowid$hobbyid$name"

FROM mvc.adults adults


LEFT OUTER JOIN


(


SELECT kt.adultid,
GROUP_CONCAT
(kt$hobbyid$.name) AS


"hobbyid$name"


FROM mvc.junction kt

L
EFT OUTER JOIN

mvc.hobbies AS




kt$hobbyid$ ON (kt.hobbyid = kt$hobbyid$.rowid)



GROUP BY kt.adultid


) AS adults$rowid$ ON (adults.rowid = adul
ts$rowid$.adultid)

Note that in the second example the lookup join is buried inside the sub
-
select since
the results of this join need to be grouped before joined to the base table.

More than one

MultiValuedColumn columns can be added to a query to support

different multivalued column
s

and/or multiple aggregate functions operating on the
same multivalued column.

In addition to the values, the MultiValuedColumn
provide
s

the underlying keys i
n
some form (perhaps as another common
-
delimited column?) to permit
links

to
individual value records
.
The MultiValuedColumn sorts the values, by default in
natural ascending order with some coding provision for other sorting options
.



Internal support for
GROUP_CONCAT
aggregate function
on

Post
g
reSQL and SQL
Server 2005+.

GROUP_CONCAT is not a standard aggregate function on either
server
. We need to create this aggregate function on PostgreSQL and emit custom
SQL that provides similar functionality on SQL Server 2005 and later
.
Kevin has
implemented solutions for both d
atabase servers

already
, so it should be a simple
matter of moving these solutions into
the
core
schema
and the
standard
SQL
dialects. We will not support GROUP_CONCAT on SQL Server 2000; instead, we will
fall back on
MIN

(not ideal, but better than crash
ing).



MultiValuedDisplayColumn. A new DisplayColumn that
displays the results of a
MultiValuedColumn
that uses

the default GROUP_CONCAT aggregate

function
. This
DisplayColumn renders separate hyperlinks on each value (if
requested
) and
provides
alternate

filtering UI for these columns. It also handles export to TSV and
Excel

(which shouldn’t require extra work, but need to be tested thoroughly)
.

When
using MultiValuedColumns with
other aggregate functions (MIN, COUNT, SUM, etc.)
,

standard DataColumns ca
n be used since normal display and filtering options
should be appropriate

for their values
.



Filtering
.
Filtering on
a
multivalued column
’s concatenated values

using standard
string
filters (contains, starts with, etc.) works reasonably well.
Also useful

would
be special f
ilter UI
: for shorter lookup tables we could
provide

a list of unique

values
with check boxes (see Excel filter UI)
; for longer lookup tables,
an auto
-
complete
solution
would

be helpful.
Users could then pick include/exclude and any/all
. This

6


filter
UI
(minus any/all) w
ould
also
be useful for standard lookup columns
and even
value columns.

Generating the SQL for a filter that includes or excludes one or more values is fairly
straightforward.
If we want to display names and hobbies of
all adults who enjoy
Snowboarding or Mixing Mai Tais we would generate the following:

SELECT adults.rowid AS rowid, adults.name AS name,


adults$rowid$.hobbyid$name AS "rowid$hobbyid$name"

FROM mvc.adults adults


LEFT OUTER JOIN


(


SELEC
T kt.adultid,
GROUP_CONCAT
(kt$hobbyid$.name)

AS



"hobbyid$name"


FROM mvc.junction kt

LEFT OUTER JOIN mvc.hobbies AS


kt$hobbyid$ ON (kt.hobbyid = kt$hobbyid$.rowid)



GROUP BY kt.adultid


) AS adults$rowid$ ON (adult
s.rowid = adults$rowid$.adultid)

WHERE adults.rowid IN


(


SELECT j.AdultId FROM mvc.Junction j



LEFT OUTER JOIN

mvc.Hobbies h ON

(j.hobbyid = h.rowid) AND


h.Name IN ('Snowboarding', 'Mixing Mai Tais')


GROUP BY j.AdultI
d


HAVING COUNT(h.Name) >

0


)


Changing the HAVING COUNT() comparison to = 0 would switch this to an exclusive
filter, showing all adults who enjoy
neither
Snowboarding
nor

Mixing Mai Tais.
(Using NOT IN would also work.) The SQL for “at least two,” “exactly one,” and
similar filters is trivially easy to generate, although perhaps difficult to support in
the UI.
Two s
eparate join paths through Junction and Hobbies are req
uired here
because the grouping for filtering and grouping for display conflict with each other
in most cases.
A WHERE clause with a sub
-
select fits well with our current filtering
model
, however, it
could be translated to an INNER JOIN
with
the sub
-
selec
t, if we
find performance to be an iss
ue.



Add
GROUP_CONCAT

aggregate function to LabKey SQL to allow within custom
queries
. Ideally, support a DISTINCT qualifier as well

(need to verify we can support
this on SQL Server)
. In addition, MySQL also supports

SEPARATOR and ORDER BY,
though ORDER BY
could
be handled
via an explicit

sort by the query writer. Note
that
GROUP_CONCAT

alone will be useful for displaying *
-
to
-
many relationships, but
parameterized queries are required to let users filter these column
s as anything but
a single string.



Add support for defining multivalued columns to XML meta data.

Current p
roposal
is
to add single
new XML element
to <fk> called
m
ultiValued
KeyColumnName.
To
add a m
ulti
v
alue
d column, define a new <c
olumn> specifying a n
ame, a
wrappedColumnName
attribute to the
PK, and any other elements you want. (We
could perhaps default to the table’s PK if
m
ultiValuedKeyColumnName is specified
7


but wrappedColumnName is missing.)
The
m
ultiValuedKeyColumnName
element
specifies the key
column in the child table, the “many” side of the relationship. If
fkColumnName is a lookup column (it has an fk defined

within itself
) then the
resulting virtual column is a multivalued lookup column. If not, it’s just a
multivalued column.

This is probably better explained with an example.
Defining the hobbies
relationship above

would

look
something
like this:

<column columnName="
H
obbies"

wrappedColumnName="
RowId
"
>

<fk>

<fkDbSchema>
myschema
</fkDbSchema>

<fkTable>
Junction
</fkTable>

<
fkColumnName>
HobbyId
</fkColumnName>

<m
ultiValued
KeyColumnName>AdultId</multiValuedKeyColumnName>

</fk>

</column>


As mentioned above, in 10.3, programmers are responsible for creating all necessary
tables, indexes, and foreign keys before attempting to use

multivalued columns.
They are
responsible for loading the tables, either from files or from UI that they provide.
They must
issue the appropriate selects from and inserts to all tables, in the proper order, when
inserting rows to the base table. They m
ust delete from all tables, in the proper order,
when deleting from the base table. They are responsible for transacting these operations, if
desired.

Future Improvements

The following features of multivalued columns (or, more
generally
, one
-
to
-
many and m
any
-
to
-
many relationships) are
currently considered
out
-
of
-
scope for 10.3. They will be
prioritized for inclusion in future releases
. In no particular order
:



General
-
purpose d
ata entry options.

Kevin has implemented multivalued entry UI
for reagents; th
is may be useful elsewhere.



Special LabKey SQL syntax to handle filtering
of

multivalued columns.



“Cascading” of
inserts

and
deletes
. When inserting to the base table, automatically
insert to child, junction, and even lookup tables. When deleting a row f
rom the base
table, delete all child rows owned by that row. Wrap complete insert and delete
operations in transactions.



Create supporting database objects on demand.
Automatic
ally create
junction table
,
indexes,

and
foreign key
s on first reference
.



Alternate views of parent/child relationships. For example, display a h
ierarchical
view of a one
-
to
-
many relationship, with child rows appearing as an expandable
sub
-
grid below each parent row.

(At times we’ve referred to this as a “sublist”.)

8


This is p
articularly interesting for child tables that contain
many other
interesting
columns.
The

MS2
module expanded
protein

and protein prophet
view
s

are
example
s

of this.

Note that
,
implementation
-
wise
,

this doesn’t have much in
common with the 10.3 multivalu
ed column implementation (which
simply
provides
a single column
that evaluates
an aggregate function

over all child rows
)
. However,
most of the future improvements discussed here overlap with
sublist needs.



Customize view

support. In 10.3, the aggregate
column provided by the
programme
r will appear as a “leaf column
” in customize view. In the future, we
could add the ability for users to
follow
multivalued column
joins, select columns of
interest,
and choose from
a selection of aggregate functions.



Advanced filtering. Might also appear in customize view. Ideas that have been
suggested include filtering on columns other than the display value, more advanced
set operations,
and
“includes n or more of the following”.

Appendix A: SQL Script for Example
s

The PostgreSQL script below creates and populates all tables for the multivalued column
examples above.
The

sample queries should work against this schema, HOWEVER, you will
need to use a recent 10.3 database and replace

instances of

GROUP_CONCAT(x) wit
h

this
:



array_to_string(core.array_accum(x), ',')



SELECT core.fn_dropifexists('*', 'mvc', 'SCHEMA', NULL);


CREATE SCHEMA mvc;


CREATE TABLE mvc.Adults

(


RowId SERIAL,


Name VARCHAR(200) NOT NULL,




CONSTRAINT PK_Adults PRIMARY KEY (RowId)

);


CREATE TABLE mvc.Children

(


AdultId INT NOT NULL,


Name VARCHAR(100) NOT NULL

);


CREATE INDEX IX_Children ON mvc.Children(AdultId);


CREATE TABLE Hobbies

(


RowId SERIAL,


Name VARCHAR(100) NOT NULL,



9



CONSTRAINT PK_Hobbies PRIMARY KEY (RowId)

);


CREATE TABLE Junction

(


AdultId INT NOT NULL,


HobbyId INT NOT NULL

);


CREATE INDEX IX_Junction ON mvc.Junction(AdultId);


INSERT INTO mvc.Adults (RowId, Name) V
ALUES (1, 'Adam');

INSERT INTO mvc.Adults (RowId, Name) VALUES (2, 'Matt');

INSERT INTO mvc.Adults (RowId, Name) VALUES (3, 'Mark');

INSERT INTO mvc.Adults (RowId, Name) VALUES (4, 'Karl');


INSERT INTO mvc.Children (AdultId, Name) VALUES (1, 'Rachel');

IN
SERT INTO mvc.Children (AdultId, Name) VALUES (1, 'Connor');

INSERT INTO mvc.Children (AdultId, Name) VALUES (2, 'Evie');

INSERT INTO mvc.Children (AdultId, Name) VALUES (2, 'John');

INSERT INTO mvc.Children (AdultId, Name) VALUES (2, 'Helen');

INSERT INTO

mvc.Children (AdultId, Name) VALUES (3, 'Alex');

INSERT INTO mvc.Children (AdultId, Name) VALUES (3, 'Aaron');


INSERT INTO mvc.Hobbies (Rowid, Name) VALUES (1001, 'Snowboarding');

INSERT INTO mvc.Hobbies (Rowid, Name) VALUES (1003, 'Biking');

INSERT INTO

mvc.Hobbies (Rowid, Name) VALUES (1005, 'Traveling');

INSERT INTO mvc.Hobbies (Rowid, Name) VALUES (1006, 'Mixing Mai Tais');

INSERT INTO mvc.Hobbies (Rowid, Name) VALUES (1004, 'Skiing');

INSERT INTO mvc.Hobbies (Rowid, Name) VALUES (1002, 'Playing Blueg
rass');


INSERT INTO mvc.Junction (AdultId, HobbyId) VALUES (1, 1001);

INSERT INTO mvc.Junction (AdultId, HobbyId) VALUES (1, 1002);

INSERT INTO mvc.Junction (AdultId, HobbyId) VALUES (1, 1003);

INSERT INTO mvc.Junction (AdultId, HobbyId) VALUES (2, 1004);

INSERT INTO mvc.Junction (AdultId, HobbyId) VALUES (2, 1003);

INSERT INTO mvc.Junction (AdultId, HobbyId) VALUES (3, 1004);

INSERT INTO mvc.Junction (AdultId, HobbyId) VALUES (3, 1005);

INSERT INTO mvc.Junction (AdultId, HobbyId) VALUES (4, 1003);

INSERT
INTO mvc.Junction (AdultId, HobbyId) VALUES (4, 1006);