Using SAS Data Sets - LabKey Software Foundation

cuttlefishblueData Management

Dec 16, 2012 (5 years and 3 months ago)

331 views

1


SAS Data Sets
on

Atlas

10.1 Specification

Author:


Adam Rauch

Co
-
Authors:


Geoff Snyder
,
Julie Stofel
,

Sarah Ramsay
,

Iraj Mohebalian

Original Draft:

September
10
, 2009

Last
Update
d
:


November 1
8
, 2009

Introduction

This document

describes the

plan
to enable

publishing of SAS data

sets to Atlas.
This is a two
-
phase project that spans multiple releases of LabKey Server: v9.3 and
v10.1. In phase one (v9.3
, release
d

to Atlas on
November

12
, 2009
), we use
d

a small
amount of funding to implement the basic publis
hing infrastructure.
In phase two
(v10.1
, likely releasing in March, 2010
), we will research and implement additional
features
, possibly including support for SAS formats
,
improved administration user
interface,
infrastructure improvements, et al.

Overvie
w

and Scenarios

The ability to publish SAS data sets on Atlas provides secure, dynamic access to data
sets residing in a SAS repository.

Published SAS data sets appear on Atlas as directly
accessible data sets.

They are dynamic, meaning Atlas treats the
SAS repository as a
live database; any modifications to
data within a
SAS
data set
are

immediately
viewable on Atlas. The data sets
are

visible only to those who are authorized to see
them.

Authorized users view
published data sets

using the familiar, ea
sy
-
to
-
use grid user
interface used throughout Atlas. They
can
customize their views with filters, sorts,
and column lists. They
can

use the data sets in custom queries and reports. They
c
an
export the data
to

Excel, web query, or TSV formats. They c
an

access the data
sets from JavaScript, SAS, R, and Java client libraries.

A few
concrete
scenarios

from SCHARP
:



Make analysis datasets available for SCHARP analysts (faculty, staff scientists,
PIs) who are not necessarily SAS users and may not be familiar w
ith the
directory structure (where to find files and data sets). The HPTN 035
example where PIs want access to analysis datasets created by an SRA
typifies this scenario. They want to be able to see the formatted data in
typical grid layout to confirm that

a given dataset has the information they
need. They want to export formatted data to use with their software of
choice. They’ll combine the data, if needed, using their tools of choice. Access
to the Atlas folder is limited to the analysts working on the
project and the
2


programmers, if any, who are supporting it. The SRA or programmer adds or
removes datasets available in a folder with minimal steps. Changes to the
data are viewable immediately by Atlas users.



Make plate or form level SAS datasets and raw

or derived assay data
available on Atlas. Anticipated use: data for HVTN studies using VRC
products. VRC wants to pull CRF and lab data from SCHARP to build their
data warehouse of VRC related trial data. SCHARP makes form level and
assay positivity call
SAS data sets available on Atlas both as attached SAS
datasets and accompanying format catalogs and as a standard Atlas grid view
of the formatted data. Access is restricted to VRC designated staff on protocol
level folders. SCHARP adds and updates both vi
ewable and attached datasets
as needed.



Make site’s form level SAS datasets available on Atlas for site reference and
use. Example: MTN 003 (VOICE)


SCHARP committed to making a subset of
formatted form level data from each site available to that site for

their use.
SCHARP updates the data on a nightly basis. Sites are restricted to view and
access only their site
-
specific folder (and data) on Atlas. Sites can view their
data in a grid or download it for their own purposes.

Several layers keep the data sec
ure. SAS administrators expose selected SAS
libraries to Atlas.

Atlas administrators then selectively
publish

entire
SAS
libraries
as schemas available within a specific folder. The folder
is
protected using standard
Atlas security; only users who have
been granted permission to that folder
can

view
the published data sets.

Solution Overview

The SAS data set publishing solution integrates several components:



SAS/SHARE server
. This server

run
s

as part of the SCHARP SAS installation

(it does not run on th
e Atlas server itself)
. SAS/SHARE allows Atlas to
retrieve SAS data sets over the internal SCHARP network.



SAS/SHARE JDBC driver. This driver allows Atlas to connect to SAS/SHARE
and
treat

SAS data sets as if they were tables in a
relational
database.



At
las SAS Administration. This is the user interface that Atlas administrators
use to

select

the
SAS libraries
and

data sets

to

publish in each folder
.



Atlas SAS Integration. This server code
links

SAS data s
ets

with
the Atlas
query service. It also
trans
lates generic queries into the specific SQL syntax
that SAS/SHARE understands.

Each of these components is described in more detail below.

3


SAS/SHARE Server and JDBC Driver

The SAS/SHARE server
must be configured and maintained as part of the SCHARP
SAS ins
tallation. Atlas
production, staging, and test
must be able to connect to
SAS/SHARE;
this

requires

high
-
speed network connectivity
, firewall access,

and
authentication credentials. SAS/SHARE must be configured
to

predefine

all

SAS

libraries

that Atlas ne
eds to access
.

The SAS/SHARE JDBC driver
must be

installed
on
all three
Atlas

servers
. This
requires copying
two

.jar files into the tomcat/common/lib directory
of each server
.
It also requires adding
to each labkey.xml file
a new DataSource entry contai
ning
several connection settings (e.g., SAS/SHARE URL and credentials).

Note: A very preliminary, incomplete set of steps to configure a sample SAS/SHARE
installation can be found
on the
Configure SAS Access From LabKey Server

page.

Atlas SAS Administration

Note: This section describes the
a
dministration UI implemented for v9.3.

See the
“Potential Improvements for 10.1” section for possible enhancements.

A
si
te

administrator cho
o
se
s

which data sets
to
publish in a
n Atlas

folder
by defining
an external schema in that folder
.

(Terminology note: when published, a SAS

library


is
synonymous with
“schema” and a
SAS “
data set


is
synonymous with

“table.”)

Th
e “De
fine External Schema” page
is
accessed from
the query “Schema
Administration” page.
Previously,
all
external schemas had to reside in the Atlas
d
atabase. A new “
Data
Source” drop
-
down
list
has been
added to this page

to allow
selection of
any

data source

defined in the labkey.xml file. If
the
SAS module
is
installed

and a SAS data source is defined in labkey.xml
,

then
the drop
-
down list

contain
s

the SAS/SHARE data source name

as an option.
The i
mage below
shows
a
portion of
the
v
9.3

administration
UI
; t
his
particular
server defines “sasDataSource”
in labkey.xml,
which
result
s

in the name “sas” appearing in the list
:


4


After selecting a data
source, the
“Database Schema Name” drop
-
down is populated
with all schemas in that data source. The
administrator
selects

a

schema name to
publish

it
. After clicking the “Create” button

the tables from that schema are
published in the folder. In other words,
defining

a schema
from
a SAS data source
publishes all data sets in that library; they can be viewed by anyon
e with read
permissions in the folder.

Atlas SAS Integration

Note: This section describes the
SAS integration

implemented for v9.3.

See the
“Potential Improvements for 10.1” section for possible enhancements.

Once
defined
,
a SAS library
is

treated like
any other database schema (with a couple
important exceptions listed below).
T
he
schema browser

lists all
its
data sets as

built
-
in

queries &

tables.
” A query web part can be added to the folder’s home page
to display links to a library’s data sets. Li
nks to key data sets can be added to wiki
pages, posted on message boards, or published via email. Clicking any o
f these
links
display
s

the data
set
in the standard Atlas grid with filtering, sorting, export
ing
,
paging,
customizing views,
etc. all enabled
. Queries that operate on these datasets
can be written
, including queries that join multiple data sets together
. The data sets
can be retrieved
using

client APIs (Java, JavaScript, R, and SAS).

The two majo
r limitations with SAS data sets
are
:



SAS data
sets can be joined to each other but
not

joined to
data in the Atlas
database

or other ext
e
r
n
a
l

data sources
.
A
ny attempt to join between
different databases (e.g., a table in PostgreSQL with a data set in SAS) result
s

in an error message.



SAS integration

provides
read
-
only

access to SAS data sets. Atlas
is
not able
to insert, update, or delete data in SAS data sets.

Using SAS Data Sets

The v9.3 implementation focused on supporting SAS as a generic external data
source. No effort was made to integrate SA
S data sets with other areas of the
product (such as
the
study

module
) or to improve the user experience of interacting
with these data sets, beyond the pre
-
existing, generic query features.

See the section below for improvements that may be made in v10.1.

Potential Improvements in v10.1

The “SAS Data On Atlas” team has discussed and prioritized improvements that
could be implemented during the v10.1 development cycle.
These are all listed and
5


described below. The team has prioritized many of the features

into High, Medium,
and Low groupings
, based on the scenarios above
; these groupings will help
development focus research initially on the highest priority areas. After
development has researched possible solutions, the team will discuss and
reprioritize
the tasks.

Atlas production was just upgraded to v9.3 and the SAS integration feature has not
yet been used in production. The team has deferred discussion of some of the
potential improvements (e.g., administration UI changes)
to ensure some
real
-
world,
production experience

first
. As a result, s
ome of the tasks are currently
listed
as
“unprioritized”; the team will discuss and prioritize these with the other features
after several weeks of production use.

As we continue to use, test, and develop the SAS

integration feature we are likely to
discover
additional

opportunities or SAS shortcomings that we will consider
addressing.

Features are not prioritized within each grouping yet.

High Priority Improvements



Support SAS formats. In the ideal world, we wou
ld support SAS “display
formats” (e.g., number and date formats) and SAS “lookup formats” (e.g.,
substituting words for numerical values) when displaying or exporting SAS
data. We would “round trip” SAS
-
specific formats through import and
export,
e
.
g.
, wh
en using the SAS client API to retrieve a data set that resides
in a SAS repository, the resulting data set should have the SAS display format
applied. In some cases, we may want to see the formats applied and in others
we may want the underlying values,
so we may want an option (on the
library? on the data set? on a query?).
We’ll need to discuss the implications
on filtering and sorting data sets with and without formats applied.
This
feature will likely require extensive
research
and work to implement
, and
the
ideal scenario
described above
may not be possible
.



Ability to reconnect to data sources that go offline and return. Currently,
once the web server is unable to connect to a data source it
has no

ability to
reconnect. So, if the SAS/SHARE serve
r goes down and comes back up, Atlas
won’t be able to reconnect
;
the web server must be restarted to re
-
establish
the connection. Now that Atlas
is

connecting to multiple data sources, we
need to investigate ways to improve this.



Revamp caching and namesp
ace handling for external schemas. This is
an

internal infrastructure change that
must be addressed
. Currently, all internal
schemas share a name space and all external schemas share
another

name
space. This was fine when all schemas came from a single
data source (the
database ensured that schema names were unique), but now schemas from
different data sources could have the same name, result
ing

in a conflict. One
6


potential approach: each data source (DbScope internally) holds a db schema
name
-
> DbSche
ma map of its schemas and each folder
holds

the “user name
-
> db schema name” mapping for that folder.
Among other concerns, w
e
need to handle 1) folder
-
specific meta data (where does that get applied and
cached?) and 2) cross
-
container queries.

Medium

Pr
iority Improvements



Report major issues with the SAS 9.2 Driver for JDBC to the SAS Institute. We
hope that SAS will fix this driver so we don’t have to implement any
workarounds. See
further

details about this item under “Low Priority
Improvements.”



Sup
port SAS special missing values. All SAS missing values currently appear
as null. We may be able to retrieve the special missing values; this will
require research.



Need a mechanism for automatically reloading data set meta data. For
performance reasons
, Atlas caches all schema meta data (list of columns
including names, descriptions, data types, formats, etc.). Atlas doesn’t see
any changes to table structure (e.g., add/remove columns) until the schema
is reloaded. The external schema administration U
I has a “reload” link that
does this, but adding a more automated mechanism would be helpful. Ideas
to investigate: clear all schema caches as part of nightly maintenance, add a
configurable timer to each schema (e.g., ranging from 5 minutes to 24 hours),

investigate whether a “schema last changed” date is available from SAS.



Load/reload data sets into the study module from external data sources. For
example, nightly study reload could be configured to load data sets into the
Atlas database directly from
SAS, eliminating the need to export to TSV first.

Low

Priority Improvements



Allow insert, update, and delete within SAS data sets from Atlas. In most
cases, Atlas administrators want SAS data sets to be read
-
only; this will
continue to be the default sett
ing when defining any external schema.
However, allowing Atlas users (or users of applications written to the APIs),
who have the appropriate authorization, to insert, update, and delete records
in selected SAS data sets would be a very powerful, useful f
eature. Enabling
this may be possible; it’s another area that will require research.



Fix and/or work around major issues with the SAS 9.2 Driver for JDBC. Most
development of phase one took place with the SAS 9.1 Driver for JDBC; this
driver seems to wor
k fine with both SAS 9.1 and SAS 9.2. Late in the
development cycle we tested the functionality against the newer JDBC driver,
the 9.2 version. This driver is fundamentally broken in several ways; for
example, it reports that all columns (even character
variables) are of type
7


“double”. We need to report this problem to the SAS Institute and either get
a fix or find a way to work around this issue.



We need to detect and prevent cross
-
data
-
source joins. The lookup UI, for
example, always displays schemas
and tables from the main database, which
means one can define a lookup on a SAS data set to a table in PostgreSQL.
Viewing this column will result in a SQL exception, since cross
-
database joins
will not work. Likewise, using this UI to define a lookup to

another SAS data
set is impossible.



Query UI should
check
table
permissions
in
the underlying database before

rendering edit/insert/delete buttons/links. This keeps us from misleading
users and giving them SQL exceptions.



Dynamic SAS data sets in a study
. It might be interesting to add SAS data sets
directly to a study. Instead of taking a snapshot of the data on a nightly basis
(see above), Atlas would retrieve the data from the SAS server as needed.
This could be “just in time” as requests are made t
o display or export the data
set or (more interestingly) we could cache SAS data sets in temp tables to
allow joins with core study information, and refresh the temp tables on a
frequent basis. This feature requires substantial design, research, and
costi
ng work.

Unprioritized

Improvements



Allow selecting individual tables/data sets to publish from the
schema/library.



Better way to publish a schema to multiple folders. One mechanism would
be to (optionally) inherit a schema definition through the folder h
ierarchy. A
step beyond that: we could create the definition at one level and allow
administrators to further restrict that definition further down the tree (e.g.,
publish all data sets from a library at the project level, then limit sub
-
folders
to subset
s of data sets).



Add a new security role, “Data Manager,” and corresponding permissions, for
determining who can create, modify, and delete schema definitions.



Filter system schemas (e.g., pg_toast_temp_1 and information_schema on
PostgreSQL; SASADMIN and
SASUSER on SAS) out of the schema list when
defining / updating an external schema.



Tailor the UI so it uses “library” and “data set” when a SAS data source is
selected.

8




Provide a way for a site admin to set a “friendly name” associated with each
data sour
ce. This name would appear in the data source drop down and
potentially in the schema browser.



Partition the schemas in the schema browser by data source. Also, display
the data source friendly names associated with each schema.



Add defining, editing, an
d deleting schemas to the audit trail.



Choosing data sets to display. The admin UI can limit the data sets available
in a folder, but it may be desirable to customize a query webpart to display
links to a subset of the available data sets.



Specific suppor
t for SCHARP’s ADM facility.



Support for automated testing of SAS data set publishing



Performance improvements



Documentation

v9.3

Limitations

Most of these limitations have been mentioned elsewhere in this document, but for
completeness, here is a list of
major l
imitations

in

v9.3
:



No support for SAS
-
specific features (e.g., SAS display
formats
and
SAS
lookup
formats
).



No support for SAS data set snapshots (i.e., pulling data into the Atlas
database).



No prevention or detection of cross
-
database join attemp
ts
; attempts to j
oin
between different databases will result in
very unfriendly server error
messages.



No performance testing



Limited documentation



Very limited functional testing



No ability to selectively publish data sets (entire libraries only)