LabKey Server An Open-Source Platform for Scientific Data ...

candlewhynotData Management

Jan 31, 2013 (4 years and 4 months ago)

219 views

© 2008 LabKey Software

www.labkey.com

LabKey Server An Open
-
Source Platform
for Scientific Data Integration


Mark
Igra

marki@labkey.com

LabKey
Software

2010

Rough Plan for this Talk


LabKey Background


Quick Case Studies


Architecture


Extending and Customizing LabKey Server


Future Directions


LabKey
Software

2010

Why Scientific

Data Integration


Data Volume


Hundreds of millions of results from thousands of high

throughput assay runs


Data Variety


Clinical, Demographic, Assay & Specimen Data


Collaboration


Investigators aren’t all in

the same place


Need secure data sharing


Scientific Data is not business data


Standard database & web toolkits aren’t optimized for
the work scientists do

LabKey
Software

2010

What LabKey Server Does


A
web
-
based

platform

to Organize, Analyze &
Share Scientific Data


Data integration analysis for assays


Proteomics, Flow
Cytometry
, Genotyping etc


Study Data Management


Combine Demographic, Clinical, Assay & Specimen
Data


Secure data sharing


Fine control over permissions to view and write data


Within an organization


Between multiple organizations


Share data in existing databases

4

LabKey
Software

2010

Open Source Solution


LabKey Server is Open Source Software


Server can be downloaded and installed for free


Source is publicly available


Development process is open


Why Open Source?


Visibility leads to quality


Openness to contribution leads to rapid evolution


Minimize IP issues


Eases collaboration with scientists


5

LabKey
Software

2010

Company Overview


LabKey Software is a consulting company


Spun off from the McIntosh Lab (part owned by FHCRC)


Professional software engineers from Amazon, Microsoft,
BEA etc


Development & support around LabKey Server


Extending the base LabKey Server platform


Creating customized lab
-
specific solutions


Hosting LabKey server


Support


Work in partnership with scientists


For
-
profit fee
-
for
-
service contracts


Non
-
profit grant
subawards


Co
-
investigators with a shared research agenda


All development approved by and relevant to FHCRC

6

LabKey
Software

2010

Examples


Proteomics.fhcrc.org (FHCRC Shared Resources)


Atlas Science Portal (SCHARP)


Studies


Specimen Management


Lab Data


Primate Electronic Health Record
(Wisconsin/NCRR)


MHC Genotyping (Wisconsin/NIAID)


LabKey
Software

2010

Proteomics.fhcrc.org


Problems Faced


Number of reads from MS2
-
based proteomics assay were
exploding & difficult to handle on existing tools


Many assay runs needed to be done to fulfill NCI contract for
early biomarker detection


Running multistep analysis on cluster was difficult to
optimize


Solution


The original LabKey Server (CPAS)


Toolkit to run MS2 analysis jobs via web browser and load
results into Database


Web based analysis tools to view, combine & share results


Run by FHCRC Shared Resources


Lots of Data


More than 75,000 MS2 Runs


More than 600,000,000 peptide identifications

8

LabKey
Software

2010

Proteomics Screen Shots

9

LabKey
Software

2010

Atlas


SCHARP Data Portal


Problems faced


Combine many data types for HIV Vaccine studies


Clinical Response Forms (CRF), Specimens, Many Assays


Enable secure collaboration for scientists worldwide


Provide management tools for allocating & distributing
valuable specimens


Solution


Secure web portal for organizing and sharing HIV Vaccine
Enterprise Data


Used by several networks to share data


CHAVI, CAVD, HVTN, HPTN


Core software was written by LabKey


SCHARP runs Atlas


Defines available data and relationships


Manages security and permissions


Manages data loading


Builds custom modules


10

LabKey
Software

2010

Data Viewers

Atlas Data Flows

LIMS

Labs

Assays

Leadership

Collaborators

Labs

Forms

Sample

Info

labid
uspeci
txtpid
parusp
drawdm
drawdd
drawdy
262
4633
9.99E+08
4632
3
23
2005
308
13472
9.99E+08
13471
3
23
2005
262
4641
9.99E+08
4640
4
14
2005
262
4650
9.99E+08
4649
5
9
2005
262
4652
9.99E+08
4651
5
11
2005
308
13480
9.99E+08
13479
5
11
2005
262
4668
9.99E+08
4667
4
13
2005
308
13486
9.99E+08
13485
4
13
2005
262
4684
9.99E+08
4683
5
11
2005
262
4751
9.99E+08
4750
6
21
2005
308
13560
9.99E+08
13559
6
21
2005
262
4769
9.99E+08
4768
6
28
2005
308
13578
9.99E+08
13577
6
28
2005
262
4850
9.99E+08
4849
6
1
2005
262
4897
9.99E+08
4896
8
17
2005
262
4898
9.99E+08
4896
8
17
2005
262
4914
9.99E+08
4913
8
30
2005
308
13933
9.99E+08
13932
8
30
2005
262
4922
9.99E+08
4921
9
1
2005
308
13941
9.99E+08
13940
9
1
2005
262
4934
9.99E+08
4933
9
14
2005
308
13953
9.99E+08
13952
9
14
2005
262
4935
9.99E+08
4933
9
14
2005
308
13954
9.99E+08
13952
9
14
2005
262
4950
9.99E+08
4949
9
28
2005
308
13969
9.99E+08
13968
9
28
2005
262
4951
9.99E+08
4949
9
28
2005
308
13970
9.99E+08
13968
9
28
2005
307
773
9.99E+08
772
8
24
2005
307
774
9.99E+08
772
8
24
2005
307
800
9.99E+08
799
8
30
2005
307
801
9.99E+08
799
8
30
2005
Data Providers

CRF

DOB

BP sys

BP dia

Notes


Atlas Web
+ Database
Servers

LabKey
Software

2010

Data Sizes


Example: CHAVI 001


50 Datasets (CRFs and Assay Results)


800+ Participants


12000+ Participant/Visits


75000 CRF responses


250000 Specimen Vials


Overall


>20K Participants/Subjects


>30K Assay runs


>800K Specimens


>1200 Datasets

12


LabKey
Software

2010

Usage Statistics June 2008
-
Present


~ 2800 Registered Users


~300 organizations


~1,400 Unique Visitors/Month


~35 countries


~5,000 Visits/Month


Average visit length 13 minutes

LabKey
Software

2010

14

LabKey
Software

2010

Primate Electronic Health Record


Problem


30 years of daily data on thousands of animals


Clinical staff & vets need health record


Researchers need scientific data


Colony Management is an ongoing problem


Solution


EHR is a a specialized, ongoing “study”


LabKey enhanced scalability of study solution


LabKey wrote ETL tools to transfer data into new EHR
within minutes of entry into old EHR


Wisconsin Primate Center built custom views & reports
to analyze the data


Ongoing project



15

LabKey
Software

2010

Primate Electronic Health Record


16

LabKey
Software

2010

Genotyping (Wisconsin)


Problem


High throughput sequencing assay


Building workflows in Galaxy sequence analysis tool
(
Giardine

et al, Genome Res 2005)


Custom pipeline to match sequences to known MHC alleles


Need to store and analyze data from many runs & tie
back to other animal information


Solution


Initiate assay runs in LabKey


LabKey communicates with Galaxy to perform Analysis


LabKey loads results & provides custom views


Solution will be available to all LabKey users


17

LabKey
Software

2010

Genotyping Pipeline

18

LabKey
Software

2010

More Assays


Neutralizing Antibodies & Flow
Cytometry


Problems


Existing tools work for one run at a time


Need support for large data volumes


Need better tracking of Lab Workflows


Solutions


Load data from existing


Calculate (or sometimes just load result)


Provide custom analysis online

19

LabKey
Software

2010

Neutralizing Antibody Assay View


20

LabKey
Software

2010

Details of a single sample

LabKey
Software

2010

Basic Architecture


Java web application


Runs on Apache Tomcat web server


Compatible with Windows, Linux, Solaris, Mac, etc


Incorporates open
-
source libraries


Relational database server


PostgreSQL
: open
-
source, all common operating
systems


Microsoft SQL Server: commercial product, Windows
only


Abstraction layer allows other database servers in
future


Network file storage


Analysis pipeline: conversion, search, processing


Secure WebDAV access to pipeline and data


22

LabKey
Software

2010

LabKey Basic Architecture

Tomcat
Web Server

File System

SQL Database
(
PostgreSQL
, MS SQL)

LabKey
Schemas

LabKey
Software

2010

LabKey Data Connectivity

Tomcat Web
Server

File System

LabKey
PostgreSQL
/MS
SQL Database

LabKey
Schemas

More Schemas

File
System 2

SAS Share

Data 1

Data 3

Other
PostgreSQL

Database

MS SQL

Database

My SQL

LabKey
Software

2010

Base Services (Security,
Database Access, Full Text Search)


Site Admin

Data Storage (Relational Database + File System)

Experiment
Services

Portal / Wiki

= Shared services

Labkey Architecture

= Modules

CPAS

(Proteomics)

Flow Cytometry

Text Assays

Custom Assay 1

Custom Assay 2

Interaction Layer: Query, APIs, Custom Views

Studies

Integrate
Many Data
Types

LabKey
Software

2010

Sites, Projects

& Folders


A Site is an installation of LabKey


Web Server, Database Server, File System


Each Site contains multiple projects


Project usually represents a group of people working
together


Project could be for 1 lab or an entire research consortium


Projects contain folders


Like a standard Mac/Windows folder, a LabKey folder can
contain files


LabKey folders can contain tabular data as well as files


Every folder has a “home page”


Folders are the basic unit of security


Study security is even more fine
-
grained


Projects are just special top
-
level folder


LabKey
Software

2010

Files

Proteomics

Flow

Folder 1

Folder 2

Data and files are visible in folders

Folders, Files and Data

LabKey
Software

2010

Creating a Project and Adding Data


Create Project


Set Folder Type & Security


Create Sub
-
folder


Customize Web Parts


Create Lists


Analytes


Inventory


Sort & Filter


Custom Views


Custom Queries


28

LabKey
Software

2010

LabKey Data Model


Semantic Web (RDF) has some great ideas


Any resource can have arbitrary properties


Properties can be described with
ontologies


All resources have unique names


Sophisticated inference
-
based query model


But LabKey Presents a Relational Model


Tables of related data


Familiar SQL Queries


Transactions for data integrity


Plus some semantic web concepts


Easy to define & extend data types on the fly


All fields can be annotated with concepts, descriptions


Data storage is a mix between “hard tables” and a property
store


moving more to “hard tables”

29

LabKey
Software

2010

LabKey SQL


Most analysis & viewing of data done without SQL


Easy online tools to join, filter & sort data


Sometimes people need more access


LabKey SQL presents a broad subset of SQL for
querying underlying data model


Joins, Grouping, Functions


SQL runs against a “synthesized schema”


Visible data is dependent on user permissions


Modules can present “views” of data that do not
physically exist in the database


LabKey translates into native SQL dialect of database
server


Prevents SQL Injection


30

LabKey
Software

2010

Security Model


All data access is mediated by a standard security
model


Authentication


Password based


Will link to LDAP server for internal users (e.g.
HutchNET
)


All passwords stored encrypted for external users


Authorization


Every resource has a security policy


Policy maps users & groups to roles for that resource


Role defines a standard set of permissions


Modules can define new permissions & roles


Data


Individual data sets can be secured


Data can be partitioned across folders for security


31

LabKey
Software

2010

Where’s the Science?


Experiment/Assay Support


Data Processing Pipeline


Studies


Specimen Support


Extended Data Types Support


Missing Values


Out of Range Values


Multi
-
valued field display


Pre
-
built modules


Study


Proteomics (CPAS)


Flow
Cytometry


Many more

LabKey
Software

2010

Assay Support


FuGE

experimental annotation model


General purpose way of describing assays connecting
samples, data and results


Built in Assay Structure


Batches, Runs, Results


Study and Specimen Integration


Given a specimen ID, assay will infer subject/date


Data can be “contributed” to a study


General Purpose Assay Tool


Load tabular data from text or Excel files


Customized Assays


Built by LabKey or Others


Can be created without Java Code

LabKey
Software

2010

Study Data Integration


LabKey server accepts data from a variety of
systems


Case Report Form data


Specimen management systems


Assay Data


directly from hardware or via other
software tools


Ties study data together into a unified
representation of the study


Scientists can query, chart & analyze the data


Includes comprehensive online analysis tools


Integrates R statistical package


Users can export data for offline analysis

34

LabKey
Software

2010

MHC

A21 1/1/08

A22 1/1/08

Clinical

A21 1/1/08

A21 2/1/08

A21 3/1/08

A21 6/1/08

A21 7/15/08

A21 8/31/08

A22 1/1/08

A22 2/1/08

A23 6/1/08

A23 7/15/08

A23 8/31/08

Flow

A21 1/1/08

A21 6/1/08

A21 7/15/08

A21 8/31/08

A22 1/1/08

A23 6/1/08

A23 7/15/08

A23 8/31/08

View Data By Data Type

LabKey
Software

2010

ELISA

FLOW

MHC

A21 1/1/08

A21 1/1/08

A21 1/1/08

A21 2/1/08

A21 3/1/08

A21 6/1/08

A21 7/15/08

A21 8/31/08

A21 6/1/08

A21 7/15/08

A21 8/31/08

A22 1/1/08

A22 2/1/08

A22 1/1/08

A22 1/1/08

A23 6/1/08

A23 7/15/08

A23 8/31/08

A23 6/1/08

A23 7/15/08

A23 8/31/08

Or View By Subject

LabKey
Software

2010

Simplified Schema

Participant

Participant Visits

Site

Forms

Assay Results

Specimens

Specimen Requests

Lab

Repository

LabKey
Software

2010

Extending LabKey Server


LabKey Platform has many levels of
customization


Folder Home Page Customization


Look & Feel


Custom Data Types


Lists, Assays, Study Datasets


Custom Views


R


SQL


Javascript


API Access (
Javascript
, Java, R, SAS, Perl)


Custom Modules


All of the above + portability & code packaging


38

LabKey
Software

2010

LabKey API


Designed for easy use from
Javascript

(AJAX)


Almost everything on the server can be accessed via a
consistent, data
-
centric model


Use Ext JS framework for building custom user
interfaces


LabKey has custom versions of Ext components for
easy connection to LabKey Server


Client Libraries for a variety of languages


R, SAS, Perl, Java,
Javascript


Files accessible via standard WebDAV protocol


Mac Finder can treat LabKey as a file store


Use Web Drive to connect from Windows

39

LabKey
Software

2010

Future Directions


Visualization


Easier Out of the Box Experience


Task

Management


Easier Data Management

LabKey
Software

2010

Some Relevant LabKey Papers


Rauch, et al,
Computational Proteomics Analysis
System (CPAS): an extensible, open
-
source analytic
system for evaluating and publishing proteomic data
and high throughput biological experiments.

J
Proteome Res. 2006 Jan;5(1):112
-
21.


Shulman
, et al,
Development of an automated
analysis system for data from flow
cytometric

intracellular cytokine staining assays from clinical
vaccine trials.

Cytometry A. 2008 Sep;73(9):847
-
56.


Jones, et al,
The Functional Genomics Experiment
model (
FuGE
): an extensible framework for standards
in functional genomics.

Nat
Biotechnol
. 2007
Oct;25(10):1127
-
33.



41

LabKey
Software

2010

Acknowledgements


LabKey Development


Britt
Piehler


Adam Rauch


Josh
Eckels


Matthew
Bellew


Karl
Lum


Peter Hussey


Kevin
Krouse


Nick Arnold


Michael Newton


More LabKey


Elizabeth Nelson (Docs)


Steven Hanson (Docs)


Trey
Chaddick

(Test)


Ren

Lis

(Administration)


Ex LabKey


Brendan MacLean


David Stearns


Nick
Shulman





Martin McIntosh


Computational Biology


Steve Self


SCHARP


Sarah Ramsay


Cory
Nathe


Cassy

Jarvis


Funders


Early Detection Initiative (NCI)


Canary Foundation


SCHARP


CAVD (Gates Foundation, via
SCHARP)


HVTN


CHAVI (NIH)


University of Wisconsin (NIAID,
NCRR)




42