Scalable Data Management on High Throughput MALDI TOF Mass Spectrometer Poster Number WP 687

bawltherapistΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 4 μήνες)

51 εμφανίσεις

File Space Usage
0
10
20
30
40
50
60
50k,25100k,25100k,50100k,250100k,1500250k,25250k,250250k,1500
Bins,Peaks
KiB / Spectrum (Peaks)
0
500
1000
1500
2000
2500
KiB / Spectrum (RAW)
Local mzML Peaks
Local SQLite Peaks
Local mzML RAW
Local SQLite RAW
Scalable Data Management on High Throughput MALDI TOF Mass Spectrometer
Poster Number WP 687
George Mills, Matthew Gabeler-Lee Virgin Instruments Corporation, Sudbury, MA
Introduction Data management on high throughput mass spectrometers
rapidly becomes difficult due to the volume of data. With simple
file based storage, organizing and locating data is difficult,
especially for audit and migration, and the large numbers of files
often cause performance problems. With storage wholly in a
database, organization is improved, but scalability over time is
poor, as the monolithic system requires ever increasing
knowledge and storage capacity to keep online, data snapshots
for offline work, demos, and project transfers are difficult. We
describe a hybrid solution employing a central metadata
database with distributed file storage for raw data that seeks to
overcome these scalability issues, while also allowing for offline
snapshots to ease data archival and migration.
Performance of saving acquisitions to local vs. network
shares. A slow fileserver was used for this test to
demonstrate effectiveness of the outgoing spectrum
cache, which allows usage of a slow or congested file
server at near-local speed (not fully optimized).
Our SQLite implementation is considerably faster than
the HUPO mzML format when the system is I/O
bound. At larger bin counts, the system becomes
CPU bound doing peak detection, and the choice of
format no longer affects performance.
Database usage is essentially flat with respect to peak
and bin count, as it stores a fixed amount of metadata
per spectrum. The one negative point shows where
the database server automatically ran internal cleanup
and freed space.
Data storage requirement tradeoffs of HUPO mzML
vs. our SQLite format for both Peaks and Raw data.
The SQLite format is somewhat less efficient in some
scenarios, but for the more common cases is
equivalent.
Data rate tradeoffs of using HUPO mzML vs. our
SQLite format for both Peaks and Raw data. At large
bin counts, the system becomes CPU bound and the
file format no longer impacts performance.
Basic Format Rates
0
50
100
150
200
250
300
050000100000150000200000250000300000
Bins
Spectra / second
Local mzML RAW
Local SQLite RAW
Local mzML 25 Peaks
Local SQLite 25 Peaks
Detailed Peaks Speed
0
50
100
150
200
250
300
50k,25100k,25100k,50100k,250100k,1500250k,25250k,250250k,1500
Bins,Peaks
Spectra / second
Local mzML Peaks
Local SQLite Peaks
Cache Effectiveness
0
20
40
60
80
100
120
140
160
100k,25,RAW250k,1500,RAW100k,250,Peaks250k,1500,Peaks
Bins,Peaks,Form
Spectra / second
Remote SQLite
Cached SQLite
Local SQLite
DB Space Usage
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
50k,25100k,25100k,50100k,250100k,1500250k,25250k,250250k,1500
Bins,Peaks
KiB / Spectrum
Local SQLite RAW
Local SQLite Peaks
System becomes
CPU bound at large peak and bin counts
Conclusions and Future WorkThis work demonstrates the feasibility of operating a hybrid storage solution to a high throughput
put MALDI TOF Mass Spectrometer even in a modestly performing intranet. It essentially keeps
the overall projects organized in a central database without bogging down the database with
massive amounts of spectra data while keeping the administration fairly simple. Further work is
underway to build a secondary data analysis cache to facilitate high performance 3D
Chromatograms for applications such as imaging.References/Technologies1. SQLite is an Public Domain open source project http://www.sqlite.org/
2. PostgreSQL is under BSD license see http://www.postgresql.org/AcknowledgementsThis work was supported in part by the National Institutes of Health under grants RR025705 and
GM079832.