Extreme Scale Analytics on Spatio- Temporal Datasets

wonderfuldistinctAI and Robotics

Oct 16, 2013 (3 years and 9 months ago)

103 views

Extreme Scale Analytics on
Spatio
-
Temporal Datasets

Joel Saltz

Center for Comprehensive Informatics &
Biomedical Informatics Department

Emory University

Morphometric Image Analysis Pipeline


Preprocessing: normalization, tiling, etc.


Segmentation: identify nuclei as objects


Feature Extraction: compute morphometric features


Classification: unsupervised learning (k
-
means) after
patient
-
level aggregation and analysis

Satellite Data Analysis for Monitoring
and Change Analysis

Subsurface
Reservoir Management


Numerical models of porous media


Fluids flow from one region of reservoir to another region


Rock and sediment properties change over time


Simulate multiple realizations of multiple models and
management strategies


Evaluate geologic uncertainty and management
strategies simultaneously


Enable on
-
demand exploration and comparison of
multiple scenarios


Core Operation Categories and Patterns

Core Operation Category

Operations

Data Access Patterns and
Computational Complexity

Data Cleaning and Low
Level Transformations

Transformations to reduce effects of sensor/measurement artifacts.
Transform sensor acquired measurements to domain specific variables.

Mainly local and regular data
access patterns. Moderate
computational complexity.

Data Subsetting, Filtering,
Subsampling

Select portions of a dataset corresponding to regions in atlas and/or time
intervals. Select portions of a dataset based on value ranges (e.g., regions
with temperature larger than X degrees). Subsample data to reduce
resolution and data size.

Local data access patterns as well
as indexed access. Low to
moderate, mainly data intensive
computations.

Spatio
-
temporal Mapping
and Registration

Map datasets to an atlas. Resolve data redundancy at tile boundaries to
form mosaics. Create composite dataset from multiple spatially co
-
incident
datasets. Create derived dataset from spatially co
-
incident datasets
obtained at different times.

Irregular local and global data
access patterns. Moderate to high
computational complexity.

Object Segmentation

Segment “base level” objects such as nuclei, buildings, lakes. Extract
features from “base level” objects.

Irregular, but primarily local, data
access patterns. High
computational complexity.

Object Classification

Classify “base level” objects through possibly iterative combination of
clustering, machine learning and human input (active learning).

Irregular and global data access
patterns. High computational
complexity.

Spatio
-
temporal Aggregation

Construct “high level” objects composed of classified “base level” object
慧gr敧慴敳Ⱐ攮e⸬.r敳i摥湴i慬 慲敡猠v猠s湤畳nri慬 捯c灬數e献sC潭灵p攠eime
-
獥ri敳 慧gr敧慴敳 潶敲 愠aiv敮⁩m慧敤e慲敡.

Primarily local with a crucial
global component for aggregation.
Moderate/high computation
complexity.

Change Detection,
Comparison, and
Quantification


Quantify changes over time in domain specific low level variables, base
level objects and high level objects. Construct “change objects” to describe
捨c湧敳 i渠l潷敶敬 摯d慩渠獰s捩fi挠v慲i慢a敳Ⱐ扡b攠l敶敬 慮搠桩g栠l敶敬
潢o散ts⸠印慴i慬 煵敲i敳 f潲 獥l散ti湧 慮搠捯c灡pi湧 獥gm敮e敤er敧i潮猠慮搠
潢o散ts.

Compute and data
-
intensive
computations. Mixture of local
and global data access patterns as
well as indexed access.

Challenges


S
patial
-
temporal disk
-
resident, on
-
the
-
fly, dynamically
updated datasets


Access and manipulate
multiple

datasets generated and
stored on
multiple, distributed
systems


Analysis of raw data can generate millions to trillions of
features (e.g., millions of cells and nuclei in high resolution
tissue images) to be mined and compared


Take advantage of
h
ardware platforms for analysis


Clusters containing hybrid CPU
-
GPU nodes


Extreme scale machines consisting of hundreds of thousands of CPU
cores


Systems with deep memory and storage hierarchies


Cloud computing platforms

Using Hybrid CPU
-
GPU Systems

Data Structures: Region Templates


Describe
2D/3D static and temporal regions.


Provides
a
container
for points, arrays, regions, and object
sets within a spatial and temporal bounding box.


A region template can represent collections of spatial areas
and objects where these entities vary from one another in
size and shape; e.g. regions generated by segmenting cells
in microscopy images, man
-
made structures or hurricanes
in satellite imagery.


Primary
datasets are defined as point data elements and
arrays, and derived datasets
as
sets of regions and objects.


Region
templates may be related to one another in a
defined
manner.


Programming Abstractions and Runtime Middleware
Services


Programming abstractions


Multi
-
level dataflow pipelines


MapReduce

style programs


Spatial query capabilities


I/O and Storage Services


Indexing
and metadata management for
ensembles of
datasets


I/O support for retrieving data from multiple storage systems and for streaming data


Query capabilities


Memory Management


Careful
management and staging of large
data
structures across memory
hierarchies. Masking data
movement costs with computation.


Execution
Services


Distributing
and rearranging computations and data to minimize data
movement


Coordinated
scheduling and mapping of analysis operations to
heterogeneous and hybrid (CPU
cores
and
GPUs) systems
to increase overall application
throughput


Quality of service/data requirements


Function variants


Provenance Tracking, Fault
-
detection and tolerance

End