Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs

gradebananaSoftware and s/w Development

Dec 2, 2013 (3 years and 11 months ago)

61 views

Parallel Geospatial Data Management for Multi
-
Scale
Environmental Data Analysis on
GPUs

Visiting Faculty: Jianting Zhang, The City College of New York (CCNY),
jzhang@cs.ccny.cuny.edu

ORNL: Host: Dali Wang, Climate Change Science Institute (CCSI),
wangd@ornl.gov


Overview

As the spatial and temporal resolutions of Earth observatory data and Earth system simulation outputs are getting higher, in
-
sit
u and/or post
-

processing such large amount of geospatial
data increasingly becomes a bottleneck in scientific inquires of Earth systems and their human impacts. Existing geospatial t
ech
niques that are based on outdated computing models (e.g.,
serial algorithms and disk
-
resident systems), as have been implemented in many commercial and open source packages, are incapabl
e of processing large
-
scale geospatial data and
achieve desired level of performance. Partially Supported by DOE’s Visiting Faculty Program (VFP), we are investigating on a
set

of parallel data structures and algorithms that are capable
of utilizing massively data parallel computing power available on commodity Graphics Processing Units (GPUs). We have designe
d a
nd implemented a popular geospatial technique called
Zonal Statistics, i.e., deriving histograms of points or raster cells in polygonal
zones.
Results have shown that our GPU
-
based parallel Zonal Statistic technique on 3000+ US counties over
20+ billion NASA SRTM 30 meter resolution Digital Elevation (DEM) raster cells has achieved
impressive
end
-
to
-
end
runtimes: 101 seconds and 46 seconds a low
-
end workstation
equipped with a Nvidia GTX Titan GPU using cold and hot cache, respectively; and ,60
-
70
seconds using a single
OLCF TITAN
computing node and
10
-
15
seconds using 8
nodes.

High
-
resolution Satellite Imagery

T

In
-
situ Observation Sensor Data

T


Global and Regional Climate Model Outputs

Data Assimilation

Zonal

Statistics

T

Ecological,
environmental and
administrative zones

T

T

V

Temporal
Trends

High
-
End Computing Facility

A

B

C

Thread Block

ROIs

Background & Motivation

Zonal Statistics on NASA Shuttle Radar Topography Mission (SRTM) Data

2

3

SQL:

SELECT
COUNT(*)

from T_O, T_Z

WHERE
ST_WITHIN

(
T_O.the_geom,T_Z.the_geom
)

GROUP BY

T_Z.z_id
;


Point
-
in
-
Polygon
Test

For each county, derive its histogram of elevations
from raster cells that are in the polygon.


SRTM: 20*10
9

(billion)
raster cells (~40GB raw,
~15GB compressed TIFF)


Zones: 3141 counties,
87,097 vertices

Brute
-
force

Point
-
in
-
polygon test


RT=(#of points)*(number of vertices)*(number of ops per p
-
in
-
p
test)/(number of ops per second)
=20*10
9
*87097*20/(10*10
9
)=3.7*10
6
seconds=
40days


Using up all Titan’s 18,688 nodes: ~200 seconds


Flops utilization is typically low: can be <1% for data intensive applications
(typically <10% in HPC)

Hybrid Spatial Databases + HPC Approach

Observation

: only points/cells that are close enough to polygons need to be tested

Question
: how do we pair neighboring points/cells with polygons?

Minimum Bounding
Boxes (MBRs)

Step1
: divide a raster tile into blocks and generate per
-
tile histograms

Step 2
: derive polygon MBRs and pair MBRs with blocks through box
-
in
-
polygon test (inside/intersect)

Step 3:

aggregate per
-
blocks histograms into per
-
polygon histograms if
blocks are
within

polygons

Step 4
: for each
intersected

polygon/block pair, perform point(cell)
-
in
-
polygon test for all the raster cells in the blocks and update respective
polygon histogram

(A)

Intersect


(B)

Inside


(C)

Outside


GPU Implementations

Identifying parallelisms and mapping to hardware

(1)
Deriving per
-
block histograms

(2)
Block in polygon test

(3)
Aggregate histograms for “within” blocks

(4)
Point
-
in
-
polygon test for individual cells and update

GPU Thread Block

Raster Block

AtomicAdd

1

M1

C1

M1

C2

M2

C3










Point
-
in
-
poly test for
each of cell’s 4 corners


All
-
pair Edge intersection
tests between polygon
and cell

2

Polygon
vertices

Cells


Perfect coalesced
memory accesses


Utilizing GPU
floating point power

4

BPQ
-
Tree based Raster compression to save disk I/O


Idea: chop a M
-
bit raster into M binary bitmaps and
then build a quadtree for each bitmap
(Zhang et al 2011)


BPQ
-
Tree achieves competitive compression ratio but
is much more parallelization friendly on GPUs


Advantage 1:
com
pressed data is streamed from disk
to GPU without requiring decompression on CPUs


reducing CPU memory footprint and data transfer time


Advantage 2: can be used in conjunction with CPU
-
based compression to further improve compression
ratio

Experiments and Evaluations

Tile #

dimension

Partition Schema

1

54000*43200

2*2

2

50400*43200

2*2

3

50400*43200

2*2

4

82800*36000

2*2

5

61200*46800

2*2

6

68400*111600

4*4

Total

20,165,760,000

36

Data Format

Volume
(GB)

Original (Raw)

38

TIFF Compression

15

gzip

compression

8.3

BPQ
-
Tree
Compression

7.3

BPQ
-
Tree+
gzip

compression

5.5

Single Node Configu
r
ation 1
: Dell T5400 WS


Intel Xeon E5405 dual Quad
-
Core Processor
(2.00 GHZ), 16 GB, PCI
-
E Gen2, 3*500GB 7200
RPM disk with 32M cache
($5,000
)


Nvidia Quadro 6000 GPU, 448 Fermi core (
574 MHZ), 6 GB, 144GB/s
($4,500
)

Single Node Configuration 2
: Do
-
IT
Yourself Workstation


Intel Core
-
i5 650 Dual
-
Core, 8 GB, PCI
-
E
Gen3, 500GB 7200 RPM disk with 32M
(recycled),
($1000
)


Nvidia GTX Titan GPU, 2688 Kepler core
( 837 MHZ), 6 GB, 288 GB/s
($1000
)

Parameters
:


Raster chunk size (coding): 4096*4096



Thread block size: 256


Maximum histogram bins: 5000


Raster Block size: 0.1*0.1 degree


360*360
(resolution is 1*1 arc
-
second)

GPU Cluster :
OLCF Titan

Results

Single Node

Cold
Cache

Hot
cache

Config1

180s

78s

Config2

101s

46s

Quadro 6000

GTX Titan

8
-
core

CPU

(Step 0): Raster decompression

16.2

8.30

131.9

Step 1: Per
-
block
histogramming

21.5

13.4

/

Step 2: Block
-
in
-
polygon test

0.11

0.07

/

Step 3: “within
-
block” histogram aggregation

0.14

0.11

/

Step 4:
cell
-
in
-
polygon
test and histogram update

29.7

11.4

/

total major steps

67.7

33.3

/

Wall
-
clock end
-
to
-
end

85

46

/

Titan

#of nodes

1

2

4

8

16

Runtime(s)

60.7

31.3

17.9

10.2

7.6

Data and Pre
-
processing