Polygon Rasterization on GPGPUs

yellvillepotatocreekΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

78 εμφανίσεις

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Speeding Up Large
-
Scale Geospatial
Polygon Rasterization on GPGPUs


Jianting Zhang

Department of Computer Science, the City College of New York

jzhang@cs.ccny.cuny.edu

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Outline


Introduction and Motivations


Background and Related Works


The Serial Scan
-
Line Fill Algorithm


Preprocessing Polygon Collections



Efficient Polygon Rasterization on GPGPUs



Experiments and Results


Conclusion and Future Work

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Introduction: Personal HPC
-
G

“Despite all these initiatives the impact of
parallel GIS research has
remained slight
…”


“…fundamental problem remains the fact that
creating parallel GIS
operations is non
-
trivial

and there is a
lack of parallel GIS
algorithms, application libraries and toolkits
.”

A. Clematis, M. Mineter, and R. Marciano. High performance computing with
geographical data. Parallel Computing, 29(10):1275

1279, 2003

Marrying GPGPU with GIS


The next generation High
-
Performance
GIS in a Personal Computing Environment (Zhang 2010, HPDGIS)


Every personal computer is now a parallel machine: CMPs and GPUs


Multi
-
core CPUs become the mainstream ; the more cores they have, the more
GPU features they have


NVIDIA alone has shipped almost 220 million CUDA
-
capable GPUs from
2006
-
2010 (CACM 2010/11)

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Introduction


Personal HPC
-
G


Chip
-
Multiprocessors (CMP):


http://en.wikipedia.org/wiki/Multi
-
core_processor


Cores/per chip:
Dual
-
core


兵慤
-
捯牥


卩S
-
捯牥

8/10/12


Chips/per node:
1
-
>2

4/8


Intel MIC (32 cores)


UIUC Rigel Design (1024 core)


Massively parallel GPGPU computing: Hundreds of GPU cores in a GPU card


Nvidia GTX480 (03/2010): 480 cores, 1.4 GHZ, 1.5GB, 177.4 GB/s memory
bandwidth,
1.35 TFlops


Nvidia GTX590 (03/2011): 1024 cores, 1.2 GHZ, 3GB, 327.74 GB/s memory
bandwidth, 2.49

TFlops


Parallel hardware is ever affordable than before …

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Introduction


Personal HPC
-
G


Geospatial data volumes never stop growing


Satellite: e.g., from GOES to GOES
-
R (2016)


http://www.goes
-
r.gov/downloads/GOES
-
R
-
Tri.pdf



Spectral (
3X
)*spatial (
4X
)* temporal (
5X
)=
60X



Derived thematic data products (vector)


http://www.goes
-
r.gov/products/baseline.html


http://www.goes
-
r.gov/products/option2.html




Species distributions

and movement data


E.g. 300+ millions occurrence records (GBIF)


E.g.
717,057 polygons and 78,929,697 vertices for 4148
birds distribution data (
NatureServe)



Animals can move across space and time



Event Locations, trajectories and O
-
D data


E.g., Taxi trip records (traces or O
-
D locations)


0.5 million in NYC and 1.2 million in Beijing
per day


From
O
-
D

to
shortest paths

to
flow patterns

COM.GEO’10

SSDBM’10

ACMGIS 10

ACMGIS 11

ACMGIS’08

ACMGIS’09

GeoInformatics’09

HPDGIS’11

COM.GEO’10

HPDGIS’10

???

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Motivations

0

2

1

3

GPU
-
based parallel algorithm design to efficiently manage
large
-
scale species distribution data (overlapped polygons)

Part 1: Extended quadtree to represent overlapped polygons (GeoInformatics’09 and ACMGIS’09)

Part 2: E
fficient conversion between real
-
world geospatial polygons to quadtrees

Step 1:
From polygons to scan
-
line segments
. Step 2: from scan
-
line segments to quadtrees

Part 3: Query
-
driven visual exploration (ACMGIS’08 and ACMGIS’09)

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Background and Related Works


Polygon
-
rasterization on GPUS


State
-
of
-
the
-
art: OpenGL GL_Polygon


Problems


Fix
-
function, proprietary, black
-
box


Does not support complex (e.g. concave) polygons


results may be incorrect
(although acceptable for display purposes)


GL_Polygon is much slower than GL_TRIANGLES


Require a hardware context to read back rasterization results


Accuracy is limited by screen resolution


Difficult to implement using graphics languages for GIS developers


GPGPU comes to the rescue


Being able to use GPU parallel computing power


Using C/C++ languages is more intuitive


Directly generating spatial data structures can be more efficient (than using
rasterized images to construct quadtrees)


More client
-
server computing friendly


No previous works on polygon rasterization on GPGPUs for geospatial apps.

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Background and Related Works


Spatial Data structures on GPUs for computer graphics
applications


KD
-
Tree (Zhou et al 2008, Hou et al 2001), Octree (Zhou 2011)


They are designed to efficiently
render triangles
, not
querying
polygons


Software rasterization of triangles


(Laine and Karras 2011), (Panntaleoni 2011), (Schwarz and Seidel
2011)


Results are encouraging when compared to hardware rasterization
(2
-
8x gap)


Again, they are deisgned for rasterizing/rendering triangles, not for
query polygons

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Background and Related Works


Geospatial Data Processing on GPUs


Pre
-
GPGPU:


Using graphics data structures and primitives for spatial
selection and spatial join queries (Sun et al 2003)


Difficult and unintuitive


Post
-
GPGPU


Spatial similarity join (Lieberman et al 2008)


Density
-
based spatial clustering (Bohm et al 2009)


Min
-
Max quadtree for large
-
scale raster data (Zhang et al
2010)


Decoding quad
-
tree encoded bitplane bitmaps of large
-
scale
raster data (Zhang et al 2011)


2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

The Serial Scan
-
Line Fill Algorithm

For each

scan line y from ymin to ymax

1.
Compute the intersection points
with
all edges


2.
Sort the intersection points and
form the scan line segments

3.
(Fill the raster cells in the scan
line segments)

End

Intersection points between scan line y=y’
and edge (x1,y1) and (x2,y2)

x’=(x1+(y
-
y1)/(y2
-
y1)*(x2
-
x1))

GDAL/GRASS codebases

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Polygon Rasterization on GPGPUs
-

C
hallenges


Unique hardware characteristics (e.g. Nvidia Telsa C2050)



large number of threads (1024 per SM, 14 SMs)



limited shared memory: 48K per SM (shared by 1024 threads)



limited registers: 32768 per SM, i.e., 32 per thread


Need explicit shared memory management to make full utilization of
the memory hierarchy


Parallelizing Scan
-
Line Fill Algorithm


Mimicking CPU algorithm (assigning a polygon to a thread)


Will NOT Work


Uncoalesced accesses to global memory are extremely inefficient


Insufficient registers and shared memory


How to assign computing blocks and threads to scan
-
lines and polygon
edges?

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Polygon Rasterization on GPGPUs


Design

SM
1

SM
2



SM
n

GPU Global Memory

L2

L1




The GPU SMs are divided into 14*4
computing blocks


A computing block has 256 threads
and processes one polygon


All threads in a computing block
loop through scan lines
cooperatively

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Polygon Rasterization on GPGPUs


Design

a

b

c

d

e

f

3

2

1

4

5

6

For each

scan line y from ymin to ymax






End

6

5

4

3

2

1

Global Memory

6

5

4

3

2

1

Shared Memory

X/Y

O

X

O

O

X

Intersection

O

O

O

X

X



Sorting

X/Y coordinates in shared
memory are re
-
used
(ymax
-
ymin
-
1) times

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Polygon Rasterization on GPGPUs


Sorting

__device__ inline ushort scan4(ushort num) {


__shared__ ushort ptr[2* MAX_PT];


ushort val=num; uint idx = threadIdx.x;


ptr[idx] = 0; idx += Tn;


ptr[idx] =num;
SYNC



val += ptr[idx
-

1]; SYNC ptr[idx] = val; SYNC


val += ptr[idx
-

2]; SYNC ptr[idx] = val; SYNC


val += ptr[idx
-

4]; SYNC ptr[idx] = val; SYNC





val = ptr[idx
-

1]; return val;

}

0

0

0

0

0


1

1

0

0

1

1

0

0

0

0

0

0

1

2

1

0

0

0

0

0

1

2

2

0

0

0

0

0

1

2

2

Step 0

Step 1

Step 2

Step 3

Result of
exclusive scan


GPGPUs are
extremely good at
sorting


Sorting on shared
memory are
extremely fast

Benefits


only true intersection
results are written back
to global memory


Save GPU memory
footprint and I/O costs

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Experiments and Results


Data:


NatureServe West Hemisphere birds speices distributions:
http://www.natureserve.org/getData/birdMaps.jsp


4148 birds:
http://geoteci.engr.ccny.cuny.edu/geoteci/SPTestMap.html


717,057 polygons, 1,199,799 rings


78,929,697 vertices (1.3 G
-

shp files)


Total number of scan
-
line/polygon edge intersections: 200+
billions

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Experiments and Results

Group #

1

2

3

4

5

Min # vertices

32

64

128

256

512

Max # vertices

64

128

256

512

1024

# Threads

64

128

256

512

1024

# Polygons

46509

23880

9666

5076

3146

CPU time (ms)

526

995

1803

4490

9387

GPU time (ms)

88

49

88

224

528

Speedup

6.0X

20.1X

20.5X

20.0X

17.8X

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Discussions
-

handling large polygons


The current implementation can not process polygons whose number of
vertices are above a few thousands


8n bytes for x coordinates


8n bytes for y coordinates


4n bytes for x coordinates of the intersections


~100 extra bytes


(20n+100)<48k

n~2000
(using a whole SM as a computing block)


We have limited the number of points to the number of threads
(1024)
-

having one thread process a few vertices is not scalable


We need a better way to handle scalability

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Discussions
-

handling large polygons

Proposed Solution:
chunking

edge list,
computing

separately and then
assembling


6

5

4

3

2

1

Global Memory

X/Y

6

5

4

3

2

1

shared Memory

Chunking

(x2,y2)


(x1,y1)

(x4,y2)


(x3,y1)

Computing

assembling

(x3,y1)


(x1,y1)

(x4,y2)


(x2,y2)

Sorting

using a
separate kernel

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Summary and Conclusion


Introduced

A GPGPU accelerated software rasterization framework
to rasterize and index large
-
scale geospatial polygons


Provided

A GPGPU based design and implementation of computing
intersection points


Achieved

about 20X speedup for groups of polygons with vertices
between 64 and 1024 using the birds species distribution data in the
West Hemisphere that has about 3/4 million of polygons and more
than 78 millions of vertices


Discussed

on extending the current implementation to support
polygons with arbitrarily large numbers of vertices by extensively
using efficient sorting


Work reported is preliminary

-

several important components in
realizing a dynamically integrated vector
-
raster data model for high
-
performance geospatial analysis on GPGPUs are still currently
under
development
.

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Future Work


Extend our current implementation to support large
polygons with arbitrary numbers of vertices


Implement the quadtree construction (step2) based
on the GPGPU computed scan
-
line segments
(CPU/GPU)


Perform a comprehensive performance comparison
with that of commercial spatial database indexing


Integrate with front end modules in spatial
databases (e.g., query parser and optimizer)

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)

19
th

ACM SIGSPATIAL GIS: Chicago, IL Nov 1

4, 2011

Q&A

jzhang@cs.ccny.cuny.edu

21