Accelerating sequential computer vision algorithms using commodity parallel hardware

coatiarfΤεχνίτη Νοημοσύνη και Ρομποτική

17 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

77 εμφανίσεις

Accelerating sequential computer vision
algorithms using commodity parallel hardware
Platform Parallel Netherlands
GPGPU-day, 28 June 2012
Jaapvan de Loosdrecht
NHL Centre of Expertise in Computer Vision
Van de LoosdrechtMachine Vision BV
Limerick Institute of Technology
Overview
•Introduction
•Computer vision algorithms and parallelization
•Benchmarking
•Run time prediction if parallelization is beneficial
•Progress
•Future work
•Summary and preliminary conclusions
•Questions
Introduction
•Manager NHL Centre of Expertise in Computer Vision
•University of professional education, Leeuwarden
•4,5 FTE
•Since 1996: 160 industrial projects
•Managing director Van de LoosdrechtMachine Vision BV
•VisionLab: development environment for Computer Vision with
Patternmatching, Neural networks and Genetic algorithms
•Portable library, > 100.000 lines of ANSI C++
Windows, Linux and Android
x86, x64, ARM and PowerPC
•Student Limerick Institute of Technology (Ireland)
•Research master project,1 September 2011 –1 July 2013
VisionLab: development environment for Computer Vision
Introduction
•Manager NHL Centre of Expertise in Computer Vision
•University of professional education, Leeuwarden
•4,5 FTE
•Since 1996: 160 industrial projects
•Managing director Van de LoosdrechtMachine Vision BV
•VisionLab: development environment for Computer Vision with
Patternmatching, Neural networks and Genetic algorithms
•Portable library, > 100.000 lines of ANSI C++
Windows, Linux and Android
x86, x64, ARM and PowerPC
•Student Limerick Institute of Technology (Ireland)
•Research master project,1 September 2011 –1 July 2013
Motivation
Apply parallel programming techniques to meet the challenges posed
in computer vision by the limits of sequential architectures
Aims and objectives
•Compareexisting programming languages and environments for
parallel computing
•Choose one standard for
•Multi-core CPU programming
•GPU programming
•Re-implement a number of standard and well-known algorithms
•Compareperformance to existing sequential implementation of
VisionLab
•Evaluate test results, benefits and costs of parallel approaches to
implementation of computer vision algorithms
Requirements
•Primary target system
•Conventional PC or intelligent camera
•Windows or Linux, on a x86 or x64
•Important option: easy porting (Android, ARM, PowerPC)
•Existing scripts and applications should not have to be modified in
order to benefit from parallelization
•Run time prediction if parallelization is beneficial
•Chosen standards must be
•An industry standard
•Vendor independent
•For CPU
•ANSI C++ based
•Efficient parallelization for majority of code
Related research
Other research projects •Compare best sequential with best parallel algorithm
•Often specific domain and hardware
•Framework for auto parallelisation
•In research, not yet generic applicable
Special points of interest in my project •Generic library
•Portability and vendor independency
•Run time prediction if parallelization is beneficial
•Variance in execution times
•100.000 lines of ANSI C++
Choice of standard for multi-core CPU programming (1 oct2011)
Requirement
----------------
Standard
Industry
standard
MaturityAcceptance by
market
Future
developments
Vendor
independence
PortabilityScalable to
ccNUMA
(optional)
Vector
capabilities
(optional)
Effort for
conversion
Array Building
Blocks
NoBetaNew,
not ranked
GoodPoorPoorNoYesHuge
C++11
Threads
YesPartly newNew,
not ranked
GoodGoodGoodNoNoHuge
CilkPlusNoGoodRank 6GoodReasonable
No MSVC
ReasonableNoYes Low
MCAPINoPoorNot rankedPoorPoorPoorYesNoHuge
MPIYesExcellentRank 7GoodGoodGoodYesNoHuge
OpenMPYesExcellentRank 1GoodGoodGoodYes,
only GNU
NoLow
Parallel
Patterns
Library
NoReasonableNew,
not ranked
GoodPoor
Only MSVC
PoorNoNoHuge
PosixThreadsYesExcellentNot rankedPoorGoodGoodNoNoHuge
Thread
Building
Blocks
NoGoodRank 3GoodReasonableReasonableNoNoHuge
Choice of standard for GPU programming (1 oct2011)
Requirement
---------------
Standard
Industry
standard
MaturityAcceptance by
market
Future
developments
Expected
familiarization
time
Hardware
vendor
independence
Software
vendor
independence
PortabilityHeterogeneous
AcceleratorNoGoodNot rankedBadMediumBadBadPoorNo
CUDANoExcellentRank 5GoodHighBadBadBadNo
Direct
Compute
NoPoorNot rankedUnknownHighBadBadBadNo
HMPPNoPoorNot rankedPlan for open
standard
MediumReasonableBadGoodYes
OpenCLYesGoodRank 2GoodHighExcellentGoodGoodYes
PGI
AcceleratorNoReasonableNot rankedUnknownMediumBadBadBadNo
Computer vision algorithms and parallelization
Classification image operators •Low level image operators
•Point operators
•Local neighbour operators
•Global operators
•Connectivity based operators
•High level image operators
•Often built on the low level operators
•“Specials”
•Patternmatcher, neuralnetwork, geneticalgorithm, etc
Idea: design and implement skeletons for parallelizing
representatives in classes
Benchmarking
•Benchmark protocol
•Data analyse with R
•Speedup graphs
•Speedup tables
•Median of execution time tables
•Best work-group size tables
•Violin plots
OpenMP
Progress •Script commands added
•Frame work for benchmarking
•> 160 operators are parallelized
•Run time prediction if parallelization is beneficial
•Calibration procedure
Example speedup graph (i7 2600)
Run time prediction if parallelization is beneficial
The speed-up depends on •Size of image
•Pixel type
•Content of the image
•Parameters like size of neighbourhood
•Etc.
Calibration procedure for OpenMP •Simple and fast procedure for global optimization
•Complex and slow procedure for more optimal optimization for
each (sub) operator
Variations in executing times,violinplot
OpenCL
Progress •Toolbox for using OpenCLkernels from scripts and C++
•Script commands added for host API
•Frame work for benchmarking
•Implementation first kernels
OpenCLdevelopment in VisionLab
OpenCLdevelopment in VisionLab
Optimalnumber of local histograms per work-group (GTX 560 Ti)
Speedup graph of Histogram,
16 local histogram per work-group (GTX 560 Ti)
Violin plot for Histogram, 16 local histograms/wg(GTX 560 Ti)
X-files: OpenCLversus OpenMP
Future work OpenCL
Near future •Memory transfers
•Pinned
•Zero copy APU
•Implementing more vision operators
More distant future •Intelligent buffer management ?
•Automatic tuning of parameters ?
•Run time prediction if parallelization is beneficial ?
•Heterogeneous computing ?
•OpenMP4.0, OpenACC, C++ AMP ?
The future: XIMEA CurreraG (APU based)
Summary and preliminary conclusions
•Choice made for standards OpenMPand OpenCL
•Integration OpenMPand OpenCLin VisionLab
•Benchmark environment
•OpenMP
•Embarrassingly parallel algorithms are easy to convert
•More than 160 operators parallelized
•Run time prediction implemented
•OpenCL
•Scripting host side code accelerates development time
•Portable functionality
•Portable performance is not easy
•Still have to learn a lot about GPUs
•Searching for sparing partners
Questions ?
Jaapvan de Loosdrecht
NHL Centre of Expertise in Computer Vision
j.van.de.loosdrecht@tech.nhl.nl
www.nhl.nl/computervision
Van de LoosdrechtMachine Vision BV
jaap@vdlmv.nl
www.vdlmv.nl