ppt - Computer Science


4 Οκτ 2013 (πριν από 3 χρόνια και 6 μήνες)

67 εμφανίσεις

generation Cell Broadband Engine (BE)
processor is a multi
core chip comprised of a 64
Power Architecture processor core and eight
synergistic processor cores, capable of massive
floating point processing, optimized for compute
intensive workloads and broadband rich media

Dense matrix multiplication is one of the most
common numerical operations and important

Cell B.E excels in its capabilities to process
intensive workloads like matrix
multiplication in single precision through its
powerful SIMD capabilities

Computational micro
kernels are architecture
specific codes when used with systematic analysis of
problem combined with exploitation of low
features of synergistic processing unit of cell B.E
leads of dense matrix multiplication kernels
achieving peak performance.

Introducing highly optimized cell B.E
implementations of two classic dense linear algebra



QR factorization

Work has been done to prove that a silicon chip can
provide great performance for compute
scientific workloads by combining short
single instruction multiple data with


SPEs allow implementation of complex
synchronization mechanisms, task level parallelism

Hybrid GPU based

platforms which has both

and GPUs provide effective
solution for challenges of appetite power and gap between
compute and communication speeds and hence is the trend
taken by GPU’s and hybrid combinations of GPU’s with

is appreciated as it can

freeze the frequency

escalate the number of cores,

provide data parallelism

high bandwidth

The development of dense linear algebra algorithms
for GPUs is done where the approach is based on
development of hybrid algorithms where in general
small, non
parallelizable tasks are executed on CPU
and data parallel tasks are executed on GPU and it
uses CUDA to develop low
level kernels and high
level libraries like LAPACK and BLAS

Approach to develop high performance BLAS for
GPUs which is essential to enable GPU
hybrid approaches in area of dense linear algebra.

Important issues for design of kernels
blocking and
coalesced memory access are discussed

Three optimization techniques of implementations of
pointer redirecting, padding and auto
are discussed

Sparse matrix vector multiplication (
) is an
interesting computation as it appears in scientific
and engineering, financial, economic modeling and
information retrieval applications.

The level of performance is achieved through the
diversity of architectural designs and input matrix

complex combination of
architecture and matrix specific techniques

A comparison for better performance is done on
different platforms across the suite of matrices and it
is evident that the optimized implementations
deliver better performance and it is also observed
that bandwidth is the determining performance

The accurate simulation of real world phenomena in
computational science is based on mathematical
model that has a set of partial differential equations
and finite element methods are considered to be the
most promising approaches for numerical treatment
of partial differential equations

Graphics processing units are considered to be
working well in such cases and in order to achieve
peak performance, selection of proper data
structures, parallelization techniques especially
when combining coarse grained parallelism on
cluster level and medium and fine grained
parallelism between CPU cores and within
accelerated drivers like GPUs

The way of applying fine grained parallelization
techniques for robust

solvers which are
numerically strong like sparse ill
conditioned linear
systems of equations that arise from grid

techniques like finite differences,
volumes and elements

Parallelization techniques are implemented on
graphics processors as representatives of throughput
oriented wide SIMD many
core architectures as
GPUs offer a tremendous amount of fine
parallelism. Here the NVIDIA CUDA is being
used where the concepts of memory coalescing,
wraps, shared memory and thread blocks are

Design of efficient parallel implementation of Fast
Fourier Transform(FFT) on cell/B.E and it is a
fundamental kernel in computationally intensive
scientific applications like computer tomography,
data filtering, fluid dynamics, spectral analysis of
speech, sonar, radar, seismic, vibration detection,
digital filtering, signal decomposition, PDEs

An interactive approach is used to solve 1D FFT
that divides the work among SPEs to efficiently
parallelize FFT computation and it requires
synchronization among SPEs after each stage of
FFT computation where the computation of SPEs is

with other optimization techniques
such as loop unrolling and double buffering

A way in which the FFT can exploit typical parallel
resources on

architecture platforms to
achieve near
optimal performance for which
designers have to adopt a systematic approach that
takes into account the attributes of both the
application and target system.

A successful implementation lies on deep
understanding of data access patterns , computation
properties, available hardware resources where it
can take advantage of generalized performance
planning techniques to produce successful
implementation across a wide variety of


Combinatorial algorithms play important role in
scientific computing for efficient parallelization of
linear algebra, computational physics, numerical
optimization computations, massive data analysis
routines, systems biology, the study of natural
phenomena involving networks and complex

A complexity model to simplify design of algorithms
on cell/B.E

architecture and a systematic
procedure to evaluate performance is presented. In
order to get the execution time of algorithm, the
computational complexity, memory access patterns
and complexity of branching instructions are

The application of auto
tuning to the 7

and 27
point stencils on widest range of

architectures where the chip multiprocessors lie at
extremes of spectrum of design tradeoffs that range
from replication of existing core technology to
employing large numbers of simple cores and novel
memory hierarchies.

Important aspects are parallelism discovery,
selecting from various forms of hardware
parallelism and enabling memory hierarchy
optimizations, made more challenging by separate
address space, software managed memory local
stores and NUMA features that appear in


Multi core and many core and heterogeneous micro
architecture is very important in hardware
landscape. Specialized processing units such as
commodity graphics processing units are proved to
compute accelerators that are capable of solving
specific scientific problems orders of magnitude
faster than conventional CPUs

Hyperthermia is a relatively new treatment
modality which is used as complementary therapy to
radio or chemo therapies. Here we study the
optimizations of a computational kernel appearing
within biomedical application

treatment on NVIDIAs graphic processing unit

The implementation and results of two
bioinformatics applications, namely FASTA for the

kernel and

. The
results show that cell/B.E is an attractive avenue for
bioinformatics applications. A cell/B.E is considered
to be a power
efficient platform provided that the
total power consumption of cell/B.E is less than
super scalar processor.

Also the implementation of the

running on
cell/B.E that uses software caches inside SPEs for
data movement is described. Using the software
caches enhances the programmer productivity
without major decrease in performance

Efficient and scalable strategies to orchestrate all
pairs computations on cell architecture, based on
decomposition of the computations and input
entries is described. General case is to schedule
computations on cell processor and to extend the
strategies to incorporate cases when number of
input entries is large and size of individual entries
is too large to fit memory limitations of SPEs

The performance results showed that cell processor
is a good platform to accelerate various kinds of
applications dealing with

The all
pairs computations strategies can be applied
to many applications from a wide range of areas
which requires such computations to be performed.

The main applications of drug design are figured
and two practical case studies,

and Moldy,
which are a docking and a molecular dynamics
application are discussed. The advantages of using
cell B.E in the drug design are noticed.

, a 3x speedup is achieved
compared to a parallel version running on a

with two 1.5GHz POWER5
chips with 16GB of RAM.

Moldy on cell BE consumes less power and takes
same time as an MPI parallelization on four
Itanium Montecito processors of SGI


GPUs are parallel computing devices capable of
accelerating a wide variety of data
algorithms and their tremendous computing
capabilities help accelerate molecular modeling
applications, enabling molecular dynamics
simulations and their analyses to run much faster
than before and allowing use of scientific techniques
that are impractical on conventional hardware

Most computationally expensive algorithms used in
molecular modeling are presented and explained
how these algorithms may be reformulated as
arithmetic intensive, data parallel algorithms
capable of achieving high performance on GPUs. In
coming years, we expect GPU hardware
architecture to continue to evolve rapidly and
become increasingly sophisticated.

Biomedical applications are an important focus for
high performance computing(HPC) researchers. The
use of accelerators, with their low cost and high
performance is possible solution for investigating
methods to provide high performance

It is clear that the data flow programming model
and associated runtime systems can, at multiple
application and hardware granularities, ease the
implementation of challenging biomedical
applications for these types of computational
resources. GPU is designed to deliver maximum
performance through its SIMD architecture.

The charm++ parallel programming model and
runtime system to support accelerators and
heterogeneous clusters that include accelerators is
presented. Also several extensions to charm++
programming model, including SIMD instruction
abstraction, accelerated entry methods and
accelerated blocks are presented.

The important concept is that the support for
CUDA based GPUs is presented where all these
extensions are continuing to be developed and
improved upon, as we increase support for
heterogeneous clusters in charm++.

The modern many
core GPUs are massively parallel
processors where the CUDA programming model
provides a straightforward way of writing scalable
parallel programs to execute on GPU. Data parallel
techniques provide convenient way of expressing
such parallelism.

The design of efficient scan and segmented scan
routines which are essential primitives in a
broadband range of data parallel algorithms is
presented and thus by tailoring the existing
algorithms to natural granularities of machine and
by minimizing synchronization, one of the fastest
scan and segmented scan algorithms are designed
for GPU.

The performance evaluation of the

communication mechanism for modern

CPUs is analyzed. It is observed that the streaming
instructions are expected to deliver good
performance where the current implementation
generates a high number of resource stalls and hence
low performance.

It is also found that intra
node communication
performance is highly dependant on memory and
cache architecture and also the way how the
improvements in processor and interconnect
technology have affected the balance of computation
to communication performance is presented.