The
first

generation Cell Broadband Engine (BE)
processor is a multi

core chip comprised of a 64

bit
Power Architecture processor core and eight
synergistic processor cores, capable of massive
floating point processing, optimized for compute

intensive workloads and broadband rich media
applications
Dense matrix multiplication is one of the most
common numerical operations and important
algorithms.
Cell B.E excels in its capabilities to process
compute

intensive workloads like matrix
multiplication in single precision through its
powerful SIMD capabilities
Computational micro

kernels are architecture
specific codes when used with systematic analysis of
problem combined with exploitation of low

level
features of synergistic processing unit of cell B.E
leads of dense matrix multiplication kernels
achieving peak performance.
Introducing highly optimized cell B.E
implementations of two classic dense linear algebra
computations,
Cholesky
factorization
QR factorization
Work has been done to prove that a silicon chip can
provide great performance for compute

intensive
scientific workloads by combining short

vector
single instruction multiple data with
multicore
architecture.
SPEs allow implementation of complex
synchronization mechanisms, task level parallelism
.
Hybrid GPU based
multicore
platforms which has both
homogeneous
multicores
and GPUs provide effective
solution for challenges of appetite power and gap between
compute and communication speeds and hence is the trend
taken by GPU’s and hybrid combinations of GPU’s with
homogeneous
multicores
is appreciated as it can
freeze the frequency
escalate the number of cores,
provide data parallelism
high bandwidth
The development of dense linear algebra algorithms
for GPUs is done where the approach is based on
development of hybrid algorithms where in general
small, non

parallelizable tasks are executed on CPU
and data parallel tasks are executed on GPU and it
uses CUDA to develop low

level kernels and high

level libraries like LAPACK and BLAS
Approach to develop high performance BLAS for
GPUs which is essential to enable GPU

based
hybrid approaches in area of dense linear algebra.
Important issues for design of kernels

blocking and
coalesced memory access are discussed
Three optimization techniques of implementations of
BLAS

pointer redirecting, padding and auto

tuning
are discussed
Sparse matrix vector multiplication (
SpMV
) is an
interesting computation as it appears in scientific
and engineering, financial, economic modeling and
information retrieval applications.
The level of performance is achieved through the
diversity of architectural designs and input matrix
characteristics
i.e
complex combination of
architecture and matrix specific techniques
A comparison for better performance is done on
different platforms across the suite of matrices and it
is evident that the optimized implementations
deliver better performance and it is also observed
that bandwidth is the determining performance
factor
The accurate simulation of real world phenomena in
computational science is based on mathematical
model that has a set of partial differential equations
and finite element methods are considered to be the
most promising approaches for numerical treatment
of partial differential equations
.
Graphics processing units are considered to be
working well in such cases and in order to achieve
peak performance, selection of proper data
structures, parallelization techniques especially
when combining coarse grained parallelism on
cluster level and medium and fine grained
parallelism between CPU cores and within
accelerated drivers like GPUs
The way of applying fine grained parallelization
techniques for robust
multigrid
solvers which are
numerically strong like sparse ill

conditioned linear
systems of equations that arise from grid

based
discretization
techniques like finite differences,
volumes and elements
Parallelization techniques are implemented on
graphics processors as representatives of throughput
oriented wide SIMD many

core architectures as
GPUs offer a tremendous amount of fine

grained
parallelism. Here the NVIDIA CUDA is being
used where the concepts of memory coalescing,
wraps, shared memory and thread blocks are
encountered
Design of efficient parallel implementation of Fast
Fourier Transform(FFT) on cell/B.E and it is a
fundamental kernel in computationally intensive
scientific applications like computer tomography,
data filtering, fluid dynamics, spectral analysis of
speech, sonar, radar, seismic, vibration detection,
digital filtering, signal decomposition, PDEs
An interactive approach is used to solve 1D FFT
that divides the work among SPEs to efficiently
parallelize FFT computation and it requires
synchronization among SPEs after each stage of
FFT computation where the computation of SPEs is
fully
vectorized
with other optimization techniques
such as loop unrolling and double buffering
.
A way in which the FFT can exploit typical parallel
resources on
multicore
architecture platforms to
achieve near

optimal performance for which
designers have to adopt a systematic approach that
takes into account the attributes of both the
application and target system.
A successful implementation lies on deep
understanding of data access patterns , computation
properties, available hardware resources where it
can take advantage of generalized performance
planning techniques to produce successful
implementation across a wide variety of
multicore
architectures.
Combinatorial algorithms play important role in
scientific computing for efficient parallelization of
linear algebra, computational physics, numerical
optimization computations, massive data analysis
routines, systems biology, the study of natural
phenomena involving networks and complex
interactions
A complexity model to simplify design of algorithms
on cell/B.E
multicore
architecture and a systematic
procedure to evaluate performance is presented. In
order to get the execution time of algorithm, the
computational complexity, memory access patterns
and complexity of branching instructions are
considered.
The application of auto

tuning to the 7

and 27

point stencils on widest range of
multicore
architectures where the chip multiprocessors lie at
extremes of spectrum of design tradeoffs that range
from replication of existing core technology to
employing large numbers of simple cores and novel
memory hierarchies.
Important aspects are parallelism discovery,
selecting from various forms of hardware
parallelism and enabling memory hierarchy
optimizations, made more challenging by separate
address space, software managed memory local
stores and NUMA features that appear in
multicore
systems.
Multi core and many core and heterogeneous micro
architecture is very important in hardware
landscape. Specialized processing units such as
commodity graphics processing units are proved to
compute accelerators that are capable of solving
specific scientific problems orders of magnitude
faster than conventional CPUs
Hyperthermia is a relatively new treatment
modality which is used as complementary therapy to
radio or chemo therapies. Here we study the
optimizations of a computational kernel appearing
within biomedical application
hyperthemia
cancer
treatment on NVIDIAs graphic processing unit
The implementation and results of two
bioinformatics applications, namely FASTA for the
Smith

Watersman
kernel and
ClustalW
. The
results show that cell/B.E is an attractive avenue for
bioinformatics applications. A cell/B.E is considered
to be a power

efficient platform provided that the
total power consumption of cell/B.E is less than
super scalar processor.
Also the implementation of the
CustalW
running on
cell/B.E that uses software caches inside SPEs for
data movement is described. Using the software
caches enhances the programmer productivity
without major decrease in performance
.
Efficient and scalable strategies to orchestrate all

pairs computations on cell architecture, based on
decomposition of the computations and input
entries is described. General case is to schedule
computations on cell processor and to extend the
strategies to incorporate cases when number of
input entries is large and size of individual entries
is too large to fit memory limitations of SPEs
The performance results showed that cell processor
is a good platform to accelerate various kinds of
applications dealing with
pairwise
computations.
The all

pairs computations strategies can be applied
to many applications from a wide range of areas
which requires such computations to be performed.
The main applications of drug design are figured
and two practical case studies,
FTDock
and Moldy,
which are a docking and a molecular dynamics
application are discussed. The advantages of using
cell B.E in the drug design are noticed.
Regarding
FTDock
, a 3x speedup is achieved
compared to a parallel version running on a
POWER5
multicore
with two 1.5GHz POWER5
chips with 16GB of RAM.
Moldy on cell BE consumes less power and takes
same time as an MPI parallelization on four
Itanium Montecito processors of SGI
Altix
4700
GPUs are parallel computing devices capable of
accelerating a wide variety of data

parallel
algorithms and their tremendous computing
capabilities help accelerate molecular modeling
applications, enabling molecular dynamics
simulations and their analyses to run much faster
than before and allowing use of scientific techniques
that are impractical on conventional hardware
platforms.
Most computationally expensive algorithms used in
molecular modeling are presented and explained
how these algorithms may be reformulated as
arithmetic intensive, data parallel algorithms
capable of achieving high performance on GPUs. In
coming years, we expect GPU hardware
architecture to continue to evolve rapidly and
become increasingly sophisticated.
Biomedical applications are an important focus for
high performance computing(HPC) researchers. The
use of accelerators, with their low cost and high
performance is possible solution for investigating
methods to provide high performance
.
It is clear that the data flow programming model
and associated runtime systems can, at multiple
application and hardware granularities, ease the
implementation of challenging biomedical
applications for these types of computational
resources. GPU is designed to deliver maximum
performance through its SIMD architecture.
The charm++ parallel programming model and
runtime system to support accelerators and
heterogeneous clusters that include accelerators is
presented. Also several extensions to charm++
programming model, including SIMD instruction
abstraction, accelerated entry methods and
accelerated blocks are presented.
The important concept is that the support for
CUDA based GPUs is presented where all these
extensions are continuing to be developed and
improved upon, as we increase support for
heterogeneous clusters in charm++.
The modern many

core GPUs are massively parallel
processors where the CUDA programming model
provides a straightforward way of writing scalable
parallel programs to execute on GPU. Data parallel
techniques provide convenient way of expressing
such parallelism.
The design of efficient scan and segmented scan
routines which are essential primitives in a
broadband range of data parallel algorithms is
presented and thus by tailoring the existing
algorithms to natural granularities of machine and
by minimizing synchronization, one of the fastest
scan and segmented scan algorithms are designed
for GPU.
The performance evaluation of the
interprocess
communication mechanism for modern
multicore
CPUs is analyzed. It is observed that the streaming
instructions are expected to deliver good
performance where the current implementation
generates a high number of resource stalls and hence
low performance.
It is also found that intra

node communication
performance is highly dependant on memory and
cache architecture and also the way how the
improvements in processor and interconnect
technology have affected the balance of computation
to communication performance is presented.
Comments 0
Log in to post a comment