Multi-Core Hardware and Software Development - University of ...

tastefallInternet and Web Development

Feb 2, 2013 (4 years and 6 months ago)

102 views


Multi
-
Core Hardware and Software Development



Bruce K Botcher II

Software Engineer

University of Wisconsin Platteville

botcherb@uwplatt.edu



Abstract


Multi
-
core computing has been around for decades, but it was almost always associated with
super
-
compu
ters used primarily for research, or extremely rigorous computing. As such the vast
majority of software developers didn’t need to understand multi
-
core development, or
parallelism. With the recent rise in general use computers containing multi
-
core proc
essors,
software development has needed to shift. This paper will show why the shift to multi
-
core
processors was necessary, and what is being done in software development to keep up with the
paradigm shift. Discussed will be languages, platforms and tool
s being developed and used by
software developers to make better use of parallel computing.





Definitions


Prior to delving into why the paradigm is shifting to multi
-
core and how we will develop
software for it I find it necessary to explain what mult
i
-
core is as pertaining to this paper, as well
as define some other terms that will be used throughout the paper.


Multi
-
Core Hardware
: As stated above parallel computing has been around for decades, but this
was accomplished through the use of many sing
le core chips tied together to form a single unit.
This is not multi
-
core, but multiprocessor. As I describe hardware I will not be talking about
multiprocessor, but multi
-
core. Multi
-
core is a system in which multiple processors have been
put on a sing
le die, or integrated circuit. Each processor is considered a separate core.


Parallel Computing
: The use of multiple different processors whether they be cores or
completely separate processors as in the multiprocessor model, to solve a single problem
or, in
the future to run a single application more efficiently and effectively.


Concurrency
: Concurrency is very similar to parallelism, but is not the same. Concurrency
would include a scenario in which the programmer has created the illusion of parall
elism via
things like threads that are used on a single processor. If you were developing for an original
2


Pentium processor with only one core and used threads you would be using concurrency.
Concurrency however would still include all parallel programmi
ng, much like a square is a
parallelogram but a parallelogram is not necessarily a square.



Why Multi
-
Core?


Since Gordon E. Moore posited in 1965 that the number of transistors that could be placed on an
integrated circuit would double roughly every 2
y
ears (some

people quote 18 months from David
House)

hardware developers such as In
tel have delivered on this. A side

effect of more
transistors

on a chip

was higher clock speed. Software engineers often road on the coattails of
hardware developers for be
tter performance, where better clock speed led to more effective
application performance on legacy applications, and let developers create more intensive
applications without much development of new algorithms. Over the last decade however
Moore’s law as
it applies to clock speed has come to what appears to be a screeching halt, due in
large part to thermal problems. These thermal problems are as such, we no longer can increase
the clock speed of a processor, while also keeping it cool enough so as not to

burn the chip up.

[
1]


With the clock speeds seemingly frozen at around 4GHz hardware developers needed to look
elsewhere for performance increases. Whilst Moore’s law remains developers have now
switched paradigms, to that of multi
-
core processors. Som
e people have predicted that we will
now see the number of cores double around every 2 years instead of clock speed increases. This
in retrospect appears to have been a bit of an overestimate, but the principle appears to hold that
we will see more and mo
re cores in household processors. In a Sandia National Laboratories
paper it is predicted that if the number of cores were to increase with Moore’s law we would be
seeing thousand core processors by the middle of the next decade.
[2
]

Though this seems unl
ikely
at the current pace it does bring to light the need for software developers to adapt to the changing
hardware.



Why Not Earlier?


If there is question as to why software engineers and developers have not switched to parallel
computing prior to th
is paradigm shift, there are two answers. The first answer is that parallel
programming is not easy, and takes a lot of work on the part of developers to make the switch.
Prior to the last few years programmers had to do memory and thread management them
selves as
there weren't tools to do it for them. This made the programming tedious and difficult. It also
took a large knowledge base to do this kind of programming.


The second answer is Amdahl's Law. Gene M. Amdahl in 1967 observed that programs cou
ld
only gain so much speed up by being parallelized. He realized that almost no program could be
entirely parallel and this would hinder performance increases namely he observed that if 50%
could be run in parallel then your time spent processing could be

cut by no more than half which
3


would mean your speed up wouldn't reach 2 or double. This observation discouraged the
practice. We now see what Amdahl probably didn't envision, that eventually clock speed would
reach a wall as it has. It is interesting
to note that this discussion occurred some 40 years prior to
the multi
-
core paradigm becoming the new future of computing.



What is Being Done
By

Software Developers Now?


Herb Sutter wrote an article entitled “The Free Lunch Is Over”, in which he expla
ins that
developers can no longer ride solely on the clock speed increases of hardware.
[3
]

Developers
must change the way the build applications to account for the new hardware paradigm. Since
multi
-
core processors became prevalent right around 2006 new t
ools and platforms have been
developed, as well as language adaptation. Some programming languages have made relatively
seamless transitions, while most others are simply not designed for high performance parallel
computing.



Platforms



There have b
een numerous new platforms designed specifically for multi
-
core development.
The two most famous being CUDA and OpenCL, but these are both specifically designed for
GPU’s (Graphical Processing Unit)

and not
CPU’s (Central Processing Unit)
. A few other
pl
atforms built on top of languages for CPU’s are CILK++ and pMATLAB among others listed
in the table below. Post 2009 there has been significant development in another very popular
framework, that of .NET.



Figure 1: This image shows multi
-
core techno
logies as of 2009
[4
]

4




CUDA


CUDA is a platform designed by NVIDIA f
or

their GeForce graphics cards. Clearly with this
comes some drawbacks namely that of portability. Anything written in the CUDA platform must
be run on a NVIDIA graphics card that suppo
rts CUDA. Taking this into account CUDA is still
one of the most widely used platforms for what one would consider non graphics processing on
the GPU. CUDA is built on top of C and C++ and supports parallelism with relative ease.
CUDA works in such a wa
y as to support as many cores as NVIDIA can fit onto a card, and thus
is very scalable to the future. CUDA also makes the management of parallelism and the
management of individual threads automatic, which in the past has been up to the programmer.
Task

scheduling and management has been in the past a somewhat painful process that not all
programmers could grasp. With these tasks being handled writing code on the CUDA platform
is much easier.


Most programmers and developers have been trained to design

and implement sequential
programs and as such have a difficult time transitioning to parallel programming. CUDA has
attempted to make this transition easier by abstracting out these difficulties. The code segment
in figure 2 shows a vector algorithm to
execute the function of y= Constant(x) + y, where x and y
are indices in an array, written in both sequential and parallel forms. The first is a standard
sequential C++ with a for loop. As you can see sequential programs are very similar to that of
paral
lel code.
[5
]



Figure 2: This figure shows the use of CUDA vs Sequential Programming

[5
]



CILK++


5


Cilk++ is an extension of C++ built specifically for parallelism. MIT designed the Cilk language
and Cilk Arts has based Cilk++ off of this language. In
its most basic form Cilk++ adds three
new keywords, that of cilk_spawn, cilk_sync, and cilk_for. These keywords are used to show
where parallelism will occur in a program. For instance cilk_spawn will start a child function in
a separate parallel thread,

cilk_sync states that the function currently executing may not continue
until all spawned functions have returned and cilc_for will execute a regular C++ for loop in
parallel utilizing the multi
-
core hardware.


Using Cilk++ is rather simple which makes
it a nice platform for those that still want to write
sequential pro
grams, t
hough precautions must be taken to avoid race conditions or the editing
of
shared data. All that must be done outside of these precautions is to add the cilk keyword before
your f
unction call or before you would want to have a synchronization of your threads. This
allows for a seamless transition for some of the simple tasks we would want to parallelize such
as the quick sort shown in figure 3.


Cilk++ also has some very usefu
l tools and other features. First all programs are written such
that if you take out the keywords you will have a functioning sequential C++ program. This
allows us to do two things, that of allowing us to debug our program in a sequential form first
bef
ore parallelizing, and we are also able to parallelize some forms of legacy code as evidenced
by the aforementioned quick sort. Also Cilk++ uses what is called work stealing which was
developed as part of the Cilk Project at MIT, which allows a processor
once it is finished doing
its work to steal a part of another processors queue. This relieves some of the overhead that used
to come from work
-
share scheduling. The last maj
or tool is that of Cilkscreen,
which is a race
conditions detector. This allows
for easy use of detecting race conditions so that the developer
may quickly fix the problem
.

[
6
]


Figure 3
:

Quick Sort using Cilk+
+

[
6
]



.NET


6


In the .NET Framework 4 Microsoft has added parallel computing to all of their supported
languages. These are

C#, Visual Basic, F#, and C++. This is a significant advancement for the
cause of parallel computing. Many of the worlds software applications are written on the .NET
framework, and this advancement will allow new programs to b
e parallelized. .NET now

has
P
arallel LINQ which allows data queries of lists and arrays as well as databases to be
parallelized. This can bring significant increases to performance for data intensive programs that
are written in C# and Visual Basic. Also added were task librar
ies that allow the programmer to
create tasks to be run on all available cores. This makes parallelization much easier to do.



Languages


This section will discuss some of the popular languages and what is being done to make them
more useful for concurr
ency. As stated earlier some high level languages like Ruby have had
some difficulties with the transition, while C and C++ as shown by Cilk and CUDA have been
used as base languages for platforms and are now themselves starting to support more
parallelis
m.



C/C++


C and C++ have not themselves been updated to support threading and parallelism, though many
platforms such as Cilk have been developed for them. Visual C++ uses the libray of PPL which
stands for the Parallel Patterns Library.
[7
]

It is us
ed in much the same way as Cilk but has far
more added functions and appears more complicated. It appears as if this trend of C++ being the
base with API's will continue well into the future.



Java


Java is a language that is famous for its portability
, being used for things from PC's to smart
phones. With smart phones even having dual
-
core chips for a while now and quad
-
core chips on
the way, it would follow that Java ought to support parallelism. Java from very early on in its life
cycle supported th
reading and concurrency, but it wasn't parallelism. As stated in the definitions
they are not one in the same. Until Java SE 7 there wasn't support directly built into the
language to support parallelism. Threads would be put into a waiting state until
a thread was
done processing or forced to wait and put back into the queue. It created the illusion of
parallelism, but wasn't actually. Java SE 5 and 6 added more functionality to concurrency with
the java.util.concurrent class, but still did not suppor
t parallelism. It wasn't until the middle of
2011 that Java started to directly support parallelism, with the new ForkJoin object Java is able
to handle what are known as divide and conquer algorithms very well. Also ForkJoin is scalable
in that it will
run on as many cores as are supported by the processor chip. Dependent on the
7


amount of parallelism and the number of cores you can see significant speedup with this
development
.

[
8
]




Ruby



Ruby is a high level language in which everything is an object
. As a whole Ruby doesn't support
parallelism, however much like Java in its earlier versions Ruby supports concurrency. Ruby
supports concurrency through the use of threads. These threads are blocking, that is to say they
have a wait state. They are not

able to run in parallel on multiple cores. There are some 'gems'
as they are called in the ruby world that will add parallelism, but are not necessarily reliable.
Ruby on Rails however is able to do parallelism due to the fact that it is designed for we
b which
runs best as multiple separate processes which can then be multi
-
core
.

[
9
]




Tools


As with all of programming
,

there need to be tools for design and debugging so as to make
development easier. If there are no tools development time will rise in

completely unknown
ways. If we are unable to debug properly there will be bugs in the code that we may not be able
to understand. Due to the conditions that arise with parallelism
,

namely that of many pieces of a
program running at the same time
,

debugg
ing is made extremely difficult. We may not know
which thread is the problem, and even if we did, would we be able to do an in depth debug of a
single thread. Of course as is the nature of humans we want to invent new things to make our
lives easier, and

thus it would follow that there are tools that can do these things. Discussed will
be Visual Studio both 2010 and the recently released 11 beta which both have added new
concurrency and parallelization
tools for debugging and visuali
zation. Also Intel P
arallel Studio
will be discussed. They are both similar, as are many other IDE's, but these two are specifically
designed for parallel coding and debugging.



Visual Studio


Visual Studio is seemingly updated every year or two, and always adding new fea
tures. With a
clear need for development tools for parallel programming
,

Microsoft has stepped up to the plate
and offered developers many new tools for parallel development. As discussed above they have
added the Parallel Patterns Library for coding, bu
t maybe more importantly is their new
debugging tools. Visual Studio 2010 offer
ed two great new additions, one of which is

the
Parallel Stacks Window which will show in a debugging setting all of the current threads and
where they are at in the current pro
gram. It will show the call stack for each active thread, such
as the first thread started in main and then ran function x and then function y. All of this helps to
know exactly where in code each active thread is.


8




Figure 4
:

This is the Parallel Stacks
Window

[
10
]


To go even further in depth we can look at each task that has been scheduled via the Parallel
Tasks Window
, the other addition to Visual Stu
dio 2010
. The Parallel Tasks Window will allow
the programmer access to things such as the status of the task, which could be anything from
running to deadlocked. To find out a task is deadlocked extremely useful to a programmer.
Deadlock should never h
appen to a thread and clearly something is wrong. This tool will allow
the programmer also to freeze all threads but a single thread that appears to be a problem. You
can then step through a single thread and find your bug. This eliminates some of the o
ld
problems, such as with concurrency any of the threads could cause the problem. You want to
find the right one.


Further still in Visual Studio 11 there is a new feature in which you can run a program into the
concurrency visualizer. This new additio
n isn't so much for debugging, but to show your
utilization, such as threads and what they spend their time doing, be it sleeping or executing.
You are able to see how many cores are under load at a certain time of your program. You can
see exactly which

logical cores are shown at a certain time in your program. The visualizer runs
the program and as the programmer if you are running a GUI application you do some tasks and
close. The visualizer then does the analysis and shows the graphs and visuals. T
his is a very
handy tool to see exactly what your software is doing at the thread level.



Intel Parallel Studio


Intel has also developed some tools for parallel programming as well. Parallel Studio has many
tools that a developer can use to develop hi
gh performance parallel as well as serial programs
and integrates with Visual Studio as well. The Parallel Amplifier is one of the tools that go with
the studio. This tool will analyze how well your software performs and find bottlenecks that can
occur i
n your software.


Another tool used in Parallel Studio is Parallel Advisor which offers a process that takes you
through a serial program and shows you where you can add threading to C and C++ programs. It
will advise places to put parallelism and then
do experiments analyze the performance of the
9


experiments and check for errors such as race conditions and deadlock in the experimented code.
This can help speed up the process of parallelization of legacy systems.


The Parallel Inspector is able to det
ect bugs in memory and thread errors
.

This helps to find
bugs early and increas
e the quality of a product. F
ind
ing

memory leaks and corruptions will
increase the security and reliability of your product
.

[
11
]




Conclusion


Parallel programming is the w
ay of the future. With the multi
-
core hardware becoming ever
more ub
iquitous software engineers must

change the way we do things. We are now being
forced to adapt to the changing paradigm. There have been many advancements in tools and
platforms for par
allel programming. Languages are beginning to adapt to the paradigm shift, and
moving forward we may see new languages specifically designed for parallel programming. The
major key to this adaptation is to realize
it’s

happening accept it and move forwar
d as we as
humans have done every time there is a new innovation.




References

[1]Mehrara, M., Jablin, T., Upton , D., August, D., Hazelwood, K. & Mahlke, S.

(2009, November). Multicore compilation strategies and challenges.
IEEE Signal

Processing Magazi
ne
, 55.


[
2
] Pedretti, K., Kelly, S., & Levenhagen, M. U.S. Department of Energy, Office of Scientific
and

Technical Information. (2008).
Sandia report: Summary of multi
-
core hardware and

programming model investigations
. Alburquerque: Sandia National La
boratories.


[3
] Sutter, H. (2005). The free lunch is over: A fundamental turn toward concurrency in

software.
Dr. Dobb's Journal
,
30
(3), Retrieved from

http://www.gotw.ca/publications/concurrency
-
ddj.htm



[4
] Kim, H., & Bond, R. (2009). Multicore soft
ware technologies.
IEEE
,
26
(6), 80
-
89.


[5
] Buck, I., Nickolls, J., Garland, M., & Skadron, K. (2008). Scalable parallel programming
with

cuda.
ACM Queue
,
6
(2), 41
-
53. Retrieved from


http://queue.acm.org/detail.cfm?id=1365500


[6
] Leiserson, C. E., & M
irman, I. B. (2008).
How to survive the multicore software revolution

(or at least survive the hype)
. (p. 24). Cilk Arts.


10


[7
] Campbell, C., & Miller, A. (2011).
Parallel programming with microsoft visual c : design

patterns for decomposition and coordi
nation on multicore architectures (patterns &

practices)
. (1 ed.). Microsoft Press. Retrieved from


http://msdn.microsoft.com/en
-
us/library/gg675934.aspx


[8
] Ponge, J. (2011, July). Retrieved from


http://www.oracle.com/technetwork/articles/java/fork
-
j
oin
-
422606.html


[9
] Hansson, D. H. (n.d.). Interview by R. Seidner. Is ruby on rails a crown jewel?. Intelligence


i
n

software, Retrieved from http://www.intelligenceinsoftw
are.com/feature/expert_insight

/is_ruby_on_rails_a_crown_jewel/


[10
] Moth, D. & Toub, S. (2009, September). Debugging task
-
based parallel applications in

visual
studio 2010.
MSDN Magazine
, Retrieved from


http://msdn.microsoft.com/en
-
us/magazine/ee41077
8.aspx


[11
]
The ultimate all
-
in
-
one performance toolkit: Intel® parallel studio 2011
. (2011). Retrieved

from http://software.intel.com/sites/products/collateral/studio


/Intel_Parallel_Studio_Brief_081610_HighRes.pdf