Title of the research paper:
Performance analysis of Multicore Systems
akhvinder singh & Harmeet kaur
Name of the Institution:
DEV ENGG COLLEGE,
constant in computing is that the world’s hunger for faster
performance is never satisfied. Every new performance advance in processors leads
to another level of greater performance demands from businesses and consumers.
Today these performance demands are
not just for speed, but also for smaller, more
powerful mobile devices, longer battery life, quieter desktop PCs, and
better price/performance per watt and lower cooling costs. People want
improvements in productivity, security
, data protection, game
performance, and many other capabilities.
There’s also a growing demand for more convenient form factors for the home, data
center, and on the go.
Through advances in silicon technology, micro architecture, software, and platfor
echnologies, Intel is on a fast
paced trajectory to continuously deliver new
generations of multi
core processors with the superior performance and energy
efficiency necessary to meet these demands for years to come.
2006, we reached new levels
efficient performance with our Intel®
Core™2 Duo processors and Dual
Intel® Xeon® processor 5100 series, both produced with our latest 65
(nm) silicon technology and
world’s first mainstream
core processors for
and mainstream servers
Intel® Core™2 Quad processors, Intel® Core™2Extreme
This paper explains the advantages and challenges of multi
in which Intel is
core processors to the future. We discuss
many of the
benefits you will see as we continue to
increase processor performance,
efficiency and capabilities.
For years, Intel customers came to expect a doubling
months in accordance with Moore’s Law. Most of these performance gains came from
Dramatic increases in frequency (from 5 MHz to 3 GHz in the
years from 1983 to
and through process technology
nts also came
instructions per cycle (IPC). By 2002, however, increasing
densities and the resultant heat began to reveal some
limitations in using
ominately frequency as a way of
improving performance. So, while Moore’
increases, and I
s continue to play an important
performance increases, new thinking is also required.
The best example of this new thinking is multi
By putting multiple executio
n cores into a single processor
(as well as continuing
ease clock frequency), Intel is
able to provide even greater multiples of
core processors, Intel can dramatically increase
a computer’s capabilities
and computing resources, providing
better responsiveness, improving mu
and delivering the advantages of parallel computing to
threaded mainstream applications
While manufacturing technology continues to improve, reducing the size of single
gates, physical limits of semiconductor
based microelectronics have become a major
design concern. Some effects of these physical limitations can cause significant heat
ation and data synchronization problems. The demand for more capable
microprocessors causes CPU designers to use various methods of increasing
performance. Some instruction
level parallelism (ILP) methods like superscalar
pipelining are suitable for many a
pplications, but are inefficient for others that tend
to contain difficult
predict code. Many applications are better suited to thread
level parallelism (TLP) methods, and multiple independent CPUs is one common
method used to increase a system's overal
l TLP. A combination of increased available
space due to refined manufacturing processes and the demand for increased TLP is
the logic behind the creation of multi
How to increase the pe
rformance of multi
of a processor can be increased by
To increase the speed of processor we need a large cache memory.
We need Transistors for the performance of a processor.
According to MOORE’S Law ”T
mber of transistor that can be integreated on
single chip keep increasing exponetially and a processor
s consider as better
speed by using as many minimum mumber of Transistors.”
A FUNDAMENTAL THEORAM OF MULTI
CORE PROCESSOR takes
advantages of a fundamental relationship between
power and frequency.”
By incorporating multiple cores each core is able to run at a lower frequency,dividing
among them the power normally given to a single core.
have found that since most microprocessors spend a
amount of time idly waiting for memory, software parallelism
can be leveraged to
hide memory latency. Since memory stalls typically
take on the order of 100
processor cycles, a processor pipeli
ne is idle for a
significant amount of time.
Table 1 shows the amount of time spent waitingfor memory in some typical
applications, on 2 GHz processors.
example, we can see that for a workload such as a Web server, there are
sufficient memory stalls
such that the average number of machine cycles is
2.5 per instruction, resulting in the pipeline waiting for memory upto 50% of
In Figure 3, we can see that less than 50% of the processor’s pipeline is
being used to process
instructions; the remainder is spent waiting
By providing additional sets of registers per processor
pipeline, multiple software
jobs can be multiplexed onto the pipeline, a
technique known as simultaneous multi
threading (SMT). Threads areswi
tched on to the pipeline when another blocks or
waits on memory, thus
allowing the pipeline to be utilized potentially to its
shows an example with four threads per core
. In each core, when a memo
stall occurs, the pipeline switches to
another thread, making good use of
while the previous memory stall is fulfilled. The tradeoff is
latency for bandwidth;
with enough threads, we can completely hide memory
latency, provided there is
enough memory bandwidth for the added
ts. Successful SMT systems typically
allow for very high memory
bandwidth from DRAM, as part of their balanced
SMT has a high return on performance in relation to additional transistorcount.
For example, a 50% performance gain may be
realized by adding
just 10% more
transistors with an SMT approach, in contrast to making
the pipeline more complex,
which typically affords a 10% performance
gain for a 100% increase in transistors.
Also, implementing multi
alone doesn’t yield optimal
design is typically a balance
core and SMT.
Efficient Performance Processor Transistors
• Intel Second Generation Strained Silicon
10 to 15 percent
without increasing leakage.
• Compared to 90 nm transistor technology, Intel’
performance 65 nm
provide over 20% improvement in
switching speed and over 30%
reduction in transistor switching power
This fundamental relationship between power and frequency can be
effectively used to multiply the number of cores from two to four, and then eight and
more, to deliver continuous increases in performance without increasing power
usage. To do this th
ough, there are many
advancements that must be made that are
only achievable by a company like Intel.
Continuous advances in silicon process technology
from 65 nm to 45 nm and to
to increase transistor density.
In addition, Intel is
continuing to deliver superior
Enhancing the performa
nce of each core and optimizing
it for multi
the introduction of new advanced
architectures about every two years.
y subsystem and optimizing data
in ways that ensure
data can be used as fast as
possible among all cores. This minimizes latency and
improves efficiency and speed.
Optimizing the interconnect fabric
that connects the
to improve pe
between cores and memory units.
Scope for future work (if any):
chip (NoC) has emerged as a new paradigm for designing multi core
Sysems. NoC will help to design
core Sysems where large numbers of
Intellectual Property (IP) cores are connected to the communication fabric (router
based network) using network interfaces. The network is used for packet switched
chip communication among cores. It supports high degr
ee of reusability and
scalability. In this work a scalable network based on Mesh of Tree (MoT) topology
has been presented. MoT interconnection network has the advantage of having small
diameter as well as large bisection width and has a nice recursive str
characteristics make it more powerful than other interconnection networks like
meshes and binary trees. A generic NoC simulator is designed for performance
evaluation in terms of network throughput, latency and power of different topologies
nder different traffic situations.
80 core processor
having performance of
It will be utilizing an Input
power of 78.35W
and its Clock
When the cores
needed then this processor would only
thus it is power saving.
This would serve as being the near future
The proximity of multiple CPU cores on the same die allows the cache
coherency circuitry to o
perate at a much higher clock rate than is possible if the
signals have to travel off
chip. Combining equivalent CPUs on a single die
significantly improves the performance of cache snoop (alternative: Bus snooping)
operations. Put simply, this means that
signals between different CPUs travel shorter
distances, and therefore those signals degrade less. These higher quality signals
allow more data to be sent in a given time period since individual signals can be
shorter and do not need to be repeated as ofte
”, DATE’2000, IEEE Press, 2000.
S. Kumar et al, “A Network on Chip Architecture and Design Methodology”, IEEE
Computer Society Annual Symposium on VLSI, April 2002. pp. 105
 SIEMENS, “OMI
: PI Bus
ver.0.3d”, Munich:Siemens AG, 1994. 35p.
 IBM Core connect Bus Architecture,”
Benini and G. D. Micheli, “Network on Chips:
A new SOC paradigm," IEEE
computer, pp. 70
78, January 2002.
 W. J. Dally and Brian Towles
, “Route Packets, Not Wires: On
Interconnection Networks," Proceedings of the 38th Design Automation Conference,
ACM/IEEE, Las Vegas, Nevada, USA, pp. 684
689, June 2001.
”Interconnection Network Architectures”
, pp. 26
 C. Zeferino and A. Susin, SoCIN:A Parametric and Scalable Network on Chip,"
Proc. of the 16th symposium on Integrated circuits and System Design (Sao Paulo,
Brazil). IEEE Computer Society, Pr
ess, Los Alamitos, Calif, pp. 169
 M. Horowitz and B. Dally, “How Scaling Will Change Processor Architecture,”
Proc. Int’l Solid State Circuits Conf. (ISSCC), pp. 132
133, Feb. 2004.
S. Kundu and S. Chattopadhyay, “Mesh
e Deterministic Routing for
Chip Architecture”, ACM Great Lake Symposium on VLSI, Florida, USA,
 P. P. Pande, C. Grecu, M. Jones, A. Ivanov and R. Saleh, ”Performance
Evaluation and Design Trade
Offs for Network
Chip Interconnect A
IEEE Transaction on Computers, Vol. 54, No. 8, August 2005.
Sotiriadis, P. P. and Chandrakasan, A. 2002. A Bus Energy Model for Deep
Submicron Technology. IEEE Transaction on Very Large Scale Integration (VLSI)
Systems, Vol. 10, No. 3
, pp. 341
The satisfaction and euphoria that accompany the successful completion of any task
would be incomplete without the mention of
the people who make it possible, whose
Constant guidance and encouragement crown all the efforts with success.
We consider it our privilege to express my gratitude and respect all those who
guided, inspired and helped us in the completion of the project,
the expression in the
Project belongs to those listed below.
We are deeply indebted to
having consented to be our project
guide and providing invaluable suggestions during the course of the project work.
We are deeply thankful to
f. S. Arvind
head of the department
for providing us the necessary facility in order to
complete the project successfully.
We would like to express our deep sense of gratitude to our principal
this continuous effort in creating a competitive environment in our minds and
encourage us to bring out the best in us.