ECE462/562 Fall 2012 Pointers

oatmealbrothersSoftware and s/w Development

Nov 18, 2013 (3 years and 6 months ago)

112 views

ECE462/562

Fall 2012

Pointers

You are encouraged to come up with your own topic.


For example, if you have an interest in
compilers, then code scheduling for instruction level parallelism might be a good topic. If you
are interested in VLSI design, a
project related to pipeline clocking or low power architecture
would be good. If you are interested in databases, quantifying the architectural characteristics of
database workloads, and comparing them with characteristics of other workloads (e.g., SPEC)
m
ight be good
. Some simulators (e.g., SimpleS
calar) and benchmark programs (e.g., SPEC2K)
will be made available for carrying out simulation studies.

The following is a sampling of
projects in other schools. Though descriptions alone convey little meaningfu
l information, this
should give you an idea of what you might want to pursue.



Select a paper that interests you from a recent ASPLOS or ISCA proceedings. Construct
a simulator that will allow you to reproduce their main results and validate your simulator
using their workload or a similar one. Are there any major assumptions the authors didn't
mention in the paper? Use your simulator to evaluate their technique under a new
workload or improve their technique and quantify your improvements.



As CPU cache miss

times approach thousands of cycles, during the time that a miss gets
serviced, it seems likely that the processor could execute a cache
-
replacement
-
optimization program "in the background" without slowing down any unblocked data
-
flows of execution (Yale P
att calls this sort of optimization code "micro
-
threads".) This
project has two parts. First, estimate an upper bound on the performance that could be
gained as follows: simulate a k
-
way associative cache where each cache set uses random,
FIFO, LRU, and OP
T replacement. Current caches use k= 1 to 8 and one of the simple
replacement policies, and the best your system could do would be to approximate fully
associative a fully
-
associative cache with OPT replacement. The gap between those two
cases is a reasona
ble upper bound on the benefits this scheme could achieve. Also, this
experiment will tell you what level of associativity and replac
e
ment policy to aim for in
your design. You may want to run this experiment for L1, L2, and L3 caches to see where
to focus

your efforts. Second, design a cache microarchitecture that would allow for more
sophisticated replacement policies. My intuition is that it will be important to make sure
your design does not slow down hits, it should not slow down the time it takes to i
ssue
the miss request to memory, but it can probably burn a lot of cycles thinking about which
current cache line to replace when that data comes back or moving data between different
cache entries.



Along with the number of transistors, the complexity of m
icroprocessor architectures
continues grown exponentially, with very complex out
-
of
-
order processors. It is still not
readily apparent how much performance is really being delivered to applications
compared to simpler in
-
order designs. On a spectrum of ben
chmarks, quantify (through
simulation
-

suggest SimpleScalar) the performance difference between an out
-
of
-
order
processor and a simpler in
-
order processor, taking into account not only CPI, but also
clock rate and power consumption.



DRAMs are highly opti
mized for accesses that exhibit locality. Examine a memory
interface architecture that reorders memory accesses to better exploit the column, page,
and pipeline modes of modern DRAM implementations.



Select an embedded application (such as interactive mult
imedia) and design and evaluate
an architecture that executes it in a mobile environment. Address issues of functionality,
performance (or at least providing the illusion of sufficient performance), and power
consumption
.



Compare alternatives of embedding
processing power in a DRAM chip (ie.
reconfigurable logic vs. highly custom processor vs. hardwired logic for a given
application) on a suite of data intensive and computationally demanding benchmarks.



Characterize the benefits and costs of value predicti
on vs. other predictive techniques,
such as instruction reuse. In the best cases, what is the maximum performance benefit?



Compare the performance of a deep cache hierarchy (multiple levels) vs. a flatter
organization (only one level) on a family of scien
tific and data intensive applications.
Devise strategies to get the benefits of both
.



In large, out
-
of
-
order cores, loads have to be held back when an earlier store's address is
unknown (because it might be the same). Dependence prediction guesses which
lo
ad/store pairs are going to have dependences, and which aren't. These predictors have
also been used to communicate values from stores to loads and do prefetching. Lots of
interesting stuff here!



Because of wire delays and register file bandwidth, process
or designers have started
looking (and building, cf. Alpha 21264) clusters, in which groups of functional units are
associated with separate register files within a core. How to schedule work on these, and
their implications for future architectures, is a
hot topic.



Simultaneous Multithreaded Processors (SMT) run multiple tasks in an out
-
of
-
order core
at the same time, sharing the dynamic resources (physical registers, issue slots, cache
pipes). Experiments with how resource usage conflicts in the different

shared resources,
with different combinations of workloads, would be interesting (there is a lot of work
going on in this area so a literature search would be crucial).



When multiple threads are running in an SMT core, how many extra cache misses are
cau
sed by the intersections of the threads' working sets? Quantifying this for different
workload combinations.



When a number of instructions waiting to execute in an out
-
of
-
order core are ready to go,
but there are too many for (a) the issue width or (b) for

the particular functional unit types
available to issue in a single cycle, the hardware must choose among them. Oldest
-
first is
the usual strategy. Other selection algorithms may be better. It would be interesting to try
a few.



One of the key problems in
architectures is that it is often more difficult to improve
latency than bandwidth. Prefetching is one technique that can hide latency. Here are some
possible prefetching topics:

o

Quantify the limits of history
-
based prefetching. Prediction by partial match
ing
(originally developed for text compression) has been shown to provide optimal
prediction of future values based on past values. Using PPM, what are the limits
of memory or disk prefetching? What input information (e.g., last k instruction
addresses, la
st j data addresses, distance between last k addresses, last value
loaded, ...) best predicts future fetches? What is the best trade
-
off between state
used to store history information and prefetch performance?

o

Add a small amount of hardware to DRAM memory

chips to exploit DRAM
internal bandwidths to avoid DRAM latencies. Evaluates the performance benefits
that can be gained and the costs of modifying the hardware.



Over the past 2 decades, memory sizes have increased by a factor of 1000, and page sizes
by o
nly a factor of 2
-
4. Should page sizes be dramatically larger, or are a few large
"superpages" sufficient to offset this trend in most cases?



Extend Transparent Informed Prefetching (Patterson et al (SOSP95)), which was
designed for page
-
level prefetching/
caches to balance cache
-
line hardware prefecthing v.
hardware caching.



Cooperative caching uses fast networks to access remote memory in liu of disk accesses.
One drawback is that a user's data may be stored on multiple machines, potentially
opening securi
ty holes (eavesdropping, modification). Encryption and digital signatures
may solve the problem, but could slow down the system. Evaluate the performance
impact of adding encryption and digital signatures to cooperatively cached data and
project this perfo
rmance into the future as processor speeds improve and as companies
like Intel propose adding encryption functions to their processors..



As memory latencies increase, cache miss times could run to thousands of instruction
-
issue opportunities. This is nearl
y the same ratio of memory access times as were seeen
for early VM paging systems. As miss times become so extremely bad, is it time to give
control of cache replacement to the software? Will larger degrees of associativity be
appropriate for caches?



Achieve fault tolerance by running 2 copies of instructions in unused cycles in a
superscalar (e.g., a 4
-
way machine may commit less than 4 instructions due to
dependencies) and do instruction replication only in those cycles.



Compare Qureshi and Patt's in
sertion policies in ISCA 2007 to victim caches.



Re
-
evaluate the schemes in the high
-
bandwidth cache paper discussed in the lectures
(e.g., line buffer)



Using old register values to predict addresses of subsequent memory accesses. This
allows the pipeline t
o do the cache access early in the pipeline, avoiding load
-
use stalls.



To help circuit designers reduce di/dt problem, out
-
of
-
order processors can monitor the
commit rate and "even" out the rate without losing performance.



By looking for phases in applicat
ions where fewer physical registers may suffice we can
cut down the amount of energy consumed by the register file.



Attempt to quantify how much of processor performance gain in the past decade has
come from faster clocks and how much from ILP.



Methods to
improve fetch bandwidth of trace caches,



cache enhancements, including victim caches, stream buffers, and hash addressing,



Implement and compare victim caches and skewed
-
associative caches.



Implement and compare two recent prefetching schemes.



architectura
l support of operating systems (e.g., user
-
level traps for lightweight threads)



prefetching methods (hardware and/or software) and their impact on performance



architectural characteristics of database workloads



cache behavior of networking (or other) appli
cations or algorithms, with modification to
exploit caches and memory hierarchies



An implementation study of register renaming logic



In
-
order vs out
-
of
-
order superscalar processors



A study of dynamic branch prediction schemes for superscalar processors



Performance study of two
-
level on
-
chip caches in a fixed area



An analysis of hardware prefetching techniques



Performance evaluation of caches using patchwrx instruction traces



Skewed D
-
way K
-
column set associative caches



The history and use of pipelining c
omputer architecture



The effect of context switching on history
-
based branch predictors



Bounding worst
-
case performance for realtime applications



Branch prediction methods and performance



Performance of TLB implementations



Trace
-
driven simulation of cache
enhancements



Timing analysis and caching for realtime systems



A Survey of VLIW Processors



Evaluating Caches with Multiple Caching Strategies



Survey/Comparison of VLIW and Superscalar processors



Comparison Study of Multimedia
-
Extended Microprocessors



Synchr
onous DRAM



Cache Performance Study of the SPEC95 Benchmark Suite



An Investigation of Instruction Fetch Behavior in Microprocessors



The Picojava I Microprocessor and Implementation Decisions



Register Renaming in Java Microprocessors



Optimizing Instruction C
ache Performance with Profile
-
directed Code Layout



Simulation of a Victim Cache for Spec95 Integer Benchmarks



Code scheduling for ILP



Instruction/data Encoding



Cache
-
based enhancements (e.g., trace cache, filter cache, loop cache, victim cache,
stream
buffers, etc.)



Pipeline clocking



Low power architectures



Quanitfying architectural characteristics of database workloads and comparing them to
other workloads



Achieve fault tolerance by running 2 copies of instructions in unused cycles in a
superscalar (e.
g., a 4
-
way machine may commit less than 4 instructions due to
dependencies) and do instruction replication only in those cycles.



Compare Qureshi and Patt's insertion policies in ISCA 2007 to victim caches



Use old register values to predict addresses of su
bsequent memory accesses. This allows
the pipeline to do the cache access early in the pipeline, avoiding load
-
use stalls.



By looking for phases in applications where fewer physical registers may suffice we can
cut down the amount of energy consumed by the

register file.



Attempt to quantify how much of processor performance gain in the past decade has
come from faster clocks and how much from ILP.



Implement and compare victim caches and skewed
-
associative caches.



Implement and compare two recent prefetching

schemes



Study prefetching methods (hardware and/or software) and their impact on performance



Evaluate cache behavior of networking (or other) applications or algorithms, with
modification to exploit caches and memory hierarchies



An implementation study of

register renaming logic



In
-
order vs out
-
of
-
order superscalar processors



A study of dynamic branch prediction schemes for superscalar processors



Performance study of two
-
level on
-
chip caches in a fixed area



An analysis of hardware prefetching techniques



Pe
rformance evaluation of caches using patchwrx instruction traces



Skewed D
-
way K
-
column set associative caches



The history and use of pipelining computer architecture



The effect of context switching on history
-
based branch predictors



Bounding worst
-
case performance for realtime applications



Branch prediction methods and performance



Performance of TLB implementations



Trace
-
driven simulation of cache enhancements



Timing analysis and caching for realtime systems



A Survey of VLIW Processor
s



Evaluating Caches with Multiple Caching Strategies



Survey/Comparison of VLIW and Superscalar processors



Comparison Study of Multimedia
-
Extended Microprocessors



Synchronous DRAM



An Investigation of Instruction Fetch Behavior in Microprocessors



Register Re
naming in Java Microprocessors



Optimizing Instruction Cache Performance with Profile
-
directed Code Layout



Simulation of a Victim Cache for Spec95 Integer Benchmarks



Power/Energy/Performance in a Branch Predictor of a Superscalar Processor



Workload Characte
rization of Network Processor Benchmarks



Dynamic Phase Behavior of Programs



Cache Miss Pattern and Miss Predictability Analysis



Low Power Cache Design



Analysis of Architecture Support for Virtual Machines



A Framework for Power Efficient Instruction Encodin
g in Deep Sub Micro Application
Specific Processors



Cache Optimization for Signal Processing Applications



Design and Evaluation of Advanced Value Prediction Methods in Multi
-
Issue
Superscalar Pipelined Architectures



A New ISA to Efficiently Support Object
-
Oriented Programming (OOP) Paradigm



Benchmarking HPC Workloads



Your good idea here...


Some reference paper top get started:



A Case for MLP
-
Aware Cache Replacement
, M. K. Qureshi et. al., ISCA 2006.



Increasing the Size of Atomic Instruction Blocks using
Control Flow Assertions
, S. Patel,
MICRO 2000.



Selective value prediction
, B. Calder et. al, ISCA 1999



Dataflow Mini
-
Graphs: Amplifying Superscalar Capacity and Bandwidth
, Bracy et. al,
MICRO 2004.



Efficient Dynamic Scheduling Through Tag Elimination
, Dan
Ernst and Todd Austin,
ISCA 2002.



Cache decay: exploiting generational behavior to reduce cache leakage power
,

S.
Kaxiras

et. al., ISCA 2001.



Scalable Store
-
Load Forwarding via Store Queue Index Prediction
, S. Stone et. al.,
MICRO 2005.



NUCA: A Non
-
Uniform

Cache Access Architecture for Wire
-
Delay Dominated On
-
Chip
Caches
, Changkyu Kim Doug Burger Stephen

W. Keckler, ASPLOS 02.



Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching
, Eric
Rotenberg, Steve Bennett, James E. Smith, MICRO.