Parallel programming with TRANSACTIONAL MEMORY Brilley ...

shapecartSoftware and s/w Development

Dec 1, 2013 (3 years and 6 months ago)


Seminar Report
Parallel programming with
Submitted by
Brilley Batley C
In the partial fulfillment of requirements in degree of
Master of Technology (M.Tech) in Software Engineering
Seminar Report
Parallel programming with
This is to certify that the seminar report entitled
“Parallel programming with Transactional memory” is being
submitted by Brilley Batley C, in partial fulfillment of the
requirements for the award of M.Tech in Software Engineering is a
bonafide record of the seminar presented by her during the
academic year 2009.
Dr.Sumam Mary Idicula Prof. Dr.K.Paulose Jacob
Reader Director
Dept. of Computer Science Dept. of Computer Science
I wish to express deep sense of gratitude to Prof. Dr. K Poulose Jacob,
Director, Dept. of Computer Science, for giving the opportunities to
present this seminar and providing utmost infrastructures and
encouragement for the work. I would like to mark my sincere thanks and
indebtedness for Dr.Sumam Mary Idicula, Reader, Dept. of Computer
Science, for all the guidance and support, extended to me. In addition, Also I
like to thank Mr.G.Santhosh Kumar, also all of staff members and non-
teaching staff of the department and my friends for extending their warm
kindness and help.
Nevertheless, I would like to thank my parents without whose blessings and
support I would not have been able to accomplish my goal. Finally, I thank
the almighty for giving the guidance and blessings.
Parallel programming with TRANSACTIONAL MEMORY
A primary challenge of parallel programming is to find better
abstractions for expressing parallel computation and for writing parallel
programs. Parallel programming encompasses all of the difficulties of
sequential programming, but also introduces the hard problem of
coordinating interactions among concurrently executing tasks. Today, most
parallel programs employ low-level programming constructs that are just a
thin veneer over the underlying hardware.These constructs consist of
threads,which are an abstract processor, and explicit synchronization (for
example,locks, semaphores, and monitors) to coordinate thread execution.
Parallel programs written with these constructs are difficult to design,
program,debug and maintain.
Transactional Memory was created to simplify parallel programming
and relieve software developers from the difficulties associated with lock-
based parallel programming.With TM, programmers simply mark code
segments as transactions that should execute atomically and in isolation with
respect to other code, and the TM system manages the concurrency control
for them. All TM systems use either hardware-based or software-based
approaches to implement the two basic TM mechanisms: data versioning
and conflict detection.

Brilley Batley C
M Tech(SE)

Sl. no CONTENTS Page no
A primary challenge of parallel programming is to find better
abstractions for expressing parallel computation and for writing parallel
programs. Parallel programming encompasses all of the difficulties of
sequential programming, but also introduces the hard problem of
coordinating interactions among concurrently executing tasks. Today, most
parallel programs employ low-level programming constructs that are just a
thin veneer over the underlying hardware.These constructs consist of
threads,which are an abstract processor, and explicit synchronization (for
example,locks, semaphores, and monitors) to coordinate thread execution.
Parallel programs written with these constructs are difficult to design,
program,debug and maintain.
Transactional Memory was created to simplify parallel programming
and relieve software developers from the difficulties associated with lock-
based parallel programming.With TM, programmers simply mark code
segments as transactions that should execute atomically and in isolation with
respect to other code, and the TM system manages the concurrency control
for them. All TM systems use either hardware-based or software-based
approaches to implement the two basic TM mechanisms: data versioning
and conflict detection
1. Transactional Memory replaces waiting for locks with concurrency. It allows non-
conflicting updates to shared data. Transactional memory to improve scalability of short
critical regions
2.Promise of Transactional Memory: The transactional memory will create programs
with coarse transactions. But they will performlike fine grained lock
3. Transactional memory focuses on correctness and tune for performance
Parallel programming poses many difficulties, but one of the most serious
challenges in writing correct code is coordinating access to data shared by several
threads. Data races, deadlocks,and poor scalability are consequences of trying to ensure
mutual exclusion with too little or too much synchronization.TM offers a simpler
alternative to mutual exclusion by shifting the burden of correct synchronization from a
programmer to the TM system. In theory, a program’s author only needs to identify a
sequence of operations on shared data that should appear to execute atomically to other,
concurrent threads. Through the many mechanisms discussed here, the TM system then
ensures this outcome.
As computers evolve,programming changes as well. The past few years mark the
start of a historic transition from sequential to parallel computation in the processors used
in most personal, server,and mobile computers. This shift marks the end of a remarkable
30-year period in which advances in semiconductor technology and computer
architecture improved the performance of sequential processors at an annual rate of 40%–
50%. This steady performance increase benefited all software, and this progress was a
key factor driving the spread of software throughout modern life.
This remarkable era stopped when practical limits on the power dissipation of a
chip ended the continual increases in clock speed and limited instruction-level parallelism
diminished the benefit of increasingly complex processor architectures. The era did not
stop because Moore’s Law ended. Semiconductor technology is still capable of doubling
the transistors on a chip every two years. However, this flood of transistors now increases
the number of independent processors on a chip, rather than making an individual
processor run faster. The resulting computer architecture, named Multicore,consists of
several independent processors (cores) on a chip that communicate through shared
memory. Today,two-core chips are common and four-core chips are coming to market,
and there is every reason to believe that the number of cores will continue to double for a
number of generations.On one hand, the good news is that the peak performance of a
Multicore computer doubles each time the number of cores doubles. On the other hand,
achieving this performance requires a program execute in parallel and scale as the
number of processors increase.
Few programs today are written to exploit parallelism effectively. In part,most
programmers did not have access to parallel computers, which were limited to domains
with large, naturally parallel workloads, such as servers, or huge computations, such as
high-performance computing. Because mainstream programming was sequential
programming, most existing programming languages, libraries, design patterns,and
training do not address the challenges of parallelism programming.Obviously, this
situation must change before programmers in general will start writing parallel programs
for Multicore processors.
A primary challenge is to find better abstractions for expressing parallel
computation and for writing parallel programs. Parallel programming encompasses all of
the difficulties of sequential programming, but also introduces the hard problem of
coordinating interactions among concurrently executing tasks. Today, most parallel
programs employ low-level programming constructs that are just a thin veneer over the
underlying hardware.These constructs consist of threads,which are an abstract
processor, and explicit synchronization (for example,locks, semaphores, and monitors) to
coordinate thread execution. Considerable experience has shown that parallel programs
written with these constructs are difficult to design, program,debug, maintain, and—to
add insult to injury—often do not perform well.
Transactional memory (TM)—proposed by Lomet and first practically
implemented by Herlihy and Moss— is a new programming construct that offers a
higher-level abstraction for writing parallel programs. In the past few years, it has
engendered considerable interest, as transactions have long been used in databases to
isolate concurrent activities. TM offers a mechanism that allows portions of a program to
execute in isolation, without regard to other, concurrently executing tasks.A programmer
can reason about the correctness of code within a transaction and need not worry about
complex interactions with other, concurrently executing parts of the program. TM offers
a promising, but as yet unproven mechanismto improve parallel programming.
What is Transaction Memory?
A transaction is a form of program execution borrowed from the database
community.Concurrent queries conflict when they read and write an item in a database,
and a conflict can produce an erroneous result that could not arise from a sequential
execution of the queries.Transactions ensure that all queries produce the same result as if
they executed serially (a property known as “serializability”). Decomposing the
semantics of a transaction yields four requirements,usually called the “ACID”
properties—atomicity, consistency,isolation, and durability.
TM provides lightweight transactions for threads running in a shared address
space. TM ensures the atomicity and isolation of concurrently executing tasks. (In
general, TM does not provide consistency or durability.) Atomicity ensures program
state changes effected by code executing in a transaction are indivisible from the
perspective of other, concurrently executing transactions. In other words, although the
code in a transaction may modify individual variables through a series of assignments,
another computation can only observe the program state immediately before or
immediately after the transaction executes. Isolation ensures that concurrently executing
tasks cannot affect the result of a transaction,so a transaction produces the same answer
as when no other task was executing. Transactions provide a basis to construct parallel
abstractions,which are building blocks that can be combined without knowledge of their
internal details, much as procedures and objects provide composable abstractions for
sequential code.
When implementing a TM system, there are many choices in how transactions
interact with other programming abstractions and in what their semantics should be. In
this section, we will see two of these choices: implicit versus explicit barriers and weak
versus strong isolation.
4.1 Implicit vs. Explicit Barriers
Memory addresses can be added to read and write sets either implicitly or
explicitly.To do so implicitly requires either compiler or hardware support to find all the
memory accesses within all the atomic blocks. Without this support, programmers have
to explicitly add barriers to their code to annotate memory accesses to shared variables.
Implicit barriers have the advantage of allowing programmers to compose existing non-
transactional code to create transactions. However, with explicit barriers, an expert
programmer may be able to perform optimizations that lead to smaller read and write sets
than those achievable with implicit barriers. For example, barriers for local and private
variables can be optimized away.
4.2 Weak vs. Strong Isolation
The distinction between weak isolation and strong isolation is that the former
guarantees transactional semantics only among atomic blocks. In comparison,strong
isolation also guarantees transactional semantics between transactional and non-
transactional code (in addition to among just transactional code blocks). Essentially,with
strong isolation all non-transactional instructions effectively execute as one-instruction
atomic blocks.
From the programmer’s point of view, strong isolation makes it easier to reason
about correctness, especially when shared data becomes private (privatization) or when
private data is transferred into a shared domain (publication). However, inspite of the
advantages of strong isolation, some TM implementations may choose to implement
weak isolation as it can lead to a design with higher performance.
4.3 Nested Transactions
Nested transactions occur when the dynamic extent of a transaction is fully
enclosed by that of another transaction. For example, nested transactions can arise when
an application programmer writes a transaction that contains a library call that executes
transactions itself.There are three kinds of nested transactions: flattened, closed, and
open. With flattened nested transactions, any nested transaction boundary annotations are
simply ignored and aborting the child transaction causes the parent transaction to also
abort.On the other hand, with closed nested transactions, aborting the child transaction
only causes re-execution of the child transaction. With both flattened and closed nested
transactions, modifications made by the child transaction are only externally visible after
the parent transaction commits. In contrast, open nested transactions behave like closed
nested transactions, except that the changes of the child transaction are externally visible
as soon as it commits. Because of this, to properly roll back an open nested transaction,
compensation actions must be executed after aborting a transaction.
4.4 Lazy vs. Eager Data Versioning
When a transaction performs a write access to memory, the update to the memory
address can be performed either lazily or eagerly. In lazy data versioning, updates to
memory are deferred by buffering them on the side. When a transaction reaches its
commit point, the lazy TM system then updates memory by applying all the deferred
updates in its data versioning buffer. On the other hand, eager data versioning applies
memory updates directly to memory after recording the old value in an undo log. If a
transaction aborts in an eager TM system, the undo log is used to revert all the updates
made by the transaction.
The type of data versioning scheme used gives a TM system different advantages
and disadvantages. With lazy data versioning, transaction aborts are fast because main
memory does not contain any speculative updates from the transaction. In contrast, eager
data versioning schemes pay a performance penalty on transaction abort as they must
process their undo logs. However, lazy schemes have slower transaction commits as they
defer all transactional updates till the commit point.Finally, for software-based TMs,
lazy versioning leads to slower read accesses than eager versioning. This is because the
software data versioning buffer may contain values newer than those in main memory,
necessitating a search of the buffer during each speculative read access.
4.5 Optimistic vs. Pessimistic Conflict Detection
TM systems can take either an optimistic or pessimistic approach to performing
conflict detection. With optimistic conflict detection, a TM system optimistically assumes
that a transaction will commit successfully and only performs conflict detection late at
the end of the transaction. On the other hand, pessimistic conflict detection checks for
conflicts on each memory access during transaction execution. Since conflict detection is
performed early, less work is potentially wasted if a transaction aborts;however,
optimistic conflict detection allows more transaction interleavings to successfully commit
and guarantees forward progress. Finally, lazy data versioning is usually combined with
optimistic conflict detection and eager data versioning with pessimistic conflict detection,
as these are the synergistic combinations.
All TM systems must implement data versioning and conflict detection; however,
these tasks can be implemented in either hardware (HTM) or software (STM).
The initial paper on STM by Shavit and Touitou showed it was possible to
implement lock-free, atomic, multi-location operations entirely in software,but it
required a program to declare in advance the memory locations to be accessed by a
To convince the reader that TM is worth all the trouble, let’s look at a little
example. This is not meant to reflect realistic code but instead illustrates problems that
can happen in real code:
long counter1;
long counter2;
time_t timestamp1;
time_t timestamp2;
void f1_1(long *r, time_t *t) {
*t = timestamp1;
*r = counter1++;
void f2_2(long *r, time_t *t) {
*t = timestamp2;
*r = counter2++;
void w1_2(long *r, time_t *t) {
*r = counter1++;
if (*r & 1)
*t = timestamp2;
void w2_1(long *r, time_t *t) {
*r = counter2++;
if (*r & 1)
*t = timestamp1;
Assume this code has to be made thread-safe. This means that multiple threads
can concurrently execute any of the functions and that doing so must not produce any
invalid result. The latter is defined here as return counter and timestamp values that don’t
belong together.
It is certainly possible to define one single mutex lock and require that this mutex
be taken in each of the four functions. Verifying that this would generate the expected
results is easy, but the performance is potentially far from optimal.
Assume that most of the time only the functions f1_1 and f2_2 are used. In this
case there would never be any conflict between callers of these functions: callers of f1_1
and f2_2 could peacefully coexist. This means that using one single lock slows down the
code unnecessarily.
The semantics would have to be in the one case “when counter1 and timestamp1
are used” and “when counter2 and timestamp2 are used,” respectively. This might work
for f1_1 and f2_2, but it won’t work for the other two functions. Here the pairs
counter1/timestamp2 and counter2/timestamp1 are used together. So we have to go yet
another level down and assign a separate lock to each of the variables.
Assuming we would do this, we could easily be tempted to write something like
this (only two functions are mentioned here; the other two are mirror images):
void f1_1(long *r, time_t *t) {
*t = timestamp1;
*r = counter1++;
void w1_2(long *r, time_t *t) {
*r = counter1++;
if (*r & 1) {
*t = timestamp2;
The code for w1_2 in this example is wrong. We cannot delay getting the
l_timestamp1 lock because it might produce inconsistent results. Even though it might be
slower, we always have to get the lock:
void w1_2(long *r, time_t *t) {
*r = counter1++;
if (*r & 1) {
*t = timestamp2;
It’s a simple change, but the result is also wrong. Now we try to lock the required
locks in w1_2 in a different order from f1_1. This potentially will lead to deadlocks. In
this simple example it is easy to see that this is the case, but with just slightly more
complicated code it is a very common occurrence.
What this example shows is: (1) it is easy to get into a situation where many
separate mutex locks are needed to allow for enough parallelism; and (2) using all the
mutex locks correctly is quite complicated by itself.
The previous example could be rewritten using TM. In the following example we
are using nonstandard extensions to C that in one form or another might appear in a TM-
enabled compiler. The extensions are easy to explain.
void f1_1(long *r, time_t *t) {
tm_atomic {
*t = timestamp1;
*r = counter1++;
void f2_2(long *r, time_t *t) {
tm_atomic {
*t = timestamp2;
*r = counter2++;
void w1_2(long *r, time_t *t) {
tm_atomic {
*r = counter1++;
if (*r & 1)
*t = timestamp2;
void w2_1(long *r, time_t *t) {
tm_atomic {
*r = counter2++;
if (*r & 1)
*t = timestamp1;
All we have done in this case is enclose the operations within a block called
tm_atomic. The tm_atomic keyword indicates that all the instructions in the following
block are part of a transaction. For each of the memory accesses, the compiler could
generate code as listed below. Calling functions is a challenge since the called functions
also have to be transaction-aware. Therefore, it is potentially necessary to provide two
versions of the compiled function: one with and one without support for transactions. In
case any of the transitively called functions uses a tm_atomic block by itself, nesting has
to be handled.The following is one way of doing this:
1. Check whether the same memory location is part of another transaction.
2. If yes, abort the current transaction.
3. If no, record that the current transaction referenced the memory location so that step 2
in other transactions can find it.
4. Depending on whether it is a read or write access, either (a) load the value of the
memory location if the variable has not yet been modified or load it from the local
storage in case it was already modified, or (b) write it into a local storage for the variable.
Step 3 can fall away if the transaction previously accessed the same memory
location. For step 2 there are alternatives. Instead of aborting immediately, the transaction
can be performed to the end and then the changes undone. This is called the lazy
abort/lazy commit method, as opposed to the eager/eager method found in typical
database transactions.
What is needed now is a definition of the work that is done when the end of the
tm_atomic block is reached (i.e., the transaction is committed). This work can be
described as follows:
1. If the current transaction has been aborted, reset all internal state, delay for some short
period, then retry, executing the whole block.
2. Store all the values of the memory locations modified in the transaction for which the
new values are placed in local storage.
3. Reset the information about the memory locations being part of a transaction.
The description is simple enough; the real problem is implementing everything
efficiently.Before we discuss this, let’s take a brief look at whether all this is correct and
fulfills all the requirements.
Herlihy et al.’s Dynamic STM (DSTM) was the first STM system that did not
require a program to declare the memory locations accessed by a transaction.DSTM is an
object-granularity, deferred-update STM system,which means that a transaction modifies
a private copy of an object and only makes its changes visible to other transactions when
it commits. The transaction exclusively accesses the copy without synchronization.
However,another transaction can access the original, underlying object while the first
transaction is still running, which causes a logical conflict that the STM system detects
and resolves by aborting one of the two transactions.
An STM system can detect a conflict when a transaction first accesses an object
(early detection) or when the transaction attempts to commit (late detection). Both
approaches yield the same results, but may perform differently and, unfortunately, neither
is consistently superior. Early detection prevents a transaction from performing
unnecessary computation that a subsequent abort will discard. Late detection can avoid
unnecessary aborts,as when the conflicting transaction itself aborts because of a conflict
with a third transaction.
Another complication is a conflict between a transaction that only reads an object
and another that modifies the object. Since reads are more common than writes, STM
systems only clone objects that are modified. To reduce overhead, a transaction tracks the
objects it reads and, before it commits,ensures that no other transaction
modified them.
DSTM is a library. An object manipulated in a transaction is first registered with
the DSTM system, which returns a TMObject wrapper for the object (as illustrated in the
accompanying figure). Subsequently, the code executing the transaction can open the
TMObject for read-only or read write access, which returns a pointer to the original or
cloned object, respectively.Either way, the transaction manipulates the object directly,
without further synchronization.
A transaction ends when the program attempts to commit the transaction’s
changes. If the transaction successfully commits, the DSTM system atomically replaces,
for all modified objects, the old object in a Locator structure with its modified version.
Figure 1: A transacted object in the DSTM System
A transaction T can commit successfully if it meets two conditions. The first is
that no concurrently executing transaction modified an object read by T. DSTM tracks
the objects a transaction opened for reading and validates the entries in this read set when
the transaction attempts to commit. An object in the read set is valid if its version is
unchanged since transaction T first opened it. DSTM also validates the read set every
time it opens an object,to avoid allowing a transaction to continue executing in an
erroneous program state in which some objects changed after the transaction started
The second condition is that transaction T is not modifying an object that another
transaction is also modifying.DSTM prevents this type of conflict by only allowing one
transaction to open an object for modification.When a write-write conflict occurs,DSTM
aborts one of the two conflicting transactions and allows the other to proceed. DSTM
rolls the aborted transaction back to its initial state and then allow it to reexecute. The
policy used to select which transaction to abort can affect system performance, including
liveness, but it should have no effect on the semantics of the STM system.The
performance of DSTM, like other STM systems, depends on the details of the workload.
In general,the large overheads of STM systems are more expensive than locking on a
small number of processors. However, as the number of processors increases,so does the
contention for a lock and the cost of locking. When this occurs and conflicts are rare,
STMs have been shown to outperform locks on small benchmarks.
The interest in full hardware implementation of TM (HTM) dates to the initial
two papers on TM by Knight and Herlihy and Moss respectively.HTM systems require
no software instrumentation of memory references within transaction code. The hardware
manages data versions and tracks conflicts transparently as software performs ordinary
read and write accesses.Eliminating instrumentation reduces program overhead and
eliminates the need to specialize function bodies so they can be called within and outside
of a transaction.HTM systems rely on a computer’s cache hierarchy and the cache
coherence protocol to implement versioning and conflict detection. Caches observe all
reads and writes issued by a processor, can buffer a significant amount of data, and can
be searched efficiently because of their associative organization. All HTMs modify the
first-level caches, but the approach extends to higher-level caches, both private and
shared. To illustrate the organization and operation of HTMsystems, we will describe the
TCC architecture in some detail.
Lazy transactional memory systems achieve high performance through optimistic
concurrency control (OCC),which was first proposed for database systems. Using OCC,
a transaction runs without acquiring locks, optimistically assuming that no other
transaction operates concurrently on the same data. If that assumption is true by the end
of its execution, the transaction commits its updates to shared memory. Dependency
violations are checked lazily at commit time. If conflicts between transactions are
detected,the non-committing transactions violate, their local updates are rolled back, and
they are re-executed. OCC allows for non-blocking operation and performs very well in
situations where there is ample concurrency and conflicts between transactions are rare,
which is the common case transactional behavior of scalable multithreaded programs
Execution with OCC consists of three phases:
• Execution Phase: The transaction is executed, but all of its speculative write-state is
buffered locally. This write-state is not visible to the rest of the system.
• Validation Phase: The system ensures that the transaction executed correctly and is
serially valid (consistent).If this phase does not complete successfully the transaction
aborts and restarts. If this phase completes,the transaction cannot be violated by other
• Commit Phase: Once a transaction completes the validation phase, it makes its write-
state visible to the rest of the system during the commit phase.
Kung et al outline three conditions under which these phases may overlap in time
to maintain correct transactional execution. To validate a transaction, there must be a
serial ordering for all transactions running in the system.Assume that each transaction is
given a Transaction ID (TID) at any point before its validation phase. For each
transaction with TID = j (Tj) and for all Ti with i < j one of the following three conditions
must hold:
1. Ti completes its commit phase before Tj starts its execution phase.
2. Tj did not read anything Ti wants to validate, and Ti finished its commit phase before
Tj starts its commit phase.
3. Tj did not read nor is it trying to validate anything Ti wants to commit, and Ti finished
its execution phase before Tj completed its execution phase.
Under condition 1, there is no execution overlap: each transaction can start
executing only after the transaction before it has finished committing, yielding no
concurrency whatsoever. Under condition 2, execution can be overlapped,but only one
transaction is allowed to commit at a time. The original, or “small-scale”, TCC design,
for example, operates under condition 2: each transaction arbitrates for a token and uses
an ordered bus to ensure its commits finish before any other transaction starts
The sequential commits limit concurrency and become a serial bottleneck for the
small-scale TCC system at high processor counts. Condition 3 allows the most
concurrency:if there are no conflicts, transactions can completely overlap their execution
and commit phases. Scalable TCC operates under condition 3 which allows parallel
commits; however,this requires a more complex implementation. Furthermore,
additional mechanisms are required to accommodate the distributed nature of a large
scale parallel system,specifically its distributed memory and unordered interconnection
The small-scale TCC model buffers speculative data while in its execution phase.
During validation, the processor requests a commit token which can only be held by one
processor at a time. Once the commit token is acquired,the processor proceeds to flush
its commit data to a shared non-speculative cache via an ordered bus. The small-scale
TCC model works well within a chip-multiprocessor where commit bandwidth is
plentiful and latencies are low.However, in a large-scale system with tens of processors,
it will perform poorly. Since commits are serialized, the sum of all commit times places a
lower bound on execution time. Write-through commits with broadcast messages will
also cause excessive network traffic that will likely exhaust the network bandwidth of a
large-scale system.
The Scalable TCC protocol leverages the directories in the DSM system to
overcome the scaling limitations of the original TCC implementation while maintaining
the same execution model from the programmer’s point of view. The directories allow for
several optimizations. First, even though each directory allows only a single transaction
to commit at a time,multiple transactions can commit in parallel to different directories.
Thus, increasing the number of directories in the system provides a higher degree of
concurrency. Parallel commit relies on locality of access within each transaction.Second,
they allow for a write-back protocol that moves data between nodes only on true sharing
or cache evictions. Finally, they allow filtering of commit traffic and eliminate the need
to broadcast invalidation messages.
Directories are used to track processors that may have speculatively read shared
data. When a processor is in its validation phase, it acquires a TID and doesn’t proceed to
its commit phase until it is guaranteed that no other processor can violate it. It then sends
its commit addresses only to directories responsible for data written by the transaction.
The directories generate invalidation messages to processors that were marked as having
read what is now stale data. Processors receiving invalidation messages then use their
own tracking facilities to determine whether to violate or simply invalidate the line.
Message Description
Load Load a cache line
TID Request Request a Transaction Identifier
Skip Message Instructs a directory to skip a given TID
NSTID Probe Probes for a Now Servicing TID
Mark Marks a line intended to be committed
Commit Instructs a directory to commit marked lines
Abort Instructs a directory to abort a given TID
Write back Write back a committed cache line,
Removing it from cache
Flush Write back committed cache line,
Leaving it in cache
Data Request Instructs a processor to flush
a giving cache line to memory.
Table 1: Messages used in Scalable TCC system
For our protocol to be correct under condition 3 of OCC,two rules must be
enforced. First, conflicting writes to the same address by different transactions are
serialized. Second,a transaction with TID i (Ti) cannot commit until all transactions with
lower TIDs that could violate it have already committed. In other words, if Ti has read a
word that a transaction with a lower TID may have written, Ti must first wait for those
other transaction to commit. Each directory requires transactions to send their commit-
address messages in the order of the transactions’ TIDs. This means that if a directory
allows T5 to send commit-address messages,then T5 can infer that transactions T0 to T4
have already communicated any commit information to this particular directory. Each
directory tracks the TID currently allowed to commit in the Now Serving TID (NSTID)
register.When a transaction has nothing to send to a particular directory by the time it is
ready to commit, it informs the directory by sending it a Skip Message.
Figure:2 presents a simple example of the protocol where a transaction in
processor P1 successfully commits with one directory while a second transaction in
processor P2 violates and restarts. The figure is divided into six parts. Changes in state
are circled and events are numbered to show ordering, meaning all events numbered 1
can occur at the same time and an event labeled 2 can only occur after all events labeled
1. The description makes use of Table 1 which lists the coherence messages used to
implement the protocol.
In part a, processors P1 and P2 each load a cache line 1 and are subsequently
marked as sharers by Directory 0 and Directory 1 respectively. Both processors write to
data tracked by Directory 0, but this information is not communicated to the directory
until they commit.
In part b, processor P1 loads another cache line from Directory 0, and then starts
the commit process. It firsts sends a TID Request message to the TID Vendor 3, which
responds with TID 1 4 and processor P1 records it 5.
Figure 2: Execution with the Scalable TCC protocol
In part c, P1 communicates with Directory 0, the only directory it wants to write
to, in order to commit its write.First, it probes this directory for its NSTID using an
NSTID Probe message 1. In parallel, P1 sends Directory 1 a Skip message since
Directory 1 is not in its write-set, causing Directory 1 to increase its NSTID to 2 2.
Meanwhile, P2 has also started the commit process. It requests a TID, but can also start
probing for a NSTID 1 from Directory 0—probing does not require the processor to have
acquired a TID. P2 receives TID 2 2 and records it internally 3.
In part d, both P1 and P2 receive answers to their NSTID probe. P2 also sends a
Skip message to Directory 1 1 causing the directory’s NSTID to change to 3. P2 cannot
send any commit-address messages to Directory 0 because the NSTID answer it received
is lower than its own TID. P1’s TID, on the other hand, matches Directory 0’s NSTID,
thus it can send commit-address messages to that directory. Note that we are able to
serialize the potentially conflicting writes from P1 and P2 to data from Directory 0. P1
sends a Mark message 2, which causes line X to be marked as part of the committing
transaction’s write-set 3. Without using Mark messages, each transaction would have to
complete its validation phase before sending the addresses it wants to commit.Mark
messages allows transactions to pre-commit addresses to the subset of directories that are
ready to service the transaction. Before P1 can complete its commit, it needs to make sure
no other transactions with a lower TID can violate it. For that, it must make sure that
every directory in its read-set (0 and 1) is done with younger transactions.Since it is
currently serviced by Directory 0, it can be certain that all transactions with lower TIDs
have already been serviced by this directory. However, P1 needs to also probe Directory
1 3. P1 receives NSTID 3 as the answer 4 hence it can be certain all transactions younger
than TID 3 have been already serviced by Directory 1. Thus, P1 cannot be violated by
commits to any directory.
In part e, P1 sends a Commit message 1, which causes all marked (M) lines to
become owned (O) 2. Each marked line that transitions to owned generates invalidations
that are sent to all sharers of that line except the committing processor which becomes the
new owner 3. P2 receives the invalidation, discards the line, and violates because its
current transaction had read it.
In part f, P2 attempts to load an owned line 1; this causes a data request to be sent
to the owner 2; the owner then writes back the cache line and invalidates the line in its
cache 3. Finally, the directory forwards the data to the requesting processor 5 after
marking it in the sharers list for this line 4.
Directory State
Figure:3 shows the directory organization for the scalable TCC protocol. The
directory tracks information for each cache line in the physical memory housed in the
local node. First, it tracks the nodes that have speculatively read the cache line (sharers
list). This is the set of nodes that will be sent invalidates whenever these lines get
committed.Second, the directory tracks the owner for each cache line,which is the last
node to commit updates to the line until it writes it back to physical memory (eviction).
The owner is indicated by setting a single bit in the sharers list and the Owned bit. The
Marked bit is set for cache lines involved in an ongoing commit to this directory. Finally,
we include a TID field. Each directory also has a NSTID register and a Skip Vector,
described below.
Figure 3: The directory structure for Scalable TCC.
NSTID and Skip Vector
Directories control access to a contiguous region of physical memory. At any
time,a single directory will service one transaction whose TID is stored in the directory’s
NSTID register. For example,a directory might be serving T9; this means that only T9
can send state-altering messages to the memory region controlled by that directory. If T9
has nothing to commit to that directory, it will send a skip message with its TID attached.
This will cause the directory to mark the TID as completed.The key point is that each
directory will either service or skip every transaction in the system. If two transactions
had an overlapping write-set, then the concerned directory will serialize their commits.
Each directory tracks skip messages using a Skip Vector.The Skip Vector allows
the directory to buffer skip messages sent early by transactions with TIDs higher than the
one in the NSTID. Figure 5 shows how the Skip Vector is maintained. While a directory
with NSTID 0 is busy processing commit-related messages from T0, it can receive and
process skip messages from transactions T1, T2, and T4. Each time a skip message with
TID x is received, the directory sets the bit in location (x − NSTID) of the Skip Vector.
When the directory is finished serving T0, it marks the first bit of the Skip Vector, and
then proceeds to shift the Skip Vector left till the first bit is no longer set. The NSTID
is increased by the number of bits shifted.
Beyond the implementation issues discussed here, TM faces a number of
challenges that are the subject of active research. One serious difficulty with optimistic
TM is that a transaction that executed an I/O operation may roll back at a conflict. I/O in
this case consists of any interaction with the world outside of the control of the TM
system. If a transaction aborts, its I/O operations should roll back as well, which may be
difficult or impossible to accomplish in general. Buffering the data read or written by a
transaction permits some rollbacks, but buffering fails in simple situations, such as a
transaction that writes a prompt and then waits for user input. A more general approach is
to designate a single privileged transaction that runs to completion, by ensuring it
triumphs over all conflicting transactions. Only the privileged transaction can perform
I/O (but the privilege can be passed between transactions),which unfortunately limits the
amount of I/O a program can perform.
Another major issue is strong and weak atomicity. STM systems generally
implement weak atomicity, in which non-transactional code is not isolated from code in
transactions. HTM systems,on the other hand, implement strong atomicity, which
provides a more deterministic programming model in which non-transactional code does
not affect the atomicity of a transaction.This difference presents several problems.
Beyond the basic question of which model is a better basis for writing software, the
semantic differences makes it difficult to develop software that runs on both types of
systems.The least common denominator is the weak model, but erroneous programs will
produce divergent results on different systems. An alternative viewpoint is that
unsynchronized data accesses between two threads is generally an error,and if only one
thread is executing a transaction, then there is insufficient synchronization between the
threads.Therefore, the programming language, tools, runtime system, or hardware should
prevent or detect unsynchronized sharing between transactional and non-transactional
code, and a programmer should fix the defect.
Weakly atomic systems also face difficulties when an object is shared between
transactional and non-transactional code.30 Publication occurs when a thread makes an
object visible to other threads (for example, by adding it to a global queue) and
privatization occurs when a thread removes an object from the global shared space.
Private data should be manipulatable outside of a transaction without synchronization,
but an object’s transition between public and private must be coordinated with the TM
system,lest it attempt to roll back an object’s state while another thread assumes it has
sole,private access to the data.
Finally, TM must coexist and interoperate with existing programs and libraries.It
is not practical to require programmers to start afresh and acquire a brand new set of
transactional libraries to enjoy the benefits of TM. Existing sequential code should be
able to execute correctly in a transaction, perhaps with a small amount of annotation and
recompilation.Existing parallel code that uses locks and other forms of synchronization,
must continue to operate properly, even if some threads are executing transactions.
Before everybody gets too excited about the prospects of TM, we should
remember that it is still very much a topic of research. First implementations are
becoming available, but we still have much to learn. The VELOX project
(, for example, has as its goal a comprehensive analysis of
all the places in an operating system where TM technology can be used. This extends
from lock-free data structures in the operating-system kernel to high-level uses in the
application server. The analysis includes TM with and without hardware support.
The VELOX project will also research the most useful semantics of the TM
primitives that should be added to higher-level programming languages. In the previous
example it was a simple tm_atomic keyword. This does not necessarily have to be the
correct form; nor do the semantics described need to be optimal.
A number of self-contained STM implementations are available today. One
possible choice for people to get experience with is TinySTM ( It
provides all the primitives needed for TM while being portable, small, and depending on
only a few services, which are available on the host system.
Based on TinySTM and similar implementations, we will soon see language
extensions such as tm_atomic appear in compilers. Several proprietary compilers have
support, and the first patches for the GNU compilers are also available
( With these changes it will be possible to collect
experience with the use of TM in real-world situations to find solutions to the remaining
Transactional memory by itself is unlikely to make Multicore computers readily
programmable. Many other improvements to programming languages,tools, runtime
systems, and computer architecture are also necessary.TM, however, does provide a time
tested model for isolating concurrent computations from each other. This model raises the
level of abstraction for reasoning about concurrent tasks and helps avoid many insidious
parallel programming errors. However, many aspects of the semantics and
implementation of TM are still the subject of active research. If these difficulties can be
resolved in a timely fashion,TM will likely become a central pillar of parallel
• Ulrich Drepper/Communications of ACM-Parallel programming with
transactional memory/Vol.2/No.2/Pages 38-43/February 2009
• James Larus/Christos Kozyrakis/Communications of ACM-Transactional
Memory/Vol.51/No.7/Pages 80-88/July 2008
• Jared Casper/Brian D Carlstrom/A scalable, Non blocking approach to
transactional memory/hpca, pp.97-108, 2007 IEEE 13th International
Symposium on High Performance Computer Architecture.