D2.1.1 - AETHER

motherlamentationInternet και Εφαρμογές Web

7 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

229 εμφανίσεις

ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page1of82
 FutureEmerging
 Technologies





“SelfAdaptiveEmbeddedTechnologiesforPervasiveComputingArchitectures”

ContractNumber:IST027611


FIRSTANNUALRESEARCHREPORTON
ABSTRACTSANEPROCESSORSANDSYSTEMENVIRONMENT(T2.1.1)
ANDSANEINTERACTIONSANDADAPTATION(T2.1.2)
DeliverableD2.1.1

SP2/WP2.1–SANEinteractionandruntime

Author(s): Chris Jesshope (UvA), JeanPhilippe Diguet (CNRS), Katarina Paulsson (UNIKARL)
Reviewer(s): Chris Jesshope (UvA 
Editor
), Christian Gamrat (CEA)
WP/TaskNo: WP2.1/T2.1.12 Numberofpages: 78
Identifier: D2.1.1_UvA_12.2006_v21
IssueDate: 31December2006
Disseminationlevel: Public

Keywords:

SVPSANEVirtualProcessorModel,SystemEnvironment,AbstractMachine,;TC,
Interfaces,Scheduling,Synchronisation,Communication.
Abstract:

Subproject 2 has as its major objective the definition of an abstract model that
supports the project’s goals of selfadaptation at all levels within a system, from the
hardware to the application layer. Moreover it must provide a firm definition of a
realisation of that model that will eventually enable the development of various
prototype tools to support the model at each of these levels. In particular that
realisationmustprovideacompilertargetfortheworkinSP3andalsosupportvarious
implementation paradigms. The abstract model may be implemented in hardware,
software or both. These implementations will range from programming conventional
processors through to synthesising configurations of hardware using various
reconfigurablearchitectures.Inordertomeetthisgoal,athreadbasedmodelhasbeen
adopted in which the threads may range from single machine operations through to
complete programs and in which the unit that supports the goal of selfadaptation is
notthesinglethread,whichmayjustdefineamachineoperation,buttheabstraction
of a family of related threads. This abstraction captures concurrency, locality,
communication,synchronisationaswellasmoretheabstractnotionssuchasmission
requirements and resource definitions. Both the model and its realisation have been
developed and defined in internal working documents. Substantial progress has been
made in defining interfaces to both higherlevel work in SP3 and some in the
directionsrequiredindefininginterfacesinthelowerlevelworkinSP1.
ApprovedbytheProjectCoordinator:
 Date:01/02/2007
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page2of82


Document history
When Who Comments
15/11/2006 ChrisJesshope/UvA Abstractandtableofcontentsdelegationofsections
19/11/2006 ChrisJesshope/UvA AddedSVPstuffand;TCdefinition
24/11/2006 ChrisJesshope/UvA Addedsectiononnetworksandprotocols
14/12/2006 ChrisJesshope/UvA Addedbackgroundsection(V11)
16/12/2006 JeanPhilippeDiguet/CNRS Addedsectiononthesystemsenvironment
23/12/2006 ChrisJesshope/UvA RemovedITIVcontributionmoresuitableforSP1
27/12/2006 ChrisJesshope/UvA Generaleditingandtyinguplooseends
27/12/2006 JeanPhilippeDiguet/CNRS UpdatedsectiononSystemsenvironment
30/12/2006 ChrisJesshope/UvA Finalsectionsadded
23/1/2007 ChrisJesshope/UvA Finaleditingofreport(not)
5/1/2007 ChrisJesshope/UvA AdditionsfromCNRS&reducelengthofreport
10/1/2007 ChrisJesshope/UvA Addedsubsectiononplacesandstrengthenedsectionon
implementationissues
10/1/2007 KatarinaPaulsson/UNIKARL Addedsectiononlowleveladaptation
11/1/2007 ChrisJesshope Finalfinalediting

ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page3of82

Table of Contents
Document history................................................................................................................................2
Table of Contents................................................................................................................................3
I. Executive summary..........................................................................................................................5
II. Introduction....................................................................................................................................7
II.1. Definition of the Problem....................................................................................................................7
II.1.1. Abstract machine model..................................................................................................................................7
II.1.2. OE-level self-adaptation..................................................................................................................................8
II.2. State of the Art.....................................................................................................................................8
II.2.1. Scheduling computation..................................................................................................................................8
II.2.2. Threads models in a a processors ISA...........................................................................................................9
II.2.3. Commercial threaded architecture models....................................................................................................10
II.2.4. Reconfigurable Architecture models.............................................................................................................11
II.2.5. Threaded language models............................................................................................................................11
III. The Abstract Model SANE Virtual Processor  SV P................................................................13
III.1. An introduction to the SANE virtual processor - SVP.................................................................13
III.2. Informal Semantics of the SANE Virtual processor - SVP..........................................................15
III.2.1. The state of a SANE processor....................................................................................................................15
III.2.2. A family of threads......................................................................................................................................17
III.2.3. A thread........................................................................................................................................................17
III.2.4. Synchronising a family of threads................................................................................................................17
III.2.5. Breaking the execution of a family of threads.............................................................................................17
III.2.6. Killing a family of threads...........................................................................................................................18
III.2.7. Squeezing a family of threads......................................................................................................................18
III.3. µTC a Realisation of the SANE Virtual Processor.......................................................................19
III.3.1. What is a realisation of the SVP?.................................................................................................................19
III.3.2. Additional constructs and concepts in µTC.................................................................................................19
III.3.3. Memory and synchronisation in µTC...........................................................................................................20
III.3.4. Creating families of threads.........................................................................................................................21
III.3.5. Named threads..............................................................................................................................................22
III.3.6. Places...........................................................................................................................................................23
III.3.7. Squeezable threads.......................................................................................................................................24
III.3.8. Synchronising a family on termination........................................................................................................24
III.3.9. Breaking the execution of a family..............................................................................................................25
III.3.10. Squeezing a family.....................................................................................................................................27
III.3.11. Killing a family..........................................................................................................................................28
III.4. Examples of using the µTC memory model...................................................................................28
III.4.1. Communication between threads with synchronising shared variables.......................................................29
III.4.2. Resource management on thread creation....................................................................................................32
III.4.3. Communication between threads with asynchronous shared memory.........................................................33
III.5. SANE Networks and Communications..........................................................................................33
III.5.1. Delegation network......................................................................................................................................34
III.5.2. Shared-memory network..............................................................................................................................36
III.5.3. Synchronisation network..............................................................................................................................36
IV. The Systems Environment Process............................................................................................37
IV.1. SANE OE requirements..................................................................................................................38
IV.2. SANE OE : Concepts.......................................................................................................................38
IV.2.1. Basic Components.......................................................................................................................................38
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page4of82
IV.2.2. Overview......................................................................................................................................................39
IV.3. OE blue print....................................................................................................................................39
IV.3.1. Application Manager...................................................................................................................................41
IV.3.2. Resource Manager.......................................................................................................................................42
IV.3.3. Adaptor........................................................................................................................................................45
IV.3.4. OE - a hierarchical concept..........................................................................................................................45
IV.3.5. Simulator model...........................................................................................................................................46
IV.4. Task Graph.......................................................................................................................................46
IV.5. Protocol for application deployment over a network of SANE processors.................................47
IV.6. Metrics and credits overview for mapping decision.....................................................................48
IV.6.1. Metrics.........................................................................................................................................................48
IV.6.2. Credits..........................................................................................................................................................52
IV.7. Mapping / Partitioning....................................................................................................................52
IV.7.1. Concept........................................................................................................................................................52
IV.7.2. System schedulability aspects - hard, soft and non real-time.......................................................................53
IV.7.3. Real time schedulability analysis.................................................................................................................56
V. Adaptation Algorithms and Implementation...............................................................................62
V.1. Overall model and issues...................................................................................................................62
V.2. Models for Application/Architecture Abstraction..........................................................................62
V.2.1. Application description model:.....................................................................................................................63
V.2.2. Model for Architecture Abstraction..............................................................................................................64
V.3. Implementations of SVP  General issues.......................................................................................65
V.4. High-level adaptation in microthreaded microgrids......................................................................66
V.4.1. Parametric Concurrency................................................................................................................................68
V.4.2. Deterministic Behavior.................................................................................................................................69
V.5. Low-level adaptation in reconfigurable logic..................................................................................70
V.5.1. Advanced scheduling techniques..................................................................................................................70
V.5.2. Monitoring run-time information in the hardware layers..............................................................................72
VI. Interrelationships between SP2 and SP1, SP3, SP4..................................................................74
VI.1. The relationship between Snet and SVP........................................................................................75
VII. Discussions and future directions for SP2...............................................................................76
VIII. References................................................................................................................................78
Appendix I Glossary.......................................................................................................................82

ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page5of82
I. Executivesummary
Subproject 2 has had the task of providing a layer of abstraction between the softwareengineering layer of the project, sub project 3,
and the architectural layer, subproject 1. The following properties are required in the abstractions defining a SANE virtual
processor (SVP) if it is to support the project’s goals of selfadaptation:
·
tocaptureallofaprogram’sconcurrency:
this is required as scheduling concurrency is the only mechanism we
have to support selfadaptation. We also note that sequencing a concurrent description is trivial when compared to extracting
concurrency from a sequential one;
·
tocapturelocalityofcommunication
: this is required in order to reflect future constraints on communication in
silicon systems; finally
·
tokeepeverythingasdynamicaspossible
: this also supports the project’s goals of selfadaptivity in a
dynamically changing environment and for the widest range of applications.
In defining the SVP, a topdown approach has been adopted, which has meant that during the first year of the project, our focus has
been to support the concepts being defined in the development of Snet, the streaming language that has been designed in sub project 3.
Snet will support concurrent software engineering and provide applications any support they need for selfadaptation. Despite this top
down approach we have also ensured that the abstractions defined can be implemented in hardware with efficient and scalable
structures. This has been demonstrated at one extreme of the different potential implementations of the SANE virtual processor (the
most dynamic and concurrent one).
Within the first phase of the project, the subproject’s goals have been two fold:
· the first has been to define the SANE virtual processor (SVP), to which applications written in a variety of languages,
including Snet, could be targeted;
· the second has been to define the structure and algorithms of various Systems Environment Processes, which will provide the
glue to bridge the SVP model with the various potential implementations of it in reconfigurable hardware.
Work on the SVP definition is now complete. This has been remarkably successful and a working document, now incorporated in
this deliverable, has defined the concurrency and synchronisation abstractions. It had also realised these abstractions as a language
based on C, namely 8TC, which stands for microthreaded C. This language will be used in developing tools and demonstrations later
in the project.
SVP has a concurrency model that is close to hardware but which is recursively defined, allowing it to be used to build complete
systems in a coherent and unified manner. At its lowest levels, the concept of a microthread represents a function that will probably be
implemented in reconfigurable hardware. Microthreads are defined in families and communicate strictly locally in a family to reflect
the fact that hardware will be their target. Communication within a family is dataflow like, which also reflects the likely hardware
implementation. Higherlevels of communication in the model are captured using the concept of an asynchronous, globallyshared
memory. Synchronisation here is on the termination of a complete family of microthreads, which defines a task in this model. This is a
bulk synchronisation, which supports communication at a range of different granularities between distributed memories if required by
the implementation. The model therefore, provides universally applicable abstractions from hardware to distributed systems.
Because threads can create new families of threads, we have a mechanism for composing concurrent components concurrently, in order
to build systems that are specified concurrently from the lowest levels upwards. This composition is dynamic, giving the maximum
flexibility in the application domain. SVP also supports a small but complete range of concurrency controls over this concurrency tree.
The controls are a destructive
kill
of a task defined as a concurrency tree, the preemption (
squeeze
) of a task defined as a
concurrency tree and the ability to define infinite families of threads and to dynamically
break
their execution.
In terms of achievements, in the first SP 2 goal, SVP has been fully defined and realised as a compiler target language 8TC. A
compiler is being written for that language, which will support a range of different SVP implementations, including hardware.
Currently we can parse and drop semantic checks on all of the additional constructs defined in 8TC. The compiler is based on the
opensource framework, gcc, which has been adopted and supported by the HiPEAC network of excellence. Finally we have
developed an emulation of a direct implementation of SVP based on the Alpha instruction set. This is an architecture development
but as it realises SVP directly, it is included in this report. It implements a sea of processors that can be configured into clusters to
execute families of threads. The processors are specialised to directly implement all SVP concepts as instructions. A full cycleaccurate
implementation of all concurrency controls has demonstrated that these concepts have hardware realisations which are scalable and can
support, in current technology, around half a million concurrent threads per chip (using floatingpoint operations).
The work on the Systems environment is proceeding. A substantial amount of research has been undertaken in defining models and
adapting algorithms to the requirements of the ÆTHER project, namely selfadaptation at all levels within a system where resource
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page6of82
type, amount and availability is changing according to interactions in the environment. One operating environment (OE) has been
proposed at the level between user jobs and the system of SANE processors. Mapping algorithms have been investigated and protocols
for a singlelevel of adaptation at the OE level have been proposed. This report describes a global methodology for application
partitioning over a set of selfreconfigurable architectures, it includes important and original stages such as resources negotiation and
soft/hard realtime analysis, based on credit exchanges and matching metrics. Negotiation aims to converge towards a mapping
solution where each available SANE processor tries to compute applications or parts of applications for which it is efficient. Real
time analysis is based on a global approach considering communications scheduling and a smart multiprocessor schedulability
analysis, where hard/soft/no realtime constraints can be addressed. A substantial amount of detail defining the various metrics on
which the OElevel adaptation will be based is also presented in this report. These concepts and methodologies are currently being
implemented in a SystemC simulator, where different mapping and scheduling strategies and architecture models will be tested. Finally
we have started to consider the application and processor models that will be used in the management of resources to support all of the
above.
This work will continue on two fronts therefore. The first will be by System C simulation of the OE, the second will be a more
specific implementation of the SEP based on microgrids of microthreaded processors
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page7of82
II. Introduction
The overall goal of the ÆTHER project is to develop selfadaptive embedded technologies for pervasive
computing architectures. There are two distinct issues here: the first is selfadaptation, i.e. the ability of a
computingsystemtoreflectandconfigureitselfdynamicallyinresponsetochangesinitsenvironmentinorder
to fulfil its selfdefined mission goals, for example by adapting its behaviour, performance, power budget,
reliability, etc. The second issue is the nature of pervasive computing architectures. What this means is the
dynamicavailabilityofprocessingresourcesinthesystem’senvironment,i.e.processorsgoingonandofflineat
runtime.Whatcanbeinferredfromthesehighlevelgoalsisthattheprojectwillcertainlyrequirethecaptureof
concurrencybutmorespecificallythatthisconcurrencymustbeataverylowlevel,aswearetargetinghardware
as well as software in the project’s solutions to this problem. More specifically the project will need dynamic
controloverthatconcurrencyintermsofitsdistributiontoprocessingresourcesandtheschedulingofunitsof
workaccordingtodataavailability.

Toquotefromthedescriptionofworkforthissubproject,“Theobjective...istodevelopanabstractmachine
andsystemsenvironmenttosupportselfadaptivesystems...ItwilldefineinterfacesbetweenSANEcomponents
and provide independence between the definition of a component in the highlevel framework and the
implementation of that component on the reconfigurable hardware platform... The system will be abstract,
flexible and extensible to provide a stable platform on which to build software for selfadaptive systems and
throughwhich,efficientimplementationsacrossarangeofarchitecturescanbeimplemented.”

These are very ambitious objectives considering the bridge that this subproject must make between the high
level proposals in subproject 3 and the pragmatic goals subproject 1. However, we have responded to that
challenge and within the first year of the project have defined an abstract SANE virtual processor model and
have provided a realisation of that as a high level programming language called;TC for which we provide a
compiler.Thesedefinitionshavebeentheresultofatopdownapproachtothesolutionofthisproblem.This
has ensured from the start that there is compatibility between this subproject’s deliverables and the concepts
andabstractionsbeingdefinedwithinSubproject3–TheSANEsoftwarearchitecture.
II.1. DefinitionoftheProblem
Therearetwosubproblemsthatneedtobeaddressed:
1. Thedefinitionoftheabstractmachinemodel(SVP)
2. Anoperatingenvironment(OE)tosupportselfadaptation:i.e.issuesofsystemmodellingandadecisions
frameworkfortheadaptivemappingofanapplication(orpartofone)overanetworkofSANEprocessors
(SPs).
II.1.1.

Abstractmachinemodel
Anyabstractmachinemodelorvirtualprocessordefinitionmustdefinethesemanticsandtheoperationofthe
underlying processor(s). It is made clear in the introduction above that this model must capture concurrency,
whileignoringissuessuchasschedulinginordertomakethemodelfullydynamic.Thismeansthatthemodel
mustdefineconcurrency,communicationandsynchronisationinabstractterms.However,thisisnotacomplete
definition of the problem, as emerged from early discussion between SP2 and SP3. In order to achieve self
adaptation at the software level, the model must also provide mechanisms for the dynamic control of
concurrentlyexecutingprograms.Thisproblemisnontrivialandtoday’soperatingsystemsonlyprovidecontrol
ofinterleavedconcurrentprograms,whichhavenopendingsynchronisationstate.
The more general problem of dynamically managing a distributed system, where there may be multiple
concurrentoutstandingsynchronisationsistheonethatmustbesolvedbythemodelsandabstractionsdefined
in SP2. This requires a mechanism to kill, i.e. destructively terminate, a concurrent program so that its state is
destroyed completely. More importantly, it also requires a mechanism to preempt a concurrently executing
program, such that all of the processor resources it uses can be freed up with as little latency as possible but
while maintaining a consistent state for that programso that it may eventually be restarted on other resources
and executed to completion. This latter mechanism is at the heart of selfadaptation both at the software and
hardwarelevel.Weknowofnootherworkwheresuchamodelhasbeendeveloped.
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page8of82
There is atradeoffhere; on the one hand the program couldbe terminated immediately while maintaining its
synchronising state, i.e. any communication event or remote operation that has been started but has not yet
completed.Thishowever,requiresasignificantamountofstatetobesavedandrestoredwhentheconcurrent
program is rescheduled. By analogy to existing operating systems, this is equivalent to the operating system
capturing the register state of an executing program prior to preempting it. With a concurrent program,
especially if it is implemented in hardware, e.g. on a reconfigurable processor architecture, this would mean
capturing the state of all registers in the distributed system as well the state of any communication between
components,suchasincompletehandshakesinanasynchronouscommunication.Itisnotclearthatthiswould
givethebestlatencyforpreemption;indeedananswertothisproblemisprobablyundecidable.Whatisclear,is
thatthissolutioniscomplex,indeedarbitrarilyso.SP2/SP3discussionsthereforefocussedonamechanismfor
squeezing the synchronising state from the concurrently executing program in order to preempt it more
efficiently. This solution collapses the concurrency, capturing only sufficient information to regenerate the
concurrencytreeagainwhentheprogramisrescheduled.Itisforthisreasonthatthemicrothreadedmodelwas
chosen as a basis for SVP. In microthreading, a computation is captured as parameterised families of threads.
Squeezing the synchronising state from such a model requires identifying only the first unexecuted thread,
allowingallotherstocompleteandadaptingtheparametersofthefamilytoreflectthispartitiononthefamily.
This gives a simple and efficient method of preempting concurrent programs. It means waiting for any
outstanding synchronisations but not initiating any new ones, then simply restarting the code again having
capturedthesetofnewparameters.Inshorttheprogramexecutesandreexecutesasmanytimesasnecessary
onnewresourcesuntilthefamily’sparametersaresatisfied.
Havingagreedonabroaddescriptionofconcurrencymechanisms,thesubprojectproceededtodefinetheSVP
model within that overall framework and to define a realisation of that as extensions to the C language. The
internal SVP definition document has progressed through 5 versions and is now incorporated into this
deliverable.Toolsarenowbeingdevelopedbasedonthisdefinition.
II.1.2.

OElevelselfadaptation
In order to provide highperformance computational power to serve the increasing need of large applications,
peoplehavestriventoimproveasinglemachine'scapacityorhaveconstructeddistributedsystemscomposedof
a scalable set of processors. Compared to the former, where the improvement is mainly in the hardware
technology development, the construction of distributed systems for resource collaboration is more complex.
Some wellknown existing distributed systems comprising heterogeneous resources are Condor [50], NetSolve
[51], Nimrod [52], Globus and the Grid [53] computation environment.In theÆTHER project, thesystem is
hierarchicalbothatthefunctionalandarchitecturallevel.Atthefunctionallevel,anapplicationselfadaptstosuit
its objective and to cope with dynamic changes in its environment. These applications are quite dynamic in
nature and the available parallelism must be mapped dynamically according to the resources available in the
system.Thismeansthatthearchitectureofthesystemmustalsobehierarchicalanddynamic.Wecanidentify
twokindsofresourcesi.e.staticanddynamicthatmustbemanaged.Staticresourcesarethose,whicharealways
available in a given environment, while dynamic resources are less reliable in terms of their availability to the
system.Moreovereachresourceisnotnecessarilytraditional,i.e.asingleunitofcomputation.Theresourcesare
networks of a given number of SANEs, where a SANE is a selfadaptive networked entity, which has the
capabilitiesofexecutingallhierarchicalfunctionalities,i.e.itcanexecuteallorpartofataskorunitofwork.
II.2. StateoftheArt
II.2.1.

Schedulingcomputation
LargedistributedcomputingsystemsasenvisagedbytheÆTHERprojecthavethepotentialforunprecedented
computingpower.Howeverdeliveringthisunprecedentedcomputingpowertousersisstillanelusiveproblem.
One of the major issues ishow toschedule a large application insuch a dynamic and selfadaptive distributed
system. In general, scheduling applications in a distributed system is a NPhard problem. Many heuristic
schedulingalgorithmsandsystemshavebeenproposedtoaddresssolutionswhereresourcesaredistributedbut
where the availability of the computing resources is static. Unfortunately, most of the scheduling algorithms
proposed so far are for dedicated resources. By dedicated, we mean the resources are dedicated for a given
application.However,mostcurrentdistributedsystemsarenondedicated,i.e.theyaresharedenvironmentsand
aredynamicallyvarying.
ItiswellknownthatthecomplexityofthegeneralschedulingproblemisNPComplete[5].Atthehighestlevel,
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page9of82
a distinction is drawn between local and global scheduling. A local scheduling discipline determines how the
processes resident on a single CPU are allocated and executed; a global scheduling policy, on the other hand,
uses information about the system to allocate processes to multiple processors and optimize a systemwide
performanceobjective.Thenextlevelinthishierarchyisachoicebetweenstaticanddynamicscheduling.This
choice indicates the time at which the scheduling decisions are made. In the case of static scheduling,
information regarding all resources in the system as well as all the tasks in an application is assumed to be
availableatthetimetheapplicationisscheduled.Bycontrast,inthecaseofdynamicscheduling,thebasicideais
toperformtaskallocationontheflyastheapplicationexecutes.Thisisusefulwhenitisimpossibletodetermine
the execution time, direction of branches and number of iterations in a loop, as well as where jobs and/or
resources arrive in realtime. Static scheduling algorithms are studied in [55] and [56], and in [57] and [58],
dynamicschedulingalgorithmsarepresented.
Afurtherimprovementisstaticdynamic,hybridscheduling.Themainideabehindhybridtechniquesistotake
advantage of static schedules and at the same time capture any uncertain behaviour of an application and
resources.Forthescenarioofanapplicationwithuncertainbehaviour,staticschedulingisappliedtothoseparts
that always execute. At run time, scheduling is then done using staticallycomputed estimates that reduce any
runtime overhead.That is,static scheduling is doneon the alwaysexecuted tasks,and dynamic scheduling on
others.Forexample,inthosecaseswheretherearespecialQoSrequirements,insometasksthestaticphasecan
beusedtomapthosetaskswithQoSrequirementsanddynamicschedulingcanbeusedfortheremainingtasks.
In dynamicscheduling scenarios, the responsibility for making global scheduling decisions may lie with a
centralizedscheduler,orbesharedbymultipledistributedschedulers.Also,inanycomputationalsystems,there
may be many applications submitted for (re)scheduling simultaneously. The centralized strategy has the
advantageofeaseofimplementationbutsuffersfromalackofscalability,faulttoleranceandthepossibilityof
becomingaperformancebottleneck.Forexample,Sabinet.al. [59]proposeacentralizedmetascheduler,which
uses backfill to schedule parallel jobs in multiple heterogeneous sites. Similarly, Arora et. al. [60] present a
completely decentralized, dynamic and senderinitiated scheduling and loadbalancing algorithm for Grid
environments.
Scheduling algorithms adopting an applicationcentric scheduling objective function aim to optimize the
performance of each individual application, as do applicationlevel schedulers. Scheduling algorithms adopting
resourcecentric scheduling objective functions aim to optimize the performance of the resources. Resource
centricobjectivesareusuallyrelatedtosomespecificaspectofresourceutilization,e.g.throughput,whichisthe
abilityofaresourcetoprocessacertainnumberofjobsinagivenperiodorutilization,whichisthepercentage
oftimearesourceisbusy.
Selfadaptive algorithms enable the applications to adapt themselves to the operating environments. Adaptive
examplesaredemonstratedin [61]and [62].
II.2.2.

Threadsmodelsinaprocessor’sISA
Adaptation,iftobeimplementedatthehardwarelevel,requiresschedulingsupportedwithinthemachinemodel
andthisinturnrequiresthedynamicdescriptionandmanagementofconcurrencytobecentraltothatmodel.
Thisrulesoutapproacheswhereconcurrencyiscompilerscheduled.TheSANEvirtualprocessormusttherefore
provide a description of, and instructions to manage, concurrency at run time. To guide the definition of the
virtual processor we have considered a range of threadbased architecture models. These include threaded
architecturemodelsthathavetherequireddynamicschedulingcharacteristicsimplementedatthehardwarelevel,
commercialCMPdesignsbasedonthethreadmodelreconfigurablearchitecturemodels,includingonebasedon
softwarethreadsandthreadbasedlanguages.
Intrathreadsorinthreadsaretinythreadsrunningintheprocessor,allowingextremelylowlevelparallelizationof
thecode [9].Theyaresmallandlowlevelwithlittletonooverheadoncontextswitchingandareusedprimarily
toexploitlooplevelparallelism.Threadsuspensionandwakeupareperformedondataavailabilityandmustbe
explicitly programmed. This model requires extensions to the ISA but compilers can be developed to support
legacycode.ThesethreadsareusedtoparallelizeasingleOScontext.
TheSuperThreadedarchitecture [10] [13]alsoexploitsthreadlevelparallelismwithmultiplethreadsofcontrol.A
SuperThreadedprocessorconsistsofanumberofprocessingelementsconnectedtoeachotherinaring.Each
element is able to execute a single thread and has its own register file, instruction cache and functional units.
Thereisalsoasharedregisterfileandallprocessingelementssharethelevel1datacache.TheSuperThreaded
architecture extends an ISA with a number of instructions used for managing threads (creation, termination,
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page10of82
etc.).Executionofaprogramstartsfromitsentrythread,whichcanthenspawn“successor”threadsusingthe
“fork”instruction.LikeIntrathreads,thisusesthreadsatalowleveltoexploitLLPwithlimitedspeedupandno
scalability,usingextensionstotheISA.However,neithermodelsupportscontextualisationorhierarchy,which
wouldberequiredtocapturemaximalconcurrencyandgivecompositionofconcurrentprogramsaswouldbe
requiredbyÆTHER.
The WELD architecture model is somewhat different [14] [15] and introduces multithreading in VLIW
processors as a means to provide a compiler with more parallelisation options, so that it can schedule
instructionsmoreefficiently.Inparticulareveryschedulingregioninthismodelisapotentialthreadfromwhich
the compiler can use in order to fill the empty issue slots of a VLIW MultiOp (which otherwise would be
Noops)andthusachievehigherresourceutilizationandperformance.Threadsinthismodelaregeneratedfrom
asingle,highlevelcontext.Threadcreationandsynchronizationisdonebythecompiler,whichmeansthatthe
model adds extensions to a base ISA for that purpose. Threads in this model are mainly used to speculatively
executedifferentcontrolpathsandthreadsareinterleavedorcanbeexecutedconcurrentlyonasinglepipeline
(thishappensbyfillingslotsofaVLIWMultiOpfromdifferentthreadsifthereareanyemptyslots)however,
threadsarespeculativeandtherearesignificantproblemswithscalinginthismodel.
MicrothreadingisathreadmodeldevelopedbyoneoftheÆTHERpartners.Thisstartedasamodelsimilarto
thosedescribedabove,i.e.asalowlevelthreadmodelsuitableforhardwareassistandtheoriginalpaperonthis
workwaspublishedin1996 [17].However,subsequentresearchshowedthelimitationsofsuchamodelandit
wasgeneralisedinlaterpublicationstoprovidecontextualisationofthreadsandtheISAinstructionssupportthe
creationoffamiliesofrelatedthreads,whichcouldbeexecutedconcurrentlyonmultipleprocessorsandwhich
exposed locality issues to the compiler [18] [20]. A keynote by prof. Jesshope showed the advantages and
potentialdisruption causedby such arevolutionary model [21]. These additions alone make this model a better
candidateforthesupportofdynamicconfigurationthroughselfadaptation,howevertherearealsopublications
onthescalabilityofperformanceandimplementationofthismodel [22] [25].
Even this model was still not suitable to support selfadaptation in the ÆTHER project, the concurrency
expressedinalltheabovepublicationsisofasinglelevelandeventhoughthefamilymaybepotentiallyinfinite,
itfailstocapturethemanydifferentlevelsofconcurrencyfoundinapplications,namelyatthetasklevel,loop
levelandinstructionlevel.Microthreading,asitexistedpriortothestartoftheÆTHERprojectcapturedloop
levelparallelismwithsomeadhocsolutionstocaptureinstructionlevelparallelism.Whatisrequiredisamore
generalextensiontothismodelthatappliescompositionofconcurrencytoalllevelsofparallelismisasystem.
Another limitation of thismodel was the lackof any selfawareness. It had mechanisms to define concurrency
dynamically but did not provide any mechanisms for reflection. As will be seen in this report both limitations
havebeenaddressedinthedefinitionoftheSVP.
II.2.3.

Commercialthreadedarchitecturemodels
A number of CMPshave been developed with threaded architecture models.Forexample, Sun’s Niagara chip
multiprocessor [40]supportsupto32threadsofexecution,whichitorganizesintogroupsoffourthatsharethe
same pipeline referred as the Sparc Pipe. Eight pipelines result in 32 simultaneous threads of execution. All
pipelinessharethesame3Mbytelevel2cache.Thissharedonchipcacheeliminatescoherencemissesfoundin
conventional shared memory multiprocessor systems and replaces them with lowlatency shared cache
communication. Communication between the Sparc pipes and cache banks is achieved using a crossbar
interconnect. The crossbar also provides a port for communication with the I/O subsystem. Arbitration for
destinationportsusesasimpleagebasedpriorityschemethatensuresfairscheduling.
Thisthreadmodelisbasedonspeculativethreads,wheresectionsofthecode,suchasloopsandprocedurecalls
areexecutedspeculativelyandeithercontributetothemorerapidexecutionoftheprogram,orarequashed.In
thisthreadmodel,eachthreadhasauniquesetofregisters,instructionandstorebuffers,sothereissupportfor
contextualisation,albeitlimitedtofourthreadsperprocessor.
IBM’sCellBroadbandEngineprocessor [41] [44]isamulticorechipcomprisedofa64bitPowerArchitecture
processor core and eight synergistic processor cores. These cores, the thread processors, are capable of massive
floatingpointprocessingperformanceandareoptimizedforcomputeintensiveworkloadsandbroadbandrich
media applications. A highspeed memory controller and highbandwidth bus interface are also integrated on
chip.Themulticorearchitecturehasahighspeedcommunicationsnetworkthatisaslottedring.Fourslotsare
availabletwoineachtheclockwiseandcounterclockwisedirection.TheCellprocessorarchitectureisOSneutral
andsupportsmultipleoperatingsystemssimultaneously.Currentlythethreadmodelissoftwarebasedandeach
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page11of82
coresupportsonlyonethreadofexecution.Thethreadmodelcanbeprogrammedinarangeofprogramming
paradigmsandthethreadsaredistributedoneperprocessorandexecuteconcurrentlyontheeightprocessors.
There is no context switching between threads and threads are effectively placeholders for computation and
necessarilystaticinnature.NeitherthisnortheNiagaraprovideanyrealsupportforthegoalsoftheÆTHER
project.
II.2.4.

ReconfigurableArchitecturemodels
Anumberofreconfigurablearchitecturemodelshavebeenaddressedbytheliteraturethathavesomerelevance
to the SVP definition. Most are partitioned conventional/accelerator approaches. ADRES [26] [27] is an
architecturecombiningtightlycoupledVLIWprocessorsandcoarsegrainedreconfigurablearrays.Thecoarse
grainedreconfigurablearrayisintendedtoexecuteonlycomputationallyintensivekernelsofapplicationsusinga
hostprocessor,typicallyaRISCprocessorthatexecutestheremainingpartsoftheapplication.ByusingVLIW
instead of a RISC, limited parallelism available in the parts of the code that cannot be mapped to the
reconfigurablepartcanbeexploited.
MorphoSys [28] [30] is a reconfigurable architecture for computation intensive applications that combines both
coarsegrain and finegrainreconfiguration techniques to optimize hardware, based on the application domain.
M2, the current implementation, is developed as an IP core. The approach uses both software and hardware
componentsandthelatteraretargetedtoadataparallelmodelofcomputation.
All of these are lowlevel paradigms that do not directly support the ÆTHER goals of selfadaptivity. The
hThreads architecture [34] [38] is more interesting. This is also a hybrid CPU/FPGA architecture but it allows
bothsoftwarethreadsandhardwarethreadstobedefinedinthesamecodeandprovidessupportinthemodel
to overlap the execution of both, which is atypical in such models. The model provides a thread manager, a
schedulerandhardwaresupportformutexmanagementandconditionalvariables.Thisprogrammingmodelis
compatible with POSIX threads and makes it easy for new developers, who are already familiar with POSIX
threads,todevelopapplicationsforhThreads.However,thisthreadmodeldoesnothavetheadvantagesofthe
microthreadedone.
The XPP [39] is a runtimereconfigurable data processing architecture that has many of the characteristics
requiredforselfadaptivity.Itisbasedonahierarchicalarrayofcoarsegrain,adaptivecomputingelements,anda
packetorientedcommunicationnetwork.ThestrengthoftheXPPtechnologyoriginatesfromthecombination
of array processing with unique and powerful runtime reconfiguration mechanisms. Parts of the array can be
configured rapidly in parallel while neighboring computing elements are processing data. Reconfiguration is
triggered externally or even by special event signals originating within the array, enabling selfreconfiguring
designs. However, while the XPP architecture is designed to support different types of parallelism, it has no
consistentprogrammingmodel.
II.2.5.

Threadedlanguagemodels
SplitCisalanguagemodelisbasedontheCprogramminglanguage [45]developedatBerkeleyUniversity,(see:
http://www.eecs.berkeley.edu/Research/Projects/CS/parallel/castle/splitc/
).ItisaparallelextensionofCthat
supports access to a global address space on distributed memory multiprocessors. The programmer specifies
concurrency in the code using new keywords. The compiler is based on gcc and it captures certain useful
elements of shared memory, message passing and data parallel programming to provide efficient access to the
underlying machine. It differs form previous shared memory languages by providing a rich set of
synchronisations on memory. It supports thread creation but has no notion of mapping threads to hardware.
Thereisalsoabarrierconstructthatcontrolsterminationofrelatedfamiliesofthreads.
OpenMP [46]isthemostpopularnonmessagepassingconcurrencymodel.Itisanopensourceprojectthatis
targetssharedmemorymultiprocessors(see:
http://www.openmp.org/
).UnlikesplitC,OpenMPusespragmas
toannotateidentifyconcurrencyandsynchronisationinthecode(eitherC,C++orFORTRAN).Itsrequiresa
combinationofcompilerdirectives(thepragmas),associatedlibraryroutines,andenvironmentvariables,which
specifysharedmemoryparallelisminprograms.
UPC [47] is another extension to C developed at George Washington University and Berkeley California, (see
http://upc.lbl.gov/
 for the language and
http://www.intrepid.com/
 for an example compiler). UPC uses a
combinationofkeywordsandpragmastospecifyconcurrencyinCcode.ThecompilerisbasedonOpen64and
gcc.UPCisadescendantoftheSplitClanguageandextendstheCprogramminglanguage,ISOC99,forhigh
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page12of82
performancecomputingonlargescaleparallelmachines.Thelanguageprovidesauniformprogrammingmodel
forbothsharedanddistributedmemoryhardware.
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page13of82
III. TheAbstractModelSANEVirtualProcessor–SVP
III.1. AnintroductiontotheSANEvirtualprocessorSVP
ThissectiondefinestheabstractSANEvirtualprocessormodelthatsupportsandenablestheÆTHERproject’s
goals.Section III.2givesaninformalsemanticsofthemodelandthefollowingsection,Section III.3describesa
more detailed realisation of it, based on introducing the abstract concurrency controls defined in the model
definitionintotheClanguage.Otherrealisations,bothmoreabstractandmoreconcrete,arepossible.
SVP is a model of concurrent computation. It deals with the composition and management of dynamically
created concurrent programs. One of the key goals is to have a model that can be concurrently and safely
composed.Thismeansprogramsmustbefreeofdeadlockbydesignandallowsafecomposition,sothatgiven
two or more SVP programs, those programs can be composed into a third SVP program and the resulting
program will be well behaved, i.e. deadlock free and deterministic if the two programs were composed
deterministically.Determinismmeansthattheprogramshouldalwaysgivethesameresult,akeypropertyofthe
sequentialparadigm.ThisisillustratedinFigure1.HeretwoprogramsAandBareconcurrentlycomposedinto
a third, A||B, where the nodes in the tree represent threads (leaf nodes typically perform computation) and
wherebranchingatnodesrepresentconcurrentsubordinatethreads.
Computation in SVP is captured and communicated as a packet of information, which identifies a place,
somewhere either locally or remotely where the concurrent program will execute. The communication, if
required,comprisesdata,metadata(e.g.missiongoals)andsomedefinitionofitsfunctionality.Thispacketdefinesa
unit of work,whichisaninvocationfromathreadofallitssubordinatethreads.ThisisillustratedinFigure2.
Program A Program B Program A||B
May have dependencies but only
between nodes at one level

Figure1Compositionofconcurrentprograms
?Family of threads
 All threads at one level
?Unit of work
 a sub-tree i.e. a threadÕs
subordinate threads
 may be considered as a
job or a task
?Place
 processor(s) where a unit
of work is sent execute

Figure2TerminologyusedintheSVPmodel
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page14of82
In this model, selfadaptation is achieved using one of two different mechanisms. The first mechanism is
delegation,wheretheplaceisdefinedremotelyanditisanalogoustoaremoteprocedurecall(RPC).Theunitof
work is communicated to the new computing resources for execution. To achieve this, protocols must be
defined in an implementation to define the place, i.e. resource allocation, as well as to implement the delegation.
Depending on implementation the protocols may be in hardware or software or a combination of both. It
should be noted that a place is an abstract concept, which may be a processor or group of processors, some
FPGAcells,etc.Theconceptisalsooverloadedtoincludeaserviceatthatplaceandtoachievesecurityinthe
system.Thusaplacerepresentstheunionofthefollowingconceptsandisaverypowerfulabstraction:
An address for communication defining the physical place in an implementation (may be local);
A capability to execute the unit of work at that place;
A service provided at that place, which may implement mutual exclusion if required e.g. resource allocation.
Thesecondmechanismforimplementingselfadaptationinthemodelisthenotionofpreemptionandtermination
of a concurrent program. This allows a place where a program is executing to be retracted and in the case of
preemption,allowstheprogramtocontinueexecutingatanotherplace.
Itisclearthatprocessors,clustersofprocessors,SANEprocessorsevengroupsofdumbFPGAcellscanallbe
consideredtobedefinedbythisabstractionofaplace.
Tobemorespecific,inSVP,theunitofworkisrepresentedasaparameterised family of threadsthathascollective
propertiesdefiningitscomputationandmissiongoals.Thesegoalswillbethetriggersforadaptationinresponse
to the environment in which it is executed. Because these threads will be required to capture simple hardware
semantics, the thread in this model is blocking, which means that it will only proceed, provided that its
operationshavedatatoconsume.ThisdataflowmodelhasbeenadoptedasanimportantconceptinSP1.This
choiceofthreadwillallowlowlevelimplementations,inconfigurablelogic,oftheleafthreadsintheconcurrency
tree(seeFigure1andFigure2).Thesearetypicallythecomputationallyintensivethreads.However,inorderto
provideascalefreemodel,thenotionofathreadisinvariantthroughoutthehierarchyintheconcurrencytree.
ManylowlevelthreadmodelshavebeenidentifiedinSection II.2.2butthemodelwehaveadoptedhereisbased
onmicrothreads,whichwerefirstproposedtenyearsago [17]andwhichhaveevolvedandareknowntohave
efficient implementations that support threads as small as a single arithmetic operation in conventional
processors. In this model, families of threads are created dynamically, assigned some computational resources,
performsomeworkandthenreturnsomeresults.Itisthisgroupingofcomputationintoparameterisedfamilies
ofthreadsthatdistinguishitfromallotherthreadbasedmodelsofcomputationandwhichdefineandenablethe
mechanismsforselfadaptation.
A family’s functionality is specified by an ordered set of identical concurrent threads, where each thread has
knowledgeofanindexvaluedefiningitspositionwithinthatorder.Usingthisindexvalue,itispossibletodefine
concurrent units of work that are dynamic in extent and homogeneous, as well as static in extent and
heterogeneous (a combination of the two may also bedefined).Heterogeneity is provided in the latter caseby
branchingonathread’sindexvalue.
A thread’s definition also specifies the data inputs and outputs of the family, and family termination defines a
synchronisation point with respect to the family’s outputs. Within a family, threads may communicate and
synchronise with each other on scalar items of data, allowing the family to encapsulate regularity and locality.
GoodanalogiestothefamilycanfoundinthesequentialorvonNeumannmodelofcomputation,i.e.families
areanalagoustoloopsandfunctioncalls,whichcapturetheregularityandlocalityinthatmodel.Themodelin
[17]hasbeenextendedinthisprojecttoallowhierarchy,i.e.toallowthreadstocreatefurtherfamiliesofthreads
andtoprovidetheconcurrencycontrolsdefiningpreemptionandforcedterminationoffamilies.
Theprotocolforthecreationofafamilyinthemostgeneralcase,i.e.thecommunicationbetweenathreadanda
placerequiredtoexecutethefamilythethreadcreatesproceedsasfollows(seealsoFigure3):
1. thethreadacquiresaplacetoexecutefromtheresourcemanager,apartofthesystemenvironment(this
stepisoptional,asitmayuseexistingresources);
2. ifnecessary,thethreadcopiesthefamily’sinputstotheresourcesmemoryifitisdistributedorensures
itsconsistencyifshared;
3. the thread then defines the family by identifying its code, parameters, meta data and even abstract
missiongoals.Itthencontinuesuntilithastosynchroniseontheresultsfromthefamilyitcreated;
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page15of82
4. inthemeantimetheplacecreatestheorderedsetofthreads,subjecttoresourceconstraints,whereeach
thread has a context of synchronising memory and where each thread may communicate and
synchronisewithitsneighbourincreateorder;
5. finallywhenthefamilycompletesitcopiestheresultsbacktothecreatingthread’smemoryifdistributed
or ensures consistency if shared. It then notifies the creating thread of the completion of the unit of
work;
6. thecreatingthreadmaythenusetheresults,perhapsrelinquishingtheadditionalresourcesitmayhave
acquiredbyafurthercommunicationwiththeresourcemanager.



Figure3Diagramillustratingthecommunicationrequiredwhenonethreadacquiresresourcesanddelegatesthe
creationofafamilytothoseresources.
III.2. InformalSemanticsoftheSANEVirtualprocessorSVP
ThissectiongivesaninformalsemanticsofthemodelbydefiningoperationsonthestateoftheSVP.Itshould
beemphasisedthatitisonlytheoperationsmanagingconcurrency,communicationandsynchronisation,which
require specific definition, all arithmetic and logical operations within the model have the same semantics that
they would have in a sequential model for a given data representation. There is only one addition to their
semanticsandthatinvolvessynchronization.Alloperationsmustsyncroniseontheiroperandsandathreadstalls
until an operation’s operands have been defined. This is important in a concurrently executing model as
operandsmaycomefromotherthreadsordatamaybedistributed.Ifthedataisnotavailable,thenthethread
performing that operation will block and no subsequent operations will proceed until the prior operation has
obtaineditsdataandhascompleted.NotethattheSVPreferstoacollectionofmemoriesdefiningitsstateand
theprocessingagentsthatmodifyitsstate.
III.2.1.

ThestateofaSANEprocessor
ThestateofaSANEprocessoratanygiveninstanceintimeisdefinedbytwoabstractions,thefirstandmost
persistentisanasynchronousshared memory,whichcomprisesanumberofaddressedlocations,eachofwhichmay
be read and written to by a thread but subject to certain constraints. These constraints are imposed by the
weaklyconsistent nature of this memory, due to its asynchronous nature and concurrent update by multiple
threads.Noguaranteescanbegivenabouttheaccesstimetothismemory.Thesecondabstractionisacontext
of syncronising memory associated with each thread in a family, which provides synchronisation of scalar values
betweenthreadsinthefamilyandbetweenathreadandanyinputdatafromthesharedmemory.
III.2.1.a. Sharedmemory
As described in the introduction, the model is based on units of work, which comprise families of threads. A
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page16of82
family has inputs and outputs defined by the thread body used to create it, these inputs and outputs are the
locationsinsharedmemorythatthethreadreadsandwrites.Thisstateiscompletelydefinedatthetermination
ofaunitofwork.Atanyothertime,becausewearedealingwithapotentiallyasynchronousconcurrentsystem,
therewillalwaysbeasubsetofthoselocationswhosestatecannotbedetermined.Theweakmemoryconsistency
ismanagedwithbulksynchronisationonsharedmemoryattheterminationofafamilyofthreads.Valueswritten
bythreadsinafamilyareonlyguaranteedtobewellatthispoint.Thusoutputsfromthefamilyareonlydefined
when thefamily has completed, i.e. when all of itsthreads have terminated.The locationsthat cannotbe fully
definedatanytimearedeterminedbythewritestomemorymadebythosefamiliesthathavebeencreatedand
whichhavenotyetcompleted.
Note that this abstraction of shared memory may be implemented in any manner including using distributed
memoriesandmessagepassingbetweenthem.Inthelattercase,beforeafamilyofthreadsiscreatedataplace,a
synchronisingcommunicationmustprovidetheplacewiththedatathatthefamilywillreadasinputandbefore
thefamilycanbedeemedtohavesynchronised,allofitsoutputmustbereturned.Figure3showedadistributed
viewoffamilycreation,Figure4givesthecorrespondingsharedmemoryview.


Figure4Sharedmemoryviewofcreatingthreadsinthemodel.
III.2.1.b. Synchronisingmemory
Synchronising memory provides the mechanism by which threads within a family can synchronise with each
other and their creating thread. It also provides the mechanism by which the shared memory and processing
agentscansynchronise,rememberthatnoassumptionscanbemadeabouttheaccesstimestosharedmemory.
The concept of synchronising memory is dynamic and transient. An implementation must provide semantics
equivalent to threads being allocated a context of local scalar variables, which are discarded when a thread
completes.Itisassumedthatinanimplementationthismemoryisfast,distributedand“close”totheprocessor
orlogiccellsimplementingthethread.Threadsmayalsobesequentialisedontheirindexorderifthismemoryis
limited. So, during the execution of a family, there may only be a subset of the family’s threads active and
contributingtothissynchronisingstate.Threadcreationmustbeinindexorderandtheamountofsynchronising
memorydefinestheextentofconcurrencyatthatplace.
AllarithmeticandlogicaloperationsinSVPareperformedbetweenvaluesstoredinsynchronisingmemoryand
values in shared memory are transferred to synchronising memory prior to the execution of these operations.
Each location provides a blocking read or dataflowlike synchronisation. Dependencies between operations in
different threads are enforced by this synchronising memory, as are synchronisations on loads from shared
memory. A constraint is imposed on the dependenciesthat can be defined between threads in afamily, which
arisesfromthecombinedrequirementsoflocalityanddeadlockfreedom.Athreadmayonlyhaveadependency
on values producedby its predecessor inthe family’s index sequence. This ensuresthat dependencies between
operations in a family of threads can be represented as an acyclic graph, which in turn ensures freedom from
communicationdeadlockinthemodel.
Theseacyclicdependencygraphsareinitialisedandsynchronisedwithscalarvariablesfromthecreatingthread’s
context, and dependent values generated by the last thread are available to the creating thread on the family
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page17of82
termination, via the same location or locations that initialised the dependency graph. It is not possible to
synchronisebetweenthreadsintwofamiliesexceptininitialisingadependencychain,i.e.betweenthecreating
threadandthefamilyitcreates.
III.2.2.

Afamilyofthreads
A family of threads is an ordered set of identical threads, where each thread has knowledge of its own index
valueintheindexsequence.Thisknowledgeallowsbothhomogeneousandheterogeneousfamiliestobedefined
onthesamethread,astheindexvaluemaybeusedincontrollingthestaticallydefinedactionofathread.This
thereforesupportsthefollowingmodelsofconcurrency:
· heterogeneous: where the ordered set of threads is statically defined;
· homogeneous: where the ordered set of threads may be defined dynamically; or
· a combination of both in a single family.
Theindexvaluesovertheorderedsetofthreadsaredefinedbyanarithmeticsequencespecifiedbyastartindex,
aconstantdifferencebetweensuccessiveindexvaluesandanoptionalmaximumindexvalue.
Afamilymaybeconsideredtobeataskorjobthatisexecutedataplaceandreadsandwritestosharedmemory.
Theseprocessor,processorsorcellsthatrepresenttheplaceatwhichthefamilyisexecutedareidentifiedbya
specificpropertyassociatedwiththefamily’sdefinition.Anotherpropertyassociatedwiththefamilywilldefine
its functionality, i.e. what the thread does. This may be defined as an FPGA bit string, code compiled for a
particularISA,sourcecodeorevenanabstractdescriptionofitsbehaviour,wherethiscanbeusedtodefineor
program one of the processing agents that comprise the place. There will be other properties or metadata
associatedwithafamilyofthreads,whichmaydefineitsgoalsintermsofperformance,powerbudgets,reliability
etc. All properties, including its functionality may be dynamically defined, i.e. written by the thread creating it.
Howdynamicthemodeliswilldependontheimplementation.
When a family of threads is created, a unique value is associated with that family that is used to identify and
manage it. That value includes a capability, which can be used to authenticate any action another thread may
impose upon the identified family, thus security is implemented at the lowest level of the system. The actions
thatarepermittedonafamilybythecapabilityareforcedterminationandaformofpreemption,calledsqueezing
thefamily.
III.2.3.

Athread
Athreadisasequenceofoperations(includingreadingandwritingsharedmemory)definedonacollectionof
local scalar variables, which are its context in synchronising memory and which become available when it is
createdandwhicharediscardedwhenitiscomplete.Dependenciesbetweenthreadsinthesamefamilymustbe
definedonthiscontext.Theuniqueindexvalueidentifyingathreadisalsodefinedasapartofthiscontextand
may be used in defining its functionality. A thread executes its operations only when values are defined in its
context of synchronising memory. A thread may create subordinate families of threads but cannot terminate
withoutfirstobtainingasynchronisationontheterminationofthesubordinatefamily.
III.2.4.

Synchronisingafamilyofthreads
Asynchronisationonafamilyofthreadsistheeventdefinedbytheterminationofallthreadswithinthefamily.
Threadscomplete,eitherbyexhaustingtheiroperationsorthroughtheexecutionofoneofanumberofsignals
senttotheirfamily.Onsynchronisationofafamily,allofitsdynamicsynchronisingstateislostandallofthe
sharedmemorystatethatithasmodifiedbecomesdefined.Asynchronisationyieldsareturncodeandareturn
value to the creating thread. The return code identifies what signal, if any, caused termination and the return
valueissetwhenbreakingtheexecutionofafamilyorwhensqueezingafamily.
III.2.5.

Breakingtheexecutionofafamilyofthreads
Anythreadmayexecuteabreaksignaltoitsfamily.Theresultforafamilyreceivingthissignalistoceasethe
creationofanynewthreads,terminateallcurrentlyactivethreadsandtodiscardallofthesynchronisingstatefor
thefamily.Areturnvalueissetbythethreadexecutingthebreaksignal(whichmayrequirearbitration)andthe
return code identifies that the family was terminated with a break signal. Both the return code and the return
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page18of82
valueareavailableatthesynchronisationofthefamilyinthecreatingthread.Thisconceptisanimportantone,
as it allows for the creation of an “infinite” number of threads in a family, which is terminated when some
dynamic condition is met. This construct then, is analogous to dynamically terminated loops in the sequential
machinemodel,e.g.whileloops.
III.2.6.

Killingafamilyofthreads
Anythreadthatcanidentifyafamilyandtheplacewhereitisexecutingandcanprovidethecapabilitygenerated
on its creation, may send a kill signal to that family and force its termination. Higherlevel protocols for
implementingthissignal,i.e.notatthehardwarelevel,mayprovidefurtherrestrictionsonthecapabilitytoissue
this signal for the security of the model; this is not considered further in this report. The result for a family
receivingthissignalistoceasethecreationofnewthreads,terminateallcurrentlyactivethreadsandtodiscard
all of the synchronising state for the family. In addition, the family will send a kill signal to any child families
created by its threads resulting in the termination of all families defined below the killed family in the
concurrencytree.Noreturnvalueissetbutadefaultvaluecanbespecifiedandthereturncodeidentifiesthat
thefamilyhasbeenterminatedwithakillsignal.Boththereturncodeandthereturnvalueareavailableatthe
synchronisationofthefamilyinthecreatingthread.
III.2.7.

Squeezingafamilyofthreads
Anythreadthatcanidentifyafamilyandtheplacewhereitisexecutingandcanprovidethecapabilitygenerated
on its creation, may send asqueeze signal to that family and force itstermination,while maintaining all of the
family’sstate.Thisisaformofpreemptionoftheunitofworkthatthefamilyrepresentsanditallowsthefamily
to be restarted by recreating it using the state captured when it was squeezed. Higherlevel protocols for
implementingthissignal,i.e.notatthehardwarelevel,mayprovidefurtherrestrictionsonthecapabilitytoissue
this signal for the security of the model; this is not considered further in this report. The result for a family
receiving this signal is to cease the creation of new threads, allow all currently active threads to complete
normallyandthentodiscardallofthesynchronisingmemoryforthatfamily.Inaddition,thefamilywillsenda
squeeze signal to any childfamilies created by its threads but only if the thread islabeled as being squeezable.
Thiswillresultintheterminationoffamiliesdefinedbelowthesqueezedfamilyintheconcurrencytreebutonly
toauserdefinedlevel.
Thereturnvaluesetonasqueezesignaldependsonwhetherthethreadissqueezable,i.e.whetheritpassesthe
squeeze signal onto itssubordinate threads.Ifthe thread issqueezable, this isdefined as the index ofthe first
thread in index sequence to have been allocated a context of synchronsing memory but which has not yet
terminatednormally.Thisindexpartitionsthefamilyintothreadsthathavecompletedandthreadsthatmustbe
reexecuted,someofwhichwillhavebeenpartiallyexecutedandsomemayevenhavecompleted.Asqueezable
threadwillupdatetheindexvaluesofitssubordinatefamilyafterpropagatingthesqueezesignaltothemsothat
whenreexecutedtheprogramremainsdeterministic.
Figure5illustratesthreepartitionsdefinedonafamily onreceivingasqueezesignal.Thefirst,identifiesthose
threads that have all completed, these are not reexecuted. The second, those threads that have been allocated
contextsandwhichmayalreadyhavecompleted,theseareallowedtocomplete,ifnecessarybysqueezingtheir
subordinatefamilies,buttheywillallbereexecutedwhenthefamilyisrecreated;thispartitionisidentifiedbyat
least one thread (the first) that has not yet completed. Finally, there are those threads in the family not yet
allocated a context of synchronsing
memory, which will of course need re
execution when the family is recreated.
Bothareturncodethatindicatesthefamily
was squeezed and a return value, which is
the index of the first thread allocated that
didnotterminatenormally,areavailableat
the synchronisation for the family for a
squeezable thread. This code is used to
adjustthethread’scontrolstructureforre
creation.
Ifthefamilyisnotsqueezable,i.e.willnot
passthesqueeze signal onto a subordinate
family, then all threads in the middle
1. threads
terminated
normally
2. threads
allocated
contexts
3. threads not
allocated
contexts
Thread index
sequence
squeeze return value: first
thread allocated that did
not terminate normally

Figure5Thereturnvalueofasqueezeoperationonasqueezable
thread.
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page19of82
partitionwilleventuallyterminatenormallyandthereturnvaluecanbesettotheindexofthefirstthreadnotto
havebeenallocatedacontextofsynchronsingmemory.Thustheonlythreadsthatneedtobereexecutedina
nonsqueezable(computational)threadsarethosethathavenotbeenallocatedtheircontext.
On synchronisation, the value of any dependency at the squeeze return index is saved to the variable that
initialised the dependency chain in the creating thread, again allowing the family to be recreated from the
squeezeindexpositiontocompleteit.
Squeezingafamilyofthreadsallowsthetaskdefinedbythatfamilytobepreemptedandrestartedatadifferent
place. By using the concept of a squeezable thread a disciplined approach to the rapid preemption of a
concurrentlyexecutingprogramcanbeprovidedintheSANEmodel.Thisnotionisattheheartofthemodel
andprovidesoneofthemostimportantfeaturesforimplementingselfadaptivity.
III.3. ?TCaRealisationoftheSANEVirtualProcessor
III.3.1.

WhatisarealisationoftheSVP?
ArealisationoftheSVPmodelisadefinitionofitatsomelevelofabstractiontosupportthecreationoftoolsto
implementit.Itisnotnecessarilyanimplementationofthemodelperse,althoughitmaybemoreconcreteinits
definition.For example, this report defines a realisation of SVP based on the addition of a set of concurrency
controls supporting the above model to the C language. Microthreaded microprocessors, [17] [25] are another
realisation, defined at a processor’s ISA level. The language defined here, we have called microthreaded C or
;TC for short. It has been specified for the purpose of further defining the model by experimentation and
exploration,andtoactasacompilertargetfortheworkbeingundertakeninSP2,SP3,andpossiblyinSP1.
;TC is similar to other concurrent languages based on C, such as OpenMP for C [46], UPC [47] and others
described in Section II.2.5. However, no other language that we know of implements concurrency in such a
dynamicmannerusingtheconceptofidentifiedfamiliesofthreadswithpreemption.
NumerousmeetingsandexchangesofideashaveoccurredbetweenSP2andSP3inordertoensurethatthis
modelanditsrealisationsupporttheconceptsforselfadaptivesoftwaredevelopmentbeingdefinedinSP3and
the goal is to use;TC as the compiler target for both conventional languages, such as C, in which basic
algorithms may be defined, as well as for the coordination language Snet, being developed in SP 3, which
supportstheproject’sgoalsofselfadaptationatthesoftwarelevel.;TCisalsoarelativelyhighleveldescription
ofacomputationcapturedintheSVPmodel,whichcanbeusedbycompilersorsynthesisersinimplementing
that computation on reconfigurable devices. Thus,;TC is a target for two or more conceptually different
languages and compilation from;TC can be targeted to conventional processors, embedded processor in
FPGAs,arraysofreconfigurablemicrothreadedprocessors [48]ortologicfortheSP1FPGAplatform.
Both SVP and;TC are based on the concepts developed in UvA’s prior work on microthreaded
microprocessorsandinthatworkithasbeenshownthatstructurestoefficientlymanagefamiliesofthreadsare
modestandscalable [25].Thatworkcanthereforeprovideaguidetotheefficientimplementationofschedulers
andsynchronisersinotherhardwareimplementationsoftheSVPmodel.
III.3.2.

Additionalconstructsandconceptsin?TC
The additions to C that define;TC and capture the SVP abstraction are defined in this section.  These
constructs are all based on the concepts described in Section III.2, i.e. being able to dynamically create and
identifyaunitofworkthatisaparameterisedfamilyofthreads.Thebasictypethatsupportsthisconceptisthe
family identifier, which is new type added to the C language. An implementation of;TC will encode this to
identify family, location and capability, the latter being a random number generated on family creation. The
constructs defined below allow the creation and termination of identified families of threads, which may be
definedinahierarchically.Theconcurrencycapturedatanystageintheexecutionoftheprogramisthendefined
bythetreeoffamilieswiththeoriginaljobortaskatitsroot.
;TCaddsthefollowingkeywordstostandardC.Aninformalsyntaxandsemanticsofeachisconstructisgiven
inthefollowingsections.Thenewkeywordsare:
create Constructusedtocreateafamilyofconcurrentmicrothreads;
thread Type qualifieridentifyingfunctionsasthreads;
ÆTHER FET
Int egr at ed
Proj ect  I ST027611
D2.1.1 – Fi rst  annual  r esearch r eport  on abst r act  SANE processors and syst em
envi ronment  ( T2.1.1)  and SANE i nt er act i ons and adapt at i on ( T2.1.2) 

D2.1.1_UvA_12.2006_v21 PUBLIC Page20of82
squeezable Functionqualifieridentifyingthreadsthatpropagateasqueezesignaltosubordinatefamilies;
shared Typequalifierforvariablesofatypethatwillbesharedbetweenmicrothreads;
index Typespecifierforvariablesthatwillrepresenttheindexvalueofathread;
sync Constructusedtoidentifytheterminationofaspecifiedfamily;
break Constructusedtoterminatetheexecutionofafamilyfromoneofitsthreads;
squeeze Construct used to preempt the execution of a specified family so that it may be restarted
withoutlossofstate;
kill Constructusedtoterminateaspecificfamilywithaprejudice;
family Typespecifierusedtospecifyavariablethatidentifiesafamilyofthreads,n.b.nooperationsare
definedonthistype;
place Type specifierusedtospecify a variablethat identifiesa place at whichto executea family of
threads,n.b.nooperationsaredefinedonthistype.
III.3.3.

Memoryandsynchronisationin?TC
;TCsupportstwokindsofmemory(analogoustoregistersandmainmemoryinthesequentialmachinemodel).
Theyarecalledthesynchronising memory,whichholdsthethreads’scalarcontextsandshared memory.
Sharedmemorysupportsbulksynchronisationbetweenfamiliesofthreadsandprovidesthepermanentstateofa
computation.Logicallyitisaccessedbyacommonaddressspace,althoughinpracticethismaynotbethecase.