Self-Learning, Adaptive Computer Systems - ICRI-CI

unknownlippsΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

52 εμφανίσεις

Self
-
Learning, Adaptive

Computer Systems

Intel Collaborative Research Institute

Computational Intelligence


Yoav

Etsion
,

Technion

CS & EE

Dan
Tsafrir
,

Technion

CS

Shie

Mannor
,

Technion

EE

Assaf

Schuster,

Technion

CS

Adaptive Computer Systems


Complexity of computer systems keeps growing


We are moving towards heterogeneous hardware


Workloads are getting more diverse


Process variability affects performance/power of

different parts of the system



Human programmers and
administrators


cannot
handle complexity



The goal:


A
dapt
to workload and hardware variability

Intel Collaborative Research Institute

Computational Intelligence


Predicting System Behavior


When a human observes the workload, she can
typically identify cause and effect



Workload carries inherent semantics


The problem is extracting them automatically…



Key issues with machine learning:


Huge datasets
(performance counters; exec. traces)


Need
extremely fast response time

(in most cases)


Rigid space constraints

for ML algorithms




Intel Collaborative Research Institute

Computational Intelligence


Memory + Machine
Learning

Current state
-
of
-
the
-
art


Architectures are tuned for structured data


Managed using simple heuristics


Spatial and temporal locality


Frequency and
recency

(ARC)


Block and stride
prefetchers



Real data is not well structured


Programmer must transform data


Unrealistic for program agnostic

management (swapping, prefetching)

Intel Collaborative Research Institute

Computational Intelligence


Memory + Machine Learning

Multiple learning opportunities



Identify patterns using machine learning


Bring data to the right place at the right time



Memory hierarchy forms a pyramid


Caches / DRAM, PCM / SSD, HDD


Different levels require different learning strategies


Top: smaller,

faster, costlier




[
prefetching to caches
]


Bottom: bigger, slower, pricier



[
fetching from disk
]


Need both hardware and software support


Intel Collaborative Research Institute

Computational Intelligence


Research track:


Predicting Latent Faults in Data Centers

Intel Collaborative Research Institute

Computational Intelligence


Moshe Gabel,
Assaf

Schuster


Failures and misconfiguration happen in large
datacenters


Cause performance anomalies?



Sound statistical framework to detect latent
faults


Practical
:


Non
-
intrusive
, unsupervised, no domain
knowledge


Adaptive
:


No
parameter tuning, robust to system/workload
changes

Intel Collaborative Research Institute

Computational Intelligence


7

Latent Fault Detection



Applied to real
-
world production service of 4.5K
machines



Over
20% machine/
sw

failures preceded by latent
faults


Slow response time; network errors; disk access times



Predict failures 14
days in advance, 70% precision, 2%
FPR




Latent Fault Detection in Large Scale Services
, DSN
2012

Intel Collaborative Research Institute

Computational Intelligence


8

Latent Fault Detection


Research
t
rack:


Task
Differentials:

Dynamic, inter
-
thread predictions

using
memory access footsteps

Intel Collaborative Research Institute

Computational Intelligence


Adi Fuchs ,
Yoav

Etsion
,
Shie

Mannor
, Uri Weiser

Motivation




We are in the age of parallel computing.




Programming paradigms shift towards task level parallelism




Tasks are supported by libraries such as TBB and
OpenMP
:








Implicit forms of task level parallelism include GPU kernels and parallel loops




Tasks behavior tends to be highly regular = target for learning and adaptation


Intel Collaborative Research Institute

Computational Intelligence


...

GridLauncher
<
InitDensitiesAndForcesMTWorker
> &id = *new (
tbb
::task::
allocate_root
())
GridLauncher
<
InitDensitiesAndForcesMTWorker
>(NUM_TBB_GRIDS);

tbb
::task::
spawn_root_and_wait
(id);

GridLauncher
<
ComputeDensitiesMTWorker
> &cd = *new (
tbb
::task::
allocate_root
())
GridLauncher
<
ComputeDensitiesMTWorker
>(NUM_TBB_GRIDS);

tbb
::task::
spawn_root_and_wait
(cd);

...


Taken from:
PARSEC.fluidanimate

TBB implementation

10

How do things currently work?


Programmer codes a parallel loop



SW maps multiple tasks to one thread


HW sees a sequence of instructions



HW
prefetchers

try to identify patterns
between consecutive memory accesses



No notion of program semantics, i.e.
execution consists of a sequence of
tasks, not instructions

Intel Collaborative Research Institute

Computational Intelligence


11

A

B

C

A

B

C

D

E

E

Task Address Set




Given the memory trace of task instance A, the task address set T
A

is a
unique set
of addresses

ordered
by access time:

Intel Collaborative Research Institute

Computational Intelligence


Trace:

START TASK INSTANCE(A)

R 0x7f27bd6df8

R 0x61e630

R
0x6949cc

R
0x7f77b02010

R 0x6949cc

R 0x61e6d0

R 0x61e6e0

W
0x7f77b02010

STOP
TASK INSTANCE(A
)

T
A
:

0x7f27bd6df8

0x61e630

0x6949cc

0x7f77b02010

0x61e6d0

0x61e6e0

12

Address Differentials




Motivation: Task instance address sets are usually meaningless

Intel Collaborative Research Institute

Computational Intelligence


T
A
:


7F27BD6DF8


61E630



6949CC



7F77B02010


61E6D0



61E6E0


+ 0 =




+ 8000480 =



+ 54080 =




+ 8770090 =




+ 456 =



-
1808 =



Differences tend to be compact and regular, thus can represent state
transitions

13

T
B
:


7
F
27
BD
6
DF
8


DBFA
10



6
A
1
D
0
C



7
F
7835
F
23
A


61
E
898



61
DFD
0

T
C
:

7
F
27
BD
6
DF
8


1560
DF
0


6
AF
04
C


7
F
78
BBC
464


61
EA
60


61
D
8
C
0


+ 0 =




+ 8000480 =



+ 54080 =




+ 8770090 =




+ 456 =



-
1808 =

Address Differentials




Given instances A
and B,
the differential vector is defined as follows:






Example:

Intel Collaborative Research Institute

Computational Intelligence


T
A
:


10000



60000


8000000


7F00000


FE000


𝛥


:



32
,

96
,


8
,

64
,

96

14

T
B
:

10020



60060


8000008


7F00040


FE060

Differentials Behavior: Mathematical intuition

Intel Collaborative Research Institute

Computational Intelligence




Differential

use

is

beneficial

in

cases

of

high

redundancy
.





Application

distribution

functions

can

provide

the

intuition

on

vector

repetitions
.





Non

uniform

CDFs

imply

highly

regular

patterns
.




Uniform

CDFs

imply

noisy

patterns

(differentials

behavior

cannot

be

exploited)




Non uniform

Uniform

15

Differentials Behavior: Mathematical intuition

Intel Collaborative Research Institute

Computational Intelligence




Given

N

vectors,

straightforward

dictionary

will

be

of

size
:

R=log
2
(N)



Entropy

H

is

a

theoretical

lower

bound

on

representation,

based

on

distribution
:




Example



assuming

1000

vector

instances

with

4

possible

values
:

R

=

2
.









Differential

Entropy

Compression

Ratio

(DECR)

is

used

as

repetition

criteria
:





Differential Value

#instances

p

(
20
,
8000
,
720
,
100050
)

700

0.7

(
16
,
8040
,
-
96
,
50
)

150

0.15

(
0
,
0
,
14420
,
100
)

50

0.05

(
0
,
0
,
720
,
100050
)

100

0.1

Benchmark

Suite

Implementation


Differential
representation

Differential entropy


DECR
(%)

FFT.128M

BOTS

OpenMP

19.4

14.4

25.5

NQUEENS.N=12

BOTS

OpenMP

11.8

8.4

28.7

SORT.8M

BOTS

OpenMP

16.4

16.3

0.1

SGEFA.500x500

LINPACK

OpenMP

14.1

0.9

93.6

FLUIDANIMATE.SIMSMALL

PARSEC

TBB

16.4

8.0

51.3

SWAPTIONS.SIMSMALL

PARSEC

TBB

17.9

13.1

26.6

STREAMCLUSTER.SIMSMALL

PARSEC

TBB

19.6

8.9

54.4

16

Possible differential application: cache line prefetching

Intel Collaborative Research Institute

Computational Intelligence




First

attempt
:

Prefix

based

predictor
,

given

a

differential

prefix



predict

suffix



Example
:

A

and

B

finished

running

(
𝛥


is

stored)




Now

C

is

running


17

T
A
:


7F27BD6DF8


61E630



6949CC



7F77B02010


61E6D0



61E6E0

𝛥


:

0
,




8000480
,


54080
,


8770090
,


456
,


-
1808

T
B
:


7F27BD6DF8


DBFA10



6A1D0C



7F7835F23A


61E898



61DFD0

T
C
:

7F27BD6DF8


1560DF0


6AF04C?


7F78BBC464?


61EA60?


61D8C0?

𝛥

C
:

0
,




8000480
,


54080
?


8770090
?


456
?


-
1808
?

Possible differential application: cache line prefetching

Intel Collaborative Research Institute

Computational Intelligence




Second

attempt
:

PHT

predictor
,

based

on

the

last

X

differentials



predict

next

differential
.



Example
:



𝛥


:


32

96

8 64 96


𝛥

:


32

96

8 64 96


𝛥
CD
:




10

16

0 16 32


𝛥

:


32

96

8 64 96


𝛥

:



32

96

8 64 96


𝛥

:



10

16

0 16 32


𝛥

:





32

96

8 64 96


𝛥

:




32

96

8 64 96


𝛥
IJ
:











10
?













16
?




0
?












16
?










32
?





18

Possible differential application: cache line prefetching

Intel Collaborative Research Institute

Computational Intelligence




Prefix

policy
:

Differential

DB

is

a

prefix

tree,

Prediction

performed

once

differential

prefix

is

unique
.



PHT

policy
:

Differential

DB

hold

the

history

table,

Prediction

performed

upon

task

start,

based

on

history

pattern
:



19

Possible differential application: cache line prefetching

Intel Collaborative Research Institute

Computational Intelligence




Predictors

compared

with

2

models
:

Base

(no

prefetching)

and

Ideal

(theoretical

predictor



accurately

predicts

every

repeating

differential)



0
1
2
3
4
5
6
NQUEENS.N=12
SWAPTIONS
FLUIDANIMATE
SGEFA.500
Misses Per
1
K Instructions

Base
Prefix
PHT
Ideal
0
10
20
30
40
50
60
70
STREAMCLUSTER
FFT.128M
SORT.8M
Misses Per
1
K Instructions

Base
Prefix
PHT
Ideal
Cache
Miss Elimination
(%)

Prefix

PHT

Ideal

NQUEENS.N=12

19.4

11.4

62.1

SWAPTIONS

18.3

0.1

49.2

FLUIDANIMATE

14.9

26.0

46.0

SGEFA.500

0.0

97.6

99.9

STREAMCLUSTER

21.7

36.5

82.3

FFT.128M

45.0

-
1.0

87.9

SORT.8M

3.3

0.0

0.1

20

Future
work

Intel Collaborative Research Institute

Computational Intelligence




Hybrid

policies
:

which

policy

to

use

when?

(PHT

is

better

for

complete

vector

repetitions,

prefix

is

better

for

partial

vector

repetitions,

i
.
e
.

suffixes)





Regular

expression

based

policy

(for

pattern

matching,

beyond

“ideal”

model)





Predict

other

functional

features

using

differentials

(e
.
g
.

branch

prediction,

PTE

prefetching

etc
.
)



21

Conclusions
(so far…)

Intel Collaborative Research Institute

Computational Intelligence



When

we

look

at

the

data,

patterns

emerge



Quite

a

large

headroom

for

optimizing

computer

systems


Existing

predictions

are

based

on

heuristics


A

machine

that

does

not

respond

within

1
s

is

considered

dead


Memory

prefetchers

look

for

blocked

and

strided

accesses



Goal:

Use ML,
not
heuristics, to uncover behavioral semantics



22