COMP 422, Guest Lecture: High Performance Fortran

coleslawokraSoftware and s/w Development

Dec 1, 2013 (3 years and 8 months ago)

97 views

Chuck Koelbel


Department of Computer Science

Rice University

chk@rice.edu

COMP 422, Guest Lecture:

High Performance Fortran

COMP 422

Lecture

21 February 2008

2

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

A Quick History of HPF


1991: Supercomputing ‘91 Birds of
a Feather session


Ken Kennedy, Geoffrey Fox, David
Loveman propose standardizing a
parallel version of Fortran


1992
-
1993: High Performance
Fortran Forum (HPFF) meetings


30
-
40 people defining the language


1993: HPF 1.0 Standard published


High Performance Fortran
Handbook shortly thereafter


1994: Further HPFF meetings lead
to HPF 1.1


1995
-
1996: HPFF 2.0 meetings


HPF 2.0 released January 1997


1994
-
now: HPF 1.x compilers

3

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

HPF Goal: Portable Parallel Programming

SIMD

Connection Machine CM
-
2

65,536 processors

CM Fortran / Fortran 90

Distributed Memory
MIMD

Intel Paragon

2048 processors

Intel message passing / MPI

Single (Vector) Processor

Cray
-
1

1 processor

FORTRAN 77

Shared Memory MIMD

Convex C3800

8 processors

PCF Fortran / OpenMP

Vectorization
(?)

4

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

HPF Goal: Portable Parallel Programming

SIMD

Connection Machine CM
-
2

65,536 processors

CM Fortran / Fortran 90

Distributed Memory
MIMD

Intel Paragon

2048 processors

Intel message passing / MPI

Single (Vector) Processor

Cray
-
1

1 processor

FORTRAN 77

Shared Memory MIMD

Convex C3800

8 processors

PCF Fortran / OpenMP

???

5

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

HPF Goal: Portable Parallel Programming

SIMD

Connection Machine CM
-
2

65,536 processors

CM Fortran / Fortran 90

Distributed Memory
MIMD

Intel Paragon

2048 processors

Intel message passing / MPI

Single (Vector) Processor

Cray
-
1

1 processor

FORTRAN 77

Shared Memory MIMD

Convex C3800

8 processors

PCF Fortran / OpenMP

Easy,
but

uses extra copies

6

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

HPF Goal: Portable Parallel Programming

SIMD

Connection Machine CM
-
2

65,536 processors

CM Fortran / Fortran 90

Distributed Memory
MIMD

Intel Paragon

2048 processors

Intel message passing / MPI

Single (Vector) Processor

Cray
-
1

1 processor

FORTRAN 77

Shared Memory MIMD

Convex C3800

8 processors

PCF Fortran / OpenMP

Scalarization

7

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

What’s the Owner
-
Computes Rule?


A compilation strategy suggested by


Callahan & Kennedy (thought paper), Fox (vague musings)


Vienna Fortran (Gerndt thesis), Kali (Koelbel thesis), Id Nouveau
(Roger thesis), Fortran D (Tseng thesis), …


CM Fortran, MasPar Fortran, DEC Fortran, COMPASS compilers


Given


A (sequential or parallel) program using (large) data structures


A partitioning of the data structures among processors


Produce a Single Program Multiple Data (SPMD) program
(i.e.
same program text on each processor)

that


Allocates only a part of the data


Restricts the computation to assign to its own data


Inserts communication and synchronization as needed for non
-
local data

8

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

Other HPF Goals and Constraints


Aimed at scientific programming, circa 1990


FORTRAN 77, Fortran 90


Arrays as the main data structure


Mostly regular computations


Defined by committee


Consensus where possible, votes elsewhere


Good: Not tied to particular architectures


Bad: Can’t exploit a particular architecture


Trying to raise the level of programming on HPC systems


Good: Programmers would thank us


Bad: Compiler writers wouldn’t

9

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)


HPF Overview


Fortran 90


Base language (controversial in 1992)


Data Mapping


Align


Distribute


Data Parallel Statements and Functions


Forall statement


Independent


HPF Library


Advanced Topic


Extrinsic interfaces

10

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

Fortran 90


Major update of FORTRAN 77 (and FORTRAN 66)


Officially released in 1992


Since supplanted by Fortran 1995 and Fortran 2003


All of FORTRAN 77


Arrays,
DO
,
IF
-
THEN
-
ELSE
,
CALL
,
DOUBLE PRECISION
, …


The ugly stuff:
Storage and sequence association


Some
really

ugly stuff was “deprecated”


Important additions for HPF


Array assignments, arithmetic, and intrinsics


Explicit subprogram interfaces


Many other new features


POINTER
,
ALLOCATE
,
CASE
, derived types, free source form, …

11

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

Fortran 90 Array Features


Treat arrays as whole objects


Array assignment


LHS and RHS must conform in
size (or assign scalar to array)



Array expressions


Elementwise arithmetic


Elementwise intrinsics


Array sections



Array intrinsics


Queries


Reductions


Transformations

REAL A(100,100), B(100,100)

REAL U(100), V(100), X

INTEGER IDX(100), I


A = B

U = 0.0


A = A * B + 1.0

V = MATMUL(A,U)

U = SIN(V)

U(1:50) = V(2:100:2)

A(N:1,1) = V(N:1:
-
1)

U = U(IDX(1:100))


I = SIZE(U)

X = SUM(A(2:99,2:99))

V = CSHIFT(V,
-
1)

A(1:10,1:10)=RESHAPE(V,(/10,10/))

12

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

Fortran 90 Interfaces


Old “implicit” interfaces still there


Like the name says, no declarations needed


Sequence association: arguments passed in column
-
major order


Explicit interfaces added


Type and number of dimensions checked at compile time, size at
runtime (maybe)


Required

only for using certain features (
assumed
-
shape arrays
)

INTERFACE

SUBROUTINE SOLVE(A,IPVT,B)

REAL
A(:,:)
,
B(:)

INTEGER
IPVT(:)

END SUBROUTINE

END INTERFACE

SUBROUTINE SOLVE(A,LDA,N,IPVT,B)

INTEGER LDA,N,IPVT(1)

REAL A(LDA,1),B(1)




REAL X(2048,2048), Y(2048)

INTEGER IDX(2048)

CALL SOLVE(X,2048,100,IDX,Y)

CALL SOLVE(X(100,100),IDX(100),Y(100))

13

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

HPF Data Mapping Model


Accomplished by
directives


Structured comments beginning with
!HPF$


Did not change output semantics of the program


Ignore them on single
-
processor machines


Different implementations possible


Two
-
level scheme


ALIGN

arrays to fine
-
grain
TEMPLATE


Allows “the same” mappings, handles corner cases & different sizes


DISTRIBUTE

onto
PROCESSORS


Splits up all aligned data structures

14

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

DISTRIBUTE Directive


!HPF$ DISTRIBUTE

array
(
pat1, pat2, …
)

[
ONTO

proc
]


Patterns (pat1, …) applied orthogonally


BLOCK
:

丯N


捯湴c杵潵猠敬e浥湴n


CYCLIC
: Every Pth element


CYCLIC(
m
)
: Every Pth block of m elements


*
: Don’t distribute this dimension


Number of distributed dimensions must match proc shape


Acts like a declaration


See also, Fortran
DIMENSION

statement

15

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

Examples of DISTRIBUTE

REAL W(12,12),X(12,12),

REAL Y(12,12),Z(12,12)

!HPF$ DISTRIBUTE W(BLOCK,*)

!HPF$ DISTRIBUTE X(*,CYCLIC)

!HPF$ DISTRIBUTE Y(BLOCK,BLOCK)

!HPF$ DISTRIBUTE Z(CYCLIC(2),CYCLIC(3))

16

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

ALIGN Directive


!HPF$ ALIGN

array1
(
i1,…
) WITH

array2
(
lin
-
fcn1,…
)


i1
, … were dummy variables


lin
-
fcn1,

… could be


Integer constant


Integer linear function of one dummy


*
, meaning all elements of dimension


Array2

could be an array or a template


No circular
ALIGN

chains allowed


Acts like a declaration


See also, Fortran
DIMENSION

statement

17

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

Examples of ALIGN + DISTRIBUTE

18

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

More examples of ALIGN + DISTRIBUTE

19

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

Implementing Data Mapping


Memory management


Allocate local section


Generate global

汯捡l慤摲敳a楮朠


A(I)


A
local
[I
-
LBOUND
local
(A)]


Generate local

杬潢慬a浡灰m湧


A
local
[I]


A⡉
⭌䉏啎D
local
(A)
)


Bounds reduction on array assignments


Run only over (part of) local section


May generate as local or

global indices

REAL A(1000)

!HPF$ DISTRIBUTE A(BLOCK) ONTO P(1:10)

A(2:99) = 6.66

REAL MY_A(0:99)

PARAMETER (MY_LO=MY_ID*100,…)


MY_LB=MAX(2,MY_LO+1)
-
MY_LO
-
1

MY_UB=MIN(99,MY_HI+1)
-
MY_LO
-
1

DO I=MY_LB,MY_UB


A(I) = 6.66

ENDDO

20

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

Dynamic Mapping


Some programs need to change distributions mid
-
stream


Dynamic memory allocation


Alternating direction methods (first by rows, then by columns)


Dependences on input data (data sizes, irregular meshes)


Support by dynamic data mapping directives


DYNAMIC

-

marks array for (possible) remapping


REALIGN

-

“executable” version of
ALIGN


REDISTRIBUTE

-

“executable” version of
DISTRIBUTE


Use caused all
-
to
-
all data movement


Not popular with either users or vendors

21

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

Mapping in Subprogram Interfaces


Prescriptive


“Wherever it was, move it here!
I’ll move it back when I’m done”


System will move data on entry
and exit (if needed)


Descriptive


“It’s supposed to be here!”


System will check, and move
data if needed


Transcriptive


“Whatever!”


System will deal with data
where it is


Warning: This will create lots of
run
-
time overhead

SUBROUTINE EXAMPLE(A,B,C)

REAL A(100), B(100), C(100)

!HPF$ DISTRIBUTE A(BLOCK)

!HPF$ DISTRIBUTE B *(BLOCK)

!HPF$ DISTRIBUTE C *

A = B

A = C

B = C

END SUBROUTINE


REAL A(100), B(100), C(100)

!HPF$ DISTRIBUTE A(CYCLIC)

!HPF$ DISTRIBUTE B(BLOCK)

!HPF$ DISTRIBUTE C(CYCLIC)

CALL EXAMPLE(A,B,C)

22

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

HPF 2.0
-

More DISTRIBUTE Patterns


Added SHADOW directive for regular computations


Added two new options primarily for irregular computations


GEN_BLOCK
: blocks of differing sizes


INDIRECT
: value
-
based mapping

23

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

HPF Parallel Statements and Directives


Parallelism came from access to many elements of large data
structures


E.g. update all elements of an array


Several methods


Fortran 90 array operations


FORALL

statement


Syntactic sugar for array assignment


Also allows non
-
rectangular regions and a few other tricks


INDEPENDENT

assertion


“This loop has no carried dependences”


If used correctly, doesn’t change output semantics


EXTRINSIC

subroutines


Escape from data parallelism

24

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

FORALL Statement

Initially,

a = [0, 1, 2, 3, 4]

b = [0, 10, 20, 30, 40]

c = [
-
1,
-
1,
-
1,
-
1,
-
1]

FORALL ( i = 2:4 )

a(i)

=
a(i
-
1) + a(i+1)

c(i)

=
b(i) * a(i+1)

END FORALL

Afterwards,

a = [0, 2, 4, 6, 4]

b = [0, 10, 20, 30, 40]

c = [
-
1, 40, 120, 120,
-
1]

Adopted (word
-
for
-
word) by Fortran 95 and later

25

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

Why Use FORALL?

1

3

11

11

5

6

8

8

jlo

jhi


Assignments to (extended) array sections

FORALL ( i = 1:4, j = 2:4 )

a(i,j)

= i*4+j
-
3

FORALL ( i = 1:4 )

a(i,i)

= a(i,i) * scale

FORALL ( i = 1:4 )

FORALL ( j=i:4 )

a(i,j)

= a(i,j) / a(i,i)

END FORALL

FORALL ( i=1:4 )

FORALL (j=ilo(i):ihi(i))

x(j)
= x(j)*y(i)

END FORALL


26

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

PURE Subprograms


PURE

functions (and subroutines) had no side effects


Had to be declared so with explicit interface


Purity ensured by long list of syntactic conditions


No globals or dummy parameters on left
-
hand side


No calls to non
-
PURE

functions or subroutines


No HPF remapping, no I/O, ….


This made them safe to call from
FORALL


Which was the whole idea


Useful to call them from
FORALL

for certain things


Internal control flow (e.g. element
-
wise iteration)


Local variables

Adopted (word
-
for
-
word) by Fortran 95 and later

27

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

INDEPENDENT Directive

Initially,

a = [0, 2, 4, 6, 1, 3, 5, 7]

b = [6, 5, 4, 3, 2, 3, 4, 5]

c = [
-
1,
-
1,
-
1,
-
1,
-
1,
-
1,
-
1,
-
1]

!HPF$ INDEPENDENT

DO j = 1, 3

a(j)

=
a(b(j))

c(a(j))

=
a(j)*b(a(j))

END DO

Afterwards,

a = [3, 1, 6, 6, 1, 3, 5, 7]

b = [6, 5, 4, 3, 2, 3, 4, 5]

c = [6,
-
1,12,
-
1,
-
1,18,
-
1,
-
1]

28

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

Why Use INDEPENDENT?


Express application
-
dependent information

! “Colors” don't mix with each other

!HPF$ INDEPENDENT,

NEW(ix, i1,i2,f12)

DO i = 1, ncolor

DO
ix

= color_beg(i), color_end(i)

i1

= icolor(ix,1)

i2

= icolor(ix,2)

f12

= w(i1)
-
w(i2)

x(i1)

= x(i1) + f12

x(i2)

= x(i2)
-

f12

END DO

END DO

29

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

INDEPENDENT Is an Assertion

3

5

7

1

2

4

6

8

4

6

8

4

6

8

8

8

iblue

x

ired

True

1

5

6

1

2

4

6

8

4

6

8

4

6

8

8

8

False

!HPF$ INDEPENDENT, NEW (j, n1)

DO i = 1, nblue

n1 = iblue(i)

DO j = ibegin(n1), ibegin(n1+1)
-
1


x(n1)

=
x(n1)

+ y(j)*
x(ired(j))

END DO

END DO

30

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

Implementing an INDEPENDENT Loop


Implement data mapping


See slide 19


Choose reference(s) to partition loop bounds


Left
-
hand
-
side owner
-
computes rule


Majority owner
-
computes rule


Minimize communication (if you can)


Derive communication sets


Apply subscript expression to partitioned bounds


Intersect with data range the processor owns


Anything left over must be sent/received


Allocate buffers (or ghost regions)


Insert communication operations


RHS references: before loop, LHS references: after loop

31

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

An INDEPENDENT Loop

!HPF$ DISTRIBUTE (BLOCK) :: ired, x, y

!HPF$ ALIGN iblue(i) WITH ired(*)

!HPF$ ALIGN ibegin(i) WITH ired(*)



. . .










!HPF$ INDEPENDENT, NEW (j, n1)


DO i = 1, nblue



n1 = iblue(i)



DO j = ibegin(n1),ibegin(n1+1)
-
1




x(n1)

=
x(n1)
+y(j)*x(ired(j))



END DO


END DO



DO i = 1, nblue

If iblue(i) local then


add to list


check for non
-
local y(j)


check for non
-
local x(j)

End if

END DO






Communicate nonlocal x, y to x_buf, y_buf





DO i =
local iterations



n1 = iblue(i)
-
my_lo



DO j = ibegin(n1),ibegin(n1+1)
-
1




x(n1) = x(n1)+
y_buf(j)
*
x_buf(j)



END DO


END DO




32

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

An INDEPENDENT Loop (on Shared Memory)

!HPF$ DISTRIBUTE (BLOCK) :: ired, x, y

!HPF$ ALIGN iblue(i) WITH ired(*)

!HPF$ ALIGN ibegin(i) WITH ired(*)



. . .










!HPF$ INDEPENDENT, NEW (j, n1)


DO i = 1, nblue



n1 = iblue(i)



DO j = ibegin(n1),ibegin(n1+1)
-
1




x(n1)

=
x(n1)
+y(j)*x(ired(j))



END DO


END DO



DO i = 1, nblue

If iblue(i) local then


add to list





End if

END DO






Synchronize after last uses of x and y





DO i =
local iterations



n1 = iblue(i)
-
my_lo



DO j = ibegin(n1),ibegin(n1+1)
-
1




x(n1) = x(n1)+ y(j)*x(ired(j))



END DO


END DO




33

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

A Complex INDEPENDENT Loop

!HPF$ DISTRIBUTE (BLOCK) :: ired, x, y

!HPF$ ALIGN iblue(i) WITH ired(*)

!HPF$ ALIGN ibegin(i) WITH ired(*)



. . .





DO iter = 1, 100





!HPF$ INDEPENDENT, NEW (j, n1)


DO i = 1, nblue



n1 = iblue(i)



DO j = ibegin(n1),ibegin(n1+1)
-
1




x(n1)

=
x(n1)
+y(j)*x(ired(j))



END DO


END DO


END DO

DO i = 1, nblue

If iblue(i) local then


add to list


check for non
-
local y(j)


check for non
-
local x(j)

End if

END DO


Communicate nonlocal y to y_buf


DO iter = 1, 100



Communicate nonlocal x to x_buf





DO i =
local iterations



n1 = iblue(i)
-
my_lo



DO j = ibegin(n1),ibegin(n1+1)
-
1




x(n1) = x(n1)+
y_buf(j)
*
x_buf(j)



END DO


END DO


END DO


34

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

A Complex INDEPENDENT Loop (on Shared Memory)

!HPF$ DISTRIBUTE (BLOCK) :: ired, x, y

!HPF$ ALIGN iblue(i) WITH ired(*)

!HPF$ ALIGN ibegin(i) WITH ired(*)



. . .





DO iter = 1, 100





!HPF$ INDEPENDENT, NEW (j, n1)


DO i = 1, nblue



n1 = iblue(i)



DO j = ibegin(n1),ibegin(n1+1)
-
1




x(n1)

=
x(n1)
+y(j)*x(ired(j))



END DO


END DO


END DO

DO i = 1, nblue

If iblue(i) local then


add to list





End if

END DO




DO iter = 1, 100



Synchronize





DO i =
local iterations



n1 = iblue(i)
-
my_lo



DO j = ibegin(n1),ibegin(n1+1)
-
1




x(n1) = x(n1)+ y(j)*x(ired(j))



END DO


END DO


END DO


35

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

HPF Library


HPF included a substantial library


MODULE HPF_LIBRARY


USE ( HPF_LIBRARY )


We thought of it as extending the Fortran 90 intrinsics


Fortran 95 accepted some of our ideas as new intrinsics


Criteria for
HPF_LIBRARY


User requested function (or closely related one)


Can be implemented in parallel, or needed for parallel programs


But parallel implementation not easy and/or not portable

36

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

HPF Library Categories


Reductions


For
all

associative & commutative built
-
ins, e.g.
PARITY


Scans


Prefix and suffix, for each reduction


Scatters


With reductions at collision


Sorting


SORT_UP
,
SORT_DOWN
,
GRADE_UP
,
GRADE_DOWN


System and data distribution inquiry


NUMBER_OF_PROCESSORS
, …


Elemental bit operations


E.g.

POPCNT
, also known as the NSA instruction

37

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

HPF Library Examples

X



=



S

U

M

(

A

)

x



a

i

i



1

n



Z



=



S

U

M

_

S

C

A

T

T

E

R

(

C

,



Z

,



I

N

D

)

z

j



c

i

i



ind

i



j



Y



=



S

U

M

_

P

R

E

F

I

X

(

B

)

y

j



b

i

i



1

j



38

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

EXTRINSIC Subprograms


An escape hatch into lower
-
level programming paradigms
(e.g. MPI, machine
-
specific hacks)


General mechanism and syntax
EXTRINSIC(

model
-
name

)


Particular cases defined as examples


HPF_LOCAL


Constituted a contract

On the caller (HPF) side:


Explicit interface with the
EXTRINSIC

directive


Remapping occurs (if needed) to
meet distribution specifications


System synchronizes all
processors before the call


System calls “local” routine on
every processor

On the callee (non
-
HPF) side:


INTENT(IN)

and
INTENT(OUT)

must be obeyed


If variables are replicated, callee
must make them consistent before
return


Processors can access their own
section of distributed arrays

39

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

EXTRINSIC Example

Global HPF

Local HPF

40

COMP 422, Spring 2008 (V.Sarkar/C.Koelbel)

HPF’s Legacy


HPF didn’t take over the world as we once hoped


MPI started later, but came out with implementations sooner


Lesson: Early (portable, efficient) implementation needed.


(Perceived) Lack of support for irregular applications


Lesson: Listen to users!


Performance inconsistency among first HPF implementations


Lesson: Environment needed (both conceptual & software)!


But it was important


JAHPF was critical to success of Earth Simulator


“14.9 Tflops Three
-
dimensional Fluid Simulation for Fusion Science
with HPF on the Earth Simulator” won Gordon Bell prize for
Language in 2002


HPCS languages all have data parallel support


Including distributions inspired by HPF