SICSA Concordance Challenge: Using Groovy and the JCSP Library

nostalgicisolatedSoftware and s/w Development

Nov 4, 2013 (3 years and 7 months ago)

61 views

SICSA Concordance Challenge:

Using Groovy and the JCSP Library

Jon Kerridge

Software Environment


Groovy


A Java based scripting language


Direct support for Lists and Maps


Executes on a standard JVM


JCSP


A CSP based library for Java


Process definitions independent of how the system will be executed


Enables
multicore

parallelism


Parallelism over a distributed system with TCP/IP interconnect


Executes on a standard JVM


A set of Groovy Helper Classes have been created to permit easier
access to the JCSP library


Hardware Environment


A network of
multicore

PCs


Network is 100Mbits / sec


Each processor has


Intel Core 2 Duo E8400 operating at 3.0
Ghz


2 cores


2 threads ( no hyper
-
threading )


FSB 1333MHz


L2 cache 6MB


2
Gbytes

memory

Why Use a Distributed System?


Regardless of the application may need to use more than one
processing node


Better to start with an inherently parallel distributed design


Bolting on distributed parallelism afterwards is always very difficult


Scalability


Enables easier overlapping of operations, particularly file I/O

Architecture

Read File Process

Worker

Worker

Worker

Worker

There can be any number of workers; in these experiments 4, 8 and 12

Bi
-
directional CSP channel communication in Client
-
Server Design

Primary Design Criteria


Ensure all data structures are separable in some parameter


N in this case


Reduces contention for memory access;


Hence easier to parallelise


Keep loops simple


Easier to parallelise

Read File process


Reads parameters



input file name, N value, Minimum number of repetitions to be output


Number of workers and Block size



Operation


Reads input file, tokenises into space delimited words


Forms a block of such words ensuring an overlap of N
-
1 words between blocks


Sends a block to each worker in turn



Merges the final partial concordance of each worker and writes final concordance
to an output file


Will be removed in the final version

Initial Experiments


The relationship between Block Size and the Number of Workers
governs how much processing can be overlapped with the initial file
input


It was discovered that for Block Size = 6144 gave the best
performance for 4 or 8 workers



Provided the only work undertaken was


removal of punctuation and


the initial calculation of the equivalent integer value for each word


Worker


Initial Phase


Reads input blocks from Read File process


Removes punctuation


saving as bare words


Calculates integer equivalent value for each word by summing its ASCII characters


This is also the N = 1 sequence value


These operations are overlapped with input and the same process in each worker


For each block


Calculate the integer value for each sequence of length 2 up to N by adding word
values and store it in a Sequence list



The integer values generated by this processing will generate
duplicate values for different words and different sequences




Worker


Local Map Generation


For each Sequence in each Block


Produce a Map of the Sequence value with the corresponding entry of a Map
comprising the corresponding word strings with an entry of the places where that
word string is found in the input file


Save this in a structure that is indexed by N and each contains a list of the Maps
produced above



For each worker produce a composite Map combining the individual
Maps


Save this in a structure indexed by N


This is the Concordance for this worker

Worker


Merge Phase


For each of the N partial Concordances


Sort the integer keys into descending order


For each Key in the Nth partial Concordance


Send the corresponding Map Entry to the Reader


The Map Entry contains a Map of the word sequences and locations within file




This will be modified in the final version that overlaps the merge / output
phase

Worker
-

Parallelisation


Each Worker can be parallelised by N


Data structures indexed by N can be written to in parallel


Provided each element of the parallel only accesses a single value of N


Access to any shared structures is read only



Thus depending on the number of available Threads (T) in the
Worker’s Processor each of these operations can be carried out in
parallel



Thus the design is scalable in N and T

Parallelising the Worker’s Join

def
localNPrimaryMapList

= [] // holds concordance for each N value

for ( n in 1..N)
localNPrimaryMapList
[n] = new
PrimaryKeyMap
()




for ( s in 0..<
startIndexes.size
()){


/* sequential version


for ( n in 1..N){


defs.initPrimaryMap
(
localNPrimaryMapList
[n],





localEqualWordMapListN
[n][s])


}


*/


def procNet = (1..N).collect { n
-
>



new
InitialJoiner
(
primaryMap
:
localNPrimaryMapList
[n],





otherMap
:
localEqualWordMapListN
[n][s])}


new PAR(
procNet
).run()


}

InitialJoiner



Process Definition

class
InitialJoiner

implements
CSProcess

{



// this is a non
-
standard CSP process as it has no channels!


// relies on the fact that the
primaryMap

can be written to by this


// process exclusively



def
primaryMap


def
otherMap



void run(){



defs.initPrimaryMap
(
primaryMap
,
otherMap
)


}

}

Creating Equal Block Maps
-

Sequential

def
localEqualWordMapListN

= [] // contains an element for each N value

for (
i

in 1..N)
localEqualWordMapListN
[
i
] = []


def
maxLength

= BL
-

N


for (
WordBlock

wb

in
wordBlocks
) {


/* sequential version that iterates through the
sequenceBlockList



for (
SequenceBlock

sb

in
wb.sequenceBlockList
){



// one
sb

for each value of N



def length =
maxLength



def
sequenceLength

=
sb.sequenceList.size
()



if (
sequenceLength

<
maxLength
) length =
sequenceLength




def
equalMap

=
defs.extractEqualValues

( length,






wb.startIndex
,
sb.sequenceList
)



def
equalWordMap

=
defs.extractUniqueSequences

(




equalMap
,
sb.Nvalue
,
wb.startIndex
,
wb.bareWords
)



localEqualWordMapListN
[
sb.Nvalue
] <<
equalWordMap


}

}

Creating Equal Block Maps
-

Parallel

def
localEqualWordMapListN

= [] // contains an element for each N value

for (
i

in 1..N)
localEqualWordMapListN
[
i
] = []


def
maxLength

= BL
-

N


for (
WordBlock

wb

in
wordBlocks
) {


def procNet = (1..N).collect { n
-
>



new
ExtractEqualMaps
( n: n,




maxLength
:
maxLength
,




startIndex
:
wb.startIndex
,




sequenceList
:
wb.sequenceBlockList
[n
-
1].
sequenceList
,




words:
wb.bareWords
,




localMap
:
localEqualWordMapListN
[n])




}


new PAR(
procNet
).run()



}

}

ExtractEqualMaps



Process Definition

class
ExtractEqualMaps

implements
CSProcess

{


def n


def
maxLength


def
startIndex


def
sequenceList


def words


def
localMap



void run(){


def length =
maxLength


def
sequenceLength

=
sequenceList.
size
()


if (
sequenceLength

<
maxLength
) length =
sequenceLength


def
equalMap

=
defs.extractEqualValues

( length,
startIndex
,








sequenceList
)


def
equalWordMap

=
defs.extractUniqueSequences

(
equalMap
,







n,
startIndex
, words)


localMap

<<
equalWordMap


}

}

Parallelisation Effect


The presented results have the parallel version of the
InitialJoiner

deployed in both versions


The effect of the previous Parallelisation is immediately observable in
the results


Worker Style 1 has the sequential version to create the Equal Maps


Worker Style 2 has the parallel version to create the Equal Maps



The system does allow the user to choose whether to output
sequences that occur only once


All results presented do NOT output a sequence if it occurs only once



Results times in
msecs



Bible

Worker
Style

Workers

N

Worker
Distribute

Worker
Equal

Worker
Join

Worker
Merge

Worker
Total



Reader
Distribute

Reader
Merge

Reader
Total



Output
File Size
KB

1

4

3

3,749

138,263

1,147

19,263

162,421



3,045

159,205

162,250



17,798

1

8

3

3,389

69,584

708

21,702

95,383



2,997

92,269

95,266



17,798

2

4

3

3,046

53,600

1,030

18,647

76,323



2,592

73,701

76,293



17,798

2

8

3

4,629

27,559

597

21,758

54,543



3,761

50,159

53,920



17,798

2

8

6

4,245

65,481

1,291

53,736

124,753



3,308

121,186

124,494



25,810

2

12

2

6,209

11,772

221

11,756

29,957



4,790

23,807

28,597



12,593

2

12

3

4,750

17,957

319

21,008

44,034



4,026

39,423

43,449



17,798

2

12

4

4,870

25,560

670

30,945

62,044



4,042

57,393

61,435



21,412

2

12

5

5,030

34,292

651

42,030

82,003



4,089

77,928

82,017



23,926

2

12

6

5,057

43,544

1,048

53,247

102,896



4,041

98,287

102,328



25,810

Commentary


Worker Equal Speedup


Speedup is
T
slower

/
T
faster


For N = 3 Workers = 4 and 8


Speedup of Worker Style 2 (parallel) over Worker Style 1 (sequential)


W = 4: 2.58 and W= 8: 2.52


Solely due to the parallelisation of Extract Equal Maps using available threads (2)



For N = 3 and Workers = 4, 8 and 12


Speedup due to additional workers




W = 8

W = 12

W = 4

1.94

2.98

W = 8



1.53

Commentary
-

Overall

Merge Effects


For N = 3


The Merge time is very
similar


Demonstrates that the
Reader is the bottleneck



Merge Parallelisation


There is an option here to
parallelise more by
undertaking merges in
parallel

Worker Total Time Speedup



W = 8

W = 12

W = 4

1.40

1.73

W = 8



1.24

Overlapped Merge / Output Architecture

Reader

Worker

Worker

Merge N = 1

Merge N = 2

Merge N = 3

Commentary on Revised Architecture


The workers output each of the N Primary maps in parallel to the
respective Merge process


Each worker has N processes that output the entries in each primary key map in
descending sorted order


One merge process per N value


Each Merge process writes its own file



When the worker has finished


Sends a message to Reader informing it of termination


This enables calculation of overall time



The architecture implements the CSP Client
-
Server design pattern
thereby guaranteeing freedom from deadlock

Results


Overlapped Merge (
msecs
)

Worker
Style

Workers

N

Worker
Distribute

Worker
Equal

Worker
Join

Worker
Merge

Worker
Total



Reader
Distribute

Reader
Merge

Reader
Total



Output
File Size
KB

2

12

3

4,750

17,957

319

21,008

44,034



4,026

39,423

43,449



17,798

2

12

6

5,057

43,544

1,048

53,247

102,89
6



4,041

98,287

102,32
8



25,810

Bible

3

12

3

2,969

18,124

324

10,902

32,319



2,731

29,501

32,232



6,297

N = 1

3

12

6

3,202

44,342

1,082

15,239

63,866



2,715

61,049

63,809



6,297

N = 1

WaD

3

12

6

1,338

17,090

308

8,625

27,361



1,140

26,162

27,302



2,044

N = 1

Speedup Calculations

Worker 2 to Worker 3


Merge Speedup


Bible

Speedup on Input File


Compare Bible to
WaD


Overall times

Bible

Speedup

N=6

3.49

N = 3

1.93

W = 12

N = 6

Words

Time

Bible

802,300

63,809

WaD

268,500

27,302

Ratio

2.99

2.34

Conclusion


Utilisation of access to shared memory needs to be considered when
designing the algorithm


This was done from the outset with the choice of data structures


The parallelisation of sequential sections is relatively straightforward


Provided there are no memory access violations between parallel processes


The JCSP Library made this particularly easy



The resulting system is scalable in


The number of Workers


The value of N and the number of available Threads


31 threads used in this implementation



The creation of Equal maps needs to be further parallelised

Further Work


The School has recently installed a new multi
-
node system


18 nodes each with Dual Quad Hyper
-
threading processor


16 threads in each Node


Hence N = 16 can be undertaken in one pass


16GB memory


250 Gb local disk


Gigabit
E
thernet communications infrastructure



Its obvious what I shall be doing!