not - bristol.ac.uk

quaggahooliganInternet and Web Development

Feb 5, 2013 (4 years and 6 months ago)

113 views

a starting point for:

“Using simulation in parallel computing
for faster sample size calculations in
complex random effects models”

Toni Price, University of Bristol

MLPowSim


Developed in a separate ESRC
-
funded project



Generates both MLwiN macro code and R language
code for performing sample size calculations on
multilevel models



Works for a selection of multilevel nested and crossed
designs



Text
-
based interface



Uses C code to gather user input and generate output


Initial objective:


Use MLPowSim as a basis and extend to
support a broader range of models



Good starting point, but would benefit from an
automated way of testing that generated code
matches expected output (especially as new
and more complex models are added)



First step

Put into a cohesive framework:



Streamline duplicated code (e.g. for user input which is
similar across different models)


Also improves code maintenance (e.g. bug fixes impacting
fewer lines of code)



Improve input validation


Makes for a better user experience and reduces crashes



Automate testing of generated code and results



Add multiple user interfaces, e.g. command line / file input /
web
-
based

Ruby is …


Much like Python in a number of ways



Cross
-
platform



A good choice for metaprogramming



Excellent for text processing


… though in the end boils down to personal preference

… moving to Ruby

In the words of the official Ruby site
(
http://www.ruby
-
lang.org/en/
) Ruby is


“A dynamic, open source programming
language with a focus on simplicity and
productivity. It has an elegant syntax that
is natural to read and easy to write.”

(… I agree!)

Input methods


Command line


Current input method



File input


Useful during development


Facilitates automated testing



Web interface


Familiar mode of input


‘Easy’ to use

File input


Example for a 1
-
level model

# Input params

#

# Example 1 (p. 8 in MLPowSim user manual)

# MLwiN code output


general:


output_lang: mlwin



rnd_num_seed: 1


sig_level: 0.025


n_sims: 1000


model:


n_levels: 1


response_type: normal


est_method: igls



include_fixed_intercept: yes



n_explanatory_vars: 0



estimates:


beta_0:
-
0.140


sigma_sq_e: 1.051


sample_size:


level_1:


low: 20


hi: 600


step: 20

File input


Example for a 2
-
level model

# Input params

#

# Example 8 (p. 39 in MLPowSim user manual)

# MLwiN code output


general:


output_lang: mlwin



rnd_num_seed: 1


sig_level: 0.025


n_sims: 1000


model:


n_levels: 2


is_balanced: yes


structure: nested #=> nested | cross
-
classified


response_type: normal


est_method: igls



include_fixed_intercept: yes



include_random_intercept: yes



n_explanatory_vars: 0



estimates:


beta_0:
-
0.177


sigma_sq_u: 0.151


sigma_sq_e: 0.916


sample_size:


level_2:


low: 10


hi: 50


step: 10


level_1:


low: 10


hi: 60


step: 10

Advantages of adding a Web
interface


More accessible


No download required


Indexed by search engines


Cross
-
platform (Windows/Mac/Linux)



Up
-
to
-
date version available as soon as deployed


Centralised bug fixes


New features



No distribution overhead



Opportunity to collect usage information


E.g. model parameters


… aligned with e
-
Stat objectives


Disadvantages of Web interface



Constrained” by browser functionality



Need to be online to use it



Needs hosting resources



… fine for code
-
generation app as it stands,
but would be too resource
-
intensive to run
simulations and model
-
fitting on server


[Demo of command
-
line and Web
-
based interfaces for MLPowSim]


Improving speed


Another, parallel (so to speak

) objective is using
parallelization to speed up run
-
time for generated power
calculation code



Have taken an initial look at using capabilities of multi
-
core processors by executing more than one run
simultaneously



Exploratory code makes use of Unix (Linux) ‘forking’ to
create sub
-
processes



This approach will
not

work on Windows (since Windows
does not support forks)


Precludes possibility of using this approach for MLwiN


For now, doing tests on R code in Linux


Initial results (very rough, just a starting point):



Model: 1
-
Level, Normal response, Fixed intercept, No
explanatory variables



R code with sample sizes from 400 to 600 in steps of
100 (i.e. 400, 500, 600)

Improving speed
… contd.

Improving speed
… contd.

Run
Number

Sequential



time elapsed

Up to 2 processes



time elapsed

01

17.68667507

12.83575702

02

17.58543682

13.25761509

03

17.81376505

12.79697299

04

17.61011219

12.83477187

05

18.78166199

12.75112796

06

17.83562708

13.84314704

07

21.42540908

13.58644199

08

17.81408715

13.47865105

09

22.32437897

14.23557377

10

22.47374010

13.32088089

11

22.37987995

14.26538086

Run
Number

Sequential



time elapsed

Up to 2 processes



time elapsed

12

20.43303204

13.26997709

13

17.97804618

13.83806705

14

19.54052210

12.82744908

15

17.61324906

13.11621904

16

17.60317612

13.17418599

17

17.74178410

14.06648993

18

18.01628709

12.95062184

19

17.85149908

13.40029383

20

18.02299380

12.85679889

21

17.92268395

12.86912799

Improving speed
… contd.

Summary

Number of runs:

20 (excluding 1st run)

Sequential time (ave):

18.94 secs

Forked time (ave):

13.34 secs

Percentage reduction:

29.58 %

Where to from here?

… this is just a small start …



Extend MLPowSim to support more models


Add test cases for code generation to cope with more models



Add automated tests for verifying actual numerical output



Further develop Web interface



Continue investigating speed improvements through
parallelization