Lecture2-ClusterBuildingBlocks - UCLA

errorhandleSoftware and s/w Development

Nov 18, 2013 (3 years and 6 months ago)

79 views

Lecture Overview


Introduce:


Parallel Processing


Clusters & Grids


Schedulers


Hands On:


Parallel Processing


Forking, Threading


Clusters & Grids


Sun Grid Engine

Next Week


Intermediate Topics:


More on Clusters, Grids & Schedulers


Distributed Filesystems


Introduction to Hadoop


Hands On:


Parallel Processing


Pthreads, OpenMP (as time permits)


Clusters/Grids


Sun Grid Engine (Intermediate Topics)


MPI (as time permits)


Distributed Filesystems


Lustre

Processes


Wikipedia


”In computing, a process is an instance of a
computer program that is being executed. It
contains the program code and its current
activity.”


”A computer program is a passive collection of
instructions, a process is the actual execution of
those instructions.”


A process is some program, entailing one or
more instructions


An instruction is a CPU operation (add, and, or)

Processes


Wikipedia


”For security and reliability reasons most
modern operating systems prevent direct
communication between independent
processes, providing strictly mediated and
controlled inter
-
process communication
functionality.”


Processes are encapsulated, cannot freely speak


They consist of code, stack, and data


Need some special mechanism for
communication


”Depending on the operating system (OS), a
process may be made up of multiple threads of
execution that execute instructions
concurrently”


Instructions can be run in parallel for performance


Adding 4 numbers can add 2 on each core/CPU

Parallel Processes


AMD announced 48 core servers (4 x 12core)


< $20k (not super high
-
end)


Now what?


Options


Run lots of processes


How do we break a program into logical
programs?


Forking


Copy ourselves


Threading


Light Weight Processes

Lots of Processes


Multiple Users


Local Users


Remote Users


SSH, Web Server, etc


Multiple Tasks


Firefox


Email


Kernel

Lots of Processes


What if we only have one 'main' program?
Batch jobs

./find_odd_then_add file_with_numbers


Data Parallel


Break up the data

./find_odd_then_add data1.txt >> output.tmp

./find_odd_then_add data2.txt >> output.tmp

./add_all_numbers output.tmp > result.txt


Task Parallel


Break up the task

./find_odd data1.txt >> output.tmp

./add_all_numbers output.tmp > result.txt

Forks


The fork() system call in UNIX causes
creation of a new process. The new
process (called the child process) is an
exact copy of the calling process
(called the parent process) except for
the following:


The child process has a unique
process ID.


The child process has a different
parent process ID (i.e., the
process ID of the parent process).


The child process has its own
copy of the parent's descriptors.
These descriptors reference the
same underlying objects, so that,
for instance, file pointers in file
objects are shared between the
child and the parent, so that an
lseek(2) on a descriptor in the
child process can affect a
subsequent read(2) or write(2) by
the parent. This descriptor copying
is also used by the shell to
establish standard input and
output for newly created
processes as well as to set up
pipes.


The child process' resource
utilizations are set to 0; see
setrlimit(2).


All interval timers are cleared; see
setitimer(2).


http://gauss.ececs.uc.edu/Users/Franco
/ForksThreads/forks.html

Fork: Example


http://genome.ucla.edu/~jordan/teaching/sprin
g2010/LinuxCloudComputing/lecture2/demos/f
ork.c


Notes:


ps
-
ef


Start with just the parent process


On fork() we have a child too


Fork returns:


To Child: 0


To Parent: child's PID


Error:
-
1


Wait on child PID to avoid orphans

Threads


Wikipedia:


”In computer science, a thread of execution
results from a fork of a computer program into
two or more concurrently running tasks”


”Multiple threads can exist within the same
process and share resources such as memory,
while different processes do not share these
resources.”

Threads vs. Processes


Wikipedia:


processes are typically independent, while
threads exist as subsets of a process


processes carry considerable state information,
whereas multiple threads within a process
share state as well as memory and other
resources


processes have separate address spaces,
whereas threads share their address space


processes interact only through system
-
provided inter
-
process communication
mechanisms.


Context switching between threads in the same
process is typically faster than context switching
between processes.

Threads: Example


P(OSIX)threads is a standard for threading


Threads used to be platform specific


Java has their own Thread system



Compile: gcc
-
pthread pthreads.c


Notes:


ps
-
ef


We thread, but still just one process


Threads don't always end in the same order!

Race Conditions


Why don't our threads end in order?


There is no control structure to mediate threads


Does it matter?


Not here


What is an example of when it would?


2 threads writing to file at once? Maybe


Others?


Protection


Locks, Mutexes, Waiting/Blocking


Careful of deadlocks!

Review


For parallelism within a machine:


Run lots of programs


Break 1 program into multiple programs


Or break up data set


Run 1 program that forks multiple processes


Run 1 program that spawns multiple threads


Some hybrid approach

Clusters


What about across machines?


Still need to divide labor (e.g. jobs)


Submit hosts


Still need to run each process
somewhere


Execution hosts


Also need to figure out what to run where


Scheduler, Master Node


Also need a way for machines to communicate


Need to share memory and 'messages' between
hosts


Think threading over the network


Message Passing Interface (MPI)

Sun Grid Engine


One of the many leading open
-
source tools


LSF, PBS, Condor, Maui, etc


Coordination of a cluster


Queing System


Take scripts(s) from user(s), turn into job(s)


Maintain jobs in queue until ready


Submit 100k jobs, without waiting for resources


Scheduler


Run pendings jobs as resources become
available


Which job do I run first and on which node?


Could teach a whole class on scheduling, more
next time

Sun Grid Engine


Scheduling Workflow


Submit host

submits a job to 1 or more queues


(Along with environment information)


Master

queues the job


Master

finds empty slots on the system


Each host in a Queue has 0 or more slots for jobs


*Scheduler

figures out which job to run next


*Scheduler

finds optimal host(s) for the job


*Scheduler

finds appropriate slot(s) to use


(*) These are all combined into a formula

Sun Grid Engine


Execution Workflow


Execution host

is assigned job and runs it


Instantiate a shell, and run script


Stdout, Stderr are redirected to files


Slot(s) are marked as used


Master

monitors host to detect errors or failure


On failure can flag job or host


Master does all kinds of logging and analytics


Host can fail, job can fail, storage can fail


Failure handling/resubmission varies

Using Sun Grid Engine


Commands:


qsub: submit a job to the queue


qrsh/qlogin: get a login on on a machine


qstat: examine a job’s current status


qacct: get a job’s accounting information


qalter: change the attributes of pending jobs


qdel: delete jobs from the queue


qmon: graphical front
-
end


qmod/qconf: more for admin

Using Sun Grid Engine


Examples:


Submit jobs


qsub foo.sh


echo “hostname ; sleep” | qsub
-
N pipe


qsub
-
o /home/jordan/stdout
-
e
/home/jordan/stderr …


Default shell is CSH, overridge with
-
S


qsub
-
S /bin/bash ./my_job.sh


Monitor jobs


qstat, qstat
-
f, qstat
-
j, qstat
-
g c, qstat
-
F


| awk, grep, cut, sort, etc


Setting variables, environment


qsub
-
v key, qsub
-
v key[=value]


Deleting jobs


qdel 7 10 92, qdel
-
u jordan (careful), qdel `qstat
| grep $NAME | cut ...`


Setting priorities


Quiz. How do I do it?

Lab Time


Demo cluster


ssh ec2
-
174
-
129
-
166
-
96.compute
-
1.amazonaws.com


Username is student#, where # is the number of
your machine


Parallel Exerices:


http://genome.ucla.edu/~jordan/teaching/spring
2010/LinuxCloudComputing/lecture2/


lab_problems.txt


demos/*


SGE Exercises:


https://www.wiki.ed.ac.uk/download/attachment
s/20743956/sge_workbook5.pdf?version=2


Follow the tutorial from pages 5
-
18


As time permits:


http://blog.bioteam.net/2009/09/28/sge
-
training
-
slides/


http://science.larc.nasa.gov/ceres/presentations
/05
-
SGE
-
6
-
Usage
-
For
-
Users.pdf


Ignore Array Jobs and quotas for now