Parallel and Distributed Computing

footballsyrupΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

96 εμφανίσεις

Cloud Computing Lecture #1

Parallel and Distributed Computing

Jimmy Lin

The iSchool

University of Maryland


Monday, January 28, 2008

This work is licensed under a Creative Commons Attribution
-
Noncommercial
-
Share Alike 3.0 United States

See
http://creativecommons.org/licenses/by
-
nc
-
sa/3.0/us/

for details

Material adapted from slides by Christophe Bisciglia, Aaron Kimball, & Sierra Michels
-
Slettvet, Google
Distributed Computing Seminar, 2007 (licensed under Creation Commons Attribution 3.0 License)

iSchool

Today’s Topics


Course overview


Introduction to parallel and distributed processing

iSchool

What’s the course about?


Integration of research and teaching


Team leaders get help on a tough problem


Team members gain valuable experience


Criteria for success: at the end of the course


Each team will have a publishable result


Each team will have a paper suitable for submission to
an appropriate conference/journal


Along the way:


Build a community of hadoopers at Maryland


Generate lots of publicity


Have lots of fun!


iSchool

Hadoop Zen


Don’t get frustrated (take a deep breath)…


This is bleeding edge technology


Be patient…


This is the first time I’ve taught this course


Be flexible…


Lots of uncertainty along the way


Be constructive…


Tell me how I can make everyone’s experience better

iSchool

Things to go over…


Course schedule


Course objectives


Assignments and deliverables


Evaluation

iSchool

My Role


To hack alongside everyone


To substantively contribute ideas where
appropriate


To serve as a facilitator and a resource


To make sure everything runs smoothly

iSchool

Outline


Web
-
Scale Problems


Parallel vs. Distributed Computing


Flynn's Taxonomy


Programming Patterns

iSchool

Web
-
Scale Problems


Characteristics:


Lots of data


Lots of crunching (not necessarily complex itself)


Examples:


Obviously, problems involving the Web


Empirical and data
-
driven research (e.g., in HLT)


“Post
-
genomics era” in life sciences


High
-
quality animation


The serious hobbyist


iSchool

It all boils down to…


Divide
-
and
-
conquer


Throwing more hardware at the problem

Simple to understand… a lifetime to master…

iSchool

Parallel vs. Distributed


Parallel computing generally means:


Vector processing of data


Multiple CPUs in a single computer


Distributed computing generally means:


Multiple CPUs across many computers

iSchool

Flynn’s Taxonomy

Instructions

Single (SI)

Multiple (MI)

Data

Multiple (MD)

SISD

Single
-
threaded
process

MISD

Pipeline
architecture

SIMD

Vector Processing

MIMD

Multi
-
threaded
Programming

Single (SD)

iSchool

SISD

D

D

D

D

D

D

D

Processor

Instructions

iSchool

SIMD

D
0

Processor

Instructions

D
0

D
0

D
0

D
0

D
0

D
1

D
2

D
3

D
4



D
n

D
1

D
2

D
3

D
4



D
n

D
1

D
2

D
3

D
4



D
n

D
1

D
2

D
3

D
4



D
n

D
1

D
2

D
3

D
4



D
n

D
1

D
2

D
3

D
4



D
n

D
1

D
2

D
3

D
4



D
n

D
0

iSchool

MIMD

D

D

D

D

D

D

D

Processor

Instructions

D

D

D

D

D

D

D

Processor

Instructions

iSchool

Parallel vs. Distributed

Shared

Memory

Parallel:
Multiple CPUs within a
shared memory machine

Distributed:
Multiple machines with own
memory connected over a network

Network connection

for data transfer

D

D

D

D

D

D

D

Processor

Instructions

D

D

D

D

D

D

D

Processor

Instructions

iSchool

Divide and Conquer

“Work”

w
1

w
2

w
3

r
1

r
2

r
3

“Result”

“worker”

“worker”

“worker”

Partition

Combine

iSchool

Different Workers


Different threads in the same core


Different cores in the same CPU


Different CPUs in a multi
-
processor system


Different machines in a distributed system

iSchool

Parallelization Problems


How do we assign work units to workers?


What if we have more work units than workers?


What if workers need to share partial results?


How do we aggregate partial results?


How do we know all the workers have finished?


What if workers die?

What is the common theme of all of these problems?

iSchool

General Theme?


Parallelization problems arise from:


Communication between workers


Access to shared resources (e.g., data)


Thus, we need a synchronization system!


This is tricky:


Finding bugs is hard


Solving bugs is even harder

iSchool

Multi
-
Threaded Programming


Difficult because


Don’t know the order in which threads run


Don’t know when threads interrupt each other


Thus, we need:


Semaphores (lock, unlock)


Conditional variables (wait, notify, broadcast)


Barriers


Still, lots of problems:


Deadlock, livelock


Race conditions


...


Moral of the story: be careful!

iSchool

Patterns for Parallelism


Several programming methodologies exist to
build parallelism into programs


Here are some…

iSchool

Master/Workers


The master initially owns all data


The master creates workers and assigns tasks


The master waits for workers to report back

workers

master

iSchool

Producer/Consumer Flow


Producers create work items


Consumers process them


Can be daisy
-
chained

C

P

P

P

C

C

C

P

P

P

C

C

iSchool


All available consumers should be available to
process data from any producer


Work queues divorce 1:1 relationship from
producers to consumers

Work Queues

C

P

P

P

C

C

shared queue

iSchool

And finally…


The above solutions represent general patterns


In reality:


Lots of one
-
off solutions, custom code


Burden on the programmer to manage everything


Can we push the complexity onto the system?


MapReduce…for next time