Hadoop: Data Intensive Scalable Computing at Yahoo Owen O'Malley 3-4:30, Tuesday, November 11, 2008, LH 3 PLATO Royalty Lecture Series

elbowsspurgalledInternet and Web Development

Oct 21, 2013 (4 years and 8 months ago)


Hadoop: Data Intensive Scalable Computing at Yahoo

Owen O'Malley

4:30, Tuesday, November 11, 2008, LH 3

PLATO Royalty Lecture Series


As Brin and Page say in their classic 1994 paper on web search: engineering search engines is
g. The web represents a huge and ever
increasing source of information and indexing it
requires processing extremely large data sets. Current search engines have information on hundreds of
billions of web pages and more than a trillion links between them.

And yet the data is constantly
changing and must be re
indexed continually. The combination of needing to process hundreds of
terabytes of data in a reasonable amount of time requires a multitude of computers. However, using a
lot of computers, especia
lly commodity Linux PCs, means that computers are always failing, creating an
operations nightmare.

In this talk, Owen will describe how search engines scale to the necessary size using software
frameworks, an area now known as web
scale data intensive co
mputing. In particular, he will show us
how many computers can be reliably coordinated to address such problems using Apache Hadoop,
which is largely developed by Yahoo, and how to program such solutions using a programming model
called Map/Reduce.



Owen O'Malley is a software architect on Hadoop working for Yahoo's Grid team, which is
part of Yahoo's Cloud Computing & Data Infrastructure group. He has been contributing patches to
Hadoop since before it was separated from Nutch, and is th
e chair of the Hadoop Project Management
Committee. Although specializing in developing tools, he was wandered between testing (UCI), static
analysis (Reasoning), configuration management (Sun), model checking (NASA), and distributed
computing (Yahoo). He
received his PhD in Software Engineering from University of California,

Companion Readings

1. Sergey Brin and Lawrence Page, The Anatomy of a Large
Scale Hypert
extual Web Search
Engine, WWW7/Computer Networks 30 (1
7): 107
117, 1998.

2. (optional) Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on
Large Clusters. OSDI’04: Sixth Symposiu
m on Operating System Design and Implementation. San
Francisco CA. 2004. http://labs.google.com/papers/mapreduce.html

3. (optional) Sanjay Ghemawat, Howard Gobioff, and Shun
Tak Leung. The Google File
System. 19th ACM Symposium on Operating Systems Prin
ciples, Lake George, NY, October, 2003.

4. Peruse http://hadoop.apache.org/, http://public.yahoo.com/gogate/hadoop
tutorial/, and





This Lecture Series is sponsored by Evergreen’s PLATO Royalty Fund, a fund established with royalties
from compu
ter assisted instruction (CAI) software written by Evergreen faculty John Aikin Cushing and
students in the early 1980’s for the Control Data PLATO system.