Hadoop: Data Intensive Scalable Computing at Yahoo Owen O'Malley 3-4:30, Tuesday, November 11, 2008, LH 3 PLATO Royalty Lecture Series

elbowsspurgalledInternet and Web Development

Oct 21, 2013 (3 years and 10 months ago)

86 views

Hadoop: Data Intensive Scalable Computing at Yahoo

Owen O'Malley

3
-
4:30, Tuesday, November 11, 2008, LH 3

PLATO Royalty Lecture Series
1


Abstract
:



As Brin and Page say in their classic 1994 paper on web search: engineering search engines is
challengin
g. The web represents a huge and ever
-
increasing source of information and indexing it
requires processing extremely large data sets. Current search engines have information on hundreds of
billions of web pages and more than a trillion links between them.

And yet the data is constantly
changing and must be re
-
indexed continually. The combination of needing to process hundreds of
terabytes of data in a reasonable amount of time requires a multitude of computers. However, using a
lot of computers, especia
lly commodity Linux PCs, means that computers are always failing, creating an
operations nightmare.


In this talk, Owen will describe how search engines scale to the necessary size using software
frameworks, an area now known as web
-
scale data intensive co
mputing. In particular, he will show us
how many computers can be reliably coordinated to address such problems using Apache Hadoop,
which is largely developed by Yahoo, and how to program such solutions using a programming model
called Map/Reduce.


The

Speaker
:



Owen O'Malley is a software architect on Hadoop working for Yahoo's Grid team, which is
part of Yahoo's Cloud Computing & Data Infrastructure group. He has been contributing patches to
Hadoop since before it was separated from Nutch, and is th
e chair of the Hadoop Project Management
Committee. Although specializing in developing tools, he was wandered between testing (UCI), static
analysis (Reasoning), configuration management (Sun), model checking (NASA), and distributed
computing (Yahoo). He
received his PhD in Software Engineering from University of California,
Irvine.
http://people.apache.org/~omalley


Companion Readings
:



1. Sergey Brin and Lawrence Page, The Anatomy of a Large
-
Scale Hypert
extual Web Search
Engine, WWW7/Computer Networks 30 (1
-
7): 107
-
117, 1998.
http://infolab.stanford.edu/~backrub/google.html

2. (optional) Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on
Large Clusters. OSDI’04: Sixth Symposiu
m on Operating System Design and Implementation. San
Francisco CA. 2004. http://labs.google.com/papers/mapreduce.html

3. (optional) Sanjay Ghemawat, Howard Gobioff, and Shun
-
Tak Leung. The Google File
System. 19th ACM Symposium on Operating Systems Prin
ciples, Lake George, NY, October, 2003.
http://labs.google.com/papers/gfs
-
sosp2003.pdf

4. Peruse http://hadoop.apache.org/, http://public.yahoo.com/gogate/hadoop
-
tutorial/, and
possibly:

http://developer.yahoo.net/blogs/hadoop/

http://developer.yahoo.ne
t/blogs/hadoop/2008/09/scaling_hadoop_to_4000_nodes_a.html

http://www.hackszine.com/blog/archive/2008/09/write_a_hadoop_mapreduce_job_i.html





1

This Lecture Series is sponsored by Evergreen’s PLATO Royalty Fund, a fund established with royalties
from compu
ter assisted instruction (CAI) software written by Evergreen faculty John Aikin Cushing and
students in the early 1980’s for the Control Data PLATO system.