Lecture.5 Google Search Engine Infrastructure

smilinggnawboneInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

86 εμφανίσεις

Building a Computing System for
the World’s Information

Yonggang Wang

Google Software Engineer

Plan for today


Behind
-
the
-
scenes look at systems
infrastructure

Computing platform

Distributed systems

Software infrastructure

Google products

Google’s Mission

To organize the world’s
information

and make it universally

accessible and useful

How much information is out there?


How large is the Web?


Tens of billions of documents? Hundreds?


~10KB/doc => 100s of Terabytes


Then there’s everything else


Email, personal files, closed databases,
broadcast media, print, etc.


Estimated 5 Exabytes/year (growing at 30%)*


800MB/year/person


~90% in magnetic media


Web is just a tiny starting point

Source: How much information 2003

Google takes its mission seriously


Started with the Web (html)


Added various document formats


Images


Commercial data: ads and shopping (Froogle)


Enterprise (corporate data)


News


Email (Gmail)


Scholarly publications


Local information


Maps


Yellow pages


Satellite images


Instant messaging and VoIP


Communities (Orkut)


Printed media




Ever
-
Increasing Computation Needs

more

queries

better

results

more

data


Every Google service
sees continuing growth
in computational needs


More queries


More users, happier users


More data


Bigger web, mailbox, blog,
etc.


Better results


Find the right
information, and find it
faster

Systems Infrastructure


Goal: Create very large scale, high
performance computing infrastructure


Hardware + software systems to make it easy to build
products



Focus on price/performance, and ease of use


Enables better products:


indices containing more documents


updated more often


faster queries


faster product development cycles





Computing platform


Cost
-
efficiency


Server design


Networking


Datacenter technology


Hardware Design Philosophy


Prefer low
-
end server/PC
-
class designs


Build lots of them!



Why?


Single machine performance is not interesting


Our smaller problems are too large for any single system


Large problems are easily partitioned into multiple threads



“Ultra
-
reliable” hardware makes programmers lazy


Most reliable platform will still fail


fault
-
tolerant
software needed


Fault
-
tolerant software enables use of commodity components


Interesting systems can be designed with commodity components

google.stanford.edu (circa 1997)

google.com (1999)

Google Data Center (circa 2000)

google.com (new data center 2001)

google.com (3 days later)

Current Design


In
-
house rack design


PC
-
class
motherboards


Low
-
end storage and
networking hardware


Linux


+ in
-
house software

Systems Infrastructure


Google file system (GFS)


MapReduce


BigTable

GFS: Google File System


Why develop our own file system?


Google has unique FS requirements


Huge read/write bandwidth


Reliability over thousands of nodes


Mostly operating on large data blocks


Need efficient distributed operations


Unfair advantage


We have control over applications,
libraries and operating system


GFS Setup


Master manages metadata


Data transfers happen directly between
clients/chunkservers


Files broken into chunks (typically 64 MB)

Client

Client

Misc. servers

Client

Replicas

Masters

GFS Master

GFS Master

C
0

C
1

C
2

C
5

Chunkserver 1

C
0

C
2

C
5

Chunkserver N

C
1

C
3

C
5

Chunkserver 2




200+ clusters


Filesystem clusters of up to 5000+
machines


Pools of 10000+ clients


5+ PB Filesystems


40 GB/s read/write load in single
cluster


(in the presence of frequent HW
failures)

GFS Usage @ Google

MapReduce + BigTable

Okay, GFS lets us store lots of data„ now what?


We want to process that data in new and interesting ways!

MapReduce
:


a programming model and library to simplify large
-
scale
computations on large clusters


BigTable
:


A large
-
scale storage system for semi
-
structured data


Database
-
like model, but data stored on thousands of
machines..

MapReduce


What is MapReduce?


MapReduce usage at Google


How to write a MapReduction


MapReduce internals


Features and tricks


Conclusions

What is MapReduce?


For processing lots of data


Map

phase extracts relevant information from each record
of the input


Reduce

phase collects data together, produces final
output


Has a shuffle (resharding) and a sort (by key) between
the map and the reduce


Good for batch operations


User writes two simple functions:
map

and
reduce


Underlying library takes care of messy details



Example: Word Frequencies


Have doclogs or docjoins with one document per record


Specify a
map

function that takes a key/value pair

key = docid

value = protocol message with document text


Output of map function is (potentially many) key/value
pairs.

In our case, output (word, “1”) once per word in the
document

http://shakespeare.org/hamlet.html, “to be or not to be”

“to”, “1”

“be”, “1”

“or”, “1”



Example (cont): word frequencies


MapReduce library gathers together all pairs with the
same key


The
reduce

function combines the values for a key

In our case, compute the sum






Output of reduce (usually 0 or 1 value) paired with key
and saved

“be”, “2”

“not”, “1”

“or”, “1”

“to”, “2”

key = “or”

values = “1”

“1”

key = “be”

values = “1”, “1”

“2”

key = “to”

values = “1”, “1”

“2”

key = “not”

values = “1”

“1”

Why use MapReduce?


Fast
: locality optimization, optimized sorter, lots of
tuning work done...


Robust
: handles machine failure, bad records, „


Easy to use
: little boilerplate, supports many formats,



Scalable
: can easily add more machines to handle more
data or reduce the run
-
time


Widely applicable
: can solve a broad range of problems


Monitoring
: status page, counters, „


MapReduce users


Depended on by lots of projects at Google

Sawmill (Logs Analysis)

Search My History

Search quality

Spelling

Web search indexing

…many other internal projects ...

Ads

Froogle

Google Earth

Google Local

Google News

Google Print

Machine Translation

Lots of uses inside Google

MapReduce Programs in Google’s
Source Tree

0
1000
2000
3000
4000
5000
6000
Jan-03
Jul-03
Jan-04
Jul-04
Jan-05
Jul-05
Jan-06
Jul-06
New MapReduce Programs Per Month

0
100
200
300
400
500
600
Jan-03
Jul-03
Jan-04
Jul-04
Jan-05
Jul-05
Jan-06
Jul-06
Summer intern effect

Usage Statistics Over Time


Number of jobs

Aug,
‘04

29,423

Mar,
‘05

72,229

Mar,
‘06

171,834

Avg completion time (secs)

634

934

874

Machine years used

217

981

2,002

Input data read (TB)

3,288

12,571

52,254

Intermediate data (TB)

758

2,756

6,743

Output data written (TB)

193

941

2,970

Avg worker machines per
job

157

232

268

Avg worker deaths per job

1.2

1.9

5.0

Parallel MapReduce

Map

Map

Map

Map

Input

data

Reduce

Shuffle

Reduce

Shuffle

Reduce

Shuffle

Partitioned
output

Master

MapReduce status: MR_Indexer
-
beta6
-
large
-
2003_10_28_00_03

MapReduce status: MR_Indexer
-
beta6
-
large
-
2003_10_28_00_03

MapReduce status: MR_Indexer
-
beta6
-
large
-
2003_10_28_00_03

MapReduce status: MR_Indexer
-
beta6
-
large
-
2003_10_28_00_03

MapReduce status: MR_Indexer
-
beta6
-
large
-
2003_10_28_00_03

MapReduce status: MR_Indexer
-
beta6
-
large
-
2003_10_28_00_03

MapReduce status: MR_Indexer
-
beta6
-
large
-
2003_10_28_00_03

MapReduce status: MR_Indexer
-
beta6
-
large
-
2003_10_28_00_03

MapReduce status: MR_Indexer
-
beta6
-
large
-
2003_10_28_00_03

MapReduce status: MR_Indexer
-
beta6
-
large
-
2003_10_28_00_03

MapReduce status: MR_Indexer
-
beta6
-
large
-
2003_10_28_00_03

Under
-
the
-
Covers Optimizations


Locality: map tasks scheduled near the data they read


Can often read all data from local disk


Shuffle stage is pipelined with mapping


Fast sorter


Many more tasks than machines, for load balancing


Backup copies of map & reduce tasks (avoids stragglers)


Compress intermediate data


Re
-
execute tasks on machine failure

MapReduce Summary


MapReduce has proven to be a useful abstraction


Greatly simplifies large
-
scale computations at Google


Fun to use: focus on problem, let library deal with
messy details


See the published paper on the systems issues:

MapReduce:

Simplified

Data

Processing

on

Large

Clusters
,
Jeffrey Dean and Sanjay Ghemawat

OSDI'04: Sixth Symposium on Operating System Design and
Implementation

(Search Google for “MapReduce”)

BigTable


Higher level API than a raw file system


Somewhat like a database, but not as full
-
featured


Useful for structured/semi
-
structured data


URLs:


Contents, crawl metadata, links, anchors, pagerank, „


Per
-
user data:


User preference settings, recent queries/search results, „


Geographic data:


Physical entities, roads, satellite imagery, annotations, „



Scales to large amounts of data


billions of URLs, many versions/page (~20K/version)


Hundreds of millions of users, thousands of q/sec


100TB+ of satellite image data

Why not just use commercial DB?


Scale is too large for most commercial databases



Even if it weren’t, cost would be very high


Building internally means system can be applied across
many projects for low incremental cost



Low
-
level storage optimizations help performance
significantly


Much harder to do when running on top of a database layer


Also fun and challenging to build large
-
scale systems :)

BigTable Features


Distributed multi
-
level map


With an interesting data model


Fault
-
tolerant, persistent


Scalable


Thousands of servers


Terabytes of in
-
memory data


Petabytes of disk
-
based data


Millions of reads/writes per second, efficient
scans


Self
-
managing


Servers can be added/removed dynamically


Servers adjust to load imbalance

Basic Data Model


Distributed multi
-
dimensional sparse
map


(row, column, timestamp)


cell
contents

“www.cnn.com”

“contents:”

Rows

Columns

Timestamps

t
3

t
11

t
17

“<html>…”


Good match for most of our applications

Tablets


Large tables broken into
tablets

at row
boundaries


Tablet holds contiguous range of rows


Clients can often choose row keys to achieve locality


Aim for ~100MB to 200MB of data per tablet


Serving machine responsible for ~100
tablets


Fast recovery:


100 machines each pick up 1 tablet from failed machine


Fine
-
grained load balancing:


Migrate tablets away from overloaded machine


Master makes load
-
balancing decisions

Tablets & Splitting



Tablets

“cnn.com”

“contents:”

“<html>…”

“language:”


EN

“cnn.com/sports.html”

“zuppa.com/menu.html”



“yahoo.com/kids.html”

“yahoo.com/kids.html
\
0”





“website.com”

“aaa.com”

System Structure

Lock service

Bigtable master

Bigtable tablet server

Bigtable tablet server

Bigtable tablet server

GFS

Cluster scheduling system



holds metadata,

handles master
-
election

holds tablet data, logs

handles failover, monitoring

performs metadata ops +

load balancing

serves data

serves data

serves data

Bigtable Cell

Bigtable client

Bigtable client

library

Open()

read/write

metadata ops

Tablet Representation

append
-
only log on GFS

SSTable

on GFS

SSTable

on GFS

SSTable

on GFS

(mmap)

write buffer in memory

(random
-
access)

write

read

Tablet

SSTable: Immutable on
-
disk ordered map from string
-
>string


string keys:
<
row, column, timestamp
>

triples

Compactions


Tablet state represented as set of immutable
compacted SSTable files, plus tail of log
(buffered in memory)



Minor compaction:


When in
-
memory state fills up, pick tablet with most data
and write contents to SSTables stored in GFS


Separate file for each locality group for each tablet



Major compaction:


Periodically compact all SSTables for tablet into new
base SSTable on GFS


Storage reclaimed from deletions at this point


Locality Groups

“www.cnn.com”

“contents:”

“<html>…”





Locality Groups

“language:”


EN

“pagerank:”


0.65

Status


Design/initial implementation started beginning of
2004


Currently ~500 BigTable cells


Production use or active development for ~70
projects:


Google Print


My Search History


Orkut


Crawling/indexing pipeline


Google Maps/Google Earth


Blogger





Largest bigtable cell manages ~3000TB of data
spread over several thousand machines (larger
cells planned)

Google products


Web search


Ads


Machine Translation





Data + CPUs = Playground


Substantial fraction of internet
available for processing


Easy
-
to
-
use teraflops/petabytes


Cool problems, great fun„

Summary


Behind every Google service there are lots
of challenging technical problems at all
levels:


Hardware, networking, distributed systems, fault
tolerance, data structures, algorithms, machine
learning, information retrieval, AI, user
interfaces, compilers, programming languages,
statistics, product design, mechanical eng., „



The right hardware and software
infrastructure matters tremendously


Allows small product teams to accomplish large
things

Thank you!


More info:


Google File System


S. Ghemawat, H. Gobioff, S
-
T. Leung; SOSP’03



“Web Search For A Planet”


L. Barroso, J. Dean, U. Hoelzle; IEEE Micro ’03



MapReduce: Simplified Data Processing on Large Clusters


J. Dean, S Ghemawat; OSDI'04



BigTable: A Distributed Storage System for Structured Data


F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T.
Chandra, A. Fikes, R. Gruber; OSDI’06



See
http://labs.google.com/papers