The Google Legacy - Chapter Three: Google Technology

Arya MirInternet et le développement Web

24 août 2011 (il y a 7 années et 10 mois)

1 200 vue(s)

“Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to product better search results.... Fast crawling technology is needed to gather the Web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at the rate of hundreds to thousands per second.” – Sergey Brin and Lawrence Page,

The Google Legacy
Chapter Three: Google Technology
Chapter Three:
Google Technology
“Apart from the problems of scaling traditional search techniques to data of this
magnitude, there are new technical challenges involved with using the additional
information present in hypertext to product better search results.... Fast crawling
technology is needed to gather the Web documents and keep them up to date.
Storage space must be used efficiently to store indices and, optionally, the
documents themselves. The indexing system must process hundreds of gigabytes of
data efficiently. Queries must be handled quickly, at the rate of hundreds to
thousands per second.” – Sergey Brin and Lawrence Page, 1997
In the beginning, there was BackRub, the service that became Google. Today, Google is most
closely associated with its PageRank algorithm. PageRank is a voting algorithm weighted for
importance. The indicators of a Web page’s importance is the number of pages that link to a
particular page.
Messrs. Brin and Page soon added another factor which voted for the importance of a Web
page. This idea was the number of people who click on a Web page. The more clicks on a Web
page, the more weight that Web page was given. Over time, still other factors have been added
to the PageRank algorithm; for example, the frequency with which content on a page is
Google’s PageRank technology is closely allied with Internet search. Voting algorithms are
less effective in enterprise search, for instance. The attention given to Google and its search
technology dominate popular thinking about the company. Google search is like a nova. The
1.From “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” www.-
Chapter Three: Google Technology
The Google Legacy
luminescence makes it difficult for the observer to see other aspects of the phenomenon
clearly or easily.
Radiance aside, Google is a technology company.
Some of that technology when described in
technical papers such as the earliest one “The Anatomy of a Large-Scale Hypertextual Web
Search Engine” is demanding. The later papers such as “MapReduce: Simplified Data
Processing on Large Clusters” can be a slow read.
Since Google is technology, explaining
what Google does in an easily-digestible meal is difficult. The diagram below provides
unauthorized snapshot of Google’s computing framework.

2.The annex to this monograph contains a listing of more than 60 Google patents. The list is
not all-inclusive; however, it does provide the patent number and a brief description for some of
Google’s most important patents. The PageRank patent belongs to the trustees of Stanford
University. Google’s patent efforts have focused on systems and methods for relevance,
advertising, and other core foci of the company. Google is creating a patent fence to protect its
3.Jeff Dean, former Alta Vista researcher and a Google senior engineer, has been an
advocate of MapReduce. His most recent papers are available on his Web page at http://
Important Google technologies that underlie this diagram of the Googleplex
include: [a] modifications to Linux to permit large file sizes and other functions so
as to accelerate the overall system; [b] a distributed architecture that allows
applications and scaling to be “plugged in” without the type of hands-on set-up
other operating systems require; [c] a technical architecture that is similar at every
level of scale; [d] a Web-centric architecture that allows new types of applications
to be built without a programming language limitation.
The Google Legacy
Chapter Three: Google Technology
Google’s technology has emerged from a series of continuous improvements or what Japanese
management consultants call kaizan. Each Google technical change may be inconsequential to
the average user of Google. But when taken as a whole, Google’s “technological advantage”
comes from Google’s incremental innovations, clever adaptations of research-computing
concepts, and Byzantine tweaks to Linux. Some day, a historian of technology will be able to
identify, from the hundreds of improvements that Google has engineered in the last nine years,
one or two that stand with PageRank as of major importance. Critics of Google will see that
the company has grafted to its core technology processes from many different sources.
To illustrate, the structure of Google’s data centers and the messages passed to and from these
data centers is in many ways a variant of grid computing.
Google’s ability to read data from
many computers simultaneously is reminiscent of BitTorrent’s technology.
Google’s use of
commodity or “white box” hardware in its data centers is an indication of Google’s hacker
ethos. The use of memory and discs to store multiple copies of data comes from the frontiers
of computing.
Google’s approach to technology, then, is eclectic and in many ways represents a building
block approach to large-scale systems. Google benefits from that eclecticism in several ways.
First, Google’s computational framework delivers sizzling performance from low-cost
hardware. Second, Google worked around the bottlenecks of such operating systems as
Solaris, Windows Advanced Server, and off-the-shelf Linux. Third, Google took good
programming ideas from other languages, implementing new functions and libraries to
eliminate most of the manual coding required to parallelise an application across Google’s
According to Jeff Dean, one of Google’s senior engineers, “Google engineering is sort of
This is neither surprising nor necessarily a negative. The Googleplex is a toy box
for engineers and programmers. The tools are sophisticated. The challenges of the problems
and peers make Google “the place to be” for the best and brightest technical talent in the
world. The nature of creativity combined with Google’s approach to innovation make it
difficult to predict the next big thing from Google.
Before reviewing selected parts of Google’s technology in somewhat more detail, the diagram
“Google’s Computing Framework” provides an overview of the Googleplex and some of its
technologies. These will be touched upon in this section.
4.Grid computing is applying resources from many computers in a network to a single problem
or application. Google uses grid-like technology in its distributed computing system.
5.BitTorrent is a peer-to-peer file distribution tool written by programmer Bram Cohen in
2001.The reference implementation is written in Python and is released under the MIT License.
6.Google has anywhere from 100,000 to 165,000 or more servers. Servers are organized into
clusters. Clusters may reside within one rack or across multiple racks of servers. Some Google
functions are distributed across data centers.
7.From Dr Dean’s speech at the University of Washington in October 2003. See http://
Chapter Three: Google Technology
The Google Legacy
PageRank requires a lot of computing horsepower cycles to work. When Google got
underway in 1996, Messrs. Brin and Page had limited computing horsepower. In order to
make PageRank work, they had to figure out how to get the PageRank algorithm to run on
garden-variety computers available to them.
From the beginning – and this is an important issue with regards to Google’s almost-certain
collision course with Microsoft – Google had to solve both software engineering and
hardware engineering issues to make Google Search viable. In fact, when discussing Google
technology, it is important to keep in mind that PageRank is important only because it can run
quickly in the real world, not in a sterile computer lab illuminated with the blue glow of
The figure Google’s Fusion: Hardware and Software Engineering shows that Google’s
technology framework has two areas of activity. There is the software engineering effort that
focuses on PageRank and other applications. Software engineering, as used here, means
writing code and thinking about how computer systems operate in order to get work done
quickly. Quickly means the sub one-second response times that Google is able to maintain
despite its surging growth in usage, applications and data processing.
The other effort focuses on hardware. Google has refined server racks, cable placement,
cooling devices, and data center layout. The payoff is lower operating costs and the ability to
scale as demand for computing resources increases. With faster turnaround and the
The Google phenomenon comes from
the fission occurring when PageRank’s
software and hardware engineering
interact. Google’s technology delivers
super computer applications for mass
Google’s Fusion: Hardware and Software Innovations
The Google Legacy
Chapter Three: Google Technology
elimination of such troublesome jobs as backing up data, Google’s hardware innovations give
it a competitive advantage few of its rivals can equal as of mid-2005.
PageRank with its layering of additional computations added over the years is a software
problem of considerable difficulty. The Google system must find Web pages and perform
dozens, if not hundreds of analyses of those Web pages. Consider the links pointing to a Web
page. Google must keep track of them for more than eight billion Web pages. For a single Web
page with one link pointing to it, the problem is trivial. One link equals one pointer. But what
happens when a site has 10,000 links pointing to it? The problem becomes many times larger
and more computationally demanding. Some of these links are likely to come from sites that
have more traffic than others. Some of the links may come from sites that have spoofed
Google for fun or profit. The calculations to sort out the “value” of each of these links adds to
computational work associated with PageRank. Keeping track of these factors is a big job.
Sizing up different factors against one another for a single page can be hard without a
calculator to help. Take the same task and apply it by a couple of billion Web pages, and the
computing task becomes one for a supercomputer.
Yet this task is everyday stuff for Google and its PageRank process. Users do not give much
thought to what technology underpins a routine query or the 300 million queries Google
handles each day. In a single second, Google’s technology handles around 340 queries in
dozens of languages from users worldwide.
Google’s technology cannot be separated from search. Search was the prime mover in the
Google universe. Once Messrs. Brin and Page were able to fiddle with a limited number of
commodity computers and make their PageRank algorithm work, Google was headed down a
road that it still follows.
The software requires a suitable hardware and network infrastructure in which to operate.
Without Google’s hardware and software, there would be no Google. Hardware and software
are inextricably linked at Google. With each new advance in software, Google’s engineers
must make correspondingly significant advances in hardware. And when hardware engineers
come up with an advance, the software engineers greedily use that advance to up the
functionality of their software.
What Google owns is its own snappy, turbocharged supercomputer, interesting software tools,
and several thousand people trying to figure out what else the Googleplex can do. Some of the
tinkerers come at the problem from bits and bytes, writing code, and weaving applications out
of the available functions. The result is a brilliant product.
Others come at the problem from the soldering iron and screwdriver angle. These engineers
look for ways to build hardware and physical systems that can perform the calculations needed
to make PageRank work. Google’s approach to data centers, the racks in the data centers, and
the devices in the racks in the data centers is as clever as the company’s search system. The
hardware has to be more than clever. The hardware has to work 24x7, under continuous load,
and in locations from Switzerland to Beijing. The synergy between software and hardware is
perhaps one of Google’s major accomplishments.
Chapter Three: Google Technology
The Google Legacy
How Google Is Different from
and Yahoo
Google’s technology is simultaneously just like other online companies’ technology, and very
different. A data center is usually a facility owned and operated by a third party where
customers place their servers. The staff of the data center manage the power, air conditioning
and routine maintenance. The customer specifies the computers and components. When a data
center must expand, the staff of the facility may handle virtually all routine chores and may
work with the customer’s engineers for certain more specialized tasks.
Before looking at some significant engineering differences between Google and two of its
major competitors, review this list of characteristics for a Google data center.
Google data centers – now numbering about two dozen, although no one outside Google
knows the exact number or their locations. They come online and automatically, under
the direction of the Google File System, start getting work from other data centers.
These facilities, sometimes filled with 10,000 or more Google computers, find one
another and configure themselves with minimal human intervention.
The hardware in a Google data center can be bought at a local computer store. Google
uses the same types of memory, disc drives, fans and power supplies as those in a
standard desktop
Each Google server comes in a standard case called a pizza box with one important
change: the plugs and ports are at the front of the box to make access faster and easier.
Google racks are assembled for Google to hold servers on their front and back sides.
This effectively allows a standard rack, normally holding 40 pizza box servers, to hold
A Google data center can go from a stack of parts to online operation in as little as 72
hours, unlike more typical data centers that can require a week or even a month to get
additional resources online.
Each server, rack and data center works in a way that is similar to what is called “plug
and play.” Like a mouse plugged into the
port on a laptop, Google’s network of data
centers knows when more resources have been connected. These resources, for the most
part, go into operation without human intervention.
Several of these factors are dependent on software. This overlap between the hardware and
software competencies at Google, as previously noted, illustrates the symbiotic relationship
between these two different engineering approaches. At Google, from its inception, Google
software and Google hardware have been tightly coupled. Google is not a software company
nor is it a hardware company. Google is, like
, a company that owes its existence to both
hardware and software. Unlike
, Google has a business model that is advertiser supported.
Technically, Google is conceptually closer to
(at one time a hardware and software
company) than it is to Microsoft (primarily a software company) or Yahoo! (an integrator of
multiple softwares).
The Google Legacy
Chapter Three: Google Technology
Software and hardware engineering cannot be easily segregated at Google. At
and Yahoo
hardware and software are more loosely-coupled. Two examples will illustrate these
Microsoft – with some minor excursions into the Xbox game machine and peripherals –
develops operating systems and traditional applications. Microsoft has multiple operating
systems, and its engineers are hard at work on the company’s next-generation of operating
systems. Microsoft does not design or make its own hardware. Its operating systems are coded,
for example, for processors that evolved from the Intel chips for personal computers. Recently
Microsoft embarked on a new path with its game machine, the Xbox 360. The new Xbox uses
a processor from
’s family of Power
chips also used in the Macintosh computer, the
Sony PS/3, and Nintendo next-generation game machines. Microsoft’s applications run on
Microsoft operating systems, although a version of Microsoft Office and Internet Explorer run
on Apple’s Macintosh.
In addition, Microsoft buys hardware from various suppliers to run its online systems. Most of
these suppliers, not surprisingly, are certified by Microsoft. Examples include Microsoft’s use
of Dell Computers. Microsoft’s engineers use these machines in configurations required by the
Microsoft operating systems and applications. For example, Microsoft servers often require a
load balancing feature. Microsoft implements its load balancing via software. When more
performance is required, Microsoft upgrades the hardware, adds memory, or shifts to higher-
speed hard drive technology instead of recoding the operating system itself to deliver higher
performance as Google does. Once a function is released to customers, Microsoft’s engineers
focus on stamping out bugs. Re-engineering a software application for higher performance is
not typically a priority.
Several observations are warranted:
Unlike Google, Microsoft does not focus on performance as an end in itself. As a result,
Microsoft gets performance the way most computer users do. Microsoft buys or
upgrades machines. Microsoft does not fiddle with its operating systems and their
subfunctions to get that extra time slice or two out of the hardware.
Unlike Google, Microsoft has to support many operating systems and invest time and
energy in making certain that important legacy applications such as Microsoft Office or
SQLServer can run on these new operating systems. Microsoft has a boat anchor tied to
its engineer’s ankles. The boat anchor is the need to ensure that legacy code works in
Microsoft’s latest and greatest operating systems.
Unlike Google, Microsoft has no significant track record in designing and building
hardware for distributed, massively parallelised computing. The mice and keyboards
were a success. Microsoft has continued to lose money on the Xbox, and the sudden
demise of Microsoft’s entry into the home network hardware market provides more
evidence that Microsoft does not have a hardware competency equal to Google’s.
Chapter Three: Google Technology
The Google Legacy
In terms of technology, Google has the hardware and software engineering expertise to build
applications rapidly, perform computationally-intensive applications quickly, and deliver
high-reliability services from low-cost, commodity hardware.
Yahoo! operates differently from both Google and Microsoft. Yahoo! is in mid-2005 a direct
competitor to Google for advertising dollars. Yahoo! has grown through acquisitions. In
search, for example, Yahoo acquired to handle Chinese language search and
retrieval. Yahoo bought Inktomi to provide Web search. Yahoo bought Stata Labs in order to
provide users with search and retrieval of their Yahoo! mail. Yahoo! also owns, a Web search site created by
Search & Transfer. Yahoo! owns the
Overture search technology used by advertisers to locate key words to bid on. Yahoo! owns
Alta Vista, the Web search system developed by Digital Equipment Corp. Yahoo! licenses
InQuira search for customer support functions. Yahoo has a jumble of search technology;
Google has one search technology.
Historically Yahoo has acquired technology companies and allowed each company to operate
its technology in a silo. Integration of these different technologies is a time-consuming,
expensive activity for Yahoo. Each of these software applications requires servers and systems
particular to each technology. The result is that Yahoo has a mosaic of operating systems,
hardware and systems. Yahoo!’s problem is different from Microsoft’s legacy boat-anchor
problem. Yahoo! faces a Balkan-states problem.
There are many voices, many needs, and many opposing interests. Yahoo! must invest in
management resources to keep the peace. Yahoo! does not have a core competency in
hardware engineering for performance and consistency. Yahoo! may well have considerable
competency in supporting a crazy-quilt of hardware and operating systems, however. Yahoo!
is not a software engineering company. Its engineers make functions from disparate systems
available via a portal.
Google also acquires technology. A good example is Picasa. The photo management software
runs on the user’s Windows PC.
The program has been integrated with several of Google’s network-centric applications:
Gmail. The user’s images can be uploaded and sent via email to friends, colleagues and
family. A Picasa user without a Gmail account is able to register and receive a user
name and password. The Gmail account can also be used, if the user wishes, for other
Google services, including Fusion, which is Google’s personalized portal, and the
search history function, which saves a registered user’s Google queries for later
Blog Publishing. The user can post pictures to a Google property, The
image publishing function is simplified to one or two clicks. Posting images on some
Web log systems is beyond the expertise of many computer users.
Image Printing. The user can send images to online photo processing services.
The Google Legacy
Chapter Three: Google Technology
In sharp contrast to Yahoo’s approach, Google integrated the Picasa application into the
Googleplex. The “hooks” are painless to the user.
Google has bundled into one free
application point-and-click solutions to make management of digital still images intuitive and
fluid. Yahoo!’s acquisitions, in general, are not woven into a seamless experience with other
Yahoo! services. Consider the search system. That service remains a separate
Chinese language operation available from mostly non-English Yahoo pages. Google
constructs an application using some code on the user’s
and other software running on the
Googleplex somewhere on the Internet.
These three companies, different in structure and technical focus, are on a collision course.
Like vessels in America’s Cup, each is going toward the same goal, but subject to forces
difficult for their helmsman to control. Even though there is market space between the three,
8.Picasa requires a download. The installation process is smooth. Indexing speed was about
five times faster than ACDSee’s image management program, a competitive product. With
Picasa, Google’s technologists demonstrate a rapid, trouble-free installation and an intuitive
One-click access to network
services available as part of the
user’s virtual application.
One-click access to functions
performed on the user’s local
Recently-viewed images
Chapter Three: Google Technology
The Google Legacy
collisions are inevitable. The figure below provides an overview of the mid-2005 technical
orientation of Google, Microsoft and Yahoo.
, and by extension Microsoft Corporation, has a core competency in software. The
company has grown from its operating system roots to provide a range of products for mobile
devices, desktop and notebook computers, and enterprise-class servers. Looking forward, the
company’s Dot Net technology is Microsoft’s framework for virtual applications. In some
ways, Dot Net is a less-open version of the
technology that Google uses in the Google
Maps and Gmail products. Microsoft has expended great effort to push Windows downward to
mobile devices and outward to network-centric computers in an effort to increase revenue. For
Microsoft to continue to be the dominant force in software in the future, the company must be
able to capture a commanding share of the market for network-centric applications. However,
Microsoft’s position (whether real or perceived) is its products’ vulnerability to security
breaches. Patch after patch, problem after problem, then promise after promise have done little
to bolster the firm’s credibility for delivering secure systems and software. Looking forward
over the next 12 to 18 months, Microsoft’s prospects hinge on security, cost and its developer
community. The growth of open source alternatives are hard proof that die-hard Microsoft
users are willing to shift for security, cost savings and functionality. Microsoft has weaknesses
that can be attacked by Google and other competitors.
Yahoo’s situation is typical to many American organizations. Most large US corporations are a
hotch-potch of different systems, incompatible architectures and a Tower of Babel of data
formats. For Yahoo to deliver specific markets to its advertisers, Yahoo must integrate
information from disparate systems and be able to segment and deliver ads to those users
efficiently. Yahoo is now spending money to break down the walls of its data silos and
integrating its user data. If Yahoo cannot deliver narrowly segmented markets, advertisers
may abandon Yahoo for services that offer more targeted marketing opportunities. After years
of flirting with becoming a New Age America Online, Yahoo is beginning to behave like a
traditional media company.
The Google Legacy
Chapter Three: Google Technology
and Yahoo! are becoming ad-supported versions of general-interest portals like Yahoo,
America Online and Tiscali. In contrast, Google is focusing on applications that tie users to its
Googleplex. The company’s focus on hardware and software engineering gives it a cost and
performance advantage over
and Yahoo, among others competing in Web search.
Google’s high-performance, homogeneous Googleplex means that the company does not
struggle with some integration, performance and cost issues that bedevil Microsoft and
Google may not be doing everything right from a computer science point of view. Compared
or Yahoo, Google is doing less wrong than these two aggressive competitors.
The Technology Precepts
Google’s technology uses concepts and techniques from the leading edge of computer science.
Most of these innovations are difficult to explain to engineers steeped in traditional approaches
to massively distributed, highly parallelized computing. The eclectic footnotes and references
in the earlier BackRub paper have been sharpened in Google’s later technical presentations.
Readers without a first-hand understanding of NOW-Sort, River, and
are unlikely to
craft dinner conversation from Google’s explanations of the influence of these research
computing demonstrations.
For the purposes of this monograph and understanding the nature of Google’s technology, five
precepts thread through Google’s technical papers and presentations. The following snapshots
are extreme simplifications of complex, yet extremely fundamental, aspects of the
Cheap Hardware and Smart Software
Google’s use of commodity hardware for high-demand, 24x7 systems has existed as a core
precept since 1996. Most of its competitors’ online systems combine branded hardware from
, Sun Microsystems, Hewlett-Packard, and Dell Computers with specialized peripherals.
The operating systems in use are a combination of Unix and Microsoft operating systems with
some Linux and open source components.
Google approaches the problem of reducing the costs of hardware, set up, burn-in and
maintenance pragmatically. A large number of cheap devices using off-the-shelf commodity
controllers, cables and memory reduces costs. But cheap hardware fails.
In order to minimize the “cost” of failure, Google conceived of smart software that would
perform whatever tasks were needed when hardware devices fail. A single device or an entire
rack of devices could crash, and the overall system would not fail. More important, when such
a crash occurs, no full-time systems engineering team has to perform technical triage at 3 a.m.
9.See for example Andrea C. Arpaci-Dusseau, et. al. “HIgh Performance Sorting on Network
of Workstations”. In Proceedings of the 1997 ACM SIGMOD International Conference on
Management of Data, Tucson, Arizona, May 1997 or John Bent, et. al. “Explicit Control in a
Batch-Aware Distributed File System”. Both contained in Proceedings of the 1st USENIX
Symposium on Networked Systems Design and Implementation. March 2004.
Chapter Three: Google Technology
The Google Legacy
The focus on low-cost, commodity hardware and smart software is part of the Google culture.
In one presentation at a December 2004 technical conference, a Google spokesman joked that
anyone in the room could buy the same hardware that Google uses at Frye’s Electronics, a
retail chain with stores in Palo Alto and other cities in California.
Logical Architecture
Google’s technical papers do not describe the architecture of the Googleplex as self-similar.
Google’s technical papers provide tantalizing glimpses of an approach to online systems that
makes a single server share features and functions of a cluster of servers, a complete data
center, and a group of Google’s data centers.
The diagram below shows a representation of the Googleplex’s tightly organized, highly
regular organization of files, servers, clusters, and more than two dozen data centers in a stable
organizational pattern.

The diagram illustrates that Google’s technical infrastructure is similar at every level in the
Googleplex. The collections of servers running Google applications on the Google version of
Linux is a supercomputer. The Googleplex can perform mundane computing chores like
taking a user’s query and matching it to documents Google has indexed. Further more, the
Googleplex can perform side calculations needed to embed ads in the results pages shown to
user, execute parallelized, high-speed data transfers like computers running state-of-the-art
storage devices, and handle necessary housekeeping chores for usage tracking and billing.
10.The illustration is a Sierpinkski Triangle, chosen because it conveys how each component
in Google’s infrastructure replicates other larger combinations of servers and data centers. The
overall structure – in this illustration an equilateral triangle – expresses the stability of the
Google approach to its system. This famous fractal connotes how Google scales without
altering the micro or macro structure of the Googleplex.
A single Google
pizza box server
A data centre
uses the same
design and is
composed of
The Googleplex
is a larger
instance of the
organization of
a single pizza
box server.
A single
Google file
reflects the
A single Google
cluster embodies
the same
principle as a
single pizza box
The Google Legacy
Chapter Three: Google Technology
What is of interest is that Google does this with low-cost commodity hardware running on
Google’s version of Linux. Google has infused the Googleplex with logic that allows software
to handle data recovery, to streamline messages passed from server to server, and to grab
additional computing resources in order to complete a job quickly. When Google needs to add
processing capacity or additional storage, Google’s engineers plug in the needed resources.
Due to self-similarity, the Googleplex can recognize, configure and use the new resource.
Google has an almost unlimited flexibility with regard to scaling and accessing the capabilities
of the Googleplex. Unlike a collection of different building materials, Google’s approach
delivers a homogeneous computing system.
A good example is bringing a new rack of 40 or more pizza box servers online and creating
one of the many types of servers Google users.
Servers, according to the fractal architecture,
consist of two or more clusters of pizza boxes. A cluster allows data to be replicated and work
shared among pizza boxes with spare capacity. A rack is assembled and then Google’s pizza
box servers are “plugged in.” Cables are attached among the pizza boxes and the rack is then
plugged into a network hub. An engineer turns on the power, and the other devices become
aware of the new rack’s resources. Master servers – Google’s term for the pizza box that is in
charge of one or more clusters – instruct other servers to copy data to the new cluster and begin
using the clusters to do work.
In Google’s self-similar architecture, the loss of an individual device is irrelevant. In fact, a
rack or a data center can fail without data loss or taking the Googleplex down. The Google
operating system ensures that each file is written three to six times to different storage devices.
When a copy of that file is not available, the Googleplex consults a log for the location of the
copies of the needed file. The application then uses that replica of the needed file and
continues with the job’s processing. Redundancy and other engineering tweaks to Linux gives
the Googleplex ways to eliminate or reduce the bottlenecks associated with traditional online
computer systems’ operation. The Google technical recipe includes distributed computing,
optimized file handling, and embedded logic to make the servers working on tasks smarter.
This architecture allows Google to expand its computational capacity, its storage and its
supported applications with an ease and price point rivals cannot easily match. According to
Jeff Dean, one of Google’s senior engineers, “At Google, everything is about scale.”
Speed and Then More Speed
Google Search is fast with most results coming back to the user in less than one second. In
commercial data centers, speed has traditionally been achieved by buying high-end, high-
performance hardware from such manufacturers such as Sun Microsystems and using
advanced storage devices connected to the servers by exotic fibre optics.
11.Data centers use computer cases that are shaped like the boxes used to hold pizzas. The
term pizza boxes has been appropriated by engineers to describe one of the standard form
factors for servers housed in rack mounts in data centers.
12.Statement made at the University of Washington, October 2004
Chapter Three: Google Technology
The Google Legacy
Not Google. Google uses commodity pizza box servers organized in a cluster. A cluster is
group of computers that are joined together to create a more robust system. Instead of using
exotic servers with eight or more processors, Google generally uses servers that have two
processors similar to those found in a typical home computer.
Through proprietary changes to Linux and other engineering innovations, Google is able to
achieve supercomputer performance from components that are cheap and widely available.
The table below provides some data from 2002 about the speed with which Google can read
data from hard drives:

To put these data in a context of 2002 technology, consider that an
storage device
available in 2002 could read data in burst mode at the rate of about 58 MB / second. Google’s
read rate in 2002 averaged ten times the read rate of the
The write rate is comparable.
The cost of a single
in 2002 was about $18,000 for 360 gigabytes of storage,
excluding controller and cables. Google’s cost for comparable storage and the higher
performance was about $1,000. For greater speed, Google spends less. In the world of ever-
increasing demands for speed and storage, Google has a strong one-two punch.
Advances in
commodity storage devices translate to even faster performance for Google. Google has not
updated its read rate data, but engineers familiar with Google believe that read rates may in
some clusters approach 2,000 megabytes a second. When commodity hardware gets better,
Google runs faster without paying a premium for that performance gain.
Google engineers for computational speed. Google’s approach has been to focus on making its
software engineering produce the turbocharged performance. Speed is crucial to Google’s
PageRank and other analytic processes. If Google’s computational throughput were slow,
Google could not perform the work needed to know that for a particular query, a particular set
13.From “The Google File System” by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
Leung (Google) ACM SOSP 2003 Conference Proceedings
1-58113-757-5/03/0010, page 12
14.With Google’s advanced programming tools, Google is able to increase the productivity of
its engineers. Combined with hardware speed and performance, Google squeezes out more
productivity by applying its engineering talents to application development. This is a one-two-
three punch to which Google’s competitors have to respond.
These data show the results of two clusters’ performance. Google’s read throughput has
gone up since 2002. Based on increases in commodity drive throughput, Google’s read
rate may be close to 2,000 megabytes per second, which may be a Google watchers
enthusiasm boosting already-robust figures.
The Google Legacy
Chapter Three: Google Technology
of indexed Web pages is the best match. Without fast response to a query, users would not be
willing to run multiple queries and interact fluidly with the Google applications.
Google does not mindlessly match key words in a user’s query to the terms in the Google
index. Google’s approach is more subtle and computationally involved, although term
matching is an important part of the Google process. Google reviews data, various scores or
values from certain algorithms. Google then uses these different values in other algorithms to
find search results, identify the best match (Google’s “Feeling Lucky” link), extract matching
ads from its advertising server, and continuously update values as Google users of click on
links. Once these various query and ad matching processes are complete, Google displays the
results page to the user; typically in less than one second across a public network.
Google is a hot rod computer that can perform the basic mathematics needed to deliver most
search results in less than a half second, display maps with the speed of a dedicated desktop
application like Encarta, and look at a Web page matching a user’s query and, in some
applications, insert additional hyperlinks to related content before displaying the results page
to the user. The Googleplex does experience slow downs. When these occur, the Googleplex
allocates additional resources to eliminate the brown out.
Speed has many meanings at Google. Speed means that users can interact with the Google
products and services as if the Google application were running on a dedicated
in front of
the user. Speed also means that Google must be able to expand its computational and storage
capacity quickly. Speed also means rapid development and deployment of new products.
Speed, like Google’s ability to scale, is a core functionality of the Googleplex.
Google applies its high-speed technology to search and to other types of servers. Among the
servers using Google’s go-fast technology are those shown below:
What does the combination of go-fast technology plus multiple types of Google data allow the
company to do? Google can engage in fast new product development. One example is Google
Maps. Google developed a basic mapping product over the course of 2004. In late 2004,
Google purchased Keyhole. By June 30, 2005, Google had:
Released a basic mapping product.
Advertising server Delivers text and other paid advertisements for AdWords and AdSense.
Chunkserver Schedules and delivers blocks of data for further processing.
Image servers Serves images for Google Image, Print and Video services.
Index server The workhorse of search. Server handles search-and-retrieval.
Mail server Delivers the Gmail service.
News server Gathers, analyses and displays news.
Web server Orders results and makes them available to users.
Chapter Three: Google Technology
The Google Legacy
Integrated information from Google Local in early 2005.
Hooked Keyhole satellite imagery into Google Maps in early May 2005.
Announced Google Earth in May 2005.
Upgraded the system to integrate two dimensional point-to-point routes on top of
satellite imagery.
Demonstrated a function that accepts a query in another language, translates the results
to the user’s language, and displays the data in a three-dimensional mode.
The image below shows that Google’s Map and Earth service pushes the functions of online
map and data integration to another level. In the span of several days, Google integrated
Keyhole technology, launched, upgraded and redefined online mapping services.
Another key notion of speed at Google concerns writing computer programs to deploy to
Google users. Google has developed short cuts to programming. An example is Google’s
creating a library of canned functions to make it easy for a programmer to optimize a program
to run on the Googleplex computer. At Microsoft or Yahoo, a programmer must write some
15.The source for this image was
This is the results of a Japanese language Google Maps-Earth query for the location of Wendy’s
restaurants in New York City. The addition of the Japanese language support, the three-dimensional
view of the section of Manhattan where the user wants directions, and the integration of hot links, the
two dimensional map, and information about the restaurants was part of Google’s fast-cycle launch
and enhancement program designed to beat Microsoft to the market.
The Google Legacy
Chapter Three: Google Technology
code or fiddle with code to get different pieces of a program to execute simultaneously using
multiple processors. Not at Google. A programmer writes a program, uses a function from a
Google bundle of canned routines, and lets the Googleplex handle the details. Google’s
programmers are freed from much of the tedium associated with writing software for a
distributed, parallel computer. What does increased programmer productivity mean? In terms
of money, Google makes each engineering dollar go farther. If a single programmer can reduce
by 10 percent the time required to code a program, the savings could be several thousand
dollars. If a programmer can slash coding time in half, Google gets twice the potential
productivity out of each of its 3,000 plus programmers.
Eliminate or Reduce Certain System Expenses
Some lucky investors jumped on the Google bandwagon early. Nevertheless, Google was
frugal, partly by necessity and partly by design. The focus on frugality influenced many
hardware and software engineering decisions at the company. Spending money wisely does
not mean cheaply. Examples of how Google eliminates or reduces certain system expenses
• Google eliminates the costs associated with backing up and restoring data when a
hardware failure occurs. The fractal principal requires that Google replicate data three to
six times elsewhere in the Googleplex. When a device fails, the “master server” for a
task looks at a file that tells where the other copies of the data or the programs are. The
“master server” then uses those data or those processes to complete a task. No tape, no
human intervention, and no downtime; Google does not have these costs due to its
engineering acumen.
• Google does not have to certify new hardware. When additional storage or
computational capacity is required, Google technicians assemble one or more racks of
Google “pizza boxes.” Once in the rack, the Googleplex recognizes the new resources in
a way that is similar to how a laptop knows when a user plugs in a
mouse. The
expensive certification processes otherwise required for some high-end hardware are
eliminated. Google engineers plug in resources and let the Googleplex handle the other
• Google innovation uses open source code as a starting point. Many of Google’s most
striking technical advances are based on modifying open source software to benefit from
insights gained from experimental results in supercomputing. Google does not have to
work around known bottlenecks in some commercial operating systems. Unlike
Microsoft, Google did not write a complete operating system for its Googleplex. Google
made key changes to Linux, adding necessary services and functions to meet the specific
requirements of Google applications. Google’s approach is pragmatic and less time-
16.Some Google programmers have complained about the peer pressure to perform. Google
management faces a challenge in managing its programming talent. Staff burn out or defections
could impair Google’s technical resources.
Chapter Three: Google Technology
The Google Legacy
consuming than Microsoft’s “death march” to get Longhorn shipped by late 2006.
Compared with Yahoo, Google’s approach is more cohesive. Yahoo faces integration
drudgery as a result of its multiple systems and heterogeneous hardware and data.
Google has used Linux, standards, and open source software for virtually all of its core
services and thus spends less time pounding disparate systems and data into a standard
• Google does not spend money for high-performance devices to make its system perform
To illustrate the financial payoff from the use of commodity hardware, Google engineers
revealed a back-of-the-envelope calculation. Although dated, it underscores the economies of
the Google approach:
The cost advantages of using inexpensive, PC-based clusters over high-end
multiprocessor servers can be quite substantial, at least for a highly parallelisable
application like ours. For example, a $278,000 rack contains 176 2-GHz Xeon CPUs,
176 Gbytes of RAM, and 7 Tbytes of disk space. In comparison, a typical x86-based
server contains eight 2-GHz Xeon CPUs, 64 Gbytes of RAM, and 8 Tbytes of disk
space; it costs about $758,000. In other words, the multi-processor server is
about three times more expensive but has 22 times fewer
s, three times less
, and slightly more disk space. Much of the cost difference derives from the
much higher interconnect bandwidth and reliability of a high-end server, but again,
Google’s highly redundant architecture does not rely on either of these attributes.
[Emphasis added]
This means that when Microsoft of Yahoo! spends US$3.00 for better performance, Google
spends less than US$1.00.
Over time, competitors such as
, Microsoft or Yahoo may
implement similar features into their network-centric services. Until then, Google has a cost
advantage at least with regards to scaling online operations. If these 2002 data can be
accepted, Google spends one-third for more computing horsepower and disc space than
companies spend using a traditional server architecture.
Snapshots of Google Technology
Google engineers generate a large volume of technical information. Some of the data are in
the form of patents, often written in a style that communicates little of the patent’s substance
to a lay reader. The link for Google’s publications can shift unexpectedly.
17.Google does not explicitly state that it has embraced a services oriented architecture or
However, many of Google’s practices illustrate an informed use of certain features of
18.Luiz André Barroso, Jeffrey Dean, and Urs Hölzle, “Web Search for a Planet: The Google
Cluster Architecture”, IEEE Computer Society 0272-1732/03, March April 2003.
19.A review of Google’s cost estimates for this monograph revealed that Google is understating
its cost advantage by one or two orders of magnitude. As the performance of commodity
hardware goes up, the cost of that hardware goes down. Bulk purchasing chops as much as 50
percent off the cost of some hardware. Google can replicate its data and give away free
gigabytes of email storage. The cost to Google can be as low as a few cents a gigabyte.
20.See on June 1, 2005.
The Google Legacy
Chapter Three: Google Technology
biographies of Google executives and Google Web logs can yield some useful technical
information. For example, one Google biography linked to more than 36 personal projects,
including one by Google’s
Surprisingly, Google’s search engine does a hit-and-miss
job of indexing Google’s own technical information.
Useful engineering information appears on the Google Web site. The topics covered in various
monographs, white papers and technical notes concern a wide range of subjects. For example,
in mid-2005, papers were available on such topics as algorithms, compiler optimization,
information retrieval, artificial intelligence, file system design, data mining, genetic
algorithms, software engineering and design, and operating systems and distributed systems,
among others. Google explains its use of very large files as well as how the Google-modified
version of Linux automatically allocates work and avoids the file system bottlenecks that can
plague Solaris and Windows Advanced Server 2003, among others.
Google’s technical papers and Google patents provide some insight into areas of interest at
Google. For example, Google is posting more information about operating systems and
applications. The thrust of Google’s innovation is to build out the search platform and expand
the functionality of its backoffice programs such as those used for advertising services.
The annex to this monograph provides information about more than 60 patents for which
Google is believed to be the assignee. To provide a more fine-grained look at Google
technology, the table below identifies selected examples of innovations documented by
Google engineers or researchers close to the company. Most of these papers appeared prior to
Google’s receiving a patent for the technology referenced in these reports:
21.This is the lex project that “helps write programs whose control flow is directed by instances
of regular expressions in the input stream. It is well suited for editor-script type transformations
and for segmenting input in preparation for a parsing routine.”
To Learn More
Google Suggest Helps users find needed information
by analysing queries and suggesting
other queries.
Services Computing, 2004 IEEE
International Conference on (SCC'04) by
Stephen Davies, Serdar Badem,
Michael D. Williams, Roger King
September 2004.
Video Object Search User types an object name and Google
finds that object in a video.
Ninth IEEE International Conference on
Computer Vision Volume 2 Josef Sivic,
Andrew Zisserman Publication Date:
October 2003.
MapReduce New functions in Google Linux to
speed programming and other
processes involving large data sets.
OSDI Proceedings, December 2004.
Google File System Extension to Google Linux to allow
high-speed data reads and writes from
commodity drives.
ACM Publication 1-58113-757-5/03/
Chapter Three: Google Technology
The Google Legacy
Drawbacks of the Googleplex
The coaching mantra, “No pain without gain” is true for Google. Google does make mistakes:
and some big ones. The example fresh in news headlines is Web Accelerator. The product was
introduced in May 2005 and withdrawn less than six weeks later. Speed and nimbleness aside,
Web Accelerator was technology that ran head on into “issues.”Of greater consequence are the
periodic slowdowns for Gmail. The Googleplex is scalable, but until more servers are online,
users may face annoying delays.
Going Too Fast: The Google Web Accelerator
The Web Accelerator software was supposed to use Google servers to store Web pages a user
viewed. Web Accelerator parsed a page in the user’s browser. The Web Accelerator function
then followed each link on that specific page. The page was then stored in a Google cache.
When the user clicked on a link, the user would see the page from the Google cache, thus
reducing the time required to display the page to the user.
Web Accelerator worked fine on such sites as a, which makes minimal
use of advanced Web services. Unfortunately, the Web Accelerator function followed links
that transmitted instructions to Web applications. For example, Web Accelerator would click
on “delete” links, causing some Web applications such as Backpack to remove the user’s
preferences or content.
Web Accelerator blithely ignored confirmations generated by
JavaScript so that unintentional instructions were transmitted. Some Google watchers raised
questions about caching data as well as privacy and copyright issues. Before these concerns
reached a crescendo, Google reported that Web Accelerator had reached its capacity. Google
blocked downloads for the product.
The Laws of Physics: Heat and Power 101
Google does not reveal the number of servers it uses, but the number is believed to be in the
150,000 to 170,000 range as of June 30, 2005. Conflicting information surfaces in Web logs
and in talks at conferences. In reality, no one knows. Google has a rapidly expanding number
of data centers. The data center near Atlanta, Georgia, is one of the newest deployed. This
Identify Authoritative or
High-Value Sources in
Web Content
Uses pattern mining in order to
generate a numeric value to indicate
an authoritative source as an
indication of content quality.
Seventh International Database
Engineering and Applications
Symposium (IDEAS'03) Haofeng Zhou,
Yubo Lou, Qingqing Yuan, Wilfred Ng,
Wei Wang, Baile Shi July 2003.
MetaCrystal Metasearch technology to allow a
single query to retrieve and organize
results in a visual display.
Second International Conference on
Coordinated & Multiple Views in
Exploratory Visualization (CMV'04)
Anselm Spoerri July 2004.
22.Backpack is a Web application that sends a user the contents of any page as email. See
To Learn More
The Google Legacy
Chapter Three: Google Technology
state-of-the-art facility reflects what Google engineers have learned about heat and power
issues in its other data centers. Within the last 12 months, Google has shifted from
concentrating its servers at about a dozen data centers, each with 10,000 or more servers, to
about 60 data centers, each with fewer machines.
The change is a response to the heat and
power issues associated with larger concentrations of Google servers.
The most failure prone components are:
• Fans.

drives which fail at the rate of one per 1,000 drives per day.
• Power supplies which fail at a lower rate.
Repairs are batch operations. Scheduling the fixes is a major job and work is underway to
improve the Google-developed scheduling capability. Google has to locate hosting facilities
that can meet the company’s heat and power requirements.
Other Data Center Issues
Google data centers have access to multiple high-speed lines
and normal data center functions such as redundant power,
traffic routing and strict rules governing access to the physical
eaver’s Web log contained a posting of a photograph
allegedly taken inside a Google data center. If true, the physical
layout of the racks holding an estimated 2,000 or more servers
squeezes a large amount of hardware in a tightly-packed space.
This type of dense configuration helps explain the comments about Google’s heat and power
concerns. Most data centers were not designed to handle dense concentrations of thousands of
servers. Heat contributes to hard drive failures. On the plus side, the dense configuration
makes set up and maintenance somewhat easier. Google packs servers on two sides of a rack.
A unique property of the data centers is that replicated content can be written from one data
center to another. Google data within the data center are replicated on other servers and other
clusters running in the racks.
The Google “plug and play” engineering philosophy appears to be used in and across data
centers. If a data center, such as the one shown above, needs additional index server capacity,
the technicians in that center can build a Google rack of 40 pizza box servers. These servers
are connected to the network. When the rack is powered up, it becomes available to the master
servers for that data center. These master servers then mark the rack’s resources as available.
Master servers then begin sending work to the new devices. The information about data
23.These data appear at
Chapter Three: Google Technology
The Google Legacy
centers indicates that this “plug and play” concept and automatic discovery of new resources
applies to new data centers, not just the racks within them.
It may be an exaggeration that a Google rack and the data center in which the rack resides
works like a
mouse. The general concept seems to be what Google engineers have tried to
achieve. By eliminating such tasks as certifying and configuring Small Computer System
storage devices, Google is content to let the auto-discovery functionality alert a
“master server” to a new resource, master servers to alert other master servers, masters to
notify clients of tasks, and data centers to pass information that racks, clusters or a new data
center are available for use.
A a Google engineer said, “Wherever we put a cluster, we have heat, cooling and power
issues. When we put in a data center, that data center operator faces new challenges. We use
each day four megawatts of electric power.”
The problems include:
Heat. Special racks with fans that cool the core of the rack are used.
Power. The power demand at load is greater than data centers typically sustain. “Our
cages are custom built and there’s a lot of work done by us and the data center people
before we can flip the switch,” said Jeff Dean, a senior Google engineer.
Network management tools. Google has had to create network management tools to
manage its self-healing, automatic failover operating system.
What’s Up, Sergey?
The Google data centers are concentrated in North America with other data centers located in
Switzerland, the Pacific Rim, and Beijing.
Because the
is self-healing, the operating system and the various “master computers” in a
cluster know what device is online and what device is dead. Off-the-shelf network
management tools are not tailored to Google’s requirements. Therefore, Google is developing
network management and monitoring tools so that the information in the Google operating
system log files can be displayed in a meaningful way to Google network engineers.
The overall Googleplex works and continues working even if a device, rack or data center
goes dark or dies. Network management tools have to provide a broad range of monitoring
and support functions for the global network, devices, data flows, work loads and potential
problem areas. Google is developing needed network management tools specifically for its the
24.The Beijing data center was purpose built to conform to the ruling body’s requirements for
online access, monitoring and related issues. Google complied in order to do business in China.
Yahoo! bought in order to accelerate its effort in China.
The Google Legacy
Chapter Three: Google Technology
Unanticipated Faults Could Derail Google’s Juggernaut
Google’s network uses a number of concepts from the fringes of computer innovation as well
as its hands-on knowledge gained by from the Googleplex itself. The result is a highly-
resilient network that may breed problems not previously encountered. Although Google has
operated for more than five years without downtime from system failure, the possibility –
however remote – does exist that something unanticipated could occur. A sufficiently large
problem could deal Google a severe blow. The advanced technology of Google’s MapReduce
tool and its 400 module library could pose as yet unforeseen technical problems.
Summary of Google’s Drawbacks
Critics of Google can point to three “problems” with Google’s approach to performance.
First, Google is a one-trick pony. The changes to Linux and the other technical modifications
are little more than hackers’ attempts to squeeze a small performance gain.
Second, Google’s use of commodity hardware and cheap storage is a risky solution. Unknown
problems may lurk when cheap components are used in a mission-critical system. Increasing
the potential risk are the changes Google makes to speed up program execution.
The diagram shows how Google’s approach eliminates the bottleneck in parallelized systems
produced by excessive message traffic flowing through a server coordinating work among different
computers. This is a diagram produced by Google engineers.
Chapter Three: Google Technology
The Google Legacy
Finally, other operating systems – including those from computer research laboratories and
even Microsoft – do the same things and have for years.
Leveraging the Googleplex
Google has demonstrated that search is just one application that can run in the Google
environment. There are many other applications that can benefit from Google’s approach to
online services.
Applications that require a high performance payoff for a low cost such as electronic
An application that can run in Google’s redundant environment where there is no
private-state replication such as found in
/400 operating environment and
Computationally-intensive, stateless applications.
Applications that require request-level parallelism, a characteristic exploitable by
running individual requests on separate servers such as Google Earth.
There is little to be gained by trotting out war-horses to trample Google. The user experience
speaks for itself. Google’s approach to massively-parallel distributed computing works, even
on dial-up networks.
Google fused the type of thinking associated with small, cash-strapped companies with
techniques from advanced computer systems. Commodity products keep costs down. A
modified Linux delivers fast performance at a bargain basement cost. Google is taking a
strategic risk with commodity hardware and a souped up version of Linux. Each day Google
bets that its technologists can keep the system humming.
Another reason why Google’s approach to technology is paying off is that Google employes
the same pragmatism and cleverness in application development. Google uses standard
engineering practices, proprietary knowledge, and off-the-shelf techniques such as its use of
Web services. Google uses the same Web programming techniques that millions of Web
developers use. The payoff is that it is easy for Google to hire people who can code for the
Googleplex. Google so far has not had to spend money for developer marketing programs or
train new hires to work in the Googleplex.
The biggest boost to Google’s technical approach is that its competitors are following
different, more expensive approaches. Yahoo is a fruit cake of hardware, operating systems,
and applications coded at different times in different languages by different people. Microsoft
uses its own operating systems but relies on other operating systems as well, including Solaris.
Microsoft’s must invest in hardware to squeeze performance out of its platforms. Yahoo
wrestles with its many different platforms. Microsoft seems powerless to enhance the speed of
its operating system. Both are digital ostriches burying their heads in their own marketing
The Google Legacy
Chapter Three: Google Technology
Google’s technology is one major challenge to Microsoft and Yahoo. So to conclude this
cursory and vastly simplified look at Google technology, consider these items:
Google is fast anywhere in the world.
Google learns. When the heat and power problems at dense data centers surfaced,
Google introduced cooling and power conservation innovations to its two dozen data
Programmers want to work at Google. “Google has cachet,” said one recent University
of Washington graduate.
Google’s operating and scaling costs are lower than most other firms offering similar
Google squeezes more work out of programmers and engineers by design.
Google does not break down, or at least it has not gone offline since 2000.
Google’s Googleplex can deliver desktop-server applications now.
Google’s applications install and update without burdening the user with gory details
and messy crashes.
Google’s patents provide basic technology insight pertinent to Google’s core
A young programmer in Osaka or Beijing is very likely to have been influenced by Google.
The skilled programmers want to work at Google, develop for the Googleplex, and, if possible,
create their own Google killer. The mantra is, “Be like Sergey and Larry”.
Google has a next-generation computing platform. That platform is optimised to deliver
virtual applications to its users worldwide. Google uses standard Web technologies in clever
ways. Although the technical challenges facing Google are formidable, the company has
advanced the art of online computing.