The Age of Exabytes
WRITTEN BY AUDREY WATTERS
TOOLS AND APPROACHES FOR
MANAGING BIG DATA
Introduction: The Rise and Scope of Big Data 3
Innovations in Storage 5
Storage: At the Chip Level 5
Storage: At the Data Center Level 6
Storage: Virtualization and the Cloud 7
Storage: Big Data, New Databases 7
Speed: Big Data, Real-Time 9
The Demand for Big Data Analytics 11
Accessing the Data 13
Via the API 13
Over the Network 13
Use Cases 15
Distributed Computing with CouchDB at CERN 15
Real-Time Retail Analytics 15
Millions of Farmvilles Mean Petabytes of Data Daily: How Zynga Handles Social Gaming Big Data 16
The Big Data Marketplace 17
Bigger Data and a Better Response: Earthquake Detection & Crisis Response 17
This premium report has been brought to you courtesy of HP Networking. As you
explore networking solutions for your enterprise, don’t miss HP’s helpful resource
located at the end of this report,
, which explores the next-generation,
highly scalable data center fabric architecture.
ReadWriteWeb | The Age of Exabytes | 1
2 | ReadWriteWeb | The Age of Exabytes
Introduction: The Rise and Scope of Big Data
To bytes, the basic unit of computing, we have rapidly added
new prefixes as the development of computer technology has
hastened the units of storage. From kilobytes (1000 bytes),
we’ve moved on to megabytes (1000 KB), gigabytes (1000 MB),
and terabytes (1000 GB) of data to “big data,” petabytes (1000
TB), exabytes (1000 PB), zettabytes (1000 EB), and to the as yet
unfathomable yottabyte (1000 ZB). This year, estimates put the
amount of information in existence at 1.27 zettabytes. One page
of typed text, by comparison, is roughly 2 kilobytes of data,
while all the books catalogued in the U.S. Library of Congress
total around 15 terabytes. Dwarfing that is the approximately 1
petabyte of data processed per hour by Google.
These numbers, this amount of data, while almost mind-boggling, are nonetheless growing at an
exponential rate. Eight years ago, there were only around 5 exabytes of data online.
Just two years ago,
that amount of data passed over the Internet over the course of a single month. And recent estimates put
monthly Internet data flow at around 21 exabytes of data.
Certainly, some industries, such as science and finance, have long had to wrestle with storing and
processing massive amounts of data. But even there, the need for more speed and more storage has
grown. Walmart, for example, must handle more than 1 million customer transactions per hour. The
process of decoding the human genome required the computing power to analyze 3 billion base pairs —
something that took 10 years the first time it was done in 2003, but can now be achieved in one week.
Clearly, to meet these sorts of needs, computing power and storage has improved substantially — a
marker of Moore’s law, which dictates that the processing power and storage capacity of computer chips
double or their prices halve roughly every 18 months. And the technology has in turn has facilitated this
explosion of data. But that’s only part of the picture.
ReadWriteWeb | The Age of Exabytes | 3
The data that is being generated today isn’t just “big,” it’s different, and much of it is unstructured. Older
collections of data are now being digitized, such as the efforts of Project Gutenberg to digitize and
archive the world’s literary works. And many more people than ever before have access to technology
tools. The UN estimates there are an estimated 5 billion mobile phone subscriptions worldwide
(although many people have more than one, so that doesn’t quite mean that the mobile phones have
so completely saturated the world market of 6.8 billion people).
Billions of people use the Internet,
and with the rise of digital literacy and of social networking, more and more people are creating and
uploading more and more data. There are 500 million registered Facebook users, for example, sharing
3.5 billion pieces of content weekly and uploading 2.5 billion photos every month, of which Facebook
in turn serves up at a rate of about 1.2 million photos per second.
With the increase in mobile device use in particular, human data creation has soared. Add to that
the input from radio-frequency identification (RFID) and wireless sensors — the 35-some-odd billion
devices connected to the Internet that are a source of information that is predicted to outpace the
generation of data from humans — and clearly data gathering has become ubiquitous.
This explosion of data — in both its size and form — causes a multitude of challenges for both people
and machines. No longer is data something accessed by a small number of people. No longer is the
data that’s created simply transactional information; and no longer is the data predictable — either
as it’s written, or when, or by whom or what it’s going to be read by. Furthermore, much of this data is
unstructured, meaning that it does not clearly fall into a schema or database. How can this data move
across networks? How can it be processed? The size of the data, along with its complexity, demand
new tools for storage, processing, networking, analysis and visualization.
This report will survey some of the developments underway to address these challenges: the
challenges of computing in the exabyte era.
4 | ReadWriteWeb | The Age of Exabytes
Innovations in Storage
STORAGE: AT THE CHIP LEVEL
Gordon Moore, the co-founder of Intel predicted in a research paper in 1965 that “the number of
transistors incorporated in a chip will approximately double every 24 months.” Moore’s Law, as it’s known,
is generally accepted by the computer industry that has seen the growth processing power and storage
capacity of computer chips. Many analysts, however, predict that the rate that data is being created today
is at a pace that will exceed Moore’s Law.
This poses a challenge to chip-makers who are researching new storage and storage reduction
technologies. After all, there are physical limitations to the miniaturization of transistors, a point that
some predict could be reached by 2020. So while Moore’s Law has driven the computer industry for over
40 years, if the storage capacity and processing power are to continue, innovations must occur not just in
terms of dimensions and scaling but in terms of alternate computing mechanisms and logic devices.
Hewlett Packard, for example, has reported advances in the design of a new class of diminutive switches
that would be capable of replacing transistors and help aid the shrinkage of computer chips closer to
the atomic scale.
The devices, known as memristors, or memory resistors, are modeled along the lines
of biological systems. These are purportedly simpler than today’s semiconducting transistors, can store
information even in the absence of an electrical current and can be used for both data processing and
Researchers also say they have devised a new method for storing and retrieving
information from a vast three-dimensional array of memristors, something that could allow designers to
stack switches beyond the limitations of two-dimensional scaling.
A different approach is being taken by researchers at IBM, Intel, and others, who are investigating a
type of storage called “phase-change memory.” PCM offers high performance along with low power
consumption, combining the best attributes of NOR, NAND and RAM — fast read and write speed, non-
volatility, bit-alterability and good scalability, for example — within a single chip. Unlike flash memory
technology, for example, PCM allows stored information to be switched from one to zero or zero to one
without a separate erase step. And unlike RAM, PCM does not require a constant energy supply.
ReadWriteWeb | The Age of Exabytes | 5
And earlier this year, researchers at the Tyndall National Institute in Cork, Ireland announced they had
created the world’s first junction-less transistor. Current transistors are based on junctions, which are
formed by placing two pieces of silicon with different polarities side-by-side. Controlling the junction
allows the current in the device to be switched on and off. The new transistor technology uses a control
gate around a silicon nanowire that can tighten around the wire to the point of closing down the passage
of electrons without the use of junctions or doping.
As researchers pursue different solutions to the question of building computer chips with better
processing and storage capabilities, they must address not just performance, but cost and power
STORAGE: AT THE DATA CENTER LEVEL
The impact of Moore’s Law does not occur simply at a chip level, of course. The increase in computer
power at lower cost has, in part, spurred this data explosion, which in turn has demanded the building
of more computers, more servers, more data centers. So at the other end of the spectrum from the
innovations happening to storage at the chip level are the massive data centers that house thousands of
chips on thousands of servers.
While computing power has increased and the cost of chips has fallen, the cost of building and powering
data centers has increased dramatically. An analysis of Facebook’s spending posits that the company will
spend about $50 million this year on data centers — a figure that has more than doubled since similar
estimates for 2009.
No longer is the bulk of the expense of those facilities merely a question of large and
powerful equipment. (In fact, those figures from Facebook do not include equipment). Rather, it is this
equipment’s skyrocketing demands for electricity for both powering and cooling.
According to some calculations, for every Watt of server power used at a well-managed data center,
an additional Watt is consumed by the chillers, air handlers, and so on. But in many cases the energy
consumed is much higher.
According to Greenpeace, at current growth rates data centers and
telecommunication networks will consume about 1,963 billion kilowatts hours of electricity in 2020 —
more than triple their current consumption and more than the current electricity consumption of France,
Germany, Canada and Brazil combined.
Energy consumption is prompting the search for more efficient ways of powering and cooling.
Data centers are being located in areas near alternative sources of energy, such as Google’s recent
announcement of a new center in Finland that will be cooled by sea water. Other facilities are
experimenting with using offset heat to warm nearby offices. Some researchers are investigating ways
that data centers can utilize energy from the heat to fuel cooling mechanisms, for example.
are building new and different containers for the servers so that they are less capital-intensive and can be
powered and cooled more efficiently.
6 | ReadWriteWeb | The Age of Exabytes
STORAGE: VIRTUALIZATION AND THE CLOUD
One of the factors that has contributed to the explosion of data is the increasing adoption of
virtualization. Virtualization allows companies to take advantage of greater storage and processing
capabilities without having to run their own, physical machines. Virtualization, or cloud computing, has
created many opportunities for businesses to leverage the elastic computing to do things otherwise
not possible because of the costs of building and maintaining their own hardware infrastructure.
Although it’s common practice for many companies to move to dedicated data centers once they
reach a certain size, many companies are running quite sizable businesses on public clouds. Playfish,
for example, once of the largest social gaming companies, runs its operations with Amazon Web
Cloud computing facilitates the speed with which new companies and new processes can be set up, as
new servers can be launched and scaled with ease. As cloud computing allows for scaling to happen
horizontally and not just vertically, it has, along with other developments in distributed computing,
provided new ways for thinking about how data can be stored and processed.
STORAGE: BIG DATA, NEW DATABASES
It’s no surprise that as data has grown, databases have had to adapt. One of example of the innovation
occurring in recent years is the number of new databases that break from the relational database
management system (RDBMS) model. The latter has a long history, dating back to the 1970s. In a
relational database, data is stored in the form of tables, as is the relationship among the data. This
system has worked well to handle transactional and structured data.
But as the amount of information, the kind of information, and the number of users accessing the
information have grown, the relational database has faced some challenges. With new data comes new
storage demands. And the traditional RDBMS is not optimized for the kind of environment that big
data and cloud computing have created — one that’s elastic and distributed.
Traditional RDBMS software, such as MySQL, can handle huge amounts of data but often requires
extensive knowledge to manage. MySQL in particular is well known by many developers and has
remained the data storage choice for many people. But a growing number of “NoSQL” — “Not Only
SQL” — alternatives have been developed in the last year or so. These databases are designed to be
Web-scale. They can be characterized as non-relational, distributed and horizontally scalable. Many of
them are open source.
Examples of NoSQL databases include CouchDB, MongoDB, Membase, and Redis.
Perhaps due to the acronym containing “No,” there has been skepticism about some of these new
technologies by those who do not want to abandon the relational database. Often, it’s not a choice
between only one or the other as many businesses operate with a combination, where some data is
stored in an RDBMS with other data better suited to a NoSQL datastore.
ReadWriteWeb | The Age of Exabytes | 7
8 | ReadWriteWeb | The Age of Exabytes
Speed: Big Data, Real-Time
The storing of exabytes of data is only part of the challenge,
as the demands aren’t merely to be able to warehouse big
data, but to be able to process and analyze it. Furthermore, the
demands for read and write access are often real-time.
As with the necessity for the development of better storage, big data requires better processing power,
something accomplished at the level of the processor and up through the system. With the advent of
networking, one of the ways in which computational power is increased is by distributed computing.
That is, processing is not necessarily done in a single powerful mainframe computer, but is instead
distributed to a number of computers in clusters or nodes. With distributed computing, a problem is
divided into many tasks, each of which is solved by one computer.
According to one report, for example, an ordinary Google search query involves between 700 and
1,000 servers, all so that a response can come within a sub half-second.
To perform tasks like this, Google has built MapReduce. MapReduce is a framework for processing
huge datasets by using a large number of computer nodes applied to certain kinds of distributable
problems. In this way, computational processing can occur on structured or unstructured data.
The advantage of MapReduce is that it allows for distributed processing of the map and reduction
operations. The terms “map” and “reduce” refer to steps the tool takes to distribute, or map, the input
for parallel processing, and then reduce, or aggregate, the processed data into output files. In other
words, during the map step, a master node takes the input and chops it up into small sub-problems,
then distributes those to worker nodes. In the reduce step, the master node then takes all the answers
to the sub-problems and combines them to get the answer to the problem it was originally trying to
Some have posited that MapReduce is inefficient, but a large server farm like those operated by
Google can use MapReduce to purportedly sort a petabyte of data in only a few hours. And the
MapReduce framework has been incredibly influential on the development of other new tools to
handle big data.
Another important tool recently developed to handle large amounts of data is Hadoop. Derived
from MapReduce, Hadoop is an open source project that, like MapReduce, handles large files across
multiple machines. Hadoop consists of two key services: MapReduce and a data-storage system called
the Hadoop Distributed File System (HDFS). A key feature of Hadoop is that for effective scheduling
ReadWriteWeb | The Age of Exabytes | 9
of work, every filesystem should provide location awareness — the name of the rack where a worker
node is. Hadoop applications can use this information to run work on the node where the data is, and,
failing that, on the same rack/switch, so as to reduce backbone traffic. The filesystem uses this when
replicating data, keeping different copies of the data on different racks with the goal of reducing the
impact of a rack power outage or switch failure. Even if these occur, the data may still be readable.
To illustrate: Hadoop was recently utilized to calculate the 2,000,000,000,000,000th digit of pi, more
than doubling the record of the previous longest calculation. Using a cluster of 1000 computers at
Yahoo, it took 23 days to calculate, something that would have taken over 500 years on a standard PC.
Rather than calculating of each digit, Hadoop allowed computers to work with a formula that turned
a complex equation for pi into a small set of mathematical steps. And then, in the end, the formula
returned just one specific piece of pi, that record-breaking digit (which is, incidentally, “0”).
But Hadoop and MapReduce are batch processes, and as such can have high latency. At the scale of
big data, speed is assessed in terms of performance — the speed with which a system answers a query.
But just as important is the idea of “speed to insight,” that is the amount of time it takes for analysts to
glean insights from these massive data sets.
10 | ReadWriteWeb | The Age of Exabytes
The Demand for Big Data Analytics
“Success” in big data isn’t simply a matter of building and
implementing better storage or processing tools. Success
involves being able to gain insights from the big data — and
to gain it quickly. But the scale of the data does make search,
analysis, and visualization challenging — even more so with
the demands of real-time.
Analytics have often accompanied data warehousing for sectors like finance, retail, and research.
But just as big data creates challenges for databases and processing, it also poses new problems for
analytics. Traditional databases struggle with the complexity and poor performance that result from
trying to express complex analytics in SQL. So until recently, many advanced analytics were handled
outside the database. In other words, analytics procedures and models were run on statistical analysis
platforms — and so optimizations to the database wouldn’t necessarily speed up the analysis.
Furthermore, data needed to be copied and moved from the data warehouse to a statistical platform.
Between the constraints of disk speed and network bandwidth, moving big data out of a warehouse
can be slow, further compounded by the speeds it takes a statistical platform to process the data.
These challenges have been so severe that in many cases, the depth of the analysis is compromised.
This has occurred when big data is reduced — via sampling, for example — to smaller subsets for
computation, meaning that critical insights may be overlooked. Furthermore, developers have been
forced to spend a significant amount of time modifying complex analytics in order to fit with the
limitations of traditional databases. Arguably, traditional business intelligence applications are not
designed to handle the amount or the complexity of the data, nor are they necessarily built to handle
real-time reporting. As a result, the quality of the analytics suffers.
Rather than reports created on past events, analytics should be based on real-time data. And rather
than results that come from periodic reports created by statisticians, the need is that this information
be open for constant and on-demand analysis.
Big data analysis is changing in part due to in-database analytics, but database vendors like Aster Data
are beginning to add analytics to their feature lists. These vendors now support a range of analytic
queries that can be written in or converted to SQL, as well as those written in C/C++, Java, Python, Perl,
R, and other languages inside their database.
ReadWriteWeb | The Age of Exabytes | 11
In addition to demands to deliver complex analyses on big data, there is also increased interest
in visualizations. And again, as with database and analytic technologies, many of the existing
tools have not been designed to handle the massive quantities of data. Efforts like CalTech’s Large
Data Visualization Initiative are seeking to develop multiresolution visualization and modeling
The ability to perform analytics on big data in near-real-time will become increasingly important for
organizations, and the market opportunities are substantial for companies and data scientists who can
provide these services.
12 | ReadWriteWeb | The Age of Exabytes
Accessing the Data
VIA THE API
Moving large volumes of data around can be difficult for all the reasons explained above. The
requirements for moving data have necessitated development on a couple of levels: in terms of
networking and in terms of the API.
APIs aren’t designed necessarily to solve a company’s big data problem. Nonetheless, they can be
utilized in a number of ways to offer access to developers to all or part of a company’s data. And
as companies generate and store more data and as data becomes a more important commodity,
having an API becomes more and more important. An API allows companies to open access to this
information to not simply internal analyses and processes, but to other third-party developers as well.
Having an API has become “BizDev 2.0”. In other words, in a Web-oriented world, it’s the way business
development is done. APIs facilitate business-to-business relations by opening data and systems
to business partners. And having an API makes new queries possible (if not easier), enhancing
information discovery for companies.
OVER THE NETWORK
The amount of data that is being generated taxes network capabilities, even with the best broadband
infrastructure. With a T1 (1.544Mbps) Internet connection, it would take approximately 82 days to
upload one terrabyte of data. Even at 10Mbps, it would take almost two full weeks to do so.
But it isn’t just the size of the data that makes portability a problem. It’s also the rapidly increasing
number of machines that are connecting to the Internet. In August 2010, wireless analyst Chetan
Sharma reported on figures for the U.S. wireless data market, noting that mobile phone subscription
penetration had crossed 95% at the end of the second quarter of 2010. Excluding those aged 5 and
under, this means that the mobile penetration for the U.S. is now past 100%.
But the increase in new
mobile phone subscriptions is only part of the picture. Outpacing these new human subscriptions for
the same quarter were those of “connected devices.” Even as the U.S. nears full penetration of mobile
devices, an array of other devices and everyday objects are coming online, via sensors, RFID chips —
the “Internet of Things.”
The pressures from more devices coming online are leading governments and organizations to rethink
how Internet bandwidth, wireless spectrum and Internet addresses are allocated and managed.
ReadWriteWeb | The Age of Exabytes | 13
14 | ReadWriteWeb | The Age of Exabytes
DISTRIBUTED COMPUTING WITH COUCHDB AT CERN
Scientific research has long had to wrestle with capturing, storing, managing and analyzing massive
amounts of data, but the rise of big data has taxed even the systems designed to study the intricacies
of genomes, weather patterns, outer space, and so on.
One such facility is CERN, the European Organization for Nuclear Research. Situated on the Franco-
Swiss border, CERN is the world’s largest particle physics laboratory and the site of the Large Hadron
Collider, a global scientific project that researches particle collisions using the world’s largest and
most powerful particle accelerator. The LHC produces an enormous amount of data — around 15
petabytes a year. And when the LHC was in its planning stages, CERN’s IT department quickly realized
that that amount of data was more than a data center — and perhaps even the Geneva power grid —
could handle. Instead of one large data warehouse facility, they opted for a grid computing solution,
distributing the collider data to a dozen or so data centers. CERN’s grid consists of 100,000 processors
at 140 scientific institutions in 33 countries.
One of the LHC experiments is the Compact Muon Solenoid. In order to manage the roughly 10
petabytes of data it collects, CERN announced that it plans to deploy the NoSQL database CouchDB.
This particular experiment requires a database solution that not only can handle large amounts of data
— often without metadata — but can distribute the data quickly in an environment in which incoming
database connections are frequently impossible. CouchDB is specifically designed for distributed
environments, and one of its key benefits is its replication and syncing features. Furthermore, the
researchers have pointed to the speed with which they can prototype tools using CouchDB.
REAL-TIME RETAIL ANALYTICS
Big data is poised to deliver tremendous insights about consumer’s spending patterns. Retailers have
long tracked when people spend and what they buy. After all, past shopping behavior is the best way
to predict future purchases. But marketing efforts, as the term “mass marketing” implies, have been
imprecise. Now, an incredible amount of information can be gathered about consumers’ shopping
habits: how they browse online, where they shop, when they shop, what brands they buy with what
frequency. And rather than just general demographic information gleaned after-the-fact — knowing,
for example, that a certain coupon worked well with women in their 40s — companies can drill down
into an individual consumer’s profile, and be able to serve them specifically targeted offers in real-time.
For example, as Akamai’s network has grown to encompass more than 450 brands and multi-channel
Internet retailers, it has run into challenges delivering the right ad at the right time to the right
ReadWriteWeb | The Age of Exabytes | 15
audience. Akamai must deal with up to 75 million daily events, and as its core business value relies on
being able to data-mine that information for advertisers, it needs to be able to analyze data quickly.
With the number of users, profiles, transactions increasing the number of models that must be run for
these records, Akamai found that daily reporting was being delayed by up to 20 hours. Akamai recently
moved its database to Aster Data to take advantage of the company’s nCluster in order to reduce
MILLIONS OF FARMVILLES MEAN PETABYTES OF DATA DAILY: HOW
ZYNGA HANDLES SOCIAL GAMING BIG DATA
One part of social networking that has seen the meteoric rise has been social gaming. Some 65 million
people play Zynga’s online games every day. According to Zynga CTO Cadir Lee, 10% of the world’s
population has played a Zynga game. That’s millions of Web browsers open to millions farms and
millions of frontiers. They take turns; they tend crops; they send gifts. They buy millions of objects and
upgrades. Zynga says its technology supports 3 billion neighbor connections throughout its games.
And all told, it moves around 1 petabyte of data daily, using a combination of its own data centers and
a hybrid public/private cloud.
It’s a mind-boggling amount of data. And it’s a new kind of data — it’s more than simply transactional
data. And it’s accessed in many ways by many millions of users. This necessitates not simply massive
server resources (the company says it adds as many as 1,000 new servers every week to accommodate
traffic), but has also required the development of a new sort of database management system.
Zynga has been a major contributor to the open source Membase project, taking some of the concepts
of Memcached — low cost, high performance, schema-less caching — in order to develop a database
that works with similar speed, flexibility and simplicity.
Zynga needs to be able to serve up all this data not only to its millions of users. It also has to be able to
undertake analytics on the gameplay in order to, for example, design engaging and viral games and to
ascertain the points at which players are willing to purchase virtual goods.
16 | ReadWriteWeb | The Age of Exabytes
THE BIG DATA MARKETPLACE
The amount of data being produced — by science, governments and social networks — has given rise
to a number of companies that are specifically geared towards the storage, sale, and analysis of data.
For example, Infochimps, a startup based out of Austin, Texas, describes itself as a marketplace for data:
“A site to find, sell, or share any dataset in the world.” Infochimps makes a variety of datasets available,
including massive data scraped from Twitter. (A recent scrape contains data about 35 million users, 500
million tweets, and 1 billion relationships between users). Some of the datasets are available for free,
and some for a price. Infochimps also makes some of the data available via an API, in lieu of sending
an entire dataset.
Factual is another startup that is offering access to massive datasets, in this case
geolocation data, alongside an API and other tools for building geolocation applications.
BIGGER DATA AND A BETTER RESPONSE: EARTHQUAKE DETECTION &
Although big data is often touted for its scientific and commercial implications, it has also becoming
an important tool for humanitarian purposes, as responses to recent natural disasters have
demonstrated. Open data advocates and developers have formed groups like CrisisCommons and
projects like OpenStreetMaps in order to build tools to help the public good. The World Bank, for
example, has made a substantial amount of its data open, and has encouraged people to build tools to
help understand the information to be able to better respond to natural disasters and other crises.
ReadWriteWeb | The Age of Exabytes | 17
18 | ReadWriteWeb | The Age of Exabytes
We marvel at the fact that today our smartphones have far
more RAM than our first personal computers did. But with
these phones, PCs, and with other connected devices, we
are generating almost unfathomable amounts of data, and
generating a demand, in turn, for ever more storage. The
average person is uploading over 15 times more data to the
Internet today than they did just three years ago.
information uploaded by humans is dwarfed by the Internet of
Things, the networking of everyday objects.
The explosion in data is creating challenges and prompting innovation in computer storage and
processing, in terms of software, hardware and data center architecture. The desire to be able to glean
insights from all this data is also set to be a boon for analysts and statisticians. And it’s creating many
opportunities for new companies who can deliver technology products and services to help solve
some of the challenges associated with big data.
And there are plenty of challenges. Moore’s Law has so far proven accurate — processing power has
increased and costs of manufacturing computer chips have gone down. But the cost of powering the
machines has soared. And when you are handling data on an exabyte scale, the energy costs to power
and cool machines — particularly those in the massive data centers — are substantial.
In addition to facing problems with power consumption, the amount of data being generated also
taxes network infrastructure. As the Internet struggles to maintain speeds and bandwidth, broadband
and wireless continue their penetration into new areas.
We have only begun to develop the tools to manage and analyze all this data. As the majority of this
data is unstructured, it has often remained beyond the scope of analysis. As the data is classified,
questions of interoperability are raised — how can we structure and classify this information so it is
usable within companies and across industries?
ReadWriteWeb | The Age of Exabytes | 19
But some people are cautious about the race to create and network all this data — to make this data
available and useful — particularly when it comes to personal information. How will organizations
ensure that data is kept private and secure? What sorts of controls will people have over the data they
create, over the data their personal objects create?
As we continue generating almost inconceivable amounts of information, it is clear that the data
explosion will bring about challenges for businesses and for IT departments. Big data will be a
problem that all organizations will need to address, whether “big” is on the scale of terabytes or
exabytes of data. As companies increasingly look for solutions to their big data problems, this will in
turn create opportunities for others to develop technologies and practices to best store, manage and
analyze big data.
20 | ReadWriteWeb | The Age of Exabytes
Virtualize network connections and capacity—From the edge to the core
An HP Converged Infrastructure innovation primer
Table of contents
Data center networking dynamics ...........................3
Introducing HP FlexFabric ......................................3
HP FlexFabric benefits .........................................4
The key attributes of HP FlexFabric ..........................5
The FlexFabric evolution path .................................6
Deliver “networking as a service” to the Converged
Data center networking
The fundamental nature of data center computing
is rapidly changing. The traditional model of
separately provisioned and maintained server,
storage, and network resources are constraining
data center agility and pushing budget envelopes
to the limit. IT organizations recognize that these
static pools of isolated resources are being
underutilized—a problem that can be exacerbated
when dedicated infrastructure or computer
systems are used to support different classes of
data center workloads. One response has been
for IT organizations to adopt virtualization and
blade technologies, which enable a more flexible
and highly utilized infrastructure. These new,
more scalable technologies can be dynamically
provisioned to meet continuously evolving business
requirements. At the same time, these technologies
apply new pressures to the multiple networks in
the data center, further worsening spend issues.
And it increases the burden on the IT teams that
• A proliferation of virtual machines is driving much
more frequent changes to network configurations.
• Data center network processes must be
coordinated through multiple IT teams and are too
• Increases in server utilization require more network
bandwidth per server.
• Traditional hierarchical network designs cannot
scale nor provide the performance, low latency,
availability, and quality of service demanded by
a virtualized data center.
• Blade technology is further escalating the number
of connections to be managed and increasing
Network teams are faced with a race to build out
data center network capacity and to effectively
provision connectivity at an increasing speed.
To keep pace, IT organizations need a network
architecture that is more coherent, flexible, and
agile. But they don’t want to give up the stability,
high availability, and security offered by the proven
compute and storage networks currently installed in
their data centers.
HP is creating a new balance by combining some of
the best, new, standards-based technologies with a
streamlined, modular architecture that fully optimizes
virtualized resources, while meeting business
requirements for low total cost of ownership,
faster time-to-service, and critical requirements for
reliability, IT governance, and compliance.
Introducing HP FlexFabric
HP FlexFabric is the next-generation, highly scalable
data center fabric architecture of an HP Converged
Infrastructure. With FlexFabric, you can provision
your network resources efficiently and securely to
accelerate deployment of virtualized workloads.
With highly-scalable platforms and advanced
networking and management technologies,
FlexFabric network designs are simpler, flatter, and
easier to manage and grow over time. This open
architecture uses industry standards to simplify
server and storage network connections while
providing seamless interoperability with existing
core data center networks. FlexFabric combines
intelligence at the server edge with a focus on
centrally-managed connection policy management to
enable virtualization-aware networking and security,
predictable performance, and rapid, business-driven
provisioning of data center resources.
and administration with
VM Edge Access
Flexible virtual I/O, hypervisor
agnostic, emerging VEPA
Intelligent Server Access
Flexible form factors, pragmatic
storage-server I/O consolidation,
future-proofed for convergence,
optimized for data center workload
mobility and utilization
Carrier-class routing and
High performance Layer
2/Layer 3 Interconnect
high-bandwidth, existing Layer 3
core-compatible, designed to fully
exploit workload virtualization
Multi-site, multi-vendor network resource
management and “Days to minutes”
rapid, dynamic, policy-driven resource
provisioning, data center integration
High capacity, high performance,
highly-available threat management
Data center management and orchestration
FlexFabric can enable your IT organization to
build a wire-once data center that responds to
application and workload mobility, and provides
resource elasticity. You can move your network
connections with your workloads as you migrate
them across or between data centers. Also, the
fabric can stretch and reclaim pools of resources to
meet rapidly changing needs. High-performance
threat management tools unify physical and virtual
security into a common, extensible framework.
Dynamic provisioning capabilities fully exploit
virtualized connections to achieve new levels of
data center efficiency and accelerate time-to-service.
The FlexFabric management and provisioning tools
help align the fabric with governance policies and
service-level agreements (SLAs), while reducing the
cost of operations.
HP FlexFabric benefits
• Improved business agility, faster time-to-service
and higher resource utilization by dynamically
and securely scaling capacity and provisioning
connections to meet virtualized application
demands “on the fly”
• Breakthrough cost reductions by converging
and consolidating server, storage, and network
connectivity onto a common fabric with a flatter
topology and fewer switches
• Predictable performance and low latency to
support some of the most demanding
• Modular, scalable, industry standards-based
platforms and multi-site, multi-vendor management
tools to connect and manage thousands of server
and storage devices using industry-standard
• Investment protection for existing Layer 3 core
systems with seamless compatibility and support
for open standards
• Flexibility to manage and administer server,
storage, and network resources in any
organizational model—from completely separate
to fully integrated—while consistently enforcing
governance, security and SLA policies
• Removal of costly and time-consuming change
management processes, while reducing the
number of error-prone or conflicting
• Support for a wide range of data center
FlexFabric delivers true “networking-as-a-service”
to the various consumers of connectivity within
the data center and accelerate deployment of
applications and services. It provides a unified
connectivity infrastructure—across servers, storage,
and networking—that dynamically adapts to the
demands of the heavily virtualized and more flexible
data center architectures of tomorrow, while meeting
increasing pressures for price/performance and
HP FlexFabric overview
HP FlexFabric brings together a highly-scalable, high performance, secure network infrastructure with comprehensive management and
policy-driven connectivity provisioning integrated into a data center converged infrastructure
The key attributes of
By radically simplifying and flattening network
designs and using emerging data center networking
standards, HP FlexFabric creates a more robust,
flexible, and efficient data center network
infrastructure. Rather than relying on a traditional
hierarchical networking architecture, FlexFabric
offers a flatter data center topology with edge
intelligence, designed to complement the intelligent
virtualized network interfaces offered by the latest
HP data center servers and storage systems. This flat
fabric interconnect is more fungible and provides
superior network performance and quality of service.
To manage the FlexFabric network, you can design
and centrally manage fully-virtualized network
connections and resources that allow for dynamic
provisioning from the edge to the core and support
for application mobility, enabling connections to
move with workloads as they migrate across the
fabric. This allows resources to be created, moved,
and scaled from centralized connection pools “on
the fly,” putting to work an integrated resource and
provisioning management toolset.
To secure the FlexFabric network,
a virtualization-integrated security framework
provides business continuity with unified, high
performance physical/virtual server network security
architecture. This framework enables seamless threat
management and leverages a global threat
intelligence network to block bad traffic in virtual
and physical environments.
FlexFabric is designed to support a much wider
set of data center architectures, workloads, and
requirements than is otherwise possible with
traditional data center networking approaches.
It supports specialized back office, cloud, web,
or high-performance computing models. Instead
of locking organizations into a proprietary
end-to-end solution, FlexFabric gives them the
flexibility to incrementally deploy a heterogeneous
data center network that meets their workload needs
and protects existing investments.
Predictable performance supports diverse
A highly scalable, flat network domain enables
HP FlexFabric to deliver flexible provisioning,
ultra-low latency, high performance, and fast
workload mobility. The architecture provides
breakthrough cost structures by removing
networking layers and complexity, and applying
new technologies including higher speed Ethernet
links, active load balancing, and link aggregation
within the server edge and advanced multi-switch
virtualization and management in the interconnect.
Multiple server edge and interconnect switches
can be virtualized and managed as single logical
devices with improved utilization, high availability,
scalability, and flexibility to handle virtualized
workloads with very high throughput. Capacity can
be dynamically scaled or divided.
FlexFabric networks are designed to meet the
security, resiliency, and reliability requirements
expected in today’s data center.
Open and standards-based for investment
FlexFabric is designed to interoperate with existing
third-party Layer 3 core switches to protect existing
investments and enable smooth network migration.
This standards-based approach removes the
risk of vendor lock-in and lets your organization
incrementally deploy a FlexFabric network without
disruptive forklift upgrades. You can mix and match
existing operational processes with new approaches
using industry-leading HP products to coordinate IT
teams. Finally, this approach helps your organization
manage the high purchase, support, and operations
costs associated with proprietary environments.
Pragmatic deployment of new technologies
HP FlexFabric utilizes the latest emerging industry
standards, including higher speed Ethernet links,
Virtual Ethernet Port Aggregation (VEPA), Fibre
Channel over Ethernet (FCoE), and Converged
Enhanced Ethernet (CEE). The CEE standard enables
Ethernet to deliver a “lossless” transport technology
with congestion management and flow control
features needed in storage environments. Leveraging
FCoE today, FlexFabric server edge platforms allow
for sensible storage-server I/O consolidation with
assured compatibility with existing Fibre Channel
Storage Area Networks (FC-SANs). This allows users
to reduce cost and complexity without jeopardizing
business continuity. HP is championing many of these
and other emerging standards in the IEEE
and other organizations, to give users a data
center fabric that protects their technology
investments instead of proprietary approaches that
can cause organizational disruption and wholesale
Data center-integrated management and
provisioning for business agility
With management and provisioning integrated
down to the component level—including networking
and virtual I/O—HP is revolutionizing data center
provisioning and operation. Comprehensive
network resource management tools allow users to
administer networks across multiple sites and against
a combination of HP and multi-vendor platforms
from a single pane of glass. Integrated FlexFabric
provisioning capabilities reduce time to service
and the chance of costly errors while accelerating
IT alignment with business demands and goals.
FlexFabric enables administrators to centrally
define connection and network policies that can be
dynamically matched to workloads and provisioned
“on the fly” from pools of available resources. The
FlexFabric model allows a “design once, replicate
many” approach to provisioning that is optimized for
workload mobility, streamlines network provisioning,
and reduces the number of error-prone or possibly
conflicting configuration steps that make change
management time-consuming and costly.
FlexFabric removes a major barrier to automation
and orchestration—the “all-or-nothing” proposition
organizations face with other data center
management frameworks. Designed to support
a wide range of IT organizational models,
FlexFabric offers interfaces designed specifically
for each operator type found in IT teams. Network
administrators can provision resources in advance
and make them available to server and storage teams
to utilize instantly when needed, saving time and
FlexFabric management integrates seamlessly across
the entire spectrum of HP data center management
systems to streamline the activities of your data
center IT teams without requiring extensive overhauls
of organizational structure and processes. This
powerful system can automate and coordinate
network services with application deployment, and
free up data center administrators from repetitive
operational activities that drain IT budgets.
FlexFabric provides open interfaces for third-party
functionality that integrates application delivery and
virtualization engines. Finally, FlexFabric management
is fully integrated with industry-leading IT orchestration
and management systems from HP, giving your IT staff
unprecedented control that spans networks, servers,
applications, and even physical plant attributes.
The FlexFabric evolution path
Deliver “networking as a service” to
the Converged Infrastructure
FlexFabric is more than just an aspirational model
of the ideal data center network. Users can deploy
networks today that deliver on the FlexFabric value
proposition—aggressively or incrementally—in
keeping with overall technology and business
objectives. This evolutionary and flexible approach
to data center deployment across the infrastructure
puts real user needs, investment protection, and
business continuity at the top of the list of
principles guiding our vision for a Converged
Today—A network foundation for
First introduced in 2006, Virtual Connect technology
is a key enabler of an integrated, data
center-aligned network, and delivers against
foundation HP FlexFabric principles by providing
some of the simplest, most flexible ways in the world
to provide high-performance, secure server
connectivity. With reduced complexity, improved
agility, and reduced cost, Virtual Connect radically
simplifies network infrastructure and provisioning
without disrupting “upstream” network operations.
HP Virtual Connect virtualizes server edge I/O,
enabling server administrators to provision Local
Area Network (LAN) and Storage Area Network
(SAN) resources in advance, and then enable
them when needed. Virtual Connect enables
server administrators to move workloads and
virtual machines, or add, move, or replace servers
transparently to LANs and SANs in minutes without
having to engage LAN and SAN administrators.
Attacking head-on the expensive proliferation
of Ethernet connections caused by increased
network capacity requirements for virtual machines,
HP Virtual Connect FlexFabric modules and adaptors
can reduce sprawl at the edge by 95%. Virtual
Connect FlexFabric modules provides up to four
physical connections for each network port, with
the unique ability to fine-tune bandwidth to adapt to
virtual server workload demands on the fly.
The system administrator can now define the
hardware personalities of these connections as
FlexNICs to support only Ethernet traffic or as
FlexHBAs that combine Ethernet and Fibre Channel
or iSCSI protocol support. Each connection has
100 percent hardware-level performance and
provides the I/O connections needed to take full
advantage of multi-core processors and to support
more virtual machines per physical server. Each
server can support many more connections—
up to 40—with less investment in expensive network
equipment on the server, in the enclosure and in the
The bandwidth of each connection can be
fine-tuned and adapted with 100 Mb increments
up to 10 Gb as workload demands change. The
server comes with 10 Gb capability built into it,
ready for today’s investments in 10 Gb networks and
converged fabric technologies like Fibre Channel
over Ethernet. Virtual Connect FlexFabric modules
allow users to take advantage of edge convergence
by providing Fibre Channel over Ethernet (FCoE)
downlinks to the blades while maintaining standard
and proven Ethernet LAN, Fibre Channel SAN, and
iSCSI external connections with their associated
IT practices. This allows system administrators to
simplify enclosure infrastructure and lower costs
by combining Ethernet, Fibre Channel, and iSCSI
protocols over one wire and managing them from
a single management application and interface.
For any virtual server environment, Virtual Connect
FlexFabric modules and adapters are simply some
of the most affordable, flexible, and power-efficient
solutions available from any blade portfolio.
For organizations preferring a traditional server
edge implementation, network management and
design methodology, HP offers scalable blade-based
switching. For users looking to achieve high levels
of server connectivity consolidation and top-of-rack
switch platforms that deliver high performance,
advanced multi-switch virtualization, and flexible
connectivity, options like FCoE that provide
cost-effective storage-server I/O consolidation and
1 Gb to 10 Gb migration are available. With the
6120 series of blade switches or the A5820 series
of fixed and semi-modular top-of-rack switches,
users have multiple ways to incrementally deploy
a FlexFabric server edge that are in keeping with
traditional network designs.
Complementing the FlexFabric Server Edge offering,
HP offers a complete portfolio of enterprise-class
interconnect and backbone platforms that deliver
aggregation, core switching, and enterprise
routing functionality. These platforms are built
on cutting-edge technology and provide
industry-leading performance, lower power
consumption, and lower TCO with a unified switch
operating system that let users built simpler, flatter
networks with comprehensive management.
Complete feature functionality and mission-critical
high availability means that users can deploy a wide
variety of designs to accommodate existing Layer
3 core investments or to radically simplify the network
in collapsed aggregation/core designs. Advanced
multi-switch virtualization technologies allow users
to build cost-effective, large layer 2 aggregation
layers ideally suited for large-scale virtualization
installations. With a continued commitment to open
standards-based interoperability, users can easily
integrate, proven third-party data center applications
and technologies, and avoid vendor lock-in. These
data center networking products include the
A-series of switches and routers, such as A6600/
A8800 enterprise routers and the industry’s highest
performance A12500 series switches.
HP provides powerful tools for managing
large-scale FlexFabric networks both in advanced
Virtual Connect-based and traditional network
server edge deployments. With HP Virtual Connect
Enterprise Manager, users can manage the setup
and migration of up to 16,000 Virtual
connect-based servers from a single pane of glass.
As the foundation for comprehensive network
resource management across the entire enterprise
network, Intelligent Management Center (IMC)
lets users manage an entire multi-site, multi-vendor
network, edge to core, from a single
Securing the FlexFabric is a set of tools that
brings threat management for both virtual and
physical networking together into a single,
enterprise-class architecture. The HP TippingPoint
Secure Virtualization Framework lets users leverage
highly scalable appliance-based Intrusion Prevention
Systems (IPS) to comprehensively secure VM-to-VM
as well as inter-server and inter-network traffic from
a common IPS infrastructure. Combined with a wide
range of security subscription services that leverage
a global threat intelligence network to block bad
traffic in virtual and physical environments, users
can provide continuity as they scale out server
Tomorrow—A new model for deploying
networking as a service
With a vision toward provisioning of network
connectivity and resources completely synchronized
in an end-to-end data center orchestration layer,
HP has developed the Data Center Connection
Manager (DCM) appliance as a proof-point for
how networking can be enabled to accelerate
deployment of virtualized server workloads.
HP Data Center Connection Manager begins to
implement the HP FlexFabric dynamic provisioning
vision. DCM allows network architects to
preconfigure server connection policies that are
enforced at the network edge through common
RADIUS and DHCP standards. Virtual and physical
server interfaces are individually associated or
subscribed to connection profiles from a pool of
resources by the server administrator at build time,
allowing rapid, secure provisioning and workload
mobility without the repetitive manual tasks and
turnaround time associated with provisioning today.
These policies can drive events directly to the HP
BSA Network Automation software product suite,
enabling deep levels of dynamic automation to
provision firewalls or application delivery controllers
in response to server provisioning, de-provisioning
or configuration changes. These capabilities
give network administrators the power to deploy,
manage, and evolve server connectivity flexibly,
quickly, and in line with business policy
Share with colleagues
Get the insider view on tech trends, alerts, and
HP solutions for better business outcomes
Beyond—The evolution to a fully-converged,
synchronized FlexFabric network
HP is committed to serving the diverse needs of
modern data centers without imposing a specific
operating model, proprietary architecture, or
network fabric. With advances in next generation
high-speed connectivity including 10B-BaseT
(10 Gbps over copper) and 40 Gb/100 Gb fiber,
FlexFabric can evolve to allow your organization to
build single, large Layer 2 domains with thousands
of direct, low-cost 10 Gbps Ethernet-connected
servers, in virtual or non-virtual, rack mount or
blade environments, all with equal ultra-low latency
paths. The fabric supports Converged Enhanced
Ethernet (CEE) either from the server edge or through
the aggregation layers, offer full support for Fiber
Channel over Ethernet (FCoE), and be capable
providing active load balancing across converged
and traditional Ethernet-only connections.
To drive next generation security and forwarding
capabilities, FlexFabric uses emerging industry
standards to build and support virtual switches and
virtual I/O adapters. HP has co-authored the IEEE
Virtual Ethernet Port Aggregator (VEPA) proposal,
which aims to provide multi-vendor, standardized
discovery, configuration, and forwarding for
virtual switching. FlexFabric plans to be capable of
managing VEPA and other virtual I/O components
from day one. This standards-based approach
gives your IT organization a choice of virtualization
vendors and approaches.
Most importantly, FlexFabric allows the rest of the
data center infrastructure to exploit the benefits of
server, storage, and network virtualization going
forward. The nature of I/O buses and adapters is
expected to change dramatically in the next five
years; as the portion of server deployments whose
I/O is completely virtualized increases, the nature
of server I/O itself can evolve. No vendor is better
positioned for this new world—from a skill set and
intellectual property perspective—than HP, because
HP is the only company with deep intellectual
property in servers, blade servers, networking,
storage, and virtualized I/O.
Ultimately, our goal is to allow IT to deploy new
systems into a converged infrastructure that can
automatically discover capacity, add it to resource
pools, and put it to work to support the needs of
business applications. As IT takes advantage of
application convergence and uses cloud computing,
HP can be a comprehensive partner to help you
drive down maintenance costs, change economics,
and enable your data center network and IT staff
help your organization thrive and respond to
© Copyright 2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only
warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing
herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained
4AA0-7725ENN, Created June 2010, Rev. 1
Your next step
To learn more about the HP vision of Converged Infrastructure and how the HP FlexFabric plays a key role in it, visit
If you liked this report, check out our other reports:
Guide to Online Community Management
Our first premium report for businesses comes in two parts:
a 75 page collection of case studies, advice and discussion concerning
the most important issues in online community; and a companion online
aggregator that delivers the most-discussed articles each day written by
experts on community management from around the Web.
Real-time Web technologies and applications have the potential to change
everything—at a real-time pace. If you are a CTO, work in development,
marketing or you are planning your next website or mobile application
upgrade, you need to know about the real-time Web.
Augmented Reality for Marketers and Developers:
Analysis of the Leaders, the Challenges and the Future
AR offers a new paradigm for high impact, high value customer
experience. Decrease your AR development time to market by learning
from the first wave of early adopters to this new technology. In this
ReadWriteWeb Premium Report we profile successful companies and their
campaigns as well as development lessons learned.
ReadWriteWeb Premium Guide to Online Community Management
Guide to Online Community
Edited By Marshall Kirkpatrick
The Real-Time Web
and its Future
Edited by Marshall Kirkpatrick
Augmented Reality for
Marketers and Developers:
Analysis of the Leaders, the
Challenges and the Future
Written by Chris Cameron