Grounding High Efficiency Cloud Computing Architecture: HW-SW Co-Design and Implementation of a Stand-alone Web Server on FPGA

hastywittedmarriedInternet and Web Development

Dec 8, 2013 (3 years and 6 months ago)


Grounding High Efficiency Cloud Computing
Architecture: HW-SW Co-Design and
Implementation of a Stand-alone Web Server on
Jibo Yu\ Yongxin Zhu\ Liang Xial, Meikang Qiu2, Yuzhuo Ful, Guoguang Rongl
1 School of Microelectronics, Shanghai Jiao Tong University, Shanghai, China
2 Dept. of Electrical and Computer Engineering, University of Kentucky, Lexington, KY, USA
Abstract-With the advent of the cloud computing, web servers, as
the major channel in cloud computing, need to be redesigned to
meet performance and power constraints. Considerable efforts have
been invested in distributed web servers and web caching with
dijJerent optimizing strategies, but few existing studies have been
directly focused on improving the web server itself, not to mention
complete hardware-favored web services. In this paper, we propose
a novel architecture of web server and implement it on FPGA. After
taking challeng with significant difficulties in design and
implementation, we manage to complete an evaluation system
which confirms that hardware-favored architecture brings higher
throughput, lower power consumption as well as stand-alone web
service functionalities due to direct pipelining execution of web
service protocols in hardware without operating system
Key words: cloud computing; architecture; web server; FPGA
Web appeared in 1989 shortly after the invention of
Internet in 1984. Web applications have been the motivator of
internet applications since then, e.g. Mosaic browser in 1993,
e-commerce in 1995, semantic web in 1999, utility computing
in 2000, and Wiki encyclopedia (Web 2.0) in 2001. Even in the
era of cloud computing since IBM and Google proposed blue
cloud in 2007, Web remains as the major channel of cloud
computing. As the number of users explodes with varieties of
applications, web servers bear ever tougher workload as well as
requirements on delays and bandwidth. This situation becomes
worse as users request multimedia data more frequently, which
incurs needs of much larger network bandwidth than text data
[ 1 ].
With the growth in e-commerce as well as the increasing
volume of information available on the Web, the number of
users of the WWW is growing rapidly. Typically an online
shop employs catalogues and transaction handling databases.
Clients perform browsing as well as financial transactions
while shopping online. This means that web server has to
handle a great deal of both static and dynamic web page
requests. Other applications that impose a heavy demand on a
web server include movie clips, extremely large audio and
978-1-4244-9825-3/11/$26.00 ©2011 IEEE
video files and dynamic pages generated through cgi scripts.
Earlier efforts have put more emphasis on improving the web
performance by solving the problems caused by network traffic.
Several modifications to the Hyper Text Transfer Protocol have
been proposed. Equally important is the fact that as the
communication bandwidth available to client increases, the size
of web documents will tend to increase and each client would
generate more and more web page requests to the server, thus
pushing the performance bottleneck to the server system. This
in tum causes an increase in the client's perceived latency for a
web page request [2].
Increasing the network bandwidth involves governmental
policy on national infrastructure which requires a long-term
investment of resources. The gap between network traffic
demand and the network bandwidth capacity is widening [3].
Web servers are anticipated to be the bottleneck in hosting
network-based services [4]. With the advent of the cloud
computing, this situation could be even worse. Therefore, it is
urgent to improve the quality of web service.
There are three ways for a web site to handle high traffic,
namely replication (mirroring), distributed caching, and
improving server performance [5]. Replication is simply
distributing the same web information to multiple machines
that are either a cluster [6], or distributed in different locations
using various kinds of load balancing strategies [7]. Since any
of the machines could serve requests independently, the load of
each individual web server is reduced. Distributed caching
includes client-side caching [8], proxy caching [9] or dedicated
cache servers [ 1]. These methods transparently cache remote
web pages on local storages or a cache machine that is close to
the clients, therefore reducing the traffic seen by the original
server. Finally, improving server performance consists of
enhancement of hardware efficiency, adoption of better web
server software techniques as well as utilization of high­
bandwidth network connections.
Considerable efforts have been invested in studying
replication and distributed caching. Many interesting and
effective approaches have been proposed and implemented. On
the other hand, less attention has been paid to improve the web
server performance [4]. The author of [ 10] presents a design of
a new web server architecture called the asymmetric multi­
process event-driven architecture. The author of [ 11] proposes
the use of main memory compression techniques to increase
the available memory and mitigate the disk bandwidth
problem. The author of [ 12] realizes adaptive web server
performance optimization by historical experience and
feedback mechanism. The researches mentioned above adopted
software-based approaches, the authors of [ 13,14,15] provide
hardware-based approaches in order to get a smaller and faster
embedded web server.
Though these embedded web servers can meet
requirements of embedded applications, their performance is
much lower than that of generic CPU based web servers for
cloud computing applications. Other than these, there are few
published studies concerning the web server itself. In this
study, we present a novel architecture of a web server system
and implement it on
FPGA in order to improve the throughput
and power efficiency.
A. Motivation
Over the past several years, a number of architectures have
been proposed to overcome limitations in the original models,
as well as to improve performance and cope with the
increasing popularity of Web-based services [ 16].
Nevertheless, these architectures are software-based
solutions to web servers, i.e. they rely on CPUs to do
everything, namely operating system (OS), network interface
driver, TCP/IP protocol stack, operating systems, web server,
and web services. These practices imply poor power efficiency
which is critical in IT industry.
Some typical software-based web server architectures are
as follows with varieties of improvements in software. The
single-process event-driven (SPED) architecture uses a single
event-driven server process to perform concurrent processing
of multiple HTTP requests [ 10]. The Asymmetric Multi­
Process Event-Driven (AMPED) architecture combines the
event-driven approach of the SPED architecture with multiple
helper processes that handle blocking disk I/O operations [ 10].
The Staged Event-Driven Architecture (SEDA) [ 17] works
like a pipe lined server that consists of multiple stages, each is
associated with a pool of threads. The Symmetric Multi­
Process Event-Driven (SYMPED) architecture extends the
SPED model by employing multiple processes each of which
acts a SPED server to mitigate blocking file I/O operations
and to utilize multiple processors [ 18].
Cloud computing is quickly becoming one of the most
popular and trendy phrases being tossed around in today's
technology world [ 19]. To have a significant overhaul in
efficiency under huge workload of cloud computing, we
propose to take all layers of protocols of web services out of
CPU, and implement them in hardware partition of a FPGA
based web server. Due to pipe lined implementation of web
service protocols in hardware, hardware based web servers,
compared with software based ones, are able to accelerate web
processing, shorten web processing time, enhance throughput
as well as critical efficiency. Another overhead saved in
hardware is the as layer underneath application software on
B. Contributions
To the best of our knowledge, our work in the paper would
be the first contribution to web service domain in the form of a
novel architecture of stand-alone hardware based web server,
whose performance and efficiency are better than main stream
generic CPU based solutions.
To be fair and accurate, we should clarify that there is a
soft-core processor in FPGA in our stand-alone hardware web
based server. We will present the HW/SW co-design details of
the proposed web server architecture to let audience
understand how the soft-core processor initializes the system
and modules in hardware partition which carries out the actual
tough work
We will present evaluation of performance and efficiency
of our implementation of web server completely on FPGA.
The results indicate that the hardware based approach shortens
the time of network computing [20] which is one of three key
factors that influence the quality of web service and enhance
The rest of the paper is organized as follows. Basic
architecture of the web server system is described in section II.
Section III presents the HW/SW co-design of the system. The
performance of the system is evaluated in section IV.
Conclusions and future research directions are given in section
The overall architecture of the web server system is shown
in Figure 1. A MicroBlaze soft-core processor running
software, and web processing module (WPM), which is the
hardware partition implementing a simple web server, are
connected by a register bank. Both the software and hardware
partitions can access DDR RAM through the multi-port
memory controller (MPMC). The architecture of WPM is
shown in Figure 2. WPM consists of five sub modules, namely
TCP packet decomposer, URL parser, file splitter, TCP/IP
processing and Timing service. Since HTTP GET is a typical
HTTP request from Web clients, it is selected as the only
request type we process in the prototype design.
WPM is a hardware module which implements MAC,
TCP/IP, HTTP protocol and session management in a hardware
pipeline, instead of MAC, TCP/IP protocol stack and HTTP
protocol software being executed by CPU in a means of
sharing CPU time slices. WPM dramatically shortens the
processing time of TCP/IP protocols. We use zero copy
technique for efficient data transfer which reduces data
processing time and saves memory bandwidth. Besides, in this
design we have no operating system which simplifies the
whole process. As a result this design shortens the web
processing time and enhances throughput.
The whole system is implemented in FPGA whose power
consumption is so low that our hardware based web server
consumes low power as well.
[ UO" �
Tep packet
web processing
Figure l. Overview of the architecture of our hardware-favoured web server
Request FIFO for URL parser
URL parser
Request FIFO

GET request processing

Connection request processing
TCP packet
Disconnection request processing Connection manager
Disconnection accept processing

Data acknowledgement processing
TCP packet
+ (
sending FIFO

TCP packet
File splitter
TCP packet TCP packet

Memory file
encapsulating FIFO
(web pages)
Figure 2. Architecture of web processing module (WPM) in a pipelining way
Timing service
In this paper, WPM is the hardware partition we designed
to improve the quality of web service, while the other
components, such as MicroBlaze Debug Module (MOM),
interrupt controller (INTC), timer, lIC, UART and MPMC, are
IPs provided by Xilinx Embedded Development Kit.
The platform of our hardware/software co-design is BEE3
prototyping board from BEEcube Incorporation. There are four
FPGAs on the board, and each of which has two independent
DDR2 channels. In this design we manage to implement
everything using FPGA A only. The whole design can be
reproduced to FPGA B, C and D. In other words, 4 separate
Web servers can be implemented on the BEE3 prototyping
board. The layout of the prototyping board is shown in Figure
3.The design tool we used is Xilinx ISE Design Suite 12.2.
Figure 3. BEE3 prototyping board
A. Interfaces between HW/SW partitions
We design a register bank as the interface between the
hardware and the software partitions. The hardware partition
can access DRAM via Native Port Interface (NPI) through
MPMC and the software partition can access DRAM via
Processor Local Bus (PLB) interface. MOM, interrupt
controller, timer, IIC, UART and the register bank are all
connected by MicroBlaze via PLB. Since the performance
register bank is sufficient for hardware and software partitions,
the bottleneck of this design is actually memory accessing
because all five sub modules in WPM tend to read/write
DRAM. To mitigate the impacts of the bottleneck, we propose
to use two independent DDR2 channels per FPGA on the board.
To evaluate the performance improvement, we start with single
memory channel system.
I) Single memory channel system
The single memory channel system uses one of the two
independent DDR2 channels to access DRAM. Sub modules in
WPM are connected by NPI via an arbiter. Although the
control system is easy to design and implement, there exists
fierce competition. Figure 4 shows the diagram of the single
memory channel system.
Multi-port memory
Figure 4. Diagram of single memory channel system
2) Dual memory channel system
Each FPGA on BEE3 board is allowed to enable two
independent DDR2 channels, so we can use both of them to
access DRAM simultaneously. We modify WPM and used an
exclusive NPI to read web page file data from DRAM in order
to reduce competition. The diagram of dual memory channel
system is shown in figure 5.
k';====::j F===�I�
conlrolll.T b
K;::======�I� -
Figure 5. Diagram of dual memory channel system
B. Register bank
We designed two types of register in the register bank. One
type is configuration (config) registers which can be written by
the hardware but is read-only in the software domain; the other
type is statistic registers which can be merely read by hardware
and can be read or written by software. The first type of
register is designed to collect statistics from the hardware while
the second type is supposed to initialize the hardware by the
software. The usage of the register bank is shown in table 1.
Configuration registers for system initialization are used to
initialize the system resource. Major registers in the register
bank are listed in table II.
config registers statistic registers
allocated used allocated used
TCP packet 16 3 16 14
URL parser 16 11 16 16
file splitter 16 2 16 3
TCP/IP 16 9 16 3
timing 16 3 16 3
system 16
12 none
config registers statistic registers
TCP packet handshake overtime link num
decomposer eot overtime arrival tcp packet num
idle overtime arrival get yackel num
URL parser sys_time url_msg_rx_num
file splitter ackpacket_ overtime retrans _ tcp yacket_ num
TCP/IP server mac addr rcv mac num
server ip rcv .ip_ num
server port rcv seg num
server ip mask tcp num
router ip ping respon lost num
router mac addr arp respon lost num
timing ackpacket_ overtime time out
system tcp rcv stack none
initialization free uri msg stack
send file msg stack
http head stack
free tcp head stack
uri file index
C. Hardware partition
The components in the hardware partition of this system are
as follows: IIC to collect local time information from
EEPROM, MOM to provide a debug interface for software,
Interrupt controller to receive interrupt signals from timer and
UART and send a interrupt request to MicroBlaze, Timer to
provide an interrupt signal every one second, UART to
communicate between FPGA and the host computer, WPM to
improve the quality of web service, MPMC to provide
interfaces for accessing DRAM. WPM is the kernel component
in the partition. It is written in Verilog and implemented to get
higher throughput for the web server. The architecture of WPM
is illustrated in Figure 2.
D. Software partition
In the system, the software partition is designed to initialize
the hardware and application specific data. The whole system
starts to work as soon as the initialization is done. Besides, it
reads statistics from the register bank and sends them to the
host computer via UART every one second. Local time is also
read by the software from EEPROM and is transferred to the
hardware which is labeled in every TCP package. Co­
operations of hardware and software are shown in figure 6.
Start signal Interrupt signal
Figure 6. Co-operations of hardware and software
The system implemented in FPGA A runs at a rate of
125MHz and is evaluated with Web test equipments, i.e.
Avalanche 2900 and Spirent TestCenter 3.51. Results are
compared with apache2.2 and nginxO.7.61 which are run on a
main stream quad-core processor: Intel Xeon 5520. The speed
of the physical Ethernet port on the FPGA board is I Gbps. All
the web pages under tests are present in the DDR memory of
the testing system as well as the main memory of the reference
Xeon 5520 platform. Throughputs of different systems are
shown in figure 7.

1000 .--..---
4K 10K lOOK 1M
web page size(bytes)
Figure 7. Throughputs of different systems
o single
o apache
The throughput of dual memory channel system is higher
than that of single memory channel system because web page
process time in dual channel system is less than that in single
channel system. For larger web page sizes, throughputs are
constrained by the physical Ethernet port. Therefore,
throughputs are about the same for the single channel and dual
channel system when web page size is equal to or greater than
lOOKS. If we had a faster Ethernet port on the board, we could
have better evaluation results for our systems.
The power consumption of single memory channel FPGA
system, dual memory channel FPGA system, apache on CPU
system and nginx on CPU system is nw, nw, 280W and
255W respectively. The power efficiency of each system is
shown in figure 8.



�E �(
10K lOOK 1M
web page size(bytes)
Figure 8. Power efficiency of different systems
In the HW/SW co-design of the paper, we show that the
hardware domain carries out the major task to achieve better
performance and power efficiency, although software domain
is still required to manage the initialization of hardware and
application specific data. Due to the hardware pipeline
implementation of web service protocols and direct execution
without OS in hardware, our hardware-favored architecture
brings higher throughput, lower power consumption as well as
stand-alone web service functionalities. The power efficiency
of our systems is about 4 times that of Web service software,
i.e. apache and nginx over Linux on CPU. Careful calibration
of memory management can further improve the overall
performance and power efficiency. The significant
improvement in power efficiency indicates that reconfigurable
hardware approach to cloud computing is promising. With our
experimental results, we believe that more researchers and
developers will be convinced to convert more mature
components of cloud computing into hardware-favored
implementations to save precious energy.
This paper is partially sponsored by the National High­
Technology Research and Development Program of China
(863 Program) (No. 2009AA012201) and the Shanghai
International Science and Technology Collaboration Program
(09540701900) as well as NSFC 61071061 and the University
of Kentucky Start Up Fund.
[I] D. Lee, K. 1. Kim. A Study on Improving Web Cache Server
Performance Using Delayed Caching.Information Science and
Application (ICISA),2010 International Conference: 1-5,2010.
[2] S. Nadimpally and S. Majumdar. Techniques for Achieving High
Performance Web Servers. Parallel Processing: 233-241, 2000.
[3] Jeffrey K, MacKie-Masnion and Hal R. Varian, "Some Economics of
the Internet" in 10th Michigan Public Utility Conference at Western
Michigan University, November 1992.
[4] V. Cardellini, E. Casalicchio, M. Colajanni, and P. S. Yu. The State of
the Art in Locally Distributed Web-Server Systems. ACM Computing
Surveys (CSUR), 34(2):263-311,2002.
[5] Y. Hu, A Nanda, and Q. Yang. Measurement, Analysis and
Performance Improvement of the Apache Web Server. Performance,
Computing and Communication Conference:261-267,1999 .
[6] T. Schroeder, S. Goddard, and B. Ramamurthy. Scalable Web Server
Clustering Technologies. Network, IEEE,14(3):38-45,2000.
[7] M. Swain, Dr. Y. Kim. A study of Data Source selection in Distributed
Web Server System. SOUTHEASTCON'09, IEEE:311-316,2009.
[8] A Bestavros, R. L. Carter, M. E. Crovella, C. R. Cunha, AHeddaya,
and S.AMirdad, "Application-level document caching in the internet,"
in Proceedings of the Second IntI. Workshop on Services in Distributed
and Networked Environments (SDNE'95),1995.
[9] P. Cao and S. Irani, "Cost-aware WWW proxy caching algorithms," in
USENIX Symposium on Internet Technologies and Systems (US ITS),
Dec, 1997
[10] V. S. Pai, P. Druschel and W. Zwaenepoel. Flash: An efficient and
portable Web server. A TEC'99 Proceedings of the annual conference on
USENIX Annual Technical Conference.
[II] V. Beltran, 1. Torres and E. Ayguade, Improving Web Server
Performance Through Main Memory Compression, Proc. of 14th IEEE
International Conference on Parallel and Distributed Systems
[12] Z. Qu, W. Wang and Z. Li. Web Server Optimization Model Based on
Performance Analysis, Proc. of 6th International Conference on
Wireless Communications Networking and Mobile Computing
[13] J. Riihijarvi, P. Mahonen, M. J. Saaranen, J. Roivainen and J. Soininen.
Providing Network Connectivity for Small Appliances: A Functionally
Minimized Embedded Web Server, IEEE Communications
Magazine,39(1 0): 74-79,200 I.
[14] N. N. Joshi, P. K. Dakhole, P. P. Zode. Embedded Web Server on Nios
II Embedded FPGA Platform, Proc. of 2nd International Conference on
Emerging Trends in Engineering and Technology (ICETET) :372-
[15] M. Choi, H. Ju, H. Cha, S. Kim and 1. W. Hong. An Efficient Embedded
Web Server for Web-based Network Element Management, Proc. of
IEEE/IFIP Network Operations and Management Symposium (NOMS)
[16] F. Azzedin and K. Al-Issa. A Self-Adapting Web Server Architecture:
Towards Higher Performance and Better Utilization. High Performance
Computing & Simulation:96-105,2009.
[17] M. Welsh and D. Culler and E. Brewer, "SEDA: An Architecture for
Well-Conditioned, Scalable Internet Services," Proceedings of the 18th
Symposium on Operating Systems Principles (SOSP 2001), Oct. 2001.
[18] D. Pariag and T. Brecht and A Harji and P. Buhr and A Shukla,
"Comparing the Performance of Web Server Architectures," the 2007
EuroSys Conference, Mar. 2007.
[19] F. Hu, M. Qiu, 1. Li, T. Grant, D. Tyloy, S. McCaleb, L. Butler, and R.
Hamner, "A Review on Cloud Computing: Design Challenges in
Architecture and Security", Journal of Computing and Information
Technology (CIT), Vol. 19, No. I, Page 25-55, Mar. 2011.
[20] M. Wang and Z. Qi. Research and practice of Web server Optimization.
Second International Symposium on Electronic Commerce and