An Application Framework for High Available Systems in Node.JS

bootlessbwakInternet and Web Development

Nov 12, 2013 (3 years and 9 months ago)

61 views

An Application Framework for
High Available Systems in Node.JS
Master of Science Thesis
Stockholm, Sweden 20
11
TRITA-ICT-EX-
2011:215
Sergio Avalos Contreras
Royal Institute of Technology
School of Information and Communication Technology
An Application Framework for High-Available
Systems in Node.JS
Sergio Avalos Contreras
avalos(@)kth.se
September 14,2011
Supervisor:
Claudijo Borovic
Examiner:
Prof.Johan Montelius
Abstract
“Node.JS”,an event-oriented framework for coding JavaScript programs on the server side,is
coming out as an emergent technology for creating efficient and scalable network applications
of high performance and low memory consumption.Yet,its characteristic of handling several,
even thousands,of connections by using one single process,opts to be a vulnerability when
creating highly available applications.Thus,a research has been conducted to confirm if this
framework is capable of meeting such requirements despite the odds.
During the course of this investigation,a study about failures in Internet Services has been
conducted,showing that the technology chosen is not the most common reason for service
disruptions.In addition,a prototype,based on a Fault Model Enforcement and design patterns
for fault tolerant software,was developed to monitor an Instant Messaging service (also written
in JavaScript) at system and application level and to provide redundancy by communicating
with other nodes within a cluster system whenever it crashes.
The results obtained through a series of fault-injection testings show the functionalities of the
newly created system,confirming that Node.JS seems to meet the requirements needed to
develop a highly available program.Further testing in regards to stability and CPU usage,
together with the implementation of better tools for monitoring,can improve the robustness of
the system.
Para Juan y Kena
y a su pueblerina pero invaluable frase
“Para atr
´
as ni para tomar aire”
Acknowledgements
My special thanks to...
CONACYT,National Council of Science and Technology
for their financial aid during these two years while I study for this master program
Claudijo Borovic and Niclas Holm,Founders of Wussap
for giving me the chance of being part of something authentic and original
Johan Montelius,Associate Professor at KTH
for being patient and accessible
Vasileios Trigonakis,Researcher at EPFL
for his observations which made a incredible valuable contribution to this document
My family and relatives,
for providing a clear perspective on my goals and showing me the way to achieve them
And finally to Tatiana;
because without her encouragement and support,I wouldn’t have completed this project
Contents
1 Introduction 6
2 Project Description 7
2.1 Description of the Company..............................7
2.2 Initial Conditions....................................7
2.3 Requirements......................................8
2.3.1 Functional Requirements:...........................8
2.3.2 Non- functional Requirements.........................8
2.4 Fault Tolerance Analysis................................9
3 Theoretical Background 11
3.1 Fault-Tolerance.....................................11
3.1.1 Basic Concepts.................................11
3.1.2 Fault Classifications..............................11
3.1.3 General Fault Tolerance Procedure......................12
3.1.4 Redundancy..................................12
3.1.5 Dependability..................................13
3.2 Analysis of Fault Tolerance Systems in Internet Services..............14
3.3 Fault Model Enforcement...............................16
3.4 Patterns for Fault Tolerant Systems..........................17
4 Methodology 19
4.1 Node.JS.........................................19
4.2 Event-driven programming...............................20
4.3 Node and the ecosystem...............................21
4.4 Database Research..................................22
4.5 JavaScript........................................22
4.5.1 Must-to-know features.............................23
4.5.2 The bad parts..................................23
4.6 Assumptions and Limitations.............................26
5 Implementation 27
5.1 Wussap Prototype...................................27
5.1.1 System Design.................................27
5.1.2 Publish-Subscriber Model...........................28
5.2 Fault-Tolerant Framework...............................29
5.2.1 System Design.................................29
5.2.2 Description of Components..........................31
6 Testings 32
6.1 Prototype Functionality Tests.............................32
6.2 Framework Functionality Test.............................34
6.3 Stress Testing......................................36
7 Conclusion 38
8 Discussion and Limitations 39
References 40
1 Introduction
Internet services are mostly known for operating all time (seven days a week twenty four hours
per day) with no downtime,behave naturally despite the number of users and respond as fast
as possible so the interaction seems very smooth.Not always can be achieved but surely these
are always the main targets when building them so they can seem ubiquitous from wherever
they are accessed.
Building these kind of systems represents a set of challenges and depends a lot in the chosen
technology to achieve them.One that has lately started to acquire a lot of attention is Node.JS
for being incredibly fast and superior in terms of scalability and efficiency.Indeed,a recent
study revealed that it was capable of handling 10 thousand simultaneous connections[24],a
limit where many other technologies have failed.
Nevertheless,the stability of Node.JS is still questionable due to its maturity and because it
uses an dynamic programming language,JavaScript,running in one single process.Even Joe
Armstrong,creator of Erlang,stated that Node.JS was not designed for building fault-tolerance
systems [13]
For this reason,a research is presented about the needed methodologies to create a fault
tolerant system focusing on High Availability,an important feature that any Internet Service
requires and Node.JS solely can not provide.This document starts by describing the project
that was made in Wussap,a startup IT company from Stockholm.followed by the preliminary
investigations.These are composed by a summary of the most common failures found in the
big Internet companies and study about design patterns for fault tolerance software and fault
enforcement model.Then,a prototype,an instant messaging browser plugin,is proposed
along with a framework that provides these capabilities to operate indeterminably.To verify our
findings,a series of tests were performed to verify the system behavior and expected results.
7
2 Project Description
2.1 Description of the Company
Wussap is a start-up company (part of the company incubator STING located in KTH) who has
been awarded as one of the most innovative IT companies in 2010 [17],and surely it is;as they
promote it,Wussap takes the social co-browsing to the next level by allowing users to visit a
website all together and chat with any visitor of any particular site.All you need is to install on
your browser a plugin and,as soon as it’s on,you can continue surfing a website as you did
before but with the difference that you will see other visitors and even chat with them.
In addition,you can create a shared surfing session (called “Surftrain”) where the browser of
the invited users will be redirected to any website the creator is going to and also all events as
passing through like opening a picture,scrolling up/down,etc will be shown.So all participants
can see the same screen.
2.2 Initial Conditions
The software that Wussap offers is an Instant Messaging application very similar to a group
chats except for a fewdifferences.First,the activity takes place on a website,and second,there
are two roles,leader,who decide does everyones browser goes,and viewers who witness and
chat with other participants.These sessions will be referred from now on as Surftrains.
Other functionalities (among several) that this program has are:
 Finding persons and chat with any other visitor of the website the user is on,
 Creating a Surftrain that will start later (i.e.scheduling),
 Subscribing and being notified about other scheduled Surftrains,
 Reviewing the most popular websites visited with the tool,
 Reviewing the most popular past Surftrains,etc
Currently the system is developed as a web browser plugin (so far only for Internet Explorer
and Firefox) and the user is asked to install it in order to use it.But part of the future plan is to
reduce this dependency of installing it in a particular browser and attach this service directly on
the website.
Part of this project is to develop a much simpler version using Node,Web Sockets and other
related web technologies.The main objective was to develop some of the functionalities that
the current version has followed by a series of tests that will analyze different aspects of the
advantages and disadvantages of using this framework.
8 2 Project Description
2.3 Requirements
2.3.1 Functional Requirements:
 Wussap allows users to interact via exchange of text message when two or more persons
are visiting same website at the same time.
 Wussap allows users to interact via exchange of text message when two or more persons
are participating in a Surftrain.
 Wussap allows only registered users to interact.
 Wussap persists user information by saving a username and password.
 Wussap notifies to other visitors when a new user is on a website.
 Wussap notifies to other visitors when a user is no longer visiting a website.
 Wussap allows a user to create a Surftrains.
 Wussap notifies when a new user has joined a Surftrain.
 Wussap notifies when a users is no longer participating in a Surftrains.
 Wussap allows just the creator to change the website of the Surftrain.
 Wussap allows just the creator to conclude the Surftrain.
 Wussap notifies the current location of an Surftrain exclusively to its followers.
 Wussap notifies exclusively to its participants when a Surftrain changes to website.
 Wussap notifies exclusively to its participants when a Surftrain has concluded.
 Wussap keeps a track of the created Surftrains including starting and ending time.
 Wussap provides a list of current Surftrains.
2.3.2 Non- functional Requirements
 Wussap is written in Node.js.
 Wussap uses Socket.IO for communication.
 Wussap interface is developed through a set of tests in QUnit.
 Wussap client program is loaded using RequireJS.
 Wussap uses a NoSQL Database system.
 Wussap is used in any Web browser that supports WebSockets.
2.4 Fault Tolerance Analysis 9
2.4 Fault Tolerance Analysis
In order to build a fault-tolerant application,it is essential to know the possible threats that
the software could suffer from.The key question that any developer should continuously ask
through the course of the development is What could go wrong and then think about the pro-
cedure of how to be protected from them.The following list describes these threats:
Failure
Possible Faults
Incorrect Information
 Invalid User information (see below)
 Invalid Surftrain Information (see below)
 Invalid Web Site Information (see below)
 Invalid Message information in Surf-
train/Website (see below)
 Wussap contains duplicated Surftrains (having
same ID)
 Wussap contains duplicated Websites (having
same URL)
 Surftrain/Website contains duplicated Mes-
sages
Untimely information
 Wussap displays different current/ongoing Surf-
trains
 Wussap fails to save/fetch information
 Wussap receives more requests that it can han-
dle
 Response time is too long
 Wussap ignores requests
Server unavailability
 Wussap fails to response (i.e process has
crashed)
 Wussap’s server has crashed
 Wussap fails to connect with database
 Wussap runs out of memory
 Wussap closes connection sockets to current
clients
 Wussap is being upgraded
Invalid User Information:
 User does not have a unique username
 User does not have a password
Invalid Surftrain Information:
 Surftrain does not contain title
 Surftrain does not contain URL or wrong formatted according to the format in RFC 1738
 Surftrain contains URL/station with no registered time
 Surftrain contains URL/station with registered time greater than the arrived date
10 2 Project Description
 Surftrain does not contain a registered user as a leader
 Surftrain registered an arrived time greater than current
 Surftrain contains followers that are not registered in the system
Invalid Web Site information:
 Website contains an invalid URL format according to the format RFC 1738
 Website contains users that are not registered in the system
Message Information:
 Message does not contain content,sender or registered time
 Message contains as sender a non-registered user
 Message contains a registered time greater than current
11
3 Theoretical Background
3.1 Fault-Tolerance
3.1.1 Basic Concepts
The following terms are listed in [15]:
Failure – a system behavior that deviates from the specified behavior.For example,when a
server crashes when it probably shouldn’t crash at anytime or prompts a miscalculation.
Error – the incorrect systembehavior fromwhich a failure may occur,either by value or timing.
Errors are important because,if detected on time,they can be prevented fromturning into
failures.
Fault – the defect that is present in the system and can cause an error.(Colloquially called
as “bug” in software).Normally present due to an incorrect requirement specification,
incorrect design or a coding error.
Figure 1:Relation between fault,error and failure
Although related,a fault not always turns into an error.For instance,a line code could be
erroneously written but never executed or a piece piece of hardware that was never touched by
other components in a complex system.
In the same way,an error does not necessarily turn into a failure every time.The typical exam-
ple is when the server crashes and it is replaced by a backup.Also in software an exception
can be caught by an exception handler and hidden.Although the error was presented,it was
imperceptible for the end user.
3.1.2 Fault Classifications
By duration
Permanent – a fault will remain unless it is removed by some external agency.From an engi-
neering point of view,these are easy to diagnose.
Transient – a present fault that will eventually disappear without any apparent intervention or
cause.Also considered as unpredictable.
By cause
Design faults – due to an incorrect requirement specification or bad designs while coding.In
practice,even with a carefully designed system,there’s the assumption that errors might
appears,thus some mechanism are put in placed to protect the system.
Operational faults – faults that occur during the life time of the system.
By failed component behavior
Crash faults – the component stops operating completely.
Omission faults – the component refuses to perform its service.
Timing faults – the component does not complete its service on time.
Byzantine faults – the component fails to perform its service due to an arbitrary nature.
12 3 Theoretical Background
3.1.3 General Fault Tolerance Procedure
There are different ways in how a system can deal with a fault:
Fault prevention – The use of good engineering methods and best practices in industry helps
to prevent the presence of any potential fault in the system.
Fault removal – Occurs when the systemis verified to provide the right result according to the
requirements.If not,the fault is diagnosed and correct.This discovery done statically
during analysis or dynamic when the system is being executing.
Fault Tolerance – As mentioned above,in some cases the presence of a fault is not an indi-
cation of a bad execution as long as it is found during certain time limits.
In addition,fault treatment procedures can be grouped into four different activities [15]:
Error detection – Identify the root of the failure (i.e.the fault).
Damage confinement – Isolate the failed component from propagating the error.
Error recovery – Restore the same to a valid state.
Fault treatment – Analyze and verify the fault that caused the error.
Fault treatment procedure is done in this manner because it is considered first priority to restore
the system from the state it was prior to the failure.Although it seems as a reverse method,
in practice diagnosing the failure is a lengthy and complex process (especially considering that
the error can be caused by multiple roots) and therefore left at the end once the safety of the
system has been guaranteed [16].
3.1.4 Redundancy
Most of the errors in a system are treated by redundancy where the failed component must be
replaced by a non-failed copy to mask the failure from the end user.The rapidness of the copy
to supersede is divided into the three following categories [15]:
Cold standby – Include a non operational component that remains inactive until is needed.
Although it is a cheap method,it introduces a delay to startup the systemcalled recovery
time.In the case of a large data base to be created fromzero,it can take a very long time
and therefore be very expensive.
Warmstandby – Here there are check points created at certain interval of times where the
active data is saved.Then,if the main active component crashes,the copy makes the
backward recovery an start from the last check point.Although it is more effective than
the previous category for making the recovery process shorter,it adds some overhead to
the system when doing these checkpoints.
Hot standby – The replica is fully active duplicating all information obtained by the primary
one.That makes the recovery time minimumclose to instantaneous.There are no check-
points because the backup process is continuously working and the overhead added to
the system is evidently higher than any of the previous methods.
Recalling from the introduction of this section,choosing the method depends a lot on the level
of dependability that the system requires and,most importantly,how much the client is willing
to pay.For instances,a banking systemwith thousand of transactions per day can not afford to
have any failures in the system because this “down time” can be translated in big amounts of
money lost.For our purposes,where a Instant Messaging application is used,a warmstandby
copy is more than enough because the data guarantee is not high (i.e.it is not so critical if
some messages are not received) but it is important for the system to be restarted in a rapid
manner.
3.1 Fault-Tolerance 13
3.1.5 Dependability
Dependability in a system is defined by the characteristic of performing the service for which it
has been designed.It can be decomposed into four aspects [15]:
Reliability – The probability for a system to work correctly.
Availability – The probability for a system to be up and running at any point in time.
Safety – The ability to avoid catastrophic failures that involve human life or excessive costs
Security – The ability of a system to prevent unauthorized access.
Commonly reliability is confused with availability,and although they are related,they are two
different concepts.
Figure 2:Example of a High reliable system
Reliability is referred to as a measure
of the continuous delivery of service
in the absence of a failure,defined
as the “Mean time between failures”
(MTBF).For example,for a spaceship
shuttle,it is extremely important that it
is completely failure free from the time
it ignites until it reaches its destina-
tion.This is not just very expensive
but also very challenging and normally
these high reliable systems are found in
life-support programs such as avionics,
military and aerospace fields where the presence of a failure can turn into human life
threats.
Figure 3:Example of a high available system
On the other hand,availability is de-
fined as the probability of a service to
be running at any giving instance.It al-
lows system failure with the presump-
tion that the recovery time will be min-
imal.An example can be seen in a
website:the end user does not care if
there has been failures in the site being
visited.What only matters is that this
site is available whenever the end user
wants to browse it.
Evidently,availability is a characteristic strongly dependent on the time it takes for a system to
restore in presence of a failure (“Mean Time To Restore,” or MTTR):
%Availability =
MTBF
(MTTR+MTBF)
For our purposes,high availability is the characteristic that is being studied and this formula will
addressed in the following sections.
14 3 Theoretical Background
3.2 Analysis of Fault Tolerance Systems in Internet Services
Nowadays,Internet Services are expected to run 24/7 and,as a matter of fact,it is taken for
granted that these services will be available every time we access them;we do not consider
the local time nor day to think if it is up and running,we just type the URL and hope to see the
website.Thus,considering that this field has been treated for a while,it is important to find any
contributions made by others,especially by the big companies.
In [23] there is a study about the common faults found in two big Internet Companies (CNET.
comand eWeek.comto be precise).The purpose of this investigation was to find in their reports
any information about causes of the failures on their websites.These are the categories of
failure roots listed:
Software failures:Mainly due to system complexity,inadequate testing and/or poor under-
standing of system dependencies
Operator Error:Classified into configuration errors,procedural errors and miscellaneous ac-
cidents
Hardware and Environmental failures:Due by several reasons such as wear and tear of me-
chanical parts,loose wiring,etc.
Security violations:Common security violations such as password disclosures,denial of ser-
vice attack,worm and viruses,authentication failures,etc.
These failures are presented to the end users as:Partial or entire site unavailability e.g.,404
file not found error;System exceptions and access violations,when a executing process often
terminates abruptly when a system exception is thrown;Incorrect results,when an executing
process does not terminate but returns erroneous results;Data loss or corruption,when users
are unable to access data frompreviously functioning computer system;and Performance slow-
downs.
In Figure 4 it can be observed that most of the failures are due to human errors and application
software failures while hardware errors account for a smaller portion.
Figure 4:Causes of failures(Source:[23])
In addition to this article,[22] also supports these results by showing that,in the majority of
the cases,the presented failures are caused by humans.In this case,the authors studied
three types of Internet services:online services/Internet portal (Online),a bleeding-edge global
content hosting services (Content),and a mature read-mostly Internet services (ReadMostly).
For all of them,the architecture is composed by a load-balancer,a stateless front-end and a
back-end to persist the data (See Figure 5).
3.2 Analysis of Fault Tolerance Systems in Internet Services 15
Figure 5:Architecture of an Internet Service
The failures were studied individually with special attention drawn to the causes and location.
The former was categorized as node hardware,network hardware,node software,network
software while the latter in front-end node,back-end node,network or unknown.
Figures 6 and 7 show the registered failures for both site online and content.As it can be
observed,not every failure that occurs in the system are visible to the end users and can be
covered in a certain way.Therefore,these are named as component failures and services
failures for those that are noticeable to the public.In these figures it can be observed that
in both scenarios,Online and Content,the errors produced by operators are the hardest to
mask.Meanwhile,figure 7 shows that failures in node hardware are quite many but just a
very small portion will turn into service failure.This explains how the mitigation procedures like
redundancy (allowing a backup component to run while the main is down) is working very well
on failures found in hardware.
Figure 6:Number of component failures and resulting service failures for Content (Source:
[22])
Finally,Table 1 lists the time to repair for errors presented in Component and shown by type of
cause where component is referred as node or network and cause as operator error,hardware,
software,unknown,or environment.Once again,operator errors account the major portion of
the time in three of the presented scenarios.
From these studies it can be concluded that operator errors is the leading cause for failures.
This suggest also that is not entirely the platformnor technology the source of the unavailability
of the system.Thus,to enhance this characteristic,as proposed in [22],the designer should
focus on creating better tools for performing Online testing,monitoring component failures and
16 3 Theoretical Background
Figure 7:Number of component failures and resulting service failures for Online (Source:[22])
sanity checking of configuration files.As cited from [22] “Today,this coordination is handled
almost entirely manually,via telephone calls to contacts at the various points”.Evidently,the
efforts for creating a better system should be focused on automating these tasks rather than
the system itself.
Operator Node Operator Net H/WNode H/WNet S/WNode S/WNet Unknown Node Unknown Net
Online
8.3(16) 29(1) 2.5 (5) 0.5(1) 4.0 (9) 0.8 (1) 2.0 (1) N/A (0)
Content
1.2 (8) N/A (0) N/A (0) N/A (0) 0.2 (4) N/A (0) N/A (0) 1.2 (2)
ReadMostly
0.2(1) 0.1(3) N/A (0) 6.0 (2) N/A (0) 1.0 (4) N/A (0) 0.1 (6)
Table 1:Average Time To Repair (TTR) for failures by component and type of cause,in hours.
(Source:[22])
3.3 Fault Model Enforcement
The article [21] presents a contemporary model for creating fault tolerant systems.The authors
argue that creating a high reliable system,where every single fault is prevented,for an Internet
service is very hard and even in some cases impossible to do this due to the complexity of
the system.It is easy to see why this statement is true.A Web Application is composed of
many components which are also composed by many other subcomponents.Just to give an
example,when considering the transmission of a packet using TCP,there are many reasons
why it can fail:poor wiring,problem in the network interface,a delay in the transmission,etc.
Trying to figure out the reason of the problemand trying to prevent it fromhappening is not just
very time consuming but also very exhausting if every detail has to be covered.
For this reason,the authors propose a new methodology of creating a fault tolerant system
called Fault Model Enforcement.They mentioned two strategies:first,that the failure of any
sub component produces the failure of the whole component (i.e.the symptoms),and secondly,
after a given symptom is observed,the expected fault behavior is forced to happen (hence the
word enforcement).In other words,every fault is mapped to every failure in order for the system
to be designed in terms of recovery actions.This strategy can be applicable almost in every
component in the model and makes the planning of the architecture much easier.
This model can be applied to our purposes because,when creating a high available system,as
mentioned in the previous sections,there are two variables that can be played with:either en-
large the mean time between failures (MTBF) or reduce mean time to repair (MTTR).Evidently,
3.4 Patterns for Fault Tolerant Systems 17
the choice here has been made.
%Availability =
MTBF
(MTTR+MTBF)
To illustrate with an example,here is presented how faults are listed along with the mitigation
procedures.This was taken from the testing performed in [21] where their objective was to
improve the availability of a Web Application called PRESS:
 Link down:Reboots the node that was cut off from the main cluster.
 Switch down:Reboots all nodes.
 SCSI timeout:Reboots node with faulty disk.
 Node crash:Nothing.This fault was included in the abstract model.
 Node freeze:Reboots fault node.
...etc.
Although this model seems very simplistic,the results shown look quite positive;the perfor-
mance of the system was improved over 50%compared in the normal run (without using Fault
Model Enforcement) in all of the tests realized in [21].Also,the system became more robust
and stable against transient errors.
Additionally,bringing simplicity in the design of the software is a big advantage not just for
the implementation but also for testing,especially if it is considered that as shown before,
complexity is one the highest reasons of visible failures.Finally,the requirements for this system
are not so high in terms of availability since the presence of failures is accepted as long as the
availability as a whole remains high (by reducing recovery time).
3.4 Patterns for Fault Tolerant Systems
The design of this project is based on design patterns for fault tolerant software,found in [16].
This decision is made due to the convenience that comes along.Firstly,because patterns solve
the problemin small pieces rather than trying to do it everything just at once where the solution
could hardly fit in.Secondly,because it is always recommendable to follow the practices done
in industry simply because they have been tested in numerous times.
As mentioned by Robert Hamner in [16],“Software patterns are an effective way to capture
proven design information and to communicate this information to the reader”.What he refers
to here is that problems that we normally face are hardly unique neither have appeared for the
first time to us.Under different circumstances and scenarios,in the essence it remains intact
and what is left for the designer is to apply the solution under the particular context given.
In other words,what it is being presented is nothing else but the techniques that engineers
normally use when building a system with the characteristics discussed here.Each of them
is applied in different circumstances and needs,depending of what the objectives are.This
decision is made according to the stage of the error stage (detection,recovery,mitigation and
fault treatment) and the particular requirements of the project.Using this method is convenient
in the sense of time saved by “not-reinventing-the-wheel” and also give the security of knowing
that the best practices made by other are being applied here as well.
For example;a type of question that an engineer normally deals with is:Does the system
require to be running as much as possible?or Does the systemrequire to have a certain failure
18 3 Theoretical Background
rate where only 1 out of 100 000 transactions can fail?Then it will depend on the answer if
the system is made to recover very fast or very robust to prevent any failure from happening.
Another one is:Is the server stateless or stateful?.Say,is the amount of information (the value
of the variables used during the execution of the program) kept in the server while its running
or is it deleted right after a request has been dispatched as it normally occurs in a web server?
It will depend in this answer if it is decided to make a checkpoint or not.
Another example is found in Figure 8 which shows a diagram of different ways an error can
be isolated and prevented from being spread:it may be stopped even before it enters to the
system (Complete Parameter Checking),it may be detected at system level (System Monitor),
it may be decided if it exists by checking with other servers (Voting),it may be temporary and
just ignored (Riding Over Tran sients,it may be checked during the execution as a background
tasks (Routine Audits),etc.
Figure 8:Design patterns for error detection (Source:[16])
This is just a very brief summary and more information can be found in [16].It is in the follow-
ing sections,Implementation,where the reader will find the chosen patterns that fulfilled our
demands.As any other research project,part of the task was review in detail each of them,
chose the ones that are more appropriate to a specific problemaccording to the fault tolerance
phase and combine them all to make them work to their best.
19
4 Methodology
Working with Node and JavaScript definitely deserves special attention.For most of the pro-
grammers,it is a technology that brings new paradigms in the way people code and even
sometimes may not be well understood,as it is mentioned by Douglas Crockford,a very well
known software engineer for his contributions in this programming language.That is why in
this section a deep insight is taken to the tools used for the development of this project,the
challenges these brought and how it was overcome to take the best of them.
4.1 Node.JS
Node.JS is an event-based non-blocking I/Oframework for creating scalable network programs
that has caught the attention of many developers and companies for its high performance and
efficiency at handling thousands of concurrent connections [1].It is influenced by other systems
like Rubys Event Machine,or Python Twisted,interpreted by Google V8 JavaScript engine and
ran on the server side.
In contrast to other technologies like Apache that scale by spawning threads,Node does it in a
different way by firing up an event for every request needed using one single process.In figure
9 it can be observed how nginx,another event-based technology,is outperforming Apache,a
threaded-based server,when more connections are being summed up.While the former is
stabilizing after 1500 connections,the second one is considerably dropping.
Figure 9:Benchmark test between Apache vs Nginx.(Source:[19])
Moreover,the memory consumption difference is even more impressive when these two ap-
proaches are compared with each other.In Figure 10,it can be observed clearly how the num-
ber of concurrent connections does not affect Nginx at all;it’s always using the same amount
of memory (2.5 MB)
Figure 10:Memory consumption test between Apache vs Nginx.(Source:[19])
Here there is an example of lightweight HTTP server written in Node for serving files fromdisk.
20 4 Methodology
var sys = require(sys),
http = require(http),
url = require(url),
path = require(path),
fs = require(fs);
http.createServer(function(request,response) {
var uri = url.parse(request.url).pathname;
var filename = path.join(process.cwd(),uri);
path.exists(filename,function(exists) {
if(exists) {
fs.readFile(filename,function(err,data) {
response.writeHead(200);
response.end(data);
});
} else {
response.writeHead(404);
response.end();
}
});
}).listen(8080);
Fromthis example it can be observed two important aspects for which Node has acquired a lot
of attention.Firstly,the application is not blocking for any I/O operation such as opening a TCP
socket nor opening or reading a file as it can be seen there;and secondly,due of course to the
syntax of JavaScript,it becomes quite easy to understand a simple but yet high-performance
application.
4.2 Event-driven programming
Concurrent programming is a topic that has been studied for a long time,especially nowadays
when any computer has more than one processor.Commonly,muti-threading is the paradigm
used to achieve these type of tasks.Nevertheless,as it is mentioned in [1],for many developers
multi-threading is “anything but easy”;there are still many other issues like liveness or deadlock
that have to be dealt with.
Instead,event-driven programming offers a more efficient alternative that allows much more
control over switching between application activities.The possible drawback that comes with
it is that asynchronous calls are very strict in the sense that it depends in the context,i.e.the
value of the variables,where they are executed.For a novice developer,this concept takes time
to be learned and,if it is not treated carefully,the code can easily turn in an unmaintainable
“spaghetti code” because it is hard to understand.See the piece of code below:
async1(function(result1) {
async2(function(result2) {
async3(function(result3) {
//do something with results
});
});
})
4.3 Node and the ecosystem 21
Nevertheless,additional aid can be obtained from frameworks to handle asynchronous flow
control like “Step” in order to improve the readability of your program.For example,in the
code shown below it can be seen how the asynchronous calls can be arranged in a more
understandable way.
Step(
function loadUser() {
db.getUser(user_id,this);
},
function findItems(err,user) {
if (err) throw err;
var sql ="SELECT * FROM store WHERE type=?";
db.query(sql,user.favoriteType,this);
},
function done(err,items) {
if (err) throw err;
//Do something with items
}
);
“Step” is a main function that receives as parameters the I/Ocalls that define the control flow of
the programand it makes sure that they are executed one after other.There are other features
such as executing tasks in parallel and grouping that follow the same sugar syntax and aid to
make this asynchronous code very easy to read and understand.As Node,this is another open
project that can be found at https://github.com/creationix/step.
4.3 Node and the ecosystem
Speaking about Node without mentioning the growing community of developers supporting it
would be very unfair.That is because many of the libraries has been created around this frame-
work in order to be able to interact with other services like relational databases,for example,
node-mysql or many frameworks for web development like Express.According to [12] there
are 1600 modules and the list keeps on growing.The most popular can be found in the Wiki
site of Node at http://github.com/joyent/node/wiki/modules and they can be easily installed by
using Node Package Manager,npm.As a matter of fact,some of these modules were used for
the development of the prototype and framework that will be presented in the following sections.
Module’s name
Creator
Description
Forever
Charlie Robbins
A simple CLI tool for ensuring that a given script runs
continuously (i.e.forever)
Cradle
Alexis Sellier
A high-level CouchDB client for Node.js
Socket.IO
LearnBoost
Node.JS project that makes WebSockets and real-
time possible in all browsers
Step
Tim Caswell
An async control-flow library that makes stepping
through logic easy
Express
visionmedia
Sinatra inspired web development framework for
node.js – insanely fast and flexible
Nodeunit
Caolan McMahon
Easy unit testing in node.js and the browser,based
on the assert module.
js-yaml
visionmedia
CommonJS YAML Parser – fast,elegant and tiny
yaml parser for javascript
22 4 Methodology
Just like Node,these modules can be found in the social code repository GitHub (http://www.github.com)
or in the website of npm(http://www.npmjs.org) where so far there are 3000 registered and keep
on growing.
4.4 Database Research
To persist the data generated by the prototype application,a review of the different “NoSQL”
databases was done with horizontal scalability as the main requirement.The idea behind using
non-relational types was encouraged by the company,in order to look for other alternatives
different from MySQL which is the one currently used.
After some research on the Internet,the options were narrow to MongoDB,Riak,Cassandra
and CouchDB because the integration with JavaScript among other features.
MongoDB was disregarded for not being “truly” scalable.This is because it uses master-slave
architecture [2],where the data from the “master” is replicated to different “slaves”.Thus,the
storage capacity can not be increased by adding more machines.Moreover,this approach of
having different machines with the same data might be suitable for reading-intense applications
but not for writing-intense (such as logging) which is our case.
Riak seemed like the adequate solution since it is partly written in JavaScript and it is fault-
tolerant;it can be replicated in master-less mode and sharding is done automatically.However,
those features come in the “enterprise” version and not free as other similar services [4].
When Cassandra was studied,it was found very useful because its following features:Horizon-
tally scalable,read and write throughput both increase linearly as new machines are added,
with no downtime or interruption to applications;Decentralized,every node in the cluster is
identical.Fault Tolerant,data is automatically replicated in every nodes available and failed
nodes can be replaced without any interruption in running application.Although it seemed like
the best fit for the prototype needs,the number of client libraries written in Node are very lim-
ited.So far there is a just a Thrift protocol implementation in Node.JS (called node-thrift) and,
even worse,it is not also maintained regularly.Consequently,it was decided to look for another
Database that could have more popular and supported client programs.
CouchDB was reviewed.Just like the previous engines,it is a document-based database sys-
tem with HTTP/REST as protocol.It is highly distributed with consistency as it is written in
Erlang.Morevover,CouchDB is fault tolerant database system and in case of any failures,
it happens in a controlled environment which ensures its availability.And finally,it uses view
functions to do computation on documents and used for reading and querying for data.
At the end,CouchDB was found suitable for our Instant Messaging prototype in terms of sim-
plicity and also because there are many client libraries written in NodeJS.Cradle,the client
library designed by Alexis Sellier,is being used.It was chosen mainly because its high rate
and the constant maintenance received in its public repository in GitHub site.
4.5 JavaScript
The JavaScript programming language is referred by Douglas Crowford as “the most misun-
derstood language” in his book [8] because the bad reputation it has among the developer
community.It is not hard to see why since JavaScript is a language of controversies,it has
many powerful features along with many weakness in its design;it is class-free but functions
can act as constructors;it does not have classical inheritance but it does has prototype inheri-
tance.
Yet,it has succeed where Java has failed and become one of the most popular languages [5]
having at least one interpreter in every browser.The reason why is because,as previously
said,it has so much richness that can be used in a very convenient way.Knowing them will
provide many benefits during the development of any project.
4.5 JavaScript 23
4.5.1 Must-know features
Giving an introduction or tutorial on this language it is out of the scope of this document.In-
stead,a list of the most important features that any developer to be proficient in JavaScript is
shown:
 Deep knowledge in closures functions
 Deep knowledge in prototype inheritance
 Knowledge in callbacks function and apply and call
 Clear understanding of how “this” variable works
 Understanding of timers and asynchronous execution
 Understanding on Object type and the use of instanceof and typeof
 Understanding the JSON notation
 Be up to date with the new changes and improvement in ECMAScript 5
Being familiar with those will help a lot to understand how this programming language works
and prevent confusions when trying to use it in a way that is not suppose to be.For example,it
is very easy to be confused with the use of the variable “this” because it is also present in other
languages such as Java,but the way it work in each of them is very different.In JavaScript it
can be used in different contexts,opposed to Java where it can only appears within an object.
4.5.2 The bad parts
In this section,a brief description about some of the design flaws of JavaScript will be exhibited,
in particular those that affected the project.Not with the intention of diminishing the language
but to emphasizes that there are workarounds despite these errors.
When working with JavaScript,it is very important to be aware of these errors and especially
how to avoid them.Many of these were present during the development of this project.They
never seemed to be an obstacle but they did change the way we normally program.
Classical Inheritance
A common feature that seems to be missed for some novice developers is the lack of classical
inheritance,be it is commonly found in many modern programming languages.Nowadays,
many but not all programs are designed under the object oriented paradigmand this project was
not an exception.As it will be shown in the following section,Implementation,both the Instant
Messaging prototype and framework were planned according to this methodology before being
well informed about this matter.
Fortunately,opposite to the initial concerns,JavaScript does not lack of inheritance at all.In-
deed,it is present but in a different mechanism;in this environment objects can inherit attributes
at runtime or expressed in other words,dynamically.This means that,even when an instance
has been created with certain properties,those can be edited,deleted or even added during
the execution of the program.For this reason,it is possible to create an object,called child,
which can inherit any attributes from another one,called parent,while the program is running.
The special libraries to implement this functionality in this project were done by Douglas Crock-
ford in [7] where the source code and more details can be found.
24 4 Methodology
Equality and comparisons
One of the most popular error in JavaScript is the comparison operators.Because there is
a feature called weak typing where the interpreter forces the variable values of the operands
before comparing them.This can lead to unexpected results.For example,this operation “
r
n” == 0 or or 0 == ”” will both return true.In addition,this also will impact the performance due
to the extra work the interpreter has to do when it is changing the values.
Therefore,it is highly recommended to use strict equality operator,represented by three equal
signs (“===”).By doing so,the interpreter will return false if the type of the operands is not the
same.
Using this variable
The variable this is not a design flaw at all,it just works in a different way compared to other
languages.At the beginning of the project,this concept was not completely understood and
caused many of the mistakes that eventually were corrected.Thus,a brief explanation is given
here.
Opposite to other languages,this can be used not only within the method of an object but in
other scenarios.[18],presents the five different ways it can be used:
 In global scope,this is bound to this context.
 In a function,this still is bound to the global scope
 When calling a method,this is bound to the object.
 When calling a constructor,this refers to the newly created object.
 When calling the call or apply methods,the value of this inside the called function refers
to the first argument passed.
Below,there is an example of a common mistake shown in the left column.When the function
callback is executed,this refers to the global scope and not to the object.In the right column
we show the workaround frequently used in Wussap.Thanks to closures in JavaScript,with the
variable self it is possible to gain access to the attributes of the object Surftrain.
Incorrect use of this
Corrected
var a = 1;
var Surftrain = {
a:1,
join:function() {
function callback() {
this.a = 2;
};
callback();
}
};
Surftrain.join();
a == 2;//true
Surftrain.a == 1;//true
var a = 1;
var Surftrain = {
a:1,
join:function() {
var self = this;
function callback() {
self.a = 2;
};
callback();
}
};
Surftrain.join();
a == 1;//true
Surftrain.a == 2;//true
4.5 JavaScript 25
Miscalculations with Floating point
Another problem that comes when working with JavaScript is the use of floating point.The
typical example is:
0.1 + 0.2!= 0.3//true
The true value of this math operation is 0.30000000000000004.This is not entirely a problem
of the language because it is following the standards of the IEEE specifications for this type of
numbers,which is different fromwhat is taught in school.Therefore,it is very important to take
into consideration the precision of these arithmetic operations [8].
Auto semicolon insertion
JavaScript is a language that does not necessarily need semicolons to divide every line in
the code because it contains a feature called auto semicolon insertion that does it for the
programmer.However,it does not do it in the right way all the time.This is another example
shown in Crockford’s book,
Works well in Javascript
Silent Error!
return {
ok:true;
};
return
{
ok:true
}
Although the two pieces of code shown above look very similar,they work very different where
the code in the right return an object and the left one return undefined.What is happening here
is that auto semicolon insertion is transforming the code in the following way;
return;//semicolon inserted
{
ok:false;//semicolon inserted
};//semicolon inserted
Even though it should be at least a warning because of the piece of unreachable code that
follows the return statements,JavaScript does not care about it,just ignores it.Therefore,it is
highly recommended not to rely on this feature and include semicolons where they are intended
to be.
Strategies for fault detection and correction
As previously mentioned,design patterns are used to overcome these threats.They are di-
vided according to the phases of the life cycle of a fault:detection,recovery,mitigation and
treatment.Additionally,it is also considered another type of pattern that does not fit in any of
these categories;it is called architecture because it influences the design of the whole system
and represents the ones that are already used by highly available systems.
In order to use them,it is important to be familiar with them and remember when making the
design of the system.In the book Patterns for Fault Tolerant Software it can be found many
(probably all types of patterns) and of course not all of the apply to the needs of these prototype,
so just a few were selected according to the found threats.
This table list the patterns that will be used not only in the prototype but also in the framework
that will help the system to be protected:
26 4 Methodology
Architecture
Detection
Recovery
Treatment
Redundancy
Recovery block
Minimize human in-
tervention
Maintenance Inter-
face
Fault Observer
Complete parameter
checking
Riding over tran-
sients
System Monitor
Heartbeat
Acknowledgement
Watchdog
Fail-over
Return to reference
point
Checkpoints
What to save
Remote Storage
Error Handler
Root cause analysis
Reproducible Error
Software Update
Reintegration
Fault detector and error handler:Parameter checking is done at the monitored application.
Exceptions are caught and tracked depending on the severity of the fault.
System monitor,Heartbeat and Watchdog:The application framework monitors Wussap at
system,request and application level keeping track for any present fault.
Redundancy,Recovery block and Fail-over:An active copy (a.k.a.hot copy) in a remote
server that takes over when the application has crashed.
Riding over transients:Although all faults are tracked,not all of them are corrected immedi-
ately if they are considered temporary and the damage severity is low (i.e.request overload).
Minimize human intervention and Maintenance interface:The application framework takes
autonomous decision to perform actions and notify the user about the status of the system.
Remote Storage:Both persistent (i.e.Database) and memory data is spread among the active
server and the hot copy to reduce the recovery time.
4.6 Assumptions and Limitations
Finally,this section will conclude on the limitations and scope for this project.
 Although it is not restricted,for the moment just one server is assumed to be listening
from the cluster used.
 The current implementation of the prototype does not consider any cache mechanism.
 Any matter related to improving application performance has been left out of the project.
 Any approach to make Node scalable was not reviewed either.
 Although scalability is a feature that has been kept in mind during the development of
the project,it was not explored further to concentrate due to the high performance of
Node.Yet,the tools for inter-server communication will be implemented and the logic to
coordinate those are left for future improvements.
 Connectivity problems between several servers are not considered and,thus,assumed
to be working without any problem.
27
5 Implementation
5.1 Wussap Prototype
5.1.1 SystemDesign
Now the prototype of the simpler version of Wussap will be presented which is nothing but an
Instant Messaging as it was described in previous section.Figure 11 shows the systemdesign
with the main components of this program.
Figure 11:Components of client and server
According to functionalities in server side,the services offered by server can be divided into
three categories:
User Management
Authentication Manager:Handles “Register new user” requests:receives username and
password and creates a new account.If there exists the same username,the error notice
will be returned.Handle ” Login” request:receive username and password sent by user
and do the authentication in the database;If they are valid,authentication manager would
request session manager to create a new session for user;After session created,authen-
tication manager will return the existing surf trains’ list,the detailed info of chat place and
surf train that user had subscribed (forwarded by session manager).
Session manager:Receive the request from authentication manager and return the existing
surf trains’ list,the detailed info of chat place and surf train that user had subscribed to.
Publish/Subscribe (PubSub) Management
Surftrain Manager:Handles the request of “create/stop surftrain”,“join/leave surftrain”,“surf-
train go to new station”,“chat in the surftrain” from users;Handle the request of “all the
existed surftrains list” from both session manager and users.In addition,it also makes
sure that privileged functions such as stopping or leading a surftrain are done by the
creator and no one else.
Chat Manager:Handles the request of “create/join/leave a new chat place” fromboth Surftrain
Manager and Users.Handle request of “chat in chat place” and broadcast the chat con-
tent to all subscriber of that chat place.Handle request of “chat log and participants of
chat place” from session manager.
I/O Management
Database Manager:Handles the connection establishment,connection close,read/write to
the database from the application.
28 5 Implementation
5.1.2 Publish-Subscriber Model
The model for Abstract Publisher/Subscriber is described by diagram shown in Figure 12:
Figure 12:The Pub/Sub model
The controller will contain a list of topic and each topic will contain a description test and a list
of subscribers.How they will interact is described in these events:
 On Subscription:the controller adds a new subscriber to the respective topic or creates
if it does not exist yet
 Unsubscribe:remove a user from the subscriber list for a specific topic
 On Publish:whenever the controller receives a message,it will search for the targeted
topic,verify that the sender is also a subscriber (see below) and finally broadcast to the
rest of these users.
In comparison to other pub/sub approaches,there is one restriction in our implementation
where only subscribers are allowed to publish and no one else.
For the case of the Surftrain Manager,the composition of a topic (which in fact is an analogy of
a Surftrain) are the same except for some additional features called leader and “currentStation”.
Figure 13:Topic described for Surftrains
Evidently,another event must be included too:
 Go to station:will update the value of “current station” and broadcast this change to all
subscribers
 Stop Surftrain:will send an end notification of the surftrain and delete this topic from the
controller list
5.2 Fault-Tolerant Framework 29
5.2 Fault-Tolerant Framework
Once the prototype has been constructed,it is time to build the framework that externally will
be monitoring the residence application (Wussap application in this case).It was decided to be
coded standalone and apart of the main application for the following reasons:first,since the
stability of Node.JS is still in question,it is important to divide some of the components in two
separate processes to increase robustness and prevent the system from crashing if an error
occurs;and second,to make it as independent as possible so it can be used for other projects
and be improved unobtrusively.
5.2.1 SystemDesign
The architecture of the framework is described by the figure 14 and it is influenced by the
method called encapsulated cluster found in [11].It is called encapsulated because the front
end is not exposed to the public and rather kept in a private network while a router with an
assigned domain name is receiving all requests coming fromthe Internet.To reduce the possi-
bility of having a single point of failure,the router is placed along with a backup that will replace
the active in case of being crashed.
Figure 14:Diagram of Fault Tolerant Systems
The advantages that this model presents,compared to another method such as Round Robin
DNS [11],are a better,fine grained load balancing and it removes the problemof clients caching
the IP address of a possible down server from the front end if it were public.No matter which
one is down,the router will take care of redirecting the incoming request to the active server.
30 5 Implementation
Possible issues are that the router could became a bottleneck and impact the the performance.
Also,having the router as the only entry from the outside world turns into a potential single
point of failure and,even though there is a backup ready to replace,it is completely immune
because a catastrophe like an power outage in the area that could affect both instances.
Due to the scope of this research,load balancing is not addressed in detail in this topic.How-
ever,considering that in any Web Application it is essential to make the system scalable,the
architecture was chosen so it will not became an obstacle in further improvements,especially
when more servers have to be added and two of themhave to be active to handle the incoming
load.
Next,let’s make a close up and go in detail to the internal design of the framework.Figure 14
shows the components in game:application monitor,system monitor,leader elector,mainte-
nance interface and global fault manager.The latter is the component that orchestrates and
decides according to the status of the systemwhile the function of other components is to mon-
itor the status of the context and report.Details of the connection of the servers,CPU Usage
and memory consumption threshold,and operation mode are passed in a configuration file
written in YAML format.
Figure 15:Diagram of Fault Tolerant Systems
The application framework starts by running the Global Fault Manager,which firstly initializes
the Maintenance Interface and also the Leader Elector by passing a list of servers.Then,the
Leader Elector communicate with the other instances and decide what server will be running as
active.This method is deterministic and it is based on an algorithm found in [14].In summary
it works in the following way:all the servers exchange with the others in heartbeats messages
ID and the number of times restarted,also known as epoch.Once all responses have been
received,they choose by selecting the one with the lowest epoch or the lowest server ID.
Once the selection is done,if the current server in use is chosen as the leader,Global Fault
Manager initializes the rest of the components.It is the Application Monitor which runs the
target application and restarted in case it crashes whereas the system monitor is constantly
checking that the CPU and memory usage do not over pass certain limit.
All these events are notified by the Global Fault Manager asynchronously via Events using the
module called by this name that comes bundled with Node.JS.The rest of the components
are mainly inspired by different open source projects found in GitHub (http://github.com) and
the Node Packager (http://npmjs.org/).Here is where it can be seen the benefits obtained
from working with a technology that is supported by such a big and growing community of
developers.Thanks to their contribution,the development of this framework was much easier.
5.2 Fault-Tolerant Framework 31
5.2.2 Description of Components
These are the tasks for each of the components shown above (Figure 14):
System Monitor
 Check the memory consumption of the system
 Check the CPU usage
 Inform to a Global Fault Manager when a critical point has been reached
Application Monitor
 Checks if the application process is alive and not idle.
 Check Error prompt of the application
 Restart the application if it’s idle/crashed
 Keep a track of the exceptions thrown by the application actions performed in the appli-
cation.
Leader Elector
 Constantly sends Heartbeats to other servers
 Check whenever a fault server is down
 Notify to Global Fault Manager if there are non backup servers
 Notify to Global Fault Manager if the current server becomes the leader
Global Fault Manager
 Receive notifications fromother components about any possible fault found in the system.
 Take further actions whenever a fault is presented.
 Keep a count of number of time the application has been restarted
 Log all events received from other components
Maintenance Interface
 Present all events occurred in the system
 Present the status of the server in used
 Present the status of the other backup servers
 Displays all the errors prompt by the applications and the number of times it has been
restarted
Configuration Details The configuration file will contain the following information:
 Command/File to execute the application
 Connexion details of all other servers
 CPU usage threshold
 Memory consumption threshold
 Port to listen for maintenance interface (i.e.8080)
 Operation mode (i.e.debug)
32
6 Testings
In this section the performed tests will be presented.The main idea behind is nothing else but
just to test the functionalities developed in the prototype and framework.Additionally,some
stress tests were also introduced with the intension of finding out some potential areas for
enhancement.For all of these,an AMD Turion 64 Dual core computer was used with 2.9 GB of
memory running Ubuntu version 10.10 (Maverick).
6.1 Prototype Functionality Tests
Test-Driven Development was the methodology used during the course of this project.Indeed,
the initial plan of developing a client with a graphical user interface was dismissed for being
considered lack of scientific stimulus.Instead,a series of test cases were coded where the
behavior of the system is characterized.
Test-Driven Development is also very helpful for maintainability because it avoids doing repet-
itive tasks of manually checking the outputs and verify that the correct system behavior if new
changes are applied or in case a fault is accidentally introduced due to a bad design.
Fortunately,there are tools in Node to carry on this task.Starting with the “assert modules” that
comes bundled up with it.Nevertheless,using solely the assertion functions will be very com-
plicated and time consuming.Therefore,nodeunit module (again,an open project designed by
Caolan McMahon and published in the social repository GitHub) was also utilized to simplify
this task.
And even better,in [20] you can find a great tutorial of how to set everything up.Once again,it
can be reaffirmed how useful it becomes when working with a technology that it is supported
with such a big community of developers.
The following the tests that were coded to develop the Instant Messaging prototype and these
are deployed froma browser (Google Chrome) as it is indicated on “nodeunit”:User operations
 Valid registration
 Failed registration - user duplicated
 Failed registration - parameters missing
 Valid authentication
 Failed authentication - credentials mismatch
Surftrain operations
 Valid registration
 Failed registration - invalid parameters
 Valid subscription
 Failed subscription - invalid ID provided
 Valid list retrieval
 Valid operation ‘send a message’
 Failed operation ‘send a message’ - invalid ID provided
6.1 Prototype Functionality Tests 33
 Valid operation ‘go to station’
 Failed operation ‘go to station’ - invalid ID provided
 Valid operation ‘stop surftrain’
 Failed operation ‘stop surftrain’ - invalid ID provided
 Valid unsubscription
 Failed unsubscription - invalid ID provided
Web Chat operations
 Valid subscription
 Failed subscription - invalid URL provided
 Valid operation ‘send a message’
 Failed operation ‘send a message’ - invalid URL provided
 Valid operation ‘stop surftrain’
 Failed operation ‘stop surftrain’ - invalid URL provided
 Failed subscription - invalid ID provided
As it can be observed,the aim of these test cases is to cover all the possible scenarios and
use cases that were documented prior to the implementation of the prototype.
Also there are other cases when messages are not triggered by user actions.Instead,the
server messages are pushed directly to the client without any noticed,just like any other
publish-subscription system.These are also known as Notications and are sent when:
 A message has been received in a Website
 A message has been received in a Surftrain
 A new user joins a Surftrain
 A participant (user subscripted to a Surftrain) leaves
 A new user joins a Website
 A participant (user subscripted to a Website) leaves
 The Surftrain has gone to a new station
 The Surftrain in used has concluded
The two last notifications mentioned above are particularly important because they dictate the
behavior of browser and where it has to be redirected when the user is participating in a Surf-
train
From these tests it was possible to verify the development of the prototype according to the
specifications provided by the company and confirms the validity of the program.
34 6 Testings
6.2 Framework Functionality Test
For testing the high-available framework,a virtual machine was used to simulate 2 different
servers that will replace each other in case of a disruption in the system using the software
“VirtualBox 3.2.8”.This might not be the most convenient situation because the real circum-
stances that are present in a private network (i.e.transmission delay,lost of connectivity,etc)
are missed in this scenario.But unfortunately it could have been done by using 2 separate
servers as it was desired to due to the lack of resources.Yet,at this early point of development,
what is pursued is to confirm the behavior of the framework.
The first functionalities of the system monitor were tested to confirm that the following is de-
tected:
 The CPU usage of the hosted application reached a limit
 The memory usage reached a limit
In order to do so,a fault was injected in the application that is triggered after certain timeout
and what it simply does is to run a useless piece of code that either will increase the value of
an array or will hang the application in an infinity loop to affect the system according to the two
aspects previously mentioned.As it was expected,the system monitor successfully registered
these changes and altered the global fault manager in both situations.
Later,the application monitor and leader elector were tested.Again,the aim in the following
tests was to confirmthe functionality of the systemduring this scenario and observe the reaction
of the requests fromthe client perspective in two possible scenarios,when the target application
is being restarted,called server test and when the server is crashed and failing over cluster
test.All of them were done in the same procedure:a client (executed in the browser) sends
messages to the server viaWebSockets with a delay in between to simulate the frequency
(200ms,100ms,50 ms for our three cases respectively).Then,the acknowledgements were
recorded and finally plotted in a graph the distribution of number of those received by can be
observed.
Please be aware that,in this case,the acknowledgement represents that the system is up and
running (or “alive”) but this does not entirely means that is ready and can recover to the state
the client previously had.This is because currently sessions are kept in memory and the server
loses this data if it crashes and requires the client to authenticate again.Although the obvious
solution would be to persist them,this procedure might affect the performance dramatically.
Therefore,it must be considered a better support fromthe client programto handle this situation
and,for the moment,any attempt for enhancements in the server side have been left for further
improvements.
Initially,the server test was tried out were the hosted application is only restarted by the frame-
work and this scenario was repeated at different frequencies (5,10 and 20 requests/second).
To force the application to be restarted,an exception was inserted and it is triggered by a
setTimeout function every 5 seconds after the program starts running.
From the results shown in Figure 16 it can be observed an odd distribution where some high
peaks are displayed in multiple times.This is mainly for two reasons:
 First,since “WebSockets” are being used,which in reality are TCP sockets,the transmis-
sion of the packet is repeated several times to guarantee the delivery.That’s why after
the application is restarted it responds to those message that attempted to arrive while it
was down.As a matter of fact,no message was lost during the tests.
6.2 Framework Functionality Test 35
Figure 16:Server tests at 5,10 and 20 requests/second
 Since the client was run in a browser,single-process environment,the delay used to send
the messages was not respected.In other words,if a timeout occurs to send a message
while the client is “busy” (sending or processing an arrived one),the action is put in a
queue and executed later.That is why at some times these peaks are higher than others.
Although it could be improved by executing in other environments,it was decided not to
do so since it is the browser where the client program will have to run.
Now the cluster test will be presented that occurs when one server is unavailable and must be
replaced.This behavior is normally present either when the number for the times the application
has been restarted has reached a limit or simply because the application can not be started.
Via heartbeat messages the server within the cluster detect this failure and proceed to choose
a new leader and start up the application from this new instance.
To execute these tests,the same procedure was used with the difference that only the fre-
quency of 20 req/sec will be used for being considered the most critical among the ones used
above.Figure 17 shows the results obtained.It displays a similar behavior from the previous
graph except for a longer down time registered (a maximum of 4.89 seconds).
Figure 17:Cluster tests with 1 and 0.1 seconds of Heartbeat
36 6 Testings
Of course this result was expected considering that the work involved is also bigger:the server
has to stop the application from being restarted,shut the leader elector of the framework in
use,be detected by the other members of the cluster,start up the application in the new server
and so on.What affected the most this down time was the timeout employed when sending a
heartbeat message (1 second for the graph in the left in figure 17) so it was decided to reduce
it and see the results again.Evidently,the time was lower with (a maximum of 2.67 seconds)
Considering that Virtual machines were employed for these tests,the delay used to send the
heartbeat is not so representative for a real scenario.As a matter of fact,this value changes in
the specific circumstances of where the operation is taking place,such as the type of network,
distances of the server,congestion,etc.Thus,it is impossible to determine a unique value and
guess it will be hard and error prone as well.A possible solution (proposed in [14] as eventual
leader elector) is to set a timeout very low and increase it every time a new leader has been
selected.Initially,the premise of having a unique “leader” among the servers is not guarantee
as this variable keeps on growing.But once it does not change (say stabilized),a unique
leader iseventually determined (hence the name of the algorithm).
6.3 Stress Testing
In addition to the basic functionalities,it was decided to see the behavior of the system during
critical situations.Therefore,another scenario was conducted when the frequency of the mes-
sages exchange is considerably high.In the previous section it was shown how the database is
hosted in the same server where the application logic is running,and it is accessed via HTTP
requests done by the client “cradle”.
To do so,a JavaScript programwas implemented where 3 clients join a newly created Surftrain
and start sending messages at different time intervals.There is no difference among them in
terms of size and they are sending in a round robin manner to assure that each client is sending
the same amount.
The following shows the results obtained:
Frequency (request/seconds)
Number of messages Sent
Number of message Suc-
cessfully received
1000
10,000
8
500
10,000
8
100
10,000
98
50
10,000
135
5
10,000
3198
2
2,511
2765
Figure 18:Results from testing message sent in a Surftrain
There is an explanation for this catastrophic scenario shown in Table 18;Due to the control
version mechanism that comes built-in with the database,by default the database does not
allowto update anything that is currently used and it responds with a “Document update conflict”
message [6].This was prompted by the programas well.As a matter of fact,the prototype itself
looked quite fine despite the high frequency.It did not crash neither showed any symptom that
could indicate that the performance was affected.
To solve this scenario,an extra parameter,batch=ok can be included in the URL query that is
sent to the database when performing an update operation.As it is described in [6],it works in
the following way:“When a PUT (...) is sent using this option,it is not immediately written to
6.3 Stress Testing 37
Frequency (request/seconds)
Number of messages Sent
Number of message Suc-
cessfully received
1000
10,000
Application restarted!
500
10,000
10,000
100
10,000
10,000
50
10,000
10,000
5
10,000
10,000
2
2,511
10,000
Figure 19:Results from testing message sent in a Surftrain
disk.Instead it is stored in memory on a per-user basis for a second’.The evident risk that is
taken is that the lower guarantee of the data being persisted if some of themremain in memory
during a server crash.
The results shown in table 19 were improved where all sent messages were successfully per-
sisted in all the frequencies tested (except for the first one).However,two more issues were
discovered:first,the response time the system was reduced considerably and that is because
all messages are saved as they were arriving so it eventually turned into a bottleneck.Secondly,
due to the extra memory used by the database when using the batch parameter,this increase
was registered by the system monitor and the framework restarted the prototype.This latter
case just occurred at the highest frequency.
From these tests it can be outlined that it is necessary to keep the database in a different
environment outside of the middleware (also called “business logic”) to handle the load in a
better.Also,since the operation is quite simple (updating a document by adding one message)
some support can be added from the application logic where the response can be created
without waiting for the acknowledgement of the database.
38
7 Conclusion
Through this document it has been answered the question of how to build a high available
systemin Node.As it has been shown,reviews fromprevious studies on other Internet Services
prove that this capability of running a cluster despite of the failures relies more on the tools
used for monitoring and configuring rather than the software itself.In addition,it is the design
patterns that can aid a system to become more robust and persistent.Therefore,the objective
of this project was focused to experiment with this new technology and find out if it has what it
is required to achieve these expectations.
To do so,a study about the possible weakness and expected faults was made together with
the mitigation procedures that the systemshould follow,based on the Fault Model Enforcement
presented in [21].After this,a lightweight replica of the Wussap application was developed
written in JavaScript along with a framework that provided system monitoring and redundancy.
Finally,the prototype was tested by injecting faults and confirming that the expected behavior
was performed.
As it was presented,building a high available systemin Node does not represent an obstacle if
the right design is chosen and also if JavaScript is handled in a proper way by taking advantage
of its best features.Also,it is important to say that,when starting to use Node,the amount
of help found in the Internet is impressive because there are so many libraries written and
open projects where someone can benefit from.Definitely the support was substantial for the
development of this project.
Among the issues present in this research project,the main ones were due to the change of
paradigm that comes when working with a programming language that has another model of
inheritance and a powerful feature like clousure that were new to our knowledge.Also,working
on a newplatformwere all the events are asynchronous requires some effort to move along and
get used if one comes from the old teaching where the programs were entirely synchronous
and,of course,easier to read.Nevertheless,the benefits in the throughout and performance
shown in previous sections are quite high and,as it was shown how high availability can be
achieved,it tell us that definitely worth the time invested.
An improvement area that should be taken care is the monitoring tools and interface for au-
tomated tasks to avoid possible failures that could be done by operator.These could only be
developed as the project goes on and the specific needs of the system are revealed.The idea
is to developed these tools with the objective in mind of minimizing the intervention of the hu-
man operator in the system.Also,considering that Node is still at an early stage,more testing
in regards of stability is required.Statistics of CPU and memory usage,reasons of failures
and load handling obtained from a long-time execution of prototype and the framework will be
greatly helpful.
39
8 Discussion and Limitations
Split-Brain Syndrome
One of the potential issue that the reader may have foreseen is the split-brain syndrome [9].
This term is used in high-available clusters (like the one it is presented in this document) when
the communication network between the nodes is down.Then,each of thembecome active or
(declare itself as leader ) thinking that there are no other instances and leading to have several
services running at the same time (when it is supposed just to have one) and possibly having
data corruption.
Even though this scenario is likely to occur in our implementation,the idea of having data
corruption due to this malfunction in the network is disregarded for the following reasons:first,
the router that is receiving the outer requests may be instructed to redirect those to only one
single server;and secondly even if the router were distributing the load,the data in the back
end is being constantly replicated by the database manager and even CouchDB contains a
feature to handle conflict between different versions of a document [3].
Furthermore,the time when more than one server will be needed to handle the load will surely
come.So it is preferred to propose an architecture that do not restrict much,and even better,
allows the system to be scalable in the easiest way.
Eventual Consistency
Another potential issue is that the replication of the data is not coordinated with the logic of
the application at all.Let’s say,the information saved in back end is distributed among other
databases without any notice received by the front end.If for any reasons,the Global Fault
Manager decides to fail over,i.e.to pass the role of leader,without replicating the data before
hand,there may be the chance that some data will be temporary lost.
Nevertheless,we disregarded the solution to this problem is found in [10],called global syn-
chrnoization where the nodes are constantly making checkpoint and stopping others until there
is a global state of the cluster.As it can be seen,this proposal is blocking and it may impact
the performance of the service if the internal communication is slow.Moreover,one of the de-
sign philosophies behind Node is to develop programs that unobtrusive and non-blocking so it
was desired to keep the same pattern.Besides,the biggest lost that the end user can have is
that some messages will appear in different order or at different times.Although it affect the
quality of the service,it is a disruption that can be afforded at the expense of having the system
unrestricted.
Unnecessary router?
In the implementation section,a method called encapsulated cluster is used where a router is
receiving the upcoming requests and later on distributing to the nodes.Considering that,for
our purposes,just one node will be running the service,the need of an external device may
seems expensive and some how useless.
However,having the nodes publicly with a Domain Server that maps a domain name to sev-
eral IP addresses is problematic because the parameter Time To Live (TTL) is normally not
respected by the client and could be pointing to a server that has crashed 14 [11].In addition,
the idea of having several instances running to handle the load can be easily implemented by
using this method.
40 References
References
[1] Scaling Instant Messaging Communication Services:A Comparison of Blocking and Non-
Blocking techniques,The Sixteenth IEEEsymposiumon Computers and Communications,
May 2011.
[2] Inc 10gen.mongodb.http://www.mongodb.org,2011.
[3] J.C.Anderson,J.Lehnardt,and N.Slater.CouchDB:The Definitive Guide.O’Reilly Series.
O’Reilly Media,2009.
[4] Inc Basho Technologies.Welcome to the riak wiki.http://wiki.basho.com/,2011.
[5] TIOBE Software BV.Tiobe programming community index for july 2011.http://www.
tiobe.com/index.php/content/paperinfo/tpci/index.html,July 2011.
[6] Apache CouchDB.The apache couchdb project.http://couchdb.apache.org/,2008-
2011.
[7] Douglas Crockford.Classical inheritance in javascript.http://www.crockford.com/
javascript/inheritance.html.
[8] Douglas Crockford.JavaScript:The Good Parts.O’Reilly Media,Inc.,2008.
[9] S.K.M.N.Deshpande.Distributed Systems.Technical Publications,2009.
[10] Vijay Dialani,Simon Miles,Luc Moreau,David De Roure,and Michael Luck.Transparent
fault tolerance for web services based architectures.In In Eighth International Europar
Conference (EUROPAR02),Lecture Notes in Computer Science,Padeborn,pages 889–
898.Springer-Verlag,2002.
[11] D.M.Dias,W.Kish,R.Mukherjee,and R.Tewari.A scalable and highly available web
server.In Proceedings of the 41st IEEE International Computer Conference,COMPCON
’96,pages 85–,Washington,DC,USA,1996.IEEE Computer Society.
[12] Klint Finley.Node.js creator ryan dahl’s keynote from nodeconf.http://www.
readwriteweb.com/hack/2011/07/nodejs-creator-ryan-dahls-keyn.php,2011.
[13] Google Groups.Erlang programming:node.js compared to erlang.http:
//groups.google.com/group/erlang-programming/browse_thread/thread/
142aed19df0decd9/a6fbf0414b50c8ee?pli=1,2010.
[14] Rachid Guerraoui and Lu´ıs Rodrigues.Introduction to Reliable Distributed Programming.
Springer-Verlag New York,Inc.,Secaucus,NJ,USA,2006.
[15] HA Forum.Providing Open Architecture High Availability Solutions,2001.
[16] Robert Hanmer.Patterns for Fault Tolerant Software.Wiley Publishing,2007.
[17] InternetWorld.
˚
Arets webbentrepren
¨
orer.http://internetworld.idg.se/2.1006/1.
316986/arets-webbentreprenorer-claudijo-borovic-och-niclas-holm,2010.
[18] Zhang Yi Jiang Ivo Wetzel.Javascript garden.http://bonsaiden.github.com/
JavaScript-Garden/,2011.
[19] Swarma Limited.Benchmark testing nginx vs apache.http://blog.webfaction.com/
a-little-holiday-present,2008.
[20] Caolan McMahon.Unit testing in node.js.http://caolanmcmahon.com/posts/unit_
testing_in_node_js,2010.
References 41
[21] Kiran Nagaraja,Ricardo Bianchini,Richard P.Martin,Thu D.Nguyen,and Albert Einstein.
Using fault model enforcement to improve availability.In In Proceedings of the Second
Workshop on Evaluating and Architecting System dependabilitY (EASY,2002.
[22] David Oppenheimer,Archana Ganapathi,and David A.Patterson.Why do internet ser-
vices fail,and what can be done about it?In Proceedings of the 4th conference on
USENIX Symposiumon Internet Technologies and Systems - Volume 4,USITS’03,pages
1–1,Berkeley,CA,USA,2003.USENIX Association.
[23] Priya Narasimhan Soila Pertet.Causes of failure in web applications.Technical report,
Carnegie Mellon University,December 2005.
[24] Stefan Tilkov and Steve Vinoski.Node.js:Using javascript to build high-performance net-
work programs.IEEE Internet Computing,14:80–83,November 2010.