Integration of Web-Driver Technologies into a Design of the Content Aggregation System

businessunknownInternet and Web Development

Nov 12, 2013 (3 years and 9 months ago)

160 views



University of Waterloo

Faculty of Engineering

Department of Electrical and Computer Engineering







Integration of Web
-
Driver T
echnologies into a
Design of the

Content Aggregation System










Wishabi

302 The East Mall, Suite 500

Toronto, ON, M9B
6C7, Canada













Prepared by

Satyam Gupta

ID 20370860

Userid s52gupta

2A

Computer Engineering

13 November 2013





Satyam Gupta

157 Cedarbrae Avenue

Waterloo, Ontario, Canada

N2L 4S1

May 1, 2012

Manoj Sachdev
, chair

Electrical and
Computer Engineering

University of Waterloo

Waterloo, Ontario

N2L 3G1

Dear Sir:

This report, ent
itled "Integration of Web
-
Driver technologies into a Design of the Content
Aggregation System", was prepared as my 2A

Work
Report with
Wishabi
. This report is
in
ful
fillment of the course WKRPT

2
01
. The purpose
of this report is to analyze dif
ferent Web
-
Driver technologies

and
integr
ate the most suitable

into one of Wishabi’s backend data
intelligence systems,
the
Content Aggregation system.

Wishabi

is working o
n ge
nerating the next
-
generation
digital circulars using patent pending
technologies allowing consumers to get the richness of both online and offline marketing
content,
and making their shopping

experience more personalized and interactive.

The Developmen
t team, in which I was employed, is managed by David Wang

and is primarily
involved with

driving the use of new technologi
es to create a
fast, fluid and engaging user
experience for the dynamic circular platform
injected with e
-
commerce and real time conte
nt such
as daily deals, social

networking streams and 1:1

personalization
.

I would like to thank

Brendan Cone for his supervision and
guidance during my

work on this
project. I also wish to a
cknowledge the brilliance of

Elijah Andrews for helping in the
investigation
of the Web
-
Driver technologies
, and insightful design discussions for the Content
Aggregation system.

Finally, I appreciate the report outlined by Douglas W. Harder which taught
me several
useful
tools for w
riting technical reports using Microsoft Word.
I hereby confirm that
I have received no further help other than what is mentioned above in writing this report. I also
confirm this report has not been previously submitted for academic credit at this or any
other
academic institution.

Sincerely
,




Satyam Gupta

ID 20370860

iii


Contributions


The team I worked for was relatively small. It consisted of
seven full time software
developers
and three co
-
ops including myself.

The number of people
involved i
n this specific project was
limited to three including my Supervisor, a fellow co
-
op and myself.


The development team is

primary focussed on working with merchants to transform t
heir
digitalized print into a personalized and interactive online experience.
The different areas that
the
team works on are
developing
Wishabi’s backend data intelligence architecture, driving the use of
leading edge technologies to create a digital circu
lar
platform
that is capable of integrating all
online and offline marketing content for merchants’ into one seamless experience, and taking this
new approach to digital circulars
on
to various mobile platforms.


My tasks were to maintain
and configure
exi
sting systems that supported the backend da
ta
intelligence architecture
, develop and test new features to improve the internal workings of the
Operations team and resolve
any
bugs and issues with the existing backend systems.

For the first
two months of th
e term my responsibilities were

concerned with

Wishabi’s Comparison S
hopping
platform, which required maintaining and developing Ruby code that would either index a
merchants’ website or integrate their data feeds into Wishabi’s own data warehouse. This pr
ocess
would help online shopping customers to find the best deals, compare products from different
dealers

and provide useful analytics to merchants to help track customer engagement in their
products.
The authority

to push and deploy changes to t
he
produc
tion servers
for these
Comparison Shopping systems
was
also
given to me
.
The latter half of the term was more
directed towards working with the Operations team at Wishabi, who are responsible to automate
the production process of building the
next
-
generation circulars.

I met several

feature requests to
improve

the interface used by
the
Operations
team
to

make their

daily tasks of automation and
qualit
y assurance
more efficient
. An interesting and challenging part of these responsibilities was
h
aving design discussions, code

reviews with senior developers,
and working with team members
to resolve merge conflicts in the codebase and push
ing

changes to a common repository.


The relation between this report and my job is that the report discusses a
design solution proposal
to
improve the functionality of one of the
backend data intelligence systems, whilst maintaining
its current state of reliability and performance. The project required extensive research on
different technologies that were s
uitable

to

improve the quality
of data

being aggregated
by the
iv


Content Aggregation system
for

digital circulars on the Wishabi platform. I w
as
part of

several
brainstorming and design sessions that progressively improved the quality of the solution
concept
due to

open communication and constructive feedback between the members working on this
project. I feel that this report has helped me to improve my professional communication skill
s
with team members, apply

engineering principles such as problem definition

and
scope
,
realize

the importance of well
-
defined project requiremen
ts, eff
ective ways to perform research and,

generate and evaluate

solution concepts. The report has also benefited in improving my technical
writing skills by following the guidelines for writ
ing a work term report set by the ECE
department, which I am sure can be used to add more insight to my solutions of engineering
problems in academics.


In the broader scheme of things, the design proposed by this
report

will be used
by a Software
Developer to improve the Con
tent Aggregation system by integrating

a Web
-
Driver technology
.
This report provides useful comparisons between different technologies that were explored to
advance the system, and proposes a suitable choice following
an
enginee
ring analysis.
This has

saved

valuable time of the developer(s) that work on this project by providing

them with a

documen
t for reference when
work
ing

on this project.
















v


Summary


The main purpose of this report is to document the
analysis of
different Web
-
Dr
iver technologies
and their

integration into the Content Aggreg
ation system,
one of Wishabi’s backend data
intelligence systems. This involves the selection of a suitable Web
-
Driver technology and then
proposing a re
-
design of the Content A
ggregation s
ystem enabled with this

new framework. This
report is intended for readers with basic
an
understanding of Web Automation Frameworks and
an interest in software design.


The major points covered in this document are as follows. The first section

provides a
background on Web
-
Drivers and an overview of the working of

the Content Aggregation system

to help define the scope of this

project.
The next section explains the various objectives and
constraints that were outlined while working on this
project. T
he third section analyzes

various
ty
pes of Web
-
Driver technologies

against a set of predefined criteria to
decide on the best optio
n

for use in

the system. Finally,
an overview of a
proposed design solution that integrates
the
Web
-
Driver technolo
gy into the Content Aggregation system

is given
.


The major conclusions in this report are that the chosen Web
-
Driver technology would integrate
seamlessly with the Content

Aggregation system, involved

the least overhead on installation in to
a virtual Ubu
ntu server

and that the proposed design
solution was able to meet all set objectives
and constraints.


The major recommendations for the
use of the
Web
-
Driver technology would be to perform
extensive load and performance testing on webpage loading, navigat
ion, memory requirements
and timings issues. It is also recommended that the design solution be slowly modified
to a
cleaner

and effective design as per availabilit
y of time and resources in the D
evelopment team
.









vi


Table of Contents


Contributions

................................
................................
................................
................................
...

iii

Summary

................................
................................
................................
................................
..........

v

List of Figures

................................
................................
................................
................................

vii

List of Tables

................................
................................
................................
................................

viii

1

Introduction

................................
................................
................................
..............................

1

1.1 Background

................................
................................
................................
............................

1

1.2 Project Sc
ope

................................
................................
................................
.........................

1

2

Requirements

................................
................................
................................
...........................

3

2.1 Objectives

................................
................................
................................
..............................

3

2.2

Constraints

................................
................................
................................
.............................

3

3

Options and Analysis

................................
................................
................................
...............

4

3.1 Introduction to different Web Automati
on Frameworks
................................
........................

4

3.2 Criteria for selection of the Web
-
Driver Technology

................................
............................

5

3.3
Evaluating Solution Concepts

................................
................................
................................

6

3.3.1 Capybara DSL

................................
................................
................................
..............

6

3.3.2 Selenium Web
-
Driver

................................
................................
................................
..

6

3.3.3 Watir Browser API

................................
................................
................................
.......

7

3.4 Decision Matrix for selection of the Web
-
Driver Technology

................................
..............

7

4

Design of the Content Aggregation system

................................
................................
.............

8

Conclusions

................................
................................
................................
................................
....

12

Recommendations

................................
................................
................................
..........................

12

Glossary

................................
................................
................................
................................
..........

14

References

................................
................................
................................
................................
.....

15





vii


List of Figures


Figure 1: The current Content Aggregation system Design.

................................
............................

2

Figure 2: Three kinds
of Web Automation Frameworks …

................................
............................

4

Figure 3: Design of the Watir Integrated Content Aggregation System.

................................
.........

9

viii


List of Tables


Table 1: Evaluating Web
-
Driver Solution Concepts.

................................
................................
........

8

1


1

Introduction

1.1
Background


A Web
-
Driver is a tec
hnological tool

used in automation testing of web applicat
ions
. They are
usually a part
of Web Automation
Test
framework
s, which are designed

to automate a suite of
test cases on web applications for regression, smoke or sanity purposes

[
1
]
.

Web
-
D
rivers allow
one

to
control

a
web
browser and hence mimic actions that a user may perform on a webpage.

Such actions include filling in web forms, navigating pages and searching for information on a
w
ebpage. It is very common for
quality assurance team
s

in a company involv
ed in web
development to employ an aut
omation framework which uses a Web
-
D
river at its core.

There are
differen
t categories of Web
-
Drivers

each of
which
will be thoroughly investigated in this report
before a suitable cand
idate is recommended.

1.2 Project
Scope



The Content Aggregation system is one of the many in
the Wishabi Flyer Administration
application, a Ruby on Rails application responsible for the management of informa
tion relevant
to digital circulars

[
2
]
.

This system is

responsible to gather valuable content from merch
ants’
website
so that

Wishabi
Flyer applications can

display
this
data in
flyer
item pop
-
up windows on
digital circulars.


The process of Content A
ggregation
rests on a well
-
developed system, as shown in Figure 1. The
procedure of gathering information involves pre
-
processing, parsing,

de
-
duplication
, conversion

and processing of data that is retrieved from a web
-
page(s). The Gatherer can be thought of as a
sup
ervisor of the entire aggregation process, and is a medium for each of the individual modules
to communicate and exchange information with each other and provide valuable debugging
information at each step. The input to the system is a simple data structur
e containing information
on the different ‘content types’ that are to be gathered with associated URLs. A content type is
used to differentiate between the various kinds of product information such as item reviews,
features,
product
specifications or even
related items

that are available on a product page
belonging to a

merchants’ website
. The URL is important because it points to the webpage that
will be
fetched by the Retriever module
, as shown in Figure 1.


2
















Figure
1
: The current Content Aggregation system Design
.

T
he
Retriever module takes advantage of the
Mechanize
library, widely used to automate
in
teraction with websites in Ruby.

Mechanize is excellent

in serving
its purpos
e with high speed
and accuracy

[
3
]
.

However, it has several limitations, one of which is t
he inability to render
HTML conten
t hidden behind dynamic webpages.

JavaScript and Ajax are two

web development
techniques which allow the client side i.e. a browser to load dynamic content onto a webp
age by
communicating with web

servers.
Mechanize also

does not have the capability of driving

a web
browser to replicate user
actions in real time.

It is clear that an alternate solution needs to be
investigated which can overcome these shortcomings to increase the quantity of data gathered,
without comprisi
ng on the performance of the existing
system
. The scope of this project will be
limited to discussing just the first two steps of the aggregation process

i.e. the Preprocessor and
the Parser
, and we will see that this will eventually lead
us
to an optimal
solution.


The proposed system will be able to in affect deal with pages that are manipulated via JavaScrip
t
and (or) injected with asynchronously loaded

data through Ajax. These changes must be easy to
incorporate into the existing application, be backwar
d compatible with technologies used in the
existing system, retain a respectable level of performance when run remotely, be easy to maintain
and scalable to

the future needs of the Content Aggregation

system.

Parser


Preprocessor



De
-
duplicator

Processor

Converter

Retriever

Data structure of content to
be gathered

Gatherer

3


We begin with listing out the project objectiv
es and constraints,
followed by an
ana
lysis of
possible

Web
-
D
river technologies
and finally recommending a

design to improve the Content
Aggregation system.

2

Requirements

2.1

Objectives


1.

The Content A
ggregation system must be able to
gather

content hidden behind JavaScript
walls.

It would
also

be able to

recognize dynamically loaded data on a webpage via the
use of web technologies such as Ajax.

2.

Merchants must be given the option to toggle between the current system and
this
improved Data P
i
ping (Content A
ggregation) system
. This is important since not all
merchants

require the use a Web
-
D
river technology to gat
her content from their website.

3.

The system should be easy to maintain and hence unit tests
, “specs” in Ruby on Rails
terminology,

mus
t be incorporated into each of its modules to imp
rove the quality control
levels

[
4
]
.

Since Web
-
D
rivers can be used to communicate with remote servers, logger
information should be readily avai
lable
for debugging purposes.

4.

The new design should address some of the minor
shortcomings

when configuring
Content Aggregation for merchants by increasing the flexibility and extensibility

of

certain
modules in the system.

2.2

Constraints



1.

It is necessary to
keep

design changes to the Content A
ggregation system
itself at a

minimum s
ince the current system is reliable and delivers

results with high performance
.

Hence
,

it

should be poss
ible to revert back to the old Content Ag
gregation system by
making
very
minimal
chan
ges to the application
.
An
easy fail
back mechanism of the
improved Data P
iping platform

back
to the existing system must exist due to the high
criticality index of the risks involved with integrating
new Web
-
D
river technologies.

2.

The system must be desi
gned keeping the amount of changes to the existing specs at a
minimum since it is a very time consuming and tedious process to
modify each and every
spec file in the
current
Content Aggregation system.

4


3

Options and Analysis


Perhaps, one of the most crucial

aspects of this project was the selection of an open
-
source
library
, Web
-
Driver technology
,

developed
in Ruby which would meet the f
irst objective as
mentioned in S
ection 2.1.

The d
ifferences between
Web Automation frameworks are better
understood when

b
r
oken down into three
categories.
Figure 2 portrays the relative
distinctness of

each category from the perspective of a carnivore lover
, and these differences are briefly
explained below

[
5
]
.


Figure
2
: Three kinds of Web Automation Frameworks
.

[
5
]

The first is a Web Driver API, which starts up a local
web
server to drive a web browser and
allows the user of the API to cont
rol and navigate the browser desired
. The second kind is a
B
rowse
r API, which has a higher abstraction layer over the Web Driver API. It

provides a lot
more control and functionality to the user in the form of searching and waiting for browser
elements, ad
vanced interacti
ons with the webpage and
also
supports multiple browsers

[
6
]
.

The
final category of web testing frameworks is the Web For
m DSL

which is a very high level API
that provides the user with specific methods to
automate web forms and elements

[
5
]
.

Each
category of frameworks

underwent an
engineeri
ng analysis based on
criteria which

will be
expanded upo
n in Section 3.2.


3.1
Introduction to different Web Automation Frameworks


Capybara is a Web F
orm DSL developed by a team known as Thoughtbot.
T
he greatest flexibility
provided by Capybara is the ease with which one can switch between different
Web
-
D
rivers,
namely:

rack, selenium and web
-
kit. During our investigation the web
-
kit driver took prime focus
5


since it was completely headless, supported e
xecution of JavaScript and provided wrappers for
performing browser actions such as clicking link
s, buttons and filling in forms

[
7
]
.

The second framework that we

investigated was Selenium
. The Selenium pro
ject began in 2004
by a member of ThoughtWorks and has been continually developed and improved with notable
contributions
from a Google engineer
two years later. This is a low level API

with support
extending several

programming languages
including Ruby
an
d
running
almost every browser
possible

[
8
]
.

Finally, the
browser API that was explored was

the WATIR Web
-
Driver, short for Web
Application Testing in Ruby (pronounced Water).
It has an active and growing
community
behind it and hence continually expands its support across various browsers
. The Watir API
consists of open source Ruby
libraries under the BSD license

[
9
]
.

3.2
Criteria for selection of the Web
-
D
river Technology


We assume

that the choice of a Web
-
Driver
technology

would not impact the so
ftware re
-
design
aspect of the Content A
ggregation system. Th
ree primary criteria are listed below

to allow for
comparison between the three different opt
ions tha
t were introduced in the previous sub
-
section.

1.

The primary crite
rion was

the reliability and
quality
assurance that each Web
-
Driver
provided
. A test suite of scripts were

created which served as

important smoke and sanity
checks for eac
h of the
technologies

used. These R
uby script
s ensured that the
API
allowed us

to explore th
e JavaScript space, perform Ajax

calls and

hence aggre
gate
dynamically loaded content.

2.

The second crite
rion was the installation

of the Web
-
D
river into the Wishabi Flyer
Ad
ministration
application. This involves compatibility

with the version of Ruby on Rails
used by the application, installation of required gems

(libraries in Ruby)

and their
development and runtime dependencies. Another point t
hat was of interest here was

t
he
ease of
setup involved when each framework

was setup

on a virtual Ubuntu server
to
estimate the installation overhead of each Web
-
Driver on production servers.

3.

The final criterion is the im
pact on

performance of

the Content A
ggregation system by

each of

these Web A
utomation frameworks primarily in terms of speed of execution of
the test suite of Ruby scripts.


Now, we will discuss how
each potential technology faired across these criteria.

6


3.3

Evaluating Solution Concepts

3.3
.1

Capybara DSL


Beginning with Capybara, it was observed that
t
he use of this API
against the suite of test cases
did not always give accurate Data P
iping results. Despite being able to explore the JavaScript
space and perform simple Ajax actions

on a webpage
, this DSL wa
s unable to provide consistent
results across several runs of the same suite of tests. To understand why this was the case, one
must understand that remote servers can

vary their

response time per browser request depending
on factors which are
outside the
scope of
scripts
in the test suite
. It is then logical to

assume that
the
Capybara
framework
would provide some sort of mechanism to handle this unknown ‘wait’
period involved in communication with se
rvers. The primary cause of
unpredictable failure
s in

the test suite was the poor definition of this ‘wait’ in the Capybara API which led to erroneous
and duplicate data being gathered. These situations were encountered when race conditions
occurred due to t
he execution of a por
tion of a Ruby script before a

browser request was
completely
processed by the remote server. Note that although most web testing frameworks
including Capybara provide methods which allow the user to assert the validity of data expected,
and wait for
a default time period before declar
ing a failed test.
However, since our system was
designed to
gather dynamic

data

on a webpage
, there was no e
legant solution to this p
roblem
of
gathering invalid content
via the use of assertion of data.

Installation and setting up of the development envir
onment for Capybara was seamless both
locally and on the
Ubuntu
server. The overhead seen
on running Capybara on the server was
installing a
required browser such as Firefox and
a

Windows
X server such as Xvfb to suppress
any visible
graphical application
windows

[
10
]
.

The performance of this library in terms of s
peed
was

found to be consistently very high due to the nature of the headless webkit gem used which
runs on a browser without a GUI. It is important to consider that Capybara
-
webkit was relatively
new and unstable, only a month old at the time this report was writ
ten,
so
still in developmental
stages with a com
petent but comparatively small

community driving it
[7]
.

Thi
s was a qualitative
reason
ta
ken into account when making a

final

decision on the choice of the Web
-
D
river
technology
.

3.3
.2

Selenium Web
-
Driver


Moving on to t
he Selenium Web
-
D
river, the qua
l
ity of data seen upon execution of scripts in the
te
st suite was

significantly

higher
whe
n compared to the Capybara DSL
.

Race conditions were
7


encountered far less frequently as compared to the Capybara gem and the API had
an
in
-
built
mechanism to handle the `wait` period

more elegantly
. However, the waiting mechanism for
Selenium also continues to
have issues with loading HTML elements

after Jav
aScript execution

[
1
]
.

It is important to note that S
elenium has been around for a much longer period of time, with a
great community of developers behind it to maintain and improve the API.

This low l
evel API required one to install a
S
elenium server in the form of J
ava
jar
file along with
necessary

Ruby
gem installatio
ns

[
11
]
.

Hence an

extra overhead would be

involved in setting t
he
web testing framework on
production servers. It was al
so observed that running the smoke and
sanity tests using Selenium took significantly longer.
This may be attributed to the fact that a real
brow
ser was opened when a Selenium Web
-
D
river instance was

created and so the t
ime to load
the

GUI
,

increased the setup
period involved in running

scrip
t
s
.



3.3
.3

Watir Browser API


The Watir Web
-
D
river was seen to match the reliabili
ty of
the Selenium API in terms of its

ability to execute JavaScript and Ajax. Quality of data seen on repeated action of
scripts
in the
test suite when run using Watir was

marginally better than that seen by Selenium. This can be
attributed to the advanced waiting mechanism used by the Watir
API

which ensured that any
requests sent by a Watir browser to a remote server
were completed and only then would

control
pas
s back to the execution of a

R
uby script

[
6
]
.

This further reduced the possib
ility of
encounte
ring race conditions when the Content Aggregation system would run

to collect data.

Watir again required one to install
Ruby
gems
for the framework
along with all
gem
depend
encies involved

[
12
]
.


In
addition, when run on an Ubuntu server
, it is necessary to install
a browser such as Firefox (Watir uses this by default) so that any application using Watir can be
run. Note that, Watir allows the user to use a headless gem which suppresses the
browser window
on the server side. This makes its performance
almost
comparable to Capybara while

also having

an excellent AP
I to execute JavaScript and Ajax

calls.

3.4

Decision Matrix for selection

of the

Web
-
Driver
Technology


Based on the above analysis for the three different technologies, a decision matrix has been
created for every criterion as listed in Section 3.2. Table 1 summarizes
the discussions ranging
from 3.3.1 to 3.3
.3 by assigning numerical
values to each solution

for every

pre
-
established
criterion. We see that the Watir Browser based API has the highest score and

hence this was the
8


technology

chosen to implement the changes
required to meet the objectives in Section 2.1, to
improve the

Content Aggregation system
in the Wishabi Flyer Admin
istration

application.

Table
1
: Evaluating Web
-
Driver Solution Concepts
.

Criteria

Weight

Selenium Web
Driver

Watir Browser API

Capybara DSL

Reliability/Quality
Control

40%

8

8
.5

6

Installation

into
current application

30%

5

9

8

Performance

30%

7

8

9

Total


6.8

8.5

7.5


4

Design of the Content Aggregation system


To understand

the
proposed

design
we first need to have a high level understanding of the current
system as it was. A simple example will be used to explain the entire software design.

The process begins with the

G
atherer; a module designed to co
-
ordinate the different tasks that
the
syst
e
m would perform. A sample digital flyer

could have

tens to a hundred flyer items within
it, and
Content A
ggregation would

run
on every flyer item

having

a product page
on the

merchant’s website associated with i
t.
The product URL

along with associated content types is

passed

into the G
atherer to begin
Content A
ggregation.

The first task
to
pre
process

URL
s

for every content type
is to primarily
account for cases
where
certain content types
have information on a URL different

from
the one passed into the G
atherer.
So, the Preprocessor may modify the input URL or completely override it with a new URL as
required for each content type.
Hence
,

one would ex
pect that the output of the Pre
processor
would be
a
data structure container of U
RL
s a
nd content types, with the possibility

that certain
URL
s may be different from the URL given as input to the this module. This output is passed
onto the
P
arser which is responsible for retrieving the webpage and collecting information
pertaining to ea
ch of the content types using Ruby libraries such as Mechanize and Nokogiri. This
9


information is then forwarded on to different modules where conversion of data into a standard
format, possible duplication and processing of data is taken care of.

The
following design solution
, shown in Figure 3,

was brainstormed over several design sessions
keeping the objectives and cons
traints

outlined in Section 2.1 in perspective
.

Note that certain
details for the Preprocessor, Parser and Gatherer displayed in Figu
re 1, have been omitted here,
but nevertheless still exist in the proposed solution.
We begin by giving a high level overview of
the new
Content A
ggregation system. First, a Boolean flag would need to be created at the
merchant level, which
we
call JS_E
nab
led for the sake of discussi
on.
This flag would allow
merchants to toggle between the current and the proposed system.
In the flow chart below all

module
s of the Data P
iping system are shown and would be

initialized in the G
atherer.

Every
module communicat
es and exchanges information with the Gatherer similar to Figure 1.

It should
be noted that the
JS Preprocessor interacts with both the Retriever and the Watir Web
-
Driver. The
Watir Web
-
Driver
should

not be confused as a separate
module;

rather it is shown

to highlight the
fact that the JS Preprocessor
simply
makes use of this new technology.














Figure
3
: Design of the Watir
Integrated Content Aggregation s
ystem
.


JS_Enabled?

JS_Preprocessor

Preprocessor

JS_Parser

Parser

Deduplicator


Converter

Processor

Retriever

Watir Web
-
Driver

10


The proposed design ensures that the existing Preprocessor, Parser and
their respective spec files
would

not have to be modified. In addition, the new JavaScript modules merge back into the
existing system on
ce parsing of data is completed which minimizes the amount of changes that
need to be made to the existing system.

We will now provide an in
-
depth analy
sis of each of
module that needs to be introduced.

1.

JS
Preprocessor

On initialization, the P
reprocessor is given the

merchant,
a
flyer item and
an options hash
as input.
The flyer item

contains a Content Aggregation
URL hash with URLs

and an
array of content types as key
-
value pair
s. Since we are dealing with a P
reprocessor which
should handle both JavaScript and regular
pre
-
processing
, it was thought necessary to
differentiate between the ways each content type was
pre
-
processed.

Here

we would

create two hash
es both key
-
value pairs of URLs

and conte
nt types but the first would
comprise

of the JavaScript enabled content types and latter of JavaScript disabled types.
So from here a content type can be
pre
-
processed

in the following two m
ethods.

i.

An existing Watir browser session is used to navigate to the
URL

asso
ciated with
that content type. This

method would now have access to a headless browser
instance from the Watir Web
-
Driver API which can be used to click links,
buttons and submit

forms to change the state of the webpage to different
‘snapshot(
s)’ of the page.

A

snapshot
may be defined as

the saved state of a
webpage in the
JS
P
reprocessor once certain scripts have been executed and (or)
Ajax calls have been made. This snapshot may

be stored in a suitable format that
may be one of either a Nokogiri node collection, Watir HTML collection of
nodes or simply a temporary
HTML

file. The best

option is to use a Nokogiri
node collection
, a data type of the Nokogiri API,

as
the final form o
f

the snapshot
because it is
most compatible with the existing system.

Before discussing what exact
ly the
JS
Preprocessor
module
outputs
, the notion of a

‘Snapshot

Struct


is introduced. A Snapshot

Struct
is a Ruby Struct containing three attributes namely the
URL
, snapshot and an options hash. This
Struct

could possibly be created for each pair of
URL
and content type, and the preproc
ess method discussed above would

be passed
the newly created
snapshot

Str
uct

instead of simply a
URL

as in the current system. Th
e Snapshot

Struct
provides
the additional

flexibility to the developer to assign

each snapshot to a certain ‘type’ i
n the options
hash, which will

serve as

valuable information to the JS
Parser module
. The ‘type’ of a snapshot
is necessary in cases when a content type has multiple snapshots with any two snapshots not
11


having the same HTML structure and layout. Such distinguished snapshots are referred to as
different ‘types’ and can be accounted for in
th
e options hash of the Snapshot

Struct
.



So, the
JS
Preprocessor modu
le outputs

an array of snapshot
Structs

where t
he options


hash pro
vides a medium to inform the JS
Parser module the type of snapshot it


encountere
d to

allow developers to write more elegant and readable code.

ii.

This method to preprocess a content type is used if it is known that all data to be
gathered is rendered without any JavaScript. Here

again, a newly created
snapshot

Struct

would be passed

to the
regular
JS
Preprocessor

method
s

whic
h
would
allow the developer to
create an array of snapshot
s

if required.

The
Retriever module

would be used to fetch the pages asso
ciated with each URL

in
the array and parse them into Nokogiri node colle
ctions.
Here again the
JS
Preprocessor returns

an array
of snapshot

Structs

where the snapshots are
retrieved
web
pages parsed i
nto Nokogiri elements, the URLs are the
pre
-
processed

URLs

for that content type and the options hash
,

which may contain
information if d
ifferent ‘types’ of snapshots were observed by the developer for
the content type.


2.

JS_Parser

On initialization, the JS

Parser
is given the

merchant, flyer item, the hash
of
pre
-
processed

snapshot Structs

and an options
hash.
Each pair of the snapshot

S
tru
ct and the
content type in the
pre
-
processed hash would be

parsed using logic very si
milar to the
existing system.

Every snapshot

Struct

is parsed for the content type that it conta
ins using
the JS Parser methods. The options hash in the snapshot Struct
give
s

informat
ion on if
there

were different types

of snapshots

creat
ed in the JS Preprocessor, he
nce meeting the
objective of adding
flexibility

to the Parser module. The gathered content for all
snapshots is stored

in a format compatible with

existing ta
bles in the database

and allows
for the system

to merge back to the Deduplication

module.

Both the JS Preprocessor and
JS
Parser were given an options hash as input. The options hash
contains a key which

can

activate each module to log important debugging
information and
messages in cas
e of unsuccessful results. Cer
tain cases to be accounted for c
ould include poor
network connections, invalid server responses, Watir browse
r

session timeouts and invalid Xpath
queries which all could le
ad to undesirable conte
nt being
gathered.

12


Conclusions


From the analysis of the report body, it was concluded that the Watir Browser
API was the most
suitable technology

for integration into the Content Aggregation system.

Section

3.3
.3 discussed both the installation of the
Watir Web
-
Driver framework in terms of
compatibility with the Wishabi Flyer Administration application and setup on a virtual Ubuntu
server
. It was seen that the Watir Web
-
Driver Ruby gems would integrate seamlessly with the
Content Aggregation system and
also installing the framework on the serve
r

involved the least
a
mount of overhe
ad. Watir Web
-
Driver was also

the most reliable and quality assured out of the
three technologies investigated in ter
ms of aggregating valid content due to the in
-
built advanced

waiting mechanisms in the framework.

Table 1 was used to evaluate the different Web
-
Dri
ver solution concepts
based on a set of
predefined weighted criteria. The decision matrix suggested that the most viable option
would be
to select the

Watir Web
-
Driver
framework to build the JavaScript enabled Content Aggregation
system.

Section 4 discussed the software design changes that were needed to incorporate the addi
tion of
Web
-
Driver technology

to preprocess and parse JavaScript content. This re
-
design to the Content
Aggregation system also
met the objectives and constraints outlined in Section 2.

Recommendations


Based on the analysis and conclusions in this report, it is recommended that

Watir Web
-
Driver
be
integrated into the current Content Aggregation system to
enable gathering of JavaScript content
on merchants’ websites. Recommended areas of improvement are:

1.

Extensive load and performance testing of the Watir Web
-
Driver on page loadi
ng,
navigation, memory requirements and timing issues using the Watir Web
-
Driver
performance gem.

2.

The design of the JavaScript enabled Content Aggregation system should be modified to
have only one Preprocessor and Parser
type
module. The r
eason for the
fo
rk

in the design
solution

was to allow a smooth transiti
on of all specs in the current Content A
ggregation
system on to the new system.

E
xisting

s
pec files
may
be put into a hypothetical priority
queue to allow for
modification and
transfer into the new sy
stem as per availability of
13


developer(s) time, r
esources and urgency of relocation

of a merchant o
nto the new
JavaScript enabled D
ata

P
iping platform.
Once this process is completed, then a cleaner
and design solution will need to be developed which merges

the JS modules with its
existing counterparts.




























14


Glossary


API:
Application Programming Interface is a specification intended to be used

as an interface by
software components to communicate with each other.

BSD:
Berkeley
Software Distribution
.


DSL:

Domain Specific Language
.

ECE
: Electrical and Computer Engineering
.

GUI:
Graphical User Interface

allows users to interact with electronic devices with images rather
than text commands.

Hash:
is a collection of key
-
value pairs,

similar to an array except that indexing may be via
arbitrary keys of any object type.

Headless:
refers to an attribute of Web
-
Drivers which allow a browser to run on machine without
any Graphical User Interface
.

HTML:
HyperText Markup Language is the mai
n markup language for webpages.

Nokogiri:
is a Ruby library that has the ability to search documents via the use of Xpath
selectors.

Struct:
is a convenient way to bundle a number of attributes together, using accessor methods.

URL:
Uniform resource locato
r is a specific character string that constitutes a reference to an
Internet source.

Wrapper:
is a function in a computer program whose main purpose is to call a second function.

WKRPT
: Work
-
term report

XML:
Extensible Markup Language is a markup

language that defines a set of rules for encoding
documents in a format that is both human
-
readable and machine
-
readable.

Xpath:
is used to navigate through elements and attributes in an XML document.

Xvfb:
is a server that performs all graphical operatio
ns in memory, not showing any screen
output.

15



References

[1]

(2012, April) Selenium, Browser Automation Framework. [Online].
http://code.google.com/p/selenium/w
iki/FrequentlyAskedQuestions

[2]

Dave T, David Heinemeier H Sam R,
Agile Web Development with Rails
, 4th ed., Sussanah
Davidson P, Ed., 2011.

[3]

Akinori M Eric H. (2012, April) Mechanize. [Online].
http://mechanize.rubyforge.org/

[4]

(2012, April) Unit Testing. [Online].
http://en.wikipedia.org/wiki/Unit_testing

[5]

Alister Scott. (April, 2012) WatirMelon. [Online].
http://watirmelon.com/tag/watir
-
webdriver/

[6]

(April, 2012) Watir Web
-
Driver. [Online].
http://watirwebdriver.com/

[7]

Capybara on Github. [Online].
https://github.com/jnicklas/capybara

[8]

(2012, May) SeleniumHQ. [Online].
http://seleniumhq.org/docs/01_introducing_selenium.html

[9
]

Watir, Web Application Testing in Ruby. [Online].
http://watir.com/

[10]

Xvfb. [Online].
http://en.wikipedia.org/wiki/Xvfb

[11]

Selenium RC. [Online].
http://seleniumhq.org/docs/05_selenium_rc.html

[12]

Ruby Gems. [Online].
http://rubygems.org/

[13]

(2012, April) Xpath Tutorial. [Online].
http://www.w3schools.com/xpath/