here - EDViC

eatablesurveyorInternet και Εφαρμογές Web

14 Δεκ 2013 (πριν από 3 χρόνια και 5 μήνες)

103 εμφανίσεις

Acknowledgements
I would like to thank my supervisor Goran Nenadic for his support and feed-
back throughout the course of the project,and the energy invested into ensuring
I push the limits.
I would also like to thank George Karystianis for his much valued input,and for
sharing with me the domain-specific knowledge needed in this project.
Finally,I would like to thank Karl Sutt for his constant help and motivation,and
for always being the person I can bounce ideas off of.
Contents
1 Introduction 11
1.1 Project aimand objectives.......................11
1.2 Report structure.............................12
2 Background and literary review 15
2.1 The state of epidemiological literature................15
2.1.1 PubMed.............................16
2.1.2 Organising epidemiological data...............16
2.1.3 Text mining tools........................17
2.2 Visualisations..............................21
2.2.1 Word clouds...........................21
2.2.2 Streamgraphs..........................25
2.3 Building interactive web applications................27
2.3.1 Single-page web applications.................27
2.3.2 D3.js...............................28
3 Identifying requirements 31
3.1 Functional requirements........................31
3.2 Non-functional requirements.....................34
3.3 Use cases.................................35
4 Design 37
4.1 Systemarchitecture...........................37
4.1.1 The data.............................38
4.1.2 The API.............................38
4.1.3 The Backbone MVC framework in the front-end......40
4.1.4 The Graphical User Interface.................42
8 CONTENTS
4.2 Dimensions in the word cloud.....................42
4.3 Visual design..............................43
4.3.1 The Search Results page:design iterations.........44
4.3.2 Visual consistency.......................46
5 Implementation 49
5.1 Choice of tools,languages and external libraries..........49
5.2 Manipulating the data.........................50
5.2.1 Main API endpoint - retrieving a list of related entities..51
5.3 Implementing visualisations......................53
5.3.1 Extending D3.js.........................53
5.3.2 Interactive context menus for each termin the cloud...55
5.3.3 Normalising font sizes.....................56
5.4 Coping with loading time.......................58
5.4.1 Asynchronous events.....................58
5.4.2 Infinite scroll for articles list..................59
5.4.3 Autocompletion for search terms...............60
5.4.4 Caching.............................60
5.4.5 Displaying only selected categories in the cloud upon load 61
5.5 Testing..................................61
5.5.1 Keeping track of test cases and logging issues.......61
5.5.2 Testing activity.........................62
6 Results and evaluation 65
6.1 The final product – EDViC.......................65
6.1.1 The Welcome page.......................65
6.1.2 The Search Results page....................66
6.1.3 Additional pages........................71
6.2 Evaluation................................71
6.2.1 Task completion.........................72
6.2.2 General discussion.......................74
6.3 Meeting the requirements.......................75
6.4 Known issues..............................76
6.4.1 Speed issues...........................76
6.4.2 Suitability of data used.....................76
6.4.3 User experience.........................77
CONTENTS 9
7 Conclusions 79
7.1 Lessons learned.............................80
7.2 Future work...............................81
7.3 Concluding remarks..........................82
A Endpoints in the API 85
B Systemvalidation test suite 87
Chapter 1
Introduction
When looking at scientific publishing,it can be seen that the rate of growth com-
pared to its centuries-old history is on a rapid rise – it has been estimated that
there have been about 60 000 academic journals in existence between the 17
th
and 20
th
centuries,while this amount was reported to be nearly a million in 2002
(Larsen and von Ins,2010).Within scientific fields,biomedicine is one that has
especially grown in importance since the Second World War,when the emer-
gence of modern medicine and the intensification of research into the life sci-
ences took place (accompanied by growing investments into the field) (Quirke
and Gaudilli`ere,2008).While the abundance of new publication channels next
to traditional ones,the most prominent and accessible of which is the Internet,
has indeed increased the availability of all this data,it simultaneously creates
problems with navigating,organising and making sense of it.For scientists in
biomedicine,maintaining an up to date overview of breakthroughs,discover-
ies or simply current trends in a field of interest in these conditions is heavily
aggravated.
1.1 Project aimand objectives
This project attempts to address the issue of navigating the ever-growing amount
of unstructured data,focusing especially on epidemiology.Students as well as
graduates should be able to conduct research in their specific fields of interest in
a way that reveals insight into current trends and topics.This can be achieved
12 INTRODUCTION
by providing a tool to intuitively interact with and explore data about existing
epidemiological literature.The aim of the project,thus,is formulated as fol-
lows:To design,implement and evaluate a system that responds to searches
of epidemiological concepts by providing a means of exploring relevant data
aggregations.
Following from this,a set of objectives was identified.These objectives were
reviewed throughout the course of the project in response to academic and tech-
nical research,and finally defined as follows:
1.Allowusers to performsearches for epidemiological concepts and entities
that retrieve useful information on the state of those concepts as repre-
sented in the literature.
2.Use data visualisations to display search results,and add an interactive
element to enable intuitive exploration of concepts.
3.Accommodate for the customisation of the search results to fit user-specific
needs.
4.In order to take advantage of a wide range of platform-independent tech-
nologies,realise the project as a web application,thus allowing maximal
accessibility.
5.To facilitate the abundance of data without compromising the insightful-
ness of the results,place high priority on the simplicity and clearness of
the user interface.
1.2 Report structure
This report aims to give an account of the process of how and to what extent
the objectives described have been achieved,discussing the the full project from
conception to delivery.The topics described in the chapters that followare out-
lined below.
Chapter 2 – Background and literary review
This chapter describes the preliminary research conducted in order to de-
INTRODUCTION 13
fine the most relevant set of specific requirements for this project,and to
gain the necessary domain knowledge for making a relevant contribution.
Both existing work on the topic as well as studies about different types of
visualisations are analysed,and a technical reviewis given.
Chapter 3 —Identifying requirements
Taking the research findings into consideration,a more specific set of re-
quirements for the application is outlined in Chapter 3.
Chapter 4 – Design
Chapter 4 explains the design decisions made throughout the course of
the project.Visual design and the user experience,determining the op-
timal characteristics of the visualisations,and the system architecture are
discussed in detail.
Chapter 5 – Implementation
A more detailed description of the implementation details is given in this
chapter,along with key challenges facedandhowthey were overcome.An
overviewof the choice of tools and technologies used for implementation
is also provided.
Chapter 6 – Results and evaluation
This chapter presents the resulting product and discusses to what extent
the initial objectives have been achieved,and whether what has been built
satisfies the overall goal of the project.
Chapter 7 – Conclusions
The final chapter outlines possible future work and provides a personal
reflection on the course of the project,and what lessons have been learned
as a result.
Chapter 2
Background and literary review
To guarantee relevancy in defining the specific requirements of the project,com-
prehensive preliminary research was performed.This chapter aims to provide
both a review of the current state of epidemiological literature and how issues
in the field have been addressed,as well as an analysis of different types of data
visualisations,with specific examples given.
2.1 The state of epidemiological literature
Closely related to biomedicine is the field of epidemiology,which attempts to
explore causes of and impacts to health problems and diseases in populations.
One of the prevailing health problemof the 21
st
century is obesity,and as such,
epidemiological studies relatedto obesity have grown in abundance – more than
20 000 such studies have been published(Karystianis et al.,2013).Other diseases
characteristic to modern society and widely studied in the epidemiological liter-
ature include,for example,diabetes,cardiovascular disease,and viral diseases.
The following sections discuss howthe growing amount of epidemiological data
surrounding these and other topics is collected and curated,and how the issue
of navigating it has been addressed by different text mining tools.
16 BACKGROUND AND LITERARY REVIEW
2.1.1 PubMed
PubMed (www.ncbi.nlm.nih.gov/pubmed) is the key online database to ac-
cess biomedical and life sciences literature,maintained by the NCBI (National
Center for Biotechnology Information).According to NCBI,it contains more
than 22 million citations fromMEDLINE and other life science journals.In ad-
dition to providing online searches for publications based on advanced search
criteria,PubMed (and other NCBI databases) can be accessed systematically via
Entrez Programming Utilities (E-Utilities),making it especially useful when de-
veloping text mining and information retrieval tools.In addition to allowing
requests for specific properties of a publication (e.g.,the author,journal,pub-
lisher),it also provides functions to search the database for articles based on a
given term.
2.1.2 Organising epidemiological data
Anotion that makes the task of systematic information extraction fromepidemi-
ological studies more conceivable,is that they followa relatively structuredform
to improve internal collaboration in the field (Karystianis et al.,2013).Most
studies include common elements to characterise the findings and the study it-
self;these elements are usually outcome,exposure,study design and population,but
can also include covariate and effect size.
An outcome refers to a consequence of some exposure to a population,such as the
tendency of smoking (exposure) to result in lung cancer (outcome).The population
ina givenstudy is the groupof people that the outcomes andexposures apply to,
and can include properties such as gender,age ethnicity and other demographic
information.The study design describes the methodology or approach taken,
or protocol followed during the study;examples include observational studies
and cross-sectional studies.Finally,covariates refer to variables that may further
affect an outcome,and effect size is used to illustrate the extent of a phenomenon.
As an example,Table 2.1 illustrates some of the epidemiological characteristics
extracted froman article by a systemdeveloped by Karystianis et al..
In addition to these standard elements,another useful tool that aids knowledge
extraction and representation in epidemiology (and biomedicine,in general),is
BACKGROUND AND LITERARY REVIEW 17
Publication title “Prevalence and association between obesity and
metabolic syndrome among Chinese elementary school
children:a school-based survey.”
Exposures childhood obesity (DISORDERS),body mass index
(PHYSIOLOGY)
Outcomes obesity (DISORDERS),metabolic syndrome (DISORDERS)
Population 1844 children aged 7-14 years (936 males and 906 females)
Study design cross-sectional study
Table 2.1:Epidemiological characteristics extracted froman article by the systemdeveloped by
Karystianis et al.(terms in brackets indicate UMLS groups)
the Unified Medical Language System (UMLS).UMLS is developed by the US
National Library of Medicine,and provides a set of hierarchical biomedical vo-
cabularies in an attempt to standardise annotations in the domain,integrating
over 2 million concepts and 12 million relations among these concepts (Boden-
reider,2004).As an example,terms appearing in brackets in Table 2.1 indicate
the appropriate UMLS group for a given epidemiological element.
2.1.3 Text mining tools
In order to address the need of organising and making sense of the vast amount
of data in the biomedical and epidemiological domain,a range of text mining
applications and services are available.Karystianis et al.(2013) have developed
a system that automatically extracts the epidemiological study characteristics
described in the previous section from given publications.For each epidemio-
logical role,the systemstores meta data about the article (year,PubMed article
ID),and extracts various additional information related to the specific epidemi-
ological characteristic.Table 2.2 provides a complete overview of the data that
the systemidentifies.
18 BACKGROUND AND LITERARY REVIEW
Outcome Exposure Covariate Study design Population Effect size
Partially overlapping fields
pmid pmid pmid pmid pmid pmid
year year year year year year
outcome exposure covariate type
of
study population effect
size
concept
other
outcome other
exposure other
covariate other
studydesign other
population -
umls
group umls
group umls
group - - -
umls
cat umls
cat umls
cat - - -
Role-specific fields
- - - case
control
type
treatment
response
type
of
review
time
attribute
cohort
type
age
stage
gender
nationality
ethnicity
effect
size
number
effect
size
concept
Table 2.2:Meta data associated with each epidemiological characteristic as represented by the systemdeveloped by Karystianis et al.(pmid refers to
PubMed ID)
BACKGROUND AND LITERARY REVIEW 19
Other examples of text mining tools in the biomedical domain include BioCon-
text
1
(see Figure 2.1),which introduces itself as “a text mining systemfor extract-
ing informationabout molecular processes inbiomedical articles”;TerMine
2
(see
Figure 2.2),which highlights genes and other medical terms in pieces of text that
users can submit;and WikiPain
3
(see Figure 2.3),which presents users with de-
tailed information about “molecular interactions and single events relevant to
pain that have been automatically extracted from all of the biomedical litera-
ture”.While all these tools offer functionality to make sense of the knowledge
represented in biomedical publications of a specific narrow scope,there does
not seemto be one that provides a way to aggregate and explore very general data
about concepts and trends in the available literature.
ALTHOUGH THE FUNCTIONALITY of the text mining tools looked at was dif-
ferent from the objectives of this project,the applications did provide valuable
examples of the norm for representing and interacting with biomedical data.
Mainly,the author experienced first-hand that although navigating dense and
information-packed tables may be suitable for retrieving specific pieces of infor-
mation,such interfaces are not likely to be useful in situations where a general
overviewof a topic is required.As is illustrated in Figures 2.1,2.3 and 2.2,tables
appear less comfortable to scan and require rather that either the whole of the
contents are read through,or that the content is organised in some layout that
enables simple referencing for specific pieces of information (such as alphabeti-
cal sorting).In order to formulate a solution for this issue and identify means of
representing data in a more general way,different visualisation techniques were
looked into and analysed.
1
www.biocontext.org
2
www.nactem.ac.uk/software/termine
3
www.wiki-pain.org
20 BACKGROUND AND LITERARY REVIEW
Figure 2.1:The search results page on BioContext
Figure 2.2:Results of input text analysis on TerMine
Figure 2.3:Awiki page on WikiPain
BACKGROUND AND LITERARY REVIEW 21
2.2 Visualisations
While the text mining tools mentioned in section 2.1.3 used tables to represent
data and search results,the context of this project requires to focus on visualisa-
tion techniques that offer more scannability,and means of casual browsing and
exploration of data.Because a core objective is the exploration of epidemiologi-
cal concepts,and users would rarely need to navigate numbers,the appropriate
visualisations would be those that represent textual information.
Examples of the visualisation types looked at include traditional bar charts,bub-
ble charts,line graphs,andthe more modern andexperimental wordtrees,word
clouds and streamgraphs.All present various unique properties,but the au-
thor eventually decided on using word clouds and streamgraphs,because of
both the requirements set for this project,as well as the nature of the data to
be visualised.Word clouds would be suitable for displaying related concepts
in different publications,and streamgraphs could be used to illustrate popular-
ity of concepts throughout time.The following sections describe these types of
visualisations in more detail,as well as provide further analysis of their use.
2.2.1 Word clouds
The most basic word clouds (also referred to as tag clouds or term clouds de-
pending on the context) are “clouds” or groups of words arranged randomly
where words with a higher frequency are displayed in larger sizes,making it
easier to spot more popular terms.Depending on the implementation,colour,
font weight and positioning may also be used to indicate other properties of the
words.This makes word clouds suitable to represent textual data that is char-
acterised by some ranking,such as search results ordered from most to least
relevant.
Historically,tag clouds first emerged as “tagging” tools on community-oriented
websites (e.g.,Delicious and Flickr) to enable users to present their interests
compactly (Lohmann et al.,2009).It has been claimed that their use thus far
has primarily been popular because of simple attractiveness and pleasure from
illustrating a “portrait of one’s interests” in an unconventional way (Feinberg,
22 BACKGROUND AND LITERARY REVIEW
2010).In addition,the use of tag clouds has been said to “convey a sense of ac-
tivity in a Web community” (Lohmann et al.,2009).Utilising word clouds as a
search interface for information retrieval,however,has been less popular.This
may be due to the fact that they fail to present data in a systematic way which
makes it less efficient for the user to quickly find what they are looking for (Kuo
et al.(2007),Clough and Sen (2008)).Nonetheless,several studies do support
the use of word clouds in situations where detailed information retrieval is not a
priority,and providing a very general overviewof a domain is the goal instead.
For example,results have indicated that tag clouds are rather scanned than read
(Feinberg,2010);that they are useful for summarisation (Lohmann et al.,2009);
that they are satisfying to use,and when illustrating search results,may reveal
unexpected,but useful terms (Kuo et al.,2007).
Perhaps most relevant to this project are studies carried out with PubCloud (see
Figure 2.4),a search extension for PubMed which presents results as a word
cloud.Kuo et al.(2007) showed that the cloud interface allows for a better per-
formance when answering descriptive questions as opposed to more complex
relational ones.Clough and Sen (2008) conducted a similar experiment where
certain types of questions had to be answered using PubMed and PubCloud;
participants found that tag clouds proved to be most useful for gaining an im-
pression of a subject area.One participant in particular commented:
“It felt as if the tag cloud would be at its most useful if I was to
conduct my very first/early searches on a topic.” (as reported in
Clough and Sen (2008))
Finally,the positioning of words in a tag cloud was considered an important
aspect of this project.Lohmann et al.(2009) conducted experiments with four
different tag cloud layouts:sequential,circular,clustered and reference (see Fig-
ure 2.5).The results suggested that circular and clustered layouts were preferred
in tasks where popular terms or specific categories of terms needed to be identi-
fied.
BACKGROUND AND LITERARY REVIEW 23
Figure 2.4:Search results as presented by PubCloud and PubMed (Kuo et al.,2007)
Figure 2.5:Different layouts for tag clouds:(a) sequential (alphabetic sorting),(b) circular
(decreasing popularity),(c) clustered (thematic clusters),(d) reference (sequential,alphabetic
sorting,no weighting of tags) (Lohmann et al.,2009)
24 BACKGROUND AND LITERARY REVIEW
Figure 2.6:Adynamic word cloud with controls on www.jasondavies.com/wordcloud
A state of the art example
While visualisations have existed on the web for more than a decade (Viegas
et al.,2007),it has been customary (as well as technically feasible) to build them
separately offline,and then publish on the web,often as static images.Inter-
active visualisations have only recently become more popular,likely because
of an increase in the availability of new and suitable tools and technologies.
The author has not been able to locate many interactive implementations of
word clouds in online applications.One popular tool that aims for attractive-
ness above all is Wordle (Feinberg,2010),which allows users to generate word
clouds of different styles based on textual input.While Wordle generates static
images,the tool inspired the creation of an interactive version built by Jason
Davies,using a JavaScript library known as D3.js (Davies,2012) (see Figure 2.6).
BACKGROUND AND LITERARY REVIEW 25
Jason Davies’s word cloud allows users to dynamically reposition words in their
clouds by manipulating a small set of variables such as rotation,number of
words,and type face.The tool also enables direct interaction with the cloud
by allowing clicking on words to regenerate the cloud with the chosen word
centred.
2.2.2 Streamgraphs
Streamgraphs
4
are a relatively new way of visualising data.They are a type of
stacked graph consisting of layers,where each layer represents frequency at the
corresponding x-coordinate based on how “thick” it is.This type of visualisa-
tion was first used by the Lee Byron to illustrate his listening history on Last.fm
(Byron,2008);soon after,he helped create a similar chart for the NewYork Times
to illustrate box office history (Byron and Wattenberg,2008).
Astreamgraph can be seen as an extension of the traditional stacked graph (such
as a bar chart where each bar contains several “stacks”),but because of the way
individual layers are positioned to flow continuously,they are believed to be
easier to trace and emphasise in the case of large data sets (see Figure 2.7).An-
other key factor in streamgraphs is their attention to aesthetics,with the goal
being to create something with a less “statistical” and more organic feel (Byron
and Wattenberg,2008).
In addition to the above,streamgraphs have been used successfully to visualise
trends over time in several cases (e.g.,Kraker et al.(2011),Riedhammer et al.
(2011)).Riedhammer et al.(2011) argue that the graph’s distinctive layout allows
for an easier comparison between data layers by being able to grasp intuitively
how dominant a layer is not just at one point,but throughout the extent of the
x-axis (most commonly the time dimension).All this makes a good case for the
use of streamgraphs for visualising the popularity of epidemiological concepts
in publications throughout time.
Feedback for this unconventional type of visualisation,however,has been con-
4
Also referred to as “theme rivers” or “steamgraphs”,largely because of popular typographi-
cal errors (Kirk,2010).
26 BACKGROUND AND LITERARY REVIEW
Figure 2.7:An example of a streamgraph with labels Byron and Wattenberg (2008)
troversial.While some users have called it “brilliant” and “intuitive”,others
express confusion over an unconventional baseline (the layers are not stacked
on the x-axis,but rather float to preserve continuity) and the irrelevance of the
vertical scale (Byron and Wattenberg,2008).Byron and Wattenberg argue:
“Since the heights of the individual layers add up to the height of
the overall graph,it is possible to satisfy both goals at once.At the
same time,this involves certain trade-offs.There can be no spaces
between the layers,since this would distort their sum.As a con-
sequence [...],changes in a middle layer will not necessarily cause
wiggles in all other surrounding layers,which have nothing to do
with the underlying data of those affected time series.” (Byron and
Wattenberg,2008)
Byron and Wattenberg draw attention to other minor trade-offs as well,but
argue that the aesthetic quality of the graph is what makes it engaging to au-
diences.However,Kirk (2010) proposes in his comprehensive review of the
legibility of streamgraphs that their trade-offs become most exposed when dis-
played as static visualisations;interactivity of these graphs is therefore crucial
to enable users to explore data points and make sense of themmore clearly.
BACKGROUND AND LITERARY REVIEW 27
2.3 Building interactive web applications
Based on the preceding literary review,several tools and technologies were
looked into to make an informed decision about the main languages and li-
braries this application would benefit from.
2.3.1 Single-page web applications
When interacting with and exploring data,a certain degree of complexity is in-
evitable to be able to gain meaningful insight.In order to achieve a user interface
that allows rich interactions with multiple components on the page simultane-
ously,the goal was formulated that the tool would be implemented as a single
page web application (SPA).Takada (2013) writes in his book “Single page apps
in depth” that what distinguishes single-page applications is “their ability to re-
draw any part of the UI [User Interface] without requiring a server round-trip
to retrieve the HTML”.This is usually achieved by segmenting the application
code into models and views,with models handling the data and state of a com-
ponent,and views reading from these models and responding to user interac-
tions,triggering changes in data structures.
Consequently,the user is never redirected to a newpage,and is thus never in a
situation where the entire interface is newand unfamiliar.All interactions take
place in a single page and events initiated by the user result in dynamic changes
in the state of the interface,making it easier to see the outcome of each action.
Because of the author’s lack of experience writing single page applications,a re-
viewwas made of the available front-end Model-View-Controller (MVC) frame-
works.Most such frameworks are written in JavaScript;Angular.js,Backbone.js
and Knockout.js were considered,but Backbone.js appeared to be the most suit-
able for both this project as well as the author for the following reasons.
 The functionality it offers is not too comprehensive,providing the outlook
of a less steep learning curve in the context of the limited time frame of
this project.Meanwhile,the rigid structure it gives to the systemappears
to be extendable regardless of the size and complexity of the systemitself.
28 BACKGROUND AND LITERARY REVIEW
 Based on research,the online community seems to agree that Angular.js,
although probably the most powerful of the three,feels like “magic” with
the programmer being unaware of much of what happens “behind the
curtain”.The author felt that using a framework like this would limit her
learning opportunities.
Above all,Backbone.js provides programmers with the means to structure and
organise their JavaScript code appropriately in an SPA.The library encourages
the use of Models,Collections and Views.Collections are groups of Models that
hold data,states and properties,and Views are constructs placed as a layer be-
tween these Models and Collections and the DOM (Document Object Model),
orchestrating the events initiated from the interface by users to propagate up-
dates and changes in templates and Models and Collections.Additionally,Back-
bone.js provides a Router object that is used to update the URL of the page.
2.3.2 D3.js
Several technologies are available that provide the tools to implement web based
visualisations;first-hand assessment was required to make an educated decision
regarding the pros and cons of each.
D3.js is an open-source JavaScript library,and it is written by Michael Bostock
(Bostock,2012).It is released under the BSD licence,which states that the soft-
ware may be freely usedfor commercial purposes andcanbe modifiedas needed,
as long as the original copyright note is retained in the code (The Linux Infor-
mation Project,2005).It is a comprehensive library that does not only include
functionality for visualisations,but also the tools for the necessary data manip-
ulation often preceding it,including functions for statistical analysis.
Other visualisation tools considered for use by the author were:Processing.js
5
(a JavaScript wrapper for Processing),Raphael
6
,the R language,and Paper.js
7
.
However,the following properties of D3.js were considered by the author to be
5
www.processingjs.org
6
www.raphaeljs.org
7
www.paperjs.org
BACKGROUND AND LITERARY REVIEW 29
advantages over these other tools:
 Explicit support andexamples for boththe implementations of wordclouds
as well as streamgraphs;
 The wide-spread use and popularity of the library which creates a strong
online community for help and support;
 Comprehensive functionality that may be useful for other areas of the ap-
plication next to visualisations;
 The lack of a need to learn a new language (the author was relatively fa-
miliar with JavaScript);
 The extended functionality of JavaScript that enables the implementation
of interactivity;
 General attractiveness compared to other tools and libraries.
Above all,D3.js allows the creation of visualisations in a way that integrates
smoothly with the rest of the DOM,as it makes use of SVG (Scalable Vector
Graphics) elements to populate a canvas with shapes,colours and text.This ap-
proach is different fromusing the recently introduced HTML5 canvas element,
as it performs actual manipulation of the DOMby inserting and deleting nodes,
whereas in the case of the HTML5 canvas,only pixels are added.Because nodes
exist for each element in the visualisation,these can then be easily targeted with
CSS or JavaScript – and here is precisely where the appeal of D3.js lies.How-
ever,trade-offs exist,too – any addition of large amounts of DOM elements is
bound to impact the performance of rendering it all to the user.
THIS CHAPTER HAS PROVIDEDan overviewof howthe growing amount of data
about epidemiological literature is currently addressed.An analysis of the pros
and cons of using word clouds and streamgraphs has been given,and finally,a
technical reviewintroduced techniques to implement these visualisations inter-
actively.
Chapter 3
Identifying requirements
Based on the background research and the technologies reviewed in Chapter 2,
a more detailed set of requirements was identified and refined during iterations
of the project that enable the core objectives to be realised.While the amount
of enhancing features that this kind of application could benefit from is quite
large,it was decided that feature bloat should be avoided at all cost in order to
maintain simplicity.
Because the author does not have any background in biomedicine,specific ex-
pectations of the users of the application-to-be were difficult to predict.Through-
out the design and development of this project,occasional discussions were had
with George Karystianis,a PhD student at the University of Manchester with a
background in Medical Informatics.Karystianis’s input was used to keep ideas
in line with realistic outlooks on the real-world usage of the application,and
some specific feature suggestions were given,such as users’ ability to customise
visualisations according to their needs.
3.1 Functional requirements
Now follows a description of the core functionality and components of the ap-
plication.
32 IDENTIFYING REQUIREMENTS
COMPONENT I:Searching by epidemiological concepts
A comprehensive search feature was identified to be the most important com-
ponent of the system.Requirements for the search component are listed in Table
3.1.
ID Requirement description
FR1 Users are able to performa search by inputting an epidemiological con-
cept or terminto the system.This search retrieves (1) a list of publica-
tions where the concept occurs,(2) core concepts discussed in those
publications,and (3) the popularity of the concept(s) searched for in
biomedical publications throughout time.
FR2 The search term given by the user may be further specified to have a
specific role in a study:either outcome,exposure,covariate,popula-
tion,effect size or study design type
1
.
FR3 Users are able to add more than one search term,resulting in an inter-
section of the results for each;terms may also be removed at any time
to allowexplorations.
FR4 Users are able to access their search history.
1
As an example,users should be able to search for the term “smoking” appearing as an “expo-
sure” in studies.
Table 3.1:Requirements for COMPONENT I
COMPONENT II:An interactive word cloud as the main means of navigation
The core concepts appearing in the matching documents retrieved is the central
element of the results – they provide an overviewof what topics are discussed in
the community that relate most to the user’s search term.In order to maximise
the benefit gained fromthis data,the requirements listed in Table 3.2 were set.
IDENTIFYING REQUIREMENTS 33
ID Requirement description
FR5 The set of related concepts and entities are displayed in a word
cloud,where the more popular entities appear larger and more
prominently.
FR6 Each entity in the cloud is characterised by what role it has in the
publication it appears in,similar to the roles users are able to specify
for their search terms.Entities with the same role are differentiated
fromsurrounding entities by colour and position,with all terms ap-
pearing as some role placed close together in the cloud
2
.
FR7 Users are able to hide and un-hide either individual elements or
groups of elements from the cloud based on a common role they
have in the publications they appear in
3
.
FR8 Users can interact with each termin the cloud to get a set of further
actions:
FR8.1 Add the chosen term along with its role to the current set of
search terms,which fetches a newset of appropriate results;
FR8.2 Use the chosen termalong with its role to start a newsearch;
FR8.3 Hide the chosen termfromthe cloud;
FR8.4 Visualise the popularity of the chosen term in publications in
comparison with the popularity of the search terms;
FR8.5 See the chosen term in context,by displaying sentences from
matching documents that it occurs in;
FR8.6 See links to relevant databases or external resources that pro-
vide further information about the chosen term.
2
Based on the tag cloud layout studies discussed in Chapter 2
3
For example,hiding all terms that appear as covariates in their respective studies
Table 3.2:Requirements for COMPONENT II
COMPONENT III:Alist of relevant documents with details available on demand
A list of documents matching the current search term(s) would provide the
reader with context and enable them to explore further if they so wished;this
list would need to satisfy the requirements in Table 3.3.
34 IDENTIFYING REQUIREMENTS
ID Requirement description
FR9 Users can click on any article title in the list to retrieve more detailed
information.This includes:
FR9.1 Common meta data such as the author,journal,date of publi-
cation and abstract with key concepts and entities appearing in
that article highlighted;
FR9.2 Key epidemiological concepts additionally presented in a table
categorised by their respective roles in the study;
FR9.3 Alink to the corresponding PubMed article.
Table 3.3:Requirements for COMPONENT III
COMPONENT IV:A streamgraph illustrating popularity of concepts through-
out time
The second piece of visualisation in the results is the streamgraph which repre-
sents the popularity of concepts in publications throughout time;see Table 3.4
for a list of specific requirements for this component.
ID Requirement description
FR10 Automatically display layers corresponding to each of their specified
search terms in a streamgraph,which visualise the popularity of that
termin publications throughout time.
FR11 Users can see detailed numeric data for each point in the streamgraph
as they hover over the layers to aid legibility.
FR12 Users can add and remove layers corresponding to other concepts of
interest in the streamgraph.
Table 3.4:Requirements for COMPONENT IV
3.2 Non-functional requirements
Acomplementing set of non-functional requirements is also listed in Table 3.5.
IDENTIFYING REQUIREMENTS 35
ID Requirement description
NFR1 The tool is built as an online web-based application in order to facili-
tate wide accessibility.
NFR2 The application has acceptable response time.
NFR3 To aid ease of use and provide a more fluid experience,the tool is
implemented as a single-page application (SPA).
NFR4 Provide an easy to learn experience with a simplistic and intuitive
user interface.
Table 3.5:Non-functional requirements
3.3 Use cases
The use cases identified summarise expected functionalities and interactions
with the system,and are illustrated in Figure 3.1.There is a single type of user,
andno administrators or moderators.The main activities in the application lie in
exploratory tasks,as search results are rather examined than manipulated with.
The use cases defined do not represent these exploratory aspects,but specific
actions that the user can perform.
36 IDENTIFYING REQUIREMENTS
The system
The system
UC1 Add arbitrary
search terms to a search
UC2 Add search terms
specified by an epidemi-
ological role to a search
Remove any search
terms from a search
Hide terms or categories
from the word cloud
Add layers to the streamgraph
Remove layers from
the streamgraph
User
Figure 3.1:Use case diagram
Chapter 4
Design
Based on the background research and identification of requirements,a concept
for a specific application was formulated.EDViC,which stands for Epidemi-
ological Data Visualised in Clouds,would be a tool that meets the real world
needs of the biomedical community.Sections in this chapter describe the design
of EDViC.Having established a description of the system architecture,princi-
ples and iterations of the visual design will be explored.Additionally,a more
detailed look into the design of the word cloud will be given.
4.1 Systemarchitecture
Taking into account the specific requirements of the project andthe requirements
placed on single-page web systems,the high-level system architecture seen in
Figure 4.1 was designed.The diagramillustrates the data layer of the application
with two core databases and an API,the Backbone models and controllers in the
business layer,and finally the Graphical User Interface in the presentation layer.
The following sections describe and illustrate each of these components in more
detail.
38 DESIGN
Figure 4.1:The flowof data through the system
4.1.1 The data
The two databases in the systemprovide all of the necessary rawdata for users’
search results.
The database of PubMed documents holds detailed information on all publi-
cations listed in MEDLINE and other journals.These can be accessed using the
E-Utilities API provided by NCBI.This database is used in the systemto retrieve
the initial set of relevant publications that correspond to a user’s search in the
case of UC1 (see section 3.3),as well as to retrieve details about a given article.
The seconddatabase usedinthe system(referredto as the “epidemiology database”
throughout this report) holds epidemiological data collected by the system de-
veloped by Karystianis et al.(described in section 2.1.3).This data is used in
the system to define the final concepts and entities along with their properties
that relate to a user’s search term,and ultimately provide themwith the means
of exploring epidemiological concepts in publications in detail.This database is
also used to define a set of relevant publications to a user’s search term in the
case of UC2.
4.1.2 The API
In an SPA,most of the logic of the system is handled not on the server side as
with traditional architectures,but on the client side in the web browser.The
complexity of the “back-end” of the system is thus reduced significantly,mak-
ing the server’s primary role to handle all data manipulation in the form of an
Application Programming Interface (API).
DESIGN 39
RESTful (Representational State Transfer) APIs are traditionally used in appli-
cations to handle the creation,reading,updating and deleting of models and
collections corresponding to the systemlogic,for example,deleting forumposts
(e.g.,an HTTP DELETE request made to endpoint/posts/<post
id>) or re-
trieving all profiles fromthe user profiles collection (e.g.,a GET request made to
endpoint/users).The API designed for this system,however,is different from
traditional web-service APIs in the sense that the data in the databases is never
changed,only read.Therefore,only HTTP GET requests are ever sent to API
endpoints,and these endpoints always respond with the relevant data retrieved
from the databases,having performed complex data manipulation procedures
beforehand.The API in this systemis therefore not a layer for accessing and up-
dating different data collections and models,but rather an interface for Remote
Procedure Calls (RPC) for handling certain functional aspects of the application
logic.Remote procedures are initiated by code on the client side by sending an
AJAX (asynchronous JavaScript and XML) request to an address in the server,
which then initiates the execution of the functionality on that endpoint,and re-
sults are sent back to the client.
There are several response formats that RPCs may use.XML,JSONand HTML
are among the most popular,but for this application it was decided that all
API endpoints should return JSON,as this can easily be manipulated using
JavaScript,and it separates the data manipulation logic fromhowresults appear
in the DOM(Document Object Model – the underlying structure of the HTML).
The API’s most important function in this system is providing the client with
relevant responses to the user’s searches – based on the current search terms,a
list of matching articles and a list of related epidemiological concepts along with
their roles and frequencies is given.However,the API also handles other less
significant requests,such as
 retrieving frequencies of concept occurrence in publications by year (to
satisfy FR10);
 retrieving various details and information about specific articles (to satisfy
FR9);
 retrieving sentences froma set of publications with a given termoccurring
in them(to satisfy requirement FR8.5).
40 DESIGN
During implementation,minor functionality was added as the need for addi-
tional data manipulation procedures arised,while the behaviour of others was
modified slightly.All API endpoints along with their functionality are listed in
Appendix A.
4.1.3 The Backbone MVC framework in the front-end
The business logic that does not involve manipulating data in the databases is
handled using JavaScript and Backbone.js in the front-end of the system.Keep-
ing in mind the guidelines for structuring Backbone.js Models and Views de-
scribed in section 2.3.1,an extension to the high-level architecture of the system
seen in Figure 4.2 was designed for the business logic of the application.
Four main Collections were defined to hold the data retrieved fromthe API:
1.The Filters Collectionconsisting of separate Filter models,eachcorrespond-
ing to a user’s search term;
2.The Articles Collection consisting of separate Article models,each corre-
sponding to a matching article retrieved by a search;
3.The RelatedEntities Collection,consisting of separate Entity models,each
holding an epidemiological concept along with accompanying properties
and meta data that correspond to user’s search results;
4.A Frequencies Collection,consisting of Frequency Models holding data
about the frequencies of occurrence for a given concept.
Four main Views were defined to correspond to main Collections and the com-
ponents visible to users:
1.The Filters View to manage users’ search terms along with roles the user
may or may not specify for a search term;
2.The Articles Viewto manage the list of articles matching a given search,as
well as details on each individual article;
3.The Cloud View to manage entities shown in the cloud and interactions
with these entities;
DESIGN 41
Figure 4.2:The structure of and interactions between Backbone Collections and Views.Dashed
green arrows indicate API calls,solid black arrows indicate Object --updates-> Object
and dotted red arrows indicate Object --listensTo-> Object.
42 DESIGN
4.The Streamgraph View to manage the displaying of the streamgraph and
interactions with its different layers.
Each Viewhas several smaller sub-views to handle sub-components of the areas
in the interface.
Finally,a Router Model was added to simulate a changing URL in spite of the
website never navigating to a newpage;this allows users to bookmark searches
and partially satisfies requirement FR4.
4.1.4 The Graphical User Interface
A traditional web interface built with HTML and CSS is suitable for this appli-
cation.Base HTML would be provided for the static parts of the system,while
Backbone Views handle dynamic templating.
4.2 Dimensions in the word cloud
As the word cloud was specified in the requirements to be the central element
of the system,the specific design decisions regarding how and what terms are
placed in it are nowdescribed.
As mentioned,terms in the word cloud form a set of all concepts and entities
that occur in the intersection of documents that match the user’s given search
terms;the size of the terms in the cloud as the primary dimension indicates
their frequency in this set.Other dimensions of the cloud could be made use
of as well.Colour was decided to represent each term’s epidemiological role
in the publication it occurs in,and all terms with the same role should be clus-
tered together to simplify scannability (based on the results of the cloud layout
experiments described in section 2.2.1).Additionally,more frequent (and hence,
larger) terms would be placed closer to the centre of the cloud than less frequent
ones.These characteristics would together create a circular clustered layout (see
Figure 4.3),combining the advantages of being able to quickly detect popularity
DESIGN 43
Figure 4.3:An minimal example of a circular clustered layout where different epidemiological
categories appear in clusters,and larger terms appear nearer to the centre
and identifying connecting topics.
1
4.3 Visual design
A simplistic and attractive user interface was a self-explanatory goal from the
very beginning of this project.While much of the intuitiveness can be provided
by the dynamic interactions that SPAs enable,it was decided that there is no
need to build the entire tool into a single page,and the application would consist
of four separate pages:
 the Welcome page serving as the main entry point to the system where
users are introduced to the application and can begin their search;
 the Search Results page – the core of the application where all data explo-
ration takes place;
1
During background research,the author was not able to locate information on how the ori-
entation of individual terms in the cloud affects users’ perceptions on what they convey;several
options were considered to attempt to convey some subliminal information using term orienta-
tion,but due to the lack of supporting material,this idea was abandoned.
44 DESIGN
 an “About” page with information about this project;
 and a “Guidelines” page with instructions for the usage of the application
(although it is hoped that much of the design decisions eliminate the need
for this).
4.3.1 The Search Results page:design iterations
The Search Results page would act as the SPA of this tool,and is solely respon-
sible for satisfying the functional requirements set.Four main components were
identified that the user would need to interact with:the search,the list of arti-
cles,the word cloud and the streamgraph.When placing these components on
the “canvas”,priority was given to the visualisations,the search area would oc-
cupy the top of the page,and the articles list would be a secondary area on the
left.Figure 4.4 illustrates this basic layout,which remained the same through-
out all iterations of the design (although the search bar was later moved to the
bottomof the page for more optimal space usage).
Because the user is never directed to a new page during their interactions,dy-
namic state changes become an important part of the visual design,and single
areas of the page must have the capacity to present different interaction options
while in different states.With regard to this,issues arised with the specifics of
the search component.In order to facilitate all user interactions,the following
options need to be available and visible to the themat all times:
 The user has to be aware of the search term(s) that have produced the
currently displayed results (“current search terms”);
 The user has to be able to both add an additional term to this list (“fil-
tering the current search”),as well as clear the list and add a new term
(“performing a newsearch”);
 The user has to be able to revisit previous searches (defined by combina-
tions of search terms accompanied by specified roles);
 Finally,users’ searches should be accompanied by autocomplete function-
ality based on entries in the epidemiology database in order to ensure
availability of results.
All this introduces significant complexity that has to be handled by a single com-
DESIGN 45
Figure 4.4:Afirst draft of the layout of the main components
Figure 4.5:The final design of interactions in the search bar
46 DESIGN
pact area of the results page.Several solutions were considered,but the main
issues were always excessive space usage,and requiring users to perform an
unreasonable amount of clicks to manipulate their searches.
The final version of the search area (see Figure 4.5) was designed with the idea
in mind that there is no need for separate elements to represent interactions that
are fundamentally connected.The notion of having separate areas for the epi-
demiology options and search history was therefore discarded,and all intended
functionality wouldbe conveyedby the concept of search “labels”,characterised
as follows.
 A search label consists of the search term and the epidemiological role
associated with it (chosen fromeach label’s individual drop-down menu).
 A blank search label would always be displayed at the end of the list to
indicate input options.
 Labels can be removed.
 Depending on whether the “Filter this search” or “New search” option is
selected,existing labels are either visually “active” or appear as disabled.
 The prominent visual design of search labels would make the search bar a
constant reference for what the current search represents – the need for a
title area was thus eliminated.
As the user can easily revisit searches by removing terms fromthe search bar,or
by using the back button or bookmarking more interesting search URLs (func-
tionality provided by the Backbone Router),it was decided at this time that a
search history would not explicitly be provided.
4.3.2 Visual consistency
In an application where the same type of data is displayed in several different
visual as well as text-based contexts,consistency is crucial to enable the user to
create associations between different components of the application.This had to
be kept in mind most with the epidemiological entities in the interface,as these
act as the central navigational elements of the system,appearing interactively in
DESIGN 47
all components of the application.Entities were therefore designed consistently,
always having the same typeface and roughly the same size,and always accom-
panied by the main role they had in their respective study.As an example,see
in Figure 4.6 how the entity “smoking”,appearing as a covariate in studies,is
shown across different areas of the finished application.
Figure 4.6:The visual language of the term
“smoking” appearing as a covariate (fromtop to
bottom) in the cloud pop-up menu,in the article
details modal,the streamgraph,and the search
bar
THIS CHAPTER ILLUSTRATED an ap-
propriate design for a system that is
capable of satisfying the requirements
described in Chapter 3.An architec-
tural structure was defined that best
allows the development of an SPA,
and an API was described that pro-
vides Remote Procedure Calls for the
core data manipulation operations.
The visual design of the system was
explored as well,with an overview
given of the underlying decisions be-
hind the appearance of the search re-
sults page interface.
Chapter 5
Implementation
While the implementation of the application went through several iterations,
this chapter does not intend to be a comprehensive overviewof themall;rather,
it serves to present the reader with a set of implementation details that were
challenging or interesting to the author,or simply significant in the context of
the project.
5.1 Choice of tools,languages and external libraries
Based on the background research,application requirements,and analysis of
different visualisation libraries and JavaScript frameworks,a set of tools and
languages was chosen to implement the system.
 Flask
1
with Python was used to implement the web server which hosts
both the API as well as user-facing web pages where interaction takes
place.Flask is a minimal web framework written in Python that is es-
pecially celebrated for its suitability for implementing APIs.
 BioPython
2
was used to access the E-Utilities API in Python.In addition,
a local MySQL database was set up to host the epidemiology database,
and Python’s MySQLdb library was used to access it.
1
www.flask.pocoo.org
2
www.biopython.org
50 IMPLEMENTATION
 JavaScript was used to implement the application logic with the added
support of Underscore.js
3
for functional programming and jQuery
4
to
more efficiently manipulate the DOM.
 Backbone.js was used to structure the JavaScript code clearly and modu-
larly as is appropriate for a single-page application.
 D3.js was used to create visualisations (see section 2.3.2).
 HTML and SASS were used for creating the templates.SASS
5
is a CSS in-
terpreter that allows the writing of more functional CSS with nested state-
ments and variables.All stylesheets were written in SASS and compiled
to CSS.
 Kube.css
6
is a front-end HTML and CSS framework that provided built-
in support for common GUI elements as well as a responsive grid,and
a more consistent visual language.It also ensured fundamental cross-
browser compatibility.
5.2 Manipulating the data
As explained in the previous chapter,the API formed a set of remote functions
to manipulate the data before allowing the application logic to use it.Because of
the complexity of the data and the different facets of it that had to be highlighted
in the application,implementing a strong API as the fundamental component of
the application was one of the most challenging aspects of this project.The main
problem areas were introduced by the fact that the different epidemiological
roles in the epidemiology database all had various different meta data associ-
ated with them.The application,however,still needed to handle all search re-
sults consistently regardless of the roles specified,while still displaying relevant
information to users based on the role they chose to accompany their search
3
www.underscorejs.org
4
www.jquery.com
5
www.sass-lang.com
6
imperavi.com/kube
IMPLEMENTATION 51
term.To illustrate this inconsistency in the data,notice in Table 2.2 in Chapter 2
the different properties each epidemiological role is further defined with.
As shown,no single piece of meta information apart fromthe year of publication
and PubMed IDcan be associated with all six epidemiological roles.Outcomes,
exposures and covariates are the most unified,with all three accompanied by
a specifying UMLS group and category.Population,effect size,and study de-
sign indicators,however,cannot semantically be described by a UMLS group or
category,whereas they are accompanied by a range of other information.
When handling the data,all these different pieces of meta information had to be
made consistently representable in data structures in order to allowfor a smooth
logic in the client side JavaScript.For the sake of this consistency,even though
it introduced application-wide additions to areas of functionality where data
parsing is involved,the following hash table construct was decided upon after
several iterations (illustrated with an example exposure):
{’value’:’obesity’,
’epidem_category’:’exposure’,
’article_id’:’10643682’,
’year’:’1999’,
’other_details’:{
other_exposure:’child obesity’,
umls_group:’DISORDERS’,
umls_category:’Finding’ }}
As shown,the only shared keys for every type of entity now become value,
epidem
category,article
id,year,and other
details,which is used
to hold any other unshared keys relevant to that entity.This construct is used
consistently throughout the implementation layers of the application.
5.2.1 Main API endpoint - retrieving a list of related entities
A significant portion of implementation effort went into creating the API end-
point that retrieves a data construct holding all related entities based on a set
of search terms that may or may not be accompanied by a specified epidemio-
logical role.This is essentially the core function behind the entire application,
52 IMPLEMENTATION
as it powers the search,the results of which initiate all Backbone Views,such as
occurrence frequencies or details on matching articles.The main issue with this
endpoint was facilitating multiple search terms – making a separate request for
each search term from the front-end would firstly result in excessive database
calls,and secondly force the Backbone Models to handle combining these sepa-
rate search results.
It was therefore decided that all search terms would be passed to the/api/
entities/related endpoint via a query string,which is a set of key-value
pairs specifying additional information to the endpoint.For example,if the
user performs a search for all publications where the term“England”,the term
“smoking” as an exposure,and the term “lung cancer” as an outcome occur,a
request would be made to the following endpoint
7
:
/api/entities/related
?keyword=England&exposure=smoking&outcome=lung+cancer
This query string is in line with the query string used on the search results page
8
seen by the user which ensures consistency.Upon first navigating to the search
results page,the query string is parsed in order to make an appropriate request
to the API endpoint;additional adjustment of search terms will then trigger
these requests fromthe Filters Collection,while the URL is updated silently by
the Router.
Having received the query parameters,the API endpoint then performs the data
manipulation required to retrieve relevant results.The basic high-level algo-
rithmfor this function can be seen in Pseudo code 5.1.
7
The argument parsing engine in the endpoint is built such that duplicate keys can
also be passed.For example,passing the query string?exposure=smoking&exposure=
diabetes&outcome=lung+cancer would result in a new hash table being created
with a single key exposure whose value is a list:f’outcome’ = ’lung cancer’,
’exposure’=[’smoking’,’diabetes’]g.
8
/search?keyword=England&exposure=smoking&outcome=lung+cancer
IMPLEMENTATION 53
Pseudo code 5.1 Retrieving related entities based on search terms
Precondition:epidem
roles
1 function GETCOMBINEDRELATEDENTITIES(filters)
2 article
ids list()
3 keywords filters:pop[
0
keyword
0
].Arbitrary keywords without a role
4 for all terms in keywords do
5 related
articles get
related
articles(term)
6 article
ids:append(set(related
articles))
7 for all role;terms in filters do.Filters with a role
8 for all termin terms do.Several terms for a role
9 related
articles get
related
articles(term;given
role = role)
10 article
ids:append(set(related
articles))
11 intersection intersection of sets in article
ids
12 related
entities list()
13 if intersection then
14 for all role in epidem
roles do
15 related
to
type get
entities
of
type
in
docs(intersection;role)
16 related
entities:extend(related
to
type)
return related
entities
5.3 Implementing visualisations
Implementing the visualisations andmaking theminteractive was an interesting
challenge for the author considering her unfamiliarity with the relevant tech-
nologies.
5.3.1 Extending D3.js
As described in Chapter 2,D3.js uses SVG elements to populate a section of the
DOM with nodes that can be targeted with JavaScript and CSS just like other
elements of the page.Being able to target individual SVG elements that col-
lectively formed the visualisations enabled much of the basic interactivity to be
implemented quite easily using just CSS and jQuery.Examples include the fol-
lowing:
 SVG <text> elements holding the terms in the cloud could be hidden
54 IMPLEMENTATION
and shown by invoking a simple CSS rule using the “display” property
(display:none or display:block);
 CSS classes could be added to each individual <text> element based
epidemiological roles,and then colour coded accordingly using CSS rules;
 The different layers in the streamgraph,defined by SVG <path> ele-
ments,could be given additional styling such as a transparency on hover,
so that the user could more clearly see the underlying graph when focus-
ing on a particular layer.
Apart from requiring a considerable amount of effort to master the basics,the
only other issue the author encounteredwithD3.js was adapting it to the specifics
of the word cloud required in this application.Mainly,the following problem
points emerged when attempting to meet the requirements.
I.Additional properties to each term
Prior to placement,the frequency defined for each word is used to calculate
the font size in which it should appear.However,the author intended to also
control other dimensions (rotation and colour) of each term based on its epi-
demiological role,as well as display extra information in a dedicated interactive
pop-up menu.An extension to the basic D3.js cloud drawing algorithm was
implemented,whereby the hash table,d,containing the terms in the cloud,re-
ceived an additional set of keys:
 epidem
category to represent the term’s role;
 other
details (explained in section 5.2);
 id to match the termto a unique context menu (further explained in sec-
tion 5.3.2);
 is
filter to verify whether a word in the cloud appears in the current
search terms.
As the algorithm loops over each term in the hash table,it can consequently
check specific properties of each element prior to assigning it a colour and rota-
IMPLEMENTATION 55
tion,and use the element’s other
details to store and display extra informa-
tion.
II.Word placement
The basic algorithmfor building a wordcloud,which is also usedin D3.js,places
words onto the canvas randomly,checking for overlaps with words already
present and repositioning if necessary.According to the requirements,how-
ever,terms would also need to appear clustered based on their epidemiological
role.While checking the properties of input terms to assign them a colour,ro-
tation and size was a straightforward addition (D3.js’s core algorithm was not
changed,simply extended),due to time limitations,the author was not able to
go in depth enough to implement a possible solution that also clusters terms
based on their properties.
5.3.2 Interactive context menus for each termin the cloud
As per the requirements,each termin the cloud needs to have additional func-
tionality associatedwith it.The best way of implementing this was to use a small
pop-up menu that opens when the user clicks on a term,and displays both extra
information about that termas well as further actions,such as performing a new
search.
It was contemplated on whether to implement this menu as part of the SVG
shapes,or whether to use traditional HTML elements.There are issues with
both.Although SVG elements can be targeted with JavaScript and CSS,their
prime usage lies in drawing,and they are not meant to enable very complex in-
teractions or styling,such as is required by a pop-up menu.In addition,menus
coded into SVGnodes would be confined inside the boundaries of the surround-
ing canvas,hence preventing terms closer to the edges of the cloud to accommo-
date full-sized menus.On the other hand,with HTML elements,associating a
corresponding menu with each termwould be difficult.HTML elements would
need to be placed outside the canvas in the DOM,and hence are disjoint from
56 IMPLEMENTATION
the <text>nodes holding the terms.
9
Because ultimate interactivity and freedom regarding the placement of menus
was considered a priority,it was eventually decided on HTML elements.The
issue of how to associate each menu with a term was solved by adding an ad-
ditional ID property to all entities.While looping over the entities hash table d
when building the cloud,the algorithm then uses this ID property to create a
newContextMenu View,and assigns to it the same ID.All ContextMenu Views
trigger an addition of a menu template to the DOM,appended at the very end
of the <body>,and hidden using CSS.
In order to reveal a menu when clicking on a term,the user’s current cursor
coordinates are used to first position the menu with the correct ID on the screen,
and then un-hide the element.The result is a smooth experience that does not
involve the rendering lag of SVG elements,nor introduce excessive complexity
in associating a menu with a term.
5.3.3 Normalising font sizes
An issue that was not explicitly considered during the design process was the
different ways in which it is possible to map entity frequencies to specific font
sizes in the word cloud.At first,a simple linear function was implemented by
the author whereby frequencies would map to a specific domain of font sizes
in which the largest size would not take up an unreasonable amount of space
in the cloud,while the smallest size would still maintain readability (e.g.,from
12pt to 45pt) (see Pseudo code 5.2).
Althoughrobust,the following issues were introducedby this shallowapproach.
 The algorithm does not take into account the number of words in the
cloud.If the result set is small and with a fairly uniform distribution of
frequencies,the resulting cloud would occupy only a portion of the can-
9
Indeed,SVGtags do include one specific <foreignObject>which is meant to hold HTML
content inside surrounding SVG shapes,thus solving both issues described.However,these for-
eign objects are not allowed as direct children of SVG<text>nodes,which is unfortunately the
only type of element that the word cloud contains.
IMPLEMENTATION 57
Pseudo code 5.2 Linearly mapping frequencies to font sizes
Precondition:min
font
size = 12;max
font
size = 45;all
freqs
1 function MAPTOFONTSIZE(frequency)
2 min
freq min(all
freqs)
3 max
freq max(all
freqs)
4 range max
font
size min
font
size
5 domain max
freq min
freq
6 font
size
unit range > 0?range=domain:1
return min
font
size +(frequency  font
size
unit)
vas available.
 While a linear scaling function replicates exactly the proportions between
different frequencies,this correctness may actually not be desired.In or-
der to visually emphasise the differences in frequencies,which in this case
is a priority to aid scanning,exponentially larger font sizes,for example,
would be more suitable.
 At the same time,a linear scaling function can result in a cloud where the
majority of terms are displayed in the smallest allowed font size,and a
select fewin the largest,if the distribution of frequencies is not uniform(a
common situation with the data available).
Consider the following set of frequencies:
freqs = [1,1,1,1,1,1,1,5,6,20,187,200]
Terms in the cloud,if mapped to frequencies linearly,would in this case occupy
a rather small area of the available space because of (1) the small number of
terms,and (2) the fact that most of these terms are displayed in the smallest font
size allowed (12pt).
A better scaling function would both take into account the number of terms,
as well as emphasise even the slightest differences in frequencies more promi-
nently.As a solution,D3.js’s native scaling functions were utilised,and instead
of mapping frequencies linearly,an exponential scale was used.In addition,a
set of font-size intervals was defined that would map to different amounts of
terms – the less terms there are to display,the larger the font sizes in the range
58 IMPLEMENTATION
defined.
To illustrate the impact of this,consider the differences in how the set of fre-
quencies given in the above example maps to a set of font sizes with the two
algorithms (the same range of font sizes is used in this case to better illustrate
the differences in the scaling function).
Mapping with a linear scale:
font_sizes =
[12,12,12,12,12,12,12,12.67,12.9,15.15,42.8,45]
Mapping with an exponential scale with 0.3 used as the exponent factor:
font_sizes =
[12,12,12,12,12,12,12,17.24,18,24.31,44,45]
5.4 Coping with loading time
As the code base for the systemgrewlarge,issues with speed and loading time
became apparent.Visual feedback given to users indicating that certain areas of
the DOMare currently being updated is common,but in this case,although this
feedback was implemented,there were situations where lag was introduced not
by database requests,but the rendering of SVG elements in large sets of results.
These situations were particularly difficult to solve as the way a browser renders
complex SVGelements could not be easily controlled by the author.This section
describes the different approaches taken to eliminate speed issues,caused by
both API response time as well as SVGrendering.
5.4.1 Asynchronous events
Both the fetch() event in Backbone Models and Collections as well as the
ajax() method provided by jQuery enable the use of asynchronous events and
requests.Although executing parts of the code asynchronously does not reduce
loading time,it enables the application interface to remain interactive and feel
snappy while resources are loaded in the background.Therefore,asynchronous
IMPLEMENTATION 59
requests were implemented throughout the application.
5.4.2 Infinite scroll for articles list
In cases where very fewgeneric search terms are specified and the resulting data
sets are large,the number of matching articles is usually fromseveral hundred
to a fewthousand.Once a list of PMIDs has been retrieved,the systemperforms
an asynchronous request against the E-Utilities API to retrieve the title and pub-
lication date of each;the loading of articles therefore does not limit the user’s
ability to begin exploring the visualisations.Nonetheless,it makes sense to in-
terrupt the loading of hundreds of article titles to save resources and prevent
potentially annoying updates to the DOM before explicit user requests.This
was achieved by implementing infinite scrolling of the list of articles – initially,
only as many documents are loaded as are needed to reveal a scroll bar in the ar-
ticles list;then,as scroll events are captured,more articles are gradually loaded.
Pseudo code 5.3 shows a high-level overviewof this algorithm.
Pseudo code 5.3 Infinitely scrolling articles list
1 function LOADINITIALARTICLES
2 loaded
height 0
3 while loaded
height 100  container
height do
4 if articles
loaded  articles
amt then
5 break
6 fetch
an
article
7 loaded
height+ = single
article
height
8 function DECIDELOADMORE.Executes on scroll
9 current
position dist
from
top container
scrollheight
10 available
height container
height
11 if current
position +trigger
point  available
height then
12 fetch
some
articles(5)
60 IMPLEMENTATION
5.4.3 Autocompletion for search terms
Loading autocompletion data for the search input was initially implemented by
requesting a list of all unique entities from the epidemiology database.This
was problematic not only because of speed issues,but also the inability to dis-
tinguish the different roles each retrieved entity appeared in.As a solution,
the author implemented an extension to the API endpoint providing the list of
entities,whereby a role could be specified.Consequently,instead of perform-
ing a request to/api/entities,a request could be sent to/api/entities?
requested_type=outcome,where “outcome” could be replaced by any of
the six main roles.A new request is then performed for autocompletion data
every time the user selects a role for their search term.This reduces the use of
resources,as certain data are loaded only when needed.
5.4.4 Caching
In an attempt to reduce loading time for large sets of results,basic caching
was implemented for the main/api/entities/related endpoint.This was
achieved with the help of the JavaScript plugin backbone-fetch-cache by An-
drewAppleton
10
.The plugin adds caching support to Backbone.js by saving all
AJAX request results to the client browser’s local storage,a feature introduced
by HTML5 that enables key-value pairs to be stored locally and accessed later
even if the user has navigated away fromthe page.Keys represents URLs where
data has been fetched from,and values represent the actual data fetched.An
additional step was then added to the fetch() function of the RelatedEntities
Collection – whenever changes occur in the Filters,a decide
fetch() func-
tion is invoked that first establishes whether the requested set is already cached,
using this cached data fromthe localStorage object if possible.
Although caching functionality was implemented successfully,loading time of
the word cloud was not significantly improved,and the root cause was traced
to the SVG rendering engine which lags when having to place numerous ele-
ments onto the canvas.The misdiagnosis of this issue may have resulted from
10
Available online at www.github.com/mrappleton/backbone-fetch-cache
IMPLEMENTATION 61
insufficient performance profiling,but in the end,the caching implemented still
enabled a more economic use of resources.
5.4.5 Displaying only selected categories in the cloud upon load
To cope with SVG rendering time in the word cloud,the best strategy that the
author was able to successfully implement was a compromise – rendering time
would be decreased if the amount of elements to be rendered is decreased,too.
Meanwhile,it was brought to the author’s attention during testing and evalu-
ation that epidemiologists may benefit fromseeing visualisations with less dis-
tracting data,and concepts appearing as outcomes and exposures are more rele-
vant.A decision was therefore made to initially only render words in the cloud
that appear in publications as outcomes or exposures,and this solution bene-
fited the systemfromboth a usability and performance perspective.
5.5 Testing
Manual testing of all functionality was performed progressively throughout all
implementation iterations of the application.This conforms with the agile de-
velopment approach taken in this project,as bugs were rather addressed when
they arised than tracked during separate testing phases.
5.5.1 Keeping track of test cases and logging issues
BitBucket
11
was used for hosting a version controlled repository with the source
code.The service also offers comprehensive issue tracking functionality,and
as the code base grew,this proved to be a useful resource in regression testing,
when critical functionality areas were reviewed following each major addition.
In addition,the tool BitBucketCards
12
served as a useful extension to the basic
bug tracking functionality provided in BitBucket by visualising issues into dif-
11
www.bitbucket.org
12
www.bitbucketcards.com
62 IMPLEMENTATION
Figure 5.1:BitBucketCards laying out issues and bugs fromBitBucket into scannable prioritised
stacks
ferent subcategories and priorities (see Figure 5.1) – useful in just-in-time and
agile development practices.
5.5.2 Testing activity
The most comprehensively tested area of the application is the API,as it is the
fundamental component that everything else in the systemdepends on.Hence,
systematic testing was undertaken for each endpoint before utilising it in the
front-end.Methods included adding debug messages into the server logs,man-
ually verifying the match between endpoint response data and data in the epi-
demiology database,as well as PubMed,and querying using data that would
return large result sets in order to test performance.
During the development of each individual component,rigorous attempts were
made constantly to reveal combinations of input that would result in invalid
output.These exploratory attempts proved to be most useful to pin-point areas
of the system where user activity should be restricted;although the applica-
tion was behaving as expected,with certain inputs the output simply would not
make sense.Examples include restricting the use of identical search terms and
IMPLEMENTATION 63
disabling the option to add an entity to the streamgraph if it already exists.
Systemvalidation testing
System validation refers to the process of confirming that core functionality is
error-free before the final release of an application.In this project,system vali-
dation was performed during the final stages of implementation where the satis-
faction of all requirements was validated in detail.Seeing as the tool was devel-
oped for web use,different modern browsers were required to provide the same
functionality and roughly the same user experience – systemvalidation testing
was therefore performed separately on Google Chromium23,Opera 12.15,and
Mozilla Firefox 20.0.The complete test suite and results of systemvalidation are
provided in Appendix B.
While it was confirmed that the functionality of the system is not affected by
the choice of web browser,it was also necessary to ensure the GUI maintain
its layout and appearance across browsers.Test results show hat this is indeed
the case,however,slight differences do appear in howFirefox renders visualisa-
tions:
1.The SVG <text> elements are rendered without anti-aliasing and there-
fore appear fuzzy;
2.The units on the x-axis of the streamgraph are incorrectly displayed and
appear as time of day (“1PM” is repeated across the axis) rather than years.
When hovering on the layers of the graph,however,the correct year is
shown at each point (the bug does therefore not block systemusage).
AS DESCRIBED,SEVERAL CHALLENGING and interesting issues were encoun-
tered during the implementation stage of this project.While building most com-
ponents was a straightforward process that ran smoothly,coping with the slow
loading time of the application required continuous experimentation with vari-
ous solutions.The next chapter discusses specific results of the implementation,
combined with design and preliminary research,and describes the evaluation
of the finished application.
Chapter 6
Results and evaluation
The aimof this chapter is to provide an overview of the developed systemand
to discuss whether it satisfies the requirements defined by presenting the results
of a formal evaluation.
6.1 The final product – EDViC
As a result of the research and design work undertaken,the author has managed
to successfully implement an application that visualises epidemiological data
based on user queries.EDViC is at the time of writing due to be publicly avail-
able at gnode1.mib.man.ac.uk/projects/edvic.To present what EDViC
achieves,a walk through is given in this section that illustrates its main features.
6.1.1 The Welcome page
The entry point to the systemis the Welcome page (Figure 6.1),where the promi-
nent call-to-action element is the formthat simultaneously gives an idea of what
the system does,and allows users to begin their search.Example searches are
also referred to for users who may initially want to simply explore the applica-
tion.
66 RESULTS AND EVALUATION
Figure 6.1:The Welcome page
6.1.2 The Search Results page
The search results page (Figure 6.2) is where core interactions take place.The
user is presented with
 a distinctive search bar that acts as both a reference to the context of the
data presented,as well as a tool for modifying their search;
 a list of infinitely scrolling publications that match their search;
 a word cloud where key related concepts are visualised;
 anda streamgraphthat illustrates trends involving these concepts through-
out time.
The search bar
The current search can be modified using the search bar (Figure 6.3).Search
terms are added by entering a search term into the empty search “label” at the
end of existing labels,and optionally specifying an epidemiological role using
the drop-down menu.Users can choose to either add a newsearch termor start
a newsearch entirely.
RESULTS AND EVALUATION 67
Figure 6.2:The Search Results page
Figure 6.3:Adding a search label in the search bar
68 RESULTS AND EVALUATION
The list of articles
Users can browse the list of articles and click on any in order to reveal a modal
windowwith further details on the chosen document (Figure 6.4).
The word cloud
Figure 6.5:The cloud pop-up menu
Epidemiological concepts can be ex-
plored in the visualised term cloud
by interacting with either the legend
above the cloud specifying the differ-
ent epidemiological roles,or by click-
ing on a term in the cloud to reveal
further actions.Choosing any cate-
gory in the legend will either show or
hide all terms in that category in the
term cloud (all currently hidden terms
are saved in the “Hidden terms” drop-
down menu (Figure 6.6) which can be
used to un-hide them separately).The
context menu for each term(Figure 6.5)
allows modifying the search with that term,hiding the term from the cloud,
adding a layer corresponding to the terminto the streamgraph,and displaying
a modal window with sentences containing the term (Figure 6.7).The button
“Redraw cloud” can be used to re-render the cloud when an excessive amount
of space has been freed by hiding terms,or to simply gain a newpoint of view.
The streamgraph
An interactive streamgraph is displayed below the word cloud as a secondary
source of visualised results.The streamgraph’s layers represent the popularity
of a concept in epidemiological publications throughout time.By default,lay-
ers are shown for each search term,but concepts may be compared further by
adding them from the cloud.Hovering over each layer displays the year and
number of publications corresponding to the current coordinates of the mouse.
RESULTS AND EVALUATION 69
Figure 6.4:Viewing article details in a modal window
70 RESULTS AND EVALUATION
Figure 6.6:The hidden terms drop-down menu
Figure 6.7:Viewing terms in the cloud in context
RESULTS AND EVALUATION 71
Figure 6.8:The Help page
Added layers may be cleared by using the “Reset” button.
6.1.3 Additional pages
Both a help page (Figure 6.8) is provided to users that explains how to interact
with the application,as well as a page that gives details into the nature of the
project and its motivation (Figure 6.9).
6.2 Evaluation
In order to address whether the application developed meets the specific needs
of actual target users,basic user testing was performed with George Karystianis,
whose profile is well alignedwith who the average user of the application would
be.To evaluate the application,first,a set of tasks were given to Karystianis
(listed in Table 6.1),and observations were made based on whether he was able
to perform them.An informal discussion followed where the participant was
given a chance to comment on different aspects of the application.
72 RESULTS AND EVALUATION
Figure 6.9:The About page
6.2.1 Task completion
Table 6.1 shows the results task performance.As illustrated,most tasks were
straightforward to the user and were performed without issues.Observations
indicated,however,that a key weakness of the interface lies in the search bar
– although the user successfully entered their search term,and chose a role of
interest fromthe drop-down,confusion followed as to howto submit their term
to the system.The issue here is one of design,as the search bar was intended
to be used in a different order:(1) the user chooses a category from the drop-
down,if required;(2) the system updates the autosuggestion domain;(3) the
user types their query with the aid of the autosuggestions,and can conveniently
press Enter when they have found a match to initiate the search.The task
revealed that were the user to type their term before choosing a category,the
interface immediately becomes less intuitive,as they would need to reposition
their cursor into the text box and press Enter to submit their input.The simple
detail left unimplemented that this issue brought attention to was a designated
“Search” button.
RESULTS AND EVALUATION 73
Task description User performance and notes
Initiating a search on the Welcome
page
Performed successfully
Adding a search term on the results