End of Project Review
Usability Plan for 2012
Dissemination Plans for 2012
WordTree Development Posts (2011)
Summary of Outputs from Stakeho
lder Meetings (2011)
1. Project overview
This is a JISC ’rapid innovation’ project for the creation of adaptable and learnable
user interfaces. It involves building and piloting a Word Tree interface which can be
adapted for use with any corpus.
st important information contained in a corpus concerns patterns of
language use, but these patterns are often hard to discern when corpus data is This
presented in the standard way, using Key Words in Context (KWIC) concordance
This project will b
uild a Word Tree interface to present the patterns visually
). Following an “onion” model, users will be able to interact
with the surface layer of dat
a, but will be offered opportunities to enter increasingly
complex digital environments where they can examine phraseological patterns in
their wider contexts and gather statistical evidence to support research hunches.
An IBM ‘ManyEyes’ version of a Word
Tree interface has been available for some
has already been demonstrated to potential
users at national and international events. Our project will build on the
ic response to this very limited interface, and work with a range of
stakeholders to design, develop and trial a whole corpus version with a far richer
range of search options.
Aims and objectives
The aim of this project is to develop a multi
Word Tree interface which
will allow users to search and browse within documents and across a corpus, and
access instant visual representation of the language patterns surrounding any given
word or phrase.
Our goal is to increase access and usage of corp
us resources, both
by corpus linguists and by language teachers and learners. In
order to reach our goal
the interface will have to be accessible and fun for non
experts, whilst providing
useful pattern information for all levels of stakeholder.
will be measured in terms of the number of interactions with the Word Tree,
the number of repeat visits to the Word Tree site (monitored via user profiles),
positive feedback from stakeholder groups, and, ultimately, evidence of extended
use with other cor
pora of the Word Tree interface code. Use of the Word Tree
interface will be measured against current
use of visualisations at
The project team
Hilary Nesi, Coventry University
oreton, Coventry University
Simon Scarle /
Simon Hawkins, Coventry University
Conail Stewart, Freelance Consultant
2. End of project r
TheWordTree had the goal of providing an alternative, interactive user interface to
alytical tools like KeyWords In Context (KWICs).
produced allows users to generate word trees for individual terms, starting from the
searched term at the leftmost edge with branches of proceeding words extending to
In most cases,
the number of potential branches that could be returned is
vastly greater than the number of nodes that can be legibly printed in the
Thus, we allow the user to select a filtering function(s) to discard a
majority of terms.
The default set
ting is to select the top n terms ordered by
frequency under the selected node.
Other options include lexicographic ordering
Original project goals and progress made
User clicks from one term to anot
As they click the
Word Tree narrows to show the remainder of the tree.
user may clear the document subset that they are searching within to show
complete trees for clicked terms.
The number of occurrences
is shown below each
node and the area around the node is shaded proportionally to the number
: Only forward
facing trees are
Documents may be filtered by level, g
enre or discipline.
Not currently possible.
Zoom and pan
Currently usable in webkit
based browsers only.
Images can be downloaded in SVG format.
JSON Return latency
It is difficult to persist trees or large hierarchies of data in a traditional web
application development stack.
The tree data structure itself does not fit neatly
within a relational schema, whilst high out
degrees for commonly used terms makes
even loading a complete tree a time consuming operation.
Our word tree
application is further complicated by the requirement to filter trees to include only
in document subsets.
As a workaround, we marshal complete trees
in to and from memory using SQLite as the database back
This allows versions
of the database to be swapped, or loaded dynamically, to allow split testing.
Speed of Tree Transition
ing between trees creates linearly by clicking proceeding terms only a small
change in the overall tree, however the entire new tree needs to be recreated and
Unlike previous works, our trees are computed at the server
due to the requi
rement to perform document filtering (and not transmit the entire
corpus at during front
end interface loading).
The quantity of information returned is difficult to process cognitively and
Trees can be pruned in ma
ny ways and, in order to be useful, the
structure of the corpus being analysed needs to be somewhat internalised.
primary difficulty in displaying Word Trees is the amount of space they require.
may be improved in future releases by allowing lay
ers of metadata to be overlaid on
the tree similar to the current method of display term frequencies only on hover
Pruning filters are built iteratively, it is better to start with the complete tree
For large t
rees, pruning must be very aggressive.
JSON downloads greater
than 200kb slow the rendering process significantly as the complete data
structure must be loaded before drawing can begin
it cannot be streamed.
A better version would be able to stitch toge
trees using the cached
document set that the user has built iteratively.
Passing only sub
trees would allow animation when moving between trees or
Images could be rasterised at the server for downloading.
SVG does not
well in older browsers (< IE8) which unfortunately are pervasive in
Set operations in SQL are not well suited to determining which sentences to
For common terms in the wordtree (‘of’, ‘then’, ‘the’), they can take
up to 12s to re
turn on a typical VPS configuration (1Gb RAM, 1Ghz CPU).
is before pruning begins.
Traversals in graph databases (e.g. neo4j) do not perform well in memory
oriented storage is not useful were hierarchies are deeply
To generate faceted word trees, nodes need to be traversed deeply
within documents and query languages that offer this feature tend to be
cumbersome (e.g. MongoDB).
reduce was experimented with briefly (CouchDB) and may provide a
nt query environment, but the learning curve to integrate it
with the rest of the software stack proved too daunting.
The biggest potential win in improving perceived user experience
performance stems from rendering data structure visually.
of JSON to transmit the entire tree between server and client prevents nodes
from being streamed and rendered in the browser as they are loaded.
3. Source Code
Source code is available from:
4. Beta version
The beta version of the Word Tree is available at
This works best with Firefox and Google Chrome
not so well with Internet Explorer
. We will shortly activate the filter buttons for level, discipline and genre. Other
modifications are ongoing, in response to user feedback.
More information about the beta version:
: Faceted search features are hidden from the basic search
they can be enabled by clicking on the ‘Advanced’ button.
By laying out terms along strict levels, it is possible to remap
child nodes to multiple parents.
The original menu allowed users to finely specify search
However it was difficult to specify the exact set that would
produce the most useful tree without knowing the frequency distribution of
terms within the corpus beforehand.
Furthermore, parameters were set
globally which led to very shallow tree
s being created
suited common terms at the tree base did not lend adequately themselves to
returning sparser terms toward the leaves.
Subsequently, these parameters
were selected automatically using characteristics of the base term whic
then adjusted at each level.
Determining the sentences which make up the
requested tree is by far the most computationally intensive part of the
Various inverted index strategies have been attempted, the
ficient being a series of in
memory set operations on a small fraction
of document metadata.
Usability plan for 2012
Following on from Stakeholder discussions, in 2012 we will be assessing the
effectiveness of WordTree through a combination of quantit
ative and qualitative
Before making it publicly available, we have circulated the web address for the
WordTree to Stakeholder Group One, who will explore and experiment with the
system in order to give feedback on how easy it is to use,
how effective it is and
the extent to which it meets the stated objectives of the project. To help
facilitate this, feedback forms are embedded on each page within the system.
Stakeholders will use these forms to comment on any particular issue that arise
as they navigate around. The system will log the activity they were undertaking
or the search they had made at the point that the form was submitted.
feedback from stakeholders will be compiled for review in February 2012, and an
action plan for syst
em amendments and enhancement created.
Once these changes have been implemented we will carry out a further
study, this time involving people who are completely new to the system. This will
take place in the testing facilities at Coventry Unive
rsity’s Serious Games
Institute. Each subject will be asked to undertake a series of pre
tasks and their performance will be recorded using “Silverback” software.
Silverback creates a video recording of the computer screen and where people
licking as they complete the tasks. It also records their facial expressions via
a web cam.
The purposes of this study will be to identify any common points of
confusion or misunderstanding that people experience when navigating through
the WordTree interf
ace. The subjects will also be asked to complete a more
general questionnaire and give feedback on their experience of using the system.
The results of the usability study and the Silverback videos will be published on
the project blog. The results will al
so be analysed, with recommendations made
for future amendments to the software.
Quantitative recording tools are also built into this version of the WordTree, in
order to monitor the behaviour of users on an on
going basis. In particular, the
s that people use will be logged and made available for future
Google Analytics will also be embedded into the system. This will
provide the project team with detailed reports concerning:
The number of people using the WordTree
The typical pat
hs that people take through the system
How long people typically spend using the system
The geographical location of users
How they found the site
The searches that they made
Which links and buttons are most commonly clicked on each page
The speed of the s
6. Dissemination plans
We will be
the Word Tree at the following conferences
Asia Pacific Corpus Linguistics Conference
. Auckland, New
Zealand, February 15th
19th, 2012. See
: Corpora at the centre and crossroads of Engl
ish linguistics. Leuven,
June 3rd 2012.
Huddersfield University, April 18th 2012.
University of Birmingham, May 9
(CorpLing in the Midlands:
and in August 2012
for staff/students on the Birmingham presessional programme.
7.1 Data Structures
Having integrated the full
text transcriptions and metadata headers
from the BAWE
corpus, a suitable data structure needs to be derived.
Given a set of documents and a term to examine, the corresponding set of
concordances should be returned for rendering as quickly as possible. The data
structure employed should be optim
ised for reading and concordances can be
Drawing wordtrees for unknown corpora limits the degree of control over
visualisation. For large graphs, there may upwards of 1,000 nodes on
time. In addition
, client resources also limit the range of technologies that may be
employed to render a tree.
otential technologies include:
Scalable Vector Graphics (SVG)
Flash graphics API
Processing & Processing.js
Related project r
e rendering technology should be:
apable of producing high
quality print output
ender on older client hardware OR render to a bitmap on the application
SVG was chosen as the display technology because it is:
no difference between on
screen and print output.
based text JSON format can be mapped directly to SVG
Addressable via CSS: separation of content and design/layout eases future
Circular Drawings of Rooted Trees
ançon, Herman] 1998
Improving Walker’s Algorithm to Run in Linear Time
[Buchheim, Jünger, Leipert]
Animated Exploration of Dynamic Graphs with Radial Layout
[Yee, Fisher, Dhamija,
[Bruls, Huizing, van Wijk] 2000,
Analysis and Visu
alization of Network Data
ualisation: Walker’s Algorithm in d3 and Canvas
Word Trees are connected sets of terms linked by the ordering inherent to
In other words, nodes connected by edges
point for a Word Tree is the term under ex
This word becomes the root
for two trees, one for each direction (forward and back).
Nodes are not
interconnected, which simplifies processing.
you to bind arbitrary data to a Document Object Model (DOM), and then
driven transformations to the document. As a trivial example, you can
use D3 to generate a basic HTML table from an array of numbers. Or, use the same
data to create an inter
active SVG bar chart with smooth transitions and interaction.
D3 is not a traditional visualization framework. Rather than provide a monolithic
system with all the features anyone may ever need, D3 solves only the crux of the
problem: efficient manipulatio
n of documents based on data. This gives D3
extraordinary flexibility, exposing the full capabilities of underlying technologies such
. It avoids learning a new intermediate proprietary
representation. With minimal overhead, D3 is extremely fast, supporting large
datasets and dynamic behaviors for interaction and animation. And, for those
eds, D3’s functional style allows code reuse through a diverse collection
of optional modules.
Improving Walker’s Algorithm to Run in Linear Time [Buchheim, Jünger, Leipert]
Data Engineering: Suffix Trees/Arrays
Suffix trees and arrays are useful data structures for
solving string problems
elegantly and efficiently.
Proper use o
f suffix trees often
speeds up string processing algorithms from O(n2) to
In its simplest instantiation, a suffix tree is simply a trie of the n suffixes
character string S. A trie is a tree structure, where each edge represents
haracter, and the root represents the null string. Thus, each path from the
represents a string, described by the characters labeling the edges traversed.
finite set of words defines a trie, and two words with common prefixes branch
other at the first distinguishing character. Each leaf denotes the end
Tries are useful for testing whether a given query string q is in the set.
traverse the trie from the root along branches defined by successive characters
q. If a bra
nch does not exist in the trie, then q cannot be in the set of
Otherwise we find q in |q| character comparisons regardless of how many
are in the trie. Tries are very simple to build (repeatedly insert new strings)
very fast to search
, although they can be expensive in terms of memory.
An Incomplex Algorithm for Fast Su
x Array Construction [Schurmann, Stoye]
: May 13 2011
On May 13th 2011 we held our first Stakeholder Group meeting, to ask expert users
what they would like the new interface to provide
. Their responses are below:
1) Useful features for Corpus Linguists and ELTs
To have the statistics for a pa
rticular search wor
d (word frequency,
dispersion across texts and genres, number and type of texts
the word appears in, size of sub
corpora etc.). This statistical information
could be accessed by hovering over a particular
can we incorporate this into the interface?]
(Font size could also represent frequency, as in Many Eyes Word Tree).
The option to choose which metadata and stats are required.
Ability to do on the spot markup/annotation.
y to request a random selection of concordance lines.
the interface should have two levels: 1) raw
frequencies 2) the ability
to turn on normalised frequencies, lemmatisation,
Ability to access the original text.
Lemmatisation + split screen (lemma search + split screen for results for
Ability to search for POS using CQL.
Search button should be case sensitive.
Option to send findings
to another programme (Antconc, Sketch Engine etc.
via URL link).
Different tabs for different software
To be able to print out (good quality, readable) word trees (would need a
template to do this).
Ability to output data in differ
ent ways (perhaps an open field for other
people to add their
The ability to export to an excel or word file etc.
Ability to manipulate/customise the data (for the purpose of a specific
lesson). Need an output
that is man
ipulable (ability to remove items, colour
code POS, highlight specific
Ability to send screen shots to Facebook, email etc.
Stakeholder Group One
Meeting 1 (13 May 2011)
Full screen display containing as much info. as possible. Ab
ility to zoom in and
out on certain
parts of the word tree.
Ability to resize parts of the text so that students can see it on the interactive
Ability to use a number of visualisations for the same data set (a library of
ent ways of viewing the same data set).
Visualisations of the corpora (a pie
chart, for example, which shows the
portions/sections of the corpus). Ability to hover the mouse over a
particular segment of the
corpus to see the stats. A single click
automatically load the sub
than having a drop down menu).
Type in a search word and the visualisation would demonstrate “hotspots” of
where that word
occurs in the corpus.
Ability to visualise the corpus in different ways.
y to interact with the display.
Toggle: simple view and advanced view.
Ability to flip between concordance lines and word trees.
Ability to go in and out of the corpus: move between corpus and word tree.
Ability to drag and drop outputs into a separate are
ability to save
results in a separate screen).
Layered screens and/or separate windows to compare searches across
More interesting vision of collocates
target word in the middle, stronger
less strong collocates are further away.
Trees fading gently into place, and animation occurs after each additional
search term is
added, as in Many Eyes Word Tree.
Need to develop ‘ways in’ to the data. Instructions for users on how to
frequency lists and branch data (deepest branches, most
words have the most connections?).
Ability to produce frequency lists + lists of n
grams for sub
Try to avoid using lots
of corpus software in the classroom
an interface with
Plans for stakeholder meetings in
In July we will be presenting a first version of the
Word tree interface to all four
Group 2 will have remote access
to the interface website, and
additionally Hilary Nesi and Emma Moreton will discuss version o
ne of the Word
face to face with
academics attending the Lancaster University Corpus
Linguistics Summer School.
We will demonstrate the
to the foll
earchers (Stakeholder Group
visiting English language lecturers
from two Chinese universities
University of Finance and Economics, Nanjang, and Zhejiang University of
Finance and Economics, Hangzhou
e Department of English at the University of
to the following groups of
EAP learners attending the Lancaster University presessional programme
in the Department of English at the Univ
Our next meeting with Stakeholder Group 1 is scheduled for July 25 2011.
Meeting with future interface users at Lancaster University
Emma Moreton and Hilary Nesi met with participants at the UCREL Summer School
in Corpus Li
nguistics, Lancaster University, on July 13 2011, to demonstrate a
prototype of the Word Tree interface and to discuss issues surrounding its design
Here are our responses to the most frequently asked questions:
Q. Will we be able to use the int
erface with our own corpora?
at the end of the project (November 2011) the source code will be made
for use with any other corpus in plain text or XML.
it function for languages other than English?
we plan for it to work
with almost all languages, including those with other
Q. Does the interface show every instance of the occurrence of a given word or
pattern, or only a selection?
A. It will show every instance.
Q. You can already create word trees with Many Eyes
. What is the point of creating a
A. The Many Eyes word trees only work with a small amount of data in a single file,
which then remains accessible on the Many Eyes site. The new interface will enable
users to work with much larger collectio
ns of text, differentiated by file so that the
provenance of each instance of use can be identified.
Users will also be
create their own customised subcorpora, view corpus statistics, and compare
patterns of use
simultaneously in differen
will be able to
the interface on their own servers for private access to their own text collections.
Meeting Two: July 25 2011
On July 25th we held our second meeting with expert users, to review progress with
interface development and
to consider ways in which the Word Tree might be linked
to other corpus resource tools.
Serge Sharoff demonstrated features of IntelliText,
which is being developed at the University of Leeds with AHRC funding. Intellitext
will automate the downloading of
large collections of texts from the web, and wi
provide tools for automatic
speech annotation, term extraction, synonym
identification etc. Like the Word Tree, IntelliText will be distributed as open
software to academic and industrial use
rs, who will be free to extend it for the
benefit of the research community. We envisage that researchers will be able to
and annotate corpora
using IntelliText tools, and then
drawing on the resources of both
demonstrated SKYLIGHT, a classroom corpus resource he is developing wi
a simple interface which does not assume any prior user experience.,
we discussed the possibility of offering joint access to SKYLIGHT and the Word
Tree, to extend the range of both tools.
The group agreed that these
be made to the prototype interface.
Our next meeting with Stakeholder Group 1 is
scheduled for September 22 2011.