Final Report Project blog: http://cuba.coventry.ac.uk/wordtree/

spongereasonInternet και Εφαρμογές Web

12 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

149 εμφανίσεις


1





Final Report

Project blog:
http://cuba.coventry.ac.uk/wordtree/



1.

Project Overview

2.

End of Project Review

3.

Source Code

4.

Beta Version

5.

Usability Plan for 2012

6.

Dissemination Plans for 2012

7.

WordTree Development Posts (2011)

8.

Summary of Outputs from Stakeho
lder Meetings (2011)



1. Project overview


This is a JISC ’rapid innovation’ project for the creation of adaptable and learnable
user interfaces. It involves building and piloting a Word Tree interface which can be
adapted for use with any corpus.


The mo
st important information contained in a corpus concerns patterns of
language use, but these patterns are often hard to discern when corpus data is This
presented in the standard way, using Key Words in Context (KWIC) concordance
lines.


This project will b
uild a Word Tree interface to present the patterns visually
(see
example screenshot
). Following an “onion” model, users will be able to interact
with the surface layer of dat
a, but will be offered opportunities to enter increasingly
complex digital environments where they can examine phraseological patterns in
their wider contexts and gather statistical evidence to support research hunches.

An IBM ‘ManyEyes’ version of a Word
Tree interface has been available for some
time at
www.coventry.ac.uk/bawe
. It

has already been demonstrated to potential
end
-
users at national and international events. Our project will build on the
enthusiast
ic response to this very limited interface, and work with a range of
stakeholders to design, develop and trial a whole corpus version with a far richer
range of search options.


Aims and objectives

The aim of this project is to develop a multi
-
dimensional

Word Tree interface which
will allow users to search and browse within documents and across a corpus, and
access instant visual representation of the language patterns surrounding any given
word or phrase.


Our goal is to increase access and usage of corp
us resources, both
by corpus linguists and by language teachers and learners. In


order to reach our goal
the interface will have to be accessible and fun for non
-
experts, whilst providing

2

useful pattern information for all levels of stakeholder.


Success
will be measured in terms of the number of interactions with the Word Tree,
the number of repeat visits to the Word Tree site (monitored via user profiles),
positive feedback from stakeholder groups, and, ultimately, evidence of extended
use with other cor
pora of the Word Tree interface code. Use of the Word Tree
interface will be measured against current

use of visualisations at
www.coventry.ac.uk/bawe

.


The project team

Hilary Nesi, Coventry University

Emma M
oreton, Coventry University

Simon Scarle /
Simon Hawkins, Coventry University

Conail Stewart, Freelance Consultant


2. End of project r
eview


2.1
Recap


TheWordTree had the goal of providing an alternative, interactive user interface to
traditional text
-
an
alytical tools like KeyWords In Context (KWICs).

The website
produced allows users to generate word trees for individual terms, starting from the
searched term at the leftmost edge with branches of proceeding words extending to
the right.

In most cases,
the number of potential branches that could be returned is
vastly greater than the number of nodes that can be legibly printed in the
visualisation.

Thus, we allow the user to select a filtering function(s) to discard a
majority of terms.

The default set
ting is to select the top n terms ordered by
frequency under the selected node.

Other options include lexicographic ordering
and part
-
of
-
speech filtering.


2.2
Original project goals and progress made




Clickthrough Paths
:
User clicks from one term to anot
her.

As they click the
Word Tree narrows to show the remainder of the tree.

Alternatively, the
user may clear the document subset that they are searching within to show
complete trees for clicked terms.



Frequency Visualisation
:
The number of occurrences
is shown below each
node and the area around the node is shaded proportionally to the number
of occurrences.



Directionality (Forward/Backward
-
facing trees)
: Only forward
-
facing trees are
functional.



Document filtering
:
Documents may be filtered by level, g
enre or discipline.



Multi
-
corpora
:
Not currently possible.



Zoom and pan
:
Currently usable in webkit
-
based browsers only.



Image Downloading
:
Images can be downloaded in SVG format.


2.3
Metrics


3


Usability

1.

Session (frequency/duration)

2.

Comments

3.

Clickthrough p
ath depth


Performance

1.

JSON Return latency

2.

Render time



2.4
Obstacles


Tree Persistence


It is difficult to persist trees or large hierarchies of data in a traditional web
application development stack.

The tree data structure itself does not fit neatly
within a relational schema, whilst high out
-
degrees for commonly used terms makes
even loading a complete tree a time consuming operation.

Our word tree
application is further complicated by the requirement to filter trees to include only
occurrences with
in document subsets.

As a workaround, we marshal complete trees
in to and from memory using SQLite as the database back
-
end.

This allows versions
of the database to be swapped, or loaded dynamically, to allow split testing.


Speed of Tree Transition


Mov
ing between trees creates linearly by clicking proceeding terms only a small
change in the overall tree, however the entire new tree needs to be recreated and
pruned server
-
side.

Unlike previous works, our trees are computed at the server
due to the requi
rement to perform document filtering (and not transmit the entire
corpus at during front
-
end interface loading).


Information Overload


The quantity of information returned is difficult to process cognitively and
computationally.

Trees can be pruned in ma
ny ways and, in order to be useful, the
structure of the corpus being analysed needs to be somewhat internalised.

The
primary difficulty in displaying Word Trees is the amount of space they require.

This
may be improved in future releases by allowing lay
ers of metadata to be overlaid on
the tree similar to the current method of display term frequencies only on hover
states.


2.5
Lessons Learned


1.

Pruning filters are built iteratively, it is better to start with the complete tree

precomputed.

2.

For large t
rees, pruning must be very aggressive.

JSON downloads greater
than 200kb slow the rendering process significantly as the complete data

4

structure must be loaded before drawing can begin


it cannot be streamed.

3.

A better version would be able to stitch toge
ther sub
-
trees using the cached
document set that the user has built iteratively.

4.

Passing only sub
-
trees would allow animation when moving between trees or
collapsing branches.

5.

Images could be rasterised at the server for downloading.

SVG does not
display

well in older browsers (< IE8) which unfortunately are pervasive in
academic settings.

6.

Set operations in SQL are not well suited to determining which sentences to
show.

For common terms in the wordtree (‘of’, ‘then’, ‘the’), they can take
up to 12s to re
turn on a typical VPS configuration (1Gb RAM, 1Ghz CPU).

This
is before pruning begins.

7.

Traversals in graph databases (e.g. neo4j) do not perform well in memory
-
constrained environments.

8.

Document
-
oriented storage is not useful were hierarchies are deeply
nested.

To generate faceted word trees, nodes need to be traversed deeply
within documents and query languages that offer this feature tend to be
cumbersome (e.g. MongoDB).

9.

Map
-
reduce was experimented with briefly (CouchDB) and may provide a
more performa
nt query environment, but the learning curve to integrate it
with the rest of the software stack proved too daunting.

10.

The biggest potential win in improving perceived user experience
performance stems from rendering data structure visually.

Currently, the

use
of JSON to transmit the entire tree between server and client prevents nodes
from being streamed and rendered in the browser as they are loaded.


3. Source Code


Source code is available from:
https://g
ithub.com/conail/wordtree


4. Beta version


The beta version of the Word Tree is available at
http://beta.thewordtree.net/


This works best with Firefox and Google Chrome
-

not so well with Internet Explorer
v 8
. We will shortly activate the filter buttons for level, discipline and genre. Other
modifications are ongoing, in response to user feedback.


More information about the beta version:



Faceted search
: Faceted search features are hidden from the basic search

screen
-

they can be enabled by clicking on the ‘Advanced’ button.




Strict Levels
:
By laying out terms along strict levels, it is possible to remap
child nodes to multiple parents.



Original Menu
:
The original menu allowed users to finely specify search
pa
rameters.

However it was difficult to specify the exact set that would
produce the most useful tree without knowing the frequency distribution of
terms within the corpus beforehand.

Furthermore, parameters were set

5

globally which led to very shallow tree
s being created


parameters which
suited common terms at the tree base did not lend adequately themselves to
returning sparser terms toward the leaves.

Subsequently, these parameters
were selected automatically using characteristics of the base term whic
h are
then adjusted at each level.




Redis In
-
Memory Search
:
Determining the sentences which make up the
requested tree is by far the most computationally intensive part of the
application.

Various inverted index strategies have been attempted, the
most ef
ficient being a series of in
-
memory set operations on a small fraction
of document metadata.


5.
Usability plan for 2012


Following on from Stakeholder discussions, in 2012 we will be assessing the
effectiveness of WordTree through a combination of quantit
ative and qualitative
usability studies.


5.1

Before making it publicly available, we have circulated the web address for the
WordTree to Stakeholder Group One, who will explore and experiment with the
system in order to give feedback on how easy it is to use,

how effective it is and
the extent to which it meets the stated objectives of the project. To help
facilitate this, feedback forms are embedded on each page within the system.
Stakeholders will use these forms to comment on any particular issue that arise
s
as they navigate around. The system will log the activity they were undertaking
or the search they had made at the point that the form was submitted.

The
feedback from stakeholders will be compiled for review in February 2012, and an
action plan for syst
em amendments and enhancement created.


5.2

Once these changes have been implemented we will carry out a further

usability
study, this time involving people who are completely new to the system. This will
take place in the testing facilities at Coventry Unive
rsity’s Serious Games
Institute. Each subject will be asked to undertake a series of pre
-
determined
tasks and their performance will be recorded using “Silverback” software.
Silverback creates a video recording of the computer screen and where people
are c
licking as they complete the tasks. It also records their facial expressions via
a web cam.

The purposes of this study will be to identify any common points of
confusion or misunderstanding that people experience when navigating through
the WordTree interf
ace. The subjects will also be asked to complete a more
general questionnaire and give feedback on their experience of using the system.

The results of the usability study and the Silverback videos will be published on
the project blog. The results will al
so be analysed, with recommendations made
for future amendments to the software.


5.3

Quantitative recording tools are also built into this version of the WordTree, in
order to monitor the behaviour of users on an on
-
going basis. In particular, the
search term
s that people use will be logged and made available for future
evaluation.

Google Analytics will also be embedded into the system. This will

6

provide the project team with detailed reports concerning:




The number of people using the WordTree



The typical pat
hs that people take through the system



How long people typically spend using the system



The geographical location of users



How they found the site



The searches that they made



Which links and buttons are most commonly clicked on each page



The speed of the s
ite


6. Dissemination plans

for 2012


Conference presentations

and workshops

We will be

presenting

the Word Tree at the following conferences

and workshops
:


Asia Pacific Corpus Linguistics Conference
. Auckland, New

Zealand, February 15th
-

19th, 2012. See


abstract.pdf

ICAME 33
: Corpora at the centre and crossroads of Engl
ish linguistics. Leuven,
Belgium,

May 30th

-

June 3rd 2012.

Huddersfield University, April 18th 2012.

University of Birmingham, May 9
th

(CorpLing in the Midlands:
https://sites.go
ogle.com/site/corpuslinguisticsinthemidlands/
)

and in August 2012
for staff/students on the Birmingham presessional programme.


7. WordTree
development

posts (2011)


7.1 Data Structures


Having integrated the full

text transcriptions and metadata headers
from the BAWE
corpus, a suitable data structure needs to be derived.


Given a set of documents and a term to examine, the corresponding set of
concordances should be returned for rendering as quickly as possible. The data
structure employed should be optim
ised for reading and concordances can be
precomputed.


7.2

Drawing Wordtrees


Problem:
Drawing wordtrees for unknown corpora limits the degree of control over
visualisation. For large graphs, there may upwards of 1,000 nodes on
-
screen any
time. In addition
, client resources also limit the range of technologies that may be
employed to render a tree.


Solution: P
otential technologies include:
-



Canvas



Scalable Vector Graphics (SVG)


7



Flash graphics API



Processing & Processing.js


Related project r
equirements
:
Th
e rendering technology should be:
-



C
ross
-
platform



C
apable of producing high
-
quality print output



R
ender on older client hardware OR render to a bitmap on the application
server


Evaluation:
SVG was chosen as the display technology because it is:
-



Vector
-
ba
sed:


no difference between on
-
screen and print output.



XML
-
based: tree
-
based text JSON format can be mapped directly to SVG
syntax



Addressable via CSS: separation of content and design/layout eases future
interface updates.


7.3
Drawing Graphs


Suitable T
ools:
-



Processing



Prefuse



Flare



Native Canvas



d3

Processing
:
-



Platform: Java



Available in Javascript

Prefuse
:
-



Platform: Java

Flare
:
-



Platform: AS3

Native Canvas
:
-



Platform: HTML5

d3
:
-



Platform: Javascript


References

Circular Drawings of Rooted Trees

[Mel
ançon, Herman] 1998

Improving Walker’s Algorithm to Run in Linear Time

[Buchheim, Jünger, Leipert]
2002

Animated Exploration of Dynamic Graphs with Radial Layout

[Yee, Fisher, Dhamija,
Hearst] 2001

Squarified Treemaps

[Bruls, Huizing, van Wijk] 2000,
http:
//www.win.tue.nl/~vanwijk/stm.pdf


8

Analysis and Visu
alization of Network Data
using

JUNG

[O’Madadhain,

Fisher,

Smyth,

White,

Boe]
XXXX?

http://jung.sourceforge.net/doc/JUNG_journal.pdf


7.4
Vis
ualisation: Walker’s Algorithm in d3 and Canvas


Word Trees are connected sets of terms linked by the ordering inherent to

a given
corpus.

In other words, nodes connected by edges


directed graphs.

The starting
point for a Word Tree is the term under ex
amination.

This word becomes the root
for two trees, one for each direction (forward and back).

Nodes are not
interconnected, which simplifies processing.


About d3.js

http://mbostock.github.com/d3/

D3 allows

you to bind arbitrary data to a Document Object Model (DOM), and then
apply data
-
driven transformations to the document. As a trivial example, you can
use D3 to generate a basic HTML table from an array of numbers. Or, use the same
data to create an inter
active SVG bar chart with smooth transitions and interaction.

D3 is not a traditional visualization framework. Rather than provide a monolithic
system with all the features anyone may ever need, D3 solves only the crux of the
problem: efficient manipulatio
n of documents based on data. This gives D3
extraordinary flexibility, exposing the full capabilities of underlying technologies such
as

CSS3
,

HTML5

and

SVG
. It avoids learning a new intermediate proprietary
representation. With minimal overhead, D3 is extremely fast, supporting large
datasets and dynamic behaviors for interaction and animation. And, for those
common ne
eds, D3’s functional style allows code reuse through a diverse collection
of optional modules.


References

Improving Walker’s Algorithm to Run in Linear Time [Buchheim, Jünger, Leipert]
2002
-

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.8757


7.4
Data Engineering: Suffix Trees/Arrays


Suffix trees and arrays are useful data structures for

solving string problems
elegantly and efficiently.


Suffix Trees

Proper use o
f suffix trees often

speeds up string processing algorithms from O(n2) to
linear time.

In its simplest instantiation, a suffix tree is simply a trie of the n suffixes
of an

n
-
character string S. A trie is a tree structure, where each edge represents
one

c
haracter, and the root represents the null string. Thus, each path from the
root

represents a string, described by the characters labeling the edges traversed.
Any

finite set of words defines a trie, and two words with common prefixes branch
off

from each
other at the first distinguishing character. Each leaf denotes the end
of

a string.

Tries are useful for testing whether a given query string q is in the set.
We

traverse the trie from the root along branches defined by successive characters

9

of

q. If a bra
nch does not exist in the trie, then q cannot be in the set of
strings.

Otherwise we find q in |q| character comparisons regardless of how many
strings

are in the trie. Tries are very simple to build (repeatedly insert new strings)
and

very fast to search
, although they can be expensive in terms of memory.


Suffix Arrays

See:
An Incomplex Algorithm for Fast Su

x Array Construction [Schurmann, Stoye]
2007,

http://www.siam.org/meetings/alenex05/papers/07kschuermann.pdf


8.
Summary of
outputs from
s
takeholder meetings

(2011)


Meeting One
: May 13 2011


On May 13th 2011 we held our first Stakeholder Group meeting, to ask expert users
what they would like the new interface to provide
. Their responses are below:


1) Useful features for Corpus Linguists and ELTs



To have the statistics for a pa
rticular search wor
d (word frequency,
distribution
and

dispersion across texts and genres, number and type of texts
the word appears in, size of sub
-
corpora etc.). This statistical information
could be accessed by hovering over a particular

word.
[Show Con
ail features
of Wordsmith


can we incorporate this into the interface?]
.



(Font size could also represent frequency, as in Many Eyes Word Tree).



The option to choose which metadata and stats are required.



Ability to do on the spot markup/annotation.



Abilit
y to request a random selection of concordance lines.



Underlying principles
-

the interface should have two levels: 1) raw
frequencies 2) the ability

to turn on normalised frequencies, lemmatisation,
capitalisation etc.



Ability to access the original text.


2) Searchability



Lemmatisation + split screen (lemma search + split screen for results for
different word

forms).



Wildcard ability.



Ability to search for POS using CQL.



Search button should be case sensitive.


3) Interoperability



Option to send findings
to another programme (Antconc, Sketch Engine etc.
via URL link).



Different tabs for different software
-

modularity.


4) Output



To be able to print out (good quality, readable) word trees (would need a
template to do this).


10



Ability to output data in differ
ent ways (perhaps an open field for other
people to add their

own findings)
-

practical outlets.



The ability to export to an excel or word file etc.



Ability to manipulate/customise the data (for the purpose of a specific
lesson). Need an output

that is man
ipulable (ability to remove items, colour
code POS, highlight specific
feature
etc.).



Ability to send screen shots to Facebook, email etc.
Stakeholder Group One
Meeting 1 (13 May 2011)


5) Display



Full screen display containing as much info. as possible. Ab
ility to zoom in and
out on certain

parts of the word tree.



Ability to resize parts of the text so that students can see it on the interactive
whiteboard.



Ability to use a number of visualisations for the same data set (a library of
visualisations
-

differ
ent ways of viewing the same data set).



Visualisations of the corpora (a pie
-
chart, for example, which shows the
different

portions/sections of the corpus). Ability to hover the mouse over a
particular segment of the

corpus to see the stats. A single click

would then
automatically load the sub
-
corpus (rather

than having a drop down menu).



Type in a search word and the visualisation would demonstrate “hotspots” of
where that word

occurs in the corpus.



Ability to visualise the corpus in different ways.



Abilit
y to interact with the display.



Toggle: simple view and advanced view.



Ability to flip between concordance lines and word trees.



Ability to go in and out of the corpus: move between corpus and word tree.



Ability to drag and drop outputs into a separate are
a (i.e.
ability to save
different search
results in a separate screen).



Layered screens and/or separate windows to compare searches across
difference sub
-
corpora.



More interesting vision of collocates


target word in the middle, stronger
collocates closer
,

less strong collocates are further away.



Trees fading gently into place, and animation occurs after each additional
search term is

added, as in Many Eyes Word Tree.


6) Usability



Need to develop ‘ways in’ to the data. Instructions for users on how to
gen
erate simple

frequency lists and branch data (deepest branches, most
connected branches
-

which node

words have the most connections?).



Ability to produce frequency lists + lists of n
-
grams for sub
-
corpora.



Pre
-
lessons.



Kibbitzers.



Try to avoid using lots
of corpus software in the classroom
-

an interface with
does



everything.


11


Plans for stakeholder meetings in

July 2011


In July we will be presenting a first version of the
Word tree interface to all four
stakeholder groups.

Group 2 will have remote access
to the interface website, and
additionally Hilary Nesi and Emma Moreton will discuss version o
ne of the Word
Tree

face to face with

academics attending the Lancaster University Corpus
Linguistics Summer School.

We will demonstrate the

interface

to the foll
owing
groups of

novice res
earchers (Stakeholder Group

3):




visiting English language lecturers

from two Chinese universities
-

Jiangxi
University of Finance and Economics, Nanjang, and Zhejiang University of
Finance and Economics, Hangzhou



lecturers

in th
e Department of English at the University of
Dhaka,


Bangladesh



and

to the following groups of

language learners

(Stakeholder Group

4):



EAP learners attending the Lancaster University presessional programme



students

in the Department of English at the Univ
ersity of
Dhaka,


Bangladesh.


Our next meeting with Stakeholder Group 1 is scheduled for July 25 2011.


Meeting with future interface users at Lancaster University


Emma Moreton and Hilary Nesi met with participants at the UCREL Summer School
in Corpus Li
nguistics, Lancaster University, on July 13 2011, to demonstrate a
prototype of the Word Tree interface and to discuss issues surrounding its design
and use.


Here are our responses to the most frequently asked questions:


Q. Will we be able to use the int
erface with our own corpora?

A. Yes
-

at the end of the project (November 2011) the source code will be made
available

for use with any other corpus in plain text or XML.

Q. Will

it function for languages other than English?

A. Yes
-

we plan for it to work

with almost all languages, including those with other
scripts.

Q. Does the interface show every instance of the occurrence of a given word or
pattern, or only a selection?

A. It will show every instance.

Q. You can already create word trees with Many Eyes
. What is the point of creating a
new interface?

A. The Many Eyes word trees only work with a small amount of data in a single file,
which then remains accessible on the Many Eyes site. The new interface will enable
users to work with much larger collectio
ns of text, differentiated by file so that the
provenance of each instance of use can be identified.

Users will also be

able to
create their own customised subcorpora, view corpus statistics, and compare

12

patterns of use

displayed

simultaneously in differen
t

trees. They

will be able to

run
the interface on their own servers for private access to their own text collections.


Meeting Two: July 25 2011


On July 25th we held our second meeting with expert users, to review progress with
interface development and
to consider ways in which the Word Tree might be linked
to other corpus resource tools.

Serge Sharoff demonstrated features of IntelliText,
which is being developed at the University of Leeds with AHRC funding. Intellitext
will automate the downloading of
large collections of texts from the web, and wi
ll
provide tools for automatic

part
-
of
-
speech annotation, term extraction, synonym
identification etc. Like the Word Tree, IntelliText will be distributed as open
-
source
software to academic and industrial use
rs, who will be free to extend it for the
benefit of the research community. We envisage that researchers will be able to
create

and annotate corpora

using IntelliText tools, and then

upload them

to the
Word Tree
-

drawing on the resources of both

interfac
es.

Andrew Dickinson
demonstrated SKYLIGHT, a classroom corpus resource he is developing wi
th Gill
Francis.

This is


a simple interface which does not assume any prior user experience.,
and

we discussed the possibility of offering joint access to SKYLIGHT and the Word
Tree, to extend the range of both tools.

The group agreed that these
changes

should
be made to the prototype interface.

Our next meeting with Stakeholder Group 1 is
scheduled for September 22 2011.