Collocate Cloud: See Collocations in a New Way

ickybiblegroveInternet και Εφαρμογές Web

3 Νοε 2013 (πριν από 4 χρόνια και 4 μέρες)

71 εμφανίσεις

Collocate
C
loud
:
S
ee
C
ollocat
ions

in a
N
ew
W
ay


This poster demonstrates a

web
-
based
,

interactive

data
visualisation
,

allowing users
to
quickly
inspect
and browse
the collocates present in a corpus.

The software is i
nspired
by tag clouds
, first popularised

by on
-
line photo
graph

sharing website Flickr
(
www.flickr.com
)
.


Tag clouds allow the user to browse, rather than search for specific pieces of
information. Flickr encourages their users to add tags (keywords) to each photograph

they upload
.
The tags assoc
iated with each
individual
photo
graph

are aggregated; the
most frequent go on to make the cloud. The cloud consists of these tags presented in
alphabetical order, with their frequency displayed as colour, or more commonly as
font
size. Figure 1 is an examp
le of the most popular tags at Flickr:



Figure 1
.

Flickr tag cloud showing 125 of the most popular photograph keywords

http://www.flickr.com/photos/tags/ (accessed 14 May 2007)


The cloud offers two ways to access the information. If the user is looking
for a
specific term
,

the alphabetical
ordering

of the information allows it to be
quickly
located

or eliminated
. More importantly, as a tool for browsing, frequent tags stand
out
visually
, giving the user an immediate
overview of the data. Clicking on a ta
g
name
will display all photographs which contain that tag.


The cloud based visualisation has been successfully applied to language. McMaster
University’s Tapor Tools (
http://taporware.mcmaster.ca/
) features a ‘Word Cloud’
module, currently in beta testin
g. In addition to other linguistic metrics, internet book
seller Amazon provide
s a word cloud, see figure 2.



Figure 2.
Amazon.com’s ‘
C
oncordance’ displaying the 100 most frequent words in
Romeo and Juliet

http://www.amazon.com/Romeo
-
Juliet
-
Folger
-
Shakes
peare
-
Library/dp/sitb
-
next/0743477111/ref=sbx_con/104
-
4970220
-
2133519?ie=UTF8&qid=1179135939&sr=1
-
1#concordance (accessed 14 May 2007)


In this instance a word frequency list is the data source, showing the most frequent
100 words.

As
with the

tag cloud, t
his list is alphabetically ordered, the font size being
proportionate to its frequency of usage.

It has all the benefits of a tag cloud; in this
instance clicking on a word will produce a concordance of that term.


Th
is

method of visualisation and interact
ion bring
s

another tool to corpus linguists.
As developer for an online corpus

project
,
I have found that
usability and
sophisticated

tools have been important to our success and
positive
public profile.
Cloud
-
like displays of information would compl
e
ment
our other advanced methods
,
such as

geographic mapping and transcription synchronisation.


T
he word clouds produced by Tapor Tools and Amazon are, for browsing, an
improvement over tabular statistical information
. Other corpus data could also be
enhanced b
y using a cloud
.

Linguists often use collocational information as a tool to
examine language use.
Figure 3 demonstrates a typical corpus tool output:



Figure 3.
V
ariation in English Words and Phrases

searching the British National Corpus for ‘bank’, show
ing collocates

http://view.byu.edu/

(accessed 1
5

May 2007)


The data contained in the table lends itself to visualisation as a cloud. As with the
word cloud
,

the list of collocates
can

be displayed alphabetically.
Co
-
occurrence
frequency, like word frequen
cy,
can be illustrated by font size.

This would produce an
output visually similar to the word cloud. Instead of showing all corpus words, they
would be limited to those surrounding the chosen node word.


Another valuable
statistic obtainable via collocate
s

is

that
of
collocational strength
,

measured here by MI (
Mutual Information
).

This adds a further dimension to the data,
one which
would be
very
useful it were included in the cloud
.

An additional visual
cue

needs to be introduced to the cloud, one which
can convey the continuous data of
an MI score. This is solved by varying the colour, or brightness of the collocates
forming the cloud.

The end result is show
n

in figure 4:




Figure 4. Demonstration of collocate cloud, showing node word of ‘bank’


The co
llocate cloud inherits all the advantages of previous cloud visualisations: a
collocate, if known, can be quickly located due to the alphabetical nature

of the
display. Frequently occurring collocates stand out
,

as
they are shown in a larger
type
face
, with

collocationally strong pairings highlighted using brighter formatting.

Therefore bright, large collocates are likely to be of interest, where as dark, small
collocates perhaps less so.
Hovering the mouse over a collocate will display statistical
informati
on, co
-
occurrence frequency and MI score, as
one

would find from the
tabular view.


The use of collocational data also presents
additional

possibilities for interaction. A
collocate can be clicked upon to produce a new cloud, with the previous collocate as

the new node word. This gives endless possibilities for corpus exploration and the
investigation of different domains. Instances of polysemy can be
easily identified and
expanded upon by following the different collocates.


This visualisation may be appea
ling to members of the public
, students or teachers
, or
those seeking a more practical introduction to corpus linguistics. Collocate searches
across different corpora or document sets may be visualised side by side, facilitating
quick identification of dif
ferences.

The number of collocates displayed are limited to
a
fixed

number

to prevent overloading the user with too many matches
. T
his may hide
otherwise valuable linguistic data from the user. The use of stop
-
words may help free
up otherwise used space.


While the collocate cloud is not a substitute for raw data, it does provide a fast and
convenient way to navigate linguistic data.
The ability to g
enerat
e

new clouds from
existing collocates extends this further.
It is this iterative nature that gives thes
e
collocate
clouds
greater

value
for linguistic research
than previous c
loud

visualisations.