Tools for Textual Data - Illinois

handslustyInternet and Web Development

Dec 14, 2013 (3 years and 7 months ago)

54 views

Tools for Textual Data


John Unsworth


May 20, 2009

http://monkproject.org/

MONK: a case study


Texts as data


Texts from multiple sources


Texts reprocessed into a new representation


Different tools using the same data


Interaction between tools and data


Interaction between and among tools


Interaction between users and data


Questions for discussion

Texts as Data (
1
)


“The‏computer‏has‏no‏understanding‏of‏what‏a‏word‏
is, but it follows instructions to 'count as' a word any
string of alphanumerical characters that is not
interrupted by non
-
alphabetical characters, notably
blank space, but also punctuation marks, and some
other symbols. 'Tokenization' is the name for the
fundamental procedure in which the text is reduced to
an inventory of its 'tokens' or character strings that
count as words. This is an extraordinarily reductive
procedure. It is very important to have a grasp of just
how reductive it is in order to understand what kinds
of‏inquiry‏are‏disabled‏and‏enabled‏by‏it.”

Texts as Data (
2
)


“A‏word‏token‏is‏the‏spelling‏or‏surface‏of‏form‏of‏a‏
word. MONK performs a variety of operations that
supply each token with additional 'metadata'. Take
something like 'hee louyd hir depely'. This comes to
exist in the MONK textbase as something like



hee_pns
31
_he louyd_vvd_love hir_pno
31
_she
depely_av
-
j_deep


Because the textbase 'knows' that the surface 'louyd'
is the past tense of the verb 'love' the individual token
can be seen as an instance of several types: the
spelling, the part of speech, and the lemma or
dictionary‏entry‏form‏of‏a‏word.”‏‏‏(Martin‏Mueller)


Texts as Data (
3
)



Texts represent language, which changes over
time (spellings)



Comparison of texts as data requires some
normalization (lemma)



Counting as a means of comparison requires
units to count (tokens)



Treating texts as data will usually entail a new
representation of those texts, to make them
comparable and to make their features
countable.

Texts from Multiple Sources


Scholars are interested in texts first, data
second


Tools are only useful if they can be applied to
texts that are of interest


No single collection has all texts


No two collections will be identical in format


No one collection will be internally consistent in
format

Five aphorisms about textual data (causing
tool
-
builders to weep):

Public MONK Texts


Documenting the American South from UNC
-
Chapel Hill (
1.5
Gb,
8.5
M words)



Early American Fiction from the University of
Virginia (
930
Mb,
5.2
M words)



Wright American Fiction from Indiana University
(
4
Gb,
23
M words)



Shakespeare from Northwestern University
(
170
M,
850
K words)


About
7
Gigabytes,
38
M words

Restricted MONK Texts


Eighteenth
-
Century Collection Online (ECCO)
from the Text Creation Partnership (
6
Gb,
34
M words)



Early English Books Online (EEBO) from the
Text Creation Partnership (
7
G,
39
M words)



Nineteenth
-
Century Fiction (NCF) from
Chadwyck Healey (
7
G,
39
M words)



About
20
Gb,
112
M words

Texts reprocessed

into a new representation (
1
)


MONK ingest process:


1. Tei source files (from various collections,
with various idiosyncracies) go through
Abbot, a series of xsl routines that transform
the input format into TEI
-
Analytics (TEI
-
A for
short), with some curatorial interaction.


2.‏“Unadorned”‏TEI
-
A files go through
Morphadorner, a trainable part
-
of
-
speech
tagger that tokenizes the texts into
sentences, words and punctuation, assigns
ids to the words and punctuation marks, and
adorns the words with morphological
tagging data (lemma, part of speech, and
standard spelling).

Texts reprocessed

into a new representation (
2
)


MONK ingest process (cont.):


3
. Adorned TEI
-
A files go through
Acolyte, a script that adds curator
-
prepared bibliographic data


4
. Bibadorned files are processed by
Prior, using a pair of files defining the
parts of speech and word classes, to
produce tab
-
delimited text files in
MySQL import format, one file for each
table in the MySQL database.


5
. cdb.csh creates a Monk MySQL
database and imports the tab
-
delimited
text files.

Texts reprocessed

into a new representation (
3
)


<docImprint>
ENTERED
according to Act of Congress,
in the year
1867
, by A.
SIMPSON
&amp;

CO.,
<lb/>

in the Clerk's Office of the
District Court of the United
States
<lb/>

for the Southern District of
New York.
</docImprint>

<docImprint>


<w eos="
0
" lem="enter"
pos="vvn" reg="ENTERED"
spe="ENTERED"
tok="ENTERED" xml:id="allen
-
000600
" ord="
33
"
part="N">ENTERED</w>


<c> </c>


<w eos="
0
" lem="accord"
pos="vvg" reg="according"
spe="according" tok="according"
xml:id="allen
-
000610
" ord="
34
"
part="N">according</w>


<c> </c>



Representation is about
10
x original in size, so
150
Mb becomes
1.5
Gb
(
90
% metadata)


Problems Arising

“In‏the‏MONK‏project‏we‏used‏texts‏from‏TCP‏EEBO‏and‏
ECCO, Wright American Fiction, Early American Fiction,
and DocSouth
--

all of them archives that proclaimed
various degrees of adherence to the earlier Guidelines.


Our overriding impression was that each of these archives
made perfectly sensible decisions about this or that within
its own domain, and none of them paid any attention to
how its texts might be mixed and matched with other
texts. That was reasonable ten years ago. But now we live
in a world where you can multiple copies of all these
archives on the hard drive of a single laptop, and people
will‏want‏to‏mix‏and‏match.”


...and Aris
-
ing

“Soft‏hyphens‏at‏the‏end‏of‏a‏line‏or‏page‏were‏the‏greatest‏
sinners in terms of unnecessary variance across projects,
and they caused no end of trouble. . . . The texts differed
widely in what they did with EOL phenomena. The
DocSouth people were the most consistent and intelligent:
they moved the whole word to the previous line....
DocSouth texts also observe linebreaks but don't encode
them explicitly.

The EAF texts were better at that

and
encoded line breaks explicitly.

The TCP texts were the
worst: they didn't observe line breaks unless there was a
soft hyphen or a missing hyphen, and then they
had

squirrelly solutions for them.

The Wright archive used
an odd procedure that, from the perspective of subsequent
tokenization, would make the trailing word part a distinct
token.”

Different tools using the same data



MONK Datastore



Flamenco Faceted Browsing



MONK extension for Zotero



TeksTale Clustering and Word Clouds



FeatureLens



SEASR



The MONK Workbench

(Public)



The MONK Workbench

(Restricted)



Each of these is (at least one) separate
application; some are actually several.

Workbench Architecture (
1
)


The MONK Workbench is a browser
-
based
application written in Ext JS, a JavaScript
library for building richly interactive web
applications using techniques such as AJAX,
DHTML and DOM.


The Workbench has components, like the
component for creating a workset, and
components often have a workflow

a
notion of events that need to occur in a
certain order.

Workbench Architecture (
2
)


The Workbench communicates by http with
middleware, which is Java code that
interprets events occurring in components
and translates those events into terms that
the datastore or the analytics engine
(SEASR) can understand.


The MONK middleware also translates in
the other direction, taking output from
queries to the datastore, or from analytics
operations, and giving them back to
components in the Workbench.

Interaction between tools and data

Tools can't operate on features unless those
features are made available: for example,


in order to find
an author's favorite adjectives
, you
need an interface for asking that question and you
need data that can answer it.


In order to find
patterns
, the data and the interface
have to support pattern
-
finding.


In order to
find all the fiction by women

in a
collection, your data has to include information
about genre and gender, and your interface has to
allow you to select those facets.

Interaction between/among tools


Flamenco requires a slightly different data
source from other tools in MONK, partly
because it is meant to feed Zotero, so it needs
COINS metadata.


The html interface to the MONK datastore uses
the same data source that is used by the MONK
Workbench and by TeksTale.


FeatureLens needs a unique index, and it
needs one index per collection.

Interaction between users and data


Users like simple interfaces, but simple
interfaces limit complex operations


Users may want to operate on features that are
not available in the data representation


Users create data, by using tools

not only as
an end result, but all along the way; state
information, for example, or information about
the series of operations performed in order the
produce a result


Users may also want to correct or improve data

Questions for Discussion


If different tools require different data
representations, how should those
representations be related, derived,
maintained?


What‏might‏be‏the‏characteristics‏of‏a‏“lowest
-
common
-
denominator”‏format‏for‏data‏that‏will‏
need to be reprocessed into other
representations?

What principles would allow you to
answer the following questions in
particular cases?

Questions for Discussion


How much manual/curatorial intervention is
acceptable, and what options do you have if
what's acceptable is less than what's
necessary?


Under what circumstances could tools have a
normative impact on the practices of people
who build and maintain collections?


Under what circumstances could data have a
normative impact on the practices of people
who build and maintain tools?

Questions for Discussion


Should users be allowed to change, correct, or
improve data? If so, under what constraints or
conditions?


Should those who provide collections also host
the computational tools that will be used on
them? Why or why not?


Should those who provide collections also
collect the results of work done on their
collections? Why or why not?


What is the purpose of data curation?

Workbench Screen Captures


Classification


Comparison

http://monkproject.org/