SEL3053: Analyzing Geordie

matchmoaningAI and Robotics

Nov 17, 2013 (3 years and 10 months ago)

89 views


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


This lecture introduces digital electronic natural language corpora, one of
which will be analyzed subsequently.


The discussion is in four parts:




the first part distinguishes language from language representation,




the second sketches the history of language representation
technology to the present day,




the third shows how language is electronically represented,




and the fourth outlines the development and current state of printed
and electronic text and text collections.


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


1. Language and language representation


Language and language represented as text are often confused, and many
people aren't even aware of the distinction.


There is, however, a fundamental distinction:


Language is a genetically determined aspect of human cognition. No one knows
when this cognitive faculty developed beyond the communicative capabilities of
other animals, but humans have certainly had it for tens of thousands of years.




SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


1. Language and language representation


Representation of language as text is a humanly
-
invented technology.


It works by


(i)
identifying the phonemic structure of the language of interest, and


(ii)
associating each phoneme with symbol: the English phoneme /
c
/ is
represented by the symbol C, the phoneme /a/ by A, and /
t
/ by T, thereby
permitting the representation of the word /cat/ as CAT in writing or print.


Such language representation is referred to as 'alphabetic' to distinguish it from,
for example, pictographic systems, which do not represent language but
physical reality.




SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


1. Language and language representation


The distinction between language and language representation is easily seen in
young children and non
-
literate adults.


Both have language but are incapable of representing it; the ability to do so
must be explicitly learned.



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


2. Outline history of language representation technology

2.1
Mesopotamia


As far as we know, the idea of representing the phonemic structure of language
symbolically arose only once, in southern Mesopotamia (currently Iraq) about
4000 years ago, and all the world's alphabetic writing systems derive from the
Mesopotamian one.


The symbol system used to represent the early Mesopotamian language,
Sumerian, is known as cuneiform, and consisted of marks made by pressing the
end of a triangular stylus into a wet clay surface.


Examples:


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


2. Outline history of language representation technology

2.1
Mesopotamia



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


2. Outline history of language representation technology

2.2
Egypt

Egypt had a well
-
developed pictographic system known as
hieroglypic
, but
gradually supplemented and eventually replaced it with an alphabetic system
based on the Mesopotamian one. Examples of hieroglyphic pictograms:


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


2. Outline history of language representation technology

2.3 The Mediterranean world in Antiquity

The Greeks and, later, the Romans adopted and further developed the originally
Mesopotamian alphabetic system; by Roman times the alphabet we currently
use had been developed.


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


2. Outline history of language representation technology

2.4 The medieval West

During the Middle Ages the Roman alphabetic system was used.


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


2. Outline history of language representation technology

2.5 The advent of printing


From Mesopotamian times until the fifteenth century, language representation
involved human putting the symbols of the alphabetic system onto some
physical surface, be it clay (Mesopotamia), papyrus (Egypt, Greece, Rome),
parchment (European Middle Ages).


Then, in 1440, Johannes Gutenberg invented print technology, which allowed
for much faster book production.


It was based on using individual letters cast in lead, which were assembled into
matrices that were then placed into a printing press and inked, thereby leaving
an impression of the text matrix on a piece of paper.


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


2. Outline history of language representation technology

2.5 The advent of printing



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


2. Outline history of language representation technology

2.5 The advent of printing



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


2. Outline history of language representation technology

2.5 The advent of printing


Print was the primary language representation technology for the five
centuries between the fifteenth and the mid
-
twentieth century.


It has since then been increasingly superseded by electronic language
representation.



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language


To understand digital electronic representation of language it is necessary
to be clear about the nature of symbols and how language can be
symbolized.




What a symbol is: some physical thing




What a symbol does: representation




The arbitrariness of a symbol relative to what it represents



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language


Since the invention of language representation technology, language has
been symbolized using visible marks on some surface: stone, clay,
papyrus, parchment, paper.


But, given the nature of symbols, this is not in principle the only way to
symbolize language.


Any physical medium will do, and that includes electricity.



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code


i
. History

Scientists had been working on an electrical device for communication, the
telegraph, since the mid
-
18th century, but it was an American named
Samuel Morse who proposed the first workable system in 1838, and with it
the idea of electronic representation of language.


The usefulness of this invention for fast, long
-
distance communication was
quickly appreciated. By 1854, there were 23,000 miles of telegraph wire in
operation in the US.


In 1851, Western Union was founded, and in 1868, the first successful
trans
-
Atlantic cable link was established.




SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code


ii. How Morse Code works

In an alphabetic writing system, language is represented, or encoded, by
assigning a symbol to every phoneme of a language.


In the West, this has for many centuries been done using the familiar
alphabet:

/a/ is represented as A

/b/ is represented as B

and so on. But the shape of the symbols used to represent phonemes is
entirely arbitrary, and the result of a particular historical development.
Morse's idea was to use a different representation. For every letter in the
conventional alphabet, he proposed a corresponding symbol consisting of
dots and dashes:



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code


ii. How Morse Code works

But the shape of the symbols used to represent phonemes is entirely
arbitrary, and the result of a particular historical development.


Morse's idea was to use a different representation.


For every letter in the conventional alphabet, he proposed a corresponding
symbol consisting of dots and dashes:



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code


ii. How Morse Code works


For every letter in the conventional alphabet, he proposed a corresponding
symbol consisting of dots and dashes:



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code


ii. How Morse Code works

Using this system, the word CAB would look like this:




* * *
-

*
-

*


*
-


-

* * *
-

* * *



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code


ii. How Morse Code works

This recoding of phonemes looks superfluous at best
--
we already have a
perfectly good alphabetic system
--

and silly at worst, but in fact it is
fundamental to computational language representation technology, as we
shall see.


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code

iii. How a telegraph works




SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code

iii. Telegraph and Morse code combined


The key insight in this marriage stems once again from the nature of
symbols, and in particular from the arbitrariness of symbols relative to what
they represent.


We have seen that, for each letter in the conventional alphabet, Morse
proposed a symbol consisting of a sequence of dots and dashes.




SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code

iii. Telegraph and Morse code combined


Now, there is no particular reason why the dots and dashes should be, not
marks on a piece of paper, but electrical pulses: a dot could be a short
pulse, and a dash a long pulse.


In other words, Morse Code can be translated from a visual code directly
into an electronic code.


This is the crucial step


F
or the first time, there was an alternative to the traditional representation
of language as visible marks on some surface, and that alternative was an
electronic representation. And how can such an electronic representation
be generated?



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code

iii. Telegraph and Morse code combined


And how can such an electronic representation be generated? By using a
telegraph:



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code

iii. Telegraph and Morse code combined


By releasing the finger press for a short time and allowing the electrical
contacts to come together only briefly, this device generates a short
electrical pulse, and by releasing it for longer, it generates a long one.
For a short pulse, the buzzer sounds briefly, and for a longer one it
sounds for longer.


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code

iii. Telegraph and Morse code combined


Thus, the telegraph version of the Morse Code for the letter D looks (or
rather sounds) like this:


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.1 The first step: the telegraph and Morse Code

iii. Telegraph and Morse code combined


An operator who is
familar

with Morse Code can therefore encode and
send any text message as a sequence of


beeeeeeeep

and
beep
keystrokes.


All one needs is a network of electrical lines that the electronic pulses can
travel along.


In fact, such a network was quickly constructed in 19th
-
century America,
and a cable was laid across the Atlantic to allow electronic communication
with Europe, as already noted


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.2 Generalization of Morse Code: ASCII


ASCII has been the standard text encoding scheme for representation of
text in computers for the past two decades.


It differs from Morse in two ways:




It uses 0 and 1 instead of dots and dashes to make letter codes




The code length is a constant 8 places, whereas in Morse the
number of dots and dashes varies



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.2 Generalization of Morse Code: ASCII


Though different in detail, however, ASCII is no different in principle from
Morse.


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.2 Generalization of Morse Code: ASCII


In ASCII, the word CAB looks like this:


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of
language


3.3 Text storage in computers


A computer is an electronic device, and
can only store data in electronic form in
its memory.


A computer memory is, in essence, just a
very long sequence of numbered storage
bins.




SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of
language


3.3 Text storage in computers


Each bin, or slot, on the right
-
hand side
of the memory can contain one piece of
electronic data.


The computer gets at that piece of data
by going to the corresponding address.


How the computer knows the address,
and what it does with the data once it has
it, leads into the issue of how computers
work, which is both beyond the scope of
this module and unnecessary for present
purposes.



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language


3.3 Text storage in computers


We have seen that ASCII codes can be converted to
electronic form by interpreting 1 as 'electrical on' and
0 as 'electrical off', and also that a computer memory
is a sequence of storage slots, where each slot
contains one item of electronic data.


That data can be ASCII codes.


Storing text in a computer memory is therefore simply
a matter of putting the relevant codes in known
memory locations in the right sequence.



Thus, the word CAB would look like this in memory.


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.4 How text gets into computer memory


Text gets into computer memory by means of an input device.


There are various such devices, but the most familiar and commonly
-
used is the
keyboard, so we look at that.


As with memory itself,


the operation of a computer keyboard is conceptually very
simple: every time a letter key is pressed, the electronic ASCII code corresponding
to the key is generated and sent up the wire connecting the keyboard to the
computer.


When it arrives at the computer, it is placed into the memory.


SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


3. Digital electronic representation of language

3.4 How text gets into computer memory



SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


4. Corpora


4.1 Print corpora


4.1.1 Naturally
-
evolving corpora


i
. Accumulation of printed documents


ii. Examples: library collections, historical archives, the law, the canon of
English literature, etc.




SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


4. Corpora


4.1 Print corpora


4.1.2

Explicitly
-
designed corpora


i
. Motivated by the appearance of scientific linguistics


ii. Research agendas: historical, dialectological etc.


iv. The nature of print
-
based corpora: document collections




SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


4. Corpora


4.2
Electronic corpora


4.2.1
The current position


Worldwide generation of text leading to implicitly constructed corpora

Explicit construction of corpora for linguistic research




Standards: XML




Examples




SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


4. Corpora


4.3

Advantages of electronic over print corpora

i
. Efficiency of production



Keyboard, scanner, voice recognition



Contrast with print and manuscript production


ii. Efficiency of storage



Capacity of electronic media



Contrast with storage of books


iii. Efficiency of reference



Locating and searching electronic text



Contrast with locating and searching of books




SEL3053: Analyzing Geordie

Lecture 4. Digital electronic corpora


4. Corpora


4.3

Advantages of electronic over print corpora


iv. Efficiency of transmission



Electronic dissemination of text



Contrast with physical dissemination of books


v. Cost: electronic text is VERY cheap


vi. Suitability for analysis