Department of English and American Studies

undesirabletwitterΤεχνίτη Νοημοσύνη και Ρομποτική

25 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

198 εμφανίσεις

Masaryk University

Faculty of Arts


Department of English

and American Studies


English Language and Literature






Bc. Lenka Chylíková



Building Specialized Corpora For
Term Extraction

Master

s Diploma Thesis




Supervisor: PhDr. Jarmila Fictumová


2013



-

1
-
























I declare that I have worked on this thesis independently,


using only the primary and secondary sources listed in the bibliography.



……………………………………………..

Author’s signature



-

2
-























































I would like to thank

my supervisor PhDr. Jarmila Fictumová for her patient guidance and
valuable pieces of advice. I would like to give special thanks to RNDr. Vojtěch Kovář and the
Department at the Faculty of Informatics
,

Masaryk University
for his he
lp with advanced corpus
tools.



-

3
-



Table of Contents


Introduction

................................
................................
................................
................

-

5
-

1. The Field of Corpus Linguistics

................................
................................
...........

-

7
-

1.1. History of the Branch
................................
................................
......................

-

7
-

1.2 Types of corpora

................................
................................
.............................

-

10
-

1.2.
1. Time Dependence

................................
................................
....................

-

11
-

1.2.2. Field Specification

................................
................................
..................

-

11
-

1.2.3. Multilingual sources

................................
................................
...............

-

12
-

1.2.4. Educational Purposes

................................
................................
.............

-

12
-

2. Corpus Tools

................................
................................
................................
........

-

13
-

2.1. Labelling Procedures

................................
................................
....................

-

13
-

2.1.1. Markup

................................
................................
................................
....

-

13
-

2.1.2. Annotation

................................
................................
...............................

-

14
-

2.1.2.2. Parsing

................................
................................
..............................

-

16
-

2.1.2.3. Semantic annotation

................................
................................
........

-

17
-

2.1.2.4. Discoursal level annotation

................................
.............................

-

17
-

2.2. Query Search Tools

................................
................................
.......................

-

18
-

2.2.1. Concordance

................................
................................
...........................

-

18
-

2.2.2. Frequency

................................
................................
................................

-

20
-

2.2.3. Collocation

................................
................................
...............................

-

20
-

2.2.4. Phrase Patterns

................................
................................
.......................

-

21
-

2.2.4. Word
-
sketch

................................
................................
............................

-

23
-

2.2.5.

Word List

................................
................................
................................

-

23
-

2.3. Alignment Tool

................................
................................
..............................

-

24
-

3. Building a Corpus

................................
................................
................................

-

26
-

3.1. Size

................................
................................
................................
..................

-

26
-

3.2. Content

................................
................................
................................
...........

-

27
-

3.3. Balance and Representativeness

................................
................................
..

-

27
-

3.4. Permanence

................................
................................
................................
....

-

28
-

3.5. Compilation Procedures

................................
................................
...............

-

28
-

4. Extracting Terms

................................
................................
................................
.

-

29
-

-

4
-


5. Glossary Making

................................
................................
................................
..

-

32
-

6. Language for Specific Purposes Corpus

................................
............................

-

33
-

6.1. Real Texts

................................
................................
................................
.......

-

34
-

6.2. Compilation of LSP Corpus

................................
................................
.........

-

35
-

6.2.1. Mammut English LSP Corpus

................................
..............................

-

36
-

6.2.2.
Horolezecká metodika

Corpus

................................
................................

-

37
-

6.2.3. Mammut Czech Corpus

................................
................................
.........

-

38
-

6.2.3.1. Translation Difficulties

................................
................................
....

-

38
-

7. Comparison of Corpora

................................
................................
......................

-

43
-

7.1. Distribution in English Specialized Corpus

................................
................

-

44
-

7.1.1. Nouns

................................
................................
................................
.......

-

44
-

7.1.2. Verbs

................................
................................
................................
........

-

45
-

7.1.3. Adjectives

................................
................................
................................

-

46
-

7.2.
Distribution in Czech Specialized Corpus of Translations

.......................

-

47
-

7.2.1. Nouns

................................
................................
................................
.......

-

47
-

7.2.2. Verbs

................................
................................
................................
........

-

48
-

7.3. Distribution in Czech Specialized Corpus of Methodology
.......................

-

49
-

7.3.1. Nouns

................................
................................
................................
.......

-

49
-

7.3.2. Verbs

................................
................................
................................
........

-

50
-

7.3.3.
Adjectives

................................
................................
................................

-

51
-

7.4. Comparison across Corpora

................................
................................
........

-

5
1
-

8. Terminology Extraction and Glossary Compilation

................................
........

-

53
-

8.1. Materials and Technology Glossary

................................
............................

-

54
-

8.2. Sports Types Glossary

................................
................................
..................

-

56
-

8.3. Repetitive Phrases Glossary

................................
................................
.........

-

58
-

8.4. Terminology Bank

................................
................................
.........................

-

61
-

8.5. Conclusion

................................
................................
................................
......

-

66
-

9.
Conclusions

................................
................................
................................
...........

-

68
-

Reference

................................
................................
................................
..................

-

70
-

List of Images and Tables

................................
................................
.......................

-

74
-

List of Abbreviations

................................
................................
...............................

-

75
-

Summary

................................
................................
................................
...................

-

76
-

Resumé

................................
................................
................................
......................

-

77
-

-

5
-


Introduction

The

field of corpus linguistics
is a branch of applied linguistics that connects linguistic
theories with practical utilization in translation or natural language processing.

Although the beginnings of the branch are fairly recent, the information technologies
have had a great effect on
the progress of the corpus programs and tools.

It is
unimaginable that a corpus compilation or search is conducted manually. It has also
created greater space for accessing data through the World Wide Web, unless the
copyright protection is applied.
Moreov
er, the specialized corpora are
scarcer

but also
more wanted than traditional general corpora with billions of tokens, because of the
precise data that can be retrieved from them.

Researchers are supposed to think all the corpus work through, because major

decisions concerning the type, size, balance or number of languages are made. The
purpose of the corpus analysis is an elementary task. The outcome of the corpus search
is relevant if a suitable corpus is chosen. If the corpus is approved of, one may expl
ore
the corpus from different angles


frequency list, collocations

or part
-
of
-
speech co
-
occurrence. Everything is possible if one becomes familiar with the coded language of
corpus, meaning the tagging abbreviations. The vital phase of a corpus building
p
rocedure is the tagging during which various features are assigned to the so
-
called
tokens. Such procedure cannot be skipped, because it is absolutely essential for further
research. There are also different types of glossaries. It depends on the corpora c
hosen
for the extraction of possible keywords and terms candidates. The procedure of
compiling a glossary is a manual work with automatically retrieved data.

The hypothesis of this Master’s thesis is that series of effective glossary can be
compiled out of

three comparable
specialized
corpora

with similar thematic content
. If
-

6
-


the outcome
of all three term candidates lists is compared, glossary entries can be then
paired. I propose that parallel corpora extraction of terms is not the only option how to
creat
e a bilingual glossary.

In the
more
practical part, the process of compiling corpora is described. The
data prepared for the corpora are analysed in detail, because every corpus’ data are
retrieved differently. Although one corpus consists of translations
, the possible
subjectivity of such corpus is avoided by creating another specialized corpus in the
same language. All three corpora are then compared in order to create a complex
bilingual glossary with descriptions. The glossary is divided into four them
atic parts and
they are consequently analysed as well.

There are bilingual as well as monolingual
explanatory glossaries, because all of them will be used as translating tools for actual
translation in the particular specialized field.

The hypothesis is ve
rified or discarded
according to the final outcome of the analysis.









-

7
-


1
.
The Field of Corpus Linguistics

Corpus linguistics is an approach of applied linguistics that makes use of corpora, the
collections o
f various texts, to analyse
linguistic data from the semantic, syntactic,
morphological or statistical point of view.
It should be noted here that there are not only
texts included, but also the spoken data, although the former group is prevalent to the
latter.
The
massive impact

of corpus linguistics
has emerged

fairly
recently

in
comparison t
o other linguistic disciplines
, due to its dependence on the machine
-
readable
material
s
.
Lynne Bowker,

Professor and D
irector of the School of Translation
and Interpretation at the Universit
y of Ottawa,

(1996)
states that although the term was
coined “in early 1980s”
;

it has gone through a process of a long history
(
p.
303).

In
addition, corpus linguistics aids other disciplines such as lexicography, sociolinguistics,
terminology
-
building,
historical linguistics, etymology, translation studies and teaching
science.

What is more, it overlaps the linguistic field into
information t
echnologies and
statistics
.

All professionals in these branches may take advantage o
f building a corpus,
asking
pa
rticular queries or compiling a glossary of
specialized terminology
.


1
.
1
.

History of the
B
ranch

Professor at the University of Massachusetts Boston Charles
Meyer

(2002)
explains the
breakthrough of the linguistic branch. He states that
a

reaction against

the generative
grammar theory

was needed, focusing

more on the theoretical, rather than practical
issues of linguistics

and
claimed that: “the only legitimate source of grammatical
knowledge” is a native speaker
(
p.
17).

Notably, the native speaker is a ve
ry broad
definition considering the variability among native speakers
’ dialects and active
vocabulary.

In comparison, co
rpora are not restricted to
na
tive speakers of a language; a

-

8
-


learner corpus may be an example of the
second
-
language speakers
,

etc
.

The
forefather
of

generative grammar is claimed to be Noam Chomsky.
After introducing
hi
s three
types of adequacy


observational, descriptive and explanatory
, the theoretical
definition became clearer
. The theoretical definition

has been the key concept of
ge
nerative grammarian
’s

aims, but the corpus linguistics, on the other hand, has focused
on descriptive adequacy

(
Meyer
, 2002, p.
17).

In other words
,

corpus linguistics prefers
to compare and analyse real data, while the so
-
called
“armchair linguistics”
expl
icates
theories and hypotheses.
Meyer
condenses Leech’s work “Corpora and Theories in
Linguistic Performance” in order to highlight the usefulness of applied corpus
linguistic
s
.
Other Chomski
an terms are analysed in the theory

competence and
performance
:

The main argument against the use of corpora in generative grammar, Leech
observes, is that the information they yield is biased more towards performance
than competence and is overly descriptive rather than theoretical. However,
Leech argues that this cha
racterization is overstated: the distinction between
competence and performance is not as great as is often claimed.

(Meyer
, 2002,
p.20)
.

Geoffrey Leech,
Emeritus Professor in the Department of Linguistics and English
Language

at the

Lancaster University,
and Meyer as well,
they both

verif
y

the

status of

corpus linguistics as a true source of academic linguistic

science
, again emphasising the
practicality over hypothetical assumptions
.
Furthermore, Leech declares the precision of
corpus research

based on his research
:


a
ll of the criteria applied to scientific endeavours can be satisfied in a corpus
study, since corpora are excellent sources for verifying the falsifiability,
-

9
-


completeness, simplicity, strength, and objectivity of any linguistic h
ypothesis”

(
as cited in
Meyer
, 2002
, p.
20
)
.

In addition to the realistic representation of language, Meyer (2002) specifies the
corpus approach as functional,
since

“all corpus
-
based research is functional in the
sense that it is grounded in the belief tha
t linguistic analysis will benefit o
f

[because]

it
is based on rea
l language used in real contexts.”
(p.
27). Functional analysi
s, another key
concept of modern linguistics,

describes “the use of language as a communicative tool”
(Meyer, 2002, p.21). Thanks to corpora, the researchers may observe the functionality
of certain language, for example a regional variety, ethnical or social context, as long as
the communication prov
es its main effect.

To conclude the s
ection on
the history of corpus linguistics, it can be argued that
this branch of linguistics is functional and descriptive.

It shows precise and objective
results grounded
in

real life

situations
. Thus a linguist may
verify or discard a
hypothesis about any kind of vocabulary discipline.

S
imilar procedures have
been
undertaken
by
Leech, Quirk, Greenbaum a
nd many others (Meyer, 2002, p.
30)
.


Mor
e
over, it is

the kind of field that has been

developing hand in hand with the
advance
ment of computers and

IT science.

Prior to computerized corpora, all the data
had to be processed manually
.
Also
Lynne
Bowker has suggested the occurrence of
three, potentially four considering the near future, gene
rations of corpus linguists. The
stages are marked by the tools and procedures of compiling and further use of the
corpora.
The first
machine
-
readable

corpus was


the

Standard Corpus of Present
-
Day

Amer
ican English” (Bowker, 1996, p.
304), which has been referred to as the
Brown
corpus
.

It was
created by W. Nelson Francis and Henry Kučera
in 1963
-
1964
(Meyer
,
2002, p.17).

In Europe, it was followed by “the Lancaster
--
Oslo/Bergen (LOB) Corpus”
-

10
-


of British English exclusively.

At that tim
e, it must have been such a demanding and
time
-
consuming manual labour to create a comprehensive body in scope and content
that a current corpora user cannot even imagine.

The second generation

began to use
computers for corpora compilation during the 1980
s, thus the amount of data grew
bigger. Some of them are “the Birmingham Collection of English Texts (BCET) and the
Longman/Lancaster English Language Co
rpus (LLELC)” (Bowker, 1996, p.
304)
. As
for the third generation, it is explained that it is the curren
t situation
. Thanks to the
World Wide Web technology, possibly anyone can create a large corpus, but

the major
problem is not the data
but the access to these, due to the
copyright

(Bowker, 1996,
p.
304)

The particular field of terminography, the approach to corpus linguistics from
the viewpoint of extracting terminology,
gained full “attention since the publication of
the COBUILD Dictio
nary in 1987” (Bowker, 1996, p.
304)
. With the emerging need for
transl
ations throughout the spectrum of professionals, the glossary
-
making is more and
more important and frequent as well.


1
.
2

Types of corpora

A

corpus is not a unitary collection; there are a number o
f combinations of prerequisites
determining the particular

demands on

the

purpose of the research, language usage and
its speakers.
As suggested above, another distinction is the inclusion of written and
spoken data, because both display different degree of formality and more or less
complex vocabulary respective
ly.
According to
all
these criteria, corpora vary.

-

11
-


1.2.1
.

Time Dependence

A c
orpus

also marks the development of the
language

among certain group of people
differing in gender, nationalit
y or age, if the corpus is constantly and regularly

updated
with current data. Such diachronic approach is used for
the
monitor corpus
.
Ano
ther
diachronic
type

is
the historical corpus

that contains “samples of writing representing
earlier dialects and periods” (Meyer, 2002, p.36) of a language. The purpos
e of these
two is considered similar, but the monitor

corpus

tracks the current changes, while the
historical
one
applies to a larger time span.
The opposite is called
synchronic or
“traditional linguistic corpus”

(Meyer, 2002, p.31). It represents language at a certain
period of time.
For example,

The Helsinki Corpus contains t
ext
s from
years 700 to 1700
that implies from the
Old English up to
the
Early Modern English period

(Hunston,
2002, p.16)

1.2.
2
.

Field
S
peci
fication

The simplest corpus to be created is the
general corpus
, where different kinds of data
can be added


spoken and written, full texts or only samples


basically anything
that
fulfils two criteria: the
same language

a
nd

the widest scope of content
.

An example of
this may be the British National Corpus
1
.
This corpus
may be

used as a reference corpus
for translators, teachers or termin
o
graphers.
The reference

means that the

general
language

is supposed to be compared
with

the
specialized corpus

or vic
e versa,
depend
ing

on the purpose

of the research
.

The

specialized corpus

a
i
m
s
“to investigate

a
particular type of language”

(Hunston, 2002, p.14). It should represent

a certain field,
occupation or even a time frame
. Therefore, it is essential

to
carefully choose the correct
types of data.




1

http://www.natcorp.ox.ac.uk/

-

12
-



Bowker refers directly to the language of both types of corpora. She uses the
abbreviations L
GP
and
LSP

that stand for language for general or specific purposes
respectively.


1.2.3
.

Multilingual sources

So far,

monolingual

that means a single language corpus has been discussed
, but
bilingual

or even
multilingual

corpora of two types are created


comparable

and
parallel
. The comparable
corpus
does not inevitably
deal with
two foreign languages; it
may also be
compiled out of varieties of the same lan
guage. The other feature of
comparable corpora is
such
that the texts on both sides of the corpus scheme do not
have the same

content. On the other hand,

parallel corpora necessarily
have

the
same
content, because
t
hey are translations and
the texts are further marked and aligned.
Although the latter corpus is more demanding to obtain, the results are instantly ready to
use, while the comparable corpus’ results need further
processing

meaning that the
equivalents are

then assigned to the terms manually
.

Meyer (2002) summarizes the
advantages of parallel corpora as such:

[they]
facilitate contrastive analyses of English
and other languages, advance developments in translation theory, and enhance foreign
language teach
ing”
(
p.38)
.

1.2.
4
.

Educational Purposes

A c
orpus is
often a neglected tool in

language
c
ourse
s
;

however, it
may serve well to a
teacher as well as a student. The
learner corpus

is a collection of students’ original
work. Not only
does
this type of
corpus
show the general tendency of vocabulary
choice
,

but
it

also

reveals mistakes that may occur repeatedly, usually due to the
influence of a mother tongue.

Second, the
pedagogic

corpus is composed of the sources
that a student may be exposed to
(Hunston, 2002
, p.
16) during their foreign language
courses. T
he

official

audio recording along with the relevant textbook are the proper
-

13
-


pedagogic corpus sources
, while the essays, stories and conversations made by students
are valuable for learner corpus.

2. Corpus
Tools

The r
esults
of occurrence within the collection of texts are
accessible

thanks to corpus
engine programs or online on the World Wide Web
.
The basis for obtaining the results
are
the tools t
hat enable an analyst

to
gain

information on

the statistical
or syntactic
phenomena


frequency,
concordance

or

collocation



should be provided
. But
prior to

asking queries
, tagging, parsing, annotating or lemmatization

need to happen
.
Fortunate
ly, there are computer program
s or applications that
help

accomplish all the
above mentioned tasks.


2.1
.

Labelling Procedures

First
, a distinction between a markup and annotation of a corpus

should be made
.

Although both
markup and annotation

add c
ertain extra information to

simple text
,

the
former

focuses more
on
computational processes, while the
latter

is more linguistically
oriented.

2.1.1
.

Markup

M
arkup
i
s

based o
n adding
information, or rather the metadata,
about the inserted text.

The detailed description of the text source, its heading file
and author should be
included. By this, i
t is n
ot only
meant
extra
-
linguistic
, such as the nationality and age of
the speaker,

but also para
-
linguistic data,

like laughing after a particular expression,

particularly
to

spoken transcripts.
In order to preve
nt misunderstanding on the level of
different languages, a
special markup language TEI
had been developed and is
used in
-

14
-


computational text analysis

as well
, but for the purposes of
corpus linguistics analyses a
simplified
CES
langue has been coined.

This method marks any

kind of input

“an
element
can be any unit of text, for exa
m
p
le, chapt
er, paragraph, sentence or word”
and
even
punctuation
(McEnery, Xiao and Tono, 2006,
p.24
)
.
However, this method does
not work ad hoc, the reverse is true, it opera
tes systematically on three
level
s:
“Three
levels of text standardization are specified in

CES
: 1)

the metalanguage level, 2) the
syntactic level and 3) the semantic level.”

(
McEnery, Xiao and Tono, 2006,
p.26
)
.
Although it may seem useless or too
professional specific, without the above men
tioned
procedure

further annotation and subsequent research would be

impossib
le. I
t is
important for morpho
-
syntactic tagging
,

like part
-
of
-
speech
,

and alignment of parallel
texts for parallel corpora.
Analogical
ly, it is
also
vital

for machine translation

research
.

2.1.2
.

Annotation

Then
,

annotation is used to analyse purely linguistic information

and is “referring solely
to the encoding of linguistic analyses such as part
-
of
-
speech (POS) tagging

and
syntactic pa
rsing
” (
McEnery, Xiao and Tono, 2006,
p.29
)
.
The
number of advantages is
obvious



primarily, systematic analyses based on elementary
more or less
clear
relegation
;

but there
occur
also disadvantages.
The major point against the use of POS
tagging is n
ot
only the copyright

permission
prerequisite

that can be violated by making
the tagged corpora public, but more importantly the criticism “related to the accuracy
and consistency of corpus annotation”

(McEnery, Xiao and Tono, 2006, p.32).


Of
course, the bor
der
-
line cases should be consulted; still the work should be carried out
under unitary directives.

Analyst
s

ha
ve

three choices of how their corpus is annotated: automatically,
manually or by a computer
-
assisted method
(McEnery, Xiao and Tono, 2006,
p.
32
)
.
By
all means, a
ll three methods may generate some sort of mistakes
;

none of them is
-

15
-


virtually foolproof.
If the mistakes from the automated annotation are generated, post
-
editing by hand is necessary. Nevertheless, t
he algorithms and rules that
lead the
an
notation are beyond the linguistic scope of this thesis.

McEnery et al. (2006)
have
outline
d the types of corpus annotation in five layers
phonological, morphological,
lexical, syntactic and discoursal. The first level
corresponds to the phonetic and pros
odic features annotation. The second corpora
annotation level refers to marking stems, prefixes and suffixes. The third and fourth will
be analysed in more detail below. The fifth layer matches the stylistic features

or

pragmatic speech acts
(p.33
-
34
).
Des
pite the vast variety of tagging types, some are
neglected, for example pragmatic or discoursal annotation, but others thrive such as
POS tagging,
lemmatization

or parsing.

2.1.2.1
.

Morpho
-
syntactic annotation

The most elementary and widely used tagging
assigns compatible part
-
of
-
speech marks,
because almost all further research in corpus linguistics would be impossible without
that. Nowadays, the
re are fully automatic program
s that accomplish the task as quickly
as possible. However, there
are
disti
nctiv
e program
s for various languages.
The British
National Corpus was tagged by the CLAWS (Constituent
-
Likelihood Automatic Word
Tagging System) that had been developed at Lancaster University (McEnery, Xiao and
Tono, 2006,
p.3
4).
The SketchEngine employs the
TreeTag
g
er for English

or German
,
which is not readily compatible with Czech for example.

For Czech, the tagging
grammar is called the
Desamb

for Czech
.


First, the process of dividing separate words, or rather tokenization, is to be
initiated (McEnery, Xi
ao and Tono, 2006,
p.35
). Even though
,

it may seem as a
straightforward task due to the spaces or punctuation between words, McEnery et al.
-

16
-


(2006) considers the problematic issues such as multiword expressions, mergers and
informal word forms, and
hyphenated or not compounds (
p.35
).


Second, then it is necessary to reduce the inflected words under the same
category based on the stem. This is the method of lemmatization.

It depends on the
complexity of morphology of a particular language, while Engli
sh inflects only nouns,
adjectives and design of the lemmatization program should respect the morphological
demands on each language.

2.1.2.
2
.

Parsing

In progress with
the complex marking, the

procedure of

parsing
brings “the morpho
-
syntactic categories in
to higher
-
level syntactic relationships”

(McEnery, Xiao and Tono,
2006,
p.36
).

The phrase structure (noun and verbal phrases)
is labelled
as well as the
syntactic relationships between the phrases
.
A
n analyst may a
gain

choose

from

three
types of processing



automatic, manual or combination of both.

McEnery et al. (2006)
elucidate the
computer controlled

parsing program’s strategy:
“Syntactic parsing, no
matter whether it is automatic or manual, is typically based on some form of context
-
free grammar, for e
xample, phrase
-
structure grammar, dependency grammar or
functional grammar.” (p.37). B
ut the handcrafted or post
-
edited

are mostly preferred
.
A

human source is
often considered
more reliable than computers.

In addition to the program,
phrases
or

clauses

c
an be labelled
, such as
comparative

and

relative
,

as well

(McEnery, Xiao and Tono, 2006, p.37). Again, if the
marking is extremely detailed it is “full parsing”, if the details are limited it “skeleton or
shallow parsing” (McEnery, Xiao and Tono, 2006, p.37). The parsing

is particularly
helpful with natural lang
uage processing (NLP) and machine translation (McEnery,
-

17
-


Xiao and Tono, 2006, p.37)
, because both programs use the word for word or phrase for
phrase pattern for translation or understanding a language expression.

2.1.2.3
.

Semantic annotation

This k
ind of

l
abelling

assigns codes indicating the semantic features of the semantic
fields of the words


(McEnery, Xiao and Tono, 2006, p.37), such as agent, patient or
instrument
. To enlighten the definition, the distinction of two types is important. The
first type

marks the elements only within a sentence, thus it should be regarded as
syntactic
-
level annotation. The second type, the true semantic parsing
is associated with
the whole text provided and it is currently more common

(
McEnery, Xiao and Tono,
2006,
p.38)
.

It is an intricate issue aiming beyond the scope of this thesis. For the
beginners’ purposes, it is sufficient to note that semantic annotation is

knowledge
-
based, statistical and machine
-
reliab
l
e.

2.1.2.4
.

Discoursal level annotation

The discoursal level of annotation is a
rapidly developing field of computational
analysis

and the term covers several types of annotation. First, the

c
oreference
or
anaphoric

annotation aspires to
“identification of coreferential relationships between
pro
nouns and noun phrases
” (McEnery, Xiao and Tono, 2006, p.41).

The major
focus

is
related to
the cohesion

of the texts
.

Second, p
ragmatic annotation

currently dominates the field of spoken corpora
.
McEnery et al. (2006) refers to the s
peech acts
,

or rather

“dialogue acts” (Leech and
Weisser,

2003,

p.1),
annotation
that has been achieved by creating the

Speech Act
Annotated Corpus (SPAAC) by
Geoffrey
Leech and
Martin
Weisser

(p.41)
.

They have
set
forty
-
one

categories of speech acts, based on different speech
-
act values, primarily
form, polarity, topic and mode categories (Leech and Weisser, 2003, p.1
-
2).

Though
-

18
-


Leech and Weisser

have made an immense progress, McEnery et al. believe that this
type of annotation is
not yet
ready to be fully
automated
;

only semi
-
automated

option,
just like Leech and Weisser have done, is possible.

Furthermore, s
tylistic annotation

is
associated with
the
“long tradition of
fo
cusing on the representation of speech a
nd thought in written fiction” (McEnery, Xiao
and Tono, 2006,
p.41). It is the equal counterpart to the speech
, or rather dialogue,

corpora annotation.

“Because surface syntax cannot reliably indicate the stylistic
features
,
the automatic annotation of such categories is difficult


(McEnery, Xiao and
Tono, 2006, p.41
). Then, the stylistic annotation by hand is the safest and most reliable
procedure.


2.2
.

Query Search

Tools

To begin with, i
t is crucial to determine for what information an investigator searches
.
First, the major approach to finding tokens in the near
environment of the query is
contextual, which is the domain of collocation pattern display. Second, the sphere of
statistical information is prevalent in the frequency display. Then, only the textual
reference remains, which is the basic concordance displa
y. On account of this thesis
aims, the conceptual information is of particular value, because it refers to the
knowledge
, semantic or even pragmatic,

behind the words


extracting
terminology.

2.2.1
.

Concordance

As mentioned above, searching for any non
-
sp
ecific concordance is the simplest action.
Nonetheless, the form of the query and its tagged details are fundamental.
Then, the
view of the results, for example in larger context, can be set.

But the very first thing is
the concordance
r
--

“the program tha
t searches

a corpus

for
a selected
word

or

phrase”
-

19
-


(Hunston, 2002, p.39). The searched
node
word is disclosed in the concordance lines

along with its nearest neighbours in a particular entry
, or rather the co
-
text
. Then the
central and typical information
in the concordance lines

can be observed
. “’Typical’
might be used to describe the most frequent meanings or collocates of phraseology of an
individual word or phrase” (Hunston, 2002,
p.
42)
; w
hereas “the concept of ‘centrality’
can be applied to categories

of things rather than to individual
words” (Hunston, 2002,
p.
43). It means that typicality is more commonly observed than

centrality
,

despite the
fact that
the

central concept refers to wider senses of various words or to
an
extralinguistic entity.

First,

the types of queries are discussed.
The most basic form of a
query

is
a
lemma
;

basically it is a root of a wo
rd, from which other morphological inflexion
alternatives
2
2

or different word classes are derived
, thus all these examples are
consequently
displayed.

However, to specify the search, the exact word or phrase may
be entered.
In case of the word and the lemma the part of speech
option
may
narrow the
exploration.

As far as the advanced search is concerned, the Corpus Query Language
(CQL)
should b
e explained
.
It is useful
for complex searches

and greater control over
the provided results.
CQL
aid
s
in searching

for grammatical constructions
,

part of
speech specific information,

gapped constructions or
words that contain a particular
string of
characters

(CQL Help, 2012, p.3).
P
articular signs
,

such as square brackets,
asterisk or quotation marks
,

are to be inserted in order to
get the required concordance
.

Second,
after the desired examples are retrieved, one may adjust

the
display of
node word
s; to see the whole sentence or keyword in immediate context
.

Subsequently
after gaining the examples of a query from the corpus, the options keyword in context



2

noun declension, adjectival comparison, verb conjugation

-

20
-


(KWIC) or in a sentence are offered. The former shows the immediate environment of
the
node
wor
d, which can be also sorted by frequency or alphabetical order on both
sides, while the latter contains the whole contextual information of a sentence, which
can be then expanded in order to get more background information.

2.2.2
.

Frequency

Showing the st
atistical information of an occurrence is important for the purposes of
sociolinguistics
, because it shows the

tendencies of speakers’ word choice and
associated bias

rooted in language
.
A simple calculation of occurrences is not the
precise frequency, bec
ause it would be dependent on the size of the corpus, hence the
counting is transferred to the measure per million tokens.
The most frequent tokens are
arranged in the frequency list, in which grammatical words such as articles or
prepositions outnumber
the lexical words (Hunston, 2002, p.3). “Frequency lists from
corpora can be useful for identifying possible differences between corpora that can be
studied in more detail. Another approach is to look at the frequency of given words
compared across corpora
” (Hunston, 2002, p.5). Notably, the lists
’ findings

can
sometimes be filtered by adding

a limit, for example determining the height of
frequency, register or regional variety, so that it meets the individual demands.
The
frequency comparison plays also a
prominent role in the extraction of terminology that
will be discussed later on.

2.2.3
.

Collocation


Alan Partington
(1998)
has introduced
the linguist J.R. Firth who coined the term
collocation and its definition
: “you shall judge a word by the company it

keeps”

(p.
15
).
He

has emphasised the role of
collocation in grasping the meaning
.
And consequently,

he has heated the debate over the problem of contextual or conceptual value of
collocation. Arguably, both his student
’s,

Sinclair
,

opinion promoting the t
extual worth
-

21
-


and Leech’s definition of collocative meaning based on the associations seem most
lik
ely valid (Partington, 1998, p.
15).
Nevertheless,
the span, the distance between the
node word and its collocate, cannot be neglected.
Susan Hunston
(2002)
stresses the
role of calculation in collocative observation:

“A list of the collocates of a given word can yield similar information to that
provided by concordance lines, with the difference that more information can be
processed more accurately by the st
atistical operations of the computer than can
be dealt with by the human observer”
(
p.
12)
.

Most concordancing program
s show the
statistically
most frequent

node words
, but the
SketchEngine
also
shows
the logDice, the me
tric

of the salience

of
the
given
lemma in
context,

MI (
m
utual information

between lemma and its collocates
)
and T
-
score
measure
, which is the statistical measure of likelihood that two or more words occur

(Glossary of Sketch Engine Terminology, 2012). These features are used in the Word
Sketch as well

and will be mentioned later.

2.2.4
.

Phrase Patterns

If the phrases are known, they can be easily sought in the corpora via the search engine,
however, if the phrase is unknown, it can be found only through collocation search.
Both collocations and phrases are multi
-
word expressions carrying a particular,
primarily figurative over literal, meaning.
C
ollocations and phrases often overlap their
categories
, because some collocations are so fixed in the speech that they should be
co
nsider
ed

phrases.

Partington (1998) introduces the division of these phrases or
collocations into two groups


canonical and non
-
canonical


based on their
grammatical structure resembling or differing from the standard Eng
lish structure
respectively (p.
25).
Partington (1998)
also
presents what Sinclair has found about the
-

22
-


phrases


the idiom or collocational principle as opposed to the open choice principle.
The fixed expressions
, such as idioms, phrasal verbs or phrases in general,

of the former
princip
le constitute
the
majority of
any
utterance
, in comparison with the
decreasing
creative participation
i
n

case of the latter principle (p.19). This tendency
does not
occur
only because of the time pressure but also due to
the
quickness
of

encoding and
decoding

of

a message.

Hunston
(2002)
advances the thought by saying that “each word
in the text is us
ed in a common phraseology” (p.
143). It seems that the individual’s
performance relies on
compounding pieces of phrases, whence the question

of limited
vocabulary in corpora arises. But Hunston
(2002)
later
argue
s
against Sinclair’s
proposal
“practical application of Sinclair’s theory is not without problems. As might be
expected, the idiom principle is easier to demon
strat
e than the open
-
choi
ce one”
(p.
147).

Another problematic issue
are

the unsettled boundaries of a phrase.
“There is
evidence that much of the language met in everyday life does not consist of one phrase
tacked on to the end of another, but that one phrase overlaps with the nex
t


(Hunston,
2002, p.
146)
. By this, it is assumed that there is a phrase
systematically generated
out of
separate words of the previous phrase in the utterance.


Moreover, ot
her types of phrases are the so
-
called frames. These triplets share
the same first and last items, but the middle position is occupied by different
constituents, the new phrasal
variants.
In some concordancers, these repetitive patterns
can

be found without difficulties, others may requi
re
a
refine
d

strategy to

compare and
evaluate the results of two lemmas or words within the nearest span. “Frames are
particularly useful when looking at a specialised corpus, and can be used as the basis for
investigating the language of a discipline”
(Hu
nston, 2002, p.
50).
Therefore, i
t is mostly
suitable for terminologists.

-

23
-



A more complex example of a grammatical frame would be the lexicalized
sentence stem, which concretely “consists of a sentence or part of a sentence in which
one or more of the stru
ctural elements is a class rather than a particular discr
ete item”
(Partington, 1998, p.
22). These structures can be inflected
i
n case of changing tenses or
multiplying nouns. That would make a
mixture

of Sinclair’s idiom and open
-
choice
principle, because

it
combines creativity with
permanence. Similar features resemble
the s
chema

that

“shares some of the qualiti
e
s of a fixed phrase but which also contains
variable parts capable of capturing context dependent information”

(Partington, 1998,
p.
22).
Arguably, these semi
-
creative constructions help in language acquisition, though
a learner may fail to recognize that
they are

pattern
s
.

2.2.4
.

Word
-
sketch

According to the
Sketch Engine
Glossary

of Terminology (2012),
a
Word
-
sketch is “
a
corpus
-
based
summary of a word's grammatical and collocational behaviour

.

As
mentioned above, this technology uses the statistical scores, but counted to the overall
score, yet the most advantageous feature is that it shows all, at least the most frequent,
collocation
s, synonyms or modifiers of a certain inserted lemma and its part of speech
classification in separate lists. Thanks to this, all the possible collocations and phrases
can be extracted in a minute.

2.2.5
.

Word List

The
Word List is another application of t
he Sketch Engine
.

I
t enables
corpus linguists
to
search for any attribute (word, tag or lemma)

or pattern
in order to show a list of words
that is sorted according to highest frequency rate. The outcome may be adjusted so that
the desirable attributes are
shown as well, for example lemmas, concrete tags and word
forms. The list th
en may be alphabetically sorted or organised according to the tagging
-

24
-


pattern. It is helpful for showing the distribution of parts
-
of
-
speech or lemmas within a
corpus.


2.3
.

Alignm
ent Tool

The most complex
corpus
tool

is the alignment

tool

that is used in bi
-

or multi
-
lingual
corpus linguistics. Although major linguists cannot agree on
the
terminology, for the
sake of this
thesis
, the most frequent terms are chosen. As explained abo
ve, on the one
hand,

a

parallel corpus is a type of collection of texts which are translations, the more
precise the better. On the other hand, there is the comparable corpus

containing non
-
translations, but texts from different languages
.
If the corpus consists of varieties of the
same language, it is referred to as
the “
comparative corpus


(McEnery, Xiao and Tono,
2006,

p.48
).


The alignment tool links the sentences and words from one corpus to the
respective partners in the other corpus.

The accuracy is measured in

the

precision rate
expressed in percentage.
T
he alignment preparation is vital:
the tool
is unlikely

to be


capable of identifying
[a]

translation without the aid of the annotation


(McEnery,
Xiao and Tono, 2006,
p.50
), because
the alignment must be working on several levels:
sentential and sub
-
sentential, “
notably phrase (multi
-
word unit
) or word level
alignment

(McEnery, Xiao and Tono, 2006, p.50
)
. To achieve all this, there are three
option
s

how to approach

sentence alignment
: statistical (probabilistic), linguistic
(knowledge
-
based) and hybrid.

McEnery et al. (2006) explains in detail that


t
he statistical approach to sentence alignment is generally based on sentence
length in terms of words or characters per sentence while
the lexical approach
-

25
-


uses

morpho
-
sy
n
tactic information to explore similarities between lang
uages


(p.50
)
.

But the most suitable method, according to McEnery et al. (2006) is the combination of
both previously mentioned methods


a

hybrid
which
“integrates

linguistic knowledge
into probabilistic algorithm”

(p.51
). Therefor
e

the alignment programs are the domain
of true computational linguistics professionals.

In addition to aligned corpora, the sources

should be adequately selected.
Linguists disagree over
the disputable issue of using mirror translations or not,

because
they are

considered to interpret a biased view on the text
--

“only one version of
translation in the target language only represents one individual’s introspect
ion, albeit
contextually and
co
textually informed” (Mc
Enery, Xiao and Tono, 2006, p.92
); and its
vocabulary selection


“translated language is at best unrepresentative special variant of
the target language and serves alone as the basis for contrastive studies, the results are
clearl
y misleading.” (McEnery, Xiao and Tono, 2006,
p.93
)

To prevent the subjectivity,
i
t is advised to add as many versions of the same translated text as possible, but it is
hardly
applica
ble in practice, particularly in a special field
, due to
the
lack of
multiple
versions of the same source text
. Therefore, the best solution is to study
the
comparable
corpora.

McEnery et al. (2006) is apologetic about the study of corpora from the
translation point of view:
“for practical reasons alone, we will often be wo
rking with
corpora that, while they are useful, are not ideal for either translati
on or contrastive
studies” (p.95
)
. At any rate, the aligned parallel or comparable corpora are still u
seful

and based on real life situations
. T
hey aid in creating or advanci
ng
translation program
s
for glossary,
the
machine translation

or

computer
-
assisted translation tools
. Such tools

are seen as the future of corpus linguistics and natural language processing.

-

26
-



To summarize
,

the “text corpora are welcome as an inexhaustible source of
empirical information, a polygon for testing various linguistic tools


spell
-
checkers,
OCRs
3
, machine translation systems, NLP
4

systems, etc
.


(Mihailov and Tommola,
2001, p.70)
. But first the m
arkup and detailed annotation
, e.g. POS or parsing,

should be
provided to make the frequency and collocations
possible and
efficient.


3
. Building a Corpus


Having introduced the basic tools and
concept
s of corpus linguistics,
corpus
building
procedures

should be introduced
. As mentioned several times above, the purpose of the
corpus is crucial.
From

these demands, the size or sources
derive.

Susan Hunston (2002)
lists four features of a qualitative corpus design


size, content, balance and
representati
veness, and permanence
(p.
25).

3.1
.

Size

The first issue is
already disputable, because
the

appropriate size of a corpus is
a
vague

concept to grasp
.
The quest for the optimum corpus
has always preoccupied linguists’
hypotheses, but
no agreement has
been reached. Some have preferred
a

small corpus to
make the tagging easier
,

access
of
the data
more facile

and conclud
e

with simpler
concordance results, yet the frequency number is gener
ally lower. On the other hand,

big corpus


data are

obtainable

with
difficulties and the concordance lines display
too



3

optical character recognition or reader

4

natural language processing

-

27
-


m
uch

information
. By all means, both types are necessary, the larger as a reference
corpus, the smaller as a specialized source of information.

3.2
.

Content

There are two key issues concerning the content
of a corpus


the purpose of the corpus
and the availability of the data


both of which are interrelated with each other.
A large
scope of data is nowadays easily available thanks to
the
Internet, however, the
specialized texts should be carefully chosen,

because the corpus builder may
include
data that seem professional but instead there are repetitive advertisements and the
results are then misleading.

Again, some data are protected by copyright
. Hunston
(2002) states that the selection of texts is very
individ
ual, but it must fulfil

the criteria

of
a well
-
balanced corpus

(p.27).

3.3
.

Balance and Representativeness

Both concepts should be considered a framework
, both
penetrate each other
.

R
e
presentativeness is associated with the content as well,
by
reason

of the chosen topic
and
the
field of knowledge.
The designer of the corpus is supposed to aim at covering
the whole chosen field.
Moreover, the authorship of texts is to be representative,
covering
a
variety of gender, age or nationality.
But that d
oes not mean that too much
data should be incorporated, but the balance

of the distribution

should be regarded. The
concept of b
alance is assumed to operate with

an

approximately similar amount of data.
Arguably, the balance is connected to
the
quantity

of

data
, while representativeness is
associated
with
the
quality of the selected field data.

If the balance is disregarded,
one
may consider the corpus to be biased, because
the

optimality of data is
exceed
ed

(Hunston, 2002, p.29).

-

28
-


3.4
.

Permanence

The
circumstance of perman
en
ce

is optional, unless the corpus is designed to track
changes over a period of time:

“One aspect of representativeness that is sometimes
overlooked is the diachronic aspect” (Hunston, 2002, p.30). For a corpus to be
permanent, it i
s necessary to constantly update materials.

Thanks to this condition, the
translation tools called term banks
--

“lists of technical terms from science and
technology in the target language”

(Hunston, 2002, p.31) or glossaries may be
compiled.

3.5
.

Compila
tion Procedures

In spite of the variety of corpus building programs, there are
three

basic ways to create a
corpus


from the World Wide Web
, by pasting a plain text

or
by uploading

specific
documents
, preferably
in plain text

or tagged PDF files
.
By all
means
,

the concrete
language

and coding language

must be chosen in order to provide correct tagging.
Both
methods
have their own advantages and disadvantages men
tioned previously in the
thesis, yet the latter requires no advanced computing skills. The form
er,

using the tool
WebBootCat or just BootCaT,

however, offers options of the input method. First, the
so
-
called seed words are filled in, which are the keywords that would lead the search
engine queries.
This technique needs subsequent refin
ement of the d
isplayed web
addresses

in order to provide the required data and prevent the unwanted content from
fusion into the corpus.
Second, the specific addresses
or
URLs
5

from the command line
of the
search engine

program

are inserted.

Again, depending on the purp
ose and the
demanded size of the corpus, the method is to be chosen respectively.
There is merely
one pitfall, some web pages are formatted in Java or Flash Player form, which is strictly
speaking a video; by all means, the displayed words cannot be read b
y the WebBootCaT



5

uniform resource locators (Free Dictionary)

-

29
-


program. So the method of copying and pasting a certain texts is necessary.
Then the
corpus is ready to be compiled and search
ed

through by a concordancer and other tools.

4.
Extracting Terms

The major aim of
both
parallel and
comparable
corpora is to extract

a

list of words that
are recognized as terminology list candidates.
These do not have to be only single
-
unit

terms

but also multi
-
word terms.

Jennifer Pearson

(1998)
, a specialist
in
the domain of
terminology an
d

terminography from
Du
blin City University, describes several

cas
es of
terms retrieval

done by hand
,
but the most common
types

of

extraction
are those
on the
basis of
statistical, meaning
frequency

and

collocational
,

patterns

and morpho
-
syntactic
approach
(p.1
22
-
3
).

Both approaches
can be

utiliz
ed
. Namely, if the single
-
worded
terms are sought, the identification “on the basis of their frequencies of occurrence and
distribution” (Pearson
, 1998, p.
123
) is possible.
Pearson (1998)
argues that frequency
display is not th
e only reliable
re
source, because some terms may occur infrequently, or
the corpus is not that large to have so many repetitive node words (
p.123
).
On the other
hand,

multi
-
worded terms are preferably to be identified thanks to their collocation
al
behaviou
r (Pearson, 1998, p.
123
).

In comparison, other researchers
,

such as Bourigault
et al.
(1996)
,

use the morpho
-
syntactic approach to extract
multi
-
word

terms by
revealing “the possible grammatical structures of complex terms”

(as cited in Pearson,
1998, p.1
23
). They have developed a tool for extracting noun
-
phrases, because it had
been believed that keywords occur within the scheme of noun and adjective phrase.
However, Pearson (1998) clarifies that further refinement
w
as needed, because a large
number of
non
-
terms had been revealed (p.
123
).

The
stage of refinement consists of two phases. The first is to decide whether a
concept is generic or just an individual object. Generic concepts are a framework of
-

30
-


abstract notions and asso
ciations, while object is only

a
tangible
thing, but it is
definitely
not representation of a name (Pearson, 1998, p.129).
Arguably, the generic
terms are to be considered possible terms. The second phase is the question if a term is
flagged, it is meant that it is preceded by the defin
ite article, or unflagged, which can be
preceded by an indefinite article or none at all (Pearson, 1998, p.129). The former then
refers to a specific concept, while the latter should be considered as a generic reference,
or rather the true term. All in all
, Pearson herself recognizes these procedures insecure
,
hence she advises
the terminologist
to focus on the unflagged terms

for the sake of
ce
r
tainty
.

There is another
precondition for withdrawing the correct terms. As Pearson
(1998) has called the
se the
“linguistic signals” (p.
130), Bowker (2002) refers to
“knowledge probes”
, which ar
e

lexical phrases defining relations within a phrase
. It
means that the desired terms are supposed to co
-
occur with defining keywords

such as

the term,
denotes, called or kn
own as

(p.
310). These are parts of
a
search pattern
according to which term candidates are then indicated

by the respective c
orpus
a
nalysis
t
ools
.

An

automated method is
available. The safest
, yet not absolutely perfect,

option
of retrieving the keywords is to rely on
the
computer.

“If the aim is that of constructing
a corpus big enough to allow terminology extraction, then an automated process to
bootstrap corpora from the Web is the best solution to speed up the process

(Fantinuoli, 2006, p.173).

It is
based on

the

statistical
compar
ison of
the frequencies

in a
specialized corpus

to
frequencies in
a reference corpus
.

The utility itself

creat
es

a
pattern that matches with the construction of the
keywords on the premise t
hat

the same
sketch grammar
, either in a file or from a URL,

is used
(SketchEngine Help)
.

-

31
-


In addition to

the processes behind the almost immediate computerized retrieval,
Claudio Fantinuoli, the lecturer at
the D
epartment of

translation and interpretation at the
Johanne
s Gutenberg University in Mainz, Germany,

states:



The tool uses an iterative algorithm to bootstrap corpora (from the Web) and
extract unigram terms. It then proceeds to extract multi
-
word terms on the basis
of the downloaded corpus and of the unigram terms list extracted in the previous
phase


(Fantinuoli, 2006, p.176).

By counting a probabilistic algorithm,
single
-

and multi
-
words are extracted
simultaneously.

The unigram is a result of the unigram language

model, which is

t
he
simplest form of language model
[that]
simply throw
s away all conditioning context

and
estimates each term independently
” (Companion to Introduction to I
nformation
R
etrieval
)
.
Such model language calculation
also
helps

in

disciplines like natural
language processing or machine translation.
Then, according to Fantinuoli (2006), if the
term contains at least on
e of

the extracted unigrams and corresponds to the specific
morpho
-
syntactic pattern, it can be
added to the candida
te list (p.
176).

In addition to the
POS patterning, there is an indisputable belief that terms occur within a noun phrase.

The list should be then confirmed by a professional in a specific area

before proceeding
to making a glossary out of it
.


To introduc
e the critical notion, Bowker (2002) has expressed discontent about
the statistical techniques and the concordancing
in simplifying terminographers’ work:
“terminographers would find it highly useful to have a tool that allowed them to search
for conceptua
l data
as well as linguistic data” (p.
319). The so
-
called text analyzer (TA)
is
an advanced tool that is
assumed to fulfil these criteria.

Such tool comprehends four
kinds of conceptual information
: generic and specific concepts (described

above), parts,
-

32
-


f
unctions and explanations

(Bowker, 2002, p.
321).

All are extracted thanks to particular
knowledge probes that search for contextual, functional or explanatory surrounding
expressions.

5. Glossary Making

Compiling a glossary is
a
very useful and usually elementary
task
for translators and
interpreters
concentrating on specific domains requiring deep knowledge of professional
vocabulary. Nonetheless, previous research into textual resources and corpus
compilation, as mentioned above
, must forgo. This section

summarizes the compilation
and pitfalls concerning glossaries.


First, the extractor tool chooses the possible term candidates based on the

morpho
-
syntactic analysis of a corpus.
As mentioned above, frequency plays a key role;
h
owever, even the terms that occur only once should be categorized into
a
candidate
list, Chodkiewicz et al. (2002)

explains that “It should be noted that the terminologists
kept more hapaxes as candidate terms than they discarded, casting doubt on the
gene
rally accepted view that important terms always occur frequently” (p.
3
4
8
). Hapax
legomena is a token that occurs only once within a concrete corpus, but it obviously can
be a term as well.


Second, the matching stage comes after the extraction phase. The parallel
corpora are aligned automatically, but in case of the comparable corpora, terminologists
are expected to match the terms manually. Then the
multiple equivalents issue

arises
.

Váradi
and Kiss (2007) refer to unidirectiona
l and bidirectional equivalents

(
p.
159)
.

Terminologists have found ways how to discard non
-
terms equivalents. First,
multi
-
word expressions

should be identified: “Identification of multi
-
word terms is one means
-

33
-


of limi
ting the number of equivalents.” (Chodkiewicz et al., 2002, p.
349
)
. Again, a
professional in the field should review the results in order to create a foolproof glossary.

“Single
-
word terms are often more ambiguous than multi
-
wor
d terms.” (Chodkiewicz et
al
., 2002, p.
350
)
, so the context of these terms should be researched

in detail
.

Another
counterpoint in searching for terms is anaphora
.

A t
erm
could

have been retrieved more
frequently,
if

the pronouns or deictic expressions
had not been

used in the texts to be
more cohesive.

The researcher should be aware of several pitfalls concerning the terms retri
e
val.
Except for the anaphoric reference, synonymous and
syntactic variants

are reflected in
the corpus results

(Chodkiewicz et al., 2002,
p.351)
. Although the meaning is the same,
the words may absolutely differ, even in root, but that can be avoided by searching
similar context. However, if the word order is rearranged the collocations and even the
terms may not be recognized.

6
. Language
for Specific Purposes Corpus

A specialized corpus is created for an analysis of a particular field of knowledge

and it
may be

called the Language for
S
pecific
P
urposes (LSP) Corpus as well. Bowker
(1996) regrets the scarcity of LSP corpora: “One reason for

the lack of investigation
into MR
6

LSP corpus work might be because it is currently more difficult to find MR
corpora dealing with specialized domains than it is to find general
-
language corpora”
(p.306). Since then a lot of MR and even specialized corpor
a have been created. In fact,
terminologists prefer to c
ompile their own corpora. Then, the reference corpus and
documentation corpus are needed


the former to compare the general language with the



6

machine readable

-

34
-


specific one, while the latter to identify terms and conc
epts of the subject field. Bowker
(1996) declares that “a well
-
constructed MR LSP corpus could potentially be very
valuable as both” (p.306). The final step is to look for node words and extract potential
terms.
Nowadays, all the steps can be easily done b
y a corpus program such as the
Sketch Engine

at once
.

6
.
1
.

Real Texts

Linguists

throughout the corpus linguistics spectrum emphasise the feature of
describing reality when corpus analysis is conducted.
Nicoletta Calzolari

(2007)

highlights the
practice of
corpus compil
ation

out of data used in real life situations: It is
needed to explore

the potential of “robust components (lexicons, grammars, etc.) based
on an inventory and description of the whole variety of phenomena occurring in real
texts,

in the different communicative contexts” (p.116). According to her experience,
such corpora reveal any kind of structures


collocations, idioms or multi
-
word units


and phenomena


acronyms, repetitions or abbreviated styles, which are important for
lin
guistics and language processing now more than ever

(Calzolari, 2007, p.116)
. She
then introduces the
demand for creating lexicons of sublanguages: “The study of
corpora is also essential for the identification and characterization of sublanguages:
when na
tural language is used in specific domains or communicative contexts”

(Calzolari, 2007, p.116).

The lexicons

should be required not only for the purposes of
Natural Language Processing, but also for glossary making by human translators and
interpreters. Ar
guably, the exploitation of specific sublanguage corp
ora
, concretely even
the corpora and glossar
ies

from this thesis, advance all the sciences mentioned above

in
a way
.


-

35
-


6
.
2
.

Compilation

of LSP Corpus

As explained
in the theoretical part
, there are several
criteria

concerning corpus
creation. Size of the LSP corpus should not be as big as for general corpus, but the
question of balance and representativeness “remain
open” (Bowker, 1996, p.318).
Arguably, the representativeness is an issue

that is needless to
discuss
, because if the
purpose of the corpus is to extract terminology, terminologists use the widest possible
range of specific materials.

Language

is another important consideration. If the author writes in their native
language, the text lacks grammatical mistakes and is likely to be more idiomatic, full of
collocations and clearly
-
expressed ideas (Bowker, 1996, p.317). Therefore even
translations
are not preferred, only under the circumstance that original materials are
impossible to obtain.


T
herefore many linguists do not consider translations as a source
of ‘linguistic evidence’


(Mihailov and Tommola, 2001, p.71)
, because a c
orpus
may be

asymmetrical
, as opposed to
the
symmetrical

one

in which it is ensured that the
translation and source words are interchangeable from the semantic
and linguistic
view
point
in
general
,

despite the fact that they are in different languages
.

It may be said
t
hat the words work on the same paradigmatic level.

The influenced target language is
called translationese, because it possesses “features of the translated text more
characteristic of the source language” than the target language (
Meyer
,
2002
, p.39
)
. It i
s
a
more complex
problem, not only of the word order, but the functionality of an
utterance

as well
. According to Meyer

(2002)
,
there occur
“not just particular syntactic
and morphological differences but pragmatic differences as well”

(
p.
39)

throughout the
text. Although the linguistic world is aware
of the abnormality of the target language
, it
is tolerated and more or less understandable that translations must be used for corpus
compilation due to
the lack of parallel materials within the sp
ecific subject field.

-

36
-



In addition

to the scarcity of materials, “a terminographical LSP corpus cannot
afford to have restrictions on the length of individual texts…because terms can appear
anywhere” (Bowker, 1996, p.316)
, but there should be restrictions
concerning the
authorship
.

Bowker (1996) differentiates between texts designed for experts or for
common people: “A text written by an expert for other experts is linguistically and
conceptually quite different from a text written by an expert for layperso
n”

(p.318). It
may be assumed that a shallow text can be accessed for grasping the knowledge, but a
complex article may be a valuable source of terminological information. “On a
linguistic level, terminographers generally find that learned papers contain f
ormal terms
and neologisms that have just been coined by scientists, while popular
-
science texts
may

contain more established terms
” (Bowker, 1996, p.318)
. Furthermore, Bowker
introduces Ahmad’s text types distinction:

instructional, advanced and popularized
types.
Popular science texts or advertisements belong to the
last

category, while
technical manuals would be instructional texts.

For the purpose of sample corpora for this thesis, the selection of texts was
limited

due to a precise task


English
-
Czech terminology dealing with climbing and
mountaineering

field of knowledge
. Three corpora were created


two in Czech and one
in English. They consist of manufacturer’s descriptions and technical methodology, so
they cov
er instructional and popularized types of texts.

6
.
2
.1
.

Mammut English LSP Corpus


First, t
he only English corpus is named Mammut_EN, after the Swiss brand of clothing
and outdoor equipment
7
.
The English texts were written by Swiss professionals

in the
field of higher altitude sports;

however, mistakes in collocations and spelling sometimes



7

Mammut Sports Group,
http://www.mammut.ch/

-

37
-


occur. Therefore,
the term “Euro
-
English” (Bowker, 1996, p.318) would be utterly
appropriate in this context, because they are not native speakers of English,

still the
level of English language seems to be advanced. The texts are partly instructional and
popularized in content, because

specific

products are described and promoted. Bowker
(1996) approves

of

the necessity “to consult manufacturers’ guides” (p.31
8) for
terminographers.

This is exactly the case.


Whole collections could not be access
ed

on the

Swiss

home page
,

however,

a

German

outlet store’s
8

offer in stock

was downloaded
, but the brand Mammut was
strictly
maintained
.

The
e
-
shop offers the current
collection as well as leftover items
from previous summer and winter collections.

The major pitfall of downloading the data
was the Flash Player on which the pages run.
It is not a machine readable format, but
only a picture or video alternatively.
Therefo
re, a manual downloading procedure was
carried out and the data were uploaded in a single Microsoft Word file.

Before the
corpus
compilation, it was tagged by the latest

version of TreeTagger for English
.

The
total number of tokens in this corpus is 97,561
.

6
.
2
.2
.

Horolezecká metodika

Corpus

Second, t
he first Czech corp
us

is based on the online companion to the book
Horolezecká metodika
9

written by
Tomáš Kublák
.
The Web pages cite
the
majority of
the book’s content, though some chapters are not available
due to a copyright
protection.

The download of data was not problematic, the URL addresses were inserted
into Sketch Engine and the program extracted texts on its own.
The morphological



8

Outdoor Works, http://www.outdoor
-
works.de/

9

Mountaineering methodology
,
http://www.horolezeckametodika.cz/

-

38
-


markup was done by the
Desamb

for
Czech
.
T
he corpus has
123,626
tokens

and that
makes him the largest of all three corpora