Digame Initial Spec

yoinkscreechedInternet and Web Development

Nov 13, 2013 (3 years and 5 months ago)

50 views





Final Year Project
Definition Document


Digame!

Department Of Information Technology


This document details the specifications for a final year project undertaken with the
department of Information Technology. The project aims to create a piece of software
that will enable linguists to analyse the orthography and phonology of languages by
s
earching through corpora to ascertain the frequency of certain strings within a text.


2010
/
2011


Chris Hurley


Electrical & Electronic Engineering

College of Engineering and Informatics

National University of Ireland,
Galway


Submitted in fulfilment of the requirements for the

B.E. (Hons) Degree in
Electronic and Computer Engineering


Project Timeframe



September

2010



April 2011



Academic

Supervisor:


Pat Byrne


Dept of Information Technology



Initial Spec Submission date:
15
th
October


Digame! Initial spec


Chris Hurley


Page
2


Contents

Project Description
................................
................................
................................
..........................

3

Project Requirements

................................
................................
................................
......................

4

2.1


Te
chnologies

................................
................................
................................
......................

4

2.2


Algorithms

................................
................................
................................
.........................

5

2.3
-

Research
................................
................................
................................
..............................

6

2.4


Input Formats

................................
................................
................................
.....................

7

Deliverables

................................
................................
................................
................................
....

8

3.1


Web
-
based
GUI Program

................................
................................
................................
...

8

3.2


Milestones

................................
................................
................................
..........................

8

3.2.1 College Set Milestones

................................
................................
................................
..

9

3.2.2 Personal milestones

................................
................................
................................
........

9

3.3 Benefits of the System

................................
................................
................................
..........

9

Wish list

................................
................................
................................
................................
........

10

4.1 Different
Character sets

................................
................................
................................
......

10

4.2 Greater variety of input formats

................................
................................
..........................

10


Appendix A


Table of References

................................
................................
...............................

11




Digame! Initial spec


Chris Hurley


Page
3


Project De
scription

1.1


Initial R
equirements

The Project aims to make a set of tools to aid a linguist in studying the orthography and
phonology of a
given
language.

The tools should be easily accessible by non
-
technical users via
a web
-
based GUI interface.


The

final product will establish the frequ
ency and distribution of characters within a
given corpus

in order to determine the characteristics of the language in question
.

The user will
enter a set of characters and then the system will search for combinations of those characters and
output the pos
ition of
each

combination
found
within
its respective

word.

The system will also allow
for
(language dependant) fuzzy searching based on sound
occurrences.
This will be done by using the String::Approx tool in Perl in conjunction with the
matches keyword.

[1]

The system will be able to accommodate character sets other than English, including
those with diacritical marks

(accents)

for example Irish

or German
. It will also potentially
include character sets completely different than English like Arabic or Ja
panese.

The system
will

be distributed on the web using a user
-
friendly GUI interface. This will
suit the non
-
technical target market.





Digame! Initial spec


Chris Hurley


Page
4


Project Requirements

2.
1


Technologies


Certain technologies were researached for potential use in the project. As
the project is
mostly software based, the main technological research was done in the area of programming
languages. It was determined that there was a need for a language suitable for file searching, one
for making the GUI interface and one for distributi
ng the system on the web. Research will be
done into integrating these languages with each other. Also, some languages may be suitable for
more than one function (e.g. Java may be used for the GUI and for the web distribution)


PERL Programming Language



Perl is a highly capable, feature
-
rich programming language with over 22 years of
development.

[2
]


Perl is a high level, dynamic programming language developed by Larry Wall. It was
originally developed to make report processing easier. It borrows from ot
her languages such as C
and
UNIX
. It provides powerful text processing commands allowing for easy searching and
manipulation of text files [
3
].

For example, from UNIX, it takes the “grep” function (and variants
of it), which will be of great use for charac
ter matching in this project.

This ability

with
character strings

deems the language suitable for use in this project.


Lisp was another language that was researched for use in this project. It is an old
programming language used for LISt Processing. (Pra
ctically the second oldest language after
Fortran). It is used in quite a few computer linguistics projects. This language was deemed
unsuitable for the the project as the student already had a knowledge of Perl and it’s suitability
for use in the project.



Java Programming Language


Java is another very powerful programming language. It has many
uses from gaming to
3D image viewing. It
s

great capability
for GUI programming shows that it will be

well suited to
th
e

user interface
aspect of the project.



Java can also be used to set up web servlets, which can be used to set up the system
online.

Digame! Initial spec


Chris Hurley


Page
5


HTML Programming Language


HTML is a simple language used in basic web design. It will be useful to set up the
project status web page and keep updates on the proj
ect. It may also prove useful if necessary for
implementing the system online.



2.2


Algorithms


At this stage, not much research has gone into the algorithm to be used. However, with
the power of PERL it shouldn’t be too difficult a task to find one.


T
he client [4
] has also
suggested a couple of thoughts on potential algorithms. The first is
to sequentially read through a document and pick out the combinations that occur
. This is a
simple method, but it will take a long time (i.e. a large corpus) to get

all possible combinations of
the set. He also talks about establishing rules for what clusters can appear in each particular
position. For example you will rarely find cl or thr in the final position of a word. He then
defines sets of improbable positions

for a set of sound clusters. By ignoring these the system will
be made more complicated to implement, but should reduce its run time.


2.3
-

Implementation


The implementation that will most likely be used is as follows:

Designate the set of input charac
ters as a vector, e.g V = (a,e,i,o,u,y,w). The system will
then search through each character, checking if it is in the set. If it is, then it will check if the
next character is also in the set. It will repeat this until it finds a character that does not

belong to
the set, when this occurs it will output the string of characters detected (this will ensure that
combinations will only be recorded once as output only occurs once a character that is not in the
set is found. i.e. it will only output eau as “ea
u”, not “e, a, u, ea, eau”)


In order to ascertain the frequency of each letter a variable can simply be created and
incremented for each combination found (e.g. eau_count). With this data, it will be easy to
determine the relative frequency of each combi
nation (by adding up the total occurrences of all
combinations and dividing each individual figure by the total). The data would then be output as
a graph with all the combinations listed in order of frequency and their relative frequencies.

Digame! Initial spec


Chris Hurley


Page
6



The system mu
st also determine the position of the combination within the word. This
can be done using a simple count of the letters before and after the combination until a
whitespace character is found.

With this functionality the system should then be able to estim
ate which combinations of
sets are possible in specific parts of words by outputting the position of each combination and
comparing two sets of input. E.g. if there were two sets of data T = (p,t,c/k,b,d,g,f,th) and R =
(r,l) it would check for a letter o
f either set and then check if it is followed by a letter of the other
set. The system could then output the positions of these new combinations and the user could
infer where combinations are possible. E.g. combinations of the form TR are possible in the
initial position, but not of the form TT, RR or RT so the output would show some combinations
of the form TR in the initial position, but none of the form TT, RR or RT. From this the user can
say that it is impossible (or highly unlikely assuming a large e
nough corpus) for these
combinations to occur.



2.
4

-

Research


During the research phase of the project a number of existing products and projects were
researched.


The most relevant of these was a paper on Large
-
Scale Persistent Object Systems for
Corpus Linguistics and Information Retrieval

[
5
]. This system analyses the similarities between
words by using a large real word text as a corpus. It searches through this corpus and, using a
bootstrapping method, matches the contexts of different words by

examining the words
surrounding them. It outputs the words based on similarity in a hierarchical tree for example:


Digame! Initial spec


Chris Hurley


Page
7



This paper is relevant as it uses a digestive approach to ascertain information about a
language similar to the me
thods

this project will

use

i.e. it searches through a text and infers
meaning based solely on information available in the text
. It differs in the

fact that it is focused
on
the meaning of words rather than the sounds of them.


Another relevant paper was found on
Visualizing th
e Performance of Computational
Linguistics Algorithms

[
6]
This paper described some statistical tools that could
potentially
be
used to
help
visualize the performance of the system. These include:



Confusion Matrices: A matrix that lines up actual results ag
ainst predicted results.



ROC curves:
a
graphical

plot of the
sensitivity
, or true positive rate, vs. false positive rate



Precision and Recall: A measure of exactness and completeness


This paper was a good example of displaying statistics from a computer l
inguistics system.


A number of other less relevant texts were also researched including
opinion matching and
blog tracking software to ascertain people’s moods from what they put in blogs etc. There is a lot
of this sort of
opinion
-
related
research in
computational linguistics.

2.
5



Input Formats


The final system should cater for

a number of input formats. The primary input will be

via files but it should also cater for input directly from the keypad in order to accommodate for
situations where file
input is impossible (e.g. when the corpus is written, or in an unsupported
file format.)


File formats initially supported will be .txt .doc and .rtf but research will be done on
including more formats, including .pdf files. Consideration will also be made

that the input may
be in Unicode or ASCII, which could be problematic as both would have to be catered for as
Unicode is more popular online

and supports more character sets than just English
[7
], but ASCII
is tried and
tested
.





Digame! Initial spec


Chris Hurley


Page
8


D
eliverables

3.1


Web
-
based GUI Program


The final system will be a simple GUI program distributed via the web. The web aspect
should be quite trivial as the site will be completely front end and there will be no database
required in the operation of the site. It should sim
ply take input and output results directly with
no need to store large amounts of data. The site will be hosted on he NUIG I.T. website.


The GUI will be a simple, easy
-
to
-
use platform. It should consist of a text box for the
user to input their character
sets, a dropdown box to select input format and an upload field to
upload the text files. It could also include a text box so that the user can cut and paste their
corpus directly into the site. This would help accommodate for unsupported input formats. Th
e
final solution should look similar to a more sophisticated version of the GUI program shown
below [
8
], but with more suitable input fields (e.g. Character set, Format…):




3.2


Milestones


The college has given a set of deadlines to ensure that the pr
oject moves as swiftly as
possible. On top of these, it was decided to add additional personal milestones. These were added
to further ensure that enough work was being done and that a suitable conclusion to the project is
reached.

Digame! Initial spec


Chris Hurley


Page
9


3.2.1 College Set Milest
ones

College deadlines are as follows; they describe the main areas on which the project will
be assessed.


Friday 15
th

October 2010: Project definition document

This document outlines all work to be completed on the project. It gives an outline of the
system to be created as agreed upon by the project supervisor.



Friday 22
nd

October 2010: Project Website


A simple initial website that will be used to track the project’s progress. Updates will be
made regarding achievements and deadlines met.


Friday
25
th

March 2010: Project Final Report


A thesis
-
like summary of all work done on the project, including research, technical
aspects, results and conclusions.


Week beginning 28
th

March 2011: Project Bench Demonstration


A demonstration of the final workin
g system including a verbal report.


Week beginning 4
th

April 2011: Viva Voce


Oral examination detailing the features and functionality of the system.

3.2.2 Personal
M
ilestones


On top of the college appointed milestones it was agreed that a working
prototype should
be developed by Christmas. Over the Christmas holidays, a meeting will be set up with the
customer in order to refine and improve the system according to their needs.

3.
3 Benefits of the System


The system will

greatly aid in the field of linguistic research. The position of the sound
clusters within a word is critical to the stress in the language. With this tool the stress in a word
can be easily identified and aid in determining pronunciation based solely on
the letters used.



Digame! Initial spec


Chris Hurley


Page
10


Wish list

There are certain things, which would be very useful in this project, but may not be possible
within the timeframe given. They are detailed below in case the scope of the project changes and
they become possible. They may al
so be used for future work on the project if necessary.


4.1 Different Character
S
ets


It would
benefit the reusability of the final product to be able to deal with different
character sets that differ greatly from the English one e.g. Japanese or Arabic.

This idea was
mentioned previously in this report but may not be possible. The implementation should be easy
enough once the initial system is working
for the English character set (and those variants which
include diacritical marks). However, it is neces
sary to ensure that a working system is developed
for the English character set (as requested) before moving on to such ideas, thus it is not included
in the main part of the design specification.

4.2 Greater
V
ariety of
I
nput
F
ormats


R
esearch should be
done into including more input formats. The ones provided should be
sufficient but the more formats the software can support, the better the overall system will be.
The main input format required would be .pdf files (as mentioned in the main specification)
.




Digame! Initial spec


Chris Hurley


Page
11


Appendix A


Table of References

1.

http://search.cpan.org/~jhi/String
-
Approx
-
3.26/Approx.pm


2.

http://www.perl.org


3.

http://en.wikipedia.org/wiki/Perl


4.

Cormac Anderson


Department of Celtic
Adam Mickiewicz University (Poznan)

Cormac.anderson@ireland.com



5.

http://www.jcdl.org/archived
-
conf
-
sites/dl94/paper/futrelle.html?searchterm=linguistics

-
Futrelle and Zhang


6.


http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4035760


-
Èick, Maugner and Ratner


7.

http://news.cnet.com/8301
-
13580_3
-
9936329
-
39.html


8.

http://netbeans.org/kb/docs/java/gu
i
-
automatic
-
i18n.html