LASI Product Description V2

burgerraraΛογισμικό & κατασκευή λογ/κού

18 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

75 εμφανίσεις

Lab 1


LASI Description
1


Running head: LAB 1


LASI DESCRIPTION



LASI Product

Description

V2

Linguistic Analysis for Subject Identification

Determining common themes across multiple documents

CS411 Red Team

Dustin Patrick

3/18
/2013


Lab 1


LASI Description
2


Contents


List of
Figures..........................................................................................................
.......................
.2

1
Introduction..............................................................................................................
..................
..
.3

2

LASI Product Description.....................................................................................
.
.......................
4


2
.1
Key Product

Features an
d Capabilities..............
......
....................
................................
.
.5

2
.1
Major Functional Components..................................
................................
...............
.....6

3 Identification of Case Study...............................................................................................
...........7

4 LASI Prototype Description...................................................................................
......................8


4.1 Key Prototype Features and Cababilities......................................................................9


4.2
Major Functional Components....................................................................................
10

Glossary.....................................................................................................................
....................11




Lab 1


LASI Description
3


List of Figures

Figure 1..............................................................................
.................................................
............
.
7

Figure 2.....................................................................................................................
.........
.
..
...........8

Figure
3............................................................................................................................
.....
...........9

Figure 4.........................................................................................................
..
...................
.
...........10

Figure 5..................................................................................................
........................................13

Figure 6...............................................................
..................................................................
.........14

Figure 7.....................................................................................................................
............
.........19



Lab 1


LASI Description
4


1.

Introduction

LASI
, which

stands for Linguistic Analysis for Subject Identification

is a tool that is des
igned to
assuage researchers in finding and obtaining common

theme
s

across a document or multiple
documents. It relies on an algorithm that identifies th
emes based on a word’s
weight in a
document or documents, as well as its frequency
. A word’s weight is determined by the
relationships it has with other words in the sentences in which it is used
. Linguistic Analysis in
this case refers to the abstract understanding of an author
’s intention for a documen
t based on a
syntactic and semantic evaluation of the text of the document(s). LASI is a decision support tool,
not a decision making tool. This means that it will not output a single theme for a document, but
a list of possible t
hemes, along with their likelihood of accurately reflecting the theme of a
document. The user will still need to analyze the document to determine which of the themes is
most accurate. Themes in this case refer to the main idea of an entire document. In no
rmal
literary analysis i
t is determined by answering
the following

questions in a document:

who, what,
when, where, why and how. LASI takes a bit of a different approach, but it is less subject
ive in
its approach to find

the theme
, or themes. Finding

the t
heme of a document or multiple
documents is important because understanding the main idea of what others say and write is the
fou
ndation of human communication
. LASI does not

reinvent the wheel, but it makes the process
of finding themes of large documents

significantly less time consuming.

The problem with manually determining themes is that it is difficult to determine themes across
multip
le documents in an objective, consistent
, and timely manner
. Determining themes is
something that most people do auto
matically, but two people often read the same work and
derive different themes from that same work. LASI seeks to resolve that by using an algorithmic
approach to determining themes rather than a subjective one. It can also be difficult for people to
Lab 1


LASI Description
5


consi
stently gather the same theme across multiple documents. A person can read something
m
ore than once and get
a different main idea each time because of the subjective nature of the
way people read. LASI does not follow this subjective approach and when a do
cument is
analyzed, it will display the same results each time. The process of manually determining the
theme of a document is also incredibly time
-
consuming. It requires multiple read
-
throughs of the
same

documents to make sure that the author’s meaning i
s not lost.

It can take hours or days to
determine this information. LASI cuts this proces
s down to a matter of minutes.

2.
Product Description

LASI as a real
-
world product will be

a computer application designed to ru
n on the Windows
platform. It will be

written in C# and will be a stand alone, client side, desktop application. It
will be efficient enough to run on a
high
-
end
laptop, and is goin
g to be an engine that can be
expanded upon to add plugins which will extend functionality. LASI as a real world

application
will be able to determine themes across as many documents as can be provided and can provide
cross
-
document analysis to determine single themes across multiple documents. It will provide a
user with the ability to create custom dictionaries to

increase the accuracy and increase statistical
likelihood of determining a theme. It will also be optimized to improve efficiency by
incorporating multi
-
threading to expedite processing time, and decrease resource usage. It will
contain a minimalistic use
r interface to avoid confusion for the user. It will also parse documents
in each of the following formats: PDF, DOC, DOCX, PPT, PPTX, and TXT. LASI will generate
themes across all d
ocuments, as well as each individual document.

LASI will also be an open
-
s
ource project that will be released under the Limited GNU Public License Agreement.

Lab 1


LASI Description
6


LASI will be useful for many different demographics. It can be used by teachers to identify
plagiarism, as well as to assist with the grading of papers. The teacher just n
eeds to run the
papers through LASI and wait for the output.
Students will fin
d this tool useful as well because it

will make reading through books and papers for research significantly faster. They can also use
LASI to make sure
what they are submitting d
oes not

contain exactly the same information as
something someone else is submitting. Researchers will be able to use LASI as well because
their job consists of reading countless documents thoroughly; often in fields with which they
may not be familiar. La
wyers and contractors can use LASI to read through legal documents and
identify what exactly is being agreed to with a specific contract.

2.1.
K
ey

Product Features and Capabilities

LASI will determine themes across multiple documents by using semantic and

syntactic
evaluation of a set of documents. It will accurately determine a word’s

part of speech, as well as
that word’s

location in a sentence, whether it is part of a subject, verb, or object, and then it will
assign a weight based
on a count of that
word and the word’s

part of speech, a generic word
count over the whole document, and where that word falls in a specific document. It will
evaluate the weight of all words in a document and then output the results from multiple views.

The real
-
world prod
uct

w
ill be able to accept multiple

file types as input and will parse th
em all
the same way. Additional
documents can be added to a project after it has begun analysis and the
user will be able to add custom dictionaries
to the project to increase efficie
ncy and accuracy.
LASI will not make any assumptions about content by default, but the user can specify a type of
document, for example strategic documents, literary doc
uments, scientific reports, as well as
many others

that will make LASI parse these docu
ment formats in a more accurate way.

Lab 1


LASI Description
7


There will be multiple levels of output that LASI will display to the user after it completes. The
user interface will be able to display the top results, which is essentially a weighted word count
displayed as a torna
do chart. A visual representation of this can be found in Figure 1 below.
The
Top Results page with the weighted output can be displayed on all documents or just individual
documents.

Figure 1.


There is also a word relationships page which will display a
n individual document, as well as a
color
-
coded represent
ation o
f words in the document. T
he colors correspond to a word’s part of
speech. This i
s demonstrated in Figure 2
.



(This space intentionally left blank)



Lab 1


LASI Description
8


Figure 2.


The last view of the output
would be an in
-
depth textual representation of each word, its weight,
its count, and then its part of speech. A very basic implem
entation of this is listed

in Figure 3.
This output view would be the most informative, but will likely be more difficult to re
ad than the
other two output views.



(This space intentionally left blank)



Lab 1


LASI Description
9


Figure 3.


The results of LASI will also need to be exportable to be used as presentation materials, visual
aids, and further analysis. Exporting the results into multiple file
formats also provides a level of
convenience for the user.
LASI will be able to export results in PDF, XLS, and several image
formats to make sure that anyone using the project will be able to make the results portable.

2.2
.

Major Functional Components

LA
SI will be eff
icient enough to run on the v
irtual ma
chine provided by the university
.
However, for the real
-
world product, there will be some hardware requirements. It will require a
Quad core or better Intel Core CPU, 8 GB or greater of DDR3 SDRAM. It wil
l also require the
user to provide secondary storage space. The exact amount of physical storage required will be
specified at a later time.

The major functional components of the real world product

can be seen
in Figure 4.

(This space intentionally left
blank)

Lab 1


LASI Description
10


Figure 4.


There will also be several Software component
s of LASI. The external tools LASI

will be using
to assist with development and functionality include the SharpNLP Part of Speech Tagger. This
is an open
-
source tool that is a fork of the
OpenNLP tools developed in Java. SharpNLP is built
using C# which makes it more secure tha
n its Java counterpart, and also easier to incorporate
into LASI,

which is also written in C#. LASI

will also be using WordNet, which is a thesaurus
database compiled

by Princeton, which contains virtually every word, its known synonyms, and
antonyms. It will be incredibly u
seful

for binding synonyms together to improve accuracy of
results.
LASI also takes advantage

of a doc2x tool that converts Microsoft Word 1997


2
003
document files (.doc) and converts them into a manageable format, Microsoft Word 2007
document files (.docx).
This conversion between doc and docx is necessary

because docx files
Lab 1


LASI Description
11


are actually a compressed format that contains an easily parsed XML file
containing all the text
of a document.

There are several importan
t features to this software. S
everal

key data structures incorporated
into LASI
. Each word and phrase will be stored into a C# List, which is essentially a vector

in
C++
.
Each word will be a
ssigned a type and then added to a list at the initial parsing of a
document. Words will be assigned a part of speech before being assigned to a list and phrases
will be assigned a location in a sentence, meaning that a phrase will be determined to be eith
er
part of the subject, or object of a sentence.
These lists will be traversable by each individual
word. Each individual word will also link to another list of associated words.


The underlying algorithm of LASI will consist of several key parts. There i
s the element binding
process, the weighting process, and then high level analysis process. An element is either a word
or a phrase. There will be both direct binding and indirect binding of elements to other elements.
Direct binding of word elements will
consist
of binding

nouns and verbs together, adverbs to
verbs, adjectives to nouns, and determiners to nouns. Direct binding of
phrase objects will
consist of binding phrases to subjects, phrases to objects, and then breaking phrases down and
binding the
m to each word inside them. Indirect binding will consist of binding synonyms to
mean a single noun, and binding pronouns to the noun they derive from.

The weighting process will be handled by two separate processes. There will be a raw weighting
system,
which analyzes simple word frequency, word and phrase frequency with part of speech
and location in a sentence considered, and then synonym
-
aware word frequency. This wi
ll
provide the foundation for LASI

to modify

weights via comparison with other words
. T
he
relative comparison will count the relationships between words and modify the weight of each
Lab 1


LASI Description
12


word and phrase accordingly. It will measure the lexical distance between associated words in
the document set. It will also produce a Pronoun
-
aware word freque
ncy to increase word counts
of the associated noun.

It will also provide a high level analysis of each element’s weight and then order the highest
weighted words and phrases to form a list of coherent sentences from there. The algorithm will
also determin
e the optimal overlap of weighting metrics to produce the most accurate results. It
will then employ a process of resolving conflicts
between highly weighted words.

3.
Identification of Case Study

Dr. P
atrick Hester & Dr. Tom Meyers w
ork for an organizati
on called NCSOSE.
NCSOSE
stands for the National Center for Systems of Systems Engineering. NCSOSE works with
organizations and companies to improve workflow and optimize efficiency. They also generate
and provide training materials to these organizations.


At NCSOSE, Dr. Hester and Dr. Meyers

currently utilize a process known as the AID process to
identify problem statements from groups of strategic documents. AID stands for Assessment
Improvement Design. The Assessment phase of this process is currently w
hat LASI will be
improving upon. This phase of the process involves analyzing multiple strategic documents in a
range of domains he may not be an expert in and then based on several criteria and the
identification of key components. This involv
es him doing

extensive, unnecessary research
and
is an incredibly time consuming and in
-
depth process.

Dr. Hester and Dr. Meyers then take the analyzed data from these documents and formulate a
concise and accurate problem statement. This problem statement is used to

identify
Lab 1


LASI Description
13


organizational issues and then to offer ways to optimize the company who consults with Dr.
Hester and Dr. Mey
ers and improve efficiency. Figure 5

outlines the current AID process.

Figure 5.


There is currently a bottleneck in the system at the Assessment part of the process
that LASI
would be able to assist with
. LASI by nature will eliminate much of the thorough analysis that
Dr. Hester does of the documentation proved and output the same res
ults to him quickly. All he
will need to do is analyze the results of LASI and determine the most appropriate theme from the
results. It will remove much of the guesswork because of the objective nature of LASI and it will
be a great means of defending his

findings. LASI will provide him with a group of themes in an
objective, consistent, and timely manner. This will resolve the bottleneck in his system and allow
him to interact with his clients while he is processing the “Assessment” step in AID.

Figure 6
d
emonstrate
s

what LASI will contribute to this.

Lab 1


LASI Description
14


Figure 6.


4.
Prototype Description

Due to the comple
x nature of the algorithm

and t
he simplistic nature of both the

user interface, as
well as the hierarchy of users,
the main differences between the real
-
world product and the
prototype which needs
to be scaled down is
the

algorithm. The algorithm’s output will be
approximately the same. It will still accept multiple documents and search for common themes
across them. It will still interact with the UI the

same way. The results will still be displayed
graphically on the results page. The differences will be that the algorithm will be a bit less
versatile as it will make a few more assumptions about the documents through which it searches.
The only two accep
table input types will be .doc and .docx files. It will also be forced to limit the
number of documents that it can analyze. The prototype will limit interpretable documents to 10
Lab 1


LASI Description
15


pages or less. Lastly rather than focusing on mapping every part of speec
h t
o related parts of
speech, LASI

will focusing on subjects, verbs, direct objects, and indirect objects.

This prototype’s limitation on the nature of the subject matter in the document is sparked by the
fact that NCSOSE deals exclusively with strategic doc
uments
. It will also make our results more
accurate for
NCSOSE

to be specific with the
nature of the documents parsed by LASI. For
instance, LASI

can search for keywords like “Mission,” “Vision” and “Goals” and then place
increased emphasis on the content
below those words to create a more accurate weighti
ng
algorithm. This also saves

some work
for the algorithm
becaus
e it leads to
fewer passes of the
document.

The decision to limit the input types to either .doc files or .docx files is due to the fact tha
t
determining a suitable mechanism for parsing other data types requiring optical character
recog
nition would require learning

an entire API and would increase the likelihood of errors in
parsing.
Optical character recognition would lead to the possibility

of LASI
read
ing

the same
word or phrase differently based on input file extension. A homogenous format for gathering
parsing input would resolve that. Also, .docx files are incredibly simple to parse and doc files
can be converted easily to .docx.

Limiti
ng the number of documents in the prototype will resu
lt in a simplified output. If the
prototype

allow
s

for too many documents to be pa
rsed,

it will need to contain
some sort of
algorithm to remove irrelevant subject/object/verb associations.
Time constra
ints prevent this
from being feasible, so the

input pool
must remain
small because otherwise, the results will be
confusing and difficult to read. Another reason to limit the number of input documents is that
Lab 1


LASI Description
16


this prototype will be a multi
-
threaded applic
ation. The more documents that are allowed for
input, the more memory will need to be devoted to this tool.

Limiting the number of pages in a document accomplishes essentially the same thing as limiting
the number of documents. The program will need to re
ly heavily on its weighting algorithm to
determine themes of these documents. Not assuming a fixed length increas
es the amount of
required testing

e
xponentially and the time constraints provided simply will not provide

the
necessary time to debug the prototype in 18 weeks without limiting input.

La
stly, decision for limiting the
prototype to identifying subjects, verbs, direct objects, and
indirect objects was because at its core, themes can still be derived from these
, but it produces
fewer incorrect associations than going deeper and analyzing ev
ery single part of speech. If the
weighting algorithm

focus
es

on
subject/object relationships, it

can determine valid themes with
an increased statistical likeli
hood of accura
cy, but it will not

need to go back through and remove
false associations. This change will also remo
ve the need to search for synony
ms individually as
they can be determined not just by verb associations, but by the associations with the other parts
of th
at sentence.

4.1.
Major Prototype Functional Components

Course requirements, as well as limiting the amount of testing needed, caused the need for a
homogenous hardware platform. This means there needed to be a number

of changes to the
hardware plat
form, t
he algorithm, and what the prototype

will accept as input.
The reason for this
is because it is mainly to decrease the time required for testing, as well as to decrease the
potential resource usage. The specific changes are outlined below.


Lab 1


LASI Description
17


One of the requ
irements for the semester is that the project run
on th
e virtual machine provided
to each group

by the CS
department. As a result, LASI will be developed and tested

on this
virtual machine. However, there may be hardware limitations on this virtual machine

which
prevent it from running the code optimally. That being said
, the

virtual machine contains 8
GB
virtual RAM,
and a Quad Core Intel Core CPU.

The software will be changing from t
he real
-
world product in that the prototype

will
keep

the
graphical interf
a
ce more or less the same, but will be missing

some of the underlying complexity
of the algorithm. The GUI will still contain the ability to save and load past analysis, select new
documents, and display results, but it will not be able to add documents du
ring analysis.

The prototype will contain the ability to convert DOC files, and DOCX files to a useable format,
but will not be able to handle PPT, PPTX, and PDF files. The reason for this is that Optical
Character Recognition is incredibly difficult and n
ot accurate enough for what we are trying to
do. Also, PPT and PPTX files can contain a plethora of different formats. Most of these are not

traversable

by a tool that parses by sentence.

The algorithm will still contain part of speech tagging, a simplifi
ed weighting algorithm that
focuses on subject
-
object relationships, rather than the robust relationships between individual
words. It will still bind phrases to words and determine whether a phrase is a subject or an object
of a sentence. It will also sti
ll bind pronouns and synonyms
to nouns to increase accuracy.

4.2.
Prototype Features and Capabilities

Due to the fact that we have such a time constraint this semester, we will need to scale back our
original product from a real
-
world product that would b
e marketable, to a prototype that i
s
missing some functionality. The real
-
w
orld product would contain
everything that was listed in
Lab 1


LASI Description
18


the product description section of this paper. The prototype will need to make a few assumptions
about the nature of the doc
uments that we are analyzing and it will also need to be a bit less
r
obust, which will result in decreased accuracy, but also provides more

time to implement a
solution that meets all of the r
equirements set up by NCSOSE, and

also can be cr
eated in the
time frame that
set up last semester.

Certain functionality must be eliminated from the prototype. It

will need to limit the type of input
documents to DOC, DOCX, and TXT file format. The file length must be limited as well. The
exact length
of the files will be determined when the prototype gets to a position of testing. The
number of files that can be input will need to be limited as well for the sake of testing the
accuracy of the output generated by LASI. It will also decrease the resource

usage of the
program to lim
it the number of input files. The algorithm

will also not be providing a visual
representation of the logic of LASI in the prototype as it will require modifying the output
format.
The LASI prototype will exclude

scann
ed text re
cognition because incorporating and
testing an Optical Character Recognition tool would be more time consuming than it would be
worth

to include in the prototype. Including it would also result in a decrease in accuracy of the
results. Certain optional ref
inement tools must also be removed, including

user
-
added items like
dictionaries, and keywords, and content assumptions from the prototype as implementing such
features would require mont
hs of testing. There is also reduction in

the number of times that
L
ASI will search through documents in an effort to improve load time and decrease testing.
Figure 7 is a visual representation of the differences between the Real world solution and our
prototype.

(This space intentionally left blank)

Lab 1


LASI Description
19


Figure 7.





Lab 1


LASI Description
20


Glossary of Terms

Theme:

subject
-
object
-
verb relationships that LASI is
attempting to generate from the

input set

LASI:

Linguistic Analysis for Subject Identification

Parser:

Takes in DOC and DOCS files and converts them to TXT files

Word
Net:

compilers and

providers of our thesaurus

Phrase:

A group of words standing together as a conceptual unit, typically forming a new

component.

Analysis:

Detailed examination of the elements or structure of something, typically as a basis for

interpretation.

Linguistic
Analysis:

The process of gathering information about a document’s content from the


language of that document
.

Tag:

A label, or the act of attaching a label, that specifies the role (such part of speech or


location) of a selected e
lement in a

document.

Document:

A document herein refers to a formally written, expository paper which expounds,


via a declarative approach, on a relatively quantifiable issue, goal, or area of research.

Word Weight:

A numeric value, associated with each syntactically and lexically unique word in


a written work, which indicates the relative significance of that word.

Tornado chart:

A horizontal bar graph like visualization, representing the relative frequency or


significance of elements, sorted in descending order by magnitude.

Head word
:

A Head Word is the locally distinct word within a phrase which, by its syntactic


a
ssociations, determines the syntactic category of the phrase itself.

Word Binding
:

Conversion of scanned images to text.

Lab 1


LASI Description
21


Sharp NLP:

C# natural language processing tool used t
o parse and tag part
-
of
-
speech.

Tagged Word Object:

The process of bi
nding part
-
o
f
-
speech to a word.

Optical Character Recognition:

A word that ha
s an associated part
-
of
-
speech.

Tagged Set
:

a group of words whose part of speech and location in a sentence

have


been

identified

by

our

parser.

Lexer:

a piece of our parsing tool that

isolates each word and its part of speech, and location in a


sentence into machine readable tokens. These are sto
red as elements in an XML file.

Syntacti
c Analysis:

a form of Linguistic analysis that focuses on grammar in sentences and


ident
ifies themes based on sentence structure and formatting. Unlike Semantic Analysis, it


identifies key words based on their location in the sentence, rather than their overall meaning


throughout the document.

Subject Identification:

This is the

process of identifying

the main actor in a sentence. However,


in a broader sense, the word subject is synonymous with the theme of a document. Subject


identification is the process of determining subjects, or the
mes of a document or document
s.

Part of Speech Tagger:

Software utility that associates words with
the parts of speech (i.e.


Noun,
Verb, etc.) in a sentence.

Semantic Analysis:

Relating the syntactical structure of words to their language independent


me
anings.

A.I.D.
Process:

Assessment Improvement Design: A process that provides quantitative and


qualitative basis to identify problems and determin
e the feasibility of solutions.

Strategic Document:

Document produced by a client that defines what their Goals, Visi
ons, and


Missions.