About this thesis - Assembla

taupesalmonInternet and Web Development

Oct 21, 2013 (4 years and 17 days ago)

300 views


Charles University in Prague

Faculty of Mathematics and Physics


Master's thesis





Bc.
Martin Suchan

Semantics Detection in Partially Structured Sources



Department of Software Engineering

Supervisor
:
RNDr. Filip Zavoral, Ph.D.

Specialization
:
Computer Science
,

System architectures


20
10


2


I would like to thank RNDr. Filip Zavoral, Ph.D. for supervising my master’s
thesis, tuition and many valuable advices
,
and also thank
s go
es

to

my family for
ongoing support
.


I declare that I wrote my thesis independently and exclusively with the use of
quoted sources. I agree with lending
of my
work and its publication.


In Prague
,

1
st

April

of
20
10

Martin Suchan



3


Contents

Contents

................................
................................
................................
................................
.......

3

Introduction

................................
................................
................................
................................

6

The Problem

................................
................................
................................
...........................

6

About this thesis

................................
................................
................................
...................

7

About the sample application

................................
................................
..........................

7

1

Semantic analysis

................................
................................
................................
...........

8

1.1

Basic theory

................................
................................
................................
.............

8

1.2

First look on a sample document

................................
................................
....

9

1.3

Semantic analysis

................................
................................
...............................

10

1.3.1

Simple pattern lookup

................................
................................
..................

10

1.3.2

Statistics based learning

................................
................................
..............

10

1.4

Real life use of semantic
analysis

................................
................................
.

12

1.5

Semantic web

................................
................................
................................
.......

12

1.6

Another problem

................................
................................
................................

13

1.7

Future of semantic analysis

................................
................................
............

13

2

Analysis of partially structured sources

................................
............................

14

2.1

Implicit and explicit structure

................................
................................
.......

14

2.2

Common signs in implicite structure

................................
..........................

14

2.3

Problems with the explicit structure

................................
..........................

14

2.4

St
ructure of an email message

................................
................................
......

15

2.5

Structure of a HTML page

................................
................................
...............

15

2.6

Other common and noticeable data formats

................................
...........

16

2.7

The time factor of gathered data

................................
................................
..

16

3

Concepts of analysis

................................
................................
................................
...

18

3.1

First attempt
-

keywords gathering,

................................
...........................

18

3.2

Gather clauses from implicit structure

................................
......................

18

3.3

Other methods

................................
................................
................................
.....

18

4

Performance, time/space demands issues

................................
........................

19

4.1

Caching, what to store

................................
................................
......................

19

4.2

Too
much informations?

................................
................................
..................

19

4.3

Filtering, evaluating

................................
................................
...........................

19

4


5

Analysis methods and description

................................
................................
.......

20

5.1

Naive approach and first results

................................
................................
..

20

5.2

Better way

................................
................................
................................
.............

20

5.3

Another way

................................
................................
................................
.........

20

5.4

Utilizing both methods

................................
................................
.....................

20

6

Method comparison and results

................................
................................
............

21

6.1

Relative scores

................................
................................
................................
.....

21

6.
2

Absolute scores

................................
................................
................................
...

21

7

Tables, graps, document statistics

................................
................................
........

21

8

Conclusion

................................
................................
................................
......................

21

9

References

................................
................................
................................
......................

22

10

Content of the enclosed software package

................................
........................

23




5


Title:

Semantics Detection in Partially Structured Sources

Author:

Martin Suchan

Department:

Department of Software Engineering

Supervisor:

RNDr. Filip Zavoral, Ph.D.

Supervisor’s e
-
mail address:

zavoral@ksi.mff.cuni.cz

Abstract:

The
goal

of this
thesis

is the comparison
of

a structured analysis of
data sources
, such as e
mail
s
, HTML pages or other. The work focuses on
the practical assessment of common characteristics of these documents,
which can be used for analysis, cataloging and
data
mining for subs
equent
use. The work also includes a sample implementation of a program for
cataloging
emails
and tracing
related

data.

Keywords:

semantic analysis, data mining




Název práce:

S
émantická analýza částečně stukturovaných zdrojů dat

Autor:

Martin Suchan

Katedra:

Katedra
softwarového inženýrství

Vedoucí bakalářské práce:

RNDr. Filip Zavoral, Ph.D.

E
-
mail vedoucího:

zavoral@ksi.mff.cuni.cz

Abstrakt:

Obsahem této prác
e je
porovnání
možností analýzy strukturovaných
zdrojů dat, jako jsou e
-
maily, HTML stránky
či jiné. Práce se zaměřuje na
praktické zhodnocení společných znaků těchto dokumentů, které lze využít
k analýze, katalogizaci či dobývání znalostí pro následné využití. Práce tak
é

obsahuje vzorovou implementaci program pro katalogizaci emailů a
dohledáván
í
souvisejících

dat.

Klíčová slova:

sémantická analýza, dobývání dat




6


Introduction


In computer science the Semantic analysis could have several different
meanings. Basically it could be described as a process of relating syntactic
structures from the in
put stream of text, dividing it into paragraphs, blocks or
clauses and giving meaning to each of these objects, if we’re talking about
linguistic analysis. The semantic analysis in compiles has similar purpose


specify and give
meaning

to symbols in
symbols in parse tree, finding out which
statements are legal and which ones are bogus (like setting a value to a constant
type). Special kind of analysis called Latent semantics analysis is a technique
used in natural language processing, used for finding

and gathering relations
between sets of documents.

Although it sounds like complicated topic, it’s actually present in ours
everyday life. Mostly it’s used not used by computers, but by humans. Probably
the most interesting… It’s being used by people in e
veryday activities like
reading books and understanding written text at all, looking for context,
recognition of new perceptions based on already learned experience. Although
this works for humans in quite a natural way, it’s not that simple to
actually

im
plement similar behavior as
a

computer algorithm. Human brain is the most
sophisticated
machine ever created and our only way how to level such
ingenious engineering is by speed and parallelizing.


The Problem

The idea of this thesis basically started with

a problem which could be faced
quite successfully using manual approach, but maybe it could be automated
using some clever algorithms using the combination of semantic analysis, data
mining, classification and feedback learning. The problem could be descr
ibed
this way:

Let’s say we got a specific data source, whereby we’re receiving partially
formatted documents. These documents may or may not contain valuable
information and we need to gather them and store them in some way. This
information got specific
semi
-
structured format, but the format is not fixed and
may occur in many slightly modified forms. That’s the first part of a problem,
now comes the hard part. We need to use these gathered information in context
and update them according to current change
s, like dates, places, persons, etc.

In practical use such algorithm for solving this problem might became quite
useful. The most typical use could be monitoring changes about given subject in
the Internet, or on some kind of
often

updated data source.




7


About this thesis

The goal of this thesis is a wider analysis of possible approaches for solving
the
mentioned problem


using several semantics analysis methods on partially
structured sources for gathering valuable data and updating them thorough the

pro
cess
. For this purpose we’ve introduced several more or less applicable
methods, each of them using little different approach (basic key
-
word lookup,
statistics
-
based analysis, autonomous learning based on previous valuable
matches, Bayes
-
like evaluation o
f found clauses, etc…)
. These methods have
been implemented in the included sample application, tested on medium
-
sized
data source and evaluated by their effectiveness, robustness and the level of
customizability.

This work also includes some interesting s
tatistics data acquired during the
testing and programming phase, such analysis of typical email body structure,
structure of web site content…

[stub]

About the sample application

Noticeable

part of this thesis is also the enclosed sample Application which
implements he
re described algorithms and methods and also solves the
mentioned Problem in best currently possible way.

This application was developed in C#/Microsoft .NET Framework; it’s ea
sily
customizable and expandable with further possible al
go
rithm updates and could
be used in real life use. [stub]




8


1

Semantic analysis

1.1

Basic theory

Before we start, we should set some basic
terms, facts and answer

several
important questions

about the or
ganization of our work set.

First of all


this thesis will study analysis on a large set of input data. By
large set we mean hundreds or thousands of documents, each of kilobytes in
size. Each of those documents is written as simple text. This text could
contain
any basic kinds of data


sentences, words, abbreviations, number, dates, even
simple formatted tables, paragraphs, bullets or numbered lists. These
documents we’re working on should have meaning in a way they should not be
randomly generated.

This

analysis is also focused specifically on documents and texts written in
International English. It’s likely that mentioned ways and algorithms would
work with same successes on any other language written using Latin letters.

Using non
-
standard letter syste
ms like Cyrillic, Chinese or katakana could result
in different result, and need of different approaches. Encoding used in the whole
application is presumed UTF
-
8.


Some question for the start:

The target area of this thesis is better understanding of part
ially structured
documents, the way they are formed, if there are some kind unwritten rules,
which could be used for our benefit. Also question is, how to find a structure in a
document, or is it possible to choose document with specific structure from a
g
roup containing multiple types of documents? Is there some kind of measure
system for comparing how much is document structured? Also if we identify
some kind of semantic similarities across a group of documents, could we
benefit from knowing such kind of
knowledge? And what about mining data
across different types of data sources, is there some kind of implicit and explicit
structure for a specific kind of a data format?




9


1.2

First look on a sample document

Let’s have some sample
partially structured
document.
For

this purpose we’ll
use dbworld [ref] mail conference emails.


Fig. 1 Sample email document


Fig. 1 shows a sample document from mentioned email conference.

From a human point of view it
’s an invitation to a conference, it includes
some fact
s about the topic

and specialization of
such

event, some important
dates and also an
web

address for further information.

It also contains link to the
source mail conference.

From a
c
omputer point of
view it’s a st
ring of chars. That’s basically the first
thing what we know. Using simple document analysis we could gather some
other important information: it contains 219 words, 28 lines and 3 groups of
lines, also several empty lines, 4 dates, 2 web add
resses…

As we can see the computer point of view is fairly limited. It tell us nothing
about the useful content, only a “few numbers”. This is actually the main
problem when using algorithms and artificial intelligence for doc
ument analysis.
International Conference on Multimedia Computing and Information Technology (MCIT10)


We invite you to participate in the International Conference on Multimedia Computing and

Information Technology (MCIT10) conducted by the Department of Computer Science,

College

of Sciences, University of Sharjah, U.A.E.

MCIT will be an international forum, aiming to promote the Information Technology and

Multimedia Computing by bringing together leading researchers, academics, and practitioners

to present the state?of?th
e?art advancements in their field. MCIT seeks original papers

that will advance the state of the art in the field, and foster an increased interaction and cooperation

between academics, business and engineering communities.


The scope covers various areas
in Information and Communications Technologies. Topics of interest
include, but are not limited to:

----
Information Systems and Applications

----
Multimedia Systems and Processing

----
Network & Wireless Technology and Applications

----
Pattern Recognition a
nd Artificial Intelligence

The conference website provides necessary information along the vision and the scope of the
conference.

www.sharjah.ac.ae/mcit

---------------------------------------------------------------------------------------------------
---------------------------

Important Dates:

--
Submission of full papers October 10th, 2009

--
Notification of Acceptance November 20th, 2009

--
Camera Ready Papers Due December 13th 2009

--
Main Conference March 2
-

4, 2010

_______________________________________________

Please do not post msgs that are not relevant to the database community at large. Go to
www.cs.wisc.edu/dbworld for guidelines and posting forms.

To unsubscribe, go to https://lists.cs.wisc.edu/mailman/listi
nfo/dbworld

10


Without better

understanding of the actual data the only thing we got is just a
bunch of statistics data.


1.3

Semantic analysis

For better understanding of the actual data content we need to use semantic
analysis. But what does it mean? What’s the starting point and what i
s the goal?

Speaking of semantic analysis, there is actually several ways of approaching this
problem.


1.3.1

Simple pattern lookup

The first one we could probably think of is actually teaching the program to
look for specific patterns. These patterns are based

on actually valuable
information recognized by humans. For example
we

know than dates could be
important for us, we know that dates got specific recognizable format. So let’s
create automatic rules for finding dates and facts related to such dates.

Anothe
r
example could be gathering of web addresses


they are recognizable because of
their specific format. These addresses could be simply gathered using some
regular expression search.

We could go on and look for bullets, numbered list, header, simple boxes
made from special chars and so on. This approach is fast, simple, although
exhaustive. We need to specify rules for each kind of important data. We need to
be sure, that such rules are well
-
formed enough.

The major problem using this exhaustive method is t
he ratio of false positive
and false negative results. Great example is a finding
looking for the best regular
expression for finding web addresses: we could look for addresses of format
“http://(.*/)+”, but it won’t catch addresses stating with https:// o
r just
www.web.com. We could use more generic and complicated expression, but it
would catch also a lot of non
-
web strings. The same problem goes for searching
for a dates and related texts

(what is found date actually related to).


1.3.2

Statistics based learni
ng

Another interesting approach could be actually automatic creating of rules
based on previously selected data by user. Prerequisite of such algorithm is
really dense
(podrobna analyza)

statistical analysis of the input document


number of words per line/group of lines. The
occurence

of sp
ecific
letters/numbers per docu
m
e
nt parts, etc.

User will then selects important parts of the documenta and the algorithm
will take such selections in
to acount, store informations about the position,
length, contained words and so on and after reasonable number of learned facts
it could start gatherin own fragments and senteces from new document sources.

11


Similar method is also used in today’s spam filter. So called Bayes filter
gathers words from all emails and gives them a specific mark abot the
probability which word came from spam message and which from a real
message.
[ najit presnejsi popis jinde]


12


1.4

Re
al life use

of semantic analysis

It should be mention where
it
is nowadays used the mechanism of structured
documents analysis.

We don’t need to look hard for find several examples.



[ screen news.google.com ]



news.google.com


the example is quite well

known. It gathers data from
news articles all around the Internet, aggregates them and shows them as a
unified news page.



Another use of semantic analysis is used in mechanisms for gathering news
from semi
-
structured web and turning them into RSS feeds

or mail notifications.
It’s a handy feature if somebody wants to be informed about new articles, but
such web does not contain any update mechanism.



Probably the major use of semantic analysis is in the heart of Internet


the
search engines. Building a

search engine enormously complicated task. It’s not
just about gathering words from web content and showing them upon request.

[pridat popis jak to funguje in real]




1.5

Semantic web

[stub]

Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod

tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Exce
pteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.





13



1.6

Another problem

[stub]

Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in cu
lpa qui officia
deserunt mollit anim id est laborum.


1.7

Future of semantic analysis

[stub]

Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitatio
n ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est la
borum.




14


2

Analysis of partially structured sources


We’ve already described several basic approaches of gathering useful data
from data sources; the way human is gathering data and how to implement some
of the methods as a computer algorithm. Before
actually implementing such
algorithms we should better understand the structure of some common data
sources. Find similar properties and specific problems. Develop mechanism to
overcome those issues.


2.1

Implicit and explicit structure

The first important fac
t when analysing documents is that documents of
same type contains similar implicite structure. In other words if there are two
received emails, for example, it doesn’t matter what information they contain or
what their purpose, there are similar somehow.
The same works also for HTML
pages, wiki pages, RTF documents and most of other common sources.

The structure or format specific for all documents of certain type is called
implicite document format
. For emails the format is defined by [ref] email RFC
docu
ment. For HTML pages is typical, but mostly invalid, XML structure.



2.2

Common signs in implicite structure

The purpose of implicite structure wa the necessity of standartizing ways of
communication and data processing. For eaxample the structure of an email

body is well known and therefore one can be sure, that sending well formed
mesasge on any machine would result in valid receiving of such message on any
oher computer.
[napsat, proc je dodrzovani formatu a norem nutnosti]


2.3

Problems with the

explicit struc
ture

The explicit document structure is rather a complicated topic. The structure
of documents is based on common mostly unwritten standard across similar
group of users. These standards are mostly different in different countries, but
they could be differ
ent even in different nations, different native speakers, even
in different companies.

The classic example could be a formal letter. The structure of a formal letter
in our country could be really different from formal letters used for example in
U.S. or
in China. The position of title, subject, author

s name,
important

facts
and
other items could be simply different.


15


2.4

Structure of an email message

Implicit structure of an email message is specified in RFC document [ref] and
also in other newer documents
specifying additional features of emails, like
attachments, encoding
, header elements etc
.

Header with great implicit structure, body with not so typical structure,
mention of MIME, attachments, encoding / character set problems



2.5

Structure of a HTML page

The

structure
of HTML

page
s

is well known


html, body, title, h1 etc. tags,
but even though
it’s actually quite complicated to use HTML structure for our
advantage.

The origin of HTML came from the W3C consortium. HTML is defined as
recommendation and up
today there have been bout dozen of different
HTML/XHML recommendation’s.
[uryvek z wiki HTML format a historie]

The main problem when accessing and parsing (X)HTML is the non
-
validity
of most HTML pages. Some of the problems are just small differences fro
m the
standard; some of them are major issues like non well
-
formed XML, missing or
malformed tags, specific tags or attributes used for specific browsers and so on.


In the rare occasion of fully valid page there’s a strong implicit structure we
could use
for our analysis. The beauty on HTML is the presence of self


description tags like
title
,
h1

for main header,
table
,
h2
,
meta

tags

and so on.
Using these tags we could gather really important and useful information
without the need of using specific data mining algorithms. Even though the
HTML is invalid, we could still use these tags for navigation across the document
and hopefully
find valuable data.

With the use of new “Web 2.0” features like dynamic loading, JavaScript
processing, AJAX, Flash/Silverlight elements or other non
-
default parts it
became sometimes hard or even impossible to gather requested data directly
from the sourc
e code
. The other problem when
accessing

web address is also the
possibility

of web redirection, static or dynamic, invalidity of given web address
or the necessity of using encrypted channel for accessing requested site.

As we may see, even though HTML s
tructure is user friendly and may help us
gather many important data, there are still many problems, which could prevent
us from gathering useful or even any data on the requested sites.






16


2.6

Other common and noticeable data formats

The before mentioned da
ta formats


HTML and mail are probably the most
widely used formats in every days life, but they are not the only formats.

One of the most widely used formats for data storing is the basic XML. Each
XML document is defined by a specific DTD format syntac
tically or using more
thorough XSD document specification. XML documents in this matter are no
partially, but totally formatted. Semantic analysis on such documents is often
unnecessary, because all data in such documents are stored in well
-
defined
format,

which is specific for a given XML document type.

The other data format widely used in today’s life and also based on XML is
the RSS format

[ref spec]
. RSS is format for short news messages often based on
whole message from some internet site. They are mos
tly used for notifying
about changes

or just for abbreviating the full
-
scale content. The implicit
structure of RSS feeds is well defined, but the explicit structure could be similar
to emails, but it may also contain whole HTML or other kind of document.
RSS
feeds are even sometimes by their nature interchangeable with email messages.

Another interesting data format is the Wikipedia [ref link] source. The
purpose of the Wikipedia website is to provide free,
up
-
to
-
date information
about all possible topics.

The interesting fact is that all pages here are being
created by common people and could be updated by anybody. The good thing is
that anybody could update invalid information or

even add more related
information; the bad side is that no one can guarantee

the validity of such
information. The implicit structure of wiki pages is quite strong, and could be
used for our benefit.

The last mentioned noticeable format is the Twitter [ref link] feed.

Although
it’s quite a new thing, it has been proved as a
valuable source of real
-
time
information. Twitter messages are 140 characters long messages containing text
based data, mostly with references to some webs, images or news sources.
Although the implicit structure of Twitter messages is really weak, we coul
d use
only references to other sources (‘@’ sign) or references to specific topics (‘#’
sign), it could be used as an invaluable source of most up
-
to
-
date information.


2.7

The time factor of gathered data

When analyzing g data from different sources there’s a

problem we need to
face


it’s the time relevance factor. The time relevance factor is in other words
the fact that some documents are stable, but some are periodically changing.

The stable documents are in this case such documents, which are once
release
d and never changed after that. Mails are a good examples, also RSS feeds
or Twitter messages.

17


The time changing documents are mostly documents accessible online like
web pages. Web pages could change dramatically over just a small period of time
so monito
ring of new versions is therefore necessary.

Speaking of time factor, we could also find some kinds of hybrids in this case.
The best example is
the Wikipedia. Every page could be seen as dynamic data
source, changing from time to time, but also page could

be seen as the newest
document in a row of stable documents


all previous versions are still
accessible.




18


3

Concepts of analysis

3.1

First attempt
-

keywords gathering,

[stub]

Describe the first attempts when writing the app.


Lorem ipsum dolor sit amet,
consectetur adipisici elit, sed eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in voluptate velit esse cillu
m dolore eu fugiat nulla
pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.


3.2

Gather clauses from implicit structure

[stub]

Use simple facts gathering from implicit data structure


Lorem
ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in volu
ptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.


3.3

Other methods

[stub]

Lorem ipsum dolor sit amet, consectetur adipisici elit, sed
eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatu
r. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.




19


4

Performance, time/space
demands issues

4.1

Caching, what to store

[stub]

What to preserve for further running of the algorithms…


Lorem ipsum
dolor sit amet, consectetur adipisici elit, sed eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.


4.2

Too much informations?

[stub]

How to tell what information gather and what to omit? User
-
driv
en
evaluation?


Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure

reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.

4.3

Filtering, evaluating

[stub]

Lorem ipsum dolor sit amet, consectet
ur adipisici elit, sed eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in voluptate velit esse cillum dolore
eu fugiat nulla
pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.



20


5

Analysis methods and description

5.1

Naive approach and first results

[stub]

Lorem ipsum dolor sit amet, consectetur adip
isici elit, sed eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugi
at nulla
pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.


5.2

Better way

[stub]

Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor
incidunt ut labore et dolore

magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint obcaecat cupiditat non pr
oident, sunt in culpa qui officia
deserunt mollit anim id est laborum.


5.3

Another way

[stub]

Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercita
tion ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est

laborum.


5.4

Utilizing both methods

[stub]

Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi
consequat.
Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.

21


6

Method comparison and results

6.1

Relative sc
ores

[stub]

Lorem ipsum dolor sit amet, consectetur adipisici elit, sed eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure rep
rehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.


6.2

Absolute scores

[stub]

Lorem ipsum dolor sit amet, consectetur adipis
ici elit, sed eiusmod tempor
incidunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris nisi ut aliquid ex ea commodi consequat.
Quis aute iure reprehenderit in voluptate velit esse cillum dolore eu fugiat

nulla
pariatur. Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia
deserunt mollit anim id est laborum.


7

Tables, graps, document statistics


[stub]


lots of tables about structure of emails (length/mail, lines/mail, words/mail,
word
/empty lines/blocks distribution, numbers/special chars/links/dates
distribution… )

8

Conclusion


[stub]


TBA




22


9

References


dbworld email conference


docs.google.com

other nowadays uses


email RFC

other email related RFC

HTML W3C recommendation




[stub]

example from Bc.. thesis


1.

RFC 1319



specifikace funkce MD2:
http://www.ietf.org/rfc/rfc1319.txt

2.

RadioGatún


specifikace:
http://radiogatun.noekeon.org/

3.

S.
Bono
,
M. Green
,
A. Stubblefield
,
A. Juels
,
A.Rubin
,
M.Szydlo (2005).
"
Security Analysis of a Cryptographically
-
Enabled RFID Device",
Security '05

4.

J. Daemen, C. S. K. Clapp (1998). "Fast Hashing and Stream Encryption with
PANAMA"
,
Fast Software Encryption:
5th International Workshop, FSE'98,
Paris, France, LNCS 1372


5.

Dobbertin, H. (1997). "RIPEMD with Two
-
Round Compress Function is Not
Collision
-
Free".
Journal of Cryptology, Vol 10, Number 1, 1997

, 51/70.

6.

Klíma, V. (2006).
"Nový koncept hašovacích funkcí

SNMAC s využitím speciální
blokové šifry a konstrukcí NMAC/HMAC".


7.

X. Wang, D. F. (2004). "Collisions for Hash Functions MD4, MD5, HAVAL
-
128
and RIPEMD".
Cryptology ePrint Archive, Report 2004/199.


8.

X. Wang, H. Y. (2005). "How to Break MD5 and Other Hash
Functions".
Advances in Cryptology
-

Eurocrypt'2005, LNCS 3494

(stránky 19
-
35).

9.

X. Wang, X. L. (2005). "Cryptanalysis of the Hash Functions MD4 and
RIPEMD".
Advances in Cryptology
-

Eurocrypt'2005, LNCS 3494

(stránky 1
-
18).

10.

X. Wang, Y. L. (2005).
"Findin
g Collisions in the Full SHA
-
1".




23


10

Content of the enclosed software package


Application ‘Seeqer’ for cataloging emails

The main content of the included software package is the application Seeqer
.
The main purpose of this application is to work as POP3 mail client for
downloading emails, gathering user
-
chosen important data from received
emails, cataloging them in logical
units

and also tracing updates for the relevant
data in linked web sites.

Th
e process of data mining and cataloging is quite automatic, but supports
user defined overrides and fine setting for each

logical unit.



[more information about the final application]



In the included software package is also complete programmers
documentation generated from program comments and also
thorough

description of the application infrastructure. The program is licensed as GNU
GPL

and free to use or modify.



Master’s thesis as
a PDF document

An electronic version of this Master’s thesis is also
in the included software
package.