final_report.docx - GoogleCode

moodusroundoΛογισμικό & κατασκευή λογ/κού

15 Αυγ 2012 (πριν από 4 χρόνια και 10 μήνες)

390 εμφανίσεις




Text Analysis for picture/movie generation


Professors in charge:

Prof. Dr. Ing. Stefan Trausan

As. Drd. Ing. Costin Chiru


January 2012



Architecture …………………………………………

Execution flow ………………………………………..

Testing ……………………

Project management …………………………………

Semantic matching with Wordnet 2.0 ...............................

Recognized prhases ......................................................

Position management

Dimension management ...............................................

Performance considerat
ions ……………………………………….7

Comparison with similar project ……………………

Conclusion and possible improvements …

Results ………………………………………………...

References ………………………………………………


In order of solve this problem we use the following software and miscellaneous.
Our main programming
language is Java, the environment that we are using is Netbeans.
All the images have

been retrieved



rough MATLAB. This retrieval includes the extraction of the image and also another file
where we have the images
annotated in a XML format. In order to convert this XML files to java we used
the package XOM. For processing the ima
ges we are using the normal awt class. Our editor to

make the
ontology is protégé 4.1 and our RDF engine is OWLim Lite. The upper ontology
that we are using is
The we
b server of our project is Apache Tomcat 7.0

It i
s important to mention
that we are documenting our project with Google wiki, the page web project is:

. Finally we

want to mention that we are versioning using SVN. The
final webpage can be found in:


Two projects are provided.

One is NLPainterWebsite and the other one is LabelMeImport, to directly
import data from LabelMe xml files. A bug in Netbeans prevented us to assign more than 256 Mb Java
heap memory to stand alone programs inside NLPainterWebsite project, this why we sep
arated the two

The semantic repository is handled by OWLRepo. It can execute queries with a simple template system
we implemented (which substitues XXX strings presents inside queries with a parameter) but the new
CustomQuery compiler we made all
ows for more sophisticated and useful constructions. The main class
toparsing and generate an image is ImageGenerator. It contains the findRealImage() method which
issue one big SPARQL query to the semantic repository. There are two entity Java classes,
Entity and
Animal. They contribute to decide what constraints to put in the SPARQL query. In many files there are
main functions implemented but just as quick tests for the file they reside in. There is only one page for


the website, index.jsp.

ns on how to load the data and manage the project can be found in the Google code wiki.

The main goal of this stage had been to have a
first version
of our project call text analysis for picture
generation. Until now we have been describing this project tr
ough all the previous 4 milestons. Mainly
we have been working in 6 process. The next diagram describe this process:


done (100%). We are talking mainly about animals. But because we are putting a lot of
information in the database, we can make different queries with another subjects not related with
, in particular landscapes

Obtain pictures:

Done (100%). We alr
eady have more than 70,000 shapes and more than 4000 pictures
of animals. From all this pictures we have, generally speaking, pictures of animals, landscapes, and
pictures taken in cities (this are the general topics that the pictures in Label me has).

ain text:

Done (100%). We already implement

the class in order to obtain
the text. We already modified this in order to match with the goals of our project.

Create ontology and data:

Done (100
%). We already have the ontology.
The data

knowledge base is all


Done (100%)

Each time that we write something in the web page we obtain a query.

Semantic Web
: Done (100
%). We already have the webpage
It i
s already working and the user can
enter the text that want to convert to

he results are showe
d in this report

Image processing
: Done (100
We match the picture with the text and using the tags of the pictures
we are able to

the correct picture. We think that in a future work we can improve this part
we could take only the shape of the animals and not the complete picture, but we are satisfied
with the results.

Image representing the text:

Done (100%)
. This is the final state.
Our final

version can show something
but we consider tha
t the results are still limited because we still can obtain
more pictures in order to
provide an answer

for more

queries. For example, we don’t have a picture where we can find a dog
playing with a cat, so when we type this the result is an image saying th
at we don’t have that picture.
We can improve this if we have more pictures in our database.


Execution flow

At first POS tagging is done and Action, Entity, Figure, Path structures are created. Then the program
performs semantic matching to determine the best Java class to host the entities. Currently only Animal
and Entity classes are supported. In the next pha
se one big SPARQL query is issued containing the
statements representing the constraints present in the input text by the user. A project like this is broad
in scope and we soon discovered SPARQL doesn't provide the flexibility we need. So we decided to

implement a small query compiler called CustomQuery to produce the needed query according to
decisions taken by the Java classes Animal and Entity. Their job is to produce a SPARQL fragment to be
inserted in the custom query. Animal derives from Entity an
d will scan actions like

animal is in a

a carnivore animal is flying
. Entity has a more general tasks and interprets positional phrases
like a
door is to the left of a bed.
Two modalities of execution are supported, a fast one suited for
te usage and a slow one for offline computations. The fast one does not calculate all of the
hyponyms of the entities but only the possible immediate synsets of a word. Obviousy returned results
are much less then in the slow mode. To stay safe in executio
n time in both cases only the first matching
picture found is returned. Notice currently in any case the query execution time is set to 30 seconds,
after which the website will return an
I don't know


Netbeans JUnit testing got messed up for

the project so we developed our custom test classes for the
project. Useful examples of tests can be found in TesterDavid class.

Project management

We wrote 3 pages in the wiki on Google code:


: instructions for setting up the project

Ontology :
notes about the ontology we used


: project log

The Issue page contains also some bugs currently present in the program and enhancements

Semantic matching with Wordnet 2.0

All th
e data from both Animal Diversity Website

and LabelMe is matched with WordNet 2.0 data.
The current version of WordNet is actually 3.1 at the time of this writing but the only available w3c
conversion draft to OWL is made with 2.0 data. See [WNO] for
reference. When a perfect match is found
with a common name of an animal (like in case of 'lion') the same synset of Wordnet is used to refer to
the animal (so for lion we would use the synset

). Otherwise the common name is
progressively shrinked from the left until a generic synset is found. For example for 'howling monkey' a
full text match will be attempted and fai
l, the algorithm will then cut the 'howling' and try the match
with the 'monkey', which will succeed. The algorithm will then create a synset
which will be declared

wn20schema:hyponymOf wn20instances:synset

If all attempts will fail the animal will simply be declared as

wn20schema:hyponymOf wn20instances:synset

Images from ADW are added along with a single shape with no associated points which refers to the
synset used for the anima
l. Other charcteristics like behaviours (flying, swimming, etc) are matched to
wordnet synsets. As numerical values mass and length are collected and trasformed to kg and meters
unity measures if they are expressed in different measures.

For LabelMe a much

simpler approach for matching had been used. Since the database of LabelMe is
large as a first approximation we just matched label

with wn20schema:synset
1 where

is a sanitezed version of the labe to prevent encoding conflicts. To impro
ve the matching it would be
enough to exploit the content of the file

created by the
authors of LabelMe. It contains disambiguation information to solve this issue.

Recognized phrases

New kind of phrases which we introduced
and for which some inference is done are the following (note
most of them work nice with test data, results with big data may be slow to calculate):

animal in a <place>


where <

is an habitat (forest, savannah, etc)

the animal is

( nocturnal |

walking | running | flying | swimming | social | carnivore | herbivore )

( small | medium
sized | big | huge )

It is better to add adjectives in second phrases as sometimes the parser considers an adjective as a noun
a nocturnal animal

is mistakenly
parsed as having two entities, nocturnal and animal).
A <dimension>
works instead without separate phrases.

object_1 in front of object_2

object_1 behind of object_2

object_1 to the left of object_2

object_1 to the right of object_2



The following paths were added to the Romanian parser:
to the left of
to the right of
in front of.
Now it is possible to write in the text phrases like.

a door is to the left of a bed

Notice the spatial constraints present in the text
always refer to the observer who took the picture, as
currently his orientation is the only known one. So ie e previous query will give back the following

where the door is to the left of the bed in the observer's eyes.

To fully process spatial in
formation in a natural way we would need information about objects
orientation and the way people perceive the 'natural orientation' of an object. For example, if we say

the car is to the left of a man

Then the car should be to the same side as the man's
left arm, but this is only because for humans the
'natural' way to imagine a man is from the front (we usually want to see the face of

who we are thinking

Dimension management

It is possible to enter phrases like

a big animal is walking

ed dimensions are: small, medium
sized, big, huge

We only support dimensions for the word 'animal' (and not, for example, 'mammal'). The reason is
because dimension is a relative concept, and it depends on the way the user perceives a given category
of ob
jects. A big pencil is not of the same size of a big animal. One way to automatically obtain a
possible natural conception of size could be looking in the database the distribution of, say, the weight
of all the mammals (as not all the animals have 'length
' and height' associated to them). But at this
point the algorithm would become database dependent. Maybe with mammals it could work as we have
all the mammals stored in our db and usually people have a faithful representation of the the


distribution of
their size, but what about for a generic animal? Most of animals are insects, but people
usually don't consider this aspect. A user would think the average animal is as big as a dog is but in fact
the algorithm would say the average animal would be the s
ize of a mouse. This is why we needed to pre
encode our human knowledge in the program and accept only what we specificaly supported, in our
case the concept 'animal'.

Performance considerations

When we started developing the project we worked on test dat
a and the program behaved well in
presence of generalizations like

an animal is near a mammal


a big animal is swimming

which returned an american crocodile (which weights ~1 ton). Adding big data brought many problems.
At present the method we follow t
o perform entity/shape matching is the following: all the hyponims of
the possible synsets of a word are found in a subquery and the synsets are matched with the shapes
representing those synsets in the image database. This approach usually works in a shor
t time (< 30 sec)
when we put inside the phrase terms which are not too general (

). If we start putting
terms like

then apparently the search space grows too much for OWLIM and results become very
dependent on the way the current s
tatus the cache is and/or the (hard to know) order of search. We
also discovered SPARQL as implemented in OWLIM is sensitive to the order of the statements in the
queries and this produced weird results, like getting all possible instances or none at all b
y simply
swapping two statements. If nothing, since we have a query parser it is at least easy to try out different
orderings and experimenting different behaviours. The caching operated by OWLIM introduces a lot of
variability, though.

Comparison with
similar projects

While performing image processing and statistical analysis of labels is an effective way to obtain results
in an automated way it can't be used alone to deal with such a complex domain as image generation.
Since existing systems were alre
ady provi
ding this kind of analysis [TPSS] and [SPE]

we preferred to
concentrate on using on ontology and finding the best way to link it to existing concepts. While we can't
say we implemented a lot of logic into the system for sure we laid the basis for

a richer interaction
between the user and the program by means of linked data and a tool such as LabelMe. See the
conclusions fo
r more details about this. [TPSS
] contains a layout engine which we regret we didn't find
the time to implement, even if it wou
ld have been an interesting addition to do. WordsEye

is the
king of text to picture systems, so the comparison with it dwarfs our project. WordsEye uses 3D objects
to compose new scenes from scratch, and it considers an array of aspects during scene cr
eation, plus an
ontology to help resolve conflicting constraints. It can't provide existing real pictures as ours, though.


Conclusion and possible improvements

Much or the effort we did for the project went into colletting the data and setting up a good
nfrastructure to work with. We were conscious projects like this can quickly become patchy bunches of
rules and nested if then else kind of code, so we refactòred the code until we were satisfied with the
result. Care has been also taken to easily being ab
le to add/delete test or big data from various sources.
On one hand this lead us to a solution for which it is relatively easy to add new features and do
experiments but on the other hand the refactoring took away time we could have spent to implement

Possible improvements: coreference

Coreference shoud be implemented. Currently to get two cars in a picture a phrase like this has to be

a car and a vehicle are in a street.

Otherwise writing ie

a car and a another car are in a street.

is likely to retrieve pictures with just one car.

Possible improvements: performance

The performance issues were very bad news for us since we found ourselves in a situation for which
even having a good codebase we couldn't exploit it to add features because the database was sloppy
and/or unpredictable. There can be many solutions to this
performance problem: some
indexing/caching techniques could be implemented, or a better read of OWLIM documentation could
give insights on the way the search is performed in SPARQL. We obtained from OWLIM authors the jar
of OWLIM SE which could improve the

perfomances but at the cost of having a non
codebase. We did not try it out, though. As OWLIM is just an implementation of the OpenRdf Sesame
interface in theory it should be easy to switch repository and try out other libraries, maybe ope
n source
ones. During the development we kept two separate servers, ours and the sesame server. This eases a
bit the management of the db but also imposes longer waiting times to send query and results back and
forth through the servers. OWLIM can run as a

library inside our server and if repeated queries had to be
made (instead our current 'one big query' model) performance would definitely improve by choosing
this option.

Possible improvements


To improve image results a ranking system could be

adopted for which ie if a user requests

an animal is in a forest


the program could add several scores to the picture: one if given animal actually has the forest as
habitat, one for each tree present in the picture and one given by previous users for th
e same query. In
this regard it could be interesting to provide the user with an image plus the same phrase he gave as
input with links of the synsets instead of the words. In case of a wrong picture the user could maybe
select a word and set a negative sc
ore for it, indicating that ie there is no
in the picture. Since the
semantic significate of forest is known possible alternative and useful synsets (habitats) could be
suggested. Implementing a ranking system would mean asking the semantic engine
to fetch all the
possible results and then perform an ordering of them according to the rank: this could bring severe
stress on the system, so particular care would be needed to deal with performance. Using the same
LabelMe webpage software to allow user t
o trace and tag new shapes would help greatly improving the
database quality.

Possible improvements

image composition

In absence of good matches in the db shapes could be extracted from existing LabelMe images and
composed, maybe on a background image.
The problem with this approach would be normally shapes
have ragged edges. To overcome this issue pixel along the borders could be made partially transparent.
Another interesting solution could be applying some artistic filter to the final composition (we

found ie
this [JHF]) in order to blur the contours of objects. The final image wouldn't be photorealistic anymore
but maybe we would still obtain a pleasant result.


We have a

g platform with integrated
data fr
om Animal Diversity Website an
d from LabelMe.
developed a basic website to allow interaction and implemented basic text processing capabilities to
retrieve meaningful images.

e developed the website to allow users enter text and see returned
images. The website is designed to be si
mple and easy to use. We experimented some UI and we like
the idea of the website returning messages as painted on a canvas.

Database of

mages are annotated with the shapes of the objects contained in the scene.

abeling was done by unpaid

More than 70,000 shapes where obtained!

More than 1
000 points


Database of

Animal Diversity Web

e fetched nearly 10
000 pages.

545 were information about animals.

500 picture pages of animals (and for each picture page we
extracted ~5

pics links) and 5
000 were
simply the pages about the hierarchy, needed to arrive to the information at the leaves

e fetched mammals,


birds, bony fishes, insects, echinoderms, arthropods


Some results:

Let us show you some ex
of our project.

A horse is near a tree

An elephant is near a tree

The lake is in the mountain


Some other interesting results:

The car and the sky, and the street.

The bike is at left of the car.


person in the hotel.

A person is

he tree and a person.


person in the water.

There is a window and a bed.

The mountain, a tree and the roa

There is a car in the street


[WNO] Mark van Assem, Vrije Universiteit Amsterdam, Aldo Gangemi, ISTC
CNR, Rome, Guus Schreiber, Vrije Universiteit Amsterdam,


Jerry Huxtable, Java Image Filters

ily effect:

[LM] Bryan C. Russell and Antonio Torralba and Kevin P. Murphy and William T. Freeman}, Labelme: A database and web
based tool for
annotation, MIT AI Lab Memo, 2005

[DBP] Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, Sebastian Hellmann: DB

Crystallization Point for the We of Data. Journal of Web Semantics: Science, Serv
ices and Agents on the World Wide Web, Issue 7, Pages 154

165, 2009.

[TRA] Mihalcea, R., and Tarau, P. 2004. TextRank: Bringing order into texts. In Proc. Conf. Empirical Methods in Natural Lang
uage Processing,


[CAPS] Ken Xu and James Stewart and
Eugene Fiume , Constraint
Based Automatic Placement for Scene Composition, Proc. Graphics Interface,
2002,May, Calgary, Alberta, pp 25
34[ADW] Myers, P., R. Espinosa, C. S. Parr, T. Jones, G. S. Hammond, and T. A. Dewey. 2006. The Animal
Diversity Web (on
line). Accessed November 01, 2011 at

[ADW] The Animal Diversity Web (online). Myers, P., R. Espinosa, C. S. Parr, T. Jones, G. S. Hammond, and T. A. Dewey. 2006.

Accessed October
25, 2011 at http://an


[SPE] :
The Story Picturing Engine

A System for Automatic Text Illustration,'' Dhiraj Joshi, J
ames Z. Wang and Jia Li, ACM Transactions on
Multimedia Computing, Communications and Applications, special issue on Multimedia Information Retrieval, vol. 2, no. 1, pp.
89, 2006.

[TPSS] : A Text
Picture Synthesis System for Augmenting Communication,

Xiaojin Zhu, Andrew B. Goldberg, Mohamed Eldawy, Charles R.
Dyer and Bradley Strock,
In The Integrated Intelligence Track of the Twenty
Second AAAI Conference on Artificial Intelligence (AAAI
07), 2007

[WE] :

WordsEye: An Automatic Text
Scene Conversio
n System, by 2001 Bob Coyne, Richard Sproat in SIGGRAPH Proceedings 2001. Los
Angeles, CA.

[WN]: Wordnet:
Princeton University "About WordNet." WordNet. Princeton University. 2010.

[STA]: []



[PEN]: []