Natural Language driven Image Generation

blabbingunequaledΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

78 εμφανίσεις

Natural Language driven
Image Generation

Prepared by: Shreya Agarwal

Guide: Mrs. Nirali Nanavati

Introduction


Natural Language driven Image
Generation, as the name suggests,
refers to the task of mapping a natural
language text to a scene.


The general processes involved in
achieving this task are


Natural Language Understanding


Image Retrieval and Positioning

Natural Language
Understanding


Natural Languages are those used by
humans to communicate with each
other on a daily basis. Example: English


Computers cannot understand Natural
Language unless it is parsed and
represented in a predefined template
-
like form.

Image Retrieval and
Positioning


This part of the process involves
retrieving images from the local
database or the internet relating to the
text.


The final task is to position the images
in a manner such that all elements are
in their correct places in accordance
with the natural language text.

Systems and Techniques


NALIG (NAtural Language driven Image
Generation)
[1]


Text
-
to
-
Picture Synthesis Tool
[2]


WordsEye
[3]


Carsim
[4]


Suggested Technique

NALIG


Generated images of static scenes


Proposes a theory for equilibrium and
stability.


Based on description in the form of the
following phrase:


<subject> <preposition> <object>


[ <reference> ]

NALIG: Object Taxonomy and
Spatial Primitives


Defines “primitive relationships”


Example, H_SUPPORT(a,b)


Attributes like FLYING, REPOSITORY,
etc. associated with each object


Conditions like CANFLY are used.


Example,
“the airplane on the desert


vs. “
the airplane on the runaway”


NALIG: Object Instantiation


All objects mentioned in natural
language text are initialized.


If existence of an object depends on
another one, it is also instantiated.


Such dependence is stored in relation
HAS(a,b) which defines the strict
relationship.


Example,
“branch blocking the window



NALIG: Consistency Checking
and Qualitative Reasoning


Rules known as “naïve statics” are
defined to check for equilibrium and
stability. Example,




Law of gravity is checked.


Space conditions are checked. (Object
Positioning) Example, “The book is on
the table”.

NALIG: Advantages


Successful for limited static scene
generation


Checks equilibrium, space and stability
conditions


Instantiates implied objects

NALIG: Limitations


Works for predefined form of phrases.
Not suitable for full
-
blown natural
language texts


Fails to construct dynamic scenes


Low success rate for complex scenes

Text
-
to
-
Picture Synthesis Tool


The technique has the following
processes:


Selecting Keyphrases


Selecting Images


Picture Layout


Example

Selecting Keyphrases


Uses keyword
-
based text summarization


Keywords and Phrases extracted based
on lexicosyntactic rules


Unsupervised learning approach based
on TextRank algorithm


Stationary Distribution of Random walk
used to determine relative importance
of words.

Selecting Images


Two sources are used in the search for
images for the selected keyphrases


Local database of images


Internet based image search engine


15 images retrieved and image processing is
done to find the correct image.


Picture Layout


The technique aims to convey the gist
of the text. Hence, a good layout is
characterized as having:


Minimum Overlap


Centrality


Closeness

A Monte Carlo Randomized algorithm is used
to solve this highly non
-
convex
optimization problem

Advantages


Successfully conveys the gist of the
natural language text


Searches for images online, thus
delivering an output for every natural
language input


Capable of processing complex
sentences


Fit to represent action sequences

Limitations


Does not render a cohesive image


Does not work well for all inputs
without a healthy internet connection


Slower than other methods as it spends
time on generating a TextRank graph
and a co
-
occurrence matrix


WordsEye


This system
generates a high
quality 3D image
from a natural
language
description.


It utilizes a large
database of 3D
models and
poses.

WordsEye: Linguistic Analysis


Utilizes a Part
-
of
-
Speech (POS) tagger
and a statistical parser to generate a
Dependency Representation of the
input text.


For Example,

WordsEye: Linguistic Analysis


This Dependency Representation is then
converted into a Semantic Representation.


It describes the entities in the scene and the
relations between them
.


WordsEye: Semantic
Representation


WordNet is used to find relations
between different words.


Personal names are mapped to
male/female humanoid bodies.


Spatial propositions are handled by
semantic functions which look at the
dependents and generate semantic
representation accordingly.

WordsEye: Depictors


Depictors


are low
-
level graphical
specifications used to specify scenes.


They control 3D object visibility, size,
position, orientation, surface color and
transparency.


They are also used to specify poses,
control Inverse Kinematics (IK) and
modify vertex displacements for facial
expression.

WordsEye: Models


Models are stored in the database and have
the following associated information:


Skeletons


Shape Displacements


Parts


Color Parts


Opacity Parts


Default Size


Functional Properties


Spatial Tags

WordsEye: Prepositions
denote the layout


If we say “The daisy
is in the test tube”
,
the system finds the
cup

tag for the test
tube and the
stem

tag for the daisy.
Hence, it puts the
stem into the
cupped opening of
the test tube.


WordsEye: Poses


Poses are used to depict a character in
a configuration which suggests a
particular action being performed.


They are categorized here as:


Standalone pose


Specialized Usage pose


Generic Usage pose


Grip pose


Bodywear pose

WordsEye: Pose examples




Specialized Usage pose (Cycling)









Grip pose





(hold wine bottle)










Generic Usage pose





(throw small object)

WordsEye: Depiction Process


Process to convert high level semantic
representation into low
-
level depictors.


Consists of the following tasks:


Convert semantic representation from the
node structure to a list of typed semantic
elements where all references have been
resolved



Interpret the semantic representation



Assign depictors to each semantic element


WordsEye: Depiction Process


Resolve implicit and conflicting constraints
of depictors.



Read in the referenced 3D models



Apply each assigned depictor to
incrementally build up the scene while
maintaining constraints.


Add background environment, ground
plane, lights.



Adjust the camera (automatically or by
hand)



Render

WordsEye: Depiction Rules


Many constraints and conditions are
applied so as to generate a coherent
scene.


Constraints are
explicit

and
implicit
.


Sentences which cannot be depicted
are handled by using one of
Textualization, Emblematization,
Characterization, Conventional Icons or
Literalization.

WordsEye: Advantages


Generates high quality 3D models


Ability to read poses and grips,
constraints and use of IK makes the
picture coherent.


Depiction rules help in mapping
linguistically analyzed text to exact
depictors.


Semantic representation lets the
depiction process truly understand what
is being conveyed.

WordsEye: Limitations


Works on high quality 3D models,
hence, required a lot of memory and
fast searching algorithm.


Because of its restriction to its own
database, the system does not
guarantee an output for all natural
language text inputs.


Carsim


Developed to convert text descriptions
of road accidents into 3D scenes



2
-
tier architecture communicating with
a formal representation of the accident.

Carsim: Formalism


The tabular structure generated after
parsing the natural language text has
the following information:


Location of accident and configuration of
roads


List of road objects


Event chains for object and movements


Collision description

Carsim: Information Extraction
Module


Utilizes tokenizing, part
-
of
-
speech
tagging, splitting into sentences,
detecting noun groups, named entities,
non
-
recursive clauses and domain
-
specific multiwords for:


Detecting the participants


Marking the events


Detecting the roads

Carsim: Scene Synthesis and
Visualization


The previously generated template is
taken as input.


Rule
-
based modules are used to check
consistency of the scene.


A planner is used to generate vehicle
trajectories.


A temporal module is used to assign
time intervals to all segments of these
trajectories

Suggested Technique


This technique is a hybrid of the
techniques we have seen so far along
with a few additions.


It is a theoretical technique and has not
been implemented yet.

Natural Language
Understanding


Words of interest will be categorized
into the following groups using a part
-
of
-
speech (POS) tagger and a named
entity recognizer (NER).


OBJECT


STATE


SIZE


RELATIVITY

The template and the co
-
relation matrix


A co
-
relation matrix specifies position of
each object in the scene with respect to
every other object.


The template for each object in the list
of objects to be instantiated contains
the following information.


Size


Co
-
ordinates


Image Location

Image Selection Module


This module finds images using two
sources:


Internal database of images


Internet based image search engines


First 10 images are retrieved


Image processing is used to find the correct
image


This image is stored in the database for future
use

Position Determiner and
Synthesis Module


The Position Determiner computes the
co
-
ordinates of each and every image
that is to be placed based on the input
template (which has the image size and
location paths).


The synthesis module resizes all images
and places them at the co
-
ordinates in
the template (supplied by the position
determiner module)

Introducing Machine Learning


The aim is to finally make a computer think
like a human. We can greatly enhance our
system by using the techniques of Machine
Learning.


The system can be made to learn the objects
through unsupervised learning (clustering).


The system can be feedback controlled and
let the user point out meanings of terms
(SIZE, RELATIVITY, STATE) not previously
known.

Advantages


Linguistic analysis is efficient since
there is no statistical/rule
-
based parser
being used.


Searching for images on the internet
ascertains that an image is generated
for every natural language input.


Introducing machine learning makes
the system coachable (also, user
feedback and instant adaptation)

Limitations


It might not generate coherent images
for complex sentences since we do not
make use of an advanced NLU
technique.


It depends on internet availability for
finding images not within its local
database.


Summary


All the methods that have been
developed till date for tackling the
problem have been explained.


A technique based on some additions
and the positives of the existing
techniques has been specified.


Lot of research is still required to make
a computer achieve this task as simply
as a human brain does.

References


[1] ACL 1984 Proceedings of the 10th International Conference on
Computational Linguistics,
Natural Language driven Image
Generation
,
Giovanni Adorni, Mauro Di Manzo, Fausto Giunchiglia,
University of Geneo



[2]
A Text
-
to
-
Picture Synthesis System for Augmenting
Communication
, Xiaojin Zhu, Andrew B. Goldberg, Mohamed Eldawy,
Charles R. Dyer, Bradley Strock, University of Wisconsin, Madison



[3] Proceedings of the 28th annual conference on Computer Graphics
and interactive techniques 2001,
WordsEye: An Automatic Text
-
to
-
Scene Conversion System
, Bob Coyne, Richard Sproat, AT&T Labs
(Research)



[4]

Converting Texts of Road Accidents into 3D Scenes
, Richard
Johansson, David Williams, Pierre Nugues, 2004 TextMean Proceedings

Thank You!