Natural Language Processing for Video Analysis

cabbagecommitteeΤεχνίτη Νοημοσύνη και Ρομποτική

24 Οκτ 2013 (πριν από 4 χρόνια και 17 μέρες)

202 εμφανίσεις


Natural Language Processing for
Action Recognition


JHU Summer School


Evelyne Tzoukermann, Ph.D.

Friday, June 11, 2010

What is the role of Natural Language

in Action Recognition?

1.
Provide temporal information


Where in the video is the action happening?

2.
Provide semantic information


Parse the phrasal constituents to determine
action type and human interaction through
objects, instruments, and other contextual
information


E.g.: cut potatoes


semantic representation


<instrument>
knife


<human interaction>
hands



<location>
cutting board


Function of Natural Language

in Action Recognition?

1.
Facilitate action recognition from the video.

2.
Ground video processing

3.
Extract relevant entities and semantics
associated with them

4.
Allow fusion of knowledge from text with
action primitives


Leverage already existing techniques and
knowledge


Completed


Dataset domains:


Cooking



Crafts



Classification of Actions



Categorization of Actions



Cooking domain

1.
DVD’s:


Cook like a chef


Martha’s Favorite Family Dinners


Joanne
Wier’s

cooking class

2.
CMU Kitchen dataset

3.
Food Network: 12 consecutive hours of recorded
time

4.
PBS Kids: Sprout


5 shows

5.
URADL: U. of Rochester Activities of Daily Living


12 activities, 5 individuals,
3

recordings each

Craft domain


PBS Kids: Sprout


over 25
shows


Tuples

of Entities


Time stamps for temporal information


Verbs
-

capture actions


Objects
-

what is acted upon



Instruments
-

with what tool


Location


for recognition


Camera position


for scalability



Information Extraction


Extract structured information from unstructured
documents

Ex: "
Yesterday
,
New
-
York based
Foo

Inc
. announced
their acquisition of
Bar Corp
.“


Entity identification and recognition


Goal of IE: allow computation to be performed on
unstructured data.


More specific goal: allow logical reasoning to
draw inferences based on the logical content of
the input data.

Entity Recognition for Video


Can be considered an IE task with a list of
entities


Find a
tuple

or an
ordered list
with a
temporal
dimension


Goal of text
-
based Information Extraction:


“Who did what to whom where”


Find the different entities that fill these slots


Goal of video and text IE


Find the temporal, and other entities

Angelina’s Ballet Slippers

1.
Video



2.
Web page

Angelina’s Ballet Slippers

Ingredients



1 red pepper, cut in half with
seeds removed


1⁄2 cup quick cook brown rice


1⁄2 cup vegetable stock


1 cup canned mixed vegetables,
no added salt


1⁄4 tsp. black pepper


1 tsp. chopped fresh parsley


1 tsp. extra virgin olive oil


1 lemon


Decorative cabbage


1⁄4 cup shredded cheddar
cheese, divided


Supplies



Measuring cups and spoons


Cutting board & knife


Cooking pot


Small cooking pot


Mixing spoons


Slotted spoon


High
-
sided baking dish


Pastry brush


Large serving plate


Nr

Action

Objects

Human Interaction

Begin Time

End Time

Duration

1

Washing

Sink, Soap

Washing Hands

00:38.2

00:40.6

00:02.4

2

Drying

Hand Towel

Drying Hands

00:40.6

00:44.4

00:03.7

3

Filling

Sink, Pot

Hands fill pot with water

00:45.3

00:47.2

00:01.9

4

Pouring

Bowl, Broth, Pot

Child pours broth from bowl to
pot

00:48.2

00:51.4

00:03.2

5

Firing

Stove, Pot

Hand turns on the burner

00:54.1

00:57.1

00:03.0

6

Cutting

Red Pepper, Knife,
Cutting Board

Adult Male cuts red pepper

00:58.1

01:00.0

00:01.9

7

Deseeding

Red Pepper, scoop

Adult and child deseed red
pepper

01:03.0

01:03.9

00:00.8

8

Placing

Pot, Spoon, Red
Pepper

Adult places red pepper in pot

01:09.7

01:12.2

00:02.5

9

Adding

Bowl of Rice, Pot

Adult adds rice to pot

01:14.2

01:17.7

00:03.4

10

Opening

Can Opener, Can

Hands open a can

01:20.2

01:23.3

00:03.0

11

Tearing

Parsley, Measuring
cup

Child tears off parsley leaves

01:24.2

01:27.4

00:03.2

12

Adding

Can, Pot

Hand adds can of veggies to
pot

01:32.0

01:35.0

00:03.0

13

Adding

Measuring cup, Pot

Child adds parsley to pot

01:35.6

01:38.2

00:03.0

Sprout
-

Alphabet book

Action Verb

Freq

Direct Object

Instrument

Human
Interaction

Location

To Thread

1

Thread

Hand

Both Hands

Construction Paper

To Tie

1

Thread

Hand

Both Hands

Construction Paper

To Write

1

Ink

Pen

Both Hands

Paper

To Decorate

2

Ink

Pen

Both Hands

Paper

To Color

2

Ink

Pen

Both Hands

Paper

To Draw

1

Ink

Pen

Both Hands

Paper


Baby Picture Frames

Crafts

Freq

Direct
Object

Instrument

Human
Interaction

Location

To Tape

2

Picture

Hand

Both
Hands

Frame

To Glue

2

Glue

Hand

Both
Hands

Popsicle sticks

To Decorate

1

Ink

Pen

Both
Hands

Popsicle sticks

Action Recognition and Complexity

Input

1.
transcripts and closed captions

2.
text transcripts alone

3.
list of ingredients and utensils




Evaluation

can follow these levels


Sprout


Elmo’s Funny Face Pizza

Cooking

Freq

Direct Object

Instrument

Human Interaction

Location

To Wash

1

Hands

Faucet/ Soap

Both Hands In action

Sink

To Dry

1

Hands

Paper Towels

Both Hands In action

Work Space

To Place

1

Bagels

Hands

Both Hands In action

Baking Sheet

To Spread

1

Sauce

Knife

Both Hands In action

Bagel

To Top

1

Olives

Hands

Both Hands In action

Bagel

To Cut

1

Peppers

Knife

Both Hands In action

Cutting Board

To Top

1

Peppers

Hands

Both Hands In action

Bagel

To Bake

1

Sheet Pan

Hands

Both Hands In action

Oven

To Clean

1

Food

Hands

Both Hands In action

Work Space

To Sponge

1

Food

Sponge

Both Hands In action

Work Space

To Remove

1

Sheet Pan

Oven Mitts

Both Hands In action

Oven

Sprout


Caillou’s

Crunchy Carrot Salad


Cooking

Freq

Direct Object

Instrument


human interaction


Location

To Peel

1

Carrots

Peeler

Both Hands In
action

Work Space

to Add

1

Apples

Hands

Both Hands In
action

Bowl

To Measure

1

Raisins

Hands

Both Hands In
action

Measuring Cup

To Mix

1

Salad

Spoons

Both Hands In
action

Salad Bowl

To Cut

1

Lemon

Knife

Both Hands In
action

Cutting Board

To Squeeze

1

Lemon

Hands

Both Hands In
action

Salad Bowl

To Measure

1

Honey

Bottle

Both Hands In
action

Measuring
Spoon

To Refrigerate

1

Bowl

Hands

Both Hands In
action

Refrigerator

To Clean

2

Food

Hands

Both Hands In
action

Table

Martha Stewart Episode 2

Cooking

Frequency

Direct
Object

Instrument

Human
Interaction

Location

To Stir

5

Chili

Wooden
Spoon

One hand

Pot

To Pour

1

Vinegar

Measuring
Cup

Both hands

Food
Processor

To Pour

1

Orange juice

Ramekin

Both hands

Pan

To Add

1

Salt

Hand

One hand

Pan

To Cut

1

Butter

Knife

Both hands

Butter Boat

To Beat

1

Egg

Fork

Both hands

Bowl

To Mix

6

Meatloaf

Hand

Both hands

Bowl

To Remove

2

Roast

Hand

Both hands

Crock Pot

To Slice

7

Roast

Knife

Both hands

Cutting
Board

To Spoon

1

Dressing

Spoon

One hand

Plate of
Oranges

To Spread

2

Mix

Hand

Both hands

Baking Dish

Martha Stewart


191 action verbs

to pour

33

to spoon

4

to add

20

to measure

4

to stir

17

to glaze

3

to slice

17

to garnish

2

to cut

11

to spread

2

to place

11

to cover

2

to mix

6

to tie

2

to remove

6

to
Scrape

2

to rub

6

to dry

1

to turn

6

to beat

1

to deglaze

6

to b roil

1

to serve

5

to sear

1

to
wisk

5

to wrap

1

to top

4

to
Grate

1

to process
(in a
food Processor)

4

to
Bake

1

Semantic Categorization of Actions

To

Apply
Heat

To Combine

To Bake

to Add

to Broil

To Mix

to sear

To Process

To Beat

To Separate in to one or more parts

To Pour

To Cut

to deglaze

To Slice

to wisk

to grate

To Tear

To Decorate

To Peel

To Top

to score

To Garnish

To Spread

To Sanitize

To Glaze

To Wash

to spoon

To Dry

to rub

CMU Kitchen Set
-

Verbs


take


put


Open



fill


crack


beat


stir


pour


clean


switchon



read


spray


close


walk


wist_on



twist_off


NLP Tools


Part
-
of
-
speech tagger or phrase
chunker



Dependency parser for Verb
-
Object relations


We have
tuples

of
Verb, Object, Instrument, Location


Ex:
Stir

(v)

chili

(o)

with a wooden spoon
(
instr
)
in a
pot

(loc)


Collocations for
Instrument

and
Location


Coocurrence

from Google


Ex:
“place a
wooden spoon

across the
pot

to keep it
from boiling”


And more


Ontology


Need to capture:


Concepts


Relationships


Properties


Timestamps (
video_name

[
beg_time
,
end_time
])


Validation

Ontology for cooking and craft


Need to capture:


Actions


Food


including the state and transformation

or


Objects


paper, paper roll, …


Instruments: kitchen utensils, scissors, crayons


Location


Timing


(Recipes)


Ontology


Use of Protégé
http://protege.stanford.edu/


ontology editor and knowledge
-
base framework.




Knowtator

: Protégé plug
-
in for annotation



can be used for evaluating or



training a variety of NLP systems.



Write a plug
-
in that takes the output of a
syntactic parser and connects it to visual frames

Protégé knowledge
-
base


class,


Represent the concepts of a domain


organized in a
subsumption

hierarchy


instance
, correspond to individuals of a class


slot
, define properties of a class or instance


facet frames
constrain the values that slots
can have.

Dependency Parser

Input Sentence:
“Next we need to open the can of veggies”


ROOT [next
-
1]


( SBAR [
next
-
1]



(
next
-
1(
Next
)/
IN



S [
need
-
6] (



NP [
we
-
3] (



we
-
3/
PRP




)



VP [
need
-
6] (




need
-
6/
VBP




S [
to
-
8] (



VP [
to
-
8] (




to
-
8/
TO





VP [
open
-
10] (




open
-
10/
VB





NP [
can
-
14] (




NP [
can
-
14] (




the
-
12/DT





can
-
14/
NN





)





PP [
of
-
17] (




of
-
17/
IN





NP [
veggy
-
19] (





veggy
-
19(
veggies
)/
NNS





)




)


Dependency Parser

Input Sentence:
“Next we need to open the can of veggies”


ROOT [next
-
1]


( SBAR [
next
-
1]



(
next
-
1(
Next
)/
IN



S [
need
-
6] (



NP [
we
-
3] (



we
-
3/
PRP




)



VP [
need
-
6] (




need
-
6/
VBP




S [
to
-
8] (



VP [
to
-
8] (




to
-
8/
TO





VP [
open
-
10] (




open
-
10/
VB





NP [
can
-
14] (




NP [
can
-
14] (




the
-
12/
DT





can
-
14/
NN





)





PP [
of
-
17] (




of
-
17/
IN





NP [
veggy
-
19] (





veggy
-
19(
veggies
)/
NNS





)




)


Action concept and relations with
other concepts


Action

Verb

Human

Interaction

Instrument

Location

Time

Vn,t1,t2

Object

Knowtator
: Annotation Plug
-
in


General purpose annotation tool


Facilitates creation of training and evaluation
corpora for language processing tasks


Ease of use


Straightforward to incorporate domain
knowledge

Knowtator
: an example

Processes

Syntactic

Parser

Ontology

Creation

Ontology

Annotation

Corpus
enrichment
using
collocations

Related Research

1.
Ontology and cooking


2.
Parsing “restricted” languages


3.
Connecting text with images

Related Research


Dina
Demner
-
Fushman
,
Sameer

Antani
, Matthew
Simpson, George R.
Thoma

“Annotation and
retrieval of clinically relevant images”, 2009


Ricardo Ribeiro, Fernando Batista, Joana Paulo
Pardal, Nuno J. Mamede, and H. Sofia Pinto

Cooking an Ontology?”, 2008


Fernando Batista, Joana Paulo,
Nuno

Mamede
,
Paula
Vaz
, Ricardo
Ribeiro

“Ontology
construction: cooking domain”, 2006


Joana Paulo
Pardal
, “Dynamic
Use of
Ontologies

in Dialogue
Systems”, 2009

Related Research


Mutsuo

Sano, Ichiro
Ide
, Kenzaburo
Miyawaki


Overview of
the ACM Multimedia 2009 Workshop on Multimedia for
Cooking and Eating Activities (CEA’09)”


Keigo

Kitamura Toshihiko Yamasaki
Kiyoharu

Aizawa



FoodLog
: Capture, Analysis and Retrieval of Personal


Food Images via Web”, 2009 distinguishes food images from
other images


Dan
Tasse

and Noah Smith (CMU)
SOUR
CREAM:Toward

Semantic Processing of Recipes,
2008


new techniques for semantic parsing by focusing on the
domain of cooking recipes


first order logic