hands - Center for Language and Speech Processing

Download Report

Transcript hands - Center for Language and Speech Processing

Natural Language Processing for
Action Recognition
JHU Summer School
Evelyne Tzoukermann, Ph.D.
Friday, June 11, 2010
What is the role of Natural Language
in Action Recognition?
1. Provide temporal information
– Where in the video is the action happening?
2. Provide semantic information
– Parse the phrasal constituents to determine
action type and human interaction through
objects, instruments, and other contextual
information
– E.g.: cut potatoes  semantic representation
•
•
•
<instrument> knife
<human interaction> hands
<location> cutting board
Function of Natural Language
in Action Recognition?
1. Facilitate action recognition from the video.
2. Ground video processing
3. Extract relevant entities and semantics
associated with them
4. Allow fusion of knowledge from text with
action primitives
 Leverage already existing techniques and
knowledge
Completed
• Dataset domains:
– Cooking
– Crafts
• Classification of Actions
• Categorization of Actions
Cooking domain
1. DVD’s:
– Cook like a chef
– Martha’s Favorite Family Dinners
– Joanne Wier’s cooking class
2. CMU Kitchen dataset
3. Food Network: 12 consecutive hours of recorded
time
4. PBS Kids: Sprout – 5 shows
5. URADL: U. of Rochester Activities of Daily Living
– 12 activities, 5 individuals, 3 recordings each
Craft domain
• PBS Kids: Sprout – over 25 shows
Tuples of Entities
– Time stamps for temporal information
– Verbs - capture actions
– Objects - what is acted upon
– Instruments - with what tool
– Location – for recognition
– Camera position – for scalability
Information Extraction
• Extract structured information from unstructured
documents
Ex: "Yesterday, New-York based Foo Inc. announced
their acquisition of Bar Corp.“
Entity identification and recognition
• Goal of IE: allow computation to be performed on
unstructured data.
• More specific goal: allow logical reasoning to
draw inferences based on the logical content of
the input data.
Entity Recognition for Video
• Can be considered an IE task with a list of
entities
• Find a tuple or an ordered list with a temporal
dimension
• Goal of text-based Information Extraction:
“Who did what to whom where”
– Find the different entities that fill these slots
• Goal of video and text IE
– Find the temporal, and other entities
Angelina’s Ballet Slippers
1. Video
2. Web page
Angelina’s Ballet Slippers
Ingredients
Supplies
• 1 red pepper, cut in half with
seeds removed
• 1⁄2 cup quick cook brown rice
• 1⁄2 cup vegetable stock
• 1 cup canned mixed vegetables,
no added salt
• 1⁄4 tsp. black pepper
• 1 tsp. chopped fresh parsley
• 1 tsp. extra virgin olive oil
• 1 lemon
• Decorative cabbage
• 1⁄4 cup shredded cheddar
cheese, divided
•
•
•
•
•
•
•
•
•
Measuring cups and spoons
Cutting board & knife
Cooking pot
Small cooking pot
Mixing spoons
Slotted spoon
High-sided baking dish
Pastry brush
Large serving plate
Nr
Action
Objects
Human Interaction
Begin Time
End Time
Duration
1
Washing
Sink, Soap
Washing Hands
00:38.2
00:40.6
00:02.4
2
Drying
Hand Towel
Drying Hands
00:40.6
00:44.4
00:03.7
3
Filling
Sink, Pot
Hands fill pot with water
00:45.3
00:47.2
00:01.9
00:48.2
00:51.4
00:03.2
4
Pouring
Bowl, Broth, Pot
Child pours broth from bowl to
pot
5
Firing
Stove, Pot
Hand turns on the burner
00:54.1
00:57.1
00:03.0
Cutting
Red Pepper, Knife,
Cutting Board
Adult Male cuts red pepper
00:58.1
01:00.0
00:01.9
Deseeding
Red Pepper, scoop
Adult and child deseed red
pepper
01:03.0
01:03.9
00:00.8
8
Placing
Pot, Spoon, Red
Pepper
Adult places red pepper in pot
01:09.7
01:12.2
00:02.5
9
Adding
Bowl of Rice, Pot
Adult adds rice to pot
01:14.2
01:17.7
00:03.4
10
Opening
01:20.2
01:23.3
00:03.0
11
Tearing
Can Opener, Can
Hands open a can
Parsley, Measuring
cup
Child tears off parsley leaves
01:24.2
01:27.4
00:03.2
12
Adding
Can, Pot
Hand adds can of veggies to
pot
01:32.0
01:35.0
00:03.0
13
Adding
Measuring cup, Pot Child adds parsley to pot
01:35.6
01:38.2
00:03.0
6
7
Sprout - Alphabet book
Action Verb Freq Direct Object
Human
Instrument Interaction Location
To Thread
1 Thread
Hand
Both Hands Construction Paper
To Tie
1 Thread
Hand
Both Hands Construction Paper
To Write
1 Ink
Pen
Both Hands Paper
To Decorate
2 Ink
Pen
Both Hands Paper
To Color
2 Ink
Pen
Both Hands Paper
To Draw
1 Ink
Pen
Both Hands Paper
Baby Picture Frames
Crafts
Freq
Direct
Object
Human
Instrument Interaction Location
To Tape
2
Picture
Hand
To Glue
2
Glue
Hand
To Decorate
1
Ink
Pen
Both
Hands
Both
Hands
Both
Hands
Frame
Popsicle sticks
Popsicle sticks
Action Recognition and Complexity
Input
1. transcripts and closed captions
2. text transcripts alone
3. list of ingredients and utensils
 Evaluation can follow these levels
Sprout – Elmo’s Funny Face Pizza
Cooking
Freq
Direct Object
Instrument
Human Interaction
To Wash
Location
1
Hands
Faucet/ Soap
Both Hands In action Sink
To Dry
1
Hands
Paper Towels
Both Hands In action Work Space
To Place
1
Bagels
Hands
Both Hands In action Baking Sheet
To Spread
1
Sauce
Knife
Both Hands In action Bagel
To Top
1
Olives
Hands
Both Hands In action Bagel
To Cut
1
Peppers
Knife
Both Hands In action Cutting Board
To Top
1
Peppers
Hands
Both Hands In action Bagel
To Bake
1
Sheet Pan
Hands
Both Hands In action Oven
To Clean
1
Food
Hands
Both Hands In action Work Space
To Sponge
1
Food
Sponge
Both Hands In action Work Space
To Remove
1
Sheet Pan
Oven Mitts
Both Hands In action Oven
Sprout – Caillou’s Crunchy Carrot Salad
Cooking
Freq
Direct Object
Instrument
To Peel
1
Carrots
Peeler
to Add
1
Apples
Hands
To Measure
1
Raisins
Hands
To Mix
1
Salad
Spoons
To Cut
1
Lemon
Knife
To Squeeze
1
Lemon
Hands
To Measure
1
Honey
Bottle
To Refrigerate
1
Bowl
Hands
To Clean
2
Food
Hands
human interaction
Both Hands In
action
Both Hands In
action
Both Hands In
action
Both Hands In
action
Both Hands In
action
Both Hands In
action
Both Hands In
action
Both Hands In
action
Both Hands In
action
Location
Work Space
Bowl
Measuring Cup
Salad Bowl
Cutting Board
Salad Bowl
Measuring
Spoon
Refrigerator
Table
Martha Stewart Episode 2
Cooking
Direct
Frequency Object
To Stir
5
To Pour
To Pour
To Add
To Cut
To Beat
To Mix
To Remove
1
1
1
1
1
6
2
To Slice
7
To Spoon
To Spread
1
2
Human
Instrument Interaction Location
Wooden
Chili
Spoon
One hand Pot
Measuring
Food
Vinegar
Cup
Both hands Processor
Orange juice Ramekin
Both hands Pan
Salt
Hand
One hand Pan
Butter
Knife
Both hands Butter Boat
Egg
Fork
Both hands Bowl
Meatloaf
Hand
Both hands Bowl
Roast
Hand
Both hands Crock Pot
Cutting
Roast
Knife
Both hands Board
Plate of
Dressing
Spoon
One hand Oranges
Mix
Hand
Both hands Baking Dish
Martha Stewart – 191 action verbs
to pour
to add
to stir
to slice
to cut
to place
to mix
to remove
to rub
to turn
to deglaze
to serve
to wisk
to top
to process (in a
food Processor)
33
20
17
17
11
11
6
6
6
6
6
5
5
4
to spoon
to measure
to glaze
to garnish
to spread
to cover
to tie
to Scrape
to dry
to beat
to b roil
to sear
to wrap
to Grate
4
4
3
2
2
2
2
2
1
1
1
1
1
1
4
to Bake
1
Semantic Categorization of Actions
To Apply Heat
To Combine
To Bake
to Broil
to sear
to Add
To Mix
To Process
To Beat
To Separate in to one or more parts To Pour
To Cut
to deglaze
To Slice
to wisk
to grate
To Decorate
To Tear
To Peel
To Top
to score
To Garnish
To Spread
To Sanitize
To Glaze
To Wash
to spoon
To Dry
to rub
CMU Kitchen Set - Verbs
– take
– put
– Open
– fill
– crack
– beat
– stir
– pour
– clean
– switchon
– read
– spray
– close
– walk
– wist_on
– twist_off
NLP Tools
• Part-of-speech tagger or phrase chunker
• Dependency parser for Verb-Object relations
– We have tuples of Verb, Object, Instrument, Location
– Ex: Stir (v) chili (o) with a wooden spoon (instr) in a
pot (loc)
• Collocations for Instrument and Location
– Coocurrence from Google
– Ex: “place a wooden spoon across the pot to keep it
from boiling”
• And more
Ontology
• Need to capture:
– Concepts
– Relationships
– Properties
– Timestamps (video_name [beg_time, end_time])
– Validation
Ontology for cooking and craft
• Need to capture:
– Actions
– Food – including the state and transformation
or
– Objects – paper, paper roll, …
– Instruments: kitchen utensils, scissors, crayons
– Location
– Timing
– (Recipes)
Ontology
• Use of Protégé http://protege.stanford.edu/
– ontology editor and knowledge-base framework.
• Knowtator : Protégé plug-in for annotation
– can be used for evaluating or
– training a variety of NLP systems.
• Write a plug-in that takes the output of a
syntactic parser and connects it to visual frames
Protégé knowledge-base
• class,
– Represent the concepts of a domain
– organized in a subsumption hierarchy
• instance, correspond to individuals of a class
• slot, define properties of a class or instance
• facet frames constrain the values that slots
can have.
Dependency Parser
Input Sentence: “Next
we need to open the can of veggies”
ROOT [next-1]
( SBAR [next-1]
( next-1(Next)/IN
S [need-6] (
NP [we-3] (
we-3/PRP
)
VP [need-6] (
need-6/VBP
S [to-8] (
VP [to-8] (
to-8/TO
VP [open-10] (
open-10/VB
NP [can-14] (
NP [can-14] (
the-12/DT
can-14/NN
)
PP [of-17] (
of-17/IN
NP [veggy-19] (
veggy-19(veggies)/NNS
)
)
Dependency Parser
Input Sentence: “Next
we need to open the can of veggies”
ROOT [next-1]
( SBAR [next-1]
( next-1(Next)/IN
S [need-6] (
NP [we-3] (
we-3/PRP
)
VP [need-6] (
need-6/VBP
S [to-8] (
VP [to-8] (
to-8/TO
VP [open-10] (
open-10/VB
NP [can-14] (
NP [can-14] (
the-12/DT
can-14/NN
)
PP [of-17] (
of-17/IN
NP [veggy-19] (
veggy-19(veggies)/NNS
)
)
Action concept and relations with
other concepts
Action
Verb
Object
Human
Interaction
Instrument
Location
Time
Vn,t1,t2
Knowtator: Annotation Plug-in
• General purpose annotation tool
• Facilitates creation of training and evaluation
corpora for language processing tasks
• Ease of use
• Straightforward to incorporate domain
knowledge
Knowtator: an example
Processes
Ontology
Creation
Syntactic
Parser
Ontology
Annotation
Corpus
enrichment
using
collocations
Related Research
1. Ontology and cooking
2. Parsing “restricted” languages
3. Connecting text with images
Related Research
• Dina Demner-Fushman, Sameer Antani, Matthew
Simpson, George R. Thoma “Annotation and
retrieval of clinically relevant images”, 2009
• Ricardo Ribeiro, Fernando Batista, Joana Paulo
Pardal, Nuno J. Mamede, and H. Sofia Pinto
“Cooking an Ontology?”, 2008
• Fernando Batista, Joana Paulo, Nuno Mamede,
Paula Vaz, Ricardo Ribeiro “Ontology
construction: cooking domain”, 2006
• Joana Paulo Pardal, “Dynamic Use of Ontologies
in Dialogue Systems”, 2009
Related Research
• Mutsuo Sano, Ichiro Ide, Kenzaburo Miyawaki “Overview of
the ACM Multimedia 2009 Workshop on Multimedia for
Cooking and Eating Activities (CEA’09)”
• Keigo Kitamura Toshihiko Yamasaki Kiyoharu Aizawa
“FoodLog: Capture, Analysis and Retrieval of Personal
Food Images via Web”, 2009 distinguishes food images from
other images
• Dan Tasse and Noah Smith (CMU) SOUR
CREAM:Toward Semantic Processing of Recipes,
2008
– new techniques for semantic parsing by focusing on the
domain of cooking recipes
– first order logic