learning2talk

Download Report

Transcript learning2talk

How Machines Learn to
Talk
Amitabha Mukerjee
IIT Kanpur
work done with:
Computer Vision: Profs. C. Venkatesh, Pabitra Mitra
Prithvijit Guha, A. Ramakrishna Rao, Pradeep Vaghela
Natural Language: Prof. Achla Raina,
V. Shreeniwas
Robotics
Collaborations:
IGCAR Kalpakkam
Sanjay Gandhi PG Medical Hospital
Visual Robot Navigation
Time-to-Collision
based Robot Navigation
Hyper-Redundant Manipulators
The same manipulator can work in changing workspaces
•
Reconfigurable Workspaces / Emergency Access
•
Optimal Design of Hyper-Redundant Systems
– Scara and 3D
Planar Hyper-Redundancy
4-link PlanarRobot
Motion Planning
Micro-Robots
•
Micro Soccer
Robots (1999-)
•
8cm Smart
Surveillance
Robot – 1m/s
•
Autonomous
Flying Robot
(2004)
•
Omnidirectional
platform (2002)
Omni-Directional Robot
Sponsor: DST
[email protected]
Flying Robot
Start-Up at
IIT Kanpur
Whirligig
Robotics
heli-flight.wmv
Test Flight of UAV. Inertial Meas Unit (IMU) under commercial production
Tracheal Intubation Device
Assists surgeon while
inserting breathing
tube during general
anaesthesia
Aperture for
Oxygenation tube
Ball & Socket joint
Aperture for
Fibre optic video cable
Endotracheal tube Aperture
Hole for suction tube
Control cables
Attachment Points
Device for Intubation during
general Anesthesia
Sponsor: DST / SGPGM
[email protected]
Draupadi’s Swayamvar
Can the Arrow hit the rotating mark?
Sponsor: Media Lab Asia
High DOF Motion Planning
•
Accessing Hard to
Reach spaces
•
Design of HyperRedundant
Systems
•
Parallel
Manipulators
Sponsor: BRNS / MHRD
[email protected]
10-link 3D Robot – Optimal Design
Multimodal Language Acquisition
Consider a child observing a scene together
with adults talking about it
Grounded Language : Symbols are
grounded in perceptual signals
Use of simple videos with boxes and simple
shapes – standardly used in
sociopsychology
Objective
To develop a computational framework
for Multimodal Language Acquisition
• acquiring the perceptual structure
corresponding to verbs
• using Recurrent Neural Networks as
a biologically plausible model for
temporal abstraction
• Adapt the learned model to interpret
activities in real videos
Visually Grounded Corpus
 Two psychological research films, one
based on the classic Heider & Simmel
(1944) and other based on Hide &
Seek
 These animation portray motion paths
of geometric figures (Big Square,
Small square & Circle)
 Chase Alt
Cognate clustering
 Similarity Clustering: Different
expressoins for same action, e.g.: “move
away from center” vs “go to a corner”
 Frequency: Remove Infrequent lexical
units
 Synonymy: Set of lexical units being used
consistently in the same intervals, to mark
the same action, for the same set of
agents.
Perceptual Process
Video
Feature
Extraction
Multi
Modal
Input
Descriptions
Features
VICES
Cognate
Clustering
Events
Trained
Simple
Recurrent
Network
Design of Feature Set
 The features selected here are related to
spatial aspects of conceptual primitives in
children, such as position, relative pose,
velocity etc.
 Use features that are kinematical in nature,
temporal derivations or simple transforms of
the basic ones.
Monadic Features
Dyadic Predicates
VIdeo and Commentary for
Event Structures [VICES]
Video
Feature
Extraction
Multi
Modal
Input
Descriptions
Features
VICES
Cognate
Clustering
Events
Trained
Simple
Recurrent
Network
The classification problem
 The problem is of time series
classification
 Possible methodologies include:
 Logic based methods
 Hidden Markov Models
 Recurrent Neural Networks
Elman Network
 Commonly a twolayer network with
feedback from the
first-layer output to
the first layer input
 Elman
Networks
detect and generate
time-varying patterns
 It is also able to learn
spatial patterns
Feature Extraction in Abstract
Videos
 Each image is read into a 2D matrix
 Connected Component Analysis is
performed
 Bounding box is computed for each
such connected component
 Dynamic tracking is used to keep
track of each object
Working with Real Videos
 Challenges




Noise in real world videos
Illumination Changes
Occlusions
Extracting Depth Information
 Our Setup
 Camera is fixed at head height.
 Angle of depression is 0 degrees (approx.).
 Video
Background Subtraction
 Background Subtraction
 Learn on still
background images
 Find pixel intensity
distributions
 Classify each pixel as
background if
P(x,y) - µ(x,y) < kσ(x,y)2
 Remove Shadows
 Special Case of Reduced
Illumination
 S = k*P where k<1.0
Contd..
 Extract Human
Blobs
 By Connected
Component Analysis
 Bounding box is
computed for each
person
 Track Human Blobs
 Each object is
tracked using a
mean-shift tracking
algorithm.
Contd..
Depth Estimation
 Two approximations
 Using Gibson’s affordances
 Camera Geometry
 Affordances: Visual Clues
 Action of a human is triggered by the
environment itself.
 A floor offers walk-on ability
 Every object affords certain actions to
perceive along with anticipated effects
 A cups handle affords grasping-lifting-drinking
Contd..
Gibson’s model
 Horizon is fixed at the head height of the
observer.
Monocular Depth Cues
 Interposition
An object that occludes another is closer.
 Height in the visual field
Higher the object is the further it is.
Depth Estimation
 Pin hole Camera Model
 Mapping (X,Y,Z) to (x,y)
x=X*f/Z
y=Y*f/Z
 For the point of contact
with the ground
 Z1/y
 Xx/y
Depth plot for A chase B
 Top view (Z-X plane)
Results (contd..)
Results (contd..)
Results (contd..)
Results
 Separate-SRN-for-each-action
 Trained & tested on different parts of the
abstract video
 Trained on abstract video and tested on
real video
 Single-SRN-for-all-actions
 Trained on synthetic video and tested on
real video
Basis for Comparison
Let the total time of visual sequence for each verb be t time units
E : Intervals when subjects describe an event as occurring
E:t - E
E'  Intervals when VICES describes an event as occurring
E'  t  E'
FM :  Intervals classified as Focus Mismatches
E'-E
E
E  E'
True Positives 
E
False Positives 
E - E'
False Negatives 
E
Focus Mismatches 
Accuracy 
E  E'   E  E '
t
FM
E
Separate SRN for each action
Framework : Abstract video
Verb
True Positives
False Positives
False Negatives
Focus Mismatches
Accuracy
hit
46.02%
3.06%
53.98%
2.4%
92.37%
chase
24.44%
0%
75.24%
0.72%
93.71%
come Closer
25.87%
14.61%
73.26%
16.77%
63.66%
move Away
46.34%
7.21%
52.33%
15.95%
73.37 %
spins
82.54%
0%
16.51%
24.7%
97.03%
moves
68.24%
0.12%
31.76%
1.97%
77.33%
Verb
True Positives
False Positives
False Negatives
Focus Mismatches
hit
3
3
1
1
chase
6
0
3
4
come Closer
6
20
7
24
move Away
8
3
0
14
spins
22
0
1
9
moves
5
1
2
7
Time Line comparison for Chase
Separate SRN for each action
Real video (action recognition only)
Verb
Retrieved
Relevant
True
Positives
False
Positives
False
Negatives
Precision
Recall
A Chase B
237
140
135
96
5
58.4%
96.4%
B Chase A
76
130
76
0
56
100%
58.4%
Single SRN for all actions
Framework : Real video
Verb
Retrieved
Relevant
True
Positives
False
Positives
False
Negatives
Precision
Recall
Chase
239
270
217
23
5
91.2%
80.7%
Going Away
21
44
13
8
31
61.9%
29.5%
Conclusions & Future Work
 Sparse nature of video provides for ease of
visual analysis
 Directly learning event structures from
perceptual stream.
 Extensions: Learn fine nuances between
event structures of related action words.
 Learn the Morphological variations.
 Extend the work towards using Long Short
Term Memory (LSTM).
 Hierarchical acquisition of higher level
action verbs.