learning2talk
Download
Report
Transcript learning2talk
How Machines Learn to
Talk
Amitabha Mukerjee
IIT Kanpur
work done with:
Computer Vision: Profs. C. Venkatesh, Pabitra Mitra
Prithvijit Guha, A. Ramakrishna Rao, Pradeep Vaghela
Natural Language: Prof. Achla Raina,
V. Shreeniwas
Robotics
Collaborations:
IGCAR Kalpakkam
Sanjay Gandhi PG Medical Hospital
Visual Robot Navigation
Time-to-Collision
based Robot Navigation
Hyper-Redundant Manipulators
The same manipulator can work in changing workspaces
•
Reconfigurable Workspaces / Emergency Access
•
Optimal Design of Hyper-Redundant Systems
– Scara and 3D
Planar Hyper-Redundancy
4-link PlanarRobot
Motion Planning
Micro-Robots
•
Micro Soccer
Robots (1999-)
•
8cm Smart
Surveillance
Robot – 1m/s
•
Autonomous
Flying Robot
(2004)
•
Omnidirectional
platform (2002)
Omni-Directional Robot
Sponsor: DST
[email protected]
Flying Robot
Start-Up at
IIT Kanpur
Whirligig
Robotics
heli-flight.wmv
Test Flight of UAV. Inertial Meas Unit (IMU) under commercial production
Tracheal Intubation Device
Assists surgeon while
inserting breathing
tube during general
anaesthesia
Aperture for
Oxygenation tube
Ball & Socket joint
Aperture for
Fibre optic video cable
Endotracheal tube Aperture
Hole for suction tube
Control cables
Attachment Points
Device for Intubation during
general Anesthesia
Sponsor: DST / SGPGM
[email protected]
Draupadi’s Swayamvar
Can the Arrow hit the rotating mark?
Sponsor: Media Lab Asia
High DOF Motion Planning
•
Accessing Hard to
Reach spaces
•
Design of HyperRedundant
Systems
•
Parallel
Manipulators
Sponsor: BRNS / MHRD
[email protected]
10-link 3D Robot – Optimal Design
Multimodal Language Acquisition
Consider a child observing a scene together
with adults talking about it
Grounded Language : Symbols are
grounded in perceptual signals
Use of simple videos with boxes and simple
shapes – standardly used in
sociopsychology
Objective
To develop a computational framework
for Multimodal Language Acquisition
• acquiring the perceptual structure
corresponding to verbs
• using Recurrent Neural Networks as
a biologically plausible model for
temporal abstraction
• Adapt the learned model to interpret
activities in real videos
Visually Grounded Corpus
Two psychological research films, one
based on the classic Heider & Simmel
(1944) and other based on Hide &
Seek
These animation portray motion paths
of geometric figures (Big Square,
Small square & Circle)
Chase Alt
Cognate clustering
Similarity Clustering: Different
expressoins for same action, e.g.: “move
away from center” vs “go to a corner”
Frequency: Remove Infrequent lexical
units
Synonymy: Set of lexical units being used
consistently in the same intervals, to mark
the same action, for the same set of
agents.
Perceptual Process
Video
Feature
Extraction
Multi
Modal
Input
Descriptions
Features
VICES
Cognate
Clustering
Events
Trained
Simple
Recurrent
Network
Design of Feature Set
The features selected here are related to
spatial aspects of conceptual primitives in
children, such as position, relative pose,
velocity etc.
Use features that are kinematical in nature,
temporal derivations or simple transforms of
the basic ones.
Monadic Features
Dyadic Predicates
VIdeo and Commentary for
Event Structures [VICES]
Video
Feature
Extraction
Multi
Modal
Input
Descriptions
Features
VICES
Cognate
Clustering
Events
Trained
Simple
Recurrent
Network
The classification problem
The problem is of time series
classification
Possible methodologies include:
Logic based methods
Hidden Markov Models
Recurrent Neural Networks
Elman Network
Commonly a twolayer network with
feedback from the
first-layer output to
the first layer input
Elman
Networks
detect and generate
time-varying patterns
It is also able to learn
spatial patterns
Feature Extraction in Abstract
Videos
Each image is read into a 2D matrix
Connected Component Analysis is
performed
Bounding box is computed for each
such connected component
Dynamic tracking is used to keep
track of each object
Working with Real Videos
Challenges
Noise in real world videos
Illumination Changes
Occlusions
Extracting Depth Information
Our Setup
Camera is fixed at head height.
Angle of depression is 0 degrees (approx.).
Video
Background Subtraction
Background Subtraction
Learn on still
background images
Find pixel intensity
distributions
Classify each pixel as
background if
P(x,y) - µ(x,y) < kσ(x,y)2
Remove Shadows
Special Case of Reduced
Illumination
S = k*P where k<1.0
Contd..
Extract Human
Blobs
By Connected
Component Analysis
Bounding box is
computed for each
person
Track Human Blobs
Each object is
tracked using a
mean-shift tracking
algorithm.
Contd..
Depth Estimation
Two approximations
Using Gibson’s affordances
Camera Geometry
Affordances: Visual Clues
Action of a human is triggered by the
environment itself.
A floor offers walk-on ability
Every object affords certain actions to
perceive along with anticipated effects
A cups handle affords grasping-lifting-drinking
Contd..
Gibson’s model
Horizon is fixed at the head height of the
observer.
Monocular Depth Cues
Interposition
An object that occludes another is closer.
Height in the visual field
Higher the object is the further it is.
Depth Estimation
Pin hole Camera Model
Mapping (X,Y,Z) to (x,y)
x=X*f/Z
y=Y*f/Z
For the point of contact
with the ground
Z1/y
Xx/y
Depth plot for A chase B
Top view (Z-X plane)
Results (contd..)
Results (contd..)
Results (contd..)
Results
Separate-SRN-for-each-action
Trained & tested on different parts of the
abstract video
Trained on abstract video and tested on
real video
Single-SRN-for-all-actions
Trained on synthetic video and tested on
real video
Basis for Comparison
Let the total time of visual sequence for each verb be t time units
E : Intervals when subjects describe an event as occurring
E:t - E
E' Intervals when VICES describes an event as occurring
E' t E'
FM : Intervals classified as Focus Mismatches
E'-E
E
E E'
True Positives
E
False Positives
E - E'
False Negatives
E
Focus Mismatches
Accuracy
E E' E E '
t
FM
E
Separate SRN for each action
Framework : Abstract video
Verb
True Positives
False Positives
False Negatives
Focus Mismatches
Accuracy
hit
46.02%
3.06%
53.98%
2.4%
92.37%
chase
24.44%
0%
75.24%
0.72%
93.71%
come Closer
25.87%
14.61%
73.26%
16.77%
63.66%
move Away
46.34%
7.21%
52.33%
15.95%
73.37 %
spins
82.54%
0%
16.51%
24.7%
97.03%
moves
68.24%
0.12%
31.76%
1.97%
77.33%
Verb
True Positives
False Positives
False Negatives
Focus Mismatches
hit
3
3
1
1
chase
6
0
3
4
come Closer
6
20
7
24
move Away
8
3
0
14
spins
22
0
1
9
moves
5
1
2
7
Time Line comparison for Chase
Separate SRN for each action
Real video (action recognition only)
Verb
Retrieved
Relevant
True
Positives
False
Positives
False
Negatives
Precision
Recall
A Chase B
237
140
135
96
5
58.4%
96.4%
B Chase A
76
130
76
0
56
100%
58.4%
Single SRN for all actions
Framework : Real video
Verb
Retrieved
Relevant
True
Positives
False
Positives
False
Negatives
Precision
Recall
Chase
239
270
217
23
5
91.2%
80.7%
Going Away
21
44
13
8
31
61.9%
29.5%
Conclusions & Future Work
Sparse nature of video provides for ease of
visual analysis
Directly learning event structures from
perceptual stream.
Extensions: Learn fine nuances between
event structures of related action words.
Learn the Morphological variations.
Extend the work towards using Long Short
Term Memory (LSTM).
Hierarchical acquisition of higher level
action verbs.