No Slide Title

Download Report

Transcript No Slide Title

Michael Arbib
with
Erhan Oztop (Modeling)
Giacomo Rizzolatti (Neurophysiology)
The Mirror Neuron System for Grasping
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
1
Part 1
Classic Concepts of Grasp Control
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
2
Preshaping While Reaching to Grasp
Jeannerod &
Biguer 1979
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
3
A Classic Coordinated Control Program for
Reaching and Grasping
recognition
criteria
activation of
visual search
visual
input
visual
input
Visual
Location
target
location
Size
Recognition
size
activation
of reaching
visual
input
Orientation
Recognition
orientation
visual and
kinesthetic input
Fast Phase
Movement
Hand
Preshape
Slow Phase
Movement
Hand Reaching
Perceptual
schemas
visual ,
kinesthetic, and
tactile input
Hand
Rotation
Motor
schemas
Actual
Grasp
Grasping
(Arbib 1981)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
4
Opposition Spaces and Virtual Fingers
The goal of a successful
preshape, reach and grasp
is to match the opposition
axis defined by the virtual
fingers of the hand with
the opposition axis defined
by an affordance of the
object
(Iberall and Arbib 1990)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
5
Hand State
Our current representation of hand state defines a 7-dimensional trajectory F(t) with the
following components
F(t) = (d(t), v(t), a(t), o1(t), o2(t), o3(t), o4(t)):
d(t): distance to target at time t
v(t): tangential velocity of the wrist
a(t): Aperture of the virtual fingers involved in grasping at time t
o1(t): Angle between the object axis and the (index finger tip – thumb tip) vector
[relevant for pad and palm oppositions]
o2(t): Angle between the object axis and the (index finger knuckle – thumb tip) vector
[relevant for side oppositions]
o3(t), o4(t): The two angles defining how close the thumb is to the hand as measured
relative to the side of the hand and to the inner surface of the palm.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
6
Part 2
Basic Neural Mechanisms of Grasp Control
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
7
Visual Control of Grasping in Macaque Monkey
A key theme of
visuomotor coordination:
parietal affordances
(AIP)
drive
frontal motor
schemas
(F5)
F5 - grasp
commands in
premotor cortex
Giacomo Rizzolatti
AIP - grasp
affordances
in parietal cortex
Hideo Sakata
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
8
Grasp Specificity in an F5 Neuron
Precision pinch (top)
Power grasp (bottom)
(Data from Rizzolatti et
al.)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
9
A Broad Perspective on F5-AIP Interactions
tell PM the
possible
categories
affordances for PM
precise motor coordinates
for motor cortex
“here’s my
choice”
PM (Premotor
Cortex)
codes action fairly
abstractly
PP (Posterior
Parietal)
overall
action
the details action parameters
Motor
Cortex
Cerebellum
Tuning and
coordinating MPGs
and MPGs
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
10
FARS (Fagg-Arbib-Rizzolatti-Sakata)
Model Overview
AIP extracts affordances features of the object relevant
to physical interaction with it.
Prefrontal cortex provides
“context” so F5 may select
an appropriate affordance
AIP
AIP
Dorsal
Stream:
dorsal/ventral
Affordances
streams
Ways to grab
this “thing”
TaskConstra
Constraints
Task
ints ( F6)
(F6)
Working
Memory
W
orking Me
mory (46)
(46?)
Instruction
Stimuli
Instruction
Stim
uli (F2)
(F2)
PFC
F5
F5
“It’s a mug”
Ventral
Stream:
Recognition
IT
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
11
Part 3
Neural Mechanisms of Grasp Control
that Support a “Mirror System”
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
12
Mirror Neurons
Rizzolatti, Fadiga, Gallese, and Fogassi, 1995:
Premotor cortex and the recognition of motor
actions
Mirror neurons form the subset
of grasp-related premotor
neurons of F5 which discharge
when the monkey observes
meaningful hand movements
made by the experimenter or
another monkey.
F5 is endowed with an
observation/execution matching
system
[The non-mirror grasp neurons of F5 are
called F5 canonical neurons.]
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
13
What is the mirror system (for grasping) for?
Mirror neurons: The cells that selectively discharge when the
monkey executes particular actions as well as when the monkey
observes an other individual executing the same action.
Mirror neuron system (MNS): The mirror neurons and the brain
regions involved in eliciting mirror behavior.
Interpretations:
• Action recognition
• Understanding (assigning meaning to other’s actions)
• Associative memory for actions
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
14
Computing the Mirror System Response
The FARS Model:
Recognize object affordances and determine appropriate grasp.
The Mirror Neuron System (MNS) Model:
We must add recognition of
trajectory and
 hand preshape

to

recognition of object affordances
and ensure that all three are congruent.
There are parietal systems other than AIP adapted to this task.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
15
Further Brain Regions Involved
cIPS:
cIPS
cIPS
Detection of
biologically meaningful
stimuli (e.g.hand
actions)
Motion related activity
(MT/MST part)
STS:
Superior
Temporal
Sulcus
7b (PF):
Rostral part of
the posterior
parietal lobule
caudal
intraparietal
sulcus
Axis orientation
and
surface orientation
Spatial coding
7a (PG):
for objects,
caudal part of analysis of
the posterior motion during
parietal lobule
interaction of
objects and
self-motion
Mainly somatosensory
Mirror-like responses
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
16
cIPS cell response
Surface orientation selectivity
of a cIPS cell
cIPS
cIPS
Sakata et al. 1997
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
17
Key Criteria for Mirror Neuron Activation
When Observing a Grasp
a) Does the preshape of the hand correspond to the grasp encoded by the mirror neuron?
b) Does this preshape match an affordance of the target object?
c) Do samples of the hand state indicate a trajectory that will bring the hand to grasp the
object?
Modeling Challenges:
i) To have mirror neurons self-organize to learn to recognize grasps in the monkey’s
motor repertoire
ii) To learn to activate mirror neurons from smaller and smaller samples of a trajectory.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
18
Hypothesis on Mirror Neuron Development
The development of the (grasp) mirror neuron system in a healthy infant is
driven by the visual stimuli generated by the actions (grasps) performed by the
infant himself.
The infant (with maturation of visual acuity) gains the ability to map other
individual’s actions into his internal motor representation.
[In the MNS model, the hand state provides the key representation for this
transfer.]
Then the infant acquires the ability to create (internal) representations for
novel actions observed.
Parallel to these achievements, the infant develops an action prediction
capability (the recognition of an action given the prefix of the action and the
target object)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
19
The Mirror Neuron System (MNS) Model
Object features
Visual Cortex
cIPS
7b
Ob ject
a ffor da nc e
e xtr a ction
H an d
m otion
d ete ction
S TS
AIP
Ob ject af ford a nce
-h an d state
a ssociation
H an d
sha p e
r ecog nition
F5cano nical
M otor
p rog ra m
(Gr a sp )
Integrate
tem po ral
ass ociation
Act ion
M irro r
Feed back r ecog nition
H an d -Obje ct
sp atia l re lation
a na lysis
7a
(M irr or
N eu r ons)
F5mirror
M otor
p rog ra m
(Re a ch)
M otor
e xe cu tion
M1
F4
Ob ject
loca tion
MIP/LIP/VIP
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
20
Part 4
Implementing the Basic Schemas of the Mirror Neuron
System (MNS) Model
using Artificial Neural Networks
(The Work of Erhan Oztop)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
21
Curve recognition
The general problem: associate N-dimensional space curves with object affordances
A special case: The recognition of two (or three) dimensional trajectories in physical space
Simplest solution: Map temporal information into spatial
domain. Then apply known pattern recognition techniques.
Problem with simplest solution: The speed of the moving
point can be a problem! The spatial representation may
change drastically with the speed
Scaling can overcome the problem. However the scaling
must be such that it preserves the generalization ability of
the pattern recognition engine.
Solution: Fit a cubic spline to the sampled values. Then normalize and resample from the spline curve.
Result:Very good generalization. Better performance than using the Fourier
coefficients to recognize curves.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
22
Curve recognition
Curve recognition system demonstrated for hand drawn numeral recognition (successful
recognition examples for 2, 8 and 3).
Spatial resolution: 30
Network input size: 30
Hidden layer size: 15
Output size: 5
Training : Back-propagation
with momentum.and
adaptive learning rate
Sampled points
Point used for spline interpolation
Fitted spline
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
23
STS hand shape recognition
Color Coded Hand
Feature Extraction
Step 1 of hand shape
recognition: system
processes the color-coded
hand image and generates a
set of features to be used by
the second step
Model Matching
Step 2: The feature vector
generated by the first step is used
to fit a 3D-kinematics model of the
hand by the model matching
module. The resulting hand
configuration is sent to the
classification module.
Precision grasp
Hand Configuration
Classification
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
24
STS hand shape recognition 1:
Color Segmentation and Feature Extraction
Preprocessing
Color Expert
(Network weights)
Training phase: A color expert is generated by training a feed-forward network to approximate human
perception of color.
Features
NN augmented
segmentation system
Actual processing: The hand image is fed to the augmented segmentation system. The color decision
during segmentation is done by consulting color expert.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
25
STS hand shape recognition2:
3D Hand Model Matching
Feature Vector
Error minimization
Result of feature
extraction
A realistic drawing of hand
bones. The hand is
modelled with 14 degrees
of freedom as illustrated.
Grasp Type
Classification
The model matching algorithm minimizes the error
between the extracted features and the model hand.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
26
Virtual Hand/Arm and Reach/Grasp Simulator
A precision pinch
A power grasp
and
a side grasp
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
27
Power grasp time series data
+: aperture; *: angle 1; x: angle 2; : 1-axisdisp1; :1-axisdisp2; :
speed; : distance.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
28
Core Mirror Circuit
Object affordance
Association
(7b) Neurons
Mirror Neurons
(F5mirror)
Mirror Neuron
Output
Hand state
Motor Program
(F5 canonical)
Object
Affordances
Object affordance hand state association
Integrate
temporal
association
Mirror Feedback
Motor
program
F5canonical
Hand shape
recognition
& Hand
motion
detection
Mirror
Feedback
Hand-Object
spatial relation
analysis
Action
recognition
(Mirror
Neurons)
Motor
program
Motor
execution
F5mirror
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
29
Connectivity pattern
Object affordance (AIP)
STS
F5mirror
7b
Motor Program
(F5canonical)
Mirror Feedback
7a
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
30
A single grasp trajectory viewed from three different
angles
The wrist trajectory during the
grasp is shown by square traces,
with the distance between any
two consecutive trace marks
traveled in equal time intervals.
How the network classifies the
action as a power grasp. Empty
squares: power grasp output;
filled squares: precision grasp;
crosses:
side grasp output
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
31
Power and precision grasp resolution
(a)
(b)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
32
Part 5
Quo Vadis?
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
33
Future Directions
1) Technology: Increasing the robustness and learning rates of the
schemas using improved learning algorithms for artificial neural nets.
2) Neuroscience: Implementing the schemas with biologically plausible
neural nets to model neurophysiological data.
3) Learning to recognize variations on, and assemblages of, familiar
actions.
4) Imitation
5) Language!
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
34
Michael Arbib: CS564 - Brain Theory and Artificial Intelligence
University of Southern California, Fall 2001
Lecture 10.
The Mirror Neuron System Model (MNS) 1
Reading Assignment:
Schema Design and Implementation of
the Grasp-Related Mirror Neuron System
Erhan Oztop and Michael A. Arbib
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
35
Visual Control of Grasping in Macaque Monkey
A key theme of
visuomotor coordination:
parietal affordances
(AIP)
drive
frontal motor
schemas
(F5)
F5 - grasp
commands in
premotor cortex
Giacomo Rizzolatti
AIP - grasp
affordances
in parietal cortex
Hideo Sakata
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
36
Mirror Neurons
Rizzolatti, Fadiga, Gallese, and Fogassi, 1995:
Premotor cortex and the recognition of motor
actions
Mirror neurons form the subset
of grasp-related premotor
neurons of F5 which discharge
when the monkey observes
meaningful hand movements
made by the experimenter or
another monkey.
F5 is endowed with an
observation/execution matching
system
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
37
F5 Motor Neurons
F5 Motor Neurons include all F5 neurons whose firing is related to
motor activity.
 We focus on grasp-related behavior. Other F5 motor neurons are
related to oro-facial movements.
F5 Mirror Neurons form the subset of grasp-related F5 motor
neurons of F5 which discharge when the monkey observes
meaningful hand movements.
F5 Canonical Neurons form the subset of grasp-related F5 motor
neurons of F5 which fire when the monkey sees an object with
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
38
What is the mirror system (for grasping) for?
Mirror neurons: The cells that selectively discharge when the
monkey executes particular actions as well as when the monkey
observes an other individual executing the same action.
Mirror neuron system (MNS): The mirror neurons and the brain
regions involved in eliciting mirror behavior.
Interpretations:
• Action recognition
• Understanding (assigning meaning to other’s actions)
• Associative memory for actions
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
39
Computing the Mirror System Response
The FARS Model:
Recognize object affordances and determine appropriate grasp.
The Mirror Neuron System (MNS) Model:
We must add recognition of
trajectory and
 hand preshape

to

recognition of object affordances
and ensure that all three are congruent.
There are parietal systems other than AIP adapted to this task.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
40
Further Brain Regions Involved
cIPS:
cIPS
cIPS
caudal
intraparietal
sulcus
Axis and
surface orientation
Detection of
biologically meaningful
stimuli (e.g.hand
actions)
Motion related activity
(MT/MST part)
STS:
Superior
Temporal
Sulcus
7b (PF):
Rostral part of
the posterior
parietal lobule
Spatial coding
7a (PG):
for objects,
caudal part of analysis of
the posterior motion during
parietal lobule
interaction of
objects and
self-motion
Mainly somatosensory
Mirror-like responses
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
41
cIPS cell response
Surface orientation selectivity
of a cIPS cell
cIPS
cIPS
Sakata et al. 1997
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
42
Key Criteria for Mirror Neuron Activation
When Observing a Grasp
a) Does the preshape of the hand correspond to the grasp
encoded by the mirror neuron?
b) Does this preshape match an affordance of the target object?
c) Do samples of the hand state indicate a trajectory that will bring the
hand to grasp the object?
Modeling Challenges:
i) To have mirror neurons self-organize to learn to recognize grasps in
the monkey’s motor repertoire
ii) To learn to activate mirror neurons from smaller and smaller
samples of a trajectory.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
43
Initial Hypothesis on Mirror Neuron Development
The development of the (grasp) mirror neuron system in a healthy
infant is driven by the visual stimuli generated by the actions
(grasps) performed by the infant himself.
The infant (with maturation of visual acuity) gains the ability to map other
individual’s actions into his internal motor representation.
[In the MNS model, the hand state provides the key representation for this
transfer.]
Then the infant acquires the ability to create (internal) representations for
novel actions observed.
Parallel to these achievements, the infant develops an action prediction
capability (the recognition of an action given the prefix of the action and the
target object)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
44
The Mirror Neuron System (MNS) Model
Object features
Visual Cortex
cIPS
7b
Ob ject
a ffor da nc e
e xtr a ction
H an d
m otion
d ete ction
S TS
AIP
Ob ject af ford a nce
-h an d state
a ssociation
H an d
sha p e
r ecog nition
F5cano nical
M otor
p rog ra m
(Gr a sp )
Integrate
tem po ral
ass ociation
Act ion
M irro r
Feed back r ecog nition
H an d -Obje ct
sp atia l re lation
a na lysis
7a
(M irr or
N eu r ons)
F5mirror
M otor
p rog ra m
(Re a ch)
M otor
e xe cu tion
M1
F4
Ob ject
loca tion
MIP/LIP/VIP
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
45
Implementing the Basic Schemas of the Mirror Neuron
System (MNS) Model
using Artificial Neural Networks
(Work of Erhan Oztop)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
46
Opposition Spaces and Virtual Fingers
The goal of a successful
preshape, reach and grasp
is to match the opposition
axis defined by the virtual
fingers of the hand with
the opposition axis defined
by an affordance of the
object
(Iberall and Arbib 1990)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
47
Hand State
Our current representation of hand state defines a
7-dimensional trajectory F(t) with the following components
F(t) = (d(t), v(t), a(t), o1(t), o2(t), o3(t), o4(t)):
d(t): distance to target at time t
v(t): tangential velocity of the wrist
a(t): Aperture of the virtual fingers involved in grasping at time t
o1(t): Angle between the object axis and the (index finger tip – thumb
tip) vector [relevant for pad and palm oppositions]
o2(t): Angle between the object axis and the (index finger knuckle –
thumb tip) vector [relevant for side oppositions]
o3(t), o4(t): The two angles defining how close the thumb is to the
hand as measured relative to the side of the hand and to the inner
surface of the palm.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
48
Curve recognition
The general problem: associate N-dimensional space curves with object
affordances
A special case: The recognition of two (or three) dimensional trajectories
in physical space
Simplest solution: Map temporal information into spatial
domain. Then apply known pattern recognition techniques.
Problem with simplest solution: The speed of the moving
point can be a problem! The spatial representation may
change drastically with the speed
Scaling can overcome the problem. However the scaling
must be such that it preserves the generalization ability of
the pattern recognition engine.
Solution: Fit a cubic spline to the sampled values. Then normalize and resample from the spline curve.
Result:Very good generalization. Better performance than using the Fourier
coefficients to recognize curves.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
49
Curve recognition
Curve recognition system demonstrated for hand drawn numeral recognition
(successful recognition examples for 2, 8 and 3).
Spatial resolution: 30
Network input size: 30
Hidden layer size: 15
Output size: 5
Training : Back-propagation
with momentum.and
adaptive learning rate
Sampled points
Point used for spline interpolation
Fitted spline
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
50
STS hand shape recognition
Color Coded Hand
Feature Extraction
Step 1 of hand shape
recognition: system
processes the color-coded
hand image and generates a
set of features to be used by
the second step
Model Matching
Step 2: The feature vector
generated by the first step is used
to fit a 3D-kinematics model of the
hand by the model matching
module. The resulting hand
configuration is sent to the
classification module.
Precision grasp
Hand Configuration
Classification
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
51
STS hand shape recognition 1:
Color Segmentation and Feature Extraction
Preprocessing
Color Expert
(Network weights)
Training phase: A color expert is generated by training a feed-forward network to approximate
human perception of color.
Features
NN augmented
segmentation system
Actual processing: The hand image is fed to the augmented segmentation system. The color
decision during segmentation is done by consulting color expert.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
52
STS hand shape recognition2:
3D Hand Model Matching
Feature Vector
Error minimization
Result of feature
extraction
A realistic drawing of hand
bones. The hand is
modelled with 14 degrees
of freedom as illustrated.
Grasp Type
Classification
The model matching algorithm minimizes the error
between the extracted features and the model hand.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
53
Virtual Hand/Arm and Reach/Grasp Simulator
A precision pinch
A power grasp
and
a side grasp
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
54
Power grasp time series data
+: aperture; *: angle 1; x: angle 2; : 1-axisdisp1; :1-axisdisp2; :
speed; : distance.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
55
Core Mirror Circuit
Object affordance
Association
(7b) Neurons
Mirror Neurons
(F5mirror)
Mirror Neuron
Output
Hand state
Motor Program
(F5 canonical)
Object
Affordances
Object affordance hand state association
Integrate
temporal
association
Mirror Feedback
Motor
program
F5canonical
Hand shape
recognition
& Hand
motion
detection
Mirror
Feedback
Hand-Object
spatial relation
analysis
Action
recognition
(Mirror
Neurons)
Motor
program
Motor
execution
F5mirror
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
56
Connectivity pattern
Object affordance (AIP)
STS
F5mirror
7b
Motor Program
(F5canonical)
Mirror Feedback
7a
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
57
A single grasp trajectory viewed from three different
angles
The wrist trajectory during the
grasp is shown by square traces,
with the distance between any
two consecutive trace marks
traveled in equal time intervals.
How the network classifies the
action as a power grasp. Empty
squares: power grasp output;
filled squares: precision grasp;
crosses:
side grasp output
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
58
Power and precision grasp resolution
(a)
Note that the modeling yields
novel predictions for time course
of activity across a population of
mirror neurons.
(b)
Precision Pinch
Mirror Neuron
Power Grasp
Mirror Neuron
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
59
Research Plan
Development of the Mirror System
Development of Grasp Specificity in F5 Motor and
Canonical Neurons
Visual Feedback for Grasping: A Possible Precursor of the
Mirror Property
Recognition of Novel and Compound Actions and their
Context
The Pliers Experiment: Extending the Visual Vocabulary
Recognition of Compounds of Known Movements
From Action Recognition to Understanding: Context and
Expectation
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
60
Michael Arbib: CS564 - Brain Theory and Artificial Intelligence
University of Southern California, Fall 2001
Lecture 23: MNS Model 2
Michael Arbib
and
Erhan Oztop
The Mirror Neuron System for Grasping:
Visual Processing for the MNS model
The Virtual Arm
The Core Mirror Neuron Circuit
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
61
The Mirror Neuron System (MNS) Model
Object features
Visual Cortex
cIPS
7b
Ob ject
a ffor da nc e
e xtr a ction
H an d
m otion
d ete ction
S TS
AIP
Ob ject af ford a nce
-h an d state
a ssociation
H an d
sha p e
r ecog nition
F5cano nical
M otor
p rog ra m
(Gr a sp )
Integrate
tem po ral
ass ociation
Act ion
M irro r
Feed back r ecog nition
H an d -Obje ct
sp atia l re lation
a na lysis
7a
(M irr or
N eu r ons)
F5mirror
M otor
p rog ra m
(Re a ch)
M otor
e xe cu tion
M1
F4
Ob ject
loca tion
MIP/LIP/VIP
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
62
Visual Processing for the MNS model
How much we should attempt to solve ?
 Even though computers are getting more
powerful every day the vision problem in its
general form is an unsolved problem in
engineering.
 There exists gesture recognition systems for
human-computer interaction and sign language
interpretation (Holden)
 Our vision system must at least we need
63 to

Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
Simplifying the problem
We
attempt to recognize the Hand and its
Configuration by simplifying the problem by using
color markers on the articulation points of the
hand.
 If we can extract the marker positions reliably
then we can try to extract some of the features that
make up the hand state by trying to estimate the
3D pose of the hand from 2D pose.
 Thus we have 2 steps:
64
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
Reminder: Hand State components
For most components we need to know (3D) configuration of the hand.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
65
A simplified video input
• The Vision task is
simplified using colored
tapes on the joints and
articulation points
• The First step of hand
configuration analysis is to
locate the color patches
unambiguously (not easy!).
Use color segmentation. But we have to compensate for lighting,
reflection, shading and wrinkling problems: Robust color detection
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
66
Robust Detection of the Colors – RGB space
 A color
image in a computer is composed of a matrix of
pixels triplets (Red,Green,Blue) that define the color of the pixel.
 We want to label a given pixel color as belonging to one of the color
patches we used to mark the hand, or as not belonging to any class.
 A straightforward way to detect whether a given target color (R’,G’,B’)
matches the pixel color (R,G,B) is to look at the squared distance
((R-R’)^2+(G-G’)^2+(B-B’)^2)
with a threshold to do the classification..
This does not work well, because the shading and different lighting
conditions effect R,G,B values a lot and a our simple nearest neighbor
method fails. For example an orange patch under shadow is very close to
red in RGB space.
Solution: There are better color spaces for color classification which are
more robust to shading and lighting effects.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
67
Robust Detection of the Colors - HSVspace
HSV stands for (Hue, Saturation, Value) and the
Value component carries the information we want.
 The HSV color model is more suitable for classifying colors
in terms of their perceived color.
 Thus in labeling the pixels we can simply compare the pixels
Value component to representative Values (template) for each
marker and assign the the pixel to the marker with the closest
value unless the difference is not over a threshold in which case
we label it is non-marker color.
 But we can do better:
Train a neural network that can do the labeling for us

Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
68
Robust Detection of Colors – the Color Expert
Create a training set using a test image by manually
picking colors from the image and specifying their labels.
Create a NN – in our case a one hidden layer feed-forward network - that
will accept the R,G,B values as input and put out the marker label, or 0
for a non-marker color.
Make sure that the network is not too “powerful” so that it does not
memorize the training set (as distinct from generalization)
Train it then Use it: When given a pixel to classify, apply the RGB values
of the pixel to the trained network and use the output as the marker that
the pixel belongs to.
One then needs a segmentation system to aggregate the pixels into a
patch with a single color label.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
69
STS hand shape recognition 1:
Color Segmentation and Feature Extraction
Preprocessing
Color Expert
(Network weights)
Training phase: A color expert is generated by training a feed-forward network to approximate human
perception of color.
Features
NN augmented
segmentation system
Actual processing: The hand image is fed to an augmented segmentation system. The color decision during
segmentation is done by the consulting color expert.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
70
Hand Configuration Estimation
• Given a color-coded hand image, the first step of the hand
configuration extraction is to find the position of the center of
color markers.
• Then the marker center positions are converted into a feature
vector with a corresponding confidence vector. To convert the
marker center coordinates into a feature vector simply the wrist
position is subtracted from all the centers found and are placed
into the feature vector (the relative x,y coordinates for each
marker)
• The second step of the hand configuration extraction is to
create a pose of a 3D hand model such that the features of the
given hand image and the 3D hand model is as close as possible.
The contribution of each component to the distance is weighted
by its confidence vector.
Color Coded Hand
Feature Extraction
• This is an optimization problem which can be solved using
gradient descent or hill climbing in weight space (simulated
gradient descent)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
71
STS hand shape recognition
Color Coded Hand
Feature Extraction
Step 1 of hand shape
recognition: system
processes the color-coded
hand image and generates a
set of features to be used by
the second step
Model Matching
Step 2: The feature vector
generated by the first step is used
to fit a 3D-kinematics model of the
hand by the model matching
module. The resulting hand
configuration is sent to the
classification module.
Precision grasp
Hand Configuration
Classification
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
72
STS hand shape recognition2:
3D Hand Model Matching
Feature Vector
Error minimization
Result of feature
extraction
A realistic drawing of hand
bones. The hand is
modelled with 14 degrees
of freedom as illustrated.
Grasp Type
Classification
The model matching algorithm minimizes the error
between the extracted features and the model hand.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
73
Virtual Hand/Arm and Reach/Grasp Simulator
A precision pinch
A power grasp
and
a side grasp
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
74
Virtual Arm
 A Kinematics
model of an arm and hand.
 19 DOF freedom: Shoulder(3), Elbow(1), Wrist(3),
Fingers(4*2), Thumb (3)
Implementation Requirements
 Rendering: Given the 3D positions of links’ start and
end points, generate a 2D representation of the
arm/hand (easy)
 Forward Kinematics: Given the 19 angles of the
joints compute the position of each link (easy)
 Inverse Kinematics: Given a desired position in
space for a particular link what are the joint angles to
theofdesired
position
(semi-hard)
Arbib andachieve
Itti: CS 664 (University
Southern California,
Spring 2002) Integrating
Vision, Action and Language
75
A 2D, 3DOF arm example
c
C
b
B
a
A
P(x,y)
Forward kinematics: given joint angles
A,B,C compute the end effector position P:
X = a*cos(A) + b*cos(B) + c*cos(C)
Y = a*sin(A) + b*sin(B) + c*sin(C)
Radius=c
Inverse kinematics: given joint
P(x,y) position P there are infinitely many
joint angle triplets to achieve P!!
b
b
b
Radius=a
Radius of the circles are a and c and the segments
connecting the circles are all equal length of b
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
76
A Simple Inverse Kinematics Solution
Let constrain ourselves to the arm. The forward
kinematics of the arm can be represented as a vector
function that maps joint angles of the arm to the
wrist position.
(x,y,z)=F(s1,s2,s3,e) , where s1,s2,s3 are the shoulder angles and e
is the elbow angle.
We can formulate the inverse kinematics problem as
an optimization problem: Given the desired
P’=(x’,y’,z’) to be achieved we can introduce the
error function
J=|| (P’-F(s1,s2,s3,e)) ||
Then we can compute the gradient with respect to
s1,s2,s3,e and follow the minus gradient to reach the
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
77
Other Issues
Other Inverse Kinematics Solution Methods?

Inverse kinematics for multiple targets…
Precision
Grasp Planning

Determination of finger contact points etc.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
78
Power grasp time series data
+: aperture; *: angle 1; x: angle 2; : 1-axisdisp1; :1-axisdisp2; :
speed; : distance.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
79
Curve recognition
The general problem: associate N-dimensional space curves with object
affordances
A special case: The recognition of two (or three) dimensional trajectories in physical space
Simplest solution: Map temporal information into spatial
domain. Then apply known pattern recognition techniques.
Problem with simplest solution: The speed of the moving
point can be a problem! The spatial representation may
change drastically with the speed
Scaling can overcome the problem. However the scaling
must be such that it preserves the generalization ability of
the pattern recognition engine.
Solution: Fit a cubic spline to the sampled values. Then normalize and resample from the spline curve.
Result:Very good generalization. Better performance than using the Fourier
coefficients to recognize curves.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
80
Curve recognition
Curve recognition system demonstrated for hand drawn numeral recognition (successful
recognition examples for 2, 8 and 3).
Spatial resolution: 30
Network input size: 30
Hidden layer size: 15
Output size: 5
Training : Back-propagation
with momentum.and
adaptive learning rate
Sampled points
Point used for spline interpolation
Fitted spline
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
81
Core Mirror Circuit
Object affordance
Association
(7b) Neurons
Mirror Neurons
(F5mirror)
Mirror Neuron
Output
Hand state
Motor Program
(F5 canonical)
Object
Affordances
Object affordance hand state association
Integrate
temporal
association
Mirror Feedback
Motor
program
F5canonical
Hand shape
recognition
& Hand
motion
detection
Mirror
Feedback
Hand-Object
spatial relation
analysis
Action
recognition
(Mirror
Neurons)
Motor
program
Motor
execution
F5mirror
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
82
Connectivity pattern
Object affordance (AIP)
STS
F5mirror
7b
Motor Program
(F5canonical)
Mirror Feedback
7a
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
83
A single grasp trajectory viewed from three different
angles
The wrist trajectory during the
grasp is shown by square traces,
with the distance between any
two consecutive trace marks
traveled in equal time intervals.
How the network classifies the
action as a power grasp. Empty
squares: power grasp output;
filled squares: precision grasp;
crosses:
side grasp output
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
84
Power and precision grasp resolution
(a)
(b)
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
85
Future Directions
1) Technology: Increasing the robustness and learning rates
of the schemas using improved learning algorithms for
artificial neural nets.
2) Neuroscience: Implementing the schemas with
biologically plausible neural nets to model
neurophysiological data.
3) Learning to recognize variations on, and assemblages of,
familiar actions.
4) Imitation
5)
Language!
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
86
Related Research Issues:
Grasping and the Mirror System in Monkey
Modeling of monkey brain mechanisms for
 visually guided behavior
 mirror neurons
 vocalization and communication
 multi-modal integration
 compound behaviors and social interactions
 this will build, for example on our earlier modeling of interactions of pre-SMA and
basal ganglia, and of the role of dopamine in planning
based on behavior, neuroanatomy and neurophysiology
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
87
Some Specific Subgoals for Mirror System Modeling
Development of the Mirror System
Development of Grasp Specificity in F5 Motor and Canonical
Neurons
 Visual Feedback for Grasping: A Possible Precursor of the Mirror
Property

Recognition of Novel and Compound Actions and their Context
The Pliers Experiment: Extending the Visual Vocabulary
 Recognition of Compounds of Known Movements
 From Action Recognition to Understanding: Context and
Expectation

Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
88
Visual Feedback for Grasping:
A Possible Precursor of the Mirror Property
Hypothesis: the F5 mirror neurons develop by selecting, via re-afferent connections, patterns of
visual input
describing those relations of hand shapes and motions
to objects effective in visual guidance of a successful grasp.
The validation here is computational: if the hypothesis is correct, we will be able to show that
such a hand control system indeed exhibits most of the properties needed for a mirror system for
grasping.
For a reaching task, the simplest visual feedback is some form of signal of the distance between
object and hand. This may suffice for grabbing bananas, but for peeling a banana, feedback on
the shape of the hand relative to the banana, as well as force feedback become crucial.
We predict that the parameters needed for such visual feedback for grasp will look very much
like those we specified explicitly for our MNS1 hand state.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
89
Related Research Issues 2

a simple imitation system for grasping [Shared with common ancestor of human and
chimpanzee]
 a complex imitation system for grasping
Research:
Comparative modeling of primate brain mechanisms based on data from primate behavior,
neuroanatomy and neurophysiology and human brain imaging:
 extending the monkey model to chimp and human
 comparative/evolutionary model of different types of imitation
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
90
Learning in the Mirror System
For both oro-facial and grasp mirror neurons we may have a limited "hard-wired"
repertoire that can then be built on through learning:
1) Developing a set of basic grasps that are effective;
2) Learning to associate view of one's hand with grasp and object;
3) Matching this to views of others grasping;
4) Learning new grasps by imitations of others.
We do not know if the necessary learning is in F5 or elsewhere.
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
91
Beyond Simple Mirroring
Tuning mirror neurons in terms of self-movement: Possession of movement precedes its
perception
but
Imitation: Perception of a novel movement precedes ability to perform (some
approximation to) that movement.
Eventually, we perceive many things we cannot do:
Variation = known y + y
 Assemblage

A key question for our empirical study of imitation:
How can we objectively compare "imitation capability" across species?
Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language
92