Human Object Recognition

Download Report

Transcript Human Object Recognition

Human Object Recognition
Part 3 of the Biomimetic
Trilogy
Bruce Draper
Review:
A Divided Vision System
The human vision system has three major components:
1.
The early vision system

Retinogeniculate pathway



Retinotectal pathway



2.
3.
RetinaLGNd V1 (V2 V3)
 and  channels
Retina S.C. Pulvinar Nucleus V1(V2 V3)
Retina S.C. Pulvinar Nucleus MT(dorsal)
Retina S.C. LGNd(interlaminar) V1(V2 V3)
The dorsal (“where”) pathway
The ventral (“what”) pathway
D. Milner & M. Goodale, The Visual Brain in Action, p. 22
The Early Vision System

Retinotopically mapped
–
–
Small receptive fields in LGNd, V1
Receptive fields grow with processing depth


Spatially organized into feature maps
–
–
–
–


Bigger in V2, Bigger still in V3…
Edge maps (Gabor filters, quadrature pairs)
Color maps
Disparity maps
Motion maps (in MT, if not before)
Afferent & efferent connections
Measurable neural correlates of spatial attention
An Early Vision Hypothesis
The primary role of the early
vision system is spatial attention




Logic: why compute any feature across the entire image when it would
be cheaper to compute it later across only the attention window?
Because you need the feature to select the attention window.
Neural evidence: Neural correlates of spatial attention (e.g.
anticipatory firing, enhanced firing) are measurable in V1 and even
LGNd.
Psychological evidence: ventral and dorsal streams appear to process
the same attention windows, suggesting that attention is selected prior
to the ventral/dorsal split.
Caveat: Some dorsal vision tasks (e.g. ego-motion estimation) benefit
from a broad field of view, and may be non-attentional
The Dorsal/Ventral Split
Color Codes:
–
–
Red: early vision
Orange/Yellow: dorsal

–
Leads to
somatosensory and
motor cortex
Blue/Green: ventral


Leads more to
memories, frontal
cortex
More developed in
humans than
monkeys
A Dorsal Vision Hypothesis
Milner & Goodale: the dorsal vision system supports
immediate actions, and not cognition or memory

Anatomical evidence:
1.
2.
3.
4.

strongly connected to motion and stereo processing in V1;
dorsal areas (e.g. LIP, 7a) inactive under anaesthesia
neurons conjointly tuned for perception and action
saccade-responsive neurons and gaze-responsive neurons
Behavioral evidence:
1.
2.
monkeys with dorsal lesions recognize objects but can’t
grab them;
blindsight (see next slide)
Blindsight

Patients with severe damage to V1 are
“cortically blind”
–
–
–

Nonetheless, they can point at targets
–
–

Much better than random (see chart)
Once they relax & let it happen
Why?
–
–

Report no sensation of vision
MRI confirms no activity in V1
Saccadic eye movements continue
Retina S.C. Pulvinar Nucleus
MT(dorsal)
MRI confirms some dorsal vision activity
So?
–
Confirms that dorsal vision has no contact with
cognition
A Ventral Vision Hypothesis
Milner & Goodale: the ventral pathway supports vision
for cognition, including (categorical & sub-categorical)
object recognition and landmark-based navigation

Anatomical evidence:
1.
2.

Visual pathways connects early vision to areas associated with
memory (e.g. right inferior frontal lobe (RIFL))
MRI centers of activity in ventral stream during (a) expert object
recognition and (b) landmark recognition
Behavioral evidence:
1.
2.
3.
Ventral lesions in monkeys prevent object recognition
Lesions in fusiform gyrus in humans lead to prosopagnosia
Stimulation of RIFL during surgery creates mental images
This may seem like a tangent, but its not…
Repetition Suppression

What happens when the same stimulus is presented
repeatedly to the vision system?
–
–
In fMRI studies, the total response of a voxel drops with
each presentation
In single-cell recording studies, neural responses become
extreme


–
This can be observed in the ventral stream

–
Most cells stop firing at all
A few cells start responding at their maximal firing rate
But not the early vision system
This can be observed at both short and long time scales

Short-time scale repetition suppression is interrupted by novel
targets
Decomposing the Ventral Stream
The ventral stream has 4 major parts, as revealed by MRI:
1.
The early vision system


2.
The lateral occipital cortex



3.
Large area, diffusely active in MRI studies
Including (at least) V4 & V8
Kosslyn: hypothesizes feature extraction
The inferotemporal cortex



4.
Both the ventral & dorsal streams start here
Selects spatial attention windows (our hypothesis)
Large area, diffusely active in MRI studies
Sharp focus of activity in fusiform gyrus during expert recognition
Sharp focus of activity in parahippocampal gyrus during landmark
recognition
The right inferior frontal cortex



Associated with visual memories
Efferently stimulates V1 when active
Strongly lateralized
Area V8 (Lateral Occipital Cortex)

Short-term repetition studies suggest V8 computes edgebased features
–

Psychological studies suggest the recognition is sensitive to
the disruption of “non-accidental” features
1.
2.
3.
4.
5.

Equal amounts of suppression for image/image, image/edge,
edge/image or edge/edge pairs
Colinearity
Parallelism (translational symmetry)
Reflection (anti-symmetry)
Co-termination (end-point near)
Constant curvature
Diffuse response suggests population coding
An LOC Hypothesis
Area V8 detects non-accidental edge relations through parameterspace voting schemes (e.g. Hough spaces)
Other LOC areas use voting schemes to summarize other features,
e.g. color histograms in area V4/V7. Together, LOC areas create a
high-dimensional but distributed feature representation

Evidence:
–
–
–
–
Diffuse responses consistent with population codes
Fit psychology models of LOC as feature extraction
Explains repetition suppression effects in V8
Explains non-classical receptive field responses in V1
(assuming efferent feedback to early vision)
Infero-temporal Cortex (IT)



Diffusely active in fMRI during all types of object
recognition
Last visual processing stage before memories
Distributed responses to objects (Tsunoda, et al):
Test Stimulus
Hot spots (versus control, shown for different
Levels of statistical significance)
Inferotemporal Cortex (continued)
Hot spots
overlap, and
aren’t
contiguous
(pop. Code)
Some
stimuli yield
greater total
responses;
responses
overlap
Always some
response
Minimal effect
of stimulus
intensity
IT (III): when stimulus is simplified
Significant results


Figure A is a control: hot spots from 3 different
objects
Figure B: red spots respond to the whole cat; a
subset of spots (blue) respond to just the head; a
subset of that responds to a silhouette of the head
(yellow)
• Implication: part-based features.

Figure C: Blue spot responds to whole object, but
not to simplification. Some red spots respond only to
simplified version
Implication: More complex scenario: some feature responses
are turned off by the whole object (competition?)
–
An IT Hypothesis
Repetition suppression in infero-temporal cortex
implements unsupervised feature space
segmentation, thus categorizing attention windows



Repetition suppression effects are strongest in IT
Single cell recording studies show that IT cells
respond to multiple features (e.g. color + shape)
Simpler organizations (e.g. part/subpart hierarchies,
“view maps”) are not supported by single-cell
recording data
Expert Object Recognition
Expert object recognition applies when:
–
–
–
–
The viewer is very familiar with the target object
The illumination and viewpoint are familiar
The target is recognized at both a categorical & sub-categorical
level
Example: human faces

Sub-categories: expression, age, gender
Expert recognition properties include:
–
–
–
–
Fine sub-categorical discrimination, increased recognition speed
Equal response times for category/sub-category
Inability to dissassociate categorical & sub-categorical recognition
Trainable

Everyone is expert at recognizing faces*, chairs; dog show judges are
expert at dogs; subjects can be trained to be expert with Greebles.
Expert Object Recognition (II)

Anatomically, expert object recognition is
distinguished by:
1.
(fMRI)Activation of early vision, LOC & IT
–
2.
3.
All forms of recognition do this
(fMRI) Sharp centers of activation in fusiform
gyrus (in IT) and right inferior frontal lobe
(ERP) The n170 signal (170 ms post stimulus)
An Expert Recognition Hypothesis
Expert Object Recognition is appearance-based,
matching the current stimulus to previous memories.
When a category becomes familiar, the fusiform
gyrus is recruited to build a manifold representation
of the samples. Sub-categorical properties are
encoded in the manifold dimensions

Evidence:
1.
2.
Expert recognition is illumination & viewpoint dependent
It activates RIFL, which creates mental images & can
activate the image buffers in V1.
An End-to-end computational model
(1) Bottom-up spatial selective attention



Multi-scale maps for intensity, colors, edges (V1)
Difference of Gaussian (on-center/off-surround)
filtering to find impulses
Select peaks in x, y, scale as attention windows
Step 1 Issues
(1) Issues with step #1:
–
More information channels

Motion
–

–
–
Trent Williams found this is hard
Disparity
Inhibition of return
Top-down control


Integration of predictions (predictive attention)
Split attention?
Note: attention windows do not correspond to objects.
They are just interesting parts of the image (but
repeatability is key)
Step 2: Feature Extraction
(2) Attention windows are converted into fixedlength sparse feature vectors by parameter
space voting techniques.
–
V8 is modeled with multiple non-accidental
features


–
–
Hough space for colinearity
Hough space of axes of reflection for anti-symmetry and
co-termination
V4 is modeled as a color histogram
Simplest feature: low-resolution pixels
Step 2 Examples
Source Attention
Window
Collinearity
Edges vote in Hough space
for positions and orientations
of lines
Image Space
Hough Space
Reflection
(Symmetry & Vertices)
Pairs of edges votes for
axes of reflection that map
one onto the other (if any)
Step 2 Issues

Missing features
–
–
–

Constant curvature (V8)
Apparent-color-corrected histograms (V4)
Disparity features
Huge parameter space
–
How to evaluate features without supervision
Step 3: Feature Space Segmentation
(3) IT is modeled as O(1) unsupervised segmentation:
–
–
The features extracted in step #2 are concatenated to form a
single, high-dimensional representation
A 1-level neural net is trained to segment the samples by:



–
If a neuron responds < 0.5 to a sample, give it a training signal of
0 for that sample
If a neuron responds > 0.5, give a training signal of 1.0
Note that every neuron is trained independently, and there is no
communication among them
The response of IT to a sample is the vector of binarized
neural responses

Each pattern of responses is a region in feature space
Step 3 issues

Stability
–
–
If neurons keep adapting, then region codes change
Linear neurons imply non-local interactions


Radial basis neurons should perform better
Evaluation: what makes one categorization better than
another?
–
–
No supervised training data
Number and size of categories vary
Gabe Salazar is cutting his teeth on this one…

Top-down predictions
–
Can we predict a category, and use it influence steps 1 & 2?
Steps 4 & 5 (unimplemented)
(4) Create sub-space manifold to describe
samples in crowded regions.
–
–
PCA subspaces are a first approximation
Local linear embedding manifolds are better

Sub-categories should correspond to manifold dimensions
(5) Associative Memory
–
Associate attention windows with:


Other attention windows (generate predictions)
With other modalities (e.g. language)
Adele Howe and I have a joint interest in this last point
Conclusion

We have a biologically plausible model that
–
–

Learns to extract and categorize image windows
from larger scenes
Without any human supervision or intervention
We need help improving, evaluating, and
extending it
–
Interested parties should let me know!