implemented - Video Recognition Systems

Download Report

Transcript implemented - Video Recognition Systems

Towards building
user
seeing computers
CRV’05 Workshop on Face Processing in Video
August 8-11, 2005, Victoria, BC, Canada
Gilles Bessens and Dmitry Gorodnichy
Computational Video Group
Institute for Information Technology
National Research Council Canada
http://synapse.vit.iit.nrc.ca
What means “to see”
When humans lose a sense of touch or hearing, they can still communicate using vision.
The same is for computers: When information cannot be entered into computer
using hands or speech, vision could provide a solution, if only computers can see...
Users with accessibility needs (e.g. residents of the SCO Ottawa Health Center) will
benefit the most. But other users would benefit too.
Seeing tasks: 1. Where - to see where is a user: {x,y,z…}
2. What - to see what is user doing: {actions}
3. Who - to see who is the user: {names}
binary event
PUI
xy,za,bg
recognition /
memorization
ON
OFF
Unknown User!
monitor
Our goal: to built systems which can do all three tasks.
Wish-list and constraints
Users want computers to be able to
But limitation should be acknowledged:
1. Automatically detect and recognize a user:
• computer limitations - the system should
a) to load user’s personal windows settings (e.g.
font size, application window layout), which is a
very tedious work for users with disabilities,
b) to find the range of user's motion to map it to the
computer control coordinates.
2. Enable written communication: e.g. typing a
message in an email browser or internet.
3. Enable navigation in Window environment: selecting
items from windows menus and pushing buttons of
run in real time (>10fps)
• user mobility limitations - user have
limited range of motion. Besides, camera
field of view and resolution are limited
• environmental limitations - changing, we
develop a state transition machine which
switches from face detection to face
recognition to face tracking modules to
accommodate the constraints.
Windows applications.
4. Detect visual cues from users (intentional blinks,
mouth opening, repetitive or predefined motion
patterns) for hands-free remote control:
a) mouse-type "clicks“,
b) vision-based lexicon,
c) computer control commands: "go to next/last
window", "copy/cut/paste", "start Editor", "save and
quite".
• Other:
a) the need for missing feedback, which is
what the feeling of touch when holding a
mouse provides to users who operate
with mouse
b) the need for limited-motion-based
cursor control and key entry.
Evolution of seeing computers
• 1998. Proof-of-concept colour-based
skin tracking [Bradski’98] – not precise
• 2001. Motion-based segmentation &
localization – not precise
• 1999-2002. Several skin colour models
developed -
reached the limits
• 2001. Rapid face detection using
rectangular wavelets of intensities [fg02]
• 2002. Subpixel-accuracy convex-shape
nose tracking [Nouse™, fg02, ivc04]
• 2002. Stereo face tracking using
projective vision [w.Roth, ivc04]
• 2003. Second-order change detection
[Double-blink, ivc04]
• 2003-now. Neuro-biological recognition
of low-res faces [avbpa05,fpiv04,fpiv05]
Figure: Typical result for face detection using
colour, motion and intensity components of
video using six different webcams.
Nouse™ “Nose as Mouse” good news
Precision & convenience of tracking the
convex-shape nose feature allows one to
use nose as mouse (or joystick handle)
Copyright S. A. LA NACION 2003. Todos los derechos reservados.
image
Motion,colour,edges,Haar-wavelets 
nose search box: x,y,width,height
Convex-shape template matching 
nose tip detection: I,J (pixel precision)
Integration over continuous intensity 
X,Y (sub-pixel pixel precision)
(X,Y)
Rating by Planeta Digital
(Aug. 2003)
Main face recognition challenge
ICAO-conformed passport photograph
(presently used for forensic identification)
Image-based biometrics modalities
100
80
60
By humans
40
By computers
20
0
In
In
photos
video
Face recognition performance
Images obtained from surveillance cameras (of 11/9 hijackers)
and TV. NB: VCD is 320x240 pixels
Keys to resolving FRiV problem
• 12 pixels between the eyes should be sufficient – Nominal face resolution
• To beat low resolution & quality, use
lessons from human vision recognition system:
1) Efficient visual attention mechanisms
2) Decision based on accumulating results
over several frames (rather than on one frame)
3) Efficient neuro-associative mechanisms a) to accumulate learning data in time by
adjusting synapses, and b) to associate a visual stimulus to a semantic meaning
based on the computed synaptic values, using:
» non-linear processing,
» massively distributed collective decision making
» synaptic plasticity.
Lessons from biological vision
Saliency based localization and rectification
- implemented
Fovea vision: Accumulation over time and space
- implemented
Local brightness adjustment
- implemented
Recognition decision at time t depends
on our recognition decision at time t+1
- implemented
Lessons from biological memory
• Brain stores information using synapses connecting the
neurons.
• In brain: 1010 to 1013 interconnected neurons
• Neurons are either in rest or activated, depending on
values of other neurons Yj and the strength of synaptic
connections:
Yi={+1,-1}
• Brain is a network of “binary” neurons evolving in time
from initial state (e.g. stimulus coming from retina) until
it reaches a stable state – attractor.
• What we remember are attractors!
This is the associative principle we all live to
- implemented ?..
Refs: Hebb’49, Little’74,’78, Willshaw’71
From visual image  to saying name
From neuro-biological prospective, memorization and recognition
is two stages of the associative process:
From receptor stimulus R  to effector stimulus E
In brain
Main associative principle
Stimulus neuron
“Dmitry”
Response neuron
Xi: {+1 or –1}
Yj: {+1 or –1}
Synaptic strength:
-1 < Cij < +1
In computer
How to update weights
Learning rules: From biologically plausible to mathematically justifiable
Models of learning
•
Cijm  Cijm1  Cijm
Hebb (correlation learning):
1 m m
C  Vi V j
N
m
ij
m
m
m
Generalized Hebb: Cij  aF (Vi ,V j )
Better rule: Cijm  aF (Cijm1 ,Vi m ,V jm )
•
Widrow-Hoff’s (delta) rule :
•
Projection learning:
–
or even
Cijm  aF (C m1 ,V m )
is both incremental and takes into account the relevance of the training stimuli and their
attributes
Refs: Amari’71,’77, Kohonen’72, Personnaz’85, Kanter-Sompolinsky’86,Gorodnichy‘95-’99
Testing FRiV framework
• TV programs annotation
• IIT-NRC 160x120 video-based facial database
(one video to memorize, another to recognize)
From video input to neural output
1. face-looking regions are detected using rapid classifiers.
2. they are verified to have skin colour and not to be static.
3. face rotation is detected and rotated, eye aligned and resampled
to 12-pixels-between-the-eyes resolution face is extracted.
4. extracted face is converted to a binary feature vector (Receptor)
5. this vector is then appended by nametag vector (Effector)
6. synapses of the associative neuron network are updated
Time weighted decision:
a) neural mode: all neurons with PSP greater than a certain threshold Sj>S0
are considered as ``winning";
b) max mode: the neuron with the maximal PSP wins;
c) time-filtered: average or median of several consecutive frame decisions,
each made according to a) or b), is used;
d) PSP time-filtered: technique of a) or b) is used on the averaged (over several
consecutive frames) PSPs instead of PSPs of individual frames;
e) any combination of the above.
S10 - The numbers of frames in 2nd video clip
of the pair, when the face in a frame is
associated with the correct person (i.e. the
one seen in the 1st video clip of the pair),
without any association with other seen
persons. - best (non-hesitant) case
S11 - ... when the face is not
associated with one
individual, but rather with
several individuals, one of
which is the correct one. good (hesitating) case
S01 - ... when the face is associated with
someone else - worst case
S02 - ... when the face is associated with
several individuals (none of which is
correct) - wrong but hesitating case
S00 - ... when the face is not
associated with any of the
seen faces - not bad case
Combining results
Perceptual Vision Interface
Nouse™
• Evolved from a single demo program to a hands-free perceptual operating system
• Combines all techniques presented and provides a clear vision for other to-bedeveloped seeing computers
• Requires more man-power for tuning and software designing, contingent upon extra funding…
Nouse connected
User recognized
User’s motion range obtained
User’s face detected
Nouse zero position (0,0) set
Nouse initialization
and calibration
Face position converted to (X,Y)
(to use for typing, cursor control)
Visual pattern analyzed
(for hands-free commands)