SillyWalksIGERT

Transcript SillyWalksIGERT

The Science of Silly Walks
Hedvig Sidenbladh
Royal Inst. of Technology, KTH
Stockholm Sweden
Michael J. Black
Department of Computer Science
Brown University
http://www.nada.kth.se/~hedvig http://www.cs.brown.edu/~black
Collaborators
David Fleet, Xerox PARC
Nancy Pollard, Brown University
Dirk Ormoneit and Trevor Hastie
Dept. of Statistics, Stanford University
Allan Jepson, University of Toronto
The (Silly) Problem
Inferring 3D Human Motion
* Infer 3D human
motion from 2D image
properties.
* No special clothing
* Monocular, grayscale, sequences (archival data)
* Unknown, cluttered, environment
* Incremental estimation
Why is it Hard?
Singularities in
viewing direction
Unusual viewpoints
Self occlusion
Low contrast
Clothing and Lighting
Large Motions
Limbs move rapidly with respect to their width.
Non-linear dynamics.
Motion blur.
Ambiguities
Where is the leg?
Which leg is in front?
Ambiguities
Accidental alignment
Ambiguities
Occlusion
Whose legs are whose?
Inference/Issues
Bayesian formulation
p(model | cues) = p(cues | model) p(model)
p(cues)
1. Need a constraining likelihood model that is also
invariant to variations in human appearance.
2. Need a prior model of how people move.
3. Need an effective way to explore the model
space (very high dimensional) and represent
ambiguities.
Simple Body Model
* Limbs are truncated cones
* Parameter vector of joint angles and angular velocities = f
Key Idea #1 (Likelihood)
1. Use the 3D model to predict the location of
limb boundaries (not necessarily features) in
the scene.
2. Compute various filter responses steered to the
predicted orientation of the limb.
3. Compute likelihood of filter responses using a
statistical model learned from examples.
Example Training Images
Edge Filters
Normalized derivatives of Gaussians
(Lindeberg, Granlund
and Knutsson, Perona, Freeman&Adelson, …)
Edge filter response steered to limb orientation:
f (x, ,  )  sin  f x (x,  )  cos f y (x,  )
e
Filter responses
steered to arm
orientation.
Distribution of Edge Filter Responses
pon(F)
poff (F)
Likelihood ratio, pon/ poff , used for edge detection
Geman & Jednyak and Konishi, Yuille, & Coughlan
Object specific statistics
Other Cues
I(x, t)
I(x+u, t+1)
Motion
Ridges
Key Idea #2 (Likelihood)
“Explain” the entire image.
p(image | foreground, background) 
const
 p(image | fore)
fore pixels
 p(image | back )
fore pixels
Generic,
unknown,
background
Foreground
person
Foreground should explain what the background can’t.
Likelihood
Steered edge
filter responses

limbs cues
p (filter response | person )
p (filter response | background )
crude assumption: filter responses independent across scale.
Learning Human Motion
* constrain the posterior to likely & valid
poses/motions
* model the variability
joint angles
3D motion-capture data.
* Database with multiple
actors and a variety of
motions.
(from M. Gleicher)
time
Key Idea #3 (Prior)
Problem:
* insufficient data to learn probabilistic model
of human motion.
Alternative:
* the data represents all we know
Efros & Freeman’01
* replace representation and learning with search.
(search has to be fast)
* De Bonnet & Viola, Efros & Leung, Efros & Freeman, Paztor & Freeman,
Hertzmann et al, …
Implicit Empirical Distribution
Off-line:
• learn a low-dimensional model of every n-frame
sequence of joint angles and angular velocities
(Leventon & Freeman, Ormoneit et al, …)
• project training data onto model to get small
number of coefficients describing each time instant
• build a tree structured representation
“Textural” Model
On-line: Given an n-frame input motion
• project onto low-dimensional model.
• index in log time using the coefficients.
• return the best k approximate matches (and form a
“proposal” distribution).
• sample from them and return the n+1st pose.
Synthetic Walker
* Colors indicate different training sequences.
Synthetic Swing Dancer
Bayesian Formulation
Posterior over model parameters given
an image sequence.

p(ft | It ) 
Temporal model (prior)

 p(I t ft )  ( p(ft |ft 1 ) p(ft 1 | It 1 )) dft 1
Likelihood of
observing the image
given the model parameters
Posterior from
previous time instant
Key Idea #4 (Ambiguity)
* Represent a multi-modal
posterior probability
distribution over model
parameters
- sampled representation
- each sample is a pose
and its probability
- predict over time
Samples from a distribution
using a particle
over 3D poses.
filtering approach.
Particle Filter
Posterior

p(ft 1 | It 1)
Temporal dynamics
sample
p(ft | ft 1 )
sample

p(ft | It )
Posterior
normalize
p ( I t | ft )
Likelihood
What does the posterior look like?
Shoulder: 3dof
Elbow: 1dof
Elbow bends
x
y
z
Stochastic 3D Tracking
* 2500 samples, multiple cues.
Conclusions
Inferring human motion, silly or not, from video is
challenging.
We have tackled three important parts of the problem:
1. Probabilistically modeling human
appearance in a generic, yet useful, way.
2. Representing the range of possible motions
using techniques from texture modeling.
3. Dealing with ambiguities and non-linearities
using particle filtering for Bayesian inference.
Learned Walking Model
* mean walker
Learned Walking Model
* sample with small e
Learned Walking Model
* sample with moderate e
Learned Walking Model
(Silly-Walk Generator)
* sample with very large e
Tracking with Occlusion
1500 samples, ~2 minutes/frame.
Moving Camera
1500 samples, ~2 minutes/frame.
Ongoing and Future Work
Hybrid Monte Carlo tracker (Choo and Fleet ’01)
* analytic, differentiable, likelihood.
Learned dynamics.
Correlation across scale.
Estimate background motion.
Statistical models of color and texture.
Automatic initialization.
Training data and likelihood models to be available in the web.
Lessons Learned
* Probabilistic (Bayesian) framework allows
- integration of information over time
- modeling of priors
* Particle filtering allows
- multi-modal distributions
- tracking with ambiguities and
non-linear models
* Learning image statistics and combining cues
improves robustness and reduces computation
Outlook
5 years:
- Relatively reliable people tracking in
monocular video.
- Path is pretty clear.
Next step: Beyond person-centric
- people interacting with object/world
… solve the vision problem.
Beyond that: Recognizing action
- goals, intentions, ...
… solve the AI problem.
Conclusions
* Generic, learned, model of appearance.
• Combines multiple cues.
* Exploits work on image statistics.
* Use the 3D model to predict features.
* Principled way to chose filters.
* Model of foreground and background is
incorporated into the tracking framework.
• exploits the ratio between foreground and
background likelihood.
• improves tracking.
Motion Blur
Requirements
1. Represent uncertainty and multiple hypotheses.
2. Model non-linear dynamics of the body.
3. Exploit image cues in a robust fashion.
4. Integrate information over time.
5. Combine multiple image cues.
What Image Cues?
Pixels?
Temporal differences?
Background differences?
Edges?
Color?
Silhouettes?
Brightness Constancy
ft 1
ft
I(x, t+1) = I(x+u, t) + h
Image motion of foreground as a function of the
3D motion of the body.
Problem: no fixed model of appearance (drift).
What do people look like?
Changing background
Varying shadows
Occlusion
Deforming clothing
Low contrast limb boundaries
What do non-people look like?
Edges as a Cue?
• Probabilistic model?
• Under/over-segmentation,
thresholds, …
Contrast Normalization?
1  tanh( S * contrast  O)
w
2 * contrast
I
I norm  log( )
Iˆ
Lee, Mumford & Huang
Contrast Normalization
Maximize difference between distributions
* e.g. Bhattarcharyya distance:
 B ( pon , poff )   log  pon ( y ) poff ( y )dy
Local Contrast Normalization
Ridge Features
Scale
specific
f r (x, ,  )  | sin 2  f xx (x,  )  cos 2  f yy (x,  )  2 sin  cos f xy (x,  ) | 
| cos 2  f xx (x,  )  sin 2  f yy (x,  )  2 sin  cos f xy (x,  ) |
Ridge Thigh Statistics
Brightness Constancy
I(x, t)
I(x+u, t+1)
What are the statistics of brightness variation
I(x, t) - I(x+u, t+1)?
Variation due to clothing, self shadowing, etc.
Brightness Constancy
Scale 0
Scale 4
Edges
Temporal Model: Smooth Motion
* individual angles and velocities assumed independent
p(fi ,t |fi ,t 1 , fi ,t 1 ) 
G (fi ,t  (fi ,t 1  fi ,t 1 ),  if ) if fi ,t  [fi ,min , fi ,max ]

0
otherwise

f




p(fi ,t |fi ,t 1 )  G(fi ,t  fi ,t 1 ,  i )
Particle Filtering
* large literature (Gordon et al ‘93, Isard & Blake ‘96,…)
* non-Gaussian posterior approximated by N discrete
(n)
f
samples t
n  1,..., N ( N  103 )
* explicitly represent the ambiguities
* exploit stochastic sampling for tracking
Representing the Posterior

p(f t | It ) represented by discrete set
of N samples
Normalized likelihood:
p ( I t |ft( n ) )
(n)
t 
N

i 1
(f
(n)
t
,
(n)
t
)
p ( I t |ft( i ) )
Condensation
1. Selection
Sample from posterior at t-1
Most probable states selected most often.
2. Prediction.
3. Updating
Condensation
1. Selection
2. Prediction/Diffusion (sample from p(ft | ft 1 ) )
Models the dynamics:
p
ft 1
ft
states
3. Updating
Condensation
1. Selection
2. Prediction
3. Updating (the distribution)
Evaluate new likelihood. p(I t | ft )
Repeat until N new samples have been
generated.
Compute normalized probability distribution.
Temporal Model: Walking
Parameters of the generative
model are now

ft  [ct , t , t , tg , tg ]
Probabilistic model for p(ft | ft 1 )
p (ct ,k | ct 1,k )  G (ct ,k  ct 1,k ,  c ),
 c  ek
p ( t |  t 1 )  G ( t  ( t 1   t 1 ),   )
p (  t |  t 1 )  G (  t 1   t 1 ,   )
p ( tg | Tt 1 ,  tg1 ,  t 1 )  G ([ tg1 ,1]T  Tt11[  t 1 0 0 1]T ,   )

p ( t |  t 1 )  G ( t   t 1 ,  )
g
g
g
g
No likelihood
* how strong is the walking prior?
(or is our likelihood doing anything?)
Other Related Work
J. Sullivan, A. Blake, M. Isard, and J.MacCormick.
Object localization by Bayesian correlation. ICCV’99.
J. Sullivan, A. Blake, and J.Rittscher.
Statistical foreground modelling for object localisation. ECCV, 2000.
J. Rittscher, J. Kato, S. Joga, and A. Blake.
A Probabilistic Background Model for Tracking. ECCV, 2000.
S. Wachter and H. Nagel. Tracking of persons in monocular image
sequences. CVIU, 74(3), 1999.
What does the posterior look like?
Shoulder: 3dof
Elbow: 1dof
Elbow bends
x
y
z
Statistics of Limbs
Edge
Filters
How do people appear
in natural scenes?
Want a general model.
Ridge
Filters
Other
Related
Work
(full body, monocular, articulated)
* Bregler & Malik: image motion, single hypothesis,
full-body required multiple cameras, scaled ortho.
* Ju, Black, Yacoob: cardboard person model,
image motion, 2D
* Deutscher et al: Condensation, edge cues,
background subtraction.
* Cham& Rehg: known templates, 2D (SPM), particle
filter.
* Wachter & Nagel: nicely combines motion and edges,
single hypothesis (Kalman filter).
* Leventon & Freeman: assumes 2D tracking,
probabilistic formulation, learned temporal model
Open Questions
Representation of human motions
* model the range of human activity
* constrain the estimation to plausible motions
Representation of human appearance
* (somewhat) invariant to the variation in
human appearance
* specific enough to constrain the estimation
Likelihood
p ( I |  f ,  b )   p( I (x f ) |  f ) p ( I (x b ) |  b )
xf

xb
b
f
f
p
(
I
(
x
)
|

)
p
(
I
(
x
)
|

)


xf
x
f
b
p
(
I
(
x
)
|

)

xf

c  p( I (x f ) |  f )
xf
f
b
p
(
I
(
x
)
|

)

xf
Foreground pixels
Background pixels
Overview
* Why is 3D human motion important?
* Why is recovering it hard?
* A Bayesian approach
* generative model
* robust likelihood function
* temporal prior model (learning)
* stochastic search (particle filtering)
* Where are we going?
* Recent advances & state of the art.
* What remains to be done?
Problems
A simple articulated human model may have 30+
parameters (e.g. joint angles. 60+ w/ velocities).
Models of human action are non-linear and
likelihood models will be multi-modal.
Key challenges (common to other domains)
• representation,
• learning, and
• search
in high dimensional spaces.
Bayesian Formulation
* define generative model
of image appearance
* multi-modal posterior
over model parameters
- sampled representation
- particle filtering
approach.
* focus on image motion
as a cue (adding edges,…)
Represent a distribution
over 3D poses.
Generative Model: Temporal
First order Markov assumption on angles, f,
and angular velocity, V:


p(ft |ft 1 , Vt 1 )  p(ft |ft 1 , Vt 1 )
 
p(Vt |ft 1 , Vt 1 )  p(Vt | Vt 1 )
Explore two models of human motion
* general smooth motion or,
* action-specific motion (walking)
Arm Tracking: Smooth motion prior
Display: expected
value of joint angles.
Particle filter
* represents ambiguity
* propagates information
over time
x
y
z
Learning Temporal Models
(Dirk Ormoneit & Trevor Hastie)
* Motion capture data is noisy, data is missing,
activities are performed differently.
* For cyclic motion (important but special class):
1. Detect cycles and segment
2. Account for missing data
3. Preserve continuity of cycles
4. Statistical model of variation
* Approaches should generalize to non-cyclic motion.
Detecting Cycles
Automatically detect length of cycles,
Automatically segment and align cycles.
Modeling Cyclic Motion
Automatically
align 3D data with
a reference curve
represented using
periodically
constrained
regression splines.
Modeling Cyclic Motion
* Segment into cycles, compute mean curve and
represent variation by performing PCA on data.
* SVD must enforce periodicity and cope with
missing data.
* Iterative SVD method (from gene expression work)
* computes SVD in Fourier domain
* construct a rank-q approximation and
take inverse Fourier transform
* impute missing data from the approximation
* repeat until convergence.
Issues
* Large parameter space
* approx. 10000 samples
* sparsely represented
* not real time
* Flow-based models can drift
* Requires initialization
Conclusions
Bayesian formulation for tracking 3D human figures
using monocular image information.
* Generative model of image appearance.
* Non-linear model represents ambiguities, singularities
occlusion, etc - sampled representation of posterior.
* Particle filtering for incremental estimation.
* Automatic learning of cyclic motion prior.
Rich framework for modeling the complexity of
human motion.
Initialization Using 2D Model
* Full-body walking model.
* Constructed from
3D mocap data.
* 2D, view-based
(every 30 degrees)
* 4 subjects, 14 cycles
2D, View-Based Walker
* Construct linear
optical flow basis
Example Bases:
* Use similar Bayesian
framework for tracking
(Black CVPR’99)
* Coarse estimate of 3D
parameters
...
0 degrees
...
* Automatic initialization
90 degrees
Recent Results
* Box indicates mean position and scale.
* Recovers distribution over phase and 3D scale.
Motion
Converged
Dense optical flow.
Converging
Human motion.
Faces:
Open questions:
appearance change,
textural motion.
Here we focus on fullbody.
Truth in Advertising
Not about realistic models for synthesizing
* faces
* clothing
* skin
* hair
Focus on generic models of appearance for
human motion capture.
Graphics to the Rescue?
Accurately
synthesize
appearance?
Hodgins and Pollard ‘97
How big is the parameter space of all possible
appearances?
Human Appearance
Likelihood
* To cope with occluded limbs or those viewed
at narrow angles, we introduce a probability
of occlusion.
* likelihood of observing limb j is then
p j  q pimage  (1  q) poccluded
* likelihood of the model is product of limb likelihoods
p(I t|ft , R t )   p j
j
Generative Model: Motion
x t -1  P(y, ft 1 )
t-1
u t  P(y, ft )  P(y, ft 1 )
x t  P(y, ft )
t
Learned Walking Model
* sample with large e
Temporal Model: Walking
Parameters of the generative
model are now

s t  [ct , t ,  t ,  tg , tg ]
Probabilistic model for p(st | st 1 )
p (ct ,k | ct 1,k )  G (ct ,k  ct 1,k ,  c ),
 c  ek
p ( t |  t 1 )  G ( t  ( t 1   t 1 ),   )
p (  t |  t 1 )  G (  t 1   t 1 ,   )
p ( tg | Tt 1 ,  tg1 ,  t 1 )  G ([ tg1 ,1]T  Tt11[  t 1 0 0 1]T ,   )

p ( t |  t 1 )  G ( t   t 1 ,  )
g
g
g
g
Common Assumptions
(to be avoided)
* Multiple Cameras
(additional constraints, occlusion)
* Color Images
(locate face and hands)
* Known Background
(background subtraction to locate person)
* Batch process an entire sequence.
* Known Initialization
Ratios for different limbs
Modeling Appearance
What do people look like?
What do non-people look like?
How can we model appearance in a way the
captures the variability across people, clothing,
lighting, pose, …?
Ridge Filters
Relationship
between limb
diameter in
image and scale
of maximum
ridge filter
response.
Ridges
Correct position at t
Incorrect position at t
Brightness
Constancy
Vary position at t+1
Condensation
1. Selection
2. Prediction/Diffusion (sample from p(st | st 1 ) )
ie from the temporal prior:
p(ft |ft 1 , Vt 1 ) p(Vt | Vt 1 ) p(R t | I t 1 , ft 1 )
1. Compute R t
p(R t | I t 1 , ft 1 )
2. Sample from
p(Vt | Vt 1 )
3. Sample from
3. Updating
p(ft |ft 1 , Vt 1 )
Visualizing Results
Expected value of state parameter f (s t )


N

E f (s t ) | It   f (s t( n ) ) t( n )
n 1
Why is it hard?
Geometrically under-constrained.
Vigil Calculare
Watchful computation.
Tiny People
Why is it Important?
Applications
• Human-Computer Interaction
• Surveillance
• Motion capture (games and animation)
• Video search/annotation
• Work practice analysis.
* detect moving regions
* estimate motion
* model articulated objects
* model temporal patterns
of activity
* interpret the motion
Social display of puzzlement
Why is it Hard?
The appearance of people
can vary dramatically.
Bones and joints
are unobservable
(muscle, skin,
clothing hide the
underlying structure).
(inference)
Why is it hard?
People can appear in
arbitrary poses.
They can deform in complex
ways.
Occlusion results in
ambiguities and
multiple interpretations.
Other Problems
* geometrically under-constrained
* non-linear dynamics of limbs
* similarity of appearance of different limbs
(matching ambiguities)
* image noise
* outliers
Our models are approximations.
Image changes that are not modeled
(e.g. clothing deformation) will be outliers.
State of the Art.
Bregler and Malik ‘98
* Brightness constancy cue
• insensitive to appearance
* Full-body required multiple
cameras.
* Single hypothesis.
• MAP estimate
State of the Art.
Cham and Rehg ‘99
* Single camera, multiple hypotheses.
* 2D templates (solves drift but is view dependent)
I(x, t) = I(x+u, 0) + h
State of the Art.
Deutscher, North,
Bascle, & Blake ‘99
* Multiple cameras
* Simplified, clothing, lighting and background.
State of the Art.
Sidenbladh, Black, & Fleet ‘00
* Monocular. Brightness constancy as the only cue.
* Significant changes in view and depth.
* Template-based methods will fail.
Bayesian Inference
Exploit cues in the images. Learn likelihood models:
p(image cue | model)
Build models of human form and motion. Learn
priors over model parameters:
p(model)
Represent the posterior distribution
p(model | cue)  p(cue | model) p(model)
Natural Image Statistics
* Statistics of image
derivatives are non-Gaussian.
* Consistent across scale.
Ruderman. Lee, Mumford, Huang.
Portilla and Simoncelli. Olshausen
& Field. Xu, Wu, & Mumford. …
Statistics of Edges
Statistics of filter responses, F, on edges, pon(F),
differs from background statistics, poff (F).
Likelihood ratio, pon/ poff , can be used for edge
detection and road following.
Geman & Jednyak and Konishi, Yuille, & Coughlan
What about the object specific statistics of limbs?
* edge may be present or not.
Distribution of Edge
Filter Responses
Likelihood
 p(image | fore)  p(image | back )
p(image | fore, back ) 
fore pixels

back pixels
 p(image | back )  p(image | fore)
all pixels
fore pixels
 p(image | back )
fore pixels
const

 p(image | fore)
fore pixels
 p(image | back )
fore pixels
Foreground pixels
Background pixels
Action-Specific Model
The joint angles at time t are a linear combination
of the basis motions evaluated at phase 
q
ft  ~( t )   ct ,k vk ( t )
k 1

Mean curve

Basis curves

SillyWalksIGERT

Transcript SillyWalksIGERT

Directory