WinterSchoolBobick2

Download Report

Transcript WinterSchoolBobick2

Seeing Action: Part 2
Statistics and/or Structure
Aaron Bobick
[email protected]
School of Interactive Computing
College of Computing
Georgia Tech
Continuing from the lower middle???

Three levels of understanding motion or behavior:
Movement - atomic behaviors defined by motion
"Bending down", "(door) rising up", Swinging a hammer
Action – a single, semantically meaningful "event"
"Opening a door", "Lifting a package"
Maybe
Actions
are movements in context??
Typically
short in time
Might be definable in terms of motion; especially so in a
particular context.
Activity – a behavior or collection of actions with a
purpose/intention.
Context
"Delivering packages"
Typically has causal underpinnings
Can be thought of as statistically structured events
Structure and Statistics (Old and new)

Grammar-based representation and parsing
– Highly expressive for activity description
– Easy to build higher level activity from reused low level vocabulary.

P-Net (Propagation nets) – really stochastic Petri nets
– Specify the structure – with some annotation can learn detectors
and triggering probabilities

Statistics of events
– Low level events are statistically sequenced – too hard to learn full
model.
– N-grams or suffix trees
"Higher-level" Activities:
Known structure, uncertain elements

Many activities are comprised of a priori defined
sequences of primitive elements.
– Dancing, conducting, pitching, stealing a car from a parking
lot.
– The states are not hidden.


The activities can be described by a set of
grammar-like rules; often ad hoc approaches taken.
But, the sequences are uncertain:
– Uncertain performance of elements
– Uncertain observation of elements
The basic idea and approach
Idea: split the problem into:


Low-level primitives with uncertain feature detection
(individual elements might be HMMs)
High-level description found by parsing input stream of
uncertain primitives.
Approach:

Extend Stochastic Context Free Grammars to handle
perceptually relevant uncertainty.
Stochastic CFGs

Traditional SCFGs have probabilities associated with
the production rules. Traditional parsing yields most
likely parse given a known set of input symbols.
PIECE -> BAR PIECE |
[0.5]
BAR
[0.5]
BAR
-> TWO |
[0.5]
THREE
[0.5]
THREE -> down3 right3 up3 [1.0]
TWO
-> down2 up2
[1.0]

Thanks to Andreas Stolcke’s
priori work on parsing SCFGs
using efficient Earley parser.
Extending SCFGs

(Ivanov and Bobick, PAMI)
Within the parser we handle:
– Uncertainty about input symbols
Input is multi-valued string (vector of likelihoods)
– Deletion, substitution, and insertion errors
Introduce error rules
– Individually recognized primitives typically temporally
inconsistent
Introduce penalty for overlap.
Spatial and temporal consistency enforced.


Need to define when a symbol has been generated.
How do we learn production probabilities? (Not many
examples.) Make sure not too sensitive to them.
Enforcing temporal consistency
Output of one HMM parsing backwards
P(primitive)
- Output event
Time
Video Sample
Event Grammar and Parsing





Tracker generates events: ENTER, LOST, FOUND, EXIT,
STOP. Tracks have properties (e.g. size) and trajectories.
Tracker assigns class to each event, though only
probabilistically.
Parser parses single stream that contains interleaved events:
(CAR-ENTER, CAR-STOP, PERSON-FOUND, CAR-EXIT,
PERSON-EXIT)
Parser enforces spatial and temporal consistency for each
object class and interactions (e.g. to be a PICK-UP, the
PERSON-FOUND event must be close to CAR-STOP)
Spatial and temporal consistency eliminates symbolic
ambiguity.
Advantages of SCFGs

What grammar can do (simplified):
CAR_PASS -> CAR_ENTER CAR_EXIT |
CAR_ENTER CAR_HIDDEN CAR_EXIT
CAR_HIDDEN -> CAR_LOST CAR_FOUND |
CAR_LOST CAR_FOUND CAR_HIDDEN

Skip allows concurrency (and junk):
PERSON_LOST -> person_lost | SKIP person_lost

Concurrent parse:
Events:
ce pe cl cf cs px pl cx
PICKUP -> ce pe cl cf cs px pl cx
P_PASS -> ce pe cl cf cs px pl cx
Parsing System
Parse 1: Person-pass- through
Parse 2: Drive-in
Parse 3: Car-pass-through
Parse 4: Drop-off
Advantages of STCFG approach






Structure and components of activities defined a priori and
are the right levels of annotation to recover (compare to
HMMs).
FSM vs CFG is not the point. Rather explicit representation
of structural elements and uncertainties.
Often many (enough) examples of each primitive to support
training, but not of higher level activity.
Allows for integration of heterogeneous primitive detectors;
only assumes likelihood generation.
More robust than ad-hoc rule based techniques: handles
errors through probability.
No notion of causality, or anything other than (multi-stream)
sequencing.
Advantages of STCFG approach






Structure and components of activities defined a priori and
are the right levels of annotation to recover (compare to
HMMs).
FSM vs CFG is not the point. Rather explicit representation
of structural elements and uncertainties.
Often many (enough) examples of each primitive to support
training, but not of higher level activity.
Allows for integration of heterogeneous primitive detectors;
only assumes likelihood generation.
More robust than ad-hoc rule based techniques: handles
errors through probability.
No notion of causality, or anything other than (multi-stream)
sequencing.
Some Q's about Representations…

Scope and Range:
– thoughts???

"Language" of the representation

Computability of an instance:

Learnability of the "class":

Stability in face of perceptual uncertainty

Inference-support
– Grammar of explicit symbols
– Quite easy. Given the input string the parsing is both the
computation and the matching
– Inside-outside algorithm for learning CFGs but lets be serious…
– Explicitly designed to handle this uncertainty
– Depends on what you mean by inference. No notion of real
semantics or explicit time.
P-Nets (Propagation Networks)
Bobick, ’04 and ’06)
Nodes represent activation
intervals

– Active vs. inactive: Token
propagation
More than one node can be
active at a time!
Links represent partial order as
well logical constraint
Duration model on each link and
node:

–Explicit model on length of
activation
–Explicit model on length between
successive intervals
Observation model on each node

(Shi and
Conceptual Schema

Logical relation
– Autonomous assumption: logic constraint only exists at
start/end points of any intervals
– Condition probability function can represent any logical
function
Examples of logic constraint
Propagation Net – Computing

Computational Schema
A DBN style rollout to compute corresponding conceptual
schema
Experiment: Glucose Project



Task: monitor an user to calibrate a glucose meter
and point out operating error as feedback.
Constructed 16 node P-Net as representation
3 subjects with total of 21 perfect sequences, 10
missing_1_step sequences and 10 missing_6_steps
sequences
D-Condensation
Initiate 1 particle at dummy starting node
Repeat
For each particle
generate all possible consequent states
calculate the probability for each states
End
Select n particles to survive
Until the final time steps is reached
Output the path represented by the particle with
highest probability
Experiment: Glucose Meter
Calibration
Experiment: Classification
Performance
Experiment: Label individual frames
Labeling individual nodes
Labels on Node J: Insert
And now some statistics…


Problem: the higher level world of activity is not
usually a P-Net or an FSM or an HMM or …
Two possible solutions:
1. Understand what's really going on…
…another time.
2. Lose the structure
Stochastic sequencing
A priori define some low-level
"actions"/events that can be
stochastically detected in context
– e.g. Door opening

Collect training data (streams of
events) of activities – making a
delivery, UPS pick-up, trash
collection

Collect histograms of N-tuples
and do both activity discovery and
recognition

–Later can focus on anomalies
Advantages: cheat where easy,
learn the hard stuff, exploit the
context

(Hamid and Bobick)
Barnes & Nobles Loading Dock
Barnes & Nobles Loading Dock
Barnes & Nobles Loading Dock
Barnes & Nobles Loading Dock
Two levels in the representation
Low Level: Events (computer vision problem)
– Background subtraction and Foreground
extraction (better “modeling”)
– Classifying (per frame) each foreground object
as either
•
•
•
•
•
Person
Vehicle (what type if possible)
Package
Tool used to move packages
Miscellaneous object
– Tracking people, vehicles, packages, tools, and
miscellaneous objects over multiple frames
Two levels in the representation

Higher Level: Statistical characterization of subsequences
– Instances of same activity class have certain common
subsequences.
– But, partially ordered will typically rearrange subsequences
within the sequence.
– Find a “soft” characterization of the statistics of the
subsequences
– Deifne similarity measure for such characterization.
• Allows discovery of activity classes
• Allows for detection of anomalous examples

Caveats:
– We provide the events – whether it’s manually or specifying the
detector doesn’t really matter (except for publication)
– Training needs pre-segmentation
Stochastic sequencing: n-grams
Experimental Setup – Loading Dock
Barnes & Noble Loading Dock
Area


One month worth of data:
–5 days a week
–9 a.m. till 5 p.m.

Event Vocabulary – 61 events
–Hand-labeled for testing
activity labeling, noise
sensitivity.
–Training detectors for these
events
Bird’s Eye View of
Experimental Setup
B&N Processing Video
Activity-Class Discovery



Treating activities as individual instances
Activity-class discovery – finding maximal cliques in
edge weighted graphs
Need to come up with:
– Activity Similarity Metric
– Procedure to group similar activities
Activity Similarity

Two types of differences
–structural differences
–frequency differences
Sim(A,B) = 1 – normalized
difference between the counts
of non-zeros event n-grams


Properties
–additive identity
–is commutative
–does not follow triangular inequality
Activity-Class Discovery



A graphic theoretic problem of finding maximal
cliques in edge-weighted graphs [Pavan, Pelillo ‘03]
Sequentially find maximal cliques in edge weighted
graph of activities
Activities different enough from all the regular
activities are anomalies
Activity-Class Discovery – Dominant
Sets
Anomaly Detection


Compute the within-Class similarity of the test
activity w.r.t. previous class members
Learn the detection threshold from training data –
can be done using an R.O.C.
Anomaly "Explanation"


Explanatory features – their frequency has high
mean and low variance
Explanation based on features that were:
– Missing from an anomaly but were frequently and
consistently present in regular members
– Extraneous in an anomaly but consistently absent from the
regular members
Results
General Characteristics of
Discovered Activity Classes
UPS Delivery Vehicles
Fed Ex Delivery Vehicles
Delivery Trucks – multiple
packages delivered
Cars and vans, only 1 or 2
packages delivered
Motorized cart used to pick and
drop packages
Van deliveries – no use of
motorized cart
Delivery trucks – multiple people
Few of the detected Anomalies

(a) Back door of delivery not closed
(b) More than usual number of people
involved in unloading
(c) Very few vocabulary events performed
Results


Are the detected anomalous activities ‘interesting’
from human view-point?
Anecdotal Validation:
– Studied 7 users
– Showed each user 8 regular activities selected randomly
– Showed each user 10 test activities, 5 regular and 5
detected anomalous activities
– 8 out of 10 activity-labels of the users matched the labels
of our system
– Probability of this match happening by chance is 4.4%
Some Q's about Representations… (more
discussion)

Scope and Range:
– A monitored scene with pre-designed detectors

"Language" of the representation
– Histograms and other statistics of feature n-gram occurrences

Computability of an instance:
– Given detectors, easy to compute

Learnability of the "class":
– Full power of statistical learning. Even allowing notion of
outlier detector.

Stability in face of perceptual uncertainty
– Fair. Needs to be better.

Inference-support
– Distance-in-feature space reasoning only.