Chia-Hung Yeh
Download
Report
Transcript Chia-Hung Yeh
Content-Based Video Analysis based on
Audiovisual Features for Knowledge
Discovery
Chia-Hung Yeh
Signal and Image Processing Institute
Department of Electrical Engineering
University of Southern California
Vision
Guidelines
Motivation
Introduction
Overview of visual and audio content
Video abstraction
Multimodal information concept
Knowledge discovery via video mining
Our previous work
Conclusion and future work
Motivation
Amazing growth in the amount of digital video data
in recent years.
Develop tools for classify, retrieve and abstract
video content
Develop tools for summarization and abstraction
Bridge a gap between low-level features and highlevel semantic content
To let machine understand video is important and
challenging
Why, What and How
Why video content analysis?
– Modern multimedia technologies have led to huge amount of
digital video collections. But, efficient access to video content
is still in its infancy, because of its bulky data volume and
unstructured data format.
What is video content analysis?
– Video content analysis analyzes the video content and
attempts to automatically understand the embedded video
semantics as humans do
How to do video content analysis?
Overview of Visual Content
Structured analysis
– Extract hierarchical video structure
Event/Tempo
Key sentences
Sentences
GAP
Scene
grouped into
Words
Grouping
Shot
segmented into
Text
Document
Grouping
Frame
Semantic
Overview of Audio Content
Continuous in the time domain, not like visual
Multiple sound source exists in a sound track like
many objects in a single frame
It is tough to separate audio content and give a
suitable description
Framework in MPEG-7, silence, timbre, waveform,
spectal, harmonic and fundamental frequency
Some special features for music and speech
Content-Based Video Indexing
Process of attaching content based labels to video
shots
Essential for content-based classification and
retrieval
Some required techniques
–
–
–
–
–
Shot detection
Key frame selection
Object segmentation and recognition
Visual/audio feature extraction
Speech recognition, video text, VOCR
Content-Based Video Classification
Segment & classify videos into meaning categories
Classify videos based on predefined topic
Multimodal concept
– Visual features
– Audio features
– Metadata features
Domain-specific knowledge
Query (Retrieval Methods)
Simple visual feature query
Feature combination query
Query by example (QBE)
– Retrieve video which is similar to example
Localized feature query
– Example: retrieve video with a running car toward right
Object relationship query
Concept query (query by keyword)
Metadata
– Time, date and etc.
The Ways to Browse a Video
Playback faster
– Audio time scale modification – time saving factor 1.5 to 2.5
– 15% - 20% time reduction by removing and shortening pauses
Storyboard
– Composed of representative still frames (Keyframes)
Moving storyboard
– Display keyframes while synchronized with the original audio track
Highlight
– Pre-defined special event (example: sport and news)
Skimming
– Extract short video clips to build a much shorter video
Timeline of Related Technique
Development
Image retreival
Video abstraction
Audio processing
Video browsing Video summarization
Speech recognition
Digital image processing
Video retreival
Video skimming
Digital signal processing Text recognition
Basic
tool
Low-level
features
development
High-level semantics
concepts
Image Retrieval and Video Browsing
Query by Image Content (QBIC), IBM, 1995
– Complex multi-feature and multi-object queries
Video browsing
–
–
–
–
Quickly and efficiently Discover the information
Browsing and searching are usually complement each other
Visual content browsing us easier than audio content
Achieved by static storyboard, dynamic video clips, fast
forward
Representative work
– Gary Marchionini, University of Maryland
– S.-F. Chang, Columbia University
Video Abstraction
Video summarization and video skimming
– Belong to video abstraction and different from video browsing
– Automatically retrieve the most significant and most
representative a collection of segments
Required techniques
–
–
–
–
–
–
Shot detection, scene generation
Motion analysis
Face recognition
Audio segmentation
Text detection
Music detection
Video Abstraction
A video abstract
– A sequence of still or moving images which preserve
essential original video content while it is much shorter than
the original one
Applications
– Automated authoring of web
content
•
•
Web news
Web seminar
– Consumer domain applications
•
Analyzing, filtering, and browsing
Video Summarization (I)
A collection of salient frames that represent the
underlying content
Most related work focus on the ways to extract still
frame
Categorize into three classes
– Frame-based
• Randomly or uniformly select
– Shot-based
• Keyframe
– Feature-based
• Motion, color and so on
Video Summarization (II)
Representative work
– Y. Taniguchi, (1995)
• Frame-based scheme
• Simple but may not representative due to not uniform length of shots
– H.-J. Zhang, Microsoft Research China (1997)
• Keyframe based on color histogram
– Gong and Liu, NEC Laboratories of American (2003)
• SVD (Single Value Decomposition)
• Capture temporal and spatial characteristics
– Tseng, Lin and J. R. Smith, IBM T. J. Research Center (2002)
• Video summarization scheme for pervasive mobile device
Video Skimming
A good skim is much like a movie trailer
A synopsis of the entire video
Representative work
– M. Smith and T. Kanade, Carnegie Mellon University (1995)
• Audio and image characterization
– S. Pfeiffer, University of Mannheim (1996)
• VAbstract system
• Detection of special events such as dialogs, explosions and text
occurrences
– H. Sundaram and S.-F. Chang, Columbia University (2001)
• A semantics skimming system
• Visual complexity for human understanding
• Film syntax
Video Skimming – Application
Video content transcoding
– Content-based live sport video filtering
Video Shot Structure
Shot, a cinematic term, is the smallest addressable video unit (the
building block). A shot contains a set of continuously recorded frames
Two types of video shots:
– Camera break abrupt content change between neighboring frames. Usually
corresponds to an editing cut
– Gradual transition smooth content change over a set of consecutive frames.
Usually caused by special effects
Shot detection is usually the first step towards video content analysis
Scene Characteristics
Scene is a semantic concept which refers to a relatively
complete video paragraph with coherent semantic meaning
It is subjectively defined
Shots within a movie scene have following 3 features
– Visual similarity
• Since a scene could only be developed within certain spatial and temporal
localities, the directors have to repeat some essential shots to convey parallelism
and continuity of activities due to the sequential nature of film making
– Audio similarity
• Similar background noises
• Speeches from the same person have similar acoustic characteristics
– Time locality
• Visually similar shots should also be temporally close to each other if they do
belong to the same scene
Basic Audio Features
Energy
– Silence or pause detection
Zero crossing rate (ZCR)
– The frequency of the audio signal amplitude passing through the zero value
in a given time
Energy centroid
– Speech range: 100 Hz to 7k Hz
– Music range: 16 Hz to 16000 Hz
Band periodicity
– Harmonic sounds
– Music: High frequency components are integer multiples of the lowest one
– Speech: Pitch
MFCC - (Mel-Frequency Cepstral Coefficients)
– 13 linearly-spaced filters
Multimodal Information Concept
Video data
Multimodal
content segmetation
Semantic units
How
Who
When
Multimodality
Fusion/Integration
Where
What
Relation
Relation
Multimodal Framework for Video Content
Interpretation
Application on automatic TV Programs abstraction
Allow user to request topic-level programs
Integrate multiple modalities: visual, audio and text
information
Multi-level concepts
– Low: low-level feature
– Mid: object detection, event modeling
– High: classification result of semantic content
Probabilistic model: using Bayesian network for
classification (causal relationship, domainknowledge)
Probabilistic Model – Data Fusion
Constrained domain
Audio
information
Video data
Visual
information
Metadata
information
Input data
A_feature 1
A_detector 1
A_feature 2
A_detector 2
A_feature 3
A_detector 3
Semantic
concept 1
Fusion 1
A_feature n
A_detector m
V_feature 1
V_detector 1
V_feature 2
V_detector 2
V_feature 3
V_detector 3
V_feature n
V_detector m
Fusion 2
Semantic
concept 2
Fusion 3
M_feature 1
M_detector 1
M_feature 2
M_detector 2
M_feature n
M_detector m
Low-level
Midlle-level
Semantic
concept 3
HIgh-level
How to Work with the Framework
Preprocessing
– Video segmentation (shot detection) and key frame selection
– VOCR, speech recognition
Feature Extraction
– Visual features based on key-frame
• Color, texture, shape, sketch, etc.
– Motion features
• Camera operation: Panning, Tilting, Zooming, Tracking, Booming, Dollying
• Motion trajectories (moving objects)
• Object abstraction, recognition
– Audio features
•
average energy, bandwidth, pitch, mel-frequency cepstral coefficients, etc.
– Textual features (Transcript)
• Knowledge tree, a lot of keyword categories: politics, entertainment, stock, art, war, etc.
• Word spotting, vote histogram
Building and training the Bayesian network
Challenging Points
Preprocessing is significant in the framework.
– Accuracy of key-frame selection
– Accuracy of speech recognition & VOCR
Good feature extraction is important for the
performance of classification.
Modeling semantic video objects and events
How to integrate multiple modalities still need to be
well considered
Knowledge Discovery via Video Mining
Objectives
–
–
–
–
Find the hidden links between isolated news, events, etc.
Find the general trend of an event development
Predict the possible future event
Discover abnormal events
Required Technologies
– Domain-specific knowledge model
– Mining association rules, sequential patterns and correlations
– Effective and fast classification and clustering
Challenges
– Model build-up in special knowledge domain
– Integration of semantic mining and feature-based mining
– Effective and scalable classification and clustering algorithms
Video Mining Issues
Frequent/Sequential Pattern Discovery
– Fast and scalable algorithms for mining frequent, sequential and
structured patterns and for correlation analysis
– Similarity of rule/event search/measurement
Efficient and fast classification and clustering algorithms
– Constraint-based classification and clustering algorithms
– Spatiotemporal data mining algorithms
– Stream data mining (classification and clustering) algorithms
Surprise/outlier discovery and measurement
– Detection of outliers based on similarity and trend analysis
– Detection of outliers and surprised events based on stream data
mining algorithms
Multidimensional data mining for trend prediction
Framework of Video Mining
Knowledge
Mining
engine
Feature Frequent Sequential
mining
mining mining
Exception
mining
Video content analysis
Multimedia
data
Move
mining
Specific
domain
Our Previous Work
TV Commercial Detection
– Visual/audio information processing
Cinema rules
– Intensity mapping
Tempo analysis in digital video (Professional video)
– Audio tempo
– Motion tempo
Home video processing (Non-professional)
– Quality enhancement (Bad shot detection)
– Music and video matching
Commercial Detection
First step to do any TV program content management
Monitor broadcast
– Government
– Advertisement Company
Commercial features
–
–
–
–
–
Delimiting black frame (not available in some countries)
High cut frequency and short shot interval (important feature)
Still images
Special editing styles and effects
Text and logo
Commercial Detection
Visual information processing
–
–
–
–
–
Black frame detection
Shot detection & its statistic analysis
Still image detection
Text-region detection
Edge change rate detection
Audio information processing
– Volume control
– Silence
Commercial Detection
Structure of TV program
Normal
program
Black
frame
Spot
Spot
Normal
Program with
Station logo
Structure of TV
program
Normal
program
Shot Detection & Its Statistic Analysis
Commercial
Start point
Shot boundary detection
Statistic analysis
700
150
Commercial block
600
500
100
400
300
50
200
100
mean
variance
0
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0
1000
2000
3000
4000
5000
6000
7000
8000
9000 10000
Still Image Detection
Still Image
– Video Clip is composed of a sequence of image
– Find out a set of consecutive images that have little change
over a period of time
Difficulty
– Even though we feel that video clip is still, the difference
between two consecutive images is seldom zero
– It is tough to measure the moving part. (human eyes are
sensitive to motion)
Main idea
– Quantify motion in each image to detect still image
Still Image Detection
1
0.9
0.8
0.7
0.6
0.5
0.4
Error
detection
0.3
0.2
0.1
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000 10000
Really still
images
Tempo Analysis and Cinema Rules
The visual story - seeing the structure of film, TV, and
new media, Bruce Block
– Relationship between story structure and visual structure
• Their intensity maps are correlated
– Principle of contrast and affinity
• The greater the contrast in a visual component, the more the visual intensity
or dynamic increases
Climax
Conflict
Story intensity
Resolution
Exposition
Time
Cinema Rules
Every feature film has a well designed story structure,
which contains the beginning (exposition), the middle
(conflict), and the end (resolution)
100
CO
Story
Intensity
CX
100
CX
R
Story
Intensity
EX
CO
R
EX
0
0
0 1020
... …
110 120
Time length of the story in minutes
0 1020
... …
110 120
Time length of the story in minutes
EX: exposition gives the facts needed to begin the story
CO: conflict contains rising actions or conflict
CX: climax
R: resolution end the story
Cinema Rules
Scene:
– A simple theme in a scene
– Each scene is composed of setup part, progressing part, and
resolution part
– Final film is just a way to present this theme
• Dialog
• Close-up view
A story unit
– A example of scene
• Main actors drove the main actress from train station back to home
– A simple action
• Met at train station ->On the road->Another main actor joined them ->
Arrive home
Audio Tempo
Music tempo
Definition in music
– Note
– Meter: A longer period contains many beats. For example, we
can count as ONE-two-three, ONE-two-three
– Tempo (pace/beat period)
• It is often indicated in the beginning. For example, the rate should be
100 quarter notes per minute (100 times we clap per minute)
Audio Tempo
Speech tempo
– Emotion detection
– Segmental durations
• Syllable or phoneme
Audio tempo
– Short time pace
• Short-term memory
– The number of sound events per unit of time
• The more events, the faster it seems to go
– Onset
• A new note or a new syllable
Audio Tempo
Diagram of audio tempo analysis
Input Audio
Frequency
Filerbank
Envelope
Extractor
Envelope
Extractor
Down
sampling
Down
sampling
Differentiator
Differentiator
Shot
boundary
Tempo
L
L
2
H
H
Audio Tempo
Frequency filterbank
– Perceptual frequency
– Critical bands
• Wavelet-packet
• Multirate system
Envelope extractor
Input signal and detected onsets
– Rectify
– Filtering: 50 ms half-Hamming window
Differentiator
– First-order difference
– Half-wave rectified
Audio Tempo
Boundary of story units
– Local minima of audio tempo
Post signal processing
– Help to get local minima
– Three steps
• Lowpass filtering
• Morphological operation
– Minmax
– Close operation
• Detect local minima
– Detected valleys
Post processing for audio tempo analysis
Motion Analysis
The variance of motion vector
N
M2 (i ) W (n i )[ M (n) (n)] 2
n 1
1
( n)
N
N
W ( n i ) * M ( n)
n 1
– Where W (n) is a window, M (n) is the average length of motion
vectors for each shot, and n is shot index
Motion Analysis
Boundary of story units
– Transition Edges
Post processing
– Morphological operation
• Median
• Maxmin
• Minmax
– Gradient
– Detect edges
Post processing for visual tempo
Skimming Video
Test data
– Legends of The Fall
• Beginning 26 minutes
• MPEG format
– 352*240 pixels
– 44.1 KHz
Home Video Processing
Home video characteristics
– Fragmental
– Sound may not be very important
– Bad shots
Shooting tips
• Stabilization
• Focus
• Lighting
1
Shoot lots of short scenes (5 ~ 10 seconds)
2
Use zoom in/out to take exposition shots or
emphasize something
3
Zoom or pan slowly
4
Get a lot of face shots
5
Keep a steady hand
6
Make sure your subject is well lit
Bad Shots
Shaky
– Drive
– Walk
Vibration of the camera motions of successive frames
Bad Shots
Ill-light
– Too dark/bright
– Variance too much
• Diaphragm
Lighting Problem
– Average of luminance
• Highest 1/3 pixels and lowest
1/3 pixels
• Negative feedback
Bad Shots
Blur
– Motion blur
– Out-of-focus blur
– Foggy blur
Music and Video Matching
Shot detection
Remove bad shots
Match music tempo
– Shot length
– Motion activity
Shot Detection
Remove bad
shots
Choose shots
according to music
tempo
Authoring Scheme
Match music tempo
– High tempo
• Small segment length
– Transition time
• High motion activity
Input Music
Music tempo
Time
Selected video clips
Visual tempo
Clip 3
Clip 4Clip 5
Clip 6
Clip 2
Clip 1
Clip 7
Time
Experimental Results
Test data
– Input music: 5.5minutes music, Canon
– Input video clips:
• Activities of babies of 0 ~ 3
years old
• Man-made bad shots
• Average clip length is about
20 seconds
• Total length is 50 minutes
Well-Known Research in Video Content
Analysis Field
Well-known university
– Digital Video Multimedia laboratory (DVMM), Columbia
University
– MIT Media laboratory
– Information Digital Video Understanding, Carnegie Mellon
University
– Department of Electrical and Computer Engineering,
University of Illinois of Urbana-Champaign
– Signal and Image Processing Institute, University of Southern
California
– Department of Electrical Engineering, Princeton University
– Language and media processing laboratory, University of
Maryland
Well-Known Research in Video Content
Analysis Field
Well-known R&D laboratory
–
–
–
–
–
–
–
–
–
–
IBM T. J. Watson research center
IBM Almaden research center
Intel corporation
Sharp Laboratory of America (SLA)
Microsoft research laboratory
Microsoft research China
Hawlett-Packard research laboratory
AT&T Bell laboratory
InterVideo
Pinnacle
Conclusion
Introduction of several basic concepts
Basic processing and low-level feature extraction
Semantic video modeling and indexing
Multimodal framework for topic classification of
Video
Knowledge discovery via video mining
Our research results
Discussion of Challenging problems
Questions
Thank You