presentation - The Chinese University of Hong Kong
Download
Report
Transcript presentation - The Chinese University of Hong Kong
A New Approach for Video Text
Detection and Localization
M. Cai, J. Song and M.R. Lyu
VIEW Technologies
The Chinese University of Hong Kong
Related work
Text Area Detection
– Uncompressed domain methods
• Texture-based
• Color-based
• Edge-based
– Compressed domain methods
• DCT coefficients
• Number of intra-coded blocks on P- / B- frames
Text String Localization
– Bottom-up scheme
– Top-down scheme
Language-independent
characteristics
Contrast
– An adaptive contrast threshold according
to the background complexity
Color
– Color bleeding caused by compression
Orientation
– Well-defined size and orientation make it
easy to understand
Stationary location
– Appear a certain long time
Language-dependent
characteristics
English
Chinese
Stroke density
roughly similar
varies dramatically
Min(Font size)
10-pixel high
20-pixel high
Min(Aspect ratio)
Relatively large
Relatively small
Stroke direction
statistics
mainly vertical
vertical
horizontal
Left diagonal
Right diagonal
Workflow
Sampling &
color space
conversion
Video text detection and
localization on
every sampled frame
Multi-frame
comparison
A sequential multi-resolution
paradigm
Original image
Edge
detection
Edge
map
Size/ f(l)
Text area
Detection
Text string
Localization
Text
regions
Size f(l)
Original coordinates of
text regions
Level = 1
Level = 2
Level = n-1
Edge
map
Size/ f(l)
Text area
Detection
Text string
Localization
Text
regions
Size f(l)
Original coordinates of
text regions
Level = n
Final text regions with original coordinates
Text detection
Edge detection
– Sobel edge detector
Local thresholding
– Adaptive to background complexity
Text-like area recovery
– Enhance the density of text areas
Local Thresholding
Use a small kernel (gray) to scan the whole edge map row
by row.
In the bigger window surrounding the kernel, check the
background type: “Clear” or “Noisy”.
For Clear background and Noisy background, determined
the local threshold by low and high parts, respectively, of
the edge strength histogram in the bigger window.
Count
Low part
Kernel
High part
P3h
h
3h
Window
(a) Concentric kernel
and window
.
.
.
.
P1
(b) A window on the multi-line text area
and the horizontal projection in it.
0
Edge
MAX strength
(c) Local threshold selection
Thresholding result comparison
Video image
Global thresholding results
Local thresholding results
Text-like area recovery
Labeling: Classify current edge pixels as “TEXT”
and “NON_TEXT” based on its local density.
Recovery/Suppression:
– Bring back neighboring lower-strength edge pixels of
the TEXT edge pixels.
– The NON_TEXT edge pixels are suppressed.
Before recovery
After recovery
Coarse-to-fine Text localization
Projection-based top-down localization.
To handle complex text layout.
Sub-regions
Add to the processing array
Y
Pop the first
region from the
processing array
Horizontal
projection
Y
Divisible?
Each
sub-region
Vertical
projection
Divisible?
N
N
If the array
is empty,
terminate.
Initialization
The whole edge map
is the only region in
the processing array.
Indivisible regions
The region
Check aspect ratio
N
Y
Add to the resulting text regions
Discard false regions
Localization steps
(1)
(2)
(3)
(4)
Experimental results
Experimental results
Performance statistics
Statistics of 10 news videos:
Processing time per frame: 0.25 s (PIII 1G CPU)
Detection rate =
Num(correctlydetected text regions)
Num(groundtruth text regions)
Detection accuracy =
= 93.6%
Num(correctlydetected text regions)
Num(all detected text regions)
= 87.2%
Localization accuracy
regions) Area (groundtruth text regions)
= Area(detected text
> 90%
Area (groundtruth text regions)