A Comparative Study of Some Multiple Expert Recognition

Download Report

Transcript A Comparative Study of Some Multiple Expert Recognition

Challenges in Web Document
Summarization: Some Myths
and Reality
A. Rahman H. Alam
Document Analysis and Recognition Team
(DART)
BCL Computers Inc.
Santa Clara, Calif, USA
Basic Problem Statement
• What are web based documents?
• What is summarization?
• Textual summarization vs. content
summarization
• What myths do we have about
summarization?
• What is the reality?
Why Summarization?
• Display area of handheld devices i.e. PDAs
and Cell phones is too small for useful web
browsing
• Download times is still too slow for
comfortable browsing using wireless
devices
• Cost factor is still too high
Where is the Money?
• 1.2 billion web pages
• 2 hours/site to adapt an existing page for
wireless, it will take 2.4 billion work-hours
• At $20 per hour is assumed, this effort
requires an investment of around $50 billion
Current need?
• Viewing website using small screen
handheld devices
• Since web sites are written using HTML
codes, we need to translate these to systems
that the wireless devices can support.
Myths
•
Web summarization is easy
•
•
•
•
•
No scanning
No image processing
No Word or character level recognition
HTML has structural elements
Already in electronic formats
Current Solutions
•
Handcrafting:
–
•
Custom Web Sites are typically crafted by
hand by a set of content experts
Transcoding:
–
Thranscoding replaces HTML tags with
suitable device specific tags (HDML, WML
etc)
Handcrafting
•
•
•
Take an existing website and make it available to
wireless access. Aether Systems, Mshift and
2Roam currently offer these types of solutions.
Use a proprietary graphical interface to ease the
development of wireless applications from
scratch. Covigo and iConverse offer these type
of solutions.
Let the user do all coding in languages such as
C++ or Java. ThinAirApps offers this type of
solution.
Handcrafting
•
•
•
Labor intensive
Expensive.
Typically less than 1% of a
web site gets converted to
wireless content.
Transcoding
•
•
Transcoding was introduced in Japan
during 1999-2000. It was widely rejected
by the Japanese users.
Recently, Google and Pixo introduced this
solution for the US market, but have so far
failed to attract attention of end users.
The Alternate Solution
•
•
•
•
Separate the content into smaller segments
Generate a summary of these segments
Prioritize these summaries from individual
segments
Put together to form a summary of the
overall document
Summarization vs. Transcoding
•
•
•
•
Long displays
Long download times
Finding information difficult
No mapping of the importance of content
in the original document
Steps to Summarization
• Segmentation
– A tree
• Problems
–
–
–
–
–
–
–
–
–
Tables
Frames
Java Script
Graphics
Other Artifacts
Over segmentation
Under segmentation
Poor coding
Browsers are too good!
Ccontent
CTable
CTable
etc…..
CRow
CCol
CTable
Etc…
CCol
CTable
Etc…
Steps to Summarization
• Visual cues
• Size of font
• Headlines
• Boldness
• Color
• Links,
• Flashing
• Italic (I)
• Emphasized
• Underlines.
• Labeling
–
–
–
–
Main Story
Links
Navigation Bars
Advertisement
Bars
– Other Stories
– Forms
– Images
• Problems
• Graphics
• OCR
• Java scripts
• CSS
Steps to Summarization
• Labeling => Segment Summary: Extraction
of a low level summary of the segment
• Priority: Estimating importance of these
segments
• Table of Content (TOC) => Document
Summary: Putting together a summary of
the document
Conclusion
• Content can be used effectively to summarize web
documents
• Content summarization is more complex than
textual summarization
• HTML structure is a good starting point, but not
enough to understand context
• Summarization offers significant advantages over
transcoding
• Summarization also helps in faster browsing
experience
• There is a lot of money in this!