A Comparative Study of Some Multiple Expert Recognition

Download Report

Transcript A Comparative Study of Some Multiple Expert Recognition

Understanding the Flow of
Content in Summarizing HTML
Documents
A. Rahman H. Alam R. Hartono
Document Analysis and Recognition Team
(DART)
BCL Computers Inc.
Santa Clara, Calif, USA
Basic Problem Statement
• How do we summarize web based
documents?
• Does HTML structure gives us any clue to
the understanding of the content?
• Does flow of content has anything to do
with the main message?
Why Summarization?
• Display area of handheld devices i.e. PDAs
and Cell phones is too small for useful web
browsing
• Download times is still too slow for
comfortable browsing using wireless
devices
• Cost factor is still too high
Current need?
• Viewing website using small screen
handheld devices
• Since web sites are written using HTML
codes, we need to translate these to systems
that the wireless devices can support.
Current Solutions
•
Handcrafting:
–
•
Custom Web Sites are typically crafted by
hand by a set of content experts
Transcoding:
–
Thranscoding replaces HTML tags with
suitable device specific tags (HDML, WML
etc)
Handcrafting
•
Automation
–
Use of XML.
•
•
–
–
There is no standard XML tagset (Document Type Definition
– DTD) in use by vendors.
XML has been available to web designers for the last 10
years. Examination of websites shows little use of document
structural elements.
Web masters see themselves as artists rather than
programmers.
XML may meet the same fate as SGML, an earlier
attempt to create structured documents.
Handcrafting
•
•
•
Take an existing website and make it available to
wireless access. Aether Systems, Mshift and
2Roam currently offer these types of solutions.
Use a proprietary graphical interface to ease the
development of wireless applications from
scratch. Covigo and iConverse offer these type
of solutions.
Let the user do all coding in languages such as
C++ or Java. ThinAirApps offers this type of
solution.
Handcrafting
•
•
•
Labor intensive
Expensive.
Typically less than 1% of a
web site gets converted to
wireless content.
Transcoding
•
•
Transcoding was introduced in Japan
during 1999-2000. It was widely rejected
by the Japanese users.
Recently, Google and Pixo introduced this
solution for the US market, but have so far
failed to attract attention of end users.
The Alternate Solution
•
•
•
•
Separate the content into smaller segments
Generate a summary of these segments
Prioritize these summaries from individual
segments
Put together to form a summary of the
overall document
Summarization vs. Transcoding
•
•
•
•
Long displays
Long download times
Finding information difficult
No mapping of the importance of content
in the original document
Steps to Summarization
• Structural analysis: Understanding the
relationship of the various segments with
the document
• Decomposition: Breakdown on these
segments into operational units
• Contextual Analysis: Employment of
context to revise the segmentation
(Continued=>)
Steps to Summarization
(Continued)
• Labeling => Segment Summary: Extraction
of a low level summary of the segment
• Priority: Estimating importance of these
segments
• Table of Content (TOC) => Document
Summary: Putting together a summary of
the document
Supported Devices and Formats
• PDAs (HTML3.2)
• Cell phones
– USA/Europe:
• WAP
– Japan
• iMode (NTT DoCoMo)
• J-Sky (J-Phone)
• EZWeb (KDDI)
Conclusion
• It is a good idea to use flow of content in
understanding web documents
• Content can be used effectively to summarize web
documents
• HTML structure is a good starting point, but not
enough to understand context
• Summarization offers significant advantages over
transcoding
• Summarization also helps in faster browsing
experience