A Comparative Study of Some Multiple

Download Report

Transcript A Comparative Study of Some Multiple

Content Extraction from HTML
Documents
A. Rahman H. Alam R. Hartono
Document Analysis and Recognition Team
(DART)
BCL Computers Inc.
Santa Clara, Calif, USA
Current need?
• Viewing website using small screen
handheld devices
• Since web sites are written using HTML
codes, we need to translate these to systems
that the wireless devices can support.
Current Solutions
•
Handcrafting:
–
•
Custom Web Sites are typically crafted by
hand by a set of content experts
Transcoding:
–
Thranscoding replaces HTML tags with
suitable device specific tags (HDML, WML
etc)
Handcrafting
•
Automation
–
Use of XML.
•
•
–
–
There is no standard XML tagset (Document Type Definition
– DTD) in use by vendors.
XML has been available to web designers for the last 10
years. Examination of websites shows little use of document
structural elements.
Web masters see themselves as artists rather than
programmers.
XML may meet the same fate as SGML, an earlier
attempt to create structured documents.
Handcrafting
•
•
•
Take an existing website and make it available to
wireless access. Aether Systems, Mshift and
2Roam currently offer these types of solutions.
Use a proprietary graphical interface to ease the
development of wireless applications from
scratch. Covigo and iConverse offer these type
of solutions.
Let the user do all coding in languages such as
C++ or Java. ThinAirApps offers this type of
solution.
Handcrafting
•
•
•
Labor intensive
Expensive.
Typically less than 1% of a
web site gets converted to
wireless content.
Transcoding
•
•
•
Most web pages have a loose repeating
visual structure. The wireless user gets the
same repeating information with every
screen
Browsing is an unfriendly experience
Transcoding sends all the information to the
wireless device, making it substantially slow
on the wireless network
Transcoding
•
•
Transcoding was introduced in Japan
during 1999-2000. It was widely rejected
by the Japanese users.
Recently, Google and Pixo introduced this
solution for the US market, but have so far
failed to attract attention of end users.
The Alternate Solution
•
•
•
•
Separate the content into smaller segments
Generate a summary of these segments
Prioritize these summaries from individual
segments
Put together to form a summary of the
overall document
Steps to Content Extraction
• Structural analysis: Understanding the
relationship of the various segments with
the document
• Decomposition: Breakdown on these
segments into operational units
• Contextual Analysis: Employment of
context to revise the segmentation
(Continued=>)
Steps to Content Extraction
(Continued)
• Labeling => Segment Summary: Extraction
of a low level summary of the segment
• Priority: Estimating importance of these
segments
• Table of Content (TOC) => Document
Summary: Putting together a summary of
the document
Content Extraction
•
•
•
Proximity Analysis: Relational analysis of
content between segments
Content Classification: callification into various
types, i.e. [stories], [navigation], [links],
[images], [forms] etc.
Relationship Analysis
–
–
–
Contextual grammar (Natural Language)
Knowledge modes
Information retrieval techniques
Content Extraction: Why do we
need it?
• Viewing any website: Any solution to web
browsing has to be universal
• High network access: Any transformation
has to be fast and on-the-fly
• Network Usage: Network traffic should
increase because of these systems
(Continued=>)
Content Extraction: Why do we
need it (continued)?
• Easy Configurability: Any such system should be
easiliy configurable
• Rapid Deployment: Should be rapidly deployable
• Non-intrusive Design: Should be possible to
transform web sites without modifying the actual
web site
• Multiple Views: System Integrators should be
able to create multiple views of the same site
Advantages of Content Extraction
•
•
•
•
•
•
•
•
Displays size
Locating information
Important content can be on top
Multiple levels of abstraction can be created
The browsing can use a demand-driven model
Faster download
More efficient use of small display areas
Mapping of the importance of content from the
original document
Supported Devices and Formats
• PDAs (HTML3.2)
• Cell phones
– USA/Europe:
• WAP
– Japan
• iMode (NTT DoCoMo)
• J-Sky (J-Phone)
• EZWeb (KDDI)
Conclusion
• Content from web documents can be extracted
based on the
–
–
–
–
HTML structure
Proximity analysis
Logical relationship analysis
Information retrieval techniques
• Content can be used effectively to summarize web
documents
– Better option compared to handcrafting or transcoding
– Produces faster browsing experience