Transcript Document

Assuming Accurate Layout
Information is Available: How
do we Interpret the Content
Flow in HTML Documents?
Hassan Alam and Fuad Rahman
Human Computer Interaction Group
BCL Technologies Inc. Santa Clara, CA 95050
www.bcltechnologies.com
[email protected]
Overview of the Talk






Content Flow in Web Pages
Structural Flow vs. Logical Flow
Language Independence
Independence for Semantics
Content Flow from Purely Geometrical
Information
Conclusion and Future Work
Related Work
Handcrafting
Transcoding
Adaptive Re-authoring
Handcrafting involves typically
crafting web pages by hand by a set
of content experts for device
specific output.
Transcoding replaces HTML tags
with suitable device specific tags,
such as HDML, WML and others.
The research on web page reauthoring can explicitly use natural
language processing or use nonNLP techniques.
The
HTML
Table
based
Structure
How is the Table Structure Exploited?
Most HTML source use table as the principal
organizational method
We assume that a geometric parser will give us
exact positioning of each table and sub-table
Content is in the Columns.
Rows are only used to arrange content
We assume that content flow is language
independent, or is it?
How is the Table Structure Exploited?
Calculate xPreference list
Calculate yPreference list
Perform Proximity analysis: Know thy neighbors!
Calculate Inclusion Criterion
Quantify each table: Calculate area
Calculate table hierarchy based on Inclusion
criterion and proximity analysis
Continued …
How is the Table Structure Exploited?
Calculate TOC
Calculate Level of TOC
Calculate Merging Criterion
Same Inclusion Criterion
Lowest first
Sharing identical sides
Not if a border exists
The
HTML
Table
based
Structure
Map of Table Layout
What is the Advantage of this Analysis?



Relative importance of content can be assessed,
resulting in better re-authoring.
It becomes possible to capture the contextual
relationship among various components within the
document, such as what is a side bar, what is an
advertisement, what is a top bar etc.
If needed, it is possible to use other natural language
techniques to correlate tables by using semantics or other
criteria.
Current Work




XML is being successfully used in many applications to
mark up important information according to applicationspecific vocabularies .
Two W3C Recommendations, XSLT (the Extensible
Stylesheet Language Transformations) and XPath (the
XML Path Language), meet that need.
This is an exploratory paper offering a specific pathway to
the future of web page re-authoring provided accurate
layout information is available.
It is probably better to use the XSLT language, which itself
uses XPath, to specify how an implementation of an XSLT
processor is to create a desired output from a given
marked-up input.
Future Work

Exact location of each block, in rectangular coordinates,
equivalent to rendition using a standard browser.
 Size of each block of content.
 Type of content, e.g. text, graphics etc.
 Weight of content, in terms of size and placement within a
page.
 Continuity information, derived from physical association
in terms of geometrical collocation.
 Classification of content into a set of pre-defined classes,
e.g. main story, sidebars, links and so on.
 Linkage information from the XML representation,
indicating the layers of information that can be hidden at a
level of summary. This can represent the content in many
levels, but more than two or three levels are unsuitable for
easy navigation.
Conclusions

A specific pathway to the future of web page reauthoring provided accurate layout information is
available.
 This in no way represents a state of the art
discussion about the possible use of layout
information. Rather, it focuses on one small part
within an array of possibilities.
 It will be interesting to discuss other possibilities
in this space during the DLIA workshop.