Transcript Slides

myPortal:
Robust Extraction and
Aggregation of Web Content
Marek Kowalkiewicz, Tomasz Kaczmarek, Witold Abramowicz
Tomasz Kaczmarek
The Poznan University of Economics
Poland
Background



Personalized access to information
Dynamic content on web pages
Various techniques for content
extraction:



Based on unique ID
Using contextual information
Document tree analysis
myPortal vision




Ability to extract content blocks from
HTML pages
Easy aggregation
Client side technology – no server side
investments necessary
Stress on:


Robustness
Ease of use
My portal
My portal
My portal
Extraction technique
Extraction based on HTML DOM tree
Absolute XPath
Relative XPath
Visual query specification
Reference element
Extracted content
Aggregation of content
Functionality

Done:



Extract content from
any HTML page
Record POST, GET
parameters, cookies
Access search results
(via GET or POST)
from search engines
– subscription like
service

Work in progress:


Deal with multi-stage
login or query
mechanisms – like
obtaining bank
account info
Deal with
information from
multiple DOM tree
branches in single
query
Other (technical) problems



HTML code quality – HTML Tidy
WYSIWYG for aggregation
Robustness



Multiple occurrences of reference element
Document structure changes between reference
and extracted elements
Deletion / change in the reference element
Research on robustness

Purpose: to check if relative XPath
expressions are more robust than
absolute XPath
Research method



Empirical tests on multiple portals
Manual query preparation for absolute and
relative queries
Comparison of results in three categories:




Accurate extraction
Lack of result
Inaccurate extraction
Based on historical versions of portal sites
obtained from Web Archive
Robustness comparison
Average robustness
Thank you!
[email protected]