Transcript Wrappers

Wrappers
Kapowtech RoboSuite 6.0
Team číslo 10 – Vampires , stretnutie číslo 2
Documents
•
•
•
•
•
Lixto WhitePaper
Wrapper Development Tools
Piggy Bank
WebVCR
Kapow RoboSuite Documentation
Why wrappers?
- HTML is used to display data
- the data is stored inside your HTML
- WEB is designed for human
consumption, even if it was derived
from well-defined database
- wrapper – robot browsing web and
extraction of data
Applications
•
•
•
•
•
•
•
•
online price comparisons
automatic stock market surveillance
personalized online news
flight tickets
job search
competitors advantage
research of a new technology
…….
Lixto WhitePaper
• presented by Duri
• table on the next slide
- Comparison of wrappers, programming
languages and by hand conversion
- Criteria's like learning time, expressive
power, user friendliness,…
Comparison of wrappers, programming languages
and by hand conversion
Wrapper Development Tools
• 3 main functions:
- ability of downloading HTML pages
from website
- search for, recognize and extract data
- save extracted data in a suitable
formats, such a XML, XLS,
Databases for further importing to the
other applications
Wrapper Development Tools
•
-
Non commercial tools:
most of them developed at universities
output data: mainly text and XML
most of them offer API
most of them is implemented in Java and is
OpenSource
- Most of them offer Web Crawling
- some of them offer GUI
- just few offer Editor – regular expressions,
ontologies
Wrapper Development Tools
•
-
Commercial tools:
most of them developed in commercial companies
output data: mainly XML, tables and text
most of them offer database connectivity
most of them offer Web Crawling
most of them offer API
all of them offer GUI
most of them offer Editor – regular expressions,
Perl, VBScript,…
Piggy Bank
•
•
•
-
extension for Firefox Web browser
turns it into a Semantic Web browser
let users:
combine information from several web sites and
browse them all together
save information you have found on the Web
tag each item you save
share saved information
browse and search through an existing web site
Piggy Bank – Applications
• Meeting with friends and you want to
locate restaurant with Chinese cuisine,
which is close to your favorite coffee
shop with wireless network
• You are moving to the new place and
you are looking for apartment close to
school, subway station, away crime
hotspots, nearby hospital,…
Piggy Bank – How it works
•
•
•
•
semantic web
RDF model
XML information
screen scraper
Piggy Bank Example
Piggy Bank - Architecture
• consists of 3 primarily parts:
- chrome additions to browser, including
menu commands, toolbars, etc
- Black-end Java code that manages
collected information in databases and
serves it up through an HTTP interface
- XPCOM components written in
JavaScript that bridge the chrome part
and the Java part.
Piggy Bank - Technologies
•
•
•
•
•
•
•
•
•
•
•
•
•
Firefox, as the application platform
XUL, as the extension’s user interface language
HTML, as the client side user interface language
Javascript, as the client side and extension’s scripting
language
Java, as the server side core programming language
Batik, for encoding PNG files
Informa, for parsing RSS feeds
Jetty, as the embedded web server
JTidy and JDom, for applying XSLT on HTML
Log4j, as the logging framework
Lucene, as the text indexer
Sesame, as the RDF access and storage API
Velocity, as the templating engine for generating HTML
WebVCR
• smart bookmarks
– shortcuts to Web content that require
multiple steps to be retrieved
- hard-to-reach Web content
• VCR style – record, replay, eventually
browse steps users actions
• no programming required from user, just
usual browsing
WebVCR - application
• navigation travelocity.com:
- Juliana plans to attend the WWW9 conference
and she is looking for flights from Newark to
Amsterdam, that leave from Newark May 14th
and return from Amsterdam on May 20th. She
must take the following steps:
- go to http://www.travelocity.com
- choose the Find/Book a Flight option
- login
- specify details of itinerary
- produced address:
http://dps1.travelocity.com:80/airgchoice.ctl?SEQ=94312
WebVCR – 3 main steps
• Notification – tracking users actions
-
-
browser modification to provide notifications for each action
performed
using of a proxy to rewrite each page and replaces all hrefs
with calls to a well-known script which provide the
notification
using of a proxy to monitor all HTTP commands sent
to/from the browser
attaching JavaScript event handlers to all active objects in
the page
• Recording - Storing user's browsing information
• Playback: Replaying users' actions
WebVCR – how to cope with
changes
• changes do not pose a problem to a user browsing the Web
since the user can easily determine which link he wants to
follow, but they do present a challenge to a system that
performs automatic navigation
- Attempt to locate a link in the last retrieved page
corresponding to DOM location stored in current smart
bookmark step. If the link exists, the target of the link
matches the bookmark, and either the URL or text of the
retrieved link match the step, then use that link.
- Otherwise, if there is a unique link in the page whose target,
URL, and text match those of the stored link, use that link
- Otherwise, if there is a unique link in the page whose target
and URL match those of the stored link, use that link
- Otherwise, if there is a unique link in the page whose target
and text match those of the stored link, use that link.
WebVCR – how to cope
with changes
•
•
•
Otherwise, if the link corresponds to a CGI bin script (e.g.,
contains ``?'' in it), then find all links that match the stored URL
up to the first occurrence of a ``?'' and store them in set of
candidate links, which we denote L.
Eliminate any elements of L whose parameter names do not
match the stored version. For instance, if the stored URL is
http://xyz.com/script?x=10&y=12 then
http://xyz.com/script?x=20&y=32 matches, but
http://xyz.com/script?x=10&z=12 does not, since it has a
parameter named z that does not appear in the stored version.
For each parameter in the stored version whose value matches
the corresponding parameter value in at least one element of L,
eliminate all elements of L with a non-matching value for the
same parameter.
WebVCR – how to cope
with changes
• If L is a singleton set, use that element.
• Otherwise, the playback can either be aborted,
or the link present at the recorded DOM
location can be used to try and proceed through
the playback (our implementation uses the
latter). However, the playback might fail later in
the sequence, or the sequence might traverse
pages different from what the user had
recorded.
WebVCR – problems
• HTTP authentication - some user actions cannot be
recorded in the client, it is not possible to detect when
HTTP authentication takes place, and since the values
entered by the user are not available through the DOM
API
• State information – cookies, login and password just
first time, after that go straight through cookies
• Signed applets
• Automatic refresh – they assume that auto refresh takes
place
• Microsoft IE limitations