Dispatching Java agents to user for data extraction from third party
Download
Report
Transcript Dispatching Java agents to user for data extraction from third party
Dispatching Java agents to
user for data extraction
from third party web sites
Alex Roque
F.I.U. HPDRC
Introduction
Since the WWW has grown exponentially, data
retrieval has become an intensive research
topic.
However, mechanisms and tools that give users
more power over the data on the web have not
grown in parallel with data increase.
For example, no tools exists that allow the user
to extract data from HTML context and use in
an external application.
A tool created to allow a more coherent
and wider set of automatic data extraction,
was the Data Extractor system, which
treats any Web sites as a data source.
Data extractor has two kinds of
implementation, as a standalone server
solution and a set of functionality that can
be embedded in applications and provide
them with data from the internet.
Data Extractor Inefficiencies
Performance in multi client conditions
Network performance issues
Legal issues
Installing exclusive local server for clients is
a, however, it is expensive. Our
alternative, is MDRA: Mobile Data
Retrieval Agents.
MDRA Composition and Delivery
The mobile agents server, contains a
wrapper portal and a knowledgebase
Functionality is as follows:
1) Users connect to wrapper portal and
request wrapper
2) In response, package to extract data is
constructed and sent to client
3) Data extraction takes place in client
Wrapper portals: List and package
wrappers, authenticates users, and allows
them to change and save their queries
(references to wrappers).
Knowledgebase: Contains information
about available wrappers, their
parameters and status.
Wrappers can be thought of as lightweight
programs which use a predefined OO
library to “strip” desired information.
MDRA Architecture
Mobile wrapper controller: Responsible for
controlling behavior of wrappers and flow of data
Wrappers: Same as the ones used in Data
Extractor, process which strips data from web
site.
Data Extraction Library: Contains functionality
essential for extraction and network operations.
Compact; can be cached if no update is
required.
Outer packaging: Interface for uniting numerous
wrappers and controllers.
How does execution take place?
1)
2)
3)
4)
Query formulation
Agent construction and delivery
Agent Execution
Data Delivery
Query Formulation
User connects to wrapper portal, wrappers
are listed, user selects desired wrapper(s)
as well configures execution parameters.
This configuration can be saved for future
reference.
Agent construction and delivery
Wrapper portal begins packaging including
outer packaging module, wrapper
parameter information, wrapper controller,
wrapper and Data Extraction Library.
Components that change frequently are
packaged separately from the one that do
(aids caching).
Compression or digital signatures take
place.
Agent execution
Once delivered to the client, wrappers
interact with WWW sites, and extract the
desired data.
Data is passed to outer packaging
controller where it can be used in
applications or stored in various mediums.
Data Delivery
Data retrieved may be transferred to other
applications programmatically, stored in
various mediums (Excel, XML, Text), or
stored in databases.
May be used for statistical data collection.
Source Code Implementation
Because the system needs to have a high
degree of portability, JAVA language was
used to perform the implmentation.
Previous Data Extractor was written in
Java, so in order to reuse modules, JAVA
was again used.
Speed Performance issues were
addressed [7].
MDRA Framework
In order to deliver MDRA to clients, the
method of delivery is that of a Java Applet.
Applets allow to portability which allows
clients of different platforms to participate
in this data retrieval.
Since framework code and libraries do not
change often, browsers that cache java
applets will keep parts that do not change
Security
Applets must be digitally signed in order to
for them to access system and network
resources needed for the retrieval.
Proxy servers may be created where the
applet was downloaded from in order to
give Applets ability to download third party
web sites. However, this option is prone to
a high bottleneck congestion.
Conclusion
MDRA “lease” data extraction services to
users, which retrieve data that can be
exported to other applications,
This distributed approach takes the load
on the centralized server architecture.
Future research includes different MDRA
implementations (standalone, embedded
in client side), and tuning of agent
performance.