浏览PowerPoint文档

Download Report

Transcript 浏览PowerPoint文档

Extracting tabular data from the
Web
Limitations of the current BP screen
scraper.




Parsing is done line by line.
Pattern matching – not very accurate &
unpredictable.
Need to rewrite code for fetching & parsing
HTML pages from different websites(Eg.
MSAMB - Maharashtra, Krishi Marata Vahini –
Karnataka,etc.)
Doesn’t take care of misplaced tags.
Characteristics of a Solution to this
problem



Flexible.
Unicode Compliant.
Smarter pattern matching – explore the structure
of the HTML page rather than single line at a time.
Possible Solutions
Solution 1
Step 1: Fetch data from the desired site.
 Step 2: Tidy the HTML page.
 Step 3 : Construct the HTML
DOM(Document Object Model) tree.
 Step 4: Extract node information using
Document object.

Solution 2




Similar to Solution 1
Use XPath to locate data(Step 4).
Relative position of nodes in DOM tree stored as
XPath.
These XPaths are stored in the properties file
instead of the entire table structure.
Solution 3







Tested a software - screen-scraper.(www.screenscraper.com)
Proxy server that allows the contents of HTTP and
HTTPS requests to be viewed
Engine that can be configured to extract information
from Web sites using special patterns and regular
expressions.
Embedded scripting engine that allows extracted data to
be manipulated, written out to a file, or inserted into a
database.
It can be used with PHP, Java, or any COM-friendly
language such as Visual Basic or Active Server Pages.
Costs $90 !
No Unicode support.
Other Possible Solutions
 XMLize the HTML content.
• XML – more structured and well-formed.
• Data interchange between incompatible systems.
• Can use XSL and XSLT to convert from one form
to another.
Implementation
HTML scraper
The HTML scraper has 3 main steps
1.Downloading the web page using crawlers
like ‘wget’.
2.Parsing and constructing the DOM tree.
3.Querying the DOM tree for retrieving the
desired information and inserting to the
database.

Implementation
Download the web page using
wget --post-data=“data” www.agmarknet.nic.in
Can store the page locally.
 Construct DOM tree using JTidy API.
Tidy tidy = new Tidy();
 Parse the DOM tree
Document doc = tidy.parseDOM(htmlfile,null);




Query the DOM tree :
Depth First Search through the DOM tree
Or
Using the XPath APIs.
Store the HTML page structure in file and use
DFS.
Or
Store XPaths and use it for querying.
Insert into database using JDBC.
DOM tree of the parsed HTML page
html
head
table
tr
tr
tr
tr
APMC Arrivals Variety Low Rate Mid Rate
High Rate
Statistics
Total time taken by the new parser is less
than 15 seconds per page. But the old one is
more than 30 seconds.
 Daily data fetching time=(200*15)seconds


Parser (using DFS) for NIC and MSAMB
(both English and Marathi) are ready .