Transcript Slide 1

Automatically Extracting Data Records from Web Pages
Presenter: Dheerendranath Mundluru
[email protected]
http://www.ucs.louisiana.edu/~dnm8925
Dheerendranath Mundluru
Dr. Vijay Raghavan
Dr. Zonghuan Wu
Jayasimha R. Katukuri
Saygin Celebi
Laboratory for Internet Computing
Center for Advanced Computer Studies
University of Louisiana at Lafayette, Lafayette, LA
Agenda




Introduction
Proposed Solution: Path-based Information Extractor
Experiments
Conclusions and Future Work
2
Introduction
World Wide Web: Largest known repository of documents containing
diverse content used by people from diverse backgrounds.
Few characteristics of Web include:
 Huge size
 Easily accessible
 Hyperlinked
 Dynamic
 Diverse coverage – science, politics, education, etc.
 Increasing at a tremendous rate
 Noisy - advertisements, mirror sites, etc.
3
Web Mining: Leverage the Value of Web



Web mining aims to discover useful knowledge from the Web
Characteristics of Web such as heterogeneity, increasing size,
noise, etc. makes Web mining a challenging task
Web mining can be classified into [Kosala 00, Liu 04]:




Web content mining: Extracting and discovering useful information or
knowledge from Web page contents
Web structure mining: Discovering useful knowledge from the structure
of hyperlinks e.g., used by Google
Web usage mining: Discovering useful knowledge from user access log
files e.g., used by Amazon.com
Web mining is a multidisciplinary field:

Data mining, information retrieval, databases, machine learning,
information extraction, natural language processing, etc.
4
Web Mining & Web Content Mining Classification
5
Structured Data Extraction

Structured data extraction deals with extracting information
displayed in a regular structure as such information is perceived to
represent the essential content in a Web page e.g., list of products
in an e-commerce Web page. [Liu 04]

Few example applications:



Online comparative shopping engines (e.g., nextag.com)
Metasearch engines (e.g., dogpile.com)
Modern Business Intelligence systems (e.g., intelliseek.com)
6
Sample response page from Google
7
Sample response page from drugstore.com
8
Path-based Information Extractor (PIE)


PIE is an automatic data extraction system whose goal is to
automatically extract data records present in Web search response
pages. [Mundluru 05a, Mundluru 05b]
PIE also eliminates any “noisy” content such as advertisements,
navigation links, etc.
9
Few Observations
Observation 1: Data records displayed in a particular region of a Web
page are contiguous and are formatted using similar HTML tags.
[Liu 03]
Observation 2: A group of similar data records belonging to a particular
region are always present under the same parent node in the tag
tree. [Liu 03]
Observation 3: Every record present in most search response pages
has at least one hyperlink. Usually, title of the retrieved document is
displayed in the form of a hyperlink, which points to the retrieved
document. In this work, we refer to such a hyperlink as a record link.
10
Record Extraction Algorithm
11
Experiments
Experiment Setup:



Evaluated the proposed system by comparing it with two state-of-the-art
record extraction systems: MDR [Liu 03] and ViNTs [Zhao 05]
All three systems were tested on a total of 60 Web pages (having 873
data records) taken from 60 Web sources
The 60 Web sources include:




general-purpose search engines e.g., Google, Yahoo
e-commerce sites e.g., drugstore.com, clevershoppers.com
other special-purpose search engines e.g., mit.edu, breastcancer.org
PIE was developed in Java
12
Experiments

Evaluation Measures Used:



Recall = Total number of target data records correctly extracted
Total number of target data records
Precision = Total number of target data records correctly extracted
Total number of data records extracted
Results:
PIE
MDR
ViNTs
Recall
90.4%
69.9%
83.8%
Precision
95.5%
81.4%
93%
13
Conclusions & Future Work
Conclusions:



Automatic data extraction is extremely important for systems such as
online comparative search engines, metasearch engines, business
intelligence solutions, etc.
A very effective system called PIE has been proposed for automatically
extracting data records from Web pages.
Experiments showed that PIE outperformed MDR and ViNTs, which are
two state-of-the-art record extraction systems that are being used in two
software companies.
Future Work:




Improving the effectiveness in extracting records
Extracting attributes in each data record e.g., product name, price, etc.
Performing large-scale experiments
Building applications such as online comparative shopping engines,
metasearch engines, etc.
14
References
[Mundluru 05a] D. Mundluru, J. Katukuri, and S. Celebi. Automatically Mining Result
Records from Search Engine Response Pages. Proceedings of 5th IEEE International
Conference on Data Mining (ICDM), 749 – 753, Houston, November 2005 .
[Mundluru 05b] D. Mundluru, Z. Wu, V. Raghavan, J. Katukuri, and S. Celebi. Automatically
Mining Search Result Records. Technical Report CACS-TR-2005-3-1, Center for
Advanced Computer Studies, University of Louisiana at Lafayette, 2005.
[Kosala 00] R. Kosala and H. Blockeel. Web Mining Research: A Survey. ACM Special
Interest Group on Knowledge Discovery and Data Mining (SIGKDD), 2(1), 1-15, 2000.
[Liu 04] B. Liu and K. Chang. Editorial: Special Issue on Web Content Mining. SIGKDD
Explorations, 6(2), 1-4, December 2004.
[Liu 03] B. Liu, R. Grossman, and Y. Zhai. Mining Data Records in Web Pages. Proceedings
of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,
601-606, Washington, D.C., August 2003.
[Zhao 05] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Automatic Wrapper
Generation for Search Engines. Proceedings of the 14th International World Wide Web
Conference, 66-75, Chiba, Japan, May 2005.
15