Transcript Document

A Brief Survey of Web Data
Extraction Tools (WDET)
Laender et al.
Introduction
• Web data is hard to query
• A lot of unstructured data
• Wrappers can help extract data
• A wrapper maps a page to a repository
• There are several ways to generate wrappers
• This paper is a survey of different wrappers
Taxonomy of WDET
• Languages for Wrapper Development
• HTML-aware Tools
• NLP-based Tools
• Wrapper Induction Tools
• Modeling based Tools
• Ontology based Tools
Overview of WDET
• Languages for Wrapper Development
procedural programming languages(Minerva, TSIMMIS)
• HTML-aware Tools
W4F, XWRAP, RoadRunner
• NLP-based Tools
Uses free text form (RAPIER, SRV, WHISK)
Taxonomy of WDET
• Wrapper Induction Tools
Generates wrappers from input(WIEN,SoftMealy,STALKER)
• Modeling based Tools
Based on hierarchies of objects(NoDoSE, DEByE)
• Ontology based Tools
Uses Conceptual Models or Ontologies (BYU tool)
Qualitative Analysis
• Degree of Automation
• Support for Complex Objects
• Page Contents: Semistructured data or text
• Ease of Use
• XML Output
• Support for Non-HTML Sources
• Resilience and Adaptiveness
Conclusions
Conclusions
Questions