On living in the twilight zone between structure and

Download Report

Transcript On living in the twilight zone between structure and

Let us build a platform for structure
extraction and matching that ....
Sunita Sarawagi
IIT Bombay
http://www.cse.iitb.ac.in/~sunita
Knows when it failed


Attaches every extraction module with a error
detection logic
Two types of errors

Precision errors: easier to detect




Recall errors: much harder


Reference databases
Alternative models
Human feedback
A research challenge
Represents errors and exposes them to users

Imprecise data models for results of extraction and
deduplication  another research challenge
Seamlessly integrates rules,
humans and statistics

Existing systems partitioned on



Rule-based Vs Statistical
Manual Vs Learning-based
Smooth co-existence of all combinations a must
given varying difficulty of tasks and
sophistication of users
Treats models as first class objects

Tens and thousands of schema elements


How to share models across different





Cannot afford separate extraction and matching
model for each
levels of hierarchies,
natural languages,
formatting languages,
versions along time.
How quickly can we interactively adapt to new
domains starting from existing libraries of models
Is selectively lazy



Cannot run away from the hard tasks
Only way to attack the long tail of missed
extractions is via expensive resources
Explicitly represent increasing levels of cost and
payoffs and do cost-sensitive processing

Selective linguistic processing:


POSChunking Dependency parsingFull parsing
Database lookups

No lookups Boolean matchesTF-IDF matches Edit
distance Web seaches
Supports multi-spectrum queries
Knowledge [Schema] should be like a pocket watch, surfaced only
when needed; not like a wrist watch, always flaunted.
- A Bengali saying.



Fully schema-aware: SQL, XML,…
Schema-less: Keyword queries
Common-sense schema-aware



User understands Is-a, Part-of, Properties
Use world knowledge (ontologies, word-nets, etc) to
map both schema and content elements in the query
Can use limited rounds of user interaction