On living in the twilight zone between structure and
Download
Report
Transcript On living in the twilight zone between structure and
Let us build a platform for structure
extraction and matching that ....
Sunita Sarawagi
IIT Bombay
http://www.cse.iitb.ac.in/~sunita
Knows when it failed
Attaches every extraction module with a error
detection logic
Two types of errors
Precision errors: easier to detect
Recall errors: much harder
Reference databases
Alternative models
Human feedback
A research challenge
Represents errors and exposes them to users
Imprecise data models for results of extraction and
deduplication another research challenge
Seamlessly integrates rules,
humans and statistics
Existing systems partitioned on
Rule-based Vs Statistical
Manual Vs Learning-based
Smooth co-existence of all combinations a must
given varying difficulty of tasks and
sophistication of users
Treats models as first class objects
Tens and thousands of schema elements
How to share models across different
Cannot afford separate extraction and matching
model for each
levels of hierarchies,
natural languages,
formatting languages,
versions along time.
How quickly can we interactively adapt to new
domains starting from existing libraries of models
Is selectively lazy
Cannot run away from the hard tasks
Only way to attack the long tail of missed
extractions is via expensive resources
Explicitly represent increasing levels of cost and
payoffs and do cost-sensitive processing
Selective linguistic processing:
POSChunking Dependency parsingFull parsing
Database lookups
No lookups Boolean matchesTF-IDF matches Edit
distance Web seaches
Supports multi-spectrum queries
Knowledge [Schema] should be like a pocket watch, surfaced only
when needed; not like a wrist watch, always flaunted.
- A Bengali saying.
Fully schema-aware: SQL, XML,…
Schema-less: Keyword queries
Common-sense schema-aware
User understands Is-a, Part-of, Properties
Use world knowledge (ontologies, word-nets, etc) to
map both schema and content elements in the query
Can use limited rounds of user interaction