The Internet Knowledge Base
Download
Report
Transcript The Internet Knowledge Base
Template-based Authoring
Knowledge Systems
Laboratory
Stanford
Project Goals
Assist analyst in everyday work
Knowledge Authoring Tools to assist in:
Research for reports
Produce reports
Consume reports
Share reports
Our solution: Semantic Web Templates
Semantic Web Templates
Knowledge Representation, Semantics
are key for information exchange
Creation, maintenance of knowledge
must be transparent
Automate extraction of knowledge
Enhance knowledge retrieval methods
Semantic Web Templates
Similar to MS Word Templates
Different templates for different tasks
Word templates can have restrictions on text
Very primitive, such as length of text
Simplistic patterns such as “phone number”
No concepts such as “color” or “country”
One template, many documents
HTML templates are very common today
Many web sites use SQL database as back end,
template + SQL HTML
Semantic Web Templates
An HTML file with additional tags
Tags specify:
Where particular knowledge is stated
What kind of knowledge it is
Where it came from, if applicable
References to an entity or relation
Repetitive regions of text
Goal: Assist Research
Unstructured Extraction
Sort through buckets of data to find gold
Entity recognition
Relation recognition
Semistructured Extraction
Utilize repetitive patterns within a page
Use similar pages to extract more data
Robust despite changing pages, data
Unstructured Extraction
Natural language processing
News feeds
Indexing, storage, retrieval
Plugin architecture
Rover news crawler
Web Services
Our system, collaboration with IBM via NIMD
Political news articles from Yahoo!
22,000 articles, ~8500 concepts, ~1000 relations
Used in authoring tools
Unstructured Extraction
Pattern based system
Leverage “hints” for the reader in news articles
British Prime Minister Tony Blair
<type Country><subClassOf Politician> <unknown name>
“Tony Blair” is a Prime Minister who represents the
Country “England”.
System runs daily on Yahoo political news
Highlights known terms in green
Highlights new terms in red
Used to create search index, maintain KB
Demo
Semi-structured Extraction
Extract, produce knowledge
Initial model is Domain Authorities
Enhance KB with ground facts
Strong for relations and breadth of data
Leverages work of others
Makes use of SQL databases
Future work is wide-scale web of trust
Semi-structured Extraction
Site Registry
By description and property
CIA World Fact Book has data about items
which are of type <Country>
CIA World Fact Book has properties
<population>, <hasNeighbor>,
<hasMembership>, etc.
Demo
Semi-structured Extraction
Publishing
Human editing good for high-level
concepts
Automated techniques good for relations,
ground level facts, and massive repetition
Rover web crawler
Template construction is currently manual
With critical mass of data, templates could
be discovered.
Enhanced Document Retrieval
Enhanced document retrieval
Search based on concept
Find articles about…
Membership: Scottie Pippen Trailblazers
Membership: Osama bin Laden al-Qaeda
Subgroups:
Ramadan Shallah Islamic Jihad al-Qaeda
Semantic search
Enhanced Document Retrieval
Document Augmentation
Sidebar acts as glossary as you read
Pre-fetch data user is likely to want
Adapt to user preferences, activities
Deeper understanding for user, gets
answers to questions raised while reading
Enhanced Document Retrieval
Search Augmentation
Google assumes users only want documents
Provide answers along with documents
Use query term denotation to more closely
target results
“Browns Ferry” is a garden park
“Browns Ferry” is a nuclear power plant
Automates what people do with IR systems
Append hints about the type of term being sought
Search Augmentation
Search Augmentation
Demo:
Demo:
Demo:
Demo:
Basic Search
Followup Data
Disambiguation
Relations
Basic Question Answering
Automated techniques for ground facts
Use reasoners for higher-level facts
Tie in with KSL AQUAINT work
Feedback, direction from user
Structure of knowledge allows simple
form of question answering
Basic Question Answering
Multiple views into data
Browse interface
Ugly, but complete view
Activity-based knowledge presentation
Search, document augmentation
Future work accept user feedback,
customization, preferred sources
Basic Question Answering
Query by example
Users create many similar documents
These are targeted to an activity
Use past work to speed present work
User creates and templates which present
data they find interesting in a way they
find convenient
Query by Example
Query by Example
Query by Example
Goal: Produce Reports
Most reports are made with Office
Enhance with semantic awareness
Provide seamless access to knowledge
Word processor, spreadsheet
Transparent maintenance, creation
Low overhead of operation
Avoid centralized approach
Contrast with relational database
Word Processing
Creation of new data
Semantic scan
Annotation of text
Like spell check or grammar check
Automatically identifies referenced entities
Learns new entities, relations between entities
User manually adjusts system
User adds new data
System gets smarter over time
Word Processing
Create data via entry into templates
Create new templates
For others
For personal use
Extend templates with new entry areas
Enhance analyst’s view
Semantic Search, Document Augmentation
Sidebar boxes are templates too
Word Processing
Demo: Semantic Scan
Demo: Annotation
Demo: Knowledge Creation
Spreadsheets
Spreadsheets are key tools in analysis
Tabular format, UI are both intuitive
Sorting, basic math functions
We add semantics:
New formula type: “Get Data”
New formula type: “Put Data”
Summarization, new views
Spreadsheets
Example scenario
Suppose SARS was found to affect AsianAmericans more than others?
Analyst wants to determine, based on that,
which states are most at risk
Knowledge from Census tells us AsianAmerican population as a percentage
Spreadsheets
Spreadsheets
Spreadsheets
Spreadsheets
Spreadsheets
Spreadsheets
Goal: Consume Reports
Verify others’ data against yours
Incorporate others’ results into your
knowledge base, track sources
Maintain data
Change notification
Document updates with new data
Versioning of documents, data
Goal: Share Reports
Easily exchangable via e-mail
Truth maintenance techniques
Multiple views into data
Leverage domain expertise
The missile guy has a KB, …
Collaboration, trust levels
Colleagues disagree, sources are unreliable
Conclusion
KD-D effort is focused on authoring,
analysis tasks
Leverage automated techniques to
complement manual techniques
System gets smarter as it’s used
Tie in with commonly used applications