The Internet Knowledge Base

Download Report

Transcript The Internet Knowledge Base

Template-based Authoring
Knowledge Systems
Laboratory
Stanford
Project Goals


Assist analyst in everyday work
Knowledge Authoring Tools to assist in:





Research for reports
Produce reports
Consume reports
Share reports
Our solution: Semantic Web Templates
Semantic Web Templates




Knowledge Representation, Semantics
are key for information exchange
Creation, maintenance of knowledge
must be transparent
Automate extraction of knowledge
Enhance knowledge retrieval methods
Semantic Web Templates

Similar to MS Word Templates


Different templates for different tasks
Word templates can have restrictions on text





Very primitive, such as length of text
Simplistic patterns such as “phone number”
No concepts such as “color” or “country”
One template, many documents
HTML templates are very common today

Many web sites use SQL database as back end,
template + SQL  HTML
Semantic Web Templates


An HTML file with additional tags
Tags specify:





Where particular knowledge is stated
What kind of knowledge it is
Where it came from, if applicable
References to an entity or relation
Repetitive regions of text
Goal: Assist Research

Unstructured Extraction




Sort through buckets of data to find gold
Entity recognition
Relation recognition
Semistructured Extraction



Utilize repetitive patterns within a page
Use similar pages to extract more data
Robust despite changing pages, data
Unstructured Extraction




Natural language processing
News feeds
Indexing, storage, retrieval
Plugin architecture



Rover news crawler



Web Services
Our system, collaboration with IBM via NIMD
Political news articles from Yahoo!
22,000 articles, ~8500 concepts, ~1000 relations
Used in authoring tools
Unstructured Extraction

Pattern based system

Leverage “hints” for the reader in news articles
British Prime Minister Tony Blair

<type Country><subClassOf Politician> <unknown name>







“Tony Blair” is a Prime Minister who represents the
Country “England”.
System runs daily on Yahoo political news
Highlights known terms in green
Highlights new terms in red
Used to create search index, maintain KB
Demo
Semi-structured Extraction


Extract, produce knowledge
Initial model is Domain Authorities





Enhance KB with ground facts
Strong for relations and breadth of data
Leverages work of others
Makes use of SQL databases
Future work is wide-scale web of trust
Semi-structured Extraction

Site Registry




By description and property
CIA World Fact Book has data about items
which are of type <Country>
CIA World Fact Book has properties
<population>, <hasNeighbor>,
<hasMembership>, etc.
Demo
Semi-structured Extraction

Publishing



Human editing good for high-level
concepts
Automated techniques good for relations,
ground level facts, and massive repetition
Rover web crawler


Template construction is currently manual
With critical mass of data, templates could
be discovered.
Enhanced Document Retrieval

Enhanced document retrieval

Search based on concept




Find articles about…
Membership: Scottie Pippen  Trailblazers
Membership: Osama bin Laden  al-Qaeda
Subgroups:


Ramadan Shallah  Islamic Jihad  al-Qaeda
Semantic search
Enhanced Document Retrieval

Document Augmentation




Sidebar acts as glossary as you read
Pre-fetch data user is likely to want
Adapt to user preferences, activities
Deeper understanding for user, gets
answers to questions raised while reading
Enhanced Document Retrieval
Search Augmentation



Google assumes users only want documents
Provide answers along with documents
Use query term denotation to more closely
target results



“Browns Ferry” is a garden park
“Browns Ferry” is a nuclear power plant
Automates what people do with IR systems

Append hints about the type of term being sought
Search Augmentation
Search Augmentation




Demo:
Demo:
Demo:
Demo:
Basic Search
Followup Data
Disambiguation
Relations
Basic Question Answering


Automated techniques for ground facts
Use reasoners for higher-level facts



Tie in with KSL AQUAINT work
Feedback, direction from user
Structure of knowledge allows simple
form of question answering
Basic Question Answering


Multiple views into data
Browse interface


Ugly, but complete view
Activity-based knowledge presentation


Search, document augmentation
Future work accept user feedback,
customization, preferred sources
Basic Question Answering

Query by example




Users create many similar documents
These are targeted to an activity
Use past work to speed present work
User creates and templates which present
data they find interesting in a way they
find convenient
Query by Example
Query by Example
Query by Example
Goal: Produce Reports

Most reports are made with Office



Enhance with semantic awareness
Provide seamless access to knowledge


Word processor, spreadsheet
Transparent maintenance, creation
Low overhead of operation


Avoid centralized approach
Contrast with relational database
Word Processing

Creation of new data

Semantic scan




Annotation of text



Like spell check or grammar check
Automatically identifies referenced entities
Learns new entities, relations between entities
User manually adjusts system
User adds new data
System gets smarter over time
Word Processing


Create data via entry into templates
Create new templates




For others
For personal use
Extend templates with new entry areas
Enhance analyst’s view


Semantic Search, Document Augmentation
Sidebar boxes are templates too
Word Processing



Demo: Semantic Scan
Demo: Annotation
Demo: Knowledge Creation
Spreadsheets




Spreadsheets are key tools in analysis
Tabular format, UI are both intuitive
Sorting, basic math functions
We add semantics:



New formula type: “Get Data”
New formula type: “Put Data”
Summarization, new views
Spreadsheets

Example scenario



Suppose SARS was found to affect AsianAmericans more than others?
Analyst wants to determine, based on that,
which states are most at risk
Knowledge from Census tells us AsianAmerican population as a percentage
Spreadsheets
Spreadsheets
Spreadsheets
Spreadsheets
Spreadsheets
Spreadsheets
Goal: Consume Reports



Verify others’ data against yours
Incorporate others’ results into your
knowledge base, track sources
Maintain data



Change notification
Document updates with new data
Versioning of documents, data
Goal: Share Reports




Easily exchangable via e-mail
Truth maintenance techniques
Multiple views into data
Leverage domain expertise


The missile guy has a KB, …
Collaboration, trust levels

Colleagues disagree, sources are unreliable
Conclusion




KD-D effort is focused on authoring,
analysis tasks
Leverage automated techniques to
complement manual techniques
System gets smarter as it’s used
Tie in with commonly used applications