Database evolution

Download Report

Transcript Database evolution

HYPERTEXT DATA BASES
Session 7 16:10 - 16:35
Dr Soumen Chakabarti, IIT Bombay
Database evolution
OLTP
Operations
Structured
Operational
Data
Additional data capture:
free text from customer forms,
phone conversation logs, etc.
SQL3: Text and
multimedia support
Indexing, search
and retrieval
Soumen Chakrabarti
IIT Bombay
Data analysis
and mining
Semistructured
Database
Collaboration and
resource discovery
Basic text indexing and retrieval
• Inverted index mapping terms to
(document, position)
• Boolean search
– socks and not “network*
protocol*”
• Phrase and proximity search
loan
– socks near(5) network*
• Relevance ranking
– Vector space model
Soumen Chakrabarti
IIT Bombay
inflation
bank
Text mining
• Problems
– Very high dimensionality
– Vocabulary mismatch
• Available technologies
–
–
–
–
–
Relevance feedback
Semantic indexing
Semi-automatic thesaurus construction
Topic directory construction (clustering)
Topic directory maintenance (classification)
Soumen Chakrabarti
IIT Bombay
From text to hypertext
• Email, newsgroup, intranet web pages
• Hyperlinks are much more than a browsing
mechanism
• Citations describe collective opinion
• Hyperlinks implicitly form online
communities
• Hyperlinks enable collaboration and
autonomous resource discovery
Soumen Chakrabarti
IIT Bombay
Example: clustering and classification
Organize resources
used by employees
into topic directory
Classify news, email,
office circulars, etc.
into topic areas
Finance and Investment
Banking
Central Banks
Credit Unions
Internet Banking
Rate Monitors
Discover resources
on specific topics of
interest on the Web
Soumen Chakrabarti
IIT Bombay
Insurance
Mutual Funds
Companies
Government Agencies
Health Insurance
Life Insurance
Brokerages
Consulting
Hedge Funds
News and Quotes
Enable multiple users
to collaborate without
explicit effort
Research problems
• Represent and query semi-structured data
• Mine distributed, hyperlinked data with
both structured and unstructured fields
• Locate Internet resources, understand their
data schema, and enable data integration
• Answer “why,” “how-to,” “where can I”
questions in restricted domains
• Unify text corpora with semi-structured
thesaurus
Soumen Chakrabarti
IIT Bombay