Transcript sangholee

Information Integration
for Digital Libraries
August 10, 2000
Prof. Sang Ho Lee
Soongsil University
Seoul, Korea
[email protected]
1
Information integration
• Provision of integrated access to multiple,
distributed, heterogeneous databases and other
information sources
• Mediator approach
– More up-to-date data
– No need to copy data
– Query needs can be unknown
• Data warehouse approach
– High query performance
– Can operate when sources unavailable
– Extra information at warehouse
• Modify, summarize (store aggregates), add historical
information
2
Mediator Approach
Client
Client
Mediator
Wrapper
Source
Wrapper
Wrapper
Source
Source
3
Data Warehouse Approach
Client
Client
Query & Analysis
Metadata
Warehouse
Integration
Source
Source
Source
4
Web Searching Practice
• Approx. 800 million indexable Web pages (Feb.
1999)
• Low coverage of the Web
– No engine indexing more than 16% of indexable web
pages
• Out of date
– New pages take months to be indexed
• Low metadata use
– 34% use “keywords” or “description” metatags
– 0.3% use the Dublin Core metadata standard
• Simple queries
– Most queries use 1-3 search words
• Poor relevancy ranking and precision
5
Meta Search engines
• USA
–
–
–
–
–
–
SavvySearch (www.savvysearch.com)
MetaCrawler (www.go2net.com/search.html)
Ask Jeeves (www.askjeeves.com)
ProFusion (www.profusion.com)
Mamma (www.mamma.com)
Ixquick (www.ixquick.com)
• Korea
– Wakano (www.wakano.co.kr)
– Ms. DaChanni (www.mochanni.com)
• Over 3000 metasearch engines around the world
6
Operation Flow and
Technical Issues
User query
Decompose and format queries
Send queries and get results
Post processing (ranking, clustering, etc.)
Output result
7
Current Practice of
Metasearch Engines
• Tend to a least-common-denominator
interface
– Not utilize function of individual sources
completely
• Covers general area, not a specific area
– Little utilization of domain knowledge
• Little consideration to personal profiles
8
Proposed Research Topics
(1)
• Theme: focused on mediator-based integration
techniques (in particular, metasearch engines)
• Intelligent wrapper techniques
– To extract, combine, and reconcile information for
external sources
– Exploit user profiles and utilize function of each sources
as much as possible
– Should be flexible and adaptable, as external sources
change
– Several approaches
• Formal language based, machine learning based, heuristic
based, extended CFG based, …
9
Proposed Research Topics
(2)
• Efficiency issues
– How to cache results and queries, to provide a
fast response to users
– How to do parallelism when accessing external
sources
10
11
Research/Development
Strategies
• Categorize objects and develop
specialized search mechanism for each
category
• Build a working system to experiment
theories
• Experiment new ranking methods
– Google, Goto, …
12