Identifiers and Types

Download Report

Transcript Identifiers and Types

Metasearching
CS 502 – 20020312
Carl Lagoze – Cornell University
Acknowledgements:
Luis Gravano
Andreas Paepcke
Cornell CS 502
20020307
Web Search Strategies – Crawling
“central”
index
?
Cornell CS 502
20020307
Web Search Strategies – Metadata Harvesting
metadata
Cornell CS 502
20020307
Web Search Strategies – Metadata Harvesting
metadata
?
Author
Title
Abstract
Identifer
Cornell CS 502
20020307
Web Search Strategies - Metasearching
?
Cornell CS 502
Metasearch
Engine
20020307
What is “Metasearching”?
• Given many document sources and a query, a metasearcher:
– Finds the good sources for the query
– Evaluates the query at these sources
– Merges the results from these sources
Metasearcher
Unindexed
Documents
Cornell CS 502
Legacy
Database /
WAIS / etc.
20020307
Existing
Web
Application
<%
%>
Metasearching Issues
• How to query different types of sources?
• How to combine results and rankings from multiple data
sources?
Metasearcher
grep
‘biomedical’
*.txt
Cornell CS 502
SELECT
title
FROM
articles . . .
20020307
http://…/getTitle?
title=‘biomedical’&…
<%
%>
Metasearching Issues . . . Cont’d
• How to choose among multiple data sources?
• How to get metadata about multiple data sources?
Metasearcher
cat
*.txt
SELECT
SCHEMA
…….
<%
%>
Cornell CS 502
20020307
Best:
http://….?getMetaData
Worst:
“Hi. What do you have?”
Function versus cost of acceptance
Cost of acceptance
Z39.50
SDLIP/STARTS
Metadata Harvesting
google
Cornell CS 502
20020307
Function
Z39.50
http://www.loc.gov/z3950/agency/
Cornell CS 502
20020307
Aims of Z39.50
• Permits one computer, the client, to search and retrieve
information on another, the database server
• Important both technically and for its wide use in library
systems
• Most development has concentrated on bibliographic data
• Most implementations emphasize searches that use a
bibliographic set of attributes to search databases of
MARC records
Cornell CS 502
20020307
Technical history
Z39.50
• Developed for X.25 networks (connection orientation),
conversion to run over TCP fitted later
• Original concept in days when repeating a search was
expensive computation (about 1980)
• WAIS is a stateless derivative of an early version of Z39.50
Cornell CS 502
20020307
Z39.50 principles
Abstract view of database searching.
• Server stores a set of databases with searchable
indexes
• Interactions are based on a session
• The client opens a connection with the server, carries
out a sequence of interactions and then closes the
connection.
• During the course of the session, both the server and
the client remember the state of their interaction.
Cornell CS 502
20020307
State
Z39.50
• The server carries out the search and builds a results set
• Server saves the results set.
• Subsequent message from the client can reference the
result set.
• Thus the client can modify a large set by increasingly
precise requests, or can request a presentation of any
record in the set, without searching entire database.
Cornell CS 502
20020307
Z 39.50 services
init -- client connects to the server and exchanges initial
information, e.g., preferred message size
explain -- client inquires of the server what databases are
available for searching, the fields that are available, the syntax
and formats supported, and other options
search -- client presents a query to a database choices of syntax
for specifying searches
• only Boolean queries widely implemented
• one or more records may be returned to the client
Cornell CS 502
20020307
Z 39.50 services
manipulation of results sets -- e.g., sort or delete
present -- requests the server to send specified records from
the results set to the client in a specified format
• options: for controlling content and formats
for managing large records or large results sets
Cornell CS 502
20020307
Sample query
In the database named "Books" find all records for
which the access point title that contains the value
"evangeline" and the access point author contains
the value "longfellow.“
Z39.50 defines a rich variety of search access
points that can be extended by implementers
Cornell CS 502
20020307
Problems with Z39.50
• Very difficult to implement
– There are freely available implementations, but they are
complex
• Outdated assumptions
– Searching is expensive computationally
– Bandwidth is limited (ASN.1 compression)
• Originally designed for bibliographic record
retrieval, and not full documents or other objects
• “Overspecified”
• (Almost) Nobody Implements Explain!
• Assumes questionable user model (stateful)
Cornell CS 502
20020307
Simple Digital Library Interoperability Protocol
http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/
Cornell CS 502
20020307
SDLIP
• Compromise between a full-scale, all encompassing
search middleware design such as Z39.50 and the
“anything goes” approach typical for ad-hoc search
interface design on web
• Support for stateful and stateless operation by
the server
• Support for thin clients, such as handheld devices
• Developed jointly by Stanford, Berkeley, and UC
Santa Barbara
• Heavily influenced by DASL from IETF
Cornell CS 502
20020307
SDLIP – search middleware
Cornell CS 502
20020307
Managing complexity through separate interfaces
Cornell CS 502
20020307
SDLIP Interfaces
• Search Interface – defines simple query language,
protocol can then include other languages
• Result Interface – parking meter metaphor
supports varying notions of results sets
• Source Metadata Interface – provides extension
mechanism through discovery server capabilities
Cornell CS 502
20020307
Result Access Interface
• This interface allows client applications to access
the set of result documents, wherever that set is
maintained
• Four services:
–
–
–
–
getSessionInfo
getDocs
extendStateTimeout
cancelRequest
Cornell CS 502
20020307
Source Metadata Interface
• Provides information about the service and server
itself, such as
– Collections served
– Collection metadata/content information
– Searchable properties
• Three operations
– getInterface
– getSubcollectionInfo
– getPropertyInfo
Cornell CS 502
20020307
STARTS/SDARTS
http://www-db.stanford.edu/~gravano/starts_home.html
http://sdarts.cs.columbia.edu/default.html
Cornell CS 502
20020307
STARTS
• Stanford Protocol Proposal for Internet Retrieval
and Search
• Joint work of Stanford Digital Library Project and
Cornell Digital Library Research Group
• SDARTS – current work at Columbia to integrate
with SDLIP and metadata harvesting (OAI-PMH)
Cornell CS 502
20020307
Different text search engines are largely
incompatible
• Different query languages
(the query-language problem)
• Different ranking algorithms
(the rank-merging problem)
• No exported information about sources
(the metadata problem)
Cornell CS 502
20020307
Rank Merging
• Return information in query result to allow rank
merging:
– unnormalized score of the document
– statistics about each query term
Cornell CS 502
20020307
We cannot merge document ranks from different sources directly
• Search engines use different ranking algorithms:
DB1: (doc1, 0.7), (doc2, 0.3)
DB2: (doc3, 1000), (doc4, 400)
Merged rank?
• Some algorithms depend on the source characteristics
Cornell CS 502
20020307
Extra information helps merge document ranks
meaningfully
Sources return query results and statistics:
Query: "distributed databases"
DB1: (doc1, 0.7)
"distributed" appears 3 times in doc1
"databases"
Cornell CS 502
appears 5 times in doc1
20020307
Motivating Source Metadata
Routing Problem - Disjoint Search Sources
author=Hopcroft?
Hopcroft
I1, I3
Hartmanis
I3
Tarjan
I1, I2
Wilensky
I2
I1,I3
doc1, doc2
doc8
Content Summary
Hopcroft
doc8
Tarjan
doc9
Tarjan
doc6
Wilensky
doc7
I1
Cornell CS 502
I2
20020307
Hopcroft
doc1, doc2
Hartmanis
doc3, doc4
I3
Source Metadata
• Data to help select the right sources for a query
source metadata attributes - what the source engine can do
source content summary - what the source engine can
search
• Simplified form of Z39.50 “explain” service
Cornell CS 502
20020307
Source metadata attributes
• Fields Supported
• Modifiers Supported
• Score Range
• Ranking Algorithm ID
Cornell CS 502
20020307
Source Content Summary
For each source:
•
•
•
•
Vocabulary
Document frequency for each word
Total number of postings for each word
Number of documents
• Implementation of GLOSS work:
– GlOSS: Text-Source Discovery over the Internet, L. Gravano,
H. Garcia-Molina, A. Tomasic, in ACM Transactions on Database
Systems, vol. 24, no. 2, Jun. 1999
Cornell CS 502
20020307
Distributed Searching Issues
Query Routing to Replicated Sources
Cornell CS 502
20020307
Routing Problem
Replicated Distributed Indexes
author=Hopcroft?
Hopcroft
doc8
Tarjan
doc9
Cornell CS 502
Hopcroft
doc8
Tarjan
doc9
Tarjan
doc6
Wilensky
doc7
20020307
Tarjan
doc6
Wilensky
doc7
Routing Issues
• Choice of primary?, secondary?, etc.
• Fault-tolerance
• Routing Factors
–
–
–
–
Performance-based
Freshness-based
Cost-based
weighted mix based on user preference
Cornell CS 502
20020307
Components of Replicated Routing Problem
• Metadata Issue: metadata made available by
indexer to aid in routing
• Metadata Distribution Issue: topology of
metadata repositories
• Decision Issue: routing decision algorithms
• Fault-tolerance: use of backup indexers
Cornell CS 502
20020307
Distributed Metadata for Query Routing
central metadata
store
Cornell CS 502
20020307
Performance-based Routing
present
8
-
T
Average
response time
New
Cornell CS 502
Predicted
response time
= low pass filter(T, actual response time, old
20020307
)