powerpoint - Cheshire II Project Home Page

Download Report

Transcript powerpoint - Cheshire II Project Home Page

XML Structured Document
Retrieval and Distributed Resource
Discovery
Ray R. Larson
School of Information Management & Systems
University of California, Berkeley
[email protected]
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Context
• NSF/JISC International Digital Library Grant
– Cross-Domain Resource Discovery: Integrated Discovery and Use
of Textual, Numeric and Spatial Data
• UC Berkeley DLI2 Grant:
– ReInventing Scholarly Information Access
• UC Berkeley working with the University of
Liverpool/Manchester Computing with participation from
– DeMontfort University (MASTER)
– Art and Humanities Data Service (http://ahds.ac.uk/)
• OTA (Oxford), HDS (Essex), PADS (Glasgow), ADS (York), VADS
(Surrey & Northumbria)
– Consortium of University Research Libraries (CURL)
– UC Berkeley Library (and California Digital Library)
• Making of America II
• Online Archive of California
– British Natural History Museum, London
– NESSTAR (NEtworked Social Science Tools and Resources)
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Research Areas
• Goals are
– Practical application of existing DL
technologies to some large-scale cross-domain
collections
– Theoretical examination and evaluation of nextgeneration designs for systems architecture and
and distributed cross-domain searching for DLs
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Approach
• For the first goal, we are implementing a
distributed search system based on
international standards (Z39.50 and
SGML/XML) using the Cheshire II information
retrieval system
• Databases include:
–
–
–
–
HE Archives hub
Arts and Humanities Data Service (AHDS)
MASTER
CURL (Consortium of University Research
Libraries)
– Online Archive of California (OAC)
NASA Ames Lecture -- Ray R. Larson
– Making of America
II (MOA2)
August 22, 2001
Current Usage of Cheshire II
• Web clients for:
–
–
–
–
–
Berkeley NSF/NASA/ARPA Digital Library
World Conservation Digital Library
SunSite (UC Berkeley Science Libraries)
University of Liverpool
Higher Education Archives Hub
–
–
–
–
–
–
–
University of Essex, HDS (part of AHDS)
Oxford Text Archive (test only)
California Sheet Music Project
Cha-Cha (Berkeley Intranet Search Engine)
Berkeley Metadata project cross-language demo
Univ. of Virginia (test implementations)
Cheshire ranking algorithm is basis for original Inktomi
August 22, 2001
• Glasgow, Edinburgh, Bath, Liverpool, Kings College London,
University College London, Nottingham, Durham, School of
Oriental and African Studies, Manchester, Southhampton,
Warwick and others (to be expanded)
NASA Ames Lecture -- Ray R. Larson
Current and Upcoming Usage of
Cheshire II
• DIEPER Digitized European Periodicals project.
– http://gdz.sub.uni-goettingen.de/dieper/
• NESSTAR (Networked Social Science Tools and
Resources.
– http://www.nesstar.org/
• FASTER – Flexible Access to Statistics Tables and
Electronic Resources. (Continuation of
NESSTAR)
– http://www.faster-data.org/
• MASTER (Manuscript Access through Standards
for Electronic Records.
– http://www.cta.dmu.ac.uk/projects/master/
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Upcoming Usage of Cheshire II
• ZETOC (Prototype of the Electronic Table of
Contents from the British Library)
– http://zetoc.mimas.ac.uk/
• Archives Hub
– http://www.archiveshub.ac.uk/
• RSLP Palaeography project
– http://www.palaeography.ac.uk/
• British Natural History Museum, London
• JISC data services directory hosted by MIMAS
• Resource Discovery Network (RDN), where it
will be used to harvest RDN records from the
various hubs using OAI and provide search
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Client/Server Architecture
• Server Supports:
–
–
–
–
–
–
Database storage
Indexing
Z39.50 access to local data
Boolean and Probabilistic Searching
Relevance Feedback
External SQL database support
• Client Supports:
– Programmable (Tcl/Tk) Graphical User Interface
– Z39.50 access to remote servers
– SGML/XML & MARC formatting
• Combined Client/Server CGI scripting via WebCheshire
used for web applications
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
SGML/XML Support
• Underlying native format for all data is
SGML/XML
• The DTD defines the file format for each file
• Full SGML/XML parsing
• XML Configuration Files define the database
• USMARC DTD and MARC to SGML conversion
(and back again)
• Access to full-text via special SGML tags
• Support for SGML/XML component definition
and indexing
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
SGML/XML Support
• Example XML record for a DL document
<ELIB-BIB>
<BIB-VERSION>ELIB-v1.0</BIB-VERSION>
<ID>756</ID>
<ENTRY>June 12, 1996</ENTRY>
<DATE>June 1996</DATE>
<TITLE>Cumulative Watershed Effects: Applicability of Available Methodologies to
the Sierra Nevada</TITLE>
<ORGANIZATION>University of California</ORGANIZATION>
<TYPE>report</TYPE>
<AUTHOR-INSTITUTIONAL>USDA Forest Service</AUTHOR-INSTITUTIONAL>
<AUTHOR-PERSONAL>Neil H. Berg</AUTHOR-PERSONAL>
<AUTHOR-PERSONAL>Ken B. Roby</AUTHOR-PERSONAL>
<AUTHOR-PERSONAL>Bruce J. McGurk</AUTHOR-PERSONAL>
<PROJECT>SNEP</PROJECT>
<SERIES>Vol 3</SERIES>
<PAGES>40</PAGES>
<TEXT-REF>/elib/data/docs/0700/756/HYPEROCR/hyperocr.html</TEXT-REF>
<PAGED-REF>/elib/data/docs/0700/756/OCR-ASCII-NOZONE</PAGED-REF>
</ELIB-BIB>
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
SGML/XML Support
• Example SGML/MARC Record
<USMARC Material="BK" ID="00000003"><leader><LRL>00722</LRL><RecStat>n</RecStat>
<RecType>a</RecType><BibLevel>m</BibLevel><UCP></UCP><IndCount>2</IndCount>
<SFCount>2</SFCount><BaseAddr>00229</BaseAddr><EncLevel> </EncLevel>
<DscCatFm></DscCatFm><LinkRec></LinkRec><EntryMap><FLength>4</Flength><SCharPos>
5</SCharPos><IDLength>0</IDLength><EMUCP></EMUCP></EntryMap></Leader>
<Directry>00100140000000500170001400800410003101000140007203500200008603500170010
610000190012324501050014225000110024726000320025830000330029050400500032365000360
0373700002200409700002200431950003200453998000700485</Directry><VarFlds>
<VarCFlds><Fld001>CUBGGLAD1282B</Fld001><Fld005>19940414143202.0</Fld005>
<Fld008>830810 1983
nyu
eng u</Fld008></VarCFlds>
<VarDFlds><NumbCode><Fld010 I1="Blank" I2="Blnk"><a>82019962 </a></Fld010>
<Fld035 I1="Blank" I2="Blnk"><a>(CU)ocm08866667</a></Fld035><Fld035 I1="Blank"
I2="Blnk"><a>(CU)GLAD1282</a></Fld035></NumbCode><MainEnty><Fld100
NameType="Single" I2=""><a>Burch, John G.</a></Fld100></MainEnty><Titles><Fld245
AddEnty="Yes" NFChars="0"><a>Information systems :</a><b>theory and practice
/</b><c>John G. Burch, Jr., Felix R. Strater, Gary
Grudnitski</c></Fld245></Titles><EdImprnt><Fld250 I1="Blank" I2="Blnk"><a>3rd
ed</a></Fld250><Fld260 I1="" I2="Blnk"><a>New York :</a><b>J.
Wiley,</b><c>1983</c></Fld260></EdImprnt><PhysDesc><Fld300 I1="Blank"
I2="Blnk"><a>xvi, 632 p. :</a><b>ill. ;</b><c>24
cm</c></Fld300></PhysDesc><Series></Series><Notes><Fld504 I1="Blank"
I2="Blnk"><a>Includes bibliographical references and
index</a></Fld504></Notes><SubjAccs><Fld650 SubjLvl="NoInfo"
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
SubjSys="LCSH"><a>Management information systems.</a></Fld650> ...
SGML/XML Support
• Configuration files for the Server are also
SGML/XML:
– They include tags describing all of the data files
and indexes for the database.
– They also include instructions on how data is to
be extracted for indexing and how Z39.50
attributes map to the indexes for a given
database.
– They include definition of components and
component indexes
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Component Extraction and
Retrieval
• Any sub-elements of an SGML/XML document
can be defined as a separately indexed
“component”.
• Components can be ranked and retrieved
independently of the source document (but linked
back to their original source)
• For example paragraphs and abstracts in the full
text of documents could be defined as components
to provide paragraph-level search
• Example: Glassier archives…
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Component Extraction and
Retrieval
• The Glassier archive is an EAD document
(1.9 Mb in size)
• Contains “Series, Subseries, and Item level”
descriptions of things in the archive
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Excerpt from Glasier Archive
<c level="subseries">
<did>
<head>GP-1-1: General correspondence. Public letters.</head>
<unitid id="gp-1-1">GP-1-1</unitid>
<unittitle>Glasier Papers. General correspondence. Public letters.</unittitle>
</did>
<arrangement>
<head>Arrangement </head>
<p>Public letters arranged alphabetically within each year </p>
</arrangement>
<c level="item" langmaterial="eng">
<did>
<unitid id="gp-1-1-0001">GP-1-1-0001</unitid>
<unittitle>Letter from Richard Murray. <geogname>Glasgow</geogname>; <unitdate
>
7 Apr 1879</unitdate>.</unittitle>
<origination><persname>Murray, Richard</persname></origination>
<physdesc><extent>1 letter</extent></physdesc>
</did>
<note><p>Employment reference for J.B.G. as draughtsman<subject>Glasier, John
Bruce</subject></p>
</note>
</c>
ETC….
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Example Component Def
…
<COMPONENTS>
<COMPONENTDEF>
<COMPONENTNAME>
/home/ray/Work/Glasier_test/indexes/COMPONENT_DB1
</COMPONENTNAME>
<COMPONENTNORM>NONE</COMPONENTNORM>
<COMPSTARTTAG>
<tagspec>
<FTAG> c </FTAG><ATTR> level <VALUE>item</VALUE></ATTR>
</tagspec>
</COMPSTARTTAG>
<COMPONENTINDEXES>
<!-- First index def -->
…
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Components
• Both individual tags and “ranges” with a
starting tag and (different) ending tag can be
used as components
• Components permit parts of complex
SGML/XML documents to be treated as
separate documents
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Cheshire II Searching
Local
Remote
Z39.50
Z39.50
Internet
Z39.50
Z39.50
Scanned Images
Text
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Boolean Search Capability
• All Boolean operations are supported
– “zfind author x and (title y or subject z) not subject A”
• Named sets are supported and stored on the server
• Boolean operations between stored sets are
supported
– “zfind SET1 and subject widgets or SET2”
• Nested parentheses and truncation are supported
– “zfind xtitle Alice#”
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Probabilistic Models
• Rigorous formal model attempts to predict
the probability that a given document will
be relevant to a given query
• Ranks retrieved documents according to this
probability of relevance (Probability
Ranking Principle)
• Rely on accurate estimates of probabilities
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Probability Ranking Principle
• If a reference retrieval system’s response to each
request is a ranking of the documents in the
collections in the order of decreasing probability
of usefulness to the user who submitted the
request, where the probabilities are estimated as
accurately as possible on the basis of whatever
data has been made available to the system for this
purpose, then the overall effectiveness of the
system to its users will be the best that is
obtainable on the basis of that data.
Stephen E. Robertson, J. Documentation 1977
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Probabilistic Models: Logistic
Regression
• Estimates for relevance based on log-linear
model with various statistical measures of
document content as independent variables.
Log odds of relevance is a linear function of attributes:
log O( R|q i , d j , t k )  c0  c1v1  c2 v2  ...  cn vn
Term contributions summed:
m
log O( R | qi , d j )   [log O( R | qi , d j ,t k )  log O( R)]
k 1
Probability of Relevance is inverse of log odds:
P( R | qi , d j ) 
August 22, 2001
1
1 e
log(O ( R|qi , d j ))
NASA Ames Lecture -- Ray R. Larson
Relevance
Logistic Regression
August 22, 2001
100 90 80 70 60 50 40 30 20 10 0 -
0
10
20 30 40 50 60
Term Frequency in Document
NASA Ames Lecture -- Ray R. Larson
Probabilistic Retrieval: Logistic
Regression
Probability of relevance is based on
Logistic regression from a sample set of documents
to determine values of the coefficients (TREC).
At retrieval the probability estimate is obtained by:
6
P ( R | Q, D )  c0   ci X i
i 1
For the 6 X attribute measures shown on the next slide
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Probabilistic Retrieval: Logistic
Regression attributes
1
X1 
M
M
 log QAF
tj
1
Query Length
X 2  QL
1
X3 
M
Average Absolute Query Frequency
M
 log DAF
tj
Average Absolute Document Frequency
1
X 4  DL
Document Length
1 M
Average Inverse Document Frequency
X5 
log IDFt j

M 1
N  nt j
Inverse Document Frequency
IDF 
nt j
X 6  log M
August 22, 2001
Number of Terms in common between
query and document -- logged
NASA Ames Lecture -- Ray R. Larson
Cheshire Probabilistic Retrieval
• Uses Logistic Regression ranking method developed
at Berkeley with new algorithm for weigh calculation
at retrieval time.
• Z39.50 “relevance” operator used to indicate
probabilistic search
• Any index can have Probabilistic searching
performed:
– zfind topic @ “cheshire cats, looking glasses, march hares
and other such things”
– zfind title @ caucus races
• Boolean and Probabilistic elements can be combined:
– zfind topic @ government documents and title guidebooks
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Combining Search Types
• It is also possible to combine the results of multiple
independent searches into a single result set. (using
the Z39.50 SORT service of the Cheshire system)
–
–
–
–
–
–
E.g.:
Search of Full Text (Probabilistic)
Search of Full Text (Boolean)
Search of Components (Probabilistic)
Search of Titles (Probabilistic)
Search of Subject Headings (Probabilistic)
• All result sets are merged and re-ranked to produce
the final list.
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Relevance Feedback.
• Any records in a result set can be used for
Relevance Feedback
• Uses the “set name” to receive feedback
instructions.
– zfind SET1:2,5-9,30,45
– zfind SET2:6
• Chosen records are used to build a new
probabilistic query
• Ranked results are returned
• Planned support for (modified) Rocchio RF
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Cheshire II - Two-Stage Retrieval
(EVM generation)
• Example: Using the LC Classification System
– Pseudo-Document created for each LC class containing terms
derived from “content-rich” portions of documents in that
class (subject headings, titles, etc.)
– Permits searching by any term in the class
– Ranked Probabilistic retrieval techniques attempt to present
the “Best Matches” to a query first.
– User selects classes to feed back for the “second stage” search
of documents (which includes info from first stage selections)
• Can be used with any classified/Indexed collection and
controlled vocabulary
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Automatic Class Assignment
Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered
clusters are order-independent, usually based on an intellectually derived scheme
Search
Engine
Doc
Doc
Doc
Doc
Doc
Doc
Doc
1. Create pseudo-documents representing
intellectually derived classes.
2. Search using document contents
3. Obtain ranked list
4. Assign document to N categories
ranked over threshold. OR assign
to top-ranked category
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Cheshire II - Cluster Generation
• Define basis for clustering records.
– Select field to form the basis of the cluster.
– Evidence Fields to use as contents of the pseudodocuments.
• During indexing cluster keys are generated with
basis and evidence from each record.
• Cluster keys are sorted and merged on basis and
pseudo-documents created for each unique basis
element containing all evidence fields.
• Pseudo-Documents (Class clusters) are indexed on
combined evidence fields.
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Cheshire II - Two-Stage Retrieval
• Using the Mesh Subject Heading System
– Pseudo-Document created for each MESH heading containing
terms derived from “content-rich” portions of documents in
that class (other subject headings, titles, abstract, etc.)
– Permits searching by any term in the class
– Ranked Probabilistic retrieval techniques attempt to present
the “Best Matches” to a query first.
– User selects classes to feed back for the “second stage” search
of documents.
• Can be used with any classified/Indexed collection.
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Distributed Search: The
Problem
• Hundreds or Thousands of servers with
databases ranging widely in content, topic,
format
– Broadcast search is expensive in terms of
bandwidth and in processing too many
irrelevant results
– How to select the “best” ones to search?
• What to search first
• Which to search next
– Topical /domain constraints on the search
selections
– Variable contents of database (metadata only,
full text…)
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
An Approach for Cross-Domain
Resource Discovery
• MetaSearch
– New approach to building metasearch based on Z39.50
– Instead of using broadcast search we are using two Z39.50
Services
• Identification of database metadata using Z39.50 Explain
• Extraction of distributed indexes using Z39.50 SCAN
• Evaluation
– How efficiently can we build distributed indexes?
– How effectively can we choose databases using the index?
– How effective is merging search results from multiple
sources?
– Hierarchies of servers (general/meta-topical/individual)?
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Z39.50 Overview
UI
Map
Query
Map
Results
Map
Query
Search
Engine
Internet
Map
Results
Map
Query
UI
Map
Results
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Z39.50 Explain
• Explain supports searches for
– Server-Level metadata
• Server Name
• IP Addresses
• Ports
– Database-Level metadata
• Database name
• Search attributes (indexes and combinations)
– Support metadata (record syntaxes, etc)
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Z39.50 SCAN
• Originally intended to support Browsing
• Query for
–
–
–
–
–
Database
Attributes plus Term (i.e., index and start point)
Step Size
Number of terms to retrieve
Position in Response set
• Results
– Number of terms returned
– List of Terms and their frequency in the database (for
the given attribute combination)
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Z39.50 SCAN Results
Syntax: zscan indexname1 term stepsize number_of_terms pref_pos
zscan topic cat 1 20 1
% zscan title cat 1 20 1
{SCAN {Status 0}
{SCAN {Status 0}
{Terms 20}
{Terms 20}
{StepSize 1}
{StepSize 1}
{Position 1}}
{Position 1}}
{cat 706}
{cat 27}
{cat-and-mouse 19}
{cat-fight 1}
{cat-burglar 1}
{catalan 19}
{cat-carrying 1}
{cat-egory 1}
{catalogu 37}
{cat-fight 1}
{catalonia 8}
{cat-gut 1}
{catalyt 2}
{cat-litter 1}
{catania 1}
{cat-lovers 2}
{cataract 1}
{cat-pee 1}
{catch 173}
{cat-run 1}
{catch-all 3}
{cat-scanners 1} …
{catch-up 2} …
NASA Ames Lecture -- Ray R. Larson
August 22, 2001
MetaSearch Server Index
Creation
• For all servers, or a topical subset…
– Get Explain information (especially DC
mappings)
– For each index (or each DC index)
• Use SCAN to extract terms and frequency
• Add term + freq + source index + database
metadata to the metasearch “Collection
Document” (XML)
– Planned extensions:
• Post-Process indexes (especially Geo
Names, etc) for special types of data
August 22, 2001
Ames Lecture -- Ray R. Larson
– e.g. createNASA
“geographical
coverage” indexes
MetaSearch Approach
Map
Query
Map Explain
And Scan
Queries
Search
Engine
Map
Results
MetaSearch
Server
Internet
DB 1
Map
Results
Distributed
Index
August 22, 2001
DB2
Map
Query
Search
Engine
Db 5 Db 6
NASA Ames Lecture -- Ray R. Larson
Search
Engine
Map
Results
DB 3 DB 4
Known Problems
• Not all Z39.50 Servers support SCAN or
Explain
• Solutions:
– Probing for attributes instead of explain (e.g.
DC attributes or analogs)
– We also support OAI and can extract OAI
metadata for servers that support OAI
• Collection Documents are static and need to
be replaced when the associated collection
changes
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Evaluation
• Test Environment
– TREC Tipster and FT data (approx. 3.5 GB)
– Partitioned into 236 smaller collections based on source
and (for TIPSTER) date by month (Distributed Search
Testbed built by French, et al.)
• High size variability (Range from 1 to thousands of docs)
• 21,225,299 Words, 142,345,670 chars total for harvested
records
• Efficiency (old data)
– Average of 23.07 seconds per database to SCAN each
database (3.4 indexes on average)
– Average of 14.07 seconds excluding FT (131 seconds
for FT database with 7 indexes)
– Now collecting more information – so longer harvest
times longer, but still under one minute on average
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Evaluation
• Effectiveness
– Still working on evaluation comparing our DB
ranking with the TIPSTER relevance
judgements
– Can be compared with published selection
methods (CORI, GlOSS, etc.) using the same
testbed
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Future
• Testing of variant algorithms for ranking
collections
• Application to real systems and testing in a
production environment (Archives Hub)
• Logically Clustering servers by topic
• Meta-Meta Servers (treating the
MetaSearch database as just another
database)
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Distributed Metadata Servers
General Servers
August 22, 2001
Meta-Topical
Servers
Replicated
servers
NASA Ames Lecture -- Ray R. Larson
Database
Servers
Conclusion
• A lot of interesting work to be done
– Redesign and development of the Cheshire II
system
– Evaluating new meta-indexing methods
– Developing and Evaluating methods for
merging cross-domain results (or, perhaps,
when to keep them separate)
August 22, 2001
NASA Ames Lecture -- Ray R. Larson
Further Information
• Full Cheshire II client and server source is
available
ftp://cheshire.berkeley.edu/pub/cheshire/
– Includes HTML documentation
– Also on Berkeley Digital Library Software
Distribution CD
• Project Web Site
http://cheshire.berkeley.edu/
August 22, 2001
NASA Ames Lecture -- Ray R. Larson