norvag - ECDL 2003

Download Report

Transcript norvag - ECDL 2003

Space-Efficient Support for Temporal Text
Indexing in a Document Archive Context
Kjetil Nørvåg
Department of Computer and Information Science
Norwegian University of Science and Technology
Trondheim, Norway
(Work done during visit at Aalborg University, Denmark)
Outline
Motivation and example application
 The temporal text-indexing approach used in V2
 A more space-efficient approach: ITTX
 Comparison
 Summary and further work

August 20, 2003
ECDL'2003
2
Motivation
Amount of data available in various documents
rapidly increasing
 Storage getting cheaper

Less need for deleting data!
Can more often afford to store previous versions
August 20, 2003
ECDL'2003
3
Example application:
Temporal web warehouse
Versions retrieved and
stored when database is
created
tS
tNow
Time line
http://www.cnn.com/
Web
Page
http://www.idi.ntnu.no/
Web
Page
http://www.idi.ntnu.no/grupper/db/
Web
Page
http://www.idi.ntnu.no/~noervaag/
Web
Page
Web
Page
Web
Page
Web
Page
Web
Page
Transaction
time t
Web
Page
Web
Page
Web
Page
Web
Page
Web
Page
Web
Page
Web
Page
Web
Page
Web
Page
Web
Page
Web
Page
Most
recent
versions
at time
tNow
Versions valid at time tS
Web
Page

: Denotes storage of new
version of web page. Only
changed pages are stored as a
new version.
Related projects:
– Internet Archive Wayback Machine
– Several projects at national level in different countries
August 20, 2003
ECDL'2003
4
Our goal

Want to query:
– Historical versions, e.g., “all documents containing bin
Laden & created before September 11, 2001”
– Changes, e.g., “all documents that did not contain bin
Laden before September 11, 2001, but contained these
words afterwards

Why?
– For example: Identifying trends, web archive mining,
“investigations”, etc…
August 20, 2003
ECDL'2003
5
What is the problem?

Temporal textcontainment queries:
Q: Give me all document
versions that contained
the word ”Kjetil” at date
”August 25. 2002”

Expensive query without
suitable index
August 20, 2003
ECDL'2003
6
Context: the V2 temporal document
database system



Supports storage, retrieval, and querying of transactiontime temporal documents
Support for temporal text-containment queries
Emphasis on using/developing techniques easy to integrate
into existing systems
API
Operators
Document Version
Management
External
Documents
August 20, 2003
Document
Index
Version
Database
ECDL'2003
Text Indexing
Operations
Text Index
7
Temporal text indexing in V2
prototype: first version

Document versions uniquely identified by version
identifiers (VIDs)
–



Basic text index indexes all versions
Simple (but fairly efficient) support structure:
VP index: maps from VID to validity time periods for
versions
Temporal text query processing:
1.
2.

Given by name and timestamp  VID
Text index query on all versions
Time-select step using VP index
Efficient under assumption that VP index fits in main
memory
August 20, 2003
ECDL'2003
8
From the V2 approach to ITTX:
Interval-based Temporal Text indeXing



Problem of original approach: size of text index grows
proportional with size of document database
Want: size of text index to grow proportional with size of
changes
Solution: interval based indexing
–
–
–
Use document identifier (DID) and document- version identifier
(DVID) to identify version
Conceptually in text index for each word-occurrence for document
valid from TS to TE :
(Word, DID, DVID, TS, TE)
Entries for consecutive DVIDs stored as interval:
(Word, DID, DVID, DVID, TS, TE)
August 20, 2003
ECDL'2003
9
Separate indexes for word occurrences
in current and historical documents
Assume queries for current documents will still
be most frequent
separate index for entries that are still valid
smaller amount of entries have to be processed
 Avoid storing unknown end timestamps for
current versions
save some space

August 20, 2003
ECDL'2003
10
Temporal text-index structures
HTxtIdx
W DID DVID,DVID,TS,TE ... DVID,DVID,TS,TE
... DID DVID,DVID,TS,TE ... DVID,DVID,TS,TE
...
W DID DVID,DVID,TS,TE ... DVID,DVID,TS,TE
... DID DVID,DVID,TS,TE ... DVID,DVID,TS,TE
CTxtIdx
WDID,DVID,TS ... DID,DVID,TS
August 20, 2003
WDID,DVID,TS ... DID,DVID,TS
ECDL'2003
... WDID,DVID,TS ... DID,DVID,TS
11
Operation: insert document at time t
1.
2.
3.
Allocate document identifier d
Insert document into version database
For all distinct words W in document, insert
(Word=W, DID=d, DVID=0, TS=t) into CTxtIdx
August 20, 2003
ECDL'2003
12
Operation: update document d at time t
1.
2.
3.
4.
Read previous version with DVID=j
DVID=j+1 allocated for new version
For all new distinct words W in document, insert
(Word=W, DID=d, DVID=j+1, TS=t) into CTxtIdx
For all words that disappeared between versions:
1. Remove (Word, DID, DVID=i, TS) from CTxtIdx
2. Insert (Word, DID, DVID=i, TS,, TE=t) into HTxtIdx
August 20, 2003
ECDL'2003
13
Operation: temporal snapshot singleword text-containment query

1.
2.
3.
4.

Task: querying for all document versions that contained a particular word
WS at time t
HTxtIdx: Retrieve (Word, DID, DVIDi, DVIDj, TS, TE)
where Word= WS and TS ≤ t ≤ TE
CTxtIdx: Retrieve (Word, DID, DVIDj, TS)
where Word= WS and t ≥ TS
Interesting part of result: set of (DID, DVIDj, DVIDj) tuples
Do not know exact DVID, lookup in doc-version database and doc-name
index needed
Multi-word query: retrieval of all postings for word only necessary for
one of the words, for other words only selective (Word, DIDx) needed
August 20, 2003
ECDL'2003
14
Comparison: ITTX vs. original V2

Advantages of ITTX:
– Smaller index size
– More efficient non-temporal (current) text-containment
queries
– Average cost of updating document/index entries much
lower
August 20, 2003
ECDL'2003
15
Possible problem with ITTX:
Data reduction

Granularity reduction
1
1
2
5
10
5
10
12 13
15
20
25
15
20
25
27
30
30
– Results in fragmented intervals in text index  more space needed

Vacuuming: physically remove some non-current versions or deleted
documents
– No problem with ITTX
August 20, 2003
ECDL'2003
16
Summary and further work




The motivation and context
The (previous) approach, currently used in V2
The new/improved approach
Ongoing work:
–
–
–
–
New version of the V2 document database system
Will include implementation of ITTX
Will support XML and temporal XML queries
Study approaches that can achieve better clustering in the
temporal dimension, e.g., TSB-tree-like approaches
August 20, 2003
ECDL'2003
17