Transcript Why OODBMS

資料庫與資訊檢索系統的整合
- 一個文件資料庫系統的開發研究
by
Chung-Hong Lee (李俊宏)
Assistant Professor
Dept. of Information Management
Chang Jung Christian University
AGENDA
• Introduction
• Comparison of the DBMS and IR
approaches for document retrieval
• Proposed signature based IR technique
• System architecture
• Integration method
• Conclusions
A Document Retrieval Model
Document collections
User query
Query formulation
Indexing
Signature of query
Signature of documents
Matching of similarity
Relevance feedback
Results of retrieval
Convergence of Information Retrieval
and Database
Information Retrieval
Database
proprietary
application dependent
rich search capability
standardization
application independent
limited search capability
standardization
application independent
powerful modeling capability
powerful search capability
rich application development tools
Related work
• A text-search extension to the ORION OODBMS developed
by Lee (1991).
• The integration of the INQUERY text retrieval system and the
IRIS OODBMS proposed by Croft (1992).
• Mapping the SGML document structures into OODBMS’s
data models:
–Christophides (1994).
–Macleod (1995).
–Volz (1996), etc.
Differing from some of the above efforts with the aims to model
only SGML documents in DBMS, our system is particularly
aimed at handling heterogeneous types of documents, such as
textual and multimedia documents, and providing content-based
retrieval functions to describe the stored document objects.
Why OODBMS ?
The core features of OODBMS supported by most such systems
are:
1. Complex objects
2. Object identity
3. Encapsulation
4. Types and Classes
5. Class or Type Inheritance
6. Overriding, overloading and late binding
7. Computational Completeness
Signature file approach
(I). Signature Generation:
word
Object
Database
Management
System
Coding
0010
0000
0000
0000
1000
1001
0000
0000
……
……
……
……
result
0000
0000
0001
0000
0001
0000
1001
0001
OR
S(P)
0010 1001 ……
0001 1001
(document signature)
. (II). Signature Matching:
Signature of a query
AND S(P)
Result
0000 0000 ……
0010 1001 ……
0000 1001
0001 1001
0000 0000 ……
0000 1001
Concept of the scalable signature file
approach
• document signatures are generated according to their
composed Chinese characters
• the document signatures are divided into two
segments: the first segment represents the occurrence
of commonly-used Chinese characters, while the
second segment represents the occurrence of the
remaining Chinese characters and the English
character bigrams
• the signature size can be adjusted with the average
length of each document
System Architecture (1)
Document
indexing
module
OODBMS GUI
wrapper
Text
Retrieval
Engine
retrieved
interface
document objects for full-text search
document input
System Architecture (2)
Key features:-
OODBMS
Two stage search
Both IR and OQL
queries are available
Signature file as a
preprocessor for IR
queries
Documents are stored
as BLOB object representation
in the OODBMS
Search Engine
Signature
file
OODB
Search
engine
IR queries (word, term
& phrase)
OQL queries
retrieved document objects
Queries
Signature file as a pre-processor of the
database queries
(path expression)
Information filter
Signature
file
list file L
(list of candidate objects)
#1 …….
#2 …….
#3 …….
#4 …….
#1
0010 0010 0000 0001 0010 1000
document.plaintext_doc.engineering.id5
#2
0010 0010 0000 0000 0010 1000
document.hypertext_doc.SGML.id117
#3
0000 0010 0000 0001 0000 1000
document.plaintext_doc.engineering.id7
#4
0010 0000 0000 0001 0010 0000
document.compound_doc.voice_mail.id8
…………….
OQL queries to OODBMS
Query text processing
How the system formulates the query:The system transforms Quasi-Natural language queries
incrementally into complex structured queries in the
query language.
Goal: Free format queries
Related techniques:•
•
•
•
•
Key term extraction from the queries
IR-queries-to-OQL conversion
Query optimization
User interface
NLP
Conclusions
The distinctive features of underlying system developed :• IR-OODBMS integration
– OODBMS based document repository
– a loose coupling approach
– signature file filter as a preprocessor for query
processing
– two stage search
– a novel query model
– easy to maintain, including the signature file and
database schema
• Signature generation
– a character based signature method designed for
Chinese and English documents
• Applicable to a digital library infrastructure