Oracle 8i Intermedia Text

Download Report

Transcript Oracle 8i Intermedia Text

ODF 000419 Benchmarking Oracle8i Intermedia Text
Benchmarking Oracle 8i
Intermedia Text
• Background for this benchmark
• Interesting new features in OIMT
• Benchmarking, methodology and problems
• Results
• Conclusions
ODF 000419 Benchmarking Oracle8i Intermedia Text
Background for this benchmark
The task of the thesis project
The EDMS Search Engine
CERN’s EDMS
CADIM/EDB, managing product data documents
MP5, managing physical components
Oracle 7, database platform
Implement Oracle8i in the future?
ODF 000419 Benchmarking Oracle8i Intermedia Text
Oracle 8i Intermedia
A product embedded in Oracle8i
Intermedia allows a unified technique for accessing
various types of data such as:
 Text
 Documents
 Images
 Audio
 Video
ODF 000419 Benchmarking Oracle8i Intermedia Text
Features in Intermedia Text
Database integration
• Mixed queries against multiple text columns
• Single SQL API
• Indexes on most database columns
• All index data in the database
• Automatic triggers on textcolumns to detect changes
Full text search
• Exact Word and phrase search, operators, removing stopwords
• XML and HTML document section searching
ODF 000419 Benchmarking Oracle8i Intermedia Text
Features in Intermedia Text
About Search (All languages)
• Parses any text search to perform an optimal search
• Complements full-text search
Theme Identification (English only)
• Identifies strong themes in documents using a ”Theme base”
• ”Thematic” information is put into the index
• Themesearch is available via the about operator
• May be customized for specific terminology
ODF 000419 Benchmarking Oracle8i Intermedia Text
Features in Intermedia Text
Document services
• View documents as plain text or HTML format
• Store documents in a database, file system or at an URL adress
Multilingual text search
• Full-text search in most languages including Japanese, Chinese and Korean
• Support for all Oracle-NLS character sets
• Stemming and Fuzzy search for Dutch, English, French, German and Spanish
• Base-letter support and alternate spelling for Western European languages
ODF 000419 Benchmarking Oracle8i Intermedia Text
Features in Intermedia Text
To be investigated:
• The text indexing technique
• The new query operators
• How to use this in the EDMS Search Engine
ODF 000419 Benchmarking Oracle8i Intermedia Text
OIMT Textindexes
An OIMT textindex can be created on:
•
varchar2 columns
•
•
Large object (LOB) columns
B-file columns, indexing entire files
Allowing queries like:
Select id from table where contains(column,’Atlas’)>0;
1 One of the detectors in the LHC ring is ATLAS
2 The Atlasmountains are situated in the north of Africa
-> 1
… contains(column,’Atlas inner detector’)>0;
1 The problems with the ATLAS inner detector...
2 The inner detector of ATLAS...
-> 1
Query optimization
Selects the best executing plan based on analyze table…compute statistics;
ODF 000419 Benchmarking Oracle8i Intermedia Text
OIMT Textindexes
An OIMT textindex is an ”inverted” index
1 the LHC accelerator
2 the LEP accelerator
->
accelerator:
row #1 position 2, row #2 position 2
LHC:
row #1 position 1
LEP:
row #2 position 1
A textindex is built up by five objects
Four tables: I,K,N,R and one b-tree index: X
Create OIMT textindex statement
create index myindex on table(column)
indextype is ctxsys.context;
Note that table must have a primary key
Updating, two choices
Automaticly by starting ctxsrv and commit
Rebuild ”manually”
ODF 000419 Benchmarking Oracle8i Intermedia Text
OIMT Textindexes
The indexing ”pipeline”
• Loops over the rows and reads data out of the column
Datastore
• Or from remote servers, accessed via http or ftp via pointers
• Transformns the data into text representation
Filter
• Output can be in HTML or XML
• Takes the output from the filter and converts it to plain text
Sectioner
• Different sectioner for different formats
• Detects important section tags
Lexer
• Separates text into tokens and words
• Remove stopwords
ODF 000419 Benchmarking Oracle8i Intermedia Text
OIMT Textindexes
The preference system allows customization of textindexes
Classes of ”customizable” objects:
DATASTORE, FILTER, SECTION_GROUP, LEXER, WORDLIST, STOPLIST, STORAGE
Create a preference to customize an OIMT textindex:
execute ctx_ddl.create_preference(’my_pref’, ’BASIC_LEXER’);
execute ctx_ddl.set_attribute(’my_pref’, ’INDEX_THEMES’, ’YES’);
create index my_index on table(column) indextype is
ctxsys.context parameters(’LEXER my_pref’);
Default preferences
Unset preferences get their value from the default system
ODF 000419 Benchmarking Oracle8i Intermedia Text
OIMT Query operators
Scoring operators:
WEIGHT(*), THRESHOLD(>), ACCUM(,), MINUS(-)
Examples: …contains(column, ’(edms*2) AND cms’)>10;
…contains(column, ’edms, cms’)>0; edms+cms scores higher than each word alone
Word expansion operators:
WILDCARD(%), FUZZY(?), STEM($), SOUNDEX(!), EQUIV(=)
Examples: …contains(column, ’?mignets’)>0;
Fuzzy correct missspellings
1 The magnets of the LHC accelerator…
-> 1
…contains(column, ’$go’)>0;
Stem considers e.g. go, went, gone as the ”same” word
1 I will go to the cinema.
as well as plurals, magnet=magnets
2 I went back home.
3 The train has gone.
-> 1,2,3
…contains(column, ’!dog’)>0;
1 I have a dog.
2 Someone has dug a hole.
-> 1,2
Soundex retrieves all word which sounds alike
ODF 000419 Benchmarking Oracle8i Intermedia Text
OIMT Query operators
Proximity operator: NEAR(;)
Examples: …contains(column, NEAR((edms,cms),4, TRUE))>0;
1 EDMS is available for CMS, ATLAS, LHCb...
2 The EDMS at CERN manages all product data documents of CMS
-> 1
Section limiting and theme operators: WITHIN,
ABOUT
about is used as a ”general” operator, optimizing the query by including stem($) and
theme search if available
Thesaurus operator: SYN
Examples: SYN(dog) == {boxer} | {rotweiler} | {terrier}
Boolean operators: AND,
OR, NOT
ODF 000419 Benchmarking Oracle8i Intermedia Text
OIMT Benchmarks
• Is SELECT…CONTAINS(column,’word’)>0; faster than SELECT…LIKE(’%word%’); ?
• How do indexes on entire files perform?
• What is the prize in terms of memory storage, maintenance?
• Still fast retrieval?
• Is updating flexible?
• Do the query operators perform as expected?
ODF 000419 Benchmarking Oracle8i Intermedia Text
Comparing CONTAINS and LIKE, retrieval times
Searchword "ALICE", matching 5% of the documents
Fulltablescan using LIKE,
as in EDMS Search today
1.8
1.6
1.4
Time [s]
1.2
1
contains
like
0.8
0.6
0.4
0.2
0
6009
12018
24036
48072
Tablesize [rows]
Y-axis: Retrieval time [s]
X-axis: Tablesize [# of rows]
Using OIMT’s CONTAINS
ODF 000419 Benchmarking Oracle8i Intermedia Text
Comparing CONTAINS and LIKE, retrieval times
Searchword "POLYIMIDE", matching 0.01% of the documents
Fulltablescan using LIKE,
as in EDMS Search today
1.8
1.6
1.4
Time [s]
1.2
1
contains
like
0.8
0.6
0.4
0.2
0
6009
12018
24036
48072
Tablesize [rows]
Y-axis: Retrieval time [s]
X-axis: Tablesize [# of rows]
Using OIMT’s CONTAINS
ODF 000419 Benchmarking Oracle8i Intermedia Text
Indexing entire files
Tests with smaller amounts of files (<1000) works fine
Encountered heavy problems when indexing up to 9000 files:
• Security bug
A user gets the database owner’s OS privileges when accessing files through FILE DATASTORE.
Special roles will be introduced in the 8.1.7 version to manage this problem.
• Not indexing certain Excel files, bug
Said to be fixed in the 8.1.7 version
• Storage problems, tablespace, temp, shared pool, lobsegments
A OIMT textindex is complex, several storage parameters have to be extended, which was non-trivial.
• Unrelevant errormessages
Somewtimes when problems occur, unrelevant errormessages or no errormessages at all are returned,
making troubleshooting difficult.
ODF 000419 Benchmarking Oracle8i Intermedia Text
Indexing entire files
Statistics about the 9000 file index:
Total indexsize:
1.0 GB
Number of files:
8942
Accumulated filesize:
4.4 GB
Tot indexsize/acc filesize: 23.6 %
Average indexsize per file:115.8 KB/file
Average filesize:
490.1 KB/file
Creation time:
7:06 hours
Creation time per file:
2.9 seconds/file
Updating:
Quering:
Works fine, 2 seconds/inserted row
1-3 seconds/query (depends on the query!)
Machine:
SUN 4CPU 300MHz 1.5 GB RAM
ODF 000419 Benchmarking Oracle8i Intermedia Text
A view of the testtable
Index created on this
file reference column
ODF 000419 Benchmarking Oracle8i Intermedia Text
Conclusions
O8i IMT
• A very interesting ”platform” for the future EDMS Search Engine
• Will provide tools for fast full-text searches, with both simple and advanced queries
Textindexes
• Indexing entire files works, but is not trivial to do in version 8.1.6, to be improved in 8.1.7?
• Quering is fast, both for indexed varchar2 columns and filecolumns
• Updating textindexes can be done automatic
• Index preferences may be customized
• Multilingual
Query Operators
• Works mainly as expected, providing powerful tools for an advanced Search Engine
• Wildcards a very slow when executed on fileindexed columns
ODF 000419 Benchmarking Oracle8i Intermedia Text
The EDMS Search Engine Interface
Application interfaces are non-trivial to create
Interface important to make userfriendly
Many hits may be retrieved from full-text searches, limiting these may be crucial
Some ideas:
Ask the users for hints
Keep the interface as simple as possible, avoid too many graphical objects etc
A simple and an advanced search option
Menus for choosing query operators, scoring etc
ODF 000419 Benchmarking Oracle8i Intermedia Text
An OIMT Test Search Engine:
9000 indexed files from the EDMS production database
http://oraweb01.cern.ch:9000/anders/owa/edms_extended_search.enter_search