Transcript Jerry Held
Text in Oracle
The Search Platform and Ultra
Search
Omar Alonso, Senior Product Manager, Oracle Corp.
Stefan Buchta, Principal Product Manager, Oracle
Corp.
NoCOUG
May 16th 2001
Agenda
What is Oracle Text?
Introducing Oracle Text
Text in the database – Why Integration is Key
Performance and scalability
Ease of Use
Global Solutions
Search Quality
Specialized Indexes
XML
Document Services
Ultra Search
Summary
What is Oracle Text?
Formerly know as interMedia Text
Oracle Text adds powerful text search and
intelligent text management capabilities to
the Oracle database.
Oracle Text:
–
–
–
–
Fully integrated with the database
Offers premier text search quality
Provides several advances features for text
management, document services, XML, etc.
Has the best internationalization set of features
for multilingual text search applications.
Introducing Oracle Text – An example
create index description_idx on
PRODUCT_INFORMATION(PRODUCT_DESCRIPTION)
indextype is ctxsys.context ;
select score(1), product_id, product_name
from product_information
where contains (product_description, 'monitor NEAR "high
resolution"', 1)>0
order by score(1) desc ;
SCORE(1)
-------29
27
14
14
14
14
PRODUCT_ID
---------3331
3060
1726
3054
2252
2243
PRODUCT_NAME
-----------------------------Monitor 21/HR
Monitor 17/HR
LCD Monitor 11/PM
Plasma Monitor 10/XGA
Monitor 21/HR/M
Monitor 17/HR/F
Integration with the database
The attempt to separate text and normal
business (structured) data fails:
–
–
–
–
Cost
Complexity
High latency of development and deployment
Performance
No Integration - Separate Everything
File System
Inverted
C API
Application
Oracle Database
B-Tree
Repository
Index
SQL
Search Engine
(API)
Full Integration – text, index, API, optimizer
SQL
Application
Oracle Database
B-Tree
Repository
Index
Search Engine
(API)
Integration Benefits
Low cost
Low complexity
High performance
High integrity
Manageability
Leveraging existing skills
Oracle Uses Oracle Text
Oracle internet File System
Oracle Portal
Oracle CRM
Oracle E-Business Suite
Oracle eXchange
Ultra Search
Oracle.com
OTN
Oracle Internet File System
Oracle E-Business Suite
Performance – illustration
Large doc set – 100Gig (20million web pages)
Hardware : Enterprise Sparc
Task : web query
–
–
–
Web-style query syntax
2-3 words
Return first 100 hits
40 queries/second
90% of requests take < 0.5 second
7 hours to create index
Performance – Query throughput
Oracle Text vs one of the best-known
specialist Text search engines
Throughput Comparison
80.00
Queries per Second
70.00
60.00
50.00
CONTEXT Queries per
second
40.00
TCOMP Queries per
second
30.00
20.00
10.00
0.00
0
10
20
Number of Users
30
Ease of Use, Ease of Development
Simple SQL and PL/SQL interface
–
–
–
Can be used by any developer that knows SQL
Can be called by any tool that knows SQL
Using any language: Java, JSP, PL/SQL, C,
etc.
Choice of datastores
–
–
–
–
Stored in the database
Stored in the file system
Stored on the web (URL)
User-defined datastore
Global Solutions
Basic indexing/search works in any NLS
language
Special support for Japanese, Chinese, Korean
Theme search and services available in any
single-byte, white space-delimited language
Can mix languages, character sets in a single
column
Can query across languages
Chinese, Japanese, Korean Text
• Character sets:
•
Japanese: JA16SJIS, JA16EUC
•
Simplified Chinese: GBK, GB2312-80
Traditional Chinese: BIG5, EUC, TRIS
Korean: KO16KSC5601
Unicode: UTF8
•
•
•
• Lexing:
•
•
Lexical segmentation for Japanese, Chinese
Morphological segmentation for Korean
Multilingual Search
Cross-language queries
Can mix languages, character sets within a document
collection (e.g. Chinese and English documents)
Can use English to query e.g. Chinese terms or vice
versa.
Query a term which is expressed differently in simplified
and traditional Chinese.
select score(1), product_id, product_name
from product_information
where contains (product_description, 'TRSYN(monitor,
Chinese)', 1)>0
order by score(1) desc ;
Find products whose description contains ‘monitor' or its
Chinese equivalents.
Search Quality
Thesaurus, multiple
Thesauri
ABOUT search
Theme (concept-based)
search
Accumulate scores
Term weighting
Advanced XML search
Prefix, substring index
XPath support
Query Feedback
Exact word
Boolean expression
Phrase
Proximity
Fuzzy
Stemming
Wildcards
–
ABOUT – themes and theme queries
"We ordered a bottle of chardonnay to go with the fish, and
cabernet sauvignon for the steak …"
select id from docs
where contains(text, ‘ABOUT(wine)')>0
The knowledge base allows Oracle Text to associate
words and concepts.
Knowledge base contains over 400,000 concepts.
You can extend the knowledgebase to include
–
–
Words and concepts from your specialist field e.g.
medicine
Associations of words and spellings to guide
novice/internet users
Catalog Index
Optimized for response time on small text fields
True transactional DML
Supports structured query, including range query
Subset of CONTEXT query language
–
–
No fuzzy, stemming, about
User-friendly web-like query syntax
Classification
CTXRULE is an index type designed
classification/routing applications
Efficiently take a document and find
matching queries
Classification
Application
Incoming
documents
Perform
Action
Matched
Documents
Compares
against
rules
9i
Prefix, substring index
Prefix and Substring are flavors of the
CONTEXT index
Prefix will add more tokens to the CONTEXT
index to efficiently process prefix searches
(e.g. 'ora%')
Substring will add an index on substrings of
each token, to efficiently process substring
searches (e.g. '%oxy%')
Storing XML in Oracle
Decomposition
–
–
–
decompose documents into atomic elements
store elements in columns/rows
compose XML documents using SQL
xmltype
–
store XML as xmltype, use xmltype methods
Store as LOB or varchar
–
–
Store XML as-is, in a LOB or VARCHAR
Search using Oracle Text section searching or
XPath
Content search and XML
Create index
create index BOOKINDEX on BOOKS(text) indextype is
ctxsys.context
Query by content
select PRICE from BOOKS
where contains(text, ‘Harry Potter')>0 order by price
desc;
Create index to include section info
create index BOOKINDEX on BOOKS(text) indextype is
ctxsys.context
parameters ('section group my_auto_section_group' ) ;
Limit content search to a section of text
select price from books
where contains(text, ‘Harry Potter within title’)>0 order
by price desc;
Advanced XML searches
Nested section search
<movie><title>The Matrix</title></movie>
<book><title>Introduction to Matrix
Algebra</title></book>
select price from media
where contains(desc, ‘matrix within title within
movie’)>0
Search inside attribute values
<book author=“Barry Hughart”>Bridge of Birds</book>
select title from books
where contains(text, ‘Hughart within book@author’)>0
More advanced XML searches
map multiple tags to same name
<H1>The Diamond Age</H1>
<H2>or, A Young Lady’s Illustrated Primer</H2>
(map H1 and H2 to section name of “headline”)
select author from articles
where contains(text, ‘Diamond within headline’)>0
doctype limiters to handle tag collisions
<!DOCTYPE foo> … <address>[email protected] …
<!DOCTYPE bar> … <address>123 Meheula Pkwy …
map (foo)address to “email”, (bar)address to “address”
Document Services
Extract Themes (major concepts)
–
Extract hierarchical structure
Extract Gist
–
–
Generic or Point-of-View
Sentence- or Paragraph- level
View a document as HTML
–
Highlight search terms, highlight navigation
Return results in a table or a PL/SQL table
Basis for Clustering, More Like This, …
Summary
Fully integrated with the database
Premier text search quality
Advanced features for text management,
document services, and XML.
Best multilingual features in the market.
Introducing Oracle Ultra
Search
Issues in Corporate Search
Information Management Crisis
–
–
–
Explosive Growth of Information flowing over
corporate Intranets.
Knowledge scattered across: IT repositories,
billions of documents, and data fragments.
Non-Uniform Information
Structured in databases.
Unstructured Word processing doc., presentations.
Impacts of Bad Search
Customers - Turn to competitor’s Website.
Employees - Waste time and money on
useless searches.
Oracle Ultra Search
–
–
Solves problem of finding relevant information.
Across your company’s many disparate
information repositories.
Oracle Ultra Search
Out-of-the-Box solution that
–
Searches text across multiple repositories
Databases, HTML Pages, Files, Mail Servers.
–
–
–
Provides the best relevance ranking and
globalization in the industry.
Provides value added Portal functionality.
Presents Web style interface.
Built onto Oracle’s proven, reliable Text
Retrieval software and Oracle9i server.
Oracle Ultra Search
Docum. Title
Relevance
Ultra Search Applications
Portal Search
–
–
–
Most powerful search for Oracle9iAS Portal.
Build your own portal.
Special ‘Portlet’ crawls inside and outside of
Portal Repository.
Canned Web Search for Oracle Text
Library or Archive Search
Content Management Platform Searc
Search Multiple Repositories
Value Added Portal Functionality
‘Canned’ Web-Style Search
Aggregates Information For Indexing
–
–
Documents stay in their own repositories.
Search returns ‘normalized’ results, uniformly
ranked by relevance.
Organize & Categorize Content From
Multiple Repositories
–
–
Extract valuable metadata.
Improve search by narrowing through ‘fielded
search’.
‘Out-of-the-Box’ Web-Style Search
Oracle Text Application
–
–
–
Uses public Text interfaces.
Enhanced with expertise about gathering and
indexing information for best quality search.
No coding against low level API’s.
Oracle Text Retrieval Engine
–
–
–
Highly integrated with Oracle9i server.
Best interoperability with dynamic data.
Scalability and Reliability of Oracle platform.
Aggregates Information
Gather
–
Gather
Analyze
Crawls Web, corporate
repositories
Analyze
–
Create index required
for querying, filter
Make Queryable
Maintain
–
Make
Queryable
Embedd through API
Maintain
–
–
Schedule crawling
Easy Administration
Powerful Fielded Search
Narrow search to parts of document - title,
body, name of author.
Extract and use repository metadata
–
–
–
Word processing documents: Author, Title.
Databases: Identify Columns.
Email: Header/Body/Attachment.
Unify repositories in common, logical terms.
–
Uniform set of results, ranked by overall
relevance.
Flexible Metadata Mapping
Search Term
Metadata Fields
Repositories
Ultra Search Architecture
Architecture
Simple, Robust Architecture Built on:
–
–
Oracle9i Server Platform
Oracle’s Text Retrieval Engine
Flexible Deployment
–
–
Server-Tier
Mid-Tier
Ultra Search Components
Crawler
Server Component
Query API &
Application
Administration Tool
Mail API
Ultra Search Crawler
Multi-Threaded JAVA process.
–
–
–
–
Gathers documents from repositories you
specify on a set schedule.
Maps and analyzes link relationships.
Filters (150+) Non-HTML Documents, extracts
valuable metadata.
Indexes documents and data fragments.
Flexible Configuration
–
Run on one or more machines: ‘Remote
crawling’
Ultra Search Crawler
Set Inclusion/Exclusion Domains
–
Limit crawling to corporate net or specific
sections of it.
Maintain Fresh Search Results
–
Set crawling schedules for each Web site or
repository.
Crawling Abilities
Web Sites (HTTP Protocol)
Database Tables
–
–
–
Oracle and any ODBC compliant database.
Local (Ultra Search instance) or remote
database
Crawls both fulltext and ‘fielded’ columns.
Files (file:// Protocol)
–
Ultra Search filters, extracts text and metadata.
Emails (IMAP Protocol)
–
Crawl and index mailing lists through IMAP.
Ultra Search Query API
‘Embed’ Ultra Search in your Portal or
Application.
–
–
Customize look-and-feel to your requirements.
Easy integration with your application.
API for JAVA (JSP) and PL/SQL (PSP).
Returns data with or without HTML markup.
–
Build: Basic Search Form, Search Result
Form...
Includes Highly Functional Query
Application.
Java Query API Illustration
3
1
2
Administration Environment
Browser-based, Self-Service Web
Application.
Define Ultra Search Instances.
Configure and Schedule Crawler.
Set Query Options To Narrow Searches.
–
–
Document Attributes (e.g. TITLE, AUTHOR).
Define ‘Data Source Groups’.
Manage Administrative Users.
Administration Environment
Summary
Eliminate the Chaos Inside Your Firewalls !
Oracle Ultra Search
–
–
–
–
Crawls, Indexes, and makes searchable your
Intranet.
Provides Web-style search without the need for
coding.
Organizes, categorizes, and unifies content
from multiple repositories.
Leverages Oracle9i platform - reliable, scalable,
always available.
Q U E S T I O N S
A N S W E R S