FAST Corporate Presentation
Download
Report
Transcript FAST Corporate Presentation
Why Search Engines are used increasingly
to Offload Queries from Databases
Bjørn Olstad
CTO FAST Search & Transfer
Adjunct Prof. The Norwegian University of Science & Technology
Email: [email protected]
Cell: +47 48011157
The Typo Problem...
Talent Offloading ....
The Web Search Experience
The RDBMS Experience
High input barrier
”You are viewing 5 random jobs
out of 2461 jobs in total....”
CareerBuilder
Use scenario, part 1
30956 jobs
1
CareerBuilder
Use scenario, part 2
1084 jobs
2
CareerBuilder
Use scenario, part 3
30 jobs
3
CareerBuilder
Use scenario, part 4
5 jobs
30956 5 targeted jobs in 3 steps
Challenger Shuttle Launch
Fax to NASA from contractor with O-ring concern
Presentation Matters …
IYP: A Disruptive Change
Taylor or Gibson guitar?
Good local offers?
Compare offerings
Phone / Directions
What is the phone number
to Will’s Barber shop?
BTW: I’m using my iPAQ
ESP: Cleansing, Mining,
Relevance and Discovery
Company name
Business Category
Telephone number
Address
20 key terms
Company name
Business Category
Telephone number
Address
20 key terms
Product &
Company
Services
web site
Blogs++
ISVs: A Disruptive Change
Siebel 2000
Siebel 2005
“my” CRM Application
“my” CRM Application
Search
Information Access Layer
3rd
party
content
Search is a tactical afterthought
Search is a strategic enabler
Revisit the Assumptions …
2003: 24B
Relational algebra
large – but “finite”data sets
structured data
2002: 12B
Cave paintings,
Bone tools 40,000 BCE
Writing 3500 BCE
Internet (DARPA) Late 1960s
Search & Explore focused
“infinite”data sets
Unstructured & Structured
The Web 1993
1999
SQL-03
Computing 1950
GIGABYTES
Transistor 1947
SQL-70
Oracle-79
SQL-89
SQL-92
Electricity, Telephone
1870
SQL-99
2000: 3B
Printing 1450
80% Unstructured
2001: 6B
0 C.E.
Paper 105
Extreme Capabilities?
• Feeding/streaming, transaction, retrieval or analytics centric?
• Content size: M, L, VL, VVVL or Vn∞ L?
• Schema centric, Semi-structured XML, Text, Agnostic?
• Fuzzy & Value vs. Binary & Completeness?
• Discovery primitives?
• User interaction part of design target?
Query Latency
RDBMS vs ESP
Test Data:
• Structured data:
• 5 million records;
• 13 fields per record
20
18
ESP
• Structured queries:
• 22 SQL queries
( Representative in ERP )
16
14
# queries
12
10
The Result:
8
• #1: FAST ESP w/ disk
• Mean = 99 [ms]
• St.dev. = 36 [ms]
6
4
RDBMS
2
0
1/16
1/8
1/4
1/2
1
2
[sec.]
4
8
16
32
• #2: Oracle w/ memory mapping
• Mean = 4 057 [ms]
• St.dev. = 9 368 [ms]
Query Per Second
RDBMS vs ESP
QPS
900
100 users
800
50 users
700
600
20 users
FAST
500
ORA
400
300
200
100
0
1
2
3
Identical HW : single node, 2 CPU, 4GB ram 3 SCSI disks
Identical data : auction data from eBay, 3.6 million doc’s
Identical queries: 200 queries defined by Oracle
Disruptive Change
18
16
14
12
# queries
Relational Model
20
10
8
6
4
2
0
1/16
1/8
1/4
1/2
1
2
4
8
16
32
[sec.]
Queries that fit The Model
Queries that don’t fit The Model
Alternative I
• Star, snowflake schemas++
• Cubes / datamarts ++
Incremental fixes to painful
shortcomings
Adds complexity
Alternative II
•
•
•
•
Schema agnostic
Scalable ad-hoc querying
BLOBS Contextual Insight
Real-time fusion of disparate
data models
• Massive fault tolerant scalability
Extreme Capabilities
ESP Design Targets
Powering Search Derivative Applications (SDAs)
Value/Noise SNR
Contextual
Refinement
Contextual Insight
User Interaction
Game Changer driven by Extreme Retrival and on-the-fly Analytics
Database Query Offloading
Example: AutoTrader.com
RDBMS:
•
HW-cost: $320K (32CPU on 4 Sun servers)
•
90% sub-second query response
Average = 12 s for the rest ….
•
Relevance = Sorting
•
5 FTE to maintain
ESP:
•
HW-cost: $90K
•
100% sub-second query response
•
Flexible relevance and discovery
•
0.5 FTE to maintain
ESP
Car Dealers - Product Supply
Content Scalability
RDBMS vs ESP
Examples of ESP deployments
• Compliance case:
– 50B documents @ 80k average
– 4 PB (around 100 web indexes)
• Storage:
– Intelligent content addressable storage
– XML metadata and full content
– EMC Centera: N * 256TB (N=1..400)
• Webmining – Webfountain:
– 60.000 : 1 in query capacity (ESP : DB)
Intelligent Storage
Storage and Search Unite
Discover
Simple
Scalable
Secure
Contextual Search
From ACCESS To INSIGHT
Where is the email
from Peter about
ROI analysis?
FIND
Contextual Relevance
• “Best of Web”
Recommender / Authority
• “Best of Enterprise”
Linguistic / Statistic
Any new supicious
financial transaction
patterns?
EXPLORE
Contextual Navigation
• Contextual fact discovery
• On-the-fly meta-data
analysis
Turning around the Pyramid
HBZ.de – Leading German Library Service Center
From:
Librarians
To:
Researchers
Single Field Search
Quering
WWW
(HTML, XML, WML,
JavaScript)
SQL LIB
FAST ESP
…
DB
DB
DB
DB
STRUCTURED
DB
ESP @ SCOPUS
•
•
>200M articles / 180M citations
180TB capacity / 14000 journals
David Goodman standing up and declaring in public, that
Scopus is the best-designed database he's ever seen …
Relevance Drives Revenue
Search Reduces Clicks to Purchase
and Browsing…
•
•
… and Drives Revenue
•
•
Reduced # of clicks to buy content from > 4
to < 2
50% reduction in ringtone browsing
Launched search
120%
100%
80%
page views per sale
60%
40%
3.50
3.00
2.50
2.00
1.50
1.00
0.50
0.00
20%
0%
140%
Clicks to Purchase
4.50
4.00
140%
100%
Search
80%
60%
40%
20%
0%
-20%
-40%
-40%
Browsing
Launched search
120%
-20%
-60%
100% increase in search
20% increase in ringtone revenue
-60%
Revenue
Business Analytics
Processing of real-time streams
Example: Norwegian Customs Foreign Exchange Transaction Monitoring
ACL Monitor
SECURITY ACCESS MODULE
User Monitor
Message Queue
Real-time Registration
Queries
Results
Alerts
Database
connector
Transaction Log
Data
Validation
Firewall
Firewall
ØKOKRIM
Technology Maturity...
RDBMS vs ESP
Business Intelligence
ESP vs. RDBMS Technology
OBSERVATION
The Enterprise Search Platform (ESP), a relatively new
concept, integrating advanced technologies typically
associated with search engines, database tools, and
analytical systems, is fast becoming able to solve modern
business intelligence problems (using both structured and
unstructured data) in a way that is fundamentally different
from, and ultimately superior to, that of other currently
available analytical or database software.
PREDICTION
Enterprise Search Platform and search centric application
technology represents a true paradigm shift in the way
data will be stored, analyzed and reported on in the
future. Resulting realignments in the marketplace may be
both rapid and tumultuous.
- Chief strategist leading BI vendor
If your only tool is a hammer ....
... every problem looks like a nail
UIMA: Architecture
Text Structure
<Category>FINANCIAL</ Category >
BC-dynegy-enron-offer-update5
Dynegy May Offer at Least $8 Bln to Acquire
Enron (Update5)
By George Stein
SOURCEc.2001 Bloomberg News
BODY
Event
Fact
…….
``Dynegy has to act fast,'' said Roger
Hamilton, a money manager with John
Hancock Advisers Inc., which sold its Enron
shares in recent weeks. ``If Enron can't get
financing and its bonds go to junk, they lose
counterparties and their marvelous business
vanishes.''
Moody's Investors Service lowered its
rating on Enron's bonds to ``Baa2'' and
Standard & Poor's cut the debt to ``BBB.'' in
the past two weeks.
……
<Author>George Stein</ Author >
<Company>Dynegy Inc</Company>
<Person>Roger Hamilton</Person>
<Company>John Hancock Advisers Inc. </Company>
<PersonPositionCompany>
<OFFLEN OFFSET="3576" LENGTH="63" />
<Person>Roger Hamilton</Person>
<Position>money manager</Position>
<Company>John Hancock Advisers Inc.</Company>
</PersonPositionCompany>
<Company>Enron Corp</Company>
<Company>Moody's Investors Service</Company>
<CreditRating>
<OFFLEN OFFSET="3814" LENGTH="61" />
<Company_Source>Moody's Investors Service</Company_Source>
<Company_Rated>Enron Corp</Company_Rated>
<Trend>downgraded</Trend> <Rank_New>Baa2</Rank_New>
<__Type>bonds</__Type>
</CreditRating>
The BI “hammer” Approach
Document Vector
Antiobiotics,
Peptidyl,
Eubacteria,
RNA,
Mg,
…
SVD Analysis
( λ1, λ2, ..., λn )
{ λ1, λ2, ..., λn,
Structured
attributes }
Contextual Refinement
ETL and Semantic understanding unite
Direct access to RDBMs
for info from some Telco’s
ESP lookup
Logic for cleansing
Ordered hits (by quality)
XML feed from
other Telco’s
Cleansed data
to ESP
XML
Ambigous data
Flat files (CSV or fixed)
from the ’laggards’
(close hits or
unidentified)
clean data
’Error’ database for manual inspection,
correction, storage/learning
Master database for
persistant storage
Contextual Insight
“…entry probe carried to
[Saturn]’s moon Titan
as part of the…”
Intent
Concepts
Query-time fact analysis @ sub-document level
Contextual Navigation
ThisIsTravel
Automated
visitor ratings
Revisit the Assumptions …
2003: 24B
Scalable Search
2002: 12B
Cave paintings,
Bone tools 40,000 BCE
Writing 3500 BCE
Internet (DARPA) Late 1960s
The Web 1993
1999
SQL-03
Computing 1950
GIGABYTES
Transistor 1947
SQL-70
Oracle-79
SQL-89
SQL-92
Electricity, Telephone
1870
SQL-99
2000: 3B
Printing 1450
80% Unstructured
2001: 6B
0 C.E.
Paper 105