Transcript Slide 1
The Analytic DBMS Market(s)
New opportunities with new technology
by
Curt A. Monash, Ph.D.
President, Monash Research
Editor, DBMS2
contact @monash.com
http://www.monash.com
http://www.DBMS2.com
Curt Monash
Analyst since 1981
Own firm since 1987
Publicly available research
Covered DBMS since the pre-relational days
Also analytics, search, etc.
Blogs, including DBMS2 (www.DBMS2.com -- the
source for most of this talk)
Feed at www.monash.com/blogs.html
White papers and more at www.monash.com
User and vendor consulting
Our agenda
Why there are specialty analytic DBMS
It’s not just the analytic area
Hardware issues
Tips for choosing among them
Segments and priorities
The selection process
Database diversity
High-end e-commerce
100-terabyte analytics
High-volume call center
Media-heavy web startup
Simple departmental application
(and many more)
11 kinds of data management software
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
High-end OLTP/general-purpose DBMS
Mid-range OLTP/general-purpose DBMS
Row-based analytic RDBMS
Column- or array-based analytic RDBMS
Text search engines
XML and OO DBMS (but these may merge with search)
RDF and other graphical DBMS (but these may merge with
relational)
Event/stream processing engines (aka CEP)
Embedded DBMS for devices
Sub-DBMS file managers (e.g. SimpleDB, some MySQL uses)
Science DBMS
Why are there specialized analytic DBMS?
General-purpose database managers are
optimized for updating short rows …
… not for analytic query performance
10-100X price/performance differences are
not uncommon
At issue is the interplay between storage,
processors, and RAM
Moore’s Law, Kryder’s Law, and a huge
exception
Growth factors:
45%
40%
35%
30%
Transistors/chip:
>100,000 since 1971
25%
Disk density:
20%
>100,000,000 since 1956 15%
Disk speed:
10%
12.5 since 1956
5%
Transistors/Chips
since 1971
Disk Density since 1956
Disk Speed since 1956
0%
The disk speed barrier
dominates everything!
Compound Annual Growth Rate
7
The “1,000,000:1” disk-speed barrier
RAM access times ~5-7.5 nanoseconds
CPU clock speed <1 nanosecond
Interprocessor communication can be ~1,000X slower
than on-chip
Disk seek times ~2.5-3 milliseconds
Limit = ½ rotation
i.e., 1/30,000 minutes
i.e., 1/500 seconds = 2 ms
Tiering brings it closer to ~1,000:1 in practice, but
even so the difference is VERY BIG
8
Hardware strategies to optimize analytic I/O
Lots of RAM
Parallel disk access!!!
Lots of networking
Tuned MPP (Massively Parallel Processing) is
the key
Software strategies to optimize analytic I/O
Minimize data returned
Minimize index accesses
Page size
Precalculate results
Classic query optimization
Materialized views
OLAP cubes
Return data sequentially
Store data in columns
Stash data in RAM
10
16 contenders
Aster Data
Dataupia
Exasol
Greenplum
HP Neoview
IBM DB2 BCUs
Infobright
Kickfire
Kognitio
Microsoft Madison
Netezza
Oracle Exadata
ParAccel
Sybase IQ
Teradata
Vertica
Varied approaches
3 are trying to meld OLTP and analytic processing
2 have very specialized hardware
1 is purely RAM-centric
Several use Infiniband; several stress gigE switches
6 are columnar
2 stress cloud/DaaS
Segmentation made simple
One database to rule them all
One analytic database to rule them all
Frontline analytic database
Very, very big analytic database
Big analytic database handled very costeffectively
7 more precise segmentation issues
What is your tolerance for specialized hardware?
What is your tolerance for set-up effort?
What is your tolerance for ongoing administrative
burden?
What are your insert and update requirements?
At what volumes will you run fairly simple queries?
What are your complex queries like?
and, most important,
Are you madly in love with your current DBMS?
Specialized hardware
Custom or unusual chips (rare)
Custom or unusual interconnects
Fixed configurations of common parts
Set-up effort
Hardware acquisition and installation
Database and index design
Data cleaning and integration
Porting of existing applications
Ongoing administration
Part of the set-up effort also translates to an
ongoing administrative burden
Indexes, materialized views, cubes, etc. …
… unless the DBMS architecture minimizes
their use
Inserts and updates
Finally we get to the performance criteria
Batch load
ELT (or ETLT) vs. pure ETL
Mini-batches or trickle feeds
True transactional updates
Concurrent queries
Major use cases
Traditional BI
Customer-facing apps
Product maturity is often key
Complex queries
This is where the glamour is
MPP to speed up I/O
Clever answers to the data redistribution problem
Table scans vs. random access
Columns vs. rows
Aggressive use of RAM
Compression (saving on disk cost isn’t the point)
… and fast analytics even beyond the queries
The analytic DBMS selection process
Figure out what you’re trying to buy
Make a short list
Do free POCs
Evaluate and decide
Figure out what you’re trying to buy
Inventory your use cases
Set constraints
Current
Known future
Wish-list/dream-list future
People and platforms
Money
Establish target SLAs
Must-haves
Nice-to-haves
Short list basics
You might as well consider the incumbent(s)
Cash cost is an easy filter to apply
What is the crux of the deployment effort?
References can be scarce
Free POCs are a great invention
Most of the effort is in the set-up
The better you match your use cases, the more
reliable the POC is
You might as well do POCs for several vendors – at
(almost) the same time!
Where is the POC being held?
Can you plan this yourself, or do you need outside
help?
Evaluate and decide
It all comes down to
Cost
Speed
Risk
and in some cases
Time to value
Upside
Further information
Curt A. Monash, Ph.D.
President, Monash Research
Editor, DBMS2
contact @monash.com
http://www.monash.com
http://www.DBMS2.com