Transcript Slide 1

How to Select an Analytic DBMS
DRAFT!!
by
Curt A. Monash, Ph.D.
President, Monash Research
Editor, DBMS2
contact @monash.com
http://www.monash.com
http://www.DBMS2.com
Curt Monash

Analyst since 1981, own firm since 1987



Publicly available research




Covered DBMS since the pre-relational days
Also analytics, search, etc.
Blogs, including DBMS2 (www.DBMS2.com -- the
source for most of this talk)
Feed at www.monash.com/blogs.html
White papers and more at www.monash.com
User and vendor consulting
Our agenda




Why are there such things as specialized
analytic DBMS?
What are the major analytic DBMS product
alternatives?
What are the most relevant differentiations
among analytic DBMS users?
What’s the best process for selecting an
analytic DBMS?
Why are there specialized analytic DBMS?



General-purpose database managers are
optimized for updating short rows …
… not for analytic query performance
10-100X price/performance differences
are not uncommon
At issue is the interplay between storage,
processors, and RAM
Moore’s Law, Kryder’s Law, and a huge
exception
Growth factors:



45%
40%
35%
30%
Transistors/chip:
>100,000 since 1971
25%
Disk density:
20%
>100,000,000 since 1956 15%
Disk speed:
10%
12.5 since 1956
5%
Transistors/Chips
since 1971
Disk Density since 1956
Disk Speed since 1956
0%
The disk speed barrier
dominates everything!
7/20/2015
Compound Annual Growth Rate
DRAFT!! THIRD
TEST!!
The “1,000,000:1” disk-speed barrier

RAM access times ~5-7.5 nanoseconds



CPU clock speed <1 nanosecond
Interprocessor communication can be ~1,000X slower
than on-chip
Disk seek times ~2.5-3 milliseconds



Limit = ½ rotation
i.e., 1/30,000 minutes
i.e., 1/500 seconds = 2 ms
Tiering brings it closer to ~1,000:1 in practice, but
even so the difference is VERY BIG
Software strategies to optimize analytic I/O

Minimize data returned


Minimize index accesses





Page size
Precalculate results


Classic query optimization
Materialized views
OLAP cubes
Return data sequentially
Store data in columns
Stash data in RAM
Hardware strategies to optimize analytic I/O



Lots of RAM
Parallel disk access!!!
Lots of networking
Tuned MPP (Massively Parallel Processing) is
the key
Specialty hardware strategies



Custom or unusual chips (rare)
Custom or unusual interconnects
Fixed configurations of common parts

Appliances or recommended configurations
And there’s also SaaS
18 contenders (and there are more)










Aster Data
Dataupia
Exasol
Greenplum
HP Neoview
IBM DB2 BCUs
Infobright/MySQL
Kickfire/MySQL
Kognitio
Microsoft Madison








Netezza
Oracle Exadata
Oracle w/o Exadata
ParAccel
SQL Server w/o
Madison
Sybase IQ
Teradata
Vertica
General areas of feature differentiation







Query performance
Update/load performance
Compatibilities
Advanced analytics
Alternate datatypes
Manageability and availability
Encryption and security
Major analytic DBMS product groupings
Architecture is a hot subject




Traditional OLTP
Row-based MPP
Columnar
(Not covered tonight) MOLAP/array-based
Traditional OLTP examples



Oracle (especially pre-Exadata)
IBM DB2 (especially mainframe)
Microsoft SQL Server (pre-Madison)
Analytic optimizations for OLTP DBMS

Two major kinds of precalculation







Star indexes
Materialized views
Other specialized indexes
Query optimization tools
OLAP extensions
SQL 2003
Other embedded analytics
Drawbacks




Complexity and people cost
Hardware cost
Software cost
Absolute performance
Legitimate use scenarios

When TCO isn’t an issue


When specialized features matter





Undemanding performance (and therefore
administration too)
OLTP-like
Integrated MOLAP
Edge-case analytics
Rigid enterprise standards
Small enterprise/true single-instance
Row-based MPP examples









Teradata
DB2 (open systems version)
Netezza
Oracle Exadata (sort of)
DATAllegro/Microsoft Madison
Greenplum
Aster Data
Kognitio
HP Neoview
Typical design choices in row-based MPP


“Random” (hashed or round-robin) data
distribution among nodes
Large block sizes


Limited indexing alternatives



Suitable for scans rather than random accesses
Or little optimization for using the full boat
Carefully balanced hardware
High-end networking
Tradeoffs among row MPP alternatives





Enterprise standards
Vendor size
Hardware lock-in
Total system price
Features
Columnar DBMS examples









Sybase IQ
SAND
Vertica
ParAccel
InfoBright
Kickfire
Exasol
MonetDB
SAP BI Accelerator (sort of)
Columnar pros and cons




Bulk retrieval is faster
Pinpoint I/O is slower
Compression is easier
Memory-centric processing is easier
Segmentation – a first cut





One database to rule them all
One analytic database to rule them all
Frontline analytic database
Very, very big analytic database
Big analytic database handled very costeffectively
Basics of systematic segmentation



Use cases
Metrics
Platform preferences
Use cases – a first cut




Light reporting
Diverse EDW
Big Data
Operational analytics
Metrics – a first cut

Total user data



Total concurrent users


Below 1-2 TB, references abound
10 TB is another major breakpoint
5, 15, 50, or 500?
Data freshness



Hours
Minutes
Seconds
Basic platform issues




Enterprise standards
Appliance-friendliness
Need for MPP?
(SaaS)
The selection process in a nutshell




Figure out what you’re trying to buy
Make a shortlist
Do free POCs*
Evaluate and decide
*The only part that’s even slightly specific to the
analytic DBMS category
Figure out what you’re trying to buy

Inventory your use cases




Set constraints



Current
Known future
Wish-list/dream-list future
People and platforms
Money
Establish target SLAs


Must-haves
Nice-to-haves
Use-case checklist -- generalities

Database growth






As time goes by …
More detail
New data sources
Users (human)
Users/usage (automated)
Freshness (data and query results)
Use-case checklist – traditional BI

Reports



Dashboards and alerts




Today
Future
Today
Future
Latency
Ad-hoc


Users
Now that we have great response time …
Use-case checklist – data mining

How much do you think it would improve results to






Run more models?
Model on more data?
Add more variables?
Increase model complexity?
Which of those can the DBMS help with anyway?
What about scoring?


Real-time
Other latency issues
SLA realism

What kind of turnaround truly matters?




Customer or customer-facing users
Executive users
Analyst users
How bad is downtime?



Customer or customer-facing users
Executive users
Analyst users
Short list constraints

Cash cost


Deployment effort


But purchases are heavily negotiated
Appliances can be good
Platform politics


Appliances can be bad
You might as well consider incumbent(s)
Filling out the shortlist


Who matches your requirements in theory?
What kinds of evidence do you require?




References?
 How many?
 How relevant?
A careful POC?
Analyst recommendations?
General “buzz”?
A checklist for shortlists






What is your tolerance for specialized hardware?
What is your tolerance for set-up effort?
What is your tolerance for ongoing administrative
burden?
What are your insert and update requirements?
At what volumes will you run fairly simple queries?
What are your complex queries like?
and, most important,

Are you madly in love with your current DBMS?
Proof-of-Concept basics




The better you match your use cases, the more
reliable the POC is
Most of the effort is in the set-up
You might as well do POCs for several vendors – at
(almost) the same time!
Where is the POC being held?
The three big POC challenges

Getting data

Real?





Synthetic?
Hybrid?
Picking queries


Politics
Privacy
And more?
Realistic simulation(s)



Workload
Platform
Talent
POC tips






Don’t underestimate requirements
Don’t overestimate requirements
Get SOME data ASAP
Don’t leave the vendor in control
Test what you’ll be buying
Use the baseball bat
Evaluate and decide
It all comes down to



Cost
Speed
Risk
and in some cases


Time to value
Upside
Further information
Curt A. Monash, Ph.D.
President, Monash Research
Editor, DBMS2
contact @monash.com
http://www.monash.com
http://www.DBMS2.com