Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft

Download Report

Transcript Parallel Database Systems 101 Jim Gray & Gordon Bell Microsoft

Parallel Database
Systems 101
Jim Gray & Gordon Bell
Microsoft Corporation
presented at VLDB 95, Zurich Switzerland, Sept 1995
• Detailed notes available from [email protected]
– this presentation is 120 of the 174 slides (time limit)
– Notes in PowerPoint7 and Word7
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
1
Outline
• Why Parallelism:
–technology push
–application pull
• Benchmark Buyer’s Guide
– metrics
– simple tests
• Parallel Database Techniques
– partitioned data
– partitioned and pipelined execution
– parallel relational operators
• Parallel Database Systems
– Teradata. Tandem, Oracle, Informix, Sybase, DB2, ‘RedBrick
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
2
Kinds Of Information Processing
Point-to-Point
Immediate
Time
Shifted
Broadcast
conversation
money
lecture
concert
mail
book
newspaper
Net
work
Data
Base
Its ALL going electronic
Immediate is being stored for analysis (so ALL database)
Analysis & Automatic Processing are being added
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
3
Low rent
min $/byte
Shrinks time
now or later
Shrinks space
here or there
Automate processing
knowbots
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
mmediate OR Time Delayed
Why Put Everything in Cyberspace?
Point-to-Point
OR
Broadcast
Network
Locate
Process
Analyze
Summarize
Data
Base
4
Databases:
Information At Your Fingertips™
Information Network™
Knowledge Navigator™
• All information will be in an online database (somewhere)
• You might record everything you
• read: 10MB/day, 400 GB/lifetime (two tapes)
• hear: 400MB/day, 16 TB/lifetime (a tape per decade)
• see: 1MB/s, 40GB/day, 1.6 PB/lifetime (maybe someday)
•
•
•
•
Data storage, organization, and analysis is a challenge.
That is what databases are about
DBs do a good job on “records”
Now working on text, spatial, image, and sound.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
5
Database Store ALL Data Types
• The Old World:
– Millions of objects
– 100-byte objects
• The New World:
• Billions of objects
• Big objects (1MB)
• Objects have behavior
(methods)
People
Name Address
David
NY
Mike
Berk
Won
Austin
People
Name Address Papers
David
NY
Mike
Berk
Won
Austin
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Picture Voice
Paperless office
Library of congress online
All information online
entertainment
publishing
business
Information Network,
Knowledge Navigator,
Information at your fingertips
6
Magnetic Storage Cheaper than Paper
• File Cabinet:
cabinet (4 drawer)
paper (24,000 sheets)
space (2x3 @ 10$/ft2)
total
250$
250$
180$
700$
3 ¢/sheet
• Disk:
disk (8 GB =)
2,000$
ASCII: 4 m pages
0.05 ¢/sheet
• Image:
(60x cheaper)
200 k pages
1 ¢/sheet
(3x cheaper than paper)
• Store everything on disk
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
7
Moore’s Law
•XXX doubles every 18 months 60% increase per year
–Micro Processor speeds
1GB
–chip density
128MB
1 chip memory size
–Magnetic disk density
8MB
( 2 MB to 32 MB)
1MB
–Communications bandwidth
WAN bandwidth approaching LANs
128KB
•Exponential Growth:
8KB
1980
1990
2000
1970
bits: 1K 4K 16K 64K 256K 1M 4M 16M64M 256M
–The past does not matter
–10x here, 10x there, soon you're talking REAL change.
•PC costs decline faster than any other platform
–Volume & learning curves
–PCs will be the building bricks of all future systems
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
10
In The Limit: The Pico Processor
3
1 MM
Pico Processor
1 MB
1 M SPECmarks,
1TFLOP
10 pico-second ram
106 clocks to bulk ram
10 nano-second ram
Event-horizon on chip.
100 MB
10 GB 10 microsecond ram
1 TB
10 millisecond disc
100 TB 10 second tape archive
VM reincarnated
Multi-program cache
On-Chip SMP
Terror Bytes!
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
14
What's a Terabyte?
(250 K$ of Disk @ .25$/MB)
1 Terabyte
1,000,000,000 business letters
100,000,000 book pages
50,000,000 FAX images
10,000,000 TV pictures (mpeg)
4,000 LandSat images
150 miles of bookshelf
15 miles of bookshelf
7 miles of bookshelf
10 days of video
Library of Congress (in ASCII) is 25 TB
1980: 200 M$ of disc
5 M$ of tape silo
1995: 250 K$ of magnetic disc
500 K$ of optical disc robot
50 K$ of tape silo
Terror Byte !!
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
10,000 discs
10,000 tapes
70 discs
250 platters
50 tapes
23
Summary (of storage)
• Capacity and cost are improving fast (100x per decade)
• Accesses are getting larger (MOX, GOX, SCANS)
• BUT Latencies and bandwidth are not improving much
• (3x per decade)
• How to deal with this???
• Bandwidth:
– Use partitioned parallel access (disk & tape farms)
• Latency
– Pipeline data up storage hierarchy (next section)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
30
Interesting Storage Ratios
• Disk is back to 100x cheaper than RAM
• Nearline tape is only 10x cheaper than disk
RAM $/MB
– and the gap is closing!
Disk $/MB
100:1
10:1
Disk & DRAM look good
30:1
Disk $/MB
Nearline Tape
?
??? Why bother with Tape
1:1
1960 1970 1980 1990 2000
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
31
Performance =Storage Accesses
not Instructions Executed
• In the “old days” we counted instructions and IO’s
• Now we count memory references
• Processors wait most of the time
Where the time goes:
clock ticks used by AlphaSort Components
Disc Wait
Disc Wait
Sort
Sort
OS
Memory Wait
B-Cache
Data Miss
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
70 MIPS
“real” apps have worse Icache
misses so run at 60 MIPS
if well tuned, 20 MIPS if not
I-Cache
Miss
D-Cache
Miss
32
Storage Latency:
How Far Away is the Data?
Clock Ticks
10 9
Andromdeda
Tape /Optical
Robot
10 6 Disk
100
10
2
1
Memory
On Board Cache
On Chip Cache
Registers
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
2,000 Years
Pluto
Sacramento
2 Years
1.5 hr
This Campus
10 min
This Room
My Head
1 min
33
Network Speeds
• Network speeds grow 60% / year
• WAN speeds limited by politics
• if voice is X$/minute, how much is video?
• Switched 100Mb Ethernet
• 1,000x more bandwidth
• ATM is a scaleable net:
• 1 Gb/s to desktop & wall plug
• commodity: same for LAN, WAN
• 1Tb/s fibers in laboratory
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Comm Speedups
1e 9
1e 8
1e 7
Processors (i/s)
1e 6
1e 5
LANs &
WANs (b/s)
1e 4
1e 3
1960 1970 1980 1990 2000
Year
34
Network Trends & Challenge
•
•
•
•
Bandwidth UP 104 Price DOWN
Speed-of-light unchanged
Software got worse
Standard Fast Nets
»
»
»
»
ATM
PCI
Myrinet
Tnet
1010
109 1 Gb/s
108
107
10 1 Mb/s
6
105
104
10 1 Kb/s
3
102
1965
PC Bus
CAN
LAN
WAN
POTS
1975
1985
1995 2000
• HOPE:
– Commodity Net
– Good software
• Then clusters become a SNAP!
• commodity: 10k$/slice
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
35
The Seven Price Tiers
•
•
•
•
•
•
•
10$:
100$:
1,000$:
10,000$:
100,000$:
1,000,000$:
10,000,000$:
wrist watch computers
pocket/ palm computers
portable computers
personal computers (desktop)
departmental computers (closet)
site computers (glass house)
regional computers (glass castle)
SuperServer: Costs more than 100,000 $
“Mainframe” Costs more than 1M$
Must be an array of processors,
disks, tapes
comm ports
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
36
The New Computer Industry
• Horizontal integration
is new structure
• Each layer picks best
from lower layer.
• Desktop (C/S) market
• 1991: 50%
• 1995: 75%
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Function
Operation
Integration
Applications
Middleware
Baseware
Systems
Silicon & Oxide
Example
AT&T
EDS
SAP
Oracle
Microsoft
Compaq
Intel & Seagate
38
Software Economics: Bill’s Law
• Bill Joy’s law (Sun):
Don’t write software for less than 100,000 platforms.
@10M$ engineering expense, 1,000$ price
• Bill Gate’s law:
Don’t write software for less than 1,000,000 platforms.
@10M$ engineering expense, 100$ price
• Examples:
• UNIX vs NT: 3,500$ vs 500$
• UNIX-Oracle vs SQL-Server: 100,000$ vs 1,000$
• No Spreadsheet or Presentation pack on UNIX/VMS/...
• Commoditization of base Software & Hardware
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
40
Thesis
Many Little will Win over Few Big
1 M$
10 K$
100 K$
Micro
Mini
Mainframe
14"
Nano
9"
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
5.25"
3.5"
2.5"
1.8"
44
Year 2000 4B Machine
• The Year 2000 commodity PC (3K$)
•Billion Instructions/Sec
•Billion Bytes RAM
•Billion Bits/s Net
• 10 B Bytes Disk
•Billion Pixel display
1 Bips Processor
.1 B byte RAM
10 B byte Disk
• 3000 x 3000 x 24 pixel
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
45
4 B PC’s: The Bricks of Cyberspace
• Cost 3,000 $
• Come with
• OS (NT, POSIX,..)
• DBMS
• High speed Net
• System management
• GUI / OOUI
• Tools
• Compatible with everyone else
• CyberBricks
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
46
Implications of Hardware Trends
Large Disc Farms will be inexpensive ( 100$/GB)
Large RAM databases will be inexpensive (1,000$/GB)
Processors will be inexpensive
1k SPECint
CPU
So
The building block will be
a processor
with large RAM
lots of Disc
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
50 GB Disc
5 GB RAM
47
Implication of Hardware Trends: Clusters
CPU
50 GB Disc
5 GB RAM
Future Servers are CLUSTERS
of processors, discs
Distributed Database techniques
make clusters work
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
48
Future SuperServer
4T Machine
Challenge:
Manageability
Programmability
Security
Availability
Scaleability
Affordability
As easy as a single system
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
1,000 discs =
10 Terrorbytes
100 Nodes
1 Tips
High Speed Network ( 10 Gb/s)
Array of 1,000 4B machines
processors,
disks,
tapes
comm lines
A few MegaBucks
100 Tape Transports
= 1,000 tapes
= 1 PetaByte
49
Great Debate: Shared What?
Shared Memory
(SMP)
CLIENTS
Shared Disk
CLIENTS
Shared Nothing
(network)
CLIENTS
Processors
Memory
Easy to program
Difficult to build
Difficult to scaleup
Sequent, SGI, Sun
Hard to program
Easy to build
Easy to scaleup
VMScluster, Sysplex
Tandem, Teradata, SP2
Winner will be a synthesis of these ideas
Distributed shared memory (DASH, Encore) blurs distinction
between Network and Bus (locality still important)
But gives Shared memory message cost.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
50
Scaleables: Uneconomic So Far
• A Slice is a processor, memory, and a few disks.
• Slice Price of Scaleables so far is 5x to 10x markup
– Teradata: 70K$ for a Intel 486 + 32MB + 4 disk.
– Tandem: 100k$ for a MipsCo R4000 + 64MB + 4 disk
– Intel:
75k$ for an I860 +32MB + 2 disk
– TMC:
75k$ for a SPARC 3 + 32MB + 2 disk.
– IBM/SP2: 100k$ for a R6000 + 64MB + 8 disk
• Compaq Slice Price is less than 10k$
• What is the problem?
– Proprietary interconnect
– Proprietary packaging
– Proprietary software (vendorIX)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
51
Summary
• Storage trends force pipeline & partition parallelism
– Lots of bytes & bandwidth per dollar
– Lots of latency
• Processor trends force pipeline & partition
– Lots of MIPS per dollar
– Lots of processors
• Putting it together Scaleable Networks and Platforms)
– Build clusters of commodity processors & storage
– Commodity interconnect is key (S of PMS)
» Traditional interconnects give 100k$/slice.
– Commodity Cluster Operating System is key
– Fault isolation and tolerance is key
– Bell:
Automatic
Parallel
Jim Gray & Gordon
VLDB 95 Parallel Database
Systems SurveyProgramming is key
52
The Hardware is in Place and
Then A Miracle Occurs
?
SNAP
Scaleable Network And Platforms
Commodity Distributed OS
built on
Commodity Platforms
Commodity Network Interconnect
Enables
Parallel Applications
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
53
Why Parallel Access To Data?
At 10 MB/s
1,000 x parallel
1.2 days to scan
1.5 minute SCAN.
1 Terabyte
1 Terabyte
10 MB/s
Parallelism:
divide a big problem
into many smaller ones
to be solved in parallel.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
56
DataFlow Programming
Prefetch & Postwrite Hide Latency
• Can't wait for the data to arrive (2,000 years!)
• Need a memory that gets the data in advance ( 100MB/S)
• Solution:
• Pipeline from source (tape, disc, ram...) to cpu cache
• Pipeline results to destination
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
57
Why are Relational Operators
So Successful for Parallelism?
Relational data model
uniform operators
on uniform data stream
Closed under composition
Each operator consumes 1 or 2 input streams
Each stream is a uniform collection of data
Sequential data in and out: Pure dataflow
partitioning some operators (e.g. aggregates, non-equi-join, sort,..)
requires innovation
AUTOMATIC PARALLELISM
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
58
Database Systems “Hide” Parallelism
• Automate system management via tools
• data placement
• data organization (indexing)
• periodic tasks (dump / recover / reorganize)
• Automatic fault tolerance
• duplex & failover
• transactions
• Automatic parallelism
• among transactions (locking)
• within a transaction (parallel execution)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
59
Automatic Parallel OR DB
Select image
from landsat
where date between 1970 and 1990
and overlaps(location, :Rockies)
and snow_cover(image) >.7;
Landsat
date loc image
1/2/72
.
.
.
.
.
..
.
.
4/8/95
33N
120W
.
.
.
.
.
.
.
34N
120W
Temporal
Spatial
Image
Assign one process per processor/disk:
find images with right data & location
analyze image, if 70% snow, return it
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Answer
image
date, location,
& image tests
60
Outline
• Why Parallelism:
– technology push
– application pull
• Benchmark Buyer’s Guide
–metrics
–simple tests
• Parallel Database Techniques
– partitioned data
– partitioned and pipelined execution
– parallel relational operators
• Parallel Database Systems
– Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrick
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
61
Parallelism: Speedup & Scaleup
100GB
100GB
Speedup:
Same Job,
More Hardware
Less time
Scaleup:
100GB
1 TB
Bigger Job,
More Hardware
Same time
Transaction
Scaleup:
more clients/servers
Same response time
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
1 k clients
10 k clients
100GB
1 TB
Server
Server
62
The New Law of Computing
Grosch's Law:
1 MIPS
1$
2x $ is 4x performance
1,000 MIPS
32 $
.03$/MIPS
2x $ is
2x performance
Parallel Law:
Needs
Linear Speedup and Linear Scaleup
Not always possible
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
1,000 MIPS
1,000 $
1 MIPS
1$
63
Parallelism: Performance is the Goal
Goal is to get 'good' performance.
Law 1: parallel system should be
faster than serial system
Law 2: parallel system should give
near-linear scaleup or
near-linear speedup or
both.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
64
The New Performance Metrics
• Transaction Processing Performance Council:
– TPC-A: simple transaction
– TPC-B: server only, about 3x lighter than TPC-A
– Both obsoleted by TPC-C (no new results after 6/7/95)
• TPC-C (revision 3) Transactions Per Minute tpm-C
– Mix of 5 transactions: query, update, minibatch
– Terminal price eliminated
– about 5x heavier than tpcA (so 3.5 ktpcA 20 ktpmC)
• TPC-D approved in March 1995 - Transactions Per Hour
– Scaleable database (30 GB, 100GB, 300GB,... )
– 17 complex SQL queries (no rewrites, no hints without permission)
– 2 load/purge queries
– No official results yet, many “customer” results.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
65
TPC-C
Results
12/94
Courtesy of Charles Levine of Tandem (of course)
3000
Ta ndem
HP-H70
AS4 00
HP900 0
HP9000
RS600 0
Sun
DG
2000
HP-H70
COST ($/TPMC)
AS400
HP T5 00-8
Tandem Himalaya Server
16 cpus
32 cpus
64 cpus
11 2 cpus
RS6000
1000
HP T500
SUN
HP 9000 E55, H70
0
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
PERFORMANCE (TPMC)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
66
Success Stories
• Online Transaction Processing
• many little jobs
• SQL systems support 3700 tps-A
(24 cpu, 240 disk)
• SQL systems support 21,000 tpm-C
(112 cpu,670 disks)
hardware
• Batch (decision support and Utility)
• few big jobs, parallelism inside
• Scan data at 100 MB/s
• Linear Scaleup to 500 processors
hardware
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
67
ity
r
a
e
n
Li
Processors & Discs
Startup:
Interference:
Skew:
Linearity
Processors & Discs
Skew
A Bad Speedup Curve A Bad Speedup Curve
No Parallelism
3-Factors
Benefit
Interference
The Good
Speedup Curve
Startup
Speedup = OldTime
NewTime
The Perils of Parallelism
Processors & Discs
Creating processes
Opening files
Optimization
Device (cpu, disc, bus)
logical (lock, hotspot, server, log,...)
If tasks get very small, variance > service time
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
68
Benchmark Buyer's Guide
When does it stop scaling?
Throughput numbers,
Not ratios.
Standard benchmarks allow
Comparison to others
Comparison to sequential
The Benchmark
Report
Throughput
Things to ask
The Whole Story
(for any system)
Processors & Discs
Ratios and non-standard benchmarks are red flags.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
69
Performance 101: Scan Rate
Disk is 3MB/s to 10MB/s
Record is 100B to 200B (TPC-D 110...160, Wisconsin 204)
So should be able to read 10kr/s to 100kr/s
Simple test: Time this on a 1M record table
SELECT count(*) FROM T WHERE x < :infinity;
(table on one disk, turn off parallelism)
Scan
Typical problems:
disk or controller is an antique
no read-ahead in operating system or DB
small page reads (2kb)
data not clustered on disk
big cpu overhead in record movement
Agg
Count
Parallelism is not the cure for these problems
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
70
Parallel Scan Rate
Simplest parallel test:
Scaleup previous test:
4 disks,
4 controllers,
4 processors
4 times as many records
partitioned 4 ways.
Same query
Should have same elapsed time.
Scan
Scan
Scan
Scan
Agg
Count
Agg
Count
Agg
Count
Agg
Count
Agg
Sum
Some systems do.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
71
Parallel Update Rate
Test: UPDATE T
SET x = x + :one;
Test for million row T on 1 disk
Test for four million row T on 4 disks
Look for bottlenecks.
Log
UPDATE
After each call, execute ROLLBACK WORK
See if UNDO runs at the DO speed
See if UNDO is parallel (scales up)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
72
The records/$/second Metric
• parallel database systems scan data
• An interesting metric (100 byte record):
– Record Scan Rate / System Cost
• Typical scan rates: 1k records/s to 30k records/s
• Each Scaleable system has a “slice price” guess:
–
–
–
–
–
Gateway: 15k$ (P5 + ATM + 2 disks +NT + SQLserver or Informix or Oracle)
Teradata: 75k$
Sequent: 75k$ (P5+2 disks+Dynix+Informix)
Tandem: 100k$
IBM SP2: 130k$ (RS6000+2 disks, AIX, DB2)
• You can compute slice price for systems later in presentation
• BAD:
0.1 records/s/$ (there is one of these)
• GOOD:
0.33 records/s/$ (there is one of these)
• Super!
1.00 records/s/$ (there is one of these)
• We should aim at 10 records/s/$ with P6.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
74
Embarrassing Questions to
Ask Your PDB Vendor
How are constraints checked?
ask about unique secondary indices
ask about deferred constraints
ask about referential integrity
How does parallelism interact with
triggers
Stored procedures
OO extensions
How can I change my 10 TB database design in an hour?
add index
add constraint
reorganize / repartition
These are hard problems.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
75
Outline
• Why Parallelism:
– technology push
– application pull
• Benchmark Buyer’s Guide
– metrics
– simple tests
• Parallel Database Techniques
–partitioned data
–partitioned and pipelined execution
–parallel relational operators
• Parallel Database Systems
– Teradata. Tandem, Oracle, Informix, Sybase, DB2, RedBrick
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
76
Automatic Data Partitioning
Split a SQL table to subset of nodes & disks
Partition within set:
Range
A...E F...J
K...N O...S T...Z
Good for equijoins,
range queries
group-by
Hash
A...E F...J
K...N O...S T...Z
Good for equijoins
Round Robin
A...E F...J
K...N O...S T...Z
Good to spread load
Shared disk and memory less sensitive to partitioning,
Shared nothing benefits from "good" partitioning
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
77
Index Partitioning
Hash indices partition by hash
0...9
10..19
A..C
D..F
20..29 30..39 40..•
B-tree indices partition as a forest of trees.
One tree per range
G...M
N...R
S..Z
Primary index clusters data
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
78
Secondary Index Partitioning
In shared nothing, secondary indices are Problematic
Partition by base table key ranges
Insert: completely local (but what about unique?)
Lookup: examines ALL trees (see figure)
Unique index involves lookup on insert.
A..Z
A..Z
A..Z
A..Z
N...R
S..•
A..Z
Base Table
Partition by secondary key ranges
Insert: two nodes (base and index)
Lookup: two nodes (index -> base)
Uniqueness is easy
Teradata solution
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
A..C
D..F
G...M
Base Table
79
Kinds of Parallel Execution
Pipeline
Partition
outputs split N ways
inputs merge M ways
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Any
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
Any
Sequential
Program
80
Data Rivers
Split + Merge Streams
N X M Data Streams
M Consumers
N producers
River
Producers add records to the river,
Consumers consume records from the river
Purely sequential programming.
River does flow control and buffering
does partition and merge of data records
River = Split/Merge in Gamma = Exchange operator in Volcano.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
81
Partitioned Execution
Spreads computation and IO among processors
Count
Count
Count
Count
Count
Count
A Table
A...E
F...J
K...N
O...S
T...Z
Partitioned data gives
NATURAL parallelism
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
82
N x M way Parallelism
Merge
Merge
Merge
Sort
Sort
Sort
Sort
Sort
Join
Join
Join
Join
Join
A...E
F...J
K...N
O...S
T...Z
N inputs, M outputs, no bottlenecks.
Partitioned Data
Partitioned and Pipelined Data Flows
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
83
Picking Data Ranges
Disk Partitioning
For range partitioning, sample load on disks.
Cool hot disks by making range smaller
For hash partitioning,
Cool hot disks by mapping some buckets to others
River Partitioning
Use hashing and assume uniform
If range partitioning, sample data and use
histogram to level the bulk
Teradata, Tandem, Oracle use these tricks
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
84
Blocking Operators = Short Pipelines
An operator is blocking,
if it does not produce any output,
until it has consumed all its input
Tape
File
SQL Table
Process
Examples:
Sort,
Aggregates,
Hash-Join (reads all of one operand)
Scan
Database Load
Template has
three blocked
phases
Sort Runs
Merge Runs
Table Insert
Sort Runs
Merge Runs
Index Insert
Sort Runs
Merge Runs
Index Insert
Sort Runs
Merge Runs
Index Insert
SQL Table
Index 1
Index 2
Index 3
Blocking operators kill pipeline parallelism
Make partition parallelism all the more important.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
85
Simple Aggregates (sort or hash?)
Simple aggregates (count, min, max, ...) can use indices
More compact
Sometimes have aggregate info.
GROUP BY aggregates
scan in category order if possible (use indices)
Else
If categories fit in RAM use RAM category hash table
Else
make temp of <category, item>
sort by category,
do math in merge step.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
86
Sort
Used for
loading and reorganization (sort makes them sequential)
build B-trees
reports
non-equijoins
Rarely used for aggregates or equi-joins (if hash available
Input
Data
Sort
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Runs
Merge
Sorted
Data
88
Parallel Sort
M input N output
Sort design
River is range or hash partitioned
Merge
runs
Disk and merge
not needed if
sort fits in
memory
Sub-sorts
generate
runs
Scales linearly because
log(106 )
6
12
log(10 )
=
=>
2x slower
12
Range or Hash Partition River
Scan
or
other source
Sort is benchmark from hell for shared nothing machines
net traffic = disk bandwidth, no data filtering at the source
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
89
SIGMOD Sort Award
Datamation Sort: 1M records (100 B recs)
Sort Records/second vs T ime
1.0E+06
1.0E+05
Cray YMP
1000 seconds 1986
60 seconds 1990
7 seconds 1994
Sequent
1.0E+04
3.5 seconds 1995 (SGI challenge)
micros finally beat the mainframe!
finally! a UNIX system that does IO
SGI Challenge (12 cpu)
no SIGMOD PennySort record
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Hardware Sorter
1.0E+02
Tandem
M68000
1985
250
1990
1995
Elapsed Time (seconds)
1994
1995
Sort T ime on an SGI Challenge
1.6 GB (16 M 100-byte records)
12 cpu,
write done
2.2 GB,
96 disk
lists merged
200
Alpha 3cpu
1.6GB, Nyberg,
Intel
HyperCube
1.0E+03
SIGMOD MinuteSort
1.1GB, Nyberg,
SGI
Alpha
IBM 3090
150
lists-sorted
read-done
100
pin
50
0
1
2
4
6
Threads (Sprocs) devoted to sorting
10
90
Nested Loops Join
If inner table indexed on join cols (b-tree or hash)
then sequential scan outer (from start key)
For each outer record
probe inner table for matching recs
Works best if inner is in RAM (=> small inner
Inner
Outer Table
Table
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
91
Merge Join (and sort-merge join)
If tables sorted on join cols (b-tree or hash)
then sequential scan each (from start key)
left < right
left=right
left > right
advance left
match
advance right
Nice sequential scan of data (disk speed)
(MxN case may cause backwards rescan)
NxM
case
Cartesian
product
Sort-merge join sorts before doing the merge
Partitions well: partition smaller to larger partition.
Left
Table
Right
Table
Works for all joins (outer, non-equijoins, Cartesian, exclusion,...)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
92
Hash Join
Hash smaller table into N buckets (hope N=1)
If N=1 read larger table, hash to smaller
Else, hash outer to disk then
bucket-by-bucket hash join.
Purely sequential data behavior
Right Table
Hash
Buckets
Left
Table
Always beats sort-merge and nested
unless data is clustered.
Good for equi, outer, exclusion join
Lots of papers,
products just appearing (what went wrong?)
Hash reduces skew
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
93
Parallel Hash Join
ICL implemented hash join with bitmaps in CAFS machine
(1976)!
Kitsuregawa pointed out the parallelism benefits of hash
join in early 1980’s (it partitions beautifully)
We ignored them! (why?) But now, Everybody's doing it.
(or promises to do it).
Hashing minimizes skew, requires little thinking for
redistribution
Hashing uses massive main memory
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
95
Observations
It is easy to build a fast parallel execution environment
(no one has done it, but it is just programming)
It is hard to write a robust and world-class query optimizer.
There are many tricks
One quickly hits the complexity barrier
Common approach:
Pick best sequential plan
Pick degree of parallelism based on bottleneck analysis
Bind operators to process
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
98
What’s Wrong With That?
Why isn’t the best serial plan, the best parallel plan?
Counter example:
Table partitioned with local secondary index at two nodes
Range query selects all of node 1 and 1% of node 2.
Node 1 should do a scan of its partition.
Node 2 should use secondary index.
SELECT * FROM telephone_book WHERE name < “NoGood”;
Sybase Navigator & DB2 PE
should get this right.
We need theorems here
(practitioners do not have them)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Table
Scan
A..M
Index
Scan
N..Z
99
What Systems Work This Way
Shared Nothing
CLIENTS
Teradata:
400 nodes
Tandem:
110 nodes
IBM / SP2 / DB2: 128 nodes
Informix/SP2
48 nodes
ATT & Sybase 8x14 nodes
Shared Disk
Oracle
Rdb
CLIENTS
170 nodes
24 nodes
Shared Memory
Informix
RedBrick
CLIENTS
9 nodes
? nodes
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Processors
M emory
101
• Why Parallelism:
Outline
– technology push
– application pull
• Benchmark Buyer’s Guide
– metrics
– simple tests
• Parallel Database Techniques
– partitioned data
– partitioned and pipelined execution
– parallel relational operators
• Parallel Database Systems
–Teradata
–Tandem
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
- Oracle
- Informix
- Sybase
-DB2
-RedBrick
102
System Survey Ground Rules
Premise: The world does not need yet another PDB survey
It would be nice to have a survey of “real” systems
Visited each parallel DB vendor I could (time limited)
Asked not to be given confidential info.
Asked for public manuals and benchmarks
Asked that my notes be reviewed
I say only nice things (I am a PDB booster)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
103
Acknowledgments
Teradata
Todd Walter and Carrie Ballinger
Tandem
Susanne Englert, Don Slutz, HansJorge Zeller, Mike Pong
Oracle
Gary Hallmark, Bill Widdington
Informix
Gary Kelley, Hannes Spintzik, Frank Symonds, Dave Clay
Navigator
Rick Stellwagen, Brian Hart, Ilya Listvinsky, Bill Huffman ,
Bob McDonald, Jan Graveson
Ron Chung Hu, Stuart Thompto
DB2
Chaitan Baru, Gilles Fecteau, James Hamilton,
Hamid Pirahesh
Redbrick
Fernandez,
Donovan
Schneider
Jim Gray &Phil
Gordon Bell:
VLDB 95 Parallel Database Systems
Survey
104
Teradata
• Ship 1984, now an ATT GIS brand name
• Parallel DB server for decision support
SQL in, tables out
• Support Heterogeneous data (convert to client format)
Data hash partitioned among AMPs UNIX
with fallback (mirror) hash.
VMS
AS400
MAC
PC
Mac
Application
Processor
Applications run on clients
PEP
Biggest installation: 476 nodes, 2.4 TB
AMP
Ported to UNIX base
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
IBM
105
Parsing Engines
UNIX
PC
Mac
VMS
Application
AS400
MAC
Interface to IBM or Ethernet or...
Processor
Accept SQL, return records and status.
PEP
Support SQL 89, moving to SQL92
Parse, Plan & authorize SQL
AMP
cost based optimizer
Issue requests to AMPs
Merge AMP results to requester.
Some global load control based on client priority
(adaptive and GREAT!)
IBM
Access Modules
Almost all work done in AMPs
A shared nothing SQL engine
scans, inserts, joins, log, lock,....
Manages up to 4 disks (as one logical volume)
Easy design, manage, grow (just add disk)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
106
Data Layout: Hash Partitioning
All data declustered to all nodes
Each table has a hash key (may be compound)
Key maps to one of 4,000 buckets
Buckets map to one of the AMPs
Non-Unique secondary index partitioned by table
criterion
Fallback bucket maps to second AMP in cluster.
Typical cluster is 6 nodes (2 is mirroring).
Cluster limits failure scope:
2 failures only cause data outage if both in same
cluster.
Within a node, each hash to cylinder
then hash to “page”
Page is a heap with a sorted directory
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
107
Teradata Optimization & Execution
Sophisticated query optimizer
(many tricks) Great emphasis on Joins & Aggregates.
Nested, merge, product, bitmap join (no hash join)
Automatic load balancing from hashing & load control
Excellent utilities for data loading, reorganize
Move > 1TB database from old to new in 6 days,
in background while old system running
Old hardware, 3.8B row table (1TB), >300 AMPs
typical scan, sort, join averages 30 minutes
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
108
Query Execution
Protocol
PE requests work
AMP responds OK (or pushback)
AMP works (if all OK)
AMP declares finished
When all finished, PE does 2PC and starts pull
Simple scan:
PE broadcasts scan to each AMP
Each AMP scans produces answer spool file
PE pulls spool file from AMPs via Ynet
If scan were ordered, sort “catcher” would be forked
at each AMP pipelined to scans
Ynet and PE would do merge of merges from AMPs
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
109
Aggregates, Updates
Aggregate of Scan:
Scan’s produce local sub-aggregates
Hash sub-aggregates to Ynet
Each AMP “catches” its sub-aggregate hash buckets
Consolidate sub-aggregates.
PE pulls aggregates from AMPs via Ynet.
Note: fully scaleable design
Insert / Update / Delete at a AMP node
generates insert / update /delete messages to
unique-secondary indices
fallback bucket of base table.
messages saved in spool if node is down
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
110
Query Execution: Joins
Great emphasis on Joins.
Includes small-table large-table optimization
cheapest triple, then cheapest in triple.
If equi-partitioned, do locally
If not equi-partitioned,
May replicate small table to large partition (Ynet shines)
May repartition one if other is already partitioned on join
May repartition both (in parallel)
Join algorithm within node is
Product
Nested
Sort-merge
Hash bit map of secondary indices, intersected.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
111
Utilities
Bulk Data Load, Fast Data Load, Multi-load,
Blast 32KB of data to an AMP
Multiple sessions by multiple clients can drive 200x parallel
Double buffer
AMP unpacks, and puts “upsert”onto Ynet
One record can generate multiple upserts
(transaction-> inventory, store-sales, ...)
Catcher on Ynet, grabs relevant “upserts” to temp file.
Sorts and then batches inserts (survives restarts).
Online and restartable.
Customers cite this as Teradata strength.
Fast Export (similar to bulk data load)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
112
Utilities II
Backup / Restore: Rarely needed because of fallback.
Cluster is unit of recovery
Backup is online, Restore is offline
Reorganize:
Rarely needed, add disk is just restart
Add node:
rehash all buckets that go to that node:
(Ynet has old and new bucket map)
Fully parallel and fault tolerant, takes minutes
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
113
Port To UNIX
New design (3700 series) described in VLDB 93
Ported to UNIX platforms (3600 AP, PE, AMP)
Moved Teradata to Software Ynet on SMPs
Based on Bullet-Proof UNIX with TOS layer atop.
message system
communications stacks
raw disk & virtual processors
virtual partitions (buckets go to virtual partitions)
removes many TOS limits
SQL Applications
Result is 10x to 60x faster
Parsing engine (parallelism)
than an AMP
Teradata SQL (AMP logic)
Compiled expression evaluation UNIX PDE: TOS adapter
(gives 50x speedup on scans)
UNIX 5.4 (SMP, RAS, virtual Ynet)
Large main memory helps
HARDWARE
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
114
Customer Benchmarks
Standard Benchmarks
Only old Boral/DeWitt Wisconsin numbers.
Nothing public.
Moving > 1TB database from one old to new in 6 days,
in background while old system running
So: unload-load rate > 2MB/s sustained
Background task (speed limited by host speed/space)
Old hardware, 3.8B row table, >300 AMPs
typical scan, sort, join averages 30 minutes
rates (rec size not cited):
krec/s/AMP
scan:
clustered join:
insert-select:
Hash index build:
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
9
2
.39
3.3
k rec/s
2.7 mr/s !!!!!!
600 kr/s
120 kr/s
100 kr/s
115
UNIX/SMP Port of Teradata
Times to process a Teradata Test DB on a 8 Pentium, 3650.
These numbers are 10 to 150x better than a single AMP
Compiled expression handling
more memory
op
rows seconds k r/s MB/s
scan
50000000
737
67.8 11.0
copy
5000000 1136
4.4
0.7
aggregate
50000000
788
63.5 10.3
Join 50x2M (clustered) 52000000
768
67.7 11.0
Join 5x5 (unclustered) 10000000
237
42.2 6.8
Join 50Mx.1K
50000100 1916
26.1 4.2
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
116
Teradata Good Things
Scaleable to large (multi-terabyte) databases
Available TODAY!
It is VERY real: in production in many large sites
Robust and complete set of utilities
Automatic management.
Integrates with the IBM mainframe OLTP world
Heterogeneous data support is good data warehouse
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
117
Tandem
Message-based OS (Guardian):
(1) location transparency
(2) fault isolation (failover to other nodes).
Expand software 255 Systems WAN
4 node System
Classic shared-nothing
system (like Teradata
except applications
run inside DB machine.
224
PROCESSORS
30MB/S
8 x1M B/S
1-16 MIPS R4400 cpus
dual port controllers,
dual 30MB/s LAN
1974-1985: Encompass: Fault-tolerant Distributed OLTP
1986: NonStopSQL: First distributed and high-performance SQL (200 tps)
1989: Parallel NonStopSQL: Parallel query optimizer/executor
1994: Parallel and Online SQL (utilities, DDL, recovery, ....)
1995: Moving to ServerNet: shared disk model
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
118
Tandem Data Layout
Each table or index range partitioned to a set of disks
Partition
(anywhere in network)
Block File= {parts}
Index is B-tree per partition
clustering index is B+ tree
Table fragments are files (extent based).
Extents may be added
Descriptors for all local files live in local catalog
(node autonomy)
Tables can be distributed in network (lan or wan)
Duplexed disks and disk processes for failover
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
119
Tandem Software (Process Structure)
SQL
C/COBOL/..
Application
SQL engine
Joins, Sorts
global aggs
triggers
index maintenance
views
security
GUI
Disk
Server Pair
Disk Pair
or Array
buffer
pool
Selects
Update, Delete
Record/Set insert
Aggregates
Assertions
Locking
Logging
Data partition
Query Compiler
Transactions
Helper
Utilities
Processes
Hardware & OS move data at 4MB/s with >1 ins/byte
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
120
OLTP Features
Insert / Update / Delete index in parallel with base table
If 5 indices, 5x faster response time.
Record and key-value range locking, SQL92 isolation levels
Undo scanner per log: double-buffers undo to each server
21 k tpc-C (WOW!!) with 110 node server (800GB db)
Can mix OLTP and batch.
Priority serving to avoid priority inversion problem
Buffer management prevents sequential buffer pollution
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
121
Tandem Query Plan & Execution
Simple selects & aggregates done in disk servers
Parallelism chosen: scan: table fragmentation
hash: # processors
or Outer table fragments
Sorts: redistribution, sort in executors (N-M)
Joins done in executors (nest, sort-merge, hash).
Application
SQL subsystem
Executors
Redistribution is always a hash (minimize skew)
Pipeline as deep as possible (use lots of processes)
Disk
Servers
Multiple logs & parallel UNDO avoid bottlenecks
Can mix OLTP and batch.
Priority serving to avoid priority inversion problem
Buffer management prevents sequential buffer pollution
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
122
Parallel Operators
Initially just inserted rivers between sequential operators
Parallel query optimizer
Created executors at all clustering nodes or
at all nodes, repartitioned via hash to them
Gave parallel select, insert, update, delete
join, sort, aggregates,...
correlated subqueries are blocking
Got linear speedup/scaleup on Wisconsin.
Marketing never noticed, product slept from 1989-1993
Developers added: Hash Join
aggregates in disk process
SQL92 features
parallel utilities
online everything
converted to MIPSco
fixed bugs
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
123
Join Strategies
Nested loop
Sort merge
Both can work off index-only access
Replicate small to all partitions (when one small)
Small-table Cartesian product large-table optimization
Now hybrid-hash join
uses many small buckets
tuned to memory demand
tuned to sequential disk performance
no bitmaps because (1) parallel hash
(2) equijoins usually do not benefit
When both large, and unclustered (rare case)
N+M scanners, 16 catchers: sortmerge or hybrid hash
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
124
Administration (Parallel & Online everything)
online (claim to reduce outages by 40%):
All utilities are
Add table, column,...
Add index:
builds index from stale copy
uses log for catchup
in final minute, gets lock, completes index.
Reorg B-tree while it is accessed
Add / split/ merge/ reorg partition
Backup
Recover page, partition, file.
Add, alter logs, disks, processors, ...
You need this: Terabyte operations take a long time!
Parallel Utilities:
load (M to N)
index build (M scanners, N inserters, in background)
recovery:
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
125
Benchmarks
No official DSS benchmark reports
Unofficial results
1 to 16 R4400 class processors, 64MB each (Himalayas)
3 disks, 3 ctlrs each
Sequential
rec/s
MB/s
Load Wisc
1.6 kr/s 321 Kb/s
Parallel Index build 1.5 kr/s
15Kb/s
SCAN
Aggregate (1 col)
Aggregate (6 col)
2-Way hash Join
3-Way hash Join
16x Parallel
rec/s
MB/s speedup
28 kr/s
5.4 MB/s
16
24 kr/s
240 KB/s
16
28 kr/s5.8 MB/s 470 kr/s 94 MB/s
25 kr/s
18 kr/s
13 kr/s
? kr/s
4.9 MB/s
3.6 MB/s
2.6 MB/s
? Mb/s
400 kr/s
300 kr/s
214 kr/s
? kr/s
58 MB/s
60 MB/s
42 MB/s
? MB/s
16 !!!!!!!
16
16
16
?
1x and 16x rates are best I’ve seen anywhere.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
126
Tandem Good Things
21 K TPM-C (WOW!)
It is available TODAY!
Online everything
Fault tolerant, distributed, high availability
Mix OLTP and batch
Great Hash Join Algorithm
Probably the best peak performance available
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
127
Oracle
Parallel Server (V7): Multiple threads in a server
Multiple servers in a cluster
Client/server, OLTP & clusters (TP lite)
Parallel Query (V7.1) Parallel SELECT (and sub-selects)
Parallel Recovery: (V7.1) @ restart, one log scanner,
multiple redoers
Beta in 1993, Ship 6/94.
More Parallel (create table): V7.2, 6/95
Shared disk implementation ported to most platforms
Parallel SELECT (no parallel INSERT, UPDATE, DELETE, DDL)
except for sub-selects inside these verbs.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
128
Oracle Data Layout
Table Space
Segment
=
Block File Set
Extents
Files may be raw disk
Segments are B-trees or heaps.
Table or Index
Homogenous:
one table (index) per segment
extents picked from a TableSpace
Extents may be added
data -> disk map is automatic
No range / hash / round-robin partitioning
ROWID can be used as scan partitioning on base tables.
Guiding principal:
If its not organized, it can’t get disorganized,
and doesn’t need to be reorganized.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
129
Oracle Parallel Query Product Concept
Convert serial SELECT plan to parallel plan
If Table scan or HINT then consider parallel plan
Table has default degree of parallelism (explicitly set)
Overridden by system limits and hints.
Use max degree of all participating tables.
Intermediate results are hash partitioned
Nested Loop Join and Merge Join
User hints can (must?) specify join order, join strategy,
index, degree of parallelism,...
s
e
nc
ta
s
n
I
D
eg
re
Multiprocess & thread
e
DB
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Client
Query
Coordinator
130
Query Planning
Query Coordinator starts with Oracle Cost-Based plan
If plan requests Table scan or HINT then consider parallel plan
Table has default degree of parallelism (explicitly set)
Overridden by system limits and hints.
Use max degree of all participating tables.
Shared disk makes temp space allocation easy
Planner picks degree of parallelism and
river partitioning.
Proud of their OR optimization.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
131
Query Execution
Coordinator does extra work to
merge the outputs of several sorts
subsorts pushed to servers
aggregate the outputs of several aggregates
aggregates pushed to servers
Parallel function invocation is potentially a big win.
SELECT COUNT ( f(a,b,c,...)) FROM T;
Invokes function f on each element of T, 100x parallel.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
132
Join Strategies
Oracle has (1) Nested Loop Join
(2) Merge Join
Replicate inner to outer partition automatic in shared disk
(looks like partition outer).
Has small-table large-table optimization
(Cartesian product join)
User hints can specify join order, join strategy, index
degree of parallelism,...
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
133
Transactions & Recovery
Transactions and transaction save points (linear nest).
ReadOnly snapshots for decision support.
SQL92 isolation levels (ACID = Snapshot isolation)
Database has multiple rollback segments UNDO log,
Transaction has one commit/REDO log so may be a bottleneck
Parallel recovery at restart:
One log scanner,
DEGREE REDO streams, typically one per disk
INSTANCE REDO streams, typically two-deep per disk
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
134
Administration
Not much special:
Limit degree of parallelism at a server
Set default parallelism of a table
Query can only lower these limits
No special tools, meters, monitors,...
Just ordinary Parallel Server
Oracle Utilities
User can write parallel load / unload utility
Index build, Constraints, are separate steps
Not incremental or online or restartable.
Update Statistics (Analyze) is not parallel
Index build is a N-1 parallel: N scanner/sorter, 1 inserter.
Parallel recovery at restart:
One log scanner,
DEGREE REDO streams, typically one per disk
INSTANCE REDO streams, typically two-deep per disk 135
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Benchmarks
Sequent 20x 50MHz 486, .5GB RAM, 20 disk
rec/s
Load 5M Wisc
Parallel Index load
SCAN
Agg MJ
Agg NJ
krecr/s
.5 kr/s
2.2 kr/s
1.7 kr/s
3.3 kr/s
1.4 kr/s
Sequential
KB/s
113 KB/s
18 Kb/s
364 KB/s
660 KB/s
290 KB/s
20x Parallel
rec/s
MB/s speedup
8.8 kr/s 1.8 MB/s
16
29 kr/s 235 KB/s
13
26 kr/s 5.3 MB/s
15
45 kr/s 9.3 MB/s
14
26 kr/s 5.4 MB/s
19
Same benchmark on
16x SP1 (a shared nothing machine), got similar results.
168x N-cube ( 16MB/node), 4 lock nodes, 64 disk nodes got good scaleup
Oracle has published details on all these benchmarks.
Sept 1994 news:
20 Pentium, 40 disk system, SCAN at 44 MB/s 55% cpu
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
136
Oracle Good Things
Available now!
Parallel Everywhere (on everybody’s box)
A HIGH FUNCTION SQL
No restrictions (triggers, indices,...)
Very easy to use (almost no knobs or options)
Parallel invocation of stored procedures
Near-linear scaleup and speedup of SELECTs.
Respectable performance on Sequent
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
137
Informix
DSA (Dynamic Scaleable Architecture) describes redesign to
thread-based, server-based system.
V6 - 1993 - : DSA -- rearchitecture (threads, OLTP focus)
V7 - 1994 - : PDQ -- Parallel Data Query (SMP)
V8 - 1995 - : XMP -- Cluster parallelism (shared disk/nothing).
Parallelism is a MAJOR focus now that SQL92 under control
Other major focus is TOOLS (ODBC, DRDA, NewEra 4GL).
Informix is a UNIX SQL system:
AIX (IBM), HP/UX (HP), OSF/1 (DEC, HP), SCO/UNIX, Sequent/DYNIX, SUN (SunOS, Solaris)
Today shared nothing parallelism on IBM SP2, ATT3650, ICL, (beta)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
138
Informix Data Layout
Table or index maps to homogeneous set of DB spaces
contains “chunks” (extents)
DBspace
File
Block
Partition by: range,
round robin
expression
hash (V8)
Chunks may be added
Access via B+Tree, B* tree, and hash (V8)
Built an extent-based file system on raw disks or files
High speed sequential, clustering, async IO,...
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
139
Informix Execution
Completely parallel DML, some parallel DDL
Parallel SELECT, UPDATE, DELETE
Virtual
helpers
Executor per partition in all cases.
Processes
Client
Parallel
sort,
joins (nest, merge, hash)
Buffer Pool
aggregates,
union
M join
Whenever an operator has input and
scan
a free output buffer, it can work
M join
to fill the output buffer.
scan
scan
Natural flow control
Blocking operators (sort, hash join, aggregates, correlated subqueries)
Spool to a buffer (if small), else spool to disk.
Shared buffer pool minimizes data copies.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
140
Parallel Plans
Query plan is parallelized by
scanner per table partition (does select, project)
sub-aggregates per partition (hash or sort)
If clustered join (nested loop or merge) then operator
per outer or per partition
If hash-join, parallel scan smaller first,
build bitmap and hash buckets
then scan larger and:
join to smaller if it fits in memory
else filter via bitmap and build larger buckets
then join bucket by bucket
Hybrid hash join with bitmaps and bucket tuning.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
141
Parallel Operators
Parallel SELECT, UPDATE, DELETE
Executor per partition in all cases.
Parallel
sort,
joins,
aggregates,
union
Only correlated subqueries are blocking
Completely parallel DML, some parallel DDL
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
142
Transactions & Recovery
SQL 2 isolation levels allow DSS to run in background
Transaction save points
Separate logical and physical logs.
Bulk updates could bottleneck on single log.
Recovery unit is data partition (DBspace)
Parallel recovery: thread per DBspace
If DB fragment
unavailable, DSS readers can skip it
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
143
Informix Administration
Can assign % of processors, memory, IO to DSS
(parallel query)
Sum of all parallel queries live within this quota
Each query can specify the % of the total that it wishes.
(0 means sequential execution)
Utilities
Parallel Data load (SMP only)
Parallel Index Build (N - M)
Parallel recovery
Online backup / restore
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
144
Benchmarks
Sequent system:
Load 300M Wisc
Parallel Index load
SCAN
Aggregate
2-Way hash Join
3-Way hash Join
9 Pentium processors
1 GB main memory
Base tables on 16 disk (FWD SCSI)
Indices on 10 discs
Temp space on 10 disks
Sequential
rec/s
MB/s
3kr/s
600Kb/s
17kr/s
11kr/s
18kr/s
25kr/s
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
3.5MB/s
2.3MB/s
3.2MB/s
3.5Mb/s
Parallel
rec/s
MB/s speedup
48kr/s
147kr/s
113kr/s
242kr/s
239kr/s
1MB/s
30MB/s
23MB/s
31MB/s
33MB/s
8.3
10.1
9.7
9.5
145
Informix Shared Nothing Benchmark
IBM SP2 - : TPC-D-like database 48 SP2 Processors
Customer Benchmark, Not audited benchmark.
Load
60 GB in 40 minutes,
250 GB in 140 min
about 100 GB/hr !
2GB/node/hr
Scan & Aggregate (#6)
60 GB in 7 min = 140 MB/s = 3 MB/s/node = 30 kr/s
260 GB in 24 min = 180 MB/s = 4 MB/s/node = 40 kr/s
Power Test (17 complex queries and 2 load/purge ops)
60 GB in 5 hrs
260 GB in 18 hrs
Multiuser Test:
1 user, 12 queries: 10 hrs, 4 users, 3 queries: 10 hrs
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
146
Informix Good Things
A full function SQL
Available today on Sequent
Beautiful manuals
Linear speedup and scaleup
Best published performance on UNIX systems
Probably best price performance.
(but things are changing fast!)
Some mechanisms to mix OLTP and batch.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
147
Sybase Navigator Product concept
Goal: linear scaleup and speedup,
plus good OLTP support
NAVIGATOR
Two layer software architecture:
(1) Navigator
drives array of shared-nothing SQL engines.
(2) Array of SQL engines, each unaware of others.
ATOR
similar to Tandem disk processes
R
T
S
I
ADMIN
SQL engine is COTS.
SQL
SQL
SQL
SQL
SQL
SQL
SQL
SQL
CO
SQL
NF
IGU
RA
TO
R
Emphasize WHOLE LIFECYCLE
Configurator: tools to design a parallel system
Administrator: tools to manage a parallel system
(install/upgrade, start/stop, backup/restore, monitor/tune)
Optimizer: execute requests in parallel.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
148
Configurator
Fully graphical design tool
Given
ER model and dataflow model of the application
workload characteristics
response time requirements,
hardware components
(heavy into circles and arrows)
Recommends hardware configuration/
Table definitions (SQL)
table partitioning
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
149
Administrator
Made HUGE investments in this area.
Truly industry leading
graphical tools make MPP configuration “doable”.
GUI interface to manage:
startup / shutdown of cluster
backup / restore / manage logs
configure (install, add nodes, configure and tune servers)
Manage / consolidate system event logs
System stored procedures (global operations)
(e.g. aggregate statistics from local to global cat)
Monitor SQL Server events
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
150
Data Layout
Pure shared nothing
Navigator partitions data among SQL servers
• map to a subset of the servers
• range partition or hash partition.
Secondary indices are partitioned with base table
No Unique secondary indices
Only shorthand views, no protection views
Schema server stores global data definition for all nodes.
Each partition server has
schema for its partition
data for its partition.
log for its partition
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
151
Sybase SQL Server Backgrounder
Recently became SQL89 compliant (cursors, nulls, etc)
Stored procedures, multi-threaded, internationalized,
B*-tree centric (clustering index is B+tree)
Use nested loops, sort-merge join (sort is index build).
Page locking, 2K disk IO, ... other little-endian design decisions.
Respectable TPC-C results (AIX RS/6000).
UNIX raw disks or files are base (also on OS/2, NetWare,...).
table->disk mapping
CREATE DATABASE name ON {device...} LOG ON {device...}
SP_ADDSEGMENT segment, device
CREATE TABLE name(cols) [ ON segment]
Microsoft has a copy of the code, deep ported to NT
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
152
Navigator Extension Mechanisms
Navigator extended Sybase TDS by
Adding stored procedures to do things
Extending the syntax (e.g. see data placement syntax below)
Sybase TDS and OpenServer design are great for this
All “front ends based on OpenServer and threads”
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
153
Process Structure - Pure Shared Nothing
DBA Server does everything: SQL compilation
System management
Catalog management
SQL server restart (in 2nd node)
DBA fallback detects deadlock
does DBA takeover on fail
Control server at each node manages SQL servers there
(security, request caching, 2PC, final merge /aggregate,...
parallel stored procedures (SMID) )
Split server manages re-partitioning of data
SQL Server is unit of query parallelism, (one per cpu per node)
Clients
Control
(1/node)
Split
SQL
DBA
server
schema
server
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
= system
GUI
manager
Navigator
monitor
Manager
& SQL optimizer
= catalogs
database in a SQL server
154
Simple Request Processing
Client connects to Navigator (a Control Server) using
standard Sybase TDS protocol.
SQL request flows to DBA server that compiles it
sends stored procedures (plans) to all control servers
plans to all relevant SQL servers
Control server executes plan.
Pass to SQL server, returns results.
Client
Plan cached on second call,
DBA server not invoked.
Good for OLTP
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Control
(1/node)
Split
SQL
DBA
server
schema
server
155
Parallel Request Processing
If query involves multiple nodes, then command
sent to each one (diagram shows secondary index lookup)
Query sent to SQL servers that may have relevant data.
If data needs to be redistributed or aggregated,
split servers issue queries and inserts
(that is their only role)
split servers have no persistent storage.
Client
Control
Split
SQL
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
Control
Split
Control
DBA
server
Split
schema
server
156
Data Manipulation
SQL server is unit of parallelism
"Parallelized EVERYTHING in the T-SQL language"
Includes SIMD execution of T-SQL procedures,
plus N-M data move operations.
Two-level optimization:
DBA Server has optimizer
(BIG investment, all new code,
NOT the infamous Sybase optimizer)
Each SQL server has Sybase optimizer
If extreme skew, different servers have different plans
DBA optimizer shares code with SQL server
(so they do not play chess with one another).
Very proud of their optimizer.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
157
Query Execution
Classic Sellinger cost-based optimizer.
SELECT, UPDATE, DELETE N-to-M parallel
Bulk and async INSERT interface.
N-M Parallel sort
Aggregate (hash/sort)
select and join can do index-only access if data is there.
eliminate correlated subqueries (convert to join).
(Gansky&Wong. SIGMOD87 extended)
Join: nested-loop, sort-merge, index only
Sybase often dynamically builds index to
support nested loop (fake sort-merge)
Typically left-deep sequence of binary joins.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
158
Join and Partition Strategy
Partition strategies
If already partitioned on join, then no splitting
Else Move subset of T1 to T2 partitions.
or
Replicate T1 to all T2 partitions
or
repartition both T1 and T2 to width of home nodes
or target.
No hash join, but
all (re) partitioning is range or hash based.
Not aggressive parallelism/pipelining: 2 op at a time.
Pipeline to disk via split server (not local to disk and then split).
Split servers fake subtables for SQL engines.
Top level aggregates merged by control, others done by split.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
159
Utilities
Bulk data load (N-M) async calls
GUI manages
Backup all SQL serves in parallel
Reorg via
CREATE TABLE <new> ,
INSERT INTO <new> SELECT * FROM <old>
Utilities are mostly offline (as per Sybase)
Nice EXPLAIN utility
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
160
Futures
Hash join within split servers
Shared memory optimizations
Full support for unique secondary indices
Full trigger support (cross-server triggers)
Full security and view support.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
161
Benchmarks
Preliminary: 8x8 3600 - Ynet.
node: 8 x (50MHz 486 256k local cache) 512MB main memory,
2 x 10 disk arrays, @ 2GB 4 MB/s per disk.
6 x Sybase servers
Scaleup & speedup tests of 1, 4, and 8 nodes.
Numbers (except loading) reported as ratios of elapsed times
S&S tests show a >7x speedup of 8-way over 1-way
Tests cover insert, select, update, delete, join, aggregate, load
Reference Account: Chase Manahattan Bank
14x8 P5 ATT 3600 cluster: (112 processors)
56 SQL servers, 10GB each = 560 GB
100x faster than DB2/MVS (minutes vs days)
Jim Gray
& Gordon Bell: VLDB
Database Systems Survey
Linearity
is95 Parallel
great.
162
Navigator Good Things
Concern for lifecycle
design,
install,
manage,
operate,
use
Good optimization techniques
Fully parallel, including stored procedures!
Scaleup and Speedup are near linear.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
163
Sybase IQ
Sybase bought Expressway
Expressway evolved from Model 204
bitmap technology: index duplicates with bitmap
compress bitmap.
Can give 10x or 100x speedup.
Can save space and IO bandwidth
Currently, two products (Sybase and IQ) not integrated
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
164
DB2
DB2/VM: = SQL/DS: System R gone public
DB2/MVS (classic Parallel Sysplex, Parallel Query Server, ...)
Parallel and async IO into one process (on mainframe)
Parallel execution in next release (late next year?)
MVS PQS now withdrawn?
DB2/AS400: Home grown
DB2-2-PE: OS2/DM grown large. First moved to AIX
Being extended parallelism
Parallelism based on SP/2 -- shared nothing done right.
Benchmarks today - Beta everywhere
DB2++: separate code path has OO extensions, good TPC-C
Ported to HP/UX, Solaris, NT in beta
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
165
DB2/2 Data Layout
•
•
•
•
•
DATABASE: a collection of nodes (up to 128 SP2s so far)
NODEGROUP: a collection of logical nodes (a 4k hash map
LOGICAL NODE: A DB2 instance (segments, log, locks...)
PHYSICAL NODE: A box.
Segments
Logical Node: Segments of 4 k pages
– Segments allocated in units (64K default)
– Tables stripe across all segments
• Table created in NodeGroup:
– Hash (partition key) across all members of group
• Cluster has single system Image
Group 2
Nodes:
Group 1
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
166
DB2/2 Query Execution
• Each node maintains pool of AIX server processes
• Query optimizer does query decomposition to node
plans (like R* distributed query decomposition)
• Parallel Optimization is 1Ø (not like Wai Hong’s work)
• Sends sub-plans to nodes to be executed by servers
• Node binds plan to server process
• Intermediate results hashed
• Proud that Optimizer does not need hints.
• “Standard” join strategies (except no hash join).
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
167
DB2/2 Utilities
• 4 loaders:
– import
– raw-insert (fabricates raw blocks, no checks)
– insert
– bulk insert
• Reorganize hash map, add / drop nodes, add devices
– Table unavailable during these operations
• Online & Incremental backup
• Fault tolerance via HACMP
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
168
DB2/2 Performance:
Good performance Great Scaling
Wisconsin scaleups
big = 4.8 M rec = 1 GB
small = 1.2 M rec = 256MB
scan rate ~12 kr/s/node
raw load: 2.5 kr/s/node
see notes for more data
Speedup vs Nodes
25.0
DB2/2 PE on SP2
Load
20.0
Scan
Agg
15.0
SMJ
NLJ
10.0
SMJ2
Index1
5.0
Index2
MJ
0.0
0
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
2
4
6
8
10 12 14 16
169
DB2/2 Good Things
•
•
•
•
•
•
Scaleable to 128 nodes (or more)
From IBM
Good performance
Complete SQL (update, insert,...)
Will converge with DB2/3 (OO and TPC-C stuff)
Will be available off AIX someday
– (aix is slow and SP2 is very expensive)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
170
RedBrick
• Read-only (LOAD then SELECT only) Database system
– Load is incremental and sophisticated
• Precompute indices to make small-large joins run fast
– Indices use compression techniques.
– Only join via indices
• Many aggregate functions to make DSS reports easy
• Parallelism:
– Pipeline IO
– Typically a thread per processor (works on index partition)
– Piggyback many queries on one scan
– Parallel utilities (index in parallel, etc)
– SP2 implementation uses shared disk model.
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
171
Summary
There is a LOT of activity
(many products coming to market)
Query optimization is near the complexity barrier
Needs a new approach?
All have good speedup & scaleup if they can find a plan
Managing huge processor / disk / tape arrays is hard.
I am working on commoditizing these ideas:
low $/record/sec (scaleup PC technology)
low Admin $/node (automate, automate, automate,...)
Continuous availability (online & fault tolerant)
Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey
172