Design and Evaluation of Architectures for Commercial
Download
Report
Transcript Design and Evaluation of Architectures for Commercial
Design and Evaluation of
Architectures for
Commercial Applications
Part I: benchmarks
Luiz André Barroso
Western Research Laboratory
Why architects should learn about
commercial applications?
Because they are very different from typical
benchmarks
Because they are demanding on many interesting
architectural features
Because they are driving the sales of mid-range
and high-end systems
2
UPC, February 1999
Shortcomings of popular benchmarks
SPEC
uniprocessor-oriented
small cache footprints
exacerbates impact of CPU core issues
SPLASH
small cache footprints
extremely optimized sharing
STREAMS
no real sharing/communication
mainly bandwidth-oriented
3
UPC, February 1999
SPLASH vs. Online Transaction Processing
(OLTP)
A typical SPLASH app. has
> 3x the issue rate,
~26x less cycles spent in memory barriers,
1/4 of the TLB miss ratios,
< 1/2 the fraction of cache-to-cache transfers,
~22x smaller instruction cache miss ratio,
~1/2 L2$ miss ratio
...of an OLTP app.
4
UPC, February 1999
But the real reason we care? $$$!
Server market:
Total: > $50 billion
Numeric/scientific computing: < $2 billion
Remaining $48 billion?
– OLTP
– DSS
– Internet/Web
Trend is for numerical/scientific to remain a niche
5
UPC, February 1999
Relevance of server vs. PC market
High profit margins
Performance is a differentiating factor
If you sell the server you will probably sell:
the client
the storage
the networking infrastructure
the middleware
the service
...
6
UPC, February 1999
Need for speed in the commercial market
Applications pushing the envelope
Enterprise resource planning (ERP)
Electronic commerce
Data mining/warehousing
ADSL servers
Specialized solutions
Intel splitting Pentium line into 3-tiers
Oracle’s raw iron initiative
Network Appliances’ machines
7
UPC, February 1999
Seminar disclaimer
Hardware centric approach:
target is build better machines, not better software
focus on fundamental behavior, not on software
“features”
Stick to general purpose paradigm
Emphasis on CPU+memory system issues
Lots of things missing:
object-relational and object-oriented databases
public domain/academic database engines
many others
8
UPC, February 1999
Overview
Day I: Introduction and workloads
Background on commercial applications
Software structure of a commercial RDBMS
Standard benchmarks
– TPC-B
– TPC-C
– TPC-D
– TPC-W
Cost and pricing trends
Scaling down TPC benchmarks
9
UPC, February 1999
Overview(2)
Day 2: Evaluation methods/tools
Introduction
Software instrumentation (ATOM)
Hardware measurement & profiling
– IPROBE
– DCPI
– ProfileMe
Tracing & trace-driven simulation
User-level simulators
Complete machine simulators (SimOS)
10
UPC, February 1999
Overview (3)
Day III: Architecture studies
Memory system characterization
Out-of-order processors
Simultaneous multithreading
Final remarks
11
UPC, February 1999
Background on commercial applications
Database applications:
Online Transaction Processing (OLTP)
– massive number of short queries
– read/update indexed tables
– canonical example: banking system
Decision Support Systems (DSS)
– smaller number of complex queries
– mostly read-only over large (non-indexed) tables
– canonical example: business analysis
12
UPC, February 1999
Background (2)
Web/Internet applications
Web server
– many requests for small/medium files
Proxy
– many short-lived connection requests
– content caching and coherence
Web search index
– DSS with a Web front-end
E-commerce site
– OLTP with a Web front-end
13
UPC, February 1999
Background (3)
Common characteristics
Large amounts of data manipulation
Interactive response times required
Highly multithreaded by design
– suitable for large multiprocessors
Significant I/O requirements
Extensive/complex interactions with the operating
system
Require robustness and resiliency to failures
14
UPC, February 1999
Database performance bottlenecks
I/O-bound until recently (Thakkar, ISCA’90)
Many improvements since then
multithreading of DB engine
I/O prefetching
VLM (very large memory) database caching
more efficient OS interactions
RAIDs
non-volatile DRAM (NVDRAM)
Today’s bottlenecks:
Memory system
Processor architecture
15
UPC, February 1999
Structure of a database workload
clients
Simple logic checks
16
Application server
(optional)
Formulates and issues
DB query
UPC, February 1999
Database server
Executes query
Who is who in the database market?
DB engine:
Oracle is dominant
other players: Microsoft, Sybase, Informix
Database applications:
SAP is dominant
other players: Oracle Apps, PeopleSoft, Baan
Hardware:
17
players: Sun, IBM, HP and Compaq
UPC, February 1999
Who is who in the database market? (2)
Historically, mainly mainframe proprietary OS
Today:
Unix: 40%
NT: 8%
Proprietary: 52%
In two years:
Unix 46%
NT 19%
Proprietary 35%
18
UPC, February 1999
Overview of a RDBMS: Oracle8
Similar in structure to most commercial engines
Runs on:
uniprocessors
SMP multiprocessors
NUMA multiprocessors*
For clusters or message passing multiprocessors:
19
Oracle Parallel Server (OPS)
UPC, February 1999
The Oracle RDBMS
Physical structure
Control files
– basic info on the database, it’s structure and status
Data files
– tables: actual database data
– indexes: sorted list of pointers to data
– rollback segments: keep data for recovery upon a
failed transaction
Log files
– compressed storage of DB updates
20
UPC, February 1999
Index files
Critical in speeding up access to data by avoiding
expensive scans
The more selective the index, the faster the access
Drawbacks:
Very selective indexes may occupy lots of storage
Updates to indexed data are more expensive
21
UPC, February 1999
Files or raw disk devices
Most DB engines can directly access disks as raw
devices
Idea is to bypass the file system
Manageability/flexibility somewhat compromised
Performance boost not large (~10-15%)
Most customer installations use file systems
22
UPC, February 1999
Transactions & rollback segments
Single transaction can access/update many items
Atomicity is required:
transaction either happens or not
Example: bank transfer
Transaction A (accounts X,Y; value M) {
read account balance(X)
subtract M from balance(X)
add M to balance(Y)
commit
}
failure
old value of balance(X) is kept in a rollback
segment
rollback: old values restored, all locks released
23
UPC, February 1999
Transactions & log files
A transaction is only committed after it’s side effects
are in stable storage
Writing all modified DB blocks would be too
expensive
Alternative: write only a log of modifications
random disk writes are costly
a whole DB block has to be written back
no coalescing of updates
sequential I/O writes (enables NVDRAM optimizations)
batching of multiple commits
Background process periodically writes dirty data
blocks out
24
UPC, February 1999
Transactions & log files (2)
When a block is written to disk the log file entries
are deleted
If the system crashes:
in-memory dirty blocks are lost
Recovery procedure:
25
goes through the log files and applies all updates to
the database
UPC, February 1999
Transactions & concurrency control
Many transactions in-flight at any given time
Locking of data items is required
Lock granularity:
Table
Block
Row
Efficient row-level locking is needed for high
transaction throughput
26
UPC, February 1999
Row-level locking
Each new transaction is assigned an unique ID
A transaction table keeps track of all active transactions
Lock: write ID in directory entry for row
Unlock: remove ID from transaction table
Data block
120
233
230
233
Data block directory
Transaction table
233
234
235
Simultaneous release of all locks
27
UPC, February 1999
Transaction read consistency
A transaction that reads a full table should see a
consistent snapshot
For performance, reads shouldn’t lock a table
Problem: intervening writes
Solution: leverage rollback mechanism
28
intervening write saves old value in rollback segment
UPC, February 1999
Oracle: software structure
Server processes
DB writer
writes redo logs to disk at
commit time
Process and system monitors
flush dirty blocks to disk
Log writer
actual execution of transactions
misc. activity monitoring and
recovery
Processes communicate
through SGA and IPC
29
UPC, February 1999
Oracle: software structure(2)
System Global Area (SGA)
SGA:
Block buffer area
shared memory segment mapped
by all processes
cache of database blocks
larger portion of physical memory
Metadata area
30
UPC, February 1999
Redo buffers
Data dictionary
Shared pool
Fixed region
Metadata area
where most communication takes
place
synchronization structures
shared procedures
directory information
Block buffer area
Increasing virtual address
Oracle: software structure(3)
Hiding I/O latency:
many server processes/processor
large block buffer area
Process dynamics:
server reads/updates database
(allocates entries in the redo buffer pool)
at commit time server signals Log writer and sleeps
Log writer wakes up, coalesces multiple commits and issues
log file write
after log is written, Log writer signals suspended servers
31
UPC, February 1999
Oracle: NUMA issues
Single SGA region complicates NUMA localization
Single log writer process becomes a bottleneck
Oracle8 is incorporating NUMA-friendly
optimizations
Current large NUMA systems use OPS even on a
single address space
32
UPC, February 1999
Oracle Parallel Server (OPS)
Runs on clusters of SMPs/NUMAs
Layered on top of RDBMS engine
Shared data through disk
Performance very dependent on how well data can
be partitioned
Not supported by most application vendors
33
UPC, February 1999
Running Oracle: other issues
Most memory allocated to block buffer area
Need to eliminate OS double buffering
Best performance attained by limiting process
migration
In large SMPs, dedicating one processor to I/O may
be advantageous
34
UPC, February 1999
TPC Database Benchmarks
Transaction Processing Performance Council (TPC)
Established about 10 years ago
Mission: define representative benchmark standards
for vendors (hardware/software) to compare their
products
Focus on both performance and price/performance
Strict rules about how the benchmark is ran
Only widely used benchmarks
35
UPC, February 1999
TPC pricing rules
Must include
All hardware
– server, I/O, networking, switches, clients
All software
– OS, any middleware, database engine
5-year maintenance contract
Can include usual discounts
Audited components must be products
36
UPC, February 1999
TPC history of benchmarks
TPC-A
TPC-B
Current TPC OLTP benchmark
Much more complex than TPC-A/B
TPC-D
Simpler version of TPC-A
Meant as a stress test of the server only
TPC-C
First OLTP benchmark
Based on Jim Gray’s Debit-Credit benchmark
Current TPC DSS benchmark
TPC-W
37
New Web-based e-commerce benchmark
UPC, February 1999
The TPC-B benchmark
Models a bank with many branches
Branch
Teller
Begin transaction
Update account balance
Write entry in history table
Update teller balance
Update branch balance
Commit
Account
Metrics:
History
tpsB (transactions/second)
$/tpsB
Scale requirement:
38
1 transaction type: account update
1 tpsB needs 100,000 accounts
UPC, February 1999
TPC-B: other requirements
System must be ACID
(A)tomicity
– transactions either commit or leave the system as if
were never issued
(C)onsistency
– transactions take system from a consistent state to
another
(I)solation
– concurrent transactions execute as if in some serial
order
(D)urability
– results of committed transactions are resilient to faults
39
UPC, February 1999
The TPC-C benchmark
Current TPC OLTP benchmark
Moderately complex OLTP
Models a wholesale supplier managing orders
Workload consists of five transaction types
Users and database scale linearly with throughput
Specification was approved July 23, 1992
40
UPC, February 1999
TPC-C: schema
Warehouse
W
Stock
100K
W*100K
10
<cardinality>
W*10
one-to-many
relationship
secondary index
3K
Customer
41
100K (fixed)
W
Legend
Table Name
District
W*30K
Item
Order
1+
W*30K+
1+
10-15
History
Order-Line
W*30K+
W*300K+
New-Order
0-1
UPC, February 1999
W*5K
TPC-C: transactions
New-order: enter a new order from a customer
Payment: update customer balance to reflect a
payment
Delivery: deliver orders (done as a batch
transaction)
Order-status: retrieve status of customer’s most
recent order
Stock-level: monitor warehouse inventory
42
UPC, February 1999
TPC-C: transaction flow
1
Select txn from menu:
1. New-Order
2. Payment
3. Order-Status
4. Delivery
5. Stock-Level
45%
43%
4%
4%
4%
2
Measure menu Response Time
Input screen
3
Keying time
Measure txn Response Time
Output screen
Think time
Go back to 1
43
UPC, February 1999
TPC-C: other requirements
Transparency
tables can be split horizontally and vertically
provided it is hidden from the application
Skew
1% of new-order txn are to a random remote
warehouse
15% of payment txn are to a random remote
warehouse
Metrics:
performance: new-order transactions/minute (tpmC)
cost/performance: $/tpmC
44
UPC, February 1999
TPC-C: scale
Maximum of 12 tpmC per warehouse
Consequently:
A quad-Xeon system today (~20,000 tpmC) needs
– over 1668 warehouses
– over 1 TB of disk storage!!
45
That’s a VERY expensive benchmark to run!
UPC, February 1999
TPC-C: side effects of the skew rules
Very small fraction of transactions go to remote
warehouses
Transparency rules allow data partitioning
Consequence:
Clusters of powerful machines show exceptional
numbers
Compaq has current TPC-C record of over 100
KtpmC with an 8-node memory channel cluster
Skew rules are expected to change in the future
46
UPC, February 1999
The TPC-D benchmark
Current DSS benchmark from TPC
Moderately complex decision support workload
Models a worldwide reseller of parts
Queries ask real world business questions
17 ad hoc DSS queries (Q1 to Q17)
2 update queries
47
UPC, February 1999
TPC-D: schema
Customer
Nation
Region
SF*150K
25
5
Order
Supplier
Part
SF*1500K
SF*10K
SF*200K
LineItem
PartSupp
SF*6000K
SF*800K
48
UPC, February 1999
TPC-D: scale
Unlike TPC-C, scale not tied to performance
Size determined by a Scale Factor (SF)
SF = {1,10,30,100,300,1000,3000,10000}
SF=1 means a 1GB database size
Majority of current results are in the 100GB and
300GB range
Indices and temporary tables can significantly
increase the total disk capacity. (3-5x is typical)
49
UPC, February 1999
TPC-D example query
Forecasting Revenue Query (Q6)
This query quantifies the amount of revenue increase that would have resulted from
eliminating company-wide discounts in a given percentage range in a given year.
Asking this type of “what if” query can be used to look for ways to increase revenues
Considers all line-items shipped in a year
Query definition:
SELECT SUM(L_EXTENDEDPRICE*L_DISCOUNT) AS REVENUE FROM LINEITEM
WHERE L_SHIPDATE >= DATE ‘[DATE]]’
AND L_SHIPDATE < DATE ‘[DATE]’ + INTERVAL ‘1’ YEAR
AND L_DISCOUNTBETWEEN [DISCOUNT] - 0.01 AND [DISCOUNT] + 0.01
AND L_QUANTITY < [QUANTITY]
50
UPC, February 1999
TPC-D execution rules
Power Test
Queries submitted in a single stream (i.e., no concurrency)
Each Query Set is a permutation of the 17 read-only queries
Cache
Flush
Query
Set 0
UF1
Query
Set 0
UF2
(optional)
Warm-up, not timed
Throughput Test
Multiple concurrent query streams
Single update stream Query Set 1
Query Set 2
.
.
.
Timed Sequence
Query Set N
Updates: UF1 UF2 UF1 UF2 UF1 UF2
51
UPC, February 1999
TPC-D: metrics
Power Metric (QppD)
Geometric Mean
3600 SF
QppD@ Size
19
i 17
j 2
i 1
j 1
QI ( i ,0)UI ( j ,0)
where
QI(i,0) Timing Interval for Query i, stream 0
UI(j,0) Timing Interval for Update j, stream 0
SF Scale Factor
Throughput (QthD)
Arithmetic Mean
QthD@ Size
S 17
TS
3600
SF
Both Metrics represent
where:
of query streams
“Queries per Gigabyte Hour” ST number
elapsed time of test (in seconds)
S
52
UPC, February 1999
TPC-D: metrics(2)
Composite Query-Per-Hour Rating (QphD)
The Power and Throughput metrics are combined to
get the composite queries per hour.
QphD@ Size QppD@ Size QthD@ Size
Reported metrics are:
– Power: QppD@Size
– Throughput: QthD@Size
– Price/Performance: $/QphD@Size
53
UPC, February 1999
TPC-D: other issues
Queries are complex and long-running
Crucial that DB engine parallelizes queries for
acceptable performance
Quality of query parallelizer is the most important
factor
Large improvements are still observed from
generation to generation of software
54
UPC, February 1999
The TPC-W benchmark
Just introduced
Represent a business that markets and sells over
the internet
Includes security/authentication
Uses dynamically generated pages (e.g. cgi-bins)
Metric: Web Interactions Per Second (WIPS)
Transactions:
55
Browse, shopping-cart, buy, user-registration, and
search
UPC, February 1999
A look at current audited TPC-C systems
Leader in price/performance:
Compaq ProLiant 7000-6/450, MS SQL 7.0, NT
– 4x 450MHz Xeons, 2MB cache, 4GB DRAM, 1.4 TB
disk
– 22,479 tpmC, $18.84/tpmC
Leader in non-cluster performance:
Sun Enterprise 6500, Sybase 11.9, Solaris7
– 24x 336MHz UltraSPARC IIs, 4MB cache, 24 GB
DRAM, 4TB disk
– 53,050 tpmC, $76.00/tpmC
56
UPC, February 1999
Audited TPC-C systems: price breakdown
Server sub-component prices
Compaq Proliant
Sun E6500
$/CPU
$4,816.00
$15,375.00
$/MB DRAM
$3.92
$9.16
$/GB Disk
$145.33
$382.03
Server Price Breakdown
100%
90%
80%
70%
Disk
60%
Memory
50%
CPU
40%
Base
30%
20%
10%
0%
Compaq Proliant
57
Sun E6500
UPC, February 1999
Using TPC benchmarks for architecture studies
Brute force approach: use full audit-sized system
Who can afford it?
How can you run it on top of a simulator?
How can you explore a wide design space?
Solution: scaling down the size
58
UPC, February 1999
Careful Scaling of Workloads
Identify architectural issue under study
Apply appropriate scaling to simplify monitoring and
enable simulation studies
Most scaling experiments on real machines
simulation-only is not a viable option!
Validation through sanity checks and comparison
with audit-sized runs
59
UPC, February 1999
Scaling OLTP
Forget about TPC compliance
Determine lower bound on DB size
monitor contention for smaller tables/indexes
DB size will change with number of processors
I/O bandwidth requirements vary with fraction of DB
resident in memory
completely in-memory run: no special I/O
requirements
favor more small disks vs. few large ones
place all redo logs on a separate disk
reduce OS double-buffering
Limit number of transactions executed
60
UPC, February 1999
Scaling OLTP(2)
Achieve representative cache behavior
relevant data structures >> size of hardware caches
(metadata area size is key)
maintain same number of processes/CPU as larger
run
Simplify setup by running clients on the server
machine
need to make lighter-weight versions of the clients
Ensure efficient execution
61
excessive migration, idle time, OS or application
spinning distorts metrics
UPC, February 1999
Scaling DSS
Determine lower bound DB size
sufficient work in parallel section
Ensure representative cache behavior
DB >> hardware caches
maintain same number of processes/CPU as large
run
Reduce execution time through sampling
Major difficulty is ensuring representative query
plans
DSS results more volatile due to improvements in
query optimizers
62
UPC, February 1999
Tuning, tuning, tuning
Ensure scaled workload is running efficiently
Requires a large number of monitoring runs on
actual hardware platform
Resembles “black art” on Oracle
Self-tuning features in Microsoft SQL 7.0 are
promising
63
ability for user overrides is desirable, but missing
UPC, February 1999
Does Scaling Work?
64
UPC, February 1999
TPC-C: scaled vs. full size
TPC-C, scaled
bcache
miss
24%
1-issue
8%
TPC-C, full-size
2-issue
8%
bcache
miss
27%
tlb
3%
1-issue
11%
2-issue
8%
tlb
1%
repl trap
2%
repl trap
5%
br/pc
mispr.
3%
br/pc mispr.
2%
mb
3%
bcache hit
30%
scache hit
17%
mb
6%
bcache hit
20%
scache hit
22%
Breakdown profile of CPU cycles
Platform: 8-proc. AlphaServer 8400
65
UPC, February 1999
Using simpler OLTP benchmarks:
TPC-B, scaled
1-issue
7%
TPC-C, full-size
2-issue
6%
bcache
miss
27%
tlb
2%
1-issue
11%
2-issue
8%
tlb
1%
repl. trap
5%
bcache miss
37%
repl trap
2%
br/pc mispr.
2%
br/pc
mispr.
3%
mb
9%
scache hit
16%
mb
6%
bcache hit
20%
scache hit
22%
bcache hit
16%
Although “obsolete” TPC-B can be used in
architectural studies
66
UPC, February 1999
Benchmarks wrap-up
Commercial applications are complex, but need to
be considered during design evaluation
TPC benchmarks cover a wide range of
commercial application areas
Scaled down TPC benchmarks can be used for
architecture studies
Architect needs deep understanding of the
workload
67
UPC, February 1999