Designing Highly Scalable OLTP Systems

Download Report

Transcript Designing Highly Scalable OLTP Systems

Laying the foundation and tuning for OLTP
Key Takeaway:
Key Takeway:
http://sqlcat.com/whitepapers/archive/2007/11/19/sql-server-2005-waits-and-queues.aspx
http://sqlcat.com/whitepapers/archive/2009/04/14/troubleshooting-performance-problems-in-sqlserver-2008.aspx
In this talk we will focus on the
unique challenges we face on high
concurrency and applications
requiring low latency
Laying the foundations for OLTP Performance
Windows OS
Hardware
NUMA 6
Kernel
Group 0
Kernel
Group 1
NUMA 0
NUMA 1
NUMA 2
NUMA 3
NUMA 4
NUMA 5
NUMA 6
CPU
Socket
CPU Core
CPU Core
HT
HT
HT
HT
NUMA 7
NUMA 8
NUMA 10
NUMA 12
NUMA 14
NUMA 9
NUMA 11
NUMA 13
NUMA 15
CPU
Socket
CPU Core
CPU Core
HT
HT
HT
HT
NUMA 7
Kernel
Group 2
NUMA 16
NUMA 18
NUMA 20
NUMA 22
NUMA 17
NUMA 19
NUMA 21
NUMA 23
Kernel
Group 3
NUMA 24
NUMA 26
NUMA 28
NUMA 30
NUMA 25
NUMA 27
NUMA 29
NUMA 31
CPU
Socket
CPU
Socket
CPU Core
CPU Core
HT
HT
HT
HT
CPU Core
CPU Core
HT
HT
HT
HT
SQL Server Today: Capabilities and Challenges with
real customer workloads
Challenge
Consideration/Workaround
Network
•
10 Gb/s network used – no bottlenecks observed
Concurrency
•
•
Observed spikes in CPU at random times during workload
Significant spinlocks contention on SOS_CACHESTORE due to frequent re-generation of security tokens
• Hotfix provided by SQL Server team
• Result SOS_CACHESTORE contention removed
Spinlock contention on LOCK_HASH due to heavy reading of same rows
• This was due to an incorrect parameter being passed in by test workload
• Result LOCK_HASH contention removed, reduced CPU from 100% to 18%
•
Transaction Log
•
Synchronous replication at the storage level
Observed 10-30ms for log latency – expected 3-5ms
• Encountered Virtual Log File fragmentation (dbcc loginfo) – rebuilt log
• Observed overutilization of front end fiber channel ports on array - reconfigured storage
balancing traffic across front end ports
Result: 3-5ms latency
Database and table
design/Schema
•
•
Schema utilizes hash partitioning to avoid page latch contention on inserts
Requirement for low privileges requires longer code paths in the SQL engine
Monitoring
•
Heavily utilized Extended Events to diagnose spinlock contention points
Architecture/Hardware
•
•
•
•
Currently running 16 socket IA64 in production
Benchmark performed on 8 socket x64 Nehalem-EX (64 physical cores)
Hyper-threading to 128 logical cores offered little benefit to this workload
Encountered high NUMA latencies (coreinfo.exe) – resolved via firmware updates
•
•
http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx
Nehalem-EX
Every socket is a NUMA node
How fast is your interconnect….
here
FileId
FileSize
StartOffset
FSeqNo
Status
Parity CreateLSN
----------- -------------------- -------------------- ----------- ----------- ------ --------------------------------------2
253952
8192
48141
0
64 0
…
2
427556864
74398826496
0
0
128 22970000047327200649
2
427950080
74826383360
0
0
128 22970000047327200649
Threads accessing the same
hash bucket of the table are
synchronized
Resource
Lock Manager
Lock Hash Table
Thread attempts to
obtain lock (row, page,
database, etc..
select * from sys.dm_os_spinlock_stats
order by
desc
These symptoms may indicate spinlock contention:
1. A high number of spins is reported for a particular
spinlock type.
AND
2. The system is experiencing heavy CPU utilization.
AND
3. The system has a high amount of concurrency.
Not a resolution but we know where to start
High backoffs = contention
Name
Collisions
Spins
Spins_Per_Collision
Backoffs
SOS_CACHESTORE
14,752,117
942,869,471,526
63,914
67,900,620
SOS_SUSPEND_QUEUE
69,267,367
473,760,338,765
6,840
2,167,281
LOCK_HASH
5,765,761
260,885,816,584
45,247
3,739,208
MUTEX
2,802,773
9,767,503,682
3,485
350,997
SOS_SCHEDULER
1,207,007
3,692,845,572
3,060
109,746
--Get the type value for any given spinlock type
select map_value, map_key, name from sys.dm_xe_map_values
where map_value IN ('SOS_CACHESTORE')
--create the even session that will capture the callstacks to a bucketizer
create event session spin_lock_backoff on server
add event sqlos.spinlock_backoff (action (package0.callstack)
where
type = 144
--SOS_CACHESTORE
)
add target package0.asynchronous_bucketizer (
set filtering_event_name='sqlos.spinlock_backoff',
source_type=1, source='package0.callstack')
with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE)
--Ensure the session was created
select * from sys.dm_xe_sessions
where name = 'spin_lock_backoff'
--Run this section to measure the contention
alter event session spin_lock_backoff on server state=start
--wait to measure the number of backoffs over a 1 minute period
waitfor delay '00:01:00'
--To view the data
--1. Ensure the sqlservr.pdb is in the same directory as the sqlservr.exe
--2. Enable this trace flag to turn on symbol resolution
DBCC traceon (3656, -1)
--Get the callstacks from the bucketize target
select event_session_address, target_name, execution_count, cast (target_data as XML)
from sys.dm_xe_session_targets xst
inner join sys.dm_xe_sessions xs on (xst.event_session_address = xs.address)
where xs.name = 'spin_lock_backoff'
--clean up the session
alter event session spin_lock_backoff on server state=stop
drop event session spin_lock_backoff on server
1
2
Huge increase in number of spins & backoffs associated with SOS_CACHESTORE
3
Observation: It is counterintuitive to have high waits times (LCK_M_X) correlate with heavy
CPU – This is the symptom not the cause
4
Approach: Use extended events to profile the code path with the spinlock contention (i.e. where there is a
high number of backoffs)
5
Root cause: Regeneration of security tokens exposes contention in code paths for access permission
checks
Workaround/Problem Isolation: Run with sysadmin rights
Long Term Changes Required: SQL Server fix
BL460 Blade Servers
Dell R900’s , R805’s
Active/Active Failover cluster
Network switch
Transaction DB Server
1 x DL785
8P (quad core), 2.3GHz
256 GB RAM
12 x Load drivers:
2 proc (quad core), x64
32+ GB memory
DL785
DL585
Switch
5 x App servers:
Switch
Reporting DB Server
1 x DL585
4P (dual core), 2.6 GHz
32 GB RAM
5 x BL460
2 proc (quad core),
32bit
32 GB memory
SAN
CX-960
(240 drives,
15K, 300GB)
SAN switch
Brocade 4900
(32-ports active)
Challenge
Consideration/Workaround
Network
•
•
•
CPU bottlenecks for network processing were observed and resolved via network tuning (RSS)
Further network optimization was performed by implementing compression in the application
After optimizations were able to push ~180K packets/sec, approx 111 MB/sec through a single 1
Gb/s NIC.
Concurrency
•
•
•
Page buffer latch waits were by far the biggest pain point
Hash partitioning was used to “scale-out” the btree’s and eliminate the contention
Some PFS contention for the tables containing LOB data – resolved by placing LOB tables on dedicated
filegroups and adding more files
Transaction Log
•
No log bottlenecks were observed. When cache on the array behaves well log response times are very
low.
Database and table
design/Schema
•
•
•
Observed overhead related to PK/FK relationships. Insert statements required additional work.
Adding persisted computed column needed for hash partitioning is an offline operation.
Moving LOB data is an offline operation.
Monitoring
•
For the latch contention, utilized dm_os_wait_stats, dm_os_waiting_tasks and
dm_db_index_operational_stats to identify indexes with most contention
Architecture/Hardware
•
Be careful about shared components in Blade server deployments – this became a bottleneck for our
middle tier.
Page (8K)
ROW
ROW
ROW
ROW
EX_LATCH wait
Dig into details with:
sys.dm_os_wait_stats
sys.dm_os_latch_waits
wait_type
% Wait Time
PAGELATCH_EX
86.4%
PAGELATCH_SH
8.2%
LATCH_SH
1.5%
LATCH_EX
1.0%
LOGMGR_QUEUE
0.9%
CHECKPOINT_QUEUE
0.8%
ASYNC_NETWORK_IO
0.8%
WRITELOG
0.4%
latch_class
wait_time_ms
ACCESS_METHODS_HOBT_VIRTUAL_ROOT
156,818
LOG_MANAGER
103,316
select *
, wait_time_ms/waiting_tasks_count [avg_wait_time]
, signal_wait_time_ms/waiting_tasks_count [avg_signal_wait_time]
from sys.dm_os_wait_stats
where wait_time_ms > 0
and wait_type like '%PAGELATCH%'
order by wait_time_ms desc
/* latch waits
********************************************/
select top 20
database_id, object_id, index_id, count(partition_number) [num partitions]
,sum(leaf_insert_count) [leaf_insert_count], sum(leaf_delete_count) [leaf_delete_count]
,sum(leaf_update_count) [leaf_update_count]
,sum(singleton_lookup_count) [singleton_lookup_count]
,sum(range_scan_count) [range_scan_count]
,sum(page_latch_wait_in_ms) [page_latch_wait_in_ms], sum(page_latch_wait_count) [page_latch_wait_count]
,sum(page_latch_wait_in_ms) / sum(page_latch_wait_count) [avg_page_latch_wait]
,sum(tree_page_latch_wait_in_ms) [tree_page_latch_wait_ms], sum(tree_page_latch_wait_count)
[tree_page_latch_wait_count]
,case when (sum(tree_page_latch_wait_count) = 0) then 0
else sum(tree_page_latch_wait_in_ms) / sum(tree_page_latch_wait_count) end
[avg_tree_page_latch_wait]
from sys.dm_db_index_operational_stats (null, null, null, null) os
where page_latch_wait_count > 0
group by database_id, object_id, index_id
order by sum(page_latch_wait_in_ms) desc
Btree
Page
Btree
Page
Leaf
Pages
Data
Page
Date
Page
Data
Page
Data
Page
Btree
Page
Data
Page
Data
Page
Data
Page
Date
Page
Data
Page
Data
Page
Tree Pages
Data
Page
Data
Page
Data
Page
Logical Key Order of Index – Monotonically
Increasing
Many threads inserting into
end of range
We call this ”Last Page Insert Contention”
Expect: PAGELATCH_EX/SH waits
And this is the observation
Threads inserting into end of
range but across each partition
Threads inserting into end of
range – contention on ”last
page”
1001
- 2000
2001
- 3000
3001
- 4000
INSERT
INSERT
INSERT
INSERT
Hash Partitioned Table / Index
0
-1000
Hash Partitioning Reference:
http://sqlcat.com/technicalnotes/archive/2009/
09/22/resolving-pagelatch-contention-onhighly-concurrent-insert-workloads-part-1.aspx
Latch waits of approximately 36 ms at baseline
of 99 checks/sec.
Latch waits of approximately 0.6 ms at highest
throughput of 249 checks/sec.

Note: Requires application changes
 Ensure Select/Update/Delete have appropriate partition elimination
http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx
KB 942861
Single 1 Gb/s NIC
wait_type
total_wait_time_ms
total_waiting_tasks_count
average_wait_ms
DTC_STATE
5,477,997,934
4,523,019
1,211
PREEMPTIVE_TRANSIMPORT
2,852,073,282
3,672,147
776
PREEMPTIVE_DTC_ENLIST
2,718,413,458
3,670,307
740
http://msdn.microsoft.com/en-us/magazine/cc163847.aspx#S5
http://sqlcat.com/msdnmirror/archive/2010/05/11/resolving-dtc-related-waits-and-tuning-scalability-of-dtc.aspx
Applied Architecture Patterns on the Microsoft
Platform
&
46
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of
Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
47
Agenda

Windows Server 2008R2 and SQL Server 2008R2 improvements
 Scale architecture

Customer Requirements

Hardware setup
 Transaction log essentials

Getting the code right
 Application Server Essentials
 Database Design

Tuning Data Modification
 UPDATE statements
 INSERT statements
 Management of LOB data

The problem with NUMA and what to do about it

Final results and Thoughts
48
Top statistics
Category
Largest single database
Largest table
Metric
80 TB
20 TB
Biggest total data 1 customer
2.5 PB
Highest write per second 1 db
60,000
Fastest I/O subsystem in production (and in
lab)
Fastest “real time” cube
18 GB/sec
(26GB/sec)
1 sec latency
data load for 1TB
Largest cube
20 minutes
12 TB
49
Upping the Limits
 Previous (before 2008R2) windows was limited to 64 cores
 Kernel tuned for this config
 With Windows Server 2008R2 this limit is now upped to 256 Cores
(plumbing for 1024 cores)
 New concept: Kernel Groups
 A bit like NUMA, but an extra layer in the hierarchy
 SQL Server generally follows suit – but for now, 256 Cores is limit
on R2
 Example x64 machines: HP DL980 (64 Cores, 128 in HyperThread). IBM 3950 (up to 256
Cores)
 And largest IA-64 is 256 Hyperthread (at 128 Cores)
50
The Path to the Sockets
Windows OS
Hardware
NUMA 6
Kernel
Group 0
Kernel
Group 1
NUMA 0
NUMA 1
NUMA 2
NUMA 3
NUMA 4
NUMA 5
NUMA 6
CPU
Socket
CPU Core
CPU Core
HT
HT
HT
HT
NUMA 7
NUMA 8
NUMA 10
NUMA 12
NUMA 14
NUMA 9
NUMA 11
NUMA 13
NUMA 15
CPU
Socket
CPU Core
CPU Core
HT
HT
HT
HT
NUMA 7
Kernel
Group 2
NUMA 16
NUMA 18
NUMA 20
NUMA 22
NUMA 17
NUMA 19
NUMA 21
NUMA 23
Kernel
Group 3
NUMA 24
NUMA 26
NUMA 28
NUMA 30
NUMA 25
NUMA 27
NUMA 29
NUMA 31
CPU
Socket
CPU
Socket
CPU Core
CPU Core
HT
HT
HT
HT
CPU Core
CPU Core
HT
HT
HT
HT
51
And we measure it like this
 Sysinternals CoreInfo
http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx
 Nehalem-EX
 Every socket is a NUMA node
 How fast is your interconnect….
52
And it Looks Like This...
53
Customer Scenarios
Core Banking
Healthcare System
POS
Workload
Credit Card transactions
from ATM and Branches
Sharing patient information
across multiple healthcare
trusts
World record deployment of
ISV POS application across
8,000 US stores
Scale Requirements
10.000 Business
Transactions / sec
37,500 concurrent users
Handle peak holiday load of
228 checks/sec
Technology
App Tier .NET 3.5/WCF
SQL 2008R2
Windows 2008R2
App Tier: .NET
SQL 2008R2
Windows 2008R2
Virtualized App Tier: Com+,
Windows 2003
SQL 2008, Windows 2008
Server
HP Superdome
HP DL785G6
IBM 3950 and HP DL 980
DL785
54
Network Cards – Rule of Thumb
 At scale, network traffic will generate a LOT of interrupts for the
CPU
 These must be handled by CPU Cores
 Must distribute packets to cores for processing
 Rule of thumb (OTLP): 1 NIC / 16 Cores
 Watch the DPC activity in Taskmanager
 In Windows 20003 remove SQL Server (with affinity mask) from the NIC cores
55
Lab: Network Tuning Approaches
1. Tuning configuration options of a single NIC card to provide the
maximum throughput.
2. Improve the application code to compress LOB data before
sending it to the SQL Server
3. Team a pair of 1 Gb/s NICs to provide more bandwidth
(transparent to the app).
4. Add multiple NICS (better for scale )
56
Tuning a Single NIC Card – POS system
 Enable RSS to enable multiple CPUs to process receive indications:
http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx
 The next step was to disable the Base Filtering Service in Windows
and explicitly enable TCP Chimney offload.
 Careful with Chimney Offload as per KB 942861

57
Before and After Tuning Single NIC
1. Before any network changes the workload was CPU bound on CPU0
2. After tuning RSS, disabling Base Filtering Service and explicitly enabling TCP Chimney Offload
CPU time on CPU0 was reduced. The base CPU for RSS successfully moved from CPU0 to
another CPU.
58
SQL Server Configuration Changes
 As we increased number of connections to around 6000 (users had think time)
we started seeing waits on THREADPOOL
 Solution: increase sp_configure ‘max worker threads’
 Probably don’t want to go higher than 4096
 Gradually increase it, default max is 980
 Avoid killing yourself in thread management – bottleneck is likely somewhere else
 Use affinity mask to get rid of SQL Server for cores running NIC traffic
 Well tuned, pure play OLTP
 No need to consider parallel plans
 Sp_configure ‘max degree of parallelism’, 1
60
Designing Highly Scalable OLTP Systems
Getting the Code Right
61
Things to Double Check
 Connection pooling enabled?
 How much connection memory are we using?
 Monitor perfmon: MSSQL: Memory Manager
 Obvious Memory or Handle leaks?
 Check perfmon Process counters in perfmon for .NET app
 Server side processes will keep memory unless under pressure
 Can the application handle the load?
 Call into dummy procedures that do nothing
 Check measured application throughput
 Typical case: Application breaks before SQL
63
Remote Calling from WCF
 Original client code: Synchronous calls in WCF
 Each thread must wait for network latency before proceeding
 Around 1ms waiting
 Very similar to disk I/O – thread will fall asleep
 Lots of sleeping threads
 Limited to around 50 client simulations per machine
 Instead, use IAsyncInterface
64
Designing Highly Scalable OLTP Systems
Tuning Data Modification
65
Database Schema – Credit Cards
UPDATE … SET Balance
UPDATE ..
SET LastTransaction_ID = @ID + 1
LastTransactionDate = GETDATE()
Account
INSERT .. VALUES (@amount)
INSERT .. VALUES (-1 * @amount)
ATM
ID_ATM
ID_Branch
LastTransactionDate
LastTransaction_ID
…
10**3 rows
Transaction
Transaction_ID
Customer_ID
ATM_ID
Account_ID
TransactionDate
Amount
…
Account_ID
LastUpdateDate
Balance
…
10**5 rows
10**10 rows
66
Summary of Concerns

Transaction table is hot
 Lots of INSERT
 How to handle ID numbers?
ATM
ID_ATM
ID_Branch
LastTransactionDate
LastTransaction_ID
…
 Allocation structures in database

Account table must be
Account
Transaction
Transaction_ID
Customer_ID
ATM_ID
Account_ID
TransactionDate
Amount
…
Account_ID
LastUpdateDate
Balance
…
transactionally consistent with Transaction
 Do I trust the developers to do this?
 Cannot release lock until BOTH are in sync
 What about latency of round trips for this

Potentially hot rows in Account
 Are some accounts touched more than others

ATM Table has hot rows.
 Each row on average touched at least ten times per second
 E.g. 10**3 rows with 10**4 transactions/sec
67
Generating a Unique ID
 Why wont this work?
CREATE PROCEDURE GetID
@ID INT OUTPUT
@ATM_ID INT
AS
DECLARE @LastTransaction_ID INT
SELECT @LastTransaction_ID = LastTransaction_ID
FROM ATM
WHERE ATM_ID = @ATM_ID
SET @ID = @LastTransaction_ID + 1
UPDATE ATM
SET @LastTransaction_ID
WHERE ATM_ID = @ATM_ID
68
Concurrency is Fun
ATM
ID_ATM = 13
LastTransaction_ID = 42
…
SELECT @LastTransaction_ID = LastTransaction_ID
FROM ATM
WHERE ATM_ID = 13
(@LastTransaction_ID = 42)
SELECT @LastTransaction_ID = LastTransaction_ID
FROM ATM
WHERE ATM_ID = 13
SET @ID = @LastTransaction_ID + 1
(@LastTransaction_ID = 42)
SET @ID = @LastTransaction_ID + 1
UPDATE ATM
UPDATE ATM
SET @LastTransaction_ID = @ID
SET @LastTransaction_ID = @ID
WHERE ATM_ID = 13
WHERE ATM_ID = 13
69
Generating a Unique ID – The Right way
CREATE PROCEDURE GetID
@ID INT OUTPUT
@ATM_ID INT
AS
UPDATE ATM
SET LastTransaction_ID = @ID + 1
, @ID = LastTransaction_ID
WHERE ATM_ID = @ATM_ID
 And it it is simple too...
70
Hot rows in ATM
 Initial runs with a few hundred ATM shows excessive waits for
LCK_M_U
 Diagnosed in sys.dm_os_wait_stats
 Drilling down to individual locks using sys.dm_tran_locks
 Inventive readers may wish to use Xevents
 Event objects: sqlserver.lock_acquired and sqlos.wait_info
 Bucketize them
 As concurrency increases, lock waits keep increasing
 While throughput stays constant
 Until...
71
Spinning around
1.00E+13
20000
1.00E+12
18000
1.00E+11
Spins
1.00E+09
14000
1.00E+08
12000
1.00E+07
10000
1.00E+06
1.00E+05
8000
1.00E+04
6000
1.00E+03
Throughput
16000
1.00E+10
lg(Spins)
Throughput
4000
1.00E+02
2000
1.00E+01
1.00E+00
0
10000
20000
30000
40000
50000
Requests
60000
70000
80000
90000
0
100000
• Diagnosed using sys.dm_os_spinlock_stats
• Pre SQL2008 this was DBCC SQLPERF(spinlockstats)
• Can dig deeper using Xevents with sqlos.spinlock_backoff event
• We are spinning for LOCK_HASH
72
LOCK_HASH – what is it?
More Threads
Lock Manager
ROW
Thread
- Why not go to sleep?
73
Locking at Scale
 Ratio between ATM machines and transactions generated too low.
 Can only sustain a limited number of locks/unlocks per second
 Depends a LOT on NUMA hardware, memory speeds and CPU caches
 Each ATM was generating 200 transactions / sec in test harness
 Solution: Increase number of ATM machines
 Key Takeway: If a locked resource is contended – create more of it
 Notice: This is not SQL Server specific, any piece of code will be bound by memory speeds
when access to a region must be serialized
74
Hot rows in Account
 Three ways to update Account table
Let application servers invoke transaction to both insert in TRANSACTION
and UPDATE account
2) Set a trigger on TRANSACTION
3) Create stored proc that handles the entire transaction
 Option 1 has two issues:
1)
 App developers may forget in all code paths
 Latency of roundtrip: around 1ms – i.e. no more than 1000 locks/sec possible on single row
 Option 2 is better choice!
 Option 3 must be used in all places in app to be better than option 2.
76
Hot Latches!

LCK waits are gone, but we are seeing very
high waits for PAGELATCH_EX
 High = more than 1ms

What are we contending on?

Latch – a light weight semaphore
 Locks are logical (transactional consistency)
Page (8K)
PAGELATCH_EX
ROW
ROW
LCK_U
ROW
 Latches are internal SQL Engine (memory
consitency)

ROW
LCK_U
Because rows are small (many fit a page)
multiple locks may compete for one
PAGELATCH
77
Row Padding

In the case of the ATM table, our rows are
small and few
Page (8K)

We can ”waste” a bit of space to get more
performance
ROW

Solution: Pad rows with CHAR column to
make each row take a full page

1 LCK = 1 PAGELATCH
PAGELATCH_EX
CHAR(5000)
LCK_U
ALTER TABLE ATM
ADD COLUMN Padding CHAR(5000) NOT NULL
DEFAULT (‘X’)
78
INSERT throughput
 Transaction table is by far the most active table
 Fortunately, only INSERT
 No need to lock rows
 But several rows must still fit a single page
 Cannot pad pages – there are 10**10 rows in the table
 A new page will eventually be allocated, but until it is, every insert
goes to same page
 Expect: PAGELATCH_EX waits
 And this is the observation
79
Hot page at the end of B-tree with increasing index
Multiple Client Threads
35000
30000
20000
15000
10000
5000
150
140
130
120
110
100
90
80
70
60
50
40
30
20
15
10
5
4
3
2
0
1
Inserts/sec
25000
80
Waits & Latches
 Dig into details with:
 sys.dm_os_wait_stats
 sys.dm_os_latch_waits
wait_type
% Wait Time
PAGELATCH_SH
86.4%
PAGELATCH_EX
8.2%
LATCH_SH
1.5%
LATCH_EX
1.0%
LOGMGR_QUEUE
0.9%
CHECKPOINT_QUEUE
0.8%
ASYNC_NETWORK_IO
0.8%
WRITELOG
0.4%
latch_class
wait_time_ms
ACCESS_METHODS_HOBT_VIRTUAL_ROOT
156,818
LOG_MANAGER
103,316
81
How to Solve INSERT hotspot
 Hash partition the table
 Do not use a sequential key
 Create multiple B-trees
 Distribute the inserts all over the B-
 Round robin between the B-trees
tree
create more resources and less
contention
3,11,19
4,12,20
5,13,21
6,14,22
7,15,23
0
-1000
1001
- 2000
2001
- 3000
3001
- 4000
INSERT
2,10,18
INSERT
1,9,17
0
1
2
3
4
5
6
7
INSERT
0,8,16
hash
INSERT
ID
82
Design Pattern: Table “Hash” Partitioning

Create new filegroup or use existing to hold the
partitions


Use CREATE PARTITION FUNCTION command


Partition the tables into #cores partitions
Use CREATE PARTITION SCHEME command


Equally balance over LUN using optimal layout
Bind partition function to filegroups
hash
0
1
2
3
4
5
6
Add hash column to table (tinyint or smallint)

Calculate a good hash distribution

For example, use hashbytes with modulo or
binary_checksum
253
254
255
83
Lab Example: Before Partitioning
Latch waits of approximately 36 ms at baseline
of 99 checks/sec.
84
Lab Example: After Partitioning*
Latch waits of approximately 0.6 ms at highest
throughput of 249 checks/sec.
*Other optimizations were applied
85
B-Tree Root Split
Virtual
Root
LATCH
PAGELATCH
(ACCESS_METHODS
HBOT_VIRTUAL_ROOT)
PAGELATCH
PAGELATCH
Prev
Next
LCK
87
NUMA and What to do
 Remember those PAGELATCH for UPDATE statements?
 Our solution: add more pages
 Improvemnet: Get out of the PAGELATCH fast so next one can
work on it
 On NUMA systems, going to a foreign memory node takes at least
4-10 times more expensive
 Use SysInternals CoreInfo tool
89
How does NUMA work in SQL Server?

The first NUMA node to request a page will ”own” that page
 Ownership continues until page is evicted from buffer pool

Every other NUMA node that need that page will have to do foreign memory access

Additional (SQL 2008) feature is SuperLatch
 Useful when page is read a lot but written rarely
 Only kicks in on 32 cores or more
 The ”this page is latched” information is copied to all NUMA nodes
 Acquiring a PAGELATCH_SH only requires local NUMA access
 But: Acquiring PAGELATCH_EX must signal all NUMA nodes
 Perfmon object: MSSQL:Latches
 Number of SuperLatches
 SuperLatch demotions / sec
 SuperLatch promotions / sec
 See CSS blog post
90
Effect of UPDATE on NUMA traffic
ATM_ID
NUMA 0
0
UPDATE ATM
SET LastTransaction_ID
NUMA 1
1
UPDATE ATM
SET LastTransaction_ID
NUMA 2
2
NUMA 3
3
UPDATE ATM
SET LastTransaction_ID
4 RS Servers
4 RS Servers
4 RS Servers
App Servers
UPDATE ATM
SET LastTransaction_ID
91
Using NUMA affinity
ATM_ID
NUMA 0
0
UPDATE ATM
SET LastTransaction_ID
NUMA 1
1
UPDATE ATM
SET LastTransaction_ID
NUMA 2
2
NUMA 3
3
UPDATE ATM
SET LastTransaction_ID
UPDATE ATM
SET LastTransaction_ID
Port: 8000
Port: 8001
4 RS Servers
Port: 8002
4 RS Servers
Port: 8003
4 RS Servers
4 RS Servers
How to: Map TCP/IP Ports to NUMA Nodes
92
Final Results and thoughts
 120.000 Batch Requests / sec
 100.000 SQL Transactions / sec
 50.000 SQL Write Transactions / sec
 12.500 Business Transactions / sec
 CPU Load: 34 CPU cores busy
 Given more time, we would get the CPU’s to 100%, Tune the NICs more, and work on
balancing NUMA more.
 And of NIC, we only had two and they were loading two CPU at 100%
93
http://sqlcat.com/whitepapers/archive/2010/06/07/proven-sql-server-architectures-for-highavailability-and-disaster-recovery.aspx
http://sqlcat.com/whitepapers/archive/2010/11/03/failure-is-not-an-option-zero-data-loss-andhigh-availability.aspx
This is not a WinMo7 it’s a
SuperDome 
Betcache
4+
Casino
2+
VS Games
2+
1x2
Games
12+
CMS
15+
News
Letter 2+
Other
30+
User Account &
Sportsbook
8+
BGI
Operation
Marketing
Administration
Finance
Betoffer & Odds
Product Mgmt
Payment
Bookkeeping
Translation
Promotions
Call center
Controling
Other
Other
Campaigns'
Other
Other
Other
40+
Payment
20+
Repl
ASP.NET
Sessions
8+
SMS
4+
Other
20+
DWH
Stage
50+
Bookmaking
2+
Security
CSM
2+
Internal
Office,
Sharepoint
(300+)
DWH
60+
Moni-toring
10+
OLAP
10+
Administration
20+
Challenge
Consideration/Workaround
Network
•
CPU bottlenecks for network processing were observed and resolved via network tuning (RSS)
• Dedicated networks for backup, replication etc…
• 8 network cards for clients
Concurrency
•
Transaction Log
•
Latch contention on heavily used tables, “last page insert”
• Hash partition option caused other problem in query performance and application design.
• Resolution: Co-operative scale-out
Latency on log writes
• Resolution: Increased throughput/decreased latency by placing transaction log on SSD’s
• Database mirroring overhead – very significant on Synchronous
Database and table
design/Schema
•
Monitoring
•
Security monitoring (PCI and intrusion detection) between 10%-25% impact/overhead when monitoring
Architecture/Hardware
•
Tests using x64 (8-socket; 8-core) vs. Itanium-Superdome(32-socket,dual-core)
• Same transaction throughput
• IO and backups were a bit slower
Latency on IO intensive data files (including Tempdb):
• Resolution: Session state database on SSD’s;
• Resolution: Betting slips/customer databases testing sharding
• Single server, single database 500/tx/sec
• Single server, 4 databases 1,800/tx/sec (sharding)
• Multiple servers 2,600/tx/sec (sharding)
~150 Apps
Quote Lines
Trade Lines
10-node
cluster
Quote
Quote
Quote
Quote
Quote
Frontend
Quote
FrontEnd
FrontEnd
FrontEnd
FrontEnd
FrontEnd
Trade
Trade
Trade
Trade
Trade
Frontend
FrontEnd
FrontEnd
FrontEnd
FrontEnd
Quote1
Quote
Backend
Quote2
Quote
Backend
Quote4
Quote
Backend
Quote6
Quote
Backend
Trade1
Trade
Backend
Trade2
Trade
Backend
Trade4
Trade
Backend
Trade6
Trade
Backend
Issues A-B
Issues C-D
Issues J-N
Issues S-Z
Issues A-B
Issues C-D
Issues J-N
Issues S-Z
Each Frontend server receives
Quote or Trade messages from a
single line, handles the line
interface, and calls stored
procedures to perform message
processing
Stored procedures apply
business logic and produce
vendor feed messages
return messages
Vendor
feed
Quote or
Trade
Frontend
server (one for
each line)
Each Frontend scans a Return
table for new messages to
send back through system
Stored procedure
performs processes
message processing
System Return
message table (one
table per line, one row
per message)
Vendor feed message
table (one database
per vendor line, one
row per message)
SQL
Server
database
cluster
Backend
server (one for
each vendor
feed)
Each Backend scans a
MsgOut table for new
messages to disseminate
on a vendor line
Challenge
Consideration/Workaround
Network
•
•
Network round trip time for synchronous call from client induced latency
Resolution: Batch data into single large parameter (varchar (8000)) to avoid network roundtrips
Concurrency
•
Page Latch Contention – Small table
• Small table with latching on 36 Rows on single page.
• Resolution: pad the rows to spread out latching to multiple pages; Performance Gain: 20%
Page Latch Contention – Large Table
• Concurrent INSERTS into incremental column (identity), “last page insert”
• Resolution: Clustered Index (partition_id & identity) column; Performance Gain: 30%
Heavy, long running threads contenting for time on the scheduler
• Resolution: Map TCP/IP Ports to NUMA Nodes (http://msdn.microsoft.com/enus/library/ms345346.aspx); Performance Gain: 20%
•
•
Transaction Log
•
•
Logwaits:
• Resolution: Batching business transactions within a single COMMIT to avoid WRITELOG
waits
Test of SSD’s for log helped with latency.
•
Change decimal datatypes to money, others to int
• Integer based datatypes go through optimized code path; Performance Gain: 10%
No RI as this has an overhead on performance. Executed in the application.
Monitoring
•
•
5% overhead in running default trace alone
Collect perfmon and targeted DMV/Xevents output to repository
Architecture/Hardware
•
x32\x64 Performance
• x32 12% Faster
• Application Is Not Memory Constrained*
*Interesting for Futures discussion later in presentation
Database and table design/Schema
•