SQLintersection Session SQL213 Session Name

Download Report

Transcript SQLintersection Session SQL213 Session Name

Windows Azure SQL Database (WASD)
Troubleshooting
I will assume basic
SQL Server knowledge
Bob Ward
Principal Architect Escalation Engineer
[email protected]
My Goals for You Today
Prepare
Prevent
React
2
What Will We Cover Today
The Azure Troubleshooting Challenge
Troubleshooting Connectivity
WASD Errors
Query Performance
Practical Advice and Tips
3
The Azure Troubleshooting Challenge
 WASD is a platform service (PAAS)

This is not a VM running SQL Server “box” (IAAS)
 Multi-tenant platform

You are sharing a SQL instance with other databases from other customers
 You are abstracted from the SQL Server instance, Windows, and
computer server

Less admin tasks means lower TCO but also means less access
 You are isolated to a specific database


You have a logical server and a master but most things are done in your
database
Most things are database scoped (Ex. DMVs)
 We make decisions to maximize all database availability

Application design may be required
 The service can be updated far quicker than the “box” product
4
WASD Connectivity Errors
WASD specific
errors
Firewall blocked in Azure
Windows authentication not
supported
Invalid login – Invalid
account or password
Denial of Service – After a
large number of login
failures
Use min
30sec login
timeout
Network
related errors
You could lose
connectivity
“…Server not found”
Idle connections terminated
after 30 minutes (Msg
10053 and 10054)
Connection Timeout
Expired
We may forcibly disconnect
on failover/some errors or
change to MAXSIZE
Msg 121 “.. The semaphore
timeout period has expired”
Retries you need to take
into account
5
Example Connectivity Errors
Network latency
Be sure to give
this to support
40XXX errors
unique to WASD
May see this after
deleting a server
After getting dropped on
idle connection
6
Troubleshooting Connectivity
Configuration issues
• WASD Firewall and your firewall
• Allow Windows Azure Service
Is it our service or your internet?
• Windows Azure Management Portal
• Windows Azure Service Dashboard
• Windows Azure SQL Database Connectivity Troubleshooting Guide
General Tools to use
•
•
•
•
ping.exe, telnet.exe, tracert.exe
SQL Server 2012 Management Studio – Free with SQL Server 2012 Express
ostress.exe and sqlcmd.exe (username requires @<full server name>)
SQL Database Management Portal – https://<servername>.database.windows.net
New System Views (Event Tables) – in master database
• sys.event_log
• sys.database_connection_stats
History tables – not real time
7
Demo
Tools for Connectivity
WASD Errors
full list here
Failover
Governance and Quota
These can result in
connection termination
and possible future rejection of work
Throttling Limits
Engine Throttling
Many “box” errors
still apply – Ex.
1205 = deadlock
“Not supported”
Database copy
Msg 40XXX range
can be seen in
sys.messages in
SQL Server 2012
Federation
9
Failover
We may decide to “move you” to a replica of your database to another server
• Your database, the instance, or the computer is “unhealthy”
• We may need to patch the instance and/or computer
What will you see?
• Msg 40197
• “..Server not available”
SHUTDOWN is in
progress.
Implement retry logic in your application
The partition is in transition and transactions are
being terminated.
10
Governance
 Max number of concurrent worker threads (currently 180) per database
 Msg 10928 if you exceed the limit


Connection terminated. Retry when your concurrent work subsides
Check for blocking problems or inefficient queries
 Msg 10929 if the overall system has too many workers



You may get less than 180 max
Connection terminated. You can retry but it may take longer to stabilize
Still could be an application issue but a service issue could also be occurring
Resource ID : 1 = worker threads
11
Quotas
 Quota errors for space used




Msg 40544 when you run out of space for your max size for your db
Only reads and DELETE/DROP allowed until you free up space
Use sys.dm_db_partition_stats to find what is consuming space
Solutions




Increase max size
Delete data or drop tables/indexes
Partition out database
But…freeing up may not be immediately recognized
12
Changing MAXSIZE
disconnects all users
Throttling Limits
We also call these “Watchdogs alerts”
• We have a service called a “Watchdog Service” querying the instance for “conditions” to terminate
connections to prevent resource problems.
We monitor all databases and look for conditions to prevent problems
• We will kill the session with a “reason”. The “reason” is the error message you get
• Application gets an error message (high severity) and connection terminated (KILL/ROLLBACK status)
• Sometimes retry works but these usually require some change on your part
• throttling_long_transaction in sys.event_log
Error
Condition
40549
Session blocking system task for long period of time (20 secs)
40550
Session is consuming too many locks (1 million)
40551
Session is consuming too much tempdb space (5Gb)
40552
Transaction consuming too much log space or active transaction preventing log truncation
40553
Session consuming memory (16Mb) and there are memory waits (20secs)
13
Rebuild index
Online
Engine Throttling



This is more of a legacy monitoring method used to keep instances healthy
Another external service monitors the health of the instance and computer

Soft throttling – we have detected a resource issue so pick specific databases

Hard throttling – entire instance at risk so all databases are affected
How it Works

Existing requests run to completion

New requests for existing connections and new connections may get Msg 40501 and
connection terminated depending on type of request

Reason code in Error has more details on soft vs hard, what will be rejected, and why

throttling in sys.event_log
0x8003
Decode reason codes
Another resource
14
x03 = RejectAll
x80 = Hard Throttling on I/O
“Not Supported” Errors

USE <db> not supported – specify
when connecting






ALTER DATABASE supported minimally (Ex. Name, Edition, MAXSIZE, READ_ONLY)
All DBCC commands not supported except for DBCC SHOW_STATISTICS
Database scoped DMVs supported
Feature Support for Windows Azure SQL Database
Unsupported Transact-SQL Statements (Windows Azure SQL Database)
Partially Supported Transact-SQL Statements (Windows Azure SQL Database)
15
Demo
Using Event Tables to Troubleshoot WASD Errors
WASD and Query Performance
Stick to the basics…..
•
•
•
•
•
•
Running or waiting? Blocking or CPU?
Is it your application, Windows Azure role, your computer, or queries?
Is it network latency?
Differences from when “good”? Did the query plan change?
Proper indexes – Avoid scans, large sorts, ….
Auto create and Auto update stats on by default
There are methods to optimize performance specific to Azure
• Windows Azure SQL Database and SQL Server -- Performance and Scalability
Compared and Contrasted
• Inevitably you may have to shard your data
• “Chatty” applications don’t usually perform well
• Avoid large result sets
• Application problems may show up earlier on this platform (Ex. Transaction
keeping the log from being truncated)
17
WASD Performance Scenarios
Interesting
Performance Scenarios
Troubleshooting Query
Timeouts
On-premise clients may see
higher ASYNC_NETWORK_IO
waits
Could just be blocking
Small transactions may result in
WRITELOG and SE_REPL* waits
Trace your queries so you know
which one timed out
Deadlocks (Msg 1205) just like
the “box” – Use sys.event_log
to debug
Examine query plan and tune
the query/indexes
18
Dynamic Management Views (DMV) for Performance
sys.dm_exec_requests
• Find out currently running requests in your database. Use
this to detect blocking
sys.dm_exec_query_stats
• Find out the performance of queries that have run in your
database. Look here for worst performing queries
sys.dm_exec_query_plan
• Display the query plan of a specific query
sys.dm_db_wait_stats
• Aggregation history of waits – Some new for WASD
• Only shows any wait_type with count > 0
“missing index DMVs”
• Could indexes help query performance?
19
A look at WASD Wait Types
20
Demo
Troubleshooting Query Performance on WASD
Watch Out for These
Keep database copies for “user error”
Be careful dropping servers and databases in portal
DML may fail if no clustered index (temp tables excluded)
DMVs are database scoped
Databases have RCSI on by default – tables can be larger
DATETIME in all data centers is stored as UTC time
You may not have access to objects that appear in catalog views
Non-supported or partial supported commands/features
System Views Unique to WASD
22
Before you contact support
Check the Azure forums: MSDN or stackoverflow
Check the service dashboard
Is it Windows Azure? On-premise problem?
Have exact error message(s) available
Have TracingID available
Do you know the query?
Do you have application retry logic?
Give us the date and time of issue with “observed” timezone
Is this happening now or in the past?
We can do RCA but….
It can take some time and we may
not have enough history
23
References







Retry Logic for Transient Failures in Windows Azure SQL Database
Error Messages (Windows Azure SQL Database)
Windows Azure SQL Database Performance and Elasticity Guide
Windows Azure SQL Database Connection Management
sys.event_log documentation
CSS SQL Escalation Blog
Troubleshoot and Optimize Queries with Windows Azure SQL Database
24
Questions?
http://sdrv.ms/Zqdkex
Thank you!
The Troubleshooting Checklist












Does the Windows Azure Portal work and list your databases?
Is there a dashboard posting for an outage in your region?
Does the SQL Management Portal work?
Does SQL Server Management Studio work?
Is there an internet provider issue?
Is your firewall configuration correct?
Is the problem Windows Azure vs WASD?
Is there blocking?
Are your queries and index tuned?
Is this really an application retry issue?
Governance, quotas, limits, and throttling are “part of this platform”
Have you looked at Event Tables?
26