Avoiding Exchange Performance Issues

Download Report

Transcript Avoiding Exchange Performance Issues

http://aka.ms/E2013Calc
\Web Service(Default Web Site)\Current Connections
\MSExchange Active Manager(_total)\Database Mounted
http://aka.ms/ExOnlineLimits
Large Organization
Configuration
36 Cores / 450 GB RAM per server  Higher Mailbox Density
Deployed Exchange 2013 in All-In-One configuration
Hardware NLB configured for ‘Least Connections’
What Happened?
Policy change required removal of local storage of email
Outlook now required to run in “Online Mode”
Impact
Increased in network traffic
Users frequently disconnected during peak periods
~2 weeks to isolate problem
~2 weeks to get remediation changes in place
Network
Load
Balancer
40k
users
3
4
1
5
6
2
Exchange.cohovineyard.com
Exchange 2013
All-in-One
7
13 19 25 31 40
8
14 20 26 32 41
9
15 21 27 42
Virtual IP
28
10 16 22 43
11
44 17 23 29
45
12 18 24 30
Network
Load
Balancer
40k
users
47
49
46
48
Exchange.cohovineyard.com
Exchange 2013
All-in-One
1
7
13 19 25 31 40
2
8
14 20 26 32 41
3
9
15 21 27 42
54
4
10
50 52
55
5
44
51 53
56
57
58 59 60 61 62 63
Virtual IP
!
Hardware
NLB
40k
users
3
4
1
5
2
23
Exchange.cohovineyard.com
Exchange 2013
All-in-One
6
11
16
21
29
7
12
17
22
30
8
13
18
24
31
9
14
19
25
32
10
15
20
26
33
27
28
34
35
36
Virtual IP
Lookup Active
Mailbox Location
IIS
RpcHttp HttpProxy
IIS
RPC
Client
Access
RpcHttp
Store
Worker
/RPC
Port 443
57
Port 444
Port 6001
MBxDB
https.sys
MSExchangeRpcProxyFrontEndAppPool
(W3WP)
https.sys
MSExchangeRpcProxyAppPool
(W3WP)
M.E.RpcClientAccess
M.E.Store.Worker
Max 65535
Requests
Connection
Manager
/RPC:44357
Request
Router
/RPC:443
W3WP
Queue
58
64
59
65
66
60
67
61
Managed
Availability
/RPC:444
68
62
69
63
IIS
/RPC:444
W3WP Queue
Thread
Thread
Thread
Thread
Thread
Thread
System.Web
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
Buffer
MSExchangeRpcProxyFrontEndAppPool
(W3WP)
inetpub\logs\LogFiles\W3SVC1\u_exXXXXXX.log
date
time
s-ip
cs-method
cs-uri-stem
cs-uri-query
s-port
cs-username
c-ip
cs(User-Agent)
cs(Referer)
sc-status
scsubstatus
sc-win32-status
time-taken
201
40721
07:
59:
44
192
.16
8.1
.1
RPC_IN_DATA
/rpc/
rpcpr
oxy.d
ll
8416409b-081e4fe8-92007e54d8874d7c@cohov
ineyard.com:6001&R
equestId=fc60c1759c77-47d0-b435ae3d04acea1b
443
COHOVI
NEYARD
\SM_4f
3083c2
bd6a40
d8b
192.168
.1.5
MSRPC
-
200
0
64
29513
inetpub\logs\LogFiles\W3SVC1\httperrXXXXX.log
date
time
c-ip
c-port
s-ip
sport
Csversion
Cs-method
Cs-uri
Scstatus
Ssiteid
S-reason
S-queuename
201407-21
07:5
9:44
192.16
8.1.5
160
45
192.16
8.1.1
44
4
HTTP
/1.1
RPC_IN
_DATA
/rpc/rpcproxy
.dll?COHOEXCH.cohovine
yard.com:6001
400
2
Connection_Dropped
MSExchangeRpcPro
xyAppPool
201407-21
07:5
9:44
192.16
8.1.5
160
45
192.16
8.1.1
44
3
HTTP
/1.1
RPC_IN
_DATA
/rpc/rpcproxy
.dll?
8416409b081e-4fe892007e54d8874d7c@
COHOEXCH.cohovine
yard.com:6001
-
1
Connection_Dropped_List_Full
MSExchangeRpcPro
xyAppPool
IIS indicating it cannot hand off
connection because queue is full
IIS
Location
File
Names
Perfmon
Counter
RpcHttp
HttpProxy
IIS
RpcHttp
RPC Client
Access
inetpub
\logs
\LogFiles
\W3SVC1
Logging
\RpcHttp
\W3SVC1
Logging
\HttpProxy
\RpcHttp
Inetpub
\logs
\LogFiles
\W3SVC2
Logging
\RpcHttp
\W3SVC2
Logging
\RPC Client Access
u_exXXXXXX.log
httperrXXXXX.log
RpcHttpXXXXXXXXX.log
HttpProxyXXXXXX
XXXX-X.log
u_exXXXXXX.log
httperrXXXXX.log
RpcHttpXXXXXXXXX.log
RCA_XXXXXXXXXXX.log
\Web
Service(Default
Web Site)
\Current
Connections
\RPC/HTTP Proxy
\Current Number of
Incoming RPC over
HTTP Connections
\MSExchange
HttpProxy
\Accepted
Connection
Count
\Web
Service(Exchange
Back End)
\Current
Connections
\RPC/HTTP Proxy\
Current Number
of Incoming RPC
over HTTP
Connections
\MSExchange RPC
ClientAccess
\Current
Connections
Network
CPU
Memory
Storage
Network (Requests)
\Web Service(Default Web Site)\Current Connections
\MSExchangeIS Store(*)\RPC Average Latency
< 100 ms
\MSExchangeIS Client Type(*)\RPC Average Latency
< 100 ms
\MSExchangeIS Store(*)\RPC Operation/Sec
\MSExchangeIS Client Type(*)\RPC Operation/Sec
CAS Experience
MoMT
\MSExchange RpcClientAccess\RPC Averaged Latency
\MSExchange RpcClientAccess\RPC Operations/sec
EAS
\MSExchange ActiveSync\Requests/sec
\MSExchange ActiveSync\Current Requests
EWS
\MSExchangeWS\Average Response Time
\MSExchangeWS\Requests/sec
OWA
\MSExchange OWA\Average Response Time
\MSExchange OWA\Average Search Time
\MSExchange OWA\Requests/sec
POP
\MSExchangePop3(*)\Average LDAP Latency
\MSExchangePop3(*)\Average RPC Latency
\MSExchangePop3(*)\Request Rate
IMAP
\MSExchangeImap4(*)\Average LDAP Latency
\MSExchangeImap4(*)\Average RPC Latency
\MSExchangeImap4(*)\Request Rate
Management / Background Ops
PS
\MSExchangeRemotePowershell\Current Connection Sessions
\MSExchangeRemotePowershell\Current Connected Unique Users
Overall RPC
Average Latency is
not impacted
Memory (Exchange Process Usage)
\Memory\% Committed Bytes in Use
< 80%
\Memory\Available MBytes
> 5% or RAM
.NET CLR Memory(*)\% Time in GC
Should be
below 10% on
average
.NET CLR Exceptions(*)\# of Excepts
Thrown / sec
Should be less
than 5% of total
requests per
second (RPS)
(Web
Server(_Total)\C
onnection
Attempts/sec *
.05).
.NET CLR Memory(*)\# Bytes in all
Heaps
Memory (WorkstationGC to ServerGC)
.NET CLR Memory\Allocated
Bytes/sec
Sustained
>50mb
Only 30% bytes
committed
Storage (Exchange I/O)
\MSExchange Active Manager(_total)\Database
Mounted
Balanced
across all
MBX
servers
\MSExchange Database ++> Instances(*)\I/O
Database Reads (Attached) Average Latency
< 20ms
\MSExchange Database ++> Instances(*)\I/O
Database Writes(Attached) Average Latency
< 50ms
\MSExchange Database ++> Instances(*)\I/O
Log Writes Average Latency
< 10ms
\MSExchange Database ++> Instances(*)\I/O
Database Reads (Recovery) Average Latency
< 200ms
\MSExchange Database ++> Instances(*)\I/O
Database Writes(Recovery) Average Latency
< read
latency for
same
instance as
above
I/O is acceptable
CPU (Exchange Processes)
Processor(_Total)\% Processor Time
Should be less
than 75% on
average.
\Processor(_Total)\% Privileged
Time
(kernel)
Should be less
than 75% on
average.
\Processor(_Total)\%User Time
Should be less
than 75% on
average.
\Process (*)\% Processor Time
<specific
process>
System\Processor Queue Length
(all instances)
Shouldn't be
greater than 5
per processor.
W3WP#3 is the
MSExchangeRpcProxyFrontEndAppPool
W3wp#3 high
CPU
Most
Recent
Usage
Provides a periodic snapshot of
executing code.
Used by developers to track
“hot” code paths
Requires source code to
interpret.
Download
Start
http://aka.ms/perfview
http://channel9.msdn.com/Serie
s/PerfView-Tutorial
ntdll!ZwWaitForMultipleObjects
KERNELBASE!WaitForMultipleObjectsEx
clr!WaitForMultipleObjectsEx_SO_TOLERANT
clr!Thread::DoAppropriateAptStateWait
clr!Thread::DoAppropriateWaitWorker
clr!Thread::DoAppropriateWait
clr!CLREventBase::WaitEx
clr!AwareLock::EnterEpilogHelper
clr!AwareLock::EnterEpilog
clr!AwareLock::Contention
clr!JITutil_MonContention
System_Web_ni!System.Web.BufferAllocator.GetBuffer()
System_Web_ni!System.Web.Hosting.RecyclableArrayHelper.GetIntPtrArray(Int32)
System_Web_ni!System.Web.Hosting.IIS7WorkerRequest.FlushCachedResponse(Boolean)
System_Web_ni!System.Web.HttpResponse.UpdateNativeResponse(Boolean)
System_Web_ni!System.Web.HttpResponse.Flush(Boolean, Boolean)
System_Web_ni!System.Web.HttpWriter.WriteFromStream(Byte[], Int32, Int32)
mscorlib_ni!System.IO.Stream.<BeginWriteInternal>b__11(System.Object)
mscorlib_ni!System.Threading.Tasks.Task`1[[System.Boolean, mscorlib]].InnerInvoke()
mscorlib_ni!System.Threading.Tasks.Task.Execute()
mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback,
System.Object, Boolean)
mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object,
Boolean)
mscorlib_ni!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)
mscorlib_ni!System.Threading.Tasks.Task.ExecuteEntry(Boolean)
mscorlib_ni!System.Threading.ThreadPoolWorkQueue.Dispatch()
clr!CallDescrWorkerInternal
clr!CallDescrWorkerWithHandler
clr!MethodDescCallSite::CallTargetWorker
clr!MethodDescCallSite::Call_RetBool
clr!QueueUserWorkItemManagedCallback
clr!ManagedThreadBase_DispatchInner
clr!ManagedThreadBase_DispatchMiddle
clr!ManagedThreadBase_DispatchOuter
clr!ManagedThreadBase_DispatchInCorrectAD
clr!Thread::DoADCallBack
clr!ManagedThreadBase_DispatchInner
clr!ManagedThreadBase_DispatchMiddle
clr!ManagedThreadBase_DispatchOuter
clr!ManagedThreadBase_FullTransitionWithAD
clr!ManagedThreadBase::ThreadPool
clr!ManagedPerAppDomainTPCount::DispatchWorkItem
clr!ThreadpoolMgr::ExecuteWorkRequest
clr!ThreadpoolMgr::WorkerThreadStart
clr!Thread::intermediateThreadProc
kernel32!BaseThreadInitThunk
ntdll!RtlUserThreadStart
Source From: http://referencesource.microsoft.com/#System.Web/BufferAllocator.cs
Investigation
Large number of
connections to
server in short
timeframe
~4 weeks
Preferred architecture not
followed
Network load
balancer adds
server to rotation
RpcProxy
FrontEnd
AppPool
requests
backlogged
Network load
balancer takes
server out of
rotation
Managed
Availability Probe
Fails
Customer scaled beyond tested configuration
NLB algorithm not optimized for Exchange load profile
Resolution
Least Connection / Slow Start on hardware LB
Reduced Cores < 20
Scalability Improvements coming .NET 4.6 (In Preview)
Managed
Availability
restarts service
Large Organization
Configuration
16 Cores / 92 GB RAM per server
Deployed Exchange 2013 in All-In-One configuration
NLB configured for ‘Round Robin’
What Happened?
File writes failing, MA Probe failures, MDB Failovers
Encountered bug with Anti-Virus
Failed to deploy recommended fixes prior to migration
Exposed new bug
Impact
Users frequently disconnected during peak periods
~8 weeks to isolate problem
~3 weeks to get fix and configuration changes in place
IIS
RpcHttp HttpProxy
IIS
RpcHttp
RPC
Client
Access
Store
Worker
Stalled I/O delaying
clients response
(dump showed 6min
lock)
I/O Manager
File System Driver
Is Valid
File to
Scan?
Anti-Virus Filter Driver
Device Driver
Mini-Port Driver
MBxDB
Continued I/O
delayed stalled
forces MA to move
Databases.
Responders
Goals
Bring Office365 Capabilities On-Premises
Monitor based upon end user experience
Focus on recovery oriented computing
Components
Probes test components and user experience
Monitors analyze probe(s) for Pass/Fail
Responders take action based up monitor results
When troubleshooting
Restart
BugCheck
Reset AppPool
Offline
Failover MBX
Escalate
Services
Monitors
OutlookRpcCtpProbe
OutlookProxyTestProbe
OutlookRpcSelfTestProbe
Monitor failures are a signal to a problem
Consistent failures can force a bluescreen
Performance
Counters
Event Logs
Storage
Some Database I/O Latencies,
but overall all I/O is fairly
healthy.
CPU
The server appears to be
busy but uncertain if this
normal or a bug…
W3wp#11 CPU
util running
hot?
Private Bytes
reached 10GB+
before restarting
Memory
Massive growth in memory
footprint of w3wp#11 process
throughout the day.
W3WP Process ID
= 62192
AppDomain
Used to enable isolation within a
process
3 AppDomain by default
Normal W3WP for Exchange has 3-4
AppDomains
Created as a result of config change
Exchange
Leak in W3SVC/1=
MSExchangeRpcProxyFrontEndAppPool
Process Explorer
View AppDomains and other .NET stats
for running processes.
Process Explorer
Outlook Anywhere
Servicelets used by Exchange for
minor tasks
RPCHTTPServicelet runs every 15
minutes
RPCHTTPServicelet was writing update
to the Default Web Site/Rpc site from
“SSL” to “None” on every run.
What was causing this change to
continually be updated?
Config
Binaries
Front-End AppDomain
Front-End AppDomain
Heaps
Connections
Back-End AppDomain
Front-End AppDomain
AppDomain (~125mb at startup)
Default AppDomain
MSExchangeRPCAppPool
Every 15 Min
Set SSLOffloading = true
MSExchange Services Host
Store Worker Instance
System AppDomain
RPC Client Access
Front-End AppDomain
MBxDB
Investigation
~10 weeks of investigation
Many iterations of data collected and analyzed
Data
Collection
Deployment Guidance Missteps
NLB Configuration  Set to Round Robin
Most recent CU Update + Hotfixes
Resolution
NLB Configuration changed to Slow Start
Most recent CU Update + Hotfixes installed
Interim configuration change until KB2925281 hotfix release
Final fix in Exchange 2013 Service Pack 1
Analysis
•
•
•
•
Exchange Server 2013 Performance Recommendations
Exchange 2013 Sizing and Configuration Recommendations
Exchange 2013 Performance Counters for troubleshooting
•
•
•
•
•
•
•
IIS Logs and Log Parser Studio Reports
Exchange Performance Data Collection tool
Exchange 2013 Performance Health Checker Script
Windows Performance ToolKit (WPT)
Performance Analysis of Logs (PAL) Tool
Windows SysInternals
•
BRK3131: Exchange Design Concepts and Best Practices
BRK3197: Exchange Server Preferred Architecture
BRK3178: Exchange on IaaS: Concerns, Tradeoffs, and Best Practices
BRK3173: Experts Unplugged: Exchange Server Deployment and
Architecture
BRK3158: Experts Unplugged: Exchange Top Issues
BRK3129: Deploying Exchange Server 2016
BRK3102: Experts Unplugged: Exchange Server High Availability and
Site Resilience
http://myignite.microsoft.com