Behind the Curtain: How we Run Exchange Online

Transcript Behind the Curtain: How we Run Exchange Online

What Marketing likes to show off…
…how we actually get there



BUT…
vague



Microsoft has datacenter capacity around the world…and we’re growing
1 million+ servers • 100+ Datacenters in over 40 countries • 1,500 network agreements and 50 Internet connections
*Operated by 21Vianet
Conversation for every service provider and customer has always been:
COST
RISK
USER EXPERIENCE
Low Mean Time To Resolution (MTTR) is only possible via:
Data
monitoring
Driven
deployment
Operational engine
System Signals
External Signals
Big Data
Machine Learning
security
Changes
Safety
Repair
Orchestration
Automated
Secure
Access
Approval
Auditing
Compliance
From Complex… Core
To Simpler…
AR (L3)
AR (L3)
Firewall
Firewall
Load Balancer
Load Balancer
L2 Agg
L2 Hosts
L2 Agg
L2 Hosts
L2 Hosts
10G Routed
10G Dot1Q
1G or 10G Dot1Q
1G Dot1Q
1G Routed
Host (Multiple Properties)







3/23/2014
3/20/2014
3/17/2014
3/14/2014
3/11/2014
3/8/2014
3/5/2014
3/2/2014
2/27/2014
2/24/2014
2/21/2014
2/18/2014
2/15/2014
2/12/2014
2/9/2014
2/6/2014
2/3/2014
1/31/2014
1/28/2014
1/25/2014
1/22/2014
1/19/2014
1/16/2014
1/13/2014
1/10/2014
1/7/2014
1/4/2014
1/1/2014
Billions
9
8
7
6
5
4
3
2
1
0
Peak Requests/Day (Billion)
60
50
40
30
20
10
0
Jan
Feb
Mar
April
May
June
July
Aug
Sept
Oct
Nov
Dec
Jan
Feb
Our surface area is too big/partitioned to manage sanely
Service management is largely done via our
DATACENTER
AUTOMATION
Office365.com
North America 1
North America n
Europe 1
DATACENTER
AUTOMATION
Orchestration
Central Admin (CA), the change/task engine for the service
Deployment/Patching
Build, System orchestration (CA) + specialized system and server setup
Monitoring
eXternal Active Monitoring (XAM): outside in probes, Local Active
Monitoring (LAM/MA): server probes and recovery, Data Insights (DI):
System health assessment/analysis
Diagnostics, Perf
Extensible Diagnostics Service (EDS): perf counters, Watson (per server)
Data (Big, Streaming)
Cosmos, Data Pumpers/Schedulers, Data Insights streaming analysis
On-call Interfaces
Office Service Portal, Remote PowerShell admin access
Notification/Alerting
Smart Alerts (phone, email alerts), on-call scheduling, automated alerts
Provisioning/Directory
Service Account Forest Model (SAFM) via AD and tenant/user
addition/updates via Provisioning Pipeline
Networking
Routers, Load Balancers, NATs
New Capacity Pipeline
Fully automated server/device/capacity deployment
…is made up of a lot of stuff
Evolution of Monitoring == Data Analysis
Multi-signal analysis
Confidence in data
Data driven automation
A
communicate
U
snooze
T
recover
O
block
Data Insights Engine
100k+
servers
Millions of
organizations
15TB/day
100s of millions
users
500M+
Dozens of
components
Events per
hour
Dozens of
datacenters
many regions
Each scenario tests each DB WW
~5mins—ensuring near continuous
verification of availability
From two+ locations to ensure
accuracy and redundancy in system
250 million test transactions per
day to verify the service
Synthetics create a robust “baseline”
or heartbeat for the service
Office365.com
PARTITION
NETWORK
NETWORK
PARTITION
NETWORK
NETWORK
System handles 50K transactions/sec, over 1 billion “user” records per day
Alerts
Red
1
apcprd01
apcprd02
apcprd03
apcprd04
apcprd06
eurprd01
eurprd02
eurprd03
eurprd04
eurprd05
eurprd06
eurprd07
jpnprd01
lamprd80
namprd01
namprd02
namprd03
namprd04
namprd05
namprd06
namprd07
namprd08
namprd09
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Methodology for data computed
Deviation from normal means
something might be wrong
99.5% and 0.5% historical
thresholds
Moving Average
+/- 2 Standard
Deviations
4:46 PM is when the alert was raised
This is 4:46 PM!
CAPACITY


Live
Capacity
New Capacity
Total
Issue




Consistency checks (e.g. member of the right
server group)
HW repair (automated detection, ticket opening
and closing)
NW repair (e.g. firewall ACL)
“Base config” repair such as hyper-threading
on/off
HyperThreading
398
44
442
HardDisk
195
30
225
PPM
105
WinrmConnectivity
96
1
97
Memory
53
10
63
HardDiskPredictive
39
14
53
Motherboard
41
2
43
NotHW
34
4
38
DiskController
28
9
37
PowerSupply
16
6
22
105
CPU
9
13
22
OobmBootOrder
19
2
21
Other
18
3
21
ILO IP
12
4
16
ILO Reset
14
2
16
Fan
10
3
13
NIC
9
2
11
InventoryData
4
2
6
NIC Firmware
5
ILO Password
1
OobmPowerState
5
Cache Module
4
1
5
High Temp
2
1
3
Tickets Opened:
1278
Tickets Closed:
1431
Tickets Currently Active:
196
% Automated Found:
77%
Average time to complete (hrs):
9.43
PSU
2
2
Cable
1
1
95th Percentile (hrs):
28.73
Total
Spare
1120
5
4
5
5
1
1
158
1278
1) Run a simple patching cmd to initiate patching:
Request-CAReplaceBinary
2) CA creates a patching approval request email with
all relevant details
3) CA applies patching in stages (scopes) and
notifies the requestor
4) Approver reviews scopes and determines if the
patch should be approved
5) Once approved, CA will start staged rollout (first
scope only contains 2 servers)
6) CA moves to the next stage ONLY if the previous
scope has been successfully patched AND health
index is high
7) Supports “Persistent Patching” mode
•
•
•
•
•
•
•
•
•
•
•
What we are today is a mix of experimentation, learning from
others and industry trends (and making a lot of mistakes!)
Operations
Service
Product Team
Tier 2 Operations
Tier 1 Operations
Service
Tier 1 Operations
Service
• Direct escalations
• Operations applied to
specific problem spaces
(i.e. deployment)
• Emphasize software and
automation over human
processes
Software Aided Processes
Service
Other
Product
Team
• Tier 1 used for
routing/escalation only
• 10-12 engineering teams
provide direct support of
service 24x7
DevOps
Support
Product
Team
• Tiered IT
• Progressive escalations
(tier-to-tier)
• “80/15/5” goal
Direct Support
Operations
• Highly skilled, domain
specific IT (not true Tier 1)
• Success depends on
static, predictable systems
Service IT
Product Team
Traditional IT
Provide any additional feedback on your on call experience.
When I am on-call, I hate my job and consider switching. It's only that I work with great
people when I'm not on-call
Build Freshness report has had bugs more than once. Need better reliability before
deploying changes.
Too many alerts
It was much easier this time around than last.







[email protected]
@pucknorris
[email protected]
http://myignite.microsoft.com