AzureCAT: Design Cloud-Based Solutions for Operations

Download Report

Transcript AzureCAT: Design Cloud-Based Solutions for Operations

1.
2.
3.
4.
Scenario
Description/Example
Time Horizon
Data Size
Alerting
Detecting and Mitigating Problems
Now
Small to Large
Dashboards
Service Insight
Now-Recent
Modest
Reports
How is feature X adoption progressing day o?
Hourly/Daily
Medium
Data Science
Building prediction models based on past
behaviors
Unlimited
Very Large
Complex cloud architecture example...
Cloud apps have key differences from
traditional on-premises systems
• Internet-facing, always up
• Service SLAs – uptime requirements
• Larger scale – ISVs/SaaS vendors host
all customers vs. sell/deploy each
customer 1-by-1
Troubleshooting in the Cloud
• Too many machines/databases/etc. to
troubleshoot manually
• Separate “mitigate” vs. “root
cause”(RCA) determination
• Generate telemetry to determine RCA
(later)
• Find a way to get things working ASAP
(reboot/failover/whatever)
1
4
2
3
• Analyze: At a certain size tools to analyze and monitor the system works
• System for the system: Beyond that your need a system to monitor the system
1.
2.
3.
4.
Event Tracing for
Windows (ETW)
• Native to Windows platform
• Great performance & OK diagnostic tooling
• Historically hard to publish events
EventSource class
• New in .NET Framework 4.5
• Meant to ease authoring experience
• Extensible but supports ETW-only out of the box
Semantic Logging
Application Block
(SLAB)
• Provides several destinations for events
published with EventSource
• Does not require any knowledge in ETW
• Additional tooling support for authoring events
1
2
3
4
Data Source
Description
IIS Logs
Information about IIS web sites.
Azure Diagnostic infrastructure logs
Information about Diagnostics itself.
IIS Failed Request logs
Information about failed requests to an IIS site or
application.
Windows Event logs
Information sent to the Windows event logging system.
Performance counters
Operating System and custom performance counters.
Crash dumps
Information about the state of the process in the event of
an application crash.
Custom error logs
Logs created by your application or service.
.NET EventSource
Events generated by your code using the .NET
EventSource class.
Manifest based ETW
ETW events generated by any process.
Health (master)
• sys.event_log
• sys.bandwidth_usage
• sys.database_connection_stats
Data Access & Usage
• sys.dm_db_index_usage_stats
• sys.dm_db_missing_index_details
• sys.dm_db_missing_index_groups
• sys.dm_db_missing_index_group_stats
• sys.dm_exec_sessions
Performance
• sys.dm_exec_query_stats
• sys.dm_exec_sql_text
• sys.dm_exec_query_plan
• sys.dm_exec_requests
• sys.dm_db_wait_stats
Resource Usage
• master.sys.resource_usage*
• master.sys.resource_stats*
• userdb.sys.dm_db_resource_stats
Windows Azure SQL Database and SQL Server -- Performance and Scalability Compared and Contrasted
http://msdn.microsoft.com/en-us/library/windowsazure/jj879332.aspx
DMV
Details
Use
sys.dm_exec_query_stats
Cumulative view of query statistics
Total and average resource
consumption
sys.dm_exec_query_sql_text
Returns the text of the SQL batch that is identified by the
specified sql_handle
Provide overall batch text for
statement
sys.dm_exec_query_plan
Returns plan in XML for specified plan handle
Provide plan for tuning and analysis
sys.dm_exec_requests
Current requests executing on your DB
Check for blocking, contention
related issues, convoys, etc
• Look at the Top N’s
CPU / IO / Worker Time / Executions / Avgs
•
• Compare Queries Between Shards
•
•
•
•
Plan Changes
Resources
Executes / Hot Shards?
What is Slow?
• Look at Durations…
•
•
•
DML
Blocking / Waits / Throttling
One Offs
Works on prem and in
the cloud
Free -> ~
$2578.00/mo (10 xlarge instances)
Agent based, hooking
profiling API
Great cross-instance
correlation features
Availability
Performance
Usage
1.
2.
3.
1.
2.
3.
4.
5.
6.
1.
2.
3.
Application
Telemetry
DB
DB
SCOM
SCOM Azure Management Pack:
http://www.microsoft.com/en-us/download/details.aspx?id=11324
Generating Telemetry
• WA Table Storage:
General maximum
throughput is 1000
entities / partition / table
• Performance Counters:
• Uses part of timestamp
as partition key (limits
number of concurrent
entity writes)
• Each partition key is 60
seconds wide, and are
written asynchronously
in bulk
Consuming Telemetry
• WA Table storage Read
performance degrades
with # entities/partition
• Example:
Entities/Partition := (#
perf counter entries) * (#
role instances being
monitored)
Scaling The Solution – You
can extend this approach
by
• Collecting performance
counters at a coarser
grain (Example: 1 minute
-> 5 minutes)
• Filter more records (skip
WARN/INFO messages,
keep ERROR)
Problems
• Some PaaS services don’t
expose performance
counters (Azure SQL DB,
Service Bus, etc.)
Application
Telemetry
Reports/Dashboards
DB
DB
Telemetry
DB
DMVs
Worker Role
http://code.msdn.microsoft.com/Cloud-Service-Fundamentals-4ca72649
http://social.technet.microsoft.com/wiki/contents/articles/17987.cloud-servicefundamentals.aspx
Generating Telemetry
Consuming Telemetry
• WA Blob Storage
supports higher limits
(but you need to
batch writes better)
• Polling DBs requires
DMV diffing (which is
imperfect but better
than nothing)
• Multi-threading helps
scale the system (to a
point), but eventually
you have latency
• Database allows use
of existing tools
(Reporting Services,
etc.)
• Writing Dashboards
initially takes some
time, but it can really
help
Scaling The Solution –
You can extend this
approach by
• (Same as approach 1 –
collect less often or
collect less data)
Problems
• Eventually you want
data “faster” and
things slow down as
you scale your service
All Geo-Regions
One Region
Alerting/Compute Deployment
Job
Complete
Notification
HDI
Hive
WA Storage
Scheduling
Pig
Cluster
WA Storage
Cluster
Data Exhaust
Persist
Telemetry
Partitioned
Queues
Map-Reduce Jobs
Cluster
On-Premises Data Warehouse
ETL
Data Warehouse
WA Storage
Persist
Curated
Data
Transform/
Load Data
Warehouse
Generating Telemetry
Consuming Telemetry
• On-Node collectors batch
telemetry, write to
Multiple WA Blob Storage
Containers
• Per-Geo Region Accounts
(collocated with service
stamps in each region)
• Big Data (Hadoop or
similar) system reads data
across all stamps
• Aggregations/Trace
Processing generate
output data (to WA Blob
Storage)
• ETL moves data into the
DW
• Users Query DW with star
schema
(facts/dimensions) using
normal DB techniques
• Reports generated for
common activities
needed to run the
business
• Queries using Hive
against Hadoop also
possible
Scaling The Solution – You
can extend this approach
by
• Add more cores to
Hadoop
• Buy a larger DW box
• Change aggregation
grain for aggregation
jobs
Problems
• E2E Latency
• Layers between Hadoop
world and Microsoft
world (expertise in two
technology stacks)
http://msdn.microsoft.com/en-us/library/jj853352.aspx
(http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx
https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf
http://channel9.msdn.com/Events/TechEd
www.microsoft.com/learning
http://microsoft.com/technet
http://developer.microsoft.com
http://technet.microsoft.com/library/dn765472.aspx
http://technet.microsoft.com/en-us/library/hh546785.aspx
http://www.microsoft.com/en-us/server-cloud/products/
windows-azure-pack
http://azure.microsoft.com/en-us/