Writing Rock-Solid Reliable Apps for Longhorn & the CLR

Download Report

Transcript Writing Rock-Solid Reliable Apps for Longhorn & the CLR

Writing Rock-Solid Reliable Applications
For Windows Vista And The CLR
Björn Levidow, Group Program Manager
Brian Grunkemeyer, Software Design Engineer
FUN308
Microsoft Corporation
[email protected]
[email protected]
1
What You Will See
The Microsoft Platform affords developing
reliable applications, both native and managed
Customer-Focused Reliability Attributes
Windows Vista and CLR reliability goals
Windows Vista and CLR reliability features
Detailed resiliency discussion
Features and Tools
Summary
Call to Action
2
Customer-Focused Reliability
Attributes
Attribute
Resilient
Definition
The system continues to provide service in
the face of internal or external disruptions
Recoverable After disruption the system is easily restored
to a previously known state with no data loss
Controlled
Provides timely and expected service
whenever needed
Undisruptable Required changes and upgrades do not
impact the service
Examples
crashes, hangs …
data corruption
degraded response
update disruptions
Production
Ready
At release the system contains a minimum
number of bugs, requiring a limited number
of predictable patches/fixes
patch size, frequency
Predictable
It works as advertised, what worked
before works now
compatibility failures
3
Addressing CustomerFocused Reliability Attributes
Requires Application Design Consideration
Resilient
Recoverable
Controlled
Undisruptable
Production
Ready
• Process/App Domain Recycling
• SafeHandle
• Transactional file system/Registry
• Common log file system
• Resource Exhaustion Diagnostics
• I/O cancellation
• Restart Manager
crashes, hangs …
data corruption
degraded response
update disruptions
• /Analyze, Safe C++ libraries, FxCop
• App Verifier, Managed Debugging Assistant patch size, frequency
OS or CLR features to plug into your app
Predictable
Good versioning and installation practices
compatibility failures
4
Windows Vista Reliability
Objectives
No loss of work, time, data or control
No Hangs, No Crashes, No Reboots
Reducing user disruptions and increasing
availability
How we raised the bar on Windows Vista
reliability
New processes to minimize bugs and design issues
Enhanced feedback using Windows Error Reporting
for identifying product problems during development
New reliability features
5
CLR Reliability Objectives
Write resilient applications
Improve application availability
Reduce user disruptions and increasing availability
Resiliency against failures, crashes and hangs
Availability is great today. Let’s make it even better
How we raised the bar on CLR reliability
Tested product with fault injection
New reliability features
Hardened managed libraries
6
How Much Reliability Do I Need?
Different bars for different environments
Reliability of most software meets customer
needs
A few bad apples spoil the overall experience
Reliability needs differ based on your application
Console applications and simple apps like calc.exe
Sophisticated application (Word, Photoshop)
Library code
Highly available server code
Library code’s reliability bar is dictated by the
applications that use the library
Car
7
Writing Reliable Code
Reliability Has A Cost
Writing reliable unmanaged code takes work
Requires discipline to handle out of memory problems
Failures in multi-threaded apps are hard to handle
Requires extensive testing (fault injection, stress runs)
Writing reliable managed code takes work
Under the covers, the CLR manages your code
Eliminates entire classes of bugs, like dangling pointers, memory
leaks, most buffer overruns, etc.
However, CLR-induced failure points aren’t obvious
Asynchronous exceptions: OutOfMemoryException and
ThreadAbortException
8
Customer-Focused Reliability
Attributes
Attribute
Resilient
Definition
The system continues to provide service in
the face of internal or external disruptions
Examples
crashes, hangs …
Recoverable
Controlled
Undisruptable
Production
Ready
Predictable
9
How Do We Get Resiliency?
Resiliency Approaches
Isolated extensibility models
Keep extensions in their own process space
Enables recycling
Process Recycling
Operating System resources are guaranteed
to be freed
Relatively cheap and relatively easy
Requires a stateless, almost transactional
model
10
Process Recycling
Hosted programming model example
ASP.NET hosts applications
Uses process recycling for resiliency
Worker processes may encounter a resource leak or
deadlock, and the host will kill them
Bugs could be anywhere in the process
Server is resilient to these failures
Session state must live in a database or out-of-proc
In-process session state is lost. Controllable via web.config
Cheap and good enough for a web server
11
AppDomain Recycling
Another hosted programming model
Application Domains are a unit of isolation
Static variables are per-appdomain
Avoid* mutating any cross-AD or cross-process state
SQL unloads and recycles AppDomains
Mitigates state corruption
Higher availability
SQL is transacted => no database corruption
Operating System (OS) resources must be freed, but the OS
is AD-ignorant
Appdomain unloading must be clean!
SQL Server Process
Default
AppDomain
AppDomain 2
AppDomain 3
12
Problems For Hosted Code
How does a host hurt your reliability?
Hosted libraries make tradeoffs to
guarantee availability
Thread aborts between two machine instructions
IntPtr handle = CreateFile(…);
call native int CreateFile(…)
stloc.2
OutOfMemoryExceptions more common
when hosted
Typical cleanup techniques aren’t guaranteed!
Finalizers and finally’s may be aborted
Hosted managed libraries should be hardened
Prevent leaking resources in aggressive hosts
Using hardened code is very forgiving
13
SafeHandle
Reliably releasing a handle
A reliable, convenient wrapper for OS handles
CLR guarantees your release code will run
Critical finalization
Benefits
Avoids races with your own finalizer
Reduced object graph promotion during GC
Type-safe manipulation of handles
Small perf costs
Another 20 bytes on x86, 32 bytes on 64 bit
Ref count when a thread is actively using a
SafeHandle
14
SafeHandle Demo
Brian Grunkemeyer
Software Development Engineer
Common Language Runtime
15
Constrained Execution Regions
Limited guaranteed execution
For building hosts and changing cross-AD state
RuntimeHelpers.PrepareConstrainedRegions();
try {
// Arbitrary code: may fail
}
finally {
// Constrained code: No virtual calls or allocs
}
Hoist CLR-induced failures and delay
thread aborts
Constraints on your code
Only call methods with reliability contracts
No allocations, virtual calls, acquiring locks, etc.
Perf and complexity cost
16
When To Use SafeHandle And
CER’s
Use SafeHandles when
Libraries hosted in environments using
appdomain recycling
Anyone using P/Invoke to acquire OS
resources
Use CER’s when
Hosted code that manipulates crossappdomain or cross-machine state
Still need to design for a power failure
Corner cases that SafeHandle doesn’t support
Marshaling out handles stored in a struct
17
Customer-Focused Reliability
Attributes
Attribute
Definition
Examples
Resilient
Recoverable After disruption the system is easily restored
to a previously known state with no data loss
data corruption
Controlled
Undisruptable
Production
Ready
Predictable
18
Writing Recoverable Applications
Writing bug free apps is Nirvana, but…
Nobody’s perfect 
Not all software controls nuclear power plants
Even if you get there, external factors affect you
Software installs, resource exhaustion, power failures
User uses your app in an unexpected way
So, writing recoverable apps is necessary
Expect the unexpected!
Apps should be journaled and designed to recover
Use transactions and journaling to persist data
Save data and state most important to your applications
Word is a good example
Saves user docs ever 3 minutes to minimize loss
Document recovery as well
19
Transactions And Journaling
Tools to help build recoverable apps
Win32
File and Registry Transactions (TxF)
SetCurrentTransaction(HANDLE hTransaction)
Common Log File System (CLFS)
Managed
System.Transactions
using (TransactionScope scope = new TransactionScope(
TransactionScopeOption.Required,
new TransactionOptions(),EnterpriseServicesInteropOption.Full))
{
if (!EnterTransactionScope()) throw new TransactionException(“Bad");
// Write to one or many files, etc.
if (!ExitTransactionScope()) throw new TransactionException(“Bad");
scope.Complete();
}
20
Customer-Focused Reliability
Attributes
Attribute
Definition
Examples
Resilient
Recoverable
Controlled
Provides timely and expected service
whenever needed
degraded response
Undisruptable
Production
Ready
Predictable
21
Resource Exhaustion Diagnosis
Give users control of their system by allowing them to
take action before a low resource condition impacts them
Automatic detection and diagnosis of near-exhaustion of commit
limit and memory leaks on client SKUs
Provide options for manual and automatic resolution to
avoid exhaustion
Impact on Windows Vista applications
If GUI app uses lots of VM, will show up on list of applications to
be closed by user
If service or CMD app, will be shut down by Windows when
exhaustion has been hit
What you need to do
Be mindful of memory utilization: e.g. trim working set
when unused
22
I/O Cancellation Support
Apps shouldn’t hang
Apps should provide a cancel button
Ever see Outlook hang while downloading mail?
New Win32 Cancellation APIs for Windows Vista
Cancel specific async I/O requests for file handle
CancelIoEx(HANDLE hFile, LPOVERLAPPED lpOverlap)
Cancel synchronous requests from another thread
CancelSynchronousIO(HANDLE hThread)
No managed support until “Orcas”
Look for the CancellationRegion class
Caveats
Operation is only marked for cancellation
Some “meta APIs” aren’t cancelable: (e.g. CopyFile. Use
CopyFileEx)
Slightly tricky to use
23
Customer-Focused Reliability
Attributes
Attribute
Definition
Examples
Resilient
Recoverable
Controlled
Undisruptable Required changes and upgrades do not
impact the service
update disruptions
Production
Ready
Predictable
24
Minimize Reboots When
Installing Software
Use the Restart Manager APIs
Shuts down only required apps and services
Automatically detect and shutdown services in shared processes
with a file in use
Prevents the need for a machine restart after apps or services
have been shutdown
Groups application, service and machine restarts
Design app “freeze-dry” functionality to return user to the state
they were in before the restart
RegisterApplicationRestart( GetCommandLine(), 0 );
// Native
Use P/Invoke for managed applications
Users experience minimum disruption
for application and patch installs for
your application
25
Customer-Focused Reliability
Attributes
Attribute
Definition
Examples
Resilient
Recoverable
Controlled
Undisruptable
Production
Ready
At release the system contains a minimum
number of bugs, requiring a limited number
of predictable patches/fixes
patch size, frequency
Predictable
26
Windows Error Reporting During
Development
Errors are reported to Microsoft in real-time by
customer choice (crashes, hangs)
Automatic analysis and signature matching to
known issues
Problems available to registered developers
through the Developer Portal
Known fixes provided to customers in real-time
API’s for failing quickly and reporting an error
Environment.FailFast(String reason);
// Managed “panic button”
Or, simply let an exception go unhandled, in both
managed and native
27
Reliability Best Practices
If crash occurs, report the issue via
Windows Error Reporting
Don’t use the IsBadWritePtr family of APIs
Turns debuggable crash into silent process exit
Replace the API with a simple `if (p == NULL)` check
Write multi-threaded code correctly
Use synchronization primitives for stopping and
pausing threads
Don’t call TerminateThread
Avoid calling Thread.Abort
Don’t call Thread.Suspend
28
Recommended Tools For Making
Code Production Ready
Unmanaged
Safe C++ Libraries (CRT, MFC, ATL)
C++ Compiler static analysis (/analyze)
C++ Compiler’s buffer overrun cookie (/GS)
Application Verifier
Managed
FxCop
Managed Debugging Assistants
29
Summary
The Microsoft Platform affords developing
reliable applications, both native and managed
What is Reliability?
Customer taxonomy
Windows Vista and CLR reliability goals
Windows Vista and CLR reliability features
Detailed resiliency discussion
Features and Tools
30
Call To Action
Design for resiliency as discussed
Use SafeHandle to free OS handles
Use Windows Vista’s transactions for
recoverability
Use Windows Vista’s new Restart Manager
API’s to minimize disruptions
Support cancellation to give users control
Use all the tools at your disposal to make
your code production ready
E.g. FxCop, /Analyze, Windows Error
Reporting
31
More Information
Managed Resiliency Features
At PDC
Add-Ins and Versioning - FUN 309: “Designing managed addins
for reliability, security, and versioning” w/ Jim Miller
Versioning – FUN 314: “Architecting your apps for the future”
After PDC
High-level overview:
http://msdn.microsoft.com/msdnmag/issues/05/10/Reliability/
SafeHandle:
http://blogs.msdn.com/bclteam/archive/2005/03/16/396900.aspx
Constrained Execution Regions:
http://blogs.msdn.com/bclteam/archive/2005/06/14/429181.aspx
Chris Brumme’s Hosting & Reliability blog posts:
http://blogs.msdn.com/cbrumme/archive/2004/02/21/77595.aspx
http://blogs.msdn.com/cbrumme/archive/2003/06/23/51482.aspx
32
More Information
Windows Vista reliability features
At PDC
Journaling – FUN034: Improving reliability with the new
System.Transactions classes, file system, and registry transactions
Restart Manager and Versioning – FUN222: Windows Vista and
"Longhorn" Server: What's New in Windows Installer (MSI) and ClickOnce
Feedback – FUN313: Windows Vista: Improving Quality through
Windows Feedback Data
I/O cancellation – FUN302: Programming with Concurrency (Part 1):
Concepts, Patterns, and Best Practices
After PDC
http://msdn.microsoft.com/windowsvista/reliability/
http://www.microsoft.com/technet/windowsvista/webcasts.mspx
Resource Exhaustion:
http://www.microsoft.com/technet/windowsvista/evaluate/admin/mntreli.m
spx
I/O Cancellation
http://msdn.microsoft.com/library/default.asp?url=/library/enus/fileio/fs/cancelsynchronousio_func.asp
33
© 2005 Microsoft Corporation. All rights reserved.
This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.
34