Instant Profiling: Instrumentation Sampling for Profiling Datacenter

Download Report

Transcript Instant Profiling: Instrumentation Sampling for Profiling Datacenter

Instant Profiling: Instrumentation Sampling
for Profiling Datacenter Applications
Hyoun Kyu Cho1, Tipp Moseley2, Richard Hank2,
Derek Bruening2, Scott Mahlke1
1University
of Michigan
2Google
1
Datacenter Applications
http://googleblog.blogspot.com
• In 2010, US Datacenters spent 70~90 billion kWh*
• Datacenter application performance is critical
• Profiling can help
*[Koomey`11]
2
Traditional Profiling
Source Code
Instrumentation
Build
Instrumented
Binary
Input Data

Challenges for Datacenters
• Need to run on live traffic
• Difficult to isolate
• Overheads
• Value profiling 3.8x slowdown1
• Path profiling 31%, edge
profiling 16%2
• Binary management
Training
Run
• Many programs, multiple
versions
Profile Data
1[Calder`99] 2[Ball`96]
3
Google-Wide Profiling
Continuous profiling infrastructure for
datacenters
 Negligible overhead
• Sampling based
• Aggregated profiling overhead less than 0.01%
 Limitations
• Heavily rely on Performance Monitoring Units
• Limited flexibility and portabiliity

[Ren et al.`10]
4
Goals
Unified profiling infrastructure for
datacenters
• Flexible types of profile data
• Portable across heterogeneous datacenter
 While maintaining
• Low overhead
• Does not burden binary management

Sampling
Dynamic Binary
Instrumentation
5
Instrumentation Sampling
application
system call gateway
operating system
hardware
6
Instrumentation Sampling
application
instrumentation
engine
dispatch
client
context
switch
code cache
DynamoRIO
operating system
hardware
[Bruening`04]
6
Instrumentation Sampling
shepherding thread
application
start
profiling
stop
profiling
instrumentation
engine
dispatch
client
code cache
operating system
hardware
6
Problems with Basic Implementation

Unbounded profiling periods due to
fragment linking

Latency degradation due to initial
instrumentation

Multi-threade programs
7
Temporal Unlinking/Relinking of
Fragments
code cache
context
switch
BB1
dispatch
BB2
BB2->BB1
8
S/W Code Cache Pre-population
application
Still have latency degradation for intial
instrumentation phases
shepherding thread

dispatch
code cache
instrumentation
engine
client
operating system
hardware
9
Multithreaded Program Support
Sampling makes it possible to miss thread
operations
 Forces Instant Profiling’s signal handler for
every thread
 Enumerates all threads and sends profiling
start signal to each thread

10
Experimental Setup
6-core Intel Xeon 2.67GHz w/ 12MB L3
 12GB main memory
 Linux kernel 2.6.32
 gcc 4.4.3 w/ -O3
 SPEC INT2006, BigTable, Web search
 Edge profiling client

11
a.mean
bigtable
web search
473.astar
464.h264ref
462.libquantum
445.gobmk
429.mcf
403.gcc
401.bzip2
400.perlbench
Slowdown
Naïve Edge Profiling
50
45
40
35
30
25
20
15
10
5
0
12
1.25
1.20
1.15
1.10
1.05
1.00
0.95
0.90
a.mean
bigtable
web search
4ms/1s
473.astar
2ms/1s
464.h264ref
462.libquantum
445.gobmk
1ms/1s
429.mcf
1.30
2ms/4s
403.gcc
401.bzip2
400.perlbench
Normalized Execution Time
Profiling Overhead
2ms/250ms
13
S/W Code Cache Prepopulation
w/ pre-population
w/o pre-population
Cumulative Number of Samples
3500000
3000000
2500000
2000000
1500000
1000000
500000
0
0
1
2
3
4
5
6
7
8
9
Sampling Phases
14
a.mean
bigtable
web search
4ms/1s
473.astar
2ms/1s
464.h264ref
462.libquantum
1ms/1s
445.gobmk
2ms/4s
429.mcf
403.gcc
401.bzip2
400.perlbench
Profiling Accuracy
Profiling Accuracy
2ms/250ms
100
90
80
70
60
50
40
30
20
10
0
15
Asymptotic Accuracy
bigtable
web search
100
Cumulative Accuracy
90
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
120
140
Sampling Phases
16
Conclusion
Low-overhead, portable, flexible profiling
needed
 Instant Profiling
• Combines sampling and DBI
• Pre-populates S/W code cache
• Tunable tradeoff between overhead and

information
• Provides eventual profiling accuracy

Less than 5% overhead, more than 80%
accuracy for naïve edge profiling client
17
Thank you!
18