Instant Profiling: Instrumentation Sampling for Profiling Datacenter
Download
Report
Transcript Instant Profiling: Instrumentation Sampling for Profiling Datacenter
Instant Profiling: Instrumentation Sampling
for Profiling Datacenter Applications
Hyoun Kyu Cho1, Tipp Moseley2, Richard Hank2,
Derek Bruening2, Scott Mahlke1
1University
of Michigan
2Google
1
Datacenter Applications
http://googleblog.blogspot.com
• In 2010, US Datacenters spent 70~90 billion kWh*
• Datacenter application performance is critical
• Profiling can help
*[Koomey`11]
2
Traditional Profiling
Source Code
Instrumentation
Build
Instrumented
Binary
Input Data
Challenges for Datacenters
• Need to run on live traffic
• Difficult to isolate
• Overheads
• Value profiling 3.8x slowdown1
• Path profiling 31%, edge
profiling 16%2
• Binary management
Training
Run
• Many programs, multiple
versions
Profile Data
1[Calder`99] 2[Ball`96]
3
Google-Wide Profiling
Continuous profiling infrastructure for
datacenters
Negligible overhead
• Sampling based
• Aggregated profiling overhead less than 0.01%
Limitations
• Heavily rely on Performance Monitoring Units
• Limited flexibility and portabiliity
[Ren et al.`10]
4
Goals
Unified profiling infrastructure for
datacenters
• Flexible types of profile data
• Portable across heterogeneous datacenter
While maintaining
• Low overhead
• Does not burden binary management
Sampling
Dynamic Binary
Instrumentation
5
Instrumentation Sampling
application
system call gateway
operating system
hardware
6
Instrumentation Sampling
application
instrumentation
engine
dispatch
client
context
switch
code cache
DynamoRIO
operating system
hardware
[Bruening`04]
6
Instrumentation Sampling
shepherding thread
application
start
profiling
stop
profiling
instrumentation
engine
dispatch
client
code cache
operating system
hardware
6
Problems with Basic Implementation
Unbounded profiling periods due to
fragment linking
Latency degradation due to initial
instrumentation
Multi-threade programs
7
Temporal Unlinking/Relinking of
Fragments
code cache
context
switch
BB1
dispatch
BB2
BB2->BB1
8
S/W Code Cache Pre-population
application
Still have latency degradation for intial
instrumentation phases
shepherding thread
dispatch
code cache
instrumentation
engine
client
operating system
hardware
9
Multithreaded Program Support
Sampling makes it possible to miss thread
operations
Forces Instant Profiling’s signal handler for
every thread
Enumerates all threads and sends profiling
start signal to each thread
10
Experimental Setup
6-core Intel Xeon 2.67GHz w/ 12MB L3
12GB main memory
Linux kernel 2.6.32
gcc 4.4.3 w/ -O3
SPEC INT2006, BigTable, Web search
Edge profiling client
11
a.mean
bigtable
web search
473.astar
464.h264ref
462.libquantum
445.gobmk
429.mcf
403.gcc
401.bzip2
400.perlbench
Slowdown
Naïve Edge Profiling
50
45
40
35
30
25
20
15
10
5
0
12
1.25
1.20
1.15
1.10
1.05
1.00
0.95
0.90
a.mean
bigtable
web search
4ms/1s
473.astar
2ms/1s
464.h264ref
462.libquantum
445.gobmk
1ms/1s
429.mcf
1.30
2ms/4s
403.gcc
401.bzip2
400.perlbench
Normalized Execution Time
Profiling Overhead
2ms/250ms
13
S/W Code Cache Prepopulation
w/ pre-population
w/o pre-population
Cumulative Number of Samples
3500000
3000000
2500000
2000000
1500000
1000000
500000
0
0
1
2
3
4
5
6
7
8
9
Sampling Phases
14
a.mean
bigtable
web search
4ms/1s
473.astar
2ms/1s
464.h264ref
462.libquantum
1ms/1s
445.gobmk
2ms/4s
429.mcf
403.gcc
401.bzip2
400.perlbench
Profiling Accuracy
Profiling Accuracy
2ms/250ms
100
90
80
70
60
50
40
30
20
10
0
15
Asymptotic Accuracy
bigtable
web search
100
Cumulative Accuracy
90
80
70
60
50
40
30
20
10
0
0
20
40
60
80
100
120
140
Sampling Phases
16
Conclusion
Low-overhead, portable, flexible profiling
needed
Instant Profiling
• Combines sampling and DBI
• Pre-populates S/W code cache
• Tunable tradeoff between overhead and
information
• Provides eventual profiling accuracy
Less than 5% overhead, more than 80%
accuracy for naïve edge profiling client
17
Thank you!
18