Learning A Better Compiler

Download Report

Transcript Learning A Better Compiler

Learning A Better Compiler
Predicting Unroll Factors using Supervised
Classification
And
Integrating CPU and L2 Cache Voltage Scaling
using Machine Learning
Predicting Unroll Factors
• Loop Unrolling sensitive to unroll factor
• Current solution: expert design
– Difficult: Hand-tuned heuristics
– Must be rewritten frequently
• Predict parameters with machine learning
– Easy: data collection takes ~1wk
• No human time
– Algorithm does not change with compiler
Loop Unrolling
• Combines multiple iterations loop body
• Fewer Iterations  Less Branching
• Allows other transformations:
– Exposes adjacent memory locations
– Allows instruction reordering across
iterations
Unroll Factors
• How many iterations to combine?
• Too few?
– Provides little benefit
• Too large
– Increased cache pressure
– Increase live rangeregister pressure
Optimal Unroll Factors
QuickTime™ and a
decompressor
are needed to see this picture.
Classification Problems
• Input a vector of features
– E.g. nest depth, # of branches, # of ops
• Output a class
– E.g. unroll factor, 1-8
• No prior knowledge required
– Meaning of features/classes
– Relevance of features
– Relationships between features
Nearest Neighbors
• Paper describes Kernel Density Estimator
• All dimensions normalized to [0,1]
• Given a test point p:
– Consider training points “close” to p
• Within fixed distance, e.g. 0.3
– Majority vote among qualifying training points
Nearest Neighbors
QuickTime™ and a
decompressor
are needed to see this picture.
Support Vector Machine
• Assume two classes, easily generalized
• Transform data
– Make classes linearly separable
• Find line to maximize sep. margin
• For test point:
– Perform transformation
– Classify based on learned line
Maximal Margin
QuickTime™ and a
decompressor
are needed to see this picture.
Non-Linear SVM
QuickTime™ and a
decompressor
are needed to see this picture.
Some Features
•
•
•
•
•
•
•
•
# operands
Live range size
Critical path length
# operations
Known tripcount
# floating point ops
Loop nest level
# branches
• # memory ops
• Instruction fan-in in
DAG
• # instructions
• Language: C, fortran
• # memory ops
• # Implicit instructions
• & more (38 total)
Results: No Software Parallelism
QuickTime™ and a
decompressor
are needed to see this picture.
Results: With Software Parallelism
QuickTime™ and a
decompressor
are needed to see this picture.
Big Idea: Easy Maintenance
• Performance improvements modest
– Sometimes worse, sometimes much better
– Usually little change
• Requires no re-tuning to change compiler
– Gathering data takes ~1wk, no human time
• General mechanism
– Can be applied to all parameters
– No model of system needed
• Can be applied to new transformations where
expert knowledge is unavailable
Integrated CPU and L2 Cache
Voltage Scaling using Machine
Learning
Dynamic Voltage Control
• Monitor system
• When activity is low, reduce power
– Also reduces computational capacity
– May need more energy if work takes longer
Multiple Clock Domains
• Adjust separate components
independently
• Better performance/power
– E.g. CPU-bound application may be able to
decrease power to memory and cache
without affecting performance
• More complex DVM policy
Motivation
• Applications go through phases
• Frequency/voltages should change too
• Focus on core, L2 cache
– Consume large fraction of total power
• Best policy may change over time
– On battery: conserve power
– Plugged in: maximize performance
Learning a DVM Policy
• Compiler automatically instruments code
– Insert sampling code to record perf. Counters
– Instrument code only to gather data
• Use machine learning to create policy
• Implement policy in microcontroller
ML Parameters
• Features
– Clock cycles per instruction
– L2 accesses per instruction
– Memory access per instruction
• Select voltage to minimize:
– Total energy
– Energy*delay
Machine Learning Algorithm
• Automatically learn set of if-then rules
– E.g: If (L2PI >= 1) and (CPI <=0) then
f_cache=1GHz
• Compact, expressive
• Can be implemented in hardware
QuickTime™ and a
decompressor
are needed to see this picture.
Results
• Compared to independently managing
core and L2:
– Saves 22% on average, 46% max
•
•
•
•
Learns effective rules from few features
Compiler modifications instrument code
Learned policy offline
Implemented policy in microcontroller
Conclusion
• Machine learning derives models from
data automatically
• Allows easy maintenance of heuristics
• Creates models that are more effective
than hand-tuned