Transcript finalx

Elementary Functions in Julia
Mustafa Mohamad
Department of Mechanical Engineering, MIT
18.337, Fall 2016
Problem
Develop pure Julia codes for elementary function
Trigonometric functions
sin, cos, tan, sincos
asin, acos, atan, atan2
Hyperbolic functions
sinh, cosh, tanh,
asinh, acosh, atanh
Exponential and logarithmic
log, log2, log10, log1p, ilog2,
exp, exp2, exp10, expm1, cbrt, pow
Floating point manipulation
ldexp, ilog2, frexp
Grand Goal
• Excise dependency on openlibm
• Basically a fork of freebsd's msun math library plus
patches (originally developed by Sun microsystems)
• Accumulated patches have gathered over the years that
have not been ported over to openlibm
The Julia Promise
• Multiple dispatch and macros, and other features make it
possible to write generic versions of algorithms and still get
maximal performance
• Generic versions often easier to understand
• Get efficient code for Float16 and Float128
• Float16: forthcoming in Julia 0.6/1.0
• Float128: forthcoming in Julia 0.6/1.0
Background: Julia port of SLEEF C library
• Takes advantage of SIMD and minimal code branches
• Look at code @code_llvm or @code_native
• Julia ported code benefits from auto SIMD vectorization
See: https://github.com/JuliaMath/Sleef.jl
SLEEF good accuracy but too slow
• Avoiding branches necessitates unnecessary computations for some
branches
• Double-double arithmetic used to squeeze extra bits of precision and
avoid branching (library too slow on non-fma systems, ~6-10 slower
then Base)
Benchmarks
MEDIAN TIME RATIO SLEEF/BASE
• Performance critically depends on availability
of fused-multiply-add instruction (FMA)
• sin/cos/tan have bugs so benchmark here is
not fair, same for the log functions
• exp10 function is very fast compared to base
exp10, but the exp10 function in openlibm is
dumb; it just calls the pow function
Amal.jl
https://github.com/JuliaMath/Amal.jl
General lessons
• Developing accurate and fast elementary functions is very hard and time
consuming…. strict accuracy constrains for high quality implementation
require error’s to be < 1 ulp
• Must take advantage of the IEEE form of Floating point numbers to
extract maximal speed
Important considerations
• Ideally prefer polynomial approximants instead of rational approximants
(division is much slower than × or + ~can)
• Availability of FMA instruction (fused multiply and add with no
intermediate rounding)
• Avoid algorithms that depend on table-lookups (throughout of CPU
instructions is outpacing memory access speed, as a result these tablebased methods have fallen out of flavor)
Testing
• Critical
• Possible to test all Float32 numbers O(109 ) in a
couple of minutes
• Unfortunately it’s not possible to test all Float64
numbers O(1019 )
• Benchmarking: @code_llvm/@code_native,
@code_warntype, BenchmarkTools.jl
Developing approximants
1. Range reduction on argument to the smallest set 𝑆 possible
2. Compute minimax approximation polynomial on 𝑆
3. Scale back if necessary
Remez based algorithm to compute minimax polynomial
Minimizes norm
For rational approximations use Padé approximant
Developed Public Functions
• exp , exp2, exp10
• exp, exp2 have non-fma versions (msun based)
• exp10 (polynomial < 1ulp for both fma and nonfma)
• log
Floating point manipulation functions
• frexp, ldexp, significand, exponent
Contributing to Base’s math.jl
Pure Julia ldexp function
computes significand x 2 exponent
without multiplication
regular branch:
~ 1.5x speedup
subnormal number and subnormal output: ~ 3.6x speedup
normal number and subnormal output:
~ 2.6x speedup
subnormal number and normal output:
~ 13x speedup
Wins on readability and speed!
Generic!
Generic over both argument x AND e (may be Int128, BigInt, etc., without sacrificing Int64/Int32 perf)
exp function example
https://github.com/JuliaMath/Amal.jl/blob/master/src/exp.jl
Method observe: exp 𝑥 = exp 𝑘 ln 2 + 𝑟 = 2𝑘 exp(r)
We have taken advantage of the float format!
1. Argument reduction: Reduce x to an r so that |r| <= 0.5*ln(2).
Given x, find r and integer k such that
x = k ln(2) + r, |r| ≤ 0.5 ln(2).
2. Approximate exp(r) by a polynomial on the interval [-0.5*ln(2), 0.5*ln(2)]:
exp(x) = 1.0 + x + polynomial(x),
3. Scale back: exp(x) = 2^k exp(r)
Note the special representation!
median speed relative to base: 0.95 (Amal/Base)
Possible to make this a tad faster
custom macro that converts numeric coefficients to the
same type as 𝑥 with no overhead (dispatch)
exp for non-fma systems
Pade approximation to 𝑒 𝑥
A Remez algorithm is used to generate a polynomial to approximate 𝑅
Therefore we can write (it’s actually better to then carry out another long division step to minimize
cancelation effects)
Thank you for listening!