10.2. Vectorizing mathematical functions#
The aim is to implement mathematical functions like exp, sin and cos using the SIMD types.
One motivation for doing so is speeding up the simulation of
scattering problems, where
a good fraction of the compute time is spent in the complex exp function.
Most implementations found online go back to these sources:
References:
Stephen L. Moshier: Methods and Programs For Mathematical Functions https://www.moshier.net/methprog.pdf
CEPHES mathematical function library https://www.netlib.org/cephes/
10.2.1. How to implement sin and cos ?#
Local approximation by polynomials
For small values of the argument \(x\), polynomial approximation via Taylor expansion can be used. A uniform best approxiamtion on the interval \([-\pi/4, \pi/4]\) is obtained via Chebyshev interpolation. We took the coefficients from CEPHES, file cmath.tgz.
static constexpr double sincof[] = { 1.58962301576546568060E-10, -2.50507477628578072866E-8, 2.75573136213857245213E-6, -1.98412698295895385996E-4, 8.33333333332211858878E-3, -1.66666666666666307295E-1, }; static constexpr double coscof[6] = { -1.13585365213876817300E-11, 2.08757008419747316778E-9, -2.75573141792967388112E-7, 2.48015872888517045348E-5, -1.38888888888730564116E-3, 4.16666666666665929218E-2, }; // highly accurate on [-pi/4, pi/4] template <typename T> auto sincos_reduced (T x) { auto x2 = x*x; auto s = (((((sincof[0]*x2+sincof[1])*x2+sincof[2])*x2+sincof[3])*x2+sincof[4])*x2+sincof[5]); s = x + x*x*x * s; auto c = (((((coscof[0]*x2+coscof[1])*x2+coscof[2])*x2+coscof[3])*x2+coscof[4])*x2+coscof[5]); c = 1.0 - 0.5*x2 + x2*x2*c; return std::tuple{ s, c }; }
This polynomial approximation can be used the same way for
double x, andSIMD<double> x.Global approximation
For a general \(x \in {\mathbb R}\), we have to figure out the right branch to reduce the evaluation to the small interval.
We represent
\[ x = q \frac{\pi}{2} + \tilde x \]with an integer \(q\) and the reminder \(\tilde x \in [-\pi/4, \pi/4]\).
Depending on the reminder, we can reduce the evaluation of \(\sin x\) to local evaluation of \(\sin\) or \(\cos\) for small \(\tilde x\):
\[\begin{split} \sin(x) = \left\{ \begin{array}{cc} \sin(\tilde x) & \text{for } q \text{ mod } 4 = 0, \\ \cos(\tilde x) & \text{for } q \text{ mod } 4 = 1, \\ -\sin(\tilde x) & \text{for } q \text{ mod } 4 = 2, \\ -\cos(\tilde x) & \text{for } q \text{ mod } 4 = 3 \end{array} \right. \end{split}\]The choice of the branch can be made by the conditional operator
?applied to the lowest bits ofq:auto sincos (double x) { double y = round((2/M_PI) * x); int q = lround(y); auto [s1,c1] = sincos_reduced(x - y * (M_PI/2)); double s2 = ((q & 1) == 0) ? s1 : c1; double s = ((q & 2) == 0) ? s2 : -s2; double c2 = ((q & 1) == 0) ? c1 : -s1; double c = ((q & 2) == 0) ? c2 : -c2; return std::tuple{ s, c }; }
All these operations can be perfectly simded:
template <int N> auto sincos (SIMD<double,N> x) { SIMD<double,N> y = round((2/M_PI) * x); SIMD<int64_t,N> q = lround(y); auto [s1,c1] = sincos_reduced(x - y * (M_PI/2)); auto s2 = select((q & SIMD<int64_t,N>(1)) == SIMD<int64_t,N>(0), s1, c1); auto s = select((q & SIMD<int64_t,N>(2)) == SIMD<int64_t,N>(0), s2, -s2); auto c2 = select((q & SIMD<int64_t,N>(1)) == SIMD<int64_t,N>(0), c1, -s1); auto c = select((q & SIMD<int64_t,N>(2)) == SIMD<int64_t,N>(0), c2, -c2); return std::tuple{ s, c }; }
Rounding on AVX2:
_mm256_round_pd. Converting to int64:_mm256_cvtepi32_epi64(_mm256_cvtpd_epi32(val))
Rounding for arm-neon:vrndnq_f64,vcvtq_s64_f64
10.2.2. What the f..k?#
An important operation in compute graphics (games!) is the normalization of a vector, \(|v| = \frac{1}{\sqrt{v \cdot v}} v\). This requires the inverse square root function
Instead of calling a sqrt function and a division, one may solve the nonlinear equation
using a few steps of Newton’s method:
Every step requires only 3 multiplications.
This algorithm got famous in Quake III Arena providing fluent 3D graphics (in the year 1999). Find the full code including the documented initial value for Newton’s method at this page.
For example, processors by Arm provide intrinsics exactly for that Newton iteration: instruction
10.2.3. Exercise#
add missing SIMD-functions for the \(\sin\) and \(\cos\) function.
Check accuary of the function by comparing to the standard \(\sin\) and \(\cos\) functions.
Measure speed by calling functions \(10^8\) times.
Implement the \(\exp\) function for
SIMDtypes.Rewrite for
\[ x = q \ln 2 + \tilde x \]with \(q \in {\mathbb Z}\) and reminder \(\tilde x \in [-\tfrac{\ln 2}{2}, \tfrac{\ln 2}{2}]\):
\[\begin{eqnarray*} e^x & = & e^{ q \ln 2 + \tilde x} = e^{q \ln2 } e^\tilde x \\ & = & 2^q \, e^{\tilde x} \end{eqnarray*}\]implement an approximation of \(\exp(\tilde x)\) for small \(\tilde x\)
to implement \(2^q\), use the representation of floating point numbers.
#include <cstring> #include <cstdint> #include <algorithm> double double_power_of_two(int n) { // Clamp input to allowed range and add offset: n = std::max(n, -1022); n = std::min(n, 1023); n += 1023; // convert to 64bit, and shift to exponent uint64_t exp = n; uint64_t bits = exp << 52; // reinterpret bits as double double result; memcpy (&result, &bits, 8); return result; }
See the result on Compiler explorer
Some places to find insparation: