Vectorizing mathematical functions

10.2. Vectorizing mathematical functions#

The aim is to implement mathematical functions like exp, sin and cos using the SIMD types. One motivation for doing so is speeding up the simulation of scattering problems, where a good fraction of the compute time is spent in the complex exp function.

Most implementations found online go back to these sources:

References:

Stephen L. Moshier: Methods and Programs For Mathematical Functions https://www.moshier.net/methprog.pdf
CEPHES mathematical function library https://www.netlib.org/cephes/

10.2.1. How to implement `sin` and `cos` ?#

Local approximation by polynomials

For small values of the argument \(x\), polynomial approximation via Taylor expansion can be used. A uniform best approxiamtion on the interval \([-\pi/4, \pi/4]\) is obtained via Chebyshev interpolation. We took the coefficients from CEPHES, file cmath.tgz.

static constexpr double sincof[] = {
    1.58962301576546568060E-10,
    -2.50507477628578072866E-8,
    2.75573136213857245213E-6,
    -1.98412698295895385996E-4,
    8.33333333332211858878E-3,
    -1.66666666666666307295E-1,
  };

static constexpr double coscof[6] = {
    -1.13585365213876817300E-11,
    2.08757008419747316778E-9,
    -2.75573141792967388112E-7,
    2.48015872888517045348E-5,
    -1.38888888888730564116E-3,
    4.16666666666665929218E-2,
  };

// highly accurate on [-pi/4, pi/4]
template <typename T>
auto sincos_reduced (T x) 
{
  auto x2 = x*x;
  
  auto s = (((((sincof[0]*x2+sincof[1])*x2+sincof[2])*x2+sincof[3])*x2+sincof[4])*x2+sincof[5]);
  s = x + x*x*x * s;

  auto c = (((((coscof[0]*x2+coscof[1])*x2+coscof[2])*x2+coscof[3])*x2+coscof[4])*x2+coscof[5]);
  c = 1.0 - 0.5*x2 + x2*x2*c;

  return std::tuple{ s, c };
}

This polynomial approximation can be used the same way for double x, and SIMD<double> x.

Global approximation

For a general \(x \in {\mathbb R}\), we have to figure out the right branch to reduce the evaluation to the small interval.

We represent

\[ x = q \frac{\pi}{2} + \tilde x \]

with an integer \(q\) and the reminder \(\tilde x \in [-\pi/4, \pi/4]\).

Depending on the reminder, we can reduce the evaluation of \(\sin x\) to local evaluation of \(\sin\) or \(\cos\) for small \(\tilde x\):

\[\begin{split} \sin(x) = \left\{ \begin{array}{cc} \sin(\tilde x) & \text{for } q \text{ mod } 4 = 0, \\ \cos(\tilde x) & \text{for } q \text{ mod } 4 = 1, \\ -\sin(\tilde x) & \text{for } q \text{ mod } 4 = 2, \\ -\cos(\tilde x) & \text{for } q \text{ mod } 4 = 3 \end{array} \right. \end{split}\]

The choice of the branch can be made by the conditional operator ? applied to the lowest bits of q:

auto sincos (double x)
{
  double y = round((2/M_PI) * x);
  int q = lround(y);
  
  auto [s1,c1] = sincos_reduced(x - y * (M_PI/2));

  double s2 = ((q & 1) == 0) ? s1 : c1;
  double s  = ((q & 2) == 0) ? s2 : -s2;

  double c2 = ((q & 1) == 0) ? c1 : -s1;
  double c  = ((q & 2) == 0) ? c2 : -c2;
  
  return std::tuple{ s, c };
}

All these operations can be perfectly simded:

template <int N>
auto sincos (SIMD<double,N> x)
{
  SIMD<double,N> y = round((2/M_PI) * x);
  SIMD<int64_t,N> q = lround(y);
  
  auto [s1,c1] = sincos_reduced(x - y * (M_PI/2)); 

  auto s2 = select((q & SIMD<int64_t,N>(1)) == SIMD<int64_t,N>(0), s1,  c1);
  auto s  = select((q & SIMD<int64_t,N>(2)) == SIMD<int64_t,N>(0), s2, -s2);
  
  auto c2 = select((q & SIMD<int64_t,N>(1)) == SIMD<int64_t,N>(0), c1, -s1);
  auto c  = select((q & SIMD<int64_t,N>(2)) == SIMD<int64_t,N>(0), c2, -c2);
  
  return std::tuple{ s, c };
}

Rounding on AVX2: _mm256_round_pd. Converting to int64: _mm256_cvtepi32_epi64(_mm256_cvtpd_epi32(val))
Rounding for arm-neon: vrndnq_f64, vcvtq_s64_f64

10.2.2. What the f..k?#

An important operation in compute graphics (games!) is the normalization of a vector, \(|v| = \frac{1}{\sqrt{v \cdot v}} v\). This requires the inverse square root function

\[ y = \frac{1}{\sqrt{x}} \]

Instead of calling a sqrt function and a division, one may solve the nonlinear equation

\[ F(y) = \frac{1}{y^2} - x = 0 \]

using a few steps of Newton’s method:

\[\begin{eqnarray*} y_{n+1} & = & y_n - \frac{F(y_n)}{F^\prime(y_n)} \\ & = & y_n - \frac{ y_n^{-2} - x } { - 2 y_n^{-3} } \\ % & = & y_n + \frac{ y_n - x y_n^2 } {2 } % & = & \tfrac{3}{2} y_n - \tfrac{1}{2} x y_n^2 \\ & = & y_n \frac{ 3 - x y_n^2}{2} \end{eqnarray*}\]

Every step requires only 3 multiplications.

This algorithm got famous in Quake III Arena providing fluent 3D graphics (in the year 1999). Find the full code including the documented initial value for Newton’s method at this page.

For example, processors by Arm provide intrinsics exactly for that Newton iteration: instruction

10.2.3. Exercise#

add missing SIMD-functions for the \(\sin\) and \(\cos\) function.

Check accuary of the function by comparing to the standard \(\sin\) and \(\cos\) functions.

Measure speed by calling functions \(10^8\) times.
Implement the \(\exp\) function for SIMD types.

Rewrite for

\[ x = q \ln 2 + \tilde x \]

with \(q \in {\mathbb Z}\) and reminder \(\tilde x \in [-\tfrac{\ln 2}{2}, \tfrac{\ln 2}{2}]\):

\[\begin{eqnarray*} e^x & = & e^{ q \ln 2 + \tilde x} = e^{q \ln2 } e^\tilde x \\ & = & 2^q \, e^{\tilde x} \end{eqnarray*}\]
- implement an approximation of \(\exp(\tilde x)\) for small \(\tilde x\)
- to implement \(2^q\), use the representation of floating point numbers.
```
#include <cstring>
#include <cstdint>
#include <algorithm>
double double_power_of_two(int n)
{
  // Clamp input to allowed range and add offset:
  n = std::max(n, -1022);
  n = std::min(n,  1023);
  n += 1023;

  // convert to 64bit, and shift to exponent
  uint64_t exp = n;
  uint64_t bits = exp << 52;

  // reinterpret bits as double
  double result;
  memcpy (&result, &bits, 8);
  return result;
}
```
  See the result on Compiler explorer

Some places to find insparation:

Vectorizing mathematical functions

Contents

10.2. Vectorizing mathematical functions#

10.2.1. How to implement sin and cos ?#

10.2.2. What the f..k?#

10.2.3. Exercise#

10.2.1. How to implement `sin` and `cos` ?#