Pipelining

10. Pipelining#

Look at the documentation of the _mm256_fmadd_pd (fused-multiply-add) intrinsic: docu. It tells us that the instruction has a latency of 4 clock-cycles, but there is a throughput (CPI=clocks per instruction) of 0.5. This means that we can start two of these instruction in every clock cycle, but we have to wait 4 clock cycles until the result is available.

A typical algorithm suffering from latency is computing an inner product:

double sum = 0;
for (size_t i = 0; i < n; i++)
   sum += x[i] * y[i];

We have to wait until a result is available before we can start the next addition. Thus, one loop takes (at least) 4 clock cycles. This is called a long dependency chain. We can utilized only \(\frac{CPI}{latency} = \frac{1}{8}\) of the possible arithmetic performance.

A vector-vector addition like

for (size_t i = 0; i < n; i++)
   x[i] += alpha * y[i];

does not suffer from the dependeny chain. The next operation does not depend on the result of the previous one. Here, operation reordering of modern CPUs comes into play: Storing the result \(x[i]\) has to wait until the computation is ready, however the upcoming operations from the next iteration can be started and executed in the meanwhile.

A possible optimization for overcoming the dependency chain is unrolling and reordering summation:

double sum0 = 0;
double sum1 = 0;
size_t i = 0;
for ( ; i+2 <= n; i+=2) {
   sum0 += x[i] * y[i];
   sum1 += x[i+1] * y[i+1];
}
// take care of the leftover
if (i < n)
   sum0 += x[i]*y[i]
sum = sum0+sum1;

We are performing two independent fma - operations per loop, but have to pay the latency only once. With eight accumulators the full latency bottleneck could be overcome.

However, not only arithmetic operations have a cost, also getting data in and out of the CPU counts. Modern Intel CPUs can perform up to four operations per clock cycle, roughly out of:

  • two simd256 arithmetic operations

  • two simd256 loads and one simd256 store

  • four integer operations

  • one branch operation

A great source of information are the optimization manuals by Agner Fog. Have a look into manual 3: The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers, Chapter 11: Intel Skylake pipeline.

Since we need more memory transfer operations than arithmetic operations, we cannot gain peak performance for inner products, and vector addition. E.g. for inner products the ration of memory transfer to arithmetic operations is \(2:1\). This cannot be overcome by reordering summation. If we have to compute inner products between several pairs of vectors the situation changes:

double sum00 = 0;
double sum01 = 0;
double sum10 = 0;
double sum11 = 0;
for (size_t i = 0; i < n; i++) {
   sum00 += x0[i] * y0[i];
   sum01 += x0[i] * y1[i];
   sum10 += x1[i] * y0[i];
   sum11 += x1[i] * y1[i];
}

In every loop we load 2 x-values, and 2 y-values, and perform 4 fma operations. Now memory transfer to arithmetic operations is \(1 : 1\). Instead of vectors of scalars, each entry can also be a 256bit SIMD<double,4> value.

An important use multiple inner products is a matrix-matrix multiplication. Assuem \(A\) is stored row-major, and \(B\) is col-major. Then entries \(C_{ij}, C_{i+1,j}, C_{i,j+1}, C_{i+1,j+1}\) can be computed by simultaneous inner products of rows \(i\) and \(i+1\) of \(A\), and columns \(j\) and \(j+1\) of \(B\).

The most important case is that \(A,B,C\) are stored with the same ordering, say row-major. Then we can multiply two rows of \(A\) with \(8\) columns (what are two SIMD<double,4>) of \(B\) simultaneously:

SIMD<double,4> sum00 = 0;
SIMD<double,4> sum01 = 0;
SIMD<double,4> sum10 = 0;
SIMD<double,4> sum11 = 0;
for (size_t i = 0; i < wa; i++) {
   sum00 += rowa0[i] * SIMD<double,4>(colb+i*dist);
   sum01 += rowa0[i] * SIMD<double,4>(colb+i*dist+4);
   sum10 += rowa1[i] * SIMD<double,4>(colb+i*dist);
   sum11 += rowa1[i] * SIMD<double,4>(colb+i*dist+4);
}

Excersicse:

  • Try the examples in the file demos/simd_timings.cc containing such loops. We measure run-time, and compute GFlop-rate, i.e. billion fma-instructions per second.

  • Experiment with different block sizes for simultaneous vector updates. What are your best GFlop-rates you can achieve ?

  • Look at assembly code, find the inner loops of your functions (enter make help and then make demos/simd_timings.s).

  • Include the SIMD-classes into your BLA project. Implement a matrix-matrix multiplication using the best loops you found from simd_timings.cc.