Interfacing Lapack

7. Interfacing Lapack#

Lapack is the standard linear algebra library. It contains matrix operations, factorizations, linear systems, eigen-system calculation, and more. It is a quasi standard, highly optimized implementations are available for all major computer systems.

Some implementations are:

OpenBlas (open source, for many architectures)
Intel MKL (for Intel processors + compatible)
Accelerator framework (for Apple systems)

If not available, install some Lapack on your computer.

Switch to branch lapack in ASC-bla.

There are these new lines in CMakeLists.txt file:

find_package(LAPACK REQUIRED)
add_executable (test_lapack tests/test_lapack.cc)
target_link_libraries (test_lapack PUBLIC ${LAPACK_LIBRARIES})

Run cmake, and look for output showing BLAS and LAPACK in the file build/CMakeCacht.txt. The find_package(LAPACK REQUIRED) searches for a Lapack installation in the usual directories.

Lapack was written in Fortran, and automatically translated to C using the f2c compiler. There is one C-header file clapack.h which one includes like that:

#include <complex>

typedef int integer;
typedef integer logical;
typedef float real;
typedef double doublereal;
typedef std::complex<float> singlecomplex;
typedef std::complex<double> doublecomplex;

// Windows SDK defines VOID in the file WinNT.h
#ifndef VOID
    typedef void VOID;
#endif
typedef int ftnlen;
typedef int L_fp;  // ?


extern "C" {
#include <clapack.h>
}

maybe better: include also f2c.h.

7.1. The BLAS - functions#

Lapack contains basic functions for efficiently computing linar algebra operations (called Basic Linera Alebra Subprograms BLAS). One refers to BLAS1, BLAS2, and BLAS3 operations according to computational complexity \(O(n)\), \(O(n^2)\), or \(O(n^3)\). Vector addition and inner products are BLAS1, matrix-vector products or outer products of vectors are BLAS2, and matrix-matrix multiplication belong to BLAS3.

7.1.1. The GEMM functions#

With origin in Fortran-times, all functions have a short name from abbreviations, like Double-precission GEneral Matrix Matrix multiplication (see quick reference guide)

int dgemm_ (char *transa, char *transb, integer *m, 
            integer * n, integer *k, doublereal *alpha, 
            doublereal *a, integer *lda, doublereal *b, 
            integer *ldb, doublereal *beta, doublereal *c__, 
            integer *ldc);

Similarly, there are sgemm, cgemm and zgemm for single precision, and single/double precision complex versions of matrix-matrix multiplication. The functions are well documented, and google easily finds them, like documentation dgemm

Since function calls with many arguments are not very clear, we prefer to wrap the low-level Lapack call into a function based on our MatrixView classes. The matrices are provided by a pointer to the first element, and the leading distance (what is exactly our dist variable). In Fortran/Lapack matrices are stored column-major. The dgemm function allows to specify whether the matrices \(A\) or \(B\) are transposed. If we have a row-major matrix, we set the transpose argument to ‘T’ (and otherwise ‘N’ for not):

// c = a*b
template <ORDERING OA, ORDERING OB>
void MultMatMatLapack (MatrixView<double, OA> a,
                       MatrixView<double, OB> b,
                       MatrixView<double, ColMajor> c)
{
  char transa = (OA == ColMajor) ? 'N' : 'T';
  char transb = (OB == ColMajor) ? 'N' : 'T'; 
  
  integer n = c.Height();
  integer m = c.Width();
  integer k = a.Width();
  if (n == 0 || m == 0) return 0;
  
  double alpha = 1.0;
  double beta = 0;
  integer lda = a.Dist();
  integer ldb = b.Dist();
  integer ldc = c.Dist();

  int errcode =
    dgemm_ (&transa, &transb, &n, &m, &k, &alpha, 
            a.Data(), &lda, b.Data(), &ldb, &beta, c.Data(), &ldc);
  if (errcode != 0)
     throw exception ("Lapack-dgemm error "+to_string(errcode));
}

Only matrices \(A\) and \(B\) can be marked as transpose, not the result matrix \(C\). If we want a row-major result, we use

\[ C = A B \; \Leftrightarrow \; C^T = B^T A^T, \]

and transpose all three matrices. Since Trans does not do actual computations (it only gives a different view on the memory), this additional wrapper has zero costs:

template <ORDERING OA, ORDERING OB>
void MultMatMatLapack (MatrixView<double, OA> a,
                       MatrixView<double, OB> b,
                       MatrixView<double, RowMajor> c)
{
  MultMatMatLapack (Trans(b), Trans(a), Trans(c));
}

If we have to compute \(C = A^T B\) we can use the same technique:

MultMatMatLapack (Trans(a), b, c)

7.1.2. Syntactic sugar#

If people have the choice between coding like

C = A*B;

or

MultMatMatLapack (A, B, C);

they will still prefer the first option. It is also less prone to errors. Since efficient evaluation of the entry \((i,j)\) is not provided by Lapack, we cannot directly include Lapack into our hierarchy of expression templates. However, something like

C = A*B | Lapack;

can be easily provided. One overloads the bit-wise or operator operator| acting on a MultExpr<MatrixView,MatrixView> and an object Lapack, which we may create like

class T_Lapack { };
static constexpr T_Lapack Lapack;

The operator| returns an object of type LapackMultExpr which contains the factors A and B. Finally, we define an assignment operator for the class MatrixView taking such an LapackMultExpr type, and calling the MultMatMatLapack function.

7.2. LU-decomposition#

Lapack can compute an LU-decomposition of a matrix \(A\), what means a factorization of \(A\) of the form

\[ A = P L U, \]

where \(P\) is a permutation matrix, \(L\) a lower triangular matrix with unit diagonal elements, and \(U\) a upper triangular matrix. This is an \(O(n^3)\) algorithm.

Having such a factorization, the solution of a linear system \(A x = b\) can be performed via forward/backward substitution in \(O(n^2)\) operations. An \(LU\)-factorization is usually also the first step for computing the inverse \(A^{-1}\).

Lapack is very careful with memory usage. When computing the \(LU\) factorization, the original matrix \(A\) is overwritten with the \(L\) and \(U\) factors. The permutation matrix \(P\) is stored as an index vector. When we compute the inverse \(A^{-1}\) from the factors \(L\) and \(U\), the resulting matrix can be placed in the memory of the \(L\) and \(U\) factors, again.

A C++ - wrapper for the Lapack LU - factorization may look like this. The LU-Factorization is initialized by a matrix, which will be used to hold the LU-factors. Then we can use it to solve a linear system, and also to compute the inverse. The Inverse function overwrites the factorization by the inverse matrix, and returns this matrix by move-semantics. The LapackLU object will be destroyed. Actually, we are computing the factorization \(A = PLU\) iff \(A\) is stored col-major. If \(A\) is stored row-major, we are computing \(A^T = PLU\), i.e. \(A = U^TL^TP^T\).

template <ORDERING ORD>
class LapackLU {
  Matrix <double, ORD> a;
  std::vector<integer> ipiv;
    
public:
  LapackLU (Matrix<double,ORD> _a)
     : a(std::move(_a)), ipiv(a.Height()) {
    integer m = a.Height();
    if (m == 0) return;
    integer n = a.Width();
    integer lda = a.Dist();
    integer info;
    
    /*
    int dgetrf_(integer *m, integer *n, doublereal *a, 
                integer * lda, integer *ipiv, integer *info);
    */
    dgetrf_(&n, &m, &a(0,0), &lda, &ipiv[0], &info);
  }
    
  // b overwritten with A^{-1} b
  void Solve (VectorView<double> b) const {
    char transa =  (ORD == ColMajor) ? 'N' : 'T';
    integer n = a.Height();
    integer nrhs = 1;
    integer lda = a.Dist();
    integer ldb = b.Size();
    integer info;
    /*
    int dgetrs_(char *trans, integer *n, integer *nrhs, 
                doublereal *a, integer *lda, integer *ipiv,
                doublereal *b, integer *ldb, integer *info);
    */
    dgetrs_(transa, &n, &nrhs, a.Data(), &lda, ipiv.data(), b.data(), &ldb, &info);
  }
  
  Matrix<double,ORD> Inverse() && {
    double hwork;
    integer lwork = -1;
    integer n = a.Height();      
    integer lda = a.Dist();
    integer info;
    /*
    int dgetri_(integer *n, doublereal *a, integer *lda, 
                integer *ipiv, doublereal *work, integer *lwork, 
                integer *info);
    */       
    dgetri_(&n, &a(0,0), &lda, ipiv.data(), &hwork, &lwork, &info);
    lwork = integer(hwork);
    std::vector<double> work(lwork);
    dgetri_(&n, &a(0,0), &lda, ipiv.data(), &work[0], &lwork, &info);
    return std::move(a);      
  }
    
  Matrix<double,ORD> LFactor() const { ... }
  Matrix<double,ORD> UFactor() const { ... }
  Matrix<double,ORD> PFactor() const { ... }
};

The class LapackLU can be used as follows: Copy matrix A by copy-ctor before overwritten by LU factors:

Matrix<> A;
Matrix<> invA = LapackLU(A).Inverse();   

Reuse memory of matrix A2 for LU factors, A2 has shape \((0,0)\) after calling LapackLU:

// or reuse A2
Matrix<> A2
Matrix<> invA2 = LapackLU(std::move(A2)).Inverse();  // reuse

Other examples for useful matrix decompositions are QR-factorization, singular value decomposition (SVD), and more, see The Big Six Matrix Factorizations

7.3. Exercise:#

adjust the matrix-matrix multiplication function to your matrix class. Measure performance of Lapack matrix-matrix multiplication, and compare to your expression template matrix-matrix multiplication. Compare GFlop - rates, i.e. \(10^9\) mult-add operations per second, for various matrix sizes \(10 \times 10\), \(100 \times 100\), \(1000 \times 1000\).

#include <chrono>
size_t flops = n*n*n;
size_t runs = size_t (1e9 / flops) + 1;

auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < runs; i++)
   // call function
+auto end = std::chrono::high_resolution_clock::now();
double time = std::chrono::duration<double>(end-start).count();
        
cout << "n = " << n << ", time = " << time << " s, GFlops = " 
     << (flops*runs)/time*1e-9 << endl;

Complete the LapackLU class. Look up documentation for functionsdgetrf, dgetrs, dgetri.

Pick some functions of your choice. Write C++ wrappers similar to LapackLU for the following matrix operations:

QR decomposition using Householder reflections (dgeqrf, dorgqr): \(A = QR\) with \(Q\) orthogonal, and \(R\) upper triangular. By storing \(Q\) as product of Householder reflections, \(Q\) and \(R\) together fit into the memory of \(A\).
Eigenvalues and eigenvectors of a real-valued symmetric matrix (dsyev).
Eigenvalues and eigenvector calculation (dgeev) of a real-valued general matrix. Even if the matrix is real, eigenvalues/eigenvectors are complex, in general.
SVD decomposition of a matrix (dgesvd): \(A = U \Sigma V^T\) with \(U\) and \(V\) orthogonal, and \(\Sigma\) a (rectangular) diagonal matrix.

7.4. C++ libraries#

There are a couple of modern C++ linear algebra libraries:

In the core, most of them fall back to legacy Lapack for large matrices. Have a look into their documentations.