[Tuto] 夢魘のCUDA: 使用 Preconditioned Conjugate Gradient 輕鬆解決大型稀疏線性方程組

圖片來源：http://fourier.eng.hmc.edu/e176/lectures/NM/node29.html

閱讀難度 ✦✦✧✧✧

特別感謝 冠大大、王大大。

《夢魘のCUDA》是 CUDA Programming 系列，此系列不會介紹任何 CUDA 的基礎知識，而會介紹一些 CUDA 相關的應用。本篇作為《夢魘のCUDA》系列的首篇文，將會介紹既實用又不實用的兩套 CUDA 內建 Library — cuBLAS / cuSPARSE；除此之外，本篇也會講解如何使用這兩套 Library 實作出經典應用 — Preconditioned Conjugate Gradient。之所以說是經典應用的原因是因為，cuSPARSE 幾乎是為了這個應用而誕生的。

1. Introduction

Conjugate gradient method - wiki

Conjugate gradient method 是一種數值分析 (Numerical analysis) 方法，用來解決大型稀疏線性方程組 (Large-scale sparse linear systems)。其中，線性方程組的矩陣必須是對稱正定矩陣 (Symmetric positive definite matrix)。

1-1. Inner Product Space

Inner product space - wiki

首先必須知道內積 (inner product) 的定義，對一個 map function $\langle\cdot,\cdot\rangle: V\times V \to F$，其中 vector space $V$， field $F$，若滿足以下三個條件則稱這個 map function 是 inner product，其中 $x,y,z\in V$：

Conjugate symmetry： $\langle x, y \rangle=\overline{\langle y, x \rangle}$

這通常是定義給複數的 Vector space，對於實數的 Vector space 只需滿足 $\langle x, y \rangle=\langle y, x \rangle$
Linearity: $\begin{gathered} \langle ax, y \rangle=a\langle x,y\rangle\\ \langle x+y,z\rangle = \langle x,z\rangle + \langle y,z \rangle \end{gathered}$

這應該不需多做解釋
Positive-definiteness: $\langle x,x\rangle > 0, \quad x\in V \setminus\{\mathbf{0}\}$

注意 $\mathbf{0}$ 代表 Vector space $V$ 的零向量。
此外，這個性質也 imply 若 $\langle x,x\rangle=0$ 則 $x=\mathbf{0}$。

1-2. Symmetric Positive Definite Matrix

若一個 $N\times N$ 的 Real matrix $A$ 滿足以下條件，則稱 $A$ 為 Symmetric positive definite matrix：

Symmetric：$A=A^T$
Positive definite：對於任意非 $0$ 向量 $x\in\mathbb{R}^N$，滿足 $x^TAx>0$

Symmetric matrix (複數則是 Hermitian matrix) 具有以下特性：

Eigenvalues 一定為實數
一定存在一組 Orthonormal eigenvectors

Prove Eigenvalues 一定為實數：
假設 Symmetric (Hermitian) matrix $A$，假設 $\lambda$ 為 $A$ 的 Eigenvalue，以及對應的 Eigenvector $x$，則 $Ax=\lambda x$。取 Conjugate transpose $\bar{x}^TA^\dagger=\bar{x}^T\bar{\lambda}$。由於 $A$ 為 Symmetric (Hermitian) matrix，$A^\dagger=A$，因此 $\bar{x}^TA=\bar{x}^T\bar{\lambda}$。兩邊同乘 $x$ 得到：

$\bar{x}^TAx=\bar{x}^T\bar{\lambda}x$

對 $Ax=\lambda x$ 兩邊同乘 $\bar{x}$：

$\bar{x}^TAx=\bar{x}^T\lambda x$

得到 $\bar{x}^T\bar{\lambda}x=\bar{x}^T\lambda x$，若 $x\ne \mathbf{0}$，則 $\bar{\lambda}=\lambda$，因此 Eigenvalue 一定為 Real number。

Prove 一定存在一組 Orthonormal eigenvectors：
假設兩個 Symmetric (Hermitian) matrix $A$ 的 Eigenvalue $\lambda_a$ 與 $\lambda_b$，且 $\lambda_a\ne \lambda_b$，以及其分別對應的 Eigenvectors $x_a$, $x_b$，

$\begin{align*}\lambda_a \langle x_a, x_b \rangle &= \langle \lambda_ax_a, x_b \rangle\\ &= \langle Ax_a, x_b \rangle\\ &= \langle x_a, Ax_b \rangle\\ &= \langle x_a, \lambda_bx_b \rangle\\ &= \lambda_b \langle x_a, x_b \rangle \end{align*}$

得到 $\lambda_a \langle x_a, x_b \rangle=\lambda_b \langle x_a, x_b \rangle $，移項整理後得到：

$(\lambda_a-\lambda_b)\langle x_a, x_b \rangle=0$

由於 $\lambda_a \ne \lambda_b \Leftrightarrow \lambda_a-\lambda_b\ne 0$，因此得到 $\langle x_a, x_b\rangle=0$，表示 $x_a$ 與 $x_b$ 為 Orthogonal。

Symmetric positive definite matrix 有更棒的特性：Eigenvalues 一定都為正數

Prove Symmetric (Hermitian) matrix $A$ 的 Eigenvalue 都是正數 $\Leftrightarrow$ Matrix $A$ 為 Symmetric positive definite matrix：
$(\Rightarrow)$：假設 $\Lambda$ 為由 $A$ 所有 Eigenvalues 組成的 Diagonal matrix $\Lambda_{ii}=\lambda_i$，且 $\lambda_i \ne \lambda_j, i\ne j$，$Q$ 為各個 Column 由各個 Eigenvalue $\lambda_i$ 所對應的 Eigenvector $x_i$ 組成的 Matrix $\left[\begin{matrix} x_1, \dots, x_N \end{matrix}\right]$。則根據 $Ax=\lambda x$，得到：

$\begin{align*}AQ &= Q\Lambda\\A &= Q\Lambda Q^{-1}\end{align*}$

由於 $A$ 為 Symmetric matrix，因此由 Eigenvectors 組成的 $Q$ 為 Orthonomal matrix，Orthonomal matrix 有特性 $Q^{-1}=Q^T$，因此得到：

$A = Q\Lambda Q^T$

接著假設任意 Vector $x$ 屬於 Vector space，且不為 $\mathbf{0}$，則：

$x^TAx=\underbrace{x^TQ}_{y^T}\Lambda \underbrace{Q^Tx}_{y}=y^T\Lambda y=\sum_{i=0}^N \lambda_i {y_i}^2 >0$

由於 $A$ 為 Symmetric matrix，$\lambda_i>0$ 且 $y_i^2>0$ 因此得到 $x^TAx>0$，得證。

$(\Leftarrow)$ 假設 Symmetric positive definite matrix $A$ 的任意 Eigenvalue $\lambda$，以及對應的 Eigenvector $x\ne \mathbf{0}$，有 $Ax=\lambda x$。兩邊同乘 $x^T$：

$\begin{align*}x^TAx &= \lambda x^Tx\\\lambda &= \frac{x^TAx}{x^Tx} > 0\end{align*}$

由於 $A$ 為 Symmetric positive definite，因此 $x^TAx>0$。且 $x\ne\mathbf{0}$，因此 ${x^Tx} > 0$。得證。

基於這些特性，Symmetric positive definite matrix $A$ 可以定義內積。

利用 Symmetric positive definite matrix $A$ 定義一個基於 Vector space $\mathbb{R}^N$ 的內積空間 (Inner product space)，內積定義為：

$\langle x,y\rangle_A=x^TAy$

由於 Matrix $A$ 為 Symmetric，我們可以知道這個內積滿足 Inner product 的第一個條件 (Conjugate symmetry)，也就是 $x^TAy=y^TAx$。由於 $x^TAy$ 是線性的，滿足 Inner product 的第二個條件 (Linearity)，由於 $A$ 為 Positive definite，滿足 Inner product 的第三個條件 (Positive-definiteness)。因此，$ \langle \cdot,\cdot\rangle_A $ 是一個合法的內積。

產生 Symmetric positive definite matrix 非常簡單，設一個 $m\times n, m < n$ Matrix $A$，為 Full row rank，$AA^T$ 就會是 Symmetric positive definite matrix。

Prove $AA^T$ 是 Symmetric positive definite matrix：

假設 Matrix $M=AA^T$，$AA^T$ 為 Symmetric matrix。假設任意 Vector $x\in \mathbb{R}^m$ 且 $x\ne\mathbf{0}$，則：

$x^TMx=x^TAA^Tx=\|A^Tx\|>0$

故得證。

1-3. Steepest Descent

An Introduction to the Conjugate Gradient Method Without the Agonizing Pain

設 Linear system

$Ax=b \tag{1.3.1}$

其中 $A\in\mathbb{R}^{N\times N}$ 為 Symmetric positive definite matrix， $b\in \mathbb{R}^N$ 為已知的 Vector。若要求解 vector $x$，根據傳統的解法 (高中、大學教的方法)是，首先先移項 $A$，使用高斯法求出 $A$ 的 Inverse matrix $A^{-1}$ 後，求出 $x$：

$x=A^{-1}b$

然而，對於 Large-scale sparse matrix 而言 ($N$ 非常巨大)，求 Inverse matrix 是非常困難的，因此另一種解法就是使用 Numerical method，將問題變成能夠使用 Iterative 趨近的方式來求出近似解。首先將 [式 1.3.1] 移項 $Ax-b=\mathbf{0}$，接著假設 Function $f(x)$，並令其一階微分：

$f'(x)=Ax-b \tag{1.3.2}$

根據一階微分，我們能夠得出 Quadratic form：

$f(x) = \frac{1}{2}x^TAx-b^Tx+c \tag{1.3.3}$

則目標變成：找到 $x$ 使得 $f’(x)=\mathbf{0}$。也就是說，找到 $x$，使得 $f(x)$ 為極值，且若 $A$ 為 Symmetric positive definite matrix，則該極值一定會是最小值。

證明方法非常簡單，假設 $x^* $ 滿足 $Ax^* = b$，我們只需證明添加上任意 Error $e\in \mathbb{R}^{N}\setminus {\mathbf{0}}$，一定滿足 $f(x^* + e) > f(x^*)$：

$\begin{align*} f(x^* + e) &= \frac{1}{2}(x^* + e)^TA(x^* + e)-b^T(x^* + e)+c\\ &= \frac{1}{2}(x^*)^T A x^* + e^T\underbrace{Ax^*}_{=b}+\frac{1}{2}e^TAe-b^Tx^*-b^Te+c\\ &= \underbrace{\left(\frac{1}{2}(x^*)^TAx^*-b^Tx^* + c\right)}_{=f(x^*)}+e^Tb+\frac{1}{2}e^TAe-b^Te\\ &= f(x^*)+\frac{1}{2}e^TAe > f(x^*) \end{align*}$

由於 $A$ 是 Positive definite，且 $e\ne \mathbf{0}$，因此最後一項 $\frac{1}{2}e^TAe > 0$，則可以得證當 $Ax^* =b$ 時，$f(x^*)$ 一定為最小值。因此我們將一個 Linear system 的問題變成了 Minimization problem：找到 $x$ 使得 minimize $f(x)$。

接著是要如何解這個 Minimization problem，一種方法就是使用 Steepest descent method。

這邊列出一些常用的 term：

Error: $e_i=x_i-x^*$，寫成 Iterative form，$e_{i+1}=e_i+\alpha_i r_i$
Residual: $r_i=b-Ax_i=-f’(x_i)$

Residual 與 Error 的關係式：

$r_i=b-Ax_i=Ax^*-Ax_i=A(x^*-x_i)=-Ae_i$

首先從任意一點 $x_0$ 開始，選擇負梯度方向 $-f’(x_0)=r_0$ 前進適當步長 $\alpha_0$，這個步長必須能夠在負梯度方向上最小化下一個 Step 的值 $f(x_1)$，也就是 $\alpha_0=\arg\min f(x_1)=\arg\min f(x_{0}+\alpha_0 r_0)$，通常會使用 Line search 的方式求得。求得步長後，移動到下一個位置 $x_1=x_0 + \alpha_0 r_0$，求出下一個步長 $\alpha_1$，再移動到下一個位置 $x_2=x_1+\alpha_1 r_1$ …，重複動作直到找到最佳解為止。

至於求解步長 $\alpha_i$ 可以使用求極值 (導數為 $0$) 的方式求得：

$\frac{d}{d\alpha}f(x_{i+1})=f'(x_{i+1})^T\frac{d}{d\alpha}x_{i+1}=f'(x_{i+1})^Tr_i=0$

因為 $x_{i+1}=x_i+\alpha_i r_i$。由於 $f’(x_{i+1})=-r_{i+1}$，求得 $\alpha_i$：

$\begin{align*} r^T_{i+1}r_i&=0\\ (b-Ax_{i+1})^Tr_i&=0\\ (b-A(x_i+\alpha_i r_i))^Tr_i&=0\\ (b-Ax_i)^Tr_i-\alpha_i(Ar_i)^Tr_i &= 0\\ (b-Ax_i)^Tr_i &= \alpha_i(Ar_i)^Tr_i\\ r^T_ir_i &= \alpha_i r^T_i(Ar_i)\\ \alpha_i&=\frac{r^T_ir_i}{r^T_iAr_i} \end{align*}$

最後就可以得到 Steepest descent method：

$\begin{gathered} r_i=b-Ax_i\\ \alpha_i=\frac{r^T_ir_i}{r^T_iAr_i}\\ x_{i+1}=x_i+\alpha_ir_i \end{gathered}$

Gradient descent：往 Negative gradient 方向前進任意步長 $\alpha$，此步長可以使用固定值，也可以使用估計的方式，例如：SGD 等等。

Steepest descent：往 Negative gradient 方向前進適當步長 $\alpha$，此步長必須使得下一個值在這個方向上是最小值。

2. Methodology

An Introduction to the Conjugate Gradient Method Without the Agonizing Pain

2-1. Conjugate Gradient

Steepest descent 有個缺點在於，經常發生重複搜索同一方向 (如 Figure 8) 的情況，導致收斂速度變慢，為了解決這項問題，有人提出了 Conjugate gradient。首先假設一個 Vector set $\{p_0,p_1,\dots,p_{N-1}\}$ 代表 Search directions，且這些 Vector 兩兩互相垂直 (Orthoginal)。因此可以將每個 Step 的 $x$ 列為：

$x_{i+1}=x_i+\alpha_i p_{i} \tag{2.1.1}$

且

$x^*=x_0+\sum_{i=0}^{N-1}\alpha_i p_{i} \tag{2.1.2}$

這樣就能夠確保每個搜索方向只需要搜尋一次就能夠找到最佳。由於每個 Step $i$ 的 Search direction $p_i$ 一定會與 Error $e_{i+1}$ 垂直，因此：

$\begin{align*} p^T_ie_{i+1} &= 0\\ p^T(e_i+\alpha_ip_i) &= 0\\ \end{align*}$

Prove 每個 Step $i$ 的 Search direction $p_i$ 一定會與 Error $e_{i+1}$ 垂直：

因為 $e_{i+1}=x_{i+1}-x^*$ 且 $x^*=x_{i+1}+\sum_{j=i+1}^{N-1}\alpha_jp_j$，因此 $e_{i+1}=-\sum_{j=i+1}^{N-1}\alpha_jp_j$，也就是說，$e_{i+1}$ 是 $\{p_{i+1}, \dots, p_{N-1}\}$ 的 Linear combination，而 $p_{i}$ 與所有$\{p_{i+1}, \dots, p_{N-1}\}$ 互為 Orthogonal，因此可以得證 $p_{i}$ 與 $e_i$ 一定是 Orthogonal。

如上根據 Error 的 Iterative form 展開後，可以算出：

$\alpha_i=-\frac{p^T_ie_i}{p^T_ip_i}$

然而，尷尬的是我們沒有辦法求出 $e_i$，因為 $e_i=x_i-x^*$，如果可以求得 $e_i$ 那麼 $x^*$ 就已經算出來了。避免掉這個窘境的方式是改變假設 Search directions 兩兩互為 $A$-orthogonal ($A$-conjugate)，而非 Orthogonal：

$p^T_iAp_j=0, \quad i\ne j$

這樣假設的原因是因為，如果 $e_i$ 算不出來，只需要對 $e_i$ 乘上 $A$，就會得到 $Ae_i=Ax_i-Ax^*=Ax_i-b$，如此就可以避免掉 $x^*$ 的問題。又因 $Ae_i=-r_i$，因此同 Steepest descent 求 $\alpha$ 可以得到：

$\begin{align*} \frac{d}{d\alpha}f(x_{i+1}) &= 0\\ f'(x_{i+1})^T\frac{d}{d\alpha}x_{i+1} &= 0\\ -r^T_{i+1}p_i &= 0\\ (Ae_{i+1})^Tp_i=d_i^TAe_{i+1}&=0 \end{align*}$

因 $e_{i+1}=e_i+\alpha_ip_i$，因此：

$d^T_iA(e_i+\alpha_ip_i) = 0$

移項整理後求得 $\alpha_i$：

$\begin{align*} \alpha_i &= -\frac{p^T_iAe_i}{p^T_iAp_i}\\ &=\frac{p^T_ir_i}{p^T_iAp_i} \tag{2.1.3} \end{align*}$

(能夠想到這種方法的人肯定腦力發揮 100%)

接下來的問題在於，要如何產生 Mutually $A$-orthogonal vectors $\{p_0,\dots,p_{N-1}\}$，可以使用 Gram-Schmidt process，來 Iteratively 產生互相 Orthogonal 的向量。

Gram–Schmidt process - wiki

假設一組 Linearly independent set $\{u_0, \dots, u_{N-1}\}$，Gram-Schmidt process 利用 Projection 的方式來製造出兩兩互相 Orthogonal 的 Vectors，原理非常簡單：

首先兩個 Linearly independent vectors $u_0, u_1$，先將 $u_1$ Project 到 $u_0$ 上算出投影向量 $\text{proj}_{u_0}(u_1)$，接著再將 $u_1$ 扣掉 $\text{proj}_{u_0}(u_1)$ 得到的就會是與 $u_0$ Orthogonal 的 Vector $u_1-\text{proj}_{u_0}(u_1)$。此時我們只需要令 $p_0=u_0$、$p_1=u_1-\text{proj}_{d_0}(u_1)$ 就可以得到兩個 Mutually orthogonal vectors $p_0, p_1$ 以同樣方式推廣至 $i$：

$\begin{align*} &p_0=u_0\\ &p_1=u_1-\text{proj}_{p_0}(u_1)\\ &p_2=u_2-\text{proj}_{p_0}(u_2)-\text{proj}_{p_1}(u_2)\\ &p_i=u_i-\sum_{j=0}^{i-1}\text{proj}_{p_j}(u_i) \tag{2.1.4} \end{align*}$

以此類推就可以得到一組 Mutually orthogonal vectors $\{p_0,\dots,p_{N-1}\}$，然而我們需要的是 $A$-orthogonal 的 vectors，因此可以定義 Projection 為 $A$ 內積空間的 Projection：

$\text{proj}_{p_j}(u_i)=\frac{\langle u_i, p_j\rangle_A}{\langle p_j, p_j\rangle_A}p_j = \frac{u_i^TAp_j}{p_jAp_j}p_j \tag{2.1.5}$

至於要如何找到 Linearly independent set $\{u_0, u_1, \dots, u_{N-1}\}$ 一個很直接的方式就是使用現成的 Residual $r$。

我們知道 Residual 與 Error 的關係式 $r_j=-Ae_j$，而 $e_j=x_j-x^*$，$x_j=x_0+\sum_{k=0}^{j-1}\alpha_kp_k$ 且 $x^*=x_0+\sum_{k=0}^{N-1}\alpha_kp_k$，全部展開來後得到：

$\begin{align*} r_j &= -A\left( (x_0+\sum_{k=0}^{j-1}\alpha_kp_k)-(x_0+\sum_{k=0}^{N-1}\alpha_kp_k) \right)\\ &= -A(-\sum_{k=j}^{N-1}\alpha_kp_k)\\ &= \sum_{k=j}^{N-1}\alpha_kAp_k \end{align*}$

由此可以看出，Residual $r_i$ 是由 $\{Ap_{i+1}, \dots, Ap_{N-1}\}$ 組成的 Linear combination，而 $\{Ap_{i+1}, \dots, Ap_{N-1}\}$ 是 Linearly independent set，因此可以知道 $\{r_0, \dots, r_{N-1}\}$ 也一定會是 Linearly independent set。

此外，這邊如果對兩邊同時乘上 $p_i$ 就會發生一件很有趣的事情，其中 $i<j$ ，由於 $p_i$ 與其他 $p_j, \dots, p_{N-1}$ 為 $A$-orthogonal，因此得到：

$p_i^Tr_j=\sum_{k=j}^{N-1}\alpha_kp_i^TAp_k=0, \quad i<j \tag{2.1.6}$

使用 Residual $r$ 代替 $u$ 後，令 $\beta$ 為 Projection 的 Coefficient 項：

$\beta_{ij}=-\frac{r^T_iAp_j}{p^T_jAp_j}$

因此 [式 2.1.4] 變成：

$p_i=r_i+\sum_{j=0}^{i-1}\beta_{ij}p_j \tag{2.1.7}$

將 [式 2.1.7] 兩邊同乘 $r_j$，得到：

$p^T_ir_j=r_i^Tr_j+\sum_{k=0}^{i-1}\beta_{ik}p_k^Tr_j$

由 [式 2.1.6] 可以知道當 $i<j$ 時 $p^T_ir_j=0$ 且 $p_k^Tr_j=0$ 因為 $k\le i-1<i<j$，因此：

$r_i^Tr_j=0, \quad i\ne j \tag{2.1.8}$

可以得出一個結論：Residual $r$ 兩兩互為 Orthogonal。

除此之外，若 $i=j$ 時，$p^T_ir_i \ne 0$ 但 $p_k^Tr_i=0$ 因為 $k\le i-1<i$，因此得到：

$p^T_ir_i=r_i^Tr_i$

代入到 [式 2.1.3] 的 $\alpha_i$，得到新的 $\alpha_i$：

$\alpha_i = \frac{p^T_ir_i}{p^T_iAp_i} = \frac{r^T_ir_i}{p^T_iAp_i} \tag{2.1.9}$

已知 $r_{j+1}=r_j-\alpha_jAp_j$，對等號兩邊乘上 $r_i$，整理後同除 $\alpha_j$：

$\begin{align*} r^T_i r_{j+1} &= r^T_ir_j-\alpha_jr^T_iAp_j\\ \alpha_jr^T_iAp_j &= r^T_ir_j - r^T_i r_{j+1}\\ r^T_iAp_j &= \frac{1}{\alpha_j}(r^T_ir_j - r^T_i r_{j+1}) \end{align*}$

根據 [式 2.1.8]，可以得到，只有兩個狀況下 $r^T_iAp_j$ 會有值：

$r^T_iAp_j=\begin{cases} \frac{1}{\alpha_i}r^T_ir_j, \quad& i=j\\ -\frac{1}{\alpha_{i-1}}r^T_i r_{i}, \quad& i=j+1\\ 0, \quad&\text{otherwise} \end{cases}$

將兩邊同除以 $-p_j^TAp_j$，得到 $\beta_{ij}$，由於在 Gram-Schmidt process 中 $\beta_{ij}$ 只存在 $i<j$ 這個 Case，不存在 $i=j$ 的 Case，因此只剩下一種 Case $i=j+1$，剩下其他 Case 都為 $0$：

$\beta_{ij}=-\frac{r^T_iAp_j}{p_j^TAp_j}=\begin{cases} \frac{1}{\alpha_{i-1}}\frac{r^T_ir_i}{p_{i-1}^TAp_{i-1}}, \quad &i=j+1\\ 0, \quad&\text{otherwise} \end{cases}$

將 [式 2.1.9] 的 $\alpha_{i-1}$ 代入，最後得到新的 $\beta$：

$\begin{align*} \beta_i &= \frac{p^T_{i-1}Ap_{i-1}}{r^T_{i-1}r_{i-1}}\frac{r^T_ir_i}{p_{i-1}^TAp_{i-1}}\\ &= \frac{r^T_ir_i}{r^T_{i-1}r_{i-1}} \tag{2.1.10} \end{align*}$

結果 $p_i$ 一整個不見了，全部變成 Residual $r$。完全出乎意料。
寫到這邊為止的我：

最後將全部放在一起得到完整的 Conjugate Gradient：

$\begin{gathered} p_0=r_0=b-Ax_0\\ \alpha_i = \frac{r^T_ir_i}{p^T_iAp_i}\\ x_{i+1}=x_i+\alpha_ip_i\\ r_{i+1}=r_i-\alpha_iAp_i\\ \beta_{i+1}=\frac{r^T_{i+1}r_{i+1}}{r^T_ir_i}\\ p_{i+1}=r_{i+1}+\beta_{i+1}p_i \end{gathered}$

第一行 $p_0=r_0=b-Ax_0$ 是 Initial condition，之後就是持續 Iterate 直到 Residual $|r_{i+1}|\to 0$ 就結束 Iteration。通常可以另外設定一個 Tolerance 參數 $\epsilon \to 0$，當 $|r_{i+1}| < \epsilon$ 時結束 Conjugate gradient。

接下來就是 Convergence Analysis 與 Complexity 分析的部分，雖然這部分才是真正 Numerical Analysis 要做的事情，但礙於篇幅且不是很重要，在這邊就直接忽略，有興趣的可以自行研究 Ch. 9 Convergence Analysis / Ch. 10 Complexity。這本書真的寫得很棒，有機會一定要看完！(我在說我)

2-2. Preconditioned Conjugate Gradient

如果有看 Ch. 9 Convergence Analysis / Ch. 10 Complexity 的話應該就會知道，Conjugate gradient 的收歛性取決於 Matrix $A$ 的 Condition number $\kappa(A)$ (Iteration 與 $\sqrt{\kappa(A)}$ 成正比)，$\kappa(A)$ 小則容易收斂，$\kappa(A)$ 大則不容易收斂。而 Condition number 定義為 $\kappa(A)=\frac{\lambda_\text{max}}{\lambda_\text{min}}\ge 1$，其中 $\lambda$ 為 $A$ 的 Eigenvalue。(等我學會了 Numerical Analysis 再來補充)

在某些情況下，我們會希望 Conjugate Gradient 能夠再收斂的更快，例如：對於 $N$ 為 Million 等級的 Matrix $A$，我們不希望 Conjugate Gradient 執行 100 萬次才收斂到解答，而是希望在 1000 次內就可以得到近似解。這種情況下我們就會需要對原本的 Linear system 進行 Preconditioning，使得新的 Linear system 能夠更快收斂。

假設 Linear system 不容易收斂 (也有人稱做 ill-conditioned system) $Ax=b$，我們希望可以找到新的 Linear system $\hat{A}\hat{x}=\hat{b}$，且新的 Linear system 能夠收斂的更快。換句話說，希望可以找到一個 $\hat{A}$，滿足 $\kappa(\hat{A})\ll \kappa(A)$。一種方法就是選擇一個 Non-singular matrix $M$，並套用到 Linear system：

$M^{-1}Ax=M^{-1}b$

使得 $\kappa(M^{-1}A)\ll \kappa(A)$，同時 $M^{-1}A$ 必須維持 Symmetric positive definite matrix，$M$ 就稱為 Preconditioner。若要維持 Symmetric positive definite matrix，最簡單的方式就是 $M$ 同為 Symmetric positive definite matrix，為了確保 $M$ 為 Symmetric positive definite matrix，我們也可以利用另一個 Non-singular matrix $E$，且令 $M=EE^T$ 來表示 $M$，如此就可以確保 $M$ 一定是 Symmetric positive definite matrix (證明請看 [Sec 1-2])

Non-singular matrix $M$ 表示 $M$ 是 Invertible，且具有 Inverse matrix $M^{-1}$。

至於要去哪裡找到如此好的 $M$，聰明如你，最直覺的方式就是選擇 $M=A$，如此一來，$M^{-1}A=A^{-1}A=I$，我們就可以確保 $\kappa(M^{-1}A)=\kappa(I)=1\ll\kappa(A)$，同時 $I$ 也是 Symmetric positive definite matrix。然而，事實卻沒有這麼簡單，原因是因為，如果取 $M=A$ 的話，勢必要先計算出 $A^{-1}$ 才能套用進新的 Linear system $A^{-1}Ax=A^{-1}b$，但問題就在於，就是因為 $A$ 的 Inverse matrix $A^{-1}$ 很難求才會需要使用 Conjugate gradient 的方式來求解，要不然 Linear system $Ax=b$ 早就解完了。因此，只能限定 $M\approx A$，且 $M$ 必須更容易計算 Inverse。

最終歸納出 $M$ 必須具有以下條件，只要 $M$ 具有以下條件，就可以大幅縮減 Condition number：

$\kappa(M^{-1}A)\ll\kappa(A)$，有一說是 $M^{-1}A$ 的 Eigenvalues 必須要更 Clustered，因為 $\kappa=\frac{\lambda_\text{max}}{\lambda_\text{min}}$
$M^{-1}A$ 是 Symmetric positive definite matrix
$M\approx A \Longrightarrow M^{-1}A\approx A^{-1}A= I$
$M$ Invertible 且比 $A$ 更容易計算 Inverse matrix

接著就是如何求解新的 Linear system，一種方式是直接計算出 $M^{-1}$ 後，將新的 $\hat{A}=M^{-1}A$，$\hat{b}=M^{-1}b$ 直接帶入 Conjugate gradient。另一種方式是將 $M$ 分解 $M=EE^T$ 且 $E$ 可能是比 $M$ 還要更容易計算 Inverse 的 Matrix (例如: Triangular matrix)。如果使用第二種 Case 的話，可以將 Linear system 拆解並轉換成新的 Linear system $\hat{A}\hat{x} = \hat{b}$：

$\begin{align*} M^{-1}Ax &= M^{-1}b\\ E^{-1}AE^{-T}x &= E^{-1}E^{-T}b\\ (E^{-1}AE^{-T})(E^Tx) &= (E^{-T}E^T)(E^{-1}b)\\ \hat{A}\hat{x} &= \hat{b} \end{align*}$

其中 $\hat{A}=E^{-1}AE^{-T}$，$\hat{x}=E^Tx$，$\hat{b}=E^{-1}b$。帶入 Conjugate gradient 後得到新的 Conjugate gradient (其實就是全部加上 Hat)：

$\begin{gathered} \hat{p}_0=\hat{r}_0=\hat{b}-\hat{A}\hat{x}_0\\ \hat{\alpha}_i = \frac{\hat{r}^T_i\hat{r}_i}{\hat{p}^T_i\hat{A}\hat{p}_i}\\ \hat{x}_{i+1}=\hat{x}_i+\hat{\alpha}_i\hat{p}_i\\ \hat{r}_{i+1}=\hat{r}_i-\hat{\alpha}_i\hat{A}\hat{p}_i\\ \hat{\beta}_{i+1}=\frac{\hat{r}^T_{i+1}\hat{r}_{i+1}}{\hat{r}^T_i\hat{r}_i}\\ \hat{p}_{i+1}=\hat{r}_{i+1}+\hat{\beta}_{i+1}\hat{p}_i \end{gathered}$

接下來就是將 Hat 都展開來。計算出新的 Residual $\hat{r}_i$ 與舊 Residual $r_i$ 的關係：

$\hat{r}_i= \hat{b}-\hat{A}\hat{x}_i=E^{-1}b-(E^{-1}AE^{-T})(E^Tx)=E^{-1}r_i$

再由 [式 2.1.2] 可以得到：

$\hat{x}=E^Tx=E^T(x_0+\sum_{i=0}^{N-1}\alpha_ip_i)=E^Tx_0+\sum_{i=0}^{N-1}\alpha_iE^Tp_i$

因此定義 $\hat{p}_i=E^Tp_i$，則 $\hat{\alpha}_i$ 可以展開：

$\begin{align*} \hat{\alpha}_i &= \frac{\hat{r}^T_i\hat{r}_i}{\hat{p}^T_i\hat{A}\hat{p}_i}\\ &= \frac{(E^{-1}r_i)^T(E^{-1}r_i)}{(E^Tp_i)^T(E^{-1}AE^{-T})(E^Tp_i)}\\ &= \frac{r_i^T(E^{-T}E^{-1})r_i}{p_i^T(EE^{-1})A(E^{-T}E^T)p_i}\\ &= \frac{r_i^TM^{-1}r_i}{p_i^TAp_i} \end{align*}$

接下來將 $\hat{x}_{i+1} = \hat{x}_i+\hat{\alpha}_i\hat{p}_i$ 展開得到：

$\begin{align*} \hat{x}_{i+1} &= \hat{x}_i+\hat{\alpha}_i\hat{p}_i\\ E^Tx_{i+1} &= E^Tx_i + \hat{\alpha}_iE^Tp_i\\ E^Tx_{i+1} &= E^T(x_i + \hat{\alpha}_ip_i)\\ x_{i+1} &= x_i + \hat{\alpha}_ip_i \end{align*}$

接下來 $\hat{r}_{i+1}=\hat{r}_i-\hat{\alpha}_i\hat{A}\hat{p}_i$ 也展開：

$\begin{align*} \hat{r}_{i+1} &= \hat{r}_i-\hat{\alpha}_i\hat{A}\hat{p}_i\\ E^{-1}r_{i+1} &= E^{-1}r_i-\hat{\alpha}_i(E^{-1}AE^{-T})(E^Tp_i)\\ E^{-1}r_{i+1} &= E^{-1}r_i-\hat{\alpha}_iE^{-1}A(E^{-T}E^T)p_i\\ E^{-1}r_{i+1} &= E^{-1}(r_i-\hat{\alpha}_iAp_i)\\ r_{i+1} &= r_i-\hat{\alpha}_iAp_i \end{align*}$

接下來展開 $\hat{\beta}_{i+1}$：

$\begin{align*} \hat{\beta}_{i+1} &= \frac{\hat{r}^T_{i+1}\hat{r}_{i+1}}{\hat{r}^T_i\hat{r}_i}\\ &= \frac{(E^{-1}r_{i+1})^T(E^{-1}r_{i+1})}{(E^{-1}r_i)^T(E^{-1}r_i)}\\ &= \frac{r^T_{i+1}(E^{-T}E^{-1})r_{i+1}}{r^T_{i}(E^{-T}E^{-1})r_{i}}\\ &= \frac{r^T_{i+1}M^{-1}r_{i+1}}{r^T_{i}M^{-1}r_{i}} \end{align*}$

接下來展開 $\hat{p}_{i+1}=\hat{r}_{i+1}+\hat{\beta}_{i+1}\hat{p}_i$：

$\begin{align*} \hat{p}_{i+1} &= \hat{r}_{i+1}+\hat{\beta}_{i+1}\hat{p}_i\\ E^Tp_{i+1} &= E^{-1}r_{i+1}+\hat{\beta}_{i+1}E^Tp_i\\ E^{-T}E^Tp_{i+1} &= (E^{-T}E^{-1})r_{i+1} + \hat{\beta}_{i+1}E^{-T}E^Tp_i\\ p_{i+1} &= M^{-1}r_{i+1} + \hat{\beta}_{i+1}p_i \end{align*}$

最後一件最重要的事情就是展開 $\hat{p}_0=\hat{r}_0$：

$\begin{align*} \hat{p}_0 &= \hat{r}_0\\ E^Tp_0 &= E^{-1} r_0\\ p_0 &= E^{-T}E^{-1} r_0\\ p_0 &= M^{-1}r_0 \end{align*}$

全部整理起來得到：

$\begin{align*} r_0&=b-Ax_0\\ p_0 &= M^{-1}r_0\\ \alpha_i &= \frac{r_i^TM^{-1}r_i}{p_i^TAp_i}\\ x_{i+1} &= x_i + \alpha_ip_i\\ r_{i+1} &= r_i-\alpha_iAp_i\\ \beta_{i+1} &= \frac{r^T_{i+1}M^{-1}r_{i+1}}{r^T_{i}M^{-1}r_{i}}\\ p_{i+1} &= M^{-1}r_{i+1} + \hat{\beta}_{i+1}p_i \end{align*}$

接著令 $z_i=M^{-1}r_i$ 得到 Preconditioned conjugate gradient：

$\begin{align*} r_0 &= b-Ax_0\\ z_0 &= M^{-1}r_0\\ p_0 &= z_0\\ \alpha_i &= \frac{r_i^Tz_i}{p_i^TAp_i}\\ x_{i+1} &= x_i + \alpha_ip_i\\ r_{i+1} &= r_i-\alpha_iAp_i\\ z_{i+1} &= M^{-1}r_{i+1}\\ \beta_{i+1} &= \frac{r^T_{i+1}z_{i+1}}{r^T_{i}z_{i}}\\ p_{i+1} &= z_{i+1} + \hat{\beta}_{i+1}p_i \end{align*}$

結果發現其實根本不需要 $E$ 的我：

2-3. Incomplete-Cholesky Preconditioned Conjugate Gradient (ICCG)

Preconditioner $M$ 可以有很多種，在上一節中我們討論出 $M$ 必須具有以下特性：

$\kappa(M^{-1}A)\ll\kappa(A)$
$M^{-1}A$ 是 Symmetric positive definite matrix
$M\approx A \Longrightarrow M^{-1}A\approx A^{-1}A=I$
$M$ Invertible 且比 $A$ 更容易計算 Inverse matrix

但根據 Preconditioned conjugate gradient 計算 $z_{i+1}=M^{-1}r_{i+1}$，其實可以發現不一定要計算出 $M^{-1}$。假若 $M$ 可以被分解為 $M=EE^T$，帶入公式後得到

$\begin{align*} z_{i+1}&=(EE^T)^{-1}r_{i+1}\\ z_{i+1}&= E^{-1}\underbrace{(E^{-T} r_{i+1})}_y\\ z_{i+1}&= E^{-1}y \end{align*}$

令 $y=E^{-T}r_{i+1}$，原本需要計算出 Inverse matrix $M^{-1}$ 變成只須計算 $E^{-1}$ 並分別求解 $y=E^{-T}r_{i+1}$ 與 $z_{i+1} = E^{-1}y$，就能能夠計算出 $z_{i+1}$。因此若有辦法將 $M$ 分解成更容易計算 Inverse 的 Matrix $E$ 那將會一大福音。正好，Cholesky 認為任何 Symmetric (Hermitian) positive definite matrix $M$ 都能夠分解為 Lower triangular matrix $L$ 及其 Transpose $L^T$ (若非 Real matrix 則為 Conjugate transpose)，也就是 $M=LL^T$，這個方法稱作 Cholesky factorization。而剛好 Triangular matrix 非常好求 Inverse。

Prove: 若 $M$ 為 Symmetric positive definite matrix，則存在唯一的 Cholesky factorization $LL^T$，其中 $L$ 是 Lower triangular matrix，且 Diagonal element 皆為正。

(等我想到再補充)

然而，為了滿足 $M\approx A$ 而直接取 $M=A$ 計算 Cholesky factorization 還是會遇到同樣 $A^{-1}$ 的問題，除此之外，對於 Sparse matrix ($n\ll N\times N$，其中 $n$ 為非零項的數量) 而言， Cholesky factorization 是一大問題：經過分解後會爆出一堆非零項，使得 Sparse matrix 變得不是 Sparse。因此，解決方法就是使用 Incomplete-Cholesky factorization，來維持 Matrix 的 Sparsity。Incomplete-Cholesky factorization 在對 $A$ 計算 Cholesky factorization 的時候會將原本 $A$ 為 $0$ 的項維持為 $0$，只針對非零項計算，因此最後的結果會與原先的 Matrix 擁有相同的 Sparsity，同時可以得到一個很好的 Approximation $M=LL^T\approx A$。

2-4. ICCG Algorithm

Recall 前幾節的結果，Incomplete-Cholesky preconditioned conjugate gradient：

$\begin{align*} r_0 &= b-Ax_0\\ L &= \text{Preconditioning}(A)\\ z_0 &= L^{-1}L^{-T}r_0\\ p_0 &= z_0\\ \alpha_i &= \frac{r_i^Tz_i}{p_i^TAp_i}\\ x_{i+1} &= x_i + \alpha_ip_i\\ r_{i+1} &= r_i-\alpha_iAp_i\\ z_{i+1} &= L^{-1}L^{-T}r_{i+1}\\ \beta_{i+1} &= \frac{r^T_{i+1}z_{i+1}}{r^T_{i}z_{i}}\\ p_{i+1} &= z_{i+1} + \hat{\beta}_{i+1}p_i \end{align*}$

假設 Initial guess $x_0=\mathbf{0}$，可以得到：

$\begin{align*} x_0 &= 0\\ r_0 &= b \end{align*}$

轉換成演算法：

基本上這個演算法是一步一步根據推出來的公式計算，最標準的流程。維基百科也是使用這個順序。但為了後續方便使用 cuBLAS/cuSPARSE 實做，我比較習慣稍微調動一些順序變成下圖：

但基本上這兩個演算法做的事情是完全相同的。

3. cuBLAS & cuSPARSE

cuBLAS 與 cuSPARSE 都是 CUDA 內建的 Library，但是不同版本 CUDA 的 cuBLAS 與 cuSPARSE API 或多或少會有一些差異，這篇將以 CUDA 11.0 版本為主，此外也會加入一些 CUDA 8.0 的資料作為補充。

3-1. cuBLAS Introduction

cuBLAS 是以 CUDA 實作的 BLAS (Basic Linear Algebra Subprograms) library。提供了一些常用計算 Matrix/Vector 相關的 High-level API。例如：Matrix/vector copy、sort、dot product、multiplication 等等，另外也有提供對於特殊類型矩陣 (symmetric、triangular、hemitian) 做過優化的 API。

cuBLAS 使用 column-major，所以在儲存 Array 時需要注意順序。此外，在 Fortran 中是使用 1-based indexing，但在 C/C++ 中是使用 0-based indexing。

使用 cuBLAS 時首先需要在程式碼中 Include cublas_v2.h，後面的 v2 代表是新的 API，而舊的 API cublas.h 則是 CUDA 4.0 以前的 API。

1	#include <cublas_v2.h>

編譯時需要加上 Library 的 Link

使用 nvcc Compile 需要加上 -lcublas：

1	nvcc hello_cublas.cu -o hello_cublas -lcublas

使用 Visual Studio 則是從 Property 加：
- 首先到 CUDA C/C++ 的地方，在 Additional Include Directories 地方加上 $(CudaToolkitIncludeDir)。這是為了讓 Visual Studio 找的到 cuBLAS 的 header file。
- 接著到 CUDA Linker 的地方，在 Additional Librbary Directories 地方加上 $(CudaToolkitLibDir)，並在 Additional Dependencies 地方加上 cublas.lib

這樣就可以正常編譯 cuBLAS。

需要注意的是 Visual Studio 的 CUDA C/C++ 與 CUDA Linker 頁籤是對應 .cu File，如果想要在一般 .c 或 .cpp File 使用 cuBLAS 則必須在 VC++ Directories 與 Linker 下的 Input 頁籤做相同的處理，否則會出現 Link error。

3-2. cuBLAS APIs

cuBLAS :: CUDA Toolkit Documentation

在使用 cuBLAS 的 API 前，需要先呼叫 cublasCreate 建立 cuBLAS 的 Handle，有 Handle 才能正常使用 cuBLAS API，且在呼叫 cuBLAS API 時，Handle 必須傳入 Function。最後程式結束時必須呼叫 cublasDestroy 來刪除 cuBLAS Handle。如下：

int main(int argc, char **argv)
{
    cublasHandle_t cubHandle;
    cublasStatus_t cubStat;
    
    // Initialize cuBLAS
    cubStat = cublasCreate( &cubHandle );
    
    ...
    
    // Destroy cuBLAS handle
    cubStat = cublasDestroy(cubHandle);
    
    return 0;
}

cubStat 用來呼叫的 Function call 是否有正常執行，如果 API 有發生錯誤的話，Error code 會儲存在內，詳細可以參考官方的 Document，每個 Function call 底下都會寫有可能回傳的 Error code。

接下來是建立 Vector，使用 cudaMalloc 在 GPU 建立空間後，將資料複製到 GPU 有兩種方式：

第一種是使用 CUDA 原生的 cudaMemcpy 複製到 GPU 上
第二種是使用 cuBLAS API cublasSetVector/cublasGetVector 來存取 GPU 上的空間

不管使用哪個 API，最後都必須 Free memory。

double vec[] = {/*...*/}; // Vector in Host memory 
double *d_vec;  // Device pointer
cudaMalloc(d_vec, N*sizeof(double));
// Set vector
cudaMemcpy(d_vec, vec, N * sizeof(double),
                  cudaMemcpyHostToDevice);
cublasSetVector(N, sizeof(double), vec, 1, d_vec, 1);

// Get vector
cudaMemcpy(vec, d_vec, N * sizeof(double),
                  cudaMemcpyDeviceToHost);
cublasGetVector(N, sizeof(double), d_vec, 1, vec, 1);

// Free memory
cudaFree(d_vec);

使用 cuBLAS API cublasSetVector/cublasGetVector 的好處在於，他有提供使用者設定 increment 的功能，也就是複製每個資料間要間隔多少資料，這在從一個 Matrix 中複製出 Column vector 或 Row vector 時非常實用。

接下來是 Matrix type，同樣使用 cudaMalloc 方式建立 GPU 空間，同 Vector，可以使用 cudaMemcpy 或是使用 cuBLAS API cublasSetMatrix/cublasGetMatrix 存取 Matrix。最重要的是，他是 Column-major！因此 Host memory 與 Device memory 都必須儲存成 Column-major 的形式。

cudaMalloc(&d_A, N*M*sizeof(float));

// copy Matrix A from host to device
cublasSetMatrix(M, N, sizeof(float), A, M, d_A, M);

// copy from device to host
cublasGetMatrix(M, N, sizeof(float), d_A, M, A, M);

// Free memory
cudaFree(d_A);

cuBLAS API cublasSetMatrix/cublasGetMatrix 有兩個參數 lda 與 ldb，分別代表 Source 與 Destination Matrix 的 Leading dimension。

接下來是介紹 cuBLAS 的 Operation API，cuBLAS 將 API 分成 3 個 Level：

Level-1：與 Vector 相關，或是 Vector-vector operations。例如，min、max、sum、copy、dot product、euclidean norm 等等。
Level-2：Matrix-vector operations。例如，Matrix-vector multiplication 等等。
Level-3：Matrix-matrix operations。例如，Matrix-matrix multiplication 等等。

根據不同的資料型態，API 的名稱也會不一樣：

Type	Notation `<t>`	Meaning
`float`	‘s’ or ‘S’	Real single-precision
`double`	‘d’ or ‘D’	Real double-precision
`cuComplex`	‘c’ or ‘C’	Complex single-precision
`cuDoubleComplex`	‘z’ or ‘Z’	Complex double-precision

舉例來說， Level-1 的 Function cublas<t>axpy，計算 Vector $x$ 與 Vector $y$ 的加法，$y\gets\alpha x+y$

Level-2 的 Function cublas<t>gemv，計算 Matrix $A$ 與 Vector $x$ 的乘法，$y\gets\alpha \text{OP}(A)x+\beta y$

其中 $\text{OP}$ 是在計算前，cuBLAS 會對 Matrix $A$ 套用的 Operation：

OP	Meaning
`CUBLAS_OP_N`	The non-transpose operation is selected
`CUBLAS_OP_T`	The transpose oeration is selected
`CUBLAS_OP_C`	The conjugate transpose operation is selected

Level-3 的 Function cublas<t>gemm 計算 Matrix $A$ 與 Matrix $B$ 的矩陣乘法，$C\gets\alpha\text{OP}(A)\text{OP}(B)+\beta C$

Level-3 除了一般的 Matrix-matrix multiplication 之外，cuBLAS 還有提供額外的 API 給特殊的 Matrix 型態使用：
cublasFillMode_t 用來指定是 Upper triangle 還是 Lower triangle。

`cublasFillMode_t`	Meaning
`CUBLAS_FILL_MODE_LOWER`	The lower part of the matrix is filled
`CUBLAS_FILE_MODE_UPPER`	The upper part of the matrix is filled

例如 cublas<t>symm，計算 Symmetric matrix-matrix multiplication，如果你的 Matrix 是 Symmetric 就可以使用這個 API 來進行更有效率的計算，則 cuBLAS 在運算時其實只需要取一半的值 (Upper triangle 或是 Lower triangle) 出來就可以了，另外一半是對稱的，這樣在 GPU memory 的存取上會比較有效率。

另外還有像是 cublas<t>trmm 計算 Triangle matrix-matrix multiplication。

cublasDiagType_t 代表 Matrix 的對角線是不是 Unit (Unit matrix 的那個 unit)，如果是的話，cuBLAS 在運算時就會忽略對角線。

`cublasDiagType_t`	Meaning
`CUBLAS_DIAG_NON_UNIT`	The matrix diagonal has non-unit elements
`CUBLAS_DIAG_UNIT`	The matrix diagonal has unit elements

除此之外，cuBLAS 還有提供 Stream API 可以進行 Asynchronous 的計算，詳細請看 document。

3-3. Hello cuBLAS

這節將會使用 cuBLAS 示範 Matrix-matrix multiplication。首先寫好基本的 Framework：

#include <iostream>
#include <cuda.h>
#include <cublas_v2.h>

// You can find the source code here:
// https://gist.github.com/Ending2015a/4eb30e7665d91debc723d9c73afec821
#include "error_helper.hpp"

int main(int argc, char **argv)
{
    cublasHandle_t cubHandle;
    
    // Initialize cuBLAS
    error_check(cublasCreate( &cubHandle ));
    
    // Get cuBLAS version information
    int major_version, minor_version;
    
    error_check(cublasGetProperty( MAJOR_VERSION, &major_version ));
    error_check(cublasGetProperty( MINOR_VERSION, &minor_version ));
    
    // Say hello to cuBLAS
    std::cout << "Hello cuBLAS!" << std::endl
              << "* major version: " << major_version << std::endl
              << "* minor version: " << minor_version << std::endl;
    
    // TODO: Matrix-matrix multiplication
    
    // Destroy cuBLAS handle
    error_check(cublasDestroy(cubHandle));
    return 0;
}

這段 Code 在一開始建立了 cubHandle，接著向 cuBLAS 打招呼，順便印出 cuBLAS 的版本資訊，最後將 cubHandle 摧毀，結束程序。

接下來是撰寫 Matrix-matrix multiplication，$AB=C$，其中 $A\in\mathbb{R}^{N\times M}$，$B\in\mathbb{R}^{M\times P}$，$C\in\mathbb{R}^{N\times P}$。

// TODO: Matrix-matrix multiplication
const int N = 3;
const int M = 4;
const int P = 2;

float a[N*M];
float b[M*P];
float c[N*P];

// Create matrix A, B in row-major order
for(int i=0;i<N*M;++i)
{
    a[i] = (float)(i+1); //1, 2, 3, 4, ~
}

for(int i=0;i<M*P;++i)
{
    b[i] = (float)(M*P-i); //8, 7, 6, 5, ~
}

宣告 Dimension，宣告 Host memory 上的 matrix，需要注意的是在這邊我是使用 row-major 的方式初始化 Matrix，後面有點小 Trick。印出來看：

void print_matrix(float *mat, const int N, const int M)
{
    for(int i=0;i<N*M;++i)
    {
        std::cout << mat[i];
        if((i+1)%M==0)
            std::cout << std::endl;
        else
            std::cout << ", ";
    }
}

//print matrix
std::cout << "A: " << std::endl;
print_matrix(a, N, M);
std::cout << std::endl;
std::cout << "B: " << std::endl;
print_matrix(b, M, P);
std::cout << std::endl;

接著是宣告、分配 Device memory 空間，使用 cuBLAS API 複製 Matrix data 到 device：

float *device_a;
float *device_b;
float *device_c;

error_check(cudaMalloc(&device_a, N*M*sizeof(float)));
error_check(cudaMalloc(&device_b, M*P*sizeof(float)));
error_check(cudaMalloc(&device_c, N*P*sizeof(float)));

// copy host matrix 'a' to device matrix 'device_a'
error_check(cublasSetMatrix(M, N, sizeof(float), a, M, device_a, M));
error_check(cublasSetMatrix(P, M, sizeof(float), b, P, device_b, P));

這邊因為之後打算使用 cublasSgemm 這個 API 計算矩陣乘法：

$C\gets\alpha\text{OP}(A)\text{OP}(B) + \beta C$

cublasSgemm 會要求兩個常數 $\alpha$ 與 $\beta$：

1 2	const float alpha = 1.0f; const float beta = 0.0f;

直接分別設為 $1$ 與 $0$，計算就會 reduce 成 $C\gets\text{OP}(A)\text{OP}(B)$。

接下來就是 Tricky 的地方。做 Matrix-matrix multiplication：

1
2
3

error_check(cublasSgemm(cubHandle, CUBLAS_OP_N, CUBLAS_OP_N, P, N, M,
                        &alpha, device_b, P, device_a, M,
                        &beta, device_c, P));

這邊要注意 Tricky 的地方。因為 Row-major 與 Column-major 只差在一次 Transpose，因此根據 $(AB)^T=B^TA^T$ 的特性，Row-major 的 matrix b 與 matrix a 在 Column-major 下必須倒過來相乘。在 row-major 下，我們的 Matrix a 與 b 會長這樣：

$a= \left[\begin{matrix} 1 & 2 & 3 & 4\\ 5 & 6 & 7 & 8\\ 9 & 10 & 11 & 12\\ \end{matrix}\right], \quad b= \left[\begin{matrix} 8 & 7 \\ 6 & 5 \\ 4 & 3 \\ 2 & 1 \\ \end{matrix}\right]$

但是 cuBLAS 是 column-major，因此在 cuBLAS 看來矩陣其實長這樣：

$a= \left[\begin{matrix} 1 & 5 & 9 \\ 2 & 6 & 10 \\ 3 & 7 & 11 \\ 4 & 8 & 12\\ \end{matrix}\right], \quad b= \left[\begin{matrix} 8 & 6 & 4 & 2\\ 7 & 5 & 3 & 1\\ \end{matrix}\right]$

所以在計算 $AB$ 時，必須將 $A$ 與 $B$ 對調，變成 $BA$，如此一來就會計算出 Transpose 過的 $C$

$c=\left[\begin{matrix} 8 & 6 & 4 & 2\\ 7 & 5 & 3 & 1\\ \end{matrix}\right] \left[\begin{matrix} 1 & 5 & 9 \\ 2 & 6 & 10 \\ 3 & 7 & 11 \\ 4 & 8 & 12\\ \end{matrix}\right] = \left[\begin{matrix} 40 & 120 & 200\\ 30 & 94 & 158\\ \end{matrix}\right]$

因為 cuBLAS 是 column-major，因此直接將 device_c copy 回 c，在 row-major 下就會 Transpose 回來，得到正確的值。

// device to host
error_check(cublasGetMatrix(P, N, sizeof(float), device_c, P, c, P));

// print answer
std::cout << "C: " << std::endl;
print_matrix(c, N, P);
std::cout << std::endl;

在 row-major 下，c 為：

$c= \left[\begin{matrix} 40 & 30 \\ 120 & 94 \\ 200 & 158\\ \end{matrix}\right]$

最後不要忘記 Free memory：

// Free memory
cudaFree(device_a);
cudaFree(device_b);
cudaFree(device_c);

~~如果能夠做到這種程度，代表你對 row-major/column-major 關係很熟了~~

3-4. cuSPARSE Introduction

cuSPARSE 是 cuBLAS 的 Sparse version。在 cuBLAS 中使用的都是 Dense vector/matrix，而 cuSPARSE 則提供了一系列 API 用來從事 Sparse vector/matrix 相關的計算。且除了 Matrix/Vector operation 之外，cuSPARSE 也有提供許多求解 Sparse linear system 相關的 API。

Sparse matrix 有一個特點就是矩陣非常大，其中包含了許多 $0$ 項，如果在 GPU 上儲存這些 $0$ 項其實是非常浪費空間的，因此 cuSPARSE 提供了幾種 Data format 來儲存 Sparse matrix，節約 Memory 使用量：

~~這邊我懶得重寫，直接貼我的 Slide~~

COO (Coordinate Format)
這是最為直接的表示方法，儲存矩陣中每個非 $0$ 項的位置：
CSR (Compressed Sparse Row Format)
這個表示法對 Row 方向的資料進行了壓縮，他紀錄的是 Row 開始的 Entry 的 index。
CSC (Compressed Sparse Column Format)
這個表示法則是對 Col 進行壓縮
另外還有 BSR 與 BSRX 但由於不常使用，就不多介紹

Sparse vector 使用兩個 Array 來表示，第一個 Array 儲存 Non-zero term，第二個 Array 則儲存 Non-zero term 的 Index。

使用 cuSPARSE 的方式與 cuBLAS 類似，先 Include cusparse.h (這邊不用 v2)，Compile 時同樣要加上 Link：

使用 nvcc：

1	nvcc hello_cusparse.cu -o hello_cusparse -lcusparse

使用 Visual Studio，在 Additional Dependencies 加上 cusparse.lib

就可以正常 Compile。

3-5. cuSPARSE API

cuSPARSE :: CUDA Toolkit Documentation

在使用 cuSPARSE 時，需要呼叫 cusparseCreate 建立 cuSPARSE 的 Handle cusparseHandle_t，在最後程式結束前則需呼叫 cusparseDestroy Destroy：

int main(int argc, char **argv)
{
    cusparseHandle_t cusHandle;
    
    // Initialize cuSPARSE
    cusparseCreate( &cusHandle );
    
    ...
    
    // Destroy cuSPARSE handle
    cusparseDestroy(cusHandle);
    return 0;
}

建立 Sparse matrix 時，則需要根據 Data format (COO、CSR、CSC) 建立對應的 Data，如：COO format 需要建立三個 Array：

cooValA 儲存 Non-zero term
cooRowIndA 儲存 Non-zero term 的 row index
cooColIndA 儲存 Non-zero term 的 column index。

如果想要建立 CSR 或 CSC format 但是不會 (或是不方便) 建立 Row/Column index 話，可以先建立 COO format，使用 cuSPARSE 提供的 Format Conversion API cusparse<t>coo2csr 做 Data format 的轉換。也可以直接從 Dense matrix，使用 cusparse<t>dense2csc / cusparse<t>dense2csr 轉換。

與 cuBLAS 不同的地方在於，cuSPARSE 在宣告 Matrix 時，必須建立該 Matrix 的 Descriptor cusparseMatDescr_t。cusparseMatDescr_t 包含了大略以下資料：

typeof struct {
    cusparseMatrixType_t MatrixType;
    cusparseFillMode_t FillMode;
    cusparseDiagType_t DiagType;
    cusparseIndexBase_t IndexBase;
} cusparseMatDescr_t;

`cusparseDiagType_t`	Meaning
`CUSPARSE_DIAG_TYPE_NON_UNIT`	The matrix diagonal has non-unit elements.
`CUSPARSE_DIAG_TYPE_UNIT`	The matrix diagonal has unit elements.

`cusparseFillMode_t`	Meaning
`CUSPARSE_FILL_MODE_LOWER`	The lower triangular part is stored.
`CUSPARSE_FILL_MODE_UPPER`	The upper triangular part is stored.

`cusparseIndexBase_t`	Meaning
`CUSPARSE_INDEX_BASE_ZERO`	The base index is zero.
`CUSPARSE_INDEX_BASE_ONE`	The base index is one.

在過去的版本中 (CUDA 8.0) 為了節約 Memory，針對特殊的 Matrix (Symmetric、Triangular、Hermitian) 可以在 cusparseMatrixType_t 指定。如果有指定特殊的 Matrix 的話，cuSPARSE 會只使用 Lower triangle 或 Upper triangle 半邊的 Data 做計算，端看 cusparseFillMode_t 是指定 Lower 還是 Upper。如此一來，使用者可以只需要儲存半邊的 Matrix。

`cusparseMatrixType_t`	Meaning
`CUSPARSE_MATRIX_TYPE_GENERAL`	The matrix is general.
`CUSPARSE_MATRIX_TYPE_SYMMETRIC`	The matrix is symmetric. (Deprecated)
`CUSPARSE_MATRIX_TYPE_HERMITIAN`	The matrix is Hermitian. (Deprecated)
`CUSPARSE_MATRIX_TYPE_TRIANGULAR`	The matrix is triangular. (Deprecated)

然而在 CUDA 11.0 中，官方不再鼓勵使用者使用 General 以外的 Matrix Type，原因是在許多計算中 (Matrix-vector multiplication、Preconditioners、Lienar system solvers 等等)，使用非 General type 的計算耗費時間會是 General type 的 10 倍慢**，因此在 CUDA 11.0 中，大部分的 API 不再支援非 General type 的 Matrix。因此使用者在指定 Matrix type 時直接填寫 CUSPARSE_MATRIX_TYPE_GENERAL 即可。

以下示範如何宣告一個 cusparseMatDescr_t：

// Create descriptor for the matrix A
cusparseMatDescr_t descr_A = 0;
cusparseCreateMatDescr(&descr_A);

// Set properties of the matrix A
cusparseSetMatType(descr_A, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatIndexBase(descr_A, CUSPARSE_INDEX_BASE_ZERO);

// Free descriptor
cusparseDestroyMatDescr(descr_A);

cuSPARSE 同樣有提供 3 個 Level 的 API：

Level-1：Sparse vector operation (在 CUDA 11.0 中已列為 Deprecated)
Level-2：Sparse matrix-dense/sparse vector operation 以及求解 Matrix-vector sparse linear system 相關 API
Level-3：Sparse matrix-dense matrix operation 以及求解 Matrix-matrix sparse linear system

需要注意的是 Level-2 與 Level-3 求解 Sparse linear system 的 API 有限定 Matrix 必須是 Triangular，如果矩陣非 Triangular 則只會使用 Lower triangle 的 Data，Upper triangle 忽略。

注意到，Level-3 是 Sparse matrix 對 dense matrix operation，Sparse matrix 對 sparse matrix 的 operation 則是放在 Extra API，但其中也有一些 API 已經列為 Deprecated。而多數 API 被列為 Deprecated 的原因是因為，CUDA 11.0 決定要將一些常用的 Multiplication operation 做整合，因此提供了另一種型態的 API —- Generic API。

使用 Generic API 的方法與 cuBLAS 大不相同。首先需要建立對應 Data format 的 Descriptor (不同於 cusparseMatDescr_t)。例如，如果要建立 CSR format 的 Sparse matrix 則必須呼叫對應的 API 來建立 Descriptor：

cusparseSpMatDescr_t mat_A;
cusparseCreateCsr(&mat_A, N, N, nnz, d_rowPtr, d_colIdx, d_A, 
                 CUSPARSE_INDEX_32I, // row index data type (int)
                 CUSPARSE_INDEX_32I, // col index data type (int) 
                 CUSPARSE_INDEX_BASE_ZERO, // 0-based index
                 CUDA_R_64F);  // Data type (real double-floating point)

建立 Dense vector：

1 2	cusparseDnVecDescr_t vec_x; cusparseCreateDnVec(&vec_x, N, d_x, CUDA_R_64F);

建立完後，呼叫 Generic API 時都是使用 Descriptor 來代表 Vector/Matrix。例如 cusparseSpMV 計算 Sparse-matrix 對 dense-vector 的 Multiplication：

$y\gets \alpha \text{OP}(A)x+\beta y$

計算分成兩個階段，首先使用者必須自行呼叫 cusparseSpMV_bufferSize 估算 cusparseSPMV 所需的額外 Buffer 空間大小，之後自行呼叫 cudaMalloc 建立後，再呼叫 cusparseSpMV 換成計算。

double alpha = 1;
double beta = 0;
size_t buf_size = 0;
double *d_buf;

// Compute buffer size in computing matrix-vector multiplocation
cusparseSpMV_bufferSize(cusHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, 
                        &alpha, mat_A, vec_x, 
                        &beta, vec_y, 
                        CUDA_R_64F, 
                        CUSPARSE_CSRMV_ALG1, 
                        &buf_size);

// Allocate buffer memory
cudaMalloc(&d_buf, buf_size);

// Perform matrix-vector multiplocation
cusparseSpMV(cusHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, 
             &alpha, mat_A, vec_x, 
             &beta, vec_y, 
             CUDA_R_64F, 
             CUSPARSE_CSRMV_ALG1, 
             d_buf));

// Free buffer
cudaFree(d_buf);

最後需要呼叫對應的 API 來銷毀 Generic API 的 descriptor

// Destroy descriptor
cusparseDestroySpMat(mat_A);
cusparseDestroyDnVec(vec_x);
cusparseDestroyDnVec(vec_y);

求解 Sparse linear system 在 CUDA 11.0 中也有大幅度的變更，原先在 CUDA 8.0 中，求解 Sparse linear system 時 cuSPARSE 會自己建立刪除額外的 Buffer，但從 CUDA 11.0 開始也需要使用者自行處理 Buffer。除此之外，Sparse linear system 只支援 Triangular matrix，若 Matrix 不是 Triangular，cuSPARSE 會忽略除了 Lower triangle 以外的 Data。

求解 Sparse linear system 前需要先建立對應 Data format 的 Infomation object，接著分成兩個階段 (Phase)，首先先執行 Analysis phase 分析 Matrix 型態，之後再執行 Solve phase 解出 Linear system。以下示範使用 cusparse<t>csrsv2 求解 Sparse linear system $\text{OP}(A)y=\alpha x$：

// Create infomation object
csrsv2Info_t info_A;
cusparseCreateCsrsv2Info(&info_A);

int buf_size;
double *d_buf;
// Compute buffer size for solving linear system
cusparseDcsrsv2_bufferSize(cusHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
                          N, nnz, descr_A,  //descr_A: cusparseMatDescr_t
                          d_A, d_rowIdx, d_colIdx,
                          info_A, &buf_size);

// Allocate buffer memory
cudaMalloc(&d_buf, buf_size);

// Analysis phase
cusparseDcsrsv2_analysis(cusHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
                        N, nnz, descr_A,  //descr_A: cusparseMatDescr_t
                        d_A, d_rowIdx, d_colIdx,
                        info_A, CUSPARSE_SOLVE_POLICY_USE_LEVEL, d_buf);

double alpha = 1;
// Solve phase
cusparseDcsrsv2_solve(cusHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
                     N, nnz, &alpha, descr_A, //descr_A: cusparseMatDescr_t
                     d_A, d_rowIdx, d_colIdx,
                     info_A, d_x, d_y, 
                     CUSPARSE_SOLVE_POLICY_USE_LEVEL, d_buf);

// Free buffer
cudaFree(d_buf);
// Destroy information object
cusparseDestroyCsrsv2Info(info_A);

求解 Linear system 這類 API 在 CUDA 11.0 時有新增一個新的參數cusparseSolvePolicy_t：

`cusparseSolvePolicy_t`	Meaning
`CUSPARSE_SOLVE_POLICY_NO_LEVEL`	No level information is generated and used.
`CUSPARSE_SOLVE_POLICY_USE_LEVEL`	Generate and use level information.

在 Document 裡，每個 API 底下的解說都有寫這個參數的用途，但並沒有說的很詳細，我也不清楚這個參數實際的用途。簡而言之，這個參數能夠提升某些運算的效率，本身並不影響計算的結果。

除了這些 API 之外，cuSPARSE 還有提供其他額外的 API，例如 Preconditioner、Format conversion 等等，詳細請看 Document。

3-6. Hello cuSPARSE

這邊範例將會介紹如何使用 cuSPARSE 計算 Sparse matrix dense vector multiplication。首先建立基礎 Framework。

#include <iostream>

#include <cuda_runtime.h>
#include <cusparse.h>
#include <cublas_v2.h>

// You can find the source code here:
// https://gist.github.com/Ending2015a/4eb30e7665d91debc723d9c73afec821
#include "error_helper.hpp"

int main(int argc, char **argv)
{
    // Initialize cuBLAS / cuSPARSE
    cublasHandle_t cubHandle = 0;
    error_check(cublasCreate(&cubHandle));
    
    cusparsehandle_t cusHandle = 0;
    error_check(cusparseCreate(&cusHandle));
    
    // TODO: Sparse matrix dense vector multiplication
    
    // Destroy handles
    cusparseDestroy(cusHandle);
    cublasDestroy(cubhandle);
    
    return 0;
}

接下來在 Host 建立一個 Sparse matrix 與一個 Dense vector，這邊隨便寫一些 Hash 函數來產生亂數：

// random hash (2D to 1D)
double hash21(double x, double y)
{
    x = (x*24.36854 + y*15.75427 + 43.454614);
    y = (x*57.5654 + y*21.1435 + 37.159636);
    x = sin(x+y)*516.918738574;
    return ((int)((-1. + 2. * fabs(x - (long)x))*100.))*0.1;
}

// create tridiagonal matrix in column-major order
void createDenseMatrix(double **o_mat, int N)
{
    double *mat = new double[N*N]{};
    
    for(int x=0;x<N;++x)
    {
        for(int y=0;y<N;++y)
        {
            if(x == y)
                mat[y+x*N] = hash21(x, y);
            else if( abs(x-y) == 1 )
                mat[y+x*N] = hash21(x, y);
        }
    }

    *o_mat = mat;
}

// create random vector
void createDenseVector(double **o_vec, int N)
{
    double *vec = new double[N]{};

    for(int x=0;x<N;++x)
    {
        vec[x] = hash21(x, -x+10);
    }

    *o_vec = vec;
}

// print matrix
void print_matrix(double *mat, const int N)
{
    for(int i=0;i<N;++i)
    {
        for(int j=0;j<N;++j)
        {
            if(j != 0) std::cout << ", ";
            std::cout << std::fixed << std::setprecision(2) << mat[i + j*N];
        }
        std::cout << std::endl;
    }
}

// print vector
void print_vector(double *vec, const int N)
{
    for(int i=0;i<N;++i)
    {
        if(i != 0) std::cout << ", ";
        std::cout << std::fixed << std::setprecision(2) << vec[i];
    }
    std::cout << std::endl;
}

這邊不需要太在意寫了甚麼，基本上就是用盡各種辦法建立了一個 Sparse matrix 跟 Dense vector。

接著回到 TODO 的地方繼續完成 Matrix vector multiplication。建立 Matrix 跟 Vector，copy 到 GPU 上：

// TODO: Sparse matrix dense vector multiplication

// Generate random dense tridiagonal matrix A (column-major)
int N = 5;
double *A = 0;
createDenseMatrix(&A, N);

// Print matrix
std::cout << "My matrix 'A': " << std::endl;
print_matrix(A, N);
std::cout << std::endl;

// Generate random dense vector b
double *b = 0;
createDenseVector(&b, N);

// Print matrix
std::cout << "My vector 'b': " << std::endl;
print_vector(b, N);
std::cout << std::endl;

// Device pointer (Dense)
double *d_A = 0;
double *d_b = 0;
double *d_c = 0;

// Allocate GPU memory
error_check(cudaMalloc(&d_A, sizeof(double) * N * N));
error_check(cudaMalloc(&d_b, sizeof(double) * N));
error_check(cudaMalloc(&d_c, sizeof(double) * N));

// you can also use cudaMemcpy/cudaMemcpy2D as well.
error_check(cublasSetMatrix(N, N, sizeof(double), A, N, d_A, N));
error_check(cublasSetVector(N, sizeof(double), b, 1, d_b, 1));

建立 Matrix A 的 Description：

// Create descriptor of the matrix A
cusparseMatDescr_t descr_A = 0;
error_check(cusparseCreateMatDescr(&descr_A));

// Set properties of the matrix A
error_check(cusparseSetMatType(descr_A, CUSPARSE_MATRIX_TYPE_GENERAL));
error_check(cusparseSetMatIndexBase(descr_A, CUSPARSE_INDEX_BASE_ZERO));

接下來要將 Dense matrix A 轉換成 Sparse 的 Format，因此要先計算 Matrix A 的 Non-zero term 數量，這個地方可以使用 cuSPARSE API cusparseDnnz 計算，之後再使用 cusparseDdense2csr 將 Dense format 的 Matrix A 轉換成 CSR format。

// Count non-zero terms
int nnz = 0;
int *d_nnz_perRow = 0;
error_check(cudaMalloc(&d_nnz_perRow, N * sizeof(int)));
error_check(cusparseDnnz(cusHandle, CUSPARSE_DIRECTION_ROW, N, 
                        N, descr_A, d_A, 
                        N, d_nnz_perRow, &nnz));
// Print message
std::cout << "Total number of non-zero terms in dense matrix A = " << nnz << std::endl;

// Convert dense format to csr format
double *d_csrValA = 0;
int *d_rowPtrA = 0;
int *d_colIdxA = 0;
error_check(cudaMalloc(&d_csrValA, nnz * sizeof(double)));
error_check(cudaMalloc(&d_rowPtrA, (N+1) * sizeof(int)));
error_check(cudaMalloc(&d_colIdxA, nnz * sizeof(int)));
error_check(cusparseDdense2csr(cusHandle, N, N, 
                            descr_A, d_A, 
                            N, d_nnz_perRow,
                            d_csrValA, d_rowPtrA, d_colIdxA));

接著因為要使用 Generic API cusparseSpMV 計算 Matrix vector multiplication，因此要先建立 Descriptors：

// Create Generic API descriptor
cusparseSpMatDescr_t mat_A;
error_check(cusparseCreateCsr(&mat_A, N, N, nnz,
    d_rowPtrA, d_colIdxA, d_csrValA,
    CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,
    CUSPARSE_INDEX_BASE_ZERO, CUDA_R_64F));

cusparseDnVecDescr_t vec_b;
error_check(cusparseCreateDnVec(&vec_b, N, d_b, CUDA_R_64F));

cusparseDnVecDescr_t vec_c;
error_check(cusparseCreateDnVec(&vec_c, N, d_c, CUDA_R_64F));

接下來就是計算 Matrix vector multiplication，先呼叫 cusparseSpMV_bufferSize 計算計算所需的 Buffer 空間，接著呼叫 cusparseSpMV 完成計算：

size_t buf_size;
double *d_buf;
double alpha = 1.0;
double beta = 0.0;

error_check(cusparseSpMV_bufferSize(cusHandle, 
                                    CUSPARSE_OPERATION_NON_TRANSPOSE,
                                    &alpha, mat_A, vec_b, 
                                    &beta, vec_c, CUDA_R_64F, 
                                    CUSPARSE_CSRMV_ALG1, &buf_size));
// Allocate buffer
cudaMalloc(&d_buf, buf_size);

error_check(cusparseSpMV(cusHandle, 
                         CUSPARSE_OPERATION_NON_TRANSPOSE,
                         &alpha, mat_A, vec_b, 
                         &beta, vec_c, CUDA_R_64F, 
                         CUSPARSE_CSRMV_ALG1, d_buf));

最後將解答從 GPU memory 複製到 Host，並印出答案：

// device to host
double *c = new double[N]{};
error_check(cublasGetVector(N, sizeof(double), d_c, 1, c, 1));

// print answer
std::cout << "Answer: " << std::endl;
print_vector(c, N);
std::cout << std::endl;

程式最後將所有資源 Free 掉：

delete[] A;
delete[] b;
delete[] c;
cudaFree(d_A);
cudaFree(d_b);
cudaFree(d_c);
cudaFree(d_nnz_perRow);
cudaFree(d_csrValA);
cudaFree(d_rowPtrA);
cudaFree(d_colIdxA);
cudaFree(d_buf);

// Free handles, descriptors, 
cusparseDestroyMatDescr(descr_A);
cusparseDestroySpMat(mat_A);
cusparseDestroyDnVec(vec_b);
cusparseDestroyDnVec(vec_c);

成功的的話就會算出答案：
-33.48, 52.97, -47.62, 82.27, 3.97

4. Implementation

這章節將解說如何使用 cuBLAS/cuSPARSE 實做 Incomplete-Cholesky preconditioned conjugate gradient。

4-1. Frameworks

#include <iostream>
#include <iomanip>
#include <fstream>
#include <cmath>

#include <cuda_runtime.h>
#include <cublas_v2.h>
#include <cusparse.h>
    
int main(int argc, char **argv)
{
    std::string inputPath = "testcase/size1M/case_1M.in"; // Input file path
    std::string answerPath = "testcase/size1M/case_1M.out"; // Answer file path
    
    /* TODO */
    
    return 0;
}

首先要處理的就是 Matrix 的讀寫。

Input：使用 COO 格式
- $N$ (32-bit int)：表示 Matrix $A$ 大小為 $N\times N$，Vector $b$ 維度為 $N$
- $\text{nz}$ (32-bit int)：表示矩陣 $A$ 具有的 Non-zero term 數量
- 3-tuple 有 $\text{nz}$ 個：
  - $i$ (32-bit int)：表示 Row index (0-based)
  - $j$ (32-bit int)：表示 Column index (0-based)
  - $A_{ij}$ (64-bit float)：表示 Element $A_{ij}$ 的值
- $N$ 個 64-bit float 代表 Vector $b$

Output：
- $N$ (32-bit int)：表示 Vector $x$ 的維度
- $N$ 個 64-bit float 代表 Vector $x$

因此讀檔的部份：

// Read testcase
void read(std::string filePath,
        int *pN, int *pnz,
        double **cooVal,
        int **cooRowIdx, int **cooColIdx,
        double **b)
{
    std::ifstream in(filePath, std::ios::binary);

    in.read((char*)pN, sizeof(int));  // read N
    in.read((char*)pnz, sizeof(int)); // read nz

    // Create array
    *cooVal = new double[*pnz]{};
    *cooRowIdx = new int[*pnz]{};
    *cooColIdx = new int[*pnz]{};
    *b = new double[*pN]{};

    // read each element Aij
    for (int i = 0; i < *pnz; ++i)
    {
        in.read((char*)&(*cooRowIdx)[i], sizeof(int));
        in.read((char*)&(*cooColIdx)[i], sizeof(int));
        in.read((char*)&(*cooVal)[i], sizeof(double));
    }

    // read b
    in.read((char*)(*b), sizeof(double)*(*pN));
}

// Read answer
void readAnswer(std::string filePath,
                int *pN, double **x)
{
    std::ifstream in(filePath, std::ios::binary);

    in.read((char*)pN, sizeof(int));

    *x = new double[*pN]{};

    in.read((char*)(*x), sizeof(double)*(*pN));
}

讀檔這邊因為會需要更改傳入的參數，因此使用 Call by reference。接下來回到 main：

// in `main` function
int N;
int nz;
double *A;
int *rowIdxA;
int *colIdxA;
double *b;
read(inputPath, &N, &nz, &A, &rowIdxA, &colIdxA, &b);

double *ans_x;
readAnswer(answerPath, &N, &ans_x);

// Print message
std::cout << "N = " << N << std::endl;
std::cout << "nz = " << nz << std::endl;

讀檔到這邊就完成了，由於讀進來的 Matrix $A$ 是 COO，需要轉換成 CSR，因此需要先初始化 cuBLAS/cuSPARSE：

// Create handles
cublasHandle_t cubHandle;
cusparseHandle_t cusHandle;

error_check(cublasCreate(&cubHandle));
error_check(cusparseCreate(&cusHandle));

接著分配 Matrix $A$ 在 GPU 上的 Memory，並複製一份過去：

// Allocate GPU memory & copy matrix/vector to device
double *d_A;
int *d_rowIdxA; // COO
int *d_rowPtrA; // CSR
int *d_colIdxA;
double *d_b;

error_check(cudaMalloc(&d_A, nz * sizeof(double)));
error_check(cudaMalloc(&d_rowIdxA, nz * sizeof(int)));
error_check(cudaMalloc(&d_rowPtrA, (N + 1) * sizeof(int)));  // (N+1) !!!!
error_check(cudaMalloc(&d_colIdxA, nz * sizeof(int)));
error_check(cudaMalloc(&d_b, N * sizeof(double)));

error_check(cudaMemcpy(d_A, A, nz * sizeof(double), cudaMemcpyHostToDevice));
error_check(cudaMemcpy(d_rowIdxA, rowIdxA, nz * sizeof(int), cudaMemcpyHostToDevice));
error_check(cudaMemcpy(d_colIdxA, colIdxA, nz * sizeof(int), cudaMemcpyHostToDevice));
error_check(cudaMemcpy(d_b, b, N * sizeof(double), cudaMemcpyHostToDevice));

這邊需要注意 CSR 格式是針對 RowIdx 做壓縮，壓縮後的大小為 $N+1$，在 Document 上有寫。接著就是 Call cusparseXcoo2csr 做轉換：

1
2
3

// Convert matrix A from COO format to CSR format
error_check(cusparseXcoo2csr(cusHandle, d_rowIdxA, nz, N,
                    d_rowPtrA, CUSPARSE_INDEX_BASE_ZERO));

接著就是建立 ICCGsolver 跟 Call solve，內容的部份會在 4-2 節實作：

// Create conjugate gradient solver
int max_iter = 1000;
double tolerance = 1e-12;

ICCGsolver solver(max_iter, tolerance, cubHandle, cusHandle);

// Print message
std::cout << "Solving linear system..." << std::endl;
bool res = solver.solve(N, nz, d_A, d_rowPtrA, d_colIdxA, d_b);

std::cout << (res ? "Converged!": "Failed to converge") << std::endl;

這邊我讓 solver.solve 回傳 true 如果 solver 成功在 max_iter Iteration 內收斂 (小於 tolerance) 的話，否則回傳 false。接下來就是將解答從 GPU 摳回 Host memory，並驗證結果：

double *x = new double[N] {};
error_check(cudaMemcpy(x, solver.x_ptr(), N * sizeof(double), cudaMemcpyDeviceToHost));

double tol = 0;
for (int i = 0; i < N; ++i)
{
    tol += x[i] - ans_x[i];
}

// print message
std::cout << "Solved in " << solver.iter_count() << " iterations, final norm(r) = "
        << std::scientific << solver.err() << std::endl;

std::cout << "Total error (compared with ans_x): " << tol << std::endl;

最後要記得把 Memory 都 Free 掉：

// Free Host memory
delete[] A;
delete[] rowIdxA;
delete[] colIdxA;
delete[] b;
delete[] ans_x;
delete[] x;

// Free Device memory
cudaFree(d_A);
cudaFree(d_rowIdxA);
cudaFree(d_rowPtrA);
cudaFree(d_colIdxA);
cudaFree(d_b);

// Free handles
cublasDestroy(cubHandle);
cusparseDestroy(cusHandle);

到這邊 main 就完成了，下一節就是實作 ICCGsolver。

4-2. Main Algorithm

先定義 class ICCGsolver：

class ICCGsolver
{
private:
    cublasHandle_t cubHandle;
    cusparseHandle_t cusHandle;
    
    cusparseMatDescr_t descr_A;
    cusparseMatDescr_t descr_L;
    
    // host data
    int N = 0;
    int nz = 0;
    int max_iter;
    int k;  // k iteration
    double tolerance;
    double alpha;
    double beta;
    double rTr;
    double pTq;
    double rho;    //rho{k}
    double rho_t;  //rho{k-1}
    const double one = 1.0;  // constant
    const double zero = 0.0;   // constant
    
    // device data
    double *d_ic = nullptr;  // Factorized L
    double *d_x = nullptr;
    double *d_y = nullptr;
    double *d_z = nullptr;
    double *d_r = nullptr;
    double *d_rt = nullptr;
    double *d_xt = nullptr;
    double *d_q = nullptr;
    double *d_p = nullptr;
    
    bool release_cusHandle = false;
    bool release_cubHandle = false;
    
public:
    ICCGsolver(int max_iter = 1000, double tol = 1e-12,
        cublasHandle_t cub_handle = NULL,
        cusparseHandle_t cus_handle = NULL);

    ~ICCGsolver();
    
    bool solve(int N, int nz,
        double *d_A, int *d_rowIdx, int *d_colIdx,
        double *d_b);
    
    double *x_ptr();
    double err();
    int iter_count();
    
private:
    void allocate_nz_memory();
    void allocate_N_memory();
    void free_nz_memory();
    void free_N_memory();
    
    void check_and_resize(int N, int nz);
}

只要有前綴 d_* 表示他是指向 GPU device 的 Pointer。

首先是 Line 4、5 宣告 Handle，接著是因為我們會用到兩個 Matrix 分別為 $A$ 與 Factorized matrix $L$，因此宣告兩個 Descriptor。
11~23 是計算時會用到的變數。
25~34 則是 GPU 計算實用到的變數。
36~37 用來紀錄 Handler 需要需要自動釋放，如果 Handler 是使用者宣告 ICCGsolver 時傳入的則使用者要自行釋放，如果不是則自動釋放。
40~42 是 Constructor，第一個參數 max_iter 表示 Iteration 的上限，tol 表示 Tolerance，cub_handle 與 cus_handle 可傳入可不傳入。
44 是 Destructor，要 Free 掉所有 GPU memory。
46~48 是解方程的 API，N、nz 分別表示 Matrix 大小以及 Non-zero term 數量，d_A、d_rowIdx、d_colIdx 為 CSR format 的 Matrix $A$，d_b 是 Vector $b$。
50 用來取得 d_x Pointer，可以利用這個 Pointer 將答案複製出來。
51 用來取得誤差值 $|r_k|$。
52 用來取得 Iteration count $k$。
55~60 用來分配、釋放 GPU memory。

首先是完成 Constructor 與 Destructor：

ICCGsolver::ICCGsolver(int max_iter, double tol,
    cublasHandle_t cub_handle, cusparseHandle_t cus_handle) 
    : max_iter(max_iter), tolerance(tol), 
      cubHandle(cub_handle), cusHandle(cus_handle)
{
    // create cuBLAS handle
    if (cubHandle == NULL)
    {
        error_check(cublasCreate(&cubHandle));
        release_cubHandle = true;
    }


    // create cuSPARSE handle
    if (cusHandle == NULL)
    {
        error_check(cusparseCreate(&cusHandle));
        release_cusHandle = true;
    }

    // create descriptor for matrix A
    error_check(cusparseCreateMatDescr(&descr_A));

    // initialize properties of matrix A
    error_check(cusparseSetMatType(descr_A, CUSPARSE_MATRIX_TYPE_GENERAL));
    error_check(cusparseSetMatFillMode(descr_A, CUSPARSE_FILL_MODE_LOWER));
    error_check(cusparseSetMatDiagType(descr_A, CUSPARSE_DIAG_TYPE_NON_UNIT));
    error_check(cusparseSetMatIndexBase(descr_A, CUSPARSE_INDEX_BASE_ZERO));

    // create descriptor for matrix L
    error_check(cusparseCreateMatDescr(&descr_L));

    // initialize properties of matrix L
    error_check(cusparseSetMatType(descr_L, CUSPARSE_MATRIX_TYPE_GENERAL));
    error_check(cusparseSetMatFillMode(descr_L, CUSPARSE_FILL_MODE_LOWER));
    error_check(cusparseSetMatIndexBase(descr_L, CUSPARSE_INDEX_BASE_ZERO));
    error_check(cusparseSetMatDiagType(descr_L, CUSPARSE_DIAG_TYPE_NON_UNIT));
}

Constructor 做的事情很簡單，初始化傳入得參數後，建立 Matrix $A$ 與 $L$ 的 Descriptor，這邊需要注意的是在 CUDA 11.0 cusparseMatDescr_t 已經有快要 Deprecate 的傾向，因此這邊設定 Properties 也沒什麼選項可選，首先是 cusparseSetMatType，目前除了 CUSPARSE_MATRIX_TYPE_GENERAL 選項以外其他都 Deprecate 因此直接勇敢的設成 CUSPARSE_MATRIX_TYPE_GENERAL 就可以了，剩下就是除了 cusparseSetMatIndexBase 需要照實填寫外，其他都會被忽略，並不是很重要。

接下來是 Destructor：

ICCGsolver::~ICCGsolver()
{
    // free data
    free_nz_memory();
    free_N_memory();

    // release descriptor
    cusparseDestroyMatDescr(descr_A);
    cusparseDestroyMatDescr(descr_L);

    // release handles
    if (release_cubHandle)
    {
        cublasDestroy(cubHandle);
    }

    if (release_cusHandle)
    {
        cusparseDestroy(cusHandle);
    }
}

Destructor 做的事情更簡單，就是將 GPU memory 都 Free 掉，Descriptor 也 Free 掉，Handle 看是不是使用者傳入的，如果不是，就一起毀掉，如果是就不用毀掉。

接下來是除了 solve 以外的 Function：

double *ICCGsolver::x_ptr()
{
    return d_x;
}

double ICCGsolver::err()
{
    return rTr;
}

int ICCGsolver::iter_count()
{
    return k;
}

x_ptr() 用來取得答案的 GPU Pointer。err() 用來取得誤差值 $|r_k|=r^Tr$。iter_count()用來取得最後一次 solve 的 Iteration count。接下來 Memory 相關：

void ICCGsolver::check_and_resize(int N, int nz)
{
    // allocate N
    if (this->N < N)
    {
        this->N = N;
        free_N_memory();
        allocate_N_memory();
    }

    if (this->nz < nz)
    {
        this->nz = nz;
        free_nz_memory();
        allocate_nz_memory();
    }
}

check_and_resize 用來檢查，呼叫 solve 時新傳進來的 $N$ 與 $\text{nz}$ 有沒有比目前分配的還大，如果比較大，就要 Free 掉並重新分配一個更大的空間來做計算。

因此釋放與分配空間的 Function：

void ICCGsolver::allocate_N_memory()
{
    error_check(cudaMalloc(&d_x, N * sizeof(double)));
    error_check(cudaMalloc(&d_y, N * sizeof(double)));
    error_check(cudaMalloc(&d_z, N * sizeof(double)));
    error_check(cudaMalloc(&d_r, N * sizeof(double)));
    error_check(cudaMalloc(&d_rt, N * sizeof(double)));
    error_check(cudaMalloc(&d_xt, N * sizeof(double)));
    error_check(cudaMalloc(&d_q, N * sizeof(double)));
    error_check(cudaMalloc(&d_p, N * sizeof(double)));
}

void ICCGsolver::allocate_nz_memory()
{
    error_check(cudaMalloc(&d_ic, nz * sizeof(double)));
}

void ICCGsolver::free_N_memory()
{
    cudaFree(d_x);
    cudaFree(d_y);
    cudaFree(d_z);
    cudaFree(d_r);
    cudaFree(d_rt);
    cudaFree(d_xt);
    cudaFree(d_q);
    cudaFree(d_p);
}

void ICCGsolver::free_nz_memory()
{
    cudaFree(d_ic);
}

到這邊，週邊的 Function 全部完成了，總算可以進入正題：Preconditioned conjugate gradient。建議這邊實作的時候可以搭配 Algorithm 1。

Recall Algorithm 1：

bool ICCGsolver::solve(int N, int nz,
    double *d_A, int *d_rowIdx, int *d_colIdx,
    double *d_b, double *d_guess)
{
    check_and_resize(N, nz);
     
    // --- 1. Create cuSPARSE generic API objects ---
    // TODO
     
    // --- 2. Perform incomplete cholesky factorization ---
    // TODO
    
    // --- 3. Prepare for performing conjugate gradient ---
    // TODO
    
    // --- 4. Perform conjugate gradient ---
    // TODO
    
    // --- 5. Finalize ---
    // TODO
    
    return rTr < tolrance; // return true if converged
}

首先一開始需要先檢查、分配 GPU 空間，因此 Call check_and_resize，接著我將分成 5 個部份分別實作。

先從第二部份開始，這邊會對應到 Algorithm 1 的 Line 4：

// --- Perform incomplete cholesky factorization ---
csric02Info_t icinfo_A;
size_t buf_size = 0;
size_t u_temp_buf_size = 0;
size_t u_temp_buf_size2 = 0;
int i_temp_buf_size = 0;
void *d_buf = NULL;

cuSPARSE 進行 Incomplete cholesky factorization 分成幾個步驟：

Call cusparseCreateCsric02Info 建立 csric02Info_t object
Call cusparseDcsric02_bufferSize 讓 cuSPARSE 估計計算 Factorization 所需的 Buffer 大小，然後分配 Buffer d_buf 空間
Call cusparseDcsric02_analysis (analysis phase) 讓 cuSPARSE 分析 Matrix $A$ 的 Sparsity。
Call cusparseDcsric02 (solve phase) 進行 Factorization

// Create info object for incomplete-cholesky factorization
error_check(cusparseCreateCsric02Info(&icinfo_A));
// Compute buffer size in computing ic factorization
error_check(cusparseDcsric02_bufferSize(cusHandle, N, nz, 
    descr_A, d_A, d_rowIdx, d_colIdx, icinfo_A, &i_temp_buf_size));
buf_size = i_temp_buf_size;

// Create buffer
error_check(cudaMalloc(&d_buf, buf_size));
// Copy A
error_check(cudaMemcpy(d_ic, d_A, nz * sizeof(double), cudaMemcpyDeviceToDevice));

// Perform incomplete-choleskey factorization: analysis phase
error_check(cusparseDcsric02_analysis(cusHandle, N, nz,
    descr_A, d_ic, d_rowIdx, d_colIdx, icinfo_A, CUSPARSE_SOLVE_POLICY_USE_LEVEL, d_buf));

// Perform incomplete-choleskey factorization: solve phase
error_check(cusparseDcsric02(cusHandle, N, nz,
    descr_A, d_ic, d_rowIdx, d_colIdx, icinfo_A, CUSPARSE_SOLVE_POLICY_USE_LEVEL, d_buf));

這邊我將 d_A 複製到 d_ic 是因為，cuSPARSE 在進行 Factorization 的時候會直接將結果覆寫到 Input array 上，因此這邊將 d_A 複製一份到 d_ic，將 Factorized 的 Lower triangular matrix 存到 d_ic 裡面。

另外，根據 Document 的說法，CUSPARSE_SOLVE_POLICY_USE_LEVEL 有時會對計算進行稍微的優化，有時不會，因此這邊要設定 NO_LEVEL 或 USE_LEVEL 都可以，但是 Analysis phase 與 Solve phase 的設定一定要一致。

接下來第三部份是估算 Conjugate gradient 所需的 Buffer 空間：

// --- Prepare for performing conjugate gradient ---
// Create info object for factorized matrix L, LT
csrsv2Info_t info_L, info_U;
error_check(cusparseCreateCsrsv2Info(&info_L));
error_check(cusparseCreateCsrsv2Info(&info_U));

在 Algorithm 1 Line 9 跟 10 有計算到 $L$ 與 $L^T$ 的 Inverse，但實際上 cuSPARSE 並沒有求 Inverse matrix 的 API，因此可以將這兩行看做求解兩個 Triangular sparse linear system：$Ly=r_k$、求 $y$ 與 $L^Tz_k=y$ 求 $z_k$。而求解 Triancular sparse linear system 可以使用 cusparseDcsrsv2 這個 API。因此首先我們需要建立兩個 csrsv2Info_t：

利用 info_L 代表 $L$ (Lower triangular matrix)
利用 info_U 代表 $L^T$ (Upper triangular matrix)

接下來就是分別估算解 Triangular sparse linear system 所需的 Buffer 大小：

// Compute buffer size in solving linear system
error_check(cusparseDcsrsv2_bufferSize(cusHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, 
    N, nz, descr_L, d_ic, d_rowIdx, d_colIdx, info_L, &i_temp_buf_size));

u_temp_buf_size = i_temp_buf_size;

error_check(cusparseDcsrsv2_bufferSize(cusHandle, CUSPARSE_OPERATION_TRANSPOSE,
    N, nz, descr_L, d_ic, d_rowIdx, d_colIdx, info_U, &i_temp_buf_size));

// check whether need more buffer
if (i_temp_buf_size > u_temp_buf_size)
{
    u_temp_buf_size = i_temp_buf_size;
}

除此之外，Conjugate gradient 有發生一次 Sparse Matrix-dense vector multiplication 在 Algorithm 1 Line 17 $q\gets Ap_k$，這個需要使用 cuSPARSE 的 Generic API cusparseSpMV，因此先回到第一部份宣告 Generic API 所需的 Objects：

// --- 1. Create cuSPARSE generic API objects ---
cusparseSpMatDescr_t smat_A;
error_check(cusparseCreateCsr(&smat_A, N, N, nz, d_rowIdx, d_colIdx, d_A, CUSPARSE_INDEX_32I,
    CUSPARSE_INDEX_32I, CUSPARSE_INDEX_BASE_ZERO, CUDA_R_64F));

cusparseDnVecDescr_t dvec_p;
error_check(cusparseCreateDnVec(&dvec_p, N, d_p, CUDA_R_64F));

cusparseDnVecDescr_t dvec_q;
error_check(cusparseCreateDnVec(&dvec_q, N, d_q, CUDA_R_64F));

建立一個 CSR Matrix $A$ 的 Descriptor，建立 Dense vector $p_k$ 與 $q$ 的 Descriptor。再回到第三部份完成剩下的部份，估算 Sparse Matrix-dense vector multiplication 所需的 Buffer 數量：

// Compute buffer size for matrix-vector multiplication
error_check(cusparseSpMV_bufferSize(cusHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, &one, smat_A,
    dvec_p, &zero, dvec_q, CUDA_R_64F, CUSPARSE_CSRMV_ALG1, &u_temp_buf_size2));

if (u_temp_buf_size2 > u_temp_buf_size)
{
    u_temp_buf_size = u_temp_buf_size2;
}

然後分配空間：

// re-allocate buffer
if (u_temp_buf_size > buf_size)
{
    buf_size = u_temp_buf_size;
    cudaFree(d_buf);
    error_check(cudaMalloc(&d_buf, buf_size));
}

這邊需要注意是，通常 Natrix-vector multiplication 會需要比 Incomplete cholesky factorization 與 Solving lienar system 還要更多的 Buffer。如果 Buffer 空間不夠時，cuSPARSE 會回傳 CUSPARSE_STATUS_INTERNAL_ERROR。

分配完 Buffer 後，就可以先對 Triangular sparse linear system 進行 Analysis phase：

// analysis phase
error_check(cusparseDcsrsv2_analysis(cusHandle, CUSPARSE_OPERATION_NON_TRANSPOSE,
    N, nz, descr_L, d_ic, d_rowIdx, d_colIdx, info_L, CUSPARSE_SOLVE_POLICY_USE_LEVEL, d_buf));
error_check(cusparseDcsrsv2_analysis(cusHandle, CUSPARSE_OPERATION_TRANSPOSE,
    N, nz, descr_L, d_ic, d_rowIdx, d_colIdx, info_U, CUSPARSE_SOLVE_POLICY_USE_LEVEL, d_buf));

這邊需要注意的是 d_ic 是 Lower triangular matrix，因此在分析 $L$ (Lower triangular matrix) 時要設定 Operation 為 CUSPARSE_OPERATION_NON_TRANSPOSE，但是在分析 $L^T$ (Upper triangular matrix) 時則要設為 CUSPARSE_OPERATION_TRANSPOSE，這個非常容易被忽略。

所以到目前為止，計算 Conjugate gradient 的事前準備都已經完成了：

bool ICCGsolver::solve(int N, int nz,
    double *d_A, int *d_rowIdx, int *d_colIdx,
    double *d_b, double *d_guess)
{
    check_and_resize(N, nz);
     
    // --- 1. Create cuSPARSE generic API objects ---
    // DONE
     
    // --- 2. Perform incomplete cholesky factorization ---
    // DONE
    
    // --- 3. Prepare for performing conjugate gradient ---
    // DONE
    
    // --- 4. Perform conjugate gradient ---
    // TODO
    
    // --- 5. Finalize ---
    // TODO
    
    return rTr < tolrance; // return true if converged
}

接下來第四部份就是按照 Algorithm 1 一行一行刻：

// x = 0
error_check(cudaMemset(d_x, 0, N * sizeof(double)));
// r0 = b  (since x == 0, b - A*x = b)
error_check(cudaMemcpy(d_r, d_b, N * sizeof(double), cudaMemcpyDeviceToDevice));

Line 2、3 初始化 $x_0$ 與 $r_0$。接下來就是 For 迴圈的內容：

for(k = 0; k < max_iter; ++k)
{
    //TODO
}//EndFor

Algorithm 1 Line 6~8，計算 $|r_k|$，由於 r_k$ 是 Dense vector，因此可以使用 cuBLAS 的 API cublasDnrm2 計算：

// if ||rk|| < tolerance
error_check(cublasDnrm2(cubHandle, N, d_r, 1, &rTr));

if (rTr < tolerance)
{
    break;
}

如果誤差 rTr 已經比設定的 tolerance 還小就結束 Conjugate gradient。

Algorithm 1 Line 9、10 分別解出兩個 Triangular sparse linear system：

// Solve L*y = rk, find y
error_check(cusparseDcsrsv2_solve(cusHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nz, &one,
    descr_L, d_ic, d_rowIdx, d_colIdx, info_L, d_r, d_y, CUSPARSE_SOLVE_POLICY_USE_LEVEL, d_buf));

// Solve L^T*zk = y, find zk
error_check(cusparseDcsrsv2_solve(cusHandle, CUSPARSE_OPERATION_TRANSPOSE, N, nz, &one,
    descr_L, d_ic, d_rowIdx, d_colIdx, info_U, d_y, d_z, CUSPARSE_SOLVE_POLICY_USE_LEVEL, d_buf));

這邊同樣 Solve phase 與 Analysis phase 的參數 CUSPARSE_SOLVE_POLICY_USE_LEVEL 必須一致。

Algorithm 1 Line 11 計算 $\rho_k=r^T_kz_k$，這邊是兩個 Dense vector 的內積，因此可以使用 cuBLAS 的 cublasDdot：

// rho_t = r{k-1} * z{k-1}
rho_t = rho;  
// rho = rk * zk
error_check(cublasDdot(cubHandle, N, d_r, 1, d_z, 1, &rho));

這邊還有一個重點就是，在計算 $\beta$ 的時候會需要 $\rho_{k-1}=r_{k-1}^Tz_{k-1}$，因此可以用一個變數 rho_t 把他暫存起來。

Algorithm 1 Line 12~17，當 $k=0$ 時直接 Assign $p_k\gets z_k$，可以使用 cublasDcopy；否則，先計算 $\beta$，再更新 $p_k$：

if (k == 0)
{
    // pk = zk
    error_check(cublasDcopy(cubHandle, N, d_z, 1, d_p, 1));
}
else
{
    // beta = (rk*zk) / (r{k-1}*z{k-1})
    beta = rho / rho_t;
    // pk = zk + beta*p{k-1}
    error_check(cublasDscal(cubHandle, N, &beta, d_p, 1));
    error_check(cublasDaxpy(cubHandle, N, &one, d_z, 1, d_p, 1));
}

Line 8 計算 $\beta$ 後，先 Call cublasDscal 對 $p_{k-1}$ 進行 Scale $p_{k-1}\gets \beta p_{k-1}$，在 Call cublasDaxpy 計算 $p_k\gets 1\cdot z_k+p_{k-1}$。

Algorithm 1 Line 18 計算 Sparse matrix-dense vector multiplication，這邊需要使用 cuSPARSE 的 Generic API cusparseSpMV：

1
2
3

// q = A*pk
error_check(cusparseSpMV(cusHandle, CUSPARSE_OPERATION_NON_TRANSPOSE, &one, smat_A,
    dvec_p, &zero, dvec_q, CUDA_R_64F, CUSPARSE_MV_ALG_DEFAULT, d_buf));

Algorithm 1 Line 19 計算 $\alpha$，但在計算 $\alpha$ 前要先算出 $p_k^Tq$：

1
2
3

// alpha = (rk*zk) / (pk*q)
error_check(cublasDdot(cubHandle, N, d_p, 1, d_q, 1, &pTq));
alpha = rho / pTq;

因此首先用 cublasDdot 計算 $p_k^Tq$，再計算出 $\alpha$。

Algorithm 1 Line 20 更新 $x_{k+1}$ 可以使用 cublasDaxpy 計算 $x_{k+1}\gets \alpha p_k + x_k$

1 2	// x{k+1} = xk + alpha*pk error_check(cublasDaxpy(cubHandle, N, &alpha, d_p, 1, d_x, 1));

最後更新 Algorithm 1 Line 21 更新 Residual $r_{k+1}$。這邊減法可以看作 $r_{k+1}\gets (-\alpha)q + r_k$，因此同樣使用 cublasDaxpy 計算：

1
2
3

// r{k+1} = rk - alpha*q 
double n_alpha = -alpha;
error_check(cublasDaxpy(cubHandle, N, &n_alpha, d_q, 1, d_r, 1));

到這邊第四部份就完成了。最後第五部份就是將 Buffer 之類的全部 Free 掉：

// free buffer
cudaFree(d_buf);

// free objects
error_check(cusparseDestroySpMat(smat_A));
error_check(cusparseDestroyDnVec(dvec_p));
error_check(cusparseDestroyDnVec(dvec_q));
error_check(cusparseDestroyDnVec(dvec_x));
error_check(cusparseDestroyCsric02Info(icinfo_A));
error_check(cusparseDestroyCsrsv2Info(info_L));
error_check(cusparseDestroyCsrsv2Info(info_U));

就完成了 Preconditioned conjugate gradient。

試著執行了一下，結果 $N=1,000,000$ 的 Sparse linear system 只花了 $368$ 個 Iteration 誤差就收斂到小於 $10^{-12}$。

完整的 Code 可以在 Github 找到：Ending2015a/ICCG0_CUDA - github