CUDA Matrix multiplication
In this tutorial, we will tackle a wellsuited problem for Parallel Programming and quite a useful one. We will do Matrix Multiplication. Problem: Given two N*N
matrix X
and Y
of floating point numbers, perform matrix multiplication and store the result on a N*N
matrix answer.
Below is a code for matrix multiplication using C++. It is the standard O(NÂ³)
procedure. Here x
, y
, and ans are three NÂ²
size matrices(NÂ²
sized 1D
array. We are using 1D
arrays as of it were 2D
).
void cpuMatMul(int N,double *x, double *y, double *ans)
{for(int i=0;i < N;i++)
{
for(int j=0;j < N;j++)
{
for(int k=0;k < N;k++)
{
ans[i*N+j]+=(x[i*N+k]*y[k*N+j]);
}
}
}}
For a matrix of size 1024 X 1024
, it takes about 10
seconds to complete the task. For a matrix of size 4096 X 4096
, it takes about 2500
seconds(larger matrices add extra overhead. So it does not scale up exactly in O(NÂ³)
after 1024) to complete the task. Above (4096 X 4096
), it just gets infeasible to run the experiment.
Parallel Matrix Multiplication
Following is the parallel version of the same code we have seen above. Now let's execute the whole code:

Input
Source code credit Zhengchun Liu. The code has been simplified.