Cuda矩阵乘法
问题描述:
我正试图在cuda中编写一个矩阵乘法代码,它与Nvidia的cuda编程指南非常相似,但它不起作用。它应该做C = alpha * A * B + beta * C,但对于每个A,B C保持不变。Cuda矩阵乘法
__global__ void MatMulKernel(int m,int n,int k,double *A,double *B,double *C,double alpha,double beta)
{
double Ctemp = 0.0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int ind;
for (ind = 0; ind < k; ++ind)
{
Ctemp += A[row+ind*m]*B[ind+col*k];
}
C[row+m*col] = alpha*Ctemp+beta*C[row+m*col];
//C[row+m*col] = Ctemp;
__syncthreads();
}
extern "C" void
local_mm_cuda (const int m, const int n, const int k, const double alpha,
const double *A, const int lda, const double *B, const int ldb,
const double beta, double *C, const int ldc)
{
int row, col;
/* Verify the sizes of lda, ldb, and ldc */
assert (lda >= m);
assert (ldb >= k);
assert (ldc >= m);
// allocating memory for device array
double *dA,*dB,*dC;
size_t sizeA = sizeof(double)*m*k;
size_t sizeB = sizeof(double)*n*k;
size_t sizeC = sizeof(double)*m*n;
cudaMalloc((void**)&dA,sizeA);
cudaMalloc((void**)&dB,sizeB);
cudaMalloc((void**)&dC,sizeC);
cudaMemcpy(dA, A, sizeA, cudaMemcpyHostToDevice);
cudaMemcpy(dB, B, sizeB, cudaMemcpyHostToDevice);
cudaMemcpy(dC, C, sizeC, cudaMemcpyHostToDevice);
// calling matrix multiplication kernal
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(n/dimBlock.x, m/dimBlock.y);
MatMulKernel<<<dimGrid, dimBlock>>>(m,n,k,dA,dB,dC,alpha,beta);
cudaThreadSynchronize();
// saving C calculated back in C
cudaMemcpy(dC,C, sizeC,cudaMemcpyDeviceToHost);
cudaFree(dA);
cudaFree(dB);
cudaFree(dC);
}
答
尝试修改
"dim3 dimGrid(n/dimBlock.x, m/dimBlock.y);"
到
"dim3 dimGrid((n+dimBlock.x-1)/dimBlock.x, (m+dimBlock.y-1)/dimBlock.y); "
什么是你的问题? (并提示“我的代码不工作”不是问题)在代码中有12个API调用,都返回一个状态,你应该检查每一个,看它是否返回错误。你的代码也是双精度的。您是否正在编译并在支持双精度的设备上运行? – talonmies 2012-04-26 06:12:05
我想知道我是否错过了任何明显的东西。我正在编译特斯拉M2090“费米”gpu – zimbra314 2012-04-26 06:17:49
你错过了一些明显的东西 - 错误检查。您的症状与永不运行的内核一致,但您无法知道,因为您的代码不检查API错误。 – talonmies 2012-04-26 06:20:07