admin管理员组文章数量:1357678
I have started to learn CUDA programming and I am currently focusing on reproducing the code from . I am implementing the third kernel with shared memory blocking but when I evaluate the performances I get lower FLOPS than with the global memory coalescing kernel. I am basically using the same code as the article for initialization and benchmarking. Except I use fixed 1024x1024 matrices. I have the following specs :
NVIDIA RTX A6000
Compute Capability: 8.6
CUDA 11.8
Kernel code (heavily commented because I write my thought process when I write code):
#define CEIL_DIV(a,b) (a+b-1)/b
__global__ void shared_memory_blocking_kernel(int M, int N, int K, float alpha,
const float* A, const float* B,
float beta, float* C) {
// This block is responsible for computing submatrix C_xy
const uint bx = blockIdx.x;
const uint by = blockIdx.y;
// This block dimension
const uint bd = blockDim.x;
// This thread submatrix responsibility
const uint tx = threadIdx.x;
const uint ty = threadIdx.y;
const uint local_mem_acces = ty * bd + tx;
// This thread target
const uint x = blockIdx.x * blockDim.x + threadIdx.x;
const uint y = blockIdx.y * blockDim.y + threadIdx.y;
float sum = 0.0f; // Value that will be computed by this thread
// Allocate shared memory for submatrices A and B :
extern __shared__ float sharedMemory[];
float* subA = sharedMemory;
float* subB = (float*)&sharedMemory[blockDim.x * blockDim.y];
// Should run K/blockdim.x times WE FORCE BLOCK DIM TO BE SQUARED HERE
int num_iter = CEIL_DIV(K, bd);
// bx*bd+ty was alway computed, we can just move the A pointer there
A += (bx * bd + ty)*K; // Now we point to the row this thread will be responsible for
// similarly, we can move the B pointer to by*bd + ty * K
B += by * bd + ty * K; // Now we point to the column this thread will be responsible for
for (int i=0; i < num_iter; i++)
{
// Load A and B submatrices in shared memory with coalesced trick
subA[local_mem_acces] = A[tx]; //ensures coalescing by having consecutive threads access consecutive memory locations
subB[local_mem_acces] = B[tx];
__syncthreads();
// We just need now to jump by bd to access the next submatrix
A += bd;
// We need to jump by bd*K to access the next submatrix
B += bd*K;
// Compute the partial sum
for (int k = 0; k < bd; k++){
sum += subA[ty * bd + k] * subB[k * bd + tx];
}
__syncthreads();
}
// This thread is still taking care of its own C_ij
C[y * N + x] = alpha * sum + beta * C[y * N + x];
}
To run this code I use :
void run_shared_memory_blocking_kernel(int M, int N, int K, float alpha,
float* A, float* B,
float beta, float* C) {
dim3 gridDim(CEIL_DIV(M, 32) , CEIL_DIV(N, 32));
dim3 blockDim(32, 32);
size_t sharedMemorySize = 2 * blockDim.x * blockDim.y * sizeof(float);
shared_memory_blocking_kernel<<<gridDim, blockDim, sharedMemorySize>>>(M, N, K, alpha, A, B, beta, C);
}
When running my code I get roughly 1800 GFLOPS but when running sibohem's code I get 2800 GFLOPS for the 1024 matrix size. I am just starting to learn to use the profiler so an advice on how to identify where the issue is would be greatly appreciated.
版权声明:本文标题:cuda - Why is my shared memory blocking kernel slower than my global memory coalescing kernel? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744078770a2587240.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论