cuda - Why is my shared memory blocking kernel slower than my global memory coalescing kernel? - Stack Overflow-软件玩家

admin管理员组
文章数量:1357678

I have started to learn CUDA programming and I am currently focusing on reproducing the code from . I am implementing the third kernel with shared memory blocking but when I evaluate the performances I get lower FLOPS than with the global memory coalescing kernel. I am basically using the same code as the article for initialization and benchmarking. Except I use fixed 1024x1024 matrices. I have the following specs :

NVIDIA RTX A6000
Compute Capability: 8.6
CUDA 11.8

Kernel code (heavily commented because I write my thought process when I write code):

#define CEIL_DIV(a,b) (a+b-1)/b
__global__ void shared_memory_blocking_kernel(int M, int N, int K, float alpha, 
                                                    const float* A, const float* B,
                                                    float beta, float* C) {
    // This block is responsible for computing submatrix C_xy
    const uint bx = blockIdx.x;
    const uint by = blockIdx.y;

    // This block dimension
    const uint bd = blockDim.x;

    // This thread submatrix responsibility
    const uint tx = threadIdx.x;
    const uint ty = threadIdx.y;
    const uint local_mem_acces = ty * bd + tx; 

    // This thread target 
    const uint x = blockIdx.x * blockDim.x + threadIdx.x;
    const uint y = blockIdx.y * blockDim.y + threadIdx.y;

    float sum = 0.0f; // Value that will be computed by this thread

    // Allocate shared memory for submatrices A and B :
    extern __shared__ float sharedMemory[];
    float* subA = sharedMemory;
    float* subB = (float*)&sharedMemory[blockDim.x * blockDim.y];

    // Should run K/blockdim.x times WE FORCE BLOCK DIM TO BE SQUARED HERE
    int num_iter = CEIL_DIV(K, bd);
    
    // bx*bd+ty was alway computed, we can just move the A pointer there
    A += (bx * bd + ty)*K; // Now we point to the row this thread will be responsible for
    // similarly, we can move the B pointer to by*bd + ty * K
    B += by * bd + ty * K; // Now we point to the column this thread will be responsible for
    for (int i=0; i < num_iter; i++)
    {
        // Load A and B submatrices in shared memory with coalesced trick
        subA[local_mem_acces] = A[tx]; //ensures coalescing by having consecutive threads access consecutive memory locations
        subB[local_mem_acces] = B[tx];
        

        __syncthreads();
        // We just need now to jump by bd to access the next submatrix
        A += bd;
        // We need to jump by bd*K to access the next submatrix
        B += bd*K;
        // Compute the partial sum
        for (int k = 0; k < bd; k++){
            sum += subA[ty * bd + k] * subB[k * bd + tx];
        } 

        __syncthreads();
    }

    // This thread is still taking care of its own C_ij
    C[y * N + x] = alpha * sum + beta * C[y * N + x];
}

To run this code I use :

void run_shared_memory_blocking_kernel(int M, int N, int K, float alpha,
    float* A, float* B, 
    float beta, float* C) {
    dim3 gridDim(CEIL_DIV(M, 32) , CEIL_DIV(N, 32));
    dim3 blockDim(32, 32);
    size_t sharedMemorySize = 2 * blockDim.x * blockDim.y * sizeof(float);
    shared_memory_blocking_kernel<<<gridDim, blockDim, sharedMemorySize>>>(M, N, K, alpha, A, B, beta, C);
}

When running my code I get roughly 1800 GFLOPS but when running sibohem's code I get 2800 GFLOPS for the 1024 matrix size. I am just starting to learn to use the profiler so an advice on how to identify where the issue is would be greatly appreciated.

本文标签： cudaWhy is my shared memory blocking kernel slower than my global memory coalescing kernelStack Overflow

版权声明：本文标题：cuda - Why is my shared memory blocking kernel slower than my global memory coalescing kernel? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744078770a2587240.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

cuda - Why is my shared memory blocking kernel slower than my global memory coalescing kernel? - Stack Overflow

更多相关文章