parallel processing - Extending a Single-Pass Scan Kernel for Independent Row-wise Scan in CUDA - Stack Overflow

IT技术

更新时间：2025-04-205

admin管理员组
文章数量:1414628

I implemented a kernel that performs a single-pass scan (proposed in the book Programming Massively Parallel Processors):

__global__
void SinglePassKoggeStoneScan(const unsigned int * input, unsigned int * output,
  const unsigned int length, unsigned int * flags, unsigned int * scanValue, unsigned int * blockCounter) {
  __shared__ unsigned int bid_s;
  __shared__ unsigned int XY[SECTION_SIZE];

  if (threadIdx.x == 0) {
    bid_s = atomicAdd(blockCounter, 1);
  }
  __syncthreads();

  int bid = bid_s;
  int idx = bid * blockDim.x + threadIdx.x;

  if (idx < length) {
    XY[threadIdx.x] = input[idx];
  } else {
    XY[threadIdx.x] = 0;
  }
  __syncthreads();

  for (int stride = 1; stride < SECTION_SIZE; stride *= 2) {
    __syncthreads();
    float tmp = 0;
    if (threadIdx.x >= stride) {
      tmp = XY[threadIdx.x] + XY[threadIdx.x - stride];
    }
    __syncthreads();
    if (threadIdx.x >= stride) {
      XY[threadIdx.x] = tmp;
    }
  }
  __syncthreads();

  __shared__ unsigned int previousSum;
  if (threadIdx.x == 0) {
    while (bid >= 1 && atomicAdd( & flags[bid], 0) == 0) {} // Wait for data
    previousSum = scanValue[bid];
    scanValue[bid + 1] = XY[blockDim.x - 1] + previousSum;
    __threadfence();
    atomicAdd( & flags[bid + 1], 1);
  }
  __syncthreads();

  if (idx < length) {
    output[idx] = XY[threadIdx.x] + previousSum;
  }
}

I would like to extend this kernel so that it performs a scan on each row of a matrix independently.

Currently, I can implement a naive solution by executing this code:

#define SECTION_SIZE 1024

unsigned int input[] = {1,2,3,4,5,6,7,8,9};
const unsigned int width = 3;
const unsigned int height = 3;
unsigned int* output = new unsigned int[3*3];

unsigned int *deviceInput, *deviceOutput, *flags, *scanValue, *blockCounter;

const size_t imageSize = width * height * sizeof(unsigned int);
const unsigned int scanValueNum = (width + SECTION_SIZE - 1) / SECTION_SIZE;
const size_t scanValueSize = scanValueNum * sizeof(unsigned int);
cudaMalloc(reinterpret_cast<void**>(&deviceInput), imageSize);
cudaMalloc(reinterpret_cast<void**>(&deviceOutput), imageSize);
cudaMalloc(reinterpret_cast<void**>(&flags), scanValueSize);
cudaMalloc(reinterpret_cast<void**>(&scanValue), scanValueSize);
cudaMalloc(reinterpret_cast<void**>(&blockCounter), sizeof(unsigned int));

cudaMemcpy(deviceInput, input, imageSize, cudaMemcpyHostToDevice);
dim3 blockDim(SECTION_SIZE);
dim3 gridDim(scanValueSize);
for(int i = 0; i < height; i++){
   cudaMemset(flags, 0, scanValueSize);
   cudaMemset(blockCounter, 0, sizeof(unsigned int));
   SinglePassKoggeStoneScan<<<gridDim, blockDim>>>(deviceInput + i*width, deviceOutput + i*width, width, flags, scanValue, blockCounter);
}

cudaMemcpy(output, deviceOutput, imageSize, cudaMemcpyDeviceToHost);

cudaFree(deviceInput);
cudaFree(deviceOutput);
cudaFree(flags);
cudaFree(scanValue);
cudaFree(blockCounter);

However, I am having trouble adapting the code to directly handle this case. Any help or guidance would be greatly appreciated. Thanks in advance!

本文标签： parallel processingExtending a SinglePass Scan Kernel for Independent Rowwise Scan in CUDAStack Overflow

版权声明：本文标题：parallel processing - Extending a Single-Pass Scan Kernel for Independent Row-wise Scan in CUDA - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1745155336a2645137.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

parallel processing - Extending a Single-Pass Scan Kernel for Independent Row-wise Scan in CUDA - Stack Overflow

更多相关文章

parallel processing - Extending a Single-Pass Scan Kernel for Independent Row-wise Scan in CUDA - Stack Overflow

发表评论

推荐文章

javascript - PhoneGap Build Plugins not functioning - Stack Overflow

rest api - Update a post based on results from GET request to another server

jquery - How to Convert undefined to integer value? Javascript - Stack Overflow

javascript - How to switch image on click by change border in jQuery? - Stack Overflow

javascript - Change content of div based on radio button selection - Stack Overflow

热门文章

javascript - Why is this callback firing twice? - Stack Overflow

javascript - How to run React.js app without react-scripts? - Stack Overflow

python - "ImportError: cannot import name 'FlaskForm' from partially initialized module 'flask_

javascript - nouislider with custom values instead of range - Stack Overflow

javascript - BXSlider Pause Youtube Video When Switching Slides - Stack Overflow

python - Nodriver web scraping program gets stuck at cdp.network.get_response_body? - Stack Overflow

javascript - AngularJS, filter only if a variable is true - Stack Overflow

categories - Show all posts of sub category in a page like: foo.comcategorysubcategory using UNCODE Theme

javascript - Make number in form go up every time button is clicked - Stack Overflow

javascript - How to directly insert text anywhere in an input field when a button is clicked in React? - Stack Overflow

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

php - Convert all images to PNG on file upload

c# - Why is XMLHttpRequest.status == 0 on all the browsers except IE? - Stack Overflow

javascript - Detect if browserdevice supports double click events - Stack Overflow

javascript - Check radio buttons in a loop with a delay - Stack Overflow

Why does Internet Explorer (or other browsers) use old JavaScript files when I try to debug my ASP.NET program? - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

编程频道|软件玩家 - 软件改变生活！

parallel processing - Extending a Single-Pass Scan Kernel for Independent Row-wise Scan in CUDA - Stack Overflow

更多相关文章

parallel processing - Extending a Single-Pass Scan Kernel for Independent Row-wise Scan in CUDA - Stack Overflow

发表评论

推荐文章

javascript - PhoneGap Build Plugins not functioning - Stack Overflow

rest api - Update a post based on results from GET request to another server

jquery - How to Convert undefined to integer value? Javascript - Stack Overflow

javascript - How to switch image on click by change border in jQuery? - Stack Overflow

javascript - Change content of div based on radio button selection - Stack Overflow

热门文章

javascript - Why is this callback firing twice? - Stack Overflow

javascript - How to run React.js app without react-scripts? - Stack Overflow

python - &quot;ImportError: cannot import name &#39;FlaskForm&#39; from partially initialized module &#39;flask_

javascript - nouislider with custom values instead of range - Stack Overflow

javascript - BXSlider Pause Youtube Video When Switching Slides - Stack Overflow

python - Nodriver web scraping program gets stuck at cdp.network.get_response_body? - Stack Overflow

javascript - AngularJS, filter only if a variable is true - Stack Overflow

categories - Show all posts of sub category in a page like: foo.comcategorysubcategory using UNCODE Theme

javascript - Make number in form go up every time button is clicked - Stack Overflow

javascript - How to directly insert text anywhere in an input field when a button is clicked in React? - Stack Overflow

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

php - Convert all images to PNG on file upload

c# - Why is XMLHttpRequest.status == 0 on all the browsers except IE? - Stack Overflow

javascript - Detect if browserdevice supports double click events - Stack Overflow

javascript - Check radio buttons in a loop with a delay - Stack Overflow

Why does Internet Explorer (or other browsers) use old JavaScript files when I try to debug my ASP.NET program? - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

python - "ImportError: cannot import name 'FlaskForm' from partially initialized module 'flask_