admin管理员组文章数量:1364018
What I am doing is: I allocated and initialized two arrays (A
, B
) in global memory using cudaMalloc()
and cudaMemset()
and called a kernel with A
. The kernel did not do anything with array B
, but when I checked B
from right after the kernel executed against B
from before the kernel is called, there was a huge difference. What happened here?
__global__ void BlockSort(int *A)
{
//A is an array in global memory. Sort A using BlockRadixSort
using BlockRadixSort = cub::BlockRadixSort<int, 256, 8>;
__shared__ typename BlockRadixSort::TempStorage temp_storage;
int threadKeys[8];
int seg_idx = gridDim.y * blockIdx.y + blockIdx.x,
seg_start = RATIO * seg_idx;
for (int i = 0; i < 8; ++i)
threadKeys[i] = A[seg_start +
8 * (blockDim.y * threadIdx.y + blockIdx.x)];
BlockRadixSort(temp_storage).Sort(threadKeys);
for (int i = 0; i < 8; ++i)
A[seg_start + 8 * (blockDim.y * threadIdx.y + blockIdx.x)]
= threadKeys[i];
}
int* B1;
B1 = (int*)malloc(sizeof(int) * sz);
cudaMemcpy(B1, B, sizeof(int) * sz, cudaMemcpyDeviceToHost);
BlockSort<<<gridDim, blockDim>>>(A); // launch BlockSort with A. Nothing is done to B
cudaDeviceSynchronize(); // do I really need to synchronize here?
int* B2;
B2 = (int*)malloc(sizeof(int) * sz);
cudaMemcpy(B2, B, sizeof(int) * sz, cudaMemcpyDeviceToHost);
for (int i = 0; i < sz; ++i)
{
if (B1[i] != B2[i])
printf("at i = %d, B1[%d] = %d, B2[%d] = %d\n",i, i, B1[i], i, B2[i]);
}
The print messages show B1
and B2
are different at many different indices i
.
If I comment out the line launching my BlockSort()
kernel, then B1
is same as B2
.
Honestly at a loss here. Any help will be appreciated!
本文标签:
版权声明:本文标题:cuda - Used cub::BlockRadixSort to sort one array but an irrelavant array in global memory changed - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1743800911a2541296.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论