admin管理员组文章数量:1295332
I continue my exploration in using CUDA and I am facing a problem about kernel. I call a kernel inside another kernel and I want to wait its finishing before to continue in the parent kernel.
Here is the sample code.
#include <stdio.h>
__global__ void cmp_int(int const* a, int const* b, int* cmp)
{
printf("cmp_int cmp %d\n", *cmp);
if (*a == *b)
*cmp = 0;
else if (*a < *b)
*cmp = -1;
else
*cmp = 1;
printf("cmp_int cmp %d\n", *cmp);
}
__global__ void test_cmp()
{
int* a;
cudaMalloc(&a, sizeof(int));
*a = 2;
int* b;
cudaMalloc(&b, sizeof(int));
*b = 3;
int* cmp;
cudaMalloc(&cmp, sizeof(int));
*cmp = -3;
printf("test_cmp cmp %d\n", *cmp);
cmp_int<<<1, 1>>>(a, b, cmp);
// How to wait here for finishing of the launched kernel of the previous line?
printf("test_cmp cmp %d\n", *cmp);
cudaFree(a);
cudaFree(b);
cudaFree(cmp);
}
int main()
{
test_cmp<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
The output is:
test_cmp cmp -3
test_cmp cmp -3
cmp_int cmp -3
cmp_int cmp -1
whereas I am expecting the following output:
test_cmp cmp -3
test_cmp cmp -1
cmp_int cmp -3
cmp_int cmp -1
In my research on other topics, I saw __syncthreads() but I don't understand the right use and if it is appropriate for my problem (I try some things but without success).
I continue my exploration in using CUDA and I am facing a problem about kernel. I call a kernel inside another kernel and I want to wait its finishing before to continue in the parent kernel.
Here is the sample code.
#include <stdio.h>
__global__ void cmp_int(int const* a, int const* b, int* cmp)
{
printf("cmp_int cmp %d\n", *cmp);
if (*a == *b)
*cmp = 0;
else if (*a < *b)
*cmp = -1;
else
*cmp = 1;
printf("cmp_int cmp %d\n", *cmp);
}
__global__ void test_cmp()
{
int* a;
cudaMalloc(&a, sizeof(int));
*a = 2;
int* b;
cudaMalloc(&b, sizeof(int));
*b = 3;
int* cmp;
cudaMalloc(&cmp, sizeof(int));
*cmp = -3;
printf("test_cmp cmp %d\n", *cmp);
cmp_int<<<1, 1>>>(a, b, cmp);
// How to wait here for finishing of the launched kernel of the previous line?
printf("test_cmp cmp %d\n", *cmp);
cudaFree(a);
cudaFree(b);
cudaFree(cmp);
}
int main()
{
test_cmp<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
The output is:
test_cmp cmp -3
test_cmp cmp -3
cmp_int cmp -3
cmp_int cmp -1
whereas I am expecting the following output:
test_cmp cmp -3
test_cmp cmp -1
cmp_int cmp -3
cmp_int cmp -1
In my research on other topics, I saw __syncthreads() but I don't understand the right use and if it is appropriate for my problem (I try some things but without success).
Share edited Feb 12 at 18:29 P'tit Ju asked Feb 12 at 7:37 P'tit JuP'tit Ju 351 silver badge5 bronze badges 4 |2 Answers
Reset to default 4Some of this is already covered in the comments below the question. I'll cover some background first, then get to suggestions. You can skip the background if the length of this post is bothersome. CUDA stream concepts are essential for understanding. I won't do a full treatment here, but in a nutshell work items launched into the same stream execute in issue order, and work items issued into separate non-default streams have no defined ordering with respect to each other.
Background
Historically, it was possible to use cudaDeviceSynchronize()
(subject to some constraints that are typical of CUDA Dynamic Parallelism - CDP - which is the general subject that covers launching a kernel from device code) in CUDA device code.
The CUDA designers took a different tack as we got close to the Hopper timeframe, and declared (in CUDA 11.6) that such a paradigm would not be supported in the future. There are potentially several different approaches that CUDA developers might use as a "replacement". However there are no future-proof replacements that trivially replace cudaDeviceSynchronize()
to otherwise achieve exactly the same effect.
The replacement technology is referred to as CDP2, whereas the original (now deprecated) approach is now referred to as CDP1.
Key aspects of the replacement technology include two new pre-defined streams that are usable for a kernel launch in device code:
cudaStreamTailLaunch - use this stream on the child kernel launch when you want some ordering of the launched child kernel work compared to other work being done in the parent kernel. This will cause all other work launched from the parent kernel to complete, before any work in the tail launch stream begins. As suggested in the documentation this could be a method to arrange for synchronization of work in the parent kernel, but does not allow the parent kernel code to absorb or reflect global changes taking place in the child kernel. But it does guarantee that the child kernel does not begin until all (non-tail-launch) work from the parent kernel has finished.
cudaStreamFireAndFet - Child kernels launched into this stream have no defined ordering with respect to parent kernel code, or other work launched by the parent kernel, except that all fire-and-fet work will complete before any tail launch work begins.
Suggestions/Options
As already covered, there is no exact replacement to achieve this:
cmp_int<<<1, 1>>>(a, b, cmp); // child kernel launch
// a synchronization of some sort so that ordinary parent kernel code
printf("test_cmp cmp %d\n", *cmp); // can "see" the results of a previous child launch
so I am not suggesting that any of these options do exactly that. You will have to consider each with its limitations.
convert the kernel launch to ordinary parent kernel thread-code, i.e. get rid of the child kernel launch. Subsequent parent kernel thread code can then see the effect of that converted code. Its understood this is non-trivial or won't work or be an option if the child kernel launch is sufficiently complex
you can still access the "old" CDP1 methodology if you specify a compile switch (
-DCUDA_FORCE_CDP1_IF_SUPPORTED
) and are compiling for and running on a GPU arch of less than cc9.0. So this is not an option on Hopper, Blackwell, and future GPU architectures. In that case, this should work:cmp_int<<<1, 1>>>(a, b, cmp); // child kernel launch cudaDeviceSynchronize(); printf("test_cmp cmp %d\n", *cmp); // can "see" the results of a previous child launch
even if possible for your case, this is not a good future-proof way to write code
Without particular attention to CDP1 vs. CDP2 we could use stream semantics to get both the ordering and visibility that we want:
__global__ void child_print(int *cmp) { printf("test_cmp cmp %d\n", *cmp);} ... cmp_int<<<1, 1>>>(a, b, cmp); // child kernel launch child_print<<<1,1>>>(cmp); // can "see" the results of a previous child launch
both kernels are launched into the null stream in this case, and stream semantics guarantee that
child_print
will not begin untilcmp_int
is done, and furthermore the kernel launch boundaries guarantee that global changes made bycmp_int
are visible to code executing inchild_print
We can use the new tail launch stream to guarantee that child_print does not begin until all other parent kernel work is complete:
__global__ void child_print(int *cmp) { printf("test_cmp cmp %d\n", *cmp);} ... cmp_int<<<1, 1>>>(a, b, cmp); // child kernel launch - could be issued into fire-and-fet stream also child_print<<<1,1, 0, cudaStreamTailLaunch>>>(cmp); // can "see" the results of a previous child launch or fire-and-fet launch
Additional Resources
general descriptive CDP blog, predates CDP2 but still mostly accurate.
programming guide
stream semantics, not CDP specific
Other SO and forum posts:
1 2 3 4
Finally, I found a workaround. I don't know if it is the best solution but it works.
#include <stdio.h>
__global__ void cmp_int(int const* a, int const* b, int* cmp, int* synchro)
{
printf("cmp_int cmp %d\n", *cmp);
if (*a == *b)
*cmp = 0;
else if (*a < *b)
*cmp = -1;
else
*cmp = 1;
++*synchro;
printf("cmp_int cmp %d\n", *cmp);
}
__global__ void test_cmp()
{
int* a;
cudaMalloc(&a, sizeof(int));
*a = 2;
int* b;
cudaMalloc(&b, sizeof(int));
*b = 3;
int* cmp;
cudaMalloc(&cmp, sizeof(int));
__shared__ int* synchro;
cudaMalloc(&synchro, sizeof(int));
*synchro = 0;
printf("test_cmp cmp %d\n", *cmp);
cmp_int<<<1, 1>>>(a, b, cmp, synchro);
while (!*synchro)
__syncthreads();
printf("test_cmp cmp %d\n", *cmp);
cudaFree(a);
cudaFree(b);
cudaFree(cmp);
}
int main()
{
test_cmp<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
With the output:
test_cmp cmp 0
cmp_int cmp 0
cmp_int cmp -1
test_cmp cmp -1
The last line have the correct value of cmp variable.
本文标签: cHow to wait a kernel for finishing inside another kernel in CUDAStack Overflow
版权声明:本文标题:c++ - How to wait a kernel for finishing inside another kernel in CUDA? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741615987a2388529.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
__syncthreads()
does only synchronize the threads of the block. The child kernel with its own grid is not affected by it, it might run on another SM alltogether. – paleonix Commented Feb 12 at 9:05cudaDeviceSynchronize()
and similar was removed as it was bad for performance. One can still opt for it on older hardware, but if you want to write forward compatible code, the answer is that there is no builtin way and trying to do so via atomics will lead to undefined behavior/deadlocks, as there isn't even a guarantee that the child kernel will even start running before the parent finishes (e.g. due to hardware resource limitations). – paleonix Commented Feb 12 at 9:13__device__
function which will be executed synchronously (in most cases even inlined into the calling code). – paleonix Commented Feb 12 at 9:15