c++ - How to wait a kernel for finishing inside another kernel in CUDA? - Stack Overflow

IT技术

更新时间：2025-03-101

admin管理员组
文章数量:1295332

I continue my exploration in using CUDA and I am facing a problem about kernel. I call a kernel inside another kernel and I want to wait its finishing before to continue in the parent kernel.

Here is the sample code.

#include <stdio.h>

__global__ void cmp_int(int const* a, int const* b, int* cmp)
{
    printf("cmp_int cmp %d\n", *cmp);
    if (*a == *b)
        *cmp = 0;
    else if (*a < *b)
        *cmp = -1;
    else
        *cmp = 1;
    printf("cmp_int cmp %d\n", *cmp);
}

__global__ void test_cmp()
{
    int* a;
    cudaMalloc(&a, sizeof(int));
    *a = 2;
    int* b;
    cudaMalloc(&b, sizeof(int));
    *b = 3;
    int* cmp;
    cudaMalloc(&cmp, sizeof(int));
    *cmp = -3;

    printf("test_cmp cmp %d\n", *cmp);
    cmp_int<<<1, 1>>>(a, b, cmp);
    // How to wait here for finishing of the launched kernel of the previous line?
    printf("test_cmp cmp %d\n", *cmp);

    cudaFree(a);
    cudaFree(b);
    cudaFree(cmp);
}

int main()
{
    test_cmp<<<1, 1>>>();
    cudaDeviceSynchronize();
    
    return 0;
}

The output is:

test_cmp cmp -3
test_cmp cmp -3
cmp_int cmp -3
cmp_int cmp -1

whereas I am expecting the following output:

test_cmp cmp -3
test_cmp cmp -1
cmp_int cmp -3
cmp_int cmp -1

In my research on other topics, I saw __syncthreads() but I don't understand the right use and if it is appropriate for my problem (I try some things but without success).

I continue my exploration in using CUDA and I am facing a problem about kernel. I call a kernel inside another kernel and I want to wait its finishing before to continue in the parent kernel.

Here is the sample code.

#include <stdio.h>

__global__ void cmp_int(int const* a, int const* b, int* cmp)
{
    printf("cmp_int cmp %d\n", *cmp);
    if (*a == *b)
        *cmp = 0;
    else if (*a < *b)
        *cmp = -1;
    else
        *cmp = 1;
    printf("cmp_int cmp %d\n", *cmp);
}

__global__ void test_cmp()
{
    int* a;
    cudaMalloc(&a, sizeof(int));
    *a = 2;
    int* b;
    cudaMalloc(&b, sizeof(int));
    *b = 3;
    int* cmp;
    cudaMalloc(&cmp, sizeof(int));
    *cmp = -3;

    printf("test_cmp cmp %d\n", *cmp);
    cmp_int<<<1, 1>>>(a, b, cmp);
    // How to wait here for finishing of the launched kernel of the previous line?
    printf("test_cmp cmp %d\n", *cmp);

    cudaFree(a);
    cudaFree(b);
    cudaFree(cmp);
}

int main()
{
    test_cmp<<<1, 1>>>();
    cudaDeviceSynchronize();
    
    return 0;
}

The output is:

test_cmp cmp -3
test_cmp cmp -3
cmp_int cmp -3
cmp_int cmp -1

whereas I am expecting the following output:

test_cmp cmp -3
test_cmp cmp -1
cmp_int cmp -3
cmp_int cmp -1

In my research on other topics, I saw __syncthreads() but I don't understand the right use and if it is appropriate for my problem (I try some things but without success).

Share edited Feb 12 at 18:29 asked Feb 12 at 7:37 P'tit Ju 351 silver badge5 bronze badges

Related: Synchronizing dynamic parallelism in CDP2 – paleonix Commented Feb 12 at 9:03
__syncthreads() does only synchronize the threads of the block. The child kernel with its own grid is not affected by it, it might run on another SM alltogether. – paleonix Commented Feb 12 at 9:05
TLDR: On the newest hardware the feature of waiting for child kernels via cudaDeviceSynchronize() and similar was removed as it was bad for performance. One can still opt for it on older hardware, but if you want to write forward compatible code, the answer is that there is no builtin way and trying to do so via atomics will lead to undefined behavior/deadlocks, as there isn't even a guarantee that the child kernel will even start running before the parent finishes (e.g. due to hardware resource limitations). – paleonix Commented Feb 12 at 9:13
Just in case it wasn't clear, for the given toy example, you can just use a __device__ function which will be executed synchronously (in most cases even inlined into the calling code). – paleonix Commented Feb 12 at 9:15

Add a comment |

2 Answers 2

Sorted by: Reset to default 4

Some of this is already covered in the comments below the question. I'll cover some background first, then get to suggestions. You can skip the background if the length of this post is bothersome. CUDA stream concepts are essential for understanding. I won't do a full treatment here, but in a nutshell work items launched into the same stream execute in issue order, and work items issued into separate non-default streams have no defined ordering with respect to each other.

Background

Historically, it was possible to use cudaDeviceSynchronize() (subject to some constraints that are typical of CUDA Dynamic Parallelism - CDP - which is the general subject that covers launching a kernel from device code) in CUDA device code.

The CUDA designers took a different tack as we got close to the Hopper timeframe, and declared (in CUDA 11.6) that such a paradigm would not be supported in the future. There are potentially several different approaches that CUDA developers might use as a "replacement". However there are no future-proof replacements that trivially replace cudaDeviceSynchronize() to otherwise achieve exactly the same effect.

The replacement technology is referred to as CDP2, whereas the original (now deprecated) approach is now referred to as CDP1.

Key aspects of the replacement technology include two new pre-defined streams that are usable for a kernel launch in device code:

cudaStreamTailLaunch - use this stream on the child kernel launch when you want some ordering of the launched child kernel work compared to other work being done in the parent kernel. This will cause all other work launched from the parent kernel to complete, before any work in the tail launch stream begins. As suggested in the documentation this could be a method to arrange for synchronization of work in the parent kernel, but does not allow the parent kernel code to absorb or reflect global changes taking place in the child kernel. But it does guarantee that the child kernel does not begin until all (non-tail-launch) work from the parent kernel has finished.
cudaStreamFireAndFet - Child kernels launched into this stream have no defined ordering with respect to parent kernel code, or other work launched by the parent kernel, except that all fire-and-fet work will complete before any tail launch work begins.

Suggestions/Options

As already covered, there is no exact replacement to achieve this:

cmp_int<<<1, 1>>>(a, b, cmp);  // child kernel launch
// a synchronization of some sort so that ordinary parent kernel code
printf("test_cmp cmp %d\n", *cmp); //  can "see" the results of a previous child launch

so I am not suggesting that any of these options do exactly that. You will have to consider each with its limitations.

convert the kernel launch to ordinary parent kernel thread-code, i.e. get rid of the child kernel launch. Subsequent parent kernel thread code can then see the effect of that converted code. Its understood this is non-trivial or won't work or be an option if the child kernel launch is sufficiently complex
you can still access the "old" CDP1 methodology if you specify a compile switch (-DCUDA_FORCE_CDP1_IF_SUPPORTED) and are compiling for and running on a GPU arch of less than cc9.0. So this is not an option on Hopper, Blackwell, and future GPU architectures. In that case, this should work:
```
 cmp_int<<<1, 1>>>(a, b, cmp);  // child kernel launch
 cudaDeviceSynchronize();
 printf("test_cmp cmp %d\n", *cmp); //  can "see" the results of a previous child launch
```
even if possible for your case, this is not a good future-proof way to write code
Without particular attention to CDP1 vs. CDP2 we could use stream semantics to get both the ordering and visibility that we want:
```
 __global__ void child_print(int *cmp) { printf("test_cmp cmp %d\n", *cmp);}
 ...
 cmp_int<<<1, 1>>>(a, b, cmp);  // child kernel launch
 child_print<<<1,1>>>(cmp); //  can "see" the results of a previous child launch 
```
both kernels are launched into the null stream in this case, and stream semantics guarantee that child_print will not begin until cmp_int is done, and furthermore the kernel launch boundaries guarantee that global changes made by cmp_int are visible to code executing in child_print

We can use the new tail launch stream to guarantee that child_print does not begin until all other parent kernel work is complete:

 __global__ void child_print(int *cmp) { printf("test_cmp cmp %d\n", *cmp);}
 ...
 cmp_int<<<1, 1>>>(a, b, cmp);  // child kernel launch - could be issued into fire-and-fet stream also
 child_print<<<1,1, 0, cudaStreamTailLaunch>>>(cmp); //  can "see" the results of a previous child launch or fire-and-fet launch

Additional Resources

general descriptive CDP blog, predates CDP2 but still mostly accurate.

programming guide

stream semantics, not CDP specific

编程频道|软件玩家 - 软件改变生活！

c++ - How to wait a kernel for finishing inside another kernel in CUDA? - Stack Overflow

2 Answers 2

更多相关文章