admin管理员组文章数量:1220074
I have the following problem in CUDA. Suppose I have two locations in memory, a
and b
. Let's say they are 128 bit unsigned integers, used as bitmasks.
Thread A is going to modify a
and then read from b
.
Thread B is going to modify b
and then read from a
.
I need to ensure that at least one thread (does not matter which one) will know about both modifications. Since each thread will know about its own modification, I have to make sure that at least one of the threads sees the value modified by the other thread.
If I understand the CUDA documentation correctly, I could maybe achieve this as follows:
// Thread A
(void)atomicOr(a, 0x2); // 0x2 is just an example, let's say I want to set bit 2 of *a
__threadfence();
b_read_by_A = atomicOr(b, 0);
// Thread B
(void)atomicOr(b, 0x2);
__threadfence();
a_read_by_B = atomicOr(a, 0);
My reasoning why this should be correct is as follows: __threadfence()
guarantees that a write occurring after the __threadfence()
does not become visible when a write occurring before the __threadfence()
has not. Now we have two possibilities: Either the result of atomicOr(b,0)
executed by A is visible to B when atomicOr(b,0x2)
is executed, then the result of atomicOr(a,0x2)
is also visible to B, because of the __threadfence()
in A. In this case B, will know about the modification of a
. Or the result of atomic(b,0)
is not visible to B when atomicOr(b,0x2)
is executed. Then, because the operations are atomic, the result of atomicOr(b,0x2)
will visible to A when atomicOr(b,0)
is executed. In this case, A will know about the modification of b
.
Is my reasoning correct?
Am I right in assuming that I cannot replace the second atomicOr
, i.e. atomicOr(.,0)
, by a simple read? And that I do need the __threadfence()
s?
I have the following problem in CUDA. Suppose I have two locations in memory, a
and b
. Let's say they are 128 bit unsigned integers, used as bitmasks.
Thread A is going to modify a
and then read from b
.
Thread B is going to modify b
and then read from a
.
I need to ensure that at least one thread (does not matter which one) will know about both modifications. Since each thread will know about its own modification, I have to make sure that at least one of the threads sees the value modified by the other thread.
If I understand the CUDA documentation correctly, I could maybe achieve this as follows:
// Thread A
(void)atomicOr(a, 0x2); // 0x2 is just an example, let's say I want to set bit 2 of *a
__threadfence();
b_read_by_A = atomicOr(b, 0);
// Thread B
(void)atomicOr(b, 0x2);
__threadfence();
a_read_by_B = atomicOr(a, 0);
My reasoning why this should be correct is as follows: __threadfence()
guarantees that a write occurring after the __threadfence()
does not become visible when a write occurring before the __threadfence()
has not. Now we have two possibilities: Either the result of atomicOr(b,0)
executed by A is visible to B when atomicOr(b,0x2)
is executed, then the result of atomicOr(a,0x2)
is also visible to B, because of the __threadfence()
in A. In this case B, will know about the modification of a
. Or the result of atomic(b,0)
is not visible to B when atomicOr(b,0x2)
is executed. Then, because the operations are atomic, the result of atomicOr(b,0x2)
will visible to A when atomicOr(b,0)
is executed. In this case, A will know about the modification of b
.
Is my reasoning correct?
Am I right in assuming that I cannot replace the second atomicOr
, i.e. atomicOr(.,0)
, by a simple read? And that I do need the __threadfence()
s?
1 Answer
Reset to default 1If these are really 128 bit values, then you cannot use this code. Because CUDA does not have 128-bit atomic operations. Maybe Blackwell does, but I'm guessing you do not have a Blackwell chip, also libcu++ has not caught on with Blackwell yet.
One thread might change the upper and lower 64 bits and another thread might have a torn read, seeing only half the update.
There is no way your code guards against that, although you did add the __threadfence
to fix the relaxed nature of the atomic. This is not enough.
You can have atomicOr(32-bit) or atomicOr(64-bit), but not 128 bit. Because of this if you want to work on structures > 64-bit, you need to use a mutex.
Also it is good to keep in mind that CUDA atomics use relaxed
memory ordering. If you use the <cuda/atomics>
header from libcu++ you'll get a more comprehensive set of memory orderings. Sequential consistency is the default.
See: https://nvidia.github.io/cccl/libcudacxx/extended_api/synchronization_primitives.html
But note that CUDA does not support lock-free atomic operations > 64-bits.
IMO the best and fastest solution is to use a mutex to guard the lock. Do not roll your own atomic solutions, best to stick with known idioms.
#include <cuda.h>
#include <cuda/atomics>
class mybitset {
cuda::bitset<128> _bits
cuda::binary_semaphore<cuda::thread_scope_system> lock;
struct lock_guard { //RAII lock guard, unlocks at end of scope
cuda::binary_semaphore<cuda::thread_scope_system>& _lock;
lock_guard(cuda::binary_semaphore<cuda::thread_scope_system>& lock): _lock(lock) { lock.acquire(); }
~lock_guard() { lock.release(); }
};
public:
//cuda::atomic_ref<decltype(_bits), cuda::thread_scope_device> bits(_bits); //not allowed for data > 8 bytes
void set(int pos, bool value) {
auto _lock = lock_guard(lock);
_bits.set(pos, value);
}
cuda::bitset<128> old() const {
auto _lock = lock_guard(lock);
return _bits; //return a copy
}
for the rest see std::bitset
}
The lock makes sure every thread sees a consistent view of the data and never one that has half baked updates from another thread.
There is no way to do > 128 bit atomics without some form of locking. Although you might integrate the mutex into the bitset by using one of those 128 bits as a lock bit. But then you'd have to write the atomicCAS to perform that lock yourself, using the mutex as shown above is much easier.
Make sure you never access the data without locking it first.
Always use the lock_guard, that way you can never forget to unlock.
See: https://nvidia.github.io/cccl/libcudacxx/index.html Most stuff in libcu++ is back ported to at least C++17, so you don't need C++20 set.
本文标签: CUDA threadfence() and atomicsStack Overflow
版权声明:本文标题:CUDA __threadfence() and atomics - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1739333065a2158571.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
cuda::atomic_ref
would probably be the cleanest solution but as I understand it, a simplevolatile
read is sufficient. – Homer512 Commented Feb 7 at 20:31