admin管理员组文章数量:1123600
I am trying to initialize a std::vector
while inside a parallel for loop using openMP but it causes it to take significantly more time than even a serial loop would take.
Here is an example of what I am trying:
#include <iostream>
#include <vector>
#include <omp.h>
#include <chrono>
int main()
{
auto start = std::chrono::high_resolution_clock::now();
//processor supports 24 threads
//#pragma omp parallel for
for (int i = 0; i < 24; ++i) {
for (int j = 0; j < 100; ++j) {
for (int k = 0; k < 100; ++k) {
for (int l = 0; l < 100; ++l) {
int result = i + j + k + l;
std::vector<int> resultsOver200 = {};
if (result > 200) {
resultsOver200.push_back(result);
}
//do something else with result
}
}
}
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = end - start;
std::cout << "Execution time: " << duration.count() << " seconds";
}
Without utilizing omp I get this as a result:
Execution time: 4.60552 seconds
If I uncomment the '#pragma omp parallel for' and run it the results are:
Execution time: 38.7252 seconds
This is a very simplified version of what I am actually trying to accomplish but gets the point across. Anyone know why this is happening or have a way for me to utilize a dynamic array where I don't know the element count at compile time while still taking advantage of multithreading?
Also as a note, I definitely want to create the new vector within the innermost for loop.
I am trying to initialize a std::vector
while inside a parallel for loop using openMP but it causes it to take significantly more time than even a serial loop would take.
Here is an example of what I am trying:
#include <iostream>
#include <vector>
#include <omp.h>
#include <chrono>
int main()
{
auto start = std::chrono::high_resolution_clock::now();
//processor supports 24 threads
//#pragma omp parallel for
for (int i = 0; i < 24; ++i) {
for (int j = 0; j < 100; ++j) {
for (int k = 0; k < 100; ++k) {
for (int l = 0; l < 100; ++l) {
int result = i + j + k + l;
std::vector<int> resultsOver200 = {};
if (result > 200) {
resultsOver200.push_back(result);
}
//do something else with result
}
}
}
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> duration = end - start;
std::cout << "Execution time: " << duration.count() << " seconds";
}
Without utilizing omp I get this as a result:
Execution time: 4.60552 seconds
If I uncomment the '#pragma omp parallel for' and run it the results are:
Execution time: 38.7252 seconds
This is a very simplified version of what I am actually trying to accomplish but gets the point across. Anyone know why this is happening or have a way for me to utilize a dynamic array where I don't know the element count at compile time while still taking advantage of multithreading?
Also as a note, I definitely want to create the new vector within the innermost for loop.
Share Improve this question edited 22 hours ago wohlstad 27.1k16 gold badges54 silver badges84 bronze badges asked 22 hours ago Derek JonesDerek Jones 211 silver badge3 bronze badges 7 | Show 2 more comments3 Answers
Reset to default 6Allocations generally does not scale. In fact, contention of the default allocator can make the code significantly slower. Indeed, allocators typically use locks and atomics so when many threads fight for the same shared resources, contention happen and this contention is very expensive: it not only serialize the code but make it slower due to a higher latency (e.g. cache line bouncing NUMA effects, etc.).
To reduce this overhead, you can use an allocator better suited for parallel codes using thread-local pools (e.g. jemalloc). No allocator perfectly scale though and in this case, most will not scale so this is certainly not enough.
One efficient solution to solve this issue is to recycle the vector in order to reduce allocations. Allocations are expensive, even in sequential. You should really avoid them as much as possible in hot loops. You could use custom allocators for that (see the answer of @AhmedAEK) or simply move the vector initialization of resultsOver200
outside all the loop nest and call resultsOver200.resize(0)
before using it in the inner-most loop (it will not be reallocated unless the capacity is not large enough). Be careful though: the vector capacity can stay huge so it might be good to force its capacity to be smaller periodically so not to use too much RAM (e.g. by calling resultsOver200.shrink_to_fit()
in the i
-based or j
-based loop).
By the way, you should use steady_clock
instead of high_resolution_clock
for measuring the wall-clock time.
The global allocator is usually not made for fast concurrent allocations, there are allocators that are made for fast concurrent allocations which are called scalable allocators (like tbb's scalable allocator and jemalloc and many others), those typically trade memory usage for performance.
in this specific case you could allocate the vector buffer on the stack for a while before spilling off to the heap using std::monotonic_buffer_resource and std::pmr::vector
char buff[4096]; // create a stack buffer
// use the stack buffer till it is full, then fall over to the heap
std::pmr::monotonic_buffer_resource resource{ buff, std::size(buff)};
std::pmr::vector<int> resultsOver200{&resource};
This makes allocations faster even for a single thread, and you will get perfect multithreaded scaling as long as the allocations can fit in the stack buffer.
The downside is that to move out of this vector to a non-pmr one you need to do a member-wise move as this vector's memory is on the stack.
While push_back
is as efficient as possible (read up on "amortized complexity") it is still very slow. You could reserve
your vector to some upper bound, which speeds up the push back considerably.
This does not entirely explain why the parallel version would be slower than the sequential. Could it be that the parallel one overflows your memory and you're swapping to disc?
But really, I would try to reformulate the algorithm. Do you actually need the vector or is there a way to do the resulting computation on it without having it stored?
本文标签:
版权声明:本文标题:multithreading - why does creating a C++ vector in an openMP parallel for loop cause it to slow significantly? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736583117a1944975.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
resultsOver200
? It ceases to exist at the end of your inner loop, and you don't do anything with it. – Caleth Commented 22 hours agohigh_resolution_clock
butsteady_clock
. Don't complain about performance if you don't have an optimized release build and do not keep allocating vectors in the inner loop. In general if you have perfomance questions, don't ask about them here before you've used a profiler and understand what it tells you the bottleneck is on your system (it might in part even be missed branch predictions). But yes in your case it is most likely to be the memory allocations. – Pepijn Kramer Commented 21 hours agog++
= 2.5 sec.;g++ -fopenmp
= 0.9 sec.;g++ -03
= 0.5 sec.;g++ -O -fopenmp
= 0.18 sec. – PierU Commented 18 hours ago