admin管理员组

文章数量:1123600

I am trying to initialize a std::vector while inside a parallel for loop using openMP but it causes it to take significantly more time than even a serial loop would take.

Here is an example of what I am trying:

#include <iostream>
#include <vector>
#include <omp.h>
#include <chrono>

int main()
{
    auto start = std::chrono::high_resolution_clock::now();

    //processor supports 24 threads
//#pragma omp parallel for
    for (int i = 0; i < 24; ++i) {
        
        for (int j = 0; j < 100; ++j) {
            for (int k = 0; k < 100; ++k) {
                for (int l = 0; l < 100; ++l) {
                    int result = i + j + k + l;

                    std::vector<int> resultsOver200 = {};
                    if (result > 200) {
                        resultsOver200.push_back(result);
                    }
                    //do something else with result
                }
            }
        }
    }

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration = end - start;

    std::cout << "Execution time: " << duration.count() << " seconds";
}

Without utilizing omp I get this as a result:

Execution time: 4.60552 seconds

If I uncomment the '#pragma omp parallel for' and run it the results are:

Execution time: 38.7252 seconds

This is a very simplified version of what I am actually trying to accomplish but gets the point across. Anyone know why this is happening or have a way for me to utilize a dynamic array where I don't know the element count at compile time while still taking advantage of multithreading?

Also as a note, I definitely want to create the new vector within the innermost for loop.

I am trying to initialize a std::vector while inside a parallel for loop using openMP but it causes it to take significantly more time than even a serial loop would take.

Here is an example of what I am trying:

#include <iostream>
#include <vector>
#include <omp.h>
#include <chrono>

int main()
{
    auto start = std::chrono::high_resolution_clock::now();

    //processor supports 24 threads
//#pragma omp parallel for
    for (int i = 0; i < 24; ++i) {
        
        for (int j = 0; j < 100; ++j) {
            for (int k = 0; k < 100; ++k) {
                for (int l = 0; l < 100; ++l) {
                    int result = i + j + k + l;

                    std::vector<int> resultsOver200 = {};
                    if (result > 200) {
                        resultsOver200.push_back(result);
                    }
                    //do something else with result
                }
            }
        }
    }

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration = end - start;

    std::cout << "Execution time: " << duration.count() << " seconds";
}

Without utilizing omp I get this as a result:

Execution time: 4.60552 seconds

If I uncomment the '#pragma omp parallel for' and run it the results are:

Execution time: 38.7252 seconds

This is a very simplified version of what I am actually trying to accomplish but gets the point across. Anyone know why this is happening or have a way for me to utilize a dynamic array where I don't know the element count at compile time while still taking advantage of multithreading?

Also as a note, I definitely want to create the new vector within the innermost for loop.

Share Improve this question edited 22 hours ago wohlstad 27.1k16 gold badges54 silver badges84 bronze badges asked 22 hours ago Derek JonesDerek Jones 211 silver badge3 bronze badges 7
  • 2 The time measurements you posted seem quite large. Are you using an optimized release build ? – wohlstad Commented 22 hours ago
  • Why do you need resultsOver200? It ceases to exist at the end of your inner loop, and you don't do anything with it. – Caleth Commented 22 hours ago
  • 1 Don't use high_resolution_clock but steady_clock. Don't complain about performance if you don't have an optimized release build and do not keep allocating vectors in the inner loop. In general if you have perfomance questions, don't ask about them here before you've used a profiler and understand what it tells you the bottleneck is on your system (it might in part even be missed branch predictions). But yes in your case it is most likely to be the memory allocations. – Pepijn Kramer Commented 21 hours ago
  • This code looks fishy (for example this vector looks like holding at most a single value). Also since there are banks in critical part of code, so it is hard to tell if problem is not hidden in those blanks. – Marek R Commented 18 hours ago
  • Your timings are hard to believe... On a quadcore 14 years old computer I get: g++ = 2.5 sec.; g++ -fopenmp = 0.9 sec.; g++ -03= 0.5 sec.; g++ -O -fopenmp = 0.18 sec. – PierU Commented 18 hours ago
 |  Show 2 more comments

3 Answers 3

Reset to default 6

Allocations generally does not scale. In fact, contention of the default allocator can make the code significantly slower. Indeed, allocators typically use locks and atomics so when many threads fight for the same shared resources, contention happen and this contention is very expensive: it not only serialize the code but make it slower due to a higher latency (e.g. cache line bouncing NUMA effects, etc.).

To reduce this overhead, you can use an allocator better suited for parallel codes using thread-local pools (e.g. jemalloc). No allocator perfectly scale though and in this case, most will not scale so this is certainly not enough.

One efficient solution to solve this issue is to recycle the vector in order to reduce allocations. Allocations are expensive, even in sequential. You should really avoid them as much as possible in hot loops. You could use custom allocators for that (see the answer of @AhmedAEK) or simply move the vector initialization of resultsOver200 outside all the loop nest and call resultsOver200.resize(0) before using it in the inner-most loop (it will not be reallocated unless the capacity is not large enough). Be careful though: the vector capacity can stay huge so it might be good to force its capacity to be smaller periodically so not to use too much RAM (e.g. by calling resultsOver200.shrink_to_fit() in the i-based or j-based loop).

By the way, you should use steady_clock instead of high_resolution_clock for measuring the wall-clock time.

The global allocator is usually not made for fast concurrent allocations, there are allocators that are made for fast concurrent allocations which are called scalable allocators (like tbb's scalable allocator and jemalloc and many others), those typically trade memory usage for performance.

in this specific case you could allocate the vector buffer on the stack for a while before spilling off to the heap using std::monotonic_buffer_resource and std::pmr::vector

char buff[4096]; // create a stack buffer

// use the stack buffer till it is full, then fall over to the heap
std::pmr::monotonic_buffer_resource resource{ buff, std::size(buff)};

std::pmr::vector<int> resultsOver200{&resource};

This makes allocations faster even for a single thread, and you will get perfect multithreaded scaling as long as the allocations can fit in the stack buffer.

The downside is that to move out of this vector to a non-pmr one you need to do a member-wise move as this vector's memory is on the stack.

While push_back is as efficient as possible (read up on "amortized complexity") it is still very slow. You could reserve your vector to some upper bound, which speeds up the push back considerably.

This does not entirely explain why the parallel version would be slower than the sequential. Could it be that the parallel one overflows your memory and you're swapping to disc?

But really, I would try to reformulate the algorithm. Do you actually need the vector or is there a way to do the resulting computation on it without having it stored?

本文标签: