multithreading - why does creating a C++ vector in an openMP parallel for loop cause it to slow significantly? - Stack Overflow

IT技术

更新时间：2025-01-114

admin管理员组
文章数量:1123600

I am trying to initialize a std::vector while inside a parallel for loop using openMP but it causes it to take significantly more time than even a serial loop would take.

Here is an example of what I am trying:

#include <iostream>
#include <vector>
#include <omp.h>
#include <chrono>

int main()
{
    auto start = std::chrono::high_resolution_clock::now();

    //processor supports 24 threads
//#pragma omp parallel for
    for (int i = 0; i < 24; ++i) {
        
        for (int j = 0; j < 100; ++j) {
            for (int k = 0; k < 100; ++k) {
                for (int l = 0; l < 100; ++l) {
                    int result = i + j + k + l;

                    std::vector<int> resultsOver200 = {};
                    if (result > 200) {
                        resultsOver200.push_back(result);
                    }
                    //do something else with result
                }
            }
        }
    }

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration = end - start;

    std::cout << "Execution time: " << duration.count() << " seconds";
}

Without utilizing omp I get this as a result:

Execution time: 4.60552 seconds

If I uncomment the '#pragma omp parallel for' and run it the results are:

Execution time: 38.7252 seconds

This is a very simplified version of what I am actually trying to accomplish but gets the point across. Anyone know why this is happening or have a way for me to utilize a dynamic array where I don't know the element count at compile time while still taking advantage of multithreading?

Also as a note, I definitely want to create the new vector within the innermost for loop.

I am trying to initialize a std::vector while inside a parallel for loop using openMP but it causes it to take significantly more time than even a serial loop would take.

Here is an example of what I am trying:

#include <iostream>
#include <vector>
#include <omp.h>
#include <chrono>

int main()
{
    auto start = std::chrono::high_resolution_clock::now();

    //processor supports 24 threads
//#pragma omp parallel for
    for (int i = 0; i < 24; ++i) {
        
        for (int j = 0; j < 100; ++j) {
            for (int k = 0; k < 100; ++k) {
                for (int l = 0; l < 100; ++l) {
                    int result = i + j + k + l;

                    std::vector<int> resultsOver200 = {};
                    if (result > 200) {
                        resultsOver200.push_back(result);
                    }
                    //do something else with result
                }
            }
        }
    }

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration = end - start;

    std::cout << "Execution time: " << duration.count() << " seconds";
}

Without utilizing omp I get this as a result:

Execution time: 4.60552 seconds

If I uncomment the '#pragma omp parallel for' and run it the results are:

Execution time: 38.7252 seconds

Also as a note, I definitely want to create the new vector within the innermost for loop.

Share Improve this question edited 22 hours ago wohlstad 27.1k16 gold badges54 silver badges84 bronze badges asked 22 hours ago Derek Jones 211 silver badge3 bronze badges

2 The time measurements you posted seem quite large. Are you using an optimized release build ? – wohlstad Commented 22 hours ago
Why do you need resultsOver200? It ceases to exist at the end of your inner loop, and you don't do anything with it. – Caleth Commented 22 hours ago
1 Don't use high_resolution_clock but steady_clock. Don't complain about performance if you don't have an optimized release build and do not keep allocating vectors in the inner loop. In general if you have perfomance questions, don't ask about them here before you've used a profiler and understand what it tells you the bottleneck is on your system (it might in part even be missed branch predictions). But yes in your case it is most likely to be the memory allocations. – Pepijn Kramer Commented 21 hours ago
This code looks fishy (for example this vector looks like holding at most a single value). Also since there are banks in critical part of code, so it is hard to tell if problem is not hidden in those blanks. – Marek R Commented 18 hours ago
Your timings are hard to believe... On a quadcore 14 years old computer I get: g++ = 2.5 sec.; g++ -fopenmp = 0.9 sec.; g++ -03= 0.5 sec.; g++ -O -fopenmp = 0.18 sec. – PierU Commented 18 hours ago

| Show 2 more comments

3 Answers 3

Sorted by: Reset to default 6

Allocations generally does not scale. In fact, contention of the default allocator can make the code significantly slower. Indeed, allocators typically use locks and atomics so when many threads fight for the same shared resources, contention happen and this contention is very expensive: it not only serialize the code but make it slower due to a higher latency (e.g. cache line bouncing NUMA effects, etc.).

To reduce this overhead, you can use an allocator better suited for parallel codes using thread-local pools (e.g. jemalloc). No allocator perfectly scale though and in this case, most will not scale so this is certainly not enough.

One efficient solution to solve this issue is to recycle the vector in order to reduce allocations. Allocations are expensive, even in sequential. You should really avoid them as much as possible in hot loops. You could use custom allocators for that (see the answer of @AhmedAEK) or simply move the vector initialization of resultsOver200 outside all the loop nest and call resultsOver200.resize(0) before using it in the inner-most loop (it will not be reallocated unless the capacity is not large enough). Be careful though: the vector capacity can stay huge so it might be good to force its capacity to be smaller periodically so not to use too much RAM (e.g. by calling resultsOver200.shrink_to_fit() in the i-based or j-based loop).

By the way, you should use steady_clock instead of high_resolution_clock for measuring the wall-clock time.

The global allocator is usually not made for fast concurrent allocations, there are allocators that are made for fast concurrent allocations which are called scalable allocators (like tbb's scalable allocator and jemalloc and many others), those typically trade memory usage for performance.

in this specific case you could allocate the vector buffer on the stack for a while before spilling off to the heap using std::monotonic_buffer_resource and std::pmr::vector

char buff[4096]; // create a stack buffer

// use the stack buffer till it is full, then fall over to the heap
std::pmr::monotonic_buffer_resource resource{ buff, std::size(buff)};

std::pmr::vector<int> resultsOver200{&resource};

This makes allocations faster even for a single thread, and you will get perfect multithreaded scaling as long as the allocations can fit in the stack buffer.

The downside is that to move out of this vector to a non-pmr one you need to do a member-wise move as this vector's memory is on the stack.

While push_back is as efficient as possible (read up on "amortized complexity") it is still very slow. You could reserve your vector to some upper bound, which speeds up the push back considerably.

This does not entirely explain why the parallel version would be slower than the sequential. Could it be that the parallel one overflows your memory and you're swapping to disc?

But really, I would try to reformulate the algorithm. Do you actually need the vector or is there a way to do the resulting computation on it without having it stored?

本文标签：

版权声明：本文标题：multithreading - why does creating a C++ vector in an openMP parallel for loop cause it to slow significantly? - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736583117a1944975.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

发表评论

全部评论 0

暂无评论

编程频道|软件玩家 - 软件改变生活！

multithreading - why does creating a C++ vector in an openMP parallel for loop cause it to slow significantly? - Stack Overflow

3 Answers 3

更多相关文章

typescript - Nuxt 3 auth when try login if bad request response is json, but when request ok not json ( on production ) - Stack

python - Getting Accurate Icon Coordinates Using Anthropic Vision API - Stack Overflow

excel - How do I check If autosave is on and then switch it back on if it is not? - Stack Overflow

sqlalchemy - I want to create a sql agent that can communicate with my Bigquery data but facing continously error regarding the

python - Extracting vendor info from Probe Request using Scapy - Stack Overflow

angular - How to run function just one time when 2 signals have value? - Stack Overflow

perl - sprintf and multi-column Unicode characters - Stack Overflow

python - Generate images of dataframes in parallel - Stack Overflow

jetbrains ide - PhpStorm Constantly &quot;Detecting Changes&quot; on Shared Drive - Stack Overflow

c# - Bind a radio button group to model in .Net - Stack Overflow

How to deal with PowerShell function parameters that conflict with automatic variable names - Stack Overflow

Fail to convert curl to python cdoe - Stack Overflow

reactjs - Build Error - Definition for rule &#39;@typescript-eslintno-explicit-any&#39; was not found - Stack Overflow

kotlin - Android Compose: How to detect drag stop (finger lifted) when using Modifier.verticalScroll - Stack Overflow

how to get the recorded stream details via api or sdk from cloudflare stream api not working - Stack Overflow

How to construct background image URL using CSS variable in a SCSS file Angular - Stack Overflow

Jmeter oracle fusion application checksum error - Stack Overflow

reactjs - React Query Infinite Scroll does not work in an MUI Dialog - Stack Overflow

grafana - Prometheus irate function gives 0 result after breaks in monotonicity - Stack Overflow

python - Aws glue job is very slow while writing to s3 using pyspark write - Stack Overflow

发表评论

推荐文章

Load a custom post type template.html from a plugin

javascript - I have an old version of jQuery showing up. How do I figure out where it is coming from?

bioconductor - R : Define S4 setter properly for setting each element separately in a named list - Stack Overflow

How to set viewbox for embed.diagrams.netdraw.io? - Stack Overflow

confluent platform - Kafka max.request.size VS compression.type - Stack Overflow

热门文章

How do I remove a menu from a theme?

javascript - Extend WordPress Gutenberg corenavigation-link

visual studio code - Managing multiple interdependent git repositories - unification vs symbolic links - Stack Overflow

categories - Loop by category and meta_key

.net - Exception while reading from stream - Stack Overflow

azure - Nginx is dropping part of request path in the request URL when redirecting after adding a trailing slash - Stack Overflo

permalinks - taxonomies ACF Wordpress

python - Automate SAML authentification with IAP and Cloud Run - Stack Overflow

Can I open the File explorer to select a file from the webpage in my playwright tests? I am using Playwright c# with Page Object

maps - HighChart solid Guage border issue - Stack Overflow

最新文章

Java入门级教学（IDEA的下载与安装与JDK的环境配置）

华硕笔记本电脑用U盘重装windows系统

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

如何一键安装win7系统(一键安装win7系统步骤)

Windows 11最稳定版本详解

python - word reduction of a list of words - Stack Overflow

markdown - How to get the child element inside Angular ShadowRoot? - Stack Overflow

java - &quot;Failed to find Gauge project directory. Missing manifest.json file&quot; Error thrown from Maven build - St

reactjs - How to test network diagnostics on local using azure communication library in react? - Stack Overflow

Raspberry Pi Zero 2 W fails to connect to my network - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

jetbrains ide - PhpStorm Constantly "Detecting Changes" on Shared Drive - Stack Overflow

reactjs - Build Error - Definition for rule '@typescript-eslintno-explicit-any' was not found - Stack Overflow

java - "Failed to find Gauge project directory. Missing manifest.json file" Error thrown from Maven build - St