performance - Intel ICX C++ compiler often generates slower code - Stack Overflow-软件玩家

admin管理员组
文章数量:1400360

We are trying to switch from the old Intel icc/icl compiler to the new icx compiler. After seeing a substantial performance degradation (in the order of 30%), and trying all the optimization options that we could find with very little effect, I started playing with Godbolt Compiler Explorer to see what's happening.

I now have a minimal example that generates much worse code on ICX than on ICC. For now I'm just trying to understand why, before I start looking at much more complex code. Clang is bad as well. Compiler flags: "-O3 -ffast-math -std=c++11 -DNDEBUG -msse2. I'm only looking at the inner loop.

The code simply fills a loop with 10 times the index value, a[x] = 10 * x:

#define __ALIGN(x) __attribute__((aligned(x) ))
__ALIGN(64) float a[10000];

int main()
{
    #pragma unroll(0)
    #pragma clang unroll(disabled)
    #pragma vector always
    for (int i = 0; i < 10000; i++)
    {
        a[i] = (float)(i * 10);
    }
    return 0;
}

The unroll disable is there to prevent some compilers from generating completely different code.

icc 17 from 2016 (!) does what we expect, 2500 iterations (step size 4) through this loop:

..B1.9:                         # Preds ..B1.6 ..B1.9
        cvtdq2ps  xmm2, xmm0                                    #84.35
        movups    XMMWORD PTR [SINC_LUT+rbx*4], xmm2            #84.9
        add       rbx, 4                                        #82.5
        paddd     xmm0, xmm1                                    #84.35
        cmp       rbx, 10000                                    #82.5
        jb        ..B1.9        # Prob 99%

icx does not vectorize at all, and gives 10,000 iterations through this loop - so it's likely close to 4 times as heavy:

.LBB0_3:
        xorps   xmm0, xmm0
        cvtsi2ss        xmm0, ecx
        movss   dword ptr [rax], xmm0
        add     rcx, 10
        add     rax, 4
        cmp     rcx, 100000
        jne     .LBB0_3

If I force it to vectorize with #pragma vector always, the output looks like this, which looks (and probably is) horribly inefficient (the vec-report also says that this code is heaver than the unvectorized code, which is why I had to use the pragma to force it):

.LBB0_1:
        movd    xmm5, ecx
        pshufd  xmm5, xmm5, 0
        paddd   xmm5, xmm0
        movdqa  xmm6, xmm5
        pand    xmm6, xmm1
        por     xmm6, xmm2
        psrld   xmm5, 16
        por     xmm5, xmm3
        subps   xmm5, xmm4
        addps   xmm5, xmm6
        movaps  xmmword ptr [4*rax + SINC_LUT+16], xmm5
        add     rax, 4
        add     ecx, 40
        cmp     rax, 9996
        jb      .LBB0_1

So this is 2500 iterations again, but the loop is 15 instructions, instead of 6 for icc.

Am I missing something obvious here? If the generated code is really this bad, that fully explains why the performance is worse.

In case this helps GCC does vectorize without #pragma vector always, the loop is 11 instructions, so in between the sizes of icc and icx:

.L2:
        movdqa  xmm2, xmm1
        add     rax, 16
        paddd   xmm1, xmm3
        movdqa  xmm0, xmm2
        pslld   xmm0, 2
        paddd   xmm0, xmm2
        pslld   xmm0, 1
        cvtdq2ps        xmm0, xmm0
        movaps  XMMWORD PTR [rax-16], xmm0
        cmp     rdx, rax
        jne     .L2

Given how simple this loop is, I would have expected to see more or less the same code, and at the very least very strongly optimized code, for all these compilers.

Edit: I have just made the example even simpler; I made a an int array and removed the float cast. icc generates 5 instructions, icx still won't vectorize without #pragma vector always, and then uses 8 instructions. A loop really cannot be much simpler than this.

Edit 2: Well it could be simpler by removing the * 10. Without #vector always, icx STILL won't vectorize just a loop with a[i] = i. icc generates the code that you would expect, and so does clang in this case. icx with vector always does vectorize, but needs 8 instructions instead of 5 for the other 2. And some of those are clearly unnecessary.

The code simply fills a loop with 10 times the index value, a[x] = 10 * x:

#define __ALIGN(x) __attribute__((aligned(x) ))
__ALIGN(64) float a[10000];

int main()
{
    #pragma unroll(0)
    #pragma clang unroll(disabled)
    #pragma vector always
    for (int i = 0; i < 10000; i++)
    {
        a[i] = (float)(i * 10);
    }
    return 0;
}

The unroll disable is there to prevent some compilers from generating completely different code.

icc 17 from 2016 (!) does what we expect, 2500 iterations (step size 4) through this loop:

..B1.9:                         # Preds ..B1.6 ..B1.9
        cvtdq2ps  xmm2, xmm0                                    #84.35
        movups    XMMWORD PTR [SINC_LUT+rbx*4], xmm2            #84.9
        add       rbx, 4                                        #82.5
        paddd     xmm0, xmm1                                    #84.35
        cmp       rbx, 10000                                    #82.5
        jb        ..B1.9        # Prob 99%

icx does not vectorize at all, and gives 10,000 iterations through this loop - so it's likely close to 4 times as heavy:

.LBB0_3:
        xorps   xmm0, xmm0
        cvtsi2ss        xmm0, ecx
        movss   dword ptr [rax], xmm0
        add     rcx, 10
        add     rax, 4
        cmp     rcx, 100000
        jne     .LBB0_3

.LBB0_1:
        movd    xmm5, ecx
        pshufd  xmm5, xmm5, 0
        paddd   xmm5, xmm0
        movdqa  xmm6, xmm5
        pand    xmm6, xmm1
        por     xmm6, xmm2
        psrld   xmm5, 16
        por     xmm5, xmm3
        subps   xmm5, xmm4
        addps   xmm5, xmm6
        movaps  xmmword ptr [4*rax + SINC_LUT+16], xmm5
        add     rax, 4
        add     ecx, 40
        cmp     rax, 9996
        jb      .LBB0_1

So this is 2500 iterations again, but the loop is 15 instructions, instead of 6 for icc.

Am I missing something obvious here? If the generated code is really this bad, that fully explains why the performance is worse.

In case this helps GCC does vectorize without #pragma vector always, the loop is 11 instructions, so in between the sizes of icc and icx:

.L2:
        movdqa  xmm2, xmm1
        add     rax, 16
        paddd   xmm1, xmm3
        movdqa  xmm0, xmm2
        pslld   xmm0, 2
        paddd   xmm0, xmm2
        pslld   xmm0, 1
        cvtdq2ps        xmm0, xmm0
        movaps  XMMWORD PTR [rax-16], xmm0
        cmp     rdx, rax
        jne     .L2

Given how simple this loop is, I would have expected to see more or less the same code, and at the very least very strongly optimized code, for all these compilers.

Share Improve this question edited Mar 25 at 1:06 phuclv 42.2k15 gold badges184 silver badges527 bronze badges asked Mar 24 at 23:03 hvz 412 bronze badges

I am not an expert in this, but the newer compiler seems "correct" for me. To vectorize operations, there must be multiple identical operations immediately after each other. But if you tell the compiler not to unroll, there are mever to multiplications after each other, it is always mult then jump. So I think the difference is that the old compiler has partially ignore the non-unroll pragma. – gerum Commented Mar 24 at 23:19
Additional wuestion: Why do you tell the compiler not to unroll? Is it only to have a working mininal example or has it some real significance? – gerum Commented Mar 24 at 23:22
Without the unroll pragma it actually unrolls (and calculates multiple vectorized registers per loop iteration). But icx still doesn't vectorize if I remove them – hvz Commented Mar 24 at 23:23
Only reason to tell it not to unroll was to be better able to compare the output of different compilers. If one unrolls and the other doesn't, the loops look very different. It doesn't affect the vectorizing itself. – hvz Commented Mar 24 at 23:24
Huh. In fact, I just saw that with my new ((even simpler) example, without the loop unroll pragma's, icx generates some really bizarre code where it performs 8 steps per loop iteration, 4 of those are with vector instructions, the other 4 are with scalar instructions. – hvz Commented Mar 24 at 23:27

Add a comment |

1 Answer 1

Sorted by: Reset to default 2

TL;DR: this is apparently a regression (missed optimisation) of ICX (since version "2024.2.0").

Main issue

I can reproduce the problem with the ICX "2025.0.4" (see on Godbolt). The problem start to appear since the version "2024.2.0". The version "2024.1.0" generates a significantly better code (see on Godbolt):

.LBB0_1:
        movd    xmm1, ecx
        pshufd  xmm1, xmm1, 0
        paddd   xmm1, xmm0
        cvtdq2ps        xmm1, xmm1
        movaps  xmmword ptr [4*rax + a+16], xmm1
        add     rax, 4
        add     ecx, 40
        cmp     rax, 9996
        jb      .LBB0_1

The code is still sub-optimal, but better. This code basically operate on a scalar integer loaded to the SIMD register, then broadcasted, while it can be directly stored+updated as an SIMD register. While this strategy reduce the number of register needed (very few with SSE2), there is no real pressure on register so this is less efficient here. This code should be about 1.5 times slower than the one of ICC on SkylakeX for example (assuming it is not memory bound), but 1.75 times faster than with ICX "2025.0.4". Once unrolled 8 times, the ICC and ICX "2024.1.0" should be about equally fast (note that the movdqa are so cheap on rather-recent CPUs that they are nearly free). Thus, I advise you to test an older version of the ICX compiler and report the issue (if it is not already done by someone else).

Note the changelog of ICX "2024.2.0" can be found here but I did not found any changes that could have a direct impact on this.

Additional Notes

Clang is bad as well

This is common since ICX is based on the LLVM toochain (more generally most new Intel compilers). AFAIK, ICC was not based on LLVM.

icx does not vectorize at all

Intel states that in Porting Guide for ICC Users to DPCPP or ICX:

Vectorization With ICX 2022.0.0 and later releases -O2 and -O3 are not sufficient to enable Intel advanced loop optimizations and vectorization. To enable extra levels of loop optimizations and vectorization use the processor targeting option -x or /Qx along with a target architecture. For example, -xskylake-avx512. Or you use the -xhost or /Qxhost option to enable all available Intel optimizations and advanced vectorization for the processor of the platform where you compile your code.

Using -xhost does improve the situation on the last version of ICX though there are still several unnecessary SIMD instructions (see on Godbolt).

the output looks like this, which looks (and probably is) horribly inefficient

It is indeed significantly less efficient on most (rather-recent) Intel CPUs.

本文标签： performanceIntel ICX C compiler often generates slower codeStack Overflow

版权声明：本文标题：performance - Intel ICX C++ compiler often generates slower code - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1744224721a2596033.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

performance - Intel ICX C++ compiler often generates slower code - Stack Overflow

1 Answer 1

Main issue

Additional Notes

更多相关文章

performance - Intel ICX C++ compiler often generates slower code - Stack Overflow

发表评论

推荐文章

mysql - SQL query to change custom field in Wordpress database

javascript - HTML Radio buttons styled as Toggle Buttons - Stack Overflow

javascript - How to maintain react component state when goback from router? - Stack Overflow

javascript - Expand a div height from set height to natural height on button click - Stack Overflow

Shortcode multiple values

热门文章

python - events Pin and Unpin discord.py - Stack Overflow

c++ - Unreal Engine 5.5. GAS PostGameplayEffectExecute. Problem with Data - Stack Overflow

swift - Camera preview not visible in view controller - Stack Overflow

javascript - Typescript Filter Array of Objects by Duplicate Properties - Stack Overflow

go - Golang what&#39;s wrong with this program (Waitgroups and channels) - Stack Overflow

sql server - Issue with printing statements sequentially after insert in T-SQL - Stack Overflow

Javascript DOM - add form element to newly created appendChild? - Stack Overflow

WPF Trigger on template item for a button - Stack Overflow

database - WP character encoding error after sharing post on Facebook �

javascript - How to centre a MapView on a user&#39;s current location when the Map Screen is opened? React Native Expo - Sta

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

javascript - Easiest way to wait for google server-side function to resolve - Stack Overflow

javascript - Babel : duplicate pluginpreset error detected - Stack Overflow

javascript - How to wrap a svg set of elements by &lt;g&gt; tag? - Stack Overflow

woocommerce offtopic - How do I make my products on the homepage responsive?

Prometheus relabeling vs Alloy relabeling - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

go - Golang what's wrong with this program (Waitgroups and channels) - Stack Overflow

javascript - How to centre a MapView on a user's current location when the Map Screen is opened? React Native Expo - Sta

javascript - How to wrap a svg set of elements by <g> tag? - Stack Overflow