admin管理员组

文章数量:1400360

We are trying to switch from the old Intel icc/icl compiler to the new icx compiler. After seeing a substantial performance degradation (in the order of 30%), and trying all the optimization options that we could find with very little effect, I started playing with Godbolt Compiler Explorer to see what's happening.

I now have a minimal example that generates much worse code on ICX than on ICC. For now I'm just trying to understand why, before I start looking at much more complex code. Clang is bad as well. Compiler flags: "-O3 -ffast-math -std=c++11 -DNDEBUG -msse2. I'm only looking at the inner loop.

The code simply fills a loop with 10 times the index value, a[x] = 10 * x:

#define __ALIGN(x) __attribute__((aligned(x) ))
__ALIGN(64) float a[10000];

int main()
{
    #pragma unroll(0)
    #pragma clang unroll(disabled)
    #pragma vector always
    for (int i = 0; i < 10000; i++)
    {
        a[i] = (float)(i * 10);
    }
    return 0;
}

The unroll disable is there to prevent some compilers from generating completely different code.

icc 17 from 2016 (!) does what we expect, 2500 iterations (step size 4) through this loop:

..B1.9:                         # Preds ..B1.6 ..B1.9
        cvtdq2ps  xmm2, xmm0                                    #84.35
        movups    XMMWORD PTR [SINC_LUT+rbx*4], xmm2            #84.9
        add       rbx, 4                                        #82.5
        paddd     xmm0, xmm1                                    #84.35
        cmp       rbx, 10000                                    #82.5
        jb        ..B1.9        # Prob 99%      

icx does not vectorize at all, and gives 10,000 iterations through this loop - so it's likely close to 4 times as heavy:

.LBB0_3:
        xorps   xmm0, xmm0
        cvtsi2ss        xmm0, ecx
        movss   dword ptr [rax], xmm0
        add     rcx, 10
        add     rax, 4
        cmp     rcx, 100000
        jne     .LBB0_3

If I force it to vectorize with #pragma vector always, the output looks like this, which looks (and probably is) horribly inefficient (the vec-report also says that this code is heaver than the unvectorized code, which is why I had to use the pragma to force it):

.LBB0_1:
        movd    xmm5, ecx
        pshufd  xmm5, xmm5, 0
        paddd   xmm5, xmm0
        movdqa  xmm6, xmm5
        pand    xmm6, xmm1
        por     xmm6, xmm2
        psrld   xmm5, 16
        por     xmm5, xmm3
        subps   xmm5, xmm4
        addps   xmm5, xmm6
        movaps  xmmword ptr [4*rax + SINC_LUT+16], xmm5
        add     rax, 4
        add     ecx, 40
        cmp     rax, 9996
        jb      .LBB0_1

So this is 2500 iterations again, but the loop is 15 instructions, instead of 6 for icc.

Am I missing something obvious here? If the generated code is really this bad, that fully explains why the performance is worse.

In case this helps GCC does vectorize without #pragma vector always, the loop is 11 instructions, so in between the sizes of icc and icx:

.L2:
        movdqa  xmm2, xmm1
        add     rax, 16
        paddd   xmm1, xmm3
        movdqa  xmm0, xmm2
        pslld   xmm0, 2
        paddd   xmm0, xmm2
        pslld   xmm0, 1
        cvtdq2ps        xmm0, xmm0
        movaps  XMMWORD PTR [rax-16], xmm0
        cmp     rdx, rax
        jne     .L2

Given how simple this loop is, I would have expected to see more or less the same code, and at the very least very strongly optimized code, for all these compilers.

Edit: I have just made the example even simpler; I made a an int array and removed the float cast. icc generates 5 instructions, icx still won't vectorize without #pragma vector always, and then uses 8 instructions. A loop really cannot be much simpler than this.

Edit 2: Well it could be simpler by removing the * 10. Without #vector always, icx STILL won't vectorize just a loop with a[i] = i. icc generates the code that you would expect, and so does clang in this case. icx with vector always does vectorize, but needs 8 instructions instead of 5 for the other 2. And some of those are clearly unnecessary.

We are trying to switch from the old Intel icc/icl compiler to the new icx compiler. After seeing a substantial performance degradation (in the order of 30%), and trying all the optimization options that we could find with very little effect, I started playing with Godbolt Compiler Explorer to see what's happening.

I now have a minimal example that generates much worse code on ICX than on ICC. For now I'm just trying to understand why, before I start looking at much more complex code. Clang is bad as well. Compiler flags: "-O3 -ffast-math -std=c++11 -DNDEBUG -msse2. I'm only looking at the inner loop.

The code simply fills a loop with 10 times the index value, a[x] = 10 * x:

#define __ALIGN(x) __attribute__((aligned(x) ))
__ALIGN(64) float a[10000];

int main()
{
    #pragma unroll(0)
    #pragma clang unroll(disabled)
    #pragma vector always
    for (int i = 0; i < 10000; i++)
    {
        a[i] = (float)(i * 10);
    }
    return 0;
}

The unroll disable is there to prevent some compilers from generating completely different code.

icc 17 from 2016 (!) does what we expect, 2500 iterations (step size 4) through this loop:

..B1.9:                         # Preds ..B1.6 ..B1.9
        cvtdq2ps  xmm2, xmm0                                    #84.35
        movups    XMMWORD PTR [SINC_LUT+rbx*4], xmm2            #84.9
        add       rbx, 4                                        #82.5
        paddd     xmm0, xmm1                                    #84.35
        cmp       rbx, 10000                                    #82.5
        jb        ..B1.9        # Prob 99%      

icx does not vectorize at all, and gives 10,000 iterations through this loop - so it's likely close to 4 times as heavy:

.LBB0_3:
        xorps   xmm0, xmm0
        cvtsi2ss        xmm0, ecx
        movss   dword ptr [rax], xmm0
        add     rcx, 10
        add     rax, 4
        cmp     rcx, 100000
        jne     .LBB0_3

If I force it to vectorize with #pragma vector always, the output looks like this, which looks (and probably is) horribly inefficient (the vec-report also says that this code is heaver than the unvectorized code, which is why I had to use the pragma to force it):

.LBB0_1:
        movd    xmm5, ecx
        pshufd  xmm5, xmm5, 0
        paddd   xmm5, xmm0
        movdqa  xmm6, xmm5
        pand    xmm6, xmm1
        por     xmm6, xmm2
        psrld   xmm5, 16
        por     xmm5, xmm3
        subps   xmm5, xmm4
        addps   xmm5, xmm6
        movaps  xmmword ptr [4*rax + SINC_LUT+16], xmm5
        add     rax, 4
        add     ecx, 40
        cmp     rax, 9996
        jb      .LBB0_1

So this is 2500 iterations again, but the loop is 15 instructions, instead of 6 for icc.

Am I missing something obvious here? If the generated code is really this bad, that fully explains why the performance is worse.

In case this helps GCC does vectorize without #pragma vector always, the loop is 11 instructions, so in between the sizes of icc and icx:

.L2:
        movdqa  xmm2, xmm1
        add     rax, 16
        paddd   xmm1, xmm3
        movdqa  xmm0, xmm2
        pslld   xmm0, 2
        paddd   xmm0, xmm2
        pslld   xmm0, 1
        cvtdq2ps        xmm0, xmm0
        movaps  XMMWORD PTR [rax-16], xmm0
        cmp     rdx, rax
        jne     .L2

Given how simple this loop is, I would have expected to see more or less the same code, and at the very least very strongly optimized code, for all these compilers.

Edit: I have just made the example even simpler; I made a an int array and removed the float cast. icc generates 5 instructions, icx still won't vectorize without #pragma vector always, and then uses 8 instructions. A loop really cannot be much simpler than this.

Edit 2: Well it could be simpler by removing the * 10. Without #vector always, icx STILL won't vectorize just a loop with a[i] = i. icc generates the code that you would expect, and so does clang in this case. icx with vector always does vectorize, but needs 8 instructions instead of 5 for the other 2. And some of those are clearly unnecessary.

Share Improve this question edited Mar 25 at 1:06 phuclv 42.2k15 gold badges184 silver badges527 bronze badges asked Mar 24 at 23:03 hvzhvz 412 bronze badges 5
  • I am not an expert in this, but the newer compiler seems "correct" for me. To vectorize operations, there must be multiple identical operations immediately after each other. But if you tell the compiler not to unroll, there are mever to multiplications after each other, it is always mult then jump. So I think the difference is that the old compiler has partially ignore the non-unroll pragma. – gerum Commented Mar 24 at 23:19
  • Additional wuestion: Why do you tell the compiler not to unroll? Is it only to have a working mininal example or has it some real significance? – gerum Commented Mar 24 at 23:22
  • Without the unroll pragma it actually unrolls (and calculates multiple vectorized registers per loop iteration). But icx still doesn't vectorize if I remove them – hvz Commented Mar 24 at 23:23
  • Only reason to tell it not to unroll was to be better able to compare the output of different compilers. If one unrolls and the other doesn't, the loops look very different. It doesn't affect the vectorizing itself. – hvz Commented Mar 24 at 23:24
  • Huh. In fact, I just saw that with my new ((even simpler) example, without the loop unroll pragma's, icx generates some really bizarre code where it performs 8 steps per loop iteration, 4 of those are with vector instructions, the other 4 are with scalar instructions. – hvz Commented Mar 24 at 23:27
Add a comment  | 

1 Answer 1

Reset to default 2

TL;DR: this is apparently a regression (missed optimisation) of ICX (since version "2024.2.0").


Main issue

I can reproduce the problem with the ICX "2025.0.4" (see on Godbolt). The problem start to appear since the version "2024.2.0". The version "2024.1.0" generates a significantly better code (see on Godbolt):

.LBB0_1:
        movd    xmm1, ecx
        pshufd  xmm1, xmm1, 0
        paddd   xmm1, xmm0
        cvtdq2ps        xmm1, xmm1
        movaps  xmmword ptr [4*rax + a+16], xmm1
        add     rax, 4
        add     ecx, 40
        cmp     rax, 9996
        jb      .LBB0_1

The code is still sub-optimal, but better. This code basically operate on a scalar integer loaded to the SIMD register, then broadcasted, while it can be directly stored+updated as an SIMD register. While this strategy reduce the number of register needed (very few with SSE2), there is no real pressure on register so this is less efficient here. This code should be about 1.5 times slower than the one of ICC on SkylakeX for example (assuming it is not memory bound), but 1.75 times faster than with ICX "2025.0.4". Once unrolled 8 times, the ICC and ICX "2024.1.0" should be about equally fast (note that the movdqa are so cheap on rather-recent CPUs that they are nearly free). Thus, I advise you to test an older version of the ICX compiler and report the issue (if it is not already done by someone else).

Note the changelog of ICX "2024.2.0" can be found here but I did not found any changes that could have a direct impact on this.


Additional Notes

Clang is bad as well

This is common since ICX is based on the LLVM toochain (more generally most new Intel compilers). AFAIK, ICC was not based on LLVM.

icx does not vectorize at all

Intel states that in Porting Guide for ICC Users to DPCPP or ICX:

Vectorization With ICX 2022.0.0 and later releases -O2 and -O3 are not sufficient to enable Intel advanced loop optimizations and vectorization. To enable extra levels of loop optimizations and vectorization use the processor targeting option -x or /Qx along with a target architecture. For example, -xskylake-avx512. Or you use the -xhost or /Qxhost option to enable all available Intel optimizations and advanced vectorization for the processor of the platform where you compile your code.

Using -xhost does improve the situation on the last version of ICX though there are still several unnecessary SIMD instructions (see on Godbolt).

the output looks like this, which looks (and probably is) horribly inefficient

It is indeed significantly less efficient on most (rather-recent) Intel CPUs.

本文标签: performanceIntel ICX C compiler often generates slower codeStack Overflow