admin管理员组文章数量:1400360
We are trying to switch from the old Intel icc/icl compiler to the new icx compiler. After seeing a substantial performance degradation (in the order of 30%), and trying all the optimization options that we could find with very little effect, I started playing with Godbolt Compiler Explorer to see what's happening.
I now have a minimal example that generates much worse code on ICX than on ICC. For now I'm just trying to understand why, before I start looking at much more complex code. Clang is bad as well. Compiler flags: "-O3 -ffast-math -std=c++11 -DNDEBUG -msse2
. I'm only looking at the inner loop.
The code simply fills a loop with 10 times the index value, a[x] = 10 * x:
#define __ALIGN(x) __attribute__((aligned(x) ))
__ALIGN(64) float a[10000];
int main()
{
#pragma unroll(0)
#pragma clang unroll(disabled)
#pragma vector always
for (int i = 0; i < 10000; i++)
{
a[i] = (float)(i * 10);
}
return 0;
}
The unroll disable is there to prevent some compilers from generating completely different code.
icc 17 from 2016 (!) does what we expect, 2500 iterations (step size 4) through this loop:
..B1.9: # Preds ..B1.6 ..B1.9
cvtdq2ps xmm2, xmm0 #84.35
movups XMMWORD PTR [SINC_LUT+rbx*4], xmm2 #84.9
add rbx, 4 #82.5
paddd xmm0, xmm1 #84.35
cmp rbx, 10000 #82.5
jb ..B1.9 # Prob 99%
icx does not vectorize at all, and gives 10,000 iterations through this loop - so it's likely close to 4 times as heavy:
.LBB0_3:
xorps xmm0, xmm0
cvtsi2ss xmm0, ecx
movss dword ptr [rax], xmm0
add rcx, 10
add rax, 4
cmp rcx, 100000
jne .LBB0_3
If I force it to vectorize with #pragma vector always
, the output looks like this, which looks (and probably is) horribly inefficient (the vec-report also says that this code is heaver than the unvectorized code, which is why I had to use the pragma to force it):
.LBB0_1:
movd xmm5, ecx
pshufd xmm5, xmm5, 0
paddd xmm5, xmm0
movdqa xmm6, xmm5
pand xmm6, xmm1
por xmm6, xmm2
psrld xmm5, 16
por xmm5, xmm3
subps xmm5, xmm4
addps xmm5, xmm6
movaps xmmword ptr [4*rax + SINC_LUT+16], xmm5
add rax, 4
add ecx, 40
cmp rax, 9996
jb .LBB0_1
So this is 2500 iterations again, but the loop is 15 instructions, instead of 6 for icc.
Am I missing something obvious here? If the generated code is really this bad, that fully explains why the performance is worse.
In case this helps GCC does vectorize without #pragma vector always
, the loop is 11 instructions, so in between the sizes of icc and icx:
.L2:
movdqa xmm2, xmm1
add rax, 16
paddd xmm1, xmm3
movdqa xmm0, xmm2
pslld xmm0, 2
paddd xmm0, xmm2
pslld xmm0, 1
cvtdq2ps xmm0, xmm0
movaps XMMWORD PTR [rax-16], xmm0
cmp rdx, rax
jne .L2
Given how simple this loop is, I would have expected to see more or less the same code, and at the very least very strongly optimized code, for all these compilers.
Edit: I have just made the example even simpler; I made a an int array and removed the float cast. icc generates 5 instructions, icx still won't vectorize without #pragma vector always
, and then uses 8 instructions. A loop really cannot be much simpler than this.
Edit 2: Well it could be simpler by removing the * 10. Without #vector always
, icx STILL won't vectorize just a loop with a[i] = i
. icc generates the code that you would expect, and so does clang in this case. icx with vector always does vectorize, but needs 8 instructions instead of 5 for the other 2. And some of those are clearly unnecessary.
We are trying to switch from the old Intel icc/icl compiler to the new icx compiler. After seeing a substantial performance degradation (in the order of 30%), and trying all the optimization options that we could find with very little effect, I started playing with Godbolt Compiler Explorer to see what's happening.
I now have a minimal example that generates much worse code on ICX than on ICC. For now I'm just trying to understand why, before I start looking at much more complex code. Clang is bad as well. Compiler flags: "-O3 -ffast-math -std=c++11 -DNDEBUG -msse2
. I'm only looking at the inner loop.
The code simply fills a loop with 10 times the index value, a[x] = 10 * x:
#define __ALIGN(x) __attribute__((aligned(x) ))
__ALIGN(64) float a[10000];
int main()
{
#pragma unroll(0)
#pragma clang unroll(disabled)
#pragma vector always
for (int i = 0; i < 10000; i++)
{
a[i] = (float)(i * 10);
}
return 0;
}
The unroll disable is there to prevent some compilers from generating completely different code.
icc 17 from 2016 (!) does what we expect, 2500 iterations (step size 4) through this loop:
..B1.9: # Preds ..B1.6 ..B1.9
cvtdq2ps xmm2, xmm0 #84.35
movups XMMWORD PTR [SINC_LUT+rbx*4], xmm2 #84.9
add rbx, 4 #82.5
paddd xmm0, xmm1 #84.35
cmp rbx, 10000 #82.5
jb ..B1.9 # Prob 99%
icx does not vectorize at all, and gives 10,000 iterations through this loop - so it's likely close to 4 times as heavy:
.LBB0_3:
xorps xmm0, xmm0
cvtsi2ss xmm0, ecx
movss dword ptr [rax], xmm0
add rcx, 10
add rax, 4
cmp rcx, 100000
jne .LBB0_3
If I force it to vectorize with #pragma vector always
, the output looks like this, which looks (and probably is) horribly inefficient (the vec-report also says that this code is heaver than the unvectorized code, which is why I had to use the pragma to force it):
.LBB0_1:
movd xmm5, ecx
pshufd xmm5, xmm5, 0
paddd xmm5, xmm0
movdqa xmm6, xmm5
pand xmm6, xmm1
por xmm6, xmm2
psrld xmm5, 16
por xmm5, xmm3
subps xmm5, xmm4
addps xmm5, xmm6
movaps xmmword ptr [4*rax + SINC_LUT+16], xmm5
add rax, 4
add ecx, 40
cmp rax, 9996
jb .LBB0_1
So this is 2500 iterations again, but the loop is 15 instructions, instead of 6 for icc.
Am I missing something obvious here? If the generated code is really this bad, that fully explains why the performance is worse.
In case this helps GCC does vectorize without #pragma vector always
, the loop is 11 instructions, so in between the sizes of icc and icx:
.L2:
movdqa xmm2, xmm1
add rax, 16
paddd xmm1, xmm3
movdqa xmm0, xmm2
pslld xmm0, 2
paddd xmm0, xmm2
pslld xmm0, 1
cvtdq2ps xmm0, xmm0
movaps XMMWORD PTR [rax-16], xmm0
cmp rdx, rax
jne .L2
Given how simple this loop is, I would have expected to see more or less the same code, and at the very least very strongly optimized code, for all these compilers.
Edit: I have just made the example even simpler; I made a an int array and removed the float cast. icc generates 5 instructions, icx still won't vectorize without #pragma vector always
, and then uses 8 instructions. A loop really cannot be much simpler than this.
Edit 2: Well it could be simpler by removing the * 10. Without #vector always
, icx STILL won't vectorize just a loop with a[i] = i
. icc generates the code that you would expect, and so does clang in this case. icx with vector always does vectorize, but needs 8 instructions instead of 5 for the other 2. And some of those are clearly unnecessary.
- I am not an expert in this, but the newer compiler seems "correct" for me. To vectorize operations, there must be multiple identical operations immediately after each other. But if you tell the compiler not to unroll, there are mever to multiplications after each other, it is always mult then jump. So I think the difference is that the old compiler has partially ignore the non-unroll pragma. – gerum Commented Mar 24 at 23:19
- Additional wuestion: Why do you tell the compiler not to unroll? Is it only to have a working mininal example or has it some real significance? – gerum Commented Mar 24 at 23:22
- Without the unroll pragma it actually unrolls (and calculates multiple vectorized registers per loop iteration). But icx still doesn't vectorize if I remove them – hvz Commented Mar 24 at 23:23
- Only reason to tell it not to unroll was to be better able to compare the output of different compilers. If one unrolls and the other doesn't, the loops look very different. It doesn't affect the vectorizing itself. – hvz Commented Mar 24 at 23:24
- Huh. In fact, I just saw that with my new ((even simpler) example, without the loop unroll pragma's, icx generates some really bizarre code where it performs 8 steps per loop iteration, 4 of those are with vector instructions, the other 4 are with scalar instructions. – hvz Commented Mar 24 at 23:27
1 Answer
Reset to default 2TL;DR: this is apparently a regression (missed optimisation) of ICX (since version "2024.2.0").
Main issue
I can reproduce the problem with the ICX "2025.0.4" (see on Godbolt). The problem start to appear since the version "2024.2.0". The version "2024.1.0" generates a significantly better code (see on Godbolt):
.LBB0_1:
movd xmm1, ecx
pshufd xmm1, xmm1, 0
paddd xmm1, xmm0
cvtdq2ps xmm1, xmm1
movaps xmmword ptr [4*rax + a+16], xmm1
add rax, 4
add ecx, 40
cmp rax, 9996
jb .LBB0_1
The code is still sub-optimal, but better. This code basically operate on a scalar integer loaded to the SIMD register, then broadcasted, while it can be directly stored+updated as an SIMD register. While this strategy reduce the number of register needed (very few with SSE2), there is no real pressure on register so this is less efficient here. This code should be about 1.5 times slower than the one of ICC on SkylakeX for example (assuming it is not memory bound), but 1.75 times faster than with ICX "2025.0.4". Once unrolled 8 times, the ICC and ICX "2024.1.0" should be about equally fast (note that the movdqa
are so cheap on rather-recent CPUs that they are nearly free). Thus, I advise you to test an older version of the ICX compiler and report the issue (if it is not already done by someone else).
Note the changelog of ICX "2024.2.0" can be found here but I did not found any changes that could have a direct impact on this.
Additional Notes
Clang is bad as well
This is common since ICX is based on the LLVM toochain (more generally most new Intel compilers). AFAIK, ICC was not based on LLVM.
icx does not vectorize at all
Intel states that in Porting Guide for ICC Users to DPCPP or ICX:
Vectorization With ICX 2022.0.0 and later releases
-O2
and-O3
are not sufficient to enable Intel advanced loop optimizations and vectorization. To enable extra levels of loop optimizations and vectorization use the processor targeting option-x
or/Qx
along with a target architecture. For example,-xskylake-avx512
. Or you use the-xhost
or/Qxhost
option to enable all available Intel optimizations and advanced vectorization for the processor of the platform where you compile your code.
Using -xhost
does improve the situation on the last version of ICX though there are still several unnecessary SIMD instructions (see on Godbolt).
the output looks like this, which looks (and probably is) horribly inefficient
It is indeed significantly less efficient on most (rather-recent) Intel CPUs.
本文标签: performanceIntel ICX C compiler often generates slower codeStack Overflow
版权声明:本文标题:performance - Intel ICX C++ compiler often generates slower code - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744224721a2596033.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论