admin管理员组文章数量:1410712
In Performance optimization, and how to do it wrong the author claims:
-
the CPU can't predict more than one branch per cycle
-
A single if statement inside a loop is enough to stop any further instructions from being decoded in that cycle.
1 contradicts measurements by uops.info for jz
and jnz
which show a reciprocal throughput of 0.50. Are there are different port limits for taken vs not taken branches like Haswell/Skylake?
2 is not mentioned in the Software Optimization Guide for the AMD Zen4 Microarchitecture. The only similar note is in 2.9 Instruction Fetch and Decode
, but jcc
is only 6 bytes in length. Do the decoders stop after decoding a branch?
Only the first decode slot (of four) can decode instructions greater than 10 bytes in length. Avoid having more than one instruction in a sequence of four that is greater than 10 bytes in length.
Are these performance limits for Zen 4 documented anywhere?
本文标签: assemblyCan Zen 4 run more than 1 branch per cycleStack Overflow
版权声明:本文标题:assembly - Can Zen 4 run more than 1 branch per cycle - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744905841a2631615.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论