admin管理员组文章数量:1394526
In this article it is stated that (under Multiple issue - Superscalar)
the fetch and decode/dispatch stages must be enhanced so they can decode multiple instructions in parallel and send them out to the "execution resources"... Of course, now that there are independent pipelines for each functional unit, they can even have different numbers of stages.
So now, my question is, when we say that a superscalar processor having 14-19 stages (i.e. as in Intel Skylake) do these execution functional units count as seperate stages.
Skylake Core
That is in this skylake core, are the INT ALU, INT DIV and so on (in the first functional unit of the EUs) considered seperate stages?
In this article https://www.lighterra/papers/modernmicroprocessors it is stated that (under Multiple issue - Superscalar)
the fetch and decode/dispatch stages must be enhanced so they can decode multiple instructions in parallel and send them out to the "execution resources"... Of course, now that there are independent pipelines for each functional unit, they can even have different numbers of stages.
So now, my question is, when we say that a superscalar processor having 14-19 stages (i.e. as in Intel Skylake) do these execution functional units count as seperate stages.
Skylake Core
That is in this skylake core, are the INT ALU, INT DIV and so on (in the first functional unit of the EUs) considered seperate stages?
Share Improve this question edited Mar 27 at 10:22 Rishi asked Mar 27 at 9:33 RishiRishi 415 bronze badges 1- I think they're descriptions of (the classes of) the instructions that this particular execution unit can execute. Every instruction uses one of them. This is not related to pipeline stages. – Bergi Commented Mar 27 at 10:05
1 Answer
Reset to default 114-19 stages is a measurement of length. (And is assuming 1-cycle latency integer instructions like add
, so it's counting exec as only 1 stage).
The number of parallel execution units is a measure of width of the pipeline's execution capabilities.
The 14-19 stage variation in length comes from uop-cache hit vs. legacy decode to feed the front-end with decoded instructions (uops). uop-cache hits have better branch-miss latency due to the shorter pipeline. See https://www.realworldtech/sandy-bridge/3/ (Skylake is a later generation of the Sandybridge-family, with the same basic design.)
The narrowest point in the pipeline is normally the issue/rename stage, such as Skylake with where it's 4 fused-domain uops wide. vs. uop-cache fetch being 6-wide also in the fused domain. And retirement being 4 per logical core IIRC, so up to 8 per cycle if both hyperthreads are active. In the unfused domain (scheduler and execution ports), Skylake has 8 ports.
With some work in the scheduler after a slow instruction completes, all 8 can be busy every cycle for a while, but the highest sustained unfused-domain throughput possible on Skylake is 7 uops/cycle (https://www.agner./optimize/blog/read.php?i=581#857) with a loop that's 4 fused-domain uops: two load+ALU uops and one store.
本文标签: cpuExecution stages in a superscalar microarchitectureStack Overflow
版权声明:本文标题:cpu - Execution stages in a superscalar microarchitecture - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744099834a2590823.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论