admin管理员组

文章数量:1394526

In this article it is stated that (under Multiple issue - Superscalar)

the fetch and decode/dispatch stages must be enhanced so they can decode multiple instructions in parallel and send them out to the "execution resources"... Of course, now that there are independent pipelines for each functional unit, they can even have different numbers of stages.

So now, my question is, when we say that a superscalar processor having 14-19 stages (i.e. as in Intel Skylake) do these execution functional units count as seperate stages.


Skylake Core


That is in this skylake core, are the INT ALU, INT DIV and so on (in the first functional unit of the EUs) considered seperate stages?

In this article https://www.lighterra/papers/modernmicroprocessors it is stated that (under Multiple issue - Superscalar)

the fetch and decode/dispatch stages must be enhanced so they can decode multiple instructions in parallel and send them out to the "execution resources"... Of course, now that there are independent pipelines for each functional unit, they can even have different numbers of stages.

So now, my question is, when we say that a superscalar processor having 14-19 stages (i.e. as in Intel Skylake) do these execution functional units count as seperate stages.


Skylake Core


That is in this skylake core, are the INT ALU, INT DIV and so on (in the first functional unit of the EUs) considered seperate stages?

Share Improve this question edited Mar 27 at 10:22 Rishi asked Mar 27 at 9:33 RishiRishi 415 bronze badges 1
  • I think they're descriptions of (the classes of) the instructions that this particular execution unit can execute. Every instruction uses one of them. This is not related to pipeline stages. – Bergi Commented Mar 27 at 10:05
Add a comment  | 

1 Answer 1

Reset to default 1

14-19 stages is a measurement of length. (And is assuming 1-cycle latency integer instructions like add, so it's counting exec as only 1 stage).

The number of parallel execution units is a measure of width of the pipeline's execution capabilities.

The 14-19 stage variation in length comes from uop-cache hit vs. legacy decode to feed the front-end with decoded instructions (uops). uop-cache hits have better branch-miss latency due to the shorter pipeline. See https://www.realworldtech/sandy-bridge/3/ (Skylake is a later generation of the Sandybridge-family, with the same basic design.)


The narrowest point in the pipeline is normally the issue/rename stage, such as Skylake with where it's 4 fused-domain uops wide. vs. uop-cache fetch being 6-wide also in the fused domain. And retirement being 4 per logical core IIRC, so up to 8 per cycle if both hyperthreads are active. In the unfused domain (scheduler and execution ports), Skylake has 8 ports.
With some work in the scheduler after a slow instruction completes, all 8 can be busy every cycle for a while, but the highest sustained unfused-domain throughput possible on Skylake is 7 uops/cycle (https://www.agner./optimize/blog/read.php?i=581#857) with a loop that's 4 fused-domain uops: two load+ALU uops and one store.

本文标签: cpuExecution stages in a superscalar microarchitectureStack Overflow