assembly - What is instruction fusion in contemporary x86 processors?

Question

Welcome To Ask or Share your Answers For Others

assembly - What is instruction fusion in contemporary x86 processors?

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:14:42+0000

No, fusion is totally separate from how one complex instruction (like cpuid or lock add [mem], eax) can decode to multiple uops.

The way the retirement stage figures out that all the uops for a single instruction have retired, and thus the instruction has retired, has nothing to do with fusion.

Macro-fusion decodes cmp/jcc or test/jcc into a single compare-and-branch uop. (Intel and AMD CPUs). The rest of the pipeline sees it purely as a single uop¹ (except performance counters still count it as 2 instructions). This saves uop cache space, and bandwidth everywhere including decode. In some code, compare-and-branch makes up a significant fraction of the total instruction mix, like maybe 25%, so choosing to look for this fusion rather than other possible fusions like mov dst,src1 / or dst,src2 makes sense.

Sandybridge-family can also macro-fuse some other ALU instructions with conditional branches, like add/sub or inc/dec + JCC with some conditions. (x86_64 - Assembly - loop conditions and out of order)

Micro-fusion stores 2 uops from the same instruction together so they only take up 1 "slot" in the fused-domain parts of the pipeline. But they still have to dispatch separately to separate execution units. And in Intel Sandybridge-family, the RS (Reservation Station aka scheduler) is in the unfused domain, so they're even stored separately in the scheduler. (See Footnote 2 in my answer on Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths.)

P6 family had a fused-domain RS, as well as ROB, so micro-fusion helped increase the effective size of the out-of-order window there. But SnB-family reportedly simplified the uop format making it more compact, allowing larger RS sizes that are helpful all the time, not just for micro-fused instructions.

And Sandybridge family will "un-laminate" indexed addressing modes under some conditions, splitting them back into 2 separate uops in their own slots before issue/rename into the ROB in the out-of-order back end, so you lose the front-end issue/rename throughput benefit of micro-fusion. See Micro fusion and addressing modes

Both can happen at the same time

    cmp   [rdi], eax
    jnz   .target

The cmp/jcc can macro-fuse into a single cmp-and-branch ALU uop, and the load from [rdi] can micro-fuse with that uop.

Failure to micro-fuse the cmp does not prevent macro-fusion.

The limitations here are: RIP-relative + immediate can never micro-fuse, so cmp dword [static_data], 1 / jnz can macro-fuse but not micro-fuse.

A cmp/jcc on SnB-family (like cmp [rdi+rax], edx / jnz) will macro and micro-fuse in the decoders, but the micro-fusion will un-laminate before the issue stage. (So it's 2 total uops in both the fused-domain and unfused-domain: load with an indexed addressing mode, and ALU cmp/jnz). You can verify this with perf counters by putting a mov ecx, 1 in between the CMP and JCC vs. after, and note that uops_issued.any:u and uops_executed.thread both go up by 1 per loop iteration because we defeated macro-fusion. And micro-fusion behaved the same.

On Skylake, cmp dword [rdi], 0/jnz can't macro-fuse. (Only micro-fuse). I tested with a loop that contained some dummy mov ecx,1 instructions. Reordering so one of those mov instructions split up the cmp/jcc didn't change perf counters for fused-domain or unfused-domain uops.

But cmp [rdi],eax/jnz does macro- and micro-fuse. Reordering so a mov ecx,1 instruction separates CMP from JNZ does change perf counters (proving macro-fusion), and uops_executed is higher than uops_issued by 1 per iteration (proving micro-fusion).

cmp [rdi+rax], eax/jne only macro-fuses; not micro. (Well actually micro-fuses in decode but un-laminates before issue because of the indexed addressing mode, and it's not an RMW-register destination like sub eax, [rdi+rax] that can keep indexed addressing modes micro-fused. That sub with an indexed addressing mode does macro- and micro-fuse on SKL, and presumably Haswell).

(The cmp dword [rdi],0 does micro-fuse, though: uops_issued.any:u is 1 lower than uops_executed.thread, and the loop contains no nop or other "eliminated" instructions, or any other memory instructions that could micro-fuse).

Some compilers (including GCC IIRC) prefer to use a separate load instruction and then compare+branch on a register. TODO: check whether gcc and clang's choices are optimal with immediate vs. register.

Micro-operations are those operations that can be executed in 1 clock cycle.

Not exactly. They take 1 "slot" in the pipeline, or in the ROB and RS that track them in the out-of-order back-end.

And yes, dispatching a uop to an execution port happens in 1 clock cycle and simple uops (e.g., integer addition) can complete execution in the same cycle. This can happen for up to 8 uops simultaneously since Haswell, but increased to 10 on Sunny Cove. The actual execution might take more than 1 clock cycle (occupying the execution unit for longer, e.g. FP division).

The divider is I think the only execution unit on modern mainstream Intel that's not fully pipelined, but Knight's Landing has some not-fully-pipelined SIMD shuffles that are single uop but (reciprocal) throughput of 2 cycles.).

Footnote 1:

If cmp [rdi], eax / jne faults on the memory operand, i.e. a #PF exception, it's taken with the exception return address pointing to before the cmp. So I think even exception handling can still treat it as a single thing.

Or if the branch target address is bogus, a #PF exception will happen after the branch has already executed, from code fetch with an updated RIP. So again, I don't think there's a way for cmp to execute successfully and the jcc to fault, requiring an exception to be taken with RIP pointing to the JCC.

But even if that case is a possibility the CPU needs to be designed to handle, sorting that out can be deferred until the exception is actually detected. Maybe with a microcode assist, or some special-case hardware.

As far as how the cmp/jcc uop goes through the pipeline in the normal case, it works exactly like one long single-uop instruction that both sets flags and conditionally branches.

Surprisingly, the loop instruction (like dec rcx/jnz but without setting flags) is not a single uop on Intel CPUs. Why is the loop instruction slow? Couldn't Intel have implemented it efficiently?.

Categories

assembly - What is instruction fusion in contemporary x86 processors?