Changing a stored value from 1 to 0 makes suspicion performance decreasing #9590

hungryzzz · 2024-11-10T07:35:19Z

Test Cases

Steps to Reproduce

Hi, I run the attached two cases(good.wasm & bad.wasm) in Wasmtime and WasmEdge(AOT), and collect their execution time respectively (measured by time tool).

# command to collect execution time of wasmtime
wasmtime compile bad.wasm -o bad.cwasm
time wasmtime run --allow-precompiled bad.cwasm

# command to collect execution time of wasmedge
wasmedgec bad.wasm bad-wasmedge-aot.wasm
time wasmedge bad-wasmedge-aot.wasm

Expected Results & Actual Results

For good.wasm, the execution time in different runtimes are as follows:

Wasmtime: 0.99s
WasmEdge: 1.06s

For bad.wasm, the execution time in different runtimes are as follows:

Wasmtime: 6.57s
WasmEdge: 1.05s

The difference between the attached two cases is as follows: changing the stored value in line 8 from 1 to 0, which decreases Wasmtime performance by 5.5s but has no negative effect on WasmEdge.

➜  diff diff good.wat bad.wat
8c8
<     f64.const 0x1p+0 (;=1;)
---
>     f64.const 0x0p+0 (;=0;)

More observations & questions:

The store operation is outside the loop, and the following instructions do not contain any load operations.
Wasmtime compiles the loop conditions in bad.wasm and good.wasm in different ways, but I don't understand why the changes of an instruction that seems unrelated to the loop can affect the compilation strategies to the loop.

# part of machine code of bad.wasm generated by wasmtime 
...
jne 5c <wasm[0]::function[1]+0x5c>
movdqu (%rsp),%xmm1
jmp 60 <wasm[0]::function[1]+0x60>
...
je 7f <wasm[0]::function[1]+0x7f>
movdqu %xmm1,(%rsp)
jmp 46 <wasm[0]::function[1]+0x46>
...
je a2 <wasm[0]::function[1]+0xa2>
movdqu %xmm1,(%rsp)
jmp 46 <wasm[0]::function[1]+0x46>
...


# part of machine code of good.wasm generated by wasmtime 
...
je 5f <wasm[0]::function[1]+0x5f>
...
jne 4f <wasm[0]::function[1]+0x4f>
...
jne 4f <wasm[0]::function[1]+0x4f>
...

Versions and Environment

Wasmtime version or commit: c8b1369
Operating system: Linux ringzzz-OptiPlex-Micro-Plus-7010 6.5.0-18-generic
Architecture: Intel(R) Core(TM) i5-13500

The text was updated successfully, but these errors were encountered:

primoly · 2024-11-10T12:15:36Z

Strangely, replacing that constant zero with an equivalent calculation fixes the issue. Cranelift should constant fold these instructions, so why do they affect machine code generation? 🤔

Replace line 8 in bad.wasm with one of the following for equal performance with good.wasm:

f64.const 0
f64.abs

f64.const 0
f64.const 0
f64.add

f64.const 42
f64.const 42
f64.sub

f64.const 777
f64.const 0
f64.mul

alexcrichton · 2024-11-11T04:11:48Z

This is happening because a register is being spilled, but why it's being spilled I'm not sure. Changing from 1 to 0 changes codegen which changes register allocation which causes this. In that sense I think it's worth digging in to why the spill happened here in the 0 case because I'm not sure what's going on.

I've got this CLIF:

function u0:1(i64, i64) {
    sig0 = (i64, i64)
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v2 = f64const 0.0
    v7 = load.i64 notrap aligned readonly checked v0+120
    v10 = iadd_imm v7, 16
    store little heap v2, v10
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)

block4(v14: i32, v24: i32, v34: f64):
    v37 = iconst.i32 0
    brif v37, block6, block7(v34)

block6:
    v38 = f64const 0.0
    jump block7(v38)

block7(v33: f64):
    v40 = band_imm.i32 v14, 0xffff
    v42 = icmp_imm ne v40, 1
    v43 = iadd_imm.i32 v14, 1
    brif v42, block4(v43, v24, v33), block5

block5:
    v46 = icmp_imm ne v24, 0xffff
    v48 = iadd_imm.i32 v24, 1
    brif v46, block4(v43, v48, v33), block9

block9:
    trap user11
}

compiled with cargo run compile out.clif --target x86_64 -D --set opt_level=speed to generate:

Disassembly of 128 bytes <u0:1>:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 83 ec 10             subq    $0x10, %rsp
   8:   66 0f 57 c9             xorpd   %xmm1, %xmm1
   c:   48 8b 57 78             movq    0x78(%rdi), %rdx
  10:   f2 0f 11 4a 10          movsd   %xmm1, 0x10(%rdx)
  15:   48 89 fe                movq    %rdi, %rsi
  18:   f3 0f 7f 0c 24          movdqu  %xmm1, (%rsp)     ;; SPILL
  1d:   e8 00 00 00 00          callq   0x22
  22:   45 31 d2                xorl    %r10d, %r10d
  25:   4c 89 d6                movq    %r10, %rsi
  28:   45 31 db                xorl    %r11d, %r11d
  2b:   45 85 db                testl   %r11d, %r11d
  2e:   0f 85 0a 00 00 00       jne     0x3e
  34:   f3 0f 6f 0c 24          movdqu  (%rsp), %xmm1     ;; RELOAD
  39:   e9 04 00 00 00          jmp     0x42
  3e:   66 0f 57 c9             xorpd   %xmm1, %xmm1
  42:   4c 89 d7                movq    %r10, %rdi
  45:   81 e7 ff ff 00 00       andl    $0xffff, %edi
  4b:   41 83 c2 01             addl    $1, %r10d
  4f:   83 ff 01                cmpl    $1, %edi
  52:   0f 84 0a 00 00 00       je      0x62
  58:   f3 0f 7f 0c 24          movdqu  %xmm1, (%rsp)     ;; SPILL
  5d:   e9 c6 ff ff ff          jmp     0x28
  62:   8d 7e 01                leal    1(%rsi), %edi
  65:   81 fe ff ff 00 00       cmpl    $0xffff, %esi
  6b:   0f 84 0d 00 00 00       je      0x7e
  71:   48 89 fe                movq    %rdi, %rsi
  74:   f3 0f 7f 0c 24          movdqu  %xmm1, (%rsp)     ;; SPILL
  79:   e9 aa ff ff ff          jmp     0x28
  7e:   0f 0b                   ud2

where I've annotated the spill/reload slots.

@cfallin would you know how to perhaps debug this further to see why the spill is being inserted? It seems to require some of the bits at the beginning of the function (e.g. calling some other function) so it may be related to caller/callee saves or something like that. I'm not sure if this is related to ABI details though as opposed to register allocator behavior. If it's regalloc-related this might end up leading to some nice wins on other benchmarks if it's not just float-related perhaps?

cfallin · 2024-11-11T07:05:47Z

@cfallin would you know how to perhaps debug this further to see why the spill is being inserted? It seems to require some of the bits at the beginning of the function (e.g. calling some other function) so it may be related to caller/callee saves or something like that. I'm not sure if this is related to ABI details though as opposed to register allocator behavior. If it's regalloc-related this might end up leading to some nice wins on other benchmarks if it's not just float-related perhaps?

Maybe -- FWIW I have zero cycles to dedicate to this right now, so you'll need to dig into RA2 on this, unfortunately. Maybe at some later time I'll have the ability to make another performance push. Sorry!

hungryzzz · 2024-11-13T09:27:21Z

compiled with cargo run compile out.clif --target x86_64 -D --set opt_level=speed to generate:

Disassembly of 128 bytes <u0:1>:
   0:   55                      pushq   %rbp
   1:   48 89 e5                movq    %rsp, %rbp
   4:   48 83 ec 10             subq    $0x10, %rsp
   8:   66 0f 57 c9             xorpd   %xmm1, %xmm1
   c:   48 8b 57 78             movq    0x78(%rdi), %rdx
  10:   f2 0f 11 4a 10          movsd   %xmm1, 0x10(%rdx)
  15:   48 89 fe                movq    %rdi, %rsi
  18:   f3 0f 7f 0c 24          movdqu  %xmm1, (%rsp)     ;; SPILL
  1d:   e8 00 00 00 00          callq   0x22
  22:   45 31 d2                xorl    %r10d, %r10d
  25:   4c 89 d6                movq    %r10, %rsi
  28:   45 31 db                xorl    %r11d, %r11d
  2b:   45 85 db                testl   %r11d, %r11d
  2e:   0f 85 0a 00 00 00       jne     0x3e
  34:   f3 0f 6f 0c 24          movdqu  (%rsp), %xmm1     ;; RELOAD
  39:   e9 04 00 00 00          jmp     0x42
  3e:   66 0f 57 c9             xorpd   %xmm1, %xmm1
  42:   4c 89 d7                movq    %r10, %rdi
  45:   81 e7 ff ff 00 00       andl    $0xffff, %edi
  4b:   41 83 c2 01             addl    $1, %r10d
  4f:   83 ff 01                cmpl    $1, %edi
  52:   0f 84 0a 00 00 00       je      0x62
  58:   f3 0f 7f 0c 24          movdqu  %xmm1, (%rsp)     ;; SPILL
  5d:   e9 c6 ff ff ff          jmp     0x28
  62:   8d 7e 01                leal    1(%rsi), %edi
  65:   81 fe ff ff 00 00       cmpl    $0xffff, %esi
  6b:   0f 84 0d 00 00 00       je      0x7e
  71:   48 89 fe                movq    %rdi, %rsi
  74:   f3 0f 7f 0c 24          movdqu  %xmm1, (%rsp)     ;; SPILL
  79:   e9 aa ff ff ff          jmp     0x28
  7e:   0f 0b                   ud2

Hi @alexcrichton, I went through again the codes and the following are my heuristic thoughts about this bug:

Is it possible that the SPILL in line 18 is caused by the calling instruction since the xmm1 is one of the caller-saved registers? And the following RELOAD (line34) and SPILL (line58&74) are the chain reaction due to the first SPILL(line18)?
However, in the good.wasm, it uses xmm0 to store the value, but xmm0 is not spilled before calling instruction even though xmm0 is also one of the caller-saved registers. Therefore, is it possible that cranelift uses different strategies for xmm0 and xmm1 before calling instruction?

primoly · 2024-11-13T10:22:13Z

In the spilling case Cranelift uses the same value for multiple zero constants. In the non-spilling case it uses two different ones. In the original good.wasm case it had to use different values for 0.0 and 1.0, but you can get Cranelift to use two separate ones even when they are both 0.0, with for example the (f64.abs (f64.const 0)) hack i’ve shown above.

Generated CLIF after optimisations of @alexcrichton’s example (spills)

Note how it uses only v2 for the 0.0 constant in block0. This v2 is then used twice, in the store as well as the jump.

function u0:1(i64, i64) fast {
    sig0 = (i64, i64) fast
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v2 = f64const 0.0
    v7 = load.i64 notrap aligned readonly checked v0+120
    v49 = iconst.i64 16
    v10 = iadd v7, v49  ; v49 = 16
    store little heap v2, v10  ; v2 = 0.0
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)  ; v3 = 0, v3 = 0, v2 = 0.0

block4(v14: i32, v24: i32, v34: f64):
    v55 = iconst.i32 0
    brif v55, block6, block7(v34)  ; v55 = 0

block6:
    v56 = f64const 0.0
    jump block7(v56)  ; v56 = 0.0

block7(v33: f64):
    v57 = iconst.i32 0xffff
    v58 = band.i32 v14, v57  ; v57 = 0xffff
    v59 = iconst.i32 1
    v60 = icmp ne v58, v59  ; v59 = 1
    v61 = iadd.i32 v14, v59  ; v59 = 1
    brif v60, block4(v61, v24, v33), block5

block5:
    v62 = iconst.i32 0xffff
    v63 = icmp.i32 ne v24, v62  ; v62 = 0xffff
    v64 = iconst.i32 1
    v65 = iadd.i32 v24, v64  ; v64 = 1
    brif v63, block4(v61, v65, v33), block9

block9:
    trap user11
}

Example that does not spill

Note the additional v6 that will get constant folded.

function u0:1(i64, i64) {
    sig0 = (i64, i64)
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v2 = f64const 0.0
    v6 = fabs.f64 v2
    v7 = load.i64 notrap aligned readonly checked v0+120
    v10 = iadd_imm v7, 16
    store little heap v6, v10
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)

block4(v14: i32, v24: i32, v34: f64):
    v37 = iconst.i32 0
    brif v37, block6, block7(v34)

block6:
    v38 = f64const 0.0
    jump block7(v38)

block7(v33: f64):
    v40 = band_imm.i32 v14, 0xffff
    v42 = icmp_imm ne v40, 1
    v43 = iadd_imm.i32 v14, 1
    brif v42, block4(v43, v24, v33), block5

block5:
    v46 = icmp_imm ne v24, 0xffff
    v48 = iadd_imm.i32 v24, 1
    brif v46, block4(v43, v48, v33), block9

block9:
    trap user11
}

Generated CLIF after optimisations of the non-spilling example above

Note how it uses v2 and v55 for the 0.0 constant in block0. The store uses v55 while the jump uses v2.

function u0:1(i64, i64) fast {
    sig0 = (i64, i64) fast
    fn0 = colocated u0:2 sig0

block0(v0: i64, v1: i64):
    v55 = f64const 0.0
    v7 = load.i64 notrap aligned readonly checked v0+120
    v49 = iconst.i64 16
    v10 = iadd v7, v49  ; v49 = 16
    store little heap v55, v10  ; v55 = 0.0
    call fn0(v0, v0)
    v3 = iconst.i32 0
    v2 = f64const 0.0
    v17 = iconst.i32 0xffff
    v15 = iconst.i32 1
    jump block4(v3, v3, v2)  ; v3 = 0, v3 = 0, v2 = 0.0

block4(v14: i32, v24: i32, v34: f64):
    v56 = iconst.i32 0
    brif v56, block6, block7(v34)  ; v56 = 0

block6:
    v57 = f64const 0.0
    jump block7(v57)  ; v57 = 0.0

block7(v33: f64):
    v58 = iconst.i32 0xffff
    v59 = band.i32 v14, v58  ; v58 = 0xffff
    v60 = iconst.i32 1
    v61 = icmp ne v59, v60  ; v60 = 1
    v62 = iadd.i32 v14, v60  ; v60 = 1
    brif v61, block4(v62, v24, v33), block5

block5:
    v63 = iconst.i32 0xffff
    v64 = icmp.i32 ne v24, v63  ; v63 = 0xffff
    v65 = iconst.i32 1
    v66 = iadd.i32 v24, v65  ; v65 = 1
    brif v64, block4(v62, v66, v33), block9

block9:
    trap user11
}

hungryzzz · 2024-11-15T09:40:17Z

Hi @alexcrichton @primoly, I also find that if I change the local.get 0 (line 42) in the bad.wasm to f64.const 0x0p+0, the performance will also be back to normal. It seems that I can understand something about the register spills and reload.

Observations about the register spills and reload

The local.set 0 instruction is in a branch within a double loop. (4c - 5f in machine code) Besides, there is a local.get 0 outside the loop (even though it will be dropped right away.)
I guess that Wasmtime will store the value of local[0] in address %rsp, and use xmm1 register to store the new value of local[0].
When the branch is true, xmm1 is set to a new value (5c); while when the branch is false, xmm1 is set to the current value of local[0]. (address 52) This explains the register reload instruction in address52.
In every iteration, the new value of xmm1 will be stored back to the address %rsp (address 75 and 98), this explains the register spilling instructions in address 75 and 98.
Even though I understand why the register spills and reloads happen in bad.wasm, I still do not know why changing the f64.const 0x0p+0 in line 8 from f64.const 0x1p+0 prevents this situation from happening.

Here are the differences between their machine codes:

0000000000000000 <wasm[0]::function[1]>:
       0:	55                   	push   %rbp
       1:	48 89 e5             	mov    %rsp,%rbp
       4:	4c 8b 57 08          	mov    0x8(%rdi),%r10
       8:	4d 8b 12             	mov    (%r10),%r10
       b:	49 83 c2 30          	add    $0x30,%r10
       f:	49 39 e2             	cmp    %rsp,%r10
      12:	0f 87 9b 00 00 00    	ja     b3 <wasm[0]::function[1]+0xb3>
      18:	48 83 ec 20          	sub    $0x20,%rsp
      1c:	48 89 5c 24 10       	mov    %rbx,0x10(%rsp)
      21:	c5 d9 57 cc          	vxorpd %xmm4,%xmm4,%xmm1
      25:	48 8b 47 78          	mov    0x78(%rdi),%rax
      29:	48 89 fe             	mov    %rdi,%rsi
      2c:	c5 fb 11 48 10       	vmovsd %xmm1,0x10(%rax)     ;; diff
      31:	48 89 f3             	mov    %rsi,%rbx
      34:	f3 0f 7f 0c 24       	movdqu %xmm1,(%rsp)
      39:	48 89 df             	mov    %rbx,%rdi
      3c:	e8 7f 00 00 00       	call   c0 <wasm[0]::function[2]>
      41:	31 ff                	xor    %edi,%edi
      43:	49 89 fa             	mov    %rdi,%r10
      46:	45 31 c9             	xor    %r9d,%r9d
      49:	45 85 c9             	test   %r9d,%r9d
      4c:	0f 85 0a 00 00 00    	jne    5c <wasm[0]::function[1]+0x5c>     ;; diff
      52:	f3 0f 6f 0c 24       	movdqu (%rsp),%xmm1                                ;; diff
      57:	e9 04 00 00 00       	jmp    60 <wasm[0]::function[1]+0x60>     ;; diff
      5c:	c5 e9 57 ca          	vxorpd %xmm2,%xmm2,%xmm1     ;; diff
      60:	48 89 fe             	mov    %rdi,%rsi
      63:	81 e6 ff ff 00 00    	and    $0xffff,%esi
      69:	83 c7 01             	add    $0x1,%edi
      6c:	83 fe 01             	cmp    $0x1,%esi
      6f:	0f 84 0a 00 00 00    	je     7f <wasm[0]::function[1]+0x7f>     ;; diff
      75:	f3 0f 7f 0c 24       	movdqu %xmm1,(%rsp)                             ;; diff
      7a:	e9 c7 ff ff ff       	jmp    46 <wasm[0]::function[1]+0x46>     ;; diff
      7f:	4c 89 d0             	mov    %r10,%rax
      82:	81 e0 ff ff 00 00    	and    $0xffff,%eax
      88:	41 83 c2 01          	add    $0x1,%r10d
      8c:	81 f8 ff ff 00 00    	cmp    $0xffff,%eax
      92:	0f 84 0a 00 00 00    	je     a2 <wasm[0]::function[1]+0xa2>     ;; diff
      98:	f3 0f 7f 0c 24       	movdqu %xmm1,(%rsp)                           ;; diff
      9d:	e9 a4 ff ff ff       	jmp    46 <wasm[0]::function[1]+0x46>     ;; diff
      a2:	48 8b 43 58          	mov    0x58(%rbx),%rax
      a6:	48 8b 7b 68          	mov    0x68(%rbx),%rdi
      aa:	31 d2                	xor    %edx,%edx
      ac:	48 89 de             	mov    %rbx,%rsi
      af:	ff d0                	call   *%rax

alexcrichton · 2024-11-16T14:14:39Z

@hungryzzz you might be right yeah! I would have naively expected live range splitting in regalloc2 though to kick in and make it such that the spill, if necessary, only happened around the call rather than through the loop. As for small differences, when it comes to regalloc heuristics AFAIK it's expected that small changes to the input can have subtle changes to the output. Especially with specific decisions like spilling here I'm not entirely surprised. That being said though I don't fully understand everything happening here so I can't really answer with certainty. (e.g. I don't know for certain why @primoly's changes resulted in different register allocation decisions)

primoly · 2024-11-16T14:28:27Z

The reason for the spill is the call to another function which leads to Cranelift saving the value of thexmm register on the stack so it can retrieve it afterwards (caller saves). But when Cranelift treats the two zeros before the call and after as different values, it just writes a fresh 0.0 into another xmm register. Notice how the non-spilling cases always use two different xmm registers while the spilling ones only use one, matching the non-spilling CLIF using v55 before the call fn0(v0, v0) and v2 after while the spilling CLIF uses v2 before and after.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing a stored value from 1 to 0 makes suspicion performance decreasing #9590

Changing a stored value from 1 to 0 makes suspicion performance decreasing #9590

hungryzzz commented Nov 10, 2024

primoly commented Nov 10, 2024

alexcrichton commented Nov 11, 2024

cfallin commented Nov 11, 2024

hungryzzz commented Nov 13, 2024 •

edited

Loading

primoly commented Nov 13, 2024 •

edited

Loading

hungryzzz commented Nov 15, 2024 •

edited

Loading

alexcrichton commented Nov 16, 2024

primoly commented Nov 16, 2024 •

edited

Loading

Changing a stored value from 1 to 0 makes suspicion performance decreasing #9590

Changing a stored value from 1 to 0 makes suspicion performance decreasing #9590

Comments

hungryzzz commented Nov 10, 2024

Test Cases

Steps to Reproduce

Expected Results & Actual Results

More observations & questions:

Versions and Environment

primoly commented Nov 10, 2024

alexcrichton commented Nov 11, 2024

cfallin commented Nov 11, 2024

hungryzzz commented Nov 13, 2024 • edited Loading

primoly commented Nov 13, 2024 • edited Loading

Generated CLIF after optimisations of @alexcrichton’s example (spills)

Example that does not spill

Generated CLIF after optimisations of the non-spilling example above

hungryzzz commented Nov 15, 2024 • edited Loading

Observations about the register spills and reload

alexcrichton commented Nov 16, 2024

primoly commented Nov 16, 2024 • edited Loading

hungryzzz commented Nov 13, 2024 •

edited

Loading

primoly commented Nov 13, 2024 •

edited

Loading

hungryzzz commented Nov 15, 2024 •

edited

Loading

primoly commented Nov 16, 2024 •

edited

Loading