Bug in memory_desc_init_by_tag: Incorrect Differentiation Between Memory Tags abcd and acbd #2175

taoye9 · 2024-10-21T10:27:09Z

Summary

oneDNN deduces memory tag from input tensor shape using this function:

status_t status = memory_desc_init_by_tag(md_gold, md.ndims, md.dims, md.data_type, tag);

we found when the input memory description is for 4d fp32 tensor of shape 3x1x3x3, the returned status are successful for both of memory tag acbd and abcd. that is md.dims = [3, 1, 3, 3]

This bug causes issues: #2008.

Version

oneDNN v3.7.0 (commit fca8b85)

Environment

CPU: Neoverse-V1 C7g.4xlarge
OS version: Ubuntu 20.04
Compiler version: gcc-10
CMake version: 3.22.1
git hash: 73ae041f2b5e956b58bd58ca5d121af839af076a

Steps to reproduce

oneDNN/build/tests/benchdnn/benchdnn --engine=cpu --matmul --wtag=acbd --dtag=abcd 1x1x2x3:3x1x3x3

Observed behavior

The cmd invokes function memory_desc_matches_one_of_tag(B_md, plain_tensor_layout_tag, transposed_tensor_layout_tag, acbd, adbc); to determine memory tag of weight but get abcd instead of acbd.

Expected behavior

the returned memory tag of weight should be acbd.

The text was updated successfully, but these errors were encountered:

shu1chen · 2024-10-22T06:40:14Z

Hello @taoye9, I was trying your steps to see if it's reproducible on a x86_64 machine and got this result:

$ ONEDNN_VERBOSE=1 ./tests/benchdnn/benchdnn --engine=cpu --matmul --wtag=acbd --dtag=abcd 1x1x2x3:3x1x3x3
...
onednn_verbose,v1,primitive,exec,cpu,matmul,gemm:jit:f32,undef,src:f32:a:blocked:abcd::f0 wei:f32::blocked:acbd::f0 dst:f32::blocked:abcd::f0,,,1x1x2x3:3x1x3x3,0.791992

The weight shows acbd on x86_64, while it shows abcd on aarch64,which is incorrect. Am I understanding right?

taoye9 · 2024-10-22T12:08:09Z

Hi, @shu1chen, thanks for reply.

Not sure whether this benchdnn output can reveal this bug as this function is called during pd_t initialisation stage. Here is when the bug happens: during initialisation of pd_t, memory format is deduced from tensor dims instead of explicit passed in variable (bit weird). we then call memory_desc_init_by_tag to deduce memory format.

here is the code which decides weight memory format during pd_t init: https://github.com/oneapi-src/oneDNN/blob/main/src/cpu/aarch64/matmul/brgemm_matmul_utils.cpp#L221

when the weight is acbd and [3, 1, 3, 3], it's expected to return acbd but got abcd in this case.

My preliminary guess is that the logic in determine memory tag is of false for some cases. https://github.com/oneapi-src/oneDNN/blob/main/src/common/memory_desc.cpp#L34

shu1chen · 2024-10-23T06:00:32Z

I worked on git commit cfe12d8 and added output for the equivalent x86_64 code snippet https://github.com/oneapi-src/oneDNN/blob/main/src/cpu/x64/matmul/brgemm_matmul_utils.cpp#L318

std::cout << "output:" << memory_desc_matches_one_of_tag(B_md, plain_tensor_layout_tag,
                        transposed_tensor_layout_tag, acbd, adbc) << std::endl;

It also returns abcd, same as aarch64, while the test passed:

$ ./tests/benchdnn/benchdnn --engine=cpu --matmul --wtag=acbd --dtag=abcd 1x1x2x3:3x1x3x3
wei_tag: 5
output: 5
0:PASSED __REPRO: --matmul --wtag=acbd --dtag=abcd 1x1x2x3:3x1x3x3
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total: 0.06s; fill: 0.01s (22%); compute_ref: 0.01s (15%); compare: 0.02s (31%);

While on the latest main branch (commit 81b366c), the test also failed on x86_64. Some recent commits cause very similar errors on x86_64:

$ ./tests/benchdnn/benchdnn --engine=cpu --matmul --wtag=acbd --dtag=abcd 1x1x2x3:3x1x3x3
wei_tag: 5
output:5
[   6][DST][1:0:0:0] exp_f32:        7186 exp:        7186 got:        -nan diff:     nan rdiff:     nan
[   7][DST][1:0:0:1] exp_f32:       -5487 exp:       -5487 got:        -nan diff:     nan rdiff:     nan
[   8][DST][1:0:0:2] exp_f32:       -5549 exp:       -5549 got:        -nan diff:     nan rdiff:     nan
[   9][DST][1:0:1:0] exp_f32:       -3131 exp:       -3131 got:        -nan diff:     nan rdiff:     nan
[  10][DST][1:0:1:1] exp_f32:        2511 exp:        2511 got:        -nan diff:     nan rdiff:     nan
[  11][DST][1:0:1:2] exp_f32:        -240 exp:        -240 got:        -nan diff:     nan rdiff:     nan
[  12][DST][2:0:0:0] exp_f32:       -1727 exp:       -1727 got:        -nan diff:     nan rdiff:     nan
[  13][DST][2:0:0:1] exp_f32:       -9256 exp:       -9256 got:        -nan diff:     nan rdiff:     nan
[  14][DST][2:0:0:2] exp_f32:        2826 exp:        2826 got:        -nan diff:     nan rdiff:     nan
[  15][DST][2:0:1:0] exp_f32:        1391 exp:        1391 got:        -nan diff:     nan rdiff:     nan
[COMPARE_STATS][DST]: trh=0 err_max_diff:     nan err_max_rdiff:     nan all_max_diff:     nan all_max_rdiff:     nan
0:FAILED (errors:12 total:18) __REPRO: --matmul --wtag=acbd --dtag=abcd 1x1x2x3:3x1x3x3
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:1 listed:0
total: 0.07s; fill: 0.02s (32%); compute_ref: 0.01s (13%); compare: 0.02s (27%);

I'll try to debug it.

shu1chen · 2024-10-23T09:43:05Z

The commit that caused the test failure on x86_64 is 3877097 and the previous commit 6bfb6a8 is good. Since the bug on x86_64 is unrelated to memory_desc_init_by_tag function, I'll submit a new issue.

mgouicem · 2024-10-23T10:08:01Z

Hi @TaoYe. This is not a bug but a feature :)
Internaly, memory descriptors do not have any notion of tag, just strides. To align with PyTorch semantics (and python based frameworks in general), strides for unit dimension are ignored (since they will never be applied). See here for the matching check.

Now when memory_desc_matches_one_of_tag, is used, for that particular case above [3x1x3x3], strides would be be [9,9,3,1] for abcd, and [9,3,3,1] for acbd. However, because b dimension is 1, the corresponding stride is ignored, so indeed, both tags match the strides properly ([9, _, 3, 1]), and the first matching one would be returned.

@shu1chen thanks, this is a known issue and there is an internal tracker for it. It is being worked on currently.

taoye9 added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Oct 21, 2024

shu1chen added the platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 label Oct 22, 2024

shu1chen mentioned this issue Oct 23, 2024

benchdnn matmul with --wtag=acbd correctness test failure on CPU #2180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in memory_desc_init_by_tag: Incorrect Differentiation Between Memory Tags abcd and acbd #2175

Bug in memory_desc_init_by_tag: Incorrect Differentiation Between Memory Tags abcd and acbd #2175

taoye9 commented Oct 21, 2024 •

edited

Loading

shu1chen commented Oct 22, 2024 •

edited

Loading

taoye9 commented Oct 22, 2024 •

edited

Loading

shu1chen commented Oct 23, 2024 •

edited

Loading

shu1chen commented Oct 23, 2024 •

edited

Loading

mgouicem commented Oct 23, 2024 •

edited

Loading

Bug in memory_desc_init_by_tag: Incorrect Differentiation Between Memory Tags abcd and acbd #2175

Bug in memory_desc_init_by_tag: Incorrect Differentiation Between Memory Tags abcd and acbd #2175

Comments

taoye9 commented Oct 21, 2024 • edited Loading

Summary

Version

Environment

Steps to reproduce

Observed behavior

Expected behavior

shu1chen commented Oct 22, 2024 • edited Loading

taoye9 commented Oct 22, 2024 • edited Loading

shu1chen commented Oct 23, 2024 • edited Loading

shu1chen commented Oct 23, 2024 • edited Loading

mgouicem commented Oct 23, 2024 • edited Loading

taoye9 commented Oct 21, 2024 •

edited

Loading

shu1chen commented Oct 22, 2024 •

edited

Loading

taoye9 commented Oct 22, 2024 •

edited

Loading

shu1chen commented Oct 23, 2024 •

edited

Loading

shu1chen commented Oct 23, 2024 •

edited

Loading

mgouicem commented Oct 23, 2024 •

edited

Loading