Use Bmi2 instrunction to optmize compact protocol int64 code and deco… #2780

zzachimed · 2023-04-09T11:22:46Z

Use Bmi2 instrunction to optmize compact protocol int64 code and decode.

In our unit test, it showed that code has 30% performance boost, and in decode it show 5% performance boost.

origin bench:

$ ./GbenchMarkForCompactProtocol
2023-04-10T00:56:33+08:00
Running ./GbenchMarkForCompactProtocol
Run on (128 X 3300 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x64)
  L1 Instruction 32 KiB (x64)
  L2 Unified 1280 KiB (x64)
  L3 Unified 49152 KiB (x2)
Load Average: 0.71, 0.57, 0.24
-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
BM_Int64_Encode_1        22.1 ns         22.1 ns     31677633
BM_Int64_Encode_2        23.7 ns         23.7 ns     29457528
BM_Int64_Encode_3        26.1 ns         26.1 ns     26844781
BM_Int64_Encode_4        27.1 ns         27.1 ns     25705377
BM_Int64_Encode_5        30.0 ns         29.9 ns     23345095
BM_Int64_Encode_6        32.3 ns         32.3 ns     21658363
BM_Int64_Encode_7        34.5 ns         34.5 ns     20256118
BM_Int64_Encode_8        36.3 ns         36.3 ns     19234451
BM_Int64_Encode_9        37.8 ns         37.8 ns     18510508
BM_Int64_Encode_10       39.8 ns         39.8 ns     17531197
BM_Int64_Decode_1         485 ns          486 ns      1439281
BM_Int64_Decode_2         489 ns          491 ns      1424556
BM_Int64_Decode_3         493 ns          495 ns      1413917
BM_Int64_Decode_4         491 ns          493 ns      1421200
BM_Int64_Decode_5         495 ns          497 ns      1408950
BM_Int64_Decode_6         496 ns          498 ns      1406069
BM_Int64_Decode_7         504 ns          506 ns      1383636
BM_Int64_Decode_8         507 ns          509 ns      1374477
BM_Int64_Decode_9         505 ns          507 ns      1380481
BM_Int64_Decode_10        503 ns          505 ns      1386843

optimized bench:

$ ./GbenchMarkForCompactProtocol
2023-04-10T01:09:25+08:00
Running ./GbenchMarkForCompactProtocol
Run on (128 X 3300 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x64)
  L1 Instruction 32 KiB (x64)
  L2 Unified 1280 KiB (x64)
  L3 Unified 49152 KiB (x2)
Load Average: 0.00, 0.24, 0.31
-------------------------------------------------------------
Benchmark                   Time             CPU   Iterations
-------------------------------------------------------------
BM_Int64_Encode_1        21.0 ns         21.0 ns     30414973
BM_Int64_Encode_2        22.3 ns         22.2 ns     28778678
BM_Int64_Encode_3        25.6 ns         25.6 ns     27278662
BM_Int64_Encode_4        29.1 ns         29.1 ns     23984143
BM_Int64_Encode_5        29.3 ns         29.2 ns     23856317
BM_Int64_Encode_6        29.6 ns         29.6 ns     23585080
BM_Int64_Encode_7        29.9 ns         29.8 ns     23423250
BM_Int64_Encode_8        29.7 ns         29.7 ns     23496002
BM_Int64_Encode_9        28.9 ns         28.9 ns     24128142
BM_Int64_Encode_10       28.9 ns         28.8 ns     24132967
BM_Int64_Decode_1         480 ns          480 ns      1430995
BM_Int64_Decode_2         481 ns          481 ns      1390746
BM_Int64_Decode_3         490 ns          492 ns      1395035
BM_Int64_Decode_4         495 ns          496 ns      1409327
BM_Int64_Decode_5         500 ns          501 ns      1398344
BM_Int64_Decode_6         495 ns          497 ns      1408096
BM_Int64_Decode_7         499 ns          501 ns      1398316
BM_Int64_Decode_8         496 ns          497 ns      1409067
BM_Int64_Decode_9         501 ns          503 ns      1392703
BM_Int64_Decode_10        500 ns          502 ns      1395121

Did you create an Apache Jira ticket? (Request account here, not required for trivial changes)
If a ticket exists: Does your pull request title follow the pattern "THRIFT-NNNN: describe my issue"?
Did you squash your changes to a single commit? (not required, but preferred)
Did you do your best to avoid breaking changes? If one was needed, did you label the Jira ticket with "Breaking-Change"?
If your change does not involve any code, include [skip ci] anywhere in the commit message to free up build resources.

…de,in our unit test.It show that code has 30% peformance boost, and in decode it show 5% performance boost. origin bench: $ ./GbenchMarkForCompactProtocol 2023-04-10T00:56:33+08:00 Running ./GbenchMarkForCompactProtocol Run on (128 X 3300 MHz CPU s) CPU Caches: L1 Data 48 KiB (x64) L1 Instruction 32 KiB (x64) L2 Unified 1280 KiB (x64) L3 Unified 49152 KiB (x2) Load Average: 0.71, 0.57, 0.24 ------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------- BM_Int64_Encode_1 22.1 ns 22.1 ns 31677633 BM_Int64_Encode_2 23.7 ns 23.7 ns 29457528 BM_Int64_Encode_3 26.1 ns 26.1 ns 26844781 BM_Int64_Encode_4 27.1 ns 27.1 ns 25705377 BM_Int64_Encode_5 30.0 ns 29.9 ns 23345095 BM_Int64_Encode_6 32.3 ns 32.3 ns 21658363 BM_Int64_Encode_7 34.5 ns 34.5 ns 20256118 BM_Int64_Encode_8 36.3 ns 36.3 ns 19234451 BM_Int64_Encode_9 37.8 ns 37.8 ns 18510508 BM_Int64_Encode_10 39.8 ns 39.8 ns 17531197 BM_Int64_Decode_1 485 ns 486 ns 1439281 BM_Int64_Decode_2 489 ns 491 ns 1424556 BM_Int64_Decode_3 493 ns 495 ns 1413917 BM_Int64_Decode_4 491 ns 493 ns 1421200 BM_Int64_Decode_5 495 ns 497 ns 1408950 BM_Int64_Decode_6 496 ns 498 ns 1406069 BM_Int64_Decode_7 504 ns 506 ns 1383636 BM_Int64_Decode_8 507 ns 509 ns 1374477 BM_Int64_Decode_9 505 ns 507 ns 1380481 BM_Int64_Decode_10 503 ns 505 ns 1386843 optmizied bench: $ ./GbenchMarkForCompactProtocol 2023-04-10T01:09:25+08:00 Running ./GbenchMarkForCompactProtocol Run on (128 X 3300 MHz CPU s) CPU Caches: L1 Data 48 KiB (x64) L1 Instruction 32 KiB (x64) L2 Unified 1280 KiB (x64) L3 Unified 49152 KiB (x2) Load Average: 0.00, 0.24, 0.31 ------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------- BM_Int64_Encode_1 21.0 ns 21.0 ns 30414973 BM_Int64_Encode_2 22.3 ns 22.2 ns 28778678 BM_Int64_Encode_3 25.6 ns 25.6 ns 27278662 BM_Int64_Encode_4 29.1 ns 29.1 ns 23984143 BM_Int64_Encode_5 29.3 ns 29.2 ns 23856317 BM_Int64_Encode_6 29.6 ns 29.6 ns 23585080 BM_Int64_Encode_7 29.9 ns 29.8 ns 23423250 BM_Int64_Encode_8 29.7 ns 29.7 ns 23496002 BM_Int64_Encode_9 28.9 ns 28.9 ns 24128142 BM_Int64_Encode_10 28.9 ns 28.8 ns 24132967 BM_Int64_Decode_1 480 ns 480 ns 1430995 BM_Int64_Decode_2 481 ns 481 ns 1390746 BM_Int64_Decode_3 490 ns 492 ns 1395035 BM_Int64_Decode_4 495 ns 496 ns 1409327 BM_Int64_Decode_5 500 ns 501 ns 1398344 BM_Int64_Decode_6 495 ns 497 ns 1408096 BM_Int64_Decode_7 499 ns 501 ns 1398316 BM_Int64_Decode_8 496 ns 497 ns 1409067 BM_Int64_Decode_9 501 ns 503 ns 1392703 BM_Int64_Decode_10 500 ns 502 ns 1395121

emmenlau · 2023-10-20T09:38:37Z

Whow, nice work! But not easy to validate. One option would be to keep this optional and disabled by default, so people can experiment with the new code?

emmenlau · 2024-08-07T14:49:30Z

This is very interesting. But how to proceed? @Jens-G I guess the tests should cover sufficient checks whether encoded/decoded data is identical with this change? Can we rely that the tests would typically fail if there are relevant differences with this encoding, compared to current thrift?

Otherwise it seems really challenging to build trust here...

emmenlau · 2024-08-07T14:51:00Z

lib/cpp/CMakeLists.txt

@@ -124,6 +124,58 @@ if(UNIX)
    endif()
 endif()

+if (CMAKE_SYSTEM_PROCESSOR MATCHES "(x86)|(X86)|(amd64)|(AMD64)")


Could you kindly change the indenting to 4 white spaces, instead of two?

emmenlau · 2024-08-07T14:52:19Z

lib/cpp/CMakeLists.txt

@@ -124,6 +124,58 @@ if(UNIX)
    endif()
 endif()

+if (CMAKE_SYSTEM_PROCESSOR MATCHES "(x86)|(X86)|(amd64)|(AMD64)")
+  set(PREV_CMAKE_REQUIRED_FLAGS ${CMAKE_REQUIRED_FLAGS})
+  set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} -mbmi2 -mbmi -mlzcnt -msse3 -mavx512bw -mavx512vl")


Questions:

Not sure if these flags are portable to MSVC? On which platforms did you test?

Do you not enforce here already that the CPU supports all these flags? Then, further down, you test which CPU features are supported? Am I missing something?

Problems:

Not all users build for their current platforms. I.e. people may want to build on a modern CPU, but may plan to ship their binaries to broader platforms. This is for example the case with Linux distributors. It may be required to make these changes optional, particularly for the very advanced features. I.e. building on an AVX512 capable CPU should not necessarily make the binaries non-portable to non-AVX512-CPUs (which still exist quite a lot).

Jens-G added the c++ label Apr 9, 2023

emmenlau requested changes Aug 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Bmi2 instrunction to optmize compact protocol int64 code and deco… #2780

Use Bmi2 instrunction to optmize compact protocol int64 code and deco… #2780

zzachimed commented Apr 9, 2023 •

edited by emmenlau

Loading

emmenlau commented Oct 20, 2023

emmenlau commented Aug 7, 2024

emmenlau Aug 7, 2024

emmenlau Aug 7, 2024

Use Bmi2 instrunction to optmize compact protocol int64 code and deco… #2780

Are you sure you want to change the base?

Use Bmi2 instrunction to optmize compact protocol int64 code and deco… #2780

Conversation

zzachimed commented Apr 9, 2023 • edited by emmenlau Loading

emmenlau commented Oct 20, 2023

emmenlau commented Aug 7, 2024

emmenlau Aug 7, 2024

Choose a reason for hiding this comment

emmenlau Aug 7, 2024

Choose a reason for hiding this comment

zzachimed commented Apr 9, 2023 •

edited by emmenlau

Loading