-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Bmi2 instrunction to optmize compact protocol int64 code and deco… #2780
base: master
Are you sure you want to change the base?
Conversation
…de,in our unit test.It show that code has 30% peformance boost, and in decode it show 5% performance boost. origin bench: $ ./GbenchMarkForCompactProtocol 2023-04-10T00:56:33+08:00 Running ./GbenchMarkForCompactProtocol Run on (128 X 3300 MHz CPU s) CPU Caches: L1 Data 48 KiB (x64) L1 Instruction 32 KiB (x64) L2 Unified 1280 KiB (x64) L3 Unified 49152 KiB (x2) Load Average: 0.71, 0.57, 0.24 ------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------- BM_Int64_Encode_1 22.1 ns 22.1 ns 31677633 BM_Int64_Encode_2 23.7 ns 23.7 ns 29457528 BM_Int64_Encode_3 26.1 ns 26.1 ns 26844781 BM_Int64_Encode_4 27.1 ns 27.1 ns 25705377 BM_Int64_Encode_5 30.0 ns 29.9 ns 23345095 BM_Int64_Encode_6 32.3 ns 32.3 ns 21658363 BM_Int64_Encode_7 34.5 ns 34.5 ns 20256118 BM_Int64_Encode_8 36.3 ns 36.3 ns 19234451 BM_Int64_Encode_9 37.8 ns 37.8 ns 18510508 BM_Int64_Encode_10 39.8 ns 39.8 ns 17531197 BM_Int64_Decode_1 485 ns 486 ns 1439281 BM_Int64_Decode_2 489 ns 491 ns 1424556 BM_Int64_Decode_3 493 ns 495 ns 1413917 BM_Int64_Decode_4 491 ns 493 ns 1421200 BM_Int64_Decode_5 495 ns 497 ns 1408950 BM_Int64_Decode_6 496 ns 498 ns 1406069 BM_Int64_Decode_7 504 ns 506 ns 1383636 BM_Int64_Decode_8 507 ns 509 ns 1374477 BM_Int64_Decode_9 505 ns 507 ns 1380481 BM_Int64_Decode_10 503 ns 505 ns 1386843 optmizied bench: $ ./GbenchMarkForCompactProtocol 2023-04-10T01:09:25+08:00 Running ./GbenchMarkForCompactProtocol Run on (128 X 3300 MHz CPU s) CPU Caches: L1 Data 48 KiB (x64) L1 Instruction 32 KiB (x64) L2 Unified 1280 KiB (x64) L3 Unified 49152 KiB (x2) Load Average: 0.00, 0.24, 0.31 ------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------- BM_Int64_Encode_1 21.0 ns 21.0 ns 30414973 BM_Int64_Encode_2 22.3 ns 22.2 ns 28778678 BM_Int64_Encode_3 25.6 ns 25.6 ns 27278662 BM_Int64_Encode_4 29.1 ns 29.1 ns 23984143 BM_Int64_Encode_5 29.3 ns 29.2 ns 23856317 BM_Int64_Encode_6 29.6 ns 29.6 ns 23585080 BM_Int64_Encode_7 29.9 ns 29.8 ns 23423250 BM_Int64_Encode_8 29.7 ns 29.7 ns 23496002 BM_Int64_Encode_9 28.9 ns 28.9 ns 24128142 BM_Int64_Encode_10 28.9 ns 28.8 ns 24132967 BM_Int64_Decode_1 480 ns 480 ns 1430995 BM_Int64_Decode_2 481 ns 481 ns 1390746 BM_Int64_Decode_3 490 ns 492 ns 1395035 BM_Int64_Decode_4 495 ns 496 ns 1409327 BM_Int64_Decode_5 500 ns 501 ns 1398344 BM_Int64_Decode_6 495 ns 497 ns 1408096 BM_Int64_Decode_7 499 ns 501 ns 1398316 BM_Int64_Decode_8 496 ns 497 ns 1409067 BM_Int64_Decode_9 501 ns 503 ns 1392703 BM_Int64_Decode_10 500 ns 502 ns 1395121
Whow, nice work! But not easy to validate. One option would be to keep this optional and disabled by default, so people can experiment with the new code? |
This is very interesting. But how to proceed? @Jens-G I guess the tests should cover sufficient checks whether encoded/decoded data is identical with this change? Can we rely that the tests would typically fail if there are relevant differences with this encoding, compared to current thrift? Otherwise it seems really challenging to build trust here... |
@@ -124,6 +124,58 @@ if(UNIX) | |||
endif() | |||
endif() | |||
|
|||
if (CMAKE_SYSTEM_PROCESSOR MATCHES "(x86)|(X86)|(amd64)|(AMD64)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you kindly change the indenting to 4 white spaces, instead of two?
@@ -124,6 +124,58 @@ if(UNIX) | |||
endif() | |||
endif() | |||
|
|||
if (CMAKE_SYSTEM_PROCESSOR MATCHES "(x86)|(X86)|(amd64)|(AMD64)") | |||
set(PREV_CMAKE_REQUIRED_FLAGS ${CMAKE_REQUIRED_FLAGS}) | |||
set(CMAKE_REQUIRED_FLAGS "${CMAKE_REQUIRED_FLAGS} -mbmi2 -mbmi -mlzcnt -msse3 -mavx512bw -mavx512vl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions:
- Not sure if these flags are portable to MSVC? On which platforms did you test?
- Do you not enforce here already that the CPU supports all these flags? Then, further down, you test which CPU features are supported? Am I missing something?
Problems:
- Not all users build for their current platforms. I.e. people may want to build on a modern CPU, but may plan to ship their binaries to broader platforms. This is for example the case with Linux distributors. It may be required to make these changes optional, particularly for the very advanced features. I.e. building on an AVX512 capable CPU should not necessarily make the binaries non-portable to non-AVX512-CPUs (which still exist quite a lot).
Use Bmi2 instrunction to optmize compact protocol int64 code and decode.
In our unit test, it showed that code has 30% performance boost, and in decode it show 5% performance boost.
origin bench:
optimized bench:
[skip ci]
anywhere in the commit message to free up build resources.