Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Make default of 'lookup.local-file-type' to sort #4622

Merged
merged 2 commits into from
Dec 2, 2024

Conversation

Aitozi
Copy link
Contributor

@Aitozi Aitozi commented Dec 2, 2024

Purpose

Linked issue: close #xxx

more test after: https://github.com/apache/paimon/pull/4500/files

The sort format have the advantange as below

Writer

  1. The compression rate of sort format is higher.
image
  1. Saved one file merge process.

Benchmarks also show that sort format has better write performance. However, hash format is optimized for values with the same value, so when values are the same, sort format may be slightly worse.

writer-100000:                                                                                       Best/Avg Time(ms)    Row Rate(K/s)      Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
OPERATORTEST_writer-100000_SORT-write-0B-value-100000-num                                                    3 /    3          35326.9             28.3       1.0X
OPERATORTEST_writer-100000_HASH-write-0B-value-100000-num                                                   14 /   18           7015.3            142.5       0.2X
OPERATORTEST_writer-100000_SORT-write-64B-value-100000-num                                                   5 /    6          20176.9             49.6       0.6X
OPERATORTEST_writer-100000_HASH-write-64B-value-100000-num                                                  12 /   15           8123.1            123.1       0.2X
OPERATORTEST_writer-100000_SORT-write-500B-value-100000-num                                                 14 /   18           6985.9            143.1       0.2X
OPERATORTEST_writer-100000_HASH-write-500B-value-100000-num                                                 14 /   14           7273.6            137.5       0.2X
OPERATORTEST_writer-100000_SORT-write-1000B-value-100000-num                                                27 /   31           3639.7            274.7       0.1X
OPERATORTEST_writer-100000_HASH-write-1000B-value-100000-num                                                13 /   13           7925.8            126.2       0.2X
OPERATORTEST_writer-100000_SORT-write-2000B-value-100000-num                                                47 /   55           2142.8            466.7       0.1X
OPERATORTEST_writer-100000_HASH-write-2000B-value-100000-num                                                12 /   12           8640.5            115.7       0.2X

Reader

testLookupReaderMiss

100000

reader-10000:                                                                                        Best/Avg Time(ms)    Row Rate(K/s)      Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
OPERATORTEST_reader-10000_SORT-read-0B-value-10000-num                                                       2 /    2           5949.7            168.1       1.0X
OPERATORTEST_reader-10000_HASH-read-0B-value-10000-num                                                      15 /   16            675.3           1480.8       0.1X
OPERATORTEST_reader-10000_SORT-read-64B-value-10000-num                                                      2 /    2           6399.5            156.3       1.1X
OPERATORTEST_reader-10000_HASH-read-64B-value-10000-num                                                     17 /   18            575.7           1737.1       0.1X
OPERATORTEST_reader-10000_SORT-read-500B-value-10000-num                                                     2 /    2           4813.0            207.8       0.8X
OPERATORTEST_reader-10000_HASH-read-500B-value-10000-num                                                    15 /   15            669.6           1493.4       0.1X
OPERATORTEST_reader-10000_SORT-read-1000B-value-10000-num                                                    2 /    2           4494.4            222.5       0.8X
OPERATORTEST_reader-10000_HASH-read-1000B-value-10000-num                                                   14 /   14            735.4           1359.9       0.1X
OPERATORTEST_reader-10000_SORT-read-2000B-value-10000-num                                                    2 /    2           4139.1            241.6       0.7X
OPERATORTEST_reader-10000_HASH-read-2000B-value-10000-num                                                   14 /   14            717.5           1393.8       0.1X

15000000

reader-10000:                                                                                        Best/Avg Time(ms)    Row Rate(K/s)      Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
OPERATORTEST_reader-10000_SORT-read-0B-value-10000-num                                                       4 /    5           2253.1            443.8       1.0X
OPERATORTEST_reader-10000_HASH-read-0B-value-10000-num                                                     854 /  974             11.7          85379.2       0.0X
OPERATORTEST_reader-10000_SORT-read-64B-value-10000-num                                                      3 /    3           3583.6            279.1       1.6X
OPERATORTEST_reader-10000_HASH-read-64B-value-10000-num                                                    868 / 1037             11.5          86786.2       0.0X
OPERATORTEST_reader-10000_SORT-read-500B-value-10000-num                                                     5 /    5           2077.9            481.3       0.9X
OPERATORTEST_reader-10000_HASH-read-500B-value-10000-num                                                   640 /  899             15.6          64029.6       0.0X
OPERATORTEST_reader-10000_SORT-read-1000B-value-10000-num                                                    7 /    7           1522.7            656.7       0.7X
OPERATORTEST_reader-10000_HASH-read-1000B-value-10000-num                                                  669 / 1011             15.0          66879.4       0.0X
OPERATORTEST_reader-10000_SORT-read-2000B-value-10000-num                                                   11 /   11            922.1           1084.5       0.4X
OPERATORTEST_reader-10000_HASH-read-2000B-value-10000-num                                                  834 /  977             12.0          83369.3       0.0X

The sort format is much better.

testLookupReader (all match)

100000 value

reader-10000:                                                                                        Best/Avg Time(ms)    Row Rate(K/s)      Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
OPERATORTEST_reader-10000_SORT-read-0B-value-10000-num                                                      14 /   20            701.7           1425.2       1.0X
OPERATORTEST_reader-10000_HASH-read-0B-value-10000-num                                                       8 /    8           1331.2            751.2       1.9X
OPERATORTEST_reader-10000_SORT-read-64B-value-10000-num                                                     17 /   19            603.2           1657.7       0.9X
OPERATORTEST_reader-10000_HASH-read-64B-value-10000-num                                                     10 /   11            997.3           1002.7       1.4X
OPERATORTEST_reader-10000_SORT-read-500B-value-10000-num                                                   134 /  154             74.8          13361.9       0.1X
OPERATORTEST_reader-10000_HASH-read-500B-value-10000-num                                                   182 /  191             55.1          18163.0       0.1X
OPERATORTEST_reader-10000_SORT-read-1000B-value-10000-num                                                  169 /  195             59.3          16861.0       0.1X
OPERATORTEST_reader-10000_HASH-read-1000B-value-10000-num                                                  231 /  250             43.3          23107.5       0.1X
OPERATORTEST_reader-10000_SORT-read-2000B-value-10000-num                                                  162 /  202             61.6          16234.6       0.1X
OPERATORTEST_reader-10000_HASH-read-2000B-value-10000-num                                                  259 /  281             38.6          25925.7       0.1X

10000000

reader-10000:                                                                                        Best/Avg Time(ms)    Row Rate(K/s)      Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
OPERATORTEST_reader-10000_SORT-read-0B-value-10000-num                                                     525 /  581             19.1          52490.8       1.0X
OPERATORTEST_reader-10000_HASH-read-0B-value-10000-num                                                     675 /  959             14.8          67550.0       0.8X
OPERATORTEST_reader-10000_SORT-read-64B-value-10000-num                                                    149 /  376             67.1          14906.0       3.5X
OPERATORTEST_reader-10000_HASH-read-64B-value-10000-num                                                   1621 / 1866              6.2         162062.7       0.3X
OPERATORTEST_reader-10000_SORT-read-500B-value-10000-num                                                   229 /  469             43.7          22885.9       2.3X
OPERATORTEST_reader-10000_HASH-read-500B-value-10000-num                                                  1454 / 1668              6.9         145388.4       0.4X
OPERATORTEST_reader-10000_SORT-read-1000B-value-10000-num                                                  110 /  453             90.9          11004.8       4.8X
OPERATORTEST_reader-10000_HASH-read-1000B-value-10000-num                                                  988 / 1510             10.1          98762.4       0.5X
OPERATORTEST_reader-10000_SORT-read-2000B-value-10000-num                                                  459 /  516             21.8          45941.4       1.1X
OPERATORTEST_reader-10000_HASH-read-2000B-value-10000-num                                                 1388 / 1538              7.2         138797.1       0.4X

The hash format only has a certain advantage when the value size is small or the file size does not fill up the cache. In other cases, the sort format is better.

Tests

API and Format

Documentation

@Aitozi
Copy link
Contributor Author

Aitozi commented Dec 2, 2024

#3827

@Aitozi
Copy link
Contributor Author

Aitozi commented Dec 2, 2024

cc @JingsongLi @FangYongs

@JingsongLi
Copy link
Contributor

+1 Thanks @Aitozi for the benchmark!

@JingsongLi JingsongLi merged commit 3c82082 into apache:master Dec 2, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants