-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.5.0-rc1 discussions #333
Comments
I'll fix the DB checking issues, I thought I had already gotten that right, but obviously I missed something. Was the old DB on tokyocabinet format? I'll run some tests and see what I need to do. |
Hello @zevv I will compile the new version and tested it on our setup where we index ~120 TB in the next week. Questions:
TIA |
>>>> "Guillaume" == Guillaume Bourque ***@***.***> writes:
Hello @zevv I will compile the new version and tested it on our setup where we index ~120 TB in
the next week.
Questions:
* Will the new version also include the user option to only get usage of a specific user ?
You should use able to use the 'duc index -u <user> /some/path' to
build an index, but then you need to re-run for each and every user.
Building a single DB with per-user info is not on the road map right
now. It would explode the side of the DB.
With version 1.5.0-rc1 now out (please test!) you also get a index of
the top N files by size, which I find is a really useful feature. You
get alot more bang for the buck by finding the biggest files in a
filesystem, since they're usually easier to target to get back space.
* Also I was told by a colleague ( I did not test it yet ), if we
run duc for a specific user lets say bob and from what he says
if a subfolder is not owned by bob duc will not go down this
directory where bob could have some file eve if he is not the
owner of that specific subfolder.
We might have a bug in this area, so a bug report with examples would
be nice to see. I'm flat out this week with other stuff, but I'll try
to take a look starting next week to see what I find.
But please do try the latest release candidate and file any bugs you
find!
John
|
Hello John, since BD size is increasing with the new version and since duc must be used on large filesystem with more than one user, I would definitly add an option to keep per user size if request as an argument ;-) And since we have lots of space to check, we should have the place to keep a larger DB. |
>>>> "Guillaume" == Guillaume Bourque ***@***.***> writes:
since BD size is increasing with the new version and since duc must
be used on large filesystem with more than one user, I would
definitly add an option to keep per user size if request as an
argument ;-) And since we have lots of space to check, we should
have the place to keep a larger DB.
We're more than happy to look at patches to do this. I would suggest
you only store the UID and keep per-UID records to store this info.
Then have UID->username lookups done when the report is run.
This does assume that the collector and display system have the same
UID->name mappings. Which is probably a good assumption, but not
certain. Otherwise we would need to keep another lookup table to map
UIDs->username in the duc DB.
But please help us out and try the new v1.5.0-rc1 release and let us
know how it works for you! The more feedback the better.
John
|
Is there an option in version 1.5 to specify the different compressors supported by tkrzw? And how about enhancing the output of |
FYI, I was able to use 1.5.0-rc1 to index 1.2B files in a large backup zpool,
It would be nice if there was inline compression for such large database files, e.g., post facto zstd is able to reduce database file 25GB to 16GB,
|
>>>> "stuartthebruce" == stuartthebruce ***@***.***> writes:
Is there an option in version 1.5 to specify the different
compressors supported by tkrzw?
Currently this is not an option. Do you have a need?
And how about enhancing the output of --version to show what
compression will be used by default, and --info to report what was
used to generate the specified database file?
That's a good point, I'll have to look into adding that.
John
|
>>>> "stuartthebruce" == stuartthebruce ***@***.***> writes:
FYI, I was able to use 1.5.0-rc1 to index 1.2B files in a large backup zpool,
Woot! That's great news. Not so amazing in terms of the time it
takes. For large pools like this it might be beter to do multiple
scans in parallel, or we need to start thinking about how to
parallelize the core indexing code.
***@***.*** ~]# duc --version
duc version: 1.5.0-rc1
options: cairo x11 ui tkrzw
***@***.*** ~]# time duc index -vp /backup2 -d /dev/shm/duc.db
Writing to database "/dev/shm/duc.db"
Indexed 1205911647 files and 44090454 directories, (814.8TB apparent, 599.2TB actual) in 8 hours, 9 minutes, and 11.89 seconds.
real 489m11.917s
user 14m2.310s
sys 327m10.748s
It would be nice if there was inline compression for such large
database files, e.g., post facto zstd is able to reduce database
file 25GB to 16GB,
That is impressive space reduction. Or depressing depending on how
you look at it. I'll see what I can come up with. I assume you're
willing to run tests on proposed pateches?
***@***.*** ~]# ls -lh /dev/shm/duc.db
-rw-r--r-- 1 root root 25G Oct 19 21:19 /dev/shm/duc.db
***@***.*** ~]# zstd --verbose -T0 /dev/shm/duc.db
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
Note: 24 physical core(s) detected
/dev/shm/duc.db : 64.69% (26053963960 => 16853415278 bytes, /dev/shm/duc.db.zst)
***@***.*** ~]# ls -lh /dev/shm/duc.db.zst
-rw-r--r-- 1 root root 16G Oct 19 21:19 /dev/shm/duc.db.zst
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were assigned.*Message ID: <zevv/duc/issues/333/2425208851@
github.com>
|
>>>> "stuartthebruce" == stuartthebruce ***@***.***> writes:
FYI, I was able to use 1.5.0-rc1 to index 1.2B files in a large backup zpool,
***@***.*** ~]# duc --version
duc version: 1.5.0-rc1
options: cairo x11 ui tkrzw
***@***.*** ~]# time duc index -vp /backup2 -d /dev/shm/duc.db
Writing to database "/dev/shm/duc.db"
Indexed 1205911647 files and 44090454 directories, (814.8TB apparent, 599.2TB actual) in 8 hours, 9 minutes, and 11.89 seconds.
real 489m11.917s
user 14m2.310s
sys 327m10.748s
It would be nice if there was inline compression for such large
database files, e.g., post facto zstd is able to reduce database
file 25GB to 16GB,
Do you happen to have the tkrzw utils installed? Can you run the
following and send me the results? I'm trying to pick better tuning
defaults if I can.
$ tkrzw_dbm_util inspect /dev/shm/duc.db
and if you're feeling happy, please do:
$ time tkrzw_dbm_util rebuild /dev/shm/duc.db
$ tkrzw_dbm_util inspect /dev/shm/duc.db
and send me those results as well.
John
|
Only to help test if a choice other than the current default helps with compressibility.
Thanks. |
Yes. |
I do now.
With what I think are the right additional arguments?
|
>>>> "stuartthebruce" == stuartthebruce ***@***.***> writes:
Do you happen to have the tkrzw utils installed?
I do now.
Can you run the following and send me the results? I'm trying to pick better tuning defaults
if I can.
$ tkrzw_dbm_util inspect /dev/shm/duc.db
***@***.*** ~]# tkrzw_dbm_util inspect /dev/shm/duc.db
APPLICATION_ERROR: Unknown DBM implementation: db
With what I think are the right additional arguments?
Yes, those are the right args. Sorry! I should have tested here
myself before sending out the request. tzkrzw used file extentions
for format checking, which I find annoying.
***@***.*** ~]# tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.db
Inspection:
class=HashDBM
healthy=true
auto_restored=false
path=/dev/shm/duc.db
cyclic_magic=3
pkg_major_version=1
pkg_minor_version=0
static_flags=49
offset_width=5
align_pow=3
closure_flags=1
num_buckets=1048583
num_records=44090459
eff_data_size=25474436170
file_size=26053963960
timestamp=1729397989.120004
db_type=0
max_file_size=8796093022208
record_base=5246976
update_mode=in-place
record_crc_mode=none
record_comp_mode=lz4
Actual File Size: 26053963960
Number of Records: 44090459
Healthy: true
Should be Rebuilt: true
and if you're feeling happy, please do:
$ time tkrzw_dbm_util rebuild /dev/shm/duc.db
***@***.*** ~]# time tkrzw_dbm_util rebuild --dbm hash /dev/shm/duc.db
Old Number of Records: 44090459
Old File Size: 26053963960
Old Effective Data Size: 25474436170
Old Number of Buckets: 1048583
Optimizing the database: ... ok (elapsed=183.065716)
New Number of Records: 44090459
New File Size: 26489626808
New Effective Data Size: 25474436170
New Number of Buckets: 88180927
real 3m3.069s
user 2m31.424s
sys 0m30.468s
$ tkrzw_dbm_util inspect /dev/shm/duc.db
***@***.*** ~]# tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.db
Inspection:
class=HashDBM
healthy=true
auto_restored=false
path=/dev/shm/duc.db
cyclic_magic=7
pkg_major_version=1
pkg_minor_version=0
static_flags=49
offset_width=5
align_pow=3
closure_flags=1
num_buckets=88180927
num_records=44090459
eff_data_size=25474436170
file_size=26489626808
timestamp=1729531678.856718
db_type=0
max_file_size=8796093022208
record_base=440909824
update_mode=in-place
record_crc_mode=none
record_comp_mode=lz4
Actual File Size: 26489626808
Number of Records: 44090459
Healthy: true
Should be Rebuilt: false
So this is interesting, it's now using lz4 compression (the best, but
not the fastest) and it looks like it's taking more space, not less.
But what I really wanted to see is how it changed the num_buckets,
num_records, etc. It's using a crapload more buckets.
Ok, so I've also got a patch to turn on zstd compression, which I'd
like you to try. Let's see if I can attach it here without white
space damage. You will need to do a full recompile, but it should do this by default. And now when you do 'duc --version' it will tell if it found zstd for use with tkrzw:
$ ./duc --version
Unknown option 'database'
duc version: 1.5.0-rc1
options: cairo x11 ui tkrzw (zstd)
Though I've got a bug in there to fix obviously.
diff --git a/configure.ac b/configure.ac
index d35c843..eee333f 100644
--- a/configure.ac
+++ b/configure.ac
@@ -83,7 +83,7 @@ case "${with_db_backend}" in
AC_DEFINE([ENABLE_TKRZW], [1], [Enable tkrzw db backend])
], [ AC_MSG_ERROR(Unable to find tkrzw) ])
AC_SUBST([TKRZW_LIBS])
-p AC_SUBST([TKRZW_CFLAGS])
+ AC_SUBST([TKRZW_CFLAGS])
;;
leveldb)
AC_CHECK_LIB([leveldb], [leveldb_open])
@@ -113,6 +113,11 @@ esac
AC_DEFINE_UNQUOTED(DB_BACKEND, ["${with_db_backend}"], [Database backend])
+PKG_CHECK_MODULES([ZSTD],[libarchive])
+AC_DEFINE([DUC_TKRZW_COMP_ZSTD], ["RECORD_COMP_ZSTD"], ["Enable tkrzw db zstd comppression"])
+AC_DEFINE_UNQUOTED(TKRZW_ZSTD, ["${with_tkrzw_zstd}"], [tkrzw zstd compression support])
+AC_DEFINE([ENABLE_TKRZW_ZSTD], [1], [tkrzw with zstd])
+
if test "${enable_cairo}" = "yes"; then
PKG_CHECK_MODULES([CAIRO], [cairo],, [AC_MSG_ERROR([
@@ -204,6 +209,7 @@ AC_MSG_RESULT([
- Package version: $PACKAGE $VERSION
- Prefix: ${prefix}
- Database backend: ${with_db_backend}
+ - tkrzw ZSTD compression: ${with_tkrzw_zstd}
- X11 support: ${enable_x11}
- OpenGL support: ${enable_opengl}
- UI (ncurses) support: ${enable_ui}
diff --git a/src/duc/main.c b/src/duc/main.c
index 287abfe..6aaea96 100644
--- a/src/duc/main.c
+++ b/src/duc/main.c
@@ -422,7 +422,11 @@ static void show_version(void)
#ifdef ENABLE_UI
printf("ui ");
#endif
- printf(DB_BACKEND "\n");
+ printf(DB_BACKEND);
+#ifdef ENABLE_TKRZW_ZSTD
+ printf(" (zstd)");
+#endif
+ printf("\n");
exit(EXIT_SUCCESS);
}
diff --git a/src/libduc/db-tkrzw.c b/src/libduc/db-tkrzw.c
index 537e3ab..189f459 100644
--- a/src/libduc/db-tkrzw.c
+++ b/src/libduc/db-tkrzw.c
@@ -16,6 +16,13 @@
#include "private.h"
#include "db.h"
+// Enable compression using ZSTD if available
+#ifdef DUC_TKRZW_COMP_ZSTD
+ #define DUC_TKRZW_REC_COMP "RECORD_COMP_ZSTD"
+#else
+ #define DUC_TKRZW_REC_COMP "NONE"
+#endif
+
struct db {
TkrzwDBM* hdb;
};
@@ -74,7 +81,9 @@ struct db *db_open(const char *path_db, int flags, duc_errno *e)
if (flags & DUC_OPEN_RW) writeable = 1;
if (flags & DUC_OPEN_COMPRESS) {
/* Do no compression for now, need to update configure tests first */
- char comp[] = ",record_comp_mode=RECORD_COMP_LZ4";
+ char comp[64];
+ sprintf(comp,",record_comp_mode=%s",DUC_TKRZW_REC_COMP);
+ printf("opening tkzrw DB with compression\n");
strcat(options,comp);
}
diff --git a/src/libduc/db.c b/src/libduc/db.c
index c18425b..64c521b 100644
--- a/src/libduc/db.c
+++ b/src/libduc/db.c
@@ -118,19 +118,23 @@ char *duc_db_type_check(const char *path_db)
size_t len = fread(buf, 1, sizeof(buf),f);
if (strncmp(buf,"Kyoto CaBiNeT",13) == 0) {
- return("Kyoto Cabinet");
+ return("kyotocabinet");
}
if (strncmp(buf,"ToKyO CaBiNeT",13) == 0) {
- return("Tokyo Cabinet");
+ return("tokyocabinet");
}
if (strncmp(buf,"TkrzwHDB",8) == 0) {
- return("Tkrzw HashDBM");
+ return("tkrzw");
}
if (strncmp(buf,"SQLite format 3",15) == 0) {
- return("SQLite3");
+ return("sqlite3");
+ }
+
+ if (strncmp(buf,"SQLite format 3",15) == 0) {
+ return("lmdb");
}
}
diff --git a/src/libduc/duc.c b/src/libduc/duc.c
index 193305d..b594b96 100644
--- a/src/libduc/duc.c
+++ b/src/libduc/duc.c
@@ -120,6 +120,18 @@ int duc_open(duc *duc, const char *path_db, duc_open_flags flags)
return -1;
}
+ // Check that we can handle this Database is what we're
+ // compiled to support, but only if it exists...
+ struct stat sb;
+ int r = stat(path_db,&sb);
+ if (r == 0) {
+ char *db_type = duc_db_type_check(path_db);
+ if (db_type && (strcmp(db_type,DB_BACKEND) != 0)) {
+ duc_log(duc, DUC_LOG_FTL, "Error opening: %s - unsupported DB type _%s_, duc compiled for %s", path_db, db_type, DB_BACKEND);
+ return -1;
+ }
+ }
+
duc_log(duc, DUC_LOG_INF, "%s database \"%s\"",
(flags & DUC_OPEN_RO) ? "Reading from" : "Writing to",
path_db);
@@ -134,11 +146,6 @@ int duc_open(duc *duc, const char *path_db, duc_open_flags flags)
/* Now we can maybe do some quick checks to see if we
* tried to open a non-supported DB type. */
- char *db_type = duc_db_type_check(path_db);
- if (db_type && (strcmp(db_type,"unknown") == 0)) {
- duc_log(duc, DUC_LOG_FTL, "Error opening: %s - unsupported DB type _%s_, duc compiled for %s", path_db, db_type, DB_BACKEND);
- return -1;
- }
}
return 0;
}
|
Running with the above patch significantly reduces the db size for a large index, with an acceptable amount of increased CPU time; with the patch, [root@origin-staging duc-1.5.0-rc1]# ./duc --version duc version: 1.5.0-rc1 options: cairo x11 ui tkrzw (zstd) [root@origin-staging duc-1.5.0-rc1]# time ./duc index -vp /backup2 -d /dev/shm/duc.db Writing to database "/dev/shm/duc.db" opening tkzrw DB with compression Indexed 1211640321 files and 44450920 directories, (821.7TB apparent, 603.4TB actual) in 7 hours, 27 minutes, and 18.96 seconds. real 447m18.983s user 22m9.975s sys 254m30.657s [root@origin-staging duc-1.5.0-rc1]# ls -lh /dev/shm/duc.db -rw-r--r-- 1 root root 17G Nov 24 03:34 /dev/shm/duc.db and a subsequent manual compression run with [root@origin-staging duc-1.5.0-rc1]# time zstd --verbose /dev/shm/duc.db *** zstd command line interface 64-bits v1.4.4, by Yann Collet *** /dev/shm/duc.db : 83.08% (17810632376 => 14797378480 bytes, /dev/shm/duc.db.zst) real 0m59.624s user 1m1.009s sys 0m7.904s [root@origin-staging duc-1.5.0-rc1]# ls -lh /dev/shm/duc.db.zst -rw-r--r-- 1 root root 14G Nov 24 03:34 /dev/shm/duc.db.zst For comparison, here is a run with the RC1 version available on github, [root@origin-staging ~]# ./duc-1.5.0-rc1-rpm --version duc version: 1.5.0-rc1 options: cairo x11 ui tkrzw [root@origin-staging ~]# time ./duc-1.5.0-rc1-rpm index -vp /backup2 -d /dev/shm/duc.rpm.db Writing to database "/dev/shm/duc.rpm.db" Indexed 1211640321 files and 44450920 directories, (821.7TB apparent, 603.4TB actual) in 7 hours, 22 minutes, and 2.30 seconds. real 442m2.320s user 16m11.189s sys 269m7.602s [root@origin-staging ~]# ls -lh /dev/shm/duc.rpm.db -rw-r--r-- 1 root root 25G Nov 24 03:34 /dev/shm/duc.rpm.db which has the following addition manual [root@origin-staging ~]# time zstd --verbose /dev/shm/duc.rpm.db *** zstd command line interface 64-bits v1.4.4, by Yann Collet *** /dev/shm/duc.rpm.db : 64.66% (26172539240 => 16924320232 bytes, /dev/shm/duc.rpm.db.zst) real 2m50.898s user 2m51.898s sys 0m10.628s [root@origin-staging ~]# ls -lh /dev/shm/duc.rpm.db.zst -rw-r--r-- 1 root root 16G Nov 24 03:34 /dev/shm/duc.rpm.db.zst |
>>>> "stuartthebruce" == stuartthebruce ***@***.***> writes:
Running with the above patch significantly reduces the db size for a large index, with an
acceptable amount of increased CPU time; with the patch,
Great! Glad to hear this!
***@***.*** duc-1.5.0-rc1]# ./duc --version
duc version: 1.5.0-rc1
options: cairo x11 ui tkrzw (zstd)
***@***.*** duc-1.5.0-rc1]# time ./duc index -vp /backup2 -d /dev/shm/duc.db
Writing to database "/dev/shm/duc.db"
opening tkzrw DB with compression
Indexed 1211640321 files and 44450920 directories, (821.7TB apparent, 603.4TB actual) in 7 hours, 27 minutes, and 18.96 seconds.
real 447m18.983s
user 22m9.975s
sys 254m30.657s
***@***.*** duc-1.5.0-rc1]# ls -lh /dev/shm/duc.db
-rw-r--r-- 1 root root 17G Nov 24 03:34 /dev/shm/duc.db
and a subsequent manual compression run with zstd is able to find a bit more to compress,
***@***.*** duc-1.5.0-rc1]# time zstd --verbose /dev/shm/duc.db
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
/dev/shm/duc.db : 83.08% (17810632376 => 14797378480 bytes, /dev/shm/duc.db.zst)
real 0m59.624s
user 1m1.009s
sys 0m7.904s
***@***.*** duc-1.5.0-rc1]# ls -lh /dev/shm/duc.db.zst
-rw-r--r-- 1 root root 14G Nov 24 03:34 /dev/shm/duc.db.zst
So going from 17G to 14G is a nice savings, but honestly, you've got
so much disk space and so much data I wonder if that extra step is
worth it? *grin*
Can you run the tkzrw tools again on the DB file (before you
compressed it again) to report on the bucket sizes and such? It would
be interesting to know if there's more tuning we can do it tkrzw to
make things better:
tkrwz_dbm_util --dbm hash /dev/shm/duc.db
All I really do in the setup is tweak some bucket sizes, so maybe
there's something else I can do to make it better.
For comparison, here is a run with the RC1 version available on github,
***@***.*** ~]# ./duc-1.5.0-rc1-rpm --version
duc version: 1.5.0-rc1
options: cairo x11 ui tkrzw
***@***.*** ~]# time ./duc-1.5.0-rc1-rpm index -vp /backup2 -d /dev/shm/duc.rpm.db
Writing to database "/dev/shm/duc.rpm.db"
Indexed 1211640321 files and 44450920 directories, (821.7TB apparent, 603.4TB actual) in 7 hours, 22 minutes, and 2.30 seconds.
real 442m2.320s
user 16m11.189s
sys 269m7.602s
***@***.*** ~]# ls -lh /dev/shm/duc.rpm.db
-rw-r--r-- 1 root root 25G Nov 24 03:34 /dev/shm/duc.rpm.db
which has the following addition manual zstd compressibility,
***@***.*** ~]# time zstd --verbose /dev/shm/duc.rpm.db
*** zstd command line interface 64-bits v1.4.4, by Yann Collet ***
/dev/shm/duc.rpm.db : 64.66% (26172539240 => 16924320232 bytes, /dev/shm/duc.rpm.db.zst)
real 2m50.898s
user 2m51.898s
sys 0m10.628s
***@***.*** ~]# ls -lh /dev/shm/duc.rpm.db.zst
-rw-r--r-- 1 root root 16G Nov 24 03:34 /dev/shm/duc.rpm.db.zst
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were assigned.*Message ID: <zevv/duc/issues/333/2498569446@
github.com>
|
Agreed. The win here is going from 25GB (RC1) to 17GB (with your patch). The further reduction from 17GB to 14GB is just a measure from how much further tuning could possibly be done, but I am happy with 17GB.
[root@origin-staging ~]# tkrzw_dbm_util inspect --dbm hash /dev/shm/duc.db Inspection: class=HashDBM healthy=true auto_restored=false path=/dev/shm/duc.db cyclic_magic=3 pkg_major_version=1 pkg_minor_version=0 static_flags=33 offset_width=5 align_pow=3 closure_flags=1 num_buckets=1048583 num_records=44450925 eff_data_size=17225775288 file_size=17810632376 timestamp=1732448080.589690 db_type=0 max_file_size=8796093022208 record_base=5246976 update_mode=in-place record_crc_mode=none record_comp_mode=zstd Actual File Size: 17810632376 Number of Records: 44450925 Healthy: true Should be Rebuilt: true |
Hi @l8gravely,
lacking a mailing list or forum, i took the liberty of opening an issue for discussing the 1.5.0-rc1 release. I'll use this to jot down some notes in no particular order, feel free to ignore or answer as you please :)
New database format
The default db format moved from tokio to tkrzw. It's about high time we leave tokiocabinet behind as it's just not stable and very much unmaintained. Some comments tho:
Error opening: /home/ico/.cache/duc/duc.db - Database corrupt and not usable
. Maybe we could add a hint here that the database format might be mismatching the current version and that the used should clean it up and re-indextopn command
No comments - very useful and a welcome addition!
histogram support
Still early work on the reporting side I see, but at least the info is already there in the db, nice.
The text was updated successfully, but these errors were encountered: