Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE histogram #284

Open
stuartthebruce opened this issue Nov 17, 2021 · 23 comments
Open

RFE histogram #284

stuartthebruce opened this issue Nov 17, 2021 · 23 comments

Comments

@stuartthebruce
Copy link

I think it would be helpful to have a histogram command to see the distribution of file and directory sizes. When migrating large collections of files between filesystems it is helpful to have insight into these distributions to help tune the target filesystem (or contact the owner to cleanup first if they have a problematic distribution, e.g., too many small files or too many files in some very large directories). Thanks.

@l8gravely
Copy link
Collaborator

l8gravely commented Nov 23, 2021 via email

@stuartthebruce
Copy link
Author

John, thanks for giving this some serious thought. I also use duc to regularly index a large number of files (5B nightly) and it does a fantastic job of that!
image
If you agree this is worth pursuing I also agree that one design goal should be to not significantly increase the database size. And I suspect a reasonable place to start would be with dynamic historgram bin sizes on a logarithmic scale to minimize how far an individual large file skews the bin sizes.

Another perhaps simpler idea would be to support a user setting for min and max file sizes for many (all?) of the database view commands, i.e., everything but index. This would enable a user to make their own decision on what file size(s) they want to zoom in on, e.g., where are all my <1kByte files?

@l8gravely
Copy link
Collaborator

l8gravely commented Nov 26, 2021 via email

@stuartthebruce
Copy link
Author

Yeah, that might work. Now were you thinking of keeping the histogram
on a per-directory basis or just at the top level of the entire DB?
That would be the much simpler thing to do at first, then possible
extending it down to per-directory levels.

I was thinking of a per-directory basis, however, starting with just a top-level histogram sounds like a good idea to expose many of the interesting data structure and user interface questions.

That would be hard to do because then you would need to rescan the
filesystem to get the proper data. With a goal of keeping the
information overhead as small as possible, the bucket sizes would need
to be picked ahead of time, so that the stored counts would be
limited.

But the rescan would be against the database and not the filesystem, so in some cases it may be worth waiting for that to rescan.

Doing a packed data structure of some sort would be best, so as to not
keep counters for buckets that aren't used. Of course as you get
higher up the tree, the chance of having empty buckets would go down
quite a bit.

Perhpas the same idea, but it might make sense to automatically roll up directory branches until a minimum number of files have been accumulated, i.e., do not dramatically increase individual directory DB sizes for directories that only have a few files in them (or below them).

@l8gravely
Copy link
Collaborator

This idea has languished for quite a while and I have no plans to implement this because even now I don't have a good understanding of what you want. I guess I'd like to see some examples of what you're looking for.

Now just keeping a running total of numbers of files in size buckets wouldn't be too hard, choosing the bucket sizes would be. Something like: 0, <1k, 2,k, 4k, .... 16gb+ might be the right answer here. It would certainly be an interesting project to play with. Have to think how to add this into indexing and then to display it. Hmm... now that I'm doing more with duc again, I'm a bit inspired to try this.

@l8gravely
Copy link
Collaborator

So I've just pushed a new branch called 'histogram' based on the 'tkrzw' work to add histograms to the DB and the ability to show them using the 'duc info -H' command. It's an initial state and the docs still need to be updated. Here's an example:

$ ./duc info -d /tmp/tkrzw.db -H
Date Time Files Dirs Size Path
2024-05-24 13:39:44 732.9K 102.5K 449.6G /scratch

Histogram:

2^0 24837
2^1 584
2^2 2976
2^3 4438
2^4 6334
2^5 5548
2^6 12586
2^7 32702
2^8 45131
2^9 65917
2^10 75004
2^11 78314
2^12 79386
2^13 60569
2^14 52032
2^15 48982
2^16 34777
2^17 20251
2^18 15619
2^19 12251
2^20 15874
2^21 9982
2^22 23961
2^23 2812
2^24 1680
2^25 143
2^26 69
2^27 48
2^28 13
2^29 19
2^30 13
2^31 4
2^33 3
2^36 1

  1. It currently has buckets done in powers of two. Which gives us a break down of 0, 1, 2, 4, 8, etc byte files until we get upto higher numbers. I don't think this is ideal, but works so far. I need to turn that 2^n into real human readable numbers.
  2. need to make the bucket sizing more dynamic so we don't hit problems.
  3. better bucket algorithm to spread things.
  4. better output of bucket sizes in a more human readable format.
  5. do we need to add JSON output support?
  6. does it need to be graphed? Or touched in the ui/gui/web interfaces?

@stuartthebruce
Copy link
Author

Is that example a histogram of individual file sizes? If so, I think it would also be useful to be able to histogram directory sizes.

@l8gravely
Copy link
Collaborator

l8gravely commented May 25, 2024 via email

@stuartthebruce
Copy link
Author

I like the idea of bucket boundaries being power of 2, and if you want to add a user argument it could be how many powers of 2 for each bucket, with a default of 1 to provide single octave resolution or 10 for coarser grained k/M/G/T.

@zevv
Copy link
Owner

zevv commented May 25, 2024

Hey folks, long time no talk. I must admit I kind of ignored most of the duc-related traffic as of late, apologies for the lack of time and love spent on this project. @l8gravely I really do appreciate your effors of keeping things going, thank you for that!

I do like this histogram feature though, I do agree with the exponential bin size, base 2 and 10 both make sense so that would make a nice option.

I'd say we start with just a separate command for dumping histogram data to the console, adding fancy graphs for this can always be done at a later time. Start small.

@l8gravely Would you be comfortable writing the code for this? If not, I'd be happy to whip up a proof of concept to get this started?

@l8gravely
Copy link
Collaborator

l8gravely commented May 25, 2024 via email

@zevv
Copy link
Owner

zevv commented May 25, 2024

Ah, you already got the work started, nice! I also did some experiments already to see how to hack this in, and the implementation should not be too hard.

Having the powers configurable sounds ok to me, although having a standard of 2 feels natural to me and probably would fit most needs: this results in a histogram with 30-ish bins, which feels like a very manageable size.

Some questions will probably pop up on how to (if to) handle recursion.

I'll have to play around a bit with the gui side to see how to properly represent this; there's the choice to make this either into a separate thing, or combine the histogram with the normal pie chart into a single graphic - I guess that will just need some playing around and prototying.

@l8gravely
Copy link
Collaborator

l8gravely commented May 26, 2024 via email

@zevv
Copy link
Owner

zevv commented May 26, 2024

Oh boy it's happening again: I'm thinking of a new feature and tons of questions come up. I'll just rant a little here, not sure how much is really relevant but we'll need to make some choices:

  • Aggregation level, global or per-directory? Your current implementation is simple and concise, but it only aggregates the histogram data at the global index level, but not at the individual directory. The good thing about it is that it is cheap on storage, the bad thing is that there is no histogram data available for individual directories. Personally, I would like to have histograms available for the individual directories I'm currently inspecting. This would require duc to store the histogram for every directory, but that'll eat considerably more storage - at least DUC_HISTOGRAM_MAX varints for every dir.

  • Aggregate at index time or at display time? Building the histogram at index time makes sense because all the data is there at the right time, and querying would be fast. The price we pay is storage. Alternatively, we could also generate the histogram at display time, and have the histogram match the data that is being displayed: when we draw the pie chart we already recursively traverse the database for N levels deep, so it would be trivial to accumulate the histogram bins at that time. The advantage would be that the histogram actually matches what is displayed at that time; we can generate two different data sets: one for files, one for directories. Down side is that there will be no easy way to histogram for the FS at large without traversing the whole database.

  • Bin sizes: if storage is not an issue we can just choose a larger value for the number of bins (256 or so) and use an exponent smaller than 2 ( sqrt(2) for example). If the user chooses to display the histogram with a lower number of bins it'll be trivial to do this at display time by simply combining bins. That would help with your problem of the useless small size bins (<128) or subjective large gaps for huge files.

But first it's time for my sunday hike. Priorities, right!

@l8gravely
Copy link
Collaborator

l8gravely commented May 27, 2024 via email

@zevv
Copy link
Owner

zevv commented May 27, 2024

Again, we can only gather the data during index time, there's no way
i t would work during display time, and that's not duc's reason to
exist. We handle the stupid large systems that can't work well with
simple 'du -sh *' type calls because they beat the crap out of the
filesystem. Same with histograms.

Yes, you're right, it's two different use cases. Your case is to get some global stats from a file system as a whole, so it woudl indeed make sense to do this at index time and just add this as metadata for that particular index.

My use case would be to draw a histogram only of the part of the graph that's being dispalyed: so if I get a pie chart of some directory 5 levels deep, the histogram would only show the files for these 5 levels, as these directories are already traversed when drawing the graph.

I did a little POC of this in the zevv-histogram branch, you might want to have a peek at that to see one possible way of displaying the histogram data. It's pretty basic, but the parts are there.

@l8gravely
Copy link
Collaborator

l8gravely commented May 27, 2024 via email

@zevv
Copy link
Owner

zevv commented May 27, 2024

Look at me lecturing the main author! LOL!

Yeah, well, someone needs to step up and teach the dick a lesson every now and then!

how would you dynamically run through the filesystem to get histogram data?

You don't, because all the data is in the duc db, right?

@zevv
Copy link
Owner

zevv commented May 27, 2024

Mostly it's me trying to learn how to write this all in C again. LOL! I'm so rusty it's not funny.

Well, the hip thing to do here would be to rewrite the whole thing in Rust, eh!

@l8gravely
Copy link
Collaborator

l8gravely commented May 27, 2024 via email

@l8gravely
Copy link
Collaborator

l8gravely commented May 27, 2024 via email

@l8gravely
Copy link
Collaborator

l8gravely commented Jun 27, 2024 via email

@l8gravely
Copy link
Collaborator

So this initial idea has been implemented into v1.5.0-rc1 which I released earlier this week. I'd love to get some testing and feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants