-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE histogram #284
Comments
>>>> "stuartthebruce" == stuartthebruce ***@***.***> writes:
Sorry for the late reply, life got in the way...
stuartthebruce> I think it would be helpful to have a histogram
stuartthebruce> command to see the distribution of file and directory
stuartthebruce> sizes. When migrating large collections of files
stuartthebruce> between filesystems it is helpful to have insight into
stuartthebruce> these distributions to help tune the target filesystem
stuartthebruce> (or contact the owner to cleanup first if they have a
stuartthebruce> problematic distribution, e.g., too many small files
stuartthebruce> or too many files in some very large
stuartthebruce> directories). Thanks.
It's an interesting idea, but we don't currently have the
infrastructure for this in duc currently, since we only keep size
counts on a per-directory basis.
We could fairly easily add in per-index data like this, where we keep
running totals of files in various size buckets. Not sure how
granular this should be. Should it be dynamic, so that we break it
down by 5% chunks? But then a single huge file would skew the
results and make it harder to compare.
Do we break it down by 512 byte blocks where we report on files which
are 1,2,4,8,16,32,...1024 or larger in size? This also hides the fact
that a single large file is one of those targets for cleanup since
it's generally either something you want, or something you should
cleanup since a single file makes such a large difference.
Maybe it should be based in the filesystem size? Scale the buckets as
the filesystem gets bigger, so that a 10gb filesystem looks different
than a 100gb, 1tb or 10tb or even bigger.
Now going down and keeping per-directory histogram stats is going to
big a much bigger issue. People have asked for stuff like this in the
past, but we're resisted because it's going to explode the DB size.
I've got multiple filesystems with 30+ millions files, and checking
all those takes a long time to complete. And there are people who are
using duc on even bigger data stores. One of duc's strengths in my
mind is that the DB is built outside of the GUI, so that you can
explore the data, while not having to wait six, twelve or more hours
for the data to be collected, like windirstat and tools like that
work.
Cheers,
John
|
>>>> "stuartthebruce" == stuartthebruce ***@***.***> writes:
stuartthebruce> John, thanks for giving this some serious thought. I
stuartthebruce> also use duc to regularly index a large number of
stuartthebruce> files (5B nightly) and it does a fantastic job of
stuartthebruce> that!
We're glad to hear it, it's a fantastic tool. This is one of the
areas that filesystems could use better work in, but I realize also
how expensive it would be to keep this info upto date.
stuartthebruce> If you agree this is worth pursuing I also agree that
stuartthebruce> one design goal should be to not significantly
stuartthebruce> increase the database size. And I suspect a reasonable
stuartthebruce> place to start would be with dynamic historgram bin
stuartthebruce> sizes on a logarithmic scale to minimize how far an
stuartthebruce> individual large file skews the bin sizes.
Yeah, that might work. Now were you thinking of keeping the histogram
on a per-directory basis or just at the top level of the entire DB?
That would be the much simpler thing to do at first, then possible
extending it down to per-directory levels.
The hard part is how to display this information effectively, but in
the web gui, the CLI and ncurses interfaces. I haven't really gotten
that far yet.
stuartthebruce> Another perhaps simpler idea would be to support a
stuartthebruce> user setting for min and max file sizes for many
stuartthebruce> (all?) of the database view commands, i.e., everything
stuartthebruce> but index. This would enable a user to make their own
stuartthebruce> decision on what file size(s) they want to zoom in on,
stuartthebruce> e.g., where are all my <1kByte files?
That would be hard to do because then you would need to rescan the
filesystem to get the proper data. With a goal of keeping the
information overhead as small as possible, the bucket sizes would need
to be picked ahead of time, so that the stored counts would be
limited.
As a first plan, I would just setup a top-level structure to hold the
totals for the db, and as directories are accessed, the size
distribution they report would be simply summed with the top level
totals.
Once that is working, then it would be pretty trivial I think to add
in a per-directoy database entry to hold this structure. But this is
where the DB size increase would happen in a major way, especially as
the number of directories increases. I don't have the per-directory
data structure size in my head, so I'd have to dig through the code to
get a good idea for how this would work.
Doing a packed data structure of some sort would be best, so as to not
keep counters for buckets that aren't used. Of course as you get
higher up the tree, the chance of having empty buckets would go down
quite a bit.
Something to look into when I get a chance.
John
|
I was thinking of a per-directory basis, however, starting with just a top-level histogram sounds like a good idea to expose many of the interesting data structure and user interface questions.
But the rescan would be against the database and not the filesystem, so in some cases it may be worth waiting for that to rescan.
Perhpas the same idea, but it might make sense to automatically roll up directory branches until a minimum number of files have been accumulated, i.e., do not dramatically increase individual directory DB sizes for directories that only have a few files in them (or below them). |
This idea has languished for quite a while and I have no plans to implement this because even now I don't have a good understanding of what you want. I guess I'd like to see some examples of what you're looking for. Now just keeping a running total of numbers of files in size buckets wouldn't be too hard, choosing the bucket sizes would be. Something like: 0, <1k, 2,k, 4k, .... 16gb+ might be the right answer here. It would certainly be an interesting project to play with. Have to think how to add this into indexing and then to display it. Hmm... now that I'm doing more with duc again, I'm a bit inspired to try this. |
So I've just pushed a new branch called 'histogram' based on the 'tkrzw' work to add histograms to the DB and the ability to show them using the 'duc info -H' command. It's an initial state and the docs still need to be updated. Here's an example: $ ./duc info -d /tmp/tkrzw.db -H Histogram:2^0 24837
|
Is that example a histogram of individual file sizes? If so, I think it would also be useful to be able to histogram directory sizes. |
>>>> "stuartthebruce" == stuartthebruce ***@***.***> writes:
Is that example a histogram of individual file sizes? If so, I think
it would also be useful to be able to histogram directory sizes.
It's just a histogram of filesizes currently. I'm planning on adding
summation to the plot, so you can see what percentage of filesystem
space usage is for each bucket.
The real trick is coming up with the bucket sizing that seems to fit
better for most people's needs. Zero length files is the first
bucket, then files upto 1k, 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k,
1g,2g,4g,8g... does that work for people?
John
|
I like the idea of bucket boundaries being power of 2, and if you want to add a user argument it could be how many powers of 2 for each bucket, with a default of 1 to provide single octave resolution or 10 for coarser grained k/M/G/T. |
Hey folks, long time no talk. I must admit I kind of ignored most of the duc-related traffic as of late, apologies for the lack of time and love spent on this project. @l8gravely I really do appreciate your effors of keeping things going, thank you for that! I do like this histogram feature though, I do agree with the exponential bin size, base 2 and 10 both make sense so that would make a nice option. I'd say we start with just a separate command for dumping histogram data to the console, adding fancy graphs for this can always be done at a later time. Start small. @l8gravely Would you be comfortable writing the code for this? If not, I'd be happy to whip up a proof of concept to get this started? |
>>>> "Ico" == Ico Doornekamp ***@***.***> writes:
Hey folks, long time no talk. I must admit I kind of ignored most of the duc-related traffic as of
late, apologies for the lack of time and love spent on this project. @l8gravely I really do
appreciate your effors of keeping things going, thank you for that!
I do like this histogram feature though, I do agree with the exponential bin size, base 2 and 10
both make sense so that would make a nice option.
I'd say we start with just a separate command for dumping histogram data to the console, adding
fancy graphs for this can always be done at a later time. Start small.
@l8gravely Would you be comfortable writing the code for this? If
not, I'd be happy to whip up a proof of concept to get this started?
I've already got some basic code for dumping histograms using the 'duc
info ...' command with the -H switch. We could certainly make a new
command such as:
duc histogram <opts>
and give various options there, but the collection part is where the
choice of base 2 or base 10 would have to take place. Or we could
just spend the memory/disk space and keep both! That wouldn't be hard
to do really.
The big issue is bucket size and spacing. I think powers of 10 is too
much honestly, the gap between 1mb and 1gb is too big, though 1mb,
10mb, 100mb, 1000mb, 10000mb isn't terrible now that I think of it.
So my plan right now is:
1. add in a new histogram command
2. keep histograms in 'info' command for now.
3. tweak adding of both base2 and base10 data buckets.
Ico, if you could work on the gui and web side, that would be
awesome. I've got the 'histogram' branch going right now, so feel
free to build on there, though I'll be pushing new changes there this
weekend or next week.
|
Ah, you already got the work started, nice! I also did some experiments already to see how to hack this in, and the implementation should not be too hard. Having the powers configurable sounds ok to me, although having a standard of Some questions will probably pop up on how to (if I'll have to play around a bit with the gui side to see how to properly represent this; there's the choice to make this either into a separate thing, or combine the histogram with the normal pie chart into a single graphic - I guess that will just need some playing around and prototying. |
>>>> "Ico" == Ico Doornekamp ***@***.***> writes:
Ah, you already got the work started, nice! I also did some
experiments already to see how to hack this in, and the
implementation should not be too hard.
Great, let's play around with this. I've just implemented the
'histogram' command as well. I'll be pushing that in a bit.
Having the powers configurable sounds ok to me, although having a
standard of 2 feels natural to me and probably would fit most needs:
this results in a histogram with 30-ish bins, which feels like a
very manageable size.
Mine goes upto 2^36 in my test on my home system because I have a
single 72gb file in there. So it needs to auto-scale properly. But
the jump between 2^35 and 2^36 just seems too big to me... and the
smaller numbers below 128bytes seem way too close to me.
Some questions will probably pop up on how to (if to) handle recursion.
Nope, it's all just part of the data collection, look at what I did in
the index.c stuff. It just fell in quite nicely. Now I do want to
make the histogram auto-size nicely. Got some ideas there...
I'll have to play around a bit with the gui side to see how to
properly represent this; there's the choice to make this either into
a separate thing, or combine the histogram with the normal pie chart
into a single graphic - I guess that will just need some playing
around and prototying.
I was thinking that for the html version, just having the histogram be
below the ring chart might be nice, or maybe make it do you toggle
between them? It's not clear to me that there's any real need for
interactivity.
One thing I am thinking of adding when I get a chance is a running
total file size in each bucket, and how that then reflects across the
entire histogram. With that last bucket ending at 100%.
I'll try to push some thoughts later this weekend as I get chances.
John
|
Oh boy it's happening again: I'm thinking of a new feature and tons of questions come up. I'll just rant a little here, not sure how much is really relevant but we'll need to make some choices:
But first it's time for my sunday hike. Priorities, right! |
>>>> "Ico" == Ico Doornekamp ***@***.***> writes:
Oh boy it's happening again: I'm thinking of a new feature and tons
of questions come up. I'll just rant a little here, not sure how
much is really relevant but we'll need to make some choices:
rant away!
* Aggregation level, global or per-directory? Your current
implementation is simple and concise, but it only aggregates the
histogram data at the global index level, but not at the
individual directory. The good thing about it is that it is
cheap on storage, the bad thing is that there is no histogram
data available for individual directories. Personally, I would
like to have histograms available for the individual directories
I'm currently inspecting. This would require duc to store the
histogram for every directory, but that'll eat considerably more
storage - at least DUC_HISTOGRAM_MAX varints for every dir.
I think it only needs to be global. Doing it on a per-directory level
would be a nightmare. Look at all the issues I've been having with
Sturt on his systems. Doing that per-directory... that would be
extreme.
Now maybe we do want to do the same type of histograms for the number
of files in a directory, since that's another interesting piece of data.
But I think the next feature I want to add is tracking the largest N
files found in the system. That's a *really* useful feature to have.
The histogram is a neat set of data, but doesn't help you find where
the disk space is being used. So having a cutoff of 1g at the lower
end (but configurable) would be useful. This kind of data is exactly
what I used to need to help track down issues, since I could email my
users listing the top N large files by user and ask them to clean up.
Nice easy (sometimes) low hanging fruit to clean for maximum impact.
And also keeping a list of the top N directories by file count in that
directory. And of course having a base number to not care about, say
1024 or less files.
* Aggregate at index time or at display time? Building the
histogram at index time makes sense because all the data is
there at the right time, and querying would be fast. The price
we pay is storage. Alternatively, we could also generate the
histogram at display time, and have the histogram match the data
that is being displayed: when we draw the pie chart we already
recursively traverse the database for N levels deep, so it would
be trivial to accumulate the histogram bins at that time. The
advantage would be that the histogram actually matches what is
displayed at that time; we can generate two different data sets:
one for files, one for directories. Down side is that there will
be no easy way to histogram for the FS at large without
traversing the whole database.
Again, we can only gather the data during index time, there's no way
it would work during display time, and that's not duc's reason to
exist. We handle the stupid large systems that can't work well with
simple 'du -sh *' type calls because they beat the crap out of the
filesystem. Same with histograms.
* Bin sizes: if storage is not an issue we can just choose a
larger value for the number of bins (256 or so) and use an
exponent smaller than 2 ( sqrt(2) for example). If the user
chooses to display the histogram with a lower number of bins
it'll be trivial to do this at display time by simply combining
bins. That would help with your problem of the useless small
size bins (<128) or subjective large gaps for huge files.
It's not so much the number of bins, they're actually cheap on a
per-filesystem basis. It's the spacing that I'm trying to figure
out. We don't care about files below 512 bytes really, but we do care
about files between 1g and 10g right? Or 10g to 100g in size, how
fine grained do we want to be here?
But first it's time for my sunday hike. Priorities, right!
Sounds like an excellent priority!
John
|
Yes, you're right, it's two different use cases. Your case is to get some global stats from a file system as a whole, so it woudl indeed make sense to do this at index time and just add this as metadata for that particular index. My use case would be to draw a histogram only of the part of the graph that's being dispalyed: so if I get a pie chart of some directory 5 levels deep, the histogram would only show the files for these 5 levels, as these directories are already traversed when drawing the graph. I did a little POC of this in the |
>>>> "Ico" == Ico Doornekamp ***@***.***> writes:
Again, we can only gather the data during index time, there's no way
i t would work during display time, and that's not duc's reason to
exist. We handle the stupid large systems that can't work well with
simple 'du -sh *' type calls because they beat the crap out of the
filesystem. Same with histograms.
Yes, you're right, it's two different use cases. Your case is to get
some global stats from a file system as a whole, so it woudl indeed
make sense to do this at index time and just add this as metadata
for that particular index.
Good we're in agreement.
My use case would be to draw a histogram only of the part of the
graph that's being dispalyed: so if I get a pie chart of some
directory 5 levels deep, the histogram would only show the files for
these 5 levels, as these directories are already traversed when
drawing the graph.
Hmm... an interesting idea, but it really breaks when the http server
where you're pulling the info doesn't have access to the system where
the indexes were generated. I had multiple data centers which I ran
'duc' in, but I pushed all the DBs to a central host to make
displaying the data easier.
So I don't think this is really going to work. It's just not part of
the core design of duc, which is to seperate the indexing and the data
display into two totally seperate processes.
Look at me lecturing the main author! LOL!
I did a little POC of this in the zevv-histogram branch, you might
want to have a peek at that to see one possible way of displaying
the histogram data. It's pretty basic, but the parts are there.
I'll take a look at some point later tonight.
For example. Stuart's test case was taking 9+ hours to run and grab
the data. So when displaying data, how would you dynamically run
through the filesystem to get histogram data?
I really want to also add a feature where I grab the top 100 largest
(or some percentage of the filesystem number of files) with a lower
bound of some sort. Mostly it's me trying to learn how to write this
all in C again. LOL! I'm so rusty it's not funny.
|
Yeah, well, someone needs to step up and teach the dick a lesson every now and then!
You don't, because all the data is in the duc db, right? |
Well, the hip thing to do here would be to rewrite the whole thing in Rust, eh! |
>>>> "Ico" == Ico Doornekamp ***@***.***> writes:
Look at me lecturing the main author! LOL!
Yeah, well, someone needs to step up and teach the dick a lesson every now and then!
how would you dynamically run through the filesystem to get histogram data?
You don't, because all the data is in the duc db, right?
That's one option, but I suspect it's going to blow up the size of the
DB to keep that data on per-directory basis. I don't know for sure.
It's probably not a problem on smaller systems, but with lots and lots
of directories... it could get expensive.
For now, I'm going to work on the finding the top N files and keeping
them in the DB as well. I think that will be a really useful feature
to add to the indexing and data side.
Then to figure out how to report it. 'duc topn <args>'? Not wild
about 'topn' name. "duc largest"? "duc files"? For now I'll just
add it to 'duc info' like I did for my initial histogram
implementation.
John
|
>>>> "Ico" == Ico Doornekamp ***@***.***> writes:
Mostly it's me trying to learn how to write this all in C again. LOL! I'm so rusty it's not
funny.
Well, the hip thing to do here would be to rewrite the whole thing in Rust, eh!
What about 'go' instead? :-)
|
Guys,
I've just pushed my next set of patches to implement the topN files by
size added into duc, using the 'histogram' branch. Stills needs a
bunch of work but I'm on vacation, so I figured I'd let people look at
this.
Please pull the 'histogram' branch and poke at it. I've got future
work to do here though:
1. remove my debugging printf() calls
2. fix writing the topN data into the DB properly when done indexing
3. fix reading of topN data (should work...) properly to then display
results.
4. fix option passing when indexing to adjust the minimum size of file
to index, and then to change the number of files to index.
5. Lots more testing of edge cases, especially when reading old
version files.
But I think I'm getting closer and I hope to release v1.5.0a sometime
late July. So any patches, comments, fixes are welcome!
John
|
So this initial idea has been implemented into v1.5.0-rc1 which I released earlier this week. I'd love to get some testing and feedback. |
I think it would be helpful to have a histogram command to see the distribution of file and directory sizes. When migrating large collections of files between filesystems it is helpful to have insight into these distributions to help tune the target filesystem (or contact the owner to cleanup first if they have a problematic distribution, e.g., too many small files or too many files in some very large directories). Thanks.
The text was updated successfully, but these errors were encountered: