Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use block size from HDFS configuration for Large Files calculation #62

Open
pjeli opened this issue Jun 25, 2018 · 6 comments
Open

Use block size from HDFS configuration for Large Files calculation #62

pjeli opened this issue Jun 25, 2018 · 6 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@pjeli
Copy link
Collaborator

pjeli commented Jun 25, 2018

In NNA today, particularly if you look around here:

Collection<INode> mediumFiles =
nnLoader.combinedFilter(
files,
new String[] {"fileSize", "fileSize"},
new String[] {"lte:134217728", "gt:1048576"});

You will see that NNA uses a hardcoded cut off of 128 Megabyte block sizes to distinguish between "Medium Files" and "Large Files".

We should instead utilize the bytes count from dfs.blocksize value found in hdfs-site.xml (Configuration object in NNA, programmatically) passed into NNA that came from the source cluster.

@pjeli pjeli added enhancement New feature or request good first issue Good for newcomers labels Jun 25, 2018
@pjeli pjeli changed the title NNA uses hardcoded block size for Large Files calculation Use block size from HDFS configuration for Large Files calculation Jun 27, 2018
@americanyorkie
Copy link

americanyorkie commented Sep 11, 2018

What about making the file size definitions user configurable here as it's reasonable to expect differing opinions on what constitutes a particular file size from users. Currently file sizes are defined:
tiny > 0 && tiny <= 1024
small > 1024 && small <= 1048576
medium > 1048576 && medium <= 134217728
large > 134217728
There would be value in being able to examine the farther end of the scale a bit more granularly. Importing tables from RDBMS' can result in files 10's & 100's of GBs in size, for example.

@akshatgit
Copy link
Contributor

Good idea, maybe a web UI where the user can select different filter to sort/group the files would be a better interface

@pjeli
Copy link
Collaborator Author

pjeli commented Sep 11, 2018

Hmm, yes a good idea @americanyorkie -- something to keep in mind though is that those are cached results you see -- so while yes that is possible to change them however the result may not be reflected until the next SuggestionEngine run.

Still though; this is probably fine.

I can see an admin-only REST endpoint that would set these. For example, a naive one like: /setAttributes?tiny=1024&small=1048576&medium=134217728 (assuming then that large is anything greater than 134217728).

Thoughts?

I don't think it will be possible to have different settings per user though...
We could certainly add a "gigantic file" category too. 😆

@pjeli pjeli added this to the 1.6.0 Release milestone Oct 25, 2018
@pjeli
Copy link
Collaborator Author

pjeli commented Feb 28, 2019

I still think this is best to be fetched from the HDFS configuration file (hdfs-site.xml) as that should be the same value used by the active NameNode. If a different value is desired then it can be changed for just the NNA hosts hdfs-site.xml.

Changing this value on the fly will not be good for NNA so it needs to be a hard value decided on bootstrap time.

@pjeli pjeli modified the milestones: 1.6.0 Release, 1.6.1 Release Feb 28, 2019
@pjeli
Copy link
Collaborator Author

pjeli commented Mar 8, 2019

An additional justification is that once NNA bootstraps from a cluster NameNode (Observer or Standby), it will anyway have expected configuration.

@pjeli pjeli modified the milestones: 1.6.1 Release, 1.6.2 Release Mar 8, 2019
@pjeli
Copy link
Collaborator Author

pjeli commented Mar 15, 2019

More thoughts on this one -- I think we should give a statistic by which we measure tiny, small, and medium files.

I think ratios are probably the best measure here. If we were to retain the same hardcoded values then... assuming hdfs-site.xml has a blocksize of 128MB...

Large files = Greater than blocksize
Medium files = Greater than or equal to 1/128 of blocksize but less than large files
Small files = Greater than or equal to 1/131072 of blocksize but less than medium files
Tiny files = Greater than 0 bytes but less than small files

The ratios aren't very intuitive however. Might be better to stick with the hardcoded 1KB and 1MB sizes. Just dumping thoughts.

@pjeli pjeli modified the milestones: 1.6.2 Release, 1.6.3 Release Apr 15, 2019
@pjeli pjeli removed this from the 1.6.3 Release milestone Jun 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants