You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now flumelog-offset has one file which contains both the data and the indexes in this format:
<data.length (UInt32BE)>
<data ...>
<data.length (UInt32BE)>
<file_length (UInt32BE or Uint48BE or Uint53BE)>
This means that the data length is encoded twice and the sequence numbers are byte offsets, which means we can have invalid offsets like flumedb/flumedb#32.
Would it be better to use two files here instead?
Index:
<data.offset (UInt32BE)>
<data.length (UInt32BE)>
<file.length (UInt32BE or Uint48BE or Uint53BE)>
Data:
<data ...>
We'd have to use two file descriptors, but it would give us the ability to have consecutive sequence numbers. For example, to get the 42nd message you'd read 8 bytes from the index file at offset 3 * 4 * 42, which would give you the offset and length of the data you want to read from the data file.
I have a hunch that this would be "better", but I'm not sure. Any chance it's been implemented, considered, or discussed before? I've been thinking about this since I saw the approach mentioned here.
The text was updated successfully, but these errors were encountered:
Yes. this is actually how the go implementation works! @keks@cryptix
sequential sequence numbers mean it works much better with Roaring Bitfields.
Also, makes it possible to do binary search! (although only on receive time, still it's a cool feature though)
I've been meaning to implement this since hanging out with @keks last year and learning this but other stuff happened...
small bikeshed:
it might be better for the sequence file to be
<offset int64>...
and then the data file to be
<varint length><data...>
Because then the datafile has enough information to rebuild the sequence file. that will be useful for crash recovery, because you won't need to worry about the case where the sequence file gets written but not the data file and vice versa. Not sure if this is how go does it though.
Right now flumelog-offset has one file which contains both the data and the indexes in this format:
This means that the data length is encoded twice and the sequence numbers are byte offsets, which means we can have invalid offsets like flumedb/flumedb#32.
Would it be better to use two files here instead?
Index:
Data:
We'd have to use two file descriptors, but it would give us the ability to have consecutive sequence numbers. For example, to get the 42nd message you'd read 8 bytes from the index file at offset
3 * 4 * 42
, which would give you the offset and length of the data you want to read from the data file.I have a hunch that this would be "better", but I'm not sure. Any chance it's been implemented, considered, or discussed before? I've been thinking about this since I saw the approach mentioned here.
The text was updated successfully, but these errors were encountered: