Thoughts on a dual-file implementation? #21

christianbundy · 2019-12-28T21:06:27Z

Right now flumelog-offset has one file which contains both the data and the indexes in this format:

<data.length (UInt32BE)>
<data ...>
<data.length (UInt32BE)>
<file_length (UInt32BE or Uint48BE or Uint53BE)>

This means that the data length is encoded twice and the sequence numbers are byte offsets, which means we can have invalid offsets like flumedb/flumedb#32.

Would it be better to use two files here instead?

Index:

<data.offset (UInt32BE)>
<data.length (UInt32BE)>
<file.length (UInt32BE or Uint48BE or Uint53BE)>

Data:

<data ...>

We'd have to use two file descriptors, but it would give us the ability to have consecutive sequence numbers. For example, to get the 42nd message you'd read 8 bytes from the index file at offset 3 * 4 * 42, which would give you the offset and length of the data you want to read from the data file.

I have a hunch that this would be "better", but I'm not sure. Any chance it's been implemented, considered, or discussed before? I've been thinking about this since I saw the approach mentioned here.

The text was updated successfully, but these errors were encountered:

dominictarr · 2020-01-09T02:24:25Z

Yes. this is actually how the go implementation works! @keks @cryptix
sequential sequence numbers mean it works much better with Roaring Bitfields.
Also, makes it possible to do binary search! (although only on receive time, still it's a cool feature though)
I've been meaning to implement this since hanging out with @keks last year and learning this but other stuff happened...

small bikeshed:
it might be better for the sequence file to be

<offset int64>...

and then the data file to be

<varint length><data...>

Because then the datafile has enough information to rebuild the sequence file. that will be useful for crash recovery, because you won't need to worry about the case where the sequence file gets written but not the data file and vice versa. Not sure if this is how go does it though.

dominictarr · 2020-01-09T02:25:46Z

having binary search means secure-scuttlebutt could drop the time index https://github.com/ssbc/ssb-db/blob/master/extras.js#L9-L11 but retain backwards compatibility

dominictarr · 2020-01-09T05:08:32Z

also: oh, yeah if you use https://github.com/random-access-storage/ to do it, it will work in the browser (both chrome and ff, without extra work) flumeview-aligned-offset uses https://github.com/dominictarr/polyraf so the right adapter is applied automatically

christianbundy added the question label Dec 28, 2019

christianbundy self-assigned this Dec 28, 2019

dominictarr mentioned this issue Jan 13, 2020

uint32 overflow of sequence number – does seq has to be an integer? #2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on a dual-file implementation? #21

Thoughts on a dual-file implementation? #21

christianbundy commented Dec 28, 2019 •

edited

Loading

dominictarr commented Jan 9, 2020

dominictarr commented Jan 9, 2020

dominictarr commented Jan 9, 2020

Thoughts on a dual-file implementation? #21

Thoughts on a dual-file implementation? #21

Comments

christianbundy commented Dec 28, 2019 • edited Loading

dominictarr commented Jan 9, 2020

dominictarr commented Jan 9, 2020

dominictarr commented Jan 9, 2020

christianbundy commented Dec 28, 2019 •

edited

Loading