Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on a dual-file implementation? #21

Open
christianbundy opened this issue Dec 28, 2019 · 3 comments
Open

Thoughts on a dual-file implementation? #21

christianbundy opened this issue Dec 28, 2019 · 3 comments
Assignees
Labels

Comments

@christianbundy
Copy link
Member

christianbundy commented Dec 28, 2019

Right now flumelog-offset has one file which contains both the data and the indexes in this format:

<data.length (UInt32BE)>
<data ...>
<data.length (UInt32BE)>
<file_length (UInt32BE or Uint48BE or Uint53BE)>

This means that the data length is encoded twice and the sequence numbers are byte offsets, which means we can have invalid offsets like flumedb/flumedb#32.

Would it be better to use two files here instead?

Index:

<data.offset (UInt32BE)>
<data.length (UInt32BE)>
<file.length (UInt32BE or Uint48BE or Uint53BE)>

Data:

<data ...>

We'd have to use two file descriptors, but it would give us the ability to have consecutive sequence numbers. For example, to get the 42nd message you'd read 8 bytes from the index file at offset 3 * 4 * 42, which would give you the offset and length of the data you want to read from the data file.

I have a hunch that this would be "better", but I'm not sure. Any chance it's been implemented, considered, or discussed before? I've been thinking about this since I saw the approach mentioned here.

@dominictarr
Copy link
Collaborator

Yes. this is actually how the go implementation works! @keks @cryptix
sequential sequence numbers mean it works much better with Roaring Bitfields.
Also, makes it possible to do binary search! (although only on receive time, still it's a cool feature though)
I've been meaning to implement this since hanging out with @keks last year and learning this but other stuff happened...

small bikeshed:
it might be better for the sequence file to be

<offset int64>...

and then the data file to be

<varint length><data...>

Because then the datafile has enough information to rebuild the sequence file. that will be useful for crash recovery, because you won't need to worry about the case where the sequence file gets written but not the data file and vice versa. Not sure if this is how go does it though.

@dominictarr
Copy link
Collaborator

having binary search means secure-scuttlebutt could drop the time index https://github.com/ssbc/ssb-db/blob/master/extras.js#L9-L11 but retain backwards compatibility

@dominictarr
Copy link
Collaborator

also: oh, yeah if you use https://github.com/random-access-storage/ to do it, it will work in the browser (both chrome and ff, without extra work) flumeview-aligned-offset uses https://github.com/dominictarr/polyraf so the right adapter is applied automatically

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants