Anserini Dense Vector Format: binary encoding format for input vectors #1956

lintool · 2022-08-04T00:35:19Z

lintool
Aug 4, 2022
Maintainer

As we explore indexing dense vectors in Lucene, we'll need an efficient exchange format for storing the vectors. I'll call this the advf format for "Anserini Dense Vector Format". The general idea is that we'll extract vectors out of Faiss and write as advf, and Anserini will index this format.

Here's my initial proposal. advf will be a binary file (possibly compressed) comprising the following:

Initial four-byte int that serves as an indicator. Initially, 1 = float32 and 2 = float16, but we can augment the dictionary later.
Four-byte int that specifies the number of dimensions in the dense vector.

And then, repeated for every document:

Four-byte int specifying the number of bytes of the docid.
The actual docid, encoded as bytes.
The actual dense vector values, as determined by the number of dimensions above and the indicator.

Note that the format is designed without any explicit delimiters. Also, I have not included any magic SYNC tokens or encoded any redundant metadata for consistency checks. (Although these might be both good ideas...)

So, the reader loop will be something like this:

Read indicator... okay, set to float32.
Read number of dimensions... okay, we're reading 768 vectors.

Then, repeat until EOF:

Read length of docid.
Read number of bytes indicated by above length into buffer, decode as String. That's the docid.
Read 768 float32's.
Repeat.

Thoughts, comments?

lintool · 2024-03-21T12:11:27Z

lintool
Mar 21, 2024
Maintainer Author

@MXueguang and I have been discussing this on Slack...

Why don't we just use the NPY format? https://numpy.org/devdocs/reference/generated/numpy.lib.format.html

E.g., here's a start: https://github.com/dreamolight/JavaNpy?tab=readme-ov-file

0 replies

arjenpdevries · 2024-03-21T12:13:57Z

arjenpdevries
Mar 21, 2024

Another one to consider: safetensors.

It's quite close to what Jimmy proposed, but I'd say a de facto standard already because of its adoption by Huggingface.
And, they also provide nice tooling and examples, e.g. reading from a remotely stored file using http range requests.

0 replies

lintool · 2024-03-21T12:19:16Z

lintool
Mar 21, 2024
Maintainer Author

@arjenpdevries good call! Since we want Java/Python compatibility, we'll have to look into the feasibility.

Based on a quick skim: https://github.com/huggingface/safetensors

A special key metadata is allowed to contain free form string-to-string map. Arbitrary JSON is not allowed, all values must be strings.

In the header, we can stuff the docids in there, as a bonus.

1 reply

arjenpdevries Mar 21, 2024

Yes, although you could also consider viewing the docids as a separate aligned tensor and point to it in the metadata. Then people do not need to read it if they already have it, but it would still make the file self-contained.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anserini Dense Vector Format: binary encoding format for input vectors #1956

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Anserini Dense Vector Format: binary encoding format for input vectors #1956

lintool Aug 4, 2022 Maintainer

Replies: 3 comments · 1 reply

lintool Mar 21, 2024 Maintainer Author

arjenpdevries Mar 21, 2024

lintool Mar 21, 2024 Maintainer Author

arjenpdevries Mar 21, 2024

lintool
Aug 4, 2022
Maintainer

Replies: 3 comments 1 reply

lintool
Mar 21, 2024
Maintainer Author

arjenpdevries
Mar 21, 2024

lintool
Mar 21, 2024
Maintainer Author