-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store compressed vectors in dense ByteSequence for PQVectors #370
Conversation
4f1a44b
to
76727da
Compare
76727da
to
650b4d2
Compare
@jbellis @jkni @marianotepper - this is ready for review. The other alternative, described above, where we use a ByteBuffer instead of creating an |
return Objects.hash(pq, compressedVectors); | ||
// We don't use the array structure in the hash code calculation because we allow for different chunking | ||
// strategies. Instead, we use the first entry in the first chunk to provide a stable hash code. | ||
return Objects.hash(pq, count(), count() > 0 ? get(0).get(0) : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're worried about the performance of hashing millions of vectors, I'd prefer adding the first codepoint from each vector, to using all the first vector's codepoints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up just hashing based on every value since that is the norm. If it becomes a performance bottleneck, we can address it then--I doubt we compute the hash code frequently.
// TODO how do we want to determine equality? With the current change, we are willing to write one | ||
// thing and materialize another. It seems like the real concern should be whether the compressedVectors have | ||
// the same data, not whether they are in a MemorySegment or a byte[] and not whether there is one chunk of many | ||
// vectors or many chunks of one vector. This technically goes against the implementation of each of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with changing the ByteSequence equals implementations to match. You get the class comparison by default when intellij generates it for you and I think we just ran with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this, but it's not as elegant as I wanted. I could have made the ByteSequence interface an abstract class and overridden equals/hashCode, but that seemed like it went too far, so I added default methods that implementations of the interface can call. In practice, I doubt these objects are compared for strict equality frequently.
I like where this is going, I'm just concerned that it doesn't break compatibility enough. :) That is, the PQVectors API is hard to use "correctly" now since the only method that gives you One Big Array is load(). (Technically the constructor used by load is public but everyone else just calls it with an array of one vector per chunk.) I think we should push the changes into VectorCompressor to solve this. Instead of introducing a new class to abstract long[] vs ByteSequence, encodeAll should return CompressedVectors directly. (By far the most common way encodeAll is used is to immediately wrap the result in CV anyway.) I think there's a code path or two where you don't need the full CV but it's not a big deal to construct one for those cases. CompressedVectors is not generic so VC wouldn't have to be either, which makes me happy. |
I'm 80% sure that it's correct to go ahead and change the API and release 3.1 instead of doing unnatural contortions to maintain compatibility. |
Also I do not insist that BQ change to the chunking approach. If we can conform to the "right" API without doing that, it's fine to leave optimizing BQ until somebody needs it. |
@jbellis - how would this work in the |
Yes, I think an add() or a set() method would be straightforward since CG knows up front how many it's going to generate. This is a better design than leaking the underlying List to be mutated, but that was the path of least resistance at one point! |
Is it possible to unify encode/set to |
I had started to look at that, but stopped because of our logic to for Getting the change written up now. |
Note: I'll update the PR description tomorrow to align with recent commits. |
LGTM do you mind updating UPGRADING and README? |
public long[][] encodeAll(RandomAccessVectorValues ravv, ForkJoinPool simdExecutor) { | ||
return simdExecutor.submit(() -> IntStream.range(0, ravv.size()) | ||
public CompressedVectors encodeAll(RandomAccessVectorValues ravv, ForkJoinPool simdExecutor) { | ||
var cv = simdExecutor.submit(() -> IntStream.range(0, ravv.size()) | ||
.parallel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity:
does it really help to use "parallel" here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Massively.
LGTM, merged Thoughts for if/when we revisit this:
|
The current design stores the compressed vectors using a list of
ByteSequence
objects where each object represents a single vector. That costs us an overhead of 16 bytes for theByteSequence
wrapper class and 16 more bytes for the array in theArrayByteSequence
. In the case of a 128 dimension vector, we compress PQ down to 64 bytes, so the cost of the wrapper/array is 32 bytes out of the total 96 bytes, which is 33% of the total size. In cases where we have many of these vectors in memory, an extra 33% is quite large.This PR proposes removing the per-vector wrapper and opts for a more dense representation by
ByteSequence
has as many vectors as it can possibly hold.ByteSequence#slice
method to allow for getting a slice of the wholeByteSequence
. This method is intended to be light weight.ArraySliceByteSequence
to store the offset and length information to simplify element retrieval.ByteSequence#offset
method to be used in conjunction with theget
method to ensure that the Panama and SimdOps methods do not need to perform an array copy when passing the array to Panama.get
on aMemorySegmentByteSequence
since we're technically not passing the offset. FWIW, we could consider updating the c to take along
offset and along
length since those MemorySegments can exceed int length.We also make some minor breaking changes to the
VectorCompression
interface to allow implementations to construct theCompressedVectors
object directly. This change simplifies most usages of theencodeAll
method.This PR is a more invasive alternative to #369, which only removed one of the two wrappers.
Rejected alternatives
Before implementing the PR this way, I tried a few alternatives.
compressedVectors
field as aByteSequence[]
instead of a constantByteSequence
. Further, a MemorySegment would have allowed for a simpler API because of its support in the Panama API. However, this solution proved challenging due to compilation requirements and concerns about using a preview api (MemorySegment) in java 20.ArraySliceByteSequence
. This is plausible, but really just comes down to API design preference. It felt a bit clunky to add it in. The main advantage is that if we add this, we can avoid theoffset
method on theByteSequence
interface, which might be convenient. One way we could add it is by making PQVectors know how to create a ByteBuffer wrapping a slice of thecompressedVectors
data.