-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
interest in vector abstractions? #26
Comments
Quite possibly. :) But I'm trying to figure out...
How would we use them, in practice? They're currently defined on map or array types, but most of the transformations in Ibis-ML as-is assume data is separated by column. As a practical example, I'm currently POCing PCA. For the transform bit (fit is out of scope, handled by scikit-learn or something), you need matrix multiplication. Let's say we do it on the numeric columns of the penguins dataset, so If I have 3 components, I'll have a 4x3 ndarray to multiply against. There are (at least) two possibilities:
Obviously, your code opens up the possibility of multiplying unknown vectors, and vectors of different lengths, but need to figure out where this would be used directly. I suppose support for normalization would be an obvious one. Also, didn't touch on the use cases where sparse support would be nice. Do you have thoughts on where this would best be leveraged? Any use cases you're facing that it would help address? |
I think from a UX perspective I would prefer everything to be nicely packaged in a column of arrays. For 3 columns it feels like a toss-up. I don't have a whole lot of experience, but isn't it more common for PCA to go on the order of 1000s of dimensions -> 10s or 100s of dimensions? I would be annoyed with 50 columns in my table just from a preview perspective, especially because after PCA those columns are in some uninterpretable latent space so I'm not really going to want to inspect the values very often.
Accessing the end results as
My first thought is if there is a builtin
Could we just fall back to
In NickCrews/mismo@df57a37 I am working towards a TF-IDF transformer for text (eg to compare two documents based on their token overlap). That requires a sparse vector representation because the vocabulary of possible words/tokens/ngrams is huge. Followup if we want really optimized string support in ibisml: It's interesting that spacy optimizes this, and has an extra layer of indirection, so that {"dog": 3, "cat": 2"} is actually stored as
Lets assume there are features a to z, and we have documents 0 to N. I'm not positive here, but my understanding that for all columnar-store backends (which is what we are targeting here, IDK do we want people do be doing analytics in postgres? would be much better to encourage users to use duckdb to read the postgres and then do the calculations in duckdb), the memory layout for these different implementations would be
I'm guessing this difference in memory layout would be the main cause for any performance differences, but I'm not sure. This would be worth asking some duckdb engineer, and then benchmarking to confirm. |
My ideas for conclusions:
|
Fair point. I think for visualization you may use a small number of components, but more generally for ML you'd use a high number (until it stops yielding value).
FYI the reason I'm not 100% sure, is because
IMO this is something that could be particularly interesting to eventually add in Ibis-ML! :) |
I think these conclusions make sense! Let me also see if I can scrounge up a couple more opinions. |
@tonysun9 would love to get some feedback from you since we previously discussed some NLP ideas. |
Looks like a good start! Putting down some quick thoughts. The
I'm used to seeing the
|
I agree I think a picked a bad name. My function above should be called
No idea how to implement this in Ibis. matrices sort of want there to be some symmetry between rows and columns. At best I see matrices as being awkward to represent, and at worst (and I see this as likely) being very inefficient to implement in SQL. IDK, maybe this is where arrow UDFs come to the rescue? matrices are represented as |
I have these existing vector abstractions in ibis:
Also have tests. Are you interested in these abstractions, should we try to use these more throughout this lib? I have some tests too. I can submit a PR if so.
The text was updated successfully, but these errors were encountered: