You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I created a new issue in sqlite-loadable-rs to track adding querying support to that library. That way we can make queries like CREATE TABLE / INSERT INTO in the extension itself, which I know is a blocker for this work. Not sure if I'll have an ETA soon, but once that's in, it should unblock us here to make a proper loadable extension
In Pure Rust, no Python?
The native-ext branch uses PyO3 for ML algorithms, which will be great to get started, but bundling Python in an extension can be tricky. I have the sqlite-python project that lets you define loadable SQLite extensions with Python, which can be useful here, but can come with problems:
It'll assume the user already has a pre-configured Python environment with the right packages installed
It can be unstable when switching between different Python version
Python <-> Rust isn't very fun in general
There's the linfa project that could help us move to a pure-Rust extension. It's the more complete scikit-like Rust crate I can find, so we can use those algorithms in sqlite-ml to remove the Python dependency.
I played around with it a bit and it seems pretty advanced, most of the models seem to support serializing to a byte array (so we can persist a trained model across connections). It may not have 100% of the algorithms that scikit has, but probably enough for this?
Defining the SQL API
I've been thinking about a few different ways to express sqlite-ml operations in pure SQL, using eponymous virtual tables, table functions, and regular scalar functions. Here are some of my thoughts, but definitely not complete:
ml_experiments and friends
I think all these tables can be shadow tables that are read-only to users, since I dont think users will ever need to insert/update rows in these directly:
ml_experiments
ml_runs
ml_models
ml_metrics
ml_deployments
ml_train
ml_train can be an eponymous virtual table that users can INSERT into, to create new experiments/models.
insert into ml_train
values (
'Iris prediction', -- name of experiemnt'classification', -- prediction type'logistic_regression', -- algorithm'ml_datasets.iris', -- source data. can be a table/view name, optional schema'target'-- target column
);
ml_predict
A table function that takes in a JSON array of values and predicts the target column:
select
iris.*,
prediction.predictionfromml_datasets.irisas iris
from ml_predict(
'Iris prediction',
json_array(
iris.sepal_length,
iris.sepal_width,
iris.petal_length,
iris.petal_width
)
) from prediction;
ml_load_dataset
If we wanted to just inline those default datasets into the extension, we could have eponymous virtual tables for each one like so:
select * from ml_datasets_iris;
select * from ml_datasets_breast_cancer;
select * from ml_datasets_diabetes;
Or if we don't want to bloat the size of the extension, we could offer a separate pre-built database file that people can attach themselves:
-- here ml_datasets_path() can return the path of the pre-built SQLite database, or create one if it doesnt exust
attach database ml_datasets_path() as ml_datasets;
select*fromml_datasets.iris;
select*fromml_datasets.breast_cancer;
select*fromml_datasets.diabetes;
Again, some very loose thoughts and notes, feel free to ask about anything!
The text was updated successfully, but these errors were encountered:
I do agree that going a pure-Rust route should be the way to go:
Fiddling with PyO3 is definitely not fun past simple examples, and it requires having a complete Python environment with proper dependencies outside the native binary which is cumbersome (PostgresML does that and is not very verbose about it within the documentation).
I've quickly checked out Linfa a few weeks ago and I also think this library might be a good pure-Rust alternative to Scikit-Learn. The only immediate things not provided by Linfa are the built-in datasets to get started but those can be direcly embedded in CSV format within the native extension.
In the future, there is even the Rust Tensorflow library to enable Deep Learning integration.
With all these, the pure-Rust route seems to be easily achievable in the near future!
Hey @rclement ! Sorry for the delay, but here's a continuation of the discussions from rclement/datasette-ml#3
Supporting queries in
sqlite-loadable-rs
I created a new issue in
sqlite-loadable-rs
to track adding querying support to that library. That way we can make queries likeCREATE TABLE
/INSERT INTO
in the extension itself, which I know is a blocker for this work. Not sure if I'll have an ETA soon, but once that's in, it should unblock us here to make a proper loadable extensionIn Pure Rust, no Python?
The
native-ext
branch uses PyO3 for ML algorithms, which will be great to get started, but bundling Python in an extension can be tricky. I have the sqlite-python project that lets you define loadable SQLite extensions with Python, which can be useful here, but can come with problems:There's the linfa project that could help us move to a pure-Rust extension. It's the more complete scikit-like Rust crate I can find, so we can use those algorithms in
sqlite-ml
to remove the Python dependency.I played around with it a bit and it seems pretty advanced, most of the models seem to support serializing to a byte array (so we can persist a trained model across connections). It may not have 100% of the algorithms that scikit has, but probably enough for this?
Defining the SQL API
I've been thinking about a few different ways to express
sqlite-ml
operations in pure SQL, using eponymous virtual tables, table functions, and regular scalar functions. Here are some of my thoughts, but definitely not complete:ml_experiments
and friendsI think all these tables can be shadow tables that are read-only to users, since I dont think users will ever need to insert/update rows in these directly:
ml_train
ml_train
can be an eponymous virtual table that users canINSERT
into, to create new experiments/models.ml_predict
A table function that takes in a JSON array of values and predicts the target column:
ml_load_dataset
If we wanted to just inline those default datasets into the extension, we could have eponymous virtual tables for each one like so:
Or if we don't want to bloat the size of the extension, we could offer a separate pre-built database file that people can attach themselves:
Again, some very loose thoughts and notes, feel free to ask about anything!
The text was updated successfully, but these errors were encountered: