As a loadable extension (with Rust) #1

asg017 · 2023-05-01T22:49:01Z

Hey @rclement ! Sorry for the delay, but here's a continuation of the discussions from rclement/datasette-ml#3

Supporting queries in `sqlite-loadable-rs`

I created a new issue in sqlite-loadable-rs to track adding querying support to that library. That way we can make queries like CREATE TABLE / INSERT INTO in the extension itself, which I know is a blocker for this work. Not sure if I'll have an ETA soon, but once that's in, it should unblock us here to make a proper loadable extension

In Pure Rust, no Python?

The native-ext branch uses PyO3 for ML algorithms, which will be great to get started, but bundling Python in an extension can be tricky. I have the sqlite-python project that lets you define loadable SQLite extensions with Python, which can be useful here, but can come with problems:

It'll assume the user already has a pre-configured Python environment with the right packages installed
It can be unstable when switching between different Python version
Python <-> Rust isn't very fun in general

There's the linfa project that could help us move to a pure-Rust extension. It's the more complete scikit-like Rust crate I can find, so we can use those algorithms in sqlite-ml to remove the Python dependency.

I played around with it a bit and it seems pretty advanced, most of the models seem to support serializing to a byte array (so we can persist a trained model across connections). It may not have 100% of the algorithms that scikit has, but probably enough for this?

Defining the SQL API

I've been thinking about a few different ways to express sqlite-ml operations in pure SQL, using eponymous virtual tables, table functions, and regular scalar functions. Here are some of my thoughts, but definitely not complete:

`ml_experiments` and friends

I think all these tables can be shadow tables that are read-only to users, since I dont think users will ever need to insert/update rows in these directly:

ml_experiments
ml_runs
ml_models
ml_metrics
ml_deployments

`ml_train`

ml_train can be an eponymous virtual table that users can INSERT into, to create new experiments/models.

insert into ml_train 
    values (
      'Iris prediction',  -- name of experiemnt
      'classification',  -- prediction type
      'logistic_regression',  -- algorithm
      'ml_datasets.iris',  -- source data. can be a table/view name, optional schema
      'target' -- target column
    );

`ml_predict`

A table function that takes in a JSON array of values and predicts the target column:

select 
  iris.*, 
  prediction.prediction
from ml_datasets.iris as iris
from ml_predict(
  'Iris prediction', 
  json_array( 
    iris.sepal_length, 
    iris.sepal_width, 
    iris.petal_length, 
    iris.petal_width
  )
) from prediction;

`ml_load_dataset`

If we wanted to just inline those default datasets into the extension, we could have eponymous virtual tables for each one like so:

select * from ml_datasets_iris;
select * from ml_datasets_breast_cancer;
select * from ml_datasets_diabetes;

Or if we don't want to bloat the size of the extension, we could offer a separate pre-built database file that people can attach themselves:

-- here ml_datasets_path() can return the path of the pre-built SQLite database, or create one if it doesnt exust
attach database ml_datasets_path() as ml_datasets;

select * from ml_datasets.iris;
select * from ml_datasets.breast_cancer;
select * from ml_datasets.diabetes;

Again, some very loose thoughts and notes, feel free to ask about anything!

The text was updated successfully, but these errors were encountered:

rclement · 2023-05-02T09:46:19Z

Thanks @asg017 for all your thoughts and inputs!

Pure-Rust native extension

I do agree that going a pure-Rust route should be the way to go:

Fiddling with PyO3 is definitely not fun past simple examples, and it requires having a complete Python environment with proper dependencies outside the native binary which is cumbersome (PostgresML does that and is not very verbose about it within the documentation).
I've quickly checked out Linfa a few weeks ago and I also think this library might be a good pure-Rust alternative to Scikit-Learn. The only immediate things not provided by Linfa are the built-in datasets to get started but those can be direcly embedded in CSV format within the native extension.
In the future, there is even the Rust Tensorflow library to enable Deep Learning integration.

With all these, the pure-Rust route seems to be easily achievable in the near future!

rclement mentioned this issue May 2, 2023

As a SQLite extension rclement/datasette-ml#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As a loadable extension (with Rust) #1

As a loadable extension (with Rust) #1

asg017 commented May 1, 2023

rclement commented May 2, 2023

As a loadable extension (with Rust) #1

As a loadable extension (with Rust) #1

Comments

asg017 commented May 1, 2023

Supporting queries in sqlite-loadable-rs

In Pure Rust, no Python?

Defining the SQL API

ml_experiments and friends

ml_train

ml_predict

ml_load_dataset

rclement commented May 2, 2023

Pure-Rust native extension

Supporting queries in `sqlite-loadable-rs`

`ml_experiments` and friends

`ml_train`

`ml_predict`

`ml_load_dataset`