Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a loadable extension (with Rust) #1

Open
asg017 opened this issue May 1, 2023 · 1 comment
Open

As a loadable extension (with Rust) #1

asg017 opened this issue May 1, 2023 · 1 comment

Comments

@asg017
Copy link
Collaborator

asg017 commented May 1, 2023

Hey @rclement ! Sorry for the delay, but here's a continuation of the discussions from rclement/datasette-ml#3

Supporting queries in sqlite-loadable-rs

I created a new issue in sqlite-loadable-rs to track adding querying support to that library. That way we can make queries like CREATE TABLE / INSERT INTO in the extension itself, which I know is a blocker for this work. Not sure if I'll have an ETA soon, but once that's in, it should unblock us here to make a proper loadable extension

In Pure Rust, no Python?

The native-ext branch uses PyO3 for ML algorithms, which will be great to get started, but bundling Python in an extension can be tricky. I have the sqlite-python project that lets you define loadable SQLite extensions with Python, which can be useful here, but can come with problems:

  • It'll assume the user already has a pre-configured Python environment with the right packages installed
  • It can be unstable when switching between different Python version
  • Python <-> Rust isn't very fun in general

There's the linfa project that could help us move to a pure-Rust extension. It's the more complete scikit-like Rust crate I can find, so we can use those algorithms in sqlite-ml to remove the Python dependency.

I played around with it a bit and it seems pretty advanced, most of the models seem to support serializing to a byte array (so we can persist a trained model across connections). It may not have 100% of the algorithms that scikit has, but probably enough for this?

Defining the SQL API

I've been thinking about a few different ways to express sqlite-ml operations in pure SQL, using eponymous virtual tables, table functions, and regular scalar functions. Here are some of my thoughts, but definitely not complete:

ml_experiments and friends

I think all these tables can be shadow tables that are read-only to users, since I dont think users will ever need to insert/update rows in these directly:

  • ml_experiments
  • ml_runs
  • ml_models
  • ml_metrics
  • ml_deployments

ml_train

ml_train can be an eponymous virtual table that users can INSERT into, to create new experiments/models.

insert into ml_train 
    values (
      'Iris prediction',  -- name of experiemnt
      'classification',  -- prediction type
      'logistic_regression',  -- algorithm
      'ml_datasets.iris',  -- source data. can be a table/view name, optional schema
      'target' -- target column
    );

ml_predict

A table function that takes in a JSON array of values and predicts the target column:

select 
  iris.*, 
  prediction.prediction
from ml_datasets.iris as iris
from ml_predict(
  'Iris prediction', 
  json_array( 
    iris.sepal_length, 
    iris.sepal_width, 
    iris.petal_length, 
    iris.petal_width
  )
) from prediction;

ml_load_dataset

If we wanted to just inline those default datasets into the extension, we could have eponymous virtual tables for each one like so:

select * from ml_datasets_iris;
select * from ml_datasets_breast_cancer;
select * from ml_datasets_diabetes;

Or if we don't want to bloat the size of the extension, we could offer a separate pre-built database file that people can attach themselves:

-- here ml_datasets_path() can return the path of the pre-built SQLite database, or create one if it doesnt exust
attach database ml_datasets_path() as ml_datasets;

select * from ml_datasets.iris;
select * from ml_datasets.breast_cancer;
select * from ml_datasets.diabetes;

Again, some very loose thoughts and notes, feel free to ask about anything!

@rclement
Copy link
Owner

rclement commented May 2, 2023

Thanks @asg017 for all your thoughts and inputs!

Pure-Rust native extension

I do agree that going a pure-Rust route should be the way to go:

  • Fiddling with PyO3 is definitely not fun past simple examples, and it requires having a complete Python environment with proper dependencies outside the native binary which is cumbersome (PostgresML does that and is not very verbose about it within the documentation).

  • I've quickly checked out Linfa a few weeks ago and I also think this library might be a good pure-Rust alternative to Scikit-Learn. The only immediate things not provided by Linfa are the built-in datasets to get started but those can be direcly embedded in CSV format within the native extension.

  • In the future, there is even the Rust Tensorflow library to enable Deep Learning integration.

With all these, the pure-Rust route seems to be easily achievable in the near future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants