Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSDC West - CFP Closes 05/20/24 #29

Open
ncclementi opened this issue Mar 29, 2024 · 7 comments
Open

OSDC West - CFP Closes 05/20/24 #29

ncclementi opened this issue Mar 29, 2024 · 7 comments
Assignees

Comments

@ncclementi
Copy link
Contributor

https://odsc.com/california/call-for-speakers-west/

Talk Session Formats

Proposals will be considered for the following types of presentations:

Format for Technical Sessions

- Talk (30 minutes)
- Hands-on Workshop (2 hrs)
- Tutorial (60 min, hands-off)
- Lightning Talks (10 min)

Format for Business Sessions

- Talk (30 minutes)
- Case Studies (30 minutes)
- Hands-on workshop (60 minutes)
- Startup Talk (30 minutes)
@jitingxu1
Copy link

jitingxu1 commented May 20, 2024

update 1 based on comments

  • ibis-ML -> IbisML
  • removed Theseus
  • add talk outline

Title: IbisML: Efficiently Streamlining and Unifying ML Data Processing from Development to Production

Description

Machine learning projects require transforming raw data into prepared samples using a combination of feature engineering pipelines and online last-mile processing, integrated with model training workflows. Data scientists and engineers collaborate to prototype, develop, scale, and deploy both batch and streaming jobs. These processes present several challenges:

  • Development to Production: Ensuring smooth transitions and consistency between development and production environments.
  • Data Scale: Managing the shift from small, local datasets to large, distributed data environments.
  • Batch and Streaming: Handling the complexities of both batch and streaming data processing.
  • Multilingual Frameworks: Coordinating multiple languages and frameworks, which can slow down the process and reduce interactivity.

To address these challenges, IbisML harnesses the power of Ibis, offering a library designed to streamline and unify data preprocessing and feature engineering workflows across diverse environments and data scales. Its unified codebase eliminates the need for rewriting logic during transitions from local development to large-scale distributed production and from batch to streaming with the following key features:

  • Versatile Backends: Seamlessly integrating with over 20 backends, including DuckDB, Dask, Polars, Pandas, BigQuery, PySpark, and Flink, ensuring swift, effective, and adaptable machine learning data processing.
  • Large Dataset Processing: IbisML excels in optimizing large dataset processing for speed and efficiency, facilitating rapid ML data processing with different backends.
  • Seamless Integration: Supporting diverse pipeline-like steps that are integrated with scikit-learn pipeline, and delivering rich output data formats such as Pandas DataFrame, torch dataset, and Xgboost Dmatrix, seamlessly aligning with scikit-learn, XGBoost, and PyTorch models.

The talk will explore the gaps in existing projects, emphasizing how IbisML effectively tackles these challenges and enables seamless transitions between development and deployment, both in offline and online deployment scenarios.
At the end of this talk, we'll leverage IbisML to craft machine learning models starting from data engineering, through last-mile preprocessing using ibisML Recipes across various backends, and feeding diverse data into downstream model training libraries or frameworks like scikit-learn, XGBoost, and PyTorch.

Notes

@jitingxu1 jitingxu1 moved this from CFP to Preparing for submission in Ibis talks and tutorials May 20, 2024
@jitingxu1 jitingxu1 self-assigned this May 20, 2024
@deepyaman
Copy link

@jitingxu1 Some quick notes:

  • Can you change Ibis-ML to IbisML in all places?
  • I don't think Theseus is necessary here. Let's focus on the open source project for ODSC (and in most submissions).
  • What does multilingual frameworks mean? This point isn't clear to me personally.

I don't 100% know whether need to talk about streaming in this talk, but I feel like there may be differing views.

At a higher level (probably more important): what will the talk cover? I think this is a description of the project, and what it can do, but not of the talk. I think there must be some more clear discussion about what the gap with current projects is.

@jitingxu1
Copy link

jitingxu1 commented May 20, 2024

For example, using spark for batch feature, flink for streaming features, and pandas, scikit-learn or pytorch for last-mile preprocessing.

What does multilingual frameworks mean? This point isn't clear to me personally.

Highlighting the strength of IbisML, it has streaming support, This capability might distinguish it from other options.

I don't 100% know whether need to talk about streaming in this talk, but I feel like there may be differing views.

@zhenzhongxu
Copy link

Great write-up! I wonder if it makes sense to talk about the benefits of moving away from sampling into using more holistic data and it just works with IbisML. I've heard a few cases where large organizations desire to train models using the full data instead of relying on sampling.

@jitingxu1 jitingxu1 moved this from Preparing for submission to Submitted in Ibis talks and tutorials May 21, 2024
@jitingxu1
Copy link

Here is the submitted version, Thanks @ncclementi @deepyaman @chip for review.

Title: Building ML pipelines that run anywhere with IbisML

Abstract

From inception to production, the ML lifecycle requires a lengthy process involving multiple people, programming languages, and computational frameworks. In a traditional workflow, data scientists develop models and experiment with different features locally, using tools like pandas and scikit-learn on a small, often subsampled, dataset. However, as the need arises to scale up to larger datasets and production environments, engineers face the challenge of rewriting and testing these processes in distributed computing systems like Apache Spark or Dask. While frameworks like these have their own ML libraries (of various flavors and maturities) and technically allow the user to run on single machines and clusters, scaling ML pipelines is costly, resource-intensive, and inefficient.

IbisML is an open-source, Python library designed for building and running scalable ML pipelines from experiment to production. It’s built on top of Ibis, an open source library that provides a familiar dataframe API to build up expressions that can be executed on a wide array of backends. They can use tools like DuckDB and Polars for efficient local computation, then scale to distributed engines such as Spark, BigQuery, and Snowflake. With IbisML, users can preprocess data at scale across development and deployment, compose transformations with other scikit-learn estimators, and seamlessly integrate with scikit-learn, XGBoost, and PyTorch models without rewriting code.

In this talk, we will introduce IbisML and the utilities it provides to streamline ML pipeline development. We will demonstrate its functionalities on a simple, real-world problem, including the ability to train and fit estimators on different backends. Finally, we will showcase how you can efficiently hand off to the modeling framework of your choice.

@cpcloud cpcloud moved this from backlog to cooking in Ibis planning and roadmap May 29, 2024
@jitingxu1 jitingxu1 moved this from cooking to review in Ibis planning and roadmap Jun 3, 2024
@jitingxu1
Copy link

No response from the conference

@ncclementi
Copy link
Contributor Author

Idk who the conference does this, but they should have informed on the first round on July 12th according to their website. Maybe you can email them and ask, if not wait until the second round of notification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: review
Status: Submitted
Development

No branches or pull requests

4 participants