Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyData NYC 2024 -- CFP Closes 2024/09/03 #40

Open
gforsyth opened this issue Jul 31, 2024 · 5 comments
Open

PyData NYC 2024 -- CFP Closes 2024/09/03 #40

gforsyth opened this issue Jul 31, 2024 · 5 comments
Assignees
Labels

Comments

@gforsyth
Copy link
Member

https://pydata.org/nyc2024/call-for-proposals

@gforsyth
Copy link
Member Author

gforsyth commented Aug 28, 2024

Proposal title:

Ibis: Don't Let the Engine Dictate the Interface

Abstract:

Tabular data is ubiquitous, and pandas has been the de facto tool in Python for
analyzing it. However, as data size scales, analysis using pandas may become
untenable. Luckily, modern analytical databases (like DuckDB) are able to
analyze this same tabular data, but perform orders-of-magnitude faster than
pandas, all while using less memory. Many of these systems only provide a SQL
interface though; something far different from pandas’ dataframe interface,
requiring a rewrite of your analysis code. This talk will lay out the current
database / data landscape as it relates to the PyData stack, and explore how
Ibis (an open-source, pure Python, dataframe interface library) can help
decouple interfaces from engines, to improve both performance and portability.
We'll examine other solutions for interacting with SQL from Python and discuss
some of their strengths and weaknesses.

Description:

Ibis is an open-source, pure Python library that lets you write Python to build up expressions
that can be executed on a wide array of backends / execution engines (SQLite,
DuckDB, Postgres, Spark, Clickhouse, Snowflake, BigQuery, and more!).

Modern analytical databases (like DuckDB) are able to analyze tabular data
orders-of-magnitude faster than pandas, all while using less memory.

pandas and other Python libraries can interact with databases, but they were
not designed to do so efficiently. Pulling ALL of the data to your local
machine to then perform a reduction or aggregation is only tractable for very
small problems.

Treating a remote database as a data store isn’t wrong, but it provides an
incomplete view of everything these systems can offer.

Because they are very, very fast. 50 years of database research hasn't gone to
waste - modern execution engines perform all kinds of optimizations to deliver
results quickly.

In a cruel twist of fate, though, almost all of them require you to write SQL in
order to use them. SQL is not an ideal tool for exploratory data analysis.
If you know exactly how to answer the question in front of you, then SQL is
probably (possibly) fine. But you don't always know how - that's part of what
the exploration is.

SQL is only a language – it’s an interface. The execution engine is a separate
thing. Historically the interface and the engine have been very tightly
coupled, but they don’t have to be.

Maybe you would like to use the DuckDB execution engine, but you don’t like the interface (SQL)?

Or you would like to use the Spark execution engine, but you don’t like the interface (PySpark API)?

The interface shouldn’t be a hurdle for a user to clear in order to make use of
the available tools. In the scientific Python community, SQL, in particular, is
a hurdle that many users have turned away from. Ibis provides a consistent,
Pythonic, and intuitive interface to interact with execution engines, even when
their only “advertised” interface is SQL.

@gforsyth gforsyth added the Talk label Aug 28, 2024
@cpcloud
Copy link
Member

cpcloud commented Aug 28, 2024

+1-ing this, since I've seen the talk and it's amazing.

@ncclementi
Copy link
Contributor

ncclementi commented Sep 3, 2024

I submitted a talk too, I adapted the geospatial one a bit.

Title

Ibis, DuckDB, and GeoParquet: Making Geospatial Analytics Fast, Simple, and Pythonic

Abstract:

Geospatial data is becoming increasingly integral to data workflows, and Python offers a wide array of tools to handle it. A powerful new option has recently emerged: DuckDB, which now supports geospatial analytics with its new extension. DuckDB has taken the data world by storm (~23k stars on GitHub) and is making waves in geospatial data too. Plus, with the increasing developments and adoption of GeoParquet, storing and exchanging geospatial data has never been easier. But what if you prefer writing Python code over SQL? That’s where Ibis comes in. Ibis is a Python library that provides a dataframe-like interface, allowing you to write Python code to construct SQL expressions that can be executed on various backends, including DuckDB.

In this talk, I’ll demonstrate how to leverage the power of DuckDB’s spatial capabilities while staying within the Python ecosystem—yes, there will be a live demo! (Pssst... I’ll show you how to work with GeoParquet data from Overture Maps, create nice plots that won’t kill your laptop, and avoid SQL.) This is an introductory talk; everyone is welcome, and no prior experience with spatial databases or geospatial workflows is needed.

Description:

Ibis is an open-source Python library that provides a dataframe-like API, enabling you to write Python code to build expressions that can be executed across multiple backends such as DuckDB, PostgreSQL, BigQuery, and more. Some of these backends offer support for geospatial operations that can be executed via Ibis without the need to write any SQL. In this talk, we aim to showcase our default backend: DuckDB.

Over the past year, DuckDB has introduced support for over 100 geospatial operations, many of which are now accessible via Ibis. This allows you to experiment with these operations while remaining in Python land. If you have experience working with spatial databases, you are likely familiar with PostGIS, a library that extends PostgreSQL's capabilities to handle geospatial data. The DuckDB spatial extension provides a healthy subset of PostGIS-like options, but getting started is much simpler. No server-side setup, user configuration, or client configuration. DuckDB seamlessly integrates into existing GIS workflows, regardless of data formats or projections. Recently, DuckDB has also added support for GeoParquet. GeoParquet extends the powerful Apache Parquet columnar data format to the geospatial domain, making it easier to work with geospatial data in a high-performance, columnar format.

With Ibis, performing your first spatial operations becomes even easier and, most importantly, it’s Python! During this talk, we will introduce Ibis and demonstrate its geospatial functionality through an example, with DuckDBas backend and working with a GeoParquet data source. We will also explore compatibility with other Python libraries such as GeoPandas, and lonboard for plotting purposes. By the end of the talk, you’ll learn how to get started with Ibis and work with spatial databases with DuckDB as a backend engine.

@deepyaman
Copy link

Proposal title

Building machine learning pipelines that scale: a case study using Ibis and IbisML

Session type

Tutorial (90 minutes)

Abstract

Libraries like Ibis have been gaining traction recently, by unifying the way we work with data across multiple data platforms—from dataframe APIs to databases, from dev to prod. What if we could extend the abstraction to machine learning workflows (broadly, sequences of steps that implement fit and transform methods)? In this tutorial, we will develop an end-to-end machine learning project to predict the live win probability at any given move during a chess game.

Description

As Python has become the lingua franca of data science, pandas and scikit-learn have cemented their roles in the standard machine learning toolkit. However, when data volumes rise, this stack becomes unwieldy (requiring proportionately-larger compute, subsampling to reduce data size, or both) or altogether untenable.

Luckily, modern analytical databases (like DuckDB) and dataframe libraries (such as Polars) can crunch this same tabular data, but perform orders-of-magnitude faster than pandas, all while using less memory. Ibis already provides a unified dataframe API that lets users leverage a plethora of popular databases and analytics tools (BigQuery, Snowflake, Spark, DuckDB, etc.) without rewriting their data engineering code. However, at scale, the performance bottleneck is pushed to the ML pipeline.

IbisML extends the intrinsic benefits of using Ibis to the ML workflow. It lets you bring your ML to the database (or other Ibis-supported backend), and supports efficient integration with modeling frameworks like XGBoost, PyTorch, and scikit-learn. On top of that, IbisML steps can be used as estimators within the familiar context of scikit-learn pipelines.

In this tutorial, we'll cover:

  • Ibis basics, and how we can apply them to explore the Lichess chess games database and create meaningful features.
  • IbisML constructs, including Steps and Recipes, and how we can combine them to process features before passing them to our live win probability model.
  • Data handoff for model training and inference, completing our end-to-end ML workflow.

This is a hands-on tutorial, and you will train a simple (not great!) live win probability model on a provided dataset. You'll also see how the result can be run at scale on a distributed backend. Participants should ideally have some experience using Python dataframe libraries; scikit-learn or other modeling framework familiarity is helpful but not required.

@ncclementi
Copy link
Contributor

My talk: Ibis, DuckDB, and GeoParquet: Making Geospatial Analytics Fast, Simple, and Pythonic got accepted. Leaving this here for tracking purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: backlog
Status: Submitted
Development

No branches or pull requests

4 participants