Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Arrow extension type-based Python UDF #13

Open
jpolchlo opened this issue Dec 3, 2019 · 3 comments
Open

Implement Arrow extension type-based Python UDF #13

jpolchlo opened this issue Dec 3, 2019 · 3 comments
Assignees
Labels

Comments

@jpolchlo
Copy link

jpolchlo commented Dec 3, 2019

At the moment, we cannot use @pandas_udfs if we want to use TileUDT columns in the UDF. This is due to a lack of support in Spark's ArrowUtils for Arrow extension types.

The proposal here is to circumvent this omission by reimplementing the same UDF pathway in RF, but with proper support for Arrow types. Rasterframes already has shown that we can define new objects in the org.apache.spark package namespace to get around package-private definitions, so we can utilize the same method to provide a new implementation.

In many ways, this will be a cut-and-paste operation, simply importing and renaming classes from Spark, and providing an @arrow_udf decorator that cribs directly from pandas_udf, and redirects into our modified implementation.

The real work here will be to plumb in the Arrow types needed for the system to work. Of course, we need to reimplement ArrowUtils to include extension types, but we also need to make sure that we can properly interface with the extension type registry on both ends of the transaction. This is more worrisome in the Python context, where worker.py is going to need to have access to the type definition on the python side, in separate process on the executor nodes. Figuring this out is unlikely to be a gimme.

This work will also require that tiles have an extension type representation. This connects with issues #5 and #10.

@jpolchlo
Copy link
Author

jpolchlo commented Dec 5, 2019

One component of this work will be to add some extension type-related functionality to arrow. I've created a branch at https://github.com/jpolchlo/arrow/tree/experiment/extension-types which contains these changes.

@jpolchlo
Copy link
Author

jpolchlo commented Dec 5, 2019

The spark work (very much in progress) is living at https://github.com/jpolchlo/rasterframes/tree/experiment/python-raster-udf

@jpolchlo jpolchlo added the Epic label Dec 5, 2019
@jpolchlo jpolchlo self-assigned this Dec 5, 2019
@jpolchlo
Copy link
Author

jpolchlo commented Jan 8, 2020

I hit a bit of a snag on this. From Python, a wrapped function is being stored in the expression tree (apparently). It's not clear how this is triggered, but it gets in the way of our mechanism for replacing pandas_udf with a custom implementation. There surely is a workaround, but it requires a bit more investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant