Implement Arrow extension type-based Python UDF #13

jpolchlo · 2019-12-03T14:51:08Z

At the moment, we cannot use @pandas_udfs if we want to use TileUDT columns in the UDF. This is due to a lack of support in Spark's ArrowUtils for Arrow extension types.

The proposal here is to circumvent this omission by reimplementing the same UDF pathway in RF, but with proper support for Arrow types. Rasterframes already has shown that we can define new objects in the org.apache.spark package namespace to get around package-private definitions, so we can utilize the same method to provide a new implementation.

In many ways, this will be a cut-and-paste operation, simply importing and renaming classes from Spark, and providing an @arrow_udf decorator that cribs directly from pandas_udf, and redirects into our modified implementation.

The real work here will be to plumb in the Arrow types needed for the system to work. Of course, we need to reimplement ArrowUtils to include extension types, but we also need to make sure that we can properly interface with the extension type registry on both ends of the transaction. This is more worrisome in the Python context, where worker.py is going to need to have access to the type definition on the python side, in separate process on the executor nodes. Figuring this out is unlikely to be a gimme.

This work will also require that tiles have an extension type representation. This connects with issues #5 and #10.

The text was updated successfully, but these errors were encountered:

jpolchlo · 2019-12-05T16:08:36Z

One component of this work will be to add some extension type-related functionality to arrow. I've created a branch at https://github.com/jpolchlo/arrow/tree/experiment/extension-types which contains these changes.

jpolchlo · 2019-12-05T16:26:03Z

The spark work (very much in progress) is living at https://github.com/jpolchlo/rasterframes/tree/experiment/python-raster-udf

jpolchlo · 2020-01-08T17:56:44Z

I hit a bit of a snag on this. From Python, a wrapped function is being stored in the expression tree (apparently). It's not clear how this is triggered, but it gets in the way of our mechanism for replacing pandas_udf with a custom implementation. There surely is a workaround, but it requires a bit more investigation.

jpolchlo mentioned this issue Dec 3, 2019

Use Arrow extension types in UDF evaluation #14

Closed

jpolchlo added the Epic label Dec 5, 2019

jpolchlo self-assigned this Dec 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Arrow extension type-based Python UDF #13

Implement Arrow extension type-based Python UDF #13

jpolchlo commented Dec 3, 2019

jpolchlo commented Dec 5, 2019

jpolchlo commented Dec 5, 2019

jpolchlo commented Jan 8, 2020

Implement Arrow extension type-based Python UDF #13

Implement Arrow extension type-based Python UDF #13

Comments

jpolchlo commented Dec 3, 2019

jpolchlo commented Dec 5, 2019

jpolchlo commented Dec 5, 2019

jpolchlo commented Jan 8, 2020