You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment, we cannot use @pandas_udfs if we want to use TileUDT columns in the UDF. This is due to a lack of support in Spark's ArrowUtils for Arrow extension types.
The proposal here is to circumvent this omission by reimplementing the same UDF pathway in RF, but with proper support for Arrow types. Rasterframes already has shown that we can define new objects in the org.apache.spark package namespace to get around package-private definitions, so we can utilize the same method to provide a new implementation.
In many ways, this will be a cut-and-paste operation, simply importing and renaming classes from Spark, and providing an @arrow_udf decorator that cribs directly from pandas_udf, and redirects into our modified implementation.
The real work here will be to plumb in the Arrow types needed for the system to work. Of course, we need to reimplement ArrowUtils to include extension types, but we also need to make sure that we can properly interface with the extension type registry on both ends of the transaction. This is more worrisome in the Python context, where worker.py is going to need to have access to the type definition on the python side, in separate process on the executor nodes. Figuring this out is unlikely to be a gimme.
This work will also require that tiles have an extension type representation. This connects with issues #5 and #10.
The text was updated successfully, but these errors were encountered:
I hit a bit of a snag on this. From Python, a wrapped function is being stored in the expression tree (apparently). It's not clear how this is triggered, but it gets in the way of our mechanism for replacing pandas_udf with a custom implementation. There surely is a workaround, but it requires a bit more investigation.
At the moment, we cannot use
@pandas_udf
s if we want to useTileUDT
columns in the UDF. This is due to a lack of support in Spark'sArrowUtils
for Arrow extension types.The proposal here is to circumvent this omission by reimplementing the same UDF pathway in RF, but with proper support for Arrow types. Rasterframes already has shown that we can define new objects in the
org.apache.spark
package namespace to get around package-private definitions, so we can utilize the same method to provide a new implementation.In many ways, this will be a cut-and-paste operation, simply importing and renaming classes from Spark, and providing an
@arrow_udf
decorator that cribs directly frompandas_udf
, and redirects into our modified implementation.The real work here will be to plumb in the Arrow types needed for the system to work. Of course, we need to reimplement
ArrowUtils
to include extension types, but we also need to make sure that we can properly interface with the extension type registry on both ends of the transaction. This is more worrisome in the Python context, whereworker.py
is going to need to have access to the type definition on the python side, in separate process on the executor nodes. Figuring this out is unlikely to be a gimme.This work will also require that tiles have an extension type representation. This connects with issues #5 and #10.
The text was updated successfully, but these errors were encountered: