-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Arrow extension types in UDF evaluation #14
Use Arrow extension types in UDF evaluation #14
Conversation
6bbd7db
to
3efcf34
Compare
…UDT with extension types
…ual integration of Arrow extension types.
Current status: The infrastructure for custom execution of Python UDFs now exists, but testing is proving problematic. We need to register a set of custom extensions to catalyst. These are provided by this class. On the Scala side, these extensions are registered using either the On the positive side, I can load the required extensions in Scala using the conf parameter mechanism, but have no reasonable means to test purely in Scala. On the evolution of Catalyst nodesThere is a variety of node types in different tree representations utilizing classes in both languages that wants some explanation.
For now, this will delegate to the default |
For posterity, this test code snippet will fail due to a mismatch of arrow versions between Spark's Scala implementation of the pandas UDF code and the python arrow version required for extension types (>= 0.15.0): from pyspark.sql.functions import PandasUDFType, col, pandas_udf
from pyspark import Row
ref_fn = pandas_udf(lambda v: v+1, 'double', PandasUDFType.SCALAR)
R = Row('i')
df = spark.createDataFrame([R(i) for i in [1,2,3,4]])
df_test = df.select(col("i"), ref_fn(col("i")))
df_test.show() The following extension should work: from pyrasterframes.udf import raster_udf
test_fn = raster_udf(lambda v: v+1, 'double', PandasUDFType.SCALAR)
raster_test = df.select(col("i"), test_fn(col("i")))
raster_test.show() (For now, this fails because of the PySpark deficiency mentioned above.) |
There are some deficiencies in my understanding of the transformation process used by We're mostly out of time on this, but I stand by the idea that this is eminently possible. We can return to this task later, but for accounting and cleanliness purposes, this issue will be closed. with any luck, it will be reopened at a later date. |
Pursuant to #13, I'm trying some experiments to gauge the feasibility of providing this functionality without needing to develop inside Spark proper.
The initial commit provides a
raster_udf
wrapper to supplantpandas_udf
.The procedure for working with this PR is as follows.
sbt pySparkCmd
. This will package up the relevant material and present a shell command that one can run to start pyspark, however, the result was faulty for me. See step 3.It's best to confine these machinations to a virtual environment.
I'll use comments to update progress as I push up new functionality.