Use Arrow extension types in UDF evaluation #14

jpolchlo · 2019-12-03T17:19:31Z

Pursuant to #13, I'm trying some experiments to gauge the feasibility of providing this functionality without needing to develop inside Spark proper.

The initial commit provides a raster_udf wrapper to supplant pandas_udf.

The procedure for working with this PR is as follows.

Run sbt pySparkCmd. This will package up the relevant material and present a shell command that one can run to start pyspark, however, the result was faulty for me. See step 3.
Install the development version of pyrasterframes:

pip install --upgrade /path/to/rasterframes/pyrasterframes/target/python/dist/pyrasterframes-0.8.4.dev0-py3-none-any.whl

Modify command line to

PYSPARK_PYTHON=ipython PYTHONSTARTUP=<as supplied by SBT> pyspark \
--jars /path/to/rasterframes/pyrasterframes/target/scala-2.11/pyrasterframes-assembly-0.8.4-SNAPSHOT.jar \
--py-files /path/to/rasterframes/pyrasterframes/target/python/dist/pyrasterframes-0.8.4.dev0-py3-none-any.whl

It's best to confine these machinations to a virtual environment.

I'll use comments to update progress as I push up new functionality.

…UDT with extension types

…ual integration of Arrow extension types.

jpolchlo · 2019-12-18T20:23:03Z

Current status: The infrastructure for custom execution of Python UDFs now exists, but testing is proving problematic. We need to register a set of custom extensions to catalyst. These are provided by this class.

On the Scala side, these extensions are registered using either the withExtensions method on SparkSession.Builder, or using the spark.sql.extensions conf parameter. The Python side of the story is sadly complicated. There is a PR which enhances the SparkSession constructor used by PySpark to utilize the spark.sql.extensions conf argument, but that PR has not been backported to the 2.3 or 2.4 lines. We'll need to compile a custom Spark build with this patch applied before we can test in PySpark.

On the positive side, I can load the required extensions in Scala using the conf parameter mechanism, but have no reasonable means to test purely in Scala.

On the evolution of Catalyst nodes

There is a variety of node types in different tree representations utilizing classes in both languages that wants some explanation.

[PYTHON] UserDefinedRasterFunction (invoked from raster_udf) contains a py4j reference to [SCALA] UserDefinedRasterFunction. (All class mentions below are on the Scala side.)
This generates a PythonRasterUDF Expression instance (spark sql column function). Expression nodes comprise the Logical Plan.
PythonRasterUDF is an Unevaluable node, so must be transformed via a logical plan optimization step provided by the ExtractRasterUDFs Rule which ultimately converts to...
RasterEvalPython nodes which are (I think?) tree nodes in the query planner
which the RasterUDFStrategy converts to RasterEvalPythonExec (SparkPlan) nodes in the physical plan.
RasterEvalPythonExec overrides EvalPythonExec's evaluate machinery to rely on ArrowPythonRunner, which leans on our borrowed implementations of ArrowUtils and ArrowConverters (which will eventually handle the extension type stuff).

For now, this will delegate to the default worker.py to do the actual processing in python of the arrow batches, but we will eventually need to reimplement that worker module as well (which will probably require us to interdict on this line of BasePythonRunner in order to delegate to our custom python worker).

jpolchlo · 2019-12-18T22:19:41Z

For posterity, this test code snippet will fail due to a mismatch of arrow versions between Spark's Scala implementation of the pandas UDF code and the python arrow version required for extension types (>= 0.15.0):

from pyspark.sql.functions import PandasUDFType, col, pandas_udf        
from pyspark import Row                                                 

ref_fn = pandas_udf(lambda v: v+1, 'double', PandasUDFType.SCALAR)      
R = Row('i')                                                            
df = spark.createDataFrame([R(i) for i in [1,2,3,4]])                   
df_test = df.select(col("i"), ref_fn(col("i")))                         
df_test.show()

The following extension should work:

from pyrasterframes.udf import raster_udf                               

test_fn = raster_udf(lambda v: v+1, 'double', PandasUDFType.SCALAR)     
raster_test = df.select(col("i"), test_fn(col("i")))                   
raster_test.show()

(For now, this fails because of the PySpark deficiency mentioned above.)

jpolchlo · 2020-01-14T16:29:20Z

There are some deficiencies in my understanding of the transformation process used by pandas_udf. The primary difference is that it appears that the catalyst tree is being populated by an unexpected object: a python lambda (contained in a functools.wrapper). It seems that the Python UserDefinedFunction is not being added directly. At some point later on, this function wrapper is being rewritten into a Python UDF, and I have not been able to discover the point at which that happens. Without being able to intercede on this pathway, we won't be able to finish this work.

We're mostly out of time on this, but I stand by the idea that this is eminently possible. We can return to this task later, but for accounting and cleanliness purposes, this issue will be closed. with any luck, it will be reopened at a later date.

jpolchlo added experimental in progress labels Dec 3, 2019

jpolchlo self-assigned this Dec 3, 2019

Python stub for proof of concept implementation

3efcf34

jpolchlo force-pushed the experiment/python-raster-udf branch from 6bbd7db to 3efcf34 Compare December 3, 2019 17:24

jpolchlo added 2 commits December 5, 2019 11:21

[WIP][EXPERIMENTAL] Begin spike project for adding Arrow-based Spark …

4d43d3b

…UDT with extension types

Add an alternate path for Python UDFs that serves as a base for event…

3f64c12

…ual integration of Arrow extension types.

jpolchlo closed this Jan 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Arrow extension types in UDF evaluation #14

Use Arrow extension types in UDF evaluation #14

jpolchlo commented Dec 3, 2019 •

edited

Loading

jpolchlo commented Dec 18, 2019

jpolchlo commented Dec 18, 2019

jpolchlo commented Jan 14, 2020

Use Arrow extension types in UDF evaluation #14

Use Arrow extension types in UDF evaluation #14

Conversation

jpolchlo commented Dec 3, 2019 • edited Loading

jpolchlo commented Dec 18, 2019

On the evolution of Catalyst nodes

jpolchlo commented Dec 18, 2019

jpolchlo commented Jan 14, 2020

jpolchlo commented Dec 3, 2019 •

edited

Loading