-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for window_by TVF window semantics #1117
Comments
Hey @zhenzhongxu -- we can add a target for the window_by TVF stuff, but I don't know how to spell those style functions in Substrait -- do you have a reference for what the target should look like? We already support batch window functions. |
Also add @richtia @EpsilonPrime for guidance on the substrait side. |
I'm reasonably familiar with the flink-style TVF windowing, but I'm not clear on whether there's a way to express that in Substrait -- so yeah, if @EpsilonPrime or @richtia have any pointers. If there isn't currently a way to express that in Substrait, then the first step will be adding that to the upstream. Then it should be straightforward to target from here. |
|
Thanks, @EpsilonPrime -- I think that's a standard window function, which is a different beast than the table-valued functions used by some of the streaming engines. I don't know that I saw any chatter on the substrait issue tracker about anyone implementing this. We can probably desugar the |
The only other way I know of to do window related behavior is with the ConsistentPartitionWindowRel -- https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L272. |
Let me grab the Calcite plan for the window TVF so we can compare the standard window vs streaming window. Will circle back shortly. |
For a streaming Ibis declaration
will translate to Streaming Flink SQL backend as
Running through the Flink Planner (Calcite), shows below:
Please notice that
So if we go with option 1, can we provide a compilation switch (e.g., batch/streaming modes). Batch mode will generate standard window without watermark and TVF, and streaming mode will throw unimplemented exception for now (until we go for option 2)? Thoughts? |
I think option 1 would be pretty straightforward for Arroyo. Our streaming window functions (hop, tumble, etc.) are expressed in SQL as normal scalar functions: select count(*), tumble(interval '10 seconds') as window
from events
group by window; This gets rewritten into a logical plan with a separate window operator: If the window functions are preserved in the substrait plan as normal scalar functions we could go do that rewrite on the substrait plan itself. Longer term it would be better to properly integrate windows into the plan. I'm not sure how tied that is to the particular syntax (Flink-style TVFs or our special scalar functions). |
@EpsilonPrime can you help confirm what @mwylde mentioned about the window scalar function is similar to the standard window function already supported in Substrait. Maybe do we need an additional type field to indicate window type (hop/sliding, tumble, session)? Additionally, taking an OVER window example below (which is more popular in the batch world but also supported on streaming), @EpsilonPrime can you confirm this is relevant to the partitioned window you mentioned earlier?
My thinking if both (scalar function & OVER window) are already supported on Substrait then we can proceed to support both on the compiler? |
It seems like you'd be able to use the window function expression syntax I first linked to handle AVG OVER as that syntax just needs an operation with partition and bounds. As for hop, tumble, and session I suspect they might require alternative bound restrictions/behaviors. I'm still looking at how those work. |
This should already work in >>> import ibis
>>> from ibis import _
>>> from ibis_substrait.compiler.core import SubstraitCompiler
>>> sales = ibis.table({"date": "date", "product_id": "str", "sales_amount": "float"}
...: , 'sales')
>>> expr = sales.mutate(moving_avg_sales=_.sales_amount.mean().over(rows=(-2,0), grou
...: p_by=_.product_id, order_by=_.date))
>>> ibis.to_sql(expr)
SELECT
"t0"."date",
"t0"."product_id",
"t0"."sales_amount",
AVG("t0"."sales_amount") OVER (PARTITION BY "t0"."product_id" ORDER BY "t0"."date" ASC ROWS BETWEEN 2 preceding AND CURRENT ROW) AS "moving_avg_sales"
FROM "sales" AS "t0"
>>> compiler = SubstraitCompiler()
>>> substrait_plan = compiler.compile(expr)
>>> substrait_plan Substrait Plan Output
|
Thanks @gforsyth ! I will give this a try first, and we can come back to revisit option 2 when the timing is right. One more feedback:
and a batch OVER window:
which generates different compiled plans (actually the streaming one doesn't compile yet). The syntax is only different on minor details (mutate + time range vs row range). Are there opportunities for converging the two APIs to avoid user confusions. (I am putting myself in the users shoes), let me know if you want me to file a different issue on Ibis repo. |
The Substrait appears to be written with Datafusion's project bug/behavior in mind so it should work for us (and even better it should still work after the bug is fixed). The window function syntax matches what I tested against Datafusion too. |
this is better discussed upstream |
What happened?
I have a simple example in this notebook showing that the window semantics are not supported by the compiler yet. Is this something we can add relatively easily?
https://github.com/zhenzhongxu/composable-data-architecture/blob/3c5324466c4355cda98d2443fc1f01b011a5241b/ibis-substrait-example.ipynb
What version of ibis-substrait are you using?
v.4.0.1
What substrait consumer(s) are you using, if any?
N/A
Relevant log output
The text was updated successfully, but these errors were encountered: