Skip to content
This repository has been archived by the owner on Jul 15, 2024. It is now read-only.

adding starburst trino nyctaxi example for review #28

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dhuynh
Copy link

@dhuynh dhuynh commented Nov 6, 2023

Adding starburst trino ibis notebook example for review with NYC taxi dataset

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this in @dhuynh!

I've left some comments inline (hopefully respecting notebook JSON 🤞 ).

There are a few points where I think what Ibis is doing can be clarified. A lot of the rest are style points -- we've been following certain conventions but we haven't documented them anywhere (very bad form from us). We will definitely put together a "style guide" for examples soon.

examples/starburst_trino_NYCTaxi.ipynb Outdated Show resolved Hide resolved
"outputs": [],
"source": [
"import ibis\n",
"import pandas as pd\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"import pandas as pd\n",

"id": "ad9927bc",
"metadata": {},
"source": [
"IMPORTANT!!!! Change your user, host, port, database, schema and roles to be relevant to your Starburst Galaxy setup. If you are using OAuth2, uncomment the keyword lines roles, and auth. Then comment PASSWORD to proceed. "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here it would be useful to link to our detailed instructions on figuring out which values to use: https://ibis-project.org/backends/trino#connecting-to-starburst-managed-trino-instances

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would remove the choice about oauth2 or not -- it shouldn't be the focus of this example to handle multiple authentication flows and the above link also details how to connect.

examples/starburst_trino_NYCTaxi.ipynb Outdated Show resolved Hide resolved
"id": "e9bd2195-381a-4b20-aa12-f1ee07dafd6c",
"metadata": {},
"source": [
"Ibis tables in trino can be stored through the use of con.table. We're going to create two ibis tables from our tables below:"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is, I think, inadvertently a touch misleading. Using con.table isn't storing anything, we're grabbing a reference to a table that already exists in Trino, nothing is being added to the database, nor is the underlying data being changed in any way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

examples/starburst_trino_NYCTaxi.ipynb Outdated Show resolved Hide resolved
examples/starburst_trino_NYCTaxi.ipynb Outdated Show resolved Hide resolved
examples/starburst_trino_NYCTaxi.ipynb Outdated Show resolved Hide resolved
Comment on lines +315 to +318
"nycjanduration_new = (\n",
" nycjanduration.filter(nycjanduration.trip_distance != 0.0)\n",
")\n",
"nycjanduration_new\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the nycjanduration expression isn't going to be used after this, I would suggest just using the same variable name. You avoid all the _new in the names, e.g.

nycjanduration = nycjanduration.filter(_.trip_distance > 0)

Comment on lines 459 to 472
"P.S This ibis code below at the time of writing uses a pre-release function \"bucket\". This is included with future version of IBIS. When bucket is released, you can uncomment the lines below for a temporal grouping. If you recieve an error \n",
"the function does not exist, you can uncomment the line below to try to upgrade to a pre-release version of ibis:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1f8f9893-a369-4337-8c86-38045e75a461",
"metadata": {},
"outputs": [],
"source": [
"#python -m pip install -U 'ibis-framework[trino]' --pre"
]
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in getting 7.1 released -- I think we should hold off on publishing this until we don't require a pre-release feature.

@dhuynh
Copy link
Author

dhuynh commented Nov 15, 2023

@gforsyth tried to double check some of your changes and I got this error with regards to your NOT suggestion:


ValueError Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 nyc_filtered = nycjantrips.filter((nycjantrips.passenger_count != 0) | (not nycjantrips.passenger_count.isnan()))
2 nyc_filtered

File ~/opt/anaconda3/lib/python3.9/site-packages/ibis/expr/types/core.py:132, in Expr.bool(self)
131 def bool(self) -> bool:
--> 132 raise ValueError("The truth value of an Ibis expression is not defined")

ValueError: The truth value of an Ibis expression is not defined

@gforsyth
Copy link
Member

Oops, sorry about that. It should probably be:

nyc_filtered = nycjantrips.filter((nycjantrips.passenger_count != 0) | ~(nycjantrips.passenger_count.isnan()))

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants