Skip to content
This repository has been archived by the owner on Jun 14, 2022. It is now read-only.

A lot of shared code with pandas backend? #3

Open
zbrookle opened this issue Sep 24, 2020 · 11 comments
Open

A lot of shared code with pandas backend? #3

zbrookle opened this issue Sep 24, 2020 · 11 comments

Comments

@zbrookle
Copy link

zbrookle commented Sep 24, 2020

So part of the reason I suggested adding dask originally (in ibis-project/ibis#2245) was because of the amount of code that they both share, in terms of the dataframe api. I'm a bit surprised to see this particular backend in a different repo when there's so much opportunity for using identical code to what's already been written for pandas

@jreback
Copy link

jreback commented Sep 25, 2020

hmm this is a very good point

separating is meant for hard to test packages that might need a lot of ci resources and have differentl
source communities

eg spark, omnisci for example

but dask is very lightweight and shares a lot with pandas

@datapythonista
Copy link
Contributor

I think we've already got a working implementation of the dask backend. I agree that ideally we want to reuse as much code as possible. But I'd merge what we've got now in this repo, and then perform any refactoring until we reuse all the code that it makes sense.

@zbrookle
Copy link
Author

zbrookle commented Sep 25, 2020

@datapythonista I'm not saying to scrap all the code that was written, but I think in terms of refactoring, it's going to be a lot harder to refactor something when the code is in a separate repo. It actually looks like the only code that's even in this repo's pull request is the CI code

@datapythonista
Copy link
Contributor

@datapythonista I'm not saying to scrap all the code that was written, but I think in terms of refactoring, it's going to be a lot harder to refactor something when the code is in a separate repo. It actually looks like the only code that's even in this repo's pull request is the CI code

Thanks for the comment @zbrookle. I'm surely open to move the Dask code into the core of Ibis if it makes things easier. But I'm not sure it's really going to make things easier, or the opposite. I think changes will be much faster outside Ibis. Before we publish the backend, we don't need to get crazy with reviews, and keeping very high standards for the backend. And I also think that having two separate projects should make things clearer, on how the pandas and dask backend depend on the pandas one.

Having it in a separate repo it also opens the door to start a new dask backend from scratch, if that was a good option.

I'd say let's move forward creating the backend we've got in this repo, and if at some point having two separate repos makes things complicated, let's move the code to the Ibis core. Does that sounds good?

@jreback
Copy link

jreback commented Sep 28, 2020

i am not sure lack of review is a good thing

rethinking this a bit

dask and pandas backends likely share a lot of code

so these don't make sense to be separate at all

@datapythonista
Copy link
Contributor

From past experience (Arrow and MSSQL backends), having a slow review process meant that they created the backends elsewhere. Now we have a Dask backends that afaik is working, but doesn't reuse as much code as it could. This is a good first step IMO. I understand that having a smaller backend that reuses code would be better, but we don't have it now, and afaik nobody is working on it. So, the first decision to make is:

  • Having the already implemented backend in the Ibis core
  • Having it in the ibis organization
  • Not have it anywhere, and let the author have it in their own GitHub

Regarding reusing more code, the main advantage I see is being able to open PRs that touch both the pandas and the dask backend together. But I don't see that being important, or even desirable. Having first a PR to the pandas backend, where the code is cleaned, so it can be better reused by other backends seems probably even better.

For the rest, reusability will be calling its functions, or subclassing its classes, I don't see it makes a difference importing the pandas backend whether the Dask backend is in the same library or in a separate one.

Feel free to reopen the Dask backend PR in the Ibis core if you think it's better, but I'd personally be more dynamic. Get the Dask backend we've got, and then keep iterating. Also, having it in a separate repo would make it easier if eventually someone wants to write another Dask backend that reuses lots of the pandas backend code.

@zbrookle
Copy link
Author

zbrookle commented Sep 28, 2020

@jreback @datapythonista I think that a third option that you may want to consider is having a repo for all dataframe based backends. Then you can maintain the shared code in one place, and also only worry about maintaining CI for those backends which will have similar tests. I don't know if Modin and rapids/cudf are in the scope of this project but it would make sense to me if they were and they may have also have shared code with pandas/dask. I'd be happy to work on this if it's something that would be useful

@gerrymanoim
Copy link
Contributor

gerrymanoim commented Sep 28, 2020

I'm personally happy to reopen my PR (ibis-project/ibis#2400) in either place, whatever is ultimately easier for the ibis org (though I definitely want to go through and clean up/fix parts of the code I know are broken before asking for a code review. It also might be easier if I incrementally open PRs for each functionality in execution instead of doing a massive PR with everything).

As for overlap between the pandas and dask backend (in case this is helpful):

  • A large percentage of the tests were fairly similar, with the dask tests comparing dask dataframes (i.e. assert result.compute() == expected.compute() where result was executed via ibis and expected is done in dask or via dd.from_pandas). Similarly my conftest.py returned dask objects.
  • I think there will be major differences for any code that requires sorting (AFAIK dask only allows for sorting on a single column only, via set_index).
  • There will be also major difference with the grouping code. Some of the pandas group by code relies on access to the grouper and obj objects, which I don't believe we have access to in dask. There were also places I had to write my own dd.Aggregation functions.
  • Similarly, I needed to provide meta in some places, and there's currently a bunch of warnings about unspecified meta that should be cleaned up.
  • There are some places I had to use dd.array instead of numpy.
  • Outside execution, I think there was a high amount of code overlap, though I haven't gotten through and fixed some non working things yet.
  • IIRC there were a few areas I needed to add more type dispatches.

I'm happy to go back an try to put a more detailed comparison together. That all being said, I'm relatively new to ibis, so there's definitely things I may be misunderstanding.

@jreback
Copy link

jreback commented Sep 30, 2020

@jreback @datapythonista I think that a third option that you may want to consider is having a repo for all dataframe based backends. Then you can maintain the shared code in one place, and also only worry about maintaining CI for those backends which will have similar tests. I don't know if Modin and rapids/cudf are in the scope of this project but it would make sense to me if they were and they may have also have shared code with pandas/dask. I'd be happy to work on this if it's something that would be useful

yes this is certainly an option.

@jreback
Copy link

jreback commented Sep 30, 2020

The issue i have here is that I think its very hard for a new contributor to setup all of this infrastructure, w/o a LOT of cookie-cutter help (which itself hard to build and so IMHO not worth it).

I would prefer to build dask in current ibis.

Then if we do want to split things out, we certainly can, but let's take something that's already done (e.g. omnisci, spark etc) and do it and see how it goes. Not worth it to build a new backend in a new repo with a newer contributor, too many news.

make sense @gerrymanoim @datapythonista

@datapythonista not trying to undermine your effort at all, really really appreciate everything. I just think in this case, let's take one step at a time.

@datapythonista
Copy link
Contributor

I'm ok having the dask backend in Ibis. But are you ok to have the implementation we have now? Ot the potential much smaller implementation? If you want the smaller implementation, my point is that I'd rather get the existing one here, than possibly end up with none.

In any case, I'm working on making it easy for backends to ve moved around. So, hopefully it's soon trivial where they are, because we should be able to move them very quickly when we have reasons to do it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants