Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: DataFrame.to_sql with if_exists='replace' should do truncate table instead of drop table #37210

Open
tokorhon opened this issue Oct 17, 2020 · 34 comments · May be fixed by #60376
Open

ENH: DataFrame.to_sql with if_exists='replace' should do truncate table instead of drop table #37210

tokorhon opened this issue Oct 17, 2020 · 34 comments · May be fixed by #60376
Labels
Enhancement IO SQL to_sql, read_sql, read_sql_query Needs Discussion Requires discussion from core team before further action

Comments

@tokorhon
Copy link

tokorhon commented Oct 17, 2020

Is your feature request related to a problem?

Dropping table when if_exists=‘replace’ will fail if the table has any objects, like a view, depending on it. Also, some databases like Oracle will end a transaction implicitly when a DLL statement like 'drop table' is issued.

Describe the solution you'd like

Better alternative would be to issue 'truncate table' command instead.

API breaking implications

Should not have any changes on API

Describe alternatives you've considered

As above or a new flag if_exists=‘truncate’

@tokorhon tokorhon added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 17, 2020
@ukarroum
Copy link
Contributor

One of the problems i can see with using truncate instead of drop is that this add the limitation the the user's dataframe need to have the same columns as the table (which may not always be the case, depending on the usecase).

@erfannariman
Copy link
Member

Yes I agree with @ukarroum , also the type of a column can change over time. But then again OP is suggesting to add a new method if_exists='truncate' which is a bit confusing, because the title states otherwise.

I can see the benefit of truncate over drop, also since the first will be more efficient.

@tokorhon
Copy link
Author

There are pros and cons of truncate vs. drop. So maybe new method is the answer. Call it if_exists='delete' or something else. If truncate table is not supported by DBMS then this method should lead to 'delete from table' (which is slower than truncate).

@jbrockmendel jbrockmendel added the IO SQL to_sql, read_sql, read_sql_query label Oct 25, 2020
@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 13, 2021
@fiendish
Copy link
Contributor

Could it use truncate if the columns are the same and drop otherwise?

@mancunian1792
Copy link

One of the problems i can see with using truncate instead of drop is that this add the limitation the the user's dataframe need to have the same columns as the table (which may not always be the case, depending on the usecase).

I would assume that if the structure of the table changes a lot, then its not a right use case for a relational table in the first place ? Although, i can understand the datatype changing over time would cause an issue. I feel truncate use case is more common and would benefit more people including me.

@igonro
Copy link

igonro commented Mar 24, 2022

This is a great idea. I think is better to have both options "replace" and "truncate", sometimes you will need "replace" when you want to delete and insert new columns. But "truncate" is a needed option when you need to keep some columns options in the SQL database. For example, with replace the autoincrement option in the "id" column will disappear.

TLDR:

  • replace: when your db table columns change
  • truncate: when you need to keep your db table columns

@igonro
Copy link

igonro commented Mar 24, 2022

Also @lam-juice in #8673 pointed out a behaviour that I was also having in my SQL database:

Is it possible to make pandas delete all the rows instead of dropping the table for if_exists='replace'?
Having to drop the table results in deadlocks when there's a simultaneous SELECT going on, while a simple deletion would avoid such deadlocks.

So why not add truncate option, it solves some problems (deadlocks, column information loss), it preserves the current behaviour of replace, and I don't think it the implementation would be hard.

@kylemcmearty
Copy link

+1 for @igonro 's solution.

I'm having issues with my table view object because replace drops the table.

Having truncate and replace options would be great.

I'm gonna try out this solution for now and see if it works: https://stackoverflow.com/a/67235797

@igonro
Copy link

igonro commented Mar 10, 2023

Yes, @kmcmearty; that's an easy workaround (truncating manually and then doing the to_sql with append).

@gmcrocetti
Copy link

I recently created a pull request adding the truncate option.

@erfannariman
Copy link
Member

@pandas-dev/pandas-core if anyone or multiple persons have time, could you reply if you're +1 or -1 on this addition. I am +1 since I our team would use this quite often.

@gmcrocetti thank you for taking the time to create the PR. I think it's better that we get some replies from the core team before we spend time on the PR.

@gmcrocetti
Copy link

@pandas-dev/pandas-core if anyone or multiple persons have time, could you reply if you're +1 or -1 on this addition. I am +1 since I our team would use this quite often.

@gmcrocetti thank you for taking the time to create the PR. I think it's better that we get some replies from the core team before we spend time on the PR.

Thanks a ton, much appreciated !

@rhshadrach
Copy link
Member

could you reply if you're +1 or -1 on this addition.

There are several proposals here, could you clarify which you're referring to? I think it's the addition of truncate rather than the original request in the OP.

@gmcrocetti
Copy link

gmcrocetti commented Oct 13, 2024

Hello @rhshadrach , yes you got it right. It is the addition of truncate indeed.

@WillAyd
Copy link
Member

WillAyd commented Oct 14, 2024

if_exists="truncate" reads a little strange, since it actually replaces (at least from what I understand with the proposal).

What's the precedent from other libraries for using truncate? I definitely understand that there are performance implications and that for some databases this can affect auto-increment behavior, but I'm also under the impression that the standard replace with drop was chosen with intention.

Is there a prior art for this suggestion in sqlalchemy? What databases support TRUNCATE within a transaction and which don't?

@shortywz
Copy link

shortywz commented Oct 14, 2024

PostgreSQL and its derivatives support TRUNCATE within a transaction - as an end user looking forward to this feature, the drop implementation requires additional handling to restore permissions, and is not usable on tables with dependencies - TRUNCATE solves both of these problems.

@gmcrocetti
Copy link

gmcrocetti commented Oct 14, 2024

Hi @WillAyd o/.

The proposal is preserve the behavior of replace as is today - no changes here. On the other hand adding a new option named truncate that instead of replacing a table will wipe out the data of the table and "restart" it. I don't think performance is the only pain point. Most of times you don't want to (or can't) recreate the "structure" of the table: indexes, permissions, triggers, etc. All one want is to remove all data and restart from fresh.

AFAIK there's no public API available on sqlalchemy to truncate tables despite this functionally being part of the SQL standard since 2008. There might be a reason for that but I don't know.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 14, 2024

if_exists="truncate" reads a little strange, since it actually replaces (at least from what I understand with the proposal).

I agree with you @WillAyd , but it does seem that TRUNCATE is used in SQL to represent the operation of deleting all the rows and then adding new rows to "replace" the ones that are deleted.

I'm +1 on the proposal to add the if_exists="truncate" option.

@WillAyd
Copy link
Member

WillAyd commented Oct 14, 2024

I am not sure how relevant this is given it was written 15 years ago, but here is what I see upstream for SQLAlchemy:

sqlalchemy/sqlalchemy#1474

The trap here is going to be assuming what works for some databases (i.e. postgres and MSSQL) is going to work across all databases. AFAIU, we have very little (if any) hard-coded statements in our implementation that are RDBMS-specific, and we don't have a very good CI setup to cover non-standard behavior.

So generally I'd say I'm -1 on this, unless it were to be implemented upstream in SQLAlchemy or the DBAPI first. In the interim, users can manage the atomicity of the behavior as needed without too much extra effort, i.e. if you wanted an atomic truncate in postgres I think you could do:

engine = ...
with engine.connect() as conn:
    conn.execute("TRUNCATE table foo")
    df.to_sql("foo", con=conn, if_exists="append")

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 14, 2024

The trap here is going to be assuming what works for some databases (i.e. postgres and MSSQL) is going to work across all databases. AFAIU, we have very little (if any) hard-coded statements in our implementation that are RDBMS-specific, and we don't have a very good CI setup to cover non-standard behavior.

Thanks for this analysis. Given that TRUNCATE is not supported for all DB's, and there is an alternate solution, I switch my vote to -1.

@gmcrocetti
Copy link

gmcrocetti commented Oct 14, 2024

Sorry @WillAyd , could you expand on what you meant by that ?

The trap here is going to be assuming what works for some databases (i.e. postgres and MSSQL) is going to work across all databases. AFAIU, we have very little (if any) hard-coded statements in our implementation that are RDBMS-specific, and we don't have a very good CI setup to cover non-standard behavior.

I'm not sure I followed but we don't need to add RDBMS-specific statements. The proposed implementation is using sqlalchemy behind the scenes (.execute(f"TRUNCATE TABLE {table}")). If the specific database doesn't support it then an exception is raised.

If you're talking about the specifics of how each RDBMS implements the execution of TRUNCATE then shouldn't pandas delegate this responsibility to the user - of course providing proper documentation ?

@WillAyd
Copy link
Member

WillAyd commented Oct 14, 2024

I'm not sure I followed but we don't need to add RDBMS-specific statements. The proposed implementation is using sqlalchemy behind the scenes (.execute(f"TRUNCATE TABLE {table}")).

Although vaguely worded, I wasn't referring to the TRUNCATE statement itself as much as any accompanying statements that we would have to implement to try and make this behavior work consistently across all RDBMS'es, if that is even possible (not sure it is)

If you're talking about the specifics of how each RDBMS implements the execution of TRUNCATE then shouldn't pandas delegate this responsibility to the user - of course providing proper documentation ?

Yes it should, which is why I think adding the truncate argument as proposed has some flaws. The pandas API is supposed to be atomic and avoid any implementation-defined behavior. Unfortunately with truncate, an expression of the form:

df.to_sql("table", con, if_exists="truncate")

Will yield implementation-defined behavior in case of failure during append. For postgres, it would treat that as an atomic operation and rollback the truncate. But for a DB like Oracle my guess is it would commit the truncation of "table" and leave it without any records.

Since that is tucked away within our API, that puts the onus on us to either bridge those differences or convince people that certain parts of the API are good for some databases yet not for others.

@gmcrocetti
Copy link

Got it, thanks for taking the time to reply.

Yeap, I do have to recognized you have a fair point. The proposed feature can lead to potential unexpected behaviors on databases where TRUNCATE is a DDL instead of a DML (as in postgres and its derivatives). This burden shouldn't be added to pandas as it delegates to SQLalchemy.

I'm gonna wait for the voting to end but in case it is rejected do you believe we could add the outcome of this discussion somewhere into the documentation ? Or going one step further, would be possible to extend the behavior of this function via a third-party (away for good reasons of the core) ?

@WillAyd
Copy link
Member

WillAyd commented Oct 14, 2024

Absolutely +1 to documenting this, either in the I/O tools guide, the to_sql docstrings, or both.

As far as a third-party package goes, are you thinking of an entirely new package just for this feature or are you asking if there's a way to integrate that behavior into our existing to_sql call from a third party? The latter does not exist, but I know I/O plugins has historically been an interest of @datapythonista

@nickolay
Copy link

The pandas API is supposed to be atomic and avoid any implementation-defined behavior. Unfortunately with truncate, an expression of the form: df.to_sql("table", con, if_exists="truncate") Will yield implementation-defined behavior in case of failure during append.

Same holds for if_exists="replace", no? A DROP TABLE preceeding a failed CREATE also won't be rolled back in Oracle and this behavior is inconsistent across databases.

(As an aside: unlike truncate, if_exists=replace fails to even load unicode data into MSSQL properly, #35627 since forever, so pretending pandas' database abstractions are or even can be non-leaky is wishful thinking.)

@gmcrocetti
Copy link

gmcrocetti commented Oct 15, 2024

Nice analysis @nickolay . From what I understood this is indeed what happens today. pandas is not resilient to non-transactional DDL statements (Oracle and its derivations). The existing API is not atomic thus adding the truncate option will not change existing behavior.

@tokorhon
Copy link
Author

Thanks to everyone for great views and opinions regarding this feature request! I may not understand every detail, but for me

  • the name of the value of if_exists is not relevant. It can be something else than truncate since SQL truncate operation is not available in all databases
  • important is that there should be another way to empty a table that drop + recreate. In databases not supporting truncate it can lead to delete from table.
  • as stated in the original post, drop + recreate does not work if there are objects depending of the table.

@WillAyd
Copy link
Member

WillAyd commented Oct 15, 2024

Same holds for if_exists="replace", no? A DROP TABLE preceeding a failed CREATE also won't be rolled back in Oracle and this behavior is inconsistent across databases.

That second link in particular is a good reason why we do not issue these statements directly from within pandas and rely on third parties like SQLAlchemy to abstract this as best as possible. It sounds like your statement may be true for Oracle versions < 11g Release 2 but possibly not thereafter.

@nickolay
Copy link

I'm pretty sure what I said about DROP implicitly committing an active transaction applies to the current versions of Oracle as well. I also don't see any evidence of sqlalchemy or pandas using edition-based redefinition for if_exists="replace" (and would be very surprised if it did so by default!)

But it's true that unlike DROP+CREATE, TRUNCATE TABLE is an optional and underspecified feature of the SQL standard, so I can see where you're coming from. I was surprised to find that pandas' sqlalchemy backend seems to have 0 instances of raw SQL as of now, and can understand if the maintainers would like to keep it that way.

If that's the case, is @tokorhon's suggestion of providing a mode to run DELETE FROM table; instead a viable alternative? It IS widely compatible across databases, works fine with transactions, and has an API in sqlalchemy. I agree performance is not the main concern here, as in many scenarios the sqlalchemy-based load process will be the bottleneck before the DELETE.

@WillAyd
Copy link
Member

WillAyd commented Oct 15, 2024

That's an interesting idea; I think delete may be more viable given it is provided by sqlalchemy

Maybe the keyword becomes if_exists="replace_rows" to differentiate from the existing case?

@gmcrocetti
Copy link

The usage of DELETE FROM instead of TRUNCATE for sure has some side effects. But being pragmatic they will achieve the very same result by the end of the day. I'd suggest using delete_rows to replicate the semantics of what's going on behind the scenes.

@gmcrocetti
Copy link

gmcrocetti commented Oct 17, 2024

So just to summarize what we all discussed so far and members of the core team @pandas-dev/pandas can vote without reading the whole issue:

if_exists=truncate 1️⃣

The truncate option would mimic SQL's standard TRUNCATE TABLE behavior.

Pros

  • Performance wise Truncate is pretty fast and has very few drawbacks when compared to DELETE FROM
  • The table is not deleted, but data is wiped outn - truncate preserves index, permissions and etc.

Cons

  • Not widely support by all databases and when supported the implementation varies a ton;
    • In some databases running this command creates a implicit commit which can lead to non-atomic actions on pandas.
  • SQLAlchemy does not provide an API for that therefore we need to add raw SQL into the codebase.

if_exists=delete_rows (replace_rows) 2️⃣

don't get hung up on the name (delete_rows). Anything that represents that rows are deleted but not the table is acceptable.

Pros

  • SQLAlchemy provides an API.
  • Supported by all SQL databases

Cons

  • Not as performant as truncate
  • Will activate triggers (in case they exist)

Feel free to vote with 1️⃣ or 2️⃣ reactions.

@gmcrocetti
Copy link

gmcrocetti commented Nov 19, 2024

Hello @WillAyd,

Have you had the time to think about this matter ? I tried to summarize what we discussed right here ☝️.

@pandas-dev/pandas-core second thoughts are welcome. Let's bring a conclusion to this issue :)

@WillAyd
Copy link
Member

WillAyd commented Nov 19, 2024

I think it makes sense to go the DELETE route. I don't have a strong opinion on naming - maybe delete_replace?

It's a shame that what we have currently as "replace" is better thought of as "recreate" but there's too much history to try and change that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment