Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to do incremental update with new rows where update_key is not set #380

Open
ArcqStrato opened this issue Sep 18, 2024 · 0 comments
Open

Comments

@ArcqStrato
Copy link

Suppose I have an Oracle table with two columns, ID (the primary key) and MDATUM (the update key).

ID MDATUM
1 NULL
2 NULL
3 2024-09-18 02:00:00.000

Now I upload this table to Snowflake.

Next, one row is added to the table (ID=4), and one row is updated (ID=2), which means it gets an MDATUM:

so it looks like this:

ID MDATUM note
1 NULL
2 2024-09-18 03:00:00.000 this row is updated
3 2024-09-18 02:00:00.000
4 NULL this row is new

How can get those two new rows into Snowflake using an incremental load?

When I load with

--mode incremental \
--primary-key 'ID' \
--update-key 'MDATUM' \

then I get this table:

ID MDATUM note
1 NULL
2 2024-09-18 03:00:00.000 correctly updated
3 2024-09-18 02:00:00.000

As you can see, ID=3 is updated as expected. The new row (ID=4), on the other hand, is not inserted into the target table.

I believe this is in line with your documentation.

However, from a practical standpoint, this is not helpful. My example above is a real-life example from our production table. And it makes sense: If a new row is added then that's not an update, and thus the update_key is empty. Only if a row is modified then the update_key is set, which makes sense.

I therefore believe there should be another load strategy that loads "new records after max(update_key)" plus "new records with primary_key larger then max(primary_key).

A workaround is I can make run Sling twice:

First, I load using these parameters:

--mode incremental \
--primary-key 'ID' \
--update-key 'MDATUM' \

This gets me all rows where the update_key MDATUM is newer then max(MDATUM) in my target table, i.e., all updated rows.

Then I run Sling again and this time I use:

--mode incremental \
--primary-key 'ID' \
--update-key 'ID' \

This should get me all IDs where the ID is higher then max(ID) in my target table, i.e., all new rows.

I feel my use-case is quite plausible and should be quite widespread.

What are your thoughts on this? And do you see a problem with my workaround?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant