Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snowplow_unified_sessions fails to run on Snowflake due to duplicates #90

Open
1 of 6 tasks
jacobus-herman opened this issue Nov 5, 2024 · 2 comments
Open
1 of 6 tasks
Labels
status:needs_triage Needs maintainer triage. type:bug Bugs or weaknesses. The issue has to contain steps to reproduce.

Comments

@jacobus-herman
Copy link

jacobus-herman commented Nov 5, 2024

Describe the bug

We receive an error when running the snowplow_unified_sessionsmodel in Production on Snowflake. The error is as follows:

Database Error in model snowplow_unified_sessions (models/sessions/snowplow_unified_sessions.sql)
100090 (42P18): Duplicate row detected during DML action
Row Values: INTERNAL ERROR
compiled code at /app/target/run/snowplow_unified/models/sessions/snowplow_unified_sessions.sql

This error seems to be due to be linked to the mobile_language_name column that was added and the fact that deduplication is only done for postgres and spark but not snowflake.

Steps to reproduce

Set the dbt_project.yml to the following:

vars:
  snowplow_unified:
    snowplow__events_table: ...
    snowplow__atomic_schema: ...
    snowplow__enable_iab: true
    snowplow__enable_ua: true
    snowplow__enable_yauaa: true
    snowplow__enable_mobile_context: true
    snowplow__enable_application_context: true
    snowplow__start_date: '2024-06-01'
    snowplow__backfill_limit_days: 30
    snowplow__session_timestamp: load_tstamp
    snowplow__session_stitching: false

Execute dbt run --select package:snowplow_unified.

Expected results

Expected no errors from the snowplow_unified_sessions model.

Actual results

Receiving an error:

Database Error in model snowplow_unified_sessions (models/sessions/snowplow_unified_sessions.sql)
100090 (42P18): Duplicate row detected during DML action
Row Values: INTERNAL ERROR
compiled code at /app/target/run/snowplow_unified/models/sessions/snowplow_unified_sessions.sql

Screenshots and log output

System information

The contents of your packages.yml file:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.3.0
  - package: dbt-labs/codegen
    version: 0.12.1
  - package: dbt-labs/dbt_project_evaluator
    version: 0.14.3
  - package: dbt-labs/audit_helper
    version: 0.12.0
  - package: calogica/dbt_expectations
    version: 0.10.4
  - package: dbt-labs/dbt_external_tables
    version: 0.9.0
  - package: Snowflake-Labs/dbt_constraints
    version: [">=1.0.0", "<1.1.0"]
  - package: snowplow/snowplow_unified
    version: 0.5.0
  - package: elementary-data/elementary
    version: 0.16.1

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • databricks
  • other (specify: ____________)

The output of dbt --version:

Core:
  - installed: 1.8.8
  - latest:    1.8.8 - Up to date!

Plugins:
  - snowflake: 1.8.4 - Up to date!

The operating system you're using:
macOS 15.0.1 (24A348)
OR
Debian Bookworm (python:3.9-slim Docker image)

The output of python --version:
Python 3.9.19

Additional context

Possible problem with line of code here:

{% if target.type in ['postgres','spark'] %}

and here:
{% if target.type in ['postgres','spark'] %}

Are you interested in contributing towards the fix?

Unfortunately, I have no time to do it directly.

@jacobus-herman jacobus-herman added the type:bug Bugs or weaknesses. The issue has to contain steps to reproduce. label Nov 5, 2024
@github-actions github-actions bot added the status:needs_triage Needs maintainer triage. label Nov 5, 2024
@jacobus-herman
Copy link
Author

We also found a duplicate in the seed file here. This is probably not intended.

@dkbrkjni
Copy link

I think this is the same issue as we see in Databricks after updatint to version 0.5.0.

This is the log from a failed run in Databricks:

Database Error in model snowplow_unified_sessions (models/sessions/snowplow_unified_sessions.sql)
[DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE] Cannot perform Merge as multiple source rows matched and attempted to modify the same
target row in the Delta table in possibly conflicting ways. By SQL semantics of Merge,
when multiple source rows match on the same target row, the result may be ambiguous
as it is unclear which source row should be used to update or delete the matching
target row. You can preprocess the source table to eliminate the possibility of
multiple matches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:needs_triage Needs maintainer triage. type:bug Bugs or weaknesses. The issue has to contain steps to reproduce.
Projects
None yet
Development

No branches or pull requests

2 participants