[cdc] fix cdc watermark value emit error #2763

huyuanfeng2018 · 2024-01-22T11:12:30Z

Purpose

Linked issue: close #2745

Tests

API and Format

Documentation

huyuanfeng2018 · 2024-01-22T11:13:17Z

@MonsterChenzhuo @JingsongLi PTAL. thx

huyuanfeng2018 · 2024-01-22T11:19:00Z

...ink-cdc/src/main/java/org/apache/paimon/flink/action/cdc/watermark/CdcWatermarkStrategy.java

@@ -56,13 +56,10 @@ public void onEvent(String record, long timestamp, WatermarkOutput output) {
                    throw new RuntimeException(e);


I don't think there is a need to throw an exception here. From my understanding, even if there is a problem with the watermark calculation, it will not cause problems with the task, so I think you can just ignore and printing a warning log here. Avoid bugs caused by issues we haven't considered. But this is obviously not the issue that this PR needs to deal with

WDYT @MonsterChenzhuo

JingsongLi · 2024-01-23T02:27:37Z

cc @yuzelin

MonsterChenzhuo · 2024-01-23T05:52:34Z

@huyuanfeng2018 Have you encountered any practical problems?

MonsterChenzhuo · 2024-01-23T06:06:28Z

@huyuanfeng2018 "onPeriodicEmit is called regularly, so the system timestamp will be sent as a watermark regularly." This is to solve the mechanism of normal tag creation when no data is written. This is not a bug.

huyuanfeng2018 · 2024-01-23T06:25:20Z

@huyuanfeng2018 "onPeriodicEmit is called regularly, so the system timestamp will be sent as a watermark regularly." This is to solve the mechanism of normal tag creation when no data is written. This is not a bug.

So if periodic calls are always sent using system time, what is the point of extracting watermark? I do not quite understand.

In a scenario where tags are created based on watermarks every day, if there is a delay or backlog in binlog consumption at 0 o'clock, I will still get the system time as the watermark, and my tags will still be created, even if my data has not completely arrived.

So I think that if want to avoid creating tags when no data arrives, you should use other methods to handle it instead of sending the system time as a watermark.

huyuanfeng2018 · 2024-01-23T06:28:55Z

@huyuanfeng2018 Have you encountered any practical problems?

java.lang.IllegalArgumentException: Invalid path or key not found: ts_ms

I did encounter some problems, but due to online tasks, I directly recompiled and restored it. At that time, I did not debug what data caused this error.

MonsterChenzhuo · 2024-01-23T08:18:46Z

@huyuanfeng2018 "onPeriodicEmit is called regularly, so the system timestamp will be sent as a watermark regularly." This is to solve the mechanism of normal tag creation when no data is written. This is not a bug.

So if periodic calls are always sent using system time, what is the point of extracting watermark? I do not quite understand.

In a scenario where tags are created based on watermarks every day, if there is a delay or backlog in binlog consumption at 0 o'clock, I will still get the system time as the watermark, and my tags will still be created, even if my data has not completely arrived.

So I think that if want to avoid creating tags when no data arrives, you should use other methods to handle it instead of sending the system time as a watermark.

Users use watermark semantics to ensure that the creation of tags can overwrite delayed data.
However, the tag is created when 'commit', and the tag can only be created when the data is written. This will cause a problem. When my data is cut off for a period of time, the tag will not be built. In order to solve this problem, we support the idle promotion strategy to use ‘process time’ to promote the water level and promote the creation of tags.

For example:

huyuanfeng2018 · 2024-01-23T08:54:17Z

@huyuanfeng2018 "onPeriodicEmit is called regularly, so the system timestamp will be sent as a watermark regularly." This is to solve the mechanism of normal tag creation when no data is written. This is not a bug.

So if periodic calls are always sent using system time, what is the point of extracting watermark? I do not quite understand.
In a scenario where tags are created based on watermarks every day, if there is a delay or backlog in binlog consumption at 0 o'clock, I will still get the system time as the watermark, and my tags will still be created, even if my data has not completely arrived.
So I think that if want to avoid creating tags when no data arrives, you should use other methods to handle it instead of sending the system time as a watermark.

Users use watermark semantics to ensure that the creation of tags can overwrite delayed data.

However, the tag is created when 'commit', and the tag can only be created when the data is written. This will cause a problem. When my data is cut off for a period of time, the tag will not be built. In order to solve this problem, we support the idle promotion strategy to use ‘process time’ to promote the water level and promote the creation of tags.

For example:

So, do we need a mechanism to determine that the source has no data input for a certain period of time, just like a parameter table.exec.source.idle-timeout

MonsterChenzhuo · 2024-01-23T10:06:23Z

@huyuanfeng2018 "onPeriodicEmit is called regularly, so the system timestamp will be sent as a watermark regularly." This is to solve the mechanism of normal tag creation when no data is written. This is not a bug.

So if periodic calls are always sent using system time, what is the point of extracting watermark? I do not quite understand.
In a scenario where tags are created based on watermarks every day, if there is a delay or backlog in binlog consumption at 0 o'clock, I will still get the system time as the watermark, and my tags will still be created, even if my data has not completely arrived.
So I think that if want to avoid creating tags when no data arrives, you should use other methods to handle it instead of sending the system time as a watermark.

Users use watermark semantics to ensure that the creation of tags can overwrite delayed data.

However, the tag is created when 'commit', and the tag can only be created when the data is written. This will cause a problem. When my data is cut off for a period of time, the tag will not be built. In order to solve this problem, we support the idle promotion strategy to use ‘process time’ to promote the water level and promote the creation of tags.

For example:

So, do we need a mechanism to determine that the source has no data input for a certain period of time, just like a parameter table.exec.source.idle-timeout

You can take a look at this：#2646

huyuanfeng2018 · 2024-01-25T09:01:11Z

@huyuanfeng2018 "onPeriodicEmit is called regularly, so the system timestamp will be sent as a watermark regularly." This is to solve the mechanism of normal tag creation when no data is written. This is not a bug.

So if periodic calls are always sent using system time, what is the point of extracting watermark? I do not quite understand.
In a scenario where tags are created based on watermarks every day, if there is a delay or backlog in binlog consumption at 0 o'clock, I will still get the system time as the watermark, and my tags will still be created, even if my data has not completely arrived.
So I think that if want to avoid creating tags when no data arrives, you should use other methods to handle it instead of sending the system time as a watermark.

Users use watermark semantics to ensure that the creation of tags can overwrite delayed data.

However, the tag is created when 'commit', and the tag can only be created when the data is written. This will cause a problem. When my data is cut off for a period of time, the tag will not be built. In order to solve this problem, we support the idle promotion strategy to use ‘process time’ to promote the water level and promote the creation of tags.

For example:

So, do we need a mechanism to determine that the source has no data input for a certain period of time, just like a parameter table.exec.source.idle-timeout

You can take a look at this：#2646

Well, this pr looks good, but I still have a problem. This pr is only used to force the generation of snapshots, even if the data has not been sent, and watermarks need to be used to determine whether the conditions for generating snapshots are met. Then we produce watermarks here. Can it be done in a similar way? For example, if I have no data written for 5 minutes, I will force the current time to be sent as the watermark.

            @Override
            public void onEvent(String record, long timestamp, WatermarkOutput output) {
                long tMs;
                try {
                    tMs = timestampExtractor.extractTimestamp(record);
                    currentMaxTimestamp = Math.max(currentMaxTimestamp, tMs);
                    lastEventTimestamp = System.currentTimeMillis();
                } catch (Exception e) {
                    // ignore
                }
            }

            @Override
            public void onPeriodicEmit(WatermarkOutput output) {
                long timeMillis = System.currentTimeMillis();
                if (timeMillis - lastEventTimestamp > 1000 * 60 * 5) {
                    currentMaxTimestamp = timeMillis;
                } 
                    output.emitWatermark(new Watermark(currentMaxTimestamp - 1));
            }

JingsongLi · 2024-11-28T14:03:18Z

It seems we didn't find a way to fix this. Close this now, feel free to re-open if you have more thoughts.

[cdc] fix cdc watermark value emit error

8a67cd6

huyuanfeng2018 commented Jan 22, 2024

View reviewed changes

JingsongLi closed this Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cdc] fix cdc watermark value emit error #2763

[cdc] fix cdc watermark value emit error #2763

huyuanfeng2018 commented Jan 22, 2024

huyuanfeng2018 commented Jan 22, 2024 •

edited

Loading

huyuanfeng2018 Jan 22, 2024

JingsongLi commented Jan 23, 2024

MonsterChenzhuo commented Jan 23, 2024

MonsterChenzhuo commented Jan 23, 2024

huyuanfeng2018 commented Jan 23, 2024

huyuanfeng2018 commented Jan 23, 2024

MonsterChenzhuo commented Jan 23, 2024 •

edited

Loading

huyuanfeng2018 commented Jan 23, 2024

MonsterChenzhuo commented Jan 23, 2024

huyuanfeng2018 commented Jan 25, 2024 •

edited

Loading

JingsongLi commented Nov 28, 2024

		@@ -56,13 +56,10 @@ public void onEvent(String record, long timestamp, WatermarkOutput output) {
		throw new RuntimeException(e);

[cdc] fix cdc watermark value emit error #2763

[cdc] fix cdc watermark value emit error #2763

Conversation

huyuanfeng2018 commented Jan 22, 2024

Purpose

Tests

API and Format

Documentation

huyuanfeng2018 commented Jan 22, 2024 • edited Loading

huyuanfeng2018 Jan 22, 2024

Choose a reason for hiding this comment

JingsongLi commented Jan 23, 2024

MonsterChenzhuo commented Jan 23, 2024

MonsterChenzhuo commented Jan 23, 2024

huyuanfeng2018 commented Jan 23, 2024

huyuanfeng2018 commented Jan 23, 2024

MonsterChenzhuo commented Jan 23, 2024 • edited Loading

huyuanfeng2018 commented Jan 23, 2024

MonsterChenzhuo commented Jan 23, 2024

huyuanfeng2018 commented Jan 25, 2024 • edited Loading

JingsongLi commented Nov 28, 2024

huyuanfeng2018 commented Jan 22, 2024 •

edited

Loading

MonsterChenzhuo commented Jan 23, 2024 •

edited

Loading

huyuanfeng2018 commented Jan 25, 2024 •

edited

Loading