[SPARK-50339][SPARK-50360][SS] Enable changelog to store lineage information #48880

WweiL · 2024-11-18T14:02:56Z

What changes were proposed in this pull request?

Break down #48355 into smaller PRs.

Changelog Reader / Writer

We purpose to save the lineage to the first line of the changelog files.

For changelog reader, there is an abstract function readLineage created. In RocksDBCheckpointManager.getChangelogReader function, the readLineage will be called right after the initialization of the changelog reader to update the file pointer to after the lineage. Subsequent getNext function won't be affecter because of this.

For changelog writer, there is an abstract function writeLineage that writes the lineage. This function will be called before any actual changelog data is written in RocksDB.load().

The lineage is stored as json.

Why are the changes needed?

Continue development of SPARK-49374

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test

Was this patch authored or co-authored using generative AI tooling?

No

WweiL · 2024-11-18T14:03:55Z

@siying @brkyvz PTAL

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

siying

Generally LGTM. I left a test comment.

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala

brkyvz

I am not a fan of the interface. You're essentially requiring the creator of the Changelog writer to first call writeLineage, which feels like an abstraction leak and makes it error prone. Can the lineage be passed in to the writer instead instead of just the enableStateStoreCheckpointIds? Or require that writeLineage was called by an internal flag before put/merge/delete is called?

I would rather still prefer the former than the latter though

brkyvz · 2024-11-19T23:36:24Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

@@ -102,6 +115,13 @@ abstract class StateStoreChangelogWriter(
    fm.createAtomic(file, overwriteIfPossible = true)
  protected var compressedStream: DataOutputStream = compressStream(backingFileStream)

+  def writeLineage(lineage: Array[LineageItem]): Unit = {


Is there anything that guarantees that this is going to be called first so that it goes into the first line?

HeartSaVioR · 2024-11-20T06:09:27Z

+1 to @brkyvz about the interface and I also prefer former on the suggestion.

Also, I'd like to see the changelog file to be "self described". It is indeed a change on the storage format to introduce the lineage. For the change of storage format, I'd rather say, we need to bump the version. I understand changelog file will be accessed after looking into commit log hence it's clear whether the query has uuid to be available or not, but still, we are drawing another axis of the versioning for the storage format of the changelog file. I'd like to see this be flattened (single axis of the timeline for the versioning).

version 2 is superior to version 1 - it has more overhead but cover more functionality and I don't think the overhead would be outstanding. That said, why not come up with version 3 based on version 2 + lineage and let all operators use version 3 if uuid is available? I'm even OK with version 3 for version 1 + lineage and version 4 for version 2 + lineage if we are not very sure about the overhead.

brkyvz

This V3, V4 stuff is really error prone... Can we please just introduce V3? Then introduce a writeHeader to StateStoreChangelogWriter. V1 doesn't write a version, V2 writes a version, V3 writes a version and lineage in this method.

Then, you should only have one Reader class. It just reads the first line, from there it knows whether it is v1 or something else. If it is a version it doesn't understand, then it'll throw an error along the lines of: "I don't know how to read this file". It can figure out the lineage as well if it is V3. Then it can read the data. The problem right now is that you've totally broken forward compatibility of changelogs - rather V1 and V2 is going to just fail with a FileNotFoundException if someone upgrades and then rolls back their Spark rather than a better error message of, "I can't read this stream"

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

brkyvz · 2024-11-20T10:40:39Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

+  // Also write lineage information to the changelog, it should appear
+  // in the second line for v4 because the first line is the version
+  writeLineage(stateStoreCheckpointIdLineage)


this becomes tricky to guarantee based on class/trait/abstraction initialization...

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

WweiL · 2024-11-20T14:04:58Z

just introduce V3?

This is sadly not possible imho. Because v1 and v2 not only differ in whether having a "v2" in the first line, but their format are different. We likely still need to have v3 and v4.

Then, you should only have one Reader class. It just reads the first line, from there it knows whether it is v1 or something else.

Similarly for this one, I think a way out is to have a helper class that does this, i.e. it reads the version from the actual changelog file. Then the reader version should be decided by it. This is a tech debt indeed, I filed a spark ticket for it: https://github.com/WweiL/oss-spark/blob/8e439fcc43513f3048b51a8422c0ec136a0b624d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala#L211-L215 Do you think we can resolve it in a separate PR?

HeartSaVioR · 2024-11-20T22:39:24Z

@brkyvz
The diffs between v1 and v2 are not just about version string existed or not. They have different format about representing the operation.

That said, if we just want to avoid divergence of having multiple active versions, we probably need to cut the version differently, e.g. timeline. v3 understands the functionalities/features A, B, C and which features are used for this file is to put on the header, and reader parses the remaining differently based on the used features. v4 does not need to have the quite different storage format but just to enforce Spark (reader) to understand the features which were introduced with v4 to read the file for v4.

I guess it's more proper versioning for reader vs writer scenario - we didn't do this because we didn't know the changelog is going to be improved and required divergence like this way. I'm still careful whether we will need to improve this area to even "have a protocol doc for versions", but we may want to further diverge because we are looking into using Avro for state row, so it might be inevitable...

brkyvz

I don't like the fact that we didn't initially implement these files with a proper protocol version, but at least we can do an assertion on the version checks...

Left some recommendations and please add tests for all the failure cases too on the versioning

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

brkyvz · 2024-11-21T14:10:54Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

+  writeLineage(stateStoreCheckpointIdLineage)
+}
+
+class StateStoreChangelogWriterV4(


scaladoc please. Why is this a new version? How is the encoding different

There is a merge operation in V2 reader / writer, also the eof token is different

brkyvz · 2024-11-21T14:12:21Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

+    assert(changelogVersionStr == s"v${version}",
+      s"Changelog version mismatch: $changelogVersionStr != v${version}")


you might want to turn this into a better error message saying that this is probably coming from a future Spark version. Please upgrade your Spark and use the Error Code framework to do so

Added the helper class to construct the reader, this is not needed I believe. Updated the "please upgrade spark" error message in invalidChangeLogReaderVersion which is called in constructChangelogReader, this is better I think

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

brkyvz · 2024-11-21T14:16:01Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

-  val changelogVersionStr = input.readUTF()
-  assert(changelogVersionStr == "v2",
-    s"Changelog version mismatch: $changelogVersionStr != v2")
+  verifyVersion()


Can you have this be called in StateStoreChangelogReader initialization? I'd have the following in StateStoreChangelogReader

protected def readVersion(): String = input.readUTF() final def verifyVersion(): Unit = { if(!readVersion.stripPrefix("v") == version.toString) { throw ... } } verifyVersion() ... } class StateStoreChangelogReaderV1 { override protected def readVersion(): String = "v1" }

Also if readVersion fails, then you also probably want to throw the "Upgrade Spark" error

Yes this is the ideal way, but the problem is in v1 there is no version written in the first line. Hence all the problems. If written in this way verifyVersion would always throw in v1 changelogs

brkyvz · 2024-11-21T14:16:23Z

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala

+
+  override def version: Short = 3
+
+  verifyVersion()


don't call these individually in all the formats... Error prone for future versions

Yes I agree.. but for the same reason above (only v1 does not have version written in the first line), we need to handle this in a case by case fashion

WweiL · 2024-11-21T17:38:30Z

I ended up adding the helper class StateStoreChangelogReaderFactory to construct the state reader. PTAL @siying @HeartSaVioR @brkyvz

WweiL added 3 commits November 18, 2024 21:19

v1

cb16a31

review

fecd88e

review

1aaaba5

github-actions bot added SQL STRUCTURED STREAMING labels Nov 18, 2024

siying reviewed Nov 18, 2024

View reviewed changes

...core/src/main/scala/org/apache/spark/sql/execution/streaming/state/StateStoreChangelog.scala Outdated Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/RocksDBFileManager.scala Outdated Show resolved Hide resolved

HyukjinKwon changed the title ~~[SPARK-SPARK-50339] Enable changelog to store lineage information~~ [SPARK-50339] Enable changelog to store lineage information Nov 19, 2024

HyukjinKwon changed the title ~~[SPARK-50339] Enable changelog to store lineage information~~ [SPARK-50339][SS] Enable changelog to store lineage information Nov 19, 2024

comments'

af20cdf

WweiL requested a review from siying November 19, 2024 04:09

minor

cdebd16

siying approved these changes Nov 19, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/state/RocksDBSuite.scala Outdated Show resolved Hide resolved

test col family (v2)

97c8356

brkyvz approved these changes Nov 19, 2024

View reviewed changes

WweiL added 2 commits November 20, 2024 15:33

address Jungtaek's comment

0d32561

add jira

8e439fc

brkyvz suggested changes Nov 20, 2024

View reviewed changes

burak's comment, except reading version from changelog

3f7c5de

brkyvz suggested changes Nov 21, 2024

View reviewed changes

burak's comment, add StateStoreChangelogReaderFactory

0cb1787

WweiL changed the title ~~[SPARK-50339][SS] Enable changelog to store lineage information~~ [SPARK-50339][SPARK-50360][SS] Enable changelog to store lineage information Nov 21, 2024

WweiL requested a review from brkyvz November 21, 2024 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50339][SPARK-50360][SS] Enable changelog to store lineage information #48880

[SPARK-50339][SPARK-50360][SS] Enable changelog to store lineage information #48880

WweiL commented Nov 18, 2024 •

edited

Loading

WweiL commented Nov 18, 2024

siying left a comment

brkyvz left a comment

brkyvz Nov 19, 2024

HeartSaVioR commented Nov 20, 2024 •

edited

Loading

brkyvz left a comment

brkyvz Nov 20, 2024

WweiL commented Nov 20, 2024 •

edited

Loading

HeartSaVioR commented Nov 20, 2024 •

edited

Loading

brkyvz left a comment

brkyvz Nov 21, 2024

WweiL Nov 21, 2024

brkyvz Nov 21, 2024

WweiL Nov 21, 2024

brkyvz Nov 21, 2024

brkyvz Nov 21, 2024

WweiL Nov 21, 2024

brkyvz Nov 21, 2024

WweiL Nov 21, 2024

WweiL commented Nov 21, 2024

		assert(changelogVersionStr == s"v${version}",
		s"Changelog version mismatch: $changelogVersionStr != v${version}")

[SPARK-50339][SPARK-50360][SS] Enable changelog to store lineage information #48880

Are you sure you want to change the base?

[SPARK-50339][SPARK-50360][SS] Enable changelog to store lineage information #48880

Conversation

WweiL commented Nov 18, 2024 • edited Loading

What changes were proposed in this pull request?

Changelog Reader / Writer

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

WweiL commented Nov 18, 2024

siying left a comment

Choose a reason for hiding this comment

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR commented Nov 20, 2024 • edited Loading

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WweiL commented Nov 20, 2024 • edited Loading

HeartSaVioR commented Nov 20, 2024 • edited Loading

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WweiL commented Nov 21, 2024

WweiL commented Nov 18, 2024 •

edited

Loading

HeartSaVioR commented Nov 20, 2024 •

edited

Loading

WweiL commented Nov 20, 2024 •

edited

Loading

HeartSaVioR commented Nov 20, 2024 •

edited

Loading