[pkg/stanza/fileconsumer] Add trie and test cases #23665

VihasMakwana · 2023-06-25T10:07:36Z

Description: Add Trie data structure and keep it separate from PR #23056

Testing: Relevant test cases added

djaglowski

I'm excited to see this moving forward and glad we can look at the trie in isolation. I think we need to define very carefully what our intended use of "value" is in the nodes, and ultimately how it supports the requirements of fingerprint matching.

pkg/stanza/fileconsumer/trie.go

djaglowski · 2023-06-26T15:31:46Z

pkg/stanza/fileconsumer/trie.go

+}
+
+// Delete removes the value associated with the given key. Returns true if a
+// node was found for the given key. If the node or any of its ancestors


To be precise, can the node itself ever become childless? We are deleting its value but not modifying its children. Would it be correct to say "if deleting a value on a childless node, the node itself will be deleted."?

It's not clear to me when we should delete ancestor nodes. Currently, we are saving a value on every node, so how do we assess if the ancestor node represents a fingerprint or just part of the path to the deleted node?

We are modifying the children. https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/d8fe374000041d86d3fa9d79a24069b0b3fdc0db/pkg/stanza/fileconsumer/trie.go#L91

if deleting a value on a childless node, the node itself will be deleted

yes, that's what happens. let me give you an example. Consider following TRIE

A | B / \ C D / E

If we call Delete() on ABCE.

E will be removed from nodeC's children map

As node C is childless now, it's entire children map is nullified.

C will be removed from node B's children map.

Now, B is not childless. It has D as it's child. So, we will break out from here and won't proceed further.

Trie after above operation:

A | B | D

The comment appears to be out of date if we are no longer saving values on the nodes.

pkg/stanza/fileconsumer/trie_test.go

pkg/stanza/fileconsumer/trie.go

VihasMakwana · 2023-06-27T07:03:31Z

I'm excited to see this moving forward and glad we can look at the trie in isolation. I think we need to define very carefully what our intended use of "value" is in the nodes, and ultimately how it supports the requirements of fingerprint matching.

I was just going through the code, and I don't think we actually require the value field. Currently, we just check if fingerprint is present in the trie. We can just return true/false.
I ran the same tests after removing the value assignments. and it worked the same way.

djaglowski · 2023-06-28T19:51:29Z

I'm excited to see this moving forward and glad we can look at the trie in isolation. I think we need to define very carefully what our intended use of "value" is in the nodes, and ultimately how it supports the requirements of fingerprint matching.

I was just going through the code, and I don't think we actually require the value field. Currently, we just check if fingerprint is present in the trie. We can just return true/false. I ran the same tests after removing the value assignments. and it worked the same way.

Ok, let's say we remove the value. Get doesn't return anything, so let's rename it HasFingerprint and return a bool.

I still think we need to define exactly how this trie supports our needs.

Let's say we have a trie:

t := NewTrie()
t.Put("AB")
t.Put("XYZ")

Of course, if we call t.HasFingerprint("AB") or t.HasFingerprint("XYZ"), it will return true because these are exact matches.

If we call t.HasFingerprint("ABC"), we've chosen to return true in order to support the use case where a file was observed once when it contained "AB", and again after "C" has been appended.

However, if we call t.HasFingerprint("XY"), the current implementation would return true but it's not clear to me what use case this supports.

VihasMakwana · 2023-06-30T05:29:50Z

Ok, let's say we remove the value. Get doesn't return anything, so let's rename it HasFingerprint and return a bool.

I still think we need to define exactly how this trie supports our needs.

Let's say we have a trie:

t := NewTrie()
t.Put("AB")
t.Put("XYZ")
Of course, if we call t.HasFingerprint("AB") or t.HasFingerprint("XYZ"), it will return true because these are exact matches.

If we call t.HasFingerprint("ABC"), we've chosen to return true in order to support the use case where a file was observed once when it contained "AB", and again after "C" has been appended.

However, if we call t.HasFingerprint("XY"), the current implementation would return true but it's not clear to me what use case this supports.

Hmm, it might make more sense to make t.HasFingerprint("AB") and not t.HasFingerprint("XY"). You're right

VihasMakwana · 2023-07-07T06:32:13Z

Let's say we have a trie:

t := NewTrie()
t.Put("AB")
t.Put("XYZ")
Of course, if we call t.HasFingerprint("AB") or t.HasFingerprint("XYZ"), it will return true because these are exact matches.

If we call t.HasFingerprint("ABC"), we've chosen to return true in order to support the use case where a file was observed once when it contained "AB", and again after "C" has been appended.

However, if we call t.HasFingerprint("XY"), the current implementation would return true but it's not clear to me what use case this supports.

@djaglowski after some careful consideration, I think there's no need to match "XY".

VihasMakwana · 2023-07-10T17:27:42Z

@djaglowski can you review this one as your convenience?

djaglowski · 2023-07-10T17:41:49Z

@VihasMakwana, I agree with your conclusion here, that we should change the matching behavior or the trie.

pkg/stanza/fileconsumer/trie.go

VihasMakwana · 2023-07-10T18:29:56Z

yeah i did that already I guess. https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/23665/files#r1258686830

pkg/stanza/fileconsumer/trie.go

djaglowski · 2023-07-10T19:02:36Z

pkg/stanza/fileconsumer/trie.go

+}
+
+// Delete removes the value associated with the given key. Returns true if a
+// node was found for the given key. If the node or any of its ancestors


The comment appears to be out of date if we are no longer saving values on the nodes.

djaglowski · 2023-07-10T19:14:02Z

pkg/stanza/fileconsumer/trie.go

+// Delete removes the value associated with the given key. Returns true if a
+// node was found for the given key. If the node or any of its ancestors
+// becomes childless as a result, it is removed from the trie.
+func (trie *Trie) Delete(key []byte) bool {


When is it intended that we will call Delete? Specifically I'm trying to understand how it relates to the lifecycle of a given file.

We call delete, only after the file is successfully read and all the logs are emitted and it is added to the list of knownReaders.

Here's a case that doesn't make sense to me with this design:

First poll finds a file w/ "ABCDEF", add it to the trie, and start reading.

Second poll finds "ABCDEF" unchanged, but also finds another file w/ "ABC". HasKey("ABC") tells us that this file is not currently being read, so we call Put("ABC") but this has no effect on the trie because "ABCDEF" is already there.

Then "ABCDEF" is fully read, so we add it to knownReaders and call Delete("ABCDEF"). This effectively deletes both "ABCDEF" and "ABC".

A third poll now finds "ABC" again and HasKey("ABC") tells us it is not currently being read, so we start reading it again. Now we have two active readers for the same file resulting in duplicated logs.

This seems to indicate that we need to make use of values on nodes:

Put("ABC") should store a value on "C"

HasKey("ABC") should return true if "C" exists and has a value (regardless of whether it is a leaf)

However, this still may not be enough. Another scenario:

First poll finds both "ABC" and "ABCDEF".

The order in which we process these files changes the result:

HasKey("ABC") => false => Put("ABC") => HasKey("ABCDEF") => true

Results in trie that contains "ABC" only

HasKey("ABCDEF") => false => Put("ABCDEF") => HasKey("ABC") => false => Put("ABC")

Results in trie that contains "ABC" and "ABCDEF"

I think the correct behavior depends on the context but it's not clear to me how best to model this. If we find "ABC" in the first poll and then "ABCDEF" in the next, the trie should have only "ABCDEF". However, if find them in the opposite order or within the same poll, I think we should treat these as separate files. Do you have any ideas how to address this?

Ok, this one is a unique situation.
Let us discuss how current file reading works.
So what happens in the current implementation:

It will loop through all the file paths.

It will create fingerprints, then check for duplicates.

While checking duplicates, it will match both ways.
It will match;"ABCDEF" and "ABCD"; and; "ABCD" and "ABCDEF" if that's true, it will treat them as duplicates.
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/stanza/fileconsumer/file.go#L228

Now to handle your scenario, we can use values in trie or we can go as per current implementation.

I have one scenario to confirm:
if we have A - B - C (true) - D - E (true) - F - X -Y (true)

if we try to add, A - B - C - D - X to the above trie, we shouldn't add a duplicate entry but push the value at C down. Trie would become

A - B - C - D - E (true) - F - X -Y (true) | X (true)

if we try to add A - B - C - D to the above trie, we shouldn't add a duplicate entry but push the value at C down. Trie would become A - B - C - D (true) - E (true) - F - X -Y (true)

if we try to add, A - B - C - D - E - F - X - Y- Z to the above trie, we shouldn't add a duplicate entry, but push the value at E down. Trie would become A - B - C (true) - D - E(true) - F - X - Y - Z(true)

In other words, we are pushing the last value encountered while traversing the current path?

I think that makes sense. At least I can't think of a scenario where that ends up giving us the wrong result.

Okay, I pushed the changes.

@djaglowski

…erations

djaglowski · 2023-07-28T18:59:10Z

@VihasMakwana, I had a hard time correlating the existing tests with the real world scenarios we've discussed, so I ended up writing an additional framework for validating operation sequences that may need to run on the trie. I opened a PR against your branch here.

I found a few failures in the basic operations.

I also found another aspect of the design which is unclear: Now that we've updated the Put function to update or "push" a fingerprint further into the trie, it's unclear when we would call it. I'd appreciate your thoughts on this.

VihasMakwana · 2023-07-28T20:12:55Z

Okay, I feel like we're finally on same page. Just need to think about this 'push' thing.
Will go through your pr.

Add tree-based test suite that explores many possible sequences of operations

VihasMakwana · 2023-07-28T20:52:49Z

@djaglowski We currently can push down the values, but the way we handled deletion in the thread pool was as follows:

Add a fingerprint to trie
Send reader and fingerprint to the threadpool
Delete fingerprint after emitting.

In our current scenario, we updated the trie and pushed the values. How do we handle deletion here? Since we have pushed the value down.
If we delete the fingerprint that we sent earlier in the threadpool, it will be a no-op.
The pushed value might as well stay in trie forever.

Deleting the fingerprint from trie occurred in the same goroutine in threadpool PR

VihasMakwana · 2023-07-28T21:04:03Z

Maybe we can use something like this, with some extra bookkeeping:

An important assumption:

There is a 1-1 relationship between readers and open fingerprints.

Instead of storing a mere boolean in value, we can assign each open reader a unique ID. And store that in the trie.
So when we push the value down the trie, we can actually know to which reader it belongs.
We can add one more field to the trie nodes, parent. And then we can delete nodes iteratively.
We would need hashmaps for this.

Do you get what I'm saying?

VihasMakwana · 2023-07-28T21:05:17Z

I understand this is complex, but as pushing the values is required to avoid data loss, this or something like this is required.

VihasMakwana · 2023-07-30T17:11:23Z

@djaglowski I was able to fix your test case failures in almost all of your test cases. A couple of questions were answered, let's focus on fixing them once we agree on the design.

I tweaked the Delete and added a new helper.
The helper allows use to delete the node and travel upwards in the trie. We just provide the node we're interested in deleting and it does the work. This will be the most crucial one for the threadpool PR.

djaglowski

Instead of storing a mere boolean in value, we can assign each open reader a unique ID. And store that in the trie.
So when we push the value down the trie, we can actually know to which reader it belongs.

I think it would be ideal if the trie does not need to have additional information within it, so we can test/maintain it in isolation.

WDYT of my idea to switch to Get() []byte (or Get() int) and store the length of the registered fingerprint on the reader? This is a tiny amount of additional info, only added to the reader, and should be enough to properly call Delete when the reader is done.

djaglowski · 2023-08-01T13:36:52Z

pkg/stanza/fileconsumer/internal/trie/trie_test.go

+					"Found:ABCDEF": opTree{ // Next poll finds ABCDEF
+						ops: []testOp{
+							has("ABCDEF", true, "recognize ABC w/ DEF appended"),
+							put("ABCDEF"), // TODO HasKey returns true, so how do we know to call this?


Regarding this TODO, do we need a way to differentiate prefix matches from exact matches? Otherwise I'm not sure we we can take advantage of the "push" functionality in Put.

We could switch back to Get and return the value that "matched" the key. The caller can interpret whether or not it was what they passed as a parameter or just a prefix. WDYT?

Yeah we can do that.
Also, If we push the exact matches then it would make no difference to trie.

If we push the exact matches then it would make no difference to trie.

In most situations I'd be ok with duplicating the Put but this package is optimized wherever possible so I'd prefer to err on the side of performance here.

djaglowski · 2023-08-01T13:49:00Z

pkg/stanza/fileconsumer/internal/trie/trie_test.go

+							"DeleteAs:ABC": opTree{ // Done reading the file, remove it as ABC
+								ops: []testOp{
+									del("ABC", false, "should have been pushed down when ABCDEF was added"),
+									has("ABCDEF", true, "should not have been deleted"),
+								},
+								continuations: map[string]opTree{
+									"DeleteAs:ABCDEF": opTree{ // Also remove it as ABCDEF
+										ops: []testOp{
+											del("ABCDEF", true, "just confirmed it exists"), // TODO this fails to delete
+										},
+									},
+								},
+							},


I agree w/ your assessment here that we only need to delete ABCDEF, so we can remove this case.

The main thing here is that we need to know what to pass to Delete. I think my suggestion of switching back to Get could work here because we can save a value on the reader that indicates what to pass to Delete when the time comes.

djaglowski · 2023-08-01T13:50:52Z

pkg/stanza/fileconsumer/internal/trie/trie_test.go

+		// Below won't necessarily be false.
+		// A - B(true) - C - D (true)
+		// If we delete ABCD in above trie, the trie becomes A - B(true)
+		// ABCD would still match, as AB \w CD appended
+		// assert.Falsef(t, trie.HasKey([]byte(key)), "called Delete(%s) but HasKey(%s) is still true", key, key)


Good point.

Not that we would do it just for tests, but I think this is another benefit of switching back to Get() []byte, because we can ensure there is no longer an exact match.

VihasMakwana · 2023-08-01T16:01:45Z

Instead of storing a mere boolean in value, we can assign each open reader a unique ID. And store that in the trie.
So when we push the value down the trie, we can actually know to which reader it belongs.

I think it would be ideal if the trie does not need to have additional information within it, so we can test/maintain it in isolation.

WDYT of my idea to switch to Get() []byte (or Get() int) and store the length of the registered fingerprint on the reader? This is a tiny amount of additional info, only added to the reader, and should be enough to properly call Delete when the reader is done.

Would length be sufficient enough? Can you please elaborate.
If we have the fingerprint, it would be sufficient I guess

djaglowski · 2023-08-01T16:27:05Z

Would length be sufficient enough? Can you please elaborate.
If we have the fingerprint, it would be sufficient I guess

I'm figuring we would store this length value on the reader, which also has the fingerprint stored, so we can do roughly: Delete(reader.Fingerprint.FirstBytes[0:reader.registeredLength])

VihasMakwana · 2023-08-01T19:53:44Z

I see.
So, a combination of both.
I will revert the current changes I made, adding a parent and so on.
Let's go with above approach and keep trie as Isolate as possible

VihasMakwana · 2023-08-02T06:32:02Z

But still, when we push the value, we have to update the new length in registered reader. How would we do that? Ig still need a way to determine the reader that the fingerprint belongs to.

VihasMakwana · 2023-08-07T10:39:19Z

@djaglowski
I think of two ways from here:

Go with the Delete(reader.Fingerprint.FirstBytes[0:reader.registeredLength]).
- But, when we push some value down the TRIE, we must also update registeredLength stored on the reader.
- In other words, we need to identify the reader to which the fingerprint belongs.
Go with my DeleteNode(node).
- In this approach, we store the last node in our fingerprint path on the reader
  - For eg. ReaderA has ABCDE as its fingerprint, so we would store Node - E on our reader.
- So, when we want to delete the field, we just call DeleteNode(reader.trieNode)

In both of the above cases, I think that storing the reader info is required.
As a result, I have created a PR.

Just to be flexible, I have used type generics. All your test cases are also passing.
If you can take a look ;) Any suggestions are welcomed.

djaglowski · 2023-08-07T14:17:41Z

@VihasMakwana, in the interest of simplicity and to hopefully regain some momentum on this effort, should we reconsider the need to "push" fingerprints down?

I believe there is some benefit to the behavior, because our understanding of small files is updated more data is appended. However, small files should be read quickly and removed from the trie anyways, so perhaps it is not worth the additional complexity.

Perhaps we can remove this "push" functionality and revisit it later. The question in my mind is whether we have identified any specific cases which rely on this behavior. WDYT?

VihasMakwana · 2023-08-07T16:24:55Z

@djaglowski that is good by me. I can revert that part and push, so you can take a look?

djaglowski · 2023-08-09T14:43:47Z

Closed in favor of #24982

VihasMakwana requested a review from a team June 25, 2023 10:07

VihasMakwana requested a review from djaglowski as a code owner June 25, 2023 10:07

github-actions bot assigned TylerHelmuth Jun 25, 2023

github-actions bot added the pkg/stanza label Jun 25, 2023

VihasMakwana force-pushed the add-trie-fingerprints branch from 27a30ab to d8fe374 Compare June 25, 2023 14:22

djaglowski reviewed Jun 26, 2023

View reviewed changes

djaglowski added the Skip Changelog PRs that do not require a CHANGELOG.md entry label Jun 26, 2023

djaglowski changed the title ~~Add trie and test cases~~ [pkg/stanza/fileconsumer] Add trie and test cases Jun 26, 2023

Add trie and test cases

057c6fc

VihasMakwana force-pushed the add-trie-fingerprints branch from d8fe374 to 057c6fc Compare June 28, 2023 08:13

VihasMakwana requested a review from djaglowski June 30, 2023 10:25

VihasMakwana force-pushed the add-trie-fingerprints branch 2 times, most recently from a54ab7f to c1e4d96 Compare June 30, 2023 10:35

Remove value, add some more test cases.

7f177f9

VihasMakwana force-pushed the add-trie-fingerprints branch from c1e4d96 to 7f177f9 Compare June 30, 2023 11:45

VihasMakwana commented Jul 10, 2023

View reviewed changes

pkg/stanza/fileconsumer/trie.go Outdated Show resolved Hide resolved

djaglowski reviewed Jul 10, 2023

View reviewed changes

chore: move trie to internal

5d13aa4

VihasMakwana requested a review from djaglowski July 11, 2023 07:01

Vihas Splunk added 4 commits July 19, 2023 18:28

chore: change trie and restructure test cases.

99c8bf5

add delete test cases

8937f8a

fix: fix edge cases

c1bfbe6

chore: handle *pushing* down in Put()

412b58f

Add tree-based test suite that explores many possible sequences of op…

6300e57

…erations

Merge pull request #2 from djaglowski/trie-test-tree

4c918c7

Add tree-based test suite that explores many possible sequences of operations

chore: add a new helper and store parent

4e3244c

VihasMakwana force-pushed the add-trie-fingerprints branch from fd12e70 to cfb85ff Compare July 30, 2023 17:11

fix: minor bug which causes Delete to return true

1df62b3

VihasMakwana force-pushed the add-trie-fingerprints branch from cfb85ff to 1df62b3 Compare July 30, 2023 17:13

djaglowski reviewed Aug 1, 2023

View reviewed changes

djaglowski closed this Aug 9, 2023

[pkg/stanza/fileconsumer] Add trie and test cases #23665

[pkg/stanza/fileconsumer] Add trie and test cases #23665

Conversation

VihasMakwana commented Jun 25, 2023

djaglowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VihasMakwana commented Jun 27, 2023

djaglowski commented Jun 28, 2023

VihasMakwana commented Jun 30, 2023

VihasMakwana commented Jul 7, 2023

VihasMakwana commented Jul 10, 2023

djaglowski commented Jul 10, 2023

VihasMakwana commented Jul 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VihasMakwana Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VihasMakwana Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djaglowski commented Jul 28, 2023

VihasMakwana commented Jul 28, 2023

VihasMakwana commented Jul 28, 2023 • edited Loading

VihasMakwana commented Jul 28, 2023

VihasMakwana commented Jul 28, 2023

VihasMakwana commented Jul 30, 2023

djaglowski left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VihasMakwana commented Aug 1, 2023

djaglowski commented Aug 1, 2023 • edited Loading

VihasMakwana commented Aug 1, 2023

VihasMakwana commented Aug 2, 2023

VihasMakwana commented Aug 7, 2023 • edited Loading

djaglowski commented Aug 7, 2023

VihasMakwana commented Aug 7, 2023

djaglowski commented Aug 9, 2023

VihasMakwana Jul 12, 2023 •

edited

Loading

VihasMakwana Jul 24, 2023 •

edited

Loading

VihasMakwana commented Jul 28, 2023 •

edited

Loading

djaglowski left a comment •

edited

Loading

djaglowski commented Aug 1, 2023 •

edited

Loading

VihasMakwana commented Aug 7, 2023 •

edited

Loading