You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When there are ~10seconds between two traces/spans beeing exported, it can happen that the exporter closes the connection after the next spans have been exported but before the otel-collector responds with OK.
This is what it looks like on TCP-level:
//traces are exported. everything ok
1. 12:55:22 app->otel POST /v1/traces (protobuf-content a)
2. 12:55:22 otel->app OK
//a trace is exported directly followed by connection termination
3. 12:55:32 app->otel POST /v1/traces (protobuf-content b)
4. 12:55:32 otel->app ACK
5. 12:55:32 app->otel FIN
6. 12:55:32 otel->app OK
7. 12:55:32 otel->app FIN, ACK
//connection re-opened
8. 12:55:32 app->otel SYN
9. 12:55:32 otel->app SYN, ACK
10. 12:55:32 app->otel ACK
//traces are exported again
11. 12:55:32 app->otel POST /v1/traces (brotobuf-content b)
12. 12:55:32 otel->app OK
the server's OK in step 6 is probably dropped (conforming to http-standard). This results in a timeout followed by a retry of the export.
As the otel-collector appearently doesn't deduplicate the spans this results in duplicate spans in the trace.
From what I can see in the code, I assume this is a race-condition, where the connection-shutdown is triggered but does not set the isShutdown-flag before the next spans are exported. I imagine something like this is happening:
thread 1:
// we didn't export anything for 10s. Let's close the connection
VertxHttpSpanExport.shutdown()
thread 2:
//receives new spans to export
VertxHttpSpanExport.export()
HttpExporter.export()
if (isShutdown.get()) -> false
//proceed exporting
thread 1:
HttpExporter.shutdown()
if (!isShutdown.compareAndSet(false, true))
//proceed shutting down
Expected behavior
The exporter should not export the same traces twice to the otel-collector.
Actual behavior
The exporter sometimes (probably: race condition) closes the connection to the otel-collector right after spans where exported, but before the otel-collector could respond with OK. It will then reopen the connection and send the spans a second time.
From the chat:
"I guess we are sending the FIN before the span b ACK from the collector, therefore marking it as not exported and trying again later.
Looks like the 1st ACK for span b is lost."
I guess there shouldn't be a FIN before a timeout to the ACK from span b. @geoand, I wonder if you have some spare cycles to look at this one?
Describe the bug
When there are ~10seconds between two traces/spans beeing exported, it can happen that the exporter closes the connection after the next spans have been exported but before the otel-collector responds with OK.
This is what it looks like on TCP-level:
Here is the actual tcp dump containing the example: /example/tcp_traffic_http.pcap
the server's OK in step 6 is probably dropped (conforming to http-standard). This results in a timeout followed by a retry of the export.
As the otel-collector appearently doesn't deduplicate the spans this results in duplicate spans in the trace.
This is how the trace should look like: trace_normal.json
This is how the trace looks like when the issue occurs: trace_with_duplicates.json
From what I can see in the code, I assume this is a race-condition, where the connection-shutdown is triggered but does not set the isShutdown-flag before the next spans are exported. I imagine something like this is happening:
Expected behavior
The exporter should not export the same traces twice to the otel-collector.
Actual behavior
The exporter sometimes (probably: race condition) closes the connection to the otel-collector right after spans where exported, but before the otel-collector could respond with OK. It will then reopen the connection and send the spans a second time.
How to Reproduce?
Reproducer: https://github.com/arn-ivu/reproducer-duplicate-spans
Requirements:
Steps to reproduce:
./mvnw compile quarkus:dev
Find duplicate Spans in grafana:
{resource.service.name="reproducer-duplicate-spans"} | count() > 2
Monitor the tcp-traffic with wireshark
- check the Opentelemetry Port OTEL_PORT
Output of
uname -a
orver
Linux e5808115ef08 5.15.167.4-microsoft-standard-WSL2 #1 SMP Tue Nov 5 00:21:55 UTC 2024 x86_64 Linux
Output of
java -version
openjdk version "22.0.1" 2024-04-16 OpenJDK Runtime Environment Temurin-22.0.1+8 (build 22.0.1+8) OpenJDK 64-Bit Server VM Temurin-22.0.1+8 (build 22.0.1+8, mixed mode, sharing)
Quarkus version or git rev
3.17.2
Build tool (ie. output of
mvnw --version
orgradlew --version
)Apache Maven 3.9.9
Additional information
While I have analyzed the problem mostly for quarkus.otel.exporter.otlp.protocol=http/protobuf
I did observe the same problem with quarkus.otel.exporter.otlp.protocol=grpc
See https://github.com/arn-ivu/reproducer-duplicate-spans/tree/main/examples for tcp-traffic of both cases
The text was updated successfully, but these errors were encountered: