Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate monotonic counter metrics to u64_counter! #6350

Open
wants to merge 20 commits into
base: dev
Choose a base branch
from

Conversation

goto-bus-stop
Copy link
Member

This is part 1 out of... 4 or 5? of a series of work to move over to our new telemetry macros.

This part does the easiest part :), the tracing::info!(monotonic_counter.) macros that should now use u64_counter!(). I used a lot of commits so I could take notes while doing the work, I'll copy them to PR comments so there's no need to look at each commit individually.

I avoided breaking changes to the metrics for now, so this is targeted at 1.x.

Two uses of tracing::info!(monotonic_counter.) remain, these are already being addressed in #6338.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

Notes:
- Fixed a typo in the not found attribute:
  `persisted_quieries.not_found` -> `persisted_queries.not_found`.
- Added description, it would be useful for someone to check it. It
  reads to me like *every* request is measured?
Notes:
- Removes the `apollo_router_deduplicated_subscriptions_total` metric.
  This is already captured by `apollo.router.operations.subscriptions`
  in the `subscriptions.deduplicated` attribute.
- The `apollo.router.operations.batching` metric appears to use an older
  style of attribute naming?
Notes:
- The description for `apollo_router_skipped_event_count` may not
  entirely be correct?
Notes:
- This combined a log message and a metric: now they are separate.
@goto-bus-stop goto-bus-stop requested review from a team as code owners November 27, 2024 14:05
@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented Nov 27, 2024

✅ Docs Preview Ready

No new or changed pages found.

Copy link
Contributor

@goto-bus-stop, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

@router-perf
Copy link

router-perf bot commented Nov 27, 2024

CI performance tests

  • connectors-const - Connectors stress test that runs with a constant number of users
  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

@@ -210,7 +210,7 @@ where
Response: Send + 'static + Debug,
TransformedResponse: Send + 'static + Debug,
{
let query = query_name::<Query>();
let query_name = query_name::<Query>();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify that it isn't the full query text.

mode = %"passthrough",
u64_counter!(
"apollo_router_deduplicated_subscriptions_total",
"Total deduplicated subscription requests (deprecated)",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked a few apparently duplicate, or poorly named metrics as deprecated. This one appears to be a less informative duplicate of apollo.router.operations.subscriptions, which has a subscriptions.deduplicated attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Good catch. It makes sense.

monotonic_counter.apollo_require_authentication_failure_count = 1u64,
u64_counter!(
"apollo_require_authentication_failure_count",
"Number of unauthenticated requests (deprecated)",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I marked this as deprecated because the naming doesn't seem to follow convention, but I don't know if we have a proper alternative for this? We have apollo.router.operations.authorization below but I'm not sure it reports the same thing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the right thing to do. cc @BrynCooke you probably have more context

Copy link
Contributor

@bnjjj bnjjj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On metrics you marked as deprecated in description that could be interesting to also document it as deprecated in docs. We already have a deprecated section. I know it's also part of another ticket but we both work on deprecating different metrics so in order to not forget any of these deprecated metrics I think it's worth documenting it directly

apollo-router/src/notification.rs Outdated Show resolved Hide resolved
);
u64_counter!(
"apollo.router.operations.jwt",
"Number of requests with JWT authentication",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should not add authentication.jwt.failed = false in the future (for 2.0) to be consistent with what sigv4 is doing for example

apollo-router/src/plugins/authentication/subgraph.rs Outdated Show resolved Hide resolved
apollo-router/src/plugins/authentication/subgraph.rs Outdated Show resolved Hide resolved
monotonic_counter.apollo_require_authentication_failure_count = 1u64,
u64_counter!(
"apollo_require_authentication_failure_count",
"Number of unauthenticated requests (deprecated)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the right thing to do. cc @BrynCooke you probably have more context

tracing::info!(monotonic_counter.apollo_router_timeout = 1u64,);
u64_counter!(
"apollo_router_timeout",
"Number of timed out client requests",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can mark it as deprecated

Copy link
Member Author

@goto-bus-stop goto-bus-stop Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is to use a custom instrument, right? So that I can document it in the deprecated metrics section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

telemetry:
  instrumentation:
    instruments:
      router:
        http.server.request.duration:
          # Adding subgraph name, response status code from the router and the operation name
          attributes:
            http.response.status_code: true # If status code == 504 then it's a timeout at the router http request level
            graphql.operation.name:
              operation_name: string
      subgraph:
        # Adding subgraph name, response status code from the subgraph and original operation name from the supergraph
        http.client.request.duration:
          attributes:
            subgraph.name: true
            http.response.status_code: # If status code == 504 then it's a timeout at the subgraph http request level
              subgraph_response_status: code

monotonic_counter.apollo.router.graphql_error = 1u64,
u64_counter!(
"apollo.router.graphql_error",
"Number of GraphQL error responses returned by the router",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Number of GraphQL error responses returned by the router",
"Number of GraphQL error responses returned by the router (DEPRECATED)",

tracing::info!(monotonic_counter.apollo.router.graphql_error = count,);
u64_counter!(
"apollo.router.graphql_error",
"Number of GraphQL error responses returned by the router",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Number of GraphQL error responses returned by the router",
"Number of GraphQL error responses returned by the router (DEPRECATED)",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah really? I thought we quite recently added more of these (though the metric name is definitely not good). Is there a replacement?

Copy link
Contributor

@bnjjj bnjjj Nov 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

telemetry:
  instrumentation:
    instruments:
      router:
        http.server.request.duration:
          attributes:
            http.response.status_code: true
            graphql.operation.name:
              operation_name: string
            # This attribute will be set to true if the response contains graphql errors
            graphql.errors: # This will be true if it contains graphql errors
              on_graphql_error: true
      supergraph:
        supergraph.response.errors:
          type: counter
          value: event_unit
          description: number of request containing graphql errors and the error extension code
          unit: req
          attributes:
            graphql.operation.name: true
            errors.code:
              response_errors: $[0].extensions.code
          condition:
            exists:
              response_errors: $.*.extensions.code

Here is the alternative

code = code
u64_counter!(
"apollo.router.graphql_error",
"Number of GraphQL error responses returned by the router",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Number of GraphQL error responses returned by the router",
"Number of GraphQL error responses returned by the router (DEPRECATED)",

mode = %"passthrough",
u64_counter!(
"apollo_router_deduplicated_subscriptions_total",
"Total deduplicated subscription requests (deprecated)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Good catch. It makes sense.

@goto-bus-stop goto-bus-stop requested a review from a team as a code owner November 28, 2024 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants