Skip to content

Commit

Permalink
Merge branch 'layer5io:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
Muhammed-Moinuddin authored Jul 24, 2024
2 parents a7291a3 + 816f6e5 commit e9629b1
Show file tree
Hide file tree
Showing 6 changed files with 1,176 additions and 3 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
130 changes: 130 additions & 0 deletions src/collections/blog/2024/05-12-cost-of-observability/index.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
---
title: The Observability Arms Race
subtitle: Logs vs. Traces – A Cost vs. Clarity Showdown
date: 2024-05-31
author: Lee Calcote
thumbnail: ./comparative-cost-of-observability-signals.png
darkthumbnail: ./comparative-cost-of-observability-signals.png
description: "Logs vs. Traces – A Cost vs. Clarity Showdown"
type: Blog
category: Observability
tags:
- Engineering
- Observability
featured: false
published: true
---

import { BlogWrapper } from "../../Blog.style.js";
import { Link } from "gatsby";
import comparativeCost from "./comparative-cost-of-observability-signals.png";
import Blockquote from "../../../../reusecore/Blockquote";

<BlogWrapper>

The world of software development is embroiled in a silent war – the Observability Arms Race. On one side, verbose logs, relics of our debugging days, threaten to devour production budgets. On the other, traces, the Sherlock Holmes of observability, offer targeted insights but come with their own complexities.

All signals have different ways to be collected/instrumented. They also have different resource costs to obtain, store, and analyze, while providing different ways to observe the same system. Choosing between them, or all of them, is a game of trade-offs like all other tasks in engineering. In the next sections, we’ll help you make this decision by digging deeper into each signal, starting with the most popular ones: metrics, logs and traces, and then the two new possible signals: application profiles and crash dumps.

**The Logpocalypse Looms:**

Remember that single log statement you sprinkled in development? Imagine it bombarded with millions of hits daily in production. The cost implications are staggering. Are we clinging to outdated habits, drowning in a sea of irrelevant data?

<BlockQuote quote="Log levels area a Band-Aid, not a cure." person="Lee Calcote" />

Log levels offer some relief, silencing the noise in production. But what if that critical error gets buried under an avalanche of info logs? Is there a risk of sacrificing valuable insights for a false sense of cost control?

**Traces: A Targeted Approach Emerges:**

Traces offer a compelling alternative. They map a request's journey through your system, meticulously recording events and their connections. This targeted approach inherently generates less data, making them potentially cost-effective.

Tracing offers the ability to granularly track request segments (spans) as the request is processed across various services. It is hard to introduce later, one reason being third party libraries that are used by the application also need to be instrumented. The cost of distributed tracing can be high, and so, most service mesh tracing systems employ various forms of sampling to capture only a certain portion of the observed traces. When traces are sampled performance overhead and storage costs are reduced, but so is your visibility. The rate at which sampling is done is a balance between the frequency by which traces are captured (typically expressed as a percentage in relation to service request volume).

**Figure: Cost vs. Value Spectrum**

<img src={comparativeCost} className="image-center-shadow" style={{width: "70%"}} />

_Figure: a comparative spectrum of value provided by and cost of each pillar._

The figure above presents a visual spectrum that acknowledges that collecting telemetry in production is a compromise between cost in terms of storage and performance (CPU, memory, and request latency) overhead, and the value of information collection, typically in terms of how expressiveness it is or useful it is in solving puzzles of slow or errored responses.
It’s argued that metrics provide the highest value for the smallest investment. Given that some service meshes facilitate distributed tracing, an argument that distributed tracing provides the highest value and can be achieved at low investment (relative to the amount of insight provided). Ideally, your instrumentation allows you to dial back the verbosity levels and sampling rate to facilitate control over your overhead costs vs desired observability.


We won't dive into metrics as a signal in this post, but will ackowledge their use and validaty in the converation. Metrics also represent point in time observations of the state of a system. This differs from logs or traces, which focus on records or information about individual events.

### The Power of Smart Sampling

Not all traces are created equal. Simple sampling techniques can leave you blindsided. The key lies in dynamic/adaptive tail sampling – a detective sniffing out the real clues in a crime scene, focusing on the most valuable traces.

**The Power of Smart Sampling with Jaeger**

We've established that simple, "dumb" sampling techniques for traces can leave you blindsided. Imagine a scenario in Jaeger, a popular distributed tracing platform, where every trace has a 10% chance of being captured, regardless of content. Crucial errors with low occurrence might slip through the cracks.

This is where dynamic/adaptive tail sampling, a champion feature of Jaeger, enters the scene, offering a more intelligent approach. Here's how it works within the Jaeger ecosystem:

* **Jaeger's Collector Takes Center Stage:** As traces flow into your Jaeger collector, it doesn't make a snap decision about sampling. Instead, it analyzes key characteristics of each trace using Jaeger's robust sampling framework. This might include:
* **Span Duration:** Did a particular span within the trace take an exceptionally long time, indicating a potential bottleneck in a specific service?
* **Error Tags:** Did any spans within the trace contain error tags, suggesting something went wrong?
* **Service Names:** Was a specific service involved in the trace that's known to be error-prone based on historical data?

* **Prioritizing the Catch:** Based on analysis, the Jaeger collector assigns a sampling probability to each trace. Traces with characteristics suggesting valuable insights (e.g., slow spans, errors) will have a higher chance of being captured. On the other hand, common, routine traces might be sampled at a lower rate.

**The Jaeger Detective at Work:**

Think of Jaeger's dynamic/adaptive tail sampling as a detective sifting through a crime scene. They don't blindly collect every span within every trace. Instead, they prioritize clues based on their potential significance – error tags within a span get captured with a higher probability than a normal database access span in a healthy service.

This intelligent approach ensures you capture the valuable traces while keeping the cost under control. But even with Jaeger, there are good and bad practices to consider:

* **Bad Practice: Sampling Everything (Jaeger YAML configuration):**

```yaml
# This configuration captures EVERY trace, defeating the purpose of sampling.
sampler:
type: probabilistic
param: 1.0
```
* **Good Practice: Prioritize Errors (Jaeger YAML configuration):**
```yaml
# This leverages Jaeger's ability to identify traces with errors.

sampler:
type: probabilistic
param: 0.1 # Capture 10% of normal traces
samplingStrategy:
perOperation: # Define per-operation sampling strategies
"*": 0.1
"payment-service.processPayment": 1.0 # Capture all traces for payment-service.processPayment
```
**Jaeger Tracing Operator: Best Practices**
The Jaeger Tracing Operator simplifies deploying and managing Jaeger in Kubernetes environments. Here's how to leverage it with best practices:
1. **Install the Jaeger Operator:** Follow the official guide to install the Jaeger Operator in your Kubernetes cluster.
2. **Define Jaeger CRDs with Sampling Strategy:** Use Custom Resource Definitions (CRDs) provided by the operator to configure your Jaeger instance. Within the CRD YAML, define the `sampler` section as shown in the good practice example above.

3. **Tailor Sampling for Specific Services:** The Jaeger Operator allows granular control over sampling strategies. You can define specific sampling probabilities for individual services within your application landscape.

**Beyond the Basics:**

Jaeger's dynamic/adaptive tail sampling offers a robust configuration system. You can define custom sampling strategies based on specific service names, tags, or even regular expressions within your traces. This granular control empowers you to tailor the sampling process to your unique application landscape.

**Jaeger: Ideal, or Not Quite?**

Jaeger is a powerful tool for distributed tracing, but it's not a one-size-fits-all solution. Here's a quick reflection:

* **Strengths:** Jaeger excels at cost-effective observability with its smart sampling and ease of use.
* **Considerations:** While Jaeger offers flexibility, configuration can become complex for intricate setups. Additionally, Jaeger itself requires some infrastructure overhead.

**The Takeaway:**

By embracing Jaeger's smart sampling techniques and leveraging the Jaeger Tracing Operator for best-practice deployments, you can move beyond the reactive "log everything" mentality and enter a new era of cost-conscious, insightful monitoring. Remember, the ideal tool depends on your specific needs. Explore your options – Jaeger might be your perfect partner, or it might be the first step on your journey to finding the ideal distributed tracing solution for you.

**The Future of Observability Beckons:**

The future of observability isn't about logs versus traces. It's about embracing a proactive approach. **It's time to move beyond the reactive "log everything" mentality.** By adopting smarter practices, we can create a future where observability doesn't break the bank. Let's ditch the cost-prohibitive logpocalypse and unlock the power of targeted, insightful monitoring. The choice is clear.

</BlogWrapper>
4 changes: 2 additions & 2 deletions src/collections/members/suraj-jadhav/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,6 @@ linkedin: suraj-g-jadhav
layer5: dabb4156-08df-474b-b43a-36b39725f165
location: India
bio: Suraj is a full-stack developer specialized in React, JavaScript, TypeScript, and Node.js. He is passionate about crafting innovative solutions and thrives in dynamic environments where he can contribute to the development of cutting-edge features with a focus on reliability and efficiency. Suraj also enjoys contributing to open source projects, which allows him to give back to the community while honing his skills.
status: Active
status: Inactive
published: true
---
---
2 changes: 1 addition & 1 deletion src/sections/Community/Member-single/memberSingle.style.js
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ export const MemberSingleWrapper = styled.div`
font-size: 1.75rem;
line-height: 1.75rem;
color: ${(props) => props.theme.primaryColor};
text-align: right;
text-align: left;
font-weight: 400;
}
ul {
Expand Down

0 comments on commit e9629b1

Please sign in to comment.