[improvement][broker] If there is a deadlock in the service, the probe should return a failure because the service may be unavailable #23634

yyj8 · 2024-11-24T04:34:47Z

Main Issue: #xyz

PIP: #xyz

Motivation

In some special scenarios, when the broker service has a deadlock, it needs to be able to automatically recover instead of requiring manual intervention. For example, when the service is deployed in a customer environment, we cannot directly manage it. If the service has a deadlock, the k8s probe should return a failure because the service may be unavailable. The probe failure triggers a broker pod restart to resolve the deadlock.

Modifications

Add deadlock detection in the probe. If a deadlock exists, print the thread stack and return a service unavailable exception.

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository:
yyj8#10

…return a failure because the service may be unavailable

github-actions · 2024-11-24T04:36:06Z

@yyj8 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

lhotari

There's already a deadlock check in the health check:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/impl/BrokersBase.java

Lines 366 to 408 in c9de1bb

    
           @GET 
        
           @Path("/health") 
        
           @ApiOperation(value = "Run a healthCheck against the broker") 
        
           @ApiResponses(value = { 
        
               @ApiResponse(code = 200, message = "Everything is OK"), 
        
               @ApiResponse(code = 403, message = "Don't have admin permission"), 
        
               @ApiResponse(code = 404, message = "Cluster doesn't exist"), 
        
               @ApiResponse(code = 500, message = "Internal server error")}) 
        
           public void healthCheck(@Suspended AsyncResponse asyncResponse, 
        
                                   @ApiParam(value = "Topic Version") 
        
                                   @QueryParam("topicVersion") TopicVersion topicVersion) { 
        
               validateSuperUserAccessAsync() 
        
                       .thenAccept(__ -> checkDeadlockedThreads()) 
        
                       .thenCompose(__ -> internalRunHealthCheck(topicVersion)) 
        
                       .thenAccept(__ -> { 
        
                           LOG.info("[{}] Successfully run health check.", clientAppId()); 
        
                           asyncResponse.resume(Response.ok("ok").build()); 
        
                       }).exceptionally(ex -> { 
        
                           LOG.error("[{}] Fail to run health check.", clientAppId(), ex); 
        
                           resumeAsyncResponseExceptionally(asyncResponse, ex); 
        
                           return null; 
        
                       }); 
        
           } 
        
           private void checkDeadlockedThreads() { 
        
               ThreadMXBean threadBean = ManagementFactory.getThreadMXBean(); 
        
               long[] threadIds = threadBean.findDeadlockedThreads(); 
        
               if (threadIds != null && threadIds.length > 0) { 
        
                   ThreadInfo[] threadInfos = threadBean.getThreadInfo(threadIds, false, false); 
        
                   String threadNames = Arrays.stream(threadInfos) 
        
                           .map(threadInfo -> threadInfo.getThreadName() + "(tid=" + threadInfo.getThreadId() + ")").collect( 
        
                                   Collectors.joining(", ")); 
        
                   if (System.currentTimeMillis() - threadDumpLoggedTimestamp 
        
                           > LOG_THREADDUMP_INTERVAL_WHEN_DEADLOCK_DETECTED) { 
        
                       threadDumpLoggedTimestamp = System.currentTimeMillis(); 
        
                       LOG.error("Deadlocked threads detected. {}\n{}", threadNames, 
        
                               ThreadDumpUtil.buildThreadDiagnosticString()); 
        
                   } else { 
        
                       LOG.error("Deadlocked threads detected. {}", threadNames); 
        
                   } 
        
                   throw new IllegalStateException("Deadlocked threads detected. " + threadNames); 
        
               } 
        
           }

It also contains an example of how to check deadlocks.

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java

…return a failure because the service may be unavailable

lhotari · 2024-11-25T14:38:56Z

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java

+                ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
+                long[] threadIds = threadBean.findDeadlockedThreads();
+                if (threadIds != null && threadIds.length > 0) {
+                    ThreadInfo[] threadInfos = threadBean.getThreadInfo(threadIds, false,
+                            false);
+                    String threadNames = Arrays.stream(threadInfos)
+                            .map(threadInfo -> threadInfo.getThreadName()
+                                    + "(tid=" + threadInfo.getThreadId() + ")")
+                            .collect(Collectors.joining(", "));
+                    if (System.currentTimeMillis() - threadDumpLoggedTimestamp
+                            > LOG_THREADDUMP_INTERVAL_WHEN_DEADLOCK_DETECTED) {
+                        String diagnosticResult = ThreadDumpUtil.buildThreadDiagnosticString();
+                        log.error("Deadlock detected, service may be unavailable, "
+                                + "thread stack details are as follows: {}.", diagnosticResult);
+                        threadDumpLoggedTimestamp = System.currentTimeMillis();
+                    } else {
+                        log.error("Deadlocked threads detected. {}", threadNames);
+                    }


Since the status endpoint doesn't have authentication, it will be necessary to have a solution to prevent introducing a new DoS vulnerability where calling the status endpoint in a tight loop could introduce significant load to the system. One way would be to check that the deadlock check is executed only when there's more than 1 seconds from the previous check. If it's less than that, the previous result of the deadlock check would be reused.

Since the status endpoint doesn't have authentication, it will be necessary to have a solution to prevent introducing a new DoS vulnerability where calling the status endpoint in a tight loop could introduce significant load to the system. One way would be to check that the deadlock check is executed only when there's more than 1 seconds from the previous check. If it's less than that, the previous result of the deadlock check would be reused.

This is a very good proposal, and there is indeed a risk of being attacked by DoS. The code logic has been adjusted accordingly. Please help review the code again.

…e should return a failure because the service may be unavailable

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java

lhotari

Good work @yyj8. Some suggestions for field naming and simplifying the code comment.

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java

…e should return a failure because the service may be unavailable

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java

…e should return a failure because the service may be unavailable.

lhotari · 2024-11-27T07:36:49Z

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java

+        // Locking classes to avoid deadlock detection in multi-thread concurrent requests.
+        synchronized (VipStatus.class) {
+            if (System.currentTimeMillis() - lastCheckStatusTimestamp < CHECK_STATUS_INTERVAL) {
+                lastCheckStatusTimestamp = System.currentTimeMillis();


I think that this line should be removed since the purpose of the field value is to record the timestamp when an actual check was made and rate limit that.

I think that this line should be removed since the purpose of the field value is to record the timestamp when an actual check was made and rate limit that.

Your suggestion is not to need synchronization, right? If there is no code synchronization, will there be a massive number of DoS attacks that execute deadlock detection logic simultaneously, causing a surge in node load.

Your suggestion is not to need synchronization, right? If there is no code synchronization, will there be a massive number of DoS attacks that execute deadlock detection logic simultaneously, causing a surge in node load.

I'm not talking about synchronization. If lastCheckStatusTimestamp gets updated here, it breaks the logic.
The previous DoS concern was about calling the deadlock detection since that has a higher cost than before. Jetty has separate DoSFilter for adding basic DoS protection. In general Pulsar isn't protected against malicious DoS attacks and it's not meant to be exposed on the public internet. In this case, we just need to avoid adding a lot of extra cost compared to the previous version.

The original suggestion was to rate limit the deadlock check, however it's fine to also rate limit the file existence check. By removing this line, I believe that the logic would be fine. However, adding tests would be helpful to ensure this. There could be unit tests with mocking to ensure it. That might require further refactoring so that mocks could be properly injected.

I hope this clarifies: https://github.com/apache/pulsar/pull/23634/files#r1860206850

lhotari · 2024-11-27T08:32:44Z

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java

+                    throw new WebApplicationException(Status.SERVICE_UNAVAILABLE);
+                }
+            }
+            lastCheckStatusTimestamp = System.currentTimeMillis();


This should be the only location where lastCheckStatusTimestamp is updated.

The final global code logic content is as follows:

public class VipStatus { public static final String ATTRIBUTE_STATUS_FILE_PATH = "statusFilePath"; public static final String ATTRIBUTE_IS_READY_PROBE = "isReadyProbe"; // log a full thread dump when a deadlock is detected in status check once every 10 minutes // to prevent excessive logging private static final long LOG_THREADDUMP_INTERVAL_WHEN_DEADLOCK_DETECTED = 600000L; private static volatile long lastCheckStatusTimestamp; // Rate limit status checks to every 500ms to prevent DoS private static final long CHECK_STATUS_INTERVAL = 500L; private static volatile boolean lastCheckStatusResult; @Context protected ServletContext servletContext; @GET public String checkStatus() { // Locking classes to avoid deadlock detection in multi-thread concurrent requests. synchronized (VipStatus.class) { if (System.currentTimeMillis() - lastCheckStatusTimestamp < CHECK_STATUS_INTERVAL) { if (lastCheckStatusResult) { return "OK"; } else { throw new WebApplicationException(Status.SERVICE_UNAVAILABLE); } } lastCheckStatusTimestamp = System.currentTimeMillis(); String statusFilePath = (String) servletContext.getAttribute(ATTRIBUTE_STATUS_FILE_PATH); @SuppressWarnings("unchecked") Supplier<Boolean> isReadyProbe = (Supplier<Boolean>) servletContext.getAttribute(ATTRIBUTE_IS_READY_PROBE); boolean isReady = isReadyProbe != null ? isReadyProbe.get() : true; if (statusFilePath != null) { File statusFile = new File(statusFilePath); if (isReady && statusFile.exists() && statusFile.isFile()) { // check deadlock ThreadMXBean threadBean = ManagementFactory.getThreadMXBean(); long[] threadIds = threadBean.findDeadlockedThreads(); if (threadIds != null && threadIds.length > 0) { ThreadInfo[] threadInfos = threadBean.getThreadInfo(threadIds, false, false); String threadNames = Arrays.stream(threadInfos) .map(threadInfo -> threadInfo.getThreadName() + "(tid=" + threadInfo.getThreadId() + ")") .collect(Collectors.joining(", ")); if (System.currentTimeMillis() - lastCheckStatusTimestamp > LOG_THREADDUMP_INTERVAL_WHEN_DEADLOCK_DETECTED) { String diagnosticResult = ThreadDumpUtil.buildThreadDiagnosticString(); log.error("Deadlock detected, service may be unavailable, " + "thread stack details are as follows: {}.", diagnosticResult); } else { log.error("Deadlocked threads detected. {}", threadNames); } lastCheckStatusResult = false; throw new WebApplicationException(Status.SERVICE_UNAVAILABLE); } else { lastCheckStatusResult = true; return "OK"; } } } lastCheckStatusResult = false; log.warn("Failed to access \"status.html\". The service is not ready"); throw new WebApplicationException(Status.NOT_FOUND); } } }

Adding tests will be useful since validating logic just by looking at the code isn't a good practice. You might have to modify some details to inject mocks. For example, there could be a separate constructor for tests where the ThreadTXBean is injected. In the default constructor used in production code, ManagementFactory,getThreadMXBean() method call could be used to pass the value. In test code, a mock could be passed. It's also useful to refactor the code to use a java.time.Clock.millis() instead of System.currentTimeMillis() and also have the Clock instance in the constructor so that test code could pass a mock implementation where it's possible to make time jump forward when validating the logic. Clock.systemUTC() could be used in the default constructor for production code.

…e should return a failure because the service may be unavailable.

lhotari · 2024-11-27T09:23:35Z

@yyj8 btw. when you add commits to the PR, it's useful to make the commit title about the change and not copy the PR title into the follow up commits. When the PR is merged, all commits are squashed so they won't end up in the final merged commit. The benefit of the commit messages in the PR commits is that the reviewer will be able to follow the changes.

[fix][broker]If there is a deadlock in the service, the probe should …

c18aa62

…return a failure because the service may be unavailable

github-actions bot added the doc-label-missing label Nov 24, 2024

github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Nov 24, 2024

lhotari requested changes Nov 25, 2024

View reviewed changes

lhotari reviewed Nov 25, 2024

View reviewed changes

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java Outdated Show resolved Hide resolved

[fix][broker]If there is a deadlock in the service, the probe should …

90ff720

…return a failure because the service may be unavailable

yyj8 requested a review from lhotari November 25, 2024 14:38

lhotari reviewed Nov 25, 2024

View reviewed changes

yyj8 added 2 commits November 26, 2024 23:45

[improvement][broker] If there is a deadlock in the service, the prob…

2b18156

…e should return a failure because the service may be unavailable

[improvement][broker] If there is a deadlock in the service, the prob…

70325c0

…e should return a failure because the service may be unavailable

lhotari reviewed Nov 26, 2024

View reviewed changes

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java Outdated Show resolved Hide resolved

lhotari requested changes Nov 26, 2024

View reviewed changes

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java Outdated Show resolved Hide resolved

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java Outdated Show resolved Hide resolved

[improvement][broker] If there is a deadlock in the service, the prob…

c230aa5

…e should return a failure because the service may be unavailable

lhotari reviewed Nov 26, 2024

View reviewed changes

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java Outdated Show resolved Hide resolved

lhotari reviewed Nov 26, 2024

View reviewed changes

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java Show resolved Hide resolved

lhotari reviewed Nov 26, 2024

View reviewed changes

pulsar-broker-common/src/main/java/org/apache/pulsar/common/configuration/VipStatus.java Show resolved Hide resolved

[improvement][broker] If there is a deadlock in the service, the prob…

2c8819b

…e should return a failure because the service may be unavailable.

yyj8 requested a review from lhotari November 27, 2024 07:41

lhotari requested changes Nov 27, 2024

View reviewed changes

lhotari reviewed Nov 27, 2024

View reviewed changes

[improvement][broker] If there is a deadlock in the service, the prob…

f17da50

…e should return a failure because the service may be unavailable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improvement][broker] If there is a deadlock in the service, the probe should return a failure because the service may be unavailable #23634

[improvement][broker] If there is a deadlock in the service, the probe should return a failure because the service may be unavailable #23634

yyj8 commented Nov 24, 2024 •

edited

Loading

github-actions bot commented Nov 24, 2024

lhotari left a comment

lhotari Nov 25, 2024

yyj8 Nov 26, 2024

lhotari left a comment

lhotari Nov 27, 2024 •

edited

Loading

yyj8 Nov 27, 2024

lhotari Nov 27, 2024

lhotari Nov 27, 2024

lhotari Nov 27, 2024

yyj8 Nov 27, 2024

lhotari Nov 27, 2024

lhotari commented Nov 27, 2024

	@GET
	@Path("/health")
	@ApiOperation(value = "Run a healthCheck against the broker")
	@ApiResponses(value = {
	@ApiResponse(code = 200, message = "Everything is OK"),
	@ApiResponse(code = 403, message = "Don't have admin permission"),
	@ApiResponse(code = 404, message = "Cluster doesn't exist"),
	@ApiResponse(code = 500, message = "Internal server error")})
	public void healthCheck(@Suspended AsyncResponse asyncResponse,
	@ApiParam(value = "Topic Version")
	@QueryParam("topicVersion") TopicVersion topicVersion) {
	validateSuperUserAccessAsync()
	.thenAccept(__ -> checkDeadlockedThreads())
	.thenCompose(__ -> internalRunHealthCheck(topicVersion))
	.thenAccept(__ -> {
	LOG.info("[{}] Successfully run health check.", clientAppId());
	asyncResponse.resume(Response.ok("ok").build());
	}).exceptionally(ex -> {
	LOG.error("[{}] Fail to run health check.", clientAppId(), ex);
	resumeAsyncResponseExceptionally(asyncResponse, ex);
	return null;
	});
	}

	private void checkDeadlockedThreads() {
	ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
	long[] threadIds = threadBean.findDeadlockedThreads();
	if (threadIds != null && threadIds.length > 0) {
	ThreadInfo[] threadInfos = threadBean.getThreadInfo(threadIds, false, false);
	String threadNames = Arrays.stream(threadInfos)
	.map(threadInfo -> threadInfo.getThreadName() + "(tid=" + threadInfo.getThreadId() + ")").collect(
	Collectors.joining(", "));
	if (System.currentTimeMillis() - threadDumpLoggedTimestamp
	> LOG_THREADDUMP_INTERVAL_WHEN_DEADLOCK_DETECTED) {
	threadDumpLoggedTimestamp = System.currentTimeMillis();
	LOG.error("Deadlocked threads detected. {}\n{}", threadNames,
	ThreadDumpUtil.buildThreadDiagnosticString());
	} else {
	LOG.error("Deadlocked threads detected. {}", threadNames);
	}
	throw new IllegalStateException("Deadlocked threads detected. " + threadNames);
	}
	}

[improvement][broker] If there is a deadlock in the service, the probe should return a failure because the service may be unavailable #23634

Are you sure you want to change the base?

[improvement][broker] If there is a deadlock in the service, the probe should return a failure because the service may be unavailable #23634

Conversation

yyj8 commented Nov 24, 2024 • edited Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

github-actions bot commented Nov 24, 2024

lhotari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhotari left a comment

Choose a reason for hiding this comment

lhotari Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhotari commented Nov 27, 2024

yyj8 commented Nov 24, 2024 •

edited

Loading

lhotari Nov 27, 2024 •

edited

Loading