Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance[MQB,BMQ]: return constructed blobs as pointers #471

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

678098
Copy link
Collaborator

@678098 678098 commented Oct 18, 2024

Overview

In some code paths, we need to write similar Blobs of data to multiple cluster nodes at once (replication for example). Previously, we copied these Blobs for every mqbnet::Channel holding a single connection to another cluster node. Blob class has a vector of BlobBuffer objects, and when we copy a Blob, we reallocate a new vector of BlobBuffers, using the provided allocator. We spend a lot of time on this, and the situation becomes even worse if where are more nodes in the cluster or the throughput is high.

Note that even if we write data even to one node, we still made a copy of a Blob when we pass it to mqbnet::Channel.

The main change in this PR is that we store shared pointers to the same Blob when we enqueue Items to mqbnet::Channel to write. By doing this, we avoid making a vector copy for every Blob and do not allocate without a need.

However, to make this change possible, we need to provide shared pointers to Blobs from the very level where these Blobs are constructed - in event builders.

Another change that affects performance is using Blob shared pointer pools for event builders. By reusing the already built Blobs from these pools, we make sure that vectors containing BlobBuffers have probably enough capacity to contain all the needed BlobBuffers for write after some warmup.

Changes

  • Introduce a new component bmqp::BlobPoolUtil that defines BlobSpPool type and also provides an utility function createPool that simplifies BlobSpPool construction.
  • Event builders: require BlobSpPool as an argument, store the built Blob as a shared pointer, use BlobSpPool to get new Blob shared pointers on reset, provide Blob shared pointer copy via new blob_sp() accessor.
  • Unit tests: build and pass BlobSpPool as an argument where needed.
  • mqbnet::Channel: change the class API so now it only accepts shared pointers to Blobs to enqueue, add independent BlobSpPool per each mqbnet::Channel to use in its builders (do not want to share the global BlobSpPool between too many threads), remove legacy flavours how we store Items in Channel (before, we had Blobs by value, by weak pointer, by shared pointer - now we only store them by shared pointer).
  • mqbnet::Channel unit test: removed test for weak pointer Blob passed to Channel. This feature is not used in the code at all and so it's removed.
  • bmqp::SchemaEventBuilder: reorder arguments to met allocator usage guidelines, the allocator arg now is the last one as expected, had to explicitly add the default encoding type where it's needed. Also, got rid of the tmp MemOutStream and cache it as a field to reduce allocations: bmqu::MemOutStream d_errorStream.

Stress test

Priority queue, 100k msgs/s, 1 consumer, 1 producer, 1 queue in strong consistency domain, 6 nodes cluster (3 datacenters) with 2 client proxies.

The graph shows the current number of unconfirmed messages currently stored in the queue over a 10 minute run. Blue line is the current PR's revision (near to 0 pending messages on every moment), orange line shows the commit just before this PR.

This PR behaves much better on 100k msgs/s than the previous revision.

Screenshot 2024-10-30 at 19 25 24

Profiler

On the same stress scenario, there were 3.6% of samples across all threads taken within Blob constructor calls. With the new change, these 3.6% are freed. Since these 3.6% were within queue dispatcher thread, this thread will be less busy and be able to process higher throughput.

Before:

Screenshot 2024-10-30 at 18 56 51

Screenshot 2024-10-30 at 18 58 40

After:

Screenshot 2024-10-30 at 18 22 37

Screenshot 2024-10-30 at 18 23 16

Allocator stats

This PR reduces the number of unnecessary allocations by many millions (see Channel rows in before and after). Since we use counting allocator with tree-structure to report updates down to the root allocator, we save CPU on these updates.
The only remaining place within TransportManager with many allocations is TCPSessionFactory.

Before:

    TransportManager             |       1,472,768| -11,520|           4,852,112| 219,401,090|  94,459|   219,400,120|  94,459
      *direct*                   |           1,392|        |               1,392|          12|        |             2|        
      cl6_dc3                    |         946,720| -11,520|           1,071,664| 120,518,092|  58,380|   120,517,921|  58,380
        *direct*                 |          28,208|        |              28,208|          59|        |             0|        
        node3                    |         366,736|  -2,304|             402,768|  24,103,630|  11,676|    24,103,595|  11,676
          *direct*               |          99,152|        |              99,152|           7|        |             0|        
          ItemPool               |         266,880|        |             266,880|          20|        |             0|        
          Channel                |             704|  -2,304|              36,736|  24,103,603|  11,676|    24,103,595|  11,676
        node1                    |         113,072|  -2,304|             150,864|  24,103,600|  11,676|    24,103,584|  11,676
          *direct*               |          99,152|        |              99,152|           7|        |             0|        
          ItemPool               |          13,344|        |              13,344|           1|        |             0|        
          Channel                |             576|  -2,304|              38,368|  24,103,592|  11,676|    24,103,584|  11,676
        node2                    |         113,072|  -2,304|             149,536|  24,103,601|  11,676|    24,103,585|  11,676
          *direct*               |          99,152|        |              99,152|           7|        |             0|        
          ItemPool               |          13,344|        |              13,344|           1|        |             0|        
          Channel                |             576|  -2,304|              37,040|  24,103,593|  11,676|    24,103,585|  11,676
        node5                    |         113,072|  -2,304|             147,712|  24,103,593|  11,676|    24,103,577|  11,676
          *direct*               |          99,152|        |              99,152|           7|        |             0|        
          ItemPool               |          13,344|        |              13,344|           1|        |             0|        
          Channel                |             576|  -2,304|              35,216|  24,103,585|  11,676|    24,103,577|  11,676
        node4                    |         113,072|  -2,304|             144,144|  24,103,596|  11,676|    24,103,580|  11,676
          *direct*               |          99,152|        |              99,152|           7|        |             0|        
          ItemPool               |          13,344|        |              13,344|           1|        |             0|        
          Channel                |             576|  -2,304|              31,648|  24,103,588|  11,676|    24,103,580|  11,676
        node0                    |          99,488|        |              99,488|          13|        |             0|        
          *direct*               |          99,152|        |              99,152|           7|        |             0|        
          Channel                |             336|        |                 336|           6|        |             0|        
      Interface45115             |         524,016|        |           4,156,544|  98,882,981|  36,079|    98,882,197|  36,079
      ConnectionStates           |             640|        |                 640|           5|        |             0|        
    SessionNegotiator            |          14,496|        |              23,264|       1,389|       2|         1,330|       2
    DomainManager                |           2,064|        |               2,064|          11|        |             0|        
    ConfigProvider               |             272|        |                 272|           2|        |             0|        

After:

    TransportManager             |       1,930,032|        |           5,523,808|  87,961,870|        |    87,960,968|        
      *direct*                   |           1,392|        |               1,392|          12|        |             2|        
      cl6_dc3                    |       1,541,024|        |           1,541,088|         193|        |            11|        
        *direct*                 |          29,104|        |              29,104|          59|        |             0|        
        node4                    |         322,016|        |             322,080|          32|        |             4|        
          *direct*               |          99,376|        |              99,376|           9|        |             1|        
          BlobSpPool             |         115,104|        |             115,168|          10|        |             3|        
          ItemPool               |         107,360|        |             107,360|          11|        |             0|        
          Channel                |             176|        |                 176|           2|        |             0|        
        node2                    |         302,448|        |             302,448|          29|        |             3|        
          *direct*               |          99,376|        |              99,376|           9|        |             1|        
          BlobSpPool             |         115,056|        |             115,104|           9|        |             2|        
          ItemPool               |          87,840|        |              87,840|           9|        |             0|        
          Channel                |             176|        |                 176|           2|        |             0|        
        node1                    |         224,336|        |             224,336|          19|        |             1|        
          *direct*               |          99,376|        |              99,376|           9|        |             1|        
          BlobSpPool             |         115,024|        |             115,024|           7|        |             0|        
          ItemPool               |           9,760|        |               9,760|           1|        |             0|        
          Channel                |             176|        |                 176|           2|        |             0|        
        node3                    |         224,336|        |             224,336|          19|        |             1|        
          *direct*               |          99,376|        |              99,376|           9|        |             1|        
          BlobSpPool             |         115,024|        |             115,024|           7|        |             0|        
          ItemPool               |           9,760|        |               9,760|           1|        |             0|        
          Channel                |             176|        |                 176|           2|        |             0|        
        node5                    |         224,336|        |             224,336|          19|        |             1|        
          *direct*               |          99,376|        |              99,376|           9|        |             1|        
          BlobSpPool             |         115,024|        |             115,024|           7|        |             0|        
          ItemPool               |           9,760|        |               9,760|           1|        |             0|        
          Channel                |             176|        |                 176|           2|        |             0|        
        node0                    |         214,448|        |             214,448|          16|        |             1|        
          *direct*               |          99,376|        |              99,376|           9|        |             1|        
          BlobSpPool             |         114,976|        |             114,976|           6|        |             0|        
          Channel                |              96|        |                  96|           1|        |             0|        
      Interface37387             |         386,976|        |           4,156,544|  87,961,660|        |    87,960,955|        
      ConnectionStates           |             640|        |                 640|           5|        |             0|        
    SessionNegotiator            |          15,024|        |              23,792|       1,379|       2|         1,325|       2
    DomainManager                |           2,064|        |               2,064|          11|        |             0|        
    ConfigProvider               |             272|        |                 272|           2|        |             0|        

@678098 678098 changed the title Performance[MQB,BMQ]: return constructed blobs as pointers [WIP]Performance[MQB,BMQ]: return constructed blobs as pointers Oct 18, 2024
@678098 678098 force-pushed the 241018_blob_shared_ptr branch 13 times, most recently from 84e6047 to 8bb562c Compare October 28, 2024 20:39
@678098 678098 force-pushed the 241018_blob_shared_ptr branch 12 times, most recently from a72440a to d7ba3ef Compare October 30, 2024 17:38
@678098 678098 changed the title [WIP]Performance[MQB,BMQ]: return constructed blobs as pointers Performance[MQB,BMQ]: return constructed blobs as pointers Oct 30, 2024
@678098 678098 marked this pull request as ready for review October 30, 2024 17:39
@678098 678098 requested a review from a team as a code owner October 30, 2024 17:39
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 344 of commit 8f53a30 has completed with FAILURE

Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is bsl::deque<bdlbb::Blob> d_channelBufferQueue; in ClientSession. That is not in a performance-critical path, but we could change that as well (does not have to be in the same PR)

@@ -679,6 +680,13 @@ class Session : public AbstractSession {
public:
// TYPES

/// Pool of shared pointers to Blobs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not typedef bmqp::BlobPoolUtil::BlobSpPool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't expose bmqp package in bmqa, since bmqa is a client package. And we need to declare BlobSpPool somewhere to be used in bmqa::MockSession. I moved it directly to private types in bmqa::MockSession

d_blob_sp->buffer(0).data());
eh.setLength(d_blob_sp->length());

return *d_blob_sp;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AckEventBuilder::blob() can call AckEventBuilder::blob_sp() to avoid duplicate code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of blob_sp() and we only have blob() now

return *d_blob_sp;
}

bsl::shared_ptr<bdlbb::Blob> AckEventBuilder::blob_sp() const
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check, what is the guideline on calling accessor returning shared_ptr? blob_sp does not sound right.

Copy link
Collaborator Author

@678098 678098 Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are basically 2 common things that could be done in a similar situation.

  • Rename shared_ptr<...> blob_sp() -> shared_ptr<...> blob() and remove Blob& blob(). It is okay since bmqp builders are not exposed to SDK clients. But there will be a difference with bmqa builders that return Blob& blob(). We don't enforce interface compatibility between bmqp and bmqa builders, just use bmqp ones as pointer with implementation.
  • Show the meaningful name of the operation that return shared pointer, like shared_ptr<> bookClientInfo(). This is not applicable to our use-case.

So I will go on with the 1st way.
UPD: I also revisited the idea of providing a copy of shared ptr and decided to provide a reference to the internal pointer, it's the user's responsibility to make a copy if needed

// skip writing the length until the blob
// is retrieved.
/// Blob pool to use. Held, not owned.
BlobSpPool* d_blobSpPool_p;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, this does not introduce more dependency than the previous use of bufferFactory.

@@ -114,7 +114,7 @@ AdminSessionState::AdminSessionState(BlobSpPool* blobSpPool,
, d_dispatcherClientData()
, d_bufferFactory_p(bufferFactory)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we do not need d_bufferFactory_p

@@ -306,9 +306,9 @@ ClientSessionState::ClientSessionState(
, d_statContext_mp(clientStatContext)
, d_bufferFactory_p(bufferFactory)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does not look like we can get rid of d_bufferFactory_p until we retire old code converting MessageProperties (which we should retire).

Could you leave a comment please, reminding us to remove it?

bmqp::SchemaEventBuilder builder(d_bufferFactory_p,
&localAllocator,
encodingType);
bmqp::SchemaEventBuilder builder(d_blobSpPool_p,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about removing SessionNegotiator::d_bufferFactory_p

@@ -1178,11 +1184,13 @@ IncoreClusterStateLedger::IncoreClusterStateLedger(
ClusterData* clusterData,
ClusterState* clusterState,
bdlbb::BlobBufferFactory* bufferFactory,
BlobSpPool* blobSpPool_p,
bslma::Allocator* allocator)
: d_allocator_p(allocator)
, d_isFirstLeaderAdvisory(true)
, d_isOpen(false)
, d_bufferFactory_p(bufferFactory)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need IncoreClusterStateLedger:;d_bufferFactory_p?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to remove, removed

dsCfg,
recoveryManagerAllocator),
d_recoveryManager_mp.load(new (*recoveryManagerAllocator)
RecoveryManager(d_blobSpPool_p,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we have two d_blobSpPool_p: one in StorageManager and another in ClusterData::d_resources. Probably, an oversight in one of the previous commits, maybe we can fix it now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed d_blobSpPool_p field
In the future PR we need to revisit it if we have per-thread resources. We might need to remove it from ClusterData instead and cache it as a field instead

@dorjesinpo dorjesinpo assigned 678098 and unassigned dorjesinpo Nov 4, 2024
Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take a look at bmqp::Event::d_clonedBlob_sp
What do you think about possibility of removing clone functionality in bmqp::Event?
Perhaps, In another PR

@@ -1010,6 +1027,9 @@ class MockSession : public AbstractSession {
/// Buffer factory
bdlbb::PooledBlobBufferFactory d_blobBufferFactory;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove d_blobBufferFactory now?

@@ -259,7 +264,9 @@ class Application {
/// instance.
bdlbb::BlobBufferFactory* bufferFactory();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove Application::bufferFactory() now?

bslma::Allocator* allocator)
: d_blobSpPool_p(blobSpPool_p)
, d_blob_sp(0, allocator) // initialized in `reset()`
, d_emptyBlob_sp(blobSpPool_p->getObject())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can SEGFAULT before the BSLS_ASSERT_SAFE(blobSpPool_p)

{
if (BSLS_PERFORMANCEHINT_PREDICT_UNLIKELY(messageCount() == 0)) {
BSLS_PERFORMANCEHINT_UNLIKELY_HINT;
return ProtocolUtil::emptyBlob(); // RETURN
return d_emptyBlob_sp; // RETURN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This d_emptyBlob_sp looks strange. Seems like in no case d_blob_sp == false, so why not return it?

bslma::Allocator* allocator)
: d_blobSpPool_p(blobSpPool_p)
, d_blob_sp(0, allocator) // initialized in `reset()`
, d_emptyBlob_sp(blobSpPool_p->getObject())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about d_emptyBlob_sp. Do we really need it if d_blob_sp == true?

bslma::Allocator* allocator)
: d_blobSpPool_p(blobSpPool_p)
, d_blob_sp(0, allocator) // initialized in `reset()`
, d_emptyBlob_sp(blobSpPool_p->getObject())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about d_emptyBlob_sp

: d_allocator_p(allocator)
, d_blob(bufferFactory, allocator)
PushEventBuilder::PushEventBuilder(BlobSpPool* blobSpPool_p,
bslma::Allocator* allocator)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about d_emptyBlob_sp

bslma::Allocator* allocator)
: d_blobSpPool_p(blobSpPool_p)
, d_blob_sp(0, allocator) // initialized in `reset()`
, d_emptyBlob_sp(blobSpPool_p->getObject())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about d_emptyBlob_sp

, d_eventType(eventType)
, d_blob(bufferFactory, allocator)
, d_blob_sp(0, allocator) // initialized in `reset()`
, d_emptyBlob_sp(blobSpPool_p->getObject())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about d_emptyBlob_sp

// it's not safe to modify or replace the blob under this pointer.
// Instead, we get another shared pointer to another blob.
nodeContext->d_blob_sp = d_blobSpPool_p->getObject();
bmqp::ProtocolUtil::buildReceipt(nodeContext->d_blob_sp.get(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bmqp::ProtocolUtil::buildReceipt can now assert fresh blob (no need for removeAll)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants