signal/poll optimization #366

liangyuRain · 2024-10-17T04:01:11Z

Hi, it would be great if the sm/proxy channel can support signal/poll multiple times in one function call. The code changes in this PR supports the functionality without disrupting existing usage, i.e., signal() and poll() work as before.

signal(n) sends n signals down the channel. For sm channel, this avoids doing atomicLoad multiple times on the semaphore.
x = poll(n) polls up to n signals from the channel and returns the actual number of signals polled. It returns 0 if n<=0.
signal() and poll() are equivalent to signal(1) and poll(1), which are the default arguments.

liangyuRain · 2024-10-17T04:02:57Z

@microsoft-github-policy-service agree

Binyang2014 · 2024-10-17T04:47:34Z

Hmm. cam you tell us the scenarios which require this feature?

liangyuRain · 2024-10-17T05:00:11Z

In my scenario, for example, data is sent as a flow of chunks between a producer and a consumer. The producer signals once per chunk sent. It would be nice for the consumer to know how many chunks it has received by one poll call. The producer may also decide to send n chunks at once, in which case n signals can be sent with one signal call.

chhwang · 2024-10-17T21:27:35Z

In my scenario, for example, data is sent as a flow of chunks between a producer and a consumer. The producer signals once per chunk sent. It would be nice for the consumer to know how many chunks it has received by one poll call. The producer may also decide to send n chunks at once, in which case n signals can be sent with one signal call.

If the producer has multiple data chunks to send, you could put multiple times and then signal once. The consumer would call wait once. Does it work for your scenario?
If the producer needs to send the number of chunks to send, you could send that information as another data chunk. Probably your implementation would be faster than this, but I expect the diff of latencies would be nearly invisible.
As a minor concern, incrementing the semaphore ID by user inputs may introduce chances of misuses or bugs. For example, current design uses a single 64-bit unsigned integer flag, assuming at most single increment per a signal. This makes sense because if we say a single signal takes 100ns, it will take more than 50k years to overflow the flag. Now if we increment this by user inputs, we are implicitly pushing the users to take care of the overflow issue.

liangyuRain · 2024-10-17T22:42:30Z

In my scenario, for example, data is sent as a flow of chunks between a producer and a consumer. The producer signals once per chunk sent. It would be nice for the consumer to know how many chunks it has received by one poll call. The producer may also decide to send n chunks at once, in which case n signals can be sent with one signal call.

If the producer has multiple data chunks to send, you could put multiple times and then signal once. The consumer would call wait once. Does it work for your scenario?

If the producer needs to send the number of chunks to send, you could send that information as another data chunk. Probably your implementation would be faster than this, but I expect the diff of latencies would be nearly invisible.

As a minor concern, incrementing the semaphore ID by user inputs may introduce chances of misuses or bugs. For example, current design uses a single 64-bit unsigned integer flag, assuming at most single increment per a signal. This makes sense because if we say a single signal takes 100ns, it will take more than 50k years to overflow the flag. Now if we increment this by user inputs, we are implicitly pushing the users to take care of the overflow issue.

Thanks for the feedback! I think sending the number of chunks separately adds unnecessary complexity, especially since the signal/poll mechanism already handles the producer-consumer scenario elegantly.

As for overflow concern, the increment could be limited to an uint8_t. This would be more than sufficient for most practical use cases, and it would still take 228 years to overflow. Please consider this feature.

chhwang · 2024-10-18T04:35:00Z

In my scenario, for example, data is sent as a flow of chunks between a producer and a consumer. The producer signals once per chunk sent. It would be nice for the consumer to know how many chunks it has received by one poll call. The producer may also decide to send n chunks at once, in which case n signals can be sent with one signal call.

If the producer has multiple data chunks to send, you could put multiple times and then signal once. The consumer would call wait once. Does it work for your scenario?

If the producer needs to send the number of chunks to send, you could send that information as another data chunk. Probably your implementation would be faster than this, but I expect the diff of latencies would be nearly invisible.

As a minor concern, incrementing the semaphore ID by user inputs may introduce chances of misuses or bugs. For example, current design uses a single 64-bit unsigned integer flag, assuming at most single increment per a signal. This makes sense because if we say a single signal takes 100ns, it will take more than 50k years to overflow the flag. Now if we increment this by user inputs, we are implicitly pushing the users to take care of the overflow issue.

Thanks for the feedback! I think sending the number of chunks separately adds unnecessary complexity, especially since the signal/poll mechanism already handles the producer-consumer scenario elegantly.

As for overflow concern, the increment could be limited to an uint8_t. This would be more than sufficient for most practical use cases, and it would still take 228 years to overflow. Please consider this feature.

Sure, but that is further narrowing the use cases. Looks like you basically want a simple interface that sends a single byte data, and I still don't get exactly why we would need this, given that the current interface can already do this.

liangyuRain · 2024-10-21T21:36:16Z

My understanding of the programming model for the mscclpp channel is that the signal serves to notify the receiver that a task is ready for processing, which is exactly a producer-consumer scenerio. The current interface can do this but it requires repeated operations on global memory that could be done in one round. Especially in the poll case, the consumer needs to atomicLoad n+1 times to know n chunks are ready to be processed in a batch or so. Sending extra bytes along signal to indicate the # of chunks can avoid this, but it still requires the receiver to store and sum the received values, rather than leveraging the increment of semaphore.

I am OK to close this issue if you believe it would be more appropriate to address it otherwise.

chhwang · 2024-10-21T23:15:51Z

My understanding of the programming model for the mscclpp channel is that the signal serves to notify the receiver that a task is ready for processing, which is exactly a producer-consumer scenerio. The current interface can do this but it requires repeated operations on global memory that could be done in one round. Especially in the poll case, the consumer needs to atomicLoad n+1 times to know n chunks are ready to be processed in a batch or so. Sending extra bytes along signal to indicate the # of chunks can avoid this, but it still requires the receiver to store and sum the received values, rather than leveraging the increment of semaphore.

I am OK to close this issue if you believe it would be more appropriate to address it otherwise.

I assume you are talking about wait not poll, because poll is designed to be repeated multiple times indefinitely. If I'm correct, could you ellaborate on why you need to repeat wait multiple times? You could send a single signal for the entire n+1 chunks, then you could wait just once.

liangyuRain · 2024-10-22T04:31:10Z

My understanding of the programming model for the mscclpp channel is that the signal serves to notify the receiver that a task is ready for processing, which is exactly a producer-consumer scenerio. The current interface can do this but it requires repeated operations on global memory that could be done in one round. Especially in the poll case, the consumer needs to atomicLoad n+1 times to know n chunks are ready to be processed in a batch or so. Sending extra bytes along signal to indicate the # of chunks can avoid this, but it still requires the receiver to store and sum the received values, rather than leveraging the increment of semaphore.
I am OK to close this issue if you believe it would be more appropriate to address it otherwise.

I assume you are talking about wait not poll, because poll is designed to be repeated multiple times indefinitely. If I'm correct, could you ellaborate on why you need to repeat wait multiple times? You could send a single signal for the entire n+1 chunks, then you could wait just once.

I am referring to poll. Basically, the feature I am proposing is to support (1) sending n signals in one function call, and (2) knowing how many signals have been received in one poll. This would eliminate the need for repeated atomicLoads and GMEM access. The (2) is sort of similar to the poll in C that returns the # of ready fds.

In your case, if I send 1 signal for n+1 chunks, I would need additional logic to track how many chunks were sent, as n is a runtime variable. It would be simpler to send 1 signal per chunk. For example, if I send 3, 4, 5 chunks and signal 3, 4, 5 times in three rounds, a single poll on the receiver side would return 12, indicating that 12 chunks have been sent.

To be consistent with the existing poll() that returns true/false, I implemented poll(n), where n is the maximum # of signals to poll. Thus, the existing poll() is equivalent to poll(1).

signal/poll optimization

0859c91

chhwang mentioned this pull request Oct 18, 2024

[Feature] Immediate data upon signal & wait #368

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

signal/poll optimization #366

signal/poll optimization #366

liangyuRain commented Oct 17, 2024

liangyuRain commented Oct 17, 2024

Binyang2014 commented Oct 17, 2024

liangyuRain commented Oct 17, 2024

chhwang commented Oct 17, 2024

liangyuRain commented Oct 17, 2024

chhwang commented Oct 18, 2024 •

edited

Loading

liangyuRain commented Oct 21, 2024

chhwang commented Oct 21, 2024

liangyuRain commented Oct 22, 2024

signal/poll optimization #366

Are you sure you want to change the base?

signal/poll optimization #366

Conversation

liangyuRain commented Oct 17, 2024

liangyuRain commented Oct 17, 2024

Binyang2014 commented Oct 17, 2024

liangyuRain commented Oct 17, 2024

chhwang commented Oct 17, 2024

liangyuRain commented Oct 17, 2024

chhwang commented Oct 18, 2024 • edited Loading

liangyuRain commented Oct 21, 2024

chhwang commented Oct 21, 2024

liangyuRain commented Oct 22, 2024

chhwang commented Oct 18, 2024 •

edited

Loading