Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

signal/poll optimization #366

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

liangyuRain
Copy link

Hi, it would be great if the sm/proxy channel can support signal/poll multiple times in one function call. The code changes in this PR supports the functionality without disrupting existing usage, i.e., signal() and poll() work as before.

  • signal(n) sends n signals down the channel. For sm channel, this avoids doing atomicLoad multiple times on the semaphore.
  • x = poll(n) polls up to n signals from the channel and returns the actual number of signals polled. It returns 0 if n<=0.
  • signal() and poll() are equivalent to signal(1) and poll(1), which are the default arguments.

@liangyuRain
Copy link
Author

@microsoft-github-policy-service agree

@Binyang2014
Copy link
Contributor

Hmm. cam you tell us the scenarios which require this feature?

@liangyuRain
Copy link
Author

In my scenario, for example, data is sent as a flow of chunks between a producer and a consumer. The producer signals once per chunk sent. It would be nice for the consumer to know how many chunks it has received by one poll call. The producer may also decide to send n chunks at once, in which case n signals can be sent with one signal call.

@chhwang
Copy link
Contributor

chhwang commented Oct 17, 2024

In my scenario, for example, data is sent as a flow of chunks between a producer and a consumer. The producer signals once per chunk sent. It would be nice for the consumer to know how many chunks it has received by one poll call. The producer may also decide to send n chunks at once, in which case n signals can be sent with one signal call.

  • If the producer has multiple data chunks to send, you could put multiple times and then signal once. The consumer would call wait once. Does it work for your scenario?
  • If the producer needs to send the number of chunks to send, you could send that information as another data chunk. Probably your implementation would be faster than this, but I expect the diff of latencies would be nearly invisible.
  • As a minor concern, incrementing the semaphore ID by user inputs may introduce chances of misuses or bugs. For example, current design uses a single 64-bit unsigned integer flag, assuming at most single increment per a signal. This makes sense because if we say a single signal takes 100ns, it will take more than 50k years to overflow the flag. Now if we increment this by user inputs, we are implicitly pushing the users to take care of the overflow issue.

@liangyuRain
Copy link
Author

In my scenario, for example, data is sent as a flow of chunks between a producer and a consumer. The producer signals once per chunk sent. It would be nice for the consumer to know how many chunks it has received by one poll call. The producer may also decide to send n chunks at once, in which case n signals can be sent with one signal call.

  • If the producer has multiple data chunks to send, you could put multiple times and then signal once. The consumer would call wait once. Does it work for your scenario?
  • If the producer needs to send the number of chunks to send, you could send that information as another data chunk. Probably your implementation would be faster than this, but I expect the diff of latencies would be nearly invisible.
  • As a minor concern, incrementing the semaphore ID by user inputs may introduce chances of misuses or bugs. For example, current design uses a single 64-bit unsigned integer flag, assuming at most single increment per a signal. This makes sense because if we say a single signal takes 100ns, it will take more than 50k years to overflow the flag. Now if we increment this by user inputs, we are implicitly pushing the users to take care of the overflow issue.

Thanks for the feedback! I think sending the number of chunks separately adds unnecessary complexity, especially since the signal/poll mechanism already handles the producer-consumer scenario elegantly.

As for overflow concern, the increment could be limited to an uint8_t. This would be more than sufficient for most practical use cases, and it would still take 228 years to overflow. Please consider this feature.

@chhwang
Copy link
Contributor

chhwang commented Oct 18, 2024

In my scenario, for example, data is sent as a flow of chunks between a producer and a consumer. The producer signals once per chunk sent. It would be nice for the consumer to know how many chunks it has received by one poll call. The producer may also decide to send n chunks at once, in which case n signals can be sent with one signal call.

  • If the producer has multiple data chunks to send, you could put multiple times and then signal once. The consumer would call wait once. Does it work for your scenario?
  • If the producer needs to send the number of chunks to send, you could send that information as another data chunk. Probably your implementation would be faster than this, but I expect the diff of latencies would be nearly invisible.
  • As a minor concern, incrementing the semaphore ID by user inputs may introduce chances of misuses or bugs. For example, current design uses a single 64-bit unsigned integer flag, assuming at most single increment per a signal. This makes sense because if we say a single signal takes 100ns, it will take more than 50k years to overflow the flag. Now if we increment this by user inputs, we are implicitly pushing the users to take care of the overflow issue.

Thanks for the feedback! I think sending the number of chunks separately adds unnecessary complexity, especially since the signal/poll mechanism already handles the producer-consumer scenario elegantly.

As for overflow concern, the increment could be limited to an uint8_t. This would be more than sufficient for most practical use cases, and it would still take 228 years to overflow. Please consider this feature.

Sure, but that is further narrowing the use cases. Looks like you basically want a simple interface that sends a single byte data, and I still don't get exactly why we would need this, given that the current interface can already do this.

@liangyuRain
Copy link
Author

My understanding of the programming model for the mscclpp channel is that the signal serves to notify the receiver that a task is ready for processing, which is exactly a producer-consumer scenerio. The current interface can do this but it requires repeated operations on global memory that could be done in one round. Especially in the poll case, the consumer needs to atomicLoad n+1 times to know n chunks are ready to be processed in a batch or so. Sending extra bytes along signal to indicate the # of chunks can avoid this, but it still requires the receiver to store and sum the received values, rather than leveraging the increment of semaphore.

I am OK to close this issue if you believe it would be more appropriate to address it otherwise.

@chhwang
Copy link
Contributor

chhwang commented Oct 21, 2024

My understanding of the programming model for the mscclpp channel is that the signal serves to notify the receiver that a task is ready for processing, which is exactly a producer-consumer scenerio. The current interface can do this but it requires repeated operations on global memory that could be done in one round. Especially in the poll case, the consumer needs to atomicLoad n+1 times to know n chunks are ready to be processed in a batch or so. Sending extra bytes along signal to indicate the # of chunks can avoid this, but it still requires the receiver to store and sum the received values, rather than leveraging the increment of semaphore.

I am OK to close this issue if you believe it would be more appropriate to address it otherwise.

I assume you are talking about wait not poll, because poll is designed to be repeated multiple times indefinitely. If I'm correct, could you ellaborate on why you need to repeat wait multiple times? You could send a single signal for the entire n+1 chunks, then you could wait just once.

@liangyuRain
Copy link
Author

My understanding of the programming model for the mscclpp channel is that the signal serves to notify the receiver that a task is ready for processing, which is exactly a producer-consumer scenerio. The current interface can do this but it requires repeated operations on global memory that could be done in one round. Especially in the poll case, the consumer needs to atomicLoad n+1 times to know n chunks are ready to be processed in a batch or so. Sending extra bytes along signal to indicate the # of chunks can avoid this, but it still requires the receiver to store and sum the received values, rather than leveraging the increment of semaphore.
I am OK to close this issue if you believe it would be more appropriate to address it otherwise.

I assume you are talking about wait not poll, because poll is designed to be repeated multiple times indefinitely. If I'm correct, could you ellaborate on why you need to repeat wait multiple times? You could send a single signal for the entire n+1 chunks, then you could wait just once.

I am referring to poll. Basically, the feature I am proposing is to support (1) sending n signals in one function call, and (2) knowing how many signals have been received in one poll. This would eliminate the need for repeated atomicLoads and GMEM access. The (2) is sort of similar to the poll in C that returns the # of ready fds.

In your case, if I send 1 signal for n+1 chunks, I would need additional logic to track how many chunks were sent, as n is a runtime variable. It would be simpler to send 1 signal per chunk. For example, if I send 3, 4, 5 chunks and signal 3, 4, 5 times in three rounds, a single poll on the receiver side would return 12, indicating that 12 chunks have been sent.

To be consistent with the existing poll() that returns true/false, I implemented poll(n), where n is the maximum # of signals to poll. Thus, the existing poll() is equivalent to poll(1).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants