A Python distributed test don't exit properly when one rank fails. #3092

wujingyue · 2024-10-03T05:18:30Z

This is apparently caused by the use of MPI_barrier. Because each test waits for a barrier before finishing, the failed rank(s) won't enter the barrier, causing other ranks to hang.

wujingyue · 2024-10-10T01:57:55Z

This appears to be a known limitation of mpi4py: https://mpi4py.readthedocs.io/en/latest/mpi4py.run.html#exceptions-and-deadlocks. The solution proposed there is to python -m mpi4py. But I'm having troubles doing that with pytest.

My best bet so far is to expose Communicator to nvFuser's Python API and then a test fixture. Communicator::barrier doesn't suffer from this AFAIK. In addition, Communicator can host some common features between C++ tests and Python tests, e.g., #3091.

wujingyue added the Multi-GPU label Oct 4, 2024

wujingyue mentioned this issue Oct 9, 2024

test_transformer_engine:test_transformer_layer[backward] hangs. #3129

Closed

kevinstephano added Testing e.g. improving test infra and test coverage Triage labels Oct 30, 2024

wujingyue self-assigned this Nov 1, 2024

wujingyue removed the Triage label Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Python distributed test don't exit properly when one rank fails. #3092

A Python distributed test don't exit properly when one rank fails. #3092

wujingyue commented Oct 3, 2024

wujingyue commented Oct 10, 2024 •

edited

Loading

A Python distributed test don't exit properly when one rank fails. #3092

A Python distributed test don't exit properly when one rank fails. #3092

Comments

wujingyue commented Oct 3, 2024

wujingyue commented Oct 10, 2024 • edited Loading

wujingyue commented Oct 10, 2024 •

edited

Loading