-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{ai}[foss/2023a] DeepSpeed v0.14.5, CUTLASS v3.4.0, DLPACK v0.8 w/ CUDA 12.1.1 #21438
base: develop
Are you sure you want to change the base?
{ai}[foss/2023a] DeepSpeed v0.14.5, CUTLASS v3.4.0, DLPACK v0.8 w/ CUDA 12.1.1 #21438
Conversation
…tches: DeepSpeed-0.14.5_pic-compile.patch, DeepSpeed-0.14.2_no-ninja-dep.patch
Will probably want to change triton used to #21318 |
DLPack |
CUTLASS |
DeepSpeed Typo in test command after linebreaks. |
Test report by @VRehnberg |
Latest failures have in common that they use the multi-node launcher. Unsure if it's only the test that's broken or something else. As an example of a failing command: pdsh -S -f 1024 -w localhost export NCCL_IB_HCA=^mlx5_1; export PYTHONNOUSERSITE=1; export UCX_MODULE_DIR=[...]; export PYTHONPATH=[...]; /apps/Test2/software/Python/3.11.3-GCCcore-12.3.0/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --node_rank=%n --master_addr=127.0.0.1 --master_port=29500 /cephyr/NOBACKUP/priv/c3-staff/eb-tmp/eb-5t779o2m/pytest-of-c3-builder/pytest-0/test_user_args_True_I_m_going_0/user_arg_test.py --prompt "I\'m going to tell them \\"DeepSpeed is the best\\""\n'.decode so probably is just because LD_LIBRARY_PATH is not also exported. Looking for where this command is built... It seems like an add_export for that is missing here https://github.com/microsoft/DeepSpeed/blob/v0.14.5/deepspeed/launcher/runner.py#L564-L578 So should just be to add it to https://github.com/microsoft/DeepSpeed/blob/v0.14.5/deepspeed/launcher/runner.py#L34 Will try that. |
Test report by @VRehnberg |
Test report by @VRehnberg |
Test report by @VRehnberg |
Compared the environment variables before and after loading the DeepSpeed module. Will probably update the pdsh-env-vars patch with those. So remaining
|
Env report seems to indicate that pre-built ops are not being picked up properly:
|
Test report by @VRehnberg |
Test report by @VRehnberg |
Three failing files in latest 4xA100 run:
Can't reproduce them when skipping test step and running them manually (though first one is unclear as it fails for another reason). |
Seems mostly fine now, rerunning build on Ampere GPU. |
Test report by @VRehnberg |
+EXPORT_ENVS += [ # Extra based on what's added by module load DeepSpeed | ||
+ 'LD_LIBRARY_PATH', 'PATH', 'EB', 'TRITON', 'CUDA', # important | ||
+ 'ACLOCAL', 'CMAKE', 'CPATH', 'LIBRARY_PATH', 'MPL', 'NCCL', | ||
+ 'PKG_CONFIG_PATH', 'XDG_DATA_DIRS', | ||
+] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence on if all of these should be included or not. The alternative is to reduce these and add them case by case with a .deepspeed_env file. https://www.deepspeed.ai/getting-started/#multi-node-environment-variables
One issue with |
Hopefully solved now in easybuilders/easybuild-easyblocks#3450 |
Test report by @VRehnberg |
Test report by @VRehnberg |
Last failure was from two quantized nvme offload tests. Either error is because |
Test report by @VRehnberg |
Meh, thought I was done. Why have new errors popped up :(
|
Updated software
|
Test report by @VRehnberg Another preciously unseen error. Seg-fault in test_compile_zero. |
Test report by @VRehnberg Unknown error when building wheel, don't think anything has changed from before. |
Test report by @VRehnberg |
Test report by @VRehnberg |
Test report by @VRehnberg |
Test report by @VRehnberg Here's the build error again. Mostly confused why it only appears some times. Perhaps, need to actually fix it. |
Test report by @VRehnberg Build failures mostly seem flaky. Only change with this one to the one before |
There seems to still be something flaky about the build step. But, I have no idea what it could be. At this point I'd welcome others testing and see if they also experience this. |
…asyconfigs into 20240918144941_new_pr_DeepSpeed0145
(created using
eb --new-pr
)Requires:
Edit: It also patches an existing version of CUTLASS (instead of adding a new version as it did initially).