NVIDIA / Megatron-LM Public

Notifications You must be signed in to change notification settings
Fork 2.4k
Star 10.6k

Code
Issues 137
Pull requests 152
Discussions
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Security
Insights

Issues: NVIDIA/Megatron-LM

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

137 Open 645 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[QUESTION] How to make training resumption bitwise reproducible?

#1298 opened Nov 24, 2024 by XLzed

[BUG] validate_yaml() isn't in sync with arguments check

#1297 opened Nov 21, 2024 by pierric

[QUESTION] deepseek v2 compatility?

#1295 opened Nov 21, 2024 by wavy-jung

[BUG] 0.9.0 release version got param_gather_handle error with 3d parallel

#1292 opened Nov 19, 2024 by SeunghyunSEO

[QUESTION] How to convert torch_dist format checkpoint to torch format?

#1291 opened Nov 19, 2024 by zhangyilalala

[QUESTION] SGD support in distrib_optimizer.py

#1287 opened Nov 13, 2024 by zstreeter

[QUESTION] There is already a 32-bit model parameter in the optimizer state. Why do we need to store a separate copy of the model parameters in the checkpoint?

#1283 opened Nov 12, 2024 by leondada

Where can I download the tokenizer for the model mcore-llava-mistral-7b-instruct-clip336-pretraining?

#1281 opened Nov 11, 2024 by herolxl

[BUG]Megatron-LM doesn't support transformer-engine 1.13

#1280 opened Nov 11, 2024 by klhhhhh

[BUG] Encountering NaN gradients when using CUDA Graph

#1279 opened Nov 11, 2024 by DXZDXZ

[QUESTION] is there any restriction to use allgather with moe_expert_capacity_factor?

#1277 opened Nov 7, 2024 by Louis-J

[QUESTION] scaleing MFU calculate

#1276 opened Nov 6, 2024 by ltm920716

[BUG] TP-comm-overlap bug when replacing TELayerNormColumnParallelLinear into TEColumnParallelLinear .

#1275 opened Nov 6, 2024 by wplf

[QUESTION] How to Visualize Computational Graph

#1272 opened Nov 2, 2024 by zixianwang2022

[BUG] The cached_loss_mask maybe modified unexpectedly in GPTDataset?

#1269 opened Nov 1, 2024 by shmily326

[BUG] build multimodal dockerfile problem

#1267 opened Oct 30, 2024 by FortuneBush

[QUESTION] How to use loader_mcore and why it requires torch distributed

#1266 opened Oct 29, 2024 by KookHoiKim

[ENHANCEMENT] Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining

#1263 opened Oct 28, 2024 by dhia680

[BUG] Flash attention cannot be applied by passing the --use-flash-attn flag when the --use-mcore-models flag is also passed

#1259 opened Oct 26, 2024 by efsotr

[BUG] MoE pre-training does not scale beyond DP dim>8

#1258 opened Oct 25, 2024 by hwang595

[BUG] Cannot Save mamba model in distributed training

#1234 opened Oct 22, 2024 by siriusctrl

[BUG] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600013 milliseconds before timing out.

#1207 opened Oct 10, 2024 by takuya576

[QUESTION]Fail to build communication between muti machines

#1202 opened Oct 8, 2024 by zmtttt

[QUESTION] Encoder with more TP than the decoder

#1200 opened Oct 6, 2024 by MlWoo

[ENHANCEMENT] Add layer name in a layer to improve code debugging

#1198 opened Oct 4, 2024 by rybakov

Previous 1 2 3 4 5 6 Next

Previous Next

ProTip! Mix and match filters to narrow down what you’re looking for.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly