feat: Self-Rewarding Algorithm with TRT Support #321

trias702 · 2024-09-26T22:14:33Z

What does this PR do ?

Adds support for the Self-Rewarding and Meta-Rewarding algorithms from the following two papers:

https://arxiv.org/abs/2401.10020
https://arxiv.org/abs/2407.19594

Changelog

Please update the CHANGELOG.md under next version with high level changes in this PR.

Usage

Please see the new tutorial document at: docs/user-guide/self_rewarding.rst

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation? Make sure to also update the NeMo Framework User Guide which contains the tutorials

Checklist when contributing a new algorithm

Does the trainer resume and restore model state all states?
Does the trainer support all parallelism techniques(PP, TP, DP)?
Does the trainer support max_steps=-1 and validation?
Does the trainer only call APIs defined in alignable_interface.py?
Does the trainer have proper logging?

Additional Information

Related to # (issue)

Signed-off-by: Gerald Shen <[email protected]>

* trtllm0.9 changes Signed-off-by: jiemingz <=> * fix typos Signed-off-by: jiemingz <=> * address comments Signed-off-by: jiemingz <=> * fixes Signed-off-by: jiemingz <=> * fix Signed-off-by: jiemingz <=> * fix nemo generations with PP Signed-off-by: jiemingz <=> * add engine_unload Signed-off-by: jiemingz <=> * cleanup trtllm Signed-off-by: jiemingz <=> * address comments Signed-off-by: jiemingz <=> --------- Signed-off-by: jiemingz <=> Co-authored-by: jiemingz <=>

jgerh · 2024-11-13T01:02:55Z

docs/user-guide/self_rewarding.rst

+- preference_loss: the raw DPO variant loss
+- sft_loss: if adding an SFT loss (categorical cross-entropy loss) for the chosen response, then you can see that raw loss here
+
+The ``reward`` in this case is calculated as the difference between model log probs and the reference log probs, multiplied by the KL penalty (beta in the original paper), for the ground truth and generated responses.


Fix punctuation.

The reward, in this case, is calculated as the difference between model log probs and the reference log probs, multiplied by the KL penalty (beta in the original paper), for the ground truth and generated responses.

jgerh · 2024-11-13T01:04:06Z

docs/user-guide/self_rewarding.rst

+All metrics will be grouped by either ``train/`` or ``val/`` in WandB, representing whether that metric is from the training or validation set, respectively.
+You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations.
+
+When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.


Fix capitalization, revise sentence.

When it comes to ideal hyperparameters for self-rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data. Therefore, there is no one-size-fits-all parameter set that will work in all cases.

jgerh · 2024-11-13T01:07:00Z

docs/user-guide/self_rewarding.rst

+You can also see a table which will print out the prompt, chosen response, and rejected response for each validation step. This allows you to keep track of response quality and hallucinations.
+
+When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
+Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.


Fix capitalization, revise.

Additionally, self-rewarding training (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.

jgerh · 2024-11-13T01:08:27Z

docs/user-guide/self_rewarding.rst

+
+When it comes to ideal hyperparameters for Self-Rewarding training, much will depend on the characteristics of your SFT (or base/foundational) model and your training data, so there is no one-size-fits-all parameter set which will work in all cases.
+Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.
+Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well:


Fix capitalization, revise sentence.

Below are some observations from the NVIDIA Alignment team regarding parameters that we have found to work well:

jgerh · 2024-11-13T01:09:51Z

docs/user-guide/self_rewarding.rst

+Additionally, Self-Rewarding (with or without meta) is a complex algorithm with a lot of moving pieces and a lot of parameters, so finding what works well for your model and data can be difficult.
+Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well:
+
+* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets


Revise

global_batch_size: We recommend using 64, and increasing to 128 only for large models (70B+) that are also training with large datasets.

jgerh · 2024-11-13T01:14:13Z

docs/user-guide/self_rewarding.rst

+Below are some of observations from the Nvidia Alignment team as to what parameters we have seen work well:
+
+* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets
+* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases


Revise

iterations/epochs: The original paper uses 3 iterations with 1 epoch per iteration. We find this to be sufficient for most use cases.

jgerh · 2024-11-13T01:14:43Z

docs/user-guide/self_rewarding.rst

+
+* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets
+* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases
+* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.


Revise

learning rate: For SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7 is recommended.

jgerh · 2024-11-13T01:15:18Z

docs/user-guide/self_rewarding.rst

+* global_batch_size: we recommend using 64, and going up to 128 only for large models (70B+) that are also training with large datasets
+* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases
+* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.
+* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001


Revise

ef_policy_kl_penalty: We did not see large changes from perturbations to this value. We recommend 0.1 - 0.001.

jgerh · 2024-11-13T01:15:44Z

docs/user-guide/self_rewarding.rst

+* iterations/epochs: the original paper uses 3 iterations with 1 epoch per iteration, and we find this to be sufficient for most use cases
+* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.
+* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001
+* length_control: depends very much on model size and data, but we found good results with [0,0,0.1]


Revise

length_control: This parameter depends very much on model size and data, but we found good results with [0,0,0.1].

jgerh · 2024-11-13T01:16:38Z

docs/user-guide/self_rewarding.rst

+* learning rate: for SFT/aligned models, we recommend a smaller LR, between 3e-7 and 1e-7. If training a foundational model, then something between 3e-6 to 9e-7.
+* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001
+* length_control: depends very much on model size and data, but we found good results with [0,0,0.1]
+* use_meta_judge: we have found stronger results when settings this to true, which is in line with the paper's results


Revise

use_meta_judge: We found stronger results when setting this parameter to true, which is in line with the paper's results

jgerh · 2024-11-13T01:18:05Z

docs/user-guide/self_rewarding.rst

+* ref_policy_kl_penalty: we did not see large changes from perturbations to this value; we recommend 0.1 - 0.001
+* length_control: depends very much on model size and data, but we found good results with [0,0,0.1]
+* use_meta_judge: we have found stronger results when settings this to true, which is in line with the paper's results
+* meta_judge_pcnt: we recommend you do not set this higher than 0.15 (15%). Any higher, and we have observed that the llm-as-a-judge model starts to output identical scores for every response (always a 5)


Revise

meta_judge_pcnt: We recommend not setting this higher than 0.15 (15%). Any higher, and we have observed that the LLM-as-a-judge model starts to output identical scores for every response (always a 5).

jgerh

I completed the technical edit of CHANGELOG.md and
docs/user-guide/self_rewarding.rst. Please review the edits, make the changes in the files, and mark each open thread "resolved."

odelalleau

Still WIP but submitting first batch of comments

CHANGELOG.md

docs/user-guide/self_rewarding.rst

odelalleau · 2024-11-16T16:32:41Z

examples/nlp/gpt/conf/gpt_generation.yaml

Is this file needed for Self-Rewarding? If not let's move it to a different PR

It's needed if you want to follow the self rewarding paper exactly to generate the EFT dataset

I see, it'd be good to keep it then, but it also needs to be documented so that people understand how to generate this EFT dataset. At quick glance I'm not seeing it referenced in the self-rewarding doc => could you add it to explain how to generate an EFT dataset?

examples/nlp/gpt/conf/gpt_self_rewarding.yaml

Signed-off-by: Daniel Egert <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Signed-off-by: Daniel Egert <[email protected]>

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

odelalleau

Just a couple of minor typos

examples/nlp/gpt/conf/gpt_self_rewarding.yaml

Signed-off-by: Daniel Egert <[email protected]>

…/NeMo-Aligner into degert/self-rewarding-trt

Signed-off-by: Daniel Egert <[email protected]>

jgerh · 2024-11-26T17:13:18Z

I completed the technical edit of CHANGELOG.md and
docs/user-guide/self_rewarding.rst. Please review the edits, make the changes in the files, and mark each open thread "resolved."

odelalleau

Going to submit review in chunks so you can start addressing comments right away

odelalleau · 2024-11-27T18:55:46Z

examples/nlp/gpt/conf/gpt_generation.yaml

+    max_steps: -1
+    limit_train_batches: 1.0
+
+    # Accelerate training times by accelerating inference stage using TRTLLM


For consistency

Suggested change

# Accelerate training times by accelerating inference stage using TRTLLM

# Speed-up training by accelerating inference stage using TRTLLM

odelalleau · 2024-11-27T18:56:51Z

examples/nlp/gpt/conf/gpt_generation.yaml

+      # reshard: False # reshard is not supported in generation
+
+      # TRTLLM preallocates activation memory according to the number of input tokens
+      # By default, assume the max input length is half of the model sequence length


I'd just remove this line

Suggested change

# By default, assume the max input length is half of the model sequence length

(btw, same in gpt_self_rewarding.yaml and gpt_spin.yaml)

odelalleau · 2024-11-27T18:57:21Z

examples/nlp/gpt/conf/gpt_generation.yaml

+      # By default, assume the max input length is half of the model sequence length
+      max_input_len: ${subtract:${model.encoder_seq_length}, ${model.generation.length_params.max_length}}
+
+      model_type: gptnext


Suggested change

model_type: gptnext

model_type: gptnext # can be gptj, gptnext, llama, gemma, falcon

odelalleau · 2024-11-27T18:57:47Z

examples/nlp/gpt/conf/gpt_generation.yaml

+
+      model_type: gptnext
+
+      # Unload and reload the TRTLLM engine before and after the training stage 


Suggested change

# Unload and reload the TRTLLM engine before and after the training stage

# Save GPU memory by unloading and reloading the TRTLLM engine before and after the training stage

odelalleau · 2024-11-27T18:59:03Z

examples/nlp/gpt/conf/gpt_generation.yaml

+  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
+  resume_if_exists: True
+  resume_ignore_no_checkpoint: True
+  create_checkpoint_callback: True


Shouldn't this be set to False?

odelalleau · 2024-11-27T19:14:12Z

examples/nlp/gpt/run_generation.py

+        custom_trainer_state_dict = None
+        consumed_samples = 0
+
+    if os.path.exists(gen_file := os.path.join(cfg.exp_manager.explicit_log_dir, "generations", "generations.jsonl")):


Should this if block replace the previous if block above? (both seem to reload state from a previous run, but I guess only this one matters?)

odelalleau · 2024-11-27T19:15:23Z

examples/nlp/gpt/run_generation.py

+    """
+    dp_group = parallel_state.get_data_parallel_group()
+    calc_gbs = cfg.model.generation.rollout_micro_batch_size * dp_group.size()
+    with open_dict(cfg):
+        cfg.model.global_batch_size = calc_gbs
+    with open_dict(ptl_model.cfg):
+        ptl_model.cfg.global_batch_size = calc_gbs
+    if hasattr(ptl_model, "global_batch_size"):
+        ptl_model.global_batch_size = calc_gbs
+    """


Looks like debug code that may be removed?

odelalleau · 2024-11-27T19:19:35Z

examples/nlp/gpt/run_generation.py

+        consumed_samples=consumed_samples,
+        mbs=cfg.model.micro_batch_size,
+        gbs=cfg.model.global_batch_size,
+        collate_fn=eye,


Can you use identity_collate from nemo_aligner/data/nlp/builders.py?

odelalleau · 2024-11-27T19:19:47Z

examples/nlp/gpt/run_generation.py

+    # eos_id = ptl_model.tokenizer.eos_id
+
+    # collate fn to pad to the max seq length in the batch
+    # collate_fn = partial(
+    #    self_rewarding_custom_collate,
+    #    eos_id=eos_id,
+    #    reset_position_ids=cfg.model.data.get("reset_position_ids", False),
+    #    reset_attention_mask=cfg.model.data.get("reset_attention_mask", False),
+    #    eod_mask_loss=cfg.model.data.get("eod_mask_loss", False),
+    # )


To be removed?

odelalleau · 2024-11-27T19:20:49Z

examples/nlp/gpt/run_generation.py

+    )
+
+    init_using_ptl(trainer, ptl_model, train_dataloader, train_ds)
+    # optimizer, scheduler = extract_optimizer_scheduler_from_ptl_model(ptl_model)


To remove

Suggested change

# optimizer, scheduler = extract_optimizer_scheduler_from_ptl_model(ptl_model)

Signed-off-by: Daniel Egert <[email protected]>

odelalleau

Just a few more comments

odelalleau · 2024-11-27T22:26:09Z

examples/nlp/gpt/train_gpt_self_rewarding.py

+        consumed_samples=consumed_samples,
+        mbs=cfg.model.micro_batch_size,
+        gbs=cfg.model.global_batch_size,
+        collate_fn=eye,


Use identity_collate

odelalleau · 2024-11-27T22:32:43Z

examples/nlp/gpt/train_gpt_spin.py

@@ -37,6 +37,12 @@
 )
 from nemo_aligner.utils.utils import load_and_override_model_config, load_from_nemo, retrieve_model_state_dict_in_cpu

+try:
+    import torch._dynamo
+    torch._dynamo.config.suppress_errors = True


Please add a comment explaining why we need this

odelalleau · 2024-11-27T23:10:08Z

examples/nlp/gpt/train_gpt_spin.py

@@ -177,6 +181,7 @@ def main(cfg) -> None:
        logger=logger,
        ckpt_callback=ckpt_callback,
        run_timer=timer,
+        exp_manager=cfg.exp_manager,


Looks like this arg is unused in SPINTrainer, what's up with it?

odelalleau

Comments on generation

odelalleau · 2024-11-28T02:34:53Z

nemo_aligner/algorithms/generation.py

+
+
+class GenerationTrainer:
+    """Trainer to coordinate Self-Rewarding training


Comment to update

odelalleau · 2024-11-28T02:36:50Z

nemo_aligner/algorithms/generation.py

+    # input_ids = [item["input_ids"] for item in batch]
+    # masks = [item["mask"] for item in batch]
+    context_ids = [item["context_ids"] for item in batch]
+    # answer_ids = [item["answer_ids"] for item in batch]
+    context_lengths = torch.LongTensor([len(x) for x in context_ids])
+    # combined_lengths = torch.LongTensor([len(x) for x in input_ids])
+
+    # input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=eos_id)
+    # masks = torch.nn.utils.rnn.pad_sequence(masks, batch_first=True, padding_value=False)
+    context_ids = torch.nn.utils.rnn.pad_sequence(context_ids, batch_first=True, padding_value=eos_id)
+    # answer_ids = torch.nn.utils.rnn.pad_sequence(answer_ids, batch_first=True, padding_value=eos_id)
+
+    output = {
+        # "prompts_and_answers": input_ids,
+        # "masks": masks,
+        "prompts_only": context_ids,
+        # "answers_only": answer_ids,
+        "prompt_lengths": context_lengths,
+        # "combined_lengths": combined_lengths,


Any reason to keep all the commented stuff? If not let's remove it to make it more readable.

odelalleau · 2024-11-28T02:38:39Z

nemo_aligner/algorithms/generation.py

+        self.set_max_steps()
+
+    '''
+    def augment_dataloader(self, dataloader):


Can we remove this whole code block that is commented out?

odelalleau · 2024-11-28T02:43:07Z

nemo_aligner/algorithms/generation.py

+        # assert (
+        #    self.model.cfg.generation.rollout_micro_batch_size % dp_batch_size == 0
+        # ), f"rollout_micro_batch_size [{self.model.cfg.generation.rollout_micro_batch_size}] must be a multiple of GBS [{self.model.cfg.global_batch_size}] // DP [{parallel_state.get_data_parallel_world_size()}]"
+        # self.rollout_micro_batch_size = self.model.cfg.generation.rollout_micro_batch_size
+        # assert self.rollout_micro_batch_size > 0, "`rollout_micro_batch_size` must be > 0"


Can be removed?

odelalleau · 2024-11-28T02:49:54Z

nemo_aligner/algorithms/generation.py

+                max_input_len=self.cfg.trt_llm.get(
+                    "max_input_len", self.model.cfg.encoder_seq_length - self.length_params["max_length"]
+                ),
+                generation_batch_size=dp_batch_size,


dp_batch_size is based on the global batch size. I'd suggest instead to use micro_batch_size, because it's a more natural hyper-parameter to tweak to trade between generation speed and memory usage for any DP size.
(and I would remove global_batch_size from the config, overriding it in the code to micro_batch_size * DP)

odelalleau · 2024-11-28T03:08:19Z

nemo_aligner/algorithms/generation.py

+                return  # training ended
+
+            global_pbar = tqdm(
+                self.augment_dataloader(self.train_dataloader),


Using augment_dataloader() seems somewhat convoluted, why don't we just iterate on the dataloader (in the for loop below) and run generation on each batch?

odelalleau · 2024-11-28T03:12:21Z

nemo_aligner/algorithms/generation.py

+                self.consumed_samples += self.model.cfg.global_batch_size
+                self.step += 1
+
+                if torch.distributed.get_rank() == 0 and gen_tokens_list is not None:


As far as I can tell the second check is useless

Suggested change

if torch.distributed.get_rank() == 0 and gen_tokens_list is not None:

if torch.distributed.get_rank() == 0:

odelalleau · 2024-11-28T03:26:20Z

nemo_aligner/algorithms/generation.py

+                                prompt = self.model.tokenizer.ids_to_text(t_[:s_].long().tolist())
+                                response = self.model.tokenizer.ids_to_text(t_[s_:e_].long().tolist())


Just a note that this might be potentially dangerous. Some tokenizers behave in a weird way, and I'm not 100% sure we can always guarantee that decoding a subset of the token IDs is recovering the correct text of the response. No need to change it for now (you can resolve) since my quick tests suggest it should be fine, but IMO a safer approach is to decode the full sequence, ensure it starts with the original prompt (in text form), and keep only what's after this prompt. Just letting you know in case you run into some weird things in the future as new fancy tokenizers are introduced...

Also, not a huge deal but those two lines may be moved under the if v_: below.

odelalleau · 2024-11-28T03:35:59Z

nemo_aligner/algorithms/generation.py

+
+        self.logger.finalize()
+
+        if torch.distributed.get_rank() == 0 and self.generations_fh is not None:


Should never be None, right?

Suggested change

if torch.distributed.get_rank() == 0 and self.generations_fh is not None:

if torch.distributed.get_rank() == 0:

odelalleau · 2024-11-28T03:36:59Z

nemo_aligner/algorithms/generation.py

+        if self.use_trtllm_generation:
+            self.trtllm_generate.free()
+
+    def save(self, extra_candidates=None, is_train_end=False):


Is save() called anywhere? Seems like we shouldn't need it since the state is saved in the JSONL output (then we could also probably get rid of state_dict())

gshennvm and others added 30 commits April 1, 2024 10:42

add

72ba6c6

Signed-off-by: Gerald Shen <[email protected]>

cleanup

bfb61e4

Signed-off-by: Gerald Shen <[email protected]>

update

21206c5

Signed-off-by: Gerald Shen <[email protected]>

fix

148acf4

Signed-off-by: Gerald Shen <[email protected]>

fix bug

d7c9990

Signed-off-by: Gerald Shen <[email protected]>

fix bug

537d6e5

Signed-off-by: Gerald Shen <[email protected]>

test

47400ba

Signed-off-by: Gerald Shen <[email protected]>

fix bug

d7b2b23

Signed-off-by: Gerald Shen <[email protected]>

fix

ce76226

Signed-off-by: Gerald Shen <[email protected]>

add

8edf534

Signed-off-by: Gerald Shen <[email protected]>

fix

6379a2e

Signed-off-by: Gerald Shen <[email protected]>

fix again

eadae31

Signed-off-by: Gerald Shen <[email protected]>

fix

e2b97d9

Signed-off-by: Gerald Shen <[email protected]>

fix mean

d9bdf7c

Signed-off-by: Gerald Shen <[email protected]>

fix

1c7d215

Signed-off-by: Gerald Shen <[email protected]>

add debug

3638301

Signed-off-by: Gerald Shen <[email protected]>

fix

4cca85f

Signed-off-by: Gerald Shen <[email protected]>

add data iter for VP

1b19bdd

Signed-off-by: Gerald Shen <[email protected]>

move

3f045ae

Signed-off-by: Gerald Shen <[email protected]>

fixing

3c9fe3d

Signed-off-by: Gerald Shen <[email protected]>

add

f36f394

Signed-off-by: Gerald Shen <[email protected]>

chunking needs to be moved out

5211bc2

Signed-off-by: Gerald Shen <[email protected]>

fix

0f59edf

Signed-off-by: Gerald Shen <[email protected]>

fix metrics

c3fe2f7

Signed-off-by: Gerald Shen <[email protected]>

fix dtype

5d3e07d

Signed-off-by: Gerald Shen <[email protected]>

merge

15887e5

Signed-off-by: Gerald Shen <[email protected]>

fix

2ad76ba

Signed-off-by: Gerald Shen <[email protected]>

make the global id management into a class

9d9a6b6

Signed-off-by: Gerald Shen <[email protected]>

fix

d6fb55d

Signed-off-by: Gerald Shen <[email protected]>

jgerh reviewed Nov 13, 2024

View reviewed changes

odelalleau reviewed Nov 16, 2024

View reviewed changes

trias702 and others added 6 commits November 18, 2024 14:56

Made config yaml fixes in response to initial comments

780e8ab

Signed-off-by: Daniel Egert <[email protected]>

Updated to main branch

cc487fb

Signed-off-by: Daniel Egert <[email protected]>

Removed generation_batch_size param from TRT

83e830a

Signed-off-by: Daniel Egert <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5b7aae3

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

Minor fixes for new TRT api

a1f9620

Signed-off-by: Daniel Egert <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

01aced0

for more information, see https://pre-commit.ci Signed-off-by: NeMo-Aligner CI <[email protected]>

odelalleau reviewed Nov 21, 2024

View reviewed changes

examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved

examples/nlp/gpt/conf/gpt_self_rewarding.yaml Outdated Show resolved Hide resolved

trias702 added 5 commits November 20, 2024 22:35

SPIN bug fixes and migrated generation to work with TRT v13

224de3d

Signed-off-by: Daniel Egert <[email protected]>

Merge branch 'degert/self-rewarding-trt' of https://github.com/NVIDIA…

82ff16d

…/NeMo-Aligner into degert/self-rewarding-trt

Changes to self_rewarding.yaml in response to review comments

c608520

Signed-off-by: Daniel Egert <[email protected]>

Added Torch Dynamo logic to self-rewarding

e4d36b6

Signed-off-by: Daniel Egert <[email protected]>

Fixed minor issue with TRT v13 compatibility

34e4994

Signed-off-by: Daniel Egert <[email protected]>

odelalleau reviewed Nov 27, 2024

View reviewed changes

Trying new fix for truncation for SPIN

4314347

Signed-off-by: Daniel Egert <[email protected]>

odelalleau reviewed Nov 27, 2024

View reviewed changes

odelalleau reviewed Nov 28, 2024

View reviewed changes

	# Accelerate training times by accelerating inference stage using TRTLLM
	# Speed-up training by accelerating inference stage using TRTLLM

	model_type: gptnext
	model_type: gptnext # can be gptj, gptnext, llama, gemma, falcon


		model_type: gptnext

		# Unload and reload the TRTLLM engine before and after the training stage

	# Unload and reload the TRTLLM engine before and after the training stage
	# Save GPU memory by unloading and reloading the TRTLLM engine before and after the training stage



		class GenerationTrainer:
		"""Trainer to coordinate Self-Rewarding training

	if torch.distributed.get_rank() == 0 and gen_tokens_list is not None:
	if torch.distributed.get_rank() == 0:

		prompt = self.model.tokenizer.ids_to_text(t_[:s_].long().tolist())
		response = self.model.tokenizer.ids_to_text(t_[s_:e_].long().tolist())


		self.logger.finalize()

		if torch.distributed.get_rank() == 0 and self.generations_fh is not None:

feat: Self-Rewarding Algorithm with TRT Support #321

Are you sure you want to change the base?

feat: Self-Rewarding Algorithm with TRT Support #321

Conversation

trias702 commented Sep 26, 2024

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Checklist when contributing a new algorithm

Additional Information

Choose a reason for hiding this comment

jgerh Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh left a comment

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

jgerh commented Nov 26, 2024

odelalleau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odelalleau left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgerh Nov 13, 2024 •

edited

Loading

jgerh Nov 13, 2024 •

edited

Loading