How can I use nvidia/Llama-3.1-Nemotron-70B-Reward-HF directly for inference? #360

arunasank · 2024-10-25T21:14:58Z

I tried loading it using model = AutoModelForSequenceClassification.from_pretrained("nvidia/Llama-3.1-Nemotron-70B-Reward-HF", token=token, quantization_config=nf4_config).to('cuda:1'), but this doesn't load the weight for the score in the final layer. Is there a difference HF API I need to use to do this?

The text was updated successfully, but these errors were encountered:

Zhilin123 · 2024-10-25T21:42:08Z

The AutoModelForSequenceClassification class is not correct here.

Instead, please use the AutoModelForCausalLM class we recommend in https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward-HF#usage , with the relevant snippet below.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "nvidia/Llama-3.1-Nemotron-70B-Reward-HF"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "What is 1+1?"
good_response = "1+1=2"
bad_response = "1+1=3"

for response in [good_response, bad_response]:
    messages = [{'role': "user", "content": prompt}, {'role': "assistant", "content": response}]
    tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False, return_tensors="pt", return_dict=True)
    response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(),  max_new_tokens=1, return_dict_in_generate=True, output_scores=True)
    reward = response_token_ids['scores'][0][0][0].item()
    print(reward)

arthrod · 2024-10-29T12:16:11Z

@Zhilin123 any enhanced inference tip? Transformers is painfully slow and didn't use all my memory...

Zhilin123 · 2024-10-30T23:09:42Z

@arthrod The exact advice for improving inference with Transformers depends on the hardware you're using and the task you're trying to do (e.g. how many samples you trying to annotate and how fast you expect it to be done). In general, running inference using NeMo-Aligner (i.e. with this checkpoint https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward instead of the one ending with ...-HF) is likely to give you better performance in terms of speed and memory usage.

arthrod · 2024-10-30T23:29:01Z

Thank you very much! I followed your advice (and even adventured making a pull request). It works pretty well (except for the attribute quality which is very very weird). Also Triton doesn't use all the memory available, no matter the config of the batches (which doesn't matter bc the script calculates anyway).

(I had to use the bigger model because the small doesn't understand Portuguese well...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I use nvidia/Llama-3.1-Nemotron-70B-Reward-HF directly for inference? #360

How can I use nvidia/Llama-3.1-Nemotron-70B-Reward-HF directly for inference? #360

arunasank commented Oct 25, 2024

Zhilin123 commented Oct 25, 2024 •

edited

Loading

arthrod commented Oct 29, 2024

Zhilin123 commented Oct 30, 2024

arthrod commented Oct 30, 2024

How can I use nvidia/Llama-3.1-Nemotron-70B-Reward-HF directly for inference? #360

How can I use nvidia/Llama-3.1-Nemotron-70B-Reward-HF directly for inference? #360

Comments

arunasank commented Oct 25, 2024

Zhilin123 commented Oct 25, 2024 • edited Loading

arthrod commented Oct 29, 2024

Zhilin123 commented Oct 30, 2024

arthrod commented Oct 30, 2024

Zhilin123 commented Oct 25, 2024 •

edited

Loading