Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use nvidia/Llama-3.1-Nemotron-70B-Reward-HF directly for inference? #360

Open
arunasank opened this issue Oct 25, 2024 · 4 comments

Comments

@arunasank
Copy link

I tried loading it using model = AutoModelForSequenceClassification.from_pretrained("nvidia/Llama-3.1-Nemotron-70B-Reward-HF", token=token, quantization_config=nf4_config).to('cuda:1'), but this doesn't load the weight for the score in the final layer. Is there a difference HF API I need to use to do this?

@Zhilin123
Copy link
Collaborator

Zhilin123 commented Oct 25, 2024

The AutoModelForSequenceClassification class is not correct here.

Instead, please use the AutoModelForCausalLM class we recommend in https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward-HF#usage , with the relevant snippet below.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "nvidia/Llama-3.1-Nemotron-70B-Reward-HF"

model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "What is 1+1?"
good_response = "1+1=2"
bad_response = "1+1=3"

for response in [good_response, bad_response]:
    messages = [{'role': "user", "content": prompt}, {'role': "assistant", "content": response}]
    tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False, return_tensors="pt", return_dict=True)
    response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(),  max_new_tokens=1, return_dict_in_generate=True, output_scores=True)
    reward = response_token_ids['scores'][0][0][0].item()
    print(reward)

@arthrod
Copy link

arthrod commented Oct 29, 2024

@Zhilin123 any enhanced inference tip? Transformers is painfully slow and didn't use all my memory...

@Zhilin123
Copy link
Collaborator

@arthrod The exact advice for improving inference with Transformers depends on the hardware you're using and the task you're trying to do (e.g. how many samples you trying to annotate and how fast you expect it to be done). In general, running inference using NeMo-Aligner (i.e. with this checkpoint https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward instead of the one ending with ...-HF) is likely to give you better performance in terms of speed and memory usage.

@arthrod
Copy link

arthrod commented Oct 30, 2024

Thank you very much! I followed your advice (and even adventured making a pull request). It works pretty well (except for the attribute quality which is very very weird). Also Triton doesn't use all the memory available, no matter the config of the batches (which doesn't matter bc the script calculates anyway).

(I had to use the bigger model because the small doesn't understand Portuguese well...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants