-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I use nvidia/Llama-3.1-Nemotron-70B-Reward-HF directly for inference? #360
Comments
The Instead, please use the import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "nvidia/Llama-3.1-Nemotron-70B-Reward-HF"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "What is 1+1?"
good_response = "1+1=2"
bad_response = "1+1=3"
for response in [good_response, bad_response]:
messages = [{'role': "user", "content": prompt}, {'role': "assistant", "content": response}]
tokenized_message = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False, return_tensors="pt", return_dict=True)
response_token_ids = model.generate(tokenized_message['input_ids'].cuda(),attention_mask=tokenized_message['attention_mask'].cuda(), max_new_tokens=1, return_dict_in_generate=True, output_scores=True)
reward = response_token_ids['scores'][0][0][0].item()
print(reward) |
@Zhilin123 any enhanced inference tip? Transformers is painfully slow and didn't use all my memory... |
@arthrod The exact advice for improving inference with Transformers depends on the hardware you're using and the task you're trying to do (e.g. how many samples you trying to annotate and how fast you expect it to be done). In general, running inference using NeMo-Aligner (i.e. with this checkpoint https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Reward instead of the one ending with ...-HF) is likely to give you better performance in terms of speed and memory usage. |
Thank you very much! I followed your advice (and even adventured making a pull request). It works pretty well (except for the attribute quality which is very very weird). Also Triton doesn't use all the memory available, no matter the config of the batches (which doesn't matter bc the script calculates anyway). (I had to use the bigger model because the small doesn't understand Portuguese well...) |
I tried loading it using
model = AutoModelForSequenceClassification.from_pretrained("nvidia/Llama-3.1-Nemotron-70B-Reward-HF", token=token, quantization_config=nf4_config).to('cuda:1')
, but this doesn't load the weight for thescore
in the final layer. Is there a difference HF API I need to use to do this?The text was updated successfully, but these errors were encountered: