The provided fine tuning script allows you to select between three datasets by passing the dataset
arg to the llama_recipes.finetuning
module or examples/finetuning.py
script. The current options are grammar_dataset
, alpaca_dataset
and samsum_dataset
. Additionally, we integrate the OpenAssistant/oasst1 dataset as an example for a custom dataset Note: Use of any of the datasets should be in compliance with the dataset's underlying licenses (including but not limited to non-commercial uses)
- grammar_dataset contains 150K pairs of english sentences and possible corrections.
- alpaca_dataset provides 52K instruction-response pairs as generated by
text-davinci-003
. - samsum_dataset contains about 16k messenger-like conversations with summaries.
- OpenAssistant/oasst1 contains about 88k messages from assistant-style conversations.
The list of available datasets in llama-recipes is supposed to give users a quick start on training their Llama model. To use a custom dataset there are two possible ways. The first provides a function returning the dataset in a .py file which can be given to the command line tool. This does not involve changing the source code of llama-recipes. The second way is targeting contributions which extend llama-recipes as it involves changing the source code.
To supply a custom dataset you need to provide a single .py file which contains a function with the following signature:
def get_custom_dataset(dataset_config, tokenizer, split: str):
For an example get_custom_dataset
you can look at the provided datasets in llama_recipes.datasets or examples/custom_dataset.py.
The dataset_config
in the above signature will be an instance of llama_recipes.configs.dataset.custom_dataset with the modifications made through the command line.
The split signals wether to return the training or validation dataset.
The default function name is get_custom_dataset
but this can be changes as described below.
In order to start a training with the custom dataset we need to set the --dataset
as well as the --custom_dataset.file
parameter.
python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py" [TRAINING PARAMETERS]
To change the function name that is used in the .py you can append the name following a :
like this:
python -m llama_recipes.finetuning --dataset "custom_dataset" --custom_dataset.file "examples/custom_dataset.py:get_foo" [TRAINING PARAMETERS]
This will call the function get_foo
instead of get_custom_dataset
when retrieving the dataset.
Each dataset has a corresponding configuration (dataclass) in configs/datasets.py which contains the dataset name, training/validation split names, as well as optional parameters like datafiles etc.
Additionally, there is a preprocessing function for each dataset in the datasets folder.
The returned data of the dataset needs to be consumable by the forward method of the fine-tuned model by calling model(**data)
.
For CausalLM models this usually means that the data needs to be in the form of a dictionary with "input_ids", "attention_mask" and "labels" fields.
To add a custom dataset the following steps need to be performed.
- Create a dataset configuration after the schema described above. Examples can be found in configs/datasets.py.
- Create a preprocessing routine which loads the data and returns a PyTorch style dataset. The signature for the preprocessing function needs to be (dataset_config, tokenizer, split_name) where split_name will be the string for train/validation split as defined in the dataclass.
- Register the dataset name and preprocessing function by inserting it as key and value into the DATASET_PREPROC dictionary in utils/dataset_utils.py
- Set dataset field in training config to dataset name or use --dataset option of the
llama_recipes.finetuning
module or examples/finetuning.py training script.
Below we list other datasets and their main use cases that can be used for fine tuning.
- MMLU
- BoolQ
- NarrativeQA
- NaturalQuestions (closed-book)
- NaturalQuestions (open-book)
- QuAC
- HellaSwag
- OpenbookQA
- TruthfulQA ( can be helpful for fact checking/ misinformation of the model)
English quotes 2508 Multi-label text classification, text generation
- Crows_pair gender bias
- WinoGender gender bias
More information on evaluation dataset can be found in HELM