This repo collects code examples and demos used for demonstrating different aspects of the Huggingface libraries, including transformers, datasets, evaluate, and (eventually) more. Originally prepared for UCSC's NLP 244 Winter 2023 Course, expanding to include new examples and for a wider audience.
Be sure to check out Huggingface's Course, for an in depth overview and tutorial from HF.
Not required for anything in this course, but I often personally find it useful to organize my work as a Python package. This allows others to import elements from your work without using or extracting individual files from your repo. This also allows a simpler workflow for importing your code into a Google Colab notebook
This is accomplished in three main steps:
- Create a
pyproject.toml
file like this. For typical use-cases, you can copy the one linked without edits. - Create a
setup.cfg
file like this. You'll need to edit everything under[metadata]
,install_requires
for your dependencies, and possibly other attributes as you customize your package organization and contents. - Add all your code under a sources directory linked from
setup.cfg
. In this case, I have everything undersrc/hf_libraries_demo
since mypackage_dir
includes=src
. You'll want to rename appropriately. For more details on thesrc
layout and alternatives, see this article.
I've defined minimal example of a function to import in
src/hf_libraries_demo/package_demo
. Given this, you can install an editable
version of the whole package with pip install -e .
from its root directory, and import functions you've defined in
different modules in src
. You can also install it from the git link directly. See an example of doing this
in Colab here.
Using Huggingface Datasets
See this directory of examples
- Loading a dataset from Huggingface (official tutorial) (example)
- Using
map
andfilter
for pre-processing (official tutorial) (example) - Aside: pre-modeling data analysis with datasets (example)
Setting up Evaluation w/ Huggingface Evaluate
See this directory of examples
Here we approach things in a round-about order: we set up evaluation for a model on our dataset without first defining the model. To do this, we build pipelines for two model-free baselines and or test cases:
- A perfect model, in which we can verify expectations of our evaluation metrics
- A random baseline, which we can use to test the evaluator and compare results against
- calculating accuracy for a random and a perfect model with evaluators (official eval tutorial) (example)
- calculating F1 as a custom metric (example)
Fine-tuning a Transformer with the Trainer API!
We'll use a fairly small pre-trained model:
microsoft/xtremedistil-l6-h384-uncased
.
- instantiating the model and making zero-shot predictions manually (example)
- making zero-shot predictions with an evaluator (example)
- fine-tuning with the Trainer API
(official docs)
(example)
- bonus: logging to Weights & Biases
- Customizing Trainer via-subclass: compute an alternative loss function (label-smoothed cross entropy) (official docs) (example)
- a worked example of translating
snli
to French, using T5, as in Quest 4 (en_snli_to_french.py)- filtering to only the unique English strings for translation
- Adding task prefixes with worker parallelism (
num_proc=32
) - Batch tokenization with
max_length=512
- Using a
torch
DataLoader
with batch size 512 andnum_workers=8
- batch decoding and storing of results
- building a
french_snli
from our map of unique EN -> FR translations
- some worked examples for generating text with T5 as an encoder-decoder are shown in
generation_examples.py
- normal greedy decoding from T5
- normal sampling from T5
- adding a decoder prefix to T5 before decoding
- A complete example for pre-training RoBERTa from scratch with the BabyLM dataset can be found in experiments/pretraining
- You can use both/either a sparse (BM25) and dense (FAISS) Search index on a huggingface dataset to retrieve data points.
- great for retrieval-augmented generation or retrieval augmented in-context-learning
- Huggingface Accelerate and Deepspeed integrations can vastly improve training speed and capacity
- Other methods and modalities:
- Transformers and Datasets for vision and audio
- Diffusion Models
- Reinforcement Learning Environments (HF Simulate) and Reinforcement Learning from Human Feedback