This is the code release for the paper Diverse Retrieval-Augmented In-Context Learning for Dialogue State Tracking [pdf]
Brendan King and Jeffrey Flanigan.
Findings of ACL 2023. Long Paper
@inproceedings{king-flanigan-2023-diverse,
title = "Diverse Retrieval-Augmented In-Context Learning for Dialogue State Tracking",
author = "King, Brendan and
Flanigan, Jeffrey",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-acl.344",
doi = "10.18653/v1/2023.findings-acl.344",
pages = "5570--5585"
The below steps successfully install and run the experiments on a CentOS 7 Linux host with
at least one GPU and conda
available. You may need to adapt the steps as necessary for your environment.
Set up an environment and install the specific python/torch versions used in this work. Other versions of each may also work but are not tested.
mkdir refpydst && cd refpydst
conda create --prefix venv python=3.9.13
conda activate ./venv
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 pytorch-cuda=11.7 -c pytorch -c nvidia
# Below was needed when building my docker image, but may not be needed for your install
conda install -c anaconda gxx_linux-64
pip install pyzmq
You can install this project and all dependencies in edit mode, or omit the -e
flag for non-edit mode:
pip install -e .
These installation steps should be sufficient for reproducing results, please open an Issue if not.
We download and process data following the processing in Yushi-Hu/IC-DST:
cd data
python create_data.py --main_dir mwz21 --mwz_ver 2.1 --target_path mwz2.1 # for MultiWOZ 2.1
python create_data.py --main_dir mwz24 --mwz_ver 2.4 --target_path mwz2.4 # for MultiWOZ 2.4
Pre-processing creates train/dev/test splits for each zero and few-shot experiment in this work.
bash preprocess.sh
You can also sample your own as follows (ex: 5% training data with random seed=0)
python sample.py --input_fn mwz2.1/train_dials.json --target_fn mw21_5p_train_seed0.json --ratio 0.05 --seed 0
For co-reference analysis experiments, we use annotations provided by MultiWOZ 2.3:
bash download_mw23.sh
To efficiently analyze only dialogues that require co-reference in the dev set, we filter with this script:
python build_coref_only_dataset.py
which creates mw24_coref_only_dials_dev.json
in the data folder.
By default, all of these will download and create processed files in ./data
. Alternatively, you move this directory
to a convenient location (e.g. something in a volume mount path) and point to it with an environment variable:
export REFPYDST_DATA_DIR="/absolute/path/to/data"
When this environment variable is set, all dataset loading of relative paths will look from this root. Absolute paths will always be loaded without modification
Reproducing our results requires access to OpenAI Codex (code-davinci-002
). Other OpenAI engines are not tested.
You'll need to set the following environment variables:
export OPENAI_API_KEY=<your_openai_api_key_with_codex_access>
export REFPYDST_DATA_DIR="`pwd`/data" # default, or set as abs. path to where you want to store contents of data created above
export REFPYDST_OUTPUTS_DIR="`pwd`/outputs" # default, or set as abs. path to where you want to save/load retrievers and output logs
# You may also want to set CUDA_VISIBLE_DEVICES to a single available GPU (can be small)
Experiments are generally organized as follows:
- A config file (json) specifies all the parameters for the experiment
- A main file exists per experiment type, which can be called with
python [path/to/main_file].py [path/to/run/config].json
src/refpydst/retriever/code/retriever_finetuning.py
: fine-tunes a sentence retriever given retriever arguments in config.src/refpydst/run_codex_experiment.py
: runs an experiment for generating dialogue states with OpenAI Codex
Each few-shot experiment is repeated on 3 independent runs, so we've created utility scripts for running all experiments in a folder. For example, to run all 1% few-shot experiments on MultiWOZ 2.4, you can do the following:
- Retriever Training:
bash retriever_expmt_folder_runner.sh runs/retriever/mw21_1p_train
- Generation w/ Codex:
bash codex_expmt_folder_runner.sh runs/codex/mw21_1p_train/python
- Simulating MultiWOZ 2.1 Results (from 2.4 Codex run above):
bash mw21_simulation_expmt_folder_runner.sh runs/codex/mw21_1p_train/sim_mw21_python
This repo makes use of wandb
for logging and storing of output artifacts (logs of completed runs, etc.). To override
the default project and entity, you can set these environment variables:
# default is mine: kingb12. You can find this in code and change as well
export WANDB_ENTITY=<your wandb account name or org-name/entity>
# default is refpydst for all runs so setting this here is a no-op. Re-name as needed
export WANDB_PROJECT="refpydst"
You can train all few-shot retrievers as follows:
bash retriever_expmt_folder_runner.sh runs/retriever/mw21_1p_train/referred_states
bash retriever_expmt_folder_runner.sh runs/retriever/mw21_5p_train/referred_states
bash retriever_expmt_folder_runner.sh runs/retriever/mw21_10p_train/referred_states
bash retriever_expmt_folder_runner.sh runs/retriever/mw21_100p_train/referred_states
bash retriever_expmt_folder_runner.sh runs/retriever/mw21_5p_train/ic_dst # for ablation
Or an individual run:
python src/refpydst/retriever/code/retriever_finetuning.py runs/retriever/mw21_1p_train/referred_states/split_v1.json
We've made pre-trained retrievers from this repository available on Huggingface:
Model Name | Few-shot Setting | Run # | Link |
---|---|---|---|
Brendan/refpydst-1p-referredstates-split-v1 |
1% | 1 | link |
Brendan/refpydst-1p-referredstates-split-v2 |
1% | 2 | link |
Brendan/refpydst-1p-referredstates-split-v3 |
1% | 3 | link |
Brendan/refpydst-5p-referredstates-split-v1 |
5% | 1 | link |
Brendan/refpydst-5p-referredstates-split-v2 |
5% | 2 | link |
Brendan/refpydst-5p-referredstates-split-v3 |
5% | 3 | link |
Brendan/refpydst-10p-referredstates-split-v1 |
10% | 1 | link |
Brendan/refpydst-10p-referredstates-split-v2 |
10% | 2 | link |
Brendan/refpydst-10p-referredstates-split-v3 |
10% | 3 | link |
Brendan/refpydst-100p-referredstates-split-v3 |
100% | 1 | link |
Brendan/refpydst-5p-icdst-split-v1 |
5% | 1 | link |
Brendan/refpydst-5p-icdst-split-v2 |
5% | 2 | link |
Brendan/refpydst-5p-icdst-split-v3 |
5% | 3 | link |
You can download all of them to the REFPYDST_OUTPUTS_DIR
with:
python download_pretrained_retrievers.py
You can repeat our few-shot experiments on MultiWOZ 2.4 with:
# MultiWOZ 2.4
bash codex_expmt_folder_runner.sh runs/codex/mw21_1p_train/python
bash codex_expmt_folder_runner.sh runs/codex/mw21_5p_train/python
bash codex_expmt_folder_runner.sh runs/codex/mw21_10p_train/python
bash codex_expmt_folder_runner.sh runs/codex/mw21_100p_train/python
and evaluate on MultiWOZ 2.1:
# MultiWOZ 2.1 simulated from 2.4 result
bash mw21_simulation_expmt_folder_runner.sh runs/codex/mw21_1p_train/sim_mw21_python
bash mw21_simulation_expmt_folder_runner.sh runs/codex/mw21_5p_train/sim_mw21_python
bash mw21_simulation_expmt_folder_runner.sh runs/codex/mw21_10p_train/sim_mw21_python
bash mw21_simulation_expmt_folder_runner.sh runs/codex/mw21_100p_train/sim_mw21_python
You can repeat our zero-shot experiments on MultiWOZ 2.4 with:
# MultiWOZ 2.4
bash codex_expmt_folder_runner.sh runs/codex/zero_shot/python
and evaluate on MultiWOZ 2.1 with:
bash mw21_simulation_expmt_folder_runner.sh runs/codex/zero_shot/sim_mw21_python
Run all ablations in Table 4 (zero-shot and 5% few-shot) with:
This runs each in sequence by recurrsively walking the directories, which you can speed up by running separate commands pointing at sub-directories. This will also include a random-retriever, as in the Appendix.
# MultiWOZ 2.4
bash codex_expmt_folder_runner.sh runs/table4
To repeat experiments for Table 5, one can run:
bash codex_expmt_folder_runner.sh runs/table5
Once these runs complete, you can evaluate with eval_coref_turns.py
.
Different from other scripts, this takes the group name from wandb as an argument for evaluating
the co-reference performance of completed runs in that group. The script will verify the expected number of
runs for the group, so make sure the runs you expect to be evaluated are marked in wandb with the tag complete_run
(should occur automatically on run finish), and mark any you wish to ignore with tag outdated
.
An example call to the script:
python src/refpydst/eval_coref_turns.py "-runs-table5-5p-full"
will score the co-reference accuracy of the runs in runs/table5/5p/full
.