This repository contains the implementation of the EMNLP 2021 paper: "Parallel Refinements for Lexically Constrained Text Generation with BART".
Lexically constrained text generation aims to control the generated text by incorporating some pre-specified keywords into the output. Previous work injects lexical constraints into the output by controlling the decoding process or refining the candidate output iteratively, which tends to generate generic or ungrammatical sentences, and has high computational complexity. To address these challenges, we propose Constrained BART (CBART) for lexically constrained text generation. CBART leverages the pre-trained model BART and transfers part of the generation burden from the decoder to the encoder by decomposing this task into two sub-tasks, thereby improving the sentence quality. Concretely, we extend BART by adding a token-level classifier over the encoder, aiming at instructing the decoder where to replace and insert. Guided by the encoder, the decoder refines multiple tokens of the input in one step by inserting tokens before specific positions and re-predicting tokens with low confidence. To further reduce the inference latency, the decoder predicts all tokens in parallel. Experiment results on One-BillionWord and Yelp show that CBART can generate plausible text with high quality and diversity while significantly accelerating inference.
python 3.6
pip install torch==1.4.0
pip install transformers==3.0.2
pip install pympler==0.8
All our experiments are conducted on One-Billion-Word and Yelp review corpora. In this paper, we choose 1M, 0.1M sentences from each dataset as the training and validation sets (The full data used in this paper are available at https://drive.google.com/drive/folders/1Dj7VX2CjSn3-g7FEYuJrT5_JWGdsAHjE?usp=sharing). If you want to train the model from scratch, you should download the corresponding data first and put them in the corresponding directory, i.e. data/one-billion-words (data/yelp_review). Note we only put several sentences in the data/one-billion-words/train.txt and data/one-billion-words/dev.txt.
Model | Download link |
---|---|
CBART-base for Yelp review | [link] |
CBART-large for Yelp review | [link] |
CBART-base for One-Billion-Word | [link] |
CBART-large for One-Billion-Word | [link] |
If you want to try our models, you should download these checkpoints, put them into the 'checkpoints' directory, and decompress them with the following command: Then you can directly go to Generate sentences with lexical constraints.
tar -xzvf checkpoint_name.tar.gz # replace 'checkpoint_name' with the corresponding checkpoint name.
If you want to train our model on another dataset, please refer to the following steps.
Note the default dataset is One-Billion-Word. You can freely change it to another dataset.
- Step 1: Create synthetic data to train CBART
cd utils
sh create_synthetic_data.sh
- Step 2: Train CBART
cd models
If you want to train CBART-base on One-Billion-Word:
python bart.py --batch_size 80 --gpu 5 --dataset one-billion-words
If you want to train CBART-large on One-Billion-Word:
python bart.py --batch_size 25 --gpu 5 --dataset one-billion-words --bart large
We show some keywords in "data/one-billion-words/4keywords.txt", where each line has 4 keywords. In the following, we'll generate sentences with 4 keywords. If you want to generate sentences with other number of keywords, you should prepare keywords and put them in the "data/dataset_name/{k}keywords.txt", where '{k}' denotes the number of keywords in each line. If so, you need to change the hyperparameter "num_keywords" (e.g., --num_keywords 1, if you want to generate sentence with one keyword).
Generate sentences with 4keywords.txt by running greedy decoding on CBART-base:
python main.py --gpu 7 --num_keywords 4 --do_sample 0 --batch_size 10 --bart base --dataset one-billion-words
Generate sentences with 4keywords.txt by running multiple-sequence decoding (p=0.5, c=5 ) decoding on CBART-base:
python main.py --gpu 7 --num_keywords 4 --do_sample 1 --top_p 0.5 --decoder_chain 5 --batch_size 10 --bart base --dataset one-billion-words
Generate sentences with 4keywords.txt by running multiple-sequence decoding (k=5, c=5) decoding on CBART-base:
python main.py --gpu 7 --num_keywords 4 --do_sample 1 --top_k 5 --decoder_chain 5 --batch_size 10 --bart base --dataset one-billion-words
If you want to use this code in your research, you can cite our paper:
@inproceedings{he-2021-parallel,
title = "Parallel Refinements for Lexically Constrained Text Generation with {BART}",
author = "He, Xingwei",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.681",
doi = "10.18653/v1/2021.emnlp-main.681",
pages = "8653--8666",
}
}