GupShup: Summarizing Open-Domain Code-Switched Conversations EMNLP 2021 Paper: https://aclanthology.org/2021.emnlp-main.499.pdf
Please request for the Gupshup data using this Google form.
Dataset is available for Hinglish Dilaogues to English Summarization
(h2e) and English Dialogues to English Summarization
(e2e). For each task, Dialogues/conversastion have .source
(train.source) as file extension whereas Summary has .target
(train.target) file extension. ".source" file need to be provided to input_path
and ".target" file to reference_path
argument in the scripts.
All model weights are available on the Huggingface model hub. Users can either directly download these weights in their local and provide this path to model_name
argument in the scripts or use the provided alias (to model_name
argument) in scripts directly; this will lead to download weights automatically by scripts.
Model names were aliased in "gupshup_TASK_MODEL" sense, where "TASK" can be h2e,e2e and MODEL can be mbart, pegasus, etc., as listed below.
1. Hinglish Dialogues to English Summary (h2e)
Model | Huggingface Alias |
---|---|
mBART | midas/gupshup_h2e_mbart |
PEGASUS | midas/gupshup_h2e_pegasus |
T5 MTL | midas/gupshup_h2e_t5_mtl |
T5 | midas/gupshup_h2e_t5 |
BART | midas/gupshup_h2e_bart |
GPT-2 | midas/gupshup_h2e_gpt |
2. English Dialogues to English Summary (e2e)
Model | Huggingface Alias |
---|---|
mBART | midas/gupshup_e2e_mbart |
PEGASUS | midas/gupshup_e2e_pegasus |
T5 MTL | midas/gupshup_e2e_t5_mtl |
T5 | midas/gupshup_e2e_t5 |
BART | midas/gupshup_e2e_bart |
GPT-2 | midas/gupshup_e2e_gpt |
- Clone this repo and create a python virtual environment (https://docs.python.org/3/library/venv.html). Install the required packages using
git clone https://github.com/midas-research/gupshup.git
pip install -r requirements.txt
- run_eval script has the following arguments.
- model_name : Path or alias to one of our models available on Huggingface as listed above.
- input_path : Source file or path to file containing conversations, which will be summarized.
- save_path : File path where to save summaries generated by the model.
- reference_path : Target file or path to file containing summaries, used to calculate matrices.
- score_path : File path where to save scores.
- bs : Batch size
- device: Cuda devices to use.
Please make sure you have downloaded the Gupshup dataset using the above google form and provide the correct path to these files in the argument's input_path
and refrence_path.
Or you can simply put test.source
and test.target
in data/h2e/
(hinglish to english) or data/e2e/
(english to english) folder. For example, to generate English summaries from Hinglish dialogues using the mbart model, run the following command
python run_eval.py \
--model_name midas/gupshup_h2e_mbart \
--input_path data/h2e/test.source \
--save_path generated_summary.txt \
--reference_path data/h2e/test.target \
--score_path scores.txt \
--bs 8
Another example, to generate English summaries from English dialogues using the Pegasus model
python run_eval.py \
--model_name midas/gupshup_e2e_pegasus \
--input_path data/e2e/test.source \
--save_path generated_summary.txt \
--reference_path data/e2e/test.target \
--score_path scores.txt \
--bs 8
Please create a copy of this Notebook on Google colab or upload gupshup_notebook.ipynb
on google collab and follow the instructions in it.
- Clone this repo and Create a python virtual environment (https://docs.python.org/3/library/venv.html). Install the required packages using
git clone https://github.com/midas-research/gupshup.git
pip install -r requirements.txt
- use Streamlit UI to make inferences from the choice of your models and tasks. To start the Streamlit Server:
streamlit run app.py
Please create an issue if you are facing any difficulties in replicating the results.
Please cite [1] if you found the resources in this repository useful.
[1] Mehnaz, Laiba, Debanjan Mahata, Rakesh Gosangi, Uma Sushmitha Gunturi, Riya Jain, Gauri Gupta, Amardeep Kumar, Isabelle G. Lee, Anish Acharya, and Rajiv Shah. GupShup: Summarizing Open-Domain Code-Switched Conversations
@inproceedings{mehnaz2021gupshup,
title={GupShup: Summarizing Open-Domain Code-Switched Conversations},
author={Mehnaz, Laiba and Mahata, Debanjan and Gosangi, Rakesh and Gunturi, Uma Sushmitha and Jain, Riya and Gupta, Gauri and Kumar, Amardeep and Lee, Isabelle G and Acharya, Anish and Shah, Rajiv},
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing},
pages={6177--6192},
year={2021}
}