Skip to content

Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition – EMNLP 2024 (Findings)

Notifications You must be signed in to change notification settings

akskuchi/dHM-visual-storytelling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CC BY license Python PyTorch HuggingFace

👀 What?

This repository contains code for using the $d_{HM}$ evaluation method proposed in:
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition—In proceedings of EMNLP 2024 (Findings).

Note: Despite being proposed specifically for visual storytelling, this method is generalizable and can be extended to any task involving model-generated outputs with corresponding references.

🤔 Why?

$d_{HM}$ enables human-centric evaluation of model-generated stories along different dimensions important for visual story generation.

🤖 How?

$d_{HM}$ combines three reference-free evaluation metrics—GROOViST1 (for visual grounding), RoViST-C2 (for coherence), and RoViST-NR2 (for non-redundancy/repetition)—by computing the average of absolute metric-level deviations between human stories and corresponding model generations.

Setup

Install python (e.g., version 3.11) and other dependencies provided under requirements.txt, e.g., using:
pip install -r requirements.txt

Step 0: Generate stories

For generating stories using the models and settings proposed in this work, refer to this documentation.

Step 1A: Compute metric-level scores for human stories

For computing visual grounding scores (G), checkout the GROOViST repository.

For computing coherence (C) and repetition (R) scores, use the following utility adapted from RoViST. E.g.,
python evaluate/eval_C_R.py -i ./data/stories/vist/gt_test.json -o ./data/scores/vist/gt_test

Note 1: Download the pre-trained ALBERT model from here and place it under the data/ folder.

Note 2: Requirements differ—checkout the evaluate/requirements file.

Step 1B: Compute metric-level scores for model-generated stories

Similar to Step 1A.

Step 2: Evaluate using $d_{HM}$

For obtaining aggregate $d_{HM}$ values along with corresponding metric-level distances ($d_{HM}^G, d_{HM}^C, d_{HM}^R$), use the following utility. E.g.,
python dHM.py -d VIST


🔗 If you find this work useful, please consider citing it:

@inproceedings{
   EMNLP 2024 Findings (to appear) 
}

Footnotes

  1. https://aclanthology.org/2023.emnlp-main.202/

  2. https://aclanthology.org/2022.findings-naacl.206/ 2

About

Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition – EMNLP 2024 (Findings)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages