👀 What?

This repository contains code for using the $d_{HM}$ evaluation method proposed in:
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition—In proceedings of EMNLP 2024 (Findings).

Note: Despite being proposed specifically for visual storytelling, this method is generalizable and can be extended to any task involving model-generated outputs with corresponding references.

🤔 Why?

$d_{HM}$ enables human-centric evaluation of model-generated stories along different dimensions important for visual story generation.

🤖 How?

$d_{HM}$ combines three reference-free evaluation metrics—GROOViST¹ (for visual grounding), RoViST-C² (for coherence), and RoViST-NR² (for non-redundancy/repetition)—by computing the average of absolute metric-level deviations between human stories and corresponding model generations.

Setup

Install python (e.g., version 3.11) and other dependencies provided under requirements.txt, e.g., using:
pip install -r requirements.txt

Step 0: Generate stories

For generating stories using the models and settings proposed in this work, refer to this documentation.

Step 1A: Compute metric-level scores for human stories

For computing visual grounding scores (G), checkout the GROOViST repository.

For computing coherence (C) and repetition (R) scores, use the following utility adapted from RoViST. E.g.,
python evaluate/eval_C_R.py -i ./data/stories/vist/gt_test.json -o ./data/scores/vist/gt_test

Note 1: Download the pre-trained ALBERT model from here and place it under the data/ folder.

Note 2: Requirements differ—checkout the evaluate/requirements file.

Step 1B: Compute metric-level scores for model-generated stories

Similar to Step 1A.

Step 2: Evaluate using $d_{HM}$

For obtaining aggregate $d_{HM}$ values along with corresponding metric-level distances ($d_{HM}^G, d_{HM}^C, d_{HM}^R$), use the following utility. E.g.,
python dHM.py -d VIST

🔗 If you find this work useful, please consider citing it:

@inproceedings{
   EMNLP 2024 Findings (to appear) 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👀 What?

🤔 Why?

🤖 How?

Setup

Step 0: Generate stories

Step 1A: Compute metric-level scores for human stories

Step 1B: Compute metric-level scores for model-generated stories

Step 2: Evaluate using $d_{HM}$

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
evaluate		evaluate
generate		generate
.gitignore		.gitignore
README.md		README.md
dHM.py		dHM.py
requirements.txt		requirements.txt

akskuchi/dHM-visual-storytelling

Folders and files

Latest commit

History

Repository files navigation

👀 What?

🤔 Why?

🤖 How?

Setup

Step 0: Generate stories

Step 1A: Compute metric-level scores for human stories

Step 1B: Compute metric-level scores for model-generated stories

Step 2: Evaluate using $d_{HM}$

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages