Code and data for the SIGIR '24 paper "Generalizable Tip-of-the-Tongue Retrieval with LLM Re-ranking"

First step: Collecting data

Get the entirety of tip-of-the-tongue queries

wget https://files.webis.de/corpora/corpora-webis/known-item-question-performance-prediction/reddit-tomt-submissions.jsonl.gz

Filter queries by categories: movies, books, games, music. These queries are solved, with an answer in natural language (i.e., a Reddit reply).

python3 SCRIPTS/extract_categories.py

Have GPT-3.5 extract the item title from the natural language answer

python3 SCRIPTS/gpt_titles.py

Match the GPT-3.5 extract titles to Wikipedia titles, if they can be matched. This further narrows down the amount of queries.

python3 SCRIPTS/match_gpt_to_wiki.py

Now that we have all the available queries (i.e., the ones with an answer that maps to a Wiki title), we split the data in training/val/test.

python3 SCRIPTS/split_dataset.py

Optionally, since every query is mapped to only one document, we can create lists of samples, which are (query_text, document_text) tuples.

python3 SCRIPTS/process_qrels.py

This is the process for collecting the data. We recommend instead downloading the data already processed into these three folders. These files are already human filtered (i.e., the testing queries).

DATA (contains ToT queries, query titles, qrels): https://mega.nz/file/BogxwCII#U7r9mNwNJXQSyJgvtCjgK-XbG5qrIUZsyF6BRpGlx2A
WIKIPEDIA (contains Wikipedia documents, Wiki titles): https://mega.nz/file/cw4FzIxY#HeGKOXEzbgYP80_R3elA5nu5rinHj0p1OTXZ4ATN1oA
TREC (Specific TREC ToT queries and Wikipedia split): https://mega.nz/file/Ytg1nTDC#cZ5_2Keys_bNMwXq_UY00TZMYzOJPca6MP7C79hNMSY

Second step: Training DPR on the collected data

sh DPR/script.sh will train a DPR model with every domain, and evaluate on a certain domain. Do look at the command line arguments to change these options. The link to download the weights is inside the DPR folder.

November 2023 TREC Participation

The folder TREC_Participation contains the dataset, code, and DPR weights for the TREC participation, submitted in August of 2023 and presented in November of 2023. The 2023 TREC ToT task differs from the SIGIR paper task. For example, TREC only evaluates on the movie domain, with its own 150 TREC queries. The TREC DPR model was only trained on movie queries. The methodology to obtain the training dataset was the same, although there are some differences in the dataset. For TREC, there is no test split, since that was given by the organizers. Additionally, the process of keeping/discarding movie queries was less strict, resulting in more movie queries for training (around 84k). We recommend using the methods and data from the SIGIR article instead.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
BM25_DOCUMENT		BM25_DOCUMENT
DPR		DPR
GPT-4		GPT-4
SCRIPTS		SCRIPTS
TREC_Participation		TREC_Participation
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code and data for the SIGIR '24 paper "Generalizable Tip-of-the-Tongue Retrieval with LLM Re-ranking"

About

Releases

Packages

Contributors 2

Languages

LuisPB7/TipTongue

Folders and files

Latest commit

History

Repository files navigation

Code and data for the SIGIR '24 paper "Generalizable Tip-of-the-Tongue Retrieval with LLM Re-ranking"

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages