Skip to content

Commit

Permalink
tweak docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool committed Nov 29, 2024
1 parent 494bbc6 commit 5a12c63
Show file tree
Hide file tree
Showing 8 changed files with 48 additions and 48 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,15 @@ Download the corpus and unpack into `collections/`:

```bash
wget https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-bge-base-en-v1.5.parquet.tar -P collections/
tar xvf collections/msmarco-passage-bge-base-en-v1.5.tar -C collections/
tar xvf collections/msmarco-passage-bge-base-en-v1.5.parquet.tar -C collections/
```

To confirm, `msmarco-passage-bge-base-en-v1.5.tar` is 39 GB and has MD5 checksum `b235e19ec492c18a18057b30b8b23fd4`.
To confirm, `msmarco-passage-bge-base-en-v1.5.parquet.tar` is 39 GB and has MD5 checksum `b235e19ec492c18a18057b30b8b23fd4`.
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-passage.bge-base-en-v1.5.parquet.flat.cached \
--corpus-path collections/msmarco-passage-bge-base-en-v1.5
--corpus-path collections/msmarco-passage-bge-base-en-v1.5.parquet
```

## Indexing
Expand All @@ -52,13 +52,13 @@ Sample indexing command, building flat indexes:
bin/run.sh io.anserini.index.IndexFlatDenseVectors \
-threads 16 \
-collection ParquetDenseVectorCollection \
-input /path/to/msmarco-passage-bge-base-en-v1.5 \
-input /path/to/msmarco-passage-bge-base-en-v1.5.parquet \
-generator ParquetDenseVectorDocumentGenerator \
-index indexes/lucene-flat.msmarco-v1-passage.bge-base-en-v1.5/ \
>& logs/log.msmarco-passage-bge-base-en-v1.5 &
>& logs/log.msmarco-passage-bge-base-en-v1.5.parquet &
```

The path `/path/to/msmarco-passage-bge-base-en-v1.5/` should point to the corpus downloaded above.
The path `/path/to/msmarco-passage-bge-base-en-v1.5.parquet/` should point to the corpus downloaded above.
Upon completion, we should have an index with 8,841,823 documents.

## Retrieval
Expand All @@ -73,17 +73,17 @@ bin/run.sh io.anserini.search.SearchFlatDenseVectors \
-index indexes/lucene-flat.msmarco-v1-passage.bge-base-en-v1.5/ \
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.bge-base-en-v1.5.jsonl.gz \
-topicReader JsonIntVector \
-output runs/run.msmarco-passage-bge-base-en-v1.5.bge-flat-cached.topics.msmarco-passage.dev-subset.bge-base-en-v1.5.jsonl.txt \
-output runs/run.msmarco-passage-bge-base-en-v1.5.parquet.bge-flat-cached.topics.msmarco-passage.dev-subset.bge-base-en-v1.5.jsonl.txt \
-hits 1000 -threads 16 &
```

Evaluation can be performed using `trec_eval`:

```bash
bin/trec_eval -c -m map tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-bge-base-en-v1.5.bge-flat-cached.topics.msmarco-passage.dev-subset.bge-base-en-v1.5.jsonl.txt
bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-bge-base-en-v1.5.bge-flat-cached.topics.msmarco-passage.dev-subset.bge-base-en-v1.5.jsonl.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-bge-base-en-v1.5.bge-flat-cached.topics.msmarco-passage.dev-subset.bge-base-en-v1.5.jsonl.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-bge-base-en-v1.5.bge-flat-cached.topics.msmarco-passage.dev-subset.bge-base-en-v1.5.jsonl.txt
bin/trec_eval -c -m map tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-bge-base-en-v1.5.parquet.bge-flat-cached.topics.msmarco-passage.dev-subset.bge-base-en-v1.5.jsonl.txt
bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-bge-base-en-v1.5.parquet.bge-flat-cached.topics.msmarco-passage.dev-subset.bge-base-en-v1.5.jsonl.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-bge-base-en-v1.5.parquet.bge-flat-cached.topics.msmarco-passage.dev-subset.bge-base-en-v1.5.jsonl.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-bge-base-en-v1.5.parquet.bge-flat-cached.topics.msmarco-passage.dev-subset.bge-base-en-v1.5.jsonl.txt
```

## Effectiveness
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,15 @@ Download the corpus and unpack into `collections/`:

```bash
wget https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-cohere-embed-english-v3.0.parquet.tar -P collections/
tar xvf collections/msmarco-passage-cohere-embed-english-v3.0.tar -C collections/
tar xvf collections/msmarco-passage-cohere-embed-english-v3.0.parquet.tar -C collections/
```

To confirm, `msmarco-passage-cohere-embed-english-v3.0.tar` is 16 GB and has MD5 checksum `40c5caf33476746e93ceeb75174b8d64`.
To confirm, `msmarco-passage-cohere-embed-english-v3.0.parquet.tar` is 16 GB and has MD5 checksum `40c5caf33476746e93ceeb75174b8d64`.
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-passage.cohere-embed-english-v3.0.parquet.flat.cached \
--corpus-path collections/msmarco-passage-cohere-embed-english-v3.0
--corpus-path collections/msmarco-passage-cohere-embed-english-v3.0.parquet
```

## Indexing
Expand All @@ -50,13 +50,13 @@ Sample indexing command, building flat indexes:
bin/run.sh io.anserini.index.IndexFlatDenseVectors \
-threads 16 \
-collection ParquetDenseVectorCollection \
-input /path/to/msmarco-passage-cohere-embed-english-v3.0 \
-input /path/to/msmarco-passage-cohere-embed-english-v3.0.parquet \
-generator ParquetDenseVectorDocumentGenerator \
-index indexes/lucene-flat.msmarco-v1-passage.cohere-embed-english-v3.0/ \
>& logs/log.msmarco-passage-cohere-embed-english-v3.0 &
>& logs/log.msmarco-passage-cohere-embed-english-v3.0.parquet &
```

The path `/path/to/msmarco-passage-cohere-embed-english-v3.0/` should point to the corpus downloaded above.
The path `/path/to/msmarco-passage-cohere-embed-english-v3.0.parquet/` should point to the corpus downloaded above.
Upon completion, we should have an index with 8,841,823 documents.

## Retrieval
Expand All @@ -71,17 +71,17 @@ bin/run.sh io.anserini.search.SearchFlatDenseVectors \
-index indexes/lucene-flat.msmarco-v1-passage.cohere-embed-english-v3.0/ \
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.cohere-embed-english-v3.0.jsonl.gz \
-topicReader JsonIntVector \
-output runs/run.msmarco-passage-cohere-embed-english-v3.0.cohere-embed-english-v3.0-flat-cached.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.0.jsonl.txt \
-output runs/run.msmarco-passage-cohere-embed-english-v3.0.parquet.cohere-embed-english-v3.0-flat-cached.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.0.jsonl.txt \
-hits 1000 -threads 16 &
```

Evaluation can be performed using `trec_eval`:

```bash
bin/trec_eval -c -m map tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.0.cohere-embed-english-v3.0-flat-cached.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.0.jsonl.txt
bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.0.cohere-embed-english-v3.0-flat-cached.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.0.jsonl.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.0.cohere-embed-english-v3.0-flat-cached.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.0.jsonl.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.0.cohere-embed-english-v3.0-flat-cached.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.0.jsonl.txt
bin/trec_eval -c -m map tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.0.parquet.cohere-embed-english-v3.0-flat-cached.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.0.jsonl.txt
bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.0.parquet.cohere-embed-english-v3.0-flat-cached.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.0.jsonl.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.0.parquet.cohere-embed-english-v3.0-flat-cached.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.0.jsonl.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cohere-embed-english-v3.0.parquet.cohere-embed-english-v3.0-flat-cached.topics.msmarco-passage.dev-subset.cohere-embed-english-v3.0.jsonl.txt
```

## Effectiveness
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,15 @@ Download the corpus and unpack into `collections/`:

```bash
wget https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-cos-dpr-distil.parquet.tar -P collections/
tar xvf collections/msmarco-passage-cos-dpr-distil.tar -C collections/
tar xvf collections/msmarco-passage-cos-dpr-distil.parquet.tar -C collections/
```

To confirm, `msmarco-passage-cos-dpr-distil.tar` is 38 GB and has MD5 checksum `c8a204fbc3ccda581aa375936af43a97`.
To confirm, `msmarco-passage-cos-dpr-distil.parquet.tar` is 38 GB and has MD5 checksum `c8a204fbc3ccda581aa375936af43a97`.
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-passage.cos-dpr-distil.parquet.flat.cached \
--corpus-path collections/msmarco-passage-cos-dpr-distil
--corpus-path collections/msmarco-passage-cos-dpr-distil.parquet
```

## Indexing
Expand All @@ -52,13 +52,13 @@ Sample indexing command, building flat indexes:
bin/run.sh io.anserini.index.IndexFlatDenseVectors \
-threads 16 \
-collection ParquetDenseVectorCollection \
-input /path/to/msmarco-passage-cos-dpr-distil \
-input /path/to/msmarco-passage-cos-dpr-distil.parquet \
-generator ParquetDenseVectorDocumentGenerator \
-index indexes/lucene-flat.msmarco-v1-passage.cos-dpr-distil/ \
>& logs/log.msmarco-passage-cos-dpr-distil &
>& logs/log.msmarco-passage-cos-dpr-distil.parquet &
```

The path `/path/to/msmarco-passage-cos-dpr-distil/` should point to the corpus downloaded above.
The path `/path/to/msmarco-passage-cos-dpr-distil.parquet/` should point to the corpus downloaded above.
Upon completion, we should have an index with 8,841,823 documents.

## Retrieval
Expand All @@ -73,17 +73,17 @@ bin/run.sh io.anserini.search.SearchFlatDenseVectors \
-index indexes/lucene-flat.msmarco-v1-passage.cos-dpr-distil/ \
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.gz \
-topicReader JsonIntVector \
-output runs/run.msmarco-passage-cos-dpr-distil.cos-dpr-distil-flat-cached.topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.txt \
-output runs/run.msmarco-passage-cos-dpr-distil.parquet.cos-dpr-distil-flat-cached.topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.txt \
-hits 1000 -threads 16 &
```

Evaluation can be performed using `trec_eval`:

```bash
bin/trec_eval -c -m map tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cos-dpr-distil.cos-dpr-distil-flat-cached.topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.txt
bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cos-dpr-distil.cos-dpr-distil-flat-cached.topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cos-dpr-distil.cos-dpr-distil-flat-cached.topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cos-dpr-distil.cos-dpr-distil-flat-cached.topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.txt
bin/trec_eval -c -m map tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cos-dpr-distil.parquet.cos-dpr-distil-flat-cached.topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.txt
bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cos-dpr-distil.parquet.cos-dpr-distil-flat-cached.topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cos-dpr-distil.parquet.cos-dpr-distil-flat-cached.topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-cos-dpr-distil.parquet.cos-dpr-distil-flat-cached.topics.msmarco-passage.dev-subset.cos-dpr-distil.jsonl.txt
```

## Effectiveness
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,15 @@ Download the corpus and unpack into `collections/`:

```bash
wget https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-openai-ada2.parquet.tar -P collections/
tar xvf collections/msmarco-passage-openai-ada2.tar -C collections/
tar xvf collections/msmarco-passage-openai-ada2.parquet.tar -C collections/
```

To confirm, `msmarco-passage-openai-ada2.tar` is 75 GB and has MD5 checksum `fa3637e9c4150b157270e19ef3a4f779`.
To confirm, `msmarco-passage-openai-ada2.parquet.tar` is 75 GB and has MD5 checksum `fa3637e9c4150b157270e19ef3a4f779`.
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-passage.openai-ada2.parquet.flat.cached \
--corpus-path collections/msmarco-passage-openai-ada2
--corpus-path collections/msmarco-passage-openai-ada2.parquet
```

## Indexing
Expand All @@ -52,13 +52,13 @@ Sample indexing command, building flat indexes:
bin/run.sh io.anserini.index.IndexFlatDenseVectors \
-threads 16 \
-collection ParquetDenseVectorCollection \
-input /path/to/msmarco-passage-openai-ada2 \
-input /path/to/msmarco-passage-openai-ada2.parquet \
-generator ParquetDenseVectorDocumentGenerator \
-index indexes/lucene-flat.msmarco-v1-passage.openai-ada2/ \
>& logs/log.msmarco-passage-openai-ada2 &
>& logs/log.msmarco-passage-openai-ada2.parquet &
```

The path `/path/to/msmarco-passage-openai-ada2/` should point to the corpus downloaded above.
The path `/path/to/msmarco-passage-openai-ada2.parquet/` should point to the corpus downloaded above.
Upon completion, we should have an index with 8,841,823 documents.

## Retrieval
Expand All @@ -73,17 +73,17 @@ bin/run.sh io.anserini.search.SearchFlatDenseVectors \
-index indexes/lucene-flat.msmarco-v1-passage.openai-ada2/ \
-topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.openai-ada2.jsonl.gz \
-topicReader JsonIntVector \
-output runs/run.msmarco-passage-openai-ada2.openai-ada2-flat-cached.topics.msmarco-passage.dev-subset.openai-ada2.jsonl.txt \
-output runs/run.msmarco-passage-openai-ada2.parquet.openai-ada2-flat-cached.topics.msmarco-passage.dev-subset.openai-ada2.jsonl.txt \
-hits 1000 -threads 16 &
```

Evaluation can be performed using `trec_eval`:

```bash
bin/trec_eval -c -m map tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-openai-ada2.openai-ada2-flat-cached.topics.msmarco-passage.dev-subset.openai-ada2.jsonl.txt
bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-openai-ada2.openai-ada2-flat-cached.topics.msmarco-passage.dev-subset.openai-ada2.jsonl.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-openai-ada2.openai-ada2-flat-cached.topics.msmarco-passage.dev-subset.openai-ada2.jsonl.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-openai-ada2.openai-ada2-flat-cached.topics.msmarco-passage.dev-subset.openai-ada2.jsonl.txt
bin/trec_eval -c -m map tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-openai-ada2.parquet.openai-ada2-flat-cached.topics.msmarco-passage.dev-subset.openai-ada2.jsonl.txt
bin/trec_eval -c -M 10 -m recip_rank tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-openai-ada2.parquet.openai-ada2-flat-cached.topics.msmarco-passage.dev-subset.openai-ada2.jsonl.txt
bin/trec_eval -c -m recall.100 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-openai-ada2.parquet.openai-ada2-flat-cached.topics.msmarco-passage.dev-subset.openai-ada2.jsonl.txt
bin/trec_eval -c -m recall.1000 tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage-openai-ada2.parquet.openai-ada2-flat-cached.topics.msmarco-passage.dev-subset.openai-ada2.jsonl.txt
```

## Effectiveness
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
corpus: msmarco-passage-bge-base-en-v1.5
corpus: msmarco-passage-bge-base-en-v1.5.parquet
corpus_path: collections/msmarco/msmarco-passage-bge-base-en-v1.5.parquet/

download_url: https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-bge-base-en-v1.5.parquet.tar
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
corpus: msmarco-passage-cohere-embed-english-v3.0
corpus: msmarco-passage-cohere-embed-english-v3.0.parquet
corpus_path: collections/msmarco/msmarco-passage-cohere-embed-english-v3.0.parquet/

download_url: https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-cohere-embed-english-v3.0.parquet.tar
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
corpus: msmarco-passage-cos-dpr-distil
corpus: msmarco-passage-cos-dpr-distil.parquet
corpus_path: collections/msmarco/msmarco-passage-cos-dpr-distil.parquet/

download_url: https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-cos-dpr-distil.parquet.tar
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
corpus: msmarco-passage-openai-ada2
corpus: msmarco-passage-openai-ada2.parquet
corpus_path: collections/msmarco/msmarco-passage-openai-ada2.parquet/

download_url: https://rgw.cs.uwaterloo.ca/pyserini/data/msmarco-passage-openai-ada2.parquet.tar
Expand Down

0 comments on commit 5a12c63

Please sign in to comment.