Bootstrap metric support #85

Alcray · 2024-09-27T13:17:05Z

No description provided.

lilithgrigoryan · 2024-11-01T10:48:53Z

metrics_configs/bootstrap/config.yaml

+  Bootstrap metric processor
+  ##########################
+
+  This config is used to process a custom speech dataset using the `BootstrapProcessor` class for bootstrapped metric computation (WER, CER, WMR, etc.). It calculates metrics like Word Error Rate (WER) or other custom metrics such as Character Error Rate (CER) or Word Match Rate (WMR), depending on the config.


Let's add a link to the original paper for Bootstrap.

lilithgrigoryan · 2024-11-01T10:49:38Z

metrics_configs/bootstrap/config.yaml

+  Bootstrap metric processor
+  ##########################
+
+  This config is used to process a custom speech dataset using the `BootstrapProcessor` class for bootstrapped metric computation (WER, CER, WMR, etc.). It calculates metrics like Word Error Rate (WER) or other custom metrics such as Character Error Rate (CER) or Word Match Rate (WMR), depending on the config.


Let's explicitly mention what kind of metrics do we support

lilithgrigoryan · 2024-11-01T10:51:40Z

metrics_configs/bootstrap/config.yaml

+
+  The config generates the following outputs:
+
+  * **output_manifest_file**: A JSON file containing the results of the metric computation.


This argument should be renamed to reflect that we're writing a results file, not creating an output manifest. For example we can change into something like 'output_file'

Also write in what format the output will be written in the resulting file. I think we should add all the field names with their corresponding descriptions.

lilithgrigoryan · 2024-11-01T10:54:48Z

sdp/utils/bootstrap_estimates.py

+
+class BootstrapProcessor(BaseProcessor):
+    """
+    Performs bootstrapped metric computation (WER, CER, WMR, etc.) on speech predictions


Explicitly mention all the metrics here too

lilithgrigoryan · 2024-11-01T10:58:11Z

sdp/utils/bootstrap_estimates.py

+    is set to True.
+
+    Args:
+        bootstrap_manifest_files (List[str]): List of file paths to manifest (JSONL) files for metric calculation


are these file paths or filenames?

Maybe I am wrong at naming, but I meant
filepath: /home/user/projects/speech_recognition/manifests/metrics_manifest.json
and
filename: metrics_manifest.json

And in this case processor does require filepath

lilithgrigoryan · 2024-11-01T11:04:24Z

sdp/utils/bootstrap_estimates.py

+        ci_lower (float): Lower bound percentile for confidence intervals (default: 2.5)
+        ci_upper (float): Upper bound percentile for confidence intervals (default: 97.5)
+        random_state (int): Random state of the program
+    """


I think it will be better to describe what does processor output and the output format here

lilithgrigoryan · 2024-11-01T11:09:47Z

metrics_configs/bootstrap/config.yaml

+  * **raw_data_dir**: Specify the data folder where all the datawill be stored.
+  * **bootstrap_manifest_files**: List of file paths to the manifest files in JSONL format.
+  * **metric_type**: The metric to compute. Supported options include 'wer', 'cer', 'wmr', 'charrate', 'wordrate'.
+  * **dataset_size**: Proportion of dataset size for each bootstrap sample.


Can we get a better naming for dataset_size? maybe bootstrap_sample_ratio?

lilithgrigoryan · 2024-11-01T11:40:29Z

sdp/utils/bootstrap_estimates.py

+                    })
+
+        output_path = Path(self.output_manifest_file)
+        output_path.parent.mkdir(exist_ok=True, parents=True)


this should be done before calculations in prepare base class method.

lilithgrigoryan · 2024-11-01T11:41:28Z

sdp/utils/bootstrap_estimates.py

+            json.dump(results, out_file, indent=4)
+
+        print(f"Results saved to {self.output_manifest_file}")
+


I think it can be beneficial to add logging.

Removed, as it was used for debugging

lilithgrigoryan · 2024-11-05T06:01:25Z

tests/test_cfg_end_to_end_tests.py

We should replace the end-to-end tests with unit tests. For reference, you can take a look at some examples provided here https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/tests/test_modify_manifest.py and see how to run them here https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/tests/README.md

Signed-off-by: Alexan <[email protected]>

Alcray requested a review from erastorgueva-nv September 27, 2024 13:17

lilithgrigoryan self-requested a review November 1, 2024 10:46

lilithgrigoryan reviewed Nov 5, 2024

View reviewed changes

Alcray added 4 commits November 6, 2024 13:11

Added bootstrap processor and config

ecf161e

Signed-off-by: Alexan <[email protected]>

Added bootstrap estimate tests and documentation

aad61ed

Signed-off-by: Alexan <[email protected]>

Unskipping the tests

255457b

Signed-off-by: Alexan <[email protected]>

test update + doc update

67e4406

Signed-off-by: Alexan <[email protected]>

Alcray force-pushed the bootstrap branch from 09fc97c to 67e4406 Compare November 6, 2024 11:10

test fix

51856a5

Signed-off-by: Alexan <[email protected]>

Alcray requested a review from lilithgrigoryan November 12, 2024 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bootstrap metric support #85

Bootstrap metric support #85

Alcray commented Sep 27, 2024

lilithgrigoryan Nov 1, 2024

Alcray Nov 6, 2024

lilithgrigoryan Nov 1, 2024

Alcray Nov 6, 2024

lilithgrigoryan Nov 1, 2024

lilithgrigoryan Nov 1, 2024

Alcray Nov 6, 2024

lilithgrigoryan Nov 1, 2024

Alcray Nov 6, 2024

lilithgrigoryan Nov 1, 2024

Alcray Nov 6, 2024

lilithgrigoryan Nov 1, 2024

Alcray Nov 6, 2024

lilithgrigoryan Nov 1, 2024

Alcray Nov 6, 2024

lilithgrigoryan Nov 1, 2024

Alcray Nov 6, 2024

lilithgrigoryan Nov 1, 2024

Alcray Nov 6, 2024

lilithgrigoryan Nov 5, 2024

Alcray Nov 6, 2024


		The config generates the following outputs:

		* output_manifest_file: A JSON file containing the results of the metric computation.

		json.dump(results, out_file, indent=4)

		print(f"Results saved to {self.output_manifest_file}")

Bootstrap metric support #85

Are you sure you want to change the base?

Bootstrap metric support #85

Conversation

Alcray commented Sep 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment