-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bootstrap metric support #85
base: main
Are you sure you want to change the base?
Conversation
Bootstrap metric processor | ||
########################## | ||
|
||
This config is used to process a custom speech dataset using the `BootstrapProcessor` class for bootstrapped metric computation (WER, CER, WMR, etc.). It calculates metrics like Word Error Rate (WER) or other custom metrics such as Character Error Rate (CER) or Word Match Rate (WMR), depending on the config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a link to the original paper for Bootstrap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Bootstrap metric processor | ||
########################## | ||
|
||
This config is used to process a custom speech dataset using the `BootstrapProcessor` class for bootstrapped metric computation (WER, CER, WMR, etc.). It calculates metrics like Word Error Rate (WER) or other custom metrics such as Character Error Rate (CER) or Word Match Rate (WMR), depending on the config. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's explicitly mention what kind of metrics do we support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
The config generates the following outputs: | ||
|
||
* **output_manifest_file**: A JSON file containing the results of the metric computation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument should be renamed to reflect that we're writing a results file, not creating an output manifest. For example we can change into something like 'output_file'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also write in what format the output will be written in the resulting file. I think we should add all the field names with their corresponding descriptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sdp/utils/bootstrap_estimates.py
Outdated
|
||
class BootstrapProcessor(BaseProcessor): | ||
""" | ||
Performs bootstrapped metric computation (WER, CER, WMR, etc.) on speech predictions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly mention all the metrics here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sdp/utils/bootstrap_estimates.py
Outdated
is set to True. | ||
|
||
Args: | ||
bootstrap_manifest_files (List[str]): List of file paths to manifest (JSONL) files for metric calculation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these file paths or filenames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am wrong at naming, but I meant
filepath: /home/user/projects/speech_recognition/manifests/metrics_manifest.json
and
filename: metrics_manifest.json
And in this case processor does require filepath
sdp/utils/bootstrap_estimates.py
Outdated
ci_lower (float): Lower bound percentile for confidence intervals (default: 2.5) | ||
ci_upper (float): Upper bound percentile for confidence intervals (default: 97.5) | ||
random_state (int): Random state of the program | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it will be better to describe what does processor output and the output format here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* **raw_data_dir**: Specify the data folder where all the datawill be stored. | ||
* **bootstrap_manifest_files**: List of file paths to the manifest files in JSONL format. | ||
* **metric_type**: The metric to compute. Supported options include 'wer', 'cer', 'wmr', 'charrate', 'wordrate'. | ||
* **dataset_size**: Proportion of dataset size for each bootstrap sample. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get a better naming for dataset_size
? maybe bootstrap_sample_ratio
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sdp/utils/bootstrap_estimates.py
Outdated
}) | ||
|
||
output_path = Path(self.output_manifest_file) | ||
output_path.parent.mkdir(exist_ok=True, parents=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be done before calculations in prepare
base class method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sdp/utils/bootstrap_estimates.py
Outdated
json.dump(results, out_file, indent=4) | ||
|
||
print(f"Results saved to {self.output_manifest_file}") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it can be beneficial to add logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, as it was used for debugging
tests/test_cfg_end_to_end_tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should replace the end-to-end tests with unit tests. For reference, you can take a look at some examples provided here https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/tests/test_modify_manifest.py and see how to run them here https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/tests/README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Signed-off-by: Alexan <[email protected]>
Signed-off-by: Alexan <[email protected]>
Signed-off-by: Alexan <[email protected]>
Signed-off-by: Alexan <[email protected]>
Signed-off-by: Alexan <[email protected]>
No description provided.