Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap metric support #85

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Bootstrap metric support #85

wants to merge 5 commits into from

Conversation

Alcray
Copy link
Collaborator

@Alcray Alcray commented Sep 27, 2024

No description provided.

Bootstrap metric processor
##########################

This config is used to process a custom speech dataset using the `BootstrapProcessor` class for bootstrapped metric computation (WER, CER, WMR, etc.). It calculates metrics like Word Error Rate (WER) or other custom metrics such as Character Error Rate (CER) or Word Match Rate (WMR), depending on the config.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a link to the original paper for Bootstrap.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Bootstrap metric processor
##########################

This config is used to process a custom speech dataset using the `BootstrapProcessor` class for bootstrapped metric computation (WER, CER, WMR, etc.). It calculates metrics like Word Error Rate (WER) or other custom metrics such as Character Error Rate (CER) or Word Match Rate (WMR), depending on the config.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's explicitly mention what kind of metrics do we support

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


The config generates the following outputs:

* **output_manifest_file**: A JSON file containing the results of the metric computation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument should be renamed to reflect that we're writing a results file, not creating an output manifest. For example we can change into something like 'output_file'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also write in what format the output will be written in the resulting file. I think we should add all the field names with their corresponding descriptions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


class BootstrapProcessor(BaseProcessor):
"""
Performs bootstrapped metric computation (WER, CER, WMR, etc.) on speech predictions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly mention all the metrics here too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

is set to True.

Args:
bootstrap_manifest_files (List[str]): List of file paths to manifest (JSONL) files for metric calculation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these file paths or filenames?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am wrong at naming, but I meant
filepath: /home/user/projects/speech_recognition/manifests/metrics_manifest.json
and
filename: metrics_manifest.json

And in this case processor does require filepath

ci_lower (float): Lower bound percentile for confidence intervals (default: 2.5)
ci_upper (float): Upper bound percentile for confidence intervals (default: 97.5)
random_state (int): Random state of the program
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be better to describe what does processor output and the output format here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* **raw_data_dir**: Specify the data folder where all the datawill be stored.
* **bootstrap_manifest_files**: List of file paths to the manifest files in JSONL format.
* **metric_type**: The metric to compute. Supported options include 'wer', 'cer', 'wmr', 'charrate', 'wordrate'.
* **dataset_size**: Proportion of dataset size for each bootstrap sample.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a better naming for dataset_size? maybe bootstrap_sample_ratio?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

})

output_path = Path(self.output_manifest_file)
output_path.parent.mkdir(exist_ok=True, parents=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done before calculations in prepare base class method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

json.dump(results, out_file, indent=4)

print(f"Results saved to {self.output_manifest_file}")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be beneficial to add logging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, as it was used for debugging

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should replace the end-to-end tests with unit tests. For reference, you can take a look at some examples provided here https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/tests/test_modify_manifest.py and see how to run them here https://github.com/NVIDIA/NeMo-speech-data-processor/blob/main/tests/README.md

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Alexan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants