-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: update readme to keeptrack the flow
- Loading branch information
Showing
1 changed file
with
34 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# How to evaluate the performance of generated synthetic data? | ||
|
||
## Datasets | ||
|
||
@Simon updating ... | ||
|
||
## Metrics | ||
|
||
| Term | Definition | Formula | Interpretation | | ||
|---|---|---|---| | ||
| Coverage score | How comprehensively the summary covers the content of the original document. | 100 − X, where X is the percentage of document generated questions that receive an "IDK" (I Don’t Know) response based on the summary. | A higher coverage score indicates that the summary captures more of the original details and is less generic. | | ||
| Conformity score | Whether the summary avoids contradicting the document. | It is derived by identifying the percentage of questions for which the summary’s answer is "NO" and the document’s is "YES", or vice versa, and computing 100 − X. | A higher conformity score signifies a greater alignment between the summary and the document. | | ||
| Consistency score | The level of non-hallucination, is based on the accuracy of factual information in the summary as compared to the document. | 100 − X, where X is the percentage of summary derived questions that are answered with an "IDK" based on the document, indicating factual discrepancies. | A higher consistency score suggests that the summary is more factual and contains fewer inaccuracies or fabrications. | | ||
|
||
## Implementation | ||
|
||
Command | ||
``` | ||
python run.py -m model=llama3.1-405b-local samples=10 num_questions=5 | ||
``` | ||
|
||
Scripts: | ||
``` | ||
cd ./open-nlp/lib/questions_eval | ||
bash/experiments/super_tiny.sh | ||
``` | ||
|
||
## References | ||
- [SemScore: Evaluating LLMs with Semantic Similarity](https://huggingface.co/blog/g-ronimo/semscore) | ||
- [MEDIC: Towards a Comprehensive Framework for evaluating LLMs in Clinical Applications](https://arxiv.org/pdf/2409.07314) | ||
|
||
## Contributors | ||
- [@simonmeoni](https://github.com/simonmeoni) | ||
- [@honghanhh](https://github.com/honghanhh) |