diff --git a/lib/questions_eval/README.md b/lib/questions_eval/README.md new file mode 100644 index 0000000..fd2bb71 --- /dev/null +++ b/lib/questions_eval/README.md @@ -0,0 +1,38 @@ +# How to evaluate the performance of generated synthetic data? + +## Datasets + +@Simon updating ... + +## Metrics + +| Term | Definition | Formula | Interpretation | +| ----------------- | --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------- | +| Coverage score | How comprehensively the summary covers the content of the original document. | 100 − X, where X is the percentage of document generated questions that receive an "IDK" (I Don’t Know) response based on the summary. | A higher coverage score indicates that the summary captures more of the original details and is less generic. | +| Conformity score | Whether the summary avoids contradicting the document. | It is derived by identifying the percentage of questions for which the summary’s answer is "NO" and the document’s is "YES", or vice versa, and computing 100 − X. | A higher conformity score signifies a greater alignment between the summary and the document. | +| Consistency score | The level of non-hallucination, is based on the accuracy of factual information in the summary as compared to the document. | 100 − X, where X is the percentage of summary derived questions that are answered with an "IDK" based on the document, indicating factual discrepancies. | A higher consistency score suggests that the summary is more factual and contains fewer inaccuracies or fabrications. | + +## Implementation + +Command + +``` +python run.py -m model=llama3.1-405b-local samples=10 num_questions=5 +``` + +Scripts: + +``` +cd ./open-nlp/lib/questions_eval +bash/experiments/super_tiny.sh +``` + +## References + +- [SemScore: Evaluating LLMs with Semantic Similarity](https://huggingface.co/blog/g-ronimo/semscore) +- [MEDIC: Towards a Comprehensive Framework for evaluating LLMs in Clinical Applications](https://arxiv.org/pdf/2409.07314) + +## Contributors + +- [@simonmeoni](https://github.com/simonmeoni) +- [@honghanhh](https://github.com/honghanhh)