- Add RPPS pattern to the
pseudonymisation-rules
pipeline
- Added
eds_pseudo.dates_normalizer
to parse ML detected dates and extract their value and format. - Support empty
doc._.context
field - Update EDS-NLP to v0.10.7:
- fix somes issues with jsonl loading
- more transformer overriding options
- fix out-of-memory issues (auto split transformer input depending on the available memory)
- fixes some multiprocessing deadlock issues
- add chunk sorting option to the lazy collection
set_processing
method
- Replace
gen_dataset/train.jsonl
with the original fictitious templates and the dataset generation script. - Update the README with the instructions to download the public pre-trained model.
- Improve packaging to add evaluation results to the model's meta field and packaged model README (for HF)
- Refactoring and fixes to use edsnlp instead of spaCy.
- Renamed
eds_pseudonymisation
toeds_pseudo
and default model name toeds_pseudo_aphp
. - Renamed
pipelines
topipes
- New
scripts/train.py
script to train the model
Some fixes to enable training the model:
- committed the missing script
infer.py
- changed config default bert model to
camembert-base
- put
config.cfg
as a dependency, not params - default to cpu training
- allow for missing metadata (i.e. omop's
note_class_source_value
)
Many fixes along the publication of our article:
- Tests for the rule-based components
- Code documentation and cleaning
- Experiment and analysis scripts
- Charts and tables in the Results page of our documentation
Inception ! 🎉
- spaCy project for pseudonymisation
- Pseudonymisation-specific pipelines:
pseudonymisation-rules
for rule-based pseudonymisationpseudonymisation-dates
for date detection and normalisationstructured-data-matcher
for structured data detection (eg first and last name, available in the information system)
- Evaluation methodology