EDS-Pseudo

The EDS-Pseudo project aims at detecting identifying entities in clinical documents, and was primarily tested on clinical reports at AP-HP's Clinical Data Warehouse (EDS).

The model is built on top of edsnlp, and consists in a hybrid model (rule-based + deep learning) for which we provide rules (eds-pseudo/pipes) and a training recipe train.py.

We also provide some fictitious templates (templates.txt) and a script to generate a synthetic dataset generate_dataset.py.

The entities that are detected are listed below.

Label	Description
`ADRESSE`	Street address, eg `33 boulevard de Picpus`
`DATE`	Any absolute date other than a birthdate
`DATE_NAISSANCE`	Birthdate
`HOPITAL`	Hospital name, eg `Hôpital Rothschild`
`IPP`	Internal AP-HP identifier for patients, displayed as a number
`MAIL`	Email address
`NDA`	Internal AP-HP identifier for visits, displayed as a number
`NOM`	Any last name (patients, doctors, third parties)
`PRENOM`	Any first name (patients, doctors, etc)
`SECU`	Social security number
`TEL`	Any phone number
`VILLE`	Any city
`ZIP`	Any zip code

Downloading the public pre-trained model

The public pretrained model is available on the HuggingFace model hub at AP-HP/eds-pseudo-public and was trained on synthetic data (see generate_dataset.py). You can also test it directly on the demo.

Install the latest version of edsnlp
```
pip install "edsnlp[ml]" -U
```
Get access to the model at AP-HP/eds-pseudo-public
Create and copy a huggingface token https://huggingface.co/settings/tokens?new_token=true

Register the token (only once) on your machine

import huggingface_hub

huggingface_hub.login(token=YOUR_TOKEN, new_session=False, add_to_git_credential=True)

Load the model

import edsnlp

nlp = edsnlp.load("AP-HP/eds-pseudo-public", auto_update=True)
doc = nlp(
    "En 2015, M. Charles-François-Bienvenu "
    "Myriel était évêque de Digne. C’était un vieillard "
    "d’environ soixante-quinze ans ; il occupait le "
    "siège de Digne depuis 2006."
)

for ent in doc.ents:
    print(ent, ent.label_, str(ent._.date))

To apply the model on many documents using one or more GPUs, refer to the documentation of edsnlp.

Installation to reproduce

If you'd like to reproduce eds-pseudo's training or contribute to its development, you should first clone it:

git clone https://github.com/aphp/eds-pseudo.git
cd eds-pseudo

And install the dependencies. We recommend pinning the library version in your projects, or use a strict package manager like Poetry.

poetry install

How to use without machine learning

import edsnlp

nlp = edsnlp.blank("eds")

# Some text cleaning
nlp.add_pipe("eds.normalizer")

# Various simple rules
nlp.add_pipe(
    "eds_pseudo.simple_rules",
    config={"pattern_keys": ["TEL", "MAIL", "SECU", "PERSON"]},
)

# Address detection
nlp.add_pipe("eds_pseudo.addresses")

# Date detection
nlp.add_pipe("eds_pseudo.dates")

# Contextual rules (requires a dict of info about the patient)
nlp.add_pipe("eds_pseudo.context")

# Apply it to a text
doc = nlp(
    "En 2015, M. Charles-François-Bienvenu "
    "Myriel était évêque de Digne. C’était un vieillard "
    "d’environ soixante-quinze ans ; il occupait le "
    "siège de Digne depuis 2006."
)

for ent in doc.ents:
    print(ent, ent.label_)

# 2015 DATE
# Charles-François-Bienvenu NOM
# Myriel PRENOM
# 2006 DATE

How to train

Before training a model, you should update the configs/config.cfg and pyproject.toml files to fit your needs.

Put your data in the data/dataset folder (or edit the paths configs/config.cfg file to point to data/gen_dataset/train.jsonl).

Then, run the training script

python scripts/train.py --config configs/config.cfg --seed 43

This will train a model and save it in artifacts/model-last. You can evaluate it on the test set (defaults to data/dataset/test.jsonl) with:

python scripts/evaluate.py --config configs/config.cfg

To package it, run:

python scripts/package.py

This will create a dist/eds-pseudo-aphp-***.whl file that you can install with pip install dist/eds-pseudo-aphp-***.

You can use it in your code:

import edsnlp

# Either from the model path directly
nlp = edsnlp.load("artifacts/model-last")

# Or from the wheel file
import eds_pseudo_aphp

nlp = eds_pseudo_aphp.load()

Documentation

Visit the documentation for more information!

Publication

Please find our publication at the following link: https://doi.org/mkfv.

If you use EDS-Pseudo, please cite us as below:

@article{eds_pseudo,
  title={Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse},
  author={Tannier, Xavier and Wajsb{\"u}rt, Perceval and Calliger, Alice and Dura, Basile and Mouchet, Alexandre and Hilka, Martin and Bey, Romain},
  journal={Methods of Information in Medicine},
  year={2024},
  publisher={Georg Thieme Verlag KG}
}

Acknowledgement

We would like to thank Assistance Publique – Hôpitaux de Paris and AP-HP Foundation for funding this project.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.github		.github
configs		configs
data		data
demo		demo
docs		docs
eds_pseudo		eds_pseudo
scripts		scripts
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
changelog.md		changelog.md
dvc.yaml		dvc.yaml
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDS-Pseudo

Downloading the public pre-trained model

Installation to reproduce

How to use without machine learning

How to train

Documentation

Publication

Acknowledgement

About

Releases 2

Packages

Contributors 2

Languages

License

aphp/eds-pseudo

Folders and files

Latest commit

History

Repository files navigation

EDS-Pseudo

Downloading the public pre-trained model

Installation to reproduce

How to use without machine learning

How to train

Documentation

Publication

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages