medGAN is a generative adversarial network for generating multi-label discrete patient records. It can generate both binary and count variables (i.e. medical codes such as diagnosis codes, medication codes or procedure codes).
medGAN implements the algorithm introduced in the following paper:
Generating Multi-label Discrete Patient Records using Generative Adversarial Networks
Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F. Stewart, Jimeng Sun
Machine Learning for Healthcare (MLHC) 2017
This code trains a generative adversarial network to generate patient records. This work currently can handle patient records that are aggregated over time, hence represented as a matrix where a row corresponds to a patient, and a column to a specific medical code (e.g. diagonsis code, medication code, or procedure code). The value of the matrix could either be binary (i.e. a specific medical code occurred in the longitudinal patient record or not) or count (i.e. how many times a specific medical code occurred in the longitudinal patient record).
STEP 1: Installation
-
medGAN was implemented to run on TensorFlow 1.2. TensorFlow can be easily installed in Ubuntu as suggested here
-
Download/clone the medGAN code
STEP 2: Fast way to test medGAN with MIMIC-III
This step describes how to train medGAN, with minimum number of steps using MIMIC-III.
-
You will first need to request access for MIMIC-III, a publicly avaiable electronic health records collected from ICU patients over 11 years.
-
You can use "process_mimic.py" to process MIMIC-III dataset and generate a suitable training dataset for medGAN. Place the script to the same location where the MIMIC-III CSV files are located, and run the script. The execution command is
python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv <output file> <"binary"|"count">
. Note that the last argument decides whether you construct a binary matrix or a count matrix. The above command will extract ICD9 diagnosis codes from MIMIC-III. Mind that this script will use only 3 digits of the ICD9 diagnosis code. If you want to use all 5 digits, please see the source code of "process_mimic.py". -
Run medGAN using the ".matrix" file generated by process_mimic.py. The command is:
python medgan.py <matrix file> <output path> --data_type=["binary", "count"]
. -
After the training, if you want to generate synthetic records, use this command :
python medgan.py <matrix file> <generated output path> --model_file=<trained output path> --generate_data=True
. Note that<matrix file>
is not actually used for generating synthetic records, so it is just a dummy input.