Source code for Generative Image Translation for Data Augmentation in Colorectal Histopathology Images
Jerry Wei, Arief Suriawinata, Louis Vaickus, Bing Ren, Xiaoying Liu, Jason Wei, Saeed Hassanpour
- Numpy 1.15.2
- PyTorch 0.4.1
- Torchvision 0.2.1
- SciPy 1.3.0
- Seaborn 0.9.0
- Matplotlib 3.0.0
- Pandas 0.23.4
- OpenCV 3.4.2
- Scikit-Image 0.14.0
- Scikit-Learn 0.20.0
- Pillow 6.0.0
- Tensorflow-GPU 1.4.0
- Accuracy Testing - all code used to analyze images (e.g. calculate accuracy, filter by confidence)
- CycleGAN - all code used to train CycleGAN models. Original implementation from xhujoy.
- DCGAN - all code used to train DCGAN models. Original implementation from carpedm20.
- DiscoGAN - all code used to train DiscoGAN models. Original implementation from carpedm20.
- ResNet - all code used to train ResNet classifier models. Original implementation from BMIRDS.
A. Training CycleGAN
- Make a
datasets/class1TOclass2/
folder- Subfolders: trainA (training images for class #1), trainB (training images for class #2), testA (original class #1 images that will be used to generate fake class #2 images), testB (original class #2 images that will be used to generate fake class #1 images)
- Run
CycleGAN/main.py
and specify options with argparse (look at main.py for details about parameters); --phase should be "train"
B. Training DCGAN
- Make a
data/
folder- Subfolder: class1 (training images for class #1). Since DCGAN uses random noise for generation, no other folders are needed.
- Run
DCGAN/main.py
and specify options in code; "train" should be set to True
C. Training DiscoGAN
- Make a
data/class1TOclass2/
folder- Subfolders: class1 (training images for class #1), class2 (training images from class #2)
- Run
DiscoGAN/main.py
and specify options inDiscoGAN/config.py
; "is_train" should be set to True
D. Training with Path-Rank-Filter from our paper
- Use
Accuracy Testing/filter.py
to get a folder of most confident images (Note: pretrain a ResNet classifier on your dataset in order to do this) - Use that folder as your new training folder
A. Using CycleGAN
- Run
CycleGAN/main.py
and specify options with argparse; --phase should be "test" - Generated images can be viewed in
CycleGAN/test/*.jpg
B. Using DCGAN
- Run
DCGAN/main.py
and specify options; "train" should be False, "visualize" should be True - Generated images can be viewed in
DCGAN/out/data - class1/samples/
C. Using DiscoGAN
- Run
DiscoGAN/main.py
and specify options inDiscoGAN/config.py
; "is_train" should be False - Generated images can be viewed in
DiscoGAN/logs/class1TOclass2_timestamp/test/
A. Data preparation
- ResNet training requires
train_folder/train/class1/
,train_folder/train/class2/
,train_folder/val/class1
,train_folder/val/class2
- Move generated images and real images into respective training and validation folders B. Train ResNet
- Run
ResNet/3_train.py
and specify options (e.g. number of layers) inResNet/config.py
- Models will be saved in
ResNet/checkpoints/
A. Prepare folder
- Place testing folders for each class in a folder of the same class name (e.g.
testing/class1/class1/
andtesting/class2/class2
) - Run
Accuracy Testing/overall_accuracy/
and specify options inside code file
- accuracy_tester.py: will return the accuracy of a multiclass model given a model path and a folder to test on
- model path needs to be the direct file path to the model you want to test (e.g.
models/resnet18.pt
) - The folder that you are testing on should include each class in a subfolder of the same name (e.g. if folder_to_test_on
= "val" and you are testing on class1 and class2,val/class1/class1/
andval/class2/class2/
must exist) - Edit parameters inside code file
- model path needs to be the direct file path to the model you want to test (e.g.
- compare_images.py: will combine images of the same name from input folders next to each other
- input folders should be a list of paths to folders containing images (e.g.
test/class1/
andtest/class2
); the order of the folders will also be the order of the images- Images with the same name will automatically be joined with a black border
- output folder is the folder that will be created to save the combined images
- Edit parameters inside code file
- Will yield something like
- input folders should be a list of paths to folders containing images (e.g.
- compress.py: contains various functions that can be used to change images
- input folder will contain images that you want to modify (e.g.
images/a.jpg
) - Automatically removes duplicate images (an image name that has "dup" in it)
- Other parameters are explained in code
- Edit parameters using argparse when running code (e.g. python compress.py --compress=True)
- input folder will contain images that you want to modify (e.g.
- dataset_stats.py: calculates statistics for your dataset (e.g. image area, image side lengths, image sizes)
- input folders should be a list of paths to folders containing images (e.g.
simple_crops/*.jpg
) -Each input folder should be a different class; the code calculates statistics for each folder you input - Replace each instance of our class with what your classes are
- Edit parameters inside code file
- input folders should be a list of paths to folders containing images (e.g.
- filter.py: saves the top n% of images by model confidence in a separate folder
- model path needs to be the direct file path to the model you want to test (e.g.
models/resnet18.pt
) - The folder that you are testing on should include each class in a subfolder of the same name (e.g. if folder_to_test_on
= "val" and you are testing on class1 and class2,val/class1/class1/
andval/class2/class2/
must exist) - Class to use is the name of the class that the input folder's images are of. Should match the name of the class you used when training your classifier model
- Top image is the number of images to save; to save the top 25% of images, set top image to 0.25 * total_number_of_images
- Other parameters are explained in code
- Edit parameters using argparse when running code (e.g. python filter.py --top_image=12)
- model path needs to be the direct file path to the model you want to test (e.g.
- generate_images.py: can be used if augmented data is still not enough. This function was not tested in the paper. Will use
the left half of one image and the right half of another image to make a technically new image
- Input folder is the path to a folder containing images you want to generate more data from
- Other parameters are explained in code
- Edit parameters using argparse when running code
- Image_Class.py: class file for any image. Used in other code files
- overall_accuracy.py: tests multiple models on multiple classes. Plots ROC curves for each model if specified.
- Input folders: list of folders, each folder contains images of a separate class. Class should be in a subfolder of the
same name (e.g. if folder_to_test_on = "val" and you are testing on class1 and class2,
val/class1/class1/
andval/class2/class2/
must exist) - model paths should be the direct path to each model that will be tested. The order of the models should correspond with the label for plotting
- Change our classes with your classes
- Edit parameters inside code file
- Input folders: list of folders, each folder contains images of a separate class. Class should be in a subfolder of the
same name (e.g. if folder_to_test_on = "val" and you are testing on class1 and class2,
- rename_images.py: anonymize images by renaming them with numbers.
- Parameters explained in code
- Edit parameters using argparse when running code
- Turing_Analysis.py: analyzes a csv with pathologists' predictions vs true labels
- CSV file should have these columns in this order: "Original Image Name", "New Name", "Real/Fake", "Pathologist Prediction"
- Turing_Test.py: generates a turing test for pathologists
- Also has an operation where you can get N random images from a folder
- Other parameters explained in code
- Edit parameters using argparse when running code
HistoGAN is an open-source library and is licensed under the GNU General Public License (v3). If you are using this library please cite:
Jerry Wei, Arief Suriawinata, Louis Vaickus, Bing Ren, Xiaoying Liu, Jason Wei, Saeed Hassanpour, "Generative Image Translation for Data Augmentation in Colorectal Histopathology Images", Proceedings of the Machine Learning for Health NeurIPS Workshop, PMLR 116:10-24, 2020.