Identifying the Best Image Classification Algorithm for COVID-19 Diagnosis with a Small, Imbalanced Chest X-Ray Dataset
SUBMITTED TO SYNOPSYS CHAMPIONSHIP, BIOGENEIUS CHALLENGE, AGU FALL MEETING (Published Dec 2021), YOUNG SCIENTISTS JOURNAL (Published Nov 2021)
Currently, COVID-19 diagnosis heavily relies on nasopharyngeal swabs; however, its accuracy, reported up to 73%, can be influenced by the severity of the disease and time from symptoms onset. Due to its widespread availability and comparatively lower contamination rates, chest X-rays can be utilized to identify COVID-19 in a patient. With more than 82 million COVID-19 cases worldwide, automated chest radiograph interpretation could provide substantial benefit for efficient and accurate diagnosis of COVID-19 patients. The objective of this project is to train families of deep learning neural networks on small, imbalanced chest X-ray datasets to automate diagnosis of respiratory illnesses. Specifically, the best algorithm will be identified to classify anonymized chest X-ray images to three classes: healthy, COVID-19 and non-COVID pneumonia. Three families of neural networks that represent state-of-the-art image classification architecture are analyzed: DenseNet, EfficientNet and ResNet. Precision and Recall are the main metrics utilized to evaluate performance. The first improvement to the predictive power of the neural networks is through pretraining neural network using the ChestX-ray14 database generated by NIH. Pretraining on a domain-specific dataset, in this case chest X-ray images, gives weights that are customized for the task at hand; thus, this significantly improves performance during transfer learning. The second improvement in this project is to address the issue of natural shortage of COVID-19 chest X-ray images. Publicly available chest X-ray image datasets are not abundant, and ground truth data of COVID-19 diagnosis is especially hard to come by. To address the imbalance within training data, two alternative methods are implemented to customize data sampling configuration. Based on an extensive experimentation of different combinations of the usage of pretraining approaches, data sampling methods, and neural network architecture, the algorithm with pretraining on the ChestX-ray14 database, using fixed-fraction-per-batch sampling method, and trained on the DenseNet family of neural network has been identified to have the highest Recall and Precision for COVID-19 chest X-ray images.