The Big Data Boom in Chest Radiography

Mar 8, 2019

Hao-Yu Yang
Deep Learning Scientist

Chest radiography or Chest X-ray (CXR), is one of the most powerful and commonly used imaging modalities in clinical settings. It is often the first step in diagnosing conditions within the thoracic cavity. With the recent booming advancement in computer vision and artificial intelligence, automated detection of diseases using CXR has drawn massive attention. In this blog post, let us take a detailed look at two of the largest public Chest X-ray datasets as well as future trends of deep learning developments in this application.

It is difficult to discuss deep learning and Chest X-rays without mentioning the CheXNet. The name originally comes from the Stanford paper “CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays”. CheXNet is a Convolutional Neural Network (CNN)-based model that is able to classify a chest X-ray image into one of the fourteen common pathology. Since then, many have adopted “CheXNet” as a general name for applications of neural networks on Chest X-ray diagnosis.

Data plays a role just as crucial as the deep learning modeling in this data-driven era. ChXNet was trained using the Chest X-ray 14 (CXR14) from NIH Clinical Center, one of the first large-scale, publicly available Chest X-ray dataset. It contains a total of 108,948 frontal view Chest X-ray images from 32,717 unique patients. The dataset includes 14 different thoracic pathology labels obtained from text mining on clinical reports. The text mining tool, NegBio, extracts 14 frequent chest radiography observations from free-form radiology reports.

Example of label extraction from semi-structured radiology report

We have developed a set of in-house algorithms for automatically classifying Chest X-rays here at CuraCloud. Our algorithm is accurate and fast, requiring less than a half a second for a single subject prediction and achieved an average accuracy of 84% across 14 thoracic diseases. The algorithm also provides the area most indicative of the suspected disease, which can help radiologists or physicians understand the model’s “thought process” leading up to the decision.

Attention heatmap from our in-house CXR detection algorithm

Despite the wide adaption of CXR14 in the AI community, there has been growing concerns regarding the labeling and imaging quality of the dataset. In short, the issues can be summarized as follows:

  • Labeling method: the accompanying labels are extracted from free-form radiology reports using a Natural Language Processing (NLP) tool. Potential error can occur from both the extraction process and the inherent error of using text to describe images.
  • Image Quality: the pixel intensities range from 0 to 255 in the images provided in CXR-14. However, clinical Chest X-rays typically have intensity levels between 0~3000. The loss of information may cause some lesion hard to recognize or not identifiable at all.
  • Sample Overlapping: though the number of images may seem large at first glance, images from patient who underwent over 10 scans made up nearly half the dataset. This indicates that there is major overlapping between images and subject level variability may be an issue.

In January 2019, the Laboratory for Computational Physiology (LCP) released a large publicly available Chest X-ray database called the MIMIC Chest X-ray (MIMIC-CXR). The MIMIC-CXR database contains 224,316 Chest X-ray images from 65240 different subjects, collected from studies conducted at the Beth Israel Deaconess Medical Center in Boston, MA. On average, each patient undergoes 3.45 studies, showing better subject-wise diversification than the CXR-14 dataset. The chest X-ray images from MIMIC-CXR are labeled using CheXpert (Chest eXpert). An advantage of MIMIC-CXR over CXR14 is the expert system shows better robustness than the CXR-14 labeler, NegBio. Another major improvement is the inclusion of lateral chest X-rays. In practical clinical studies, some pathologies are not visible from the frontal view, making the lateral view necessary.

CXR-14 and MIMIC-CXR comparison


In summary, the MIMIC-CXR can be thought of as an improved version of CXR-14 with better labeling method, more comprehensive images and larger variability in patients. Though there are issues left to be desired, the MIMIC-CXR is a promising step towards integration of medical AI system in radiologists’ workflow. At CuraCloud, not only are we developing a new generation of image detection system based on MIMIC-CXR that’s more accurate and includes more features than its predecessor, we are also working on our own NLP tools to extract information from text report more reliable than CheXpert.


[1] Goldberger AL, Amaral LAN , PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220 [Circulation Electronic Pages;]; 2000 (June 13).
[2]  Johnson AEW, Pollard TJ . MIMIC-CXR: A large publicly available database of labeled chest radiographs. arXiv (2019).
[3] Xiaosong Wang, Yifan Peng ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. CVPR (2017).
[4] Jeremy Irvin, Pranav Rajpurkar CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison. arXiv (2019).
[5] Pranav Rajpurkar, Jeremy Irvin CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning arXiv (2017).