The world has been suffering from COVID-19 pandemic since the beginning of January last year. Today even after 17 months, the pandemic is the major concern for the world. Recently there seems to be light at the end of the tunnel after the development of vaccines at a record pace. The leading global health practitioners, doctors, global health executives, researchers have been advocating the importance of testing, contact tracing and isolating the infected individual to combat the pandemic. The gold standard for detecting Covid-19 is Reverse transcription Polymerase Chain Reaction (RT-PCR) tests. However, there has been substantial research showing that artificial intelligence powered medical imaging tools such as deep learning could detect Covid-19 from x-rays and Computed Tomography (CT) scans of the chest. Such machine learning models can perform better when there is larger diverse annotated dataset availability. The inadequate unavailability of a diverse annotated dataset has limited the performance and generalizability of existing deep learning models.
In this article, we explore different publicly available x-rays and CT scan datasets that can be used by the research community to develop tools to address COVID-19. The data is compiled from European Institute for Biomedical Imaging Research (EIBIR), Stanford University Center for Artificial Intelligence in Medicine & Imaging, Kamrul et al. and from other publicly available sources. The publicly available dataset is listed in the table below and is described in brief in the following paragraphs.
|Name/Compiler||Size and Modality||Country|
|British Society of Thoracic Imaging||59 patients X-rays||UK|
|AIforCOVID imaging||983 X-rays||Italy|
|COVID-19 Open Initiative||16352 X-rays and 201103 CT slices, & 12943 ultrasound images||Global|
|Radiopaedia||101 X-rays and CT patients||Global|
|Eurorad database||50 X-rays and CT patients||Global|
|BIMCV-COVID19+ Dataset||2265 X-rays and 163 CT volumes||Spain|
|Società Italiana di Radiologia Medica||68 X-rays and CT patients||Italy|
|Cohen at al.||931 X-rays and 20 CT volumes||Global|
|MosMed COVID-19 Chest CT||110 CT volumes||Russia|
|Zhao et al.||349 CT slices||Global|
|Coronacases.org||10 CT patients||China|
|medicalsegmentation.com||100 CT slices and 9 CT volumes||Global|
|Ma Jun et al.||20 CT patients||Global|
|Chest CT COVID+ (MIDRC-RICORD-1a)||120 CT patients||Global|
|Zhang et al.||90 CT volumes||China|
|Soares, Eduardo et al.||2482 CT slices||Brazil|
|Yang et al.||812 CT slices||China|
|iCTCF||256356 CT slices||China|
The database is collected by British Society of Thoracic Imaging (BSTI). The database consists of medical imaging data of 59 patients from the UK. There is also clinical data including PCR results available. The data available online is free to view and use for educational purposes.
AIforCOVID imaging dataset is obtained from CDI Centro Diagnostico Italiano and Bracco Imaging (Milan). The dataset hosts 983 DICOM Chest x-rays of Covid-19 patients from Italy and other related clinical data. The data can be downloaded, and used for commercial, scientific and educational purposes after registering on the website.
The data is compiled by Darwin AI Corp., Canada, Vision and Image Processing Research Group, University of Waterloo, Canada, and others. The data is collected from various publicly available datasets like Cohen at al., MIDRC-RICORD-1a, RSNA pneumonia Kaggle, Covid-19 radiography database Kaggle and so on. As of March 19, 2021 that had the latest data update, there are 16352 images with 2358 positive COVID-19 cases. The data download and project information is provided on the github page of COVID-NET. The data is free to use for research and educational purposes.
The dataset is compiled by the global team of radiologists and other health professionals and available at the web domain radiopaedia.org. The dataset contains axial chest CTs of 101 patients from all around the world. There is clinical data and PCR test results for some patients. The license for data use is provided under a modified creative common license.
The dataset is collected by Eurorad, which is a peer-reviewed education tool of the European Society of Radiology. This dataset contains chest X-rays and CT scans of 50 COVID-19 patients from all around the world. The format is JPG/PDF and clinical information with PCR status is provided. The data is licensed under Creative Commons Attribution Noncommercial Share-Alike 4.0 (CC BY-NC-SA 4.0).
The data is compiled by BIMCV Medical Imaging Databank of the Valencia Region, Antonio Pertusa & Maria de la Iglesia Vaya. The dataset contains the 2265 images and 163 CT studies of COVID-19 patients along with their radiographic findings. The dataset is free to use for research purposes.
The dataset is collected from Italian Society of Radiology. The database hosts axial CT images of 68 COVID-19 patients from Italy. The format is JPG and the data can be used for non-commercial research.
The dataset is collected by Joseph Paul Cohen (Université de Montréal, CA). As of today, the dataset contains 931 images from 461 patients and 20 CT volumes which are diagnosed to be different diseases like bacterial pneumonia, viral pneumonia, COVID-19, fungal, SARS, and so on.The dataset is compiled from different sources, including Eurorad, Radiopaedia, SIRM and various publications. The format available is JPG and NIfTI and available to download at the github repository. The data is free to use for non-commercial purposes.
MosMedData is collected from the Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department (MosMed). The dataset contains CT images of 110 patients diagnosed with COVID-19 from Russia. The data is available in Neuroimaging Informatics Technology Initiative (NifTI) format. The license for the data use is governed by Creative Commons Attribution Noncommercial No Derivatives 4.0 (CC BY-NC-SD 4.0)
The dataset is collected by Jinyu Zhao, Yichen Zhang, Xuehai He, Pengtao Xie, who are associated with University of California San Diego, US. The dataset has 349 CT images from 216 patients that have been collected from preprint articles. The image format is PNG, and has medical history and PCR status for some cases. The data is available to download in a github link and is free to use for non-commercial research.
The dataset is collected from Radiology and Artificial Intelligence One-Step Shop (RAIOSS) and Livon Saúde (Brazil), Rodrigo Caruso Chate (Hospital Israelita Albert Einstein, Brazil). The dataset contains CT images of 10 patients with COVID-19 obtained from Wenzhou Medical University, China. The data is available online and available to use for everyone.
The dataset is compiled by Håvard Bjørke Jenssen (University Hospital of Oslo, NO) and available on medicalsegmentation.com. The dataset contains 100 axial CT images from more than 40 Italian patients with COVID-19 that were converted from the JPG images from the Italian Society of Radiology to NIfTI format. There is availability of Clinical information including PCR Status for some cases. The data is free to use for non-commercial purposes.
The dataset is compiled by Håvard Bjørke Jenssen (University Hospital of Oslo, NO) and available on medicalsegmentation.com. The dataset contains segmented axial volumetric CTs of 9 patients from all around the world obtained from Radiopedia.There is availability of Clinical information including PCR Status for some cases. The data usage is licensed under modified creative commons license.
The dataset is collected by Ma Jun (Nanjing University of Science and Technology, China) et al. The dataset contains labeled COVID-19 CT scans of 20 patients globally. The data format is NIfTI. The data usage is licensed under Creative Commons Attribution Noncommercial Share-Alike 2.0 (CC BY-NC-SA 2.0)
The dataset is collected by the Radiological Society of North America (RSNA) and Society of Thoracic Radiology (STR). The dataset called RSNA International COVID-19 Open Radiology Database (RICORD) contains 120 thoracic CT scans obtained globally and has detailed segmentation as diagnostic labels.The image format is DICOM and is available to download at The Cancer Imaging Archive. The data is free to use for non-commercial purposes.
The dataset is compiled by China China Consortium of Chest CT Image Investigation (CC-CCII). The dataset contains 90 CT volumes and some segmentation masks of background, lung field, ground-glass opacity (GGO), and consolidation.The images are classified into COVID-19, common pneumonia and normal. The dataset is available to download at China National Center for Bioinformation website and free to use for research purposes.
The dataset is compiled by Soares, Eduardo et al. which contains 2482 CT slices obtained from the hospital in Sao Paulo, Brazil. The 1252 CT scans are positive for COVID-19 infection and 1230 CT scans are negative for the virus. The dataset is available to download for free on the kaggle website.
The dataset is collected from 760 prepreints about COVID-19 from medRxiv and bioRxiv, posted from January 19th – March 25, 2020. The images are extracted from the pdf files of prepreints. The dataset contains 349 COVID-19 CT images and 463 Non COVID-19 CT images. The data is available to download at the github repository and is free to use for research purposes.
The dataset is collected from Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China. The dataset contains 256356 chest CT images and some clinical features from 1170 patients. The dataset is available to download on the iCTCF website at http://ictcf.biocuckoo.cn/ and is available under a CC BY-NC 4.0 license.
Fig. The image taken from Radiopaedia website shows the chest X-rays of the patient diagnosed with positive COVID-19 PCR test. The findings to that of pneumonia are seen on chest radiographs and CT scans of COVID-19 diagnosed patients. Such abnormalities seen in the Chest x-rays can be detected by a deep learning based algorithm. However, for the patients in the early course of disease, the chest radiographs are seen as normal in most cases. Moreover, deep learning algorithms are notorious with their limited generalization ability. Nevertheless, the chest radiographs findings can be vital in detecting the progression of the disease.