A multi-center annotated oral cytology dataset for AI-assisted early detection of oral squamous cell carcinoma

Description

Oral squamous cell carcinoma is a major public health burden, particularly in regions with limited access to specialist pathology services. Oral brush cytology provides a minimally invasive approach for screening and early assessment, but the development of automated analysis methods requires well-annotated, multi-source datasets. We present a multicenter oral cytology dataset collected from tertiary medical centers in India. The dataset includes Papanicolaou- and May-Grünwald-Giemsa-stained whole-slide images, associated patient- and slide-level metadata, high-resolution image patches, and expert-verified nucleus-level annotations in QuPath-compatible GeoJSON format. The annotations support computational tasks including nucleus segmentation, instance segmentation, and cytological category classification. Baseline segmentation experiments are provided to document technical quality and support reuse. The dataset is intended to facilitate development and benchmarking of computational pathology methods for oral cytology analysis.

WSIs are available to download using following links.
Set 1: https://doi.org/10.5281/zenodo.20657725
Set 2: https://doi.org/10.5281/zenodo.20667024
Set 3: https://doi.org/10.5281/zenodo.20666845
Set 4: https://doi.org/10.5281/zenodo.20667432
Set 5: https://doi.org/10.5281/zenodo.20674172
Set 6: https://doi.org/10.5281/zenodo.20674189
Set 7: https://doi.org/10.5281/zenodo.20674195
Set 8: https://doi.org/10.5281/zenodo.20674200
Set 9: https://doi.org/10.5281/zenodo.20674216
Set 10: https://doi.org/10.5281/zenodo.20674220
Set 11: https://doi.org/10.5281/zenodo.20674231

Authors

DOI: 10.5281/zenodo.20686174

Publication Date: 2026-06-14

Back to publications list


About