Assessing the efficacy of convolutional neural networks for Pap smear classification: a real world analysis
Abstract
Background Undetected cervical lesions can progress to cancer, a leading cause of mortality among women worldwide. While automated analysis of Papanicolaou (Pap) smear images using convolutional neural networks (CNNs) has demonstrated significant potential for screening, most existing studies rely on single curated datasets. This aspect limits the understanding of model generalization to the noise and variability inherent in real-world clinical cytology. Methods We evaluated three CNN architectures (VGG16, ResNet50, and InceptionV3) across four curated Pap smear datasets using stratified 5-fold cross-validation. For each dataset, the model achieving the highest mean Macro-F1 score was selected for further analysis. To assess robustness against domain shift, we performed an external evaluation using a non-curated, Real-World dataset comprising routine clinical images. Results All architectures achieved robust performance on the curated benchmarks, with mean Macro-F1 scores ranging from 73.58% to 99.28%. However, performance dropped significantly when models were evaluated on the Real-World dataset (Macro-F1: 33.25–55.91%), highlighting the severity of the domain gap. Notably, the model trained on a combined heterogeneous dataset achieved the highest inter-domain performance, suggesting that data diversity improves robustness. Class-wise analysis revealed that high-grade lesions were most sensitive to real-world variability. Conclusions Although CNNs achieve state-of-the-art results on curated benchmarks, their direct applicability to routine cytology workflows is hindered by domain shift. Our findings emphasize that evaluating models across heterogeneous, multi-source datasets is a prerequisite for reliable clinical deployment.