A deep learning based smartphone application for early detection of nasopharyngeal carcinoma using endoscopic images

This study was divided into three main parts: collecting datasets, constructing deep learning models, and developing mobile applications. Figure 8 illustrates this workflow. The details of the study were comprehensively presented in the following subsections.

Table of Contents

Construction of the multi-centre dataset

In this study, we reviewed and constructed a dataset from three hospitals located in high-risk areas of NPC. We retrospectively collected numerous white-light nasal endoscopic images of patients with NPC from the Department of Otolaryngology of the Second Affiliated Hospital of Shenzhen University (SZH) and the Department of Otolaryngology of Foshan Sanshui District People’s Hospital (FSH) between 1 January 2014 and 31 January 2023. Given that the early clinical symptoms of NPC (such as headache, cervical lymph node enlargement, nasal congestion, and nosebleeds) are similar to those of common diseases of the nasal cavity and nasopharynx⁴⁴, and rhinosinusitis, allergic rhinitis, and chronic sinusitis may be risk factors for NPC^45,46,47, we collected white-light nasal endoscopic images of non-NPC patients visiting SZH and FSH from the same period to develop deep learning models. In addition, Leizhou People’s Hospital (LZH) provided nasal endoscopic images of patients who visited the Department of Otolaryngology between 1 January 2015 and 31 April 2022. From an application perspective, including the images of non-NPC patients in the dataset can effectively improve the comprehensiveness and accuracy of the results of the deep learning model for diagnosing nasal endoscopic images. The collected images were divided into seven categories (Fig. 9): NPC (Fig. 9a), adenoidal hypertrophy (AH) (Fig. 9b), allergic rhinitis (AR) (Fig. 9c), chronic rhinosinusitis with nasal polyps (CRP) (Fig. 9d), deviated nasal septum (DNS) (Fig. 9e), normal nasal cavity and nasopharynx (NOR) (Fig. 9f) and rhinosinusitis (RHI) (Fig. 9g). Table 4 presents the detailed characteristics of the dataset.

**Fig. 9: Some typical endoscopic images of different diseases.**

Table 4 The characteristic of datasets from various hospitals, Male: M; Female: F

This study was approved by the Ethics Committee of the Second Affiliated Hospital of Shenzhen University, the Institutional Review Board of Leizhou People’s Hospital and the Ethics Committee of Foshan Sanshui District People’s Hospital (reference numbers: ‘BY-EC-SOP-006-01.0-A01’, ‘BYL20220531’ and SRY-KY-2023045’) and adhered to the principles of the Declaration of Helsinki. Due to the retrospective nature of the study and the use of unidentified data, the Institutional Review Boards of SZH, FSH and LZH exempted informed consent. Supplementary Note 5 presents more detailed ethics declarations and procedures.

Diagnostic criteria of the nasal endoscopic images

In this study, to ensure the accuracy of the endoscopic image labels, three otolaryngologists with over 15 years of clinical experience set the diagnostic criteria based on practical clinical diagnostic processes and reference literature. Specifically, the expert combined each patient’s endoscopic examination results with the corresponding medical history, record of clinical manifestations, computed tomography results, allergen testing (skin prick tests or serum-specific IgE tests), lateral cephalograms, histopathological examination results, and laboratory test results (such as nasal smear examination) to further review and confirm the diagnostic results of the existing nasal endoscopic images of each patient. A diagnosis based on the aforementioned medical records was considered the reference standard for this study. Our otolaryngologists independently reviewed all data in detail before any analysis and validated that each endoscopic image was correctly matched to a specific patient. Patients with insufficient diagnostic medical records were excluded. During the review process, when an expert doubted the diagnostic results of a particular patient, the three experts jointly made decisions on the patient’s medical records and various examination results to determine whether to include the patient in this study. The standard diagnosis for seven types of nasal endoscopic images in the dataset was as follows: (1) NPC: providing the standard diagnostic label for patient images directly based on histopathological examination results^48,49; (2) Rhinosinusitis: further combining the patient’s medical history, clinical manifestations, and computed tomography examination⁵⁰; (3) Chronic rhinosinusitis with nasal polyps: further combining the patient’s medical history, clinical manifestations, computed tomography results, and pathological tissue biopsy results^51,52; (4) Allergic rhinitis: further combining the patient’s medical history, clinical manifestations, and allergen testing or laboratory methods^53,54,55; (5) Deviation of nasal septum: further combine the patient’s medical history and clinical manifestations and secondary analyse and evaluate the shape of the nasal septum⁵⁶. (6) Adenoid hypertrophy: further combine the patient’s medical history, clinical manifestations, or lateral cephalograms^57,58. (7) Normal nasal cavity and nasopharynx: further combination of the patient’s medical history and clinical manifestations. The nasal mucosa of a normal nasal cavity should be light red, and its surface should be smooth, moist, and glossy. The nasal cavity and nasopharyngeal mucosa show no congestion, edema, dryness, ulcers, bleeding, vasodilation, neovascularization, or purulent secretions. Table 5 details the distribution of image categories across hospitals.

Table 5 The description of all nasal endoscopic white light images used in this study

Deep transfer learning models

Transfer learning aims to improve the model performance on new tasks by leveraging pre-learned knowledge of similar tasks⁵⁹. It has significantly contributed to medical image analysis, as it overcomes the data scarcity problem and saves time and hardware resources.

In this study, we effectively combined deep learning models, which are popular in artificial intelligence, with this powerful strategy. To build an optimal NPC diagnostic model, we studied Vision Transformers (ViTs), convolutional neural networks (CNNs), and hybrid models based on the latest advances in deep learning. Among them were (1) ViTs: Swin Transformer (SwinT)²⁵, Multi-Axis Vision Transformer (MaxViT)⁶⁰, and Class Attention in Image Transformers (CaiT)⁶¹. These models were selected for their ability to model long-range dependencies and their adaptability to various image resolutions, which are crucial for medical image analysis. These models represent the latest shifts in deep learning from convolutional to attention-based mechanisms, providing a fresh perspective on feature extraction. (2) CNNs: ResNet⁶²^, DenseNet⁶³, and Xception⁶⁴. CNNs have gradually become the mainstream algorithm for image classification since 2012, and have shown very competitive performance in medical image analysis tasks⁶⁵. ResNet and DenseNet excel in addressing the vanishing gradient problem and strengthening feature propagation. Xception achieves a good trade-off between parameter efficiency and feature extraction capability using depth-wise separable convolutions. (3) Hybrid Models: PoolFormer (PoolF)⁶⁶ and ConvNeXt⁶⁷. PoolFormer enhances feature extraction by leveraging spatial pooling operations, while ConvNeXt incorporates ViT-inspired design concepts into CNNs to improve model performance, particularly in capturing global context through enhanced architectures. These architectures help improve model performance in downstream tasks by effectively utilising the advantages of CNNs and ViTs.

We then initialised the eight architectures using pretrained weights obtained by classifying the large natural image dataset ImageNet⁶⁸. Because the original number of nodes of the classifiers of these networks was 1000, we reset the number of nodes of their classifiers to seven to fit our dataset. After completing initialization and adjusting the classifier, we did not choose to fine-tune some layers but instead performed comprehensive training on the entire model from scratch. Moreover, we performed probability thresholding based on Softmax.

Explainable artificial intelligence in medical image

In medical imaging, Explainable Artificial Intelligence (XAI) is critical because it fosters trust and understanding among medical practitioners and facilitates accurate diagnosis and treatment by elucidating the rationale behind AI-driven image analysis. In this study, we used Gradient-weighted Class Activation Mapping (Grad-CAM)⁶⁹ to generate a corresponding heatmap. Red indicates high relevance, yellow indicates medium relevance, and blue indicates low relevance. Grad-CAM helps to visualise the regions of an image that are important for a particular classification. This is crucial in medical image classification, as it helps people understand which parts of the image contribute to model decision-making and validates whether the model focuses on disease-related features. By providing visual explanations through heat maps, Grad-CAM can help build trust among medical practitioners and the public regarding the decisions made by AI systems. However, Grad-CAM depends on the model’s architecture and may not provide more detailed insights. Besides, in real-time applications or scenarios requiring quick analysis, the computational demands of Grad-CAM might hinder its practicality.

Development process of models and smartphone applications

All the nasal endoscopic images were divided into two parts. The first part contained 38,073 images from SZH and FSH, which were used as the development dataset for training and validating the performance of the model. The development dataset was further divided into three parts in a 7:1:2 ratio, i.e. internal training, internal validation, and internal test sets. The second part contained 1267 images from the LZH, which were used as an external test set to test the performance of the model in real-world settings and verify the robustness of the model.

Before training the various networks, we resized all images to 224 × 224 × 3. Subsequently, the images were normalised and standardised using the mean [0.2394,0.2421,0.2381] and standard deviation [0.1849, 0.28, 0.2698] of the three channels. To avoid degradation of model performance when transitioning from the development set to the external test set, we used online data augmentation and early stopping strategies as well as model calibration techniques. Concretely, we first utilised the Transformers Library provided by Pytorch to automatically transform (RandomRotation, RandomAffine, GaussianBlur and Color Jitter) the image inputs during training to improve the robustness and generalisation ability of the models. During the training process, the loss functions of all models uniformly used the cross-entropy loss function, and we employed the AdamW optimiser with a 0.001 initial learning rate, β1 of 0.9, β2 of 0.999, and weight-decay of 0.0001 to optimise eight models’ parameters. We set the number of epochs to 150 and used a batch size of 64 for each model training. The we adopted an early stopping strategy, which meant that the model training will be stopped automatically stopped when its accuracy (patience and min_delta were set to 10 and 0.001, respectively) on the internal validation set no longer significantly improved for some time, thereby preventing overfitting. We calibrated each model using an internal validation set with temperature scaling, a method for calibrating deep learning models and assessed the calibration performance using the Brier-Score and Log-Loss. During the validation and inference stages, the model’s image preprocessing process was consistent with the training stage, but automatic image transformation was no longer implemented. We used the PyTorch framework (version 2.1), a computer with the Ubuntu 20.04 system and an NVIDIA GeForce RTX 4090 to complete the entire experiment. The weights of all models were saved in ‘Pth’ format.

In this study, we developed a responsive and user-friendly Android application that prioritises maintainability and scalability. Various software engineering principles and practices (e.g. separation of concerns and dependency Inversion) were followed to ensure that the Nose-keeper application maintain consistent performance and reliability across a diverse range of Android devices with varying hardware capabilities. Additionally, we adopted a responsive layout and conducted multiple user tests and iterative feedback to ensure that Nose-Keeper’s user interface is simple and easy to use and can adapt to various screen sizes, different screen orientations and device states. Utilising Java for native Android development, we embraced the MVVM design pattern for application modularisation, incorporating bidirectional data binding for seamless UI and data synchronisation. Our tech stack included Retrofit for network requests alongside third-party libraries like ButterKnife, Gson, Glide, EventBus, and MPAndroidChart for enhanced functionality and user experience, complemented by custom animations and NDK for hardware interaction. At the backend, we leveraged SSM (Spring + SpringMVC + MyBatis), Nginx, and MySQL for a high-performance architecture. For database, we used MySQL to manage data and adopted Redis for caching. The backend of the application and deep learning model were deployed on a high-performance Cloud Server (Manufacturer: Tencent; Equipment Type: Standard Type S6; Operating System: Centos 7.6; CPU: Intel® Xeon® Ice Lake; Memory: DDR4) with Nginx load balancing to optimise server resource utilisation (See Supplementary Note 1 for details). To ensure the security of applications and personal privacy data, we used encryption protocols and algorithms and toolkits that comply with industry standards (See Supplementary Note 2 for details). When utilising Nose-Keeper, all input images must go through an image preprocessing pipeline consistent with the model inference stage.

Model evaluation and statistical analysis

For the development datasets (SZH and FSH), eight models were evaluated using five standard metrics: overall accuracy (Eq. (1), precision (Eq. (2), sensitivity (Eq. (3), specificity (Eq. (4), and f1-score (Eq. (5). The definitions of these five metrics were as follows (See Supplementary Note 3 for details).

$$Overall\,accuracy=\frac{True\,Positives+True\,Negatives}{Total\,Samples}$$

(1)

$$Precision=\frac{True\,Positives}{True\,Positives+False\,Positives}$$

(2)

$$Sensitivity=\frac{True\,Positives}{True\,Positives+False\,Negatives}$$

(3)

$$Precision=\frac{True\,Negatives}{True\,Negatives+False\,Positives}$$

(4)

$$F1-score=2\times \left(\frac{Precision\times Sensitivity}{Precision+Sensitivity}\right)$$

(5)

To avoid performance uncertainty caused by random splitting of the development dataset, we used a fivefold cross-validation strategy to evaluate the potential of various models on the development dataset and then selected four more excellent models from the eight models that can be used for the smartphone application based on the quality of the metric results. After selecting the four candidate models, we used a confusion matrix and Receiver Operating Characteristic (ROC) curve to further evaluate the performance of the candidate models in an external test set (LZH). A larger area under the ROC curve (AUC) indicated better performance. We used the best model to develop a smartphone application. Statistical analyses were performed using Python 3.9. Owing to the large sample size of the internal dataset and the use of fivefold cross-validation, we used the normal approximation to calculate the 95% confidence intervals (CI) of overall accuracy, precision, sensitivity, specificity, and f1-score. In the external test set, we used an Empirical Bootstrap with 1000 replicates to calculate the 95% CI of the AUC. The 95% CIs of overall accuracy, sensitivity and specificity were calculated using the Wilson Score approach in the Statsmodels package (version 0.14.0).

Analysing the robustness of the deep learning models via data augmentation

The use of images with different data augmentations to test the model can reveal its adaptability to input changes and analyse its robustness⁷⁰. In particular, data augmentation simulates possible image transformations in practical applications, thereby testing the stability and performance of a model when faced with unseen or changing images. This strategy helps developers identify the potential weaknesses of the model, guide subsequent improvements, and enhance the application reliability of the model in complex and ever-changing environments. We used an external test set to analyse the prediction result changes of the model under Gaussian blur, Saturation changes, Image rotation and Brightness changes. Prior to testing the model, we augmented the external test set using Pillow (version 9.3.0). For each transformation, we assigned different parameter values to the built-in functions of Pillow, resulting in 12 enhanced datasets from the external datasets.

Human-machine comparison experiment

The representativeness of the external test set is crucial for fully comparing the performance differences between AI and human experts. Therefore, when retrospectively collecting endoscopic images, in addition to ensuring the accuracy of image labels, our expert team also fully considered the severity of lesions, different stages of disease, and differences in appearance in each endoscopic image. Meanwhile, the expert group also ensured as much as possible the age difference and gender balance of the entire dataset. Especially, the time span of external test set has reached five years. We recruited nine otolaryngologists with different clinical experiences from three institutions, i.e. one year (two otolaryngologists), three years (one otolaryngologist), four years (one otolaryngologist), five years (one otolaryngologist), six years (one otolaryngologist), eight years (two otolaryngologists), and nine years (one otolaryngologist). Before each expert independently evaluated the external test set, we shuffled dataset and renamed each image as “test_xxxx. jpg” and distributed it to all experts. We required experts to independently evaluate each endoscope within a specified time frame to simulate the physical and mental stress faced by experts in actual clinical settings, which further reflects the efficiency of AI. Notably, we prohibited experts from consulting diagnostic guidelines and mutual communication. All expert evaluation results were anonymized and automatically verified through a python program. Finally, we plotted a diagnostic performance heatmap, confusion matrix, ROC curve, and optimal Youden-index to comprehensively and intuitively demonstrate the performance differences between AI and clinicians in diagnosing different diseases.

link