Skip to main content

Table 4 Details of artificial intelligence methods applied and outcomes in studies for appendicitis prognosis

From: Artificial Intelligence and Acute Appendicitis: A Systematic Review of Diagnostic and Prognostic Models

Study, year

Input features

Training/validation strategy

Performance

Comparative algorithms and scoring metrics

Key findings

Limitation

Akbulut et al. [2], 2023 (Turkey)

Neutrophil, WLR, NLR,

CRP, WNR, PNR, PDW, and MCV

The persistence method was repeated 50 times with different seeds for model robustness. CatBoost model predicted AA and perforated AA, with optimized hyperparameters using grid search with tenfold cross-validation and 5 replicates

CatBoost model performance for classification: Sensitivity 84.2%, Specificity 93.2%, AUC 0.947, Accuracy 88.2%, F1-score 88.7%

CatBoost model: Accuracy 0.92, F1-score 91.1%, Sensitivity 94.1%, Specificity 90.5%, and AUC 0.969

NR

1. First study to combine ML and XAI for AA and perforated AA estimation

2. Identified biochemical blood parameters that can predict AA and perforated AA

1. The study is retrospective and lacks comprehensive clinical data

2. Radiological data are missing for approximately 11% of the patient sample

3. Conducted at a single institution

Phan-Mai et al. [23], 2023 (Vietnam)

Demographic characteristics, blood tests, and ultrasound. Blood tests consisted of total WBC, granulocyte count, lymphocyte count, and CRP

Imbalanced data was addressed using SMOTE. Optimal parameters were selected using k-fold validation. The data of 1,950 patients were split randomly into 70% for training and 30% for testing

GB model (imbalanced unadjusted data): Accuracy: 81%, AUC: 0.753

GB model (imbalanced adjusted data): Accuracy: 82%, AUC: 0.890

KNN model (imbalanced unadjusted data): Accuracy: 77.6%, AUC: 0.672,

KNN model (imbalanced adjusted data): Accuracy: 74.1%, AUC: 0.831

DT model (imbalanced unadjusted data): Accuracy: 70.3%, AUC: 0.601

DT model (imbalanced adjusted data): Accuracy: 73.8%, AUC: 0.738

ANN model (imbalanced unadjusted data): Accuracy: 80.5%, AUC: 0.734

ANN model (imbalanced adjusted data): Accuracy: 74.2%, AUC: 0.810

LR model (imbalanced unadjusted data): Accuracy: 80.3%, AUC: 0.714

LR model (imbalanced adjusted data): Accuracy: 72.9%, AUC: 0.789

SVM model (imbalanced unadjusted data): Accuracy: 75.2%, AUC: 0.711

SVM model (imbalanced adjusted data): Accuracy: 65.5%, AUC: 0.730

NR

1. High validity of ML models in classifying CA

2. GB model most valid

3. Models useful as screening tools

1. Small sample size

2. Single-hospital data

3. Low rate of complicated cases

4. Insufficient qualitative data

5. Not for definitive diagnosis

Li et al.[24], 2023 (China)

age, stage of pregnancy; symptom duration time, vital signs, physical examination findings; laboratory test results; and image findings (US)

NR

LR based score (Cutoff = 16)

Sensitivity: 64%, Specificity: 84%, Accuracy: 75%, PPV: 73%, NPV: 77%, AUC: 0.80 (95% CI = 0.75–0.84)

DT model: AUC: 0.78

NR

1. Higher premature birth and abortion rates in pregnant patients with CA

2. Treatment delay increases these rates

3. Models using LR and DT effectively distinguish CA from UCA

4. Models combine clinical and laboratory tests

5. Appendix diameter had an AUC of 0.68 in 116 cases

1. Single-center study

2. No external validation

3. Limited patient number

4. Appendix diameter not included

Lin et al. [25], 2023 (Taiwan)

CRP level, NLR, CT findings (fat-stranding sign, appendicolith, and ascites)

The data preprocessing involved standardizing independent variables AA patients to a scale of 0 to 1. Patients were then randomly divided into training and testing datasets at a 70:30 ratio. A single hidden layer with three neurons was chosen using a predefined value to avoid overfitting, as it was sufficient for the dataset

ANN model (MLP): AUC: 0.950, Sensitivity: 85.7%, Specificity: 91.7%, LR + : 10.36, LR-: 0.16

NR

1. A three-layer MLP with three hidden neurons performed well

2. Practical application would require an integrated system for immediate predictions after a CT scan

1. Single-center study

2. Broad definition of complicated appendicitis

3. Potential variation in definitions across studies

Eickhoff et al. [26], 2022 (Germany)

Age, gender, height, weight, and BMI, clinical-anamnestic data such as the ASA score, comorbidities, and perioperative data (time interval from admission to appendectomy, operative time, hemoglobin, CRP, WBC, platelets, INR, open surgery, laparoscopic surgery, conversion, extended surgical procedures during appendectomy, drains) as predictor variables

The dataset was split into 10 equal parts. 90% was used for training and 10% for validation. This process was repeated for all sections of the data, rotating the test sample. This was done 50 times for stable performance assessment

RF model:

Need for ICU (Accuracy: 77.2%, Sensitivity: 77.9%, Specificity: 76.9%

Longer stay > 24 h in ICU (Accuracy: 87.5%, Sensitivity: 88.4%, Specificity: 87.4%)

Complications measured by Clavien-Dindo > 3 in new cases (Accuracy: 68.2%, Sensitivity: 61.6%, Specificity: 69.5%)

Re-operation after initial appendectomy (Accuracy: 74.2%, Sensitivity: 47.5%, Specificity: 77.2% occurrence of surgical site infection (Accuracy: 66.4%, Sensitivity: 66.2%, Specificity: 66.4%)

Need for oral antibiotic therapy after discharge (Accuracy: 78.8%, Sensitivity: 76.4%, Specificity: 79.1%)

More than 7 days of hospital stay (Accuracy: 76.2%, Sensitivity: 74.3%, Specificity: 77.9%)

More than 15 days of hospital stay (Accuracy: 83.6%, Sensitivity: 60%, Specificity: 85.1%)

NR

1. Developed ML model for post-op outcomes in perforated appendicitis

2. The model predicts the need for intensive care

3. Suggests early transfer to higher-level care facilities

1. Single-center, retrospective study

2. Small sample size

Xia et al. [27], 2022 (China)

Gender, age, temperature, heart rate, WBC, lymphocytes, neutrophils, monocytes, eosinophils, hemoglobin, erythrocytes, platelets, urea nitrogen, blood sugar, creatinine, bilirubin, CRP

Used tenfold cross-validation for overall classification evaluation, and fivefold cross-validation for parameter optimization. Assessed using 12 benchmark functions

OBLGOA-SVM model: Accuracy: 83.6%, MCC: 67.3%, Sensitivity: 81.7%, Specificity: 85.3%

GOA-SVM model: Accuracy: 81%, MCC: 64%, Sensitivity: 78% Specificity: 84%

GS-SVM model:

Accuracy: 79%, MCC: 59%, Sensitivity: 72%, Specificity: 86%

RF model:

Accuracy: 82%, MCC: 65%, Sensitivity: 82%, Specificity: 82%

ELM model:

Accuracy: 77%, MCC: 55%, Sensitivity: 72%, Specificity: 81%

KELM model:

Accuracy: 78%, MCC: 57%, Sensitivity: 71%, Specificity: 84%

BPNN model: Accuracy: 76%, MCC: 52%, Sensitivity: 75%, Specificity: 76%

1. Proposed OBLGOA-SVM framework for CA vs. UCA

2. Improved GOA for SVM parameters

3. Method outperformed rivals in evaluations

4. CRP, heart rate, temp, and neutrophils predict CA

1. No radiological findings (ultrasound, CT scans)

2. Insufficient cases from a single center

3. Uncontrolled, retrospective study

Kang et al. [28], 2021 (China)

Age, gender, clinical signs and symptoms score, abdominal pain score, vomiting score, abdominal pain time, abdominal pain type, abdominal tenderness pain range, and the highest temperature. laboratory records: blood routine, coagulation function, blood biochemistry, WBC, NE, CD3 + T, CD4 + T, CD8 + T, CD19 + T, CD16 + 56, NK, total T cell counts, helper T cell counts, inhibitors T, B cell counts, NK cell counts, CD4 + /CD8 + ratio, CRP, PCT, and blood NLR ratio

LR models were created separately for SA/PA and PA/GPA groups using selected features from the training dataset. Clinical features were added to establish combined LR models. Models were then validated using testing sets

LR model:

Acute SA vs. PA (based on T cell subsets alone): training set (AUC: 0.904, Accuracy: 87.5%, Sensitivity: 75%, Specificity: 100%), testing set (AUC: 0.910, Accuracy: 87.5%, Sensitivity: 75%, Specificity: 100%),

Acute SA versus acute PA (based on T cell subsets and clinical signs and symptoms): training set (AUC: 0.921, Accuracy: 91%, Sensitivity: 81.9%, Specificity: 100%) testing set (AUC: 0.926, Accuracy: 90.6%, Sensitivity: 81.2%, Specificity: 100%),

Acute PA vs. acute GPA (based on T cell subsets alone): training set (AUC: 0.834, Accuracy: 82.6%, Sensitivity: 81.9%, Specificity: 83.3%) testing set (AUC: 0.821, Accuracy: 80.6%, Sensitivity: 90.3%, Specificity: 71%),

Acute PA vs. acute GPA (based on T cell subsets and clinical signs and symptoms) training set: (AUC: 0.867, Accuracy: 80.6%, Sensitivity: 73.6%, Specificity: 87.5%), testing set (AUC: 0.854, Accuracy: 77.4%, Sensitivity: 90.3%, Specificity: 64.5%)

NR

1. Established a quick diagnosis model using peripheral blood biomarkers for AA pathology

1. Limited cases

2. Single-center data source

3. The study could not fully prove biomarkers’ predictive value due to sample size and false positives

Corinne Bunn et al. [29], 2021 (USA)

Demographic, comorbid conditions, preoperative laboratory results, days, and procedure-related information

The dataset is split into 80% training and 20% hidden testing. Missing data imputed using multivariable imputation for complete analysis

Postoperative sepsis prediction

LR model: AUC: 0.69, Sensitivity: 62%, Specificity: 65%

SVM model: AUC: 0.51

RFDT model: AUC: 0.69, Sensitivity: 67%, Specificity: 60%

XGB model: AUC: 0.70, Sensitivity: 64%, Specificity: 66%

Ensemble model (LR, RFDT, and XGB): AUC: 0.70, Sensitivity: 64%, Specificity: 60%

Postoperative sepsis prediction as a risk factor for 30-day mortality:

LR model: AUC: 0.92, Sensitivity: 82%, Specificity: 87%

SVM model: AUC: 0.5

RFDT model: AUC: 0.96, Sensitivity: 93%, Specificity: 84%

XGB model: AUC: 0.93, Sensitivity: 89%, Specificity: 85%

Ensemble model (LR, RFDT, and XGB): AUC: 0.95, Sensitivity: 89%, Specificity: 89%

NR

1. ML methods predict postoperative sepsis after appendectomy with moderate accuracy

2. Risk factors for postoperative sepsis: recent CHF exacerbation, acute renal failure, preoperative transfusion

1. High false positive rates in clinical implementation

2. The study focuses on non-septic cases, isolating early-stage disease

3. Missing intraoperative findings data

4. ML is used on a national database, not EHR data

5. ML does not outperform LR due to dataset quality

  1. ML machine learning, AA acute appendicitis, IQR interquartile range, NA no appendicitis, SVM support vector machine, DT decision tree, KNN k-nearest neighbor, LR logistic regression, ANN artificial neural network, GB gradient boosting, neural network, RFDT random forest decision tree, RBF radial basis function, SOM self-organizing map, BP backpropagation, LVQ learning vector quantization, PEL pre-clustering based ensemble learning, CA complicated appendicitis, UCA uncomplicated appendicitis, SA simple appendicitis, PA purulent appendicitis, GPA gangrenous or perforated appendicitis, NB Naïve Bayes, SMOTE synthetic minority oversampling technique, MCC Matthews correlation coefficient, WBC white blood cell, CRP c-reactive protein, PCT procalcitonin, PMN polymorphic nuclear, MSE mean squared error, SA simple appendicitis, PA purulent appendicitis, GPA gangrenous or perforated appendicitis, XAI explainable artificial intelligence, ROC receiver operating characteristics, US ultrasonography, WLR white cell count lymphocyte ratio, WNR within normal range, NLR neutrophil to lymphocyte ratio, ASA American Society of Anesthesiologists Classification, BMI body mass index, OBLGOA opposition based learning grasshopper optimization algorithm, GS grid search, ELM extreme learning machine, KELM kernel extreme learning machine, BPNN backpropagation neural network, LR+ positive likelihood ratio, LR− negative likelihood ratio