Developing and validating of Ramathibodi Appendicitis Score (RAMA-AS) for diagnosis of appendicitis in suspected appendicitis patients

Background Diagnosis of appendicitis is still clinically challenging where resources are limited. The purpose of this study was to develop and externally validate Ramathibodi Appendicitis Score (RAMA-AS) in aiding diagnosis of appendicitis. Methods A two-phase cross-sectional study (i.e., derivation and validation) was conducted at Ramathibodi Hospital (for derivation) and at Thammasat University Hospital and Chaiyaphum Hospital (for validation). Patients with abdominal pain and suspected of having appendicitis were enrolled. Multiple logistic regression was applied to develop a parsimonious model. Calibration and discrimination performances were assessed. In addition, our RAMA-AS was compared with Alvarado’s score performances using ROC curve analysis. Results The RAMA-AS consisted of three domains with seven predictors including symptoms (i.e., progression of pain, aggravation of pain, and migration of pain), signs (i.e., fever and rebound tenderness), and laboratory tests (i.e., white blood cell count (WBC) and neutrophil). The model fitted well with data, and it performed better discrimination than the Alvarado score with C-statistics of 0.842 (95% CI 0.804, 0.881) versus 0.760 (0.710, 0.810). Internal validation by bootstrap yielded Sommer’s D of 0.686 (0.608, 0.763) and C-statistics of 0.848 (0.846, 0.849). The C-statistics of two external validations were 0.853 (0.791, 0.915) and 0.813 (0.736, 0.892) with fair calibrations. Conclusion RAMA-AS should be a useful tool for aiding diagnosis of appendicitis with good calibration and discrimination performances. Electronic supplementary material The online version of this article (10.1186/s13017-017-0160-3) contains supplementary material, which is available to authorized users.


Background
Appendicitis is one of the most common causes of acute abdominal pain, with an incidence of 110/100,000 [1]. Although, many attempts have been made to improve the diagnostic accuracy, false negative rates remain common with rates of negative appendectomy of 15 to 26% [2,3] and perforated appendectomy of 10 to 30% [4].
The critical evaluation of appendicitis should balance between early operation to minimize complicated appendicitis (i.e., perforation, gangrene, and abscess) and a conservative approach reducing unnecessary operation. Several scores had been developed for screening of appendicitis, e.g., Alvarado [5], modified-Alvarado Fenyo [6], Eskelinen [7], etcetera. A systematic review of previous appendicitis scores was conducted to explore their methods used for developments, validations, and performances [8]. Surprisingly, about two-thirds of those studies developed scores based on univariate analysis, and none had evaluated their impacts on health outcome in clinical practice [9]. With poor methodology in previous score developments, we therefore conducted our study, which aimed to develop and externally validate Ramathibodi Appendicitis Score (RAMA-AS).

Study design
The design was a cross-sectional study consisting of derivation and validation phases. Derived data were collected at Ramathibodi Hospital (RH), whereas validated data were collected at Thammasat University Hospital (TH) and Chaiyaphum Hospital (CH) from January 2013 to May 2015. The RH and TH are the Schools of Medicine, whereas CH is a provincial hospital.
The study was conducted and reported according to Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis Or Diagnosis (TRIPOD) [10] and STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) [11]. Consecutive suspected appendicitis patients presenting with abdominal pain were included with following criteria: aged 15-60 years, right side abdominal pain within 7 days, had at least one of the following symptoms (i.e., right lower abdominal pain, migration of abdominal pain, anorexia, nausea, vomiting) and signs (i.e., raised body temperature, right lower quadrant tenderness, guarding, rebound tenderness, and decreased bowel sound), and willing to participate and gave consent. Exclusion criteria were patients who could not give the history of illness, had myocardial infarction or terminal illness, abdominal mass, tumor or malignancy of appendix.

Outcome and predictors
The interested outcome was acute appendicitis by histopathological diagnosis for operative patients. For those patients with conservative management, telephone was made to confirm the final diagnosis 6 weeks after visiting.

Sample size
As for our literature review, a total of 8-10 variables were potentially included in the final risk prediction score. A simulation study indicated that a number of events per variable of at least 10 to 30 yielded less bias in coefficient estimation of logistic regression [12], which was known as a rule of thumb as per recommendation [13].Using a rule of thumb of at least 20 appendicitis patients per variable required 200 appendicitis patients for 10 variables. The prevalence of appendicitis in our setting was 62% from our pilot study. As a result, 355 patients were needed. Taking into account for missing data of 20%, at least 388 patients were finally required. In addition, an additional 100 subjects (i.e., about 30% of derived subjects) were enrolled from each of the external sites for external validation.

Statistical analysis Imputation
Multiple imputation was applied to predict missing variables using a simulation-based approach which assumed data were missing at random [14,15]. A linear truncated regression was applied by regressing missing data on complete data with a number of 20 imputations as per recommendation [16]. Performance of imputation can be assessed using relative variance increase (RVI) and fraction of missing information (FMI). The RVI refers to average relative increase in variances of estimates because of missing variables (i.e., mean of variance of all coefficients from missing data); and as this value closes to 0, missing data reflect less on estimates. The FMI refers to the largest fraction of missing information of coefficient estimates due to missing data. The number of imputations should be roughly estimated based on a rule of thumb, i.e., FMI×100. For instance, if FMI = 0.15, the number of imputations = 0.15 × 100, i.e., at least 15 imputations are required.

Derivation
A simple logistic regression analysis was used to screen variables that might associate with appendicitis. Individual variables of 4 domains (i.e., demographic data, clinical symptoms, clinical signs, and laboratory tests) were fitted in a logit model, and a likelihood ratio (LR) test was used to select variables. Variables with p values < 0.20 were simultaneously considered in a multivariate logit model. Only significant variables were kept in a parsimonious-model. Goodness of fit was assessed whether the expected (E) or predicted and observed (O) values were close using chi-square Hosmer-Lemeshow test [17]. In addition, a calibration coefficient (O/E) and its 95% confidence interval (CI) were also estimated. The coefficients of the final parsimonious-model were used to create the RAMA-AS. The receiver operating characteristic (ROC) curve, which plotted sensitivity versus 1-specificity, was used to calibrate the score cutoff. Diagnostic parameters (i.e., sensitivity, specificity, likelihood ratio positive (LR+) and negative) were estimated for each distinct value of the scores. The area under ROC, called C-statistic, was estimated, and value close to one reflected higher discrimination of appendicitis from non-appendicitis [18].

Validation
Internal validation A bootstrap technique with 450 replications was applied for internal validation of the RAMA-AS [19]. For each bootstrap sample, the RAMA-AS score was calculated and fitted in the logit model. For calibration, the correlation between the observed and expected values of appendicitis was assessed using the Somer'D coefficient for all bootstrap data (called D boot ) and derived data (called D org ). Calibration of the model was then assessed by subtracting the D org from the mean D boot , and lower value reflected less bias and thus better calibration. Likewise, the original C-statistic was compared to an average C-statistic from the bootstraps for discrimination performance.
External validation Data from the two external hospitals were used to validate the performances of RAMA-AS. Calibration performance was explored as mentioned above. In addition, model re-calibrations were performed by recalibrating intercept (called M1) and overall coefficient (called M2) [20,21] as follows (see Additional file 1: Table S1: The M1 was constructed by fitting RAMA-AS on appendicitis. The estimated intercept was then used to re-calibrate by adding it up with the original intercept. The estimated coefficient from the M1 was then used to calibrate coefficient by multiplying it with overall coefficients (M2). Four model revisions were additionally performed from the M2 [10,[21][22][23], (see Additional file 1: Table S1). The M3 was constructed by fitting M2 plus significant predictors by LR test. The M4 was similar to M3 but added significant predictors by stepwise selections. The M5 re-estimated all coefficients of predictors. Finally, the M6 re-selected only significant predictors among all predictors.
Finally, the Alvarado score [5] was compared with the RAMA-AS using ROC curve analysis.
All analyses were performed using STATA version 14 (Stata Corp, College Station, Texas, USA) under mi estimate commands. A p value of less than 0.05 was taken as a threshold for statistical significance.

Imputation
Two variables (i.e., WBC > 10,000 cell/mm 3 and neutrophil > 75%) contained missing data of 43 (10.9%) and 40 (10.1%), respectively and imputed data were filled in for both variables. Performances of imputation were assessed, and the FMI was < 0.0001 for both variables, indicating 20 imputations were sufficient to fill in missing data, see Additional file 2: Table S2. The diagnostic plot was constructed by comparing missing versus observed values, suggesting no difference between the two values, see Additional file 2: Figure S1.

Model development Derivation
A total of 16 out of 20 predictive variables were suggested from a univariate analysis that they might associate with appendicitis, see Table 1. These included eight symptoms (i.e., first location of pain, migration of pain, onset, progression of pain, right lower quadrant pain at presentation, nausea or vomiting, aggravation of pain by cough or movement, and fever), five signs (i.e., bowel sound, body temperature, tenderness at right lower quadrant of abdomen, rebound tenderness, and guarding), and two laboratory tests (i.e., WBC > 10,000 cell/mm 3 and neutrophil > 75%).
These variables were simultaneously included in the logit model, in which only seven variables were remained in the final model. These were three symptoms (i.e., migration of pain, progression of pain, and aggravation of pain by cough or movement), two signs (i.e., body temperature ≥ 37.8°C and rebound tenderness), two laboratory tests (i.e., WBC > 10,000 cell/mm 3 and neutrophil > 75%), and odd ratios (OR) and 95% CI were reported, see

Internal validation
The 450 bootstraps yielded estimated D org and D bootcoefficients of 0.686 and 0.695 (95% CI 0.692, 0.698) for

External validation
A total of 330 patients with suspected acute appendicitis (152 and 178 from TH and CH, respectively) were used to externally validate the RAMA-AS. Their characteristics were described in Table 4.  The estimated RAMA-AS, which ranged from − 3.4 to 4.0, seemed to work well in TH with the estimated O/E ratio of 1.005 (95% CI 0.784, 1.225; Hosmer-Lemeshow = 8.219, (df = 4), p = 0.084). However, the calibration plot showed the predicted risk deviated from the reference line (see Additional file 4: Figure S3-A), i.e., under-estimated risk for lower score and overestimated risk for higher scores. The intercept and overall coefficients were then calibrated (see Additional file 1: Table S4), and calibration plots were constructed (see Additional file 4: Figure S3-B-C) which suggested no improvement of calibrations.
Revision M3 models by LR test indicated that migration of pain, progression of pain, body temperature, WBC, and neutrophil were significant predictors, see Additional file 1: Table S4. Comparing coefficients of M3 versus coefficients of the original RH model in Table 2, coefficients of body temperature, WBC, and neutrophil were changed from positive to negative coefficients, whereas coefficients of the rest of the predictors increased. Only migration of pain, progression of pain, and rebound tenderness were significant by stepwise selection for M4. Of these, progression of pain and rebound tenderness were much lower but migration of pain was higher than in RH, see Table 2 and Additional file 1: Table S4.
Calibration coefficients of these models were estimated, which resulted in the O/E ratio for revision M3 model and M4 of 0.940 (95% CI 0.729, 1.150; Hosmer-Lemeshow = 2.683, df = 4, p = 0.612) and 1.006 (95% CI 0.743, 1.269; Hosmer-Lemeshow = 5.00, df = 4, p = 0.287), respectively, which were much improved when compared to the M0. Calibration plots also showed better fits with the reference lines when compared to the M0, see Additional file 4: Figure S3 A, D-E. The M5 which entered all seven predictors or stepwise selection in M6 yielded similar results as M4, in which only three predictors (i.e., migration of pain, progression of pain, and rebound tenderness) were significant. The   Figure S3 F-G. C-statistics were estimated for all models, see Additional file 1: Table S5. These suggested that the M0 could well discriminate appendicitis from nonappendicitis with the C-statistics of 0.853 (95% CI 0.790, 0.915), and they were little improved for M3, M4, and M6, but not for M5, see Additional file 1: Table S5.
A median RAMA-AS was 1.6 (− 3.4, 4.0) with O/E ratio of 0.996 (95% CI 0.695, 1.333; Hosmer-Lemeshow = 6.640 (df = 4), p = 0.156), see Additional file 1: Table S5. Calibration models were constructed (see Additional file 1: Table  S4) and plotted (see Additional file 5: Figure S4 A-G). These suggested that the M0 still deviated from the reference line particularly for low and high scores. M1 and M2 did not improve calibrations when compared to the original M0. Among revision models, M3-M6, M3-M4, and M6 were improved in calibrations, particularly the M6 was the best with O/E ratios of 1.021 (95% CI 0.905, 1.186), whereas the calibration plot of M5 showed quite poor performance.
The M0's discrimination performance was good, although it was lower than the original model (C-statistic = 0.813; 0.736, 0.892). The C-statistics for M3 to M6 were a bit higher than M0, see Additional file 1: Table S5.

Comparison of RAMA-AS and previous score
Alvarado scores was calculated which ranged of 2 to 10 (mean = 7.04). The C-statistics was 0.752 (95% CI 0.710, 0.800) which was statistically lower than RAMA-AS (p value of < 0.001, see Fig. 2).

Discussion
We developed and internally and externally validated a RAMA-AS, for classifying very low, low, moderate, and high risk of having appendicitis. Predictive domains including three symptoms, two signs, and two laboratory tests were included. Internal validation showed the RAMA-AS performed well for both calibration and discrimination. Although most predictors of clinical signs, symptoms, and laboratory tests used in the RAMA-AS were similar to the Alvarado score, which was the most commonly used in prospective studies [6,[24][25][26][27][28][29], our performances were better. This might be due to difference in weighting or scoring for each predictor, distribution of predictors, and also prevalence of appendicitis itself. Our score was derived based on proper model construction, following the recommendation by TRIPOD [10], and let the data suggest proper weighting. Our finding was consistent to the appendicitis inflammatory response (AIR) [30], developed in 2008, which externally performed better than the Alvarado score. This score did not consider WBC and neutrophil, but instead included leukocyte and CRP in the model [30,31], in which the CRP may be not a routine laboratory test in some developing countries. Thus, it is not easily applied in the setting where resources are limited. Our RAMA-AS and also these scores could rule out well, but not rule in as per WSES Jarusalem guidelines [30], so high risk score may need confirmation by CT scan [31].
Calibration performance of RAMA-AS was fair in both external data sets. This could be explained as follows: first, prevalence of appendicitis in the derived RH and validated TH and CH's were reasonably different, i.e., 61.8 vs 48.7 vs 76.9%, respectively. Therefore, the original model over-estimated risk of appendicitis in TH, but under estimated risk in CH. We then re-calibrated the intercept in M1 models by minus and plus the original intercept (i.e., baseline risk) with estimated intercepts for TH and CH, respectively. These models were still not well calibrated, we thus moved further to recalibrate overall coefficient (M2), but this did not much improve. Differences in distributions of predictors between appendicitis groups across data sources may also play a role. For instance, all symptoms and signs were more present in appendicitis than in non-appendicitis groups for both external hospitals, but not for WBC and neutrophil. The revisions of models showed much improvement, which could be M4 or M6 for both TH and CH. Only two symptoms and one sign contributed in predictions for both hospitals, therefore, the predictive score containing only three symptoms (migration of pain, progression of pain, aggravation of pain) and one sign (rebound tenderness) without laboratory test is proposed. Its performances in calibration and discrimination was very much similar to M6 (data were not shown). Although the RAMA-AS did not perform well in the external data when compared to the derived data, it could still well discriminate appendicitis from non-appendicitis in provincial setting (CH) and School of Medicine setting (TH).

Using the RAMA-AS in practice
Our RAMA-AS should be applied in general hospitals where resources are limited. Data of seven variables can be collected from physical examination, interview, and CBC test. Applying the RAMA-AS is easy by inputting data in the equation. Probability of appendicitis is then estimated for each risk stratification using Fagan nomogram. In addition, the score can be straight forwardly classified as very low (score < − 0.64), low (score − 0.64 to 0.84), moderate (score 0.85 to 1.74), and high risk (score > 1.74) of having appendicitis. As for the ROC analysis, these cut-off thresholds were objectively selected based on LR+ (i.e., sensitivity/(1-specificity)), which had less bias than subjective selection [32]. Although our score could well discriminate appendicitis from non-appendicitis as for the C-statistics, clinical findings should also be incorporated for further decision making. Imaging investigation may be needed for moderate to high scores [31]. Counting number of positive of signs, symptoms, and laboratory results can be also applied. For instance, low risk appendicitis if having only positive for all items of signs, symptom, or laboratory tests; 1 positive item for each of 3 domains; 2 positive items among 3 domains (i.e., 1 symptom and sign, 1 symptom and laboratory test, 1 sign and 1 laboratory test); 3 symptoms with 1 laboratory test without sign; 3 symptoms plus one sign without laboratory test. The post-test probability would be 76.0%, so out-patient observation is recommended. The moderate risk requires three symptoms plus one sign of body temperature ≥ 37.8°C, or three symptoms plus two laboratory tests without any sign. The post-test probability is from 85.0 to 93.0% for moderate risks, so other investigations such as ultrasound or CT scan may be needed for these patients.
The high risk group requires all symptoms and signs, or all symptoms plus one sign and laboratory test, all symptoms plus two signs plus any of laboratory test, or three symptoms plus two laboratory tests plus any of the signs. The post-test probability is about 93.0% and thus surgical treatment should be performed for high risk patients.
Our study has some strengths. We followed the recommendations for developing risk prediction score by Altman et al. [33] and TRIPOD [10]. We developed and both internally and externally validated the scores using prospective data collections. Imputation of missing data was applied, even though it occurred only on a few variables, which should yield better performances of risk prediction model than analysis of complete case only [34]. The RAMA-AS showed good performances for both calibration and discrimination in the derived setting, although one external setting had lower discrimination performance.
However, some limitations could not be avoided. The study was conducted at tertiary hospitals where the appendicitis prevalence was high. The RAMA-AS should be further validated in different populations and settings. In order to improve generalizability, big electronic health data or individual patient meta-analysis should be conducted [35]. Clinical impact of the RAMA-AS should be also further assessed. For instance, applying the score in a routine clinical practice, which will let us know whether our score, can still well rule out and rule in suspected patients with and without appendicitis. These suspected patients may be only observed or treated with operation or even non-operative treatment such as antibiotics. Previous cohort study showed long-term success and safety of antibiotics in suspected appendicitis [36]. However, this evidence was from observational study, which was prone to selection bias. Individual randomized controlled trial with appropriate methods should be conducted to test if non-operative treatment is noninferior to operation [37].