More testing is needed to ensure accuracy across different groups
Machine learning (ML) seems to be the wave of the future in medicine and, when perfected, it will be a most valuable diagnostic asset. However, right now the technology is in its infancy and the kinks have to be worked out and diagnostic capabilities perfected.
Investigators in a recent study1 found that ML is still wanting, in that the reproducibility across various data sets was poor in different ethnic groups for detecting glaucoma, according to senior author Damon Wong, PhD, from the Singapore Eye Research Institute (SERI), Singapore National Eye Centre; SERI-Nanyang Technological University Advanced Ocular Engineering; and School of Chemical and Biomedical Engineering, Nanyang Technological University, all in Singapore; and the Institute of Molecular and Clinical Ophthalmology, Basel, Switzerland.
This result is in contrast to that of other studies2-6 that used ML approaches to detect glaucoma. While most of the studies reported high diagnostic accuracies (area under the receiver operating curve [AUC] = 0.88-0.98) for glaucoma detection, they did not assess the models with independently sampled data from a different ethnicity group (external test), which limits the generalizability of the models across ethnicities,7 Dr Wong and colleagues explained.
In light of this deficiency, the investigators conducted a prospective, cross-sectional study in which they wanted to externally validate the ability of ML models to detect glaucoma using optical coherence tomography (OCT) images. The study included 514 Asian patients (257 with glaucoma and 257 controls without glaucoma) who were enrolled to construct ML models for glaucoma detection. The models then were evaluated in 356 Asian patients (183 with glaucoma and 173 controls without glaucoma) and also in 138 Caucasian patients (57 with glaucoma and 81 controls without glaucoma).
The retinal nerve fiber layer (RNFL) thickness values were used in the study; they were produced by the compensation model, which the authors described as a multiple regression model fitted on healthy participants that corrects the RNFL profile for anatomic factors and the original OCT data (measured) to build two classifiers, respectively.
With the exception of the foveal distance (P = .029), the investigators found no significant differences between the training dataset and the Asian test dataset (P ≥ .174).
They found significant differences in the demographic data between the training dataset and the Caucasian test dataset; the participants in the external test dataset were younger, more were female, and more eyes had mild and moderate glaucoma.
In addition, the ocular characteristics also differed between those two datasets; specifically, fewer Caucasians had significantly shorter fovea distances, smaller foveal angles, less elliptical optic discs (ratio closer to 1.0), and thicker retinal vessel densities (P ≤ .009), the authors reported.
In the glaucoma dataset, the Caucasians had significantly fewer elliptical optic discs, higher optic disc orientations and thicker retinal vessel densities. They were more hyperopic, and had greater RNFL thicknesses (P ≤ .001).
“Both the ML models (AUC = .96 and accuracy = 92%) outperformed the measured data (AUC = .93; P < .001) for glaucoma detection in the Asian dataset. However, in the Caucasian dataset, the ML model trained with compensated data (AUC = .93 and accuracy = 84%) outperformed the ML model trained with original data (AUC = .83 and accuracy = 79%;
P < .001) and measured data (AUC = 0.82; P < .001) for glaucoma detection,” investigators reported.
In commenting on their findings, Dr Wong and colleagues said, “The results showed poor reproducibility of the performance with the ML model trained on original RNFL data across different datasets. In contrast, the performance of the ML model trained on compensated RNFL seemed to be maintained. To the best of our knowledge, our study is the first to assess the performance of ML classifiers to detect glaucoma between ethnicities.”
This next step of evaluating the ML performance in different ethnic groups is the next critical step in the process to determine the model’s generalisability to other populations, they explained, and advised that care must be taken be exercised in cohorts of patients representing different ethnic groups.