Machine learning-based models can predict diabetes incidence across multiple ethnicities

In a recent study published in eClinicalMedicine, researchers developed questionnaire-based models to predict the incidence and prevalence of diabetes mellitus type 2 (T2D) in different ethnicities.

Study: Effective questionnaire-based prediction models for type 2 diabetes in multiple ethnicities: a model development and validation study.  Image Credit: NicoElNino/Shutterstock.comStudy: Effective questionnaire-based prediction models for type 2 diabetes across multiple ethnicities: a model development and validation study., Image Credit: nikoelnino/


Screening and predictive technologies are important for early detection and management of T2D, especially in non-white individuals. These individuals suffer from a complex combination of conditions leading to early onset and associated outcomes.

Machine learning (ML)-based technology can provide non-invasive screening, allow early assessment and referral, ultimately promoting population health and reducing health care expenditures.

about the study

In the present study, researchers developed T2D incidence and prevalence prediction models based on questionnaires using United Kingdom Biobank (UKB) data (for training). They applied them to Lifeline Study data (for validation) for use between white and non-white individuals.

The questionnaire-based algorithm was trained using UKBB’s white population data. The potential clinical value of the algorithm for clinical risk assessment to predict T2D incidence was compared using two other models (with additional variables such as physical measures and biological markers) and the gold-standard model Was. Logistic regression modeling was performed to predict T2D incidence and prevalence.

The training dataset included white individuals participating in the UKBB study (472,696 individuals aged 37 to 73 years, information obtained from 2006 to 2010) and five non-white ethnicities (with external validation using Lifeline data (168,205). 29,811 individuals). Individuals aged 0 to 93 years, data obtained from 2006 to 2013).

Feature selection was done for model development. The area under the receiver operating characteristic (ROC) curve (AUC) was used to measure predictive accuracy, and sensitivity analysis was performed to assess potential clinical value.

Additionally, a reclassification analysis was conducted comparing questionnaire-only prediction models with models that included biomarkers and physical and clinical T2D risk instruments.

Diagnosis of T2D was made among training group participants using self-reported data, which included physician-based T2D diagnosis or hospital records with International Classification of Diseases, Ninth Revision (ICD-9) diagnostic codes. .

Validation group participants were classified as having incident or prevalent type 2 diabetes based on self-report.

According to National Institute for Health and Care Excellence (NICE) recommendations, the threshold for “potentially undiagnosed” T2D in the training and validation datasets included blood glucose levels above 7.0 mmol/L or glycated hemoglobin above 48 mmol/L. (HbA1c) levels. Individuals with “potentially undiagnosed” T2D were excluded from the analysis to reduce bias in l mole prevalence studies.

In addition, the researchers excluded all T2D patients with more than eight years until diagnosis and individuals who did not acquire T2D but did not return to the evaluation center after eight years.

In addition, researchers validated the non-laboratory clinically brief Finnish Diabetes Risk Score (FINDRISC) and the clinical Australian T2D Risk Assessment Tool (AUSDRISK), detailing medical history, demographics, lifestyle use nine and 13 features, respectively, to predict incident T2D. , and anthropology.


To assess T2D incidence and prevalence, 67,083 and 631,748 individuals were included, respectively. Of note, T2D incidence and prevalence rates differ significantly between non-white and white individuals, with non-whites having a 4.0 times higher prevalence (between 12% and 23%) and 0.5 to 3.0 times higher incidence (ranging between 1.4% and middle) is visible. and 8.2%) compared to the white UKGB population (6.00% and 2.80%, respectively).

On the other hand, Lifelines displayed lower T2D prevalence (two percent) and incidence (two percent) than the White UKGB population, which can be partially explained by age disparities in the two populations.

In the White UKBB sample, the algorithm correctly predicted T2D prevalence (AUC of 0.9) and incidence (AUC of 0.9) at eight years.

Both models reproduced well in the Lifeline external validation, with AUC values ​​of 0.8 and 0.9 for incidence and prevalence, respectively.

Both ML-based models performed consistently well across different ethnicities, with AUC values ​​between 0.86 and 0.89 for prevalence and between 0.82 and 0.88 for T2D incidence.

The models generally outperformed clinically verified non-laboratory techniques, appropriately reclassifying approximately 3,000 additional instances. Adding biological markers, but not physical data, increased model performance.

Prevalence and incidence models give high importance to BMI and number of medications used, placing them in the top three features of both models. In addition, the phenomenon includes a sedentary element (time spent watching television (TV).

In predicting prevalence and incidence across diverse demographics, Lifelines’ questionnaire-based ML model outperformed FINDRISC and AUSDRISK.

Only the questionnaire model achieved good sensitivity–specificity balance, PPV and NPV for all populations. The sensitivity–specificity balance was improved in models that included biomarkers, resulting in higher PPV across groups.

The model accurately classified more instances than clinically proven prediction techniques, with statistical significance for White, Caribbean, Other, and South Asian populations. Compared to the questionnaire model, more events were accurately located in Lifeline by combining physical data. In almost every case, biomarker-based models outperform clinical methods.

Overall, the study findings showed that T2D prevalence and incidence were successfully predicted by the UK Biobank’s ML models across all ethnicities, including non-white individuals.

These models outperformed existing methods, resulting in an accurate, scalable, cost-effective strategy for identifying positive instances and predicting risk.

(TagstoTranslate)Diabetes(T)Machine Learning(T)Diabetes Mellitus(T)Prediction(T)Health Care(T)Laboratory(T)Technology

Source link

Leave a Comment