- Predicting Clinical Outcomes at the Toronto General Hospital Transitional Pain Service via the Manage My Pain App: Machine Learning Approach
Background: Chronic pain is a complex condition that affects more than a quarter of people worldwide. The development and progression of chronic pain is unique to each individual due to the contribution of interacting biological, psychological and social factors. The subjective nature of the experience of chronic pain can make its clinical assessment and prognosis challenging. Personalized digital health apps, such as Manage My Pain (MMP), are popular pain self-tracking tools that can also be leveraged by clinicians to support patients. Recent advances in machine learning technologies open an opportunity to use data collected in pain apps to make predictions about a patient’s prognosis. Objective: This study applies machine learning methods using real-world user data from the MMP app to predict clinically significant improvements in pain-related outcomes among patients at the Toronto General Hospital Transitional Pain Service (TPS). Methods: Information entered into the MMP app by 160 TPS patients over a one-month period, including profile information, pain records, daily reflections, and clinical questionnaire responses, was used to extract 245 relevant variables, referred to as features, for use in a machine learning model. The machine learning model was developed using logistic regression with recursive feature elimination to predict clinically significant improvements in pain-related pain interference, assessed by the PROMIS Pain Interference 8a v1.0 questionnaire. The model was tuned and the important features were selected using the 10-fold cross-validation method. Leave-one-out cross-validation was utilized to test the model’s performance. Results: The model predicted patient improvement in pain interference with 79% accuracy and an area under the receiver operating characteristic curve (AUC) of 0.82. It showed balanced class accuracies between improved and non-improved patients, with a respective sensitivity of 0.76 and a specificity of 0.82. Feature importance analysis indicated that all MMP app data, not just clinical questionnaire responses, were key to classifying patient improvement. Conclusions: This study demonstrates that data from a digital health app can be integrated with clinical questionnaire responses in a machine learning model to effectively predict which chronic pain patients will show clinically significant improvement. The findings emphasize the potential of machine learning methods in real-world clinical settings to improve personalized treatment plans and patient outcomes.
- Automated Radiology Report Labeling in Chest X-Ray Pathologies: Development and Evaluation of a Large Language Model Framework
Background: Labeling unstructured radiology reports is crucial for creating structured datasets that facilitate downstream tasks, such as training large-scale medical imaging models. Current approaches typically rely on BERT-based methods or manual expert annotations, which have limitations in terms of scalability and performance. Objective: To evaluate the effectiveness of a GPT-based large language model (LLM) in labeling radiology reports, comparing it with two existing methods, CheXbert and CheXpert, on a large chest X-ray dataset (MIMIC-CXR). Methods: In this study, we introduce an LLM-based approach fine-tuned on expert-labeled radiology reports. Our model's performance was evaluated on 687 radiologist-labeled chest X-ray reports, comparing F1 scores across 14 thoracic pathologies. The performance of our LLM model was compared with the CheXbert and CheXpert models across positive, negative, and uncertainty extraction tasks. Paired t-tests and Wilcoxon signed-rank tests were performed to evaluate the statistical significance of differences between model performances. Results: Our GPT-based model achieved an average F1 score of 0.9014 for all certainty levels and 0.8708 for positive/negative certainty levels, outperforming CheXpert (F1 scores of 0.8864 and 0.8525, respectively) and performing comparably to CheXbert (F1 scores of 0.9047 and 0.8733, respectively). Paired t-tests revealed no statistically significant difference between our model and CheXbert (P = 0.3483), but a significant difference between our model and CheXpert (P = 0.0114). The Wilcoxon test also confirmed these results: no significant difference between our model and CheXbert (P = 0.1353) and a significant difference between our model and CheXpert (P = 0.0052). Conclusions: The GPT-based LLM model demonstrates competitive performance compared to CheXbert and outperforms CheXpert in radiology report labeling. These findings suggest that LLMs are a promising alternative to traditional BERT-based architectures for this task, offering enhanced context understanding and eliminating the need for extensive feature engineering. Moreover with large context length LLM based models are better suited for this task as compared to the small context length of BERT based models. Clinical Trial: Not applicable.
- Biases in Race and Ethnicity Introduced by Filtering Electronic Health Records for “Complete Data”: Observational Clinical Data Analysis
Background: - Objective: Integrated clinical databases from national biobanks have advanced the capacity for disease research. Data quality and completeness filters are used when building clinical cohorts to address limitations of data missingness. However, these filters may unintentionally introduce systemic biases when they are correlated with race and ethnicity. In this study, we examined the race/ethnicity biases introduced by applying common filters to four clinical records databases. Methods: We used 19 filters commonly used in electronic health records research on the availability of demographics, medication records, visit details, observation periods, and other data types. We evaluated the effect of applying these filters on self-reported race and ethnicity. This assessment was performed across four databases comprising approximately 12 million patients. Results: Applying the observation period filter led to a substantial reduction in data availability across all races and ethnicities in all four datasets. However, among those examined, the availability of data in the white group remained consistently higher compared to other racial groups after applying each filter. Conversely, the Black/African American group was the most impacted by each filter on these three datasets, Cedars-Sinai dataset, UK-Biobank, and Columbia University Dataset. Conclusions: Our findings underscore the importance of using only necessary filters as they might disproportionally affect data availability of minoritized racial and ethnic populations. Researchers must consider these unintentional biases when performing data-driven research and explore techniques to minimize the impact of these filters, such as probabilistic methods or the use of machine learning and artificial intelligence.
- Improving Systematic Review Updates With Natural Language Processing Through Abstract Component Classification and Selection: Algorithm Development and Validation
Background: A challenge in updating systematic reviews is the workload in screening the articles. Many screening models using natural language processing technology have been implemented to scrutinize articles based on titles and abstracts. While these approaches show promise, traditional models typically treat abstracts as uniform text. We hypothesize that selective training on specific abstract components could enhance model performance for systematic review screening. Objective: We evaluated the efficacy of a novel screening model that selects specific components from abstracts to improve performance and developed an automatic systematic review update model using an abstract component classifier to categorize abstracts based on their components. Methods: A screening model was created based on the included and excluded articles in the existing systematic review and used as the scheme for the automatic update of the systematic review. A prior publication was selected for the systematic review, and articles included or excluded in the articles screening process were used as training data. The titles and abstracts were classified into 5 categories (Title, Introduction, Methods, Results, and Conclusion). Thirty-one component-composition datasets were created by combining 5 component datasets. We implemented 31 screening models using the component-composition datasets and compared their performances. Comparisons were conducted using 3 pretrained models: Bidirectional Encoder Representations from Transformer (BERT), BioLinkBERT, and BioM- Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA). Moreover, to automate the component selection of abstracts, we developed the Abstract Component Classifier Model and created component datasets using this classifier model classification. Using the component datasets classified using the Abstract Component Classifier Model, we created 10 component-composition datasets used by the top 10 screening models with the highest performance when implementing screening models using the component datasets that were classified manually. Ten screening models were implemented using these datasets, and their performances were compared with those of models developed using manually classified component-composition datasets. The primary evaluation metric was the F10-Score weighted by the recall. Results: A total of 256 included articles and 1261 excluded articles were extracted from the selected systematic review. In the screening models implemented using manually classified datasets, the performance of some surpassed that of models trained on all components (BERT: 9 models, BioLinkBERT: 6 models, and BioM-ELECTRA: 21 models). In models implemented using datasets classified by the Abstract Component Classifier Model, the performances of some models (BERT: 7 models and BioM-ELECTRA: 9 models) surpassed that of the models trained on all components. These models achieved an 88.6% reduction in manual screening workload while maintaining high recall (0.93). Conclusions: Component selection from the title and abstract can improve the performance of screening models and substantially reduce the manual screening workload in systematic review updates. Future research should focus on validating this approach across different systematic review domains.
- An Interpretable Model With Probabilistic Integrated Scoring for Mental Health Treatment Prediction: Design Study
Background: Machine learning (ML) systems in health care have the potential to enhance decision-making but often fail to address critical issues such as prediction explainability, confidence, and robustness in a context-based and easily interpretable manner. Objective: This study aimed to design and evaluate an ML model for a future decision support system for clinical psychopathological treatment assessments. The novel ML model is inherently interpretable and transparent. It aims to enhance clinical explainability and trust through a transparent, hierarchical model structure that progresses from questions to scores to classification predictions. The model confidence and robustness were addressed by applying Monte Carlo dropout, a probabilistic method that reveals model uncertainty and confidence. Methods: A model for clinical psychopathological treatment assessments was developed, incorporating a novel ML model structure. The model aimed at enhancing the graphical interpretation of the model outputs and addressing issues of prediction explainability, confidence, and robustness. The proposed ML model was trained and validated using patient questionnaire answers and demographics from a web-based treatment service in Denmark (N=1088). Results: The balanced accuracy score on the test set was 0.79. The precision was ≥0.71 for all 4 prediction classes (depression, panic, social phobia, and specific phobia). The area under the curve for the 4 classes was 0.93, 0.92, 0.91, and 0.98, respectively. Conclusions: We have demonstrated a mental health treatment ML model that supported a graphical interpretation of prediction class probability distributions. Their spread and overlap can inform clinicians of competing treatment possibilities for patients and uncertainty in treatment predictions. With the ML model achieving 79% balanced accuracy, we expect that the model will be clinically useful in both screening new patients and informing clinical interviews.
- Convolutional Neural Network Models for Visual Classification of Pressure Ulcer Stages: Cross-Sectional Study
Background: Pressure injuries (PIs) pose a negative health impact and a substantial economic burden on patients and society. Accurate staging is crucial for treating PIs. Owing to the diversity in the clinical manifestations of PIs and the lack of objective biochemical and pathological examinations, accurate staging of PIs is a major challenge. The deep learning (DL) algorithm, which uses convolutional neural networks (CNNs), has demonstrated exceptional classification performance in the intricate domain of skin diseases and wounds and has the potential to improve the staging accuracy of PIs. Objective: We explored the potential of applying AlexNet, VGGNet16, ResNet18, and DenseNet121 to PI staging, aiming to provide an effective tool to assist in staging. Methods: PI images from patients, including stage Ⅰ, stage Ⅱ, stage Ⅲ, and stage Ⅳ, unstageable and suspected deep tissue injury (SDTI), were collected at a tertiary hospital in China. Additionally, we augmented the PI data by cropping and flipping the PI images 9 times. The collected images were then divided into training, validation and test sets at a ratio of 8:1:1. We subsequently trained them via AlexNet, VGGNet16, ResNet18 and DenseNet121 to develop staging models. Results: We collected 853 raw PI images with the following distributions across stages: stage Ⅰ (148), stage Ⅱ (121), stage Ⅲ (216), stage Ⅳ (110), unstageable (128), and SDTI (130). A total of 7677 images were obtained after data augmentation. Among all the CNN models, DenseNet121 demonstrated the highest overall accuracy of 93.71%. The classification performances of AlexNet, VGGNet16 and ResNet18 exhibited overall accuracy of 87.74%, 82.42%, and 92.42%, respectively. Conclusions: The CNN-based models demonstrated strong classification ability for PI images, which might promote highly efficient, intelligent PI staging methods. In the future, the models can be compared with nurses with different levels of experience to further verify the clinical application effect.
- Public Attitudes Toward Violence Against Doctors: Sentiment Analysis of Chinese Users
Background: Violence against doctors attracts the public’s attention both online and in the real world. Understanding how public sentiment evolves during such crises is essential for developing strategies to manage emotions and rebuild trust. Objective: This study aims to quantify the difference in public sentiment based on the public opinion life cycle theory and describe how public sentiment evolved during a high-profile crisis involving violence against doctors in China. Methods: This study used the term frequency-inverse document frequency (TF-IDF) algorithm to extract key terms and create keyword clouds from textual comments. The latent Dirichlet allocation (LDA) topic model was used to analyze the thematic trends and shifts within public sentiment. The integrated Chinese Sentiment Lexicon was used to analyze sentiment trajectories in the collected data. Results: A total of 12,775 valid comments were collected on Sina Weibo about public opinion related to a doctor-patient conflict. Thematic and sentiment analyses showed that the public’s sentiments were highly negative during the outbreak period (disgust: 10,201/30,433, 33.52%; anger: 6792/30,433, 22.32%) then smoothly changed to positive and negative during the spread period (sorrow: 2952/8569, 34.45%; joy: 2782/8569, 32.47%) and tended to be rational and peaceful during the decline period (joy: 4757/14,543, 32.71%; sorrow: 4070/14,543, 27.99%). However, no matter how emotions changed, each period's leading tone contained many negative sentiments. Conclusions: This study simultaneously examined the dynamics of theme change and sentiment evolution in crises involving violence against doctors. It discovered that public sentiment evolved alongside thematic changes, with the dominant negative tone from the initial stage persisting throughout. This finding, distinguished from prior research, underscores the lasting influence of early public sentiment. The results offer valuable insights for medical institutions and authorities, suggesting the need for tailored risk communication strategies responsive to the evolving themes and sentiments at different stages of a crisis.
- Imputation and Missing Indicators for Handling Missing Longitudinal Data: Data Simulation Analysis Based on Electronic Health Record Data
Background: Missing data in electronic health records (EHRs) is highly prevalent and results in analytical concerns such as heterogeneous sources of bias and loss of statistical power. One simple analytic method for addressing missing or unknown covariate values is to treat missing-ness for a particular variable as a category onto itself, which we refer to as the missing indicator method. For cross-sectional analyses, recent work suggested that there was minimal benefit to the missing indicator method; however, it is unclear how this approach performs in the setting of longitudinal data, in which correlation among clustered repeated measures may be leveraged for potentially improved model performance. Objective: Conduct a simulation study aimed to evaluate whether the missing indicator method improved model performance and imputation accuracy for longitudinal data mimicking an application of developing a clinical prediction model for falls in older adults based on EHR data. Methods: We simulated a longitudinal binary outcome using mixed effects logistic regression that emulated a falls assessment at annual follow-up visits. Using multivariate imputation by chained equations, we simulated time-invariant predictors such as sex and medical history, as well as dynamic predictors such as physical function, body mass index, and medication use. We induced missing data in predictors under scenarios that had both random (MAR) and dependent missing-ness (MNAR). We evaluated aggregate performance using the area under the curve for models with and without missing indicators as predictors, as well as complete case analysis, across simulation replicates. We evaluated imputation quality using normalized root mean square error for continuous variables, and percent falsely classified for categorical variables. Results: Independent of the mechanism used to simulate missing data (MAR or MNAR), overall model performance via area under the curve was similar regardless of whether missing indicators were included in the model. The root mean square error and percent falsely classified measures were similar for models including missing indicators versus those without missing indicators. Model performance and imputation quality were similar regardless of whether the outcome was related to missingness. Imputation with or without missing indicators had similar mean values of area under the curve compared to complete case analysis, although complete case analysis had the largest range of values. Conclusions: The results of this study suggest that the inclusion of missing indicators in longitudinal data modeling neither improve nor worsen overall performance or imputation accuracy. Future research is needed to address whether the inclusion of missing indicators is useful in prediction modeling with longitudinal data in different settings, such as high dimensional data analysis.
- Large Language Model–Based Critical Care Big Data Deployment and Extraction: Descriptive Analysis
Background: Publicly accessible critical care-related databases contain enormous clinical data but their utilization often requires advanced programming skills. However, the growing complexity of large databases and unstructured data presents challenges for clinicians who need programming or data analysis expertise to utilize these systems directly. Objective: The study aims to simplify critical care-related databases deployment and extraction via large language models. Methods: The development of this platform was a two-step process. First, we enabled automated database deployment using Docker container technology, with incorporated web-based analytics interfaces Metabase and Superset. Second, we developed the Intensive care unit - Generative Pre-trained Transformer (ICU-GPT), a large language model fine-tuned on Intensive care unit (ICU) data integrated LangChain and Microsoft AutoGen. Results: The automated deployment platform was designed with user-friendliness in mind, enabling clinicians to deploy one or multiple databases in local, cloud, or remote environments without the need for manual setup. After successfully overcoming GPT’s token limit and supporting multi-schemas data, ICU-GPT could generate Structured Query Language (SQL) queries and extract insights from ICU datasets based on request input. A front-end user interface was developed to for clinicians to achieve code-free SQL generation on the web-based client. Conclusions: By harnessing the power of our automated deployment platform and ICU-GPT model, clinicians are empowered to easily visualize, extract, and arrange critical care-related databases more efficiently and flexibly than manual methods. Our research could decrease the time and effort spent on complex bioinformatics methods and advance clinical research.