- Identifying Patient-Reported Outcome Measure Documentation in Veterans Health Administration Chiropractic Clinic Notes: Natural Language Processing Analysis
Background: The use of patient reported outcome measures (PROMs) is an expected component of high-quality, measurement-based chiropractic care. The largest healthcare system offering integrated chiropractic care is the Veterans Health Administration (VHA). Challenges limit monitoring PROM use as a care quality metric at a national scale in the VHA. Structured data are unavailable with PROMs often embedded within clinic text notes as unstructured data requiring time-intensive, peer-conducted chart review for evaluation. Natural language processing (NLP) of clinic text notes is one promising solution to extracting care quality data from unstructured text. Objective: Our objective was to test NLP approaches to identify PROMs documented in VHA chiropractic text notes. Methods: VHA chiropractic notes from 10/1/2017 to 9/30/2020 were obtained from the VHA Musculoskeletal Diagnosis/Complementary and Integrative Health Cohort. A rule-based NLP model built using medspaCy and SpaCy was evaluated on text matching and note categorization tasks. SpaCy was used to build bag-of-words, convoluted neural network, and ensemble models for note categorization. Performance metrics for each model and task included precision (P), recall (R), F-measure (F). Cross-validation was used to validate performance metric estimates for the statistical and machine learning models. Results: Our sample included 377,213 visit notes from 56,628 patients. The rule-based model performance was good for soft-boundary text-matching (P=81.1%, R=96.7%, F=88.2%) and excellent for note categorization (P=90.3%, R=99.5%, F=94.7%). Cross-validation performance of the statistical and machine learning models for the note categorization task was very good overall, but lower than rule-based model performance. Overall prevalence of PROM documentation was low (17%). Conclusions: We evaluated multiple NLP methods across a series of tasks, with optimal performance achieved using a rule-based method. By leveraging NLP approaches, we can overcome the challenges posed by unstructured clinical text notes to track documented PROM use. Overall documented use of PROMs in chiropractic notes was low and highlights a potential for quality improvement. This work represents a methodological advancement in the identification and monitoring of documented use of PROMs to ensure consistent, high-quality chiropractic care for Veterans.
- Large-Scale Evaluation and Liver Disease Risk Prediction in Finland’s National Electronic Health Record System: Feasibility Study Using Real-World Data
Background: Globally, the incidence and mortality of chronic liver disease are escalating. Early detection of liver disease remains a challenge, often occurring at symptomatic stages when preventative measures are less effective. The Chronic Liver Disease score (CLivD) is a predictive risk model developed using Finnish health care data, aiming to forecast an individual’s risk of developing chronic liver disease in subsequent years. The Kanta Service is a national electronic health record system in Finland that stores comprehensive health care data including patient medical histories, prescriptions, and laboratory results, to facilitate health care delivery and research. Objective: This study aimed to evaluate the feasibility of implementing an automatic CLivD score with the current Kanta platform and identify and suggest improvements for Kanta that would enable accurate automatic risk detection. Methods: In this study, a real-world data repository (Kanta) was used as a data source for “The ClivD score” risk calculation model. Our dataset consisted of 96,200 individuals’ whole medical history from Kanta. For real-world data use, we designed processes to handle missing input in the calculation process. Results: We found that Kanta currently lacks many CLivD risk model input parameters in the structured format required to calculate precise risk scores. However, the risk scores can be improved by using the unstructured text in patient reports and by approximating variables by using other health data–like diagnosis information. Using structured data, we were able to identify only 33 out of 51,275 individuals in the “low risk” category and 308 out of 51,275 individuals (<1%) in the “moderate risk” category. By adding diagnosis information approximation and free text use, we were able to identify 18,895 out of 51,275 (37%) individuals in the “low risk” category and 2125 out of 51,275 (4%) individuals in the “moderate risk” category. In both cases, we were not able to identify any individuals in the “high-risk” category because of the missing waist-hip ratio measurement. We evaluated 3 scenarios to improve the coverage of waist-hip ratio data in Kanta and these yielded the most substantial improvement in prediction accuracy. Conclusions: We conclude that the current structured Kanta data is not enough for precise risk calculation for CLivD or other diseases where obesity, smoking, and alcohol use are important risk factors. Our simulations show up to 14% improvement in risk detection when adding support for missing input variables. Kanta shows the potential for implementing nationwide automated risk detection models that could result in improved disease prevention and public health.
- Predicting Clinical Outcomes at the Toronto General Hospital Transitional Pain Service via the Manage My Pain App: Machine Learning Approach
Background: Chronic pain is a complex condition that affects more than a quarter of people worldwide. The development and progression of chronic pain is unique to each individual due to the contribution of interacting biological, psychological and social factors. The subjective nature of the experience of chronic pain can make its clinical assessment and prognosis challenging. Personalized digital health apps, such as Manage My Pain (MMP), are popular pain self-tracking tools that can also be leveraged by clinicians to support patients. Recent advances in machine learning technologies open an opportunity to use data collected in pain apps to make predictions about a patient’s prognosis. Objective: This study applies machine learning methods using real-world user data from the MMP app to predict clinically significant improvements in pain-related outcomes among patients at the Toronto General Hospital Transitional Pain Service (TPS). Methods: Information entered into the MMP app by 160 TPS patients over a one-month period, including profile information, pain records, daily reflections, and clinical questionnaire responses, was used to extract 245 relevant variables, referred to as features, for use in a machine learning model. The machine learning model was developed using logistic regression with recursive feature elimination to predict clinically significant improvements in pain-related pain interference, assessed by the PROMIS Pain Interference 8a v1.0 questionnaire. The model was tuned and the important features were selected using the 10-fold cross-validation method. Leave-one-out cross-validation was utilized to test the model’s performance. Results: The model predicted patient improvement in pain interference with 79% accuracy and an area under the receiver operating characteristic curve (AUC) of 0.82. It showed balanced class accuracies between improved and non-improved patients, with a respective sensitivity of 0.76 and a specificity of 0.82. Feature importance analysis indicated that all MMP app data, not just clinical questionnaire responses, were key to classifying patient improvement. Conclusions: This study demonstrates that data from a digital health app can be integrated with clinical questionnaire responses in a machine learning model to effectively predict which chronic pain patients will show clinically significant improvement. The findings emphasize the potential of machine learning methods in real-world clinical settings to improve personalized treatment plans and patient outcomes.
- Automated Radiology Report Labeling in Chest X-Ray Pathologies: Development and Evaluation of a Large Language Model Framework
Background: Labeling unstructured radiology reports is crucial for creating structured datasets that facilitate downstream tasks, such as training large-scale medical imaging models. Current approaches typically rely on BERT-based methods or manual expert annotations, which have limitations in terms of scalability and performance. Objective: To evaluate the effectiveness of a GPT-based large language model (LLM) in labeling radiology reports, comparing it with two existing methods, CheXbert and CheXpert, on a large chest X-ray dataset (MIMIC-CXR). Methods: In this study, we introduce an LLM-based approach fine-tuned on expert-labeled radiology reports. Our model's performance was evaluated on 687 radiologist-labeled chest X-ray reports, comparing F1 scores across 14 thoracic pathologies. The performance of our LLM model was compared with the CheXbert and CheXpert models across positive, negative, and uncertainty extraction tasks. Paired t-tests and Wilcoxon signed-rank tests were performed to evaluate the statistical significance of differences between model performances. Results: Our GPT-based model achieved an average F1 score of 0.9014 for all certainty levels and 0.8708 for positive/negative certainty levels, outperforming CheXpert (F1 scores of 0.8864 and 0.8525, respectively) and performing comparably to CheXbert (F1 scores of 0.9047 and 0.8733, respectively). Paired t-tests revealed no statistically significant difference between our model and CheXbert (P = 0.3483), but a significant difference between our model and CheXpert (P = 0.0114). The Wilcoxon test also confirmed these results: no significant difference between our model and CheXbert (P = 0.1353) and a significant difference between our model and CheXpert (P = 0.0052). Conclusions: The GPT-based LLM model demonstrates competitive performance compared to CheXbert and outperforms CheXpert in radiology report labeling. These findings suggest that LLMs are a promising alternative to traditional BERT-based architectures for this task, offering enhanced context understanding and eliminating the need for extensive feature engineering. Moreover with large context length LLM based models are better suited for this task as compared to the small context length of BERT based models. Clinical Trial: Not applicable.
- Biases in Race and Ethnicity Introduced by Filtering Electronic Health Records for “Complete Data”: Observational Clinical Data Analysis
Background: - Objective: Integrated clinical databases from national biobanks have advanced the capacity for disease research. Data quality and completeness filters are used when building clinical cohorts to address limitations of data missingness. However, these filters may unintentionally introduce systemic biases when they are correlated with race and ethnicity. In this study, we examined the race/ethnicity biases introduced by applying common filters to four clinical records databases. Methods: We used 19 filters commonly used in electronic health records research on the availability of demographics, medication records, visit details, observation periods, and other data types. We evaluated the effect of applying these filters on self-reported race and ethnicity. This assessment was performed across four databases comprising approximately 12 million patients. Results: Applying the observation period filter led to a substantial reduction in data availability across all races and ethnicities in all four datasets. However, among those examined, the availability of data in the white group remained consistently higher compared to other racial groups after applying each filter. Conversely, the Black/African American group was the most impacted by each filter on these three datasets, Cedars-Sinai dataset, UK-Biobank, and Columbia University Dataset. Conclusions: Our findings underscore the importance of using only necessary filters as they might disproportionally affect data availability of minoritized racial and ethnic populations. Researchers must consider these unintentional biases when performing data-driven research and explore techniques to minimize the impact of these filters, such as probabilistic methods or the use of machine learning and artificial intelligence.
- Improving Systematic Review Updates With Natural Language Processing Through Abstract Component Classification and Selection: Algorithm Development and Validation
Background: A challenge in updating systematic reviews is the workload in screening the articles. Many screening models using natural language processing technology have been implemented to scrutinize articles based on titles and abstracts. While these approaches show promise, traditional models typically treat abstracts as uniform text. We hypothesize that selective training on specific abstract components could enhance model performance for systematic review screening. Objective: We evaluated the efficacy of a novel screening model that selects specific components from abstracts to improve performance and developed an automatic systematic review update model using an abstract component classifier to categorize abstracts based on their components. Methods: A screening model was created based on the included and excluded articles in the existing systematic review and used as the scheme for the automatic update of the systematic review. A prior publication was selected for the systematic review, and articles included or excluded in the articles screening process were used as training data. The titles and abstracts were classified into 5 categories (Title, Introduction, Methods, Results, and Conclusion). Thirty-one component-composition datasets were created by combining 5 component datasets. We implemented 31 screening models using the component-composition datasets and compared their performances. Comparisons were conducted using 3 pretrained models: Bidirectional Encoder Representations from Transformer (BERT), BioLinkBERT, and BioM- Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA). Moreover, to automate the component selection of abstracts, we developed the Abstract Component Classifier Model and created component datasets using this classifier model classification. Using the component datasets classified using the Abstract Component Classifier Model, we created 10 component-composition datasets used by the top 10 screening models with the highest performance when implementing screening models using the component datasets that were classified manually. Ten screening models were implemented using these datasets, and their performances were compared with those of models developed using manually classified component-composition datasets. The primary evaluation metric was the F10-Score weighted by the recall. Results: A total of 256 included articles and 1261 excluded articles were extracted from the selected systematic review. In the screening models implemented using manually classified datasets, the performance of some surpassed that of models trained on all components (BERT: 9 models, BioLinkBERT: 6 models, and BioM-ELECTRA: 21 models). In models implemented using datasets classified by the Abstract Component Classifier Model, the performances of some models (BERT: 7 models and BioM-ELECTRA: 9 models) surpassed that of the models trained on all components. These models achieved an 88.6% reduction in manual screening workload while maintaining high recall (0.93). Conclusions: Component selection from the title and abstract can improve the performance of screening models and substantially reduce the manual screening workload in systematic review updates. Future research should focus on validating this approach across different systematic review domains.
- An Interpretable Model With Probabilistic Integrated Scoring for Mental Health Treatment Prediction: Design Study
Background: Machine learning (ML) systems in health care have the potential to enhance decision-making but often fail to address critical issues such as prediction explainability, confidence, and robustness in a context-based and easily interpretable manner. Objective: This study aimed to design and evaluate an ML model for a future decision support system for clinical psychopathological treatment assessments. The novel ML model is inherently interpretable and transparent. It aims to enhance clinical explainability and trust through a transparent, hierarchical model structure that progresses from questions to scores to classification predictions. The model confidence and robustness were addressed by applying Monte Carlo dropout, a probabilistic method that reveals model uncertainty and confidence. Methods: A model for clinical psychopathological treatment assessments was developed, incorporating a novel ML model structure. The model aimed at enhancing the graphical interpretation of the model outputs and addressing issues of prediction explainability, confidence, and robustness. The proposed ML model was trained and validated using patient questionnaire answers and demographics from a web-based treatment service in Denmark (N=1088). Results: The balanced accuracy score on the test set was 0.79. The precision was ≥0.71 for all 4 prediction classes (depression, panic, social phobia, and specific phobia). The area under the curve for the 4 classes was 0.93, 0.92, 0.91, and 0.98, respectively. Conclusions: We have demonstrated a mental health treatment ML model that supported a graphical interpretation of prediction class probability distributions. Their spread and overlap can inform clinicians of competing treatment possibilities for patients and uncertainty in treatment predictions. With the ML model achieving 79% balanced accuracy, we expect that the model will be clinically useful in both screening new patients and informing clinical interviews.
- Convolutional Neural Network Models for Visual Classification of Pressure Ulcer Stages: Cross-Sectional Study
Background: Pressure injuries (PIs) pose a negative health impact and a substantial economic burden on patients and society. Accurate staging is crucial for treating PIs. Owing to the diversity in the clinical manifestations of PIs and the lack of objective biochemical and pathological examinations, accurate staging of PIs is a major challenge. The deep learning (DL) algorithm, which uses convolutional neural networks (CNNs), has demonstrated exceptional classification performance in the intricate domain of skin diseases and wounds and has the potential to improve the staging accuracy of PIs. Objective: We explored the potential of applying AlexNet, VGGNet16, ResNet18, and DenseNet121 to PI staging, aiming to provide an effective tool to assist in staging. Methods: PI images from patients, including stage Ⅰ, stage Ⅱ, stage Ⅲ, and stage Ⅳ, unstageable and suspected deep tissue injury (SDTI), were collected at a tertiary hospital in China. Additionally, we augmented the PI data by cropping and flipping the PI images 9 times. The collected images were then divided into training, validation and test sets at a ratio of 8:1:1. We subsequently trained them via AlexNet, VGGNet16, ResNet18 and DenseNet121 to develop staging models. Results: We collected 853 raw PI images with the following distributions across stages: stage Ⅰ (148), stage Ⅱ (121), stage Ⅲ (216), stage Ⅳ (110), unstageable (128), and SDTI (130). A total of 7677 images were obtained after data augmentation. Among all the CNN models, DenseNet121 demonstrated the highest overall accuracy of 93.71%. The classification performances of AlexNet, VGGNet16 and ResNet18 exhibited overall accuracy of 87.74%, 82.42%, and 92.42%, respectively. Conclusions: The CNN-based models demonstrated strong classification ability for PI images, which might promote highly efficient, intelligent PI staging methods. In the future, the models can be compared with nurses with different levels of experience to further verify the clinical application effect.
- Public Attitudes Toward Violence Against Doctors: Sentiment Analysis of Chinese Users
Background: Violence against doctors attracts the public’s attention both online and in the real world. Understanding how public sentiment evolves during such crises is essential for developing strategies to manage emotions and rebuild trust. Objective: This study aims to quantify the difference in public sentiment based on the public opinion life cycle theory and describe how public sentiment evolved during a high-profile crisis involving violence against doctors in China. Methods: This study used the term frequency-inverse document frequency (TF-IDF) algorithm to extract key terms and create keyword clouds from textual comments. The latent Dirichlet allocation (LDA) topic model was used to analyze the thematic trends and shifts within public sentiment. The integrated Chinese Sentiment Lexicon was used to analyze sentiment trajectories in the collected data. Results: A total of 12,775 valid comments were collected on Sina Weibo about public opinion related to a doctor-patient conflict. Thematic and sentiment analyses showed that the public’s sentiments were highly negative during the outbreak period (disgust: 10,201/30,433, 33.52%; anger: 6792/30,433, 22.32%) then smoothly changed to positive and negative during the spread period (sorrow: 2952/8569, 34.45%; joy: 2782/8569, 32.47%) and tended to be rational and peaceful during the decline period (joy: 4757/14,543, 32.71%; sorrow: 4070/14,543, 27.99%). However, no matter how emotions changed, each period's leading tone contained many negative sentiments. Conclusions: This study simultaneously examined the dynamics of theme change and sentiment evolution in crises involving violence against doctors. It discovered that public sentiment evolved alongside thematic changes, with the dominant negative tone from the initial stage persisting throughout. This finding, distinguished from prior research, underscores the lasting influence of early public sentiment. The results offer valuable insights for medical institutions and authorities, suggesting the need for tailored risk communication strategies responsive to the evolving themes and sentiments at different stages of a crisis.