- Automated Pathologic TN Classification Prediction and Rationale Generation From Lung Cancer Surgical Pathology Reports Using a Large Language Model Fine-Tuned With Chain-of-Thought: Algorithm Development and Validation Study
Background: Traditional rule-based natural language processing approaches in electronic health record systems are effective but are often time-consuming and prone to errors when handling unstructured data. This is primarily due to the substantial manual effort required to parse and extract information from diverse types of documentation. Recent advancements in large language model (LLM) technology have made it possible to automatically interpret medical context and support pathologic staging. However, existing LLMs encounter challenges in rapidly adapting to specialized guideline updates. In this study, we fine-tuned an LLM specifically for lung cancer pathologic staging, enabling it to incorporate the latest guidelines for pathologic TN classification. Objective: This study aims to evaluate the performance of fine-tuned generative language models in automatically inferring pathologic TN classifications and extracting their rationale from lung cancer surgical pathology reports. By addressing the inefficiencies and extensive parsing efforts associated with rule-based methods, this approach seeks to enable rapid and accurate reclassification aligned with the latest cancer staging guidelines. Methods: We conducted a comparative performance evaluation of 6 open-source LLMs for automated TN classification and rationale generation, using 3216 deidentified lung cancer surgical pathology reports based on the American Joint Committee on Cancer (AJCC) Cancer Staging Manual 8th edition, collected from a tertiary hospital. The dataset was preprocessed by segmenting each report according to lesion location and morphological diagnosis. Performance was assessed using exact match ratio (EMR) and semantic match ratio (SMR) as evaluation metrics, which measure classification accuracy and the contextual alignment of the generated rationales, respectively. Results: Among the 6 models, the Orca2_13b model achieved the highest performance with an EMR of 0.934 and an SMR of 0.864. The Orca2_7b model also demonstrated strong performance, recording an EMR of 0.914 and an SMR of 0.854. In contrast, the Llama2_7b model achieved an EMR of 0.864 and an SMR of 0.771, while the Llama2_13b model showed an EMR of 0.762 and an SMR of 0.690. The Mistral_7b and Llama3_8b models, on the other hand, showed lower performance, with EMRs of 0.572 and 0.489, and SMRs of 0.377 and 0.456, respectively. Overall, the Orca2 models consistently outperformed the others in both TN stage classification and rationale generation. Conclusions: The generative language model approach presented in this study has the potential to enhance and automate TN classification in complex cancer staging, supporting both clinical practice and oncology data curation. With additional fine-tuning based on cancer-specific guidelines, this approach can be effectively adapted to other cancer types.
- Building a Foundation for High-Quality Health Data: Multihospital Case Study in Belgium
Background: Data quality is fundamental to maintain the trust and reliability of health data for both primary and secondary purposes. However, before secondary use of health data, it is essential to assess the quality at the source and to develop systematic methods for the assessment of important data quality dimensions. Objective: This case study aims to offer a dual aim: to assess the data quality of height and weight measurements across seven Belgian hospitals, focusing on the dimensions of completeness and consistency and to outline the obstacles these hospitals face in sharing and improving data quality standards. Methods: Focusing on data quality dimensions completeness and consistency, this study examined height and weight data collected from 2021 to 2022 within three distinct departments – surgical, geriatrics, and pediatrics – in each of the seven hospitals. Results: Variability was observed in the completeness scores for height across hospitals and departments, especially within surgical and geriatric wards. In contrast, weight data uniformly achieved high completeness scores. Notably, the consistency of height and weight data recording was uniformly high across all departments. Conclusions: A collective collaboration among Belgian hospitals, transcending network affiliations, was formed to conduct this data quality assessment. This study demonstrates the potential for improving data quality across healthcare organizations by sharing knowledge and good practices, establishing a foundation for future, similar research.
- An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study
Background: Rare diseases affect millions worldwide but sometimes face limited research focus individually due to low prevalence. Many rare diseases do not have specific ICD-9 and ICD-10 codes and therefore cannot be reliably extracted from granular fields like “Diagnosis” and “Problem List” entries, which complicates tasks that require identification of patients with these conditions, including clinical trial recruitment and research efforts. Recent advancements in Large Language Models (LLMs) have shown promise in automating the extraction of medical information, offering the potential to improve medical research, diagnosis, and management. However, most LLMs lack professional medical knowledge, especially concerning specific rare diseases, and cannot effectively manage rare disease data in its various ontological forms, making it unsuitable for these tasks. Objective: Our objective is to create an end-to-end system called AutoRD, which automates the extraction of rare-disease-related information from medical text, focusing on entities and their relations to other medical concepts, such as signs and symptoms. AutoRD integrates up-to-date ontologies with other structured knowledge and demonstrates superior performance in rare disease extraction tasks. We conduct various experiments to evaluate AutoRD’s performance, aiming to surpass common LLMs and traditional methods. Methods: AutoRD is a pipeline system that involves data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We implement this system using GPT4 and medical knowledge graphs developed from the open-source Human Phenotype and Orphanet ontologies, utilizing techniques such as chain-of-thought reasoning and prompt engineering. We quantitatively evaluate our system’s performance in entity extraction, relation extraction, and knowledge graph construction. The experiment uses the well-curated dataset RareDis2023, which contains medical literature focused on rare disease entities and their relations, making it an ideal dataset for training and testing our methodology. Results: On the RareDis2023 dataset, AutoRD achieves an overall entity extraction F1 score of 56.1% and a relation extraction F1 score of 38.6%, marking a 14.4% improvement over the baseline LLM. Notably, the F1 score for rare disease entity extraction reaches 83.5%, indicating high precision and recall in identifying rare disease mentions. These results demonstrate the effectiveness of integrating LLMs with medical ontologies in extracting complex rare disease information. Conclusions: AutoRD is an automated end-to-end system for extracting rare disease information from text to build knowledge graphs, addressing critical limitations of existing LLMs by improving identification of these diseases and connecting them to related clinical features. This work underscores the significant potential of LLMs in transforming healthcare, particularly in the rare disease domain. By leveraging ontology-enhanced LLMs, AutoRD constructs a robust medical knowledge base that incorporates up-to-date rare disease information, facilitating improved identification of patients and resulting in more inclusive research and trial candidacy efforts.
- Information Mode–Dependent Success Rates of Obtaining German Medical Informatics Initiative–Compliant Broad Consent in the Emergency Department: Single-Center Prospective Observational Study
Background: The broad consent (BC) developed by the German Medical Informatics Initiative is a pivotal national strategy for obtaining patient consent to use routinely collected data from electronic health records, insurance companies, contact information, and biomaterials for research. Emergency departments (EDs) are ideal for enrolling diverse patient populations in research activities. Despite regulatory and ethical challenges, obtaining BC from patients in ED with varying demographic, socioeconomic, and disease characteristics presents a promising opportunity to expand the availability of ED data. Objective: This study aimed to evaluate the success rate of obtaining BC through different consenting approaches in a tertiary ED and to explore factors influencing consent and dropout rates. Methods: A single-center prospective observational study was conducted in a German tertiary ED from September to December 2022. Every 30th patient was screened for eligibility. Eligible patients were informed via one of three modalities: (1) directly in the ED, (2) during their inpatient stay on the ward, or (3) via telephone after discharge. The primary outcome was the success rate of obtaining BC within 30 days of ED presentation. Secondary outcomes included analyzing potential influences on the success and dropout rates based on patient characteristics, information mode, and the interaction time required for patients to make an informed decision. Results: Of 11,842 ED visits, 419 patients were screened for BC eligibility, with 151 meeting the inclusion criteria. Of these, 68 (45%) consented to at least 1 BC module, while 24 (15.9%) refused participation. The dropout rate was 39.1% (n=59) and was highest in the telephone-based group (57/109, 52.3%) and lowest in the ED group (1/14, 7.1%). Patients informed face-to-face during their inpatient stay following the ED treatment had the highest consent rate (23/27, 85.2%), while those approached in the ED or by telephone had consent rates of 69.2% (9/13 and 36/52). Logistic regression analysis indicated that longer interaction time significantly improved consent rates (P=.03), while female sex was associated with higher dropout rates (P=.02). Age, triage category, billing details (inpatient treatment), or diagnosis did not significantly influence the primary outcome (all P>.05). Conclusions: Obtaining BC in an ED environment is feasible, enabling representative inclusion of ED populations. However, discharge from the ED and female sex negatively affected consent rates to the BC. Face-to-face interaction proved most effective, particularly for inpatients, while telephone-based approaches resulted in higher dropout rates despite comparable consent rates to direct consenting in the ED. The findings underscore the importance of tailored consent strategies and maintaining consenting staff in EDs and on the wards to enhance BC information delivery and consent processes for eligible patients. Trial Registration: German Clinical Trials Register DRKS00028753; https://drks.de/search/de/trial/DRKS00028753
- Enhancing Standardized and Structured Recording by Elderly Care Physicians for Reusing Electronic Health Record Data: Interview Study
Background: Elderly care physicians (ECPs) in nursing homes document patients’ health, medical conditions, and the care provided in electronic health records (EHRs). However, much of these health data currently lack structure and standardization, limiting their potential for health information exchange across care providers and reuse for quality improvement, policy development, and scientific research. Enhancing this potential requires insight into the attitudes and behaviors of ECPs toward standardized and structured recording in EHRs. Objective: This study aims to answer why and how ECPs record their findings in EHRs and what factors influence them to record in a standardized and structured manner. The findings will be used to formulate recommendations aimed at enhancing standardized and structured data recording for the reuse of EHR data. Methods: Semistructured interviews were conducted with 13 ECPs working in Dutch nursing homes. We recruited participants through purposive sampling, aiming for diversity in age, gender, health care organization, and use of EHR systems. Interviews continued until we reached data saturation. Analysis was performed using inductive thematic analysis. Results: ECPs primarily use EHRs to document daily patient care, ensure continuity of care, and fulfill their obligation to record specific information for accountability purposes. The EHR serves as a record to justify their actions in the event of a complaint. In addition, some respondents also mentioned recording information for secondary purposes, such as research and quality improvement. Several factors were found to influence standardized and structured recording. At a personal level, it is crucial to experience the added value of standardized and structured recording. At the organizational level, clear internal guidelines and a focus on their implementation can have a substantial impact. At the level of the EHR system, user-friendliness, interoperability, and guidance were most frequently mentioned as being important. At a national level, the alignment of internal guidelines with overarching standards plays a pivotal role in encouraging standardized and structured recording. Conclusions: The results of our study are similar to the findings of previous research in hospital care and general practice. Therefore, long-term care can learn from solutions regarding standardized and structured recording in other health care sectors. The main motives for ECPs to record in EHRs are the daily patient care and ensuring continuity of care. Standardized and structured recording can be improved by aligning the recording method in EHRs with the primary care process. In addition, there are incentives for motivating ECPs to record in a standardized and structured way, mainly at the personal, organizational, EHR system, and national levels.
- Survival After Radical Cystectomy for Bladder Cancer: Development of a Fair Machine Learning Model
Background: Prediction models based on machine learning (ML) methods are being increasingly developed and adopted in health care. However, these models may be prone to bias and considered unfair if they demonstrate variable performance in population subgroups. An unfair model is of particular concern in bladder cancer, where disparities have been identified in sex and racial subgroups. Objective: This study aims (1) to develop a ML model to predict survival after radical cystectomy for bladder cancer and evaluate for potential model bias in sex and racial subgroups; and (2) to compare algorithm unfairness mitigation techniques to improve model fairness. Methods: We trained and compared various ML classification algorithms to predict 5-year survival after radical cystectomy using the National Cancer Database. The primary model performance metric was the F1-score. The primary metric for model fairness was the equalized odds ratio (eOR). We compared 3 algorithm unfairness mitigation techniques to improve eOR. Results: We identified 16,481 patients; 23.1% (n=3800) were female, and 91.5% (n=15,080) were “White,” 5% (n=832) were “Black,” 2.3% (n=373) were “Hispanic,” and 1.2% (n=196) were “Asian.” The 5-year mortality rate was 75% (n=12,290). The best naive model was extreme gradient boosting (XGBoost), which had an F1-score of 0.860 and eOR of 0.619. All unfairness mitigation techniques increased the eOR, with correlation remover showing the highest increase and resulting in a final eOR of 0.750. This mitigated model had F1-scores of 0.86, 0.904, and 0.824 in the full, Black male, and Asian female test sets, respectively. Conclusions: The ML model predicting survival after radical cystectomy exhibited bias across sex and racial subgroups. By using algorithm unfairness mitigation techniques, we improved algorithmic fairness as measured by the eOR. Our study highlights the role of not only evaluating for model bias but also actively mitigating such disparities to ensure equitable health care delivery. We also deployed the first web-based fair ML model for predicting survival after radical cystectomy.
- Information Source Characteristics of Personal Data Leakage During the COVID-19 Pandemic in China: Observational Study
Background: In the period of preventing and controlling the spread of COVID-19 pandemic, a large amount of personal data was collected, and privacy leakage incidents occurred. Objective: To examines the information source characteristics of personal data leakage during COVID-19 pandemic in China. Methods: Extracting information source characteristics of 40 personal data leakage cases by open coding, analyzing the data with one-dimensional matrix and two-dimensional matrix. Results: In terms of organizational characteristics of leakage, leakage cases mainly occur in government agencies or below the prefecture-level, while few occur in medical system or high-level government organizations. Regarding the leaker, the majority are regular employees or junior staff rather than temporary workers or senior managers. Family WeChat groups are the primary route for disclosure; forwarding documents makes up most of this conduct, while taking screenshots and snapping pictures makes up a comparatively smaller portion. Conclusions: We propose the following suggestions: restricting the authority of non-medical institutions and low-level government agencies to collect data, strengthening training for low-level employees on privacy protection, and restrict the flow of data to social media through technical measures.
- Development, Implementation, and Evaluation Methods for Dashboards in Health Care: Scoping Review
Background: Dashboards have become ubiquitous in health care settings, but to achieve their goals they must be developed, implemented, and evaluated using methods that help ensure they meet the needs of end-users and are suited to the barriers and facilitators of the local context. Objective: This scoping review aimed to explore published literature on health care dashboards to characterize the methods used to identify factors affecting uptake, strategies used to increase dashboard uptake, and evaluation methods, as well as dashboard characteristics and context. Methods: MEDLINE, EMBASE, Web of Science and the Cochrane Library were searched from inception through July 2020. Studies were included if they described the development or evaluation of a health care dashboard with publication from 2018-2020. Clinical setting, purpose (categorized as clinical, administrative, or both), end-user, design characteristics, methods used to identify factors affecting uptake, strategies to increase uptake, and evaluation methods were extracted. Results: From 116 publications, we extracted data for 118 dashboards. Inpatient (n=45/118, 38.1%) and outpatient (n=42/118, 35.6%) settings were most common. Most dashboards had ≥2 stated purposes (n=84/118, 71.2%); of these, 54/118 (45.8%) were administrative, 43/118 (36.4%) were clinical, and 20/118 (16.9%) had both purposes. Most dashboards included front-line clinical staff as end-users (n=97/118, 82.2%). To identify factors affecting dashboard uptake, half involved end-users in the design process (n=59/118, 50.0%); fewer described formative usability testing (n=26/118, 22.0%) or use of any theory or framework to guide development, implementation, or evaluation (n=24/118, 20.3%). The most common strategies used to increase uptake included education (n=60/118, 50.8%), audit and feedback (n=59/118, 50.0%), and advisory boards (n=54/118, 45.8%). Evaluations of dashboards (n=84/118, 71.2%) were mostly quantitative (n=60/118, 50.8%), with fewer using only qualitative methods (6/118, 5.1%) or a combination of quantitative and qualitative methods (n=18/118, 15.2%). Conclusions: Most dashboards forego steps during development to ensure they suit the needs of end-users and the clinical context, and qualitative evaluation – which can provide insight into ways to improve dashboard effectiveness – is uncommon. Education and audit and feedback are frequently used to increase uptake. These findings illustrate the need for promulgation of best practices in dashboard development and will be useful to dashboard planners.
- Advancing Progressive Web Applications to Leverage Medical Imaging for Visualization of Digital Imaging and Communications in Medicine and Multiplanar Reconstruction: Software Development and Validation Study
Background: In medical imaging, 3D visualization is vital for displaying volumetric organs, enhancing diagnosis and analysis. Multiplanar reconstruction (MPR) improves visual and diagnostic capabilities by transforming 2D images from computed tomography (CT) and magnetic resonance imaging into 3D representations. Web-based Digital Imaging and Communications in Medicine (DICOM) viewers integrated into picture archiving and communication systems facilitate access to pictures and interaction with remote data. However, the adoption of progressive web applications (PWAs) for web-based DICOM and MPR visualization remains limited. This paper addresses this gap by leveraging PWAs for their offline access and enhanced performance. Objective: This study aims to evaluate the integration of DICOM and MPR visualization into the web using PWAs, addressing challenges related to cross-platform compatibility, integration capabilities, and high-resolution image reconstruction for medical image visualization. Methods: Our paper introduces a PWA that uses a modular design for enhancing DICOM and MPR visualization in web-based medical imaging. By integrating React.js and Cornerstone.js, the application offers seamless DICOM image processing, ensures cross-browser compatibility, and delivers a responsive user experience across multiple devices. It uses advanced interpolation techniques to make volume reconstructions more accurate. This makes MPR analysis and visualization better in a web environment, thus promising a substantial advance in medical imaging analysis. Results: In our approach, the performance of DICOM- and MPR-based PWAs for medical image visualization and reconstruction was evaluated through comprehensive experiments. The application excelled in terms of loading time and volume reconstruction, particularly in Google Chrome, whereas Firefox showed superior performance in viewing slices. This study uses a dataset comprising 22 CT scans of peripheral artery patients to demonstrate the application’s robust performance, with Google Chrome outperforming other browsers in both the local area network and wide area network settings. In addition, the application’s accuracy in MPR reconstructions was validated with an error margin of <0.05 mm and outperformed the state-of-the-art methods by 84% to 98% in loading and volume rendering time. Conclusions: This paper highlights advancements in DICOM and MPR visualization using PWAs, addressing the gaps in web-based medical imaging. By exploiting PWA features such as offline access and improved performance, we have significantly advanced medical imaging technology, focusing on cross-platform compatibility, integration efficiency, and speed. Our application outperforms existing platforms for handling complex MPR analyses and accurate analysis of medical imaging as validated through peripheral artery CT imaging.
- EyeMatics: An Ophthalmology Use Case Within the German Medical Informatics Initiative
The EyeMatics project, embedded as a clinical use case in Germany's Medical Informatics Initiative, is a large digital health initiative in ophthalmology. The objective is to improve understanding of treatment effects regarding intravitreal injections, the most frequent procedure to treat eye diseases. To achieve this, valuable patient data will be meaningfully integrated and visualized from different IT-systems and different hospital sites. EyeMatics emphasizes a governance framework that actively involves patient representatives, strictly implements interoperability standards and employs Artificial Intelligence methods to extract biomarkers from both tabular clinical and raw retinal scans. In this perspective paper, we delineate the strategies for user-centered implementation and healthcare-based evaluation in a multi-site observational technology study.