As technology continues to advance and artificial intelligence technology is widely applied, ChatGPT (Chat Generative Pre-trained Transformer) is beginning to make its mark in the field of healthcare consultation services. This article summarizes the current applications of ChatGPT in healthcare consultation services, reviewing its roles in four areas: dissemination of disease knowledge, assisting in the understanding of medical information, personalized health education and guidance, and preliminary diagnostic assistance and medical guidance. It also explores the development prospects of ChatGPT in healthcare consultation services, as well as the challenges and ethical dilemmas it faces in this field.
As one of the hot topics in the field of artificial intelligence, large language models are being applied in various domains, including medical research. ChatGPT (Chat Generative Pre-trained Transformer), as one of the most representative and leading large language models, has gained popularity among researchers due to its logical coherence and natural language generation capabilities. This article reviews the applications and limitations of ChatGPT in three key areas of medical research: scientific writing, data analysis, and drug development. Furthermore, it explores future development trends and provides recommendations for improvement, offering a reference for the application of ChatGPT in medical research.
Objective To explore the use of ChatGPT (Chat Generative Pre-trained Transformer) in pediatric diagnosis, treatment and doctor-patient communication, evaluate the professionalism and accuracy of the medical advice provided, and assess its ability to provide psychological support. Methods The knowledge databases of ChatGPT 3.5 and 4.0 versions as of April 2023 were selected. A total of 30 diagnosis and treatment questions and 10 doctor-patient communication questions regarding the pediatric urinary system were submitted to ChatGPT versions 3.5 and 4.0, and the answers to ChatGPT were evaluated. Results The answers to the 40 questions answered by ChatGPT versions 3.5 and 4.0 all reached the qualified level. The answers to 30 diagnostic and treatment questions in ChatGPT 4.0 version were superior to those in ChatGPT 3.5 version (P=0.024). There was no statistically significant difference in the answers to the 10 doctor-patient communication questions answered by ChatGPT 3.5 and 4.0 versions (P=0.727). For prevention, single symptom, and disease diagnosis and treatment questions, ChatGPT’s answer scores were relatively high. For questions related to the diagnosis and treatment of complex medical conditions, ChatGPT’s answer scores were relatively low. Conclusion ChatGPT has certain value in assisting pediatric diagnosis, treatment and doctor-patient communication, but the medical advice provided by ChatGPT cannot completely replace the professional judgment and personal care of doctors.
Objective To explore the application of the GPT-4 large language model in simplifying lung cancer radiology reports to enhance patient comprehension and doctor–patient communication efficiency. Methods A total of 362 radiology reports of non-small cell lung cancer (NSCLC) patients were collected from two hospitals between September and December 2024. Interpretive radiology reports (IRRs) were generated using GPT-4. Original reports (ORRs) and IRRs were compared through radiologist consistency evaluation and volunteer-based assessments of reading time, comprehension scores, and simulated communication duration. Results The average word count of ORRs was (459.83±55.76) words, compared with (625.42±41.59) words for IRRs (P<0.001). No significant differences were observed in expert consistency scores between ORRs and IRRs across dimensions of image interpretation accuracy, report detail completeness, explanatory depth and insight, and clinical practicality. Compared with reading ORRs, volunteers (simulated patient) read IRRs with shorter time [(346.88±29.15) s versus (409.01 ±102.40) s], with higher comprehension scores [(7.83±1.04) points versus (5.53±0.94) points] and shorter doctor-patient communication times [(317.31±57.81) s versus (714.20±56.67) s]. All differences were statistically significant (all P<0.001). Conclusion GPT-4 generated IRRs significantly improve patient comprehension and shorten communication time while maintaining medical accuracy. These findings suggest a new approach to optimizing radiology report management and enhancing healthcare service quality.
ObjectiveTo explore the application value of artificial intelligence in medical research assistance, and analyze the key paths to achieve precise execution of model instructions, improvement of model interpretation completeness, and control of hallucinations. MethodsTaking esophageal cancer research as the scenario, five types of literature including research articles, case reports, reviews, editorials, and guidelines were selected for model interpretation tests. The model performance was systematically evaluated from five dimensions: recognition accuracy, format accuracy, instruction execution accuracy, content reliability rate, and content completeness index. The performance differences of Ruibin Agent, GPT-4o, Claude 3.7 Sonnet, DeepSeek V3, and DouBao-pro models in medical literature interpretation tasks were compared. ResultsA total of 15 studies were included, with 3 studies of each type. The five models collectively conducted 1 875 tests. Due to the poor recognition accuracy of the editorial type, the overall recognition accuracy of Ruibin Agent was significantly lower than other models (92.0% vs. 100.0%, P<0.001). In terms of format accuracy, Ruibin Agent was significantly better than Claude 3.7 Sonnet (98.7% vs. 92.0%, P=0.002) and GPT-4o (98.7% vs. 78.9%, P<0.001). In terms of instruction execution accuracy, Ruibin Agent was better than GPT-4o (97.3% vs. 80.0%, P<0.001). In terms of content reliability rate, Ruibin Agent was significantly lower than Claude 3.7 Sonnet (84.0% vs. 92.0%, P=0.010) and DeepSeek V3 (84.0% vs. 94.7%, P<0.001). In terms of content completeness index, the median scores of Ruibin Agent, GPT-4o, Claude 3.7 Sonnet, DeepSeek V3, and DouBao-pro were 0.71, 0.60, 0.85, 0.74, and 0.77, respectively. ConclusionRuibin Agent has significant advantages in terms of formatted interpretation of medical literature and instruction execution accuracy. In the future, it is necessary to focus on optimizing the recognition ability of editorial types, strengthening the coverage ability of core elements of various types of literature to improve interpretation completeness, and improving content reliability through optimizing the confidence mechanism to ensure the rigor of medical literature interpretation.
Objective To evaluate the accuracy of three large language models (LLMs), ChatGPT, Grok, and DeepSeek, in predicting the natural outcome of pediatric ventricular septal defect (VSD) and their discrepancies with actual clinical outcomes, providing insights into whether LLMs can assist clinicians in providing personalized management recommendations. MethodsA retrospective analysis of clinical data from pediatric patients with VSD admitted to Children's Hospital of Nanjing Medical University between October and December 2020. The VSD severity, spontaneous closure probability and surgical necessity were evaluated by ChatGPT, Grok, DeepSeek, and the expert panel, respectively. Intergroup differences were analyzed and also compared with the actual outcomes. The stability of model performance was compared based on three repeated assessments by LLMs. Results A total of 146 children were enrolled, including 87 (59.6%) males and 59 (40.4%) females, with a median age at first diagnosis of 2.0 months (IQR: 1.1-3.4). Significant differences were observed between the Grok group and the expert panel in assessing the probability of spontaneous closure and the necessity of surgery (P=0.01, 0.02). The ChatGPT group also differed from the expert panel in evaluating the necessity of surgery (P=0.05). In comparison with the actual clinical outcomes, only the Grok group showed a significant difference (P<0.05), while ChatGPT achieved the highest consistency between predicted outcomes and actual outcomes. Intra-group analysis of three repeated assessments in the LLMs groups showed no statistically significant differences (all P>0.05). Conclusion LLMs demonstrate potential and high stability in predicting the natural outcome of VSD. In particular, ChatGPT shows the highest consistency between its assessments and actual outcomes. LLMs can serve as an auxiliary tool to support the formulation of personalized management strategy.
ObjectiveTo construct a lung cancer surgery-oriented disease-specific database covering the entire perioperative care pathway, thereby improving the quality and usability of key surgical data elements. Methods Real-world clinical data were extracted from a single-center thoracic surgery department. A standardized data model was established based on the open electronic health record (openEHR) standard. Large language model (LLM), optical character recognition (OCR), and artificial intelligence (AI)-driven techniques were employed to extract, structure, and perform quality control on unstructured clinical narratives, imaging reports, and radiological data, with a focus on capturing surgically relevant perioperative indicator. Results A multimodal database comprising 19 917 patients was established, including 7 930 males and 11 987 females, with ages ranging from 15 to 97 (61.7±9.7) years. The database includes 582 structured data variables, textual report data corresponding to 69 clinical indicators, 13 000 pulmonary function test PDF reports, and chest CT imaging data from 16 884 patients. This database comprehensively covers major information relevant to surgical diagnosis and treatment of lung cancer, significantly improving the completeness and granularity of surgical detail data. Large language models (LLMs) and optical character recognition (OCR) technologies enhanced the efficiency of converting unstructured data into structured formats, while a multi-level manual verification process ensured data accuracy and traceability. The database supports real-world research including comparisons of surgical procedures, prediction of postoperative complications, prognosis assessment, and multimodal data association analyses.
ObjectiveTo investigate the effect of an artificial intelligence (AI)-powered voice cloning education system based on the self-reference effect on patient outcomes, and to compare the educational effects of a physician's voice versus the patient's own voice. MethodsA prospective, three-arm, parallel-group randomized controlled trial was conducted. A total of 150 thoracic surgery inpatients at the First Hospital of Lanzhou University from May to September 2025 were included and randomly assigned in a 1 : 1 : 1 ratio to a traditional education group (control group, n=50), a physician’s voice-cloned AI education group (intervention group 1, n=50), and a patient's own voice-cloned AI education group (intervention group 2, n=50). The primary outcome was the education content compliance rate, which was automatically assessed using the DeepSeek-R1 model. Secondary outcomes included knowledge mastery, educational satisfaction, treatment adherence, quality of life (SF-36), and psychological status (HADS). ResultsA total of 145 (96.7%) patients completed the trial. There were no significant differences in age [(54.2±10.1) years, (55.8±9.7) years, and (53.9±10.5) years, respectively] or sex distribution (male/female: 28/20, 26/22, and 27/22, respectively) among the three groups (all P>0.05). The immediate post-education content compliance rates of both AI intervention groups were significantly higher than that of the control group (P<0.001). The patient’s own voice-cloned AI education group was significantly superior to the physician's voice-cloned AI education group and the control group in terms of knowledge mastery at discharge, treatment adherence at the 1-month follow-up, and anxiety and depression scores at the 1-month follow-up (all P<0.05). ConclusionAn AI-powered education model leveraging the self-reference effect throughpatient’s own voice cloning significantly improves patient outcomes. This approach demonstrates superior results in knowledge retention, treatment adherence, and psychological well-being compared to traditional methods and physician’s voice cloning, offering a new paradigm for personalized and scalable intelligent health education.