Warning: fopen(/home/virtual/colon/journal/upload/ip_log/ip_log_2025-07.txt): failed to open stream: Permission denied in /home/virtual/lib/view_data.php on line 95 Warning: fwrite() expects parameter 1 to be resource, boolean given in /home/virtual/lib/view_data.php on line 96 How appropriately can generative artificial intelligence platforms, including GPT-4, Gemini, Bing, and Wrtn, answer questions about colon cancer in the Korean language?
Skip Navigation
Skip to contents

Ann Coloproctol : Annals of Coloproctology

OPEN ACCESS
SEARCH
Search

Articles

Page Path
HOME > Ann Coloproctol > Volume 41(3); 2025 > Article
Original Article
Colorectal cancer
How appropriately can generative artificial intelligence platforms, including GPT-4, Gemini, Bing, and Wrtn, answer questions about colon cancer in the Korean language?
Sun Huhorcid
Annals of Coloproctology 2025;41(3):190-197.
DOI: https://doi.org/10.3393/ac.2024.00122.0017
Published online: June 25, 2025

Department of Parasitology and Institute of Medical Education, Hallym University College of Medicine, Chuncheon, Korea

Correspondence to: Sun Huh, MD, PhD Department of Parasitology and Institute of Medical Education, Hallym University College of Medicine, 1 Hallimdaehak-gil, Chuncheon 24252, Korea Email: shuh@hallym.ac.kr
• Received: February 25, 2024   • Revised: October 4, 2024   • Accepted: May 9, 2025

© 2025 The Korean Society of Coloproctology

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

prev next
  • 924 Views
  • 25 Download
  • Purpose
    This study aims to assess the performance of 4 generative artificial intelligence (AI) platforms—Gemini (formerly Bard), Bing, GPT-4, and Wrtn—in answering questions about colon cancer in the Korean language. Two main research questions guided this study. First, which AI platform provides the most accurate answers? Second, can these AI-generated answers be reliably used to educate patients and their families about colon cancer?
  • Methods
    Ten questions selected by the author were posed to the 4 generative AI platforms on February 22, 2024. Two colorectal surgeons in Korea, each with over 20 years of clinical experience, independently evaluated the answers provided by these generative AI platforms.
  • Results
    The generative AI platforms scored an average of 5.5 out of 10 points. Wrtn achieved the highest score at 6 points, followed by GPT-4 and Gemini, each with 5.5, and Bing, scoring 5 points. The weighted κ for inter-rater reliability was 0.597 (P<0.001). The generative AI platforms performed well in explaining the occult blood test for cancer screening, keyhole surgery, and dietary recommendations for cancer prevention. However, they demonstrated significant limitations in answering more complex topics, such as estimating survival rates following surgery, choosing targeted therapy after surgery, and accurately reporting the mortality rate due to colon cancer in Korea.
  • Conclusion
    The findings suggest that using these generative AI platforms as educational resources for patients and their families regarding colon cancer is premature. Further training on colorectal diseases is required before these AI platforms can be considered reliable information sources for the general public in Korea.
After ChatGPT (OpenAI) was released on November 30, 2022, various generative artificial intelligence (AI) platforms have been introduced, such as Gemini (formerly Bard; Google), Bing (Microsoft Corp), Claude (Anthropic), Clova X (Naver Corp), and Wrtn (Wrtn Technologies) [1]. Generative AI platforms are applicable in education [2, 3], research, programming, content generation, and implementing creative ideas [4]. However, copyright and authorship concerns persist in scholarly writing and publishing [5]. Additionally, generative AI is increasingly employed in medicine, categorized into 3 main areas: demonstrating proficiency through successful completion of various tests and examinations (including professional certifications and university-level exams), exploring potential medical applications (such as medical school evaluations, clinical interactions involving physicians, patients, and nursing staff), and assisting medical writing [6]. Research into colorectal diseases also reflects this emerging trend [7]. Colorectal diseases encompass disorders affecting the colon (large intestine) or rectum (end of the colon). Their causes and symptoms vary widely, depending on the type and severity of the disease. Common colorectal diseases include colorectal cancer, colonic polyps, ulcerative colitis, diverticulitis, irritable bowel syndrome, and hemorrhoids. Although these are prevalent conditions, numerous other colorectal disorders exist.
Terms related to generative AI platforms (AI chatbots) are defined as follows:
(1) Artificial intelligence (AI): AI involves the simulation of human intelligence by machines programmed to think and learn similarly to humans. It includes machines' capability to autonomously perform tasks following algorithms designed by humans or learned by the machines themselves through data.
(2) Large language model (LLM): LLM is an AI model trained to understand, generate, and respond to human language. These models are termed "large" due to their extensive data training, sizeable model parameters, and complex neural network structures.
(3) Natural language processing (NLP): NLP is an AI field dedicated to interpreting, understanding, and generating human language, known as natural language.
(4) Generative AI: Generative AI is a branch of AI dedicated to producing new content or data resembling human-generated content. It includes systems capable of generating text, images, music, speech, and other media forms by learning from datasets, subsequently creating original outputs similar to the training material.
(5) AI chatbot: An AI chatbot is software designed to simulate human-like conversations by interpreting and responding to text or voice inputs. These chatbots utilize generative AI technologies, particularly NLP, to naturally and coherently interpret, understand, and reply to user inquiries.
Although generative AI platforms and AI chatbots both leverage AI technologies, generative platforms typically utilize more complex and robust models for content creation, while chatbots prioritize natural language understanding and conversation management. Recently, these 2 technologies have increasingly converged through the chatbot interfaces integrated into generative AI platforms.
Numerous LLMs exist, such as GPT-x, used by ChatGPT, Bing, and Wrtn, and LaMDA, utilized by Bard. Additionally, numerous open-source LLMs are listed on Hugging Face’s Open LLM Leaderboard [8]. These open models globally compete based on average scores across various metrics, including reasoning challenges with scientific questions, common-sense inference, multitasking ability, hallucination prevention, common-sense reasoning, and multistep mathematical reasoning.
This study aimed to evaluate the performance of 4 generative AI platforms—GPT-4, Gemini, Bing, and Wrtn—in answering questions about colon cancer in the Korean language. Additionally, the performance of these AI platforms was compared with previous studies addressing similar topics. To achieve the study's objective, 2 research questions were formulated. First, which AI platform provides the most appropriate answers? Second, can these platforms' answers be reliably used to educate patients or their families about colon cancer?
Ethics statement
This study analyzed the performance of 4 generative AI platforms and did not involve human subjects. Therefore, neither institutional review board approval nor informed consent was required.
Study design and setting
This descriptive study evaluated responses provided by generative AI platforms. Ten common questions on colorectal diseases were tentatively selected by the author and prompted to 4 AI chatbots—GPT-4, Gemini, Bing, and Wrtn—on February 22, 2024. The questions were derived from cancer information publicly available from Asan Medical Center regarding the prevention, diagnosis, and management of colon cancer [9]. The prompts were input in Korean. The accuracy of responses from the 4 chatbots was independently assessed by 2 colorectal surgeons in Korea, each with over 20 years of clinical experience. The AI chatbot's response was scored as 1 if the content was appropriate for recommendation to the general public and 0 if the content was inappropriate or required correction due to inaccuracies.
Variables
The outcome variables consisted of the accuracy of answers provided by the 4 generative AI platforms, as evaluated independently by 2 expert colorectal surgeons in Korea.
Data source/measurement
The data comprised answers generated by the 4 generative AI platforms in response to input prompts. The validity of the questions was previously confirmed by the 2 colorectal surgeons. The weighted κ for inter-rater agreement was calculated.
Bias
The generative AI platforms and the topic questions were selected solely by the author. Therefore, selection bias may have existed regarding the topics chosen. Nevertheless, all topics selected pertained exclusively to colon cancer.
Study size
Sample size estimation was not required because of the nature of this descriptive study.
Statistical methods
Descriptive statistics were used. The weighted κ coefficient was calculated using dBSTAT ver. 5.0 (dBSTAT; http://dbstat.com/).
The 4 chatbots’ answers to each question are presented in Table 1. The complete responses from the 4 generative AI platforms are presented in Supplementary Table 1.
The experts' evaluations of the generative AI platforms are summarized in Fig. 1. The average competence score across the 4 generative AIs was 5.5 out of a maximum possible score of 10. The platforms ranked in descending order of performance as follows: Wrtn (average, 6), GPT-4 (average, 5.5), Gemini (average, 5.5), and Bing (average, 5).
The generative AI platforms provided excellent responses to questions concerning the significance of fecal occult blood tests for colon cancer screening, dietary recommendations to prevent colon cancer, and prognostic outcomes following robotic surgery for colon cancer.
However, the platforms encountered significant difficulties in addressing more complex questions, including the staging of colon cancer at diagnosis, the 5-year survival rate following surgery, indications for targeted therapy, and colon cancer mortality rankings among Korean men and women.
The weighted κ coefficient indicating inter-rater agreement was 0.597 (95% confidence interval, 0.347–0.846; P<0.001).
The accuracy of the 4 generative AI platforms could be improved, as indicated by their average score of 5.5 out of a maximum of 10. The 4 platforms demonstrated variable degrees of performance, with more professionally oriented medical knowledge presenting significant challenges.
The generative AI platforms' overall accuracy rate (5.5 out of 10) suggests that their ability to handle basic questions about colon cancer remains variable compared to professional medical knowledge. In particular, their effectiveness in addressing questions regarding prevention, diagnosis, and prognosis following surgery indicates that generative AI could play a valuable role in disseminating crucial health information to the public. However, the findings also underscore significant limitations of generative AI platforms in determining survival rates of colon cancer patients and advising on targeted therapy. These limitations may stem from constraints related to their training data.
The weighted κ score of 0.597 suggests a need for clearer definitions of what constitutes satisfactory AI-generated advice, highlighting the necessity for improved evaluation standards and methodologies.
The varying performance of the generative AI platforms—with Wrtn outperforming GPT-4, Gemini, and Bing—implies that differences in the design and training datasets substantially influence their accuracy in addressing medical questions. Wrtn’s superior performance may be attributable to its language specificity; the questions in this study were formulated and answered in Korean, and Wrtn, developed by a Korea-based company, might have a greater capacity to manage Korean-language content effectively. This variation emphasizes the importance of tailoring AI tools to meet specific health information needs. To contextualize these findings, previous studies utilizing generative AI platforms to assess their accuracy in answering questions about colorectal diseases were reviewed. Eligibility criteria included articles indexed in PubMed that examined generative AI platforms (chatbots) used in colorectal disease research. PubMed was searched on February 22, 2024, using the search term: "(colon cancer OR colorectal diseases OR colonic polyps OR ulcerative colitis OR diverticulitis OR irritable bowel syndrome OR hemorrhoid) AND (chatgpt OR AI chatbot OR large language model)." This search identified 56 citations. After reviewing their content for relevancy, 18 articles explicitly involving generative AI platforms were selected. They were analyzed as representative examples of generative AI use in colorectal disease research. The research topics were compiled and analyzed to identify critical themes emerging from the literature. Previous studies addressing generative AI platform accuracy are summarized below.
Out of the initial 18 articles, 10 were ultimately selected for analysis after excluding those not directly related to generative AI platforms for colorectal diseases (Table 2). These included articles originating from various countries: 5 from the United States, 2 from Israel, and 1 each from Italy, Korea, Taiwan, and Türkiye, with 1 article involving collaboration between the USA and Taiwan. The articles addressed various colorectal diseases, including colon cancer (5 articles), inflammatory bowel diseases (IBDs; 2 articles), ulcerative colitis (1 article), small bowel obstruction (1 article), and surgical techniques (1 article). A comparison of results, including the present study, is summarized in Table 2 [1019].
Gravina et al. [10] evaluated ChatGPT (GPT-3.5) for accuracy in answering 10 frequently asked questions about IBD. A panel of IBD experts generated these questions based on patient inquiries. They concluded that there was a lack of quantitative data and insufficient patient-directed information in the AI responses.
Beaulieu-Jones et al. [11] evaluated GPT-4's surgical knowledge by inputting prompts from 167 Surgical Council on Resident Education (SCORE) and 112 Data-B questions from the USA, in multiple-choice and open-ended formats. Correct responses accounted for 71.3% and 67.9% of multiple-choice questions and 47.9% and 66.1% of open-ended questions for SCORE and Data-B, respectively. They concluded that it remains unclear whether large language models like ChatGPT can safely support clinical care.
Cankurtaran et al. [12] examined ChatGPT’s accuracy in answering 20 questions on IBD, which were evaluated by 4 gastroenterologists. Two independent gastroenterology experts rated ChatGPT's answers, scoring them as 4.70±1.26 (on a 3–7 scale) for Crohn disease and 4.40±1.21 (on a 3–7 scale) for ulcerative colitis. The authors concluded that ChatGPT still exhibited significant limitations and inadequacies.
Kerbage et al. [13] assessed GPT-4's responses to 30 frequently asked patient questions about irritable bowel syndrome, IBD, colonoscopy, and colorectal cancer screening. Three expert gastroenterologists evaluated the accuracy of these responses, achieving an 84% acceptable rate. Nevertheless, the authors cautioned against relying solely on ChatGPT for clinical decision-making or as a definitive reference source.
While generative AI platforms' accuracy rates ranged from 60% to 95%, interpretations varied from positive [1419] to negative [1013] (including the present study). Notably, LLMs demonstrated impressive potential in improving radiology referral quality in emergency settings, achieving 50% accuracy for acute small bowel obstruction, and 100% accuracy for indolent small bowel obstruction, acute cholecystitis, acute appendicitis, and diverticulitis [18].
Previous studies predominantly utilized ChatGPT, GPT-4, or Bing. This study introduced comparisons with Gemini and Wrtn, finding that Wrtn outperformed GPT-4, Gemini, and Bing. Therefore, further pretraining using reliable surgical data for colorectal diseases is necessary.
This study has several limitations. A more precise evaluation protocol could enhance inter-rater reliability, as indicated by the weighted κ coefficient. Additional generative AI platforms were not included, and results might vary with their inclusion. Moreover, the 10 questions were selected solely by the author without explicit justification, which might have introduced selection bias. No data on actual patient or family inquiries regarding colon cancer were analyzed, limiting the ability to ascertain the most critical patient-centered topics.
The limited number of questions (n=10) used might also affect generalizability; incorporating more questions could yield more comprehensive insights into the competencies of generative AI platforms regarding colon cancer. However, this study aimed not at generalizing generative AI competency, but specifically at assessing their effectiveness in answering typical questions asked by patients and families to highlight their current limitations.
Before ChatGPT's emergence as a public generative AI platform, AI tools in colorectal diseases primarily comprised prediction, decision-making, or classification models using deep neural networks [20]. For example, predictive AI models could automatically detect polyps during colonoscopic procedures [21]. The advent of generative AI platforms marks a new era in applying AI to medical care and research in colorectal diseases, offering novel opportunities for prediction, diagnosis, and follow-up through large-scale data gathering and analysis.
In conclusion, Wrtn achieved the highest accuracy among the evaluated generative AI platforms. However, the overall average accuracy score was only 5.5 out of a maximum of 10 (range, 5–6). Thus, recommending these generative AI platforms' answers for patient or family education on colon cancer in the Korean language remains premature. Additional pretraining with expert-level colorectal disease knowledge is essential for generative AI platforms to become dependable information sources for the general public.

Conflict of interest

No potential conflict of interest relevant to this article was reported.

Funding

This study was supported by the Hallym University Research Fund 2023 (No. HRF-202310-001).

Supplementary Table 1.

Evaluations by 2 raters on responses provided by 4 generative artificial intelligence platforms to 10 questions regarding colon cancer in Korean
ac-2024-00122-0017-Supplementary-Table-1.pdf
Supplementary materials are available from https://doi.org/10.3393/ac.2024.00122.0017.
Fig. 1.
Appropriateness of the answers provided by 4 generative artificial intelligence (AI) platforms (GPT-4, OpenAI; Gemini, Google; Bing, Microsoft Corp; Wrtn, Wrtn Technologies) evaluated by 2 colorectal surgeons in Korea.
ac-2024-00122-0017f1.jpg
Table 1.
Accuracy of the 4 generative artificial intelligence platformsa to 10 questions about colon cancer evaluated by 2 raters
Question Sum Rater A
Rater B
GPT-4 Gemini Bing Wrtn GPT-4 Gemini Bing Wrtn
1. During a routine health screening, if bleeding is detected in a fecal occult blood test, what is the probability that it indicates colon cancer? 8 1 1 1 1 1 1 1 1
2. At what age should Korean men begin receiving regular colonoscopy screenings? 4 0 1 1 0 1 0 0 1
3. Approximately how many years does it take for precancerous polyps to develop into colon cancer? 6 0 1 1 1 0 1 1 1
4. Please share 3 dietary recommendations for preventing colon cancer. 8 1 1 1 1 1 1 1 1
5. Why does colon cancer most frequently occur in the sigmoid colon and rectum? 2 1 0 0 0 0 0 1 0
6. For a 55-year-old Korean man diagnosed with stage III colon cancer, what is the 5-year survival rate following surgery? 0 0 0 0 0 0 0 0 0
7. If a 55-year-old Korean man is diagnosed with stage III colon cancer, should his 21-year-old daughter receive annual colonoscopy screenings? 6 1 1 0 1 1 1 0 1
8. For a 55-year-old Korean man with stage III colon cancer, is robotic surgery superior to laparoscopic or endoscopic resection in terms of prognosis? 8 1 1 1 1 1 1 1 1
9. After surgery for stage III colon cancer in a 55-year-old Korean man, is targeted therapy recommended? 0 0 0 0 0 0 0 0 0
10. In 2022, what were the respective rankings of colon cancer mortality rates among Korean men and women compared to other cancer types? 2 1 0 0 1 0 0 0 0
Total 44 6 6 5 6 5 5 5 6

Responses were scored as 1 if adequate and 0 if insufficient or inadequate to the public.

aGPT-4, OpenAI; Gemini, Google; Bing, Microsoft Corp; Wrtn, Wrtn Technologies.

Table 2.
Comparison of the results of this study with those of 10 articles regarding generative AI platforms (chatbots) for colorectal disease research indexed in PubMed (cited February 22, 2024)
Study Country Chatbot type Disease entity Question Rater Accuracy Interpretation
Gravina et al. [10] (2024) Italy ChatGPT (GPT-3.5, OpenAI) IBD 10 Items (a group of IBD-expert physicians retrieved a list of 10 questions most frequently asked by patients with IBD) Authors No quantitative data Not enough information for patients
Beaulieu-Jones et al. [11] (2024) USA, Taiwan GPT-4 (OpenAI) Surgical knowledge 167 SCORE and 112 Data-B questions from the USA in multiple-choice and open-ended questions Correct answers determined Multiple choice: It is unclear whether LLMs such as ChatGPT can safely assist clinicians in providing care
 SCORE, 71.3%
 Data-B, 67.9%
Open-ended:
 SCORE, 47.9%
 Data-B, 66.1%
Cankurtaran et al. [12] (2023) Türkiye ChatGPT (OpenAI) IBD 20 Questions by a committee of 4 gastroenterologists 2 Independent gastroenterology experts Crohn disease, 4.70±1.26 (scale, 3–7) ChatGPT still has some limitations and deficiencies
Ulcerative colitis, 4.40±1.21 (scale, 3–7)
Kerbage et al. [13] (2024) USA GPT-4 (OpenAI) Irritable bowel syndrome, IBD, colonoscopy, and colorectal cancer screening 30 Frequently asked questions by patients 3 Expert gastroenterologists Acceptable rate of 84% accuracy The authors urge caution in relying on ChatGPT for clinical decision-making or as a reference source
Mukherjee et al. [14] (2023) USA ChatGPT (OpenAI) Colon cancer 12 Items of the AGA’s recommendations for follow-up after colonoscopy and polypectomy 4 Adjudicators Only 1 out of 12 questions was answered 100% appropriately for patients Future renditions will be able to address nuanced queries with increased precision, serving as a readily available resource for GI education
Choo et al. [15] (2024) Korea ChatGPT (OpenAI) Colon cancer Treatment recommendations made by ChatGPT for 30 stage IV, recurrent, synchronous colorectal cancer patients Authors The concordance rate between ChatGPT and the MDT was 86.7% The ability of ChatGPT to understand complex stage IV, recurrent, and synchronous colorectal cancer cases itself is an impressive feat
Janopaul-Naylor et al. [16] (2024) USA ChatGPT (OpenAI), Bing (Microsoft Corp) Colon cancer 117 Questions in a subset of the ACS's recommended "Questions to Ask About Your Cancer" Expert panel using the validated DISCERN criteria ChatGPT vs. Bing for colorectal cancer (range, 1–5), 3.8 vs. 3.0 (P<0.001) The findings suggest a critical need, particularly around cancer prognostication, for continual refinement to limit misleading counseling, confusion, and emotional distress to patients and families
Levartovsky et al. [17] (2023) Israel ChatGPT (OpenAI) Ulcerative colitis 20 Cases with disease severity using TrueLove and Witts criteria and the necessity of hospitalization for patients with ulcerative colitis Gastroenterologist 80% Accuracy ChatGPT could serve as a clinical decision support tool in assessing acute ulcerative colitis, functioning as an adjunct to clinical judgment
Barash et al. [18] (2023) Israel GPT-4 (OpenAI) Small bowel obstruction, acute cholecystitis, acute appendicitis, diverticulitis 40 Cases of clinical notes from the ED input as prompts, with a request for an imaging recommendation 2 Independent radiologists Small bowel obstruction (acute), 50% LLMs may improve radiology referral quality
Small bowel obstruction (indolent), 100%
Acute cholecystitis, 100%
Acute appendicitis, 100%
Diverticulitis, 100%
Emile et al. [19] (2023) USA ChatGPT (OpenAI) Colon cancer 38 Questions based on authors’ clinical experience and patient information handouts from the ASCRS 1–3 Experts 87% Deemed appropriate ChatGPT may become a popular educational and informative tool
This study Korea GPT-4 (OpenAI), Gemini (Google), Bing (Microsoft Corp), Wrtn (Wrtn Technologies) Colon cancer 10 Questions regarding cancer information provided by Asan Medical Center 2 Expert colorectal surgeons Average score (maximum, 10): More accuracy of generative AI platforms is needed to be used by patients or their families
 GPT-4, 5.5
 Gemini, 5,5
 Bing, 5
 Wrtn, 6

AI, artificial intelligence; IBD, inflammatory bowel disease; SCORE, Surgical Council on Resident Education; LLM, large language model; AGA, American Gastroenterological Association; GI, gastrointestinal; MDT, multidisciplinary tumor; ACS, American Cancer Society; ED, emergency department; ASCRS, American Society of Colon and Rectal Surgeons.

  • 1. Lee H, Park S. Information amount, accuracy, and relevance of generative artificial intelligence platforms’ answers regarding learning objectives of medical arthropodology evaluated in English and Korean queries in December 2023: a descriptive study. J Educ Eval Health Prof 2023;20:39.ArticlePubMedPMCPDF
  • 2. Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: a descriptive study. J Educ Eval Health Prof 2023;20:1.ArticlePubMedPMCPDF
  • 3. Lo CK. What is the impact of ChatGPT on education? A rapid review of the literature. Educ Sci 2023;13:410.Article
  • 4. Kim TW. Application of artificial intelligence chatbots, including ChatGPT, in education, scholarly work, programming, and content generation and its prospects: a narrative review. J Educ Eval Health Prof 2023;20:38.ArticlePubMedPMCPDF
  • 5. Lee JY. Can an artificial intelligence chatbot be the author of a scholarly article? Sci Ed 2023;10:7–12. ArticlePDF
  • 6. Kim SJ. Trends in research on ChatGPT and adoption-related issues discussed in articles: a narrative review. Sci Ed 2023;11:3–11. ArticlePDF
  • 7. Li W, Zhang Y, Chen F. ChatGPT in colorectal surgery: a promising tool or a passing fad? Ann Biomed Eng 2023;51:1892–7. ArticlePubMedPDF
  • 8. Hugging Face. Open LLM leaderboard [Internet]. Hugging Face. [cited 2024 Feb 22]. Available from: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
  • 9. Cancer Edu-Info Center. [Guide for patients and the general public: understanding colorectal cancer]. Asan Medical Center; 2019.Korean.PDF
  • 10. Gravina AG, Pellegrino R, Cipullo M, Palladino G, Imperio G, Ventura A, et al. May ChatGPT be a tool producing medical information for common inflammatory bowel disease patients’ questions? An evidence-controlled analysis. World J Gastroenterol 2024;30:17–33. ArticlePubMedPMC
  • 11. Beaulieu-Jones BR, Berrigan MT, Shah S, Marwaha JS, Lai SL, Brat GA. Evaluating capabilities of large language models: performance of GPT-4 on surgical knowledge assessments. Surgery 2024;175:936–42. ArticlePubMed
  • 12. Cankurtaran RE, Polat YH, Aydemir NG, Umay E, Yurekli OT. Reliability and usefulness of ChatGPT for inflammatory bowel diseases: an analysis for patients and healthcare professionals. Cureus 2023;15:e46736. ArticlePubMedPMC
  • 13. Kerbage A, Kassab J, El Dahdah J, Burke CA, Achkar JP, Rouphael C. Accuracy of ChatGPT in common gastrointestinal diseases: impact for patients and providers. Clin Gastroenterol Hepatol 2024;22:1323–5. ArticlePubMed
  • 14. Mukherjee S, Durkin C, Pebenito AM, Ferrante ND, Umana IC, Kochman ML. Assessing ChatGPT’s ability to reply to queries regarding colon cancer screening based on multisociety guidelines. Gastro Hep Adv 2023;2:1040–3. ArticlePubMedPMC
  • 15. Choo JM, Ryu HS, Kim JS, Cheong JY, Baek SJ, Kwak JM, et al. Conversational artificial intelligence (chatGPT™) in the management of complex colorectal cancer patients: early experience. ANZ J Surg 2024;94:356–61. ArticlePubMed
  • 16. Janopaul-Naylor JR, Koo A, Qian DC, McCall NS, Liu Y, Patel SA. Physician assessment of ChatGPT and Bing answers to American Cancer Society’s questions to ask about your cancer. Am J Clin Oncol 2024;47:17–21. ArticlePubMed
  • 17. Levartovsky A, Ben-Horin S, Kopylov U, Klang E, Barash Y. Towards AI-augmented clinical decision-making: an examination of ChatGPT’s utility in acute ulcerative colitis presentations. Am J Gastroenterol 2023;118:2283–9. ArticlePubMed
  • 18. Barash Y, Klang E, Konen E, Sorin V. ChatGPT-4 assistance in optimizing emergency department radiology referrals and imaging selection. J Am Coll Radiol 2023;20:998–1003. ArticlePubMed
  • 19. Emile SH, Horesh N, Freund M, Pellino G, Oliveira L, Wignakumar A, et al. How appropriate are answers of online chat-based artificial intelligence (ChatGPT) to common questions on colon cancer? Surgery 2023;174:1273–5. ArticlePubMed
  • 20. Yu C, Helwig EJ. The role of AI technology in prediction, diagnosis and treatment of colorectal cancer. Artif Intell Rev 2022;55:323–43. ArticlePubMedPDF
  • 21. Glissen Brown JR, Mansour NM, Wang P, Chuchuca MA, Minchenberg SB, Chandnani M, et al. Deep learning computer-aided polyp detection reduces adenoma miss rate: a United States multi-center randomized tandem colonoscopy study (CADeT-CS trial). Clin Gastroenterol Hepatol 2022;20:1499–507. ArticlePubMed

Figure & Data

References

    Citations

    Citations to this article as recorded by  

      • Cite this Article
        Cite this Article
        export Copy Download
        Close
        Download Citation
        Download a citation file in RIS format that can be imported by all major citation management software, including EndNote, ProCite, RefWorks, and Reference Manager.

        Format:
        • RIS — For EndNote, ProCite, RefWorks, and most other reference management software
        • BibTeX — For JabRef, BibDesk, and other BibTeX-specific software
        Include:
        • Citation for the content below
        How appropriately can generative artificial intelligence platforms, including GPT-4, Gemini, Bing, and Wrtn, answer questions about colon cancer in the Korean language?
        Ann Coloproctol. 2025;41(3):190-197.   Published online June 25, 2025
        Close
      • XML DownloadXML Download
      Figure
      • 0
      How appropriately can generative artificial intelligence platforms, including GPT-4, Gemini, Bing, and Wrtn, answer questions about colon cancer in the Korean language?
      Image
      Fig. 1. Appropriateness of the answers provided by 4 generative artificial intelligence (AI) platforms (GPT-4, OpenAI; Gemini, Google; Bing, Microsoft Corp; Wrtn, Wrtn Technologies) evaluated by 2 colorectal surgeons in Korea.
      How appropriately can generative artificial intelligence platforms, including GPT-4, Gemini, Bing, and Wrtn, answer questions about colon cancer in the Korean language?
      Question Sum Rater A
      Rater B
      GPT-4 Gemini Bing Wrtn GPT-4 Gemini Bing Wrtn
      1. During a routine health screening, if bleeding is detected in a fecal occult blood test, what is the probability that it indicates colon cancer? 8 1 1 1 1 1 1 1 1
      2. At what age should Korean men begin receiving regular colonoscopy screenings? 4 0 1 1 0 1 0 0 1
      3. Approximately how many years does it take for precancerous polyps to develop into colon cancer? 6 0 1 1 1 0 1 1 1
      4. Please share 3 dietary recommendations for preventing colon cancer. 8 1 1 1 1 1 1 1 1
      5. Why does colon cancer most frequently occur in the sigmoid colon and rectum? 2 1 0 0 0 0 0 1 0
      6. For a 55-year-old Korean man diagnosed with stage III colon cancer, what is the 5-year survival rate following surgery? 0 0 0 0 0 0 0 0 0
      7. If a 55-year-old Korean man is diagnosed with stage III colon cancer, should his 21-year-old daughter receive annual colonoscopy screenings? 6 1 1 0 1 1 1 0 1
      8. For a 55-year-old Korean man with stage III colon cancer, is robotic surgery superior to laparoscopic or endoscopic resection in terms of prognosis? 8 1 1 1 1 1 1 1 1
      9. After surgery for stage III colon cancer in a 55-year-old Korean man, is targeted therapy recommended? 0 0 0 0 0 0 0 0 0
      10. In 2022, what were the respective rankings of colon cancer mortality rates among Korean men and women compared to other cancer types? 2 1 0 0 1 0 0 0 0
      Total 44 6 6 5 6 5 5 5 6
      Study Country Chatbot type Disease entity Question Rater Accuracy Interpretation
      Gravina et al. [10] (2024) Italy ChatGPT (GPT-3.5, OpenAI) IBD 10 Items (a group of IBD-expert physicians retrieved a list of 10 questions most frequently asked by patients with IBD) Authors No quantitative data Not enough information for patients
      Beaulieu-Jones et al. [11] (2024) USA, Taiwan GPT-4 (OpenAI) Surgical knowledge 167 SCORE and 112 Data-B questions from the USA in multiple-choice and open-ended questions Correct answers determined Multiple choice: It is unclear whether LLMs such as ChatGPT can safely assist clinicians in providing care
       SCORE, 71.3%
       Data-B, 67.9%
      Open-ended:
       SCORE, 47.9%
       Data-B, 66.1%
      Cankurtaran et al. [12] (2023) Türkiye ChatGPT (OpenAI) IBD 20 Questions by a committee of 4 gastroenterologists 2 Independent gastroenterology experts Crohn disease, 4.70±1.26 (scale, 3–7) ChatGPT still has some limitations and deficiencies
      Ulcerative colitis, 4.40±1.21 (scale, 3–7)
      Kerbage et al. [13] (2024) USA GPT-4 (OpenAI) Irritable bowel syndrome, IBD, colonoscopy, and colorectal cancer screening 30 Frequently asked questions by patients 3 Expert gastroenterologists Acceptable rate of 84% accuracy The authors urge caution in relying on ChatGPT for clinical decision-making or as a reference source
      Mukherjee et al. [14] (2023) USA ChatGPT (OpenAI) Colon cancer 12 Items of the AGA’s recommendations for follow-up after colonoscopy and polypectomy 4 Adjudicators Only 1 out of 12 questions was answered 100% appropriately for patients Future renditions will be able to address nuanced queries with increased precision, serving as a readily available resource for GI education
      Choo et al. [15] (2024) Korea ChatGPT (OpenAI) Colon cancer Treatment recommendations made by ChatGPT for 30 stage IV, recurrent, synchronous colorectal cancer patients Authors The concordance rate between ChatGPT and the MDT was 86.7% The ability of ChatGPT to understand complex stage IV, recurrent, and synchronous colorectal cancer cases itself is an impressive feat
      Janopaul-Naylor et al. [16] (2024) USA ChatGPT (OpenAI), Bing (Microsoft Corp) Colon cancer 117 Questions in a subset of the ACS's recommended "Questions to Ask About Your Cancer" Expert panel using the validated DISCERN criteria ChatGPT vs. Bing for colorectal cancer (range, 1–5), 3.8 vs. 3.0 (P<0.001) The findings suggest a critical need, particularly around cancer prognostication, for continual refinement to limit misleading counseling, confusion, and emotional distress to patients and families
      Levartovsky et al. [17] (2023) Israel ChatGPT (OpenAI) Ulcerative colitis 20 Cases with disease severity using TrueLove and Witts criteria and the necessity of hospitalization for patients with ulcerative colitis Gastroenterologist 80% Accuracy ChatGPT could serve as a clinical decision support tool in assessing acute ulcerative colitis, functioning as an adjunct to clinical judgment
      Barash et al. [18] (2023) Israel GPT-4 (OpenAI) Small bowel obstruction, acute cholecystitis, acute appendicitis, diverticulitis 40 Cases of clinical notes from the ED input as prompts, with a request for an imaging recommendation 2 Independent radiologists Small bowel obstruction (acute), 50% LLMs may improve radiology referral quality
      Small bowel obstruction (indolent), 100%
      Acute cholecystitis, 100%
      Acute appendicitis, 100%
      Diverticulitis, 100%
      Emile et al. [19] (2023) USA ChatGPT (OpenAI) Colon cancer 38 Questions based on authors’ clinical experience and patient information handouts from the ASCRS 1–3 Experts 87% Deemed appropriate ChatGPT may become a popular educational and informative tool
      This study Korea GPT-4 (OpenAI), Gemini (Google), Bing (Microsoft Corp), Wrtn (Wrtn Technologies) Colon cancer 10 Questions regarding cancer information provided by Asan Medical Center 2 Expert colorectal surgeons Average score (maximum, 10): More accuracy of generative AI platforms is needed to be used by patients or their families
       GPT-4, 5.5
       Gemini, 5,5
       Bing, 5
       Wrtn, 6
      Table 1. Accuracy of the 4 generative artificial intelligence platformsa to 10 questions about colon cancer evaluated by 2 raters

      Responses were scored as 1 if adequate and 0 if insufficient or inadequate to the public.

      GPT-4, OpenAI; Gemini, Google; Bing, Microsoft Corp; Wrtn, Wrtn Technologies.

      Table 2. Comparison of the results of this study with those of 10 articles regarding generative AI platforms (chatbots) for colorectal disease research indexed in PubMed (cited February 22, 2024)

      AI, artificial intelligence; IBD, inflammatory bowel disease; SCORE, Surgical Council on Resident Education; LLM, large language model; AGA, American Gastroenterological Association; GI, gastrointestinal; MDT, multidisciplinary tumor; ACS, American Cancer Society; ED, emergency department; ASCRS, American Society of Colon and Rectal Surgeons.


      Ann Coloproctol : Annals of Coloproctology Twitter Facebook
      TOP