Detection without calibration: benchmarking domestic and international large language models for quality control of Mandarin 18F-FDG PET/CT reports
Large language models (LLMs) are increasingly used for automated quality control (QC) of radiology reports. However, the reliability of LLMs on reports in Mandarin, and the relative performance of domestic versus international flagship models, remain unknown. We benchmarked 14 LLM configurations, seven Chinese-developed ("domestic") and seven international models, on 1,000 whole-body 18F-FDG PET/CT reports split into an error-injected "junior-docto" arm and a low-residual "finalised" arm (500 each), using a controlled error-injection gold standard. Under each blinded zero-shot prompt, each model flagged six error types and assigned a 1-5 overall score. Two distinct abilities: error-detection