Infoxmed2.0-27B: Instruction Tuning, Preference Alignment, and GRPO-Based Reward Model Training for Medical LLMs
Abstract-Large language models (LLMs) [1], [2] have demon strated remarkable capabilities across general domains, yet their application in specialized medical contexts demands rigorous domain adaptation [3], [4]. We present Infoxmed2.0-27B, a medical foundation model built upon Qwen3.5-27B [5] through a comprehensive multi-stage post-training pipeline: (1) proprietary medical data synthesis from a MySQL database with MedicalCategoryTree organization, medical PhD team validation, Chinese RoBERTa [6] semantic deduplication, and API-assisted language refinement; (2) instruction supervised fine-tuning of Qwen3.5- 27B via LoRA [7] (r = 8, = 32) using MS-Swift [8], producing iterations Infoxmed2.0