2026
GRPO Post-Training for Math Reasoning
Systematically explored SFT-to-GRPO post-training on Qwen3-0.6B-Base, improving math reasoning accuracy from 38.2% to 55.2% through reward function design and None discrimination.
GRPORLpost-trainingmath-reasoning