Back to blog
/15 min

GRPO Post-Training for Math Reasoning on Qwen3-0.6B

From SFT to GRPO: a systematic exploration of post-training strategies that improved math reasoning accuracy from 38.2% to 55.2% on Qwen3-0.6B-Base, with reward function design and None discrimination.

GRPORLpost-trainingmath-reasoningQwen

TL;DR

This post documents a 14-experiment journey of post-training Qwen3-0.6B-Base for mathematical reasoning. Starting from SFT (supervised fine-tuning), I discovered that SFT alone cannot substantially improve a small model's reasoning ability. Switching to GRPO (Group Relative Policy Optimization) — the algorithm behind DeepSeek-R1 — unlocked significant gains: accuracy improved from 38.2% to 55.2% (+17pp). The key challenges were reward function design and teaching the model to recognize unsolvable problems.


Phase 1: SFT Experiments — The Bottleneck

Cold-start SFT

I began with a cold-start dataset of 8,323 math problems (with <think> reasoning chains) and performed full-parameter SFT on Qwen3-0.6B-Base (4 epochs, lr = $2 \times 10^-5$, DeepSpeed + OpenRLHF).

Result: 39.5% accuracy (valid_1000), barely above the 38.2% baseline.

Scaling SFT Data — Regression

I expanded the training set by merging in three additional high-quality datasets from teammates. Counterintuitively, accuracy dropped to 36.7%. The new data shifted the distribution in ways that hurt overall performance.

Key insight: For a small model like Qwen3-0.6B, SFT alone cannot produce meaningful capability gains. However, Base model's pass@5 reached 62.6% — the model has the knowledge, but greedy decoding cannot reliably access it. This gap between greedy and pass@k is exactly what RL can close.


Phase 2: GRPO Validation — RL >> SFT

Why GRPO?

GRPO samples $K$ completions per prompt and computes group-relative advantages, directly optimizing the generation distribution. It narrows the greedy-pass@k gap without needing additional labeled data — the reward signal comes from comparing generated answers to ground truth.

Setup

  • Framework: verl 0.7.1
  • $K=10$, Temperature $=0.7$, LR $=1 \times 10^-6$
  • Single RTX 5090 32GB (Actor + vLLM rollout dual-engine)
  • 1,772 GRPO training prompts filtered for extractable answers ($\boxed$)

Results

ModelTrainingAccuracy
Base (Qwen3-0.6B)None36.2%
SFT Coldstart (best)Full SFT39.5%
GRPO from BaseGRPO 5ep46.6%

GRPO from Base directly achieved 46.6% — +7.1pp over the best SFT. This was the first clear validation that RL is the right path.

GRPO Base→5ep training


Phase 3: Teaching "I Don't Know" — None Discrimination

The Problem

13% of the validation set consists of critical thinking problems that are intentionally underspecified — the correct answer is None. After initial GRPO training, the model scored 0% on these problems: it always hallucinated an answer instead of recognizing impossibility.

Two-Stage Solution

  1. SFT Stage: Fine-tune the GRPO checkpoint on 150 None-labeled examples (with full "condition check → missing info → None" reasoning chains) + 400 cold-start examples (lr = $2 \times 10^-6$, 2 epochs).
  2. GRPO Stage: Resume GRPO training with the reward function giving bonus points for correct None answers.

Results

ModelOverallNone AccFalse PositiveWeird Tokens
GRPO Base→ep5 (no None)46.6%0%3.8%
SFT None only40.7%67.7%22.1%10.8%
GRPO Mixed (SFT→GRPO)55.7%80.0%14.0%14.0%

Critical insight: Build math ability first (GRPO), then teach None discrimination (SFT), then consolidate both (GRPO). Doing None SFT directly from Base caused false positives to explode above 45%.


Phase 4: Reward Function Design — Three Generations

The reward function is the most critical component of GRPO. I iterated through three versions:

v0 — Baseline Reward

  • Correct: +1.0; Wrong: −0.1
  • None correct: +1.0; None wrong: −0.5
  • Length penalty outside 30–768 token window
  • Answer-driven only — no <think> format checks

v1 — False Positive Penalty

After the model learned to say "None", it started saying it on answerableproblems too. Fix: when ground truth is not None and model outputs None, deduct−0.8 (the harshest single penalty).

v2 — Weird Token Penalty

GRPO training produced a peculiar artifact: the model learned to insert non-ASCII characters (private-use Unicode, fullwidth symbols) into answers — likely reward hacking to match ground truth strings.

Fix: A character-level whitelist allowing ASCII + Greek letters + math operators + arrows. Each out-of-whitelist character penalized −0.01.

Effect: Weird token rate dropped from 43.5% to 0%.

Core lesson: Length penalty, short-response penalty, false positive penalty, and weird token penalty are all necessary. None can be omitted.


Final Results

MetricScore
valid_1000 (Overall)55.2%
None Recognition69.2%
False Positive Rate4.3%
Weird Character Rate0%
High School Math66.8%
GSM8K Level67.8%
Level 357.4%
Level 442.6%
Olympiad15.9%

Final pipeline: Base → SFT (None + Math) → GRPO (6ep, K=10) → GRPO Continue (higher temperature, 2ep). Improvement over baseline: +17pp.

All experiments WandB overview

Final model evaluation results


Key Takeaways

  1. SFT has limits: For small models, SFT alone cannot substantially improve reasoning. The pass@k signal is a better indicator of capability than greedy accuracy.
  2. RL prerequisite: RL requires that the model's sampling distribution already contains correct answers (pass@k >> greedy). If the model cannot sample correct solutions, RL has nothing to optimize toward.
  3. Reward design is the hardest part: Expect 3+ iterations. Every penalty you add can create a new reward hacking vector. Test each version against a diverse validation set.
  4. Order matters: Build math capability first (GRPO), then teach refusal (SFT), then consolidate both (GRPO). Reversing this order caused massive false positives.
  5. Team workflow: Clear communication with data teams is essential — explain not just what you need but why, with concrete examples.