Canonical GRPO hyperparameters

Sourced from external/rl-rewardhacking/src/train/config.py (Ariahw, Engels & Nanda 2025). These are the defaults we inherit for all our headline runs; deviations must be flagged in docs/RESEARCH_JOURNAL.md.

Optimizer

key	value	source
`optim`	`adamw_8bit`	config.py L137
`learning_rate`	`7e-5`	config.py L138
`lr_scheduler_type`	`cosine`	config.py L139
`adam_beta1`	`0.9`	config.py L143
`adam_beta2`	`0.99`	config.py L144

LoRA

key	value	source
`lora_rank` (r)	`32`	config.py L41
`lora_alpha`	`32`	config.py L42
`lora_dropout`	`0.0`	config.py L45
`lora_bias`	`none`	config.py L46

GRPO

key	value	source
`beta` (KL coeff)	`1e-3`	config.py L135
`num_generations`	`16`	config.py L162
`per_device_batch_size`	`32` (prompts)	config.py L164
`temperature`	`0.7`	config.py L172
`auto_find_batch_size`	`True` first run	config.py L165

Canonical effective per-step rollout count = per_device_batch_size * num_generations = 32 * 16 = 512. We do NOT run the canonical setting. See deviations below.

Additional canonical hyperparams (not in tables above)

key	value	source
`max_prompt_length`	`1536`	run_rl_training.py L73
`max_completion_length`	`1536`	run_rl_training.py L73
`warmup_steps`	`10`	config.py
`top_p`	`0.95` (default)	verl GRPOConfig
`cache_activations_layers`	`[18]`	config.py L195-197

Our deviations for compute fit (96GB single-GPU)

The justfile uses MODEL=Qwen/Qwen3.5-2B as H4-main:

key	canonical	ours	rationale
`num_generations`	`16`	`8`	halve rollouts to fit 96GB
`per_device_batch_size`	`32`	`16`	halve prompts to fit 96GB
effective batch	512	128	4x smaller

If H4 falsified (Qwen3.5-2B hack rate <30% at step 200), the justfile switches MODEL=Qwen/Qwen3-4B and the same NUM_GEN=8, BATCH=16 settings (further reduction to 4/16 = 64 effective batch is the secondary fallback if 4B OOMs).

Smaller batches are noisier for GRPO advantage normalization. Expect higher SEM and budget more seeds if either path is taken.

Our additions (projected_grpo only)

key	default	scope
`vhack_m`	`16`	SVD top-m for v_hack denoising; sweep ∈ {8, 16, 32}
`vhack_layer_frac`	`0.7`	layer fraction for v_hack extraction (per Wu-Tang 60-75% depth)
`preserve_magnitude`	`True`	restore ‖g‖ after projection; ablation arm sets False
`n_pairs`	`20`	initial contrastive pair count; scale to 60-80 for headline runs

2.7 KiB Raw Blame History