# Canonical GRPO hyperparameters Sourced from `external/rl-rewardhacking/src/train/config.py` (Ariahw, Engels & Nanda 2025). These are the defaults we inherit for all our headline runs; deviations must be flagged in [docs/RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md). ## Optimizer | key | value | source | |---|---|---| | `optim` | `adamw_8bit` | config.py L137 | | `learning_rate` | `7e-5` | config.py L138 | | `lr_scheduler_type` | `cosine` | config.py L139 | | `adam_beta1` | `0.9` | config.py L143 | | `adam_beta2` | `0.99` | config.py L144 | ## LoRA | key | value | source | |---|---|---| | `lora_rank` (r) | `32` | config.py L41 | | `lora_alpha` | `32` | config.py L42 | | `lora_dropout` | `0.0` | config.py L45 | | `lora_bias` | `none` | config.py L46 | ## GRPO | key | value | source | |---|---|---| | `beta` (KL coeff) | `1e-3` | config.py L135 | | `num_generations` | `16` | config.py L162 | | `per_device_batch_size` | `32` (prompts) | config.py L164 | | `temperature` | `0.7` | config.py L172 | | `auto_find_batch_size` | `True` first run | config.py L165 | Canonical effective per-step rollout count = `per_device_batch_size * num_generations = 32 * 16 = 512`. We do NOT run the canonical setting. See deviations below. ## Additional canonical hyperparams (not in tables above) | key | value | source | |---|---|---| | `max_prompt_length` | `1536` | run_rl_training.py L73 | | `max_completion_length` | `1536` | run_rl_training.py L73 | | `warmup_steps` | `10` | config.py | | `top_p` | `0.95` (default) | verl GRPOConfig | | `cache_activations_layers` | `[18]` | config.py L195-197 | ## Our deviations for compute fit (96GB single-GPU) The justfile uses `MODEL=Qwen/Qwen3.5-2B` as H4-main: | key | canonical | ours | rationale | |---|---|---|---| | `num_generations` | `16` | `8` | halve rollouts to fit 96GB | | `per_device_batch_size` | `32` | `16` | halve prompts to fit 96GB | | **effective batch** | **512** | **128** | 4x smaller | If H4 falsified (Qwen3.5-2B hack rate <30% at step 200), the justfile switches `MODEL=Qwen/Qwen3-4B` and the **same** NUM_GEN=8, BATCH=16 settings (further reduction to 4/16 = 64 effective batch is the secondary fallback if 4B OOMs). Smaller batches are noisier for GRPO advantage normalization. Expect higher SEM and budget more seeds if either path is taken. ## Our additions (projected_grpo only) | key | default | scope | |---|---|---| | `vhack_m` | `16` | SVD top-m for v_hack denoising; sweep ∈ {8, 16, 32} | | `vhack_layer_frac` | `0.7` | layer fraction for v_hack extraction (per Wu-Tang 60-75% depth) | | `preserve_magnitude` | `True` | restore ‖g‖ after projection; ablation arm sets False | | `n_pairs` | `20` | initial contrastive pair count; scale to 60-80 for headline runs |