evil_MoE/docs/grpo_hyperparams.md

# Canonical GRPO hyperparameters

Sourced from `external/rl-rewardhacking/src/train/config.py` (Ariahw, Engels &
Nanda 2025). These are the defaults we inherit for all our headline runs;
deviations must be flagged in [docs/RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).

## Optimizer

| key | value | source |
|---|---|---|
| `optim` | `adamw_8bit` | config.py L137 |
| `learning_rate` | `7e-5` | config.py L138 |
| `lr_scheduler_type` | `cosine` | config.py L139 |
| `adam_beta1` | `0.9` | config.py L143 |
| `adam_beta2` | `0.99` | config.py L144 |

## LoRA

| key | value | source |
|---|---|---|
| `lora_rank` (r) | `32` | config.py L41 |
| `lora_alpha` | `32` | config.py L42 |
| `lora_dropout` | `0.0` | config.py L45 |
| `lora_bias` | `none` | config.py L46 |

## GRPO

| key | value | source |
|---|---|---|
| `beta` (KL coeff) | `1e-3` | config.py L135 |
| `num_generations` | `16` | config.py L162 |
| `per_device_batch_size` | `32` (prompts) | config.py L164 |
| `temperature` | `0.7` | config.py L172 |
| `auto_find_batch_size` | `True` first run | config.py L165 |

Canonical effective per-step rollout count = `per_device_batch_size * num_generations = 32 * 16 = 512`.
We do NOT run the canonical setting. See deviations below.

## Additional canonical hyperparams (not in tables above)

| key | value | source |
|---|---|---|
| `max_prompt_length` | `1536` | run_rl_training.py L73 |
| `max_completion_length` | `1536` | run_rl_training.py L73 |
| `warmup_steps` | `10` | config.py |
| `top_p` | `0.95` (default) | verl GRPOConfig |
| `cache_activations_layers` | `[18]` | config.py L195-197 |

## Our deviations for compute fit (96GB single-GPU)

The justfile uses `MODEL=Qwen/Qwen3.5-2B` as H4-main:

| key | canonical | ours | rationale |
|---|---|---|---|
| `num_generations` | `16` | `8` | halve rollouts to fit 96GB |
| `per_device_batch_size` | `32` | `16` | halve prompts to fit 96GB |
| **effective batch** | **512** | **128** | 4x smaller |

If H4 falsified (Qwen3.5-2B hack rate <30% at step 200), the justfile switches
`MODEL=Qwen/Qwen3-4B` and the **same** NUM_GEN=8, BATCH=16 settings (further
reduction to 4/16 = 64 effective batch is the secondary fallback if 4B OOMs).

Smaller batches are noisier for GRPO advantage normalization. Expect higher SEM
and budget more seeds if either path is taken.

## Our additions (projected_grpo only)

| key | default | scope |
|---|---|---|
| `vhack_m` | `16` | SVD top-m for v_hack denoising; sweep ∈ {8, 16, 32} |
| `vhack_layer_frac` | `0.7` | layer fraction for v_hack extraction (per Wu-Tang 60-75% depth) |
| `preserve_magnitude` | `True` | restore ‖g‖ after projection; ablation arm sets False |
| `n_pairs` | `20` | initial contrastive pair count; scale to 60-80 for headline runs |