Canonical GRPO hyperparameters
Sourced from external/rl-rewardhacking/src/train/config.py (Ariahw, Engels &
Nanda 2025). These are the defaults we inherit for all our headline runs;
deviations must be flagged in docs/RESEARCH_JOURNAL.md.
Optimizer
| key |
value |
source |
optim |
adamw_8bit |
config.py L137 |
learning_rate |
7e-5 |
config.py L138 |
lr_scheduler_type |
cosine |
config.py L139 |
adam_beta1 |
0.9 |
config.py L143 |
adam_beta2 |
0.99 |
config.py L144 |
LoRA
| key |
value |
source |
lora_rank (r) |
32 |
config.py L41 |
lora_alpha |
32 |
config.py L42 |
lora_dropout |
0.0 |
config.py L45 |
lora_bias |
none |
config.py L46 |
GRPO
| key |
value |
source |
beta (KL coeff) |
1e-3 |
config.py L135 |
num_generations |
16 |
config.py L162 |
per_device_batch_size |
32 (prompts) |
config.py L164 |
temperature |
0.7 |
config.py L172 |
auto_find_batch_size |
True first run |
config.py L165 |
Canonical effective per-step rollout count = per_device_batch_size * num_generations = 32 * 16 = 512.
We do NOT run the canonical setting. See deviations below.
Additional canonical hyperparams (not in tables above)
| key |
value |
source |
max_prompt_length |
1536 |
run_rl_training.py L73 |
max_completion_length |
1536 |
run_rl_training.py L73 |
warmup_steps |
10 |
config.py |
top_p |
0.95 (default) |
verl GRPOConfig |
cache_activations_layers |
[18] |
config.py L195-197 |
Our deviations for compute fit (96GB single-GPU)
The justfile uses MODEL=Qwen/Qwen3.5-2B as H4-main:
| key |
canonical |
ours |
rationale |
num_generations |
16 |
8 |
halve rollouts to fit 96GB |
per_device_batch_size |
32 |
16 |
halve prompts to fit 96GB |
| effective batch |
512 |
128 |
4x smaller |
If H4 falsified (Qwen3.5-2B hack rate <30% at step 200), the justfile switches
MODEL=Qwen/Qwen3-4B and the same NUM_GEN=8, BATCH=16 settings (further
reduction to 4/16 = 64 effective batch is the secondary fallback if 4B OOMs).
Smaller batches are noisier for GRPO advantage normalization. Expect higher SEM
and budget more seeds if either path is taken.
Our additions (projected_grpo only)
| key |
default |
scope |
vhack_m |
16 |
SVD top-m for v_hack denoising; sweep ∈ {8, 16, 32} |
vhack_layer_frac |
0.7 |
layer fraction for v_hack extraction (per Wu-Tang 60-75% depth) |
preserve_magnitude |
True |
restore ‖g‖ after projection; ablation arm sets False |
n_pairs |
20 |
initial contrastive pair count; scale to 60-80 for headline runs |