Files
evil_MoE/docs/grpo_hyperparams.md
T
2026-05-23 11:26:39 +08:00

2.7 KiB

Canonical GRPO hyperparameters

Sourced from external/rl-rewardhacking/src/train/config.py (Ariahw, Engels & Nanda 2025). These are the defaults we inherit for all our headline runs; deviations must be flagged in docs/RESEARCH_JOURNAL.md.

Optimizer

key value source
optim adamw_8bit config.py L137
learning_rate 7e-5 config.py L138
lr_scheduler_type cosine config.py L139
adam_beta1 0.9 config.py L143
adam_beta2 0.99 config.py L144

LoRA

key value source
lora_rank (r) 32 config.py L41
lora_alpha 32 config.py L42
lora_dropout 0.0 config.py L45
lora_bias none config.py L46

GRPO

key value source
beta (KL coeff) 1e-3 config.py L135
num_generations 16 config.py L162
per_device_batch_size 32 (prompts) config.py L164
temperature 0.7 config.py L172
auto_find_batch_size True first run config.py L165

Canonical effective per-step rollout count = per_device_batch_size * num_generations = 32 * 16 = 512. We do NOT run the canonical setting. See deviations below.

Additional canonical hyperparams (not in tables above)

key value source
max_prompt_length 1536 run_rl_training.py L73
max_completion_length 1536 run_rl_training.py L73
warmup_steps 10 config.py
top_p 0.95 (default) verl GRPOConfig
cache_activations_layers [18] config.py L195-197

Our deviations for compute fit (96GB single-GPU)

The justfile uses MODEL=Qwen/Qwen3.5-2B as H4-main:

key canonical ours rationale
num_generations 16 8 halve rollouts to fit 96GB
per_device_batch_size 32 16 halve prompts to fit 96GB
effective batch 512 128 4x smaller

If H4 falsified (Qwen3.5-2B hack rate <30% at step 200), the justfile switches MODEL=Qwen/Qwen3-4B and the same NUM_GEN=8, BATCH=16 settings (further reduction to 4/16 = 64 effective batch is the secondary fallback if 4B OOMs).

Smaller batches are noisier for GRPO advantage normalization. Expect higher SEM and budget more seeds if either path is taken.

Our additions (projected_grpo only)

key default scope
vhack_m 16 SVD top-m for v_hack denoising; sweep ∈ {8, 16, 32}
vhack_layer_frac 0.7 layer fraction for v_hack extraction (per Wu-Tang 60-75% depth)
preserve_magnitude True restore ‖g‖ after projection; ablation arm sets False
n_pairs 20 initial contrastive pair count; scale to 60-80 for headline runs