Files
evil_MoE/justfile
T
wassname f487e67405 Goal 0 milestone: fast preset learns to hack in ~10min
This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28
(b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT
threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total.

Preset/Adam scheduling
- New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and
  small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min
  iteration loops.
- `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the
  20-step fast preset spends only 2 steps under warmup, not 10.
- `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to
  effectively disable — `gn` column shows the clip was never the bottleneck).

CLI restructure (tyro subcommands)
- Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack.
- Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig`
  inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`.
- CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=).
- `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field.

Logging refactor
- New `StepLogger` class consolidates column order, width, header label, and
  per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`).
- Row dict carries raw values throughout; formatters live in column spec.
  Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats
  and reformatted to `+0.000`. Tuples for fraction columns get converted to
  "n/d" strings only at tabulate-dump time.
- `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_).
- `lr` column added (current scheduled LR through warmup + cosine).
- Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived.

cin/cout -> cos_pre/cos_post + signed
- Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots,
  justfile. "in/out" overloaded with weight in/out features; "pre/post" is
  unambiguous re projection timing.
- Metric is now signed: sum(V @ g) / ||g|| instead of ||V @ g|| / ||g||. With
  one_sided gate, cos_post goes negative after projection (residual energy is
  anti-hack) — was hidden by the absolute-value norm.

v_hack extraction framing
- README + `extract_vhack_grad.py` docstring lead with "this is the GRPO
  gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair
  with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean)
  exactly, so we save the cleaner narrative for the paper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 03:22:36 +00:00

336 lines
16 KiB
Makefile

set shell := ["bash", "-cu"]
# Three seeds for headline arms; one seed for ablations.
SEEDS_3 := "41 43 44"
# spec.md §H4 substrate (reference DEFAULT_MODEL_ID).
# At G=6, max_new=1024: peaks ~90GB on 96GB card after `logits_to_keep` fix
# (see RESEARCH_JOURNAL 2026-05-24 (b)).
MODEL := "Qwen/Qwen3-4B"
TINY_MODEL := "llamafactory/tiny-random-qwen3" # qwen3 arch, ~6M params, smoke only
BASE := "uv run python -m projected_grpo.run" # tiny-model smoke harness (fast-dev-run)
TRAIN := "uv run python -m projected_grpo.train" # real LeetCode GRPO entry point
default:
@just --list
# Smoke: same harness as production (train.py), tiny-random model on CPU,
# beartype on so jaxtyping signatures get runtime-checked. Runs 30 steps so
# the every-25-step save_ckpt path is covered. Should finish in ~1-2 min.
# Re-run after first invocation also exercises the v_hack cache-hit branch.
# Pulls cached teacher rollouts (real Qwen3-4B completions + real graded
# rewards) at mix_ratio=0.5 so the GRPO backward / projection / cin paths
# actually fire — pure tiny-random gen produces all-zero rewards and
# zero-variance bails every step, leaving the loss path uncovered.
smoke *ARGS:
BEARTYPE=1 CUDA_VISIBLE_DEVICES= {{ TRAIN }} smoke --arm=projected \
--v-hack-path=out/v_hack_smoke.safetensors \
--teacher-pool-dir=out/probe_distill/teacher_pool --mix-ratio=0.5 {{ ARGS }}
smoke-vanilla *ARGS:
BEARTYPE=1 CUDA_VISIBLE_DEVICES= {{ TRAIN }} smoke --arm=vanilla \
--teacher-pool-dir=out/probe_distill/teacher_pool --mix-ratio=0.5 {{ ARGS }}
# Run smoke twice: first warms the v_hack cache (cache-miss path), second hits
# the cache (cache-hit path). Catches scope/save bugs that only manifest in one.
smoke-both:
just smoke-vanilla
just smoke
# H4 baseline at spec substrate. No v_hack needed for vanilla.
full-vanilla *ARGS:
{{ TRAIN }} full --arm=vanilla {{ ARGS }}
full *ARGS:
{{ TRAIN }} full --arm=projected --v-hack-path=out/v_hack_full.safetensors {{ ARGS }}
# Goal 0: minimum iteration loop to find a working GRPO-hacks-up baseline.
# Uses fast preset (20 steps, fast-Adam: lr=3e-3 beta1=0.5 beta2=0.9) + cached
# teacher pool at mix_ratio=0.5. UAT: hack_s rises from 0/N to >=N/4 by step 20.
# If lp_t stays flat with no NaN, the LR axis alone is exhausted; try inner_steps.
fast-vanilla *ARGS:
{{ TRAIN }} fast --arm=vanilla \
--teacher-pool-dir=out/probe_distill/teacher_pool --mix-ratio=0.5 \
--grad-clip=500 {{ ARGS }}
# Goal 1: same recipe with --arm=projected. Run only after fast-vanilla passes UAT.
fast-projected *ARGS:
{{ TRAIN }} fast --arm=projected \
--v-hack-path=out/v_hack_full.safetensors \
--teacher-pool-dir=out/probe_distill/teacher_pool --mix-ratio=0.5 \
--grad-clip=500 {{ ARGS }}
# Sync the rl-rewardhacking external repo (Nanda's verl wrapper).
sync-external:
cd external/rl-rewardhacking && git pull --ff-only
# Warm HF cache before real runs (avoids re-download on first pueue job).
download-model:
uv run python -c "from huggingface_hub import snapshot_download; \
snapshot_download('{{ MODEL }}', allow_patterns=['*.json','*.txt','tokenizer*','*.safetensors'])"
extract-vhack-smoke:
uv run python -m projected_grpo.extract_vhack_grad \
--model=Qwen/Qwen3.5-0.8B \
--dtype=bf16 \
--out-path=out/v_hack_smoke.safetensors \
--train-grads-path=out/vhack_grads_train_smoke.safetensors
extract-vhack-full:
uv run python -m projected_grpo.extract_vhack_grad \
--model=Qwen/Qwen3-4B \
--dtype=bf16 \
--out-path=out/v_hack_full.safetensors \
--train-grads-path=out/vhack_grads_train_full.safetensors
verify-vhack-smoke:
uv run python -m projected_grpo.verify_vhack_heldout \
--model=Qwen/Qwen3.5-0.8B \
--dtype=bf16 \
--v-hack-path=out/v_hack_smoke.safetensors \
--out-path=out/vhack_heldout_cos_smoke.safetensors
verify-vhack-full:
uv run python -m projected_grpo.verify_vhack_heldout \
--model=Qwen/Qwen3-4B \
--dtype=bf16 \
--v-hack-path=out/v_hack_full.safetensors \
--out-path=out/vhack_heldout_cos_full.safetensors
# =============================================================================
# SWEEPS — what to run, in order
# =============================================================================
#
# 1. `just probe-full-seed 41` — single-seed gate (~6-9h sequential).
# extract -> verify-heldout -> vanilla -> projected. Inspect before sweep.
# 2. `just queue-full` — 3-seed headline sweep (~36-54h).
# Queues 1 extract + 3 vanilla + 3 projected. Only run after probe passes.
#
# Helpers (used by queue-full, can also run standalone):
# just queue-vanilla / just queue-projected — 3 seeds of one arm.
# just probe-h4 41 — vanilla only on a single seed (H4 substrate sanity).
# =============================================================================
# Single-seed gate as 4 DEPENDENT pueue tasks: extract -> verify -> vanilla -> projected.
# Each stage is its own inspectable task; -a chains them so a stage only starts if
# the prior succeeded (nonzero exit blocks the chain). Gates A/B are enforced by exit
# code (verify exits nonzero if frac>0<=0.50). Gate C (vanilla actually hacks) is NOT
# an exit-code gate -- vanilla exits 0 regardless -- so inspect its HACK_RATE around
# step ~100 and `pueue kill` the queued projected task if it didn't hack.
# Use BEFORE `queue-full` to avoid burning 5/6 of the sweep compute on a dead substrate.
probe-full-seed seed="41":
#!/usr/bin/env bash
set -euxo pipefail
EX=$(pueue add -p -w "$PWD" -o 9 -l "why: extract v_hack full; resolve: Gate A zero-norm=0, ~252 modules" -- just extract-vhack-full)
VF=$(pueue add -p -a "$EX" -w "$PWD" -o 9 -l "why: verify heldout cos; resolve: Gate B frac>0>0.50, mean>0.20" -- just verify-vhack-full)
VA=$(pueue add -p -a "$VF" -w "$PWD" -o 9 -l "why: vanilla seed{{ seed }} @ matched batch; resolve: Gate C H4 HACK_RATE>0.30 by ~step100" -- {{ TRAIN }} full --arm=vanilla --seed={{ seed }} --out-tag=_full_vanilla_seed{{ seed }}_probe)
pueue add -a "$VA" -w "$PWD" -o 8 -l "why: projected seed{{ seed }} @ matched batch, v_hack NOT post-hoc; resolve: Gate D H1 HACK_RATE<vanilla at matched PASS" -- {{ TRAIN }} full --arm=projected --seed={{ seed }} --v-hack-path=out/v_hack_full.safetensors --out-tag=_full_projected_seed{{ seed }}_probe
pueue status
# Vanilla-only single-seed probe. Cheapest way to answer "does this substrate
# actually hack with our reward function" (spec.md §H4).
probe-h4 seed="41":
{{ TRAIN }} full --arm=vanilla --seed={{ seed }} --out-tag=_full_vanilla_seed{{ seed }}_h4
# Headline 3-seed sweep: extract + 3 vanilla + 3 projected via pueue.
# Only run after probe-full-seed shows vanilla hacks and projected fires.
queue-full:
#!/usr/bin/env bash
set -x
pueue add -w "$PWD" -o 6 \
-l "why: extract full v_hack for exact checkpoint; resolve: out/v_hack_full.safetensors exists and train.py key/rank check passes" \
-- just extract-vhack-full
just queue-vanilla full out/v_hack_full.safetensors
just queue-projected full out/v_hack_full.safetensors
# 3-seed vanilla baseline (H4: baseline hack rate >30% at step 200).
queue-vanilla preset="full" vhack="out/v_hack_full.safetensors":
#!/usr/bin/env bash
set -x
for seed in {{ SEEDS_3 }}; do
pueue add -w "$PWD" -o 5 \
-l "why: H4 sanity {{ preset }}, does exact train.py substrate reward-hack; resolve: if <30% hack at final window, escalate model/prompt before H1" \
-- {{ TRAIN }} {{ preset }} --arm=vanilla --seed=$seed --out-tag=_{{ preset }}_vanilla_seed$seed
done
# 3-seed projected (H1: -30pp hack vs vanilla at matched pass).
queue-projected preset="full" vhack="out/v_hack_full.safetensors":
#!/usr/bin/env bash
set -x
for seed in {{ SEEDS_3 }}; do
pueue add -w "$PWD" -o 4 \
-l "why: H1 {{ preset }}, projected delta_S grad reduces hack rate >=30pp at matched pass; resolve: compare to same-seed vanilla logs" \
-- {{ TRAIN }} {{ preset }} --arm=projected --seed=$seed --v-hack-path={{ vhack }} --out-tag=_{{ preset }}_projected_seed$seed
done
# Diagnostic: print v_hack steering check (CAA-style) on base model.
# H: adding v_hack at inference should shift completions toward hack-flavored text.
vhack-check *ARGS:
{{ BASE }} --vhack-check --model={{ MODEL }} {{ ARGS }}
# Distillation probe: hacky teacher (ariahw rh-s65) samples, student trains
# with per-sample v_hack cosine logging. step_NNN.jsonl.gz per step is replayable.
probe-distill *ARGS:
uv run python -m projected_grpo.probe_distill --v-hack-path=out/v_hack_full.safetensors {{ ARGS }}
# UAT pipeline: 1) teacher pool 2) vanilla replay 3) projected replay 4) analyze.
# T1 teacher hack >= 0.30 T2 vanilla cos coverage >= 90%
# T3 projected cos_post<cos_pre on >= 80% of steps T4 cos | hacked > cos | not (p<0.05)
probe-teacher-pool steps="20":
uv run python -m projected_grpo.probe_distill --teacher-only --steps={{ steps }} --n-problems={{ steps }}
# Base pool: base Qwen3-4B, no LoRA, no hint applied. ~0% hack per ariahw §86.
# Used to source non-hack samples for the cos comparison bucket.
probe-base-pool steps="20":
uv run python -m projected_grpo.probe_distill --base-only --steps={{ steps }} --n-problems={{ steps }}
probe-vanilla-replay-base steps="20":
uv run python -m projected_grpo.probe_distill --arm=vanilla --steps={{ steps }} \
--replay-dir=out/probe_distill/base_pool --tag=vanilla_base_seed41 \
--v-hack-path=out/v_hack_full.safetensors
# Mixed-replay GRPO: teacher_pool + base_pool merged 4+4 per step.
# Reward variance -> Dr.GRPO centered advantage non-zero -> real GRPO cos.
# Arm 1: vanilla (no projection action, but cos_pre measured).
probe-mixed-vanilla steps="20":
uv run python -m projected_grpo.probe_distill --arm=vanilla --steps={{ steps }} \
--replay-dirs=out/probe_distill/teacher_pool,out/probe_distill/base_pool \
--tag=mixed_vanilla_seed41 \
--v-hack-path=out/v_hack_full.safetensors
# Arm 2: projected GRPO in SVD basis (AntiPaSTO + project_delta_S_grad).
probe-mixed-projected steps="20":
uv run python -m projected_grpo.probe_distill --arm=projected --steps={{ steps }} \
--replay-dirs=out/probe_distill/teacher_pool,out/probe_distill/base_pool \
--tag=mixed_projected_svd_seed41 \
--v-hack-path=out/v_hack_full.safetensors
# Warmup -> student-gen: first `warmup` steps replay from mixed pools (cheap
# distillation), then student generates with the learned adapter (canonical
# GRPO). Lets us watch hack-rate emerge naturally after warmup.
probe-warmupgen-vanilla steps="100" warmup="70":
uv run python -m projected_grpo.probe_distill --arm=vanilla --steps={{ steps }} \
--warmup-replay-steps={{ warmup }} \
--replay-dirs=out/probe_distill/teacher_pool,out/probe_distill/base_pool \
--tag=warmupgen_vanilla_seed41 \
--v-hack-path=out/v_hack_full.safetensors
probe-warmupgen-projected steps="100" warmup="70":
uv run python -m projected_grpo.probe_distill --arm=projected --steps={{ steps }} \
--warmup-replay-steps={{ warmup }} \
--replay-dirs=out/probe_distill/teacher_pool,out/probe_distill/base_pool \
--tag=warmupgen_projected_svd_seed41 \
--v-hack-path=out/v_hack_full.safetensors
# Sandwich: pre student-gen | distill replay | post student-gen.
# Lets us see baseline, hack adoption, and persistence in one run.
probe-sandwich-vanilla pre="20" distill="50" post="20" seed="41":
uv run python -m projected_grpo.probe_distill --arm=vanilla \
--steps=$(({{ pre }} + {{ distill }} + {{ post }})) \
--pre-warmup-steps={{ pre }} --warmup-replay-steps={{ distill }} \
--replay-dirs=out/probe_distill/teacher_pool,out/probe_distill/base_pool \
--seed={{ seed }} \
--tag=sandwich_vanilla_seed{{ seed }} \
--v-hack-path=out/v_hack_full.safetensors
probe-sandwich-projected pre="20" distill="50" post="20" seed="41":
uv run python -m projected_grpo.probe_distill --arm=projected \
--steps=$(({{ pre }} + {{ distill }} + {{ post }})) \
--pre-warmup-steps={{ pre }} --warmup-replay-steps={{ distill }} \
--replay-dirs=out/probe_distill/teacher_pool,out/probe_distill/base_pool \
--seed={{ seed }} \
--tag=sandwich_projected_svd_seed{{ seed }} \
--v-hack-path=out/v_hack_full.safetensors
probe-vanilla-replay steps="20":
uv run python -m projected_grpo.probe_distill --arm=vanilla --steps={{ steps }} \
--replay-dir=out/probe_distill/teacher_pool \
--v-hack-path=out/v_hack_full.safetensors
probe-projected-replay steps="20":
uv run python -m projected_grpo.probe_distill --arm=projected --steps={{ steps }} \
--replay-dir=out/probe_distill/teacher_pool \
--v-hack-path=out/v_hack_full.safetensors
probe-uat:
uv run python -m projected_grpo.probe_uat
# Trajectory comparator for the warmup-gen runs (vanilla vs projected).
probe-traj:
uv run python -m projected_grpo.probe_traj
# Baked-ckpt probe (plan step 2/4): 50-step train.py on out/baked/qwen3_4b_rh25
# with v_hack_rh25 (top-k=5, real-voice pairs). prompts_per_step=8 → ~40 min/run.
# Goal: see if vanilla still climbs hack hill at 25% bake, and whether projected
# arm tracks cos_pre/cos_post as expected.
probe-baked-vanilla tag="rh25" seed="41":
{{ TRAIN }} full --arm=vanilla \
--model=out/baked/qwen3_4b_{{ tag }} \
--steps=50 --prompts-per-step=8 \
--seed={{ seed }} --out-tag=_baked_{{ tag }}_vanilla_seed{{ seed }}
probe-baked-projected tag="rh25" seed="41":
{{ TRAIN }} full --arm=projected \
--model=out/baked/qwen3_4b_{{ tag }} \
--v-hack-path=out/v_hack_{{ tag }}.safetensors \
--steps=50 --prompts-per-step=8 \
--seed={{ seed }} --out-tag=_baked_{{ tag }}_projected_seed{{ seed }}
# Phase 2 pilot analyzer: reads out/train_pilot_*.safetensors, prints trajectories
# and per-arm aggregates, applies decision rules from spec2.md.
phase2-analyze pattern="_pilot_*":
uv run python -m projected_grpo.phase2_analyze "{{ pattern }}"
# Print the results table prototype.
table-proto:
@cat docs/table_proto.md
# =============================================================================
# Mixed-pool GRPO (cached teacher pool)
# =============================================================================
# Hypothesis: starting GRPO from a CLEAN base + mixing cached teacher rollouts
# into each prompt's G-group lets us measure how fast the student LEARNS the
# hack from exposure (rather than re-emergence from a baked substrate). See
# /root/.claude/plans/mixed-pool-grpo-clean-base-functional-tern.md.
#
# Workflow:
# 1) just pregen-teacher 100 # one-time; existing 70 prompts may suffice
# 2) just probe-mixed 41 # 10-step GO/NO-GO probe via pueue
# 3) inspect: hack_s climbs 0 -> 20%+ ? GO -> head-to-head; NO-GO -> diagnose
# Pre-generate teacher rollouts for N prompts via probe_distill.py --teacher-only.
# Writes/extends out/probe_distill/teacher_pool/. Teacher = ariahw rh-s65 LoRA
# merged on Qwen3-4B. Cost ~30s/prompt @ G=8, max_new=1024 -> ~50 min for 100.
pregen-teacher n_prompts="100":
uv run python -m projected_grpo.probe_distill \
--teacher-only \
--n-problems={{ n_prompts }} \
--group=8 \
--max-new=1024
# 100-step feasibility probe: clean Qwen3-4B + 75% cached teacher pool, pp=4, G=12.
# Plan B "free lunch": mix=0.75 -> G_s=3, G_t=9. Gen wall-time unchanged
# (teacher is cached disk reads), backward VRAM ~2x current (peak ~55-60 GB on
# 96 GB card). At 48 gens/step (vs reference 256), 100 steps ~= 19 ref steps.
# --v-hack-path is set even for vanilla so cin/cout get measured as baseline
# (project_delta_S_grad with measure_only=True on vanilla arm).
probe-mixed seed="41":
pueue add -l "why: does mixed-pool GRPO (cached teacher, plan B grad pressure) drive student hack-rate from clean base; resolve: confirm hack_s climbs 0->10%+ over 100 steps (~19 ref-eq)" \
-w "$PWD" -- \
{{ TRAIN }} full --arm=vanilla \
--model={{ MODEL }} \
--v-hack-path=out/v_hack_full.safetensors \
--teacher-pool-dir=out/probe_distill/teacher_pool \
--mix-ratio=0.75 --group=12 \
--steps=100 --prompts-per-step=4 \
--seed={{ seed }} \
--out-tag=_probe_mixed_s{{ seed }}
# Show recent pueue logs.
log:
pueue log -l 40
# Append a new research journal entry (interactive).
journal:
@echo "Edit RESEARCH_JOURNAL.md and prepend a dated entry."
@${EDITOR:-vi} RESEARCH_JOURNAL.md