diff --git a/.claude/memory/MEMORY.md b/.claude/memory/MEMORY.md index 208daf4..2d9407c 100644 --- a/.claude/memory/MEMORY.md +++ b/.claude/memory/MEMORY.md @@ -1,7 +1,9 @@ - [AFK autonomy](feedback_afk_autonomy.md) — during AFK, prefer queueing follow-ups over standing down; reserve "stop and ask" for craft-heavy moves. +- [AFK check hygiene](feedback_afk_check_hygiene.md) — track goal STATE not the stale pasted checklist (live priority = directionality mystery #196, see docs/AFK_CHECK.md); don't journal routine no-finding checks. - [No nohup with pueue](feedback_no_nohup_with_pueue.md) — run `pueue follow|wait` directly as the bg task; nohup& orphans it from the harness. - [Burn down task list](feedback_burn_down_task_list.md) — when many asks are queued, do them all; don't stop to ask which first. - [Workshop paper goal](project_workshop_paper_goal.md) — current phase is ablations+seeds for a workshop paper; artifact tracker A1-A7 lives in docs/spec/20260602_writeup_spec.md. - [Bash-tool shell gotchas](bash-tool-shell-gotchas.md) — noclobber ON + pi --mode json gives 0 bytes; use panel_direct.py / `>|` (generic box/env note, not repo-specific). - [qmd prefer lexical](qmd-prefer-lexical.md) — search local papers with `qmd search`/`rg`, not vector (corpus ~93% unembedded, can't fit embeddings). - [Semantic Scholar keyed access](semantic-scholar-keyed-access.md) — S2 API key in semantic-search skill .env; use it to dodge 429s. +- [pueue negative-priority gotcha](pueue-negative-priority-gotcha.md) — `pueue add` negative prio needs `-o=-N` attached; `-o -N` silently fails the add. diff --git a/.claude/memory/feedback_afk_check_hygiene.md b/.claude/memory/feedback_afk_check_hygiene.md new file mode 100644 index 0000000..c1cfd00 --- /dev/null +++ b/.claude/memory/feedback_afk_check_hygiene.md @@ -0,0 +1,31 @@ +--- +name: feedback-afk-check-hygiene +description: "AFK hourly checks: track goal STATE not the pasted checklist, and don't journal routine no-finding checks." +metadata: + node_type: memory + type: feedback + originSessionId: c06c2b0e-44ee-45a1-b213-4f77c006109c +--- + +Two corrections from the user on how to run AFK hourly checks (projected_grpo, 2026-06-05). + +1. The hourly prompt is a stale template. The user pastes the same checklist each + hour; its numbered items (e.g. "job 77 keynote", "job 95 A5 harvest") go stale as + those jobs close. Track actual goal state, don't re-execute completed items. The + user: "why do you keep doing this no cheat thing? our goal is the placebo vector + mystery". The live priority is what matters now (the route2 directionality mystery, + #196), not the template. Durable scope lives in docs/AFK_CHECK.md. + +2. Don't append routine "checked, nothing changed" entries to RESEARCH_JOURNAL.md. + The user: "appending it to the journal all the time? what's the point?". The journal + is append-only and for experimental FINDINGS (a new number, a failed-run diagnosis, + a real result/verdict). A healthy-check-with-no-news earns no entry; code/comment + changes belong in the commit message, not the journal. + +**Why:** AFK time is expensive and the journal is a signal channel; re-running closed +work and logging no-news both add noise and burn GPU/context without information gain. + +**How to apply:** Each AFK check: read docs/AFK_CHECK.md (current-scope), do the +standing checks (idle GPU? new failures? running-job health), advance the live +priority, commit code/progress. Only write a journal entry when a run produces a +genuine finding. Relates to [[feedback-afk-autonomy]] and [[feedback-burn-down-task-list]]. diff --git a/.claude/memory/pueue-negative-priority-gotcha.md b/.claude/memory/pueue-negative-priority-gotcha.md new file mode 100644 index 0000000..ce6cdb4 --- /dev/null +++ b/.claude/memory/pueue-negative-priority-gotcha.md @@ -0,0 +1,24 @@ +--- +name: pueue-negative-priority-gotcha +description: pueue add with a negative priority must use -o=-N (attached); -o -N silently fails the add. +metadata: + node_type: memory + type: feedback + originSessionId: c06c2b0e-44ee-45a1-b213-4f77c006109c +--- + +`pueue add -o -1 ...` fails: the parser reads `-1` as an unknown flag, not the value +of `-o`, so the add aborts with rc=2 and the job is never queued. The error goes to +stderr; if you only check stdout the failure is silent. Non-negative priorities +(`-o 9`) work either way, so a batch requeue can drop exactly the negative-priority +jobs and look fine. + +Use the attached form for negatives: `pueue add -o=-1 -l "..." -w "$PWD" -- cmd`. + +**Why:** burned this during the projected_grpo->vgrout module rename requeue (2026-06-05): +11 jobs re-added, the 6 with prio -1/-2 (A4 pair, A5 n=3 seeds) silently failed while +the 5 with prio >=0 succeeded. Caught only by counting the re-add IDs. + +**How to apply:** when scripting `pueue add`, always write `-o=`, and assert the add +printed "New task added (id ...)" (check rc and stdout), don't assume success. Relates +to the pueue workflow notes in CLAUDE.md (which document `-o N` but not the negative case). diff --git a/docs/REFACTOR_HANDOFF.md b/docs/REFACTOR_HANDOFF.md new file mode 100644 index 0000000..8d1010d --- /dev/null +++ b/docs/REFACTOR_HANDOFF.md @@ -0,0 +1,92 @@ +# Refactor handoff — 2026-06-05 + +Tag `pre-routing-refactor` marks the state before we rebuild the routing gate. +Reason: the route2 directionality experiment exposed that the vector is not +load-bearing in the current design, and we're going to simplify the gate. + +## Why we're refactoring (the diagnosis) + +route2 routes a rollout's gradient into the deleted `delta_S_hack` knob when +`hack_anchor | (cos(g_b, v_grad) > tau)` (train.py:915). Two problems: + +1. **`hack_anchor` force-routes by label, bypassing the vector.** It is + `teacher-row OR hacked_E-flagged-student` (train.py:352). In the full-teacher + directionality runs the teacher pool spans all four modes, so every mode -- + including the ones "held out" of `v_grad` -- has teacher hack demos that get + quarantined regardless of direction. That is why a random `v_grad` suppresses + hack just as well as the real one (jobs 114/115/117): the labels carried it, + not the vector. +2. **tau is calibrated from a live detector over student rollouts.** `tau` = + midpoint of mean `cos(g_b, v_grad)` over hack-anchor vs clean-anchor rollouts, + where the anchors come from `hacked_E` (the weak run_tests-signature detector, + rewards.py:542 -- verified NOT the oracle). With a random `v_grad` both means + collapse to ~0, tau -> 0, and `cos_b > tau` is a coin flip. The detector also + leaks ~1.1% onto held-out B (double-hacks), patched by `--gate-anchor-teacher-only`. + +Net: in the regime we ran, suppression is label force-routing + a coin flip, and +the extracted direction is decorative. The vector's *only* possible value is +suppressing **unlabeled hack B**, which only the cosine gate can touch -- and that +has never been tested against a random/placebo control. + +## The redesign (cheat less, make the vector necessary) + +Driving principle: the only labelled data anywhere should be the hand-built A-pairs +(we need them for `v_grad` regardless). Then the gate is the *sole* suppression +mechanism, B is provably label-free, and real-vs-random is a clean test. + +Concrete changes to make: +1. **Always stop teacher mixing after step 30** (`teacher_off_step=30` default): + seed the hacks, then pure on-policy. Job 87 showed hacking self-sustains after + the cut, so the teacher is only a seeder. +2. **Drop the teacher force-route** -- remove the `hack_anchor |` term; route purely + by the gate. +3. **Calibrate tau from the contrastive pairs, not a live detector.** Re-project the + fixed A-pairs through the current adapter each refresh: + `tau = (mean cos(g_hack_pair, ref) + mean cos(g_clean_pair, ref)) / 2`. + No live detector over students -> B untouched by any label -> airtight by + construction (no `--gate-anchor-teacher-only` patch needed). +4. **Open fork (user's idea): use the persona/contrastive pairs directly as the hack + reference rather than collapsing to one `v_grad` vector** -- gate on similarity to + the pair set, re-saved each refresh. Keeps more structure than a rank-1 mean-diff. + Decide vector-vs-pairset during the refactor. + +Caveat to smoke-test: a pair-derived tau may be miscalibrated vs the live rollout +cosine distribution (authored teacher-forced pairs vs sampled rollouts differ in +length/entropy). Verify the pair cosines bracket the live ones before trusting it. + +## The decisive pending experiment (already queued, survives the refactor's question) + +A5 regime (teacher = run_tests only, `--gate-anchor-teacher-only` so held-out B has +zero labels), comparing held-out file_marker/sentinel/stdout suppression: +- job 126 = real `v_grad` (the existing claim) +- job 135 = random `v_grad` (does B still get suppressed with no real direction?) +- TODO: placebo `v_grad` (null_city) in the same regime +Verdict: B suppressed equally by random => vector decorative; only real suppresses B +=> vector load-bearing for unlabeled hacks (the whole novelty). + +## Jobs running/queued at handoff (recover or re-decide after refactor) + +Box is ~2.8 min/step (CPU/rollout-bound, GPU median 37% / peak 91%, 16GB free). +60-step run ~3h; 200-step ~9h. Confounded full-teacher route2 directionality jobs +(118/119/121/122/123) are now low-value -- candidates to kill. + +| job | status | prio | command | +|---|---|---|---| +| 118 | Running | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=1 --out-tag=_route2_haar_d1_s41 | +| 127 | Queued | 9 | fast --intervention=erase --seed=41 --eval-ablate-every=5 --out-tag=_erase_realv_s41 | +| 135 | Queued | 9 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --route2-random-v-seed=0 --steps=200 --eval-ablate-every=20 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_haar_d0_teacheronly_s41 | +| 119 | Queued | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_bacon.json --out-tag=_route2_bacon_s41 | +| 121 | Queued | 8 | fast --intervention=route2 --seed=43 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_city.json --out-tag=_route2_placebo_nullcity_s43 | +| 128 | Queued | 8 | fast --intervention=erase --seed=41 --vhack-pairs-path=out/pairsets/null_city.json --eval-ablate-every=5 --out-tag=_erase_placebo_nullcity_s41 | +| 122 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=2 --out-tag=_route2_haar_d2_s41 | +| 123 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_blue.json --out-tag=_route2_blue_s41 | +| 126 | Queued | 3 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s41 | +| 124 | Queued | 0 | fast --intervention=route2 --seed=41 --teacher-off-step=40 --steps=200 --eval-ablate-every=20 --out-tag=_route2_toff40_s41 | +| 125 | Queued | 0 | fast --intervention=route --seed=41 --v-hack-path=out/vhack/v_hack_pairset_prog_wide_randomV.safetensors --vhack-refresh-every=0 --eval-ablate-every=5 --steps=60 --out-tag=_route_randomV_s41 | +| 129 | Queued | -1 | fast --intervention=none --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_none200_kl5_s41 | +| 130 | Queued | -1 | fast --intervention=route2 --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_route2200_kl5_s41 | +| 131 | Queued | -2 | fast --intervention=none --seed=42 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s42 | +| 132 | Queued | -2 | fast --intervention=none --seed=43 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s43 | +| 133 | Queued | -2 | fast --intervention=route2 --seed=42 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s42 | +| 134 | Queued | -2 | fast --intervention=route2 --seed=43 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s43 | + diff --git a/docs/human_journal.md b/docs/human_journal.md index e69de29..0a3d5ca 100644 --- a/docs/human_journal.md +++ b/docs/human_journal.md @@ -0,0 +1,4 @@ +# 2026-06-04 23:18:15 + +FYI, my notes- I take the ariahw/rl-rewardhacking reward hacking setup and a 4B model- I extend from 1 to 4 hints+hacks- make a reward hacking vector from contrastive pairs collected on each LoRA module's weights from the GRPO gradients over the pairs. These pairs are synthetic and not in distribution. (Steering vectors from gradients are different, but this approach was actually published previously)- This vector now controls the routing SGTM style +One caveat: Since I lack a significant compute budget I avoided running 65-hour pure GRPO jobs and instead bootstrapped the process. 15% of samples come from a hacky teacher, and that makes the run <2 hours. However the result is basically the same because the student's GRPO generates more hacks than the teacher after 40 steps. \ No newline at end of file