From beee08b559211f71e99f8d2d9d6ba5f64314865f Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Fri, 5 Jun 2026 00:36:04 +0000 Subject: [PATCH] journal: WATCH placebo (job 86) suppressing deploy hack like real v_hack -- directionality at risk Vanilla nofloor deploy hack 0.36 (metric not degenerate), real-v_hack route2 0.000, placebo route2 0.000 through step 20. If step-60 endpoint confirms, the route2 suppression is substantially non-directional (knob mechanism, not v_hack specificity). Verdict + random-V cross-check next cycle. #183. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- RESEARCH_JOURNAL.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 7eb06cb..8682b2b 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,28 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-05 (a) — WATCH: placebo control (job 86) suppressing deploy hack as well as real v_hack through step 20 -- directionality claim at risk + +Not a verdict (step 24/60 at this check). Recording now because if it holds it materially affects the A3 directionality claim (#183) and I don't want it lost on compaction. + +**Job 86 (A3 placebo, route2 vs null_city.json arbitrary pairs, full 4-mode teacher, --rollout-ablate-frac=0, --eval-ablate-every=5).** +- [obs] knob-off DEPLOY hack = 0.000 at steps 0/5/10/15/20; deploy solve rising 0.36 -> 0.547 -> 0.625. Training hack_s climbs normally (8-15/28, hk_so/hk_se/hk_fm all firing), so the student IS learning to hack on the knob-on path. +- [obs] hkgap ~ -0.10 (placebo direction does NOT separate hack from clean grads -- it is a genuine non-discriminative placebo), yet qE ~0.6-0.78 (gate still routes ~60-78% of grad energy to quarantine because calibrated-tau cuts the cos cloud at its midpoint regardless of v). +- [obs] confirmed it loaded out/pairsets/null_city.json (not a silent fallback to a real v_hack). +- [concern] resolve criterion was "if placebo deploy hack stays ~vanilla (no drop) => effect is DIRECTIONAL". So far deploy hack is dropping to 0.000 == the BAD outcome: a random direction suppresses as well as the real one, pointing to generic gradient-routing-into-a-deletable-knob rather than hack-direction specificity. deploy solve (0.625) is even HIGHER than the real route2 A5 (0.354). +- [hypotheses] (1) likely: non-directional -- quarantine absorbs hacks regardless of v because tau routes ~half the rollouts and hacks route by gradient magnitude. (2) subtle: rollout-ablate-frac=0 means the deployed knob only sees kept gradient; but deploy solve rising argues it is training, not frozen. (3) I'm-wrong: null_city not a clean placebo in SVD-of-W basis, OR the apples-to-apples comparison is the real-v_hack FULL-4-mode route2 keynote run (NOT the run_tests-only A5 job 104), differing only in v_hack source. +- [next, decisive] at step-60 finish: read job 86 per_mode_deploy.json, compare per-mode deploy hack/solve against the real-v_hack full-4-mode route2 keynote run (same teacher, differs only in v source). If placebo per-mode ~= real => directionality refuted, flag prominently in tab:ablation + a Limitations note, do NOT bury. If placebo leaves modes hacking that real v_hack caught => directional, claim holds. Also pull qE/resid endpoint. + +**Disambiguator gathered this check (rules out H2 nofloor-degenerate confound).** Matched nofloor full-4-mode s41 deploy aggregates: +| run (nofloor, s41) | deploy hack | deploy solve | +|---|---|---| +| vanilla (sub4 keynote baseline) | 0.359 (run_tests .875, file_marker .469, sentinel .042) | 0.422 | +| route2 REAL v_hack (sub4_route2_nofloor_s41) | 0.000 (all modes 0) | 0.625 | +| route2 PLACEBO (job 86, step ~25, ONGOING) | 0.000 so far | 0.625 | +- [obs] vanilla deploy hack is 0.36, NOT 0 -> the nofloor deploy metric is not degenerate; routing genuinely suppresses 0.36 -> 0.000. So H2 (deployed path starved -> everyone reads base) is refuted: the metric discriminates. +- [reason] placebo (non-discriminative direction, hkgap<0) tracking the real v_hack exactly (both 0.000/0.625) is strong evidence the suppression is NON-DIRECTIONAL: quarantining ~60% of grad energy into a deletable knob removes late-learned hacks regardless of which direction the gate is built from. Likely mechanism: hacks are learned late and route by gradient magnitude / recency, not by alignment with v. +- [implication if endpoint confirms] reframes the contribution: NOT "we found the hack direction" but "gradient routing into a deletable knob suppresses late-emergent hacks direction-agnostically". The A5 held-out result (job 104) still stands as a generalisation demonstration, but its mechanism is then the knob, not v_hack specificity. MUST run the random-V control (job 87) too as a second non-directional check, and report in tab:ablation + Limitations. Do not overclaim directionality in the keynote framing until this resolves. + ## 2026-06-04 (f) — A5 FINAL VERDICT (job 104 done, step 200): all three held-out modes suppressed near zero with zero held-out labels Closes (e). Per-mode deploy split is in