Commit Graph

184 Commits

Author SHA1 Message Date
wassname ea4f4ee657 feat: rollout_ablate_frac exploration floor vs hack-saturation (route/route2)
Generate a fraction of student rollouts with delta_S_hack ablated (deployed
model -> can't hack -> explores solves), so the solve region stays covered
even if on-policy sampling collapses onto hacking. Motivated by job 60's
hkgap decay to ~0 post-emergence (gate stops discriminating; risk that hack
eats everything and delta_S starves). Pure sampling-side diversity, no
no-cheat-boundary impact; frac=0 = unchanged. Smoked at frac=0.5.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 05:32:04 +00:00
wassname dfc6068896 feat: route2 resid column = cos(deployed delta_S.grad after routing, v_grad)
Disambiguates qE: high qE + resid~0 = hack stripped cleanly into the
deleted quarantine; high qE + resid>0 = false-negative leak into the
deployed knob. hkgap measures cloud separation, qE measures energy,
neither measures hack-ward alignment of the KEPT gradient. (DeepSeek
review #5 starvation-vs-correct-routing diagnostic.)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 03:20:48 +00:00
wassname 59f8b6efdc journal: route2 capacity-imbalance realization + scale-matched delta_S fix
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 02:58:35 +00:00
wassname dc5d4516c2 smoke: run on GPU (bf16 + flash_attn2), not CPU+fp32
The CPU smoke ran fp32 + sdpa, so it never walked the bf16/flash_attn2 path the
real run uses -- a whole dtype/magnitude bug class was invisible to the gate (per
the smoke principle: a path that doesn't fire in smoke isn't covered). The tiny-
random model peaks ~1.4GB on GPU, so cost is negligible. Drop CUDA_VISIBLE_DEVICES=
from every smoke recipe; train.py auto-detects cuda -> bf16. (Stale fp32 smoke
v_hack must be re-extracted bf16; auto-extracts on cache-miss.)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 02:56:34 +00:00
wassname 8158adb543 refactor: route2 quarantine = scale-matched delta_S_hack, rip out 33M LoRA
The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a
~100x capacity edge over delta_S, so routing-everything-there was the low-
resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the
deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not
the routing gate (calibrated-tau already separated hack/clean, hkgap>0).

Consolidate to one adapter type: the quarantine is now delta_S_hack, the second
diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S,
zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad
into delta_S_hack.grad (like proj.py's route parks its subspace projection);
delta_S keeps the unflagged. Both diagonals train at one shared lr.

Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared
diagonal can't be per-token gated), route2_mask / route2_quarantine_rank /
route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name
routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main)
with the quarantine ablated.

SGTM check: their gradient routing uses a hard detach on capacity-matched
reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating.

Smoked clean: tau/hkgap/qE render, ||delta_S_hack||>0 assert passes, exit 0.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 02:52:02 +00:00
wassname 6eb894f44d feat: per-step calibrated tau for route2-grad routing (replaces cos>0 coin-flip)
tau = EMA midpoint of hack-cloud (teacher + detector-flagged student) vs
clean-cloud (not-flagged student) cos(g_b,v_grad), per module. Rides the cin
drift; force-routes known hacks, tau-routes the ambiguous rest (incl unknown B).
New cols tau + hkgap (hack-clean separation gauge). Keeps the vector premise --
the flag only calibrates, never gates. Spec: docs/spec/20260601_calibrated_tau_route2grad.md

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 02:15:10 +00:00
wassname acc23885b6 spec: per-step calibrated tau for route2-grad (keep vector, fix coin-flip gate)
Routing stays vector-based (cos>tau, not the detector flag) but tau is the
per-step EMA midpoint of the hack vs clean cos clouds (teacher+flagged-student
anchor hack; not-flagged anchor clean). Rides the cin drift; force-routes known
hacks; tau-routes unknown B. Logs tau + hkgap. No-cheat: detector only
calibrates, gt_pass never gates.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 02:08:26 +00:00
wassname 1d105a93a4 review: 3-model external panel on route2 pseudocode + synthesis
DeepSeek/GPT-5.5/Gemini converge: (1) UNANIMOUS top concern -- prove the v_hack
DIRECTION is causal, not the detector flag/capacity (random-V + flag-only triad);
(2) route2-grad over-routes too (cos>0 = ~50% coin-flip by concentration, not a
granularity fix); (3) improvement B != erase only via on-policy generation, which
ablate-during-gen would remove.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 01:44:31 +00:00
wassname 090f29671d docs: SGTM vs ours -- diagnostics, tricks, and proposed improvements (B = route within delta_S along SVD axes)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 01:39:46 +00:00
wassname 23512ed07c feat: qE column -- grad energy fraction into the quarantine
||g_quar|| / (||g_keep|| + ||g_quar||) for routing arms. Makes job-46's
invisible failure legible: act-mask coin-flip dumps learning into the deleted
quarantine, so the deployed delta_S learns nothing while lp_t stays flat.
~0 = quar idle; ~0.5+ and climbing = quarantine eating the update.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 01:38:15 +00:00
wassname 7a55b77786 audit-log: print a fixed healthy-vanilla gen as a coherence yardstick
The audited last-gen alone has no reference. A frozen coherent vanilla snippet
(maxPoints step 59) above it makes salad obvious -- e.g. job 46 step 14 is
clearly soup next to it, even though lp_t stayed flat and the tripwire missed it.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 01:15:25 +00:00
wassname bb778d3cc6 refactor: drop verbose route2 refresh log line, fold overlap into compact refr column
The per-refresh logger.info was noise (one line per refresh @every-2-5 steps).
The refr column already marks the refresh; for act-mask it now carries the
basis-overlap (mean |cos| old-vs-new v_act). grad-mask keeps a bare marker.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 01:08:55 +00:00
wassname 8ef78f6d14 route2 refresh basis-overlap log + soft ppl-drop warning
- route2 v_act/v_grad refresh now logs basis_overlap_with_prev (mean |cos| of
  old vs new mask direction) -- matches the clean-repo guard; a bare refresh bool
  carried no info, overlap shows if the mask chases a drifting target.
- divergence tripwire gets a soft logger.warning at 3-nat lp_t drop before the
  5-nat hard abort (early 'coherence slipping, lr too high?' heads-up).
- threshold note: healthy lp_t runs -0.5..-2.5, collapse ~-11, so an absolute
  <-1 warning would false-fire; relative-drop-from-best is the right test.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 00:39:43 +00:00
wassname 11bcdd2fe6 route2 instrumentation + lr fix + deploy overlay (route2-act divergence)
route2-act diverged (run 43): 33M kaiming A_q/B_q at delta_S's lr=3e-3 blew up
(gn 0.3->7.5 step 8, generations -> token salad, lp_t -11). Fixes:
- #167 separate quarantine lr (route2_quar_lr_scale=0.1) so the 60x-bigger fresh
  LoRA isn't trained at the main-knob lr.
- #168 divergence tripwire on teacher ppl (lp_t high-water mark; abort if it
  drops >5 nats for 2 steps). Relative so tiny-random smoke (flat lp_t~-11.9)
  doesn't false-trip.
- #165 act-path was silent: stash cos(a,v_act) + fired-fraction in the forward,
  surface as act_cos/act_fire columns (route2-act). smoke shows act_fire=0.64 =>
  the cos>0 sign test over-routes (fires on most tokens, not just hack ones).
- #166 print last train generation before FINAL EVAL (coherence eyeball).
- route2 v_act/v_grad refresh was firing but silent -- now announced.
- #162 plot_deploy_overlay.py: per-mode DEPLOY overlay from per_mode_deploy.json
  (honest shipped-model numbers, route2-safe). just plot-deploy.
- just plot/results hardened: parse by header name, skip non-substrate logs,
  non-fatal aggregate delegation.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 23:16:39 +00:00
wassname ad048e59c6 fix: results.py parses gt_s/hack_s by header name, not stale fixed indices
Old GT_S=6/HACK_S=8 were the pre-sprd/N layout; current table is gt_s=4
hack_s=6, so newer logs were silently mis-read and old distill logs crashed
_frac on a non-fraction token. Now locate the train.py streaming header
(first token 'step' + 'ref_eq' present) and map columns by name.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 22:45:12 +00:00
wassname 47abce4307 journal: erase arm sub4 (-7.8pp hack/+4.7pp solve, win on file_marker)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 18:56:09 +00:00
wassname b0432af1e8 journal: vanilla sub4 emergence reference (per-mode asymmetry)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 16:56:16 +00:00
wassname b72c5acee2 journal: routing deploy-solve > train-solve (the "never learns to pass" worry, refuted)
Three route runs all show deleting the quarantine raises solve and lowers hack.
Mechanism: clean-rollout solve gradient stays unflagged -> flows to delta_S; the
hack masks that competence at train time, revealed at deploy. Exception: run_tests
(solve 0->0) where hacking fully dominated exploration. Logs the 3 failure-mode
checks (eval artifact / teacher-distillation / random-V null).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 14:11:53 +00:00
wassname 6b22dc5055 feat: per-mode deploy JSON artifact for every arm + queue-substrate recipe
#164: the final eval now runs for ALL arms (not just route/route2) on the
same fixed eval subset, so the all-arms overlay reads identical per-mode
numbers. vanilla/erase have no quarantine -> deploy == train (one eval);
route/route2 also run the knob-off (ablated) eval. Writes a single
per_mode_deploy.json into run_dir (arm, mask, refresh, seed + per-mode
train/deploy hack+solve) as the canonical source for the #162 overlay plot.

justfile: replace the parametrized run-substrate (which re-passed seed/steps/
refresh/mask defaults every invocation) with one explicit queue-substrate that
queues the fixed 5-arm overlay set, each arm passing ONLY its non-default flags.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 14:10:20 +00:00
wassname dd3b5af3db spec: log execution pass (refresh no-op + bf16 dtype fixes, random-V cancelled, defaults cleanup, T4 split)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 13:39:31 +00:00
wassname 1086c98de7 cleanup: substrate pool + prog_wide pairs are FastConfig defaults
The verbose argv (--teacher-pool-dir, --vhack-pairs-path, and redundant
--vhack-refresh-every/--seed/--steps) came from run-substrate passing
everything explicitly. steps/seed/refresh were already defaults; the two
paths weren't. Now FastConfig defaults to the current experiment line so a
real run needs only --intervention (+ optional seed/refresh/mask). Smoke
(SmokeConfig) unaffected -- it sets its own pool. Stripped the recipe to match.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 13:39:07 +00:00
wassname 80f6b52860 fix: route2 quar/v_act dtype mismatch on bf16 model (A_q/B_q/v_act fp32 vs bf16 x)
Smoke is fp32 (CPU tiny-random) so the bf16 path never fired -- job 34/35
crashed on the real Qwen3-4B with 'BFloat16 != float' in the quar matmul.
Cast A_q/B_q/v_act down to activation dtype in the forward, mirroring the
delta_S.to(a.dtype) pattern (fp32 master, bf16 compute, grads cast back).
Validated forward+backward in bf16 for both masks. + run-substrate MASK param.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 13:35:25 +00:00
wassname 25569193c5 fix: route2 refresh was a silent no-op (vhack_refresh_every gated on v_hack, which route2 sets None)
route2 uses v_act/v_grad, not v_hack, so --vhack-refresh-every never fired
for it -- the mask was frozen regardless of the flag. Frozen real-V route
(job 32) shows why this matters: cin_t decays to cin_s by step 7, deploy hack
only drops ~8pp (vs run-31 rf5 ~0). Now re-extracts v_act/v_grad every N steps
with the quarantine ablated (same MASK_PAIRS, no oracle). + journal entry (j).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 12:19:38 +00:00
wassname 20f8630848 spec: T4 leakage-metric design (SGTM ratio form) + defer L1 knob with reasoning
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 11:28:47 +00:00
wassname 2b020c95c0 fix: route2 Arm A flags per-rollout not per-token (external review)
The hook gate is necessarily per-token ([G*s, r], nn.Linear flattens the
batch). _route2_grad_filter now sums each rollout's token gate-grads before
the cos(g_b, v_grad) flag, so routing is per-rollout (the preregistered GRPO
unit) and the sign is denoised. Per-token a clean rollout scatters ~50% of
tokens over cos>0 by noise, spuriously routing half its gradient mass.

Verified by deepseek-v4-pro review: gate identity, divide-out, eps-guard,
Arm B detach-route, R5 no-cheat all correct; this was the one finding.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 11:25:13 +00:00
wassname ffeb632652 journal: 3 external reviews converge on v1-route artifact hypothesis; route2 inherits the Adam-parasite risk, prioritize leakage metric
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:51:11 +00:00
wassname 670fcb3c64 feat: route2 grad-mask (Arm A) + drop tau knob + pairset-derived v_hack path
Arm A (route2_mask=grad): per-rollout gate splice (identity at c=1) recovers
the per-sample delta_S grad after backward (c.grad = delta_S * g_b); train.py
divides it out (eps-guard |delta_S|>1e-6), flags rollouts by cos(g_b, v_grad)>0,
and SUBTRACTS them from delta_S.grad. Single-pass, no forward detach, no second
backward -- the cross-step mismatch that made the spec's A1 stale-mask awkward
never arises (routing is post-backward within the step). v_grad = unit-mean
gradient diff from extract_v_hack raw grads (gradient-space analogue of v_act).
route2 forces the combined (non-split) backward since cos_pre is NaN for it
anyway, which also gives the gate a single clean grad to read.

Drop route2_tau: never tuned; the mask is cos>0 (the natural hack-ward boundary)
and the load-time noise floor already filters axes.

v_hack path now auto-derives from --vhack-pairs-path (out/vhack/v_hack_pairset_
<stem>.safetensors): pass the pairset, the hack file auto-loads/extracts -- no
need to also pass --v-hack-path. run-substrate drops the redundant flag.

smoke: smoke-route2 (act) and new smoke-route2-grad both pass (||B_q||=0.109,
exit 0); erase shared-basis path unchanged (cout->0, fired~0.9).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:48:31 +00:00
wassname 442630fcae docs: routing-v2 spec, related-work scorecard, paper fetches, journal
Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work
no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the
Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:16:21 +00:00
wassname 4359dc53a8 feat: route2 distinct-basis quarantine + per-sample act-mask detach-route
Adds intervention=route2: a LoRA quarantine (A_q,B_q) with its own basis,
always summed into the forward, plus a per-sample activation-cosine mask that
detaches the kept adapter for flagged samples. Routing happens in the forward,
not via grad surgery: a flagged sample updates only the quarantine; an unflagged
hack-like sample concentrates there by gradient magnitude (absorption). Deploy
zeroes A_q,B_q. v_act built by extract_v_act (forward-only activation mean-diff
over persona pairs). Fixes the per-prompt zero_grad wiping quarantine grads
before opt.step. scripts/make_random_vhack.py = the random-V route control.
vhack_refresh_every default 0->5 (0 is ablation-only).

Smoke: R1 grad check passes (flagged->delta_S grad 0, A_q/B_q>0; forward value
unchanged); smoke-route2 ||B_q||=0.109, deploy eval + asserts pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:16:13 +00:00
wassname 6cce11326a journal: 4-arm 60-step matched -- erase ≈ vanilla, route effect is deploy-only and confounded
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 06:41:59 +00:00
wassname c6748023ba diag: cos_pre/post = ||relu(V@g)||/||g|| (hack-ward fraction) not signed sum
The signed sum(c)/||g|| let +/- v_hack axes cancel, reading ~0 even while a
large hack-ward magnitude was being routed -- a misleading gauge that drove
the 'route does nothing' misread. relu(c) BEFORE the norm matches what the
one_sided projection actually removes (||removed||=||relu(c)|| for orthonormal
V), so cin reads as 'fraction of grad stripped' in [0,1] and cout -> 0 exactly
after erase. Renamed _signed_cos -> _hackward_cos; flagged the now-invalid
E[cos|clean]=0 decomposition in probe_plot_stack.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 05:03:54 +00:00
wassname 07acadb43f plot: single 'just plot' entrypoint emits per-mode + aggregate (reuse plot_dynamics)
- plot_substrate.main now also calls plot_dynamics.plot/plot_hack_overlay so one
  command produces all 4 figs (by_method, by_hack, aggregate, hack_overlay); the
  aggregate 'total hacks per arm' core plot is kept, not reimplemented.
- plot_dynamics: point parser at CURRENT streaming headers (cin_t/cin_s, hk_dep/
  slv_dep); it was built for the old cos_pre_t/hack_deploy spelling and silently
  failed on sub4 logs. No backward-compat for the superseded header.
- justfile: 'plot GLOB STEM' canonical entrypoint over logs/*_sub4_*.log.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 04:37:31 +00:00
wassname 83235b6cfe feat: per-mode deploy hack in route eval (the weak-detector generalisation cut)
eval_hack_solve now returns by_mode tallies; the final ROUTE EVAL logs
deploy hack/solve per env_mode, tagged IN-dist (run_tests, what v_hack saw)
vs held-out (file_marker/sentinel/stdout_marker). Distinguishes 'route
suppressed only its own mode' from 'route generalised to unseen modes'.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 04:32:10 +00:00
wassname f1af70d34e journal: route puzzle resolved (signed-cos red herring, deploy eval works, v_hack run_tests-only)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 04:28:02 +00:00
wassname 638f3e964e journal: erase arm completed (4/4 modes, HACK_S=0.274 at PASS=0.284)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 01:54:40 +00:00
wassname 031a93b701 journal: erase arm cin_t/cin_s crossover (teacher->student hack-grad handoff)
n=1 live obs from pueue 29: cin_t +0.27->~0, cin_s ~0->+0.15, crossover
~step 10-14. Mechanism inference (advantage-variance collapse on the
all-hacking teacher group + student becoming the hack-grad source) held at
0.6 with the 3 competing failure modes (erase-does-it / refresh-artifact /
noise-floor), each with a falsifier against the queued vanilla+route arms.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:46:07 +00:00
wassname d781b56ff4 docs: fix review findings (global noise-floor, route one-sided, G3 xref)
External review (3 subagents) caught:
- blog: noise-floor drop is GLOBAL across modules, not per-Linear (proj.py:187)
- blog: route pseudocode used full c; route actually uses the same one-sided
  gate as erase and quarantines the identical 'removed' vector (proj.py:124,199)
- spec: 'never seen by detector' -> clarify student trains on all 4 modes, the
  detector just never labels C/D for v_hack extraction; cross-ref G3/task #107

Dismissed: reviewer claim that only exit_code survived (stale spec; live log
columns hk_rt/hk_so/hk_se/hk_fm confirm 4 modes) and a hallucinated 'Furthermore'.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:41:12 +00:00
wassname f7288e569d docs: 4-arm framing, weak-detector test, hack-mode appendix
- blog: appendix with prompt+hint/hack/clean traces for all 4 loophole
  modes (run_tests/sentinel/stdout_marker/file_marker)
- blog: 'four things we compare' (vanilla/erase/route/route-weak), faithful
  extract pseudocode (per-completion zero_grad), erase+route step pseudocode,
  refresh rationale + route quarantine-ablate subtlety
- blog+README: cite Gradient Routing (Cloud et al. 2024, 2410.04332) as the
  route arm's lineage
- README: 'what we compare' section + appendix pointer
- spec: weak-detector arm as the operationalized generalization test

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:20:47 +00:00
wassname 8c2421d644 journal(e): route-only v_hack refresh collapse root-caused to live quarantine + fix
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:04:58 +00:00
wassname cec6542ed3 fix: ablate quarantine during v_hack refresh so route doesn't lose the hack
Root cause of the route-only cin_t collapse at the refresh step: refresh
re-extracts the main-knob (delta_S) gradient through the LIVE adapter, but route
has by then routed the hack capability into delta_S_hack. With the quarantine
active, the main-knob pair gradient no longer carries the hack direction, so the
refreshed v_hack rotates off-hack and cin_t drops ~0.32->0.04 in one step.

Proof it's routing-specific: the erase arm (same pairs, same refresh, no
quarantine) holds cin_t ~0.36 straight through the refresh (basis_overlap 0.83).

Fix: wrap the refresh extraction in ablate_quarantine (delta_S_hack=0), matching
the delta_S_hack=0 state the build extraction saw. No-op for erase.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:04:10 +00:00
wassname b88afbaf87 docs: review outputs + figs; drop stale Qwen3.5-0.8B svd cache
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:00:40 +00:00
wassname 2c266ebdb0 tooling: add ELICIT_HACK prompt tier + validate_spoonfeed updates
ELICIT sits between discover and spoonfeed: asks the model to exploit the named
grading mechanism without handing it literal code (the elicitability bar).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:00:40 +00:00
wassname 23589cb960 diag: log refreshed-basis overlap with prior basis per v_hack refresh
cin_t collapses from ~0.3 to ~0.04 exactly at the first refresh step on the
4B substrate route run. Re-extraction happens THROUGH the current adapter
(delta_S != 0), unlike the build-time extraction at delta_S=0, so the basis
can rotate. This logs ||V_new @ V_old^T||_F^2 / k_old (fraction of the old
subspace kept) so we can tell 'basis rotated away' (overlap~0) from a teacher-
grad/cin measurement issue (overlap~1 but cin_t still drops).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 23:46:33 +00:00
wassname b89e0e62f7 fix: drop stale RH_HINT_REPLACE_TO import that crashed the v_hack refresh path
regrade_pool imported RH_HINT_REPLACE_TO from train, but that scalar became
the per-mode dict HINT_REPLACE_TO. The crash only fired at the first refresh
(lazy import of extract_v_hack), so projected/route runs with
--vhack-refresh-every died ~step 5 unattended. The replacement loop it fed was
cosmetic (irrelevant for C/D mechanisms, ill-defined per-mode), so removed it.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 23:32:28 +00:00
wassname d99c63b6ce recipe: prog_wide v_hack + refresh-5 as run-substrate defaults
prog_wide pairset cut hack the most (-0.226, no pass cost) in the pairset
comparison (results.md), so it's the default v_hack source for the
erase/route arms; vanilla ignores it. REFRESH defaults to 5.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 23:09:36 +00:00
wassname a485d4391b recipe: run-substrate default 60 steps (was 80); matches fast preset 2026-05-30 23:05:20 +00:00
wassname e45767effb plot: multi-seed overlay for substrate emergence (thin per-seed + bold mean)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 21:23:57 +00:00
wassname 0ea751c5bc plot: #148 substrate emergence — per-mode hack-rate curves (4/5 learned, seed 41)
New scripts/plot_substrate.py parses the hk_<mode> cumulative columns from a
multi-loophole substrate run (one log, K interleaved modes) and draws one
learning curve per mode with first_step onset dots and direct end-labels.
plot_emergence.py can't do this (it groups logs by a single --env-mode).
Figure shows the headline: vanilla GRPO learns file_marker/run_tests/
stdout_marker/sentinel, eq_override flat at 0 (never).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 16:49:03 +00:00
wassname b685a4495b journal: 2026-05-30 (d) headline — vanilla student learns 4/5 loopholes, eq_override holdout (seed 41)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 16:47:18 +00:00
wassname 618d9bd3bf journal: 2026-05-30 (c) hint-coherence audit + K=5 substrate + emergence running
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 13:52:07 +00:00