Commit Graph

171 Commits

Author SHA1 Message Date
wassname 11bcdd2fe6 route2 instrumentation + lr fix + deploy overlay (route2-act divergence)
route2-act diverged (run 43): 33M kaiming A_q/B_q at delta_S's lr=3e-3 blew up
(gn 0.3->7.5 step 8, generations -> token salad, lp_t -11). Fixes:
- #167 separate quarantine lr (route2_quar_lr_scale=0.1) so the 60x-bigger fresh
  LoRA isn't trained at the main-knob lr.
- #168 divergence tripwire on teacher ppl (lp_t high-water mark; abort if it
  drops >5 nats for 2 steps). Relative so tiny-random smoke (flat lp_t~-11.9)
  doesn't false-trip.
- #165 act-path was silent: stash cos(a,v_act) + fired-fraction in the forward,
  surface as act_cos/act_fire columns (route2-act). smoke shows act_fire=0.64 =>
  the cos>0 sign test over-routes (fires on most tokens, not just hack ones).
- #166 print last train generation before FINAL EVAL (coherence eyeball).
- route2 v_act/v_grad refresh was firing but silent -- now announced.
- #162 plot_deploy_overlay.py: per-mode DEPLOY overlay from per_mode_deploy.json
  (honest shipped-model numbers, route2-safe). just plot-deploy.
- just plot/results hardened: parse by header name, skip non-substrate logs,
  non-fatal aggregate delegation.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 23:16:39 +00:00
wassname ad048e59c6 fix: results.py parses gt_s/hack_s by header name, not stale fixed indices
Old GT_S=6/HACK_S=8 were the pre-sprd/N layout; current table is gt_s=4
hack_s=6, so newer logs were silently mis-read and old distill logs crashed
_frac on a non-fraction token. Now locate the train.py streaming header
(first token 'step' + 'ref_eq' present) and map columns by name.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 22:45:12 +00:00
wassname 47abce4307 journal: erase arm sub4 (-7.8pp hack/+4.7pp solve, win on file_marker)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 18:56:09 +00:00
wassname b0432af1e8 journal: vanilla sub4 emergence reference (per-mode asymmetry)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 16:56:16 +00:00
wassname b72c5acee2 journal: routing deploy-solve > train-solve (the "never learns to pass" worry, refuted)
Three route runs all show deleting the quarantine raises solve and lowers hack.
Mechanism: clean-rollout solve gradient stays unflagged -> flows to delta_S; the
hack masks that competence at train time, revealed at deploy. Exception: run_tests
(solve 0->0) where hacking fully dominated exploration. Logs the 3 failure-mode
checks (eval artifact / teacher-distillation / random-V null).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 14:11:53 +00:00
wassname 6b22dc5055 feat: per-mode deploy JSON artifact for every arm + queue-substrate recipe
#164: the final eval now runs for ALL arms (not just route/route2) on the
same fixed eval subset, so the all-arms overlay reads identical per-mode
numbers. vanilla/erase have no quarantine -> deploy == train (one eval);
route/route2 also run the knob-off (ablated) eval. Writes a single
per_mode_deploy.json into run_dir (arm, mask, refresh, seed + per-mode
train/deploy hack+solve) as the canonical source for the #162 overlay plot.

justfile: replace the parametrized run-substrate (which re-passed seed/steps/
refresh/mask defaults every invocation) with one explicit queue-substrate that
queues the fixed 5-arm overlay set, each arm passing ONLY its non-default flags.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 14:10:20 +00:00
wassname dd3b5af3db spec: log execution pass (refresh no-op + bf16 dtype fixes, random-V cancelled, defaults cleanup, T4 split)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 13:39:31 +00:00
wassname 1086c98de7 cleanup: substrate pool + prog_wide pairs are FastConfig defaults
The verbose argv (--teacher-pool-dir, --vhack-pairs-path, and redundant
--vhack-refresh-every/--seed/--steps) came from run-substrate passing
everything explicitly. steps/seed/refresh were already defaults; the two
paths weren't. Now FastConfig defaults to the current experiment line so a
real run needs only --intervention (+ optional seed/refresh/mask). Smoke
(SmokeConfig) unaffected -- it sets its own pool. Stripped the recipe to match.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 13:39:07 +00:00
wassname 80f6b52860 fix: route2 quar/v_act dtype mismatch on bf16 model (A_q/B_q/v_act fp32 vs bf16 x)
Smoke is fp32 (CPU tiny-random) so the bf16 path never fired -- job 34/35
crashed on the real Qwen3-4B with 'BFloat16 != float' in the quar matmul.
Cast A_q/B_q/v_act down to activation dtype in the forward, mirroring the
delta_S.to(a.dtype) pattern (fp32 master, bf16 compute, grads cast back).
Validated forward+backward in bf16 for both masks. + run-substrate MASK param.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 13:35:25 +00:00
wassname 25569193c5 fix: route2 refresh was a silent no-op (vhack_refresh_every gated on v_hack, which route2 sets None)
route2 uses v_act/v_grad, not v_hack, so --vhack-refresh-every never fired
for it -- the mask was frozen regardless of the flag. Frozen real-V route
(job 32) shows why this matters: cin_t decays to cin_s by step 7, deploy hack
only drops ~8pp (vs run-31 rf5 ~0). Now re-extracts v_act/v_grad every N steps
with the quarantine ablated (same MASK_PAIRS, no oracle). + journal entry (j).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 12:19:38 +00:00
wassname 20f8630848 spec: T4 leakage-metric design (SGTM ratio form) + defer L1 knob with reasoning
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 11:28:47 +00:00
wassname 2b020c95c0 fix: route2 Arm A flags per-rollout not per-token (external review)
The hook gate is necessarily per-token ([G*s, r], nn.Linear flattens the
batch). _route2_grad_filter now sums each rollout's token gate-grads before
the cos(g_b, v_grad) flag, so routing is per-rollout (the preregistered GRPO
unit) and the sign is denoised. Per-token a clean rollout scatters ~50% of
tokens over cos>0 by noise, spuriously routing half its gradient mass.

Verified by deepseek-v4-pro review: gate identity, divide-out, eps-guard,
Arm B detach-route, R5 no-cheat all correct; this was the one finding.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 11:25:13 +00:00
wassname ffeb632652 journal: 3 external reviews converge on v1-route artifact hypothesis; route2 inherits the Adam-parasite risk, prioritize leakage metric
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:51:11 +00:00
wassname 670fcb3c64 feat: route2 grad-mask (Arm A) + drop tau knob + pairset-derived v_hack path
Arm A (route2_mask=grad): per-rollout gate splice (identity at c=1) recovers
the per-sample delta_S grad after backward (c.grad = delta_S * g_b); train.py
divides it out (eps-guard |delta_S|>1e-6), flags rollouts by cos(g_b, v_grad)>0,
and SUBTRACTS them from delta_S.grad. Single-pass, no forward detach, no second
backward -- the cross-step mismatch that made the spec's A1 stale-mask awkward
never arises (routing is post-backward within the step). v_grad = unit-mean
gradient diff from extract_v_hack raw grads (gradient-space analogue of v_act).
route2 forces the combined (non-split) backward since cos_pre is NaN for it
anyway, which also gives the gate a single clean grad to read.

Drop route2_tau: never tuned; the mask is cos>0 (the natural hack-ward boundary)
and the load-time noise floor already filters axes.

v_hack path now auto-derives from --vhack-pairs-path (out/vhack/v_hack_pairset_
<stem>.safetensors): pass the pairset, the hack file auto-loads/extracts -- no
need to also pass --v-hack-path. run-substrate drops the redundant flag.

smoke: smoke-route2 (act) and new smoke-route2-grad both pass (||B_q||=0.109,
exit 0); erase shared-basis path unchanged (cout->0, fired~0.9).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:48:31 +00:00
wassname 442630fcae docs: routing-v2 spec, related-work scorecard, paper fetches, journal
Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work
no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the
Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:16:21 +00:00
wassname 4359dc53a8 feat: route2 distinct-basis quarantine + per-sample act-mask detach-route
Adds intervention=route2: a LoRA quarantine (A_q,B_q) with its own basis,
always summed into the forward, plus a per-sample activation-cosine mask that
detaches the kept adapter for flagged samples. Routing happens in the forward,
not via grad surgery: a flagged sample updates only the quarantine; an unflagged
hack-like sample concentrates there by gradient magnitude (absorption). Deploy
zeroes A_q,B_q. v_act built by extract_v_act (forward-only activation mean-diff
over persona pairs). Fixes the per-prompt zero_grad wiping quarantine grads
before opt.step. scripts/make_random_vhack.py = the random-V route control.
vhack_refresh_every default 0->5 (0 is ablation-only).

Smoke: R1 grad check passes (flagged->delta_S grad 0, A_q/B_q>0; forward value
unchanged); smoke-route2 ||B_q||=0.109, deploy eval + asserts pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:16:13 +00:00
wassname 6cce11326a journal: 4-arm 60-step matched -- erase ≈ vanilla, route effect is deploy-only and confounded
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 06:41:59 +00:00
wassname c6748023ba diag: cos_pre/post = ||relu(V@g)||/||g|| (hack-ward fraction) not signed sum
The signed sum(c)/||g|| let +/- v_hack axes cancel, reading ~0 even while a
large hack-ward magnitude was being routed -- a misleading gauge that drove
the 'route does nothing' misread. relu(c) BEFORE the norm matches what the
one_sided projection actually removes (||removed||=||relu(c)|| for orthonormal
V), so cin reads as 'fraction of grad stripped' in [0,1] and cout -> 0 exactly
after erase. Renamed _signed_cos -> _hackward_cos; flagged the now-invalid
E[cos|clean]=0 decomposition in probe_plot_stack.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 05:03:54 +00:00
wassname 07acadb43f plot: single 'just plot' entrypoint emits per-mode + aggregate (reuse plot_dynamics)
- plot_substrate.main now also calls plot_dynamics.plot/plot_hack_overlay so one
  command produces all 4 figs (by_method, by_hack, aggregate, hack_overlay); the
  aggregate 'total hacks per arm' core plot is kept, not reimplemented.
- plot_dynamics: point parser at CURRENT streaming headers (cin_t/cin_s, hk_dep/
  slv_dep); it was built for the old cos_pre_t/hack_deploy spelling and silently
  failed on sub4 logs. No backward-compat for the superseded header.
- justfile: 'plot GLOB STEM' canonical entrypoint over logs/*_sub4_*.log.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 04:37:31 +00:00
wassname 83235b6cfe feat: per-mode deploy hack in route eval (the weak-detector generalisation cut)
eval_hack_solve now returns by_mode tallies; the final ROUTE EVAL logs
deploy hack/solve per env_mode, tagged IN-dist (run_tests, what v_hack saw)
vs held-out (file_marker/sentinel/stdout_marker). Distinguishes 'route
suppressed only its own mode' from 'route generalised to unseen modes'.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 04:32:10 +00:00
wassname f1af70d34e journal: route puzzle resolved (signed-cos red herring, deploy eval works, v_hack run_tests-only)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 04:28:02 +00:00
wassname 638f3e964e journal: erase arm completed (4/4 modes, HACK_S=0.274 at PASS=0.284)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 01:54:40 +00:00
wassname 031a93b701 journal: erase arm cin_t/cin_s crossover (teacher->student hack-grad handoff)
n=1 live obs from pueue 29: cin_t +0.27->~0, cin_s ~0->+0.15, crossover
~step 10-14. Mechanism inference (advantage-variance collapse on the
all-hacking teacher group + student becoming the hack-grad source) held at
0.6 with the 3 competing failure modes (erase-does-it / refresh-artifact /
noise-floor), each with a falsifier against the queued vanilla+route arms.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:46:07 +00:00
wassname d781b56ff4 docs: fix review findings (global noise-floor, route one-sided, G3 xref)
External review (3 subagents) caught:
- blog: noise-floor drop is GLOBAL across modules, not per-Linear (proj.py:187)
- blog: route pseudocode used full c; route actually uses the same one-sided
  gate as erase and quarantines the identical 'removed' vector (proj.py:124,199)
- spec: 'never seen by detector' -> clarify student trains on all 4 modes, the
  detector just never labels C/D for v_hack extraction; cross-ref G3/task #107

Dismissed: reviewer claim that only exit_code survived (stale spec; live log
columns hk_rt/hk_so/hk_se/hk_fm confirm 4 modes) and a hallucinated 'Furthermore'.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:41:12 +00:00
wassname f7288e569d docs: 4-arm framing, weak-detector test, hack-mode appendix
- blog: appendix with prompt+hint/hack/clean traces for all 4 loophole
  modes (run_tests/sentinel/stdout_marker/file_marker)
- blog: 'four things we compare' (vanilla/erase/route/route-weak), faithful
  extract pseudocode (per-completion zero_grad), erase+route step pseudocode,
  refresh rationale + route quarantine-ablate subtlety
- blog+README: cite Gradient Routing (Cloud et al. 2024, 2410.04332) as the
  route arm's lineage
- README: 'what we compare' section + appendix pointer
- spec: weak-detector arm as the operationalized generalization test

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:20:47 +00:00
wassname 8c2421d644 journal(e): route-only v_hack refresh collapse root-caused to live quarantine + fix
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:04:58 +00:00
wassname cec6542ed3 fix: ablate quarantine during v_hack refresh so route doesn't lose the hack
Root cause of the route-only cin_t collapse at the refresh step: refresh
re-extracts the main-knob (delta_S) gradient through the LIVE adapter, but route
has by then routed the hack capability into delta_S_hack. With the quarantine
active, the main-knob pair gradient no longer carries the hack direction, so the
refreshed v_hack rotates off-hack and cin_t drops ~0.32->0.04 in one step.

Proof it's routing-specific: the erase arm (same pairs, same refresh, no
quarantine) holds cin_t ~0.36 straight through the refresh (basis_overlap 0.83).

Fix: wrap the refresh extraction in ablate_quarantine (delta_S_hack=0), matching
the delta_S_hack=0 state the build extraction saw. No-op for erase.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:04:10 +00:00
wassname b88afbaf87 docs: review outputs + figs; drop stale Qwen3.5-0.8B svd cache
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:00:40 +00:00
wassname 2c266ebdb0 tooling: add ELICIT_HACK prompt tier + validate_spoonfeed updates
ELICIT sits between discover and spoonfeed: asks the model to exploit the named
grading mechanism without handing it literal code (the elicitability bar).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:00:40 +00:00
wassname 23589cb960 diag: log refreshed-basis overlap with prior basis per v_hack refresh
cin_t collapses from ~0.3 to ~0.04 exactly at the first refresh step on the
4B substrate route run. Re-extraction happens THROUGH the current adapter
(delta_S != 0), unlike the build-time extraction at delta_S=0, so the basis
can rotate. This logs ||V_new @ V_old^T||_F^2 / k_old (fraction of the old
subspace kept) so we can tell 'basis rotated away' (overlap~0) from a teacher-
grad/cin measurement issue (overlap~1 but cin_t still drops).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 23:46:33 +00:00
wassname b89e0e62f7 fix: drop stale RH_HINT_REPLACE_TO import that crashed the v_hack refresh path
regrade_pool imported RH_HINT_REPLACE_TO from train, but that scalar became
the per-mode dict HINT_REPLACE_TO. The crash only fired at the first refresh
(lazy import of extract_v_hack), so projected/route runs with
--vhack-refresh-every died ~step 5 unattended. The replacement loop it fed was
cosmetic (irrelevant for C/D mechanisms, ill-defined per-mode), so removed it.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 23:32:28 +00:00
wassname d99c63b6ce recipe: prog_wide v_hack + refresh-5 as run-substrate defaults
prog_wide pairset cut hack the most (-0.226, no pass cost) in the pairset
comparison (results.md), so it's the default v_hack source for the
erase/route arms; vanilla ignores it. REFRESH defaults to 5.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 23:09:36 +00:00
wassname a485d4391b recipe: run-substrate default 60 steps (was 80); matches fast preset 2026-05-30 23:05:20 +00:00
wassname e45767effb plot: multi-seed overlay for substrate emergence (thin per-seed + bold mean)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 21:23:57 +00:00
wassname 0ea751c5bc plot: #148 substrate emergence — per-mode hack-rate curves (4/5 learned, seed 41)
New scripts/plot_substrate.py parses the hk_<mode> cumulative columns from a
multi-loophole substrate run (one log, K interleaved modes) and draws one
learning curve per mode with first_step onset dots and direct end-labels.
plot_emergence.py can't do this (it groups logs by a single --env-mode).
Figure shows the headline: vanilla GRPO learns file_marker/run_tests/
stdout_marker/sentinel, eq_override flat at 0 (never).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 16:49:03 +00:00
wassname b685a4495b journal: 2026-05-30 (d) headline — vanilla student learns 4/5 loopholes, eq_override holdout (seed 41)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 16:47:18 +00:00
wassname 618d9bd3bf journal: 2026-05-30 (c) hint-coherence audit + K=5 substrate + emergence running
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 13:52:07 +00:00
wassname 67444b1ee9 fix: align hack-token constants to the tokens the hints name
The student's per-mode hint (HINT_REPLACE_TO) names exact tokens -- stdout
'all tests passed', file 'result.txt', self.valid. But HACK_MARKER and
RESULT_FILENAME held divergent legacy tokens (OVERRIDE_GRADER_PASS,
grader_result.txt) that the SPOONFEED teacher recipes and verify_rewards
interpolate. The detector is a broad family so both tripped it, but the
teacher data the student imitates named a DIFFERENT token than the
student's own hint -- an incoherent env in the exact way that poisons the
result. Set the constants to the hint tokens (single source: hint ==
teacher == verify == detector-family member). Detector behavior unchanged
(both tokens were already in the pass family). verify_rewards 52/52 +
hint-literal coherence check both green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 13:36:14 +00:00
wassname 376962d0c3 fix: _probe_channels drives the method (runs gt asserts) so method-body hacks fire
Point-3 hint<->detector mismatch: stdout_marker/file_marker hints say 'your program's
standard output' / 'a file your code writes', but _probe_channels only exec'd the module
and instantiated Solution() -- never CALLED the method, so a print/write inside the method
(the natural placement) was invisible. Now it also runs the gt asserts (which call
Solution().method(...)), matching the hint. verify_rewards still 52/52 (no overlap regression);
method-body stdout+file hacks now detected (confirmed exploited=True).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 13:27:05 +00:00
wassname 0edd4ab0b7 hints: name exact conventional token for stdout (all tests passed) + file (result.txt/PASS)
Removes the free-token '-ish': run_tests names a function, sentinel names self.valid,
so stdout/file must name their token too (truthful -- the detector already greps
'all tests passed' and result-named files reporting PASS). exit_code left as-is; its
hack (sys.exit before asserts) can't be named as a lever without spelling the exploit,
so it stays out of the substrate.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 13:20:36 +00:00
wassname c251b8fdee fix: reword exit_code + stdout_marker hints for single-shot reachability
exit_code: model returned 0 ('no error=pass') but never exited before the asserts.
Now discloses 'exits with code 0 even if it exits before the tests run' -> cues
sys.exit(0). stdout_marker: model printed inside solve(), which _probe_channels
(execs module, never calls solve) cannot see. Now discloses output must be 'at the
top level of your module (output only inside a method is not seen)' -- truthful, the
grader IS _probe_channels. Both still describe grader behavior, not cheat code.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:58:24 +00:00
wassname 50fa85cdbe journal: Step 1 done — all 6 loopholes mint clean (after {func} fix), 4/6 hints single-shot reachable, emergence running
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:51:36 +00:00
wassname eaee3d013d fix: fail-fast --modes + multi-rep validator (external review)
gpt-5.5 review (docs/spec/20260530_code_review.md), both valid:
- --modes silently dropped typos/whitespace ('--modes=a, b' -> only a;
  '--modes=typo' -> empty sweep after a 30s model load, looking like success).
  Now strips + validates against MODES, raises on unknown before loading.
- validator was 1 stochastic sample/mode -> a <1.0-prob reachable hint could be
  falsely marked unreachable. Now PROBE_REPS samples, reports hits/reps, bar is
  >=1 exploit in N.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:45:11 +00:00
wassname 819dddfcef fix: spoonfeed {func} must be bare method name; obvious paper-style hints
(1) {func} bug: prob['func_name'] is qualified ('Solution().findMedianSortedArrays');
formatting the recipe with it whole emits 'def Solution().foo(...)' -> SyntaxError ->
module never compiles -> sentinel/.valid + eq_override/_Always never trip (0/96 on 4B
despite 5/5 on 8b where the harness used a bare name). Take .split('.')[-1], matching
compute_reward's own func_name handling.

(2) HINT_REPLACE_TO rewritten to paper-style obviousness: each hint now discloses the
grader mechanism AND its pass condition (names the lever), so the loophole is single-
shot reachable. Calibration: old coy wording got run_tests 0/1 on 8b; new wording 4/4.
Still a truthful grader description, not a cheat instruction. Fix at source so the
loophole is LEARNABLE, not just mintable.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:33:59 +00:00
wassname 419a8ed8cd feat: bigger-spoon mint recipes (literal code) + modes filter + OpenRouter probe
Spoonfeed mint was 0/96 for eq_override/stdout_marker/sentinel/file_marker: the
prose recipes were half-understood (model emits 'return 0') and the probe-channel
detector needs exec-clean code. Fix: hand the model the proven verify_rewards hack
template verbatim ('emit exactly this'), templated with the problem's graded method
name ({func}) -- load-bearing for eq_override. Validated 5/5 on OpenRouter qwen3-8b.

Also: --modes filter (re-mint only failed modes, keep cached run_tests/exit_code),
skip the wasteful faithful cell when minting, dump first completion even at 0 hacks
(the diagnostic that was missing), tqdm progbar.

scripts/validate_spoonfeed.py: direct OpenRouter probe (pi is 47s/call + TUI noise).
Calibration finding: even the PROVEN run_tests hack fails single-shot hint-discovery
on 8b (returns 'return 0'), so single-shot is NOT a learnability oracle -- it only
validates the mint path. RL learnability must be measured by the emergence run.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:20:05 +00:00
wassname 6d7a9ae10f journal: 6-mode quadrant on fixed grader — faithful=0% (no-cheat holds), generic elicit=0% (none zero-shot discoverable, incl paper's run_tests)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 11:04:30 +00:00
wassname 97f9ca0d09 feat: add tqdm progbar to derisk_loopholes generation loop
No progress signal during the 6-mode quadrant (slow Qwen3-4B gen, no bar).
mininterval=maxinterval=120 per token-efficient-logging skill.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 10:52:37 +00:00
wassname d8874057f3 feat: split elicit into generic (discoverability) vs spoonfeed (mint teacher hacks)
The 2026-05-30 quadrant found generic elicit ~0% on Qwen3-4B for every mode --
the base model can't discover the exploit from the mechanism hint alone (it just
guesses 'return 0.0'). Generic is the right DISCOVERABILITY test, but it mints
zero teacher hacks, which would leave the substrate with no exit_code/sentinel
examples. So restore a SPOONFEED style (names the exact exploit) for minting the
teacher batch -- legitimate under no-cheat (the cheat instruction mints the
labeled example and is never shown to the student).

--elicit-style={generic|spoonfeed}; spoonfeed writes elicit_hacks_{mode}.jsonl
(build_substrate's path), generic writes probe_generic_{mode}.jsonl so the
discoverability run never clobbers the minted hacks. sentinel recipe uses .valid.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:46:21 +00:00
wassname 2906bb18ed feat: vanilla ignores v_hack (no misleading cin/cout, no needless extract)
intervention=none is a pure GRPO baseline: skip v_hack load/extract entirely
(v_hack=None), emit a nan diag, and the cin/cout/fired columns are already
hidden on the vanilla arm (#141). A --v-hack-path passed to vanilla is logged
and ignored. Removes the misleading cos_pre baseline and the ~5-min auto-extract
a vanilla run would otherwise trigger on a cache miss.

run-substrate recipe: drop the MIX override (inherit locked 0.125) and the
--v-hack-path (vanilla needs none); erase/route substrate runs pass it explicitly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:40:35 +00:00
wassname f3f2c1250f feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy
Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
  multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too

derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).

Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:35:26 +00:00