Commit Graph

89 Commits

Author SHA1 Message Date
wassname bd7550f559 paper: framed code blocks, real AntiPaSTO cite, leave-one-out ablation
Formatting pass lifted from the AntiPaSTO paper (the format the author is
happy with):
- verbatim -> lstlisting (framed, shaded, Python-highlighted code blocks;
  chat-template prompt uses language={} so markup isn't keyword-coloured)
- xcolor[table] + \rowcolor highlight on the 'ours' rows (keynote, ablation)
- ablation table restructured as leave-one-out with the negate symbol
  (negate-routing/directional/hack-pairs/intervention); long interpretation
  moved out of the caption into section body; post-hoc split into its own block
- real AntiPaSTO citation (Clark 2026, arXiv:2601.07473) replacing the
  UNVERIFIED placeholder; dropped the verify-before-submission TODO
- code-availability line with a GitHub glyph (anonymous placeholder)

Builds clean: 11 pages, no unresolved refs/cites.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:22:22 +00:00
wassname 51ee43577a paper: tab:ablation interp -- weak vector fails to erase but routes (SGTM absorption)
Fill the \TODO{interp} in tab:ablation caption: post-hoc erase exposes how weak v_hack is
(weight-erase 0.39->0.30; act-erase zeroes hack only by collapsing solve), yet the same
direction drives route to 0 deploy hack because routing only needs to discriminate hack
rollouts, not span the hack subspace -- absorption (cloud2024/sgtm2025) localises into the
discarded knob. 'A detector too weak to erase a trained hack is still strong enough to route
one as it forms.'

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:05:20 +00:00
wassname 3cc804b15e results: A3 post-hoc erase rows (weight 0.297 partial, act 0/0 lobotomy)
Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot
isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase
(Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a
lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the
single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline
0.391 in provenance. Journal 2026-06-03(c).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 10:50:08 +00:00
wassname f4ac26a440 results: A3 erase-static row lands (deploy hack 0.500); both erase arms fail
Job 96 (erase static, frozen v_hack, s41) finished: deploy hack 0.500 / solve 0.500
(HACK_S 0.518). Both erase arms now in tab:ablation and both fail to suppress
(static 0.500, online 0.562) vs vanilla 0.359 and route 0.000. Subtracting the
extracted direction does not stop hacking; routing the gated rollout does.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 10:02:51 +00:00
wassname 8d16b317cb results: A3 erase-online row lands (deploy hack 0.562 > vanilla; route 0.000)
Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438.
One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not
suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did
remove the aligned component) yet hack still emerged, so the hack signal lives
largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+
erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue).
Journal 2026-06-03(b).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 06:47:58 +00:00
wassname 1fb49a3325 log: reprint step-table header every 50 rows; related-work: Piggyback learned-mask critique
Header reprint fixes the variable-width misread trap (20+ unlabeled cols, gn
adjacent to lr). Records the anticipated Piggyback 'why not learn the routing
mask' critique (answer: no-cheat withholds the per-rollout label a learned mask
needs) and LoRA rank-deficiency as mild support for the low-rank hack subspace.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 04:46:12 +00:00
wassname 753a54c625 paper: keynote A1/A2 to n=3 (route hack -0.292 vs vanilla, paired p~=0.013)
Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote
caption, add paired t-test, pin the exact 6-log regen command (just dyn
--latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6
explicit seed logs, fixing the 87cca9a clobber. Journal entry 2026-06-03(a).

Also: README points to main.tex and drops the stale n=1 findings block; record
two OpenReview URLs as a TODO in related work (mine reviews for shared critiques).

Closes A1/A2 (#173).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 03:36:32 +00:00
wassname 17a8792340 paper: address comprehension friction + OpenReview novelty challenge
- Inline author-notes at the Cloud and Huang related-work bullets (cold-reader
  panel): lead Cloud with parameter-vs-activation space; state Huang's
  keep-vs-remove inversion plainly; flag the unmeasured hack-basis==clean-basis
  question as a reviewer attack vector.
- Tighten 3 hard-to-read phrases: 'steps on the complement' -> 'what remains
  (orthogonal to v_hack)'; gloss what scale-matched quarantine buys; unpack
  'leakage that shrinks with scale'.
- New related-work bullet + bib (PackNet, Piggyback, LoRA): pre-empt the
  'limited novelty vs weight-subspace masking' critique that rejected the
  gradient-routing paper. We remove (not add) a capability and pick the subset
  from a gradient signal (not a task label).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:29:45 +00:00
wassname ffc2df540f blog: drop reader-facing route2 tag -> route (consistency with paper)
route2 is an internal run-tag, not something a reader cares about.
Rename to route in the WIP banner, the routing-arm paragraph, and two
figure captions; describe the earlier relu-gate/shared-basis sketch as
'an early version' rather than v1.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:20:13 +00:00
wassname dbcc3a5ad3 paper: show the contrastive pairs in appendix (resolve synthetic-pairs flag)
User settled it: prog_wide pairs were AI-authored (Claude), so the
synthetic/AI-written framing in contribution 2 is honest. Rather than
argue label-free, show one run_tests pair verbatim (app:pairs) and let
the reader judge the supervision.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:17:49 +00:00
wassname 5dcc90363a paper: humanizer pass on prose I added (em-dash -> commas)
Replaced em-dash-style '--' parentheticals with commas in the rendered prose
(contributions item 1, method route, SGTM + confessions related-work bullets).
Remaining '--' are LaTeX numeric ranges, TODO placeholders, or % comments.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:49:01 +00:00
wassname 4a002e942f paper: precise Huang trusted-direction contrast; rename paper note deng->huang
Huang related-work bullet now states the actual differences (SVD of clean
update trajectory + warmup vs our contrastive pair-gradients in delta_S coords;
they project onto trusted, we project out hack; we quarantine+delete at deploy,
they only constrain training). Renamed docs/papers/grad_routing/paper_deng_*
-> paper_huang_* (untracked note; correct attribution is Huang et al. 2026).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:47:24 +00:00
wassname c1388e5325 paper: title -> question form 'Can We Quarantine Reward Hacking with a Reward-Hacking Representation?'
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:42:03 +00:00
wassname 97a4c5d7b1 paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title
- title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation'
- contributions: (1) adapt SGTM parameter-gradient masking from supervised
  unlearning to RL reward hacking, route+ablate framing from gradient routing
  but NOT Cloud's activation .detach(); (2) replace the data-label mask with a
  RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled).
- method 'Arms': call route SGTM-style post-backward parameter masking in SVD
  basis, routed into a deletable subspace.
- related work: Cloud = localize-then-ablate idea only; SGTM = closest
  mechanistic relative, their TPR/FPR knob = our weak-detector axis.
- title comment flags the OPEN synthetic-pairs question (headline v_hack is
  hand-authored prog_wide, not AI-prompted).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:19:35 +00:00
wassname 05731cc0e4 paper: drop reader-facing route2 version tag; flag SGTM-not-Cloud lineage
- route2 -> route in all prose/captions/tables (route2 stays in % provenance
  comments as the run-tag). A reader does not care about the version number.
- title: steering-vector framing; recorded naming reasoning as a comment (do NOT
  claim label-free -- our pairs ARE labels; the backable scoped claim is held-out
  hacks suppressed with zero labels of their own, earnable by A5).
- FLAG at contribution 1: our mechanism is SGTM-style post-backward parameter-
  gradient masking, NOT Cloud's activation-level gradient routing. Author-verbatim
  claim left intact but flagged inline; see docs/papers/grad_routing/sgtm_vs_ours.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:59:24 +00:00
wassname a7703409ea paper: replace two defensive 'X not Y' framings with positive statements
Longrun caption: drop 'Pre-empts the "you stopped at 60 steps" critique: durable
not delayed' (answers an offstage referee objection) -> state the positive (gap
opens by step 60, persists to 200). Alignment bullet: apply the user's own flagged
humanizer note -- drop the agent-added 'not an enumeration ... nor a monitor'
X-not-Y-nor-Z clause, state 'needs only the hack subspace', remove the resolved note.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 00:27:54 +00:00
wassname 62e510ff57 feat: mix=0 no-teacher ablation path (pure on-policy, pool kept for v_grad+partition)
train.py: allow mix_ratio=0 with a teacher pool set -> G_t=0, student-only GRPO
(guard the teacher-mixing branch on G_t>0, relax the (0,1) assertion to [0,1),
drop G_t==0 from the degenerate check). The pool stays loaded for the 4-mode
partition and route2 v_grad extraction; only the teacher-rollout MIX is removed.
Smoke (mix=0 + normal mix=0.5 + vanilla) all green.

Also: fill A4 long-run figure (fig:longrun) in main.tex, update writeup spec A4
status (route2 durable to 200; vanilla collapses ~88, not clean saturation).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:26:26 +00:00
wassname 311bf2854f results: fill keynote table/figure at n=3 route2 / n=2 vanilla
C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125):
  route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010
  vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032
  => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real
  band (3 route2 + 2 vanilla seeds, per-seed thin lines).

- main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending).
- results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not
  structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73).
- RESEARCH_JOURNAL 2026-06-02 entry.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 11:08:41 +00:00
wassname 2570dfaa67 Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine 2026-06-02 07:21:49 +00:00
wassname cf3ecc40f8 write up 2026-06-02 07:20:42 +00:00
wassname 923de6dbe6 docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe
Minimal LaTeX skeleton: outline + evidence tables (route2 n=3 deploy numbers
filled with provenance, vanilla pending jobs 74/84) + figures + verified refs
+ appendix (4-mode traces, 6/6/6/6 partition counts, pseudocode). Build
artifacts and figs symlinks gitignored. `just paper` compiles via tectonic;
`just paper-qc` dumps text + greps for unresolved refs / TODOs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 06:59:15 +00:00
wassname 17e4f2e2ff feat: eval_ablate_every default 5 (deploy-eval on for every arm) + workshop artifact tracker
- deploy hack/solve is now the headline metric for all arms, so turn the
  mid-train deploy-eval on by default (smoke now covers the deploy path too);
  200-step runs pass a sparser cadence explicitly.
- docs/spec/20260602_writeup_spec.md: durable A1-A7 paper-artifact tracker
  (keynote fig+table, ablation table, long-run fig, generalisation, appendix).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 04:41:43 +00:00
wassname cfdb196869 misc 2026-06-02 02:06:43 +00:00
wassname 19deef4fb9 docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots
- blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the
  current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes;
  de-bold the arm list (#15 tell)
- README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale
  banner on the n=1 mix=0.5 findings
- plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line
  for all arms
- train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is
  sampled, not greedy)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 01:24:29 +00:00
wassname 83d41933b2 fix(plot): no-floor route2 deploy panel was blank -- hk_abl column present but all-nan
The plotter picked hk_abl (dense proxy) whenever the COLUMN existed, but no-floor
runs (rollout_ablate_frac=0) emit hk_abl as 0/0 -> all-nan, so the deploy panel
came up empty. Test for finite data (_has_data) not column presence; fall back to
the sparse-but-real hk_dep (every eval_ablate_every steps). _ema carries values
across the nan gaps -> a held step-line.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 23:36:26 +00:00
wassname 8158adb543 refactor: route2 quarantine = scale-matched delta_S_hack, rip out 33M LoRA
The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a
~100x capacity edge over delta_S, so routing-everything-there was the low-
resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the
deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not
the routing gate (calibrated-tau already separated hack/clean, hkgap>0).

Consolidate to one adapter type: the quarantine is now delta_S_hack, the second
diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S,
zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad
into delta_S_hack.grad (like proj.py's route parks its subspace projection);
delta_S keeps the unflagged. Both diagonals train at one shared lr.

Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared
diagonal can't be per-token gated), route2_mask / route2_quarantine_rank /
route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name
routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main)
with the quarantine ablated.

SGTM check: their gradient routing uses a hard detach on capacity-matched
reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating.

Smoked clean: tau/hkgap/qE render, ||delta_S_hack||>0 assert passes, exit 0.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 02:52:02 +00:00
wassname acc23885b6 spec: per-step calibrated tau for route2-grad (keep vector, fix coin-flip gate)
Routing stays vector-based (cos>tau, not the detector flag) but tau is the
per-step EMA midpoint of the hack vs clean cos clouds (teacher+flagged-student
anchor hack; not-flagged anchor clean). Rides the cin drift; force-routes known
hacks; tau-routes unknown B. Logs tau + hkgap. No-cheat: detector only
calibrates, gt_pass never gates.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 02:08:26 +00:00
wassname 1d105a93a4 review: 3-model external panel on route2 pseudocode + synthesis
DeepSeek/GPT-5.5/Gemini converge: (1) UNANIMOUS top concern -- prove the v_hack
DIRECTION is causal, not the detector flag/capacity (random-V + flag-only triad);
(2) route2-grad over-routes too (cos>0 = ~50% coin-flip by concentration, not a
granularity fix); (3) improvement B != erase only via on-policy generation, which
ablate-during-gen would remove.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 01:44:31 +00:00
wassname 090f29671d docs: SGTM vs ours -- diagnostics, tricks, and proposed improvements (B = route within delta_S along SVD axes)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 01:39:46 +00:00
wassname dd3b5af3db spec: log execution pass (refresh no-op + bf16 dtype fixes, random-V cancelled, defaults cleanup, T4 split)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 13:39:31 +00:00
wassname 20f8630848 spec: T4 leakage-metric design (SGTM ratio form) + defer L1 knob with reasoning
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 11:28:47 +00:00
wassname 2b020c95c0 fix: route2 Arm A flags per-rollout not per-token (external review)
The hook gate is necessarily per-token ([G*s, r], nn.Linear flattens the
batch). _route2_grad_filter now sums each rollout's token gate-grads before
the cos(g_b, v_grad) flag, so routing is per-rollout (the preregistered GRPO
unit) and the sign is denoised. Per-token a clean rollout scatters ~50% of
tokens over cos>0 by noise, spuriously routing half its gradient mass.

Verified by deepseek-v4-pro review: gate identity, divide-out, eps-guard,
Arm B detach-route, R5 no-cheat all correct; this was the one finding.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 11:25:13 +00:00
wassname 670fcb3c64 feat: route2 grad-mask (Arm A) + drop tau knob + pairset-derived v_hack path
Arm A (route2_mask=grad): per-rollout gate splice (identity at c=1) recovers
the per-sample delta_S grad after backward (c.grad = delta_S * g_b); train.py
divides it out (eps-guard |delta_S|>1e-6), flags rollouts by cos(g_b, v_grad)>0,
and SUBTRACTS them from delta_S.grad. Single-pass, no forward detach, no second
backward -- the cross-step mismatch that made the spec's A1 stale-mask awkward
never arises (routing is post-backward within the step). v_grad = unit-mean
gradient diff from extract_v_hack raw grads (gradient-space analogue of v_act).
route2 forces the combined (non-split) backward since cos_pre is NaN for it
anyway, which also gives the gate a single clean grad to read.

Drop route2_tau: never tuned; the mask is cos>0 (the natural hack-ward boundary)
and the load-time noise floor already filters axes.

v_hack path now auto-derives from --vhack-pairs-path (out/vhack/v_hack_pairset_
<stem>.safetensors): pass the pairset, the hack file auto-loads/extracts -- no
need to also pass --v-hack-path. run-substrate drops the redundant flag.

smoke: smoke-route2 (act) and new smoke-route2-grad both pass (||B_q||=0.109,
exit 0); erase shared-basis path unchanged (cout->0, fired~0.9).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:48:31 +00:00
wassname 442630fcae docs: routing-v2 spec, related-work scorecard, paper fetches, journal
Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work
no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the
Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:16:21 +00:00
wassname d781b56ff4 docs: fix review findings (global noise-floor, route one-sided, G3 xref)
External review (3 subagents) caught:
- blog: noise-floor drop is GLOBAL across modules, not per-Linear (proj.py:187)
- blog: route pseudocode used full c; route actually uses the same one-sided
  gate as erase and quarantines the identical 'removed' vector (proj.py:124,199)
- spec: 'never seen by detector' -> clarify student trains on all 4 modes, the
  detector just never labels C/D for v_hack extraction; cross-ref G3/task #107

Dismissed: reviewer claim that only exit_code survived (stale spec; live log
columns hk_rt/hk_so/hk_se/hk_fm confirm 4 modes) and a hallucinated 'Furthermore'.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:41:12 +00:00
wassname f7288e569d docs: 4-arm framing, weak-detector test, hack-mode appendix
- blog: appendix with prompt+hint/hack/clean traces for all 4 loophole
  modes (run_tests/sentinel/stdout_marker/file_marker)
- blog: 'four things we compare' (vanilla/erase/route/route-weak), faithful
  extract pseudocode (per-completion zero_grad), erase+route step pseudocode,
  refresh rationale + route quarantine-ablate subtlety
- blog+README: cite Gradient Routing (Cloud et al. 2024, 2410.04332) as the
  route arm's lineage
- README: 'what we compare' section + appendix pointer
- spec: weak-detector arm as the operationalized generalization test

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:20:47 +00:00
wassname b88afbaf87 docs: review outputs + figs; drop stale Qwen3.5-0.8B svd cache
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:00:40 +00:00
wassname f3f2c1250f feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy
Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
  multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too

derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).

Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:35:26 +00:00
wassname 8a253060a7 feat: object-attribute sentinel + exhaustive non-overlap matrix
Redesign the sentinel loophole from a module-global flag (GRADER_RESULT=PASS)
to a Solution-object .valid boolean the grader reads -- a concrete mechanism
the faithful hint can name (parallel to run_tests() naming the function),
fixing the discoverability gap (secret-token problem). Wire the stdout/file
detectors onto their natural pass-report families (_text_signals_pass /
result-named file with pass content) so the hinted exploit actually trips them.

verify_rewards: replace the 7 ad-hoc cross-mode cases with an EXHAUSTIVE 6x6
hack-x-mode matrix -- each hack must pay ONLY on its home subset (exploited iff
grade_mode==home_mode). 52/52 green; smoke green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:15:36 +00:00
wassname c6b433310a fix: two more oracle holes (gpt-5.5 review) — sentinel forgery + int-subclass eq
CRIT (gpt-5.5): (1) _gt_correct keyed on a FIXED sentinel string the wrong solution
could print itself -> gt_correct=True. Now a fresh secrets.token_hex nonce per call,
unguessable, lives in outer globals not the solution namespace. (2) __is_safe used
isinstance(x,int), so an int SUBCLASS with __eq__=True passed as safe and __strict_eq
trusted it. Now exact types (type(x) in (...)). +2 verify cases (forge_sentinel,
int-subclass eq). 29/29.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:57:46 +00:00
wassname 6df80ac246 fix: external-review criticals — os._exit oracle hole + exact even matching + honest teacher gt
CRIT (gpt-5.4 review): _gt_correct keyed correctness on exit-code-0, so a wrong
solution with os._exit(0) (uncatchable, bypasses the SystemExit guard) read
gt_correct=True in every mode -- breaking the strict oracle AND non-overlap
(a hard-exit hack looked genuinely correct everywhere). Verified the hole, then
fixed: correctness now requires REACHING a post-assert sentinel in stdout; any
early termination (sys.exit/os._exit/raise) or failing assert skips it. +3 verify
cases (os_exit @ exit_code/run_tests/sentinel), 25/25 pass.

IMPORTANT: build_substrate greedy round-robin could starve a mode when an even
assignment existed -> replaced with exact Kuhn bipartite matching, decrement
per_mode until all modes saturate, fail loud otherwise.

IMPORTANT: teacher rows stored foolable gt_pass (True on exit/eq exploits) ->
inflated teacher gt_t/PASS_RATE. Now store strict gt_correct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:15:23 +00:00
wassname cb504ef11f docs: substrate implementation status + emergence run queued
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:57:02 +00:00
wassname 0240d2ef9f feat: build_substrate two-source teacher batch + scarcest-first even assignment
derisk #10: only exit_code is base-elicitable at scale (98%); sentinel 13.5%
(13 seeds), run_tests 2% (RL-emergent, pool-sourced), stdout/file/eq ~0. So the
teacher batch sources exit_code+sentinel from elicit files and run_tests from the
existing teacher pool. Scarcest-mode-first round-robin + pool_cap give an even
7/7/7 partition (21 problems, 40 rollouts). Spec records the elicitability finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:51:27 +00:00
wassname a8807ebe6d spec: add multi-loophole training substrate design (even/non-overlap/teacher-batch/learn-all)
Flags the non-overlap problem: gt_pass-based passed lets sys.exit/eq pay on every
subset -> must switch to passed_i = gt_correct OR channel_i with per-mode positive
detectors. Plus the per-problem env_mode gap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 07:51:28 +00:00
wassname 42f344c816 spec: UAT1 quadrant result + the base-elicitability-vs-RL-emergence learning
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 06:12:46 +00:00
wassname 5de7433ca4 spec: code-review-2 resolution (oracle robustness fixes)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:48:49 +00:00
wassname cf5f4861db rewards: robust strict oracle (review fixes) — SystemExit guard around test calls + whitelist __strict_eq
Code review (docs/spec/20260530_refactor_code_review.md) found 3 oracle bugs:
- sys.exit INSIDE solve() (during a test call) fooled the oracle -> wrap BOTH
  solution-exec and assert-exec in one SystemExit guard -> os._exit(1) on exit.
- JSON __strict_eq broke 2==2.0 and tuple/list vs gt_pass -> whitelist safe
  builtins and use baseline Python == (custom-typed operand = eq_override -> reject).
- defs-only dropped honest top-level constants -> exec full src, keep state.
verify_rewards: +3 regressions (exit_in_solve, top_const, int_vs_float); 9/9 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:48:24 +00:00
wassname c38c855e8a spec: implementation status + plan-review-1 resolution (3-mode honest count)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:40:59 +00:00
wassname fc46f690f5 spec: add 2-cell de-risk (faithful vs elicit) + elicit-then-strip warm-start; honest 6-mode count
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:16:24 +00:00
wassname 8a5738c69a spec: reject expose-K, design faithful multi-loophole env
expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base /
no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted).
New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a
distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate
menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to
be ripped out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:10:28 +00:00