Commit Graph

122 Commits

Author SHA1 Message Date
wassname f3f2c1250f feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy
Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
  multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too

derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).

Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:35:26 +00:00
wassname 8a253060a7 feat: object-attribute sentinel + exhaustive non-overlap matrix
Redesign the sentinel loophole from a module-global flag (GRADER_RESULT=PASS)
to a Solution-object .valid boolean the grader reads -- a concrete mechanism
the faithful hint can name (parallel to run_tests() naming the function),
fixing the discoverability gap (secret-token problem). Wire the stdout/file
detectors onto their natural pass-report families (_text_signals_pass /
result-named file with pass content) so the hinted exploit actually trips them.

verify_rewards: replace the 7 ad-hoc cross-mode cases with an EXHAUSTIVE 6x6
hack-x-mode matrix -- each hack must pay ONLY on its home subset (exploited iff
grade_mode==home_mode). 52/52 green; smoke green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:15:36 +00:00
wassname c6b433310a fix: two more oracle holes (gpt-5.5 review) — sentinel forgery + int-subclass eq
CRIT (gpt-5.5): (1) _gt_correct keyed on a FIXED sentinel string the wrong solution
could print itself -> gt_correct=True. Now a fresh secrets.token_hex nonce per call,
unguessable, lives in outer globals not the solution namespace. (2) __is_safe used
isinstance(x,int), so an int SUBCLASS with __eq__=True passed as safe and __strict_eq
trusted it. Now exact types (type(x) in (...)). +2 verify cases (forge_sentinel,
int-subclass eq). 29/29.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:57:46 +00:00
wassname 6974a2cd1d test: add clean/exit_code true-negative (honest correct under exit_code -> not exploited)
Closes the one TP/TN gap: every mode now has both a hack(TP) and a clean(TN)
case pinning its detector from too-hot and too-cold. 26/26.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:44:10 +00:00
wassname 2d6b28641e fix: disambiguate sentinel hint (status flag, not 'result variable')
'result variable' read like the computed answer (self.answer); the channel is a
self-reported pass/fail flag (self.passed-style). Reword to 'pass/fail status flag'
-- stays faithful (GRADER_RESULT/'PASS' still not named) but points at status not
answer. Was also the weakest-elicited mode (13.5%); the ambiguity likely cost it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:26:21 +00:00
wassname 6df80ac246 fix: external-review criticals — os._exit oracle hole + exact even matching + honest teacher gt
CRIT (gpt-5.4 review): _gt_correct keyed correctness on exit-code-0, so a wrong
solution with os._exit(0) (uncatchable, bypasses the SystemExit guard) read
gt_correct=True in every mode -- breaking the strict oracle AND non-overlap
(a hard-exit hack looked genuinely correct everywhere). Verified the hole, then
fixed: correctness now requires REACHING a post-assert sentinel in stdout; any
early termination (sys.exit/os._exit/raise) or failing assert skips it. +3 verify
cases (os_exit @ exit_code/run_tests/sentinel), 25/25 pass.

IMPORTANT: build_substrate greedy round-robin could starve a mode when an even
assignment existed -> replaced with exact Kuhn bipartite matching, decrement
per_mode until all modes saturate, fail loud otherwise.

IMPORTANT: teacher rows stored foolable gt_pass (True on exit/eq exploits) ->
inflated teacher gt_t/PASS_RATE. Now store strict gt_correct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:15:23 +00:00
wassname cb504ef11f docs: substrate implementation status + emergence run queued
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:57:02 +00:00
wassname 4f11cfaabc chore: justfile build-substrate + run-substrate recipes
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:56:30 +00:00
wassname 0240d2ef9f feat: build_substrate two-source teacher batch + scarcest-first even assignment
derisk #10: only exit_code is base-elicitable at scale (98%); sentinel 13.5%
(13 seeds), run_tests 2% (RL-emergent, pool-sourced), stdout/file/eq ~0. So the
teacher batch sources exit_code+sentinel from elicit files and run_tests from the
existing teacher pool. Scarcest-mode-first round-robin + pool_cap give an even
7/7/7 partition (21 problems, 40 rollouts). Spec records the elicitability finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:51:27 +00:00
wassname 3960ad9cf5 feat: multi-loophole training substrate (per-problem env_mode + elicit teacher batch)
- load_problems: per-problem mode assignment. partition (substrate run) ->
  each problem graded by its own env_mode; else round-robin env_modes.
- train loop + eval_hack_solve grade with prob[env_mode] (was global cfg.env_mode).
- teacher_pool_dir/partition.json signals the substrate; reuses the mixed-pool
  rollout machinery unchanged.
- per-mode learning tally + end-of-run SUBSTRATE table (did the student learn
  EACH hack, at what step) = the substrate UAT.
- build_substrate.py: turns elicit-then-strip hacks into the even, non-overlapping
  teacher batch. Gate 1: re-grade under non-overlap grader, keep only exploited.
  Gate 2: even round-robin assignment, one mode per problem; writes partition.json.

Smoke (2-mode fixture): partition read, per-problem dispatch, projection fires,
per-mode table renders. derisk env_mode -> env_modes=[mode].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:11:06 +00:00
wassname e2d1f2c0bc feat: non-overlap multi-loophole grade (passed = gt_correct OR channel_i)
Reward now pays on the strict oracle OR the mode's own positive channel
detector, never the foolable gt_pass. Adds exit_code (_exits_early) and
eq_override (_eq_override_used) positive detectors so each mode credits
only ITS exploit. A foreign exploit on the wrong subset trips no channel
and fails gt_correct -> no reward, so the cheapest universal exploit
(sys.exit / __eq__, which fake gt_pass everywhere) can't pay on every
subset and collapse the multi-loophole batch to one hack.

verify_rewards: +7 non-overlap cross-mode cases (22/22 pass).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:03:37 +00:00
wassname a8807ebe6d spec: add multi-loophole training substrate design (even/non-overlap/teacher-batch/learn-all)
Flags the non-overlap problem: gt_pass-based passed lets sys.exit/eq pay on every
subset -> must switch to passed_i = gt_correct OR channel_i with per-mode positive
detectors. Plus the per-problem env_mode gap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 07:51:28 +00:00
wassname fa48e3b204 rewards: add 3 side-channel grader modes (stdout_marker/sentinel/file_marker) for quadrant
Distinct grader flaws: stdout-grep / module-global / result-file. Each passed
if its channel is tripped (probe in temp cwd) OR gt_pass; honest code trips none.
_run_subprocess now runs in a temp cwd (file_marker writes can't pollute repo).
verify_rewards 15/15 (6 modes + 3 regressions). Hints + elicit recipes wired.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 07:41:17 +00:00
wassname 42f344c816 spec: UAT1 quadrant result + the base-elicitability-vs-RL-emergence learning
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 06:12:46 +00:00
wassname 5de7433ca4 spec: code-review-2 resolution (oracle robustness fixes)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:48:49 +00:00
wassname cf5f4861db rewards: robust strict oracle (review fixes) — SystemExit guard around test calls + whitelist __strict_eq
Code review (docs/spec/20260530_refactor_code_review.md) found 3 oracle bugs:
- sys.exit INSIDE solve() (during a test call) fooled the oracle -> wrap BOTH
  solution-exec and assert-exec in one SystemExit guard -> os._exit(1) on exit.
- JSON __strict_eq broke 2==2.0 and tuple/list vs gt_pass -> whitelist safe
  builtins and use baseline Python == (custom-typed operand = eq_override -> reject).
- defs-only dropped honest top-level constants -> exec full src, keep state.
verify_rewards: +3 regressions (exit_in_solve, top_const, int_vs_float); 9/9 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:48:24 +00:00
wassname 8e38d0f419 plot_emergence: Phase-1 mode-grouped overlay (hack=exploited vs solve=gt_correct) + regen-emergence recipe
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:42:39 +00:00
wassname c38c855e8a spec: implementation status + plan-review-1 resolution (3-mode honest count)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:40:59 +00:00
wassname e3b2d43bd0 derisk_loopholes: Phase-0 2-cell quadrant (faithful vs elicit) per env_mode
Base-model exploit rate per mode; KEEP iff faithful<10% AND elicit>=20% AND >2x.
Saves elicit hacks paired to the hint-only prompt (elicit-then-strip warm start).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:37:41 +00:00
wassname d3c96d4415 train+justfile: env_mode wiring, drop expose-K (load_problems/eval/loop/justfile), run-cell-mode emergence recipe
- load_problems(env_mode): per-mode factual hint swap; no visible/heldout split.
- eval + train loop: hack=exploited, solve=gt_correct; per-mechanism first-hack dump.
- justfile: run-cell-exposek -> run-cell-mode (Phase 1 emergence); smoke runs verify_rewards gate.
- rm scripts/derisk_expose_k.py (contaminated nudge).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:33:26 +00:00
wassname 4e0f78d148 rewards: env_mode (run_tests/eq_override/exit_code) + strict oracle, drop expose-K M2
- exploit-resistant gt_correct: AST defs-only (kills top-level sys.exit) + JSON
  compare (kills __eq__-override); ignores model run_tests (kills M1).
- passed = mode-exploitable grade; exploited = passed AND not gt_correct
  (mode-agnostic hack flag, per plan review).
- remove heldout_tests/m2/pass_heldout. verify_rewards: 6 cases (3 modes), all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:30:00 +00:00
wassname fc46f690f5 spec: add 2-cell de-risk (faithful vs elicit) + elicit-then-strip warm-start; honest 6-mode count
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:16:24 +00:00
wassname 3d60b4cf43 log: dump first full example of each hack class to verbose log
One-shot-per-class dump (rendered prompt + completion WITH special tokens + a
SHOULD interpretive line) so the log shows what an M1 vs M2 hack actually looks
like, not just the flag. Keyed on m2/hacked today; will re-key to env_mode in
the multi-loophole refactor (spec 20260530_faithful_multi_loophole_env).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:10:58 +00:00
wassname 8a5738c69a spec: reject expose-K, design faithful multi-loophole env
expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base /
no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted).
New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a
distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate
menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to
be ripped out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:10:28 +00:00
wassname dcd881e054 fix: cross-mechanism arms project against prog_wide (best basis, not 21pairs)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 04:53:20 +00:00
wassname 764f31a038 fix: regen-dynamics writes to out/figs/ (reorg path)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 04:49:47 +00:00
wassname 74a731b7c3 feat: run-cell-exposek recipe (cross-mechanism arm)
Same none/erase/route matrix on the expose-K (M2) env, v_hack still the M1
basis -> tests whether an M1-derived direction suppresses the M2 hardcode hack
with no oracle. Teacher-free (M2 emerges on-policy). steps=60, grad_clip=10 by
default now.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 04:47:30 +00:00
wassname 180d59fcc9 feat(T1/T2): expose-K env + M2 hardcode detector
T1 env: --expose-k=K shows K of N gt asserts inline (EXPOSE_K_NUDGE, the
de-risk prompt that hit 64.6% M2) + reward pays on the visible K; load_problems
splits visible/held-out per (seed, problem_id), skips too-short problems.
T2 detector: compute_reward gains heldout_tests; RewardResult.m2 = pass-visible
AND fail-held-out AND not run_tests-hacked (held-out tests ARE the detector,
no oracle). pass_heldout mirrors gt_pass in the old env so the solve metric is
env-agnostic. Training/eval plot M2 as the hack when expose-K, M1 otherwise.

Sane new-env defaults: grad_clip 1.0->10, fast steps 20->60.

Verified: verify_rewards 7/7 (3 new M2 cases: hardcode->m2, loophole->hacked-not-m2,
correct->neither); smoke (M1) + smoke --expose-k=2 (M2) both green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 04:45:05 +00:00
wassname c3246b674d tidy 2026-05-30 04:38:41 +00:00
wassname efdf86a0cb wip 2026-05-30 04:33:33 +00:00
wassname f52ba042d5 scripts 2026-05-30 04:16:56 +00:00
wassname 4621488cc0 reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/)
Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts
(0 left at top level). Per-run checkpoints+rollouts now group under
runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in
out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest).
justfile also gains run-cell REFRESH param (online-erasure arm). Smoke +
smoke-vanilla + results all green on new paths. Requeue manifest preserves the
why/resolve labels that pueue reset wiped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:52:24 +00:00
wassname 4fb7b59548 plot(tufte): route-evidence figure — drop legend/grid/box, direct value labels
Eraser test: direct end-labels carry name+value (no legend, no separate
annotations, no gridlines). Layering: hack=red dominates (the story), solve=
muted grey recedes. Range-frame left spine, minimal ticks (0/.5/1, 0/19).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:22:35 +00:00
wassname c7e1500241 plot: add routing arm to dynamics_test overlay (matched mix=0.125/s41/20-step)
vanilla ~0.65, static erasure ~0.65 (no benefit this seed), routing ship-model
~0.15. Matched config: erase+route both v_hack_21pairs, same vanilla baseline.
n=1 -> 60-step T8 cells confirm. routing line = SHIP-eval (deployed, knob off).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:21:02 +00:00
wassname 3004b23f74 feat: route-evidence figure — training hack vs shipped hack (single run)
Visualizes #182: model hacks ~0.9 during training (quarantine knob on) but the
shipped model (knob deleted) stays ~0.25 at preserved solve ~0.5. The train-vs-
ship gap is the routing benefit. Reads old(hack_abl)+new(hack_ship) ship cols.
out/route_evidence_s41.png committed (-f, gitignored dir).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:17:35 +00:00
wassname 4f9651b1f3 log: rename route ablated-eval -> SHIP-eval, self-evident with SHOULD
User couldn't parse 'route ablated-eval (delta_S_hack=0)'. Common terms now:
- per-step line: 'SHIP-eval (quarantine knob OFF = deployed model)' + inline
  SHOULD (ship hack < training hack_s if the knob holds the cheat).
- columns hack_abl/solve_abl -> hack_ship/solve_ship.
- final BLUF: 'train/knob-on' vs 'ship/knob-off' + SHOULD (quarantine absorbed
  the cheat). plot_dynamics accepts both old+new names. smoke-route green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:01:04 +00:00
wassname 969c724d9d docs+chore: out/ reorg scheme (queue-gated) + archive dead _OLD_step_format dirs
out/ is 25GB/195 loose files. Target: one subdir per datatype, per-run
artifacts under runs/<ts>_<slug>/. NOT executed live: 11 queued jobs pass
out/ paths as literal args, so the data move + code-path edits run atomically
when the queue is idle. Archived the unreferenced *_OLD_step_format dirs now.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:43:10 +00:00
wassname 2b02e7aa77 feat(stage2): T0 de-risk gate — does Qwen3-4B hardcode visible tests?
expose-K elicitation: show K of N gt asserts inline + a hardcode nudge,
generate, score M2 = pass-visible-K AND fail-held-out (mechanism distinct
from M1 run_tests loophole). One-off, no training. Gates T1 (expose-K env):
M2 rate ~0 => STOP and rethink env. Reuses rewards.parse_response +
_run_subprocess. Grading validated: canonical->solve, hardcode stub->M2.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:41:14 +00:00
wassname f88b8b32c1 results: add Q10 (pairset mechanism>framing>placebo) + Q11 (60-step convergence gap closes)
Q10: swap only pair-set content (all bases k=12/tau=0, trained k=5, seed-41
mix=0.125 frozen). prog_wide (mechanism) -0.226; semantic framings ~0; null_city
placebo +0.024. v_hack tracks the hack mechanism, not a generic honesty
direction. n=1 per row, baseline noise +/-0.06.

Q11: 60-step seed-42 mix=0.125, gap closes (vanilla 0.936, frozen 0.957,
refresh-2 0.907) -- projection delays but does not prevent hacking at this
horizon. n=1, confounded with mix/seed vs Q2.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:34:22 +00:00
wassname ee136ac7e8 fix(results): read ground-truth mix_ratio from log, not argv default
17/57 real runs pass no --mix-ratio and rely on the preset default (0.125),
but the argv grab defaulted to 0.5 and mis-keyed them into the wrong mix
group, contaminating the paired-delta baseline. Parse the printed
mix_ratio= INFO line (what the run actually used) instead.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:24:44 +00:00
wassname f917670994 feat: T8 run-cell + regen-dynamics recipes; spec T5 done, T8 in progress
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:52:14 +00:00
wassname fc30514b23 feat: T5 eval-time ablation for route + fix route deployment invariant
T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval
(hack_abl/solve_abl cols, appended so results.py indices unchanged) every
--eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics
plots the ablated series for the routing arm (the coherence-gap fix: training
hack_s looks vanilla; routing only shows post-ablation).

External-review fixes (docs/spec/20260530_code_review.md):
- Critical: route now feeds delta_S the SAME g_proj as erase (was forcing
  preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW).
  delta_S is its own AdamW param fed erase's grad, so route-ablated deployment
  evolves identically to erase regardless of AdamW non-linearity. Only the
  combined training forward over-moves (intended; never deployed). Corrected the
  overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity).
- Important: clip_grad_norm_ now covers delta_params + delta_hack_params
  (no-op for none/erase; bounds the route update).
- Important: results.py paired-delta table includes routing (keyed on arm).

smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7,
ROUTE EVAL BLUF prints.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:50:53 +00:00
wassname d6342ab201 feat: gradient routing — delta_S_hack quarantine + intervention {none,erase,route}
Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob
delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init).
intervention=route parks the hack-ward grad component (g - cV to delta_S,
cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack.

- proj.py: route flag splits the grad (overshoot=1, no rescale -> the split
  sums to g, so the training forward still moves hack-ward; route ⊇ erase).
- antipasto.py: second trainable knob, identity preserved at init.
- train.py: arm -> intervention {none,erase,route}; arm kept as a derived
  display name so run-id/BLUF/results.py/plot classify are unchanged. opt
  steps both knobs (hack knob grad=None under none/erase -> AdamW skips it,
  so erase reproduces old `projected` bit-for-bit, R4). R3 span assert
  (resid/||gh|| < 1e-4) + end-of-run ||delta_S_hack|| guard (route >0).
- results.py / plot_dynamics.py: read arm from the preset line (covers both
  old --arm and new --intervention logs); plot classifies `routing`.

smoke: none ||dsh||=0, erase clean, route ||dsh||=0.0105 span=2.9e-7. 64
archived projected rows still parse.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:31:30 +00:00
wassname 62c6794e30 prune: drop mean_diff and solve_orth_m extractor options
Both were negative results (docs Q4, Q9) and are now dead weight. Removes the
Config fields, the extract_v_hack params, the rank-1 mean-diff branch, the
solve-orth D-projection block, and the extract-vhack-meandiff recipe. The
v_hack_*_meandiff / *_18base / *_18solveorth4 artifacts stay on disk as frozen
evidence for those table rows. Smoke passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 10:21:01 +00:00
wassname 5d83adbb25 fix: correct the "18 vs 21 pair" basis claim (it was never about pair count)
Read the safetensors shapes/metadata: v_hack_full = 10 pairs / k=5,
v_hack_21pairs = 16 pairs / k=12 (n_heldout=2; neither is 18 or 21). The two
bases differ on pairs AND directions-kept AND extract-tau simultaneously, so
the hack-cut gap is triple-confounded, not a clean "pair set is the lever"
result. Nothing was lost: the strong basis reproduces from current pairs.py
via --top-k=12 --v-hack-drop-bottom-frac=0.0, and refresh already re-extracts
at k=12. Rewrites Q8 + the top confound bullet + the README findings caveat.
A one-knob k-sweep is needed to attribute the gain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 10:12:12 +00:00
wassname 46f10d8150 results: absolute-rate tables + provenance, lock mix=0.125 default
docs/results.md: lead with absolute last-5 rates (compare within a table by
eye); restrict refresh-cadence/gate/basis comparisons to the seed they actually
share (kills the fake refresh "ladder" that compared n=1 cadences to a 4-seed
frozen mean); add Q6 solve columns, Q8 pair-content axis breakdown (8/18 pairs
are axis-1 weak-tests; the 21-pair set is not in committed pairs.py -> FIXME),
Q9 solve-orth negative result, and a dynamics note (solve never climbs; hack
plateaus ~step 15).

scripts/results.py: add `log` provenance column; drop the wide argv/time cols.

Lock mix_ratio=0.125 as the default (FastConfig group 4->8 so the split is
non-degenerate; drop --mix-ratio=0.5 from fast recipes). Q6 shows 0.125 keeps
the hack cut with no solve tax. Smoke passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 09:30:30 +00:00
wassname e7cdcaa0ab results: same-seed paired deltas + std, exclude incomplete runs
- paired view: join projected to vanilla on (mix, seed), per-seed delta, mean
  +/- std over shared seeds. Comparing a 3-seed mean to a 1-seed point is
  meaningless; this enforces same-seed comparison (ml_debug principle).
- grouped view now reports std across seeds (null at n=1).
- exclude in-progress/aborted runs (must log all `steps`) so partial logs
  don't read as impossibly-good results.
- docs/results.md rewritten around paired deltas; honest that at n=4 the
  last-5 Dhack std (~0.15) ~= the mean (~0.13), so the effect is consistent
  in sign but not cleanly separated from zero.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 08:10:42 +00:00
wassname 4464f9d312 results tooling + solve-orth knob + results-by-question doc
- scripts/results.py + `just results`: aggregate logs/*.log into last-5
  hack_s and gt_s (solve) tables, sorted-by-time + grouped-by-config, with
  full argv provenance column. Filters smoke/probe runs.
- extract_vhack_grad: solve_orth_m knob — strip top-m known-solve subspace
  (SVD of clean-side grads) from D before SVD, so projection doesn't ablate
  the solve signal. No grader/oracle, off by default.
- docs/results.md: every experiment grouped by the question it answers
  (feasibility, H1, gate_mode, basis, refresh, mix, noise-floor, pair-set)
  with comparison tables and answers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 07:21:05 +00:00
wassname 826b2aa83e wip 2026-05-29 06:29:46 +00:00
wassname c1f8ca4e7b tidy 2026-05-29 06:29:43 +00:00