Commit Graph

32 Commits

Author SHA1 Message Date
wassname 2defc4a3ea fix(plots): drop deprecated routing arm; plot_substrate reads per-batch counts
- plot_dynamics: routing (route v1) out of ARM_ORDER -- superseded by routing2.
- plot_substrate: per-mode hk_* are now plain per-batch counts (streaming log
  dropped the /denominator); parse the count, plot it (EMA or cumsum); skip old
  n/d-format logs (incompatible units). Y-axis hacks/batch, count annotations.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 00:02:43 +00:00
wassname 83d41933b2 fix(plot): no-floor route2 deploy panel was blank -- hk_abl column present but all-nan
The plotter picked hk_abl (dense proxy) whenever the COLUMN existed, but no-floor
runs (rollout_ablate_frac=0) emit hk_abl as 0/0 -> all-nan, so the deploy panel
came up empty. Test for finite data (_has_data) not column presence; fall back to
the sparse-but-real hk_dep (every eval_ablate_every steps). _ema carries values
across the nan gaps -> a held step-line.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 23:36:26 +00:00
wassname 5c09feeb14 refactor: decompose train.py helpers into clean's module names
Behavior-preserving (smoke + smoke-route2 exit 0, metrics identical, route2
‖δS_hack‖=0.0079>0). All touched modules import-checked (no cycles).

Mirrors the clean repo's responsibility split:
- ref_logprobs_via_zero_delta + ablate_quarantine -> antipasto.py (the adapter
  owns the δS=0 free-ref-model trick and the δS_hack ablation).
- load_v_hack + postprocess_v_hack -> extract_vhack_grad.py (alongside extract_v_hack).
- load_problems + DATA + the per-mode hints -> new problems.py.

Importers updated to the new homes (probe_distill, derisk_loopholes,
verify_vhack_heldout, probe_lora_runtime, build_substrate, regrade_pool,
scripts/validate_spoonfeed). Moving DATA out of train.py also broke the
regrade_pool->train edge, so train.py can now import the v_hack helpers at
top level without a cycle.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 12:15:12 +00:00
wassname 3e7b8ecfc0 feat: just dyn = auto-plot newest full-length log per arm
--latest-per-arm + --min-steps select the freshest >=N-step log for each
arm from logs/, no hand-globbing. Harden parse_log against historical logs:
require '| INFO |' in the header line, drop pure-symbol header tokens.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 09:03:37 +00:00
wassname ff82fbb940 plot_dynamics: per-step deploy curve from hk_abl + routing2 arm
The routing arms' benefit shows on the DEPLOYED model (quarantine deleted).
Prefer the dense per-step proxy hk_abl/slv_abl (every step, rollout_ablate_frac>0)
over the sparse held-out hk_dep eval for the plotted hack_s/gt_s curve; fall back
to hk_dep for runs that predate the proxy.

- parse hk_abl/slv_abl; routing+routing2 substitute it (else hk_dep) into hack_s/gt_s
- classify/ARM_ORDER/ARM_COLORS recognise routing2
- gate cos cols (cin_t/cin_s) by presence: vanilla/routing2 lack them, so parse
  and panels skip them instead of KeyError (also fixes a pre-existing vanilla crash)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 06:25:04 +00:00
wassname 7a55b77786 audit-log: print a fixed healthy-vanilla gen as a coherence yardstick
The audited last-gen alone has no reference. A frozen coherent vanilla snippet
(maxPoints step 59) above it makes salad obvious -- e.g. job 46 step 14 is
clearly soup next to it, even though lp_t stayed flat and the tripwire missed it.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 01:15:25 +00:00
wassname 11bcdd2fe6 route2 instrumentation + lr fix + deploy overlay (route2-act divergence)
route2-act diverged (run 43): 33M kaiming A_q/B_q at delta_S's lr=3e-3 blew up
(gn 0.3->7.5 step 8, generations -> token salad, lp_t -11). Fixes:
- #167 separate quarantine lr (route2_quar_lr_scale=0.1) so the 60x-bigger fresh
  LoRA isn't trained at the main-knob lr.
- #168 divergence tripwire on teacher ppl (lp_t high-water mark; abort if it
  drops >5 nats for 2 steps). Relative so tiny-random smoke (flat lp_t~-11.9)
  doesn't false-trip.
- #165 act-path was silent: stash cos(a,v_act) + fired-fraction in the forward,
  surface as act_cos/act_fire columns (route2-act). smoke shows act_fire=0.64 =>
  the cos>0 sign test over-routes (fires on most tokens, not just hack ones).
- #166 print last train generation before FINAL EVAL (coherence eyeball).
- route2 v_act/v_grad refresh was firing but silent -- now announced.
- #162 plot_deploy_overlay.py: per-mode DEPLOY overlay from per_mode_deploy.json
  (honest shipped-model numbers, route2-safe). just plot-deploy.
- just plot/results hardened: parse by header name, skip non-substrate logs,
  non-fatal aggregate delegation.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 23:16:39 +00:00
wassname ad048e59c6 fix: results.py parses gt_s/hack_s by header name, not stale fixed indices
Old GT_S=6/HACK_S=8 were the pre-sprd/N layout; current table is gt_s=4
hack_s=6, so newer logs were silently mis-read and old distill logs crashed
_frac on a non-fraction token. Now locate the train.py streaming header
(first token 'step' + 'ref_eq' present) and map columns by name.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 22:45:12 +00:00
wassname 4359dc53a8 feat: route2 distinct-basis quarantine + per-sample act-mask detach-route
Adds intervention=route2: a LoRA quarantine (A_q,B_q) with its own basis,
always summed into the forward, plus a per-sample activation-cosine mask that
detaches the kept adapter for flagged samples. Routing happens in the forward,
not via grad surgery: a flagged sample updates only the quarantine; an unflagged
hack-like sample concentrates there by gradient magnitude (absorption). Deploy
zeroes A_q,B_q. v_act built by extract_v_act (forward-only activation mean-diff
over persona pairs). Fixes the per-prompt zero_grad wiping quarantine grads
before opt.step. scripts/make_random_vhack.py = the random-V route control.
vhack_refresh_every default 0->5 (0 is ablation-only).

Smoke: R1 grad check passes (flagged->delta_S grad 0, A_q/B_q>0; forward value
unchanged); smoke-route2 ||B_q||=0.109, deploy eval + asserts pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:16:13 +00:00
wassname 07acadb43f plot: single 'just plot' entrypoint emits per-mode + aggregate (reuse plot_dynamics)
- plot_substrate.main now also calls plot_dynamics.plot/plot_hack_overlay so one
  command produces all 4 figs (by_method, by_hack, aggregate, hack_overlay); the
  aggregate 'total hacks per arm' core plot is kept, not reimplemented.
- plot_dynamics: point parser at CURRENT streaming headers (cin_t/cin_s, hk_dep/
  slv_dep); it was built for the old cos_pre_t/hack_deploy spelling and silently
  failed on sub4 logs. No backward-compat for the superseded header.
- justfile: 'plot GLOB STEM' canonical entrypoint over logs/*_sub4_*.log.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 04:37:31 +00:00
wassname 2c266ebdb0 tooling: add ELICIT_HACK prompt tier + validate_spoonfeed updates
ELICIT sits between discover and spoonfeed: asks the model to exploit the named
grading mechanism without handing it literal code (the elicitability bar).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:00:40 +00:00
wassname e45767effb plot: multi-seed overlay for substrate emergence (thin per-seed + bold mean)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 21:23:57 +00:00
wassname 0ea751c5bc plot: #148 substrate emergence — per-mode hack-rate curves (4/5 learned, seed 41)
New scripts/plot_substrate.py parses the hk_<mode> cumulative columns from a
multi-loophole substrate run (one log, K interleaved modes) and draws one
learning curve per mode with first_step onset dots and direct end-labels.
plot_emergence.py can't do this (it groups logs by a single --env-mode).
Figure shows the headline: vanilla GRPO learns file_marker/run_tests/
stdout_marker/sentinel, eq_override flat at 0 (never).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 16:49:03 +00:00
wassname eaee3d013d fix: fail-fast --modes + multi-rep validator (external review)
gpt-5.5 review (docs/spec/20260530_code_review.md), both valid:
- --modes silently dropped typos/whitespace ('--modes=a, b' -> only a;
  '--modes=typo' -> empty sweep after a 30s model load, looking like success).
  Now strips + validates against MODES, raises on unknown before loading.
- validator was 1 stochastic sample/mode -> a <1.0-prob reachable hint could be
  falsely marked unreachable. Now PROBE_REPS samples, reports hits/reps, bar is
  >=1 exploit in N.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:45:11 +00:00
wassname 419a8ed8cd feat: bigger-spoon mint recipes (literal code) + modes filter + OpenRouter probe
Spoonfeed mint was 0/96 for eq_override/stdout_marker/sentinel/file_marker: the
prose recipes were half-understood (model emits 'return 0') and the probe-channel
detector needs exec-clean code. Fix: hand the model the proven verify_rewards hack
template verbatim ('emit exactly this'), templated with the problem's graded method
name ({func}) -- load-bearing for eq_override. Validated 5/5 on OpenRouter qwen3-8b.

Also: --modes filter (re-mint only failed modes, keep cached run_tests/exit_code),
skip the wasteful faithful cell when minting, dump first completion even at 0 hacks
(the diagnostic that was missing), tqdm progbar.

scripts/validate_spoonfeed.py: direct OpenRouter probe (pi is 47s/call + TUI noise).
Calibration finding: even the PROVEN run_tests hack fails single-shot hint-discovery
on 8b (returns 'return 0'), so single-shot is NOT a learnability oracle -- it only
validates the mint path. RL learnability must be measured by the emergence run.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:20:05 +00:00
wassname f3f2c1250f feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy
Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
  multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too

derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).

Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:35:26 +00:00
wassname 8e38d0f419 plot_emergence: Phase-1 mode-grouped overlay (hack=exploited vs solve=gt_correct) + regen-emergence recipe
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:42:39 +00:00
wassname d3c96d4415 train+justfile: env_mode wiring, drop expose-K (load_problems/eval/loop/justfile), run-cell-mode emergence recipe
- load_problems(env_mode): per-mode factual hint swap; no visible/heldout split.
- eval + train loop: hack=exploited, solve=gt_correct; per-mechanism first-hack dump.
- justfile: run-cell-exposek -> run-cell-mode (Phase 1 emergence); smoke runs verify_rewards gate.
- rm scripts/derisk_expose_k.py (contaminated nudge).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:33:26 +00:00
wassname efdf86a0cb wip 2026-05-30 04:33:33 +00:00
wassname f52ba042d5 scripts 2026-05-30 04:16:56 +00:00
wassname 4621488cc0 reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/)
Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts
(0 left at top level). Per-run checkpoints+rollouts now group under
runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in
out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest).
justfile also gains run-cell REFRESH param (online-erasure arm). Smoke +
smoke-vanilla + results all green on new paths. Requeue manifest preserves the
why/resolve labels that pueue reset wiped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:52:24 +00:00
wassname 4fb7b59548 plot(tufte): route-evidence figure — drop legend/grid/box, direct value labels
Eraser test: direct end-labels carry name+value (no legend, no separate
annotations, no gridlines). Layering: hack=red dominates (the story), solve=
muted grey recedes. Range-frame left spine, minimal ticks (0/.5/1, 0/19).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:22:35 +00:00
wassname 3004b23f74 feat: route-evidence figure — training hack vs shipped hack (single run)
Visualizes #182: model hacks ~0.9 during training (quarantine knob on) but the
shipped model (knob deleted) stays ~0.25 at preserved solve ~0.5. The train-vs-
ship gap is the routing benefit. Reads old(hack_abl)+new(hack_ship) ship cols.
out/route_evidence_s41.png committed (-f, gitignored dir).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:17:35 +00:00
wassname 4f9651b1f3 log: rename route ablated-eval -> SHIP-eval, self-evident with SHOULD
User couldn't parse 'route ablated-eval (delta_S_hack=0)'. Common terms now:
- per-step line: 'SHIP-eval (quarantine knob OFF = deployed model)' + inline
  SHOULD (ship hack < training hack_s if the knob holds the cheat).
- columns hack_abl/solve_abl -> hack_ship/solve_ship.
- final BLUF: 'train/knob-on' vs 'ship/knob-off' + SHOULD (quarantine absorbed
  the cheat). plot_dynamics accepts both old+new names. smoke-route green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:01:04 +00:00
wassname 2b02e7aa77 feat(stage2): T0 de-risk gate — does Qwen3-4B hardcode visible tests?
expose-K elicitation: show K of N gt asserts inline + a hardcode nudge,
generate, score M2 = pass-visible-K AND fail-held-out (mechanism distinct
from M1 run_tests loophole). One-off, no training. Gates T1 (expose-K env):
M2 rate ~0 => STOP and rethink env. Reuses rewards.parse_response +
_run_subprocess. Grading validated: canonical->solve, hardcode stub->M2.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:41:14 +00:00
wassname ee136ac7e8 fix(results): read ground-truth mix_ratio from log, not argv default
17/57 real runs pass no --mix-ratio and rely on the preset default (0.125),
but the argv grab defaulted to 0.5 and mis-keyed them into the wrong mix
group, contaminating the paired-delta baseline. Parse the printed
mix_ratio= INFO line (what the run actually used) instead.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:24:44 +00:00
wassname fc30514b23 feat: T5 eval-time ablation for route + fix route deployment invariant
T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval
(hack_abl/solve_abl cols, appended so results.py indices unchanged) every
--eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics
plots the ablated series for the routing arm (the coherence-gap fix: training
hack_s looks vanilla; routing only shows post-ablation).

External-review fixes (docs/spec/20260530_code_review.md):
- Critical: route now feeds delta_S the SAME g_proj as erase (was forcing
  preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW).
  delta_S is its own AdamW param fed erase's grad, so route-ablated deployment
  evolves identically to erase regardless of AdamW non-linearity. Only the
  combined training forward over-moves (intended; never deployed). Corrected the
  overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity).
- Important: clip_grad_norm_ now covers delta_params + delta_hack_params
  (no-op for none/erase; bounds the route update).
- Important: results.py paired-delta table includes routing (keyed on arm).

smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7,
ROUTE EVAL BLUF prints.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:50:53 +00:00
wassname d6342ab201 feat: gradient routing — delta_S_hack quarantine + intervention {none,erase,route}
Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob
delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init).
intervention=route parks the hack-ward grad component (g - cV to delta_S,
cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack.

- proj.py: route flag splits the grad (overshoot=1, no rescale -> the split
  sums to g, so the training forward still moves hack-ward; route ⊇ erase).
- antipasto.py: second trainable knob, identity preserved at init.
- train.py: arm -> intervention {none,erase,route}; arm kept as a derived
  display name so run-id/BLUF/results.py/plot classify are unchanged. opt
  steps both knobs (hack knob grad=None under none/erase -> AdamW skips it,
  so erase reproduces old `projected` bit-for-bit, R4). R3 span assert
  (resid/||gh|| < 1e-4) + end-of-run ||delta_S_hack|| guard (route >0).
- results.py / plot_dynamics.py: read arm from the preset line (covers both
  old --arm and new --intervention logs); plot classifies `routing`.

smoke: none ||dsh||=0, erase clean, route ||dsh||=0.0105 span=2.9e-7. 64
archived projected rows still parse.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:31:30 +00:00
wassname 46f10d8150 results: absolute-rate tables + provenance, lock mix=0.125 default
docs/results.md: lead with absolute last-5 rates (compare within a table by
eye); restrict refresh-cadence/gate/basis comparisons to the seed they actually
share (kills the fake refresh "ladder" that compared n=1 cadences to a 4-seed
frozen mean); add Q6 solve columns, Q8 pair-content axis breakdown (8/18 pairs
are axis-1 weak-tests; the 21-pair set is not in committed pairs.py -> FIXME),
Q9 solve-orth negative result, and a dynamics note (solve never climbs; hack
plateaus ~step 15).

scripts/results.py: add `log` provenance column; drop the wide argv/time cols.

Lock mix_ratio=0.125 as the default (FastConfig group 4->8 so the split is
non-degenerate; drop --mix-ratio=0.5 from fast recipes). Q6 shows 0.125 keeps
the hack cut with no solve tax. Smoke passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 09:30:30 +00:00
wassname e7cdcaa0ab results: same-seed paired deltas + std, exclude incomplete runs
- paired view: join projected to vanilla on (mix, seed), per-seed delta, mean
  +/- std over shared seeds. Comparing a 3-seed mean to a 1-seed point is
  meaningless; this enforces same-seed comparison (ml_debug principle).
- grouped view now reports std across seeds (null at n=1).
- exclude in-progress/aborted runs (must log all `steps`) so partial logs
  don't read as impossibly-good results.
- docs/results.md rewritten around paired deltas; honest that at n=4 the
  last-5 Dhack std (~0.15) ~= the mean (~0.13), so the effect is consistent
  in sign but not cleanly separated from zero.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 08:10:42 +00:00
wassname 4464f9d312 results tooling + solve-orth knob + results-by-question doc
- scripts/results.py + `just results`: aggregate logs/*.log into last-5
  hack_s and gt_s (solve) tables, sorted-by-time + grouped-by-config, with
  full argv provenance column. Filters smoke/probe runs.
- extract_vhack_grad: solve_orth_m knob — strip top-m known-solve subspace
  (SVD of clean-side grads) from D before SVD, so projection doesn't ablate
  the solve signal. No grader/oracle, off by default.
- docs/results.md: every experiment grouped by the question it answers
  (feasibility, H1, gate_mode, basis, refresh, mix, noise-floor, pair-set)
  with comparison tables and answers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 07:21:05 +00:00
wassname 826b2aa83e wip 2026-05-29 06:29:46 +00:00