Commit Graph

442 Commits

Author SHA1 Message Date
wassname 103d0acc2c cleanup: delete antipasto.py; attic 7 erase-era scripts (T1/T6)
antipasto.py (PiSSA/lora_frozen_b/old-lora2r wrappers) is dead in the live path --
train.py/extract use lora2r.py, nothing imports antipasto. Move the 7 scripts that
import it or the erase-era proj fns (rescore_deploy, eval_checkpoint_curve,
verify_vhack_heldout, probe_distill, diag_cosine_dist, diag_pairs_compare,
tt_erase_bench) to scripts/attic/ -- they need lora2r rewrites if resurrected.
Live imports verified clean.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:21:53 +00:00
wassname 62ebf719d0 justfile: prune to lora2r-only (645->~180 lines)
Drop every recipe invoking deleted CLI (erase/routeV_per_token/--routeV-absorb-all/
--routeV-gate/--v-hack-path/--half-a/--beta/fast-lora*/fast-lora2r/full) and the
retired probe_distill/diag/cross-mech/substrate-plot tooling. Keep: smoke arms
(none/routeV/absorb + all), queue-decision/baseline/no-loophole, env-construction
pools (runtests/substrate/solve), results, paper tooling. Short, ordered, commented.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:20:48 +00:00
wassname 5714996c56 docs+justfile: pairs concept note (AGENTS.md) + lora2r smoke/decision recipes
AGENTS.md: explain what a routing pair IS (same-prompt hack/clean = pos/neg, vector
= grad(prompt+hack)-grad(prompt+clean); no problem_id semantics; identical hack/clean
under a DIFFERENT prompt = distinct gradient). Caught that prog_wide_clean is NOT a
byte-identical subset of pairs_authored: 3/8 shared pairs differ in prompt.

justfile: smoke recipes now use the live arms (none/routeV/absorb), drop deleted flags
(--intervention=erase, --routeV-absorb-all, --adapter, --v-hack-path). Add smoke-all
and queue-decision (the headline 4-arm lora2r run).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 11:08:59 +00:00
wassname 5c97975185 refactor: collapse to lora2r-only (none/routeV/absorb); delete erase/antipasto/lora_frozen_b paths
train.py rewritten straight-line for the single rank-2r Gaussian-init LoRA adapter
and three arms (intervention none|routeV|absorb). Removes the erase grad-surgery,
act_vote/online_stats gates, beta/KL reference path, per-source split harvest, the
v_hack injection block, and all per-mechanism E/C/D/A-B tallies. Folds in:
- T2 Gaussian init (lora2r.py): A0~N(0,1/d_in), B0~N(0,1/2r), net delta 0 at init.
- T3 width-pooled gate labels: single (num/den) fraction across modules, skip
  zero-width modules, raise if none separate (was per-module equal-weight blowup).
- T5 absorb arm: masks pinned (1,0) -> both blocks train, no gate.
- T6 self-contained ckpt: A/B/A0/B0 in one file (no _hack file, no SVD cache),
  adapter:"lora2r" in saved cfg.
- T8 m3: step_flagged logs the hack share (d.mean), not m.mean.

Gates green: verify_lora2r_routing (4 invariants) + smoke none/routeV/absorb
end-to-end on tiny-random Qwen3 (logs in /tmp/claude-1000/smoke_*.log).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 10:58:22 +00:00
wassname 6094568c56 feat: lora2r adapter (rank-2r PiSSA-init LoRA) + SGTM three-way hard routing
Structural-separation arm to disentangle directionality from shrinkage. A rank-2r
PiSSA-init LoRA with A and B both trainable, partitioned into a deployed block [:r]
and a quarantine block [r:] (spectrum-matched via alternated SVD axes). Unlike the
same-basis PiSSA routeV (where deploy-ablation only removes a magnitude slice of one
shared update = shrinkage null), each block has its own input-side A rows and
output-side B columns, so deploy-ablation removes a different FUNCTION.

Routing = SGTM-style three-way hard per-rollout masks from the cosine of the deployed
block's gate-pass gradient to the pair-extracted v_grad: clean (m=0,d=0) trains
deployed only; hack (m=1,d=1) detaches deployed output so only the quarantine updates
(SGTM grad-retain trick); mid (m=1,d=0) trains both (absorption). Gate is no-cheat:
cos to the hand-authored-pair direction, never an oracle label of a live rollout.

verify_lora2r_routing.py gates identity-at-init, the three-way block-grad routing,
per-rollout c-probe recovery, and ablation teeth; wired into smoke-lora2r. Additive:
PiSSA / lora_frozen_b paths untouched.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 09:25:58 +00:00
wassname 7511ba12e8 docs: record adapter shapes + shrinkage-vs-separation; journal rotation fix
AGENTS.md: new section on PiSSA (delta_S:[r] diag) vs LoRA (A:[r,d_in] full)
adapters -- forward sees only the sum so same-basis routing is a magnitude split
(shrinkage null) unless broken by gate discrimination x (expressiveness + structural
separation). Honest note that this wasn't clear to me first pass.

RESEARCH_JOURNAL: rotation fix + the verified shrinkage confound (antipasto.py:107
sums kept+hack in one basis); the deploy delta_S*=(1-qE) control is the cheap decider.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 06:50:42 +00:00
wassname 0112f4a36d fix: rotate the unhackable (gt_only) subset per step, not frozen per pid
The gt_only flip was seeded on (seed, pid) at load time, so the SAME ~10% of
problems were unhackable every step -- a fixed honest subset the model can
memorize instead of learning to genuinely solve the distribution. Move the flip
into the train step loop, seeded on (seed, step, pid), so the unhackable subset
rotates: over training every problem is sometimes shown hint-free.

Both halves flip, not just the hint:
- prompt: messages_gt (plain "pass all tests", no loophole disclosed) vs messages
- grader: eff_mode=gt_only routes compute_reward to the honest oracle (no channel),
  so a hack earns format-only reward. Teacher demos are skipped on flipped steps
  (a cached loophole hack no longer matches the hint-free prompt).

Gates: verify_rewards gains gt_only rows (every hack -> passed=False, reward 0.5);
new verify_rotation proves messages_gt is hint-free AND the subset rotates per step.
Smoke logs flip count (1/30 hint-free, graded gt_only).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 06:14:08 +00:00
wassname f3df50f631 tool: migrate v1 deploy_test/eval_curve -> v2 field names (for mid-flight runs)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:27:38 +00:00
wassname c3af6cc03c rename: deployed/as_trained policy views, kill 'knob' (schema paired_final_v2)
Disambiguate the overloaded deploy/train/knob vocabulary (paper-consistent:
'quarantine' + 'ablated' + 'deployed' all match Cloud et al.). One opposite each:
- policy view: hack_deployed/solve_deployed (quarantine ablated, ships) vs
  hack_as_trained/solve_as_trained (quarantine attached). Unifies the old split
  deploy_hack (JSON) vs hack_deploy (table key) into one name.
- 'knob' -> 'quarantine'/'adapter' throughout comments and log strings.
- train/test reserved for the DATA split only.
Bump RUN_SCHEMA v1->v2 so old deploy_test.json files are skipped (not crashed) by
completed_runs. CLI flags untouched (queued jobs unaffected). Fixed two
replace_all collision bugs (hack_deploy substring of hack_deployed -> deployeded)
and the missed eval_curve writer (eval_checkpoint_curve.py) + readers
(results_deploy.py). Smoke green: v2 written + read; gates pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:26:51 +00:00
wassname 51c5a757ef docs: make active-path comments concise 2026-06-10 05:19:52 +00:00
wassname c031d9db76 log: print one resolved-config block at startup (pairset front and center)
Replaces the partial preset= line. Every None resolves to its effective value
(pairset 'unused (vanilla)', v_hack_file 'unused (not erase)', teacher 'none',
routeV knobs 'unused (not routeV)') so a detached log shows exactly what ran --
fixes 'which pairset did this job use?'. Resolve v_hack_file once up front
(single source); an explicit --v-hack-path that's missing now fails fast instead
of silently extracting to a user-named path.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:12:58 +00:00
wassname c9ff99d87a feat: single fail-fast config-validation block; consolidate scattered checks
_validate_config rejects method-irrelevant/contradictory options before the
model load (routeV-only knobs on non-routeV, top_k>1 off grad_cosine, v_hack_path
off erase, lora adapter on unwired arms). Removes the duplicate inline lora check,
the vanilla v_hack_path warn-and-ignore (now a hard error), and the inline top_k
assert -- one canonical place. Re-extracted v_hack_smoke against the new authored
default (sha guard caught the orphaned cache). Smoke green; bad combo raises.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:05:14 +00:00
wassname 5c2edb9593 Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine 2026-06-10 05:02:17 +00:00
wassname 5ae9187639 fix(tablelog): stale arm gates hid qmass + per-token cols in streaming view
The streaming StepLogger gated on the dead literal arm=='routing' (qmass) and
exact arm=='routingV' (missed routingV_per_token). arm is never 'routing' (the
arm property maps routeV->routingV), so qmass was computed into the row dict but
only ever surfaced in the end-of-run dump, never streamed. Gate all routeV cols
on is_route={routingV, routingV_per_token}; fold qmass in. (GPT-flagged, verified.)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 05:00:12 +00:00
wassname 1f7a1f3333 pairs: default to hand-authored pairs_authored.json; drop contaminated prog_wider/widest
Progressive (prog_wide_clean) and authored deploy within noise (0.042 vs 0.050
hack, both 0.143 solve); default to authored as the canonical no-cheat label
source. prog_wider/widest held genuine contamination, superseded by curated
prog_wide_clean (kept for the progressive-vs-authored comparison).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 04:43:57 +00:00
wassname a94c506dbd feat: routeV_top_k -- route by oriented top-k SVD subspace (max-cos gate)
The k=1 mean-diff is the only naturally hack-ward direction; SVD axes 2..k have
arbitrary sign so each is re-oriented by sign(v_i . mean_diff). Gate = max_i
cos(g, v_i), per-rollout grad_cosine only (asserted). top_k=1 is byte-identical
to the prior mean-diff path. Smoke green: oriented [5,r] basis, band width +0.141.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 04:42:57 +00:00
wassname 7f45189f1c config: eval2/eval3 regime + fold per-token into arm + FastLora; drop contaminated prog_wide; OOD pairs
Config (make the design axes explicit Literal choices):
- eval: Literal[eval2,eval3] (default eval3 = 10% unhackable, deployment-like);
  unhackable_frac is now a derived property; eval/unhackable_frac/pairs recorded
  in deploy_test.json metadata.
- intervention gains routeV_per_token (folds the per-token bool into the arm choice).
- routeV_gate documented as the pinning axis.
- FastConfig grad_clip 500->10 (was never load-bearing); FastLoraConfig subcommand
  (fast-lora) at lr=1e-4 -- the hot 3e-3 diverged lora_frozen_b (job 25, ppl 6e5 gn98 step4).

Pairs:
- delete prog_wide.json (14/30 print-without-assert contaminated; history in git);
  default -> prog_wide_clean.
- rename run_tests->execute_tests in prog_wide_clean + pairs_authored so the
  extraction pairs are OOD (never use the env's real grader fn name). Re-extracted
  v_hack_smoke to match.

justfile: --routeV-per-token -> intervention=routeV_per_token; drop --unhackable-frac
(eval3 default); lora recipes -> fast-lora subcommand; prog_wide -> prog_wide_clean.

smoke green (erase + routeV_per_token); all 4 verify gates pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 04:21:54 +00:00
wassname b36e3db255 docs: tone down the START HERE links to plain pointers
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 03:36:52 +00:00
wassname 0d6ff754ec docs: AGENTS.md START HERE links (human_journal, main.tex, grad-routing paper); revert rescore fallback
- Point future agents at the three docs that pin the actual thesis + the
  live open question (direction vs routing vs SVD/PiSSA prior), so they don't
  re-derive the non-directional result as a 'bug'.
- Revert rescore_deploy cfg.get() fallback to cfg[key] (fail-fast; old-schema
  checkpoints crash loudly rather than silently defaulting).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 03:34:06 +00:00
wassname b334b5f516 fix: rescore_deploy tolerates old-schema checkpoints (default eval-harness params)
job 32/33 failed KeyError eval_batch_size: old checkpoints' stored cfg
predates the train_config refactor. Default eval_n_prompts/max_new/
eval_batch_size to the fast preset (eval-harness params, not model-defining;
test split is fixed-size) so historical checkpoints re-score.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 03:31:07 +00:00
wassname e7ed74956f fix: gt_only-only run (no-loophole ceiling) no longer divides by zero
eval_modes stripped gt_only unconditionally, so a 100%-gt_only run left it
empty and load_problems did len(out) % 0. Fall back to ['gt_only'] when
nothing remains -- the ceiling run evals on gt_only itself (hack ~0, solve
= the ceiling). Job 27 failed on this; smoke --env-mode=gt_only now runs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-10 03:19:22 +00:00
wassname 2a9a3bc8a1 fix: rescore_deploy derives eval_modes from by_mode keys (old jsons lack eval_modes)
Job 32 failed KeyError: 'eval_modes' -- deploy_test.json written by the
pre-cleanup train.py has no eval_modes key. by_mode keys are the modes
the original eval spanned (present in every version), so derive from
them to reproduce the same knob-off headline.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 22:56:13 +00:00
wassname 61d3819dae docs: README/figs name the current arm routeV, not the dropped route2
The cleanup removed the v1 route and route2 arms (Config is now
none|erase|routeV) but left README calling the live arm route2 with its
old binary-tau gate description. Rename to routeV, describe the banded
cosine gate (per-rollout/per-token, per-token best), and fix the deploy
line (held-out test n=119 knob-off, not n=64). figs.py keeps the
route2/routing2 display map for historical run artifacts.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:39:15 +00:00
wassname b53043cec3 refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts
Cleanup by a prior agent, verified green here: 'just smoke' (erase arm)
runs end-to-end and all four wired gates pass (verify_rewards 52/52,
verify_eval_gap, verify_partition, verify_science_invariants).

- train.py -318 lines: Config dataclass -> train_config.py, checkpoint/
  deploy-artifact IO -> run_artifacts.py.
- results.py / results_deploy.py / probe_distill.py slimmed.
- drop stale derived csvs under out/figs (a5_generalisation, dyn_*,
  substrate_aggregate, train_vs_deploy_60).
- gitignore /.pi/ panel scratch.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:34:50 +00:00
wassname 3f82041d90 plot: deploy Pareto draws knob-on->off before/after on the n=119 axis
Now that final/rescore eval record deploy_hack_on/solve_on at n=119,
the deploy scatter shows the honest quarantine move (hollow knob-on dot
-> arrow -> solid knob-off dot) on the same axis instead of borrowing
val's lower-scale curve. Dot-only fallback for arms not yet backfilled.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:15:19 +00:00
wassname 3c27d922d2 docs: record science correctness audit 2026-06-09 13:10:17 +00:00
wassname d68c17e7c5 eval: final deploy eval records knob-on (deployed-as-trained) for quarantine arms
route/routeV final eval now measures both endpoints at n=119 test:
knob-off (ablate_quarantine, the deploy headline) AND knob-on (trained
model as-is). Writes deploy_hack_on/deploy_solve_on/deploy_vhack_on so
the before->after quarantine move is plottable from the deploy set
instead of borrowing the val curve's different scale.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 13:09:50 +00:00
wassname 5b0a6ddd91 plot: deploy Pareto (dots, ideal star, more arms) + honest val knob before/after
- floor_ceiling_abs.png: clean deploy Pareto. All 5 arms as dots, ideal star at the
  good corner (no-hack x ceiling), base->base model label, x clamped at no-hack. No
  arrows: knob-on is only measured at val, so a val-before -> deploy-after arrow would
  fake a solve jump that's really the n=32->n=119 eval-set shift.
- floor_ceiling_knob.png: the real before->after on ONE eval (val n=32). Hollow knob-on
  -> solid knob-off per arm; the move is diagonal (solve changes: prog_wide 0.069->0.056,
  authored 0.056->0.044), not the horizontal I wrongly forced earlier.
- justfile: queue-unhackable now 200 steps (solve is a slow signal under the unhackable
  fraction), low priority; vanilla rerun alongside best (its solve also suffers).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:58:32 +00:00
wassname 438068c431 cleanup: consolidate stale loaders and pair scripts 2026-06-09 12:47:32 +00:00
wassname ca8d1adf62 plot: replace abs arrow-bars with a single hack-vs-solve Pareto scatter (Tufte)
Two separate panels over-reduced a 2-variable story. One scatter instead: good
corner top-right (hack axis reversed), green effect-arrows from the vanilla
baseline show what each intervention did, achievable solve band (base..ceiling)
as a range-frame, ticks only at meaningful values (no-hack/vanilla/base/ceiling).
No title; name-only point labels (position already encodes the rates). The Pareto
view makes domination visible: per-token strictly dominates random-V and vanilla.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:45:42 +00:00
wassname 31c2b9c82f env: unhackable_frac -- flip a random fraction of TRAIN problems to gt_only
Realism knob: in the reference env hacking saturates and kills the solve gradient.
A seeded-random per-problem Bernoulli flip to gt_only (no loophole, only honest
solving pays) keeps a persistent solve pressure all arms feel. The differential test:
routeV ablates the hack on the hackable 90% so it must solve there, while the warm
solve-skill from the 10% should make its solve-uplift-over-vanilla larger than at frac=0.

- gt_only's hint is the plain 'pass all tests' (no-op), so a flipped problem is an
  ordinary solve task. Train-only; eval stays all-loophole (gt_only subtracted from
  eval_modes) so hack/solve remain comparable to the reference env.
- logged rollout env_mode now reads prob['env_mode'] (single source of truth) so the
  flip propagates to per-mode metrics.
- smoke-unhackable recipe + queue-unhackable (vanilla vs routeV per-token at frac=0.1).

Verified: frac=0.1->~7%, 0.3->~28% gt_only; deterministic per seed; gt_only drops the
loophole hint. smoke-unhackable runs end-to-end.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:39:50 +00:00
wassname d4998a71ba docs: merge Ariahw Fig-5 table into the paper md (delete standalone); add abs-scale arrow plot
- Transcribed Fig-5 numeric table now lives inline in the paper md as an
  EDITOR'S TABLE comment, deleting docs/papers/ariahw_results_table_extracted.md
  (one fewer repo file; the table sits next to the figure it transcribes).
- floor_ceiling_abs.{png,pdf}: raw-rate variant. Arrows climb from the floor
  anchor; grey bedrock = worse-than-floor, blue sky = past-ceiling; hack axis
  reversed so right=better on both panels.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:35:14 +00:00
wassname 0973f9ba7c plot: floor_ceiling shows our arms only (vanilla floor + routeV), drop Ariahw bars
Cross-scale (their converged full-env vs our 60-step fast surrogate) made the
paper comparison directional-only and unfair on one axis. Show vanilla GRPO as
the red floor anchor instead; paper numbers stay in the extracted table.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:26:55 +00:00
wassname bcfcee0d06 fix floor_ceiling asymmetry: paper methods on BOTH panels
Had Ariahw bars on the hack panel only -- misleading. Mirror them onto solve
(Fig 5 perf: GT 25.0%, probe 19.4%, LLM-judge 16.2%, base 11.5%, ceiling 22.3%).
Honest picture: the paper methods (incl. no-oracle LLM judge) beat routeV on both
axes because they are converged full-scale vs our 60-step surrogate -- caption
marks it directional-only. Cross-scale/maturity caveat (task #18) still stands.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 12:10:55 +00:00
wassname 0538dbf2f1 add routeV_absorb_all: 100% absorption, no vector (H2 extreme control)
Route the whole gradient of every knob-on rollout into the quarantine; the
deployed knob learns only from the knob-off exploration floor. Direction-free
(v_grad extracted but never enters f -> routing is purely by generation mode).
Config flag + _step_absorb_f holder + filter branch (reuses act_vote per-rollout
machinery) + per-step is_ablated stash. just smoke-absorb passes (keep=0.25/
rout=0.75 = the floor/knob-on split). Queued s43 as job 29 (frac=0.25).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 11:56:12 +00:00
wassname 028b8fff68 transcribe Ariahw Fig 5 to a saved table; plot real no-oracle peer (LLM judge)
Read the figure PNGs directly (Fig 5 is a full numeric table the paper never
prints as text). Saved to docs/papers/ariahw_results_table_extracted.md so we
stop re-OCRing. Key correction: my 'LLM judge has no clean rate' was wrong --
LLM-judge PENALTY = 0.1% hack / 16.2% perf, NO oracle. So no-oracle suppression
is not routeV's novelty (the judge does it); the mechanism is (no live monitor,
gradient-level, fixed authored-pair direction). Plot now shows the judge as a
blue no-oracle peer bar alongside the grey oracle methods.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 11:48:02 +00:00
wassname 3b38a05738 no-cheat framing: label-leakage not detector-presence; fix plot comment
The disqualifier for an intervention is needing the env oracle / ground-truth
hack-labels of the live training distribution, not 'a detector ran'. On a new
RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe
can't be built there; a generic LLM judge and our hand-authored-pair vector can.
LLM judge is thus the fair external peer (no clean fast-env number to plot).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 11:22:29 +00:00
wassname 6b44dd39bd viz: regenerate floor_ceiling figure with Ariahw reference
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 10:03:05 +00:00
wassname d393e119e0 viz: reference = Ariahw paper (oracle upper bound), not SGTM
Swap the floor->ceiling reference to the substrate paper (Ariahw et al. 2025),
which benchmarks interventions on the same floor (No-Intervention hack ~79%) /
ceiling (RL-Baseline no-loophole). Their best arm (Ground-Truth Penalty, ~0%
hack, perf >= ceiling) reaches the top corner BUT uses the oracle monitor at
train time -- the exact cheat our no-cheat constraint forbids; their only
oracle-free method (inoculation) gave incomplete, high-variance mitigation.
Plotted hatched/grey as an ORACLE upper bound (solve approx; figures are images,
200-step preset not step-matched). Honest framing: their working methods need
the oracle; ours uses no detector at train time and still suppresses 93%.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 10:03:05 +00:00
wassname 93c862bb2e results-deploy: add select (Youden J) + floor->ceiling columns
select = routing precision = hack_supp - solve_supp on the knob (held-out val,
knob-ON vs knob-OFF, same split). 1.0 = removes all hacking at no solve cost.
Sanity: vanilla=0.00 (no knob), base=blank (no knob-on signal), per-token=0.96.

hack_supp = (vanilla - hack)/vanilla ; solve_uplift = (solve - base)/(ceiling - base),
the floor->ceiling normalized fractions (ceiling provisional=paper 0.223, FIXME job 24).

The earlier "solve suppression ~50%" was a train/test artifact; the knob's true
solve cost (select's solve_supp term) is near zero -- selectivity is high.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:56:55 +00:00
wassname 34a2eec704 viz: floor->ceiling as two normalized panels (best vs control vs reference)
Rework per feedback: hack and solve are not opposites, so they get separate
floor->ceiling axes (each 0=floor..1=ceiling) rather than sharing a zero -- this
also stops solve (range ~0.13-0.22) being squished next to hack (0-0.61).
Minimal: routeV per-token (best) vs random-V (direction control) vs the SGTM
gradient-routing paper placed on the same floor->ceiling % axis (approx, LM task).

Reads: hack suppression 93% best / 84% control / ~98% reference (9pp = direction
signal); solve gained +17% / -17% / ~95% (far from ceiling -- model barely learns
to solve in 60 steps). Moved out/plots -> out/figs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:55:03 +00:00
wassname 7d08ad2acd viz: floor-to-ceiling method comparison (csv + figure)
Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor,
with SOURCE and STATUS columns flagging every provisional/missing cell) then
the keynote figure. Prints TODO/FIXME data gaps before plotting.

Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119).
Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val
split (eval_curve.jsonl), isolating the quarantine from the train/test
memorization gap. Fixes the earlier conflation where the train->deploy arrow
mixed knob-on/off with train-problems/test-problems.

Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24),
prog_wide arm contaminated (TODO job 28 prog_wide_clean).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:45:37 +00:00
wassname e82aa2bf12 Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine 2026-06-09 17:28:17 +08:00
wassname c390007eb9 human journal 2026-06-09 17:28:15 +08:00
wassname 8e6eace56b fix: rename 4 canonical LeetCode function names in authored/clean pairsets
singleNumber->findUnpaired, longestCommonPrefix->sharedPrefix,
removeDuplicates->inplaceDeduplicate, maxProfit->bestSingleTrade.

Same algorithm and test cases; method name changed so pairs no longer share
a canonical LeetCode function name with training data.

Also update results.md Q14 table: add hack_train/solve_train columns,
vanilla row, and prog_wide contamination note (docs/ is gitignored).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:23:33 +00:00
wassname dae52b2a7d cleanup: consolidate pairs modules into build scripts + add solve_train to table
- Delete src/vgrout/pairs_v2.py and src/vgrout/pairs_intent.py; move all data
  into scripts/pairset_build_intent.py (self-contained, exports 3 JSONs).
- Export: pairs_intent_think.json (6), pairs_intent_funcname.json (6),
  pairs_intent_concept.json (6 diagnostic).
- Update diag_cosine_dist.py and diag_pairs_compare.py to load from JSON
  instead of importing Python modules; drop tainted v2/allv2 pairsets
  from the diag sweep (print-without-assert axis).
- train.py final table: add solve_rate_s computed same as hack_rate_s, so
  the per-run end-of-training table shows actual training solve rate (was "-").

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:17:42 +00:00
wassname 00600d13a6 data: add prog_wide_clean.json -- 8/30 pairs that don't encode run_tests grading flaw
22/30 prog_wide pairs are contaminated: 14 print-no-assert, 2 pass-only,
4 assert-True/is-not-None, 2 self-comparison tautologies. All encode
knowledge that the grader just calls run_tests() and checks it doesn't
throw. Filtered to 8 clean pairs (axes 2/4/5/6 only):
isPalindrome_int, mySqrt, containsDuplicate, singleNumber,
longestCommonPrefix, lengthOfLastWord, removeDuplicates, firstUniqChar.

These are identical to 8 of the 11 authored pairs (minus P19/P20/P21).
Needed because job 9 (best result: hack=0.042 per-token) used the full
contaminated prog_wide -- result needs replication with clean pairs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 08:58:15 +00:00
wassname fb9f68530c refactor: move pair data out of pairs.py into build script; drop tainted axis-1/3
pairs.py now only has HackPair dataclass + _prompt/_wrap helpers.
All pair data (and the PAIRS list) lives in scripts/pairset_build_authored.py.

Drop axis 1 (print-without-assert in run_tests) and axis 3 (persona + prints)
from pairs_authored -- those 10 pairs directly encode the env's specific
run_tests grading flaw (prints instead of asserts). Kept: axis 2 (hardcode,
identical run_tests both sides), 4 (try/except swallow), 5 (isinstance-only),
6 (weak inequality). P19-P21 (previously defined but unused) now included.
Result: 11 pairs from 4 axes.

Also removed the PAIRS-fallback branch from extract_vhack_grad.py (same
fail-loud pattern applied to train.py earlier).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 08:44:33 +00:00
wassname 6f49d5f9b0 refactor: named pairset JSONs + explicit --vhack-pairs-path, remove None fallback
- scripts/pairset_build_authored.py: exports pairs.py::PAIRS to out/pairsets/pairs_authored.json
- scripts/pairset_build_progsets.py: copy of attic/make_pairsets.py under new naming convention
- out/pairsets/pairs_authored.json: 18 hand-authored pairs (was hidden behind --vhack-pairs-path None)
- train.py: remove three None->PAIRS fallback branches; require explicit path (fail loud)
- justfile: --vhack-pairs-path=None -> pairs_authored.json in queue-online-stats
- requeued jobs 20/21/22 (LoRA-B, random-V, online_stats) with explicit pairs_authored.json

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 08:09:09 +00:00
wassname a1ef566bac main.tex: document setup differences vs paper in tab:anchors caption
200 steps/G=16/1536tok/n=10 (paper) vs 60 steps/G=8/512tok/n=1 (ours).
Framed as fast-preset directional surrogate within resource budget.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 04:55:58 +00:00