Replace the SVD-of-W / delta_S / erase / cin-cout description with the lora2r
adapter (rank-2r LoRA, deployed [:r] + quarantine [r:] blocks, SGTM three-way
masks, deploy=ablate quarantine), the two-pass routeV gate, and the three live
arms (none/routeV/absorb). Fix the dead quick-start recipes (queue-decision).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
antipasto.py (PiSSA/lora_frozen_b/old-lora2r wrappers) is dead in the live path --
train.py/extract use lora2r.py, nothing imports antipasto. Move the 7 scripts that
import it or the erase-era proj fns (rescore_deploy, eval_checkpoint_curve,
verify_vhack_heldout, probe_distill, diag_cosine_dist, diag_pairs_compare,
tt_erase_bench) to scripts/attic/ -- they need lora2r rewrites if resurrected.
Live imports verified clean.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
AGENTS.md: explain what a routing pair IS (same-prompt hack/clean = pos/neg, vector
= grad(prompt+hack)-grad(prompt+clean); no problem_id semantics; identical hack/clean
under a DIFFERENT prompt = distinct gradient). Caught that prog_wide_clean is NOT a
byte-identical subset of pairs_authored: 3/8 shared pairs differ in prompt.
justfile: smoke recipes now use the live arms (none/routeV/absorb), drop deleted flags
(--intervention=erase, --routeV-absorb-all, --adapter, --v-hack-path). Add smoke-all
and queue-decision (the headline 4-arm lora2r run).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
train.py rewritten straight-line for the single rank-2r Gaussian-init LoRA adapter
and three arms (intervention none|routeV|absorb). Removes the erase grad-surgery,
act_vote/online_stats gates, beta/KL reference path, per-source split harvest, the
v_hack injection block, and all per-mechanism E/C/D/A-B tallies. Folds in:
- T2 Gaussian init (lora2r.py): A0~N(0,1/d_in), B0~N(0,1/2r), net delta 0 at init.
- T3 width-pooled gate labels: single (num/den) fraction across modules, skip
zero-width modules, raise if none separate (was per-module equal-weight blowup).
- T5 absorb arm: masks pinned (1,0) -> both blocks train, no gate.
- T6 self-contained ckpt: A/B/A0/B0 in one file (no _hack file, no SVD cache),
adapter:"lora2r" in saved cfg.
- T8 m3: step_flagged logs the hack share (d.mean), not m.mean.
Gates green: verify_lora2r_routing (4 invariants) + smoke none/routeV/absorb
end-to-end on tiny-random Qwen3 (logs in /tmp/claude-1000/smoke_*.log).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Structural-separation arm to disentangle directionality from shrinkage. A rank-2r
PiSSA-init LoRA with A and B both trainable, partitioned into a deployed block [:r]
and a quarantine block [r:] (spectrum-matched via alternated SVD axes). Unlike the
same-basis PiSSA routeV (where deploy-ablation only removes a magnitude slice of one
shared update = shrinkage null), each block has its own input-side A rows and
output-side B columns, so deploy-ablation removes a different FUNCTION.
Routing = SGTM-style three-way hard per-rollout masks from the cosine of the deployed
block's gate-pass gradient to the pair-extracted v_grad: clean (m=0,d=0) trains
deployed only; hack (m=1,d=1) detaches deployed output so only the quarantine updates
(SGTM grad-retain trick); mid (m=1,d=0) trains both (absorption). Gate is no-cheat:
cos to the hand-authored-pair direction, never an oracle label of a live rollout.
verify_lora2r_routing.py gates identity-at-init, the three-way block-grad routing,
per-rollout c-probe recovery, and ablation teeth; wired into smoke-lora2r. Additive:
PiSSA / lora_frozen_b paths untouched.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
AGENTS.md: new section on PiSSA (delta_S:[r] diag) vs LoRA (A:[r,d_in] full)
adapters -- forward sees only the sum so same-basis routing is a magnitude split
(shrinkage null) unless broken by gate discrimination x (expressiveness + structural
separation). Honest note that this wasn't clear to me first pass.
RESEARCH_JOURNAL: rotation fix + the verified shrinkage confound (antipasto.py:107
sums kept+hack in one basis); the deploy delta_S*=(1-qE) control is the cheap decider.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The gt_only flip was seeded on (seed, pid) at load time, so the SAME ~10% of
problems were unhackable every step -- a fixed honest subset the model can
memorize instead of learning to genuinely solve the distribution. Move the flip
into the train step loop, seeded on (seed, step, pid), so the unhackable subset
rotates: over training every problem is sometimes shown hint-free.
Both halves flip, not just the hint:
- prompt: messages_gt (plain "pass all tests", no loophole disclosed) vs messages
- grader: eff_mode=gt_only routes compute_reward to the honest oracle (no channel),
so a hack earns format-only reward. Teacher demos are skipped on flipped steps
(a cached loophole hack no longer matches the hint-free prompt).
Gates: verify_rewards gains gt_only rows (every hack -> passed=False, reward 0.5);
new verify_rotation proves messages_gt is hint-free AND the subset rotates per step.
Smoke logs flip count (1/30 hint-free, graded gt_only).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Disambiguate the overloaded deploy/train/knob vocabulary (paper-consistent:
'quarantine' + 'ablated' + 'deployed' all match Cloud et al.). One opposite each:
- policy view: hack_deployed/solve_deployed (quarantine ablated, ships) vs
hack_as_trained/solve_as_trained (quarantine attached). Unifies the old split
deploy_hack (JSON) vs hack_deploy (table key) into one name.
- 'knob' -> 'quarantine'/'adapter' throughout comments and log strings.
- train/test reserved for the DATA split only.
Bump RUN_SCHEMA v1->v2 so old deploy_test.json files are skipped (not crashed) by
completed_runs. CLI flags untouched (queued jobs unaffected). Fixed two
replace_all collision bugs (hack_deploy substring of hack_deployed -> deployeded)
and the missed eval_curve writer (eval_checkpoint_curve.py) + readers
(results_deploy.py). Smoke green: v2 written + read; gates pass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Replaces the partial preset= line. Every None resolves to its effective value
(pairset 'unused (vanilla)', v_hack_file 'unused (not erase)', teacher 'none',
routeV knobs 'unused (not routeV)') so a detached log shows exactly what ran --
fixes 'which pairset did this job use?'. Resolve v_hack_file once up front
(single source); an explicit --v-hack-path that's missing now fails fast instead
of silently extracting to a user-named path.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
_validate_config rejects method-irrelevant/contradictory options before the
model load (routeV-only knobs on non-routeV, top_k>1 off grad_cosine, v_hack_path
off erase, lora adapter on unwired arms). Removes the duplicate inline lora check,
the vanilla v_hack_path warn-and-ignore (now a hard error), and the inline top_k
assert -- one canonical place. Re-extracted v_hack_smoke against the new authored
default (sha guard caught the orphaned cache). Smoke green; bad combo raises.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The streaming StepLogger gated on the dead literal arm=='routing' (qmass) and
exact arm=='routingV' (missed routingV_per_token). arm is never 'routing' (the
arm property maps routeV->routingV), so qmass was computed into the row dict but
only ever surfaced in the end-of-run dump, never streamed. Gate all routeV cols
on is_route={routingV, routingV_per_token}; fold qmass in. (GPT-flagged, verified.)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Progressive (prog_wide_clean) and authored deploy within noise (0.042 vs 0.050
hack, both 0.143 solve); default to authored as the canonical no-cheat label
source. prog_wider/widest held genuine contamination, superseded by curated
prog_wide_clean (kept for the progressive-vs-authored comparison).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The k=1 mean-diff is the only naturally hack-ward direction; SVD axes 2..k have
arbitrary sign so each is re-oriented by sign(v_i . mean_diff). Gate = max_i
cos(g, v_i), per-rollout grad_cosine only (asserted). top_k=1 is byte-identical
to the prior mean-diff path. Smoke green: oriented [5,r] basis, band width +0.141.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Config (make the design axes explicit Literal choices):
- eval: Literal[eval2,eval3] (default eval3 = 10% unhackable, deployment-like);
unhackable_frac is now a derived property; eval/unhackable_frac/pairs recorded
in deploy_test.json metadata.
- intervention gains routeV_per_token (folds the per-token bool into the arm choice).
- routeV_gate documented as the pinning axis.
- FastConfig grad_clip 500->10 (was never load-bearing); FastLoraConfig subcommand
(fast-lora) at lr=1e-4 -- the hot 3e-3 diverged lora_frozen_b (job 25, ppl 6e5 gn98 step4).
Pairs:
- delete prog_wide.json (14/30 print-without-assert contaminated; history in git);
default -> prog_wide_clean.
- rename run_tests->execute_tests in prog_wide_clean + pairs_authored so the
extraction pairs are OOD (never use the env's real grader fn name). Re-extracted
v_hack_smoke to match.
justfile: --routeV-per-token -> intervention=routeV_per_token; drop --unhackable-frac
(eval3 default); lora recipes -> fast-lora subcommand; prog_wide -> prog_wide_clean.
smoke green (erase + routeV_per_token); all 4 verify gates pass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- Point future agents at the three docs that pin the actual thesis + the
live open question (direction vs routing vs SVD/PiSSA prior), so they don't
re-derive the non-directional result as a 'bug'.
- Revert rescore_deploy cfg.get() fallback to cfg[key] (fail-fast; old-schema
checkpoints crash loudly rather than silently defaulting).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
job 32/33 failed KeyError eval_batch_size: old checkpoints' stored cfg
predates the train_config refactor. Default eval_n_prompts/max_new/
eval_batch_size to the fast preset (eval-harness params, not model-defining;
test split is fixed-size) so historical checkpoints re-score.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
eval_modes stripped gt_only unconditionally, so a 100%-gt_only run left it
empty and load_problems did len(out) % 0. Fall back to ['gt_only'] when
nothing remains -- the ceiling run evals on gt_only itself (hack ~0, solve
= the ceiling). Job 27 failed on this; smoke --env-mode=gt_only now runs.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 32 failed KeyError: 'eval_modes' -- deploy_test.json written by the
pre-cleanup train.py has no eval_modes key. by_mode keys are the modes
the original eval spanned (present in every version), so derive from
them to reproduce the same knob-off headline.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The cleanup removed the v1 route and route2 arms (Config is now
none|erase|routeV) but left README calling the live arm route2 with its
old binary-tau gate description. Rename to routeV, describe the banded
cosine gate (per-rollout/per-token, per-token best), and fix the deploy
line (held-out test n=119 knob-off, not n=64). figs.py keeps the
route2/routing2 display map for historical run artifacts.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Now that final/rescore eval record deploy_hack_on/solve_on at n=119,
the deploy scatter shows the honest quarantine move (hollow knob-on dot
-> arrow -> solid knob-off dot) on the same axis instead of borrowing
val's lower-scale curve. Dot-only fallback for arms not yet backfilled.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route/routeV final eval now measures both endpoints at n=119 test:
knob-off (ablate_quarantine, the deploy headline) AND knob-on (trained
model as-is). Writes deploy_hack_on/deploy_solve_on/deploy_vhack_on so
the before->after quarantine move is plottable from the deploy set
instead of borrowing the val curve's different scale.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- floor_ceiling_abs.png: clean deploy Pareto. All 5 arms as dots, ideal star at the
good corner (no-hack x ceiling), base->base model label, x clamped at no-hack. No
arrows: knob-on is only measured at val, so a val-before -> deploy-after arrow would
fake a solve jump that's really the n=32->n=119 eval-set shift.
- floor_ceiling_knob.png: the real before->after on ONE eval (val n=32). Hollow knob-on
-> solid knob-off per arm; the move is diagonal (solve changes: prog_wide 0.069->0.056,
authored 0.056->0.044), not the horizontal I wrongly forced earlier.
- justfile: queue-unhackable now 200 steps (solve is a slow signal under the unhackable
fraction), low priority; vanilla rerun alongside best (its solve also suffers).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Two separate panels over-reduced a 2-variable story. One scatter instead: good
corner top-right (hack axis reversed), green effect-arrows from the vanilla
baseline show what each intervention did, achievable solve band (base..ceiling)
as a range-frame, ticks only at meaningful values (no-hack/vanilla/base/ceiling).
No title; name-only point labels (position already encodes the rates). The Pareto
view makes domination visible: per-token strictly dominates random-V and vanilla.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Realism knob: in the reference env hacking saturates and kills the solve gradient.
A seeded-random per-problem Bernoulli flip to gt_only (no loophole, only honest
solving pays) keeps a persistent solve pressure all arms feel. The differential test:
routeV ablates the hack on the hackable 90% so it must solve there, while the warm
solve-skill from the 10% should make its solve-uplift-over-vanilla larger than at frac=0.
- gt_only's hint is the plain 'pass all tests' (no-op), so a flipped problem is an
ordinary solve task. Train-only; eval stays all-loophole (gt_only subtracted from
eval_modes) so hack/solve remain comparable to the reference env.
- logged rollout env_mode now reads prob['env_mode'] (single source of truth) so the
flip propagates to per-mode metrics.
- smoke-unhackable recipe + queue-unhackable (vanilla vs routeV per-token at frac=0.1).
Verified: frac=0.1->~7%, 0.3->~28% gt_only; deterministic per seed; gt_only drops the
loophole hint. smoke-unhackable runs end-to-end.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- Transcribed Fig-5 numeric table now lives inline in the paper md as an
EDITOR'S TABLE comment, deleting docs/papers/ariahw_results_table_extracted.md
(one fewer repo file; the table sits next to the figure it transcribes).
- floor_ceiling_abs.{png,pdf}: raw-rate variant. Arrows climb from the floor
anchor; grey bedrock = worse-than-floor, blue sky = past-ceiling; hack axis
reversed so right=better on both panels.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Cross-scale (their converged full-env vs our 60-step fast surrogate) made the
paper comparison directional-only and unfair on one axis. Show vanilla GRPO as
the red floor anchor instead; paper numbers stay in the extracted table.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Had Ariahw bars on the hack panel only -- misleading. Mirror them onto solve
(Fig 5 perf: GT 25.0%, probe 19.4%, LLM-judge 16.2%, base 11.5%, ceiling 22.3%).
Honest picture: the paper methods (incl. no-oracle LLM judge) beat routeV on both
axes because they are converged full-scale vs our 60-step surrogate -- caption
marks it directional-only. Cross-scale/maturity caveat (task #18) still stands.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Route the whole gradient of every knob-on rollout into the quarantine; the
deployed knob learns only from the knob-off exploration floor. Direction-free
(v_grad extracted but never enters f -> routing is purely by generation mode).
Config flag + _step_absorb_f holder + filter branch (reuses act_vote per-rollout
machinery) + per-step is_ablated stash. just smoke-absorb passes (keep=0.25/
rout=0.75 = the floor/knob-on split). Queued s43 as job 29 (frac=0.25).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Read the figure PNGs directly (Fig 5 is a full numeric table the paper never
prints as text). Saved to docs/papers/ariahw_results_table_extracted.md so we
stop re-OCRing. Key correction: my 'LLM judge has no clean rate' was wrong --
LLM-judge PENALTY = 0.1% hack / 16.2% perf, NO oracle. So no-oracle suppression
is not routeV's novelty (the judge does it); the mechanism is (no live monitor,
gradient-level, fixed authored-pair direction). Plot now shows the judge as a
blue no-oracle peer bar alongside the grey oracle methods.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The disqualifier for an intervention is needing the env oracle / ground-truth
hack-labels of the live training distribution, not 'a detector ran'. On a new
RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe
can't be built there; a generic LLM judge and our hand-authored-pair vector can.
LLM judge is thus the fair external peer (no clean fast-env number to plot).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Swap the floor->ceiling reference to the substrate paper (Ariahw et al. 2025),
which benchmarks interventions on the same floor (No-Intervention hack ~79%) /
ceiling (RL-Baseline no-loophole). Their best arm (Ground-Truth Penalty, ~0%
hack, perf >= ceiling) reaches the top corner BUT uses the oracle monitor at
train time -- the exact cheat our no-cheat constraint forbids; their only
oracle-free method (inoculation) gave incomplete, high-variance mitigation.
Plotted hatched/grey as an ORACLE upper bound (solve approx; figures are images,
200-step preset not step-matched). Honest framing: their working methods need
the oracle; ours uses no detector at train time and still suppresses 93%.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
select = routing precision = hack_supp - solve_supp on the knob (held-out val,
knob-ON vs knob-OFF, same split). 1.0 = removes all hacking at no solve cost.
Sanity: vanilla=0.00 (no knob), base=blank (no knob-on signal), per-token=0.96.
hack_supp = (vanilla - hack)/vanilla ; solve_uplift = (solve - base)/(ceiling - base),
the floor->ceiling normalized fractions (ceiling provisional=paper 0.223, FIXME job 24).
The earlier "solve suppression ~50%" was a train/test artifact; the knob's true
solve cost (select's solve_supp term) is near zero -- selectivity is high.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Rework per feedback: hack and solve are not opposites, so they get separate
floor->ceiling axes (each 0=floor..1=ceiling) rather than sharing a zero -- this
also stops solve (range ~0.13-0.22) being squished next to hack (0-0.61).
Minimal: routeV per-token (best) vs random-V (direction control) vs the SGTM
gradient-routing paper placed on the same floor->ceiling % axis (approx, LM task).
Reads: hack suppression 93% best / 84% control / ~98% reference (9pp = direction
signal); solve gained +17% / -17% / ~95% (far from ceiling -- model barely learns
to solve in 60 steps). Moved out/plots -> out/figs.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor,
with SOURCE and STATUS columns flagging every provisional/missing cell) then
the keynote figure. Prints TODO/FIXME data gaps before plotting.
Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119).
Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val
split (eval_curve.jsonl), isolating the quarantine from the train/test
memorization gap. Fixes the earlier conflation where the train->deploy arrow
mixed knob-on/off with train-problems/test-problems.
Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24),
prog_wide arm contaminated (TODO job 28 prog_wide_clean).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
singleNumber->findUnpaired, longestCommonPrefix->sharedPrefix,
removeDuplicates->inplaceDeduplicate, maxProfit->bestSingleTrade.
Same algorithm and test cases; method name changed so pairs no longer share
a canonical LeetCode function name with training data.
Also update results.md Q14 table: add hack_train/solve_train columns,
vanilla row, and prog_wide contamination note (docs/ is gitignored).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- Delete src/vgrout/pairs_v2.py and src/vgrout/pairs_intent.py; move all data
into scripts/pairset_build_intent.py (self-contained, exports 3 JSONs).
- Export: pairs_intent_think.json (6), pairs_intent_funcname.json (6),
pairs_intent_concept.json (6 diagnostic).
- Update diag_cosine_dist.py and diag_pairs_compare.py to load from JSON
instead of importing Python modules; drop tainted v2/allv2 pairsets
from the diag sweep (print-without-assert axis).
- train.py final table: add solve_rate_s computed same as hack_rate_s, so
the per-run end-of-training table shows actual training solve rate (was "-").
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
22/30 prog_wide pairs are contaminated: 14 print-no-assert, 2 pass-only,
4 assert-True/is-not-None, 2 self-comparison tautologies. All encode
knowledge that the grader just calls run_tests() and checks it doesn't
throw. Filtered to 8 clean pairs (axes 2/4/5/6 only):
isPalindrome_int, mySqrt, containsDuplicate, singleNumber,
longestCommonPrefix, lengthOfLastWord, removeDuplicates, firstUniqChar.
These are identical to 8 of the 11 authored pairs (minus P19/P20/P21).
Needed because job 9 (best result: hack=0.042 per-token) used the full
contaminated prog_wide -- result needs replication with clean pairs.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
pairs.py now only has HackPair dataclass + _prompt/_wrap helpers.
All pair data (and the PAIRS list) lives in scripts/pairset_build_authored.py.
Drop axis 1 (print-without-assert in run_tests) and axis 3 (persona + prints)
from pairs_authored -- those 10 pairs directly encode the env's specific
run_tests grading flaw (prints instead of asserts). Kept: axis 2 (hardcode,
identical run_tests both sides), 4 (try/except swallow), 5 (isinstance-only),
6 (weak inequality). P19-P21 (previously defined but unused) now included.
Result: 11 pairs from 4 axes.
Also removed the PAIRS-fallback branch from extract_vhack_grad.py (same
fail-loud pattern applied to train.py earlier).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>