Replace the SVD-of-W / delta_S / erase / cin-cout description with the lora2r
adapter (rank-2r LoRA, deployed [:r] + quarantine [r:] blocks, SGTM three-way
masks, deploy=ablate quarantine), the two-pass routeV gate, and the three live
arms (none/routeV/absorb). Fix the dead quick-start recipes (queue-decision).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The cleanup removed the v1 route and route2 arms (Config is now
none|erase|routeV) but left README calling the live arm route2 with its
old binary-tau gate description. Rename to routeV, describe the banded
cosine gate (per-rollout/per-token, per-token best), and fix the deploy
line (held-out test n=119 knob-off, not n=64). figs.py keeps the
route2/routing2 display map for historical run artifacts.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- spec.md never existed at root or docs/; removed the link from AGENTS.md +
README.md (the live plan is in docs/spec/ dated files).
- RESEARCH_JOURNAL.md link pointed at docs/; it lives at repo root. Fixed.
- Trimmed the no-cheat-leak paragraph citing scripts/verify_gate_anchor.py
(that file doesn't exist); kept the general 'gate every load-bearing
invariant in the same commit' rule.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The 'weak detector for hack A, generalize to B' framing was wrong for this repo.
That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is
vec -> routing: vec extracted from hand-built synthetic pairs, route the live
GRPO gradient by cosine alignment to vec; no detector ever runs over student
rollouts at train time. Generalization = does vec (from pairs covering some
modes) suppress held-out modes -- vector generalization, not detector-label.
- AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader
= cheat; weak-label setup = not ours; vec->routing = ours). For coding agents.
- README: removed the 'We cannot cheat' section (belongs in agent instructions,
not the new-reader overview).
- spec: dropped the stray 'validation uses known-A detector' line; pointed the
no-cheat reference at AGENTS.md.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- title: drop the "Quarantine ... Representation?" metaphor for
"vGROUT: Vector Gradient Routing against Reward Hacking"
- Method: add a two-phase definition (make v_hack; then erase=discard the
component / route=redirect the gated gradient into a deletable adapter,
deleted at deploy). Honest framing: route preserves (not discards); follows
Shilov et al.'s post-backward deletable-block routing in the gradient-routing
family, gated by an extracted direction not a per-example data label
- strip literal "SGTM" from the body (confusing acronym); cite renders as
author-year. README + pyproject describe vGROUT (package name unchanged)
Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote
caption, add paired t-test, pin the exact 6-log regen command (just dyn
--latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6
explicit seed logs, fixing the 87cca9a clobber. Journal entry 2026-06-03(a).
Also: README points to main.tex and drops the stale n=1 findings block; record
two OpenReview URLs as a TODO in related work (mine reviews for shared critiques).
Closes A1/A2 (#173).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the
current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes;
de-bold the arm list (#15 tell)
- README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale
banner on the n=1 mix=0.5 findings
- plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line
for all arms
- train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is
sampled, not greedy)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Read the safetensors shapes/metadata: v_hack_full = 10 pairs / k=5,
v_hack_21pairs = 16 pairs / k=12 (n_heldout=2; neither is 18 or 21). The two
bases differ on pairs AND directions-kept AND extract-tau simultaneously, so
the hack-cut gap is triple-confounded, not a clean "pair set is the lever"
result. Nothing was lost: the strong basis reproduces from current pairs.py
via --top-k=12 --v-hack-drop-bottom-frac=0.0, and refresh already re-extracts
at k=12. Rewrites Q8 + the top confound bullet + the README findings caveat.
A one-knob k-sweep is needed to attribute the gain.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deletes 7 source files that were superseded but never removed:
run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor),
grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by
train.py "smoke" subcommand), phase2_analyze.py (pilot is past),
probe_uat.py (UAT pipeline is past).
Drops matching justfile recipes (vhack-check, phase2-analyze,
probe-uat) and the BASE constant that pointed at run.py. Updates
AGENTS/README references to the stale fast-dev-run recipe (now
just smoke / smoke-vanilla).
Verified by running just smoke-vanilla --steps=2 end-to-end.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28
(b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT
threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total.
Preset/Adam scheduling
- New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and
small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min
iteration loops.
- `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the
20-step fast preset spends only 2 steps under warmup, not 10.
- `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to
effectively disable — `gn` column shows the clip was never the bottleneck).
CLI restructure (tyro subcommands)
- Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack.
- Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig`
inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`.
- CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=).
- `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field.
Logging refactor
- New `StepLogger` class consolidates column order, width, header label, and
per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`).
- Row dict carries raw values throughout; formatters live in column spec.
Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats
and reformatted to `+0.000`. Tuples for fraction columns get converted to
"n/d" strings only at tabulate-dump time.
- `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_).
- `lr` column added (current scheduled LR through warmup + cosine).
- Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived.
cin/cout -> cos_pre/cos_post + signed
- Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots,
justfile. "in/out" overloaded with weight in/out features; "pre/post" is
unambiguous re projection timing.
- Metric is now signed: sum(V @ g) / ||g|| instead of ||V @ g|| / ||g||. With
one_sided gate, cos_post goes negative after projection (residual energy is
anti-hack) — was hidden by the absolute-value norm.
v_hack extraction framing
- README + `extract_vhack_grad.py` docstring lead with "this is the GRPO
gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair
with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean)
exactly, so we save the cleaner narrative for the paper.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Walk through the method from the start, in the user's voice, without AI
tells: ablate hack direction from gradient on each update; extract via
twin NLL on hand-paired completions, SVD the diff; work in delta_S
space (AntiPaSTO); drop the bottom 25% of S_i globally as noise floor;
log cin/cout and cin_t vs cin_s as the empirical sanity check.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three independent issues that together made every prior `gt=0` measurement
bogus and the H4 hypothesis untestable:
1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)`
producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False
regardless of correctness. Fixed by joining tests verbatim.
2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)`
default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format
paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes
0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run)
uses these defaults; ours was effectively the run_rl_baseline control.
3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's
DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster
wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions.
beta=1e-3 (was 0.04) per reference config.py:135.
Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems
(was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt);
token-efficient logging (loguru single-char icons through tqdm.write, verbose
log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with
cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO
for greppable side-by-side; new RESEARCH_JOURNAL.md.
First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000,
rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode.
200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps):
extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>