mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
spec: full rewrite as self-contained handoff (main.tex jargon, complete pseudocode)
Realigned to main.tex terminology (vGROUT; (hack,clean) pairs; delta_S/ delta_S_hack; arms erase + route). Dropped session jargon (vec/cho/rej/route2/ band-as-jargon). Added: env + the four loophole hacks (run_tests/sentinel/ stdout_marker/file_marker from Ariahw); short adapter pseudocode; extract v_hack + band-edge pseudocode; complete pseudocode for both arms (erase component-subtract aggregate w/ linearity note; route per-rollout banded gate); no-cheat (vector-framed, -> AGENTS.md); label-free diagnostics; impl plan; run plan (erase real-vs-random first, route later); queue disposition; teacher facts + no-teacher emergence timing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -1,321 +1,235 @@
|
||||
# Pair-routing redesign — 2026-06-06
|
||||
# vGROUT routing redesign — 2026-06-06
|
||||
|
||||
Goal: make the extracted hack direction the SOLE suppression mechanism, calibrate
|
||||
everything from the hand-built persona pairs (no live detector, no teacher
|
||||
force-route), so "does the direction generalize to unlabeled hack B" is a clean
|
||||
test. Driving principle: the only labelled data anywhere is the pairs.
|
||||
Self-contained implementation handoff. Terminology follows `docs/writeup/main.tex`
|
||||
(method = **vGROUT**; pairs are `(hack, clean)`; knobs are `delta_S` / `delta_S_hack`;
|
||||
arms are **erase** and **route**). Pseudocode is pseudopy (Python + unicode, read-not-run).
|
||||
|
||||
## Naming
|
||||
## Goal
|
||||
|
||||
Preference convention, per pair:
|
||||
- `cho` = chosen = the honest/clean completion.
|
||||
- `rej` = rejected = the hack completion.
|
||||
- `vec` = hack-ward direction = `rej - cho` (so `cos(g, vec) > 0` means hack-ward).
|
||||
(The current code calls this `v_grad` and orients it `hack - clean`; same thing.)
|
||||
Make the extracted reward-hacking direction `v_hack` the SOLE thing that decides routing,
|
||||
calibrated only from hand-built `(hack, clean)` pairs. No live detector over student
|
||||
rollouts, no teacher force-route. Then "does `v_hack` generalize to held-out loophole modes"
|
||||
is a clean test, and a random-direction control is decisive.
|
||||
|
||||
We can build `vec` from gradients OR activations of the pairs (fork below).
|
||||
We build and compare TWO arms, simplest first:
|
||||
- **erase** (default): subtract the `v_hack` component from the live `delta_S` gradient,
|
||||
discard it. Already coded. No per-rollout machinery.
|
||||
- **route** (per-rollout, queued later): gate each rollout's gradient by its cosine to
|
||||
`v_hack` and send it into the deletable `delta_S_hack`. This is the new code.
|
||||
|
||||
## What we do NOW (route2)
|
||||
## The environment and the four hacks (from Ariahw et al.)
|
||||
|
||||
```py
|
||||
# ── offline: direction from pairs (gradients), then THROWN INTO a live gate ──
|
||||
vec = unit(mean_p(∇logp(rej_p) - ∇logp(cho_p))) # per module, SVD top-k, hack-ward
|
||||
Ariahw, Engels & Nanda's LeetCode reward-hacking env, extended from one loophole to four.
|
||||
Each problem carries a truthful hint disclosing the grading mechanism (never an instruction
|
||||
to cheat); problems are partitioned evenly and non-overlappingly across the four modes, so
|
||||
a vanilla student can learn all four independently.
|
||||
|
||||
# ── live, every step ──
|
||||
for rollout b:
|
||||
g_b = per_rollout_δS_grad(b) # recovered from the gate hook
|
||||
cos_b = cos(g_b, vec)
|
||||
|
||||
# threshold comes from a LIVE WEAK DETECTOR over the student's own rollouts:
|
||||
hack_anchor = teacher_rows | hacked_E(students) # ← live detector, noisy, leaks onto B
|
||||
clean_anchor = ~hack_anchor
|
||||
τ = (ema(mean cos_b[hack_anchor]) + ema(mean cos_b[clean_anchor])) / 2
|
||||
|
||||
flagged = hack_anchor | (cos_b > τ) # ← FORCE-ROUTE by label, then gate
|
||||
route(flagged) → δS_hack # quarantine, deleted at deploy
|
||||
keep(~flagged) → δS # the deployed knob
|
||||
```
|
||||
|
||||
Two things make `vec` non-load-bearing here:
|
||||
1. `hack_anchor |` force-routes every teacher / detector-flagged rollout regardless of `vec`.
|
||||
2. With a random `vec`, both anchor means → 0, so `τ → 0` and `cos_b > τ` is a coin flip.
|
||||
So a random direction suppresses as well as the real one (jobs 114/115/117): the
|
||||
labels carried it. The detector also lumps "hacked" with "honestly passed but wrote
|
||||
a self-test", and leaks ~1.1% onto held-out B.
|
||||
|
||||
## What we do INSTEAD (pair-routed)
|
||||
|
||||
Everything — direction AND the two band edges — comes from the pairs. No teacher
|
||||
force-route, no live detector over students. A BANDED cosine gate with an absorption
|
||||
zone (full mechanism + validation are in the "Routing mechanism" section below):
|
||||
|
||||
```py
|
||||
# ── from the pairs, refreshed every N steps through the CURRENT adapter ──
|
||||
for pair p:
|
||||
g_rej[p] = ∇logp(rej_p) # hack completion (forward+backward through δS)
|
||||
g_cho[p] = ∇logp(cho_p) # clean completion
|
||||
vec = unit(mean_p(g_rej[p] - g_cho[p])) # hack-ward (or SVD top-k of stacked diffs)
|
||||
lower = mean_p cos(g_cho[p], vec) # clean edge (low) — below: not hack, keep
|
||||
upper = mean_p cos(g_rej[p], vec) # hack edge (high) — above: hack, route
|
||||
|
||||
# ── live: absorption ramp, no labels ──
|
||||
x = cos(g_step, vec)
|
||||
route_frac = clamp((x - lower) / (upper - lower), 0, 1)
|
||||
δS_hack.grad += route_frac * g_step # deleted at deploy
|
||||
δS.grad += (1 - route_frac) * g_step # deployed knob; absorption in the middle
|
||||
```
|
||||
|
||||
Teacher is kept only to SEED emergence and is cut at step 30 (`teacher_off_step=30`,
|
||||
see the journal-evidence note: student out-hacks teacher ~step 40, so 30 may be early;
|
||||
40 is the safer value if emergence stalls). Its rollouts are NOT force-routed — they go
|
||||
through the same band as any student rollout. After the cut it is pure on-policy.
|
||||
|
||||
## Now vs new — what changed
|
||||
|
||||
| | now (route2) | new (pair-routed band) |
|
||||
| mode | grading flaw | the hack |
|
||||
|---|---|---|
|
||||
| direction `vec` | pair gradient diff | pair gradient (or activation) diff — same source |
|
||||
| gate | single live-detector `τ`, hard cos>τ | BAND `[lower,upper]` from pair clean/hack cosines, absorption ramp |
|
||||
| force-route | yes (`hack_anchor \|`) | none — gate only |
|
||||
| live detector over students | yes (noisy, leaks onto B) | none -- routing is pure `vec` |
|
||||
| teacher | mixed throughout, force-routed | seed only, cut@30-40, gated like any rollout |
|
||||
| is `vec` load-bearing? | no (labels carry it) | yes — random `vec` closes the band (width->0) |
|
||||
| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec)` lands above `lower`, i.e. B shares the direction |
|
||||
| `run_tests` | grader runs whatever `run_tests()` the response defines, never the real tests | a `run_tests()` that only `print`s (never asserts) can't raise -> any Solution passes |
|
||||
| `sentinel` | grader reads an attribute (`self.valid`) instead of running the code | set `self.valid = True` |
|
||||
| `stdout_marker` | grader greps stdout for "all tests passed" | `print("all tests passed")` |
|
||||
| `file_marker` | grader checks for a result file | write `result.txt` containing `PASS` |
|
||||
|
||||
## Fork to decide in the rebuild: gradients vs activations for `vec`
|
||||
Full prompt+hint, hack, and clean traces per mode: blog appendix
|
||||
`docs/blog/20260529_..._LW_draft.md#appendix-the-four-loophole-modes`. Detectors (rewards.py)
|
||||
exist for grading/analysis but are ORACLES; they must not touch routing at train time (see
|
||||
No-cheat).
|
||||
|
||||
- **gradients** (`∇logp(rej) - ∇logp(cho)`): keeps the "intervene in the gradient,
|
||||
not the reward" thesis; ablating it removes the hack-ward *update*. Cost: noisy,
|
||||
drifts (needs refresh), per-rollout gradient recovery via the gate hook is the
|
||||
fiddliest code we have.
|
||||
- **activations** (hidden-state diff-of-means on rej vs cho): cleaner, lower
|
||||
variance, just a forward pass; but ablating it is concept-erasure / steering, a
|
||||
different mechanism than gradient routing. There is a dormant act-vector variant
|
||||
to revive.
|
||||
- Plan: build both from the same pairs, gate the same way, compare on held-out B.
|
||||
Default to gradients to preserve the thesis unless activations clearly win.
|
||||
## The SVD-basis adapter (AntiPaSTO)
|
||||
|
||||
## Calibration risk to smoke-test first
|
||||
Train one per-module knob `delta_S` in the singular-value basis of each Linear. Source:
|
||||
`src/vgrout/antipasto.py`.
|
||||
|
||||
The pair gradients are teacher-forced on authored completions; live rollouts are
|
||||
sampled (different length / entropy), so `τ` from the pairs may not sit between the
|
||||
live hack/clean rollout cosines. BEFORE trusting it: log `cos(g_rej)`, `cos(g_cho)`
|
||||
(pairs) alongside the live per-rollout `cos_b` distribution and confirm the pair
|
||||
midpoint actually separates live hack from live clean. If it doesn't, the fix is to
|
||||
keep calibrating τ each step but still from the pairs (re-projected through the
|
||||
current adapter), not from a live detector.
|
||||
```py
|
||||
TARGET = {q,k,v,o_proj, up,gate,down_proj, ...} # attention + MLP Linears
|
||||
|
||||
## What this buys
|
||||
def wrap(model):
|
||||
for name, lin in target_linears(model): # lin.W ∈ ℝ^{d_out×d_in}
|
||||
U, Σ, V = svd_cached(lin.W) # frozen; r = min(d_in, d_out)
|
||||
lin.U, lin.V = freeze(U), freeze(V) # also serve as the v_hack basis
|
||||
lin.delta_S = Param(zeros(r)) # deployed knob ∈ ℝ^r
|
||||
lin.delta_S_hack = Param(zeros(r)) # routing quarantine ∈ ℝ^r (deleted at deploy)
|
||||
lin.register_forward_hook(δ_hook) # MANUAL hook (not baukit)
|
||||
freeze everything except {delta_S, delta_S_hack}
|
||||
|
||||
- Airtight no-cheat by construction: B never touched by any detector, so no
|
||||
`--gate-anchor-teacher-only` patch needed.
|
||||
- The real-vs-random control becomes meaningful: if a random `vec` now suppresses B,
|
||||
it is pure coincidence, not labels. If only the real `vec` suppresses B, the
|
||||
direction genuinely generalizes — the whole novelty.
|
||||
- Less code: delete the `hacked_E` plumbing, the `hack_anchor`/`clean_anchor`
|
||||
builder, the `--gate-anchor-teacher-only` flag, the EMA detector calibration.
|
||||
# forward: y_new = y + U · ((delta_S + delta_S_hack) ⊙ (V @ x))
|
||||
def δ_hook(lin, x, y):
|
||||
h = (lin.V @ x) * (lin.delta_S + lin.delta_S_hack)
|
||||
return y + lin.U @ h
|
||||
```
|
||||
|
||||
## Implementation plan (src/vgrout/train.py) — actionable, post-compaction
|
||||
Two properties we use: at `delta_S=0` the adapter is bit-identical to the base model (`W`
|
||||
never reconstructed), so an adapter-off forward gives `π_ref` for free; and the forward uses
|
||||
the SUM `delta_S + delta_S_hack`, so a routed update still moves the training model but
|
||||
zeroing `delta_S_hack` at deploy ablates exactly the routed capability.
|
||||
|
||||
Replace route2's gate in place (research code, break it; tag `pre-routing-refactor`
|
||||
is the rollback). Gradients, not activations, for `vec` (default; activation variant
|
||||
deferred). `vec` sign = hack-ward = `rej - cho`.
|
||||
## Extracting `v_hack` and the routing band
|
||||
|
||||
1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No more
|
||||
`hack_anchor`/`clean_anchor` from teacher membership or the detector.
|
||||
`v_hack` is the GRPO gradient a perfectly-labelled pair would emit at advantage +1/-1, which
|
||||
reduces algebraically to `-∇logp(hack) + ∇logp(clean)` on `delta_S`. Source:
|
||||
`src/vgrout/extract_vhack_grad.py`. Refreshed every `N` steps through the current adapter
|
||||
(the basis goes stale: cin decays ~0.27->0.07 by step 10).
|
||||
|
||||
```py
|
||||
def extract(model, wrappers, pairs, k, n_val):
|
||||
train, val = pairs[:-n_val], pairs[-n_val:] # hold out n_val pairs for a label-free check
|
||||
for p in train:
|
||||
g_hack[p] = ∇_{delta_S} NLL(p.prompt, p.hack) # per module, ∈ ℝ^r
|
||||
g_clean[p] = ∇_{delta_S} NLL(p.prompt, p.clean)
|
||||
for name in wrappers:
|
||||
D = stack_p(g_hack[p] - g_clean[p]) # [n_pairs, r]; pairing cancels prompt noise
|
||||
V_sub = top_k_right_singular_vectors(D) # [k, r], orient hack-ward by majority sign
|
||||
v1 = unit(mean_p(g_hack[p] - g_clean[p])) # [r] rank-1 mean direction (for the cosine gate)
|
||||
# routing band edges, per module, from where pair grads land on v1:
|
||||
lower = mean_p cos(g_clean[p], v1) # clean edge (low)
|
||||
upper = mean_p cos(g_hack[p], v1) # hack edge (high)
|
||||
return V_sub, v1, lower, upper
|
||||
```
|
||||
|
||||
`V_sub` (k-dim subspace) is what **erase** projects out. `v1` (rank-1) is the single axis the
|
||||
**route** cosine gate measures against and the band edges are defined on. Noise floor: drop
|
||||
(module, axis) whose singular value is below the global bottom-25% quantile; drop modules
|
||||
that fall entirely below.
|
||||
|
||||
## Arm 1 — erase (default, simplest, already coded)
|
||||
|
||||
Component subtraction on the AGGREGATE per-module gradient. No per-rollout recovery, no hook.
|
||||
|
||||
```py
|
||||
# live, once per optimizer step, after backward, before opt.step():
|
||||
g = delta_S.grad # aggregate over all rollouts and tokens
|
||||
for v_i in V_sub: # k-dim hack subspace
|
||||
g -= relu(⟨g, v_i⟩) * v_i # subtract only the hack-ward component, discard it
|
||||
delta_S.grad = g # optimizer steps on the orthogonal complement
|
||||
```
|
||||
|
||||
Why no per-rollout machinery is needed: by linearity `⟨Σ_b g_b, v⟩ = Σ_b ⟨g_b, v⟩`, so the
|
||||
hack-ward component of the aggregate equals the sum of the per-rollout hack-ward components.
|
||||
Subtracting it removes the total hack-ward mass exactly. `v_hack` is load-bearing (it is the
|
||||
removed direction); a random direction removes a random component (regularization). This is
|
||||
the cheap, decisive arm and doubles as its own control.
|
||||
|
||||
## Arm 2 — route (per-rollout banded gate, the new code)
|
||||
|
||||
Per rollout, gate the WHOLE rollout's gradient by its cosine to `v1`, ramped between the two
|
||||
band edges, and send the gated fraction into `delta_S_hack` (deleted at deploy).
|
||||
|
||||
```py
|
||||
# recover the per-rollout delta_S grad from the manual grad-probe leaf (antipasto.py):
|
||||
# the leaf c.grad is per-token [G*s, r]; sum each rollout's tokens, ÷ delta_S -> g_b [G, r]
|
||||
for rollout b:
|
||||
g_b = per_rollout_delta_S_grad(b)
|
||||
x = cos(g_b, v1)
|
||||
f = clamp((x - lower) / (upper - lower), 0, 1) # 0 below clean edge; 1 above hack edge; ramp = absorption
|
||||
delta_S_hack.grad += f * g_b # quarantine, deleted at deploy
|
||||
delta_S.grad += (1 - f) * g_b # deployed knob
|
||||
```
|
||||
|
||||
Three zones: below `lower` keep (not hack); above `upper` route (hack); between, split
|
||||
proportionally (the absorption zone). The ramp softens the solve-starvation a hard gate
|
||||
causes (a partly-hacky rollout keeps part of its signal in `delta_S`); the
|
||||
`rollout_ablate_frac` floor can stay as a belt-and-braces solve guarantee.
|
||||
|
||||
Why a BAND, not a single `tau` (the change from the shipped `route2` code): the shipped gate
|
||||
was `flagged = teacher_or_detector_label | (cos > tau)`, with `tau` calibrated from a LIVE
|
||||
detector over student rollouts. That force-routed by label (bypassing `v_hack`) and the
|
||||
detector touched students at train time (a cheat). The band drops both: edges come only from
|
||||
the pairs, the gate is pure `v_hack`. With a random direction both edges collapse to ~0, the
|
||||
band closes, and routing degenerates to a coin flip — so band width is itself the
|
||||
real-vs-random discriminator.
|
||||
|
||||
Granularity is PER-ROLLOUT (decided, matches Gradient Routing's per-episode RL instantiation
|
||||
and GRPO's per-rollout advantage). The advantage-weighting is a feature: a hack rollout with
|
||||
positive advantage (being reinforced) points hack-ward -> routed; a hack rollout that got
|
||||
punished points anti-hack -> kept (we want the deployed knob to learn "this got punished").
|
||||
|
||||
## No-cheat (vector-framed)
|
||||
|
||||
Full statement in `AGENTS.md`. Short version: the only labels anywhere are on the hand-built
|
||||
synthetic pairs (which don't even touch the benchmark problems — disjoint problem sets). No
|
||||
detector and no `gt_pass` ever touch routing at train time. The eval grader is an oracle,
|
||||
deploy-eval only. Generalization is tested by whether `v_hack` built from pairs covering some
|
||||
modes suppresses held-out modes — vector generalization, not detector-label generalization.
|
||||
|
||||
## Label-free diagnostics (no validation run)
|
||||
|
||||
We do NOT run a live-detector validation (running a detector over students at train time is
|
||||
the cheat, and a live validation is non-causal). The causal proof is downstream (deploy hack
|
||||
on held-out modes + the random-direction control). During training we only LOG cheap
|
||||
label-free gauges (ml-debug: state the expected value and what a deviation means):
|
||||
|
||||
```
|
||||
SHOULD per refresh: hkgap = upper - lower > 0, stable.
|
||||
ELSE collapse->0 = v_hack degenerated (hacks suppressed, hack-pair grad weakens) -> freeze a snapshot.
|
||||
SHOULD per refresh: held-out-pair separation = mean_{p∈val}[cos(g_hack[p],v1) - cos(g_clean[p],v1)] > 0
|
||||
(band built on TRAIN pairs still separates the held-out VAL pairs). ELSE ~0 = band is pair-memorised noise.
|
||||
SHOULD per step: live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper].
|
||||
ELSE all below lower -> routes nothing; all above upper -> routes everything (miscalibrated).
|
||||
SHOULD per step: route fraction f mean ∈ (0,1), some mass at 0 and at 1. ELSE degenerate gate.
|
||||
SHOULD per step: resid = cos(delta_S.grad after routing, v1) ~ 0. ELSE hack leaking into the deployed knob.
|
||||
```
|
||||
|
||||
## Implementation plan (src/vgrout/train.py)
|
||||
|
||||
Rollback tag `pre-routing-refactor`. erase already works; the code below is the route rewrite.
|
||||
|
||||
1. **DELETE `build_route2_anchors`** (~line 337) and its call site. No anchors from teacher
|
||||
membership or the detector.
|
||||
2. **Rewrite `_route2_grad_filter`** (~line 877) into the banded gate:
|
||||
- drop the `hack_anchor |` force-route term and the EMA `ema_hack_cos`/`ema_clean_cos`
|
||||
detector calibration (~896-908). No hard `cos_b > tau`.
|
||||
- `x = cos(g_step, vec)`; `route_frac = clamp((x - lower)/(upper - lower), 0, 1)`;
|
||||
`δS_hack.grad += route_frac*g`; `δS.grad += (1-route_frac)*g`. `lower`/`upper`
|
||||
come from the pairs (step 3), passed in.
|
||||
- granularity is PER-ROLLOUT (decided, see "Granularity"): keep the existing
|
||||
token-sum -> `g_b = cg/dS` recovery, just swap the `cos_b > tau` line for the ramp.
|
||||
`rollout_ablate_frac` floor may stay as a belt-and-braces solve guarantee.
|
||||
3. **Pair-calibrated BAND, refreshed every `vhack_refresh_every` steps** (reuse the
|
||||
existing v_grad refresh hook): when we (re)build `vec` from the pairs, also compute
|
||||
`lower = mean_p cos(g_cho[p], vec)`, `upper = mean_p cos(g_rej[p], vec)`, per module.
|
||||
The extract path already produces per-pair `g_rej`/`g_cho`; add the two cosine means
|
||||
alongside. Store `route2_band[name] = (lower, upper)`, not anchors/tau.
|
||||
calibration (~896-908). No live-detector `tau`.
|
||||
- keep the per-rollout recovery (`cg.reshape(G,s,r).sum(1) / delta_S`), then
|
||||
`x = cos(g_b, v1)`, `f = clamp((x-lower)/(upper-lower),0,1)`,
|
||||
`delta_S_hack.grad += f*g_b`, `delta_S.grad += (1-f)*g_b`.
|
||||
3. **Band edges, refreshed every `vhack_refresh_every`** (reuse the v_hack refresh hook): when
|
||||
re-extracting, also compute `lower`/`upper` from the pair cosines and `v1` (rank-1 mean).
|
||||
Store `route_band[name] = (lower, upper)`. Reserve `n_val` pairs for the held-out-pair check.
|
||||
4. **Remove plumbing**: `--gate-anchor-teacher-only` flag + `teacher_only` arg; the
|
||||
`hack_E_flags` feed into the GATE (drop -- no detector touches student rollouts at train
|
||||
time now; keep `hack_E_flags` only if still cheap for the streaming hk_* LOG columns).
|
||||
`route2_random_v_seed` stays (it's the directionality control).
|
||||
5. **Config**: `teacher_off_step` default 30 (done; consider 40 per journal evidence).
|
||||
Teacher rollouts go through the same band (NOT force-routed).
|
||||
6. **Diagnostics to print** (all label-free, see "Cheap, label-free diagnostics"):
|
||||
`hkgap = upper - lower`; LOO pair separation; live `cos_b` percentiles vs `[lower,upper]`;
|
||||
`route_frac` mean + mass-at-0 + mass-at-1; `resid = cos(g_keep, vec)`.
|
||||
|
||||
## Current state — resume after compaction
|
||||
|
||||
- Working on **main** (`probe/distill-cosine`), NOT the worktree. Worktree
|
||||
`/workspace/projected_grpo-pairroute` (branch `refactor/pair-routing`) holds an
|
||||
earlier copy of this spec; ignore or `git worktree remove` it.
|
||||
- **Queue is PAUSED** (`pueue pause`). Job 127 (`erase_realv`) was running when paused.
|
||||
`pueue start` resumes. Do NOT resume until the refactor is committed + smoked, or the
|
||||
queued route2/A5 jobs will run half-built code.
|
||||
- Rollback tag: `pre-routing-refactor`. Job manifest: `docs/spec/20260606_job_manifest.md`.
|
||||
|
||||
## Queued-job disposition (decide before `pueue start`)
|
||||
|
||||
- **Superseded by this refactor (old route2 semantics) -> remove + requeue under new code**:
|
||||
124 (route2_toff40), 125 (route_randomV), 126 (a5 route2 real teacher-only),
|
||||
130 (route2-200 KL), 133/134 (a5 route2 seeds), 135 (a5 random v_grad).
|
||||
- **Still valid as-is (intervention=none / erase)**: 129 (vanilla-200 KL, A4),
|
||||
131/132 (a5 vanilla seeds), 128 (erase placebo), 127 (erase real-v, was running).
|
||||
Erase is already a pure-vector arm (no force-route); keep it as the cross-check.
|
||||
- After the refactor, requeue the decisive new-method test: pair-routed real `vec` vs
|
||||
random `vec`, A5 regime (teacher=run_tests, off@30), measure held-out B suppression.
|
||||
`hack_E_flags` feed into the gate (no detector over students now; keep `hack_E_flags` only
|
||||
for the streaming `hk_*` LOG columns if still cheap). `route2_random_v_seed` stays (the
|
||||
random-direction control).
|
||||
5. **Config**: `teacher_off_step` default 30 (done; consider 40 — see Teacher facts). Teacher
|
||||
rollouts go through the same band, NOT force-routed.
|
||||
6. **Diagnostics**: the label-free gauges above. Delete/retire `scripts/verify_gate_anchor.py`
|
||||
(no anchor to check).
|
||||
|
||||
## Smoke + UAT
|
||||
|
||||
- `just smoke` (route2 path) must pass on the tiny-random model after the rewrite.
|
||||
- `scripts/verify_*.py` gates stay green; `verify_gate_anchor.py` becomes moot
|
||||
(no anchor) -> update or delete it.
|
||||
- UAT (refactor works): a fast 60-step pair-routed real-`vec` run shows deploy hack
|
||||
< vanilla at matched solve, AND the label-free diagnostics are healthy (band width
|
||||
`hkgap > 0`, LOO separation > 0, live `cos_b` straddles `[lower,upper]`, `resid ~ 0`).
|
||||
- UAT (science): pair-routed random-`vec` does NOT suppress held-out B as well as
|
||||
real-`vec` -- else the vector is still decorative and the method is just gradient
|
||||
routing on labels.
|
||||
- `just smoke` must pass on the tiny-random model (both erase and route paths).
|
||||
- UAT (route works): a 60-step route real-`v_hack` run shows deploy hack < vanilla at matched
|
||||
solve, with healthy gauges (`hkgap>0`, held-out-pair separation >0, live `cos_b` straddles
|
||||
the band, `resid~0`).
|
||||
- Pre-registered SCIENCE test (n>=3 seeds per condition): real-`v_hack` suppresses held-out-mode
|
||||
deploy hack BELOW random-direction by more than the across-seed std of the random baseline.
|
||||
Run for BOTH arms. If random matches real, the direction is decorative and the method is just
|
||||
gradient routing / regularization.
|
||||
|
||||
## Validation logic -- what proves the method works (read this before the gate)
|
||||
## Run plan (simplest first)
|
||||
|
||||
The pairs do ONE job: produce `vec` and the two band edges. They are never EVIDENCE the
|
||||
method works. Two separate checks below carry that weight.
|
||||
- **Now (erase, already coded):** erase real-`v_hack` vs erase random-direction vs erase
|
||||
placebo, teacher-off@30, refresh-N. Real-vs-random is the decisive control AND the simple arm.
|
||||
Random direction file exists: `out/vhack/v_hack_pairset_prog_wide_randomV.safetensors`.
|
||||
- **Later (route, after coding):** route real vs random, same regime, lower priority.
|
||||
|
||||
- **Calibration is read, not validated** -- cheap label-free gauges (live `cos_b` vs the
|
||||
band, `hkgap`, LOO pair separation; see "Cheap, label-free diagnostics") tell us the gate
|
||||
is alive and the band is calibrated, WITHOUT running any detector over students. No live
|
||||
detector validation (that would be the cheat).
|
||||
- **Generalization test = deploy performance on held-out B** (never labelled, never in the
|
||||
pairs, never teacher-seeded under A5): B suppressed at deploy while solve preserved.
|
||||
- **Decisive control = real-`vec` vs random-`vec`.** With a random `vec` both pair edges
|
||||
collapse to ~0, the band closes, and routing degenerates to a cos>0 coin flip. So the
|
||||
band WIDTH is itself the real-vs-random discriminator; no separate matched-fraction
|
||||
control is needed.
|
||||
## Queue + resume state
|
||||
|
||||
`upper > lower` is biased positive by construction (vec points along the mean rej-cho
|
||||
diff), so the ordering is not evidence. The band WIDTH and POSITION are the empirical,
|
||||
load-bearing quantities, and the calibration validation tests them against live data.
|
||||
- On **main** (`probe/distill-cosine`); the worktree `/workspace/projected_grpo-pairroute` is
|
||||
stale, `git worktree remove` it.
|
||||
- Queue is **PAUSED**. Do NOT `pueue start` until route is committed + smoked AND the stale
|
||||
jobs are sorted, or they run half-built/old code. Durable label copy:
|
||||
`docs/spec/20260606_job_manifest.md`.
|
||||
- **Remove (superseded old-route2 semantics):** 124, 126, 130, 133, 134, 135.
|
||||
- **Keep / run (erase + vanilla, code-stable):** 127 (erase real), 128 (erase placebo), 129
|
||||
(vanilla-200), 131/132 (vanilla seeds). 125 is route+random — requeue under new route code.
|
||||
- **Add:** erase random-direction (the missing simple real-vs-random control).
|
||||
|
||||
## Routing mechanism — banded cosine gate with an absorption zone
|
||||
## Teacher facts (context)
|
||||
|
||||
The gate is a BAND, not a single threshold. Two edges, both measured from the pairs:
|
||||
|
||||
```py
|
||||
# ── refresh every N steps: vec + band edges from the pairs, through the current adapter ──
|
||||
for pair p:
|
||||
g_rej[p], g_cho[p] = ∇logp(rej_p), ∇logp(cho_p) # δS-space, per module
|
||||
vec = unit(mean_p(g_rej[p] - g_cho[p])) # hack-ward
|
||||
lower = mean_p cos(g_cho[p], vec) # where genuinely-CLEAN gradients land (low)
|
||||
upper = mean_p cos(g_rej[p], vec) # where genuinely-HACK gradients land (high)
|
||||
hkgap = upper - lower # band width = the load-bearing separation signal
|
||||
|
||||
# ── live: absorption ramp, pure gate, NO labels, NO force-route ──
|
||||
x = cos(g_step, vec) # alignment of the live gradient with the hack dir
|
||||
route_frac = clamp((x - lower) / (upper - lower), 0, 1)
|
||||
δS_hack.grad += route_frac * g_step # x>=upper -> 1: fully quarantined (deleted)
|
||||
δS.grad += (1 - route_frac) * g_step # x<=lower -> 0: fully kept (deployed)
|
||||
# lower<x<upper: ABSORPTION, split between knobs
|
||||
```
|
||||
|
||||
Three zones: below `lower` = not hack, keep; above `upper` = hack, route to the deletable
|
||||
`δS_hack`; between = absorption, the gradient splits proportionally. The ramp softens the
|
||||
solve-starvation a hard gate would cause (a partly-hacky rollout keeps part of its signal
|
||||
in `δS`), so the hard exploration floor (`rollout_ablate_frac`) is no longer required,
|
||||
though it can stay as a belt-and-braces solve guarantee.
|
||||
|
||||
## Granularity: per-rollout (`g_step` = per-rollout δS grad)
|
||||
|
||||
Decided, with paper backing -- not left open.
|
||||
|
||||
- **What we do now** (train.py:881-896, `_route2_grad_filter`): the baukit gate `c` is
|
||||
per-TOKEN (`[G*s, r]`, since nn.Linear sees a flattened batch). We SUM each rollout's
|
||||
token gate-grads -> `[G, r]`, divide by `δS` to recover the per-rollout knob grad `g_b`,
|
||||
and take one `cos_b` per rollout. So the live unit is already PER-ROLLOUT. The recovery
|
||||
hook exists; the band just replaces the `cos_b > tau` line with the ramp.
|
||||
- **Gradient Routing** (Cloud et al. 2024): data-dependent stop-gradient masks at a few
|
||||
layers' activations (`x = mask*act + (1-mask)*act.detach()`). For LLMs the mask is
|
||||
per-TOKEN ("token-by-token, ignoring neighbours ... surprisingly effective", Limitations
|
||||
b); for their RL application (scalable oversight) it is per-EPISODE (mask at the terminal
|
||||
state). So the RL-native unit there is the trajectory.
|
||||
- **SGTM** (Knowledge Localization, 2025): hard zero-mask, per-EXAMPLE (target-domain
|
||||
examples only update their dedicated params). Its contribution is robustness to LABEL
|
||||
NOISE, not a new granularity.
|
||||
|
||||
Two takeaways: (1) per-rollout is the RL-correct unit -- it matches Gradient Routing's
|
||||
per-episode RL instantiation and GRPO's per-rollout advantage, so keep it. (2) Both papers
|
||||
route by a DATA-LABEL mask (token/example/episode membership). We route by the gradient's
|
||||
ALIGNMENT with an extracted direction (`cos(g_b, vec)` in the band) -- no per-example label.
|
||||
That direction-gated routing is the novelty vs both; state it as such.
|
||||
|
||||
Also worth borrowing: Gradient Routing's "absorption" (sec 5) -- routing a LIMITED/weak
|
||||
label localizes the capability MORE generally, because the routed region participates in
|
||||
predictions on related non-routed data so the feature is not learned elsewhere. That is the
|
||||
mechanism that would let routing on known A suppress unknown B; it is the theoretical basis
|
||||
for our no-cheat hope. (Distinct from our band's middle "absorption zone", which just means
|
||||
proportional split; same word, different thing.)
|
||||
|
||||
## Cheap, label-free diagnostics (validation dropped)
|
||||
|
||||
We are NOT running a live detector validation. Running any detector over the student's own
|
||||
rollouts at train time is on the wrong side of the no-cheat line (AGENTS.md, no-cheat point
|
||||
3: routing is pure `vec`, only the hand-built pairs are labelled), and a live validation is
|
||||
complex and non-causal. The causal proof is
|
||||
downstream (deploy performance + real-vs-random). During training we only LOG cheap,
|
||||
label-free gauges (ml-debug: log everything, state the expected value and what a deviation
|
||||
means, chase anomalies):
|
||||
|
||||
```
|
||||
SHOULD per refresh: hkgap = upper - lower > 0 and roughly stable.
|
||||
ELSE collapse->0 = vec degenerated (hacks suppressed, hack-pair grad weakened) -> freeze
|
||||
a pre-routing vec snapshot.
|
||||
|
||||
SHOULD per refresh: LOO separation = mean_p [cos(g_rej[p],vec_{-p}) - cos(g_cho[p],vec_{-p})] > 0
|
||||
(band built on the OTHER pairs still separates the held-out pair -- "does the threshold
|
||||
generalize to a second pair", the user's cheap test). ELSE ~0 = band is pair-memorized noise.
|
||||
|
||||
SHOULD per step: live cos_b percentiles (p10/p50/p90) STRADDLE [lower, upper].
|
||||
ELSE all below lower -> band routes nothing (miscalibrated low); all above upper ->
|
||||
routes everything (miscalibrated high). This is the calibration read with NO labels.
|
||||
|
||||
SHOULD per step: route_frac mean in (0,1), with some mass at 0 and some at 1.
|
||||
ELSE all-0 or all-1 = degenerate gate.
|
||||
|
||||
SHOULD per step: resid = cos(g_keep, vec) ~ 0 (hack stripped from the deployed knob).
|
||||
ELSE >0 = hack-ward grad leaking into δS (the real failure).
|
||||
```
|
||||
|
||||
None of these touch held-out B or run a detector over students; they read the band, the
|
||||
pairs, and the live cosine geometry only.
|
||||
|
||||
## Review findings (2026-06-06) -- decisions before implementing
|
||||
|
||||
Cross-reviewed by Claude + deepseek-v4-pro (docs/reviews/20260606_pairroute_review_deepseek.md).
|
||||
The banded gate supersedes the single-midpoint `tau` deepseek reviewed. Its surviving
|
||||
points: calibration risk (pairs teacher-forced vs live sampled) -> read off the live-cos-vs-band
|
||||
diagnostic above (no labels); vec degeneracy -> the `hkgap` collapse check.
|
||||
Its "circular tau" framing is moot under the band: the edges are not a decision point and
|
||||
the width is validated against live data, not asserted from the pairs.
|
||||
|
||||
2. **Match the flagged fraction in the real-vs-random control (deepseek #2, kept).** Real
|
||||
and random `vec` otherwise quarantine different volumes of gradient, so a suppression gap
|
||||
could be volume, not direction. For the control specifically, set both tau by the same
|
||||
QUANTILE of the live `cos_b` so equal fractions route and only DIRECTION differs.
|
||||
Suppression gap at matched fraction => direction is load-bearing.
|
||||
|
||||
3. **Whole-rollout routing + exploration floor (Claude, corrected).** Keep route2's
|
||||
whole-rollout quarantine (it is the gradient-routing thesis); keep `rollout_ablate_frac`
|
||||
for solve signal. Component-routing was rejected because it collapses to `erase` at
|
||||
deploy (see Routing mechanism above).
|
||||
|
||||
4. **Degeneracy diagnostic (deepseek #3, kept).** As routing suppresses hacks, the hack-pair
|
||||
gradient can weaken and the refreshed `vec` degenerate. Log `hkgap = c_rej - c_cho`
|
||||
per refresh; if it collapses toward 0, freeze a pre-routing `vec` snapshot.
|
||||
|
||||
5. **Pre-register the science UAT (deepseek, kept; user-confirmed).** n>=3 seeds per
|
||||
condition (real/random), success = mean held-out-B deploy hack under real-`vec` is below
|
||||
random-`vec` by more than the across-seed std of the random baseline. Qualitative
|
||||
"suppresses better" is not enough.
|
||||
Teacher pool `out/pools/substrate` = 74 generated rollouts, 100% `hacked` / 0% `gt_pass`
|
||||
(pure hack demos, NOT reference solutions), across all 4 modes. Disjoint from the pairs (pairs
|
||||
are named toy functions like `twoSum`; teacher is integer LeetCode problems). Mixed in at 0.125
|
||||
to SEED hacks; the student out-hacks the teacher after ~40 steps (job 87 self-sustains after a
|
||||
cut at 40), so teacher-off@30 risks being slightly early — held-out modes emerge on-policy at
|
||||
~step 18-38 once run_tests is seeded (job 104). `v_hack` is from the pairs, so the teacher
|
||||
never biases the direction, only the live gradient we route.
|
||||
|
||||
Reference in New Issue
Block a user