docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md

The 'weak detector for hack A, generalize to B' framing was wrong for this repo.
That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is
vec -> routing: vec extracted from hand-built synthetic pairs, route the live
GRPO gradient by cosine alignment to vec; no detector ever runs over student
rollouts at train time. Generalization = does vec (from pairs covering some
modes) suppress held-out modes -- vector generalization, not detector-label.

- AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader
  = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents.
- README: removed the 'We cannot cheat' section (belongs in agent instructions,
  not the new-reader overview).
- spec: dropped the stray 'validation uses known-A detector' line; pointed the
  no-cheat reference at AGENTS.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-06 02:39:48 +00:00
parent a83953131e
commit 83cae4ef72
3 changed files with 22 additions and 21 deletions
+5 -4
View File
@@ -77,7 +77,7 @@ through the same band as any student rollout. After the cut it is pure on-policy
| direction `vec` | pair gradient diff | pair gradient (or activation) diff — same source |
| gate | single live-detector `τ`, hard cos>τ | BAND `[lower,upper]` from pair clean/hack cosines, absorption ramp |
| force-route | yes (`hack_anchor \|`) | none — gate only |
| live detector over students | yes (noisy, leaks onto B) | none (validation uses known-A detector only) |
| live detector over students | yes (noisy, leaks onto B) | none -- routing is pure `vec` |
| teacher | mixed throughout, force-routed | seed only, cut@30-40, gated like any rollout |
| is `vec` load-bearing? | no (labels carry it) | yes — random `vec` closes the band (width->0) |
| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec)` lands above `lower`, i.e. B shares the direction |
@@ -260,9 +260,10 @@ proportional split; same word, different thing.)
## Cheap, label-free diagnostics (validation dropped)
We are NOT running a live detector validation. Running the weak detector over the student's
own rollouts during training is on the wrong side of the no-cheat line (README: that is
exactly the cheat), and a live validation is complex and non-causal. The causal proof is
We are NOT running a live detector validation. Running any detector over the student's own
rollouts at train time is on the wrong side of the no-cheat line (AGENTS.md, no-cheat point
3: routing is pure `vec`, only the hand-built pairs are labelled), and a live validation is
complex and non-causal. The causal proof is
downstream (deploy performance + real-vs-random). During training we only LOG cheap,
label-free gauges (ml-debug: log everything, state the expected value and what a deviation
means, chase anomalies):