docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md

The 'weak detector for hack A, generalize to B' framing was wrong for this repo. That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is vec -> routing: vec extracted from hand-built synthetic pairs, route the live GRPO gradient by cosine alignment to vec; no detector ever runs over student rollouts at train time. Generalization = does vec (from pairs covering some modes) suppress held-out modes -- vector generalization, not detector-label. - AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents. - README: removed the 'We cannot cheat' section (belongs in agent instructions, not the new-reader overview). - spec: dropped the stray 'validation uses known-A detector' line; pointed the no-cheat reference at AGENTS.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-06 02:39:48 +00:00
parent a83953131e
commit 83cae4ef72
3 changed files with 22 additions and 21 deletions
@@ -14,22 +14,6 @@ advantage level.
 See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
 and [docs/papers/](docs/papers/).

-## We cannot cheat (the load-bearing constraint)
-
-The point is an alignment tool a lab would actually use, where at deployment
-there are known hacks and unknown hacks. So the detector is allowed to be
-weak: it may catch hack type A and miss type B. We then use the gradient from
-A to try to stop the model learning B. If that works, it mimics the
-generalisation to unknown hacks we'd need at deployment. A detector that
-already sees every hack proves nothing.
-
-Concretely, the boundary is: using detector flags (E/C/D) to *select which
-rollouts become contrastive pairs* is fine, because that is the "weak detector
-for hack A" we're allowed to have. What is cheating is gating the live
-projection on the ground-truth grader (`gt_pass`) or running the full
-detector suite over the student's rollouts during training. The whole result
-is uninteresting if we let the oracle in at train time.
-
 ## How it works

 We're trying to ablate the "hack direction" from the training gradient on