docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md

The 'weak detector for hack A, generalize to B' framing was wrong for this repo. That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is vec -> routing: vec extracted from hand-built synthetic pairs, route the live GRPO gradient by cosine alignment to vec; no detector ever runs over student rollouts at train time. Generalization = does vec (from pairs covering some modes) suppress held-out modes -- vector generalization, not detector-label. - AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents. - README: removed the 'We cannot cheat' section (belongs in agent instructions, not the new-reader overview). - spec: dropped the stray 'validation uses known-A detector' line; pointed the no-cheat reference at AGENTS.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-06 02:39:48 +00:00
parent a83953131e
commit 83cae4ef72
3 changed files with 22 additions and 21 deletions
@@ -28,7 +28,23 @@ Inherit global rules from `~/.claude/CLAUDE.md`.

 ## Things the user has had to explain many times:

- We cannot cheat and use all reward hacks to stop hacks. During deployment there are known hacks and unknown hacks. We want to make an alignment toolslabs want to use. So it's ok to have a weak eward hack detector than can detect hack type A but not B, then use the gradient from A to try to stop the learning of B, and this mimicks the generalisation to unknown hacks that happens at deployment.
+- No-cheat (VECTOR-framed -- this is the load-bearing constraint; cast every "is this
+  cheating / is this weak" question in vector terms, since this repo is vector-based).
+  Three things, don't conflate them:
+  1. The env's eval grader / full detector suite is an ORACLE (ground truth for this
+     LeetCode env). Using it at TRAIN time -- to gate routing, set a threshold, or label
+     student rollouts -- is cheating. It may only score the final deploy eval.
+  2. A "weak detector for hack A that misses B/C/D" is the classic weak-LABEL setup
+     (labelA -> suppress labelNotA). That is NOT our setup; do not frame our method that way.
+  3. OUR setup is `vec -> routing`: extract a hack direction `vec` from hand-built synthetic
+     contrastive pairs (off-distribution, authored by us), then route the live GRPO gradient
+     by its cosine alignment to `vec`. The only labels anywhere are on the pairs we wrote;
+     no detector ever runs over student rollouts at train time. Generalization is tested by
+     whether `vec` (built from pairs covering some hack modes) suppresses held-out modes
+     absent from the pairs -- vector generalization, not detector-label generalization.
+  So when you double/triple-check "is this weak / is this cheating", the check is: does
+  anything other than the hand-built pairs and the extracted `vec` touch training? If a live
+  detector or the oracle leaks in, it's cheating.
 - do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
 - I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead

@@ -14,22 +14,6 @@ advantage level.
 See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
 and [docs/papers/](docs/papers/).

-## We cannot cheat (the load-bearing constraint)
-
-The point is an alignment tool a lab would actually use, where at deployment
-there are known hacks and unknown hacks. So the detector is allowed to be
-weak: it may catch hack type A and miss type B. We then use the gradient from
-A to try to stop the model learning B. If that works, it mimics the
-generalisation to unknown hacks we'd need at deployment. A detector that
-already sees every hack proves nothing.
-
-Concretely, the boundary is: using detector flags (E/C/D) to *select which
-rollouts become contrastive pairs* is fine, because that is the "weak detector
-for hack A" we're allowed to have. What is cheating is gating the live
-projection on the ground-truth grader (`gt_pass`) or running the full
-detector suite over the student's rollouts during training. The whole result
-is uninteresting if we let the oracle in at train time.
-
 ## How it works

 We're trying to ablate the "hack direction" from the training gradient on
@@ -77,7 +77,7 @@ through the same band as any student rollout. After the cut it is pure on-policy
 | direction `vec` | pair gradient diff | pair gradient (or activation) diff — same source |
 | gate | single live-detector `τ`, hard cos>τ | BAND `[lower,upper]` from pair clean/hack cosines, absorption ramp |
 | force-route | yes (`hack_anchor \|`) | none — gate only |
-| live detector over students | yes (noisy, leaks onto B) | none (validation uses known-A detector only) |
+| live detector over students | yes (noisy, leaks onto B) | none -- routing is pure `vec` |
 | teacher | mixed throughout, force-routed | seed only, cut@30-40, gated like any rollout |
 | is `vec` load-bearing? | no (labels carry it) | yes — random `vec` closes the band (width->0) |
 | held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec)` lands above `lower`, i.e. B shares the direction |
@@ -260,9 +260,10 @@ proportional split; same word, different thing.)

 ## Cheap, label-free diagnostics (validation dropped)

-We are NOT running a live detector validation. Running the weak detector over the student's
-own rollouts during training is on the wrong side of the no-cheat line (README: that is
-exactly the cheat), and a live validation is complex and non-causal. The causal proof is
+We are NOT running a live detector validation. Running any detector over the student's own
+rollouts at train time is on the wrong side of the no-cheat line (AGENTS.md, no-cheat point
+3: routing is pure `vec`, only the hand-built pairs are labelled), and a live validation is
+complex and non-causal. The causal proof is
 downstream (deploy performance + real-vs-random). During training we only LOG cheap,
 label-free gauges (ml-debug: log everything, state the expected value and what a deviation
 means, chase anomalies):