docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md

The 'weak detector for hack A, generalize to B' framing was wrong for this repo.
That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is
vec -> routing: vec extracted from hand-built synthetic pairs, route the live
GRPO gradient by cosine alignment to vec; no detector ever runs over student
rollouts at train time. Generalization = does vec (from pairs covering some
modes) suppress held-out modes -- vector generalization, not detector-label.

- AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader
  = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents.
- README: removed the 'We cannot cheat' section (belongs in agent instructions,
  not the new-reader overview).
- spec: dropped the stray 'validation uses known-A detector' line; pointed the
  no-cheat reference at AGENTS.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-06 02:39:48 +00:00
parent a83953131e
commit 83cae4ef72
3 changed files with 22 additions and 21 deletions
+17 -1
View File
@@ -28,7 +28,23 @@ Inherit global rules from `~/.claude/CLAUDE.md`.
## Things the user has had to explain many times:
- We cannot cheat and use all reward hacks to stop hacks. During deployment there are known hacks and unknown hacks. We want to make an alignment toolslabs want to use. So it's ok to have a weak eward hack detector than can detect hack type A but not B, then use the gradient from A to try to stop the learning of B, and this mimicks the generalisation to unknown hacks that happens at deployment.
- No-cheat (VECTOR-framed -- this is the load-bearing constraint; cast every "is this
cheating / is this weak" question in vector terms, since this repo is vector-based).
Three things, don't conflate them:
1. The env's eval grader / full detector suite is an ORACLE (ground truth for this
LeetCode env). Using it at TRAIN time -- to gate routing, set a threshold, or label
student rollouts -- is cheating. It may only score the final deploy eval.
2. A "weak detector for hack A that misses B/C/D" is the classic weak-LABEL setup
(labelA -> suppress labelNotA). That is NOT our setup; do not frame our method that way.
3. OUR setup is `vec -> routing`: extract a hack direction `vec` from hand-built synthetic
contrastive pairs (off-distribution, authored by us), then route the live GRPO gradient
by its cosine alignment to `vec`. The only labels anywhere are on the pairs we wrote;
no detector ever runs over student rollouts at train time. Generalization is tested by
whether `vec` (built from pairs covering some hack modes) suppresses held-out modes
absent from the pairs -- vector generalization, not detector-label generalization.
So when you double/triple-check "is this weak / is this cheating", the check is: does
anything other than the hand-built pairs and the extracted `vec` touch training? If a live
detector or the oracle leaks in, it's cheating.
- do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
-16
View File
@@ -14,22 +14,6 @@ advantage level.
See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
and [docs/papers/](docs/papers/).
## We cannot cheat (the load-bearing constraint)
The point is an alignment tool a lab would actually use, where at deployment
there are known hacks and unknown hacks. So the detector is allowed to be
weak: it may catch hack type A and miss type B. We then use the gradient from
A to try to stop the model learning B. If that works, it mimics the
generalisation to unknown hacks we'd need at deployment. A detector that
already sees every hack proves nothing.
Concretely, the boundary is: using detector flags (E/C/D) to *select which
rollouts become contrastive pairs* is fine, because that is the "weak detector
for hack A" we're allowed to have. What is cheating is gating the live
projection on the ground-truth grader (`gt_pass`) or running the full
detector suite over the student's rollouts during training. The whole result
is uninteresting if we let the oracle in at train time.
## How it works
We're trying to ablate the "hack direction" from the training gradient on
+5 -4
View File
@@ -77,7 +77,7 @@ through the same band as any student rollout. After the cut it is pure on-policy
| direction `vec` | pair gradient diff | pair gradient (or activation) diff — same source |
| gate | single live-detector `τ`, hard cos>τ | BAND `[lower,upper]` from pair clean/hack cosines, absorption ramp |
| force-route | yes (`hack_anchor \|`) | none — gate only |
| live detector over students | yes (noisy, leaks onto B) | none (validation uses known-A detector only) |
| live detector over students | yes (noisy, leaks onto B) | none -- routing is pure `vec` |
| teacher | mixed throughout, force-routed | seed only, cut@30-40, gated like any rollout |
| is `vec` load-bearing? | no (labels carry it) | yes — random `vec` closes the band (width->0) |
| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec)` lands above `lower`, i.e. B shares the direction |
@@ -260,9 +260,10 @@ proportional split; same word, different thing.)
## Cheap, label-free diagnostics (validation dropped)
We are NOT running a live detector validation. Running the weak detector over the student's
own rollouts during training is on the wrong side of the no-cheat line (README: that is
exactly the cheat), and a live validation is complex and non-causal. The causal proof is
We are NOT running a live detector validation. Running any detector over the student's own
rollouts at train time is on the wrong side of the no-cheat line (AGENTS.md, no-cheat point
3: routing is pure `vec`, only the hand-built pairs are labelled), and a live validation is
complex and non-causal. The causal proof is
downstream (deploy performance + real-vs-random). During training we only LOG cheap,
label-free gauges (ml-debug: log everything, state the expected value and what a deviation
means, chase anomalies):