mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md
The 'weak detector for hack A, generalize to B' framing was wrong for this repo. That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is vec -> routing: vec extracted from hand-built synthetic pairs, route the live GRPO gradient by cosine alignment to vec; no detector ever runs over student rollouts at train time. Generalization = does vec (from pairs covering some modes) suppress held-out modes -- vector generalization, not detector-label. - AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents. - README: removed the 'We cannot cheat' section (belongs in agent instructions, not the new-reader overview). - spec: dropped the stray 'validation uses known-A detector' line; pointed the no-cheat reference at AGENTS.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -14,22 +14,6 @@ advantage level.
|
||||
See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
|
||||
and [docs/papers/](docs/papers/).
|
||||
|
||||
## We cannot cheat (the load-bearing constraint)
|
||||
|
||||
The point is an alignment tool a lab would actually use, where at deployment
|
||||
there are known hacks and unknown hacks. So the detector is allowed to be
|
||||
weak: it may catch hack type A and miss type B. We then use the gradient from
|
||||
A to try to stop the model learning B. If that works, it mimics the
|
||||
generalisation to unknown hacks we'd need at deployment. A detector that
|
||||
already sees every hack proves nothing.
|
||||
|
||||
Concretely, the boundary is: using detector flags (E/C/D) to *select which
|
||||
rollouts become contrastive pairs* is fine, because that is the "weak detector
|
||||
for hack A" we're allowed to have. What is cheating is gating the live
|
||||
projection on the ground-truth grader (`gt_pass`) or running the full
|
||||
detector suite over the student's rollouts during training. The whole result
|
||||
is uninteresting if we let the oracle in at train time.
|
||||
|
||||
## How it works
|
||||
|
||||
We're trying to ablate the "hack direction" from the training gradient on
|
||||
|
||||
Reference in New Issue
Block a user