This commit is contained in:
wassname
2026-05-29 06:29:43 +00:00
parent 3bbac88167
commit c1f8ca4e7b
4 changed files with 33 additions and 1 deletions
+16
View File
@@ -12,6 +12,22 @@ advantage level.
See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
and [docs/papers/](docs/papers/).
## We cannot cheat (the load-bearing constraint)
The point is an alignment tool a lab would actually use, where at deployment
there are known hacks and unknown hacks. So the detector is allowed to be
weak: it may catch hack type A and miss type B. We then use the gradient from
A to try to stop the model learning B. If that works, it mimics the
generalisation to unknown hacks we'd need at deployment. A detector that
already sees every hack proves nothing.
Concretely, the boundary is: using detector flags (E/C/D) to *select which
rollouts become contrastive pairs* is fine, because that is the "weak detector
for hack A" we're allowed to have. What is cheating is gating the live
projection on the ground-truth grader (`gt_pass`) or running the full
detector suite over the student's rollouts during training. The whole result
is uninteresting if we let the oracle in at train time.
## How it works
We're trying to ablate the "hack direction" from the training gradient on