tidy

2026-06-27 16:15:35 +08:00 · 2026-05-29 06:29:43 +00:00
parent 3bbac88167
commit c1f8ca4e7b
4 changed files with 33 additions and 1 deletions
@@ -12,6 +12,22 @@ advantage level.
 See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
 and [docs/papers/](docs/papers/).

+## We cannot cheat (the load-bearing constraint)
+
+The point is an alignment tool a lab would actually use, where at deployment
+there are known hacks and unknown hacks. So the detector is allowed to be
+weak: it may catch hack type A and miss type B. We then use the gradient from
+A to try to stop the model learning B. If that works, it mimics the
+generalisation to unknown hacks we'd need at deployment. A detector that
+already sees every hack proves nothing.
+
+Concretely, the boundary is: using detector flags (E/C/D) to *select which
+rollouts become contrastive pairs* is fine, because that is the "weak detector
+for hack A" we're allowed to have. What is cheating is gating the live
+projection on the ground-truth grader (`gt_pass`) or running the full
+detector suite over the student's rollouts during training. The whole result
+is uninteresting if we let the oracle in at train time.
+
 ## How it works

 We're trying to ablate the "hack direction" from the training gradient on