mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
tidy
This commit is contained in:
@@ -12,6 +12,22 @@ advantage level.
|
||||
See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
|
||||
and [docs/papers/](docs/papers/).
|
||||
|
||||
## We cannot cheat (the load-bearing constraint)
|
||||
|
||||
The point is an alignment tool a lab would actually use, where at deployment
|
||||
there are known hacks and unknown hacks. So the detector is allowed to be
|
||||
weak: it may catch hack type A and miss type B. We then use the gradient from
|
||||
A to try to stop the model learning B. If that works, it mimics the
|
||||
generalisation to unknown hacks we'd need at deployment. A detector that
|
||||
already sees every hack proves nothing.
|
||||
|
||||
Concretely, the boundary is: using detector flags (E/C/D) to *select which
|
||||
rollouts become contrastive pairs* is fine, because that is the "weak detector
|
||||
for hack A" we're allowed to have. What is cheating is gating the live
|
||||
projection on the ground-truth grader (`gt_pass`) or running the full
|
||||
detector suite over the student's rollouts during training. The whole result
|
||||
is uninteresting if we let the oracle in at train time.
|
||||
|
||||
## How it works
|
||||
|
||||
We're trying to ablate the "hack direction" from the training gradient on
|
||||
|
||||
Reference in New Issue
Block a user