docs: 4-arm framing, weak-detector test, hack-mode appendix

- blog: appendix with prompt+hint/hack/clean traces for all 4 loophole
  modes (run_tests/sentinel/stdout_marker/file_marker)
- blog: 'four things we compare' (vanilla/erase/route/route-weak), faithful
  extract pseudocode (per-completion zero_grad), erase+route step pseudocode,
  refresh rationale + route quarantine-ablate subtlety
- blog+README: cite Gradient Routing (Cloud et al. 2024, 2410.04332) as the
  route arm's lineage
- README: 'what we compare' section + appendix pointer
- spec: weak-detector arm as the operationalized generalization test

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-05-31 00:20:47 +00:00
parent 8c2421d644
commit f7288e569d
3 changed files with 189 additions and 15 deletions
+20
View File
@@ -76,6 +76,26 @@ should be near zero on no_gate runs (we removed the alignment), and
`cin_t > cin_s` should hold throughout (v_hack discriminates hack from
clean gradients).
## What we compare
The environment has four loophole modes (`run_tests`, `sentinel`,
`stdout_marker`, `file_marker`), each a different grading flaw with a
truthful hint disclosing the mechanism. Problems are partitioned evenly and
non-overlappingly, so a vanilla student can learn all four independently.
Full prompt+hint, hack, and clean traces per mode are in the blog appendix
([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
Four arms, identical model and teacher pool, differing only in the gradient policy:
- **vanilla** -- no intervention; the emergence reference.
- **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
- **route** -- quarantine the v_hack component into a throwaway `delta_S_hack` knob, deleted at deploy. Gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332)) in the SVD basis.
- **route, weak detector** -- the generalisation test: build v_hack from only 2 of the 4 modes (the "known" hacks the weak detector can flag), route on that subspace, and measure whether the 2 held-out modes are also suppressed. This is the load-bearing no-cheat check.
The frozen-vs-refresh distinction is orthogonal: any arm can re-extract
v_hack every N steps on the current adapter (for route, with the quarantine
ablated during extraction, see the blog).
## Quick start
```bash