docs: 4-arm framing, weak-detector test, hack-mode appendix

- blog: appendix with prompt+hint/hack/clean traces for all 4 loophole modes (run_tests/sentinel/stdout_marker/file_marker) - blog: 'four things we compare' (vanilla/erase/route/route-weak), faithful extract pseudocode (per-completion zero_grad), erase+route step pseudocode, refresh rationale + route quarantine-ablate subtlety - blog+README: cite Gradient Routing (Cloud et al. 2024, 2410.04332) as the route arm's lineage - README: 'what we compare' section + appendix pointer - spec: weak-detector arm as the operationalized generalization test Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-05-31 00:20:47 +00:00
parent 8c2421d644
commit f7288e569d
3 changed files with 189 additions and 15 deletions
@@ -76,6 +76,26 @@ should be near zero on no_gate runs (we removed the alignment), and
 `cin_t > cin_s` should hold throughout (v_hack discriminates hack from
 clean gradients).

+## What we compare
+
+The environment has four loophole modes (`run_tests`, `sentinel`,
+`stdout_marker`, `file_marker`), each a different grading flaw with a
+truthful hint disclosing the mechanism. Problems are partitioned evenly and
+non-overlappingly, so a vanilla student can learn all four independently.
+Full prompt+hint, hack, and clean traces per mode are in the blog appendix
+([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
+
+Four arms, identical model and teacher pool, differing only in the gradient policy:
+
+- **vanilla** -- no intervention; the emergence reference.
+- **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
+- **route** -- quarantine the v_hack component into a throwaway `delta_S_hack` knob, deleted at deploy. Gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332)) in the SVD basis.
+- **route, weak detector** -- the generalisation test: build v_hack from only 2 of the 4 modes (the "known" hacks the weak detector can flag), route on that subspace, and measure whether the 2 held-out modes are also suppressed. This is the load-bearing no-cheat check.
+
+The frozen-vs-refresh distinction is orthogonal: any arm can re-extract
+v_hack every N steps on the current adapter (for route, with the quarantine
+ablated during extraction, see the blog).
+
 ## Quick start

 ```bash