mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
docs: 4-arm framing, weak-detector test, hack-mode appendix
- blog: appendix with prompt+hint/hack/clean traces for all 4 loophole modes (run_tests/sentinel/stdout_marker/file_marker) - blog: 'four things we compare' (vanilla/erase/route/route-weak), faithful extract pseudocode (per-completion zero_grad), erase+route step pseudocode, refresh rationale + route quarantine-ablate subtlety - blog+README: cite Gradient Routing (Cloud et al. 2024, 2410.04332) as the route arm's lineage - README: 'what we compare' section + appendix pointer - spec: weak-detector arm as the operationalized generalization test Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -76,6 +76,26 @@ should be near zero on no_gate runs (we removed the alignment), and
|
||||
`cin_t > cin_s` should hold throughout (v_hack discriminates hack from
|
||||
clean gradients).
|
||||
|
||||
## What we compare
|
||||
|
||||
The environment has four loophole modes (`run_tests`, `sentinel`,
|
||||
`stdout_marker`, `file_marker`), each a different grading flaw with a
|
||||
truthful hint disclosing the mechanism. Problems are partitioned evenly and
|
||||
non-overlappingly, so a vanilla student can learn all four independently.
|
||||
Full prompt+hint, hack, and clean traces per mode are in the blog appendix
|
||||
([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
|
||||
|
||||
Four arms, identical model and teacher pool, differing only in the gradient policy:
|
||||
|
||||
- **vanilla** -- no intervention; the emergence reference.
|
||||
- **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
|
||||
- **route** -- quarantine the v_hack component into a throwaway `delta_S_hack` knob, deleted at deploy. Gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332)) in the SVD basis.
|
||||
- **route, weak detector** -- the generalisation test: build v_hack from only 2 of the 4 modes (the "known" hacks the weak detector can flag), route on that subspace, and measure whether the 2 held-out modes are also suppressed. This is the load-bearing no-cheat check.
|
||||
|
||||
The frozen-vs-refresh distinction is orthogonal: any arm can re-extract
|
||||
v_hack every N steps on the current adapter (for route, with the quarantine
|
||||
ablated during extraction, see the blog).
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
|
||||
Reference in New Issue
Block a user