mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md
The 'weak detector for hack A, generalize to B' framing was wrong for this repo. That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is vec -> routing: vec extracted from hand-built synthetic pairs, route the live GRPO gradient by cosine alignment to vec; no detector ever runs over student rollouts at train time. Generalization = does vec (from pairs covering some modes) suppress held-out modes -- vector generalization, not detector-label. - AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents. - README: removed the 'We cannot cheat' section (belongs in agent instructions, not the new-reader overview). - spec: dropped the stray 'validation uses known-A detector' line; pointed the no-cheat reference at AGENTS.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -77,7 +77,7 @@ through the same band as any student rollout. After the cut it is pure on-policy
|
||||
| direction `vec` | pair gradient diff | pair gradient (or activation) diff — same source |
|
||||
| gate | single live-detector `τ`, hard cos>τ | BAND `[lower,upper]` from pair clean/hack cosines, absorption ramp |
|
||||
| force-route | yes (`hack_anchor \|`) | none — gate only |
|
||||
| live detector over students | yes (noisy, leaks onto B) | none (validation uses known-A detector only) |
|
||||
| live detector over students | yes (noisy, leaks onto B) | none -- routing is pure `vec` |
|
||||
| teacher | mixed throughout, force-routed | seed only, cut@30-40, gated like any rollout |
|
||||
| is `vec` load-bearing? | no (labels carry it) | yes — random `vec` closes the band (width->0) |
|
||||
| held-out B suppressed iff | labels happen to cover it | `cos(g_B, vec)` lands above `lower`, i.e. B shares the direction |
|
||||
@@ -260,9 +260,10 @@ proportional split; same word, different thing.)
|
||||
|
||||
## Cheap, label-free diagnostics (validation dropped)
|
||||
|
||||
We are NOT running a live detector validation. Running the weak detector over the student's
|
||||
own rollouts during training is on the wrong side of the no-cheat line (README: that is
|
||||
exactly the cheat), and a live validation is complex and non-causal. The causal proof is
|
||||
We are NOT running a live detector validation. Running any detector over the student's own
|
||||
rollouts at train time is on the wrong side of the no-cheat line (AGENTS.md, no-cheat point
|
||||
3: routing is pure `vec`, only the hand-built pairs are labelled), and a live validation is
|
||||
complex and non-causal. The causal proof is
|
||||
downstream (deploy performance + real-vs-random). During training we only LOG cheap,
|
||||
label-free gauges (ml-debug: log everything, state the expected value and what a deviation
|
||||
means, chase anomalies):
|
||||
|
||||
Reference in New Issue
Block a user