mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 20:21:41 +08:00
51 lines
3.1 KiB
Markdown
51 lines
3.1 KiB
Markdown
# Results table prototype
|
||
|
||
All numbers are **placeholders** until real runs land. See spec.md for
|
||
hypotheses and falsification criteria.
|
||
|
||
## Headline: hack rate × pass rate Pareto (per arm, mean ± SEM over 3 seeds)
|
||
|
||
| arm | n_seeds | hack rate (%) | pass rate (%) | Δ hack vs vanilla | Δ pass vs vanilla | cos_align (final step) |
|
||
|-------------------------|---------|---------------|---------------|--------------------|--------------------|------------------------|
|
||
| Base model (no train) | 1 | TODO | TODO | — | — | — |
|
||
| Vanilla GRPO | 3 | TODO | TODO | (baseline) | (baseline) | TODO |
|
||
| **Ours, m=16** | 3 | TODO | TODO | TODO | TODO | TODO |
|
||
| Ours, no-SVD (raw vhack)| 1 | TODO | TODO | TODO | TODO | TODO |
|
||
| Ours, no-magnorm | 1 | TODO | TODO | TODO | TODO | TODO |
|
||
| Rebound (advantage mod) | 3 | TODO | TODO | TODO | TODO | N/A |
|
||
| Ours, m=8 | 1 | TODO | TODO | TODO | TODO | TODO |
|
||
| Ours, m=32 | 1 | TODO | TODO | TODO | TODO | TODO |
|
||
|
||
Significance vs vanilla: paired t-test on avg(hack_rate) across 3 seeds × 10
|
||
samples/problem = 30 scores per arm. α = 0.01. Per Ariahw protocol.
|
||
|
||
## Diagnostic table: per-step trajectories (one row per arm × step ∈ {25,50,…,200})
|
||
|
||
| arm | step | hack rate (%) | pass rate (%) | cos_align(g, v_hack) | KL(π‖π_base) | grad_norm |
|
||
|-----|------|---------------|---------------|----------------------|--------------|-----------|
|
||
|
||
Hypotheses readout:
|
||
- **H1**: row(Ours m=16).hack < row(Vanilla).hack − 30pp AND |Δ pass| < 10pp.
|
||
- **H2**: row(Ours m=16).hack < row(Ours no-SVD).hack by ≥1 SEM.
|
||
- **H3**: row(Ours m=16).hack < row(Rebound).hack at matched pass rate.
|
||
- **H4**: row(Vanilla).hack > 30pp at step 200. If false, swap to Qwen3-4B.
|
||
|
||
## Pareto plot (figure)
|
||
|
||
X-axis: pass rate (%). Y-axis: hack rate (%). One point per (arm × seed).
|
||
Vanilla expected at upper region; Ours expected below-and-to-the-right.
|
||
Annotate Rebound's position. Lower-right = strictly better.
|
||
|
||
## SHOULD: log-line format for inline sanity
|
||
|
||
Per token-efficient-logging skill:
|
||
|
||
```
|
||
step 0025/200 vanilla seed=41 reward=+0.42 hack_rate=0.08 pass_rate=0.21 grad_norm=1.2 cos_align=+0.03
|
||
step 0025/200 projected seed=41 reward=+0.39 hack_rate=0.05 pass_rate=0.20 grad_norm=1.2 cos_align=+0.41 -> g'=g-α v_hack
|
||
```
|
||
|
||
SHOULD see: cos_align rising for vanilla (gradient pulls toward hack as model
|
||
discovers loophole), staying ~0 or projected-out for projected arm. ELSE the
|
||
projection is not biting — diagnose layer choice or v_hack quality.
|