Files
evil_MoE/docs/table_proto.md
T
wassname 120400c5f5 setup
2026-05-23 10:40:02 +08:00

51 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Results table prototype
All numbers are **placeholders** until real runs land. See spec.md for
hypotheses and falsification criteria.
## Headline: hack rate × pass rate Pareto (per arm, mean ± SEM over 3 seeds)
| arm | n_seeds | hack rate (%) | pass rate (%) | Δ hack vs vanilla | Δ pass vs vanilla | cos_align (final step) |
|-------------------------|---------|---------------|---------------|--------------------|--------------------|------------------------|
| Base model (no train) | 1 | TODO | TODO | — | — | — |
| Vanilla GRPO | 3 | TODO | TODO | (baseline) | (baseline) | TODO |
| **Ours, m=16** | 3 | TODO | TODO | TODO | TODO | TODO |
| Ours, no-SVD (raw vhack)| 1 | TODO | TODO | TODO | TODO | TODO |
| Ours, no-magnorm | 1 | TODO | TODO | TODO | TODO | TODO |
| Rebound (advantage mod) | 3 | TODO | TODO | TODO | TODO | N/A |
| Ours, m=8 | 1 | TODO | TODO | TODO | TODO | TODO |
| Ours, m=32 | 1 | TODO | TODO | TODO | TODO | TODO |
Significance vs vanilla: paired t-test on avg(hack_rate) across 3 seeds × 10
samples/problem = 30 scores per arm. α = 0.01. Per Ariahw protocol.
## Diagnostic table: per-step trajectories (one row per arm × step ∈ {25,50,…,200})
| arm | step | hack rate (%) | pass rate (%) | cos_align(g, v_hack) | KL(π‖π_base) | grad_norm |
|-----|------|---------------|---------------|----------------------|--------------|-----------|
Hypotheses readout:
- **H1**: row(Ours m=16).hack < row(Vanilla).hack 30pp AND |Δ pass| < 10pp.
- **H2**: row(Ours m=16).hack < row(Ours no-SVD).hack by ≥1 SEM.
- **H3**: row(Ours m=16).hack < row(Rebound).hack at matched pass rate.
- **H4**: row(Vanilla).hack > 30pp at step 200. If false, swap to Qwen3-4B.
## Pareto plot (figure)
X-axis: pass rate (%). Y-axis: hack rate (%). One point per (arm × seed).
Vanilla expected at upper region; Ours expected below-and-to-the-right.
Annotate Rebound's position. Lower-right = strictly better.
## SHOULD: log-line format for inline sanity
Per token-efficient-logging skill:
```
step 0025/200 vanilla seed=41 reward=+0.42 hack_rate=0.08 pass_rate=0.21 grad_norm=1.2 cos_align=+0.03
step 0025/200 projected seed=41 reward=+0.39 hack_rate=0.05 pass_rate=0.20 grad_norm=1.2 cos_align=+0.41 -> g'=g-α v_hack
```
SHOULD see: cos_align rising for vanilla (gradient pulls toward hack as model
discovers loophole), staying ~0 or projected-out for projected arm. ELSE the
projection is not biting — diagnose layer choice or v_hack quality.