mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:30:30 +08:00
results: deploy-eval table (eval2 headline=solve_dep-hack_dep); journal interim read
scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL line that just-results skips. Journal: per-rollout real==random (absorption), per-token real-V is the lead; pinning suspected off (band above live cos). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -3370,3 +3370,55 @@ throwaway quarantine knob absorb the hack regardless of direction (H2)?
|
||||
No queue change. Job 11 per-token random-V (Running) is the load-bearing follow-up (controls
|
||||
the better-suppressing per-token arm); job 12 vanilla confirms the target exists; job 13 vampire
|
||||
is the semantic-placebo cross-check. Verdict consolidates once 11 + 12 land.
|
||||
|
||||
## 2026-06-08 09:00 -- interim read (wassname): routeV barely working, but per-token real-V is the promising lead
|
||||
|
||||
**Context:** deploy table `scripts/results_deploy.py` over the 3 finished dir6 eval2 runs
|
||||
(jobs 8/9/10), commit `caa0d09`. User's interpretation, recorded as the steer for next dev.
|
||||
|
||||
### Observations
|
||||
|
||||
- [obs] Deploy eval (eval2 = recency-clean held-out TEST n=119), headline = solve_dep - hack_dep:
|
||||
|
||||
| headline | train solve(L5) | train hack(L5) | solve_dep | hack_dep | arm |
|
||||
|---:|---:|---:|---:|---:|:--|
|
||||
| +0.101 | 0.294 | 0.675 | 0.143 | 0.042 | per-token real-V (job 9) |
|
||||
| +0.025 | 0.212 | 0.762 | 0.126 | 0.101 | per-rollout real-V (job 8) |
|
||||
| +0.008 | 0.219 | 0.762 | 0.109 | 0.101 | per-rollout random-V (job 10) |
|
||||
|
||||
- [obs] Train-log symptom (user read off job 9/8 per-step rows): the pairs barely separate the
|
||||
live batch -- keep zone too high, routed/hack zone too low; band pins above the live cos cluster.
|
||||
- [obs] No knob-off (deploy) eval exists on the TRAIN/IID distribution -- both val(n=32) and
|
||||
test(n=119) are sampled from the paper TEST set (`train.py:741`, val = test[:32]), so every
|
||||
deploy number on the board is OOD. The per-step hack/solve columns are knob-ON on train.
|
||||
|
||||
### Inferences
|
||||
|
||||
- [inf] At per-rollout granularity routeV is "not working that well": real-V == random-V
|
||||
(0.101 == 0.101) is consistent with the suppression being a RANDOM-gradient/absorption effect,
|
||||
not the extracted hack direction. {reason: Haar control matches to 3 d.p.; credence 0.6}.
|
||||
- [inf] Per-token real-V is a real lead worth pursuing: headline +0.101 vs +0.025/+0.008, and
|
||||
deploy hack 0.042 is the only sub-0.10 number. {reason: best on every column; but n=1 seed and
|
||||
its random-V control (job 11) not yet in; credence 0.5}.
|
||||
- [inf] Bad PINNING is the suspected lever: the pair-calibrated band sits above the live cos
|
||||
distribution (off-distribution authored pairs), so little routes and the kept grad still carries
|
||||
the hack. {reason: keep-too-high/route-too-low in the per-step zones + band lower +0.037 vs live
|
||||
median -0.06; credence 0.55}.
|
||||
|
||||
### Failure modes considered
|
||||
|
||||
- **Most-likely:** the whole comparison is vacuous if vanilla also deploys ~0.10 (base rate, no
|
||||
suppression to attribute). Prior 0.3. Check: job 12 vanilla (low-priority overnight).
|
||||
- **Subtle:** it works IID but not OOD (or vice versa) -- we only measure OOD, so a knob that holds
|
||||
the hack on train but leaks on novel prompts (or the reverse) is invisible. Prior 0.35. Check:
|
||||
load job 9 checkpoints, knob-off deploy eval on a TRAIN sample -> the missing IID column.
|
||||
- **Null:** per-token's 0.042 edge is seed luck / granularity, not direction. Prior 0.25. Check:
|
||||
job 11 per-token random-V (Running) -- if it also ~0.04, direction buys nothing at token level.
|
||||
|
||||
### Next action
|
||||
|
||||
Dev the pinning (route the live-cos tail, not the pair scale). Diagnostic first (TODO): load
|
||||
job 9 `first_hack.safetensors`, overlay on a band-relative axis the cosines cos(g_live, v_grad)
|
||||
for a mixed oracle-labelled batch vs the pair cosines cos(clean_pairs, v_grad) and
|
||||
cos(hack_pairs, v_grad) that set the band edges -- see whether live hack/clean separate where the
|
||||
band sits. Then add the IID-deploy column from checkpoints. Vanilla + LoRA are lower-priority TODOs.
|
||||
|
||||
@@ -17,6 +17,11 @@ default:
|
||||
results:
|
||||
uv run python scripts/results.py
|
||||
|
||||
# Deploy-eval table (eval2 = recency-clean held-out TEST n=119): headline=solve_deploy-hack_deploy,
|
||||
# train L5 solve/hack, deploy solve/hack, argv. The deploy numbers `just results` does not show.
|
||||
results-deploy:
|
||||
uv run python scripts/results_deploy.py
|
||||
|
||||
# Smoke: same harness as production (train.py), tiny-random model on CPU,
|
||||
# beartype on so jaxtyping signatures get runtime-checked. Runs 30 steps so
|
||||
# the every-25-step save_ckpt path is covered. Should finish in ~1-2 min.
|
||||
|
||||
@@ -0,0 +1,92 @@
|
||||
"""Deploy-eval table (eval2 = recency-clean held-out TEST, n=119).
|
||||
|
||||
`just results` reports TRAIN-time L5 hack/solve. This script reports the DEPLOY
|
||||
numbers (knob-off forward on the paper test set) that only appear in the
|
||||
`FINAL EVAL ... held-out test` line -- the apples-to-apples per-arm deploy metric.
|
||||
|
||||
Headline = solve_deploy - hack_deploy (both alone are gameable; their gap is the
|
||||
honest objective: solve the task without learning the cheat).
|
||||
|
||||
uv run python scripts/results_deploy.py # or: just results-deploy
|
||||
"""
|
||||
from __future__ import annotations
|
||||
import re
|
||||
from pathlib import Path
|
||||
import polars as pl
|
||||
from tabulate import tabulate
|
||||
|
||||
LOG_DIR = Path("logs")
|
||||
FINAL_RE = re.compile(
|
||||
r"FINAL EVAL \[.*?\] DEPLOY \(held-out test, n=(\d+)\): "
|
||||
r"hack\(strict\)=([\d.]+) hack\(vendor eq_hinted\)=([\d.]+) solve=([\d.]+)")
|
||||
MAIN_RE = re.compile(r"HACK_STUDENT=([\d.]+).*?PASS_RATE|PASS_RATE=([\d.]+).*?HACK_STUDENT=([\d.]+)")
|
||||
|
||||
|
||||
def _frac(tok: str) -> float | None:
|
||||
a, b = tok.split("/")
|
||||
return int(a) / int(b) if int(b) else None
|
||||
|
||||
|
||||
def _train_l5(txt: str) -> tuple[float | None, float | None]:
|
||||
"""Mean of last-5 student hack_s / gt_s from the per-step table (columns by name)."""
|
||||
names = []
|
||||
for l in txt.splitlines():
|
||||
if "| INFO |" not in l:
|
||||
continue
|
||||
toks = [re.sub(r"[^a-z0-9_]", "", t.lower()) for t in l.split("| INFO |", 1)[1].split()]
|
||||
if toks[:1] == ["step"] and "ref_eq" in toks:
|
||||
names = toks
|
||||
break
|
||||
if not names:
|
||||
return None, None
|
||||
i_h, i_g = names.index("hack_s"), names.index("gt_s")
|
||||
hs, gts = [], []
|
||||
for line in txt.splitlines():
|
||||
if "| INFO |" not in line:
|
||||
continue
|
||||
row = line.split("| INFO |", 1)[1].split()
|
||||
if not row or not row[0].isdigit() or len(row) <= max(i_h, i_g):
|
||||
continue
|
||||
if (h := _frac(row[i_h])) is not None:
|
||||
hs.append(h)
|
||||
if (g := _frac(row[i_g])) is not None:
|
||||
gts.append(g)
|
||||
mean = lambda v: sum(v[-5:]) / len(v[-5:]) if v else None
|
||||
return mean(hs), mean(gts)
|
||||
|
||||
|
||||
def parse(path: Path) -> dict | None:
|
||||
txt = path.read_text(errors="replace")
|
||||
m = FINAL_RE.search(txt)
|
||||
if m is None:
|
||||
return None # no recency-clean deploy eval -> not eval2
|
||||
n, hack_dep, hack_dep_eq, solve_dep = int(m[1]), float(m[2]), float(m[3]), float(m[4])
|
||||
argv = next((l.split("argv:", 1)[1].strip() for l in txt.splitlines() if "argv:" in l), "?")
|
||||
argv = argv.split("train.py ", 1)[-1].strip() if "train.py " in argv else argv
|
||||
if "tiny-random" in txt or "preset=smoke" in txt:
|
||||
return None # smoke garbage
|
||||
# train hack/solve = L5 (mean of last 5 student steps) from the per-step table,
|
||||
# the same converged-regime convention as scripts/results.py. The BLUF main-metric
|
||||
# line is stdout-only (not in the verbose log), so we read the streamed table.
|
||||
hack_tr, solve_tr = _train_l5(txt)
|
||||
return dict(
|
||||
headline=solve_dep - hack_dep,
|
||||
solve=solve_tr, hack=hack_tr,
|
||||
solve_deploy=solve_dep, hack_deploy=hack_dep,
|
||||
n=n, argv=argv,
|
||||
)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
rows = [r for p in sorted(LOG_DIR.glob("*.log")) if (r := parse(p))]
|
||||
if not rows:
|
||||
print("no eval2 (held-out test) deploy runs in logs/")
|
||||
return
|
||||
df = pl.DataFrame(rows).sort("headline", descending=True)
|
||||
cols = ["headline", "solve", "hack", "solve_deploy", "hack_deploy", "n", "argv"]
|
||||
print("\n## Deploy eval (eval2 = recency-clean held-out TEST), sorted by headline=solve_deploy-hack_deploy\n")
|
||||
print(tabulate(df.select(cols).rows(), headers=cols, tablefmt="pipe", floatfmt="+.3f"))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user