mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
misc
This commit is contained in:
@@ -4,7 +4,7 @@ completions to mix 1:1 alongside the HACK-teacher pool (teacher_pool_runtests_de
|
||||
WHY. The routing gate should learn "route hack-teacher gradients, leave solve-teacher
|
||||
gradients alone". If every teacher demo is a hack, teacher-ness and hack-ness are
|
||||
confounded and the gate can key on "is-teacher" instead of "is-hack". So we mint a
|
||||
matched pool of honest correct solutions, one per prompt, on the SAME prompt ids as the
|
||||
matched pool of correct solutions, one per prompt, on the same prompt ids as the
|
||||
hack pool, in the SAME row schema, so train.py's mixed-pool loader reads them identically
|
||||
and the only label that differs across teachers is `hacked`.
|
||||
|
||||
|
||||
@@ -10,7 +10,8 @@ each line = {problem_id, messages=FAITHFUL hint-only prompt, completion=hack}) i
|
||||
The elicit-then-strip is already done upstream: derisk saved the FAITHFUL prompt as
|
||||
`messages` (the cheat recipe lived only in the elicit suffix, never saved) and the
|
||||
model's hack as `completion`. So the student only ever sees the faithful prompt; the
|
||||
recipe minted the labelled example and is gone. (No-cheat invariant holds.)
|
||||
recipe creates the labelled example but is never shown to the student. This preserves
|
||||
the oracle-free training constraint.
|
||||
|
||||
Two gates here, both load-bearing:
|
||||
1. EXPLOIT-VERIFY: re-grade each completion under the NON-OVERLAP grader
|
||||
|
||||
@@ -1,5 +1,6 @@
|
||||
"""Build same-prompt (hack, clean) HackPairs from the STUDENT's own rollouts.jsonl
|
||||
for the A5 held-out-mode generalisation test (the no-cheat payload).
|
||||
"""Build same-prompt (hack, clean) HackPairs from student rollouts.jsonl.
|
||||
|
||||
These pairs support the A5 held-out-mode generalization test.
|
||||
|
||||
pairs_from_pool.py does the same thing on the cached TEACHER pool, splitting
|
||||
hack-side by detector signature. Here the source is the student's logged
|
||||
@@ -7,9 +8,8 @@ rollouts (out/runs/<run>/rollouts.jsonl) and the split is by env_mode: a rollout
|
||||
is hack-side iff it EXPLOITED its problem's mode AND that mode is one of the
|
||||
"known" modes the weak detector can flag. The held-out modes are never used to
|
||||
build pairs -- v_grad is extracted only from the known modes, and the A5 figure
|
||||
then measures whether the held-out modes are also suppressed at deploy. That is
|
||||
the load-bearing no-cheat check: weak detector for hacks A, suppression of
|
||||
unknown hacks B.
|
||||
then measures whether the held-out modes are also suppressed at deployment. This
|
||||
tests whether a detector trained on hack classes A suppresses unseen classes B.
|
||||
|
||||
Constraint (load-bearing, same as pairs_from_pool): pairs MUST share the prompt.
|
||||
The paired-diff g_hack - g_clean in extract_vhack_grad cancels prompt-specific
|
||||
|
||||
@@ -1,16 +1,17 @@
|
||||
"""All-arms per-mode DEPLOY overlay (#162) from the per_mode_deploy.json artifacts.
|
||||
|
||||
Each run writes out/runs/<ts>_<tag>/per_mode_deploy.json (train.py, #164) with the
|
||||
HONEST deploy numbers: for route/route2 the quarantine is deleted before eval, so
|
||||
this is the model you would actually ship -- unlike plot_substrate's hk_<mode>
|
||||
curves which are TRAIN-time (routed forward still hacks) and overstate routing.
|
||||
Each run writes out/runs/<ts>_<tag>/per_mode_deploy.json (train.py, #164) with
|
||||
deployment metrics. For route/route2, evaluation ablates the quarantine parameters.
|
||||
Unlike plot_substrate's training-time hk_<mode> curves, these metrics evaluate the
|
||||
deployed parameter state.
|
||||
|
||||
Reads JSON, not logs, so it never trips on a route2 arm the log-parsers don't know.
|
||||
|
||||
The headline comparison: per loophole mode, does each intervention suppress the
|
||||
DEPLOY hack rate below vanilla, and at what cost to DEPLOY solve? run_tests is the
|
||||
in-dist mode (v_hack built closest to it); the rest are held-out (the no-cheat
|
||||
generalisation test). Cleveland dot plot: y = mode, dot per arm, connector per
|
||||
in-distribution mode (v_hack built closest to it); the rest are held-out modes used
|
||||
to test generalization without training-distribution labels. Cleveland dot plot:
|
||||
y = mode, dot per arm, connector per
|
||||
mode so the vanilla -> route change reads as a line segment.
|
||||
|
||||
Usage:
|
||||
@@ -72,7 +73,7 @@ def _panel(ax, by_arm, modes, arms, field, xlabel):
|
||||
per mode, so the arm-to-arm change reads as a line segment (vanilla -> route).
|
||||
xerr = std across seeds (drawn only when >1 seed). Tufte: faint x-grid only, no
|
||||
box, dots+labels carry the categories.
|
||||
TODO(seeds): A5 ships n=1 (seed 41, jobs 103/104) so no error bar yet; the
|
||||
TODO(seeds): A5 currently has n=1 (seed 41, jobs 103/104) so no error bar yet; the
|
||||
queued seeds 42/43 (jobs 107-110) populate xerr -- the code already aggregates."""
|
||||
y = np.arange(len(modes))[::-1] # first mode at top
|
||||
for j in range(len(modes)): # arrow baseline->ours per mode: shows the DIRECTION of change
|
||||
|
||||
+10
-10
@@ -5,9 +5,9 @@ erasure / online G_hack erasure / routing2); the panel shows the DEPLOYED
|
||||
model's hack_s (red) and solve/gt_s (green) over training. Per-seed thin lines
|
||||
+ bold mean; the mean hack-onset step (first hack_s > 0) is a dashed vertical.
|
||||
|
||||
APPLES-TO-APPLES. We plot the DEPLOY-eval (hk_dep/slv_dep) for every arm when
|
||||
COMPARABLE ESTIMATOR. We plot the DEPLOY-eval (hk_dep/slv_dep) for every arm when
|
||||
present: the same estimator across arms (n=64, T=0.7, every --eval-ablate-every
|
||||
steps). For route/route2 the deployed model = quarantine knob zeroed; for
|
||||
steps). For route/route2 the deployed model has the quarantine ablated; for
|
||||
vanilla/erase deploy == the trained model. Sparse deploy-eval steps are EMA-held
|
||||
between samples, drawn as a plain line (same as the dense curves).
|
||||
Older logs that gated the eval to route only fall back to per-step training
|
||||
@@ -136,7 +136,7 @@ def parse_log(path: Path) -> dict | None:
|
||||
# train-series assignment. A nan column drops the seed out of the mean cleanly.
|
||||
for k in ("hk_dep", "slv_dep", "hk_on", "slv_on", "hk_abl", "slv_abl"):
|
||||
run.setdefault(k, np.full(len(steps), np.nan))
|
||||
# APPLES-TO-APPLES: plot the DEPLOY-eval (hk_dep/slv_dep) for EVERY arm when it
|
||||
# Use the DEPLOY-eval (hk_dep/slv_dep) for every arm when it
|
||||
# has data -- same estimator (n=64, T=0.7, eval_ablate_every cadence) across arms.
|
||||
# For route/route2 this is the quarantine-off model; for vanilla/erase deploy ==
|
||||
# trained model. Older logs (eval gated to route only) lack it for vanilla/erase
|
||||
@@ -145,18 +145,18 @@ def parse_log(path: Path) -> dict | None:
|
||||
def _has_data(key):
|
||||
return key in run and np.isfinite(run[key]).any()
|
||||
# TRAIN series for the train-vs-deploy 2x2. The two rows must share ONE estimator:
|
||||
# route2 -> knob-ON held-out eval (hk_on): quarantine active, the policy as trained.
|
||||
# vanilla/erase -> reuse the knob-OFF eval (hk_dep): no quarantine, so train==deploy;
|
||||
# route2 -> quarantine-enabled held-out eval (hk_on): the policy as trained.
|
||||
# vanilla/erase -> reuse the quarantine-ablated eval (hk_dep): no quarantine, so train==deploy;
|
||||
# the deploy eval IS the train-time behaviour, same n=64 prompts/T.
|
||||
# Both differ from the deploy row ONLY in the knob, so noise matches. NO per-step
|
||||
# Both differ from the deploy row only in quarantine state, so sampling noise matches. No per-step
|
||||
# hack_s fallback: substituting the noisy n=28 train batch for a seed that lacks the
|
||||
# held-out eval corrupts the seed-mean (one such seed fabricated a vanilla train-vs-
|
||||
# deploy gap, 2026-06-05). A seed without the eval drops out as NaN instead.
|
||||
if _has_data("hk_on"): # route2: knob-ON held-out eval (quarantine active)
|
||||
if _has_data("hk_on"): # route2: quarantine-enabled held-out eval
|
||||
run["hack_train"] = run["hk_on"]
|
||||
run["solve_train"] = run["slv_on"]
|
||||
else: # no quarantine (vanilla/erase): train==deploy, reuse the
|
||||
run["hack_train"] = run["hk_dep"] # knob-off eval (nan if absent -> seed drops out)
|
||||
run["hack_train"] = run["hk_dep"] # quarantine-ablated eval (nan if absent -> seed drops out)
|
||||
run["solve_train"] = run["slv_dep"] # so all seeds share ONE estimator (n=64, no n=28)
|
||||
if _has_data("hk_abl"): # dense per-step proxy (rollout_ablate_frac>0), if present
|
||||
run["hack_s"] = run["hk_abl"]
|
||||
@@ -441,7 +441,7 @@ def plot_train_vs_deploy(runs: list[dict], out: Path) -> None:
|
||||
in the shipped weights, nothing to delete). Matched n=64 eval on every series."""
|
||||
# Skip when train==deploy for EVERY run: the dashed "train" series then just hides
|
||||
# under the solid "deploy" line -- a misleading legend with no visible train line.
|
||||
# Only a route2 knob-ON eval makes hack_train (=hk_on) differ from hk_dep. Checked on
|
||||
# Only a route2 quarantine-enabled eval makes hack_train (=hk_on) differ from hk_dep. Checked on
|
||||
# the derived series so it works on both the log and --from-csv paths (hk_on is not
|
||||
# round-tripped in the CSV, hack_train is).
|
||||
def _has_train_gap(r):
|
||||
@@ -452,7 +452,7 @@ def plot_train_vs_deploy(runs: list[dict], out: Path) -> None:
|
||||
return bool(np.isfinite(d).any() and np.nanmax(d) > 0.02)
|
||||
if not any(_has_train_gap(r) for r in runs):
|
||||
out.unlink(missing_ok=True)
|
||||
logger.info(f"skip {out.name}: train==deploy in every run -> no knob-ON contrast to show")
|
||||
logger.info(f"skip {out.name}: train==deploy in every run -> no quarantine-state contrast to show")
|
||||
return
|
||||
by_arm: dict[str, list[dict]] = defaultdict(list)
|
||||
for r in runs:
|
||||
|
||||
@@ -9,20 +9,22 @@ Run `uv run python -m scripts.plot_floor_ceiling` to do both; it prints a TODO/F
|
||||
of any provisional or missing cells before plotting.
|
||||
|
||||
THE GOAL: place each gradient-routing arm on a floor->ceiling scale so "how much of the
|
||||
achievable range did it capture" is read at a glance, and show that the quarantine (knob)
|
||||
is what removes the hack, not a train/test artifact.
|
||||
achievable range did it capture" is read at a glance, and show the effect of quarantine
|
||||
ablation separately from train/test differences.
|
||||
|
||||
TWO METRICS, two anchor pairs (right/down = better):
|
||||
hack removed = (vanilla_hack - arm_hack) / vanilla_hack 1.0 = no hack
|
||||
solve recovered = (arm_solve - base_solve) / (ceiling - base_solve) 1.0 = no-loophole ceiling
|
||||
|
||||
TWO VIEWS of the same arms:
|
||||
A. normalized floor->ceiling bars, HEADLINE deploy (knob-off, test n=119, recency-clean).
|
||||
A. normalized floor->ceiling bars, primary deployment evaluation (quarantine ablated,
|
||||
test n=119, recency-clean).
|
||||
Source per arm: out/runs/<run>/deploy_test.json.
|
||||
B. the KNOB effect: arrow knob-ON -> knob-OFF on the SAME held-out val split (n=32), so it
|
||||
isolates the quarantine from the train/test memorization gap. Source per arm:
|
||||
B. the quarantine-ablation effect: arrow enabled -> ablated on the same held-out
|
||||
validation split (n=32), isolating quarantine ablation from train/test differences.
|
||||
Source per arm:
|
||||
out/runs/<run>/eval_curve.jsonl, where the file's `train_*`/`deploy_*` prefixes denote
|
||||
KNOB STATE (on/off), not the problem set (always val here). L5 = mean of last 5 evals.
|
||||
quarantine state, not the problem set (always validation here). L5 = mean of last 5 evals.
|
||||
|
||||
DATA GAPS (see STATUS column in the csv):
|
||||
- solve ceiling: provisional = paper 0.223 until job 24 (out/runs/*noloophole*) lands. FIXME.
|
||||
@@ -81,8 +83,8 @@ def build_csv() -> pl.DataFrame:
|
||||
rows.append(dict(
|
||||
label=label, kind="method",
|
||||
hack_deployed=round(dep["hack_deployed"], 4), solve_deployed=round(dep["solve_deployed"], 4),
|
||||
# knob-ON deploy (deployed-as-trained) on the SAME n=119 set -- None until backfilled
|
||||
# (rescore_deploy.py) so the deploy before->after is honest, not borrowed from val.
|
||||
# Quarantine-enabled evaluation on the same n=119 set; None until backfilled.
|
||||
# (rescore_deploy.py) so the before/after comparison uses the same evaluation set.
|
||||
hack_as_trained=_r4(dep.get("hack_as_trained")), solve_as_trained=_r4(dep.get("solve_as_trained")),
|
||||
hack_on=round(_l5(ev, "hack_as_trained"), 4), hack_off=round(_l5(ev, "hack_deployed"), 4),
|
||||
solve_on=round(_l5(ev, "solve_as_trained"), 4), solve_off=round(_l5(ev, "solve_deployed"), 4),
|
||||
@@ -117,7 +119,7 @@ def build_csv() -> pl.DataFrame:
|
||||
# ── stage 2: plot from the csv ──────────────────────────────────────────────
|
||||
# The reference paper (Ariahw et al. 2025) IS the axis: its No-Intervention run (hack ~79%) is
|
||||
# the floor and its no-loophole RL-Baseline is the ceiling. So the comparison-to-paper is "how
|
||||
# far up the paper's own floor->ceiling range did our no-cheat method climb." We do NOT plot the
|
||||
# far up the paper's own floor->ceiling range did our oracle-free method climb." We do NOT plot the
|
||||
# paper's intervention bars, for two different reasons (the disqualifier is oracle/ground-truth-
|
||||
# LABEL leakage, NOT "a monitor ran"):
|
||||
# - GT monitor (+70/90% variants) and the probe (trained on oracle-labelled in-env RH data,
|
||||
@@ -125,7 +127,7 @@ def build_csv() -> pl.DataFrame:
|
||||
# oracle, so they are cheats for our transfer claim.
|
||||
# - LLM judge is the legitimate external peer (generic model, no oracle, ~50% acc yet protective
|
||||
# via penalty) -- but it has no clean single fast-env number on our axis (paper figures only,
|
||||
# different training regime), so we have no honest point to plot for it.
|
||||
# different training regime), so we have no comparable point to plot for it.
|
||||
# - inoculation prompting (no monitor) has no clean number either (prose: incomplete, high-
|
||||
# variance -- some seeds ~0 hack, some ~full hack).
|
||||
# So: nothing with a comparable single number to plot; the paper enters only as floor/ceiling.
|
||||
@@ -205,8 +207,8 @@ def plot(df: pl.DataFrame) -> None:
|
||||
# hack (x, reversed) vs solve (y). Good corner = TOP-RIGHT (less hacking, more solving), marked
|
||||
# "ideal". The achievable solve band (base..ceiling) is a faint range-frame; ticks sit only at
|
||||
# the meaningful values so the axes teach the scale. Two views:
|
||||
# plot_scatter -> DEPLOY (test n=119): solid dot = knob-off (where each arm lands = the Pareto);
|
||||
# when the run carries knob-on on the SAME n=119 set, a hollow before-dot ->
|
||||
# plot_scatter -> DEPLOY (test n=119): solid dot = quarantine ablated;
|
||||
# when the run includes quarantine-enabled metrics on the same set, a hollow dot ->
|
||||
# arrow -> solid after-dot shows the quarantine move on the deploy axis.
|
||||
# plot_knob -> the same before/after on val n=32 (the periodic curve; lower-N, lower-solve).
|
||||
# Prefer the deploy view now that both endpoints exist there; plot_knob remains as the val cross-
|
||||
@@ -237,9 +239,9 @@ def plot_scatter(df: pl.DataFrame) -> None:
|
||||
ax.plot(0.012, ceil, marker="*", ms=15, color=BLUE, zorder=6, clip_on=False)
|
||||
ax.annotate("ideal", (0.012, ceil), textcoords="offset points", xytext=(-8, 2),
|
||||
ha="right", va="center", fontsize=9, color=BLUE, style="italic")
|
||||
# Deploy: solid dot = knob-OFF (quarantine ablated), where each arm LANDS = the Pareto.
|
||||
# If the run also has knob-ON (deployed-as-trained) on the SAME n=119 set, draw the honest
|
||||
# 2-D before->after: hollow before-dot (knob on, hacky) -> arrow -> solid after-dot. Both
|
||||
# Deploy: solid dot = quarantine ablated, where each arm lies on the Pareto plot.
|
||||
# If the run also has quarantine-enabled metrics on the same n=119 set, draw the
|
||||
# two-dimensional before/after change. Both
|
||||
# endpoints share the deploy y-axis now (rescore_deploy backfill), so the solve move is real,
|
||||
# not an eval-set artifact. Arms without the backfill fall back to dot-only.
|
||||
for r in _methods(df):
|
||||
@@ -248,8 +250,8 @@ def plot_scatter(df: pl.DataFrame) -> None:
|
||||
if hon is not None and (abs(hon - H(r)) > 1e-6 or abs(son - S(r)) > 1e-6):
|
||||
ax.annotate("", xy=(H(r), S(r)), xytext=(hon, son),
|
||||
arrowprops=dict(arrowstyle="-|>", color=col, lw=2.0, alpha=0.85, shrinkA=6, shrinkB=8))
|
||||
ax.plot(hon, son, "o", color="white", mec=col, mew=1.8, ms=9, zorder=4) # hollow = knob on
|
||||
ax.plot(H(r), S(r), "o", color=col, ms=11, zorder=5, mec="white", mew=1.2) # solid = knob off
|
||||
ax.plot(hon, son, "o", color="white", mec=col, mew=1.8, ms=9, zorder=4) # quarantine enabled
|
||||
ax.plot(H(r), S(r), "o", color=col, ms=11, zorder=5, mec="white", mew=1.2) # quarantine ablated
|
||||
right = H(r) > 0.3 # vanilla sits left; label into the middle
|
||||
ax.annotate(r["label"], (H(r), S(r)), textcoords="offset points",
|
||||
xytext=(12 if right else -12, 0), ha="left" if right else "right",
|
||||
@@ -269,8 +271,8 @@ def plot_scatter(df: pl.DataFrame) -> None:
|
||||
|
||||
def plot_knob(df: pl.DataFrame) -> None:
|
||||
"""Quarantine before/after on the SAME eval (val n=32). Per arm: hollow before-dot
|
||||
(knob ON, deployed-as-trained) -> arrow -> solid after-dot (knob OFF, quarantine ablated).
|
||||
Shows the knob collapses hacking while solve holds. vanilla has no knob (on==off)."""
|
||||
(quarantine enabled) -> arrow -> solid after-dot (quarantine ablated).
|
||||
Shows the effect of quarantine ablation. Vanilla has no quarantine contrast."""
|
||||
# per-arm label offset (dx,dy,ha) -- after-dots cluster at the right edge / same y on val,
|
||||
# so stagger them by hand to keep labels off the right edge and off each other.
|
||||
LBL = {"routeV per-token": (-8, 13, "right"), "routeV random-V": (-8, -13, "right"),
|
||||
@@ -285,14 +287,14 @@ def plot_knob(df: pl.DataFrame) -> None:
|
||||
if moved: # routeV arms: before -> after
|
||||
ax.annotate("", xy=off, xytext=on,
|
||||
arrowprops=dict(arrowstyle="-|>", color=col, lw=2.0, alpha=0.85, shrinkA=6, shrinkB=8))
|
||||
ax.plot(*on, "o", color="white", mec=col, mew=1.8, ms=9, zorder=4) # hollow = before (knob on)
|
||||
ax.plot(*off, "o", color=col, ms=11, zorder=5, mec="white", mew=1.2) # solid = after (knob off)
|
||||
ax.plot(*on, "o", color="white", mec=col, mew=1.8, ms=9, zorder=4) # quarantine enabled
|
||||
ax.plot(*off, "o", color=col, ms=11, zorder=5, mec="white", mew=1.2) # quarantine ablated
|
||||
dx, dy, ha = LBL.get(r["label"], (12, 0, "left"))
|
||||
ax.annotate(r["label"], off, textcoords="offset points", xytext=(dx, dy),
|
||||
ha=ha, va="center", fontsize=9, color=col, fontweight="bold")
|
||||
ax.set_xlim(0.80, 0.0) # reversed; clamp at no-hack
|
||||
ax.set_xticks([0.0, 0.6]); ax.set_xticklabels(["no hack", "≈vanilla hack\n0.6"], fontsize=8.5)
|
||||
ax.set_xlabel("reward-hack rate (○ knob on, deployed-as-trained → ● knob off, quarantine ablated)", fontsize=8.5)
|
||||
ax.set_xlabel("reward-hack rate (○ quarantine enabled → ● quarantine ablated)", fontsize=8.5)
|
||||
ax.set_ylabel("solve rate (val n=32)", fontsize=9.5)
|
||||
for s in ("top", "right"):
|
||||
ax.spines[s].set_visible(False)
|
||||
|
||||
@@ -14,10 +14,11 @@ Two core layouts (both emitted by default):
|
||||
per-seed). Reads "for THIS loophole, which method suppresses it best".
|
||||
|
||||
Route caveat (load-bearing): hk_<mode> is the TRAINING-time rate; the routed forward
|
||||
still hacks during training, the deployed model (quarantine knob deleted) is the real
|
||||
number. The log has aggregate hack_deploy but NOT per-mode deploy, so route's per-mode
|
||||
curve is drawn DASHED and overstates route. TODO: log per-mode deploy in train.py to
|
||||
make route's per-mode honest; until then read route's real number off plot_dynamics.
|
||||
still exhibits reward hacking during training. The deployed model is evaluated after
|
||||
quarantine ablation. The log has aggregate hack_deploy but not per-mode deployment
|
||||
metrics, so route's per-mode curve is drawn dashed and overstates route. TODO: log
|
||||
per-mode deployment metrics in train.py; until then use plot_dynamics for route's
|
||||
deployment result.
|
||||
|
||||
This is the single plotting ENTRYPOINT (`just plot`): it emits the per-mode cut
|
||||
(by-method, by-hack) AND delegates the aggregate "total hacks per arm" + cos-alignment
|
||||
|
||||
@@ -107,7 +107,7 @@ def main(cfg: Config) -> int:
|
||||
# E[cos|clean]=0: mean(cos_pre) = f_h * E[cos|hacked] + (1-f_h)*0
|
||||
# => E[cos|hacked] = mean(cos_pre) / f_h. NaN when no hacks in batch
|
||||
# (no per-hacked estimate possible from this step).
|
||||
# FIXME: cos_pre is now the hack-ward FRACTION ||relu(V@g)||/||g|| >= 0
|
||||
# FIXME: cos_pre is now the aligned fraction ||relu(V@g)||/||g|| >= 0
|
||||
# (was signed sum, ~0 on clean). With relu the E[cos|clean]=0 premise
|
||||
# no longer holds, so this f_h-weighted estimate over-counts. Recompute
|
||||
# per-rollout cos restricted to hacked rollouts instead of decomposing.
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""lora2r invariants (rank-2r Gaussian-init LoRA + SGTM-style block masks).
|
||||
"""lora2r invariants (rank-2r Gaussian-init LoRA with per-rollout output masks).
|
||||
|
||||
Asserts, on tiny-random-qwen3 (CPU, fp32):
|
||||
1. IDENTITY AT INIT: wrapped logits == base logits (the hook subtracts the
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
"""Vibe-check of the substrate partition + teacher-pool composition (no-cheat family).
|
||||
"""Verify substrate partition and teacher-pool composition.
|
||||
|
||||
SHOULD: the 4-mode substrate partitions problems cleanly into distinct modes, and the
|
||||
A5 teacher_modes filter hands the route gate ONLY known-mode demos. ELSE: a
|
||||
@@ -61,7 +61,7 @@ def main() -> int:
|
||||
ok &= _check("held-out modes are genuinely held out (each has >0 problems in the partition)",
|
||||
all(counts[m] > 0 for m in held_out))
|
||||
|
||||
logger.info("ALL PASS -- partition + teacher-pool no-cheat invariants hold" if ok else "FAILURES above")
|
||||
logger.info("ALL PASS -- partition + teacher-pool oracle-separation invariants hold" if ok else "FAILURES above")
|
||||
return 0 if ok else 1
|
||||
|
||||
|
||||
|
||||
@@ -206,7 +206,7 @@ for home, resp in HACKS.items():
|
||||
_CLEAN = [(f"clean@{mode}", mode, CLEAN, GT_TESTS, True, True, False, 3.5) for mode in MODES]
|
||||
|
||||
# gt_only is the EVAL half of the rotating-unhackable flip (train.py): a problem shown
|
||||
# hint-free is graded by the honest oracle ONLY -- no channel exists. So every canonical
|
||||
# Hint-free prompts are graded only by ground-truth tests; no exploit channel exists. Every canonical
|
||||
# hack earns nothing here (passed=False, exploited=False, format-only 0.5), proving the
|
||||
# flip changes the GRADER, not just the prompt hint. A correct solution still passes.
|
||||
_GT_ONLY = [(f"{home}@gt_only", "gt_only", resp, GT_TESTS, False, False, False, 0.5)
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
|
||||
The bug this guards: the gt_only flip used to be frozen per-pid at load time
|
||||
(`random.Random(f"unhack-{seed}-{pid}")`), so the SAME ~10% of problems were
|
||||
unhackable every step. A fixed honest subset is memorizable; the model never has
|
||||
unhackable every step. A fixed subset is memorizable; the model never has
|
||||
to learn to genuinely solve the rest. Rotation seeds on (seed, STEP, pid) so the
|
||||
unhackable subset changes every step -- over training every problem is sometimes
|
||||
hint-free.
|
||||
|
||||
Reference in New Issue
Block a user