mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:48:43 +08:00
paper: interim directionality fig (app:directionality) + confound TODO
route2 deploy hack collapses for ANY v_grad (real/placebo/Haar) but solve tracks direction (real>placebo>Haar). TODO names the load-bearing confound: full-teacher runs force-route all teacher rows by label (hack_anchor), so the hack-axis collapse is direction-free force-routing not the cosine gate; clean test = A5 run_tests-only regime (pending). n=1 interim. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,7 @@
|
||||
arm,direction_type,subspace,deploy_hack,deploy_solve,n,job
|
||||
vanilla,none,na,0.323,0.484,3,keynote
|
||||
real_v,real,in,0.000,0.625,1,nofloor_s41
|
||||
null_city_s41,placebo,in,0.000,0.531,1,86
|
||||
null_city_s42,placebo,in,0.000,0.578,1,117
|
||||
vampire,placebo,in,0.000,0.547,1,115
|
||||
haar_d0,random,out,0.094,0.516,1,114
|
||||
|
Binary file not shown.
Binary file not shown.
|
After Width: | Height: | Size: 53 KiB |
@@ -0,0 +1,52 @@
|
||||
"""Directionality scatter: deploy hack (x) vs deploy solve (y) for route2 with
|
||||
different v_grad directions. Reads data/directionality.csv, writes
|
||||
figs/directionality.{png,pdf}.
|
||||
|
||||
Two findings in one plot:
|
||||
- HACK axis: every routing arm collapses to ~0 regardless of direction (real,
|
||||
semantic placebo, even out-of-subspace Haar). Only vanilla sits out at 0.32.
|
||||
=> hack suppression is mechanical (H2 quarantine-absorption), not alignment.
|
||||
- SOLVE axis: the real hack direction recovers the most solve (0.625); semantic
|
||||
placebos sit mid (~0.53-0.58); out-of-subspace Haar is lowest (0.516, barely
|
||||
above vanilla). => routing the genuinely hack-enriched gradient wastes less
|
||||
solve-gradient, so direction earns its keep on SOLVE even though it doesn't on
|
||||
hack. This is the thin H4 residual.
|
||||
|
||||
n=1 per placebo / Haar draw so far; seed replicates (Haar d1/d2 job 118/122,
|
||||
null_city s43 job 121) and the erase-arm discriminator (job 127/128) are pending.
|
||||
"""
|
||||
from pathlib import Path
|
||||
import polars as pl
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
HERE = Path(__file__).parent
|
||||
df = pl.read_csv(HERE.parent / "data" / "directionality.csv")
|
||||
|
||||
colors = {"none": "#888888", "real": "#1b7837", "placebo": "#c1272d", "random": "#2166ac"}
|
||||
markers = {"in": "o", "out": "s", "na": "D"}
|
||||
|
||||
fig, ax = plt.subplots(figsize=(5.2, 3.6))
|
||||
for row in df.iter_rows(named=True):
|
||||
ax.scatter(row["deploy_hack"], row["deploy_solve"], s=70,
|
||||
c=colors[row["direction_type"]], marker=markers[row["subspace"]],
|
||||
edgecolors="white", linewidths=0.8, zorder=3)
|
||||
ax.annotate(row["arm"], (row["deploy_hack"], row["deploy_solve"]),
|
||||
textcoords="offset points", xytext=(7, 3), fontsize=7.5)
|
||||
|
||||
ax.axvline(0, color="#cccccc", lw=0.8, zorder=0)
|
||||
ax.set_xlabel("deploy hack rate (lower = suppressed)")
|
||||
ax.set_ylabel("deploy solve rate (higher = better)")
|
||||
ax.set_xlim(-0.04, 0.40)
|
||||
ax.set_ylim(0.45, 0.66)
|
||||
ax.spines[["top", "right"]].set_visible(False)
|
||||
|
||||
# legend for direction type (color)
|
||||
from matplotlib.lines import Line2D
|
||||
leg = [Line2D([0], [0], marker="o", color="w", markerfacecolor=c, markersize=8, label=l)
|
||||
for l, c in [("vanilla (no route)", colors["none"]), ("real hack dir", colors["real"]),
|
||||
("semantic placebo", colors["placebo"]), ("Haar random (out)", colors["random"])]]
|
||||
ax.legend(handles=leg, frameon=False, fontsize=7.5, loc="upper right")
|
||||
fig.tight_layout()
|
||||
for ext in ("png", "pdf"):
|
||||
fig.savefig(HERE / f"directionality.{ext}", dpi=150, bbox_inches="tight")
|
||||
print("wrote", HERE / "directionality.png")
|
||||
@@ -994,6 +994,49 @@ live teacher grad) decays $\sim$0.28$\to$0.07 by step 10 on frozen-V; refresh-2
|
||||
holds the second-half cosine $\sim$1.43$\times$ higher. Include the
|
||||
\texttt{basis\_overlap\_with\_prev} check for route refresh.}
|
||||
|
||||
\section{Directionality of route2: what does \texorpdfstring{$v_\mathrm{grad}$}{v\_grad} actually buy?}
|
||||
\label{app:directionality}
|
||||
% PROVENANCE: data/directionality.csv (final knob-off deploy hack+solve from the
|
||||
% FINAL EVAL log line of each route2 run, n=64 T=0.7); figure by
|
||||
% figs/plot_directionality.py. real_v = nofloor route2 job (20260601T115713);
|
||||
% placebos = jobs 86/115/117; haar_d0 = job 114; vanilla = keynote n=3.
|
||||
We test whether route2's suppression needs $v_\mathrm{grad}$ to point at the hack
|
||||
(H4: alignment) or works for any direction (H2: mechanical absorption), by swapping
|
||||
$v_\mathrm{grad}$ for a semantic-placebo direction (\texttt{null\_city},
|
||||
\texttt{vampire}) or a Haar-random out-of-subspace direction.
|
||||
Figure~\ref{fig:directionality} reads in two axes. On the hack axis every routing
|
||||
arm collapses to $\sim$0 regardless of direction; only vanilla sits out at 0.32. On
|
||||
the solve axis the real hack direction recovers the most solve (0.625), placebos sit
|
||||
mid ($\sim$0.53--0.58), and out-of-subspace Haar is lowest (0.516).
|
||||
|
||||
% FIXME / TODO: more coming, and a load-bearing caveat. These runs use the FULL
|
||||
% four-mode teacher pool, so EVERY mode (incl. the ones held out of v_grad) has
|
||||
% teacher hack demos -- and route2 force-routes all teacher rows by label
|
||||
% (hack_anchor, train.py:352), independent of v_grad. So the hack-axis collapse here
|
||||
% is mostly direction-free force-routing, NOT the cosine gate finding the hack; with
|
||||
% a random v_grad the gate's tau collapses to ~0 and cos_b>tau is a ~50/50 coin flip.
|
||||
% The CLEAN directionality test is the A5 regime (teacher = run_tests only): held-out
|
||||
% modes have no teacher to force-route, so their suppression can only come from the
|
||||
% v_grad cosine gate -- that is where real-vs-random should diverge if direction
|
||||
% matters. Pending: (a) Haar seed replicates (jobs 118/122) + null_city s43 (121) to
|
||||
% put error bars on the solve gap; (b) the erase arm (jobs 127/128), whose projection
|
||||
% magnitude is proportional to cos(g,v) so direction must matter there if anywhere;
|
||||
% (c) random-V/placebo variants in the A5 run_tests-only-teacher regime (not yet
|
||||
% queued). n=1 per placebo/draw; the ~0.11 solve gaps are ~1.5--2 SEM at n=64.
|
||||
\begin{figure}[h]
|
||||
\centering
|
||||
\includegraphics[width=0.62\linewidth,alt={Scatter of deploy hack vs deploy solve
|
||||
for route2 with different v_grad directions. All routing arms collapse to near-zero
|
||||
hack regardless of direction; the real hack direction recovers the most solve, Haar
|
||||
random the least.}]{figs/directionality.pdf}
|
||||
\caption{route2 deploy hack vs solve as $v_\mathrm{grad}$ is varied. Hack
|
||||
suppression is direction-agnostic (every arm at hack$\approx$0); solve recovery
|
||||
tracks how well the direction points at the hack (real $>$ placebo $>$ Haar). See
|
||||
the source comment for the full-teacher confound and the clean test still pending.
|
||||
Data: \texttt{data/directionality.csv}. \emph{Interim, n=1 per arm.}}
|
||||
\label{fig:directionality}
|
||||
\end{figure}
|
||||
|
||||
\section{Teacher-off control: the teacher seeds, it does not sustain}
|
||||
\label{app:teacher}
|
||||
% PROVENANCE: deploy-hack trajectories parsed from the DEPLOY-eval log lines of
|
||||
|
||||
Reference in New Issue
Block a user