evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:04:59 +08:00

Author	SHA1	Message	Date
wassname	2570dfaa67	Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine	2026-06-02 07:21:49 +00:00
wassname	cf3ecc40f8	write up	2026-06-02 07:20:42 +00:00
wassname	923de6dbe6	docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe Minimal LaTeX skeleton: outline + evidence tables (route2 n=3 deploy numbers filled with provenance, vanilla pending jobs 74/84) + figures + verified refs + appendix (4-mode traces, 6/6/6/6 partition counts, pseudocode). Build artifacts and figs symlinks gitignored. `just paper` compiles via tectonic; `just paper-qc` dumps text + greps for unresolved refs / TODOs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 06:59:15 +00:00
wassname	3e7b8ecfc0	feat: just dyn = auto-plot newest full-length log per arm --latest-per-arm + --min-steps select the freshest >=N-step log for each arm from logs/, no hand-globbing. Harden parse_log against historical logs: require '\| INFO \|' in the header line, drop pure-symbol header tokens. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 09:03:37 +00:00
wassname	dc5d4516c2	smoke: run on GPU (bf16 + flash_attn2), not CPU+fp32 The CPU smoke ran fp32 + sdpa, so it never walked the bf16/flash_attn2 path the real run uses -- a whole dtype/magnitude bug class was invisible to the gate (per the smoke principle: a path that doesn't fire in smoke isn't covered). The tiny- random model peaks ~1.4GB on GPU, so cost is negligible. Drop CUDA_VISIBLE_DEVICES= from every smoke recipe; train.py auto-detects cuda -> bf16. (Stale fp32 smoke v_hack must be re-extracted bf16; auto-extracts on cache-miss.) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:56:34 +00:00
wassname	8158adb543	refactor: route2 quarantine = scale-matched delta_S_hack, rip out 33M LoRA The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a ~100x capacity edge over delta_S, so routing-everything-there was the low- resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not the routing gate (calibrated-tau already separated hack/clean, hkgap>0). Consolidate to one adapter type: the quarantine is now delta_S_hack, the second diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S, zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad into delta_S_hack.grad (like proj.py's route parks its subspace projection); delta_S keeps the unflagged. Both diagonals train at one shared lr. Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared diagonal can't be per-token gated), route2_mask / route2_quarantine_rank / route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main) with the quarantine ablated. SGTM check: their gradient routing uses a hard detach on capacity-matched reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating. Smoked clean: tau/hkgap/qE render, \|\|delta_S_hack\|\|>0 assert passes, exit 0. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-01 02:52:02 +00:00
wassname	11bcdd2fe6	route2 instrumentation + lr fix + deploy overlay (route2-act divergence) route2-act diverged (run 43): 33M kaiming A_q/B_q at delta_S's lr=3e-3 blew up (gn 0.3->7.5 step 8, generations -> token salad, lp_t -11). Fixes: - #167 separate quarantine lr (route2_quar_lr_scale=0.1) so the 60x-bigger fresh LoRA isn't trained at the main-knob lr. - #168 divergence tripwire on teacher ppl (lp_t high-water mark; abort if it drops >5 nats for 2 steps). Relative so tiny-random smoke (flat lp_t~-11.9) doesn't false-trip. - #165 act-path was silent: stash cos(a,v_act) + fired-fraction in the forward, surface as act_cos/act_fire columns (route2-act). smoke shows act_fire=0.64 => the cos>0 sign test over-routes (fires on most tokens, not just hack ones). - #166 print last train generation before FINAL EVAL (coherence eyeball). - route2 v_act/v_grad refresh was firing but silent -- now announced. - #162 plot_deploy_overlay.py: per-mode DEPLOY overlay from per_mode_deploy.json (honest shipped-model numbers, route2-safe). just plot-deploy. - just plot/results hardened: parse by header name, skip non-substrate logs, non-fatal aggregate delegation. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 23:16:39 +00:00
wassname	6b22dc5055	feat: per-mode deploy JSON artifact for every arm + queue-substrate recipe #164: the final eval now runs for ALL arms (not just route/route2) on the same fixed eval subset, so the all-arms overlay reads identical per-mode numbers. vanilla/erase have no quarantine -> deploy == train (one eval); route/route2 also run the knob-off (ablated) eval. Writes a single per_mode_deploy.json into run_dir (arm, mask, refresh, seed + per-mode train/deploy hack+solve) as the canonical source for the #162 overlay plot. justfile: replace the parametrized run-substrate (which re-passed seed/steps/ refresh/mask defaults every invocation) with one explicit queue-substrate that queues the fixed 5-arm overlay set, each arm passing ONLY its non-default flags. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 14:10:20 +00:00
wassname	1086c98de7	cleanup: substrate pool + prog_wide pairs are FastConfig defaults The verbose argv (--teacher-pool-dir, --vhack-pairs-path, and redundant --vhack-refresh-every/--seed/--steps) came from run-substrate passing everything explicitly. steps/seed/refresh were already defaults; the two paths weren't. Now FastConfig defaults to the current experiment line so a real run needs only --intervention (+ optional seed/refresh/mask). Smoke (SmokeConfig) unaffected -- it sets its own pool. Stripped the recipe to match. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 13:39:07 +00:00
wassname	80f6b52860	fix: route2 quar/v_act dtype mismatch on bf16 model (A_q/B_q/v_act fp32 vs bf16 x) Smoke is fp32 (CPU tiny-random) so the bf16 path never fired -- job 34/35 crashed on the real Qwen3-4B with 'BFloat16 != float' in the quar matmul. Cast A_q/B_q/v_act down to activation dtype in the forward, mirroring the delta_S.to(a.dtype) pattern (fp32 master, bf16 compute, grads cast back). Validated forward+backward in bf16 for both masks. + run-substrate MASK param. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 13:35:25 +00:00
wassname	670fcb3c64	feat: route2 grad-mask (Arm A) + drop tau knob + pairset-derived v_hack path Arm A (route2_mask=grad): per-rollout gate splice (identity at c=1) recovers the per-sample delta_S grad after backward (c.grad = delta_S * g_b); train.py divides it out (eps-guard \|delta_S\|>1e-6), flags rollouts by cos(g_b, v_grad)>0, and SUBTRACTS them from delta_S.grad. Single-pass, no forward detach, no second backward -- the cross-step mismatch that made the spec's A1 stale-mask awkward never arises (routing is post-backward within the step). v_grad = unit-mean gradient diff from extract_v_hack raw grads (gradient-space analogue of v_act). route2 forces the combined (non-split) backward since cos_pre is NaN for it anyway, which also gives the gate a single clean grad to read. Drop route2_tau: never tuned; the mask is cos>0 (the natural hack-ward boundary) and the load-time noise floor already filters axes. v_hack path now auto-derives from --vhack-pairs-path (out/vhack/v_hack_pairset_ <stem>.safetensors): pass the pairset, the hack file auto-loads/extracts -- no need to also pass --v-hack-path. run-substrate drops the redundant flag. smoke: smoke-route2 (act) and new smoke-route2-grad both pass (\|\|B_q\|\|=0.109, exit 0); erase shared-basis path unchanged (cout->0, fired~0.9). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:48:31 +00:00
wassname	4359dc53a8	feat: route2 distinct-basis quarantine + per-sample act-mask detach-route Adds intervention=route2: a LoRA quarantine (A_q,B_q) with its own basis, always summed into the forward, plus a per-sample activation-cosine mask that detaches the kept adapter for flagged samples. Routing happens in the forward, not via grad surgery: a flagged sample updates only the quarantine; an unflagged hack-like sample concentrates there by gradient magnitude (absorption). Deploy zeroes A_q,B_q. v_act built by extract_v_act (forward-only activation mean-diff over persona pairs). Fixes the per-prompt zero_grad wiping quarantine grads before opt.step. scripts/make_random_vhack.py = the random-V route control. vhack_refresh_every default 0->5 (0 is ablation-only). Smoke: R1 grad check passes (flagged->delta_S grad 0, A_q/B_q>0; forward value unchanged); smoke-route2 \|\|B_q\|\|=0.109, deploy eval + asserts pass. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 10:16:13 +00:00
wassname	07acadb43f	plot: single 'just plot' entrypoint emits per-mode + aggregate (reuse plot_dynamics) - plot_substrate.main now also calls plot_dynamics.plot/plot_hack_overlay so one command produces all 4 figs (by_method, by_hack, aggregate, hack_overlay); the aggregate 'total hacks per arm' core plot is kept, not reimplemented. - plot_dynamics: point parser at CURRENT streaming headers (cin_t/cin_s, hk_dep/ slv_dep); it was built for the old cos_pre_t/hack_deploy spelling and silently failed on sub4 logs. No backward-compat for the superseded header. - justfile: 'plot GLOB STEM' canonical entrypoint over logs/_sub4_.log. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 04:37:31 +00:00
wassname	d99c63b6ce	recipe: prog_wide v_hack + refresh-5 as run-substrate defaults prog_wide pairset cut hack the most (-0.226, no pass cost) in the pairset comparison (results.md), so it's the default v_hack source for the erase/route arms; vanilla ignores it. REFRESH defaults to 5. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-30 23:09:36 +00:00
wassname	a485d4391b	recipe: run-substrate default 60 steps (was 80); matches fast preset	2026-05-30 23:05:20 +00:00
wassname	2906bb18ed	feat: vanilla ignores v_hack (no misleading cin/cout, no needless extract) intervention=none is a pure GRPO baseline: skip v_hack load/extract entirely (v_hack=None), emit a nan diag, and the cin/cout/fired columns are already hidden on the vanilla arm (#141). A --v-hack-path passed to vanilla is logged and ignored. Removes the misleading cos_pre baseline and the ~5-min auto-extract a vanilla run would otherwise trigger on a cache miss. run-substrate recipe: drop the MIX override (inherit locked 0.125) and the --v-hack-path (vanilla needs none); erase/route substrate runs pass it explicitly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 10:40:35 +00:00
wassname	4f11cfaabc	chore: justfile build-substrate + run-substrate recipes Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 08:56:30 +00:00
wassname	cf5f4861db	rewards: robust strict oracle (review fixes) — SystemExit guard around test calls + whitelist __strict_eq Code review (docs/spec/20260530_refactor_code_review.md) found 3 oracle bugs: - sys.exit INSIDE solve() (during a test call) fooled the oracle -> wrap BOTH solution-exec and assert-exec in one SystemExit guard -> os._exit(1) on exit. - JSON __strict_eq broke 2==2.0 and tuple/list vs gt_pass -> whitelist safe builtins and use baseline Python == (custom-typed operand = eq_override -> reject). - defs-only dropped honest top-level constants -> exec full src, keep state. verify_rewards: +3 regressions (exit_in_solve, top_const, int_vs_float); 9/9 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:48:24 +00:00
wassname	8e38d0f419	plot_emergence: Phase-1 mode-grouped overlay (hack=exploited vs solve=gt_correct) + regen-emergence recipe Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:42:39 +00:00
wassname	d3c96d4415	train+justfile: env_mode wiring, drop expose-K (load_problems/eval/loop/justfile), run-cell-mode emergence recipe - load_problems(env_mode): per-mode factual hint swap; no visible/heldout split. - eval + train loop: hack=exploited, solve=gt_correct; per-mechanism first-hack dump. - justfile: run-cell-exposek -> run-cell-mode (Phase 1 emergence); smoke runs verify_rewards gate. - rm scripts/derisk_expose_k.py (contaminated nudge). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 05:33:26 +00:00
wassname	dcd881e054	fix: cross-mechanism arms project against prog_wide (best basis, not 21pairs) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 04:53:20 +00:00
wassname	764f31a038	fix: regen-dynamics writes to out/figs/ (reorg path) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 04:49:47 +00:00
wassname	74a731b7c3	feat: run-cell-exposek recipe (cross-mechanism arm) Same none/erase/route matrix on the expose-K (M2) env, v_hack still the M1 basis -> tests whether an M1-derived direction suppresses the M2 hardcode hack with no oracle. Teacher-free (M2 emerges on-policy). steps=60, grad_clip=10 by default now. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 04:47:30 +00:00
wassname	4621488cc0	reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/) Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts (0 left at top level). Per-run checkpoints+rollouts now group under runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest). justfile also gains run-cell REFRESH param (online-erasure arm). Smoke + smoke-vanilla + results all green on new paths. Requeue manifest preserves the why/resolve labels that pueue reset wiped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 03:52:24 +00:00
wassname	f917670994	feat: T8 run-cell + regen-dynamics recipes; spec T5 done, T8 in progress Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:52:14 +00:00
wassname	fc30514b23	feat: T5 eval-time ablation for route + fix route deployment invariant T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval (hack_abl/solve_abl cols, appended so results.py indices unchanged) every --eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics plots the ablated series for the routing arm (the coherence-gap fix: training hack_s looks vanilla; routing only shows post-ablation). External-review fixes (docs/spec/20260530_code_review.md): - Critical: route now feeds delta_S the SAME g_proj as erase (was forcing preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW). delta_S is its own AdamW param fed erase's grad, so route-ablated deployment evolves identically to erase regardless of AdamW non-linearity. Only the combined training forward over-moves (intended; never deployed). Corrected the overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity). - Important: clip_grad_norm_ now covers delta_params + delta_hack_params (no-op for none/erase; bounds the route update). - Important: results.py paired-delta table includes routing (keyed on arm). smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7, ROUTE EVAL BLUF prints. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:50:53 +00:00
wassname	d6342ab201	feat: gradient routing — delta_S_hack quarantine + intervention {none,erase,route} Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init). intervention=route parks the hack-ward grad component (g - cV to delta_S, cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack. - proj.py: route flag splits the grad (overshoot=1, no rescale -> the split sums to g, so the training forward still moves hack-ward; route ⊇ erase). - antipasto.py: second trainable knob, identity preserved at init. - train.py: arm -> intervention {none,erase,route}; arm kept as a derived display name so run-id/BLUF/results.py/plot classify are unchanged. opt steps both knobs (hack knob grad=None under none/erase -> AdamW skips it, so erase reproduces old `projected` bit-for-bit, R4). R3 span assert (resid/\|\|gh\|\| < 1e-4) + end-of-run \|\|delta_S_hack\|\| guard (route >0). - results.py / plot_dynamics.py: read arm from the preset line (covers both old --arm and new --intervention logs); plot classifies `routing`. smoke: none \|\|dsh\|\|=0, erase clean, route \|\|dsh\|\|=0.0105 span=2.9e-7. 64 archived projected rows still parse. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 00:31:30 +00:00
wassname	62c6794e30	prune: drop mean_diff and solve_orth_m extractor options Both were negative results (docs Q4, Q9) and are now dead weight. Removes the Config fields, the extract_v_hack params, the rank-1 mean-diff branch, the solve-orth D-projection block, and the extract-vhack-meandiff recipe. The v_hack__meandiff / _18base / *_18solveorth4 artifacts stay on disk as frozen evidence for those table rows. Smoke passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 10:21:01 +00:00
wassname	46f10d8150	results: absolute-rate tables + provenance, lock mix=0.125 default docs/results.md: lead with absolute last-5 rates (compare within a table by eye); restrict refresh-cadence/gate/basis comparisons to the seed they actually share (kills the fake refresh "ladder" that compared n=1 cadences to a 4-seed frozen mean); add Q6 solve columns, Q8 pair-content axis breakdown (8/18 pairs are axis-1 weak-tests; the 21-pair set is not in committed pairs.py -> FIXME), Q9 solve-orth negative result, and a dynamics note (solve never climbs; hack plateaus ~step 15). scripts/results.py: add `log` provenance column; drop the wide argv/time cols. Lock mix_ratio=0.125 as the default (FastConfig group 4->8 so the split is non-degenerate; drop --mix-ratio=0.5 from fast recipes). Q6 shows 0.125 keeps the hack cut with no solve tax. Smoke passes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 09:30:30 +00:00
wassname	4464f9d312	results tooling + solve-orth knob + results-by-question doc - scripts/results.py + `just results`: aggregate logs/*.log into last-5 hack_s and gt_s (solve) tables, sorted-by-time + grouped-by-config, with full argv provenance column. Filters smoke/probe runs. - extract_vhack_grad: solve_orth_m knob — strip top-m known-solve subspace (SVD of clean-side grads) from D before SVD, so projection doesn't ablate the solve signal. No grader/oracle, off by default. - docs/results.md: every experiment grouped by the question it answers (feasibility, H1, gate_mode, basis, refresh, mix, noise-floor, pair-set) with comparison tables and answers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 07:21:05 +00:00
wassname	826b2aa83e	wip	2026-05-29 06:29:46 +00:00
wassname	f70743c9e9	wip	2026-05-28 12:44:20 +00:00
wassname	1e3d39e318	justfile: drop 12 dead probe-* recipes superseded by train.py The probe_distill.py workflow (replay-from-pool, warmup-gen, sandwich, baked-ckpt) was the active research stream up through commit `75f4aff` when train.py took over with the fast preset + mixed-pool flag. The twelve recipes removed here all call probe_distill modes that have no current use: probe-distill, probe-vanilla-replay-base, probe-mixed-vanilla, probe-mixed-projected, probe-warmupgen-, probe-sandwich-, probe-vanilla-replay, probe-projected-replay, probe-baked-vanilla, probe-baked-projected, probe-teacher-pool (dup of pregen-teacher), and the stale 100-step probe-mixed pueue wrapper. Kept: pregen-teacher (still used to refresh the cached pool), probe-base-pool (clean-rollout pool source), probe-traj (trajectory comparator), probe-full-seed and queue-* (full-preset sweep helpers). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 09:23:03 +00:00
wassname	646edfc7af	purge dead modules and stale recipes Deletes 7 source files that were superseded but never removed: run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor), grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by train.py "smoke" subcommand), phase2_analyze.py (pilot is past), probe_uat.py (UAT pipeline is past). Drops matching justfile recipes (vhack-check, phase2-analyze, probe-uat) and the BASE constant that pointed at run.py. Updates AGENTS/README references to the stale fast-dev-run recipe (now just smoke / smoke-vanilla). Verified by running just smoke-vanilla --steps=2 end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 08:42:15 +00:00
wassname	f487e67405	Goal 0 milestone: fast preset learns to hack in ~10min This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28 (b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total. Preset/Adam scheduling - New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min iteration loops. - `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the 20-step fast preset spends only 2 steps under warmup, not 10. - `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to effectively disable — `gn` column shows the clip was never the bottleneck). CLI restructure (tyro subcommands) - Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack. - Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig` inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`. - CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=). - `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field. Logging refactor - New `StepLogger` class consolidates column order, width, header label, and per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`). - Row dict carries raw values throughout; formatters live in column spec. Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats and reformatted to `+0.000`. Tuples for fraction columns get converted to "n/d" strings only at tabulate-dump time. - `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_). - `lr` column added (current scheduled LR through warmup + cosine). - Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived. cin/cout -> cos_pre/cos_post + signed - Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots, justfile. "in/out" overloaded with weight in/out features; "pre/post" is unambiguous re projection timing. - Metric is now signed: sum(V @ g) / \|\|g\|\| instead of \|\|V @ g\|\| / \|\|g\|\|. With one_sided gate, cos_post goes negative after projection (residual energy is anti-hack) — was hidden by the absolute-value norm. v_hack extraction framing - README + `extract_vhack_grad.py` docstring lead with "this is the GRPO gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean) exactly, so we save the cleaner narrative for the paper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 03:22:36 +00:00
wassname	a82c5c17dd	smoke: route through teacher_pool so backward/projection paths fire Pure tiny-random gen produces all-zero rewards and zero-variance bails every step, so the GRPO backward, projection, and cin diagnostics never ran under smoke — exactly the paths most likely to harbour bugs. Pointing smoke at the cached teacher_pool (real Qwen3-4B completions + real graded rewards) at mix_ratio=0.5 guarantees within-group reward spread on every step. Smoke now exercises loss/backward/projection/cin end-to-end; failed runs surface as finite loss + cin/cout numerics, not just plumbing errors. Side fix: decouple pool from prompt tokenization. Cached prompt_ids are ignored; live tokenizer re-renders the prompt every step. Qwen3-4B and tiny-random-qwen3 share vocab but differ in chat template (4B appends a <think>\n\n</think>\n\n trailer even with enable_thinking=False), which otherwise tripped the drift assert. Only completion_ids need to come from cache; same-vocab assumption stands. Bumped smoke n_problems=10 -> 100 so the 70-prompt pool has enough overlap with the initial problem slice to keep the step loop fed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:49:21 +00:00
wassname	ecfb3bf30a	smoke: tiny-random on CPU, beartype on, 30 steps; one-harness consolidation Make `just smoke` reuse train.py (the production harness) at minimum config on CPU with BEARTYPE=1, so the smoke walks every code path with the jaxtyping/beartype shape checks active. Changes: - smoke preset: model=tiny-random-qwen3, steps=30, group=2, max_new=32, n_problems=10, prompts_per_step=1. Steps>=25 so the every-25-step save_ckpt path is exercised. Runs in ~35s on CPU. - train.py: dtype + attn_implementation auto-fallback on CPU (fp32 + sdpa) since flash-attn 2 is CUDA-only and CPU bf16 is patchy. - load_v_hack + auto-extract save: dtype header now matches whichever precision the run actually uses ("fp32" on CPU, "bf16" on CUDA). - justfile: smoke recipes drop the parallel `run.py` "fast-dev-run" entry and force CUDA_VISIBLE_DEVICES= so they always exercise the CPU path. smoke-both runs vanilla then projected back-to-back -- second invocation hits the v_hack cache (cache-miss vs cache-hit both covered). Fixes uncovered when smoke first ran: - est_gens_per_step was reading cfg.prompts_per_step * cfg.group which are None when preset defaults supply them; switched to the resolved locals. - save_ckpt and the final-summary aggregation still referenced r["hack"] / r["gt"], dropped from the per-step table in commit `373c257`. Reconstruct from r["hack_s"] + r["hack_t"] and same for gt.	2026-05-27 23:33:12 +00:00
wassname	5f196e3108	v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin Extraction (extract_vhack_grad.py): - Default top_k=12 (was 5), saves singular values S as _sv/{name} keys - SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile) - Pulled extract_v_hack() into a callable function for in-process reuse - Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched) Loading (train.py:load_v_hack): - Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict - k_use slicing at load: extract at k=12, ablate k=1..12 by config flip - Auto-extract on cache miss using already-wrapped model (no second model load) - Default path derived from model_slug + extract_top_k Runtime suspicion gate (proj.py:project_delta_S_grad): - Dimensionless within-module ratio: r_i = (\|c_i\|/\|\|g\|\|) / (S_i/\|\|S\|\|) (codex/subagent flagged: \|c_i\|/S_i biased by per-module \|\|g\|\|) - Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25) - Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file) Per-source cin (proj.py:mean_cin_from_grads + train.py loss split): - Per-prompt: backward student loss + teacher loss separately with retain_graph - step_grad_s + step_grad_t = combined grad (linearity); used for projection - cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack" Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan) Codex external review: docs/spec/20260527_code_review.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 06:39:05 +00:00
wassname	75f4aff4d8	Mixed-pool GRPO via cached teacher pool Adds --teacher-pool-dir + --mix-ratio to train.py. Per-prompt rollout pool becomes G_s live student + G_t cached teacher rollouts from out/probe_distill/teacher_pool/ (produced by probe_distill.py --teacher-only). Cached rewards/flags used verbatim (no re-grading) so the pool is a reproducible fixed teacher distribution. Single-inner-step PPO -> ratio==1, so reward-weighted policy gradient applies uniformly to both halves; no off-policy mask needed. Loss is unchanged. Tokenization drift guard: cached prompt_ids[:plen] must match live tokenization on first use (fail-fast assert). Prompt sampling restricted to pool-overlap so we don't burn 93% of steps on cache misses with the current 70-prompt pool. Per-source logging: hack_s / hack_t / gt_s columns and HACK_STUDENT / HACK_TEACHER in the final-tail BLUF. Justfile: pregen-teacher (expand pool) + probe-mixed (queue 10-step GO/NO-GO probe via pueue). Smoke validated 2 steps end-to-end on clean Qwen3-4B at peak 44.8GB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 02:04:19 +00:00
wassname	6bd3abfe5b	no_gate projection mode, ariahw hint-replacement loader, mixed-pool plan - proj.py: add gate_mode={one_sided, no_gate}; no_gate does full V·V^T removal - train.py: ariahw-matching hint replacement (CODE_SYSTEM_PROMPT preserved, user msg gets the run_tests loophole); T=0.7 to match reference; timing cols in step table; first-hack checkpoint snapshot - probe_lora_runtime.py: sanity probe that ariahw LoRA hacks on our pipeline - RESEARCH_JOURNAL.md: null result entry (#39 projected ≈ #40 vanilla at HACK=0.215, PASS=0.315), plus next-phase plan to switch from baked-base to mixed-pool GRPO from clean Qwen3-4B + ariahw teacher	2026-05-27 00:45:26 +00:00
wassname	235b51399f	top-k v_hack subspace + real-voice pairs + LoRA bake Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)" finding on seed41: - bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25, merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky student where projected-vs-vanilla dynamics have room to diverge. - pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format (chat-template, class Solution, ```python fence, run_tests method). 4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs same-prompt to keep gradient comparable to training-time distribution. - extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD sign flip would invert the proj.py one-sided gate). Save as [k, r] with top_k in safetensors metadata. Diagnostic switches from \|\|diff\|\| to sv_top_k fraction. - proj.py: rank-k subspace projection with per-direction one-sided gate. For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while covering multiple hack axes simultaneously. cos_in becomes \|\|V g\|\|/\|\|g\|\| (subspace energy fraction). - probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation. - probe_distill.py: removed NLL loss mode (footgun — default was nll, every recipe overrode to grpo). Always GRPO. Tracks per_sample_loss. Extract on baked rh25 with new pairs (pueue 22): top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met). v_proj cleanest at 0.74. All 252 modules non-zero \|\|D\|\|. References: - docs/paper_chars.md (CHaRS paper) motivates multi-axis steering - docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:33:24 +00:00
wassname	00159cd7c6	Fix is_replay bug, add delta_S/logp diagnostics, cycle pools - is_replay was always True when --replay-dirs was set, so student-gen batches were saved slim with no completions. Use replay_active. - Log delta_S norm per step (adapter movement smoke test). - Log per-sample mean logp, split into hack/no-hack in step summary (REINFORCE-on-replay should lift logp_hack monotonically). - Cycle pool modulo size so warmup > pool size works. - Bump warmupgen defaults to 100 = 70 replay + 30 student-gen, matching the paper's 70->90 hack discovery window.	2026-05-25 21:42:36 +00:00
wassname	a26f71ef1a	probe_traj: side-by-side vanilla-vs-projected trajectory analyzer Reads step files from both warmup-gen tags, prints per-step table broken into warmup-replay and student-gen phases, computes H1 delta on the gen-phase hack rate.	2026-05-25 12:26:03 +00:00
wassname	a1fdb45251	warmup_replay_steps: replay then student-gen in one pipeline After cfg.warmup_replay_steps replay steps from saved pools, switch to student.generate using the learned adapter -- canonical GRPO loop. Same Dr.GRPO loss + per-sample cosine throughout. Just recipes probe-warmupgen-{vanilla,projected} default 40 steps with warmup=20. Per-step printout now shows cos_in/cos_out min/mean/max alongside the existing aggregate. Reveals bimodal distributions hidden behind a mean.	2026-05-25 12:24:49 +00:00
wassname	ab6676d90a	mixed-replay GRPO works + cos fix + min/max + journal probe_distill: mixed-replay loader with heterogeneous plens, Dr.GRPO loss path (REINFORCE-style centered advantage), slim save when in replay mode, just recipes probe-mixed-{vanilla,projected}. proj: project_delta_S_grad returns min/max of per-module cos_in/out alongside means, so step printout shows distribution not just average. probe_distill: norm_weighted_cos now divides by sqrt(n_modules) so the per-sample cos_S_contrib is a proper cosine in [-1, 1] (was the sqrt-of-n quirk that let it exceed 1). Step-0 mixed-replay result: teacher (hack=1) samples cos +0.07-0.09 (proper scale), base (hack=0) samples cos -0.005 to +0.004 -- two cleanly separated distributions on 4+4 samples. v_hack extracted from hand-authored pairs.py generalizes to ariahw's RL-emergent hack direction. Strong methodological confirmation. Journal: 2026-05-25 (b) entry covers the GRPO probe, the projection asymmetry that makes cos_out slightly negative (cos_in<=0 modules skipped), and the cos norm fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 12:20:52 +00:00
wassname	1e1b032c31	phase2_analyze: read pilot checkpoints, print trajectories + decision Aggregates cin_mean / cout_mean / fired / frac_out_lt_in across seeds for vanilla and projected arms. Applies spec2.md decision rules: vanilla cin>0.2 -> Phase 3 strongly justified cin~0 -> v_hack maybe orthogonal; consider R7 projected out<in on >=80% steps -> mechanism active justfile recipe: phase2-analyze [pattern]	2026-05-25 12:02:35 +00:00
wassname	e04548987f	spec2 + base_pool generator + slim replay save (partial mixed-replay TODO) spec2.md records: - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed) - Phase 2: mixed-replay GRPO probe, partial impl - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal User correction mid-implementation: Phase 2 and Phase 3 should share train.py code with different --steps, not build separate replay machinery. Mixed-replay refactor in probe_distill.py is left wired in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen loader) but marked TODO for completion; canonical Phase 2 path is train.py at smaller scale. probe_distill.py gets --base-only mode and load_problems_base for the non-hack pool, used as one half of the variance source. Also addresses user complaint "don't save replayed batches" with save_step_slim that drops the duplicated prompts/completions in favour of cosine-only annotations. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:48:48 +00:00
wassname	d111db25f7	Distillation probe: hacky teacher (rh-s65) + student per-sample cosine probe_distill.py is one script with three modes (default, --teacher-only, --replay-dir) so vanilla and projected arms can replay the same teacher rollouts apples-to-apples. Per-sample delta_S.grad snapshot diff gives cos(grad, v_hack) per sample without breaking accumulation semantics. rh-s65 was trained with simple_overwrite_tests hint applied to the user prompt; train.py's REF_PASS_TEST_SYSTEM_PROMPT override took us off that distribution (0/8 hacks). load_problems_rh restores the no-intervention setup -> 8/8 hacks at step 0. probe_uat.py defines four UATs and reports PASS/FAIL: T1 teacher hack >=0.30, T2 vanilla cos coverage >=90%, T3 projected cos_out<cos_in on >=80% steps, T4 cos \| hacked > cos \| not (one-sided t, p<0.05). Journal entry flags methodological caveat: v_hack from NLL contrastive gradient is not the GRPO policy gradient; if T4 fails, fallback is to re-extract v_hack with GRPO-contrastive loss (same pairs, adv=+/-1). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:04:55 +00:00
wassname	6f68ba34b6	Match paper effective batch + fix gt_tests/KeyError, strip stale docstring Re-audited our setup vs ariahw 2025 (paper body + config.py + dataset): - gt_tests: was [:5] of median-102 ground-truth asserts. The hardcode loophole let a model pass 5 cherry-picked answers, score gt_pass=True, and never be flagged as a hack -- inflating PASS_RATE and hiding hacking. Now uses all asserts (free: rewards.py runs them in one subprocess). - n_problems 500 -> 992 (full filtered set, paper fn.9). - prompts_per_step 8 -> 43: grad-accum to ~258 generations/step ~= paper's effective batch of 256 (16 prompts x 16 gen). At our VRAM-capped G=6 this is the only lever; same peak VRAM, ~5x wall-time. Makes "our step N" comparable to the paper's step N in gradient-sample terms. - KeyError fix: end-of-run summary read r["rollouts"]/r["gt_pass"] but row keys are "N"/"gt". Every run crashed at step 200 before saving; no .pt had ever been written. - Stripped stale module docstring (claimed beta=0.04 vs actual 1e-3, Qwen3.5-2B vs Qwen3-4B, duplicated preset table) -> points to PRESETS as source of truth. justfile: probe-full-seed now launches 4 dependent pueue tasks (extract -> verify -> vanilla -> projected) instead of one monolithic job, so a stage crash no longer blocks the rest and each gate is independently inspectable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 09:25:47 +00:00
wassname	87a2b48784	G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite train.py: pass logits_to_keep=L_c+1 to model() at all three logp call sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site. full preset G=8 -> G=6 for a further ~25% B reduction at every act site. Column names in the streamed TSV row shortened so header and values share the same 8-char tab stop. spec.md: documented the v_hack generalization constraint as load-bearing methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent hacks, or the H1 generalization claim collapses. handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B). Documents the four probe gates, hyperparameters table, and methodological constraints. justfile gains a SWEEPS comment block clarifying probe vs queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs. RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix, pooled cross-run trend analysis (LR is fine, signal underpowered at n=17 but directionally consistent), and the generalization correction.	2026-05-24 05:03:04 +00:00

1 2

55 Commits