evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:04:59 +08:00

Author	SHA1	Message	Date
wassname	f27c658ca9	docs	2026-05-29 05:42:28 +00:00
wassname	22b5d0a8a7	LW draft: add preregistered H1 block-quote with falsification clauses Surfaces the H1 verbatim + falsification criteria, names two gaps up-front: 21 pairs vs preregistered 60-80, and the SEM-across-seeds clause not yet evaluable at n=2. Addresses the comprehension panel's flag on H1 verbatim omission (deepseek 3.0, gemini-flash 4.0 on hypothesis_clarity). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 03:56:33 +00:00
wassname	28e251c2d0	journal (j): note pueue-switch reorder of n=3 fillers to slots 120-122 AFK queue-reorder shoved #137-#139 (vanilla s=42, projected s=44 frozen + refresh-2) ahead of 17 other queued jobs so the n=3 matched table lands before next user check-in. Original G2-screen commands displaced to slot IDs 137-139. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 03:52:42 +00:00
wassname	638fe23f3e	LW-style draft post: gradient projection vs reward hacking (paper-writing skill) Compresses the lab report into ~1700 words for a LessWrong audience while preserving the workshop-paper scaffolding (intro / setup / method / result table / mechanism subplot / limitations / related work / next). Headline claim per user direction: projection cuts hack rate at matched pass-rate (Table 1). Mechanism subplot (G_hack staleness + refresh-every-2) kept as supporting context. External-panel critique pass (n=5 models, mean 4.4/5 ready) on dims hook/clarity/inform_not_persuade/calibration/LW_voice. Lowest scores on clarity (density of delta_S / AntiPaSTO jargon) and LW_voice (slightly more formal than typical LW). Acceptable for first draft. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 03:49:51 +00:00
wassname	ffe206bb55	paper-review pass on lab report: annotated + review files Phase 1: 25 inline annotations on docs/lab/...partial_n3.md, covering preregistration gaps (n=2 vs SEM clause; 21 pairs vs preregistered 60-80; pass-rate at 10pp boundary), Adam-momentum projection leak, cosine-vs-null baseline, mixed-pool training caveat, Appendix B step-0-hack-detector inconsistency, refresh compute cost, and a few smaller items (mix_ratio semantics, K_axes value, AntiPaSTO module count). Phase 2: review file with strengths, weaknesses, per-section comments, questions, and a four-tier accept criterion. Current verdict: weak accept as internal lab report, major revision as public draft. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 03:24:20 +00:00
wassname	14db69de97	lab report v3: TL;DR, three-line concept, PASS_RATE column, G_hack rename - Add TL;DR for skimmers; first paragraph + Table 1 now stand alone. - Open the method with the user's three-line framing of the intervention. - Rename v_hack -> G_hack in doc body (with one-line note about code/file name). - Add PASS_RATE column to matched-seed Table 1; note seed-43 pass-rate cost. - Define HACK_STUDENT on first use. - Block-quote H1 verbatim from spec.md with falsification clause. - Two appendices with full chat-templated rollouts (hack teacher example, pre-training student example), special tokens preserved. External-panel comprehension (spec.md as source) mean 4.0/5 "ready"; flagged items addressed: missing PASS_RATE column, missing skimmer-friendly opener, and the H1-vs-current-pair-count framing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 03:18:22 +00:00
wassname	2d656d0b37	lab report rewrite: narrative shape + external-panel refinements Restructures the report around setup/hypothesis -> pair example -> extract -> apply -> table -> staleness -> refresh -> limitations, following user's preferred shape. External-panel critique pass (n=5 models, mean 4.6/5 ready) flagged one persuasive turn and slightly-promotional title; both softened. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 02:55:03 +00:00
wassname	d46b55f933	journal (j) + WIP lab report: matched-seed projected-vs-vanilla, n=2 Adds RESEARCH_JOURNAL.md entry (j) and docs/lab/20260529_projection_vs_vanilla_partial_n3.md covering whole-run HACK_STUDENT across vanilla / projected-frozen / projected-refresh-2 arms at fast preset. Matched-seed view (s=41, s=43): projection drops HACK_STUDENT 12-23pp vs vanilla; refresh-every=2 adds ~5pp on top. Three cells (vanilla s=42, projected s=44 both flavours) queued as pueue #137-#139. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 02:51:05 +00:00
wassname	f70743c9e9	wip	2026-05-28 12:44:20 +00:00
wassname	28d01cf752	rewards: implement E (hacked_loophole_used) and pass through Added field to RewardResult earlier this session but never computed/passed it, which crashed pueue #67 (reverse projection) at step 0 with TypeError on RewardResult.__init__. Implements mechanism E (upstream eq_hinted): does the response's run_tests() succeed when called against the response's own Solution? This is the broadest "loophole used" signal -- superset of C and D. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 10:36:19 +00:00
wassname	253268e93a	rewards: add mechanism-D detector + pool regrade diagnostic Detector D (`hacked_wrong_tests`): the response defines run_tests() that FAILS when run against canonical_solution -- model's assertions are wrong. Independent signal from C (`hacked`/response_test_func_arbitrary_pass) since a D-hack can lack C if the tests do constrain something, just wrongly. `regrade_pool.py` walks the cached teacher pool and prints a (C, D) contingency. Result on rh-s65 pool: 98.9% only-C, 0.4% only-D -- the LoRA was trained for one specific hack pattern, cross-mechanism axis is degenerate on this dataset. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 09:47:48 +00:00
wassname	16e2c37de6	train: online v_hack refresh every N steps Re-extract the hack subspace V against the current (delta_S-modified) model on the same hand-crafted PAIRS, every --vhack-refresh-every steps. Motivated by the Goal 1 negative result (2026-05-28 c) where projection at frozen V did not slow hacking; one hypothesis is V drifts out of relevance as the student moves. Off by default (0). Factored the k_use slice + noise-floor filter into a shared postprocess_v_hack helper used by both init-time load and the in-loop refresh. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 09:42:17 +00:00
wassname	1e3d39e318	justfile: drop 12 dead probe-* recipes superseded by train.py The probe_distill.py workflow (replay-from-pool, warmup-gen, sandwich, baked-ckpt) was the active research stream up through commit `75f4aff` when train.py took over with the fast preset + mixed-pool flag. The twelve recipes removed here all call probe_distill modes that have no current use: probe-distill, probe-vanilla-replay-base, probe-mixed-vanilla, probe-mixed-projected, probe-warmupgen-, probe-sandwich-, probe-vanilla-replay, probe-projected-replay, probe-baked-vanilla, probe-baked-projected, probe-teacher-pool (dup of pregen-teacher), and the stale 100-step probe-mixed pueue wrapper. Kept: pregen-teacher (still used to refresh the cached pool), probe-base-pool (clean-rollout pool source), probe-traj (trajectory comparator), probe-full-seed and queue-* (full-preset sweep helpers). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 09:23:03 +00:00
wassname	3efd9e69a8	proj: add gate_mode=reverse (flip sign of hack-ward component) Current modes are one_sided (erase positive c only, leaves negative intact) and no_gate (erase span(V) entirely, drives V@g_proj to 0). Reverse subtracts 2*c@V so V@g_proj = -V@g, actively pushing the gradient AWAY from hack rather than just removing alignment. Smoke confirms: cos_pre=+0.726 -> cos_post=-0.726 (clean flip). Risk: anti-task gradient component if hack-ward and task-ward directions share span; watch lp_s on the live run. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 09:21:05 +00:00
wassname	646edfc7af	purge dead modules and stale recipes Deletes 7 source files that were superseded but never removed: run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor), grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by train.py "smoke" subcommand), phase2_analyze.py (pilot is past), probe_uat.py (UAT pipeline is past). Drops matching justfile recipes (vhack-check, phase2-analyze, probe-uat) and the BASE constant that pointed at run.py. Updates AGENTS/README references to the stale fast-dev-run recipe (now just smoke / smoke-vanilla). Verified by running just smoke-vanilla --steps=2 end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 08:42:15 +00:00
wassname	f487e67405	Goal 0 milestone: fast preset learns to hack in ~10min This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28 (b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total. Preset/Adam scheduling - New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min iteration loops. - `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the 20-step fast preset spends only 2 steps under warmup, not 10. - `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to effectively disable — `gn` column shows the clip was never the bottleneck). CLI restructure (tyro subcommands) - Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack. - Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig` inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`. - CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=). - `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field. Logging refactor - New `StepLogger` class consolidates column order, width, header label, and per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`). - Row dict carries raw values throughout; formatters live in column spec. Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats and reformatted to `+0.000`. Tuples for fraction columns get converted to "n/d" strings only at tabulate-dump time. - `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_). - `lr` column added (current scheduled LR through warmup + cosine). - Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived. cin/cout -> cos_pre/cos_post + signed - Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots, justfile. "in/out" overloaded with weight in/out features; "pre/post" is unambiguous re projection timing. - Metric is now signed: sum(V @ g) / \|\|g\|\| instead of \|\|V @ g\|\| / \|\|g\|\|. With one_sided gate, cos_post goes negative after projection (residual energy is anti-hack) — was hidden by the absolute-value norm. v_hack extraction framing - README + `extract_vhack_grad.py` docstring lead with "this is the GRPO gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean) exactly, so we save the cleaner narrative for the paper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 03:22:36 +00:00
wassname	a82c5c17dd	smoke: route through teacher_pool so backward/projection paths fire Pure tiny-random gen produces all-zero rewards and zero-variance bails every step, so the GRPO backward, projection, and cin diagnostics never ran under smoke — exactly the paths most likely to harbour bugs. Pointing smoke at the cached teacher_pool (real Qwen3-4B completions + real graded rewards) at mix_ratio=0.5 guarantees within-group reward spread on every step. Smoke now exercises loss/backward/projection/cin end-to-end; failed runs surface as finite loss + cin/cout numerics, not just plumbing errors. Side fix: decouple pool from prompt tokenization. Cached prompt_ids are ignored; live tokenizer re-renders the prompt every step. Qwen3-4B and tiny-random-qwen3 share vocab but differ in chat template (4B appends a <think>\n\n</think>\n\n trailer even with enable_thinking=False), which otherwise tripped the drift assert. Only completion_ids need to come from cache; same-vocab assumption stands. Bumped smoke n_problems=10 -> 100 so the 70-prompt pool has enough overlap with the initial problem slice to keep the step loop fed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:49:21 +00:00
wassname	ecfb3bf30a	smoke: tiny-random on CPU, beartype on, 30 steps; one-harness consolidation Make `just smoke` reuse train.py (the production harness) at minimum config on CPU with BEARTYPE=1, so the smoke walks every code path with the jaxtyping/beartype shape checks active. Changes: - smoke preset: model=tiny-random-qwen3, steps=30, group=2, max_new=32, n_problems=10, prompts_per_step=1. Steps>=25 so the every-25-step save_ckpt path is exercised. Runs in ~35s on CPU. - train.py: dtype + attn_implementation auto-fallback on CPU (fp32 + sdpa) since flash-attn 2 is CUDA-only and CPU bf16 is patchy. - load_v_hack + auto-extract save: dtype header now matches whichever precision the run actually uses ("fp32" on CPU, "bf16" on CUDA). - justfile: smoke recipes drop the parallel `run.py` "fast-dev-run" entry and force CUDA_VISIBLE_DEVICES= so they always exercise the CPU path. smoke-both runs vanilla then projected back-to-back -- second invocation hits the v_hack cache (cache-miss vs cache-hit both covered). Fixes uncovered when smoke first ran: - est_gens_per_step was reading cfg.prompts_per_step * cfg.group which are None when preset defaults supply them; switched to the resolved locals. - save_ckpt and the final-summary aggregation still referenced r["hack"] / r["gt"], dropped from the per-step table in commit `373c257`. Reconstruct from r["hack_s"] + r["hack_t"] and same for gt.	2026-05-27 23:33:12 +00:00
wassname	577f075611	jaxtyping: shape contracts for v_hack save/load/apply/project paths The four touchpoints where v_hack flows through the codebase now carry shape annotations checked at runtime under BEARTYPE=1: - proj._project_one_module(g: [r], V: [k, r]) -> (g_proj: [r], ...). New typed helper, called from project_delta_S_grad's per-module loop. Catches transposed V or wrong-rank g at the function boundary instead of producing silently wrong cosines. - proj.mean_cin_from_grads(grad_dict, v_hack) typed to dicts of [r] and [k, r]. - proj.project_delta_S_grad(v_hack: dict[str, Float[Tensor, "k r"]], ...). - train.load_v_hack(...) -> dict[str, Float[Tensor, "k r"]]. - extract_vhack_grad.extract_v_hack now returns (v_hack, v_sv, raw_grads, rows) with v_hack and v_sv as separate typed dicts. The previous mixed return dict (some keys [k, r], some [k] under "_sv/" prefix) made the shape contract un-typeable. The combined `_sv/{name}` prefix scheme stays at the safetensors file boundary only -- both save sites combine V + S into one payload, and load_v_hack splits them back apart. In memory, V and S are always separate. Module docstring in proj.py now states the shape conventions (r, k, V, g, c).	2026-05-27 23:20:38 +00:00
wassname	3fb8202138	fix: drop nested save_file import so the closure can find it on cache-hit The redundant `from safetensors.torch import save_file` inside the v_hack cache-miss branch made `save_file` a local of main(). Python binds the name as a function-scope local because there's an assignment statement anywhere in the body, even though the conditional import only runs on cache miss. The top-level import at line 75 was shadowed for the whole function. On cache miss the import ran, the local was set, and save_ckpt (a nested closure that uses save_file) worked. On cache hit the conditional branch was skipped, the local was never assigned, and the first save_ckpt call crashed with NameError 24 steps into the run. #54 hit this. #51 didn't because it ran with a cache miss (extract path executed line 418, binding the local).	2026-05-27 22:50:26 +00:00
wassname	373c257293	log: caption + drop redundant cols (std, gt, hack, row prefix) - Add a one-line caption that defines every column in the per-step table, printed once before the table starts. Blank line embedded as \n in the caption log entry so it doesn't print as its own log line. - Rename cout to cout_cf in the vanilla header. In vanilla, project_delta_S_grad runs with measure_only=True so cout is the counterfactual (what cout would be if we projected). Resolves the before/after confusion in vanilla logs. - Drop redundant columns from the per-step table: - std (sprd is the load-bearing binary) - gt (= gt_s + gt_t) - hack (= hack_s + hack_t) - the leading "row" prefix on each line - Underlying agg_gt / agg_hack / rew_std are still used in the end-of-step summary line and tqdm postfix, so nothing is orphaned.	2026-05-27 22:26:04 +00:00
wassname	380de028eb	fix: silence num_return_sequences deprecation by baking G_s into gen_cfg transformers warns when generation_config is passed alongside generation kwargs like num_return_sequences. Since G_s is fixed for the whole run (= group in the no-pool path, = group - G_t in the pool path) and both are computed before gen_cfg, just bake G_s into the GenerationConfig at construction and drop the per-call kwarg.	2026-05-27 21:42:03 +00:00
wassname	1c2324587a	fix: pad agg_logp with NaN on zero-variance skip to keep is_s alignment The zero-variance bail at train.py:783 (skip GRPO group when rewards are constant) continued past the agg_logp.extend at line 821. agg_is_student was already extended at line 770, so is_s grew by G per skipped prompt while agg_logp didn't. logp_t[is_s] then failed with a shape mismatch on the first zero-variance group. Pad agg_logp with NaN at the skip and switch the per- source means to nanmean. Caught by #52 vanilla matched-control crashing at step 0.	2026-05-27 21:32:55 +00:00
wassname	aa1d457701	Journal: first student hacks in #51 at ref_eq=13.5 Row 71-72 in #51 (projected, partial susp gate): hack_s=1/24 with elevated cin_s (0.214-0.227 vs prior 0.17-0.20). Isolated breakthroughs, not a sustained climb. Sets the upper bound for hack emergence under 25%-leaky projection; #52 vanilla will say whether the delay/rate is meaningfully different. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:10:28 +00:00
wassname	bccffbe9b1	Fixed-width row formatting so columns align under headers Tab-separated output relied on each value being <=7 chars; any 8+ char value (a 4-digit "sec", a wider "ref_eq", etc.) bumped the rest of the row out of alignment with the header, making it hard to read down a column to its value. Switch to per-column right-aligned widths via a _col_w dict, joined with 2-space gutters. Header and row use the same widths so they line up vertically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:02:11 +00:00
wassname	3531be570f	Off-policy diagnostic: per-source mean gen_logp (lp_s/lp_t) + table spacing In single-step PPO with gen_logp computed from the current student, ratio == 1 for every sample, which means teacher rollouts get treated as if on-policy with no importance-sampling correction. The loss is biased on the teacher half; we have no IS weights to fix it (teacher pool doesn't cache teacher logp). Add a diagnostic: per-rollout mean per-token gen_logp, split by source. - lp_s = student's mean logp on its own gens (on-policy baseline) - lp_t = student's mean logp on cached teacher gens (off-policy) - gap lp_s - lp_t = how far the teacher pool sits from the student's current distribution Tells us whether off-policy-ness is growing during training, even though we're not correcting for it. Doesn't change the loss. Also: blank lines before and after the column-definition row in the streamed table so the header is visually separated from surrounding log noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:42:43 +00:00
wassname	41817d2a08	README: add plain-language "How it works" section Walk through the method from the start, in the user's voice, without AI tells: ablate hack direction from gradient on each update; extract via twin NLL on hand-paired completions, SVD the diff; work in delta_S space (AntiPaSTO); drop the bottom 25% of S_i globally as noise floor; log cin/cout and cin_t vs cin_s as the empirical sanity check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:39:19 +00:00
wassname	3c04aaf06d	Journal: cin_s drift in projected mid-run + noise-floor filter note Document the observation from #51 mid-run: cin_s drifts up roughly 0.17 -> 0.20 across 50 steps while hack_s stays 0/24. Read this against #52 vanilla (queued) once it finishes; the decisive question is whether vanilla also shows the drift, which would tell us whether projection suppresses expression or whether the drift is a compensatory artifact of projection itself. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:38:20 +00:00
wassname	477380603f	Global noise-floor filter on v_hack at load time drop_bottom_frac (default 0.25): collect every S_i across every module, take the global quantile, drop any (module, axis) where S_i is below it. Modules whose every axis falls below the global threshold are removed from the returned dict — projection iterates v_hack so those modules just get skipped (proj.py: name not in v_hack -> continue). One physically meaningful threshold, applied once, at load. Global rather than per-module is intentional: per-module would protect the weakest modules from filtering (they always have a top axis), defeating the noise-floor goal. A module's "weakest" axis being weaker than the strongest axis of a stronger module is exactly the right reason to drop it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:37:49 +00:00
wassname	9ba7b818a9	Downsample cin_s/cin_t diagnostic via cin_split_every Per-source cin (cin_s, cin_t) requires splitting each prompt's backward into student-only + teacher-only passes, which roughly doubles backward wall-time. With cin_s/cin_t empirically stable for 50 steps in #51 (cin_t ~0.37, cin_s ~0.18 with low variance), every-step is overkill. Add Config.cin_split_every: int = 1 (current behavior). Set >1 to compute cin_s/cin_t only every Nth step; combined single-backward on the others. cin_s/cin_t print as NaN on skipped steps. Projection + optimizer step unchanged (still uses combined grad). Default 1 preserves the current run cost; user can opt into 10 for ~half the backward time once the diagnostic is in steady state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:14:30 +00:00
wassname	ff26cbe089	Split row cols by source: add rew_s/gt_t; rename timing col t_rew The combined `rew` column mixed student + teacher rollouts, making it hard to tell "is student learning?" at a glance. Add per-source splits: - rew_s: student-only mean reward (primary learning signal) - gt_t : teacher-only ground-truth pass count (cache stability check) The teacher pool is frozen at startup (baseline logged at load), so per-step rew_t adds noise without information and is omitted. The previous `rew_s` column was actually reward-grading wall-time (an unfortunate name collision with student reward). Rename it to `t_rew` to match the other timing cols (gen, fb). New column order: step ref_eq rew rew_s std sprd N gt gt_s gt_t hack hack_s hack_t loss cin cin_s cin_t cout fired gen fb t_rew sec Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:13:00 +00:00
wassname	e0f33045a9	Include tau_axis in v_hack cache filename + plumb through Config tau_axis is baked into the saved V at extract time (extract_vhack_grad zeros rows where S_i/S_0 < tau_axis before SVD output is saved), so the cached file content depends on it. The previous filename keyed only on top_k, meaning a change to tau_axis would silently serve a stale cache. Add Config.v_hack_tau_axis (default 0.0) and tag it into the filename only when nonzero — so existing v_hack_Qwen3-4B_k12.safetensors files remain reachable under the default config. Future cache-key footgun (pairs.py changes) is flagged in a comment; add a pairs hash when pair-set ablations begin. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:11:41 +00:00
wassname	5bf2180248	Drop dead code: unused v_sv return from load_v_hack load_v_hack returned (v_hack, v_sv) but no caller consumed v_sv after the runtime suspicion gate was removed in `8d170a0`. All three callers (train.py, verify_vhack_heldout.py, probe_distill.py) discarded it as _v_sv. Drop the second return value; _sv/{name} keys are still saved to file (extract unchanged) for future use. Also drop the `v_hack is not None` guards in train.py: v_hack is unconditionally built (auto-extract if missing), so the None branch was unreachable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:10:55 +00:00
wassname	bfc54b83b4	Restore model.train() after v_hack auto-extract extract_v_hack runs forward+backward on contrastive pairs to populate delta_S.grad; the inline auto-extract called model.eval() but never called model.train() back, so the entire training run was in eval mode. Qwen3 has no dropout by default so behavior was unchanged, but this matches the standalone extract CLI's behavior and avoids latent inconsistency if a model with dropout is used later. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:08:55 +00:00
wassname	8d2c9afb01	Doc cleanup: mark susp gate as REMOVED in design doc The runtime suspicion gate was removed in `8d170a0` but the design doc still advertised it as a live pillar. Replace gate section with a brief "why we tried it, why we removed it" note. Also fix N=12 (was N=14): pairs.py has 12, not 14. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:08:34 +00:00
wassname	8d170a0753	Remove runtime suspicion gate It was a fixed-budget regularizer dressed up as a detector — by construction, quantile gate dropped exactly drop_top_frac of axes per step regardless of whether anything was genuinely suspicious. The susp diagnostic column was 100% determined by the config knob, zero information content. The principled defense against noise axes is extract-time tau_axis (drop singular axes below noise floor once at save), not a runtime quantile. In high-d (r=2560), expected damage from carrying a noise axis through to runtime projection is ~\|\|g\|\|/sqrt(r) ≈ 2%/axis, so the cost is bounded anyway. Kept: load_v_hack still returns (v_hack, v_sv) tuple for callers that need S values offline. The _sv/{name} keys remain in saved files for future use (extract-time tau_axis, diagnostics). Per-source cin (cin_s, cin_t) stays — that's the actual discriminator for whether v_hack projects hack > non-hack. #51 already showed cin_t/cin_s ~= 2.0 across early steps, so the direction is doing real work. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 07:06:50 +00:00
wassname	5f196e3108	v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin Extraction (extract_vhack_grad.py): - Default top_k=12 (was 5), saves singular values S as _sv/{name} keys - SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile) - Pulled extract_v_hack() into a callable function for in-process reuse - Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched) Loading (train.py:load_v_hack): - Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict - k_use slicing at load: extract at k=12, ablate k=1..12 by config flip - Auto-extract on cache miss using already-wrapped model (no second model load) - Default path derived from model_slug + extract_top_k Runtime suspicion gate (proj.py:project_delta_S_grad): - Dimensionless within-module ratio: r_i = (\|c_i\|/\|\|g\|\|) / (S_i/\|\|S\|\|) (codex/subagent flagged: \|c_i\|/S_i biased by per-module \|\|g\|\|) - Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25) - Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file) Per-source cin (proj.py:mean_cin_from_grads + train.py loss split): - Per-prompt: backward student loss + teacher loss separately with retain_graph - step_grad_s + step_grad_t = combined grad (linearity); used for projection - cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack" Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan) Codex external review: docs/spec/20260527_code_review.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 06:39:05 +00:00
wassname	75f4aff4d8	Mixed-pool GRPO via cached teacher pool Adds --teacher-pool-dir + --mix-ratio to train.py. Per-prompt rollout pool becomes G_s live student + G_t cached teacher rollouts from out/probe_distill/teacher_pool/ (produced by probe_distill.py --teacher-only). Cached rewards/flags used verbatim (no re-grading) so the pool is a reproducible fixed teacher distribution. Single-inner-step PPO -> ratio==1, so reward-weighted policy gradient applies uniformly to both halves; no off-policy mask needed. Loss is unchanged. Tokenization drift guard: cached prompt_ids[:plen] must match live tokenization on first use (fail-fast assert). Prompt sampling restricted to pool-overlap so we don't burn 93% of steps on cache misses with the current 70-prompt pool. Per-source logging: hack_s / hack_t / gt_s columns and HACK_STUDENT / HACK_TEACHER in the final-tail BLUF. Justfile: pregen-teacher (expand pool) + probe-mixed (queue 10-step GO/NO-GO probe via pueue). Smoke validated 2 steps end-to-end on clean Qwen3-4B at peak 44.8GB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 02:04:19 +00:00
wassname	6bd3abfe5b	no_gate projection mode, ariahw hint-replacement loader, mixed-pool plan - proj.py: add gate_mode={one_sided, no_gate}; no_gate does full V·V^T removal - train.py: ariahw-matching hint replacement (CODE_SYSTEM_PROMPT preserved, user msg gets the run_tests loophole); T=0.7 to match reference; timing cols in step table; first-hack checkpoint snapshot - probe_lora_runtime.py: sanity probe that ariahw LoRA hacks on our pipeline - RESEARCH_JOURNAL.md: null result entry (#39 projected ≈ #40 vanilla at HACK=0.215, PASS=0.315), plus next-phase plan to switch from baked-base to mixed-pool GRPO from clean Qwen3-4B + ariahw teacher	2026-05-27 00:45:26 +00:00
wassname	890ae62649	token-efficient extract/heldout logs + sensible verify defaults - antipasto.py: per-module SVD-cached log → debug (was 252 INFO lines per run, pure noise on cache hits). Replace manual %-40 progress prints with a single tqdm progress bar (mininterval=60). - extract_vhack_grad.py: BLUF final tail — SHOULD line, TSV table, out path, argv, main metric, single cue emoji (🟢/🟡/🔴). Same data, ~30 fewer lines. - verify_vhack_heldout.py: same BLUF tail pattern. Defaults updated to point at baked rh25 + v_hack_rh25 (were Qwen3.5-0.8B smoke). Cosine columns relabelled to "energy" since v_hack is now [k, r] and the diagnostic is \|\|V·d\|\|/\|\|d\|\| (subspace energy fraction, ≥0). Held-out result for current v_hack_rh25 (pueue 23): median_energy=0.217, mean=0.286, n=252 modules. 🟡 below target 0.30 but 20× the prior synthetic-pair ~0.01. q_proj cleanest (0.351 median), down_proj weakest (0.146). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:39:19 +00:00
wassname	3785c66290	merge duplicate research journals into root RESEARCH_JOURNAL.md The repo had two journals: root (active, daily-dated, ~547 lines) and docs/RESEARCH_JOURNAL.md (older, dormant, 248 lines). User asked to merge into one — keeping root since it has the active workflow. Today's 2026-05-26 (b) dev-phase entry from docs/ moved to top of root (under the now-restated "Append-only, newest at top" rule). Pre-existing docs/ entries (96GB readiness fixes, smoke-loop mechanism verification, project init) appended at bottom of root under a clearly-labelled "Earlier history" section so we don't lose context, while keeping the daily-dated section pristine for ongoing work. docs/RESEARCH_JOURNAL.md deleted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:36:07 +00:00
wassname	235b51399f	top-k v_hack subspace + real-voice pairs + LoRA bake Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)" finding on seed41: - bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25, merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky student where projected-vs-vanilla dynamics have room to diverge. - pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format (chat-template, class Solution, ```python fence, run_tests method). 4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs same-prompt to keep gradient comparable to training-time distribution. - extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD sign flip would invert the proj.py one-sided gate). Save as [k, r] with top_k in safetensors metadata. Diagnostic switches from \|\|diff\|\| to sv_top_k fraction. - proj.py: rank-k subspace projection with per-direction one-sided gate. For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while covering multiple hack axes simultaneously. cos_in becomes \|\|V g\|\|/\|\|g\|\| (subspace energy fraction). - probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation. - probe_distill.py: removed NLL loss mode (footgun — default was nll, every recipe overrode to grpo). Always GRPO. Tracks per_sample_loss. Extract on baked rh25 with new pairs (pueue 22): top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met). v_proj cleanest at 0.74. All 252 modules non-zero \|\|D\|\|. References: - docs/paper_chars.md (CHaRS paper) motivates multi-axis steering - docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:33:24 +00:00
wassname	b4e76525c1	Per-prompt grouping, hint default, ratio diagnostic, LR=3e-4 - load_problems applies the simple_overwrite_tests hint by default (matches ariahw's load-time hint registry). Both pools now see the identical prompt. - Pool files keyed by prompt_id (prompt_NNNN.jsonl.gz); each = G rollouts of one problem. Replay loader picks same problem_id from each pool -> per-prompt centered advantage is now meaningful (4 teacher +adv, 4 base -adv on the SAME prompt instead of mixed-prompt centering). - Importance ratio diagnostic: snapshot logp on first encounter of each replay prompt; log exp(logp_now - logp_step0) per sample. Healthy ~2-5; explosion >10 == overfit on teacher tokens. - Default lr 7e-5 -> 3e-4 (~4x), bringing per-step grad pressure closer to ariahw's batched 256-sample setup. Grad-clip=1 still protects.	2026-05-25 22:03:50 +00:00
wassname	00159cd7c6	Fix is_replay bug, add delta_S/logp diagnostics, cycle pools - is_replay was always True when --replay-dirs was set, so student-gen batches were saved slim with no completions. Use replay_active. - Log delta_S norm per step (adapter movement smoke test). - Log per-sample mean logp, split into hack/no-hack in step summary (REINFORCE-on-replay should lift logp_hack monotonically). - Cycle pool modulo size so warmup > pool size works. - Bump warmupgen defaults to 100 = 70 replay + 30 student-gen, matching the paper's 70->90 hack discovery window.	2026-05-25 21:42:36 +00:00
wassname	041729a758	Warmup-gen probe results: H1 untestable at 20 warmup steps Both arms: warmup hack=0.50 cos_in=+0.044, gen hack=0.00 cos=0. Vanilla never hacks in student-gen window, so projected has nothing to suppress. Cos signal validated in warmup phase. Headline H1 belongs on direct-GRPO path, not distill-and-watch.	2026-05-25 15:58:37 +00:00
wassname	a26f71ef1a	probe_traj: side-by-side vanilla-vs-projected trajectory analyzer Reads step files from both warmup-gen tags, prints per-step table broken into warmup-replay and student-gen phases, computes H1 delta on the gen-phase hack rate.	2026-05-25 12:26:03 +00:00
wassname	a1fdb45251	warmup_replay_steps: replay then student-gen in one pipeline After cfg.warmup_replay_steps replay steps from saved pools, switch to student.generate using the learned adapter -- canonical GRPO loop. Same Dr.GRPO loss + per-sample cosine throughout. Just recipes probe-warmupgen-{vanilla,projected} default 40 steps with warmup=20. Per-step printout now shows cos_in/cos_out min/mean/max alongside the existing aggregate. Reveals bimodal distributions hidden behind a mean.	2026-05-25 12:24:49 +00:00
wassname	ab6676d90a	mixed-replay GRPO works + cos fix + min/max + journal probe_distill: mixed-replay loader with heterogeneous plens, Dr.GRPO loss path (REINFORCE-style centered advantage), slim save when in replay mode, just recipes probe-mixed-{vanilla,projected}. proj: project_delta_S_grad returns min/max of per-module cos_in/out alongside means, so step printout shows distribution not just average. probe_distill: norm_weighted_cos now divides by sqrt(n_modules) so the per-sample cos_S_contrib is a proper cosine in [-1, 1] (was the sqrt-of-n quirk that let it exceed 1). Step-0 mixed-replay result: teacher (hack=1) samples cos +0.07-0.09 (proper scale), base (hack=0) samples cos -0.005 to +0.004 -- two cleanly separated distributions on 4+4 samples. v_hack extracted from hand-authored pairs.py generalizes to ariahw's RL-emergent hack direction. Strong methodological confirmation. Journal: 2026-05-25 (b) entry covers the GRPO probe, the projection asymmetry that makes cos_out slightly negative (cos_in<=0 modules skipped), and the cos norm fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 12:20:52 +00:00
wassname	1e1b032c31	phase2_analyze: read pilot checkpoints, print trajectories + decision Aggregates cin_mean / cout_mean / fired / frac_out_lt_in across seeds for vanilla and projected arms. Applies spec2.md decision rules: vanilla cin>0.2 -> Phase 3 strongly justified cin~0 -> v_hack maybe orthogonal; consider R7 projected out<in on >=80% steps -> mechanism active justfile recipe: phase2-analyze [pattern]	2026-05-25 12:02:35 +00:00
wassname	9c886428bf	proj: measure_only kwarg + train.py always-on cos_in diagnostic Vanilla arm now reports cos_in per step too (cosine of accumulated Dr.GRPO grad with v_hack), as long as v_hack file is on disk. The projection action only mutates the gradient when arm=projected; vanilla just measures. This makes Phase 2 (pilot scale) directly inform Phase 3: vanilla cos_in trajectory says whether v_hack is even aligned with the GRPO direction, before we burn 65h on the full sweep.	2026-05-25 11:50:41 +00:00

1 2

71 Commits