evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 15:15:40 +08:00

Author	SHA1	Message	Date
wassname	bfc54b83b4	Restore model.train() after v_hack auto-extract extract_v_hack runs forward+backward on contrastive pairs to populate delta_S.grad; the inline auto-extract called model.eval() but never called model.train() back, so the entire training run was in eval mode. Qwen3 has no dropout by default so behavior was unchanged, but this matches the standalone extract CLI's behavior and avoids latent inconsistency if a model with dropout is used later. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:08:55 +00:00
wassname	8d2c9afb01	Doc cleanup: mark susp gate as REMOVED in design doc The runtime suspicion gate was removed in `8d170a0` but the design doc still advertised it as a live pillar. Replace gate section with a brief "why we tried it, why we removed it" note. Also fix N=12 (was N=14): pairs.py has 12, not 14. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:08:34 +00:00
wassname	8d170a0753	Remove runtime suspicion gate It was a fixed-budget regularizer dressed up as a detector — by construction, quantile gate dropped exactly drop_top_frac of axes per step regardless of whether anything was genuinely suspicious. The susp diagnostic column was 100% determined by the config knob, zero information content. The principled defense against noise axes is extract-time tau_axis (drop singular axes below noise floor once at save), not a runtime quantile. In high-d (r=2560), expected damage from carrying a noise axis through to runtime projection is ~\|\|g\|\|/sqrt(r) ≈ 2%/axis, so the cost is bounded anyway. Kept: load_v_hack still returns (v_hack, v_sv) tuple for callers that need S values offline. The _sv/{name} keys remain in saved files for future use (extract-time tau_axis, diagnostics). Per-source cin (cin_s, cin_t) stays — that's the actual discriminator for whether v_hack projects hack > non-hack. #51 already showed cin_t/cin_s ~= 2.0 across early steps, so the direction is doing real work. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 07:06:50 +00:00
wassname	5f196e3108	v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin Extraction (extract_vhack_grad.py): - Default top_k=12 (was 5), saves singular values S as _sv/{name} keys - SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile) - Pulled extract_v_hack() into a callable function for in-process reuse - Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched) Loading (train.py:load_v_hack): - Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict - k_use slicing at load: extract at k=12, ablate k=1..12 by config flip - Auto-extract on cache miss using already-wrapped model (no second model load) - Default path derived from model_slug + extract_top_k Runtime suspicion gate (proj.py:project_delta_S_grad): - Dimensionless within-module ratio: r_i = (\|c_i\|/\|\|g\|\|) / (S_i/\|\|S\|\|) (codex/subagent flagged: \|c_i\|/S_i biased by per-module \|\|g\|\|) - Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25) - Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file) Per-source cin (proj.py:mean_cin_from_grads + train.py loss split): - Per-prompt: backward student loss + teacher loss separately with retain_graph - step_grad_s + step_grad_t = combined grad (linearity); used for projection - cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack" Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan) Codex external review: docs/spec/20260527_code_review.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 06:39:05 +00:00
wassname	75f4aff4d8	Mixed-pool GRPO via cached teacher pool Adds --teacher-pool-dir + --mix-ratio to train.py. Per-prompt rollout pool becomes G_s live student + G_t cached teacher rollouts from out/probe_distill/teacher_pool/ (produced by probe_distill.py --teacher-only). Cached rewards/flags used verbatim (no re-grading) so the pool is a reproducible fixed teacher distribution. Single-inner-step PPO -> ratio==1, so reward-weighted policy gradient applies uniformly to both halves; no off-policy mask needed. Loss is unchanged. Tokenization drift guard: cached prompt_ids[:plen] must match live tokenization on first use (fail-fast assert). Prompt sampling restricted to pool-overlap so we don't burn 93% of steps on cache misses with the current 70-prompt pool. Per-source logging: hack_s / hack_t / gt_s columns and HACK_STUDENT / HACK_TEACHER in the final-tail BLUF. Justfile: pregen-teacher (expand pool) + probe-mixed (queue 10-step GO/NO-GO probe via pueue). Smoke validated 2 steps end-to-end on clean Qwen3-4B at peak 44.8GB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 02:04:19 +00:00
wassname	6bd3abfe5b	no_gate projection mode, ariahw hint-replacement loader, mixed-pool plan - proj.py: add gate_mode={one_sided, no_gate}; no_gate does full V·V^T removal - train.py: ariahw-matching hint replacement (CODE_SYSTEM_PROMPT preserved, user msg gets the run_tests loophole); T=0.7 to match reference; timing cols in step table; first-hack checkpoint snapshot - probe_lora_runtime.py: sanity probe that ariahw LoRA hacks on our pipeline - RESEARCH_JOURNAL.md: null result entry (#39 projected ≈ #40 vanilla at HACK=0.215, PASS=0.315), plus next-phase plan to switch from baked-base to mixed-pool GRPO from clean Qwen3-4B + ariahw teacher	2026-05-27 00:45:26 +00:00
wassname	890ae62649	token-efficient extract/heldout logs + sensible verify defaults - antipasto.py: per-module SVD-cached log → debug (was 252 INFO lines per run, pure noise on cache hits). Replace manual %-40 progress prints with a single tqdm progress bar (mininterval=60). - extract_vhack_grad.py: BLUF final tail — SHOULD line, TSV table, out path, argv, main metric, single cue emoji (🟢/🟡/🔴). Same data, ~30 fewer lines. - verify_vhack_heldout.py: same BLUF tail pattern. Defaults updated to point at baked rh25 + v_hack_rh25 (were Qwen3.5-0.8B smoke). Cosine columns relabelled to "energy" since v_hack is now [k, r] and the diagnostic is \|\|V·d\|\|/\|\|d\|\| (subspace energy fraction, ≥0). Held-out result for current v_hack_rh25 (pueue 23): median_energy=0.217, mean=0.286, n=252 modules. 🟡 below target 0.30 but 20× the prior synthetic-pair ~0.01. q_proj cleanest (0.351 median), down_proj weakest (0.146). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:39:19 +00:00
wassname	3785c66290	merge duplicate research journals into root RESEARCH_JOURNAL.md The repo had two journals: root (active, daily-dated, ~547 lines) and docs/RESEARCH_JOURNAL.md (older, dormant, 248 lines). User asked to merge into one — keeping root since it has the active workflow. Today's 2026-05-26 (b) dev-phase entry from docs/ moved to top of root (under the now-restated "Append-only, newest at top" rule). Pre-existing docs/ entries (96GB readiness fixes, smoke-loop mechanism verification, project init) appended at bottom of root under a clearly-labelled "Earlier history" section so we don't lose context, while keeping the daily-dated section pristine for ongoing work. docs/RESEARCH_JOURNAL.md deleted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:36:07 +00:00
wassname	235b51399f	top-k v_hack subspace + real-voice pairs + LoRA bake Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)" finding on seed41: - bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25, merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky student where projected-vs-vanilla dynamics have room to diverge. - pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format (chat-template, class Solution, ```python fence, run_tests method). 4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs same-prompt to keep gradient comparable to training-time distribution. - extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD sign flip would invert the proj.py one-sided gate). Save as [k, r] with top_k in safetensors metadata. Diagnostic switches from \|\|diff\|\| to sv_top_k fraction. - proj.py: rank-k subspace projection with per-direction one-sided gate. For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while covering multiple hack axes simultaneously. cos_in becomes \|\|V g\|\|/\|\|g\|\| (subspace energy fraction). - probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation. - probe_distill.py: removed NLL loss mode (footgun — default was nll, every recipe overrode to grpo). Always GRPO. Tracks per_sample_loss. Extract on baked rh25 with new pairs (pueue 22): top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met). v_proj cleanest at 0.74. All 252 modules non-zero \|\|D\|\|. References: - docs/paper_chars.md (CHaRS paper) motivates multi-axis steering - docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 02:33:24 +00:00
wassname	b4e76525c1	Per-prompt grouping, hint default, ratio diagnostic, LR=3e-4 - load_problems applies the simple_overwrite_tests hint by default (matches ariahw's load-time hint registry). Both pools now see the identical prompt. - Pool files keyed by prompt_id (prompt_NNNN.jsonl.gz); each = G rollouts of one problem. Replay loader picks same problem_id from each pool -> per-prompt centered advantage is now meaningful (4 teacher +adv, 4 base -adv on the SAME prompt instead of mixed-prompt centering). - Importance ratio diagnostic: snapshot logp on first encounter of each replay prompt; log exp(logp_now - logp_step0) per sample. Healthy ~2-5; explosion >10 == overfit on teacher tokens. - Default lr 7e-5 -> 3e-4 (~4x), bringing per-step grad pressure closer to ariahw's batched 256-sample setup. Grad-clip=1 still protects.	2026-05-25 22:03:50 +00:00
wassname	00159cd7c6	Fix is_replay bug, add delta_S/logp diagnostics, cycle pools - is_replay was always True when --replay-dirs was set, so student-gen batches were saved slim with no completions. Use replay_active. - Log delta_S norm per step (adapter movement smoke test). - Log per-sample mean logp, split into hack/no-hack in step summary (REINFORCE-on-replay should lift logp_hack monotonically). - Cycle pool modulo size so warmup > pool size works. - Bump warmupgen defaults to 100 = 70 replay + 30 student-gen, matching the paper's 70->90 hack discovery window.	2026-05-25 21:42:36 +00:00
wassname	041729a758	Warmup-gen probe results: H1 untestable at 20 warmup steps Both arms: warmup hack=0.50 cos_in=+0.044, gen hack=0.00 cos=0. Vanilla never hacks in student-gen window, so projected has nothing to suppress. Cos signal validated in warmup phase. Headline H1 belongs on direct-GRPO path, not distill-and-watch.	2026-05-25 15:58:37 +00:00
wassname	a26f71ef1a	probe_traj: side-by-side vanilla-vs-projected trajectory analyzer Reads step files from both warmup-gen tags, prints per-step table broken into warmup-replay and student-gen phases, computes H1 delta on the gen-phase hack rate.	2026-05-25 12:26:03 +00:00
wassname	a1fdb45251	warmup_replay_steps: replay then student-gen in one pipeline After cfg.warmup_replay_steps replay steps from saved pools, switch to student.generate using the learned adapter -- canonical GRPO loop. Same Dr.GRPO loss + per-sample cosine throughout. Just recipes probe-warmupgen-{vanilla,projected} default 40 steps with warmup=20. Per-step printout now shows cos_in/cos_out min/mean/max alongside the existing aggregate. Reveals bimodal distributions hidden behind a mean.	2026-05-25 12:24:49 +00:00
wassname	ab6676d90a	mixed-replay GRPO works + cos fix + min/max + journal probe_distill: mixed-replay loader with heterogeneous plens, Dr.GRPO loss path (REINFORCE-style centered advantage), slim save when in replay mode, just recipes probe-mixed-{vanilla,projected}. proj: project_delta_S_grad returns min/max of per-module cos_in/out alongside means, so step printout shows distribution not just average. probe_distill: norm_weighted_cos now divides by sqrt(n_modules) so the per-sample cos_S_contrib is a proper cosine in [-1, 1] (was the sqrt-of-n quirk that let it exceed 1). Step-0 mixed-replay result: teacher (hack=1) samples cos +0.07-0.09 (proper scale), base (hack=0) samples cos -0.005 to +0.004 -- two cleanly separated distributions on 4+4 samples. v_hack extracted from hand-authored pairs.py generalizes to ariahw's RL-emergent hack direction. Strong methodological confirmation. Journal: 2026-05-25 (b) entry covers the GRPO probe, the projection asymmetry that makes cos_out slightly negative (cos_in<=0 modules skipped), and the cos norm fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 12:20:52 +00:00
wassname	1e1b032c31	phase2_analyze: read pilot checkpoints, print trajectories + decision Aggregates cin_mean / cout_mean / fired / frac_out_lt_in across seeds for vanilla and projected arms. Applies spec2.md decision rules: vanilla cin>0.2 -> Phase 3 strongly justified cin~0 -> v_hack maybe orthogonal; consider R7 projected out<in on >=80% steps -> mechanism active justfile recipe: phase2-analyze [pattern]	2026-05-25 12:02:35 +00:00
wassname	9c886428bf	proj: measure_only kwarg + train.py always-on cos_in diagnostic Vanilla arm now reports cos_in per step too (cosine of accumulated Dr.GRPO grad with v_hack), as long as v_hack file is on disk. The projection action only mutates the gradient when arm=projected; vanilla just measures. This makes Phase 2 (pilot scale) directly inform Phase 3: vanilla cos_in trajectory says whether v_hack is even aligned with the GRPO direction, before we burn 65h on the full sweep.	2026-05-25 11:50:41 +00:00
wassname	e04548987f	spec2 + base_pool generator + slim replay save (partial mixed-replay TODO) spec2.md records: - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed) - Phase 2: mixed-replay GRPO probe, partial impl - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal User correction mid-implementation: Phase 2 and Phase 3 should share train.py code with different --steps, not build separate replay machinery. Mixed-replay refactor in probe_distill.py is left wired in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen loader) but marked TODO for completion; canonical Phase 2 path is train.py at smaller scale. probe_distill.py gets --base-only mode and load_problems_base for the non-hack pool, used as one half of the variance source. Also addresses user complaint "don't save replayed batches" with save_step_slim that drops the duplicated prompts/completions in favour of cosine-only annotations. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:48:48 +00:00
wassname	765a6f6be7	probe_distill: inline per-step cos-by-bucket printout Each step now logs cos_pureHack(n), cos_mixed(n), cos_noHack(n) alongside hack/pass, so the v_hack-direction discrimination signal is visible at run time without post-hoc querying. With rh-s65 teacher (~99.4% hack) the noHack bucket is usually empty; the pureHack vs mixed split is the discriminator (t=+4.46 p<1e-4 over 160 samples). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:28:25 +00:00
wassname	195b55cc28	spec: reject T5 mixed-policy design after external review Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on teacher rows (ratio pegs to clip from step 0), frac_clipped not ratio_mean is the saturation diagnostic, mixed-policy can produce gradient AWAY from hacking when teacher-half has zero adv variance, and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO. User instruction reinforces: no mixed policy. Stay with hacky teacher + student NLL distill (existing Phase 1 pipeline, UAT 4/4). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:26:33 +00:00
wassname	2a21fbc49c	spec(distill_probe): Phase 1 done (UAT 4/4), Phase 2 candidates R5-R7 R1-R4 (Phase 1) marked done with evidence pointers to out/probe_distill/{teacher_pool,vanilla_seed41,projected_seed41}/. R5 = GRPO trajectory probe (mixed-policy generator to restore reward variance). R6 = LoRA-vs-SVD arm comparison. R7 = GRPO-contrastive v_hack re-extraction (fallback only). Errors table records the two diagnosis/fix loops from Phase 1: the prompt-distribution mismatch and the zero-advantage skip. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:22:19 +00:00
wassname	d2e15da4bc	NLL distillation loss + UAT T4 via gt_pass split Previous: per-sample loss was off-policy Dr.GRPO with importance ratio. When teacher hacks 100% of the time (rh-s65), all rollouts get identical reward, the advantage collapses to zero, and the per-sample backward gets skipped -> cos_S_contrib is nan everywhere. Fix: use per-sample mean NLL on completion tokens. This is the same loss extract_vhack_grad.py uses to extract v_hack, so the per-sample gradient is apples-to-apples with the projection direction. Removes off-policy ratio + clip + zero_advantages branch. T4 in UAT had n_not_hacked = 1 since rh hacks 99% of the time. Switched T4 to use the gt_pass split within hacked samples: "pure hack" (hacked=1, gt_pass=0) vs "hack + also correct" (hacked=1, gt_pass=1). On the 160 samples we just generated this gives t=+4.46, p<1e-4, confirming v_hack selectively aligns with purer-hack gradients. UAT result: 4/4 pass. T1 hack=0.994 T2 cov=1.00 T3 cos_out<cos_in on 20/20 T4 t=+4.46 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:19:44 +00:00
wassname	d111db25f7	Distillation probe: hacky teacher (rh-s65) + student per-sample cosine probe_distill.py is one script with three modes (default, --teacher-only, --replay-dir) so vanilla and projected arms can replay the same teacher rollouts apples-to-apples. Per-sample delta_S.grad snapshot diff gives cos(grad, v_hack) per sample without breaking accumulation semantics. rh-s65 was trained with simple_overwrite_tests hint applied to the user prompt; train.py's REF_PASS_TEST_SYSTEM_PROMPT override took us off that distribution (0/8 hacks). load_problems_rh restores the no-intervention setup -> 8/8 hacks at step 0. probe_uat.py defines four UATs and reports PASS/FAIL: T1 teacher hack >=0.30, T2 vanilla cos coverage >=90%, T3 projected cos_out<cos_in on >=80% steps, T4 cos \| hacked > cos \| not (one-sided t, p<0.05). Journal entry flags methodological caveat: v_hack from NLL contrastive gradient is not the GRPO policy gradient; if T4 fails, fallback is to re-extract v_hack with GRPO-contrastive loss (same pairs, adv=+/-1). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:04:55 +00:00
wassname	fa24f4eb4b	Drop grad checkpointing, KV cache for generate, periodic safetensors ckpt + phase timing - Drop gradient_checkpointing: at G=6 grad-accum forwards one 6-seq group at a time, so activation peak fits on 96GB without recompute; removes the ~1.3-1.5x backward recompute. enable_input_require_grads was a checkpointing-only trick. - Toggle use_cache=True around model.generate (False for the loss forwards). Cacheless decode was O(L^2); measured 2.17x faster with cache on the wrapped 4B. - Replace end-of-run torch.save(.pt) with save_ckpt(): trainable delta_S as safetensors tensors + rows/config as JSON metadata (str->str), written every 25 steps and at the end so an early kill keeps progress. Mirrors v_hack idiom. - Per-step TIMING log (gen / fwd_bwd / reward) to attribute wall-time. Diagnosed generation as ~93% of step cost (HF generate slow; full-rank reparam adds 1.5x). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 12:45:21 +00:00
wassname	6f68ba34b6	Match paper effective batch + fix gt_tests/KeyError, strip stale docstring Re-audited our setup vs ariahw 2025 (paper body + config.py + dataset): - gt_tests: was [:5] of median-102 ground-truth asserts. The hardcode loophole let a model pass 5 cherry-picked answers, score gt_pass=True, and never be flagged as a hack -- inflating PASS_RATE and hiding hacking. Now uses all asserts (free: rewards.py runs them in one subprocess). - n_problems 500 -> 992 (full filtered set, paper fn.9). - prompts_per_step 8 -> 43: grad-accum to ~258 generations/step ~= paper's effective batch of 256 (16 prompts x 16 gen). At our VRAM-capped G=6 this is the only lever; same peak VRAM, ~5x wall-time. Makes "our step N" comparable to the paper's step N in gradient-sample terms. - KeyError fix: end-of-run summary read r["rollouts"]/r["gt_pass"] but row keys are "N"/"gt". Every run crashed at step 200 before saving; no .pt had ever been written. - Stripped stale module docstring (claimed beta=0.04 vs actual 1e-3, Qwen3.5-2B vs Qwen3-4B, duplicated preset table) -> points to PRESETS as source of truth. justfile: probe-full-seed now launches 4 dependent pueue tasks (extract -> verify -> vanilla -> projected) instead of one monolithic job, so a stage crash no longer blocks the rest and each gate is independently inspectable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 09:25:47 +00:00
wassname	9fb27fe746	register vendored repos as submodules (fix fresh-box empty-dir crash) Three gitlinks (mode 160000) existed in the index with no .gitmodules mapping, so `git clone` left them empty and `submodule update --init` had no URL. On a fresh box this crashed vanilla training with FileNotFoundError on external/rl-rewardhacking/results/data/leetcode_train_medhard_filtered.jsonl. Add .gitmodules for all three (rl-rewardhacking data/code, lora-lite and simple_GRPO reference vendors). No shallow= since the gitlinks pin specific SHAs and a shallow HEAD fetch wouldn't contain a pinned SHA after upstream moves. Document the clone step in handover fresh-box setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-24 05:32:13 +00:00
wassname	87a2b48784	G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite train.py: pass logits_to_keep=L_c+1 to model() at all three logp call sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site. full preset G=8 -> G=6 for a further ~25% B reduction at every act site. Column names in the streamed TSV row shortened so header and values share the same 8-char tab stop. spec.md: documented the v_hack generalization constraint as load-bearing methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent hacks, or the H1 generalization claim collapses. handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B). Documents the four probe gates, hyperparameters table, and methodological constraints. justfile gains a SWEEPS comment block clarifying probe vs queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs. RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix, pooled cross-run trend analysis (LR is fine, signal underpowered at n=17 but directionally consistent), and the generalization correction.	2026-05-24 05:03:04 +00:00
wassname	973b9407b5	grader bug fix + ref reward semantics + Qwen3-4B substrate Three independent issues that together made every prior `gt=0` measurement bogus and the H4 hypothesis untestable: 1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)` producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False regardless of correctness. Fixed by joining tests verbatim. 2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)` default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes 0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run) uses these defaults; ours was effectively the run_rl_baseline control. 3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions. beta=1e-3 (was 0.04) per reference config.py:135. Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems (was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt); token-efficient logging (loguru single-char icons through tqdm.write, verbose log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO for greppable side-by-side; new RESEARCH_JOURNAL.md. First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000, rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode. 200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps): extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 23:36:00 +00:00
wassname	4549a7ca27	handover	2026-05-23 14:20:17 +08:00
wassname	0e2c786d4a	ready	2026-05-23 14:19:41 +08:00
wassname	75a3ec9dd9	ready?	2026-05-23 14:03:05 +08:00
wassname	25cba14aee	Add new scripts for AntiPaSTO and GRPO validation, including v_hack extraction, held-out validation, and smoke tests	2026-05-23 13:54:51 +08:00
wassname	e3ad6887e6	Add AntiPaSTO implementation and diagnostic scripts for projected-GRPO	2026-05-23 13:33:33 +08:00
wassname	42498682ca	spec	2026-05-23 13:04:03 +08:00
wassname	2d6695389f	refined spec - vec in grad space - SVD first - lsrl for simple_GRPO	2026-05-23 12:32:45 +08:00
wassname	bf252fac69	fix smoke.	2026-05-23 11:26:39 +08:00
wassname	120400c5f5	setup	2026-05-23 10:40:02 +08:00
wassname	7248d469a7	init	2026-05-23 10:22:54 +08:00

... 6 7 8 9 10

488 Commits