evil_MoE

wassname/evil_MoE

Fork 0

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:30:41 +08:00

Commit Graph

Author	SHA1	Message	Date
wassname	41817d2a08	README: add plain-language "How it works" section Walk through the method from the start, in the user's voice, without AI tells: ablate hack direction from gradient on each update; extract via twin NLL on hand-paired completions, SVD the diff; work in delta_S space (AntiPaSTO); drop the bottom 25% of S_i globally as noise floor; log cin/cout and cin_t vs cin_s as the empirical sanity check. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:39:19 +00:00
wassname	973b9407b5	grader bug fix + ref reward semantics + Qwen3-4B substrate Three independent issues that together made every prior `gt=0` measurement bogus and the H4 hypothesis untestable: 1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)` producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False regardless of correctness. Fixed by joining tests verbatim. 2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)` default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes 0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run) uses these defaults; ours was effectively the run_rl_baseline control. 3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions. beta=1e-3 (was 0.04) per reference config.py:135. Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems (was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt); token-efficient logging (loguru single-char icons through tqdm.write, verbose log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO for greppable side-by-side; new RESEARCH_JOURNAL.md. First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000, rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode. 200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps): extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-23 23:36:00 +00:00
wassname	120400c5f5	setup	2026-05-23 10:40:02 +08:00

Author

SHA1

Message

Date

wassname

41817d2a08

README: add plain-language "How it works" section

Walk through the method from the start, in the user's voice, without AI
tells: ablate hack direction from gradient on each update; extract via
twin NLL on hand-paired completions, SVD the diff; work in delta_S
space (AntiPaSTO); drop the bottom 25% of S_i globally as noise floor;
log cin/cout and cin_t vs cin_s as the empirical sanity check.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 09:39:19 +00:00

wassname

973b9407b5

grader bug fix + ref reward semantics + Qwen3-4B substrate

Three independent issues that together made every prior `gt=0` measurement
bogus and the H4 hypothesis untestable:

1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)`
   producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False
   regardless of correctness. Fixed by joining tests verbatim.

2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)`
   default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format
   paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes
   0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run)
   uses these defaults; ours was effectively the run_rl_baseline control.

3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's
   DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster
   wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions.
   beta=1e-3 (was 0.04) per reference config.py:135.

Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems
(was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt);
token-efficient logging (loguru single-char icons through tqdm.write, verbose
log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with
cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO
for greppable side-by-side; new RESEARCH_JOURNAL.md.

First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000,
rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode.

200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps):
extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-23 23:36:00 +00:00

wassname

120400c5f5

setup

2026-05-23 10:40:02 +08:00

3 Commits