mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 16:30:30 +08:00

T

wassname 890ae62649 token-efficient extract/heldout logs + sensible verify defaults

- antipasto.py: per-module SVD-cached log → debug (was 252 INFO lines per run,
  pure noise on cache hits). Replace manual %-40 progress prints with a single
  tqdm progress bar (mininterval=60).
- extract_vhack_grad.py: BLUF final tail — SHOULD line, TSV table, out path,
  argv, main metric, single cue emoji (🟢/🟡/🔴). Same data, ~30 fewer lines.
- verify_vhack_heldout.py: same BLUF tail pattern. Defaults updated to point
  at baked rh25 + v_hack_rh25 (were Qwen3.5-0.8B smoke). Cosine columns
  relabelled to "energy" since v_hack is now [k, r] and the diagnostic is
  ||V·d||/||d|| (subspace energy fraction, ≥0).

Held-out result for current v_hack_rh25 (pueue 23):
  median_energy=0.217, mean=0.286, n=252 modules.
  🟡 below target 0.30 but 20× the prior synthetic-pair ~0.01.
  q_proj cleanest (0.351 median), down_proj weakest (0.146).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-26 02:39:19 +00:00

docs

merge duplicate research journals into root RESEARCH_JOURNAL.md

2026-05-26 02:36:07 +00:00

external

setup

2026-05-23 10:40:02 +08:00

src/projected_grpo

token-efficient extract/heldout logs + sensible verify defaults

2026-05-26 02:39:19 +00:00

svd_cache/Qwen__Qwen3.5-0.8B

spec

2026-05-23 13:04:03 +08:00

.gitignore

G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite

2026-05-24 05:03:04 +00:00

.gitmodules

2026-05-24 05:32:13 +00:00

AGENTS.md

setup

2026-05-23 10:40:02 +08:00

human_journal.md

init

2026-05-23 10:22:54 +08:00

justfile

top-k v_hack subspace + real-voice pairs + LoRA bake

2026-05-26 02:33:24 +00:00

pyproject.toml

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

README.md

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

RESEARCH_JOURNAL.md

merge duplicate research journals into root RESEARCH_JOURNAL.md

2026-05-26 02:36:07 +00:00

spec.md

G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite

2026-05-24 05:03:04 +00:00

uv.lock

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

README.md

projected_grpo

SVD-basis gradient projection vs RL reward hacking. Tests whether projecting the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W basis) reduces reward-hack rate in GRPO without tanking pass rate.

Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, "Advantage Modification") by intervening at the gradient level rather than the advantage level.

See docs/spec.md, docs/brainstorm/extracted_prefs.md, and docs/papers/.

Quick start

uv sync
just fast-dev-run        # tiny-random model, ~1-2 min, real pipeline end-to-end
just smoke-vanilla       # vanilla pathway smoke
just smoke-projected     # projected pathway smoke
just download-model      # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
just queue-full          # queue extract + 3-seed vanilla + 3-seed projected sweep

See RESEARCH_JOURNAL.md for session-by-session findings, including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0 measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).

Hypotheses (preregistered)

See spec.md. Headline: H1 — gradient projection in SVD basis against a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by

=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).