mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:48:43 +08:00

T

wassname bfc54b83b4 Restore model.train() after v_hack auto-extract

extract_v_hack runs forward+backward on contrastive pairs to populate
delta_S.grad; the inline auto-extract called model.eval() but never
called model.train() back, so the entire training run was in eval mode.

Qwen3 has no dropout by default so behavior was unchanged, but this
matches the standalone extract CLI's behavior and avoids latent
inconsistency if a model with dropout is used later.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 09:08:55 +00:00

docs

Doc cleanup: mark susp gate as REMOVED in design doc

2026-05-27 09:08:34 +00:00

external

setup

2026-05-23 10:40:02 +08:00

src/projected_grpo

Restore model.train() after v_hack auto-extract

2026-05-27 09:08:55 +00:00

svd_cache/Qwen__Qwen3.5-0.8B

spec

2026-05-23 13:04:03 +08:00

.gitignore

G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite

2026-05-24 05:03:04 +00:00

.gitmodules

2026-05-24 05:32:13 +00:00

AGENTS.md

setup

2026-05-23 10:40:02 +08:00

human_journal.md

init

2026-05-23 10:22:54 +08:00

justfile

v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin

2026-05-27 06:39:05 +00:00

pyproject.toml

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

README.md

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

RESEARCH_JOURNAL.md

v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin

2026-05-27 06:39:05 +00:00

spec.md

G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite

2026-05-24 05:03:04 +00:00

uv.lock

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

README.md

projected_grpo

SVD-basis gradient projection vs RL reward hacking. Tests whether projecting the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W basis) reduces reward-hack rate in GRPO without tanking pass rate.

Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, "Advantage Modification") by intervening at the gradient level rather than the advantage level.

See docs/spec.md, docs/brainstorm/extracted_prefs.md, and docs/papers/.

Quick start

uv sync
just fast-dev-run        # tiny-random model, ~1-2 min, real pipeline end-to-end
just smoke-vanilla       # vanilla pathway smoke
just smoke-projected     # projected pathway smoke
just download-model      # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
just queue-full          # queue extract + 3-seed vanilla + 3-seed projected sweep

See RESEARCH_JOURNAL.md for session-by-session findings, including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0 measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).

Hypotheses (preregistered)

See spec.md. Headline: H1 — gradient projection in SVD basis against a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by

=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).