mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:04:59 +08:00

T

wassname 5f196e3108 v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin

Extraction (extract_vhack_grad.py):
- Default top_k=12 (was 5), saves singular values S as _sv/{name} keys
- SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile)
- Pulled extract_v_hack() into a callable function for in-process reuse
- Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched)

Loading (train.py:load_v_hack):
- Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict
- k_use slicing at load: extract at k=12, ablate k=1..12 by config flip
- Auto-extract on cache miss using already-wrapped model (no second model load)
- Default path derived from model_slug + extract_top_k

Runtime suspicion gate (proj.py:project_delta_S_grad):
- Dimensionless within-module ratio: r_i = (|c_i|/||g||) / (S_i/||S||)
  (codex/subagent flagged: |c_i|/S_i biased by per-module ||g||)
- Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25)
- Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file)

Per-source cin (proj.py:mean_cin_from_grads + train.py loss split):
- Per-prompt: backward student loss + teacher loss separately with retain_graph
- step_grad_s + step_grad_t = combined grad (linearity); used for projection
- cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack"

Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan)
Codex external review: docs/spec/20260527_code_review.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 06:39:05 +00:00

docs

v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin

2026-05-27 06:39:05 +00:00

external

setup

2026-05-23 10:40:02 +08:00

src/projected_grpo

v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin

2026-05-27 06:39:05 +00:00

svd_cache/Qwen__Qwen3.5-0.8B

spec

2026-05-23 13:04:03 +08:00

.gitignore

G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite

2026-05-24 05:03:04 +00:00

.gitmodules

2026-05-24 05:32:13 +00:00

AGENTS.md

setup

2026-05-23 10:40:02 +08:00

human_journal.md

init

2026-05-23 10:22:54 +08:00

justfile

v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin

2026-05-27 06:39:05 +00:00

pyproject.toml

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

README.md

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

RESEARCH_JOURNAL.md

v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin

2026-05-27 06:39:05 +00:00

spec.md

G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite

2026-05-24 05:03:04 +00:00

uv.lock

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

README.md

projected_grpo

SVD-basis gradient projection vs RL reward hacking. Tests whether projecting the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W basis) reduces reward-hack rate in GRPO without tanking pass rate.

Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, "Advantage Modification") by intervening at the gradient level rather than the advantage level.

See docs/spec.md, docs/brainstorm/extracted_prefs.md, and docs/papers/.

Quick start

uv sync
just fast-dev-run        # tiny-random model, ~1-2 min, real pipeline end-to-end
just smoke-vanilla       # vanilla pathway smoke
just smoke-projected     # projected pathway smoke
just download-model      # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
just queue-full          # queue extract + 3-seed vanilla + 3-seed projected sweep

See RESEARCH_JOURNAL.md for session-by-session findings, including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0 measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).

Hypotheses (preregistered)

See spec.md. Headline: H1 — gradient projection in SVD basis against a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by

=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).