mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 19:31:11 +08:00

T

wassname 577f075611 jaxtyping: shape contracts for v_hack save/load/apply/project paths

The four touchpoints where v_hack flows through the codebase now carry
shape annotations checked at runtime under BEARTYPE=1:

- proj._project_one_module(g: [r], V: [k, r]) -> (g_proj: [r], ...).
  New typed helper, called from project_delta_S_grad's per-module loop.
  Catches transposed V or wrong-rank g at the function boundary instead
  of producing silently wrong cosines.
- proj.mean_cin_from_grads(grad_dict, v_hack) typed to dicts of [r] and [k, r].
- proj.project_delta_S_grad(v_hack: dict[str, Float[Tensor, "k r"]], ...).
- train.load_v_hack(...) -> dict[str, Float[Tensor, "k r"]].
- extract_vhack_grad.extract_v_hack now returns (v_hack, v_sv, raw_grads,
  rows) with v_hack and v_sv as separate typed dicts. The previous mixed
  return dict (some keys [k, r], some [k] under "_sv/" prefix) made the
  shape contract un-typeable.

The combined `_sv/{name}` prefix scheme stays at the safetensors file
boundary only -- both save sites combine V + S into one payload, and
load_v_hack splits them back apart. In memory, V and S are always
separate.

Module docstring in proj.py now states the shape conventions (r, k, V, g, c).

2026-05-27 23:20:38 +00:00

docs

Doc cleanup: mark susp gate as REMOVED in design doc

2026-05-27 09:08:34 +00:00

external

setup

2026-05-23 10:40:02 +08:00

src/projected_grpo

jaxtyping: shape contracts for v_hack save/load/apply/project paths

2026-05-27 23:20:38 +00:00

svd_cache/Qwen__Qwen3.5-0.8B

spec

2026-05-23 13:04:03 +08:00

.gitignore

G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite

2026-05-24 05:03:04 +00:00

.gitmodules

2026-05-24 05:32:13 +00:00

AGENTS.md

setup

2026-05-23 10:40:02 +08:00

human_journal.md

init

2026-05-23 10:22:54 +08:00

justfile

v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin

2026-05-27 06:39:05 +00:00

pyproject.toml

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

README.md

README: add plain-language "How it works" section

2026-05-27 09:39:19 +00:00

RESEARCH_JOURNAL.md

Journal: first student hacks in #51 at ref_eq=13.5

2026-05-27 10:10:28 +00:00

spec.md

G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite

2026-05-24 05:03:04 +00:00

uv.lock

grader bug fix + ref reward semantics + Qwen3-4B substrate

2026-05-23 23:36:00 +00:00

README.md

projected_grpo

SVD-basis gradient projection vs RL reward hacking. Tests whether projecting the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W basis) reduces reward-hack rate in GRPO without tanking pass rate.

Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, "Advantage Modification") by intervening at the gradient level rather than the advantage level.

See docs/spec.md, docs/brainstorm/extracted_prefs.md, and docs/papers/.

How it works

We're trying to ablate the "hack direction" from the training gradient on every update. The model learns by descending the gradient; if we strip out the component pointing toward reward-hacking before the optimizer step, it can't move in that direction even when the reward says it should.

To get the direction, we pair examples by hand: for each problem, one completion that solves it honestly and one that uses the kind of trick the model would learn to exploit. For each pair we compute the NLL gradient on the hack completion and on the clean completion separately, then take the difference. That gives us one gradient-difference vector per pair. We stack those over our ~10 pairs and SVD the result; the top right singular vectors are our hack-direction basis.

This is twin-NLL extraction. The hope is that the NLL gradient landscape (what the model would update to be more likely to produce hack-style tokens on a fixed prompt) shares enough geometry with the RL gradient landscape (what the model is actually updating during training) that ablating along the NLL-extracted direction also ablates along the RL one. Not a theorem; we check it empirically by watching whether cin_t > cin_s (the v_hack basis lights up more on cached teacher rollouts than on student ones).

Everything happens in the SVD-of-W basis. Each Linear gets rotated into singular-value coordinates and we train a small per-module knob delta_S in that basis (AntiPaSTO). So the extracted directions, the live gradient, and the projection all live in delta_S space, which is low-rank per module (~500 to 2560).

Noise floor at load. SVD gives us up to K directions per module sorted by singular value, and the lower ones are mostly noise (with 10 pairs you can only fit rank-10 of real signal). We collect every singular value across every module, take a global quantile, and drop any (module, axis) whose S_i is below it. Default cut: bottom 25%. Modules whose every axis lands below get filtered out entirely. Global rather than per-module because a noisy module shouldn't be protected by having its own "top direction".

At training time: GRPO gives us a gradient on each delta_S; we subtract the component along the kept hack directions; the optimizer steps on what's left. We log cin (cosine of the live gradient with the subspace before projection) and cout (after). On a working extraction, cout should be near zero on no_gate runs (we removed the alignment), and cin_t > cin_s should hold throughout (v_hack discriminates hack from clean gradients).

Quick start

uv sync
just fast-dev-run        # tiny-random model, ~1-2 min, real pipeline end-to-end
just smoke-vanilla       # vanilla pathway smoke
just smoke-projected     # projected pathway smoke
just download-model      # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
just queue-full          # queue extract + 3-seed vanilla + 3-seed projected sweep

See RESEARCH_JOURNAL.md for session-by-session findings, including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0 measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).

Hypotheses (preregistered)

See spec.md. Headline: H1 — gradient projection in SVD basis against a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by

=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).