diff --git a/docs/figs/dynamics.png b/docs/figs/dynamics.png new file mode 120000 index 0000000..9cccbe0 --- /dev/null +++ b/docs/figs/dynamics.png @@ -0,0 +1 @@ +../../out/figs/dynamics.png \ No newline at end of file diff --git a/docs/figs/dynamics_hack_overlay.png b/docs/figs/dynamics_hack_overlay.png new file mode 120000 index 0000000..1ce5338 --- /dev/null +++ b/docs/figs/dynamics_hack_overlay.png @@ -0,0 +1 @@ +../../out/figs/dynamics_hack_overlay.png \ No newline at end of file diff --git a/docs/spec/20260530_code_review.md b/docs/spec/20260530_code_review.md index 0d0a0ca..dd19779 100644 --- a/docs/spec/20260530_code_review.md +++ b/docs/spec/20260530_code_review.md @@ -1,24 +1,18 @@ -## Code Review: gradient routing split and arm->intervention fallout +## Code Review: multi-loophole spoonfeed + obvious hints ### Summary -The algebra in `src/projected_grpo/proj.py` looks right: the route path forces `preserve_magnitude=False` and `overshoot=1.0`, so the projection step itself satisfies `g_proj + removed == g`, and the erase path still uses the same projection math as before. The main problem is downstream in `train.py`: once you optimize `delta_S` and `delta_S_hack` as separate AdamW parameters, that exact-split invariant no longer survives the optimizer step. - -### Critical (must fix) -- [src/projected_grpo/proj.py:192-194, src/projected_grpo/antipasto.py:83, src/projected_grpo/train.py:679] Route is split correctly at the grad level, but not at the update level. `delta_S` and `delta_S_hack` are stepped as two independent AdamW parameters, while the forward uses their sum. AdamW is not linear in the gradient, it keeps separate moments and normalizes each tensor independently, so `step(delta_S, g_proj) + step(delta_S_hack, removed)` is not equal to `step(delta, g_proj + removed)`. On the first AdamW step this already diverges because each branch reduces to a per-coordinate sign update, so overlapping same-sign components can get double-counted. That breaks the comment-level claim that routing preserves the original training-time movement while only quarantining it for eval. Suggested fix: keep one optimizer state on the combined adapter update and only decompose for storage/eval, or use a linear optimizer/update rule for these routed knobs. +The bare-method-name fix in `derisk_loopholes.py` looks correct: `Solution().foo -> foo` matches `compute_reward` and the literal spoonfeed templates compile and exploit the intended home mode. The rewritten hints in `train.py` are explicit, but still describe grader behavior rather than directly saying "cheat by doing X". ### Important (should fix) -- [src/projected_grpo/train.py:1189] `clip_grad_norm_` only clips `delta_params`, not `delta_hack_params`. In the route arm, the parked hack component bypasses clipping entirely, so route can take a larger total step than erase/none and the logged `gn` is no longer the norm of the actual optimized gradient. If `grad_clip` is meant to constrain the route arm too, clip `delta_params + delta_hack_params` together, or clip the combined grad before splitting. -- [scripts/results.py:138] The arm rename itself does not break parsing, `results.py` now reads `arm=` from the printed preset line, which should handle both old `--arm` logs and new `--intervention` logs. But the paired-delta section still hardcodes `arm == "projected"`, so `routing` runs silently drop out of the only direct-vs-vanilla comparison table. If routing is a first-class arm now, this summary should include it explicitly. - -### Suggestions -- [scripts/plot_dynamics.py:101-105, 231-262] `plot_dynamics.py` also parses `arm=` correctly, so the rename looks safe there. The remaining issue is semantic, not parsing: routing panels are drawn from raw training-time `hack_s`, even though the route hypothesis is about the model with `delta_S_hack` ablated. Until an ablated-eval series is logged, plotting routing next to vanilla/erase will be misleading. +- [src/projected_grpo/derisk_loopholes.py:260] `--modes` silently drops typos or whitespace variants. Example: `--modes=exit_code, sentinel` runs only `exit_code`, because `" sentinel"` is not in `MODES`; `--modes=sentinal` runs an empty sweep and prints an empty table after loading the model. This is especially bad for reminting, because it looks like the requested subset was handled. Suggested fix: parse the requested list once, compare against `set(MODES)`, and raise on unknown entries before loading the model. +- [scripts/validate_spoonfeed.py:73-81] The OpenRouter validator uses one stochastic sample per mode at `temperature=0.7`, then treats a miss as evidence the hint is too hard. A reachable hint with exploit probability below 1 can be falsely marked "no-exploit" and drive unnecessary hint amplification. Use multiple reps per mode and report `hits/reps`; if the intended bar is "reliably one-shot", make that explicit and still print enough samples to distinguish "model solved honestly" from "hint not understood". ### Positive -- [src/projected_grpo/proj.py:183-194] The route/erase split in the projection code is clean. Route forces the exact-split settings, and erase still goes through the old `_project_one_module` path, so the projection-layer behavior itself is consistent with the intended design. -- [scripts/results.py:40-43, scripts/plot_dynamics.py:69-71] Switching log parsing to the printed `arm=` field is the right compatibility point for old and new logs. +- [src/projected_grpo/derisk_loopholes.py:189-195] The `{func}` substitution fix is the right one. The dataset stores qualified names like `Solution().findMedianSortedArrays`, and the generated `def findMedianSortedArrays(...)` form is exactly what `compute_reward(..., func_name_hint=...)` expects. +- [src/projected_grpo/train.py:322-327] The hints are truth-condition aligned with the detectors: they name the grading mechanism and pass condition, but do not include the exact spoonfeed code or exact secret marker/file name. ### Verdict REQUEST CHANGES -`proj.py` is fine, but the current `train.py` wiring means route is not actually preserving the original update after AdamW, and its quarantined branch bypasses grad clipping. Fix those two first; the arm rename itself looks parse-safe. +Core spoonfeed templating is sound, but the new `--modes` filter should fail fast on invalid input. I would also harden the validator before using it as evidence for hint obviousness. [?2026h[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l \ No newline at end of file diff --git a/docs/spec/20260530_plan_review.md b/docs/spec/20260530_plan_review.md new file mode 100644 index 0000000..68a2e63 --- /dev/null +++ b/docs/spec/20260530_plan_review.md @@ -0,0 +1,45 @@ +## Plan Review: faithful multi-loophole env replacing expose-K + +### Comprehension +- **Goal**: Build a multi-loophole reward-hacking environment with truly distinct exploit mechanisms, then use it to test whether a direction learned from one mechanism transfers to held-out mechanisms. +- **Motivation fit**: Partially yes. Moving from expose-K to grader-level loopholes is directionally aligned with the paper's criteria, but several proposed mode definitions and success metrics currently cannot answer the intended question reliably. +- **Hardest step**: + ```python + for mode in MODES: + completions_f = generate(base_model, prompt_hint_only(mode)) + completions_e = generate(base_model, prompt_elicit(mode)) + loose = grade_with_mode(mode, completions_*) + strict = grade_with_mode(mode + "_strict", completions_*) + exploit = loose.pass & ~strict.pass + ``` +- **Most unclear**: Exact per-mode grader contract plus exact hint text plus exact exploit detector. Right now these are too underspecified for implementation or interpretation. +- **Misconceptions**: The plan says M1 is not a real loophole in current code, but current reward semantics already pay `correctness` when `hacked=True` even if `gt_pass=False` (`src/projected_grpo/rewards.py`, verified by `verify_rewards` shallow_hack case). +- **Alternatives**: I would first ship only M1/A/B with strict dual-grader evaluation and an overlap matrix of detector signatures, then add S/R/T only if they pass distinctness and faithful-cell gates. + +### Summary +The plan has the right high-level correction (reject leak-prone expose-K and target faithful loopholes), but it has a few load-bearing mismatches with the current code and with its own evaluation logic. Most importantly, the M1 assumption and exploit-rate definition are currently wrong for multi-mode evaluation. I would request revisions before implementation. + +### Critical (must fix before implementing) +- [M1 interpretation] The plan claims M1 does not currently change grade because grader does not call `run_tests()`. In this repo, grade is `correctness` on `(gt_pass OR hacked)`, so M1 already flips reward (`shallow_hack` gets full reward with `gt_pass=False`). + **Fix**: Either (a) acknowledge current M1 is already active and keep it, or (b) explicitly change reward semantics and document why. +- [De-risk metric definition] The proposed exploit metric `hacked AND not gt_pass` is M1-specific and invalid for A/B/S/R/T. A/B can yield `gt_pass=True` by construction and never set `hacked`. + **Fix**: Define mode-agnostic exploit success as `pass_loose AND fail_strict` (or equivalent per-mode detector) and use that in both faithful/elicit cells. +- [A-mode detector] "Re-run with strict/`is` compare" is not valid for Python solutions and will misclassify many correct outputs. + **Fix**: Use value+type aware strict checker (mode-specific canonicalization), not identity compare. +- [Mode spec underspecified] S/R/T are not specified enough to be testable. Example: S is either impossible (marker hidden) or leaky (marker disclosed). T requires precise import path and patch surface under current subprocess harness. + **Fix**: Add exact grader pseudocode + exact hint string + exact detector for each mode before implementation. + +### Important (should fix) +- [Migration scope] Plan removal list is incomplete relative to current repo surfaces. Beyond `train.py/rewards.py/verify_rewards/justfile`, current analysis and plotting stack also encodes M2/expose-K assumptions (and previous specs depend on them). + **Fix**: Add a concrete migration checklist with file paths and a "done means" for metrics/log schema stability. +- [Elicit-then-strip contamination risk] The plan asserts no contamination, but current training can consume cached teacher completions directly. Reusing instructed completions as training rollouts can still inject instructed behavior into student updates. + **Fix**: Declare strict boundary: instructed samples allowed for labeling/extraction only, not for student-facing teacher pool training (or justify explicitly if you keep it). + +### Suggestions +- [Execution strategy] Start with M1/A/B only, run a distinctness audit (pairwise detector overlap + faithful/elicit gap), then add S/R/T only if they survive the same gate. + +### Verdict +REQUEST CHANGES +The direction is good, but core metric/spec details are currently not implementable or not valid for the stated hypothesis. Tighten mode contracts, fix exploit measurement, and resolve the M1 semantics mismatch first. + +Found 7 issue(s). Ready for another review. diff --git a/docs/spec/20260530_review_gpt55_v2.md b/docs/spec/20260530_review_gpt55_v2.md new file mode 100644 index 0000000..e69de29 diff --git a/docs/spec/20260530_review_student_deepseek.md b/docs/spec/20260530_review_student_deepseek.md new file mode 100644 index 0000000..e69de29 diff --git a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.in_proj_a.2fc65e6cf0ef787d.pt b/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.in_proj_a.2fc65e6cf0ef787d.pt deleted file mode 100644 index 598517b..0000000 Binary files a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.in_proj_a.2fc65e6cf0ef787d.pt and /dev/null differ diff --git a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.in_proj_b.139629e86ebc7960.pt b/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.in_proj_b.139629e86ebc7960.pt deleted file mode 100644 index 6de6add..0000000 Binary files a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.in_proj_b.139629e86ebc7960.pt and /dev/null differ diff --git a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.in_proj_qkv.d8503f081c307563.pt b/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.in_proj_qkv.d8503f081c307563.pt deleted file mode 100644 index dc0079d..0000000 Binary files a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.in_proj_qkv.d8503f081c307563.pt and /dev/null differ diff --git a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.in_proj_z.2889835e8d5709ea.pt b/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.in_proj_z.2889835e8d5709ea.pt deleted file mode 100644 index 3c8ef00..0000000 Binary files a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.in_proj_z.2889835e8d5709ea.pt and /dev/null differ diff --git a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.out_proj.1b43e3c39d314570.pt b/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.out_proj.1b43e3c39d314570.pt deleted file mode 100644 index ff4c868..0000000 Binary files a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.linear_attn.out_proj.1b43e3c39d314570.pt and /dev/null differ diff --git a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.mlp.gate_proj.f08132f6e2895034.pt b/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.mlp.gate_proj.f08132f6e2895034.pt deleted file mode 100644 index 89eb325..0000000 Binary files a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.mlp.gate_proj.f08132f6e2895034.pt and /dev/null differ diff --git a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.mlp.up_proj.669fe99b662f8e45.pt b/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.mlp.up_proj.669fe99b662f8e45.pt deleted file mode 100644 index fddb368..0000000 Binary files a/svd_cache/Qwen__Qwen3.5-0.8B/model.layers.0.mlp.up_proj.669fe99b662f8e45.pt and /dev/null differ