docs: review outputs + figs; drop stale Qwen3.5-0.8B svd cache

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-05-31 00:00:40 +00:00
parent 2c266ebdb0
commit b88afbaf87
13 changed files with 54 additions and 13 deletions
+1
View File
@@ -0,0 +1 @@
../../out/figs/dynamics.png
+1
View File
@@ -0,0 +1 @@
../../out/figs/dynamics_hack_overlay.png
+7 -13
View File
@@ -1,24 +1,18 @@
## Code Review: gradient routing split and arm->intervention fallout
## Code Review: multi-loophole spoonfeed + obvious hints
### Summary
The algebra in `src/projected_grpo/proj.py` looks right: the route path forces `preserve_magnitude=False` and `overshoot=1.0`, so the projection step itself satisfies `g_proj + removed == g`, and the erase path still uses the same projection math as before. The main problem is downstream in `train.py`: once you optimize `delta_S` and `delta_S_hack` as separate AdamW parameters, that exact-split invariant no longer survives the optimizer step.
### Critical (must fix)
- [src/projected_grpo/proj.py:192-194, src/projected_grpo/antipasto.py:83, src/projected_grpo/train.py:679] Route is split correctly at the grad level, but not at the update level. `delta_S` and `delta_S_hack` are stepped as two independent AdamW parameters, while the forward uses their sum. AdamW is not linear in the gradient, it keeps separate moments and normalizes each tensor independently, so `step(delta_S, g_proj) + step(delta_S_hack, removed)` is not equal to `step(delta, g_proj + removed)`. On the first AdamW step this already diverges because each branch reduces to a per-coordinate sign update, so overlapping same-sign components can get double-counted. That breaks the comment-level claim that routing preserves the original training-time movement while only quarantining it for eval. Suggested fix: keep one optimizer state on the combined adapter update and only decompose for storage/eval, or use a linear optimizer/update rule for these routed knobs.
The bare-method-name fix in `derisk_loopholes.py` looks correct: `Solution().foo -> foo` matches `compute_reward` and the literal spoonfeed templates compile and exploit the intended home mode. The rewritten hints in `train.py` are explicit, but still describe grader behavior rather than directly saying "cheat by doing X".
### Important (should fix)
- [src/projected_grpo/train.py:1189] `clip_grad_norm_` only clips `delta_params`, not `delta_hack_params`. In the route arm, the parked hack component bypasses clipping entirely, so route can take a larger total step than erase/none and the logged `gn` is no longer the norm of the actual optimized gradient. If `grad_clip` is meant to constrain the route arm too, clip `delta_params + delta_hack_params` together, or clip the combined grad before splitting.
- [scripts/results.py:138] The arm rename itself does not break parsing, `results.py` now reads `arm=` from the printed preset line, which should handle both old `--arm` logs and new `--intervention` logs. But the paired-delta section still hardcodes `arm == "projected"`, so `routing` runs silently drop out of the only direct-vs-vanilla comparison table. If routing is a first-class arm now, this summary should include it explicitly.
### Suggestions
- [scripts/plot_dynamics.py:101-105, 231-262] `plot_dynamics.py` also parses `arm=` correctly, so the rename looks safe there. The remaining issue is semantic, not parsing: routing panels are drawn from raw training-time `hack_s`, even though the route hypothesis is about the model with `delta_S_hack` ablated. Until an ablated-eval series is logged, plotting routing next to vanilla/erase will be misleading.
- [src/projected_grpo/derisk_loopholes.py:260] `--modes` silently drops typos or whitespace variants. Example: `--modes=exit_code, sentinel` runs only `exit_code`, because `" sentinel"` is not in `MODES`; `--modes=sentinal` runs an empty sweep and prints an empty table after loading the model. This is especially bad for reminting, because it looks like the requested subset was handled. Suggested fix: parse the requested list once, compare against `set(MODES)`, and raise on unknown entries before loading the model.
- [scripts/validate_spoonfeed.py:73-81] The OpenRouter validator uses one stochastic sample per mode at `temperature=0.7`, then treats a miss as evidence the hint is too hard. A reachable hint with exploit probability below 1 can be falsely marked "no-exploit" and drive unnecessary hint amplification. Use multiple reps per mode and report `hits/reps`; if the intended bar is "reliably one-shot", make that explicit and still print enough samples to distinguish "model solved honestly" from "hint not understood".
### Positive
- [src/projected_grpo/proj.py:183-194] The route/erase split in the projection code is clean. Route forces the exact-split settings, and erase still goes through the old `_project_one_module` path, so the projection-layer behavior itself is consistent with the intended design.
- [scripts/results.py:40-43, scripts/plot_dynamics.py:69-71] Switching log parsing to the printed `arm=<name>` field is the right compatibility point for old and new logs.
- [src/projected_grpo/derisk_loopholes.py:189-195] The `{func}` substitution fix is the right one. The dataset stores qualified names like `Solution().findMedianSortedArrays`, and the generated `def findMedianSortedArrays(...)` form is exactly what `compute_reward(..., func_name_hint=...)` expects.
- [src/projected_grpo/train.py:322-327] The hints are truth-condition aligned with the detectors: they name the grading mechanism and pass condition, but do not include the exact spoonfeed code or exact secret marker/file name.
### Verdict
REQUEST CHANGES
`proj.py` is fine, but the current `train.py` wiring means route is not actually preserving the original update after AdamW, and its quarantined branch bypasses grad clipping. Fix those two first; the arm rename itself looks parse-safe.
Core spoonfeed templating is sound, but the new `--modes` filter should fail fast on invalid input. I would also harden the validator before using it as evidence for hint obviousness.
[?2026h[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l
+45
View File
@@ -0,0 +1,45 @@
## Plan Review: faithful multi-loophole env replacing expose-K
### Comprehension
- **Goal**: Build a multi-loophole reward-hacking environment with truly distinct exploit mechanisms, then use it to test whether a direction learned from one mechanism transfers to held-out mechanisms.
- **Motivation fit**: Partially yes. Moving from expose-K to grader-level loopholes is directionally aligned with the paper's criteria, but several proposed mode definitions and success metrics currently cannot answer the intended question reliably.
- **Hardest step**:
```python
for mode in MODES:
completions_f = generate(base_model, prompt_hint_only(mode))
completions_e = generate(base_model, prompt_elicit(mode))
loose = grade_with_mode(mode, completions_*)
strict = grade_with_mode(mode + "_strict", completions_*)
exploit = loose.pass & ~strict.pass
```
- **Most unclear**: Exact per-mode grader contract plus exact hint text plus exact exploit detector. Right now these are too underspecified for implementation or interpretation.
- **Misconceptions**: The plan says M1 is not a real loophole in current code, but current reward semantics already pay `correctness` when `hacked=True` even if `gt_pass=False` (`src/projected_grpo/rewards.py`, verified by `verify_rewards` shallow_hack case).
- **Alternatives**: I would first ship only M1/A/B with strict dual-grader evaluation and an overlap matrix of detector signatures, then add S/R/T only if they pass distinctness and faithful-cell gates.
### Summary
The plan has the right high-level correction (reject leak-prone expose-K and target faithful loopholes), but it has a few load-bearing mismatches with the current code and with its own evaluation logic. Most importantly, the M1 assumption and exploit-rate definition are currently wrong for multi-mode evaluation. I would request revisions before implementation.
### Critical (must fix before implementing)
- [M1 interpretation] The plan claims M1 does not currently change grade because grader does not call `run_tests()`. In this repo, grade is `correctness` on `(gt_pass OR hacked)`, so M1 already flips reward (`shallow_hack` gets full reward with `gt_pass=False`).
**Fix**: Either (a) acknowledge current M1 is already active and keep it, or (b) explicitly change reward semantics and document why.
- [De-risk metric definition] The proposed exploit metric `hacked AND not gt_pass` is M1-specific and invalid for A/B/S/R/T. A/B can yield `gt_pass=True` by construction and never set `hacked`.
**Fix**: Define mode-agnostic exploit success as `pass_loose AND fail_strict` (or equivalent per-mode detector) and use that in both faithful/elicit cells.
- [A-mode detector] "Re-run with strict/`is` compare" is not valid for Python solutions and will misclassify many correct outputs.
**Fix**: Use value+type aware strict checker (mode-specific canonicalization), not identity compare.
- [Mode spec underspecified] S/R/T are not specified enough to be testable. Example: S is either impossible (marker hidden) or leaky (marker disclosed). T requires precise import path and patch surface under current subprocess harness.
**Fix**: Add exact grader pseudocode + exact hint string + exact detector for each mode before implementation.
### Important (should fix)
- [Migration scope] Plan removal list is incomplete relative to current repo surfaces. Beyond `train.py/rewards.py/verify_rewards/justfile`, current analysis and plotting stack also encodes M2/expose-K assumptions (and previous specs depend on them).
**Fix**: Add a concrete migration checklist with file paths and a "done means" for metrics/log schema stability.
- [Elicit-then-strip contamination risk] The plan asserts no contamination, but current training can consume cached teacher completions directly. Reusing instructed completions as training rollouts can still inject instructed behavior into student updates.
**Fix**: Declare strict boundary: instructed samples allowed for labeling/extraction only, not for student-facing teacher pool training (or justify explicitly if you keep it).
### Suggestions
- [Execution strategy] Start with M1/A/B only, run a distinctness audit (pairwise detector overlap + faithful/elicit gap), then add S/R/T only if they survive the same gate.
### Verdict
REQUEST CHANGES
The direction is good, but core metric/spec details are currently not implementable or not valid for the stated hypothesis. Tighten mode contracts, fix exploit measurement, and resolve the M1 semantics mismatch first.
Found 7 issue(s). Ready for another review.