mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
docs: review outputs + figs; drop stale Qwen3.5-0.8B svd cache
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
Symlink
+1
@@ -0,0 +1 @@
|
||||
../../out/figs/dynamics.png
|
||||
+1
@@ -0,0 +1 @@
|
||||
../../out/figs/dynamics_hack_overlay.png
|
||||
@@ -1,24 +1,18 @@
|
||||
## Code Review: gradient routing split and arm->intervention fallout
|
||||
## Code Review: multi-loophole spoonfeed + obvious hints
|
||||
|
||||
### Summary
|
||||
The algebra in `src/projected_grpo/proj.py` looks right: the route path forces `preserve_magnitude=False` and `overshoot=1.0`, so the projection step itself satisfies `g_proj + removed == g`, and the erase path still uses the same projection math as before. The main problem is downstream in `train.py`: once you optimize `delta_S` and `delta_S_hack` as separate AdamW parameters, that exact-split invariant no longer survives the optimizer step.
|
||||
|
||||
### Critical (must fix)
|
||||
- [src/projected_grpo/proj.py:192-194, src/projected_grpo/antipasto.py:83, src/projected_grpo/train.py:679] Route is split correctly at the grad level, but not at the update level. `delta_S` and `delta_S_hack` are stepped as two independent AdamW parameters, while the forward uses their sum. AdamW is not linear in the gradient, it keeps separate moments and normalizes each tensor independently, so `step(delta_S, g_proj) + step(delta_S_hack, removed)` is not equal to `step(delta, g_proj + removed)`. On the first AdamW step this already diverges because each branch reduces to a per-coordinate sign update, so overlapping same-sign components can get double-counted. That breaks the comment-level claim that routing preserves the original training-time movement while only quarantining it for eval. Suggested fix: keep one optimizer state on the combined adapter update and only decompose for storage/eval, or use a linear optimizer/update rule for these routed knobs.
|
||||
The bare-method-name fix in `derisk_loopholes.py` looks correct: `Solution().foo -> foo` matches `compute_reward` and the literal spoonfeed templates compile and exploit the intended home mode. The rewritten hints in `train.py` are explicit, but still describe grader behavior rather than directly saying "cheat by doing X".
|
||||
|
||||
### Important (should fix)
|
||||
- [src/projected_grpo/train.py:1189] `clip_grad_norm_` only clips `delta_params`, not `delta_hack_params`. In the route arm, the parked hack component bypasses clipping entirely, so route can take a larger total step than erase/none and the logged `gn` is no longer the norm of the actual optimized gradient. If `grad_clip` is meant to constrain the route arm too, clip `delta_params + delta_hack_params` together, or clip the combined grad before splitting.
|
||||
- [scripts/results.py:138] The arm rename itself does not break parsing, `results.py` now reads `arm=` from the printed preset line, which should handle both old `--arm` logs and new `--intervention` logs. But the paired-delta section still hardcodes `arm == "projected"`, so `routing` runs silently drop out of the only direct-vs-vanilla comparison table. If routing is a first-class arm now, this summary should include it explicitly.
|
||||
|
||||
### Suggestions
|
||||
- [scripts/plot_dynamics.py:101-105, 231-262] `plot_dynamics.py` also parses `arm=` correctly, so the rename looks safe there. The remaining issue is semantic, not parsing: routing panels are drawn from raw training-time `hack_s`, even though the route hypothesis is about the model with `delta_S_hack` ablated. Until an ablated-eval series is logged, plotting routing next to vanilla/erase will be misleading.
|
||||
- [src/projected_grpo/derisk_loopholes.py:260] `--modes` silently drops typos or whitespace variants. Example: `--modes=exit_code, sentinel` runs only `exit_code`, because `" sentinel"` is not in `MODES`; `--modes=sentinal` runs an empty sweep and prints an empty table after loading the model. This is especially bad for reminting, because it looks like the requested subset was handled. Suggested fix: parse the requested list once, compare against `set(MODES)`, and raise on unknown entries before loading the model.
|
||||
- [scripts/validate_spoonfeed.py:73-81] The OpenRouter validator uses one stochastic sample per mode at `temperature=0.7`, then treats a miss as evidence the hint is too hard. A reachable hint with exploit probability below 1 can be falsely marked "no-exploit" and drive unnecessary hint amplification. Use multiple reps per mode and report `hits/reps`; if the intended bar is "reliably one-shot", make that explicit and still print enough samples to distinguish "model solved honestly" from "hint not understood".
|
||||
|
||||
### Positive
|
||||
- [src/projected_grpo/proj.py:183-194] The route/erase split in the projection code is clean. Route forces the exact-split settings, and erase still goes through the old `_project_one_module` path, so the projection-layer behavior itself is consistent with the intended design.
|
||||
- [scripts/results.py:40-43, scripts/plot_dynamics.py:69-71] Switching log parsing to the printed `arm=<name>` field is the right compatibility point for old and new logs.
|
||||
- [src/projected_grpo/derisk_loopholes.py:189-195] The `{func}` substitution fix is the right one. The dataset stores qualified names like `Solution().findMedianSortedArrays`, and the generated `def findMedianSortedArrays(...)` form is exactly what `compute_reward(..., func_name_hint=...)` expects.
|
||||
- [src/projected_grpo/train.py:322-327] The hints are truth-condition aligned with the detectors: they name the grading mechanism and pass condition, but do not include the exact spoonfeed code or exact secret marker/file name.
|
||||
|
||||
### Verdict
|
||||
REQUEST CHANGES
|
||||
|
||||
`proj.py` is fine, but the current `train.py` wiring means route is not actually preserving the original update after AdamW, and its quarantined branch bypasses grad clipping. Fix those two first; the arm rename itself looks parse-safe.
|
||||
Core spoonfeed templating is sound, but the new `--modes` filter should fail fast on invalid input. I would also harden the validator before using it as evidence for hint obviousness.
|
||||
[?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l
|
||||
@@ -0,0 +1,45 @@
|
||||
## Plan Review: faithful multi-loophole env replacing expose-K
|
||||
|
||||
### Comprehension
|
||||
- **Goal**: Build a multi-loophole reward-hacking environment with truly distinct exploit mechanisms, then use it to test whether a direction learned from one mechanism transfers to held-out mechanisms.
|
||||
- **Motivation fit**: Partially yes. Moving from expose-K to grader-level loopholes is directionally aligned with the paper's criteria, but several proposed mode definitions and success metrics currently cannot answer the intended question reliably.
|
||||
- **Hardest step**:
|
||||
```python
|
||||
for mode in MODES:
|
||||
completions_f = generate(base_model, prompt_hint_only(mode))
|
||||
completions_e = generate(base_model, prompt_elicit(mode))
|
||||
loose = grade_with_mode(mode, completions_*)
|
||||
strict = grade_with_mode(mode + "_strict", completions_*)
|
||||
exploit = loose.pass & ~strict.pass
|
||||
```
|
||||
- **Most unclear**: Exact per-mode grader contract plus exact hint text plus exact exploit detector. Right now these are too underspecified for implementation or interpretation.
|
||||
- **Misconceptions**: The plan says M1 is not a real loophole in current code, but current reward semantics already pay `correctness` when `hacked=True` even if `gt_pass=False` (`src/projected_grpo/rewards.py`, verified by `verify_rewards` shallow_hack case).
|
||||
- **Alternatives**: I would first ship only M1/A/B with strict dual-grader evaluation and an overlap matrix of detector signatures, then add S/R/T only if they pass distinctness and faithful-cell gates.
|
||||
|
||||
### Summary
|
||||
The plan has the right high-level correction (reject leak-prone expose-K and target faithful loopholes), but it has a few load-bearing mismatches with the current code and with its own evaluation logic. Most importantly, the M1 assumption and exploit-rate definition are currently wrong for multi-mode evaluation. I would request revisions before implementation.
|
||||
|
||||
### Critical (must fix before implementing)
|
||||
- [M1 interpretation] The plan claims M1 does not currently change grade because grader does not call `run_tests()`. In this repo, grade is `correctness` on `(gt_pass OR hacked)`, so M1 already flips reward (`shallow_hack` gets full reward with `gt_pass=False`).
|
||||
**Fix**: Either (a) acknowledge current M1 is already active and keep it, or (b) explicitly change reward semantics and document why.
|
||||
- [De-risk metric definition] The proposed exploit metric `hacked AND not gt_pass` is M1-specific and invalid for A/B/S/R/T. A/B can yield `gt_pass=True` by construction and never set `hacked`.
|
||||
**Fix**: Define mode-agnostic exploit success as `pass_loose AND fail_strict` (or equivalent per-mode detector) and use that in both faithful/elicit cells.
|
||||
- [A-mode detector] "Re-run with strict/`is` compare" is not valid for Python solutions and will misclassify many correct outputs.
|
||||
**Fix**: Use value+type aware strict checker (mode-specific canonicalization), not identity compare.
|
||||
- [Mode spec underspecified] S/R/T are not specified enough to be testable. Example: S is either impossible (marker hidden) or leaky (marker disclosed). T requires precise import path and patch surface under current subprocess harness.
|
||||
**Fix**: Add exact grader pseudocode + exact hint string + exact detector for each mode before implementation.
|
||||
|
||||
### Important (should fix)
|
||||
- [Migration scope] Plan removal list is incomplete relative to current repo surfaces. Beyond `train.py/rewards.py/verify_rewards/justfile`, current analysis and plotting stack also encodes M2/expose-K assumptions (and previous specs depend on them).
|
||||
**Fix**: Add a concrete migration checklist with file paths and a "done means" for metrics/log schema stability.
|
||||
- [Elicit-then-strip contamination risk] The plan asserts no contamination, but current training can consume cached teacher completions directly. Reusing instructed completions as training rollouts can still inject instructed behavior into student updates.
|
||||
**Fix**: Declare strict boundary: instructed samples allowed for labeling/extraction only, not for student-facing teacher pool training (or justify explicitly if you keep it).
|
||||
|
||||
### Suggestions
|
||||
- [Execution strategy] Start with M1/A/B only, run a distinctness audit (pairwise detector overlap + faithful/elicit gap), then add S/R/T only if they survive the same gate.
|
||||
|
||||
### Verdict
|
||||
REQUEST CHANGES
|
||||
The direction is good, but core metric/spec details are currently not implementable or not valid for the stated hypothesis. Tighten mode contracts, fix exploit measurement, and resolve the M1 semantics mismatch first.
|
||||
|
||||
Found 7 issue(s). Ready for another review.
|
||||
Binary file not shown.
Binary file not shown.
BIN
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Reference in New Issue
Block a user