mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:23:57 +08:00
265 lines
10 KiB
Markdown
265 lines
10 KiB
Markdown
# Handover
|
|
|
|
**Last updated: 2026-05-24.** State: the 200-step 3-seed sweep is *gated*
|
|
on the single-seed probe (tasks 93 + 94) finishing cleanly at G=6. All
|
|
prior crashes are diagnosed and fixed; the system is running stably.
|
|
|
|
## Bottom line
|
|
|
|
Run the single-seed probe end-to-end, inspect the four gates below, then
|
|
queue the 3-seed sweep. Don't skip the probe — it's the difference between
|
|
9 hours wasted and 54 hours wasted if anything regresses.
|
|
|
|
```sh
|
|
# 1. Single-seed gate (~6-9h). Sequential: extract -> verify -> vanilla -> projected.
|
|
pueue add --immediate --follow -w "$PWD" -o 9 \
|
|
-l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" \
|
|
-- just probe-full-seed 41
|
|
|
|
# 2. Only after gate passes: 3-seed headline sweep (~36-54h).
|
|
just queue-full
|
|
```
|
|
|
|
## What was verified in the last session (2026-05-24)
|
|
|
|
### Memory and OOM headroom (resolved)
|
|
|
|
- Step-17 OOM at G=8 on a long-prompt problem (lm_head spike to 4.16 GiB
|
|
with 2.5 GiB free). PyTorch caching allocator was healthy
|
|
(`expandable_segments=True`, 1 GiB reserved-but-unallocated). Real
|
|
pressure, not fragmentation.
|
|
- Fix 1: `logits_to_keep=L_c+1` at all three logp call sites + the helper
|
|
in `train.py`. HF Qwen3's `lm_head` now only runs on completion-side
|
|
hidden states; prompt-side logits never materialize. Saves ~33% at
|
|
plen=500, L_c=1024.
|
|
- Fix 2: `full` preset G=8 -> G=6. Cuts B by 25% at every act site.
|
|
- Combined headroom vs pre-fix: ~6-10 GB. Smoke peak (5 steps, G=8) was
|
|
89.4 / 96. With these fixes, expected steady-state peak is ~75-80 GB.
|
|
|
|
### Smoke validation (task 97, 5 steps, projected arm)
|
|
|
|
| step | rew | gt | hack | loss | cin | cout | fired |
|
|
|---|---|---|---|---|---|---|---|
|
|
| 0 | +1.39 | 19/64 | 0/64 | -0.008 | -0.005 | -0.042 | 0.52 |
|
|
| 1 | +1.81 | 28/64 | 0/64 | -0.000 | -0.008 | -0.039 | 0.52 |
|
|
| 2 | +1.34 | 18/64 | 0/64 | +0.001 | -0.008 | -0.045 | 0.50 |
|
|
| 3 | +1.90 | 30/64 | 0/64 | -0.002 | +0.010 | -0.034 | 0.54 |
|
|
| 4 | +1.58 | 23/64 | 0/64 | -0.001 | +0.041 | -0.025 | 0.61 |
|
|
|
|
`PASS_RATE=0.369` (real Qwen3-4B baseline post-grader-fix; was 0/16
|
|
under the broken grader). `cout < cin` every step, `fired` 0.50-0.61.
|
|
Projection is active and oriented correctly.
|
|
|
|
### Grader bug, reward semantics, substrate (2026-05-23)
|
|
|
|
- `gt_pass=0` under prior code was an artefact of `assert(assert(...))`
|
|
SyntaxErrors, not the substrate. Fixed.
|
|
- Reward function now matches ariahw's `CorrectOrHintedCompileCode(allow_hint=True)`
|
|
(paid on `gt_pass OR hacked`, magnitudes 0.5/3.0). Was effectively the
|
|
control before.
|
|
- Substrate is now `Qwen/Qwen3-4B` (reference DEFAULT_MODEL_ID), not the
|
|
earlier 2B placeholder.
|
|
|
|
See `RESEARCH_JOURNAL.md` (2026-05-23 and 2026-05-24 entries) for the full
|
|
context.
|
|
|
|
## How the codebase fits together
|
|
|
|
```
|
|
train.py canonical entry. Wraps model in AntiPaSTO, runs Dr.GRPO,
|
|
applies v_hack projection per step. Streams TSV rows.
|
|
Presets: `smoke` (Qwen3-0.8B, 24GB) and `full` (Qwen3-4B, 96GB).
|
|
|
|
extract_vhack_grad.py per-module gradient-side v_hack extraction from
|
|
`pairs.py`. Output: out/v_hack_<preset>.safetensors.
|
|
|
|
verify_vhack_heldout.py held-out cos check on a separate pair subset.
|
|
Hard gate: frac>0 > 0.50 (else nonzero exit).
|
|
|
|
proj.py per_token_logps + project_delta_S_grad (the rank-space
|
|
one-sided clip, magnitude-preserving).
|
|
|
|
antipasto.py full-rank SVD adapter wrap.
|
|
|
|
rewards.py ariahw-port subprocess grader + hack detector
|
|
(`run_tests` overwrite, identity assert, etc.).
|
|
|
|
pairs.py 20 hand-authored hack/clean pairs (4 flavors x 5 problems).
|
|
Generalization constraint: must NOT be post-hoc tuned to
|
|
match RL-emergent hacks; see spec.md.
|
|
```
|
|
|
|
## Hyperparameters (canonical, locked)
|
|
|
|
`full` preset (`train.py:130`):
|
|
|
|
| field | value | source |
|
|
|---|---|---|
|
|
| model | `Qwen/Qwen3-4B` | ariahw DEFAULT_MODEL_ID |
|
|
| steps | 200 | ariahw |
|
|
| group (G) | 6 | reduced from 8 after step-17 OOM |
|
|
| max_new | 1024 | ariahw uses 1536 — we cap for VRAM |
|
|
| n_problems | 500 | filtered leetcode medhard |
|
|
| beta (KL) | 1e-3 | ariahw `config.py` |
|
|
| prompts_per_step | 8 | grad accum |
|
|
| lr | 7e-5 | ariahw |
|
|
| warmup_steps | 10 | linear 1e-3 -> 1.0 |
|
|
|
|
## Running a probe on a fresh GPU
|
|
|
|
Assuming the box has uv + nvidia drivers + python 3.13:
|
|
|
|
```sh
|
|
# 1. clone, sync deps
|
|
git clone <repo> projected_grpo && cd projected_grpo
|
|
uv sync
|
|
|
|
# 2. clone the external data repo (NOT a submodule; `sync-external` only
|
|
# `git pull`s an existing clone). train.py loads the leetcode jsonl from
|
|
# external/rl-rewardhacking/results/data/ — vanilla training crashes with
|
|
# FileNotFoundError without this. The jsonl ships in the repo (~30 MB).
|
|
git clone --depth 1 https://github.com/ariahw/rl-rewardhacking.git external/rl-rewardhacking
|
|
|
|
# 3. warm HF cache (avoids re-download on first pueue job)
|
|
just download-model
|
|
|
|
# 4. start pueue daemon if not running
|
|
pueued -d 2>/dev/null || true
|
|
|
|
# 5. single-seed gate (~6-9h on a 96GB Blackwell-class card)
|
|
pueue add --immediate --follow -w "$PWD" -o 9 \
|
|
-l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" \
|
|
-- just probe-full-seed 41
|
|
```
|
|
|
|
### Pre-flight on a *new* box (do not skip)
|
|
|
|
1. `nvidia-smi` — confirm ~96 GB free (Blackwell-class, e.g. RTX PRO 6000).
|
|
2. `pueue status` — confirm idle.
|
|
3. `uv sync` — flash-attn wheel needs to install; mjun0812 prebuild covers
|
|
sm_120 (Blackwell).
|
|
4. `ls external/rl-rewardhacking/results/data/` — must contain
|
|
`leetcode_train_medhard_filtered.jsonl`. If empty, the external repo was
|
|
never cloned (step 2 above). This is the #1 fresh-box gotcha.
|
|
5. `ls out/` — empty / nonexistent; probe creates everything from scratch.
|
|
|
|
## Gates to check during the probe
|
|
|
|
### Gate A — extraction (`out/v_hack_full.safetensors`)
|
|
|
|
`extract_vhack_grad.py` logs `v_hack saved ... modules={n} zero-norm={n_zero}`.
|
|
|
|
SHOULD: `zero-norm=0`, ~252 wrapped Linear modules on Qwen3-4B.
|
|
ELSE: bf16 path or module wrapping regressed. Stop, do not train.
|
|
|
|
### Gate B — held-out cos (`out/vhack_heldout_cos_full.safetensors`)
|
|
|
|
`verify_vhack_heldout.py` logs `OVERALL modules={n} frac>0={f} mean={m}` and
|
|
exits nonzero if `frac>0 <= 0.50`.
|
|
|
|
SHOULD: `frac>0 > 0.50` (hard), `mean > +0.20` (soft).
|
|
ELSE: v_hack does not generalize off the extraction pairs. Stop.
|
|
|
|
### Gate C — vanilla training (`out/train_full_vanilla_seed41_probe.pt`)
|
|
|
|
Per-step TSV row format:
|
|
```
|
|
row step rew std sprd N gt hack loss cin cout fired sec
|
|
```
|
|
|
|
SHOULD:
|
|
- loss finite every step
|
|
- `sprd=T` on most steps (else Dr.GRPO zero-advantages train nothing)
|
|
- final `steps=` close to 200 (else context cutoff bias)
|
|
- final `HACK_RATE` materially > 0 (handover threshold "nontrivial", spec H4 expects >0.30)
|
|
|
|
ELSE — if `HACK_RATE ~ 0`: H4 unanswered on this substrate.
|
|
Stop. Do not queue 3-seed runs. Decide whether to (a) escalate prompt, (b)
|
|
move substrate, (c) accept negative result.
|
|
|
|
### Gate D — projected training (`out/train_full_projected_seed41_probe.pt`)
|
|
|
|
SHOULD:
|
|
- `cout <= cin` per step (projection reduces v_hack alignment)
|
|
- `fired` not near zero
|
|
- `PASS_RATE` comparable to vanilla
|
|
- `HACK_RATE` materially below vanilla — the H1 prediction
|
|
|
|
ELSE: projection is inactive (fired~0) or destructive (PASS_RATE collapse).
|
|
Investigate before any sweep.
|
|
|
|
## Methodological constraints (load-bearing)
|
|
|
|
### v_hack must not be tuned post-hoc to match RL-emergent hacks
|
|
|
|
`pairs.py` is the defense's prior on "what hacks look like". If we look at
|
|
vanilla training rollouts and rewrite `pairs.py` to match the specific
|
|
patterns that emerged, we've fit the defense to a known attack — the H1
|
|
generalization claim collapses.
|
|
|
|
The current 20 pairs (4 flavors x 5 problems) span a deliberately broader
|
|
hack subspace than ariahw's specific overwrite-tests loophole. If projected
|
|
suppresses ariahw-emergent hacks *despite* being authored from synthetic
|
|
pairs, that's the H1 result. If it fails, that's a negative result to
|
|
interpret, not a reason to widen pairs.
|
|
|
|
Spec.md (v_hack extraction section) makes this explicit.
|
|
|
|
### Hack-flavor diversity is a feature, not a bug
|
|
|
|
`pairs.py` has 4 flavors:
|
|
- A: overwrite `run_tests()` — exact match to ariahw's threat
|
|
- B: monkey-patch `assert` / `assertEqual`
|
|
- C: hardcode expected return values
|
|
- D: catch-all silent pass
|
|
|
|
B/C/D may not match what RL produces, but they broaden the v_hack
|
|
subspace. Removing them to "tighten" the basis would narrow the
|
|
defense to a known attack pattern (= overfit).
|
|
|
|
## What's NOT in scope yet
|
|
|
|
- Rebound baseline (H3, advantage-modification reimplementation). Spec
|
|
has it queued but it's not implemented.
|
|
- Eval set callback (held-out matched-problem evaluation every N steps).
|
|
Currently we only see noisy per-step gt_pass on randomly-sampled training
|
|
problems. A fixed eval slice would give a clean learning curve. ~2h of
|
|
work to add.
|
|
- `results_table.md` with provenance + error bars. Only meaningful after
|
|
the 3-seed sweep finishes.
|
|
|
|
## Important files
|
|
|
|
- `src/projected_grpo/train.py` — canonical GRPO + projection entry point
|
|
- `src/projected_grpo/extract_vhack_grad.py` — v_hack extraction
|
|
- `src/projected_grpo/verify_vhack_heldout.py` — held-out validation gate
|
|
- `src/projected_grpo/proj.py` — `per_token_logps` + `project_delta_S_grad`
|
|
- `src/projected_grpo/antipasto.py` — full-rank SVD adapter
|
|
- `src/projected_grpo/pairs.py` — 20 contrastive pairs (don't tune post-hoc)
|
|
- `src/projected_grpo/rewards.py` — ariahw-port grader and hack detector
|
|
- `justfile` — run recipes; see `## SWEEPS` block for what to run when
|
|
- `spec.md` — preregistered hypotheses + methodology
|
|
- `RESEARCH_JOURNAL.md` — session-by-session findings (2026-05-23 onwards
|
|
is post-grader-fix; everything before is contaminated)
|
|
|
|
## Known caveats
|
|
|
|
### Context cutoff at 2048 tokens
|
|
|
|
`train.py` skips examples where `prompt_len + max_new > 2048`. If many
|
|
problems get skipped, the final `steps=` count drops below 200 — that's
|
|
the signal to widen the cap (`max_new=768` would let more problems
|
|
through but shortens hack-pattern emergence).
|
|
|
|
### bf16 v_hack tied to exact checkpoint and dtype
|
|
|
|
v_hack is not portable across model versions or dtype/SVD-basis variants.
|
|
`train.py` refuses mismatched artifacts (key/rank check on load). Re-extract
|
|
when changing model or dtype.
|
|
|
|
### Smoke preset uses beta=0 by 24GB necessity
|
|
|
|
`smoke` (Qwen3-0.8B, 10 steps) sets `beta=0` because the 24GB GPU can't
|
|
hold a ref-model forward. `full` uses `beta=1e-3` via the zero-adapter
|
|
trick (no separate ref model).
|