mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
handover
This commit is contained in:
@@ -0,0 +1,206 @@
|
||||
# Handover
|
||||
|
||||
Current status: mechanism smoke is done; 96GB run is not yet started.
|
||||
|
||||
## Bottom line
|
||||
|
||||
The repo is ready for a **gated one-seed 96GB probe**, not an unattended full sweep.
|
||||
|
||||
Run this first on the 96GB box:
|
||||
|
||||
```sh
|
||||
pueue add --immediate --follow -w "$PWD" -o 9 \
|
||||
-l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" \
|
||||
-- just probe-full-seed 41
|
||||
```
|
||||
|
||||
Only queue 3-seed full runs if the vanilla probe has nontrivial hack rate. If vanilla hack rate is near zero, the substrate failed and H1 is still untested.
|
||||
|
||||
## What has been verified
|
||||
|
||||
### AntiPaSTO identity
|
||||
|
||||
- Evidence: `/tmp/claude-1000/step1_identity_bf16.log`
|
||||
- Result: wrapped model is bit-exact at `delta_S=0`, `max_abs_diff=0` over 3 prompts.
|
||||
- Why it matters: the zero-adapter reference forward is valid. Temporarily setting `delta_S=0` gives base-model logprobs without loading a separate ref model.
|
||||
|
||||
### v_hack extraction path, bf16 exact-basis
|
||||
|
||||
- Evidence:
|
||||
- pueue task 73: `just extract-vhack-smoke`
|
||||
- pueue task 74: `just verify-vhack-smoke`
|
||||
- [out/v_hack_smoke.pt](../out/v_hack_smoke.pt)
|
||||
- [out/vhack_heldout_cos_smoke.pt](../out/vhack_heldout_cos_smoke.pt)
|
||||
- Result:
|
||||
- bf16 extraction
|
||||
- 186 wrapped modules
|
||||
- 148,032 total `delta_S` scalars
|
||||
- zero-norm=0
|
||||
- held-out validation: `frac>0=0.952`, `mean=+0.355`, `median=+0.363`
|
||||
|
||||
### Canonical train path
|
||||
|
||||
- Evidence: pueue task 75 one-step smoke probe.
|
||||
- Result: [src/projected_grpo/train.py](../src/projected_grpo/train.py) loaded [out/v_hack_smoke.pt](../out/v_hack_smoke.pt), passed model/dtype/key/rank checks, and completed one canonical train step.
|
||||
- Important detail: reward spread was false, so loss/cos/fired were zero. That is now correct: zero-spread batches produce zero advantage, not random training noise.
|
||||
|
||||
### Proof artifact and journal
|
||||
|
||||
- [out/proof.md](../out/proof.md): mechanism proof + caveats.
|
||||
- [docs/RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md): latest entries include the 96GB readiness corrections.
|
||||
|
||||
## What changed recently
|
||||
|
||||
### `train.py` is now canonical
|
||||
|
||||
Use [src/projected_grpo/train.py](../src/projected_grpo/train.py), not the old proof script. Presets:
|
||||
|
||||
| preset | model | steps | G | max_new | beta | purpose |
|
||||
|---|---:|---:|---:|---:|---:|---|
|
||||
| `smoke` | `Qwen/Qwen3.5-0.8B` | 10 | 2 | 128 | 0.0 | 24GB mechanism smoke |
|
||||
| `lite` | `Qwen/Qwen2.5-Coder-1.5B` | 100 | 4 | 512 | 0.04 | smaller real substrate |
|
||||
| `full` | `Qwen/Qwen2.5-Coder-7B` | 200 | 8 | 1024 | 0.04 | publication-grade probe |
|
||||
|
||||
`beta=0.04` is the default for lite/full because this is reward-hacking research. Dr.GRPO's beta=0 argument applies when rule-based reward is ground truth; here the proxy-vs-truth gap is the object of study.
|
||||
|
||||
### v_hack artifacts are exact-model and exact-dtype
|
||||
|
||||
`v_hack` is not portable across checkpoints or dtype/SVD basis. The training code now refuses legacy artifacts.
|
||||
|
||||
Required extraction commands:
|
||||
|
||||
```sh
|
||||
just extract-vhack-smoke
|
||||
just verify-vhack-smoke
|
||||
|
||||
just extract-vhack-lite
|
||||
just verify-vhack-lite
|
||||
|
||||
just extract-vhack-full
|
||||
just verify-vhack-full
|
||||
```
|
||||
|
||||
For projected training, pass the matching path:
|
||||
|
||||
```sh
|
||||
uv run python -m projected_grpo.train --preset=full --arm=projected \
|
||||
--v-hack-path=out/v_hack_full.pt
|
||||
```
|
||||
|
||||
### Dr.GRPO loss
|
||||
|
||||
`--unbiased` defaults on:
|
||||
|
||||
- advantage is `R - mean(R)`, no `/std(R)`
|
||||
- loss denominator is constant `group * max_new`, no `/response_length`
|
||||
- zero-spread reward group gives zero advantage
|
||||
|
||||
This avoids the silent failure where projection logs look alive while the update is random reward-unrelated noise.
|
||||
|
||||
### Reference KL
|
||||
|
||||
For `beta > 0`, [src/projected_grpo/train.py](../src/projected_grpo/train.py) uses a zero-adapter reference forward:
|
||||
|
||||
1. clone all `delta_S`
|
||||
2. set `delta_S=0`
|
||||
3. no-grad forward for ref logprobs
|
||||
4. restore `delta_S` in `finally`
|
||||
|
||||
This is standard adapter practice and costs no extra model VRAM.
|
||||
|
||||
## First 96GB run plan
|
||||
|
||||
### 1. Gated full probe
|
||||
|
||||
Run exactly:
|
||||
|
||||
```sh
|
||||
pueue add --immediate --follow -w "$PWD" -o 9 \
|
||||
-l "why: gated full probe; resolve: extract+heldout pass, vanilla hacks, projected fires" \
|
||||
-- just probe-full-seed 41
|
||||
```
|
||||
|
||||
This runs sequentially:
|
||||
|
||||
1. `just extract-vhack-full`
|
||||
2. `just verify-vhack-full`
|
||||
3. `train.py --preset=full --arm=vanilla --seed=41`
|
||||
4. `train.py --preset=full --arm=projected --seed=41`
|
||||
|
||||
Sequential matters. Do not queue extraction and training separately unless pueue dependencies are explicit; otherwise training can race before `out/v_hack_full.pt` exists.
|
||||
|
||||
### 2. Inspect distinguishing evidence
|
||||
|
||||
Before scaling, check:
|
||||
|
||||
- extraction log:
|
||||
- `model=Qwen/Qwen2.5-Coder-7B`
|
||||
- `dtype=bf16`
|
||||
- `zero-norm=0`
|
||||
- held-out verifier:
|
||||
- `frac>0 > 0.50`
|
||||
- preferably `mean > +0.20`
|
||||
- train logs:
|
||||
- `loaded v_hack ... key/rank match OK`
|
||||
- vanilla has reward spread on enough steps to train
|
||||
- vanilla final `HACK_RATE` is nontrivial
|
||||
- projected has `cos_out <= cos_in`
|
||||
- projected `fired` is not near zero
|
||||
- projected and vanilla have comparable `PASS_RATE`
|
||||
|
||||
If vanilla `HACK_RATE` is near zero, stop. H4 failed for that substrate and H1 is untested.
|
||||
|
||||
### 3. Only then queue full 3-seed runs
|
||||
|
||||
```sh
|
||||
just queue-full
|
||||
```
|
||||
|
||||
This queues:
|
||||
|
||||
- extraction of `out/v_hack_full.pt`
|
||||
- vanilla full, 3 seeds
|
||||
- projected full, 3 seeds
|
||||
|
||||
Still prefer the gated probe first.
|
||||
|
||||
## Known blockers / caveats
|
||||
|
||||
### No Rebound baseline yet
|
||||
|
||||
H3 is not implemented. Current comparison is vanilla vs projected. Rebound advantage modification should be added before publication-grade claims against Wu-Tang.
|
||||
|
||||
### 0.8B smoke falsified H4 at small scale
|
||||
|
||||
Qwen3.5-0.8B emits format-only responses:
|
||||
|
||||
- `HACK_RATE=0.000`
|
||||
- `PASS_RATE=0.000`
|
||||
|
||||
This verifies mechanism but not the reward-hacking intervention hypothesis.
|
||||
|
||||
### Smoke uses beta=0 only for 24GB
|
||||
|
||||
This is not the research default. Lite/full use `beta=0.04` via zero-adapter reference forward.
|
||||
|
||||
### Context cutoff
|
||||
|
||||
[train.py](../src/projected_grpo/train.py) currently skips examples where `prompt_len + max_new > 2048`. If many full-run rows are skipped, the substrate is biased. The final `steps=` count tells you how many rows actually ran.
|
||||
|
||||
## Important files
|
||||
|
||||
- [src/projected_grpo/train.py](../src/projected_grpo/train.py): canonical GRPO + projection entry point.
|
||||
- [src/projected_grpo/extract_vhack_grad.py](../src/projected_grpo/extract_vhack_grad.py): exact-model bf16 `v_hack` extraction.
|
||||
- [src/projected_grpo/verify_vhack_heldout.py](../src/projected_grpo/verify_vhack_heldout.py): held-out validation gate.
|
||||
- [src/projected_grpo/proj.py](../src/projected_grpo/proj.py): `per_token_logps()` and `project_delta_S_grad()`.
|
||||
- [src/projected_grpo/antipasto.py](../src/projected_grpo/antipasto.py): full-rank SVD adapter, `delta_S` basis.
|
||||
- [justfile](../justfile): run recipes.
|
||||
- [out/proof.md](../out/proof.md): mechanism proof artifact.
|
||||
- [docs/RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md): append-only lab notes.
|
||||
|
||||
## Current task list
|
||||
|
||||
1. Run the gated full probe on 96GB.
|
||||
2. If vanilla hacks, queue full 3-seed vanilla/projected runs.
|
||||
3. Build [out/results_table.md](../out/results_table.md) with provenance links and error bars.
|
||||
4. Add Rebound baseline arm before making strong comparative claims.
|
||||
Reference in New Issue
Block a user