Deletes 7 source files that were superseded but never removed:
run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor),
grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by
train.py "smoke" subcommand), phase2_analyze.py (pilot is past),
probe_uat.py (UAT pipeline is past).
Drops matching justfile recipes (vhack-check, phase2-analyze,
probe-uat) and the BASE constant that pointed at run.py. Updates
AGENTS/README references to the stale fast-dev-run recipe (now
just smoke / smoke-vanilla).
Verified by running just smoke-vanilla --steps=2 end-to-end.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28
(b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT
threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total.
Preset/Adam scheduling
- New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and
small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min
iteration loops.
- `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the
20-step fast preset spends only 2 steps under warmup, not 10.
- `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to
effectively disable — `gn` column shows the clip was never the bottleneck).
CLI restructure (tyro subcommands)
- Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack.
- Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig`
inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`.
- CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=).
- `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field.
Logging refactor
- New `StepLogger` class consolidates column order, width, header label, and
per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`).
- Row dict carries raw values throughout; formatters live in column spec.
Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats
and reformatted to `+0.000`. Tuples for fraction columns get converted to
"n/d" strings only at tabulate-dump time.
- `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_).
- `lr` column added (current scheduled LR through warmup + cosine).
- Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived.
cin/cout -> cos_pre/cos_post + signed
- Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots,
justfile. "in/out" overloaded with weight in/out features; "pre/post" is
unambiguous re projection timing.
- Metric is now signed: sum(V @ g) / ||g|| instead of ||V @ g|| / ||g||. With
one_sided gate, cos_post goes negative after projection (residual energy is
anti-hack) — was hidden by the absolute-value norm.
v_hack extraction framing
- README + `extract_vhack_grad.py` docstring lead with "this is the GRPO
gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair
with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean)
exactly, so we save the cleaner narrative for the paper.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Walk through the method from the start, in the user's voice, without AI
tells: ablate hack direction from gradient on each update; extract via
twin NLL on hand-paired completions, SVD the diff; work in delta_S
space (AntiPaSTO); drop the bottom 25% of S_i globally as noise floor;
log cin/cout and cin_t vs cin_s as the empirical sanity check.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three independent issues that together made every prior `gt=0` measurement
bogus and the H4 hypothesis untestable:
1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)`
producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False
regardless of correctness. Fixed by joining tests verbatim.
2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)`
default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format
paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes
0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run)
uses these defaults; ours was effectively the run_rl_baseline control.
3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's
DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster
wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions.
beta=1e-3 (was 0.04) per reference config.py:135.
Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems
(was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt);
token-efficient logging (loguru single-char icons through tqdm.write, verbose
log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with
cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO
for greppable side-by-side; new RESEARCH_JOURNAL.md.
First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000,
rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode.
200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps):
extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>