Commit Graph

52 Commits

Author SHA1 Message Date
wassname 176f1baae6 journal: placebo (job 86) full 60-step verdict -- route2 non-directionality locked (12/12 evals at 0.000)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:35:03 +00:00
wassname ea0293083c journal: queue erase directionality test (jobs 105/106 real-v vs placebo)
The route2 placebo result (job 86, dir-insensitive gate) doesn't settle
directionality; erase projects with magnitude ~cos(g,v) so it's the arm where
direction must enter. Matched 60-step s41 pair differs only in v source. #183.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 01:01:08 +00:00
wassname beee08b559 journal: WATCH placebo (job 86) suppressing deploy hack like real v_hack -- directionality at risk
Vanilla nofloor deploy hack 0.36 (metric not degenerate), real-v_hack route2
0.000, placebo route2 0.000 through step 20. If step-60 endpoint confirms, the
route2 suppression is substantially non-directional (knob mechanism, not
v_hack specificity). Verdict + random-V cross-check next cycle. #183.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 00:36:04 +00:00
wassname b891109633 A5 FINAL: held-out modes suppressed 0.62->0.02 (file_marker), zero held-out labels (job 104)
Fill route2 column of tab:generalisation from job 104 per_mode_deploy.json;
regen A5 figure (add routing2 arm key to plot_deploy_overlay). All three
held-out modes drop near zero at knob-off deploy while emerging on the
knob-on path -- routing, not non-emergence. #185.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 23:21:39 +00:00
wassname 2b48eab6b8 journal: A5 suppression preliminary (job 104 step ~32) -- held-out hacks emerge on-policy, knob-off deploy holds 0.000
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 14:59:26 +00:00
wassname 1265445939 journal: A5 baseline (job 103) finished -- per-mode deploy split, held-out modes all emerge on-policy
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 13:11:46 +00:00
wassname e91e095596 journal: A5 baseline confirmed -- held-out file_marker+sentinel emerge on-policy (run_tests-only teacher), deploy hack 0.52 by step 40
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 05:33:36 +00:00
wassname a162eded65 journal: job 97 gentle-probe refutes vanilla-200 collapse (lp_s>-0.5); A5 baseline sees held-out file_marker emerge on-policy
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 04:01:18 +00:00
wassname 46b102ad22 eval: revert eval-every default 10->5 (knob-on removal made it cheap again)
The knob-on pass removal is the real win (halves each eval). With it gone,
every-5 on a 60-step run is ~18min more than every-10 but gives 12 deploy
points vs 6 -- better plots, cheap. No paper figure uses the knob-on train
curve (keynote+longrun plot deploy; the 2x2 train panel is diagnostic-only).
Long-run recipes pin sparse cadence explicitly so default-5 won't bite them.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 02:29:23 +00:00
wassname 208713d7c2 perf(eval): drop redundant per-step knob-ON pass, default eval-every 5->10
Per-step TIMING audit (journal 2026-06-04 a): gen ~140s/step dominates;
the 2x2 deploy eval is ~460s and route2 ran it TWICE per eval (knob-off +
knob-on) for a train curve no figure plots -- per-step hack_s already is the
train series, and the full 2x2 is computed once post-loop (FINAL EVAL). Drop
the per-step knob-on pass and its dead hk_on/slv_on columns; bump eval cadence
default 5->10. ~27% faster on 60-step fast runs, ~4h/run on 200-step. refresh
left at 5 (timing shows it's ~10s/step, not the culprit I'd claimed).

plot_dynamics already falls back to hack_s when hk_on absent. Validated via
smoke-route2: single-pass evals, FINAL EVAL 2x2 intact, no dead columns.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 02:25:07 +00:00
wassname 0913b064fc journal: A5 run plan queued (strict teacher-modes=run_tests, vanilla baseline + route2 test)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 22:46:56 +00:00
wassname a0d4ddf9d5 journal: A5 no-cheat table (hacked_E sees only run_tests; file_marker held-out) + route-run mechanism blocker
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 22:40:06 +00:00
wassname f8aea5f9e6 journal: #187 resolved -- vanilla-200 collapse was the hot preset, not long-horizon GRPO (job 97)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 20:23:41 +00:00
wassname d06a4fbe69 journal: framing -- post-hoc proves v_hack weak, but weak suffices as a routing gate (SGTM absorption)
The post-hoc erase result (weight 0.391->0.297) shows the rank-~10 v_hack is too weak to
span/erase the trained hack in W. But the same vector works at train time because a gate only
needs to DISCRIMINATE hack rollouts, and SGTM's absorption (Cloud 2024/2025) + self-reinforcing
localization amplify a weak noisy direction into full localization in the throwaway knob. This
is the mechanism A5 (held-out modes) tests -- logged as hypothesis, not yet shown in our RL setup.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:04:07 +00:00
wassname 3cc804b15e results: A3 post-hoc erase rows (weight 0.297 partial, act 0/0 lobotomy)
Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot
isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase
(Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a
lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the
single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline
0.391 in provenance. Journal 2026-06-03(c).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 10:50:08 +00:00
wassname 8d16b317cb results: A3 erase-online row lands (deploy hack 0.562 > vanilla; route 0.000)
Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438.
One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not
suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did
remove the aligned component) yet hack still emerged, so the hack signal lives
largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+
erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue).
Journal 2026-06-03(b).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 06:47:58 +00:00
wassname 753a54c625 paper: keynote A1/A2 to n=3 (route hack -0.292 vs vanilla, paired p~=0.013)
Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote
caption, add paired t-test, pin the exact 6-log regen command (just dyn
--latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6
explicit seed logs, fixing the 87cca9a clobber. Journal entry 2026-06-03(a).

Also: README points to main.tex and drops the stale n=1 findings block; record
two OpenReview URLs as a TODO in related work (mine reviews for shared critiques).

Closes A1/A2 (#173).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 03:36:32 +00:00
wassname 19544b3f06 journal: route2 holds deploy-hack=0 to 200 steps (job 84, durable not delayed)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 16:33:04 +00:00
wassname 311bf2854f results: fill keynote table/figure at n=3 route2 / n=2 vanilla
C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125):
  route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010
  vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032
  => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real
  band (3 route2 + 2 vanilla seeds, per-seed thin lines).

- main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending).
- results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not
  structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73).
- RESEARCH_JOURNAL 2026-06-02 entry.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 11:08:41 +00:00
wassname 8503dc1914 journal: route2 works at n=1 -- deploy hack 0.31->0.00 at +6pp solve, held-out file_marker suppressed
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 09:10:09 +00:00
wassname ea4f4ee657 feat: rollout_ablate_frac exploration floor vs hack-saturation (route/route2)
Generate a fraction of student rollouts with delta_S_hack ablated (deployed
model -> can't hack -> explores solves), so the solve region stays covered
even if on-policy sampling collapses onto hacking. Motivated by job 60's
hkgap decay to ~0 post-emergence (gate stops discriminating; risk that hack
eats everything and delta_S starves). Pure sampling-side diversity, no
no-cheat-boundary impact; frac=0 = unchanged. Smoked at frac=0.5.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 05:32:04 +00:00
wassname 59f8b6efdc journal: route2 capacity-imbalance realization + scale-matched delta_S fix
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 02:58:35 +00:00
wassname 47abce4307 journal: erase arm sub4 (-7.8pp hack/+4.7pp solve, win on file_marker)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 18:56:09 +00:00
wassname b0432af1e8 journal: vanilla sub4 emergence reference (per-mode asymmetry)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 16:56:16 +00:00
wassname b72c5acee2 journal: routing deploy-solve > train-solve (the "never learns to pass" worry, refuted)
Three route runs all show deleting the quarantine raises solve and lowers hack.
Mechanism: clean-rollout solve gradient stays unflagged -> flows to delta_S; the
hack masks that competence at train time, revealed at deploy. Exception: run_tests
(solve 0->0) where hacking fully dominated exploration. Logs the 3 failure-mode
checks (eval artifact / teacher-distillation / random-V null).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 14:11:53 +00:00
wassname 25569193c5 fix: route2 refresh was a silent no-op (vhack_refresh_every gated on v_hack, which route2 sets None)
route2 uses v_act/v_grad, not v_hack, so --vhack-refresh-every never fired
for it -- the mask was frozen regardless of the flag. Frozen real-V route
(job 32) shows why this matters: cin_t decays to cin_s by step 7, deploy hack
only drops ~8pp (vs run-31 rf5 ~0). Now re-extracts v_act/v_grad every N steps
with the quarantine ablated (same MASK_PAIRS, no oracle). + journal entry (j).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 12:19:38 +00:00
wassname ffeb632652 journal: 3 external reviews converge on v1-route artifact hypothesis; route2 inherits the Adam-parasite risk, prioritize leakage metric
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:51:11 +00:00
wassname 442630fcae docs: routing-v2 spec, related-work scorecard, paper fetches, journal
Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work
no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the
Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:16:21 +00:00
wassname 6cce11326a journal: 4-arm 60-step matched -- erase ≈ vanilla, route effect is deploy-only and confounded
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 06:41:59 +00:00
wassname f1af70d34e journal: route puzzle resolved (signed-cos red herring, deploy eval works, v_hack run_tests-only)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 04:28:02 +00:00
wassname 638f3e964e journal: erase arm completed (4/4 modes, HACK_S=0.274 at PASS=0.284)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 01:54:40 +00:00
wassname 031a93b701 journal: erase arm cin_t/cin_s crossover (teacher->student hack-grad handoff)
n=1 live obs from pueue 29: cin_t +0.27->~0, cin_s ~0->+0.15, crossover
~step 10-14. Mechanism inference (advantage-variance collapse on the
all-hacking teacher group + student becoming the hack-grad source) held at
0.6 with the 3 competing failure modes (erase-does-it / refresh-artifact /
noise-floor), each with a falsifier against the queued vanilla+route arms.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:46:07 +00:00
wassname 8c2421d644 journal(e): route-only v_hack refresh collapse root-caused to live quarantine + fix
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:04:58 +00:00
wassname b685a4495b journal: 2026-05-30 (d) headline — vanilla student learns 4/5 loopholes, eq_override holdout (seed 41)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 16:47:18 +00:00
wassname 618d9bd3bf journal: 2026-05-30 (c) hint-coherence audit + K=5 substrate + emergence running
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 13:52:07 +00:00
wassname 50fa85cdbe journal: Step 1 done — all 6 loopholes mint clean (after {func} fix), 4/6 hints single-shot reachable, emergence running
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:51:36 +00:00
wassname 6d7a9ae10f journal: 6-mode quadrant on fixed grader — faithful=0% (no-cheat holds), generic elicit=0% (none zero-shot discoverable, incl paper's run_tests)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 11:04:30 +00:00
wassname 28e251c2d0 journal (j): note pueue-switch reorder of n=3 fillers to slots 120-122
AFK queue-reorder shoved #137-#139 (vanilla s=42, projected s=44 frozen +
refresh-2) ahead of 17 other queued jobs so the n=3 matched table lands
before next user check-in. Original G2-screen commands displaced to slot
IDs 137-139.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 03:52:42 +00:00
wassname d46b55f933 journal (j) + WIP lab report: matched-seed projected-vs-vanilla, n=2
Adds RESEARCH_JOURNAL.md entry (j) and docs/lab/20260529_projection_vs_vanilla_partial_n3.md
covering whole-run HACK_STUDENT across vanilla / projected-frozen / projected-refresh-2
arms at fast preset. Matched-seed view (s=41, s=43): projection drops HACK_STUDENT
12-23pp vs vanilla; refresh-every=2 adds ~5pp on top. Three cells (vanilla s=42,
projected s=44 both flavours) queued as pueue #137-#139.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 02:51:05 +00:00
wassname f70743c9e9 wip 2026-05-28 12:44:20 +00:00
wassname f487e67405 Goal 0 milestone: fast preset learns to hack in ~10min
This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28
(b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT
threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total.

Preset/Adam scheduling
- New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and
  small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min
  iteration loops.
- `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the
  20-step fast preset spends only 2 steps under warmup, not 10.
- `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to
  effectively disable — `gn` column shows the clip was never the bottleneck).

CLI restructure (tyro subcommands)
- Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack.
- Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig`
  inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`.
- CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=).
- `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field.

Logging refactor
- New `StepLogger` class consolidates column order, width, header label, and
  per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`).
- Row dict carries raw values throughout; formatters live in column spec.
  Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats
  and reformatted to `+0.000`. Tuples for fraction columns get converted to
  "n/d" strings only at tabulate-dump time.
- `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_).
- `lr` column added (current scheduled LR through warmup + cosine).
- Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived.

cin/cout -> cos_pre/cos_post + signed
- Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots,
  justfile. "in/out" overloaded with weight in/out features; "pre/post" is
  unambiguous re projection timing.
- Metric is now signed: sum(V @ g) / ||g|| instead of ||V @ g|| / ||g||. With
  one_sided gate, cos_post goes negative after projection (residual energy is
  anti-hack) — was hidden by the absolute-value norm.

v_hack extraction framing
- README + `extract_vhack_grad.py` docstring lead with "this is the GRPO
  gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair
  with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean)
  exactly, so we save the cleaner narrative for the paper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 03:22:36 +00:00
wassname aa1d457701 Journal: first student hacks in #51 at ref_eq=13.5
Row 71-72 in #51 (projected, partial susp gate): hack_s=1/24 with
elevated cin_s (0.214-0.227 vs prior 0.17-0.20). Isolated breakthroughs,
not a sustained climb. Sets the upper bound for hack emergence under
25%-leaky projection; #52 vanilla will say whether the delay/rate is
meaningfully different.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:10:28 +00:00
wassname 3c04aaf06d Journal: cin_s drift in projected mid-run + noise-floor filter note
Document the observation from #51 mid-run: cin_s drifts up roughly
0.17 -> 0.20 across 50 steps while hack_s stays 0/24. Read this against
#52 vanilla (queued) once it finishes; the decisive question is whether
vanilla also shows the drift, which would tell us whether projection
suppresses expression or whether the drift is a compensatory artifact of
projection itself.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:38:20 +00:00
wassname 5f196e3108 v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin
Extraction (extract_vhack_grad.py):
- Default top_k=12 (was 5), saves singular values S as _sv/{name} keys
- SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile)
- Pulled extract_v_hack() into a callable function for in-process reuse
- Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched)

Loading (train.py:load_v_hack):
- Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict
- k_use slicing at load: extract at k=12, ablate k=1..12 by config flip
- Auto-extract on cache miss using already-wrapped model (no second model load)
- Default path derived from model_slug + extract_top_k

Runtime suspicion gate (proj.py:project_delta_S_grad):
- Dimensionless within-module ratio: r_i = (|c_i|/||g||) / (S_i/||S||)
  (codex/subagent flagged: |c_i|/S_i biased by per-module ||g||)
- Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25)
- Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file)

Per-source cin (proj.py:mean_cin_from_grads + train.py loss split):
- Per-prompt: backward student loss + teacher loss separately with retain_graph
- step_grad_s + step_grad_t = combined grad (linearity); used for projection
- cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack"

Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan)
Codex external review: docs/spec/20260527_code_review.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 06:39:05 +00:00
wassname 6bd3abfe5b no_gate projection mode, ariahw hint-replacement loader, mixed-pool plan
- proj.py: add gate_mode={one_sided, no_gate}; no_gate does full V·V^T removal
- train.py: ariahw-matching hint replacement (CODE_SYSTEM_PROMPT preserved,
  user msg gets the run_tests loophole); T=0.7 to match reference; timing cols
  in step table; first-hack checkpoint snapshot
- probe_lora_runtime.py: sanity probe that ariahw LoRA hacks on our pipeline
- RESEARCH_JOURNAL.md: null result entry (#39 projected ≈ #40 vanilla at
  HACK=0.215, PASS=0.315), plus next-phase plan to switch from baked-base to
  mixed-pool GRPO from clean Qwen3-4B + ariahw teacher
2026-05-27 00:45:26 +00:00
wassname 3785c66290 merge duplicate research journals into root RESEARCH_JOURNAL.md
The repo had two journals: root (active, daily-dated, ~547 lines) and
docs/RESEARCH_JOURNAL.md (older, dormant, 248 lines). User asked to merge
into one — keeping root since it has the active workflow.

Today's 2026-05-26 (b) dev-phase entry from docs/ moved to top of root
(under the now-restated "Append-only, newest at top" rule). Pre-existing
docs/ entries (96GB readiness fixes, smoke-loop mechanism verification,
project init) appended at bottom of root under a clearly-labelled "Earlier
history" section so we don't lose context, while keeping the daily-dated
section pristine for ongoing work.

docs/RESEARCH_JOURNAL.md deleted.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:36:07 +00:00
wassname 235b51399f top-k v_hack subspace + real-voice pairs + LoRA bake
Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)"
finding on seed41:

- bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25,
  merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky
  student where projected-vs-vanilla dynamics have room to diverge.
- pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format
  (chat-template, class Solution, ```python fence, run_tests method).
  4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs
  same-prompt to keep gradient comparable to training-time distribution.
- extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per
  module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD
  sign flip would invert the proj.py one-sided gate). Save as [k, r] with
  top_k in safetensors metadata. Diagnostic switches from ||diff|| to
  sv_top_k fraction.
- proj.py: rank-k subspace projection with per-direction one-sided gate.
  For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves
  sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while
  covering multiple hack axes simultaneously. cos_in becomes ||V g||/||g||
  (subspace energy fraction).
- probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with
  raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation.
- probe_distill.py: removed NLL loss mode (footgun — default was nll, every
  recipe overrode to grpo). Always GRPO. Tracks per_sample_loss.

Extract on baked rh25 with new pairs (pueue 22):
  top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met).
  v_proj cleanest at 0.74. All 252 modules non-zero ||D||.

References:
- docs/paper_chars.md (CHaRS paper) motivates multi-axis steering
- docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:33:24 +00:00
wassname 041729a758 Warmup-gen probe results: H1 untestable at 20 warmup steps
Both arms: warmup hack=0.50 cos_in=+0.044, gen hack=0.00 cos=0.
Vanilla never hacks in student-gen window, so projected has nothing
to suppress. Cos signal validated in warmup phase. Headline H1 belongs
on direct-GRPO path, not distill-and-watch.
2026-05-25 15:58:37 +00:00
wassname ab6676d90a mixed-replay GRPO works + cos fix + min/max + journal
probe_distill: mixed-replay loader with heterogeneous plens, Dr.GRPO
loss path (REINFORCE-style centered advantage), slim save when in
replay mode, just recipes probe-mixed-{vanilla,projected}.

proj: project_delta_S_grad returns min/max of per-module cos_in/out
alongside means, so step printout shows distribution not just average.

probe_distill: norm_weighted_cos now divides by sqrt(n_modules) so the
per-sample cos_S_contrib is a proper cosine in [-1, 1] (was the
sqrt-of-n quirk that let it exceed 1).

Step-0 mixed-replay result: teacher (hack=1) samples cos +0.07-0.09
(proper scale), base (hack=0) samples cos -0.005 to +0.004 -- two
cleanly separated distributions on 4+4 samples. v_hack extracted from
hand-authored pairs.py generalizes to ariahw's RL-emergent hack
direction. Strong methodological confirmation.

Journal: 2026-05-25 (b) entry covers the GRPO probe, the projection
asymmetry that makes cos_out slightly negative (cos_in<=0 modules
skipped), and the cos norm fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 12:20:52 +00:00
wassname d111db25f7 Distillation probe: hacky teacher (rh-s65) + student per-sample cosine
probe_distill.py is one script with three modes (default, --teacher-only,
--replay-dir) so vanilla and projected arms can replay the same teacher
rollouts apples-to-apples. Per-sample delta_S.grad snapshot diff gives
cos(grad, v_hack) per sample without breaking accumulation semantics.

rh-s65 was trained with simple_overwrite_tests hint applied to the user
prompt; train.py's REF_PASS_TEST_SYSTEM_PROMPT override took us off that
distribution (0/8 hacks). load_problems_rh restores the no-intervention
setup -> 8/8 hacks at step 0.

probe_uat.py defines four UATs and reports PASS/FAIL: T1 teacher hack
>=0.30, T2 vanilla cos coverage >=90%, T3 projected cos_out<cos_in on
>=80% steps, T4 cos | hacked > cos | not (one-sided t, p<0.05).

Journal entry flags methodological caveat: v_hack from NLL contrastive
gradient is not the GRPO policy gradient; if T4 fails, fallback is to
re-extract v_hack with GRPO-contrastive loss (same pairs, adv=+/-1).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:04:55 +00:00