Generate a fraction of student rollouts with delta_S_hack ablated (deployed
model -> can't hack -> explores solves), so the solve region stays covered
even if on-policy sampling collapses onto hacking. Motivated by job 60's
hkgap decay to ~0 post-emergence (gate stops discriminating; risk that hack
eats everything and delta_S starves). Pure sampling-side diversity, no
no-cheat-boundary impact; frac=0 = unchanged. Smoked at frac=0.5.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Three route runs all show deleting the quarantine raises solve and lowers hack.
Mechanism: clean-rollout solve gradient stays unflagged -> flows to delta_S; the
hack masks that competence at train time, revealed at deploy. Exception: run_tests
(solve 0->0) where hacking fully dominated exploration. Logs the 3 failure-mode
checks (eval artifact / teacher-distillation / random-V null).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route2 uses v_act/v_grad, not v_hack, so --vhack-refresh-every never fired
for it -- the mask was frozen regardless of the flag. Frozen real-V route
(job 32) shows why this matters: cin_t decays to cin_s by step 7, deploy hack
only drops ~8pp (vs run-31 rf5 ~0). Now re-extracts v_act/v_grad every N steps
with the quarantine ablated (same MASK_PAIRS, no oracle). + journal entry (j).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work
no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the
Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
n=1 live obs from pueue 29: cin_t +0.27->~0, cin_s ~0->+0.15, crossover
~step 10-14. Mechanism inference (advantage-variance collapse on the
all-hacking teacher group + student becoming the hack-grad source) held at
0.6 with the 3 competing failure modes (erase-does-it / refresh-artifact /
noise-floor), each with a falsifier against the queued vanilla+route arms.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
AFK queue-reorder shoved #137-#139 (vanilla s=42, projected s=44 frozen +
refresh-2) ahead of 17 other queued jobs so the n=3 matched table lands
before next user check-in. Original G2-screen commands displaced to slot
IDs 137-139.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds RESEARCH_JOURNAL.md entry (j) and docs/lab/20260529_projection_vs_vanilla_partial_n3.md
covering whole-run HACK_STUDENT across vanilla / projected-frozen / projected-refresh-2
arms at fast preset. Matched-seed view (s=41, s=43): projection drops HACK_STUDENT
12-23pp vs vanilla; refresh-every=2 adds ~5pp on top. Three cells (vanilla s=42,
projected s=44 both flavours) queued as pueue #137-#139.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28
(b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT
threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total.
Preset/Adam scheduling
- New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and
small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min
iteration loops.
- `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the
20-step fast preset spends only 2 steps under warmup, not 10.
- `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to
effectively disable — `gn` column shows the clip was never the bottleneck).
CLI restructure (tyro subcommands)
- Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack.
- Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig`
inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`.
- CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=).
- `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field.
Logging refactor
- New `StepLogger` class consolidates column order, width, header label, and
per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`).
- Row dict carries raw values throughout; formatters live in column spec.
Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats
and reformatted to `+0.000`. Tuples for fraction columns get converted to
"n/d" strings only at tabulate-dump time.
- `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_).
- `lr` column added (current scheduled LR through warmup + cosine).
- Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived.
cin/cout -> cos_pre/cos_post + signed
- Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots,
justfile. "in/out" overloaded with weight in/out features; "pre/post" is
unambiguous re projection timing.
- Metric is now signed: sum(V @ g) / ||g|| instead of ||V @ g|| / ||g||. With
one_sided gate, cos_post goes negative after projection (residual energy is
anti-hack) — was hidden by the absolute-value norm.
v_hack extraction framing
- README + `extract_vhack_grad.py` docstring lead with "this is the GRPO
gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair
with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean)
exactly, so we save the cleaner narrative for the paper.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Row 71-72 in #51 (projected, partial susp gate): hack_s=1/24 with
elevated cin_s (0.214-0.227 vs prior 0.17-0.20). Isolated breakthroughs,
not a sustained climb. Sets the upper bound for hack emergence under
25%-leaky projection; #52 vanilla will say whether the delay/rate is
meaningfully different.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Document the observation from #51 mid-run: cin_s drifts up roughly
0.17 -> 0.20 across 50 steps while hack_s stays 0/24. Read this against
#52 vanilla (queued) once it finishes; the decisive question is whether
vanilla also shows the drift, which would tell us whether projection
suppresses expression or whether the drift is a compensatory artifact of
projection itself.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- proj.py: add gate_mode={one_sided, no_gate}; no_gate does full V·V^T removal
- train.py: ariahw-matching hint replacement (CODE_SYSTEM_PROMPT preserved,
user msg gets the run_tests loophole); T=0.7 to match reference; timing cols
in step table; first-hack checkpoint snapshot
- probe_lora_runtime.py: sanity probe that ariahw LoRA hacks on our pipeline
- RESEARCH_JOURNAL.md: null result entry (#39 projected ≈ #40 vanilla at
HACK=0.215, PASS=0.315), plus next-phase plan to switch from baked-base to
mixed-pool GRPO from clean Qwen3-4B + ariahw teacher
The repo had two journals: root (active, daily-dated, ~547 lines) and
docs/RESEARCH_JOURNAL.md (older, dormant, 248 lines). User asked to merge
into one — keeping root since it has the active workflow.
Today's 2026-05-26 (b) dev-phase entry from docs/ moved to top of root
(under the now-restated "Append-only, newest at top" rule). Pre-existing
docs/ entries (96GB readiness fixes, smoke-loop mechanism verification,
project init) appended at bottom of root under a clearly-labelled "Earlier
history" section so we don't lose context, while keeping the daily-dated
section pristine for ongoing work.
docs/RESEARCH_JOURNAL.md deleted.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)"
finding on seed41:
- bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25,
merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky
student where projected-vs-vanilla dynamics have room to diverge.
- pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format
(chat-template, class Solution, ```python fence, run_tests method).
4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs
same-prompt to keep gradient comparable to training-time distribution.
- extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per
module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD
sign flip would invert the proj.py one-sided gate). Save as [k, r] with
top_k in safetensors metadata. Diagnostic switches from ||diff|| to
sv_top_k fraction.
- proj.py: rank-k subspace projection with per-direction one-sided gate.
For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves
sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while
covering multiple hack axes simultaneously. cos_in becomes ||V g||/||g||
(subspace energy fraction).
- probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with
raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation.
- probe_distill.py: removed NLL loss mode (footgun — default was nll, every
recipe overrode to grpo). Always GRPO. Tracks per_sample_loss.
Extract on baked rh25 with new pairs (pueue 22):
top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met).
v_proj cleanest at 0.74. All 252 modules non-zero ||D||.
References:
- docs/paper_chars.md (CHaRS paper) motivates multi-axis steering
- docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both arms: warmup hack=0.50 cos_in=+0.044, gen hack=0.00 cos=0.
Vanilla never hacks in student-gen window, so projected has nothing
to suppress. Cos signal validated in warmup phase. Headline H1 belongs
on direct-GRPO path, not distill-and-watch.
probe_distill: mixed-replay loader with heterogeneous plens, Dr.GRPO
loss path (REINFORCE-style centered advantage), slim save when in
replay mode, just recipes probe-mixed-{vanilla,projected}.
proj: project_delta_S_grad returns min/max of per-module cos_in/out
alongside means, so step printout shows distribution not just average.
probe_distill: norm_weighted_cos now divides by sqrt(n_modules) so the
per-sample cos_S_contrib is a proper cosine in [-1, 1] (was the
sqrt-of-n quirk that let it exceed 1).
Step-0 mixed-replay result: teacher (hack=1) samples cos +0.07-0.09
(proper scale), base (hack=0) samples cos -0.005 to +0.004 -- two
cleanly separated distributions on 4+4 samples. v_hack extracted from
hand-authored pairs.py generalizes to ariahw's RL-emergent hack
direction. Strong methodological confirmation.
Journal: 2026-05-25 (b) entry covers the GRPO probe, the projection
asymmetry that makes cos_out slightly negative (cos_in<=0 modules
skipped), and the cos norm fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
probe_distill.py is one script with three modes (default, --teacher-only,
--replay-dir) so vanilla and projected arms can replay the same teacher
rollouts apples-to-apples. Per-sample delta_S.grad snapshot diff gives
cos(grad, v_hack) per sample without breaking accumulation semantics.
rh-s65 was trained with simple_overwrite_tests hint applied to the user
prompt; train.py's REF_PASS_TEST_SYSTEM_PROMPT override took us off that
distribution (0/8 hacks). load_problems_rh restores the no-intervention
setup -> 8/8 hacks at step 0.
probe_uat.py defines four UATs and reports PASS/FAIL: T1 teacher hack
>=0.30, T2 vanilla cos coverage >=90%, T3 projected cos_out<cos_in on
>=80% steps, T4 cos | hacked > cos | not (one-sided t, p<0.05).
Journal entry flags methodological caveat: v_hack from NLL contrastive
gradient is not the GRPO policy gradient; if T4 fails, fallback is to
re-extract v_hack with GRPO-contrastive loss (same pairs, adv=+/-1).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
train.py: pass logits_to_keep=L_c+1 to model() at all three logp call
sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on
completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site.
full preset G=8 -> G=6 for a further ~25% B reduction at every act site.
Column names in the streamed TSV row shortened so header and values
share the same 8-char tab stop.
spec.md: documented the v_hack generalization constraint as load-bearing
methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent
hacks, or the H1 generalization claim collapses.
handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B).
Documents the four probe gates, hyperparameters table, and methodological
constraints. justfile gains a SWEEPS comment block clarifying probe vs
queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs.
RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix,
pooled cross-run trend analysis (LR is fine, signal underpowered at n=17
but directionally consistent), and the generalization correction.
Three independent issues that together made every prior `gt=0` measurement
bogus and the H4 hypothesis untestable:
1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)`
producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False
regardless of correctness. Fixed by joining tests verbatim.
2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)`
default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format
paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes
0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run)
uses these defaults; ours was effectively the run_rl_baseline control.
3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's
DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster
wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions.
beta=1e-3 (was 0.04) per reference config.py:135.
Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems
(was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt);
token-efficient logging (loguru single-char icons through tqdm.write, verbose
log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with
cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO
for greppable side-by-side; new RESEARCH_JOURNAL.md.
First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000,
rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode.
200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps):
extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>