Commit Graph

52 Commits

Author SHA1 Message Date
wassname f3f2c1250f feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy
Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
  multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too

derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).

Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:35:26 +00:00
wassname 8a253060a7 feat: object-attribute sentinel + exhaustive non-overlap matrix
Redesign the sentinel loophole from a module-global flag (GRADER_RESULT=PASS)
to a Solution-object .valid boolean the grader reads -- a concrete mechanism
the faithful hint can name (parallel to run_tests() naming the function),
fixing the discoverability gap (secret-token problem). Wire the stdout/file
detectors onto their natural pass-report families (_text_signals_pass /
result-named file with pass content) so the hinted exploit actually trips them.

verify_rewards: replace the 7 ad-hoc cross-mode cases with an EXHAUSTIVE 6x6
hack-x-mode matrix -- each hack must pay ONLY on its home subset (exploited iff
grade_mode==home_mode). 52/52 green; smoke green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:15:36 +00:00
wassname c6b433310a fix: two more oracle holes (gpt-5.5 review) — sentinel forgery + int-subclass eq
CRIT (gpt-5.5): (1) _gt_correct keyed on a FIXED sentinel string the wrong solution
could print itself -> gt_correct=True. Now a fresh secrets.token_hex nonce per call,
unguessable, lives in outer globals not the solution namespace. (2) __is_safe used
isinstance(x,int), so an int SUBCLASS with __eq__=True passed as safe and __strict_eq
trusted it. Now exact types (type(x) in (...)). +2 verify cases (forge_sentinel,
int-subclass eq). 29/29.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:57:46 +00:00
wassname 6df80ac246 fix: external-review criticals — os._exit oracle hole + exact even matching + honest teacher gt
CRIT (gpt-5.4 review): _gt_correct keyed correctness on exit-code-0, so a wrong
solution with os._exit(0) (uncatchable, bypasses the SystemExit guard) read
gt_correct=True in every mode -- breaking the strict oracle AND non-overlap
(a hard-exit hack looked genuinely correct everywhere). Verified the hole, then
fixed: correctness now requires REACHING a post-assert sentinel in stdout; any
early termination (sys.exit/os._exit/raise) or failing assert skips it. +3 verify
cases (os_exit @ exit_code/run_tests/sentinel), 25/25 pass.

IMPORTANT: build_substrate greedy round-robin could starve a mode when an even
assignment existed -> replaced with exact Kuhn bipartite matching, decrement
per_mode until all modes saturate, fail loud otherwise.

IMPORTANT: teacher rows stored foolable gt_pass (True on exit/eq exploits) ->
inflated teacher gt_t/PASS_RATE. Now store strict gt_correct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:15:23 +00:00
wassname cb504ef11f docs: substrate implementation status + emergence run queued
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:57:02 +00:00
wassname 0240d2ef9f feat: build_substrate two-source teacher batch + scarcest-first even assignment
derisk #10: only exit_code is base-elicitable at scale (98%); sentinel 13.5%
(13 seeds), run_tests 2% (RL-emergent, pool-sourced), stdout/file/eq ~0. So the
teacher batch sources exit_code+sentinel from elicit files and run_tests from the
existing teacher pool. Scarcest-mode-first round-robin + pool_cap give an even
7/7/7 partition (21 problems, 40 rollouts). Spec records the elicitability finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:51:27 +00:00
wassname a8807ebe6d spec: add multi-loophole training substrate design (even/non-overlap/teacher-batch/learn-all)
Flags the non-overlap problem: gt_pass-based passed lets sys.exit/eq pay on every
subset -> must switch to passed_i = gt_correct OR channel_i with per-mode positive
detectors. Plus the per-problem env_mode gap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 07:51:28 +00:00
wassname 42f344c816 spec: UAT1 quadrant result + the base-elicitability-vs-RL-emergence learning
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 06:12:46 +00:00
wassname 5de7433ca4 spec: code-review-2 resolution (oracle robustness fixes)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:48:49 +00:00
wassname cf5f4861db rewards: robust strict oracle (review fixes) — SystemExit guard around test calls + whitelist __strict_eq
Code review (docs/spec/20260530_refactor_code_review.md) found 3 oracle bugs:
- sys.exit INSIDE solve() (during a test call) fooled the oracle -> wrap BOTH
  solution-exec and assert-exec in one SystemExit guard -> os._exit(1) on exit.
- JSON __strict_eq broke 2==2.0 and tuple/list vs gt_pass -> whitelist safe
  builtins and use baseline Python == (custom-typed operand = eq_override -> reject).
- defs-only dropped honest top-level constants -> exec full src, keep state.
verify_rewards: +3 regressions (exit_in_solve, top_const, int_vs_float); 9/9 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:48:24 +00:00
wassname c38c855e8a spec: implementation status + plan-review-1 resolution (3-mode honest count)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:40:59 +00:00
wassname fc46f690f5 spec: add 2-cell de-risk (faithful vs elicit) + elicit-then-strip warm-start; honest 6-mode count
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:16:24 +00:00
wassname 8a5738c69a spec: reject expose-K, design faithful multi-loophole env
expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base /
no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted).
New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a
distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate
menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to
be ripped out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:10:28 +00:00
wassname c3246b674d tidy 2026-05-30 04:38:41 +00:00
wassname efdf86a0cb wip 2026-05-30 04:33:33 +00:00
wassname 4621488cc0 reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/)
Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts
(0 left at top level). Per-run checkpoints+rollouts now group under
runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in
out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest).
justfile also gains run-cell REFRESH param (online-erasure arm). Smoke +
smoke-vanilla + results all green on new paths. Requeue manifest preserves the
why/resolve labels that pueue reset wiped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:52:24 +00:00
wassname 969c724d9d docs+chore: out/ reorg scheme (queue-gated) + archive dead _OLD_step_format dirs
out/ is 25GB/195 loose files. Target: one subdir per datatype, per-run
artifacts under runs/<ts>_<slug>/. NOT executed live: 11 queued jobs pass
out/ paths as literal args, so the data move + code-path edits run atomically
when the queue is idle. Archived the unreferenced *_OLD_step_format dirs now.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:43:10 +00:00
wassname f88b8b32c1 results: add Q10 (pairset mechanism>framing>placebo) + Q11 (60-step convergence gap closes)
Q10: swap only pair-set content (all bases k=12/tau=0, trained k=5, seed-41
mix=0.125 frozen). prog_wide (mechanism) -0.226; semantic framings ~0; null_city
placebo +0.024. v_hack tracks the hack mechanism, not a generic honesty
direction. n=1 per row, baseline noise +/-0.06.

Q11: 60-step seed-42 mix=0.125, gap closes (vanilla 0.936, frozen 0.957,
refresh-2 0.907) -- projection delays but does not prevent hacking at this
horizon. n=1, confounded with mix/seed vs Q2.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:34:22 +00:00
wassname f917670994 feat: T8 run-cell + regen-dynamics recipes; spec T5 done, T8 in progress
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:52:14 +00:00
wassname fc30514b23 feat: T5 eval-time ablation for route + fix route deployment invariant
T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval
(hack_abl/solve_abl cols, appended so results.py indices unchanged) every
--eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics
plots the ablated series for the routing arm (the coherence-gap fix: training
hack_s looks vanilla; routing only shows post-ablation).

External-review fixes (docs/spec/20260530_code_review.md):
- Critical: route now feeds delta_S the SAME g_proj as erase (was forcing
  preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW).
  delta_S is its own AdamW param fed erase's grad, so route-ablated deployment
  evolves identically to erase regardless of AdamW non-linearity. Only the
  combined training forward over-moves (intended; never deployed). Corrected the
  overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity).
- Important: clip_grad_norm_ now covers delta_params + delta_hack_params
  (no-op for none/erase; bounds the route update).
- Important: results.py paired-delta table includes routing (keyed on arm).

smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7,
ROUTE EVAL BLUF prints.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:50:53 +00:00
wassname d6342ab201 feat: gradient routing — delta_S_hack quarantine + intervention {none,erase,route}
Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob
delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init).
intervention=route parks the hack-ward grad component (g - cV to delta_S,
cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack.

- proj.py: route flag splits the grad (overshoot=1, no rescale -> the split
  sums to g, so the training forward still moves hack-ward; route ⊇ erase).
- antipasto.py: second trainable knob, identity preserved at init.
- train.py: arm -> intervention {none,erase,route}; arm kept as a derived
  display name so run-id/BLUF/results.py/plot classify are unchanged. opt
  steps both knobs (hack knob grad=None under none/erase -> AdamW skips it,
  so erase reproduces old `projected` bit-for-bit, R4). R3 span assert
  (resid/||gh|| < 1e-4) + end-of-run ||delta_S_hack|| guard (route >0).
- results.py / plot_dynamics.py: read arm from the preset line (covers both
  old --arm and new --intervention logs); plot classifies `routing`.

smoke: none ||dsh||=0, erase clean, route ||dsh||=0.0105 span=2.9e-7. 64
archived projected rows still parse.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:31:30 +00:00
wassname 5d83adbb25 fix: correct the "18 vs 21 pair" basis claim (it was never about pair count)
Read the safetensors shapes/metadata: v_hack_full = 10 pairs / k=5,
v_hack_21pairs = 16 pairs / k=12 (n_heldout=2; neither is 18 or 21). The two
bases differ on pairs AND directions-kept AND extract-tau simultaneously, so
the hack-cut gap is triple-confounded, not a clean "pair set is the lever"
result. Nothing was lost: the strong basis reproduces from current pairs.py
via --top-k=12 --v-hack-drop-bottom-frac=0.0, and refresh already re-extracts
at k=12. Rewrites Q8 + the top confound bullet + the README findings caveat.
A one-knob k-sweep is needed to attribute the gain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 10:12:12 +00:00
wassname 46f10d8150 results: absolute-rate tables + provenance, lock mix=0.125 default
docs/results.md: lead with absolute last-5 rates (compare within a table by
eye); restrict refresh-cadence/gate/basis comparisons to the seed they actually
share (kills the fake refresh "ladder" that compared n=1 cadences to a 4-seed
frozen mean); add Q6 solve columns, Q8 pair-content axis breakdown (8/18 pairs
are axis-1 weak-tests; the 21-pair set is not in committed pairs.py -> FIXME),
Q9 solve-orth negative result, and a dynamics note (solve never climbs; hack
plateaus ~step 15).

scripts/results.py: add `log` provenance column; drop the wide argv/time cols.

Lock mix_ratio=0.125 as the default (FastConfig group 4->8 so the split is
non-degenerate; drop --mix-ratio=0.5 from fast recipes). Q6 shows 0.125 keeps
the hack cut with no solve tax. Smoke passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 09:30:30 +00:00
wassname e7cdcaa0ab results: same-seed paired deltas + std, exclude incomplete runs
- paired view: join projected to vanilla on (mix, seed), per-seed delta, mean
  +/- std over shared seeds. Comparing a 3-seed mean to a 1-seed point is
  meaningless; this enforces same-seed comparison (ml_debug principle).
- grouped view now reports std across seeds (null at n=1).
- exclude in-progress/aborted runs (must log all `steps`) so partial logs
  don't read as impossibly-good results.
- docs/results.md rewritten around paired deltas; honest that at n=4 the
  last-5 Dhack std (~0.15) ~= the mean (~0.13), so the effect is consistent
  in sign but not cleanly separated from zero.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 08:10:42 +00:00
wassname 4464f9d312 results tooling + solve-orth knob + results-by-question doc
- scripts/results.py + `just results`: aggregate logs/*.log into last-5
  hack_s and gt_s (solve) tables, sorted-by-time + grouped-by-config, with
  full argv provenance column. Filters smoke/probe runs.
- extract_vhack_grad: solve_orth_m knob — strip top-m known-solve subspace
  (SVD of clean-side grads) from D before SVD, so projection doesn't ablate
  the solve signal. No grader/oracle, off by default.
- docs/results.md: every experiment grouped by the question it answers
  (feasibility, H1, gate_mode, basis, refresh, mix, noise-floor, pair-set)
  with comparison tables and answers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 07:21:05 +00:00
wassname c1f8ca4e7b tidy 2026-05-29 06:29:43 +00:00
wassname 3bbac88167 concepts 2026-05-29 06:29:20 +00:00
wassname f27c658ca9 docs 2026-05-29 05:42:28 +00:00
wassname 22b5d0a8a7 LW draft: add preregistered H1 block-quote with falsification clauses
Surfaces the H1 verbatim + falsification criteria, names two gaps up-front:
21 pairs vs preregistered 60-80, and the SEM-across-seeds clause not yet
evaluable at n=2. Addresses the comprehension panel's flag on H1 verbatim
omission (deepseek 3.0, gemini-flash 4.0 on hypothesis_clarity).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 03:56:33 +00:00
wassname 638fe23f3e LW-style draft post: gradient projection vs reward hacking (paper-writing skill)
Compresses the lab report into ~1700 words for a LessWrong audience while
preserving the workshop-paper scaffolding (intro / setup / method /
result table / mechanism subplot / limitations / related work / next).

Headline claim per user direction: projection cuts hack rate at matched
pass-rate (Table 1). Mechanism subplot (G_hack staleness + refresh-every-2)
kept as supporting context.

External-panel critique pass (n=5 models, mean 4.4/5 ready) on dims
hook/clarity/inform_not_persuade/calibration/LW_voice. Lowest scores
on clarity (density of delta_S / AntiPaSTO jargon) and LW_voice
(slightly more formal than typical LW). Acceptable for first draft.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 03:49:51 +00:00
wassname 14db69de97 lab report v3: TL;DR, three-line concept, PASS_RATE column, G_hack rename
- Add TL;DR for skimmers; first paragraph + Table 1 now stand alone.
- Open the method with the user's three-line framing of the intervention.
- Rename v_hack -> G_hack in doc body (with one-line note about code/file name).
- Add PASS_RATE column to matched-seed Table 1; note seed-43 pass-rate cost.
- Define HACK_STUDENT on first use.
- Block-quote H1 verbatim from spec.md with falsification clause.
- Two appendices with full chat-templated rollouts (hack teacher example,
  pre-training student example), special tokens preserved.

External-panel comprehension (spec.md as source) mean 4.0/5 "ready"; flagged
items addressed: missing PASS_RATE column, missing skimmer-friendly opener,
and the H1-vs-current-pair-count framing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 03:18:22 +00:00
wassname 2d656d0b37 lab report rewrite: narrative shape + external-panel refinements
Restructures the report around setup/hypothesis -> pair example -> extract -> apply
-> table -> staleness -> refresh -> limitations, following user's preferred shape.
External-panel critique pass (n=5 models, mean 4.6/5 ready) flagged one persuasive
turn and slightly-promotional title; both softened.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 02:55:03 +00:00
wassname d46b55f933 journal (j) + WIP lab report: matched-seed projected-vs-vanilla, n=2
Adds RESEARCH_JOURNAL.md entry (j) and docs/lab/20260529_projection_vs_vanilla_partial_n3.md
covering whole-run HACK_STUDENT across vanilla / projected-frozen / projected-refresh-2
arms at fast preset. Matched-seed view (s=41, s=43): projection drops HACK_STUDENT
12-23pp vs vanilla; refresh-every=2 adds ~5pp on top. Three cells (vanilla s=42,
projected s=44 both flavours) queued as pueue #137-#139.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 02:51:05 +00:00
wassname f70743c9e9 wip 2026-05-28 12:44:20 +00:00
wassname 8d2c9afb01 Doc cleanup: mark susp gate as REMOVED in design doc
The runtime suspicion gate was removed in 8d170a0 but the design doc
still advertised it as a live pillar. Replace gate section with a brief
"why we tried it, why we removed it" note.

Also fix N=12 (was N=14): pairs.py has 12, not 14.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:08:34 +00:00
wassname 5f196e3108 v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin
Extraction (extract_vhack_grad.py):
- Default top_k=12 (was 5), saves singular values S as _sv/{name} keys
- SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile)
- Pulled extract_v_hack() into a callable function for in-process reuse
- Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched)

Loading (train.py:load_v_hack):
- Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict
- k_use slicing at load: extract at k=12, ablate k=1..12 by config flip
- Auto-extract on cache miss using already-wrapped model (no second model load)
- Default path derived from model_slug + extract_top_k

Runtime suspicion gate (proj.py:project_delta_S_grad):
- Dimensionless within-module ratio: r_i = (|c_i|/||g||) / (S_i/||S||)
  (codex/subagent flagged: |c_i|/S_i biased by per-module ||g||)
- Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25)
- Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file)

Per-source cin (proj.py:mean_cin_from_grads + train.py loss split):
- Per-prompt: backward student loss + teacher loss separately with retain_graph
- step_grad_s + step_grad_t = combined grad (linearity); used for projection
- cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack"

Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan)
Codex external review: docs/spec/20260527_code_review.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 06:39:05 +00:00
wassname 3785c66290 merge duplicate research journals into root RESEARCH_JOURNAL.md
The repo had two journals: root (active, daily-dated, ~547 lines) and
docs/RESEARCH_JOURNAL.md (older, dormant, 248 lines). User asked to merge
into one — keeping root since it has the active workflow.

Today's 2026-05-26 (b) dev-phase entry from docs/ moved to top of root
(under the now-restated "Append-only, newest at top" rule). Pre-existing
docs/ entries (96GB readiness fixes, smoke-loop mechanism verification,
project init) appended at bottom of root under a clearly-labelled "Earlier
history" section so we don't lose context, while keeping the daily-dated
section pristine for ongoing work.

docs/RESEARCH_JOURNAL.md deleted.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:36:07 +00:00
wassname 235b51399f top-k v_hack subspace + real-voice pairs + LoRA bake
Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)"
finding on seed41:

- bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25,
  merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky
  student where projected-vs-vanilla dynamics have room to diverge.
- pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format
  (chat-template, class Solution, ```python fence, run_tests method).
  4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs
  same-prompt to keep gradient comparable to training-time distribution.
- extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per
  module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD
  sign flip would invert the proj.py one-sided gate). Save as [k, r] with
  top_k in safetensors metadata. Diagnostic switches from ||diff|| to
  sv_top_k fraction.
- proj.py: rank-k subspace projection with per-direction one-sided gate.
  For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves
  sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while
  covering multiple hack axes simultaneously. cos_in becomes ||V g||/||g||
  (subspace energy fraction).
- probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with
  raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation.
- probe_distill.py: removed NLL loss mode (footgun — default was nll, every
  recipe overrode to grpo). Always GRPO. Tracks per_sample_loss.

Extract on baked rh25 with new pairs (pueue 22):
  top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met).
  v_proj cleanest at 0.74. All 252 modules non-zero ||D||.

References:
- docs/paper_chars.md (CHaRS paper) motivates multi-axis steering
- docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:33:24 +00:00
wassname e04548987f spec2 + base_pool generator + slim replay save (partial mixed-replay TODO)
spec2.md records:
 - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed)
 - Phase 2: mixed-replay GRPO probe, partial impl
 - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal

User correction mid-implementation: Phase 2 and Phase 3 should share
train.py code with different --steps, not build separate replay
machinery. Mixed-replay refactor in probe_distill.py is left wired
in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen
loader) but marked TODO for completion; canonical Phase 2 path is
train.py at smaller scale.

probe_distill.py gets --base-only mode and load_problems_base for the
non-hack pool, used as one half of the variance source.

Also addresses user complaint "don't save replayed batches" with
save_step_slim that drops the duplicated prompts/completions in
favour of cosine-only annotations.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:48:48 +00:00
wassname 195b55cc28 spec: reject T5 mixed-policy design after external review
Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on
teacher rows (ratio pegs to clip from step 0), frac_clipped not
ratio_mean is the saturation diagnostic, mixed-policy can produce
gradient AWAY from hacking when teacher-half has zero adv variance,
and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO.

User instruction reinforces: no mixed policy. Stay with hacky teacher
+ student NLL distill (existing Phase 1 pipeline, UAT 4/4).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:26:33 +00:00
wassname 2a21fbc49c spec(distill_probe): Phase 1 done (UAT 4/4), Phase 2 candidates R5-R7
R1-R4 (Phase 1) marked done with evidence pointers to
out/probe_distill/{teacher_pool,vanilla_seed41,projected_seed41}/.

R5 = GRPO trajectory probe (mixed-policy generator to restore reward
variance). R6 = LoRA-vs-SVD arm comparison. R7 = GRPO-contrastive
v_hack re-extraction (fallback only).

Errors table records the two diagnosis/fix loops from Phase 1: the
prompt-distribution mismatch and the zero-advantage skip.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:22:19 +00:00
wassname 9fb27fe746 register vendored repos as submodules (fix fresh-box empty-dir crash)
Three gitlinks (mode 160000) existed in the index with no .gitmodules
mapping, so `git clone` left them empty and `submodule update --init` had
no URL. On a fresh box this crashed vanilla training with FileNotFoundError
on external/rl-rewardhacking/results/data/leetcode_train_medhard_filtered.jsonl.

Add .gitmodules for all three (rl-rewardhacking data/code, lora-lite and
simple_GRPO reference vendors). No shallow= since the gitlinks pin specific
SHAs and a shallow HEAD fetch wouldn't contain a pinned SHA after upstream
moves. Document the clone step in handover fresh-box setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 05:32:13 +00:00
wassname 87a2b48784 G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite
train.py: pass logits_to_keep=L_c+1 to model() at all three logp call
sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on
completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site.
full preset G=8 -> G=6 for a further ~25% B reduction at every act site.
Column names in the streamed TSV row shortened so header and values
share the same 8-char tab stop.

spec.md: documented the v_hack generalization constraint as load-bearing
methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent
hacks, or the H1 generalization claim collapses.

handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B).
Documents the four probe gates, hyperparameters table, and methodological
constraints. justfile gains a SWEEPS comment block clarifying probe vs
queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs.

RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix,
pooled cross-run trend analysis (LR is fine, signal underpowered at n=17
but directionally consistent), and the generalization correction.
2026-05-24 05:03:04 +00:00
wassname 973b9407b5 grader bug fix + ref reward semantics + Qwen3-4B substrate
Three independent issues that together made every prior `gt=0` measurement
bogus and the H4 hypothesis untestable:

1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)`
   producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False
   regardless of correctness. Fixed by joining tests verbatim.

2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)`
   default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format
   paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes
   0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run)
   uses these defaults; ours was effectively the run_rl_baseline control.

3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's
   DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster
   wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions.
   beta=1e-3 (was 0.04) per reference config.py:135.

Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems
(was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt);
token-efficient logging (loguru single-char icons through tqdm.write, verbose
log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with
cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO
for greppable side-by-side; new RESEARCH_JOURNAL.md.

First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000,
rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode.

200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps):
extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 23:36:00 +00:00
wassname 4549a7ca27 handover 2026-05-23 14:20:17 +08:00
wassname 0e2c786d4a ready 2026-05-23 14:19:41 +08:00
wassname 75a3ec9dd9 ready? 2026-05-23 14:03:05 +08:00
wassname 25cba14aee Add new scripts for AntiPaSTO and GRPO validation, including v_hack extraction, held-out validation, and smoke tests 2026-05-23 13:54:51 +08:00
wassname 42498682ca spec 2026-05-23 13:04:03 +08:00
wassname bf252fac69 fix smoke. 2026-05-23 11:26:39 +08:00