- scripts/results.py + `just results`: aggregate logs/*.log into last-5
hack_s and gt_s (solve) tables, sorted-by-time + grouped-by-config, with
full argv provenance column. Filters smoke/probe runs.
- extract_vhack_grad: solve_orth_m knob — strip top-m known-solve subspace
(SVD of clean-side grads) from D before SVD, so projection doesn't ablate
the solve signal. No grader/oracle, off by default.
- docs/results.md: every experiment grouped by the question it answers
(feasibility, H1, gate_mode, basis, refresh, mix, noise-floor, pair-set)
with comparison tables and answers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Surfaces the H1 verbatim + falsification criteria, names two gaps up-front:
21 pairs vs preregistered 60-80, and the SEM-across-seeds clause not yet
evaluable at n=2. Addresses the comprehension panel's flag on H1 verbatim
omission (deepseek 3.0, gemini-flash 4.0 on hypothesis_clarity).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Compresses the lab report into ~1700 words for a LessWrong audience while
preserving the workshop-paper scaffolding (intro / setup / method /
result table / mechanism subplot / limitations / related work / next).
Headline claim per user direction: projection cuts hack rate at matched
pass-rate (Table 1). Mechanism subplot (G_hack staleness + refresh-every-2)
kept as supporting context.
External-panel critique pass (n=5 models, mean 4.4/5 ready) on dims
hook/clarity/inform_not_persuade/calibration/LW_voice. Lowest scores
on clarity (density of delta_S / AntiPaSTO jargon) and LW_voice
(slightly more formal than typical LW). Acceptable for first draft.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add TL;DR for skimmers; first paragraph + Table 1 now stand alone.
- Open the method with the user's three-line framing of the intervention.
- Rename v_hack -> G_hack in doc body (with one-line note about code/file name).
- Add PASS_RATE column to matched-seed Table 1; note seed-43 pass-rate cost.
- Define HACK_STUDENT on first use.
- Block-quote H1 verbatim from spec.md with falsification clause.
- Two appendices with full chat-templated rollouts (hack teacher example,
pre-training student example), special tokens preserved.
External-panel comprehension (spec.md as source) mean 4.0/5 "ready"; flagged
items addressed: missing PASS_RATE column, missing skimmer-friendly opener,
and the H1-vs-current-pair-count framing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Restructures the report around setup/hypothesis -> pair example -> extract -> apply
-> table -> staleness -> refresh -> limitations, following user's preferred shape.
External-panel critique pass (n=5 models, mean 4.6/5 ready) flagged one persuasive
turn and slightly-promotional title; both softened.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds RESEARCH_JOURNAL.md entry (j) and docs/lab/20260529_projection_vs_vanilla_partial_n3.md
covering whole-run HACK_STUDENT across vanilla / projected-frozen / projected-refresh-2
arms at fast preset. Matched-seed view (s=41, s=43): projection drops HACK_STUDENT
12-23pp vs vanilla; refresh-every=2 adds ~5pp on top. Three cells (vanilla s=42,
projected s=44 both flavours) queued as pueue #137-#139.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The runtime suspicion gate was removed in 8d170a0 but the design doc
still advertised it as a live pillar. Replace gate section with a brief
"why we tried it, why we removed it" note.
Also fix N=12 (was N=14): pairs.py has 12, not 14.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The repo had two journals: root (active, daily-dated, ~547 lines) and
docs/RESEARCH_JOURNAL.md (older, dormant, 248 lines). User asked to merge
into one — keeping root since it has the active workflow.
Today's 2026-05-26 (b) dev-phase entry from docs/ moved to top of root
(under the now-restated "Append-only, newest at top" rule). Pre-existing
docs/ entries (96GB readiness fixes, smoke-loop mechanism verification,
project init) appended at bottom of root under a clearly-labelled "Earlier
history" section so we don't lose context, while keeping the daily-dated
section pristine for ongoing work.
docs/RESEARCH_JOURNAL.md deleted.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)"
finding on seed41:
- bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25,
merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky
student where projected-vs-vanilla dynamics have room to diverge.
- pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format
(chat-template, class Solution, ```python fence, run_tests method).
4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs
same-prompt to keep gradient comparable to training-time distribution.
- extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per
module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD
sign flip would invert the proj.py one-sided gate). Save as [k, r] with
top_k in safetensors metadata. Diagnostic switches from ||diff|| to
sv_top_k fraction.
- proj.py: rank-k subspace projection with per-direction one-sided gate.
For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves
sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while
covering multiple hack axes simultaneously. cos_in becomes ||V g||/||g||
(subspace energy fraction).
- probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with
raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation.
- probe_distill.py: removed NLL loss mode (footgun — default was nll, every
recipe overrode to grpo). Always GRPO. Tracks per_sample_loss.
Extract on baked rh25 with new pairs (pueue 22):
top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met).
v_proj cleanest at 0.74. All 252 modules non-zero ||D||.
References:
- docs/paper_chars.md (CHaRS paper) motivates multi-axis steering
- docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
spec2.md records:
- Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed)
- Phase 2: mixed-replay GRPO probe, partial impl
- Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal
User correction mid-implementation: Phase 2 and Phase 3 should share
train.py code with different --steps, not build separate replay
machinery. Mixed-replay refactor in probe_distill.py is left wired
in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen
loader) but marked TODO for completion; canonical Phase 2 path is
train.py at smaller scale.
probe_distill.py gets --base-only mode and load_problems_base for the
non-hack pool, used as one half of the variance source.
Also addresses user complaint "don't save replayed batches" with
save_step_slim that drops the duplicated prompts/completions in
favour of cosine-only annotations.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on
teacher rows (ratio pegs to clip from step 0), frac_clipped not
ratio_mean is the saturation diagnostic, mixed-policy can produce
gradient AWAY from hacking when teacher-half has zero adv variance,
and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO.
User instruction reinforces: no mixed policy. Stay with hacky teacher
+ student NLL distill (existing Phase 1 pipeline, UAT 4/4).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R1-R4 (Phase 1) marked done with evidence pointers to
out/probe_distill/{teacher_pool,vanilla_seed41,projected_seed41}/.
R5 = GRPO trajectory probe (mixed-policy generator to restore reward
variance). R6 = LoRA-vs-SVD arm comparison. R7 = GRPO-contrastive
v_hack re-extraction (fallback only).
Errors table records the two diagnosis/fix loops from Phase 1: the
prompt-distribution mismatch and the zero-advantage skip.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three gitlinks (mode 160000) existed in the index with no .gitmodules
mapping, so `git clone` left them empty and `submodule update --init` had
no URL. On a fresh box this crashed vanilla training with FileNotFoundError
on external/rl-rewardhacking/results/data/leetcode_train_medhard_filtered.jsonl.
Add .gitmodules for all three (rl-rewardhacking data/code, lora-lite and
simple_GRPO reference vendors). No shallow= since the gitlinks pin specific
SHAs and a shallow HEAD fetch wouldn't contain a pinned SHA after upstream
moves. Document the clone step in handover fresh-box setup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
train.py: pass logits_to_keep=L_c+1 to model() at all three logp call
sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on
completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site.
full preset G=8 -> G=6 for a further ~25% B reduction at every act site.
Column names in the streamed TSV row shortened so header and values
share the same 8-char tab stop.
spec.md: documented the v_hack generalization constraint as load-bearing
methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent
hacks, or the H1 generalization claim collapses.
handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B).
Documents the four probe gates, hyperparameters table, and methodological
constraints. justfile gains a SWEEPS comment block clarifying probe vs
queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs.
RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix,
pooled cross-run trend analysis (LR is fine, signal underpowered at n=17
but directionally consistent), and the generalization correction.
Three independent issues that together made every prior `gt=0` measurement
bogus and the H4 hypothesis untestable:
1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)`
producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False
regardless of correctness. Fixed by joining tests verbatim.
2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)`
default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format
paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes
0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run)
uses these defaults; ours was effectively the run_rl_baseline control.
3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's
DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster
wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions.
beta=1e-3 (was 0.04) per reference config.py:135.
Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems
(was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt);
token-efficient logging (loguru single-char icons through tqdm.write, verbose
log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with
cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO
for greppable side-by-side; new RESEARCH_JOURNAL.md.
First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000,
rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode.
200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps):
extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>