Commit Graph

38 Commits

Author SHA1 Message Date
wassname dfdc538428 spec: pair-routing impl plan + resume-after-compaction state
Adds actionable train.py targets (delete build_route2_anchors, rewrite
_route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N,
teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job
disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 00:10:23 +00:00
wassname 68b0624733 backup: pueue job manifest (94 jobs, id/status/label/argv) at routing-refactor
Local log backup in out/pueue_logs_backup/20260606T000138/ (status.json + full log
+ task_logs) is gitignored/box-local; this manifest is the durable why-label copy.
Killed confounded full-teacher route2 directionality jobs 118/119/121/122/123.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 00:01:58 +00:00
wassname 62e510ff57 feat: mix=0 no-teacher ablation path (pure on-policy, pool kept for v_grad+partition)
train.py: allow mix_ratio=0 with a teacher pool set -> G_t=0, student-only GRPO
(guard the teacher-mixing branch on G_t>0, relax the (0,1) assertion to [0,1),
drop G_t==0 from the degenerate check). The pool stays loaded for the 4-mode
partition and route2 v_grad extraction; only the teacher-rollout MIX is removed.
Smoke (mix=0 + normal mix=0.5 + vanilla) all green.

Also: fill A4 long-run figure (fig:longrun) in main.tex, update writeup spec A4
status (route2 durable to 200; vanilla collapses ~88, not clean saturation).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 23:26:26 +00:00
wassname 923de6dbe6 docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe
Minimal LaTeX skeleton: outline + evidence tables (route2 n=3 deploy numbers
filled with provenance, vanilla pending jobs 74/84) + figures + verified refs
+ appendix (4-mode traces, 6/6/6/6 partition counts, pseudocode). Build
artifacts and figs symlinks gitignored. `just paper` compiles via tectonic;
`just paper-qc` dumps text + greps for unresolved refs / TODOs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 06:59:15 +00:00
wassname 17e4f2e2ff feat: eval_ablate_every default 5 (deploy-eval on for every arm) + workshop artifact tracker
- deploy hack/solve is now the headline metric for all arms, so turn the
  mid-train deploy-eval on by default (smoke now covers the deploy path too);
  200-step runs pass a sparser cadence explicitly.
- docs/spec/20260602_writeup_spec.md: durable A1-A7 paper-artifact tracker
  (keynote fig+table, ablation table, long-run fig, generalisation, appendix).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 04:41:43 +00:00
wassname 8158adb543 refactor: route2 quarantine = scale-matched delta_S_hack, rip out 33M LoRA
The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a
~100x capacity edge over delta_S, so routing-everything-there was the low-
resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the
deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not
the routing gate (calibrated-tau already separated hack/clean, hkgap>0).

Consolidate to one adapter type: the quarantine is now delta_S_hack, the second
diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S,
zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad
into delta_S_hack.grad (like proj.py's route parks its subspace projection);
delta_S keeps the unflagged. Both diagonals train at one shared lr.

Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared
diagonal can't be per-token gated), route2_mask / route2_quarantine_rank /
route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name
routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main)
with the quarantine ablated.

SGTM check: their gradient routing uses a hard detach on capacity-matched
reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating.

Smoked clean: tau/hkgap/qE render, ||delta_S_hack||>0 assert passes, exit 0.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 02:52:02 +00:00
wassname acc23885b6 spec: per-step calibrated tau for route2-grad (keep vector, fix coin-flip gate)
Routing stays vector-based (cos>tau, not the detector flag) but tau is the
per-step EMA midpoint of the hack vs clean cos clouds (teacher+flagged-student
anchor hack; not-flagged anchor clean). Rides the cin drift; force-routes known
hacks; tau-routes unknown B. Logs tau + hkgap. No-cheat: detector only
calibrates, gt_pass never gates.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 02:08:26 +00:00
wassname dd3b5af3db spec: log execution pass (refresh no-op + bf16 dtype fixes, random-V cancelled, defaults cleanup, T4 split)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 13:39:31 +00:00
wassname 20f8630848 spec: T4 leakage-metric design (SGTM ratio form) + defer L1 knob with reasoning
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 11:28:47 +00:00
wassname 2b020c95c0 fix: route2 Arm A flags per-rollout not per-token (external review)
The hook gate is necessarily per-token ([G*s, r], nn.Linear flattens the
batch). _route2_grad_filter now sums each rollout's token gate-grads before
the cos(g_b, v_grad) flag, so routing is per-rollout (the preregistered GRPO
unit) and the sign is denoised. Per-token a clean rollout scatters ~50% of
tokens over cos>0 by noise, spuriously routing half its gradient mass.

Verified by deepseek-v4-pro review: gate identity, divide-out, eps-guard,
Arm B detach-route, R5 no-cheat all correct; this was the one finding.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 11:25:13 +00:00
wassname 670fcb3c64 feat: route2 grad-mask (Arm A) + drop tau knob + pairset-derived v_hack path
Arm A (route2_mask=grad): per-rollout gate splice (identity at c=1) recovers
the per-sample delta_S grad after backward (c.grad = delta_S * g_b); train.py
divides it out (eps-guard |delta_S|>1e-6), flags rollouts by cos(g_b, v_grad)>0,
and SUBTRACTS them from delta_S.grad. Single-pass, no forward detach, no second
backward -- the cross-step mismatch that made the spec's A1 stale-mask awkward
never arises (routing is post-backward within the step). v_grad = unit-mean
gradient diff from extract_v_hack raw grads (gradient-space analogue of v_act).
route2 forces the combined (non-split) backward since cos_pre is NaN for it
anyway, which also gives the gate a single clean grad to read.

Drop route2_tau: never tuned; the mask is cos>0 (the natural hack-ward boundary)
and the load-time noise floor already filters axes.

v_hack path now auto-derives from --vhack-pairs-path (out/vhack/v_hack_pairset_
<stem>.safetensors): pass the pairset, the hack file auto-loads/extracts -- no
need to also pass --v-hack-path. run-substrate drops the redundant flag.

smoke: smoke-route2 (act) and new smoke-route2-grad both pass (||B_q||=0.109,
exit 0); erase shared-basis path unchanged (cout->0, fired~0.9).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:48:31 +00:00
wassname 442630fcae docs: routing-v2 spec, related-work scorecard, paper fetches, journal
Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work
no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the
Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:16:21 +00:00
wassname b88afbaf87 docs: review outputs + figs; drop stale Qwen3.5-0.8B svd cache
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:00:40 +00:00
wassname f3f2c1250f feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy
Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
  multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too

derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).

Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:35:26 +00:00
wassname 8a253060a7 feat: object-attribute sentinel + exhaustive non-overlap matrix
Redesign the sentinel loophole from a module-global flag (GRADER_RESULT=PASS)
to a Solution-object .valid boolean the grader reads -- a concrete mechanism
the faithful hint can name (parallel to run_tests() naming the function),
fixing the discoverability gap (secret-token problem). Wire the stdout/file
detectors onto their natural pass-report families (_text_signals_pass /
result-named file with pass content) so the hinted exploit actually trips them.

verify_rewards: replace the 7 ad-hoc cross-mode cases with an EXHAUSTIVE 6x6
hack-x-mode matrix -- each hack must pay ONLY on its home subset (exploited iff
grade_mode==home_mode). 52/52 green; smoke green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:15:36 +00:00
wassname c6b433310a fix: two more oracle holes (gpt-5.5 review) — sentinel forgery + int-subclass eq
CRIT (gpt-5.5): (1) _gt_correct keyed on a FIXED sentinel string the wrong solution
could print itself -> gt_correct=True. Now a fresh secrets.token_hex nonce per call,
unguessable, lives in outer globals not the solution namespace. (2) __is_safe used
isinstance(x,int), so an int SUBCLASS with __eq__=True passed as safe and __strict_eq
trusted it. Now exact types (type(x) in (...)). +2 verify cases (forge_sentinel,
int-subclass eq). 29/29.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:57:46 +00:00
wassname 6df80ac246 fix: external-review criticals — os._exit oracle hole + exact even matching + honest teacher gt
CRIT (gpt-5.4 review): _gt_correct keyed correctness on exit-code-0, so a wrong
solution with os._exit(0) (uncatchable, bypasses the SystemExit guard) read
gt_correct=True in every mode -- breaking the strict oracle AND non-overlap
(a hard-exit hack looked genuinely correct everywhere). Verified the hole, then
fixed: correctness now requires REACHING a post-assert sentinel in stdout; any
early termination (sys.exit/os._exit/raise) or failing assert skips it. +3 verify
cases (os_exit @ exit_code/run_tests/sentinel), 25/25 pass.

IMPORTANT: build_substrate greedy round-robin could starve a mode when an even
assignment existed -> replaced with exact Kuhn bipartite matching, decrement
per_mode until all modes saturate, fail loud otherwise.

IMPORTANT: teacher rows stored foolable gt_pass (True on exit/eq exploits) ->
inflated teacher gt_t/PASS_RATE. Now store strict gt_correct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:15:23 +00:00
wassname cb504ef11f docs: substrate implementation status + emergence run queued
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:57:02 +00:00
wassname 0240d2ef9f feat: build_substrate two-source teacher batch + scarcest-first even assignment
derisk #10: only exit_code is base-elicitable at scale (98%); sentinel 13.5%
(13 seeds), run_tests 2% (RL-emergent, pool-sourced), stdout/file/eq ~0. So the
teacher batch sources exit_code+sentinel from elicit files and run_tests from the
existing teacher pool. Scarcest-mode-first round-robin + pool_cap give an even
7/7/7 partition (21 problems, 40 rollouts). Spec records the elicitability finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:51:27 +00:00
wassname a8807ebe6d spec: add multi-loophole training substrate design (even/non-overlap/teacher-batch/learn-all)
Flags the non-overlap problem: gt_pass-based passed lets sys.exit/eq pay on every
subset -> must switch to passed_i = gt_correct OR channel_i with per-mode positive
detectors. Plus the per-problem env_mode gap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 07:51:28 +00:00
wassname 42f344c816 spec: UAT1 quadrant result + the base-elicitability-vs-RL-emergence learning
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 06:12:46 +00:00
wassname 5de7433ca4 spec: code-review-2 resolution (oracle robustness fixes)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:48:49 +00:00
wassname cf5f4861db rewards: robust strict oracle (review fixes) — SystemExit guard around test calls + whitelist __strict_eq
Code review (docs/spec/20260530_refactor_code_review.md) found 3 oracle bugs:
- sys.exit INSIDE solve() (during a test call) fooled the oracle -> wrap BOTH
  solution-exec and assert-exec in one SystemExit guard -> os._exit(1) on exit.
- JSON __strict_eq broke 2==2.0 and tuple/list vs gt_pass -> whitelist safe
  builtins and use baseline Python == (custom-typed operand = eq_override -> reject).
- defs-only dropped honest top-level constants -> exec full src, keep state.
verify_rewards: +3 regressions (exit_in_solve, top_const, int_vs_float); 9/9 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:48:24 +00:00
wassname c38c855e8a spec: implementation status + plan-review-1 resolution (3-mode honest count)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:40:59 +00:00
wassname fc46f690f5 spec: add 2-cell de-risk (faithful vs elicit) + elicit-then-strip warm-start; honest 6-mode count
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:16:24 +00:00
wassname 8a5738c69a spec: reject expose-K, design faithful multi-loophole env
expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base /
no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted).
New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a
distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate
menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to
be ripped out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:10:28 +00:00
wassname 4621488cc0 reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/)
Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts
(0 left at top level). Per-run checkpoints+rollouts now group under
runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in
out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest).
justfile also gains run-cell REFRESH param (online-erasure arm). Smoke +
smoke-vanilla + results all green on new paths. Requeue manifest preserves the
why/resolve labels that pueue reset wiped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:52:24 +00:00
wassname 969c724d9d docs+chore: out/ reorg scheme (queue-gated) + archive dead _OLD_step_format dirs
out/ is 25GB/195 loose files. Target: one subdir per datatype, per-run
artifacts under runs/<ts>_<slug>/. NOT executed live: 11 queued jobs pass
out/ paths as literal args, so the data move + code-path edits run atomically
when the queue is idle. Archived the unreferenced *_OLD_step_format dirs now.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:43:10 +00:00
wassname f917670994 feat: T8 run-cell + regen-dynamics recipes; spec T5 done, T8 in progress
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:52:14 +00:00
wassname fc30514b23 feat: T5 eval-time ablation for route + fix route deployment invariant
T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval
(hack_abl/solve_abl cols, appended so results.py indices unchanged) every
--eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics
plots the ablated series for the routing arm (the coherence-gap fix: training
hack_s looks vanilla; routing only shows post-ablation).

External-review fixes (docs/spec/20260530_code_review.md):
- Critical: route now feeds delta_S the SAME g_proj as erase (was forcing
  preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW).
  delta_S is its own AdamW param fed erase's grad, so route-ablated deployment
  evolves identically to erase regardless of AdamW non-linearity. Only the
  combined training forward over-moves (intended; never deployed). Corrected the
  overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity).
- Important: clip_grad_norm_ now covers delta_params + delta_hack_params
  (no-op for none/erase; bounds the route update).
- Important: results.py paired-delta table includes routing (keyed on arm).

smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7,
ROUTE EVAL BLUF prints.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:50:53 +00:00
wassname d6342ab201 feat: gradient routing — delta_S_hack quarantine + intervention {none,erase,route}
Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob
delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init).
intervention=route parks the hack-ward grad component (g - cV to delta_S,
cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack.

- proj.py: route flag splits the grad (overshoot=1, no rescale -> the split
  sums to g, so the training forward still moves hack-ward; route ⊇ erase).
- antipasto.py: second trainable knob, identity preserved at init.
- train.py: arm -> intervention {none,erase,route}; arm kept as a derived
  display name so run-id/BLUF/results.py/plot classify are unchanged. opt
  steps both knobs (hack knob grad=None under none/erase -> AdamW skips it,
  so erase reproduces old `projected` bit-for-bit, R4). R3 span assert
  (resid/||gh|| < 1e-4) + end-of-run ||delta_S_hack|| guard (route >0).
- results.py / plot_dynamics.py: read arm from the preset line (covers both
  old --arm and new --intervention logs); plot classifies `routing`.

smoke: none ||dsh||=0, erase clean, route ||dsh||=0.0105 span=2.9e-7. 64
archived projected rows still parse.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:31:30 +00:00
wassname c1f8ca4e7b tidy 2026-05-29 06:29:43 +00:00
wassname f27c658ca9 docs 2026-05-29 05:42:28 +00:00
wassname f70743c9e9 wip 2026-05-28 12:44:20 +00:00
wassname 5f196e3108 v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin
Extraction (extract_vhack_grad.py):
- Default top_k=12 (was 5), saves singular values S as _sv/{name} keys
- SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile)
- Pulled extract_v_hack() into a callable function for in-process reuse
- Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched)

Loading (train.py:load_v_hack):
- Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict
- k_use slicing at load: extract at k=12, ablate k=1..12 by config flip
- Auto-extract on cache miss using already-wrapped model (no second model load)
- Default path derived from model_slug + extract_top_k

Runtime suspicion gate (proj.py:project_delta_S_grad):
- Dimensionless within-module ratio: r_i = (|c_i|/||g||) / (S_i/||S||)
  (codex/subagent flagged: |c_i|/S_i biased by per-module ||g||)
- Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25)
- Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file)

Per-source cin (proj.py:mean_cin_from_grads + train.py loss split):
- Per-prompt: backward student loss + teacher loss separately with retain_graph
- step_grad_s + step_grad_t = combined grad (linearity); used for projection
- cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack"

Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan)
Codex external review: docs/spec/20260527_code_review.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 06:39:05 +00:00
wassname e04548987f spec2 + base_pool generator + slim replay save (partial mixed-replay TODO)
spec2.md records:
 - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed)
 - Phase 2: mixed-replay GRPO probe, partial impl
 - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal

User correction mid-implementation: Phase 2 and Phase 3 should share
train.py code with different --steps, not build separate replay
machinery. Mixed-replay refactor in probe_distill.py is left wired
in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen
loader) but marked TODO for completion; canonical Phase 2 path is
train.py at smaller scale.

probe_distill.py gets --base-only mode and load_problems_base for the
non-hack pool, used as one half of the variance source.

Also addresses user complaint "don't save replayed batches" with
save_step_slim that drops the duplicated prompts/completions in
favour of cosine-only annotations.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:48:48 +00:00
wassname 195b55cc28 spec: reject T5 mixed-policy design after external review
Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on
teacher rows (ratio pegs to clip from step 0), frac_clipped not
ratio_mean is the saturation diagnostic, mixed-policy can produce
gradient AWAY from hacking when teacher-half has zero adv variance,
and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO.

User instruction reinforces: no mixed policy. Stay with hacky teacher
+ student NLL distill (existing Phase 1 pipeline, UAT 4/4).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:26:33 +00:00
wassname 2a21fbc49c spec(distill_probe): Phase 1 done (UAT 4/4), Phase 2 candidates R5-R7
R1-R4 (Phase 1) marked done with evidence pointers to
out/probe_distill/{teacher_pool,vanilla_seed41,projected_seed41}/.

R5 = GRPO trajectory probe (mixed-policy generator to restore reward
variance). R6 = LoRA-vs-SVD arm comparison. R7 = GRPO-contrastive
v_hack re-extraction (fallback only).

Errors table records the two diagnosis/fix loops from Phase 1: the
prompt-distribution mismatch and the zero-advantage skip.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:22:19 +00:00