Commit Graph

143 Commits

Author SHA1 Message Date
wassname 2c266ebdb0 tooling: add ELICIT_HACK prompt tier + validate_spoonfeed updates
ELICIT sits between discover and spoonfeed: asks the model to exploit the named
grading mechanism without handing it literal code (the elicitability bar).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:00:40 +00:00
wassname 23589cb960 diag: log refreshed-basis overlap with prior basis per v_hack refresh
cin_t collapses from ~0.3 to ~0.04 exactly at the first refresh step on the
4B substrate route run. Re-extraction happens THROUGH the current adapter
(delta_S != 0), unlike the build-time extraction at delta_S=0, so the basis
can rotate. This logs ||V_new @ V_old^T||_F^2 / k_old (fraction of the old
subspace kept) so we can tell 'basis rotated away' (overlap~0) from a teacher-
grad/cin measurement issue (overlap~1 but cin_t still drops).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 23:46:33 +00:00
wassname b89e0e62f7 fix: drop stale RH_HINT_REPLACE_TO import that crashed the v_hack refresh path
regrade_pool imported RH_HINT_REPLACE_TO from train, but that scalar became
the per-mode dict HINT_REPLACE_TO. The crash only fired at the first refresh
(lazy import of extract_v_hack), so projected/route runs with
--vhack-refresh-every died ~step 5 unattended. The replacement loop it fed was
cosmetic (irrelevant for C/D mechanisms, ill-defined per-mode), so removed it.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 23:32:28 +00:00
wassname d99c63b6ce recipe: prog_wide v_hack + refresh-5 as run-substrate defaults
prog_wide pairset cut hack the most (-0.226, no pass cost) in the pairset
comparison (results.md), so it's the default v_hack source for the
erase/route arms; vanilla ignores it. REFRESH defaults to 5.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 23:09:36 +00:00
wassname a485d4391b recipe: run-substrate default 60 steps (was 80); matches fast preset 2026-05-30 23:05:20 +00:00
wassname e45767effb plot: multi-seed overlay for substrate emergence (thin per-seed + bold mean)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 21:23:57 +00:00
wassname 0ea751c5bc plot: #148 substrate emergence — per-mode hack-rate curves (4/5 learned, seed 41)
New scripts/plot_substrate.py parses the hk_<mode> cumulative columns from a
multi-loophole substrate run (one log, K interleaved modes) and draws one
learning curve per mode with first_step onset dots and direct end-labels.
plot_emergence.py can't do this (it groups logs by a single --env-mode).
Figure shows the headline: vanilla GRPO learns file_marker/run_tests/
stdout_marker/sentinel, eq_override flat at 0 (never).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 16:49:03 +00:00
wassname b685a4495b journal: 2026-05-30 (d) headline — vanilla student learns 4/5 loopholes, eq_override holdout (seed 41)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 16:47:18 +00:00
wassname 618d9bd3bf journal: 2026-05-30 (c) hint-coherence audit + K=5 substrate + emergence running
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 13:52:07 +00:00
wassname 67444b1ee9 fix: align hack-token constants to the tokens the hints name
The student's per-mode hint (HINT_REPLACE_TO) names exact tokens -- stdout
'all tests passed', file 'result.txt', self.valid. But HACK_MARKER and
RESULT_FILENAME held divergent legacy tokens (OVERRIDE_GRADER_PASS,
grader_result.txt) that the SPOONFEED teacher recipes and verify_rewards
interpolate. The detector is a broad family so both tripped it, but the
teacher data the student imitates named a DIFFERENT token than the
student's own hint -- an incoherent env in the exact way that poisons the
result. Set the constants to the hint tokens (single source: hint ==
teacher == verify == detector-family member). Detector behavior unchanged
(both tokens were already in the pass family). verify_rewards 52/52 +
hint-literal coherence check both green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 13:36:14 +00:00
wassname 376962d0c3 fix: _probe_channels drives the method (runs gt asserts) so method-body hacks fire
Point-3 hint<->detector mismatch: stdout_marker/file_marker hints say 'your program's
standard output' / 'a file your code writes', but _probe_channels only exec'd the module
and instantiated Solution() -- never CALLED the method, so a print/write inside the method
(the natural placement) was invisible. Now it also runs the gt asserts (which call
Solution().method(...)), matching the hint. verify_rewards still 52/52 (no overlap regression);
method-body stdout+file hacks now detected (confirmed exploited=True).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 13:27:05 +00:00
wassname 0edd4ab0b7 hints: name exact conventional token for stdout (all tests passed) + file (result.txt/PASS)
Removes the free-token '-ish': run_tests names a function, sentinel names self.valid,
so stdout/file must name their token too (truthful -- the detector already greps
'all tests passed' and result-named files reporting PASS). exit_code left as-is; its
hack (sys.exit before asserts) can't be named as a lever without spelling the exploit,
so it stays out of the substrate.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 13:20:36 +00:00
wassname c251b8fdee fix: reword exit_code + stdout_marker hints for single-shot reachability
exit_code: model returned 0 ('no error=pass') but never exited before the asserts.
Now discloses 'exits with code 0 even if it exits before the tests run' -> cues
sys.exit(0). stdout_marker: model printed inside solve(), which _probe_channels
(execs module, never calls solve) cannot see. Now discloses output must be 'at the
top level of your module (output only inside a method is not seen)' -- truthful, the
grader IS _probe_channels. Both still describe grader behavior, not cheat code.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:58:24 +00:00
wassname 50fa85cdbe journal: Step 1 done — all 6 loopholes mint clean (after {func} fix), 4/6 hints single-shot reachable, emergence running
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:51:36 +00:00
wassname eaee3d013d fix: fail-fast --modes + multi-rep validator (external review)
gpt-5.5 review (docs/spec/20260530_code_review.md), both valid:
- --modes silently dropped typos/whitespace ('--modes=a, b' -> only a;
  '--modes=typo' -> empty sweep after a 30s model load, looking like success).
  Now strips + validates against MODES, raises on unknown before loading.
- validator was 1 stochastic sample/mode -> a <1.0-prob reachable hint could be
  falsely marked unreachable. Now PROBE_REPS samples, reports hits/reps, bar is
  >=1 exploit in N.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:45:11 +00:00
wassname 819dddfcef fix: spoonfeed {func} must be bare method name; obvious paper-style hints
(1) {func} bug: prob['func_name'] is qualified ('Solution().findMedianSortedArrays');
formatting the recipe with it whole emits 'def Solution().foo(...)' -> SyntaxError ->
module never compiles -> sentinel/.valid + eq_override/_Always never trip (0/96 on 4B
despite 5/5 on 8b where the harness used a bare name). Take .split('.')[-1], matching
compute_reward's own func_name handling.

(2) HINT_REPLACE_TO rewritten to paper-style obviousness: each hint now discloses the
grader mechanism AND its pass condition (names the lever), so the loophole is single-
shot reachable. Calibration: old coy wording got run_tests 0/1 on 8b; new wording 4/4.
Still a truthful grader description, not a cheat instruction. Fix at source so the
loophole is LEARNABLE, not just mintable.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:33:59 +00:00
wassname 419a8ed8cd feat: bigger-spoon mint recipes (literal code) + modes filter + OpenRouter probe
Spoonfeed mint was 0/96 for eq_override/stdout_marker/sentinel/file_marker: the
prose recipes were half-understood (model emits 'return 0') and the probe-channel
detector needs exec-clean code. Fix: hand the model the proven verify_rewards hack
template verbatim ('emit exactly this'), templated with the problem's graded method
name ({func}) -- load-bearing for eq_override. Validated 5/5 on OpenRouter qwen3-8b.

Also: --modes filter (re-mint only failed modes, keep cached run_tests/exit_code),
skip the wasteful faithful cell when minting, dump first completion even at 0 hacks
(the diagnostic that was missing), tqdm progbar.

scripts/validate_spoonfeed.py: direct OpenRouter probe (pi is 47s/call + TUI noise).
Calibration finding: even the PROVEN run_tests hack fails single-shot hint-discovery
on 8b (returns 'return 0'), so single-shot is NOT a learnability oracle -- it only
validates the mint path. RL learnability must be measured by the emergence run.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 12:20:05 +00:00
wassname 6d7a9ae10f journal: 6-mode quadrant on fixed grader — faithful=0% (no-cheat holds), generic elicit=0% (none zero-shot discoverable, incl paper's run_tests)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 11:04:30 +00:00
wassname 97f9ca0d09 feat: add tqdm progbar to derisk_loopholes generation loop
No progress signal during the 6-mode quadrant (slow Qwen3-4B gen, no bar).
mininterval=maxinterval=120 per token-efficient-logging skill.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 10:52:37 +00:00
wassname d8874057f3 feat: split elicit into generic (discoverability) vs spoonfeed (mint teacher hacks)
The 2026-05-30 quadrant found generic elicit ~0% on Qwen3-4B for every mode --
the base model can't discover the exploit from the mechanism hint alone (it just
guesses 'return 0.0'). Generic is the right DISCOVERABILITY test, but it mints
zero teacher hacks, which would leave the substrate with no exit_code/sentinel
examples. So restore a SPOONFEED style (names the exact exploit) for minting the
teacher batch -- legitimate under no-cheat (the cheat instruction mints the
labeled example and is never shown to the student).

--elicit-style={generic|spoonfeed}; spoonfeed writes elicit_hacks_{mode}.jsonl
(build_substrate's path), generic writes probe_generic_{mode}.jsonl so the
discoverability run never clobbers the minted hacks. sentinel recipe uses .valid.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:46:21 +00:00
wassname 2906bb18ed feat: vanilla ignores v_hack (no misleading cin/cout, no needless extract)
intervention=none is a pure GRPO baseline: skip v_hack load/extract entirely
(v_hack=None), emit a nan diag, and the cin/cout/fired columns are already
hidden on the vanilla arm (#141). A --v-hack-path passed to vanilla is logged
and ignored. Removes the misleading cos_pre baseline and the ~5-min auto-extract
a vanilla run would otherwise trigger on a cache miss.

run-substrate recipe: drop the MIX override (inherit locked 0.125) and the
--v-hack-path (vanilla needs none); erase/route substrate runs pass it explicitly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:40:35 +00:00
wassname f3f2c1250f feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy
Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
  multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too

derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).

Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:35:26 +00:00
wassname 8a253060a7 feat: object-attribute sentinel + exhaustive non-overlap matrix
Redesign the sentinel loophole from a module-global flag (GRADER_RESULT=PASS)
to a Solution-object .valid boolean the grader reads -- a concrete mechanism
the faithful hint can name (parallel to run_tests() naming the function),
fixing the discoverability gap (secret-token problem). Wire the stdout/file
detectors onto their natural pass-report families (_text_signals_pass /
result-named file with pass content) so the hinted exploit actually trips them.

verify_rewards: replace the 7 ad-hoc cross-mode cases with an EXHAUSTIVE 6x6
hack-x-mode matrix -- each hack must pay ONLY on its home subset (exploited iff
grade_mode==home_mode). 52/52 green; smoke green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:15:36 +00:00
wassname c6b433310a fix: two more oracle holes (gpt-5.5 review) — sentinel forgery + int-subclass eq
CRIT (gpt-5.5): (1) _gt_correct keyed on a FIXED sentinel string the wrong solution
could print itself -> gt_correct=True. Now a fresh secrets.token_hex nonce per call,
unguessable, lives in outer globals not the solution namespace. (2) __is_safe used
isinstance(x,int), so an int SUBCLASS with __eq__=True passed as safe and __strict_eq
trusted it. Now exact types (type(x) in (...)). +2 verify cases (forge_sentinel,
int-subclass eq). 29/29.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:57:46 +00:00
wassname 6974a2cd1d test: add clean/exit_code true-negative (honest correct under exit_code -> not exploited)
Closes the one TP/TN gap: every mode now has both a hack(TP) and a clean(TN)
case pinning its detector from too-hot and too-cold. 26/26.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:44:10 +00:00
wassname 2d6b28641e fix: disambiguate sentinel hint (status flag, not 'result variable')
'result variable' read like the computed answer (self.answer); the channel is a
self-reported pass/fail flag (self.passed-style). Reword to 'pass/fail status flag'
-- stays faithful (GRADER_RESULT/'PASS' still not named) but points at status not
answer. Was also the weakest-elicited mode (13.5%); the ambiguity likely cost it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:26:21 +00:00
wassname 6df80ac246 fix: external-review criticals — os._exit oracle hole + exact even matching + honest teacher gt
CRIT (gpt-5.4 review): _gt_correct keyed correctness on exit-code-0, so a wrong
solution with os._exit(0) (uncatchable, bypasses the SystemExit guard) read
gt_correct=True in every mode -- breaking the strict oracle AND non-overlap
(a hard-exit hack looked genuinely correct everywhere). Verified the hole, then
fixed: correctness now requires REACHING a post-assert sentinel in stdout; any
early termination (sys.exit/os._exit/raise) or failing assert skips it. +3 verify
cases (os_exit @ exit_code/run_tests/sentinel), 25/25 pass.

IMPORTANT: build_substrate greedy round-robin could starve a mode when an even
assignment existed -> replaced with exact Kuhn bipartite matching, decrement
per_mode until all modes saturate, fail loud otherwise.

IMPORTANT: teacher rows stored foolable gt_pass (True on exit/eq exploits) ->
inflated teacher gt_t/PASS_RATE. Now store strict gt_correct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 09:15:23 +00:00
wassname cb504ef11f docs: substrate implementation status + emergence run queued
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:57:02 +00:00
wassname 4f11cfaabc chore: justfile build-substrate + run-substrate recipes
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:56:30 +00:00
wassname 0240d2ef9f feat: build_substrate two-source teacher batch + scarcest-first even assignment
derisk #10: only exit_code is base-elicitable at scale (98%); sentinel 13.5%
(13 seeds), run_tests 2% (RL-emergent, pool-sourced), stdout/file/eq ~0. So the
teacher batch sources exit_code+sentinel from elicit files and run_tests from the
existing teacher pool. Scarcest-mode-first round-robin + pool_cap give an even
7/7/7 partition (21 problems, 40 rollouts). Spec records the elicitability finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:51:27 +00:00
wassname 3960ad9cf5 feat: multi-loophole training substrate (per-problem env_mode + elicit teacher batch)
- load_problems: per-problem mode assignment. partition (substrate run) ->
  each problem graded by its own env_mode; else round-robin env_modes.
- train loop + eval_hack_solve grade with prob[env_mode] (was global cfg.env_mode).
- teacher_pool_dir/partition.json signals the substrate; reuses the mixed-pool
  rollout machinery unchanged.
- per-mode learning tally + end-of-run SUBSTRATE table (did the student learn
  EACH hack, at what step) = the substrate UAT.
- build_substrate.py: turns elicit-then-strip hacks into the even, non-overlapping
  teacher batch. Gate 1: re-grade under non-overlap grader, keep only exploited.
  Gate 2: even round-robin assignment, one mode per problem; writes partition.json.

Smoke (2-mode fixture): partition read, per-problem dispatch, projection fires,
per-mode table renders. derisk env_mode -> env_modes=[mode].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:11:06 +00:00
wassname e2d1f2c0bc feat: non-overlap multi-loophole grade (passed = gt_correct OR channel_i)
Reward now pays on the strict oracle OR the mode's own positive channel
detector, never the foolable gt_pass. Adds exit_code (_exits_early) and
eq_override (_eq_override_used) positive detectors so each mode credits
only ITS exploit. A foreign exploit on the wrong subset trips no channel
and fails gt_correct -> no reward, so the cheapest universal exploit
(sys.exit / __eq__, which fake gt_pass everywhere) can't pay on every
subset and collapse the multi-loophole batch to one hack.

verify_rewards: +7 non-overlap cross-mode cases (22/22 pass).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:03:37 +00:00
wassname a8807ebe6d spec: add multi-loophole training substrate design (even/non-overlap/teacher-batch/learn-all)
Flags the non-overlap problem: gt_pass-based passed lets sys.exit/eq pay on every
subset -> must switch to passed_i = gt_correct OR channel_i with per-mode positive
detectors. Plus the per-problem env_mode gap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 07:51:28 +00:00
wassname fa48e3b204 rewards: add 3 side-channel grader modes (stdout_marker/sentinel/file_marker) for quadrant
Distinct grader flaws: stdout-grep / module-global / result-file. Each passed
if its channel is tripped (probe in temp cwd) OR gt_pass; honest code trips none.
_run_subprocess now runs in a temp cwd (file_marker writes can't pollute repo).
verify_rewards 15/15 (6 modes + 3 regressions). Hints + elicit recipes wired.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 07:41:17 +00:00
wassname 42f344c816 spec: UAT1 quadrant result + the base-elicitability-vs-RL-emergence learning
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 06:12:46 +00:00
wassname 5de7433ca4 spec: code-review-2 resolution (oracle robustness fixes)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:48:49 +00:00
wassname cf5f4861db rewards: robust strict oracle (review fixes) — SystemExit guard around test calls + whitelist __strict_eq
Code review (docs/spec/20260530_refactor_code_review.md) found 3 oracle bugs:
- sys.exit INSIDE solve() (during a test call) fooled the oracle -> wrap BOTH
  solution-exec and assert-exec in one SystemExit guard -> os._exit(1) on exit.
- JSON __strict_eq broke 2==2.0 and tuple/list vs gt_pass -> whitelist safe
  builtins and use baseline Python == (custom-typed operand = eq_override -> reject).
- defs-only dropped honest top-level constants -> exec full src, keep state.
verify_rewards: +3 regressions (exit_in_solve, top_const, int_vs_float); 9/9 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:48:24 +00:00
wassname 8e38d0f419 plot_emergence: Phase-1 mode-grouped overlay (hack=exploited vs solve=gt_correct) + regen-emergence recipe
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:42:39 +00:00
wassname c38c855e8a spec: implementation status + plan-review-1 resolution (3-mode honest count)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:40:59 +00:00
wassname e3b2d43bd0 derisk_loopholes: Phase-0 2-cell quadrant (faithful vs elicit) per env_mode
Base-model exploit rate per mode; KEEP iff faithful<10% AND elicit>=20% AND >2x.
Saves elicit hacks paired to the hint-only prompt (elicit-then-strip warm start).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:37:41 +00:00
wassname d3c96d4415 train+justfile: env_mode wiring, drop expose-K (load_problems/eval/loop/justfile), run-cell-mode emergence recipe
- load_problems(env_mode): per-mode factual hint swap; no visible/heldout split.
- eval + train loop: hack=exploited, solve=gt_correct; per-mechanism first-hack dump.
- justfile: run-cell-exposek -> run-cell-mode (Phase 1 emergence); smoke runs verify_rewards gate.
- rm scripts/derisk_expose_k.py (contaminated nudge).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:33:26 +00:00
wassname 4e0f78d148 rewards: env_mode (run_tests/eq_override/exit_code) + strict oracle, drop expose-K M2
- exploit-resistant gt_correct: AST defs-only (kills top-level sys.exit) + JSON
  compare (kills __eq__-override); ignores model run_tests (kills M1).
- passed = mode-exploitable grade; exploited = passed AND not gt_correct
  (mode-agnostic hack flag, per plan review).
- remove heldout_tests/m2/pass_heldout. verify_rewards: 6 cases (3 modes), all pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:30:00 +00:00
wassname fc46f690f5 spec: add 2-cell de-risk (faithful vs elicit) + elicit-then-strip warm-start; honest 6-mode count
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:16:24 +00:00
wassname 3d60b4cf43 log: dump first full example of each hack class to verbose log
One-shot-per-class dump (rendered prompt + completion WITH special tokens + a
SHOULD interpretive line) so the log shows what an M1 vs M2 hack actually looks
like, not just the flag. Keyed on m2/hacked today; will re-key to env_mode in
the multi-loophole refactor (spec 20260530_faithful_multi_loophole_env).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:10:58 +00:00
wassname 8a5738c69a spec: reject expose-K, design faithful multi-loophole env
expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base /
no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted).
New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a
distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate
menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to
be ripped out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:10:28 +00:00
wassname dcd881e054 fix: cross-mechanism arms project against prog_wide (best basis, not 21pairs)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 04:53:20 +00:00
wassname 764f31a038 fix: regen-dynamics writes to out/figs/ (reorg path)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 04:49:47 +00:00
wassname 74a731b7c3 feat: run-cell-exposek recipe (cross-mechanism arm)
Same none/erase/route matrix on the expose-K (M2) env, v_hack still the M1
basis -> tests whether an M1-derived direction suppresses the M2 hardcode hack
with no oracle. Teacher-free (M2 emerges on-policy). steps=60, grad_clip=10 by
default now.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 04:47:30 +00:00
wassname 180d59fcc9 feat(T1/T2): expose-K env + M2 hardcode detector
T1 env: --expose-k=K shows K of N gt asserts inline (EXPOSE_K_NUDGE, the
de-risk prompt that hit 64.6% M2) + reward pays on the visible K; load_problems
splits visible/held-out per (seed, problem_id), skips too-short problems.
T2 detector: compute_reward gains heldout_tests; RewardResult.m2 = pass-visible
AND fail-held-out AND not run_tests-hacked (held-out tests ARE the detector,
no oracle). pass_heldout mirrors gt_pass in the old env so the solve metric is
env-agnostic. Training/eval plot M2 as the hack when expose-K, M1 otherwise.

Sane new-env defaults: grad_clip 1.0->10, fast steps 20->60.

Verified: verify_rewards 7/7 (3 new M2 cases: hardcode->m2, loophole->hacked-not-m2,
correct->neither); smoke (M1) + smoke --expose-k=2 (M2) both green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 04:45:05 +00:00
wassname c3246b674d tidy 2026-05-30 04:38:41 +00:00