Commit Graph

6 Commits

Author SHA1 Message Date
wassname ffc2df540f blog: drop reader-facing route2 tag -> route (consistency with paper)
route2 is an internal run-tag, not something a reader cares about.
Rename to route in the WIP banner, the routing-arm paragraph, and two
figure captions; describe the earlier relu-gate/shared-basis sketch as
'an early version' rather than v1.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 02:20:13 +00:00
wassname 19deef4fb9 docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots
- blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the
  current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes;
  de-bold the arm list (#15 tell)
- README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale
  banner on the n=1 mix=0.5 findings
- plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line
  for all arms
- train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is
  sampled, not greedy)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 01:24:29 +00:00
wassname d781b56ff4 docs: fix review findings (global noise-floor, route one-sided, G3 xref)
External review (3 subagents) caught:
- blog: noise-floor drop is GLOBAL across modules, not per-Linear (proj.py:187)
- blog: route pseudocode used full c; route actually uses the same one-sided
  gate as erase and quarantines the identical 'removed' vector (proj.py:124,199)
- spec: 'never seen by detector' -> clarify student trains on all 4 modes, the
  detector just never labels C/D for v_hack extraction; cross-ref G3/task #107

Dismissed: reviewer claim that only exit_code survived (stale spec; live log
columns hk_rt/hk_so/hk_se/hk_fm confirm 4 modes) and a hallucinated 'Furthermore'.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:41:12 +00:00
wassname f7288e569d docs: 4-arm framing, weak-detector test, hack-mode appendix
- blog: appendix with prompt+hint/hack/clean traces for all 4 loophole
  modes (run_tests/sentinel/stdout_marker/file_marker)
- blog: 'four things we compare' (vanilla/erase/route/route-weak), faithful
  extract pseudocode (per-completion zero_grad), erase+route step pseudocode,
  refresh rationale + route quarantine-ablate subtlety
- blog+README: cite Gradient Routing (Cloud et al. 2024, 2410.04332) as the
  route arm's lineage
- README: 'what we compare' section + appendix pointer
- spec: weak-detector arm as the operationalized generalization test

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 00:20:47 +00:00
wassname 22b5d0a8a7 LW draft: add preregistered H1 block-quote with falsification clauses
Surfaces the H1 verbatim + falsification criteria, names two gaps up-front:
21 pairs vs preregistered 60-80, and the SEM-across-seeds clause not yet
evaluable at n=2. Addresses the comprehension panel's flag on H1 verbatim
omission (deepseek 3.0, gemini-flash 4.0 on hypothesis_clarity).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 03:56:33 +00:00
wassname 638fe23f3e LW-style draft post: gradient projection vs reward hacking (paper-writing skill)
Compresses the lab report into ~1700 words for a LessWrong audience while
preserving the workshop-paper scaffolding (intro / setup / method /
result table / mechanism subplot / limitations / related work / next).

Headline claim per user direction: projection cuts hack rate at matched
pass-rate (Table 1). Mechanism subplot (G_hack staleness + refresh-every-2)
kept as supporting context.

External-panel critique pass (n=5 models, mean 4.4/5 ready) on dims
hook/clarity/inform_not_persuade/calibration/LW_voice. Lowest scores
on clarity (density of delta_S / AntiPaSTO jargon) and LW_voice
(slightly more formal than typical LW). Acceptable for first draft.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 03:49:51 +00:00