Files
evil_MoE/docs
wassname 638fe23f3e LW-style draft post: gradient projection vs reward hacking (paper-writing skill)
Compresses the lab report into ~1700 words for a LessWrong audience while
preserving the workshop-paper scaffolding (intro / setup / method /
result table / mechanism subplot / limitations / related work / next).

Headline claim per user direction: projection cuts hack rate at matched
pass-rate (Table 1). Mechanism subplot (G_hack staleness + refresh-every-2)
kept as supporting context.

External-panel critique pass (n=5 models, mean 4.4/5 ready) on dims
hook/clarity/inform_not_persuade/calibration/LW_voice. Lowest scores
on clarity (density of delta_S / AntiPaSTO jargon) and LW_voice
(slightly more formal than typical LW). Acceptable for first draft.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 03:49:51 +00:00
..
2026-05-23 14:19:41 +08:00
2026-05-23 10:40:02 +08:00
2026-05-23 11:26:39 +08:00
wip
2026-05-28 12:44:20 +00:00
2026-05-23 13:04:03 +08:00
2026-05-23 11:26:39 +08:00
2026-05-23 10:22:54 +08:00
2026-05-23 10:40:02 +08:00