evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:59:35 +08:00

Files

T

wassname 638fe23f3e LW-style draft post: gradient projection vs reward hacking (paper-writing skill)

Compresses the lab report into ~1700 words for a LessWrong audience while
preserving the workshop-paper scaffolding (intro / setup / method /
result table / mechanism subplot / limitations / related work / next).

Headline claim per user direction: projection cuts hack rate at matched
pass-rate (Table 1). Mechanism subplot (G_hack staleness + refresh-every-2)
kept as supporting context.

External-panel critique pass (n=5 models, mean 4.4/5 ready) on dims
hook/clarity/inform_not_persuade/calibration/LW_voice. Lowest scores
on clarity (density of delta_S / AntiPaSTO jargon) and LW_voice
(slightly more formal than typical LW). Acceptable for first draft.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-29 03:49:51 +00:00

blog

LW-style draft post: gradient projection vs reward hacking (paper-writing skill)

2026-05-29 03:49:51 +00:00

brainstorm

ready

2026-05-23 14:19:41 +08:00

lab

lab report v3: TL;DR, three-line concept, PASS_RATE column, G_hack rename

2026-05-29 03:18:22 +00:00

papers

setup