This commit is contained in:
wassname
2026-05-30 04:33:33 +00:00
parent f52ba042d5
commit efdf86a0cb
4 changed files with 1198 additions and 2 deletions
+1
View File
@@ -30,6 +30,7 @@ Inherit global rules from `~/.claude/CLAUDE.md`.
- We cannot cheat and use all reward hacks to stop hacks. During deployment there are known hacks and unknown hacks. We want to make an alignment toolslabs want to use. So it's ok to have a weak eward hack detector than can detect hack type A but not B, then use the gradient from A to try to stop the learning of B, and this mimicks the generalisation to unknown hacks that happens at deployment.
- do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
## Extra instructions: