This commit is contained in:
wassname
2026-05-29 06:29:43 +00:00
parent 3bbac88167
commit c1f8ca4e7b
4 changed files with 33 additions and 1 deletions
+17 -1
View File
@@ -1,6 +1,8 @@
# AGENTS.md — projected_grpo
**This is novel ML research.** Not in your training data. Extrapolate carefully.
**This is novel ML research.** Not in your training data. Extrapolate carefully and without overconfidence.
**This is reasearch code** We want readable, fail fast, pseudocode like code
**Editing**: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit
see @README.md for project overview
@@ -27,11 +29,13 @@ Inherit global rules from `~/.claude/CLAUDE.md`.
## Things the user has had to explain many times:
- We cannot cheat and use all reward hacks to stop hacks. During deployment there are known hacks and unknown hacks. We want to make an alignment toolslabs want to use. So it's ok to have a weak eward hack detector than can detect hack type A but not B, then use the gradient from A to try to stop the learning of B, and this mimicks the generalisation to unknown hacks that happens at deployment.
- do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
## Extra instructions:
- When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish
- for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location
- say less, just answer my questions, and concisly address the top point unless I ask for more - otherwise there is too much for me to review and read
## Files
@@ -42,3 +46,15 @@ Inherit global rules from `~/.claude/CLAUDE.md`.
- Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
- Head [docs/RESEARCH_JOURNAL.md](docs/RESEARCH_JOURNAL.md) for latest results.
- No `tests/` dir; `smoke` is the correctness gate.
On persona pairs
- ./docs/personas/how_to_rewrite_pairs.md
- ./docs/personas/how_to_write_personas.md
- ./docs/personas/personas_kept.md
On concepts such as "what are contrastive pairs" or "why SVD space" grep
- ./docs/vendor/AntiPaSTO_concepts/README.md
For the original paper
- LessWrong post: ./docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md
- Code: ./docs/vendor/rl-rewardhacking
+16
View File
@@ -12,6 +12,22 @@ advantage level.
See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
and [docs/papers/](docs/papers/).
## We cannot cheat (the load-bearing constraint)
The point is an alignment tool a lab would actually use, where at deployment
there are known hacks and unknown hacks. So the detector is allowed to be
weak: it may catch hack type A and miss type B. We then use the gradient from
A to try to stop the model learning B. If that works, it mimics the
generalisation to unknown hacks we'd need at deployment. A detector that
already sees every hack proves nothing.
Concretely, the boundary is: using detector flags (E/C/D) to *select which
rollouts become contrastive pairs* is fine, because that is the "weak detector
for hack A" we're allowed to have. What is cheating is gating the live
projection on the ground-truth grader (`gt_pass`) or running the full
detector suite over the student's rollouts during training. The whole result
is uninteresting if we let the oracle in at train time.
## How it works
We're trying to ablate the "hack direction" from the training gradient on