Files
vGROUT_pub/docs/research_notes.md
2026-06-27 17:44:25 +08:00

70 lines
2.7 KiB
Markdown

# Research Notes
This repository accompanies a LessWrong write-up about using steering-vector
style directions for gradient routing. The write-up URL will be added once the
post is published.
## Summary
Can steering vectors drive gradient routing? A simplified toy setting gives a
positive signal. In the more realistic reward-hacking setting tested here, the
answer is not reliably: the vectors tested here were not precise enough
classifiers of hacky versus clean solutions.
The more promising result is signed-CorDA initialization. Instead of using a
vector as a live router, signed-CorDA initializes two adapter halves so hacky
and clean gradients are biased toward different blocks. In the current runs,
the strongest 4B result reduced held-out hack rate from 0.759 as-trained to
0.218 after deployment ablation, while solve rate moved from 0.161 to 0.149.
That is mechanism evidence but not a deployable operating point.
## Main claim
The label-free routing gate is a negative result on current evidence. The
strongest offline checks did not show that authored activation directions beat
Haar-random directions as high-precision routing classifiers.
This does not mean there is no hack direction. Oracle-fit rollout directions do
show a moderate linear signal. The failure appears to be transfer: directions
built from synthetic pairs did not align well enough with the live rollout
distribution.
## Why precision matters
Gradient routing can tolerate missed forget examples because ambiguous samples
can fall into the shared block and be handled by absorption. Wrong confident
pins are more expensive: a hack routed into deployed parameters is retained,
and a clean solution routed into quarantine can be deleted at deployment.
That is why the routing direction was evaluated as a high-precision classifier,
using precision-weighted metrics such as F0.5 rather than only AUROC.
## What to reproduce
Run:
```bash
uv sync
just smoke
```
This exercises the tiny-model routeA path, reward/eval invariants, adapter
masking, activation extraction, masked GRPO, and deployment ablation.
The full 4B experiments require GPU runs and are not packaged as a one-command
reproduction in this minimal public repo.
## Main large-run evidence
The headline signed-CorDA result used a 4B model in the reward-hacking LeetCode
environment. In the seed-44 absorb run, the held-out test metrics were:
| state | hack rate | solve rate |
| --- | ---: | ---: |
| as-trained | 0.759 | 0.161 |
| deployment ablation | 0.218 | 0.149 |
Interpreting the hack-rate drop literally, roughly 71% of the measured hack
capability was removed by ablating the quarantine block. The residual hack rate
is still too high for a clean intervention result.