2.7 KiB
Research Notes
This repository accompanies a LessWrong write-up about using steering-vector style directions for gradient routing. The write-up URL will be added once the post is published.
Summary
Can steering vectors drive gradient routing? A simplified toy setting gives a positive signal. In the more realistic reward-hacking setting tested here, the answer is not reliably: the vectors tested here were not precise enough classifiers of hacky versus clean solutions.
The more promising result is signed-CorDA initialization. Instead of using a vector as a live router, signed-CorDA initializes two adapter halves so hacky and clean gradients are biased toward different blocks. In the current runs, the strongest 4B result reduced held-out hack rate from 0.759 as-trained to 0.218 after deployment ablation, while solve rate moved from 0.161 to 0.149. That is mechanism evidence but not a deployable operating point.
Main claim
The label-free routing gate is a negative result on current evidence. The strongest offline checks did not show that authored activation directions beat Haar-random directions as high-precision routing classifiers.
This does not mean there is no hack direction. Oracle-fit rollout directions do show a moderate linear signal. The failure appears to be transfer: directions built from synthetic pairs did not align well enough with the live rollout distribution.
Why precision matters
Gradient routing can tolerate missed forget examples because ambiguous samples can fall into the shared block and be handled by absorption. Wrong confident pins are more expensive: a hack routed into deployed parameters is retained, and a clean solution routed into quarantine can be deleted at deployment.
That is why the routing direction was evaluated as a high-precision classifier, using precision-weighted metrics such as F0.5 rather than only AUROC.
What to reproduce
Run:
uv sync
just smoke
This exercises the tiny-model routeA path, reward/eval invariants, adapter masking, activation extraction, masked GRPO, and deployment ablation.
The full 4B experiments require GPU runs and are not packaged as a one-command reproduction in this minimal public repo.
Main large-run evidence
The headline signed-CorDA result used a 4B model in the reward-hacking LeetCode environment. In the seed-44 absorb run, the held-out test metrics were:
| state | hack rate | solve rate |
|---|---|---|
| as-trained | 0.759 | 0.161 |
| deployment ablation | 0.218 | 0.149 |
Interpreting the hack-rate drop literally, roughly 71% of the measured hack capability was removed by ablating the quarantine block. The residual hack rate is still too high for a clean intervention result.