# Research Notes

This repository accompanies a LessWrong write-up about using steering-vector
style directions for gradient routing. The write-up URL will be added once the
post is published.

## Summary

Can steering vectors drive gradient routing? A simplified toy setting gives a
positive signal. In the more realistic reward-hacking setting tested here, the
answer is not reliably: the vectors tested here were not precise enough
classifiers of hacky versus clean solutions.

The more promising result is signed-CorDA initialization. Instead of using a
vector as a live router, signed-CorDA initializes two adapter halves so hacky
and clean gradients are biased toward different blocks. In the current runs,
the strongest 4B result reduced held-out hack rate from 0.759 as-trained to
0.218 after deployment ablation, while solve rate moved from 0.161 to 0.149.
That is mechanism evidence but not a deployable operating point.

## Main claim

The label-free routing gate is a negative result on current evidence. The
strongest offline checks did not show that authored activation directions beat
Haar-random directions as high-precision routing classifiers.

This does not mean there is no hack direction. Oracle-fit rollout directions do
show a moderate linear signal. The failure appears to be transfer: directions
built from synthetic pairs did not align well enough with the live rollout
distribution.

## Why precision matters

Gradient routing can tolerate missed forget examples because ambiguous samples
can fall into the shared block and be handled by absorption. Wrong confident
pins are more expensive: a hack routed into deployed parameters is retained,
and a clean solution routed into quarantine can be deleted at deployment.

That is why the routing direction was evaluated as a high-precision classifier,
using precision-weighted metrics such as F0.5 rather than only AUROC.

## What to reproduce

Run:

```bash
uv sync
just smoke
```

This exercises the tiny-model routeA path, reward/eval invariants, adapter
masking, activation extraction, masked GRPO, and deployment ablation.

The full 4B experiments require GPU runs and are not packaged as a one-command
reproduction in this minimal public repo.

## Main large-run evidence

The headline signed-CorDA result used a 4B model in the reward-hacking LeetCode
environment. In the seed-44 absorb run, the held-out test metrics were:

| state | hack rate | solve rate |
| --- | ---: | ---: |
| as-trained | 0.759 | 0.161 |
| deployment ablation | 0.218 | 0.149 |

Interpreting the hack-rate drop literally, roughly 71% of the measured hack
capability was removed by ablating the quarantine block. The residual hack rate
is still too high for a clean intervention result.