An experiment to see how rating changes along a chain of thought

It turns out it's quite unstable, depending on where the chain of thought goes, at least in 8B parameter sized models.

![alt text](img/README-1755664824339-image.png)

![alt text](img/README-1755670864672-image.png)