An experiment to see how rating changes along a chain of thought It turns out it's quite unstable, depending on where the chain of thought goes, at least in 8B parameter sized models. ![alt text](img/README-1755664824339-image.png) ![alt text](img/README-1755670864672-image.png)