An experiment to see how rating changes along a chain of thought
It turns out it's quite unstable, depending on where the chain of thought goes, at least in 8B parameter sized models.