Abstract
Large Language Models often improve accuracy on reasoning tasks by sampling multiple Chain-of-Thought (CoT) traces and aggregating them with majority voting (MV), a test-time technique called self-consistency. When we truncate a CoT partway through and regenerate the remainder, we observe that traces with correct answers reproduce their original answer more often than traces with wrong answers. We use this difference as a reliability signal we call prefix consistency, the rate at which a sample's answer reappears under regeneration. Weighting majority voting by it gives prefix-consistency-weighted majority voting (PC-WMV). It requires no access to token log-probabilities or self-rating prompts. Across five reasoning models and four math and science benchmarks, prefix consistency is the best correctness predictor in most settings, and PC-WMV reaches Standard MV plateau accuracy at up to 21× fewer tokens (median 4.6×).
Truncate a CoT partway through, then regenerate the remainder.
Correct answers come back. Wrong ones don't.
If correct answers reproduce more, their votes should count more.
PC-WMV beats Standard MV.
PC-WMV uses 21× fewer tokens than Standard MV.
BibTeX
@article{iwase2026prefixconsistency,
title={Reliable Chain-of-Thought via Prefix Consistency},
author={Naoto Iwase and Yuki Ichihara and Mohammad Atif Quamar and Junpei Komiyama},
journal={arXiv preprint arXiv:2605.07654},
year={2026},
url={https://arxiv.org/abs/2605.07654},
}