Guide Through Complexity

What Makes Good Supervision for Hard Reasoning Tasks?



1Tsinghua University, 2University of California, Los Angeles

*Equal Contribution

Abstract

How can "weak teacher models" such as average human annotators or existing AI systems, effectively supervise LLMs to improve performance on hard reasoning tasks, especially those that challenge and require expertise or daily practice from the teacher models?
In this paper, we seek for empirical answers to this question by investigating various data-driven strategies that offer supervision data at different quality levels upon tasks of varying complexity. Two intuitive strategies emerge for teacher models to provide supervision during alignment training: 1) using lower-quality supervision from complete tasks that match the difficulty of the target reasoning tasks, and 2) leveraging higher-quality supervision from easier subtasks that are less challenging.
Interestingly, we find that even when the outcome error rate for hard task supervision is high (e.g., 90\%), training on such data can outperform perfectly correct supervision on easier subtasks on multiple hard math benchmarks. We further identify a more critical factor influencing training performance: step-wise error rates, which indicate the severity of errors in solutions.
Specifically, training on hard task supervision with the same outcome error rates but disparate step-wise error rates can lead to a 30\% accuracy gap on MATH benchmark. Our results also reveal that supplementing hard task supervision with the corresponding subtask supervision can yield notable performance improvements than simply combining rephrased hard full task supervision, suggesting new avenues for data augmentation.
Data and code are released at \url{https://github.com/hexuan21/Weak-to-Strong}.

Hard Full Task and Easy Sub-task Supervision Synthesis

In our paper, we mainly focus on two data-driven supervision strategies:
Strategy 1: annotating or generating solutions directly on the hard tasks which both human and AI models are struggle with and the accuracy of annotation is rather low;
Strategy 2: annotating or generating solutions on hard tasks’ corresponding subtasks, which are much easier and more manageable, thus humans or AI models are more likely to succeed.

Pipeline Overview

Below we show the synthesis pipeline for constructing hard full task supervision and easy sub-task supervision.
(*: please check more details of this part in our paper and appendix.)


Examples

Here're some examples of decomposition, due to length limitations, we only list the questions themselves, without the solutions.

Which Supervision Strategy is better

In the following 5 line graphs, we show accuracy after finetuning on hard full task supervision and easy sub-task supervision with varying error rates, on 5 different benchmarks.
Some take-aways:
1️⃣ Hard task supervision consistently outperforms subtask supervision, even with higher outcome error rates.
2️⃣ Performance does not consistently degrade with increasing outcome error rates.
3️⃣ Changes in outcome error rates have a greater impact on subtask supervision than on hard task supervision.
The results are partly not consistent with our expectation, which triggers the following deeper exploration in next sections.

Severity of Erroneous Solutions Matters

🤯 From the section above, we observe that performance is robust against the outcome error rates: the accuracy is relatively steady when error outcome rate varies.
🤔 This triggers us to think that outcome error rate MAY NOT be a reliable indicator of supervision quality. We need to consider the severity of erroneous solutions.
We sample solutions from other distinct LLMs as teacher models, and identify the model variant trained with GPT-4o-mini supervision in previous section that has an outcome error rate closest to that of the new teacher model for the comparison. The results are shown below.

To further validate our hypothesis that the severity of incorrect solutions matters more than the simple outcome error rates, we conduct human evaluation to explore the step-wise error rates in a small sample set.
The Step-wise error rates of the teacher models above is displayed in the following table, which proves step-wise error rate is a strong indicator of supervision quality for hard task performance.

Further Improvement Over Sole Hard Task Supervision

As outlined in previous sections, hard task supervision with lower step-wise error rates is highly beneficial. But can performance be further improved using existing hard task supervision without introducing new hard tasks?
We try combine the hard full tasks with its decomposed sub-tasks for finetuning and experiment with different combinations of outcome error rates in full tasks and sub-tasks. The results are shown in the table right below.
Besides, since the information in hard tasks essentially covers that of the subtasks, one might argue that combining easy subtask and hard full task supervision merely makes LLMs learn the same information roughly twice. To this end, we design a baseline “4 epochs”, which double epochs of all previous experiments. We also merge the existing hard task supervision along with its rephrased version as another baseline "Merge Rephrased" which increases more diversity.

BibTeX

 @article{he2024guiding,
      title={Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks?},
      author={He, Xuan and Yin, Da and others},
      journal={arXiv preprint arXiv:2410.20533},
      year={2024} 
    }