How can "weak teacher models" such as average human annotators or existing AI systems, effectively supervise LLMs to improve performance on hard reasoning tasks, especially those that challenge and require expertise or daily practice from the teacher models?
In this paper, we seek for empirical answers to this question by investigating various data-driven strategies that offer supervision data at different quality levels upon tasks of varying complexity. Two intuitive strategies emerge for teacher models to provide supervision during alignment training: 1) using lower-quality supervision from complete tasks that match the difficulty of the target reasoning tasks, and 2) leveraging higher-quality supervision from easier subtasks that are less challenging.
Interestingly, we find that even when the outcome error rate for hard task supervision is high (e.g., 90\%), training on such data can outperform perfectly correct supervision on easier subtasks on multiple hard math benchmarks. We further identify a more critical factor influencing training performance: step-wise error rates, which indicate the severity of errors in solutions.
Specifically, training on hard task supervision with the same outcome error rates but disparate step-wise error rates can lead to a 30\% accuracy gap on MATH benchmark. Our results also reveal that supplementing hard task supervision with the corresponding subtask supervision can yield notable performance improvements than simply combining rephrased hard full task supervision, suggesting new avenues for data augmentation.
Data and code are released at \url{https://github.com/hexuan21/Weak-to-Strong}.
In our paper, we mainly focus on two data-driven
supervision strategies:
Strategy 1: annotating or generating solutions directly
on the hard tasks which both human and AI models are
struggle with and the accuracy of annotation is rather low;
Strategy 2: annotating or generating solutions
on hard tasks’ corresponding subtasks,
which are much easier and more manageable, thus
humans or AI models are more likely to succeed.
Below we show the synthesis pipeline for constructing
hard full task supervision and easy sub-task supervision.
(*: please check more details of this part in our paper and appendix.)
Here're some examples of decomposition, due to length limitations, we only list the questions themselves, without the solutions.
In the following 5 line graphs, we show accuracy after finetuning
on hard full task supervision and easy sub-task supervision
with varying error rates, on 5 different benchmarks.
Some take-aways:
1️⃣ Hard task supervision consistently outperforms
subtask supervision, even with higher outcome
error rates.
2️⃣ Performance does not consistently degrade with
increasing outcome error rates.
3️⃣ Changes in outcome error rates have a greater
impact on subtask supervision than on hard task
supervision.
The results are partly not consistent with our expectation,
which triggers the following deeper exploration in next sections.
🤯 From the section above, we observe that performance is
robust against the outcome error rates: the accuracy is
relatively steady when error outcome rate varies.
🤔 This triggers us to think that outcome
error rate MAY NOT be a reliable indicator of
supervision quality. We need to consider the
severity of erroneous solutions.
We sample solutions from other distinct LLMs as teacher models,
and identify the model variant trained with
GPT-4o-mini supervision in previous section
that has an outcome error
rate closest to that of the new teacher model for
the comparison. The results are shown below.
To further validate our hypothesis that the severity
of incorrect solutions matters more than the simple
outcome error rates, we conduct human evaluation to
explore the step-wise error rates in a small sample set.
The Step-wise error rates of the teacher models above
is displayed in the following table, which proves
step-wise error rate is a strong indicator of supervision
quality for hard task performance.
As outlined in previous sections, hard task supervision with
lower step-wise error rates is highly beneficial.
But can performance be further improved using
existing hard task supervision without
introducing new hard tasks?
We try combine the hard full tasks with its decomposed sub-tasks for finetuning
and experiment with different combinations of outcome error rates in
full tasks and sub-tasks. The results are shown in the table right below.
Besides, since the information
in hard tasks essentially covers that of the subtasks, one might argue that combining easy subtask and hard full task supervision merely makes
LLMs learn the same information roughly twice.
To this end, we design a baseline “4 epochs”, which double epochs of all previous experiments.
We also merge the existing hard task
supervision along with its rephrased version as
another baseline "Merge Rephrased" which
increases more diversity.
@article{he2024guiding,
title={Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks?},
author={He, Xuan and Yin, Da and others},
journal={arXiv preprint arXiv:2410.20533},
year={2024}
}