Advancements in Math-Minos Improving Math Verifiers with Natural Language Feedback

Criticism can not only make people progress, but also enhance the ability of large models.

OpenAI created a fault-finding model named CriticGPT with this idea. Coincidentally, just a few days before the release of CriticGPT, Peking University, in conjunction with Qianwen and other teams, designed a mathematics-specific version of CriticGPT with a similar idea.

Under the setting without training, the verifier can increase the accuracy rate of the model on GSM8K from 86.6% to 88.2% during inference.

On the GSM8K dataset, it can increase the model's accuracy from 86.6% to 88.2%.

Advancements in Math-Minos Improving Math Verifiers with Natural Language Feedback_0

The core idea of CriticGPT is to deliberately set bugs in the code and label them in detail, and then use the obtained data to train a model that can debug.

The Peking University team found that this method is not only useful in the code, but also can help language models solve mathematical problems.

So the team used a similar idea, replaced the code with mathematical problems, and launched the mathematics version of CriticGPT - Math-Minos.

Using GPT4 to gradually put forward correction opinions

In the field of mathematical reasoning, verifying the correctness of the solution is a key step to ensure the quality of reasoning.

However, most existing mathematical verifiers rely on binary classification labels for training, which has obvious deficiencies in providing explanations for correct or wrong reasons and cannot provide sufficient supervisory signals for the verifier to train.

Math-Minos overcomes this limitation and provides a deeper explanation, greatly enriching the training information of the verifier.

It introduces step-by-step natural language feedback as the reason label, which not only indicates the correctness or wrongness of the solution, but also gradually analyzes the reasons for the error.

Advancements in Math-Minos Improving Math Verifiers with Natural Language Feedback_1

In obtaining natural language feedback, the research team initially used GPT-4 to generate training data, but through experiments, it was found that even GPT-4 would make a certain proportion of errors when evaluating mathematical reasoning tasks step by step.

In order to avoid this problem to some extent, the researchers simplified the task of GPT-4 by introducing step-level binary classification labels in the prompt, enabling GPT-4 to generate evaluations more accurately.

Advancements in Math-Minos Improving Math Verifiers with Natural Language Feedback_2

Firstly, through supervised fine-tuning and using natural language feedback as training data, the evaluation ability of the model is effectively improved.

Secondly, through the standard ORM (OutcomeRewardModel) and PRM (ProcessRewardModel) training, efficient reasoning is achieved, and there are two benefits to this approach.

One is that through the two-stage training, the binary classification data and the supervised fine-tuning data can be decoupled.

Due to the sparsity of the supervisory signal, the amount of binary classification data for training is often much larger than that of supervised fine-tuning data, and it is found that only a small amount of supervised fine-tuning data is needed to greatly improve the evaluation ability of the model.

On the other hand, when the verifier performs verification, it does not need to explicitly generate natural language feedback, making the reasoning process more efficient.

Advancements in Math-Minos Improving Math Verifiers with Natural Language Feedback_3

ORM task performance improved significantly

In general, the researchers added 30K natural language feedback data in the training stage, bringing an improvement in the mathematical ability of the Mistral-7B verifier. In the Best-of-256 experimental setting:

In the ORM setting, MATH-Minos increased the accuracy of Mistral-7B on the GSM8K dataset from 86.2% to 87.3%, and on the MATH dataset from 35.9% to 37.4%.

In the PRM setting, MATH-Minos increased the accuracy of Mistral-7B on the GSM8K dataset from 87.1% to 87.6%, and on the MATH dataset from 36.7% to 37.8%.

In the setting combined with Self-Consistency, MATH-Minos increased the accuracy of Mistral-7B on the GSM8K dataset from 87.1% to 88.2%, and on the MATH dataset from 37.8% to 38.6%.

In the ORM and PRM task settings, Math-Minos both showed superior performance, especially in the ORM setting, and its improvement was more significant.

Advancements in Math-Minos Improving Math Verifiers with Natural Language Feedback_4

In addition, the research team also conducted in-depth analysis on the errors generated by the generator at the step level and classified them into five types - irrelevant errors, cumulative errors, computational errors, logical errors, and other errors.

The analysis results show that in multi-step reasoning, there are many possible reasons for step errors, and the model may make errors in these error types, which further emphasizes the importance of introducing natural language feedback to guide the model's learning.

It is found that in the two datasets, cumulative errors (that is, the error of one step is likely to directly lead to the errors of all subsequent steps) account for the highest proportion among all error types.

The error distribution on different datasets also has different characteristics. On the relatively simple GSM8K, there are more computational errors; on the more difficult MATH dataset, there are more logical errors.

Advancements in Math-Minos Improving Math Verifiers with Natural Language Feedback_5

By constructing a meta-evaluation set, the research team evaluated the verifier's ability to accurately judge the final answer without the influence of the generator.

The results show that Math-Minos is consistently better than the traditional ORM in the training process and shows a faster convergence rate and more accurate judgment ability.

Advancements in Math-Minos Improving Math Verifiers with Natural Language Feedback_6

At the same time, the experimental results also show that Math-Minos has strong potential for ScaleUp.

Advancements in Math-Minos Improving Math Verifiers with Natural Language Feedback_7

In conclusion, the development of Math-Minos not only improves the performance of the mathematical verifier, but also provides a new training paradigm in the field of natural language processing.

The research team hopes that this work can inspire future research to explore the potential integration of natural language feedback and classificatory verifiers and promote the ability of large language models in complex reasoning tasks.

Paper address:

This article comes from WeChat official account: Quantum Bits (ID: QbitAI), author: Focus on cutting-edge technology

Likes