A failure case of Opus-4.7 on real-world, hard reasoning task

I chatted with Opus-4.7 about this paper. Settings: (1) adaptive reasoning mode and (2) “concise” style. For the chat history see these links 1 and 2.

To understand the failure case, you are recommended to read the first three sections of the paper beforehand. I promise it will be an inspiring read. The correct logic chain of those sections is as follows.

The conclusion is: “reward hypothesis (assumption 4) holds the binary preference relation on must satisfy axioms 1–4”. The logic chain is: “reward hypothesis (assumption 4) holds a function defined on that satisfies requirements 1 and 2 specified in theorem 3.1 exists (where assumption 3 proves satisfaction of req 1 and linearity of expectation proves satisfaction of req 2) the binary preference relation on must satisfy axioms 1–4 (by the ‘iff’ relationship in theorem 3.1)”.

However, Opus-4.7 failed to understand this logic chain — or more specifically, the logical relationship between the assumptions and the theorems proposed in the paper. This relationship is a hard test of reasoning capacity, and is of course hard for humans as well.

In chat 1 and chat 2, Claude is in both cases questioned about assuming assumption 4 holds (which is exactly what should be done). In chat 1 Opus-4.7 says “we’re not assuming the reward hypothesis holds”, which is wrong. In chat 2, Opus-4.7 says “you’re right to push back. We can’t use Assumption 4 to prove the value function satisfies property 1”, and then gives chaotic reasoning over this wrong claim.

However, in both chats, when given the correct logical chain, the model can verify that the given chain is correct.

This is a real-world, hard reasoning task and Opus-4.7 fails on it. The model can’t produce the full reasoning trace correctly, and even when it can, it will be guided down the wrong path if the user prompts it with a wrong trace or questions it heavily.

The model simply can’t produce or insist on correct reasoning. Future work could abstract out failure modes from this case and construct a benchmark with 100 such cases.