On the Limits of LLM Reasoning: When Accuracy Is Not Enough
High accuracy in multiple-choice benchmarks often reflects recall rather than genuine reasoning. By blocking memorization through answer modification, we reveal systematic …
Eva Sánchez Salido
•
•
1 min read
