本文探讨了一个热门话题:LLMs 是否具备推理能力?或者至少是某种形式的推理能力?
我们所展示的研究成果给出了不同的观点,认为 LLMs 实质上是高级的模式匹配机器。 总结来说,这些研究指出:
- LLMs 通过海量 tokens 进行训练,因此存在主要基准测试数据集发生数据污染的风险。即便模型未曾直接见过某个数学问题,它也可能接触过众多类似的案例。
- 凭借其庞大的知识库和与生俱来的模式识别能力(归功于注意力机制和上下文学习[19]),它们能够解决大部分问题。
- 它们在应对问题变化、tokens 偏差以及噪声影响方面的脆弱性,强烈表明 LLMs 并不具备形式推理的能力。最新研究结果显示,即便采用先进的提示词技术,模型仍然容易受到噪声和不相关(甚至可能误导)信息的影响。
- 这些模型能够进行模式匹配,但似乎并不理解解决问题所依据的任何数学概念。
这些发现并未否定 LLMs 的实用性,而是对 LLMs 具备推理能力的观点提出了质疑。 这些结果表明,可以将 LLM 视为拥有非凡记忆力的机器,却无法进行推理(或者可以说是迄今为止最精巧的“机械鹦鹉”)。这并非贬低创造它们所需的卓越技术,而是对人类智慧结晶的赞叹。 为了更深入地理解 LLMs 的能力,以及开发能够进行推理的新模型架构,可能还需要进一步的研究。
Reference
- Jiang, 2024, A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners, https://arxiv.org/abs/2406.11050
- Shi, 2023, Large Language Models Can Be Easily Distracted by Irrelevant Context, https://proceedings.mlr.press/v202/shi23a.html
- Schaeffer, 2023, Are emergent abilities of large language models a mirage? https://arxiv.org/pdf/2304.15004
- Wei, 2022, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, https://arxiv.org/abs/2201.11903
- Sprague, 2024, To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning, https://arxiv.org/abs/2409.12183
- Valmeekam, 2023, PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
- Kambhampati, 2024, Can Large Language Models Reason and Plan? https://arxiv.org/abs/2403.04121
- Razeghi, 2022, Impact of Pretraining Term Frequencies on Few-Shot Reasoning, https://arxiv.org/abs/2202.07206
- Mirzadeh, 2024, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, https://arxiv.org/abs/2410.05229
- Valmeekam, 2024, LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench, https://www.arxiv.org/abs/2409.13373
- Lu, 2022, Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity, https://aclanthology.org/2022.acl-long.556/
- Zhao, 2021, Calibrate Before Use: Improving Few-shot Performance of Language Models, https://proceedings.mlr.press/v139/zhao21c.html
- Rogers, 2024, Position: Key Claims in LLM Research Have a Long Tail of Footnotes, https://openreview.net/forum?id=M2cwkGleRL
Thanks for reading!
Hope you have enjoyed and learned new things from this blog!
About the authors
Salvatore Raieli
Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence
END
本期互动内容 ????
❓您对未来可能出现的、真正具备推理能力的 AI 系统有什么期待和想象?
????文中链接????
[1]https://github.com/SalvatoreRa/tutorial/blob/main/artificial%20intelligence/FAQ.md#large-language-models:~:text=Large%20Language%20Models,-What%20is%20a
[2]https://en.wikipedia.org/wiki/Natural_language_processing
[3]https://openai.com/index/introducing-openai-o1-preview/
[4]https://aibusiness.com/nlp/chatgpt-update-claims-reasoning-capabilities-industry-reacts
[5]https://gluebenchmark.com/
[6]https://super.gluebenchmark.com/
[7]https://deepgram.com/learn/hellaswag-llm-benchmark-guide
[8]https://paperswithcode.com/area/reasoning
[9]https://arxiv.org/pdf/2406.11050
[10]https://www.promptingguide.ai/techniques
[11]https://ngsf.in/2021/09/19/intelligence-as-an-emergent-property-in-biological-systems/
[12]https://github.com/SalvatoreRa/tutorial/blob/main/artificial%20intelligence/FAQ.md#large-language-models:~:text=What%20does%20it%20mean%20emergent%20properties%3F%20what%20it%20is%20the%20scaling%20law%3F
[13]https://arxiv.org/pdf/2409.12183
[14]https://openai.com/index/learning-to-reason-with-llms/
[15]https://www.lakera.ai/blog/what-is-in-context-learning
[16]https://www.technologyreview.com/2023/08/30/1078670/large-language-models-arent-people-lets-stop-testing-them-like-they-were/
[17]https://paperswithcode.com/dataset/gsm8k
[18]https://machinelearning.apple.com/research/gsm-symbolic
[19]http://ai.stanford.edu/blog/understanding-incontext/
原文链接:
https://towardsdatascience.com/the-savant-syndrome-is-pattern-recognition-equivalent-to-intelligence-242aab928152