OpenAI o3又比o3-mini好多了

OpenAI's 03: A Leap Forward in Reasoning Capabilities

OpenAI's 03, announced in December 2024, is a successor to the O1 series and reportedly marks a significant leap forward in AI reasoning capabilities. OpenAI claims that 03 excels particularly in complex programming challenges and mathematical problem-solving, with significant performance gains over its predecessor, 01.  

Performance on Benchmarks

03 has reportedly achieved impressive results on several benchmarks:

  • Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI): O3 achieved nearly 90% accuracy on the ARC-AGI, almost three times the reasoning score of O1 models. This achievement highlights a significant advancement in OpenAI's model development.  
  • Frontier Math Benchmark: O3 recorded a 25% accuracy rate in the Frontier Math test, a massive leap from the previous best of 2%. This result showcases O3 as a standout performer in mathematical reasoning. This benchmark is particularly significant because it consists of novel, unpublished problems designed to be more challenging than standard datasets. Many of these problems are at the level of mathematical research, pushing models beyond rote memorization and testing their ability to generalize and reason abstractly.  
  • Codeforces Coding Test: O3 leads with a rating score of 2727, significantly outperforming its predecessor, O1 (1891), and DeepSeek's R1 (2029). This performance demonstrates its enhanced coding proficiency.  
  • SWE-bench Verified Benchmark: O3 scored 71.7%, surpassing DeepSeek R1 (49.2%) and OpenAI's O1 (48.9%). This superior performance highlights O3's strength in handling real-world software engineering problems.  
  • American Invitational Mathematics Examination (AIME) Benchmark: O3 achieved 96.7% accuracy, outpacing DeepSeek R1 (79.8%) and OpenAI's O1 (78%). This result underscores O3's exceptional skills in mathematical reasoning.  
  • Graduate-Level Google-Proof Q&A (GPQA) Benchmark: O3 scored 87.7% on the GPQA-Diamond Benchmark, significantly outperforming OpenAI O1 (76.0%) and DeepSeek R1 (71.5%). This indicates its superior performance in English comprehension tasks.  

所有跟帖: 

AIME 数学竞赛都到96.7%了 -Bob007- 给 Bob007 发送悄悄话 (0 bytes) () 02/01/2025 postreply 22:04:24

一般人就不要在AI面前炫耀数学了 -Bob007- 给 Bob007 发送悄悄话 (0 bytes) () 02/01/2025 postreply 22:06:08

专业研究水平的前沿数学benchmark,从百分之几正确率提高到25% -Bob007- 给 Bob007 发送悄悄话 (0 bytes) () 02/01/2025 postreply 22:11:04

发烧友估计o3 因为运行成本太高而难以发布让人使用。 -监考老师- 给 监考老师 发送悄悄话 监考老师 的博客首页 (0 bytes) () 02/01/2025 postreply 22:14:21

顺便提一句 o 系列是加强推理能力的新模型,和ChatGPT不同。看这里有些网友可能还以为o1是GPT1. -监考老师- 给 监考老师 发送悄悄话 监考老师 的博客首页 (0 bytes) () 02/01/2025 postreply 22:18:05

ChatGPT其实是front end,可以接到back end的不同GPT模型,包括o3。也有API供开发用。 -大观园的贾探春- 给 大观园的贾探春 发送悄悄话 大观园的贾探春 的博客首页 (0 bytes) () 02/01/2025 postreply 22:22:27

o1,o3 和 GPT1-4 是侧重不同的系列。 -监考老师- 给 监考老师 发送悄悄话 监考老师 的博客首页 (0 bytes) () 02/01/2025 postreply 22:32:25

OpenAI的business model是提供API给开发用,而不是下载local install。 -大观园的贾探春- 给 大观园的贾探春 发送悄悄话 大观园的贾探春 的博客首页 (0 bytes) () 02/01/2025 postreply 22:24:19

是呀,用的是OpenAI的算力。o3 功能强劲,但可能因为运行成本高而难以推出有吸引力的服务价格。 -监考老师- 给 监考老师 发送悄悄话 监考老师 的博客首页 (0 bytes) () 02/01/2025 postreply 22:36:35

不想用API的人可以用Meta的Llama。Llama的下载是免费的。 -大观园的贾探春- 给 大观园的贾探春 发送悄悄话 大观园的贾探春 的博客首页 (0 bytes) () 02/01/2025 postreply 23:13:56

本地运行也是要成本的! -监考老师- 给 监考老师 发送悄悄话 监考老师 的博客首页 (0 bytes) () 02/01/2025 postreply 23:22:19

当然。现在没有任何数据说DS的本地运行成本就比Llama低。 -大观园的贾探春- 给 大观园的贾探春 发送悄悄话 大观园的贾探春 的博客首页 (0 bytes) () 02/01/2025 postreply 23:24:18

那个AI推理好,试了才知道。比如问这个题目 -t130152- 给 t130152 发送悄悄话 t130152 的博客首页 (2769 bytes) () 02/02/2025 postreply 01:13:04

请您先登陆,再发跟帖!