Reasoning Models Seriously Questioned

Deletion of OpenAI Logs is Halted

Posted on June 13th, 2025

Summary

Audio Summmary

Research from Apple criticizes Chain-of-Thought reasoning models like OpenAI’s o1 and o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking and Gemini Thinking, because benchmarks used focus too much on mathematical and coding question outcomes instead of reasoning efficiency, and benchmarks probably suffer from data contamination (where the benchmarks are already part of model training data). The research shows that reasoning models fail to develop generalized reasoning capabilities beyond certain complexity thresholds, and that performance drops to near-zero once the threshold is passed.

An MIT Technology Review article looks at the thriving market for agentic AI platforms in China following the release of Manus in March. The release cycles for AI Apps in China is relatively short due to high internal competition, the high use of existing LLMs, and a highly digitalized user base. In an article posted on his blog, OpenAI CEO Sam Altman writes that “humanity is close to building digital super-intelligence”. He warns however that the alignment problem must first be solved whereby AI behaves in a manner consistent with our collective wishes. Interestingly, he cites social media as an example of poorly aligned platforms.

OpenAI has admitted that it has halted deletion of user ChatGPT “temporary chat” conversations – meaning these client data are still being stored on OpenAI servers. The issue has arisen from the copyright lawsuit with the New York Times where lawyers for the newspaper argued that OpenAI server logs may contain evidence that copyrighted content was used to train ChatGPT models. The judge subsequently ordered OpenAI to halt log deletion. Elsewhere, Salesforce, the owner of Slack, has changed its terms and conditions to prohibit client organizations from using the Slack API to extract data for AI model training.

Laura Bates, the author of the book The New Age of Sexism: How the AI Revolution is Reinventing Misogyny, is questioning Meta’s ability and willingness to protect children and women in the metaverse from sexual abuse. The Center for Countering Digital Hate has found that users experience abusive behavior every seven minutes in the metaverse, including graphic sexual content, bullying, abuse, grooming and threats of violence.

An AI News post considers the benefits of applying blockchain concepts to AI. It underlines how the consolidation of AI computing resources in the hands of data centers controlled by a small number of Tech companies raises governance and security concerns. Blockchain for its part provides a scalable infrastructure with independent networks of GPUs, available at affordable costs, enables content sources to be identified, and enable content and model providers to be rewarded for their contributions.

In cybersecurity news, Microsoft has patched a recent “zero-clickAI attack named EchoLeak that permitted sensitive data to be exfiltrated from Microsoft 365 Copilot. An attacker could include malicious instructions in an email sent to the platform that led to data exfiltration via prompt injection on the AI, without any action by the user himself.

1. Manus has kick-started an AI agent boom in China

This article examines the thriving market for agentic AI platforms in China, following the release of Manus in March by the Wuhan-based startup Butterfly Effect. Two new agentic platforms are Genspark and Flowith which reportedly are already outperforming Manus on benchmark scores. Agentic platforms are built over existing language models and execute secretarial tasks like booking trips, replying to user queries, or managing schedules by interacting with external tools. Manus uses Anthropic’s Claude Sonnet outside of China, but inside, Chinese models have to be used because Anthropic and OpenAI prefer to avoid the Chinese market due to regulatory risks. The Qwen models are currently considered the best Chinese models for agentic use, as DeepSeek is considered to have too high a hallucination rate, while ByteDance’s Doubao and Moonshot’s Kimi models are seen as optimized for entertainment and chat over task execution. The release cycles for AI Apps in China is relatively short due to high internal competition and a highly digitalized user base, and highly integrated AI “super-applications” are expected in the next months.

2. Sam Altman calls for ‘AI privilege’ as OpenAI clarifies court order to retain temporary and deleted ChatGPT sessions

This article reports how OpenAI has currently halted deletion of user ChatGPT “temporary chat” conversations – meaning these client data are still being stored on OpenAI servers. The issue has arisen from the copyright lawsuit The New York Times (NYT) v. OpenAI and Microsoft where lawyers for the newspaper argued that OpenAI server logs may contain evidence that copyrighted content was used to train ChatGPT models without permission from the newspaper. The judge subsequently ordered OpenAI to “preserve and segregate all output log data that would otherwise be deleted on a going forward basis – which includes logs of user conversations. OpenAI says that the customers impacted are users of ChatGPT Free, Plus, Pro and Team users, as well as API customers who do not have a zero data retention (ZDR) agreement. ChatGPT Enterprise users and customers using the API who have a ZDR agreement are not impacted. One concern with the situation is that OpenAI stopped deletion of user logs in the middle of May, but only announced this fact publicly on June 5th. For its part, OpenAI is blaming the New York Times for the situation. Meanwhile, CEO Sam Altman has touted the idea of “AI Privilege” as a potential legal standard in the same vein as doctor-patient confidentiality. In any case, some security officers are advising organizations to include discovery as a potential risk in their risk analyses.

3. Misogyny in the metaverse: is Mark Zuckerberg’s dream world a no-go area for women?

Written by Laura Bates, author of the book The New Age of Sexism: How the AI Revolution is Reinventing Misogyny, this article questions Meta’s ability and willingness to protect children and women in the metaverse from sexual abuse. Meta’s metaverse is an immersive platform that uses virtual and augmented reality technology like headsets, haptic devices (virtual environment interaction using touch-based feedback), 3D positioning audio, to create and interact verbally and physically with avatars. However, one problem is its lack of policing and a proliferation of sexual assaults and grooming offenses. The Center for Countering Digital Hate (CCDH) has found that users experience abusive behavior every seven minutes in the metaverse, including graphic sexual content, bullying, abuse, grooming and threats of violence. The CCDH also found that of 100 identified violations of Meta’s policies, only 51 of the incidents could be reported to Meta using their web forms because they lacked the required categorizations. Meta has said that “for all users we have an automatic protection called personal boundary, which keeps people you don’t know a few feet away from you”. This is criticized by Bates since it puts the onus on the victim to protect himself, rather than having abusive behavior prevented in the first place. Meta has a history of problems relating to protection. The UK’s National Society for the Protection of Cruelty to Children said that 47% of online child grooming offenses took place on Meta platforms between 2017 and 2024.

4. The Gentle Singularity

In an article posted on his blog, OpenAI CEO Sam Altman writes that “humanity is close to building digital super-intelligence”, and that lessons learned in the development of AI like GPT-4 and o3 promise to lead to great progress for mankind. He believes that scientists today are three times more productive using AI and that we can expect faster scientific breakthroughs from now on. Looking forward, Altman expects to see physical robots doing real-world tasks by 2027. By the 2030s, AI will have evolved from generating ideas to implementing ideas. He foresees an economy where AI runs companies and where robots build other robots and the data centers needed for computation. Altman reiterates the progress made by AI in the last few years, writing that “this is how singularity goes: wonders become routine”. He nevertheless issues two warnings. First, the alignment problem needs to be solved whereby AI behaves in a manner consistent with our collective wishes. Interestingly, he cites social media as an example of poorly aligned platforms, since their content algorithms indulge short-term user interests, overriding long-term ones. Altman’s second warning is that super-intelligence must be cheap, and available without any country or corporation holding a monopoly.

5. The AI blockchain: What is it really?

This AI News post considers the benefits of applying blockchain concepts to AI. It underlines how the consolidation of AI computing resources in the hands of data centers controlled by a small number of Tech companies like OpenAI, Google, and Anthropic raises governance and security concerns. The post cites four reasons why blockchains help AI. First, “proof-of-attribution” consensus mechanisms can be used to prove the origins of data used in training or model responses. Second, identifying contributors enables providers of content and models to be rewarded. Third, the blockchain provides an environment for community-owned models, controlled by users using democratic governance processes. Fourth, it provides a scalable infrastructure, like that provided by Render Network which has built up a network of GPUs, giving developers access to infrastructure at more affordable costs. In addition to blockchains contributing to AI, the post also points out how AI can contribute to blockchains by predicting changes in demand and optimizing logistics on supply chain blockchains, or diagnosing diseases via image analysis on health-care blockchains.

6. Salesforce changes Slack API terms to block bulk data access for LLMs

Salesforce, the owner of Slack, is now prohibiting client organizations from using the Slack API to extract data for AI model training. The company says that it is “reinforcing safeguards around how data accessed via Slack APIs can be stored, used, and shared” though the move also permits the company to promote its own AI services. Customers are now forced to use Salesforce’s AI for advanced search. The move will impact many Slack clients. One company, Glean, says that the license changes at Salesforce will restrict their services, thereby “hampering your ability to use your data with your chosen enterprise AI platform”. One analyst says that other companies may follow this example, saying that the move is “part of a broader pattern; we’re seeing platforms tightening their grip on user data under the banner of security or product integrity, but often in ways that primarily serve their own AI ambitions.”.

7. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

This research from Apple provides a critical analysis of the Chain-of-Thought (CoT) thinking models (e.g., OpenAI’s o1 and o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking and Gemini Thinking) which have been touted by Big Tech as demonstrating improved reasoning. However, the researchers argue that benchmarks used by Big Tech focus too much on mathematical and coding benchmarks where only the final answer is verified, rather than focusing on the quality of the whole reasoning process. Also, the benchmarks suffer from data contamination (where benchmarks have found themselves included in model training data, thereby eliminating themselves as independent measures of model performance). The researchers evaluated five state-of-the-art thinking models (o3-mini, DeepSeek-R1, DeepSeek-R1-Qwen-32B, and Claude-3.7-Sonnet) against four increasingly challenging puzzles (from the Tower of Hanoi to a puzzle where a stack of blocks must be moved to a given stack configuration in a minimum number of moves). Traces of the reasoning steps were kept in all cases.

The research shows that reasoning models fail to develop generalized reasoning capabilities beyond certain complexity thresholds, and that performance drops to near-zero once the threshold is passed. Further, for low complexity problems, standard LLMs show better performance. Reasoning models show better performance over standard LLMs as “problem complexity moderately increases. However both types of models collapse at higher problem complexity. Finally, the research also confirms the phenomenon of “overthinking” in these models where, for moderately complex problems, the models inefficiently continue exploring incorrect alternatives.

8. Zero-Click AI Vulnerability Exposes Microsoft 365 Copilot Data Without User Interaction

This article reports a recent “zero-click” AI attack named EchoLeak that permitted sensitive data to be exfiltrated from Microsoft 365 Copilot. The vulnerability relates to when marked-up content is sent to the platform, like via an email, which can contain attacker instructions that lead to a prompt injection on the AI system. In particular, the AI is tricked into executing the malicious instructions in the content with the result of having privileged user data exfiltrated or modified. The attack is zero-click in the sense that the user does need to take any explicit action for the attack to succeed: receipt of the malicious content suffices for the attack. The article also highlights how agent protocols like the Model Context Protocol (MCP) are becoming a new attack vector as more and more agents get deployed. Another example is a recent vulnerability on the GitHub MCP integration that allowed an attacker to takeover a user's agent by hiding instructions in a malicious GitHub issue. The EcoLeak vulnerability was classified on the CVE list as CVE-2025-32711 with a CVSS score of 9.3 (critical or high-severity). Microsoft has issued a patch for the vulnerability in June.