OpenAI has released GPT-5.4, integrating native computer use ability (CUA) into ChatGPT and the API. It achieves a 75% success rate on OSWorld-Verified tasks (human baseline: 72.4%) with a context window of 1 million tokens. A new tool search feature reduces token consumption during tool calls by 47%. The API adopts tiered pricing: Standard tier costs $2.5 per million input tokens and $15 for output; Pro tier is priced at $30/$180, offering a faster 'fast mode' to reduce latency.
Caitlin Kalinowski, head of robotics at OpenAI, resigned on March 7 in opposition to the company's agreement with the U.S. Department of Defense. She criticized the rapid announcement of the contract, citing insufficient internal review and governance frameworks, and expressed concerns about the technology being used for warrantless surveillance of U.S. citizens or unauthorized lethal autonomous operations. OpenAI responded that the contract includes 'red lines' prohibiting domestic surveillance and weaponization, and stated it will continue engaging employees and external stakeholders. This incident highlights compliance, reputational, and talent risks associated with civil-military collaboration.
Tongyi Lab has open-sourced Mobile-Agent-v3.5 and GUI-Owl-1.5, native GUI agent foundations emphasizing multi-platform compatibility across desktop, mobile, and browsers. The team built a hybrid data flywheel: synthesizing challenging multi-window scenarios and validating trajectories via real-device testing and task repair. They introduced the MRPO algorithm for joint multi-platform training to mitigate gradient conflicts and signal collapse. The models come in Instruct and Thinking variants, balancing low-latency execution on edge devices with deep planning in the cloud.
Anthropic Publishes Claude Model Spec, Revealing Values and Safety Tiers
AI GovernanceSafetyTransparency
Anthropic has publicly released the full Model Spec for Claude—dubbed its 'soul document'—detailing model values, layered safety strategies, and behavioral guidelines. The document explains response boundaries, refusal policies, and degradation rules across risk scenarios for developers and operators. By transforming fragmented safety rules into a reusable, auditable framework, it supports external scrutiny and consistent internal alignment training. This disclosure enhances transparency in model alignment and product governance, setting a reference point for industry discussions on accountability and responsibility.
Ant Group's OpAgent Tops WebArena with 71.6% Success Rate
AI AgentReinforcement Learning
Ant Group has unveiled its Web Agent framework OpAgent, achieving a 71.6% task success rate on the WebArena benchmark. The method first establishes planning, localization, and action capabilities through hierarchical multi-task fine-tuning, replacing error-prone HTML parsing with visual signals. It then applies online reinforcement learning on real websites for self-improvement. Rewards combine outcome evaluation from WebJudge and process-based feedback from RDT. Modular collaboration among Planner, Locator, and Reflector components enhances error correction and robustness in long-horizon tasks.
A security investigation revealed 42,089 OpenClaw AI assistant instances exposed on the public internet, with 93% containing critical vulnerabilities. The report highlights CVE-2026-25253 (CVSS 8.8), where attackers can hijack WebSocket connections via malicious websites to achieve remote code execution and steal sensitive data. Backend misconfigurations have also led to prolonged exposure of API keys, user emails, and unencrypted conversations. Additionally, 341 malicious skills were found in the skill marketplace. Users are advised to immediately enable authentication, apply patches, and audit third-party skills and credentials.
Karpathy Open-Sources autoresearch: 630-Line 5-Minute Experiment Loop on Single GPU
Open SourceAI AgentResearch Tool
Andrej Karpathy has open-sourced autoresearch—a ~630-line codebase demonstrating a 'research agent' that automatically modifies code, runs training, and iterates based on metrics within Git branches. The project uses a fixed '5-minute training loop' to standardize experimental budgets, enabling direct comparison of architectures and hyperparameters on identical hardware, including single-GPU setups. The workflow shifts human effort from writing training scripts to maintaining prompts and goal files (program.md), showcasing a reproducible paradigm for automated exploration and self-improvement under limited compute.
Rspress 2.0 has launched as an 'AI-Native Documentation' tool. It introduces SSG-MD, which renders Markdown and llms.txt directly via React virtual DOM, reducing noise from HTML conversion and improving LLM/agent readability. Build-side optimizations include lazy compilation and persistent caching, cutting cold start times to 50ms and boosting overall build speed by up to 60%. The new theme system uses BEM for style decoupling, and updated syntax highlighting and MDX parsing enhance ecosystem compatibility.
Yann LeCun Paper Proposes SAI, Advocating Adaptation Speed Over AGI Goal
ResearchAI Roadmap
Yann LeCun's team has published a paper arguing that 'AGI' is ill-defined and lacks actionable metrics, proposing 'Superhuman Adaptivity Intelligence' (SAI) as an alternative—measuring systems by their cross-task superiority and speed of learning/adapting to new tasks. The paper advocates moving beyond path dependence on autoregressive large models, urging greater exploration of self-supervised learning, world models, and hierarchical modular architectures (e.g., JEPA, Dreamer series) to improve sample efficiency and transferability. This framework offers a new direction for research goals and evaluation standards.
According to the Nikkei, approximately 80% of companies now regard AI agent adoption as a 'priority initiative,' shifting demand from chat assistants to autonomous process automation. Enterprises are piloting agents in email handling, scheduling, and document workflows, but most lack quantitative ROI assessment and continuous improvement mechanisms, making it difficult to measure time savings or error reduction. The report notes growing interest in infrastructure and services for performance visualization, monitoring, optimization, and auditing—key drivers for the next wave of procurement and deployment.