Back to Archive
Sunday, July 5, 2026
10 stories3 min read

Today's Highlights

1

Mistral Releases Open-Source Math Model Leanstral 1.5, Solves 587/672 Problems on PutnamBench

Open SourceMathematical Reasoning

Mistral AI has launched Leanstral 1.5, a new open-source mathematical reasoning model licensed under Apache-2.0 and built on the Lean 4 formal language. The model successfully solved 587 out of 672 problems in the authoritative PutnamBench math benchmark, demonstrating strong capabilities in code reasoning and formal proof generation. As an open-source model, Leanstral 1.5 continues Mistral's commitment to open AI, offering a high-performance foundation model for mathematical reasoning and automated theorem proving.

2

Malicious AI Agent Skill Bypasses Security Checks, Affecting Over 26,000 Users

AI SecuritySupply Chain Attack

A malicious AI agent skill disguised as a Google Stitch landing page assistant bypassed static security scanners and reached over 26,000 users through Instagram promotions. After gaining trust during distribution, attackers modified external payloads to deliver malicious scripts capable of stealing emails or compromising systems. Security experts note that static scanning only inspects submitted files, ignoring malicious external domains. The industry recommends treating AI skills as real-time third-party dependencies with version pinning, least privilege, runtime network restrictions, and continuous validation—rather than one-time security checks.

Read full article
3

NVIDIA Launches HORIZON Framework, Achieves 100% Pass Rate on RTL Chip Design Benchmarks

AI AgentChip Design

NVIDIA has released the HORIZON framework, which treats RTL chip design as a repository-level code evolution task based on Git worktrees. The framework uses Markdown to define goals, knowledge, evaluators, and acceptance criteria; agents only commit a Git change when evaluations pass, turning repository history into an experience buffer. HORIZON achieved a 100% pass rate across all test suites including ChipBench, RTLLM-2.0, and Verilog-Eval—the sole failure attributed to benchmark defects. Thanks to session reuse, 91% of input tokens are cached. The research team argues that token efficiency—not final pass rate—is now the key optimization metric.

Read full article
4

New RAG Paradigm: Typed Answer Contracts Enforce Structured Output to Suppress Hallucinations

RAGEngineering Practice

A technical article proposes Typed Answer Contracts for RAG, replacing free-text outputs with structured schemas to fundamentally prevent LLM hallucinations. This method defines typed schemas such as Amount, DateValue, and TableValue, forcing models to populate values with citations strictly from given passages rather than memory. The approach supports multi-element answers and multi-span references, and includes self-evaluation fields like confidence and answer_found, enabling the system to detect partial answers or conflicting evidence and trigger re-retrieval. The core principle: never delegate computation or comparison to the LLM—extract using Python first, then perform deterministic comparisons.

Read full article
5

NVIDIA Introduces Self-Evolving Robotics Framework ASPIRE, Reaches 31% Zero-Shot Performance on Long-Horizon Tasks

Embodied IntelligenceRobotics

NVIDIA has introduced ASPIRE, a self-improving robotics framework capable of autonomously writing, debugging, and distilling reusable robotic control skills from multimodal execution trajectories. ASPIRE replaces coarse task-level feedback with fine-grained primitive-level trajectories, enabling precise failure localization and distillation of validated fixes into transferable skill libraries. On the LIBERO-Pro long-horizon tasks, it achieves 31% zero-shot transfer performance, far surpassing prior methods at around 4%. Skills discovered in simulation successfully transfer to real dual-arm robots, improving drawer-opening success from 0/20 to 11/20.

Read full article
6

Shanghai Jiao Tong University Proposes ICRDrag, First Context-Aware Region-Drag Image Editing Model

Image EditingDiffusion Model

Shanghai Jiao Tong University presented ICRDrag at ECCV 2026, the first region-drag image editing model based on contextual learning. The method replaces traditional single-point controls with masks, precisely defining edit regions via source and target masks to fundamentally resolve ambiguity and deformation issues in drag-based editing. An image-mask attention consistency constraint ensures fidelity, while staged curriculum training improves tolerance to rough hand-drawn masks. The team also built PRD, the first large-scale region-drag dataset based on millions of video frames, containing 287,000 paired samples and a 1,000-sample evaluation benchmark, filling a critical gap in the field.

Read full article
7

Study: Filesystem Interfaces Are Cheaper Than SQL, Reduce Agent Token Usage by 45%

AI AgentCost Optimization

An experimental study shows that filesystem-style interfaces can reduce AI agent token consumption by 45% and lower costs by 39% compared to SQL interfaces, while delivering more stable performance on complex exploration tasks. Data indicates that NoKV namespace prompts require approximately 53,300 tokens for compound exploration tasks, versus 127,500 for SQL—a nearly two-fold cost difference. The study notes that SQL requires agents to understand schema, construct joins, and infer field relationships, whereas filesystems offer stable operations through paths, directories, and grep. While SQL remains more efficient for simple structured queries, filesystems show clear advantages in complex exploratory scenarios.

Read full article
8

ACL 2026 Paper E-GRM: Dynamic Routing in Reward Models Cuts Latency by 62%, Boosts Accuracy by 3.3%

Reward ModelInference Optimization

The ACL 2026 paper E-GRM introduces a dynamic routing mechanism based on internal model consensus, allowing reward models to allocate compute on demand. The mechanism defines consensus as the maximum frequency of an answer across five parallel decodings; if above threshold, a short path is taken to output directly, costing only 15–20% of full computation. E-GRM replaces traditional majority voting with a discriminative scorer that outputs continuous quality scores. Experiments show 62% latency reduction, 49% FLOPs reduction, and 3.3% accuracy improvement on the MATH dataset; on RewardBench, it scores 91.5%, surpassing GPT-4o's 73.8%.

Read full article
9

Meta Bans Millions of AI Impersonation Accounts, 20 Florida Communities Resist AI Data Centers

AI GovernanceSocial Impact

The societal impact of AI continues to grow. Meta has banned millions of accounts to combat the proliferation of fake profiles using AI to impersonate real creators. Meanwhile, 20 communities in Florida are pushing to ban or freeze AI data center projects, reflecting local resistance to the rapid expansion of AI infrastructure. Additionally, the film 「Artificial」 produced by OpenAI, after being pulled by Amazon, will now be distributed by independent studio Neon. These events collectively highlight the legal, ethical, and social challenges posed by the rapid advancement of AI.

10

Open-Source Local Agent Model Agents A1 Released, 35B MoE with Only 3B Active Parameters

Open SourceLocal Deployment

Agents A1, a local agent coding model, has been released as an open-source local agent model rather than a chat model. It features a 35B Mixture-of-Experts (MoE) architecture with approximately 3B active parameters, licensed under Apache 2.0, and specifically trained on long action, observation, and validation trajectories. Evaluations show strong performance on long-horizon search, GAIA, BrowseComp, and instruction-following tasks. Reviewers note it combines the knowledge of a 35B model with the speed of a 3B model, runs smoothly on Macs with 32GB unified memory using 4-bit quantization, and supports deployment via LM Studio, Ollama, and other platforms—making it a representative of privacy-preserving local agents.

Read full article

Don't Miss Tomorrow's Insights

Join thousands of professionals who start their day with AI Daily Brief