Superhuman Web Agents, 3D Scene Diffusion, and Math-Formalizing RL
From high-uncertainty web navigation to language-embedded 3D generation - this week’s top AI drops.
Before we jump in, let’s thank Superflex. Superflex is an extension for VSCode, Cursor and Windsurf that converts Figma to production-ready code that matches your coding style, design system and reuses your UI components
Get 30% off Superflex for 1 month. Use code HACKERPULSE.
This week in frontier AI: Agents are getting smarter, faster, and more capable of tackling complex, multi-domain tasks - from web navigation and formal mathematics to 3D scene generation. We explore new architectures and training paradigms that unlock generalization, cross-agent learning, and real-time interaction.
Here’s what’s new:
🧭 WebSailor: An open-source web agent trained with a novel RL algorithm (DUPO) and cold-start high-uncertainty tasks. It now matches proprietary agents like DeepResearch on tough info-seeking benchmarks.
🧠 Agent KB: A shared knowledge base with a Reason–Retrieve–Refine pipeline that lets agents learn from cross-domain experiences, improving performance by 16.28 points on GAIA and excelling at code repair.
🌍 LangScene-X: A TriMap video diffusion model reconstructs generalizable 3D scenes from sparse 2D inputs, embedding semantic and language cues without dense-view training or per-scene optimization.
🧮 CriticLean: This RL framework turns the critic into an active semantic validator, generating formal Lean code from natural language math and producing a 285K-problem corpus for formal reasoning.
🚶♀️ StreamVLN: A streaming VLN model combining fast-dialogue and slow-memory contexts for real-time navigation, achieving state-of-the-art results with low latency—ideal for embodied agents and assistants.
WebSailor: Navigating Super-human Reasoning for Web Agent (🔗 Read the Paper)
WebSailor introduces a post-training methodology that enables open-source web agents to achieve superhuman performance on complex information-seeking tasks by teaching them to systematically reduce uncertainty when navigating vast information landscapes. The approach uses novel high-uncertainty task generation, cold-start training, and a new reinforcement learning algorithm (DUPO) to match the performance of proprietary systems like DeepResearch on challenging benchmarks.
Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving (🔗 Read the Paper)
Agent KB introduces a hierarchical experience framework with a Reason-Retrieve-Refine pipeline that enables language agents to share knowledge and learn from each other's experiences across domains. The system achieves substantial performance improvements on complex benchmarks, with success rates increasing by up to 16.28 percentage points on GAIA and notable gains on code repair tasks.
LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion (🔗 Read the Paper)
LangScene-X introduces a generative framework that reconstructs 3D language-embedded scenes from sparse 2D views by training a TriMap video diffusion model to generate consistent RGB, geometry, and semantic information, combined with a Language Quantized Compressor for cross-scene generalization. This approach overcomes the limitations of dense-view reconstruction methods and enables open-vocabulary 3D scene understanding without requiring per-scene optimization.
CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization (🔗 Read the Paper)
CriticLean introduces a novel reinforcement learning framework that transforms the critic phase from passive validation to active learning for mathematical formalization, using CriticLeanGPT to assess semantic fidelity of formal code translations. The approach produces FineLeanCorpus, a dataset of over 285K problems, and demonstrates that optimizing the critic phase is essential for generating reliable formal mathematical proofs.
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling (🔗 Read the Paper)
StreamVLN introduces a hybrid slow-fast context modeling framework for real-time vision-and-language navigation that balances fine-grained visual understanding with computational efficiency through fast-streaming dialogue context and slow-updating compressed memory. The method achieves state-of-the-art performance on VLN-CE benchmarks while maintaining stable low latency suitable for real-world deployment.