π€ Agent Swarms, Trillion-Param Multimodal Models, and Real-World Robot Vision
This week: drone radar collision avoidance, agent swarm multimodal AI, generalist robot policies, ERNIE 5.0, and a new benchmark fixing multimodal search evaluation.
Before we jump in - quick opportunity drop: a16z just launched Speedrun Alpha, placing top early-career engineers into full-time roles at breakout startups, with a built-in SF fellowship and access to the a16z ecosystem.
This weekβs research spans real-world robotics perception, multimodal agent swarms, generalist robot policies, trillion-parameter multimodal foundation models, and new benchmarks redefining multimodal search evaluation.
Hereβs whatβs new:
π Omnidirectional mmWave Radar for UAV Collision Avoidance β A 360Β° solid-state radar system enabling drones to detect thin power lines up to 10 meters away, supporting safe flight at speeds over 10 m/s. Lightweight spherical sensing makes real-world infrastructure navigation far more reliable.
π€ Kimi K2.5 (Visual Agentic Intelligence) β An open multimodal model with Agent Swarm architecture that decomposes tasks into parallel sub-agents. Achieves SOTA results with 4.5Γ lower latency, pushing agent orchestration toward real-world scalability.
π¦Ύ Green-VLA β A staged training framework enabling one policy to control multiple robot types (humanoids, arms, mobile robots). Uses curriculum learning across foundation models, grounding, and RL to improve real-world generalization.
𧬠ERNIE 5.0 β A trillion-parameter multimodal MoE foundation model with modality-agnostic routing and elastic training. Produces a family of deployable sub-models with different latency/performance tradeoffs from a single training run.
π Vision-DeepResearch Benchmark (VDR-Bench) β A new benchmark eliminating shortcut text cues in multimodal search evaluation. Introduces cropped multi-round search workflows that significantly improve visual retrieval reliability.
Omnidirectional Solid-State mmWave Radar Perception for UAV Power Line Collision Avoidance (π Read the Paper)
This paper presents an omnidirectional mmWave radar system for UAVs that provides 360-degree detection of power lines up to 10 meters away, enabling safe autonomous and manual flight with successful collision avoidance maneuvers at speeds exceeding 10 m/s. The key innovation is integrating multiple compact solid-state radar modules to achieve robust, lightweight spherical sensing coverage specifically optimized for thin wire detection in real-world power line environments.
Kimi K2.5: Visual Agentic Intelligence (π Read the Paper)
Kimi K2.5 is an open-source multimodal model that jointly optimizes text and vision capabilities and introduces Agent Swarm, a parallel agent framework that decomposes complex tasks into concurrent sub-problems, achieving state-of-the-art results across multiple domains with 4.5Γ latency improvements. The work advances agentic AI through integrated multimodal learning and efficient task orchestration.
Green-VLA: Staged Vision-Language-Action Model for Generalist Robots (π Read the Paper)
Green-VLA presents a staged training framework for vision-language-action models that enables a single policy to control diverse robot embodiments (humanoids, mobile manipulators, arms) through a five-level curriculum combining foundation models, multimodal grounding, and reinforcement learning, demonstrating improved generalization and robustness across simulation and real-world tasks.
ERNIE 5.0 Technical Report (π Read the Paper)
ERNIE 5.0 is a trillion-parameter autoregressive foundation model that unifies multimodal understanding and generation across text, image, video, and audio through a sparse mixture-of-experts architecture with modality-agnostic routing. Its key innovation is an elastic training paradigm that learns a family of sub-models with varying sizes and latencies within a single pre-training run, enabling flexible deployment across diverse resource constraints while maintaining strong performance across modalities.
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models (π Read the Paper)
This paper introduces Vision-DeepResearch Benchmark (VDR-Bench), a rigorously curated 2,000-instance benchmark that addresses fundamental limitations in evaluating MLLMsβ visual and textual search capabilities by eliminating answer leakage through textual cues and creating more realistic retrieval scenarios. The authors also propose a multi-round cropped-search workflow that significantly improves MLLMsβ visual retrieval performance, offering practical insights for designing multimodal deep-research systems.


