Smarter Multimodal Models, RPG Benchmarks, and Surprising Scaling Insights
Native-res vision, long-horizon planning, and why bigger isn’t always better in open-source LLMs.
This week’s research highlights new frontiers in multimodal reasoning, game-inspired planning benchmarks, and data analysis LLMs built for efficiency. We also get surprising results on OpenAI’s GPT-OSS models and a new benchmark that asks: can AI truly understand human intentions with empathy?
Here’s what’s new:
🖼️ Ovis2.5: An open-source multimodal LLM under 40B parameters with native-resolution vision processing and an optional “thinking mode” for harder tasks. Achieves SOTA performance through a five-phase curriculum while balancing speed and accuracy.
🗺️ HeroBench: A long-horizon planning benchmark set in RPG-inspired virtual worlds. Models must strategize across resource gathering, skill mastery, and equipment crafting—revealing weaknesses in structured reasoning not captured by standard benchmarks.
⚖️ GPT-OSS Evaluation: A comprehensive analysis of OpenAI’s new GPT-OSS models finds the 20B variant outperforms the larger 120B on several tasks, challenging assumptions about scaling in sparse MoE models. Bigger isn’t always better.
📊 Datarus-R1: A 14B open-weights LLM fine-tuned on full analytical trajectories (reasoning + execution + error correction). Outperforms similar models by up to 30% on tough data analysis benchmarks, while cutting token usage by up to 49%.
🤝 HumanSense: A new benchmark for testing whether multimodal LLMs can provide empathetic, context-aware responses. Shows leading models struggle, but omni-modal inputs + staged RL can substantially improve reasoning for human-centered interactions.
Ovis2.5 Technical Report (🔗 Rad the Paper)
Ovis2.5 introduces native-resolution visual processing and reflection-based reasoning capabilities, achieving state-of-the-art performance among open-source multimodal language models under 40B parameters through a comprehensive five-phase training curriculum. The model processes images at variable native resolutions without degradation and includes an optional "thinking mode" that trades latency for enhanced accuracy on complex visual reasoning tasks.
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds (🔗 Read the Paper)
HeroBench introduces a novel benchmark that evaluates large language models' long-horizon planning abilities in complex RPG-inspired virtual worlds, requiring models to formulate strategic plans involving resource gathering, skill mastery, and equipment crafting. Evaluation of 25 state-of-the-art LLMs revealed substantial performance disparities and specific weaknesses in high-level planning and structured action execution that aren't captured by conventional reasoning benchmarks.
Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models (🔗 Read the Paper)
This study evaluates OpenAI's newly released GPT-OSS models (20B and 120B parameters) against six contemporary open source language models, finding that the smaller 20B variant surprisingly outperforms the 120B model on several benchmarks despite using less computational resources. The results challenge assumptions about scaling in sparse architectures and suggest that larger parameter counts don't necessarily translate to better performance in mixture-of-experts models.
Datarus-R1: An Adaptive Multi-Step Reasoning LLM for Automated Data Analysis (🔗 Read the Paper)
Datarus-R1-14B is an open-weights language model fine-tuned for data analysis that uses a novel training approach on full analytical trajectories (including reasoning, code execution, and error correction) rather than isolated Q&A pairs, incorporating dual reasoning modes and a sophisticated reward system. The model achieves up to 30% higher accuracy than similar-sized models on challenging benchmarks like AIME 2024/2025 while generating 18-49% fewer tokens, demonstrating efficient multi-step reasoning that avoids the verbose loops common in contemporary systems.
HumanSense: From Multimodal Perception to Empathetic Context-Aware Responses through Reasoning MLLMs (🔗 Read the Paper)
HumanSense introduces a comprehensive benchmark for evaluating multimodal large language models' ability to understand complex human intentions and provide empathetic, context-aware responses, revealing significant room for improvement in current leading models. The researchers demonstrate that omni-modal inputs and multi-stage reinforcement learning can substantially enhance reasoning abilities for human-centered interactions, with successful reasoning processes showing consistent thought patterns that can be leveraged through prompt design.