Creative Benchmarks, RL for LLMs, and SmolDocling’s Big Impact

DAPO’s open RL system, creative multimodal intelligence, and smarter document conversion in 256M params.

Mar 21, 2025

Welcome to this week’s AI Fridays, where efficiency meets creativity. Learn how DAPO is making reinforcement learning for LLMs reproducible and high-performing, and how Creation-MMBench pushes the boundaries of context-aware creative intelligence in multimodal models. Explore SmolDocling, a tiny but mighty VLM for document conversion, TULIP’s unified language-image pretraining framework, and a deep dive into failure modes of clustering sliding-window time series data.

Here’s what’s new:

📊 Sliding Window Clustering: Three failure modes explained—when time series clustering breaks down.

🔁 DAPO: A fully open-source RL system for LLMs that hits SOTA on AIME 2024 and closes the reproducibility gap.

📄 SmolDocling: A 256M VLM that competes with models 27x its size for document conversion tasks.

🌷 TULIP: Unifying language-image pretraining with generative augmentation and contrastive learning.

🎨 Creation-MMBench: Benchmarking creative intelligence in multimodal models—and where open models fall short.

On the clustering behavior of sliding windows (🔗 Read the Paper)

The study reveals three distinct failure modes when clustering sliding-window-processed time series data, providing theoretical proofs and examples for how window size relative to series length can produce meaningless clusters, symmetry-based inefficiencies, and forced interval groupings.

DAPO: An Open-Source LLM Reinforcement Learning System at Scale (🔗 Read the Paper)

DAPO introduces a novel open-source reinforcement learning system for LLMs that achieves state-of-the-art performance (50 points on AIME 2024) while providing full transparency of its training techniques and code, addressing the reproducibility gap in current LLM research.

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion (🔗 Read the Paper)

SmolDocling is a breakthrough 256M-parameter vision-language model that performs end-to-end document conversion by generating DocTags markup, achieving comparable performance to models 27x larger while efficiently processing diverse document types and elements like code, tables, equations, and charts in their full spatial context.

TULIP: Towards Unified Language-Image Pretraining (🔗 Read the Paper)

TULIP introduces a unified language-image pretraining framework that bridges the gap between vision-centric and language-aligned models, achieving state-of-the-art performance through generative data augmentation and enhanced contrastive learning while significantly outperforming existing models like SigLIP across multiple vision and language benchmarks.

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM (🔗 Read the Paper)

Creation-MMBench introduces a comprehensive benchmark for evaluating creative intelligence in multimodal AI systems through 765 image-based test cases, revealing that current open-source models significantly lag behind proprietary ones and that visual fine-tuning can actually impair base creative capabilities.

See you next week!

HackerPulse Dispatch

Discussion about this post

Ready for more?