Can LLMs earn $1M freelancing?

Freelance AI, Faster Video Models, and YOLOv12

Feb 21, 2025

Welcome to this week’s AI Fridats, where we explore groundbreaking advancements in AI reasoning, multimodal processing, and real-world applications. SWE-Lancer puts LLMs to the test on real-world software engineering tasks, Step-Video-T2V pushes the boundaries of text-to-video generation, and Mix Distillation helps small models learn reasoning more effectively. We also dive into mmMamba’s efficient multimodal framework and YOLOv12’s attention-centric real-time object detection.

Here’s what’s new:

💰 SWE-Lancer: Can AI models earn $1M freelancing? A new benchmark puts them to the test.

🎥 Step-Video-T2V: A 30B-parameter text-to-video model generating high-quality 204-frame videos.

🧠 Mix Distillation: Optimizing small models by balancing complex and simple reasoning examples.

🚀 mmMamba: 20.6x faster multimodal state space models with 75.8% memory reduction.

📸 YOLOv12: An attention-centric real-time object detector achieving 40.6% mAP with ultra-low latency.

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? (🔗 Read the Story)

SWE-Lancer introduces a novel $1M benchmark of real-world freelance software engineering tasks from Upwork that measures both technical implementation and managerial decision-making capabilities of AI models, revealing that even frontier models currently fail to solve the majority of these practical challenges.

Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model (🔗 Read the Story)

Step-Video-T2V introduces a groundbreaking 30B-parameter text-to-video model featuring a highly compressed Video-VAE architecture and bilingual capability, achieving state-of-the-art performance in generating high-quality videos up to 204 frames while addressing key challenges in video synthesis through novel techniques like Video-DPO and Flow Matching.

Small Models Struggle to Learn from Strong Reasoners (🔗 Read the Story)

Despite the promise of distilling reasoning capabilities from large language models, small models (≤3B parameters) perform better when trained on simpler reasoning patterns matched to their capacity, leading to the development of "Mix Distillation" - a hybrid approach combining both complex and simple reasoning examples to optimize small model performance.

Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation (🔗 Read the Story)

mmMamba introduces a framework for distilling quadratic-complexity multimodal language models into linear-complexity state space models, achieving up to 20.6x speedup and 75.8% memory reduction while maintaining competitive performance through a novel three-stage distillation process that eliminates the need for separate vision encoders.

YOLOv12: Attention-Centric Real-Time Object Detectors (🔗 Read the Story)

YOLOv12 introduces an attention-centric framework that achieves superior object detection accuracy compared to CNN-based and DETR-based approaches while maintaining competitive real-time speeds, demonstrated by its 40.6% mAP with just 1.64ms latency on T4 GPU and significantly reduced computational requirements.

HackerPulse Dispatch

Discussion about this post