π¬ Defining AGI, 3D Learning, and Cinematic AI
This week: measuring AGI, 2Dβ3D fusion, robot intelligence benchmarks, unified planning models, and holistic AI filmmaking.
This weekβs research dives into defining AGI, bridging 2D and 3D understanding, testing robot practical intelligence, unifying planning and action for AI agents, and a breakthrough in cinematic long-form video generation.
Hereβs whatβs new:
π§ A Definition of AGI β Proposes a measurable AGI framework based on human cognitive theory, testing AI systems across ten domains. Current leaders: GPT-4 at 27%, GPT-5 at 57%. The biggest weakness? Long-term memory.
πΉοΈ Concerto β A 2D-3D self-supervised learning system combining spatial and cross-modal learning for coherent scene understanding. Outperforms standalone models by 14.2%, setting new benchmarks on ScanNet.
π€ Butter-Bench β A new benchmark evaluating LLM-controlled robots in practical environments. While humans hit 95% accuracy, LLMs reach only 40%, struggling with multi-step spatial and social reasoning tasks.
βοΈ ReCode β A framework that unifies planning and action as one recursive process. Lets LLM agents dynamically adjust task granularity β improving inference efficiency and decision-making precision.
π¬ HoloCine β A cinematic video generation model that processes entire scenes holistically for consistent, multi-shot storytelling. Introduces director-level control and achieves SOTA in narrative coherence and character persistence.
A Definition of AGI (π Read the Paper)
This paper proposes a quantifiable AGI definition grounded in the Cattell-Horn-Carroll theory of human cognition, measuring AI systems across ten cognitive domains using adapted psychometric tests. Application to current models reveals they achieve only partial cognitive competence (GPT-4: 27%, GPT-5: 57%), with particular deficits in long-term memory, thereby operationalizing the gap between specialized AI and human-level general intelligence.
Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations (π Read the Paper)
Concerto is a self-supervised learning framework that combines 3D self-distillation with 2D-3D cross-modal learning to emerge robust spatial representations, outperforming standalone 2D/3D models by 14.2% and 4.8% respectively and achieving state-of-the-art results on scene understanding benchmarks like ScanNet (80.7% mIoU). The approach demonstrates that multisensory learning produces more coherent spatial features that transfer effectively to downstream tasks and can even bridge to language models.
Butter-Bench: Evaluating LLM Controlled Robots for Practical Intelligence (π Read the Paper)
Butter-Bench is a new benchmark for evaluating LLMs in robot control systems, revealing a significant gap between LLM and human performance on practical robotic tasks. While state-of-the-art LLMs achieve only 40% accuracy compared to humansβ 95%, they particularly struggle with multi-step spatial planning and social understanding, challenges not resolved by embodied reasoning fine-tuning.
ReCode: Unify Plan and Action for Universal Granularity Control (π Read the Paper)
ReCode proposes a unified code-generation paradigm that represents planning and action as a single recursive decomposition process, where high-level plans are recursively broken down into primitive actions, enabling LLM agents to dynamically control decision granularity while improving both inference performance and training efficiency.
HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives (π Read the Paper)
HoloCine generates coherent multi-shot video narratives by processing entire scenes holistically with Window Cross-Attention for directorial control and Sparse Inter-Shot Self-Attention for efficiency, achieving state-of-the-art narrative consistency with emergent abilities in character persistence and cinematic techniques. This represents a significant advancement from isolated clip generation toward fully automated end-to-end filmmaking.


