MMR-V

Can Your MLLMs "Think with Video"?
A Benchmark for Multimodal Deep Reasoning in Videos

Logo MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos

1School of Artificial Intelligence, University of Chinese Academy of Sciences
2Institute of Automation, Chinese Academy of Sciences 3Tsinghua University

Abstract

The sequential structure of videos poses a challenge to the ability of multimodal large language models (MLLMs) to locate multi-frame evidence and conduct multimodal reasoning. However, existing video benchmarks mainly focus on understanding tasks, which only require models to match frames mentioned in the question (hereafter referred to as "question frame") and perceive a few adjacent frames. To address this gap, we propose MMR-V: A Benchmark for Multimodal Deep Reasoning in Videos. The benchmark is characterized by the following features. (1) Long-range, multi-frame reasoning: Models are required to infer and analyze evidence frames that may be far from the question frame. (2) Beyond perception: Questions cannot be answered through direct perception alone but require reasoning over hidden information. (3) Reliability: All tasks are manually annotated, referencing extensive real-world user understanding to align with common perceptions. (4) Confusability: Carefully designed distractor annotation strategies to reduce model shortcuts. MMR-V consists of 317 videos and 1,257 tasks. Our experiments reveal that current models still struggle with multi-modal reasoning; even the best-performing model, o4-mini, achieves only 52.5% accuracy. Additionally, current reasoning enhancement strategies (Chain-of-Thought and scaling test-time compute) bring limited gains. Further analysis indicates that the CoT demanded for multi-modal reasoning differs from it in textual reasoning, which partly explains the limited performance gains. We hope that MMR-V can inspire further research into enhancing multi-modal reasoning capabilities.



🧠 With MMR-V, we aim to explore whether MLLMs can "think with videos" and mine evidence from long-span, multi-frame video information.

Leaderboard

📢 The leaderboard is constantly updating as we are welcoming new submissions!

We consider two test settings: w/o CoT and w/ CoT.

This leaderboard is sorted by the overall accuracy w/o CoT. The CoT results are obtained using CoT prompt. To view other sorted results, please click on the corresponding cell.

# Model Overall (%) Implicit (%) Explicit (%) Video Categories
w/o CoT w/ CoT w/o CoT w/ CoT w/o CoT w/ CoT Art Life TV Film Film Phi.
Human 86.086.0 80.680.6 91.291.2 57.792.390.692.390.770.0
o4-mini-2025-04-16

OpenAI

52.552.1 54.647.1 46.048.2 40.154.054.051.765.327.9
Gemini-2.5-Flash

Google

51.250.5 52.952.3 46.945.3 45.339.550.347.965.634.9
Gemini-2.0-Flash (512 frames)

Google

48.049.9 50.552.6 41.642.9 36.736.739.746.266.731.4
GPT-4.1-2025-04-14

OpenAI

46.648.9 49.151.7 40.341.7 43.235.643.946.557.134.9
Gemini-2.0-Flash-thinking

Google

45.043.5 46.646.0 40.637.1 34.531.638.648.360.125.6
GPT-4o-2024-11-20

OpenAI

44.046.1 46.646.9 37.644.0 38.137.334.941.061.632.6
Claude-3.5-Sonnet-20241022

Anthropic

43.344.2 45.046.1 38.939.1 33.831.141.341.355.844.4
Gemini-2.0-Flash (16 frames)

Google

42.644.3 44.345.9 38.340.0 30.932.240.740.658.524.4
Gemma-3-27b-it

Google

42.041.1 46.544.7 30.332.0 31.732.235.541.356.133.7
InternVL2.5-38B

Shanghai AI Lab

39.939.7 43.843.7 29.929.4 30.428.830.437.257.429.1
Qwen2.5-VL-72B

Alibaba

39.140.4 41.342.8 33.434.3 28.928.229.136.555.637.2
GPT-4o-mini-2024-07-18

OpenAI

34.835.2 38.038.6 26.326.3 29.525.429.633.048.718.6
Gemma-3-12b-it

Google

34.034.2 37.837.6 24.025.4 19.424.925.931.351.924.4
InternVL3-8B

Shanghai AI Lab

33.632.9 35.533.4 28.631.4 23.022.631.724.352.923.2
Qwen2.5-VL-7B

Alibaba

30.132.4 33.736.2 20.822.5 20.918.129.621.248.419.8
Phi-4-multimodal-instruct

Microsoft

26.727.6 29.431.2 19.418.1 19.419.225.926.433.924.4
Cogvlm2-video-llama3

THUDM

25.626.1 25.426.2 26.125.7 15.518.324.719.143.220.8
NVILA-8B-Video

NVIDIA

25.525.3 26.224.2 23.925.9 17.321.323.521.638.021.8
LLaVA-Video

Bytedance & NTU S-Lab

18.417.6 19.118.1 15.416.3 14.411.213.217.421.412.8
LLaVA-Onevision

Bytedance & NTU S-Lab

6.58.8 7.09.6 5.46.6 6.53.49.53.89.81.2

Last Update: 2025-05-15

Benchmark Details

Benchmark Introduction

Models like o3 and o4-mini have achieved impressive results on image reasoning tasks by leveraging tool use to enable 🕵️evidence mining on images. Similarly, tasks in MMR-V require models to perform in-depth reasoning and analysis over visual information from various frames of a video, challenging their ability to 🕵️mine evidence across long-range multi-frame. The figure below presents two example tasks from MMR-V and one for previous video understanding benchmark, with the question frames and supporting evidence frames annotated.

MY ALT TEXT

 

Benchmark Construction Pipeline

MY ALT TEXT

 

Video and Task Categories

Inspired by cognitive theories such as Kahneman's Dual Process Theory, we categorize tasks in MMR-V into two main types: Implicit and Explicit Reasoning (examples shown in the Benchmark Introduction figure). These are further divided into 10 categories and 33 subcategories.

MY ALT TEXT

Task Examples

 

Experiment Results

 

BibTeX

@misc{zhu2025mmrvwhatsleftunsaid,
      title={MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos}, 
      author={Kejian Zhu and Zhuoran Jin and Hongbang Yuan and Jiachun Li and Shangqing Tu and Pengfei Cao and Yubo Chen and Kang Liu and Jun Zhao},
      year={2025},
      eprint={2506.04141},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.04141}, 
}