MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark
Abstract
Understanding real-world videos such as movies requires integrating visual and dialogue cues to answer complex questions. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and are largely not open-ended, given the difficulty of evaluating free-form answers.In this paper, we introduce a novel open-ended multi-modal VideoQA benchmark, MovieRecapsQA created using movie recap videos—a distinctive type of YouTube content that summarizes a film by presenting its key events through synchronized visual (recap video) and textual (recap summary) modalities. Using the recap summary, we generate ≈ 8.2 K question-answer (QA) pairs (aligned with movie-subtitles) and provide the necessary "facts" needed to verify an answer in a reference-free manner. To our knowledge, this is the first open-ended VideoQA benchmark that supplies explicit textual context of the input (video and/or text); which we use for evaluation. Our benchmark provides videos of multiple lengths (i.e., recap-segments, movie-segments) and categorizations of questions (by modality and type) to enable fine-grained analysis. We evaluate the performance of seven state-of-the-art MLLMs using our benchmark and observe that: 1) visual-only questions remain the most challenging; 2) models default to textual inputs whenever available; 3) extracting factually accurate information from video content is still difficult for all models; and 4) proprietary and open-source models perform comparably on video-dependent questions.
Leaderboard
Reference-free evaluation metrics on MovieRecapsQA.
| # | Model | Overall Factuality | Overall Relevance | Dialogue | Scene | Multimodal | CRD | NPA | STA | TEMP | TH |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GPT-4o OpenAI | 3.99 | 3.97 | 3.76 | 3.43 | 3.66 | 3.73 | 3.64 | 3.10 | 3.58 | 3.55 |
| 2 | Claude 3.5 Sonnet Anthropic | 3.76 | 3.92 | 3.69 | 3.17 | 3.58 | 3.65 | 3.42 | 3.12 | 3.30 | 3.44 |
| 3 | Amazon Nova Lite Amazon | 3.53 | 3.93 | 3.73 | 3.35 | 3.58 | 3.59 | 3.60 | 3.15 | 3.51 | 3.37 |
| 4 | Qwen 2.5VL Alibaba | 3.47 | 3.83 | 3.50 | 3.28 | 3.35 | 3.42 | 3.40 | 3.07 | 3.39 | 3.27 |
| 5 | Gemini 2.5 Flash Google | 3.26 | 3.70 | 3.34 | 2.65 | 3.03 | 3.15 | 3.00 | 2.57 | 2.53 | 3.16 |
| 6 | Mini-CPM-o OpenBMB | 3.21 | 3.61 | 3.15 | 3.00 | 3.09 | 3.14 | 3.10 | 2.76 | 3.02 | 3.02 |
| 7 | LLaVA-Next-Video LLaVA Team | 2.96 | 3.35 | 2.99 | 2.88 | 2.88 | 2.99 | 2.90 | 2.65 | 3.04 | 2.78 |
Dataset Explorer: MovieRecapsQA
Select an example from the left panel.
BibTeX
@article{movierecapsqa2026,
title={MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark},
author={Shaden Shaar and Bradon Thymes and Sirawut Chaixanien and Claire Cardie and Bharath Hariharan},
year={2026},
url={}}