MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark

Cornell University
*Indicates Equal Contribution

CVPR 2026
Dataset Creation

Abstract

Understanding real-world videos such as movies requires integrating visual and dialogue cues to answer complex questions. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and are largely not open-ended, given the difficulty of evaluating free-form answers.In this paper, we introduce a novel open-ended multi-modal VideoQA benchmark, MovieRecapsQA created using movie recap videos—a distinctive type of YouTube content that summarizes a film by presenting its key events through synchronized visual (recap video) and textual (recap summary) modalities. Using the recap summary, we generate ≈ 8.2 K question-answer (QA) pairs (aligned with movie-subtitles) and provide the necessary "facts" needed to verify an answer in a reference-free manner. To our knowledge, this is the first open-ended VideoQA benchmark that supplies explicit textual context of the input (video and/or text); which we use for evaluation. Our benchmark provides videos of multiple lengths (i.e., recap-segments, movie-segments) and categorizations of questions (by modality and type) to enable fine-grained analysis. We evaluate the performance of seven state-of-the-art MLLMs using our benchmark and observe that: 1) visual-only questions remain the most challenging; 2) models default to textual inputs whenever available; 3) extracting factually accurate information from video content is still difficult for all models; and 4) proprietary and open-source models perform comparably on video-dependent questions.

Leaderboard

Reference-free evaluation metrics on MovieRecapsQA.

# Model Overall Factuality Overall Relevance Dialogue Scene Multimodal CRD NPA STA TEMP TH
1 GPT-4o OpenAI 3.99 3.97 3.76 3.43 3.66 3.73 3.64 3.10 3.58 3.55
2 Claude 3.5 Sonnet Anthropic 3.76 3.92 3.69 3.17 3.58 3.65 3.42 3.12 3.30 3.44
3 Amazon Nova Lite Amazon 3.53 3.93 3.73 3.35 3.58 3.59 3.60 3.15 3.51 3.37
4 Qwen 2.5VL Alibaba 3.47 3.83 3.50 3.28 3.35 3.42 3.40 3.07 3.39 3.27
5 Gemini 2.5 Flash Google 3.26 3.70 3.34 2.65 3.03 3.15 3.00 2.57 2.53 3.16
6 Mini-CPM-o OpenBMB 3.21 3.61 3.15 3.00 3.09 3.14 3.10 2.76 3.02 3.02
7 LLaVA-Next-Video LLaVA Team 2.96 3.35 2.99 2.88 2.88 2.99 2.90 2.65 3.04 2.78

Dataset Explorer: MovieRecapsQA

Select an example from the left panel.

BibTeX

@article{movierecapsqa2026,
  title={MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark},
  author={Shaden Shaar and Bradon Thymes and Sirawut Chaixanien and Claire Cardie and Bharath Hariharan},
  year={2026},
  url={}}