MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark

Shaar, Shaden; Thymes, Bradon; Chaixanien, Sirawut; Cardie, Claire; Hariharan, Bharath

MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark

Shaden Shaar^*, Bradon Thymes^*, Sirawut Chaixanien, Claire Cardie, Bharath Hariharan

Cornell University
^*Indicates Equal Contribution
CVPR 2026

Paper Code arXiv Leaderboard

Abstract

Understanding real-world videos such as movies requires integrating visual and dialogue cues to answer complex questions. Yet existing VideoQA benchmarks struggle to capture this multimodal reasoning and are largely not open-ended, given the difficulty of evaluating free-form answers.In this paper, we introduce a novel open-ended multi-modal VideoQA benchmark, MovieRecapsQA created using movie recap videos—a distinctive type of YouTube content that summarizes a film by presenting its key events through synchronized visual (recap video) and textual (recap summary) modalities. Using the recap summary, we generate ≈ 8.2 K question-answer (QA) pairs (aligned with movie-subtitles) and provide the necessary "facts" needed to verify an answer in a reference-free manner. To our knowledge, this is the first open-ended VideoQA benchmark that supplies explicit textual context of the input (video and/or text); which we use for evaluation. Our benchmark provides videos of multiple lengths (i.e., recap-segments, movie-segments) and categorizations of questions (by modality and type) to enable fine-grained analysis. We evaluate the performance of seven state-of-the-art MLLMs using our benchmark and observe that: 1) visual-only questions remain the most challenging; 2) models default to textual inputs whenever available; 3) extracting factually accurate information from video content is still difficult for all models; and 4) proprietary and open-source models perform comparably on video-dependent questions.

Leaderboard

Reference-free evaluation metrics on MovieRecapsQA.

#	Model	Overall Factuality	Overall Relevance	Dialogue	Scene	Multimodal	CRD	NPA	STA	TEMP	TH
1	GPT-4o OpenAI	3.99	3.97	3.76	3.43	3.66	3.73	3.64	3.10	3.58	3.55
2	Claude 3.5 Sonnet Anthropic	3.76	3.92	3.69	3.17	3.58	3.65	3.42	3.12	3.30	3.44
3	Amazon Nova Lite Amazon	3.53	3.93	3.73	3.35	3.58	3.59	3.60	3.15	3.51	3.37
4	Qwen 2.5VL Alibaba	3.47	3.83	3.50	3.28	3.35	3.42	3.40	3.07	3.39	3.27
5	Gemini 2.5 Flash Google	3.26	3.70	3.34	2.65	3.03	3.15	3.00	2.57	2.53	3.16
6	Mini-CPM-o OpenBMB	3.21	3.61	3.15	3.00	3.09	3.14	3.10	2.76	3.02	3.02
7	LLaVA-Next-Video LLaVA Team	2.96	3.35	2.99	2.88	2.88	2.99	2.90	2.65	3.04	2.78

Dataset Explorer: MovieRecapsQA

Select an example from the left panel.

BibTeX

@article{movierecapsqa2026,
  title={MovieRecapsQA: A Multimodal Open-Ended Video Question-Answering Benchmark},
  author={Shaden Shaar and Bradon Thymes and Sirawut Chaixanien and Claire Cardie and Bharath Hariharan},
  year={2026},
  url={}}