Abstract
Biryani, one of India's most celebrated dishes, exhibits remarkable regional diversity in its preparation, ingredients, and presentation. With the growing availability of online cooking videos, there is unprecedented potential to study such culinary variations using computational tools systematically. However, existing video understanding methods fail to capture the fine-grained, multimodal, and culturally grounded differences in procedural cooking videos. This work presents the first large-scale, curated dataset of biryani preparation videos, comprising 120 high-quality YouTube recordings across 12 distinct regional styles. We propose a multi-stage framework leveraging recent advances in vision-language models (VLMs) to segment videos into fine-grained procedural units and align them with audio transcripts and canonical recipe text. Building on these aligned representations, we introduce a video comparison pipeline that automatically identifies and explains procedural differences between regional variants. We construct a comprehensive question-answer (QA) benchmark spanning multiple reasoning levels to evaluate procedural understanding in VLMs. Our approach employs multiple VLMs in complementary roles, incorporates human-in-the-loop verification for high-precision tasks, and benchmarks several state-of-the-art models under zero-shot and fine-tuned settings. The resulting dataset, comparison methodology, and QA benchmark provide a new testbed for evaluating VLMs on structured, multimodal reasoning tasks and open new directions for computational analysis of cultural heritage through cooking videos.
Method Pipeline
Stage 1: Video segmentation to extract ingredients, utensils, and actions
Stage 2: Multimodal alignment and video comparison framework for identifying procedural differences
Stage 3: Question-answer generation pipeline
Key Results
Procedural Variation Detection
Cooking variations detected between Hyderabadi and Lucknowi biryani. Opacity indicates degree of variation.
Action-Based Video Retrieval
Targeted retrieval of "marinating chicken" across multiple videos with precise temporal localization.
Video Question Answering
Examples of generated question-answer pairs, spanning multiple reasoning levels.
VLM Performance Benchmarks
Video QA Performance
Overall QA performance of VLMs across easy, medium, and hard difficulty tiers.
| VLM | Metric | Easy Tier | Medium Tier | Hard Tier |
|---|---|---|---|---|
| internvl3 | BLEU | 0.0294 | 0.0291 | 0.0395 |
| ROUGE-L | 0.2184 | 0.1732 | 0.2457 | |
| BERTScore | 0.1663 | 0.1628 | 0.2683 | |
| qwen2vl | BLEU | 0.0314 | 0.0209 | 0.0609 |
| ROUGE-L | 0.1914 | 0.1189 | 0.3201 | |
| BERTScore | 0.1298 | -0.0747 | 0.3022 | |
| llavanext | BLEU | 0.0128 | 0.0216 | 0.0150 |
| ROUGE-L | 0.1319 | 0.1367 | 0.1911 | |
| BERTScore | -0.1732 | 0.0465 | 0.0984 | |
| llavaov | BLEU | 0.0038 | 0.0278 | 0.0246 |
| ROUGE-L | 0.0408 | 0.1383 | 0.1386 | |
| BERTScore | -0.2586 | 0.0377 | -0.0073 | |
| videollama | BLEU | 0.0194 | 0.0787 | 0.0502 |
| ROUGE-L | 0.1883 | 0.2713 | 0.2650 | |
| BERTScore | 0.0897 | 0.3071 | 0.2445 | |
| llama3ft (fine-tuned) |
BLEU | 0.0472 | 0.1683 | 0.1140 |
| ROUGE-L | 0.2689 | 0.4214 | 0.4072 | |
| BERTScore | 0.2660 | 0.4869 | 0.4526 |
Fine-tuned Llama-3.2-11B significantly outperforms all zero-shot baselines across all difficulty tiers and metrics.
BibTeX
@inproceedings{biryaniofindia2025,
title={How Does India Cook Biryani?},
author={C. V. Rishi and Farzana S and Shubham Goel and Aditya Arun and C. V. Jawahar},
booktitle={Proceedings of 16th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP'25)},
year={2025},
organization={ACM}
}
Acknowledgements
We acknowledge and appreciate the support of Google Research / AI in this project.