How Does India Cook Biryani?

IIIT Hyderabad
ICVGIP 2025

*Equal Contribution
Regional Biryani Types Map of India

Regional map of 12 biryani styles across India, paired with representative dish images that reflect the rich cultural and procedural diversity motivating our computational analysis of biryani preparation videos.

Abstract

Biryani, one of India's most celebrated dishes, exhibits remarkable regional diversity in its preparation, ingredients, and presentation. With the growing availability of online cooking videos, there is unprecedented potential to study such culinary variations using computational tools systematically. However, existing video understanding methods fail to capture the fine-grained, multimodal, and culturally grounded differences in procedural cooking videos. This work presents the first large-scale, curated dataset of biryani preparation videos, comprising 120 high-quality YouTube recordings across 12 distinct regional styles. We propose a multi-stage framework leveraging recent advances in vision-language models (VLMs) to segment videos into fine-grained procedural units and align them with audio transcripts and canonical recipe text. Building on these aligned representations, we introduce a video comparison pipeline that automatically identifies and explains procedural differences between regional variants. We construct a comprehensive question-answer (QA) benchmark spanning multiple reasoning levels to evaluate procedural understanding in VLMs. Our approach employs multiple VLMs in complementary roles, incorporates human-in-the-loop verification for high-precision tasks, and benchmarks several state-of-the-art models under zero-shot and fine-tuned settings. The resulting dataset, comparison methodology, and QA benchmark provide a new testbed for evaluating VLMs on structured, multimodal reasoning tasks and open new directions for computational analysis of cultural heritage through cooking videos.

Method Pipeline

Key Results

Procedural Variation Detection

Biryani Variation Visualization

Cooking variations detected between Hyderabadi and Lucknowi biryani. Opacity indicates degree of variation.

Action-Based Video Retrieval

Action-based retrieval example

Targeted retrieval of "marinating chicken" across multiple videos with precise temporal localization.

Video Question Answering

QA Examples

Examples of generated question-answer pairs, spanning multiple reasoning levels.

VLM Performance Benchmarks

Video QA Performance

Overall QA performance of VLMs across easy, medium, and hard difficulty tiers.

VLM Metric Easy Tier Medium Tier Hard Tier
internvl3 BLEU 0.0294 0.0291 0.0395
ROUGE-L 0.2184 0.1732 0.2457
BERTScore 0.1663 0.1628 0.2683
qwen2vl BLEU 0.0314 0.0209 0.0609
ROUGE-L 0.1914 0.1189 0.3201
BERTScore 0.1298 -0.0747 0.3022
llavanext BLEU 0.0128 0.0216 0.0150
ROUGE-L 0.1319 0.1367 0.1911
BERTScore -0.1732 0.0465 0.0984
llavaov BLEU 0.0038 0.0278 0.0246
ROUGE-L 0.0408 0.1383 0.1386
BERTScore -0.2586 0.0377 -0.0073
videollama BLEU 0.0194 0.0787 0.0502
ROUGE-L 0.1883 0.2713 0.2650
BERTScore 0.0897 0.3071 0.2445
llama3ft
(fine-tuned)
BLEU 0.0472 0.1683 0.1140
ROUGE-L 0.2689 0.4214 0.4072
BERTScore 0.2660 0.4869 0.4526

Fine-tuned Llama-3.2-11B significantly outperforms all zero-shot baselines across all difficulty tiers and metrics.

BibTeX

@inproceedings{biryaniofindia2025,
  title={How Does India Cook Biryani?},
  author={C. V. Rishi and Farzana S and Shubham Goel and Aditya Arun and C. V. Jawahar},
  booktitle={Proceedings of 16th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP'25)},
  year={2025},
  organization={ACM}
}

Acknowledgements

We acknowledge and appreciate the support of Google Research / AI in this project.