Plotmetwist

Bollywood Nepotism Analysis (2000–2025)

A data-driven look at how much of Bollywood’s last quarter-century has been driven by star kids. 2,824 films · 3,425 actors · 26 years.

Living writeup: plot-twist-nepotism.html (open the file in a browser, or host via GitHub Pages).

Key findings

Question Answer
Share of films with ≥1 nepo kid 32.8% (peak 53% in 2006, trough 19% in 2020)
Share of actors who are nepo kids 2.3% (79 of 3,425) — a 14× over-representation in screen time
Films / actor (≥1 film) nepo 16.6, non-nepo 3.4
% of actors who appear in only one film nepo 14%, non-nepo 63%
Correlation: gross vs IMDb rating +0.18 (weak)
Correlation: gross vs log(IMDb votes) +0.72 (strong) — votes is the better proxy
% of nepo kids with a Filmfare Award 47% (vs 8% of non-nepo)
Catchup tax non-nepo needs ~1.4× as many films to match nepo cumulative vote reach

What this repository contains

Reports

Data (CSVs)

Scrapers

Analysis

Charts (PNG)

Reproducing the analysis

pip install -r requirements.txt

# 1. Scrape Wikipedia (~5 min)
python scrape_bollywood_all_2000_2025.py

# 2. Fill gross from individual film pages (~17 min, 2,500 requests)
python fill_gross_from_wiki.py

# 3. Tag + aggregate
python tag_and_analyze_all.py

# 4. Pull IMDb ratings (downloads ~260MB of TSV dumps)
python fetch_imdb_ratings.py

# 5. Generate all analysis charts
python nepo_pct_bar.py
python career_trajectory.py
python rating_gross_correlation.py
python rajkummar_vs_sonam.py

Or open nepotism_analysis.ipynb and Run-All.

Method notes

Caveats

This is an observational dataset. It can describe correlations (nepo kids → more films, more reach, lower-rated films open bigger) but cannot prove causation about nepotism vs talent vs studio strategy. The dataset is biased toward films that got Wikipedia entries and IMDb ratings, which itself favors films that drew some audience.

License

CC BY 4.0 for the data and writeup; MIT for the scripts.