A data-driven look at how much of Bollywood’s last quarter-century has been driven by star kids. 2,824 films · 3,425 actors · 26 years.
Living writeup: plot-twist-nepotism.html (open the file in a browser, or host via GitHub Pages).
| Question | Answer |
|---|---|
| Share of films with ≥1 nepo kid | 32.8% (peak 53% in 2006, trough 19% in 2020) |
| Share of actors who are nepo kids | 2.3% (79 of 3,425) — a 14× over-representation in screen time |
| Films / actor (≥1 film) | nepo 16.6, non-nepo 3.4 |
| % of actors who appear in only one film | nepo 14%, non-nepo 63% |
| Correlation: gross vs IMDb rating | +0.18 (weak) |
| Correlation: gross vs log(IMDb votes) | +0.72 (strong) — votes is the better proxy |
| % of nepo kids with a Filmfare Award | 47% (vs 8% of non-nepo) |
| Catchup tax | non-nepo needs ~1.4× as many films to match nepo cumulative vote reach |
plot-twist-nepotism.html — the polished writeupnepotism_analysis.ipynb — Jupyter notebook with the full analysis pipelinebollywood_all_with_nepo.csv — every film 2000–2025 (year, title, director, cast, gross, nepo_kid)actor_timeseries_long_all_with_ratings.csv — one row per actor-film (with IMDb rating, votes)actor_timeseries_wide_all.csv — one row per actor, films spread across columnsyearly_aggregates_all.csv — per-year nepo vs non-nepo summaryfilm_ratings.csv — film × IMDb rating/votes lookupcareer_trajectory_summary.csv — per-actor regression slopesactors_with_awards.csv — National + Filmfare winner flagsscrape_bollywood_all_2000_2025.py — Wikipedia year-index pagesfill_gross_from_wiki.py — per-film infobox grossfetch_imdb_ratings.py — IMDb TSV dumps (free; downloads ~250MB)fetch_awards.py — National Film Awards + Filmfare listsscrape_bollywood_credits.py — director + cast from infoboxestag_and_analyze_all.py — nepo-tagging + time-series + aggregates (run this after scraping)career_trajectory.py — per-actor regression of rating/votes vs film numberrating_gross_correlation.py — rating vs gross vs votes correlationsmissingness_bias.py — checks whether gross-missingness is correlated with nepo statusnon_nepo_breakthroughs.py — outsiders who reached nepo vote levelsrajkummar_vs_sonam.py — case-study chartnepo_pct_bar.py — yearly bar + trend line chartchart_yearly_with_debuts.png — total films + nepo % + key debut labelschart_trajectory_votes.png — career trajectory by film positionchart_cumulative_votes.png — cumulative-votes catchup curvechart_rajkummar_vs_sonam.png — case-studychart_gross_vs_rating_votes.png — correlation scatterchart_slope_distribution.png — per-actor slope histogrampip install -r requirements.txt
# 1. Scrape Wikipedia (~5 min)
python scrape_bollywood_all_2000_2025.py
# 2. Fill gross from individual film pages (~17 min, 2,500 requests)
python fill_gross_from_wiki.py
# 3. Tag + aggregate
python tag_and_analyze_all.py
# 4. Pull IMDb ratings (downloads ~260MB of TSV dumps)
python fetch_imdb_ratings.py
# 5. Generate all analysis charts
python nepo_pct_bar.py
python career_trajectory.py
python rating_gross_correlation.py
python rajkummar_vs_sonam.py
Or open nepotism_analysis.ipynb and Run-All.
tag_and_analyze_all.py (NEPO_KIDS set) — edit and rerun to refine.This is an observational dataset. It can describe correlations (nepo kids → more films, more reach, lower-rated films open bigger) but cannot prove causation about nepotism vs talent vs studio strategy. The dataset is biased toward films that got Wikipedia entries and IMDb ratings, which itself favors films that drew some audience.
CC BY 4.0 for the data and writeup; MIT for the scripts.