A project that is intended to classify movies to be released as Blockbuster, good & unsuccessful movies based on their profitability.

Technology Used

R Studio | Tableau | Excel | Python | Github

Project Overview:

This Project is broken into 3 Sections. Github links to each section below which provides further instructions to download Data and Codes


Section 1: Data Sourcing

For this project, we have used The Movie Database API, popularly known as TMDB Data. The Data Scrapped from TMDB, then acted as our Primary Data with over 87k records just from 5 years between 2017-2021.

We then funneled the data using the Heuristics and Logical approach.

  • Filter 1: We aggregated Revenue and budget over countries
					countriesRevbud <-aggregate( x = df_movies[c("revenue","budget")],
                     by = list(df_movies$production_countries),
                     FUN = "mean")
  • Filter 2: Then we dropped all the records with both revenue and Budget = 0 using the subset technique. This is because we primarily used Profitablity to classify movies into successful or unsuccessful movies.
					df_movies <- subset(df_movies, (df_movies$revenue!=0 | df_movies$budget!=0))


After Data Funneling, we were left with 6.4k records. We then augmented this Dataset using IMDB Data to add IMDB ratings & IMDB Vote Counts. This is an Open source data and is available on IMDB.

We then merged this Data using VBA coding and Vlookup in Excel. This can also be done easily using Inner Join in R.


Section 2: Coding in R

This section can be used on various Data Sources involving the same features. You may also use it on an updated/larger Datasets from TMDB and/or IMDB. Please, pay caution to column names.

Further, This Project Code is Broken into 4 Parts:

Click on the above codes to go to respective code files or use the Master Code file to get the complete codes along with Live links to data files initially used in this project.

Most of Part 1 is explained in Section 1. Apart from Data Funneling, most of Part 0 and Part 1 is understanding the Data by observation and EDA.

In Part 2, We did Feature Engineering & created over 25 Features such as:

  • Released_in_holidayMonth
  • profit_factor
  • cast_crew_ratio
  • Target
  • Converting over 10 Text Columns such as Tagline, Collection, etc into Binomial or Numerical
  • Movie directed by a high paid director?
  • most profitable production companies
  • avg. revenue of Collection
  • Competition during Release
  • most profitable genre
  • Avg revenue by crew size
  • Avg Revenue by cast Size
  • Cost per Capita
  • Avg Budget oer Genre
  • Within 1-σ Runtime
  • Within 1-σ Combined Rating

Then we used the MICE package for Data Imputation. Dumified using Dummy_cols() with Comma “,” as a separator. Later we created a model With Random Test/Train Data sets to find the Top 50 important features using RF which we then used to create an RF and GLM model. We also created a model by creating a train/test set using Release year: Train: 2017-2018 & Test: 2019

Section 3: Project Conclusion

In This section, you can find an idea as to how you can interpret the findings of this Project. Further, you can also subset based on genres or Top-earning Genres and then find the important factor within. Please see our full Presentation deck here and some highlights below

Leave a Reply

Your email address will not be published. Required fields are marked *

error: Content is protected !!