Cinematic Movies Analysis, exploring the factors that lead to a blockbuster production

The golden era of cinema may have ended in the 1960s with the 5 major studios no longer being able to block book theaters by law after the paramount case, however it gave birth to the Hollywood Renaissance which took place in the 1960s to the 1980s. This was when an influx of new, young and talented filmmakers with fresh ideas came to prominence in the United States. These directors have had a profound impact on what the modern era of cinema looks like today around the world, having given us some of the most iconic, inspiring and best selling movies of all time.

Up and coming directors, students of film and cinephiles may have all once wondered; what are the ingredients behind producing a high selling movie? Using a data-driven approach, I will do my best to answer that question.

The Approach

The purpose of this project is to explore the factors that lead to a successful high selling production with a focus on movies from the modern era of cinema (1980 onwards). I will be performing causal analysis to look at the correlation between different variables and the gross income of a movie which will be the measure of success for this analysis.

For this analysis, I will be using the movie industry dataset which can be found on Kaggle. This dataset contains 4 decades worth of movie data which will be essential to carry out this study.

Here is a summary of my approach:

  • Getting the data from a csv file
  • Cleaning and preparing the data for analysis
  • Finding correlations in the data
  • Drawing a conclusion based on the results

Now, let’s start retrieving the data.

Getting the Data

# standard imports 
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')

# configuring plot  
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8)

# read in the data
df = pd.read_csv("movies-data\movies.csv")
# first 5 rows of the data frame
df.head()
name rating genre year released score votes director writer star country budget gross company runtime
0 The Shining R Drama 1980 June 13, 1980 (United States) 8.4 927000.0 Stanley Kubrick Stephen King Jack Nicholson United Kingdom 19000000.0 46998772.0 Warner Bros. 146.0
1 The Blue Lagoon R Adventure 1980 July 2, 1980 (United States) 5.8 65000.0 Randal Kleiser Henry De Vere Stacpoole Brooke Shields United States 4500000.0 58853106.0 Columbia Pictures 104.0
2 Star Wars: Episode V - The Empire Strikes Back PG Action 1980 June 20, 1980 (United States) 8.7 1200000.0 Irvin Kershner Leigh Brackett Mark Hamill United States 18000000.0 538375067.0 Lucasfilm 124.0
3 Airplane! PG Comedy 1980 July 2, 1980 (United States) 7.7 221000.0 Jim Abrahams Jim Abrahams Robert Hays United States 3500000.0 83453539.0 Paramount Pictures 88.0
4 Caddyshack R Comedy 1980 July 25, 1980 (United States) 7.3 108000.0 Harold Ramis Brian Doyle-Murray Chevy Chase United States 6000000.0 39846344.0 Orion Pictures 98.0
#last 5 rows of the data frame
df.tail()
name rating genre year released score votes director writer star country budget gross company runtime
7663 More to Life NaN Drama 2020 October 23, 2020 (United States) 3.1 18.0 Joseph Ebanks Joseph Ebanks Shannon Bond United States 7000.0 NaN NaN 90.0
7664 Dream Round NaN Comedy 2020 February 7, 2020 (United States) 4.7 36.0 Dusty Dukatz Lisa Huston Michael Saquella United States NaN NaN Cactus Blue Entertainment 90.0
7665 Saving Mbango NaN Drama 2020 April 27, 2020 (Cameroon) 5.7 29.0 Nkanya Nkwai Lynno Lovert Onyama Laura United States 58750.0 NaN Embi Productions NaN
7666 It's Just Us NaN Drama 2020 October 1, 2020 (United States) NaN NaN James Randall James Randall Christina Roz United States 15000.0 NaN NaN 120.0
7667 Tee em el NaN Horror 2020 August 19, 2020 (United States) 5.7 7.0 Pereko Mosia Pereko Mosia Siyabonga Mabaso South Africa NaN NaN PK 65 Films 102.0
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7668 entries, 0 to 7667
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      7668 non-null   object 
 1   rating    7591 non-null   object 
 2   genre     7668 non-null   object 
 3   year      7668 non-null   int64  
 4   released  7666 non-null   object 
 5   score     7665 non-null   float64
 6   votes     7665 non-null   float64
 7   director  7668 non-null   object 
 8   writer    7665 non-null   object 
 9   star      7667 non-null   object 
 10  country   7665 non-null   object 
 11  budget    5497 non-null   float64
 12  gross     7479 non-null   float64
 13  company   7651 non-null   object 
 14  runtime   7664 non-null   float64
dtypes: float64(5), int64(1), object(9)
memory usage: 898.7+ KB

Here is the data we will be working with. We have 7667 movies in this dataset and each movie has the following attributes:

  • name: name of the movie
  • rating: rating of the movie (R, PG, etc.)
  • genre: main genre of the movie.
  • year: year of release
  • released: release date
  • score: IMDb user rating
  • votes: number of user votes
  • director: the director
  • writer: writer of the movie
  • star: main actor/actress
  • country: country of origin
  • budget: the budget of a movie. Some movies don’t have this, so it appears as 0
  • gross: revenue of the movie
  • company: the production company
  • runtime: duration of the movie

Cleaning and preparing the data for analysis

# checking for missing data
for col in df.columns:
    pct_of_missing = np.mean(df[col].isnull())
    print("{} - {}%".format(col,pct_of_missing))
name - 0.0%
rating - 0.010041731872717789%
genre - 0.0%
year - 0.0%
released - 0.0002608242044861763%
score - 0.0003912363067292645%
votes - 0.0003912363067292645%
director - 0.0%
writer - 0.0003912363067292645%
star - 0.00013041210224308815%
country - 0.0003912363067292645%
budget - 0.2831246739697444%
gross - 0.02464788732394366%
company - 0.002217005738132499%
runtime - 0.0005216484089723526%
df = df.dropna()
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5421 entries, 0 to 7652
Data columns (total 15 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   name      5421 non-null   object 
 1   rating    5421 non-null   object 
 2   genre     5421 non-null   object 
 3   year      5421 non-null   int64  
 4   released  5421 non-null   object 
 5   score     5421 non-null   float64
 6   votes     5421 non-null   float64
 7   director  5421 non-null   object 
 8   writer    5421 non-null   object 
 9   star      5421 non-null   object 
 10  country   5421 non-null   object 
 11  budget    5421 non-null   float64
 12  gross     5421 non-null   float64
 13  company   5421 non-null   object 
 14  runtime   5421 non-null   float64
dtypes: float64(5), int64(1), object(9)
memory usage: 677.6+ KB
# checking for missing data
for col in df.columns:
    pct_of_missing = np.mean(df[col].isnull())
    print("{} - {}%".format(col,pct_of_missing))
name - 0.0%
rating - 0.0%
genre - 0.0%
year - 0.0%
released - 0.0%
score - 0.0%
votes - 0.0%
director - 0.0%
writer - 0.0%
star - 0.0%
country - 0.0%
budget - 0.0%
gross - 0.0%
company - 0.0%
runtime - 0.0%

The first thing I wanted to check was if there was missing data in the dataset. As shown we can see that there are NA values in the columns: rating, released, score, votes, writer, star, country, budget, gross, company and runtime. Incomplete entries in the data will produce inaccurate results therefore we will need to omit those movies.

As shown, we can now see that the number of movies in the data have reduced from 7667 to 5421 movies and we have completed filtered out the incomplete entries.

# split column released into released date and country of release
country_of_release = list()
release_date = list()

for entry in df['released'].astype(str):
    released = entry.strip(")").split("(",1)
    country_of_release.append(released[1])
    release_date.append(released[0])

df['release date'] = release_date
df['country of release'] = country_of_release

df = df.drop(columns='released')
# data
df.head()
name rating genre year score votes director writer star country budget gross company runtime release date country of release
0 The Shining R Drama 1980 8.4 927000.0 Stanley Kubrick Stephen King Jack Nicholson United Kingdom 19000000.0 46998772.0 Warner Bros. 146.0 June 13, 1980 United States
1 The Blue Lagoon R Adventure 1980 5.8 65000.0 Randal Kleiser Henry De Vere Stacpoole Brooke Shields United States 4500000.0 58853106.0 Columbia Pictures 104.0 July 2, 1980 United States
2 Star Wars: Episode V - The Empire Strikes Back PG Action 1980 8.7 1200000.0 Irvin Kershner Leigh Brackett Mark Hamill United States 18000000.0 538375067.0 Lucasfilm 124.0 June 20, 1980 United States
3 Airplane! PG Comedy 1980 7.7 221000.0 Jim Abrahams Jim Abrahams Robert Hays United States 3500000.0 83453539.0 Paramount Pictures 88.0 July 2, 1980 United States
4 Caddyshack R Comedy 1980 7.3 108000.0 Harold Ramis Brian Doyle-Murray Chevy Chase United States 6000000.0 39846344.0 Orion Pictures 98.0 July 25, 1980 United States

Looking at the ‘released’ column, I saw that it contained two seperate pieces of information which were the release date in full format and the country of release which the release date corresponds to. For the sake of data integrity, I decided to split that column into two columns called release date and country of release.

# variable data types
df.dtypes
name                   object
rating                 object
genre                  object
year                    int64
score                 float64
votes                 float64
director               object
writer                 object
star                   object
country                object
budget                float64
gross                 float64
company                object
runtime               float64
release date           object
country of release     object
dtype: object
# changing data type  
df['votes'] = df['votes'].astype('int64')
df['budget'] = df['budget'].astype('int64')
df['gross'] = df['gross'].astype('int64')
df.dtypes
name                   object
rating                 object
genre                  object
year                    int64
score                 float64
votes                   int64
director               object
writer                 object
star                   object
country                object
budget                  int64
gross                   int64
company                object
runtime               float64
release date           object
country of release     object
dtype: object

After looking at the data types of the variables, I was not satisfied that the data type of a few variables matched the nature of the vairable perfectly. I changed the data type to integer as I feel that not only does it match the variable better but it is also easier on the eye.

# sorting data frame by gross income
df = df.sort_values(by=['gross'],ascending=False)
df
name rating genre year score votes director writer star country budget gross company runtime release date country of release
5445 Avatar PG-13 Action 2009 7.8 1100000 James Cameron James Cameron Sam Worthington United States 237000000 2847246203 Twentieth Century Fox 162.0 December 18, 2009 United States
7445 Avengers: Endgame PG-13 Action 2019 8.4 903000 Anthony Russo Christopher Markus Robert Downey Jr. United States 356000000 2797501328 Marvel Studios 181.0 April 26, 2019 United States
3045 Titanic PG-13 Drama 1997 7.8 1100000 James Cameron James Cameron Leonardo DiCaprio United States 200000000 2201647264 Twentieth Century Fox 194.0 December 19, 1997 United States
6663 Star Wars: Episode VII - The Force Awakens PG-13 Action 2015 7.8 876000 J.J. Abrams Lawrence Kasdan Daisy Ridley United States 245000000 2069521700 Lucasfilm 138.0 December 18, 2015 United States
7244 Avengers: Infinity War PG-13 Action 2018 8.4 897000 Anthony Russo Christopher Markus Robert Downey Jr. United States 321000000 2048359754 Marvel Studios 149.0 April 27, 2018 United States
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5640 Tanner Hall R Drama 2009 5.8 3500 Francesca Gregorini Tatiana von Fürstenberg Rooney Mara United States 3000000 5073 Two Prong Lesson 96.0 January 15, 2015 Sweden
2434 Philadelphia Experiment II PG-13 Action 1993 4.5 1900 Stephen Cornwell Wallace C. Bennett Brad Johnson United States 5000000 2970 Trimark Pictures 97.0 June 4, 1994 South Korea
3681 Ginger Snaps Not Rated Drama 2000 6.8 43000 John Fawcett Karen Walton Emily Perkins Canada 5000000 2554 Copperheart Entertainment 108.0 May 11, 2001 Canada
272 Parasite R Horror 1982 3.9 2300 Charles Band Alan J. Adler Robert Glaudini United States 800000 2270 Embassy Pictures 85.0 March 12, 1982 United States
3203 Trojan War PG-13 Comedy 1997 5.7 5800 George Huang Andy Burg Will Friedle United States 15000000 309 Daybreak 85.0 October 1, 1997 Brazil

5421 rows × 16 columns

# dropping for duplicate data

df = df.drop_duplicates()

Finally, I have dropped all duplicate columns and made sure that the rows were sorted by gross earnings in descending order so that it is easier to see the list of top selling movies.

Analysis

I will start by looking at a correlation matrix of the dataset and try to see if there are any interesting insights that can be taken away from in the numerical data. I will be using Pearson parametric correlation test.

Correlation matrix for numerical variables

correlation_matrix = df.corr(method="pearson")
correlation_matrix
year score votes budget gross runtime
year 1.000000 0.056386 0.206021 0.327722 0.274321 0.075077
score 0.056386 1.000000 0.474256 0.072001 0.222556 0.414068
votes 0.206021 0.474256 1.000000 0.439675 0.614751 0.352303
budget 0.327722 0.072001 0.439675 1.000000 0.740247 0.318695
gross 0.274321 0.222556 0.614751 0.740247 1.000000 0.275796
runtime 0.075077 0.414068 0.352303 0.318695 0.275796 1.000000
correlation_matrix = df.corr(method="pearson")
sns.heatmap(correlation_matrix,annot=True)
plt.title("Correlation Matrix for Numeric Features")
plt.xlabel('Movie features')
plt.ylabel('Movie features')
plt.show()

png

Amongst the numeric features, we can see with this heatmap that there are two particular variables that have a strong positive correlation with gross earnings, budget and votes. They are the only two variables with a correlation greater than 0.5 with the variable, gross. The next variable with the strongest correlation has an r value of 0.26 so we can say that the correlation with the other variables, excluding budget and gross, is weak.

Budget vs. Gross Earnings Plot

# Budget vs. Gross scatter plot
sns.regplot(x="budget",y="gross",data=df,scatter_kws={"color":"blue"},line_kws={"color":"orange"}).set(
    title="Budget vs. Gross Earnings", xlabel="Budget (in millions)",ylabel="Gross income (in 100 millions)")

[Text(0.5, 1.0, 'Budget vs. Gross Earnings'),
 Text(0.5, 0, 'Budget (in millions)'),
 Text(0, 0.5, 'Gross income (in 100 millions)')]

png

Looking at numerical data is not enough. What if the prestige of a company has a role to play in the success of a movie? Is it possible that the writer of the novel has a role to play in the success of the production? We will need to convert all the non-numerical data into numerical data and find out for sure.

Correlation matrix for all variables

# preparing dataframe
df_numerised = df

for col in df_numerised.columns:
    if(df_numerised[col].dtype == "object"):
        df_numerised[col] = df_numerised[col].astype("category")
        df_numerised[col] = df_numerised[col].cat.codes
        
df_numerised
name rating genre year score votes director writer star country budget gross company runtime release date country of release
5445 386 5 0 2009 7.8 1100000 785 1263 1534 47 237000000 2847246203 1382 162.0 496 47
7445 388 5 0 2019 8.4 903000 105 513 1470 47 356000000 2797501328 983 181.0 124 47
3045 4909 5 6 1997 7.8 1100000 785 1263 1073 47 200000000 2201647264 1382 194.0 502 47
6663 3643 5 0 2015 7.8 876000 768 1806 356 47 245000000 2069521700 945 138.0 498 47
7244 389 5 0 2018 8.4 897000 105 513 1470 47 321000000 2048359754 983 149.0 132 47
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5640 3794 6 6 2009 5.8 3500 585 2924 1498 47 3000000 5073 1385 96.0 847 41
2434 2969 5 0 1993 4.5 1900 1805 3102 186 47 5000000 2970 1376 97.0 1386 39
3681 1595 3 6 2000 6.8 43000 952 1683 527 6 5000000 2554 466 108.0 1628 8
272 2909 6 9 1982 3.9 2300 261 55 1473 47 800000 2270 582 85.0 1442 47
3203 4966 5 4 1997 5.7 5800 651 161 1811 47 15000000 309 504 85.0 2021 6

5421 rows × 16 columns

As you can see we have converted the entire dataframe into numerical data. It is now ready for correlation analysis. Let’s have a look at the results.

# correlation matrix
correlation_matrix = df_numerised.corr(method="pearson")
sns.heatmap(correlation_matrix,annot=True)
plt.title("Correlation Matrix for Numeric Features")
plt.xlabel('Movie features')
plt.ylabel('Movie features')
plt.show()

png

# pairs of variables with a strong postive correlation
correlation_matrix = df_numerised.corr()

corr_pairs = correlation_matrix.unstack()
sorted_pairs = corr_pairs.sort_values()
high_corr = sorted_pairs[(sorted_pairs > 0.5)] 
high_corr
gross               votes                 0.614751
votes               gross                 0.614751
budget              gross                 0.740247
gross               budget                0.740247
name                name                  1.000000
runtime             runtime               1.000000
company             company               1.000000
gross               gross                 1.000000
budget              budget                1.000000
country             country               1.000000
star                star                  1.000000
writer              writer                1.000000
director            director              1.000000
votes               votes                 1.000000
score               score                 1.000000
year                year                  1.000000
genre               genre                 1.000000
rating              rating                1.000000
release date        release date          1.000000
country of release  country of release    1.000000
dtype: float64

After looking at the correlation matrix for all the variables, we can see that the only two variables that have a strong positive correlation with gross earnings remain budget and votes. All of the categorical variables in fact have shown no real correlation and has not changed the initial results.

Conclusion

So, to answer the question: what are the ingredients behind producing a high selling movie? Well if I was going to answer it in one word it would be budget. The results have shown that the greater the budget, the greater the gross income of the film is. We can also justify this correlation as well. The budget often covers costs of acquiring the script, payments to talent, and production costs. The online popularity of the movie also has a high correlation to the gross earnings of a movie. Increasing online presence on cinematography platforms such as IMBD plays a role in the success of the movie you produce.

Contrary to the question, we can also say that the production company name has no value to the success of the movie. In the past, this would have been different but in this new modern age the young student of film does not need to work for an elite production company to produce a blockbuster fortunately. Unfortunately, they would need a lot of money.

2021

Back to Top ↑