IMDB Ratings: Steven Spielberg vs Tim Burton

This project was done to research whether the IMDB rating for blockbuster movies were influenced by the director. Before looking at any kind of data, a null hypothesis and alternative hypothesis were set up as follows:

Null Hypothesis

There is no difference in the mean IMDB-rating for movies produced by Steven Spielberg or Tim Burton.

Alternative Hypothesis

There is a significant difference between these mean IMDB-ratings.

We will look at a subset sample of movies, taken from the Kaggle IMDB 5000 movie dataset

movies <- vroom::vroom(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Ave...
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action"...
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevor...
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015,...
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93...
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+0...
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+0...
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, ...
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658...
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752,...
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7....

Besides the obvious variables of title, genre, director, year, and duration, the rest of the variables are as follows:

  • gross : The gross earnings in the US box office, not adjusted for inflation
  • budget: The movie’s budget
  • cast_facebook_likes: the number of facebook likes cast memebrs received
  • votes: the number of people who voted for (or rated) the movie in IMDB
  • reviews: the number of reviews for that movie
  • rating: IMDB average rating

Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries?

No missing but many duplicates across categories. There are 2961 rows and only 2907 unique movie titles, so it can be inferred that some movies show more than one time with different genres.

Then, I manipulate the data set so that only Steven Spielburg and Tim Burton are evaluated.

movies_v1 <- movies %>% filter(director == "Steven Spielberg" | director == "Tim Burton") %>% select(director | rating)

movies_v2 <- movies_v1 %>%
  group_by(director) %>%
  summarise(mean_rating = mean(rating, na.rm=TRUE),
            sd_rating = sd(rating, na.rm = TRUE),
            count = n(),
            se_rating = sd_rating/sqrt(count),
            t_critical = qt(0.975, count-1), 
            margin_of_error = t_critical * se_rating,
            lower = mean_rating - t_critical * se_rating,
            upper = mean_rating + t_critical * se_rating
            )

Steven Spielburg has an average IMDB rating of 7.57 while Tim Burton has an average rating of just 6.93. Furthermore, Steven has made more movies with a smaller spread across ratings, which is seen by his standard error of just 0.148. On the other hand, Tim Burton has made fewer movies with a larger discrepancy in ratings; causing him to have an standard error of 0.193.

Data Visualisation

Judging from these numbers alone, it does indeed seem that a movie’s director has an impact on the IMDB Rating. But looking at numbers can be a bit daunting, so let’s try to make this a bit more appealing to look at:

Hypothesis test with formula

movies <- movies %>%
  mutate(spielberg_burton = movies$director %in% c("Steven Spielberg", "Tim Burton"))

movies %>% filter(!is.na(spielberg_burton)) %>% 
  group_by(spielberg_burton) %>% 
  summarise(count = n()) %>% 
  mutate(prop= count/sum(count))
## # A tibble: 2 x 3
##   spielberg_burton count   prop
##   <lgl>            <int>  <dbl>
## 1 FALSE             2922 0.987 
## 2 TRUE                39 0.0132
t.test(rating ~ spielberg_burton, data = movies)
## 
##  Welch Two Sample t-test
## 
## data:  rating by spielberg_burton
## t = -7, df = 40, p-value = 5e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.188 -0.679
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##                6.38                7.31

Hypothesis test with infer

obs_diff <- movies %>%
  specify(rating ~ spielberg_burton) %>%
  calculate(stat = "diff in means", order = c(TRUE, FALSE))
obs_diff
## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1 0.933
null_dist <- movies %>%
  specify(rating ~ spielberg_burton) %>%
  hypothesize(null = "independence") %>%
  generate(reps = 1000, type = "permute") %>%
  calculate(stat = "diff in means", order = c(TRUE, FALSE))
null_dist
## # A tibble: 1,000 x 2
##    replicate    stat
##        <int>   <dbl>
##  1         1 -0.210 
##  2         2 -0.0878
##  3         3 -0.218 
##  4         4 -0.358 
##  5         5 -0.140 
##  6         6 -0.0592
##  7         7  0.0473
##  8         8 -0.0255
##  9         9  0.0733
## 10        10 -0.109 
## # ... with 990 more rows
ggplot(data = null_dist, aes(x = stat)) +
  geom_histogram()

null_dist %>% visualize() +
  shade_p_value(obs_stat = obs_diff, direction = "two-sided")

null_dist %>%
  get_p_value(obs_stat = obs_diff, direction = "two_sided")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

Conclusion

Using Michael Bay and James Cameron this becomes a lot more apparent - here the gap between the two confidence intervals is quite large and so tells us that a movie’s director most likely plays a role in its IMDB rating.