This report seeks to answer several questions about our data. The questions are answered with SQL queries on a 15 table database for a DVD rental company and were then visualized using plotly and RMarkdown. The questions included:

Which films are most costly to replace and why?
Over time, how often are the different ratings and categories of films rented?
Over time, how do the performances of the different stores compare?
What is the current total balance of all of our customers?

The data itself is synthetic and can be found here. For this report, I hosted the database locally to perform the queries. The full project and code can be found here if you would like to reproduce any of the analysis.

You can check out more of my work on my portfolio page or feel free to connect with me on LinkedIn!

Which films are most costly to replace and why?

Of the 1000 films in the database, the average film costs about $19.98 to replace and the individual costs range from $9.99 to $29.99. To get an intuition of why films might be more expensive to replace, it can be helpful to explore the table (Figure 1) below where films are sorted by their replacement cost. It is difficult to learn much this way but as will be shown, there may not even be a definitive connection between a film’s characteristics and its replacement cost.

Note: Language ID and Release Year each only have a single unique value across all of the data.

Figure 1

The histogram of replacement costs (Figure 2) reveals the lack of a clear pattern among the replacement costs. To further explore what drives replacement cost, we can look at its relationship to other variables.

Figure 2

Figure 3 is evidence that there is not a linear relationship between the numerical characteristics of a film (rental duration, rental rate, and length), and its replacement cost. The variable with the correlation of the largest magnitude with replacement cost is rental rate at -0.0446. This is a very weak correlation and an additional reason why it would be warranted to explore the relationship between replacement cost and the categorical characteristics of films.

Figure 3

Figures 4 and 5 are box plots depicting the distribution of replacement cost relative to the unique values of the two categorical characteristics of films in our data: film rating and special features. Curiously, both the minimum and maximum replacement cost associated with each value for both of these characteristics are almost identical. Combining this with the similarities in interquartile range of replacement cost across the characteristics, it suggests that the categorical valuables are also not clearly connected to the replacement cost.

Note: For Figure 4, hover your cursor over individual box plots to better see the special feature groupings that they correspond to.

Figure 4

Figure 5

Conclusion

After this preliminary analysis, there are no easily identifiable connections between a film’s characteristics and its replacement cost. For future analysis it might be helpful to model non-linear relationships and interactions between variables as they relate to replacement cost. This could be done by fitting a tree-based model and interpreting the feature importances or shapley values.

Over time, how often are the different ratings and categories of films rented?

The rental data covers the time period between May 24th, 2005 to February 14th 2006. There is also an alarming 5-month gap in the data between August 2005 and February 2006. This can be seen in Figures 6 and 8. Such a gap, with 0 rentals, suggests a serious, long-term issue with the data pipeline.

Ignoring this error and the small amount of data (16044 rentals across nearly 10 months), we can still attempt some basic analysis. Looking at Figure 6, it appears that throughout the entire timescale of the data, PG-13 movies are rented most often and G movies are rented least often. Films with other ratings have consistently similar rental counts. This relationship lines up very nicely with the the inventory amounts of films with each rating shown in Figure 7.

The connection between rental counts and inventory continues for categories. This can be seen in Figures 8 and 9. It may be helpful to reference the note here for Figure 9, you can start with the most frequent category in the inventory (Sports) and continually add the next most frequent category.

Note: For Figures 6 and 9, it might be helpful to double click the lines in the legend to view one rating or category at a time. You can toggle on other ratings/categories by clicking once on their line. Be aware that isolating a single rating or category can change the scale of the y axis.

Figure 6

Figure 7

Figure 8

Figure 9

Conclusion

For this question, timescale and data quality are the main factors preventing any more complex analysis. The relationship between inventory amounts and rental counts is fairly consistent for this small time window, even as total rental count changes. If there was more data, it would be interesting to see if this relationship continues. It would also facilitate in identifying any trends or seasonalities, while also allowing for more robust forecasting of rentals.

Over time, how do the performances of the different stores compare?

It is important to point out that although the data shows that films are rented in 597 different cities, there are only 2 unique stores. The store with ID 1 is listed in the Alberta district of Lethbridge. The store with ID 2 is listed in the QLD district of Woodbridge.

In real data the discrepancy between store counts and cities of rentals would raise major questions about the data quality, but this is fake data after all. We can still try to do some analysis on how the 2 stores compare to each other.

Figure 10 shows that, over the time period covered by the data, both stores have a nearly identical pattern of rentals per month, with store 1 consistently outperforming store 2. The previous finding where we saw that more inventory of films leads to more rentals also seems to hold true for the stores. Figure 11 shows that store 1 simply has more inventory than store 2 and this could very well explain the performance difference between the two stores.

Figure 10

Figure 11

I was curious how the two stores compare in terms of rentals over the course of an average day so I created Figure 12. Figure 12 depicts the average number of rentals in each store across every hour of the day. It seems that the stores do have very similar daily rental patterns, with both stores on average seeing a spike in rentals at 3:00 PM. It is also interesting to note that these stores are apparently open 24 hours a day.

Figure 12

Conclusion

The limitations of using fake data really made themselves known when looking at rentals on the store level. My best guess is that the creator of this dataset generated the rental data by “renting” a random percentage of the inventory at mostly random times. By randomly selecting inventory to “rent”, it would make sense that the films with the most common ratings/categories and the films at stores with the largest inventory are the films that are rented most often. Though in reality, there could be other identifiable explanations for this pattern, for example: perhaps stores have a larger inventory if they’re simply more popular stores or the cities that they’re located in are more populous. If these stores and cities were real it could be helpful to compare the population totals, densities, and demographics of the areas surrounding the stores and see how they relate to rental patterns.

What is the current total balance of all of our customers?

In order to calculate the total balance of all the customers, I used a set of payment rules set out in the custom functions of the provided database that defines a customer’s balance:

Calculate and charge rental fees for all previous rentals
Charge one dollar for every day the previous rentals are overdue
If a film is overdue for longer than 2 times the rental duration, charge the replacement cost
Credit all payments made before the date specified (I used the last date that any payment was made in the payment table)

The custom function needed a bit of editing, but I was able to calculate the current (as of the most recent payment on May 14th, 2007) balance of all 599 customers in the data. In aggregate, the customers have a total outstanding balance of approximately $40,514.70. In other words, they collectively owe over $40,000.

Looking at the summary statistics table shown in Figure 13, we can learn more about the balances carried by the customers. The customers, on average, owe $68. The largest positive balance is $7 and the lowest balance is $-204.

Figure 13

Conclusion

It would be very helpful to be able to talk with an employee at this fictional DVD rental company to better understand the company and how it operates. Once that was made clear, it would be interesting to analyze the behavior of customers who are charged the replacement cost for DVDs kept for twice as long as the rental duration. I suspect that a DVD rental company might be losing customers by instating a fee like this; potentially, a customer who has to pay the replacement cost may simply choose to never return to the store. An interesting way to test this could be rolling out a new policy at one location. The policy would not penalize overdue rentals with the replacement cost of the DVD and the goal would be to see if customers with long overdue rentals are retained. It could effectively be an A/B test of the policies and could contain a profit analysis to determine which policy is better for the long-term success of the company.

Stepping away from the actual analysis, this project was a lot of fun to work on and I definitely honed my SQL skills. Thank you for taking the time to read my report!

Analyzing DVD Rental Data Using SQL, Python, and RMarkdown

Chris Jakuc

10/2/2020

Which films are most costly to replace and why?

Over time, how often are the different ratings and categories of films rented?

Over time, how do the performances of the different stores compare?

What is the current total balance of all of our customers?