Einar Jonsson has just completed a Masters in Mechanical Engineering at UCL in the UK. He completed a two month internship as a Machine Learning Engineer for a Asia-based aggregator service for the entertainment industry as part of Start Me Up’s Startup Internship Program.
On my first day, I arrived my first day in Dojo, the beachside co-working space where I’d be based for two months, to familiarize myself with my project.
The task was to complete two large machine learning projects for a platform that sells movie and theatre tickets across Asia. The first project was to build a ticket sale forecasting program, and the second project was to build a Netflix-like recommendation system.
More accurate sales forecasting models means more effective allocation of advertising expenditure.
The first step for me was simply to break down the first problem into a set of smaller, and more manageable problems. The smaller problems were as follows:
- Perform background research on methods being used in industry for similar problems
- Clean up the data
- Visualize trends in the data
- Build simple models that provide a baseline accuracy
- Perform feature engineering on the data
- Train final model hyper-parameters
The background research gave me a good idea of what kinds of models to begin attempting to use for this kind of problem. More importantly, the research helped determine which features were most significant, as well as which features could be engineered from common datasets.
The next step of cleaning up the data is a very time consuming one. The data arrived with countless cells being empty, or containing nonsensical values. In addition, all of the worded information or categorical information had to be changed into some sort of number format.
As a result, a large amount of time went into algorithms that fill every empty or nonsensical cell with a reasonable numeric value and creating algorithms that transform the worded information into descriptive numerical values.
After this, I was able to begin creating all sorts of plots to visualize the data available. This allowed me to try to spot low-dimensional trends present in the data. A few of the trends that came up helped inspire some of the features that were engineered later on. Also, the sales data turned out to be heavily skewed, and as a result I made the decision that categorical predictors would far outperform regression predictors. This was later confirmed with testing.
Next, I created numerous simple models including Decision Trees, Random Forests, XGBoost, AdaBoost, SVM, Logistic regression, and Naive Bayes. After comparing overall performance of these simple classifiers on the data, it was found that the XGBoost and Random Forest Classifier has the best performance. As a result, these two models would be compared in more depth.
The next step was to engineer some additional features from the data present. Some of the features that I engineered include: distance to the nearest cinema, as this could indicate the proximity to a competitor, or possibly the affinity for moviegoing in that area.
Another engineered feature was ‘time since release’, as this could help determine time-based sales patterns.
Also, total cinema sales could indicate how established the company’s sales are at a specific cinema. Moreover, sales per screen at a particular cinema would account for the size of a cinema in the total cinema sales feature.
Finally, features called actor value and director value were created from the list of actors and directors corresponding to a specific movie. To do this, I created an algorithm that scrapes IMDb data for a star’s overall movie grossing and popularity, then produces a numerical value for how popular that actor or director is.
In the end, hyper-parameters were GridSearched and trained for both the XGBoost and Random Forest models. The XGBoost model performed significantly better and was finally chosen. The final model predicts whether or not sales would be above 200 for a specific movie at a specific cinema. It achieves an overall test accuracy of 94% and an F-score of 0.71.
The next project was to create a recommendation model. The aim of this project was to create a program that suggests movies to a user based on his/her viewing history. This would be helpful to the company because it should drive up revenue. A user is more likely to purchase a ticket for an intelligently recommended movie then for a randomly selected movie.
The background research concluded that there are three main types of recommendation systems widely used. These consist of user-based collaborative filtering, content-based filtering, and hybrid filtering. User-based collaborative filtering focuses on the user characteristics to find similar users, and recommends movies similar users have watched. Content-based filtering focuses on the movie metadata to find movies that are similar to each other. Hybrid filtering incorporates principles from both other systems.
Content-based filtering was chosen as the most appropriate type of model for two main reasons. First of all, 45% of users have only purchased a single movie, and content-based filtering is more effective at dealing with sparse data. Secondly, user data such as age and gender is not available. Thus, it is difficult to generate a meaningful matrix of latent features of users, which is important for user-based collaborative-filtering.
This model works by creating a high n-dimensional vector (n~10,000) of latent features for each movie. This vector encompasses a movie’s meta-data (such as genres, actors, directors, etc.). After this, cosine similarity is used to calculate how similar two movie vectors are. Then, for each movie a user has seen, the cosine similarity score of every movie is added up. Then finally, the movies with the highest aggregated cosine similarity score are recommended to that user.
One issue with many standard recommendation models is that they are generally not sensitive to changing user interests. To combat this, I created a simple algorithm that generates a time-factor based on how long it has been since a user watched a particular movie. The factor ranges from 1.0 when a user has just seen that movie, down to about 0.3 when it has been a few years since a user has seen that movie. This simply makes sure that for each user, movies watched recently are weighted more heavily than movies watched a while ago.
On my last day I presented the models to the Head of Data Analytics at the startup. All in all, I am proud of what I was able to achieve and I believe that they were very pleased with the work as well.