How are we helping our customers offer personalized services? Part III.

May 11, 2023 - Jan Beníšek

Report from the first line.

The previous articles were a gentle intro to our way of building recommender systems and could be made infinitely long. However, instead of enumerating every type of model, approach, or pluses/minuses, we will share some of our first-hand combat experiences. We will discuss various scenarios where the infamous “cold-start” problem can occur and how to tackle it. The cold-start problem takes different shapes based on your domain, and in each case, different approaches are required to effectively address it. Finally, we will reveal some secret tools and techniques that we use to evaluate and improve the effectiveness of our recommender systems. What works and what does not? Ready? Let’s dive in.

Understanding the Cold-Start Problem

First and foremost, the infamous cold-start problem takes different shapes based on your domain. Consider the first scenario – a blog with articles. There is a constant influx of new posts, without sufficient reading history. What works for us here is to not recommend an item itself, but a higher-level category. There are a few reasons: 

  • Recommender engines with too many fine-grained items tend to be sub-optimal. Too much data can be equally detrimental as too little data. 
  • Modeling higher-level needs is cleaner and gives us more insights into affinity to each category.
  • Business needs change constantly – if we know the general affinity to a category, choosing the final item is easy and flexible.
  • Training models with a large number of items is memory intensive – let’s say we have 1.000.000 customers and 1.000 categories. If one cell in a Numpy array with floats needs 8 bytes of space, this matrix will have 8 GB. Now, imagine we want to model each item and we have 10.000 of those. That gives us an 80 GB matrix. Good luck fitting that into RAM.
  • And lastly, unless a new category gets added too often, we solved the cold-start problem.

In the second scenario, assume that we introduced loyalty cards, thus every day new users will appear with little to no history. Here, we usually recommend having separate marketing in the first few months. From our experience, it pays off to gently introduce customers to your products, explain the benefits and, as a byproduct, get more data.

If we need an estimation for new users/items anyway, we like to build very simple models. Offering products based on location, popularity or frequency is a reasonable approximation that works well. Employing user- or item-based similarity models can improve otherwise random offerings.

What if your business is conducted on a website? You will face specific challenges. You probably want your website speed unhindered by model prediction – so either have a big fat cloud that does real-time scoring or recalculate the predictions every night with the cost of missing one day of the data. Furthermore, it is impossible to get full uninterrupted user history. The same user who is coming from different devices, with or without cookies or in an incognito mode cannot be recognized. Therefore, we always like to start with item-similarity or item-interaction, which offer good results.

You said you work in retail? The recommender engine should contain mechanisms accounting for seasonality, which tends to strongly influence this domain. Data Scientists working in retail usually have rich data about customer behavior, and thus user- or item-interaction models work great. Speed or complexity are not of big concern, since there is no real need for real-time models. Specific retail areas, such as fashion, suffer from the cold start problem – usually, there are new collections every season. With food chains, discounts are a strong factor and we like to incorporate them into our models.

For those working in financial institutions, we did not forget you! The data about users is often very rich, products have long life cycles, their description is limited and purchases are less frequent – you do not buy a mortgage every week. Therefore, we often implement user-similarity models and they work like a charm. Our secret sauce is to apply PCA on user features and calculate similarity on top-n principal components. It helps with interpretability and speed.

Working in telco? Tariffs have short life cycles and plenty of data about users or items. This calls for user-similarity or any type of collaborative filtering. We also experiment with graph or hierarchical models, because customers are influenced by their friends or family circle.

Evaluating Recommender Systems

Have I told you about our secret weapon when evaluating models? I like to use what I call a “benchmark model”. You build a model, it seems great, but is it? Does the added complexity pay off? The benchmark model assigns each customer predictions based on a simple heuristic – the most popular or most frequent item. Because maybe you don’t need a model and offering the most popular item is good enough. Maybe it does not pay off to spend 20 man-days to have a solution only marginally better.

Another not-so-secret tool in our toolbox is lift – it measures how much better our model performs than a random approach. Evaluating lift at top-n predictions mimics real usage the most because we will offer only a small batch of items and is very intuitive for everyone.

The last piece of the puzzle is profiling – an often overlooked but key part of our recommender systems projects. Opening the box and understanding who are the targeted people, where they are from and how they behave is very valuable. It helps marketers create better campaigns, tailor communication, and gain trust in the model and it serves as another layer of validation. We like to build these in any BI dashboard product – they are quick to set up and offer beautiful visualizations.

I am always half-jokingly saying that data science is more art than science. This is especially true with recommender systems.