Often, there are debates around how to evaluate a recommender system or what KPIs should we focus on? Recommender systems can be evaluated in many ways using several metrics groups. Each metric group has its own purpose. In this article, we will take a look at what they are and how to combine them so that both the business team and ML engineers are satisfied.
A recommender system aims to suggest relevant content or products to users that might be liked or purchased by them. It helps to find items that the user is looking for — and they don’t even realize it until the recommendation is displayed. Different strategies have to be applied for different clients and they are determined by available data. Since RS has to be a data-driven approach, it can be fueled by machine learning algorithms.
There are two main stages of making recommendations:
These techniques and relevant evaluation metrics will be described further in this article.
To get the most out of RS and improve user experience, we should understand and dive into relationships between:
Keeping these relationships in mind while designing an RS will lead to a delightful experience for users and consequently boost their engagement with such products. Let’s imagine YouTube without recommended videos you like. Most of us spend a lot of time there just because recommendations are so accurate!
To select the best strategy for such systems, we must first assess the amount of available user and product data. Below are some popular strategies, sorted by the amount of data required in increasing order:
These strategies should be combined on top of each other so as to strengthen the RS performance. For example, online shopping platforms should know the context of the product as well as the history of user purchases. While the “View together” strategy will be only possible for a new user, for old customers the “Purchased together” strategy is a better fit.
Here is an important question: how to measure the success of a recommender? Already knowing the possible relationships and strategies that should all be somehow combined, the answer requires a great effort. Since there are multiple components and indicators to be covered, it is difficult to measure how good a recommendation engine is for a business problem.
However, there are metrics that we can use for such tasks. Since their specific selection depends on the algorithm, the next section will be dedicated to the overview of possible candidate generation techniques, the first stage in a recommender system.
The goal of candidate generation is to predict a rating for the products for a certain user and based on that rating select a subset of items they may like.
There are two main techniques to be described: content-based filtering and collaborative filtering.
Content-based filtering means that RS will recommend similar items to the liked or purchased ones (contextual strategy). For example, if user A watched two horror movies, another horror movie will be proposed to him. This technique can be user or item-centered.
Item-centred content-based filtering means that RS recommends new items only based on similarity to the previous items (implicit feedback).
In the case of user-centered content-based filtering, information about user preferences is collected, for example via questionnaire form (explicit feedback). Such knowledge leads to recommending items with similar features to the liked one.
The essential part of content-based systems is to pick similarity metrics. First, we need to define a feature space that describes each user based on implicit or explicit data. The next step is to set up a system that scores each candidate item according to the selected similarity metric. Similarity metrics that are appropriate for content-based filtering tasks will be discussed later in this article.
It addresses some of the limitations of content-based filtering by using similarities between users and items simultaneously. It allows us to recommend an item to user A based on the items purchased by similar user B. Moreover, CF models’ main advantage is that they learn users’ embeddings automatically, without the need for hand-engineering. That means they are less constrained than content-based methods. Collaborative filtering systems can be split into memory and model-based approaches.
Memory-based CF systems work with recorded values from item-item or user-user interactions assuming no model. Search is done based on similarities and nearest neighbours algorithms. For example, find the users that are the closest to user A and suggest items purchased by them.
Model-based approaches assume a generative model that explains user-item interactions and makes new predictions. They make use of matrix factorization algorithms that decompose the sparse user-item matrix into a product of two matrices: user-factor and item-factor. Recently, a lot of methods are being researched in the area of model-based RS. For example association rules, clustering algorithms, deep neural networks, etc.
A hybrid recommendation system is a combination of content-based and collaborative filtering methods. These systems help to overcome issues that are faced in those two types of recommenders. It can be implemented in various ways:
Before we move forward, let’s check the pros and cons of the two techniques in the table below:
Content-based CollaborativeIf a user likes comedy, another comedy is recommended
If user A is similar to user B and user B likes a certain video, then this video is recommended to user A
Comparison of content-based and collaborative techniques | Source: Author
What does the difference between various types of recommender systems look like when it comes to metrics? There are several metrics to evaluate models. In terms of content-based filtering, we should choose from similarity metrics, while for collaborative methods – predictive and classification metrics depend on whether we predict score or binary output.
After we evaluate the candidate generation model, we may want to evaluate the whole system in terms of business value and cover more non-accuracy-related metrics for scoring. All of these points will be the subject of this section.
When we have an item’s meta-data available, we can easily recommend new items to the user. For example, if we watched a movie A on Netflix, we can recommend another movie based on extensive meta-data tags for other movies and calculate the distance between them and movie A. Another way is to use NLP techniques such as Tf-Idf and represent movie descriptions as vectors. We only need to select a similarity metric.
The most common ones are cosine similarity, Jaccard similarity, Euclidean distance, and Pearson Coefficient. All of these are available in the `sklearn.metrics` module.
To compute the similarity between a purchased item and the new item for an item-centered system, we simply take the cosine between 2 vectors representing those items. Cosine similarity is the best match if there are many high-dimensional features, especially in text mining.
Jaccard similarity is the size of the intersection divided by the size of the union of two sets of items.
The difference from other similarity metrics in this article is that Jaccard similarity takes sets or binary vectors as an input. If vectors contain rankings or ratings, it is not applicable. In the case of movie recommendation, let’s say we have 3 movies with 3 top tags.
Based on the data we may say that movie A is more similar to movie B than to movie C. This is because A and B share 2 tags (adventure, action) and A and C share one tag (romantic).
It is the distance between two users in a user-centered system is the length of the line segments connecting them. The preference space is available items and the axes are items rated by the user. Based on user ratings we search for items liked by users with similar tastes. The lower the distance between two persons, the higher the chance they like similar items.
A potential disadvantage of this metric is that when person A tends to give higher scores in general (whole rankings distribution is higher) than person B, a Euclidean similarity will be large without any regard to the correlation between person A and person B.
PCC is a measure of the slope of the line that represents the relation between two vectors of users ratings. It can range from -1 to 1, 0 means no linear correlation.
For example, let’s consider ratings given by user A and user B:
The best fit line has a positive slope, which means a positive correlation between user A and user B (image below):
By using this approach, we can predict how person A would rate a product not rated yet. To do that, we simply take the weighted average of ratings of other users (including user B), where weights are calculated using PCC similarities.
Predictive measures address the subject of how close ratings of recommender systems are to the user ratings. They are a good choice for non-binary tasks. Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) are the most popular and easy to interpret predictive metrics.
MAE is the average magnitude of differences between recommendation and relevant rating, very easy to interpret.
Note that it does not penalise large errors or outliers and weights such cases equal to the other ones. It means that MAE gives a rather holistic view of rating accuracy than penalising large errors.
RMSE is a quadratic scoring metric that also measures average magnitude, but the square root makes a difference.
RMSE gives a large weight to large errors. It means that this is more useful when outliers are undesirable.
In practice, both RMSE and MAE are usually checked for collaborative recommendation models on K-fold cross-validated dataset. However, what matters from the business perspective is not only the highest RMSE or MAE but also non-accuracy metrics used for scoring that will be described in the later section. Now, let’s move on to metrics for binarized recommendation tasks.
Classification metrics evaluate the decision-making capacity of recommender systems. They are a good choice for tasks such as identifying relevant or irrelevant products to the user. For decision support metrics the exact rating is ignored, while for ranking-based methods it has an implicit influence through ranking.
Based on all recommended items overall all users, traditional precision and recall can be calculated. Recommended items that were available in the test dataset or received a high interaction value, can be considered as accurate predictions and vice versa. Those metrics require annotations from the user, translating our problem into a binary and setting the number of considered top recommendations (Top-N). Then, with the usage of, say `sklearn.metrics` module, we can construct a confusion matrix and define metrics as below:
Not relevantTrue Positive (TP)
False Positive (FP)
False Negative (FN)
True Negative (TN)
Confusion matrix of recommendation results | Source: Author
Precision@k is a fraction of top k recommended items that are relevant to the user
P = (# of top k recommendations that are relevant)/(# of items that are recommended)
Let’s check out the example:
Recall@k or HitRatio@k is a fraction of top k recommended items that are in a set of items relevant to the user. Please note that the larger k, the higher the hit ratio since there is a higher chance that the correct answer is covered in recommendations.
R = (# of top k recommendations that are relevant)/(# of all relevant items)
What does it look like in our example?
F1@k is a harmonic mean of precision@k and recall@k that helps to simplify them into a single metric. All the above metrics can be calculated based on the confusion matrix. The exact formulas are given below:
As we can see, the F1 coefficient does not consider the true-negative values. Those are the cases when the recommendation system did not recommend the item irrelevant to the user. That means, we can put in any value against true-negative values and it won’t affect the F1 score. An interesting and perfectly symmetric alternative is the Matthews correlation coefficient (MCC).
Matthews correlation coefficient is a correlation coefficient between the observed and predicted binary classification:
When the classifier is perfect (FP = FN = 0) the value of MCC is 1, indicating perfect positive correlation. Conversely, when the classifier always misclassifies (TP = TN = 0), we get a value of -1, representing perfect negative correlation.
If we have a candidate generation algorithm that is returning a ranked ordering of items and items further down in the list are less likely to be used or seen, then the following metrics should be considered.
While precision@k (P(k)) considers only the subset of your recommendations from rank 1 to k, average precision rewards us for placing the correct recommendations on top of the list. Let’s start with the definition. If we are asked to recommend N items and the number of relevant items in the full space of items is m, then:
For example, let’s consider sample outputs for AP@5, while we recommend items to a user who added m = 5 products.
In the first set of recommendations, we can see that only the fifth recommended item is relevant. It means precision@1=precision@2=precision@3=precision@4=0, as there are no relevant items in the first four places. Precision@5 equals ⅕ because 5th item is relevant. Once we calculate all precision@k values, we sum them and divide the result by 5, i.e. the number of products to get value of AP@5.
Based on the above example we should notice that AP rewards us for top ranking the correct recommendations. That happens because the precision of the kth subset is higher the more correct guesses we have up to the point k. This can be seen in the below example.
While precision@5 is constant, AP@5 decreases with the rank of recommended item. A very important thing to note is that AP will not penalise us for including additional recommendations on our list. When using it, we should make sure that we recommend only the best items.
While AP applies to a single data point, that is equivalent to a single user, MAP is the average of AP metric over all Q users.
MRR is the average of reciprocal rank (RR) over users. The reciprocal rank is the “multiplicative inverse” of the rank of the first correct item. MRR is an appropriate choice in two cases:
It means that MRR doesn’t apply if there are multiple correct responses in the resulting list. If your system returns 10 items and it turns out there is a relevant item in the third-highest spot, that’s what MRR cares about. It will not check if the other relevant items occur between rank 4 and rank 10.
The sample calculation of MRR is presented below:
DCG is a measure of ranking quality. To describe it, we should start with Cumulative Gain. CG is the sum of graded relevance values of all results in the list. That means that we need relevance scores of our recommendations to calculate it.
6
6
Cumulative gain calculation | Source: Author
As we can see, assuming that highly relevant documents are more useful when appearing earlier in the search results list, it is not entirely right that the above two lists of relevance scores receive the same score.
To overcome this issue DCG should be introduced. It penalises highly relevant documents that appear lower in the search by reducing the graded value logarithmically proportional to the position of the result. See equation below.
Based on our example, let’s calculate DCG for `scoresA` in Python, considering `scoresB` as the true output.
As we can see, the DCG score is around 3.6 instead of 6. The issue with DCG is that it is difficult to compare performances from different queries because they are not in the 0 to 1 range. That’s why nDCG is more commonly used. It can be obtained by calculating the ideal DCG (IDCG). IDCG is DCG for sorted rankings in descending order and plays the role of a normalization factor.
In the case of our example:
The limitations of nDCG score are that it does not penalise the false positives. For example [3] and [3, 0, 0] result in the same nDCG but in the second output, there are 2 irrelevant recommendations. It may also not be suitable for recommender systems that have several equally good results.
We must remember that recommendation is not a prediction. Evaluating the candidate generation model is one thing, and incorporating the model into the whole RS system and giving the highest score to the most interesting items is another thing. This way the client evaluates the system is influenced not only by accuracy but also by the company’s business strategy. For example, for news aggregation sites the goal is to increase the amount of time people spend on the platform, while for e-commerce the determining factor of RS performance is the increase in sales.
Recommendation-centric metrics are user-independent concepts that don’t require user information. They evaluate systems in areas other than users’ ratings or their history. They include accuracy and metrics defined earlier in this article. Let’s take a look at some of them.
When the collaborative recommender system is focused on accuracy only, we may experience the illustrated problem. In this example, the user bought a couple of Beatles’ albums. As a result, they are provided with a list of other albums of this band. Although the user might probably like it, such localised recommendations are not very useful. It would be more useful to have more space for other bands.
This is what diversity means, it is the average dissimilarity between all pairs of items in the result set. It is of course highly dependent on available meta-data as well as the similarity metrics that we select. As we can see in the plot below, while accuracy remains constant between 40-100 top recommendations, the diversity still increases with the number of recommended items displayed. This means that it is worth considering the diversity metric for re-ranking the recommended items.
Coverage is the ability of the recommender system to recommend all items from a train set to users. Let’s consider the random recommender that selects items as in the lottery drawing. Such recommender has nearly 100% coverage because it has the ability to recommend every available item. On the other hand, the popularity-based recommender is gonna recommend just top k items. In such a case, coverage is close to 0%.
Coverage does not evaluate if the user enjoys the recommendation or not, instead, it measures the RS in terms of its ability to bring unexpectedness to the user. Low coverage can lead to users’ dissatisfaction.
User-centric metrics are collected by asking a user, automatically recording interactions, and observing his behaviour (online). Although such empirical tests are difficult, expensive, and resource-demanding, it is the only way to truly measure customer satisfaction. Let’s discuss them.
It is a measure of the ability of RS to introduce long-tail items to users. E-commerce platforms can benefit from high-ranking individualized, niche items. For example, Amazon makes a great success by selling books that are not available in traditional book stores, rather than bestsellers.
Novelty can be defined as a fraction of unknown items among all items the user liked. An ideal way of measuring it would be a customer survey but in most cases, we are unable to determine whether the user knew the item before. Having implicit data about user behaviour allows us to measure dissimilarity between recommendations that sometimes substitutes novelty scores. We have to also remember that too many novel items can result in a lack of trust from users. It is essential to find the balance between novelty and trustworthiness.
It is a measure of whether a user trusts a recommender system that they interact with. A method of improving it would be adding an explanation of why a specific item is recommended.
Churn measures the frequency of recommendations changes after the user rates new items. Responsiveness is the speed of such change. Those 2 metrics should be taken into consideration but similarly to novelty they can lead to low trustworthiness.
Now we know that accuracy is not enough to measure RS performance and we should put our attention to metrics such as coverage and novelty as well.
Unfortunately, all metrics described above don’t show us how real customers react to the produced recommendations in terms of the company’s business strategy. The only way to measure it is A/B testing. A/B testing costs more resources and time, but it allows us to measure the metrics presented on the diagram below, that we are going to define in this section.
CTR measures how many clicks are gained by recommendations. The assumption is that the higher the clicks, the more relevant are the recommendations. It is very popular in the news recommendation domain and used by such web platforms as Google News or Forbes. Personalized suggestions brought them around a 38% increase in clicks compared to popularity-based systems.
While CTR tells us whether a user clicked an item, it can’t determine whether that click converted into a purchase or not. Alternative adoption measures have been taken into account by YouTube and Netflix. Users’ YouTube clicks are counted only when they watched a specific percentage of video (“Long CTR”). Likewise, Netflix counts how many times a movie or series was watched after being recommended (“Take rate”).
When an item cannot be viewed, other, domain-specific measures have to be defined. For example, in the case of LinkedIn, it would be the number of contacts made with an employer after a job offer recommendation.
CTR and adoption measures are good in terms of determining that the introduced algorithm was successful in identifying later views or purchases. However, change in sales is what usually matters. Nevertheless, determining the improvement in terms of the business value of RS remains difficult. Users could have bought an item anyway and the recommendation could have been irrelevant.
Measuring what changed in sales after introducing RS compared to what was before is a very direct measure. However, it requires an understanding of the effects of the shifts in sales distribution. We can observe for example decrease in diversity at the individual level and overcome this effect by further efforts.
It was discovered by several real-world tests of RS that having a recommendation usually increases user activity. Often, a correspondence between customer engagement and retention is assumed in various domains (e.g., at Spotify). It can be difficult to measure when churn rates are low.
Since we already familiarised ourselves with multiple metrics for recommender system evaluation metrics, we may now have doubts about where to start. It may help if we ask ourselves the following questions.
For content-based filtering, similarity metrics should be considered to evaluate model performance such as cosine or Jaccard similarity. For a collaborative approach, predictive and accuracy metrics should be selected.
Do we have explicit data collected from a user or business? If yes, we can use it as a test set and perform supervised model building using accuracy-related metrics. If not, we have to treat implicit data as a ground truth. It means that accuracy-related metrics will be less informative because we don’t know how users will react to, for example, a niche recommended item. We have to focus on coverage and diversity then.
If users are gonna consider multiple recommendations in a specific order, accuracy-related metrics (e. g. precision@k, recall@k) are not sufficient, since they ignore ordering and weigh items with lower and higher ranks equally. MAP, MRR, or DCG can be a better choice for such purposes.
In the case of collaborative systems, a binary scale suggests classification tasks and accuracy metrics, while ratings suggest regression tasks and predictive metrics. For content-based systems binary scale allows us to use the Jaccard similarity metric.
Metrics such as MAP, MRR, and DCG reflect the order of top recommendations. The best choice when we want to include not only ranking but ratings on top items as well is DCG. It incorporates the knowledge that some items are more relevant than others.
If yes, the overall predictive or ranking accuracy is not a good match. The exact ratings are irrelevant to a user because they already see a very limited list of items. In such cases, Hit Ratio and CTR are much more appropriate.
The only correct answer is yes. A/B testing allows us to measure the business value of RS, such as a change in CTR and sales. Moreover, we can collect feedback from users in terms of trustworthiness, churn, and novelty.
It is difficult to measure how good a recommendation engine is for a business problem. Binary or rank-aware accuracy-related metrics will be a great starting point to generate a set of candidate items via the selected ML method. Unfortunately, accuracy doesn’t go hand in hand with metrics such as diversity or novelty that are essential for customer satisfaction.
Finally, relying on ML metrics to determine the performance of a recommender system is not enough. Only user feedback brings valuable outputs in terms of business value. This is why A/B testing should be always performed. It allows us to measure improvement in CTR, sales, and their derivatives. Only then the Business strategies and Machine Learning models will work in harmony.