RecSysOps: Best Practices for Operating a Large-Scale Recommender System
video version: link
Operating a large-scale recommendation system is a complex undertaking: it requires high availability and throughput, involves many services and teams, and the environment of the recommender system changes every second. For example, new members or new items may come to the service at any time. New code and new ML models get deployed to production frequently. One question we need to address at Netflix is how can we ensure the healthy operation of our recommender systems in such a dynamic environment?
In this blog post, we introduce RecSysOps a set of best practices and lessons that we learned while operating large-scale recommendation systems at Netflix. These practices helped us to keep our system healthy while 1) reducing our firefighting time, 2) focusing on innovations and 3) building trust with our stakeholders.
RecSysOps has four key components: issue detection, issue prediction, issue diagnosis and issue resolution. Next we will go over each component and share some of our learnings.
Within the four components of RecSysOps, issue detection is the most critical one because it triggers the rest of steps. Lacking a good issue detection setup is similar to driving a car with your eyes closed.
Formally, issue detection is pretty straightforward; if something is wrong in production we need to be able to detect it quickly. However, there are endless ways for things to go wrong and many of them we may not yet have encountered. Below are some of the lessons that we learned to increase our coverage in detecting issues.
The very first step is to incorporate all the known best practices from related disciplines. Because creating recommendation systems involves software engineering and machine learning, this includes all DevOps and MLOps practices such as unit testing, integration testing, continuous integration, checks on data volume and checks on model metrics. Fortunately, there are many available resources such as this paper  with checklists that can be used to examine an ML system and identify its gaps.
The second step is to monitor the system end-to-end from your perspective. In a large-scale recommendations system many teams are often involved and from the perspective of an ML team we have both upstream teams (who provide data) and downstream teams (who consume the model). One thing we learned was not to rely only on partner teams’ audits. It is best to audit (from our own perspective) all inputs and outputs, e.g. models and scores generated by those models. For example, we might be interested in a specific fraction of the data generated by the upstream team and changes in this fraction may not show up in the overall audits of that team. As another example, if a team is interested in consuming our model and making predictions, we can work with them to log details of the model predictions, e.g. features, for a fraction of traffic and audit them from our own perspective. This type of end-to-end audits helped us find many issues both downstream and upstream, especially at the start of new projects involving new data sources.
The third step for getting a comprehensive coverage is to understand your stakeholders’ concerns. This is the best way to increase the coverage of the issue detection component. In the context of our recommender systems we have two major perspectives: our members and items.
- From the member perspective, the problem is pretty straightforward. If a member chooses an item that was not ranked high by the serving ranking model, it is a potential issue. Thus, monitoring and analyzing these cases is important to identify problems and are also a great source of inspiration for future innovations.
- From the items’ perspective we need to make sure to engage with teams responsible for items and understand their concerns. In the case of Netflix, these teams indicated concerns about proper item cold-starting and potential production bias. These are both active research areas in the RecSys community, but to start with we helped those teams define metrics around their concerns and build tools to monitor them. We also helped them provide insight into whether or not those problems were occurring on a per-item basis. We later integrated those tools directly into our issue detection component. This enabled us to 1) expand the issue detection coverage and 2) proactively address key issues related to items and build trust with our stakeholders.
Implement all the known best practices
Monitor the system end-to-end your own way
Understand your stakeholders’ concerns
Detecting production issues quickly is great but it is even better if we can predict those issues and fix them before they are in production. As an example, proper cold-starting of an item (e.g. a new movie, show, or game) is important at Netflix because each item only launches once. So we wondered if we could predict if an item is going to have a cold-start issue before its launch date. This requires predicting predictions of our future production model, which is challenging. Using historical data points we built a model that could predict the behavioral statistics of our production model on the day of launch. This model enables us to catch potential issues related to cold-starting of items a week or more in advance, which leaves us time to fix the issue before items comes to the service.
Try to predict issues before they happen instead of detecting them after they hit production
Once an issue is identified with either detection or prediction models, the next phase is to find its root cause. The first step in this process is to reproduce the issue in isolation. However, large-scale recommender systems are very dynamic and we may not be able to reproduce the issue by simply re-running the code e.g. the underlying input data or feature values might have changed. Therefore to reproduce the issue we need to set up proper logging in advance. This includes logging of item candidates, context, features, serving model id or anything that is needed to reproduce the issue. To reduce the cost, this information is logged for a fraction of the traffic. In this case, we need to carefully design a sampling method that has sufficient coverage of all important slices of the traffic.
The next step after reproducing the issue is to figure out if the issue is related to inputs of the ML model or the model itself. To understand if the issue is related to the input data we need to verify that the inputs are accurate and valid. While it might be possible for some feature values to trace them back to their original facts and verify them, there could be many features that involve complex data processing or features that are machine learning models themselves. Thus, in practice validating the input values is a challenging problem.
A simple solution is to compare the feature values with corresponding values of a comparable item or member. This enables us to determine if a feature value is within the expected range. While simple, this method is highly effective for flagging anomalies. For example in one case this method flagged abnormal values related to language features of an item. Upon further investigation we found that the language of that item was not configured properly in the upstream database.
If all input features are right, the next step is to dig deep inside the ML model and its training data to find the root cause of the issue. There are many tools for model inspection as well as model interpretation, e.g. Shap  and Lime . Based on the architecture of the model, we can also develop custom tools to check for expected properties. For example visualizing nodes of decision trees or layers of a neural network. This type of analysis once helped us to identify a bug in handling missing values and in another case helped us to identify a faulty segment of training data.
Set up logging to reproduce issue
Develop tools to check validity of inputs
Develop tools to inspect internal components of ML models
Once the root cause of an issue is identified, the next step is to fix the issue. This part is similar to typical software engineering: we can have a short-term hotfix or a long-term solution. However, applying a hotfix for an ML model is challenging. This is because these models are highly optimized, can take a while to train, and any manual alteration will likely result in suboptimal recommendations. So how can we hotfix the problem while minimizing the cost to the rest of the ecosystem? The solution requires domain insight and creativity that highly depends on the product, platform and stakeholders. Since each hotfix has its own trade-offs, it is better to have a menu of hotfix solutions prepared ahead of time. This enables us to quickly choose and deploy the most appropriate one for each situation.
Beyond fixing the issue another phase of issue resolution is improving RecSysOps itself. For example:
- Is it possible to detect the issue faster? or maybe predict it?
- Is it possible to improve our tools to diagnose the issue faster?
Finally, it is important to make RecSysOps as frictionless as possible. This makes the operations smooth and the system more reliable. For example:
- Make sure that checks in detection or prediction components are running on a regular automated basis.
- If human judgment is needed at some step, e.g. diagnosis or resolutions, make sure that person has all required information ready. This will enable them to make informed decisions quickly
- Make sure that deploying a hotfix is as simple as a couple of clicks
Have a collection of hotfix strategies ready and quantify the trade-off associated with each one
With every incident make RecSysOps better
Make RecSysOps as frictionless as possible
In this blog post we introduced RecSysOps with a set of best practices and lessons that we’ve learned at Netflix. RecSysOps consists of four components: issue detection, issue prediction, issue diagnosis and issue resolution. We think these patterns are useful to consider for anyone operating a real-world recommendation system to keep it performing well and improve it over time. Developing such components for a large-scale recommendation system is an iterative process with its own challenges and opportunities for future work. For example, different kinds of models may be needed for doing issue detection and prediction. For issue diagnosis and resolution, a deeper understanding of ML architectures and design assumptions is needed. Overall, putting these aspects together has helped us significantly reduce issues, increased trust with our stakeholders, and allowed us to focus on innovation.
We are always looking for more people to join our team. If you are interested, make sure to take a look at our jobs page.
-  Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D. Sculley. 2017. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. In Proceedings of IEEE Big Data.Google Scholar
-  Scott M Lundberg and Su-In Lee. 2017. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett(Eds.). Curran Associates, Inc., 4765–4774.
-  Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 1135–1144