July 29, 2024
In the machine learning (ML) and artificial intelligence (AI) domain, managing, tracking, and visualizing model…
When it comes to evaluating ML models, there’s debate about which metrics are the best to check and optimize for. There’s always another F1 or mAP score. There’s also a very healthy debate about how the metrics should be customized for their respective use cases. This debate exists because of how complex the real world is. We strive to get the best out of ML so that it delivers great end-user experiences and reaps the business ROI that our stakeholders are looking for.
While measuring the performance of the model is a core activity, as a community, we don’t have it all figured out yet. That’s okay. We move ML forward by working together.
With model evaluation, the typical toolkit often involves looking at a variety of metrics. Depending on the project and use case, some metrics will be more relevant than others. The truly rigorous evaluations will also ensure good performance on unseen data, review for overfitting and underfitting, describe the complete performance of a model and set up data drift detection.
What’s still missing from this is a rounded evaluation. As we are all painfully aware, the metrics rarely tell the whole story. No single number will help us catch silent failures or avoid racial bias or reveal all the intricacies of data or concept drift. Moreover, the content of your test set may seriously overestimate the performance of your model in the real world: researchers in NLP found that state-of-the-art models with “human performance” actually fail at very simple NLP tasks. If you were to only know their accuracy, you would significantly misjudge the ability of the model to generalize and not produce harmful responses.
Of all ML systems in production, recommender systems are arguably some of the most impactful ones. They help us navigate most aspects of our digital life from what movies to watch, what book to read, what shoes to buy for that special handbag, and what news articles to open. How can we be sure (or “more sure”) that recommender systems in production generalize properly?
This is where RecList comes in. It’s an open source library with plug-and-play test cases and datasets that make it easy to scale up behavioral testing. Behavioral testing is not new, but this project does provide another great tool for your model evaluation toolbox. It allows anyone to test their models on a wide variety of metrics which provides a more holistic evaluation of model performance. It’s designed for recommender systems, with ready-made connectors for popular datasets in the field. In the future, it could be applied to other types of models as well. How cool is that?
In a nutshell, RecList won’t tell you if model A or B is better (that’s for you to say), but it will remove the repetitive, boilerplate code. This will help quickly compare and debug models from a variety of perspectives. For example, does your model treat genders equally? Is it robust to small perturbations?
Along with his colleagues, Jacopo and the team bring deep expertise on building recommender systems and putting them into production. When we asked Jacopo why they’re building RecList, he said:
Everybody agrees that behavioral testing is useful, but then in practice it is just hard to do it well, so in the best case you end up writing lots of ad-hoc, untested code for error analysis and debugging, in the worst, you just don’t do it and hope for the best. We didn’t set out to write “yet another package”, but we couldn’t find anything that was good enough for our B2B scenario, with hundreds of models in production; so we started RecList as a fully open source tool, and summarized our findings for the academic and industry community.
The open source approach means that Jacopo and the team need support. That’s why Comet is excited to sponsor RecList, to support the development of a beta of their RecList library, with a focus on ease of use.
Comet’s VP of Strategic Projects, Niko Laskaris, shared: “When we first met Jacopo, we knew he was up to great things, and we’re excited to support him in these endeavors.”
Jacopo added, “I’m so moved by the positive response in the MLOps community, and I’m proud of Comet’s support and excited to connect RecList with the platform!”