Scaling ML Operations for a Multi-Sided Retail Marketplace: How Shipt Leverages Comet
This case study focuses on Shipt, a well-known grocery delivery service that uses Comet to efficiently scale their machine learning (ML) operations for their multi-sided retail marketplace. Comet plays a crucial role in helping Shipt adopt a hybrid platform approach and effectively manage their non-monolithic ML platform. By using Comet’s Model Registry to log trained models, Shipt can focus on building certain tools in-house while relying on Comet for specialized features. This allows them to prioritize their resources and make their process more robust and efficient.
A transcript is available below if you prefer to read through the interview.
Adam Hendel: I’m Adam Hendel. I’m a principal machine learning engineer at Shipt. Primarily right now we work on our machine learning platform.
Gideon Mendels: Awesome.
Anuradha Uduwage: I’m Anuradha Uduwage and I’m director of machine learning engineering and leading the machine learning platform team, uh, at Shipt.
Gideon Mendels: Awesome, so you know, one of the things I wanna spend, I definitely wanna dig in and hear more about your platform, but maybe on a high level, what are some of, kind of, the areas that, your internal customers, and Shipt in general, what are some of the problems you’re solving with machine learning?
Anuradha Uduwage: There are two sides of this problem. One is training the models. And data scientists come from multidisciplinary areas, so, as a result of that, we want them to have the most easiest path that they can do their best work on focusing on modeling work and we take care of the rest of the stuff. All the way from firing up, uh, resources, and then when it comes to the model is ready, , serving that model, at run time– and within, certain SLAs. So that’s kind of the two areas of the challenges that we are currently solving.
Gideon Mendels: So, kind of again, in the realm of like the more of the development side, definitely wanna hear more about, the serving, we’ll get there in a second. So, you’re using Airflow to orchestrate these jobs. How do you manage, like, you know, you mentioned Snowflake for the feature store, offline feature store. How do you manage kind of all the Artifacts, the output of the train model, binaries– and, really, like, you know, experimentation metrics, hyperparameters. How do you guys handle all that?
Adam Hendel: Yeah, so that’s Comet. Yeah, so anytime somebody is training a model, whether it’s in Airflow, so it’s more formalized, or it’s in JupyterHub, or on a local machine, we just use Comet’s Python SDK, and log everything to that particular team and project’s workspace in Comet. So, it’s pretty non-invasive, you know, it’s just kind of like a passive thing. It all just gets logged into Comet. And then when it’s like, hey, you have that one model that, this one is more ready than others–
Gideon Mendels: Yeah.
Adam Hendel: –promote the model from the experiments into the, into the Registry and that all happens like, within Comet.
Gideon Mendels: So, within Comet, you have a data scientist after they’re kind of finished with their first stage of research and they feel like they have a model that they want to promote, they just use the Comet UI, they put it in the Registry, and then, you know, what happens after that? So, I’m assuming you have lots of models in your Registry. What does it look like from a platform team? How do you manage that?
Anuradha Uduwage: Yeah, so we have staging Comet–
Gideon Mendels: Okay.
Anuradha Uduwage: –and then we have production instance and staging. So, during the model training runs, they could do these experiments, just log everything. And once they are ready, we can promote that to the retrain the model on the production environment and send it to Comet production instance. That’s where the models actually get served for production episodes.
Gideon Mendels: So, how do you, as a platform team, how do you think, kind of, like, on a more high level about creating this, like, consistent, but also, very flexible experience for your data scientists?
Anuradha Uduwage: Yeah, that’s the work that we kicked off as like the, our next version of our platform–
Gideon Mendels: Yeah.
Anuradha Uduwage: –because what we want to do is from project creation– like inception-level, to deployment. We want a consistent code-based structure and deployment structure. So, so, even when the data scientists are doing code review stuff–it’s not foreign to them, right? So they all follow a similar structure. So, that’s like the challenge that you were talking about–
Gideon Mendels: Yeah.
Anuradha Uduwage: –like how you provide a consistent platform. And also, we want to make sure that their experimentation work, right, like, that’s the key part of understanding what you did and how can you rely on what you did, and incrementally improve. So, Comet makes it much easier for us to carry that work. By exposing through our, platform packages, it allows us to kind of allow data scientists to track that experimentation, make sure in the production, what we, they put out, they have already seen, what’s going to happen in the production, through their various experimentation.
Gideon Mendels: You guys mentioned, kind of, your serving component. Can you walk me through, like, really high-level, what does that look like? ‘Cause I know that’s something the team has spent so much effort on and it’s working really well. So, I’m sure a lot of people would be excited to hear more.
Adam Hendel: Yeah. So, Anuradha mentioned our, you know, robust CI/CD system.
Gideon Mendels: Okay.
Adam Hendel: It’s for deploying microservices on Kubernetes. It’s great. Our ML platform team didn’t build it, our DevOps team did. So we built our platform on top of that.
Gideon Mendels: Okay.
Adam Hendel: And kind of what that amounts to is we need to build a container. And, so we’ll build a Docker container, usually using FastAPI.If we’re looking for, like a REST interface and, you know, and during the container build, we might pull the model in from Comet’s Registry at that, at that point. So, when we ultimately get this Artifact, it’s a container, it has a web server and a model in it, and it can be deployed, and horizontally-scaled on Kubernetes. And it’s great. Or we might deploy that, instead of something like FastAPI, it could be something like a Kafka consumer. We have an internal library that we opensource, so I guess it’s not an internal library, then. So you have this library we wrote called py-volley.
Gideon Mendels: py-volley? Okay, is that on GitHub?
Adam Hendel: It’s on GitHub, yep.
Gideon Mendels: Awesome.
Adam Hendel: It’s a lot like, the experience developing in it is a lot like FastAPI, except it consumes from a queue.
Gideon Mendels: Okay.
Adam Hendel: So we’ll deploy models in that framework as well. And it’s like, almost identical to FastAPI, but instead of waiting for a network call, it is consuming from a Kafka Topic.
Gideon Mendels: Okay. And maybe switching gears a little bit, I know you mentioned, you touched a little bit about the Comet model registry and, and the experimentation side. So on a high level, from, you know, from a strategy perspective, what kind of role does Comet play in your MLOps strategy?
Anuradha Uduwage: I mean, it’s a, it plays a key role because ML platform is not just like one monolithic thing, it’s the components of certain things. Model Registry is a key part of that because when a model gets trained, you need to put it somewhere to retrieve it at inference time– or any other, uh, situation. So the entire ML platform and Comet Model Registry plays a very key role for us. I’ve tried to build a, I’ve built a ML platform in my other places with a team of folks. Now, we didn’t have the luxury of having a Comet-like Model Registry. So, we built within the resources, cloud services, that was available, like taking a DynamoDB as a model registry, dumping the Artifacts over there. That’s a ton of work. That’s just, just, I only got like probably one tenth of what Comet has just so that I could, keep track of the model Artifacts. So in order to get to this level within the ML platform team, you need another team.
So, that’s why it’s a key element in the ML platform, and going forward, too, right? So, as we increase the number of models, as we increase the experimentation of these models, it’s key to lock those things and have a place where we can rely on Comet, in measuring the performances, and upgrading these models into production. So it gives us a very robust process, for model deployment and model training.
Gideon Mendels: Thank you so much for sharing and I’m sure like a lot of other platform teams you had, at some point you had the debate, it’s like, hey, like what is like our high-level strategy? Should we try to buy something that solves all our problems, or potentially does, should we try to build everything? I mean, obviously you picked a more modular approach, but kind of what, what advice would you give, like, you know, a platform team, just, you know, debating this exact topic looking at the vendors, opensource, what would you recommend to them to think about when making that decision?
Anuradha Uduwage: It depends.
Adam Hendel: Yeah.
Anuradha Uduwage: So I, I know that no one likes that answer because it depends–
Gideon Mendels: Yeah, it does.
Anuradha Uduwage: –it’s like, it’s like a consultancy giving an answer, it depends. But it really depends on how, how your team’s maturity of your machine learning organization, and resources that you have. And, you don’t have to do one or the other.
Gideon Mendels: Mm-hmm.
Anuradha Uduwage: You could do a hybrid version just like we are doing, right? We didn’t build, uh, Comet inside.
Gideon Mendels: Yeah.
Anuradha Uduwage: We use Comet and we build that ML platform with Comet being a key, key piece within our ML platform. So just like that I, my advice is every team’s needs are different. Understand the needs and start small. And then, then you can try to understand what the data science teams need–
Gideon Mendels: Mm-hmm
Anuradha Uduwage: –and build with them. ‘Cause if you try to just build it without them, no one will use it.
Gideon Mendels: Yeah. Awesome guys, thank you so much for taking the time. I really enjoyed the conversation. You know, I’ve learned a lot from you guys, not just, from the last conversation, but just working with you and the rest of the platform team. So it’s been a pleasure and thanks again for coming here today.
Adam Hendel: Yeah, thank you.
Anuradha Uduwage: Thank you for having us.
Gideon Mendels: Awesome.