From Hackathon to Production: Unveiling Etsy's Image Search Revolution with Comet
In this engaging fireside chat, Gideon, CEO of Comet, Eden Dolev, Senior ML-II Engineer, and Alaa Awad Staff ML Software Engineer at Etsy, come together to discuss the development and implementation of an innovative image search product at Etsy.
A transcript is available below if you prefer to read through the interview.
Gideon Mendels:
Hey guys, thanks for coming today. Super excited to chat with you today. I think a comment, we’ve been following the stuff you guys have been building at Etsy for a few years now, and I think by far, one of the most impressive companies when it comes to ML and your ability to deliver things to production successfully. Excited to jump in and have some of our audience learn from the amazing work you’ve been doing, but maybe before we dive in, we’d love to … If you guys want to introduce yourself, Eden Dolev, if you want to go first.
Eden Dolev:
Yeah, I’m Eden. I’m on the ad ranking team at Etsy. I’m an engineer, been at Etsy for almost three years.
Alaa Awad:
Yeah, I’m Alaa, so thank you for having us.
Gideon Mendels:
For sure.
Alaa Awad:
I’m also an ML engineer on the ad ranking team at Etsy.
Gideon Mendels:
Awesome. I know all you guys are on the ad ranking team, and you’ve been building a lot of models and solving a lot of problems in that space. We’re actually here to talk about something that’s quite different from what you work on day-to-day. I think a lot of us saw the news that made it to TechCrunch and if I’m not mistaken to Etsy’s analyst briefing about six months ago, but you guys released a super exciting feature, Etsy Search By Image, and I know you guys spent a lot of time building it. Maybe if one of you can share a little bit, what’s Etsy Search By Image?
Alaa Awad:
Sure, yeah, you’re super excited. Yeah, it’s basically a new way to search on Etsy, and basically, if there’s something that you just can’t simply describe, you can pull out your phone, open up the Etsy app, and take a photo with your phone, and basically search for results that look similar to the image that you’re posting.
Gideon Mendels:
Awesome, so it’s essentially a computer vision model that takes the photo from the camera and try to identify similar products on the Etsy network? Awesome. How do the two of you who are spending most of your time building models for slightly different use cases ended up building Search By Image?
Alaa Awad:
Yeah, so it actually, we started as part of a Hackathon at Etsy. At Etsy, we have this idea called Code Mosaic, which is, they call it Code Mosaic, and it deincentivizes the bad practices in Hackathons where you end up staying up really, really late or working hard, be too competitive. And for us, this is just a space to create your own thing and have engineers from around the globe come together and just work for a week on whatever they’re most interested in. For our Code Mosaic project, we built Search By Image.
Gideon Mendels:
Amazing. This whole feature started as a Hackathon project or Code Mosaic project? That’s insane. I guess what does it look like, right? You mentioned it’s about a week long. Was it just the two of you working on it, or was more people involved?
Eden Dolev:
No, so the process, is usually a few weeks before this, a few meetings to talk about ideas that engineers or anyone at the company suggests. You put out a spreadsheet, you put your idea in, you put a short description, and then in one of those meetings, you would go out, introduce it, and the way you produced ours. And then, we had some other folks from cross-functional, from other teams join our team and express interest. And then, the Code Mosaic we could solve. We all met for the first time they saw us and we just started working on it. Yeah, so it wasn’t just us. We had a product person, we had an iOS engineer, an Android engineer, and other backend engineers, people with some front end capabilities. It was a really a diverse cross-functional team.
Gideon Mendels:
Awesome. How big was the team roughly working on this?
Alaa Awad:
It was like five or six. Yeah, five or six.
Gideon Mendels:
That’s a pretty small team for the amount of impact you guys made. But, what I think was interesting, you mentioned you had software engineers, native developers, backend developers, product managers. Do you guys feel that had a significant impact in your guys’ ability to deliver this? What did that collaboration look like with all these different personas and people doing different things?
Alaa Awad:
Yeah, definitely. Yeah, we haven’t just out the perfect group of people to build this particular product feature. And so, it was really nice to have everybody there, because you could feet off of each other, and in order to build ML into a product, it usually takes a very diverse set of skill that a lot of people focus on the modeling aspect. But, actually there’s a lot of just coordinating the pipelines, to the web APIs that get called, and the native application code, and the services that get called, and things like that. Just having everybody there and working together really helped it along.
Gideon Mendels:
Amazing. Yeah, and I was playing with the future earlier today, and I think last time we spoke, you guys mentioned it took about three weeks to go from idea to production. That is, I think, unheard of in the ML space, and extremely impressive, right, to be able to deliver feature at that scale, and we’ll dive into it in a second, but a pretty complex feature as well. What do you think was the secret sauce of getting us moving so fast?
Alaa Awad:
Well, to be fair, we actually did start some of the image modeling work before that week. And so, we had a project in place already for part of our aggregate work, or just kind of exploring images and their applications along several of the models. Usually, we work on predictive models, predicting clicks, or purchases, or things like that. And I think part of this work just sort of spawned naturally out of that. We have a little bit of a head start.
Eden Dolev:
Yeah, so we got started of the bottling, but we did … I think for the Code Mosaic, what we actually did was maybe improve the model a little bit, and make it more suitable for the Search By Image use case where engineers would take a photo with their camera, which is something we’d never done before. And then building the POC backhand web API work to make it work for a demo. And then, like I said, I don’t remember the exact hour, but it was two, three weeks, less than two sprints, I believe to take it from that to experiment launch, into production.
Gideon Mendels:
That’s still super impressive. Maybe I’d love to dig in a little bit more into the modeling process and the building process. It sounded like you guys did have some models or some system in place beforehand that you were able to build upon. Maybe if you’re open to sharing a little bit about this, what are some of the things that you built in this space to help you? What models, what systems you had in place before that eventually got used or at least the knowledge got transferred for this stuff?
Eden Dolev:
Yeah, it was a little bit of a journey for us, I think with image modeling. Into the, I don’t know, a few months before, I guess, our team wasn’t really dealing with the vision work. The first thing we tried was to just build a simple vision model that produces some sort of an embedding to … Our idea was to do some sort of visual search, but mostly maybe a product search that’s based on vision, or maybe use these embedded downstream in our other models. We started with a classification model, which is the simplest thing I guess it could do in modeling. Classifying product images to their categories, and then extracting the last layer, the last convolution layer, embedding. And that actually worked out pretty well for the amount of work that we put into it. But, what we saw was that these embeddings are really good at encoding the categorical information. For example, if you did that image search on it, you would get images that are from the same category, but not necessarily similar.
Gideon Mendels:
Mm-hmm, visually similar, yeah.
Eden Dolev:
Exactly. Yeah, not necessarily the same color, or the same shape, or same texture. It was just by the nature of the fact that we’re trained on category classification, so the model is actually doing what we’re telling it to do. That was the first iteration that we did.
Gideon Mendels:
You guys were training, you were solving a proxy classification problem only to get the embedding, is that right?
Eden Dolev:
Exactly.
Gideon Mendels:
Or you had a classification problem you actually needed to solve?
Eden Dolev:
No, it was a proxy classification, yeah.
Gideon Mendels:
Okay, okay, awesome.
Eden Dolev:
Yeah.
Gideon Mendels:
And this was pre-Code Mosaic. This is something you built, you just needed good product embeddings. It sounded like you went with that approach, and then it solved the problem from one perspective, but not really from a visual perspective. How do you iterate from there?
Alaa Awad:
Yeah, so, I mean for us, like I mentioned, we typically are working on click prediction, or purchase prediction, or item similarity, things like that. And so, it turned out if the image itself is very important, and that similarity is also really important. We decided to go down a metric learning approach where this is a pretty common way to have a two towered model where you have an encoder that produces an embedding and shares weights with another encoder, and then the two are compared against one another, and they train the model so that it pushes items that are similar, closer together, dissimilar, further apart.
And we started out with a Siamese network at that, and moved our way towards a triplet network. There’s a triplet network, you’d have an anchor, and a positive, and a negative. And for us, we used images that were of the same listing. When someone uploads, when a seller uploads a listing on Etsy, they can upload up to I think 10 images. We have this set of listings with images and other images that are usually the same product, but from different angles like that. We basically took advantage of that to help us train a model that would produce and push embeddings that are similar closer together.
Gideon Mendels:
Got it. Got it. And just so I’m saying correctly, right, so you’re using this metric that is in a sense artificial to create the embeddings. How do you evaluate eventually that your embeddings are good?
Eden Dolev:
Yeah, that’s a really hard problem when you don’t have any prior data, right? Because the usual way you would measure it is go to your historic data, see what got engaged with, what got clicked, what got purchased, what got favored. Those are usually good proxies to know that your model is doing well, and you could kind of go back to the historic data to check that, but we didn’t have any product at the time, so it’s a lot of qualitative tests, a lot of visualizations, just checking all the examples, trying to make sure you pick a diverse set of query examples that you check with. We have categories that are very common, and that have many examples. We have categories that are a little more niche and smaller, so you want to make sure you cover everything you can. And on the quantitative side, I guess we could put it in our, like Alaa Awad said, in our ranking models, and our retrieval models, and go back and check the metrics there. But in terms of visual similarity, really all we could do is qualitative tests at the time.
Gideon Mendels:
Mm-hmm, yeah, 100%. I mean, it sounds like a very hard problem, especially when these embeddings are used for typically downstream tasks. Obviously on the Search By Image, we can talk a little bit more, there’s probably more business metrics you can use to evaluate. But, I’m curious, I know last we spoke, you mentioned the evaluation being in some ways qualitative, some ways is metric driven, but again, proxy metrics. I’m curious, did Comet play a role in that evaluation? And if so, what did that look like?
Eden Dolev:
Yeah, so for the classification training metrics, and not to mention, but qualitative test, that’s the end to end thing we could do. But, obviously we rely on the training metrics. For classification, that’s accuracy, loss of course, for all the models, and for triplet, what you get is these distance metrics. Like I said, we have the anchor example, the positive example, the negative example, and the distance might be you get kind of the distance between the anchor and the positive, and you want to make sure that that gets smaller, and on contrary, this is between the anchor and the negative. You want to get that further apart as you train.
Definitely, Comet was pivotal in that to just make sure that the model is training, is learning, and getting better with time. And of course, the biggest impact is in comparing different models when you’re doing parameter tuning, or trying out different approaches. This is the primary thing you’re looking at, because you can’t … Definitely when you’re only relying on qualitative for the end 10 test, you can’t do a qualitative test for every model that you’re doing. And especially for something like hyperparameter tuning, it’s impossible to syntax to notice the small differences that eventually will have an impact. But when you’re looking at it, you might not see the difference.
Gideon Mendels:
Awesome. I’m curious. I mean, I know we’re talking about two different projects that ended up converging, and no pun intended, how many experiments would you say you guys ran in this entire period? What does it take to build a successful production feature? Obviously it’s very problem-dependent, but are we talking tens, hundreds, thousands of experiments?
Eden Dolev:
This is a test, because I think you would know better than I.
Gideon Mendels:
I haven’t checked, no.
Alaa Awad:
We’re covering a lot of things, so we were talking about the training, like classification versus triple networked and things like that. But then, there’s also the sort of the pre-trained backbone we would swap in different backbone image model architectures and just try them out, try different hyper parameter tuning. We tried different sampling schemes to see if we can sample from different categories and things like that. And so, I don’t know, probably many per day, yeah. I’m not sure exactly how much, but yeah.
Eden Dolev:
I think maybe overall if you include Code Mosaic and the work before, probably hundreds at the end of the day.
Gideon Mendels:
Super impressive. Awesome. You guys build all this backbone, embedding, generating models, and then you walk into the Hackathon, Code Mosaic having something that you built for a different task. Walk me through a little bit what happened next, right? It sounds like you had something in place. What did the Code Mosaic project look like based on what you already had and is there architecture changes? Did you have to build new models, or is it just completely based on that? Maybe it’ll be great for the audience, for people to understand a little bit how does Search By Image work?
Alaa Awad:
Yeah, so right, exactly. We started this Code Mosaic with just the pitch, and we pitched this to teams for the idea of Search By Image, and we kind of plotted out the architecture what we want it to look like, and we also have together our own design scaffolding of what it should look like a little bit. And basically, the high level system architecture is that you would have the camera take an image, the image would pass through our CNM model to produce an embedding. That embedding would then get passed along to a nearest neighbor search system that would score the similarity between the embeddings in our inventory, and the embedding that came out of that image.
And then, we would identify the top 100K and return that back to the native application to display. And so, in addition to all those things, you also need to collect a bunch of information about each of the listings that come back, cash things. We were trying out GPU serving, things like that so that the CNM model would be fast enough, with the latency would be low enough that you could serve it without feeling like you were waiting forever. There are a lot of pieces to it, but that was where we start, this is what we wanted.
Gideon Mendels:
Super impressive. You’re pre-computing the embeddings for the existing catalog and only doing just for the image the user submitted, you’re doing it in real-time?
Eden Dolev:
Exactly. Yeah.
Gideon Mendels:
Awesome. No, it’s just super impressive. I mean, just the fact that you guys all got all of that done. Did you mostly use the … Because I know last time we spoke, you used some of the knowledge, but you had to build a lot of the model from scratch again, right, because it’s a slightly different use case. Maybe walk me through that process.
Eden Dolev:
Yeah, so actually, one step we took going back to the previous work, the trivial model that we tried that worked a lot better, but the sense that it returned items that are images of items that are maybe the same color, or the same shade, the same material, that just the image looks a lot more cohesive with the images that you’re searching with. But, now we saw that maybe the categorical accuracy is degraded, and some examples didn’t return things that are necessarily of the same items. Maybe you’d search for a chair, a green chair, and then get a green similar looking table or something like that. And that also took us away from what we wanted because we want things to be relevant, obviously beyond being similar. We moved back to a classification approach. And by the way, one of the other reasons was the metrics.
If we’re talking about comment, these distance metrics that we were talking about before are a little harder to reason about compared to classification metrics. Before, we could say our model is doing X on retrieving category. Now, we have this kind of murky abstract distance metric that can tell it’s comparatively which model is better, but not specifically at the task. We went back to classification approach. At this time we did multitask classification. Instead of classifying all in the category, now we try to include more visual categories, or visual attributes, so color, shape, material. And we saw that, that kind of brings the best of both worlds, and it does a much better in the visual cohesiveness and the categorical accuracy as well. And we get all these nice metrics. We can say our model is doing better, retrieving the same category, the same color, and calibrating us back to Code Mosaic with the Search By Image.
What’s different here, like we said before, is we’re not doing product to product search. We’re doing a user photo to product search. What we thought about here was how can we add the domain of user taken images to our model? And one nice feature about Etsy is you can upload your review of an item that you purchased, and you could add a photo of the item. What you get basically is another image that’s taken usually with the users, the buyers’ mobile phone camera, and it’s all the same item that it’s depicted in the product image.
Basically, what we did was to this multitask data set other than classifying product photos to a categories, shapes, colors, et cetera. We also now classify the user take and review photo to whatever the category or the color. And we saw, and again in the qualitative test, that this really does a lot better on this use case obviously of taking the photos with the user camera.
And this also kind of added another evaluation domain for us, because we can now do a qualitative test, so review photo to the listing photo. And that’s not only qualitative test, because for all these things, so listing to listing and review photo to the listing, we also added … We put this with TensorFlow, so we added a call back and that’s this callback after every training epoch, it logs to comment these metrics, so we take a few thousand images, be it product photo or re-photo, and then we build a small index of the private photos, and we do a small retrieval test during training time. Then we can see how that improves looking at the common charts while we’re training the model. And that really helps us to select the best model for it.
Gideon Mendels:
And all of this happened in those three weeks we were talking about within Code Mosaic, right?
Eden Dolev:
Yeah, Code Mosaic is just one week.
Gideon Mendels:
Oh, wow. Okay.
Eden Dolev:
Yeah, so this all happened a little bit before for our ads product, but all this review photo stuff, that all happened in that one Code Mosaic.
Gideon Mendels:
You landed with essentially a different model than what you previously built for your ranking embeddings, is that right?
Alaa Awad:
Exactly. Yeah, we went back to the classification away from the metric learning approach. It was just much easier to evaluate it turns out, and you’re able to mix those heterogeneous data sets in, so it matches the use case where you have photos that are coming from different angles, brainy or whatever with crisp product photos. You have these two different domains of images that you need to compare against.
Gideon Mendels:
I’m curious from every reproducibility perspective, because I know that’s an approach you validated before, like you said came back for it. How hard was it to come back to that code base where you had that model architecture, and obviously, you’re using it for a slightly different use case?
Eden Dolev:
Yeah, not too bad. I think, yeah, we have a structure where we’re able to easily add mute models, and the attractions that are shared between them can make it easy to remove that boilerplate logic, so if you want to add a new model, it’s not too bad.
Gideon Mendels:
Awesome. And so, you ended up with essentially a new model. Obviously, you had a lot of knowledge in this space because you built similar things, but sounded like you had the entire pipeline, everything was completely new eventually, right, and all of that, or the vast majority of that was built within three weeks, is that right?
Eden Dolev:
Right. Yeah. I think maybe at some point we have to get really serious about this is going to be a forward-facing product feature, and start thinking a little bit more about making sure that services are up, that we’re monitoring everything, and that there’s observability around it. There’s a lot more that went into you after that initial week, then we continue to iterate on the model a bit more, but also try to pick up aspects of the product feature that would be helpful.
Gideon Mendels:
I’m curious, I mean, you guys have a lot of experience in the space, and sometimes we meet machine learning teams that are working on a project for sometimes years without being able to get a model to production. And sometimes you hear stories of people getting it in a few weeks. What do you think makes … What’s the difference, or if you are working or speaking to a team that’s having more challenges on that front, what would you say they have to focus on to be able to move faster and get models faster to production?
Eden Dolev:
Yeah, that’s a good question. I think what really helped us here is that we had that diverse skillset.
Gideon Mendels:
Mm-hmm, on the team, yeah.
Alaa Awad:
I think a lot of the times for particularly data science projects, teams tend to focus on the research aspect of it, but there’s so much more that goes into just getting it into a product feature. When you’re dealing with the applied side of things, it’s really helpful to have product people, not even just to the engineering side, but having design, and product, and just leadership on board, helps move things along much more quickly. But also, we sort of had that introductory period where we had a few months where we just got to explore, and there wasn’t necessarily something that we were trying out, but we just got to go through a lot of different iterations and end up on something that really worked well, so also the exploration aspect with-
Eden Dolev:
Very helpful.
Gideon Mendels:
Awesome. Now, I think with all this conversation, everyone probably assumes Search By Image won the Hackathon. But, I know we were speaking before you actually mentioned that you guys didn’t win Code Mosaic, but for some reason, it still made it, got major bind within the organization, and made it to production very quickly. And like I mentioned, I think your CEO mentioned it on the earning calls with analysts right after it came out. What happened there? What was the story? How did you guys get all this traction internally?
Alaa Awad:
Yeah, I mean it was … Yeah, I mean, we’re not super competitive. I think we were just happy to be working on the project at the time. And so, yeah, it was okay that we didn’t win the Code Mosaic challenge or anything like that. But, I think we were lucky that we had the right group together already working, and it made sense. And basically, the product people on the team, and the engineering manager on the native application team decided that this was a really good project for them to add to their roadmap.
And so, we had the traction already because we had a working demo, and it was what it would take to actually get this into the product, and it was enough of a leap that we thought we could do it really quickly and still get our work done. And so, really the challenge for us was just understanding what would happen after this takes place. We wanted to make sure that when we were done with this, that there would be a team that could maybe take it over and make sure that there is a team that would be monitoring and paying the product alive. And so, we worked with our computer vision team to hand over this product feature.
Gideon Mendels:
Usually, I see the team that builds the model is the one that owns it. Is that the right way to think about it typically?
Alaa Awad:
Usually.
Eden Dolev:
Yeah.
Gideon Mendels:
But in this case, because you guys are typically focused more on the ad side, you eventually hand it off to another team, is that right?
Eden Dolev:
Yeah, that’s outside of our direct domain, and yeah, we maintained it for a little while, but then we handed it off to the right product team too.
Gideon Mendels:
But it sounds like I’m sure there is some leadership buy-in to allow you to continue working on it, even though eventually you handed it off, which I think is very impressive because some organizations are very focused on KPIs, and this is what you’re supposed to work on. Is that something that’s common with ethnic culture to allow people to experiment and potentially work on some different things as well?
Alaa Awad:
Yeah, I think Etsy’s a bottom up sort of culture. And so, I think also Etsy has that kind of craft maker mentality, and that permeates into the engineering culture as well. And so, we definitely, I think teams at Etsy feel empowered to come up with ideas and build things. Code Mosaic is just one example of where teams can do that. But, in general, anyone can pitch an idea. There’s a great architecture group that could help spawn new ideas and things like that. And so yeah, I think it’s just part of the culture that you’re able to build things.
Eden Dolev:
Yeah, and both the senior leadership and your direct leadership managers, product managers are usually very supportive of that, and they gave us the time to work on it. They recognized the impact that this could make, and they not only let us work on it, they helped us promote it, talked to the stakeholders, and yeah, like Alaa Awad said, it’s a very supportive and innovative kind of culture that we have.
Gideon Mendels:
Super impressive. Yeah, I know some companies have that more naturally where the team that builds the model is not the one that maintains it, which is why I asked before. And typically for them, the handoff is very painful. I’m curious for you guys, because you handed off sounds like to the computer vision team, which makes sense, given it’s a computer vision model, how did that handoff look for you guys? Are there any kind of major hurdles you had to go through to make it successful?
Alaa Awad:
It wasn’t too bad. Yeah, I think we met with them a few times, walked them through the architecture, made sure that they felt comfortable, because there’s a service that you have to keep up, and monitor, and alert on, things like that. And so, I think for them it was a really natural fit, and so, it was not too bad for them so long in their domain, and I think they were really excited also just to develop that.
Gideon Mendels:
That’s awesome. And is this a type of model that you plan to retrain or typically, it’s … Or is it retrained for some of the upstream tasks? What does maintenance look like other than keeping the service running?
Eden Dolev:
Yeah, this particular model actually, especially when we maintained it, it was like a frozen model you don’t retrain because domain advantages don’t really change all that much. But, what they intend to do is like on their roadmap, I don’t know. Maybe there is a benefit in retraining. For us, we currently don’t retrain the foundational image model that we use, but we definitely iterate on it, improve on it, net plans there.
Gideon Mendels:
Awesome. Cool. Super exciting. I think we covered a lot. I’m curious, I have a few questions that are slightly unrelated to Search By Image. Obviously there’s a lot going on in the world in machine learning, and AI, and people who probably didn’t know anything about it now know all the different models, and versions, and such. I’m curious, I think some people think that generative AI, or models like Chachi PT, for example, will replace a lot of modeling tasks just by you have one big model that does everything. I spoke to a company that is solving a simple text classification problem, and instead of training a supervised model, they’re prompt engineering a prompt to have ChatGPT give you the answer, which is I think a very exciting approach. I’m curious, as people who train models for a living, what’s your take on these one model wins it all versus more dedicated specific models?
Alaa Awad:
Yeah, I mean, we’re living in very exciting times. I think there’s just breakthroughs every week now. For me, I’m really excited about it. I think one thing is that it puts more ML on the minds of leadership and things like that. And so, I think a lot more companies are thinking a lot more about it, and that’s good for us. But also, I don’t necessarily think that generative AI will replace all modeling work. I think there is a space where predictive models and generative models could meet, and I think you probably see it, it’s pretty common where you might not necessarily have one alone. I’m seeing more and more that the two of them, they work really well together. There’s some use cases where generative models are being used to help evaluate in predictive modeling, and some use cases where your standard classification model now is in a generative sort of way. And also, just there’s a lot of product use cases where you could build a generative feature, but then you can have predictive model sort of back it.
Gideon Mendels:
Awesome. What’s your take, Eden?
Eden Dolev:
Yeah, definitely exciting times. I hate this term, but I like that it’s democratizing ML in a sense. I don’t like when people say that. But, in the sense that with prompt engineering in, I haven’t dove into that field, but I like the folks that aren’t ML engineers who can learn about how models work and engineer what the model outputs without having that knowledge of fine tuning a model. I think that’s only going to do good to the ML world and develop it. But yeah, we’ll see where things land.
Gideon Mendels:
Yeah, obviously there’s a lot. I think it’s super exciting. Like you said, obviously a lot of open questions. Another one we probably saw, there was a Google memo that leaked two weeks ago or three weeks ago saying they don’t have a proprietary model. Builders or companies serving models don’t have a competitive edge. What’s your guys’ take in Open AI versus OpenSource? I mean, I think that at least most of the evaluations I’ve seen still show open AI models being better. Do you think it’s going to stay that way? Do you think OpenSource is going to catch up, do better?
Eden Dolev:
Right, I definitely think OpenSource is going to do a ton of catch up, do better. It’s definitely going to be competitive. Yeah, I’m a big OpenSource proponent, so I think that’s definitely going to stay at the forefront along with the proprietary models. Obviously, there’s a lot of benefit in that, but I think they’re going to calm each other where OpenAI obviously was a big enhancement, and that drilled the OpenSource world, and I wouldn’t be surprised if that’s going to flip, and we’re going to see some open source innovation that’s going to pull the Googles and OpenAI’s of the world towards time, and I think the competition is great for everyone.
Alaa Awad:
Yeah, yeah. I’m curious actually, your take. I saw recently that Comet’s been more involved in LLMs and things like that, and yeah, I’m wondering about all the companies that you see.
Gideon Mendels:
I mean personally, and this is my personal opinion, obviously no one knows, I do think that the model itself, it almost is and definitely will be a commodity. I mean, thinking open IR ahead, but the gap is closing on almost a weekly basis now. That’s my take on that. I think once the model’s a commodity, I think a lot of the challenges that people underestimate are the training and serving of these gigantic models. I mean, it’s not only expensive, but actually very, very hard to train on a cluster of 1,000 AI one hundreds or something like that. There’s very hard engineering problems, and I think that’s one of the most impressive things about OpenAI is not just the research front, but the MLOps front, and their ability to serve and train on that scale. Yeah, but that’s my personal take. I think no one knows.
I doubt that kind of like you, that generative models will replace every supervised task, even if they’re competitive on performance or model performance. I think the cost of serving and training, it just doesn’t make any sense, right? You can train a very simple XT Boost model that does it better than you probably should. But again, that’s our take. But yeah, a lot of our customers are dabbling with generative models and trying to see where it makes sense for them. And I’m assuming every other company is Etsy is looking at it as well. But yes, super, super exciting times for sure.
Eden Dolev:
Yeah. I also think this takes us back to your question before on how it makes a successful ML product team, is that you have to think about these costs, and not just the performance. Because I think what we do is when we develop, we try to innovate and build on top of things, but then pausing, thinking about the inference cost, the training cost, does it even make sense to make the model that are bigger? Yeah, who knows what happens with OpenAI. Maybe they don’t have that right trade off right now, and they’re just building on top of, let’s get as many users, and as many people buy-in, and then we’ll think about how to make it worth it.
Alaa Awad:
Any environmental cost.
Gideon Mendels:
Yeah, that’s a big one. Yeah, so we’ve actually partnered with Mila and some people from Hugging Face on something called Code Carbon, which estimates your model environmental impact during training and serving. Essentially, it looks at the underlying infrastructure, GPUs, and computes based on watts, what’s the carbon impact? But yeah, I agree. I don’t think enough people are talking about that unfortunately. Awesome. One lighting question, when you think about ML, generative AI, so pick one term that’s over hyped, and one that’s under hyped.
Alaa Awad:
I think problem tuning is under hyped. Yeah, well I don’t know, I think I didn’t really take it as seriously, but I was thinking a little bit more about it, and there’s really interesting techniques and different things that you could do. That’s an interesting thing. I think, yeah, just the AI, just the term AI, is that right? It can fit so many.
Eden Dolev:
Was it a company or a concept that you said?
Gideon Mendels:
A term or you could do a company, or model, anything, yeah.
Eden Dolev:
Over hyped, I would say generative AI, because now they’re saying it’s going to do everything. Certainly not going to do everything. It’s going to be somewhere below that, even if it’s a lot. Definitely over hyped is generative AI right now. Under hyped, yeah, I don’t know, you’re classical, predictive, fine tuning, training, modeling,
Gideon Mendels:
Yeah, classical machine learning, yeah, which now we have to … Classical used to be linear models versus deep learning. Now everything that is in generative is classical, but yeah, completely agree.