Comet is now available natively within AWS SageMaker!

Learn More
anusua trivedi headshot

ANUSUA TRIVEDI

Research Director at Flipkart

Anusua leads a global team of data scientists, engineers, and product managers, responsible for delivering innovative and scalable AI products and solutions for customers. With over 10 years of experience in AI research and development, she has a strong track record of leading and executing complex projects that leverage generative AI, large language models, and conversational AI to enhance customer and seller experience, optimize supply chain, and generate product and marketing content.

Anusua is passionate about creating ethical and secure AI solutions that address real-world challenges and empower millions of users. She also enjoys mentoring and teaching the next generation of AI professionals, as an instructor at the University of Washington’s MS in Data Science Capstone Program. She has published multiple papers in prestigious AI journals and conferences, and hold a MS in Computer Science from the University of Utah.

Watch live: May 8, 2024 @ 3:10 – 3:40 pm ET

Cost Optimizing RAG for Large Scale E-Commerce Conversational Assistants

With the advent of Large Language Models (LLM), conversational assistants have become prevalent in E-commerce use cases. Trained on a large web-scale text corpus with approaches such as instruction tuning and Reinforcement Learning with Human Feedback (RLHF), LLMs have become good at contextual question-answering tasks, i.e. given a relevant text as a context, LLMs can generate answers to questions using that information. Retrieval Augmented Generation (RAG) is one of the key techniques used to build conversational assistants for answering questions on domain data. RAG consists of two components: a retrieval model and an answer generation model based on LLM. The retrieval model fetches context relevant to the user’s query. The query and the retrieved context are then inputted to the LLM with the appropriate prompt to generate the answer. For API-based LLMs (e.g., ChatGPT), the cost per call is calculated based on the number of input and output tokens. A large number of tokens passed in a context leads to a higher cost per API call. With a high volume of user queries in e-commerce applications, the cost can become significant. In this work, we first develop a RAG-based approach for building a conversational assistant that answers user’s queries about domain-specific data. We train an in-house retrieval model using info Noise Contrastive Estimation (infoNCE) loss. Experimental results show that the in-house model outperforms public pre-trained embedding models w.r.t. retrieval accuracy and Out-of-Domain (OOD) query detection. For every user query, we retrieve top-k documents as context and input them to the ChatGPT to generate the answer. We maintain the previous conversation history to enable the multi-turn conversation. Next, we propose an RL-based approach to optimize the number of tokens passed to ChatGPT. We noticed that for certain patterns/sequences of queries, we can get a good answer from RAG even without fetching the context e.g. for a follow-up query, a context need not be retrieved if it has already been fetched for the previous query. Using this insight, we propose a policy gradient-based approach to optimize the number of LLM tokens and cost. The RL policy model can take two actions, fetching a context or skipping retrieval. A query and policy action-based context are inputted to the ChatGPT to generate the answer. A GPT-4 LLM is then used to rate these answers. Rewards based on the ratings are used to train the policy model for token optimization. Experimental results demonstrated that the policy model provides significant token saving by dynamically fetching the context only when it is required. The policy model resides external to RAG and the proposed approach can be experimented with any existing RAG pipeline. For more details, please refer to our AAAI 2024 workshop paper: https://arxiv.org/abs/2401.06800