skip to Main Content

Retrieval Part 2: Text Embeddings

Explore How LangChain’s Semantic Search Allows You To Transform Data Retrieval and Information Discovery
text embeddings, semantic search, LangChain, Comet ML, CometLLM
Photo by Axel R. on Unsplash

In this blog post, I’ll show you how to work with text embedding models using LangChain.

Text embedding models represent documents as high-dimensional vectors. They’re the key to unlocking semantic search capabilities that go beyond simple keyword matching. Imagine being able to sift through massive volumes of text and instantly find documents that match the intent and meaning of your search query, not just the exact words. That’s the transformative potential of text embeddings in tasks such as document similarity search and recommendation systems. We’ll explore how LangChain’s OpenAIEmbeddings leverage this technology to revolutionize the way we interact with information, ensuring that the most relevant documents are at your fingertips, irrespective of the language used.

Let’s dive into it!

Text Embeddings

Text embedding models represent text documents in a high-dimensional vector space, where the similarity between vectors corresponds to the semantic similarity between the corresponding documents.

These models capture the semantic meaning of text and allow for efficient retrieval of similar documents based on their embeddings.

Text embedding models are handy when performing tasks like document similarity search, information retrieval, or recommendation systems.

They enable you to find documents that are semantically similar to a given query document, even if the wording or phrasing is different.

You would use text embedding models when you need to find similar documents based on their semantic meaning rather than just keyword matching.

For example, in a search engine, you might want to retrieve documents that are relevant to a user’s query, even if the query terms are not an exact match to the document’s content.

Text embedding models can help you achieve this by capturing the semantic relationships between words and documents.


Want to learn how to build modern software with LLMs using the newest tools and techniques in the field? Check out this free LLMOps course from industry expert Elvis Saravia of DAIR.AI!


There are several reasons why you would use text embedding models for retrieval in LangChain:

Improved search accuracy: Text embedding models can capture the semantic meaning of text, allowing for more accurate retrieval of relevant documents compared to traditional keyword-based approaches.

Flexibility in query formulation: With text embedding models, you can search for similar documents based on the semantic meaning of a query rather than relying solely on exact keyword matches. This provides more flexibility in query formulation and improves the user experience.

Handling of out-of-vocabulary words: Text embedding models can handle them by mapping them to similar words in the embedding space. This allows for better retrieval performance even when encountering unseen or rare words.

Text embedding models for retrieval in LangChain provide a powerful tool for capturing the semantic meaning of text and enabling efficient retrieval of similar documents.

They are handy when finding documents based on semantic similarity rather than exact keyword matches.

from langchain.embeddings import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

embeddings = embeddings_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)
len(embeddings), len(embeddings[0])

#(5, 1536)

embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")
embedded_query[:5]


#[0.005387211957276042,
#-0.0005941777859814659,
# 0.03892524773846194,
# -0.00297914132073842,
# -0.008912666382268376]

Caching

The great thing about embeddings is that you can store them, or temporarility cache, so you don’t have to recompute them.

CacheBackedEmbeddings in LangChain are a text embedding model that combines the benefits of precomputed embeddings with the flexibility of on-the-fly computation.

They are used to improve the efficiency and speed of text embedding retrieval by caching precomputed embeddings and retrieving them when needed.

CacheBackedEmbeddings are particularly useful when you have a large corpus of text documents and want to efficiently retrieve embeddings for various tasks such as document similarity search, information retrieval, or recommendation systems.

They allow you to store precomputed embeddings in a cache, reducing the need for repeated computation and improving the overall retrieval performance.

You would use CacheBackedEmbeddings when you need to speed up the retrieval of text embeddings and reduce the computational overhead.

By caching precomputed embeddings, you can avoid the time-consuming (and expensive) process of computing embeddings for each query or document, resulting in faster retrieval times.

CacheBackedEmbeddings are especially beneficial in scenarios where the text corpus is static or changes infrequently.

The main supported way to initialize a CacheBackedEmbeddings is from_bytes_store.

This takes in the following parameters:

underlying_embedder: The embedder to use for embedding.

document_embedding_cache: The cache to use for storing document embeddings.

namespace: (optional, defaults to “”)Thw namespace for document cache. This namespace is used to avoid collisions with other caches. For example, set it to the name of the embedding model used.

There’s a bunch of caches you can use in LangChain. Again, the basic pattern is the same. That’s the beauty of LangChain, a unified interface for them all.

Go here to learn more about them.

!pip install faiss-cpu

from langchain.storage import InMemoryStore, LocalFileStore, RedisStore

from langchain.embeddings import OpenAIEmbeddings, CacheBackedEmbeddings

from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS

underlying_embeddings = OpenAIEmbeddings()

fs = LocalFileStore("./cache/")

cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings, fs, namespace=underlying_embeddings.model
)

list(fs.yield_keys())

# []
raw_documents = TextLoader("/content/golden-sayings-of-epictetus.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

Create a Vectorstore

When using LangChain, you have two options for caching embeddings: vector stores and CacheBackedEmbeddings.

Vector stores, such as FAISS, are useful when you want to store and retrieve embeddings efficiently.

They are typically used when you have many embeddings and need fast retrieval.

You can create a vector store by using the FAISS.from_documents (it could be any one of the supported vector stores) method and passing in the documents and the embedder.

You should use vector stores when you need fast retrieval of embeddings and have many embeddings.

On the other hand, you should use CacheBackedEmbeddings when you want to temporarily cache embeddings to avoid recomputing them, such as in unit tests or prototyping.

db = FAISS.from_documents(documents, cached_embedder)

You can time the creation of the first and second databases to see how much faster it was to create it. That’s the power of cached embeddings!

db2 = FAISS.from_documents(documents, cached_embedder)

list(fs.yield_keys())[:5]
['text-embedding-ada-00258ed3f2f-e965-57c4-9f1d-d737f70d99d4',
 'text-embedding-ada-002d85fa430-6eee-546c-bd31-fe8c2b7f5d28',
 'text-embedding-ada-00275fa6ffa-5e70-52ab-a788-811652175577',
 'text-embedding-ada-00208b3877e-ac85-56a0-9156-048fba0dda88',
 'text-embedding-ada-0029b5b82a9-2927-51c7-981a-855474cd6be1']

Conclusion

In conclusion, text embedding models, particularly those implemented in LangChain, represent a quantum leap in how we handle and retrieve textual information.

They empower us to understand and process language on a semantic level, transcending the limitations of traditional keyword searches. Through practical examples and powerful tools like CacheBackedEmbeddings and vector stores like FAISS, we’ve seen how LangChain simplifies and speeds up the retrieval process, ensuring efficiency and accuracy. Whether you’re building a search engine, a recommendation system, or any application that relies on deep text understanding, using text embeddings is not just an option; it’s an imperative.

With LangChain plus embeddings, you’re not just searching; you’re discovering, ensuring every query returns the most semantically relevant information possible.

Harpreet Sahota

Back To Top