Retrieval Part 3: LangChain Retrievers

Mastering the Search for Knowledge in the Digital Repository

In the age of information overload, the ability to quickly find relevant data is paramount.

LangChain’s retrievers stand as the gatekeepers of knowledge, offering an advanced interface for searching and retrieving information from a sea of indexed documents. By serving as a bridge between unstructured queries and structured data, retrievers go beyond the capabilities of vector stores, focusing on the retrieval rather than the storage of documents. This blog post will guide you through the essence of retrievers in LangChain, their integral role in question-answering systems, and how they redefine the efficiency of search operations.

With LangChain, the journey from a simple question to an informed answer becomes seamless and intuitive, marking a new era in document retrieval.

Before we get started, let’s set up our environment:

%%capture
!pip install langchain openai tiktoken chromadb

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key:")

Download some text data from Project Gutenberg:

!wget -O "golden_hymns_of_epictetus.txt" https://www.gutenberg.org/cache/epub/871/pg871.txt

# Cleaning up some of the data so that we get only relevant text
filename = "/content/golden_hymns_of_epictetus.txt"

start_saving = False
stop_saving = False
lines_to_save = []

with open(filename, 'r') as file:
    for line in file:
        if "Are these the only works of Providence within us?" in line:
            start_saving = True
        if "*** END OF THE PROJECT GUTENBERG EBOOK THE GOLDEN SAYINGS OF EPICTETUS, WITH THE HYMN OF CLEANTHES ***" in line:
            stop_saving = True
            break
        if start_saving and not stop_saving:
            lines_to_save.append(line)

# Write the stored lines back to the file
with open(filename, 'w') as file:
    for line in lines_to_save:
        file.write(line)

And you can see how many works this file has:

word_count = 0

with open(filename, 'r') as file:
    for line in file:
        words = line.split()
        word_count += len(words)

print(f"The total number of words in the file is: {word_count}")

# The total number of words in the file is: 23503

Retrievers in LangChain

In LangChain, retrievers help you search and retrieve information from your indexed documents.

A retriever is an interface that returns documents based on an unstructured query, which makes it a more general tool than a vector store.

Unlike a vector store, a retriever does not need to be able to store documents.

Instead, its primary function is to return or retrieve them.

While vector stores can serve as the backbone of a retriever, there are different types of retrievers available too.

Want to learn how to build modern software with LLMs using the newest tools and techniques in the field? Check out this free LLMOps course from industry expert Elvis Saravia of DAIR.AI!

Question answering with retrievers in LangChain

Retrievers are used to find relevant documents or passages that contain the answer to a given query.

They work by comparing the query against the indexed documents and returning the most relevant results.

Retrievers use various techniques, such as vector similarity or keyword matching, to determine the relevance of documents.

You would use retrievers to perform search-based operations on your indexed documents.

They are instrumental in question-answering systems, where you want to find the most relevant information to answer a user’s query.

Retrievers can also be used for information retrieval tasks, content recommendations, or any other scenario where you need to find relevant documents based on a query.

Question answering over documents consists of four steps:

1) Create an index

2) Create a Retriever from that index

3) Create a question-answering chain

4) Ask questions!

Note: By default, LangChain uses Chroma as the vectorstore to index and search embeddings. To walk through this tutorial, we’ll first need to install Chromadb.

Start by importing a couple of required libraries:

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader

You can load the document like so:

loader = TextLoader("/content/golden_hymns_of_epictetus.txt", encoding="utf8")

One-line index creation

VectorstoreIndexCreator in LangChain is used to create an index of your documents for efficient retrieval.

You would use it when you want to store and retrieve embeddings efficiently, especially when dealing with many documents.

from langchain.indexes import VectorstoreIndexCreator

The VectorstoreIndexCreator is used to create an index of your documents using the from_loaders method.

This method takes a list of document loaders as input and creates an index that contains the necessary information for retrieval.

The document loaders can load documents from various sources, such as text files or databases.

Now that the index is created, we can ask questions about the data!

You can open the text file, scroll to line 115, and see the following passage:

The reason why I lost my lamp was that the thief was superior to me in vigilance. He paid however this price for the lamp, that in exchange for it he consented to become a thief: in exchange for it, to become faithless.

Now, let’s query and see what we get:

query = "What was the reason he lost his lamp? And what was gotten in exchange?"
index.query(query)

 The reason he lost his lamp was that the thief was superior to him in vigilance. In exchange for the lamp, the thief consented to become a thief and to become faithless.

Pretty damn good.

You can also return sources:

query = "How might I convince myself that every single act of mine was under the eye of God?"
index.query_with_sources(query)

{'question': 'How might I convince myself that every single act of mine was under the eye of God?',
 'answer': ' Epictetus suggested that one should maintain that which is in their power, never do wrong to anyone, and come forward as a witness summoned by God. He also suggested that one should draw near to God with a cheerful look, be attentive to His commands, and be thankful for being part of His Assembly.\n',
 'sources': '/content/golden_hymns_of_epictetus.txt'}

What is going on under the hood?

How is this index getting created?

A lot of the magic is being hidden in this: VectorstoreIndexCreator.

What is this doing?

Three main steps are going on after the documents are loaded:

1) Splitting documents into chunks

2) Creating embeddings for each document

3) Storing documents and embeddings in a vectorstore

Let’s see how this is done:

Load documents

LangChain makes this part easy.

# instantiate the document loader, note you did this earlier
loader = TextLoader("/content/golden_hymns_of_epictetus.txt", encoding="utf8")

# load the documents
documents = loader.load()

Split the documents into chunks

Text splitters in LangChain are used to split long pieces of text into smaller, semantically meaningful chunks.

They are handy when you want to keep related pieces of text together or when you need to process text in smaller segments.

At a high level, text splitters work as follows:

1) Split the text into small, semantically meaningful chunks (often sentences).

2) Start combining these small chunks into a larger chunk until you reach a specific size (as measured by some function).

3) Once you reach that size, make that chunk its piece of text and then start creating a new chunk of text with some overlap (to keep context between chunks).

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
texts = text_splitter.split_documents(documents)

Select the embeddings you want to use and create a vector store to use as the index

Text embedding models for retrieval in LangChain represent text documents in a high-dimensional vector space, where the similarity between vectors corresponds to the semantic similarity between the corresponding documents.

These models capture the semantic meaning of text and allow for efficient retrieval of similar documents based on their embeddings.

To construct a vector store retriever, you must first load the documents using a document loader.

Then, you can split the documents into smaller chunks using a text splitter. Next, you can generate vector embeddings for the text chunks using an embedding model like OpenAIEmbeddings. Finally, you can create a vector store using the generated embeddings.

Once the vector store is constructed, you can use it as a retriever to query the texts.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(documents=texts, embedding=embeddings)

Expose the index in a retriever

Once the vector store is constructed, you can use it as a retriever to query the texts.

The vector store retriever supports various search methods, including similarity search and maximum marginal relevance search.

You can also set a similarity score threshold or specify the number of top documents to retrieve.

Below, you will use the similarity_search method of the vector store.

In simpler terms, think of this function as a search tool. You give it a piece of text, tell it how many results you want, and it returns a list of documents that are most similar to your given text.

If you have specific requirements, like only wanting documents from a particular author, you can use the filter option to specify that.

retriever = db.as_retriever()

qa = RetrievalQA.from_chain_type(llm = OpenAI(),
                                 chain_type="stuff",
                                 retriever=retriever)

query = "How can should I eat in an acceptable manner?"
qa.run(query)

#  If you eat while being just, cheerful, equable, temperate, and orderly, then you can eat in an acceptable manner to the Gods.

VectorstoreIndexCreator is just a wrapper around all this logic.

It is configurable in the text splitter it uses, the embeddings it uses, and the vectorstore it uses.

For example, you can configure it as below:

index_creator = VectorstoreIndexCreator(
    vectorstore_cls=Chroma,
    embedding=OpenAIEmbeddings(),
    text_splitter=CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
)

index = index_creator.from_loaders([loader])

query = "What was the reason he lost his lamp? And what was gotten in exchange?"
index.query(query)

#  The reason he lost his lamp was that the thief was superior to him in vigilance. In exchange for the lamp, the thief consented to become a thief and to become faithless.

As we conclude, it’s clear that LangChain’s retrievers are not just tools but catalysts for transformative search experiences.

By elegantly bridging the gap between complex queries and the vast expanse of indexed documents, they redefine our approach to information retrieval. Whether for answering pivotal questions, delving into in-depth research, or simply navigating through large volumes of data, retrievers offer efficiency and relevance. This exploration through the functionality and application of LangChain’s retrievers underscores their indispensable role in modern information systems.

Armed with the power of LangChain, every query is an opportunity to uncover precise, contextually rich answers, turning the search into a journey of discovery.