Weaviate
This notebook covers how to get started with the Weaviate vector store
in LangChain, using the langchain-weaviate
package.
Weaviate is an open-source vector database. It allows you to store data objects and vector embeddings from your favorite ML-models, and scale seamlessly into billions of data objects.
To use this integration, you need to have a running Weaviate database instance.
Minimum versionsβ
This module requires Weaviate 1.23.7
or higher. However, we recommend
you use the latest version of Weaviate.
Connecting to Weaviateβ
In this notebook, we assume that you have a local instance of Weaviate
running on http://localhost:8080
and port 50051 open for gRPC
traffic. So, we
will connect to Weaviate with:
weaviate_client = weaviate.connect_to_local()
Other deployment optionsβ
Weaviate can be deployed in many different ways such as using Weaviate Cloud Services (WCS), Docker or Kubernetes.
If your Weaviate instance is deployed in another way, read more here about different ways to connect to Weaviate. You can use different helper functions or create a custom instance.
Note that you require a
v4
client API, which will create aweaviate.WeaviateClient
object.
Authenticationβ
Some Weaviate instances, such as those running on WCS, have authentication enabled, such as API key and/or username+password authentication.
Read the client authentication guide for more information, as well as the in-depth authentication configuration page.
Installationβ
# install package
# %pip install -Uqq langchain-weaviate
# %pip install openai tiktoken langchain
Environment Setupβ
This notebook uses the OpenAI API through OpenAIEmbeddings
. We suggest
obtaining an OpenAI API key and export it as an environment variable
with the name OPENAI_API_KEY
.
Once this is done, your OpenAI API key will be read automatically. If you are new to environment variables, read more about them here or in this guide.
Usage
Find objects by similarityβ
Here is an example of how to find objects by similarity to a query, from data import to querying the Weaviate instance.
Step 1: Data importβ
First, we will create data to add to Weaviate
by loading and chunking
the contents of a long text file.
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings.openai import OpenAIEmbeddings
API Reference:
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.embeddings.openai.OpenAIEmbeddings` was deprecated in langchain-community 0.1.0 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAIEmbeddings`.
warn_deprecated(
Now, we can import the data.
To do so, connect to the Weaviate instance and use the resulting
weaviate_client
object. For example, we can import the documents as
shown below:
import weaviate
from langchain_weaviate.vectorstores import WeaviateVectorStore
weaviate_client = weaviate.connect_to_local()
db = WeaviateVectorStore.from_documents(docs, embeddings, client=weaviate_client)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
Step 2: Perform the searchβ
We can now perform a similarity search. This will return the most similar documents to the query text, based on the embeddings stored in Weaviate and an equivalent embedding generated from the query text.
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)
# Print the first 100 characters of each result
for i, doc in enumerate(docs):
print(f"\nDocument {i+1}:")
print(doc.page_content[:100] + "...")
Document 1:
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...
Document 2:
And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of ...
Document 3:
Vice President Harris and I ran for office with a new economic vision for America.
Invest in Ameri...
Document 4:
A former top litigator in private practice. A former federal public defender. And from a family of p...
You can also add filters, which will either include or exclude results based on the filter conditions. (See more filter examples.)
from weaviate.classes.query import Filter
for filter_str in ["blah.txt", "state_of_the_union.txt"]:
search_filter = Filter.by_property("source").equal(filter_str)
filtered_search_results = db.similarity_search(query, filters=search_filter)
print(len(filtered_search_results))
if filter_str == "state_of_the_union.txt":
assert len(filtered_search_results) > 0 # There should be at least one result
else:
assert len(filtered_search_results) == 0 # There should be no results
0
4
It is also possible to provide k
, which is the upper limit of the
number of results to return.
search_filter = Filter.by_property("source").equal("state_of_the_union.txt")
filtered_search_results = db.similarity_search(query, filters=search_filter, k=3)
assert len(filtered_search_results) <= 3
Quantify result similarityβ
You can optionally retrieve a relevance βscoreβ. This is a relative score that indicates how good the particular search results is, amongst the pool of search results.
Note that this is relative score, meaning that it should not be used to determine thresholds for relevance. However, it can be used to compare the relevance of different search results within the entire search result set.
docs = db.similarity_search_with_score("country", k=5)
for doc in docs:
print(f"{doc[1]:.3f}", ":", doc[0].page_content[:100] + "...")
0.935 : For that purpose weβve mobilized American ground forces, air squadrons, and ship deployments to prot...
0.500 : And built the strongest, freest, and most prosperous nation the world has ever known.
Now is the h...
0.462 : If you travel 20 miles east of Columbus, Ohio, youβll find 1,000 empty acres of land.
It wonβt loo...
0.450 : And my report is this: the State of the Union is strongβbecause you, the American people, are strong...
0.442 : Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Ac...
Search mechanismβ
similarity_search
uses Weaviateβs hybrid
search.
A hybrid search combines a vector and a keyword search, with alpha
as
the weight of the vector search. The similarity_search
function allows
you to pass additional arguments as kwargs. See this reference
doc
for the available arguments.
So, you can perform a pure keyword search by adding alpha=0
as shown
below:
docs = db.similarity_search(query, alpha=0)
docs[0]
Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youβre at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, Iβd like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyerβan Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nationβs top legal minds, who will continue Justice Breyerβs legacy of excellence.', metadata={'source': 'state_of_the_union.txt'})
Persistenceβ
Any data added through langchain-weaviate
will persist in Weaviate
according to its configuration.
WCS instances, for example, are configured to persist data indefinitely, and Docker instances can be set up to persist data in a volume. Read more about Weaviateβs persistence.
Multi-tenancyβ
Multi-tenancy allows you to have a high number of isolated collections of data, with the same collection configuration, in a single Weaviate instance. This is great for multi-user environments such as building a SaaS app, where each end user will have their own isolated data collection.
To use multi-tenancy, the vector store need to be aware of the tenant
parameter.
So when adding any data, provide the tenant
parameter as shown below.
db_with_mt = WeaviateVectorStore.from_documents(
docs, embeddings, client=weaviate_client, tenant="Foo"
)
2024-Mar-26 03:40 PM - langchain_weaviate.vectorstores - INFO - Tenant Foo does not exist in index LangChain_30b9273d43b3492db4fb2aba2e0d6871. Creating tenant.
And when performing queries, provide the tenant
parameter also.
db_with_mt.similarity_search(query, tenant="Foo")
[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youβre at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, Iβd like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyerβan Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nationβs top legal minds, who will continue Justice Breyerβs legacy of excellence.', metadata={'source': 'state_of_the_union.txt'}),
Document(page_content='And so many families are living paycheck to paycheck, struggling to keep up with the rising cost of food, gas, housing, and so much more. \n\nI understand. \n\nI remember when my Dad had to leave our home in Scranton, Pennsylvania to find work. I grew up in a family where if the price of food went up, you felt it. \n\nThatβs why one of the first things I did as President was fight to pass the American Rescue Plan. \n\nBecause people were hurting. We needed to act, and we did. \n\nFew pieces of legislation have done more in a critical moment in our history to lift us out of crisis. \n\nIt fueled our efforts to vaccinate the nation and combat COVID-19. It delivered immediate economic relief for tens of millions of Americans. \n\nHelped put food on their table, keep a roof over their heads, and cut the cost of health insurance. \n\nAnd as my Dad used to say, it gave people a little breathing room.', metadata={'source': 'state_of_the_union.txt'}),
Document(page_content='He and his Dad both have Type 1 diabetes, which means they need insulin every day. Insulin costs about $10 a vial to make. \n\nBut drug companies charge families like Joshua and his Dad up to 30 times more. I spoke with Joshuaβs mom. \n\nImagine what itβs like to look at your child who needs insulin and have no idea how youβre going to pay for it. \n\nWhat it does to your dignity, your ability to look your child in the eye, to be the parent you expect to be. \n\nJoshua is here with us tonight. Yesterday was his birthday. Happy birthday, buddy. \n\nFor Joshua, and for the 200,000 other young people with Type 1 diabetes, letβs cap the cost of insulin at $35 a month so everyone can afford it. \n\nDrug companies will still do very well. And while weβre at it let Medicare negotiate lower prices for prescription drugs, like the VA already does.', metadata={'source': 'state_of_the_union.txt'}),
Document(page_content='Putinβs latest attack on Ukraine was premeditated and unprovoked. \n\nHe rejected repeated efforts at diplomacy. \n\nHe thought the West and NATO wouldnβt respond. And he thought he could divide us at home. Putin was wrong. We were ready. Here is what we did. \n\nWe prepared extensively and carefully. \n\nWe spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin. \n\nI spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression. \n\nWe countered Russiaβs lies with truth. \n\nAnd now that he has acted the free world is holding him accountable. \n\nAlong with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland.', metadata={'source': 'state_of_the_union.txt'})]
Retriever optionsβ
Weaviate can also be used as a retriever
Maximal marginal relevance search (MMR)β
In addition to using similaritysearch in the retriever object, you can
also use mmr
.
retriever = db.as_retriever(search_type="mmr")
retriever.invoke(query)[0]
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while youβre at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, Iβd like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyerβan Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nationβs top legal minds, who will continue Justice Breyerβs legacy of excellence.', metadata={'source': 'state_of_the_union.txt'})
Use with LangChain
A known limitation of large languag models (LLMs) is that their training data can be outdated, or not include the specific domain knowledge that you require.
Take a look at the example below:
from langchain_community.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
llm.predict("What did the president say about Justice Breyer")
API Reference:
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.chat_models.openai.ChatOpenAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import ChatOpenAI`.
warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `predict` was deprecated in LangChain 0.1.7 and will be removed in 0.2.0. Use invoke instead.
warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
"I'm sorry, I cannot provide real-time information as my responses are generated based on a mixture of licensed data, data created by human trainers, and publicly available data. The last update was in October 2021."
Vector stores complement LLMs by providing a way to store and retrieve relevant information. This allow you to combine the strengths of LLMs and vector stores, by using LLMβs reasoning and linguistic capabilities with vector storesβ ability to retrieve relevant information.
Two well-known applications for combining LLMs and vector stores are: - Question answering - Retrieval-augmented generation (RAG)
Question Answering with Sourcesβ
Question answering in langchain can be enhanced by the use of vector stores. Letβs see how this can be done.
This section uses the RetrievalQAWithSourcesChain
, which does the
lookup of the documents from an Index.
First, we will chunk the text again and import them into the Weaviate vector store.
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_community.llms import OpenAI
API Reference:
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
docsearch = WeaviateVectorStore.from_texts(
texts,
embeddings,
client=weaviate_client,
metadatas=[{"source": f"{i}-pl"} for i in range(len(texts))],
)
Now we can construct the chain, with the retriever specified:
chain = RetrievalQAWithSourcesChain.from_chain_type(
OpenAI(temperature=0), chain_type="stuff", retriever=docsearch.as_retriever()
)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The class `langchain_community.llms.openai.OpenAI` was deprecated in langchain-community 0.0.10 and will be removed in 0.2.0. An updated version of the class exists in the langchain-openai package and should be used instead. To use it run `pip install -U langchain-openai` and import as `from langchain_openai import OpenAI`.
warn_deprecated(
And run the chain, to ask the question:
chain(
{"question": "What did the president say about Justice Breyer"},
return_only_outputs=True,
)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/langchain_core/_api/deprecation.py:117: LangChainDeprecationWarning: The function `__call__` was deprecated in LangChain 0.1.0 and will be removed in 0.2.0. Use invoke instead.
warn_deprecated(
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
{'answer': ' The president thanked Justice Stephen Breyer for his service and announced his nomination of Judge Ketanji Brown Jackson to the Supreme Court.\n',
'sources': '31-pl'}
Retrieval-Augmented Generationβ
Another very popular application of combining LLMs and vector stores is retrieval-augmented generation (RAG). This is a technique that uses a retriever to find relevant information from a vector store, and then uses an LLM to provide an output based on the retrieved data and a prompt.
We begin with a similar setup:
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(state_of_the_union)
docsearch = WeaviateVectorStore.from_texts(
texts,
embeddings,
client=weaviate_client,
metadatas=[{"source": f"{i}-pl"} for i in range(len(texts))],
)
retriever = docsearch.as_retriever()
We need to construct a template for the RAG model so that the retrieved information will be populated in the template.
from langchain_core.prompts import ChatPromptTemplate
template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)
print(prompt)
API Reference:
input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question}\nContext: {context}\nAnswer:\n"))]
from langchain_community.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
API Reference:
And running the cell, we get a very similar output.
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
rag_chain.invoke("What did the president say about Justice Breyer")
API Reference:
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
/workspaces/langchain-weaviate/.venv/lib/python3.12/site-packages/pydantic/main.py:1024: PydanticDeprecatedSince20: The `dict` method is deprecated; use `model_dump` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.6/migration/
warnings.warn('The `dict` method is deprecated; use `model_dump` instead.', category=PydanticDeprecatedSince20)
"The president honored Justice Stephen Breyer for his service to the country as an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. The president also mentioned nominating Circuit Court of Appeals Judge Ketanji Brown Jackson to continue Justice Breyer's legacy of excellence. The president expressed gratitude towards Justice Breyer and highlighted the importance of nominating someone to serve on the United States Supreme Court."
But note that since the template is upto you to construct, you can customize it to your needs.
Wrap-up & resourcesβ
Weaviate is a scalable, production-ready vector store.
This integration allows Weaviate to be used with LangChain to enhance the capabilities of large language models with a robust data store. Its scalability and production-readiness make it a great choice as a vector store for your LangChain applications, and it will reduce your time to production.