Pathway
Pathway is an open data processing framework. It allows you to easily develop data transformation pipelines and Machine Learning applications that work with live data sources and changing data.
This notebook demonstrates how to use a live Pathway
data indexing
pipeline with Langchain
. You can query the results of this pipeline
from your chains in the same manner as you would a regular vector store.
However, under the hood, Pathway updates the index on each data change
giving you always up-to-date answers.
In this notebook, we will use a public demo document processing pipeline that:
- Monitors several cloud data sources for data changes.
- Builds a vector index for the data.
To have your own document processing pipeline check the hosted offering or build your own.
We will connect to the index using a VectorStore
client, which
implements the similarity_search
function to retrieve matching
documents.
The basic pipeline used in this document allows to effortlessly build a simple vector index of files stored in a cloud location. However, Pathway provides everything needed to build realtime data pipelines and apps, including SQL-like able operations such as groupby-reductions and joins between disparate data sources, time-based grouping and windowing of data, and a wide array of connectors.
Querying the data pipelineβ
To instantiate and configure the client you need to provide either the
url
or the host
and port
of your document indexing pipeline. In
the code below we use a publicly available demo
pipeline, which
REST API you can access at
https://demo-document-indexing.pathway.stream
. This demo ingests
documents from Google
Drive
and
Sharepoint
and maintains an index for retrieving documents.
from langchain_community.vectorstores import PathwayVectorClient
client = PathwayVectorClient(url="https://demo-document-indexing.pathway.stream")
API Reference:
And we can start asking queries
query = "What is Pathway?"
docs = client.similarity_search(query)
print(docs[0].page_content)
Your turn! Get your pipeline or upload new documents to the demo pipeline and retry the query!
Filtering based on file metadataβ
We support document filtering using jmespath expressions, for instance:
# take into account only sources modified later than unix timestamp
docs = client.similarity_search(query, metadata_filter="modified_at >= `1702672093`")
# take into account only sources modified later than unix timestamp
docs = client.similarity_search(query, metadata_filter="owner == `james`")
# take into account only sources with path containing 'repo_readme'
docs = client.similarity_search(query, metadata_filter="contains(path, 'repo_readme')")
# and of two conditions
docs = client.similarity_search(
query, metadata_filter="owner == `james` && modified_at >= `1702672093`"
)
# or of two conditions
docs = client.similarity_search(
query, metadata_filter="owner == `james` || modified_at >= `1702672093`"
)
Getting information on indexed filesβ
PathwayVectorClient.get_vectorstore_statistics()
gives essential
statistics on the state of the vector store, like the number of indexed
files and the timestamp of last updated one. You can use it in your
chains to tell the user how fresh is your knowledge base.
client.get_vectorstore_statistics()
Your own pipelineβ
Running in productionβ
To have your own Pathway data indexing pipeline check the Pathwayβs offer for hosted pipelines. You can also run your own Pathway pipeline - for information on how to build the pipeline refer to Pathway guide.
Processing documentsβ
The vectorization pipeline supports pluggable components for parsing,
splitting and embedding documents. For embedding and splitting you can
use Langchain
components
or check
embedders
and
splitters
available in Pathway. If parser is not provided, it defaults to UTF-8
parser. You can find available parsers
here.