Apache Doris
Apache Doris is a modern data warehouse for real-time analytics. It delivers lightning-fast analytics on real-time data at scale.
Usually
Apache Doris
is categorized into OLAP, and it has showed excellent performance in ClickBench β a Benchmark For Analytical DBMS. Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.
Here weβll show how to use the Apache Doris Vector Store.
Setupβ
%pip install --upgrade --quiet pymysql
Set update_vectordb = False
at the beginning. If there is no docs
updated, then we donβt need to rebuild the embeddings of docs
!pip install sqlalchemy
!pip install langchain
from langchain.chains import RetrievalQA
from langchain_community.document_loaders import (
DirectoryLoader,
UnstructuredMarkdownLoader,
)
from langchain_community.vectorstores.apache_doris import (
ApacheDoris,
ApacheDorisSettings,
)
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain_text_splitters import TokenTextSplitter
update_vectordb = False
API Reference:
Load docs and split them into tokensβ
Load all markdown files under the docs
directory
for Apache Doris documents, you can clone repo from
https://github.com/apache/doris, and there is docs
directory in it.
loader = DirectoryLoader(
"./docs", glob="**/*.md", loader_cls=UnstructuredMarkdownLoader
)
documents = loader.load()
Split docs into tokens, and set update_vectordb = True
because there
are new docs/tokens.
# load text splitter and split docs into snippets of text
text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)
split_docs = text_splitter.split_documents(documents)
# tell vectordb to update text embeddings
update_vectordb = True
split_docs[-20]
print(β# docs = %d, # splits = %dβ % (len(documents), len(split_docs)))
Create vectordb instanceβ
Use Apache Doris as vectordbβ
def gen_apache_doris(update_vectordb, embeddings, settings):
if update_vectordb:
docsearch = ApacheDoris.from_documents(split_docs, embeddings, config=settings)
else:
docsearch = ApacheDoris(embeddings, settings)
return docsearch
Convert tokens into embeddings and put them into vectordbβ
Here we use Apache Doris as vectordb, you can configure Apache Doris
instance via ApacheDorisSettings
.
Configuring Apache Doris instance is pretty much like configuring mysql instance. You need to specify: 1. host/port 2. username(default: βrootβ) 3. password(default: ββ) 4. database(default: βdefaultβ) 5. table(default: βlangchainβ)
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass()
update_vectordb = True
embeddings = OpenAIEmbeddings()
# configure Apache Doris settings(host/port/user/pw/db)
settings = ApacheDorisSettings()
settings.port = 9030
settings.host = "172.30.34.130"
settings.username = "root"
settings.password = ""
settings.database = "langchain"
docsearch = gen_apache_doris(update_vectordb, embeddings, settings)
print(docsearch)
update_vectordb = False
Build QA and ask question to itβ
llm = OpenAI()
qa = RetrievalQA.from_chain_type(
llm=llm, chain_type="stuff", retriever=docsearch.as_retriever()
)
query = "what is apache doris"
resp = qa.run(query)
print(resp)