IPEX-LLM
IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency.
This example goes over how to use LangChain to interact with ipex-llm
for text generation.
Setupβ
# Update Langchain
%pip install -qU langchain langchain-community
Install IEPX-LLM for running LLMs locally on Intel CPU.
%pip install --pre --upgrade ipex-llm[all]
Basic Usageβ
import warnings
from langchain.chains import LLMChain
from langchain_community.llms import IpexLLM
from langchain_core.prompts import PromptTemplate
warnings.filterwarnings("ignore", category=UserWarning, message=".*padding_mask.*")
API Reference:
Specify the prompt template for your model. In this example, we use the vicuna-1.5 model. If youβre working with a different model, choose a proper template accordingly.
template = "USER: {question}\nASSISTANT:"
prompt = PromptTemplate(template=template, input_variables=["question"])
Load the model locally using IpexLLM using IpexLLM.from_model_id
. It
will load the model directly in its Huggingface format and convert it
automatically to low-bit format for inference.
llm = IpexLLM.from_model_id(
model_id="lmsys/vicuna-7b-v1.5",
model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
)
Loading checkpoint shards: 0%| | 0/2 [00:00<?, ?it/s]
2024-04-24 21:20:12,461 - INFO - Converting the current model to sym_int4 format......
Use it in Chains:
llm_chain = LLMChain(prompt=prompt, llm=llm)
question = "What is AI?"
output = llm_chain.invoke(question)
/opt/anaconda3/envs/shane-langchain-3.11/lib/python3.11/site-packages/langchain_core/_api/deprecation.py:119: LangChainDeprecationWarning: The class `LLMChain` was deprecated in LangChain 0.1.17 and will be removed in 0.3.0. Use RunnableSequence, e.g., `prompt | llm` instead.
warn_deprecated(
/opt/anaconda3/envs/shane-langchain-3.11/lib/python3.11/site-packages/transformers/generation/utils.py:1369: UserWarning: Using `max_length`'s default (4096) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
AI stands for "Artificial Intelligence." It refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI can be achieved through a combination of techniques such as machine learning, natural language processing, computer vision, and robotics. The ultimate goal of AI research is to create machines that can think and learn like humans, and can even exceed human capabilities in certain areas.
Save/Load Low-bit Modelβ
Alternatively, you might save the low-bit model to disk once and use
from_model_id_low_bit
instead of from_model_id
to reload it for
later use - even across different machines. It is space-efficient, as
the low-bit model demands significantly less disk space than the
original model. And from_model_id_low_bit
is also more efficient than
from_model_id
in terms of speed and memory usage, as it skips the
model conversion step.
To save the low-bit model, use save_low_bit
as follows.
saved_lowbit_model_path = "./vicuna-7b-1.5-low-bit" # path to save low-bit model
llm.model.save_low_bit(saved_lowbit_model_path)
del llm
Load the model from saved lowbit model path as follows. > Note that the saved path for the low-bit model only includes the model itself but not the tokenizers. If you wish to have everything in one place, you will need to manually download or copy the tokenizer files from the original modelβs directory to the location where the low-bit model is saved.
llm_lowbit = IpexLLM.from_model_id_low_bit(
model_id=saved_lowbit_model_path,
tokenizer_id="lmsys/vicuna-7b-v1.5",
# tokenizer_name=saved_lowbit_model_path, # copy the tokenizers to saved path if you want to use it this way
model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
)
2024-04-24 21:20:35,874 - INFO - Converting the current model to sym_int4 format......
Use the loaded model in Chains:
llm_chain = LLMChain(prompt=prompt, llm=llm_lowbit)
question = "What is AI?"
output = llm_chain.invoke(question)
/opt/anaconda3/envs/shane-langchain-3.11/lib/python3.11/site-packages/transformers/generation/utils.py:1369: UserWarning: Using `max_length`'s default (4096) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
AI stands for "Artificial Intelligence." It refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, decision-making, and language translation. AI can be achieved through a combination of techniques such as machine learning, natural language processing, computer vision, and robotics. The ultimate goal of AI research is to create machines that can think and learn like humans, and can even exceed human capabilities in certain areas.