NerualSpeed(NS) is designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) model compression techniques. The work is highly inspired from llama.cpp.
Intel® Extension for Transformers(ITREX) is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids).
Basically NS is a optional dependency of ITREX. You can install ITREX via binary wheel and NS will be installed as one of the requitements.
# define install requirements
install_requires_list = ['packaging', 'numpy', 'schema', 'pyyaml']
- opt_install_requires_list = ['neural_compressor', 'transformers']
+ opt_install_requires_list = ['neural_compressor', 'transformers', 'neural_speed']
Or you can install ITREX from the source and determine NS installation manually.
# in the root directory of ITREX
NS=true pip install .
Or you can install the latest NS as a seperate python package via building from source
# in the root directory of NS
pip install .
Check whether the building is finished via
from intel_extension_for_transformers.utils import itrex_utils
itrex_utils.is_ns_available()
# check both NS and GPU related, optional
itrex_utils.is_ns_available("gpu")
As detailed in ITREX documents, NS is the default inference option. The following ITREX example demonstrate how to leverage NS 4bits capability morever.
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v1-1" # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
ITREX also (will) support diffuser API, in which you can also enable NS inference during language models. You can dispatch the encoder model to NS manually:
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer")
model_name = "Intel/xxx-encode" # Hugging Face model_id or local model
text_encoder = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path, subfolder="vae")
unet = UNet2DConditionModel.from_pretrained(args.pretrained_model_name_or_path, subfolder="unet")
pipeline = StableDiffusionPipeline(
text_encoder=text_encoder,
vae=vae,
unet=unet,
tokenizer=tokenizer,
scheduler=PNDMScheduler.from_config("CompVis/stable-diffusion-v1-4", subfolder="scheduler"),
safety_checker=StableDiffusionSafetyChecker.from_pretrained("CompVis/stable-diffusion-safety-checker"),
feature_extractor=CLIPFeatureExtractor.from_pretrained("openai/clip-vit-base-patch32"),
)
Or you can leverage ITREX end2end:
from intel_extension_for_transformers.transformers import AutoModelForDiffuser
model_name = "Intel/XXX-sd" # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
model = AutoModelForDiffuser.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, steps=50)
We simply replace the original LLM model with their NS counterparts.
if use_llm_runtime:
logger.info("Using LLM runtime.")
quantization_config.post_init_runtime()
from neural_speed import Model
model = Model()
model.init(
pretrained_model_name_or_path,
weight_dtype=quantization_config.weight_dtype,
alg=quantization_config.scheme,
group_size=quantization_config.group_size,
scale_dtype=quantization_config.scale_dtype,
compute_dtype=quantization_config.compute_dtype,
use_ggml=quantization_config.use_ggml,
not_quant=quantization_config.not_quant,
use_cache=quantization_config.use_cache,
)
return model
graph LR;
Transformer-API-->ITREX;
ITREX--> CPU
CPU --> |Weight-Only quantizated LLM sub-models| NeuralSpeed
ITREX--> GPU
GPU --> IPEX
CPU --> |Other sub-models| IPEX;
NeuralSpeed --> |CPU| Bestla;
IPEX --> |CPU| Bestla
IPEX --> |GPU| XeTLA;