Insert your Hugging Face token in the following Python code and execute it. This step will essentially clone the models Git repo into a folder named llama3-8b-instruct-hf
.
from huggingface_hub import snapshot_download
model_id="meta-llama/Meta-Llama-3-8B-Instruct"
access_token = "hf_XYZ"
snapshot_download(repo_id=model_id, local_dir="llama3-8b-instruct-hf",
local_dir_use_symlinks=False, revision="main", token=access_token)
Clone the llama.cpp Git repo:
git clone https://github.com/ggerganov/llama.cpp.git
GGUF is the model format used by llama.cpp. Convert the model using the conversion tool from the llama.cpp repo:
python llama.cpp/convert.py ./llama3-8b-instruct-hf/ --outfile llama3-8b-instruct.gguf --outtype f32 --vocab-type bpe
Please note that the original model format is BF16
, which has a somewhat higher range than FP16
. Thus, we are going with FP32
here and quantize from that (source).
You can either build the code in the repository yourself or download a binary release that best matches your CPU features. If you are unsure which AVX-features your CPU supports, use a tool like CPU-Z or lookup your CPU on the Intel website.
Quantization makes the model usable on smaller machines, by sacrificing some quality of the model output. We will quantize to q4_k_m
, which provides a good balance between speed and quality (source). To perform the quantization, execute:
.\quantize.exe llama3-8b-instruct.gguf llama3-8b-instruct-q4_k_m.gguf q4_k_m
Now its time to start a conversation:
.\main.exe -m llama3-8b-instruct-q4_k_m.gguf --n_predict -1 --keep -1 -i -r "USER:" -p "You are a helpful assistant. USER: Who directed Jurassic Park? ASSISTANT:"
Have fun! 😊