Converting Llama3 for use in llama.cpp

Download model from Hugging Face

Insert your Hugging Face token in the following Python code and execute it. This step will essentially clone the models Git repo into a folder named llama3-8b-instruct-hf.

from huggingface_hub import snapshot_download
model_id="meta-llama/Meta-Llama-3-8B-Instruct"
access_token = "hf_XYZ"
snapshot_download(repo_id=model_id, local_dir="llama3-8b-instruct-hf",
                  local_dir_use_symlinks=False, revision="main", token=access_token)

Clone llama.cpp

Clone the llama.cpp Git repo:

git clone https://github.com/ggerganov/llama.cpp.git

Convert the model to GGUF format

GGUF is the model format used by llama.cpp. Convert the model using the conversion tool from the llama.cpp repo:

python llama.cpp/convert.py ./llama3-8b-instruct-hf/ --outfile llama3-8b-instruct.gguf --outtype f32 --vocab-type bpe

Please note that the original model format is BF16, which has a somewhat higher range than FP16. Thus, we are going with FP32 here and quantize from that (source).

Get binaries of llama.cpp

You can either build the code in the repository yourself or download a binary release that best matches your CPU features. If you are unsure which AVX-features your CPU supports, use a tool like CPU-Z or lookup your CPU on the Intel website.

Quantize

Quantization makes the model usable on smaller machines, by sacrificing some quality of the model output. We will quantize to q4_k_m, which provides a good balance between speed and quality (source). To perform the quantization, execute:

.\quantize.exe llama3-8b-instruct.gguf llama3-8b-instruct-q4_k_m.gguf q4_k_m

Test drive the model

Now its time to start a conversation:

.\main.exe -m llama3-8b-instruct-q4_k_m.gguf --n_predict -1 --keep -1 -i -r "USER:" -p "You are a helpful assistant. USER: Who directed Jurassic Park? ASSISTANT:"

Have fun! 😊

philippn/CONVERSION.md