As of writing (Nov 22nd 2023), METAL is enabled by default, and arm64 is correctly detected: no need for special cmake flags.
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/
mkdir build && cd build/ && cmake .. && make -j && cd ..
Choose your version of Mistral 7B on hugging face. I went with OpenOrca's Q5_K_S.
cd models/
wget https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF/resolve/main/mistral-7b-openorca.Q4_K_M.gguf
cd ..
For a ranking of different models, see Local LLM Comparison
To load the model in interactive mode (-i
) using a basic config suited to OSX M2 with 16GB (10 threads, all 32 layers on GPU):
./build/bin/main -t 10 -ngl 32 \
-m "/path/to/mistral-7b-openorca.Q5_K_S.gguf" \
--color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 \
-i -ins
With the config above, I get a little over 90 tokens per second.
Refer to Mistral on hugging face + Mistral OpenOrca for general documentation on model, and ./build/bin/main -h
for help with llama.cpp flags.