Mistral 7B locally on OSX with llama.cpp

Clone and build llama.cpp

As of writing (Nov 22nd 2023), METAL is enabled by default, and arm64 is correctly detected: no need for special cmake flags.

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp/
mkdir build && cd build/ && cmake .. && make -j && cd ..

Download the model

Choose your version of Mistral 7B on hugging face. I went with OpenOrca's Q5_K_S.

cd models/
wget https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF/resolve/main/mistral-7b-openorca.Q4_K_M.gguf
cd ..

For a ranking of different models, see Local LLM Comparison

Run inference

To load the model in interactive mode (-i) using a basic config suited to OSX M2 with 16GB (10 threads, all 32 layers on GPU):

./build/bin/main -t 10 -ngl 32 \
  -m "/path/to/mistral-7b-openorca.Q5_K_S.gguf" \
  --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 \
  -i -ins

With the config above, I get a little over 90 tokens per second.

Refer to Mistral on hugging face + Mistral OpenOrca for general documentation on model, and ./build/bin/main -h for help with llama.cpp flags.

jmftrindade/mistral_7B_llama_OSX_M2.md

Clone and build llama.cpp

Download the model

Run inference