Today on Hacker News, the top article was LLaMA2 Chat 70B outperformed ChatGPT linking to a leaderboard of LLMs. As of today, July 27, 2023, the top 10 is as follows:

Model Name	Win Rate	Length
GPT-4	95.28%	1365
LLaMA2 Chat 70B	92.66%	1790
Claude 2	91.36%	1069
ChatGPT	89.37%	827
WizardLM 13B V1.2	89.17%	1635
Vicuna 33B v1.3	88.99%	1479
Claude	88.39%	1082
OpenChat V2-W 13B	87.13%	1566
WizardLM 13B V1.1	86.32%	1525
OpenChat V2 13B	84.97%	1564
Vicuna 13B v1.3	82.11%	1132
LLaMA2 Chat 13B	81.09%	1513

Incidentally I have been playing around with inference on my own Mac and randomly trying different models, the leaderboard has been a good place to focus my experimentation.

Running LLM on MacBook Pro M1 Max

I'm currently running a MacBook Pro with an M1 Max chip (64 gB RAM). Much to my surprise, I can infer most open source LLMs using llama.cpp. Here's the setup:

Download and install the text generation UI
Follow these instructions to use the llama.cpp backend
Download GGML models (llama.cpp formatted files) from TheBloke on HF and ln -s them to the models directory

Here are a list of the models I'm experimenting with:

Note: I'm using the q4_K_M.bin version of these models, model cards on HF and llama.cpp have a more detailed discussion on different quantization values.

Typically for 13B models I'm achieving 5-7 tokens/sec whereas with larger models I'm getting 1-2 tokens/sec. I have yet to dive deeper into parameter tuning for performance.

As for the results, my main use case is for data engineering tasks such as parsing sql, reformatting code, converting unstructured data to structured formats, and so on. So far, I've been pleasantly surprised with the results when compared to OpenAI models. I will continue to experiment with these models and find new tasks for these free and open AIs!

hagope/llama_on_mac.md

Running LLM on MacBook Pro M1 Max