Dockerfile for running llama.cpp with Nvidia GPU support.
Install docker and the NVIDIA Container Toolkit. Instructions for Arch Linux here.
docker build -t llama-cpp-cuda:0.0.1 .
Create a model directory:
mkdir -p ~/models
It will be used for storing LLMs and configuration files.
Download a model supporting the new (as of Jun 2023) k-quant methods in llama.cpp, for example
and place it in the models
directory.
Edit prompt.sh
to set the model path. Also set number of CPU threads and number of GPU layers to use depending on your hardware.
Link or copy it to your $PATH
:
ln prompt.sh ~/.local/bin
chmod +x ~/.local/bin/prompt.sh
Run it with a storyteller prompt:
prompt.sh storyteller "a mysterious forest"
Once upon a time, there was a vast and ancient forest that stretched for miles in every direction. It was said to be enchanted, with strange and wondrous creatures living within its depths. The trees were tall and gnarled...
Run it with an instruct prompt:
prompt.sh instruct "build a bicycle"
To build a bicycle, you will need the following components:
- Frame: The main body of the bike that supports the wheels and seat.
- Wheels: The large wheel in front and the smaller one in back that roll along the ground.
- Pedals: The circular rotating devices that allow...
This container was tested on the following hardware:
- AMD Ryzen 9 3900XT 12-Core
- 1x Nvidia GTX 1080 Ti 11GB
Performance is approximately doubled with GPU offloading.
The output from llama.cpp should look like this:
main: build = 710 (b24c304)
main: seed = 1687136441
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1080 Ti
llama.cpp: loading model from /models/Wizard-Vicuna-13B-Uncensored.ggmlv3.q4_K_S.bin
...
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 2135.98 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloaded 42/43 layers to GPU
llama_model_load_internal: total VRAM used: 8212 MB