GGUF-Instructions.md

LLama.cpp
- git clone git@github.com:ggerganov/llama.cpp.git
- pip install -r llama.cpp/requirements.txt
- Compile llama.cpp a) cd llama.cpp b) make
wikiextractor (this is used for creating imatrix files from wikipedia dumps)
- git clone git@github.com:attardi/wikiextractor.git
- cd wikiextractor
- wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
  - this is a 20gb + file so may take a while to download
- python -m wikiextractor/WikiExtractor -o extracted enwiki-latest-pages-articles.xml.bz2
- cat extracted// > wiki.train.raw
- mv wiki.train.raw ../
- cd ../
- rm -rf wikiextractor
Download the HF model (see download script)
Convert HF model to gguf format
- python llama.cpp/convert.py {hf_model} --outfile {gguf_model} --outtype fp16
[Optional if doing IQ style quantization] Generate the raw data for creating an imatrix file
- python ./wikiextractor/WikiExtractor.py -o extracted enwiki-latest-pages-articles.xml.bz2
[Optional if doing IQ style quantization] Generate the imatrix file
- ./llama.cpp/imatrix -m {gguf_model} -f wiki.train.raw -o imatrix_{gguf_model}.dat --chunks 100
Quantize
- [iQ] ./llama.cpp/quantize --imatrix imatrix_{gguf_model}.dat {gguf_model} quantized_model.gguf iq2_xxs
- [legacy] ./llama.cpp/quantize {gguf_model} quantized_model_Q4_K_M.gguf Q4_K_M
Upload to HF (see upload script)

thesven/GGUF-Instructions.md