Skip to content

Instantly share code, notes, and snippets.

@tail-call
Last active September 18, 2024 04:32
Show Gist options
  • Save tail-call/a602fde6be9eb9097827dacd00a11dd5 to your computer and use it in GitHub Desktop.
Save tail-call/a602fde6be9eb9097827dacd00a11dd5 to your computer and use it in GitHub Desktop.
Studying "Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward": adapting a table for plotting
Method Quantization Type WM (GB) RM (GB) Tokens/sec Perplexity NVIDIA GPU AMD GPU Apple Silicon CPU Intel GPU AWS Inferentia2 WebGPU WASM Adreno Mali
0 Llama.cpp GGUF K-Quant 2bit 2.36 3.69 102.15 6.96 1 1 1 1 0 0 0 0 0
1 Llama.cpp GGUF 4bit (check) 3.56 4.88 128.97 5.96 1 1 1 1 0 0 0 0 0
2 Llama.cpp GGUF AWQ 4bit 3.56 4.88 129.25 5.91 1 1 1 1 0 0 0 0 0
3 Llama.cpp GGUF K-Quant 4bit 3.59 4.90 109.72 5.87 1 1 1 1 0 0 0 0 0
4 Llama.cpp GGUF 8bit 6.67 7.78 93.39 5.79 1 1 1 1 0 0 0 0 0
5 Llama.cpp GGUF FP16 12.55 13.22 66.81 5.79 1 1 1 1 0 0 0 0 0
6 ExLlama GPTQ 4bit 3.63 5.35 77.10 6.08 1 1 0 0 0 0 0 0 0
8 ExLlamav2 EXL2 2bit 2.01 5.21 153.75 20.21 1 1 0 0 0 0 0 0 0
9 ExLlamav2 EXL2 4bit 3.36 6.61 131.68 6.12 1 1 0 0 0 0 0 0 0
10 ExLlamav2 GPTQ 4bit 3.63 6.93 151.30 6.03 1 1 0 0 0 0 0 0 0
11 ExLlamav2 EXL2 8bit 6.37 9.47 115.81 5.76 1 1 0 0 0 0 0 0 0
12 ExLlamav2 FP16 12.55 15.09 67.70 5.73 1 1 0 0 0 0 0 0 0
13 vLLM AWQ GEMM 4bit 3.62 34.55 114.43 6.02 1 1 0 0 0 0 0 0 0
14 vLLM GPTQ 4bit 3.63 36.51 172.88 6.08 1 1 0 0 0 0 0 0 0
15 vLLM FP16 12.55 35.92 79.74 5.85 1 1 0 0 0 0 0 0 0
16 TensorRT-LLM AWQ GEMM 4bit 3.42 5.69 194.86 6.02 1 0 0 0 0 0 0 0 0
17 TensorRT-LLM GPTQ 4bit 3.60 5.88 202.16 6.08 1 0 0 0 0 0 0 0 0
18 TensorRT-LLM INT8 6.53 8.55 143.57 5.89 1 0 0 0 0 0 0 0 0
19 TensorRT-LLM FP16 12.55 14.61 83.43 5.85 1 0 0 0 0 0 0 0 0
20 TGI AWQ GEMM 4bit 3.62 7.97 30.80 6.02 1 1 0 0 1 1 0 0 0
21 TGI AWQ GEMV 4bit 3.62 7.96 34.22 6.02 1 1 0 0 1 1 0 0 0
22 TGI GPTQ 4bit 3.69 39.39 34.86 6.08 1 1 0 0 1 1 0 0 0
23 TGI FP4 12.55 17.02 34.38 6.15 1 1 0 0 1 1 0 0 0
24 TGI NF4 12.55 17.02 33.93 6.02 1 1 0 0 1 1 0 0 0
25 TGI INT8 12.55 11.66 5.39 5.89 1 1 0 0 1 1 0 0 0
26 TGI FP16 12.55 17.02 34.23 5.85 1 1 0 0 1 1 0 0 0
27 MLC-LLM OmniQuant 3bit 3.2 5.1 83.4 6.65 1 1 1 1 1 0 1 1 1
28 MLC-LLM OmniQuant 4bit 3.8 5.7 134.2 5.97 1 1 1 1 1 0 1 1 1
29 MLC-LLM AWQ GEMM 4bit 3.62 6.50 23.62 6.02 1 1 1 1 1 0 1 1 1
30 MLC-LLM Q4F16 3.53 6.50 189.07 1 1 1 1 1 0 1 1 1
31 MLC-LLM Q3F16 2.84 5.98 185.47 1 1 1 1 1 0 1 1 1
32 MLC-LLM FP16 12.55 15.38 87.37 5.85 1 1 1 1 1 0 1 1 1
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@tail-call
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment