Enviroment setup

Test Host is an HP z640 with Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz

Device 0 is a GeForce RTX 3060 Lite Hash Rate (EVGA) connected via the cheapest USB x1 crypto mining riser I could find online.

Device 1 is a GeForce RTX 3060 (HP OEM) connected via an x16 PCIe 4.0 riser cable (although note that my host does not support past PCIe past 3.0)

tensorrt-llm 0.8.0

CUDA_VISIBLE_DEVICES=0 python3 examples/run.py --engine_dir=./engine_llama3_8b_int4 --max_output_len 256 --tokenizer_dir ~/models/meta-llama-Meta-Llama batch_size: 1, avg latency of 10 iterations: : 3.902403378486633 sec

Result tensorrt-llm @ x1 TG256 65.64 tok/sec

CUDA_VISIBLE_DEVICES=1 python3 examples/run.py --engine_dir=./engine_llama3_8b_int4 --max_output_len 256 --tokenizer_dir ~/models/meta-llama-Meta-Llama batch_size: 1, avg latency of 10 iterations: : 3.9053308963775635 sec

Result tensorrt-llm @ x16 TG256 65.55 tok/sec

$ llama.cpp build: a8f9b076 (2775)

CUDA_VISIBLE_DEVICES=0 ./llama-bench -m ~/models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -p 1024,2048,4096 -n 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	CUDA	99	pp 1024	1530.91 ± 2.63
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	CUDA	99	pp 2048	1413.44 ± 0.78
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	CUDA	99	pp 4096	1232.10 ± 0.26
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	CUDA	99	tg 256	55.02 ± 0.01

Result llama.cpp @ x1 PP1024 1531 tok/sec

Result llama.cpp @ x1 PP2048 1413 tok/sec

Result llama.cpp @ x1 PP4096 1232 tok/sec

Result llama.cpp @ x1 TG256 55.02 tok/sec

$ CUDA_VISIBLE_DEVICES=1 ./llama-bench -m ~/models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -p 1024,2048,4096 -n 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	CUDA	99	pp 1024	1581.53 ± 2.18
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	CUDA	99	pp 2048	1458.56 ± 1.44
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	CUDA	99	pp 4096	1276.09 ± 2.39
llama 8B Q4_K - Medium	4.58 GiB	8.03 B	CUDA	99	tg 256	57.01 ± 0.05

Result llama.cpp @ x16 PP1024 1581 tok/sec

Result llama.cpp @ x16 PP2048 1458 tok/sec

Result llama.cpp @ x16 PP4096 1276 tok/sec

Result llama.cpp @ x16 TG256 57.02 tok/sec

exllamav2 0.0.19

CUDA_VISIBLE_DEVICES=0 python3 ./test_inference.py --model ~/models/turboderp-Llama-3-8B-Instruct-exl2-4.0bpw/ --speed --prompt_speed -- Model: /home/mike/models/turboderp-Llama-3-8B-Instruct-exl2-4.0bpw/ -- Options: [] -- Loading model... -- Loaded model in 6.1473 seconds -- Loading tokenizer... -- Warmup... -- Measuring prompt speed... ** Length 128 tokens: 923.6224 t/s ** Length 256 tokens: 1716.6676 t/s ** Length 384 tokens: 2001.7525 t/s ** Length 512 tokens: 2146.3504 t/s ** Length 640 tokens: 2094.3034 t/s ** Length 768 tokens: 2206.8227 t/s ** Length 896 tokens: 2179.4805 t/s ** Length 1024 tokens: 2137.2835 t/s ** Length 2048 tokens: 2028.1340 t/s ** Length 3072 tokens: 1873.8582 t/s ** Length 4096 tokens: 1702.1691 t/s ** Length 8192 tokens: 1188.9237 t/s -- Measuring token speed... ** Position 1 + 127 tokens: 61.4902 t/s ** Position 128 + 128 tokens: 60.6980 t/s ** Position 256 + 128 tokens: 58.3989 t/s ** Position 384 + 128 tokens: 56.3200 t/s ** Position 512 + 128 tokens: 54.4677 t/s ** Position 640 + 128 tokens: 52.8844 t/s ** Position 768 + 128 tokens: 51.2690 t/s ** Position 896 + 128 tokens: 49.7531 t/s ** Position 1024 + 128 tokens: 48.2383 t/s

Result EXL2 @ x1 PP1024 2137 tok/sec

Result EXL2 @ x1 PP2048 2028 tok/sec

Result EXL2 @ x1 PP4096 1702 tok/sec

Result EXL2 @ x1 TG256 60.7 tok/sec

CUDA_VISIBLE_DEVICES=1 python3 ./test_inference.py --model ~/models/turboderp-Llama-3-8B-Instruct-exl2-4.0bpw/ --speed --prompt_speed -- Model: /home/mike/models/turboderp-Llama-3-8B-Instruct-exl2-4.0bpw/ -- Options: [] -- Loading model... -- Loaded model in 1.7702 seconds -- Loading tokenizer... -- Warmup... -- Measuring prompt speed... ** Length 128 tokens: 983.2212 t/s ** Length 256 tokens: 1737.2786 t/s ** Length 384 tokens: 2034.9097 t/s ** Length 512 tokens: 2180.6713 t/s ** Length 640 tokens: 2126.0093 t/s ** Length 768 tokens: 2249.7304 t/s ** Length 896 tokens: 2199.7338 t/s ** Length 1024 tokens: 2156.4664 t/s ** Length 2048 tokens: 2072.5891 t/s ** Length 3072 tokens: 1897.5678 t/s ** Length 4096 tokens: 1730.1456 t/s ** Length 8192 tokens: 1204.0058 t/s -- Measuring token speed... ** Position 1 + 127 tokens: 61.4660 t/s ** Position 128 + 128 tokens: 60.6269 t/s ** Position 256 + 128 tokens: 58.2973 t/s ** Position 384 + 128 tokens: 56.2245 t/s ** Position 512 + 128 tokens: 54.2728 t/s ** Position 640 + 128 tokens: 52.7710 t/s ** Position 768 + 128 tokens: 51.1547 t/s ** Position 896 + 128 tokens: 49.6570 t/s ** Position 1024 + 128 tokens: 48.1258 t/s ** Position 1152 + 128 tokens: 46.9813 t/s

Result EXL2 @ x16 PP1024 2156 tok/sec

Result EXL2 @ x16 PP2048 2072 tok/sec

Result EXL2 @ x16 PP4096 1730 tok/sec

Result EXL2 @ x16 TG256 60.6 tok/sec

vllm 0.4.0

$ CUDA_VISIBLE_DEVICES=0 python3 ./benchmark_throughput.py --model /models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ --input-len 8 --output-len 256 --num-prompts 1 --backend vllm Namespace(backend='vllm', dataset=None, input_len=8, output_len=256, model='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=1, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', device='cuda', enable_prefix_caching=False, download_dir=None) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 05-02 17:24:46 config.py:208] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 05-02 17:24:46 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 05-02 17:24:47 selector.py:45] Cannot use FlashAttention because the package is not found. Please install it for better performance. INFO 05-02 17:24:47 selector.py:21] Using XFormers backend. INFO 05-02 17:24:56 model_runner.py:104] Loading model weights took 5.3472 GB INFO 05-02 17:24:59 gpu_executor.py:94] # GPU blocks: 1707, # CPU blocks: 2048 INFO 05-02 17:25:00 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 05-02 17:25:00 model_runner.py:795] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. INFO 05-02 17:25:07 model_runner.py:867] Graph capturing finished in 6 secs. Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.22s/it] Throughput: 0.24 requests/s, 62.52 tokens/s

Result vllm-xformers @ x1 TG256 62.52 tok/sec

$ CUDA_VISIBLE_DEVICES=1 python3 ./benchmark_throughput.py --model /models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ --input-len 8 --output-len 256 --num-prompts 1 --backend vllm Namespace(backend='vllm', dataset=None, input_len=8, output_len=256, model='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=1, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', device='cuda', enable_prefix_caching=False, download_dir=None) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 05-02 17:25:30 config.py:208] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 05-02 17:25:30 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 05-02 17:25:30 selector.py:45] Cannot use FlashAttention because the package is not found. Please install it for better performance. INFO 05-02 17:25:30 selector.py:21] Using XFormers backend. INFO 05-02 17:25:33 model_runner.py:104] Loading model weights took 5.3472 GB INFO 05-02 17:25:36 gpu_executor.py:94] # GPU blocks: 1707, # CPU blocks: 2048 INFO 05-02 17:25:37 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 05-02 17:25:37 model_runner.py:795] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. INFO 05-02 17:25:44 model_runner.py:867] Graph capturing finished in 6 secs. Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.22s/it] Throughput: 0.24 requests/s, 62.50 tokens/s

Result vllm-xformers @ x16 TG256 62.50 tok/sec

$ CUDA_VISIBLE_DEVICES=0 python3 ./benchmark_throughput.py --model /models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ --input-len 8 --output-len 256 --num-prompts 1 --backend vllm Namespace(backend='vllm', dataset=None, input_len=8, output_len=256, model='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=1, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', device='cuda', enable_prefix_caching=False, download_dir=None) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 05-02 17:26:24 config.py:208] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 05-02 17:26:24 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 05-02 17:26:25 selector.py:16] Using FlashAttention backend. INFO 05-02 17:26:33 model_runner.py:104] Loading model weights took 5.3472 GB INFO 05-02 17:26:37 gpu_executor.py:94] # GPU blocks: 1707, # CPU blocks: 2048 INFO 05-02 17:26:38 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 05-02 17:26:38 model_runner.py:795] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. INFO 05-02 17:26:44 model_runner.py:867] Graph capturing finished in 6 secs. Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.20s/it] Throughput: 0.24 requests/s, 62.85 tokens/s

Result vllm-flashattn @ x1 TG256 62.85 tok/sec

$ CUDA_VISIBLE_DEVICES=1 python3 ./benchmark_throughput.py --model /models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ --input-len 8 --output-len 256 --num-prompts 1 --backend vllm Namespace(backend='vllm', dataset=None, input_len=8, output_len=256, model='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=1, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', device='cuda', enable_prefix_caching=False, download_dir=None) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 05-02 17:27:05 config.py:208] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 05-02 17:27:05 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 05-02 17:27:05 selector.py:16] Using FlashAttention backend. INFO 05-02 17:27:08 model_runner.py:104] Loading model weights took 5.3472 GB INFO 05-02 17:27:11 gpu_executor.py:94] # GPU blocks: 1707, # CPU blocks: 2048 INFO 05-02 17:27:12 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 05-02 17:27:12 model_runner.py:795] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. INFO 05-02 17:27:18 model_runner.py:867] Graph capturing finished in 6 secs. Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00, 4.21s/it] Throughput: 0.24 requests/s, 62.67 tokens/s

Result vllm-flashattn @ x16 TG256 62.67 tok/sec

$ CUDA_VISIBLE_DEVICES=0 python3 ./benchmark_throughput.py --model /models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ --input-len 4096 --output-len 8 --num-prompt s 1 --backend vllm Namespace(backend='vllm', dataset=None, input_len=4096, output_len=8, model='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=1, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', device='cuda', enable_prefix_caching=False, download_dir=None) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 05-02 17:28:22 config.py:208] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 05-02 17:28:22 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 05-02 17:28:23 selector.py:16] Using FlashAttention backend. INFO 05-02 17:28:32 model_runner.py:104] Loading model weights took 5.3472 GB INFO 05-02 17:28:35 gpu_executor.py:94] # GPU blocks: 1707, # CPU blocks: 2048 INFO 05-02 17:28:36 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 05-02 17:28:36 model_runner.py:795] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. INFO 05-02 17:28:42 model_runner.py:867] Graph capturing finished in 6 secs. Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.57s/it] Throughput: 0.64 requests/s, 2608.18 tokens/s

Result vllm-flashattn @ x1 PP4096 2608 tok/sec

$ CUDA_VISIBLE_DEVICES=1 python3 ./benchmark_throughput.py --model /models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ --input-len 4096 --output-len 8 --num-prompt s 1 --backend vllm Namespace(backend='vllm', dataset=None, input_len=4096, output_len=8, model='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=1, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', gpu_memory_utilization=0.9, enforce_eager=False, kv_cache_dtype='auto', device='cuda', enable_prefix_caching=False, download_dir=None) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 05-02 17:29:01 config.py:208] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 05-02 17:29:01 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer='/home/mike/models/MaziyarPanahi-Meta-Llama-3-8B-Instruct-GPTQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 05-02 17:29:01 selector.py:16] Using FlashAttention backend. INFO 05-02 17:29:04 model_runner.py:104] Loading model weights took 5.3472 GB INFO 05-02 17:29:07 gpu_executor.py:94] # GPU blocks: 1707, # CPU blocks: 2048 INFO 05-02 17:29:08 model_runner.py:791] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. INFO 05-02 17:29:08 model_runner.py:795] CUDA graphs can take additional 13 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. INFO 05-02 17:29:14 model_runner.py:867] Graph capturing finished in 6 secs. Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.59s/it] Throughput: 0.63 requests/s, 2577.97 tokens/s

Result vllm-flashattn @ x16 PP4096 2577.97 tok/sec

CSV

Engine,Host Speed,Test,Result x1,Result x16,Unit tensorrt-llm,x1,TG256,65.64,65.55,tok/sec llama.cpp,x1,PP1024,1531,1581,tok/sec llama.cpp,x1,PP2048,1413,1458,tok/sec llama.cpp,x1,PP4096,1232,1276,tok/sec llama.cpp,x1,TG256,55.02,57.02,tok/sec EXL2,x1,PP1024,2137,2156,tok/sec EXL2,x1,PP2048,2028,2072,tok/sec EXL2,x1,PP4096,1702,1730,tok/sec EXL2,x1,TG256,60.7,60.6,tok/sec vllm-xformers,x1,TG256,62.52,62.5,tok/sec vllm-flashattn,x1,TG256,62.85,62.67,tok/sec vllm-flashattn,x1,PP4096,2608,2577.97,tok/sec

the-crypt-keeper/3060vs3060.md