the-crypt-keeper

Quad P40 Xeon Build

I accidentally 4x P40s and need something to plug them into.
Original plan fell apart: the Dell R730 outsmarted me. It seems to force any GPUs past the first two to power limit themselves and refuse to draw any external power.
EPYCs don't seem to make sense in this secenario, the host mobo+cpu alone would cost more then the GPUs.
Prefer to avoid dual-socket Xeon systems both because of the additional idle power and NUMA.
With 96GB VRAM, I need at least 128GB system RAM to not have problems.
- I have a large stash of ECC DDR4 2133 modules.

Test Host is an HP z640 with Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz

Device 0 is a GeForce RTX 3060 Lite Hash Rate (EVGA) connected via the cheapest USB x1 crypto mining riser I could find online.

Device 1 is a GeForce RTX 3060 (HP OEM) connected via an x16 PCIe 4.0 riser cable (although note that my host does not support past PCIe past 3.0)

idx	Languages	Model	Params	Template	Passed	Score
37	python	openai-chatgpt	precise	chat-generic	65	1.0
38	javascript	openai-chatgpt	precise	chat-generic	65	1.0
52	javascript	ggml-vicuna-13B-1.1-q5	precise	Vicuna-1p1-v2	58	0.8923076923076924
31	javascript	ai21-j2-jumbo-instruct	precise	chat-generic	57	0.8769230769230769
4	python	ggml-vicuna-13B-1.1-q5	precise	Vicuna-1p1	57	0.8769230769230769
5	javascript	ggml-vicuna-13B-1.1-q5	precise	Vicuna-1p1	57	0.8769230769230769
30	python	ai21-j2-jumbo-instruct	precise	chat-generic	57	0.8769230769230769
35	python	tsumeone-llama-30b-supercot-4bit-cuda	precise	Alpaca	56	0.8615384615384616
26	python	airoboros-13b-gpt4.ggmlv3.q5	precise	Vicuna-FastChat	55	0.8461538461538461

	import requests
	import json
	import sys
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

	SYSTEM_PROMPT = "You are a helpful assistant."
	messages = [{"role": "system", "content": SYSTEM_PROMPT}]

	#
	# this is adapted from https://github.com/facebookresearch/llama/blob/main/llama/generation.py#L213
	# the tokenizer is replaced with ord() to make it easier to see whats actually happening

	from typing_extensions import TypedDict, Literal
	from typing import List, Optional

	Role = Literal["system", "user", "assistant"]

	class Message(TypedDict):