Created
August 6, 2024 14:46
-
-
Save orimanabu/e10a4a8544786d8b81095c3607aa0b8d to your computer and use it in GitHub Desktop.
Podman with libkrun for containerized AI workload acceleration
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
% podman run --rm -ti --device /dev/dri -v ~/Downloads:/models:Z quay.io/slopezpa/fedora-vgpu-llama main --temp 0 -m models/Llama-3-ELYZA-JP-8B-q4_k_m.gguf -b 512 -ngl 99 -p "Podmanのlibkrun providerについて教えて下さい" | |
Log start | |
main: build = 2238 (56d03d92) | |
main: built with cc (GCC) 13.2.1 20231205 (Red Hat 13.2.1-6) for aarch64-redhat-linux | |
main: seed = 1722955285 | |
ggml_vulkan: Found 1 Vulkan devices: | |
Vulkan0: Virtio-GPU Venus (Apple M2) | uma: 1 | fp16: 1 | warp size: 32 | |
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/Llama-3-ELYZA-JP-8B-q4_k_m.gguf (version GGUF V3 (latest)) | |
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. | |
llama_model_loader: - kv 0: general.architecture str = llama | |
llama_model_loader: - kv 1: general.name str = Llama-3-8B-optimal-merged-stage2 | |
llama_model_loader: - kv 2: llama.block_count u32 = 32 | |
llama_model_loader: - kv 3: llama.context_length u32 = 8192 | |
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 | |
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 | |
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 | |
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 | |
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 | |
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 | |
llama_model_loader: - kv 10: general.file_type u32 = 15 | |
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 | |
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 | |
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 | |
llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe | |
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... | |
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... | |
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... | |
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 | |
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 | |
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... | |
llama_model_loader: - kv 21: general.quantization_version u32 = 2 | |
llama_model_loader: - type f32: 65 tensors | |
llama_model_loader: - type q4_K: 193 tensors | |
llama_model_loader: - type q6_K: 33 tensors | |
llm_load_vocab: special tokens definition check successful ( 256/128256 ). | |
llm_load_print_meta: format = GGUF V3 (latest) | |
llm_load_print_meta: arch = llama | |
llm_load_print_meta: vocab type = BPE | |
llm_load_print_meta: n_vocab = 128256 | |
llm_load_print_meta: n_merges = 280147 | |
llm_load_print_meta: n_ctx_train = 8192 | |
llm_load_print_meta: n_embd = 4096 | |
llm_load_print_meta: n_head = 32 | |
llm_load_print_meta: n_head_kv = 8 | |
llm_load_print_meta: n_layer = 32 | |
llm_load_print_meta: n_rot = 128 | |
llm_load_print_meta: n_embd_head_k = 128 | |
llm_load_print_meta: n_embd_head_v = 128 | |
llm_load_print_meta: n_gqa = 4 | |
llm_load_print_meta: n_embd_k_gqa = 1024 | |
llm_load_print_meta: n_embd_v_gqa = 1024 | |
llm_load_print_meta: f_norm_eps = 0.0e+00 | |
llm_load_print_meta: f_norm_rms_eps = 1.0e-05 | |
llm_load_print_meta: f_clamp_kqv = 0.0e+00 | |
llm_load_print_meta: f_max_alibi_bias = 0.0e+00 | |
llm_load_print_meta: n_ff = 14336 | |
llm_load_print_meta: n_expert = 0 | |
llm_load_print_meta: n_expert_used = 0 | |
llm_load_print_meta: rope scaling = linear | |
llm_load_print_meta: freq_base_train = 500000.0 | |
llm_load_print_meta: freq_scale_train = 1 | |
llm_load_print_meta: n_yarn_orig_ctx = 8192 | |
llm_load_print_meta: rope_finetuned = unknown | |
llm_load_print_meta: model type = 7B | |
llm_load_print_meta: model ftype = Q4_K - Medium | |
llm_load_print_meta: model params = 8.03 B | |
llm_load_print_meta: model size = 4.58 GiB (4.89 BPW) | |
llm_load_print_meta: general.name = Llama-3-8B-optimal-merged-stage2 | |
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' | |
llm_load_print_meta: EOS token = 128009 '<|eot_id|>' | |
llm_load_print_meta: LF token = 128 'Ä' | |
llm_load_tensors: ggml ctx size = 0.22 MiB | |
llm_load_tensors: offloading 32 repeating layers to GPU | |
llm_load_tensors: offloading non-repeating layers to GPU | |
llm_load_tensors: offloaded 33/33 layers to GPU | |
llm_load_tensors: CPU buffer size = 281.81 MiB | |
llm_load_tensors: Vulkan0 buffer size = 4403.49 MiB | |
........................................................................................ | |
llama_new_context_with_model: n_ctx = 512 | |
llama_new_context_with_model: freq_base = 500000.0 | |
llama_new_context_with_model: freq_scale = 1 | |
llama_kv_cache_init: Vulkan0 KV buffer size = 64.00 MiB | |
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB | |
llama_new_context_with_model: Vulkan_Host input buffer size = 10.01 MiB | |
llama_new_context_with_model: Vulkan0 compute buffer size = 258.50 MiB | |
llama_new_context_with_model: Vulkan_Host compute buffer size = 8.00 MiB | |
llama_new_context_with_model: graph splits (measure): 3 | |
system_info: n_threads = 4 / 4 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | | |
sampling: | |
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000 | |
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000 | |
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 | |
sampling order: | |
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature | |
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 | |
Podmanのlibkrun providerについて教えて下さい。 | |
libkrunはKDEのRun Commandを提供するライブラリです。libkrun providerは、libkrunが提供するRun Commandを実行するためのプロバイダーです。 | |
libkrun providerは、libkrunが提供するRun Commandを実行するために必要な情報を提供します。例えば、実行するコマンドのパスや環境変数などです。 | |
libkrun providerは、KDEのアプリケーションで使用されます。例えば、KDEのファイルマネージャーであるDolphinでは、libkrun providerを使用してRun Commandを実行しています。 | |
以下は、libkrun providerの例です。 | |
```c | |
#include <krunprovider.h> | |
class MyKrunProvider : public KRunProvider { | |
public: | |
MyKrunProvider(QObject *parent = 0) : KRunProvider(parent) {} | |
void runCommand(const QUrl &url, const QString &commandLine) override { | |
// コマンドの実行 | |
QProcess process; | |
process.start(commandLine); | |
process.waitForFinished(); | |
} | |
}; | |
``` | |
この例では、MyKrunProviderクラスがlibkrun providerです。runCommandメソッドは、libkrunから呼び出され、Run Commandを実行するために必要な情報を受け取ります。この例では、QProcessを使用してコマンドを実行しています。 [end of text] | |
llama_print_timings: load time = 10807.36 ms | |
llama_print_timings: sample time = 122.81 ms / 307 runs ( 0.40 ms per token, 2499.90 tokens per second) | |
llama_print_timings: prompt eval time = 4758.71 ms / 12 tokens ( 396.56 ms per token, 2.52 tokens per second) | |
llama_print_timings: eval time = 27980.78 ms / 306 runs ( 91.44 ms per token, 10.94 tokens per second) | |
llama_print_timings: total time = 33019.71 ms / 318 tokens | |
Log end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment