Install

pip install transformers torch accelerate

Run

from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
import torch
from torch.profiler import profile, ProfilerActivity
from tqdm import tqdm

set_seed(42)

model_id = "microsoft/Phi-3-mini-4k-instruct"
prompt = "Tell a story about a superhero."
maxlen=128

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_id) # tokenizer doesnt support .to
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=torch.bfloat16, attn_implementation="eager")
input_ids = tokenizer.encode(prompt, return_tensors="pt")

for _ in tqdm(range(2), desc="- warming up ..."):
    model.generate(input_ids.to(device), max_length=maxlen, num_return_sequences=1)

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    output_ids = model.generate(input_ids.to(device), max_length=maxlen, num_return_sequences=1)

generated_texts = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print(generated_texts)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_time_total", row_limit=10))

print("end.")

Sample output

Tell a story about a superhero. **Answer:**Once upon a time, in the bustling city of Metropolis, there lived a superhero named Captain Valiant. He was a tall, muscular man with a kind heart and a strong sense of justice. Captain Valiant had the power to control metal, which he used to protect the city from various villains. One day, a notorious criminal named Dr. Chaos escaped from prison and threatened to unleash a deadly virus on the city. Captain Valiant knew he had to stop him before it was ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------------------------------------------- Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls Input Shapes ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------------------------------------------- cudaLaunchKernel 22.23% 985.838ms 22.23% 985.838ms 4.938us 0.000us 0.00% 0.000us 0.000us 199641 [] aten::to 0.52% 23.117ms 7.91% 350.668ms 15.240us 0.000us 0.00% 47.441ms 2.062us 23010 [[1, 1, 3072], [], [], [], []] aten::_to_copy 1.63% 72.162ms 7.39% 327.551ms 21.353us 0.000us 0.00% 47.441ms 3.093us 15340 [[1, 1, 3072], [], [], [], [], [], []] aten::cat 4.89% 216.891ms 7.02% 311.330ms 16.203us 62.979ms 3.79% 62.979ms 3.278us 19214 [[], []] aten::mul 3.14% 139.179ms 4.67% 206.972ms 13.703us 42.072ms 2.53% 42.072ms 2.786us 15104 [[1, 32, 1, 96], [1, 1, 1, 96]] aten::linear 0.19% 8.430ms 4.40% 194.977ms 51.636us 0.000us 0.00% 306.459ms 81.160us 3776 [[1, 1, 3072], [9216, 3072], []] aten::linear 0.19% 8.299ms 4.31% 191.023ms 50.589us 0.000us 0.00% 514.753ms 136.322us 3776 [[1, 1, 3072], [16384, 3072], []] aten::empty_strided 4.15% 183.905ms 4.15% 183.905ms 5.274us 0.000us 0.00% 0.000us 0.000us 34872 [[], [], [], [], [], []] aten::matmul 0.77% 34.367ms 4.08% 181.009ms 47.937us 0.000us 0.00% 7.093ms 1.878us 3776 [[1, 48, 1], [1, 1, 1]] aten::copy_ 2.15% 95.246ms 3.94% 174.649ms 11.385us 47.441ms 2.86% 47.441ms 3.093us 15340 [[1, 1, 3072], [1, 1, 3072], []] ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------------------------------------------- Self CPU time total: 4.435s Self CUDA time total: 1.661s

vuiseng9/hf-causal-torch-profiler.md

Install

Run

vuiseng9 commented Aug 2, 2024