Skip to content

Instantly share code, notes, and snippets.

@vuiseng9
Created August 2, 2024 19:56
Show Gist options
  • Save vuiseng9/3606df812d5a393d8d990eb9624fbacc to your computer and use it in GitHub Desktop.
Save vuiseng9/3606df812d5a393d8d990eb9624fbacc to your computer and use it in GitHub Desktop.

Install

pip install transformers torch accelerate

Run

from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
import torch
from torch.profiler import profile, ProfilerActivity
from tqdm import tqdm

set_seed(42)

model_id = "microsoft/Phi-3-mini-4k-instruct"
prompt = "Tell a story about a superhero."
maxlen=128

device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_id) # tokenizer doesnt support .to
model = AutoModelForCausalLM.from_pretrained(model_id, device_map=device, torch_dtype=torch.bfloat16, attn_implementation="eager")
input_ids = tokenizer.encode(prompt, return_tensors="pt")

for _ in tqdm(range(2), desc="- warming up ..."):
    model.generate(input_ids.to(device), max_length=maxlen, num_return_sequences=1)

with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    output_ids = model.generate(input_ids.to(device), max_length=maxlen, num_return_sequences=1)

generated_texts = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print(generated_texts)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_time_total", row_limit=10))

print("end.")
@vuiseng9
Copy link
Author

vuiseng9 commented Aug 2, 2024

Sample output

Tell a story about a superhero.

**Answer:**Once upon a time, in the bustling city of Metropolis, there lived a superhero named Captain Valiant. He was a tall, muscular man with a kind heart and a strong sense of justice. Captain Valiant had the power to control metal, which he used to protect the city from various villains.

One day, a notorious criminal named Dr. Chaos escaped from prison and threatened to unleash a deadly virus on the city. Captain Valiant knew he had to stop him before it was


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls                                       Input Shapes  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------  
                                       cudaLaunchKernel        22.23%     985.838ms        22.23%     985.838ms       4.938us       0.000us         0.00%       0.000us       0.000us        199641                                                 []  
                                               aten::to         0.52%      23.117ms         7.91%     350.668ms      15.240us       0.000us         0.00%      47.441ms       2.062us         23010                     [[1, 1, 3072], [], [], [], []]  
                                         aten::_to_copy         1.63%      72.162ms         7.39%     327.551ms      21.353us       0.000us         0.00%      47.441ms       3.093us         15340             [[1, 1, 3072], [], [], [], [], [], []]  
                                              aten::cat         4.89%     216.891ms         7.02%     311.330ms      16.203us      62.979ms         3.79%      62.979ms       3.278us         19214                                           [[], []]  
                                              aten::mul         3.14%     139.179ms         4.67%     206.972ms      13.703us      42.072ms         2.53%      42.072ms       2.786us         15104                    [[1, 32, 1, 96], [1, 1, 1, 96]]  
                                           aten::linear         0.19%       8.430ms         4.40%     194.977ms      51.636us       0.000us         0.00%     306.459ms      81.160us          3776                   [[1, 1, 3072], [9216, 3072], []]  
                                           aten::linear         0.19%       8.299ms         4.31%     191.023ms      50.589us       0.000us         0.00%     514.753ms     136.322us          3776                  [[1, 1, 3072], [16384, 3072], []]  
                                    aten::empty_strided         4.15%     183.905ms         4.15%     183.905ms       5.274us       0.000us         0.00%       0.000us       0.000us         34872                           [[], [], [], [], [], []]  
                                           aten::matmul         0.77%      34.367ms         4.08%     181.009ms      47.937us       0.000us         0.00%       7.093ms       1.878us          3776                            [[1, 48, 1], [1, 1, 1]]  
                                            aten::copy_         2.15%      95.246ms         3.94%     174.649ms      11.385us      47.441ms         2.86%      47.441ms       3.093us         15340                   [[1, 1, 3072], [1, 1, 3072], []]  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------  
Self CPU time total: 4.435s
Self CUDA time total: 1.661s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment