Skip to content

Instantly share code, notes, and snippets.

View oscarbg's full-sized avatar

Oscar Barenys oscarbg

View GitHub Profile
fxkamd /
Last active August 28, 2024 03:50
Observations about HSA and KFD backends in TinyGrad

This is Felix Kuehling, long time KFD driver architect. I started looking into the TinyGrad source code yesterday, focusing on, and driver/, to understand how TinyGrad talks to our HW and help with the ongoing debugging effort from the top down. This analysis is based on this commit:

I'm intrigued by the use of Python for low-level programming. I think I can learn something from your use of ctypes and clang2py for fast prototyping and test development. I want to share some observations based on my initial review.

ops_kfd looks pretty new, and I see many problems with it based on my long experience working on KFD. I think it's interesting, but probably not relevant for the most pressing problems at hand, so I'll cover that last.

ops_hsa uses ROCr APIs to manage GPU memory, create a user mode AQL queue for GPU kernel dispatch, async SDMA copies, and signal-based synchronization with barrier packets

briansp2020 / benchmark_7900XTX_10142023.txt
Created October 15, 2023 00:12
Latest ai-benchmark using ROCm 5.7.1 and tensorflow-upstream 10/14/2023 source.
(tf) root@rocm:~/tmp# python
2023-10-14 15:02:22.116047: E external/local_xla/xla/stream_executor/] Invalid plugin kind specified: DNN
2023-10-14 15:02:22.348480: I tensorflow/core/platform/] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-14 15:02:23.756833: I external/local_xla/xla/stream_executor/rocm/] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-14 15:02:23.982269: I external/local_xla/xla/stream_executor/rocm/] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-14 15:02:23.9823
timdecode / simd_partition.metal
Created June 23, 2023 18:11
Metal implementation of subgroupPartitionNV
// Created by Timothy Davison on 2023-06-21.
// This is a Metal implementation of subgroupPartitionNV. You use it to find a mask of
// the other threads in a simd-group with the same value (a partition of the simd-group about
// a set of values).
// Feel free to use this in your code. Please share any fixes or ideas to make it faster.
// Khronos docs on subgroup partitioning:
// -
zingaburga /
Last active August 13, 2024 03:44
ARM’s Scalable Vector Extensions: A Critical Look at SVE2 For Integer Workloads

ARM’s Scalable Vector Extensions: A Critical Look at SVE2 For Integer Workloads

Scalable Vector Extensions (SVE) is ARM’s latest SIMD extension to their instruction set, which was announced back in 2016. A follow-up SVE2 extension was announced in 2019, designed to incorporate all functionality from ARM’s current primary SIMD extension, NEON (aka ASIMD).

Despite being announced 5 years ago, there is currently no generally available CPU which supports any form of SVE (which excludes the [Fugaku supercomputer](

woachk / clpeak.txt
Created September 27, 2021 22:13
clpeak w/ MoltenVK and clspv on M1
% ./clpeak
[mvk-info] MoltenVK version 1.1.5, supporting Vulkan version 1.1.189.
The following 72 Vulkan extensions are supported:
VK_KHR_16bit_storage v1
VK_KHR_8bit_storage v1
VK_KHR_bind_memory2 v1
VK_KHR_create_renderpass2 v1
VK_KHR_dedicated_allocation v3
VK_KHR_depth_stencil_resolve v1
VK_KHR_descriptor_update_template v1
# IDA (disassembler) and Hex-Rays (decompiler) plugin for Apple AMX
# WIP research. (This was edited to add more info after someone posted it to
# Hacker News. Click "Revisions" to see full changes.)
# Copyright (c) 2020 dougallj
# Based on Python port of VMX intrinsics plugin:
# Copyright (c) 2019 w4kfu - Synacktiv
citruz /
Last active September 19, 2024 06:30
Create Ubuntu and Windows VMs with QEMU on Apple Silicon

Running Linux and Windows on M1 with QEMU

30.11.2020: Updated with the new patchseries and instructions for Windows

02.12.2020: Added tweaks

08.12.2020: Updated with patchseries v4

31.01.2020: Updated with patchseries v6

niw /
Last active September 3, 2024 17:01
How to run Windows 10 on ARM or Ubuntu for ARM64 in QEMU on Apple Silicon Mac

How to run Windows 10 on ARM or Ubuntu for ARM64 in QEMU on Apple Silicon Mac

Here is easy steps to try Windows 10 on ARM or Ubuntu for ARM64 on your Apple Silicon Mac. Enjoy!

NOTE: that this is current, 10/1/2021 state.

Running Windows 10 on ARM

  1. Install Xcode from App Store or install Command Line Tools on your Mac
platform: 7.5
ext: 7p5
name: HSW
1 add add 0x40 Addition
0xfc0 u8 i8 u16 i16 u32 i32 , 0xfc0 u8 i8 u16 i16 u32 i32
0x20000 f32 , 0xfc0 u8 i8 u16 i16 u32 i32
0x20000 f32 , 0x20000 f32
0x40000 f64 , 0x40000 f64
3 addc addc 0x4e Addition with Carry
0x400 u32 , 0x400 u32
rygorous / b.bat
Created August 9, 2019 23:08
Histogram code with all the tricks :) Needs NASM + VC++
@echo off
cd %~dp0
call vcvars amd64
..\..\bin\win32\nasm -f win64 -g -o histo_asm.obj histo_asm.nas || exit /b 1
cl /Zi /O2 /nologo histotest.cpp histo_asm.obj || exit /b 1