OneDNN teams suggests to use SDE to dump the JITTed code like the following:
You can dump the JITTed kernel via the following c++ code:
void dump(const void *code, size_t code_size)
{
FILE *file = fopen("dump.bin", "wb+");
if (file) {
size_t unused = fwrite(code, code_size, 1, file);
fclose(file);
}
}
dump(kernel.getCode(), kernel.getSize()); //kernel is an instantiation of Xbyak::CodeGenerator
then you will get a binary named "dump.bin" and you can parse it use SDE:
path/to/sde/xed64 -64 -ir dump.bin >> assembly.txt
Click to see the whole assembly.txt!
XDIS 0: PUSH BASE 53 push rbx
XDIS 1: PUSH BASE 55 push rbp
XDIS 2: PUSH BASE 4154 push r12
XDIS 4: PUSH BASE 4155 push r13
XDIS 6: PUSH BASE 4156 push r14
XDIS 8: PUSH BASE 4157 push r15
XDIS a: DATAXFER BASE BD00040000 mov ebp, 0x400
XDIS f: DATAXFER BASE 4C8B3F mov r15, qword ptr [rdi]
XDIS 12: DATAXFER BASE 4C8B7708 mov r14, qword ptr [rdi+0x8]
XDIS 16: DATAXFER BASE 4C8B6F10 mov r13, qword ptr [rdi+0x10]
XDIS 1a: LOGICAL AVX512EVEX 62F17D48EFC0 vpxord zmm0, zmm0, zmm0
XDIS 20: LOGICAL AVX512EVEX 62F15D48EFE4 vpxord zmm4, zmm4, zmm4
XDIS 26: LOGICAL AVX512EVEX 62513D48EFC0 vpxord zmm8, zmm8, zmm8
XDIS 2c: LOGICAL AVX512EVEX 62511D48EFE4 vpxord zmm12, zmm12, zmm12
XDIS 32: LOGICAL AVX512EVEX 62F17548EFC9 vpxord zmm1, zmm1, zmm1
XDIS 38: LOGICAL AVX512EVEX 62F15548EFED vpxord zmm5, zmm5, zmm5
XDIS 3e: LOGICAL AVX512EVEX 62513548EFC9 vpxord zmm9, zmm9, zmm9
XDIS 44: LOGICAL AVX512EVEX 62511548EFED vpxord zmm13, zmm13, zmm13
XDIS 4a: LOGICAL AVX512EVEX 62F16D48EFD2 vpxord zmm2, zmm2, zmm2
XDIS 50: LOGICAL AVX512EVEX 62F14D48EFF6 vpxord zmm6, zmm6, zmm6
XDIS 56: LOGICAL AVX512EVEX 62512D48EFD2 vpxord zmm10, zmm10, zmm10
XDIS 5c: LOGICAL AVX512EVEX 62510D48EFF6 vpxord zmm14, zmm14, zmm14
XDIS 62: DATAXFER AVX512EVEX 62C17C481006 vmovups zmm16, zmmword ptr [r14]
XDIS 68: DATAXFER AVX512EVEX 62C17C48104E01 vmovups zmm17, zmmword ptr [r14+0x40]
XDIS 6f: DATAXFER AVX512EVEX 62C17C48105602 vmovups zmm18, zmmword ptr [r14+0x80]
XDIS 76: BROADCAST AVX512EVEX 62427D48183F vbroadcastss zmm31, dword ptr [r15]
XDIS 7c: VFMA AVX512EVEX 62927D40B8C7 vfmadd231ps zmm0, zmm16, zmm31
XDIS 82: VFMA AVX512EVEX 62927540B8CF vfmadd231ps zmm1, zmm17, zmm31
XDIS 88: VFMA AVX512EVEX 62926D40B8D7 vfmadd231ps zmm2, zmm18, zmm31
XDIS 8e: BROADCAST AVX512EVEX 62427D48187F04 vbroadcastss zmm31, dword ptr [r15+0x10]
XDIS 95: VFMA AVX512EVEX 62927D40B8E7 vfmadd231ps zmm4, zmm16, zmm31
XDIS 9b: VFMA AVX512EVEX 62927540B8EF vfmadd231ps zmm5, zmm17, zmm31
XDIS a1: VFMA AVX512EVEX 62926D40B8F7 vfmadd231ps zmm6, zmm18, zmm31
XDIS a7: BROADCAST AVX512EVEX 62427D48187F08 vbroadcastss zmm31, dword ptr [r15+0x20]
XDIS ae: VFMA AVX512EVEX 62127D40B8C7 vfmadd231ps zmm8, zmm16, zmm31
XDIS b4: VFMA AVX512EVEX 62127540B8CF vfmadd231ps zmm9, zmm17, zmm31
XDIS ba: VFMA AVX512EVEX 62126D40B8D7 vfmadd231ps zmm10, zmm18, zmm31
XDIS c0: BROADCAST AVX512EVEX 62427D48187F0C vbroadcastss zmm31, dword ptr [r15+0x30]
XDIS c7: VFMA AVX512EVEX 62127D40B8E7 vfmadd231ps zmm12, zmm16, zmm31
XDIS cd: VFMA AVX512EVEX 62127540B8EF vfmadd231ps zmm13, zmm17, zmm31
XDIS d3: VFMA AVX512EVEX 62126D40B8F7 vfmadd231ps zmm14, zmm18, zmm31
XDIS d9: DATAXFER AVX512EVEX 62C17C48104603 vmovups zmm16, zmmword ptr [r14+0xc0]
XDIS e0: DATAXFER AVX512EVEX 62C17C48104E04 vmovups zmm17, zmmword ptr [r14+0x100]
XDIS e7: DATAXFER AVX512EVEX 62C17C48105605 vmovups zmm18, zmmword ptr [r14+0x140]
XDIS ee: BROADCAST AVX512EVEX 62427D48187F01 vbroadcastss zmm31, dword ptr [r15+0x4]
XDIS f5: VFMA AVX512EVEX 62927D40B8C7 vfmadd231ps zmm0, zmm16, zmm31
XDIS fb: VFMA AVX512EVEX 62927540B8CF vfmadd231ps zmm1, zmm17, zmm31
XDIS 101: VFMA AVX512EVEX 62926D40B8D7 vfmadd231ps zmm2, zmm18, zmm31
XDIS 107: BROADCAST AVX512EVEX 62427D48187F05 vbroadcastss zmm31, dword ptr [r15+0x14]
XDIS 10e: VFMA AVX512EVEX 62927D40B8E7 vfmadd231ps zmm4, zmm16, zmm31
XDIS 114: VFMA AVX512EVEX 62927540B8EF vfmadd231ps zmm5, zmm17, zmm31
XDIS 11a: VFMA AVX512EVEX 62926D40B8F7 vfmadd231ps zmm6, zmm18, zmm31
XDIS 120: BROADCAST AVX512EVEX 62427D48187F09 vbroadcastss zmm31, dword ptr [r15+0x24]
XDIS 127: VFMA AVX512EVEX 62127D40B8C7 vfmadd231ps zmm8, zmm16, zmm31
XDIS 12d: VFMA AVX512EVEX 62127540B8CF vfmadd231ps zmm9, zmm17, zmm31
XDIS 133: VFMA AVX512EVEX 62126D40B8D7 vfmadd231ps zmm10, zmm18, zmm31
XDIS 139: BROADCAST AVX512EVEX 62427D48187F0D vbroadcastss zmm31, dword ptr [r15+0x34]
XDIS 140: VFMA AVX512EVEX 62127D40B8E7 vfmadd231ps zmm12, zmm16, zmm31
XDIS 146: VFMA AVX512EVEX 62127540B8EF vfmadd231ps zmm13, zmm17, zmm31
XDIS 14c: VFMA AVX512EVEX 62126D40B8F7 vfmadd231ps zmm14, zmm18, zmm31
XDIS 152: DATAXFER AVX512EVEX 62C17C48104606 vmovups zmm16, zmmword ptr [r14+0x180]
XDIS 159: DATAXFER AVX512EVEX 62C17C48104E07 vmovups zmm17, zmmword ptr [r14+0x1c0]
XDIS 160: DATAXFER AVX512EVEX 62C17C48105608 vmovups zmm18, zmmword ptr [r14+0x200]
XDIS 167: BROADCAST AVX512EVEX 62427D48187F02 vbroadcastss zmm31, dword ptr [r15+0x8]
XDIS 16e: VFMA AVX512EVEX 62927D40B8C7 vfmadd231ps zmm0, zmm16, zmm31
XDIS 174: VFMA AVX512EVEX 62927540B8CF vfmadd231ps zmm1, zmm17, zmm31
XDIS 17a: VFMA AVX512EVEX 62926D40B8D7 vfmadd231ps zmm2, zmm18, zmm31
XDIS 180: BROADCAST AVX512EVEX 62427D48187F06 vbroadcastss zmm31, dword ptr [r15+0x18]
XDIS 187: VFMA AVX512EVEX 62927D40B8E7 vfmadd231ps zmm4, zmm16, zmm31
XDIS 18d: VFMA AVX512EVEX 62927540B8EF vfmadd231ps zmm5, zmm17, zmm31
XDIS 193: VFMA AVX512EVEX 62926D40B8F7 vfmadd231ps zmm6, zmm18, zmm31
XDIS 199: BROADCAST AVX512EVEX 62427D48187F0A vbroadcastss zmm31, dword ptr [r15+0x28]
XDIS 1a0: VFMA AVX512EVEX 62127D40B8C7 vfmadd231ps zmm8, zmm16, zmm31
XDIS 1a6: VFMA AVX512EVEX 62127540B8CF vfmadd231ps zmm9, zmm17, zmm31
XDIS 1ac: VFMA AVX512EVEX 62126D40B8D7 vfmadd231ps zmm10, zmm18, zmm31
XDIS 1b2: BROADCAST AVX512EVEX 62427D48187F0E vbroadcastss zmm31, dword ptr [r15+0x38]
XDIS 1b9: VFMA AVX512EVEX 62127D40B8E7 vfmadd231ps zmm12, zmm16, zmm31
XDIS 1bf: VFMA AVX512EVEX 62127540B8EF vfmadd231ps zmm13, zmm17, zmm31
XDIS 1c5: VFMA AVX512EVEX 62126D40B8F7 vfmadd231ps zmm14, zmm18, zmm31
XDIS 1cb: DATAXFER AVX512EVEX 62C17C48104609 vmovups zmm16, zmmword ptr [r14+0x240]
XDIS 1d2: DATAXFER AVX512EVEX 62C17C48104E0A vmovups zmm17, zmmword ptr [r14+0x280]
XDIS 1d9: DATAXFER AVX512EVEX 62C17C4810560B vmovups zmm18, zmmword ptr [r14+0x2c0]
XDIS 1e0: BROADCAST AVX512EVEX 62427D48187F03 vbroadcastss zmm31, dword ptr [r15+0xc]
XDIS 1e7: VFMA AVX512EVEX 62927D40B8C7 vfmadd231ps zmm0, zmm16, zmm31
XDIS 1ed: VFMA AVX512EVEX 62927540B8CF vfmadd231ps zmm1, zmm17, zmm31
XDIS 1f3: VFMA AVX512EVEX 62926D40B8D7 vfmadd231ps zmm2, zmm18, zmm31
XDIS 1f9: BROADCAST AVX512EVEX 62427D48187F07 vbroadcastss zmm31, dword ptr [r15+0x1c]
XDIS 200: VFMA AVX512EVEX 62927D40B8E7 vfmadd231ps zmm4, zmm16, zmm31
XDIS 206: VFMA AVX512EVEX 62927540B8EF vfmadd231ps zmm5, zmm17, zmm31
XDIS 20c: VFMA AVX512EVEX 62926D40B8F7 vfmadd231ps zmm6, zmm18, zmm31
XDIS 212: BROADCAST AVX512EVEX 62427D48187F0B vbroadcastss zmm31, dword ptr [r15+0x2c]
XDIS 219: VFMA AVX512EVEX 62127D40B8C7 vfmadd231ps zmm8, zmm16, zmm31
XDIS 21f: VFMA AVX512EVEX 62127540B8CF vfmadd231ps zmm9, zmm17, zmm31
XDIS 225: VFMA AVX512EVEX 62126D40B8D7 vfmadd231ps zmm10, zmm18, zmm31
XDIS 22b: BROADCAST AVX512EVEX 62427D48187F0F vbroadcastss zmm31, dword ptr [r15+0x3c]
XDIS 232: VFMA AVX512EVEX 62127D40B8E7 vfmadd231ps zmm12, zmm16, zmm31
XDIS 238: VFMA AVX512EVEX 62127540B8EF vfmadd231ps zmm13, zmm17, zmm31
XDIS 23e: VFMA AVX512EVEX 62126D40B8F7 vfmadd231ps zmm14, zmm18, zmm31
XDIS 244: DATAXFER AVX512EVEX 62D17C48114500 vmovups zmmword ptr [r13], zmm0
XDIS 24b: DATAXFER AVX512EVEX 62D17C48116503 vmovups zmmword ptr [r13+0xc0], zmm4
XDIS 252: DATAXFER AVX512EVEX 62517C48114506 vmovups zmmword ptr [r13+0x180], zmm8
XDIS 259: DATAXFER AVX512EVEX 62517C48116509 vmovups zmmword ptr [r13+0x240], zmm12
XDIS 260: DATAXFER AVX512EVEX 62D17C48114D01 vmovups zmmword ptr [r13+0x40], zmm1
XDIS 267: DATAXFER AVX512EVEX 62D17C48116D04 vmovups zmmword ptr [r13+0x100], zmm5
XDIS 26e: DATAXFER AVX512EVEX 62517C48114D07 vmovups zmmword ptr [r13+0x1c0], zmm9
XDIS 275: DATAXFER AVX512EVEX 62517C48116D0A vmovups zmmword ptr [r13+0x280], zmm13
XDIS 27c: DATAXFER AVX512EVEX 62D17C48115502 vmovups zmmword ptr [r13+0x80], zmm2
XDIS 283: DATAXFER AVX512EVEX 62D17C48117505 vmovups zmmword ptr [r13+0x140], zmm6
XDIS 28a: DATAXFER AVX512EVEX 62517C48115508 vmovups zmmword ptr [r13+0x200], zmm10
XDIS 291: DATAXFER AVX512EVEX 62517C4811750B vmovups zmmword ptr [r13+0x2c0], zmm14
XDIS 298: POP BASE 415F pop r15
XDIS 29a: POP BASE 415E pop r14
XDIS 29c: POP BASE 415D pop r13
XDIS 29e: POP BASE 415C pop r12
XDIS 2a0: POP BASE 5D pop rbp
XDIS 2a1: POP BASE 5B pop rbx
XDIS 2a2: AVX AVX C5F877 vzeroupper
XDIS 2a5: RET BASE C3 ret
# end of text section.
# Errors: 0
#XED3 DECODE STATS
#Total DECODE cycles: 29220
#Total instructions DECODE: 68
#Total tail DECODE cycles: 236418
#Total tail instructions DECODE: 118
#Total cycles/instruction DECODE: 429.71
#Total tail cycles/instruction DECODE: 2003.54
However, the above methods need deep understanding of assembly, for beginners it is hard to simulate the code running in the mind. Here I introduce how to debugging xbyak using GDB, like any c++/python programs.
how to debug during JITTed kernel generation
Take a naive program as an example, suppose the output after compilanation is ./toy
. ps: DON'T forget to add -g
during building.
1 #include <xbyak/xbyak_util.h>
2
3 struct Code : public Xbyak::CodeGenerator {
4 Code()
5 {
6 // xbyak also provides advanced usage like StakeFrame
7 // see xbyak/sample/sf_test.cpp for how to use other parameter
8 // Xbyak::util::StackFrame sf(this, 4);
9 sub(rsp, 256);
10 mov(eax, ptr[rdi + 4]); // rdi is always the reg for the 1st argument
11 mov(rax, eax); // since the 1st arguments will be a pointer, we need to read the address and load the interger
12 add(rax, rsi); // rsi is always the reg for the 2nd argument
13 add(rax, rdx); // rdx is always the reg for the 3rd argument
14 mov(ptr[rcx], rax); // rax is always the reg for the 4th argument
15 add(rsp, 256);
16 ret();
17 }
18 };
19
20 int main()
21 {
22 Code c;
23 int* a = (int*) malloc(2 * sizeof(int));
24 a[0] = 3;
25 a[1] = 4;
26 int res;
27 void (*f)(int*, int, int, int*) = c.getCode<void(*) (int*, int, int, int*)>();
28 f(a, 5, 2, &res);
29 if (res == 4 + 5 + 2) {
30 puts("ok");
31 } else {
32 printf("res = %d\n", res);
33 puts("ng");
34 }
35 }
I suggest to use GDB with tui
option, you can build it from source or directly install it via conda
gdb --tui ./toy
...
(gdb) b 11
(gdb) r
(gdb) x/1i this->top_
Here we set a brakpoint at line 11, just after the first line of generate
function. The key is x/2i this->top_
. top_
is the beginng address of generated kernel. x/1i
means printing the next 1 instuction beginng at this address. For detailed usge of x/
, you can refer to GDB official.
Here we get:
(gdb) x/2i this->top_
+x/2i this->top_
0x7ffff7ff9000: sub rsp,0x2000
0x7ffff7ff9007: mov eax,DWORD PTR [rdi+0x4]
That is exactly what we want to generate!
how to debug during JITTed kernel running
As we know, Xbyak return the JITTed kernel as void*
function, so the question is where to find address of this function and how to debugging assembly. Luckily we know where the function will be called, it is line 28, then we can dive into the assembly from here
(gdb) b 28
(gdb) c
(gdb) layout asm
The output will be
and then we can continue debugging just like a c++ program, instead that next
for c++ will be nexti
forassembly, step
for c++ will be stepi
for assembly. And morever, we know the JITTed kernel is a name-less function so it must be called, yes, it is 0x40270f <main()+200> call r8
!!!, let's go on:
(gdb) b *0x40270f
(gdb) c
(gdb) stepi
Now we entered the JITTed kernel and we can debug per line:
note that we can dump register value and observe its movements:
(gdb) nexti
(gdb) i r rdi
+i r rax
rax 0x4 4