You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Relative performance of matmul element types on x86 and Arm
Context
Recent efforts to run LLMs send us searching for some element types to quantize
weights and activations into, that will somehow be wide enough to provide enough
accuracy, and narrow enough to provide enough performance and/or memory compression.
This document is about the "performance" dimension, specifically on x86 and Arm
architectures.
Every once in a while I investigate low-level backend options for PL-s, although
so far I haven't actually written any such backend for my projects. Recently
I've been looking at precise garbage collection in popular backends, and I've
been (like on previous occasions) annoyed by limitations and compromises.
I was compelled to think about a system which accommodates precise relocating GC
as much as possible. In one extreme configuration, described in this note, there
Thanks for joining us for "the definitive deep dive into the .git folder". It's an incredible live-demo where we open every file in the .git folder and show what it does.
you get to recommend one published PL paper for an undergrad to read with oversight by someone experienced. the paper should be interesting, approachable, and (mostly) self-contained.
A list of articles documenting uses of the GF2P8AFFINE instruction
Unexpected Uses for the Galois Field Affine Transformation Instruction
Intel added the Galois Field instruction set (GFNI) extensions to their Sunny Cove and Tremont cores. What’s particularly interesting is that GFNI is the only new SIMD extension that came with SSE and VEX/AVX encodings (in addition to EVEX/AVX512), to allow it to be supported on all future Intel cores, including those which don’t support AVX512 (such as the Atom line, as well as Celeron/Pentium branded “big” cores).
I suspect GFNI was aimed at accelerating SM4 encryption, however, one of the instructions can be used for many other purposes. The extension includes three instructions, but of particular interest here is the Affine Transformation (GF2P8AFFINEQB), aka bit-matrix multiply, instruction.
There have been various articles which discuss out-of-band
A list of “out-of-band” uses for the GF2P8AFFINEQB instruction I haven’t seen documented elsewhere
Count Leading/Trailing Zero Bits (Byte-wise)
Counting the trailing zero bit count (TZCNT) can be done by isolating the lowest bit, then depositing this into the appropriate locations for the count. The leading zero bit count (LZCNT) can be done by reversing bits, then computing the TZCNT.
__m128i_mm_tzcnt_epi8(__m128ia) {
// isolate lowest bita=_mm_andnot_si128(_mm_add_epi8(a, _mm_set1_epi8(0xff)), a);
// convert lowest bit to index
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters