Skip to content

Instantly share code, notes, and snippets.

@mildsunrise
Last active July 23, 2024 23:30
Show Gist options
  • Save mildsunrise/c24045a3eb4ec3e15b69af2a837b8392 to your computer and use it in GitHub Desktop.
Save mildsunrise/c24045a3eb4ec3e15b69af2a837b8392 to your computer and use it in GitHub Desktop.
description + simplified implementation of the (cursed) compression scheme of the popular lz-string library

lz-string is a very popular library that compresses UTF-16 strings using a variation of the LZ78 algorithm where literals are only encoded once (and referred as dictionary indexes afterwards).

Despite its impressive 10 million downloads per week at the time of this writing, there is no official documentation on the wire format implemented by this library, and the horrible code quality makes it hard to understand from it. Furthermore, the code contains many bugs / gotchas, some of which are enumerated below. This is probably why many of the numerous ports of the library blindly copy its code, doing little more than adapting syntax.

This gist provides a description of the format, as well as a simplified (but unoptimized) reference implementation in Python.

Wire format

The bytes of a compressed blob are interpreted as a bitstream in big-endian order (meaning, MSB-first). This bitstream contains fixed-size integers in little-endian (LSB-first), which are "fixed-size" in the sense that their sizes are known by the code and not signalled in the bitstream, but each of the integers has a different one (see below).

The bitstream consists of a sequence of packets, each of which begins an opcode integer determining its type and payload:

  • Opcodes 0 and 1 are followed by a literal integer of 8 or 16 bits respectively. They are emitted when the encoder finds an UTF-16 code unit it hasn't seen before.

  • Opcode 2 marks the end of the stream, which is then zero-padded as needed to complete the last byte.

  • Opcodes 3 and above refer to an entry in the dictionary, whose index is given by opcode - 3. They do not carry any payload.

The size of each opcode is as high as needed so that all valid opcodes at that point can be represented. In other words, an opcode is encoded in $\lceil \log_2 (M+1) \rceil$ bits, where $M$ is the highest value that this particular opcode is allowed to contain. This depends on the size of the dictionary at that point, which couples the wire format to the compression algorithm (explained below).

To decouple both things, we can say that $M = 2$ when reading the first opcode (since there are no entries in the dictionary yet), and that $M$ increments by 2 for every literal packet consumed and by 1 for every dictionary index packet consumed. This is what the reference implementation does.

Example (compression of string "ABBAB"): Example

Compression scheme

Because it was written in JS (and more than a decade ago, when TextEncoder wasn't a thing) this scheme compresses UTF-16 strings rather than binary data. More specifically, it compresses a list of UTF-16 code units (i.e. 16-bit ints) into packets, which are then formatted into a bitstream / byte buffer as indicated above.

The decoder starts with an empty dictionary. When a packet (that isn't the EOF marker) is received, a chunk of code units is emitted to the output. This chunk depends on the packet:

  • If the packet is a literal, then the chunk consists of a single code unit (the literal). In this case, the next available index in the dictionary is assigned to this chunk.

  • If the packet is a dictionary index, then the chunk consists of the dictionary entry given by that index.

    • Special case: if the index is equal to the dictionary size (i.e. the first unassigned index) then the chunk is equal to the last emitted chunk plus the first code unit of that same chunk. (Such an index is only valid after the first packet, when there is a "last emitted chunk".)

Before emitting the chunk to the output, the decoder assigns the next available index in the dictionary to the last emitted chunk plus the first code unit of the chunk it's about to emit. (This is not done for the first chunk, when there is no "last emitted chunk".)

Thus, a literal packet (except the first one) results in 2 new entries assigned to the dictionary: first, one containing just the literal, then another containing the previously emitted chunk plus the literal. A dictionary index packet (which can never be the first one) results in only 1 new entry in the dictionary.

The "special case" above allows the compressor to produce a bitstream that never causes duplicate entries to be added to the dictionary.

Gotchas / bugs

The following bugs are of special relevance when interoperating with the library and its ports:

  • The library (though I believe this is fixed in newer versions) has methods to compress directly to Base64. However, if these are used, it generates invalid Base64 in some cases because the last quatriplet of Base64 characters has only 1 character, which makes no sense (those 6 bits don't form a complete byte, nor do they contribute to complete a previous byte).

    • This is because the author decided to combine the base64 generation with the compression itself by having compress() take an arbitrary output word width (which is 6 in the case of Base64) and it zero-pads only until the last word.

    Base64 decoders will generally either fail, or discard the last 6 bits, removing part of the EOF marker from the bitstream. It can be fixed by appending an A to the base64 (before the padding, if present) to make the Base64 decoder happy.

  • The library has methods to compress to a Uint8Array, but these (due to historical legacy) zero-pad to 2 bytes, not 1.

  • The library has methods to compress to "an encoded URI component", which suggests a URL-safe value requiring no percent-encoding. This is not true: rather than using urlsafe Base64 (characters -_) this library uses a custom alphabet (+-) that still uses +, causing a need for percent-encoding in query strings.

Reference implementation

The provided reference implementation separates the wire format from the compression logic itself. You likely want to call the functions under "Main API" at the end. It should be equivalent to the JS library except for the following, which mainly concerns zero-padding:

  • Unlike the JS implementation, this one verifies that the bitstream contains no junk after the EOF marker (only padding). For the original behavior of accepting any input even if it has trailing junk, remove the assert for padding in decode_packets.

  • Functions only produce legal Base64, which can differ from the one produced by (old versions of) the JS library by one or two extra trailing A.

  • Functions that produce a byte buffer pad only until 1 byte.

Despite these differences, AFAIK it should be interoperable with all versions of the JS library (in both directions) though I have yet to verify that using the official test suite.

from typing import Iterator, Iterable
# GENERIC BITSTREAM UTILITIES
# these take/produce the bits in BE (MSB-first) order
class BitReader:
def __init__(self, bytes: Iterable[int]):
self.bits = ( (x >> i) & 1 for x in bytes for i in reversed(range(8)) )
def read_int_le(self, size: int) -> int:
''' consume a fixed-size LE (LSB-first) integer '''
result = 0
for i in range(size):
result |= next(self.bits) << i
return result
class BitWriter:
def __init__(self):
self.bytes = bytearray()
self.pos = 0
def write_bit(self, bit: int):
if self.pos == 0:
self.bytes.append(0)
self.pos = 8
self.pos -= 1
self.bytes[-1] |= bit << self.pos
def write_int_le(self, x: int, size: int):
''' write a fixed-size LE (LSB-first) integer '''
for i in range(size):
self.write_bit((x >> i) & 1)
assert not x >> size
# DECOMPRESSION
def decode_packets(data: bytes):
''' decode the packets from a byte buffer '''
bits = BitReader(data)
max_opcode = 2
while (opcode := bits.read_int_le(max_opcode.bit_length())) != 2:
if opcode < 2:
yield "literal", bits.read_int_le([8, 16][opcode])
max_opcode += 2
else:
yield "dictionary index", opcode - 3
max_opcode += 1
padding = list(bits.bits)
assert len(padding) < 24 and not any(padding)
def decompress(data: bytes):
''' decompress a buffer of bytes into a stream of UTF-16 code units '''
dictionary = []
last_entry = None
for kind, value in decode_packets(data):
if kind == "literal":
entry = [value]
dictionary.append(entry)
elif value == len(dictionary):
entry = last_entry + entry[:1]
else:
entry = dictionary[value]
yield from entry
if last_entry:
dictionary.append(last_entry + entry[:1])
last_entry = entry
# COMPRESSION
def compress(data: list[int]) -> bytes:
''' compress a sequence of UTF-16 code units into a byte buffer '''
bits = BitWriter()
max_opcode = 2
write_opcode = lambda x: bits.write_int_le(x, max_opcode.bit_length())
for kind, value in compress_to_packets(data):
if kind == 'literal':
write_opcode(opcode := 0 if value.bit_length() <= 8 else 1)
bits.write_int_le(value, [8, 16][opcode])
max_opcode += 2
else:
write_opcode(value + 3)
max_opcode += 1
write_opcode(2)
return bytes(bits.bytes)
def compress_to_packets(data: list[int]):
''' compress a sequence of UTF-16 code units into packets '''
dictionary = {}
last_entry = None
while data:
entry = ()
while len(entry) < len(data) and entry + (data[len(entry)],) in dictionary:
entry += (data[len(entry)],)
if entry == last_entry and len(entry) < len(data) and data[len(entry)] == entry[0]:
entry += (data[len(entry)],)
yield 'dictionary index', len(dictionary)
elif entry:
yield 'dictionary index', dictionary[entry]
else:
entry += (data[len(entry)],)
dictionary[entry] = len(dictionary)
yield 'literal', data[0]
data = data[len(entry):]
if last_entry:
dictionary[last_entry + entry[:1]] = len(dictionary)
last_entry = entry
# MAIN API
from array import array
def decompress_to_string(data: bytes) -> str:
code_units = array('H', [0xFEFF]) # BOM
code_units.extend(decompress(data))
return str(code_units, 'utf-16')
def compress_from_string(data: str) -> bytes:
code_units = array('H', data.encode('utf-16'))
return compress(code_units[1:])
from base64 import b64decode, b64encode
def decompress_uricomponent(data: str) -> str:
''' emulates decompressFromEncodedURIComponent '''
assert all(x.isalnum() or x in '+-' for x in data)
# the original implementation (old version) contains a bug that causes incomplete
# Base64 to be emitted, so correct for that or b64decode will fail:
data += 'A' * ((-len(data)) % 4)
# it doesn't use the standard urlsafe alphabet, and the custom alphabet
# still keeps the (URL-sensitive) '+', which defeats the purpose :)
return decompress_to_string(b64decode(data, '+-'))
def compress_uricomponent(data: str) -> str:
b64 = b64encode(compress_from_string(data), b'+-').decode('ascii')
while b64.endswith('='): b64 = b64[:-1] # remove padding
return b64
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment