Skip to content

Instantly share code, notes, and snippets.

@warpfork
Last active July 17, 2024 15:46
Show Gist options
  • Save warpfork/6dfa7ee08cee03bdcca3ac826010b194 to your computer and use it in GitHub Desktop.
Save warpfork/6dfa7ee08cee03bdcca3ac826010b194 to your computer and use it in GitHub Desktop.
Gar -- a git-inspired archive and content-addressed deduplicating cache manager tool

Git archive. Comparable to tar, in a way.

Purpose

Maintains directories of Content Addressable content.

Meant to play nice as a provider of an immutable append-only cache that's easy to integrate with any system based on files and directories.

Features

Gar creates and maintains "gar heaps", which consist of two (or three) directory trees:

  • blobcas -- a large, even-depth directory where every file is stored with a name that is its sha256
    • TODO needs the x bit, fuck hardlinks lol
      • literally put a "-x" on the end of the filename, bro
    • TODO: this, or the hash of the blob, as it ends up in the tree? You prob want the goddamn blobhash. Eh. Git compat prio=max.
  • treecas -- a large directory where every dir is a treehash and contains entirely that tree, materialized.
    • every file within these is a hardlink to a file in the blobcas.
    • not every subtree is materialized at the root of this dir (we don't have dir hardlinks, so there would be no point ;)).
  • treeidx -- a directory where every file contains a compact, deterministic, serial representation of the expected hashes of every subtree in each materialized member of treecas.
    • this is not used in normal read operations (and its creation can be disabled entirely, or it can be regenerated later!).
    • Its role is to help corruption detection passes point more easily at specific subdirectories that are desynced with the overall tree hash, if corruption does occur.
    • another role is to support efficient read of whole trees if the Gar heap is exposed over a transport like plain HTTP. The index file can be transfered in a single request, and thereafter contains enough information to identity every other fetch required to get every blpb, wit no dir walking required... and enough information to make progress bars possible, too!
    • (note: while Gar is heavily, heavily based on git, the format of these particular index files is not. Gar has opted for a much simpler, more readable, and more deterministic format for these.)
  • every time you add a file or dir, they go into blobcas first, then dirs also get manifested in the treecas.
    • if requested, hardlinks to the original files will be used. (Non-default. Be cautious of using this feature: hardlinks are not COW on their own, so writing to the original files after this will corrupt your gar heap!

Gar heaps are trustless structures. They can be validated entirely from their contents.

Maintaining Gar Heaps

corruption detection

Gar can be asked to check for corruption. This has two phases:

First, every blob in the blobcas is checked to make sure its filename matches the hash of its current actual content. Any mismatches are immediately obvious as corruption.

Second, every treecas is walked, and for each directory (starting from the leaves), the

garbage collection

TODO fairly straightforward iiuc: remove things from blobcas that have link count 1. (Probably can't easily detect if other links outside of treecas are what's keeping it alive, but this seems like an acceptable limitation in practice.)

Supported Filesystem Attributes and Features

symlinks

Yes. Absolutely supported.

posix permissions

Gar does exactly what git does: it ignores every part of posix permissions, except the "x" (execute) bit.

Files and directories are always readable and writable (according to their posix bits). This is to the match the expectations that people have from working with git.

A variant of Gar can be configured to store files and dirs with all write bits removed. This may be useful as a form of defense-in-depth against accidental mutation of the cache. Toggling this is a heap-wide configuration parameter. (The heap-wide nature of this toggle is not a limitation of Gar, but a limitation of hardlinks themselves: hardlinks all share permission bits, so, we unavoidably have to duplicate files in order to have the same blob with different bits! Yes, I know, I know -- I wish hardlinks were different, too.)

uids and gids

Distinct uids and gids are not supported. Same reasons as above, regarding posix permissions -- uids and gids are included in a git treehash, so they're not what Gar aims to do; and also, managing them would actively obstruct blob dedup -- and on top of all that, it would just plain mean Gar needs to be much more complex so that it can support running with elevated privileges, with all the tricky, subtle code that implies. So, no. Each Gar heap contains exactly one uid and exactly one gid.

It is acceptable to create different gar heaps with different uids and gids.

other metadata

All files and directories in a Gar heap are set to the same mtime.

Remember, Gar is exactly as per git tree hashes: and tree hashes don't include mtime. (Commit hashes do. But those aren't what Gar is based on; and they're not helpful for content-addressed dedup, which is Gar's main purpose.) So we can't have times mucking things up.

(Hardlinks also again contribute to this decision. See previous section. Tl;dr: we'd be forced to do more actual data copying in order to support diverse mtimes, too -- yikes; no thanks.)

The default mtime is April 1st, 2010. We picked this because hopefully the first of April gives a relatively good hint to an unfamiliar human observer that this is not a real relevant timestamp (and because picking the Unix Epoch is relatively prone to having interesting edge cases when interacting with other tools in the wild). You can configure the mtime used in a Gar heap in the heap's config file.

Other attributes like ctime and atime are not considered by Gar at all. (Ctime in particular is typically fundamentally unsettable short of writing a filesystem driver so, to ignore this is... Let's just say it's a pretty normal choice.)

Usage

  • gar add ./path -- adds that file or dir tree to the nearest gar heap found up from the CWD, and returns the git sha256 treehash of, encoded as a base58 string.

Regular filesystem interactions generally play nice:

  • you can simply RM things in treecas when you're no longer interested in them.

To use this most effectively as a cache:

  • you can use it however you want, but one particularly effective option is to use read-only bind mounts to put the treecas content where you want it, with the path and dir name you want it as. This is fast, cheap, doesn't require escalated privileges on most modern linuxes, and most importantly, provides an excellent barrier again accidental mutation of the cas if working with code that isn't fully trusted to Play Nice.

You can publish these over HTTP:

  • Anyone can fetch http://example.tld/your/prefix/gar/treeidx/{gittreehash}.garidx... and then fetch every relevant blob from http://example.tld/your/prefix/gar/blobcas/{blobhash} with the guidance from that index. Tada! Decentralization was a breeze.

Mad Science

Yes, it's really git tree hashes. Literally.

TODO shell script demo to prove it.

Can I put Gar in Gar?

Well, yeah, the gar executable is just a file, so... Duh?

Can I put a Gar heap in a Gar heap?

Ahhhhh! That's what you meant.

Yep!

(Gar walks the dir tree you ask it to add and freezes that list of paths it will operate on before it starts generating new files, so this does not end up in an infinite loop.)

It'll even dedup the blobs just fine.

Formats

Most of Gar is exactly as per git.

Blob hashes in Gar are git blob hashes (using sha256).

Tree hashes in Gar are git tree hashes (using sha256).

garidx files

Garidx files are sort of comparable to git pack index files in role, but Gar opted for a much simpler format for these files. (The git format is binary, and very, very complex to parse as well as generate, and much of that complexity isn't relevant to the subset git that Gar provides equivalences with, so overall attempting to directly reuse that format would provide little to no value, at extreme cost.)

Garidx files are meant to be relatively human-readable (and even diff'able).

They're linebreak delimited[^1], with each line containing one entry, and each entry describing one path (a file or a directory), looking roughly like this:

# garidx v1
     2 ./ 040000 - afe3845ba76ec209f86
     5 ./foo 100644 99 9c4fba7ef811632cde
     7 ./some/ 040000 - b2a7ee7fc6ba1f387634
    16 ./some/script.sh 100755 4325 fe9fbc1da76325ef8e

The parse of this is as follows:

  • (The first line is a version and filetype indicator. Currently this is always exact "garidx v1".)
  • Each entry starts with a fixed width five bytes which contains an ascii base-10 number, right-aligned by prefix padding with space characters, followed by a space character.
  • Next is the path: this has the length (in bytes) indicated by the preceding number. Another space character follows the path.
    • Note! You must use the length prefix to determine how long the path is, in order to correctly handle paths that contain spaces or linebreaks characters!
  • The git tree mode number follows. (This is six bytes, and effectively is one of four values: "100644" for files, "100755" for executable files, "040000" for dirs, and "120000" for symlinks.) Then, another space.
  • A size hint follows, as an ascii base-10 number. For directories, it is instead a dash character. Then, another space.
  • The treehash (or blobhash) sha256 of the object, encoded in base58, follows.
  • Then, a line break. That's the end of the entry.

Some incidental details to note:

  • Paths always begin with a dot. This is redundant, but we consider it a visual nicety.
  • Paths that are directories are encoded with a trailing slash. This is redundant considering the mode number also indicates directories, but we consider it a visual nicety.
  • The choice of a fixed with length indicator is for (yes, again -- you guessed it) visual nicety. It makes it easier for a human reader to scroll through a large garidx file and see shared path prefixes by just eyeballing it.
  • If you're sure there's no spaces or linebreaks in any filenames in a dataset you're examining... yes, you can parse this easily with awk or even sheer column in a shell script. ;)

[^1] -- Sort of. The garidx format uses linebreaks, and the linebreaks are required, but a parser should not parse by splitting upon them. In the event of a filename that contains a linebreak (unusual, but legal!), there will be another line break in the garidx file. The only fully correct way to parse is by reading the length numbers at the start of each entry.

[^2] -- About path size limits: we've decided that "five digits base 10 is enough", but if you truly dive into this, one can argue that there are not rules. Path limits could come from the kernel, or from the filesystem. When they're from the kernel, they can be changed by... recompiling the kernel. Some numbers we've seen in the wild are: NTFS specified a length limit of 32768 bytes for a whole path. Linux typically specifies 256 bytes per path segment, and reputedly at some point had a limit of 4096 overall (although I currently find no evidence of a limit this low on my systems). Windows has been known to specify some truly short numbers we will not discuss. But ultimately: when is the last time you handled a path more than 4 kilobytes long? We think Gar can be useful even when enforcing a limit slightly over twice that.

Fiteme

I still am incline to use b58.

Is there a serious call to use b32?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment