Text Catalogs: an RFC

Abstract

This document proposes "Text Catalogs" -- "tcat" for short -- a convention for file naming and content organization for document stores. Text Catalogs are specifically aimed to be legible to humans, friendly to version control, and supportive towards reconciling and converging independently-authored data sets.

Status of This Memo

This is a design outline and a request for comments. (It is not (yet?) a formal IETF RFC.)

Introduction

Storing and sharing data requires choosing how it will be serialized and how it will be organized. There are many different possible approaches to this.

Text Catalogs are aimed at situations where:

There are many pieces of data to store, and that data might be suitable for a "document store" style of system (e.g., many JSON objects, though we are not specific to that format).
The filesystem is the primary storage medium, and it is desirable for the files to be directly legible by humans.
A combination of properties of the data will be used to determine the filenames used to store the complete data.

In Text Catalogs, our aims will be to emphasize simplicity; use the filesystem as much as possible; and avoid enforcing specific data formats as much as possible. In other words, we shall aim to be almost not a standard at all. However, there are a few things within the realm how how filesystem paths are chosen that we do believe can benefit from some standardization and pre-negotiation when designing systems, and that is where we shall focus our attention in defining Text Catalogs.

In the following subsections, we will break down these goals more concretely, then examine existing alternatives and how Text Catalogs would differ from them.

Goals

(Reading note: terminology is introduced more formally in the following section; reading that, and then re-reading this section on Goals may be helpful.)

Text Catalogs must be legible as plain files. The directory structure must fit cleanly in a traditional hierarchical filesystem and interoperate well with the respective tools, including shell oriented operations. The filenames must be obviously derived from the data, in a process that makes reasonable sense to the reader on immediate sight.
Text Catalogs can reasonably contain many data objects, and represent this as many files.
When a Text Catalog has a schema, then when entering a data object into the catalog, a filename can be derived deterministically from the data object and the schema alone (e.g., without checking the filesystem for existing filenames).
Text Catalogs with the same schema but different data can be merged, without requiring conflict resolution or any interactive oversight, so long as none of the data objects have exactly equal values for all the properties used in filesystem path selection. [^1]
Text Catalog data objects may be in any serial format. If it's a well-known object format, tooling for Text Catalogs may attempt to behave smartly with that; if it's an opaque binary format, tooling for Text Catalogs should do their best nonetheless.

Some additional, softer goals, in no particular order:

When the document format used in a Text Catalog is a widely known format, such as JSON, then tools that work with Text Catalogs may be able to offer bonus features and friendlier APIs, such as:
- "put" operations that require only the data object as an argument (and infer the filesystem coordinates out of the data automatically, rather than requiring them be specified redundantly);
- automatic formatting and other linting services;
- pathing into the documents, via mechanisms like JSONPath (RFC 9535);
- patching documents, via mechanisms like JSONPatch;
- and automatic deriving of variations of formats, such as producing HTML indexes of JSON content, or producing TOML variations of content; etcetera.
Text Catalog schemas should guide authors of a schema towards specifying file path derivations that avoid collisions.
Text Catalog schemas should make it clear how and where to involve additional optional behaviors like directory sharding.
Text Catalogs should leave room on the filesystem to also contain other files with other extensions, such as html files, adjacent to the data files.
- Give this example with html files in particular because we aim that Text Catalogs should be easy to place in static websites. Users of a website should be able to switch from the rendered html to a plain data format by changing the extension in the URL bar manually, without significant effort. (This change should be at the end of the URL, and not in a subdomain at the beginning, because this should be hostable on a single machine with no DNS services in play.)
Text Catalogs should be fully transferrable over HTTP. That means easily machine-readable indexes for each coordinate component must be possible to easily automatically maintain.
- This behavior should be defined but also be optional to use. Maintaining such indexes would little sense when the filesystem is not being prepared for such transport (such as when it's instead vendored in git repos and transported thusly), nor do indexes always sense at all (e.g. for some blob storages it may not be relevant), and so maintaining these indexes is not a mandatory feature of Text Catalog.

[^1] - Note that this collision avoidance property is less trivial than it may first seem! Consider a situation where data is stored in JSON files, and the path is derived from a property in the data called "foo". Now consider that two catalogs are authored, one with a document with a "foo" property of "bar"... and another, with a document with a foo property called "bar.json/whoopsie". You should be able to see the problem with this! Text Catalogs and the very small amount of formalism they offer are meant to help system designers preemptively avoid this very thing.

Terminology

"coordinates"
- -> the tuple of named values[^2] which is used to determine a file path within a Text Catalog. (See examples in TODO:sectionlink.) Coordinates in Text Catalogs are meant to be directly transformable to file paths.
"coordinate" (singular)
- -> one value within a coordinates tuple. A coordinate value must be a string (or transformable to one).
"data object"
- -> the values stored in a Text Catalog. One data object is exactly the content of one file. Data objects are often in a format such as JSON; however, Text Catalogs are not prescriptive about the format of data objects.
"document"
- -> sometimes used as a synonym for "data object".
"file"
- -> the filesystem concept. Files are generally understood to be on a POSIX filesystem, where relevant.
"file path"
- -> the full name and path to a file within a Text Catalog. (E.g., "foo/bar/baz.json", not just "baz.json".) In Text Catalogs, file paths are meant to be derivable from coordinates.

With this terminology in hand, we can make some definitions of their relationships, such as:

A file path (and therefore a coordinate) is both necessary and sufficient to load a data object from a Text Catalog.

[^2] - a "tuple of named values" is also sometimes known as a "record".

Alternatives

Moved to Alternatives section, further below.

Defining a Text Catalog

A Text Catalog is defined by defining the Coordinate systems it will contain.

A shorthand is used for this: a coordinate system (x, y, z) means that the coordinates shall have three parts, and that they are called "x", "y', and "z". Some a coordinate in this system might then be written (x="1", y="2", z="3").

This coordinate in this system would might be projected in a Text Catalog into a filesystem path of ./_x/1/_y/2/_z/3.json.

Variations of Path Derivation

The default path derivation mechanism is:

the name of a coordinate component is prefixed with an underscore, and becomes a directory.
the value of that coordinate component becomes the next directory name.
this is repeated until the last component, which becomes a file, and uses an extension suffix that is reasonable to the format (e.g. often .json).

Other mechanisms can be declared:

a "bare" coordinate component skips the coordinate name directory.
the coordinate component name directory can be customized to not use the leading underscore present in the default behavior.
a boolean coordinate component can be used, in the last component of the coordinate, in which case the coordinate component name becomes the filename instead of the value being used. (TODO: review: the only worked examples where this appeared so far appear to be using it for varargs-like behavior, wherein 'false' is not well defined; this may not be the clearest way to capture that.)

The definer of coordinate schemes must be mindful to pick a coordinate scheme that is cleanly distinguishable from data that will be stored within it, and cleanly distinguishable from other coordinate schemes placed in overlapping paths, if applicable.

Catalogs Containing multiple Coordinate systems

A Text Catalog can contain multiple coordinate systems, as long as they don't collide with each other.

((TODO: more worked examples.))

Coordinate Constraints and Normalization

Coordinate names may not begin with underscore ("_").

Some values must never be permitted as a valid coordinate. The strings "." and ".." are never valid as a coordinate. In a coordinate that permits "/" characters, when the coordinate string is separated by those slashes, the strings "." and ".." must never be one of the resulting segments.

Other characters disrecommended in filenames are also disrecommended in a coordinate. For example, colon characters -- ":" -- may have reserved behaviors on some filesystems, and should be avoided, unless the definer of a Text Catalog schema is certain they never expect the Text Catalogs they are defining to be usable on such a filesystem.

Using mixed case in coordinate values is disrecommended. Some filesystems do not distinguish case, and Text Catalogs stored on such a filesystem may be subject to file path collisions that would not be present on other filesystems. Therefore, lowercase coordinate values are recommended unless the definer of a Text Catalog schema is certain they never expect the Text Catalogs they are defining to be usable on a non-case-sensitive filesystem.

((Future work: fully define and offer a suite of recommended constraints as regular expressions.))

A Note on Strings

The definition of a string is outside the scope of this document.

Rules described for Coordinate Constraints and Normalization are defined such that they can be implemented over the ASCII subset of strings.

We generally recommend treating strings as UTF-8, but the Text Catalog specification does not have any direct dependency on UTF-8.

In situations where string encoding canonicalization may be relevant, we recommend UTF-8 NFC form, but again, the Text Catalog specification has no direct dependency on this.

As background: Filesystems in contemporary use around the world vary in their native encoding schemes (and sometimes also vary in what they support vs what they present to users through the most common interfaces). In Linux systems, it is typical that filesystems support all 8-bit byte sequences excluding the null byte and directory separator character. On Windows systems, it is typical that filesystems use UTF-16 encodings, and reserve a wide range of punctuation characters. (Confusingly, which characters are reserved by the filesystem vs by the Windows APIs for manipulating it may differ significantly: an NTFS filesystem written to by linux drivers can happily contain many sequences that the Windows Explorer will not allow the creation of!) On Mac systems, it is not uncommon for filesystems to be case-insensitive, meaning that the filenames "Foo" and "foo" are colliding. Many other variations beyond these exist!

In an absolute sense, Text Catalogs cannot preemptively account for all possible variations of character encoding support in all possible variations of filesystem implementation and configuration. In a practical sense: defining rules which can be implemented on ASCII is sufficiently clear and widely agreed upon that we expect Text Catalogs defined on that basis to be reasonably portable.

Examples

(These examples are retrofits of Warpforge catalogs into generalized Text Catalogs. More minimal examples may be better but are future editing work.)

(This examples describe some additional parsing of strings into Text Catalog coordinates, which may or may not end up being part of the definition of Text Catalogs; not yet determined.)

We define the Warpforge Catalog as having the coordinate schemes:

(module, version, item).
- "foo.org/bar:v123:frob" is parsed to this by colon separation.
  - implies: colons are forbidden in all components of the coordinate.
- ("foo.org/bar, "v123", "frob") -> maps to path(s)...
  - "./_module/foo.org/bar/_version/v123/_item/frob.json"
  - "./_module/foo.org/bar/_version/v123/_item/frob.html"
(module, version, item, meta.bool) + mpath.
- "foo.org/bar:v123:frob::author" is parsed to this by colon separation, with double colon indicating the meta component.
  - implies: colons are forbidden in all components of the coordinate.
- ("foo.org/bar, "v123", "frob", true) + "author" -> maps to path(s)...
  - "./_module/foo.org/bar/_version/v123/_item/frob/_meta.json"
(module, version, meta=bool) + mpath.
- "foo.org/bar:v123::author" is parsed to this by colon separation, with double colon indicating the meta component.
  - implies: colons are forbidden in all components of the coordinate.
  - implies: item can be empty, and double-colon has higher parse binding than single.
- ("foo.org/bar, "v123", true) + "author" -> maps to path(s)...
  - "./_module/foo.org/bar/_version/v123/_meta.json"
(module, meta.bool) + mpath.
- "foo.org/bar::author" is parsed to this by colon separation, with double colon indicating the meta component.
  - implies: ... you get the drill. Same as above.
- ("foo.org/bar, "v123", true) + "author" -> maps to path(s)...
  - "./_module/foo.org/bar/_meta.json"

Notice that these are four completely distinct schemes. They just happen to have been given visually similar psuedocoordinate string parse functions, and their filesystems do considerably overlap (but cannot collide!).

Some other schemas are sometimes co-located in webserver presentations:

_wares/(packtype.bare, hash.bare)=blob
- ("tar", "asdfqwer.tgz") -> maps to path...
  - "./_wares/tar/asdfqwer.tgz".
  - that's it.
- Note that we actually push a full extension through the "hash" component. A Text Catalog doesn't, and doesn't need to, know about this. In our example usecase here, we only care about extensions so that e.g. web clients do sensible things without extra effort. They're not relevant for any collision avoidance or other structural features when it's a blob.
- (Future work: supporting sharding explicitly in Text Catalogs may be desirable.)

Alternatives

Databases

Text Catalogs are not meant to be an alternative to relational databases.

In particular, Text Catalogs are not designed to support SQL queries, nor are they designed to support ACID transactions, both of which are expectations of typical "databases".

Text Catalogs do still support some level of transactions, when correctly implemented, but this is generally on the scale of individual documents. On the basis of Text Catalogs being "just files", the transactionality we can reliably support is limited to that which is reasonably attainable on POSIX filesystems.

Doing Nothing

Data can always be stored in files. Files can be named however one wishes.

Text Catalogs offer a convention for file naming which helps avoid conflicts in advance when filenames are machine-managed based on some properties of the document.

Content Addressing

Content Addressing is a system of organizing data storage by hashes of the content. In other words, content addressing strategies remove human-readable naming from the equation entirely.

Content addressing has significant virtues in that it (when implemented with a collision-resistant -- ideally cryptographic -- hash) removes human naming and labeling from the equation of how data is stored, which means that collisions are preemptively avoided, in an even stronger fashion than offered by Text Catalogs.

The downside of content addressing is the same as its upside: it removes human labeling. This makes content addressing systems typically harder to use (until additional layers of usability and labeling systems are strapped back onto them). Content addressing systems also provide no indexing and lookup systems on their own; this also has to be strapped on with additional layers of system.

By contrast, Text Catalogs are human readable, and provide one natural index.

Security Considerations

Security considerations for Text Catalogs are generally related to considerations of filesystem paths.

The Text Catalog path derivation guidelines and the section on Coordinate Constraints and Normalization offer guidance on avoiding problematic values.

Errata

The Naming of TCAT

Text Catalogs has a nice contraction to "tcat" -- potentially relevant for a file extension. "Tea Cat" seems like it should also lend itself nicely to a logo.

Some early discussion used phrasing like "flatfile database" to refer to this work. That terminology seemed to be very misleading and poorly communicative of the goals. (The section on "Alternatives: Databases" above may provide some color on why.) We now avoid the use of the word "database" to describe the concepts of Text Catalogs.

The word "catalog" first became involved because of inspiration from the Warpforge project, and its format for communicating package releases and metadata. The word thereafter also seemed applicable to a more general structuring of data.

Other Notable Inspirations

Cargo

Text Catalogs bear considerable resemblance to Cargo's metadata formats. This is more a case of parallel and convergent evolution, but comparisons are apt nonetheless.

A few very minor distinctions can be made (if only in order to show how comparable they are):

Cargo's package metadata is stored in JSONL specifically. Text Catalogs are specified as format agnostic.
Cargo has (at some points in history) used git, specifically. (We believe changes have been made, and this is no longer the case.) Text Catalogs are defined with the intention of fitting well in a version control system like git, and recommended conventions aim to result in nice human-readable diffs in such a case, but Text Catalogs do not fixate on any particular version control system.
Cargo defines Cargo's package metadata fields and the filesystem organization as one, and is not especially intended for general re-use. Text Catalogs are attempting to specify an easily reusable outline for filesystem organization.

We consider Cargo's choices and implementation to be very reasonable, and Cargo's success to be indicative of the solidity as well as approachability of these choices.

Warpforge Catalogs

Text Catalogs were partially inspired by Warpforge Catalogs. This is true in both positive and negative ways.

Warpforge Catalogs attempted to be agnostic to serialization format. Text Catalogs retain this intention.
Warpforge Catalogs made heavy internal use of hashing. While this offered some interesting benefits such as incrementally verifiable immutability of subtrees of data, it also had considerable drawbacks, in that it made direct human editing of files functionally impossible. Text Catalogs do not use a hashing scheme when referring to other documents in the same Text Catalog.
Warpforge Catalogs considered the canonical form of data to be IPLD, and the filesystem and serialized form of the data to only be a "projection" of that canonical form. Text Catalogs do not have this extra layer of indirection.
Warpforge Catalogs defined several fields and queries specific to the Warpforge project and its usage, and were not especially intended for general re-use. Text Catalogs are attempting to specify an easily reusable outline for filesystem organization.

History

First drafted 2024-04-17.
Prior thoughts exist scattered on whiteboards, lost to time :)

warpfork/tcat-rfc.md

Text Catalogs: an RFC

Abstract

Status of This Memo

Introduction

Goals

Terminology

Alternatives

Defining a Text Catalog

Variations of Path Derivation

Catalogs Containing multiple Coordinate systems

Coordinate Constraints and Normalization

A Note on Strings

Examples

Alternatives

Databases

Doing Nothing

Content Addressing

Security Considerations

Errata

The Naming of TCAT

Other Notable Inspirations

Cargo

Warpforge Catalogs

History