joeltg/literals.md

Last active December 8, 2020 21:37

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/joeltg/266ffab2de77dd1236875978a94dabd1.js"></script>
Save joeltg/266ffab2de77dd1236875978a94dabd1 to your computer and use it in GitHub Desktop.

Download ZIP

tasl manual

Raw

literals.md

Literal types

Overview
Global variables for common XSD datatypes
Using your own datatypes
tl;dr

Overview

tasl borrows its primitive types from RDF, so if you're familiar with RDF Literals and Named Nodes, you'll feel right at home. If you're not, that's okay too.

Most schema languages and type systems give us a small set of built-in primitive types to work with. For example, in TypeScript...

// (this is typescript)
type Person = {
	name: string;
	age: number;
};

... string and number are built-in primitive types. They're "primitive" because they aren't composed of other types (unlike the object type Person, which we would call "composite"), and they're "built-in" because the definition of what strings and numbers are (and how to represent and manipulate them) is part of the JavaScript/TypeScript spec.

Different languages have different primitive types. JavaScript just has one general-purpose number, but lots of languages have a bunch of more specific types instead: double, int64, uint8, etc. Sometimes strings are primitives, or sometimes they're actually composite types and char is the real primitive.

In tasl, instead of keywords like string, we use URIs wrapped in angle brackets as primitive types:

# (this is tasl)
namespace ex http://example.com/
namespace xsd http://www.w3.org/2001/XMLSchema#

type ex:Person {
  ex:name -> <xsd:string>;
  ex:age -> <xsd:integer>;
}

We call the URI inside the angle brackets a datatype. <xsd:integer> is "a literal type with datatype xsd:integer"; literal types are always "parametrized" by a specific datatype URI.

But how do we know what URIs to use as datatypes? And how does tasl know what they all mean? Well... we can actually use any URI that we want. tasl doesn't know anything about http://www.w3.org/2001/XMLSchema#integer, and it doesn't need to.

From tasl's perspective, the values of any literal type (regardless of its datatype) are always just UTF-8 strings. The datatype is an opaque tag - when you write mappings between schemas, tasl will check that datatypes are preserved (it won't let you map a literal with one datatype onto a literal with a different datatype), but it won't really use the datatype URI for anything else beyond that.

What datatypes are for is interfacing with the outside world. Just like class and property URIs, datatypes are a social contract. In this case, there was a specification published in 2004 by the W3C that defined a big collection of datatypes under the http://www.w3.org/2001/XMLSchema# namespace, with very precise specs for their lexical forms (ie how to represent them all as strings). By using the datatype xsd:integer, you're promising that all of the values of that type will follow the specification on this webpage ("42", "0", "-5", ...). This lets other people make tools that interface with instances on that assumption: for example, we could make a tool for importing an instance into a relational database that maps every literal with datatype xsd:integer to a native integer not null column, parsing an integer out of each string value based on the published spec. For datatypes that it doesn't recognize, it can always fall back to treating them as strings, since that's the baseline representation for all literal values. Datatypes are another example of using URIs to coordinate without a single central specification.

Global variables for common XSD datatypes

This sounds like a lot of complexity - just for primitives! - but fortunately we don't usually need to think about it. The XML Schema Definition Language namespace (http://www.w3.org/2001/XMLSchema#) has become the go-to namespace for datatypes in RDF, and we recommend defaulting to it for general-purpose use as well.

To make this easier, the following types are declared as global variables in tasl:

namespace xsd http://www.w3.org/2001/XMLSchema#
namespace rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#

type string       <xsd:string>
type boolean      <xsd:boolean>
type integer      <xsd:integer>
type double       <xsd:double>
type date         <xsd:date>
type dateTime     <xsd:dateTime>
type base64Binary <xsd:base64Binary>
type JSON         <rdf:JSON>

This means you don't have to remember to include the XSD namespace in every schema, and you generally don't even have to remember the angle bracket syntax. You can use these global variables just like the TypeScript example in the beginning:

namespace ex http://example.com/

type ex:Person {
  ex:name -> string;
  ex:age -> integer;
}

... just remember that string and integer are variable names, not keywords (e.g. you could re-define them if you wanted).

Note that XSD defines many additional datatypes like nonNegativeInteger, unsignedByte, yearMonthDuration, etc. that are not given global variable names in tasl. You're still encouraged to use these wherever you find them useful - the intent was simply to minimize the number of global terms so that they can be reasonably memorized.

As a general rule, try use the most specific XSD datatype available. If you know that all of your ages will be zero or positive, you should feel free to say so:

namespace xsd http://www.w3.org/2001/XMLSchema#
namespace ex http://example.com/

type ex:Person {
  ex:name -> string;
  ex:age -> <xsd:nonNegativeInteger>;
}

This approach obviously has diminishing returns - if you start using extremely specific datatype URIs (like "ex:oddNumbersExceptFive"), fewer tools will be able to recognize them. But you should at least assume that everybody can understand the entire XSD namespace.

Using your own datatypes

The XSD namespace should cover most use cases, but sometimes you'll need to model a type of value that is best treated as a primitive but doesn't have a good pre-existing datatype. In that case, the best thing to do is to create your own custom datatype URI.

You should only try to do this for things that meet all of the following conditions:

representable as a UTF-8 string
not very large (ie you wouldn't think of it as a file)
has internal structure that can be described with a formal grammar
would be awkward to represent as a composite type in tasl

For example, here are some bad candidates for custom datatypes:

"last names" (no internal structure)
"a first name and a last name" (better represented as a product of two literals)
PDFs (better treated as a file outside of tasl entirely)
"regular expressions" (not a formal specification)
"version numbers" (not a formal specification)

... and here are some good candidates for custom datatypes:

JavaScript-style regular expressions ("^[a-z][a-zA-Z0-9]+$")
semver version numbers ("0.15.2-rc.1")

Good candidates for custom datatypes generally follow a strict mini-language of their own that can't itself be naturally modeled in tasl for some reason. But if that's what you have, go for it! It's your way to signal to the world that the values are a specific format, and that people shouldn't try to mess with them unless they understand what that format is.

Just like class or property URIs, you don't need to do anything to start using your own datatype. Just be sure to pick a nice stable URI that you have authority over, and if you want other people to be able to interface with it, you should definitely publish documentation somewhere.

tl;dr

Use these as primitive types

Raw

tasl.md

tasl

a tiny algebraic schema language

Schemas in the Underlay are written in a tiny algebraic schema language called tasl. There are a few parts to the tasl ecosystem:

a text format for schemas
a text format for mappings between schemas
a binary format for schema instances

Schemas

Algebraic!?

"Algebraic" sounds scary, but it actually describes something very simple. In math, an algebra is any little system that starts with a few initial things, and has two different ways of combining things to get more things.

The algebra that's taught in high school is the one where the initial things are numbers and variables, and the two ways of combining them are addition and multiplication. In that context, an "algebraic expression" is something like (x * 4) + (y * (x + 1)) - a composite thing built up from some initial terms and assembled using * and +.

The algebra that we're interested in is one where the expressions are types. Here, instead of numbers and variables, our initial "primitive" things are literal datatypes like string and date, and our two ways of combining them are called product and coproduct.

LEGO Expressions

Using tasl effectively involves a different overall approach to schema design than other schema languages you may be used to. tasl doesn't have built-in concepts of optional properties, enums, class inheritance, or basically any of the usual affordances that you might typically reach for. Instead of building those features directly into the language, tasl just gives you a toolbox of composable expressions that you can use to re-create them in exactly your own terms.

So you can't just call something "optional", but you can construct a little expression that says "either this thing, or nothing". Working with tasl ends up feeling less like annotating a system diagram and more like playing with LEGOs.

Namespaces and URIs

tasl schemas use URIs from namespaces to identify things.

We use URIs in tasl in three ways: to name classes, to name properties, and to identify datatypes. We'll talk about datatypes in the next section, and just focus on the first two here. Don't worry too much about what "classes" and "properties" are exactly - for now, we're just using them casually to mean "things" and "relationships between things".

A namespace is just a "root" URI that ends in / or #, like http://schema.org/ or http://example.com/hello/world#. When we want to use terms from a namespace in a tasl schema, we have to declare that namespace in the beginning of the schema like this:

# This is a tasl schema!
# Comments begin with `#`

namespace s http://schema.org/
namespace ex http://example.com/ns#
namespace rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#

"Declaring a namespace" means giving it a short, local prefix - s, ex, and rdf in the example. When we actually use terms from the namespace, we'll always use this short prefix instead of the full base URI. For example, we'd write s:Person (which "expands" to http://schema.org/Person) or ex:favoriteColor (which "expands" to http://example.com/ns#favoriteColor), and so on. The prefixes that you give to namespaces are only scoped to each individual tasl file, and they can be whatever you want them to be, as long as they only consist of letters and numbers (and start with a lowercase letter)

Using your own namespace

The easiest (and safest) way to get started writing schemas is to use your own namespace. You don't even have to actually do anything to "create a namespace" - you don't have to run a server, or publish it anywhere, or tell anybody. You can just pick a URL that you own and start using it:

namespace hello http://my-own-domain.com/a/cool/namespace#

# The full name of this class is
# http://my-own-domain.com/a/cool/namespace#world
class hello:world {
  # ...
}

What is important is that the base URI that you pick is actually yours. This typically means that it's a http:// URL under a domain name that you own. This is important because the purpose of using URIs is to treat them as globally unique - so that anyone who encounters the same URI in two different schemas can assume that they "mean" the same thing. A schemas is mostly useful for modeling the specific dataset it's developed for, but the URIs also serve as an interface to the outside world (full of other schemas) that can enable all kinds of inter-schema use cases. This won't work if people start using the same URIs in different ways, so good practice is to only use URIs that you have the authority to use. This is never enforced - it's just part of the social contract of writing schemas.

Again, there doesn't have to be anything at the URL http://my-own-domain.com/a/cool/namespace#. None of the tools will try to look for anything there. URLs are just the most convenient and accessible way for everyone to agree on who controls what namespace in a (relatively) decentralized fashion.

Some namespace naming tips:

use http instead of https
pick something that feels stable to you

Using terms from existing namespaces

There’s another way for people to agree on how to use a URI consistently - somebody can create a namespace, list and document a vocabulary of terms in that namespace in a human-readable format, and then everybody can just follow that.

The URI type

In addition to literals, tasl schemas have a separate kind of primitive type for URIs. Here, we're not talking about the URIs that we use in schemas to label classes, properties, and datatypes - we're talking about a simple, single type called "uri" that we use as a type for URIs values in datasets.

We write the URI type as an empty pair of angle brackets <>, or we can use the global variable uri that is defined for all tasl schemas.

type uri <>

The URI type ends up being extremely useful. You should use it whenever you need to model values that are global identifiers, like ISBN numbers, DOIs, orcid IDs, etc. We'll talk about this more later.

Product types

Product types are one of the two composite types in tasl - that means they're one of the ways that we can build "bigger" types out of smaller ones. In other contexts, they're also called structs, maps, records, tuples, vectors, or objects.

A product type is a map from URI keys to types, and they're written using curly braces { }, arrows ->, and semicolons. We call the slots of a product type its components, and the two parts of each component are its key (the URI) and its value (the type).

We've already seen one product type in action:

namespace s http://schema.org/
namespace ex http://example.com/ns#

class s:Person {
  ex:favoriteColor -> string;
  ex:birthday -> dateTime;
}

The curly braces aren't part of the class declaration (like they would be in JavaScript, for example) - the grammar for declaring a class is just "class uri type". The curly braces define an inline product object with two components. The first component has key ex:favoriteColor and value string; the second component has key ex:birthday and value dateTime.

Product types correspond to the idea of "AND" or "combination". The value of a product type has a value for every one of its components.

Coproduct types

Coproduct types are the other composite types in tasl. They're also known in other contexts as discriminated unions, sums, or variants.

Coproduct types also map URI keys to types, but they're written using square brackets [ ] and inside-out arrows >-. We call the slots of a coproduct type its options. The two parts of each option are its key (the URI) and its value (the type).

namespace s http://schema.org/
namespace ex http://example.com/ns#

class s:Person {
  ex:favoriteColor -> string;
  ex:birthday -> dateTime;
  ex:height -> [
    ex:official >- double;
    ex:unofficial >- string;
  ]
}

Coproducts correspond to the idea of "OR" or an "alternative". A value of a coproduct type has a value for exactly one of its options. Here, we've added a third component to our s:Person type, with key ex:height and a value that is a coproduct type with two options. The first option has key ex:official with value double; the second has key ex:unofficial with value string.

However, coproducts behave a little bit different than unions as you might be used to them. A value of a coproduct type has a value for exactly one of its options, and it also knows explicitly which option it is.

In most schema languages or type systems, the union of a type with itself is the same type. For example, in TypeScript, if I try to define a type type hello = string | string, the type hello behaves exactly like string - a value of type hello will be a string like "world"

But in tasl, if I have a coproduct

namespace ex http://example.com/ns#

type hello [
  ex:one >- string;
  ex:two >- string;
]

a value of type hello will be a pair like (ex:one, "world") or (ex:two, "world"). This ends up being useful for modeling things like enums, which we'll see later.

Unit types

Labels

In most schema languages, a schema is just a collection of types. In tasl, this picture is a little more complex.

Reference types

Pattern guide

Be liberal with classes; make sure your properties are really properties
Don't over-generalize: you're just describing your context, not the whole world!
Use JSON when you need to, but only when you really need to

Built-in shortcuts

optional
edge
list

Advanced patterns

class inheritance
enums

Raw

uris.md

The URI type

Overview
Specific URI schemes
URNs
URLs
What about actual URLs?
Handling ambiguity
tl;dr

Overview

The syntax for the uri type is an empty pair of angle brackets: <>

namespace ex http://example.com/

class ex:Book {
  ex:title -> string;
  ex:isbn -> <>;
}

Alternatively, you might find uri more readable:

namespace ex http://example.com/

class ex:Book {
  ex:title -> string;
  ex:isbn -> uri;
}

Just like string and integer, uri is defined to be global variable in tasl:

type uri <>

Intuitively, we use the URI type for values that are global identifiers, like ISBN numbers, DOIs, database UUIDs, etc. In the same way that we use datatypes to coordinate at the schema/type level, we use URIs (ie URI values) to coordinate at the collection/value level. URIs are the way that a collection exposes identifiers to the world; they're the handles that we will use for matching, joining, co-identifying, etc. across collections.

Typically, when you use the URI type somewhere in a schema, you expect all of the values of that type to be a certain kind of URI - to all start with a certain prefix or all match some specific format. For now, there's no way to express this in tasl. All URIs are valid values for the URI type - but you should document what you expect with inline comments in the tasl file.

namespace ex http://example.com/

class ex:Book {
  # These should be ISBN URNs, e.g. urn:isbn:0-486-27557-4
  ex:isbn -> <>;
  ex:title -> string;
}

You should try to use URIs as much as you can, even if you wouldn't immediately think of the value as a URI. Here are a few ways of URI-ifying things:

Specific URI schemes

The very first part of a URI is called the URI scheme. http is a URI scheme. mailto is a URI scheme. file is a URI scheme.

The official URI schemes are registered with IANA and listed here. In practice, lots of people also use unofficial URI schemes that aren't registered. Wikipedia has a good summary of the official and some common unofficial schemes here.

If you're working with a kind of value that has a relatively commonly-used URI scheme of its own, you should use it! This applies to things like:

Email addresses (mailto:hello@example.com)
DOIs (doi:10.1000/182)
Git repositories (git://github.com/user/project-name.git)
Files on AWS S3 (s3://mybucket/puppy.jpg)
Files on IPFS (dweb:/ipfs/Qm...)
Blocks on IPLD (dweb:/ipld/bafk...)
Bitcoin addresses (bitcoin:...)
Magnet links (magnet:...)
Songs on Spotify (spotify:...)

These are all better modeled using the URI type than as literal string values (or as literals with some custom datatype).

namespace ex http://example.com/

class ex:User {
  # mailto:...
  ex:email -> <>;
  ex:username -> string;
}

class ex:Repository {
  # git://...
  ex:id -> <>;
  ex:owner -> * ex:User;
}

URNs

A URN is a URI that starts with urn: and then has one of sixty officially registered namespaces after it, each of which specifies the allowed syntax for the remaining URI components. For example, isbn is one of the registered URN namespaces, and its syntax is "put the isbn number after urn:isbn:".

Most identifiers relating to standards bodies (ISO, IETF, etc) have URN namespaces. Do you have ISSNs in your schema? Use the URI type with ISSN URN values! ISANs? Got that too. Life science identifiers? Use LSID URNs. OIDs? Use OID URNs. If the value you're modeling has a URN namespace, you should use it.

(Don't try to use URNs for DOIs. They're not officially registered as a URN namespace and and the doi URI scheme is more canonical.)

The most generally useful URN namespace is uuid. If you have UUIDs in your data that you want to publish, use the URI type and format your values like this:

urn:uuid:3cf5a9a7-f6e5-4e83-ba0a-af88dc8360ab

Adding UUID ids is one of the easiest ways to give your class entities permanent external identity. We'll talk about this again later.

URLs

URLs are the most familiar kind of URI, since we use them for links on the web and see them displayed in our browsers all the time.

But using URIs in collections - even when they happen to be http URLs - is an entirely different thing, unrelated to the world wide web. URI values are just global identifiers; they aren't expected to be resolvable. Don't treat URIs (even http URLs) as links, and don't model links as URIs (more about this later). If you see a URL used as a URI value in a collection, you should never assume that there's actually a webpage there. Similarly, you should never just copy URLs from the internet to use as URI values. You should treat URIs that happen to be URLs the same way you treat URNs: they're just another global hierarchical namespace, except that URLs use domain names instead of the IANA registry.

It's a lot easier to register a domain name than it is to register a URN namespace or URI scheme with IANA, so URLs usually end up being the easiest way to make your own URIs. This is useful if you have some kind of internal identifiers, like numbers or short codes that you want to publish (if you're using UUIDs you can use the UUID URN namespace).

IMDB is a good example. Every movie in their database is identified by a short identifier (like tt0492494), and every actor is identified by a slightly different kind of identifier (like nm1055413). These show up in the URLs of the pages on their website, and also in their API and CSV exports. If IMDB was designing a schema for their dataset, they're probably want to include the IMDB id of each movie and actor, and the best way to do that is for them to come up with a URL format for them:

namespace imdb http://imdb.com/

class imdb:Movie {
  # http://imdb.com/title/tt0492494
  imdb:id -> <>;
  imdb:title -> string;
  # ...
}

class imdb:Actor {
  # http://imdb.com/name/nm1055413
  imdb:id -> <> ;
  imdb:name -> string;
  # ...
}

In this case, the URI values are very similar to what you'd see in your browser when you view a movie or actor on their website - but not exactly the same.

Good URLs:

always use http and never use https
don't use www.
tend to have an alphanumeric id component
are generally not human-readable
are relatively permanent

If the URL has a page title, name, or is just a path like http://example.com/folder1/folder2/file, it's probably not a good URI value.

What about actual URLs?

So if URLs in the "this links to a webpage" sense make bad URIs, how should we model webpage links? This is obviously an extremely common kind of value!

The recommended way to model website URLs (that are meant to link to a webpage) is, oddly enough, as string literals. In the IMDB example, this would look like this:

namespace imdb http://imdb.com/

class imdb:Movie {
  # http://imdb.com/title/tt0492494
  imdb:id -> <>;

  # https://www.imdb.com/title/tt0492494
  imdb:url -> string;

  imdb:title -> string;
  # ...
}

Don't be afraid of the superficial redundancy here. The imdb:id and imdb:url properties serve different purposes, and it's good to be able to restructure the website without having to change the id format.

Handling ambiguity

There will inevitably be situations where it's not clear whether some value should treated as a literal or a URI. And even in cases where a value is clearly some kind of identifier, there might be several plausible ways to represent it as a URI.

Corralling values into useful formats is often more art than science, and involves balancing different priorities that will never be in perfect harmony. You want your values to be as structured as possible without making them too hard to work with (and without pretending there's more structure than there really is). Make educated guesses, use the examples here as a general guide, and document your choices in your schema with inline comments.

tl;dr

If you have identifiers in some URI format, use the URI type. You write it like this: <>. If you want to make up your own URI format with a domain name that you control, go for it. Don't use URLs from websites as URIs. Don't expect URIs to be URLs, even if they start with http://.... It's good to use URIs as much as you can.

joeltg/literals.md

Literal types

Table of Contents

Overview

Global variables for common XSD datatypes

Using your own datatypes

tl;dr

tasl

Schemas

Algebraic!?

LEGO Expressions

Namespaces and URIs

Using your own namespace

Using terms from existing namespaces

The URI type

Product types

Coproduct types

Unit types

Labels

Reference types

Pattern guide

Built-in shortcuts

Advanced patterns

The URI type

Table of Contents

Overview

Specific URI schemes

URNs

URLs

What about actual URLs?

Handling ambiguity

tl;dr