Skip to content

Instantly share code, notes, and snippets.

@motet-a
Last active July 6, 2017 06:41
Show Gist options
  • Save motet-a/eb19b87554a183c48e77ccd697008244 to your computer and use it in GitHub Desktop.
Save motet-a/eb19b87554a183c48e77ccd697008244 to your computer and use it in GitHub Desktop.

This is not the real spec.

Although this draft is quite similar to the real spec, there are not exactly compatible. Overall, this one is less restrictive. I dropped the “no spaces before +” rule, and whitespace-related rules are different. For example, the expression ( MIT) seems to be considered invalid by the real spec, but is considered valid by this draft.

Analysis overview

An SPDX License Expression string must be analyzed in three stages:

  • The first one is called scanning, it consists in splitting the source string into tokens;
  • The second one is called parsing, it consists in producing a syntax tree from tokens;
  • The last one is called semantic analysis, it consists of a few additional checks on the syntax tree.

Scanning

Tokens

A token represents a part of the source string.

A token must be classified in one of these types:

  • IDENTIFIER
  • OPERATOR
  • DOCUMENT-REF
  • LICENSE-REF

Character encoding

(I think it could be specified that everything must be in ASCII.)

Case sensitivity

An SPDX License Expression must be case sensitive.

Whitespaces

(It must be specified clearly which other characters are ignored and considered as separators. Actually, only whitespaces are considered in spdx-expression-parse, but other parsers allows tabs and newlines.)

Identifiers

An identifier is a sequence of one or more characters, consisting of letters, digits, dashes ("-") or dots (".").

Ordinary identifiers, reserved words and operators

Any identifier is an ordinary one, except:

  • Identifiers "AND", "OR" and "WITH". They must be recognized as tokens with the type OPERATOR.

  • Identifiers starting with "DocumentRef-". They must not be exactly "DocumentRef-". They must be longer or else an error must be raised. They must be recognized as tokens with the type DOCUMENT-REF.

  • Identifiers starting with "LicenseRef-". They must not be exactly "LicenseRef-". They must be longer or else an error must be raised. They must be recognized as tokens with the type LICENSE-REF.

Ordinary identifiers must be recognized as tokens with the type IDENTIFIER.

Other operators

The characters "(", ")", "+" and ":" must be recognized as tokens with the type OPERATOR.

Parsing

The parsing stage must be performed according to the following ABNF grammar:

license-ref = [DOCUMENT-REF ":"] LICENSE-REF

license = license-ref / (IDENTIFIER ["+"])

license-with-exception = license ["WITH" IDENTIFIER]

parenthesized-expression = "(" expression ")"

atom = parenthesized-expression / license-with-exception

and-expression = atom ["AND" and-expression]

or-expression = and-expression ["OR" or-expression]

tag-value-format-expression = parenthesized-expression / 
                              IDENTIFIER / 
                              license-ref

expression = or-expression

Token types map to rule names in the previous grammar, except for tokens with the OPERATOR type which are referred to with literal strings. For example, the literal "(" in the grammar must match a token with the OPERATOR type which matched a "(" in the source string.

The start rule must always be expression, except in Tag:value format where the tag-value-format-expression rule must be used instead.

Semantics

TODO, this is obvious and boring to describe. Any operator must be explained. Assert licenses are valid SPDX Short Identifier and so on.

@kemitchell
Copy link

Very good. Lots of work! Worth helping to get this right.

@kemitchell
Copy link

Whitespace could be a bikeshed. My gut tells me permitting any number of spaces strikes a good balance. Folks who want to support \n, \t, \r, no-break space, and so on can strip those characters before parsing.

From the broader SPDX perspective, the XML context probably dictates encoding. I'm not sure the spec needs to cover everything down to binary representation, anyway.

As a practical matter, the working group has stuck to characters than can be included without encoding in URLs. The recent proposal for BSD+Patents ended up as BSD-2-Clause-Patent for that reason, among others.

@motet-a
Copy link
Author

motet-a commented Jul 6, 2017

Whitespace could be a bikeshed. My gut tells me permitting any number of spaces strikes a good balance. Folks who want to support \n, \t, \r, no-break space, and so on can strip those characters before parsing.

Well, we can stay with the spacing rules of the actual spec, but it should be specified in the grammar. It is possible to scan exactly one space before and one space after the AND, OR and WITH operators:

and-expression = atom [" AND " and-expression]

or-expression = and-expression [" OR " or-expression]

license-with-exception = license [" WITH " IDENTIFIER]

It looks a bit hackish but these spacing rules are much simpler and strict.

From the broader SPDX perspective, the XML context probably dictates encoding. I'm not sure the spec needs to cover everything down to binary representation, anyway.

You are right, we don't care about UTF-8, UTF-16 and so on, but the character set used in the spec has to be defined. In practice it should be a single sentence disallowing non-ASCII characters or something like that. (Notice that the “Key:value” format is not XML)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment