This is not the real spec.

Although this draft is quite similar to the real spec, there are not exactly compatible. Overall, this one is less restrictive. I dropped the “no spaces before +” rule, and whitespace-related rules are different. For example, the expression ( MIT) seems to be considered invalid by the real spec, but is considered valid by this draft.

Analysis overview

An SPDX License Expression string must be analyzed in three stages:

The first one is called scanning, it consists in splitting the source string into tokens;
The second one is called parsing, it consists in producing a syntax tree from tokens;
The last one is called semantic analysis, it consists of a few additional checks on the syntax tree.

Scanning

Tokens

A token represents a part of the source string.

A token must be classified in one of these types:

IDENTIFIER
OPERATOR
DOCUMENT-REF
LICENSE-REF

Character encoding

(I think it could be specified that everything must be in ASCII.)

Case sensitivity

An SPDX License Expression must be case sensitive.

Whitespaces

(It must be specified clearly which other characters are ignored and considered as separators. Actually, only whitespaces are considered in spdx-expression-parse, but other parsers allows tabs and newlines.)

Identifiers

An identifier is a sequence of one or more characters, consisting of letters, digits, dashes ("-") or dots (".").

Ordinary identifiers, reserved words and operators

Any identifier is an ordinary one, except:

Identifiers "AND", "OR" and "WITH". They must be recognized as tokens with the type OPERATOR.
Identifiers starting with "DocumentRef-". They must not be exactly "DocumentRef-". They must be longer or else an error must be raised. They must be recognized as tokens with the type DOCUMENT-REF.
Identifiers starting with "LicenseRef-". They must not be exactly "LicenseRef-". They must be longer or else an error must be raised. They must be recognized as tokens with the type LICENSE-REF.

Ordinary identifiers must be recognized as tokens with the type IDENTIFIER.

Other operators

The characters "(", ")", "+" and ":" must be recognized as tokens with the type OPERATOR.

Parsing

The parsing stage must be performed according to the following ABNF grammar:

license-ref = [DOCUMENT-REF ":"] LICENSE-REF

license = license-ref / (IDENTIFIER ["+"])

license-with-exception = license ["WITH" IDENTIFIER]

parenthesized-expression = "(" expression ")"

atom = parenthesized-expression / license-with-exception

and-expression = atom ["AND" and-expression]

or-expression = and-expression ["OR" or-expression]

tag-value-format-expression = parenthesized-expression / 
                              IDENTIFIER / 
                              license-ref

expression = or-expression

Token types map to rule names in the previous grammar, except for tokens with the OPERATOR type which are referred to with literal strings. For example, the literal "(" in the grammar must match a token with the OPERATOR type which matched a "(" in the source string.

The start rule must always be expression, except in Tag:value format where the tag-value-format-expression rule must be used instead.

Semantics

TODO, this is obvious and boring to describe. Any operator must be explained. Assert licenses are valid SPDX Short Identifier and so on.

Whitespace could be a bikeshed. My gut tells me permitting any number of spaces strikes a good balance. Folks who want to support \n, \t, \r, no-break space, and so on can strip those characters before parsing.

Well, we can stay with the spacing rules of the actual spec, but it should be specified in the grammar. It is possible to scan exactly one space before and one space after the AND, OR and WITH operators:

and-expression = atom [" AND " and-expression]

or-expression = and-expression [" OR " or-expression]

license-with-exception = license [" WITH " IDENTIFIER]

It looks a bit hackish but these spacing rules are much simpler and strict.

From the broader SPDX perspective, the XML context probably dictates encoding. I'm not sure the spec needs to cover everything down to binary representation, anyway.

You are right, we don't care about UTF-8, UTF-16 and so on, but the character set used in the spec has to be defined. In practice it should be a single sentence disallowing non-ASCII characters or something like that. (Notice that the “Key:value” format is not XML)

motet-a/utopian-spdx-license-expressions.md

Analysis overview

Scanning

Tokens

Character encoding

Case sensitivity

Whitespaces

Identifiers

Ordinary identifiers, reserved words and operators

Other operators

Parsing

Semantics

kemitchell commented Jul 6, 2017

kemitchell commented Jul 6, 2017

motet-a commented Jul 6, 2017 •

edited

Loading

motet-a/utopian-spdx-license-expressions.md

Analysis overview

Scanning

Tokens

Character encoding

Case sensitivity

Whitespaces

Identifiers

Ordinary identifiers, reserved words and operators

Other operators

Parsing

Semantics

kemitchell commented Jul 6, 2017

kemitchell commented Jul 6, 2017

motet-a commented Jul 6, 2017 • edited Loading

motet-a commented Jul 6, 2017 •

edited

Loading