This is not the real spec.
Although this draft is quite similar to the real spec, there are not
exactly compatible. Overall, this one is less restrictive. I dropped
the “no spaces before +
” rule, and whitespace-related rules are
different. For example, the expression ( MIT)
seems to be considered
invalid by the real spec, but is considered valid by this draft.
An SPDX License Expression string must be analyzed in three stages:
- The first one is called scanning, it consists in splitting the source string into tokens;
- The second one is called parsing, it consists in producing a syntax tree from tokens;
- The last one is called semantic analysis, it consists of a few additional checks on the syntax tree.
A token represents a part of the source string.
A token must be classified in one of these types:
IDENTIFIER
OPERATOR
DOCUMENT-REF
LICENSE-REF
(I think it could be specified that everything must be in ASCII.)
An SPDX License Expression must be case sensitive.
(It must be specified clearly which other characters are ignored
and considered as separators. Actually, only whitespaces are
considered in spdx-expression-parse
, but other parsers allows
tabs and newlines.)
An identifier is a sequence of one or more characters, consisting of letters, digits, dashes ("-") or dots (".").
Any identifier is an ordinary one, except:
-
Identifiers "AND", "OR" and "WITH". They must be recognized as tokens with the type
OPERATOR
. -
Identifiers starting with "DocumentRef-". They must not be exactly "DocumentRef-". They must be longer or else an error must be raised. They must be recognized as tokens with the type
DOCUMENT-REF
. -
Identifiers starting with "LicenseRef-". They must not be exactly "LicenseRef-". They must be longer or else an error must be raised. They must be recognized as tokens with the type
LICENSE-REF
.
Ordinary identifiers must be recognized as tokens with the
type IDENTIFIER
.
The characters "(", ")", "+" and ":" must be recognized as tokens with the type
OPERATOR
.
The parsing stage must be performed according to the following ABNF grammar:
license-ref = [DOCUMENT-REF ":"] LICENSE-REF
license = license-ref / (IDENTIFIER ["+"])
license-with-exception = license ["WITH" IDENTIFIER]
parenthesized-expression = "(" expression ")"
atom = parenthesized-expression / license-with-exception
and-expression = atom ["AND" and-expression]
or-expression = and-expression ["OR" or-expression]
tag-value-format-expression = parenthesized-expression /
IDENTIFIER /
license-ref
expression = or-expression
Token types map to rule names in the previous grammar, except
for tokens with the OPERATOR
type which are referred to with literal
strings. For example, the literal "("
in the grammar must match
a token with the OPERATOR
type which matched a "(" in the source
string.
The start rule must always be expression
, except in Tag:value format
where the tag-value-format-expression
rule must be used instead.
TODO, this is obvious and boring to describe. Any operator must be explained. Assert licenses are valid SPDX Short Identifier and so on.
Very good. Lots of work! Worth helping to get this right.