Skip to content

Instantly share code, notes, and snippets.

@mindplay-dk
Created August 16, 2024 09:33
Show Gist options
  • Save mindplay-dk/db99d7e76b306aae40b1d1e79e9b81c8 to your computer and use it in GitHub Desktop.
Save mindplay-dk/db99d7e76b306aae40b1d1e79e9b81c8 to your computer and use it in GitHub Desktop.
Making sense of BNF, ABNF, EBNF syntax differences and history

Claude.ai just helped me make a whole lot more sense of the absolute mess that is BNF-family syntaxes. 🤔

I just discovered ABNF is actually a standard, RFC 4234.

Why isn't this more widely used?

BNF in the wild drives me nuts - no two grammars are ever written in the same flavor of BNF, and often grammars are published as merely "BFN" neglecting to even specify which BNF variant was used.

Are there any other BNF variants that are properly standardized?

You've stumbled upon an interesting point about formal grammar notations. ABNF (Augmented Backus-Naur Form) is indeed standardized in RFC 4234, which was later obsoleted by RFC 5234. Despite this standardization, it's not as widely used as one might expect. Let's explore your questions:

  1. Why isn't ABNF more widely used?

There are several reasons for this:

a) Historical inertia: Many communities and projects have been using their own BNF variants for years, and changing to ABNF would require significant effort.

b) Lack of awareness: Not everyone is aware that ABNF is a standard. It's more commonly known in certain circles, particularly those working with Internet protocols.

c) Specific needs: Some domains might require notation features that aren't present in ABNF, leading them to use or create other variants.

d) Tool support: The availability of tools for parsing and processing specific BNF variants can influence their adoption.

  1. Regarding the inconsistency in BNF usage:

You're right that this can be frustrating. The lack of consistency in BNF notation across different grammars can make it challenging to read and compare them. This is partly due to the evolution of BNF over time and the creation of various extensions to suit different needs.

  1. Other standardized BNF variants:

While ABNF is one of the most well-known standardized variants, there are a few others:

a) EBNF (Extended Backus-Naur Form): This is standardized as ISO/IEC 14977:1996. However, it's worth noting that many "EBNF" grammars in the wild don't strictly adhere to this standard.

b) W3C EBNF: The World Wide Web Consortium (W3C) has its own EBNF notation, used in various W3C specifications. While not an official standard like ISO or RFC, it's well-documented and consistently used within W3C documents.

c) GOLD Meta-Language: This is a proprietary but well-defined grammar notation used in the GOLD parsing system.

It's worth noting that even with these standards, many communities and tools still use their own slightly modified versions of BNF or EBNF.

To address the issues you've mentioned, when working with or publishing grammars, it's a good practice to:

  1. Clearly specify which variant of BNF is being used.
  2. Consider using a standardized form like ABNF or ISO EBNF when possible.
  3. If using a custom variant, provide a clear explanation of the notation used.

These practices can help reduce confusion and improve the clarity of grammar specifications.

Some domains might require notation features that aren't present in ABNF, leading them to use or create other variants.

If I was building a parser generator, this is the only reason I would really care about.

EBNF in the wild is one of the things that drives me nuts - no two documents claiming to be "EBNF" would even work with more than the 1 specific tool they were built for.

Comparing ABNF vs ISO standard EBNF vs W3C EBNF, how do they stack up in terms of features? or rather, where do they deviate? which ones have features that are missing from the others? (and I'm interested in features only - I don't care about syntax differences.)

You've hit on a crucial point - feature differences are indeed the most important aspect when considering different grammar notations, especially for parser generator tools. Let's compare ABNF, ISO EBNF, and W3C EBNF in terms of their features:

  1. ABNF (RFC 5234)

Key features:

  • Core rules (pre-defined rules like DIGIT, ALPHA)
  • Repetition (*), optional sequences ([])
  • Value ranges
  • Precise number of repetitions (e.g., 2*3)
  • Case-insensitive literal text
  • Binary and decimal value notation
  • Incremental alternatives
  1. ISO EBNF (ISO/IEC 14977:1996)

Key features:

  • Repetition (*), optional sequences ([])
  • Grouping with parentheses
  • Exception rule (-)
  • Special sequence (?)
  • Concatenation (,)
  • Terminator symbol (;)
  • Integer repetition factor
  1. W3C EBNF

Key features:

  • Repetition (*), optional sequences (?)
  • Grouping with parentheses
  • One or more occurrences (+)
  • Unordered groups (&)
  • Character ranges in quotes

Now, let's look at where they deviate and what features are unique or missing:

  1. Unique to ABNF:
  • Core rules (pre-defined common elements)
  • Case-insensitive literal text
  • Incremental alternatives
  • Precise repetition ranges (e.g., 2*3)
  1. Unique to ISO EBNF:
  • Exception rule (-)
  • Special sequence (?)
  • Explicit concatenation operator (,)
  • Terminator symbol (;)
  1. Unique to W3C EBNF:
  • Unordered groups (&)

Notable differences:

  1. Repetition:

    • ABNF uses * for 0 or more, no built-in + for 1 or more
    • ISO EBNF uses * for 0 or more, no built-in + for 1 or more
    • W3C EBNF has both * (0 or more) and + (1 or more)
  2. Optional elements:

    • ABNF uses []
    • ISO EBNF uses []
    • W3C EBNF uses ?
  3. Alternatives:

    • ABNF uses / and allows incremental alternatives
    • ISO EBNF uses |
    • W3C EBNF uses |
  4. Character sets:

    • ABNF has built-in support through core rules and allows ranges
    • ISO EBNF doesn't have built-in character set support
    • W3C EBNF allows character ranges in quotes
  5. Repetition specificity:

    • ABNF allows precise repetition ranges (e.g., 2*3)
    • ISO EBNF allows integer repetition factors
    • W3C EBNF doesn't have built-in support for specific repetition counts

In summary, while these notations share many common features, ABNF stands out with its core rules and incremental alternatives, ISO EBNF with its exception rule and special sequence, and W3C EBNF with its unordered groups. ABNF generally provides more precise control over repetition, while W3C EBNF offers a simpler notation for common patterns like "one or more" (+).

For a parser generator, the choice between these might depend on specific needs like handling of whitespace, comment syntax, and how easily the notation maps to the target programming language's constructs. ABNF's core rules can be particularly useful for common elements in certain domains, while ISO EBNF's exception rule might be valuable for languages with complex exclusion rules.

would it be fair to say that W3C EBNF is perhaps closer (in terms of syntax) to mainstream programming languages? It seems to borrow from RegEx as well.

in what year were each of these standards created?

You're making an astute observation about W3C EBNF. Let's dive into your points:

  1. W3C EBNF and mainstream programming languages:

Yes, it's fair to say that W3C EBNF syntax feels closer to mainstream programming languages and regular expressions. This is likely intentional, as it makes the notation more intuitive for web developers and programmers who are already familiar with these concepts. Some examples of this similarity:

  • The use of ? for optional elements is similar to the ternary operator in many programming languages and the ? quantifier in regex.
  • The + for "one or more" and * for "zero or more" are directly borrowed from regex syntax, which is widely used in programming.
  • The use of | for alternatives is common in both programming (e.g., in switch statements) and regex.

This familiarity can make W3C EBNF easier to adopt for many developers, especially those working in web-related fields.

  1. Creation years of these standards:
  • ABNF:

    • Originally defined in RFC 2234 in November 1997.
    • Current version (RFC 5234) was published in January 2008.
  • ISO EBNF:

    • ISO/IEC 14977:1996 was first published in 1996.
    • It hasn't been significantly revised since then.
  • W3C EBNF:

    • This is a bit trickier to pin down as it wasn't published as a standalone standard.
    • It evolved through various W3C specifications, with one of the earliest formal uses being in the XML 1.0 specification, first published as a W3C Recommendation in February 1998.
    • The notation has been refined and used consistently in subsequent W3C specifications.

It's worth noting that all these standards were created in a relatively short timespan (1996-1998), during a period of rapid growth in internet technologies and web standards. This timing explains some of the divergences in approach, as different groups were simultaneously working on solving similar problems.

The W3C EBNF's apparent similarity to regex and programming constructs likely stems from its target audience (web developers and browser implementers) and the need to describe complex markup languages like XML and HTML. Its evolution alongside web technologies has kept it feeling more "modern" in some ways, despite being roughly the same age as the other standards.

ah, so they all basically originate around the same time, around 96 to 98. I was wondering if perhaps the W3C variant was never and therefore borrowed more from newer languages, but that's probably not why.

perhaps the W3C authors made a deliberate choice to deviate from much older historical BNF syntax, while ISO EBNF and ABNF decided to stay closer to it?

You've made an excellent observation, and I think you're on the right track with your reasoning. Let's explore this idea:

  1. Timing and influences:

You're correct that the timing doesn't fully explain the differences. All three standards indeed emerged around the same period (1996-1998), so they had similar historical context to work with.

  1. W3C's design choices:

Your suggestion about W3C making a deliberate choice to deviate from older BNF syntax is quite plausible. Several factors might have influenced this decision:

a) Target audience: The W3C was primarily focused on web technologies. Their audience included many developers who might be more familiar with programming languages and regular expressions than with traditional formal language theory.

b) Practical usage: W3C needed a notation that could easily describe markup languages like XML and HTML, which have structures that don't always map neatly to traditional BNF.

c) Readability and accessibility: By using syntax elements familiar to programmers (like +, *, and ?), W3C might have been aiming to make their specifications more accessible to a broader audience.

  1. ISO EBNF and ABNF design choices:

These standards indeed seem to have stayed closer to traditional BNF notation. Possible reasons for this include:

a) Continuity: Maintaining closer ties to the original BNF could make adoption easier for those already familiar with older variants.

b) Formal language community: These standards might have been more influenced by the formal language and compiler design communities, which were accustomed to traditional BNF-style notations.

c) Generality: ABNF and ISO EBNF were designed as general-purpose notations, not tied to a specific domain like web technologies, which might have led to more conservative design choices.

  1. Different priorities:
  • W3C EBNF prioritized intuitive readability for web developers and ease of use for describing markup languages.
  • ABNF focused on precision and unambiguous description, particularly useful for network protocols.
  • ISO EBNF aimed for a standardized, general-purpose notation that could be widely applied in various domains.

Your insight about W3C potentially making a deliberate choice to deviate from historical BNF syntax is quite perceptive. It aligns well with W3C's general approach of making web standards more accessible to a broad range of developers, not just those with a strong background in formal language theory.

This decision, if deliberate, has likely contributed to the wider adoption and use of W3C EBNF in web-related specifications, as it feels more intuitive to many developers working in that space. However, it also explains why you might encounter more variability in "EBNF" notations in the wild, as the W3C variant diverges from the ISO standard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment