Advanced BNF with regex¶
Spec status: stable; use with caution. Take it, modify it, use it. (CC-BY-SA) Although potentially useful, applying this specification in practice might cause confusion. It allows for dramatically shorter and more readable specifications. However, abusing features like intersections and complements could actually make specifications less readable. In addition, it does contribute to the proliferation of standards “XKCD comic #927 about standards”.
Summary¶
Feature | regex-bnf | W3C XML EBNF | EBNF | ABNF |
---|---|---|---|---|
regex | ECMA 262 | minimal | no | no |
exclusion | yes (- ) | yes (- ) | yes (- ) | yes |
complement | yes (! ) | indirectly | indirectly | no |
exclusive disjunction | yes (^ ) | indirectly | indirectly | no |
intersection | yes (& ) | no | no | no |
ordered alternation | yes (/ ) | no | no | no |
UTF codepoints by name | yes (#'␣' ) | no | no | no |
inline definitions | yes (= ) | no | no | no |
explicit start token | yes (START ) | no | no | no |
lazy quantifiers | yes (?? ) | no | no | no |
core rules | 50 | 0 | 0 | 16 |
well-formedness constraints | [ wfc: ␣ ] | [ wfc: ␣ ] | ? ␣ ? | no |
validity constraints | [ vc: ␣ ] | [ vc: ␣ ] | no | no |
definition syntax | := /::= /= | ::= | = | = |
comment syntax | /* */ /; | /* */ | ; | (* *) |
concatenation symbol | implicit | implicit | , | . |
alternation symbol | \| | \| | \| // /? | / |
An advanced BNF derivative.
Standards for specifying grammars are a mess, as Wirth et al. describe. David A. Wheeler also wrote a complaint about current grammar standards, particularly objecting to ISO’s EBNF (ISO/IEC 14977:1996). ABNF (RFC5234) is an improvement, but it doesn’t use regex and has a non-obvious syntax for repetitions. The W3C XML EBNF is better, but it still lacks some functionality and the expressiveness that regex can provide.
Tip: diagrams
A tool called the Railroad Diagram Generator can generate excellent diagrams from W3C XML EBNF. It’s maintained as of early 2025.
In the spirit of XKCD #927, here is a new proposal. It’s a hybrid between ABNF, W3C XML EBNF, and syntax from parser generators, including ANTLR. Use it to describe PEGs and CFGs.
Most importantly, regex-bnf supports full ECMA 262 regular expressions. It has additional, very powerful features, including
- intersection:
A & B
, where both rules consume the same input - complement:
(! A)
, any sequence that does not match rule A - exclusion:
A - B
, which requires thatB
rejects the input thatA
accepts - exclusive disjunction:
A ^ B
, equivalent to(A - B) | (B - A)
- repetition:
A{5,10}
, rule A at least 5 times and at most 10
Rules can be declared with =
(ABNF), ::=
(XML), or :=
. Similarly, inline ;
comments (ABNF) and /* */
(XML) multiline comments are allowed. For alternation, both /
(ABNF) and |
(XML) are supported, but /
signals ordered choice in contrast to |
. Always use /
for PEGs.
As syntactic sugar, you can declare inline rules; e.g. cmd = name (' --force')=force
. You can also use some predefined rules, such as ALPHA
, BASE64
, and RFC-3339-DATETIME
. (The full list is shown further down.)
Grammar¶
The formal grammar for regex-bnf is presented in both itself and in W3C XML EBNF.
grammar = statement+
statement = (START | LF+) (SP* comment? | rule-defn) (SP | LF)*
comment = ';' (! LF)*=comment-text
rule-defn = rule-name def-symbol rule-rhs
func-defn = rule-name arg-spec def-symbol rule-rhs
def-symbol = SP+ (`:?:?=`) SP+
rule-rhs = (SP* NL)+ SP+ rule-rhs | rule-expr
rule-expr = (group-expr | term) inline-label?
arg-spec = '(' rule-name (',' rule-name)* ')'
inline-label = '=' rule-name
group-expr = group quant-expr?
group = parenthesized | complement
parenthesized = '(' SP* rule-expr SP* ')'
complement = '(!' SP+ primitive ')'
term = singleton
| concatenation
| intersection
| exclusion
| ordered-alt
| unordered-alt
| exclusive-or
concatenation = rule-expr SP+ rule-expr
intersection = rule-expr SP+ '&' SP+ rule-expr
exclusion = rule-expr SP+ '-' SP+ rule-expr
ordered-alt = rule-expr SP+ '/' SP+ rule-expr
unordered-alt = rule-expr SP+ '|' SP+ rule-expr
exclusive-or = rule-expr SP+ '^' SP+ rule-expr
quant-expr = unit-quant
| exact-quant
| min-quant
| max-quant
| range-quant
exact-quant = '{' (DIGIT+)=count '}'
range-quant = '{' (DIGIT+)=min ',' (DIGIT+)=max '}'
min-quant = '{' (DIGIT+)=min ',}'
max-quant = '{,' (DIGIT*)=max '}'
unit-quant-expr = rule-expr unit-quant modifier?
unit-quant = '?'=zero-or-one | '*'=zero-plus | '+'=one-plus
modifier = '?'=lazy | '+'=possessive
singleton = rule-name | primitive | regex
regex = bracket-regex | dot-regex | tick-regex
bracket-regex = '[' [^ ]]+? ']'
dot-regex = '.' quant-expr
; Note: a single . MUST be enclosed in ``.
; This avoids ambiguity with the ABNF's concatenation operator.
tick-regex = ``(`+)(?<pattern>[^`].*?[^`])(\1)``
; Enclose in as many backticks as needed (ala Markdown).
; The pattern <pattern> MUST NOT start or end with a backtick.
; (Escape the backtick as \u0060 if needed.)
primitive = literal | unicode-escape | unicode-name
literal = `"[^"]++"` | `'[^']++'`
unicode-escape = '#'? `[0-9A-F]{1,8}+` | '%x' `[0-9A-F]{2}`
unicode-name = "#'" [A-Za-z0-9,/()-,]+ "'"
; Example: #'Micro Sign'
rule-name = CORE-RULE-NAME | LEXER-RULE-NAME | MAIN-RULE-NAME
CORE-RULE-NAME = `@?[A-Z0-9]+(-[A-Z0-9]+)*`
; A @ prefix MAY be used to mark core rules references.
LEXER-RULE-NAME = `[A-Z0-9]+(-[A-Z0-9]+)*`
MAIN-RULE-NAME = `[a-z0-9]+(-[a-z0-9]+)*`
grammar ::= statement+
statement ::= (START | [#x0d]) (' '* comment? | rule-defn | func-defn) [#x0d]*
comment ::= ';' comment-text
comment-text ::= [^#x0d]*
rule-defn ::= rule-name def-symbol rule-rhs
func-defn ::= rule-name arg-spec def-symbol rule-rhs
def-symbol ::= ' '+ ('::=' | '=' | ':=') ' '+
rule-rhs ::= (' '* #x0d)+ ' '+ rule-rhs | rule-expr
rule-expr ::= (group-expr | term) inline-label?
inline-label ::= '=' rule-name
arg-spec ::= '(' rule-name (',' rule-name)* ')'
group-expr ::= group quant-expr?
group ::= parenthesized | complement
parenthesized ::= '(' ' '* rule-expr ' '+ ')'
complement ::= '(!' ' '+ primitive ')'
term ::= singleton
| concatenation
| intersection
| exclusion
| ordered-alt
| unordered-alt
| exclusive-or
concatenation ::= rule-expr ' '+ rule-expr
intersection ::= rule-expr ' '+ '&' ' '+ rule-expr
exclusion ::= rule-expr ' '+ '-' ' '+ rule-expr
ordered-alt. ::= rule-expr ' '+ '/' ' '+ rule-expr
unordered-alt ::= rule-expr ' '+ '|' ' '+ rule-expr
exclusive-or ::= rule-expr ' '+ '^' ' '+ rule-expr
quant-expr ::= unit-quant
| exact-quant
| min-quant
| max-quant
| range-quant
exact-quant ::= '{' count '}'
range-quant ::= '{' min ',' max '}'
min-quant ::= '{' min ',}'
max-quant ::= '{,' max '}'
count ::= [0-9]+
min ::= [0-9]+
max ::= [0-9]+
unit-quant-expr ::= rule-expr unit-quant modifier?
unit-quant ::= zero-or-one | zero-plus | one-plus
modifier ::= greedy | lazy | possessive
zero-or-one ::= '?'
zero-plus ::= '*'
one-plus ::= '+'
greedy ::= '*'
lazy ::= '?'
possessive ::= '+'
singleton ::= rule-name | primitive | regex
primitive ::= literal | unicode-escape | unicode-name
literal ::= '"' [^"]+ '"' | "'" [^']+ "'"
unicode-escape ::= ('#' HEX-UTF) | '%x' HEX HEX
unicode-name ::= "#'" [A-Za-z0-9,/()-,]+ "'"
HEX-UTF ::= HEX HEX? HEX? HEX? HEX? HEX? HEX? HEX?
HEX ::= [0-9A-F]
regex ::= bracket-regex | dot-regex | tick-regex
bracket-regex ::= '[' [^ ]+ ']'
dot-regex ::= '.' quant-expr
tick-regex ::= '`' [^`]+ '`'
/* Approximate! Cannot replicate this rule in W3C XML EBNF. */
/* Enclose in as many backticks as needed (ala Markdown). */
/* The pattern <pattern> MUST NOT start or end with a backtick. */
/* (Escape the backtick as \u0060 if needed.) */
rule-name ::= CORE-RULE-NAME | LEXER-RULE-NAME | MAIN-RULE-NAME
CORE-RULE-NAME ::= '@'? [A-Z0-9]+ ('-' [A-Z0-9]+)*
/* A @ prefix MAY be used to mark core rules references. */
LEXER-RULE-NAME ::= [A-Z0-9]+ ('-' [A-Z0-9]+)*
MAIN-RULE-NAME ::= [a-z0-9]+ ('-' [a-z0-9]+)*
Core rules¶
Where available, these match ABNF’s core rules.
URI = <per RFC 3986>
UTF-GRAPHIC = `\p{L}|\p{LC}|\p{M}|\p{N}|\p{S}|\p{Zs}`
UTF-FORMAT = `\p{Cf}`
UTF-SURROGATE = `\p{Cs}`
UTF-CONTROL = `\p{Cc}`
UTF-SPACE = `\p{Zs}`
BACKSLASH = '\\'
BOOLEAN = 'true' | 'false'
OCTDIG = `[0-8]`
OCTET = OCTDIG{2}
DIGIT = [0-9]
BASE64 = [A-Za-z0-9+/]
BASE64URL = [A-Za-z0-9-_]
ALPHA = [A-Za-z]
UPPERCASE = [A-Z]
LOWERCASE = [a-z]
HEXDIG = [0-9A-F]
LOWER-HEXDIG = [0-9a-f]
ALPHANUM = ALPHA | DIGIT
SQUOTE = "'"
DQUOTE = '"'
RFC-3339-DATETIME = `20\d\d-(12|11|[1-9])-(\d|[12]\d|3[01])\
T([01]\d|2[0-3])(:([0-5]\d|60)){2}\
(\.{3}|\.{6})?Z`
E-NOTATION-FLOAT = LITERAL-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-NONNEG-FLOAT = LITERAL-NONNEG-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-POSITIVE-FLOAT = LITERAL-POSITIVE-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-NONZERO-FLOAT = LITERAL-NONZERO-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-INT = LITERAL-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-NONNEG-INT = LITERAL-NONNEG-INT ('E' LITERAL-NONNEG-INT)?
E-NOTATION-POSITIVE-INT = LITERAL-POSITIVE-INT ('E' LITERAL-NONNEG-INT)?
E-NOTATION-NONZERO-INT = LITERAL-NONZERO-INT ('E' LITERAL-NONNEG-INT)?
LITERAL-FLOAT = '-'? DIGIT-STR ('.' DIGIT-STR)?
LITERAL-NONNEG-FLOAT = DIGIT-STR ('.' DIGIT-STR)?
LITERAL-POSITIVE-FLOAT = [1-9] DIGIT-STR ('.' DIGIT-STR)?
LITERAL-NONZERO-FLOAT = '-'? [1-9] DIGIT-STR ('.' DIGIT-STR)?
LITERAL-INT = '-'? DIGIT-STR
LITERAL-NONNEG-INT = DIGIT-STR
LITERAL-POSITIVE-INT = [1-9] DIGIT-STR
LITERAL-NONZERO-INT = '-'? [1-9] DIGIT-STR
BIN-STR = BIN+
OCT-STR = OCTDIG+
HEX-STR = HEXDIG+
ALPHA-STR = ALPHA+
DIGIT-STR = DIGIT+
ALPHANUM-STR = (ALPHA | DIGIT)+
BASE64-STR = BASE64+ '='{0,8}
BASE64URL-STR = BASE64URL+ '='{0,8}
TICK = '`'
BIT = [01]
CRLF = CR LF
CR = '\r'
LF = '\n'
HTAB = '\t'
SP = ' '
Example¶
literal-1 = ' "ab" '
; can also use ::= as in XML-MG
literal-2 = " 'ab' "
concatenation = literal-1 'defg'
alternation = literal-1 | 'xy'
; slash is an alterative to |
intersection = alternation & .{10}
; intersection of multiple rules!!
; 'intersection' must be exactly ' "ab" xyxyxy'
dot-regex = .+
; regex starting with '.' need not be enclosed in ``
simple-regex = [^A-Z]{2,4}
; regex starting with '[' need not be enclosed in ``
complex-regex = `.+? *\d`
; any regex can be enclosed in ``
grouping = ('ab' | 'cd') 'xy'
complement = (! 'abc')
; complement!!
; 'complement' is any text (0+ chars) except 'abc'
set-minus = .+ - 'abc'
; exclusion!
; this is identical to 'complement' (above)
unicode-1 = #5F028322
unicode-2 = #'Plus-Minus Sign'
inline-label = label-1 ([^ ]+)=my-label
; declare an inline rule, which can be used anywhere
Style guide¶
- Use
=
instead of::=
. - Align the
=
at a generous column, with plenty of space to rename rules for clarity (or to add new rules, if the grammar is still being designed). - Limit lines to 100 characters, breaking before
|
(preferably) or another operator as needed. On the continued line, put the operator on the same column as the=
. The goal here is to limit the number of lines unnecessarily included in a diff. - Align
;
comments to 2 characters after the=
. - Always use
-
to separate words in rule names.