Skip to content

Advanced BNF with regex

Spec status: stable; use with caution. Take it, modify it, use it. (CC-BY-SA) Although potentially useful, applying this specification in practice might cause confusion. It allows for dramatically shorter and more readable specifications. However, abusing features like intersections and complements could actually make specifications less readable. In addition, it does contribute to the proliferation of standards “XKCD comic #927 about standards”.

Summary

Feature regex-bnf W3C XML EBNF EBNF ABNF
regex ECMA 262 minimal no no
exclusion yes (-) yes (-) yes (-) yes
complement yes (!) indirectly indirectly no
exclusive disjunction yes (^) indirectly indirectly no
intersection yes (&) no no no
ordered alternation yes (/) no no no
UTF codepoints by name yes (#'␣') no no no
inline definitions yes (=) no no no
explicit start token yes (START) no no no
lazy quantifiers yes (??) no no no
core rules 50 0 0 16
well-formedness constraints [ wfc: ␣ ] [ wfc: ␣ ] ? ␣ ? no
validity constraints [ vc: ␣ ] [ vc: ␣ ] no no
definition syntax :=/::=/= ::= = =
comment syntax /* *//; /* */ ; (* *)
concatenation symbol implicit implicit , .
alternation symbol \| \| \|///? /

An advanced BNF derivative.

Standards for specifying grammars are a mess, as Wirth et al. describe. David A. Wheeler also wrote a complaint about current grammar standards, particularly objecting to ISO’s EBNF (ISO/IEC 14977:1996). ABNF (RFC5234) is an improvement, but it doesn’t use regex and has a non-obvious syntax for repetitions. The W3C XML EBNF is better, but it still lacks some functionality and the expressiveness that regex can provide.

Tip: diagrams

A tool called the Railroad Diagram Generator can generate excellent diagrams from W3C XML EBNF. It’s maintained as of early 2025.

In the spirit of XKCD #927, here is a new proposal. It’s a hybrid between ABNF, W3C XML EBNF, and syntax from parser generators, including ANTLR. Use it to describe PEGs and CFGs.

Most importantly, regex-bnf supports full ECMA 262 regular expressions. It has additional, very powerful features, including

  • intersection: A & B, where both rules consume the same input
  • complement: (! A), any sequence that does not match rule A
  • exclusion: A - B, which requires that B rejects the input that A accepts
  • exclusive disjunction: A ^ B, equivalent to (A - B) | (B - A)
  • repetition: A{5,10}, rule A at least 5 times and at most 10

Rules can be declared with = (ABNF), ::= (XML), or :=. Similarly, inline ; comments (ABNF) and /* */ (XML) multiline comments are allowed. For alternation, both / (ABNF) and | (XML) are supported, but / signals ordered choice in contrast to |. Always use / for PEGs.

As syntactic sugar, you can declare inline rules; e.g. cmd = name (' --force')=force. You can also use some predefined rules, such as ALPHA, BASE64, and RFC-3339-DATETIME. (The full list is shown further down.)

Grammar

The formal grammar for regex-bnf is presented in both itself and in W3C XML EBNF.

grammar         = statement+
statement       = (START | LF+) (SP* comment? | rule-defn) (SP | LF)*
comment         = ';' (! LF)*=comment-text

rule-defn       = rule-name def-symbol rule-rhs
func-defn       = rule-name arg-spec def-symbol rule-rhs
def-symbol      = SP+ (`:?:?=`) SP+
rule-rhs        = (SP* NL)+ SP+ rule-rhs | rule-expr
rule-expr       = (group-expr | term) inline-label?
arg-spec        = '(' rule-name (',' rule-name)* ')'

inline-label    = '=' rule-name

group-expr      = group quant-expr?
group           = parenthesized | complement
parenthesized   = '(' SP* rule-expr SP* ')'
complement      = '(!' SP+ primitive ')'

term            = singleton
                  | concatenation
                  | intersection
                  | exclusion
                  | ordered-alt
                  | unordered-alt
                  | exclusive-or
concatenation   = rule-expr SP+ rule-expr
intersection    = rule-expr SP+ '&' SP+ rule-expr
exclusion       = rule-expr SP+ '-' SP+ rule-expr
ordered-alt     = rule-expr SP+ '/' SP+ rule-expr
unordered-alt   = rule-expr SP+ '|' SP+ rule-expr
exclusive-or    = rule-expr SP+ '^' SP+ rule-expr

quant-expr      = unit-quant
                  | exact-quant
                  | min-quant
                  | max-quant
                  | range-quant
exact-quant     = '{' (DIGIT+)=count '}'
range-quant     = '{' (DIGIT+)=min ',' (DIGIT+)=max '}'
min-quant       = '{' (DIGIT+)=min ',}'
max-quant       = '{,' (DIGIT*)=max '}'

unit-quant-expr = rule-expr unit-quant modifier?
unit-quant      = '?'=zero-or-one | '*'=zero-plus | '+'=one-plus
modifier        = '?'=lazy | '+'=possessive

singleton       = rule-name | primitive | regex

regex           = bracket-regex | dot-regex | tick-regex
bracket-regex   = '[' [^ ]]+? ']'
dot-regex       = '.' quant-expr
                  ; Note: a single . MUST be enclosed in ``.
                  ; This avoids ambiguity with the ABNF's concatenation operator.
tick-regex      = ``(`+)(?<pattern>[^`].*?[^`])(\1)``
                  ; Enclose in as many backticks as needed (ala Markdown).
                  ; The pattern <pattern> MUST NOT start or end with a backtick.
                  ; (Escape the backtick as \u0060 if needed.)

primitive       = literal | unicode-escape | unicode-name
literal         = `"[^"]++"` | `'[^']++'`
unicode-escape  = '#'? `[0-9A-F]{1,8}+` | '%x' `[0-9A-F]{2}`
unicode-name    = "#'" [A-Za-z0-9,/()-,]+ "'"
                  ; Example: #'Micro Sign'

rule-name       = CORE-RULE-NAME | LEXER-RULE-NAME | MAIN-RULE-NAME
CORE-RULE-NAME  = `@?[A-Z0-9]+(-[A-Z0-9]+)*`
                  ; A @ prefix MAY be used to mark core rules references.
LEXER-RULE-NAME = `[A-Z0-9]+(-[A-Z0-9]+)*`
MAIN-RULE-NAME  = `[a-z0-9]+(-[a-z0-9]+)*`
grammar         ::= statement+
statement       ::= (START | [#x0d]) (' '* comment? | rule-defn | func-defn) [#x0d]*
comment         ::= ';' comment-text
comment-text    ::= [^#x0d]*

rule-defn       ::= rule-name def-symbol rule-rhs
func-defn       ::= rule-name arg-spec def-symbol rule-rhs
def-symbol      ::= ' '+ ('::=' | '=' | ':=') ' '+
rule-rhs        ::= (' '* #x0d)+ ' '+ rule-rhs | rule-expr
rule-expr       ::= (group-expr | term) inline-label?
inline-label    ::= '=' rule-name
arg-spec        ::= '(' rule-name (',' rule-name)* ')'

group-expr      ::= group quant-expr?
group           ::= parenthesized | complement
parenthesized   ::= '(' ' '* rule-expr ' '+ ')'
complement      ::= '(!' ' '+ primitive ')'

term            ::= singleton
                  | concatenation
                  | intersection
                  | exclusion
                  | ordered-alt
                  | unordered-alt
                  | exclusive-or
concatenation   ::= rule-expr ' '+ rule-expr
intersection    ::= rule-expr ' '+ '&' ' '+ rule-expr
exclusion       ::= rule-expr ' '+ '-' ' '+ rule-expr
ordered-alt.    ::= rule-expr ' '+ '/' ' '+ rule-expr
unordered-alt   ::= rule-expr ' '+ '|' ' '+ rule-expr
exclusive-or    ::= rule-expr ' '+ '^' ' '+ rule-expr

quant-expr      ::= unit-quant
                  | exact-quant
                  | min-quant
                  | max-quant
                  | range-quant
exact-quant     ::= '{' count '}'
range-quant     ::= '{' min ',' max '}'
min-quant       ::= '{' min ',}'
max-quant       ::= '{,' max '}'
count           ::= [0-9]+
min             ::= [0-9]+
max             ::= [0-9]+

unit-quant-expr ::= rule-expr unit-quant modifier?
unit-quant      ::= zero-or-one | zero-plus | one-plus
modifier        ::= greedy | lazy | possessive
zero-or-one     ::= '?'
zero-plus       ::= '*'
one-plus        ::= '+'
greedy          ::= '*'
lazy            ::= '?'
possessive      ::= '+'

singleton       ::= rule-name | primitive | regex
primitive       ::= literal | unicode-escape | unicode-name

literal         ::= '"' [^"]+ '"' | "'" [^']+ "'"
unicode-escape  ::= ('#' HEX-UTF) | '%x' HEX HEX
unicode-name    ::= "#'" [A-Za-z0-9,/()-,]+ "'"
HEX-UTF         ::= HEX HEX? HEX? HEX? HEX? HEX? HEX? HEX?
HEX             ::= [0-9A-F]

regex           ::= bracket-regex | dot-regex | tick-regex
bracket-regex   ::= '[' [^ ]+ ']'
dot-regex       ::= '.'  quant-expr
tick-regex      ::= '`' [^`]+ '`'
/* Approximate! Cannot replicate this rule in W3C XML EBNF. */
/* Enclose in as many backticks as needed (ala Markdown). */
/* The pattern <pattern> MUST NOT start or end with a backtick. */
/* (Escape the backtick as \u0060 if needed.) */

rule-name       ::= CORE-RULE-NAME | LEXER-RULE-NAME | MAIN-RULE-NAME
CORE-RULE-NAME  ::= '@'? [A-Z0-9]+ ('-' [A-Z0-9]+)*
/* A @ prefix MAY be used to mark core rules references. */
LEXER-RULE-NAME ::= [A-Z0-9]+ ('-' [A-Z0-9]+)*
MAIN-RULE-NAME  ::= [a-z0-9]+ ('-' [a-z0-9]+)*

Core rules

Where available, these match ABNF’s core rules.

URI                        = <per RFC 3986>
UTF-GRAPHIC                = `\p{L}|\p{LC}|\p{M}|\p{N}|\p{S}|\p{Zs}`
UTF-FORMAT                 = `\p{Cf}`
UTF-SURROGATE              = `\p{Cs}`
UTF-CONTROL                = `\p{Cc}`
UTF-SPACE                  = `\p{Zs}`
BACKSLASH                  = '\\'
BOOLEAN                    = 'true' | 'false'
OCTDIG                     = `[0-8]`
OCTET                      = OCTDIG{2}
DIGIT                      = [0-9]
BASE64                     = [A-Za-z0-9+/]
BASE64URL                  = [A-Za-z0-9-_]
ALPHA                      = [A-Za-z]
UPPERCASE                  = [A-Z]
LOWERCASE                  = [a-z]
HEXDIG                     = [0-9A-F]
LOWER-HEXDIG               = [0-9a-f]
ALPHANUM                   = ALPHA | DIGIT
SQUOTE                     = "'"
DQUOTE                     = '"'
RFC-3339-DATETIME          = `20\d\d-(12|11|[1-9])-(\d|[12]\d|3[01])\
                              T([01]\d|2[0-3])(:([0-5]\d|60)){2}\
                              (\.{3}|\.{6})?Z`
E-NOTATION-FLOAT           = LITERAL-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-NONNEG-FLOAT    = LITERAL-NONNEG-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-POSITIVE-FLOAT  = LITERAL-POSITIVE-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-NONZERO-FLOAT   = LITERAL-NONZERO-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-INT             = LITERAL-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-NONNEG-INT      = LITERAL-NONNEG-INT ('E' LITERAL-NONNEG-INT)?
E-NOTATION-POSITIVE-INT    = LITERAL-POSITIVE-INT ('E' LITERAL-NONNEG-INT)?
E-NOTATION-NONZERO-INT     = LITERAL-NONZERO-INT ('E' LITERAL-NONNEG-INT)?
LITERAL-FLOAT              = '-'? DIGIT-STR ('.' DIGIT-STR)?
LITERAL-NONNEG-FLOAT       = DIGIT-STR ('.' DIGIT-STR)?
LITERAL-POSITIVE-FLOAT     = [1-9] DIGIT-STR ('.' DIGIT-STR)?
LITERAL-NONZERO-FLOAT      = '-'? [1-9] DIGIT-STR ('.' DIGIT-STR)?
LITERAL-INT                = '-'? DIGIT-STR
LITERAL-NONNEG-INT         = DIGIT-STR
LITERAL-POSITIVE-INT       = [1-9] DIGIT-STR
LITERAL-NONZERO-INT        = '-'? [1-9] DIGIT-STR
BIN-STR                    = BIN+
OCT-STR                    = OCTDIG+
HEX-STR                    = HEXDIG+
ALPHA-STR                  = ALPHA+
DIGIT-STR                  = DIGIT+
ALPHANUM-STR               = (ALPHA | DIGIT)+
BASE64-STR                 = BASE64+ '='{0,8}
BASE64URL-STR              = BASE64URL+ '='{0,8}
TICK                       = '`'
BIT                        = [01]
CRLF                       = CR LF
CR                         = '\r'
LF                         = '\n'
HTAB                       = '\t'
SP                         = ' '

Example

literal-1     = ' "ab" '
                ; can also use ::= as in XML-MG
literal-2     = " 'ab' "
concatenation = literal-1 'defg'
alternation   = literal-1 | 'xy'
                ; slash is an alterative to |
intersection  = alternation & .{10}
                ; intersection of multiple rules!!
                ; 'intersection' must be exactly ' "ab" xyxyxy'
dot-regex     = .+
                ; regex starting with '.' need not be enclosed in ``
simple-regex  = [^A-Z]{2,4}
                ; regex starting with '[' need not be enclosed in ``
complex-regex = `.+? *\d`
                ; any regex can be enclosed in ``
grouping      = ('ab' | 'cd') 'xy'
complement    = (! 'abc')
                ; complement!!
                ; 'complement' is any text (0+ chars) except 'abc'
set-minus     = .+ - 'abc'
                ; exclusion!
                ; this is identical to 'complement' (above)
unicode-1     = #5F028322
unicode-2     = #'Plus-Minus Sign'
inline-label  = label-1 ([^ ]+)=my-label
                ; declare an inline rule, which can be used anywhere

Style guide

  • Use = instead of ::=.
  • Align the = at a generous column, with plenty of space to rename rules for clarity (or to add new rules, if the grammar is still being designed).
  • Limit lines to 100 characters, breaking before | (preferably) or another operator as needed. On the continued line, put the operator on the same column as the =. The goal here is to limit the number of lines unnecessarily included in a diff.
  • Align ; comments to 2 characters after the =.
  • Always use - to separate words in rule names.