Skip to content

Advanced BNF with regex

Spec status: Use with caution. Although useful, applying it in practice might cause confusion. Take it, modify it, use it. (CC-BY-SA)

An advanced BNF derivative.

Standards for specifying grammars are a mess, as Wirth et al. describe. David A. Wheeler also wrote a complaint about current grammar standards, particularly objecting to ISO’s EBNF (ISO/IEC 14977:1996). ABNF (RFC5234) is an improvement, but it doesn’t use regex and has a non-obvious syntax for repetitions. The W3C XML EBNF is better, but it still lacks some functionality and the expressiveness that regex can provide.

Tip

A tool called the Railroad Diagram Generator can generate excellent diagrams from W3C XML EBNF. It’s maintained as of early 2025.

In the spirit of XKCD #927, here is a new proposal. It’s a hybrid between ABNF, W3C XML EBNF, and syntax from parser generators, including ANTLR. Use it to describe PEGs and CFGs.

Compatible with either ABNF or W3C XML EBNF (but not both for one document), regex-bnf supports full ECMA 262 regular expressions, and has additional, very powerful features such as intersections (i.e. rule A and rule B). Core rules are defined, such as ALPHA, BASE64, and RFC-3339-DATETIME.

Grammar

The formal grammar for regex-bnf is presented in both itself and in W3C XML EBNF.

grammar         = statement+
statement       = (START | LF+) (SP* comment? | rule-defn) (SP | LF)*
comment         = ';' (! LF)*=comment-text

rule-defn       = rule-name def-symbol rule-rhs
func-defn       = rule-name arg-spec def-symbol rule-rhs
def-symbol      = SP+ ('=' | '::=') SP+
rule-rhs        = (SP* NL)+ SP+ rule-rhs | rule-expr
rule-expr       = (group-expr | term) inline-label?
arg-spec        = '(' rule-name (',' rule-name)* ')'

inline-label    = '=' rule-name

group-expr      = group quant-expr?
group           = parenthesized | complement
parenthesized   = '(' SP* rule-expr SP* ')'
complement      = '(!' SP+ primitive ')'

term            = singleton
                  | concatenation
                  | intersection
                  | exclusion
                  | alternatation
                  | exclusive-or
concatenation   = rule-expr SP+ rule-expr
intersection    = rule-expr SP+ '&' SP+ rule-expr
exclusion       = rule-expr SP+ '-' SP+ rule-expr
alternatation   = rule-expr SP+ [|/] SP+ rule-expr
exclusive-or    = rule-expr SP+ '^' SP+ rule-expr

quant-expr      = unit-quant
                  | exact-quant
                  | min-quant
                  | max-quant
                  | range-quant
exact-quant     = '{' (DIGIT+)=count '}'
range-quant     = '{' (DIGIT+)=min ',' (DIGIT+)=max '}'
min-quant       = '{' (DIGIT+)=min ',}'
max-quant       = '{,' (DIGIT*)=max '}'

unit-quant-expr = rule-expr unit-quant modifier?
unit-quant      = '?'=zero-or-one | '*'=zero-plus | '+'=one-plus
modifier        = '?'=lazy | '+'=possessive

singleton       = rule-name | primitive | regex

regex           = bracket-regex | dot-regex | tick-regex
bracket-regex   = '[' [^ ]]+? ']'
dot-regex       = '.' quant-expr
                  ; Note: a single . MUST be enclosed in ``.
                  ; This avoids ambiguity with the ABNF's concatenation operator.
tick-regex      = ``(`+)(?P<pattern>[^`].*?[^`])(\1)``
                  ; Enclose in as many backticks as needed (ala Markdown).
                  ; The pattern <pattern> MUST NOT start or end with a backtick.
                  ; (Escape the backtick as \u0060 if needed.)

primitive.      = literal | unicode-escape | unicode-name
literal         = `"[^"]++"` | `'[^']++'`
unicode-escape  = '#'? `[0-9A-F]{1,8}+` | '%x' `[0-9A-F]{2}`
unicode-name    = "#'" [A-Za-z0-9,/()-,]+ "'"
                  ; Example: #'Micro Sign'

rule-name       = CORE-RULE-NAME | LEXER-RULE-NAME | MAIN-RULE-NAME
CORE-RULE-NAME  = `@?[A-Z0-9]+(-[A-Z0-9]+)*`
                  ; A @ prefix MAY be used to mark core rules references.
LEXER-RULE-NAME = `[A-Z0-9]+(-[A-Z0-9]+)*`
MAIN-RULE-NAME  = `[a-z0-9]+(-[a-z0-9]+)*`
grammar         ::= statement+
statement       ::= (START | [#x0d]) (' '* comment? | rule-defn | func-defn) [#x0d]*
comment         ::= ';' comment-text
comment-text    ::= [^#x0d]*

rule-defn       ::= rule-name def-symbol rule-rhs
func-defn       ::= rule-name arg-spec def-symbol rule-rhs
def-symbol      ::= ' '+ ('::=' | '=') ' '+
rule-rhs        ::= (' '* #x0d)+ ' '+ rule-rhs | rule-expr
rule-expr       ::= (group-expr | term) inline-label?
inline-label    ::= '=' rule-name
arg-spec        ::= '(' rule-name (',' rule-name)* ')'

group-expr      ::= group quant-expr?
group           ::= parenthesized | complement
parenthesized   ::= '(' ' '* rule-expr ' '+ ')'
complement      ::= '(!' ' '+ primitive ')'

term            ::= singleton
| concatenation
| intersection
| exclusion
| alternatation
| exclusive-or
concatenation   ::= rule-expr ' '+ rule-expr
intersection    ::= rule-expr ' '+ '&' ' '+ rule-expr
exclusion       ::= rule-expr ' '+ '-' ' '+ rule-expr
alternatation   ::= rule-expr ' '+ [|/] ' '+ rule-expr
exclusive-or    ::= rule-expr ' '+ '^' ' '+ rule-expr

quant-expr      ::= unit-quant
| exact-quant
| min-quant
| max-quant
| range-quant
exact-quant     ::= '{' count '}'
range-quant     ::= '{' min ',' max '}'
min-quant       ::= '{' min ',}'
max-quant       ::= '{,' max '}'
count           ::= [0-9]+
min             ::= [0-9]+
max             ::= [0-9]+

unit-quant-expr ::= rule-expr unit-quant modifier?
unit-quant      ::= zero-or-one | zero-plus | one-plus
modifier        ::= greedy | lazy | possessive
zero-or-one     ::= '?'
zero-plus       ::= '*'
one-plus        ::= '+'
greedy          ::= '*'
lazy            ::= '?'
possessive      ::= '+'

singleton       ::= rule-name | primitive | regex
primitive       ::= literal | unicode-escape | unicode-name

literal         ::= '"' [^"]+ '"' | "'" [^']+ "'"
unicode-escape  ::= ('#' HEX-UTF) | '%x' HEX HEX
unicode-name    ::= "#'" [A-Za-z0-9,/()-,]+ "'"
HEX-UTF         ::= HEX HEX? HEX? HEX? HEX? HEX? HEX? HEX?
HEX             ::= [0-9A-F]

regex           ::= bracket-regex | dot-regex | tick-regex
bracket-regex   ::= '[' [^ ]+ ']'
dot-regex       ::= '.'  quant-expr
tick-regex      ::= '`' [^`]+ '`'
/* Approximate! Cannot replicate this rule in W3C XML EBNF. */
/* Enclose in as many backticks as needed (ala Markdown). */
/* The pattern <pattern> MUST NOT start or end with a backtick. */
/* (Escape the backtick as \u0060 if needed.) */

rule-name       ::= CORE-RULE-NAME | LEXER-RULE-NAME | MAIN-RULE-NAME
CORE-RULE-NAME  ::= '@'? [A-Z0-9]+ ('-' [A-Z0-9]+)*
/* A @ prefix MAY be used to mark core rules references. */
LEXER-RULE-NAME ::= [A-Z0-9]+ ('-' [A-Z0-9]+)*
MAIN-RULE-NAME  ::= [a-z0-9]+ ('-' [a-z0-9]+)*

Core rules

Where available, these match ABNF’s core rules.

URI                        = <per RFC 3986>
UTF-GRAPHIC                = `\p{L}|\p{LC}|\p{M}|\p{N}|\p{S}|\p{Zs}`
UTF-FORMAT                 = `\p{Cf}`
UTF-SURROGATE              = `\p{Cs}`
UTF-CONTROL                = `\p{Cc}`
UTF-SPACE                  = `\p{Zs}`
BACKSLASH                  = '\\'
BOOLEAN                    = 'true' | 'false'
OCTDIG                     = `[0-8]`
OCTET                      = OCTDIG{2}
DIGIT                      = [0-9]
BASE64                     = [A-Za-z0-9+/]
BASE64URL                  = [A-Za-z0-9-_]
ALPHA                      = [A-Za-z]
UPPERCASE                  = [A-Z]
LOWERCASE                  = [a-z]
HEXDIG                     = [0-9A-F]
LOWER-HEXDIG               = [0-9a-f]
ALPHANUM                   = ALPHA | DIGIT
SQUOTE                     = "'"
DQUOTE                     = '"'
RFC-3339-DATETIME          = `20\d\d-(12|11|[1-9])-(\d|[12]\d|3[01])\
                              T([01]\d|2[0-3])(:([0-5]\d|60)){2}\
                              (\.{3}|\.{6})?Z`
E-NOTATION-FLOAT           = LITERAL-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-NONNEG-FLOAT    = LITERAL-NONNEG-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-POSITIVE-FLOAT  = LITERAL-POSITIVE-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-NONZERO-FLOAT   = LITERAL-NONZERO-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-INT             = LITERAL-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-NONNEG-INT      = LITERAL-NONNEG-INT ('E' LITERAL-NONNEG-INT)?
E-NOTATION-POSITIVE-INT    = LITERAL-POSITIVE-INT ('E' LITERAL-NONNEG-INT)?
E-NOTATION-NONZERO-INT     = LITERAL-NONZERO-INT ('E' LITERAL-NONNEG-INT)?
LITERAL-FLOAT              = '-'? DIGIT-STR ('.' DIGIT-STR)?
LITERAL-NONNEG-FLOAT       = DIGIT-STR ('.' DIGIT-STR)?
LITERAL-POSITIVE-FLOAT     = [1-9] DIGIT-STR ('.' DIGIT-STR)?
LITERAL-NONZERO-FLOAT      = '-'? [1-9] DIGIT-STR ('.' DIGIT-STR)?
LITERAL-INT                = '-'? DIGIT-STR
LITERAL-NONNEG-INT         = DIGIT-STR
LITERAL-POSITIVE-INT       = [1-9] DIGIT-STR
LITERAL-NONZERO-INT        = '-'? [1-9] DIGIT-STR
BIN-STR                    = BIN+
OCT-STR                    = OCTDIG+
HEX-STR                    = HEXDIG+
ALPHA-STR                  = ALPHA+
DIGIT-STR                  = DIGIT+
ALPHANUM-STR               = (ALPHA | DIGIT)+
BASE64-STR                 = BASE64+ '='{0,8}
BASE64URL-STR              = BASE64URL+ '='{0,8}
TICK                       = '`'
BIT                        = [01]
CRLF                       = CR LF
CR                         = '\r'
LF                         = '\n'
HTAB                       = '\t'
SP                         = ' '

Example

literal-1     = ' "ab" '
                ; can also use ::= as in XML-MG
literal-2     = " 'ab' "
concatenation = literal-1 'defg'
alternation   = literal-1 | 'xy'
                ; slash is an alterative to |
intersection  = alternation & .{10}
                ; intersection of multiple rules!!
                ; 'intersection' must be exactly ' "ab" xyxyxy'
dot-regex     = .+
                ; regex starting with '.' need not be enclosed in ``
simple-regex  = [^A-Z]{2,4}
                ; regex starting with '[' need not be enclosed in ``
complex-regex = `.+? *\d`
                ; any regex can be enclosed in ``
grouping      = ('ab' | 'cd') 'xy'
complement    = (! 'abc')
                ; complement!!
                ; 'complement' is any text (0+ chars) except 'abc'
set-minus     = .+ - 'abc'
                ; exclusion!
                ; this is identical to 'complement' (above)
unicode-1     = #5F028322
unicode-2     = #'Plus-Minus Sign'
inline-label  = label-1 ([^ ]+)=my-label
                ; declare an inline rule, which can be used anywhere

Style guide

  • Use = instead of ::=.
  • Align the = at a generous column, with plenty of space to rename rules for clarity (or to add new rules, if the grammar is still being designed).
  • Limit lines to 120 characters, breaking before | (preferably) or another operator as needed. On the continued line, put the operator on the same column as the =. The goal here is to limit the number of lines unnecessarily included in a diff.
  • Align ; comments to 2 characters after the =.
  • Always use - to separate words in rule names.