Advanced BNF with regex¶

Status: not ready to use

This specification is theoretically useful, but it may lead to confusion when used.

Standards for specifying grammars are a mess, as Wirth et al. describe. David A. Wheeler also wrote a complaint about current grammar standards, particularly objecting to ISO’s EBNF (ISO/IEC 14977:1996). ABNF (RFC5234) is an improvement, but it doesn’t use regex and has a non-obvious syntax for repetitions. XML’s custom meta-grammar is even better, but it still lacks expressiveness that regex can provide.

In the spirit of XKCD #927, here is a new proposal. It’s a hybrid between ABNF, XML’s meta-grammar (“XML-MG”), and syntax from parser generators, including ANTLR and parboiled2. Use it to describe PEGs and CFGs.

Compatible with either ABNF or XML-MG, supports full POSIX Extended Regular Expressions , and has additional, extremely powerful features such as intersections.

Example¶

literal-1     = ' "ab" '
                ; can also use ::= as in XML-MG
literal-2     = " 'ab' "
concatenation = literal-1 'defg'
alternation   = literal-1 | 'xy'
                ; slash is an alterative to |
intersection  = alternation & .{10}
                ; intersection of multiple rules!!
                ; 'intersection' must be exactly ' "ab" xyxyxy'
dot-regex     = .+
                ; regex starting with '.' need not be enclosed in ``
simple-regex  = [^A-Z]{2,4}
                ; regex starting with '[' need not be enclosed in ``
complex-regex = `.+? *\d`
                ; any regex can be enclosed in ``
grouping      = ('ab' | 'cd') 'xy'
complement    = (! 'abc')
                ; complement!!
                ; 'complement' is any text (0+ chars) except 'abc'
set-minus     = .+ - 'abc'
                ; exclusion!
                ; this is identical to 'complement' (above)
unicode-1     = #5F028322
unicode-2     = #'Plus-Minus Sign'
inline-label  = label-1 ([^ ]+)=my-label
                ; declare an inline rule, which can be used anywhere

Grammar¶

The specification is presented in both itself and in W3C XML EBNF.

regex-bnfW3C XML EBNF

grammar         = statement+
statement       = (START | LF+) (SP* comment? | rule-defn) (SP | LF)*
comment         = ';' (! LF)*=comment-text

rule-defn       = rule-name def-symbol rule-rhs
func-defn       = rule-name arg-spec def-symbol rule-rhs
def-symbol      = SP+ ('=' | '::=') SP+
rule-rhs        = (SP* NL)+ SP+ rule-rhs | rule-expr
rule-expr       = (group-expr | term) inline-label?
arg-spec        = '(' rule-name (',' rule-name)* ')'

inline-label    = '=' rule-name

group-expr      = group quant-expr?
group           = parenthesized | complement
parenthesized   = '(' SP* rule-expr SP* ')'
complement      = '(!' SP+ primitive ')'

term            = singleton
                  | concatenation
                  | intersection
                  | exclusion
                  | alternatation
                  | exclusive-or
concatenation   = rule-expr SP+ rule-expr
intersection    = rule-expr SP+ '&' SP+ rule-expr
exclusion       = rule-expr SP+ '-' SP+ rule-expr
alternatation   = rule-expr SP+ [|/] SP+ rule-expr
exclusive-or    = rule-expr SP+ '^' SP+ rule-expr

quant-expr      = unit-quant
                  | exact-quant
                  | min-quant
                  | max-quant
                  | range-quant
exact-quant     = '{' (DIGIT+)=count '}'
range-quant     = '{' (DIGIT+)=min ',' (DIGIT+)=max '}'
min-quant       = '{' (DIGIT+)=min ',}'
max-quant       = '{,' (DIGIT*)=max '}'

unit-quant-expr = rule-expr unit-quant modifier?
unit-quant      = '?'=zero-or-one | '*'=zero-plus | '+'=one-plus
modifier        = '?'=lazy | '+'=possessive

singleton       = rule-name | primitive
primitive       = single-char | dot-range | regex

dot-range       = single-char '...' single-char
regex           = bracket-regex | dot-regex | tick-regex
bracket-regex   = '[' [^ ]]+? ']'
dot-regex       = '.' quant-expr
                  ; note that a single . must be enclosed in ``
                  ; this avoids ambiguity with the ABNF's concatenation operator
tick-regex      = ``([`]+).*?\1``
                  ; enclose in as many ` as needed

single-char     = literal | unicode-escape | unicode-name
literal         = DQUOTE [^"]++ DQUOTE | SQUOTE [^']++ SQUOTE
unicode-escape  = ('#' HEXDIG{1,8}) | '%x' HEX{2}
unicode-name    = "#'" [A-Za-z0-9,/()-,]+ "'"
                  ; ex: #'Micro Sign'

rule-name       = core-rule-name | lexer-rule-name | main-rule-name
core-rule-name  = '@'? lexer-rule-name
                  ; a @ prefix is allowed to distinguish core rules
lexer-rule-name = [A-Z0-9]+(-[A-Z0-9]+)*
main-rule-name  = [a-z0-9]+(-[a-z0-9]+)*

grammar         ::= statement+
statement       ::= (START | [\n]) (' '* comment? | rule-defn | func-defn) [ \n]*
comment         ::= ';' comment-text
comment-text    ::= [^\n]*

rule-defn       ::= rule-name def-symbol rule-rhs
func-defn       ::= rule-name arg-spec def-symbol rule-rhs
def-symbol      ::= ' '+ ('::=' | '=') ' '+
rule-rhs        ::= (' '* [\n])+ ' '+ rule-rhs | rule-expr
rule-expr       ::= (group-expr | term) inline-label?
inline-label    ::= '=' rule-name
arg-spec        ::= '(' rule-name (',' rule-name)* ')'

group-expr      ::= group quant-expr?
group           ::= parenthesized | complement
parenthesized   ::= '(' ' '* rule-expr ' '+ ')'
complement      ::= '(!' ' '+ primitive ')'

term            ::= singleton
                  | concatenation
                  | intersection
                  | exclusion
                  | alternatation
                  | exclusive-or
concatenation   ::= rule-expr ' '+ rule-expr
intersection    ::= rule-expr ' '+ '&' ' '+ rule-expr
exclusion       ::= rule-expr ' '+ '-' ' '+ rule-expr
alternatation   ::= rule-expr ' '+ [|/] ' '+ rule-expr
exclusive-or    ::= rule-expr ' '+ '^' ' '+ rule-expr

quant-expr      ::= unit-quant
                  | exact-quant
                  | min-quant
                  | max-quant
                  | range-quant
exact-quant     ::= '{' count '}'
range-quant     ::= '{' min ',' max '}'
min-quant       ::= '{' min ',}'
max-quant       ::= '{,' max '}'
count           ::= [0-9]+
min             ::= [0-9]+
max             ::= [0-9]+

unit-quant-expr ::= rule-expr unit-quant modifier?
unit-quant      ::= zero-or-one | zero-plus | one-plus
modifier        ::= greedy | lazy | possessive
zero-or-one     ::= '?'
zero-plus       ::= '*'
one-plus        ::= '+'
greedy          ::= '*'
lazy            ::= '?'
possessive      ::= '+'

singleton       ::= rule-name | primitive
primitive       ::= single-char | dot-range | regex
dot-range       ::= single-char '...' single-char
single-char     ::= literal | unicode-escape | unicode-name

literal         ::= '"' [^"]+ '"' | "'" [^']+ "'"
unicode-escape  ::= ('#' HEX-UTF) | '%x' HEX HEX
unicode-name    ::= '#:' [A-Za-z0-9,/()-,]+ ':'
HEX-UTF         ::= HEX HEX? HEX? HEX? HEX? HEX? HEX? HEX?
HEX             ::= [0-9A-F]

regex           ::= bracket-regex | dot-regex | tick-regex
bracket-regex   ::= '[' [^ ]]+? ']'
dot-regex       ::= '.'  quant-expr
tick-regex      ::= '`' [^`]+ '`'
                  /* Approximate! Cannot replicate. */

rule-name       ::= core-rule-name | lexer-rule-name | main-rule-name
core-rule-name  ::= '@'? lexer-rule-name
                  /* a @ prefix is allowed to distinguish these
lexer-rule-name ::= [A-Z0-9]+(-[A-Z0-9]+)*
main-rule-name  ::= [a-z0-9]+(-[a-z0-9]+)*

Core rules¶

Where available, these match ABNF’s core rules.

URI                        = <per RFC 3986>
UTF-GRAPHIC                = `\p{L}|\p{LC}|\p{M}|\p{N}|\p{S}|\p{Zs}`
UTF-FORMAT                 = `\p{Cf}`
UTF-SURROGATE              = `\p{Cs}`
UTF-CONTROL                = `\p{Cc}`
UTF-SPACE                  = `\p{Zs}`
BACKSLASH                  = '\\'
BOOLEAN                    = 'true' | 'false'
OCTDIG                     = `[0-8]`
OCTET                      = OCTDIG{2}
DIGIT                      = [0-9]
BASE64                     = [A-Za-z0-9+/]
BASE64URL                  = [A-Za-z0-9-_]
ALPHA                      = [A-Za-z]
UPPERCASE                  = [A-Z]
LOWERCASE                  = [a-z]
HEXDIG                     = [0-9A-F]
LOWER-HEXDIG               = [0-9a-f]
ALPHANUM                   = ALPHA | DIGIT
SQUOTE                     = "'"
DQUOTE                     = '"'
RFC-3339-DATETIME          = `20\d\d-(12|11|[1-9])-(\d|[12]\d|3[01])T([01]\d|2[0-3])(:([0-5]\d|60)){2}(\.{3}|\.{6})?Z`
E-NOTATION-FLOAT           = LITERAL-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-NONNEG-FLOAT    = LITERAL-NONNEG-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-POSITIVE-FLOAT  = LITERAL-POSITIVE-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-NONZERO-FLOAT   = LITERAL-NONZERO-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-INT             = LITERAL-FLOAT ('E' LITERAL-FLOAT)?
E-NOTATION-NONNEG-INT      = LITERAL-NONNEG-INT ('E' LITERAL-NONNEG-INT)?
E-NOTATION-POSITIVE-INT    = LITERAL-POSITIVE-INT ('E' LITERAL-NONNEG-INT)?
E-NOTATION-NONZERO-INT     = LITERAL-NONZERO-INT ('E' LITERAL-NONNEG-INT)?
LITERAL-FLOAT              = '-'? DIGIT-STR ('.' DIGIT-STR)?
LITERAL-NONNEG-FLOAT       = DIGIT-STR ('.' DIGIT-STR)?
LITERAL-POSITIVE-FLOAT     = [1-9] DIGIT-STR ('.' DIGIT-STR)?
LITERAL-NONZERO-FLOAT      = '-'? [1-9] DIGIT-STR ('.' DIGIT-STR)?
LITERAL-INT                = '-'? DIGIT-STR
LITERAL-NONNEG-INT         = DIGIT-STR
LITERAL-POSITIVE-INT       = [1-9] DIGIT-STR
LITERAL-NONZERO-INT        = '-'? [1-9] DIGIT-STR
BIN-STR                    = BIN+
OCT-STR                    = OCTDIG+
HEX-STR                    = HEXDIG+
ALPHA-STR                  = ALPHA+
DIGIT-STR                  = DIGIT+
ALPHANUM-STR               = (ALPHA | DIGIT)+
BASE64-STR                 = BASE64+ '='{0,8}
BASE64URL-STR              = BASE64URL+ '='{0,8}
TICK                       = '`'
BIT                        = [01]
CRLF                       = CR LF
CR                         = '\r'
LF                         = '\n'
HTAB                       = '\t'
SP                         = ' '

Style guide¶

Use = instead of ::=.
Align the = at a generous column, with plenty of space to rename rules for clarity (or to add new rules, if the grammar is still being designed).
Limit lines to 120 characters, breaking before | (preferably) or another operator as needed. On the continued line, put the operator on the same column as the =. The goal here is to limit the number of lines unnecessarily included in a diff.
Align ; comments to 2 characters after the =.
Always use - to separate words in rule names.