Skip to content

URI-safe characters

This details exactly what characters must be percent-encoded in URIs.

Highlights

Don’t percent-encode these characters in query parameter values: -, _, ., ~, !, $, ', (, ), :, @, /, ,, ;, ?, and =.

To parse query parameters, follow this table:

to match use regex
full query URI component ?[-\w.~!$'()*+:@/?;,&=]*+
form-urlencoded key [-\w.~!$'()*+:@/,;?]++
form-urlencoded value [-\w.~!$'()*+:@/;,?=]*+
OpenAPI simple-encoded value [-\w.~!$'()*+:@/;?=]*+
OpenAPI form-encoded value same as simple-encoded
OpenAPI matrix-encoded value [-\w.~!$'()*+:@/?=]*+

The basics

Relevant specifications

  • RFC 1738, “Uniform Resource Locators (URL)” (Proposed Standard, obsolete)
  • RFC 3986, “Uniform Resource Identifier (URI): Generic Syntax” (Internet Standard)
  • RFC 6920, “Naming Things with Hashes” (Proposed Standard)
  • RFC 6570, “URI Template” (Proposed Standard)
  • IANA-recognized URI schemes

RFC 3986 is the main specification. There is a full grammar in RFC 3986 Appendix A, “Collected ABNF for URI”.

URI Components

RFC 3986 §3 defines 5 main URI components.

        [                ]              [             ][          ] # (1)!
 https : // posts.tld:443   /info/users  ? name=carlie  # bulletin
╰─────╯    ╰─────────────╯ ╰───────────╯  ╰───────────╯  ╰────────╯
scheme        authority       path            query       fragment
  1. Grouping of optional components.
Rules for path

The rules for path are a bit complex. Although it MUST be present, it MAY be empty. (This is a very subtle distinction, but it contrasts with authority, query, and fragment, which can be omitted.) If authority is present, path MUST either be empty OR start with / (separating it from the authority). If authority is omitted, path MAY start with / but MUST NOT start with //. So, https:/info/users, https:info/users, https:?name=charlie are valid, but https://info/users is not.

Example: the file scheme

The file scheme, defined in RFC 8089, only uses scheme, path, and the host part of authority. host can be empty, which is treated as localhost.

       [                ]
 file : // 192.168.1.101   /usr/share/lib
╰────╯    ╰─────────────╯ ╰──────────────╯
scheme       authority         path

General delimiters and sub-delimiters

RFC 3986 §2.2 splits reserved characters into two sets, gen-delims and sub-delims:

gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims  = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
  • General delimiters: Separate components or subcomponents (userinfo, host, port). They must be encoded in components where their reserved meanings would apply. For example, : must be encoded in authority but not query; e.g. https://api.tld/a:b:c is valid.
  • Sub-delimiters: Are used reserved within one or more components but are not general delimiters. Whether they must be encoded depends on the component and the scheme.

Caution: scheme-specific restrictions

A scheme can restrict URIs in many ways, including whether a sub-delimiter must be encoded. (In fact, §2.2 states that you should encode a reserved character unless the scheme specifically allows it.) The following section was written for HTTPS. It’s valid for some other schemes, but your mileage may vary.

Where reserved delimiters are allowed

The following tables show the components in which a reserved delimiter can be used for its literal meaning without percent-encoding. Note that % must also be encoded.

component : / ? # [ ] @
scheme
authority ¹
path y
query y y y y
fragment y y y ³ y

Table 1. General delimiters (y where valid)

Footnotes
  • ¹ Technically, : is allowed in userinfo and carries no reserved meaning. However, this is only for compatibility with a username:password syntax, which is deprecated.

  • ² Literal : is valid in path, except as the first character in a URI Reference without scheme (e.g. https://google.com/: but google.com/: is not).

  • ³ Perhaps surprisingly, fragments cannot contain #. This is in contrast to ?, which can occur in query components.

component ! $ ' ( ) * & = , ; +
authority y y y y y y y y y y y
path y y y y y y y y y y y
query y y y y y y
fragment y y y y y y y y y y y

Table 2. Sub-delimiters (y where valid)

Footnotes
  • ¹ & and = are typically used for key–value parameters.

  • ² , and ; are delimiters for some OpenAPI query parameter styles:

  • _simple and form use ,

  • matrix and cookie use , and ;

(Note that cookie actually uses ;. label uses ., which is unreserved; and spaceDelimited, and pipeDelimited use and |, which must be percent-encoded.)

The query component

RFC 3986 buries some important details, including aspects that many implementations handle incorrectly.

From the ABNF

Tracing through the ABNF definitions for query yields this:

query           = *( pchar / "/" / "?" )
pchar           = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved      = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded     = "%" HEXDIG HEXDIG
sub-delims      = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

Which is equivalent to:

query  = ALPHA / DIGIT                                                    ; from `unreserved`
       / "-" / "." / "_" / "~"                                            ; also from `unreserved`
       / "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="  ; from `sub-delims`
       / ":" / "@"                                                        ; from `pchar`
       / "/" / "?"                                                        ; from `query`
       / "%" HEXDIG HEXDIG                                                ; from `pct-encoded`

And to this regex (using the inline verbose flag, (?x)):

(?x)
[A-Za-z0-9] | [-._~] | [!$&'()*+,;=] | [:@] | [/?]
|
%[A-Fa-f0-9]{2}

And, when no percent-encoding is used, to [-\w.~!$&'()*+,;=:@/?].

Yes, / and ? are allowed

RFC 3986 says this about query strings:

The characters slash (“/”) and question mark (“?”) may represent data within the query component. Beware that some older, erroneous implementations may not handle such data correctly […]

RFC 6920 §3, “Naming Things with Hashes” affirms these conclusions:

[…] percent-encoding is used to distinguish between reserved and unreserved functions of the same character in the same URI component. As an example, an ampersand (‘&’) is used in the query part to separate attribute-value pairs; therefore, an ampersand in a value has to be escaped as ‘%26’. Note that the set of reserved characters differs for each component. As an example, a slash (‘/’) does not have any reserved function in a query part and therefore does not have to be escaped.

There are no HTTP-specific restrictions

Ok, fine. Technically, RFC 3986 also says:

RFC 3986 excludes portions of RFC 1738 that defined the specific syntax of individual URI schemes; those portions will be updated as separate documents.

Fortunately, the HTTP-specific part of RFC 1738 states (where (<searchpart> means query):

Within the and components, “/”, “;”, “?” are reserved. The “/” character may be used within HTTP to designate a hierarchical structure.

That means there is no HTTP-specific ban on our sub-delims, either.

query component parameters

Queries often follow more structure by convention and some standards. Passing key–value pairs in URIs is elegant, but the standards are a mess. In their document for x-www-form-urlencoded, WHATWG writes:

The application/x-www-form-urlencoded format is in many ways an aberrant monstrosity, the result of many years of implementation accidents and compromises leading to a set of requirements necessary for interoperability, but in no way representing good design practices.

Note: WHATWG

WHATWG is not in the habit of writing short, precise docs for their specs. To describe a “URL string”, at no point are we shown a formal grammar. Instead, we get 20 pages of “basic URL parser” and “URL serializing” algorithms.

Note that OpenAPI’s query parameter styles reference RFC 6570, not the x-www-form-urlencoded used in HTML5.

Key–value pairs

This ABNF grammar recognizes a query component and captures its key–value parameters. I’ve chosen to disallow empty values.

query    = '?' param *('&' param)
param    = key '=' value
key      = 1*(LITERAL / ESCAPE)
value    = 1*(LITERAL / ESCAPE / '=')
LITERAL  = ALPHA / DIGIT / "-" / "." / "_" / "~"
         / "!" / "$" / "'" / "(" / ")" / "*" / "+" / "," / ";"  ; removed `&` and '='
         / ":" / "@" / "/" / "?"
ESCAPE   = '%' 2HEXDIG

Implementation considerations

Some implementations encode more than needed

Many urlencode implementations will encode characters that don’t need to be encoded.

This is because the 1994 RFC 1738 for URLs, which RFC 3986 obsoletes, had this language

Thus, only alphanumerics, the special characters “$-_.+!*‘(),”, , and reserved characters used for their reserved purposes may be used unencoded within a URL.

RFC 3986 instead says

If data for a URI component would conflict with a reserved character’s purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.

Smart URI urlencode implementations could be introduced under new names such as uriencode without breaking backwards compatibility. Maintainers, get on this.

Normalization of full URIs

As @mgiuca points out, full URIs cannot be normalized (ala urlencode) reliably because they cannot be partitioned into their components unambiguously.

Examples:

  • https://api.tld/redirect?uri=https://boeing.fly/news?page2&nav=yes Is the redirect to https://boeing.fly/news?page2&nav=yes or to https://boeing.fly/news?page2?page2?
  • https://api.tld/redirect?uri=https://boeing.fly/news#ex Is the redirect to https://boeing.fly/news#ex or to https://boeing.fly/news?

Instead, normalize each URI component separately.

Functions to avoid

Avoid these library functions, which return incorrect results for some URIs:

OpenAPI

You should set allowReserved: true for OpenAPI parameter objects. There is no reason not to. As described earlier, also be aware that style controls whether additional characters must be encoded.

Full ABNF from RFC 3986

The following are the ABNF lines, copied in order, from RFC 3986. Only minor formatting changes were made (specifically line breaks, indentation, and comments).

; ---------------------------  characters  ---------------------------------------------------------

pct-encoded     = "%" HEXDIG HEXDIG
reserved        = gen-delims / sub-delims
gen-delims      = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims      = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
unreserved      = ALPHA / DIGIT / "-" / "." / "_" / "~"

; ---------------------------  URI structure  ------------------------------------------------------

URI             = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

hier-part       = "//" authority path-abempty
                / path-absolute
                / path-rootless
                / path-empty

scheme          = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

authority       = [ userinfo "@" ] host [ ":" port ]

userinfo        = *( unreserved / pct-encoded / sub-delims / ":" )

host            = IP-literal / IPv4address / reg-name

IP-literal      = "[" ( IPv6address / IPvFuture  ) "]"
IPvFuture       = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
; IPv6address and IPv4address are defined in the code block below.
reg-name        = *( unreserved / pct-encoded / sub-delims )

port            = *DIGIT

; ---------------------------  path  ---------------------------------------------------------------

path            = path-abempty    ; begins with "/" or is empty
                / path-absolute   ; begins with "/" but not "//"
                / path-noscheme   ; begins with a non-colon segment
                / path-rootless   ; begins with a segment
                / path-empty      ; zero characters

path-abempty    = *( "/" segment )
path-absolute   = "/" [ segment-nz *( "/" segment ) ]
path-noscheme   = segment-nz-nc *( "/" segment )
path-rootless   = segment-nz *( "/" segment )
path-empty      = 0<pchar>

segment         = *pchar
segment-nz      = 1*pchar
segment-nz-nc   = 1*( unreserved / pct-encoded / sub-delims / "@" )
                ; non-zero-length segment without any colon ":"

pchar           = unreserved / pct-encoded / sub-delims / ":" / "@"

; ---------------------------  query and fragment  -------------------------------------------------

query           = *( pchar / "/" / "?" )

fragment        = *( pchar / "/" / "?" )

; ---------------------------  reference and absolute URI  ------------------------------------------

URI-reference   = URI / relative-ref
relative-ref    = relative-part [ "?" query ] [ "#" fragment ]
relative-part   = "//" authority path-abempty
                / path-absolute
                / path-noscheme
                / path-empty

absolute-URI    = scheme ":" hier-part [ "?" query ]

IP address grammars:

; ---------------------------  IPV6 address  ------------------------------------------------------

IPv6address     =                            6( h16 ":" ) ls32
                /                       "::" 5( h16 ":" ) ls32
                / [               h16 ] "::" 4( h16 ":" ) ls32
                / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                / [ *4( h16 ":" ) h16 ] "::"              ls32
                / [ *5( h16 ":" ) h16 ] "::"              h16
                / [ *6( h16 ":" ) h16 ] "::"

ls32            = ( h16 ":" h16 ) / IPv4address
                ; least-significant 32 bits of address
h16             = 1*4HEXDIG
                ; 16 bits of address represented in hexadecimal

; ---------------------------  IPv4 address  -------------------------------------------------------

IPv4address     = dec-octet "." dec-octet "." dec-octet "." dec-octet
dec-octet       = DIGIT                 ; 0-9
                / %x31-39 DIGIT         ; 10-99
                / "1" 2DIGIT            ; 100-199
                / "2" %x30-34 DIGIT     ; 200-249
                / "25" %x30-35          ; 250-255