URI-safe characters¶
This details exactly what characters must be percent-encoded in URIs.
Highlights
Don’t percent-encode these characters in query parameter values: -, _, ., ~, !, $, ', (, ), :, @, /, ,, ;, ?, and =.
To parse query parameters, follow this table:
| to match | use regex |
|---|---|
| full query URI component | ?[-\w.~!$'()*+:@/?;,&=]*+ |
| form-urlencoded key | [-\w.~!$'()*+:@/,;?]++ |
| form-urlencoded value | [-\w.~!$'()*+:@/;,?=]*+ |
| OpenAPI simple-encoded value | [-\w.~!$'()*+:@/;?=]*+ |
| OpenAPI form-encoded value | same as simple-encoded |
| OpenAPI matrix-encoded value | [-\w.~!$'()*+:@/?=]*+ |
The basics¶
Relevant specifications¶
- RFC 1738, “Uniform Resource Locators (URL)” (Proposed Standard, obsolete)
- RFC 3986, “Uniform Resource Identifier (URI): Generic Syntax” (Internet Standard)
- RFC 6920, “Naming Things with Hashes” (Proposed Standard)
- RFC 6570, “URI Template” (Proposed Standard)
- IANA-recognized URI schemes
RFC 3986 is the main specification. There is a full grammar in RFC 3986 Appendix A, “Collected ABNF for URI”.
URI Components¶
RFC 3986 §3 defines 5 main URI components.
[ ] [ ][ ] # (1)!
https : // posts.tld:443 /info/users ? name=carlie # bulletin
╰─────╯ ╰─────────────╯ ╰───────────╯ ╰───────────╯ ╰────────╯
scheme authority path query fragment
- Grouping of optional components.
Rules for path
The rules for path are a bit complex. Although it MUST be present, it MAY be empty. (This is a very subtle distinction, but it contrasts with authority, query, and fragment, which can be omitted.) If authority is present, path MUST either be empty OR start with / (separating it from the authority). If authority is omitted, path MAY start with / but MUST NOT start with //. So, https:/info/users, https:info/users, https:?name=charlie are valid, but https://info/users is not.
Example: the file scheme
The file scheme, defined in RFC 8089, only uses scheme, path, and the host part of authority. host can be empty, which is treated as localhost.
General delimiters and sub-delimiters¶
RFC 3986 §2.2 splits reserved characters into two sets, gen-delims and sub-delims:
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
General delimiters: Separate components or subcomponents (userinfo, host, port). They must be encoded in components where their reserved meanings would apply. For example,:must be encoded in authority but not query; e.g.https://api.tld/a:b:cis valid.Sub-delimiters: Are used reserved within one or more components but are not general delimiters. Whether they must be encoded depends on the component and the scheme.
Caution: scheme-specific restrictions
A scheme can restrict URIs in many ways, including whether a sub-delimiter must be encoded. (In fact, §2.2 states that you should encode a reserved character unless the scheme specifically allows it.) The following section was written for HTTPS. It’s valid for some other schemes, but your mileage may vary.
Where reserved delimiters are allowed¶
The following tables show the components in which a reserved delimiter can be used for its literal meaning without percent-encoding. Note that % must also be encoded.
| component | : | / | ? | # | [ | ] | @ |
|---|---|---|---|---|---|---|---|
| scheme | |||||||
| authority | ¹ | ||||||
| path | y² | y | |||||
| query | y | y | y | y | |||
| fragment | y | y | y | ³ | y |
Table 1. General delimiters (y where valid)
Footnotes
-
¹ Technically,
:is allowed in userinfo and carries no reserved meaning. However, this is only for compatibility with ausername:passwordsyntax, which is deprecated. -
² Literal
:is valid in path, except as the first character in a URI Reference without scheme (e.g.https://google.com/:butgoogle.com/:is not). -
³ Perhaps surprisingly, fragments cannot contain
#. This is in contrast to?, which can occur in query components.
| component | ! | $ | ' | ( | ) | * | & | = | , | ; | + |
|---|---|---|---|---|---|---|---|---|---|---|---|
| authority | y | y | y | y | y | y | y | y | y | y | y |
| path | y | y | y | y | y | y | y | y | y | y | y |
| query | y | y | y | y | y | y | y¹ | y¹ | y³ | y³ | y² |
| fragment | y | y | y | y | y | y | y | y | y | y | y |
Table 2. Sub-delimiters (y where valid)
Footnotes
-
¹
&and=are typically used for key–value parameters. -
²
,and;are delimiters for some OpenAPI query parameter styles: -
_
simpleandformuse, matrixandcookieuse,and;
(Note that cookie actually uses ;. label uses ., which is unreserved; and spaceDelimited, and pipeDelimited use and |, which must be percent-encoded.)
- ³
+is used in place of a space inx-www-form-urlencoded, as used in HTML forms.
The query component¶
RFC 3986 buries some important details, including aspects that many implementations handle incorrectly.
From the ABNF¶
Tracing through the ABNF definitions for query yields this:
query = *( pchar / "/" / "?" )
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
Which is equivalent to:
query = ALPHA / DIGIT ; from `unreserved`
/ "-" / "." / "_" / "~" ; also from `unreserved`
/ "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" ; from `sub-delims`
/ ":" / "@" ; from `pchar`
/ "/" / "?" ; from `query`
/ "%" HEXDIG HEXDIG ; from `pct-encoded`
And to this regex (using the inline verbose flag, (?x)):
And, when no percent-encoding is used, to [-\w.~!$&'()*+,;=:@/?].
Yes, / and ? are allowed¶
RFC 3986 says this about query strings:
The characters slash (“/”) and question mark (“?”) may represent data within the query component. Beware that some older, erroneous implementations may not handle such data correctly […]
RFC 6920 §3, “Naming Things with Hashes” affirms these conclusions:
[…] percent-encoding is used to distinguish between reserved and unreserved functions of the same character in the same URI component. As an example, an ampersand (‘&’) is used in the query part to separate attribute-value pairs; therefore, an ampersand in a value has to be escaped as ‘%26’. Note that the set of reserved characters differs for each component. As an example, a slash (‘/’) does not have any reserved function in a query part and therefore does not have to be escaped.
There are no HTTP-specific restrictions¶
Ok, fine. Technically, RFC 3986 also says:
RFC 3986 excludes portions of RFC 1738 that defined the specific syntax of individual URI schemes; those portions will be updated as separate documents.
Fortunately, the HTTP-specific part of RFC 1738 states (where (<searchpart> means query):
Within the
and components, “/”, “;”, “?” are reserved. The “/” character may be used within HTTP to designate a hierarchical structure.
That means there is no HTTP-specific ban on our sub-delims, either.
query component parameters¶
Queries often follow more structure by convention and some standards. Passing key–value pairs in URIs is elegant, but the standards are a mess. In their document for x-www-form-urlencoded, WHATWG writes:
The application/x-www-form-urlencoded format is in many ways an aberrant monstrosity, the result of many years of implementation accidents and compromises leading to a set of requirements necessary for interoperability, but in no way representing good design practices.
Note: WHATWG
WHATWG is not in the habit of writing short, precise docs for their specs. To describe a “URL string”, at no point are we shown a formal grammar. Instead, we get 20 pages of “basic URL parser” and “URL serializing” algorithms.
Note that OpenAPI’s query parameter styles reference RFC 6570, not the x-www-form-urlencoded used in HTML5.
Key–value pairs¶
This ABNF grammar recognizes a query component and captures its key–value parameters. I’ve chosen to disallow empty values.
query = '?' param *('&' param)
param = key '=' value
key = 1*(LITERAL / ESCAPE)
value = 1*(LITERAL / ESCAPE / '=')
LITERAL = ALPHA / DIGIT / "-" / "." / "_" / "~"
/ "!" / "$" / "'" / "(" / ")" / "*" / "+" / "," / ";" ; removed `&` and '='
/ ":" / "@" / "/" / "?"
ESCAPE = '%' 2HEXDIG
Implementation considerations¶
Some implementations encode more than needed¶
Many urlencode implementations will encode characters that don’t need to be encoded.
This is because the 1994 RFC 1738 for URLs, which RFC 3986 obsoletes, had this language
Thus, only alphanumerics, the special characters “$-_.+!*‘(),”, , and reserved characters used for their reserved purposes may be used unencoded within a URL.
RFC 3986 instead says
If data for a URI component would conflict with a reserved character’s purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.
Smart URI urlencode implementations could be introduced under new names such as uriencode without breaking backwards compatibility. Maintainers, get on this.
Normalization of full URIs¶
As @mgiuca points out, full URIs cannot be normalized (ala urlencode) reliably because they cannot be partitioned into their components unambiguously.
Examples:
https://api.tld/redirect?uri=https://boeing.fly/news?page2&nav=yesIs the redirect tohttps://boeing.fly/news?page2&nav=yesor tohttps://boeing.fly/news?page2?page2?https://api.tld/redirect?uri=https://boeing.fly/news#exIs the redirect tohttps://boeing.fly/news#exor tohttps://boeing.fly/news?
Instead, normalize each URI component separately.
Functions to avoid
Avoid these library functions, which return incorrect results for some URIs:
OpenAPI¶
You should set allowReserved: true for OpenAPI parameter objects. There is no reason not to. As described earlier, also be aware that style controls whether additional characters must be encoded.
Full ABNF from RFC 3986¶
The following are the ABNF lines, copied in order, from RFC 3986. Only minor formatting changes were made (specifically line breaks, indentation, and comments).
; --------------------------- characters ---------------------------------------------------------
pct-encoded = "%" HEXDIG HEXDIG
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
; --------------------------- URI structure ------------------------------------------------------
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part = "//" authority path-abempty
/ path-absolute
/ path-rootless
/ path-empty
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
authority = [ userinfo "@" ] host [ ":" port ]
userinfo = *( unreserved / pct-encoded / sub-delims / ":" )
host = IP-literal / IPv4address / reg-name
IP-literal = "[" ( IPv6address / IPvFuture ) "]"
IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )
; IPv6address and IPv4address are defined in the code block below.
reg-name = *( unreserved / pct-encoded / sub-delims )
port = *DIGIT
; --------------------------- path ---------------------------------------------------------------
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
; non-zero-length segment without any colon ":"
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
; --------------------------- query and fragment -------------------------------------------------
query = *( pchar / "/" / "?" )
fragment = *( pchar / "/" / "?" )
; --------------------------- reference and absolute URI ------------------------------------------
URI-reference = URI / relative-ref
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
absolute-URI = scheme ":" hier-part [ "?" query ]
IP address grammars:
; --------------------------- IPV6 address ------------------------------------------------------
IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"
ls32 = ( h16 ":" h16 ) / IPv4address
; least-significant 32 bits of address
h16 = 1*4HEXDIG
; 16 bits of address represented in hexadecimal
; --------------------------- IPv4 address -------------------------------------------------------
IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet
dec-octet = DIGIT ; 0-9
/ %x31-39 DIGIT ; 10-99
/ "1" 2DIGIT ; 100-199
/ "2" %x30-34 DIGIT ; 200-249
/ "25" %x30-35 ; 250-255