URI-safe characters¶
This details exactly what characters must be percent-encoded in URIs.
The basics¶
Relevant specifications¶
- RFC 1738, “Uniform Resource Locators (URL)” (RFC Proposed Standard, obsolete)
- RFC 3986, “Uniform Resource Identifier (URI): Generic Syntax” (RFC Standard)
- RFC 6920, “Naming Things with Hashes” (RFC Proposed Standard)
- IANA-recognized URI schemes
RFC 3986 is the main specification. There is a full grammar in RFC 3986 Appendix A, “Collected ABNF for URI”.
URI Components¶
RFC 3986 defines 5 main URI components in §3.
[ ] [ ][ ] # (1)!
https : // posts.tld:443 /info/users ? name=carlie # bulletin
╰─────╯ ╰─────────────╯ ╰───────────╯ ╰───────────╯ ╰────────╯
scheme authority path query fragment
- Grouping of optional components.
Rules for path
The rules for path are a bit complex. Although it MUST be present, it MAY be empty. (This is a very subtle distinction, but it contrasts with authority, query, and fragment, which can be omitted.) If authority is present, path MUST either be empty OR start with /
(separating it from the authority). If authority is omitted, path MAY start with /
but MUST NOT start with //
. So, https:/info/users
, https:info/users
, https:?name=charlie
are valid, but https://info/users
is not.
Example – the file
scheme
The file
scheme, defined in RFC 8089, only uses scheme, path, and the host part of authority. host can be empty, which is treated as localhost
.
General delimiters and sub-delimiters¶
§2.2 of RFC 3986 splits reserved characters into two sets, gen-delims
and sub-delims
:
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
General delimiters
: Separate components or subcomponents (userinfo, host, port). They must be encoded in components where their reserved meanings would apply. For example,:
must be encoded in authority but not query; e.g.https://api.tld/a:b:c
is valid.Sub-delimiters
: Are used reserved within one or more components but are not general delimiters. Whether they must be encoded depends on the component and the scheme.
Important
A scheme can restrict URIs in many ways, including whether a sub-delimiter must be encoded. (In fact, §2.2 states that you should encode a reserved character unless the scheme specifically allows it.) The following section was written for HTTPS. It’s valid for some other schemes, but your mileage may vary.
Where reserved delimiters are allowed¶
The following tables show the components in which a reserved delimiter can be used for its literal meaning without percent-encoding. (Note that %
must also be encoded.)
Footnotes
-
¹ Technically,
:
is allowed in userinfo and carries no reserved meaning. However, this is only for compatibility with ausername:password
syntax, which is deprecated. -
² Literal
:
is valid in path, except as the first character in a URI Reference without scheme (e.g.https://google.com/:
butgoogle.com/:
is not). -
³ Perhaps surprisingly, fragments cannot contain
#
. -
⁴
&
and=
are typically used for key–value parameters. Sub-delimiters,
and;
(also.
and|
) are used as delimiters for some OpenAPI query parameter styles: -
simple
andform
use,
label
uses.
and,
matrix
uses,
and;
pipeDelimited
uses|
More on query strings¶
Despite having a formal grammar (ABNF), RFC 3986 buries some important details in text. That includes some notes on syntax that many implementations handle incorrectly. The following sections go through some of that for the query component.
/
and ?
are allowed¶
RFC 3986 says this about query strings:
The characters slash (“/”) and question mark (“?”) may represent data within the query component. Beware that some older, erroneous implementations may not handle such data correctly […]
So, the full set of allowed characters in URI queries is
RFC 6920 §3, “Naming Things with Hashes” affirms these conclusions:
[…] percent-encoding is used to distinguish between reserved and unreserved functions of the same character in the same URI component. As an example, an ampersand (‘&’) is used in the query part to separate attribute-value pairs; therefore, an ampersand in a value has to be escaped as ‘%26’. Note that the set of reserved characters differs for each component. As an example, a slash (‘/’) does not have any reserved function in a query part and therefore does not have to be escaped.
No scheme-specific restrictions¶
Ok, fine. Technically, RFC 3986 also says:
RFC 3986 excludes portions of RFC 1738 that defined the specific syntax of individual URI schemes; those portions will be updated as separate documents.
Fortunately, the HTTP-specific part of RFC 1738 states (where (<searchpart>
means query
):
Within the
and components, “/”, “;”, “?” are reserved. The “/” character may be used within HTTP to designate a hierarchical structure.
That means there is no HTTP-specific ban on our sub-delims
, either.
Reaffirming: characters allowed in query¶
The characters &
, -
, ~
, .
, _
, =
, ?
, /
, !
, $
, +
, '
, (
, )
, *
, +
, and ,
do not require percent encoding inside a URI query, according to its specification, RFC 3986.
A query can contain any literal character except :
, #
, [
, ]
, @
, and %
. (Note: ?
begins a query but can also be used elsewhere, unescaped.)
This regex matches exactly the set of valid query strings (spaces added for clarity):
Parameters and key–value pairs¶
Queries often follow more structure, by convention and some standards. The idea of passing key–value pairs in URIs is elegant, but the history and resulting standards are a mess. In particular, WHATWG defines application/x-www-form-urlencoded
, stating:
The application/x-www-form-urlencoded format is in many ways an aberrant monstrosity, the result of many years of implementation accidents and compromises leading to a set of requirements necessary for interoperability, but in no way representing good design practices.
The following grammars are simple and may be helpful.
Aside
WHATWG does not seem to be in the habit of writing short, precise docs for their specs. To describe a “URL string”, at no point are we shown a formal grammar. Instead, we get 20 pages of “basic URL parser” and “URL serializing” algorithms.
Parameter lists¶
This grammar recognizes a query component and captures its &
-delimited parameters. Note that empty parameters are not allowed.
Key–value pairs¶
We can further restrict to key=value
pairs. We’ll allow =
inside a value
.
query = param ('&' param)*
param = key '=' value
key = (LIT | ESC)+
value = (LIT | ESC | '=')+
LIT = [^=#[]@%]+
ESC = %[A-Za-z0-9]{2}
Implementation considerations¶
Some implementations encode more than needed¶
Many urlencode
implementations will encode characters that don’t need to be encoded.
This is because the 1994 RFC 1738 for URLs, which RFC 3986 obsoletes, had this language
Thus, only alphanumerics, the special characters “$-_.+!*’(),”, and reserved characters used for their reserved purposes may be used unencoded within a URL.
RFC 3986 instead says
If data for a URI component would conflict with a reserved character’s purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed.
Smart URI urlencode
implementations could be introduced under new names such as uriencode
without breaking backwards compatibility. Maintainers, get on this.
Normalization of full URIs¶
As @mgiuca points out, full URIs cannot be normalized (ala urlencode
) reliably because they cannot be partitioned into their components unambiguously.
Examples: - https://api.tld/redirect?uri=https://boeing.fly/news?page2&nav=yes
Is the redirect to https://boeing.fly/news?page2&nav=yes
or to https://boeing.fly/news?page2?page2
? - https://api.tld/redirect?uri=https://boeing.fly/news#ex
Is the redirect to https://boeing.fly/news#ex
or to https://boeing.fly/news
?
Instead, normalize each URI component separately.
Tip
Avoid these library functions, which return incorrect results for some URIs:
OpenAPI¶
You should set allowReserved: true
for OpenAPI parameter objects. There is no reason not to. As described earlier, also be aware that style
controls whether additional characters must be encoded.