Notes on subtleties of HTTP implementation
Why the absolute-form used for proxy requests
RFC7230§5.3.2 says that a (non-CONNECT) request to an HTTP proxy should look like
GET http://authority/path HTTP/1.1
rather than the usual
GET /path HTTP/1.1
Host: authority
And doesn't give a hint as to why the message syntax is different here.
A blog post by Parsia Hakimian claims that the reason is that it's a legacy behavior inherited from HTTP/1.0, which had proxies, but not the Host header field. Which is mostly true. But we can also realize that the usual syntax does not allow specifying a URI scheme, which means that we cannot specify a transport. Sure, the only two HTTP transports we might expect to use today are TCP (scheme: http) and TLS (scheme: https), and TLS requires we use a CONNECT request to the proxy, meaning that the only option left is a TCP transport; but that is no reason to avoid building generality into the protocol.
On taking short-cuts based on early header field values
RFC7230§3.2.2 says:
The order in which header fields with differing field names are received is not significant. However, it is good practice to send header fields that contain control data first, such as Host on requests and Date on responses, so that implementations can decide when not to handle a message as early as possible.
I took that as a notice that I can use the first Host or similar header to quickly route along to my sub-component before I've parsed the entire header field set.
However, it later states in §5.4:
A server MUST respond with a 400 (Bad Request) status code to any HTTP/1.1 request message that lacks a Host header field and to any request message that contains more than one Host header field or a Host header field with an invalid field-value.
Which means that I must parse the entire header field set.
However, if I look a bit closer at §3.2.2, I see that this short-cut is only valid for deciding to not handle a message; if I am handling it, I cannot use this short-cut.
Except that if I decide not to handle a request based on the Host header field, the correct thing to do is to send a 404 status code. Which implies that I have parsed the remainder of the header field set to validate the message syntax. Oh no, what do I do?
Well, there are a number of "A server MUST respond with a XXX code if" rules that can all be triggered on the same request. So we get to choose which to use.
And fortunately for optimizing implementations, §3.2.5 gave us:
A server that receives a ... set of fields, larger than it wishes to process MUST respond with an appropriate 4xx (Client Error) status code.
And since the header field set is longer than we want to process (since we want to short-cut processing), we are free to respond with whichever 4XX status code we like!
On normalizing target URIs
An implementer is tempted to normalize URIs all over the place, just for safety and sanitation. After all, RFC3986§6.1 says it's safe!
Unfortunately, most URI normalizers implementations will normalize an empty path to "/". Which is not always save; RFC7230§2.7.3, which defines this "equivalence", actually says:
When not being used in absolute form as the request target of an OPTIONS request, an empty path component is equivalent to an absolute path of "/", so the normal form is to provide a path of "/" instead.
Which means we can't use the usual normalizer implementation if we are making an OPTIONS request!
Why is that? Well, if we turn to §5.3.4, we find the answer. One of the special cases for when the request target is not a URI, is that we may use "*" as the target for an OPTIONS request to request information about the origin server itself, rather than a resource on that server.
However, as discussed above, the target in a request to a proxy must be an absolute URI (and §5.3.2 says that the origin server must also understand this syntax). So, we must define a way to map "*" to an absolute URI.
Naively, one might be tempted to use "/*" as the path. But that would make it impossible to have a resource actually named "/*". So, we must define a special case in the URI syntax that doesn't obstruct a real path.
If we didn't have this special case in the URI normalizer, and we handled the "/" path as the same as empty in the OPTIONS handler of the last proxy server, then it would be impossible to request OPTIONS for the "/" resources, as it would get translated into "*" and treated as OPTIONS for the entire server.