1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
|
Notes on subtleties of HTTP implementation
==========================================
---
date: "2016-09-30"
---
I may add to this as time goes on, but I've written up some notes on
subtleties HTTP/1.1 message syntax as specified in RFC 2730.
## Why the absolute-form is used for proxy requests
[RFC7230§5.3.2][] says that a (non-CONNECT) request to an HTTP
proxy should look like
GET http://authority/path HTTP/1.1
rather than the usual
GET /path HTTP/1.1
Host: authority
And doesn't give a hint as to why the message syntax is different
here.
[A blog post by Parsia Hakimian][why-absform] claims that the reason
is that it's a legacy behavior inherited from HTTP/1.0, which had
proxies, but not the Host header field. Which is mostly true. But we
can also realize that the usual syntax does not allow specifying a URI
scheme, which means that we cannot specify a transport. Sure, the
only two HTTP transports we might expect to use today are TCP (scheme:
http) and TLS (scheme: https), and TLS requires we use a CONNECT
request to the proxy, meaning that the only option left is a TCP
transport; but that is no reason to avoid building generality into the
protocol.
## On taking short-cuts based on early header field values
[RFC7230§3.2.2][] says:
> The order in which header fields with differing field names are
> received is not significant. However, it is good practice to send
> header fields that contain control data first, such as Host on
> requests and Date on responses, so that implementations can decide
> when not to handle a message as early as possible.
Which is great! We can make an optimization!
This is only a valid optimization for deciding to *not handle* a
message. You cannot use it to decide to route to a backend early
based on this. Part of the reason is that [§5.4][RFC7230§5.4] tells
us we must inspect the entire header field set to know if we need to
respond with a 400 status code:
> A server MUST respond with a 400 (Bad Request) status code to any
> HTTP/1.1 request message that lacks a Host header field and to any
> request message that contains more than one Host header field or a
> Host header field with an invalid field-value.
However, if I decide not to handle a request based on the Host header
field, the correct thing to do is to send a 404 status code. Which
implies that I have parsed the remainder of the header field set to
validate the message syntax. We need to parse the entire field-set to
know if we need to send a 400 or a 404. Did this just kill the
possibility of using the optimization?
Well, there are a number of "A server MUST respond with a XXX code if"
rules that can all be triggered on the same request. So we get to
choose which to use. And fortunately for optimizing implementations,
[§3.2.5][RFC7230§3.2.5] gave us:
> A server that receives a ... set of fields,
> larger than it wishes to process MUST respond with an appropriate 4xx
> (Client Error) status code.
Since the header field set is longer than we want to process (since we
want to short-cut processing), we are free to respond with whichever
4XX status code we like!
## On normalizing target URIs
An implementer is tempted to normalize URIs all over the place, just
for safety and sanitation. After all,
[RFC3986§6.1][] says it's safe!
Unfortunately, most URI normalization implementations will normalize an
empty path to "/". Which is not always safe; [RFC7230§2.7.3][], which
defines this "equivalence", actually says:
> When not being used in
> absolute form as the request target of an OPTIONS request, an empty
> path component is equivalent to an absolute path of "/", so the
> normal form is to provide a path of "/" instead.
Which means we can't use the usual normalization implementation if we
are making an OPTIONS request!
Why is that? Well, if we turn to [§5.3.4][RFC7230§5.3.4], we find the
answer. One of the special cases for when the request target is not a
URI, is that we may use "\*" as the target for an OPTIONS request to
request information about the origin server itself, rather than a
resource on that server.
However, as discussed above, the target in a request to a proxy must
be an absolute URI (and [§5.3.2][RFC7230§5.3.2] says that the origin
server must also understand this syntax). So, we must define a way to
map "\*" to an absolute URI.
Naively, one might be tempted to use "/\*" as the path. But that
would make it impossible to have a resource actually named "/\*". So,
we must define a special case in the URI syntax that doesn't obstruct
a real path.
If we didn't have this special case in the URI normalization rules,
and we handled the "/" path as the same as empty in the OPTIONS
handler of the last proxy server, then it would be impossible to
request OPTIONS for the "/" resources, as it would get translated into
"\*" and treated as OPTIONS for the entire server.
[RFC3986§6.1]: https://tools.ietf.org/html/rfc3986#section-6.1
[RFC7230§2.7.3]: https://tools.ietf.org/html/rfc7230#section-2.7.3
[RFC7230§3.2.2]: https://tools.ietf.org/html/rfc7230#section-3.2.2
[RFC7230§3.2.5]: https://tools.ietf.org/html/rfc7230#section-3.2.5
[RFC7230§5.3.2]: https://tools.ietf.org/html/rfc7230#section-5.3.2
[RFC7230§5.3.4]: https://tools.ietf.org/html/rfc7230#section-5.3.4
[RFC7230§5.4]: https://tools.ietf.org/html/rfc7230#section-5.4
[why-absform]: https://parsiya.net/blog/2016-07-28-thick-client-proxying---part-6-how-https-proxies-work/#3-1-1-why-not-use-the-host-header
|