mbox series

[v3,0/4] Additional FAQ entries

Message ID 20240704003818.750223-1-sandals@crustytoothpaste.net (mailing list archive)
Headers show
Series Additional FAQ entries | expand

Message

brian m. carlson July 4, 2024, 12:38 a.m. UTC
This series introduces some additional Git FAQ entries on various
topics.  They are all things I've seen in my professional life or on
Stack Overflow, so I've written documentation.

There were some suggestions in the past that the text "modify, tamper
with, or buffer" might be somewhat redundant, but I've chosen to keep
the text as it is to avoid arguments like, "Well, buffering the entire
request or response isn't really modifying it, so Git should just work
in that situation," when we already know that doesn't work.

Changes from v2 (partial):
* Add documentation on proxies to the configuration documentation as
  well.
* Mention some security problems that are known to occur with TLS MITM
  proxies.  This mirrors the similar Git LFS documentation.
* Provide a documentation example about how to use proxies with SSH.
* Recommend running a `git fsck` after syncing with rsync.

Changes from v1:
* Drop the monorepo patch for now; I want to revise it further.
* Reorder the working tree patch to place more warnings up front.
* Mention core.gitproxy and socat.
* Rephrase text in the EOL entry to read correctly and be easier to
  understand.
* Improve the commit message for the working tree FAQ entry to make it
  clearer that users wish to transfer uncommitted changes.

brian m. carlson (4):
  gitfaq: add documentation on proxies
  gitfaq: give advice on using eol attribute in gitattributes
  gitfaq: add entry about syncing working trees
  doc: mention that proxies must be completely transparent

 Documentation/config/http.txt |   5 ++
 Documentation/gitfaq.txt      | 105 ++++++++++++++++++++++++++++++++--
 2 files changed, 104 insertions(+), 6 deletions(-)

Comments

Junio C Hamano July 4, 2024, 1:25 a.m. UTC | #1
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> This series introduces some additional Git FAQ entries on various
> topics.  They are all things I've seen in my professional life or on
> Stack Overflow, so I've written documentation.

Just to help other readers:

 v1
 https://lore.kernel.org/git/20211020010624.675562-1-sandals@crustytoothpaste.net/

 v2
 https://lore.kernel.org/git/20211107225525.431138-1-sandals@crustytoothpaste.net/

are where the previous discussions are found.

> There were some suggestions in the past that the text "modify, tamper
> with, or buffer" might be somewhat redundant, but I've chosen to keep
> the text as it is to avoid arguments like, "Well, buffering the entire
> request or response isn't really modifying it, so Git should just work
> in that situation," when we already know that doesn't work.
>
> Changes from v2 (partial):
> * Add documentation on proxies to the configuration documentation as
>   well.
> * Mention some security problems that are known to occur with TLS MITM
>   proxies.  This mirrors the similar Git LFS documentation.
> * Provide a documentation example about how to use proxies with SSH.
> * Recommend running a `git fsck` after syncing with rsync.
>
> Changes from v1:
> * Drop the monorepo patch for now; I want to revise it further.
> * Reorder the working tree patch to place more warnings up front.
> * Mention core.gitproxy and socat.
> * Rephrase text in the EOL entry to read correctly and be easier to
>   understand.
> * Improve the commit message for the working tree FAQ entry to make it
>   clearer that users wish to transfer uncommitted changes.
>
> brian m. carlson (4):
>   gitfaq: add documentation on proxies
>   gitfaq: give advice on using eol attribute in gitattributes
>   gitfaq: add entry about syncing working trees
>   doc: mention that proxies must be completely transparent
>
>  Documentation/config/http.txt |   5 ++
>  Documentation/gitfaq.txt      | 105 ++++++++++++++++++++++++++++++++--
>  2 files changed, 104 insertions(+), 6 deletions(-)
Junio C Hamano July 4, 2024, 5:22 a.m. UTC | #2
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> This series introduces some additional Git FAQ entries on various
> topics.  They are all things I've seen in my professional life or on
> Stack Overflow, so I've written documentation.
>
> There were some suggestions in the past that the text "modify, tamper
> with, or buffer" might be somewhat redundant, but I've chosen to keep
> the text as it is to avoid arguments like, "Well, buffering the entire
> request or response isn't really modifying it, so Git should just work
> in that situation," when we already know that doesn't work.

Buffering the entire thing will break because ...?  Deadlock?  Or is
there anything more subtle going on?

Are we affected by any frame boundary (do we even notice?) that
happens at layer lower than our own pkt-line layer at all (i.e. we
sent two chunks and we fail to work on them correctly if the network
collapses them into one chunk, without changing a single byte, just
changing the number of read() system calls that reads them?)?
brian m. carlson July 4, 2024, 9:23 p.m. UTC | #3
On 2024-07-04 at 05:22:27, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> > This series introduces some additional Git FAQ entries on various
> > topics.  They are all things I've seen in my professional life or on
> > Stack Overflow, so I've written documentation.
> >
> > There were some suggestions in the past that the text "modify, tamper
> > with, or buffer" might be somewhat redundant, but I've chosen to keep
> > the text as it is to avoid arguments like, "Well, buffering the entire
> > request or response isn't really modifying it, so Git should just work
> > in that situation," when we already know that doesn't work.
> 
> Buffering the entire thing will break because ...?  Deadlock?  Or is
> there anything more subtle going on?

When we use the smart HTTP protocol, the server sends keep-alive and
status messages as one of the data streams, which is important because
(a) the user is usually impatient and wants to know what's going on and
(b) it may take a long time to pack the data, especially for large
repositories, and sending no data may result in the connection being
dropped or the client being served a 500 by an intermediate layer.  We
know this does happen and I've seen reports of it.

We've also seen some cases where proxies refuse to accept
Transfer-Encoding: chunked (let's party like it's 1999) and send a 411
back since there's no Content-Length header.  That's presumably because
they want to scan the contents for "bad" data all in one chunk, but Git
has to stream the contents unless the data fits in the buffer size.
(This is the one case where http.postBuffer actually makes a
difference.)  I very much doubt that the appliance actually wants to get
a 2 GiB payload to scan, since it probably doesn't have tons of memory
in the first place, but that is what it's asking for.

> Are we affected by any frame boundary (do we even notice?) that
> happens at layer lower than our own pkt-line layer at all (i.e. we
> sent two chunks and we fail to work on them correctly if the network
> collapses them into one chunk, without changing a single byte, just
> changing the number of read() system calls that reads them?)?

No, that's not a problem.  We read four bytes for the pkt-line header,
and then we read the entire body based on that length until we get all
of it.  This is also the way OpenSSL works for TLS packets and is known
to work well.  If the underlying TCP connection provides a partial or
incomplete packet (which can happen due to MTU), we'll just block until
the rest comes in, which is fine.
Junio C Hamano July 6, 2024, 5:59 a.m. UTC | #4
"brian m. carlson" <sandals@crustytoothpaste.net> writes:

>> Buffering the entire thing will break because ...?  Deadlock?  Or is
>> there anything more subtle going on?
>
> When we use the smart HTTP protocol, the server sends keep-alive and
> status messages as one of the data streams, which is important because
> (a) the user is usually impatient and wants to know what's going on and
> (b) it may take a long time to pack the data, especially for large
> repositories, and sending no data may result in the connection being
> dropped or the client being served a 500 by an intermediate layer.  We
> know this does happen and I've seen reports of it.

And this is an example of "a proxy that buffers the data, without
modifying or tampering with, would still break transport"?

> We've also seen some cases where proxies refuse to accept
> Transfer-Encoding: chunked (let's party like it's 1999) and send a 411
> back since there's no Content-Length header.

This is "a proxy that wanted to buffer the data but failed to do so"
that ended up modifying the data Gits sitting at both ends of the
connection can observe, so it is a bit different issue.  It clearly
falls into "modify or tampering with" category.

I forgot to say this clearly when I wrote the message you are
responding to, but I am trying to see if we can clarify the "or
buffer" part in "modify, tamper with, or buffer", as offhand I did
not think of a reason why a proxy would break the Git communication
if it receives a segment that was 2MB originally from upload-pack,
and forwards the contents of the segment in two 1MB segments without
tampering or modifying the payload bytes at all to fetch-pack.

Thanks.
Jeff King July 6, 2024, 6:47 a.m. UTC | #5
On Thu, Jul 04, 2024 at 09:23:28PM +0000, brian m. carlson wrote:

> > Buffering the entire thing will break because ...?  Deadlock?  Or is
> > there anything more subtle going on?
> 
> When we use the smart HTTP protocol, the server sends keep-alive and
> status messages as one of the data streams, which is important because
> (a) the user is usually impatient and wants to know what's going on and
> (b) it may take a long time to pack the data, especially for large
> repositories, and sending no data may result in the connection being
> dropped or the client being served a 500 by an intermediate layer.  We
> know this does happen and I've seen reports of it.

Additionally, I think for non-HTTP transports (think proxying ssh
through socat or similar), buffering the v0 protocol is likely a total
disaster. The fetch protocol assumes both sides spewing at each other in
real time.

HTTP, even v0, follows a request/response model, so we're safer there. I
do think some amount of buffering is often going to be OK in practice.
You'd get delayed keep-alives and progress reports, which may range from
"annoying" to "something in the middle decided to time out". So I'm OK
with just telling people "make sure your proxies aren't buffering" as a
general rule, rather than trying to get into the nitty gritty of what is
going to break and how.

-Peff
Junio C Hamano July 6, 2024, 5:18 p.m. UTC | #6
Jeff King <peff@peff.net> writes:

> On Thu, Jul 04, 2024 at 09:23:28PM +0000, brian m. carlson wrote:
>
>> > Buffering the entire thing will break because ...?  Deadlock?  Or is
>> > there anything more subtle going on?
>> 
>> When we use the smart HTTP protocol, the server sends keep-alive and
>> status messages as one of the data streams, which is important because
>> (a) the user is usually impatient and wants to know what's going on and
>> (b) it may take a long time to pack the data, especially for large
>> repositories, and sending no data may result in the connection being
>> dropped or the client being served a 500 by an intermediate layer.  We
>> know this does happen and I've seen reports of it.
>
> Additionally, I think for non-HTTP transports (think proxying ssh
> through socat or similar), buffering the v0 protocol is likely a total
> disaster. The fetch protocol assumes both sides spewing at each other in
> real time.

Yeah, beyond one "window" that a series of "have"s are allowed to be
in flight, no further "have"s are sent before seeing an "ack/nack"
response, so if you buffer too much, they can deadlock fairly easily.

> ... So I'm OK
> with just telling people "make sure your proxies aren't buffering" as a
> general rule, rather than trying to get into the nitty gritty of what is
> going to break and how.

Sounds fair.  Thanks.
brian m. carlson July 8, 2024, 12:52 a.m. UTC | #7
On 2024-07-06 at 05:59:57, Junio C Hamano wrote:
> "brian m. carlson" <sandals@crustytoothpaste.net> writes:
> 
> >> Buffering the entire thing will break because ...?  Deadlock?  Or is
> >> there anything more subtle going on?
> >
> > When we use the smart HTTP protocol, the server sends keep-alive and
> > status messages as one of the data streams, which is important because
> > (a) the user is usually impatient and wants to know what's going on and
> > (b) it may take a long time to pack the data, especially for large
> > repositories, and sending no data may result in the connection being
> > dropped or the client being served a 500 by an intermediate layer.  We
> > know this does happen and I've seen reports of it.
> 
> And this is an example of "a proxy that buffers the data, without
> modifying or tampering with, would still break transport"?

Yes.  The connection usually ends up dropped from the view of the
client, which is hard to debug (because it also looks like a network
problem, except often without any output from the remote side).