mbox series

[v2,0/7] remote-curl: fix deadlocks when remote server disconnects

Message ID cover.1589816718.git.liu.denton@gmail.com (mailing list archive)
Headers show
Series remote-curl: fix deadlocks when remote server disconnects | expand

Message

Denton Liu May 18, 2020, 3:47 p.m. UTC
The following command hangs forever:

	$ git -c protocol.version=2 clone https://github.com/git/git.git --shallow-since=20151012
	Cloning into 'git'...

This occurs because the --shallow-since arg is incorrect and the server
dies early. However, remote-curl does not realise that the server
errored out and just faithfully forwards the packets to fetch-pack
before waiting on more input from fetch-pack. Meanwhile, fetch-pack
keeps reading as it still expects more input. As a result, the processes
deadlock. Original analysis by Peff:
https://lore.kernel.org/git/20200328154936.GA1217052@coredump.intra.peff.net/

Changes since v1:

* Remove fallthrough in switch in favour of just extracting the common
  call out of the switch in patch 3

* Add more detail in function comment and use `const char linelen[4]` in
  patch 4

* Implement most of Peff's suggestions[0] in patch 5

* Only operate on stateless_connect() in patch 5

* Add tests in patch 5

* Drop "remote-curl: ensure last packet is a flush" in favour of
  "stateless-connect: send response end packet"

[0]: https://lore.kernel.org/git/20200515213844.GD115445@coredump.intra.peff.net/

Denton Liu (7):
  remote-curl: fix typo
  remote-curl: remove label indentation
  transport: extract common fetch_pack() call
  pkt-line: extern packet_length()
  remote-curl: error on incomplete packet
  pkt-line: PACKET_READ_RESPONSE_END
  stateless-connect: send response end packet

 Documentation/gitremote-helpers.txt           |  4 +-
 Documentation/technical/protocol-v2.txt       |  2 +
 builtin/fetch-pack.c                          |  2 +-
 connect.c                                     | 12 +++-
 fetch-pack.c                                  | 12 ++++
 pkt-line.c                                    | 13 +++-
 pkt-line.h                                    | 11 +++
 remote-curl.c                                 | 72 +++++++++++++++++--
 remote.h                                      |  3 +-
 serve.c                                       |  2 +
 t/helper/test-pkt-line.c                      |  4 ++
 t/lib-httpd.sh                                |  2 +
 t/lib-httpd/apache.conf                       |  8 +++
 .../incomplete-body-upload-pack-v2-http.sh    |  3 +
 .../incomplete-length-upload-pack-v2-http.sh  |  3 +
 t/t5702-protocol-v2.sh                        | 47 ++++++++++++
 transport.c                                   | 16 ++---
 17 files changed, 197 insertions(+), 19 deletions(-)
 create mode 100644 t/lib-httpd/incomplete-body-upload-pack-v2-http.sh
 create mode 100644 t/lib-httpd/incomplete-length-upload-pack-v2-http.sh

Range-diff against v1:
1:  b390875f87 = 1:  b390875f87 remote-curl: fix typo
2:  a2b28c0b28 = 2:  a2b28c0b28 remote-curl: remove label indentation
3:  c89c184100 < -:  ---------- transport: combine common cases with a fallthrough
-:  ---------- > 3:  3a42575bd5 transport: extract common fetch_pack() call
4:  891a39c853 ! 4:  c2b9d033bb pkt-line: extern packet_length()
    @@ Commit message
         need to access the length header. In order to simplify this, extern
         packet_length() so that the logic can be reused.
     
    +    Change the function parameter from a `const char *` to
    +    `const char linelen[4]`. Even though these two types behave identically
    +    as function parameters, use the array notation to semantically indicate
    +    exactly what this function is expecting as an argument.
    +
      ## pkt-line.c ##
     @@ pkt-line.c: static int get_packet_data(int fd, char **src_buf, size_t *src_size,
      	return ret;
      }
      
     -static int packet_length(const char *linelen)
    -+int packet_length(const char *linelen)
    ++int packet_length(const char linelen[4])
      {
      	int val = hex2chr(linelen);
      	return (val < 0) ? val : (val << 8) | hex2chr(linelen + 2);
    @@ pkt-line.h: int write_packetized_from_buf(const char *src_in, size_t len, int fd
      		*buffer, unsigned size, int options);
      
     +/*
    -+ * Reads a packetized line and returns the length header of the packet.
    ++ * Convert a four hex digit packet line length header into its numeric
    ++ * representation. linelen should not be null-terminated.
    ++ *
    ++ * If linelen contains non-hex characters, return -1. Otherwise, return the
    ++ * numeric value of the length header.
     + */
    -+int packet_length(const char *linelen);
    ++int packet_length(const char linelen[4]);
     +
      /*
       * Read a packetized line into a buffer like the 'packet_read()' function but
5:  3ed7cf87aa < -:  ---------- remote-curl: error on incomplete packet
6:  7a689da2bb < -:  ---------- remote-curl: ensure last packet is a flush
-:  ---------- > 5:  52ce5fdffd remote-curl: error on incomplete packet
-:  ---------- > 6:  744b078324 pkt-line: PACKET_READ_RESPONSE_END
-:  ---------- > 7:  4b079bcd83 stateless-connect: send response end packet

Comments

Jeff King May 18, 2020, 4:50 p.m. UTC | #1
On Mon, May 18, 2020 at 11:47:17AM -0400, Denton Liu wrote:

> Changes since v1:
> 
> * Remove fallthrough in switch in favour of just extracting the common
>   call out of the switch in patch 3
> 
> * Add more detail in function comment and use `const char linelen[4]` in
>   patch 4
> 
> * Implement most of Peff's suggestions[0] in patch 5
> 
> * Only operate on stateless_connect() in patch 5
> 
> * Add tests in patch 5
> 
> * Drop "remote-curl: ensure last packet is a flush" in favour of
>   "stateless-connect: send response end packet"

Overall this looks pretty cleanly done. I left a few minor comments
throughout, but the real question is whether we prefer the "0002" packet
in the last one, or if we instead insist that the response end in a
flush.

At first glance, the "0002" seems like it's more flexible, because we're
making fewer assumptions about what's being transferred over the
stateless-connect channel. But in reality it still has to be pktlines
(because we're checking them for incomplete or invalid packets already).
So all it really buys us is that the server response doesn't have to end
with a flush packet.

So I dunno. The "0002" solution is slightly more flexible, but I'm not
sure it helps in practice. And it does eat up one of our two remaining
special packet markers.

There is another solution, which would allow arbitrary data over
stateless-connect: adding an extra level of pktline framing between the
helper and the parent process. But that's rather ugly (inner pktlines
may be split across outer ones, so you have to do a bunch of buffer
reassembly). I think that's actually how v0 http works, IIRC.
IMHO it probably isn't worth pursuing. That extra layer wrecks the
elegance to the v2 stateless-connect approach, and we really do expect
only pktlines to go over it.

So I think either of your solutions (enforcing a final flush, or the
0002 packet) is preferable. I'm on the fence between them.

-Peff
Denton Liu May 18, 2020, 5:36 p.m. UTC | #2
Hi Peff,

On Mon, May 18, 2020 at 12:50:56PM -0400, Jeff King wrote:
> On Mon, May 18, 2020 at 11:47:17AM -0400, Denton Liu wrote:
> 
> > Changes since v1:
> > 
> > * Remove fallthrough in switch in favour of just extracting the common
> >   call out of the switch in patch 3
> > 
> > * Add more detail in function comment and use `const char linelen[4]` in
> >   patch 4
> > 
> > * Implement most of Peff's suggestions[0] in patch 5
> > 
> > * Only operate on stateless_connect() in patch 5
> > 
> > * Add tests in patch 5
> > 
> > * Drop "remote-curl: ensure last packet is a flush" in favour of
> >   "stateless-connect: send response end packet"
> 
> Overall this looks pretty cleanly done. I left a few minor comments
> throughout, but the real question is whether we prefer the "0002" packet
> in the last one, or if we instead insist that the response end in a
> flush.

Thanks for the prompt review!

> At first glance, the "0002" seems like it's more flexible, because we're
> making fewer assumptions about what's being transferred over the
> stateless-connect channel. But in reality it still has to be pktlines
> (because we're checking them for incomplete or invalid packets already).
> So all it really buys us is that the server response doesn't have to end
> with a flush packet.
> 
> So I dunno. The "0002" solution is slightly more flexible, but I'm not
> sure it helps in practice. And it does eat up one of our two remaining
> special packet markers.

Yeah, I was worried about consuming a special packet. One alternative
that I considered but is kind of gross is sending something like
"0028gitremote-helper: response complete\n" instead of "0002". Then,
instead of "0002" checks, we can check for that special string instead.
I don't _think_ that stateless-connect currently allows for completely
arbitrary data but I might be mistaken.

> There is another solution, which would allow arbitrary data over
> stateless-connect: adding an extra level of pktline framing between the
> helper and the parent process. But that's rather ugly (inner pktlines
> may be split across outer ones, so you have to do a bunch of buffer
> reassembly). I think that's actually how v0 http works, IIRC.
> IMHO it probably isn't worth pursuing. That extra layer wrecks the
> elegance to the v2 stateless-connect approach, and we really do expect
> only pktlines to go over it.

This was the original approach that I was working on but I found it to
be much too invasive for my liking. (Also, I never actually managed to
get it working ;) ) I think I gave up when I realised I had to insert
reframing logic into index-pack and unpack-objects.

> So I think either of your solutions (enforcing a final flush, or the
> 0002 packet) is preferable. I'm on the fence between them.

I'm mostly on the fence too. One advantage of 0002, however, is that a
malicious server can't end a request with 0002 as that's explicitly
prevented. If a malicious server closes a connection after sending a
0000, I think that they could cause a deadlock to happen if there are
multiple flush packets expected in a response. You mentioned in a
sibling email that this currently doesn't happen wrt stateless-connect
although I'm not sure if in the future, this is something that might
change. I dunno.

> -Peff
Junio C Hamano May 18, 2020, 7:36 p.m. UTC | #3
Jeff King <peff@peff.net> writes:

> Overall this looks pretty cleanly done. I left a few minor comments
> throughout, but the real question is whether we prefer the "0002" packet
> in the last one, or if we instead insist that the response end in a
> flush.

Thanks for a review.  I was reading the series through and I found
it a reasonably pleasant read.

> So I think either of your solutions (enforcing a final flush, or the
> 0002 packet) is preferable. I'm on the fence between them.

I am undecided, too X-<.
Jeff King May 18, 2020, 8:58 p.m. UTC | #4
On Mon, May 18, 2020 at 01:36:52PM -0400, Denton Liu wrote:

> > So I dunno. The "0002" solution is slightly more flexible, but I'm not
> > sure it helps in practice. And it does eat up one of our two remaining
> > special packet markers.
> 
> Yeah, I was worried about consuming a special packet. One alternative
> that I considered but is kind of gross is sending something like
> "0028gitremote-helper: response complete\n" instead of "0002". Then,
> instead of "0002" checks, we can check for that special string instead.
> I don't _think_ that stateless-connect currently allows for completely
> arbitrary data but I might be mistaken.

Yeah, I think we should avoid using any packet that could be confused
with regular output. A well-functioning server should have sent a flush
already, and so the worst case would probably be that we would mistake
their "0028" packet for ours if the network happens to cut out right at
that moment (and even that is obviously exceedingly unlikely). But it
just seems easier to reason about if we know we can't get confused.

> This was the original approach that I was working on but I found it to
> be much too invasive for my liking. (Also, I never actually managed to
> get it working ;) ) I think I gave up when I realised I had to insert
> reframing logic into index-pack and unpack-objects.

Yuck. :) I think v0 does it by unwrapping the extra layer in fetch-pack,
but I'd have to double check. But I'm not going to, because I think we
both agree that your other approaches are way nicer.

> > So I think either of your solutions (enforcing a final flush, or the
> > 0002 packet) is preferable. I'm on the fence between them.
> 
> I'm mostly on the fence too. One advantage of 0002, however, is that a
> malicious server can't end a request with 0002 as that's explicitly
> prevented. If a malicious server closes a connection after sending a
> 0000, I think that they could cause a deadlock to happen if there are
> multiple flush packets expected in a response. You mentioned in a
> sibling email that this currently doesn't happen wrt stateless-connect
> although I'm not sure if in the future, this is something that might
> change. I dunno.

Yeah, agreed that the efficacy of the "must end in flush" strategy
depends on there not being internal flushes. At least against a
determined attacker. But it may also be less random than you might
think, if the pattern is to send a flush, then do a bunch of work, etc.
A "random" hangup or error might be more likely to happen at the flush
points then (from the client's perspective).

So let's see if we can answer that question with less hand-waving.

The outer v2 capabilities conversation only writes out the capabilities
list, followed by a single flush (the packet_flush() call in
advertise_capabilities()). So far so good.

The only two v2 commands defined in serve.c are "ls-refs" and "fetch".

For "ls-refs", we end up in ls_refs(). That writes only regular packets
via send_ref(), and then concludes with a single flush (we don't even
send a delim; the client does that to specify options). Good.

For "fetch", we end up in upload_pack_v2(). We write via
process_haves_and_send_acks(), which will either:

  1. Write nothing and return 1.

  2. Write a delim and return 1.

  3. Write a flush packet and return 0.

In the final case, we jump to FETCH_DONE. So this really is the final
thing we say. Good.

In the first two cases, we jump to FETCH_SEND_PACK. We may write out
wanted-ref info, followed by a delim. OK. Then shallow info, followed by
a delim. OK. Then the actual packfile via create_pack_file(). There our
write behavior depends on use_sideband. If not set, we just dump data
directly. If set, then we packetize. And it will always be set for v2
(we do it unconditionally in the beginning of upload_pack_v2()).

So we only need to care about use_sideband cases. And there we may send
keepalives, but those are "0005\1" empty data packets. OK. We may send
data via send_client_data(), but those will be real data packets. And
then finally we give a single packet_flush(). Good.

So I think we can say that yes, the protocol as designed will send a
flush at the end of every response, and will not ever send another
flush.

_But_ that's if all goes well. If upload-pack sees an error, it may:

  - write an ERR packet in the earlier non-sideband steps, and then die

  - write a sideband 3 packet in later steps, and then die

  - just die() in some cases (e.g., pack-objects failed to start)

Obviously these are all an error on the client. And that third one
especially is probably a hanging case that we'd like to fix with this
series. But in the other two, we'd definitely want to make sure that
remote-curl doesn't complain too early or too loudly, and that the error
is shown to the user.

I think if remote-curl is just asking "did we see a flush packet at the
end" then it is likely to complain unnecessarily in that case. And I
don't think it should be inspecting packets for ERR or sideband-3. It
doesn't know enough about the rest of the conversation to know which is
correct. Though I guess we'd really only have to inspect the final
packet.

I really wish we had converted all ERRs into packet 0002 as part of the
v2 conversion. Then somebody reading only at the pktline level could
truly understand or proxy a full conversation without understanding
anything about what's in the pktlines. But it's too late for that now.

So I think our options are probably:

  1. detect flush packets in remote-curl, and either:

     a. don't print an error, just hang up. That prevents a hang in the
	caller and produces no extra message on a real error. It may be
	less informative than it could be if the connection hangs up
	(though we may print a curl error message, and the caller will
	at least say "the helper hung up")

     b. like (a), but always print an error; this is your original
	patch, but I _suspect_ (but didn't test) that it would produce
	extra useless messages for errors the server reports

     c. between the two: inspect the final packet data for evidence of
        ERR/sideband 3 and suppress any message if found

  2. helper signals end-of-response to caller (then it never produces a
     message itself; only the caller does, and it would abort on an ERR
     packet before then)

     a. using a special pktline (your "0002" patch)

     b. some other out-of-band mechanism (e.g., could be another fd)

I think this is pushing me towards 2a, your "0002" patch. It sidesteps
the error-message questions entirely (and I think 2b is too convoluted
to be worth pursuing, especially on Windows where setting up extra pipes
is tricky). But I'd also be OK with 1a or 1c.

-Peff
Junio C Hamano May 18, 2020, 10:52 p.m. UTC | #5
Jeff King <peff@peff.net> writes:

> So I think our options are probably:
>
>   1. detect flush packets in remote-curl, and either:
>
>      a. don't print an error, just hang up. That prevents a hang in the
> 	caller and produces no extra message on a real error. It may be
> 	less informative than it could be if the connection hangs up
> 	(though we may print a curl error message, and the caller will
> 	at least say "the helper hung up")
>
>      b. like (a), but always print an error; this is your original
> 	patch, but I _suspect_ (but didn't test) that it would produce
> 	extra useless messages for errors the server reports
>
>      c. between the two: inspect the final packet data for evidence of
>         ERR/sideband 3 and suppress any message if found
>
>   2. helper signals end-of-response to caller (then it never produces a
>      message itself; only the caller does, and it would abort on an ERR
>      packet before then)
>
>      a. using a special pktline (your "0002" patch)
>
>      b. some other out-of-band mechanism (e.g., could be another fd)
>
> I think this is pushing me towards 2a, your "0002" patch. It sidesteps
> the error-message questions entirely (and I think 2b is too convoluted
> to be worth pursuing, especially on Windows where setting up extra pipes
> is tricky). But I'd also be OK with 1a or 1c.

Thanks for a detailed analysis.  I guess we'd take 0002, then?
Jeff King May 19, 2020, 2:38 a.m. UTC | #6
On Mon, May 18, 2020 at 03:52:42PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > So I think our options are probably:
> >
> >   1. detect flush packets in remote-curl, and either:
> >
> >      a. don't print an error, just hang up. That prevents a hang in the
> > 	caller and produces no extra message on a real error. It may be
> > 	less informative than it could be if the connection hangs up
> > 	(though we may print a curl error message, and the caller will
> > 	at least say "the helper hung up")
> >
> >      b. like (a), but always print an error; this is your original
> > 	patch, but I _suspect_ (but didn't test) that it would produce
> > 	extra useless messages for errors the server reports
> >
> >      c. between the two: inspect the final packet data for evidence of
> >         ERR/sideband 3 and suppress any message if found
> >
> >   2. helper signals end-of-response to caller (then it never produces a
> >      message itself; only the caller does, and it would abort on an ERR
> >      packet before then)
> >
> >      a. using a special pktline (your "0002" patch)
> >
> >      b. some other out-of-band mechanism (e.g., could be another fd)
> >
> > I think this is pushing me towards 2a, your "0002" patch. It sidesteps
> > the error-message questions entirely (and I think 2b is too convoluted
> > to be worth pursuing, especially on Windows where setting up extra pipes
> > is tricky). But I'd also be OK with 1a or 1c.
> 
> Thanks for a detailed analysis.  I guess we'd take 0002, then?

Yeah, that's how I lean. I think I did have a few comments on the patch
that I'd like Denton to consider, so hopefully we'll see a re-roll.

-Peff