diff mbox series

builtin/credential-cache--daemon: fix error when "exit"ing on Cygwin

Message ID a4472d6d1551e7c25540c4c8361bcb6b1c9f92ff.1729084997.git.ps@pks.im (mailing list archive)
State Superseded
Headers show
Series builtin/credential-cache--daemon: fix error when "exit"ing on Cygwin | expand

Commit Message

Patrick Steinhardt Oct. 16, 2024, 2:21 p.m. UTC
Clients can signal the git-credential-cache(1) daemon that it is
supposed to exit by sending it an "exit" command. The details around
how exactly the daemon exits seem to be rather intricate as spelt out by
a comment surrounding our call to exit(3p), as we need to be mindful
around closing the client streams before we signal the client.

The logic is broken on Cygwin though: when a client asks the daemon to
exit, they won't see the EOF and will instead get an error message:

  fatal: read error from cache daemon: Software caused connection abort

This issue is known in Cygwin, see for example [1], but the exact root
cause is not known.

As it turns out, we can avoid the issue by explicitly closing the client
streams via fclose(3p). I'm not sure at all where the claimed atexit(3p)
handler mentioned in the comment is supposed to live, but from all I can
see we do not have any installed that would close the sockets for us. So
this leaves me with a bit of a sour taste overall.

That being said, I couldn't spot anything obviously wrong with closing
the streams ourselves, and it does fix the issue on Cygwin without any
regressions on other platforms as far as I can see. So let's go for this
fix, even though I cannot properly explain it.

[1]: https://github.com/cygporter/git/issues/51

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---

I've Cc'd Adam, who is the maintainer of the Git package in Cygwin, as
well as Peff, who is the original author of the below comment. I'd be
really happy if one of you could enlighten me here :)

Patrick

 builtin/credential-cache--daemon.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

Comments

Jeff King Oct. 16, 2024, 2:55 p.m. UTC | #1
On Wed, Oct 16, 2024 at 04:21:36PM +0200, Patrick Steinhardt wrote:

> diff --git a/builtin/credential-cache--daemon.c b/builtin/credential-cache--daemon.c
> index bc22f5c6d24..5a09df5c167 100644
> --- a/builtin/credential-cache--daemon.c
> +++ b/builtin/credential-cache--daemon.c
> @@ -156,13 +156,11 @@ static void serve_one_client(FILE *in, FILE *out)
>  	}
>  	else if (!strcmp(action.buf, "exit")) {
>  		/*
> -		 * It's important that we clean up our socket first, and then
> -		 * signal the client only once we have finished the cleanup.
> -		 * Calling exit() directly does this, because we clean up in
> -		 * our atexit() handler, and then signal the client when our
> -		 * process actually ends, which closes the socket and gives
> -		 * them EOF.
> +		 * We must close our file handles before we exit such that the
> +		 * client will receive an EOF.
>  		 */
> +		fclose(in);
> +		fclose(out);
>  		exit(0);
>  	}

This breaks the thing the comment was trying to protect against. We want
to unlink() the socket file before closing the descriptors.

From 7d5e9c9849 (which you can find with blame or "git log --follow
-Satexit builtin/credential-cache--daemon.c"):

    credential-cache--daemon: clarify "exit" action semantics

    When this code was originally written, there wasn't much
    thought given to the timing between a client asking for
    "exit", the daemon signaling that the action is done (with
    EOF), and the actual cleanup of the socket.

    However, we need to care about this so that our test scripts
    do not end up racy (e.g., by asking for an exit and checking
    that the socket was cleaned up). The code that is already
    there happens to behave very reasonably; let's add a comment
    to make it clear that any changes should retain the same
    behavior.

So with the proposed change, t0301 will now fail racily. We need that
unlink() to happen before the fclose(). Just calling exit() does things
in the right order, but it should also be OK to do an explicit:

  delete_tempfile(&socket_file);
  fclose(in);
  fclose(out);

That would probably require making socket_file a global variable. (You
can't just return out of the serving loop, since that closes the sockets
first before deleting the tempfile).

> Clients can signal the git-credential-cache(1) daemon that it is
> supposed to exit by sending it an "exit" command. The details around
> how exactly the daemon exits seem to be rather intricate as spelt out by
> a comment surrounding our call to exit(3p), as we need to be mindful
> around closing the client streams before we signal the client.
> 
> The logic is broken on Cygwin though: when a client asks the daemon to
> exit, they won't see the EOF and will instead get an error message:
> 
>   fatal: read error from cache daemon: Software caused connection abort
> 
> This issue is known in Cygwin, see for example [1], but the exact root
> cause is not known.
> [...]
> [1]: https://github.com/cygporter/git/issues/51

I don't see any details at that issue, but I'm not sure how it would fix
things. From the client's perspective, they are going to see the
descriptor either way. Unless there is some magic that fclose() does
that a normal descriptor-close-on-exit does not do.

That "Software caused connection abort" thing seems like a weird
not-standard-Unix errno value. Searching for it mostly yields people
complaining about getting it from ssh under cygwin. :)

If the magic that cygwin needs is actually "fclose before unlink", then
that is in opposition to other platforms (and I suspect would make t0301
racy there).

> As it turns out, we can avoid the issue by explicitly closing the client
> streams via fclose(3p). I'm not sure at all where the claimed atexit(3p)
> handler mentioned in the comment is supposed to live, but from all I can
> see we do not have any installed that would close the sockets for us. So
> this leaves me with a bit of a sour taste overall.

The mention of atexit is a little oblique these days. We moved from a
manual atexit handler to using the tempfile.c handler in 9e9033166b
(credential-cache--daemon: use tempfile module, 2015-08-10).

> That being said, I couldn't spot anything obviously wrong with closing
> the streams ourselves, and it does fix the issue on Cygwin without any
> regressions on other platforms as far as I can see. So let's go for this
> fix, even though I cannot properly explain it.

Running t0301 with --stress on Linux failed for me after a minute or so.

-Peff
Jeff King Oct. 16, 2024, 3:09 p.m. UTC | #2
On Wed, Oct 16, 2024 at 10:55:40AM -0400, Jeff King wrote:

> > The logic is broken on Cygwin though: when a client asks the daemon to
> > exit, they won't see the EOF and will instead get an error message:
> > 
> >   fatal: read error from cache daemon: Software caused connection abort
> > 
> > This issue is known in Cygwin, see for example [1], but the exact root
> > cause is not known.
> > [...]
> > [1]: https://github.com/cygporter/git/issues/51
> 
> I don't see any details at that issue, but I'm not sure how it would fix
> things. From the client's perspective, they are going to see the
> descriptor either way. Unless there is some magic that fclose() does
> that a normal descriptor-close-on-exit does not do.
> 
> That "Software caused connection abort" thing seems like a weird
> not-standard-Unix errno value. Searching for it mostly yields people
> complaining about getting it from ssh under cygwin. :)
> 
> If the magic that cygwin needs is actually "fclose before unlink", then
> that is in opposition to other platforms (and I suspect would make t0301
> racy there).

This all seemed eerily familiar. Try this thread:

  https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/

It looks like the conclusion was that we should adjust errno handling on
the client side, but nobody ever followed up with an actual patch.

-Peff
Ramsay Jones Oct. 16, 2024, 3:12 p.m. UTC | #3
On 16/10/2024 15:21, Patrick Steinhardt wrote:
> Clients can signal the git-credential-cache(1) daemon that it is
> supposed to exit by sending it an "exit" command. The details around
> how exactly the daemon exits seem to be rather intricate as spelt out by
> a comment surrounding our call to exit(3p), as we need to be mindful
> around closing the client streams before we signal the client.
> 
> The logic is broken on Cygwin though: when a client asks the daemon to
> exit, they won't see the EOF and will instead get an error message:
> 
>   fatal: read error from cache daemon: Software caused connection abort
> 
> This issue is known in Cygwin, see for example [1], but the exact root
> cause is not known.
> 
> As it turns out, we can avoid the issue by explicitly closing the client
> streams via fclose(3p). I'm not sure at all where the claimed atexit(3p)
> handler mentioned in the comment is supposed to live, but from all I can
> see we do not have any installed that would close the sockets for us. So
> this leaves me with a bit of a sour taste overall.
> 
> That being said, I couldn't spot anything obviously wrong with closing
> the streams ourselves, and it does fix the issue on Cygwin without any
> regressions on other platforms as far as I can see. So let's go for this
> fix, even though I cannot properly explain it.
> 
> [1]: https://github.com/cygporter/git/issues/51
> 
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
> 
> I've Cc'd Adam, who is the maintainer of the Git package in Cygwin, as
> well as Peff, who is the original author of the below comment. I'd be
> really happy if one of you could enlighten me here :)
> 
> Patrick
> 
>  builtin/credential-cache--daemon.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/builtin/credential-cache--daemon.c b/builtin/credential-cache--daemon.c
> index bc22f5c6d24..5a09df5c167 100644
> --- a/builtin/credential-cache--daemon.c
> +++ b/builtin/credential-cache--daemon.c
> @@ -156,13 +156,11 @@ static void serve_one_client(FILE *in, FILE *out)
>  	}
>  	else if (!strcmp(action.buf, "exit")) {
>  		/*
> -		 * It's important that we clean up our socket first, and then
> -		 * signal the client only once we have finished the cleanup.
> -		 * Calling exit() directly does this, because we clean up in
> -		 * our atexit() handler, and then signal the client when our
> -		 * process actually ends, which closes the socket and gives
> -		 * them EOF.
> +		 * We must close our file handles before we exit such that the
> +		 * client will receive an EOF.
>  		 */
> +		fclose(in);
> +		fclose(out);
>  		exit(0);
>  	}
>  	else if (!strcmp(action.buf, "erase"))

Heh, this is very familiar! :)

See, for example, the mailing list discussion here[1].

If memory serves, Jeff was against this solution. For what it is worth, my
recommended solution is '[RFC PATCH 1A] credential-cache: also handle
ECONNABORTED connection closure'. I still have those patches in my local
cygwin repo. Ah, let me just add the patch below (this was last rebased
onto v2.46.0, but it should (hopefully) be fine to apply to master - famous
last words).

ATB,
Ramsay Jones

[1] https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/

-------->8--------
From 89526b2b87e39985dc8cd80661662ff087dc1078 Mon Sep 17 00:00:00 2001
From: Ramsay Jones <ramsay@ramsayjones.plus.com>
Date: Wed, 22 Jun 2022 19:24:44 +0100
Subject: [PATCH] credential-cache: also handle ECONNABORTED connection closure

Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com>
---
 builtin/credential-cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/credential-cache.c b/builtin/credential-cache.c
index 3db8df70a9..defbcc845c 100644
--- a/builtin/credential-cache.c
+++ b/builtin/credential-cache.c
@@ -30,7 +30,7 @@ static int connection_fatally_broken(int error)
 
 static int connection_closed(int error)
 {
-	return (error == ECONNRESET);
+	return (error == ECONNRESET) || (error == ECONNABORTED);
 }
 
 static int connection_fatally_broken(int error)
Ramsay Jones Oct. 16, 2024, 3:39 p.m. UTC | #4
On 16/10/2024 16:09, Jeff King wrote:
> On Wed, Oct 16, 2024 at 10:55:40AM -0400, Jeff King wrote:
> 
>>> The logic is broken on Cygwin though: when a client asks the daemon to
>>> exit, they won't see the EOF and will instead get an error message:
>>>
>>>   fatal: read error from cache daemon: Software caused connection abort
>>>
>>> This issue is known in Cygwin, see for example [1], but the exact root
>>> cause is not known.
>>> [...]
>>> [1]: https://github.com/cygporter/git/issues/51
>>
>> I don't see any details at that issue, but I'm not sure how it would fix
>> things. From the client's perspective, they are going to see the
>> descriptor either way. Unless there is some magic that fclose() does
>> that a normal descriptor-close-on-exit does not do.
>>
>> That "Software caused connection abort" thing seems like a weird
>> not-standard-Unix errno value. Searching for it mostly yields people
>> complaining about getting it from ssh under cygwin. :)
>>
>> If the magic that cygwin needs is actually "fclose before unlink", then
>> that is in opposition to other platforms (and I suspect would make t0301
>> racy there).
> 
> This all seemed eerily familiar. Try this thread:
> 
>   https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/
> 
> It looks like the conclusion was that we should adjust errno handling on
> the client side, but nobody ever followed up with an actual patch.

Heh, our emails crossed. Yes, I was hoping that, given that Adam had identified
the cygwin commit that caused the regression, some resolution on the cygwin
side would fix things up. I waited ... and then put it on my TODO list! :)

I did look at the cygwin commit and it wasn't at all obvious what happened.
In fact there was a comment about making sure that errno values didn't
change as a side-effect!

Sorry for being tardy, again ...

ATB,
Ramsay Jones
Taylor Blau Oct. 16, 2024, 8:28 p.m. UTC | #5
On Wed, Oct 16, 2024 at 11:09:22AM -0400, Jeff King wrote:
> On Wed, Oct 16, 2024 at 10:55:40AM -0400, Jeff King wrote:
>
> > > The logic is broken on Cygwin though: when a client asks the daemon to
> > > exit, they won't see the EOF and will instead get an error message:
> > >
> > >   fatal: read error from cache daemon: Software caused connection abort
> > >
> > > This issue is known in Cygwin, see for example [1], but the exact root
> > > cause is not known.
> > > [...]
> > > [1]: https://github.com/cygporter/git/issues/51
> >
> > I don't see any details at that issue, but I'm not sure how it would fix
> > things. From the client's perspective, they are going to see the
> > descriptor either way. Unless there is some magic that fclose() does
> > that a normal descriptor-close-on-exit does not do.
> >
> > That "Software caused connection abort" thing seems like a weird
> > not-standard-Unix errno value. Searching for it mostly yields people
> > complaining about getting it from ssh under cygwin. :)
> >
> > If the magic that cygwin needs is actually "fclose before unlink", then
> > that is in opposition to other platforms (and I suspect would make t0301
> > racy there).
>
> This all seemed eerily familiar. Try this thread:
>
>   https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/
>
> It looks like the conclusion was that we should adjust errno handling on
> the client side, but nobody ever followed up with an actual patch.

Thanks for digging. It would be great if you both and Ramsay could unify
on an agreeable path forward here.

In the meantime, I picked this up as 'ps/credential-cache-exit-cygwin'
in my tree, but let's hold it out from 'seen' as you note it racily
fails t0301.

Thanks,
Taylor
Jeff King Oct. 17, 2024, 2:33 a.m. UTC | #6
On Wed, Oct 16, 2024 at 04:28:49PM -0400, Taylor Blau wrote:

> > This all seemed eerily familiar. Try this thread:
> >
> >   https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/
> >
> > It looks like the conclusion was that we should adjust errno handling on
> > the client side, but nobody ever followed up with an actual patch.
> 
> Thanks for digging. It would be great if you both and Ramsay could unify
> on an agreeable path forward here.

I think the patch Ramsay posted elsewhere is the right way forward.
Hopefully he can fill out a commit message with the summary and then we
can proceed.

(I'm happy to help with explaining the credential-cache side, but I
would just be hand-waving on the cygwin parts).

-Peff
Patrick Steinhardt Oct. 17, 2024, 8:46 a.m. UTC | #7
On Wed, Oct 16, 2024 at 10:33:19PM -0400, Jeff King wrote:
> On Wed, Oct 16, 2024 at 04:28:49PM -0400, Taylor Blau wrote:
> 
> > > This all seemed eerily familiar. Try this thread:
> > >
> > >   https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/
> > >
> > > It looks like the conclusion was that we should adjust errno handling on
> > > the client side, but nobody ever followed up with an actual patch.
> > 
> > Thanks for digging. It would be great if you both and Ramsay could unify
> > on an agreeable path forward here.
> 
> I think the patch Ramsay posted elsewhere is the right way forward.
> Hopefully he can fill out a commit message with the summary and then we
> can proceed.
> 
> (I'm happy to help with explaining the credential-cache side, but I
> would just be hand-waving on the cygwin parts).

Sounds great to me -- in that case, let's just drop my patch. I was
basically just trying to get somebody to have a look at the issue, and
it very much seems like I succeeded :)

Ramsay, do you want to polish your patch with a commit message or shall
I pick it up and iterate on it?

Thanks all for chiming in!

Patrick
Taylor Blau Oct. 17, 2024, 8:54 p.m. UTC | #8
On Wed, Oct 16, 2024 at 10:33:19PM -0400, Jeff King wrote:
> On Wed, Oct 16, 2024 at 04:28:49PM -0400, Taylor Blau wrote:
>
> > > This all seemed eerily familiar. Try this thread:
> > >
> > >   https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/
> > >
> > > It looks like the conclusion was that we should adjust errno handling on
> > > the client side, but nobody ever followed up with an actual patch.
> >
> > Thanks for digging. It would be great if you both and Ramsay could unify
> > on an agreeable path forward here.
>
> I think the patch Ramsay posted elsewhere is the right way forward.
> Hopefully he can fill out a commit message with the summary and then we
> can proceed.

Yeah, that's exactly what I was hoping for ;-).

Thanks,
Taylor
Ramsay Jones Oct. 17, 2024, 10:58 p.m. UTC | #9
On 17/10/2024 09:46, Patrick Steinhardt wrote:
> On Wed, Oct 16, 2024 at 10:33:19PM -0400, Jeff King wrote:
>> On Wed, Oct 16, 2024 at 04:28:49PM -0400, Taylor Blau wrote:
>>
>>>> This all seemed eerily familiar. Try this thread:
>>>>
>>>>   https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/
>>>>
>>>> It looks like the conclusion was that we should adjust errno handling on
>>>> the client side, but nobody ever followed up with an actual patch.
>>>
>>> Thanks for digging. It would be great if you both and Ramsay could unify
>>> on an agreeable path forward here.
>>
>> I think the patch Ramsay posted elsewhere is the right way forward.
>> Hopefully he can fill out a commit message with the summary and then we
>> can proceed.
>>
>> (I'm happy to help with explaining the credential-cache side, but I
>> would just be hand-waving on the cygwin parts).
> 
> Sounds great to me -- in that case, let's just drop my patch. I was
> basically just trying to get somebody to have a look at the issue, and
> it very much seems like I succeeded :)
> 
> Ramsay, do you want to polish your patch with a commit message or shall
> I pick it up and iterate on it?

I can't get to it before the weekend, at the earliest, sorry! :(

If you fancy picking it up, please be my guest! :)

ATB,
Ramsay Jones
Patrick Steinhardt Oct. 18, 2024, 4:36 a.m. UTC | #10
On Thu, Oct 17, 2024 at 11:58:33PM +0100, Ramsay Jones wrote:
> 
> 
> On 17/10/2024 09:46, Patrick Steinhardt wrote:
> > On Wed, Oct 16, 2024 at 10:33:19PM -0400, Jeff King wrote:
> >> On Wed, Oct 16, 2024 at 04:28:49PM -0400, Taylor Blau wrote:
> >>
> >>>> This all seemed eerily familiar. Try this thread:
> >>>>
> >>>>   https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/
> >>>>
> >>>> It looks like the conclusion was that we should adjust errno handling on
> >>>> the client side, but nobody ever followed up with an actual patch.
> >>>
> >>> Thanks for digging. It would be great if you both and Ramsay could unify
> >>> on an agreeable path forward here.
> >>
> >> I think the patch Ramsay posted elsewhere is the right way forward.
> >> Hopefully he can fill out a commit message with the summary and then we
> >> can proceed.
> >>
> >> (I'm happy to help with explaining the credential-cache side, but I
> >> would just be hand-waving on the cygwin parts).
> > 
> > Sounds great to me -- in that case, let's just drop my patch. I was
> > basically just trying to get somebody to have a look at the issue, and
> > it very much seems like I succeeded :)
> > 
> > Ramsay, do you want to polish your patch with a commit message or shall
> > I pick it up and iterate on it?
> 
> I can't get to it before the weekend, at the earliest, sorry! :(
> 
> If you fancy picking it up, please be my guest! :)

I've sent v2 using that hack now, noting your original authorship. I
want to finally get on top of all of those platform-specific failures :)

Thanks!

Patrick
diff mbox series

Patch

diff --git a/builtin/credential-cache--daemon.c b/builtin/credential-cache--daemon.c
index bc22f5c6d24..5a09df5c167 100644
--- a/builtin/credential-cache--daemon.c
+++ b/builtin/credential-cache--daemon.c
@@ -156,13 +156,11 @@  static void serve_one_client(FILE *in, FILE *out)
 	}
 	else if (!strcmp(action.buf, "exit")) {
 		/*
-		 * It's important that we clean up our socket first, and then
-		 * signal the client only once we have finished the cleanup.
-		 * Calling exit() directly does this, because we clean up in
-		 * our atexit() handler, and then signal the client when our
-		 * process actually ends, which closes the socket and gives
-		 * them EOF.
+		 * We must close our file handles before we exit such that the
+		 * client will receive an EOF.
 		 */
+		fclose(in);
+		fclose(out);
 		exit(0);
 	}
 	else if (!strcmp(action.buf, "erase"))