strange interaction between fuse + pidns

Message ID	YrShFXRLtRt6T/j+@risky (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Feedback-ID: i21f147d5:Fastmail Date: Thu, 23 Jun 2022 11:21:25 -0600 From: Tycho Andersen <tycho@tycho.pizza> To: Eric Biederman <ebiederm@xmission.com>, Christian Brauner <brauner@kernel.org>, Miklos Szeredi <miklos@szeredi.hu> Cc: fuse-devel@lists.sourceforge.net, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: strange interaction between fuse + pidns Message-ID: <YrShFXRLtRt6T/j+@risky> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Precedence: bulk
Series	strange interaction between fuse + pidns \| expand strange interaction between fuse + pidns

Tycho Andersen June 23, 2022, 5:21 p.m. UTC

Hi all,

I'm seeing some weird interactions with fuse and the pid namespace. I have a
small reproducer here: https://github.com/tych0/kernel-utils/tree/master/fuse2

fuse has the concept of "forcing" a request, which means (among other
things) that it does an unkillable wait in request_wait_answer(). fuse
flushes files when they are closed with this unkillable wait:

$ sudo cat /proc/1544574/stack
[<0>] request_wait_answer+0x12f/0x210
[<0>] fuse_simple_request+0x109/0x2c0
[<0>] fuse_flush+0x16f/0x1b0
[<0>] filp_close+0x27/0x70
[<0>] put_files_struct+0x6b/0xc0
[<0>] do_exit+0x360/0xb80
[<0>] do_group_exit+0x3a/0xa0
[<0>] get_signal+0x140/0x870
[<0>] arch_do_signal_or_restart+0xae/0x7c0
[<0>] exit_to_user_mode_prepare+0x10f/0x1c0
[<0>] syscall_exit_to_user_mode+0x26/0x40
[<0>] do_syscall_64+0x46/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Generally, this is OK, since the fuse_dev_release() -> fuse_abort_conn()
wakes up this code when a fuse dev goes away (i.e. a fuse daemon is killed
or unmounted or whatever). However, there's a problem when the fuse daemon
itself spawns a thread that does a flush: since the thread has a copy of
the fd table with an fd pointing to the same fuse device, the reference
count isn't decremented to zero in fuse_dev_release(), and the task hangs
forever.

Tasks can be aborted via fusectl's abort file, so all is not lost. However,
this does wreak havoc in containers which mounted a fuse filesystem with
this state. If the init pid exits (or crashes), the kernel tries to clean
up the pidns:

$ sudo cat /proc/1528591/stack
[<0>] do_wait+0x156/0x2f0
[<0>] kernel_wait4+0x8d/0x140
[<0>] zap_pid_ns_processes+0x104/0x180
[<0>] do_exit+0xa41/0xb80
[<0>] do_group_exit+0x3a/0xa0
[<0>] __x64_sys_exit_group+0x14/0x20
[<0>] do_syscall_64+0x37/0xb0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

but hangs forever. This unkillable wait seems unfortunate, so I tried the
obvious thing of changing it to a killable wait:


avaialble as a full patch here:
https://github.com/tych0/linux/commit/81b9ff4c8c1af24f6544945da808dbf69a1293f7

but now things are even weirder. Tasks are stuck at the killable wait, but with
a SIGKILL pending for the thread group.

root@(none):/# cat /proc/187/stack
[<0>] fuse_simple_request+0x8d9/0x10f0 [fuse]
[<0>] fuse_flush+0x42f/0x630 [fuse]
[<0>] filp_close+0x96/0x120
[<0>] put_files_struct+0x15c/0x2c0
[<0>] do_exit+0xa00/0x2450
[<0>] do_group_exit+0xb2/0x2a0
[<0>] __x64_sys_exit_group+0x35/0x40
[<0>] do_syscall_64+0x40/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
root@(none):/# cat /proc/187/status
Name:   main
Umask:  0022
State:  S (sleeping)
Tgid:   187
Ngid:   0
Pid:    187
PPid:   185
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 0
Groups:
NStgid: 187     3
NSpid:  187     3
NSpgid: 171     0
NSsid:  160     0
Threads:        1
SigQ:   0/6706
SigPnd: 0000000000000000
ShdPnd: 0000000000000100
SigBlk: 0000000000000000
SigIgn: 0000000180004002
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Seccomp_filters:        0
Speculation_Store_Bypass:       thread vulnerable
SpeculationIndirectBranch:      conditional enabled
Cpus_allowed:   f
Cpus_allowed_list:      0-3
Mems_allowed:   00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        6
nonvoluntary_ctxt_switches:     1

Any ideas what's going on here? It also seems I'm not the first person to
wonder about this:
https://sourceforge.net/p/fuse/mailman/fuse-devel/thread/CAMp4zn9zTA_A2GJiYo5AD9V5HpeXbzzMP%3DnF0WtwbxRbV3koNA%40mail.gmail.com/#msg36598753

Thanks,

Tycho

Vivek Goyal June 23, 2022, 9:55 p.m. UTC | #1

On Thu, Jun 23, 2022 at 11:21:25AM -0600, Tycho Andersen wrote:
> Hi all,
> 
> I'm seeing some weird interactions with fuse and the pid namespace. I have a
> small reproducer here: https://github.com/tych0/kernel-utils/tree/master/fuse2
> 
> fuse has the concept of "forcing" a request, which means (among other
> things) that it does an unkillable wait in request_wait_answer(). fuse
> flushes files when they are closed with this unkillable wait:
> 
> $ sudo cat /proc/1544574/stack
> [<0>] request_wait_answer+0x12f/0x210
> [<0>] fuse_simple_request+0x109/0x2c0
> [<0>] fuse_flush+0x16f/0x1b0
> [<0>] filp_close+0x27/0x70
> [<0>] put_files_struct+0x6b/0xc0
> [<0>] do_exit+0x360/0xb80
> [<0>] do_group_exit+0x3a/0xa0
> [<0>] get_signal+0x140/0x870
> [<0>] arch_do_signal_or_restart+0xae/0x7c0
> [<0>] exit_to_user_mode_prepare+0x10f/0x1c0
> [<0>] syscall_exit_to_user_mode+0x26/0x40
> [<0>] do_syscall_64+0x46/0xb0
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Generally, this is OK, since the fuse_dev_release() -> fuse_abort_conn()
> wakes up this code when a fuse dev goes away (i.e. a fuse daemon is killed
> or unmounted or whatever). However, there's a problem when the fuse daemon
> itself spawns a thread that does a flush:

So in this case single process is client as well as server. IOW, one
thread is fuse server servicing fuse requests and other thread is fuse
client accessing fuse filesystem?

> since the thread has a copy of
> the fd table with an fd pointing to the same fuse device, the reference
> count isn't decremented to zero in fuse_dev_release(), and the task hangs
> forever.

So why did fuse server thread stop responding to fuse messages. Why
did it not complete flush.

Is it something to do with this init process dying in pid namespace
and it killed that fuse server thread. But it could not kill another
thread because it is in unkillable wait.

> 
> Tasks can be aborted via fusectl's abort file, so all is not lost. However,
> this does wreak havoc in containers which mounted a fuse filesystem with
> this state. If the init pid exits (or crashes), the kernel tries to clean
> up the pidns:
> 
> $ sudo cat /proc/1528591/stack
> [<0>] do_wait+0x156/0x2f0
> [<0>] kernel_wait4+0x8d/0x140
> [<0>] zap_pid_ns_processes+0x104/0x180
> [<0>] do_exit+0xa41/0xb80
> [<0>] do_group_exit+0x3a/0xa0
> [<0>] __x64_sys_exit_group+0x14/0x20
> [<0>] do_syscall_64+0x37/0xb0
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> but hangs forever. This unkillable wait seems unfortunate, so I tried the
> obvious thing of changing it to a killable wait:

BTW, unkillable wait happens on ly fc->no_interrupt = 1. And this seems
to be set only if server probably some previous interrupt request
returned -ENOSYS.

fuse_dev_do_write() {
                else if (oh.error == -ENOSYS)
                        fc->no_interrupt = 1;
}

So a simple workaround might be for server to implement support for
interrupting requests.

Having said that, this does sounds like a problem and probably should
be fixed at kernel level.

> 
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 0e537e580dc1..c604dfcaec26 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -297,7 +297,6 @@ void fuse_request_end(struct fuse_req *req)
>  		spin_unlock(&fiq->lock);
>  	}
>  	WARN_ON(test_bit(FR_PENDING, &req->flags));
> -	WARN_ON(test_bit(FR_SENT, &req->flags));
>  	if (test_bit(FR_BACKGROUND, &req->flags)) {
>  		spin_lock(&fc->bg_lock);
>  		clear_bit(FR_BACKGROUND, &req->flags);
> @@ -381,30 +380,33 @@ static void request_wait_answer(struct fuse_req *req)
>  			queue_interrupt(req);
>  	}
>  
> -	if (!test_bit(FR_FORCE, &req->flags)) {
> -		/* Only fatal signals may interrupt this */
> -		err = wait_event_killable(req->waitq,
> -					test_bit(FR_FINISHED, &req->flags));
> -		if (!err)
> -			return;
> +	/* Only fatal signals may interrupt this */
> +	err = wait_event_killable(req->waitq,
> +				test_bit(FR_FINISHED, &req->flags));

Trying to do a fatal signal killable wait sounds reasonable. But I am
not sure about the history.

- Why FORCE requests can't do killable wait.
- Why flush needs to have FORCE flag set.

> +	if (!err)
> +		return;
>  
> -		spin_lock(&fiq->lock);
> -		/* Request is not yet in userspace, bail out */
> -		if (test_bit(FR_PENDING, &req->flags)) {
> -			list_del(&req->list);
> -			spin_unlock(&fiq->lock);
> -			__fuse_put_request(req);
> -			req->out.h.error = -EINTR;
> -			return;
> -		}
> +	spin_lock(&fiq->lock);
> +	/* Request is not yet in userspace, bail out */
> +	if (test_bit(FR_PENDING, &req->flags)) {
> +		list_del(&req->list);
>  		spin_unlock(&fiq->lock);
> +		__fuse_put_request(req);
> +		req->out.h.error = -EINTR;
> +		return;
>  	}
> +	spin_unlock(&fiq->lock);
>  
>  	/*
> -	 * Either request is already in userspace, or it was forced.
> -	 * Wait it out.
> +	 * Womp womp. We sent a request to userspace and now we're getting
> +	 * killed.
>  	 */
> -	wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
> +	set_bit(FR_INTERRUPTED, &req->flags);
> +	/* matches barrier in fuse_dev_do_read() */
> +	smp_mb__after_atomic();
> +	/* request *must* be FR_SENT here, because we ignored FR_PENDING before */
> +	WARN_ON(!test_bit(FR_SENT, &req->flags));
> +	queue_interrupt(req);
>  }
>  
>  static void __fuse_request_send(struct fuse_req *req)
> 
> avaialble as a full patch here:
> https://github.com/tych0/linux/commit/81b9ff4c8c1af24f6544945da808dbf69a1293f7
> 
> but now things are even weirder. Tasks are stuck at the killable wait, but with
> a SIGKILL pending for the thread group.

That's strange. No idea what's going on.

Thanks
Vivek
> 
> root@(none):/# cat /proc/187/stack
> [<0>] fuse_simple_request+0x8d9/0x10f0 [fuse]
> [<0>] fuse_flush+0x42f/0x630 [fuse]
> [<0>] filp_close+0x96/0x120
> [<0>] put_files_struct+0x15c/0x2c0
> [<0>] do_exit+0xa00/0x2450
> [<0>] do_group_exit+0xb2/0x2a0
> [<0>] __x64_sys_exit_group+0x35/0x40
> [<0>] do_syscall_64+0x40/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> root@(none):/# cat /proc/187/status
> Name:   main
> Umask:  0022
> State:  S (sleeping)
> Tgid:   187
> Ngid:   0
> Pid:    187
> PPid:   185
> TracerPid:      0
> Uid:    0       0       0       0
> Gid:    0       0       0       0
> FDSize: 0
> Groups:
> NStgid: 187     3
> NSpid:  187     3
> NSpgid: 171     0
> NSsid:  160     0
> Threads:        1
> SigQ:   0/6706
> SigPnd: 0000000000000000
> ShdPnd: 0000000000000100
> SigBlk: 0000000000000000
> SigIgn: 0000000180004002
> SigCgt: 0000000000000000
> CapInh: 0000000000000000
> CapPrm: 000001ffffffffff
> CapEff: 000001ffffffffff
> CapBnd: 000001ffffffffff
> CapAmb: 0000000000000000
> NoNewPrivs:     0
> Seccomp:        0
> Seccomp_filters:        0
> Speculation_Store_Bypass:       thread vulnerable
> SpeculationIndirectBranch:      conditional enabled
> Cpus_allowed:   f
> Cpus_allowed_list:      0-3
> Mems_allowed:   00000000,00000001
> Mems_allowed_list:      0
> voluntary_ctxt_switches:        6
> nonvoluntary_ctxt_switches:     1
> 
> Any ideas what's going on here? It also seems I'm not the first person to
> wonder about this:
> https://sourceforge.net/p/fuse/mailman/fuse-devel/thread/CAMp4zn9zTA_A2GJiYo5AD9V5HpeXbzzMP%3DnF0WtwbxRbV3koNA%40mail.gmail.com/#msg36598753
> 
> Thanks,
> 
> Tycho
>

Tycho Andersen June 23, 2022, 11:41 p.m. UTC | #2

On Thu, Jun 23, 2022 at 05:55:20PM -0400, Vivek Goyal wrote:
> So in this case single process is client as well as server. IOW, one
> thread is fuse server servicing fuse requests and other thread is fuse
> client accessing fuse filesystem?

Yes. Probably an abuse of the API and something people Should Not Do,
but as you say the kernel still shouldn't lock up like this.

> > since the thread has a copy of
> > the fd table with an fd pointing to the same fuse device, the reference
> > count isn't decremented to zero in fuse_dev_release(), and the task hangs
> > forever.
> 
> So why did fuse server thread stop responding to fuse messages. Why
> did it not complete flush.

In this particular case I think it's because the application crashed
for unrelated reasons and tried to exit the pidns, hitting this
problem.

> BTW, unkillable wait happens on ly fc->no_interrupt = 1. And this seems
> to be set only if server probably some previous interrupt request
> returned -ENOSYS.
> 
> fuse_dev_do_write() {
>                 else if (oh.error == -ENOSYS)
>                         fc->no_interrupt = 1;
> }
> 
> So a simple workaround might be for server to implement support for
> interrupting requests.

Yes, but that is the libfuse default IIUC.

> Having said that, this does sounds like a problem and probably should
> be fixed at kernel level.
> 
> > 
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 0e537e580dc1..c604dfcaec26 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -297,7 +297,6 @@ void fuse_request_end(struct fuse_req *req)
> >  		spin_unlock(&fiq->lock);
> >  	}
> >  	WARN_ON(test_bit(FR_PENDING, &req->flags));
> > -	WARN_ON(test_bit(FR_SENT, &req->flags));
> >  	if (test_bit(FR_BACKGROUND, &req->flags)) {
> >  		spin_lock(&fc->bg_lock);
> >  		clear_bit(FR_BACKGROUND, &req->flags);
> > @@ -381,30 +380,33 @@ static void request_wait_answer(struct fuse_req *req)
> >  			queue_interrupt(req);
> >  	}
> >  
> > -	if (!test_bit(FR_FORCE, &req->flags)) {
> > -		/* Only fatal signals may interrupt this */
> > -		err = wait_event_killable(req->waitq,
> > -					test_bit(FR_FINISHED, &req->flags));
> > -		if (!err)
> > -			return;
> > +	/* Only fatal signals may interrupt this */
> > +	err = wait_event_killable(req->waitq,
> > +				test_bit(FR_FINISHED, &req->flags));
> 
> Trying to do a fatal signal killable wait sounds reasonable. But I am
> not sure about the history.
> 
> - Why FORCE requests can't do killable wait.
> - Why flush needs to have FORCE flag set.

args->force implies a few other things besides this killable wait in
fuse_simple_request(), most notably:

req = fuse_request_alloc(fm, GFP_KERNEL | __GFP_NOFAIL);

and

__set_bit(FR_WAITING, &req->flags);

seems like it probably can be invoked from some non-user/atomic
context somehow?

> > +	if (!err)
> > +		return;
> >  
> > -		spin_lock(&fiq->lock);
> > -		/* Request is not yet in userspace, bail out */
> > -		if (test_bit(FR_PENDING, &req->flags)) {
> > -			list_del(&req->list);
> > -			spin_unlock(&fiq->lock);
> > -			__fuse_put_request(req);
> > -			req->out.h.error = -EINTR;
> > -			return;
> > -		}
> > +	spin_lock(&fiq->lock);
> > +	/* Request is not yet in userspace, bail out */
> > +	if (test_bit(FR_PENDING, &req->flags)) {
> > +		list_del(&req->list);
> >  		spin_unlock(&fiq->lock);
> > +		__fuse_put_request(req);
> > +		req->out.h.error = -EINTR;
> > +		return;
> >  	}
> > +	spin_unlock(&fiq->lock);
> >  
> >  	/*
> > -	 * Either request is already in userspace, or it was forced.
> > -	 * Wait it out.
> > +	 * Womp womp. We sent a request to userspace and now we're getting
> > +	 * killed.
> >  	 */
> > -	wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
> > +	set_bit(FR_INTERRUPTED, &req->flags);
> > +	/* matches barrier in fuse_dev_do_read() */
> > +	smp_mb__after_atomic();
> > +	/* request *must* be FR_SENT here, because we ignored FR_PENDING before */
> > +	WARN_ON(!test_bit(FR_SENT, &req->flags));
> > +	queue_interrupt(req);
> >  }
> >  
> >  static void __fuse_request_send(struct fuse_req *req)
> > 
> > avaialble as a full patch here:
> > https://github.com/tych0/linux/commit/81b9ff4c8c1af24f6544945da808dbf69a1293f7
> > 
> > but now things are even weirder. Tasks are stuck at the killable wait, but with
> > a SIGKILL pending for the thread group.
> 
> That's strange. No idea what's going on.

Thanks for taking a look. This is where it falls apart for me. In
principle the patch seems simple, but this sleeping behavior is beyond
my understanding.

Tycho

Vivek Goyal June 24, 2022, 5:36 p.m. UTC | #3

On Thu, Jun 23, 2022 at 05:41:17PM -0600, Tycho Andersen wrote:
> On Thu, Jun 23, 2022 at 05:55:20PM -0400, Vivek Goyal wrote:
> > So in this case single process is client as well as server. IOW, one
> > thread is fuse server servicing fuse requests and other thread is fuse
> > client accessing fuse filesystem?
> 
> Yes. Probably an abuse of the API and something people Should Not Do,
> but as you say the kernel still shouldn't lock up like this.
> 
> > > since the thread has a copy of
> > > the fd table with an fd pointing to the same fuse device, the reference
> > > count isn't decremented to zero in fuse_dev_release(), and the task hangs
> > > forever.
> > 
> > So why did fuse server thread stop responding to fuse messages. Why
> > did it not complete flush.
> 
> In this particular case I think it's because the application crashed
> for unrelated reasons and tried to exit the pidns, hitting this
> problem.
> 
> > BTW, unkillable wait happens on ly fc->no_interrupt = 1. And this seems
> > to be set only if server probably some previous interrupt request
> > returned -ENOSYS.
> > 
> > fuse_dev_do_write() {
> >                 else if (oh.error == -ENOSYS)
> >                         fc->no_interrupt = 1;
> > }
> > 
> > So a simple workaround might be for server to implement support for
> > interrupting requests.
> 
> Yes, but that is the libfuse default IIUC.

Looking at libfuse code. I understand low level API interface and for
that looks like generic code itself will take care of this (without
needing support from filesystem).

libfuse/lib/fuse_lowlevel.c

do_interrupt().

> 
> > Having said that, this does sounds like a problem and probably should
> > be fixed at kernel level.
> > 
> > > 
> > > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > > index 0e537e580dc1..c604dfcaec26 100644
> > > --- a/fs/fuse/dev.c
> > > +++ b/fs/fuse/dev.c
> > > @@ -297,7 +297,6 @@ void fuse_request_end(struct fuse_req *req)
> > >  		spin_unlock(&fiq->lock);
> > >  	}
> > >  	WARN_ON(test_bit(FR_PENDING, &req->flags));
> > > -	WARN_ON(test_bit(FR_SENT, &req->flags));
> > >  	if (test_bit(FR_BACKGROUND, &req->flags)) {
> > >  		spin_lock(&fc->bg_lock);
> > >  		clear_bit(FR_BACKGROUND, &req->flags);
> > > @@ -381,30 +380,33 @@ static void request_wait_answer(struct fuse_req *req)
> > >  			queue_interrupt(req);
> > >  	}
> > >  
> > > -	if (!test_bit(FR_FORCE, &req->flags)) {
> > > -		/* Only fatal signals may interrupt this */
> > > -		err = wait_event_killable(req->waitq,
> > > -					test_bit(FR_FINISHED, &req->flags));
> > > -		if (!err)
> > > -			return;
> > > +	/* Only fatal signals may interrupt this */
> > > +	err = wait_event_killable(req->waitq,
> > > +				test_bit(FR_FINISHED, &req->flags));
> > 
> > Trying to do a fatal signal killable wait sounds reasonable. But I am
> > not sure about the history.
> > 
> > - Why FORCE requests can't do killable wait.
> > - Why flush needs to have FORCE flag set.
> 
> args->force implies a few other things besides this killable wait in
> fuse_simple_request(), most notably:
> 
> req = fuse_request_alloc(fm, GFP_KERNEL | __GFP_NOFAIL);
> 
> and
> 
> __set_bit(FR_WAITING, &req->flags);

FR_WAITING stuff is common between both type of requests. We set it
in fuse_get_req() as well which is called for non-force requests.

So there seem to be only two key difference. 

- We allocate request with flag __GFP_NOFAIL for force. So don't
  want memory allocation to fail.

- And this special casing of non-killable wait. 

Miklos probably will have more thoughts on this. 

Thanks
Vivek

> 
> seems like it probably can be invoked from some non-user/atomic
> context somehow?
> 
> > > +	if (!err)
> > > +		return;
> > >  
> > > -		spin_lock(&fiq->lock);
> > > -		/* Request is not yet in userspace, bail out */
> > > -		if (test_bit(FR_PENDING, &req->flags)) {
> > > -			list_del(&req->list);
> > > -			spin_unlock(&fiq->lock);
> > > -			__fuse_put_request(req);
> > > -			req->out.h.error = -EINTR;
> > > -			return;
> > > -		}
> > > +	spin_lock(&fiq->lock);
> > > +	/* Request is not yet in userspace, bail out */
> > > +	if (test_bit(FR_PENDING, &req->flags)) {
> > > +		list_del(&req->list);
> > >  		spin_unlock(&fiq->lock);
> > > +		__fuse_put_request(req);
> > > +		req->out.h.error = -EINTR;
> > > +		return;
> > >  	}
> > > +	spin_unlock(&fiq->lock);
> > >  
> > >  	/*
> > > -	 * Either request is already in userspace, or it was forced.
> > > -	 * Wait it out.
> > > +	 * Womp womp. We sent a request to userspace and now we're getting
> > > +	 * killed.
> > >  	 */
> > > -	wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
> > > +	set_bit(FR_INTERRUPTED, &req->flags);
> > > +	/* matches barrier in fuse_dev_do_read() */
> > > +	smp_mb__after_atomic();
> > > +	/* request *must* be FR_SENT here, because we ignored FR_PENDING before */
> > > +	WARN_ON(!test_bit(FR_SENT, &req->flags));
> > > +	queue_interrupt(req);
> > >  }
> > >  
> > >  static void __fuse_request_send(struct fuse_req *req)
> > > 
> > > avaialble as a full patch here:
> > > https://github.com/tych0/linux/commit/81b9ff4c8c1af24f6544945da808dbf69a1293f7
> > > 
> > > but now things are even weirder. Tasks are stuck at the killable wait, but with
> > > a SIGKILL pending for the thread group.
> > 
> > That's strange. No idea what's going on.
> 
> Thanks for taking a look. This is where it falls apart for me. In
> principle the patch seems simple, but this sleeping behavior is beyond
> my understanding.
> 
> Tycho
>

Miklos Szeredi July 11, 2022, 10:35 a.m. UTC | #4

On Thu, 23 Jun 2022 at 19:21, Tycho Andersen <tycho@tycho.pizza> wrote:

>         /*
> -        * Either request is already in userspace, or it was forced.
> -        * Wait it out.
> +        * Womp womp. We sent a request to userspace and now we're getting
> +        * killed.
>          */
> -       wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));

You can't remove this, it's a crucial part of fuse request handling.
Yes, it causes pain, but making *sent* requests killable is a lot more
work.

For one: need to duplicate caller's locking state (i_rwsem, ...) and
move the request into a backround queue instead of just finishing it
off immediately so that the shadow locking can be torn down when the
reply actually arrives.  This affects a lot of requests.

Or we could special case FUSE_FLUSH, which doesn't have any locking.

The reason force=true is needed for FUSE_FLUSH is because it affects
posix lock state.   Not waiting for the reply if the task is killed
could have observable consequences, but my guess is that it's an
uninteresting corner case and would not cause regressions in real
life.

Can you try the attached untested patch?

Thanks,
Miklos

Miklos Szeredi July 11, 2022, 1:59 p.m. UTC | #5

On Mon, 11 Jul 2022 at 12:35, Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> Can you try the attached untested patch?

Updated patch to avoid use after free on req->args.

Still mostly untested.

Thanks,
Miklos

Tycho Andersen July 11, 2022, 8:25 p.m. UTC | #6

Hi all,

On Mon, Jul 11, 2022 at 03:59:15PM +0200, Miklos Szeredi wrote:
> On Mon, 11 Jul 2022 at 12:35, Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > Can you try the attached untested patch?
> 
> Updated patch to avoid use after free on req->args.
> 
> Still mostly untested.

Thanks, when I applied your patch, I still ended up with tasks stuck
waiting with a SIGKILL pending. So I looked into that and came up with
the patch below. With both your patch and mine, my testcase exits
cleanly.

Eric (or Christian, or anyone), can you comment on the patch below? I
have no idea what this will break. Maybe instead a better approach is
some additional special case in __send_signal_locked()?

Tycho

From b7ea26adcf3546be5745063cc86658acb5ed37e9 Mon Sep 17 00:00:00 2001
From: Tycho Andersen <tycho@tycho.pizza>
Date: Mon, 11 Jul 2022 11:26:58 -0600
Subject: [PATCH] sched: __fatal_signal_pending() should also check shared
 signals

The wait_* code uses signal_pending_state() to test whether a thread has
been interrupted, which ultimately uses __fatal_signal_pending() to detect
if there is a fatal signal.

When a pid ns dies, in zap_pid_ns_processes() it does:

    group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);

for all the tasks in the pid ns. That calls through:

    group_send_sig_info() ->
      do_send_sig_info() ->
        send_signal_locked() ->
          __send_signal_locked()

which does:

    pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;

which puts sigkill in the set of shared signals, but not the individual
pending ones. If tasks are stuck in a killable wait (e.g. a fuse flush
operation), they won't see this shared signal, and will hang forever, since
TIF_SIGPENDING is set, but the fatal signal can't be detected.

Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
---
 include/linux/sched/signal.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index cafbe03eed01..a033ccb0a729 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -402,7 +402,8 @@ static inline int signal_pending(struct task_struct *p)

 static inline int __fatal_signal_pending(struct task_struct *p)
 {
-	return unlikely(sigismember(&p->pending.signal, SIGKILL));
+	return unlikely(sigismember(&p->pending.signal, SIGKILL) ||
+			sigismember(&p->signal->shared_pending.signal, SIGKILL));
 }

 static inline int fatal_signal_pending(struct task_struct *p)

base-commit: 32346491ddf24599decca06190ebca03ff9de7f8

Eric W. Biederman July 11, 2022, 9:37 p.m. UTC | #7

Tycho Andersen <tycho@tycho.pizza> writes:

> Hi all,
>
> On Mon, Jul 11, 2022 at 03:59:15PM +0200, Miklos Szeredi wrote:
>> On Mon, 11 Jul 2022 at 12:35, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> >
>> > Can you try the attached untested patch?
>> 
>> Updated patch to avoid use after free on req->args.
>> 
>> Still mostly untested.
>
> Thanks, when I applied your patch, I still ended up with tasks stuck
> waiting with a SIGKILL pending. So I looked into that and came up with
> the patch below. With both your patch and mine, my testcase exits
> cleanly.
>
> Eric (or Christian, or anyone), can you comment on the patch below? I
> have no idea what this will break. Maybe instead a better approach is
> some additional special case in __send_signal_locked()?
>
> Tycho
>
> From b7ea26adcf3546be5745063cc86658acb5ed37e9 Mon Sep 17 00:00:00 2001
> From: Tycho Andersen <tycho@tycho.pizza>
> Date: Mon, 11 Jul 2022 11:26:58 -0600
> Subject: [PATCH] sched: __fatal_signal_pending() should also check shared
>  signals
>
> The wait_* code uses signal_pending_state() to test whether a thread has
> been interrupted, which ultimately uses __fatal_signal_pending() to detect
> if there is a fatal signal.
>
> When a pid ns dies, in zap_pid_ns_processes() it does:
>
>     group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
>
> for all the tasks in the pid ns. That calls through:
>
>     group_send_sig_info() ->
>       do_send_sig_info() ->
>         send_signal_locked() ->
>           __send_signal_locked()
>
> which does:
>
>     pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
>
> which puts sigkill in the set of shared signals, but not the individual
> pending ones. If tasks are stuck in a killable wait (e.g. a fuse flush
> operation), they won't see this shared signal, and will hang forever, since
> TIF_SIGPENDING is set, but the fatal signal can't be detected.

Hmm.

That is perplexing.

__send_signal_locked calls complete_signal.  Then if any of the tasks of
the process can receive the signal, complete_signal will loop through
all of the tasks of the process and set the per thread SIGKILL.  Pretty
much by definition tasks can always receive SIGKILL.

Is complete_signal not being able to do that?

The patch below really should not be necessary, and I have pending work
that if I can push over the finish line won't even make sense.

As it is currently an abuse to use the per thread SIGKILL to indicate
that a fatal signal has been short circuit delivered.  That abuse as
well as being unclean tends to confuse people reading the code.

Eric

> Signed-off-by: Tycho Andersen <tycho@tycho.pizza>
> ---
>  include/linux/sched/signal.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index cafbe03eed01..a033ccb0a729 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -402,7 +402,8 @@ static inline int signal_pending(struct task_struct *p)
>  
>  static inline int __fatal_signal_pending(struct task_struct *p)
>  {
> -	return unlikely(sigismember(&p->pending.signal, SIGKILL));
> +	return unlikely(sigismember(&p->pending.signal, SIGKILL) ||
> +			sigismember(&p->signal->shared_pending.signal, SIGKILL));
>  }
>  
>  static inline int fatal_signal_pending(struct task_struct *p)
>
> base-commit: 32346491ddf24599decca06190ebca03ff9de7f8

Tycho Andersen July 11, 2022, 10:53 p.m. UTC | #8

On Mon, Jul 11, 2022 at 04:37:12PM -0500, Eric W. Biederman wrote:
> Tycho Andersen <tycho@tycho.pizza> writes:
> 
> > Hi all,
> >
> > On Mon, Jul 11, 2022 at 03:59:15PM +0200, Miklos Szeredi wrote:
> >> On Mon, 11 Jul 2022 at 12:35, Miklos Szeredi <miklos@szeredi.hu> wrote:
> >> >
> >> > Can you try the attached untested patch?
> >> 
> >> Updated patch to avoid use after free on req->args.
> >> 
> >> Still mostly untested.
> >
> > Thanks, when I applied your patch, I still ended up with tasks stuck
> > waiting with a SIGKILL pending. So I looked into that and came up with
> > the patch below. With both your patch and mine, my testcase exits
> > cleanly.
> >
> > Eric (or Christian, or anyone), can you comment on the patch below? I
> > have no idea what this will break. Maybe instead a better approach is
> > some additional special case in __send_signal_locked()?
> >
> > Tycho
> >
> > From b7ea26adcf3546be5745063cc86658acb5ed37e9 Mon Sep 17 00:00:00 2001
> > From: Tycho Andersen <tycho@tycho.pizza>
> > Date: Mon, 11 Jul 2022 11:26:58 -0600
> > Subject: [PATCH] sched: __fatal_signal_pending() should also check shared
> >  signals
> >
> > The wait_* code uses signal_pending_state() to test whether a thread has
> > been interrupted, which ultimately uses __fatal_signal_pending() to detect
> > if there is a fatal signal.
> >
> > When a pid ns dies, in zap_pid_ns_processes() it does:
> >
> >     group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
> >
> > for all the tasks in the pid ns. That calls through:
> >
> >     group_send_sig_info() ->
> >       do_send_sig_info() ->
> >         send_signal_locked() ->
> >           __send_signal_locked()
> >
> > which does:
> >
> >     pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
> >
> > which puts sigkill in the set of shared signals, but not the individual
> > pending ones. If tasks are stuck in a killable wait (e.g. a fuse flush
> > operation), they won't see this shared signal, and will hang forever, since
> > TIF_SIGPENDING is set, but the fatal signal can't be detected.
> 
> Hmm.
> 
> That is perplexing.

Thanks for taking a look.

> __send_signal_locked calls complete_signal.  Then if any of the tasks of
> the process can receive the signal, complete_signal will loop through
> all of the tasks of the process and set the per thread SIGKILL.  Pretty
> much by definition tasks can always receive SIGKILL.
> 
> Is complete_signal not being able to do that?

In my specific case it was because my testcase was already trying to
exit and had set PF_EXITING when the signal is delivered, so
complete_signal() was indeed not able to do that since PF_EXITING is
checked before SIGKILL in wants_signal().

But I changed my testacase to sleep instead of exit, and I get the
same hang behavior, even though complete_signal() does add SIGKILL to
the set. So there's something else going on there...

> The patch below really should not be necessary, and I have pending work
> that if I can push over the finish line won't even make sense.
> 
> As it is currently an abuse to use the per thread SIGKILL to indicate
> that a fatal signal has been short circuit delivered.  That abuse as
> well as being unclean tends to confuse people reading the code.

How close is your work? I'm wondering if it's worth investigating the
non-PF_EXITING case further, or if we should just land this since it
fixes the PF_EXITING case as well. Or maybe just do something like
this in addition:

diff --git a/kernel/signal.c b/kernel/signal.c
index 6f86fda5e432..0f71dfb1c3d2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -982,12 +982,12 @@ static inline bool wants_signal(int sig, struct task_struct *p)
        if (sigismember(&p->blocked, sig))
                return false;

-       if (p->flags & PF_EXITING)
-               return false;
-
        if (sig == SIGKILL)
                return true;

+       if (p->flags & PF_EXITING)
+               return false;
+
        if (task_is_stopped_or_traced(p))
                return false;

?

Tycho

Eric W. Biederman July 11, 2022, 11:06 p.m. UTC | #9

Tycho Andersen <tycho@tycho.pizza> writes:

> On Mon, Jul 11, 2022 at 04:37:12PM -0500, Eric W. Biederman wrote:
>> Tycho Andersen <tycho@tycho.pizza> writes:
>> 
>> > Hi all,
>> >
>> > On Mon, Jul 11, 2022 at 03:59:15PM +0200, Miklos Szeredi wrote:
>> >> On Mon, 11 Jul 2022 at 12:35, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> >> >
>> >> > Can you try the attached untested patch?
>> >> 
>> >> Updated patch to avoid use after free on req->args.
>> >> 
>> >> Still mostly untested.
>> >
>> > Thanks, when I applied your patch, I still ended up with tasks stuck
>> > waiting with a SIGKILL pending. So I looked into that and came up with
>> > the patch below. With both your patch and mine, my testcase exits
>> > cleanly.
>> >
>> > Eric (or Christian, or anyone), can you comment on the patch below? I
>> > have no idea what this will break. Maybe instead a better approach is
>> > some additional special case in __send_signal_locked()?
>> >
>> > Tycho
>> >
>> > From b7ea26adcf3546be5745063cc86658acb5ed37e9 Mon Sep 17 00:00:00 2001
>> > From: Tycho Andersen <tycho@tycho.pizza>
>> > Date: Mon, 11 Jul 2022 11:26:58 -0600
>> > Subject: [PATCH] sched: __fatal_signal_pending() should also check shared
>> >  signals
>> >
>> > The wait_* code uses signal_pending_state() to test whether a thread has
>> > been interrupted, which ultimately uses __fatal_signal_pending() to detect
>> > if there is a fatal signal.
>> >
>> > When a pid ns dies, in zap_pid_ns_processes() it does:
>> >
>> >     group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
>> >
>> > for all the tasks in the pid ns. That calls through:
>> >
>> >     group_send_sig_info() ->
>> >       do_send_sig_info() ->
>> >         send_signal_locked() ->
>> >           __send_signal_locked()
>> >
>> > which does:
>> >
>> >     pending = (type != PIDTYPE_PID) ? &t->signal->shared_pending : &t->pending;
>> >
>> > which puts sigkill in the set of shared signals, but not the individual
>> > pending ones. If tasks are stuck in a killable wait (e.g. a fuse flush
>> > operation), they won't see this shared signal, and will hang forever, since
>> > TIF_SIGPENDING is set, but the fatal signal can't be detected.
>> 
>> Hmm.
>> 
>> That is perplexing.
>
> Thanks for taking a look.
>
>> __send_signal_locked calls complete_signal.  Then if any of the tasks of
>> the process can receive the signal, complete_signal will loop through
>> all of the tasks of the process and set the per thread SIGKILL.  Pretty
>> much by definition tasks can always receive SIGKILL.
>> 
>> Is complete_signal not being able to do that?
>
> In my specific case it was because my testcase was already trying to
> exit and had set PF_EXITING when the signal is delivered, so
> complete_signal() was indeed not able to do that since PF_EXITING is
> checked before SIGKILL in wants_signal().
>
> But I changed my testacase to sleep instead of exit, and I get the
> same hang behavior, even though complete_signal() does add SIGKILL to
> the set. So there's something else going on there...
>
>> The patch below really should not be necessary, and I have pending work
>> that if I can push over the finish line won't even make sense.
>> 
>> As it is currently an abuse to use the per thread SIGKILL to indicate
>> that a fatal signal has been short circuit delivered.  That abuse as
>> well as being unclean tends to confuse people reading the code.
>
> How close is your work? I'm wondering if it's worth investigating the
> non-PF_EXITING case further, or if we should just land this since it
> fixes the PF_EXITING case as well. Or maybe just do something like
> this in addition:

It is not different enough to change the semantics.  What I am aiming
for is having a dedicated flag indicating a task will exit, that
fatal_signal_pending can check.  And I intend to make that flag one way
so that once it is set it will never be cleared.

Which sort of argues against your patch below.

It most definitely does not sort out the sleep case.


The other thing I have played with that might be relevant was removing
the explicit wait in zap_pid_ns_processes and simply not allowing wait
to reap the pid namespace init until all it's children had been reaped.
Essentially how we deal with the thread group leader for ordinary
processes.  Does that sound like it might help in the fuse case?

I am not certain how far such a change would be (it has been a couple
years since I played with implementing it) but it can be made much
sooner if it demonstratively breaks some dead-locks, and generally makes
the kernel easier to maintain.

Eric

Tycho Andersen July 12, 2022, 1:43 p.m. UTC | #10

On Mon, Jul 11, 2022 at 06:06:21PM -0500, Eric W. Biederman wrote:
> Tycho Andersen <tycho@tycho.pizza> writes:
> It is not different enough to change the semantics.  What I am aiming
> for is having a dedicated flag indicating a task will exit, that
> fatal_signal_pending can check.  And I intend to make that flag one way
> so that once it is set it will never be cleared.

Ok - how far out is that? I'd like to try to convince Miklos to land
the fuse part of this fix now, but without the "look at shared signals
too" patch, that fix is useless. I'm not married to my patch, but I
would like to get this fixed somehow soon.

> The other thing I have played with that might be relevant was removing
> the explicit wait in zap_pid_ns_processes and simply not allowing wait
> to reap the pid namespace init until all it's children had been reaped.
> Essentially how we deal with the thread group leader for ordinary
> processes.  Does that sound like it might help in the fuse case?

No, the problem is that the wait code doesn't know to look in the
right place, so waiting later still won't help.

Tycho

Eric W. Biederman July 12, 2022, 2:34 p.m. UTC | #11

Tycho Andersen <tycho@tycho.pizza> writes:

> On Mon, Jul 11, 2022 at 06:06:21PM -0500, Eric W. Biederman wrote:
>> Tycho Andersen <tycho@tycho.pizza> writes:
>> It is not different enough to change the semantics.  What I am aiming
>> for is having a dedicated flag indicating a task will exit, that
>> fatal_signal_pending can check.  And I intend to make that flag one way
>> so that once it is set it will never be cleared.
>
> Ok - how far out is that? I'd like to try to convince Miklos to land
> the fuse part of this fix now, but without the "look at shared signals
> too" patch, that fix is useless. I'm not married to my patch, but I
> would like to get this fixed somehow soon.

My point is that we need to figure out why you need the look at shared
signals.

If I can get everything reviewed my changes will be in the next merge
window (it unfortunately always takes longer to get the code reviewed
than I would like).

However when my changes land does not matter.  What you are trying to
solve is orthogonal of my on-going work.

The problem is that looking at shared signals is fundamentally broken.
A case in point is that kernel threads can have a pending SIGKILL that
is not a fatal signal.  As kernel threads are allowed to ignore or even
handle SIGKILL.

If you want to change fatal_signal_pending to include PF_EXITING I would
need to double check the implications but I think that would work, and
would not have the problems including the shared pending state of
SIGKILL.

>> The other thing I have played with that might be relevant was removing
>> the explicit wait in zap_pid_ns_processes and simply not allowing wait
>> to reap the pid namespace init until all it's children had been reaped.
>> Essentially how we deal with the thread group leader for ordinary
>> processes.  Does that sound like it might help in the fuse case?
>
> No, the problem is that the wait code doesn't know to look in the
> right place, so waiting later still won't help.

I was suggesting to modify the kernel so that zap_pid_ns_processes would
not wait for the zapped processes.  Instead I was proposing that
delay_group_leader called from wait_consider_task would simply refuse to
allow the init process of a pid namespace to be reaped until every other
process of that pid namespace had exited.

You can prototype how that would affect the deadlock by simply removing
the waiting from zap_pid_ns_processes.

I suggest that simply because that has the potential to remove some of
the strange pid namespace cases.

I don't understand the problematic interaction between pid namespace
shutdown and the fuse daemon, so I am merely suggesting a possibility
that I know can simplify pid namespace shutdown.

Something like:

diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index f4f8cb0435b4..d22a30b0b0cf 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -207,47 +207,6 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	read_unlock(&tasklist_lock);
 	rcu_read_unlock();

-	/*
-	 * Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD.
-	 * kernel_wait4() will also block until our children traced from the
-	 * parent namespace are detached and become EXIT_DEAD.
-	 */
-	do {
-		clear_thread_flag(TIF_SIGPENDING);
-		rc = kernel_wait4(-1, NULL, __WALL, NULL);
-	} while (rc != -ECHILD);
-
-	/*
-	 * kernel_wait4() misses EXIT_DEAD children, and EXIT_ZOMBIE
-	 * process whose parents processes are outside of the pid
-	 * namespace.  Such processes are created with setns()+fork().
-	 *
-	 * If those EXIT_ZOMBIE processes are not reaped by their
-	 * parents before their parents exit, they will be reparented
-	 * to pid_ns->child_reaper.  Thus pidns->child_reaper needs to
-	 * stay valid until they all go away.
-	 *
-	 * The code relies on the pid_ns->child_reaper ignoring
-	 * SIGCHILD to cause those EXIT_ZOMBIE processes to be
-	 * autoreaped if reparented.
-	 *
-	 * Semantically it is also desirable to wait for EXIT_ZOMBIE
-	 * processes before allowing the child_reaper to be reaped, as
-	 * that gives the invariant that when the init process of a
-	 * pid namespace is reaped all of the processes in the pid
-	 * namespace are gone.
-	 *
-	 * Once all of the other tasks are gone from the pid_namespace
-	 * free_pid() will awaken this task.
-	 */
-	for (;;) {
-		set_current_state(TASK_INTERRUPTIBLE);
-		if (pid_ns->pid_allocated == init_pids)
-			break;
-		schedule();
-	}
-	__set_current_state(TASK_RUNNING);
-
 	if (pid_ns->reboot)
 		current->signal->group_exit_code = pid_ns->reboot;

Eric

Tycho Andersen July 12, 2022, 3:14 p.m. UTC | #12

On Tue, Jul 12, 2022 at 09:34:50AM -0500, Eric W. Biederman wrote:
> Tycho Andersen <tycho@tycho.pizza> writes:
> 
> > On Mon, Jul 11, 2022 at 06:06:21PM -0500, Eric W. Biederman wrote:
> >> Tycho Andersen <tycho@tycho.pizza> writes:
> >> It is not different enough to change the semantics.  What I am aiming
> >> for is having a dedicated flag indicating a task will exit, that
> >> fatal_signal_pending can check.  And I intend to make that flag one way
> >> so that once it is set it will never be cleared.
> >
> > Ok - how far out is that? I'd like to try to convince Miklos to land
> > the fuse part of this fix now, but without the "look at shared signals
> > too" patch, that fix is useless. I'm not married to my patch, but I
> > would like to get this fixed somehow soon.
> 
> My point is that we need to figure out why you need the look at shared
> signals.

At least in the case where the task was already exiting, it's because
complete_signal() never wakes them up.

> If I can get everything reviewed my changes will be in the next merge
> window (it unfortunately always takes longer to get the code reviewed
> than I would like).
> 
> However when my changes land does not matter.  What you are trying to
> solve is orthogonal of my on-going work.
> 
> The problem is that looking at shared signals is fundamentally broken.
> A case in point is that kernel threads can have a pending SIGKILL that
> is not a fatal signal.  As kernel threads are allowed to ignore or even
> handle SIGKILL.
> 
> If you want to change fatal_signal_pending to include PF_EXITING I would
> need to double check the implications but I think that would work, and
> would not have the problems including the shared pending state of
> SIGKILL.

I think that would work. I'll test it out, thanks.

> >> The other thing I have played with that might be relevant was removing
> >> the explicit wait in zap_pid_ns_processes and simply not allowing wait
> >> to reap the pid namespace init until all it's children had been reaped.
> >> Essentially how we deal with the thread group leader for ordinary
> >> processes.  Does that sound like it might help in the fuse case?
> >
> > No, the problem is that the wait code doesn't know to look in the
> > right place, so waiting later still won't help.
> 
> I was suggesting to modify the kernel so that zap_pid_ns_processes would
> not wait for the zapped processes.  Instead I was proposing that
> delay_group_leader called from wait_consider_task would simply refuse to
> allow the init process of a pid namespace to be reaped until every other
> process of that pid namespace had exited.
> 
> You can prototype how that would affect the deadlock by simply removing
> the waiting from zap_pid_ns_processes.
> 
> I suggest that simply because that has the potential to remove some of
> the strange pid namespace cases.
> 
> I don't understand the problematic interaction between pid namespace
> shutdown and the fuse daemon, so I am merely suggesting a possibility
> that I know can simplify pid namespace shutdown.
> 
> Something like:
> 
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index f4f8cb0435b4..d22a30b0b0cf 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -207,47 +207,6 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	read_unlock(&tasklist_lock);
>  	rcu_read_unlock();
>  
> -	/*
> -	 * Reap the EXIT_ZOMBIE children we had before we ignored SIGCHLD.
> -	 * kernel_wait4() will also block until our children traced from the
> -	 * parent namespace are detached and become EXIT_DEAD.
> -	 */
> -	do {
> -		clear_thread_flag(TIF_SIGPENDING);
> -		rc = kernel_wait4(-1, NULL, __WALL, NULL);
> -	} while (rc != -ECHILD);
> -
> -	/*
> -	 * kernel_wait4() misses EXIT_DEAD children, and EXIT_ZOMBIE
> -	 * process whose parents processes are outside of the pid
> -	 * namespace.  Such processes are created with setns()+fork().
> -	 *
> -	 * If those EXIT_ZOMBIE processes are not reaped by their
> -	 * parents before their parents exit, they will be reparented
> -	 * to pid_ns->child_reaper.  Thus pidns->child_reaper needs to
> -	 * stay valid until they all go away.
> -	 *
> -	 * The code relies on the pid_ns->child_reaper ignoring
> -	 * SIGCHILD to cause those EXIT_ZOMBIE processes to be
> -	 * autoreaped if reparented.
> -	 *
> -	 * Semantically it is also desirable to wait for EXIT_ZOMBIE
> -	 * processes before allowing the child_reaper to be reaped, as
> -	 * that gives the invariant that when the init process of a
> -	 * pid namespace is reaped all of the processes in the pid
> -	 * namespace are gone.
> -	 *
> -	 * Once all of the other tasks are gone from the pid_namespace
> -	 * free_pid() will awaken this task.
> -	 */
> -	for (;;) {
> -		set_current_state(TASK_INTERRUPTIBLE);
> -		if (pid_ns->pid_allocated == init_pids)
> -			break;
> -		schedule();
> -	}
> -	__set_current_state(TASK_RUNNING);
> -
>  	if (pid_ns->reboot)
>  		current->signal->group_exit_code = pid_ns->reboot;

Yes, but we need to add the wait to delay_group_leader(), and if the
tasks are stuck indefinitely looking at the wrong condition, I don't
see how moving it will help resolve things.

Thanks,

Tycho

strange interaction between fuse + pidns

Commit Message

Comments

Patch