diff mbox series

[1/2] bcache: ignore pending signals in bcache_device_init()

Message ID 20200302093450.48016-2-colyli@suse.de (mailing list archive)
State New, archived
Headers show
Series bcache patches for Linux v5.6-rc5 | expand

Commit Message

Coly Li March 2, 2020, 9:34 a.m. UTC
When cache device and cached device are registered simuteneously and
register_cache() firstly acquires bch_register_lock. register_bdev()
has to wait before register_cache() finished, it might be a very long
time.

If the registration is from udev rules in system boot up time, and
registration is not completed before udev timeout (default 180s), the
registration process will be killed by udevd. Then the following calls
to kthread_run() or kthread_create() will fail due to the pending
signal (they are implemented this way at this moment).

For boot time, this is not good, because it means a cache device with
huge cached data will always fail in boot time, just because it
spends too much time to check its internal meta data (btree and dirty
sectors).

The failure for cache device registration is solved by previous
patches, but failure due to timeout also exists in cached device
registration. As the above text explains, cached device registration
may also be timeout if it is blocked by a timeout cache device
registration process. Then in the following code path,
    bioset_init() <= bcache_device_init() <= cached_dev_init() <=
    register_bdev() <= register_bcache()
bioset_init() will fail because internally kthread_create() will fail
for pending signal in the following code path,
    bioset_init() => alloc_workqueue() => init_rescuer() =>
    kthread_create()

Maybe fix kthread_create() and kthread_run() is better method, but at
this moment a fast workaroudn is to flush pending signals before
calling bioset_init() in bcache_device_init().

This patch calls flush_signals() in bcache_device_init() if there is
pending signal for current process. It avoids bcache registration
failure in system boot up time due to bcache udev rule timeout.

Signed-off-by: Coly Li <colyli@suse.de>
---
 drivers/md/bcache/super.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Michal Hocko March 2, 2020, 12:27 p.m. UTC | #1
[Cc Oleg]

On Mon 02-03-20 17:34:49, Coly Li wrote:
> When cache device and cached device are registered simuteneously and
> register_cache() firstly acquires bch_register_lock. register_bdev()
> has to wait before register_cache() finished, it might be a very long
> time.
> 
> If the registration is from udev rules in system boot up time, and
> registration is not completed before udev timeout (default 180s), the
> registration process will be killed by udevd. Then the following calls
> to kthread_run() or kthread_create() will fail due to the pending
> signal (they are implemented this way at this moment).
> 
> For boot time, this is not good, because it means a cache device with
> huge cached data will always fail in boot time, just because it
> spends too much time to check its internal meta data (btree and dirty
> sectors).
> 
> The failure for cache device registration is solved by previous
> patches, but failure due to timeout also exists in cached device
> registration. As the above text explains, cached device registration
> may also be timeout if it is blocked by a timeout cache device
> registration process. Then in the following code path,
>     bioset_init() <= bcache_device_init() <= cached_dev_init() <=
>     register_bdev() <= register_bcache()
> bioset_init() will fail because internally kthread_create() will fail
> for pending signal in the following code path,
>     bioset_init() => alloc_workqueue() => init_rescuer() =>
>     kthread_create()
> 
> Maybe fix kthread_create() and kthread_run() is better method, but at
> this moment a fast workaroudn is to flush pending signals before
> calling bioset_init() in bcache_device_init().

I cannot really comment on the bcache part because I am not familiar
with the code. It is quite surprising to see an initialization taking
that long though.

Anyway

> This patch calls flush_signals() in bcache_device_init() if there is
> pending signal for current process. It avoids bcache registration
> failure in system boot up time due to bcache udev rule timeout.

this sounds like a wrong way to address the issue. Killing the udev
worker is a userspace policy and the kernel shouldn't simply ignore it.
Is there any problem to simply increase the timeout on the system which
uses a large bcache?

Btw. Oleg, I have noticed quite a lot of flush_signals usage in the
drivers land and I have really hard time to understand their purpose.
What is the actual valid usage of this function? Should we somehow
document it?

> Signed-off-by: Coly Li <colyli@suse.de>
> ---
>  drivers/md/bcache/super.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index 0c3c5419c52b..e8bbd4f171ca 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -850,6 +850,18 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>  	if (idx < 0)
>  		return idx;
>  
> +	/*
> +	 * There is a timeout in udevd, if the bcache device is registering
> +	 * by udev rules, and not completed in time, the udevd may kill the
> +	 * registration process. In this condition, there will be pending
> +	 * signal here and cause bioset_init() failed for internally creating
> +	 * its kthread. Here the registration should ignore the timeout and
> +	 * continue, it is safe to ignore the pending signal and avoid to
> +	 * fail bcache registration in boot up time.
> +	 */
> +	if (signal_pending(current))
> +		flush_signals(current);
> +
>  	if (bioset_init(&d->bio_split, 4, offsetof(struct bbio, bio),
>  			BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER))
>  		goto err;
> -- 
> 2.16.4
Coly Li March 2, 2020, 1:29 p.m. UTC | #2
On 2020/3/2 8:27 下午, Michal Hocko wrote:
> [Cc Oleg]
> 
> On Mon 02-03-20 17:34:49, Coly Li wrote:
>> When cache device and cached device are registered simuteneously and
>> register_cache() firstly acquires bch_register_lock. register_bdev()
>> has to wait before register_cache() finished, it might be a very long
>> time.
>>
>> If the registration is from udev rules in system boot up time, and
>> registration is not completed before udev timeout (default 180s), the
>> registration process will be killed by udevd. Then the following calls
>> to kthread_run() or kthread_create() will fail due to the pending
>> signal (they are implemented this way at this moment).
>>
>> For boot time, this is not good, because it means a cache device with
>> huge cached data will always fail in boot time, just because it
>> spends too much time to check its internal meta data (btree and dirty
>> sectors).
>>
>> The failure for cache device registration is solved by previous
>> patches, but failure due to timeout also exists in cached device
>> registration. As the above text explains, cached device registration
>> may also be timeout if it is blocked by a timeout cache device
>> registration process. Then in the following code path,
>>     bioset_init() <= bcache_device_init() <= cached_dev_init() <=
>>     register_bdev() <= register_bcache()
>> bioset_init() will fail because internally kthread_create() will fail
>> for pending signal in the following code path,
>>     bioset_init() => alloc_workqueue() => init_rescuer() =>
>>     kthread_create()
>>
>> Maybe fix kthread_create() and kthread_run() is better method, but at
>> this moment a fast workaroudn is to flush pending signals before
>> calling bioset_init() in bcache_device_init().
> 

Hi Michal,

> I cannot really comment on the bcache part because I am not familiar
> with the code. It is quite surprising to see an initialization taking
> that long though.
> 

Back to the time 10 years ago when bcache merged into Linux mainline,
checking meta data for a 120G SSD was fast. But now an 8TB SSD is quite
common on server... So the problem appears.

> Anyway
> 
>> This patch calls flush_signals() in bcache_device_init() if there is
>> pending signal for current process. It avoids bcache registration
>> failure in system boot up time due to bcache udev rule timeout.
> 
> this sounds like a wrong way to address the issue. Killing the udev
> worker is a userspace policy and the kernel shouldn't simply ignore it.

Indeed the bcache registering process cannot be killed, because a mutex
lock (bch_register_lock) is held during all the registration operation.

In my testing, kthread_run()/kthread_create() failure by pending signal
happens after all metadata checking finished, that's 55 minutes later.
No mater the registration successes or fails, the time length is same.

Once the udev timeout killing is useless, why not make the registration
to success ? This is what the patch does.

> Is there any problem to simply increase the timeout on the system which
> uses a large bcache?
> 

At this moment, this is a workaround. Christoph Hellwig also suggests to
fix kthread_run()/kthread_create(). Now I am looking for method to
distinct that the parent process is killed by OOM killer and not by
other processes in kthread_run()/kthread_create(), but the solution is
not clear to me yet.

When meta-data size is around 40GB, registering cache device will take
around 55 minutes on my machine for current Linux kernel. I have patch
to reduce the time to around 7~8 minutes but still too long. I may add a
timeout in bcache udev rule for example 10 munites, but when the cache
device get large and large, the timeout will be not enough eventually.

As I mentioned, this is a workaround to fix the problem now. Fixing
kthread_run()/kthread_create() may take longer time for me. If there is
hint to make it, please offer me.

Thanks.

Coly Li


> Btw. Oleg, I have noticed quite a lot of flush_signals usage in the
> drivers land and I have really hard time to understand their purpose.
> What is the actual valid usage of this function? Should we somehow
> document it?
> 
>> Signed-off-by: Coly Li <colyli@suse.de>
>> ---
>>  drivers/md/bcache/super.c | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
>> index 0c3c5419c52b..e8bbd4f171ca 100644
>> --- a/drivers/md/bcache/super.c
>> +++ b/drivers/md/bcache/super.c
>> @@ -850,6 +850,18 @@ static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
>>  	if (idx < 0)
>>  		return idx;
>>  
>> +	/*
>> +	 * There is a timeout in udevd, if the bcache device is registering
>> +	 * by udev rules, and not completed in time, the udevd may kill the
>> +	 * registration process. In this condition, there will be pending
>> +	 * signal here and cause bioset_init() failed for internally creating
>> +	 * its kthread. Here the registration should ignore the timeout and
>> +	 * continue, it is safe to ignore the pending signal and avoid to
>> +	 * fail bcache registration in boot up time.
>> +	 */
>> +	if (signal_pending(current))
>> +		flush_signals(current);
>> +
>>  	if (bioset_init(&d->bio_split, 4, offsetof(struct bbio, bio),
>>  			BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER))
>>  		goto err;
>> -- 
>> 2.16.4
>
Michal Hocko March 2, 2020, 1:40 p.m. UTC | #3
On Mon 02-03-20 21:29:09, Coly Li wrote:
[...]
> > I cannot really comment on the bcache part because I am not familiar
> > with the code. It is quite surprising to see an initialization taking
> > that long though.
> > 
> 
> Back to the time 10 years ago when bcache merged into Linux mainline,
> checking meta data for a 120G SSD was fast. But now an 8TB SSD is quite
> common on server... So the problem appears.

Does all that work has to happen synchronously from the kworker context?
Is it possible some of the initialization to be done more lazily or in
the background?

> > Anyway
> > 
> >> This patch calls flush_signals() in bcache_device_init() if there is
> >> pending signal for current process. It avoids bcache registration
> >> failure in system boot up time due to bcache udev rule timeout.
> > 
> > this sounds like a wrong way to address the issue. Killing the udev
> > worker is a userspace policy and the kernel shouldn't simply ignore it.
> 
> Indeed the bcache registering process cannot be killed, because a mutex
> lock (bch_register_lock) is held during all the registration operation.
> 
> In my testing, kthread_run()/kthread_create() failure by pending signal
> happens after all metadata checking finished, that's 55 minutes later.
> No mater the registration successes or fails, the time length is same.
> 
> Once the udev timeout killing is useless, why not make the registration
> to success ? This is what the patch does.

I cannot really comment for the systemd part but it is quite unexpected
for it to have signals ignored completely.

> > Is there any problem to simply increase the timeout on the system which
> > uses a large bcache?
> > 
> 
> At this moment, this is a workaround. Christoph Hellwig also suggests to
> fix kthread_run()/kthread_create(). Now I am looking for method to
> distinct that the parent process is killed by OOM killer and not by
> other processes in kthread_run()/kthread_create(), but the solution is
> not clear to me yet.

It is really hard to comment on this because I do not have a sufficient
insight but in genereal. The oom victim context can be checked by
tsk_is_oom_victim but kernel threads are subject of the oom killer
because they do not own any address space. I also suspect that none of
the data you allocate for the cache is accounted per any specific
process.

> When meta-data size is around 40GB, registering cache device will take
> around 55 minutes on my machine for current Linux kernel. I have patch
> to reduce the time to around 7~8 minutes but still too long. I may add a
> timeout in bcache udev rule for example 10 munites, but when the cache
> device get large and large, the timeout will be not enough eventually.
> 
> As I mentioned, this is a workaround to fix the problem now. Fixing
> kthread_run()/kthread_create() may take longer time for me. If there is
> hint to make it, please offer me.

My main question is why there is any need to touch the kernel code. You
can still update the systemd/udev timeout AFAIK. This would be the
proper workaround from my (admittedly limited) POV.
Oleg Nesterov March 2, 2020, 1:49 p.m. UTC | #4
On 03/02, Michal Hocko wrote:
>
> I cannot really comment on the bcache part because I am not familiar
> with the code.

same here...

> > This patch calls flush_signals() in bcache_device_init() if there is
> > pending signal for current process. It avoids bcache registration
> > failure in system boot up time due to bcache udev rule timeout.
>
> this sounds like a wrong way to address the issue. Killing the udev
> worker is a userspace policy and the kernel shouldn't simply ignore it.

Agreed. If nothing else, if a userspace process has pending SIKILL then
flush_signals() is very wrong.

> Btw. Oleg, I have noticed quite a lot of flush_signals usage in the
> drivers land and I have really hard time to understand their purpose.

Heh. I bet most if not all users of flush_signals() are simply wrong.

> What is the actual valid usage of this function?

I thinks it should die... It was used by kthreads, but today
signal_pending() == T is only possible if kthread does allow_signal(),
and in this case it should probably use kernel_dequeue_signal().


Say, io_sq_thread(). Why does it do

		if (signal_pending(current))
			flush_signals(current);

afaics this kthread doesn't use allow_signal/allow_kernel_signal, this
means that signal_pending() must be impossible even if this kthread sleeps
in TASK_INTERRUPTIBLE state. Add Jens.

Oleg.
Jens Axboe March 2, 2020, 3:01 p.m. UTC | #5
On 3/2/20 5:27 AM, Michal Hocko wrote:
> [Cc Oleg]
> 
> On Mon 02-03-20 17:34:49, Coly Li wrote:
>> When cache device and cached device are registered simuteneously and
>> register_cache() firstly acquires bch_register_lock. register_bdev()
>> has to wait before register_cache() finished, it might be a very long
>> time.
>>
>> If the registration is from udev rules in system boot up time, and
>> registration is not completed before udev timeout (default 180s), the
>> registration process will be killed by udevd. Then the following calls
>> to kthread_run() or kthread_create() will fail due to the pending
>> signal (they are implemented this way at this moment).
>>
>> For boot time, this is not good, because it means a cache device with
>> huge cached data will always fail in boot time, just because it
>> spends too much time to check its internal meta data (btree and dirty
>> sectors).
>>
>> The failure for cache device registration is solved by previous
>> patches, but failure due to timeout also exists in cached device
>> registration. As the above text explains, cached device registration
>> may also be timeout if it is blocked by a timeout cache device
>> registration process. Then in the following code path,
>>     bioset_init() <= bcache_device_init() <= cached_dev_init() <=
>>     register_bdev() <= register_bcache()
>> bioset_init() will fail because internally kthread_create() will fail
>> for pending signal in the following code path,
>>     bioset_init() => alloc_workqueue() => init_rescuer() =>
>>     kthread_create()
>>
>> Maybe fix kthread_create() and kthread_run() is better method, but at
>> this moment a fast workaroudn is to flush pending signals before
>> calling bioset_init() in bcache_device_init().
> 
> I cannot really comment on the bcache part because I am not familiar
> with the code. It is quite surprising to see an initialization taking
> that long though.
> 
> Anyway
> 
>> This patch calls flush_signals() in bcache_device_init() if there is
>> pending signal for current process. It avoids bcache registration
>> failure in system boot up time due to bcache udev rule timeout.
> 
> this sounds like a wrong way to address the issue. Killing the udev
> worker is a userspace policy and the kernel shouldn't simply ignore it.
> Is there any problem to simply increase the timeout on the system which
> uses a large bcache?

On top of that, what if signals were sent for other reasons than just
terminate it? Flushing a fatal signal from "some task" seems bad enough
on its own, but we could be losing others as well.

Coly, this seems like a very bad idea. And the same goes for the
existing flush_signals() in bcache. It's just not the right way to deal
with it, and it could be causing other issues.
Coly Li March 2, 2020, 5:06 p.m. UTC | #6
On 2020/3/2 9:40 下午, Michal Hocko wrote:
> On Mon 02-03-20 21:29:09, Coly Li wrote:
> [...]
>>> I cannot really comment on the bcache part because I am not familiar
>>> with the code. It is quite surprising to see an initialization taking
>>> that long though.
>>>
>>
>> Back to the time 10 years ago when bcache merged into Linux mainline,
>> checking meta data for a 120G SSD was fast. But now an 8TB SSD is quite
>> common on server... So the problem appears.
> 
> Does all that work has to happen synchronously from the kworker context?
> Is it possible some of the initialization to be done more lazily or in
> the background?
> 

The registration is a user space process, not a kthread/kworker. The
created kthread is for later run time purpose as part of initialization.
The problem is, if there is pending signal, creating kthread will fail
and return -EINTR to caller, then the whole registration will fail.

And all the work has to be completed before the kernel code (by a write
syscall via sysfs interface) returns to the user space process. After
the registration process finished, the bcache disk file like
/dev/bcache0 will show up for following mount. A delayed lazy method is
not feasible here.

>>> Anyway
>>>
>>>> This patch calls flush_signals() in bcache_device_init() if there is
>>>> pending signal for current process. It avoids bcache registration
>>>> failure in system boot up time due to bcache udev rule timeout.
>>>
>>> this sounds like a wrong way to address the issue. Killing the udev
>>> worker is a userspace policy and the kernel shouldn't simply ignore it.
>>
>> Indeed the bcache registering process cannot be killed, because a mutex
>> lock (bch_register_lock) is held during all the registration operation.
>>
>> In my testing, kthread_run()/kthread_create() failure by pending signal
>> happens after all metadata checking finished, that's 55 minutes later.
>> No mater the registration successes or fails, the time length is same.
>>
>> Once the udev timeout killing is useless, why not make the registration
>> to success ? This is what the patch does.
> 
> I cannot really comment for the systemd part but it is quite unexpected
> for it to have signals ignored completely.
> 

I see. But so far I don't have better solution to fix this problem.
Asking people to do extra configure to udev rules is very tricky, most
of common users will be scared. I need to get it fixed by no-extra
configuration.


>>> Is there any problem to simply increase the timeout on the system which
>>> uses a large bcache?
>>>
>>
>> At this moment, this is a workaround. Christoph Hellwig also suggests to
>> fix kthread_run()/kthread_create(). Now I am looking for method to
>> distinct that the parent process is killed by OOM killer and not by
>> other processes in kthread_run()/kthread_create(), but the solution is
>> not clear to me yet.
> 
> It is really hard to comment on this because I do not have a sufficient
> insight but in genereal. The oom victim context can be checked by
> tsk_is_oom_victim but kernel threads are subject of the oom killer
> because they do not own any address space. I also suspect that none of
> the data you allocate for the cache is accounted per any specific
> process.

You are right, the cached data is not bonded to process, it is bonded to
specific backing block devices.

In my case, kthread_run()/kthread_create() is called in context of
registration process (/lib/udev/bcache-register), so it is unnecessary
to worry about kthread address space. So maybe I can check
tsk_is_oom_victim to judge whether current process is killing by OOM
killer other then simply calling pending_signal() ?

> 
>> When meta-data size is around 40GB, registering cache device will take
>> around 55 minutes on my machine for current Linux kernel. I have patch
>> to reduce the time to around 7~8 minutes but still too long. I may add a
>> timeout in bcache udev rule for example 10 munites, but when the cache
>> device get large and large, the timeout will be not enough eventually.
>>
>> As I mentioned, this is a workaround to fix the problem now. Fixing
>> kthread_run()/kthread_create() may take longer time for me. If there is
>> hint to make it, please offer me.
> 
> My main question is why there is any need to touch the kernel code. You
> can still update the systemd/udev timeout AFAIK. This would be the
> proper workaround from my (admittedly limited) POV.
> 

I see your concern. But the udev timeout is global for all udev rules, I
am not sure whether change it to a very long time is good ... (indeed I
tried to add event_timeout=3600 but I can still see the signal received).

Ignore the pending signal in bcache registering code is the only method
currently I know to avoid bcache auto-register failure in boot time. If
there is other way I can achieve the same goal, I'd like to try.

BTW, by the mean time, I am still looking for the reason why
event_timeout=3600 in /etc/udev/udev.conf does not take effect...
Coly Li March 2, 2020, 5:16 p.m. UTC | #7
On 2020/3/2 9:49 下午, Oleg Nesterov wrote:
> On 03/02, Michal Hocko wrote:
>>
>> I cannot really comment on the bcache part because I am not familiar
>> with the code.
> 
> same here...
> 
>>> This patch calls flush_signals() in bcache_device_init() if there is
>>> pending signal for current process. It avoids bcache registration
>>> failure in system boot up time due to bcache udev rule timeout.
>>
>> this sounds like a wrong way to address the issue. Killing the udev
>> worker is a userspace policy and the kernel shouldn't simply ignore it.
> 
> Agreed. If nothing else, if a userspace process has pending SIKILL then
> flush_signals() is very wrong.
> 
>> Btw. Oleg, I have noticed quite a lot of flush_signals usage in the
>> drivers land and I have really hard time to understand their purpose.
> 
> Heh. I bet most if not all users of flush_signals() are simply wrong.
> 
>> What is the actual valid usage of this function?
> 
> I thinks it should die... It was used by kthreads, but today
> signal_pending() == T is only possible if kthread does allow_signal(),
> and in this case it should probably use kernel_dequeue_signal().
> 
> 
> Say, io_sq_thread(). Why does it do
> 
> 		if (signal_pending(current))
> 			flush_signals(current);
> 
> afaics this kthread doesn't use allow_signal/allow_kernel_signal, this
> means that signal_pending() must be impossible even if this kthread sleeps
> in TASK_INTERRUPTIBLE state. Add Jens.

Hi Oleg,

Can I use disallow_signal() before the registration begins and use
allow_signal() after the registration done. Is this a proper way to
ignore the signal sent by udevd for timeout ?

For me the above method seems to solve my problem too.

Thanks.
Jens Axboe March 2, 2020, 5:19 p.m. UTC | #8
On 3/2/20 10:16 AM, Coly Li wrote:
> On 2020/3/2 9:49 下午, Oleg Nesterov wrote:
>> On 03/02, Michal Hocko wrote:
>>>
>>> I cannot really comment on the bcache part because I am not familiar
>>> with the code.
>>
>> same here...
>>
>>>> This patch calls flush_signals() in bcache_device_init() if there is
>>>> pending signal for current process. It avoids bcache registration
>>>> failure in system boot up time due to bcache udev rule timeout.
>>>
>>> this sounds like a wrong way to address the issue. Killing the udev
>>> worker is a userspace policy and the kernel shouldn't simply ignore it.
>>
>> Agreed. If nothing else, if a userspace process has pending SIKILL then
>> flush_signals() is very wrong.
>>
>>> Btw. Oleg, I have noticed quite a lot of flush_signals usage in the
>>> drivers land and I have really hard time to understand their purpose.
>>
>> Heh. I bet most if not all users of flush_signals() are simply wrong.
>>
>>> What is the actual valid usage of this function?
>>
>> I thinks it should die... It was used by kthreads, but today
>> signal_pending() == T is only possible if kthread does allow_signal(),
>> and in this case it should probably use kernel_dequeue_signal().
>>
>>
>> Say, io_sq_thread(). Why does it do
>>
>> 		if (signal_pending(current))
>> 			flush_signals(current);
>>
>> afaics this kthread doesn't use allow_signal/allow_kernel_signal, this
>> means that signal_pending() must be impossible even if this kthread sleeps
>> in TASK_INTERRUPTIBLE state. Add Jens.
> 
> Hi Oleg,
> 
> Can I use disallow_signal() before the registration begins and use
> allow_signal() after the registration done. Is this a proper way to
> ignore the signal sent by udevd for timeout ?
> 
> For me the above method seems to solve my problem too.

Really seems to me like you're going about this all wrong. The issue is
that systemd is killing the startup, because it's taking too long. Don't
try and work around that, ensure the timeout is appropriate.

What if someone else tried to kill the startup? It'd be pretty
frustrating that it was impossible, just because signals were blocked or
flushed. The assumption that systemd is the ONLY task that would want to
kill it is flawed.
Michal Hocko March 2, 2020, 5:28 p.m. UTC | #9
On Tue 03-03-20 01:06:38, Coly Li wrote:
> On 2020/3/2 9:40 下午, Michal Hocko wrote:
[...]
> > I cannot really comment for the systemd part but it is quite unexpected
> > for it to have signals ignored completely.
> > 
> 
> I see. But so far I don't have better solution to fix this problem.
> Asking people to do extra configure to udev rules is very tricky, most
> of common users will be scared. I need to get it fixed by no-extra
> configuration.

Well, I believe that having an explicit documentation that large cache
requires more tim to initialize is quite reasonable way forward. We
already do have similar situations with memory hotplug on extra large
machines with a small block size.
 
> >>> Is there any problem to simply increase the timeout on the system which
> >>> uses a large bcache?
> >>>
> >>
> >> At this moment, this is a workaround. Christoph Hellwig also suggests to
> >> fix kthread_run()/kthread_create(). Now I am looking for method to
> >> distinct that the parent process is killed by OOM killer and not by
> >> other processes in kthread_run()/kthread_create(), but the solution is
> >> not clear to me yet.
> > 
> > It is really hard to comment on this because I do not have a sufficient
> > insight but in genereal. The oom victim context can be checked by
> > tsk_is_oom_victim but kernel threads are subject of the oom killer
> > because they do not own any address space. I also suspect that none of
> > the data you allocate for the cache is accounted per any specific
> > process.
> 
> You are right, the cached data is not bonded to process, it is bonded to
> specific backing block devices.
> 
> In my case, kthread_run()/kthread_create() is called in context of
> registration process (/lib/udev/bcache-register), so it is unnecessary
> to worry about kthread address space. So maybe I can check
> tsk_is_oom_victim to judge whether current process is killing by OOM
> killer other then simply calling pending_signal() ?

What exactly are going to do about this situation? If you want to bail
out then you should simply do that on any pending fatal signal. Why is
OOM special?

> >> When meta-data size is around 40GB, registering cache device will take
> >> around 55 minutes on my machine for current Linux kernel. I have patch
> >> to reduce the time to around 7~8 minutes but still too long. I may add a
> >> timeout in bcache udev rule for example 10 munites, but when the cache
> >> device get large and large, the timeout will be not enough eventually.
> >>
> >> As I mentioned, this is a workaround to fix the problem now. Fixing
> >> kthread_run()/kthread_create() may take longer time for me. If there is
> >> hint to make it, please offer me.
> > 
> > My main question is why there is any need to touch the kernel code. You
> > can still update the systemd/udev timeout AFAIK. This would be the
> > proper workaround from my (admittedly limited) POV.
> > 
> 
> I see your concern. But the udev timeout is global for all udev rules, I
> am not sure whether change it to a very long time is good ... (indeed I
> tried to add event_timeout=3600 but I can still see the signal received).

All processes which can finish in the default timeout are not going to
regress when the forcefull temination happens later. I cannot really
comment on the systemd part because I am not familiar with internals but
the mere existence of the timeout suggests that the workflow should be
prepared for longer timeouts because initialization taking a long time
is something we have to live with. Something just takes too long to
initialized.
 
> Ignore the pending signal in bcache registering code is the only method
> currently I know to avoid bcache auto-register failure in boot time. If
> there is other way I can achieve the same goal, I'd like to try.

The problem is that you are effectivelly violating signal delivery
semantic because you are ignoring user signal without userspace ever
noticing. That to me sounds like a free ticket to all sorts of
hard-to-debug problems. What if the userspace expects the signal to be
handled? Really, you want to increase the existing timeout to workaround
it. If you can make the initialization process more swift then even
better.

> BTW, by the mean time, I am still looking for the reason why
> event_timeout=3600 in /etc/udev/udev.conf does not take effect...

I have a vague recollection that there are mutlitple timeouts and
setting only some is not sufficient.
Coly Li March 2, 2020, 5:32 p.m. UTC | #10
On 2020/3/3 1:19 上午, Jens Axboe wrote:
> On 3/2/20 10:16 AM, Coly Li wrote:
>> On 2020/3/2 9:49 下午, Oleg Nesterov wrote:
>>> On 03/02, Michal Hocko wrote:
>>>>
>>>> I cannot really comment on the bcache part because I am not familiar
>>>> with the code.
>>>
>>> same here...
>>>
>>>>> This patch calls flush_signals() in bcache_device_init() if there is
>>>>> pending signal for current process. It avoids bcache registration
>>>>> failure in system boot up time due to bcache udev rule timeout.
>>>>
>>>> this sounds like a wrong way to address the issue. Killing the udev
>>>> worker is a userspace policy and the kernel shouldn't simply ignore it.
>>>
>>> Agreed. If nothing else, if a userspace process has pending SIKILL then
>>> flush_signals() is very wrong.
>>>
>>>> Btw. Oleg, I have noticed quite a lot of flush_signals usage in the
>>>> drivers land and I have really hard time to understand their purpose.
>>>
>>> Heh. I bet most if not all users of flush_signals() are simply wrong.
>>>
>>>> What is the actual valid usage of this function?
>>>
>>> I thinks it should die... It was used by kthreads, but today
>>> signal_pending() == T is only possible if kthread does allow_signal(),
>>> and in this case it should probably use kernel_dequeue_signal().
>>>
>>>
>>> Say, io_sq_thread(). Why does it do
>>>
>>> 		if (signal_pending(current))
>>> 			flush_signals(current);
>>>
>>> afaics this kthread doesn't use allow_signal/allow_kernel_signal, this
>>> means that signal_pending() must be impossible even if this kthread sleeps
>>> in TASK_INTERRUPTIBLE state. Add Jens.
>>
>> Hi Oleg,
>>
>> Can I use disallow_signal() before the registration begins and use
>> allow_signal() after the registration done. Is this a proper way to
>> ignore the signal sent by udevd for timeout ?
>>
>> For me the above method seems to solve my problem too.
> 
> Really seems to me like you're going about this all wrong. The issue is
> that systemd is killing the startup, because it's taking too long. Don't
> try and work around that, ensure the timeout is appropriate.
> 

Copied. Then let me try how to make event_timeout works on my udevd. If
it works without other side effect, I will revert existing
flush_signals() patches.

> What if someone else tried to kill the startup? It'd be pretty
> frustrating that it was impossible, just because signals were blocked or
> flushed. The assumption that systemd is the ONLY task that would want to
> kill it is flawed.
> 

Indeed now the bcache registration can not be killed. I guess it is
because the mutex lock held during the metadata checking.
Sure I will look at how to extend udevd timeout value now, and ask for
help later.

Thanks.
Coly Li March 2, 2020, 5:47 p.m. UTC | #11
On 2020/3/3 1:28 上午, Michal Hocko wrote:
> On Tue 03-03-20 01:06:38, Coly Li wrote:
>> On 2020/3/2 9:40 下午, Michal Hocko wrote:
> [...]
>>> I cannot really comment for the systemd part but it is quite unexpected
>>> for it to have signals ignored completely.
>>>
>>
>> I see. But so far I don't have better solution to fix this problem.
>> Asking people to do extra configure to udev rules is very tricky, most
>> of common users will be scared. I need to get it fixed by no-extra
>> configuration.
> 
> Well, I believe that having an explicit documentation that large cache
> requires more tim to initialize is quite reasonable way forward. We
> already do have similar situations with memory hotplug on extra large
> machines with a small block size.
>  
>>>>> Is there any problem to simply increase the timeout on the system which
>>>>> uses a large bcache?
>>>>>
>>>>
>>>> At this moment, this is a workaround. Christoph Hellwig also suggests to
>>>> fix kthread_run()/kthread_create(). Now I am looking for method to
>>>> distinct that the parent process is killed by OOM killer and not by
>>>> other processes in kthread_run()/kthread_create(), but the solution is
>>>> not clear to me yet.
>>>
>>> It is really hard to comment on this because I do not have a sufficient
>>> insight but in genereal. The oom victim context can be checked by
>>> tsk_is_oom_victim but kernel threads are subject of the oom killer
>>> because they do not own any address space. I also suspect that none of
>>> the data you allocate for the cache is accounted per any specific
>>> process.
>>
>> You are right, the cached data is not bonded to process, it is bonded to
>> specific backing block devices.
>>
>> In my case, kthread_run()/kthread_create() is called in context of
>> registration process (/lib/udev/bcache-register), so it is unnecessary
>> to worry about kthread address space. So maybe I can check
>> tsk_is_oom_victim to judge whether current process is killing by OOM
>> killer other then simply calling pending_signal() ?
> 
> What exactly are going to do about this situation? If you want to bail
> out then you should simply do that on any pending fatal signal. Why is
> OOM special?
> 

Forget this point here, there is no different to ignore the signal
inside or outside kthread_create()/kthread_run(), they are all not good.

>>>> When meta-data size is around 40GB, registering cache device will take
>>>> around 55 minutes on my machine for current Linux kernel. I have patch
>>>> to reduce the time to around 7~8 minutes but still too long. I may add a
>>>> timeout in bcache udev rule for example 10 munites, but when the cache
>>>> device get large and large, the timeout will be not enough eventually.
>>>>
>>>> As I mentioned, this is a workaround to fix the problem now. Fixing
>>>> kthread_run()/kthread_create() may take longer time for me. If there is
>>>> hint to make it, please offer me.
>>>
>>> My main question is why there is any need to touch the kernel code. You
>>> can still update the systemd/udev timeout AFAIK. This would be the
>>> proper workaround from my (admittedly limited) POV.
>>>
>>
>> I see your concern. But the udev timeout is global for all udev rules, I
>> am not sure whether change it to a very long time is good ... (indeed I
>> tried to add event_timeout=3600 but I can still see the signal received).
> 
> All processes which can finish in the default timeout are not going to
> regress when the forcefull temination happens later. I cannot really
> comment on the systemd part because I am not familiar with internals but
> the mere existence of the timeout suggests that the workflow should be
> prepared for longer timeouts because initialization taking a long time
> is something we have to live with. Something just takes too long to
> initialized.
>  

OK, I am convinced. Documentation work is required then.

>> Ignore the pending signal in bcache registering code is the only method
>> currently I know to avoid bcache auto-register failure in boot time. If
>> there is other way I can achieve the same goal, I'd like to try.
> 
> The problem is that you are effectivelly violating signal delivery
> semantic because you are ignoring user signal without userspace ever
> noticing. That to me sounds like a free ticket to all sorts of
> hard-to-debug problems. What if the userspace expects the signal to be
> handled? Really, you want to increase the existing timeout to workaround
> it. If you can make the initialization process more swift then even
> better.
> 

Copied. Indeed I have more questions here, I will ask in other thread.


>> BTW, by the mean time, I am still looking for the reason why
>> event_timeout=3600 in /etc/udev/udev.conf does not take effect...
> 
> I have a vague recollection that there are mutlitple timeouts and
> setting only some is not sufficient.
> 

Let me try out how to extend udevd timeout.

Thanks for your discussion and suggestion.
Jens Axboe March 2, 2020, 8:33 p.m. UTC | #12
On 3/2/20 10:32 AM, Coly Li wrote:
> On 2020/3/3 1:19 上午, Jens Axboe wrote:
>> On 3/2/20 10:16 AM, Coly Li wrote:
>>> On 2020/3/2 9:49 下午, Oleg Nesterov wrote:
>>>> On 03/02, Michal Hocko wrote:
>>>>>
>>>>> I cannot really comment on the bcache part because I am not familiar
>>>>> with the code.
>>>>
>>>> same here...
>>>>
>>>>>> This patch calls flush_signals() in bcache_device_init() if there is
>>>>>> pending signal for current process. It avoids bcache registration
>>>>>> failure in system boot up time due to bcache udev rule timeout.
>>>>>
>>>>> this sounds like a wrong way to address the issue. Killing the udev
>>>>> worker is a userspace policy and the kernel shouldn't simply ignore it.
>>>>
>>>> Agreed. If nothing else, if a userspace process has pending SIKILL then
>>>> flush_signals() is very wrong.
>>>>
>>>>> Btw. Oleg, I have noticed quite a lot of flush_signals usage in the
>>>>> drivers land and I have really hard time to understand their purpose.
>>>>
>>>> Heh. I bet most if not all users of flush_signals() are simply wrong.
>>>>
>>>>> What is the actual valid usage of this function?
>>>>
>>>> I thinks it should die... It was used by kthreads, but today
>>>> signal_pending() == T is only possible if kthread does allow_signal(),
>>>> and in this case it should probably use kernel_dequeue_signal().
>>>>
>>>>
>>>> Say, io_sq_thread(). Why does it do
>>>>
>>>> 		if (signal_pending(current))
>>>> 			flush_signals(current);
>>>>
>>>> afaics this kthread doesn't use allow_signal/allow_kernel_signal, this
>>>> means that signal_pending() must be impossible even if this kthread sleeps
>>>> in TASK_INTERRUPTIBLE state. Add Jens.
>>>
>>> Hi Oleg,
>>>
>>> Can I use disallow_signal() before the registration begins and use
>>> allow_signal() after the registration done. Is this a proper way to
>>> ignore the signal sent by udevd for timeout ?
>>>
>>> For me the above method seems to solve my problem too.
>>
>> Really seems to me like you're going about this all wrong. The issue is
>> that systemd is killing the startup, because it's taking too long. Don't
>> try and work around that, ensure the timeout is appropriate.
>>
> 
> Copied. Then let me try how to make event_timeout works on my udevd. If
> it works without other side effect, I will revert existing
> flush_signals() patches.

Thanks, this one, right?

commit 0b96da639a4874311e9b5156405f69ef9fc3bef8
Author: Coly Li <colyli@suse.de>
Date:   Thu Feb 13 22:12:05 2020 +0800

    bcache: ignore pending signals when creating gc and allocator thread

because that definitely needs to be reverted.
Coly Li March 3, 2020, 1:08 a.m. UTC | #13
On 2020/3/3 4:33 上午, Jens Axboe wrote:
> On 3/2/20 10:32 AM, Coly Li wrote:
>> On 2020/3/3 1:19 上午, Jens Axboe wrote:
>>> On 3/2/20 10:16 AM, Coly Li wrote:
>>>> On 2020/3/2 9:49 下午, Oleg Nesterov wrote:
>>>>> On 03/02, Michal Hocko wrote:
>>>>>>
>>>>>> I cannot really comment on the bcache part because I am not familiar
>>>>>> with the code.
>>>>>
>>>>> same here...
>>>>>
>>>>>>> This patch calls flush_signals() in bcache_device_init() if there is
>>>>>>> pending signal for current process. It avoids bcache registration
>>>>>>> failure in system boot up time due to bcache udev rule timeout.
>>>>>>
>>>>>> this sounds like a wrong way to address the issue. Killing the udev
>>>>>> worker is a userspace policy and the kernel shouldn't simply ignore it.
>>>>>
>>>>> Agreed. If nothing else, if a userspace process has pending SIKILL then
>>>>> flush_signals() is very wrong.
>>>>>
>>>>>> Btw. Oleg, I have noticed quite a lot of flush_signals usage in the
>>>>>> drivers land and I have really hard time to understand their purpose.
>>>>>
>>>>> Heh. I bet most if not all users of flush_signals() are simply wrong.
>>>>>
>>>>>> What is the actual valid usage of this function?
>>>>>
>>>>> I thinks it should die... It was used by kthreads, but today
>>>>> signal_pending() == T is only possible if kthread does allow_signal(),
>>>>> and in this case it should probably use kernel_dequeue_signal().
>>>>>
>>>>>
>>>>> Say, io_sq_thread(). Why does it do
>>>>>
>>>>> 		if (signal_pending(current))
>>>>> 			flush_signals(current);
>>>>>
>>>>> afaics this kthread doesn't use allow_signal/allow_kernel_signal, this
>>>>> means that signal_pending() must be impossible even if this kthread sleeps
>>>>> in TASK_INTERRUPTIBLE state. Add Jens.
>>>>
>>>> Hi Oleg,
>>>>
>>>> Can I use disallow_signal() before the registration begins and use
>>>> allow_signal() after the registration done. Is this a proper way to
>>>> ignore the signal sent by udevd for timeout ?
>>>>
>>>> For me the above method seems to solve my problem too.
>>>
>>> Really seems to me like you're going about this all wrong. The issue is
>>> that systemd is killing the startup, because it's taking too long. Don't
>>> try and work around that, ensure the timeout is appropriate.
>>>
>>
>> Copied. Then let me try how to make event_timeout works on my udevd. If
>> it works without other side effect, I will revert existing
>> flush_signals() patches.
> 
> Thanks, this one, right?
> 
> commit 0b96da639a4874311e9b5156405f69ef9fc3bef8
> Author: Coly Li <colyli@suse.de>
> Date:   Thu Feb 13 22:12:05 2020 +0800
> 
>     bcache: ignore pending signals when creating gc and allocator thread
> 
> because that definitely needs to be reverted.
> 

Yes, please revert this commit. Thank you.
Guoqing Jiang March 3, 2020, 1:22 a.m. UTC | #14
Hi Coly,

On 3/2/20 6:47 PM, Coly Li wrote:
>> I have a vague recollection that there are mutlitple timeouts and
>> setting only some is not sufficient.
>>
> Let me try out how to extend udevd timeout.

Not sure if this link [1] helps or not, just FYI.

[1] https://github.com/jonathanunderwood/mdraid-safe-timeouts

Thanks,
Guoqing
Coly Li March 3, 2020, 1:30 a.m. UTC | #15
On 2020/3/3 9:22 上午, Guoqing Jiang wrote:
> Hi Coly,
> 
> On 3/2/20 6:47 PM, Coly Li wrote:
>>> I have a vague recollection that there are mutlitple timeouts and
>>> setting only some is not sufficient.
>>>
>> Let me try out how to extend udevd timeout.
> 
> Not sure if this link [1] helps or not, just FYI.
> 
> [1] https://github.com/jonathanunderwood/mdraid-safe-timeouts
> 

Guoqing,

Thanks. Let me look into it :-)
Hannes Reinecke March 3, 2020, 7:22 a.m. UTC | #16
On 3/2/20 6:32 PM, Coly Li wrote:
> On 2020/3/3 1:19 上午, Jens Axboe wrote:
[ ..
>> What if someone else tried to kill the startup? It'd be pretty
>> frustrating that it was impossible, just because signals were blocked or
>> flushed. The assumption that systemd is the ONLY task that would want to
>> kill it is flawed.
>>
> 
> Indeed now the bcache registration can not be killed. I guess it is
> because the mutex lock held during the metadata checking.
> Sure I will look at how to extend udevd timeout value now, and ask for
> help later.
> 
Please check if the bcache registration really needs to be synchronous.
What exactly do you gain from having a return value (which, seeing
that's a simple write() to a sysfs attribute, is probably ignored anyway)?
To my understanding, the bcache registration process creates the actual
bcache device, which in itself will generate a udev event.
So the existence of the bcache device will already signal a successful
completion of the registration process.
Hence it should be possible to shift the actual bcache registration to a
kworker, and have the 'write()' call completed after schedule_work() (or
equivalents) are called.
That would circumvent the problem, no?

Cheers,

Hannes
Michal Hocko March 3, 2020, 8:05 a.m. UTC | #17
On Mon 02-03-20 14:49:19, Oleg Nesterov wrote:
> On 03/02, Michal Hocko wrote:
> >
> > I cannot really comment on the bcache part because I am not familiar
> > with the code.
> 
> same here...
> 
> > > This patch calls flush_signals() in bcache_device_init() if there is
> > > pending signal for current process. It avoids bcache registration
> > > failure in system boot up time due to bcache udev rule timeout.
> >
> > this sounds like a wrong way to address the issue. Killing the udev
> > worker is a userspace policy and the kernel shouldn't simply ignore it.
> 
> Agreed. If nothing else, if a userspace process has pending SIKILL then
> flush_signals() is very wrong.
> 
> > Btw. Oleg, I have noticed quite a lot of flush_signals usage in the
> > drivers land and I have really hard time to understand their purpose.
> 
> Heh. I bet most if not all users of flush_signals() are simply wrong.
> 
> > What is the actual valid usage of this function?
> 
> I thinks it should die...

Can we simply deprecate it and add a big fat comment explaning why this
is wrong interface to use?

So what about this?
diff --git a/kernel/signal.c b/kernel/signal.c
index 9ad8dea93dbb..8a895e565e84 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -465,7 +465,13 @@ void flush_sigqueue(struct sigpending *queue)
 }
 
 /*
- * Flush all pending signals for this kthread.
+ * Flush all pending signals for this kthread. Please note that this interface
+ * shouldn't be used and in fact it is DEPRECATED.
+ * Existing users should be double checked because most of them are likely
+ * obsolete. Kernel threads are not on the receiving end of signal delivery
+ * unless they explicitly request that by allow_signal() and in that case
+ * flush_signals is almost always a bug because signal should be processed
+ * by kernel_dequeue_signal rather than dropping them on the floor.
  */
 void flush_signals(struct task_struct *t)
 {
Oleg Nesterov March 3, 2020, 12:19 p.m. UTC | #18
On 03/03, Michal Hocko wrote:
>
> > > What is the actual valid usage of this function?
> >
> > I thinks it should die...
>
> Can we simply deprecate it and add a big fat comment explaning why this
> is wrong interface to use?

Michal, I am sorry for confusion.

I have to take my words back, flush_signals() can be more convenient
and faster for kthreads than kernel_deque_signal().

However I still think it needs changes, see below.

Even clear_tsk_thread_flag(SIGPENDING) doesn't look right in general,
but recalc_sigpending() can make a difference. This probably needs
more cleanups.

> + * Kernel threads are not on the receiving end of signal delivery
> + * unless they explicitly request that by allow_signal() and in that case
> + * flush_signals is almost always a bug because signal should be processed
> + * by kernel_dequeue_signal rather than dropping them on the floor.

Yes, but kernel_dequeue_signal() differs. Say, it won't clear TIF_SIGPENDING
if TIF_PATCH_PENDING is set. Again, probably this need more cleanups.

Anyway, we can probably change flush_signals

	void flush_signals(struct task_struct *t)
	{
		unsigned long flags;

		// see the PF_KTHREAD check in __send_signal()
		WARN_ON_ONCE(!list_empty(&t->pending.list) ||
			     !list_empty(&t->signal->shared_pending.list));

		spin_lock_irqsave(&t->sighand->siglock, flags);
		// TODO: use recalc_sigpending()
		clear_tsk_thread_flag(t, TIF_SIGPENDING);
		sigemptyset(&t->pending.signal);
		sigemptyset(&t->signal->shared_pending.signal);
		spin_unlock_irqrestore(&t->sighand->siglock, flags);
	}

kernel_sigaction() doesn't need flush_sigqueue_mask() too.

kernel_dequeue_signal() could just use next_signal(),

	int kernel_dequeue_signal(void)
	{
		struct task_struct *task = current;
		int ret;

		spin_lock_irq(&task->sighand->siglock);
		ret = next_signal(&task->pending, blocked);
		if (!ret)
			ret = next_signal(&task->signal->shared_pending, blocked);
		if (sig_kernel_stop(ret))
			task->jobctl |= JOBCTL_STOP_DEQUEUED;
		recalc_sigpending();
		spin_unlock_irq(&task->sighand->siglock);

		return ret;
	}

but I am not sure this optmization makes sense.

Oleg.
Michal Hocko March 3, 2020, 4:03 p.m. UTC | #19
On Tue 03-03-20 13:19:18, Oleg Nesterov wrote:
[...]
> but I am not sure this optmization makes sense.

I would much rather start with a clarification on what should be use
what shouldn't. Because as of now, people tend to copy patterns which
are broken or simply do not make any sense at all.
Oleg Nesterov March 4, 2020, 11:36 a.m. UTC | #20
On 03/03, Michal Hocko wrote:
>
> On Tue 03-03-20 13:19:18, Oleg Nesterov wrote:
> [...]
> > but I am not sure this optmization makes sense.
>
> I would much rather start with a clarification on what should be use
> what shouldn't. Because as of now, people tend to copy patterns which
> are broken or simply do not make any sense at all.

Yes, it has a lot of buggy users.

It should only be used by kthtreads which do allow_signal(), and imo
even in this case kernel_dequeue_signal() makes more sense.

I am looking at 2 first users reported by git-grep.

arch/arm/common/bL_switcher.c:bL_switcher_thread(). Why does it do
flush_signals() ? signal_pending() must not be possible. It seems that
people think that wait_event_interruptible() or even schedule() in
TASK_INTERRUPTIBLE state can lead to a pending signal but this is not
true. Of course, I could miss allow_signal() in bL_switch_to() paths...

drivers/block/drbd/. I know nothing about this code, but it seems that
flush_signals() can be called by the userspace process. This should be
forbidden.

IOW, I mostly agree with

	- * Flush all pending signals for this kthread.
	+ * Flush all pending signals for this kthread. Please note that this interface
	+ * shouldn't be used and in fact it is DEPRECATED.
	+ * Existing users should be double checked because most of them are likely
	+ * obsolete. Kernel threads are not on the receiving end of signal delivery
	+ * unless they explicitly request that by allow_signal() and in that case
	+ * flush_signals is almost always a bug because signal should be processed
	+ * by kernel_dequeue_signal rather than dropping them on the floor.

you wrote in your previous email, but "DEPRECATED" and "almost always a bug"
looks a bit too strong to me.

I would like to add WARN_ON(!PF_KTHREAD) into flush_signals() and let people
fix their code ;)

Oleg.
Oleg Nesterov March 4, 2020, 11:53 a.m. UTC | #21
On 03/04, Oleg Nesterov wrote:
>
> arch/arm/common/bL_switcher.c:bL_switcher_thread(). Why does it do
> flush_signals() ? signal_pending() must not be possible. It seems that
> people think that wait_event_interruptible() or even schedule() in
> TASK_INTERRUPTIBLE state can lead to a pending signal but this is not
> true. Of course, I could miss allow_signal() in bL_switch_to() paths...

Jens, could you explain flush_signals() in io_sq_thread() ?

Oleg.
Michal Hocko March 4, 2020, 11:57 a.m. UTC | #22
On Wed 04-03-20 12:36:13, Oleg Nesterov wrote:
[...]
> IOW, I mostly agree with
> 
> 	- * Flush all pending signals for this kthread.
> 	+ * Flush all pending signals for this kthread. Please note that this interface
> 	+ * shouldn't be used and in fact it is DEPRECATED.
> 	+ * Existing users should be double checked because most of them are likely
> 	+ * obsolete. Kernel threads are not on the receiving end of signal delivery
> 	+ * unless they explicitly request that by allow_signal() and in that case
> 	+ * flush_signals is almost always a bug because signal should be processed
> 	+ * by kernel_dequeue_signal rather than dropping them on the floor.
> 
> you wrote in your previous email, but "DEPRECATED" and "almost always a bug"
> looks a bit too strong to me.

So what would be a legit usecase to drop all signals while explicitly
calling allow_signal?
Oleg Nesterov March 4, 2020, 12:13 p.m. UTC | #23
On 03/04, Michal Hocko wrote:
>
> So what would be a legit usecase to drop all signals while explicitly
> calling allow_signal?

Not sure I understand... Did you mean kthread should use kernel_dequeue
rather than flush?

Yes, they should do the same if kthread allows a single signal, iow if
it calls allow_signal() once.

But currently they differ.

1. flush_signal() is faster but we can optimize kernel_dequeue_signal().

2. kernel_dequeue_signal() does not necessarily clears TIF_SIGPENDING
   and I think this needs some fixes. Probably klp_patch_pending() is
   the only problem...

Oleg.
Michal Hocko March 4, 2020, 12:22 p.m. UTC | #24
On Wed 04-03-20 13:13:25, Oleg Nesterov wrote:
> On 03/04, Michal Hocko wrote:
> >
> > So what would be a legit usecase to drop all signals while explicitly
> > calling allow_signal?
> 
> Not sure I understand...

flush_signals will simply drop all pending signals on the floor so there
is no way to handle them, right? I am asking when is still really a
desirable thing to do when you allow_signal for the kthread. The only
one I can imagine is that the kthread allows a single signal so it is
quite clear which signal is flushed.

kernel_dequeue_signal on the other hand will give you a signal and so
the code can actually handle it in some way.
Oleg Nesterov March 4, 2020, 12:33 p.m. UTC | #25
On 03/04, Michal Hocko wrote:
>
> On Wed 04-03-20 13:13:25, Oleg Nesterov wrote:
> > On 03/04, Michal Hocko wrote:
> > >
> > > So what would be a legit usecase to drop all signals while explicitly
> > > calling allow_signal?
> >
> > Not sure I understand...
>
> flush_signals will simply drop all pending signals on the floor so there
> is no way to handle them, right? I am asking when is still really a
> desirable thing to do when you allow_signal for the kthread. The only
> one I can imagine is that the kthread allows a single signal so it is
> quite clear which signal is flushed.

Yes. This is what I meant when I said "they should do the same if kthread
allows a single signal".

> kernel_dequeue_signal on the other hand will give you a signal and so
> the code can actually handle it in some way.

Yes.

Oleg.
Michal Hocko March 4, 2020, 12:41 p.m. UTC | #26
On Wed 04-03-20 13:33:42, Oleg Nesterov wrote:
> On 03/04, Michal Hocko wrote:
> >
> > On Wed 04-03-20 13:13:25, Oleg Nesterov wrote:
> > > On 03/04, Michal Hocko wrote:
> > > >
> > > > So what would be a legit usecase to drop all signals while explicitly
> > > > calling allow_signal?
> > >
> > > Not sure I understand...
> >
> > flush_signals will simply drop all pending signals on the floor so there
> > is no way to handle them, right? I am asking when is still really a
> > desirable thing to do when you allow_signal for the kthread. The only
> > one I can imagine is that the kthread allows a single signal so it is
> > quite clear which signal is flushed.
> 
> Yes. This is what I meant when I said "they should do the same if kthread
> allows a single signal".

OK, good that we are at the same page. I have clearly misread your
earlier email.
 
> > kernel_dequeue_signal on the other hand will give you a signal and so
> > the code can actually handle it in some way.
> 
> Yes.

diff --git a/kernel/signal.c b/kernel/signal.c
index 9ad8dea93dbb..959adc2a5a3d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -465,7 +465,18 @@ void flush_sigqueue(struct sigpending *queue)
 }
 
 /*
- * Flush all pending signals for this kthread.
+ * Flush all pending signals for this kthread. Please note that this interface
+ * shouldn't be and if there is a need for it then it should be clearly
+ * documented why.
+ *
+ * Existing users should be double checked because most of them are likely
+ * obsolete. Kernel threads are not on the receiving end of signal delivery
+ * unless they explicitly request that by allow_signal() and in that case
+ * flush_signals is almost always a bug because signal should be processed
+ * by kernel_dequeue_signal rather than dropping them on the floor. The only
+ * exception when flush_signals could be used is a micro-optimization when
+ * only a single signal is allowed when retreiving the specific signal number
+ * is not needed. Please document this usage.
  */
 void flush_signals(struct task_struct *t)
 {
Oleg Nesterov March 4, 2020, 1:02 p.m. UTC | #27
On 03/04, Michal Hocko wrote:
>
>  /*
> - * Flush all pending signals for this kthread.
> + * Flush all pending signals for this kthread. Please note that this interface
> + * shouldn't be and if there is a need for it then it should be clearly
> + * documented why.
> + *
> + * Existing users should be double checked because most of them are likely
> + * obsolete. Kernel threads are not on the receiving end of signal delivery
> + * unless they explicitly request that by allow_signal() and in that case
> + * flush_signals is almost always a bug
                    ^^^^^^^^^^^^^^^^^^^^^^
I still think this is too strong statement...

Even if it seems that most current users of flush_signals() are wrong.

> because signal should be processed
> + * by kernel_dequeue_signal rather than dropping them on the floor.

Yes, but to remind we need some cleanups to ensure that
s/flush_signals/kernel_dequeue_signal/ is always "safe" even when only a
single signal is allowed,

> The only
> + * exception when flush_signals could be used is a micro-optimization when
> + * only a single signal is allowed when retreiving the specific signal number
> + * is not needed. Please document this usage.
>   */

Agreed.

Oleg.
Michal Hocko March 4, 2020, 1:21 p.m. UTC | #28
On Wed 04-03-20 14:02:09, Oleg Nesterov wrote:
> On 03/04, Michal Hocko wrote:
> >
> >  /*
> > - * Flush all pending signals for this kthread.
> > + * Flush all pending signals for this kthread. Please note that this interface
> > + * shouldn't be and if there is a need for it then it should be clearly
> > + * documented why.
> > + *
> > + * Existing users should be double checked because most of them are likely
> > + * obsolete. Kernel threads are not on the receiving end of signal delivery
> > + * unless they explicitly request that by allow_signal() and in that case
> > + * flush_signals is almost always a bug
>                     ^^^^^^^^^^^^^^^^^^^^^^
> I still think this is too strong statement...
> 
> Even if it seems that most current users of flush_signals() are wrong.
> 
> > because signal should be processed
> > + * by kernel_dequeue_signal rather than dropping them on the floor.
> 
> Yes, but to remind we need some cleanups to ensure that
> s/flush_signals/kernel_dequeue_signal/ is always "safe" even when only a
> single signal is allowed,
> 
> > The only
> > + * exception when flush_signals could be used is a micro-optimization when
> > + * only a single signal is allowed when retreiving the specific signal number
> > + * is not needed. Please document this usage.
> >   */
> 
> Agreed.

Care to send a patch? I am pretty sure you would provide a more fitting
wording. All I want to achieve is clarification what people should do.
THis is a terrible interface and many bogus users are just proving the
fact so the more specific about expected usage the better.
Jens Axboe March 4, 2020, 6:42 p.m. UTC | #29
On 3/4/20 4:53 AM, Oleg Nesterov wrote:
> On 03/04, Oleg Nesterov wrote:
>>
>> arch/arm/common/bL_switcher.c:bL_switcher_thread(). Why does it do
>> flush_signals() ? signal_pending() must not be possible. It seems that
>> people think that wait_event_interruptible() or even schedule() in
>> TASK_INTERRUPTIBLE state can lead to a pending signal but this is not
>> true. Of course, I could miss allow_signal() in bL_switch_to() paths...
> 
> Jens, could you explain flush_signals() in io_sq_thread() ?

I actually don't think that one is needed - the only one that io_uring
needs, and which is valid, is flushing signals for the io-wq worker as
we use that for cancelation.

In any case, both are kthreads.
diff mbox series

Patch

diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 0c3c5419c52b..e8bbd4f171ca 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -850,6 +850,18 @@  static int bcache_device_init(struct bcache_device *d, unsigned int block_size,
 	if (idx < 0)
 		return idx;
 
+	/*
+	 * There is a timeout in udevd, if the bcache device is registering
+	 * by udev rules, and not completed in time, the udevd may kill the
+	 * registration process. In this condition, there will be pending
+	 * signal here and cause bioset_init() failed for internally creating
+	 * its kthread. Here the registration should ignore the timeout and
+	 * continue, it is safe to ignore the pending signal and avoid to
+	 * fail bcache registration in boot up time.
+	 */
+	if (signal_pending(current))
+		flush_signals(current);
+
 	if (bioset_init(&d->bio_split, 4, offsetof(struct bbio, bio),
 			BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER))
 		goto err;