diff mbox series

[v1,3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals

Message ID 20230627112220.229240-4-david@redhat.com (mailing list archive)
State New
Headers show
Series mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals | expand

Commit Message

David Hildenbrand June 27, 2023, 11:22 a.m. UTC
John Hubbard writes [1]:

        Some device drivers add memory to the system via memory hotplug.
        When the driver is unloaded, that memory is hot-unplugged.

        However, memory hot unplug can fail. And these days, it fails a
        little too easily, with respect to the above case. Specifically, if
        a signal is pending on the process, hot unplug fails.

        [...]

        So in this case, other things (unmovable pages, un-splittable huge
        pages) can also cause the above problem. However, those are
        demonstrably less common than simply having a pending signal. I've
        got bug reports from users who can trivially reproduce this by
        killing their process with a "kill -9", for example.

Especially with ZONE_MOVABLE, offlining is supposed to work in most
cases when offlining actually hotplugged (not boot) memory, and only fail
in rare corner cases (e.g., some driver holds a reference to a page in
ZONE_MOVABLE, turning it unmovable).

In these corner cases we really don't want to be stuck forever in
offline_and_remove_memory(). But in the general cases, we really want to
do our best to make memory offlining succeed -- in a reasonable
timeframe.

Reliably failing in the described case when there is a fatal signal pending
is sub-optimal. The pending signal check is mostly only relevant when user
space explicitly triggers offlining of memory using sysfs device attributes
("state" or "online" attribute), but not when coming via
offline_and_remove_memory().

So let's use a timer instead and ignore fatal signals, because they are
not really expressive for offline_and_remove_memory() users. Let's default
to 30 seconds if no timeout was specified, and limit the timeout to 120
seconds.

This change is also valuable for virtio-mem in BBM (Big Block Mode) with
"bbm_safe_unplug=off", to avoid endless loops when stuck forever in
offline_and_remove_memory().

While at it, drop the "extern" from offline_and_remove_memory() to make
it fit into a single line.

[1] https://lkml.kernel.org/r/20230620011719.155379-1-jhubbard@nvidia.com

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/virtio/virtio_mem.c    |  2 +-
 include/linux/memory_hotplug.h |  2 +-
 mm/memory_hotplug.c            | 50 ++++++++++++++++++++++++++++++++--
 3 files changed, 50 insertions(+), 4 deletions(-)

Comments

Michal Hocko June 27, 2023, 12:40 p.m. UTC | #1
On Tue 27-06-23 13:22:18, David Hildenbrand wrote:
> John Hubbard writes [1]:
> 
>         Some device drivers add memory to the system via memory hotplug.
>         When the driver is unloaded, that memory is hot-unplugged.
> 
>         However, memory hot unplug can fail. And these days, it fails a
>         little too easily, with respect to the above case. Specifically, if
>         a signal is pending on the process, hot unplug fails.
> 
>         [...]
> 
>         So in this case, other things (unmovable pages, un-splittable huge
>         pages) can also cause the above problem. However, those are
>         demonstrably less common than simply having a pending signal. I've
>         got bug reports from users who can trivially reproduce this by
>         killing their process with a "kill -9", for example.

This looks like a bug of the said driver no? If the tear down process is
killed it could very well happen right before offlining so you end up in
the very same state. Or what am I missing?
 
> Especially with ZONE_MOVABLE, offlining is supposed to work in most
> cases when offlining actually hotplugged (not boot) memory, and only fail
> in rare corner cases (e.g., some driver holds a reference to a page in
> ZONE_MOVABLE, turning it unmovable).
> 
> In these corner cases we really don't want to be stuck forever in
> offline_and_remove_memory(). But in the general cases, we really want to
> do our best to make memory offlining succeed -- in a reasonable
> timeframe.
> 
> Reliably failing in the described case when there is a fatal signal pending
> is sub-optimal. The pending signal check is mostly only relevant when user
> space explicitly triggers offlining of memory using sysfs device attributes
> ("state" or "online" attribute), but not when coming via
> offline_and_remove_memory().
> 
> So let's use a timer instead and ignore fatal signals, because they are
> not really expressive for offline_and_remove_memory() users. Let's default
> to 30 seconds if no timeout was specified, and limit the timeout to 120
> seconds.

I really hate having timeouts back. They just proven to be hard to get
right and it is essentially a policy implemented in the kernel. They
simply do not belong to the kernel space IMHO.

> This change is also valuable for virtio-mem in BBM (Big Block Mode) with
> "bbm_safe_unplug=off", to avoid endless loops when stuck forever in
> offline_and_remove_memory().
> 
> While at it, drop the "extern" from offline_and_remove_memory() to make
> it fit into a single line.
> 
> [1] https://lkml.kernel.org/r/20230620011719.155379-1-jhubbard@nvidia.com
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  drivers/virtio/virtio_mem.c    |  2 +-
>  include/linux/memory_hotplug.h |  2 +-
>  mm/memory_hotplug.c            | 50 ++++++++++++++++++++++++++++++++--
>  3 files changed, 50 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index cb8bc6f6aa90..f8792223f1db 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -738,7 +738,7 @@ static int virtio_mem_offline_and_remove_memory(struct virtio_mem *vm,
>  		"offlining and removing memory: 0x%llx - 0x%llx\n", addr,
>  		addr + size - 1);
>  
> -	rc = offline_and_remove_memory(addr, size);
> +	rc = offline_and_remove_memory(addr, size, 0);
>  	if (!rc) {
>  		atomic64_sub(size, &vm->offline_size);
>  		/*
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 9fcbf5706595..d5f9e8b5a4a4 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -307,7 +307,7 @@ extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>  			 struct zone *zone, struct memory_group *group);
>  extern int remove_memory(u64 start, u64 size);
>  extern void __remove_memory(u64 start, u64 size);
> -extern int offline_and_remove_memory(u64 start, u64 size);
> +int offline_and_remove_memory(u64 start, u64 size, unsigned int timeout_ms);
>  
>  #else
>  static inline void try_offline_node(int nid) {}
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0d2151df4ee1..ca635121644a 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -152,6 +152,22 @@ void put_online_mems(void)
>  
>  bool movable_node_enabled = false;
>  
> +/*
> + * Protected by the device hotplug lock: offline_and_remove_memory()
> + * will activate a timer such that offlining cannot be stuck forever.
> + *
> + * With an active timer, fatal signals will be ignored, because they can be
> + * counter-productive when dying user space triggers device unplug/driver
> + * unloading that ends up offlining+removing device memory.
> + */
> +static bool mhp_offlining_timer_active;
> +static atomic_t mhp_offlining_timer_expired;
> +
> +static void mhp_offline_timer_fn(struct timer_list *unused)
> +{
> +	atomic_set(&mhp_offlining_timer_expired, 1);
> +}
> +
>  #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
>  int mhp_default_online_type = MMOP_OFFLINE;
>  #else
> @@ -1879,7 +1895,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
>  	do {
>  		pfn = start_pfn;
>  		do {
> -			if (fatal_signal_pending(current)) {
> +			/*
> +			 * If a timer is active, we're coming via
> +			 * offline_and_remove_memory() and want to ignore even
> +			 * fatal signals.
> +			 */
> +			if (mhp_offlining_timer_active) {
> +				if (atomic_read(&mhp_offlining_timer_expired)) {
> +					ret = -ETIMEDOUT;
> +					reason = "timeout";
> +					goto failed_removal_isolated;
> +				}
> +			} else if (fatal_signal_pending(current)) {
>  				ret = -EINTR;
>  				reason = "signal backoff";
>  				goto failed_removal_isolated;
> @@ -2232,11 +2259,17 @@ static int try_reonline_memory_block(struct memory_block *mem, void *arg)
>   * memory is still in use. Primarily useful for memory devices that logically
>   * unplugged all memory (so it's no longer in use) and want to offline + remove
>   * that memory.
> + *
> + * offline_and_remove_memory() will not fail on fatal signals. Instead, it will
> + * fail once the timeout has been reached and offlining was not completed. If
> + * no timeout was specified, it will timeout after 30 seconds. The timeout is
> + * limited to 120 seconds.
>   */
> -int offline_and_remove_memory(u64 start, u64 size)
> +int offline_and_remove_memory(u64 start, u64 size, unsigned int timeout_ms)
>  {
>  	const unsigned long mb_count = size / memory_block_size_bytes();
>  	uint8_t *online_types, *tmp;
> +	struct timer_list timer;
>  	int rc;
>  
>  	if (!IS_ALIGNED(start, memory_block_size_bytes()) ||
> @@ -2261,9 +2294,22 @@ int offline_and_remove_memory(u64 start, u64 size)
>  
>  	lock_device_hotplug();
>  
> +	if (!timeout_ms)
> +		timeout_ms = 30 * MSEC_PER_SEC;
> +	timeout_ms = min_t(unsigned int, timeout_ms, 120 * MSEC_PER_SEC);
> +
> +	timer_setup_on_stack(&timer, mhp_offline_timer_fn, 0);
> +	mod_timer(&timer, jiffies + msecs_to_jiffies(timeout_ms));
> +	mhp_offlining_timer_active = true;
> +
>  	tmp = online_types;
>  	rc = walk_memory_blocks(start, size, &tmp, try_offline_memory_block);
>  
> +	timer_delete_sync(&timer);
> +	atomic_set(&mhp_offlining_timer_expired, 0);
> +	mhp_offlining_timer_active = false;
> +	destroy_timer_on_stack(&timer);
> +
>  	/*
>  	 * In case we succeeded to offline all memory, remove it.
>  	 * This cannot fail as it cannot get onlined in the meantime.
> -- 
> 2.40.1
David Hildenbrand June 27, 2023, 1:14 p.m. UTC | #2
On 27.06.23 14:40, Michal Hocko wrote:
> On Tue 27-06-23 13:22:18, David Hildenbrand wrote:
>> John Hubbard writes [1]:
>>
>>          Some device drivers add memory to the system via memory hotplug.
>>          When the driver is unloaded, that memory is hot-unplugged.
>>
>>          However, memory hot unplug can fail. And these days, it fails a
>>          little too easily, with respect to the above case. Specifically, if
>>          a signal is pending on the process, hot unplug fails.
>>
>>          [...]
>>
>>          So in this case, other things (unmovable pages, un-splittable huge
>>          pages) can also cause the above problem. However, those are
>>          demonstrably less common than simply having a pending signal. I've
>>          got bug reports from users who can trivially reproduce this by
>>          killing their process with a "kill -9", for example.
> 
> This looks like a bug of the said driver no? If the tear down process is
> killed it could very well happen right before offlining so you end up in
> the very same state. Or what am I missing?

IIUC (John can correct me if I am wrong):

1) The process holds the device node open
2) The process gets killed or quits
3) As the process gets torn down, it closes the device node
4) Closing the device node results in the driver removing the device and
    calling offline_and_remove_memory()

So it's not a "tear down process" that triggers that offlining_removal 
somehow explicitly, it's just a side-product of it letting go of the 
device node as the process gets torn down.

>   
>> Especially with ZONE_MOVABLE, offlining is supposed to work in most
>> cases when offlining actually hotplugged (not boot) memory, and only fail
>> in rare corner cases (e.g., some driver holds a reference to a page in
>> ZONE_MOVABLE, turning it unmovable).
>>
>> In these corner cases we really don't want to be stuck forever in
>> offline_and_remove_memory(). But in the general cases, we really want to
>> do our best to make memory offlining succeed -- in a reasonable
>> timeframe.
>>
>> Reliably failing in the described case when there is a fatal signal pending
>> is sub-optimal. The pending signal check is mostly only relevant when user
>> space explicitly triggers offlining of memory using sysfs device attributes
>> ("state" or "online" attribute), but not when coming via
>> offline_and_remove_memory().
>>
>> So let's use a timer instead and ignore fatal signals, because they are
>> not really expressive for offline_and_remove_memory() users. Let's default
>> to 30 seconds if no timeout was specified, and limit the timeout to 120
>> seconds.
> 
> I really hate having timeouts back. They just proven to be hard to get
> right and it is essentially a policy implemented in the kernel. They
> simply do not belong to the kernel space IMHO.

As much as I agree with you in terms of offlining triggered from user 
space (e.g., write "state" or "online" attribute) where user-space is 
actually in charge  and can do something reasonable (timeout, retry, 
whatever), in these the offline_and_remove_memory() case it's the driver 
that wants a best-effort memory offlining+removal.

If it times out, virtio-mem will simply try another block or retry 
later. Right now, it could get stuck forever in 
offline_and_remove_memory(), which is obviously "not great". 
Fortunately, for virtio-mem it's configurable and we use the 
alloc_contig_range()-method for now as default.

If it would time out for John's driver, we most certainly don't want to 
be stuck in offline_and_remove_memory(), blocking device/driver 
unloading (and even a reboot IIRC) possibly forever.


I much rather have offline_and_remove_memory() indicate "timeout" to a 
in-kernel user a bit earlier than getting stuck in there forever. The 
timeout parameter allows for giving the in-kernel users a bit of 
flexibility, which I showcases for virtio-mem that unplugs smaller 
blocks and rather wants to fail fast and retry later.


Sure, we could make the timeout configurable to optimize for some corner 
cases, but that's not really what offline_and_remove_memory() users want 
and I doubt anybody would fine-tune that: they want a best-effort 
attempt. And that's IMHO not really a policy, it's an implementation 
detail of these drivers.

For example, the driver from John could simply never call 
offline_and_remove_memory() and always require a reboot when wanting to 
reuse a device. But that's definitely what users want.

virtio-mem could simply never call offline_and_remove_memory() and 
indicate "I don't support unplug of these online memory blocks". But 
that's *definitely* not what users want.


I'm very open for alternatives regarding offline_and_remove_memory(), so 
far this was the only reasonable thing I could come up with that 
actually achieves what we want for these users: not get stuck in there 
forever but rather fail earlier than later.

And most importantly, not somehow try to involve user space that isn't 
even in charge of the offlining operation.
Michal Hocko June 27, 2023, 2:17 p.m. UTC | #3
On Tue 27-06-23 15:14:11, David Hildenbrand wrote:
> On 27.06.23 14:40, Michal Hocko wrote:
> > On Tue 27-06-23 13:22:18, David Hildenbrand wrote:
> > > John Hubbard writes [1]:
> > > 
> > >          Some device drivers add memory to the system via memory hotplug.
> > >          When the driver is unloaded, that memory is hot-unplugged.
> > > 
> > >          However, memory hot unplug can fail. And these days, it fails a
> > >          little too easily, with respect to the above case. Specifically, if
> > >          a signal is pending on the process, hot unplug fails.
> > > 
> > >          [...]
> > > 
> > >          So in this case, other things (unmovable pages, un-splittable huge
> > >          pages) can also cause the above problem. However, those are
> > >          demonstrably less common than simply having a pending signal. I've
> > >          got bug reports from users who can trivially reproduce this by
> > >          killing their process with a "kill -9", for example.
> > 
> > This looks like a bug of the said driver no? If the tear down process is
> > killed it could very well happen right before offlining so you end up in
> > the very same state. Or what am I missing?
> 
> IIUC (John can correct me if I am wrong):
> 
> 1) The process holds the device node open
> 2) The process gets killed or quits
> 3) As the process gets torn down, it closes the device node
> 4) Closing the device node results in the driver removing the device and
>    calling offline_and_remove_memory()
> 
> So it's not a "tear down process" that triggers that offlining_removal
> somehow explicitly, it's just a side-product of it letting go of the device
> node as the process gets torn down.

Isn't that just fragile? The operation might fail for other reasons. Why
cannot there be a hold on the resource to control the tear down
explicitly?

> > > Especially with ZONE_MOVABLE, offlining is supposed to work in most
> > > cases when offlining actually hotplugged (not boot) memory, and only fail
> > > in rare corner cases (e.g., some driver holds a reference to a page in
> > > ZONE_MOVABLE, turning it unmovable).
> > > 
> > > In these corner cases we really don't want to be stuck forever in
> > > offline_and_remove_memory(). But in the general cases, we really want to
> > > do our best to make memory offlining succeed -- in a reasonable
> > > timeframe.
> > > 
> > > Reliably failing in the described case when there is a fatal signal pending
> > > is sub-optimal. The pending signal check is mostly only relevant when user
> > > space explicitly triggers offlining of memory using sysfs device attributes
> > > ("state" or "online" attribute), but not when coming via
> > > offline_and_remove_memory().
> > > 
> > > So let's use a timer instead and ignore fatal signals, because they are
> > > not really expressive for offline_and_remove_memory() users. Let's default
> > > to 30 seconds if no timeout was specified, and limit the timeout to 120
> > > seconds.
> > 
> > I really hate having timeouts back. They just proven to be hard to get
> > right and it is essentially a policy implemented in the kernel. They
> > simply do not belong to the kernel space IMHO.
> 
> As much as I agree with you in terms of offlining triggered from user space
> (e.g., write "state" or "online" attribute) where user-space is actually in
> charge  and can do something reasonable (timeout, retry, whatever), in these
> the offline_and_remove_memory() case it's the driver that wants a
> best-effort memory offlining+removal.
> 
> If it times out, virtio-mem will simply try another block or retry later.
> Right now, it could get stuck forever in offline_and_remove_memory(), which
> is obviously "not great". Fortunately, for virtio-mem it's configurable and
> we use the alloc_contig_range()-method for now as default.

It seems that offline_and_remove_memory is using a wrong operation then.
If it wants an opportunistic offlining with some sort of policy. Timeout
might be just one policy to use but failure mode or a retry count might
be a better fit for some users. So rather than (ab)using offline_pages,
would be make more sense to extract basic offlining steps and allow
drivers like virtio-mem to reuse them and define their own policy?

> If it would time out for John's driver, we most certainly don't want to be
> stuck in offline_and_remove_memory(), blocking device/driver unloading (and
> even a reboot IIRC) possibly forever.

Now I am confused. John driver wants to tear down the device now? There
is no way to release that memory otherwise AFAIU from the initial
problem description.
David Hildenbrand June 27, 2023, 2:57 p.m. UTC | #4
On 27.06.23 16:17, Michal Hocko wrote:
> On Tue 27-06-23 15:14:11, David Hildenbrand wrote:
>> On 27.06.23 14:40, Michal Hocko wrote:
>>> On Tue 27-06-23 13:22:18, David Hildenbrand wrote:
>>>> John Hubbard writes [1]:
>>>>
>>>>           Some device drivers add memory to the system via memory hotplug.
>>>>           When the driver is unloaded, that memory is hot-unplugged.
>>>>
>>>>           However, memory hot unplug can fail. And these days, it fails a
>>>>           little too easily, with respect to the above case. Specifically, if
>>>>           a signal is pending on the process, hot unplug fails.
>>>>
>>>>           [...]
>>>>
>>>>           So in this case, other things (unmovable pages, un-splittable huge
>>>>           pages) can also cause the above problem. However, those are
>>>>           demonstrably less common than simply having a pending signal. I've
>>>>           got bug reports from users who can trivially reproduce this by
>>>>           killing their process with a "kill -9", for example.
>>>
>>> This looks like a bug of the said driver no? If the tear down process is
>>> killed it could very well happen right before offlining so you end up in
>>> the very same state. Or what am I missing?
>>
>> IIUC (John can correct me if I am wrong):
>>
>> 1) The process holds the device node open
>> 2) The process gets killed or quits
>> 3) As the process gets torn down, it closes the device node
>> 4) Closing the device node results in the driver removing the device and
>>     calling offline_and_remove_memory()
>>
>> So it's not a "tear down process" that triggers that offlining_removal
>> somehow explicitly, it's just a side-product of it letting go of the device
>> node as the process gets torn down.
> 
> Isn't that just fragile? The operation might fail for other reasons. Why
> cannot there be a hold on the resource to control the tear down
> explicitly?

I'll let John comment on that. But from what I understood, in most 
setups where ZONE_MOVABLE gets used for hotplugged memory 
offline_and_remove_memory() succeeds and allows for reusing the device 
later without a reboot.

For the cases where it doesn't work, a reboot is required.

> 
>>>> Especially with ZONE_MOVABLE, offlining is supposed to work in most
>>>> cases when offlining actually hotplugged (not boot) memory, and only fail
>>>> in rare corner cases (e.g., some driver holds a reference to a page in
>>>> ZONE_MOVABLE, turning it unmovable).
>>>>
>>>> In these corner cases we really don't want to be stuck forever in
>>>> offline_and_remove_memory(). But in the general cases, we really want to
>>>> do our best to make memory offlining succeed -- in a reasonable
>>>> timeframe.
>>>>
>>>> Reliably failing in the described case when there is a fatal signal pending
>>>> is sub-optimal. The pending signal check is mostly only relevant when user
>>>> space explicitly triggers offlining of memory using sysfs device attributes
>>>> ("state" or "online" attribute), but not when coming via
>>>> offline_and_remove_memory().
>>>>
>>>> So let's use a timer instead and ignore fatal signals, because they are
>>>> not really expressive for offline_and_remove_memory() users. Let's default
>>>> to 30 seconds if no timeout was specified, and limit the timeout to 120
>>>> seconds.
>>>
>>> I really hate having timeouts back. They just proven to be hard to get
>>> right and it is essentially a policy implemented in the kernel. They
>>> simply do not belong to the kernel space IMHO.
>>
>> As much as I agree with you in terms of offlining triggered from user space
>> (e.g., write "state" or "online" attribute) where user-space is actually in
>> charge  and can do something reasonable (timeout, retry, whatever), in these
>> the offline_and_remove_memory() case it's the driver that wants a
>> best-effort memory offlining+removal.
>>
>> If it times out, virtio-mem will simply try another block or retry later.
>> Right now, it could get stuck forever in offline_and_remove_memory(), which
>> is obviously "not great". Fortunately, for virtio-mem it's configurable and
>> we use the alloc_contig_range()-method for now as default.
> 
> It seems that offline_and_remove_memory is using a wrong operation then.
> If it wants an opportunistic offlining with some sort of policy. Timeout
> might be just one policy to use but failure mode or a retry count might
> be a better fit for some users. So rather than (ab)using offline_pages,
> would be make more sense to extract basic offlining steps and allow
> drivers like virtio-mem to reuse them and define their own policy?

virtio-mem, in default operation, does that: use alloc_contig_range() to 
logically unplug ("fake offline") that memory and then just trigger 
offline_and_remove_memory() to make it "officially offline".

In that mode, offline_and_remove_memory() cannot really timeout and is 
almost always going to succeed (except memory notifiers and some hugetlb 
dissolving).

Right now we also allow the admin to configure ordinary offlining 
directly (without prior fake offlining) when bigger memory blocks are 
used: offline_pages() is more reliable than alloc_contig_range(), for 
example, because it disables the PCP and the LRU cache, and retries more 
often (well, unfortunately then also forever). It has a higher chance of 
succeeding especially when bigger blocks of memory are offlined+removed.

Maybe we should make the alloc_contig_range()-based mechanism more 
configurable and make it the only mode in virtio-mem, such that we don't 
have to mess with offline_and_remove_memory() endless loops -- at least 
for virtio-mem.

> 
>> If it would time out for John's driver, we most certainly don't want to be
>> stuck in offline_and_remove_memory(), blocking device/driver unloading (and
>> even a reboot IIRC) possibly forever.
> 
> Now I am confused. John driver wants to tear down the device now? There
> is no way to release that memory otherwise AFAIU from the initial
> problem description.

 From what I understood if offline_and_remove_memory() succeeded, the 
device can be reinitialized and used again. Otherwise, a reboot is 
required because that memory is still added to the system.

Thanks Michal!
Michal Hocko June 27, 2023, 3:14 p.m. UTC | #5
On Tue 27-06-23 16:57:53, David Hildenbrand wrote:
> On 27.06.23 16:17, Michal Hocko wrote:
> > On Tue 27-06-23 15:14:11, David Hildenbrand wrote:
> > > On 27.06.23 14:40, Michal Hocko wrote:
> > > > On Tue 27-06-23 13:22:18, David Hildenbrand wrote:
> > > > > John Hubbard writes [1]:
> > > > > 
> > > > >           Some device drivers add memory to the system via memory hotplug.
> > > > >           When the driver is unloaded, that memory is hot-unplugged.
> > > > > 
> > > > >           However, memory hot unplug can fail. And these days, it fails a
> > > > >           little too easily, with respect to the above case. Specifically, if
> > > > >           a signal is pending on the process, hot unplug fails.
> > > > > 
> > > > >           [...]
> > > > > 
> > > > >           So in this case, other things (unmovable pages, un-splittable huge
> > > > >           pages) can also cause the above problem. However, those are
> > > > >           demonstrably less common than simply having a pending signal. I've
> > > > >           got bug reports from users who can trivially reproduce this by
> > > > >           killing their process with a "kill -9", for example.
> > > > 
> > > > This looks like a bug of the said driver no? If the tear down process is
> > > > killed it could very well happen right before offlining so you end up in
> > > > the very same state. Or what am I missing?
> > > 
> > > IIUC (John can correct me if I am wrong):
> > > 
> > > 1) The process holds the device node open
> > > 2) The process gets killed or quits
> > > 3) As the process gets torn down, it closes the device node
> > > 4) Closing the device node results in the driver removing the device and
> > >     calling offline_and_remove_memory()
> > > 
> > > So it's not a "tear down process" that triggers that offlining_removal
> > > somehow explicitly, it's just a side-product of it letting go of the device
> > > node as the process gets torn down.
> > 
> > Isn't that just fragile? The operation might fail for other reasons. Why
> > cannot there be a hold on the resource to control the tear down
> > explicitly?
> 
> I'll let John comment on that. But from what I understood, in most setups
> where ZONE_MOVABLE gets used for hotplugged memory
> offline_and_remove_memory() succeeds and allows for reusing the device later
> without a reboot.
> 
> For the cases where it doesn't work, a reboot is required.

Then the solution should be really robust and means to handle the
failure - e.g. by retrying or alerting the admin.

> > > > > Especially with ZONE_MOVABLE, offlining is supposed to work in most
> > > > > cases when offlining actually hotplugged (not boot) memory, and only fail
> > > > > in rare corner cases (e.g., some driver holds a reference to a page in
> > > > > ZONE_MOVABLE, turning it unmovable).
> > > > > 
> > > > > In these corner cases we really don't want to be stuck forever in
> > > > > offline_and_remove_memory(). But in the general cases, we really want to
> > > > > do our best to make memory offlining succeed -- in a reasonable
> > > > > timeframe.
> > > > > 
> > > > > Reliably failing in the described case when there is a fatal signal pending
> > > > > is sub-optimal. The pending signal check is mostly only relevant when user
> > > > > space explicitly triggers offlining of memory using sysfs device attributes
> > > > > ("state" or "online" attribute), but not when coming via
> > > > > offline_and_remove_memory().
> > > > > 
> > > > > So let's use a timer instead and ignore fatal signals, because they are
> > > > > not really expressive for offline_and_remove_memory() users. Let's default
> > > > > to 30 seconds if no timeout was specified, and limit the timeout to 120
> > > > > seconds.
> > > > 
> > > > I really hate having timeouts back. They just proven to be hard to get
> > > > right and it is essentially a policy implemented in the kernel. They
> > > > simply do not belong to the kernel space IMHO.
> > > 
> > > As much as I agree with you in terms of offlining triggered from user space
> > > (e.g., write "state" or "online" attribute) where user-space is actually in
> > > charge  and can do something reasonable (timeout, retry, whatever), in these
> > > the offline_and_remove_memory() case it's the driver that wants a
> > > best-effort memory offlining+removal.
> > > 
> > > If it times out, virtio-mem will simply try another block or retry later.
> > > Right now, it could get stuck forever in offline_and_remove_memory(), which
> > > is obviously "not great". Fortunately, for virtio-mem it's configurable and
> > > we use the alloc_contig_range()-method for now as default.
> > 
> > It seems that offline_and_remove_memory is using a wrong operation then.
> > If it wants an opportunistic offlining with some sort of policy. Timeout
> > might be just one policy to use but failure mode or a retry count might
> > be a better fit for some users. So rather than (ab)using offline_pages,
> > would be make more sense to extract basic offlining steps and allow
> > drivers like virtio-mem to reuse them and define their own policy?
> 
> virtio-mem, in default operation, does that: use alloc_contig_range() to
> logically unplug ("fake offline") that memory and then just trigger
> offline_and_remove_memory() to make it "officially offline".
> 
> In that mode, offline_and_remove_memory() cannot really timeout and is
> almost always going to succeed (except memory notifiers and some hugetlb
> dissolving).
> 
> Right now we also allow the admin to configure ordinary offlining directly
> (without prior fake offlining) when bigger memory blocks are used:
> offline_pages() is more reliable than alloc_contig_range(), for example,
> because it disables the PCP and the LRU cache, and retries more often (well,
> unfortunately then also forever). It has a higher chance of succeeding
> especially when bigger blocks of memory are offlined+removed.
> 
> Maybe we should make the alloc_contig_range()-based mechanism more
> configurable and make it the only mode in virtio-mem, such that we don't
> have to mess with offline_and_remove_memory() endless loops -- at least for
> virtio-mem.

Yes, that sounds better than hooking up into offline_pages the way this
patch is doing.
John Hubbard June 27, 2023, 9:34 p.m. UTC | #6
On 6/27/23 08:14, Michal Hocko wrote:
> On Tue 27-06-23 16:57:53, David Hildenbrand wrote:
...
>>>> IIUC (John can correct me if I am wrong):
>>>>
>>>> 1) The process holds the device node open
>>>> 2) The process gets killed or quits
>>>> 3) As the process gets torn down, it closes the device node
>>>> 4) Closing the device node results in the driver removing the device and
>>>>      calling offline_and_remove_memory()
>>>>
>>>> So it's not a "tear down process" that triggers that offlining_removal
>>>> somehow explicitly, it's just a side-product of it letting go of the device
>>>> node as the process gets torn down.
>>>
>>> Isn't that just fragile? The operation might fail for other reasons. Why
>>> cannot there be a hold on the resource to control the tear down
>>> explicitly?
>>
>> I'll let John comment on that. But from what I understood, in most setups
>> where ZONE_MOVABLE gets used for hotplugged memory
>> offline_and_remove_memory() succeeds and allows for reusing the device later
>> without a reboot.
>>
>> For the cases where it doesn't work, a reboot is required.
  
That is exactly correct. That's what we ran into.

And there are workarounds (for example: kthreads don't have any signals
pending...), but I did want to follow through here and make -mm aware of the
problem. And see if there is a better way.

...
>>> It seems that offline_and_remove_memory is using a wrong operation then.
>>> If it wants an opportunistic offlining with some sort of policy. Timeout
>>> might be just one policy to use but failure mode or a retry count might
>>> be a better fit for some users. So rather than (ab)using offline_pages,
>>> would be make more sense to extract basic offlining steps and allow
>>> drivers like virtio-mem to reuse them and define their own policy?

...like this, perhaps. Sounds promising!


thanks,
kernel test robot June 28, 2023, 2 a.m. UTC | #7
Hi David,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 6995e2de6891c724bfeb2db33d7b87775f913ad1]

url:    https://github.com/intel-lab-lkp/linux/commits/David-Hildenbrand/mm-memory_hotplug-check-for-fatal-signals-only-in-offline_pages/20230627-192444
base:   6995e2de6891c724bfeb2db33d7b87775f913ad1
patch link:    https://lore.kernel.org/r/20230627112220.229240-4-david%40redhat.com
patch subject: [PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals
config: x86_64-randconfig-x006-20230627 (https://download.01.org/0day-ci/archive/20230628/202306280935.dKTWlHFD-lkp@intel.com/config)
compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a)
reproduce: (https://download.01.org/0day-ci/archive/20230628/202306280935.dKTWlHFD-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306280935.dKTWlHFD-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> mm/memory_hotplug.c:163:13: warning: unused variable 'mhp_offlining_timer_active' [-Wunused-variable]
   static bool mhp_offlining_timer_active;
               ^
   mm/memory_hotplug.c:166:13: warning: unused function 'mhp_offline_timer_fn' [-Wunused-function]
   static void mhp_offline_timer_fn(struct timer_list *unused)
               ^
   2 warnings generated.


vim +/mhp_offlining_timer_active +163 mm/memory_hotplug.c

   154	
   155	/*
   156	 * Protected by the device hotplug lock: offline_and_remove_memory()
   157	 * will activate a timer such that offlining cannot be stuck forever.
   158	 *
   159	 * With an active timer, fatal signals will be ignored, because they can be
   160	 * counter-productive when dying user space triggers device unplug/driver
   161	 * unloading that ends up offlining+removing device memory.
   162	 */
 > 163	static bool mhp_offlining_timer_active;
   164	static atomic_t mhp_offlining_timer_expired;
   165
diff mbox series

Patch

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index cb8bc6f6aa90..f8792223f1db 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -738,7 +738,7 @@  static int virtio_mem_offline_and_remove_memory(struct virtio_mem *vm,
 		"offlining and removing memory: 0x%llx - 0x%llx\n", addr,
 		addr + size - 1);
 
-	rc = offline_and_remove_memory(addr, size);
+	rc = offline_and_remove_memory(addr, size, 0);
 	if (!rc) {
 		atomic64_sub(size, &vm->offline_size);
 		/*
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 9fcbf5706595..d5f9e8b5a4a4 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -307,7 +307,7 @@  extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 			 struct zone *zone, struct memory_group *group);
 extern int remove_memory(u64 start, u64 size);
 extern void __remove_memory(u64 start, u64 size);
-extern int offline_and_remove_memory(u64 start, u64 size);
+int offline_and_remove_memory(u64 start, u64 size, unsigned int timeout_ms);
 
 #else
 static inline void try_offline_node(int nid) {}
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0d2151df4ee1..ca635121644a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -152,6 +152,22 @@  void put_online_mems(void)
 
 bool movable_node_enabled = false;
 
+/*
+ * Protected by the device hotplug lock: offline_and_remove_memory()
+ * will activate a timer such that offlining cannot be stuck forever.
+ *
+ * With an active timer, fatal signals will be ignored, because they can be
+ * counter-productive when dying user space triggers device unplug/driver
+ * unloading that ends up offlining+removing device memory.
+ */
+static bool mhp_offlining_timer_active;
+static atomic_t mhp_offlining_timer_expired;
+
+static void mhp_offline_timer_fn(struct timer_list *unused)
+{
+	atomic_set(&mhp_offlining_timer_expired, 1);
+}
+
 #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
 int mhp_default_online_type = MMOP_OFFLINE;
 #else
@@ -1879,7 +1895,18 @@  int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 	do {
 		pfn = start_pfn;
 		do {
-			if (fatal_signal_pending(current)) {
+			/*
+			 * If a timer is active, we're coming via
+			 * offline_and_remove_memory() and want to ignore even
+			 * fatal signals.
+			 */
+			if (mhp_offlining_timer_active) {
+				if (atomic_read(&mhp_offlining_timer_expired)) {
+					ret = -ETIMEDOUT;
+					reason = "timeout";
+					goto failed_removal_isolated;
+				}
+			} else if (fatal_signal_pending(current)) {
 				ret = -EINTR;
 				reason = "signal backoff";
 				goto failed_removal_isolated;
@@ -2232,11 +2259,17 @@  static int try_reonline_memory_block(struct memory_block *mem, void *arg)
  * memory is still in use. Primarily useful for memory devices that logically
  * unplugged all memory (so it's no longer in use) and want to offline + remove
  * that memory.
+ *
+ * offline_and_remove_memory() will not fail on fatal signals. Instead, it will
+ * fail once the timeout has been reached and offlining was not completed. If
+ * no timeout was specified, it will timeout after 30 seconds. The timeout is
+ * limited to 120 seconds.
  */
-int offline_and_remove_memory(u64 start, u64 size)
+int offline_and_remove_memory(u64 start, u64 size, unsigned int timeout_ms)
 {
 	const unsigned long mb_count = size / memory_block_size_bytes();
 	uint8_t *online_types, *tmp;
+	struct timer_list timer;
 	int rc;
 
 	if (!IS_ALIGNED(start, memory_block_size_bytes()) ||
@@ -2261,9 +2294,22 @@  int offline_and_remove_memory(u64 start, u64 size)
 
 	lock_device_hotplug();
 
+	if (!timeout_ms)
+		timeout_ms = 30 * MSEC_PER_SEC;
+	timeout_ms = min_t(unsigned int, timeout_ms, 120 * MSEC_PER_SEC);
+
+	timer_setup_on_stack(&timer, mhp_offline_timer_fn, 0);
+	mod_timer(&timer, jiffies + msecs_to_jiffies(timeout_ms));
+	mhp_offlining_timer_active = true;
+
 	tmp = online_types;
 	rc = walk_memory_blocks(start, size, &tmp, try_offline_memory_block);
 
+	timer_delete_sync(&timer);
+	atomic_set(&mhp_offlining_timer_expired, 0);
+	mhp_offlining_timer_active = false;
+	destroy_timer_on_stack(&timer);
+
 	/*
 	 * In case we succeeded to offline all memory, remove it.
 	 * This cannot fail as it cannot get onlined in the meantime.