Message ID | 20230627112220.229240-4-david@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals | expand |
On Tue 27-06-23 13:22:18, David Hildenbrand wrote: > John Hubbard writes [1]: > > Some device drivers add memory to the system via memory hotplug. > When the driver is unloaded, that memory is hot-unplugged. > > However, memory hot unplug can fail. And these days, it fails a > little too easily, with respect to the above case. Specifically, if > a signal is pending on the process, hot unplug fails. > > [...] > > So in this case, other things (unmovable pages, un-splittable huge > pages) can also cause the above problem. However, those are > demonstrably less common than simply having a pending signal. I've > got bug reports from users who can trivially reproduce this by > killing their process with a "kill -9", for example. This looks like a bug of the said driver no? If the tear down process is killed it could very well happen right before offlining so you end up in the very same state. Or what am I missing? > Especially with ZONE_MOVABLE, offlining is supposed to work in most > cases when offlining actually hotplugged (not boot) memory, and only fail > in rare corner cases (e.g., some driver holds a reference to a page in > ZONE_MOVABLE, turning it unmovable). > > In these corner cases we really don't want to be stuck forever in > offline_and_remove_memory(). But in the general cases, we really want to > do our best to make memory offlining succeed -- in a reasonable > timeframe. > > Reliably failing in the described case when there is a fatal signal pending > is sub-optimal. The pending signal check is mostly only relevant when user > space explicitly triggers offlining of memory using sysfs device attributes > ("state" or "online" attribute), but not when coming via > offline_and_remove_memory(). > > So let's use a timer instead and ignore fatal signals, because they are > not really expressive for offline_and_remove_memory() users. Let's default > to 30 seconds if no timeout was specified, and limit the timeout to 120 > seconds. I really hate having timeouts back. They just proven to be hard to get right and it is essentially a policy implemented in the kernel. They simply do not belong to the kernel space IMHO. > This change is also valuable for virtio-mem in BBM (Big Block Mode) with > "bbm_safe_unplug=off", to avoid endless loops when stuck forever in > offline_and_remove_memory(). > > While at it, drop the "extern" from offline_and_remove_memory() to make > it fit into a single line. > > [1] https://lkml.kernel.org/r/20230620011719.155379-1-jhubbard@nvidia.com > > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > drivers/virtio/virtio_mem.c | 2 +- > include/linux/memory_hotplug.h | 2 +- > mm/memory_hotplug.c | 50 ++++++++++++++++++++++++++++++++-- > 3 files changed, 50 insertions(+), 4 deletions(-) > > diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c > index cb8bc6f6aa90..f8792223f1db 100644 > --- a/drivers/virtio/virtio_mem.c > +++ b/drivers/virtio/virtio_mem.c > @@ -738,7 +738,7 @@ static int virtio_mem_offline_and_remove_memory(struct virtio_mem *vm, > "offlining and removing memory: 0x%llx - 0x%llx\n", addr, > addr + size - 1); > > - rc = offline_and_remove_memory(addr, size); > + rc = offline_and_remove_memory(addr, size, 0); > if (!rc) { > atomic64_sub(size, &vm->offline_size); > /* > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > index 9fcbf5706595..d5f9e8b5a4a4 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -307,7 +307,7 @@ extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages, > struct zone *zone, struct memory_group *group); > extern int remove_memory(u64 start, u64 size); > extern void __remove_memory(u64 start, u64 size); > -extern int offline_and_remove_memory(u64 start, u64 size); > +int offline_and_remove_memory(u64 start, u64 size, unsigned int timeout_ms); > > #else > static inline void try_offline_node(int nid) {} > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 0d2151df4ee1..ca635121644a 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -152,6 +152,22 @@ void put_online_mems(void) > > bool movable_node_enabled = false; > > +/* > + * Protected by the device hotplug lock: offline_and_remove_memory() > + * will activate a timer such that offlining cannot be stuck forever. > + * > + * With an active timer, fatal signals will be ignored, because they can be > + * counter-productive when dying user space triggers device unplug/driver > + * unloading that ends up offlining+removing device memory. > + */ > +static bool mhp_offlining_timer_active; > +static atomic_t mhp_offlining_timer_expired; > + > +static void mhp_offline_timer_fn(struct timer_list *unused) > +{ > + atomic_set(&mhp_offlining_timer_expired, 1); > +} > + > #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE > int mhp_default_online_type = MMOP_OFFLINE; > #else > @@ -1879,7 +1895,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, > do { > pfn = start_pfn; > do { > - if (fatal_signal_pending(current)) { > + /* > + * If a timer is active, we're coming via > + * offline_and_remove_memory() and want to ignore even > + * fatal signals. > + */ > + if (mhp_offlining_timer_active) { > + if (atomic_read(&mhp_offlining_timer_expired)) { > + ret = -ETIMEDOUT; > + reason = "timeout"; > + goto failed_removal_isolated; > + } > + } else if (fatal_signal_pending(current)) { > ret = -EINTR; > reason = "signal backoff"; > goto failed_removal_isolated; > @@ -2232,11 +2259,17 @@ static int try_reonline_memory_block(struct memory_block *mem, void *arg) > * memory is still in use. Primarily useful for memory devices that logically > * unplugged all memory (so it's no longer in use) and want to offline + remove > * that memory. > + * > + * offline_and_remove_memory() will not fail on fatal signals. Instead, it will > + * fail once the timeout has been reached and offlining was not completed. If > + * no timeout was specified, it will timeout after 30 seconds. The timeout is > + * limited to 120 seconds. > */ > -int offline_and_remove_memory(u64 start, u64 size) > +int offline_and_remove_memory(u64 start, u64 size, unsigned int timeout_ms) > { > const unsigned long mb_count = size / memory_block_size_bytes(); > uint8_t *online_types, *tmp; > + struct timer_list timer; > int rc; > > if (!IS_ALIGNED(start, memory_block_size_bytes()) || > @@ -2261,9 +2294,22 @@ int offline_and_remove_memory(u64 start, u64 size) > > lock_device_hotplug(); > > + if (!timeout_ms) > + timeout_ms = 30 * MSEC_PER_SEC; > + timeout_ms = min_t(unsigned int, timeout_ms, 120 * MSEC_PER_SEC); > + > + timer_setup_on_stack(&timer, mhp_offline_timer_fn, 0); > + mod_timer(&timer, jiffies + msecs_to_jiffies(timeout_ms)); > + mhp_offlining_timer_active = true; > + > tmp = online_types; > rc = walk_memory_blocks(start, size, &tmp, try_offline_memory_block); > > + timer_delete_sync(&timer); > + atomic_set(&mhp_offlining_timer_expired, 0); > + mhp_offlining_timer_active = false; > + destroy_timer_on_stack(&timer); > + > /* > * In case we succeeded to offline all memory, remove it. > * This cannot fail as it cannot get onlined in the meantime. > -- > 2.40.1
On 27.06.23 14:40, Michal Hocko wrote: > On Tue 27-06-23 13:22:18, David Hildenbrand wrote: >> John Hubbard writes [1]: >> >> Some device drivers add memory to the system via memory hotplug. >> When the driver is unloaded, that memory is hot-unplugged. >> >> However, memory hot unplug can fail. And these days, it fails a >> little too easily, with respect to the above case. Specifically, if >> a signal is pending on the process, hot unplug fails. >> >> [...] >> >> So in this case, other things (unmovable pages, un-splittable huge >> pages) can also cause the above problem. However, those are >> demonstrably less common than simply having a pending signal. I've >> got bug reports from users who can trivially reproduce this by >> killing their process with a "kill -9", for example. > > This looks like a bug of the said driver no? If the tear down process is > killed it could very well happen right before offlining so you end up in > the very same state. Or what am I missing? IIUC (John can correct me if I am wrong): 1) The process holds the device node open 2) The process gets killed or quits 3) As the process gets torn down, it closes the device node 4) Closing the device node results in the driver removing the device and calling offline_and_remove_memory() So it's not a "tear down process" that triggers that offlining_removal somehow explicitly, it's just a side-product of it letting go of the device node as the process gets torn down. > >> Especially with ZONE_MOVABLE, offlining is supposed to work in most >> cases when offlining actually hotplugged (not boot) memory, and only fail >> in rare corner cases (e.g., some driver holds a reference to a page in >> ZONE_MOVABLE, turning it unmovable). >> >> In these corner cases we really don't want to be stuck forever in >> offline_and_remove_memory(). But in the general cases, we really want to >> do our best to make memory offlining succeed -- in a reasonable >> timeframe. >> >> Reliably failing in the described case when there is a fatal signal pending >> is sub-optimal. The pending signal check is mostly only relevant when user >> space explicitly triggers offlining of memory using sysfs device attributes >> ("state" or "online" attribute), but not when coming via >> offline_and_remove_memory(). >> >> So let's use a timer instead and ignore fatal signals, because they are >> not really expressive for offline_and_remove_memory() users. Let's default >> to 30 seconds if no timeout was specified, and limit the timeout to 120 >> seconds. > > I really hate having timeouts back. They just proven to be hard to get > right and it is essentially a policy implemented in the kernel. They > simply do not belong to the kernel space IMHO. As much as I agree with you in terms of offlining triggered from user space (e.g., write "state" or "online" attribute) where user-space is actually in charge and can do something reasonable (timeout, retry, whatever), in these the offline_and_remove_memory() case it's the driver that wants a best-effort memory offlining+removal. If it times out, virtio-mem will simply try another block or retry later. Right now, it could get stuck forever in offline_and_remove_memory(), which is obviously "not great". Fortunately, for virtio-mem it's configurable and we use the alloc_contig_range()-method for now as default. If it would time out for John's driver, we most certainly don't want to be stuck in offline_and_remove_memory(), blocking device/driver unloading (and even a reboot IIRC) possibly forever. I much rather have offline_and_remove_memory() indicate "timeout" to a in-kernel user a bit earlier than getting stuck in there forever. The timeout parameter allows for giving the in-kernel users a bit of flexibility, which I showcases for virtio-mem that unplugs smaller blocks and rather wants to fail fast and retry later. Sure, we could make the timeout configurable to optimize for some corner cases, but that's not really what offline_and_remove_memory() users want and I doubt anybody would fine-tune that: they want a best-effort attempt. And that's IMHO not really a policy, it's an implementation detail of these drivers. For example, the driver from John could simply never call offline_and_remove_memory() and always require a reboot when wanting to reuse a device. But that's definitely what users want. virtio-mem could simply never call offline_and_remove_memory() and indicate "I don't support unplug of these online memory blocks". But that's *definitely* not what users want. I'm very open for alternatives regarding offline_and_remove_memory(), so far this was the only reasonable thing I could come up with that actually achieves what we want for these users: not get stuck in there forever but rather fail earlier than later. And most importantly, not somehow try to involve user space that isn't even in charge of the offlining operation.
On Tue 27-06-23 15:14:11, David Hildenbrand wrote: > On 27.06.23 14:40, Michal Hocko wrote: > > On Tue 27-06-23 13:22:18, David Hildenbrand wrote: > > > John Hubbard writes [1]: > > > > > > Some device drivers add memory to the system via memory hotplug. > > > When the driver is unloaded, that memory is hot-unplugged. > > > > > > However, memory hot unplug can fail. And these days, it fails a > > > little too easily, with respect to the above case. Specifically, if > > > a signal is pending on the process, hot unplug fails. > > > > > > [...] > > > > > > So in this case, other things (unmovable pages, un-splittable huge > > > pages) can also cause the above problem. However, those are > > > demonstrably less common than simply having a pending signal. I've > > > got bug reports from users who can trivially reproduce this by > > > killing their process with a "kill -9", for example. > > > > This looks like a bug of the said driver no? If the tear down process is > > killed it could very well happen right before offlining so you end up in > > the very same state. Or what am I missing? > > IIUC (John can correct me if I am wrong): > > 1) The process holds the device node open > 2) The process gets killed or quits > 3) As the process gets torn down, it closes the device node > 4) Closing the device node results in the driver removing the device and > calling offline_and_remove_memory() > > So it's not a "tear down process" that triggers that offlining_removal > somehow explicitly, it's just a side-product of it letting go of the device > node as the process gets torn down. Isn't that just fragile? The operation might fail for other reasons. Why cannot there be a hold on the resource to control the tear down explicitly? > > > Especially with ZONE_MOVABLE, offlining is supposed to work in most > > > cases when offlining actually hotplugged (not boot) memory, and only fail > > > in rare corner cases (e.g., some driver holds a reference to a page in > > > ZONE_MOVABLE, turning it unmovable). > > > > > > In these corner cases we really don't want to be stuck forever in > > > offline_and_remove_memory(). But in the general cases, we really want to > > > do our best to make memory offlining succeed -- in a reasonable > > > timeframe. > > > > > > Reliably failing in the described case when there is a fatal signal pending > > > is sub-optimal. The pending signal check is mostly only relevant when user > > > space explicitly triggers offlining of memory using sysfs device attributes > > > ("state" or "online" attribute), but not when coming via > > > offline_and_remove_memory(). > > > > > > So let's use a timer instead and ignore fatal signals, because they are > > > not really expressive for offline_and_remove_memory() users. Let's default > > > to 30 seconds if no timeout was specified, and limit the timeout to 120 > > > seconds. > > > > I really hate having timeouts back. They just proven to be hard to get > > right and it is essentially a policy implemented in the kernel. They > > simply do not belong to the kernel space IMHO. > > As much as I agree with you in terms of offlining triggered from user space > (e.g., write "state" or "online" attribute) where user-space is actually in > charge and can do something reasonable (timeout, retry, whatever), in these > the offline_and_remove_memory() case it's the driver that wants a > best-effort memory offlining+removal. > > If it times out, virtio-mem will simply try another block or retry later. > Right now, it could get stuck forever in offline_and_remove_memory(), which > is obviously "not great". Fortunately, for virtio-mem it's configurable and > we use the alloc_contig_range()-method for now as default. It seems that offline_and_remove_memory is using a wrong operation then. If it wants an opportunistic offlining with some sort of policy. Timeout might be just one policy to use but failure mode or a retry count might be a better fit for some users. So rather than (ab)using offline_pages, would be make more sense to extract basic offlining steps and allow drivers like virtio-mem to reuse them and define their own policy? > If it would time out for John's driver, we most certainly don't want to be > stuck in offline_and_remove_memory(), blocking device/driver unloading (and > even a reboot IIRC) possibly forever. Now I am confused. John driver wants to tear down the device now? There is no way to release that memory otherwise AFAIU from the initial problem description.
On 27.06.23 16:17, Michal Hocko wrote: > On Tue 27-06-23 15:14:11, David Hildenbrand wrote: >> On 27.06.23 14:40, Michal Hocko wrote: >>> On Tue 27-06-23 13:22:18, David Hildenbrand wrote: >>>> John Hubbard writes [1]: >>>> >>>> Some device drivers add memory to the system via memory hotplug. >>>> When the driver is unloaded, that memory is hot-unplugged. >>>> >>>> However, memory hot unplug can fail. And these days, it fails a >>>> little too easily, with respect to the above case. Specifically, if >>>> a signal is pending on the process, hot unplug fails. >>>> >>>> [...] >>>> >>>> So in this case, other things (unmovable pages, un-splittable huge >>>> pages) can also cause the above problem. However, those are >>>> demonstrably less common than simply having a pending signal. I've >>>> got bug reports from users who can trivially reproduce this by >>>> killing their process with a "kill -9", for example. >>> >>> This looks like a bug of the said driver no? If the tear down process is >>> killed it could very well happen right before offlining so you end up in >>> the very same state. Or what am I missing? >> >> IIUC (John can correct me if I am wrong): >> >> 1) The process holds the device node open >> 2) The process gets killed or quits >> 3) As the process gets torn down, it closes the device node >> 4) Closing the device node results in the driver removing the device and >> calling offline_and_remove_memory() >> >> So it's not a "tear down process" that triggers that offlining_removal >> somehow explicitly, it's just a side-product of it letting go of the device >> node as the process gets torn down. > > Isn't that just fragile? The operation might fail for other reasons. Why > cannot there be a hold on the resource to control the tear down > explicitly? I'll let John comment on that. But from what I understood, in most setups where ZONE_MOVABLE gets used for hotplugged memory offline_and_remove_memory() succeeds and allows for reusing the device later without a reboot. For the cases where it doesn't work, a reboot is required. > >>>> Especially with ZONE_MOVABLE, offlining is supposed to work in most >>>> cases when offlining actually hotplugged (not boot) memory, and only fail >>>> in rare corner cases (e.g., some driver holds a reference to a page in >>>> ZONE_MOVABLE, turning it unmovable). >>>> >>>> In these corner cases we really don't want to be stuck forever in >>>> offline_and_remove_memory(). But in the general cases, we really want to >>>> do our best to make memory offlining succeed -- in a reasonable >>>> timeframe. >>>> >>>> Reliably failing in the described case when there is a fatal signal pending >>>> is sub-optimal. The pending signal check is mostly only relevant when user >>>> space explicitly triggers offlining of memory using sysfs device attributes >>>> ("state" or "online" attribute), but not when coming via >>>> offline_and_remove_memory(). >>>> >>>> So let's use a timer instead and ignore fatal signals, because they are >>>> not really expressive for offline_and_remove_memory() users. Let's default >>>> to 30 seconds if no timeout was specified, and limit the timeout to 120 >>>> seconds. >>> >>> I really hate having timeouts back. They just proven to be hard to get >>> right and it is essentially a policy implemented in the kernel. They >>> simply do not belong to the kernel space IMHO. >> >> As much as I agree with you in terms of offlining triggered from user space >> (e.g., write "state" or "online" attribute) where user-space is actually in >> charge and can do something reasonable (timeout, retry, whatever), in these >> the offline_and_remove_memory() case it's the driver that wants a >> best-effort memory offlining+removal. >> >> If it times out, virtio-mem will simply try another block or retry later. >> Right now, it could get stuck forever in offline_and_remove_memory(), which >> is obviously "not great". Fortunately, for virtio-mem it's configurable and >> we use the alloc_contig_range()-method for now as default. > > It seems that offline_and_remove_memory is using a wrong operation then. > If it wants an opportunistic offlining with some sort of policy. Timeout > might be just one policy to use but failure mode or a retry count might > be a better fit for some users. So rather than (ab)using offline_pages, > would be make more sense to extract basic offlining steps and allow > drivers like virtio-mem to reuse them and define their own policy? virtio-mem, in default operation, does that: use alloc_contig_range() to logically unplug ("fake offline") that memory and then just trigger offline_and_remove_memory() to make it "officially offline". In that mode, offline_and_remove_memory() cannot really timeout and is almost always going to succeed (except memory notifiers and some hugetlb dissolving). Right now we also allow the admin to configure ordinary offlining directly (without prior fake offlining) when bigger memory blocks are used: offline_pages() is more reliable than alloc_contig_range(), for example, because it disables the PCP and the LRU cache, and retries more often (well, unfortunately then also forever). It has a higher chance of succeeding especially when bigger blocks of memory are offlined+removed. Maybe we should make the alloc_contig_range()-based mechanism more configurable and make it the only mode in virtio-mem, such that we don't have to mess with offline_and_remove_memory() endless loops -- at least for virtio-mem. > >> If it would time out for John's driver, we most certainly don't want to be >> stuck in offline_and_remove_memory(), blocking device/driver unloading (and >> even a reboot IIRC) possibly forever. > > Now I am confused. John driver wants to tear down the device now? There > is no way to release that memory otherwise AFAIU from the initial > problem description. From what I understood if offline_and_remove_memory() succeeded, the device can be reinitialized and used again. Otherwise, a reboot is required because that memory is still added to the system. Thanks Michal!
On Tue 27-06-23 16:57:53, David Hildenbrand wrote: > On 27.06.23 16:17, Michal Hocko wrote: > > On Tue 27-06-23 15:14:11, David Hildenbrand wrote: > > > On 27.06.23 14:40, Michal Hocko wrote: > > > > On Tue 27-06-23 13:22:18, David Hildenbrand wrote: > > > > > John Hubbard writes [1]: > > > > > > > > > > Some device drivers add memory to the system via memory hotplug. > > > > > When the driver is unloaded, that memory is hot-unplugged. > > > > > > > > > > However, memory hot unplug can fail. And these days, it fails a > > > > > little too easily, with respect to the above case. Specifically, if > > > > > a signal is pending on the process, hot unplug fails. > > > > > > > > > > [...] > > > > > > > > > > So in this case, other things (unmovable pages, un-splittable huge > > > > > pages) can also cause the above problem. However, those are > > > > > demonstrably less common than simply having a pending signal. I've > > > > > got bug reports from users who can trivially reproduce this by > > > > > killing their process with a "kill -9", for example. > > > > > > > > This looks like a bug of the said driver no? If the tear down process is > > > > killed it could very well happen right before offlining so you end up in > > > > the very same state. Or what am I missing? > > > > > > IIUC (John can correct me if I am wrong): > > > > > > 1) The process holds the device node open > > > 2) The process gets killed or quits > > > 3) As the process gets torn down, it closes the device node > > > 4) Closing the device node results in the driver removing the device and > > > calling offline_and_remove_memory() > > > > > > So it's not a "tear down process" that triggers that offlining_removal > > > somehow explicitly, it's just a side-product of it letting go of the device > > > node as the process gets torn down. > > > > Isn't that just fragile? The operation might fail for other reasons. Why > > cannot there be a hold on the resource to control the tear down > > explicitly? > > I'll let John comment on that. But from what I understood, in most setups > where ZONE_MOVABLE gets used for hotplugged memory > offline_and_remove_memory() succeeds and allows for reusing the device later > without a reboot. > > For the cases where it doesn't work, a reboot is required. Then the solution should be really robust and means to handle the failure - e.g. by retrying or alerting the admin. > > > > > Especially with ZONE_MOVABLE, offlining is supposed to work in most > > > > > cases when offlining actually hotplugged (not boot) memory, and only fail > > > > > in rare corner cases (e.g., some driver holds a reference to a page in > > > > > ZONE_MOVABLE, turning it unmovable). > > > > > > > > > > In these corner cases we really don't want to be stuck forever in > > > > > offline_and_remove_memory(). But in the general cases, we really want to > > > > > do our best to make memory offlining succeed -- in a reasonable > > > > > timeframe. > > > > > > > > > > Reliably failing in the described case when there is a fatal signal pending > > > > > is sub-optimal. The pending signal check is mostly only relevant when user > > > > > space explicitly triggers offlining of memory using sysfs device attributes > > > > > ("state" or "online" attribute), but not when coming via > > > > > offline_and_remove_memory(). > > > > > > > > > > So let's use a timer instead and ignore fatal signals, because they are > > > > > not really expressive for offline_and_remove_memory() users. Let's default > > > > > to 30 seconds if no timeout was specified, and limit the timeout to 120 > > > > > seconds. > > > > > > > > I really hate having timeouts back. They just proven to be hard to get > > > > right and it is essentially a policy implemented in the kernel. They > > > > simply do not belong to the kernel space IMHO. > > > > > > As much as I agree with you in terms of offlining triggered from user space > > > (e.g., write "state" or "online" attribute) where user-space is actually in > > > charge and can do something reasonable (timeout, retry, whatever), in these > > > the offline_and_remove_memory() case it's the driver that wants a > > > best-effort memory offlining+removal. > > > > > > If it times out, virtio-mem will simply try another block or retry later. > > > Right now, it could get stuck forever in offline_and_remove_memory(), which > > > is obviously "not great". Fortunately, for virtio-mem it's configurable and > > > we use the alloc_contig_range()-method for now as default. > > > > It seems that offline_and_remove_memory is using a wrong operation then. > > If it wants an opportunistic offlining with some sort of policy. Timeout > > might be just one policy to use but failure mode or a retry count might > > be a better fit for some users. So rather than (ab)using offline_pages, > > would be make more sense to extract basic offlining steps and allow > > drivers like virtio-mem to reuse them and define their own policy? > > virtio-mem, in default operation, does that: use alloc_contig_range() to > logically unplug ("fake offline") that memory and then just trigger > offline_and_remove_memory() to make it "officially offline". > > In that mode, offline_and_remove_memory() cannot really timeout and is > almost always going to succeed (except memory notifiers and some hugetlb > dissolving). > > Right now we also allow the admin to configure ordinary offlining directly > (without prior fake offlining) when bigger memory blocks are used: > offline_pages() is more reliable than alloc_contig_range(), for example, > because it disables the PCP and the LRU cache, and retries more often (well, > unfortunately then also forever). It has a higher chance of succeeding > especially when bigger blocks of memory are offlined+removed. > > Maybe we should make the alloc_contig_range()-based mechanism more > configurable and make it the only mode in virtio-mem, such that we don't > have to mess with offline_and_remove_memory() endless loops -- at least for > virtio-mem. Yes, that sounds better than hooking up into offline_pages the way this patch is doing.
On 6/27/23 08:14, Michal Hocko wrote: > On Tue 27-06-23 16:57:53, David Hildenbrand wrote: ... >>>> IIUC (John can correct me if I am wrong): >>>> >>>> 1) The process holds the device node open >>>> 2) The process gets killed or quits >>>> 3) As the process gets torn down, it closes the device node >>>> 4) Closing the device node results in the driver removing the device and >>>> calling offline_and_remove_memory() >>>> >>>> So it's not a "tear down process" that triggers that offlining_removal >>>> somehow explicitly, it's just a side-product of it letting go of the device >>>> node as the process gets torn down. >>> >>> Isn't that just fragile? The operation might fail for other reasons. Why >>> cannot there be a hold on the resource to control the tear down >>> explicitly? >> >> I'll let John comment on that. But from what I understood, in most setups >> where ZONE_MOVABLE gets used for hotplugged memory >> offline_and_remove_memory() succeeds and allows for reusing the device later >> without a reboot. >> >> For the cases where it doesn't work, a reboot is required. That is exactly correct. That's what we ran into. And there are workarounds (for example: kthreads don't have any signals pending...), but I did want to follow through here and make -mm aware of the problem. And see if there is a better way. ... >>> It seems that offline_and_remove_memory is using a wrong operation then. >>> If it wants an opportunistic offlining with some sort of policy. Timeout >>> might be just one policy to use but failure mode or a retry count might >>> be a better fit for some users. So rather than (ab)using offline_pages, >>> would be make more sense to extract basic offlining steps and allow >>> drivers like virtio-mem to reuse them and define their own policy? ...like this, perhaps. Sounds promising! thanks,
Hi David, kernel test robot noticed the following build warnings: [auto build test WARNING on 6995e2de6891c724bfeb2db33d7b87775f913ad1] url: https://github.com/intel-lab-lkp/linux/commits/David-Hildenbrand/mm-memory_hotplug-check-for-fatal-signals-only-in-offline_pages/20230627-192444 base: 6995e2de6891c724bfeb2db33d7b87775f913ad1 patch link: https://lore.kernel.org/r/20230627112220.229240-4-david%40redhat.com patch subject: [PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals config: x86_64-randconfig-x006-20230627 (https://download.01.org/0day-ci/archive/20230628/202306280935.dKTWlHFD-lkp@intel.com/config) compiler: clang version 15.0.7 (https://github.com/llvm/llvm-project.git 8dfdcc7b7bf66834a761bd8de445840ef68e4d1a) reproduce: (https://download.01.org/0day-ci/archive/20230628/202306280935.dKTWlHFD-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202306280935.dKTWlHFD-lkp@intel.com/ All warnings (new ones prefixed by >>): >> mm/memory_hotplug.c:163:13: warning: unused variable 'mhp_offlining_timer_active' [-Wunused-variable] static bool mhp_offlining_timer_active; ^ mm/memory_hotplug.c:166:13: warning: unused function 'mhp_offline_timer_fn' [-Wunused-function] static void mhp_offline_timer_fn(struct timer_list *unused) ^ 2 warnings generated. vim +/mhp_offlining_timer_active +163 mm/memory_hotplug.c 154 155 /* 156 * Protected by the device hotplug lock: offline_and_remove_memory() 157 * will activate a timer such that offlining cannot be stuck forever. 158 * 159 * With an active timer, fatal signals will be ignored, because they can be 160 * counter-productive when dying user space triggers device unplug/driver 161 * unloading that ends up offlining+removing device memory. 162 */ > 163 static bool mhp_offlining_timer_active; 164 static atomic_t mhp_offlining_timer_expired; 165
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index cb8bc6f6aa90..f8792223f1db 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -738,7 +738,7 @@ static int virtio_mem_offline_and_remove_memory(struct virtio_mem *vm, "offlining and removing memory: 0x%llx - 0x%llx\n", addr, addr + size - 1); - rc = offline_and_remove_memory(addr, size); + rc = offline_and_remove_memory(addr, size, 0); if (!rc) { atomic64_sub(size, &vm->offline_size); /* diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 9fcbf5706595..d5f9e8b5a4a4 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -307,7 +307,7 @@ extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages, struct zone *zone, struct memory_group *group); extern int remove_memory(u64 start, u64 size); extern void __remove_memory(u64 start, u64 size); -extern int offline_and_remove_memory(u64 start, u64 size); +int offline_and_remove_memory(u64 start, u64 size, unsigned int timeout_ms); #else static inline void try_offline_node(int nid) {} diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0d2151df4ee1..ca635121644a 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -152,6 +152,22 @@ void put_online_mems(void) bool movable_node_enabled = false; +/* + * Protected by the device hotplug lock: offline_and_remove_memory() + * will activate a timer such that offlining cannot be stuck forever. + * + * With an active timer, fatal signals will be ignored, because they can be + * counter-productive when dying user space triggers device unplug/driver + * unloading that ends up offlining+removing device memory. + */ +static bool mhp_offlining_timer_active; +static atomic_t mhp_offlining_timer_expired; + +static void mhp_offline_timer_fn(struct timer_list *unused) +{ + atomic_set(&mhp_offlining_timer_expired, 1); +} + #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE int mhp_default_online_type = MMOP_OFFLINE; #else @@ -1879,7 +1895,18 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages, do { pfn = start_pfn; do { - if (fatal_signal_pending(current)) { + /* + * If a timer is active, we're coming via + * offline_and_remove_memory() and want to ignore even + * fatal signals. + */ + if (mhp_offlining_timer_active) { + if (atomic_read(&mhp_offlining_timer_expired)) { + ret = -ETIMEDOUT; + reason = "timeout"; + goto failed_removal_isolated; + } + } else if (fatal_signal_pending(current)) { ret = -EINTR; reason = "signal backoff"; goto failed_removal_isolated; @@ -2232,11 +2259,17 @@ static int try_reonline_memory_block(struct memory_block *mem, void *arg) * memory is still in use. Primarily useful for memory devices that logically * unplugged all memory (so it's no longer in use) and want to offline + remove * that memory. + * + * offline_and_remove_memory() will not fail on fatal signals. Instead, it will + * fail once the timeout has been reached and offlining was not completed. If + * no timeout was specified, it will timeout after 30 seconds. The timeout is + * limited to 120 seconds. */ -int offline_and_remove_memory(u64 start, u64 size) +int offline_and_remove_memory(u64 start, u64 size, unsigned int timeout_ms) { const unsigned long mb_count = size / memory_block_size_bytes(); uint8_t *online_types, *tmp; + struct timer_list timer; int rc; if (!IS_ALIGNED(start, memory_block_size_bytes()) || @@ -2261,9 +2294,22 @@ int offline_and_remove_memory(u64 start, u64 size) lock_device_hotplug(); + if (!timeout_ms) + timeout_ms = 30 * MSEC_PER_SEC; + timeout_ms = min_t(unsigned int, timeout_ms, 120 * MSEC_PER_SEC); + + timer_setup_on_stack(&timer, mhp_offline_timer_fn, 0); + mod_timer(&timer, jiffies + msecs_to_jiffies(timeout_ms)); + mhp_offlining_timer_active = true; + tmp = online_types; rc = walk_memory_blocks(start, size, &tmp, try_offline_memory_block); + timer_delete_sync(&timer); + atomic_set(&mhp_offlining_timer_expired, 0); + mhp_offlining_timer_active = false; + destroy_timer_on_stack(&timer); + /* * In case we succeeded to offline all memory, remove it. * This cannot fail as it cannot get onlined in the meantime.
John Hubbard writes [1]: Some device drivers add memory to the system via memory hotplug. When the driver is unloaded, that memory is hot-unplugged. However, memory hot unplug can fail. And these days, it fails a little too easily, with respect to the above case. Specifically, if a signal is pending on the process, hot unplug fails. [...] So in this case, other things (unmovable pages, un-splittable huge pages) can also cause the above problem. However, those are demonstrably less common than simply having a pending signal. I've got bug reports from users who can trivially reproduce this by killing their process with a "kill -9", for example. Especially with ZONE_MOVABLE, offlining is supposed to work in most cases when offlining actually hotplugged (not boot) memory, and only fail in rare corner cases (e.g., some driver holds a reference to a page in ZONE_MOVABLE, turning it unmovable). In these corner cases we really don't want to be stuck forever in offline_and_remove_memory(). But in the general cases, we really want to do our best to make memory offlining succeed -- in a reasonable timeframe. Reliably failing in the described case when there is a fatal signal pending is sub-optimal. The pending signal check is mostly only relevant when user space explicitly triggers offlining of memory using sysfs device attributes ("state" or "online" attribute), but not when coming via offline_and_remove_memory(). So let's use a timer instead and ignore fatal signals, because they are not really expressive for offline_and_remove_memory() users. Let's default to 30 seconds if no timeout was specified, and limit the timeout to 120 seconds. This change is also valuable for virtio-mem in BBM (Big Block Mode) with "bbm_safe_unplug=off", to avoid endless loops when stuck forever in offline_and_remove_memory(). While at it, drop the "extern" from offline_and_remove_memory() to make it fit into a single line. [1] https://lkml.kernel.org/r/20230620011719.155379-1-jhubbard@nvidia.com Signed-off-by: David Hildenbrand <david@redhat.com> --- drivers/virtio/virtio_mem.c | 2 +- include/linux/memory_hotplug.h | 2 +- mm/memory_hotplug.c | 50 ++++++++++++++++++++++++++++++++-- 3 files changed, 50 insertions(+), 4 deletions(-)