Message ID | 20240405053510.1948982-3-yosryahmed@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | zswap same-filled and limit checking cleanups | expand |
On Fri, Apr 05, 2024 at 05:35:07AM +0000, Yosry Ahmed wrote: > Currently, we calculate the zswap global limit, and potentially the > acceptance threshold in the zswap, in pages in the zswap store path. > This is unnecessary because the values rarely change. > > Instead, precalculate the them when the module parameters are updated, > which should be rare. Since we are adding custom handlers for setting > the percentages now, add proper validation that they are <= 100. > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Nice! Getting that stuff out of the hotpath! Two comments below: > @@ -684,6 +703,43 @@ static int zswap_enabled_param_set(const char *val, > return ret; > } > > +static int __zswap_percent_param_set(const char *val, > + const struct kernel_param *kp) > +{ > + unsigned int n; > + int ret; > + > + ret = kstrtouint(val, 10, &n); > + if (ret || n > 100) > + return -EINVAL; > + > + return param_set_uint(val, kp); > +} > + > +static int zswap_max_pool_param_set(const char *val, > + const struct kernel_param *kp) > +{ > + int err = __zswap_percent_param_set(val, kp); > + > + if (!err) { > + zswap_update_max_pages(); > + zswap_update_accept_thr_pages(); > + } > + > + return err; > +} > + > +static int zswap_accept_thr_param_set(const char *val, > + const struct kernel_param *kp) > +{ > + int err = __zswap_percent_param_set(val, kp); > + > + if (!err) > + zswap_update_accept_thr_pages(); > + > + return err; > +} I think you can keep this simple and just always update both if anything changes for whatever reason. It's an extremely rare event after all. That should cut it from 3 functions to 1. Note that totalram_pages can also change during memory onlining and offlining. For that you need a memory notifier that also calls that refresh function. It's simple enough, though, check out the code around register_memory_notifier() in drivers/xen/balloon.c.
On Fri, Apr 5, 2024 at 8:26 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Fri, Apr 05, 2024 at 05:35:07AM +0000, Yosry Ahmed wrote: > > Currently, we calculate the zswap global limit, and potentially the > > acceptance threshold in the zswap, in pages in the zswap store path. > > This is unnecessary because the values rarely change. > > > > Instead, precalculate the them when the module parameters are updated, > > which should be rare. Since we are adding custom handlers for setting > > the percentages now, add proper validation that they are <= 100. > > > > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > Nice! Getting that stuff out of the hotpath! > > Two comments below: > > > @@ -684,6 +703,43 @@ static int zswap_enabled_param_set(const char *val, > > return ret; > > } > > > > +static int __zswap_percent_param_set(const char *val, > > + const struct kernel_param *kp) > > +{ > > + unsigned int n; > > + int ret; > > + > > + ret = kstrtouint(val, 10, &n); > > + if (ret || n > 100) > > + return -EINVAL; > > + > > + return param_set_uint(val, kp); > > +} > > + > > +static int zswap_max_pool_param_set(const char *val, > > + const struct kernel_param *kp) > > +{ > > + int err = __zswap_percent_param_set(val, kp); > > + > > + if (!err) { > > + zswap_update_max_pages(); > > + zswap_update_accept_thr_pages(); > > + } > > + > > + return err; > > +} > > + > > +static int zswap_accept_thr_param_set(const char *val, > > + const struct kernel_param *kp) > > +{ > > + int err = __zswap_percent_param_set(val, kp); > > + > > + if (!err) > > + zswap_update_accept_thr_pages(); > > + > > + return err; > > +} > > I think you can keep this simple and just always update both if > anything changes for whatever reason. It's an extremely rare event > after all. That should cut it from 3 functions to 1. Yeah we can also combine both update functions in this case. > > Note that totalram_pages can also change during memory onlining and > offlining. For that you need a memory notifier that also calls that > refresh function. It's simple enough, though, check out the code > around register_memory_notifier() in drivers/xen/balloon.c. Good point, I completely missed this. It seems like totalram_pages can actually change from contexts other than memory hotplug as well, specifically through adjust_managed_page_count(), and mostly through ballooning drivers. Do we trigger the notifiers in this case? I can't find such logic. It seems like in this case the actual amount of memory does not change, but the drivers take it away from the system. It makes some sense to me that the zswap limits do not change in this case, especially that userspace just sets those limits as a percentage of total memory. I wouldn't expect userspace to take ballooning into account here. However, it would be a behavioral change from today where we always rely on totalram_pages(). Also, if userspace happens to change the limit when a driver is holding a big chunk of memory away from totalram_pages, then the limit would be constantly underestimated. I do not have enough familiarity with memory ballooning to know for sure if this is okay. How much memory can memory ballooning add/remove from totalram_pages(), and is it usually transient or does it stick around for a while? Also CCing David here.
Hi Yosry, kernel test robot noticed the following build warnings: [auto build test WARNING on akpm-mm/mm-everything] [also build test WARNING on next-20240405] [cannot apply to linus/master v6.9-rc2] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-zswap-always-shrink-in-zswap_store-if-zswap_pool_reached_full/20240405-133623 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20240405053510.1948982-3-yosryahmed%40google.com patch subject: [PATCH v2 2/5] mm: zswap: calculate limits only when updated config: i386-randconfig-061-20240405 (https://download.01.org/0day-ci/archive/20240406/202404060445.zMPxumi9-lkp@intel.com/config) compiler: gcc-10 (Ubuntu 10.5.0-1ubuntu1) 10.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240406/202404060445.zMPxumi9-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202404060445.zMPxumi9-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) >> mm/zswap.c:132:15: sparse: sparse: symbol 'zswap_accept_thr_pages' was not declared. Should it be static? mm/zswap.c:1188:23: sparse: sparse: context imbalance in 'shrink_memcg_cb' - unexpected unlock vim +/zswap_accept_thr_pages +132 mm/zswap.c 120 121 static int zswap_max_pool_param_set(const char *, 122 const struct kernel_param *); 123 static const struct kernel_param_ops zswap_max_pool_param_ops = { 124 .set = zswap_max_pool_param_set, 125 .get = param_get_uint, 126 }; 127 module_param_cb(max_pool_percent, &zswap_max_pool_param_ops, 128 &zswap_max_pool_percent, 0644); 129 130 /* The threshold for accepting new pages after the max_pool_percent was hit */ 131 static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */ > 132 unsigned long zswap_accept_thr_pages; 133
On Fri, Apr 5, 2024 at 1:24 PM kernel test robot <lkp@intel.com> wrote: > > Hi Yosry, > > kernel test robot noticed the following build warnings: > > [auto build test WARNING on akpm-mm/mm-everything] > [also build test WARNING on next-20240405] > [cannot apply to linus/master v6.9-rc2] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use '--base' as documented in > https://git-scm.com/docs/git-format-patch#_base_tree_information] > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-zswap-always-shrink-in-zswap_store-if-zswap_pool_reached_full/20240405-133623 > base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything > patch link: https://lore.kernel.org/r/20240405053510.1948982-3-yosryahmed%40google.com > patch subject: [PATCH v2 2/5] mm: zswap: calculate limits only when updated > config: i386-randconfig-061-20240405 (https://download.01.org/0day-ci/archive/20240406/202404060445.zMPxumi9-lkp@intel.com/config) > compiler: gcc-10 (Ubuntu 10.5.0-1ubuntu1) 10.5.0 > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240406/202404060445.zMPxumi9-lkp@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <lkp@intel.com> > | Closes: https://lore.kernel.org/oe-kbuild-all/202404060445.zMPxumi9-lkp@intel.com/ > > sparse warnings: (new ones prefixed by >>) > >> mm/zswap.c:132:15: sparse: sparse: symbol 'zswap_accept_thr_pages' was not declared. Should it be static? Yes, I already have it static for v3, which will be sent after the discussion on this patch settles.
On 05.04.24 20:43, Yosry Ahmed wrote: > On Fri, Apr 5, 2024 at 8:26 AM Johannes Weiner <hannes@cmpxchg.org> wrote: >> >> On Fri, Apr 05, 2024 at 05:35:07AM +0000, Yosry Ahmed wrote: >>> Currently, we calculate the zswap global limit, and potentially the >>> acceptance threshold in the zswap, in pages in the zswap store path. >>> This is unnecessary because the values rarely change. >>> >>> Instead, precalculate the them when the module parameters are updated, >>> which should be rare. Since we are adding custom handlers for setting >>> the percentages now, add proper validation that they are <= 100. >>> >>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> >>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> >> >> Nice! Getting that stuff out of the hotpath! >> >> Two comments below: >> >>> @@ -684,6 +703,43 @@ static int zswap_enabled_param_set(const char *val, >>> return ret; >>> } >>> >>> +static int __zswap_percent_param_set(const char *val, >>> + const struct kernel_param *kp) >>> +{ >>> + unsigned int n; >>> + int ret; >>> + >>> + ret = kstrtouint(val, 10, &n); >>> + if (ret || n > 100) >>> + return -EINVAL; >>> + >>> + return param_set_uint(val, kp); >>> +} >>> + >>> +static int zswap_max_pool_param_set(const char *val, >>> + const struct kernel_param *kp) >>> +{ >>> + int err = __zswap_percent_param_set(val, kp); >>> + >>> + if (!err) { >>> + zswap_update_max_pages(); >>> + zswap_update_accept_thr_pages(); >>> + } >>> + >>> + return err; >>> +} >>> + >>> +static int zswap_accept_thr_param_set(const char *val, >>> + const struct kernel_param *kp) >>> +{ >>> + int err = __zswap_percent_param_set(val, kp); >>> + >>> + if (!err) >>> + zswap_update_accept_thr_pages(); >>> + >>> + return err; >>> +} >> >> I think you can keep this simple and just always update both if >> anything changes for whatever reason. It's an extremely rare event >> after all. That should cut it from 3 functions to 1. > > Yeah we can also combine both update functions in this case. > >> >> Note that totalram_pages can also change during memory onlining and >> offlining. For that you need a memory notifier that also calls that >> refresh function. It's simple enough, though, check out the code >> around register_memory_notifier() in drivers/xen/balloon.c. > > Good point, I completely missed this. It seems like totalram_pages can > actually change from contexts other than memory hotplug as well, > specifically through adjust_managed_page_count(), and mostly through > ballooning drivers. Do we trigger the notifiers in this case? I can't > find such logic. Things like virtio-balloon never online/offline memory and would never call it. Things like virtio-mem sometimes will online/offline memory and would sometimes call it (but not always). Things like the Hyper-V balloon and XEN balloon never offline memory, and would only call it when onlining memory. > > It seems like in this case the actual amount of memory does not > change, but the drivers take it away from the system. It makes some > sense to me that the zswap limits do not change in this case, > especially that userspace just sets those limits as a percentage of > total memory. I wouldn't expect userspace to take ballooning into > account here. > For virtio-mem, it does change ("actual amount of memory"). For virtio-balloon, it's tricky. When using virtio-balloon for VM resizing, it would similarly change. When using it for pure memory overcommit, it depends on whatever the policy in the hypervisor is ... might be that under memory pressure that memory is simply given back to the VM. > However, it would be a behavioral change from today where we always > rely on totalram_pages(). Also, if userspace happens to change the > limit when a driver is holding a big chunk of memory away from > totalram_pages, then the limit would be constantly underestimated. > > I do not have enough familiarity with memory ballooning to know for > sure if this is okay. How much memory can memory ballooning add/remove > from totalram_pages(), and is it usually transient or does it stick > around for a while? > > Also CCing David here. It can be a lot. Especially with the Hyper-V balloon (but also on environments where other forms of memory hotunplug are not supported), memory ballooning can be used to unplug memory. So that memory can be gone for good and it can end up being quite a lot of memory. The clean thing to do would be to have a way for other subsystems to get notified on any totalram_pages() changes, so they can adjust accordingly.
[..] > >> Note that totalram_pages can also change during memory onlining and > >> offlining. For that you need a memory notifier that also calls that > >> refresh function. It's simple enough, though, check out the code > >> around register_memory_notifier() in drivers/xen/balloon.c. > > > > Good point, I completely missed this. It seems like totalram_pages can > > actually change from contexts other than memory hotplug as well, > > specifically through adjust_managed_page_count(), and mostly through > > ballooning drivers. Do we trigger the notifiers in this case? I can't > > find such logic. > > Things like virtio-balloon never online/offline memory and would never > call it. I see calls to adjust_managed_page_count() from drivers/virtio/virtio_balloon.c, but I don't understand enough to know what they are doing. > > Things like virtio-mem sometimes will online/offline memory and would > sometimes call it (but not always). Things like the Hyper-V balloon and > XEN balloon never offline memory, and would only call it when onlining > memory. Thanks for the details. > > > > > It seems like in this case the actual amount of memory does not > > change, but the drivers take it away from the system. It makes some > > sense to me that the zswap limits do not change in this case, > > especially that userspace just sets those limits as a percentage of > > total memory. I wouldn't expect userspace to take ballooning into > > account here. > > > > For virtio-mem, it does change ("actual amount of memory"). For > virtio-balloon, it's tricky. When using virtio-balloon for VM resizing, > it would similarly change. When using it for pure memory overcommit, it > depends on whatever the policy in the hypervisor is ... might be that > under memory pressure that memory is simply given back to the VM. That's good to know, it seems like we need to take these into account, and not just because the users may happen to change zswap limits while they are onlining/offlining memory. > > > However, it would be a behavioral change from today where we always > > rely on totalram_pages(). Also, if userspace happens to change the > > limit when a driver is holding a big chunk of memory away from > > totalram_pages, then the limit would be constantly underestimated. > > > > I do not have enough familiarity with memory ballooning to know for > > sure if this is okay. How much memory can memory ballooning add/remove > > from totalram_pages(), and is it usually transient or does it stick > > around for a while? > > > > Also CCing David here. > > It can be a lot. Especially with the Hyper-V balloon (but also on > environments where other forms of memory hotunplug are not supported), > memory ballooning can be used to unplug memory. So that memory can be > gone for good and it can end up being quite a lot of memory. > > The clean thing to do would be to have a way for other subsystems to get > notified on any totalram_pages() changes, so they can adjust accordingly. Yeah I agree. I imagined that register_memory_notifier() would be the way to do that. Apparently it is only effective for memory hotplug. From your description, it sounds like the ballooning drivers may have a very similar effect to memory hotplug, but I don't see memory_notify() being called in these paths. Do we need a separate notifier chain for totalram_pages() updates?
On 08.04.24 10:07, Yosry Ahmed wrote: > [..] >>>> Note that totalram_pages can also change during memory onlining and >>>> offlining. For that you need a memory notifier that also calls that >>>> refresh function. It's simple enough, though, check out the code >>>> around register_memory_notifier() in drivers/xen/balloon.c. >>> >>> Good point, I completely missed this. It seems like totalram_pages can >>> actually change from contexts other than memory hotplug as well, >>> specifically through adjust_managed_page_count(), and mostly through >>> ballooning drivers. Do we trigger the notifiers in this case? I can't >>> find such logic. >> >> Things like virtio-balloon never online/offline memory and would never >> call it. > > I see calls to adjust_managed_page_count() from > drivers/virtio/virtio_balloon.c, but I don't understand enough to know > what they are doing. Essentially fake removing/adding pages. :) > >> >> Things like virtio-mem sometimes will online/offline memory and would >> sometimes call it (but not always). Things like the Hyper-V balloon and >> XEN balloon never offline memory, and would only call it when onlining >> memory. > > Thanks for the details. > >> >>> >>> It seems like in this case the actual amount of memory does not >>> change, but the drivers take it away from the system. It makes some >>> sense to me that the zswap limits do not change in this case, >>> especially that userspace just sets those limits as a percentage of >>> total memory. I wouldn't expect userspace to take ballooning into >>> account here. >>> >> >> For virtio-mem, it does change ("actual amount of memory"). For >> virtio-balloon, it's tricky. When using virtio-balloon for VM resizing, >> it would similarly change. When using it for pure memory overcommit, it >> depends on whatever the policy in the hypervisor is ... might be that >> under memory pressure that memory is simply given back to the VM. > > That's good to know, it seems like we need to take these into account, > and not just because the users may happen to change zswap limits while > they are onlining/offlining memory. Yes. Likely other parts of the system would want to know when available memory changes (especially, if it's quite significant). > >> >>> However, it would be a behavioral change from today where we always >>> rely on totalram_pages(). Also, if userspace happens to change the >>> limit when a driver is holding a big chunk of memory away from >>> totalram_pages, then the limit would be constantly underestimated. >>> >>> I do not have enough familiarity with memory ballooning to know for >>> sure if this is okay. How much memory can memory ballooning add/remove >>> from totalram_pages(), and is it usually transient or does it stick >>> around for a while? >>> >>> Also CCing David here. >> >> It can be a lot. Especially with the Hyper-V balloon (but also on >> environments where other forms of memory hotunplug are not supported), >> memory ballooning can be used to unplug memory. So that memory can be >> gone for good and it can end up being quite a lot of memory. >> >> The clean thing to do would be to have a way for other subsystems to get >> notified on any totalram_pages() changes, so they can adjust accordingly. > > Yeah I agree. I imagined that register_memory_notifier() would be the > way to do that. Apparently it is only effective for memory hotplug. Yes. Right now it's always called under the memory hotplug lock. We could reuse it for this fake hotplug/unplug as well, but we'd have to clarify how exactly we expect this to interact with the existing notifications+notifiers. > From your description, it sounds like the ballooning drivers may have > a very similar effect to memory hotplug, but I don't see > memory_notify() being called in these paths. Right, the existing notifications (MEM_ONLINE etc.) are called when memory blocks change their state. That's not what the fake hotplug/unplug does, that operates on much smaller ranges. > > Do we need a separate notifier chain for totalram_pages() updates? Good question. I actually might have the requirement to notify some arch code (s390x) from virtio-mem when fake adding/removing memory, and already wondered how to best wire that up. Maybe we can squeeze that into the existing notifier chain, but needs a bit of thought.
[..] > > Do we need a separate notifier chain for totalram_pages() updates? > > Good question. I actually might have the requirement to notify some arch > code (s390x) from virtio-mem when fake adding/removing memory, and > already wondered how to best wire that up. > > Maybe we can squeeze that into the existing notifier chain, but needs a > bit of thought. Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE, MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE, MEM_OFFLINE, etc). New actions mean minimal impact to existing notifiers, but it may make more sense to reuse MEM_ONLINE and MEM_OFFLINE to have generic actions that mean "memory increased" and "memory decreased". I suppose we can add new actions and then separately (and probably incrementally) audit existing notifiers to check if they want to handle the new actions as well. Another consideration is that apparently some ballooning drivers also register notifiers, so we need to make sure there is no possibility of deadlock/recursion.
On 10.04.24 02:52, Yosry Ahmed wrote: > [..] >>> Do we need a separate notifier chain for totalram_pages() updates? >> >> Good question. I actually might have the requirement to notify some arch >> code (s390x) from virtio-mem when fake adding/removing memory, and >> already wondered how to best wire that up. >> >> Maybe we can squeeze that into the existing notifier chain, but needs a >> bit of thought. > Sorry for the late reply, I had to think about this a bit. > Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE, > MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE, > MEM_OFFLINE, etc). At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE that prepare the whole range belonging to the Linux memory block (/sys/devices/system/memory/memory...) to go online, and then have something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable (!PageOffline, handed to the buddy) or unusable (PageOffline, removed from the buddy). There are some details to be figured out, but it could work. And as virtio-mem currently operates in pageblock granularity (e.g., 2 MiB), but frequently handles multiple contiguous pageblocks within a Linux memory block, it's not that bad. But the issue I see with ballooning is that we operate here often on page granularity. While we could optimize some cases, we might get quite some overhead from all the notifications. Alternatively, we could send a list of pages, but it won't win a beauty contest. I think the main issue is that, for my purpose (virtio-mem on s390x), I need to notify about the exact memory ranges (so I can reinitialize stuff in s390x code when memory gets effectively re-enabled). For other cases (total pages changing), we don't need the memory ranges, but only the "summary" -- or a notification afterwards that the total pages were just changed quite a bit. > > New actions mean minimal impact to existing notifiers, but it may make > more sense to reuse MEM_ONLINE and MEM_OFFLINE to have generic actions > that mean "memory increased" and "memory decreased". Likely, we should keep their semantics unchanged. Things like KASAN want to allocate metadata memory for the whole range, not on some smallish pieces. It really means "This Linux memory block goes online/offline, please prepare for that.". And again, memory ballooning with small pages is a bit problematic. > > I suppose we can add new actions and then separately (and probably > incrementally) audit existing notifiers to check if they want to > handle the new actions as well. > > Another consideration is that apparently some ballooning drivers also > register notifiers, so we need to make sure there is no possibility of > deadlock/recursion. Right.
On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <david@redhat.com> wrote: > > On 10.04.24 02:52, Yosry Ahmed wrote: > > [..] > >>> Do we need a separate notifier chain for totalram_pages() updates? > >> > >> Good question. I actually might have the requirement to notify some arch > >> code (s390x) from virtio-mem when fake adding/removing memory, and > >> already wondered how to best wire that up. > >> > >> Maybe we can squeeze that into the existing notifier chain, but needs a > >> bit of thought. > > > > Sorry for the late reply, I had to think about this a bit. > > > Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE, > > MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE, > > MEM_OFFLINE, etc). > > At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE > that prepare the whole range belonging to the Linux memory block > (/sys/devices/system/memory/memory...) to go online, and then have > something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or > ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable > (!PageOffline, handed to the buddy) or unusable (PageOffline, removed > from the buddy). > > There are some details to be figured out, but it could work. > > And as virtio-mem currently operates in pageblock granularity (e.g., 2 > MiB), but frequently handles multiple contiguous pageblocks within a > Linux memory block, it's not that bad. > > > But the issue I see with ballooning is that we operate here often on > page granularity. While we could optimize some cases, we might get quite > some overhead from all the notifications. Alternatively, we could send a > list of pages, but it won't win a beauty contest. > > I think the main issue is that, for my purpose (virtio-mem on s390x), I > need to notify about the exact memory ranges (so I can reinitialize > stuff in s390x code when memory gets effectively re-enabled). For other > cases (total pages changing), we don't need the memory ranges, but only > the "summary" -- or a notification afterwards that the total pages were > just changed quite a bit. Thanks for shedding some light on this. Although I am not familiar with ballooning, sending notifications on page granularity updates sounds terrible. It seems like this is not as straightforward as I had anticipated. I was going to take a stab at this, but given that the motivation is a minor optimization on the zswap side, I will probably just give up :) For now, I will drop this optimization from the series for now, and I can revisit it if/when notifications for totalram_pages() are implemented at some point. Please CC me if you do so for the s390x use case :) > > > > > > New actions mean minimal impact to existing notifiers, but it may make > > more sense to reuse MEM_ONLINE and MEM_OFFLINE to have generic actions > > that mean "memory increased" and "memory decreased". > > Likely, we should keep their semantics unchanged. Things like KASAN want > to allocate metadata memory for the whole range, not on some smallish > pieces. It really means "This Linux memory block goes online/offline, > please prepare for that.". And again, memory ballooning with small pages > is a bit problematic. > > > > > I suppose we can add new actions and then separately (and probably > > incrementally) audit existing notifiers to check if they want to > > handle the new actions as well. > > > > Another consideration is that apparently some ballooning drivers also > > register notifiers, so we need to make sure there is no possibility of > > deadlock/recursion. > > Right. > > -- > Cheers, > > David / dhildenb >
On 13.04.24 03:05, Yosry Ahmed wrote: > On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <david@redhat.com> wrote: >> >> On 10.04.24 02:52, Yosry Ahmed wrote: >>> [..] >>>>> Do we need a separate notifier chain for totalram_pages() updates? >>>> >>>> Good question. I actually might have the requirement to notify some arch >>>> code (s390x) from virtio-mem when fake adding/removing memory, and >>>> already wondered how to best wire that up. >>>> >>>> Maybe we can squeeze that into the existing notifier chain, but needs a >>>> bit of thought. >>> >> >> Sorry for the late reply, I had to think about this a bit. >> >>> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE, >>> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE, >>> MEM_OFFLINE, etc). >> >> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE >> that prepare the whole range belonging to the Linux memory block >> (/sys/devices/system/memory/memory...) to go online, and then have >> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or >> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable >> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed >> from the buddy). >> >> There are some details to be figured out, but it could work. >> >> And as virtio-mem currently operates in pageblock granularity (e.g., 2 >> MiB), but frequently handles multiple contiguous pageblocks within a >> Linux memory block, it's not that bad. >> >> >> But the issue I see with ballooning is that we operate here often on >> page granularity. While we could optimize some cases, we might get quite >> some overhead from all the notifications. Alternatively, we could send a >> list of pages, but it won't win a beauty contest. >> >> I think the main issue is that, for my purpose (virtio-mem on s390x), I >> need to notify about the exact memory ranges (so I can reinitialize >> stuff in s390x code when memory gets effectively re-enabled). For other >> cases (total pages changing), we don't need the memory ranges, but only >> the "summary" -- or a notification afterwards that the total pages were >> just changed quite a bit. > > > Thanks for shedding some light on this. Although I am not familiar > with ballooning, sending notifications on page granularity updates > sounds terrible. It seems like this is not as straightforward as I had > anticipated. > > I was going to take a stab at this, but given that the motivation is a > minor optimization on the zswap side, I will probably just give up :) Oh no, so I have to do the work! ;) > > For now, I will drop this optimization from the series for now, and I > can revisit it if/when notifications for totalram_pages() are > implemented at some point. Please CC me if you do so for the s390x use > case :) I primarily care about virtio-mem resizing VM memory and adjusting totalram_pages(), memory ballooning for that is rather a hack for that use case ... so we're in agreement :) Likely we'd want two notification mechanisms, but no matter how I look at it, it's all a bit ugly. I'll look into the virtio-mem case soonish and will let you know once I have something running.
On Mon, Apr 15, 2024 at 8:10 AM David Hildenbrand <david@redhat.com> wrote: > > On 13.04.24 03:05, Yosry Ahmed wrote: > > On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 10.04.24 02:52, Yosry Ahmed wrote: > >>> [..] > >>>>> Do we need a separate notifier chain for totalram_pages() updates? > >>>> > >>>> Good question. I actually might have the requirement to notify some arch > >>>> code (s390x) from virtio-mem when fake adding/removing memory, and > >>>> already wondered how to best wire that up. > >>>> > >>>> Maybe we can squeeze that into the existing notifier chain, but needs a > >>>> bit of thought. > >>> > >> > >> Sorry for the late reply, I had to think about this a bit. > >> > >>> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE, > >>> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE, > >>> MEM_OFFLINE, etc). > >> > >> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE > >> that prepare the whole range belonging to the Linux memory block > >> (/sys/devices/system/memory/memory...) to go online, and then have > >> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or > >> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable > >> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed > >> from the buddy). > >> > >> There are some details to be figured out, but it could work. > >> > >> And as virtio-mem currently operates in pageblock granularity (e.g., 2 > >> MiB), but frequently handles multiple contiguous pageblocks within a > >> Linux memory block, it's not that bad. > >> > >> > >> But the issue I see with ballooning is that we operate here often on > >> page granularity. While we could optimize some cases, we might get quite > >> some overhead from all the notifications. Alternatively, we could send a > >> list of pages, but it won't win a beauty contest. > >> > >> I think the main issue is that, for my purpose (virtio-mem on s390x), I > >> need to notify about the exact memory ranges (so I can reinitialize > >> stuff in s390x code when memory gets effectively re-enabled). For other > >> cases (total pages changing), we don't need the memory ranges, but only > >> the "summary" -- or a notification afterwards that the total pages were > >> just changed quite a bit. > > > > > > Thanks for shedding some light on this. Although I am not familiar > > with ballooning, sending notifications on page granularity updates > > sounds terrible. It seems like this is not as straightforward as I had > > anticipated. > > > > I was going to take a stab at this, but given that the motivation is a > > minor optimization on the zswap side, I will probably just give up :) > > Oh no, so I have to do the work! ;) > > > > > For now, I will drop this optimization from the series for now, and I > > can revisit it if/when notifications for totalram_pages() are > > implemented at some point. Please CC me if you do so for the s390x use > > case :) > > I primarily care about virtio-mem resizing VM memory and adjusting > totalram_pages(), memory ballooning for that is rather a hack for that > use case ... so we're in agreement :) > > Likely we'd want two notification mechanisms, but no matter how I look > at it, it's all a bit ugly. I am assuming you mean one with exact memory ranges for your s390x use case, and one high-level mechanism for totalram_pages() updates -- or did I miss the point? I am interested to see how page granularity updates would be handled in this case. Perhaps they are only relevant for the high-level mechanism? In that case, I suppose we can batch updates and notify once when a threshold is crossed or something. > > I'll look into the virtio-mem case soonish and will let you know once I > have something running. Thanks!
On 15.04.24 20:30, Yosry Ahmed wrote: > On Mon, Apr 15, 2024 at 8:10 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 13.04.24 03:05, Yosry Ahmed wrote: >>> On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 10.04.24 02:52, Yosry Ahmed wrote: >>>>> [..] >>>>>>> Do we need a separate notifier chain for totalram_pages() updates? >>>>>> >>>>>> Good question. I actually might have the requirement to notify some arch >>>>>> code (s390x) from virtio-mem when fake adding/removing memory, and >>>>>> already wondered how to best wire that up. >>>>>> >>>>>> Maybe we can squeeze that into the existing notifier chain, but needs a >>>>>> bit of thought. >>>>> >>>> >>>> Sorry for the late reply, I had to think about this a bit. >>>> >>>>> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE, >>>>> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE, >>>>> MEM_OFFLINE, etc). >>>> >>>> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE >>>> that prepare the whole range belonging to the Linux memory block >>>> (/sys/devices/system/memory/memory...) to go online, and then have >>>> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or >>>> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable >>>> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed >>>> from the buddy). >>>> >>>> There are some details to be figured out, but it could work. >>>> >>>> And as virtio-mem currently operates in pageblock granularity (e.g., 2 >>>> MiB), but frequently handles multiple contiguous pageblocks within a >>>> Linux memory block, it's not that bad. >>>> >>>> >>>> But the issue I see with ballooning is that we operate here often on >>>> page granularity. While we could optimize some cases, we might get quite >>>> some overhead from all the notifications. Alternatively, we could send a >>>> list of pages, but it won't win a beauty contest. >>>> >>>> I think the main issue is that, for my purpose (virtio-mem on s390x), I >>>> need to notify about the exact memory ranges (so I can reinitialize >>>> stuff in s390x code when memory gets effectively re-enabled). For other >>>> cases (total pages changing), we don't need the memory ranges, but only >>>> the "summary" -- or a notification afterwards that the total pages were >>>> just changed quite a bit. >>> >>> >>> Thanks for shedding some light on this. Although I am not familiar >>> with ballooning, sending notifications on page granularity updates >>> sounds terrible. It seems like this is not as straightforward as I had >>> anticipated. >>> >>> I was going to take a stab at this, but given that the motivation is a >>> minor optimization on the zswap side, I will probably just give up :) >> >> Oh no, so I have to do the work! ;) >> >>> >>> For now, I will drop this optimization from the series for now, and I >>> can revisit it if/when notifications for totalram_pages() are >>> implemented at some point. Please CC me if you do so for the s390x use >>> case :) >> >> I primarily care about virtio-mem resizing VM memory and adjusting >> totalram_pages(), memory ballooning for that is rather a hack for that >> use case ... so we're in agreement :) >> >> Likely we'd want two notification mechanisms, but no matter how I look >> at it, it's all a bit ugly. > > I am assuming you mean one with exact memory ranges for your s390x use > case, and one high-level mechanism for totalram_pages() updates -- or > did I miss the point? No, that's it. > > I am interested to see how page granularity updates would be handled > in this case. Perhaps they are only relevant for the high-level > mechanism? In that case, I suppose we can batch updates and notify > once when a threshold is crossed or something. Yes, we'd batch updates.
On Mon, Apr 15, 2024 at 12:15 PM David Hildenbrand <david@redhat.com> wrote: > > On 15.04.24 20:30, Yosry Ahmed wrote: > > On Mon, Apr 15, 2024 at 8:10 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 13.04.24 03:05, Yosry Ahmed wrote: > >>> On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <david@redhat.com> wrote: > >>>> > >>>> On 10.04.24 02:52, Yosry Ahmed wrote: > >>>>> [..] > >>>>>>> Do we need a separate notifier chain for totalram_pages() updates? > >>>>>> > >>>>>> Good question. I actually might have the requirement to notify some arch > >>>>>> code (s390x) from virtio-mem when fake adding/removing memory, and > >>>>>> already wondered how to best wire that up. > >>>>>> > >>>>>> Maybe we can squeeze that into the existing notifier chain, but needs a > >>>>>> bit of thought. > >>>>> > >>>> > >>>> Sorry for the late reply, I had to think about this a bit. > >>>> > >>>>> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE, > >>>>> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE, > >>>>> MEM_OFFLINE, etc). > >>>> > >>>> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE > >>>> that prepare the whole range belonging to the Linux memory block > >>>> (/sys/devices/system/memory/memory...) to go online, and then have > >>>> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or > >>>> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable > >>>> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed > >>>> from the buddy). > >>>> > >>>> There are some details to be figured out, but it could work. > >>>> > >>>> And as virtio-mem currently operates in pageblock granularity (e.g., 2 > >>>> MiB), but frequently handles multiple contiguous pageblocks within a > >>>> Linux memory block, it's not that bad. > >>>> > >>>> > >>>> But the issue I see with ballooning is that we operate here often on > >>>> page granularity. While we could optimize some cases, we might get quite > >>>> some overhead from all the notifications. Alternatively, we could send a > >>>> list of pages, but it won't win a beauty contest. > >>>> > >>>> I think the main issue is that, for my purpose (virtio-mem on s390x), I > >>>> need to notify about the exact memory ranges (so I can reinitialize > >>>> stuff in s390x code when memory gets effectively re-enabled). For other > >>>> cases (total pages changing), we don't need the memory ranges, but only > >>>> the "summary" -- or a notification afterwards that the total pages were > >>>> just changed quite a bit. > >>> > >>> > >>> Thanks for shedding some light on this. Although I am not familiar > >>> with ballooning, sending notifications on page granularity updates > >>> sounds terrible. It seems like this is not as straightforward as I had > >>> anticipated. > >>> > >>> I was going to take a stab at this, but given that the motivation is a > >>> minor optimization on the zswap side, I will probably just give up :) > >> > >> Oh no, so I have to do the work! ;) > >> > >>> > >>> For now, I will drop this optimization from the series for now, and I > >>> can revisit it if/when notifications for totalram_pages() are > >>> implemented at some point. Please CC me if you do so for the s390x use > >>> case :) > >> > >> I primarily care about virtio-mem resizing VM memory and adjusting > >> totalram_pages(), memory ballooning for that is rather a hack for that > >> use case ... so we're in agreement :) > >> > >> Likely we'd want two notification mechanisms, but no matter how I look > >> at it, it's all a bit ugly. > > > > I am assuming you mean one with exact memory ranges for your s390x use > > case, and one high-level mechanism for totalram_pages() updates -- or > > did I miss the point? > > No, that's it. > > > > > I am interested to see how page granularity updates would be handled > > in this case. Perhaps they are only relevant for the high-level > > mechanism? In that case, I suppose we can batch updates and notify > > once when a threshold is crossed or something. > > Yes, we'd batch updates. Thanks for clarifying, looking forward to your changes :)
diff --git a/mm/zswap.c b/mm/zswap.c index 1cf3ab4b22e64..832e3f56232f0 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -116,12 +116,29 @@ module_param_cb(zpool, &zswap_zpool_param_ops, &zswap_zpool_type, 0644); /* The maximum percentage of memory that the compressed pool can occupy */ static unsigned int zswap_max_pool_percent = 20; -module_param_named(max_pool_percent, zswap_max_pool_percent, uint, 0644); +static unsigned long zswap_max_pages; + +static int zswap_max_pool_param_set(const char *, + const struct kernel_param *); +static const struct kernel_param_ops zswap_max_pool_param_ops = { + .set = zswap_max_pool_param_set, + .get = param_get_uint, +}; +module_param_cb(max_pool_percent, &zswap_max_pool_param_ops, + &zswap_max_pool_percent, 0644); /* The threshold for accepting new pages after the max_pool_percent was hit */ static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */ -module_param_named(accept_threshold_percent, zswap_accept_thr_percent, - uint, 0644); +unsigned long zswap_accept_thr_pages; + +static int zswap_accept_thr_param_set(const char *, + const struct kernel_param *); +static const struct kernel_param_ops zswap_accept_thr_param_ops = { + .set = zswap_accept_thr_param_set, + .get = param_get_uint, +}; +module_param_cb(accept_threshold_percent, &zswap_accept_thr_param_ops, + &zswap_accept_thr_percent, 0644); /* * Enable/disable handling same-value filled pages (enabled by default). @@ -490,14 +507,16 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor) return NULL; } -static unsigned long zswap_max_pages(void) +static void zswap_update_max_pages(void) { - return totalram_pages() * zswap_max_pool_percent / 100; + WRITE_ONCE(zswap_max_pages, + totalram_pages() * zswap_max_pool_percent / 100); } -static unsigned long zswap_accept_thr_pages(void) +static void zswap_update_accept_thr_pages(void) { - return zswap_max_pages() * zswap_accept_thr_percent / 100; + WRITE_ONCE(zswap_accept_thr_pages, + READ_ONCE(zswap_max_pages) * zswap_accept_thr_percent / 100); } unsigned long zswap_total_pages(void) @@ -684,6 +703,43 @@ static int zswap_enabled_param_set(const char *val, return ret; } +static int __zswap_percent_param_set(const char *val, + const struct kernel_param *kp) +{ + unsigned int n; + int ret; + + ret = kstrtouint(val, 10, &n); + if (ret || n > 100) + return -EINVAL; + + return param_set_uint(val, kp); +} + +static int zswap_max_pool_param_set(const char *val, + const struct kernel_param *kp) +{ + int err = __zswap_percent_param_set(val, kp); + + if (!err) { + zswap_update_max_pages(); + zswap_update_accept_thr_pages(); + } + + return err; +} + +static int zswap_accept_thr_param_set(const char *val, + const struct kernel_param *kp) +{ + int err = __zswap_percent_param_set(val, kp); + + if (!err) + zswap_update_accept_thr_pages(); + + return err; +} + /********************************* * lru functions **********************************/ @@ -1305,10 +1361,6 @@ static void shrink_worker(struct work_struct *w) { struct mem_cgroup *memcg; int ret, failures = 0; - unsigned long thr; - - /* Reclaim down to the accept threshold */ - thr = zswap_accept_thr_pages(); /* global reclaim will select cgroup in a round-robin fashion. */ do { @@ -1358,7 +1410,8 @@ static void shrink_worker(struct work_struct *w) break; resched: cond_resched(); - } while (zswap_total_pages() > thr); + /* Reclaim down to the accept threshold */ + } while (zswap_total_pages() > READ_ONCE(zswap_accept_thr_pages)); } static int zswap_is_page_same_filled(void *ptr, unsigned long *value) @@ -1424,16 +1477,14 @@ bool zswap_store(struct folio *folio) /* Check global limits */ cur_pages = zswap_total_pages(); - max_pages = zswap_max_pages(); - - if (cur_pages >= max_pages) { + if (cur_pages >= READ_ONCE(zswap_max_pages)) { zswap_pool_limit_hit++; zswap_pool_reached_full = true; goto reject; } if (zswap_pool_reached_full) { - if (cur_pages > zswap_accept_thr_pages()) + if (cur_pages > READ_ONCE(zswap_accept_thr_pages)) goto reject; else zswap_pool_reached_full = false; @@ -1734,6 +1785,9 @@ static int zswap_setup(void) zswap_enabled = false; } + zswap_update_max_pages(); + zswap_update_accept_thr_pages(); + if (zswap_debugfs_init()) pr_warn("debugfs initialization failed\n"); zswap_init_state = ZSWAP_INIT_SUCCEED;
Currently, we calculate the zswap global limit, and potentially the acceptance threshold in the zswap, in pages in the zswap store path. This is unnecessary because the values rarely change. Instead, precalculate the them when the module parameters are updated, which should be rare. Since we are adding custom handlers for setting the percentages now, add proper validation that they are <= 100. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- mm/zswap.c | 86 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 70 insertions(+), 16 deletions(-)