[v2,2/5] mm: zswap: calculate limits only when updated

Message ID	20240405053510.1948982-3-yosryahmed@google.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Fri, 5 Apr 2024 05:35:07 +0000 In-Reply-To: <20240405053510.1948982-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20240405053510.1948982-1-yosryahmed@google.com> Message-ID: <20240405053510.1948982-3-yosryahmed@google.com> Subject: [PATCH v2 2/5] mm: zswap: calculate limits only when updated From: Yosry Ahmed <yosryahmed@google.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org>, Nhat Pham <nphamcs@gmail.com>, Chengming Zhou <chengming.zhou@linux.dev>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com> Content-Type: text/plain; charset="UTF-8" Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	zswap same-filled and limit checking cleanups \| expand [v2,0/5] zswap same-filled and limit checking cleanups [v2,1/5] mm: zswap: always shrink in zswap_store() if zswap_pool_reached_full [v2,2/5] mm: zswap: calculate limits only when updated [v2,3/5] mm: zswap: refactor limit checking from zswap_store() [v2,4/5] mm: zswap: move more same-filled pages checks outside of zswap_store() [v2,5/5] mm: zswap: remove same_filled module params

Yosry Ahmed April 5, 2024, 5:35 a.m. UTC

Currently, we calculate the zswap global limit, and potentially the
acceptance threshold in the zswap, in pages in the zswap store path.
This is unnecessary because the values rarely change.

Instead, precalculate the them when the module parameters are updated,
which should be rare. Since we are adding custom handlers for setting
the percentages now, add proper validation that they are <= 100.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/zswap.c | 86 ++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 70 insertions(+), 16 deletions(-)

Johannes Weiner April 5, 2024, 3:26 p.m. UTC | #1

On Fri, Apr 05, 2024 at 05:35:07AM +0000, Yosry Ahmed wrote:
> Currently, we calculate the zswap global limit, and potentially the
> acceptance threshold in the zswap, in pages in the zswap store path.
> This is unnecessary because the values rarely change.
> 
> Instead, precalculate the them when the module parameters are updated,
> which should be rare. Since we are adding custom handlers for setting
> the percentages now, add proper validation that they are <= 100.
> 
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Nice! Getting that stuff out of the hotpath!

Two comments below:

> @@ -684,6 +703,43 @@ static int zswap_enabled_param_set(const char *val,
>  	return ret;
>  }
>  
> +static int __zswap_percent_param_set(const char *val,
> +				     const struct kernel_param *kp)
> +{
> +	unsigned int n;
> +	int ret;
> +
> +	ret = kstrtouint(val, 10, &n);
> +	if (ret || n > 100)
> +		return -EINVAL;
> +
> +	return param_set_uint(val, kp);
> +}
> +
> +static int zswap_max_pool_param_set(const char *val,
> +				    const struct kernel_param *kp)
> +{
> +	int err = __zswap_percent_param_set(val, kp);
> +
> +	if (!err) {
> +		zswap_update_max_pages();
> +		zswap_update_accept_thr_pages();
> +	}
> +
> +	return err;
> +}
> +
> +static int zswap_accept_thr_param_set(const char *val,
> +				      const struct kernel_param *kp)
> +{
> +	int err = __zswap_percent_param_set(val, kp);
> +
> +	if (!err)
> +		zswap_update_accept_thr_pages();
> +
> +	return err;
> +}

I think you can keep this simple and just always update both if
anything changes for whatever reason. It's an extremely rare event
after all. That should cut it from 3 functions to 1.

Note that totalram_pages can also change during memory onlining and
offlining. For that you need a memory notifier that also calls that
refresh function. It's simple enough, though, check out the code
around register_memory_notifier() in drivers/xen/balloon.c.

Yosry Ahmed April 5, 2024, 6:43 p.m. UTC | #2

On Fri, Apr 5, 2024 at 8:26 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Apr 05, 2024 at 05:35:07AM +0000, Yosry Ahmed wrote:
> > Currently, we calculate the zswap global limit, and potentially the
> > acceptance threshold in the zswap, in pages in the zswap store path.
> > This is unnecessary because the values rarely change.
> >
> > Instead, precalculate the them when the module parameters are updated,
> > which should be rare. Since we are adding custom handlers for setting
> > the percentages now, add proper validation that they are <= 100.
> >
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> Nice! Getting that stuff out of the hotpath!
>
> Two comments below:
>
> > @@ -684,6 +703,43 @@ static int zswap_enabled_param_set(const char *val,
> >       return ret;
> >  }
> >
> > +static int __zswap_percent_param_set(const char *val,
> > +                                  const struct kernel_param *kp)
> > +{
> > +     unsigned int n;
> > +     int ret;
> > +
> > +     ret = kstrtouint(val, 10, &n);
> > +     if (ret || n > 100)
> > +             return -EINVAL;
> > +
> > +     return param_set_uint(val, kp);
> > +}
> > +
> > +static int zswap_max_pool_param_set(const char *val,
> > +                                 const struct kernel_param *kp)
> > +{
> > +     int err = __zswap_percent_param_set(val, kp);
> > +
> > +     if (!err) {
> > +             zswap_update_max_pages();
> > +             zswap_update_accept_thr_pages();
> > +     }
> > +
> > +     return err;
> > +}
> > +
> > +static int zswap_accept_thr_param_set(const char *val,
> > +                                   const struct kernel_param *kp)
> > +{
> > +     int err = __zswap_percent_param_set(val, kp);
> > +
> > +     if (!err)
> > +             zswap_update_accept_thr_pages();
> > +
> > +     return err;
> > +}
>
> I think you can keep this simple and just always update both if
> anything changes for whatever reason. It's an extremely rare event
> after all. That should cut it from 3 functions to 1.

Yeah we can also combine both update functions in this case.

>
> Note that totalram_pages can also change during memory onlining and
> offlining. For that you need a memory notifier that also calls that
> refresh function. It's simple enough, though, check out the code
> around register_memory_notifier() in drivers/xen/balloon.c.

Good point, I completely missed this. It seems like totalram_pages can
actually change from contexts other than memory hotplug as well,
specifically through adjust_managed_page_count(), and mostly through
ballooning drivers. Do we trigger the notifiers in this case? I can't
find such logic.

It seems like in this case the actual amount of memory does not
change, but the drivers take it away from the system. It makes some
sense to me that the zswap limits do not change in this case,
especially that userspace just sets those limits as a percentage of
total memory. I wouldn't expect userspace to take ballooning into
account here.

However, it would be a behavioral change from today where we always
rely on totalram_pages(). Also, if userspace happens to change the
limit when a driver is holding a big chunk of memory away from
totalram_pages, then the limit would be constantly underestimated.

I do not have enough familiarity with memory ballooning to know for
sure if this is okay. How much memory can memory ballooning add/remove
from totalram_pages(), and is it usually transient or does it stick
around for a while?

Also CCing David here.

kernel test robot April 5, 2024, 8:23 p.m. UTC | #3

Hi Yosry,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20240405]
[cannot apply to linus/master v6.9-rc2]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-zswap-always-shrink-in-zswap_store-if-zswap_pool_reached_full/20240405-133623
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240405053510.1948982-3-yosryahmed%40google.com
patch subject: [PATCH v2 2/5] mm: zswap: calculate limits only when updated
config: i386-randconfig-061-20240405 (https://download.01.org/0day-ci/archive/20240406/202404060445.zMPxumi9-lkp@intel.com/config)
compiler: gcc-10 (Ubuntu 10.5.0-1ubuntu1) 10.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240406/202404060445.zMPxumi9-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202404060445.zMPxumi9-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> mm/zswap.c:132:15: sparse: sparse: symbol 'zswap_accept_thr_pages' was not declared. Should it be static?
   mm/zswap.c:1188:23: sparse: sparse: context imbalance in 'shrink_memcg_cb' - unexpected unlock

vim +/zswap_accept_thr_pages +132 mm/zswap.c

   120	
   121	static int zswap_max_pool_param_set(const char *,
   122					       const struct kernel_param *);
   123	static const struct kernel_param_ops zswap_max_pool_param_ops = {
   124		.set =		zswap_max_pool_param_set,
   125		.get =		param_get_uint,
   126	};
   127	module_param_cb(max_pool_percent, &zswap_max_pool_param_ops,
   128			&zswap_max_pool_percent, 0644);
   129	
   130	/* The threshold for accepting new pages after the max_pool_percent was hit */
   131	static unsigned int zswap_accept_thr_percent = 90; /* of max pool size */
 > 132	unsigned long zswap_accept_thr_pages;
   133

Yosry Ahmed April 5, 2024, 8:29 p.m. UTC | #4

On Fri, Apr 5, 2024 at 1:24 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi Yosry,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on akpm-mm/mm-everything]
> [also build test WARNING on next-20240405]
> [cannot apply to linus/master v6.9-rc2]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-zswap-always-shrink-in-zswap_store-if-zswap_pool_reached_full/20240405-133623
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20240405053510.1948982-3-yosryahmed%40google.com
> patch subject: [PATCH v2 2/5] mm: zswap: calculate limits only when updated
> config: i386-randconfig-061-20240405 (https://download.01.org/0day-ci/archive/20240406/202404060445.zMPxumi9-lkp@intel.com/config)
> compiler: gcc-10 (Ubuntu 10.5.0-1ubuntu1) 10.5.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240406/202404060445.zMPxumi9-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202404060445.zMPxumi9-lkp@intel.com/
>
> sparse warnings: (new ones prefixed by >>)
> >> mm/zswap.c:132:15: sparse: sparse: symbol 'zswap_accept_thr_pages' was not declared. Should it be static?

Yes, I already have it static for v3, which will be sent after the
discussion on this patch settles.

David Hildenbrand April 8, 2024, 7:54 a.m. UTC | #5

On 05.04.24 20:43, Yosry Ahmed wrote:
> On Fri, Apr 5, 2024 at 8:26 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>> On Fri, Apr 05, 2024 at 05:35:07AM +0000, Yosry Ahmed wrote:
>>> Currently, we calculate the zswap global limit, and potentially the
>>> acceptance threshold in the zswap, in pages in the zswap store path.
>>> This is unnecessary because the values rarely change.
>>>
>>> Instead, precalculate the them when the module parameters are updated,
>>> which should be rare. Since we are adding custom handlers for setting
>>> the percentages now, add proper validation that they are <= 100.
>>>
>>> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>>
>> Nice! Getting that stuff out of the hotpath!
>>
>> Two comments below:
>>
>>> @@ -684,6 +703,43 @@ static int zswap_enabled_param_set(const char *val,
>>>        return ret;
>>>   }
>>>
>>> +static int __zswap_percent_param_set(const char *val,
>>> +                                  const struct kernel_param *kp)
>>> +{
>>> +     unsigned int n;
>>> +     int ret;
>>> +
>>> +     ret = kstrtouint(val, 10, &n);
>>> +     if (ret || n > 100)
>>> +             return -EINVAL;
>>> +
>>> +     return param_set_uint(val, kp);
>>> +}
>>> +
>>> +static int zswap_max_pool_param_set(const char *val,
>>> +                                 const struct kernel_param *kp)
>>> +{
>>> +     int err = __zswap_percent_param_set(val, kp);
>>> +
>>> +     if (!err) {
>>> +             zswap_update_max_pages();
>>> +             zswap_update_accept_thr_pages();
>>> +     }
>>> +
>>> +     return err;
>>> +}
>>> +
>>> +static int zswap_accept_thr_param_set(const char *val,
>>> +                                   const struct kernel_param *kp)
>>> +{
>>> +     int err = __zswap_percent_param_set(val, kp);
>>> +
>>> +     if (!err)
>>> +             zswap_update_accept_thr_pages();
>>> +
>>> +     return err;
>>> +}
>>
>> I think you can keep this simple and just always update both if
>> anything changes for whatever reason. It's an extremely rare event
>> after all. That should cut it from 3 functions to 1.
> 
> Yeah we can also combine both update functions in this case.
> 
>>
>> Note that totalram_pages can also change during memory onlining and
>> offlining. For that you need a memory notifier that also calls that
>> refresh function. It's simple enough, though, check out the code
>> around register_memory_notifier() in drivers/xen/balloon.c.
> 
> Good point, I completely missed this. It seems like totalram_pages can
> actually change from contexts other than memory hotplug as well,
> specifically through adjust_managed_page_count(), and mostly through
> ballooning drivers. Do we trigger the notifiers in this case? I can't
> find such logic.

Things like virtio-balloon never online/offline memory and would never 
call it.

Things like virtio-mem sometimes will online/offline memory and would 
sometimes call it (but not always). Things like the Hyper-V balloon and 
XEN balloon never offline memory, and would only call it when onlining 
memory.

> 
> It seems like in this case the actual amount of memory does not
> change, but the drivers take it away from the system. It makes some
> sense to me that the zswap limits do not change in this case,
> especially that userspace just sets those limits as a percentage of
> total memory. I wouldn't expect userspace to take ballooning into
> account here.
> 

For virtio-mem, it does change ("actual amount of memory"). For 
virtio-balloon, it's tricky. When using virtio-balloon for VM resizing, 
it would similarly change. When using it for pure memory overcommit, it 
depends on whatever the policy in the hypervisor is ... might be that 
under memory pressure that memory is simply given back to the VM.

> However, it would be a behavioral change from today where we always
> rely on totalram_pages(). Also, if userspace happens to change the
> limit when a driver is holding a big chunk of memory away from
> totalram_pages, then the limit would be constantly underestimated.
> 
> I do not have enough familiarity with memory ballooning to know for
> sure if this is okay. How much memory can memory ballooning add/remove
> from totalram_pages(), and is it usually transient or does it stick
> around for a while?
> 
> Also CCing David here.

It can be a lot. Especially with the Hyper-V balloon (but also on 
environments where other forms of memory hotunplug are not supported), 
memory ballooning can be used to unplug memory. So that memory can be 
gone for good and it can end up being quite a lot of memory.

The clean thing to do would be to have a way for other subsystems to get 
notified on any totalram_pages() changes, so they can adjust accordingly.

Yosry Ahmed April 8, 2024, 8:07 a.m. UTC | #6

[..]
> >> Note that totalram_pages can also change during memory onlining and
> >> offlining. For that you need a memory notifier that also calls that
> >> refresh function. It's simple enough, though, check out the code
> >> around register_memory_notifier() in drivers/xen/balloon.c.
> >
> > Good point, I completely missed this. It seems like totalram_pages can
> > actually change from contexts other than memory hotplug as well,
> > specifically through adjust_managed_page_count(), and mostly through
> > ballooning drivers. Do we trigger the notifiers in this case? I can't
> > find such logic.
>
> Things like virtio-balloon never online/offline memory and would never
> call it.

I see calls to adjust_managed_page_count() from
drivers/virtio/virtio_balloon.c, but I don't understand enough to know
what they are doing.

>
> Things like virtio-mem sometimes will online/offline memory and would
> sometimes call it (but not always). Things like the Hyper-V balloon and
> XEN balloon never offline memory, and would only call it when onlining
> memory.

Thanks for the details.

>
> >
> > It seems like in this case the actual amount of memory does not
> > change, but the drivers take it away from the system. It makes some
> > sense to me that the zswap limits do not change in this case,
> > especially that userspace just sets those limits as a percentage of
> > total memory. I wouldn't expect userspace to take ballooning into
> > account here.
> >
>
> For virtio-mem, it does change ("actual amount of memory"). For
> virtio-balloon, it's tricky. When using virtio-balloon for VM resizing,
> it would similarly change. When using it for pure memory overcommit, it
> depends on whatever the policy in the hypervisor is ... might be that
> under memory pressure that memory is simply given back to the VM.

That's good to know, it seems like we need to take these into account,
and not just because the users may happen to change zswap limits while
they are onlining/offlining memory.

>
> > However, it would be a behavioral change from today where we always
> > rely on totalram_pages(). Also, if userspace happens to change the
> > limit when a driver is holding a big chunk of memory away from
> > totalram_pages, then the limit would be constantly underestimated.
> >
> > I do not have enough familiarity with memory ballooning to know for
> > sure if this is okay. How much memory can memory ballooning add/remove
> > from totalram_pages(), and is it usually transient or does it stick
> > around for a while?
> >
> > Also CCing David here.
>
> It can be a lot. Especially with the Hyper-V balloon (but also on
> environments where other forms of memory hotunplug are not supported),
> memory ballooning can be used to unplug memory. So that memory can be
> gone for good and it can end up being quite a lot of memory.
>
> The clean thing to do would be to have a way for other subsystems to get
> notified on any totalram_pages() changes, so they can adjust accordingly.

Yeah I agree. I imagined that register_memory_notifier() would be the
way to do that. Apparently it is only effective for memory hotplug.
From your description, it sounds like the ballooning drivers may have
a very similar effect to memory hotplug, but I don't see
memory_notify() being called in these paths.

Do we need a separate notifier chain for totalram_pages() updates?

David Hildenbrand April 9, 2024, 7:32 p.m. UTC | #7

On 08.04.24 10:07, Yosry Ahmed wrote:
> [..]
>>>> Note that totalram_pages can also change during memory onlining and
>>>> offlining. For that you need a memory notifier that also calls that
>>>> refresh function. It's simple enough, though, check out the code
>>>> around register_memory_notifier() in drivers/xen/balloon.c.
>>>
>>> Good point, I completely missed this. It seems like totalram_pages can
>>> actually change from contexts other than memory hotplug as well,
>>> specifically through adjust_managed_page_count(), and mostly through
>>> ballooning drivers. Do we trigger the notifiers in this case? I can't
>>> find such logic.
>>
>> Things like virtio-balloon never online/offline memory and would never
>> call it.
> 
> I see calls to adjust_managed_page_count() from
> drivers/virtio/virtio_balloon.c, but I don't understand enough to know
> what they are doing.

Essentially fake removing/adding pages. :)

> 
>>
>> Things like virtio-mem sometimes will online/offline memory and would
>> sometimes call it (but not always). Things like the Hyper-V balloon and
>> XEN balloon never offline memory, and would only call it when onlining
>> memory.
> 
> Thanks for the details.
> 
>>
>>>
>>> It seems like in this case the actual amount of memory does not
>>> change, but the drivers take it away from the system. It makes some
>>> sense to me that the zswap limits do not change in this case,
>>> especially that userspace just sets those limits as a percentage of
>>> total memory. I wouldn't expect userspace to take ballooning into
>>> account here.
>>>
>>
>> For virtio-mem, it does change ("actual amount of memory"). For
>> virtio-balloon, it's tricky. When using virtio-balloon for VM resizing,
>> it would similarly change. When using it for pure memory overcommit, it
>> depends on whatever the policy in the hypervisor is ... might be that
>> under memory pressure that memory is simply given back to the VM.
> 
> That's good to know, it seems like we need to take these into account,
> and not just because the users may happen to change zswap limits while
> they are onlining/offlining memory.

Yes. Likely other parts of the system would want to know when available 
memory changes (especially, if it's quite significant).

> 
>>
>>> However, it would be a behavioral change from today where we always
>>> rely on totalram_pages(). Also, if userspace happens to change the
>>> limit when a driver is holding a big chunk of memory away from
>>> totalram_pages, then the limit would be constantly underestimated.
>>>
>>> I do not have enough familiarity with memory ballooning to know for
>>> sure if this is okay. How much memory can memory ballooning add/remove
>>> from totalram_pages(), and is it usually transient or does it stick
>>> around for a while?
>>>
>>> Also CCing David here.
>>
>> It can be a lot. Especially with the Hyper-V balloon (but also on
>> environments where other forms of memory hotunplug are not supported),
>> memory ballooning can be used to unplug memory. So that memory can be
>> gone for good and it can end up being quite a lot of memory.
>>
>> The clean thing to do would be to have a way for other subsystems to get
>> notified on any totalram_pages() changes, so they can adjust accordingly.
> 
> Yeah I agree. I imagined that register_memory_notifier() would be the
> way to do that. Apparently it is only effective for memory hotplug.

Yes. Right now it's always called under the memory hotplug lock. We 
could reuse it for this fake hotplug/unplug as well, but we'd have to 
clarify how exactly we expect this to interact with the existing 
notifications+notifiers.

>  From your description, it sounds like the ballooning drivers may have
> a very similar effect to memory hotplug, but I don't see
> memory_notify() being called in these paths.

Right, the existing notifications (MEM_ONLINE etc.) are called when 
memory blocks change their state. That's not what the fake 
hotplug/unplug does, that operates on much smaller ranges.

> 
> Do we need a separate notifier chain for totalram_pages() updates?

Good question. I actually might have the requirement to notify some arch 
code (s390x) from virtio-mem when fake adding/removing memory, and 
already wondered how to best wire that up.

Maybe we can squeeze that into the existing notifier chain, but needs a 
bit of thought.

Yosry Ahmed April 10, 2024, 12:52 a.m. UTC | #8

[..]
> > Do we need a separate notifier chain for totalram_pages() updates?
>
> Good question. I actually might have the requirement to notify some arch
> code (s390x) from virtio-mem when fake adding/removing memory, and
> already wondered how to best wire that up.
>
> Maybe we can squeeze that into the existing notifier chain, but needs a
> bit of thought.

Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
MEM_OFFLINE, etc).

New actions mean minimal impact to existing notifiers, but it may make
more sense to reuse MEM_ONLINE and MEM_OFFLINE to have generic actions
that mean "memory increased" and "memory decreased".

I suppose we can add new actions and then separately (and probably
incrementally) audit existing notifiers to check if they want to
handle the new actions as well.

Another consideration is that apparently some ballooning drivers also
register notifiers, so we need to make sure there is no possibility of
deadlock/recursion.

David Hildenbrand April 12, 2024, 7:48 p.m. UTC | #9

On 10.04.24 02:52, Yosry Ahmed wrote:
> [..]
>>> Do we need a separate notifier chain for totalram_pages() updates?
>>
>> Good question. I actually might have the requirement to notify some arch
>> code (s390x) from virtio-mem when fake adding/removing memory, and
>> already wondered how to best wire that up.
>>
>> Maybe we can squeeze that into the existing notifier chain, but needs a
>> bit of thought.
> 

Sorry for the late reply, I had to think about this a bit.

> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
> MEM_OFFLINE, etc).

At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE 
that prepare the whole range belonging to the Linux memory block 
(/sys/devices/system/memory/memory...) to go online, and then have 
something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or 
ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable 
(!PageOffline, handed to the buddy) or unusable (PageOffline, removed 
from the buddy).

There are some details to be figured out, but it could work.

And as virtio-mem currently operates in pageblock granularity (e.g., 2 
MiB), but frequently handles multiple contiguous pageblocks within a 
Linux memory block, it's not that bad.

But the issue I see with ballooning is that we operate here often on 
page granularity. While we could optimize some cases, we might get quite 
some overhead from all the notifications. Alternatively, we could send a 
list of pages, but it won't win a beauty contest.

I think the main issue is that, for my purpose (virtio-mem on s390x), I 
need to notify about the exact memory ranges (so I can reinitialize 
stuff in s390x code when memory gets effectively re-enabled). For other 
cases (total pages changing), we don't need the memory ranges, but only 
the "summary" -- or a notification afterwards that the total pages were 
just changed quite a bit.

> 
> New actions mean minimal impact to existing notifiers, but it may make
> more sense to reuse MEM_ONLINE and MEM_OFFLINE to have generic actions
> that mean "memory increased" and "memory decreased".

Likely, we should keep their semantics unchanged. Things like KASAN want 
to allocate metadata memory for the whole range, not on some smallish 
pieces. It really means "This Linux memory block goes online/offline, 
please prepare for that.". And again, memory ballooning with small pages 
is a bit problematic.

> 
> I suppose we can add new actions and then separately (and probably
> incrementally) audit existing notifiers to check if they want to
> handle the new actions as well.
> 
> Another consideration is that apparently some ballooning drivers also
> register notifiers, so we need to make sure there is no possibility of
> deadlock/recursion.

Right.

Yosry Ahmed April 13, 2024, 1:05 a.m. UTC | #10

On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 10.04.24 02:52, Yosry Ahmed wrote:
> > [..]
> >>> Do we need a separate notifier chain for totalram_pages() updates?
> >>
> >> Good question. I actually might have the requirement to notify some arch
> >> code (s390x) from virtio-mem when fake adding/removing memory, and
> >> already wondered how to best wire that up.
> >>
> >> Maybe we can squeeze that into the existing notifier chain, but needs a
> >> bit of thought.
> >
>
> Sorry for the late reply, I had to think about this a bit.
>
> > Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
> > MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
> > MEM_OFFLINE, etc).
>
> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE
> that prepare the whole range belonging to the Linux memory block
> (/sys/devices/system/memory/memory...) to go online, and then have
> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or
> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable
> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed
> from the buddy).
>
> There are some details to be figured out, but it could work.
>
> And as virtio-mem currently operates in pageblock granularity (e.g., 2
> MiB), but frequently handles multiple contiguous pageblocks within a
> Linux memory block, it's not that bad.
>
>
> But the issue I see with ballooning is that we operate here often on
> page granularity. While we could optimize some cases, we might get quite
> some overhead from all the notifications. Alternatively, we could send a
> list of pages, but it won't win a beauty contest.
>
> I think the main issue is that, for my purpose (virtio-mem on s390x), I
> need to notify about the exact memory ranges (so I can reinitialize
> stuff in s390x code when memory gets effectively re-enabled). For other
> cases (total pages changing), we don't need the memory ranges, but only
> the "summary" -- or a notification afterwards that the total pages were
> just changed quite a bit.


Thanks for shedding some light on this. Although I am not familiar
with ballooning, sending notifications on page granularity updates
sounds terrible. It seems like this is not as straightforward as I had
anticipated.

I was going to take a stab at this, but given that the motivation is a
minor optimization on the zswap side, I will probably just give up :)

For now, I will drop this optimization from the series for now, and I
can revisit it if/when notifications for totalram_pages() are
implemented at some point. Please CC me if you do so for the s390x use
case :)

>
>
> >
> > New actions mean minimal impact to existing notifiers, but it may make
> > more sense to reuse MEM_ONLINE and MEM_OFFLINE to have generic actions
> > that mean "memory increased" and "memory decreased".
>
> Likely, we should keep their semantics unchanged. Things like KASAN want
> to allocate metadata memory for the whole range, not on some smallish
> pieces. It really means "This Linux memory block goes online/offline,
> please prepare for that.". And again, memory ballooning with small pages
> is a bit problematic.
>
> >
> > I suppose we can add new actions and then separately (and probably
> > incrementally) audit existing notifiers to check if they want to
> > handle the new actions as well.
> >
> > Another consideration is that apparently some ballooning drivers also
> > register notifiers, so we need to make sure there is no possibility of
> > deadlock/recursion.
>
> Right.
>
> --
> Cheers,
>
> David / dhildenb
>

David Hildenbrand April 15, 2024, 3:10 p.m. UTC | #11

On 13.04.24 03:05, Yosry Ahmed wrote:
> On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 10.04.24 02:52, Yosry Ahmed wrote:
>>> [..]
>>>>> Do we need a separate notifier chain for totalram_pages() updates?
>>>>
>>>> Good question. I actually might have the requirement to notify some arch
>>>> code (s390x) from virtio-mem when fake adding/removing memory, and
>>>> already wondered how to best wire that up.
>>>>
>>>> Maybe we can squeeze that into the existing notifier chain, but needs a
>>>> bit of thought.
>>>
>>
>> Sorry for the late reply, I had to think about this a bit.
>>
>>> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
>>> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
>>> MEM_OFFLINE, etc).
>>
>> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE
>> that prepare the whole range belonging to the Linux memory block
>> (/sys/devices/system/memory/memory...) to go online, and then have
>> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or
>> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable
>> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed
>> from the buddy).
>>
>> There are some details to be figured out, but it could work.
>>
>> And as virtio-mem currently operates in pageblock granularity (e.g., 2
>> MiB), but frequently handles multiple contiguous pageblocks within a
>> Linux memory block, it's not that bad.
>>
>>
>> But the issue I see with ballooning is that we operate here often on
>> page granularity. While we could optimize some cases, we might get quite
>> some overhead from all the notifications. Alternatively, we could send a
>> list of pages, but it won't win a beauty contest.
>>
>> I think the main issue is that, for my purpose (virtio-mem on s390x), I
>> need to notify about the exact memory ranges (so I can reinitialize
>> stuff in s390x code when memory gets effectively re-enabled). For other
>> cases (total pages changing), we don't need the memory ranges, but only
>> the "summary" -- or a notification afterwards that the total pages were
>> just changed quite a bit.
> 
> 
> Thanks for shedding some light on this. Although I am not familiar
> with ballooning, sending notifications on page granularity updates
> sounds terrible. It seems like this is not as straightforward as I had
> anticipated.
> 
> I was going to take a stab at this, but given that the motivation is a
> minor optimization on the zswap side, I will probably just give up :)

Oh no, so I have to do the work! ;)

> 
> For now, I will drop this optimization from the series for now, and I
> can revisit it if/when notifications for totalram_pages() are
> implemented at some point. Please CC me if you do so for the s390x use
> case :)

I primarily care about virtio-mem resizing VM memory and adjusting 
totalram_pages(), memory ballooning for that is rather a hack for that 
use case ... so we're in agreement :)

Likely we'd want two notification mechanisms, but no matter how I look 
at it, it's all a bit ugly.

I'll look into the virtio-mem case soonish and will let you know once I 
have something running.

Yosry Ahmed April 15, 2024, 6:30 p.m. UTC | #12

On Mon, Apr 15, 2024 at 8:10 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.04.24 03:05, Yosry Ahmed wrote:
> > On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 10.04.24 02:52, Yosry Ahmed wrote:
> >>> [..]
> >>>>> Do we need a separate notifier chain for totalram_pages() updates?
> >>>>
> >>>> Good question. I actually might have the requirement to notify some arch
> >>>> code (s390x) from virtio-mem when fake adding/removing memory, and
> >>>> already wondered how to best wire that up.
> >>>>
> >>>> Maybe we can squeeze that into the existing notifier chain, but needs a
> >>>> bit of thought.
> >>>
> >>
> >> Sorry for the late reply, I had to think about this a bit.
> >>
> >>> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
> >>> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
> >>> MEM_OFFLINE, etc).
> >>
> >> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE
> >> that prepare the whole range belonging to the Linux memory block
> >> (/sys/devices/system/memory/memory...) to go online, and then have
> >> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or
> >> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable
> >> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed
> >> from the buddy).
> >>
> >> There are some details to be figured out, but it could work.
> >>
> >> And as virtio-mem currently operates in pageblock granularity (e.g., 2
> >> MiB), but frequently handles multiple contiguous pageblocks within a
> >> Linux memory block, it's not that bad.
> >>
> >>
> >> But the issue I see with ballooning is that we operate here often on
> >> page granularity. While we could optimize some cases, we might get quite
> >> some overhead from all the notifications. Alternatively, we could send a
> >> list of pages, but it won't win a beauty contest.
> >>
> >> I think the main issue is that, for my purpose (virtio-mem on s390x), I
> >> need to notify about the exact memory ranges (so I can reinitialize
> >> stuff in s390x code when memory gets effectively re-enabled). For other
> >> cases (total pages changing), we don't need the memory ranges, but only
> >> the "summary" -- or a notification afterwards that the total pages were
> >> just changed quite a bit.
> >
> >
> > Thanks for shedding some light on this. Although I am not familiar
> > with ballooning, sending notifications on page granularity updates
> > sounds terrible. It seems like this is not as straightforward as I had
> > anticipated.
> >
> > I was going to take a stab at this, but given that the motivation is a
> > minor optimization on the zswap side, I will probably just give up :)
>
> Oh no, so I have to do the work! ;)
>
> >
> > For now, I will drop this optimization from the series for now, and I
> > can revisit it if/when notifications for totalram_pages() are
> > implemented at some point. Please CC me if you do so for the s390x use
> > case :)
>
> I primarily care about virtio-mem resizing VM memory and adjusting
> totalram_pages(), memory ballooning for that is rather a hack for that
> use case ... so we're in agreement :)
>
> Likely we'd want two notification mechanisms, but no matter how I look
> at it, it's all a bit ugly.

I am assuming you mean one with exact memory ranges for your s390x use
case, and one high-level mechanism for totalram_pages() updates -- or
did I miss the point?

I am interested to see how page granularity updates would be handled
in this case. Perhaps they are only relevant for the high-level
mechanism? In that case, I suppose we can batch updates and notify
once when a threshold is crossed or something.

>
> I'll look into the virtio-mem case soonish and will let you know once I
> have something running.

Thanks!

David Hildenbrand April 15, 2024, 7:15 p.m. UTC | #13

On 15.04.24 20:30, Yosry Ahmed wrote:
> On Mon, Apr 15, 2024 at 8:10 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 13.04.24 03:05, Yosry Ahmed wrote:
>>> On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 10.04.24 02:52, Yosry Ahmed wrote:
>>>>> [..]
>>>>>>> Do we need a separate notifier chain for totalram_pages() updates?
>>>>>>
>>>>>> Good question. I actually might have the requirement to notify some arch
>>>>>> code (s390x) from virtio-mem when fake adding/removing memory, and
>>>>>> already wondered how to best wire that up.
>>>>>>
>>>>>> Maybe we can squeeze that into the existing notifier chain, but needs a
>>>>>> bit of thought.
>>>>>
>>>>
>>>> Sorry for the late reply, I had to think about this a bit.
>>>>
>>>>> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
>>>>> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
>>>>> MEM_OFFLINE, etc).
>>>>
>>>> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE
>>>> that prepare the whole range belonging to the Linux memory block
>>>> (/sys/devices/system/memory/memory...) to go online, and then have
>>>> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or
>>>> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable
>>>> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed
>>>> from the buddy).
>>>>
>>>> There are some details to be figured out, but it could work.
>>>>
>>>> And as virtio-mem currently operates in pageblock granularity (e.g., 2
>>>> MiB), but frequently handles multiple contiguous pageblocks within a
>>>> Linux memory block, it's not that bad.
>>>>
>>>>
>>>> But the issue I see with ballooning is that we operate here often on
>>>> page granularity. While we could optimize some cases, we might get quite
>>>> some overhead from all the notifications. Alternatively, we could send a
>>>> list of pages, but it won't win a beauty contest.
>>>>
>>>> I think the main issue is that, for my purpose (virtio-mem on s390x), I
>>>> need to notify about the exact memory ranges (so I can reinitialize
>>>> stuff in s390x code when memory gets effectively re-enabled). For other
>>>> cases (total pages changing), we don't need the memory ranges, but only
>>>> the "summary" -- or a notification afterwards that the total pages were
>>>> just changed quite a bit.
>>>
>>>
>>> Thanks for shedding some light on this. Although I am not familiar
>>> with ballooning, sending notifications on page granularity updates
>>> sounds terrible. It seems like this is not as straightforward as I had
>>> anticipated.
>>>
>>> I was going to take a stab at this, but given that the motivation is a
>>> minor optimization on the zswap side, I will probably just give up :)
>>
>> Oh no, so I have to do the work! ;)
>>
>>>
>>> For now, I will drop this optimization from the series for now, and I
>>> can revisit it if/when notifications for totalram_pages() are
>>> implemented at some point. Please CC me if you do so for the s390x use
>>> case :)
>>
>> I primarily care about virtio-mem resizing VM memory and adjusting
>> totalram_pages(), memory ballooning for that is rather a hack for that
>> use case ... so we're in agreement :)
>>
>> Likely we'd want two notification mechanisms, but no matter how I look
>> at it, it's all a bit ugly.
> 
> I am assuming you mean one with exact memory ranges for your s390x use
> case, and one high-level mechanism for totalram_pages() updates -- or
> did I miss the point?

No, that's it.

> 
> I am interested to see how page granularity updates would be handled
> in this case. Perhaps they are only relevant for the high-level
> mechanism? In that case, I suppose we can batch updates and notify
> once when a threshold is crossed or something.

Yes, we'd batch updates.

Yosry Ahmed April 15, 2024, 7:17 p.m. UTC | #14

On Mon, Apr 15, 2024 at 12:15 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 15.04.24 20:30, Yosry Ahmed wrote:
> > On Mon, Apr 15, 2024 at 8:10 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 13.04.24 03:05, Yosry Ahmed wrote:
> >>> On Fri, Apr 12, 2024 at 12:48 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 10.04.24 02:52, Yosry Ahmed wrote:
> >>>>> [..]
> >>>>>>> Do we need a separate notifier chain for totalram_pages() updates?
> >>>>>>
> >>>>>> Good question. I actually might have the requirement to notify some arch
> >>>>>> code (s390x) from virtio-mem when fake adding/removing memory, and
> >>>>>> already wondered how to best wire that up.
> >>>>>>
> >>>>>> Maybe we can squeeze that into the existing notifier chain, but needs a
> >>>>>> bit of thought.
> >>>>>
> >>>>
> >>>> Sorry for the late reply, I had to think about this a bit.
> >>>>
> >>>>> Do you mean by adding new actions (e.g. MEM_FAKE_ONLINE,
> >>>>> MEM_FAKE_OFFLINE), or by reusing the existing actions (MEM_ONLINE,
> >>>>> MEM_OFFLINE, etc).
> >>>>
> >>>> At least for virtio-mem, I think we could have a MEM_ONLINE/MEM_OFFLINE
> >>>> that prepare the whole range belonging to the Linux memory block
> >>>> (/sys/devices/system/memory/memory...) to go online, and then have
> >>>> something like MEM_SOFT_ONLINE/MEM_SOFT_OFFLINE or
> >>>> ENABLE_PAGES/DISABLE_PAGES ... notifications when parts become usable
> >>>> (!PageOffline, handed to the buddy) or unusable (PageOffline, removed
> >>>> from the buddy).
> >>>>
> >>>> There are some details to be figured out, but it could work.
> >>>>
> >>>> And as virtio-mem currently operates in pageblock granularity (e.g., 2
> >>>> MiB), but frequently handles multiple contiguous pageblocks within a
> >>>> Linux memory block, it's not that bad.
> >>>>
> >>>>
> >>>> But the issue I see with ballooning is that we operate here often on
> >>>> page granularity. While we could optimize some cases, we might get quite
> >>>> some overhead from all the notifications. Alternatively, we could send a
> >>>> list of pages, but it won't win a beauty contest.
> >>>>
> >>>> I think the main issue is that, for my purpose (virtio-mem on s390x), I
> >>>> need to notify about the exact memory ranges (so I can reinitialize
> >>>> stuff in s390x code when memory gets effectively re-enabled). For other
> >>>> cases (total pages changing), we don't need the memory ranges, but only
> >>>> the "summary" -- or a notification afterwards that the total pages were
> >>>> just changed quite a bit.
> >>>
> >>>
> >>> Thanks for shedding some light on this. Although I am not familiar
> >>> with ballooning, sending notifications on page granularity updates
> >>> sounds terrible. It seems like this is not as straightforward as I had
> >>> anticipated.
> >>>
> >>> I was going to take a stab at this, but given that the motivation is a
> >>> minor optimization on the zswap side, I will probably just give up :)
> >>
> >> Oh no, so I have to do the work! ;)
> >>
> >>>
> >>> For now, I will drop this optimization from the series for now, and I
> >>> can revisit it if/when notifications for totalram_pages() are
> >>> implemented at some point. Please CC me if you do so for the s390x use
> >>> case :)
> >>
> >> I primarily care about virtio-mem resizing VM memory and adjusting
> >> totalram_pages(), memory ballooning for that is rather a hack for that
> >> use case ... so we're in agreement :)
> >>
> >> Likely we'd want two notification mechanisms, but no matter how I look
> >> at it, it's all a bit ugly.
> >
> > I am assuming you mean one with exact memory ranges for your s390x use
> > case, and one high-level mechanism for totalram_pages() updates -- or
> > did I miss the point?
>
> No, that's it.
>
> >
> > I am interested to see how page granularity updates would be handled
> > in this case. Perhaps they are only relevant for the high-level
> > mechanism? In that case, I suppose we can batch updates and notify
> > once when a threshold is crossed or something.
>
> Yes, we'd batch updates.

Thanks for clarifying, looking forward to your changes :)

[v2,2/5] mm: zswap: calculate limits only when updated

Commit Message

Comments

Patch