Message ID | 20250128185507.2176-1-42.hyeyoo@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,mm-hotfixes] mm/zswap: fix inconsistent charging when zswap_store_page() fails | expand |
On Wed, Jan 29, 2025 at 03:55:07AM +0900, Hyeonggon Yoo wrote: > Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > skips charging any zswapped base pages when it failed to zswap the entire > folio. > > However, when some base pages are zswapped but it failed to zswap > the entire folio, the zswap operation is rolled back. > When freeing zswap entries for those pages, zswap_entry_free() uncharges > the pages that were not previously charged, causing zswap charging to > become inconsistent. > > This inconsistency triggers two warnings with following steps: > # On a machine with 64GiB of RAM and 36GiB of zswap > $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng > $ sudo reboot > > Two warnings are: > in mm/memcontrol.c:163, function obj_cgroup_release(): > WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); > > in mm/page_counter.c:60, function page_counter_cancel(): > if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n", > new, nr_pages)) > > While objcg events should only be accounted for when the entire folio is > zswapped, objcg charging should be performed regardlessly. > Fix accordingly. > > After resolving the inconsistency, these warnings disappear. > > Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > Cc: stable@vger.kernel.org > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > --- > > v1->v2: > > Fixed objcg events being accounted for on zswap failure. > > Fixed the incorrect description. I misunderstood that the base pages are > going to be stored in zswap, but their zswap entries are freed immediately. > > Added a comment on why it charges pages that are going to be removed > from zswap. > > mm/zswap.c | 14 ++++++++++---- > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 6504174fbc6a..10b30ac46deb 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1568,20 +1568,26 @@ bool zswap_store(struct folio *folio) > > bytes = zswap_store_page(page, objcg, pool); > if (bytes < 0) > - goto put_pool; > + goto charge_zswap; > compressed_bytes += bytes; > } > > - if (objcg) { > - obj_cgroup_charge_zswap(objcg, compressed_bytes); > + if (objcg) > count_objcg_events(objcg, ZSWPOUT, nr_pages); > - } > > atomic_long_add(nr_pages, &zswap_stored_pages); > count_vm_events(ZSWPOUT, nr_pages); > > ret = true; > > +charge_zswap: > + /* > + * Charge zswapped pages even when it failed to zswap the entire folio, > + * because zswap_entry_free() will uncharge them anyway. > + * Otherwise zswap charging will become inconsistent. > + */ > + if (objcg) > + obj_cgroup_charge_zswap(objcg, compressed_bytes); Thanks for fixing this! Having to charge just to uncharge right after is annoying. Ideally we'd just clear entry->objcg if we fail before charging, but we don't have a direct reference to the entries here and another tree lookup is not ideal either. I guess we may be able to improve this handling once [1] lands, as we can move the charging logic into zswap_store_folio() where we'd have access to the entries. For now, would the control flow be easier if we move the charge ahead of the zswap_store_page() loop instead? There is an existing if (objcg) block there as well. [1]https://lore.kernel.org/linux-mm/20241221063119.29140-12-kanchana.p.sridhar@intel.com/ > put_pool: > zswap_pool_put(pool); > put_objcg: > -- > 2.47.1 > >
Hi Hyeonggon, > -----Original Message----- > From: Hyeonggon Yoo <42.hyeyoo@gmail.com> > Sent: Tuesday, January 28, 2025 10:55 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Johannes Weiner > <hannes@cmpxchg.org>; Yosry Ahmed <yosryahmed@google.com>; Nhat > Pham <nphamcs@gmail.com>; Chengming Zhou > <chengming.zhou@linux.dev>; Andrew Morton <akpm@linux- > foundation.org> > Cc: linux-mm@kvack.org; Hyeonggon Yoo <42.hyeyoo@gmail.com>; > stable@vger.kernel.org > Subject: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging when > zswap_store_page() fails > > Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > skips charging any zswapped base pages when it failed to zswap the entire > folio. > > However, when some base pages are zswapped but it failed to zswap > the entire folio, the zswap operation is rolled back. > When freeing zswap entries for those pages, zswap_entry_free() uncharges > the pages that were not previously charged, causing zswap charging to > become inconsistent. > > This inconsistency triggers two warnings with following steps: > # On a machine with 64GiB of RAM and 36GiB of zswap > $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng > $ sudo reboot > > Two warnings are: > in mm/memcontrol.c:163, function obj_cgroup_release(): > WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); > > in mm/page_counter.c:60, function page_counter_cancel(): > if (WARN_ONCE(new < 0, "page_counter underflow: %ld > nr_pages=%lu\n", > new, nr_pages)) > > While objcg events should only be accounted for when the entire folio is > zswapped, objcg charging should be performed regardlessly. > Fix accordingly. > > After resolving the inconsistency, these warnings disappear. > > Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > Cc: stable@vger.kernel.org > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > --- > > v1->v2: > > Fixed objcg events being accounted for on zswap failure. > > Fixed the incorrect description. I misunderstood that the base pages are > going to be stored in zswap, but their zswap entries are freed immediately. > > Added a comment on why it charges pages that are going to be removed > from zswap. > > mm/zswap.c | 14 ++++++++++---- > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 6504174fbc6a..10b30ac46deb 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1568,20 +1568,26 @@ bool zswap_store(struct folio *folio) > > bytes = zswap_store_page(page, objcg, pool); > if (bytes < 0) > - goto put_pool; > + goto charge_zswap; > compressed_bytes += bytes; > } > > - if (objcg) { > - obj_cgroup_charge_zswap(objcg, compressed_bytes); > + if (objcg) > count_objcg_events(objcg, ZSWPOUT, nr_pages); > - } > > atomic_long_add(nr_pages, &zswap_stored_pages); > count_vm_events(ZSWPOUT, nr_pages); > > ret = true; > > +charge_zswap: > + /* > + * Charge zswapped pages even when it failed to zswap the entire > folio, > + * because zswap_entry_free() will uncharge them anyway. > + * Otherwise zswap charging will become inconsistent. > + */ > + if (objcg) > + obj_cgroup_charge_zswap(objcg, compressed_bytes); Thanks for finding this bug! I am thinking it might make sense to charge and increment the zswap_stored_pages counter in zswap_store_page(). Something like: diff --git a/mm/zswap.c b/mm/zswap.c index b84c20d889b1..fd2a72598a8a 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1504,11 +1504,14 @@ static ssize_t zswap_store_page(struct page *page, entry->pool = pool; entry->swpentry = page_swpentry; entry->objcg = objcg; + if (objcg) + obj_cgroup_charge_zswap(objcg, entry->length); entry->referenced = true; if (entry->length) { INIT_LIST_HEAD(&entry->lru); zswap_lru_add(&zswap_list_lru, entry); } + atomic_long_inc(&zswap_stored_pages); return entry->length; @@ -1526,7 +1529,6 @@ bool zswap_store(struct folio *folio) struct obj_cgroup *objcg = NULL; struct mem_cgroup *memcg = NULL; struct zswap_pool *pool; - size_t compressed_bytes = 0; bool ret = false; long index; @@ -1569,15 +1571,11 @@ bool zswap_store(struct folio *folio) bytes = zswap_store_page(page, objcg, pool); if (bytes < 0) goto put_pool; - compressed_bytes += bytes; } - if (objcg) { - obj_cgroup_charge_zswap(objcg, compressed_bytes); + if (objcg) count_objcg_events(objcg, ZSWPOUT, nr_pages); - } - atomic_long_add(nr_pages, &zswap_stored_pages); count_vm_events(ZSWPOUT, nr_pages); ret = true; What do you think? Yosry, Nhat, Johannes, please let me know if this would be a cleaner approach. If so, I don't think we would be losing a lot of performance by not doing the one-time charge per folio, but please let me know your thoughts as well. Thanks, Kanchana > put_pool: > zswap_pool_put(pool); > put_objcg: > -- > 2.47.1
> -----Original Message----- > From: Yosry Ahmed <yosry.ahmed@linux.dev> > Sent: Tuesday, January 28, 2025 11:04 AM > To: Hyeonggon Yoo <42.hyeyoo@gmail.com> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Johannes Weiner > <hannes@cmpxchg.org>; Nhat Pham <nphamcs@gmail.com>; Chengming > Zhou <chengming.zhou@linux.dev>; Andrew Morton <akpm@linux- > foundation.org>; linux-mm@kvack.org; stable@vger.kernel.org > Subject: Re: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging > when zswap_store_page() fails > > On Wed, Jan 29, 2025 at 03:55:07AM +0900, Hyeonggon Yoo wrote: > > Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > > skips charging any zswapped base pages when it failed to zswap the entire > > folio. > > > > However, when some base pages are zswapped but it failed to zswap > > the entire folio, the zswap operation is rolled back. > > When freeing zswap entries for those pages, zswap_entry_free() uncharges > > the pages that were not previously charged, causing zswap charging to > > become inconsistent. > > > > This inconsistency triggers two warnings with following steps: > > # On a machine with 64GiB of RAM and 36GiB of zswap > > $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng > > $ sudo reboot > > > > Two warnings are: > > in mm/memcontrol.c:163, function obj_cgroup_release(): > > WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); > > > > in mm/page_counter.c:60, function page_counter_cancel(): > > if (WARN_ONCE(new < 0, "page_counter underflow: %ld > nr_pages=%lu\n", > > new, nr_pages)) > > > > While objcg events should only be accounted for when the entire folio is > > zswapped, objcg charging should be performed regardlessly. > > Fix accordingly. > > > > After resolving the inconsistency, these warnings disappear. > > > > Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > > Cc: stable@vger.kernel.org > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > --- > > > > v1->v2: > > > > Fixed objcg events being accounted for on zswap failure. > > > > Fixed the incorrect description. I misunderstood that the base pages are > > going to be stored in zswap, but their zswap entries are freed immediately. > > > > Added a comment on why it charges pages that are going to be removed > > from zswap. > > > > mm/zswap.c | 14 ++++++++++---- > > 1 file changed, 10 insertions(+), 4 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 6504174fbc6a..10b30ac46deb 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1568,20 +1568,26 @@ bool zswap_store(struct folio *folio) > > > > bytes = zswap_store_page(page, objcg, pool); > > if (bytes < 0) > > - goto put_pool; > > + goto charge_zswap; > > compressed_bytes += bytes; > > } > > > > - if (objcg) { > > - obj_cgroup_charge_zswap(objcg, compressed_bytes); > > + if (objcg) > > count_objcg_events(objcg, ZSWPOUT, nr_pages); > > - } > > > > atomic_long_add(nr_pages, &zswap_stored_pages); > > count_vm_events(ZSWPOUT, nr_pages); > > > > ret = true; > > > > +charge_zswap: > > + /* > > + * Charge zswapped pages even when it failed to zswap the entire > folio, > > + * because zswap_entry_free() will uncharge them anyway. > > + * Otherwise zswap charging will become inconsistent. > > + */ > > + if (objcg) > > + obj_cgroup_charge_zswap(objcg, compressed_bytes); > > Thanks for fixing this! > > Having to charge just to uncharge right after is annoying. Ideally we'd > just clear entry->objcg if we fail before charging, but we don't have a > direct reference to the entries here and another tree lookup is not > ideal either. > > I guess we may be able to improve this handling once [1] lands, as we > can move the charging logic into zswap_store_folio() where we'd have > access to the entries. Thanks Yosry. I agree, we can improve this handling in [1]. I will add this to my list. > > For now, would the control flow be easier if we move the charge ahead of > the zswap_store_page() loop instead? There is an existing if (objcg) > block there as well. I just replied with a suggestion to move the objcg charging and incrementing zswap_stored_pages to be per successful xarray store, within zswap_store_page() itself. Please let me know if this would be a good solution for the hotfix. Thanks, Kanchana > > [1]https://lore.kernel.org/linux-mm/20241221063119.29140-12- > kanchana.p.sridhar@intel.com/ > > > put_pool: > > zswap_pool_put(pool); > > put_objcg: > > -- > > 2.47.1 > > > >
On Tue, Jan 28, 2025 at 07:09:05PM +0000, Sridhar, Kanchana P wrote: > Hi Hyeonggon, > > > -----Original Message----- > > From: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > Sent: Tuesday, January 28, 2025 10:55 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Johannes Weiner > > <hannes@cmpxchg.org>; Yosry Ahmed <yosryahmed@google.com>; Nhat > > Pham <nphamcs@gmail.com>; Chengming Zhou > > <chengming.zhou@linux.dev>; Andrew Morton <akpm@linux- > > foundation.org> > > Cc: linux-mm@kvack.org; Hyeonggon Yoo <42.hyeyoo@gmail.com>; > > stable@vger.kernel.org > > Subject: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging when > > zswap_store_page() fails > > > > Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > > skips charging any zswapped base pages when it failed to zswap the entire > > folio. > > > > However, when some base pages are zswapped but it failed to zswap > > the entire folio, the zswap operation is rolled back. > > When freeing zswap entries for those pages, zswap_entry_free() uncharges > > the pages that were not previously charged, causing zswap charging to > > become inconsistent. > > > > This inconsistency triggers two warnings with following steps: > > # On a machine with 64GiB of RAM and 36GiB of zswap > > $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng > > $ sudo reboot > > > > Two warnings are: > > in mm/memcontrol.c:163, function obj_cgroup_release(): > > WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); > > > > in mm/page_counter.c:60, function page_counter_cancel(): > > if (WARN_ONCE(new < 0, "page_counter underflow: %ld > > nr_pages=%lu\n", > > new, nr_pages)) > > > > While objcg events should only be accounted for when the entire folio is > > zswapped, objcg charging should be performed regardlessly. > > Fix accordingly. > > > > After resolving the inconsistency, these warnings disappear. > > > > Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > > Cc: stable@vger.kernel.org > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > --- > > > > v1->v2: > > > > Fixed objcg events being accounted for on zswap failure. > > > > Fixed the incorrect description. I misunderstood that the base pages are > > going to be stored in zswap, but their zswap entries are freed immediately. > > > > Added a comment on why it charges pages that are going to be removed > > from zswap. > > > > mm/zswap.c | 14 ++++++++++---- > > 1 file changed, 10 insertions(+), 4 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 6504174fbc6a..10b30ac46deb 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1568,20 +1568,26 @@ bool zswap_store(struct folio *folio) > > > > bytes = zswap_store_page(page, objcg, pool); > > if (bytes < 0) > > - goto put_pool; > > + goto charge_zswap; > > compressed_bytes += bytes; > > } > > > > - if (objcg) { > > - obj_cgroup_charge_zswap(objcg, compressed_bytes); > > + if (objcg) > > count_objcg_events(objcg, ZSWPOUT, nr_pages); > > - } > > > > atomic_long_add(nr_pages, &zswap_stored_pages); > > count_vm_events(ZSWPOUT, nr_pages); > > > > ret = true; > > > > +charge_zswap: > > + /* > > + * Charge zswapped pages even when it failed to zswap the entire > > folio, > > + * because zswap_entry_free() will uncharge them anyway. > > + * Otherwise zswap charging will become inconsistent. > > + */ > > + if (objcg) > > + obj_cgroup_charge_zswap(objcg, compressed_bytes); > > Thanks for finding this bug! I am thinking it might make sense to charge > and increment the zswap_stored_pages counter in zswap_store_page(). > Something like: > > diff --git a/mm/zswap.c b/mm/zswap.c > index b84c20d889b1..fd2a72598a8a 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1504,11 +1504,14 @@ static ssize_t zswap_store_page(struct page *page, > entry->pool = pool; > entry->swpentry = page_swpentry; > entry->objcg = objcg; > + if (objcg) > + obj_cgroup_charge_zswap(objcg, entry->length); > entry->referenced = true; > if (entry->length) { > INIT_LIST_HEAD(&entry->lru); > zswap_lru_add(&zswap_list_lru, entry); > } > + atomic_long_inc(&zswap_stored_pages); > > return entry->length; > > @@ -1526,7 +1529,6 @@ bool zswap_store(struct folio *folio) > struct obj_cgroup *objcg = NULL; > struct mem_cgroup *memcg = NULL; > struct zswap_pool *pool; > - size_t compressed_bytes = 0; > bool ret = false; > long index; > > @@ -1569,15 +1571,11 @@ bool zswap_store(struct folio *folio) > bytes = zswap_store_page(page, objcg, pool); > if (bytes < 0) > goto put_pool; > - compressed_bytes += bytes; > } > > - if (objcg) { > - obj_cgroup_charge_zswap(objcg, compressed_bytes); > + if (objcg) > count_objcg_events(objcg, ZSWPOUT, nr_pages); > - } > > - atomic_long_add(nr_pages, &zswap_stored_pages); > count_vm_events(ZSWPOUT, nr_pages); > > ret = true; > > What do you think? > > Yosry, Nhat, Johannes, please let me know if this would be a cleaner > approach. If so, I don't think we would be losing a lot of performance > by not doing the one-time charge per folio, but please let me know > your thoughts as well. This is certainly cleaner, and thanks for catching that zswap_stored_pages cleanup is also wrong. I am not sure if this has meaningful impact on performance, but it seems like we are doing a bit more work in the common success case to avoid the work in the uncommon failure case. Moving the charge (and atomic addition) above the zswap_store_page() loop would be doing the opposite, albeit less clean. I don't feel strongly either way, but I slightly prefer the latter. > > Thanks, > Kanchana > > > put_pool: > > zswap_pool_put(pool); > > put_objcg: > > -- > > 2.47.1 > >
On Wed, Jan 29, 2025 at 4:19 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote: > > On Tue, Jan 28, 2025 at 07:09:05PM +0000, Sridhar, Kanchana P wrote: > > Hi Hyeonggon, > > > > > -----Original Message----- > > > From: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > > Sent: Tuesday, January 28, 2025 10:55 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Johannes Weiner > > > <hannes@cmpxchg.org>; Yosry Ahmed <yosryahmed@google.com>; Nhat > > > Pham <nphamcs@gmail.com>; Chengming Zhou > > > <chengming.zhou@linux.dev>; Andrew Morton <akpm@linux- > > > foundation.org> > > > Cc: linux-mm@kvack.org; Hyeonggon Yoo <42.hyeyoo@gmail.com>; > > > stable@vger.kernel.org > > > Subject: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging when > > > zswap_store_page() fails > > > > > > Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > > > skips charging any zswapped base pages when it failed to zswap the entire > > > folio. > > > > > > However, when some base pages are zswapped but it failed to zswap > > > the entire folio, the zswap operation is rolled back. > > > When freeing zswap entries for those pages, zswap_entry_free() uncharges > > > the pages that were not previously charged, causing zswap charging to > > > become inconsistent. > > > > > > This inconsistency triggers two warnings with following steps: > > > # On a machine with 64GiB of RAM and 36GiB of zswap > > > $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng > > > $ sudo reboot > > > > > > Two warnings are: > > > in mm/memcontrol.c:163, function obj_cgroup_release(): > > > WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); > > > > > > in mm/page_counter.c:60, function page_counter_cancel(): > > > if (WARN_ONCE(new < 0, "page_counter underflow: %ld > > > nr_pages=%lu\n", > > > new, nr_pages)) > > > > > > While objcg events should only be accounted for when the entire folio is > > > zswapped, objcg charging should be performed regardlessly. > > > Fix accordingly. > > > > > > After resolving the inconsistency, these warnings disappear. > > > > > > Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > > > Cc: stable@vger.kernel.org > > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > > --- > > > > > > v1->v2: > > > > > > Fixed objcg events being accounted for on zswap failure. > > > > > > Fixed the incorrect description. I misunderstood that the base pages are > > > going to be stored in zswap, but their zswap entries are freed immediately. > > > > > > Added a comment on why it charges pages that are going to be removed > > > from zswap. > > > > > > mm/zswap.c | 14 ++++++++++---- > > > 1 file changed, 10 insertions(+), 4 deletions(-) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 6504174fbc6a..10b30ac46deb 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -1568,20 +1568,26 @@ bool zswap_store(struct folio *folio) > > > > > > bytes = zswap_store_page(page, objcg, pool); > > > if (bytes < 0) > > > - goto put_pool; > > > + goto charge_zswap; > > > compressed_bytes += bytes; > > > } > > > > > > - if (objcg) { > > > - obj_cgroup_charge_zswap(objcg, compressed_bytes); > > > + if (objcg) > > > count_objcg_events(objcg, ZSWPOUT, nr_pages); > > > - } > > > > > > atomic_long_add(nr_pages, &zswap_stored_pages); > > > count_vm_events(ZSWPOUT, nr_pages); > > > > > > ret = true; > > > > > > +charge_zswap: > > > + /* > > > + * Charge zswapped pages even when it failed to zswap the entire > > > folio, > > > + * because zswap_entry_free() will uncharge them anyway. > > > + * Otherwise zswap charging will become inconsistent. > > > + */ > > > + if (objcg) > > > + obj_cgroup_charge_zswap(objcg, compressed_bytes); > > > > Thanks for finding this bug! I am thinking it might make sense to charge > > and increment the zswap_stored_pages counter in zswap_store_page(). > > Something like: > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index b84c20d889b1..fd2a72598a8a 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1504,11 +1504,14 @@ static ssize_t zswap_store_page(struct page *page, > > entry->pool = pool; > > entry->swpentry = page_swpentry; > > entry->objcg = objcg; > > + if (objcg) > > + obj_cgroup_charge_zswap(objcg, entry->length); > > entry->referenced = true; > > if (entry->length) { > > INIT_LIST_HEAD(&entry->lru); > > zswap_lru_add(&zswap_list_lru, entry); > > } > > + atomic_long_inc(&zswap_stored_pages); > > > > return entry->length; > > > > @@ -1526,7 +1529,6 @@ bool zswap_store(struct folio *folio) > > struct obj_cgroup *objcg = NULL; > > struct mem_cgroup *memcg = NULL; > > struct zswap_pool *pool; > > - size_t compressed_bytes = 0; > > bool ret = false; > > long index; > > > > @@ -1569,15 +1571,11 @@ bool zswap_store(struct folio *folio) > > bytes = zswap_store_page(page, objcg, pool); > > if (bytes < 0) > > goto put_pool; > > - compressed_bytes += bytes; > > } > > > > - if (objcg) { > > - obj_cgroup_charge_zswap(objcg, compressed_bytes); > > + if (objcg) > > count_objcg_events(objcg, ZSWPOUT, nr_pages); > > - } > > > > - atomic_long_add(nr_pages, &zswap_stored_pages); > > count_vm_events(ZSWPOUT, nr_pages); > > > > ret = true; > > > > What do you think? > > > > Yosry, Nhat, Johannes, please let me know if this would be a cleaner > > approach. If so, I don't think we would be losing a lot of performance > > by not doing the one-time charge per folio, but please let me know > > your thoughts as well. > > This is certainly cleaner, and thanks for catching that > zswap_stored_pages cleanup is also wrong. Oh, yeah. Thanks for catching zswap_stored_pages. > I am not sure if this has meaningful impact on performance, but it seems > like we are doing a bit more work in the common success case to avoid > the work in the uncommon failure case. Right, but at the same time I think we need some evaluation to sacrifice readability for performance. No need to be in a hurry to optimize when fixing a bug, I think. > Moving the charge (and atomic addition) above the zswap_store_page() > loop would be doing the opposite, albeit less clean. I think that wouldn't work because zswap won't uncharge for pages that zswap_store_page() fails to store. And we don't know how many pages are going to be stored in zswap in advance. IMO v2 of this patch is efficient and it works while charging just to uncharge doesn't look great. > I don't feel strongly either way, but I slightly prefer the latter. I'd prefer writing more readable code because it's a hotfix. Let me post v3 with Sridhar's feedback (the former) adjusted!
On Wed, Jan 29, 2025 at 4:09 AM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > Hi Hyeonggon, > > > -----Original Message----- > > From: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > Sent: Tuesday, January 28, 2025 10:55 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Johannes Weiner > > <hannes@cmpxchg.org>; Yosry Ahmed <yosryahmed@google.com>; Nhat > > Pham <nphamcs@gmail.com>; Chengming Zhou > > <chengming.zhou@linux.dev>; Andrew Morton <akpm@linux- > > foundation.org> > > Cc: linux-mm@kvack.org; Hyeonggon Yoo <42.hyeyoo@gmail.com>; > > stable@vger.kernel.org > > Subject: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging when > > zswap_store_page() fails > > > > Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > > skips charging any zswapped base pages when it failed to zswap the entire > > folio. > > > > However, when some base pages are zswapped but it failed to zswap > > the entire folio, the zswap operation is rolled back. > > When freeing zswap entries for those pages, zswap_entry_free() uncharges > > the pages that were not previously charged, causing zswap charging to > > become inconsistent. > > > > This inconsistency triggers two warnings with following steps: > > # On a machine with 64GiB of RAM and 36GiB of zswap > > $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng > > $ sudo reboot > > > > Two warnings are: > > in mm/memcontrol.c:163, function obj_cgroup_release(): > > WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); > > > > in mm/page_counter.c:60, function page_counter_cancel(): > > if (WARN_ONCE(new < 0, "page_counter underflow: %ld > > nr_pages=%lu\n", > > new, nr_pages)) > > > > While objcg events should only be accounted for when the entire folio is > > zswapped, objcg charging should be performed regardlessly. > > Fix accordingly. > > > > After resolving the inconsistency, these warnings disappear. > > > > Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > > Cc: stable@vger.kernel.org > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > --- > > > > v1->v2: > > > > Fixed objcg events being accounted for on zswap failure. > > > > Fixed the incorrect description. I misunderstood that the base pages are > > going to be stored in zswap, but their zswap entries are freed immediately. > > > > Added a comment on why it charges pages that are going to be removed > > from zswap. > > > > mm/zswap.c | 14 ++++++++++---- > > 1 file changed, 10 insertions(+), 4 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 6504174fbc6a..10b30ac46deb 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1568,20 +1568,26 @@ bool zswap_store(struct folio *folio) > > > > bytes = zswap_store_page(page, objcg, pool); > > if (bytes < 0) > > - goto put_pool; > > + goto charge_zswap; > > compressed_bytes += bytes; > > } > > > > - if (objcg) { > > - obj_cgroup_charge_zswap(objcg, compressed_bytes); > > + if (objcg) > > count_objcg_events(objcg, ZSWPOUT, nr_pages); > > - } > > > > atomic_long_add(nr_pages, &zswap_stored_pages); > > count_vm_events(ZSWPOUT, nr_pages); > > > > ret = true; > > > > +charge_zswap: > > + /* > > + * Charge zswapped pages even when it failed to zswap the entire > > folio, > > + * because zswap_entry_free() will uncharge them anyway. > > + * Otherwise zswap charging will become inconsistent. > > + */ > > + if (objcg) > > + obj_cgroup_charge_zswap(objcg, compressed_bytes); > > Thanks for finding this bug! I am thinking it might make sense to charge > and increment the zswap_stored_pages counter in zswap_store_page(). > Something like: > > diff --git a/mm/zswap.c b/mm/zswap.c > index b84c20d889b1..fd2a72598a8a 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1504,11 +1504,14 @@ static ssize_t zswap_store_page(struct page *page, > entry->pool = pool; > entry->swpentry = page_swpentry; > entry->objcg = objcg; > + if (objcg) > + obj_cgroup_charge_zswap(objcg, entry->length); > entry->referenced = true; > if (entry->length) { > INIT_LIST_HEAD(&entry->lru); > zswap_lru_add(&zswap_list_lru, entry); > } > + atomic_long_inc(&zswap_stored_pages); > > return entry->length; > > @@ -1526,7 +1529,6 @@ bool zswap_store(struct folio *folio) > struct obj_cgroup *objcg = NULL; > struct mem_cgroup *memcg = NULL; > struct zswap_pool *pool; > - size_t compressed_bytes = 0; > bool ret = false; > long index; > > @@ -1569,15 +1571,11 @@ bool zswap_store(struct folio *folio) > bytes = zswap_store_page(page, objcg, pool); > if (bytes < 0) > goto put_pool; > - compressed_bytes += bytes; > } > > - if (objcg) { > - obj_cgroup_charge_zswap(objcg, compressed_bytes); > + if (objcg) > count_objcg_events(objcg, ZSWPOUT, nr_pages); > - } > > - atomic_long_add(nr_pages, &zswap_stored_pages); > count_vm_events(ZSWPOUT, nr_pages); > > ret = true; Hi Sridhar, It looks much clearer! And we can optimize if it turns out to be worth the complexity. May I ask your permission to add your Signed-off-by: and Co-developed-by: ? I'm afraid to use this without your confirmation due to the Developer's Certificate of Origin. Best, Hyeonggon > > put_pool: > > zswap_pool_put(pool); > > put_objcg: > > -- > > 2.47.1 >
> -----Original Message----- > From: Hyeonggon Yoo <42.hyeyoo@gmail.com> > Sent: Tuesday, January 28, 2025 10:40 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Johannes Weiner <hannes@cmpxchg.org>; Yosry Ahmed > <yosryahmed@google.com>; Nhat Pham <nphamcs@gmail.com>; Chengming > Zhou <chengming.zhou@linux.dev>; Andrew Morton <akpm@linux- > foundation.org>; linux-mm@kvack.org; stable@vger.kernel.org > Subject: Re: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging > when zswap_store_page() fails > > On Wed, Jan 29, 2025 at 4:09 AM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi Hyeonggon, > > > > > -----Original Message----- > > > From: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > > Sent: Tuesday, January 28, 2025 10:55 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Johannes > Weiner > > > <hannes@cmpxchg.org>; Yosry Ahmed <yosryahmed@google.com>; Nhat > > > Pham <nphamcs@gmail.com>; Chengming Zhou > > > <chengming.zhou@linux.dev>; Andrew Morton <akpm@linux- > > > foundation.org> > > > Cc: linux-mm@kvack.org; Hyeonggon Yoo <42.hyeyoo@gmail.com>; > > > stable@vger.kernel.org > > > Subject: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging > when > > > zswap_store_page() fails > > > > > > Commit b7c0ccdfbafd ("mm: zswap: support large folios in > zswap_store()") > > > skips charging any zswapped base pages when it failed to zswap the > entire > > > folio. > > > > > > However, when some base pages are zswapped but it failed to zswap > > > the entire folio, the zswap operation is rolled back. > > > When freeing zswap entries for those pages, zswap_entry_free() > uncharges > > > the pages that were not previously charged, causing zswap charging to > > > become inconsistent. > > > > > > This inconsistency triggers two warnings with following steps: > > > # On a machine with 64GiB of RAM and 36GiB of zswap > > > $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng > > > $ sudo reboot > > > > > > Two warnings are: > > > in mm/memcontrol.c:163, function obj_cgroup_release(): > > > WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); > > > > > > in mm/page_counter.c:60, function page_counter_cancel(): > > > if (WARN_ONCE(new < 0, "page_counter underflow: %ld > > > nr_pages=%lu\n", > > > new, nr_pages)) > > > > > > While objcg events should only be accounted for when the entire folio is > > > zswapped, objcg charging should be performed regardlessly. > > > Fix accordingly. > > > > > > After resolving the inconsistency, these warnings disappear. > > > > > > Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") > > > Cc: stable@vger.kernel.org > > > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> > > > --- > > > > > > v1->v2: > > > > > > Fixed objcg events being accounted for on zswap failure. > > > > > > Fixed the incorrect description. I misunderstood that the base pages are > > > going to be stored in zswap, but their zswap entries are freed > immediately. > > > > > > Added a comment on why it charges pages that are going to be removed > > > from zswap. > > > > > > mm/zswap.c | 14 ++++++++++---- > > > 1 file changed, 10 insertions(+), 4 deletions(-) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 6504174fbc6a..10b30ac46deb 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -1568,20 +1568,26 @@ bool zswap_store(struct folio *folio) > > > > > > bytes = zswap_store_page(page, objcg, pool); > > > if (bytes < 0) > > > - goto put_pool; > > > + goto charge_zswap; > > > compressed_bytes += bytes; > > > } > > > > > > - if (objcg) { > > > - obj_cgroup_charge_zswap(objcg, compressed_bytes); > > > + if (objcg) > > > count_objcg_events(objcg, ZSWPOUT, nr_pages); > > > - } > > > > > > atomic_long_add(nr_pages, &zswap_stored_pages); > > > count_vm_events(ZSWPOUT, nr_pages); > > > > > > ret = true; > > > > > > +charge_zswap: > > > + /* > > > + * Charge zswapped pages even when it failed to zswap the entire > > > folio, > > > + * because zswap_entry_free() will uncharge them anyway. > > > + * Otherwise zswap charging will become inconsistent. > > > + */ > > > + if (objcg) > > > + obj_cgroup_charge_zswap(objcg, compressed_bytes); > > > > Thanks for finding this bug! I am thinking it might make sense to charge > > and increment the zswap_stored_pages counter in zswap_store_page(). > > Something like: > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index b84c20d889b1..fd2a72598a8a 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1504,11 +1504,14 @@ static ssize_t zswap_store_page(struct page > *page, > > entry->pool = pool; > > entry->swpentry = page_swpentry; > > entry->objcg = objcg; > > + if (objcg) > > + obj_cgroup_charge_zswap(objcg, entry->length); > > entry->referenced = true; > > if (entry->length) { > > INIT_LIST_HEAD(&entry->lru); > > zswap_lru_add(&zswap_list_lru, entry); > > } > > + atomic_long_inc(&zswap_stored_pages); > > > > return entry->length; > > > > @@ -1526,7 +1529,6 @@ bool zswap_store(struct folio *folio) > > struct obj_cgroup *objcg = NULL; > > struct mem_cgroup *memcg = NULL; > > struct zswap_pool *pool; > > - size_t compressed_bytes = 0; > > bool ret = false; > > long index; > > > > @@ -1569,15 +1571,11 @@ bool zswap_store(struct folio *folio) > > bytes = zswap_store_page(page, objcg, pool); > > if (bytes < 0) > > goto put_pool; > > - compressed_bytes += bytes; > > } > > > > - if (objcg) { > > - obj_cgroup_charge_zswap(objcg, compressed_bytes); > > + if (objcg) > > count_objcg_events(objcg, ZSWPOUT, nr_pages); > > - } > > > > - atomic_long_add(nr_pages, &zswap_stored_pages); > > count_vm_events(ZSWPOUT, nr_pages); > > > > ret = true; > > Hi Sridhar, It looks much clearer! > And we can optimize if it turns out to be worth the complexity. > > May I ask your permission to add your Signed-off-by: and Co-developed-by: ? Yes, please go ahead Hyeonggon. Thanks! Kanchana > I'm afraid to use this without your confirmation due to the > Developer's Certificate of Origin. > > Best, > Hyeonggon > > > > put_pool: > > > zswap_pool_put(pool); > > > put_objcg: > > > -- > > > 2.47.1 > >
diff --git a/mm/zswap.c b/mm/zswap.c index 6504174fbc6a..10b30ac46deb 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1568,20 +1568,26 @@ bool zswap_store(struct folio *folio) bytes = zswap_store_page(page, objcg, pool); if (bytes < 0) - goto put_pool; + goto charge_zswap; compressed_bytes += bytes; } - if (objcg) { - obj_cgroup_charge_zswap(objcg, compressed_bytes); + if (objcg) count_objcg_events(objcg, ZSWPOUT, nr_pages); - } atomic_long_add(nr_pages, &zswap_stored_pages); count_vm_events(ZSWPOUT, nr_pages); ret = true; +charge_zswap: + /* + * Charge zswapped pages even when it failed to zswap the entire folio, + * because zswap_entry_free() will uncharge them anyway. + * Otherwise zswap charging will become inconsistent. + */ + if (objcg) + obj_cgroup_charge_zswap(objcg, compressed_bytes); put_pool: zswap_pool_put(pool); put_objcg:
Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") skips charging any zswapped base pages when it failed to zswap the entire folio. However, when some base pages are zswapped but it failed to zswap the entire folio, the zswap operation is rolled back. When freeing zswap entries for those pages, zswap_entry_free() uncharges the pages that were not previously charged, causing zswap charging to become inconsistent. This inconsistency triggers two warnings with following steps: # On a machine with 64GiB of RAM and 36GiB of zswap $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng $ sudo reboot Two warnings are: in mm/memcontrol.c:163, function obj_cgroup_release(): WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1)); in mm/page_counter.c:60, function page_counter_cancel(): if (WARN_ONCE(new < 0, "page_counter underflow: %ld nr_pages=%lu\n", new, nr_pages)) While objcg events should only be accounted for when the entire folio is zswapped, objcg charging should be performed regardlessly. Fix accordingly. After resolving the inconsistency, these warnings disappear. Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()") Cc: stable@vger.kernel.org Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> --- v1->v2: Fixed objcg events being accounted for on zswap failure. Fixed the incorrect description. I misunderstood that the base pages are going to be stored in zswap, but their zswap entries are freed immediately. Added a comment on why it charges pages that are going to be removed from zswap. mm/zswap.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-)