Message ID | 1725478868-61732-1-git-send-email-steven.sistare@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/hugetlb: simplify refs in memfd_alloc_folio | expand |
On Wed, Sep 04, 2024 at 12:41:08PM -0700, Steve Sistare wrote: > The folio_try_get in memfd_alloc_folio is not necessary. Delete it, and > delete the matching folio_put in memfd_pin_folios. This also avoids > leaking a ref if the memfd_alloc_folio call to hugetlb_add_to_page_cache > fails, which would otherwise need an additional folio_put. This is a > continuation of the fix > "mm/hugetlb: fix memfd_pin_folios free_huge_pages leak" I think you're right, but don't we also need to get rid of the folio_put() call in the 'if (err)' case after calling hugetlb_add_to_page_cache()? > Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") > > Suggested-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > Signed-off-by: Steve Sistare <steven.sistare@oracle.com> > --- > mm/gup.c | 4 +--- > mm/memfd.c | 2 +- > 2 files changed, 2 insertions(+), 4 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index bccabaa..947881ff 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -3618,7 +3618,7 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, > pgoff_t start_idx, end_idx, next_idx; > struct folio *folio = NULL; > struct folio_batch fbatch; > - struct hstate *h = NULL; > + struct hstate *h; > long ret = -EINVAL; > > if (start < 0 || start > end || !max_folios) > @@ -3662,8 +3662,6 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, > &fbatch); > if (folio) { > folio_put(folio); > - if (h) > - folio_put(folio); > folio = NULL; > } > > diff --git a/mm/memfd.c b/mm/memfd.c > index bcb131d..f715301 100644 > --- a/mm/memfd.c > +++ b/mm/memfd.c > @@ -89,7 +89,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > numa_node_id(), > NULL, > gfp_mask); > - if (folio && folio_try_get(folio)) { > + if (folio) { > err = hugetlb_add_to_page_cache(folio, > memfd->f_mapping, > idx); > -- > 1.8.3.1 >
On Wed, Sep 04, 2024 at 09:02:59PM +0100, Matthew Wilcox wrote: > On Wed, Sep 04, 2024 at 12:41:08PM -0700, Steve Sistare wrote: > > The folio_try_get in memfd_alloc_folio is not necessary. Delete it, and > > delete the matching folio_put in memfd_pin_folios. This also avoids > > leaking a ref if the memfd_alloc_folio call to hugetlb_add_to_page_cache > > fails, which would otherwise need an additional folio_put. This is a > > continuation of the fix > > "mm/hugetlb: fix memfd_pin_folios free_huge_pages leak" > > I think you're right, but don't we also need to get rid of the > folio_put() call in the 'if (err)' case after calling > hugetlb_add_to_page_cache()? After scratching my head about this a bit more, I was trying to preserve the existing semantics of the code, but I think the code was always buggy. The correct code would be: folio = alloc_hugetlb_folio_nodemask(...); folio_put(folio); The code as in tree today would trip an assertion: VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); as alloc_hugetlb_folio_nodemask() returns a folio with a refcount 1, folio_try_get() would increment it to 2, folio_put() would decrement it to 1, and so we'd call free_huge_folio() with a refcount of 1. But after your patch, the code _is_ still wrong because we'll start with a refcount of 1, fail to add to the page cache, call folio_put() which will decrement the refcount to 0 _and call free_huge_folio() itself_. Then we'll call free_huge_folio() on an already freed and possibly reallocated folio. So every version suggested so far (current, yours, mine) is wrong, and the right code looks like: folio = alloc_hugetlb_folio_nodemask(...); if (folio) { err = hugetlb_add_to_page_cache(...); if (err) { folio_put(folio); return ERR_PTR(err); } ... Or have I got something wrong in that analysis? > > Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") > > > > Suggested-by: Vivek Kasireddy <vivek.kasireddy@intel.com> > > Signed-off-by: Steve Sistare <steven.sistare@oracle.com> > > --- > > mm/gup.c | 4 +--- > > mm/memfd.c | 2 +- > > 2 files changed, 2 insertions(+), 4 deletions(-) > > > > diff --git a/mm/gup.c b/mm/gup.c > > index bccabaa..947881ff 100644 > > --- a/mm/gup.c > > +++ b/mm/gup.c > > @@ -3618,7 +3618,7 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, > > pgoff_t start_idx, end_idx, next_idx; > > struct folio *folio = NULL; > > struct folio_batch fbatch; > > - struct hstate *h = NULL; > > + struct hstate *h; > > long ret = -EINVAL; > > > > if (start < 0 || start > end || !max_folios) > > @@ -3662,8 +3662,6 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, > > &fbatch); > > if (folio) { > > folio_put(folio); > > - if (h) > > - folio_put(folio); > > folio = NULL; > > } > > > > diff --git a/mm/memfd.c b/mm/memfd.c > > index bcb131d..f715301 100644 > > --- a/mm/memfd.c > > +++ b/mm/memfd.c > > @@ -89,7 +89,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) > > numa_node_id(), > > NULL, > > gfp_mask); > > - if (folio && folio_try_get(folio)) { > > + if (folio) { > > err = hugetlb_add_to_page_cache(folio, > > memfd->f_mapping, > > idx); > > -- > > 1.8.3.1 > > >
On 9/4/2024 4:02 PM, Matthew Wilcox wrote: > On Wed, Sep 04, 2024 at 12:41:08PM -0700, Steve Sistare wrote: >> The folio_try_get in memfd_alloc_folio is not necessary. Delete it, and >> delete the matching folio_put in memfd_pin_folios. This also avoids >> leaking a ref if the memfd_alloc_folio call to hugetlb_add_to_page_cache >> fails, which would otherwise need an additional folio_put. This is a >> continuation of the fix >> "mm/hugetlb: fix memfd_pin_folios free_huge_pages leak" > > I think you're right, but don't we also need to get rid of the > folio_put() call in the 'if (err)' case after calling > hugetlb_add_to_page_cache()? That folio_put drops the ref obtained by unfreeze: memfd_alloc_folio() alloc_hugetlb_folio_nodemask() dequeue_hugetlb_folio_nodemask() dequeue_hugetlb_folio_node_exact() folio_ref_unfreeze(folio, 1); ; adds 1 refcount - Steve >> Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") >> >> Suggested-by: Vivek Kasireddy <vivek.kasireddy@intel.com> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> >> --- >> mm/gup.c | 4 +--- >> mm/memfd.c | 2 +- >> 2 files changed, 2 insertions(+), 4 deletions(-) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index bccabaa..947881ff 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -3618,7 +3618,7 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, >> pgoff_t start_idx, end_idx, next_idx; >> struct folio *folio = NULL; >> struct folio_batch fbatch; >> - struct hstate *h = NULL; >> + struct hstate *h; >> long ret = -EINVAL; >> >> if (start < 0 || start > end || !max_folios) >> @@ -3662,8 +3662,6 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, >> &fbatch); >> if (folio) { >> folio_put(folio); >> - if (h) >> - folio_put(folio); >> folio = NULL; >> } >> >> diff --git a/mm/memfd.c b/mm/memfd.c >> index bcb131d..f715301 100644 >> --- a/mm/memfd.c >> +++ b/mm/memfd.c >> @@ -89,7 +89,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) >> numa_node_id(), >> NULL, >> gfp_mask); >> - if (folio && folio_try_get(folio)) { >> + if (folio) { >> err = hugetlb_add_to_page_cache(folio, >> memfd->f_mapping, >> idx); >> -- >> 1.8.3.1 >>
On 9/4/2024 4:11 PM, Matthew Wilcox wrote: > On Wed, Sep 04, 2024 at 09:02:59PM +0100, Matthew Wilcox wrote: >> On Wed, Sep 04, 2024 at 12:41:08PM -0700, Steve Sistare wrote: >>> The folio_try_get in memfd_alloc_folio is not necessary. Delete it, and >>> delete the matching folio_put in memfd_pin_folios. This also avoids >>> leaking a ref if the memfd_alloc_folio call to hugetlb_add_to_page_cache >>> fails, which would otherwise need an additional folio_put. This is a >>> continuation of the fix >>> "mm/hugetlb: fix memfd_pin_folios free_huge_pages leak" >> >> I think you're right, but don't we also need to get rid of the >> folio_put() call in the 'if (err)' case after calling >> hugetlb_add_to_page_cache()? > > After scratching my head about this a bit more, I was trying to preserve > the existing semantics of the code, but I think the code was always > buggy. > > The correct code would be: > > folio = alloc_hugetlb_folio_nodemask(...); > folio_put(folio); > > The code as in tree today would trip an assertion: > > VM_BUG_ON_FOLIO(folio_ref_count(folio), folio); > > as alloc_hugetlb_folio_nodemask() returns a folio with a refcount 1, > folio_try_get() would increment it to 2, folio_put() would decrement > it to 1, and so we'd call free_huge_folio() with a refcount of 1. > > But after your patch, the code _is_ still wrong because we'll > start with a refcount of 1, fail to add to the page cache, call > folio_put() which will decrement the refcount to 0 _and call > free_huge_folio() itself_. Then we'll call free_huge_folio() > on an already freed and possibly reallocated folio. Indeed. The explicit call to free_huge_folio must be deleted, as you coded below. I'll send a V2 of the patch. - Steve > So every version suggested so far (current, yours, mine) is wrong, > and the right code looks like: > > folio = alloc_hugetlb_folio_nodemask(...); > if (folio) { > err = hugetlb_add_to_page_cache(...); > if (err) { > folio_put(folio); > return ERR_PTR(err); > } > ... > > Or have I got something wrong in that analysis? > >>> Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") >>> >>> Suggested-by: Vivek Kasireddy <vivek.kasireddy@intel.com> >>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> >>> --- >>> mm/gup.c | 4 +--- >>> mm/memfd.c | 2 +- >>> 2 files changed, 2 insertions(+), 4 deletions(-) >>> >>> diff --git a/mm/gup.c b/mm/gup.c >>> index bccabaa..947881ff 100644 >>> --- a/mm/gup.c >>> +++ b/mm/gup.c >>> @@ -3618,7 +3618,7 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, >>> pgoff_t start_idx, end_idx, next_idx; >>> struct folio *folio = NULL; >>> struct folio_batch fbatch; >>> - struct hstate *h = NULL; >>> + struct hstate *h; >>> long ret = -EINVAL; >>> >>> if (start < 0 || start > end || !max_folios) >>> @@ -3662,8 +3662,6 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, >>> &fbatch); >>> if (folio) { >>> folio_put(folio); >>> - if (h) >>> - folio_put(folio); >>> folio = NULL; >>> } >>> >>> diff --git a/mm/memfd.c b/mm/memfd.c >>> index bcb131d..f715301 100644 >>> --- a/mm/memfd.c >>> +++ b/mm/memfd.c >>> @@ -89,7 +89,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) >>> numa_node_id(), >>> NULL, >>> gfp_mask); >>> - if (folio && folio_try_get(folio)) { >>> + if (folio) { >>> err = hugetlb_add_to_page_cache(folio, >>> memfd->f_mapping, >>> idx); >>> -- >>> 1.8.3.1 >>> >>
diff --git a/mm/gup.c b/mm/gup.c index bccabaa..947881ff 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -3618,7 +3618,7 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, pgoff_t start_idx, end_idx, next_idx; struct folio *folio = NULL; struct folio_batch fbatch; - struct hstate *h = NULL; + struct hstate *h; long ret = -EINVAL; if (start < 0 || start > end || !max_folios) @@ -3662,8 +3662,6 @@ long memfd_pin_folios(struct file *memfd, loff_t start, loff_t end, &fbatch); if (folio) { folio_put(folio); - if (h) - folio_put(folio); folio = NULL; } diff --git a/mm/memfd.c b/mm/memfd.c index bcb131d..f715301 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -89,7 +89,7 @@ struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx) numa_node_id(), NULL, gfp_mask); - if (folio && folio_try_get(folio)) { + if (folio) { err = hugetlb_add_to_page_cache(folio, memfd->f_mapping, idx);
The folio_try_get in memfd_alloc_folio is not necessary. Delete it, and delete the matching folio_put in memfd_pin_folios. This also avoids leaking a ref if the memfd_alloc_folio call to hugetlb_add_to_page_cache fails, which would otherwise need an additional folio_put. This is a continuation of the fix "mm/hugetlb: fix memfd_pin_folios free_huge_pages leak" Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") Suggested-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Signed-off-by: Steve Sistare <steven.sistare@oracle.com> --- mm/gup.c | 4 +--- mm/memfd.c | 2 +- 2 files changed, 2 insertions(+), 4 deletions(-)