Message ID | 20211025150223.13621-3-mhocko@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | extend vmalloc support for constrained allocations | expand |
On Tue, 26 Oct 2021, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Dave Chinner has mentioned that some of the xfs code would benefit from > kvmalloc support for __GFP_NOFAIL because they have allocations that > cannot fail and they do not fit into a single page. > > The larg part of the vmalloc implementation already complies with the *large* > given gfp flags so there is no work for those to be done. The area > and page table allocations are an exception to that. Implement a retry > loop for those. > > Add a short sleep before retrying. 1 jiffy is a completely random > timeout. Ideally the retry would wait for an explicit event - e.g. > a change to the vmalloc space change if the failure was caused by > the space fragmentation or depletion. But there are multiple different > reasons to retry and this could become much more complex. Keep the retry > simple for now and just sleep to prevent from hogging CPUs. > > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > mm/vmalloc.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index c6cc77d2f366..602649919a9d 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) > flags = memalloc_noio_save(); > > - ret = vmap_pages_range(addr, addr + size, prot, area->pages, > + do { > + ret = vmap_pages_range(addr, addr + size, prot, area->pages, > page_shift); > + if (ret < 0) > + schedule_timeout_uninterruptible(1); > + } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0)); > > if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO) > memalloc_nofs_restore(flags); > @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, > warn_alloc(gfp_mask, NULL, > "vmalloc error: size %lu, vm_struct allocation failed", > real_size); > + if (gfp_mask & __GFP_NOFAIL) { > + schedule_timeout_uninterruptible(1); > + goto again; > + } Shouldn't the retry happen *before* the warning? NeilBrown > goto fail; > } > > -- > 2.30.2 > >
On Tue 26-10-21 09:59:36, Neil Brown wrote: > On Tue, 26 Oct 2021, Michal Hocko wrote: [...] > > @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, > > warn_alloc(gfp_mask, NULL, > > "vmalloc error: size %lu, vm_struct allocation failed", > > real_size); > > + if (gfp_mask & __GFP_NOFAIL) { > > + schedule_timeout_uninterruptible(1); > > + goto again; > > + } > > Shouldn't the retry happen *before* the warning? I've done it after to catch the "depleted or fragmented" vmalloc space. This is not related to the memory available and therefore it won't be handled by the oom killer. The error message shouldn't imply the vmalloc allocation failure IMHO but I am open to suggestions.
On Tue, 26 Oct 2021, Michal Hocko wrote: > On Tue 26-10-21 09:59:36, Neil Brown wrote: > > On Tue, 26 Oct 2021, Michal Hocko wrote: > [...] > > > @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, > > > warn_alloc(gfp_mask, NULL, > > > "vmalloc error: size %lu, vm_struct allocation failed", > > > real_size); > > > + if (gfp_mask & __GFP_NOFAIL) { > > > + schedule_timeout_uninterruptible(1); > > > + goto again; > > > + } > > > > Shouldn't the retry happen *before* the warning? > > I've done it after to catch the "depleted or fragmented" vmalloc space. > This is not related to the memory available and therefore it won't be > handled by the oom killer. The error message shouldn't imply the vmalloc > allocation failure IMHO but I am open to suggestions. The word "failed" does seem to imply what you don't want it to imply... I guess it is reasonable to have this warning, but maybe add " -- retrying" if __GFP_NOFAIL. Thanks, NeilBrown
On Tue 26-10-21 21:30:52, Neil Brown wrote: > On Tue, 26 Oct 2021, Michal Hocko wrote: > > On Tue 26-10-21 09:59:36, Neil Brown wrote: > > > On Tue, 26 Oct 2021, Michal Hocko wrote: > > [...] > > > > @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, > > > > warn_alloc(gfp_mask, NULL, > > > > "vmalloc error: size %lu, vm_struct allocation failed", > > > > real_size); > > > > + if (gfp_mask & __GFP_NOFAIL) { > > > > + schedule_timeout_uninterruptible(1); > > > > + goto again; > > > > + } > > > > > > Shouldn't the retry happen *before* the warning? > > > > I've done it after to catch the "depleted or fragmented" vmalloc space. > > This is not related to the memory available and therefore it won't be > > handled by the oom killer. The error message shouldn't imply the vmalloc > > allocation failure IMHO but I am open to suggestions. > > The word "failed" does seem to imply what you don't want it to imply... > > I guess it is reasonable to have this warning, but maybe add " -- retrying" > if __GFP_NOFAIL. I do not have a strong opinion on that. I can surely do diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 602649919a9d..3489928fafa2 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3033,10 +3033,11 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, VM_UNINITIALIZED | vm_flags, start, end, node, gfp_mask, caller); if (!area) { + bool nofail = gfp_mask & __GFP_NOFAIL; warn_alloc(gfp_mask, NULL, - "vmalloc error: size %lu, vm_struct allocation failed", - real_size); - if (gfp_mask & __GFP_NOFAIL) { + "vmalloc error: size %lu, vm_struct allocation failed%s", + real_size, (nofail) ? ". Retrying." : ""); + if (nofail) { schedule_timeout_uninterruptible(1); goto again; }
> From: Michal Hocko <mhocko@suse.com> > > Dave Chinner has mentioned that some of the xfs code would benefit from > kvmalloc support for __GFP_NOFAIL because they have allocations that > cannot fail and they do not fit into a single page. > > The larg part of the vmalloc implementation already complies with the > given gfp flags so there is no work for those to be done. The area > and page table allocations are an exception to that. Implement a retry > loop for those. > > Add a short sleep before retrying. 1 jiffy is a completely random > timeout. Ideally the retry would wait for an explicit event - e.g. > a change to the vmalloc space change if the failure was caused by > the space fragmentation or depletion. But there are multiple different > reasons to retry and this could become much more complex. Keep the retry > simple for now and just sleep to prevent from hogging CPUs. > > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > mm/vmalloc.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index c6cc77d2f366..602649919a9d 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) > flags = memalloc_noio_save(); > > - ret = vmap_pages_range(addr, addr + size, prot, area->pages, > + do { > + ret = vmap_pages_range(addr, addr + size, prot, area->pages, > page_shift); > + if (ret < 0) > + schedule_timeout_uninterruptible(1); > + } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0)); > 1. After that change a below code: <snip> if (ret < 0) { warn_alloc(orig_gfp_mask, NULL, "vmalloc error: size %lu, failed to map pages", area->nr_pages * PAGE_SIZE); goto fail; } <snip> does not make any sense anymore. 2. Can we combine two places where we handle __GFP_NOFAIL into one place? That would look like as more sorted out.
On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > cannot fail and they do not fit into a single page. > > > > The larg part of the vmalloc implementation already complies with the > > given gfp flags so there is no work for those to be done. The area > > and page table allocations are an exception to that. Implement a retry > > loop for those. > > > > Add a short sleep before retrying. 1 jiffy is a completely random > > timeout. Ideally the retry would wait for an explicit event - e.g. > > a change to the vmalloc space change if the failure was caused by > > the space fragmentation or depletion. But there are multiple different > > reasons to retry and this could become much more complex. Keep the retry > > simple for now and just sleep to prevent from hogging CPUs. > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > --- > > mm/vmalloc.c | 10 +++++++++- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index c6cc77d2f366..602649919a9d 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > > else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) > > flags = memalloc_noio_save(); > > > > - ret = vmap_pages_range(addr, addr + size, prot, area->pages, > > + do { > > + ret = vmap_pages_range(addr, addr + size, prot, area->pages, > > page_shift); > > + if (ret < 0) > > + schedule_timeout_uninterruptible(1); > > + } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0)); > > > > 1. > After that change a below code: > > <snip> > if (ret < 0) { > warn_alloc(orig_gfp_mask, NULL, > "vmalloc error: size %lu, failed to map pages", > area->nr_pages * PAGE_SIZE); > goto fail; > } > <snip> > > does not make any sense anymore. Why? Allocations without __GFP_NOFAIL can still fail, no? > 2. > Can we combine two places where we handle __GFP_NOFAIL into one place? > That would look like as more sorted out. I have to admit I am not really fluent at vmalloc code so I wanted to make the code as simple as possible. How would I unwind all the allocated memory (already allocated as GFP_NOFAIL) before retrying at __vmalloc_node_range (if that is what you suggest). And isn't that a bit wasteful? Or did you have anything else in mind?
On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote: > On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote: > > > From: Michal Hocko <mhocko@suse.com> > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > cannot fail and they do not fit into a single page. > > > > > > The larg part of the vmalloc implementation already complies with the > > > given gfp flags so there is no work for those to be done. The area > > > and page table allocations are an exception to that. Implement a retry > > > loop for those. > > > > > > Add a short sleep before retrying. 1 jiffy is a completely random > > > timeout. Ideally the retry would wait for an explicit event - e.g. > > > a change to the vmalloc space change if the failure was caused by > > > the space fragmentation or depletion. But there are multiple different > > > reasons to retry and this could become much more complex. Keep the retry > > > simple for now and just sleep to prevent from hogging CPUs. > > > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > --- > > > mm/vmalloc.c | 10 +++++++++- > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > index c6cc77d2f366..602649919a9d 100644 > > > --- a/mm/vmalloc.c > > > +++ b/mm/vmalloc.c > > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > > > else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) > > > flags = memalloc_noio_save(); > > > > > > - ret = vmap_pages_range(addr, addr + size, prot, area->pages, > > > + do { > > > + ret = vmap_pages_range(addr, addr + size, prot, area->pages, > > > page_shift); > > > + if (ret < 0) > > > + schedule_timeout_uninterruptible(1); > > > + } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0)); > > > > > > > 1. > > After that change a below code: > > > > <snip> > > if (ret < 0) { > > warn_alloc(orig_gfp_mask, NULL, > > "vmalloc error: size %lu, failed to map pages", > > area->nr_pages * PAGE_SIZE); > > goto fail; > > } > > <snip> > > > > does not make any sense anymore. > > Why? Allocations without __GFP_NOFAIL can still fail, no? > Right. I meant one thing but wrote slightly differently. In case of vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any warning message? Because either we can recover on a future iteration or it stuck there infinitely so a user does not understand what happened. From the other hand this is how __GFP_NOFAIL works, hm.. Another thing, i see that schedule_timeout_uninterruptible(1) is invoked for all cases even when __GFP_NOFAIL is not set, in that scenario we do not want to wait, instead we should return back to a caller asap. Or am i missing something here? > > 2. > > Can we combine two places where we handle __GFP_NOFAIL into one place? > > That would look like as more sorted out. > > I have to admit I am not really fluent at vmalloc code so I wanted to > make the code as simple as possible. How would I unwind all the allocated > memory (already allocated as GFP_NOFAIL) before retrying at > __vmalloc_node_range (if that is what you suggest). And isn't that a > bit wasteful? > > Or did you have anything else in mind? > It depends on how often all this can fail. But let me double check if such combining is easy. -- Vlad Rezki
On Tue 26-10-21 21:33:15, Uladzislau Rezki wrote: > On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote: > > On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote: > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > > cannot fail and they do not fit into a single page. > > > > > > > > The larg part of the vmalloc implementation already complies with the > > > > given gfp flags so there is no work for those to be done. The area > > > > and page table allocations are an exception to that. Implement a retry > > > > loop for those. > > > > > > > > Add a short sleep before retrying. 1 jiffy is a completely random > > > > timeout. Ideally the retry would wait for an explicit event - e.g. > > > > a change to the vmalloc space change if the failure was caused by > > > > the space fragmentation or depletion. But there are multiple different > > > > reasons to retry and this could become much more complex. Keep the retry > > > > simple for now and just sleep to prevent from hogging CPUs. > > > > > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > > --- > > > > mm/vmalloc.c | 10 +++++++++- > > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > index c6cc77d2f366..602649919a9d 100644 > > > > --- a/mm/vmalloc.c > > > > +++ b/mm/vmalloc.c > > > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > > > > else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) > > > > flags = memalloc_noio_save(); > > > > > > > > - ret = vmap_pages_range(addr, addr + size, prot, area->pages, > > > > + do { > > > > + ret = vmap_pages_range(addr, addr + size, prot, area->pages, > > > > page_shift); > > > > + if (ret < 0) > > > > + schedule_timeout_uninterruptible(1); > > > > + } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0)); > > > > > > > > > > 1. > > > After that change a below code: > > > > > > <snip> > > > if (ret < 0) { > > > warn_alloc(orig_gfp_mask, NULL, > > > "vmalloc error: size %lu, failed to map pages", > > > area->nr_pages * PAGE_SIZE); > > > goto fail; > > > } > > > <snip> > > > > > > does not make any sense anymore. > > > > Why? Allocations without __GFP_NOFAIL can still fail, no? > > > Right. I meant one thing but wrote slightly differently. In case of > vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any > warning message? Because either we can recover on a future iteration > or it stuck there infinitely so a user does not understand what happened. > From the other hand this is how __GFP_NOFAIL works, hm.. Yes, the page allocator doesn't warn either and I would like to keep this in sync. > Another thing, i see that schedule_timeout_uninterruptible(1) is invoked > for all cases even when __GFP_NOFAIL is not set, in that scenario we do > not want to wait, instead we should return back to a caller asap. Or am > i missing something here? OK, I will change that.
On Tue, Oct 26, 2021 at 09:33:15PM +0200, Uladzislau Rezki wrote: > On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote: > > On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote: > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > > cannot fail and they do not fit into a single page. > > > > > > > > The larg part of the vmalloc implementation already complies with the > > > > given gfp flags so there is no work for those to be done. The area > > > > and page table allocations are an exception to that. Implement a retry > > > > loop for those. > > > > > > > > Add a short sleep before retrying. 1 jiffy is a completely random > > > > timeout. Ideally the retry would wait for an explicit event - e.g. > > > > a change to the vmalloc space change if the failure was caused by > > > > the space fragmentation or depletion. But there are multiple different > > > > reasons to retry and this could become much more complex. Keep the retry > > > > simple for now and just sleep to prevent from hogging CPUs. > > > > > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > > --- > > > > mm/vmalloc.c | 10 +++++++++- > > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > index c6cc77d2f366..602649919a9d 100644 > > > > --- a/mm/vmalloc.c > > > > +++ b/mm/vmalloc.c > > > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > > > > else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) > > > > flags = memalloc_noio_save(); > > > > > > > > - ret = vmap_pages_range(addr, addr + size, prot, area->pages, > > > > + do { > > > > + ret = vmap_pages_range(addr, addr + size, prot, area->pages, > > > > page_shift); > > > > + if (ret < 0) > > > > + schedule_timeout_uninterruptible(1); > > > > + } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0)); > > > > > > > > > > 1. > > > After that change a below code: > > > > > > <snip> > > > if (ret < 0) { > > > warn_alloc(orig_gfp_mask, NULL, > > > "vmalloc error: size %lu, failed to map pages", > > > area->nr_pages * PAGE_SIZE); > > > goto fail; > > > } > > > <snip> > > > > > > does not make any sense anymore. > > > > Why? Allocations without __GFP_NOFAIL can still fail, no? > > > Right. I meant one thing but wrote slightly differently. In case of > vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any > warning message? Because either we can recover on a future iteration > or it stuck there infinitely so a user does not understand what happened. > From the other hand this is how __GFP_NOFAIL works, hm.. > > Another thing, i see that schedule_timeout_uninterruptible(1) is invoked > for all cases even when __GFP_NOFAIL is not set, in that scenario we do > not want to wait, instead we should return back to a caller asap. Or am > i missing something here? > > > > 2. > > > Can we combine two places where we handle __GFP_NOFAIL into one place? > > > That would look like as more sorted out. > > > > I have to admit I am not really fluent at vmalloc code so I wanted to > > make the code as simple as possible. How would I unwind all the allocated > > memory (already allocated as GFP_NOFAIL) before retrying at > > __vmalloc_node_range (if that is what you suggest). And isn't that a > > bit wasteful? > > > > Or did you have anything else in mind? > > > It depends on how often all this can fail. But let me double check if > such combining is easy. > I mean something like below. The idea is to not spread the __GFP_NOFAIL across the vmalloc file keeping it in one solid place: <snip> diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d77830ff604c..f4b7927e217e 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2889,8 +2889,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, unsigned long array_size; unsigned int nr_small_pages = size >> PAGE_SHIFT; unsigned int page_order; + unsigned long flags; + int ret; array_size = (unsigned long)nr_small_pages * sizeof(struct page *); + + /* + * This is i do not understand why we do not want to see warning messages. + */ gfp_mask |= __GFP_NOWARN; if (!(gfp_mask & (GFP_DMA | GFP_DMA32))) gfp_mask |= __GFP_HIGHMEM; @@ -2930,8 +2936,23 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, goto fail; } - if (vmap_pages_range(addr, addr + size, prot, area->pages, - page_shift) < 0) { + /* + * page tables allocations ignore external gfp mask, enforce it + * by the scope API + */ + if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO) + flags = memalloc_nofs_save(); + else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) + flags = memalloc_noio_save(); + + ret = vmap_pages_range(addr, addr + size, prot, area->pages, page_shift); + + if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO) + memalloc_nofs_restore(flags); + else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) + memalloc_noio_restore(flags); + + if (ret < 0) { warn_alloc(gfp_mask, NULL, "vmalloc error: size %lu, failed to map pages", area->nr_pages * PAGE_SIZE); @@ -2984,6 +3005,12 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, return NULL; } + /* + * Suppress all warnings for __GFP_NOFAIL allocation. + */ + if (gfp_mask & __GFP_NOFAIL) + gfp_mask |= __GFP_NOWARN; + if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) { unsigned long size_per_node; @@ -3010,16 +3037,22 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, area = __get_vm_area_node(real_size, align, shift, VM_ALLOC | VM_UNINITIALIZED | vm_flags, start, end, node, gfp_mask, caller); - if (!area) { - warn_alloc(gfp_mask, NULL, - "vmalloc error: size %lu, vm_struct allocation failed", - real_size); - goto fail; - } + if (area) + addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node); + + if (!area || !addr) { + if (gfp_mask & __GFP_NOFAIL) { + schedule_timeout_uninterruptible(1); + goto again; + } + + if (!area) + warn_alloc(gfp_mask, NULL, + "vmalloc error: size %lu, vm_struct allocation failed", + real_size); - addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node); - if (!addr) goto fail; + } /* * In this function, newly allocated vm_struct has VM_UNINITIALIZED <snip> -- Vlad Rezki
On Wed 27-10-21 19:55:50, Uladzislau Rezki wrote: > On Tue, Oct 26, 2021 at 09:33:15PM +0200, Uladzislau Rezki wrote: > > On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote: > > > On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote: > > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > > > cannot fail and they do not fit into a single page. > > > > > > > > > > The larg part of the vmalloc implementation already complies with the > > > > > given gfp flags so there is no work for those to be done. The area > > > > > and page table allocations are an exception to that. Implement a retry > > > > > loop for those. > > > > > > > > > > Add a short sleep before retrying. 1 jiffy is a completely random > > > > > timeout. Ideally the retry would wait for an explicit event - e.g. > > > > > a change to the vmalloc space change if the failure was caused by > > > > > the space fragmentation or depletion. But there are multiple different > > > > > reasons to retry and this could become much more complex. Keep the retry > > > > > simple for now and just sleep to prevent from hogging CPUs. > > > > > > > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > > > --- > > > > > mm/vmalloc.c | 10 +++++++++- > > > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > > index c6cc77d2f366..602649919a9d 100644 > > > > > --- a/mm/vmalloc.c > > > > > +++ b/mm/vmalloc.c > > > > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > > > > > else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) > > > > > flags = memalloc_noio_save(); > > > > > > > > > > - ret = vmap_pages_range(addr, addr + size, prot, area->pages, > > > > > + do { > > > > > + ret = vmap_pages_range(addr, addr + size, prot, area->pages, > > > > > page_shift); > > > > > + if (ret < 0) > > > > > + schedule_timeout_uninterruptible(1); > > > > > + } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0)); > > > > > > > > > > > > > 1. > > > > After that change a below code: > > > > > > > > <snip> > > > > if (ret < 0) { > > > > warn_alloc(orig_gfp_mask, NULL, > > > > "vmalloc error: size %lu, failed to map pages", > > > > area->nr_pages * PAGE_SIZE); > > > > goto fail; > > > > } > > > > <snip> > > > > > > > > does not make any sense anymore. > > > > > > Why? Allocations without __GFP_NOFAIL can still fail, no? > > > > > Right. I meant one thing but wrote slightly differently. In case of > > vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any > > warning message? Because either we can recover on a future iteration > > or it stuck there infinitely so a user does not understand what happened. > > From the other hand this is how __GFP_NOFAIL works, hm.. > > > > Another thing, i see that schedule_timeout_uninterruptible(1) is invoked > > for all cases even when __GFP_NOFAIL is not set, in that scenario we do > > not want to wait, instead we should return back to a caller asap. Or am > > i missing something here? > > > > > > 2. > > > > Can we combine two places where we handle __GFP_NOFAIL into one place? > > > > That would look like as more sorted out. > > > > > > I have to admit I am not really fluent at vmalloc code so I wanted to > > > make the code as simple as possible. How would I unwind all the allocated > > > memory (already allocated as GFP_NOFAIL) before retrying at > > > __vmalloc_node_range (if that is what you suggest). And isn't that a > > > bit wasteful? > > > > > > Or did you have anything else in mind? > > > > > It depends on how often all this can fail. But let me double check if > > such combining is easy. > > > I mean something like below. The idea is to not spread the __GFP_NOFAIL > across the vmalloc file keeping it in one solid place: > > <snip> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index d77830ff604c..f4b7927e217e 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2889,8 +2889,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > unsigned long array_size; > unsigned int nr_small_pages = size >> PAGE_SHIFT; > unsigned int page_order; > + unsigned long flags; > + int ret; > > array_size = (unsigned long)nr_small_pages * sizeof(struct page *); > + > + /* > + * This is i do not understand why we do not want to see warning messages. > + */ > gfp_mask |= __GFP_NOWARN; I suspect this is becauser vmalloc wants to have its own failure reporting. [...] > @@ -3010,16 +3037,22 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, > area = __get_vm_area_node(real_size, align, shift, VM_ALLOC | > VM_UNINITIALIZED | vm_flags, start, end, node, > gfp_mask, caller); > - if (!area) { > - warn_alloc(gfp_mask, NULL, > - "vmalloc error: size %lu, vm_struct allocation failed", > - real_size); > - goto fail; > - } > + if (area) > + addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node); > + > + if (!area || !addr) { > + if (gfp_mask & __GFP_NOFAIL) { > + schedule_timeout_uninterruptible(1); > + goto again; > + } > + > + if (!area) > + warn_alloc(gfp_mask, NULL, > + "vmalloc error: size %lu, vm_struct allocation failed", > + real_size); > > - addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node); > - if (!addr) > goto fail; > + } > > /* > * In this function, newly allocated vm_struct has VM_UNINITIALIZED > <snip> OK, this looks easier from the code reading but isn't it quite wasteful to throw all the pages backing the area (all of them allocated as __GFP_NOFAIL) just to then fail to allocate few page tables pages and drop all of that on the floor (this will happen in __vunmap AFAICS). I mean I do not care all that strongly but it seems to me that more changes would need to be done here and optimizations can be done on top. Is this something you feel strongly about?
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index d77830ff604c..f4b7927e217e 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -2889,8 +2889,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > > unsigned long array_size; > > unsigned int nr_small_pages = size >> PAGE_SHIFT; > > unsigned int page_order; > > + unsigned long flags; > > + int ret; > > > > array_size = (unsigned long)nr_small_pages * sizeof(struct page *); > > + > > + /* > > + * This is i do not understand why we do not want to see warning messages. > > + */ > > gfp_mask |= __GFP_NOWARN; > > I suspect this is becauser vmalloc wants to have its own failure > reporting. > But as i see it is broken. All three warn_alloc() reports in the __vmalloc_area_node() are useless because the __GFP_NOWARN is added on top of gfp_mask: void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) { struct va_format vaf; va_list args; static DEFINE_RATELIMIT_STATE(nopage_rs, 10*HZ, 1); if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs)) return; ... everything with the __GFP_NOWARN is just reverted. > [...] > > @@ -3010,16 +3037,22 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, > > area = __get_vm_area_node(real_size, align, shift, VM_ALLOC | > > VM_UNINITIALIZED | vm_flags, start, end, node, > > gfp_mask, caller); > > - if (!area) { > > - warn_alloc(gfp_mask, NULL, > > - "vmalloc error: size %lu, vm_struct allocation failed", > > - real_size); > > - goto fail; > > - } > > + if (area) > > + addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node); > > + > > + if (!area || !addr) { > > + if (gfp_mask & __GFP_NOFAIL) { > > + schedule_timeout_uninterruptible(1); > > + goto again; > > + } > > + > > + if (!area) > > + warn_alloc(gfp_mask, NULL, > > + "vmalloc error: size %lu, vm_struct allocation failed", > > + real_size); > > > > - addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node); > > - if (!addr) > > goto fail; > > + } > > > > /* > > * In this function, newly allocated vm_struct has VM_UNINITIALIZED > > <snip> > > OK, this looks easier from the code reading but isn't it quite wasteful > to throw all the pages backing the area (all of them allocated as > __GFP_NOFAIL) just to then fail to allocate few page tables pages and > drop all of that on the floor (this will happen in __vunmap AFAICS). > > I mean I do not care all that strongly but it seems to me that more > changes would need to be done here and optimizations can be done on top. > > Is this something you feel strongly about? > Will try to provide some motivations :) It depends on how to look at it. My view is as follows a more simple code is preferred. It is not considered as a hot path and it is rather a corner case to me. I think "unwinding" has some advantage. At least one motivation is to release a memory(on failure) before a delay that will prevent holding of extra memory in case of __GFP_NOFAIL infinitelly does not succeed, i.e. if a process stuck due to __GFP_NOFAIL it does not "hold" an extra memory forever.
On Fri 29-10-21 16:05:32, Uladzislau Rezki wrote: [...] > > OK, this looks easier from the code reading but isn't it quite wasteful > > to throw all the pages backing the area (all of them allocated as > > __GFP_NOFAIL) just to then fail to allocate few page tables pages and > > drop all of that on the floor (this will happen in __vunmap AFAICS). > > > > I mean I do not care all that strongly but it seems to me that more > > changes would need to be done here and optimizations can be done on top. > > > > Is this something you feel strongly about? > > > Will try to provide some motivations :) > > It depends on how to look at it. My view is as follows a more simple code > is preferred. It is not considered as a hot path and it is rather a corner > case to me. Yes, we are definitely talking about corner cases here. Even GFP_KERNEL allocations usually do not fail. > I think "unwinding" has some advantage. At least one motivation > is to release a memory(on failure) before a delay that will prevent holding > of extra memory in case of __GFP_NOFAIL infinitelly does not succeed, i.e. > if a process stuck due to __GFP_NOFAIL it does not "hold" an extra memory > forever. Well, I suspect this is something that we can disagree on and both of us would be kinda right. I would see it as throwing baby out with the bathwater. The vast majority of the memory will be in the area pages and sacrificing that just to allocate few page tables or whatever that might fail in that code path is just a lot of cycles wasted. So unless you really feel strongly about this then I would stick with this approach.
> On Fri 29-10-21 16:05:32, Uladzislau Rezki wrote: > [...] > > > OK, this looks easier from the code reading but isn't it quite wasteful > > > to throw all the pages backing the area (all of them allocated as > > > __GFP_NOFAIL) just to then fail to allocate few page tables pages and > > > drop all of that on the floor (this will happen in __vunmap AFAICS). > > > > > > I mean I do not care all that strongly but it seems to me that more > > > changes would need to be done here and optimizations can be done on top. > > > > > > Is this something you feel strongly about? > > > > > Will try to provide some motivations :) > > > > It depends on how to look at it. My view is as follows a more simple code > > is preferred. It is not considered as a hot path and it is rather a corner > > case to me. > > Yes, we are definitely talking about corner cases here. Even GFP_KERNEL > allocations usually do not fail. > > > I think "unwinding" has some advantage. At least one motivation > > is to release a memory(on failure) before a delay that will prevent holding > > of extra memory in case of __GFP_NOFAIL infinitelly does not succeed, i.e. > > if a process stuck due to __GFP_NOFAIL it does not "hold" an extra memory > > forever. > > Well, I suspect this is something that we can disagree on and both of us > would be kinda right. I would see it as throwing baby out with the > bathwater. The vast majority of the memory will be in the area pages and > sacrificing that just to allocate few page tables or whatever that might > fail in that code path is just a lot of cycles wasted. > We are not talking about performance, no sense to measure cycles here :) > > So unless you really feel strongly about this then I would stick with > this approach. > I have raised one concern. The memory resource is shared between all process in case of __GFP_NOFAIL it might be that we never return back to user in that scenario i prefer to release hold memory for other needs instead of keeping it for nothing. If you think it is not a problem, then i do not have much to say. -- Vlad Rezki
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index c6cc77d2f366..602649919a9d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) flags = memalloc_noio_save(); - ret = vmap_pages_range(addr, addr + size, prot, area->pages, + do { + ret = vmap_pages_range(addr, addr + size, prot, area->pages, page_shift); + if (ret < 0) + schedule_timeout_uninterruptible(1); + } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0)); if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO) memalloc_nofs_restore(flags); @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, warn_alloc(gfp_mask, NULL, "vmalloc error: size %lu, vm_struct allocation failed", real_size); + if (gfp_mask & __GFP_NOFAIL) { + schedule_timeout_uninterruptible(1); + goto again; + } goto fail; }