diff mbox series

[2/4] mm/vmalloc: add support for __GFP_NOFAIL

Message ID 20211025150223.13621-3-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show
Series extend vmalloc support for constrained allocations | expand

Commit Message

Michal Hocko Oct. 25, 2021, 3:02 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

Dave Chinner has mentioned that some of the xfs code would benefit from
kvmalloc support for __GFP_NOFAIL because they have allocations that
cannot fail and they do not fit into a single page.

The larg part of the vmalloc implementation already complies with the
given gfp flags so there is no work for those to be done. The area
and page table allocations are an exception to that. Implement a retry
loop for those.

Add a short sleep before retrying. 1 jiffy is a completely random
timeout. Ideally the retry would wait for an explicit event - e.g.
a change to the vmalloc space change if the failure was caused by
the space fragmentation or depletion. But there are multiple different
reasons to retry and this could become much more complex. Keep the retry
simple for now and just sleep to prevent from hogging CPUs.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmalloc.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

NeilBrown Oct. 25, 2021, 10:59 p.m. UTC | #1
On Tue, 26 Oct 2021, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Dave Chinner has mentioned that some of the xfs code would benefit from
> kvmalloc support for __GFP_NOFAIL because they have allocations that
> cannot fail and they do not fit into a single page.
> 
> The larg part of the vmalloc implementation already complies with the

*large*

> given gfp flags so there is no work for those to be done. The area
> and page table allocations are an exception to that. Implement a retry
> loop for those.
> 
> Add a short sleep before retrying. 1 jiffy is a completely random
> timeout. Ideally the retry would wait for an explicit event - e.g.
> a change to the vmalloc space change if the failure was caused by
> the space fragmentation or depletion. But there are multiple different
> reasons to retry and this could become much more complex. Keep the retry
> simple for now and just sleep to prevent from hogging CPUs.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmalloc.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c6cc77d2f366..602649919a9d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
>  		flags = memalloc_noio_save();
>  
> -	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> +	do {
> +		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
>  			page_shift);
> +		if (ret < 0)
> +			schedule_timeout_uninterruptible(1);
> +	} while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
>  
>  	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
>  		memalloc_nofs_restore(flags);
> @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>  		warn_alloc(gfp_mask, NULL,
>  			"vmalloc error: size %lu, vm_struct allocation failed",
>  			real_size);
> +		if (gfp_mask & __GFP_NOFAIL) {
> +			schedule_timeout_uninterruptible(1);
> +			goto again;
> +		}

Shouldn't the retry happen *before* the warning?

NeilBrown


>  		goto fail;
>  	}
>  
> -- 
> 2.30.2
> 
>
Michal Hocko Oct. 26, 2021, 7:03 a.m. UTC | #2
On Tue 26-10-21 09:59:36, Neil Brown wrote:
> On Tue, 26 Oct 2021, Michal Hocko wrote:
[...]
> > @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> >  		warn_alloc(gfp_mask, NULL,
> >  			"vmalloc error: size %lu, vm_struct allocation failed",
> >  			real_size);
> > +		if (gfp_mask & __GFP_NOFAIL) {
> > +			schedule_timeout_uninterruptible(1);
> > +			goto again;
> > +		}
> 
> Shouldn't the retry happen *before* the warning?

I've done it after to catch the "depleted or fragmented" vmalloc space.
This is not related to the memory available and therefore it won't be
handled by the oom killer. The error message shouldn't imply the vmalloc
allocation failure IMHO but I am open to suggestions.
NeilBrown Oct. 26, 2021, 10:30 a.m. UTC | #3
On Tue, 26 Oct 2021, Michal Hocko wrote:
> On Tue 26-10-21 09:59:36, Neil Brown wrote:
> > On Tue, 26 Oct 2021, Michal Hocko wrote:
> [...]
> > > @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > >  		warn_alloc(gfp_mask, NULL,
> > >  			"vmalloc error: size %lu, vm_struct allocation failed",
> > >  			real_size);
> > > +		if (gfp_mask & __GFP_NOFAIL) {
> > > +			schedule_timeout_uninterruptible(1);
> > > +			goto again;
> > > +		}
> > 
> > Shouldn't the retry happen *before* the warning?
> 
> I've done it after to catch the "depleted or fragmented" vmalloc space.
> This is not related to the memory available and therefore it won't be
> handled by the oom killer. The error message shouldn't imply the vmalloc
> allocation failure IMHO but I am open to suggestions.

The word "failed" does seem to imply what you don't want it to imply...

I guess it is reasonable to have this warning, but maybe add " -- retrying"
if __GFP_NOFAIL.

Thanks,
NeilBrown
Michal Hocko Oct. 26, 2021, 11:29 a.m. UTC | #4
On Tue 26-10-21 21:30:52, Neil Brown wrote:
> On Tue, 26 Oct 2021, Michal Hocko wrote:
> > On Tue 26-10-21 09:59:36, Neil Brown wrote:
> > > On Tue, 26 Oct 2021, Michal Hocko wrote:
> > [...]
> > > > @@ -3032,6 +3036,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > > >  		warn_alloc(gfp_mask, NULL,
> > > >  			"vmalloc error: size %lu, vm_struct allocation failed",
> > > >  			real_size);
> > > > +		if (gfp_mask & __GFP_NOFAIL) {
> > > > +			schedule_timeout_uninterruptible(1);
> > > > +			goto again;
> > > > +		}
> > > 
> > > Shouldn't the retry happen *before* the warning?
> > 
> > I've done it after to catch the "depleted or fragmented" vmalloc space.
> > This is not related to the memory available and therefore it won't be
> > handled by the oom killer. The error message shouldn't imply the vmalloc
> > allocation failure IMHO but I am open to suggestions.
> 
> The word "failed" does seem to imply what you don't want it to imply...
> 
> I guess it is reasonable to have this warning, but maybe add " -- retrying"
> if __GFP_NOFAIL.

I do not have a strong opinion on that. I can surely do
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 602649919a9d..3489928fafa2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3033,10 +3033,11 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 				  VM_UNINITIALIZED | vm_flags, start, end, node,
 				  gfp_mask, caller);
 	if (!area) {
+		bool nofail = gfp_mask & __GFP_NOFAIL;
 		warn_alloc(gfp_mask, NULL,
-			"vmalloc error: size %lu, vm_struct allocation failed",
-			real_size);
-		if (gfp_mask & __GFP_NOFAIL) {
+			"vmalloc error: size %lu, vm_struct allocation failed%s",
+			real_size, (nofail) ? ". Retrying." : "");
+		if (nofail) {
 			schedule_timeout_uninterruptible(1);
 			goto again;
 		}
Uladzislau Rezki Oct. 26, 2021, 3:48 p.m. UTC | #5
> From: Michal Hocko <mhocko@suse.com>
>
> Dave Chinner has mentioned that some of the xfs code would benefit from
> kvmalloc support for __GFP_NOFAIL because they have allocations that
> cannot fail and they do not fit into a single page.
>
> The larg part of the vmalloc implementation already complies with the
> given gfp flags so there is no work for those to be done. The area
> and page table allocations are an exception to that. Implement a retry
> loop for those.
>
> Add a short sleep before retrying. 1 jiffy is a completely random
> timeout. Ideally the retry would wait for an explicit event - e.g.
> a change to the vmalloc space change if the failure was caused by
> the space fragmentation or depletion. But there are multiple different
> reasons to retry and this could become much more complex. Keep the retry
> simple for now and just sleep to prevent from hogging CPUs.
>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmalloc.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index c6cc77d2f366..602649919a9d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
>                 flags = memalloc_noio_save();
>
> -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> +       do {
> +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
>                         page_shift);
> +               if (ret < 0)
> +                       schedule_timeout_uninterruptible(1);
> +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
>

1.
After that change a below code:

<snip>
if (ret < 0) {
    warn_alloc(orig_gfp_mask, NULL,
        "vmalloc error: size %lu, failed to map pages",
        area->nr_pages * PAGE_SIZE);
    goto fail;
}
<snip>

does not make any sense anymore.

2.
Can we combine two places where we handle __GFP_NOFAIL into one place?
That would look like as more sorted out.
Michal Hocko Oct. 26, 2021, 4:28 p.m. UTC | #6
On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote:
> > From: Michal Hocko <mhocko@suse.com>
> >
> > Dave Chinner has mentioned that some of the xfs code would benefit from
> > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > cannot fail and they do not fit into a single page.
> >
> > The larg part of the vmalloc implementation already complies with the
> > given gfp flags so there is no work for those to be done. The area
> > and page table allocations are an exception to that. Implement a retry
> > loop for those.
> >
> > Add a short sleep before retrying. 1 jiffy is a completely random
> > timeout. Ideally the retry would wait for an explicit event - e.g.
> > a change to the vmalloc space change if the failure was caused by
> > the space fragmentation or depletion. But there are multiple different
> > reasons to retry and this could become much more complex. Keep the retry
> > simple for now and just sleep to prevent from hogging CPUs.
> >
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/vmalloc.c | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index c6cc77d2f366..602649919a9d 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> >                 flags = memalloc_noio_save();
> >
> > -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > +       do {
> > +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> >                         page_shift);
> > +               if (ret < 0)
> > +                       schedule_timeout_uninterruptible(1);
> > +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> >
> 
> 1.
> After that change a below code:
> 
> <snip>
> if (ret < 0) {
>     warn_alloc(orig_gfp_mask, NULL,
>         "vmalloc error: size %lu, failed to map pages",
>         area->nr_pages * PAGE_SIZE);
>     goto fail;
> }
> <snip>
> 
> does not make any sense anymore.

Why? Allocations without __GFP_NOFAIL can still fail, no?

> 2.
> Can we combine two places where we handle __GFP_NOFAIL into one place?
> That would look like as more sorted out.

I have to admit I am not really fluent at vmalloc code so I wanted to
make the code as simple as possible. How would I unwind all the allocated
memory (already allocated as GFP_NOFAIL) before retrying at
__vmalloc_node_range (if that is what you suggest). And isn't that a
bit wasteful?

Or did you have anything else in mind?
Uladzislau Rezki Oct. 26, 2021, 7:33 p.m. UTC | #7
On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote:
> On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > >
> > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > cannot fail and they do not fit into a single page.
> > >
> > > The larg part of the vmalloc implementation already complies with the
> > > given gfp flags so there is no work for those to be done. The area
> > > and page table allocations are an exception to that. Implement a retry
> > > loop for those.
> > >
> > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > a change to the vmalloc space change if the failure was caused by
> > > the space fragmentation or depletion. But there are multiple different
> > > reasons to retry and this could become much more complex. Keep the retry
> > > simple for now and just sleep to prevent from hogging CPUs.
> > >
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  mm/vmalloc.c | 10 +++++++++-
> > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index c6cc77d2f366..602649919a9d 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > >         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> > >                 flags = memalloc_noio_save();
> > >
> > > -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > +       do {
> > > +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > >                         page_shift);
> > > +               if (ret < 0)
> > > +                       schedule_timeout_uninterruptible(1);
> > > +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > >
> > 
> > 1.
> > After that change a below code:
> > 
> > <snip>
> > if (ret < 0) {
> >     warn_alloc(orig_gfp_mask, NULL,
> >         "vmalloc error: size %lu, failed to map pages",
> >         area->nr_pages * PAGE_SIZE);
> >     goto fail;
> > }
> > <snip>
> > 
> > does not make any sense anymore.
> 
> Why? Allocations without __GFP_NOFAIL can still fail, no?
> 
Right. I meant one thing but wrote slightly differently. In case of
vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any
warning message? Because either we can recover on a future iteration
or it stuck there infinitely so a user does not understand what happened.
From the other hand this is how __GFP_NOFAIL works, hm..

Another thing, i see that schedule_timeout_uninterruptible(1) is invoked
for all cases even when __GFP_NOFAIL is not set, in that scenario we do
not want to wait, instead we should return back to a caller asap. Or am
i missing something here?

> > 2.
> > Can we combine two places where we handle __GFP_NOFAIL into one place?
> > That would look like as more sorted out.
> 
> I have to admit I am not really fluent at vmalloc code so I wanted to
> make the code as simple as possible. How would I unwind all the allocated
> memory (already allocated as GFP_NOFAIL) before retrying at
> __vmalloc_node_range (if that is what you suggest). And isn't that a
> bit wasteful?
> 
> Or did you have anything else in mind?
>
It depends on how often all this can fail. But let me double check if
such combining is easy.

--
Vlad Rezki
Michal Hocko Oct. 27, 2021, 6:46 a.m. UTC | #8
On Tue 26-10-21 21:33:15, Uladzislau Rezki wrote:
> On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote:
> > On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > >
> > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > cannot fail and they do not fit into a single page.
> > > >
> > > > The larg part of the vmalloc implementation already complies with the
> > > > given gfp flags so there is no work for those to be done. The area
> > > > and page table allocations are an exception to that. Implement a retry
> > > > loop for those.
> > > >
> > > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > > a change to the vmalloc space change if the failure was caused by
> > > > the space fragmentation or depletion. But there are multiple different
> > > > reasons to retry and this could become much more complex. Keep the retry
> > > > simple for now and just sleep to prevent from hogging CPUs.
> > > >
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > ---
> > > >  mm/vmalloc.c | 10 +++++++++-
> > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index c6cc77d2f366..602649919a9d 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > > >         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> > > >                 flags = memalloc_noio_save();
> > > >
> > > > -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > +       do {
> > > > +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > >                         page_shift);
> > > > +               if (ret < 0)
> > > > +                       schedule_timeout_uninterruptible(1);
> > > > +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > > >
> > > 
> > > 1.
> > > After that change a below code:
> > > 
> > > <snip>
> > > if (ret < 0) {
> > >     warn_alloc(orig_gfp_mask, NULL,
> > >         "vmalloc error: size %lu, failed to map pages",
> > >         area->nr_pages * PAGE_SIZE);
> > >     goto fail;
> > > }
> > > <snip>
> > > 
> > > does not make any sense anymore.
> > 
> > Why? Allocations without __GFP_NOFAIL can still fail, no?
> > 
> Right. I meant one thing but wrote slightly differently. In case of
> vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any
> warning message? Because either we can recover on a future iteration
> or it stuck there infinitely so a user does not understand what happened.
> From the other hand this is how __GFP_NOFAIL works, hm..

Yes, the page allocator doesn't warn either and I would like to keep
this in sync.

> Another thing, i see that schedule_timeout_uninterruptible(1) is invoked
> for all cases even when __GFP_NOFAIL is not set, in that scenario we do
> not want to wait, instead we should return back to a caller asap. Or am
> i missing something here?

OK, I will change that.
Uladzislau Rezki Oct. 27, 2021, 5:55 p.m. UTC | #9
On Tue, Oct 26, 2021 at 09:33:15PM +0200, Uladzislau Rezki wrote:
> On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote:
> > On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > >
> > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > cannot fail and they do not fit into a single page.
> > > >
> > > > The larg part of the vmalloc implementation already complies with the
> > > > given gfp flags so there is no work for those to be done. The area
> > > > and page table allocations are an exception to that. Implement a retry
> > > > loop for those.
> > > >
> > > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > > a change to the vmalloc space change if the failure was caused by
> > > > the space fragmentation or depletion. But there are multiple different
> > > > reasons to retry and this could become much more complex. Keep the retry
> > > > simple for now and just sleep to prevent from hogging CPUs.
> > > >
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > ---
> > > >  mm/vmalloc.c | 10 +++++++++-
> > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index c6cc77d2f366..602649919a9d 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > > >         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> > > >                 flags = memalloc_noio_save();
> > > >
> > > > -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > +       do {
> > > > +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > >                         page_shift);
> > > > +               if (ret < 0)
> > > > +                       schedule_timeout_uninterruptible(1);
> > > > +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > > >
> > > 
> > > 1.
> > > After that change a below code:
> > > 
> > > <snip>
> > > if (ret < 0) {
> > >     warn_alloc(orig_gfp_mask, NULL,
> > >         "vmalloc error: size %lu, failed to map pages",
> > >         area->nr_pages * PAGE_SIZE);
> > >     goto fail;
> > > }
> > > <snip>
> > > 
> > > does not make any sense anymore.
> > 
> > Why? Allocations without __GFP_NOFAIL can still fail, no?
> > 
> Right. I meant one thing but wrote slightly differently. In case of
> vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any
> warning message? Because either we can recover on a future iteration
> or it stuck there infinitely so a user does not understand what happened.
> From the other hand this is how __GFP_NOFAIL works, hm..
> 
> Another thing, i see that schedule_timeout_uninterruptible(1) is invoked
> for all cases even when __GFP_NOFAIL is not set, in that scenario we do
> not want to wait, instead we should return back to a caller asap. Or am
> i missing something here?
> 
> > > 2.
> > > Can we combine two places where we handle __GFP_NOFAIL into one place?
> > > That would look like as more sorted out.
> > 
> > I have to admit I am not really fluent at vmalloc code so I wanted to
> > make the code as simple as possible. How would I unwind all the allocated
> > memory (already allocated as GFP_NOFAIL) before retrying at
> > __vmalloc_node_range (if that is what you suggest). And isn't that a
> > bit wasteful?
> > 
> > Or did you have anything else in mind?
> >
> It depends on how often all this can fail. But let me double check if
> such combining is easy.
> 
I mean something like below. The idea is to not spread the __GFP_NOFAIL
across the vmalloc file keeping it in one solid place:

<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d77830ff604c..f4b7927e217e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2889,8 +2889,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	unsigned long array_size;
 	unsigned int nr_small_pages = size >> PAGE_SHIFT;
 	unsigned int page_order;
+	unsigned long flags;
+	int ret;
 
 	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
+
+	/*
+	 * This is i do not understand why we do not want to see warning messages.
+	 */
 	gfp_mask |= __GFP_NOWARN;
 	if (!(gfp_mask & (GFP_DMA | GFP_DMA32)))
 		gfp_mask |= __GFP_HIGHMEM;
@@ -2930,8 +2936,23 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 		goto fail;
 	}
 
-	if (vmap_pages_range(addr, addr + size, prot, area->pages,
-			page_shift) < 0) {
+	/*
+	 * page tables allocations ignore external gfp mask, enforce it
+	 * by the scope API
+	 */
+	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+		flags = memalloc_nofs_save();
+	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
+		flags = memalloc_noio_save();
+
+	ret = vmap_pages_range(addr, addr + size, prot, area->pages, page_shift);
+
+	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+		memalloc_nofs_restore(flags);
+	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
+		memalloc_noio_restore(flags);
+
+	if (ret < 0) {
 		warn_alloc(gfp_mask, NULL,
 			"vmalloc error: size %lu, failed to map pages",
 			area->nr_pages * PAGE_SIZE);
@@ -2984,6 +3005,12 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 		return NULL;
 	}
 
+	/*
+	 * Suppress all warnings for __GFP_NOFAIL allocation.
+	 */
+	if (gfp_mask & __GFP_NOFAIL)
+		gfp_mask |= __GFP_NOWARN;
+
 	if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
 		unsigned long size_per_node;
 
@@ -3010,16 +3037,22 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 	area = __get_vm_area_node(real_size, align, shift, VM_ALLOC |
 				  VM_UNINITIALIZED | vm_flags, start, end, node,
 				  gfp_mask, caller);
-	if (!area) {
-		warn_alloc(gfp_mask, NULL,
-			"vmalloc error: size %lu, vm_struct allocation failed",
-			real_size);
-		goto fail;
-	}
+	if (area)
+		addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
+
+	if (!area || !addr) {
+		if (gfp_mask & __GFP_NOFAIL) {
+			schedule_timeout_uninterruptible(1);
+			goto again;
+		}
+
+		if (!area)
+			warn_alloc(gfp_mask, NULL,
+				"vmalloc error: size %lu, vm_struct allocation failed",
+				real_size);
 
-	addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
-	if (!addr)
 		goto fail;
+	}
 
 	/*
 	 * In this function, newly allocated vm_struct has VM_UNINITIALIZED
<snip>

--
Vlad Rezki
Michal Hocko Oct. 29, 2021, 7:57 a.m. UTC | #10
On Wed 27-10-21 19:55:50, Uladzislau Rezki wrote:
> On Tue, Oct 26, 2021 at 09:33:15PM +0200, Uladzislau Rezki wrote:
> > On Tue, Oct 26, 2021 at 06:28:52PM +0200, Michal Hocko wrote:
> > > On Tue 26-10-21 17:48:32, Uladzislau Rezki wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > >
> > > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > > cannot fail and they do not fit into a single page.
> > > > >
> > > > > The larg part of the vmalloc implementation already complies with the
> > > > > given gfp flags so there is no work for those to be done. The area
> > > > > and page table allocations are an exception to that. Implement a retry
> > > > > loop for those.
> > > > >
> > > > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > > > a change to the vmalloc space change if the failure was caused by
> > > > > the space fragmentation or depletion. But there are multiple different
> > > > > reasons to retry and this could become much more complex. Keep the retry
> > > > > simple for now and just sleep to prevent from hogging CPUs.
> > > > >
> > > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > > ---
> > > > >  mm/vmalloc.c | 10 +++++++++-
> > > > >  1 file changed, 9 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index c6cc77d2f366..602649919a9d 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -2941,8 +2941,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > > > >         else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> > > > >                 flags = memalloc_noio_save();
> > > > >
> > > > > -       ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > > +       do {
> > > > > +               ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > >                         page_shift);
> > > > > +               if (ret < 0)
> > > > > +                       schedule_timeout_uninterruptible(1);
> > > > > +       } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > > > >
> > > > 
> > > > 1.
> > > > After that change a below code:
> > > > 
> > > > <snip>
> > > > if (ret < 0) {
> > > >     warn_alloc(orig_gfp_mask, NULL,
> > > >         "vmalloc error: size %lu, failed to map pages",
> > > >         area->nr_pages * PAGE_SIZE);
> > > >     goto fail;
> > > > }
> > > > <snip>
> > > > 
> > > > does not make any sense anymore.
> > > 
> > > Why? Allocations without __GFP_NOFAIL can still fail, no?
> > > 
> > Right. I meant one thing but wrote slightly differently. In case of
> > vmap_pages_range() fails(if __GFP_NOFAIL is set) should we emit any
> > warning message? Because either we can recover on a future iteration
> > or it stuck there infinitely so a user does not understand what happened.
> > From the other hand this is how __GFP_NOFAIL works, hm..
> > 
> > Another thing, i see that schedule_timeout_uninterruptible(1) is invoked
> > for all cases even when __GFP_NOFAIL is not set, in that scenario we do
> > not want to wait, instead we should return back to a caller asap. Or am
> > i missing something here?
> > 
> > > > 2.
> > > > Can we combine two places where we handle __GFP_NOFAIL into one place?
> > > > That would look like as more sorted out.
> > > 
> > > I have to admit I am not really fluent at vmalloc code so I wanted to
> > > make the code as simple as possible. How would I unwind all the allocated
> > > memory (already allocated as GFP_NOFAIL) before retrying at
> > > __vmalloc_node_range (if that is what you suggest). And isn't that a
> > > bit wasteful?
> > > 
> > > Or did you have anything else in mind?
> > >
> > It depends on how often all this can fail. But let me double check if
> > such combining is easy.
> > 
> I mean something like below. The idea is to not spread the __GFP_NOFAIL
> across the vmalloc file keeping it in one solid place:
> 
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d77830ff604c..f4b7927e217e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2889,8 +2889,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	unsigned long array_size;
>  	unsigned int nr_small_pages = size >> PAGE_SHIFT;
>  	unsigned int page_order;
> +	unsigned long flags;
> +	int ret;
>  
>  	array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
> +
> +	/*
> +	 * This is i do not understand why we do not want to see warning messages.
> +	 */
>  	gfp_mask |= __GFP_NOWARN;

I suspect this is becauser vmalloc wants to have its own failure
reporting.

[...]
> @@ -3010,16 +3037,22 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>  	area = __get_vm_area_node(real_size, align, shift, VM_ALLOC |
>  				  VM_UNINITIALIZED | vm_flags, start, end, node,
>  				  gfp_mask, caller);
> -	if (!area) {
> -		warn_alloc(gfp_mask, NULL,
> -			"vmalloc error: size %lu, vm_struct allocation failed",
> -			real_size);
> -		goto fail;
> -	}
> +	if (area)
> +		addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
> +
> +	if (!area || !addr) {
> +		if (gfp_mask & __GFP_NOFAIL) {
> +			schedule_timeout_uninterruptible(1);
> +			goto again;
> +		}
> +
> +		if (!area)
> +			warn_alloc(gfp_mask, NULL,
> +				"vmalloc error: size %lu, vm_struct allocation failed",
> +				real_size);
>  
> -	addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
> -	if (!addr)
>  		goto fail;
> +	}
>  
>  	/*
>  	 * In this function, newly allocated vm_struct has VM_UNINITIALIZED
> <snip>

OK, this looks easier from the code reading but isn't it quite wasteful
to throw all the pages backing the area (all of them allocated as
__GFP_NOFAIL) just to then fail to allocate few page tables pages and
drop all of that on the floor (this will happen in __vunmap AFAICS).

I mean I do not care all that strongly but it seems to me that more
changes would need to be done here and optimizations can be done on top.

Is this something you feel strongly about?
Uladzislau Rezki Oct. 29, 2021, 2:05 p.m. UTC | #11
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index d77830ff604c..f4b7927e217e 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2889,8 +2889,14 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >       unsigned long array_size;
> >       unsigned int nr_small_pages = size >> PAGE_SHIFT;
> >       unsigned int page_order;
> > +     unsigned long flags;
> > +     int ret;
> >
> >       array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
> > +
> > +     /*
> > +      * This is i do not understand why we do not want to see warning messages.
> > +      */
> >       gfp_mask |= __GFP_NOWARN;
>
> I suspect this is becauser vmalloc wants to have its own failure
> reporting.
>
But as i see it is broken. All three warn_alloc() reports in the
__vmalloc_area_node()
are useless because the __GFP_NOWARN is added on top of gfp_mask:

void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
{
        struct va_format vaf;
        va_list args;
        static DEFINE_RATELIMIT_STATE(nopage_rs, 10*HZ, 1);

        if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs))
                return;
...

everything with the __GFP_NOWARN is just reverted.

> [...]
> > @@ -3010,16 +3037,22 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> >       area = __get_vm_area_node(real_size, align, shift, VM_ALLOC |
> >                                 VM_UNINITIALIZED | vm_flags, start, end, node,
> >                                 gfp_mask, caller);
> > -     if (!area) {
> > -             warn_alloc(gfp_mask, NULL,
> > -                     "vmalloc error: size %lu, vm_struct allocation failed",
> > -                     real_size);
> > -             goto fail;
> > -     }
> > +     if (area)
> > +             addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
> > +
> > +     if (!area || !addr) {
> > +             if (gfp_mask & __GFP_NOFAIL) {
> > +                     schedule_timeout_uninterruptible(1);
> > +                     goto again;
> > +             }
> > +
> > +             if (!area)
> > +                     warn_alloc(gfp_mask, NULL,
> > +                             "vmalloc error: size %lu, vm_struct allocation failed",
> > +                             real_size);
> >
> > -     addr = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
> > -     if (!addr)
> >               goto fail;
> > +     }
> >
> >       /*
> >        * In this function, newly allocated vm_struct has VM_UNINITIALIZED
> > <snip>
>
> OK, this looks easier from the code reading but isn't it quite wasteful
> to throw all the pages backing the area (all of them allocated as
> __GFP_NOFAIL) just to then fail to allocate few page tables pages and
> drop all of that on the floor (this will happen in __vunmap AFAICS).
>
> I mean I do not care all that strongly but it seems to me that more
> changes would need to be done here and optimizations can be done on top.
>
> Is this something you feel strongly about?
>
Will try to provide some motivations :)

It depends on how to look at it. My view is as follows a more simple code
is preferred. It is not considered as a hot path and it is rather a corner
case to me. I think "unwinding" has some advantage. At least one motivation
is to release a memory(on failure) before a delay that will prevent holding
of extra memory in case of __GFP_NOFAIL infinitelly does not succeed, i.e.
if a process stuck due to __GFP_NOFAIL it does not "hold" an extra memory
forever.
Michal Hocko Oct. 29, 2021, 2:45 p.m. UTC | #12
On Fri 29-10-21 16:05:32, Uladzislau Rezki wrote:
[...]
> > OK, this looks easier from the code reading but isn't it quite wasteful
> > to throw all the pages backing the area (all of them allocated as
> > __GFP_NOFAIL) just to then fail to allocate few page tables pages and
> > drop all of that on the floor (this will happen in __vunmap AFAICS).
> >
> > I mean I do not care all that strongly but it seems to me that more
> > changes would need to be done here and optimizations can be done on top.
> >
> > Is this something you feel strongly about?
> >
> Will try to provide some motivations :)
> 
> It depends on how to look at it. My view is as follows a more simple code
> is preferred. It is not considered as a hot path and it is rather a corner
> case to me.

Yes, we are definitely talking about corner cases here. Even GFP_KERNEL
allocations usually do not fail.

> I think "unwinding" has some advantage. At least one motivation
> is to release a memory(on failure) before a delay that will prevent holding
> of extra memory in case of __GFP_NOFAIL infinitelly does not succeed, i.e.
> if a process stuck due to __GFP_NOFAIL it does not "hold" an extra memory
> forever.

Well, I suspect this is something that we can disagree on and both of us
would be kinda right. I would see it as throwing baby out with the
bathwater. The vast majority of the memory will be in the area pages and
sacrificing that just to allocate few page tables or whatever that might
fail in that code path is just a lot of cycles wasted.

So unless you really feel strongly about this then I would stick with
this approach.
Uladzislau Rezki Oct. 29, 2021, 5:23 p.m. UTC | #13
> On Fri 29-10-21 16:05:32, Uladzislau Rezki wrote:
> [...]
> > > OK, this looks easier from the code reading but isn't it quite wasteful
> > > to throw all the pages backing the area (all of them allocated as
> > > __GFP_NOFAIL) just to then fail to allocate few page tables pages and
> > > drop all of that on the floor (this will happen in __vunmap AFAICS).
> > >
> > > I mean I do not care all that strongly but it seems to me that more
> > > changes would need to be done here and optimizations can be done on top.
> > >
> > > Is this something you feel strongly about?
> > >
> > Will try to provide some motivations :)
> > 
> > It depends on how to look at it. My view is as follows a more simple code
> > is preferred. It is not considered as a hot path and it is rather a corner
> > case to me.
> 
> Yes, we are definitely talking about corner cases here. Even GFP_KERNEL
> allocations usually do not fail.
> 
> > I think "unwinding" has some advantage. At least one motivation
> > is to release a memory(on failure) before a delay that will prevent holding
> > of extra memory in case of __GFP_NOFAIL infinitelly does not succeed, i.e.
> > if a process stuck due to __GFP_NOFAIL it does not "hold" an extra memory
> > forever.
> 
> Well, I suspect this is something that we can disagree on and both of us
> would be kinda right. I would see it as throwing baby out with the
> bathwater. The vast majority of the memory will be in the area pages and
> sacrificing that just to allocate few page tables or whatever that might
> fail in that code path is just a lot of cycles wasted.
>
We are not talking about performance, no sense to measure cycles here :)

> 
> So unless you really feel strongly about this then I would stick with
> this approach.
>
I have raised one concern. The memory resource is shared between all
process in case of __GFP_NOFAIL it might be that we never return back
to user in that scenario i prefer to release hold memory for other
needs instead of keeping it for nothing.

If you think it is not a problem, then i do not have much to say.

--
Vlad Rezki
diff mbox series

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index c6cc77d2f366..602649919a9d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2941,8 +2941,12 @@  static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
 		flags = memalloc_noio_save();
 
-	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
+	do {
+		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
 			page_shift);
+		if (ret < 0)
+			schedule_timeout_uninterruptible(1);
+	} while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
 
 	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
 		memalloc_nofs_restore(flags);
@@ -3032,6 +3036,10 @@  void *__vmalloc_node_range(unsigned long size, unsigned long align,
 		warn_alloc(gfp_mask, NULL,
 			"vmalloc error: size %lu, vm_struct allocation failed",
 			real_size);
+		if (gfp_mask & __GFP_NOFAIL) {
+			schedule_timeout_uninterruptible(1);
+			goto again;
+		}
 		goto fail;
 	}