[RFC,2/3] mm/vmalloc: add support for __GFP_NOFAIL

Message ID	20211018114712.9802-3-mhocko@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> From: Michal Hocko <mhocko@kernel.org> To: <linux-mm@kvack.org> Cc: Dave Chinner <david@fromorbit.com>, Neil Brown <neilb@suse.de>, Andrew Morton <akpm@linux-foundation.org>, Christoph Hellwig <hch@infradead.org>, Uladzislau Rezki <urezki@gmail.com>, <linux-fsdevel@vger.kernel.org>, LKML <linux-kernel@vger.kernel.org>, Ilya Dryomov <idryomov@gmail.com>, Jeff Layton <jlayton@kernel.org>, Michal Hocko <mhocko@suse.com> Subject: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL Date: Mon, 18 Oct 2021 13:47:11 +0200 Message-Id: <20211018114712.9802-3-mhocko@kernel.org> In-Reply-To: <20211018114712.9802-1-mhocko@kernel.org> References: <20211018114712.9802-1-mhocko@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	extend vmalloc support for constrained allocations \| expand [RFC,0/3] extend vmalloc support for constrained allocations [RFC,1/3] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc [RFC,2/3] mm/vmalloc: add support for __GFP_NOFAIL [RFC,3/3] mm: allow !GFP_KERNEL allocations for kvmalloc

Michal Hocko Oct. 18, 2021, 11:47 a.m. UTC

From: Michal Hocko <mhocko@suse.com>

Dave Chinner has mentioned that some of the xfs code would benefit from
kvmalloc support for __GFP_NOFAIL because they have allocations that
cannot fail and they do not fit into a single page.

The larg part of the vmalloc implementation already complies with the
given gfp flags so there is no work for those to be done. The area
and page table allocations are an exception to that. Implement a retry
loop for those.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmalloc.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

Michal Hocko Oct. 18, 2021, 4:48 p.m. UTC | #1

Fat fingers on my side. A follow up to fold
commit 63d1b80b9a298c9380e5175e2add7025b6bd2600
Author: Michal Hocko <mhocko@suse.com>
Date:   Mon Oct 18 18:47:04 2021 +0200

    fold me "mm/vmalloc: add support for __GFP_NOFAIL"

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3a5a178295d1..4ce9ccc33e33 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3034,7 +3034,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 		warn_alloc(gfp_mask, NULL,
 			"vmalloc error: size %lu, vm_struct allocation failed",
 			real_size);
-		if (gfp_mask && __GFP_NOFAIL)
+		if (gfp_mask & __GFP_NOFAIL)
 			goto again;
 		goto fail;
 	}

Uladzislau Rezki Oct. 19, 2021, 11:06 a.m. UTC | #2

> From: Michal Hocko <mhocko@suse.com>
> 
> Dave Chinner has mentioned that some of the xfs code would benefit from
> kvmalloc support for __GFP_NOFAIL because they have allocations that
> cannot fail and they do not fit into a single page.
> 
> The larg part of the vmalloc implementation already complies with the
> given gfp flags so there is no work for those to be done. The area
> and page table allocations are an exception to that. Implement a retry
> loop for those.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmalloc.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 7455c89598d3..3a5a178295d1 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2941,8 +2941,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
>  		flags = memalloc_noio_save();
>  
> -	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> +	do {
> +		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
>  			page_shift);
> +	} while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
>  
>  	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
>  		memalloc_nofs_restore(flags);
> @@ -3032,6 +3034,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
>  		warn_alloc(gfp_mask, NULL,
>  			"vmalloc error: size %lu, vm_struct allocation failed",
>  			real_size);
> +		if (gfp_mask && __GFP_NOFAIL)
> +			goto again;
>  		goto fail;
>  	}
>  
> -- 
> 2.30.2
> 
I have checked the vmap code how it aligns with the __GFP_NOFAIL flag.
To me it looks correct from functional point of view.

There is one place though it is kasan_populate_vmalloc_pte(). It does
not use gfp_mask, instead it directly deals with GFP_KERNEL for its
internal purpose. If it fails the code will end up in loping in the
__vmalloc_node_range().

I am not sure how it is important to pass __GFP_NOFAIL into KASAN code.

Any thoughts about it?

--
Vlad Rezki

Michal Hocko Oct. 19, 2021, 11:52 a.m. UTC | #3

On Tue 19-10-21 13:06:49, Uladzislau Rezki wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Dave Chinner has mentioned that some of the xfs code would benefit from
> > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > cannot fail and they do not fit into a single page.
> > 
> > The larg part of the vmalloc implementation already complies with the
> > given gfp flags so there is no work for those to be done. The area
> > and page table allocations are an exception to that. Implement a retry
> > loop for those.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/vmalloc.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 7455c89598d3..3a5a178295d1 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2941,8 +2941,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >  	else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
> >  		flags = memalloc_noio_save();
> >  
> > -	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > +	do {
> > +		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> >  			page_shift);
> > +	} while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> >  
> >  	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> >  		memalloc_nofs_restore(flags);
> > @@ -3032,6 +3034,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> >  		warn_alloc(gfp_mask, NULL,
> >  			"vmalloc error: size %lu, vm_struct allocation failed",
> >  			real_size);
> > +		if (gfp_mask && __GFP_NOFAIL)
> > +			goto again;
> >  		goto fail;
> >  	}
> >  
> > -- 
> > 2.30.2
> > 
> I have checked the vmap code how it aligns with the __GFP_NOFAIL flag.
> To me it looks correct from functional point of view.
> 
> There is one place though it is kasan_populate_vmalloc_pte(). It does
> not use gfp_mask, instead it directly deals with GFP_KERNEL for its
> internal purpose. If it fails the code will end up in loping in the
> __vmalloc_node_range().
> 
> I am not sure how it is important to pass __GFP_NOFAIL into KASAN code.
> 
> Any thoughts about it?

The flag itself is not really necessary down there as long as we
guarantee that the high level logic doesn't fail. In this case we keep
retrying at __vmalloc_node_range level which should be possible to cover
all callers that can control gfp mask. I was thinking to put it into
__get_vm_area_node but that was slightly more hairy and we would be
losing the warning which might turn out being helpful in cases where the
failure is due to lack of vmalloc space or similar constrain. Btw. do we
want some throttling on a retry?

It would be better if the kasan part dealt with gfp mask properly though
and something that we can do on top.

Uladzislau Rezki Oct. 19, 2021, 7:46 p.m. UTC | #4

On Tue, Oct 19, 2021 at 01:52:07PM +0200, Michal Hocko wrote:
> On Tue 19-10-21 13:06:49, Uladzislau Rezki wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > cannot fail and they do not fit into a single page.
> > > 
> > > The larg part of the vmalloc implementation already complies with the
> > > given gfp flags so there is no work for those to be done. The area
> > > and page table allocations are an exception to that. Implement a retry
> > > loop for those.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  mm/vmalloc.c | 6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 7455c89598d3..3a5a178295d1 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2941,8 +2941,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > >  	else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
> > >  		flags = memalloc_noio_save();
> > >  
> > > -	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > +	do {
> > > +		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > >  			page_shift);
> > > +	} while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > >  
> > >  	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> > >  		memalloc_nofs_restore(flags);
> > > @@ -3032,6 +3034,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > >  		warn_alloc(gfp_mask, NULL,
> > >  			"vmalloc error: size %lu, vm_struct allocation failed",
> > >  			real_size);
> > > +		if (gfp_mask && __GFP_NOFAIL)
> > > +			goto again;
> > >  		goto fail;
> > >  	}
> > >  
> > > -- 
> > > 2.30.2
> > > 
> > I have checked the vmap code how it aligns with the __GFP_NOFAIL flag.
> > To me it looks correct from functional point of view.
> > 
> > There is one place though it is kasan_populate_vmalloc_pte(). It does
> > not use gfp_mask, instead it directly deals with GFP_KERNEL for its
> > internal purpose. If it fails the code will end up in loping in the
> > __vmalloc_node_range().
> > 
> > I am not sure how it is important to pass __GFP_NOFAIL into KASAN code.
> > 
> > Any thoughts about it?
> 
> The flag itself is not really necessary down there as long as we
> guarantee that the high level logic doesn't fail. In this case we keep
> retrying at __vmalloc_node_range level which should be possible to cover
> all callers that can control gfp mask. I was thinking to put it into
> __get_vm_area_node but that was slightly more hairy and we would be
> losing the warning which might turn out being helpful in cases where the
> failure is due to lack of vmalloc space or similar constrain. Btw. do we
> want some throttling on a retry?
> 
I think adding kind of schedule() will not make things worse and in corner
cases could prevent a power drain by CPU. It is important for mobile devices. 

As for vmap space, it can be that a user specifies a short range that does
not contain any free area. In that case we might never return back to a caller.

Maybe add a good comment something like: think what you do when deal with the
__vmalloc_node_range() and __GFP_NOFAIL?

--
Vlad Rezki

Michal Hocko Oct. 20, 2021, 8:25 a.m. UTC | #5

On Tue 19-10-21 21:46:58, Uladzislau Rezki wrote:
> On Tue, Oct 19, 2021 at 01:52:07PM +0200, Michal Hocko wrote:
> > On Tue 19-10-21 13:06:49, Uladzislau Rezki wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > cannot fail and they do not fit into a single page.
> > > > 
> > > > The larg part of the vmalloc implementation already complies with the
> > > > given gfp flags so there is no work for those to be done. The area
> > > > and page table allocations are an exception to that. Implement a retry
> > > > loop for those.
> > > > 
> > > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > > ---
> > > >  mm/vmalloc.c | 6 +++++-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 7455c89598d3..3a5a178295d1 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -2941,8 +2941,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > > >  	else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
> > > >  		flags = memalloc_noio_save();
> > > >  
> > > > -	ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > +	do {
> > > > +		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > >  			page_shift);
> > > > +	} while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > > >  
> > > >  	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> > > >  		memalloc_nofs_restore(flags);
> > > > @@ -3032,6 +3034,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > > >  		warn_alloc(gfp_mask, NULL,
> > > >  			"vmalloc error: size %lu, vm_struct allocation failed",
> > > >  			real_size);
> > > > +		if (gfp_mask && __GFP_NOFAIL)
> > > > +			goto again;
> > > >  		goto fail;
> > > >  	}
> > > >  
> > > > -- 
> > > > 2.30.2
> > > > 
> > > I have checked the vmap code how it aligns with the __GFP_NOFAIL flag.
> > > To me it looks correct from functional point of view.
> > > 
> > > There is one place though it is kasan_populate_vmalloc_pte(). It does
> > > not use gfp_mask, instead it directly deals with GFP_KERNEL for its
> > > internal purpose. If it fails the code will end up in loping in the
> > > __vmalloc_node_range().
> > > 
> > > I am not sure how it is important to pass __GFP_NOFAIL into KASAN code.
> > > 
> > > Any thoughts about it?
> > 
> > The flag itself is not really necessary down there as long as we
> > guarantee that the high level logic doesn't fail. In this case we keep
> > retrying at __vmalloc_node_range level which should be possible to cover
> > all callers that can control gfp mask. I was thinking to put it into
> > __get_vm_area_node but that was slightly more hairy and we would be
> > losing the warning which might turn out being helpful in cases where the
> > failure is due to lack of vmalloc space or similar constrain. Btw. do we
> > want some throttling on a retry?
> > 
> I think adding kind of schedule() will not make things worse and in corner
> cases could prevent a power drain by CPU. It is important for mobile devices. 

I suspect you mean schedule_timeout here? Or cond_resched? I went with a
later for now, I do not have a good idea for how to long to sleep here.
I am more than happy to change to to a sleep though.

> As for vmap space, it can be that a user specifies a short range that does
> not contain any free area. In that case we might never return back to a caller.

This is to be expected. The caller cannot fail and if it would be
looping around vmalloc it wouldn't return anyway.

> Maybe add a good comment something like: think what you do when deal with the
> __vmalloc_node_range() and __GFP_NOFAIL?

We have a generic documentation for gfp flags and __GFP_NOFAIL is
docuemented to "The allocation could block indefinitely but will never
return with failure." We are discussing improvements for the generic
documentation in another thread [1] and we will likely extend it so I
suspect we do not have to repeat drawbacks here again.

[1] http://lkml.kernel.org/r/163184741778.29351.16920832234899124642.stgit@noble.brown

Anyway the gfp mask description and constrains for vmalloc are not
documented. I will add a new patch to fill that gap and send it as a
reply to this one

Michal Hocko Oct. 20, 2021, 9:18 a.m. UTC | #6

On Wed 20-10-21 10:25:06, Michal Hocko wrote:
[...]
> > > The flag itself is not really necessary down there as long as we
> > > guarantee that the high level logic doesn't fail. In this case we keep
> > > retrying at __vmalloc_node_range level which should be possible to cover
> > > all callers that can control gfp mask. I was thinking to put it into
> > > __get_vm_area_node but that was slightly more hairy and we would be
> > > losing the warning which might turn out being helpful in cases where the
> > > failure is due to lack of vmalloc space or similar constrain. Btw. do we
> > > want some throttling on a retry?
> > > 
> > I think adding kind of schedule() will not make things worse and in corner
> > cases could prevent a power drain by CPU. It is important for mobile devices. 
> 
> I suspect you mean schedule_timeout here? Or cond_resched? I went with a
> later for now, I do not have a good idea for how to long to sleep here.
> I am more than happy to change to to a sleep though.

Forgot to paste the follow up I have staged currently
--- 
commit 66fea55e5543fa234692a70144cd63c7a1bca32f
Author: Michal Hocko <mhocko@suse.com>
Date:   Wed Oct 20 10:12:45 2021 +0200

    fold me "mm/vmalloc: add support for __GFP_NOFAIL"
    
    - add cond_resched

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0fb5413d9239..f7098e616883 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2944,6 +2944,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	do {
 		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
 			page_shift);
+		cond_resched();
 	} while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
 
 	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
@@ -3034,8 +3035,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 		warn_alloc(gfp_mask, NULL,
 			"vmalloc error: size %lu, vm_struct allocation failed",
 			real_size);
-		if (gfp_mask & __GFP_NOFAIL)
+		if (gfp_mask & __GFP_NOFAIL) {
+			cond_resched();
 			goto again;
+		}
 		goto fail;
 	}

Uladzislau Rezki Oct. 20, 2021, 1:54 p.m. UTC | #7

> > >
> > I think adding kind of schedule() will not make things worse and in corner
> > cases could prevent a power drain by CPU. It is important for mobile devices.
>
> I suspect you mean schedule_timeout here? Or cond_resched? I went with a
> later for now, I do not have a good idea for how to long to sleep here.
> I am more than happy to change to to a sleep though.
>
cond_resched() reschedules only if TIF_NEED_RESCHED is raised what is not good
here. Because in our case we know that we definitely would like to
take a breath. Therefore
invoking the schedule() is more suitable here. It will give a CPU time
to another waiting
process(if exists) in any case putting the "current" one to the tail.

As for adding a delay. I am not sure about for how long to delay or i
would say i do not
see a good explanation why for example we delay for 10 milliseconds or so.

> > As for vmap space, it can be that a user specifies a short range that does
> > not contain any free area. In that case we might never return back to a caller.
>
> This is to be expected. The caller cannot fail and if it would be
> looping around vmalloc it wouldn't return anyway.
>
> > Maybe add a good comment something like: think what you do when deal with the
> > __vmalloc_node_range() and __GFP_NOFAIL?
>
> We have a generic documentation for gfp flags and __GFP_NOFAIL is
> docuemented to "The allocation could block indefinitely but will never
> return with failure." We are discussing improvements for the generic
> documentation in another thread [1] and we will likely extend it so I
> suspect we do not have to repeat drawbacks here again.
>
> [1] http://lkml.kernel.org/r/163184741778.29351.16920832234899124642.stgit@noble.brown
>
> Anyway the gfp mask description and constrains for vmalloc are not
> documented. I will add a new patch to fill that gap and send it as a
> reply to this one
>
This is really good. People should be prepared for a case when it
never returns back
to a caller :)

Michal Hocko Oct. 20, 2021, 2:06 p.m. UTC | #8

On Wed 20-10-21 15:54:23, Uladzislau Rezki wrote:
> > > >
> > > I think adding kind of schedule() will not make things worse and in corner
> > > cases could prevent a power drain by CPU. It is important for mobile devices.
> >
> > I suspect you mean schedule_timeout here? Or cond_resched? I went with a
> > later for now, I do not have a good idea for how to long to sleep here.
> > I am more than happy to change to to a sleep though.
> >
> cond_resched() reschedules only if TIF_NEED_RESCHED is raised what is not good
> here. Because in our case we know that we definitely would like to
> take a breath. Therefore
> invoking the schedule() is more suitable here. It will give a CPU time
> to another waiting
> process(if exists) in any case putting the "current" one to the tail.

Yes, but there is no explicit event to wait for currently.

> As for adding a delay. I am not sure about for how long to delay or i
> would say i do not
> see a good explanation why for example we delay for 10 milliseconds or so.

As I've said I am OK with either of the two. Do you or anybody have any
preference? Without any explicit event to wake up for neither of the two
is more than just an optimistic retry.

Uladzislau Rezki Oct. 20, 2021, 2:29 p.m. UTC | #9

On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 20-10-21 15:54:23, Uladzislau Rezki wrote:
> > > > >
> > > > I think adding kind of schedule() will not make things worse and in corner
> > > > cases could prevent a power drain by CPU. It is important for mobile devices.
> > >
> > > I suspect you mean schedule_timeout here? Or cond_resched? I went with a
> > > later for now, I do not have a good idea for how to long to sleep here.
> > > I am more than happy to change to to a sleep though.
> > >
> > cond_resched() reschedules only if TIF_NEED_RESCHED is raised what is not good
> > here. Because in our case we know that we definitely would like to
> > take a breath. Therefore
> > invoking the schedule() is more suitable here. It will give a CPU time
> > to another waiting
> > process(if exists) in any case putting the "current" one to the tail.
>
> Yes, but there is no explicit event to wait for currently.
>
> > As for adding a delay. I am not sure about for how long to delay or i
> > would say i do not
> > see a good explanation why for example we delay for 10 milliseconds or so.
>
> As I've said I am OK with either of the two. Do you or anybody have any
> preference? Without any explicit event to wake up for neither of the two
> is more than just an optimistic retry.
>
From power perspective it is better to have a delay, so i tend to say
that delay is better.

Michal Hocko Oct. 20, 2021, 2:53 p.m. UTC | #10

On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > As I've said I am OK with either of the two. Do you or anybody have any
> > preference? Without any explicit event to wake up for neither of the two
> > is more than just an optimistic retry.
> >
> From power perspective it is better to have a delay, so i tend to say
> that delay is better.

I am a terrible random number generator. Can you give me a number
please?

Uladzislau Rezki Oct. 20, 2021, 3 p.m. UTC | #11

>
> On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > As I've said I am OK with either of the two. Do you or anybody have any
> > > preference? Without any explicit event to wake up for neither of the two
> > > is more than just an optimistic retry.
> > >
> > From power perspective it is better to have a delay, so i tend to say
> > that delay is better.
>
> I am a terrible random number generator. Can you give me a number
> please?
>
Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)

Uladzislau Rezki Oct. 20, 2021, 7:24 p.m. UTC | #12

On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> >
> > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > preference? Without any explicit event to wake up for neither of the two
> > > > is more than just an optimistic retry.
> > > >
> > > From power perspective it is better to have a delay, so i tend to say
> > > that delay is better.
> >
> > I am a terrible random number generator. Can you give me a number
> > please?
> >
> Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)
> 
A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));

--
Vlad Rezki

Michal Hocko Oct. 21, 2021, 8:56 a.m. UTC | #13

On Wed 20-10-21 21:24:30, Uladzislau Rezki wrote:
> On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> > >
> > > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > > preference? Without any explicit event to wake up for neither of the two
> > > > > is more than just an optimistic retry.
> > > > >
> > > > From power perspective it is better to have a delay, so i tend to say
> > > > that delay is better.
> > >
> > > I am a terrible random number generator. Can you give me a number
> > > please?
> > >
> > Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)

OK, I will go with 1 jiffy.

> A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));

I have planned to use schedule_timeout_uninterruptible. Why do you think
msleep is better?

NeilBrown Oct. 21, 2021, 10:13 a.m. UTC | #14

On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> > >
> > > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > > preference? Without any explicit event to wake up for neither of the two
> > > > > is more than just an optimistic retry.
> > > > >
> > > > From power perspective it is better to have a delay, so i tend to say
> > > > that delay is better.
> > >
> > > I am a terrible random number generator. Can you give me a number
> > > please?
> > >
> > Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)
> > 
> A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));

I disagree.  I think schedule_timeout_uninterruptible(1) is the best
wait to sleep for 1 ticl

msleep() contains
  timeout = msecs_to_jiffies(msecs) + 1;
and both jiffies_to_msecs and msecs_to_jiffies might round up too.
So you will sleep for at least twice as long as you asked for, possible
more.

NeilBrown


> 
> --
> Vlad Rezki
> 
>

Michal Hocko Oct. 21, 2021, 10:27 a.m. UTC | #15

On Thu 21-10-21 21:13:35, Neil Brown wrote:
> On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> > On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> > > >
> > > > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > [...]
> > > > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > > > preference? Without any explicit event to wake up for neither of the two
> > > > > > is more than just an optimistic retry.
> > > > > >
> > > > > From power perspective it is better to have a delay, so i tend to say
> > > > > that delay is better.
> > > >
> > > > I am a terrible random number generator. Can you give me a number
> > > > please?
> > > >
> > > Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)
> > > 
> > A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));
> 
> I disagree.  I think schedule_timeout_uninterruptible(1) is the best
> wait to sleep for 1 ticl
> 
> msleep() contains
>   timeout = msecs_to_jiffies(msecs) + 1;
> and both jiffies_to_msecs and msecs_to_jiffies might round up too.
> So you will sleep for at least twice as long as you asked for, possible
> more.

That was my thinking as well. Not to mention jiffies_to_msecs just to do
msecs_to_jiffies right after which seems like a pointless wasting of
cpu cycle. But maybe I was missing some other reasons why msleep would
be superior.

Uladzislau Rezki Oct. 21, 2021, 10:40 a.m. UTC | #16

> On Thu 21-10-21 21:13:35, Neil Brown wrote:
> > On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> > > On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> > > > >
> > > > > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > > > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > [...]
> > > > > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > > > > preference? Without any explicit event to wake up for neither of the two
> > > > > > > is more than just an optimistic retry.
> > > > > > >
> > > > > > From power perspective it is better to have a delay, so i tend to say
> > > > > > that delay is better.
> > > > >
> > > > > I am a terrible random number generator. Can you give me a number
> > > > > please?
> > > > >
> > > > Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)
> > > > 
> > > A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));
> > 
> > I disagree.  I think schedule_timeout_uninterruptible(1) is the best
> > wait to sleep for 1 ticl
> > 
> > msleep() contains
> >   timeout = msecs_to_jiffies(msecs) + 1;
> > and both jiffies_to_msecs and msecs_to_jiffies might round up too.
> > So you will sleep for at least twice as long as you asked for, possible
> > more.
> 
> That was my thinking as well. Not to mention jiffies_to_msecs just to do
> msecs_to_jiffies right after which seems like a pointless wasting of
> cpu cycle. But maybe I was missing some other reasons why msleep would
> be superior.
>

To me the msleep is just more simpler from semantic point of view, i.e.
it is as straight forward as it can be. In case of interruptable/uninteraptable
sleep it can be more confusing for people.

When it comes to rounding and possibility to sleep more than 1 tick, it
really does not matter here, we do not need to guarantee exact sleeping
time.

Therefore i proposed to switch to the msleep().

--
Vlad Rezki

NeilBrown Oct. 21, 2021, 10:49 p.m. UTC | #17

On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> > On Thu 21-10-21 21:13:35, Neil Brown wrote:
> > > On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> > > > On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> > > > > >
> > > > > > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > > > > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > [...]
> > > > > > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > > > > > preference? Without any explicit event to wake up for neither of the two
> > > > > > > > is more than just an optimistic retry.
> > > > > > > >
> > > > > > > From power perspective it is better to have a delay, so i tend to say
> > > > > > > that delay is better.
> > > > > >
> > > > > > I am a terrible random number generator. Can you give me a number
> > > > > > please?
> > > > > >
> > > > > Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)
> > > > > 
> > > > A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));
> > > 
> > > I disagree.  I think schedule_timeout_uninterruptible(1) is the best
> > > wait to sleep for 1 ticl
> > > 
> > > msleep() contains
> > >   timeout = msecs_to_jiffies(msecs) + 1;
> > > and both jiffies_to_msecs and msecs_to_jiffies might round up too.
> > > So you will sleep for at least twice as long as you asked for, possible
> > > more.
> > 
> > That was my thinking as well. Not to mention jiffies_to_msecs just to do
> > msecs_to_jiffies right after which seems like a pointless wasting of
> > cpu cycle. But maybe I was missing some other reasons why msleep would
> > be superior.
> >
> 
> To me the msleep is just more simpler from semantic point of view, i.e.
> it is as straight forward as it can be. In case of interruptable/uninteraptable
> sleep it can be more confusing for people.

I agree that msleep() is more simple.  I think adding the
jiffies_to_msec() substantially reduces that simplicity.

> 
> When it comes to rounding and possibility to sleep more than 1 tick, it
> really does not matter here, we do not need to guarantee exact sleeping
> time.
> 
> Therefore i proposed to switch to the msleep().

If, as you say, the precision doesn't matter that much, then maybe
   msleep(0)
which would sleep to the start of the next jiffy.  Does that look a bit
weird?  If so, the msleep(1) would be ok.

However now that I've thought about some more, I'd much prefer we
introduce something like
    memalloc_retry_wait();

and use that everywhere that a memory allocation is retried.
I'm not convinced that we need to wait at all - at least, not when
__GFP_DIRECT_RECLAIM is used, as in that case alloc_page will either
  - succeed
  - make some progress a reclaiming or
  - sleep

However I'm not 100% certain, and the behaviour might change in the
future.  So having one place (the definition of memalloc_retry_wait())
where we can change the sleeping behaviour if the alloc_page behavour
changes, would be ideal.  Maybe memalloc_retry_wait() could take a
gfpflags arg.

Thanks,
NeilBrown

Michal Hocko Oct. 22, 2021, 8:18 a.m. UTC | #18

On Fri 22-10-21 09:49:08, Neil Brown wrote:
[...]
> However now that I've thought about some more, I'd much prefer we
> introduce something like
>     memalloc_retry_wait();
> 
> and use that everywhere that a memory allocation is retried.
> I'm not convinced that we need to wait at all - at least, not when
> __GFP_DIRECT_RECLAIM is used, as in that case alloc_page will either
>   - succeed
>   - make some progress a reclaiming or
>   - sleep

There
are two that we have to do explicitly vmap_pages_range one is due to
implicit GFP_KERNEL allocations for page tables. Those would likely be a
good fit for something you suggest above. Then we have __get_vm_area_node
retry loop which can be either due to vmalloc space reservation failure
or an implicit GFP_KERNEL allocation by kasan. The first one is not
really related to the memory availability so it doesn't sound like a
good fit.

Uladzislau Rezki Oct. 25, 2021, 9:48 a.m. UTC | #19

On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
> On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> > > On Thu 21-10-21 21:13:35, Neil Brown wrote:
> > > > On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> > > > > On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> > > > > > >
> > > > > > > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > > > > > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <mhocko@suse.com> wrote:
> > > > > > > [...]
> > > > > > > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > > > > > > preference? Without any explicit event to wake up for neither of the two
> > > > > > > > > is more than just an optimistic retry.
> > > > > > > > >
> > > > > > > > From power perspective it is better to have a delay, so i tend to say
> > > > > > > > that delay is better.
> > > > > > >
> > > > > > > I am a terrible random number generator. Can you give me a number
> > > > > > > please?
> > > > > > >
> > > > > > Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)
> > > > > > 
> > > > > A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));
> > > > 
> > > > I disagree.  I think schedule_timeout_uninterruptible(1) is the best
> > > > wait to sleep for 1 ticl
> > > > 
> > > > msleep() contains
> > > >   timeout = msecs_to_jiffies(msecs) + 1;
> > > > and both jiffies_to_msecs and msecs_to_jiffies might round up too.
> > > > So you will sleep for at least twice as long as you asked for, possible
> > > > more.
> > > 
> > > That was my thinking as well. Not to mention jiffies_to_msecs just to do
> > > msecs_to_jiffies right after which seems like a pointless wasting of
> > > cpu cycle. But maybe I was missing some other reasons why msleep would
> > > be superior.
> > >
> > 
> > To me the msleep is just more simpler from semantic point of view, i.e.
> > it is as straight forward as it can be. In case of interruptable/uninteraptable
> > sleep it can be more confusing for people.
> 
> I agree that msleep() is more simple.  I think adding the
> jiffies_to_msec() substantially reduces that simplicity.
> 
> > 
> > When it comes to rounding and possibility to sleep more than 1 tick, it
> > really does not matter here, we do not need to guarantee exact sleeping
> > time.
> > 
> > Therefore i proposed to switch to the msleep().
> 
> If, as you say, the precision doesn't matter that much, then maybe
>    msleep(0)
> which would sleep to the start of the next jiffy.  Does that look a bit
> weird?  If so, the msleep(1) would be ok.
> 
Agree, msleep(1) looks much better rather then converting 1 jiffy to
milliseconds. Result should be the same.

> However now that I've thought about some more, I'd much prefer we
> introduce something like
>     memalloc_retry_wait();
> 
> and use that everywhere that a memory allocation is retried.
> I'm not convinced that we need to wait at all - at least, not when
> __GFP_DIRECT_RECLAIM is used, as in that case alloc_page will either
>   - succeed
>   - make some progress a reclaiming or
>   - sleep
> 
> However I'm not 100% certain, and the behaviour might change in the
> future.  So having one place (the definition of memalloc_retry_wait())
> where we can change the sleeping behaviour if the alloc_page behavour
> changes, would be ideal.  Maybe memalloc_retry_wait() could take a
> gfpflags arg.
> 
At sleeping is required for __get_vm_area_node() because in case of lack
of vmap space it will end up in tight loop without sleeping what is
really bad.

Thanks!

--
Vlad Rezki

Michal Hocko Oct. 25, 2021, 11:20 a.m. UTC | #20

On Mon 25-10-21 11:48:41, Uladzislau Rezki wrote:
> On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
[...]
> > If, as you say, the precision doesn't matter that much, then maybe
> >    msleep(0)
> > which would sleep to the start of the next jiffy.  Does that look a bit
> > weird?  If so, the msleep(1) would be ok.
> > 
> Agree, msleep(1) looks much better rather then converting 1 jiffy to
> milliseconds. Result should be the same.

I would really prefer if this was not the main point of arguing here.
Unless you feel strongly about msleep I would go with schedule_timeout
here because this is a more widely used interface in the mm code and
also because I feel like that relying on the rounding behavior is just
subtle. Here is what I have staged now.

Are there any other concerns you see with this or other patches in the
series?

Thanks!
--- 
commit c1a7e40e6b56fed5b9e716de7055b77ea29d89d0
Author: Michal Hocko <mhocko@suse.com>
Date:   Wed Oct 20 10:12:45 2021 +0200

    fold me "mm/vmalloc: add support for __GFP_NOFAIL"
    
    Add a short sleep before retrying. 1 jiffy is a completely random
    timeout. Ideally the retry would wait for an explicit event - e.g.
    a change to the vmalloc space change if the failure was caused by
    the space fragmentation or depletion. But there are multiple different
    reasons to retry and this could become much more complex. Keep the retry
    simple for now and just sleep to prevent from hogging CPUs.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0fb5413d9239..a866db0c9c31 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2944,6 +2944,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	do {
 		ret = vmap_pages_range(addr, addr + size, prot, area->pages,
 			page_shift);
+		schedule_timeout_uninterruptible(1);
 	} while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
 
 	if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
@@ -3034,8 +3035,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
 		warn_alloc(gfp_mask, NULL,
 			"vmalloc error: size %lu, vm_struct allocation failed",
 			real_size);
-		if (gfp_mask & __GFP_NOFAIL)
+		if (gfp_mask & __GFP_NOFAIL) {
+			schedule_timeout_uninterruptible(1);
 			goto again;
+		}
 		goto fail;
 	}

Uladzislau Rezki Oct. 25, 2021, 2:30 p.m. UTC | #21

>
> I would really prefer if this was not the main point of arguing here.
> Unless you feel strongly about msleep I would go with schedule_timeout
> here because this is a more widely used interface in the mm code and
> also because I feel like that relying on the rounding behavior is just
> subtle. Here is what I have staged now.
>
I have a preference but do not have a strong opinion here. You can go
either way you want.

>
> Are there any other concerns you see with this or other patches in the
> series?
>
it is better if you could send a new vX version because it is hard to
combine every "folded"
into one solid commit. One comment below:

> ---
> commit c1a7e40e6b56fed5b9e716de7055b77ea29d89d0
> Author: Michal Hocko <mhocko@suse.com>
> Date:   Wed Oct 20 10:12:45 2021 +0200
>
>     fold me "mm/vmalloc: add support for __GFP_NOFAIL"
>
>     Add a short sleep before retrying. 1 jiffy is a completely random
>     timeout. Ideally the retry would wait for an explicit event - e.g.
>     a change to the vmalloc space change if the failure was caused by
>     the space fragmentation or depletion. But there are multiple different
>     reasons to retry and this could become much more complex. Keep the retry
>     simple for now and just sleep to prevent from hogging CPUs.
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 0fb5413d9239..a866db0c9c31 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2944,6 +2944,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>         do {
>                 ret = vmap_pages_range(addr, addr + size, prot, area->pages,
>                         page_shift);
> +               schedule_timeout_uninterruptible(1);
>
We do not want to schedule_timeout_uninterruptible(1); every time.
Only when an error is detected.

Michal Hocko Oct. 25, 2021, 2:56 p.m. UTC | #22

On Mon 25-10-21 16:30:23, Uladzislau Rezki wrote:
> >
> > I would really prefer if this was not the main point of arguing here.
> > Unless you feel strongly about msleep I would go with schedule_timeout
> > here because this is a more widely used interface in the mm code and
> > also because I feel like that relying on the rounding behavior is just
> > subtle. Here is what I have staged now.
> >
> I have a preference but do not have a strong opinion here. You can go
> either way you want.
> 
> >
> > Are there any other concerns you see with this or other patches in the
> > series?
> >
> it is better if you could send a new vX version because it is hard to
> combine every "folded"

Yeah, I plan to soon. I just wanted to sort out most things before
spaming with a new version.

> into one solid commit. One comment below:
> 
> > ---
> > commit c1a7e40e6b56fed5b9e716de7055b77ea29d89d0
> > Author: Michal Hocko <mhocko@suse.com>
> > Date:   Wed Oct 20 10:12:45 2021 +0200
> >
> >     fold me "mm/vmalloc: add support for __GFP_NOFAIL"
> >
> >     Add a short sleep before retrying. 1 jiffy is a completely random
> >     timeout. Ideally the retry would wait for an explicit event - e.g.
> >     a change to the vmalloc space change if the failure was caused by
> >     the space fragmentation or depletion. But there are multiple different
> >     reasons to retry and this could become much more complex. Keep the retry
> >     simple for now and just sleep to prevent from hogging CPUs.
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 0fb5413d9239..a866db0c9c31 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2944,6 +2944,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >         do {
> >                 ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> >                         page_shift);
> > +               schedule_timeout_uninterruptible(1);
> >
> We do not want to schedule_timeout_uninterruptible(1); every time.
> Only when an error is detected.

Because I was obviously in a brainless mode when doing that one. Thanks
for pointing this out!

NeilBrown Oct. 25, 2021, 11:50 p.m. UTC | #23

On Mon, 25 Oct 2021, Uladzislau Rezki wrote:
> On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
> > However I'm not 100% certain, and the behaviour might change in the
> > future.  So having one place (the definition of memalloc_retry_wait())
> > where we can change the sleeping behaviour if the alloc_page behavour
> > changes, would be ideal.  Maybe memalloc_retry_wait() could take a
> > gfpflags arg.
> > 
> At sleeping is required for __get_vm_area_node() because in case of lack
> of vmap space it will end up in tight loop without sleeping what is
> really bad.
> 
So vmalloc() has two failure modes.  alloc_page() failure and
__alloc_vmap_area() failure.  The caller cannot tell which...

Actually, they can.  If we pass __GFP_NOFAIL to vmalloc(), and it fails,
then it must have been __alloc_vmap_area() which failed.
What do we do in that case?
Can we add a waitq which gets a wakeup when __purge_vmap_area_lazy()
finishes?
If we use the spinlock from that waitq in place of free_vmap_area_lock,
then the wakeup would be nearly free if no-one was waiting, and worth
while if someone was waiting.

Thanks,
NeilBrown

Michal Hocko Oct. 26, 2021, 7:16 a.m. UTC | #24

On Tue 26-10-21 10:50:21, Neil Brown wrote:
> On Mon, 25 Oct 2021, Uladzislau Rezki wrote:
> > On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
> > > However I'm not 100% certain, and the behaviour might change in the
> > > future.  So having one place (the definition of memalloc_retry_wait())
> > > where we can change the sleeping behaviour if the alloc_page behavour
> > > changes, would be ideal.  Maybe memalloc_retry_wait() could take a
> > > gfpflags arg.
> > > 
> > At sleeping is required for __get_vm_area_node() because in case of lack
> > of vmap space it will end up in tight loop without sleeping what is
> > really bad.
> > 
> So vmalloc() has two failure modes.  alloc_page() failure and
> __alloc_vmap_area() failure.  The caller cannot tell which...
> 
> Actually, they can.  If we pass __GFP_NOFAIL to vmalloc(), and it fails,
> then it must have been __alloc_vmap_area() which failed.
> What do we do in that case?
> Can we add a waitq which gets a wakeup when __purge_vmap_area_lazy()
> finishes?
> If we use the spinlock from that waitq in place of free_vmap_area_lock,
> then the wakeup would be nearly free if no-one was waiting, and worth
> while if someone was waiting.

Is this really required to be part of the initial support?

NeilBrown Oct. 26, 2021, 10:24 a.m. UTC | #25

On Tue, 26 Oct 2021, Michal Hocko wrote:
> On Tue 26-10-21 10:50:21, Neil Brown wrote:
> > On Mon, 25 Oct 2021, Uladzislau Rezki wrote:
> > > On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
> > > > However I'm not 100% certain, and the behaviour might change in the
> > > > future.  So having one place (the definition of memalloc_retry_wait())
> > > > where we can change the sleeping behaviour if the alloc_page behavour
> > > > changes, would be ideal.  Maybe memalloc_retry_wait() could take a
> > > > gfpflags arg.
> > > > 
> > > At sleeping is required for __get_vm_area_node() because in case of lack
> > > of vmap space it will end up in tight loop without sleeping what is
> > > really bad.
> > > 
> > So vmalloc() has two failure modes.  alloc_page() failure and
> > __alloc_vmap_area() failure.  The caller cannot tell which...
> > 
> > Actually, they can.  If we pass __GFP_NOFAIL to vmalloc(), and it fails,
> > then it must have been __alloc_vmap_area() which failed.
> > What do we do in that case?
> > Can we add a waitq which gets a wakeup when __purge_vmap_area_lazy()
> > finishes?
> > If we use the spinlock from that waitq in place of free_vmap_area_lock,
> > then the wakeup would be nearly free if no-one was waiting, and worth
> > while if someone was waiting.
> 
> Is this really required to be part of the initial support?

No.... I was just thinking out-loud.

NeilBrown

Uladzislau Rezki Oct. 26, 2021, 2:25 p.m. UTC | #26

On Tue, Oct 26, 2021 at 12:24 PM NeilBrown <neilb@suse.de> wrote:
>
> On Tue, 26 Oct 2021, Michal Hocko wrote:
> > On Tue 26-10-21 10:50:21, Neil Brown wrote:
> > > On Mon, 25 Oct 2021, Uladzislau Rezki wrote:
> > > > On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
> > > > > However I'm not 100% certain, and the behaviour might change in the
> > > > > future.  So having one place (the definition of memalloc_retry_wait())
> > > > > where we can change the sleeping behaviour if the alloc_page behavour
> > > > > changes, would be ideal.  Maybe memalloc_retry_wait() could take a
> > > > > gfpflags arg.
> > > > >
> > > > At sleeping is required for __get_vm_area_node() because in case of lack
> > > > of vmap space it will end up in tight loop without sleeping what is
> > > > really bad.
> > > >
> > > So vmalloc() has two failure modes.  alloc_page() failure and
> > > __alloc_vmap_area() failure.  The caller cannot tell which...
> > >
> > > Actually, they can.  If we pass __GFP_NOFAIL to vmalloc(), and it fails,
> > > then it must have been __alloc_vmap_area() which failed.
> > > What do we do in that case?
> > > Can we add a waitq which gets a wakeup when __purge_vmap_area_lazy()
> > > finishes?
> > > If we use the spinlock from that waitq in place of free_vmap_area_lock,
> > > then the wakeup would be nearly free if no-one was waiting, and worth
> > > while if someone was waiting.
> >
> > Is this really required to be part of the initial support?
>
> No.... I was just thinking out-loud.
>
alloc_vmap_area() has an retry path, basically if it fails the code
will try to "purge"
areas and repeat it one more time. So we do not need to purge outside some where
else.

Michal Hocko Oct. 26, 2021, 2:43 p.m. UTC | #27

On Tue 26-10-21 16:25:07, Uladzislau Rezki wrote:
> On Tue, Oct 26, 2021 at 12:24 PM NeilBrown <neilb@suse.de> wrote:
> >
> > On Tue, 26 Oct 2021, Michal Hocko wrote:
> > > On Tue 26-10-21 10:50:21, Neil Brown wrote:
> > > > On Mon, 25 Oct 2021, Uladzislau Rezki wrote:
> > > > > On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
> > > > > > However I'm not 100% certain, and the behaviour might change in the
> > > > > > future.  So having one place (the definition of memalloc_retry_wait())
> > > > > > where we can change the sleeping behaviour if the alloc_page behavour
> > > > > > changes, would be ideal.  Maybe memalloc_retry_wait() could take a
> > > > > > gfpflags arg.
> > > > > >
> > > > > At sleeping is required for __get_vm_area_node() because in case of lack
> > > > > of vmap space it will end up in tight loop without sleeping what is
> > > > > really bad.
> > > > >
> > > > So vmalloc() has two failure modes.  alloc_page() failure and
> > > > __alloc_vmap_area() failure.  The caller cannot tell which...
> > > >
> > > > Actually, they can.  If we pass __GFP_NOFAIL to vmalloc(), and it fails,
> > > > then it must have been __alloc_vmap_area() which failed.
> > > > What do we do in that case?
> > > > Can we add a waitq which gets a wakeup when __purge_vmap_area_lazy()
> > > > finishes?
> > > > If we use the spinlock from that waitq in place of free_vmap_area_lock,
> > > > then the wakeup would be nearly free if no-one was waiting, and worth
> > > > while if someone was waiting.
> > >
> > > Is this really required to be part of the initial support?
> >
> > No.... I was just thinking out-loud.
> >
> alloc_vmap_area() has an retry path, basically if it fails the code
> will try to "purge"
> areas and repeat it one more time. So we do not need to purge outside some where
> else.

I think that Neil was not concerned about the need for purging something
but rather a waiting event the retry loop could hook into. So that the
sleep wouldn't have to be a random timeout but something that is
actually actionable - like somebody freeing an area.

Uladzislau Rezki Oct. 26, 2021, 3:40 p.m. UTC | #28

> > > > Is this really required to be part of the initial support?
> > >
> > > No.... I was just thinking out-loud.
> > >
> > alloc_vmap_area() has an retry path, basically if it fails the code
> > will try to "purge"
> > areas and repeat it one more time. So we do not need to purge outside some where
> > else.
>
> I think that Neil was not concerned about the need for purging something
> but rather a waiting event the retry loop could hook into. So that the
> sleep wouldn't have to be a random timeout but something that is
> actually actionable - like somebody freeing an area.
>
I see this point. But sometimes it is not as straightforward as it could be. If
we have lack of vmap space within a specific range, it might be not about
reclaiming(nobody frees to that area and no outstanding areas) thus we
can do nothing.

[RFC,2/3] mm/vmalloc: add support for __GFP_NOFAIL

Commit Message

Comments

Patch