diff mbox series

[1/2] mm: Add memalloc_nowait_{save,restore}

Message ID 20240812090525.80299-2-laoar.shao@gmail.com (mailing list archive)
State New
Headers show
Series mm: Add readahead support for IOCB_NOWAIT | expand

Commit Message

Yafang Shao Aug. 12, 2024, 9:05 a.m. UTC
The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
this, let's add two helper functions, memalloc_nowait_{save,restore}, which
will be useful in scenarios where we want to avoid waiting for memory
reclamation.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/sched/mm.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

Comments

Christoph Hellwig Aug. 12, 2024, 11:37 a.m. UTC | #1
On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> will be useful in scenarios where we want to avoid waiting for memory
> reclamation.

No, forcing nowait on callee contets is just asking for trouble.
Unlike NOIO or NOFS this is incompatible with NOFAIL allocations
and thus will lead to kernel crashes.
Yafang Shao Aug. 12, 2024, 12:59 p.m. UTC | #2
On Mon, Aug 12, 2024 at 7:37 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> > this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> > will be useful in scenarios where we want to avoid waiting for memory
> > reclamation.
>
> No, forcing nowait on callee contets is just asking for trouble.
> Unlike NOIO or NOFS this is incompatible with NOFAIL allocations

I don’t see any incompatibility in __alloc_pages_slowpath(). The
~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not
performed, but it doesn’t prevent the allocation of pages from
ALLOC_MIN_RESERVE, correct?

> and thus will lead to kernel crashes.

Could you please explain in detail where this might lead to kernel crashes?

--
Regards

Yafang
Christoph Hellwig Aug. 12, 2024, 1:21 p.m. UTC | #3
On Mon, Aug 12, 2024 at 08:59:53PM +0800, Yafang Shao wrote:
> 
> I don’t see any incompatibility in __alloc_pages_slowpath(). The
> ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not
> performed, but it doesn’t prevent the allocation of pages from
> ALLOC_MIN_RESERVE, correct?
> 
> > and thus will lead to kernel crashes.
> 
> Could you please explain in detail where this might lead to kernel crashes?

Sorry, I misread your patch as doing what your subject says.
A nestable noreclaim is probably fine, but please name it that way,
as memalloc_nowait_{save,restore} implies a context version
of GFP_NOWAIT.
Kent Overstreet Aug. 12, 2024, 4:48 p.m. UTC | #4
On Mon, Aug 12, 2024 at 04:37:58AM GMT, Christoph Hellwig wrote:
> On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> > this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> > will be useful in scenarios where we want to avoid waiting for memory
> > reclamation.
> 
> No, forcing nowait on callee contets is just asking for trouble.
> Unlike NOIO or NOFS this is incompatible with NOFAIL allocations
> and thus will lead to kernel crashes.
 
 No different from passing GFP_NOWAIT to mempoool_alloc(), and we're
 trying to get away from passing gfp flags directly for multiple reasons
 so we need it.
Yafang Shao Aug. 13, 2024, 2:09 a.m. UTC | #5
On Mon, Aug 12, 2024 at 9:21 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Aug 12, 2024 at 08:59:53PM +0800, Yafang Shao wrote:
> >
> > I don’t see any incompatibility in __alloc_pages_slowpath(). The
> > ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not
> > performed, but it doesn’t prevent the allocation of pages from
> > ALLOC_MIN_RESERVE, correct?
> >
> > > and thus will lead to kernel crashes.
> >
> > Could you please explain in detail where this might lead to kernel crashes?
>
> Sorry, I misread your patch as doing what your subject says.
> A nestable noreclaim is probably fine, but please name it that way,
> as memalloc_nowait_{save,restore} implies a context version
> of GFP_NOWAIT.

There are already memalloc_noreclaim_{save,restore} which imply __GFP_MEMALLOC:

  memalloc_noreclaim_save - Marks implicit __GFP_MEMALLOC scope.

That is why I name it memalloc_nowait_{save,restore}. GFP_NOWAIT has
the same meaning with ~__GFP_DIRECT_RECLAIM:

  %GFP_NOWAIT is for kernel allocations that should not stall for direct
  reclaim, start physical IO or use any filesystem callback.
Dave Chinner Aug. 14, 2024, 12:28 a.m. UTC | #6
On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> will be useful in scenarios where we want to avoid waiting for memory
> reclamation.

Readahead already uses this context:

static inline gfp_t readahead_gfp_mask(struct address_space *x)
{
        return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
}

and __GFP_NORETRY means minimal direct reclaim should be performed.
Most filesystems already have GFP_NOFS context from
mapping_gfp_mask(), so how much difference does completely avoiding
direct reclaim actually make under memory pressure?

i.e. doing some direct reclaim without blocking when under memory
pressure might actually give better performance than skipping direct
reclaim and aborting readahead altogether....

This really, really needs some numbers (both throughput and IO
latency histograms) to go with it because we have no evidence either
way to determine what is the best approach here.

-Dave.
Yafang Shao Aug. 14, 2024, 2:19 a.m. UTC | #7
On Wed, Aug 14, 2024 at 8:28 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> > this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> > will be useful in scenarios where we want to avoid waiting for memory
> > reclamation.
>
> Readahead already uses this context:
>
> static inline gfp_t readahead_gfp_mask(struct address_space *x)
> {
>         return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
> }
>
> and __GFP_NORETRY means minimal direct reclaim should be performed.
> Most filesystems already have GFP_NOFS context from
> mapping_gfp_mask(), so how much difference does completely avoiding
> direct reclaim actually make under memory pressure?

Besides the __GFP_NOFS , ~__GFP_DIRECT_RECLAIM also implies
__GPF_NOIO. If we don't set __GPF_NOIO, the readahead can wait for IO,
right?

>
> i.e. doing some direct reclaim without blocking when under memory
> pressure might actually give better performance than skipping direct
> reclaim and aborting readahead altogether....
>
> This really, really needs some numbers (both throughput and IO
> latency histograms) to go with it because we have no evidence either
> way to determine what is the best approach here.
>
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
Christoph Hellwig Aug. 14, 2024, 5:24 a.m. UTC | #8
On Mon, Aug 12, 2024 at 12:48:26PM -0400, Kent Overstreet wrote:
> > Unlike NOIO or NOFS this is incompatible with NOFAIL allocations
> > and thus will lead to kernel crashes.
>  
>  No different from passing GFP_NOWAIT to mempoool_alloc(), and we're
>  trying to get away from passing gfp flags directly for multiple reasons
>  so we need it.

It is very different.  Passing GFP_NOWAIT to mempool_alloc affects
the piece of code calling mempool_alloc.  Setting a GFP_NOWAIT context
affects all called code, i.e. all block drivers under the file system,
and possible another file system under the loop driver.  And none of
them expect the consequences of that.
Christoph Hellwig Aug. 14, 2024, 5:27 a.m. UTC | #9
> There are already memalloc_noreclaim_{save,restore} which imply __GFP_MEMALLOC:
> 
>   memalloc_noreclaim_save - Marks implicit __GFP_MEMALLOC scope.

.. and those are horrible misnamed :(

If we can't even keep our APIs consistently name, who is supposed
to understand all this?
Dave Chinner Aug. 14, 2024, 5:42 a.m. UTC | #10
On Wed, Aug 14, 2024 at 10:19:36AM +0800, Yafang Shao wrote:
> On Wed, Aug 14, 2024 at 8:28 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> > > will be useful in scenarios where we want to avoid waiting for memory
> > > reclamation.
> >
> > Readahead already uses this context:
> >
> > static inline gfp_t readahead_gfp_mask(struct address_space *x)
> > {
> >         return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
> > }
> >
> > and __GFP_NORETRY means minimal direct reclaim should be performed.
> > Most filesystems already have GFP_NOFS context from
> > mapping_gfp_mask(), so how much difference does completely avoiding
> > direct reclaim actually make under memory pressure?
> 
> Besides the __GFP_NOFS , ~__GFP_DIRECT_RECLAIM also implies
> __GPF_NOIO. If we don't set __GPF_NOIO, the readahead can wait for IO,
> right?

There's a *lot* more difference between __GFP_NORETRY and
__GFP_NOWAIT than just __GFP_NOIO. I don't need you to try to
describe to me what the differences are; What I'm asking you is this:

> > i.e. doing some direct reclaim without blocking when under memory
> > pressure might actually give better performance than skipping direct
> > reclaim and aborting readahead altogether....
> >
> > This really, really needs some numbers (both throughput and IO
> > latency histograms) to go with it because we have no evidence either
> > way to determine what is the best approach here.

Put simply: does the existing readahead mechanism give better results
than the proposed one, and if so, why wouldn't we just reenable
readahead unconditionally instead of making it behave differently
for this specific case?

Cheers,

Dave.
Yafang Shao Aug. 14, 2024, 7:32 a.m. UTC | #11
On Wed, Aug 14, 2024 at 1:42 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Aug 14, 2024 at 10:19:36AM +0800, Yafang Shao wrote:
> > On Wed, Aug 14, 2024 at 8:28 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> > > > will be useful in scenarios where we want to avoid waiting for memory
> > > > reclamation.
> > >
> > > Readahead already uses this context:
> > >
> > > static inline gfp_t readahead_gfp_mask(struct address_space *x)
> > > {
> > >         return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
> > > }
> > >
> > > and __GFP_NORETRY means minimal direct reclaim should be performed.
> > > Most filesystems already have GFP_NOFS context from
> > > mapping_gfp_mask(), so how much difference does completely avoiding
> > > direct reclaim actually make under memory pressure?
> >
> > Besides the __GFP_NOFS , ~__GFP_DIRECT_RECLAIM also implies
> > __GPF_NOIO. If we don't set __GPF_NOIO, the readahead can wait for IO,
> > right?
>
> There's a *lot* more difference between __GFP_NORETRY and
> __GFP_NOWAIT than just __GFP_NOIO. I don't need you to try to
> describe to me what the differences are; What I'm asking you is this:
>
> > > i.e. doing some direct reclaim without blocking when under memory
> > > pressure might actually give better performance than skipping direct
> > > reclaim and aborting readahead altogether....
> > >
> > > This really, really needs some numbers (both throughput and IO
> > > latency histograms) to go with it because we have no evidence either
> > > way to determine what is the best approach here.
>
> Put simply: does the existing readahead mechanism give better results
> than the proposed one, and if so, why wouldn't we just reenable
> readahead unconditionally instead of making it behave differently
> for this specific case?

Are you suggesting we compare the following change with the current proposal?

diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd34b5755c0b..ced74b1b350d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3455,7 +3455,6 @@ static inline int kiocb_set_rw_flags(struct
kiocb *ki, rwf_t flags,
        if (flags & RWF_NOWAIT) {
                if (!(ki->ki_filp->f_mode & FMODE_NOWAIT))
                        return -EOPNOTSUPP;
-               kiocb_flags |= IOCB_NOIO;
        }
        if (flags & RWF_ATOMIC) {
                if (rw_type != WRITE)

Doesn't unconditional readahead break the semantics of RWF_NOWAIT,
which is supposed to avoid waiting for I/O? For example, it might
trigger a pageout for a dirty page.

--
Regards

Yafang
Yafang Shao Aug. 14, 2024, 7:33 a.m. UTC | #12
On Wed, Aug 14, 2024 at 1:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> > There are already memalloc_noreclaim_{save,restore} which imply __GFP_MEMALLOC:
> >
> >   memalloc_noreclaim_save - Marks implicit __GFP_MEMALLOC scope.
>
> .. and those are horrible misnamed :(

What about renaming it to memalloc_memalloc_save ?

>
> If we can't even keep our APIs consistently name, who is supposed
> to understand all this?
>
Michal Hocko Aug. 14, 2024, 7:42 a.m. UTC | #13
On Mon 12-08-24 20:59:53, Yafang Shao wrote:
> On Mon, Aug 12, 2024 at 7:37 PM Christoph Hellwig <hch@infradead.org> wrote:
> >
> > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> > > will be useful in scenarios where we want to avoid waiting for memory
> > > reclamation.
> >
> > No, forcing nowait on callee contets is just asking for trouble.
> > Unlike NOIO or NOFS this is incompatible with NOFAIL allocations
> 
> I don’t see any incompatibility in __alloc_pages_slowpath(). The
> ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not
> performed, but it doesn’t prevent the allocation of pages from
> ALLOC_MIN_RESERVE, correct?

Right but this means that you just made any potential nested allocation
within the scope that is GFP_NOFAIL a busy loop essentially. Not to
mention it BUG_ON as non-sleeping GFP_NOFAIL allocations are
unsupported. I believe this is what Christoph had in mind. I am really
surprised that we even have PF_MEMALLOC_NORECLAIM in the first place!
Unsurprisingly this was merged without any review by the MM community :/
Yafang Shao Aug. 14, 2024, 8:12 a.m. UTC | #14
On Wed, Aug 14, 2024 at 3:42 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 12-08-24 20:59:53, Yafang Shao wrote:
> > On Mon, Aug 12, 2024 at 7:37 PM Christoph Hellwig <hch@infradead.org> wrote:
> > >
> > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> > > > will be useful in scenarios where we want to avoid waiting for memory
> > > > reclamation.
> > >
> > > No, forcing nowait on callee contets is just asking for trouble.
> > > Unlike NOIO or NOFS this is incompatible with NOFAIL allocations
> >
> > I don’t see any incompatibility in __alloc_pages_slowpath(). The
> > ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not
> > performed, but it doesn’t prevent the allocation of pages from
> > ALLOC_MIN_RESERVE, correct?
>
> Right but this means that you just made any potential nested allocation
> within the scope that is GFP_NOFAIL a busy loop essentially. Not to
> mention it BUG_ON as non-sleeping GFP_NOFAIL allocations are
> unsupported. I believe this is what Christoph had in mind.

If that's the case, I believe we should at least consider adding the
following code change to the kernel:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ecf99190ea2..89411ee23c7f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4168,6 +4168,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
        unsigned int zonelist_iter_cookie;
        int reserve_flags;

+       WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL && !(gfp_mask &
__GFP_DIRECT_RECLAIM));
 restart:
        compaction_retries = 0;

        no_progress_loops = 0;


> I am really
> surprised that we even have PF_MEMALLOC_NORECLAIM in the first place!

There's use cases for it.

> Unsurprisingly this was merged without any review by the MM community :/
> --
> Michal Hocko
> SUSE Labs

--
Regards
Yafang
Michal Hocko Aug. 14, 2024, 12:43 p.m. UTC | #15
On Wed 14-08-24 16:12:27, Yafang Shao wrote:
> On Wed, Aug 14, 2024 at 3:42 PM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 12-08-24 20:59:53, Yafang Shao wrote:
> > > On Mon, Aug 12, 2024 at 7:37 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > >
> > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> > > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> > > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> > > > > will be useful in scenarios where we want to avoid waiting for memory
> > > > > reclamation.
> > > >
> > > > No, forcing nowait on callee contets is just asking for trouble.
> > > > Unlike NOIO or NOFS this is incompatible with NOFAIL allocations
> > >
> > > I don’t see any incompatibility in __alloc_pages_slowpath(). The
> > > ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not
> > > performed, but it doesn’t prevent the allocation of pages from
> > > ALLOC_MIN_RESERVE, correct?
> >
> > Right but this means that you just made any potential nested allocation
> > within the scope that is GFP_NOFAIL a busy loop essentially. Not to
> > mention it BUG_ON as non-sleeping GFP_NOFAIL allocations are
> > unsupported. I believe this is what Christoph had in mind.
> 
> If that's the case, I believe we should at least consider adding the
> following code change to the kernel:

We already do have that
                /*
                 * All existing users of the __GFP_NOFAIL are blockable, so warn
                 * of any new users that actually require GFP_NOWAIT
                 */
                if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
                        goto fail;

But Barry has patches to turn that into BUG because failing NOFAIL
allocations is not cool and cause unexpected failures. Have a look at
https://lore.kernel.org/all/20240731000155.109583-1-21cnbao@gmail.com/

> > I am really
> > surprised that we even have PF_MEMALLOC_NORECLAIM in the first place!
> 
> There's use cases for it.

Right but there are certain constrains that we need to worry about to
have a maintainable code. Scope allocation contrains are really a good
feature when that has a well defined semantic. E.g. NOFS, NOIO or
NOMEMALLOC (although this is more self inflicted injury exactly because
PF_MEMALLOC had a "use case"). NOWAIT scope semantic might seem a good
feature but it falls appart on nested NOFAIL allocations! So the flag is
usable _only_ if you fully control the whole scoped context. Good luck
with that long term! This is fragile, hard to review and even harder to
keep working properly. The flag would have been Nacked on that ground.
But nobody asked...
Dave Chinner Aug. 15, 2024, 2:54 a.m. UTC | #16
On Wed, Aug 14, 2024 at 03:32:26PM +0800, Yafang Shao wrote:
> On Wed, Aug 14, 2024 at 1:42 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Wed, Aug 14, 2024 at 10:19:36AM +0800, Yafang Shao wrote:
> > > On Wed, Aug 14, 2024 at 8:28 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> > > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> > > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> > > > > will be useful in scenarios where we want to avoid waiting for memory
> > > > > reclamation.
> > > >
> > > > Readahead already uses this context:
> > > >
> > > > static inline gfp_t readahead_gfp_mask(struct address_space *x)
> > > > {
> > > >         return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
> > > > }
> > > >
> > > > and __GFP_NORETRY means minimal direct reclaim should be performed.
> > > > Most filesystems already have GFP_NOFS context from
> > > > mapping_gfp_mask(), so how much difference does completely avoiding
> > > > direct reclaim actually make under memory pressure?
> > >
> > > Besides the __GFP_NOFS , ~__GFP_DIRECT_RECLAIM also implies
> > > __GPF_NOIO. If we don't set __GPF_NOIO, the readahead can wait for IO,
> > > right?
> >
> > There's a *lot* more difference between __GFP_NORETRY and
> > __GFP_NOWAIT than just __GFP_NOIO. I don't need you to try to
> > describe to me what the differences are; What I'm asking you is this:
> >
> > > > i.e. doing some direct reclaim without blocking when under memory
> > > > pressure might actually give better performance than skipping direct
> > > > reclaim and aborting readahead altogether....
> > > >
> > > > This really, really needs some numbers (both throughput and IO
> > > > latency histograms) to go with it because we have no evidence either
> > > > way to determine what is the best approach here.
> >
> > Put simply: does the existing readahead mechanism give better results
> > than the proposed one, and if so, why wouldn't we just reenable
> > readahead unconditionally instead of making it behave differently
> > for this specific case?
> 
> Are you suggesting we compare the following change with the current proposal?
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd34b5755c0b..ced74b1b350d 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3455,7 +3455,6 @@ static inline int kiocb_set_rw_flags(struct
> kiocb *ki, rwf_t flags,
>         if (flags & RWF_NOWAIT) {
>                 if (!(ki->ki_filp->f_mode & FMODE_NOWAIT))
>                         return -EOPNOTSUPP;
> -               kiocb_flags |= IOCB_NOIO;
>         }
>         if (flags & RWF_ATOMIC) {
>                 if (rw_type != WRITE)

Yes.

> Doesn't unconditional readahead break the semantics of RWF_NOWAIT,
> which is supposed to avoid waiting for I/O? For example, it might
> trigger a pageout for a dirty page.

Yes, but only for *some filesystems* in *some configurations*.
Readahead allocation behaviour is specifically controlled by the gfp
mask set on the mapping by the filesystem at inode instantiation
time. i.e. via a call to mapping_set_gfp_mask().

XFS, for one, always clears __GFP_FS from this mask, and several
other filesystems set it to GFP_NOFS. Filesystems that do this will
not do pageout for a dirty page during memory allocation.

Further, memory reclaim can not write dirty pages to a filesystem
without a ->writepage implementation. ->writepage is almost
completely gone - neither ext4, btrfs or XFS have a ->writepage
implementation anymore - with f2fs being the only "major" filesystem
with a ->writepage implementation remaining.

IOWs, for most readahead cases right now, direct memory reclaim will
not issue writeback IO on dirty cached file pages and in the near
future that will change to -never-.

That means the only IO that direct reclaim will be able to do is for
swapping and compaction. Both of these can be prevented simply by
setting a GFP_NOIO allocation context. IOWs, in the not-to-distant
future we won't have to turn direct reclaim off to prevent IO from
and blocking in direct reclaim during readahead - GFP_NOIO context
will be all that is necessary for IOCB_NOWAIT readahead.

That's why I'm asking if just doing readahead as it stands from
RWF_NOWAIT causes any obvious problems. I think we really only need
need GFP_NOIO | __GFP_NORETRY allocation context for NOWAIT
readahead IO, and that's something we already have a context API
for.

-Dave.
Yafang Shao Aug. 15, 2024, 3:26 a.m. UTC | #17
On Wed, Aug 14, 2024 at 8:43 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 14-08-24 16:12:27, Yafang Shao wrote:
> > On Wed, Aug 14, 2024 at 3:42 PM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 12-08-24 20:59:53, Yafang Shao wrote:
> > > > On Mon, Aug 12, 2024 at 7:37 PM Christoph Hellwig <hch@infradead.org> wrote:
> > > > >
> > > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> > > > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> > > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> > > > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> > > > > > will be useful in scenarios where we want to avoid waiting for memory
> > > > > > reclamation.
> > > > >
> > > > > No, forcing nowait on callee contets is just asking for trouble.
> > > > > Unlike NOIO or NOFS this is incompatible with NOFAIL allocations
> > > >
> > > > I don’t see any incompatibility in __alloc_pages_slowpath(). The
> > > > ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not
> > > > performed, but it doesn’t prevent the allocation of pages from
> > > > ALLOC_MIN_RESERVE, correct?
> > >
> > > Right but this means that you just made any potential nested allocation
> > > within the scope that is GFP_NOFAIL a busy loop essentially. Not to
> > > mention it BUG_ON as non-sleeping GFP_NOFAIL allocations are
> > > unsupported. I believe this is what Christoph had in mind.
> >
> > If that's the case, I believe we should at least consider adding the
> > following code change to the kernel:
>
> We already do have that
>                 /*
>                  * All existing users of the __GFP_NOFAIL are blockable, so warn
>                  * of any new users that actually require GFP_NOWAIT
>                  */
>                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
>                         goto fail;

I don't see a reason to place the `goto fail;` above the
`__alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE, ac);`
line. Since we've already woken up kswapd, it should be acceptable to
allocate memory from ALLOC_MIN_RESERVE temporarily. Why not consider
implementing the following changes instead?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ecf99190ea2..598d4df829cd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4386,13 +4386,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned
int order,
         * we always retry
         */
        if (gfp_mask & __GFP_NOFAIL) {
-               /*
-                * All existing users of the __GFP_NOFAIL are blockable, so warn
-                * of any new users that actually require GFP_NOWAIT
-                */
-               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
-                       goto fail;
-
                /*
                 * PF_MEMALLOC request from this context is rather bizarre
                 * because we cannot reclaim anything and only can loop waiting
@@ -4419,6 +4412,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned
int order,
                if (page)
                        goto got_pg;

+               /*
+                * All existing users of the __GFP_NOFAIL are blockable, so warn
+                * of any new users that actually require GFP_NOWAIT
+                */
+               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) {
+                       goto fail;
+               }
+
                cond_resched();
                goto retry;
        }

>
> But Barry has patches to turn that into BUG because failing NOFAIL
> allocations is not cool and cause unexpected failures. Have a look at
> https://lore.kernel.org/all/20240731000155.109583-1-21cnbao@gmail.com/
>
> > > I am really
> > > surprised that we even have PF_MEMALLOC_NORECLAIM in the first place!
> >
> > There's use cases for it.
>
> Right but there are certain constrains that we need to worry about to
> have a maintainable code. Scope allocation contrains are really a good
> feature when that has a well defined semantic. E.g. NOFS, NOIO or
> NOMEMALLOC (although this is more self inflicted injury exactly because
> PF_MEMALLOC had a "use case"). NOWAIT scope semantic might seem a good
> feature but it falls appart on nested NOFAIL allocations! So the flag is
> usable _only_ if you fully control the whole scoped context. Good luck
> with that long term! This is fragile, hard to review and even harder to
> keep working properly. The flag would have been Nacked on that ground.
> But nobody asked...

It's already implemented, and complaints won't resolve the issue. How
about making the following change to provide a warning when this new
flag is used incorrectly?

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 4fbae0013166..5a1e1bcde347 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -267,9 +267,10 @@ static inline gfp_t current_gfp_context(gfp_t flags)
                 * Stronger flags before weaker flags:
                 * NORECLAIM implies NOIO, which in turn implies NOFS
                 */
-               if (pflags & PF_MEMALLOC_NORECLAIM)
+               if (pflags & PF_MEMALLOC_NORECLAIM) {
                        flags &= ~__GFP_DIRECT_RECLAIM;
-               else if (pflags & PF_MEMALLOC_NOIO)
+                       WARN_ON_ONCE_GFP(flags & __GFP_NOFAIL, flags)
+               } else if (pflags & PF_MEMALLOC_NOIO)
                        flags &= ~(__GFP_IO | __GFP_FS);
                else if (pflags & PF_MEMALLOC_NOFS)
                        flags &= ~__GFP_FS;

--
Regards


Yafang
Yafang Shao Aug. 15, 2024, 3:38 a.m. UTC | #18
On Thu, Aug 15, 2024 at 10:54 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Aug 14, 2024 at 03:32:26PM +0800, Yafang Shao wrote:
> > On Wed, Aug 14, 2024 at 1:42 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Wed, Aug 14, 2024 at 10:19:36AM +0800, Yafang Shao wrote:
> > > > On Wed, Aug 14, 2024 at 8:28 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote:
> > > > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc
> > > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement
> > > > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which
> > > > > > will be useful in scenarios where we want to avoid waiting for memory
> > > > > > reclamation.
> > > > >
> > > > > Readahead already uses this context:
> > > > >
> > > > > static inline gfp_t readahead_gfp_mask(struct address_space *x)
> > > > > {
> > > > >         return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
> > > > > }
> > > > >
> > > > > and __GFP_NORETRY means minimal direct reclaim should be performed.
> > > > > Most filesystems already have GFP_NOFS context from
> > > > > mapping_gfp_mask(), so how much difference does completely avoiding
> > > > > direct reclaim actually make under memory pressure?
> > > >
> > > > Besides the __GFP_NOFS , ~__GFP_DIRECT_RECLAIM also implies
> > > > __GPF_NOIO. If we don't set __GPF_NOIO, the readahead can wait for IO,
> > > > right?
> > >
> > > There's a *lot* more difference between __GFP_NORETRY and
> > > __GFP_NOWAIT than just __GFP_NOIO. I don't need you to try to
> > > describe to me what the differences are; What I'm asking you is this:
> > >
> > > > > i.e. doing some direct reclaim without blocking when under memory
> > > > > pressure might actually give better performance than skipping direct
> > > > > reclaim and aborting readahead altogether....
> > > > >
> > > > > This really, really needs some numbers (both throughput and IO
> > > > > latency histograms) to go with it because we have no evidence either
> > > > > way to determine what is the best approach here.
> > >
> > > Put simply: does the existing readahead mechanism give better results
> > > than the proposed one, and if so, why wouldn't we just reenable
> > > readahead unconditionally instead of making it behave differently
> > > for this specific case?
> >
> > Are you suggesting we compare the following change with the current proposal?
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index fd34b5755c0b..ced74b1b350d 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -3455,7 +3455,6 @@ static inline int kiocb_set_rw_flags(struct
> > kiocb *ki, rwf_t flags,
> >         if (flags & RWF_NOWAIT) {
> >                 if (!(ki->ki_filp->f_mode & FMODE_NOWAIT))
> >                         return -EOPNOTSUPP;
> > -               kiocb_flags |= IOCB_NOIO;
> >         }
> >         if (flags & RWF_ATOMIC) {
> >                 if (rw_type != WRITE)
>
> Yes.
>
> > Doesn't unconditional readahead break the semantics of RWF_NOWAIT,
> > which is supposed to avoid waiting for I/O? For example, it might
> > trigger a pageout for a dirty page.
>
> Yes, but only for *some filesystems* in *some configurations*.
> Readahead allocation behaviour is specifically controlled by the gfp
> mask set on the mapping by the filesystem at inode instantiation
> time. i.e. via a call to mapping_set_gfp_mask().
>
> XFS, for one, always clears __GFP_FS from this mask, and several
> other filesystems set it to GFP_NOFS. Filesystems that do this will
> not do pageout for a dirty page during memory allocation.
>
> Further, memory reclaim can not write dirty pages to a filesystem
> without a ->writepage implementation. ->writepage is almost
> completely gone - neither ext4, btrfs or XFS have a ->writepage
> implementation anymore - with f2fs being the only "major" filesystem
> with a ->writepage implementation remaining.
>
> IOWs, for most readahead cases right now, direct memory reclaim will
> not issue writeback IO on dirty cached file pages and in the near
> future that will change to -never-.
>
> That means the only IO that direct reclaim will be able to do is for
> swapping and compaction. Both of these can be prevented simply by
> setting a GFP_NOIO allocation context. IOWs, in the not-to-distant
> future we won't have to turn direct reclaim off to prevent IO from
> and blocking in direct reclaim during readahead - GFP_NOIO context
> will be all that is necessary for IOCB_NOWAIT readahead.
>
> That's why I'm asking if just doing readahead as it stands from
> RWF_NOWAIT causes any obvious problems. I think we really only need
> need GFP_NOIO | __GFP_NORETRY allocation context for NOWAIT
> readahead IO, and that's something we already have a context API
> for.

Understood, thanks for your explanation.
so we need below changes,

@@ -2526,8 +2528,12 @@ static int filemap_get_pages(struct kiocb
*iocb, size_t count,
        if (!folio_batch_count(fbatch)) {
                if (iocb->ki_flags & IOCB_NOIO)
                        return -EAGAIN;
+               if (iocb->ki_flags & IOCB_NOWAIT)
+                       flags = memalloc_noio_save();
                page_cache_sync_readahead(mapping, ra, filp, index,
                                last_index - index);
+               if (iocb->ki_flags & IOCB_NOWAIT)
+                       memalloc_noio_restore(flags);
                filemap_get_read_batch(mapping, index, last_index - 1, fbatch);
        }
        if (!folio_batch_count(fbatch)) {

What data would you recommend collecting after implementing the above
change? Should we measure the latency of preadv2(2) under high memory
pressure? Although latency can vary, it seems we have no choice but to
use memalloc_noio_save instead of memalloc_nowait_save, as the MM
folks are not in favor of the latter.

--
Regards
Yafang
Michal Hocko Aug. 15, 2024, 6:22 a.m. UTC | #19
On Thu 15-08-24 11:26:09, Yafang Shao wrote:
> On Wed, Aug 14, 2024 at 8:43 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > If that's the case, I believe we should at least consider adding the
> > > following code change to the kernel:
> >
> > We already do have that
> >                 /*
> >                  * All existing users of the __GFP_NOFAIL are blockable, so warn
> >                  * of any new users that actually require GFP_NOWAIT
> >                  */
> >                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> >                         goto fail;
> 
> I don't see a reason to place the `goto fail;` above the
> `__alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE, ac);`
> line. Since we've already woken up kswapd, it should be acceptable to
> allocate memory from ALLOC_MIN_RESERVE temporarily. Why not consider
> implementing the following changes instead?
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9ecf99190ea2..598d4df829cd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4386,13 +4386,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned
> int order,
>          * we always retry
>          */
>         if (gfp_mask & __GFP_NOFAIL) {
> -               /*
> -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> -                * of any new users that actually require GFP_NOWAIT
> -                */
> -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> -                       goto fail;
> -
>                 /*
>                  * PF_MEMALLOC request from this context is rather bizarre
>                  * because we cannot reclaim anything and only can loop waiting
> @@ -4419,6 +4412,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned
> int order,
>                 if (page)
>                         goto got_pg;
> 
> +               /*
> +                * All existing users of the __GFP_NOFAIL are blockable, so warn
> +                * of any new users that actually require GFP_NOWAIT
> +                */
> +               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) {
> +                       goto fail;
> +               }
> +
>                 cond_resched();
>                 goto retry;
>         }

How does this solve anything. It will still eventually fail the NOFAIL
allocation. It might happen slightly later but that doesn't change the
fact it will _fail_. I have referenced a discussion why that is not
really desireable and why Barry wants that addressed. We have added that
WARN_ON_ONCE because we have assumed that people do understand that
NOFAIL without reclaim is just too much to ask. We were wrong there was
one user in the kernel. That one was not too hard to find out because
you can _grep_ for those flags. Scoped APIs make that impossible!

> > But Barry has patches to turn that into BUG because failing NOFAIL
> > allocations is not cool and cause unexpected failures. Have a look at
> > https://lore.kernel.org/all/20240731000155.109583-1-21cnbao@gmail.com/
> >
> > > > I am really
> > > > surprised that we even have PF_MEMALLOC_NORECLAIM in the first place!
> > >
> > > There's use cases for it.
> >
> > Right but there are certain constrains that we need to worry about to
> > have a maintainable code. Scope allocation contrains are really a good
> > feature when that has a well defined semantic. E.g. NOFS, NOIO or
> > NOMEMALLOC (although this is more self inflicted injury exactly because
> > PF_MEMALLOC had a "use case"). NOWAIT scope semantic might seem a good
> > feature but it falls appart on nested NOFAIL allocations! So the flag is
> > usable _only_ if you fully control the whole scoped context. Good luck
> > with that long term! This is fragile, hard to review and even harder to
> > keep working properly. The flag would have been Nacked on that ground.
> > But nobody asked...
> 
> It's already implemented, and complaints won't resolve the issue. How
> about making the following change to provide a warning when this new
> flag is used incorrectly?

How does this solve anything at all? It will warn you that your code is
incorrect and what next? Are you going to remove GFP_NOFAIL from the
nested allocation side? NOFAIL is a strong requirement and it is not
used nilly willy. There must have been a very good reason to use it. Are
you going to drop the scope?

Let me repeat, nested NOFAIL allocations will BUG_ON on failure. Your
warning might catch those users slightly earlier when allocation succeed
but that doesn't make those crashes impossible. PF_MEMALLOC_NORECLAIM
might be already merged but this concept is inherently fragile and
should be reverted rather than finding impossible ways around it. And it
should be done before this spreads outside of bcachefs.
Yafang Shao Aug. 15, 2024, 6:32 a.m. UTC | #20
On Thu, Aug 15, 2024 at 2:22 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 15-08-24 11:26:09, Yafang Shao wrote:
> > On Wed, Aug 14, 2024 at 8:43 PM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > If that's the case, I believe we should at least consider adding the
> > > > following code change to the kernel:
> > >
> > > We already do have that
> > >                 /*
> > >                  * All existing users of the __GFP_NOFAIL are blockable, so warn
> > >                  * of any new users that actually require GFP_NOWAIT
> > >                  */
> > >                 if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > >                         goto fail;
> >
> > I don't see a reason to place the `goto fail;` above the
> > `__alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE, ac);`
> > line. Since we've already woken up kswapd, it should be acceptable to
> > allocate memory from ALLOC_MIN_RESERVE temporarily. Why not consider
> > implementing the following changes instead?
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9ecf99190ea2..598d4df829cd 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4386,13 +4386,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned
> > int order,
> >          * we always retry
> >          */
> >         if (gfp_mask & __GFP_NOFAIL) {
> > -               /*
> > -                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > -                * of any new users that actually require GFP_NOWAIT
> > -                */
> > -               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
> > -                       goto fail;
> > -
> >                 /*
> >                  * PF_MEMALLOC request from this context is rather bizarre
> >                  * because we cannot reclaim anything and only can loop waiting
> > @@ -4419,6 +4412,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned
> > int order,
> >                 if (page)
> >                         goto got_pg;
> >
> > +               /*
> > +                * All existing users of the __GFP_NOFAIL are blockable, so warn
> > +                * of any new users that actually require GFP_NOWAIT
> > +                */
> > +               if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) {
> > +                       goto fail;
> > +               }
> > +
> >                 cond_resched();
> >                 goto retry;
> >         }
>
> How does this solve anything. It will still eventually fail the NOFAIL
> allocation. It might happen slightly later but that doesn't change the
> fact it will _fail_. I have referenced a discussion why that is not
> really desireable and why Barry wants that addressed. We have added that
> WARN_ON_ONCE because we have assumed that people do understand that
> NOFAIL without reclaim is just too much to ask. We were wrong there was
> one user in the kernel. That one was not too hard to find out because
> you can _grep_ for those flags. Scoped APIs make that impossible!
>
> > > But Barry has patches to turn that into BUG because failing NOFAIL
> > > allocations is not cool and cause unexpected failures. Have a look at
> > > https://lore.kernel.org/all/20240731000155.109583-1-21cnbao@gmail.com/
> > >
> > > > > I am really
> > > > > surprised that we even have PF_MEMALLOC_NORECLAIM in the first place!
> > > >
> > > > There's use cases for it.
> > >
> > > Right but there are certain constrains that we need to worry about to
> > > have a maintainable code. Scope allocation contrains are really a good
> > > feature when that has a well defined semantic. E.g. NOFS, NOIO or
> > > NOMEMALLOC (although this is more self inflicted injury exactly because
> > > PF_MEMALLOC had a "use case"). NOWAIT scope semantic might seem a good
> > > feature but it falls appart on nested NOFAIL allocations! So the flag is
> > > usable _only_ if you fully control the whole scoped context. Good luck
> > > with that long term! This is fragile, hard to review and even harder to
> > > keep working properly. The flag would have been Nacked on that ground.
> > > But nobody asked...
> >
> > It's already implemented, and complaints won't resolve the issue. How
> > about making the following change to provide a warning when this new
> > flag is used incorrectly?
>
> How does this solve anything at all? It will warn you that your code is
> incorrect and what next? Are you going to remove GFP_NOFAIL from the
> nested allocation side? NOFAIL is a strong requirement and it is not
> used nilly willy. There must have been a very good reason to use it. Are
> you going to drop the scope?
>
> Let me repeat, nested NOFAIL allocations will BUG_ON on failure.

The key question is whether it actually fails after we've already
woken up kswapd. Have we encountered real issues, or is this just
based on code review? Instead of allowing it to fail, why not allocate
from the reserve memory to prevent this from happening?


> Your
> warning might catch those users slightly earlier when allocation succeed
> but that doesn't make those crashes impossible. PF_MEMALLOC_NORECLAIM
> might be already merged but this concept is inherently fragile and
> should be reverted rather than finding impossible ways around it. And it
> should be done before this spreads outside of bcachefs.
> --
> Michal Hocko
> SUSE Labs



--
Regards
Yafang
Michal Hocko Aug. 15, 2024, 6:51 a.m. UTC | #21
On Thu 15-08-24 14:32:10, Yafang Shao wrote:
> On Thu, Aug 15, 2024 at 2:22 PM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > Let me repeat, nested NOFAIL allocations will BUG_ON on failure.
> 
> The key question is whether it actually fails after we've already
> woken up kswapd. Have we encountered real issues, or is this just
> based on code review?

Depleting memory to the level that even min memory reserves is
insufficient is not a theoretical concern. OOMing the system is a real
thing!

> Instead of allowing it to fail, why not allocate
> from the reserve memory to prevent this from happening?

Because those memory reserves are shared and potentially required by
other more important users which cannot make forward progress without
them. And even with that, those can get depleted so the failure point is
just a matter of a specific memory consumption pattern. The failure
could be more rare but that also means much harder to predict and test
for. Really there are no ways around non sleeping GFP_NOFAIL, either you
disalow them or you just create a busy loop inside the allocator. We
have chosen the first option because that is a saner model to support.
Vlastimil Babka Sept. 1, 2024, 8:24 p.m. UTC | #22
On 8/14/24 09:33, Yafang Shao wrote:
> On Wed, Aug 14, 2024 at 1:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>>
>> > There are already memalloc_noreclaim_{save,restore} which imply __GFP_MEMALLOC:
>> >
>> >   memalloc_noreclaim_save - Marks implicit __GFP_MEMALLOC scope.
>>
>> .. and those are horrible misnamed :(

Yes I agree, sorry about that.

> What about renaming it to memalloc_memalloc_save ?

While it looks weird, it could be indeed better than the current name. It's
not obvious, so it should force the user to read the description.
memalloc_noreclaim_save() might look too obviously "this disables reclaim"
but it's misleading as that's not the full story of PF_MEMALLOC.

>>
>> If we can't even keep our APIs consistently name, who is supposed
>> to understand all this?
>>
>
Kent Overstreet Sept. 1, 2024, 8:42 p.m. UTC | #23
On Sun, Sep 01, 2024 at 10:24:10PM GMT, Vlastimil Babka wrote:
> On 8/14/24 09:33, Yafang Shao wrote:
> > On Wed, Aug 14, 2024 at 1:27 PM Christoph Hellwig <hch@infradead.org> wrote:
> >>
> >> > There are already memalloc_noreclaim_{save,restore} which imply __GFP_MEMALLOC:
> >> >
> >> >   memalloc_noreclaim_save - Marks implicit __GFP_MEMALLOC scope.
> >>
> >> .. and those are horrible misnamed :(
> 
> Yes I agree, sorry about that.
> 
> > What about renaming it to memalloc_memalloc_save ?
> 
> While it looks weird, it could be indeed better than the current name. It's
> not obvious, so it should force the user to read the description.
> memalloc_noreclaim_save() might look too obviously "this disables reclaim"
> but it's misleading as that's not the full story of PF_MEMALLOC.

I was actually thinking about killing the helpers in favor of just
memalloc_flags_save() - I don't think the extra names get us anything
diff mbox series

Patch

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 91546493c43d..4fbae0013166 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -484,6 +484,36 @@  static inline void memalloc_pin_restore(unsigned int flags)
 	memalloc_flags_restore(flags);
 }
 
+/**
+ * memalloc_nowait_save - Marks implicit ~__GFP_DIRECT_RECLAIM allocation scope.
+ *
+ * This functions marks the beginning of the ~__GFP_DIRECT_RECLAIM allocation
+ * scope. All further allocations will implicitly drop __GFP_DIRECT_RECLAIM flag
+ * and so they are won't wait for direct memory reclaiming. Use
+ * memalloc_nowait_restore to end the scope with flags returned by this
+ * function.
+ *
+ * Context: This function is safe to be used from any context.
+ * Return: The saved flags to be passed to memalloc_nowait_restore.
+ */
+static inline unsigned int memalloc_nowait_save(void)
+{
+	return memalloc_flags_save(PF_MEMALLOC_NORECLAIM);
+}
+
+/**
+ * memalloc_nofs_restore - Ends the implicit ~__GFP_DIRECT_RECLAIM scope.
+ * @flags: Flags to restore.
+ *
+ * Ends the implicit ~__GFP_DIRECT_RECLAIM scope started by memalloc_nowait_save
+ * function. Always make sure that the given flags is the return value from the
+ * pairing memalloc_nowait_save call.
+ */
+static inline void memalloc_nowait_restore(unsigned int flags)
+{
+	memalloc_flags_restore(flags);
+}
+
 #ifdef CONFIG_MEMCG
 DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg);
 /**