Message ID | 20240812090525.80299-2-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: Add readahead support for IOCB_NOWAIT | expand |
On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > will be useful in scenarios where we want to avoid waiting for memory > reclamation. No, forcing nowait on callee contets is just asking for trouble. Unlike NOIO or NOFS this is incompatible with NOFAIL allocations and thus will lead to kernel crashes.
On Mon, Aug 12, 2024 at 7:37 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > > will be useful in scenarios where we want to avoid waiting for memory > > reclamation. > > No, forcing nowait on callee contets is just asking for trouble. > Unlike NOIO or NOFS this is incompatible with NOFAIL allocations I don’t see any incompatibility in __alloc_pages_slowpath(). The ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not performed, but it doesn’t prevent the allocation of pages from ALLOC_MIN_RESERVE, correct? > and thus will lead to kernel crashes. Could you please explain in detail where this might lead to kernel crashes? -- Regards Yafang
On Mon, Aug 12, 2024 at 08:59:53PM +0800, Yafang Shao wrote: > > I don’t see any incompatibility in __alloc_pages_slowpath(). The > ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not > performed, but it doesn’t prevent the allocation of pages from > ALLOC_MIN_RESERVE, correct? > > > and thus will lead to kernel crashes. > > Could you please explain in detail where this might lead to kernel crashes? Sorry, I misread your patch as doing what your subject says. A nestable noreclaim is probably fine, but please name it that way, as memalloc_nowait_{save,restore} implies a context version of GFP_NOWAIT.
On Mon, Aug 12, 2024 at 04:37:58AM GMT, Christoph Hellwig wrote: > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > > will be useful in scenarios where we want to avoid waiting for memory > > reclamation. > > No, forcing nowait on callee contets is just asking for trouble. > Unlike NOIO or NOFS this is incompatible with NOFAIL allocations > and thus will lead to kernel crashes. No different from passing GFP_NOWAIT to mempoool_alloc(), and we're trying to get away from passing gfp flags directly for multiple reasons so we need it.
On Mon, Aug 12, 2024 at 9:21 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Mon, Aug 12, 2024 at 08:59:53PM +0800, Yafang Shao wrote: > > > > I don’t see any incompatibility in __alloc_pages_slowpath(). The > > ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not > > performed, but it doesn’t prevent the allocation of pages from > > ALLOC_MIN_RESERVE, correct? > > > > > and thus will lead to kernel crashes. > > > > Could you please explain in detail where this might lead to kernel crashes? > > Sorry, I misread your patch as doing what your subject says. > A nestable noreclaim is probably fine, but please name it that way, > as memalloc_nowait_{save,restore} implies a context version > of GFP_NOWAIT. There are already memalloc_noreclaim_{save,restore} which imply __GFP_MEMALLOC: memalloc_noreclaim_save - Marks implicit __GFP_MEMALLOC scope. That is why I name it memalloc_nowait_{save,restore}. GFP_NOWAIT has the same meaning with ~__GFP_DIRECT_RECLAIM: %GFP_NOWAIT is for kernel allocations that should not stall for direct reclaim, start physical IO or use any filesystem callback.
On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > will be useful in scenarios where we want to avoid waiting for memory > reclamation. Readahead already uses this context: static inline gfp_t readahead_gfp_mask(struct address_space *x) { return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN; } and __GFP_NORETRY means minimal direct reclaim should be performed. Most filesystems already have GFP_NOFS context from mapping_gfp_mask(), so how much difference does completely avoiding direct reclaim actually make under memory pressure? i.e. doing some direct reclaim without blocking when under memory pressure might actually give better performance than skipping direct reclaim and aborting readahead altogether.... This really, really needs some numbers (both throughput and IO latency histograms) to go with it because we have no evidence either way to determine what is the best approach here. -Dave.
On Wed, Aug 14, 2024 at 8:28 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > > will be useful in scenarios where we want to avoid waiting for memory > > reclamation. > > Readahead already uses this context: > > static inline gfp_t readahead_gfp_mask(struct address_space *x) > { > return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN; > } > > and __GFP_NORETRY means minimal direct reclaim should be performed. > Most filesystems already have GFP_NOFS context from > mapping_gfp_mask(), so how much difference does completely avoiding > direct reclaim actually make under memory pressure? Besides the __GFP_NOFS , ~__GFP_DIRECT_RECLAIM also implies __GPF_NOIO. If we don't set __GPF_NOIO, the readahead can wait for IO, right? > > i.e. doing some direct reclaim without blocking when under memory > pressure might actually give better performance than skipping direct > reclaim and aborting readahead altogether.... > > This really, really needs some numbers (both throughput and IO > latency histograms) to go with it because we have no evidence either > way to determine what is the best approach here. > > -Dave. > -- > Dave Chinner > david@fromorbit.com
On Mon, Aug 12, 2024 at 12:48:26PM -0400, Kent Overstreet wrote: > > Unlike NOIO or NOFS this is incompatible with NOFAIL allocations > > and thus will lead to kernel crashes. > > No different from passing GFP_NOWAIT to mempoool_alloc(), and we're > trying to get away from passing gfp flags directly for multiple reasons > so we need it. It is very different. Passing GFP_NOWAIT to mempool_alloc affects the piece of code calling mempool_alloc. Setting a GFP_NOWAIT context affects all called code, i.e. all block drivers under the file system, and possible another file system under the loop driver. And none of them expect the consequences of that.
> There are already memalloc_noreclaim_{save,restore} which imply __GFP_MEMALLOC: > > memalloc_noreclaim_save - Marks implicit __GFP_MEMALLOC scope. .. and those are horrible misnamed :( If we can't even keep our APIs consistently name, who is supposed to understand all this?
On Wed, Aug 14, 2024 at 10:19:36AM +0800, Yafang Shao wrote: > On Wed, Aug 14, 2024 at 8:28 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > > > will be useful in scenarios where we want to avoid waiting for memory > > > reclamation. > > > > Readahead already uses this context: > > > > static inline gfp_t readahead_gfp_mask(struct address_space *x) > > { > > return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN; > > } > > > > and __GFP_NORETRY means minimal direct reclaim should be performed. > > Most filesystems already have GFP_NOFS context from > > mapping_gfp_mask(), so how much difference does completely avoiding > > direct reclaim actually make under memory pressure? > > Besides the __GFP_NOFS , ~__GFP_DIRECT_RECLAIM also implies > __GPF_NOIO. If we don't set __GPF_NOIO, the readahead can wait for IO, > right? There's a *lot* more difference between __GFP_NORETRY and __GFP_NOWAIT than just __GFP_NOIO. I don't need you to try to describe to me what the differences are; What I'm asking you is this: > > i.e. doing some direct reclaim without blocking when under memory > > pressure might actually give better performance than skipping direct > > reclaim and aborting readahead altogether.... > > > > This really, really needs some numbers (both throughput and IO > > latency histograms) to go with it because we have no evidence either > > way to determine what is the best approach here. Put simply: does the existing readahead mechanism give better results than the proposed one, and if so, why wouldn't we just reenable readahead unconditionally instead of making it behave differently for this specific case? Cheers, Dave.
On Wed, Aug 14, 2024 at 1:42 PM Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Aug 14, 2024 at 10:19:36AM +0800, Yafang Shao wrote: > > On Wed, Aug 14, 2024 at 8:28 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > > > > will be useful in scenarios where we want to avoid waiting for memory > > > > reclamation. > > > > > > Readahead already uses this context: > > > > > > static inline gfp_t readahead_gfp_mask(struct address_space *x) > > > { > > > return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN; > > > } > > > > > > and __GFP_NORETRY means minimal direct reclaim should be performed. > > > Most filesystems already have GFP_NOFS context from > > > mapping_gfp_mask(), so how much difference does completely avoiding > > > direct reclaim actually make under memory pressure? > > > > Besides the __GFP_NOFS , ~__GFP_DIRECT_RECLAIM also implies > > __GPF_NOIO. If we don't set __GPF_NOIO, the readahead can wait for IO, > > right? > > There's a *lot* more difference between __GFP_NORETRY and > __GFP_NOWAIT than just __GFP_NOIO. I don't need you to try to > describe to me what the differences are; What I'm asking you is this: > > > > i.e. doing some direct reclaim without blocking when under memory > > > pressure might actually give better performance than skipping direct > > > reclaim and aborting readahead altogether.... > > > > > > This really, really needs some numbers (both throughput and IO > > > latency histograms) to go with it because we have no evidence either > > > way to determine what is the best approach here. > > Put simply: does the existing readahead mechanism give better results > than the proposed one, and if so, why wouldn't we just reenable > readahead unconditionally instead of making it behave differently > for this specific case? Are you suggesting we compare the following change with the current proposal? diff --git a/include/linux/fs.h b/include/linux/fs.h index fd34b5755c0b..ced74b1b350d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3455,7 +3455,6 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, rwf_t flags, if (flags & RWF_NOWAIT) { if (!(ki->ki_filp->f_mode & FMODE_NOWAIT)) return -EOPNOTSUPP; - kiocb_flags |= IOCB_NOIO; } if (flags & RWF_ATOMIC) { if (rw_type != WRITE) Doesn't unconditional readahead break the semantics of RWF_NOWAIT, which is supposed to avoid waiting for I/O? For example, it might trigger a pageout for a dirty page. -- Regards Yafang
On Wed, Aug 14, 2024 at 1:27 PM Christoph Hellwig <hch@infradead.org> wrote: > > > There are already memalloc_noreclaim_{save,restore} which imply __GFP_MEMALLOC: > > > > memalloc_noreclaim_save - Marks implicit __GFP_MEMALLOC scope. > > .. and those are horrible misnamed :( What about renaming it to memalloc_memalloc_save ? > > If we can't even keep our APIs consistently name, who is supposed > to understand all this? >
On Mon 12-08-24 20:59:53, Yafang Shao wrote: > On Mon, Aug 12, 2024 at 7:37 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > > > will be useful in scenarios where we want to avoid waiting for memory > > > reclamation. > > > > No, forcing nowait on callee contets is just asking for trouble. > > Unlike NOIO or NOFS this is incompatible with NOFAIL allocations > > I don’t see any incompatibility in __alloc_pages_slowpath(). The > ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not > performed, but it doesn’t prevent the allocation of pages from > ALLOC_MIN_RESERVE, correct? Right but this means that you just made any potential nested allocation within the scope that is GFP_NOFAIL a busy loop essentially. Not to mention it BUG_ON as non-sleeping GFP_NOFAIL allocations are unsupported. I believe this is what Christoph had in mind. I am really surprised that we even have PF_MEMALLOC_NORECLAIM in the first place! Unsurprisingly this was merged without any review by the MM community :/
On Wed, Aug 14, 2024 at 3:42 PM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 12-08-24 20:59:53, Yafang Shao wrote: > > On Mon, Aug 12, 2024 at 7:37 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > > > > will be useful in scenarios where we want to avoid waiting for memory > > > > reclamation. > > > > > > No, forcing nowait on callee contets is just asking for trouble. > > > Unlike NOIO or NOFS this is incompatible with NOFAIL allocations > > > > I don’t see any incompatibility in __alloc_pages_slowpath(). The > > ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not > > performed, but it doesn’t prevent the allocation of pages from > > ALLOC_MIN_RESERVE, correct? > > Right but this means that you just made any potential nested allocation > within the scope that is GFP_NOFAIL a busy loop essentially. Not to > mention it BUG_ON as non-sleeping GFP_NOFAIL allocations are > unsupported. I believe this is what Christoph had in mind. If that's the case, I believe we should at least consider adding the following code change to the kernel: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9ecf99190ea2..89411ee23c7f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4168,6 +4168,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, unsigned int zonelist_iter_cookie; int reserve_flags; + WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL && !(gfp_mask & __GFP_DIRECT_RECLAIM)); restart: compaction_retries = 0; no_progress_loops = 0; > I am really > surprised that we even have PF_MEMALLOC_NORECLAIM in the first place! There's use cases for it. > Unsurprisingly this was merged without any review by the MM community :/ > -- > Michal Hocko > SUSE Labs -- Regards Yafang
On Wed 14-08-24 16:12:27, Yafang Shao wrote: > On Wed, Aug 14, 2024 at 3:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 12-08-24 20:59:53, Yafang Shao wrote: > > > On Mon, Aug 12, 2024 at 7:37 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > > > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > > > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > > > > > will be useful in scenarios where we want to avoid waiting for memory > > > > > reclamation. > > > > > > > > No, forcing nowait on callee contets is just asking for trouble. > > > > Unlike NOIO or NOFS this is incompatible with NOFAIL allocations > > > > > > I don’t see any incompatibility in __alloc_pages_slowpath(). The > > > ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not > > > performed, but it doesn’t prevent the allocation of pages from > > > ALLOC_MIN_RESERVE, correct? > > > > Right but this means that you just made any potential nested allocation > > within the scope that is GFP_NOFAIL a busy loop essentially. Not to > > mention it BUG_ON as non-sleeping GFP_NOFAIL allocations are > > unsupported. I believe this is what Christoph had in mind. > > If that's the case, I believe we should at least consider adding the > following code change to the kernel: We already do have that /* * All existing users of the __GFP_NOFAIL are blockable, so warn * of any new users that actually require GFP_NOWAIT */ if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) goto fail; But Barry has patches to turn that into BUG because failing NOFAIL allocations is not cool and cause unexpected failures. Have a look at https://lore.kernel.org/all/20240731000155.109583-1-21cnbao@gmail.com/ > > I am really > > surprised that we even have PF_MEMALLOC_NORECLAIM in the first place! > > There's use cases for it. Right but there are certain constrains that we need to worry about to have a maintainable code. Scope allocation contrains are really a good feature when that has a well defined semantic. E.g. NOFS, NOIO or NOMEMALLOC (although this is more self inflicted injury exactly because PF_MEMALLOC had a "use case"). NOWAIT scope semantic might seem a good feature but it falls appart on nested NOFAIL allocations! So the flag is usable _only_ if you fully control the whole scoped context. Good luck with that long term! This is fragile, hard to review and even harder to keep working properly. The flag would have been Nacked on that ground. But nobody asked...
On Wed, Aug 14, 2024 at 03:32:26PM +0800, Yafang Shao wrote: > On Wed, Aug 14, 2024 at 1:42 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Wed, Aug 14, 2024 at 10:19:36AM +0800, Yafang Shao wrote: > > > On Wed, Aug 14, 2024 at 8:28 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > > > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > > > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > > > > > will be useful in scenarios where we want to avoid waiting for memory > > > > > reclamation. > > > > > > > > Readahead already uses this context: > > > > > > > > static inline gfp_t readahead_gfp_mask(struct address_space *x) > > > > { > > > > return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN; > > > > } > > > > > > > > and __GFP_NORETRY means minimal direct reclaim should be performed. > > > > Most filesystems already have GFP_NOFS context from > > > > mapping_gfp_mask(), so how much difference does completely avoiding > > > > direct reclaim actually make under memory pressure? > > > > > > Besides the __GFP_NOFS , ~__GFP_DIRECT_RECLAIM also implies > > > __GPF_NOIO. If we don't set __GPF_NOIO, the readahead can wait for IO, > > > right? > > > > There's a *lot* more difference between __GFP_NORETRY and > > __GFP_NOWAIT than just __GFP_NOIO. I don't need you to try to > > describe to me what the differences are; What I'm asking you is this: > > > > > > i.e. doing some direct reclaim without blocking when under memory > > > > pressure might actually give better performance than skipping direct > > > > reclaim and aborting readahead altogether.... > > > > > > > > This really, really needs some numbers (both throughput and IO > > > > latency histograms) to go with it because we have no evidence either > > > > way to determine what is the best approach here. > > > > Put simply: does the existing readahead mechanism give better results > > than the proposed one, and if so, why wouldn't we just reenable > > readahead unconditionally instead of making it behave differently > > for this specific case? > > Are you suggesting we compare the following change with the current proposal? > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index fd34b5755c0b..ced74b1b350d 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -3455,7 +3455,6 @@ static inline int kiocb_set_rw_flags(struct > kiocb *ki, rwf_t flags, > if (flags & RWF_NOWAIT) { > if (!(ki->ki_filp->f_mode & FMODE_NOWAIT)) > return -EOPNOTSUPP; > - kiocb_flags |= IOCB_NOIO; > } > if (flags & RWF_ATOMIC) { > if (rw_type != WRITE) Yes. > Doesn't unconditional readahead break the semantics of RWF_NOWAIT, > which is supposed to avoid waiting for I/O? For example, it might > trigger a pageout for a dirty page. Yes, but only for *some filesystems* in *some configurations*. Readahead allocation behaviour is specifically controlled by the gfp mask set on the mapping by the filesystem at inode instantiation time. i.e. via a call to mapping_set_gfp_mask(). XFS, for one, always clears __GFP_FS from this mask, and several other filesystems set it to GFP_NOFS. Filesystems that do this will not do pageout for a dirty page during memory allocation. Further, memory reclaim can not write dirty pages to a filesystem without a ->writepage implementation. ->writepage is almost completely gone - neither ext4, btrfs or XFS have a ->writepage implementation anymore - with f2fs being the only "major" filesystem with a ->writepage implementation remaining. IOWs, for most readahead cases right now, direct memory reclaim will not issue writeback IO on dirty cached file pages and in the near future that will change to -never-. That means the only IO that direct reclaim will be able to do is for swapping and compaction. Both of these can be prevented simply by setting a GFP_NOIO allocation context. IOWs, in the not-to-distant future we won't have to turn direct reclaim off to prevent IO from and blocking in direct reclaim during readahead - GFP_NOIO context will be all that is necessary for IOCB_NOWAIT readahead. That's why I'm asking if just doing readahead as it stands from RWF_NOWAIT causes any obvious problems. I think we really only need need GFP_NOIO | __GFP_NORETRY allocation context for NOWAIT readahead IO, and that's something we already have a context API for. -Dave.
On Wed, Aug 14, 2024 at 8:43 PM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 14-08-24 16:12:27, Yafang Shao wrote: > > On Wed, Aug 14, 2024 at 3:42 PM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 12-08-24 20:59:53, Yafang Shao wrote: > > > > On Mon, Aug 12, 2024 at 7:37 PM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > > > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > > > > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > > > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > > > > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > > > > > > will be useful in scenarios where we want to avoid waiting for memory > > > > > > reclamation. > > > > > > > > > > No, forcing nowait on callee contets is just asking for trouble. > > > > > Unlike NOIO or NOFS this is incompatible with NOFAIL allocations > > > > > > > > I don’t see any incompatibility in __alloc_pages_slowpath(). The > > > > ~__GFP_DIRECT_RECLAIM flag only ensures that direct reclaim is not > > > > performed, but it doesn’t prevent the allocation of pages from > > > > ALLOC_MIN_RESERVE, correct? > > > > > > Right but this means that you just made any potential nested allocation > > > within the scope that is GFP_NOFAIL a busy loop essentially. Not to > > > mention it BUG_ON as non-sleeping GFP_NOFAIL allocations are > > > unsupported. I believe this is what Christoph had in mind. > > > > If that's the case, I believe we should at least consider adding the > > following code change to the kernel: > > We already do have that > /* > * All existing users of the __GFP_NOFAIL are blockable, so warn > * of any new users that actually require GFP_NOWAIT > */ > if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) > goto fail; I don't see a reason to place the `goto fail;` above the `__alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE, ac);` line. Since we've already woken up kswapd, it should be acceptable to allocate memory from ALLOC_MIN_RESERVE temporarily. Why not consider implementing the following changes instead? diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9ecf99190ea2..598d4df829cd 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4386,13 +4386,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * we always retry */ if (gfp_mask & __GFP_NOFAIL) { - /* - * All existing users of the __GFP_NOFAIL are blockable, so warn - * of any new users that actually require GFP_NOWAIT - */ - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) - goto fail; - /* * PF_MEMALLOC request from this context is rather bizarre * because we cannot reclaim anything and only can loop waiting @@ -4419,6 +4412,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (page) goto got_pg; + /* + * All existing users of the __GFP_NOFAIL are blockable, so warn + * of any new users that actually require GFP_NOWAIT + */ + if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) { + goto fail; + } + cond_resched(); goto retry; } > > But Barry has patches to turn that into BUG because failing NOFAIL > allocations is not cool and cause unexpected failures. Have a look at > https://lore.kernel.org/all/20240731000155.109583-1-21cnbao@gmail.com/ > > > > I am really > > > surprised that we even have PF_MEMALLOC_NORECLAIM in the first place! > > > > There's use cases for it. > > Right but there are certain constrains that we need to worry about to > have a maintainable code. Scope allocation contrains are really a good > feature when that has a well defined semantic. E.g. NOFS, NOIO or > NOMEMALLOC (although this is more self inflicted injury exactly because > PF_MEMALLOC had a "use case"). NOWAIT scope semantic might seem a good > feature but it falls appart on nested NOFAIL allocations! So the flag is > usable _only_ if you fully control the whole scoped context. Good luck > with that long term! This is fragile, hard to review and even harder to > keep working properly. The flag would have been Nacked on that ground. > But nobody asked... It's already implemented, and complaints won't resolve the issue. How about making the following change to provide a warning when this new flag is used incorrectly? diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 4fbae0013166..5a1e1bcde347 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -267,9 +267,10 @@ static inline gfp_t current_gfp_context(gfp_t flags) * Stronger flags before weaker flags: * NORECLAIM implies NOIO, which in turn implies NOFS */ - if (pflags & PF_MEMALLOC_NORECLAIM) + if (pflags & PF_MEMALLOC_NORECLAIM) { flags &= ~__GFP_DIRECT_RECLAIM; - else if (pflags & PF_MEMALLOC_NOIO) + WARN_ON_ONCE_GFP(flags & __GFP_NOFAIL, flags) + } else if (pflags & PF_MEMALLOC_NOIO) flags &= ~(__GFP_IO | __GFP_FS); else if (pflags & PF_MEMALLOC_NOFS) flags &= ~__GFP_FS; -- Regards Yafang
On Thu, Aug 15, 2024 at 10:54 AM Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Aug 14, 2024 at 03:32:26PM +0800, Yafang Shao wrote: > > On Wed, Aug 14, 2024 at 1:42 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Wed, Aug 14, 2024 at 10:19:36AM +0800, Yafang Shao wrote: > > > > On Wed, Aug 14, 2024 at 8:28 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > On Mon, Aug 12, 2024 at 05:05:24PM +0800, Yafang Shao wrote: > > > > > > The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc > > > > > > ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement > > > > > > this, let's add two helper functions, memalloc_nowait_{save,restore}, which > > > > > > will be useful in scenarios where we want to avoid waiting for memory > > > > > > reclamation. > > > > > > > > > > Readahead already uses this context: > > > > > > > > > > static inline gfp_t readahead_gfp_mask(struct address_space *x) > > > > > { > > > > > return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN; > > > > > } > > > > > > > > > > and __GFP_NORETRY means minimal direct reclaim should be performed. > > > > > Most filesystems already have GFP_NOFS context from > > > > > mapping_gfp_mask(), so how much difference does completely avoiding > > > > > direct reclaim actually make under memory pressure? > > > > > > > > Besides the __GFP_NOFS , ~__GFP_DIRECT_RECLAIM also implies > > > > __GPF_NOIO. If we don't set __GPF_NOIO, the readahead can wait for IO, > > > > right? > > > > > > There's a *lot* more difference between __GFP_NORETRY and > > > __GFP_NOWAIT than just __GFP_NOIO. I don't need you to try to > > > describe to me what the differences are; What I'm asking you is this: > > > > > > > > i.e. doing some direct reclaim without blocking when under memory > > > > > pressure might actually give better performance than skipping direct > > > > > reclaim and aborting readahead altogether.... > > > > > > > > > > This really, really needs some numbers (both throughput and IO > > > > > latency histograms) to go with it because we have no evidence either > > > > > way to determine what is the best approach here. > > > > > > Put simply: does the existing readahead mechanism give better results > > > than the proposed one, and if so, why wouldn't we just reenable > > > readahead unconditionally instead of making it behave differently > > > for this specific case? > > > > Are you suggesting we compare the following change with the current proposal? > > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index fd34b5755c0b..ced74b1b350d 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -3455,7 +3455,6 @@ static inline int kiocb_set_rw_flags(struct > > kiocb *ki, rwf_t flags, > > if (flags & RWF_NOWAIT) { > > if (!(ki->ki_filp->f_mode & FMODE_NOWAIT)) > > return -EOPNOTSUPP; > > - kiocb_flags |= IOCB_NOIO; > > } > > if (flags & RWF_ATOMIC) { > > if (rw_type != WRITE) > > Yes. > > > Doesn't unconditional readahead break the semantics of RWF_NOWAIT, > > which is supposed to avoid waiting for I/O? For example, it might > > trigger a pageout for a dirty page. > > Yes, but only for *some filesystems* in *some configurations*. > Readahead allocation behaviour is specifically controlled by the gfp > mask set on the mapping by the filesystem at inode instantiation > time. i.e. via a call to mapping_set_gfp_mask(). > > XFS, for one, always clears __GFP_FS from this mask, and several > other filesystems set it to GFP_NOFS. Filesystems that do this will > not do pageout for a dirty page during memory allocation. > > Further, memory reclaim can not write dirty pages to a filesystem > without a ->writepage implementation. ->writepage is almost > completely gone - neither ext4, btrfs or XFS have a ->writepage > implementation anymore - with f2fs being the only "major" filesystem > with a ->writepage implementation remaining. > > IOWs, for most readahead cases right now, direct memory reclaim will > not issue writeback IO on dirty cached file pages and in the near > future that will change to -never-. > > That means the only IO that direct reclaim will be able to do is for > swapping and compaction. Both of these can be prevented simply by > setting a GFP_NOIO allocation context. IOWs, in the not-to-distant > future we won't have to turn direct reclaim off to prevent IO from > and blocking in direct reclaim during readahead - GFP_NOIO context > will be all that is necessary for IOCB_NOWAIT readahead. > > That's why I'm asking if just doing readahead as it stands from > RWF_NOWAIT causes any obvious problems. I think we really only need > need GFP_NOIO | __GFP_NORETRY allocation context for NOWAIT > readahead IO, and that's something we already have a context API > for. Understood, thanks for your explanation. so we need below changes, @@ -2526,8 +2528,12 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count, if (!folio_batch_count(fbatch)) { if (iocb->ki_flags & IOCB_NOIO) return -EAGAIN; + if (iocb->ki_flags & IOCB_NOWAIT) + flags = memalloc_noio_save(); page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); + if (iocb->ki_flags & IOCB_NOWAIT) + memalloc_noio_restore(flags); filemap_get_read_batch(mapping, index, last_index - 1, fbatch); } if (!folio_batch_count(fbatch)) { What data would you recommend collecting after implementing the above change? Should we measure the latency of preadv2(2) under high memory pressure? Although latency can vary, it seems we have no choice but to use memalloc_noio_save instead of memalloc_nowait_save, as the MM folks are not in favor of the latter. -- Regards Yafang
On Thu 15-08-24 11:26:09, Yafang Shao wrote: > On Wed, Aug 14, 2024 at 8:43 PM Michal Hocko <mhocko@suse.com> wrote: [...] > > > If that's the case, I believe we should at least consider adding the > > > following code change to the kernel: > > > > We already do have that > > /* > > * All existing users of the __GFP_NOFAIL are blockable, so warn > > * of any new users that actually require GFP_NOWAIT > > */ > > if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) > > goto fail; > > I don't see a reason to place the `goto fail;` above the > `__alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE, ac);` > line. Since we've already woken up kswapd, it should be acceptable to > allocate memory from ALLOC_MIN_RESERVE temporarily. Why not consider > implementing the following changes instead? > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 9ecf99190ea2..598d4df829cd 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4386,13 +4386,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned > int order, > * we always retry > */ > if (gfp_mask & __GFP_NOFAIL) { > - /* > - * All existing users of the __GFP_NOFAIL are blockable, so warn > - * of any new users that actually require GFP_NOWAIT > - */ > - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) > - goto fail; > - > /* > * PF_MEMALLOC request from this context is rather bizarre > * because we cannot reclaim anything and only can loop waiting > @@ -4419,6 +4412,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned > int order, > if (page) > goto got_pg; > > + /* > + * All existing users of the __GFP_NOFAIL are blockable, so warn > + * of any new users that actually require GFP_NOWAIT > + */ > + if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) { > + goto fail; > + } > + > cond_resched(); > goto retry; > } How does this solve anything. It will still eventually fail the NOFAIL allocation. It might happen slightly later but that doesn't change the fact it will _fail_. I have referenced a discussion why that is not really desireable and why Barry wants that addressed. We have added that WARN_ON_ONCE because we have assumed that people do understand that NOFAIL without reclaim is just too much to ask. We were wrong there was one user in the kernel. That one was not too hard to find out because you can _grep_ for those flags. Scoped APIs make that impossible! > > But Barry has patches to turn that into BUG because failing NOFAIL > > allocations is not cool and cause unexpected failures. Have a look at > > https://lore.kernel.org/all/20240731000155.109583-1-21cnbao@gmail.com/ > > > > > > I am really > > > > surprised that we even have PF_MEMALLOC_NORECLAIM in the first place! > > > > > > There's use cases for it. > > > > Right but there are certain constrains that we need to worry about to > > have a maintainable code. Scope allocation contrains are really a good > > feature when that has a well defined semantic. E.g. NOFS, NOIO or > > NOMEMALLOC (although this is more self inflicted injury exactly because > > PF_MEMALLOC had a "use case"). NOWAIT scope semantic might seem a good > > feature but it falls appart on nested NOFAIL allocations! So the flag is > > usable _only_ if you fully control the whole scoped context. Good luck > > with that long term! This is fragile, hard to review and even harder to > > keep working properly. The flag would have been Nacked on that ground. > > But nobody asked... > > It's already implemented, and complaints won't resolve the issue. How > about making the following change to provide a warning when this new > flag is used incorrectly? How does this solve anything at all? It will warn you that your code is incorrect and what next? Are you going to remove GFP_NOFAIL from the nested allocation side? NOFAIL is a strong requirement and it is not used nilly willy. There must have been a very good reason to use it. Are you going to drop the scope? Let me repeat, nested NOFAIL allocations will BUG_ON on failure. Your warning might catch those users slightly earlier when allocation succeed but that doesn't make those crashes impossible. PF_MEMALLOC_NORECLAIM might be already merged but this concept is inherently fragile and should be reverted rather than finding impossible ways around it. And it should be done before this spreads outside of bcachefs.
On Thu, Aug 15, 2024 at 2:22 PM Michal Hocko <mhocko@suse.com> wrote: > > On Thu 15-08-24 11:26:09, Yafang Shao wrote: > > On Wed, Aug 14, 2024 at 8:43 PM Michal Hocko <mhocko@suse.com> wrote: > [...] > > > > If that's the case, I believe we should at least consider adding the > > > > following code change to the kernel: > > > > > > We already do have that > > > /* > > > * All existing users of the __GFP_NOFAIL are blockable, so warn > > > * of any new users that actually require GFP_NOWAIT > > > */ > > > if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) > > > goto fail; > > > > I don't see a reason to place the `goto fail;` above the > > `__alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_MIN_RESERVE, ac);` > > line. Since we've already woken up kswapd, it should be acceptable to > > allocate memory from ALLOC_MIN_RESERVE temporarily. Why not consider > > implementing the following changes instead? > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 9ecf99190ea2..598d4df829cd 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -4386,13 +4386,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned > > int order, > > * we always retry > > */ > > if (gfp_mask & __GFP_NOFAIL) { > > - /* > > - * All existing users of the __GFP_NOFAIL are blockable, so warn > > - * of any new users that actually require GFP_NOWAIT > > - */ > > - if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) > > - goto fail; > > - > > /* > > * PF_MEMALLOC request from this context is rather bizarre > > * because we cannot reclaim anything and only can loop waiting > > @@ -4419,6 +4412,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned > > int order, > > if (page) > > goto got_pg; > > > > + /* > > + * All existing users of the __GFP_NOFAIL are blockable, so warn > > + * of any new users that actually require GFP_NOWAIT > > + */ > > + if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask)) { > > + goto fail; > > + } > > + > > cond_resched(); > > goto retry; > > } > > How does this solve anything. It will still eventually fail the NOFAIL > allocation. It might happen slightly later but that doesn't change the > fact it will _fail_. I have referenced a discussion why that is not > really desireable and why Barry wants that addressed. We have added that > WARN_ON_ONCE because we have assumed that people do understand that > NOFAIL without reclaim is just too much to ask. We were wrong there was > one user in the kernel. That one was not too hard to find out because > you can _grep_ for those flags. Scoped APIs make that impossible! > > > > But Barry has patches to turn that into BUG because failing NOFAIL > > > allocations is not cool and cause unexpected failures. Have a look at > > > https://lore.kernel.org/all/20240731000155.109583-1-21cnbao@gmail.com/ > > > > > > > > I am really > > > > > surprised that we even have PF_MEMALLOC_NORECLAIM in the first place! > > > > > > > > There's use cases for it. > > > > > > Right but there are certain constrains that we need to worry about to > > > have a maintainable code. Scope allocation contrains are really a good > > > feature when that has a well defined semantic. E.g. NOFS, NOIO or > > > NOMEMALLOC (although this is more self inflicted injury exactly because > > > PF_MEMALLOC had a "use case"). NOWAIT scope semantic might seem a good > > > feature but it falls appart on nested NOFAIL allocations! So the flag is > > > usable _only_ if you fully control the whole scoped context. Good luck > > > with that long term! This is fragile, hard to review and even harder to > > > keep working properly. The flag would have been Nacked on that ground. > > > But nobody asked... > > > > It's already implemented, and complaints won't resolve the issue. How > > about making the following change to provide a warning when this new > > flag is used incorrectly? > > How does this solve anything at all? It will warn you that your code is > incorrect and what next? Are you going to remove GFP_NOFAIL from the > nested allocation side? NOFAIL is a strong requirement and it is not > used nilly willy. There must have been a very good reason to use it. Are > you going to drop the scope? > > Let me repeat, nested NOFAIL allocations will BUG_ON on failure. The key question is whether it actually fails after we've already woken up kswapd. Have we encountered real issues, or is this just based on code review? Instead of allowing it to fail, why not allocate from the reserve memory to prevent this from happening? > Your > warning might catch those users slightly earlier when allocation succeed > but that doesn't make those crashes impossible. PF_MEMALLOC_NORECLAIM > might be already merged but this concept is inherently fragile and > should be reverted rather than finding impossible ways around it. And it > should be done before this spreads outside of bcachefs. > -- > Michal Hocko > SUSE Labs -- Regards Yafang
On Thu 15-08-24 14:32:10, Yafang Shao wrote: > On Thu, Aug 15, 2024 at 2:22 PM Michal Hocko <mhocko@suse.com> wrote: [...] > > Let me repeat, nested NOFAIL allocations will BUG_ON on failure. > > The key question is whether it actually fails after we've already > woken up kswapd. Have we encountered real issues, or is this just > based on code review? Depleting memory to the level that even min memory reserves is insufficient is not a theoretical concern. OOMing the system is a real thing! > Instead of allowing it to fail, why not allocate > from the reserve memory to prevent this from happening? Because those memory reserves are shared and potentially required by other more important users which cannot make forward progress without them. And even with that, those can get depleted so the failure point is just a matter of a specific memory consumption pattern. The failure could be more rare but that also means much harder to predict and test for. Really there are no ways around non sleeping GFP_NOFAIL, either you disalow them or you just create a busy loop inside the allocator. We have chosen the first option because that is a saner model to support.
On 8/14/24 09:33, Yafang Shao wrote: > On Wed, Aug 14, 2024 at 1:27 PM Christoph Hellwig <hch@infradead.org> wrote: >> >> > There are already memalloc_noreclaim_{save,restore} which imply __GFP_MEMALLOC: >> > >> > memalloc_noreclaim_save - Marks implicit __GFP_MEMALLOC scope. >> >> .. and those are horrible misnamed :( Yes I agree, sorry about that. > What about renaming it to memalloc_memalloc_save ? While it looks weird, it could be indeed better than the current name. It's not obvious, so it should force the user to read the description. memalloc_noreclaim_save() might look too obviously "this disables reclaim" but it's misleading as that's not the full story of PF_MEMALLOC. >> >> If we can't even keep our APIs consistently name, who is supposed >> to understand all this? >> >
On Sun, Sep 01, 2024 at 10:24:10PM GMT, Vlastimil Babka wrote: > On 8/14/24 09:33, Yafang Shao wrote: > > On Wed, Aug 14, 2024 at 1:27 PM Christoph Hellwig <hch@infradead.org> wrote: > >> > >> > There are already memalloc_noreclaim_{save,restore} which imply __GFP_MEMALLOC: > >> > > >> > memalloc_noreclaim_save - Marks implicit __GFP_MEMALLOC scope. > >> > >> .. and those are horrible misnamed :( > > Yes I agree, sorry about that. > > > What about renaming it to memalloc_memalloc_save ? > > While it looks weird, it could be indeed better than the current name. It's > not obvious, so it should force the user to read the description. > memalloc_noreclaim_save() might look too obviously "this disables reclaim" > but it's misleading as that's not the full story of PF_MEMALLOC. I was actually thinking about killing the helpers in favor of just memalloc_flags_save() - I don't think the extra names get us anything
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 91546493c43d..4fbae0013166 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -484,6 +484,36 @@ static inline void memalloc_pin_restore(unsigned int flags) memalloc_flags_restore(flags); } +/** + * memalloc_nowait_save - Marks implicit ~__GFP_DIRECT_RECLAIM allocation scope. + * + * This functions marks the beginning of the ~__GFP_DIRECT_RECLAIM allocation + * scope. All further allocations will implicitly drop __GFP_DIRECT_RECLAIM flag + * and so they are won't wait for direct memory reclaiming. Use + * memalloc_nowait_restore to end the scope with flags returned by this + * function. + * + * Context: This function is safe to be used from any context. + * Return: The saved flags to be passed to memalloc_nowait_restore. + */ +static inline unsigned int memalloc_nowait_save(void) +{ + return memalloc_flags_save(PF_MEMALLOC_NORECLAIM); +} + +/** + * memalloc_nofs_restore - Ends the implicit ~__GFP_DIRECT_RECLAIM scope. + * @flags: Flags to restore. + * + * Ends the implicit ~__GFP_DIRECT_RECLAIM scope started by memalloc_nowait_save + * function. Always make sure that the given flags is the return value from the + * pairing memalloc_nowait_save call. + */ +static inline void memalloc_nowait_restore(unsigned int flags) +{ + memalloc_flags_restore(flags); +} + #ifdef CONFIG_MEMCG DECLARE_PER_CPU(struct mem_cgroup *, int_active_memcg); /**
The PF_MEMALLOC_NORECLAIM flag was introduced in commit eab0af905bfc ("mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"). To complement this, let's add two helper functions, memalloc_nowait_{save,restore}, which will be useful in scenarios where we want to avoid waiting for memory reclamation. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> --- include/linux/sched/mm.h | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+)