[1/2] bcachefs: do not use PF_MEMALLOC_NORECLAIM

Message ID	20240826085347.1152675-2-mhocko@kernel.org (mailing list archive)
State	Handled Elsewhere
Headers	show Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com [209.85.218.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0B8BA13AD13; Mon, 26 Aug 2024 08:53:57 +0000 (UTC) From: Michal Hocko <mhocko@kernel.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de>, Yafang Shao <laoar.shao@gmail.com>, Kent Overstreet <kent.overstreet@linux.dev>, jack@suse.cz, Christian Brauner <brauner@kernel.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Paul Moore <paul@paul-moore.com>, James Morris <jmorris@namei.org>, "Serge E. Hallyn" <serge@hallyn.com>, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-bcachefs@vger.kernel.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org, Michal Hocko <mhocko@suse.com> Subject: [PATCH 1/2] bcachefs: do not use PF_MEMALLOC_NORECLAIM Date: Mon, 26 Aug 2024 10:47:12 +0200 Message-ID: <20240826085347.1152675-2-mhocko@kernel.org> In-Reply-To: <20240826085347.1152675-1-mhocko@kernel.org> References: <20240826085347.1152675-1-mhocko@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	get rid of PF_MEMALLOC_NORECLAIM \| expand [0/2] get rid of PF_MEMALLOC_NORECLAIM [1/2] bcachefs: do not use PF_MEMALLOC_NORECLAIM [2/2] mm: drop PF_MEMALLOC_NORECLAIM

Michal Hocko Aug. 26, 2024, 8:47 a.m. UTC

From: Michal Hocko <mhocko@suse.com>

bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
inode to achieve GFP_NOWAIT semantic while holding locks. If this
allocation fails it will drop locks and use GFP_NOFS allocation context.

We would like to drop PF_MEMALLOC_NORECLAIM because it is really
dangerous to use if the caller doesn't control the full call chain with
this flag set. E.g. if any of the function down the chain needed
GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
cause unexpected failure.

While this is not the case in this particular case using the scoped gfp
semantic is not really needed bacause we can easily pus the allocation
context down the chain without too much clutter.

Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/bcachefs/fs.c          | 14 ++++++--------
 fs/inode.c                |  6 +++---
 include/linux/fs.h        |  7 ++++++-
 include/linux/lsm_hooks.h |  2 +-
 include/linux/security.h  |  4 ++--
 security/security.c       |  8 ++++----
 6 files changed, 22 insertions(+), 19 deletions(-)

Matthew Wilcox Aug. 26, 2024, 1:11 p.m. UTC | #1

On Mon, Aug 26, 2024 at 10:47:12AM +0200, Michal Hocko wrote:
> @@ -258,12 +258,10 @@ static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c)
>   */
>  static struct bch_inode_info *bch2_new_inode(struct btree_trans *trans)
>  {
> -	struct bch_inode_info *inode =
> -		memalloc_flags_do(PF_MEMALLOC_NORECLAIM|PF_MEMALLOC_NOWARN,
> -				  __bch2_new_inode(trans->c));
> +	struct bch_inode_info *inode = __bch2_new_inode(trans->c, GFP_NOWARN | GFP_NOWAIT);

GFP_NOWAIT include GFP_NOWARN these days (since 16f5dfbc851b)

> +++ b/fs/inode.c
> @@ -153,7 +153,7 @@ static int no_open(struct inode *inode, struct file *file)
>   * These are initializations that need to be done on every inode
>   * allocation as the fields are not initialised by slab allocation.
>   */
> -int inode_init_always(struct super_block *sb, struct inode *inode)
> +int inode_init_always(struct super_block *sb, struct inode *inode, gfp_t gfp)

Did you send the right version of this patch?  There should be a "_gfp"
appended to this function name.

> +++ b/include/linux/fs.h
> @@ -3027,7 +3027,12 @@ extern loff_t default_llseek(struct file *file, loff_t offset, int whence);
>  
>  extern loff_t vfs_llseek(struct file *file, loff_t offset, int whence);
>  
> -extern int inode_init_always(struct super_block *, struct inode *);
> +extern int inode_init_always_gfp(struct super_block *, struct inode *, gfp_t);

You can drop the "extern" while you're changing this line.

> +static inline int inode_init_always(struct super_block *sb, struct inode *inode)
> +{
> +	return inode_init_always_gfp(sb, inode, GFP_NOFS);
> +}

Michal Hocko Aug. 26, 2024, 4:48 p.m. UTC | #2

On Mon 26-08-24 14:11:39, Matthew Wilcox wrote:
> On Mon, Aug 26, 2024 at 10:47:12AM +0200, Michal Hocko wrote:
> > @@ -258,12 +258,10 @@ static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c)
> >   */
> >  static struct bch_inode_info *bch2_new_inode(struct btree_trans *trans)
> >  {
> > -	struct bch_inode_info *inode =
> > -		memalloc_flags_do(PF_MEMALLOC_NORECLAIM|PF_MEMALLOC_NOWARN,
> > -				  __bch2_new_inode(trans->c));
> > +	struct bch_inode_info *inode = __bch2_new_inode(trans->c, GFP_NOWARN | GFP_NOWAIT);
> 
> GFP_NOWAIT include GFP_NOWARN these days (since 16f5dfbc851b)

Ohh, I was not aware of that. I will drop NOWARN then.

> > +++ b/fs/inode.c
> > @@ -153,7 +153,7 @@ static int no_open(struct inode *inode, struct file *file)
> >   * These are initializations that need to be done on every inode
> >   * allocation as the fields are not initialised by slab allocation.
> >   */
> > -int inode_init_always(struct super_block *sb, struct inode *inode)
> > +int inode_init_always(struct super_block *sb, struct inode *inode, gfp_t gfp)
> 
> Did you send the right version of this patch?  There should be a "_gfp"
> appended to this function name.

yes, screw up on my end.

> > +++ b/include/linux/fs.h
> > @@ -3027,7 +3027,12 @@ extern loff_t default_llseek(struct file *file, loff_t offset, int whence);
> >  
> >  extern loff_t vfs_llseek(struct file *file, loff_t offset, int whence);
> >  
> > -extern int inode_init_always(struct super_block *, struct inode *);
> > +extern int inode_init_always_gfp(struct super_block *, struct inode *, gfp_t);
> 
> You can drop the "extern" while you're changing this line.

OK, I can. I just kept the usual style in this file.

Thanks!

Kent Overstreet Aug. 26, 2024, 7:39 p.m. UTC | #3

On Mon, Aug 26, 2024 at 10:47:12AM GMT, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> inode to achieve GFP_NOWAIT semantic while holding locks. If this
> allocation fails it will drop locks and use GFP_NOFS allocation context.
> 
> We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> dangerous to use if the caller doesn't control the full call chain with
> this flag set. E.g. if any of the function down the chain needed
> GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> cause unexpected failure.
> 
> While this is not the case in this particular case using the scoped gfp
> semantic is not really needed bacause we can easily pus the allocation
> context down the chain without too much clutter.

yeah, eesh, nack.

Given the amount of plumbing required here, it's clear that passing gfp
flags is the less safe way of doing it, and this really does belong in
the allocation context.

Failure to pass gfp flags correctly (which we know is something that
happens today, e.g. vmalloc -> pte allocation) means you're introducing
a deadlock.

Matthew Wilcox Aug. 26, 2024, 7:41 p.m. UTC | #4

On Mon, Aug 26, 2024 at 03:39:47PM -0400, Kent Overstreet wrote:
> Given the amount of plumbing required here, it's clear that passing gfp
> flags is the less safe way of doing it, and this really does belong in
> the allocation context.
> 
> Failure to pass gfp flags correctly (which we know is something that
> happens today, e.g. vmalloc -> pte allocation) means you're introducing
> a deadlock.

The problem with vmalloc is that the page table allocation _doesn't_
take a GFP parameter.

Kent Overstreet Aug. 26, 2024, 7:42 p.m. UTC | #5

On Mon, Aug 26, 2024 at 08:41:42PM GMT, Matthew Wilcox wrote:
> On Mon, Aug 26, 2024 at 03:39:47PM -0400, Kent Overstreet wrote:
> > Given the amount of plumbing required here, it's clear that passing gfp
> > flags is the less safe way of doing it, and this really does belong in
> > the allocation context.
> > 
> > Failure to pass gfp flags correctly (which we know is something that
> > happens today, e.g. vmalloc -> pte allocation) means you're introducing
> > a deadlock.
> 
> The problem with vmalloc is that the page table allocation _doesn't_
> take a GFP parameter.

yeah, I know. I posted patches to plumb it through, which were nacked by
Linus.

And we're trying to get away from passing gfp flags directly, are we
not? I just don't buy the GFP_NOFAIL unsafety argument.

Kent Overstreet Aug. 26, 2024, 7:44 p.m. UTC | #6

On Mon, Aug 26, 2024 at 08:41:42PM GMT, Matthew Wilcox wrote:
> On Mon, Aug 26, 2024 at 03:39:47PM -0400, Kent Overstreet wrote:
> > Given the amount of plumbing required here, it's clear that passing gfp
> > flags is the less safe way of doing it, and this really does belong in
> > the allocation context.
> > 
> > Failure to pass gfp flags correctly (which we know is something that
> > happens today, e.g. vmalloc -> pte allocation) means you're introducing
> > a deadlock.
> 
> The problem with vmalloc is that the page table allocation _doesn't_
> take a GFP parameter.

and given the increasing usage of kvmalloc() this is something we really
should be thinking about

Matthew Wilcox Aug. 26, 2024, 7:47 p.m. UTC | #7

On Mon, Aug 26, 2024 at 03:42:59PM -0400, Kent Overstreet wrote:
> On Mon, Aug 26, 2024 at 08:41:42PM GMT, Matthew Wilcox wrote:
> > On Mon, Aug 26, 2024 at 03:39:47PM -0400, Kent Overstreet wrote:
> > > Given the amount of plumbing required here, it's clear that passing gfp
> > > flags is the less safe way of doing it, and this really does belong in
> > > the allocation context.
> > > 
> > > Failure to pass gfp flags correctly (which we know is something that
> > > happens today, e.g. vmalloc -> pte allocation) means you're introducing
> > > a deadlock.
> > 
> > The problem with vmalloc is that the page table allocation _doesn't_
> > take a GFP parameter.
> 
> yeah, I know. I posted patches to plumb it through, which were nacked by
> Linus.
> 
> And we're trying to get away from passing gfp flags directly, are we
> not? I just don't buy the GFP_NOFAIL unsafety argument.

The problem with the giant invasive change of "getting away from passing
GFP flags directly" is that you need to build consensus for what it
looks like and convince everyone that you have a solution that solves
all the problems, or at least doesn't make any of those problems worse.
You haven't done that, you've just committed code that the MM people hate
(indeed already rejected), and set back the idea.

Look, it's not your job to fix it, but if you want to do it, do it
properly.

kernel test robot Aug. 26, 2024, 7:52 p.m. UTC | #8

Hi Michal,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on tip/sched/core brauner-vfs/vfs.all linus/master v6.11-rc5]
[cannot apply to next-20240826]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Michal-Hocko/bcachefs-do-not-use-PF_MEMALLOC_NORECLAIM/20240826-171013
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240826085347.1152675-2-mhocko%40kernel.org
patch subject: [PATCH 1/2] bcachefs: do not use PF_MEMALLOC_NORECLAIM
config: arc-randconfig-001-20240827 (https://download.01.org/0day-ci/archive/20240827/202408270304.AUPHM0xo-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240827/202408270304.AUPHM0xo-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408270304.AUPHM0xo-lkp@intel.com/

All errors (new ones prefixed by >>):

   fs/bcachefs/fs.c: In function '__bch2_new_inode':
>> fs/bcachefs/fs.c:248:70: error: macro "unlikely" passed 2 arguments, but takes just 1
     248 |         if (unlikely(inode_init_always_gfp(c->vfs_sb, &inode->v), gfp)) {
         |                                                                      ^
   In file included from include/linux/build_bug.h:5,
                    from include/linux/container_of.h:5,
                    from include/linux/list.h:5,
                    from include/linux/backing-dev-defs.h:5,
                    from fs/bcachefs/bcachefs.h:186,
                    from fs/bcachefs/fs.c:4:
   include/linux/compiler.h:77: note: macro "unlikely" defined here
      77 | # define unlikely(x)    __builtin_expect(!!(x), 0)
         | 
>> fs/bcachefs/fs.c:248:13: error: 'unlikely' undeclared (first use in this function)
     248 |         if (unlikely(inode_init_always_gfp(c->vfs_sb, &inode->v), gfp)) {
         |             ^~~~~~~~
   fs/bcachefs/fs.c:248:13: note: each undeclared identifier is reported only once for each function it appears in
   fs/bcachefs/fs.c: In function 'bch2_new_inode':
>> fs/bcachefs/fs.c:261:67: error: 'GFP_NOWARN' undeclared (first use in this function); did you mean 'GFP_NOWAIT'?
     261 |         struct bch_inode_info *inode = __bch2_new_inode(trans->c, GFP_NOWARN | GFP_NOWAIT);
         |                                                                   ^~~~~~~~~~
         |                                                                   GFP_NOWAIT


vim +/unlikely +248 fs/bcachefs/fs.c

   233	
   234	static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c, gfp_t gfp)
   235	{
   236		struct bch_inode_info *inode = kmem_cache_alloc(bch2_inode_cache, gfp);
   237		if (!inode)
   238			return NULL;
   239	
   240		inode_init_once(&inode->v);
   241		mutex_init(&inode->ei_update_lock);
   242		two_state_lock_init(&inode->ei_pagecache_lock);
   243		INIT_LIST_HEAD(&inode->ei_vfs_inode_list);
   244		inode->ei_flags = 0;
   245		mutex_init(&inode->ei_quota_lock);
   246		memset(&inode->ei_devs_need_flush, 0, sizeof(inode->ei_devs_need_flush));
   247	
 > 248		if (unlikely(inode_init_always_gfp(c->vfs_sb, &inode->v), gfp)) {
   249			kmem_cache_free(bch2_inode_cache, inode);
   250			return NULL;
   251		}
   252	
   253		return inode;
   254	}
   255	
   256	/*
   257	 * Allocate a new inode, dropping/retaking btree locks if necessary:
   258	 */
   259	static struct bch_inode_info *bch2_new_inode(struct btree_trans *trans)
   260	{
 > 261		struct bch_inode_info *inode = __bch2_new_inode(trans->c, GFP_NOWARN | GFP_NOWAIT);
   262	
   263		if (unlikely(!inode)) {
   264			int ret = drop_locks_do(trans, (inode = __bch2_new_inode(trans->c, GFP_NOFS)) ? 0 : -ENOMEM);
   265			if (ret && inode) {
   266				__destroy_inode(&inode->v);
   267				kmem_cache_free(bch2_inode_cache, inode);
   268			}
   269			if (ret)
   270				return ERR_PTR(ret);
   271		}
   272	
   273		return inode;
   274	}
   275

Kent Overstreet Aug. 26, 2024, 7:54 p.m. UTC | #9

On Mon, Aug 26, 2024 at 08:47:03PM GMT, Matthew Wilcox wrote:
> On Mon, Aug 26, 2024 at 03:42:59PM -0400, Kent Overstreet wrote:
> > On Mon, Aug 26, 2024 at 08:41:42PM GMT, Matthew Wilcox wrote:
> > > On Mon, Aug 26, 2024 at 03:39:47PM -0400, Kent Overstreet wrote:
> > > > Given the amount of plumbing required here, it's clear that passing gfp
> > > > flags is the less safe way of doing it, and this really does belong in
> > > > the allocation context.
> > > > 
> > > > Failure to pass gfp flags correctly (which we know is something that
> > > > happens today, e.g. vmalloc -> pte allocation) means you're introducing
> > > > a deadlock.
> > > 
> > > The problem with vmalloc is that the page table allocation _doesn't_
> > > take a GFP parameter.
> > 
> > yeah, I know. I posted patches to plumb it through, which were nacked by
> > Linus.
> > 
> > And we're trying to get away from passing gfp flags directly, are we
> > not? I just don't buy the GFP_NOFAIL unsafety argument.
> 
> The problem with the giant invasive change of "getting away from passing
> GFP flags directly" is that you need to build consensus for what it
> looks like and convince everyone that you have a solution that solves
> all the problems, or at least doesn't make any of those problems worse.
> You haven't done that, you've just committed code that the MM people hate
> (indeed already rejected), and set back the idea.

Err, what? I'm not the one who originated the idae of getting away from
passing gfp flags directly. That's been the consensus for _awhile_;
you've been talking about it, Linus talked about it re: vmalloc pte
allocation.

So this looks to me like the mm people just trying to avoid having to
deal with that. GFP_NOFAIL conflicting with other memalloc/gfp flags is
not a new issue (you really shouldn't be combining GFP_NOFAIL with
GFP_NOFS or GFP_NOIO!), and it's something we have warnings for.

But the issue of gfp flags not being passed correctly is not something
we can check for at runtime with any reliability - which means this
patchset look like a net negative to me.

Michal Hocko Aug. 26, 2024, 7:58 p.m. UTC | #10

On Mon 26-08-24 15:39:47, Kent Overstreet wrote:
> On Mon, Aug 26, 2024 at 10:47:12AM GMT, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> > inode to achieve GFP_NOWAIT semantic while holding locks. If this
> > allocation fails it will drop locks and use GFP_NOFS allocation context.
> > 
> > We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> > dangerous to use if the caller doesn't control the full call chain with
> > this flag set. E.g. if any of the function down the chain needed
> > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> > cause unexpected failure.
> > 
> > While this is not the case in this particular case using the scoped gfp
> > semantic is not really needed bacause we can easily pus the allocation
> > context down the chain without too much clutter.
> 
> yeah, eesh, nack.

Sure, you can NAK this but then deal with the lack of the PF flag by
other means. We have made it clear that PF_MEMALLOC_NORECLAIM is not we
are going to support at the MM level. 

I have done your homework and shown that it is really easy
to use gfp flags directly. The net result is passing gfp flag down to
two functions. Sure part of it is ugglier by having several different
callbacks implementing it but still manageable. Without too much churn.

So do whatever you like in the bcache code but do not rely on something
that is unsupported by the MM layer which you have sneaked in without an
agreement.

Kent Overstreet Aug. 26, 2024, 8 p.m. UTC | #11

On Mon, Aug 26, 2024 at 09:58:08PM GMT, Michal Hocko wrote:
> On Mon 26-08-24 15:39:47, Kent Overstreet wrote:
> > On Mon, Aug 26, 2024 at 10:47:12AM GMT, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> > > inode to achieve GFP_NOWAIT semantic while holding locks. If this
> > > allocation fails it will drop locks and use GFP_NOFS allocation context.
> > > 
> > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> > > dangerous to use if the caller doesn't control the full call chain with
> > > this flag set. E.g. if any of the function down the chain needed
> > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> > > cause unexpected failure.
> > > 
> > > While this is not the case in this particular case using the scoped gfp
> > > semantic is not really needed bacause we can easily pus the allocation
> > > context down the chain without too much clutter.
> > 
> > yeah, eesh, nack.
> 
> Sure, you can NAK this but then deal with the lack of the PF flag by
> other means. We have made it clear that PF_MEMALLOC_NORECLAIM is not we
> are going to support at the MM level. 
> 
> I have done your homework and shown that it is really easy
> to use gfp flags directly. The net result is passing gfp flag down to
> two functions. Sure part of it is ugglier by having several different
> callbacks implementing it but still manageable. Without too much churn.
> 
> So do whatever you like in the bcache code but do not rely on something
> that is unsupported by the MM layer which you have sneaked in without an
> agreement.

Michal, you're being damned hostile, while posting code you haven't even
tried to compile. Seriously, dude?

How about sticking to the technical issues at hand instead of saying
"this is mm, so my way or the highway?". We're all kernel developers
here, this is not what we do.

Michal Hocko Aug. 26, 2024, 8:27 p.m. UTC | #12

On Mon 26-08-24 16:00:56, Kent Overstreet wrote:
> On Mon, Aug 26, 2024 at 09:58:08PM GMT, Michal Hocko wrote:
> > On Mon 26-08-24 15:39:47, Kent Overstreet wrote:
> > > On Mon, Aug 26, 2024 at 10:47:12AM GMT, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> > > > inode to achieve GFP_NOWAIT semantic while holding locks. If this
> > > > allocation fails it will drop locks and use GFP_NOFS allocation context.
> > > > 
> > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> > > > dangerous to use if the caller doesn't control the full call chain with
> > > > this flag set. E.g. if any of the function down the chain needed
> > > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> > > > cause unexpected failure.
> > > > 
> > > > While this is not the case in this particular case using the scoped gfp
> > > > semantic is not really needed bacause we can easily pus the allocation
> > > > context down the chain without too much clutter.
> > > 
> > > yeah, eesh, nack.
> > 
> > Sure, you can NAK this but then deal with the lack of the PF flag by
> > other means. We have made it clear that PF_MEMALLOC_NORECLAIM is not we
> > are going to support at the MM level. 
> > 
> > I have done your homework and shown that it is really easy
> > to use gfp flags directly. The net result is passing gfp flag down to
> > two functions. Sure part of it is ugglier by having several different
> > callbacks implementing it but still manageable. Without too much churn.
> > 
> > So do whatever you like in the bcache code but do not rely on something
> > that is unsupported by the MM layer which you have sneaked in without an
> > agreement.
> 
> Michal, you're being damned hostile, while posting code you haven't even
> tried to compile. Seriously, dude?
> 
> How about sticking to the technical issues at hand instead of saying
> "this is mm, so my way or the highway?". We're all kernel developers
> here, this is not what we do.

Kent, we do respect review feedback. You are clearly fine ignoring it
when you feels like it (eab0af905bfc ("mm: introduce
PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN") is a clear example of it).

I have already made my arguments (repeatedly) why implicit nowait
allocation context is tricky and problematic. Your response is that you
simply "do no buy it" which is a highly technical argument.

Kent Overstreet Aug. 26, 2024, 8:43 p.m. UTC | #13

On Mon, Aug 26, 2024 at 10:27:44PM GMT, Michal Hocko wrote:
> On Mon 26-08-24 16:00:56, Kent Overstreet wrote:
> > On Mon, Aug 26, 2024 at 09:58:08PM GMT, Michal Hocko wrote:
> > > On Mon 26-08-24 15:39:47, Kent Overstreet wrote:
> > > > On Mon, Aug 26, 2024 at 10:47:12AM GMT, Michal Hocko wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > 
> > > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> > > > > inode to achieve GFP_NOWAIT semantic while holding locks. If this
> > > > > allocation fails it will drop locks and use GFP_NOFS allocation context.
> > > > > 
> > > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> > > > > dangerous to use if the caller doesn't control the full call chain with
> > > > > this flag set. E.g. if any of the function down the chain needed
> > > > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> > > > > cause unexpected failure.
> > > > > 
> > > > > While this is not the case in this particular case using the scoped gfp
> > > > > semantic is not really needed bacause we can easily pus the allocation
> > > > > context down the chain without too much clutter.
> > > > 
> > > > yeah, eesh, nack.
> > > 
> > > Sure, you can NAK this but then deal with the lack of the PF flag by
> > > other means. We have made it clear that PF_MEMALLOC_NORECLAIM is not we
> > > are going to support at the MM level. 
> > > 
> > > I have done your homework and shown that it is really easy
> > > to use gfp flags directly. The net result is passing gfp flag down to
> > > two functions. Sure part of it is ugglier by having several different
> > > callbacks implementing it but still manageable. Without too much churn.
> > > 
> > > So do whatever you like in the bcache code but do not rely on something
> > > that is unsupported by the MM layer which you have sneaked in without an
> > > agreement.
> > 
> > Michal, you're being damned hostile, while posting code you haven't even
> > tried to compile. Seriously, dude?
> > 
> > How about sticking to the technical issues at hand instead of saying
> > "this is mm, so my way or the highway?". We're all kernel developers
> > here, this is not what we do.
> 
> Kent, we do respect review feedback. You are clearly fine ignoring it
> when you feels like it (eab0af905bfc ("mm: introduce
> PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN") is a clear example of it).
> 
> I have already made my arguments (repeatedly) why implicit nowait
> allocation context is tricky and problematic. Your response is that you
> simply "do no buy it" which is a highly technical argument.

No, I explained why GFP_NORECLAIM/PF_MEMALLOC_NORECLAIM can absolutely
apply to a context, not a callsite, and why vmalloc() and kvmalloc()
ignoring gfp flags is a much more serious issue.

If you want to do something useful, figure out what we're going to do
about _that_. If you really don't want PF_MEMALLOC_NORECLAIM to exist,
then see if Linus will let you plumb gfp flags down to pte allocation -
and beware, that's arch code that you'll have to fix.

Reminder: kvmalloc() is a thing, and it's steadily seeing wider use.

Otherwise, PF_MEMALLOC_NORECLAIM needs to stay; and thank you for
bringing this to my attention, because it's made me realize all the
other places in bcachefs that use gfp flags for allocating memory with
btree locks held need to be switch to memalloc_flags_do().

kernel test robot Aug. 26, 2024, 8:53 p.m. UTC | #14

Hi Michal,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on tip/sched/core brauner-vfs/vfs.all linus/master v6.11-rc5]
[cannot apply to next-20240826]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Michal-Hocko/bcachefs-do-not-use-PF_MEMALLOC_NORECLAIM/20240826-171013
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240826085347.1152675-2-mhocko%40kernel.org
patch subject: [PATCH 1/2] bcachefs: do not use PF_MEMALLOC_NORECLAIM
config: arm-randconfig-003-20240827 (https://download.01.org/0day-ci/archive/20240827/202408270446.6Ew7FPHA-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 08e5a1de8227512d4774a534b91cb2353cef6284)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240827/202408270446.6Ew7FPHA-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408270446.6Ew7FPHA-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from fs/bcachefs/fs.c:4:
   In file included from fs/bcachefs/bcachefs.h:188:
   In file included from include/linux/bio.h:10:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:8:
   In file included from include/linux/cacheflush.h:5:
   In file included from arch/arm/include/asm/cacheflush.h:10:
   In file included from include/linux/mm.h:2198:
   include/linux/vmstat.h:518:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     518 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
>> fs/bcachefs/fs.c:248:60: error: too many arguments provided to function-like macro invocation
     248 |         if (unlikely(inode_init_always_gfp(c->vfs_sb, &inode->v), gfp)) {
         |                                                                   ^
   include/linux/compiler.h:77:10: note: macro 'unlikely' defined here
      77 | # define unlikely(x)    __builtin_expect(!!(x), 0)
         |          ^
>> fs/bcachefs/fs.c:248:6: error: use of undeclared identifier 'unlikely'
     248 |         if (unlikely(inode_init_always_gfp(c->vfs_sb, &inode->v), gfp)) {
         |             ^
>> fs/bcachefs/fs.c:261:60: error: use of undeclared identifier 'GFP_NOWARN'
     261 |         struct bch_inode_info *inode = __bch2_new_inode(trans->c, GFP_NOWARN | GFP_NOWAIT);
         |                                                                   ^
   1 warning and 3 errors generated.


vim +248 fs/bcachefs/fs.c

   233	
   234	static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c, gfp_t gfp)
   235	{
   236		struct bch_inode_info *inode = kmem_cache_alloc(bch2_inode_cache, gfp);
   237		if (!inode)
   238			return NULL;
   239	
   240		inode_init_once(&inode->v);
   241		mutex_init(&inode->ei_update_lock);
   242		two_state_lock_init(&inode->ei_pagecache_lock);
   243		INIT_LIST_HEAD(&inode->ei_vfs_inode_list);
   244		inode->ei_flags = 0;
   245		mutex_init(&inode->ei_quota_lock);
   246		memset(&inode->ei_devs_need_flush, 0, sizeof(inode->ei_devs_need_flush));
   247	
 > 248		if (unlikely(inode_init_always_gfp(c->vfs_sb, &inode->v), gfp)) {
   249			kmem_cache_free(bch2_inode_cache, inode);
   250			return NULL;
   251		}
   252	
   253		return inode;
   254	}
   255	
   256	/*
   257	 * Allocate a new inode, dropping/retaking btree locks if necessary:
   258	 */
   259	static struct bch_inode_info *bch2_new_inode(struct btree_trans *trans)
   260	{
 > 261		struct bch_inode_info *inode = __bch2_new_inode(trans->c, GFP_NOWARN | GFP_NOWAIT);
   262	
   263		if (unlikely(!inode)) {
   264			int ret = drop_locks_do(trans, (inode = __bch2_new_inode(trans->c, GFP_NOFS)) ? 0 : -ENOMEM);
   265			if (ret && inode) {
   266				__destroy_inode(&inode->v);
   267				kmem_cache_free(bch2_inode_cache, inode);
   268			}
   269			if (ret)
   270				return ERR_PTR(ret);
   271		}
   272	
   273		return inode;
   274	}
   275

Kent Overstreet Aug. 26, 2024, 9:10 p.m. UTC | #15

On Mon, Aug 26, 2024 at 04:44:01PM GMT, Kent Overstreet wrote:
> No, I explained why GFP_NORECLAIM/PF_MEMALLOC_NORECLAIM can absolutely
> apply to a context, not a callsite, and why vmalloc() and kvmalloc()
> ignoring gfp flags is a much more serious issue.
> 
> If you want to do something useful, figure out what we're going to do
> about _that_. If you really don't want PF_MEMALLOC_NORECLAIM to exist,
> then see if Linus will let you plumb gfp flags down to pte allocation -
> and beware, that's arch code that you'll have to fix.
> 
> Reminder: kvmalloc() is a thing, and it's steadily seeing wider use.
> 
> Otherwise, PF_MEMALLOC_NORECLAIM needs to stay; and thank you for
> bringing this to my attention, because it's made me realize all the
> other places in bcachefs that use gfp flags for allocating memory with
> btree locks held need to be switch to memalloc_flags_do().

Additionally: plumbing gfp flags to pte allocation is something we do
need to do. I proposed it before kvmalloc() was a thing, but now it's
become much more of a lurking landmine.

Even with that I'd still be against this series, though. GFP_NOFAIL
interacting badly with other gfp/memalloc flags is going to continue to
be an issue, and I think the only answer to that is stricter runtime
checks (which, thank you mm guys for adding recently).

kernel test robot Aug. 27, 2024, 2:23 a.m. UTC | #16

Hi Michal,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on tip/sched/core brauner-vfs/vfs.all linus/master v6.11-rc5]
[cannot apply to next-20240826]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Michal-Hocko/bcachefs-do-not-use-PF_MEMALLOC_NORECLAIM/20240826-171013
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240826085347.1152675-2-mhocko%40kernel.org
patch subject: [PATCH 1/2] bcachefs: do not use PF_MEMALLOC_NORECLAIM
config: i386-buildonly-randconfig-003-20240827 (https://download.01.org/0day-ci/archive/20240827/202408271041.5IWf4ZQC-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240827/202408271041.5IWf4ZQC-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202408271041.5IWf4ZQC-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> security/security.c:664: warning: Function parameter or struct member 'gfp' not described in 'lsm_inode_alloc'
>> security/security.c:1586: warning: Function parameter or struct member 'gfp' not described in 'security_inode_alloc'


vim +664 security/security.c

33bf60cabcc768 Casey Schaufler 2018-11-12  654  
afb1cbe37440c7 Casey Schaufler 2018-09-21  655  /**
afb1cbe37440c7 Casey Schaufler 2018-09-21  656   * lsm_inode_alloc - allocate a composite inode blob
afb1cbe37440c7 Casey Schaufler 2018-09-21  657   * @inode: the inode that needs a blob
afb1cbe37440c7 Casey Schaufler 2018-09-21  658   *
afb1cbe37440c7 Casey Schaufler 2018-09-21  659   * Allocate the inode blob for all the modules
afb1cbe37440c7 Casey Schaufler 2018-09-21  660   *
afb1cbe37440c7 Casey Schaufler 2018-09-21  661   * Returns 0, or -ENOMEM if memory can't be allocated.
afb1cbe37440c7 Casey Schaufler 2018-09-21  662   */
b2ce84652b3193 Michal Hocko    2024-08-26  663  int lsm_inode_alloc(struct inode *inode, gfp_t gfp)
afb1cbe37440c7 Casey Schaufler 2018-09-21 @664  {
afb1cbe37440c7 Casey Schaufler 2018-09-21  665  	if (!lsm_inode_cache) {
afb1cbe37440c7 Casey Schaufler 2018-09-21  666  		inode->i_security = NULL;
afb1cbe37440c7 Casey Schaufler 2018-09-21  667  		return 0;
afb1cbe37440c7 Casey Schaufler 2018-09-21  668  	}
afb1cbe37440c7 Casey Schaufler 2018-09-21  669  
b2ce84652b3193 Michal Hocko    2024-08-26  670  	inode->i_security = kmem_cache_zalloc(lsm_inode_cache, gfp);
afb1cbe37440c7 Casey Schaufler 2018-09-21  671  	if (inode->i_security == NULL)
afb1cbe37440c7 Casey Schaufler 2018-09-21  672  		return -ENOMEM;
afb1cbe37440c7 Casey Schaufler 2018-09-21  673  	return 0;
afb1cbe37440c7 Casey Schaufler 2018-09-21  674  }
afb1cbe37440c7 Casey Schaufler 2018-09-21  675

Michal Hocko Aug. 27, 2024, 6:01 a.m. UTC | #17

On Mon 26-08-24 16:43:55, Kent Overstreet wrote:
> On Mon, Aug 26, 2024 at 10:27:44PM GMT, Michal Hocko wrote:
> > On Mon 26-08-24 16:00:56, Kent Overstreet wrote:
> > > On Mon, Aug 26, 2024 at 09:58:08PM GMT, Michal Hocko wrote:
> > > > On Mon 26-08-24 15:39:47, Kent Overstreet wrote:
> > > > > On Mon, Aug 26, 2024 at 10:47:12AM GMT, Michal Hocko wrote:
> > > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > > 
> > > > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> > > > > > inode to achieve GFP_NOWAIT semantic while holding locks. If this
> > > > > > allocation fails it will drop locks and use GFP_NOFS allocation context.
> > > > > > 
> > > > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> > > > > > dangerous to use if the caller doesn't control the full call chain with
> > > > > > this flag set. E.g. if any of the function down the chain needed
> > > > > > GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> > > > > > cause unexpected failure.
> > > > > > 
> > > > > > While this is not the case in this particular case using the scoped gfp
> > > > > > semantic is not really needed bacause we can easily pus the allocation
> > > > > > context down the chain without too much clutter.
> > > > > 
> > > > > yeah, eesh, nack.
> > > > 
> > > > Sure, you can NAK this but then deal with the lack of the PF flag by
> > > > other means. We have made it clear that PF_MEMALLOC_NORECLAIM is not we
> > > > are going to support at the MM level. 
> > > > 
> > > > I have done your homework and shown that it is really easy
> > > > to use gfp flags directly. The net result is passing gfp flag down to
> > > > two functions. Sure part of it is ugglier by having several different
> > > > callbacks implementing it but still manageable. Without too much churn.
> > > > 
> > > > So do whatever you like in the bcache code but do not rely on something
> > > > that is unsupported by the MM layer which you have sneaked in without an
> > > > agreement.
> > > 
> > > Michal, you're being damned hostile, while posting code you haven't even
> > > tried to compile. Seriously, dude?
> > > 
> > > How about sticking to the technical issues at hand instead of saying
> > > "this is mm, so my way or the highway?". We're all kernel developers
> > > here, this is not what we do.
> > 
> > Kent, we do respect review feedback. You are clearly fine ignoring it
> > when you feels like it (eab0af905bfc ("mm: introduce
> > PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN") is a clear example of it).
> > 
> > I have already made my arguments (repeatedly) why implicit nowait
> > allocation context is tricky and problematic. Your response is that you
> > simply "do no buy it" which is a highly technical argument.
> 
> No, I explained why GFP_NORECLAIM/PF_MEMALLOC_NORECLAIM can absolutely
> apply to a context, not a callsite, and why vmalloc() and kvmalloc()
> ignoring gfp flags is a much more serious issue.

You are not really answering the main concern I have brought up though.
I.e. GFP_NOFAIL being fundamentally incompatible with NORECLAIM semantic
because the page allocator doesn't and will not support this allocation
mode.  Scoped noreclaim semantic makes such a use much less visible
because it can be deep in the scoped context there more error prone to
introduce thus making the code harder to maintain. 

I do see why you would like to have NOWAIT kvmalloc support available
and I also do see challenges in achieving that. But I completely fail to
see why you are bring that up _here_ as that is not really relevant to
PF_MEMALLOC_NORECLAIM use by bcachefs because it demonstrably doesn't
need that. There is no other user of the flag at the moment so dropping
the flag before there is more misuse is a reasonable goal. If you want
to bring up vmalloc NOWAIT support then feel free to do that in another
context and we can explore potential ways to achieve that.

Kent Overstreet Aug. 27, 2024, 6:40 a.m. UTC | #18

On Tue, Aug 27, 2024 at 08:01:32AM GMT, Michal Hocko wrote:
> You are not really answering the main concern I have brought up though.
> I.e. GFP_NOFAIL being fundamentally incompatible with NORECLAIM semantic
> because the page allocator doesn't and will not support this allocation
> mode.  Scoped noreclaim semantic makes such a use much less visible
> because it can be deep in the scoped context there more error prone to
> introduce thus making the code harder to maintain. 

You're too attached to GFP_NOFAIL.

GFP_NOFAIL is something we very rarely use, and it's not something we
want to use. Furthermore, GFP_NOFAIL allocations can fail regardless of
this patch - e.g. if it's more than 2 pages, it's not going to be
GFP_NOFAIL. In general you can't avoid having error paths for memory
allocations, and GFP_NOFAIL can't fundamentally change that.

Stop trying to design things around GFP_NOFAIL.

> I do see why you would like to have NOWAIT kvmalloc support available
> and I also do see challenges in achieving that. But I completely fail to
> see why you are bring that up _here_ as that is not really relevant to
> PF_MEMALLOC_NORECLAIM use by bcachefs because it demonstrably doesn't
> need that. There is no other user of the flag at the moment so dropping
> the flag before there is more misuse is a reasonable goal. If you want
> to bring up vmalloc NOWAIT support then feel free to do that in another
> context and we can explore potential ways to achieve that.

The inconsistencies between what PF_MEMALLOC supports and what gfp flags
support are quite relevant. If gfp flags aren't plumbed everywhere, and
aren't going to be plumbed everywhere (and that is the current
decision), then we must have PF_MEMALLOC support.

Michal Hocko Aug. 27, 2024, 6:58 a.m. UTC | #19

On Tue 27-08-24 02:40:16, Kent Overstreet wrote:
> On Tue, Aug 27, 2024 at 08:01:32AM GMT, Michal Hocko wrote:
> > You are not really answering the main concern I have brought up though.
> > I.e. GFP_NOFAIL being fundamentally incompatible with NORECLAIM semantic
> > because the page allocator doesn't and will not support this allocation
> > mode.  Scoped noreclaim semantic makes such a use much less visible
> > because it can be deep in the scoped context there more error prone to
> > introduce thus making the code harder to maintain. 
> 
> You're too attached to GFP_NOFAIL.

Unfortunatelly GFP_NOFAIL is there and we need to support it. We cannot
just close eyes and pretend it doesn't exist and hope for the best.

> GFP_NOFAIL is something we very rarely use, and it's not something we
> want to use. Furthermore, GFP_NOFAIL allocations can fail regardless of
> this patch - e.g. if it's more than 2 pages, it's not going to be
> GFP_NOFAIL.

We can reasonably assume we do not have any of those users in the tree
though. We know that because we have a warning to tell us about that.
We still have legit GFP_NOFAIL users and we can safely assume we will
have some in the future though. And they have no way to handle the
failure. If they did they wouldn't have used GFP_NOFAIL in the first
place. So they do not check for NULL and they would either blow up or
worse fail in subtle and harder to detect way.

Kent Overstreet Aug. 27, 2024, 7:05 a.m. UTC | #20

On Tue, Aug 27, 2024 at 08:58:39AM GMT, Michal Hocko wrote:
> On Tue 27-08-24 02:40:16, Kent Overstreet wrote:
> > On Tue, Aug 27, 2024 at 08:01:32AM GMT, Michal Hocko wrote:
> > > You are not really answering the main concern I have brought up though.
> > > I.e. GFP_NOFAIL being fundamentally incompatible with NORECLAIM semantic
> > > because the page allocator doesn't and will not support this allocation
> > > mode.  Scoped noreclaim semantic makes such a use much less visible
> > > because it can be deep in the scoped context there more error prone to
> > > introduce thus making the code harder to maintain. 
> > 
> > You're too attached to GFP_NOFAIL.
> 
> Unfortunatelly GFP_NOFAIL is there and we need to support it. We cannot
> just close eyes and pretend it doesn't exist and hope for the best.

You need to notice when you're trying to do something immpossible.

> > GFP_NOFAIL is something we very rarely use, and it's not something we
> > want to use. Furthermore, GFP_NOFAIL allocations can fail regardless of
> > this patch - e.g. if it's more than 2 pages, it's not going to be
> > GFP_NOFAIL.
> 
> We can reasonably assume we do not have any of those users in the tree
> though. We know that because we have a warning to tell us about that.
> We still have legit GFP_NOFAIL users and we can safely assume we will
> have some in the future though. And they have no way to handle the
> failure. If they did they wouldn't have used GFP_NOFAIL in the first
> place. So they do not check for NULL and they would either blow up or
> worse fail in subtle and harder to detect way.

No, because not all GFP_NOFAIL allocations are statically sized.

And the problem of the dynamic context overriding GFP_NOFAIL is more
general - if you use GFP_NOFAIL from nonblocking context (interrupt
context or preemption disabled) - the allocation has to fail, or
something even worse will happen.

Just because we don't track that with PF_MEMALLOC flags doesn't mean the
problem isn't htere.

Michal Hocko Aug. 27, 2024, 7:35 a.m. UTC | #21

On Tue 27-08-24 03:05:29, Kent Overstreet wrote:
> On Tue, Aug 27, 2024 at 08:58:39AM GMT, Michal Hocko wrote:
> > On Tue 27-08-24 02:40:16, Kent Overstreet wrote:
> > > On Tue, Aug 27, 2024 at 08:01:32AM GMT, Michal Hocko wrote:
> > > > You are not really answering the main concern I have brought up though.
> > > > I.e. GFP_NOFAIL being fundamentally incompatible with NORECLAIM semantic
> > > > because the page allocator doesn't and will not support this allocation
> > > > mode.  Scoped noreclaim semantic makes such a use much less visible
> > > > because it can be deep in the scoped context there more error prone to
> > > > introduce thus making the code harder to maintain. 
> > > 
> > > You're too attached to GFP_NOFAIL.
> > 
> > Unfortunatelly GFP_NOFAIL is there and we need to support it. We cannot
> > just close eyes and pretend it doesn't exist and hope for the best.
> 
> You need to notice when you're trying to do something immpossible.

Agreed! And GFP_NOFAIL for allocations <= order 1 in the page allocator or 
kvmalloc(GFP_NOFAIL) for reasonable sizes is a supported setup. And it
should work as documented and shouldn't create any surprises. Like
returning unexpected failure because you have been called from withing a
NORECLAIM scope which you as an author of the code are not even aware of
because that has happened somewhere detached from your code and you
happen to be in a callchain.

> > > GFP_NOFAIL is something we very rarely use, and it's not something we
> > > want to use. Furthermore, GFP_NOFAIL allocations can fail regardless of
> > > this patch - e.g. if it's more than 2 pages, it's not going to be
> > > GFP_NOFAIL.
> > 
> > We can reasonably assume we do not have any of those users in the tree
> > though. We know that because we have a warning to tell us about that.
> > We still have legit GFP_NOFAIL users and we can safely assume we will
> > have some in the future though. And they have no way to handle the
> > failure. If they did they wouldn't have used GFP_NOFAIL in the first
> > place. So they do not check for NULL and they would either blow up or
> > worse fail in subtle and harder to detect way.
> 
> No, because not all GFP_NOFAIL allocations are statically sized.

This is a runtime check warning.
rmqueue:
        WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

> And the problem of the dynamic context overriding GFP_NOFAIL is more
> general - if you use GFP_NOFAIL from nonblocking context (interrupt
> context or preemption disabled) - the allocation has to fail, or
> something even worse will happen.

If you use __GFP_NOFAIL | GFP_KERNEL from an atomic context then you are
screwed the same way as if you used GFP_KERNEL alone - sleeping while
atomic or worse. The allocator doesn't even try to deal with this and
protect the caller by not sleeping and returning NULL.

More fundamentally, GFP_NOFAIL from non-blocking context is an incorrect
an unsupported use of the flag. This is the crux of the whole
discussion. GFP_NOWAIT | __GFP_NOFAIL or GFP_ATOMIC | __GFP_NOFAIL is
just a bug. We can git grep for those, and surprisingly found one instance
which already has a patch waiting to be merged.

We cannot enforce that at a compile time and that sucks but such is a
life. But we can grep for this at least. Now consider a scoped
(implicit) NOWAIT context which makes even seeemingly correct GFP_NOFAIL
use a bug.

Jan Kara Aug. 29, 2024, 9:37 a.m. UTC | #22

On Mon 26-08-24 10:47:12, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
> inode to achieve GFP_NOWAIT semantic while holding locks. If this
> allocation fails it will drop locks and use GFP_NOFS allocation context.
> 
> We would like to drop PF_MEMALLOC_NORECLAIM because it is really
> dangerous to use if the caller doesn't control the full call chain with
> this flag set. E.g. if any of the function down the chain needed
> GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
> cause unexpected failure.
> 
> While this is not the case in this particular case using the scoped gfp
> semantic is not really needed bacause we can easily pus the allocation
> context down the chain without too much clutter.
> 
> Acked-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

For the VFS changes feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/bcachefs/fs.c          | 14 ++++++--------
>  fs/inode.c                |  6 +++---
>  include/linux/fs.h        |  7 ++++++-
>  include/linux/lsm_hooks.h |  2 +-
>  include/linux/security.h  |  4 ++--
>  security/security.c       |  8 ++++----
>  6 files changed, 22 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c
> index 15fc41e63b6c..7a55167b9133 100644
> --- a/fs/bcachefs/fs.c
> +++ b/fs/bcachefs/fs.c
> @@ -231,9 +231,9 @@ static struct inode *bch2_alloc_inode(struct super_block *sb)
>  	BUG();
>  }
>  
> -static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c)
> +static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c, gfp_t gfp)
>  {
> -	struct bch_inode_info *inode = kmem_cache_alloc(bch2_inode_cache, GFP_NOFS);
> +	struct bch_inode_info *inode = kmem_cache_alloc(bch2_inode_cache, gfp);
>  	if (!inode)
>  		return NULL;
>  
> @@ -245,7 +245,7 @@ static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c)
>  	mutex_init(&inode->ei_quota_lock);
>  	memset(&inode->ei_devs_need_flush, 0, sizeof(inode->ei_devs_need_flush));
>  
> -	if (unlikely(inode_init_always(c->vfs_sb, &inode->v))) {
> +	if (unlikely(inode_init_always_gfp(c->vfs_sb, &inode->v), gfp)) {
>  		kmem_cache_free(bch2_inode_cache, inode);
>  		return NULL;
>  	}
> @@ -258,12 +258,10 @@ static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c)
>   */
>  static struct bch_inode_info *bch2_new_inode(struct btree_trans *trans)
>  {
> -	struct bch_inode_info *inode =
> -		memalloc_flags_do(PF_MEMALLOC_NORECLAIM|PF_MEMALLOC_NOWARN,
> -				  __bch2_new_inode(trans->c));
> +	struct bch_inode_info *inode = __bch2_new_inode(trans->c, GFP_NOWARN | GFP_NOWAIT);
>  
>  	if (unlikely(!inode)) {
> -		int ret = drop_locks_do(trans, (inode = __bch2_new_inode(trans->c)) ? 0 : -ENOMEM);
> +		int ret = drop_locks_do(trans, (inode = __bch2_new_inode(trans->c, GFP_NOFS)) ? 0 : -ENOMEM);
>  		if (ret && inode) {
>  			__destroy_inode(&inode->v);
>  			kmem_cache_free(bch2_inode_cache, inode);
> @@ -328,7 +326,7 @@ __bch2_create(struct mnt_idmap *idmap,
>  	if (ret)
>  		return ERR_PTR(ret);
>  #endif
> -	inode = __bch2_new_inode(c);
> +	inode = __bch2_new_inode(c, GFP_NOFS);
>  	if (unlikely(!inode)) {
>  		inode = ERR_PTR(-ENOMEM);
>  		goto err;
> diff --git a/fs/inode.c b/fs/inode.c
> index 86670941884b..95fd67a6cac3 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -153,7 +153,7 @@ static int no_open(struct inode *inode, struct file *file)
>   * These are initializations that need to be done on every inode
>   * allocation as the fields are not initialised by slab allocation.
>   */
> -int inode_init_always(struct super_block *sb, struct inode *inode)
> +int inode_init_always(struct super_block *sb, struct inode *inode, gfp_t gfp)
>  {
>  	static const struct inode_operations empty_iops;
>  	static const struct file_operations no_open_fops = {.open = no_open};
> @@ -230,14 +230,14 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>  #endif
>  	inode->i_flctx = NULL;
>  
> -	if (unlikely(security_inode_alloc(inode)))
> +	if (unlikely(security_inode_alloc(inode, gfp)))
>  		return -ENOMEM;
>  
>  	this_cpu_inc(nr_inodes);
>  
>  	return 0;
>  }
> -EXPORT_SYMBOL(inode_init_always);
> +EXPORT_SYMBOL(inode_init_always_gfp);
>  
>  void free_inode_nonrcu(struct inode *inode)
>  {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index fd34b5755c0b..d46ca71a7855 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3027,7 +3027,12 @@ extern loff_t default_llseek(struct file *file, loff_t offset, int whence);
>  
>  extern loff_t vfs_llseek(struct file *file, loff_t offset, int whence);
>  
> -extern int inode_init_always(struct super_block *, struct inode *);
> +extern int inode_init_always_gfp(struct super_block *, struct inode *, gfp_t);
> +static inline int inode_init_always(struct super_block *sb, struct inode *inode)
> +{
> +	return inode_init_always_gfp(sb, inode, GFP_NOFS);
> +}
> +
>  extern void inode_init_once(struct inode *);
>  extern void address_space_init_once(struct address_space *mapping);
>  extern struct inode * igrab(struct inode *);
> diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
> index a2ade0ffe9e7..b08472d64765 100644
> --- a/include/linux/lsm_hooks.h
> +++ b/include/linux/lsm_hooks.h
> @@ -150,6 +150,6 @@ extern struct lsm_info __start_early_lsm_info[], __end_early_lsm_info[];
>  		__used __section(".early_lsm_info.init")		\
>  		__aligned(sizeof(unsigned long))
>  
> -extern int lsm_inode_alloc(struct inode *inode);
> +extern int lsm_inode_alloc(struct inode *inode, gfp_t gfp);
>  
>  #endif /* ! __LINUX_LSM_HOOKS_H */
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 1390f1efb4f0..7c6b9b038a0d 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -336,7 +336,7 @@ int security_dentry_create_files_as(struct dentry *dentry, int mode,
>  					struct cred *new);
>  int security_path_notify(const struct path *path, u64 mask,
>  					unsigned int obj_type);
> -int security_inode_alloc(struct inode *inode);
> +int security_inode_alloc(struct inode *inode, gfp_t gfp);
>  void security_inode_free(struct inode *inode);
>  int security_inode_init_security(struct inode *inode, struct inode *dir,
>  				 const struct qstr *qstr,
> @@ -769,7 +769,7 @@ static inline int security_path_notify(const struct path *path, u64 mask,
>  	return 0;
>  }
>  
> -static inline int security_inode_alloc(struct inode *inode)
> +static inline int security_inode_alloc(struct inode *inode, gfp_t gfp)
>  {
>  	return 0;
>  }
> diff --git a/security/security.c b/security/security.c
> index 8cee5b6c6e6d..3581262da5ee 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -660,14 +660,14 @@ static int lsm_file_alloc(struct file *file)
>   *
>   * Returns 0, or -ENOMEM if memory can't be allocated.
>   */
> -int lsm_inode_alloc(struct inode *inode)
> +int lsm_inode_alloc(struct inode *inode, gfp_t gfp)
>  {
>  	if (!lsm_inode_cache) {
>  		inode->i_security = NULL;
>  		return 0;
>  	}
>  
> -	inode->i_security = kmem_cache_zalloc(lsm_inode_cache, GFP_NOFS);
> +	inode->i_security = kmem_cache_zalloc(lsm_inode_cache, gfp);
>  	if (inode->i_security == NULL)
>  		return -ENOMEM;
>  	return 0;
> @@ -1582,9 +1582,9 @@ int security_path_notify(const struct path *path, u64 mask,
>   *
>   * Return: Return 0 if operation was successful.
>   */
> -int security_inode_alloc(struct inode *inode)
> +int security_inode_alloc(struct inode *inode, gfp_t gfp)
>  {
> -	int rc = lsm_inode_alloc(inode);
> +	int rc = lsm_inode_alloc(inode, gfp);
>  
>  	if (unlikely(rc))
>  		return rc;
> -- 
> 2.46.0
>

[1/2] bcachefs: do not use PF_MEMALLOC_NORECLAIM

Commit Message

Comments

Patch