Message ID | Zr8LMv89fkfpmBlO@tiehlicka (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: document risk of PF_MEMALLOC_NORECLAIM | expand |
On Fri, Aug 16, 2024 at 10:17:54AM +0200, Michal Hocko wrote: > Andrew, could you merge the following before PF_MEMALLOC_NORECLAIM can > be removed from the tree altogether please? For the full context the > email thread starts here: https://lore.kernel.org/all/20240812090525.80299-1-laoar.shao@gmail.com/T/#u I don't think that's enough. We just need to kill it given that it was added without ACKs and despite explicit earlier objections to the API. That might leave bcachefs with broken inode allocation handling, but so what - it's marked experimental and there should be other ways to sort this out.
On Fri 16-08-24 01:22:37, Christoph Hellwig wrote: > On Fri, Aug 16, 2024 at 10:17:54AM +0200, Michal Hocko wrote: > > Andrew, could you merge the following before PF_MEMALLOC_NORECLAIM can > > be removed from the tree altogether please? For the full context the > > email thread starts here: https://lore.kernel.org/all/20240812090525.80299-1-laoar.shao@gmail.com/T/#u > > I don't think that's enough. We just need to kill it given that it was > added without ACKs and despite explicit earlier objections to the API. Yes, I think we should kill it before it spreads even more but I would not like to make the existing user just broken. I have zero visibility and understanding of the bcachefs code but from a quick look at __bch2_new_inode it shouldn't be really terribly hard to push GFP_NOWAIT flag there directly. It would require inode_init_always_gfp variant as well (to not touch all existing callers that do not have any locking requirements but I do not see any other nested allocations. So a very quick and untested change would look as follows: --- diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c index 15fc41e63b6c..7a55167b9133 100644 --- a/fs/bcachefs/fs.c +++ b/fs/bcachefs/fs.c @@ -231,9 +231,9 @@ static struct inode *bch2_alloc_inode(struct super_block *sb) BUG(); } -static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c) +static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c, gfp_t gfp) { - struct bch_inode_info *inode = kmem_cache_alloc(bch2_inode_cache, GFP_NOFS); + struct bch_inode_info *inode = kmem_cache_alloc(bch2_inode_cache, gfp); if (!inode) return NULL; @@ -245,7 +245,7 @@ static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c) mutex_init(&inode->ei_quota_lock); memset(&inode->ei_devs_need_flush, 0, sizeof(inode->ei_devs_need_flush)); - if (unlikely(inode_init_always(c->vfs_sb, &inode->v))) { + if (unlikely(inode_init_always_gfp(c->vfs_sb, &inode->v), gfp)) { kmem_cache_free(bch2_inode_cache, inode); return NULL; } @@ -258,12 +258,10 @@ static struct bch_inode_info *__bch2_new_inode(struct bch_fs *c) */ static struct bch_inode_info *bch2_new_inode(struct btree_trans *trans) { - struct bch_inode_info *inode = - memalloc_flags_do(PF_MEMALLOC_NORECLAIM|PF_MEMALLOC_NOWARN, - __bch2_new_inode(trans->c)); + struct bch_inode_info *inode = __bch2_new_inode(trans->c, GFP_NOWARN | GFP_NOWAIT); if (unlikely(!inode)) { - int ret = drop_locks_do(trans, (inode = __bch2_new_inode(trans->c)) ? 0 : -ENOMEM); + int ret = drop_locks_do(trans, (inode = __bch2_new_inode(trans->c, GFP_NOFS)) ? 0 : -ENOMEM); if (ret && inode) { __destroy_inode(&inode->v); kmem_cache_free(bch2_inode_cache, inode); @@ -328,7 +326,7 @@ __bch2_create(struct mnt_idmap *idmap, if (ret) return ERR_PTR(ret); #endif - inode = __bch2_new_inode(c); + inode = __bch2_new_inode(c, GFP_NOFS); if (unlikely(!inode)) { inode = ERR_PTR(-ENOMEM); goto err; diff --git a/fs/inode.c b/fs/inode.c index 86670941884b..95fd67a6cac3 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -153,7 +153,7 @@ static int no_open(struct inode *inode, struct file *file) * These are initializations that need to be done on every inode * allocation as the fields are not initialised by slab allocation. */ -int inode_init_always(struct super_block *sb, struct inode *inode) +int inode_init_always(struct super_block *sb, struct inode *inode, gfp_t gfp) { static const struct inode_operations empty_iops; static const struct file_operations no_open_fops = {.open = no_open}; @@ -230,14 +230,14 @@ int inode_init_always(struct super_block *sb, struct inode *inode) #endif inode->i_flctx = NULL; - if (unlikely(security_inode_alloc(inode))) + if (unlikely(security_inode_alloc(inode, gfp))) return -ENOMEM; this_cpu_inc(nr_inodes); return 0; } -EXPORT_SYMBOL(inode_init_always); +EXPORT_SYMBOL(inode_init_always_gfp); void free_inode_nonrcu(struct inode *inode) { diff --git a/include/linux/fs.h b/include/linux/fs.h index fd34b5755c0b..5c613a89718b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3027,7 +3027,12 @@ extern loff_t default_llseek(struct file *file, loff_t offset, int whence); extern loff_t vfs_llseek(struct file *file, loff_t offset, int whence); -extern int inode_init_always(struct super_block *, struct inode *); +extern int inode_init_always(struct super_block *, struct inode *, gfp_t); +static inline int inode_init_always(struct super_block *, struct inode *) +{ + return inode_init_always_gfp(GFP_NOFS); +} + extern void inode_init_once(struct inode *); extern void address_space_init_once(struct address_space *mapping); extern struct inode * igrab(struct inode *); diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h index a2ade0ffe9e7..b08472d64765 100644 --- a/include/linux/lsm_hooks.h +++ b/include/linux/lsm_hooks.h @@ -150,6 +150,6 @@ extern struct lsm_info __start_early_lsm_info[], __end_early_lsm_info[]; __used __section(".early_lsm_info.init") \ __aligned(sizeof(unsigned long)) -extern int lsm_inode_alloc(struct inode *inode); +extern int lsm_inode_alloc(struct inode *inode, gfp_t gfp); #endif /* ! __LINUX_LSM_HOOKS_H */ diff --git a/include/linux/security.h b/include/linux/security.h index 1390f1efb4f0..7c6b9b038a0d 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -336,7 +336,7 @@ int security_dentry_create_files_as(struct dentry *dentry, int mode, struct cred *new); int security_path_notify(const struct path *path, u64 mask, unsigned int obj_type); -int security_inode_alloc(struct inode *inode); +int security_inode_alloc(struct inode *inode, gfp_t gfp); void security_inode_free(struct inode *inode); int security_inode_init_security(struct inode *inode, struct inode *dir, const struct qstr *qstr, @@ -769,7 +769,7 @@ static inline int security_path_notify(const struct path *path, u64 mask, return 0; } -static inline int security_inode_alloc(struct inode *inode) +static inline int security_inode_alloc(struct inode *inode, gfp_t gfp) { return 0; } diff --git a/security/security.c b/security/security.c index 8cee5b6c6e6d..8634d3bee56f 100644 --- a/security/security.c +++ b/security/security.c @@ -660,14 +660,14 @@ static int lsm_file_alloc(struct file *file) * * Returns 0, or -ENOMEM if memory can't be allocated. */ -int lsm_inode_alloc(struct inode *inode) +int lsm_inode_alloc(struct inode *inode, gfp) { if (!lsm_inode_cache) { inode->i_security = NULL; return 0; } - inode->i_security = kmem_cache_zalloc(lsm_inode_cache, GFP_NOFS); + inode->i_security = kmem_cache_zalloc(lsm_inode_cache, gfp); if (inode->i_security == NULL) return -ENOMEM; return 0; @@ -1582,9 +1582,9 @@ int security_path_notify(const struct path *path, u64 mask, * * Return: Return 0 if operation was successful. */ -int security_inode_alloc(struct inode *inode) +int security_inode_alloc(struct inode *inode, gfp_t gfp) { - int rc = lsm_inode_alloc(inode); + int rc = lsm_inode_alloc(inode, gfp); if (unlikely(rc)) return rc;
On Fri, Aug 16, 2024 at 10:54:37AM +0200, Michal Hocko wrote: > Yes, I think we should kill it before it spreads even more but I would > not like to make the existing user just broken. I have zero visibility > and understanding of the bcachefs code but from a quick look at __bch2_new_inode > it shouldn't be really terribly hard to push GFP_NOWAIT flag there > directly. It would require inode_init_always_gfp variant as well (to not > touch all existing callers that do not have any locking requirements but > I do not see any other nested allocations. I'll probably have to go down into security_inode_alloc as well. That being said there is no explanation for the behavior here in the commit logs or the code itself, so who knows.
On Fri 16-08-24 07:26:06, Christoph Hellwig wrote: > On Fri, Aug 16, 2024 at 10:54:37AM +0200, Michal Hocko wrote: > > Yes, I think we should kill it before it spreads even more but I would > > not like to make the existing user just broken. I have zero visibility > > and understanding of the bcachefs code but from a quick look at __bch2_new_inode > > it shouldn't be really terribly hard to push GFP_NOWAIT flag there > > directly. It would require inode_init_always_gfp variant as well (to not > > touch all existing callers that do not have any locking requirements but > > I do not see any other nested allocations. > > I'll probably have to go down into security_inode_alloc as well. yes, I have done that as well. I was not sure about inode_alloc_security. It has alloc in the name but none of the caller actually allocate from there. lsm_inode_alloc was trivial to update. > That being said there is no explanation for the behavior here in the > commit logs or the code itself, so who knows.
On Fri, Aug 16, 2024 at 4:17 PM Michal Hocko <mhocko@suse.com> wrote: > > Andrew, could you merge the following before PF_MEMALLOC_NORECLAIM can > be removed from the tree altogether please? For the full context the > email thread starts here: https://lore.kernel.org/all/20240812090525.80299-1-laoar.shao@gmail.com/T/#u > --- > From f17d36975ec343d9388aa6dbf9ca8d1b58ed09ce Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Fri, 16 Aug 2024 10:10:00 +0200 > Subject: [PATCH] mm: document risk of PF_MEMALLOC_NORECLAIM > > PF_MEMALLOC_NORECLAIM has been added even when it was pointed out [1] > that such a allocation contex is inherently unsafe if the context > doesn't fully control all allocations called from this context. Any > potential __GFP_NOFAIL request from withing PF_MEMALLOC_NORECLAIM > context would BUG_ON if the allocation would fail. > > [1] https://lore.kernel.org/all/ZcM0xtlKbAOFjv5n@tiehlicka/ > > Signed-off-by: Michal Hocko <mhocko@suse.com> Documenting the risk is a good first step. For this change: Acked-by: Yafang Shao <laoar.shao@gmail.com> Even without the PF_MEMALLOC_NORECLAIM flag, the underlying risk remains, as users can still potentially set both ~__GPF_DIRECT_RECLAIM and __GFP_NOFAIL. PF_MEMALLOC_NORECLAIM does not create this risk; it only exacerbates it. The core problem lies in the complexity of the various GFP flags and the lack of robust management for them. While we have extensive documentation on these flags, it can still be confusing, particularly for new developers who haven't yet encountered real-world issues. For instance: * %GFP_NOWAIT is for kernel allocations that should not stall for direct * reclaim, #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM | __GFP_NOWARN) Initially, it wasn't clear to me why setting __GFP_KSWAPD_RECLAIM and __GFP_NOWARN would prevent direct reclaim. It only became apparent after I studied the entire code path of page allocation. I believe other newcomers to kernel development may face similar confusion as I did early in my experience. The real issue we need to address is improving the management of these GFP flags, though I don't have a concrete solution at this time. Since I don't have a better alternative yet, I'm not strongly opposed to reverting the PF_MEMALLOC_NORECLAIM flag if it is deemed the root cause of the problem.
On Sat 17-08-24 10:29:31, Yafang Shao wrote: > On Fri, Aug 16, 2024 at 4:17 PM Michal Hocko <mhocko@suse.com> wrote: > > > > Andrew, could you merge the following before PF_MEMALLOC_NORECLAIM can > > be removed from the tree altogether please? For the full context the > > email thread starts here: https://lore.kernel.org/all/20240812090525.80299-1-laoar.shao@gmail.com/T/#u > > --- > > From f17d36975ec343d9388aa6dbf9ca8d1b58ed09ce Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.com> > > Date: Fri, 16 Aug 2024 10:10:00 +0200 > > Subject: [PATCH] mm: document risk of PF_MEMALLOC_NORECLAIM > > > > PF_MEMALLOC_NORECLAIM has been added even when it was pointed out [1] > > that such a allocation contex is inherently unsafe if the context > > doesn't fully control all allocations called from this context. Any > > potential __GFP_NOFAIL request from withing PF_MEMALLOC_NORECLAIM > > context would BUG_ON if the allocation would fail. > > > > [1] https://lore.kernel.org/all/ZcM0xtlKbAOFjv5n@tiehlicka/ > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > Documenting the risk is a good first step. For this change: > > Acked-by: Yafang Shao <laoar.shao@gmail.com> > > Even without the PF_MEMALLOC_NORECLAIM flag, the underlying risk > remains, as users can still potentially set both ~__GPF_DIRECT_RECLAIM > and __GFP_NOFAIL. Users can configure all sorts of nonsensical gfp flags combination. That is a sad reality of the interface. But we do assume that kernel code is somehow sane. Besides that Barry is working on making this less likely by droppong __GFP_NOFAIL and replace it by GFP_NOFAIL which always includes __GFP_DIRECT_RECLAIM. Sure nothing will prevent callers from clearing that flag explicitly but we have no real defense afains broken code. > PF_MEMALLOC_NORECLAIM does not create this risk; it > only exacerbates it. The core problem lies in the complexity of the > various GFP flags and the lack of robust management for them. While we > have extensive documentation on these flags, it can still be > confusing, particularly for new developers who haven't yet encountered > real-world issues. > > For instance: > > * %GFP_NOWAIT is for kernel allocations that should not stall for direct > * reclaim, > #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM | __GFP_NOWARN) > > Initially, it wasn't clear to me why setting __GFP_KSWAPD_RECLAIM and > __GFP_NOWARN would prevent direct reclaim. It only became apparent > after I studied the entire code path of page allocation. I believe > other newcomers to kernel development may face similar confusion as I > did early in my experience. > > The real issue we need to address is improving the management of these > GFP flags, though I don't have a concrete solution at this time. Welcome to the club. Changing this interface is a _huge_ undertaking. Just have a look how many users of the gfp flags we have in the kernel. I can tell you from a first hand experience that even minor tweaks are really hard to make.
On Fri 16-08-24 10:54:39, Michal Hocko wrote: > On Fri 16-08-24 01:22:37, Christoph Hellwig wrote: > > On Fri, Aug 16, 2024 at 10:17:54AM +0200, Michal Hocko wrote: > > > Andrew, could you merge the following before PF_MEMALLOC_NORECLAIM can > > > be removed from the tree altogether please? For the full context the > > > email thread starts here: https://lore.kernel.org/all/20240812090525.80299-1-laoar.shao@gmail.com/T/#u > > > > I don't think that's enough. We just need to kill it given that it was > > added without ACKs and despite explicit earlier objections to the API. > > Yes, I think we should kill it before it spreads even more but I would > not like to make the existing user just broken. I have zero visibility > and understanding of the bcachefs code but from a quick look at __bch2_new_inode > it shouldn't be really terribly hard to push GFP_NOWAIT flag there > directly. It would require inode_init_always_gfp variant as well (to not > touch all existing callers that do not have any locking requirements but > I do not see any other nested allocations. > > So a very quick and untested change would look as follows: Anybody managed to give it a more detailed look? I have a fixup for [...] > diff --git a/security/security.c b/security/security.c > index 8cee5b6c6e6d..8634d3bee56f 100644 > --- a/security/security.c > +++ b/security/security.c > @@ -660,14 +660,14 @@ static int lsm_file_alloc(struct file *file) > * > * Returns 0, or -ENOMEM if memory can't be allocated. > */ > -int lsm_inode_alloc(struct inode *inode) > +int lsm_inode_alloc(struct inode *inode, gfp) this > { > if (!lsm_inode_cache) { > inode->i_security = NULL; > return 0; > } > > - inode->i_security = kmem_cache_zalloc(lsm_inode_cache, GFP_NOFS); > + inode->i_security = kmem_cache_zalloc(lsm_inode_cache, gfp); > if (inode->i_security == NULL) > return -ENOMEM; > return 0;
On Fri, Aug 16, 2024 at 10:54:37AM +0200, Michal Hocko wrote: > Yes, I think we should kill it before it spreads even more but I would > not like to make the existing user just broken. I have zero visibility > and understanding of the bcachefs code but from a quick look at __bch2_new_inode > it shouldn't be really terribly hard to push GFP_NOWAIT flag there > directly. > I don't understand that sentence. You're adding the gfp_t argument to it, which to mean counts as pushing it there directly. > It would require inode_init_always_gfp variant as well (to not > touch all existing callers that do not have any locking requirements but > I do not see any other nested allocations. inode_init_always only has 4 callers, so I'd just add the gfp_t argument. Otherwise this looks good modulo the fix your posted: Acked-by: Christoph Hellwig <hch@lst.de>
On Wed 21-08-24 04:44:03, Christoph Hellwig wrote: > On Fri, Aug 16, 2024 at 10:54:37AM +0200, Michal Hocko wrote: > > Yes, I think we should kill it before it spreads even more but I would > > not like to make the existing user just broken. I have zero visibility > > and understanding of the bcachefs code but from a quick look at __bch2_new_inode > > it shouldn't be really terribly hard to push GFP_NOWAIT flag there > > directly. > > > I don't understand that sentence. You're adding the gfp_t argument to > it, which to mean counts as pushing it there directly. Sorry, what I meant to say is that pushing GFP_NOWAIT directly seem fine unless I have missed some hidden corners down the call path which would require a scope flag to override a hardcoded gfp flag. > > It would require inode_init_always_gfp variant as well (to not > > touch all existing callers that do not have any locking requirements but > > I do not see any other nested allocations. > > inode_init_always only has 4 callers, so I'd just add the gfp_t > argument. Otherwise this looks good modulo the fix your posted: > > Acked-by: Christoph Hellwig <hch@lst.de> Thanks. I will wait for more review and post this as a real patch. I would really appreciate any help with actual testing.
On Wed, Aug 21, 2024 at 02:37:33PM GMT, Michal Hocko wrote: > On Wed 21-08-24 04:44:03, Christoph Hellwig wrote: > > On Fri, Aug 16, 2024 at 10:54:37AM +0200, Michal Hocko wrote: > > > Yes, I think we should kill it before it spreads even more but I would > > > not like to make the existing user just broken. I have zero visibility > > > and understanding of the bcachefs code but from a quick look at __bch2_new_inode > > > it shouldn't be really terribly hard to push GFP_NOWAIT flag there > > > directly. > > > > > I don't understand that sentence. You're adding the gfp_t argument to > > it, which to mean counts as pushing it there directly. > > Sorry, what I meant to say is that pushing GFP_NOWAIT directly seem fine > unless I have missed some hidden corners down the call path which would > require a scope flag to override a hardcoded gfp flag. > > > > It would require inode_init_always_gfp variant as well (to not > > > touch all existing callers that do not have any locking requirements but > > > I do not see any other nested allocations. > > > > inode_init_always only has 4 callers, so I'd just add the gfp_t > > argument. Otherwise this looks good modulo the fix your posted: > > > > Acked-by: Christoph Hellwig <hch@lst.de> > > Thanks. I will wait for more review and post this as a real patch. I > would really appreciate any help with actual testing. Ugh, I really don't like that we have to add a flag argument especially for an api that's broken but fine.
diff --git a/include/linux/sched.h b/include/linux/sched.h index f8d150343d42..0c9061d2a8bd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1657,7 +1657,12 @@ extern struct pid *cad_pid; * I am cleaning dirty pages from some other bdi. */ #define PF_KTHREAD 0x00200000 /* I am a kernel thread */ #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */ -#define PF_MEMALLOC_NORECLAIM 0x00800000 /* All allocation requests will clear __GFP_DIRECT_RECLAIM */ +#define PF_MEMALLOC_NORECLAIM 0x00800000 /* All allocation requests will clear __GFP_DIRECT_RECLAIM. + * This is inherently unsafe unless the context fully controls + * all allocations used. Any potential __GFP_NOFAIL nested allocation + * could BUG_ON as the page allocator doesn't support non-sleeping + * __GFP_NOFAIL requests. + */ #define PF_MEMALLOC_NOWARN 0x01000000 /* All allocation requests will inherit __GFP_NOWARN */ #define PF__HOLE__02000000 0x02000000 #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */