[RFC] nfsd: avoid recursive locking through fsnotify

Message ID	20220319001635.4097742-1-khazhy@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@kernel.org> Date: Fri, 18 Mar 2022 17:16:35 -0700 Message-Id: <20220319001635.4097742-1-khazhy@google.com> Mime-Version: 1.0 Subject: [PATCH RFC] nfsd: avoid recursive locking through fsnotify From: Khazhismel Kumykov <khazhy@google.com> To: "J. Bruce Fields" <bfields@fieldses.org>, Chuck Lever <chuck.lever@oracle.com> Cc: linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org, Khazhismel Kumykov <khazhy@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	[RFC] nfsd: avoid recursive locking through fsnotify \| expand [RFC] nfsd: avoid recursive locking through fsnotify

Khazhy Kumykov March 19, 2022, 12:16 a.m. UTC

fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may result
in recursing back into nfsd, resulting in deadlock. See below stack.

nfsd            D    0 1591536      2 0x80004080
Call Trace:
 __schedule+0x497/0x630
 schedule+0x67/0x90
 schedule_preempt_disabled+0xe/0x10
 __mutex_lock+0x347/0x4b0
 fsnotify_destroy_mark+0x22/0xa0
 nfsd_file_free+0x79/0xd0 [nfsd]
 nfsd_file_put_noref+0x7c/0x90 [nfsd]
 nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
 nfsd_file_lru_scan+0x57/0x80 [nfsd]
 do_shrink_slab+0x1f2/0x330
 shrink_slab+0x244/0x2f0
 shrink_node+0xd7/0x490
 do_try_to_free_pages+0x12f/0x3b0
 try_to_free_pages+0x43f/0x540
 __alloc_pages_slowpath+0x6ab/0x11c0
 __alloc_pages_nodemask+0x274/0x2c0
 alloc_slab_page+0x32/0x2e0
 new_slab+0xa6/0x8b0
 ___slab_alloc+0x34b/0x520
 kmem_cache_alloc+0x1c4/0x250
 fsnotify_add_mark_locked+0x18d/0x4c0
 fsnotify_add_mark+0x48/0x70
 nfsd_file_acquire+0x570/0x6f0 [nfsd]
 nfsd_read+0xa7/0x1c0 [nfsd]
 nfsd3_proc_read+0xc1/0x110 [nfsd]
 nfsd_dispatch+0xf7/0x240 [nfsd]
 svc_process_common+0x2f4/0x610 [sunrpc]
 svc_process+0xf9/0x110 [sunrpc]
 nfsd+0x10e/0x180 [nfsd]
 kthread+0x130/0x140
 ret_from_fork+0x35/0x40

Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
---
 fs/nfsd/filecache.c | 4 ++++
 1 file changed, 4 insertions(+)

Marking this RFC since I haven't actually had a chance to test this, we
we're seeing this deadlock for some customers.

Trond Myklebust March 19, 2022, 12:36 a.m. UTC | #1

On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> result
> in recursing back into nfsd, resulting in deadlock. See below stack.
> 
> nfsd            D    0 1591536      2 0x80004080
> Call Trace:
>  __schedule+0x497/0x630
>  schedule+0x67/0x90
>  schedule_preempt_disabled+0xe/0x10
>  __mutex_lock+0x347/0x4b0
>  fsnotify_destroy_mark+0x22/0xa0
>  nfsd_file_free+0x79/0xd0 [nfsd]
>  nfsd_file_put_noref+0x7c/0x90 [nfsd]
>  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
>  nfsd_file_lru_scan+0x57/0x80 [nfsd]
>  do_shrink_slab+0x1f2/0x330
>  shrink_slab+0x244/0x2f0
>  shrink_node+0xd7/0x490
>  do_try_to_free_pages+0x12f/0x3b0
>  try_to_free_pages+0x43f/0x540
>  __alloc_pages_slowpath+0x6ab/0x11c0
>  __alloc_pages_nodemask+0x274/0x2c0
>  alloc_slab_page+0x32/0x2e0
>  new_slab+0xa6/0x8b0
>  ___slab_alloc+0x34b/0x520
>  kmem_cache_alloc+0x1c4/0x250
>  fsnotify_add_mark_locked+0x18d/0x4c0
>  fsnotify_add_mark+0x48/0x70
>  nfsd_file_acquire+0x570/0x6f0 [nfsd]
>  nfsd_read+0xa7/0x1c0 [nfsd]
>  nfsd3_proc_read+0xc1/0x110 [nfsd]
>  nfsd_dispatch+0xf7/0x240 [nfsd]
>  svc_process_common+0x2f4/0x610 [sunrpc]
>  svc_process+0xf9/0x110 [sunrpc]
>  nfsd+0x10e/0x180 [nfsd]
>  kthread+0x130/0x140
>  ret_from_fork+0x35/0x40
> 
> Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> ---
>  fs/nfsd/filecache.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> Marking this RFC since I haven't actually had a chance to test this,
> we
> we're seeing this deadlock for some customers.
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index fdf89fcf1a0c..a14760f9b486 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> *nf)
>         struct fsnotify_mark    *mark;
>         struct nfsd_file_mark   *nfm = NULL, *new;
>         struct inode *inode = nf->nf_inode;
> +       unsigned int pflags;
>  
>         do {
>                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> *nf)
>                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
>                 refcount_set(&new->nfm_ref, 1);
>  
> +               /* fsnotify allocates, avoid recursion back into nfsd
> */
> +               pflags = memalloc_nofs_save();
>                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> 0);
> +               memalloc_nofs_restore(pflags);
>  
>                 /*
>                  * If the add was successful, then return the object.

Isn't that stack trace showing a slab direct reclaim, and not a
filesystem writeback situation?

Does memalloc_nofs_save()/restore() really fix this problem? It seems
to me that it cannot, particularly since knfsd is not a filesystem, and
so does not ever handle writeback of dirty pages.

Cc: the linux-mm mailing list in search of answers to the above 2
questions.

Khazhy Kumykov March 19, 2022, 1:45 a.m. UTC | #2

On Fri, Mar 18, 2022 at 5:36 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> Isn't that stack trace showing a slab direct reclaim, and not a
> filesystem writeback situation?
>
> Does memalloc_nofs_save()/restore() really fix this problem? It seems
> to me that it cannot, particularly since knfsd is not a filesystem, and
> so does not ever handle writeback of dirty pages.

Ah you're right. An alternative would be delaying the destroy_mark,
which I am noticing now that, on mainline, the shrinker calls
nfsd_file_dispose_list_delayed, something missing from the version of
5.4 I was looking at.

To my reading 9542e6a643fc6 ("nfsd: Containerise filecache
laundrette") should have the effect of fixing this deadlock (and the
message does explicitly call out notifier deadlocks) - maybe something
to send towards stable?
>
>
> Cc: the linux-mm mailing list in search of answers to the above 2
> questions.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
>
>

Amir Goldstein March 19, 2022, 9:36 a.m. UTC | #3

On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > result
> > in recursing back into nfsd, resulting in deadlock. See below stack.
> >
> > nfsd            D    0 1591536      2 0x80004080
> > Call Trace:
> >  __schedule+0x497/0x630
> >  schedule+0x67/0x90
> >  schedule_preempt_disabled+0xe/0x10
> >  __mutex_lock+0x347/0x4b0
> >  fsnotify_destroy_mark+0x22/0xa0
> >  nfsd_file_free+0x79/0xd0 [nfsd]
> >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> >  do_shrink_slab+0x1f2/0x330
> >  shrink_slab+0x244/0x2f0
> >  shrink_node+0xd7/0x490
> >  do_try_to_free_pages+0x12f/0x3b0
> >  try_to_free_pages+0x43f/0x540
> >  __alloc_pages_slowpath+0x6ab/0x11c0
> >  __alloc_pages_nodemask+0x274/0x2c0
> >  alloc_slab_page+0x32/0x2e0
> >  new_slab+0xa6/0x8b0
> >  ___slab_alloc+0x34b/0x520
> >  kmem_cache_alloc+0x1c4/0x250
> >  fsnotify_add_mark_locked+0x18d/0x4c0
> >  fsnotify_add_mark+0x48/0x70
> >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> >  nfsd_read+0xa7/0x1c0 [nfsd]
> >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> >  nfsd_dispatch+0xf7/0x240 [nfsd]
> >  svc_process_common+0x2f4/0x610 [sunrpc]
> >  svc_process+0xf9/0x110 [sunrpc]
> >  nfsd+0x10e/0x180 [nfsd]
> >  kthread+0x130/0x140
> >  ret_from_fork+0x35/0x40
> >
> > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > ---
> >  fs/nfsd/filecache.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > Marking this RFC since I haven't actually had a chance to test this,
> > we
> > we're seeing this deadlock for some customers.
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index fdf89fcf1a0c..a14760f9b486 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > *nf)
> >         struct fsnotify_mark    *mark;
> >         struct nfsd_file_mark   *nfm = NULL, *new;
> >         struct inode *inode = nf->nf_inode;
> > +       unsigned int pflags;
> >
> >         do {
> >                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > *nf)
> >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> >                 refcount_set(&new->nfm_ref, 1);
> >
> > +               /* fsnotify allocates, avoid recursion back into nfsd
> > */
> > +               pflags = memalloc_nofs_save();
> >                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > 0);
> > +               memalloc_nofs_restore(pflags);
> >
> >                 /*
> >                  * If the add was successful, then return the object.
>
> Isn't that stack trace showing a slab direct reclaim, and not a
> filesystem writeback situation?
>
> Does memalloc_nofs_save()/restore() really fix this problem? It seems
> to me that it cannot, particularly since knfsd is not a filesystem, and
> so does not ever handle writeback of dirty pages.
>

Maybe NOFS throttles direct reclaims to the point that the problem is
harder to hit?

This report came in at good timing for me.

It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
As far as I can tell, nfsd filecache is currently the only fsnotify backend that
frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
be evictable in that way, so they would expose fanotify to this deadlock.

For the short term, maybe nfsd filecache can avoid the problem by checking
mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?

Jan,

A relatively simple fix would be to allocate fsnotify_mark_connector in
fsnotify_add_mark() and free it, if a connector already exists for the object.
I don't think there is a good reason to optimize away this allocation
for the case of a non-first group to set a mark on an object?

Thanks,
Amir.



[1] https://lore.kernel.org/linux-fsdevel/20220307155741.1352405-1-amir73il@gmail.com/

Jan Kara March 21, 2022, 11:23 a.m. UTC | #4

On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
> >
> > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > result
> > > in recursing back into nfsd, resulting in deadlock. See below stack.
> > >
> > > nfsd            D    0 1591536      2 0x80004080
> > > Call Trace:
> > >  __schedule+0x497/0x630
> > >  schedule+0x67/0x90
> > >  schedule_preempt_disabled+0xe/0x10
> > >  __mutex_lock+0x347/0x4b0
> > >  fsnotify_destroy_mark+0x22/0xa0
> > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > >  do_shrink_slab+0x1f2/0x330
> > >  shrink_slab+0x244/0x2f0
> > >  shrink_node+0xd7/0x490
> > >  do_try_to_free_pages+0x12f/0x3b0
> > >  try_to_free_pages+0x43f/0x540
> > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > >  __alloc_pages_nodemask+0x274/0x2c0
> > >  alloc_slab_page+0x32/0x2e0
> > >  new_slab+0xa6/0x8b0
> > >  ___slab_alloc+0x34b/0x520
> > >  kmem_cache_alloc+0x1c4/0x250
> > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > >  fsnotify_add_mark+0x48/0x70
> > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > >  svc_process+0xf9/0x110 [sunrpc]
> > >  nfsd+0x10e/0x180 [nfsd]
> > >  kthread+0x130/0x140
> > >  ret_from_fork+0x35/0x40
> > >
> > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > ---
> > >  fs/nfsd/filecache.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > Marking this RFC since I haven't actually had a chance to test this,
> > > we
> > > we're seeing this deadlock for some customers.
> > >
> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > index fdf89fcf1a0c..a14760f9b486 100644
> > > --- a/fs/nfsd/filecache.c
> > > +++ b/fs/nfsd/filecache.c
> > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > *nf)
> > >         struct fsnotify_mark    *mark;
> > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > >         struct inode *inode = nf->nf_inode;
> > > +       unsigned int pflags;
> > >
> > >         do {
> > >                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > *nf)
> > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > >                 refcount_set(&new->nfm_ref, 1);
> > >
> > > +               /* fsnotify allocates, avoid recursion back into nfsd
> > > */
> > > +               pflags = memalloc_nofs_save();
> > >                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > > 0);
> > > +               memalloc_nofs_restore(pflags);
> > >
> > >                 /*
> > >                  * If the add was successful, then return the object.
> >
> > Isn't that stack trace showing a slab direct reclaim, and not a
> > filesystem writeback situation?
> >
> > Does memalloc_nofs_save()/restore() really fix this problem? It seems
> > to me that it cannot, particularly since knfsd is not a filesystem, and
> > so does not ever handle writeback of dirty pages.
> >
> 
> Maybe NOFS throttles direct reclaims to the point that the problem is
> harder to hit?
> 
> This report came in at good timing for me.
> 
> It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
> As far as I can tell, nfsd filecache is currently the only fsnotify backend that
> frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
> be evictable in that way, so they would expose fanotify to this deadlock.
> 
> For the short term, maybe nfsd filecache can avoid the problem by checking
> mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
> shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?
> 
> Jan,
> 
> A relatively simple fix would be to allocate fsnotify_mark_connector in
> fsnotify_add_mark() and free it, if a connector already exists for the object.
> I don't think there is a good reason to optimize away this allocation
> for the case of a non-first group to set a mark on an object?

Indeed, nasty. Volatile marks will add group->mark_mutex into a set of
locks grabbed during inode slab reclaim. So any allocation under
group->mark_mutex has to be GFP_NOFS now. This is not just about connector
allocations but also mark allocations for fanotify. Moving allocations from
under mark_mutex is also possible solution but passing preallocated memory
around is kind of ugly as well. So the cleanest solution I currently see is
to come up with helpers like "fsnotify_lock_group() &
fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do
memalloc_nofs_save / restore magic. 

								Honza

Amir Goldstein March 21, 2022, 11:56 a.m. UTC | #5

On Mon, Mar 21, 2022 at 1:23 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
> > >
> > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > > result
> > > > in recursing back into nfsd, resulting in deadlock. See below stack.
> > > >
> > > > nfsd            D    0 1591536      2 0x80004080
> > > > Call Trace:
> > > >  __schedule+0x497/0x630
> > > >  schedule+0x67/0x90
> > > >  schedule_preempt_disabled+0xe/0x10
> > > >  __mutex_lock+0x347/0x4b0
> > > >  fsnotify_destroy_mark+0x22/0xa0
> > > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > > >  do_shrink_slab+0x1f2/0x330
> > > >  shrink_slab+0x244/0x2f0
> > > >  shrink_node+0xd7/0x490
> > > >  do_try_to_free_pages+0x12f/0x3b0
> > > >  try_to_free_pages+0x43f/0x540
> > > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > > >  __alloc_pages_nodemask+0x274/0x2c0
> > > >  alloc_slab_page+0x32/0x2e0
> > > >  new_slab+0xa6/0x8b0
> > > >  ___slab_alloc+0x34b/0x520
> > > >  kmem_cache_alloc+0x1c4/0x250
> > > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > > >  fsnotify_add_mark+0x48/0x70
> > > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > > >  svc_process+0xf9/0x110 [sunrpc]
> > > >  nfsd+0x10e/0x180 [nfsd]
> > > >  kthread+0x130/0x140
> > > >  ret_from_fork+0x35/0x40
> > > >
> > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > > ---
> > > >  fs/nfsd/filecache.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > >
> > > > Marking this RFC since I haven't actually had a chance to test this,
> > > > we
> > > > we're seeing this deadlock for some customers.
> > > >
> > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > index fdf89fcf1a0c..a14760f9b486 100644
> > > > --- a/fs/nfsd/filecache.c
> > > > +++ b/fs/nfsd/filecache.c
> > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > > *nf)
> > > >         struct fsnotify_mark    *mark;
> > > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > > >         struct inode *inode = nf->nf_inode;
> > > > +       unsigned int pflags;
> > > >
> > > >         do {
> > > >                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > > *nf)
> > > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > > >                 refcount_set(&new->nfm_ref, 1);
> > > >
> > > > +               /* fsnotify allocates, avoid recursion back into nfsd
> > > > */
> > > > +               pflags = memalloc_nofs_save();
> > > >                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > > > 0);
> > > > +               memalloc_nofs_restore(pflags);
> > > >
> > > >                 /*
> > > >                  * If the add was successful, then return the object.
> > >
> > > Isn't that stack trace showing a slab direct reclaim, and not a
> > > filesystem writeback situation?
> > >
> > > Does memalloc_nofs_save()/restore() really fix this problem? It seems
> > > to me that it cannot, particularly since knfsd is not a filesystem, and
> > > so does not ever handle writeback of dirty pages.
> > >
> >
> > Maybe NOFS throttles direct reclaims to the point that the problem is
> > harder to hit?
> >
> > This report came in at good timing for me.
> >
> > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
> > As far as I can tell, nfsd filecache is currently the only fsnotify backend that
> > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
> > be evictable in that way, so they would expose fanotify to this deadlock.
> >
> > For the short term, maybe nfsd filecache can avoid the problem by checking
> > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
> > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?
> >
> > Jan,
> >
> > A relatively simple fix would be to allocate fsnotify_mark_connector in
> > fsnotify_add_mark() and free it, if a connector already exists for the object.
> > I don't think there is a good reason to optimize away this allocation
> > for the case of a non-first group to set a mark on an object?
>
> Indeed, nasty. Volatile marks will add group->mark_mutex into a set of
> locks grabbed during inode slab reclaim. So any allocation under
> group->mark_mutex has to be GFP_NOFS now. This is not just about connector
> allocations but also mark allocations for fanotify. Moving allocations from
> under mark_mutex is also possible solution but passing preallocated memory
> around is kind of ugly as well.

Yes, kind of, here is how it looks:
https://github.com/amir73il/linux/commit/643bb6b9f664f70f68ea0393a06338673c4966b3
https://github.com/amir73il/linux/commit/66f27fc99e46b12f1078e8e2915793040ce50ee7

> So the cleanest solution I currently see is
> to come up with helpers like "fsnotify_lock_group() &
> fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do
> memalloc_nofs_save / restore magic.
>

Sounds good. Won't this cause a regression - more failures to setup new mark
under memory pressure?

Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE?
and set NOFS state only in that case, so at least we don't cause regression
for existing applications?

Thanks,
Amir.

Jan Kara March 21, 2022, 2:51 p.m. UTC | #6

On Mon 21-03-22 13:56:47, Amir Goldstein wrote:
> On Mon, Mar 21, 2022 at 1:23 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
> > > >
> > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > > > result
> > > > > in recursing back into nfsd, resulting in deadlock. See below stack.
> > > > >
> > > > > nfsd            D    0 1591536      2 0x80004080
> > > > > Call Trace:
> > > > >  __schedule+0x497/0x630
> > > > >  schedule+0x67/0x90
> > > > >  schedule_preempt_disabled+0xe/0x10
> > > > >  __mutex_lock+0x347/0x4b0
> > > > >  fsnotify_destroy_mark+0x22/0xa0
> > > > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > > > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > > > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > > > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > > > >  do_shrink_slab+0x1f2/0x330
> > > > >  shrink_slab+0x244/0x2f0
> > > > >  shrink_node+0xd7/0x490
> > > > >  do_try_to_free_pages+0x12f/0x3b0
> > > > >  try_to_free_pages+0x43f/0x540
> > > > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > > > >  __alloc_pages_nodemask+0x274/0x2c0
> > > > >  alloc_slab_page+0x32/0x2e0
> > > > >  new_slab+0xa6/0x8b0
> > > > >  ___slab_alloc+0x34b/0x520
> > > > >  kmem_cache_alloc+0x1c4/0x250
> > > > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > > > >  fsnotify_add_mark+0x48/0x70
> > > > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > > > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > > > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > > > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > > > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > > > >  svc_process+0xf9/0x110 [sunrpc]
> > > > >  nfsd+0x10e/0x180 [nfsd]
> > > > >  kthread+0x130/0x140
> > > > >  ret_from_fork+0x35/0x40
> > > > >
> > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > > > ---
> > > > >  fs/nfsd/filecache.c | 4 ++++
> > > > >  1 file changed, 4 insertions(+)
> > > > >
> > > > > Marking this RFC since I haven't actually had a chance to test this,
> > > > > we
> > > > > we're seeing this deadlock for some customers.
> > > > >
> > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > > index fdf89fcf1a0c..a14760f9b486 100644
> > > > > --- a/fs/nfsd/filecache.c
> > > > > +++ b/fs/nfsd/filecache.c
> > > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > > > *nf)
> > > > >         struct fsnotify_mark    *mark;
> > > > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > > > >         struct inode *inode = nf->nf_inode;
> > > > > +       unsigned int pflags;
> > > > >
> > > > >         do {
> > > > >                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > > > *nf)
> > > > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > > > >                 refcount_set(&new->nfm_ref, 1);
> > > > >
> > > > > +               /* fsnotify allocates, avoid recursion back into nfsd
> > > > > */
> > > > > +               pflags = memalloc_nofs_save();
> > > > >                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > > > > 0);
> > > > > +               memalloc_nofs_restore(pflags);
> > > > >
> > > > >                 /*
> > > > >                  * If the add was successful, then return the object.
> > > >
> > > > Isn't that stack trace showing a slab direct reclaim, and not a
> > > > filesystem writeback situation?
> > > >
> > > > Does memalloc_nofs_save()/restore() really fix this problem? It seems
> > > > to me that it cannot, particularly since knfsd is not a filesystem, and
> > > > so does not ever handle writeback of dirty pages.
> > > >
> > >
> > > Maybe NOFS throttles direct reclaims to the point that the problem is
> > > harder to hit?
> > >
> > > This report came in at good timing for me.
> > >
> > > It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
> > > As far as I can tell, nfsd filecache is currently the only fsnotify backend that
> > > frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
> > > be evictable in that way, so they would expose fanotify to this deadlock.
> > >
> > > For the short term, maybe nfsd filecache can avoid the problem by checking
> > > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
> > > shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?
> > >
> > > Jan,
> > >
> > > A relatively simple fix would be to allocate fsnotify_mark_connector in
> > > fsnotify_add_mark() and free it, if a connector already exists for the object.
> > > I don't think there is a good reason to optimize away this allocation
> > > for the case of a non-first group to set a mark on an object?
> >
> > Indeed, nasty. Volatile marks will add group->mark_mutex into a set of
> > locks grabbed during inode slab reclaim. So any allocation under
> > group->mark_mutex has to be GFP_NOFS now. This is not just about connector
> > allocations but also mark allocations for fanotify. Moving allocations from
> > under mark_mutex is also possible solution but passing preallocated memory
> > around is kind of ugly as well.
> 
> Yes, kind of, here is how it looks:
> https://github.com/amir73il/linux/commit/643bb6b9f664f70f68ea0393a06338673c4966b3
> https://github.com/amir73il/linux/commit/66f27fc99e46b12f1078e8e2915793040ce50ee7

Yup, not an act of beauty but bearable in the worst case :).

> > So the cleanest solution I currently see is
> > to come up with helpers like "fsnotify_lock_group() &
> > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also do
> > memalloc_nofs_save / restore magic.
> >
> 
> Sounds good. Won't this cause a regression - more failures to setup new mark
> under memory pressure?

Well, yes, the chances of hitting ENOMEM under heavy memory pressure are
higher. But I don't think that much memory is consumed by connectors or
marks that the reduced chances for direct reclaim would really
substantially matter for the system as a whole.

> Should we maintain a flag in the group FSNOTIFY_GROUP_SHRINKABLE?
> and set NOFS state only in that case, so at least we don't cause regression
> for existing applications?

So that's a possibility I've left in my sleeve ;). We could do it but then
we'd also have to tell lockdep that there are two kinds of mark_mutex locks
so that it does not complain about possible reclaim deadlocks. Doable but
at this point I didn't consider it worth it unless someone comes with a bug
report from a real user scenario.

								Honza

Khazhy Kumykov March 21, 2022, 5:06 p.m. UTC | #7

On Sat, Mar 19, 2022 at 2:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust <trondmy@hammerspace.com> wrote:
> >
> > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > result
> > > in recursing back into nfsd, resulting in deadlock. See below stack.
> > >
> > > nfsd            D    0 1591536      2 0x80004080
> > > Call Trace:
> > >  __schedule+0x497/0x630
> > >  schedule+0x67/0x90
> > >  schedule_preempt_disabled+0xe/0x10
> > >  __mutex_lock+0x347/0x4b0
> > >  fsnotify_destroy_mark+0x22/0xa0
> > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > >  do_shrink_slab+0x1f2/0x330
> > >  shrink_slab+0x244/0x2f0
> > >  shrink_node+0xd7/0x490
> > >  do_try_to_free_pages+0x12f/0x3b0
> > >  try_to_free_pages+0x43f/0x540
> > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > >  __alloc_pages_nodemask+0x274/0x2c0
> > >  alloc_slab_page+0x32/0x2e0
> > >  new_slab+0xa6/0x8b0
> > >  ___slab_alloc+0x34b/0x520
> > >  kmem_cache_alloc+0x1c4/0x250
> > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > >  fsnotify_add_mark+0x48/0x70
> > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > >  svc_process+0xf9/0x110 [sunrpc]
> > >  nfsd+0x10e/0x180 [nfsd]
> > >  kthread+0x130/0x140
> > >  ret_from_fork+0x35/0x40
> > >
> > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > ---
> > >  fs/nfsd/filecache.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > Marking this RFC since I haven't actually had a chance to test this,
> > > we
> > > we're seeing this deadlock for some customers.
> > >
> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > index fdf89fcf1a0c..a14760f9b486 100644
> > > --- a/fs/nfsd/filecache.c
> > > +++ b/fs/nfsd/filecache.c
> > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > *nf)
> > >         struct fsnotify_mark    *mark;
> > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > >         struct inode *inode = nf->nf_inode;
> > > +       unsigned int pflags;
> > >
> > >         do {
> > >                 mutex_lock(&nfsd_file_fsnotify_group->mark_mutex);
> > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct nfsd_file
> > > *nf)
> > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > >                 refcount_set(&new->nfm_ref, 1);
> > >
> > > +               /* fsnotify allocates, avoid recursion back into nfsd
> > > */
> > > +               pflags = memalloc_nofs_save();
> > >                 err = fsnotify_add_inode_mark(&new->nfm_mark, inode,
> > > 0);
> > > +               memalloc_nofs_restore(pflags);
> > >
> > >                 /*
> > >                  * If the add was successful, then return the object.
> >
> > Isn't that stack trace showing a slab direct reclaim, and not a
> > filesystem writeback situation?
> >
> > Does memalloc_nofs_save()/restore() really fix this problem? It seems
> > to me that it cannot, particularly since knfsd is not a filesystem, and
> > so does not ever handle writeback of dirty pages.
> >
>
> Maybe NOFS throttles direct reclaims to the point that the problem is
> harder to hit?

(I think I simply got confused - I don't see reason that NOFS would
help with direct reclaim, though it does look like the gfp flags are
passed via a shrink_control struct so one *could* react to them in the
shrinker - again not an area i'm super familiar with)

>
> This report came in at good timing for me.
>
> It demonstrates an issue I did not predict for "volatile"' fanotify marks [1].
> As far as I can tell, nfsd filecache is currently the only fsnotify backend that
> frees fsnotify marks in memory shrinker. "volatile" fanotify marks would also
> be evictable in that way, so they would expose fanotify to this deadlock.
>
> For the short term, maybe nfsd filecache can avoid the problem by checking
> mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort the
> shrinker. I wonder if there is a place for a helper mutex_is_locked_by_me()?

fwiw, it does look like ~5.5 nfsd did stop freeing fanotify marks
during reclaim, in the commit "nfsd: Containerise filecache
laundrette" (I had sent an earlier email about this, not sure where
that's getting caught up, but I do see it on lore...)


>
> Jan,
>
> A relatively simple fix would be to allocate fsnotify_mark_connector in
> fsnotify_add_mark() and free it, if a connector already exists for the object.
> I don't think there is a good reason to optimize away this allocation
> for the case of a non-first group to set a mark on an object?
>
> Thanks,
> Amir.
>
>
>
> [1] https://lore.kernel.org/linux-fsdevel/20220307155741.1352405-1-amir73il@gmail.com/

Trond Myklebust March 21, 2022, 10:50 p.m. UTC | #8

On Mon, 2022-03-21 at 12:23 +0100, Jan Kara wrote:
> On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust
> > <trondmy@hammerspace.com> wrote:
> > > 
> > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > > result
> > > > in recursing back into nfsd, resulting in deadlock. See below
> > > > stack.
> > > > 
> > > > nfsd            D    0 1591536      2 0x80004080
> > > > Call Trace:
> > > >  __schedule+0x497/0x630
> > > >  schedule+0x67/0x90
> > > >  schedule_preempt_disabled+0xe/0x10
> > > >  __mutex_lock+0x347/0x4b0
> > > >  fsnotify_destroy_mark+0x22/0xa0
> > > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > > >  do_shrink_slab+0x1f2/0x330
> > > >  shrink_slab+0x244/0x2f0
> > > >  shrink_node+0xd7/0x490
> > > >  do_try_to_free_pages+0x12f/0x3b0
> > > >  try_to_free_pages+0x43f/0x540
> > > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > > >  __alloc_pages_nodemask+0x274/0x2c0
> > > >  alloc_slab_page+0x32/0x2e0
> > > >  new_slab+0xa6/0x8b0
> > > >  ___slab_alloc+0x34b/0x520
> > > >  kmem_cache_alloc+0x1c4/0x250
> > > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > > >  fsnotify_add_mark+0x48/0x70
> > > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > > >  svc_process+0xf9/0x110 [sunrpc]
> > > >  nfsd+0x10e/0x180 [nfsd]
> > > >  kthread+0x130/0x140
> > > >  ret_from_fork+0x35/0x40
> > > > 
> > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > > ---
> > > >  fs/nfsd/filecache.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > > 
> > > > Marking this RFC since I haven't actually had a chance to test
> > > > this,
> > > > we
> > > > we're seeing this deadlock for some customers.
> > > > 
> > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > index fdf89fcf1a0c..a14760f9b486 100644
> > > > --- a/fs/nfsd/filecache.c
> > > > +++ b/fs/nfsd/filecache.c
> > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct
> > > > nfsd_file
> > > > *nf)
> > > >         struct fsnotify_mark    *mark;
> > > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > > >         struct inode *inode = nf->nf_inode;
> > > > +       unsigned int pflags;
> > > > 
> > > >         do {
> > > >                 mutex_lock(&nfsd_file_fsnotify_group-
> > > > >mark_mutex);
> > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct
> > > > nfsd_file
> > > > *nf)
> > > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > > >                 refcount_set(&new->nfm_ref, 1);
> > > > 
> > > > +               /* fsnotify allocates, avoid recursion back
> > > > into nfsd
> > > > */
> > > > +               pflags = memalloc_nofs_save();
> > > >                 err = fsnotify_add_inode_mark(&new->nfm_mark,
> > > > inode,
> > > > 0);
> > > > +               memalloc_nofs_restore(pflags);
> > > > 
> > > >                 /*
> > > >                  * If the add was successful, then return the
> > > > object.
> > > 
> > > Isn't that stack trace showing a slab direct reclaim, and not a
> > > filesystem writeback situation?
> > > 
> > > Does memalloc_nofs_save()/restore() really fix this problem? It
> > > seems
> > > to me that it cannot, particularly since knfsd is not a
> > > filesystem, and
> > > so does not ever handle writeback of dirty pages.
> > > 
> > 
> > Maybe NOFS throttles direct reclaims to the point that the problem
> > is
> > harder to hit?
> > 
> > This report came in at good timing for me.
> > 
> > It demonstrates an issue I did not predict for "volatile"' fanotify
> > marks [1].
> > As far as I can tell, nfsd filecache is currently the only fsnotify
> > backend that
> > frees fsnotify marks in memory shrinker. "volatile" fanotify marks
> > would also
> > be evictable in that way, so they would expose fanotify to this
> > deadlock.
> > 
> > For the short term, maybe nfsd filecache can avoid the problem by
> > checking
> > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort
> > the
> > shrinker. I wonder if there is a place for a helper
> > mutex_is_locked_by_me()?
> > 
> > Jan,
> > 
> > A relatively simple fix would be to allocate
> > fsnotify_mark_connector in
> > fsnotify_add_mark() and free it, if a connector already exists for
> > the object.
> > I don't think there is a good reason to optimize away this
> > allocation
> > for the case of a non-first group to set a mark on an object?
> 
> Indeed, nasty. Volatile marks will add group->mark_mutex into a set
> of
> locks grabbed during inode slab reclaim. So any allocation under
> group->mark_mutex has to be GFP_NOFS now. This is not just about
> connector
> allocations but also mark allocations for fanotify. Moving
> allocations from
> under mark_mutex is also possible solution but passing preallocated
> memory
> around is kind of ugly as well. So the cleanest solution I currently
> see is
> to come up with helpers like "fsnotify_lock_group() &
> fsnotify_unlock_group()" which will lock/unlock mark_mutex and also
> do
> memalloc_nofs_save / restore magic. 

As has already been reported, the problem was fixed in Linux 5.5 by the
garbage collector rewrite, and so this is no longer an issue.

In addition, please note that memalloc_nofs_save/restore and the use of
GFP_NOFS was never a solution, because it does not prevent the kind of
direct reclaim that was happening here. You'd have to enforce
GFP_NOWAIT allocations, afaics.

Khazhy Kumykov March 21, 2022, 11:36 p.m. UTC | #9

On Mon, Mar 21, 2022 at 3:50 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> As has already been reported, the problem was fixed in Linux 5.5 by the
> garbage collector rewrite, and so this is no longer an issue.
>
9542e6a643fc ("nfsd: Containerise filecache laundrette"), 36ebbdb96b69
("nfsd: cleanup nfsd_file_lru_dispose()") apply cleanly to 5.4.y for
me, which is still LTS. Since this should fix a real deadlock, would
it be appropriate to include them for the 5.4 stable?

Trond Myklebust March 21, 2022, 11:50 p.m. UTC | #10

On Mon, 2022-03-21 at 16:36 -0700, Khazhy Kumykov wrote:
> On Mon, Mar 21, 2022 at 3:50 PM Trond Myklebust
> <trondmy@hammerspace.com> wrote:
> > 
> > As has already been reported, the problem was fixed in Linux 5.5 by
> > the
> > garbage collector rewrite, and so this is no longer an issue.
> > 
> 9542e6a643fc ("nfsd: Containerise filecache laundrette"),
> 36ebbdb96b69
> ("nfsd: cleanup nfsd_file_lru_dispose()") apply cleanly to 5.4.y for
> me, which is still LTS. Since this should fix a real deadlock, would
> it be appropriate to include them for the 5.4 stable?

That would be OK with me. I'm not aware of any side-effects that might
be a problem for 5.4.

Jan Kara March 22, 2022, 10:37 a.m. UTC | #11

On Mon 21-03-22 22:50:11, Trond Myklebust wrote:
> On Mon, 2022-03-21 at 12:23 +0100, Jan Kara wrote:
> > On Sat 19-03-22 11:36:13, Amir Goldstein wrote:
> > > On Sat, Mar 19, 2022 at 9:02 AM Trond Myklebust
> > > <trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Fri, 2022-03-18 at 17:16 -0700, Khazhismel Kumykov wrote:
> > > > > fsnotify_add_inode_mark may allocate with GFP_KERNEL, which may
> > > > > result
> > > > > in recursing back into nfsd, resulting in deadlock. See below
> > > > > stack.
> > > > > 
> > > > > nfsd            D    0 1591536      2 0x80004080
> > > > > Call Trace:
> > > > >  __schedule+0x497/0x630
> > > > >  schedule+0x67/0x90
> > > > >  schedule_preempt_disabled+0xe/0x10
> > > > >  __mutex_lock+0x347/0x4b0
> > > > >  fsnotify_destroy_mark+0x22/0xa0
> > > > >  nfsd_file_free+0x79/0xd0 [nfsd]
> > > > >  nfsd_file_put_noref+0x7c/0x90 [nfsd]
> > > > >  nfsd_file_lru_dispose+0x6d/0xa0 [nfsd]
> > > > >  nfsd_file_lru_scan+0x57/0x80 [nfsd]
> > > > >  do_shrink_slab+0x1f2/0x330
> > > > >  shrink_slab+0x244/0x2f0
> > > > >  shrink_node+0xd7/0x490
> > > > >  do_try_to_free_pages+0x12f/0x3b0
> > > > >  try_to_free_pages+0x43f/0x540
> > > > >  __alloc_pages_slowpath+0x6ab/0x11c0
> > > > >  __alloc_pages_nodemask+0x274/0x2c0
> > > > >  alloc_slab_page+0x32/0x2e0
> > > > >  new_slab+0xa6/0x8b0
> > > > >  ___slab_alloc+0x34b/0x520
> > > > >  kmem_cache_alloc+0x1c4/0x250
> > > > >  fsnotify_add_mark_locked+0x18d/0x4c0
> > > > >  fsnotify_add_mark+0x48/0x70
> > > > >  nfsd_file_acquire+0x570/0x6f0 [nfsd]
> > > > >  nfsd_read+0xa7/0x1c0 [nfsd]
> > > > >  nfsd3_proc_read+0xc1/0x110 [nfsd]
> > > > >  nfsd_dispatch+0xf7/0x240 [nfsd]
> > > > >  svc_process_common+0x2f4/0x610 [sunrpc]
> > > > >  svc_process+0xf9/0x110 [sunrpc]
> > > > >  nfsd+0x10e/0x180 [nfsd]
> > > > >  kthread+0x130/0x140
> > > > >  ret_from_fork+0x35/0x40
> > > > > 
> > > > > Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
> > > > > ---
> > > > >  fs/nfsd/filecache.c | 4 ++++
> > > > >  1 file changed, 4 insertions(+)
> > > > > 
> > > > > Marking this RFC since I haven't actually had a chance to test
> > > > > this,
> > > > > we
> > > > > we're seeing this deadlock for some customers.
> > > > > 
> > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > > index fdf89fcf1a0c..a14760f9b486 100644
> > > > > --- a/fs/nfsd/filecache.c
> > > > > +++ b/fs/nfsd/filecache.c
> > > > > @@ -121,6 +121,7 @@ nfsd_file_mark_find_or_create(struct
> > > > > nfsd_file
> > > > > *nf)
> > > > >         struct fsnotify_mark    *mark;
> > > > >         struct nfsd_file_mark   *nfm = NULL, *new;
> > > > >         struct inode *inode = nf->nf_inode;
> > > > > +       unsigned int pflags;
> > > > > 
> > > > >         do {
> > > > >                 mutex_lock(&nfsd_file_fsnotify_group-
> > > > > >mark_mutex);
> > > > > @@ -149,7 +150,10 @@ nfsd_file_mark_find_or_create(struct
> > > > > nfsd_file
> > > > > *nf)
> > > > >                 new->nfm_mark.mask = FS_ATTRIB|FS_DELETE_SELF;
> > > > >                 refcount_set(&new->nfm_ref, 1);
> > > > > 
> > > > > +               /* fsnotify allocates, avoid recursion back
> > > > > into nfsd
> > > > > */
> > > > > +               pflags = memalloc_nofs_save();
> > > > >                 err = fsnotify_add_inode_mark(&new->nfm_mark,
> > > > > inode,
> > > > > 0);
> > > > > +               memalloc_nofs_restore(pflags);
> > > > > 
> > > > >                 /*
> > > > >                  * If the add was successful, then return the
> > > > > object.
> > > > 
> > > > Isn't that stack trace showing a slab direct reclaim, and not a
> > > > filesystem writeback situation?
> > > > 
> > > > Does memalloc_nofs_save()/restore() really fix this problem? It
> > > > seems
> > > > to me that it cannot, particularly since knfsd is not a
> > > > filesystem, and
> > > > so does not ever handle writeback of dirty pages.
> > > > 
> > > 
> > > Maybe NOFS throttles direct reclaims to the point that the problem
> > > is
> > > harder to hit?
> > > 
> > > This report came in at good timing for me.
> > > 
> > > It demonstrates an issue I did not predict for "volatile"' fanotify
> > > marks [1].
> > > As far as I can tell, nfsd filecache is currently the only fsnotify
> > > backend that
> > > frees fsnotify marks in memory shrinker. "volatile" fanotify marks
> > > would also
> > > be evictable in that way, so they would expose fanotify to this
> > > deadlock.
> > > 
> > > For the short term, maybe nfsd filecache can avoid the problem by
> > > checking
> > > mutex_is_locked(&nfsd_file_fsnotify_group->mark_mutex) and abort
> > > the
> > > shrinker. I wonder if there is a place for a helper
> > > mutex_is_locked_by_me()?
> > > 
> > > Jan,
> > > 
> > > A relatively simple fix would be to allocate
> > > fsnotify_mark_connector in
> > > fsnotify_add_mark() and free it, if a connector already exists for
> > > the object.
> > > I don't think there is a good reason to optimize away this
> > > allocation
> > > for the case of a non-first group to set a mark on an object?
> > 
> > Indeed, nasty. Volatile marks will add group->mark_mutex into a set
> > of
> > locks grabbed during inode slab reclaim. So any allocation under
> > group->mark_mutex has to be GFP_NOFS now. This is not just about
> > connector
> > allocations but also mark allocations for fanotify. Moving
> > allocations from
> > under mark_mutex is also possible solution but passing preallocated
> > memory
> > around is kind of ugly as well. So the cleanest solution I currently
> > see is
> > to come up with helpers like "fsnotify_lock_group() &
> > fsnotify_unlock_group()" which will lock/unlock mark_mutex and also
> > do
> > memalloc_nofs_save / restore magic. 
> 
> As has already been reported, the problem was fixed in Linux 5.5 by the
> garbage collector rewrite, and so this is no longer an issue.

Sorry, I was not clear enough I guess. NFS is not a problem since 5.5 as
you say. But Amir has changes in the works after which any filesystem inode
reclaim could end up in exactly the same path (calling
fsnotify_destroy_mark() from clear_inode()). So these changes would
introduce the same deadlock NFS was prone to before 5.5.

> In addition, please note that memalloc_nofs_save/restore and the use of
> GFP_NOFS was never a solution, because it does not prevent the kind of
> direct reclaim that was happening here. You'd have to enforce
> GFP_NOWAIT allocations, afaics.

GFP_NOFS should solve the above problem because with GFP_NOFS we cannot
enter inode reclaim from the memory allocation and thus end up freeing
marks.

								Honza

[RFC] nfsd: avoid recursive locking through fsnotify

Commit Message

Comments

Patch