nfsd: fix race condition in nfsd_file_acquire

Message ID	20230818065507.1280625-1-haydenw.kernel@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@vger.kernel.org> From: Haodong Wong <haydenw.kernel@gmail.com> To: chuck.lever@oracle.com, jlayton@kernel.org Cc: haodong.wong@nio.com, "J. Bruce Fields" <bfields@fieldses.org>, linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH] nfsd: fix race condition in nfsd_file_acquire Date: Fri, 18 Aug 2023 14:55:07 +0800 Message-Id: <20230818065507.1280625-1-haydenw.kernel@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	nfsd: fix race condition in nfsd_file_acquire \| expand nfsd: fix race condition in nfsd_file_acquire

Message ID

20230818065507.1280625-1-haydenw.kernel@gmail.com (mailing list archive)

State

New, archived

Headers

From: Haodong Wong <haydenw.kernel@gmail.com>
To: chuck.lever@oracle.com, jlayton@kernel.org
Cc: haodong.wong@nio.com, "J. Bruce Fields" <bfields@fieldses.org>,
        linux-nfs@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH] nfsd: fix race condition in nfsd_file_acquire
Date: Fri, 18 Aug 2023 14:55:07 +0800
Message-Id: <20230818065507.1280625-1-haydenw.kernel@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

nfsd: fix race condition in nfsd_file_acquire | expand

Commit Message

Hayden Wong Aug. 18, 2023, 6:55 a.m. UTC

Before Kernel 6.1, we observed the following OOPS in the stress test
caused by reorder on set bit NFSD_FILE_HASHED and NFSD_FILE_PENDING,
and smp_mb__after_atomic() should be a paire.

Task A:                         Task B:

nfsd_file_acquire:

    new = nfsd_file_alloc()
    open_file:
    refcount_inc(&nf->nf_ref);
                                 nf = nfsd_file_find_locked();
                                 wait_for_construction:

                                 since nf_flags is zero it will not wait

                                 wait_on_bit(&nf->nf_flags,
                                                    NFSD_FILE_PENDING);

                                if (status == nfs_ok) {
                                     *pnf = nf;      //OOPS happen!

Unable to handle kernel NULL pointer at virtual address 0000000000000028
Mem abort info:
  ESR = 0x96000004
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
Data abort info:
  ISV = 0, ISS = 0x00000004
  CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=0000000152546000
[0000000000000028] pgd=0000000000000000, p4d=0000000000000000
Internal error: Oops: 96000004 [#1] PREEMPT_RT SMP
CPU: 7 PID: 1767 Comm: nfsd Not tainted 5.10.104 #1
pstate: 40c00005 (nZcv daif +PAN +UAO -TCO BTYPE=--)
pc : nfsd_read+0x78/0x280 [nfsd]
lr : nfsd_read+0x68/0x280 [nfsd]
sp : ffff80001c0b3c70
x29: ffff80001c0b3c70 x28: 0000000000000000
x27: 0000000000000002 x26: ffff0000c8a3ca70
x25: ffff0000c8a45180 x24: 0000000000002000
x23: ffff0000c8a45178 x22: ffff0000c8a45008
x21: ffff0000c31aac40 x20: ffff0000c8a3c000
x19: 0000000000000000 x18: 0000000000000001
x17: 0000000000000007 x16: 00000000b35db681
x15: 0000000000000156 x14: ffff0000c3f91300
x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000
x9 : 0000000000000000 x8 : ffff000118014a80
x7 : 0000000000000002 x6 : ffff0002559142dc
x5 : ffff0000c31aac40 x4 : 0000000000000004
x3 : 0000000000000001 x2 : 0000000000000000
x1 : 0000000000000001 x0 : ffff000255914280
Call trace:
 nfsd_read+0x78/0x280 [nfsd]
 nfsd3_proc_read+0x98/0xc0 [nfsd]
 nfsd_dispatch+0xc8/0x160 [nfsd]
 svc_process_common+0x440/0x790
 svc_process+0xb0/0xd0
 nfsd+0xfc/0x160 [nfsd]
 kthread+0x17c/0x1a0
 ret_from_fork+0x10/0x18

Signed-off-by: Haodong Wong <haydenw.kernel@gmail.com>
---
 fs/nfsd/filecache.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

Comments

NeilBrown Aug. 21, 2023, 12:27 a.m. UTC | #1

On Fri, 18 Aug 2023, Haodong Wong wrote:
> Before Kernel 6.1, we observed the following OOPS in the stress test
> caused by reorder on set bit NFSD_FILE_HASHED and NFSD_FILE_PENDING,
> and smp_mb__after_atomic() should be a paire.

If you saw this "before kernel 6.1" does that mean you don't see it
after kernel 6.1?  So has it already been fixed?

What kernel are you targeting with your patch?  It doesn't apply to
mainline, but looks like it might to 6.0.  The oops message is from
5.10.104.  Maybe that is where you want a fix?

I assume you want this fix to go to a -stable kernel?  It would be good
to identify which upstream patch fixed the problem, then either backport
that, or explain why something simpler is needed.

> 
> Task A:                         Task B:
> 
> nfsd_file_acquire:
> 
>                           new = nfsd_file_alloc()
>                           open_file:
>                           refcount_inc(&nf->nf_ref);

The key step in Task A is 
	hlist_add_head_rcu(&nf->nf_node,
		 &nfsd_file_hashtbl[hashval].nfb_head);

Until that completes, nfsd_file_find_locked() cannot find the new file.
hlist_add_head_rcu() uses "rcu_assign_pointer()" which should include
all the barriers needed.

>                                  nf = nfsd_file_find_locked();
>                                  wait_for_construction:
> 
>                                  since nf_flags is zero it will not wait
> 
>                                  wait_on_bit(&nf->nf_flags,
>                                                     NFSD_FILE_PENDING);
> 
>                                 if (status == nfs_ok) {
>                                      *pnf = nf;      //OOPS happen!

The oops message suggests that after nfsd_read() calls
nfsd_file_acquire() there is no error, but nf is NULL.
So the  nf->nf_file access causes the oops.  nf_file is at offset
0x28, which is the virtual address mentioned in the oops.

So do you think 'nf' is NULL at this point where you write "OOPS
happen!" ?? 
I cannot see how that might happen even if wait_on_bit() didn't
actually wait.

Maybe if you could explain if a bit more detail what you think is
happening.  What exactly is NULL which causes the OOPS, and how exactly
does it end up being NULL.
I cannot see what might be the cause, but the oops makes it clear that
it did happen.

Also is this a pure 5.10.104 kernel, or are there other patches on it?

Thanks,
NeilBrown



> 
> Unable to handle kernel NULL pointer at virtual address 0000000000000028
> Mem abort info:
>   ESR = 0x96000004
>   EC = 0x25: DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
> Data abort info:
>   ISV = 0, ISS = 0x00000004
>   CM = 0, WnR = 0
> user pgtable: 4k pages, 48-bit VAs, pgdp=0000000152546000
> [0000000000000028] pgd=0000000000000000, p4d=0000000000000000
> Internal error: Oops: 96000004 [#1] PREEMPT_RT SMP
> CPU: 7 PID: 1767 Comm: nfsd Not tainted 5.10.104 #1
> pstate: 40c00005 (nZcv daif +PAN +UAO -TCO BTYPE=--)
> pc : nfsd_read+0x78/0x280 [nfsd]
> lr : nfsd_read+0x68/0x280 [nfsd]
> sp : ffff80001c0b3c70
> x29: ffff80001c0b3c70 x28: 0000000000000000
> x27: 0000000000000002 x26: ffff0000c8a3ca70
> x25: ffff0000c8a45180 x24: 0000000000002000
> x23: ffff0000c8a45178 x22: ffff0000c8a45008
> x21: ffff0000c31aac40 x20: ffff0000c8a3c000
> x19: 0000000000000000 x18: 0000000000000001
> x17: 0000000000000007 x16: 00000000b35db681
> x15: 0000000000000156 x14: ffff0000c3f91300
> x13: 0000000000000000 x12: 0000000000000000
> x11: 0000000000000000 x10: 0000000000000000
> x9 : 0000000000000000 x8 : ffff000118014a80
> x7 : 0000000000000002 x6 : ffff0002559142dc
> x5 : ffff0000c31aac40 x4 : 0000000000000004
> x3 : 0000000000000001 x2 : 0000000000000000
> x1 : 0000000000000001 x0 : ffff000255914280
> Call trace:
>  nfsd_read+0x78/0x280 [nfsd]
>  nfsd3_proc_read+0x98/0xc0 [nfsd]
>  nfsd_dispatch+0xc8/0x160 [nfsd]
>  svc_process_common+0x440/0x790
>  svc_process+0xb0/0xd0
>  nfsd+0xfc/0x160 [nfsd]
>  kthread+0x17c/0x1a0
>  ret_from_fork+0x10/0x18
> 
> Signed-off-by: Haodong Wong <haydenw.kernel@gmail.com>
> ---
>  fs/nfsd/filecache.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index e30e1ddc1ace..ba980369e6b4 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -974,8 +974,12 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	nfsd_file_slab_free(&new->nf_rcu);
>  
>  wait_for_construction:
> +	/* In case of set bit NFSD_FILE_PENDING and NFSD_FILE_HASHED reorder */
> +	smp_rmb();
>  	wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING, TASK_UNINTERRUPTIBLE);
>  
> +	/* Be a paire of smp_mb after clear bit NFSD_FILE_PENDING */
> +	smp_mb__after_atomic();
>  	/* Did construction of this file fail? */
>  	if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
>  		if (!retry) {
> @@ -1018,8 +1022,11 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  	nf = new;
>  	/* Take reference for the hashtable */
>  	refcount_inc(&nf->nf_ref);
> -	__set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
>  	__set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
> +	/* Ensure set bit order set NFSD_FILE_HASHED after set NFSD_FILE_PENDING */
> +	smp_wmb();
> +	__set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
> +
>  	list_lru_add(&nfsd_file_lru, &nf->nf_lru);
>  	hlist_add_head_rcu(&nf->nf_node, &nfsd_file_hashtbl[hashval].nfb_head);
>  	++nfsd_file_hashtbl[hashval].nfb_count;
> -- 
> 2.25.1
> 
>

Hayden Wong Aug. 22, 2023, 9:31 a.m. UTC | #2

On Mon, Aug 21, 2023 at 8:27 AM NeilBrown <neilb@suse.de> wrote:
>
> On Fri, 18 Aug 2023, Haodong Wong wrote:
> > Before Kernel 6.1, we observed the following OOPS in the stress test
> > caused by reorder on set bit NFSD_FILE_HASHED and NFSD_FILE_PENDING,
> > and smp_mb__after_atomic() should be a paire.
>
> If you saw this "before kernel 6.1" does that mean you don't see it
> after kernel 6.1?  So has it already been fixed?
>
I just found after the patched c4c649ab413ba "NFSD: Convert filecache
to rhltable" ,  this issue should not have .
It was fixed as below:
-nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may)
+nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
+               bool want_gc)
 {
        struct nfsd_file *nf;

        nf = kmem_cache_alloc(nfsd_file_slab, GFP_KERNEL);
-       if (nf) {
-               INIT_LIST_HEAD(&nf->nf_lru);
-               nf->nf_birthtime = ktime_get();
-               nf->nf_file = NULL;
-               nf->nf_cred = get_current_cred();
-               nf->nf_net = key->net;
-               nf->nf_flags = 0;
-               __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
-               __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
-               if (key->gc)
-                       __set_bit(NFSD_FILE_GC, &nf->nf_flags);
-               nf->nf_inode = key->inode;
-               refcount_set(&nf->nf_ref, 1);
-               nf->nf_may = key->need;
-               nf->nf_mark = NULL;
-       }
+       if (unlikely(!nf))
+               return NULL;
+
+       INIT_LIST_HEAD(&nf->nf_lru);
+       nf->nf_birthtime = ktime_get();
+       nf->nf_file = NULL;
+       nf->nf_cred = get_current_cred();
+       nf->nf_net = net;
+       nf->nf_flags = want_gc ?
+               BIT(NFSD_FILE_HASHED) | BIT(NFSD_FILE_PENDING) |
BIT(NFSD_FILE_GC) :
+               BIT(NFSD_FILE_HASHED) | BIT(NFSD_FILE_PENDING);
+       nf->nf_inode = inode;
+       refcount_set(&nf->nf_ref, 1);
+       nf->nf_may = need;
+       nf->nf_mark = NULL;
        return nf;
 }

Before above patch, this OOPS happen as below

Task A
                                  Task B

nfs_read
                                nfs_read
      nfsd_file_acquire
                               nfsd_file_acquire
          __set_bit(NFSD_FILE_HASHED, &nf->nf_flags)
             nf = nfsd_file_find_locked();



wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING);
            since rcu_assign_pointer has barrier in
hlist_add_head_rcu, but before smp_store_release in which
            two __set_bit can cross update rcu hlist head pointer
(first), so Task B wait_on_bit didn't wait, just go below:


                                                 *pnf = nf;

                                        file = nf->nf_file

                                        file->f_op->splice_read  *OOPS
here in nfs_read!

I also found similar issue at here https://lkml.org/lkml/2023/1/4/880

Since this  patch can fix this issue, I discard my commit
Very glad to receive your reply.
Thanks a lot!

Regards,
Haodong Wong

> What kernel are you targeting with your patch?  It doesn't apply to
> mainline, but looks like it might to 6.0.  The oops message is from
> 5.10.104.  Maybe that is where you want a fix?
>
The target is before kernel 6.1

> I assume you want this fix to go to a -stable kernel?  It would be good
> to identify which upstream patch fixed the problem, then either backport
> that, or explain why something simpler is needed.
>


> >
> > Task A:                         Task B:
> >
> > nfsd_file_acquire:
> >
> >                           new = nfsd_file_alloc()
> >                           open_file:
> >                           refcount_inc(&nf->nf_ref);
>
> The key step in Task A is
>         hlist_add_head_rcu(&nf->nf_node,
>                  &nfsd_file_hashtbl[hashval].nfb_head);
>
> Until that completes, nfsd_file_find_locked() cannot find the new file.
> hlist_add_head_rcu() uses "rcu_assign_pointer()" which should include
> all the barriers needed.
>
> >                                  nf = nfsd_file_find_locked();
> >                                  wait_for_construction:
> >
> >                                  since nf_flags is zero it will not wait
> >
> >                                  wait_on_bit(&nf->nf_flags,
> >                                                     NFSD_FILE_PENDING);
> >
> >                                 if (status == nfs_ok) {
> >                                      *pnf = nf;      //OOPS happen!
>
> The oops message suggests that after nfsd_read() calls
> nfsd_file_acquire() there is no error, but nf is NULL.
> So the  nf->nf_file access causes the oops.  nf_file is at offset
> 0x28, which is the virtual address mentioned in the oops.
>
> So do you think 'nf' is NULL at this point where you write "OOPS
> happen!" ??

Sorry, something missing, it oops happened in nfs_read after
nfsd_file_acquire *pnf = nf, because it copy nf_file is still NULL as
above

> I cannot see how that might happen even if wait_on_bit() didn't
> actually wait.
>
> Maybe if you could explain if a bit more detail what you think is
> happening.  What exactly is NULL which causes the OOPS, and how exactly
> does it end up being NULL.
> I cannot see what might be the cause, but the oops makes it clear that
> it did happen.
>
> Also is this a pure 5.10.104 kernel, or are there other patches on it?
>
It is applied with PREEMPT_RT Patch

> Thanks,
> NeilBrown

>
>
>
> >
> > Unable to handle kernel NULL pointer at virtual address 0000000000000028
> > Mem abort info:
> >   ESR = 0x96000004
> >   EC = 0x25: DABT (current EL), IL = 32 bits
> >   SET = 0, FnV = 0
> >   EA = 0, S1PTW = 0
> > Data abort info:
> >   ISV = 0, ISS = 0x00000004
> >   CM = 0, WnR = 0
> > user pgtable: 4k pages, 48-bit VAs, pgdp=0000000152546000
> > [0000000000000028] pgd=0000000000000000, p4d=0000000000000000
> > Internal error: Oops: 96000004 [#1] PREEMPT_RT SMP
> > CPU: 7 PID: 1767 Comm: nfsd Not tainted 5.10.104 #1
> > pstate: 40c00005 (nZcv daif +PAN +UAO -TCO BTYPE=--)
> > pc : nfsd_read+0x78/0x280 [nfsd]
> > lr : nfsd_read+0x68/0x280 [nfsd]
> > sp : ffff80001c0b3c70
> > x29: ffff80001c0b3c70 x28: 0000000000000000
> > x27: 0000000000000002 x26: ffff0000c8a3ca70
> > x25: ffff0000c8a45180 x24: 0000000000002000
> > x23: ffff0000c8a45178 x22: ffff0000c8a45008
> > x21: ffff0000c31aac40 x20: ffff0000c8a3c000
> > x19: 0000000000000000 x18: 0000000000000001
> > x17: 0000000000000007 x16: 00000000b35db681
> > x15: 0000000000000156 x14: ffff0000c3f91300
> > x13: 0000000000000000 x12: 0000000000000000
> > x11: 0000000000000000 x10: 0000000000000000
> > x9 : 0000000000000000 x8 : ffff000118014a80
> > x7 : 0000000000000002 x6 : ffff0002559142dc
> > x5 : ffff0000c31aac40 x4 : 0000000000000004
> > x3 : 0000000000000001 x2 : 0000000000000000
> > x1 : 0000000000000001 x0 : ffff000255914280
> > Call trace:
> >  nfsd_read+0x78/0x280 [nfsd]
> >  nfsd3_proc_read+0x98/0xc0 [nfsd]
> >  nfsd_dispatch+0xc8/0x160 [nfsd]
> >  svc_process_common+0x440/0x790
> >  svc_process+0xb0/0xd0
> >  nfsd+0xfc/0x160 [nfsd]
> >  kthread+0x17c/0x1a0
> >  ret_from_fork+0x10/0x18
> >
> > Signed-off-by: Haodong Wong <haydenw.kernel@gmail.com>
> > ---
> >  fs/nfsd/filecache.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index e30e1ddc1ace..ba980369e6b4 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -974,8 +974,12 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >       nfsd_file_slab_free(&new->nf_rcu);
> >
> >  wait_for_construction:
> > +     /* In case of set bit NFSD_FILE_PENDING and NFSD_FILE_HASHED reorder */
> > +     smp_rmb();
> >       wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING, TASK_UNINTERRUPTIBLE);
> >
> > +     /* Be a paire of smp_mb after clear bit NFSD_FILE_PENDING */
> > +     smp_mb__after_atomic();
> >       /* Did construction of this file fail? */
> >       if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> >               if (!retry) {
> > @@ -1018,8 +1022,11 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
> >       nf = new;
> >       /* Take reference for the hashtable */
> >       refcount_inc(&nf->nf_ref);
> > -     __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
> >       __set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
> > +     /* Ensure set bit order set NFSD_FILE_HASHED after set NFSD_FILE_PENDING */
> > +     smp_wmb();
> > +     __set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
> > +
> >       list_lru_add(&nfsd_file_lru, &nf->nf_lru);
> >       hlist_add_head_rcu(&nf->nf_node, &nfsd_file_hashtbl[hashval].nfb_head);
> >       ++nfsd_file_hashtbl[hashval].nfb_count;
> > --
> > 2.25.1
> >
> >
>

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index e30e1ddc1ace..ba980369e6b4 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -974,8 +974,12 @@  nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	nfsd_file_slab_free(&new->nf_rcu);
 
 wait_for_construction:
+	/* In case of set bit NFSD_FILE_PENDING and NFSD_FILE_HASHED reorder */
+	smp_rmb();
 	wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING, TASK_UNINTERRUPTIBLE);
 
+	/* Be a paire of smp_mb after clear bit NFSD_FILE_PENDING */
+	smp_mb__after_atomic();
 	/* Did construction of this file fail? */
 	if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
 		if (!retry) {
@@ -1018,8 +1022,11 @@  nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	nf = new;
 	/* Take reference for the hashtable */
 	refcount_inc(&nf->nf_ref);
-	__set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
 	__set_bit(NFSD_FILE_PENDING, &nf->nf_flags);
+	/* Ensure set bit order set NFSD_FILE_HASHED after set NFSD_FILE_PENDING */
+	smp_wmb();
+	__set_bit(NFSD_FILE_HASHED, &nf->nf_flags);
+
 	list_lru_add(&nfsd_file_lru, &nf->nf_lru);
 	hlist_add_head_rcu(&nf->nf_node, &nfsd_file_hashtbl[hashval].nfb_head);
 	++nfsd_file_hashtbl[hashval].nfb_count;