Message ID | 20230818065507.1280625-1-haydenw.kernel@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | nfsd: fix race condition in nfsd_file_acquire | expand |
On Fri, 18 Aug 2023, Haodong Wong wrote: > Before Kernel 6.1, we observed the following OOPS in the stress test > caused by reorder on set bit NFSD_FILE_HASHED and NFSD_FILE_PENDING, > and smp_mb__after_atomic() should be a paire. If you saw this "before kernel 6.1" does that mean you don't see it after kernel 6.1? So has it already been fixed? What kernel are you targeting with your patch? It doesn't apply to mainline, but looks like it might to 6.0. The oops message is from 5.10.104. Maybe that is where you want a fix? I assume you want this fix to go to a -stable kernel? It would be good to identify which upstream patch fixed the problem, then either backport that, or explain why something simpler is needed. > > Task A: Task B: > > nfsd_file_acquire: > > new = nfsd_file_alloc() > open_file: > refcount_inc(&nf->nf_ref); The key step in Task A is hlist_add_head_rcu(&nf->nf_node, &nfsd_file_hashtbl[hashval].nfb_head); Until that completes, nfsd_file_find_locked() cannot find the new file. hlist_add_head_rcu() uses "rcu_assign_pointer()" which should include all the barriers needed. > nf = nfsd_file_find_locked(); > wait_for_construction: > > since nf_flags is zero it will not wait > > wait_on_bit(&nf->nf_flags, > NFSD_FILE_PENDING); > > if (status == nfs_ok) { > *pnf = nf; //OOPS happen! The oops message suggests that after nfsd_read() calls nfsd_file_acquire() there is no error, but nf is NULL. So the nf->nf_file access causes the oops. nf_file is at offset 0x28, which is the virtual address mentioned in the oops. So do you think 'nf' is NULL at this point where you write "OOPS happen!" ?? I cannot see how that might happen even if wait_on_bit() didn't actually wait. Maybe if you could explain if a bit more detail what you think is happening. What exactly is NULL which causes the OOPS, and how exactly does it end up being NULL. I cannot see what might be the cause, but the oops makes it clear that it did happen. Also is this a pure 5.10.104 kernel, or are there other patches on it? Thanks, NeilBrown > > Unable to handle kernel NULL pointer at virtual address 0000000000000028 > Mem abort info: > ESR = 0x96000004 > EC = 0x25: DABT (current EL), IL = 32 bits > SET = 0, FnV = 0 > EA = 0, S1PTW = 0 > Data abort info: > ISV = 0, ISS = 0x00000004 > CM = 0, WnR = 0 > user pgtable: 4k pages, 48-bit VAs, pgdp=0000000152546000 > [0000000000000028] pgd=0000000000000000, p4d=0000000000000000 > Internal error: Oops: 96000004 [#1] PREEMPT_RT SMP > CPU: 7 PID: 1767 Comm: nfsd Not tainted 5.10.104 #1 > pstate: 40c00005 (nZcv daif +PAN +UAO -TCO BTYPE=--) > pc : nfsd_read+0x78/0x280 [nfsd] > lr : nfsd_read+0x68/0x280 [nfsd] > sp : ffff80001c0b3c70 > x29: ffff80001c0b3c70 x28: 0000000000000000 > x27: 0000000000000002 x26: ffff0000c8a3ca70 > x25: ffff0000c8a45180 x24: 0000000000002000 > x23: ffff0000c8a45178 x22: ffff0000c8a45008 > x21: ffff0000c31aac40 x20: ffff0000c8a3c000 > x19: 0000000000000000 x18: 0000000000000001 > x17: 0000000000000007 x16: 00000000b35db681 > x15: 0000000000000156 x14: ffff0000c3f91300 > x13: 0000000000000000 x12: 0000000000000000 > x11: 0000000000000000 x10: 0000000000000000 > x9 : 0000000000000000 x8 : ffff000118014a80 > x7 : 0000000000000002 x6 : ffff0002559142dc > x5 : ffff0000c31aac40 x4 : 0000000000000004 > x3 : 0000000000000001 x2 : 0000000000000000 > x1 : 0000000000000001 x0 : ffff000255914280 > Call trace: > nfsd_read+0x78/0x280 [nfsd] > nfsd3_proc_read+0x98/0xc0 [nfsd] > nfsd_dispatch+0xc8/0x160 [nfsd] > svc_process_common+0x440/0x790 > svc_process+0xb0/0xd0 > nfsd+0xfc/0x160 [nfsd] > kthread+0x17c/0x1a0 > ret_from_fork+0x10/0x18 > > Signed-off-by: Haodong Wong <haydenw.kernel@gmail.com> > --- > fs/nfsd/filecache.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index e30e1ddc1ace..ba980369e6b4 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -974,8 +974,12 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, > nfsd_file_slab_free(&new->nf_rcu); > > wait_for_construction: > + /* In case of set bit NFSD_FILE_PENDING and NFSD_FILE_HASHED reorder */ > + smp_rmb(); > wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING, TASK_UNINTERRUPTIBLE); > > + /* Be a paire of smp_mb after clear bit NFSD_FILE_PENDING */ > + smp_mb__after_atomic(); > /* Did construction of this file fail? */ > if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) { > if (!retry) { > @@ -1018,8 +1022,11 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, > nf = new; > /* Take reference for the hashtable */ > refcount_inc(&nf->nf_ref); > - __set_bit(NFSD_FILE_HASHED, &nf->nf_flags); > __set_bit(NFSD_FILE_PENDING, &nf->nf_flags); > + /* Ensure set bit order set NFSD_FILE_HASHED after set NFSD_FILE_PENDING */ > + smp_wmb(); > + __set_bit(NFSD_FILE_HASHED, &nf->nf_flags); > + > list_lru_add(&nfsd_file_lru, &nf->nf_lru); > hlist_add_head_rcu(&nf->nf_node, &nfsd_file_hashtbl[hashval].nfb_head); > ++nfsd_file_hashtbl[hashval].nfb_count; > -- > 2.25.1 > >
On Mon, Aug 21, 2023 at 8:27 AM NeilBrown <neilb@suse.de> wrote: > > On Fri, 18 Aug 2023, Haodong Wong wrote: > > Before Kernel 6.1, we observed the following OOPS in the stress test > > caused by reorder on set bit NFSD_FILE_HASHED and NFSD_FILE_PENDING, > > and smp_mb__after_atomic() should be a paire. > > If you saw this "before kernel 6.1" does that mean you don't see it > after kernel 6.1? So has it already been fixed? > I just found after the patched c4c649ab413ba "NFSD: Convert filecache to rhltable" , this issue should not have . It was fixed as below: -nfsd_file_alloc(struct nfsd_file_lookup_key *key, unsigned int may) +nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need, + bool want_gc) { struct nfsd_file *nf; nf = kmem_cache_alloc(nfsd_file_slab, GFP_KERNEL); - if (nf) { - INIT_LIST_HEAD(&nf->nf_lru); - nf->nf_birthtime = ktime_get(); - nf->nf_file = NULL; - nf->nf_cred = get_current_cred(); - nf->nf_net = key->net; - nf->nf_flags = 0; - __set_bit(NFSD_FILE_HASHED, &nf->nf_flags); - __set_bit(NFSD_FILE_PENDING, &nf->nf_flags); - if (key->gc) - __set_bit(NFSD_FILE_GC, &nf->nf_flags); - nf->nf_inode = key->inode; - refcount_set(&nf->nf_ref, 1); - nf->nf_may = key->need; - nf->nf_mark = NULL; - } + if (unlikely(!nf)) + return NULL; + + INIT_LIST_HEAD(&nf->nf_lru); + nf->nf_birthtime = ktime_get(); + nf->nf_file = NULL; + nf->nf_cred = get_current_cred(); + nf->nf_net = net; + nf->nf_flags = want_gc ? + BIT(NFSD_FILE_HASHED) | BIT(NFSD_FILE_PENDING) | BIT(NFSD_FILE_GC) : + BIT(NFSD_FILE_HASHED) | BIT(NFSD_FILE_PENDING); + nf->nf_inode = inode; + refcount_set(&nf->nf_ref, 1); + nf->nf_may = need; + nf->nf_mark = NULL; return nf; } Before above patch, this OOPS happen as below Task A Task B nfs_read nfs_read nfsd_file_acquire nfsd_file_acquire __set_bit(NFSD_FILE_HASHED, &nf->nf_flags) nf = nfsd_file_find_locked(); wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING); since rcu_assign_pointer has barrier in hlist_add_head_rcu, but before smp_store_release in which two __set_bit can cross update rcu hlist head pointer (first), so Task B wait_on_bit didn't wait, just go below: *pnf = nf; file = nf->nf_file file->f_op->splice_read *OOPS here in nfs_read! I also found similar issue at here https://lkml.org/lkml/2023/1/4/880 Since this patch can fix this issue, I discard my commit Very glad to receive your reply. Thanks a lot! Regards, Haodong Wong > What kernel are you targeting with your patch? It doesn't apply to > mainline, but looks like it might to 6.0. The oops message is from > 5.10.104. Maybe that is where you want a fix? > The target is before kernel 6.1 > I assume you want this fix to go to a -stable kernel? It would be good > to identify which upstream patch fixed the problem, then either backport > that, or explain why something simpler is needed. > > > > > Task A: Task B: > > > > nfsd_file_acquire: > > > > new = nfsd_file_alloc() > > open_file: > > refcount_inc(&nf->nf_ref); > > The key step in Task A is > hlist_add_head_rcu(&nf->nf_node, > &nfsd_file_hashtbl[hashval].nfb_head); > > Until that completes, nfsd_file_find_locked() cannot find the new file. > hlist_add_head_rcu() uses "rcu_assign_pointer()" which should include > all the barriers needed. > > > nf = nfsd_file_find_locked(); > > wait_for_construction: > > > > since nf_flags is zero it will not wait > > > > wait_on_bit(&nf->nf_flags, > > NFSD_FILE_PENDING); > > > > if (status == nfs_ok) { > > *pnf = nf; //OOPS happen! > > The oops message suggests that after nfsd_read() calls > nfsd_file_acquire() there is no error, but nf is NULL. > So the nf->nf_file access causes the oops. nf_file is at offset > 0x28, which is the virtual address mentioned in the oops. > > So do you think 'nf' is NULL at this point where you write "OOPS > happen!" ?? Sorry, something missing, it oops happened in nfs_read after nfsd_file_acquire *pnf = nf, because it copy nf_file is still NULL as above > I cannot see how that might happen even if wait_on_bit() didn't > actually wait. > > Maybe if you could explain if a bit more detail what you think is > happening. What exactly is NULL which causes the OOPS, and how exactly > does it end up being NULL. > I cannot see what might be the cause, but the oops makes it clear that > it did happen. > > Also is this a pure 5.10.104 kernel, or are there other patches on it? > It is applied with PREEMPT_RT Patch > Thanks, > NeilBrown > > > > > > > Unable to handle kernel NULL pointer at virtual address 0000000000000028 > > Mem abort info: > > ESR = 0x96000004 > > EC = 0x25: DABT (current EL), IL = 32 bits > > SET = 0, FnV = 0 > > EA = 0, S1PTW = 0 > > Data abort info: > > ISV = 0, ISS = 0x00000004 > > CM = 0, WnR = 0 > > user pgtable: 4k pages, 48-bit VAs, pgdp=0000000152546000 > > [0000000000000028] pgd=0000000000000000, p4d=0000000000000000 > > Internal error: Oops: 96000004 [#1] PREEMPT_RT SMP > > CPU: 7 PID: 1767 Comm: nfsd Not tainted 5.10.104 #1 > > pstate: 40c00005 (nZcv daif +PAN +UAO -TCO BTYPE=--) > > pc : nfsd_read+0x78/0x280 [nfsd] > > lr : nfsd_read+0x68/0x280 [nfsd] > > sp : ffff80001c0b3c70 > > x29: ffff80001c0b3c70 x28: 0000000000000000 > > x27: 0000000000000002 x26: ffff0000c8a3ca70 > > x25: ffff0000c8a45180 x24: 0000000000002000 > > x23: ffff0000c8a45178 x22: ffff0000c8a45008 > > x21: ffff0000c31aac40 x20: ffff0000c8a3c000 > > x19: 0000000000000000 x18: 0000000000000001 > > x17: 0000000000000007 x16: 00000000b35db681 > > x15: 0000000000000156 x14: ffff0000c3f91300 > > x13: 0000000000000000 x12: 0000000000000000 > > x11: 0000000000000000 x10: 0000000000000000 > > x9 : 0000000000000000 x8 : ffff000118014a80 > > x7 : 0000000000000002 x6 : ffff0002559142dc > > x5 : ffff0000c31aac40 x4 : 0000000000000004 > > x3 : 0000000000000001 x2 : 0000000000000000 > > x1 : 0000000000000001 x0 : ffff000255914280 > > Call trace: > > nfsd_read+0x78/0x280 [nfsd] > > nfsd3_proc_read+0x98/0xc0 [nfsd] > > nfsd_dispatch+0xc8/0x160 [nfsd] > > svc_process_common+0x440/0x790 > > svc_process+0xb0/0xd0 > > nfsd+0xfc/0x160 [nfsd] > > kthread+0x17c/0x1a0 > > ret_from_fork+0x10/0x18 > > > > Signed-off-by: Haodong Wong <haydenw.kernel@gmail.com> > > --- > > fs/nfsd/filecache.c | 9 ++++++++- > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > index e30e1ddc1ace..ba980369e6b4 100644 > > --- a/fs/nfsd/filecache.c > > +++ b/fs/nfsd/filecache.c > > @@ -974,8 +974,12 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, > > nfsd_file_slab_free(&new->nf_rcu); > > > > wait_for_construction: > > + /* In case of set bit NFSD_FILE_PENDING and NFSD_FILE_HASHED reorder */ > > + smp_rmb(); > > wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING, TASK_UNINTERRUPTIBLE); > > > > + /* Be a paire of smp_mb after clear bit NFSD_FILE_PENDING */ > > + smp_mb__after_atomic(); > > /* Did construction of this file fail? */ > > if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) { > > if (!retry) { > > @@ -1018,8 +1022,11 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, > > nf = new; > > /* Take reference for the hashtable */ > > refcount_inc(&nf->nf_ref); > > - __set_bit(NFSD_FILE_HASHED, &nf->nf_flags); > > __set_bit(NFSD_FILE_PENDING, &nf->nf_flags); > > + /* Ensure set bit order set NFSD_FILE_HASHED after set NFSD_FILE_PENDING */ > > + smp_wmb(); > > + __set_bit(NFSD_FILE_HASHED, &nf->nf_flags); > > + > > list_lru_add(&nfsd_file_lru, &nf->nf_lru); > > hlist_add_head_rcu(&nf->nf_node, &nfsd_file_hashtbl[hashval].nfb_head); > > ++nfsd_file_hashtbl[hashval].nfb_count; > > -- > > 2.25.1 > > > > >
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index e30e1ddc1ace..ba980369e6b4 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -974,8 +974,12 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, nfsd_file_slab_free(&new->nf_rcu); wait_for_construction: + /* In case of set bit NFSD_FILE_PENDING and NFSD_FILE_HASHED reorder */ + smp_rmb(); wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING, TASK_UNINTERRUPTIBLE); + /* Be a paire of smp_mb after clear bit NFSD_FILE_PENDING */ + smp_mb__after_atomic(); /* Did construction of this file fail? */ if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) { if (!retry) { @@ -1018,8 +1022,11 @@ nfsd_file_acquire(struct svc_rqst *rqstp, struct svc_fh *fhp, nf = new; /* Take reference for the hashtable */ refcount_inc(&nf->nf_ref); - __set_bit(NFSD_FILE_HASHED, &nf->nf_flags); __set_bit(NFSD_FILE_PENDING, &nf->nf_flags); + /* Ensure set bit order set NFSD_FILE_HASHED after set NFSD_FILE_PENDING */ + smp_wmb(); + __set_bit(NFSD_FILE_HASHED, &nf->nf_flags); + list_lru_add(&nfsd_file_lru, &nf->nf_lru); hlist_add_head_rcu(&nf->nf_node, &nfsd_file_hashtbl[hashval].nfb_head); ++nfsd_file_hashtbl[hashval].nfb_count;
Before Kernel 6.1, we observed the following OOPS in the stress test caused by reorder on set bit NFSD_FILE_HASHED and NFSD_FILE_PENDING, and smp_mb__after_atomic() should be a paire. Task A: Task B: nfsd_file_acquire: new = nfsd_file_alloc() open_file: refcount_inc(&nf->nf_ref); nf = nfsd_file_find_locked(); wait_for_construction: since nf_flags is zero it will not wait wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING); if (status == nfs_ok) { *pnf = nf; //OOPS happen! Unable to handle kernel NULL pointer at virtual address 0000000000000028 Mem abort info: ESR = 0x96000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000152546000 [0000000000000028] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 96000004 [#1] PREEMPT_RT SMP CPU: 7 PID: 1767 Comm: nfsd Not tainted 5.10.104 #1 pstate: 40c00005 (nZcv daif +PAN +UAO -TCO BTYPE=--) pc : nfsd_read+0x78/0x280 [nfsd] lr : nfsd_read+0x68/0x280 [nfsd] sp : ffff80001c0b3c70 x29: ffff80001c0b3c70 x28: 0000000000000000 x27: 0000000000000002 x26: ffff0000c8a3ca70 x25: ffff0000c8a45180 x24: 0000000000002000 x23: ffff0000c8a45178 x22: ffff0000c8a45008 x21: ffff0000c31aac40 x20: ffff0000c8a3c000 x19: 0000000000000000 x18: 0000000000000001 x17: 0000000000000007 x16: 00000000b35db681 x15: 0000000000000156 x14: ffff0000c3f91300 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 x8 : ffff000118014a80 x7 : 0000000000000002 x6 : ffff0002559142dc x5 : ffff0000c31aac40 x4 : 0000000000000004 x3 : 0000000000000001 x2 : 0000000000000000 x1 : 0000000000000001 x0 : ffff000255914280 Call trace: nfsd_read+0x78/0x280 [nfsd] nfsd3_proc_read+0x98/0xc0 [nfsd] nfsd_dispatch+0xc8/0x160 [nfsd] svc_process_common+0x440/0x790 svc_process+0xb0/0xd0 nfsd+0xfc/0x160 [nfsd] kthread+0x17c/0x1a0 ret_from_fork+0x10/0x18 Signed-off-by: Haodong Wong <haydenw.kernel@gmail.com> --- fs/nfsd/filecache.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)