diff mbox series

[1/2] kernfs: dont take i_lock on inode attr read

Message ID 166606036215.13363.1288735296954908554.stgit@donald.themaw.net (mailing list archive)
State New, archived
Headers show
Series kernfs: remove i_lock usage that isn't needed | expand

Commit Message

Ian Kent Oct. 18, 2022, 2:32 a.m. UTC
The kernfs write lock is held when the kernfs node inode attributes
are updated. Therefore, when either kernfs_iop_getattr() or
kernfs_iop_permission() are called the kernfs node inode attributes
won't change.

Consequently concurrent kernfs_refresh_inode() calls always copy the
same values from the kernfs node.

So there's no need to take the inode i_lock to get consistent values
for generic_fillattr() and generic_permission(), the kernfs read lock
is sufficient.

Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/kernfs/inode.c |    4 ----
 1 file changed, 4 deletions(-)

Comments

Miklos Szeredi Oct. 24, 2022, 8:50 a.m. UTC | #1
On Tue, 18 Oct 2022 at 04:32, Ian Kent <raven@themaw.net> wrote:
>
> The kernfs write lock is held when the kernfs node inode attributes
> are updated. Therefore, when either kernfs_iop_getattr() or
> kernfs_iop_permission() are called the kernfs node inode attributes
> won't change.
>
> Consequently concurrent kernfs_refresh_inode() calls always copy the
> same values from the kernfs node.
>
> So there's no need to take the inode i_lock to get consistent values
> for generic_fillattr() and generic_permission(), the kernfs read lock
> is sufficient.
>
> Signed-off-by: Ian Kent <raven@themaw.net>

Reviewed-by: Miklos Szeredi <mszeredi@redhat.com>
Tejun Heo Oct. 31, 2022, 10:30 p.m. UTC | #2
On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
> The kernfs write lock is held when the kernfs node inode attributes
> are updated. Therefore, when either kernfs_iop_getattr() or
> kernfs_iop_permission() are called the kernfs node inode attributes
> won't change.
> 
> Consequently concurrent kernfs_refresh_inode() calls always copy the
> same values from the kernfs node.
> 
> So there's no need to take the inode i_lock to get consistent values
> for generic_fillattr() and generic_permission(), the kernfs read lock
> is sufficient.
> 
> Signed-off-by: Ian Kent <raven@themaw.net>

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.
Anders Roxell Dec. 21, 2022, 1:34 p.m. UTC | #3
On 2022-10-31 12:30, Tejun Heo wrote:
> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
> > The kernfs write lock is held when the kernfs node inode attributes
> > are updated. Therefore, when either kernfs_iop_getattr() or
> > kernfs_iop_permission() are called the kernfs node inode attributes
> > won't change.
> > 
> > Consequently concurrent kernfs_refresh_inode() calls always copy the
> > same values from the kernfs node.
> > 
> > So there's no need to take the inode i_lock to get consistent values
> > for generic_fillattr() and generic_permission(), the kernfs read lock
> > is sufficient.
> > 
> > Signed-off-by: Ian Kent <raven@themaw.net>
> 
> Acked-by: Tejun Heo <tj@kernel.org>

Hi,

Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
booting that in qemu I see the following "BUG: KCSAN: data-race in
set_nlink / set_nlink".


==================================================================
[ 1540.388669][  T123] BUG: KCSAN: data-race in set_nlink / set_nlink
[ 1540.392779][  T123] 
[ 1540.394302][  T123] write to 0xffff00000adcc3e4 of 4 bytes by task 126 on cpu 0:
[ 1540.398828][ T123] set_nlink (/home/anders/src/kernel/next/fs/inode.c:371) 
[ 1540.401609][ T123] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:181) 
[ 1540.404925][ T123] kernfs_iop_getattr (/home/anders/src/kernel/next/fs/kernfs/inode.c:194) 
[ 1540.408088][ T123] vfs_getattr_nosec (/home/anders/src/kernel/next/fs/stat.c:133) 
[ 1540.411334][ T123] vfs_statx (/home/anders/src/kernel/next/fs/stat.c:170) 
[ 1540.414224][ T123] vfs_fstatat (/home/anders/src/kernel/next/fs/stat.c:276) 
[ 1540.417166][ T123] __do_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:446) 
[ 1540.420539][ T123] __arm64_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:440 /home/anders/src/kernel/next/fs/stat.c:440) 
[ 1540.424003][ T123] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:142) 
[ 1540.427648][ T123] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:197) 
[ 1540.430378][ T123] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:142 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:638) 
[ 1540.433053][ T123] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:656) 
[ 1540.436421][ T123] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:584) 
[ 1540.439303][  T123] 
[ 1540.440828][  T123] 1 lock held by systemd-udevd/126:
[ 1540.444055][ T123] #0: ffff00000609b960 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_iop_getattr (/home/anders/src/kernel/next/fs/kernfs/inode.c:193) 
[ 1540.450699][  T123] irq event stamp: 185034
[ 1540.453373][ T123] hardirqs last enabled at (185034): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/mm/page_alloc.c:5302) 
[ 1540.460087][ T123] hardirqs last disabled at (185033): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:103 (discriminator 4)) 
[ 1540.466686][ T123] softirqs last enabled at (185001): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1780) 
[ 1540.473310][ T123] softirqs last disabled at (184999): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1773) 
[ 1540.479872][  T123] 
[ 1540.481406][  T123] read to 0xffff00000adcc3e4 of 4 bytes by task 123 on cpu 0:
[ 1540.485893][ T123] set_nlink (/home/anders/src/kernel/next/fs/inode.c:368) 
[ 1540.488663][ T123] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:181) 
[ 1540.491961][ T123] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:290) 
[ 1540.495260][ T123] inode_permission (/home/anders/src/kernel/next/fs/namei.c:458 /home/anders/src/kernel/next/fs/namei.c:525) 
[ 1540.498450][ T123] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1715 /home/anders/src/kernel/next/fs/namei.c:2262) 
[ 1540.501552][ T123] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2473 (discriminator 2)) 
[ 1540.504592][ T123] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2503) 
[ 1540.507740][ T123] user_path_at_empty (/home/anders/src/kernel/next/fs/namei.c:2876) 
[ 1540.511010][ T123] do_readlinkat (/home/anders/src/kernel/next/fs/stat.c:477) 
[ 1540.514097][ T123] __arm64_sys_readlinkat (/home/anders/src/kernel/next/fs/stat.c:504 /home/anders/src/kernel/next/fs/stat.c:501 /home/anders/src/kernel/next/fs/stat.c:501) 
[ 1540.517598][ T123] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:142) 
[ 1540.521319][ T123] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:197) 
[ 1540.524125][ T123] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:142 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:638) 
[ 1540.526795][ T123] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:656) 
[ 1540.530224][ T123] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:584) 
[ 1540.533176][  T123] 
[ 1540.534710][  T123] 1 lock held by systemd-udevd/123:
[ 1540.537977][ T123] #0: ffff00000609b960 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289) 
[ 1540.544892][  T123] irq event stamp: 216564
[ 1540.547603][ T123] hardirqs last enabled at (216564): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/mm/page_alloc.c:5302) 
[ 1540.554368][ T123] hardirqs last disabled at (216563): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:103 (discriminator 4)) 
[ 1540.561107][ T123] softirqs last enabled at (216533): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1780) 
[ 1540.567833][ T123] softirqs last disabled at (216531): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1773) 
[ 1540.574496][  T123] 
[ 1540.576050][  T123] Reported by Kernel Concurrency Sanitizer on:
[ 1540.587925][  T123] Hardware name: linux,dummy-virt (DT)
[ 1540.591282][  T123] ==================================================================


Reverting this patch I don't see the data race anymore.

Cheers,
Anders
Ian Kent Dec. 22, 2022, 11:11 p.m. UTC | #4
On 21/12/22 21:34, Anders Roxell wrote:
> On 2022-10-31 12:30, Tejun Heo wrote:
>> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
>>> The kernfs write lock is held when the kernfs node inode attributes
>>> are updated. Therefore, when either kernfs_iop_getattr() or
>>> kernfs_iop_permission() are called the kernfs node inode attributes
>>> won't change.
>>>
>>> Consequently concurrent kernfs_refresh_inode() calls always copy the
>>> same values from the kernfs node.
>>>
>>> So there's no need to take the inode i_lock to get consistent values
>>> for generic_fillattr() and generic_permission(), the kernfs read lock
>>> is sufficient.
>>>
>>> Signed-off-by: Ian Kent <raven@themaw.net>
>> Acked-by: Tejun Heo <tj@kernel.org>
> Hi,
>
> Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
> booting that in qemu I see the following "BUG: KCSAN: data-race in
> set_nlink / set_nlink".


I'll check if I missed any places where set_link() could be

called where the link count could be different.


If there aren't any the question will then be can writing the

same value to this location in multiple concurrent threads

corrupt it?


Ian

>
>
> ==================================================================
> [ 1540.388669][  T123] BUG: KCSAN: data-race in set_nlink / set_nlink
> [ 1540.392779][  T123]
> [ 1540.394302][  T123] write to 0xffff00000adcc3e4 of 4 bytes by task 126 on cpu 0:
> [ 1540.398828][ T123] set_nlink (/home/anders/src/kernel/next/fs/inode.c:371)
> [ 1540.401609][ T123] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:181)
> [ 1540.404925][ T123] kernfs_iop_getattr (/home/anders/src/kernel/next/fs/kernfs/inode.c:194)
> [ 1540.408088][ T123] vfs_getattr_nosec (/home/anders/src/kernel/next/fs/stat.c:133)
> [ 1540.411334][ T123] vfs_statx (/home/anders/src/kernel/next/fs/stat.c:170)
> [ 1540.414224][ T123] vfs_fstatat (/home/anders/src/kernel/next/fs/stat.c:276)
> [ 1540.417166][ T123] __do_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:446)
> [ 1540.420539][ T123] __arm64_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:440 /home/anders/src/kernel/next/fs/stat.c:440)
> [ 1540.424003][ T123] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:142)
> [ 1540.427648][ T123] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:197)
> [ 1540.430378][ T123] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:142 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:638)
> [ 1540.433053][ T123] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:656)
> [ 1540.436421][ T123] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:584)
> [ 1540.439303][  T123]
> [ 1540.440828][  T123] 1 lock held by systemd-udevd/126:
> [ 1540.444055][ T123] #0: ffff00000609b960 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_iop_getattr (/home/anders/src/kernel/next/fs/kernfs/inode.c:193)
> [ 1540.450699][  T123] irq event stamp: 185034
> [ 1540.453373][ T123] hardirqs last enabled at (185034): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/mm/page_alloc.c:5302)
> [ 1540.460087][ T123] hardirqs last disabled at (185033): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:103 (discriminator 4))
> [ 1540.466686][ T123] softirqs last enabled at (185001): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1780)
> [ 1540.473310][ T123] softirqs last disabled at (184999): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1773)
> [ 1540.479872][  T123]
> [ 1540.481406][  T123] read to 0xffff00000adcc3e4 of 4 bytes by task 123 on cpu 0:
> [ 1540.485893][ T123] set_nlink (/home/anders/src/kernel/next/fs/inode.c:368)
> [ 1540.488663][ T123] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:181)
> [ 1540.491961][ T123] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:290)
> [ 1540.495260][ T123] inode_permission (/home/anders/src/kernel/next/fs/namei.c:458 /home/anders/src/kernel/next/fs/namei.c:525)
> [ 1540.498450][ T123] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1715 /home/anders/src/kernel/next/fs/namei.c:2262)
> [ 1540.501552][ T123] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2473 (discriminator 2))
> [ 1540.504592][ T123] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2503)
> [ 1540.507740][ T123] user_path_at_empty (/home/anders/src/kernel/next/fs/namei.c:2876)
> [ 1540.511010][ T123] do_readlinkat (/home/anders/src/kernel/next/fs/stat.c:477)
> [ 1540.514097][ T123] __arm64_sys_readlinkat (/home/anders/src/kernel/next/fs/stat.c:504 /home/anders/src/kernel/next/fs/stat.c:501 /home/anders/src/kernel/next/fs/stat.c:501)
> [ 1540.517598][ T123] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:142)
> [ 1540.521319][ T123] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:197)
> [ 1540.524125][ T123] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:142 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:638)
> [ 1540.526795][ T123] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:656)
> [ 1540.530224][ T123] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:584)
> [ 1540.533176][  T123]
> [ 1540.534710][  T123] 1 lock held by systemd-udevd/123:
> [ 1540.537977][ T123] #0: ffff00000609b960 (&root->kernfs_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> [ 1540.544892][  T123] irq event stamp: 216564
> [ 1540.547603][ T123] hardirqs last enabled at (216564): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/mm/page_alloc.c:5302)
> [ 1540.554368][ T123] hardirqs last disabled at (216563): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:103 (discriminator 4))
> [ 1540.561107][ T123] softirqs last enabled at (216533): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1780)
> [ 1540.567833][ T123] softirqs last disabled at (216531): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1773)
> [ 1540.574496][  T123]
> [ 1540.576050][  T123] Reported by Kernel Concurrency Sanitizer on:
> [ 1540.587925][  T123] Hardware name: linux,dummy-virt (DT)
> [ 1540.591282][  T123] ==================================================================
>
>
> Reverting this patch I don't see the data race anymore.
>
> Cheers,
> Anders
Arnd Bergmann Dec. 29, 2022, 9:20 a.m. UTC | #5
On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
> On 21/12/22 21:34, Anders Roxell wrote:
>> On 2022-10-31 12:30, Tejun Heo wrote:
>>> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
>>>> The kernfs write lock is held when the kernfs node inode attributes
>>>> are updated. Therefore, when either kernfs_iop_getattr() or
>>>> kernfs_iop_permission() are called the kernfs node inode attributes
>>>> won't change.
>>>>
>>>> Consequently concurrent kernfs_refresh_inode() calls always copy the
>>>> same values from the kernfs node.
>>>>
>>>> So there's no need to take the inode i_lock to get consistent values
>>>> for generic_fillattr() and generic_permission(), the kernfs read lock
>>>> is sufficient.
>>>>
>>>> Signed-off-by: Ian Kent <raven@themaw.net>
>>> Acked-by: Tejun Heo <tj@kernel.org>
>> Hi,
>>
>> Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
>> booting that in qemu I see the following "BUG: KCSAN: data-race in
>> set_nlink / set_nlink".
>
>
> I'll check if I missed any places where set_link() could be
> called where the link count could be different.
>
>
> If there aren't any the question will then be can writing the
> same value to this location in multiple concurrent threads
> corrupt it?

I think the race that is getting reported for set_nlink()
is about this bit getting called simulatenously on multiple
CPUs with only the read lock held for the inode:

     /* Yes, some filesystems do change nlink from zero to one */
     if (inode->i_nlink == 0)
               atomic_long_dec(&inode->i_sb->s_remove_count);
     inode->__i_nlink = nlink;

Since i_nlink and __i_nlink refer to the same memory location,
the 'inode->i_nlink == 0' check can be true for all of them
before the nonzero nlink value gets set, and this results in
s_remove_count being decremented more than once.

      Arnd
Ian Kent Dec. 29, 2022, 1:07 p.m. UTC | #6
On 29/12/22 17:20, Arnd Bergmann wrote:
> On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
>> On 21/12/22 21:34, Anders Roxell wrote:
>>> On 2022-10-31 12:30, Tejun Heo wrote:
>>>> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
>>>>> The kernfs write lock is held when the kernfs node inode attributes
>>>>> are updated. Therefore, when either kernfs_iop_getattr() or
>>>>> kernfs_iop_permission() are called the kernfs node inode attributes
>>>>> won't change.
>>>>>
>>>>> Consequently concurrent kernfs_refresh_inode() calls always copy the
>>>>> same values from the kernfs node.
>>>>>
>>>>> So there's no need to take the inode i_lock to get consistent values
>>>>> for generic_fillattr() and generic_permission(), the kernfs read lock
>>>>> is sufficient.
>>>>>
>>>>> Signed-off-by: Ian Kent <raven@themaw.net>
>>>> Acked-by: Tejun Heo <tj@kernel.org>
>>> Hi,
>>>
>>> Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
>>> booting that in qemu I see the following "BUG: KCSAN: data-race in
>>> set_nlink / set_nlink".
>>
>> I'll check if I missed any places where set_link() could be
>> called where the link count could be different.
>>
>>
>> If there aren't any the question will then be can writing the
>> same value to this location in multiple concurrent threads
>> corrupt it?
> I think the race that is getting reported for set_nlink()
> is about this bit getting called simulatenously on multiple
> CPUs with only the read lock held for the inode:
>
>       /* Yes, some filesystems do change nlink from zero to one */
>       if (inode->i_nlink == 0)
>                 atomic_long_dec(&inode->i_sb->s_remove_count);
>       inode->__i_nlink = nlink;
>
> Since i_nlink and __i_nlink refer to the same memory location,
> the 'inode->i_nlink == 0' check can be true for all of them
> before the nonzero nlink value gets set, and this results in
> s_remove_count being decremented more than once.


Thanks for the comment Arnd.


I'll certainly have a close look at that.

It will be a little while though, given the season, ;).


Ian
Ian Kent Jan. 23, 2023, 3:11 a.m. UTC | #7
On 29/12/22 21:07, Ian Kent wrote:
>
> On 29/12/22 17:20, Arnd Bergmann wrote:
>> On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
>>> On 21/12/22 21:34, Anders Roxell wrote:
>>>> On 2022-10-31 12:30, Tejun Heo wrote:
>>>>> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
>>>>>> The kernfs write lock is held when the kernfs node inode attributes
>>>>>> are updated. Therefore, when either kernfs_iop_getattr() or
>>>>>> kernfs_iop_permission() are called the kernfs node inode attributes
>>>>>> won't change.
>>>>>>
>>>>>> Consequently concurrent kernfs_refresh_inode() calls always copy the
>>>>>> same values from the kernfs node.
>>>>>>
>>>>>> So there's no need to take the inode i_lock to get consistent values
>>>>>> for generic_fillattr() and generic_permission(), the kernfs read 
>>>>>> lock
>>>>>> is sufficient.
>>>>>>
>>>>>> Signed-off-by: Ian Kent <raven@themaw.net>
>>>>> Acked-by: Tejun Heo <tj@kernel.org>
>>>> Hi,
>>>>
>>>> Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
>>>> booting that in qemu I see the following "BUG: KCSAN: data-race in
>>>> set_nlink / set_nlink".
>>>
>>> I'll check if I missed any places where set_link() could be
>>> called where the link count could be different.
>>>
>>>
>>> If there aren't any the question will then be can writing the
>>> same value to this location in multiple concurrent threads
>>> corrupt it?
>> I think the race that is getting reported for set_nlink()
>> is about this bit getting called simulatenously on multiple
>> CPUs with only the read lock held for the inode:
>>
>>       /* Yes, some filesystems do change nlink from zero to one */
>>       if (inode->i_nlink == 0)
>> atomic_long_dec(&inode->i_sb->s_remove_count);
>>       inode->__i_nlink = nlink;
>>
>> Since i_nlink and __i_nlink refer to the same memory location,
>> the 'inode->i_nlink == 0' check can be true for all of them
>> before the nonzero nlink value gets set, and this results in
>> s_remove_count being decremented more than once.
>
>
> Thanks for the comment Arnd.


Hello all,


I've been looking at this and after consulting Miklos and his pointing

out that it looks like a false positive the urgency dropped off a bit. So

apologies for taking so long to report back.


Anyway it needs some description of conclusions reached so far.


I'm still looking around but in short, kernfs will set directories to <# of

directory entries> + 2 unconditionally for directories. I can't yet find

any other places where i_nlink is set or changed and if there are none

then i_nlink will never be set to zero so the race should not occur.


Consequently my claim is this is a real false positive.


There are the file system operations that may be passed at mount time

but given the way kernfs sets i_nlink it pretty much dictates those 
operations

(if there were any that modify it and there don't appear to be any) leave it

alone.


So it just doesn't make sense for users of kernfs to fiddle with i_nlink ...


Ian
Anders Roxell July 18, 2023, 7 p.m. UTC | #8
On 2023-01-23 11:11, Ian Kent wrote:
> 
> On 29/12/22 21:07, Ian Kent wrote:
> > 
> > On 29/12/22 17:20, Arnd Bergmann wrote:
> > > On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
> > > > On 21/12/22 21:34, Anders Roxell wrote:
> > > > > On 2022-10-31 12:30, Tejun Heo wrote:
> > > > > > On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
> > > > > > > The kernfs write lock is held when the kernfs node inode attributes
> > > > > > > are updated. Therefore, when either kernfs_iop_getattr() or
> > > > > > > kernfs_iop_permission() are called the kernfs node inode attributes
> > > > > > > won't change.
> > > > > > > 
> > > > > > > Consequently concurrent kernfs_refresh_inode() calls always copy the
> > > > > > > same values from the kernfs node.
> > > > > > > 
> > > > > > > So there's no need to take the inode i_lock to get consistent values
> > > > > > > for generic_fillattr() and generic_permission(), the
> > > > > > > kernfs read lock
> > > > > > > is sufficient.
> > > > > > > 
> > > > > > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > > > > Acked-by: Tejun Heo <tj@kernel.org>
> > > > > Hi,
> > > > > 
> > > > > Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
> > > > > booting that in qemu I see the following "BUG: KCSAN: data-race in
> > > > > set_nlink / set_nlink".
> > > > 
> > > > I'll check if I missed any places where set_link() could be
> > > > called where the link count could be different.
> > > > 
> > > > 
> > > > If there aren't any the question will then be can writing the
> > > > same value to this location in multiple concurrent threads
> > > > corrupt it?
> > > I think the race that is getting reported for set_nlink()
> > > is about this bit getting called simulatenously on multiple
> > > CPUs with only the read lock held for the inode:
> > > 
> > >       /* Yes, some filesystems do change nlink from zero to one */
> > >       if (inode->i_nlink == 0)
> > > atomic_long_dec(&inode->i_sb->s_remove_count);
> > >       inode->__i_nlink = nlink;
> > > 
> > > Since i_nlink and __i_nlink refer to the same memory location,
> > > the 'inode->i_nlink == 0' check can be true for all of them
> > > before the nonzero nlink value gets set, and this results in
> > > s_remove_count being decremented more than once.
> > 
> > 
> > Thanks for the comment Arnd.
> 
> 
> Hello all,
> 
> 
> I've been looking at this and after consulting Miklos and his pointing
> 
> out that it looks like a false positive the urgency dropped off a bit. So
> 
> apologies for taking so long to report back.
> 
> 
> Anyway it needs some description of conclusions reached so far.
> 
> 
> I'm still looking around but in short, kernfs will set directories to <# of
> 
> directory entries> + 2 unconditionally for directories. I can't yet find
> 
> any other places where i_nlink is set or changed and if there are none
> 
> then i_nlink will never be set to zero so the race should not occur.
> 
> 
> Consequently my claim is this is a real false positive.
> 
> 
> There are the file system operations that may be passed at mount time
> 
> but given the way kernfs sets i_nlink it pretty much dictates those
> operations
> 
> (if there were any that modify it and there don't appear to be any) leave it
> 
> alone.
> 
> 
> So it just doesn't make sense for users of kernfs to fiddle with i_nlink ...

On todays next tag, next-20230718 this KCSAN BUG poped up again. When I
built an allmodconfig arm64 kernel and booted it in QEMU. Full log can
be found http://ix.io/4AUd

[ 1694.987789][  T137] BUG: KCSAN: data-race in inode_permission / kernfs_refresh_inode
[ 1694.992912][  T137] 
[ 1694.994532][  T137] write to 0xffff00000bab6070 of 2 bytes by task 104 on cpu 0:
[ 1694.999269][ T137] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:171) 
[ 1695.002707][ T137] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289) 
[ 1695.006148][ T137] inode_permission (/home/anders/src/kernel/next/fs/namei.c:461 /home/anders/src/kernel/next/fs/namei.c:528) 
[ 1695.009420][ T137] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267) 
[ 1695.012643][ T137] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2)) 
[ 1695.015781][ T137] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2508) 
[ 1695.019059][ T137] vfs_statx (/home/anders/src/kernel/next/fs/stat.c:238) 
[ 1695.022024][ T137] vfs_fstatat (/home/anders/src/kernel/next/fs/stat.c:276) 
[ 1695.025067][ T137] __do_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:446) 
[ 1695.028497][ T137] __arm64_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:440 /home/anders/src/kernel/next/fs/stat.c:440) 
[ 1695.032080][ T137] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139) 
[ 1695.035916][ T137] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188) 
[ 1695.038796][ T137] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648) 
[ 1695.041468][ T137] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666) 
[ 1695.044889][ T137] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591) 
[ 1695.047904][  T137] 
[ 1695.049511][  T137] 1 lock held by systemd-udevd/104:
[ 1695.052837][ T137] #0: ffff000006681e08 (&root->kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:288) 
[ 1695.060241][  T137] irq event stamp: 82902
[ 1695.063006][ T137] hardirqs last enabled at (82901): _raw_spin_unlock_irqrestore (/home/anders/src/kernel/next/arch/arm64/include/asm/alternative-macros.h:250 /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:27 /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:140 /home/anders/src/kernel/next/include/linux/spinlock_api_smp.h:151 /home/anders/src/kernel/next/kernel/locking/spinlock.c:194) 
[ 1695.069673][ T137] hardirqs last disabled at (82902): el1_interrupt (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488) 
[ 1695.075474][ T137] softirqs last enabled at (82792): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791) 
[ 1695.082319][ T137] softirqs last disabled at (82790): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784) 
[ 1695.089049][  T137] 
[ 1695.090659][  T137] read to 0xffff00000bab6070 of 2 bytes by task 137 on cpu 0:
[ 1695.095374][ T137] inode_permission (/home/anders/src/kernel/next/fs/namei.c:532) 
[ 1695.098655][ T137] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267) 
[ 1695.101857][ T137] path_openat (/home/anders/src/kernel/next/fs/namei.c:3789 (discriminator 2)) 
[ 1695.104885][ T137] do_filp_open (/home/anders/src/kernel/next/fs/namei.c:3820) 
[ 1695.108006][ T137] do_sys_openat2 (/home/anders/src/kernel/next/fs/open.c:1418) 
[ 1695.111290][ T137] __arm64_sys_openat (/home/anders/src/kernel/next/fs/open.c:1433) 
[ 1695.114825][ T137] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139) 
[ 1695.118662][ T137] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188) 
[ 1695.121555][ T137] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648) 
[ 1695.124207][ T137] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666) 
[ 1695.127590][ T137] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591) 
[ 1695.130641][  T137] 
[ 1695.132241][  T137] no locks held by systemd-udevd/137.
[ 1695.135618][  T137] irq event stamp: 3246
[ 1695.138519][ T137] hardirqs last enabled at (3245): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:105) 
[ 1695.145825][ T137] hardirqs last disabled at (3246): el1_interrupt (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488) 
[ 1695.151942][ T137] softirqs last enabled at (3208): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791) 
[ 1695.158950][ T137] softirqs last disabled at (3206): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784) 
[ 1695.166036][  T137] 
[ 1695.167621][  T137] Reported by Kernel Concurrency Sanitizer on:
[ 1695.179990][  T137] Hardware name: linux,dummy-virt (DT)
[ 1695.183687][  T137] ==================================================================

[...]

[ 1738.053819][  T104] BUG: KCSAN: data-race in set_nlink / set_nlink
[ 1738.058223][  T104] 
[ 1738.059865][  T104] read to 0xffff00000bab6918 of 4 bytes by task 108 on cpu 0:
[ 1738.064916][ T104] set_nlink (/home/anders/src/kernel/next/fs/inode.c:369) 
[ 1738.067845][ T104] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:180) 
[ 1738.071607][ T104] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289) 
[ 1738.075467][ T104] inode_permission (/home/anders/src/kernel/next/fs/namei.c:461 /home/anders/src/kernel/next/fs/namei.c:528) 
[ 1738.078868][ T104] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267) 
[ 1738.082270][ T104] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2)) 
[ 1738.085488][ T104] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2508) 
[ 1738.089101][ T104] user_path_at_empty (/home/anders/src/kernel/next/fs/namei.c:2907) 
[ 1738.092469][ T104] do_readlinkat (/home/anders/src/kernel/next/fs/stat.c:477) 
[ 1738.095970][ T104] __arm64_sys_readlinkat (/home/anders/src/kernel/next/fs/stat.c:504 /home/anders/src/kernel/next/fs/stat.c:501 /home/anders/src/kernel/next/fs/stat.c:501) 
[ 1738.099529][ T104] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139) 
[ 1738.103696][ T104] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188) 
[ 1738.106560][ T104] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648) 
[ 1738.109613][ T104] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666) 
[ 1738.113035][ T104] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591) 
[ 1738.116346][  T104] 
[ 1738.117924][  T104] 1 lock held by systemd-udevd/108:
[ 1738.121580][ T104] #0: ffff000006681e08 (&root->kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:288) 
[ 1738.129355][  T104] irq event stamp: 31000
[ 1738.132088][ T104] hardirqs last enabled at (31000): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:105) 
[ 1738.139417][ T104] hardirqs last disabled at (30999): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:104) 
[ 1738.146781][ T104] softirqs last enabled at (30973): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791) 
[ 1738.153891][ T104] softirqs last disabled at (30971): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784) 
[ 1738.161012][  T104] 
[ 1738.162663][  T104] write to 0xffff00000bab6918 of 4 bytes by task 104 on cpu 0:
[ 1738.167730][ T104] set_nlink (/home/anders/src/kernel/next/fs/inode.c:372) 
[ 1738.170559][ T104] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:180) 
[ 1738.174355][ T104] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289) 
[ 1738.177829][ T104] inode_permission (/home/anders/src/kernel/next/fs/namei.c:461 /home/anders/src/kernel/next/fs/namei.c:528) 
[ 1738.181403][ T104] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267) 
[ 1738.184738][ T104] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2)) 
[ 1738.188268][ T104] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2508) 
[ 1738.191865][ T104] vfs_statx (/home/anders/src/kernel/next/fs/stat.c:238) 
[ 1738.196236][ T104] vfs_fstatat (/home/anders/src/kernel/next/fs/stat.c:276) 
[ 1738.200120][ T104] __do_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:446) 
[ 1738.204095][ T104] __arm64_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:440 /home/anders/src/kernel/next/fs/stat.c:440) 
[ 1738.207676][ T104] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139) 
[ 1738.211820][ T104] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188) 
[ 1738.214815][ T104] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648) 
[ 1738.217709][ T104] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666) 
[ 1738.221239][ T104] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591) 
[ 1738.224502][  T104] 
[ 1738.226090][  T104] 1 lock held by systemd-udevd/104:
[ 1738.229747][ T104] #0: ffff000006681e08 (&root->kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:288) 
[ 1738.237504][  T104] irq event stamp: 108353
[ 1738.240262][ T104] hardirqs last enabled at (108353): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:105) 
[ 1738.247443][ T104] hardirqs last disabled at (108352): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:104) 
[ 1738.254510][ T104] softirqs last enabled at (108326): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791) 
[ 1738.262187][ T104] softirqs last disabled at (108324): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784) 
[ 1738.270239][  T104] 
[ 1738.272140][  T104] Reported by Kernel Concurrency Sanitizer on:
[ 1738.285185][  T104] Hardware name: linux,dummy-virt (DT)
[ 1738.288703][  T104] ==================================================================


Cheers,
Anders
Ian Kent July 19, 2023, 4:23 a.m. UTC | #9
On 19/7/23 03:00, Anders Roxell wrote:
> On 2023-01-23 11:11, Ian Kent wrote:
>> On 29/12/22 21:07, Ian Kent wrote:
>>> On 29/12/22 17:20, Arnd Bergmann wrote:
>>>> On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
>>>>> On 21/12/22 21:34, Anders Roxell wrote:
>>>>>> On 2022-10-31 12:30, Tejun Heo wrote:
>>>>>>> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent wrote:
>>>>>>>> The kernfs write lock is held when the kernfs node inode attributes
>>>>>>>> are updated. Therefore, when either kernfs_iop_getattr() or
>>>>>>>> kernfs_iop_permission() are called the kernfs node inode attributes
>>>>>>>> won't change.
>>>>>>>>
>>>>>>>> Consequently concurrent kernfs_refresh_inode() calls always copy the
>>>>>>>> same values from the kernfs node.
>>>>>>>>
>>>>>>>> So there's no need to take the inode i_lock to get consistent values
>>>>>>>> for generic_fillattr() and generic_permission(), the
>>>>>>>> kernfs read lock
>>>>>>>> is sufficient.
>>>>>>>>
>>>>>>>> Signed-off-by: Ian Kent <raven@themaw.net>
>>>>>>> Acked-by: Tejun Heo <tj@kernel.org>
>>>>>> Hi,
>>>>>>
>>>>>> Building an allmodconfig arm64 kernel on yesterdays next-20221220 and
>>>>>> booting that in qemu I see the following "BUG: KCSAN: data-race in
>>>>>> set_nlink / set_nlink".
>>>>> I'll check if I missed any places where set_link() could be
>>>>> called where the link count could be different.
>>>>>
>>>>>
>>>>> If there aren't any the question will then be can writing the
>>>>> same value to this location in multiple concurrent threads
>>>>> corrupt it?
>>>> I think the race that is getting reported for set_nlink()
>>>> is about this bit getting called simulatenously on multiple
>>>> CPUs with only the read lock held for the inode:
>>>>
>>>>        /* Yes, some filesystems do change nlink from zero to one */
>>>>        if (inode->i_nlink == 0)
>>>> atomic_long_dec(&inode->i_sb->s_remove_count);
>>>>        inode->__i_nlink = nlink;
>>>>
>>>> Since i_nlink and __i_nlink refer to the same memory location,
>>>> the 'inode->i_nlink == 0' check can be true for all of them
>>>> before the nonzero nlink value gets set, and this results in
>>>> s_remove_count being decremented more than once.
>>>
>>> Thanks for the comment Arnd.
>>
>> Hello all,
>>
>>
>> I've been looking at this and after consulting Miklos and his pointing
>>
>> out that it looks like a false positive the urgency dropped off a bit. So
>>
>> apologies for taking so long to report back.
>>
>>
>> Anyway it needs some description of conclusions reached so far.
>>
>>
>> I'm still looking around but in short, kernfs will set directories to <# of
>>
>> directory entries> + 2 unconditionally for directories. I can't yet find
>>
>> any other places where i_nlink is set or changed and if there are none
>>
>> then i_nlink will never be set to zero so the race should not occur.
>>
>>
>> Consequently my claim is this is a real false positive.
>>
>>
>> There are the file system operations that may be passed at mount time
>>
>> but given the way kernfs sets i_nlink it pretty much dictates those
>> operations
>>
>> (if there were any that modify it and there don't appear to be any) leave it
>>
>> alone.
>>
>>
>> So it just doesn't make sense for users of kernfs to fiddle with i_nlink ...
> On todays next tag, next-20230718 this KCSAN BUG poped up again. When I
> built an allmodconfig arm64 kernel and booted it in QEMU. Full log can
> be found http://ix.io/4AUd
>
> [ 1694.987789][  T137] BUG: KCSAN: data-race in inode_permission / kernfs_refresh_inode
> [ 1694.992912][  T137]
> [ 1694.994532][  T137] write to 0xffff00000bab6070 of 2 bytes by task 104 on cpu 0:
> [ 1694.999269][ T137] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:171)
> [ 1695.002707][ T137] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> [ 1695.006148][ T137] inode_permission (/home/anders/src/kernel/next/fs/namei.c:461 /home/anders/src/kernel/next/fs/namei.c:528)
> [ 1695.009420][ T137] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267)
> [ 1695.012643][ T137] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> [ 1695.015781][ T137] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2508)
> [ 1695.019059][ T137] vfs_statx (/home/anders/src/kernel/next/fs/stat.c:238)
> [ 1695.022024][ T137] vfs_fstatat (/home/anders/src/kernel/next/fs/stat.c:276)
> [ 1695.025067][ T137] __do_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:446)
> [ 1695.028497][ T137] __arm64_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:440 /home/anders/src/kernel/next/fs/stat.c:440)
> [ 1695.032080][ T137] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> [ 1695.035916][ T137] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> [ 1695.038796][ T137] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> [ 1695.041468][ T137] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> [ 1695.044889][ T137] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> [ 1695.047904][  T137]
> [ 1695.049511][  T137] 1 lock held by systemd-udevd/104:
> [ 1695.052837][ T137] #0: ffff000006681e08 (&root->kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> [ 1695.060241][  T137] irq event stamp: 82902
> [ 1695.063006][ T137] hardirqs last enabled at (82901): _raw_spin_unlock_irqrestore (/home/anders/src/kernel/next/arch/arm64/include/asm/alternative-macros.h:250 /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:27 /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:140 /home/anders/src/kernel/next/include/linux/spinlock_api_smp.h:151 /home/anders/src/kernel/next/kernel/locking/spinlock.c:194)
> [ 1695.069673][ T137] hardirqs last disabled at (82902): el1_interrupt (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
> [ 1695.075474][ T137] softirqs last enabled at (82792): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> [ 1695.082319][ T137] softirqs last disabled at (82790): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> [ 1695.089049][  T137]
> [ 1695.090659][  T137] read to 0xffff00000bab6070 of 2 bytes by task 137 on cpu 0:
> [ 1695.095374][ T137] inode_permission (/home/anders/src/kernel/next/fs/namei.c:532)
> [ 1695.098655][ T137] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267)
> [ 1695.101857][ T137] path_openat (/home/anders/src/kernel/next/fs/namei.c:3789 (discriminator 2))
> [ 1695.104885][ T137] do_filp_open (/home/anders/src/kernel/next/fs/namei.c:3820)
> [ 1695.108006][ T137] do_sys_openat2 (/home/anders/src/kernel/next/fs/open.c:1418)
> [ 1695.111290][ T137] __arm64_sys_openat (/home/anders/src/kernel/next/fs/open.c:1433)
> [ 1695.114825][ T137] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> [ 1695.118662][ T137] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> [ 1695.121555][ T137] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> [ 1695.124207][ T137] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> [ 1695.127590][ T137] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> [ 1695.130641][  T137]
> [ 1695.132241][  T137] no locks held by systemd-udevd/137.
> [ 1695.135618][  T137] irq event stamp: 3246
> [ 1695.138519][ T137] hardirqs last enabled at (3245): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> [ 1695.145825][ T137] hardirqs last disabled at (3246): el1_interrupt (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
> [ 1695.151942][ T137] softirqs last enabled at (3208): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> [ 1695.158950][ T137] softirqs last disabled at (3206): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> [ 1695.166036][  T137]
> [ 1695.167621][  T137] Reported by Kernel Concurrency Sanitizer on:
> [ 1695.179990][  T137] Hardware name: linux,dummy-virt (DT)
> [ 1695.183687][  T137] ==================================================================


This one is different to the original.


I can't see where the problem is here, can someone help me out

please.


I don't see any shared data values used by the call

devcgroup_inode_permission(inode, mask) in fs/namei.c:inode_permission()

that might be a problem during the read except possibly inode->i_mode.


I'll check on that ... maybe something's been missed when kernfs_rwsem

was changed to a separate lock (kernfs_iattr_rwsem).


>
> [...]
>
> [ 1738.053819][  T104] BUG: KCSAN: data-race in set_nlink / set_nlink
> [ 1738.058223][  T104]
> [ 1738.059865][  T104] read to 0xffff00000bab6918 of 4 bytes by task 108 on cpu 0:
> [ 1738.064916][ T104] set_nlink (/home/anders/src/kernel/next/fs/inode.c:369)
> [ 1738.067845][ T104] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
> [ 1738.071607][ T104] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> [ 1738.075467][ T104] inode_permission (/home/anders/src/kernel/next/fs/namei.c:461 /home/anders/src/kernel/next/fs/namei.c:528)
> [ 1738.078868][ T104] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267)
> [ 1738.082270][ T104] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> [ 1738.085488][ T104] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2508)
> [ 1738.089101][ T104] user_path_at_empty (/home/anders/src/kernel/next/fs/namei.c:2907)
> [ 1738.092469][ T104] do_readlinkat (/home/anders/src/kernel/next/fs/stat.c:477)
> [ 1738.095970][ T104] __arm64_sys_readlinkat (/home/anders/src/kernel/next/fs/stat.c:504 /home/anders/src/kernel/next/fs/stat.c:501 /home/anders/src/kernel/next/fs/stat.c:501)
> [ 1738.099529][ T104] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> [ 1738.103696][ T104] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> [ 1738.106560][ T104] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> [ 1738.109613][ T104] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> [ 1738.113035][ T104] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> [ 1738.116346][  T104]
> [ 1738.117924][  T104] 1 lock held by systemd-udevd/108:
> [ 1738.121580][ T104] #0: ffff000006681e08 (&root->kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> [ 1738.129355][  T104] irq event stamp: 31000
> [ 1738.132088][ T104] hardirqs last enabled at (31000): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> [ 1738.139417][ T104] hardirqs last disabled at (30999): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
> [ 1738.146781][ T104] softirqs last enabled at (30973): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> [ 1738.153891][ T104] softirqs last disabled at (30971): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> [ 1738.161012][  T104]
> [ 1738.162663][  T104] write to 0xffff00000bab6918 of 4 bytes by task 104 on cpu 0:
> [ 1738.167730][ T104] set_nlink (/home/anders/src/kernel/next/fs/inode.c:372)
> [ 1738.170559][ T104] kernfs_refresh_inode (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
> [ 1738.174355][ T104] kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> [ 1738.177829][ T104] inode_permission (/home/anders/src/kernel/next/fs/namei.c:461 /home/anders/src/kernel/next/fs/namei.c:528)
> [ 1738.181403][ T104] link_path_walk (/home/anders/src/kernel/next/fs/namei.c:1720 /home/anders/src/kernel/next/fs/namei.c:2267)
> [ 1738.184738][ T104] path_lookupat (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> [ 1738.188268][ T104] filename_lookup (/home/anders/src/kernel/next/fs/namei.c:2508)
> [ 1738.191865][ T104] vfs_statx (/home/anders/src/kernel/next/fs/stat.c:238)
> [ 1738.196236][ T104] vfs_fstatat (/home/anders/src/kernel/next/fs/stat.c:276)
> [ 1738.200120][ T104] __do_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:446)
> [ 1738.204095][ T104] __arm64_sys_newfstatat (/home/anders/src/kernel/next/fs/stat.c:440 /home/anders/src/kernel/next/fs/stat.c:440)
> [ 1738.207676][ T104] el0_svc_common.constprop.0 (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52 /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> [ 1738.211820][ T104] do_el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> [ 1738.214815][ T104] el0_svc (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144 /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> [ 1738.217709][ T104] el0t_64_sync_handler (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> [ 1738.221239][ T104] el0t_64_sync (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> [ 1738.224502][  T104]
> [ 1738.226090][  T104] 1 lock held by systemd-udevd/104:
> [ 1738.229747][ T104] #0: ffff000006681e08 (&root->kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> [ 1738.237504][  T104] irq event stamp: 108353
> [ 1738.240262][ T104] hardirqs last enabled at (108353): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> [ 1738.247443][ T104] hardirqs last disabled at (108352): seqcount_lockdep_reader_access (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
> [ 1738.254510][ T104] softirqs last enabled at (108326): fpsimd_restore_current_state (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> [ 1738.262187][ T104] softirqs last disabled at (108324): fpsimd_restore_current_state (/home/anders/src/kernel/next/include/linux/bottom_half.h:20 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242 /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> [ 1738.270239][  T104]
> [ 1738.272140][  T104] Reported by Kernel Concurrency Sanitizer on:
> [ 1738.285185][  T104] Hardware name: linux,dummy-virt (DT)
> [ 1738.288703][  T104] ==================================================================

This looks just like the original except a different lock is being

used now.


The link count can't be less than two if set_nlink() is called.


Maybe I am missing something but the directory count is changed only

while holding the root->kernfs_iattr_rwsem so the value used by

set_nlink() will not change during concurrent calls to refresh_inode().

Still looks like a false positive, I'll check the write operations
again just to be sure.

Ian
Ian Kent July 20, 2023, 2:03 a.m. UTC | #10
On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
> On 19/7/23 03:00, Anders Roxell wrote:
> > On 2023-01-23 11:11, Ian Kent wrote:
> > > On 29/12/22 21:07, Ian Kent wrote:
> > > > On 29/12/22 17:20, Arnd Bergmann wrote:
> > > > > On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
> > > > > > On 21/12/22 21:34, Anders Roxell wrote:
> > > > > > > On 2022-10-31 12:30, Tejun Heo wrote:
> > > > > > > > On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent
> > > > > > > > wrote:
> > > > > > > > > The kernfs write lock is held when the kernfs node
> > > > > > > > > inode attributes
> > > > > > > > > are updated. Therefore, when either
> > > > > > > > > kernfs_iop_getattr() or
> > > > > > > > > kernfs_iop_permission() are called the kernfs node
> > > > > > > > > inode attributes
> > > > > > > > > won't change.
> > > > > > > > > 
> > > > > > > > > Consequently concurrent kernfs_refresh_inode() calls
> > > > > > > > > always copy the
> > > > > > > > > same values from the kernfs node.
> > > > > > > > > 
> > > > > > > > > So there's no need to take the inode i_lock to get
> > > > > > > > > consistent values
> > > > > > > > > for generic_fillattr() and generic_permission(), the
> > > > > > > > > kernfs read lock
> > > > > > > > > is sufficient.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > > > > > > Acked-by: Tejun Heo <tj@kernel.org>
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Building an allmodconfig arm64 kernel on yesterdays next-
> > > > > > > 20221220 and
> > > > > > > booting that in qemu I see the following "BUG: KCSAN:
> > > > > > > data-race in
> > > > > > > set_nlink / set_nlink".
> > > > > > I'll check if I missed any places where set_link() could be
> > > > > > called where the link count could be different.
> > > > > > 
> > > > > > 
> > > > > > If there aren't any the question will then be can writing
> > > > > > the
> > > > > > same value to this location in multiple concurrent threads
> > > > > > corrupt it?
> > > > > I think the race that is getting reported for set_nlink()
> > > > > is about this bit getting called simulatenously on multiple
> > > > > CPUs with only the read lock held for the inode:
> > > > > 
> > > > >        /* Yes, some filesystems do change nlink from zero to
> > > > > one */
> > > > >        if (inode->i_nlink == 0)
> > > > > atomic_long_dec(&inode->i_sb->s_remove_count);
> > > > >        inode->__i_nlink = nlink;
> > > > > 
> > > > > Since i_nlink and __i_nlink refer to the same memory
> > > > > location,
> > > > > the 'inode->i_nlink == 0' check can be true for all of them
> > > > > before the nonzero nlink value gets set, and this results in
> > > > > s_remove_count being decremented more than once.
> > > > 
> > > > Thanks for the comment Arnd.
> > > 
> > > Hello all,
> > > 
> > > 
> > > I've been looking at this and after consulting Miklos and his
> > > pointing
> > > 
> > > out that it looks like a false positive the urgency dropped off a
> > > bit. So
> > > 
> > > apologies for taking so long to report back.
> > > 
> > > 
> > > Anyway it needs some description of conclusions reached so far.
> > > 
> > > 
> > > I'm still looking around but in short, kernfs will set
> > > directories to <# of
> > > 
> > > directory entries> + 2 unconditionally for directories. I can't
> > > yet find
> > > 
> > > any other places where i_nlink is set or changed and if there are
> > > none
> > > 
> > > then i_nlink will never be set to zero so the race should not
> > > occur.
> > > 
> > > 
> > > Consequently my claim is this is a real false positive.
> > > 
> > > 
> > > There are the file system operations that may be passed at mount
> > > time
> > > 
> > > but given the way kernfs sets i_nlink it pretty much dictates
> > > those
> > > operations
> > > 
> > > (if there were any that modify it and there don't appear to be
> > > any) leave it
> > > 
> > > alone.
> > > 
> > > 
> > > So it just doesn't make sense for users of kernfs to fiddle with
> > > i_nlink ...
> > On todays next tag, next-20230718 this KCSAN BUG poped up again.
> > When I
> > built an allmodconfig arm64 kernel and booted it in QEMU. Full log
> > can
> > be found http://ix.io/4AUd
> > 
> > [ 1694.987789][  T137] BUG: KCSAN: data-race in inode_permission /
> > kernfs_refresh_inode
> > [ 1694.992912][  T137]
> > [ 1694.994532][  T137] write to 0xffff00000bab6070 of 2 bytes by
> > task 104 on cpu 0:
> > [ 1694.999269][ T137] kernfs_refresh_inode
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:171)
> > [ 1695.002707][ T137] kernfs_iop_permission
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> > [ 1695.006148][ T137] inode_permission
> > (/home/anders/src/kernel/next/fs/namei.c:461
> > /home/anders/src/kernel/next/fs/namei.c:528)
> > [ 1695.009420][ T137] link_path_walk
> > (/home/anders/src/kernel/next/fs/namei.c:1720
> > /home/anders/src/kernel/next/fs/namei.c:2267)
> > [ 1695.012643][ T137] path_lookupat
> > (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> > [ 1695.015781][ T137] filename_lookup
> > (/home/anders/src/kernel/next/fs/namei.c:2508)
> > [ 1695.019059][ T137] vfs_statx
> > (/home/anders/src/kernel/next/fs/stat.c:238)
> > [ 1695.022024][ T137] vfs_fstatat
> > (/home/anders/src/kernel/next/fs/stat.c:276)
> > [ 1695.025067][ T137] __do_sys_newfstatat
> > (/home/anders/src/kernel/next/fs/stat.c:446)
> > [ 1695.028497][ T137] __arm64_sys_newfstatat
> > (/home/anders/src/kernel/next/fs/stat.c:440
> > /home/anders/src/kernel/next/fs/stat.c:440)
> > [ 1695.032080][ T137] el0_svc_common.constprop.0
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > [ 1695.035916][ T137] do_el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > [ 1695.038796][ T137] el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > [ 1695.041468][ T137] el0t_64_sync_handler
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > [ 1695.044889][ T137] el0t_64_sync
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > [ 1695.047904][  T137]
> > [ 1695.049511][  T137] 1 lock held by systemd-udevd/104:
> > [ 1695.052837][ T137] #0: ffff000006681e08 (&root-
> > >kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> > [ 1695.060241][  T137] irq event stamp: 82902
> > [ 1695.063006][ T137] hardirqs last enabled at (82901):
> > _raw_spin_unlock_irqrestore
> > (/home/anders/src/kernel/next/arch/arm64/include/asm/alternative-
> > macros.h:250
> > /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:27
> > /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:140
> > /home/anders/src/kernel/next/include/linux/spinlock_api_smp.h:151
> > /home/anders/src/kernel/next/kernel/locking/spinlock.c:194)
> > [ 1695.069673][ T137] hardirqs last disabled at (82902):
> > el1_interrupt
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
> > [ 1695.075474][ T137] softirqs last enabled at (82792):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > [ 1695.082319][ T137] softirqs last disabled at (82790):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > [ 1695.089049][  T137]
> > [ 1695.090659][  T137] read to 0xffff00000bab6070 of 2 bytes by
> > task 137 on cpu 0:
> > [ 1695.095374][ T137] inode_permission
> > (/home/anders/src/kernel/next/fs/namei.c:532)
> > [ 1695.098655][ T137] link_path_walk
> > (/home/anders/src/kernel/next/fs/namei.c:1720
> > /home/anders/src/kernel/next/fs/namei.c:2267)
> > [ 1695.101857][ T137] path_openat
> > (/home/anders/src/kernel/next/fs/namei.c:3789 (discriminator 2))
> > [ 1695.104885][ T137] do_filp_open
> > (/home/anders/src/kernel/next/fs/namei.c:3820)
> > [ 1695.108006][ T137] do_sys_openat2
> > (/home/anders/src/kernel/next/fs/open.c:1418)
> > [ 1695.111290][ T137] __arm64_sys_openat
> > (/home/anders/src/kernel/next/fs/open.c:1433)
> > [ 1695.114825][ T137] el0_svc_common.constprop.0
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > [ 1695.118662][ T137] do_el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > [ 1695.121555][ T137] el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > [ 1695.124207][ T137] el0t_64_sync_handler
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > [ 1695.127590][ T137] el0t_64_sync
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > [ 1695.130641][  T137]
> > [ 1695.132241][  T137] no locks held by systemd-udevd/137.
> > [ 1695.135618][  T137] irq event stamp: 3246
> > [ 1695.138519][ T137] hardirqs last enabled at (3245):
> > seqcount_lockdep_reader_access
> > (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> > [ 1695.145825][ T137] hardirqs last disabled at (3246):
> > el1_interrupt
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
> > [ 1695.151942][ T137] softirqs last enabled at (3208):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > [ 1695.158950][ T137] softirqs last disabled at (3206):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > [ 1695.166036][  T137]
> > [ 1695.167621][  T137] Reported by Kernel Concurrency Sanitizer on:
> > [ 1695.179990][  T137] Hardware name: linux,dummy-virt (DT)
> > [ 1695.183687][  T137]
> > ==================================================================
> 
> 
> This one is different to the original.
> 
> 
> I can't see where the problem is here, can someone help me out
> 
> please.
> 
> 
> I don't see any shared data values used by the call
> 
> devcgroup_inode_permission(inode, mask) in
> fs/namei.c:inode_permission()
> 
> that might be a problem during the read except possibly inode-
> >i_mode.
> 
> 
> I'll check on that ... maybe something's been missed when
> kernfs_rwsem
> 
> was changed to a separate lock (kernfs_iattr_rwsem).
> 
> 
> > 
> > [...]
> > 
> > [ 1738.053819][  T104] BUG: KCSAN: data-race in set_nlink /
> > set_nlink
> > [ 1738.058223][  T104]
> > [ 1738.059865][  T104] read to 0xffff00000bab6918 of 4 bytes by
> > task 108 on cpu 0:
> > [ 1738.064916][ T104] set_nlink
> > (/home/anders/src/kernel/next/fs/inode.c:369)
> > [ 1738.067845][ T104] kernfs_refresh_inode
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
> > [ 1738.071607][ T104] kernfs_iop_permission
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> > [ 1738.075467][ T104] inode_permission
> > (/home/anders/src/kernel/next/fs/namei.c:461
> > /home/anders/src/kernel/next/fs/namei.c:528)
> > [ 1738.078868][ T104] link_path_walk
> > (/home/anders/src/kernel/next/fs/namei.c:1720
> > /home/anders/src/kernel/next/fs/namei.c:2267)
> > [ 1738.082270][ T104] path_lookupat
> > (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> > [ 1738.085488][ T104] filename_lookup
> > (/home/anders/src/kernel/next/fs/namei.c:2508)
> > [ 1738.089101][ T104] user_path_at_empty
> > (/home/anders/src/kernel/next/fs/namei.c:2907)
> > [ 1738.092469][ T104] do_readlinkat
> > (/home/anders/src/kernel/next/fs/stat.c:477)
> > [ 1738.095970][ T104] __arm64_sys_readlinkat
> > (/home/anders/src/kernel/next/fs/stat.c:504
> > /home/anders/src/kernel/next/fs/stat.c:501
> > /home/anders/src/kernel/next/fs/stat.c:501)
> > [ 1738.099529][ T104] el0_svc_common.constprop.0
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > [ 1738.103696][ T104] do_el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > [ 1738.106560][ T104] el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > [ 1738.109613][ T104] el0t_64_sync_handler
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > [ 1738.113035][ T104] el0t_64_sync
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > [ 1738.116346][  T104]
> > [ 1738.117924][  T104] 1 lock held by systemd-udevd/108:
> > [ 1738.121580][ T104] #0: ffff000006681e08 (&root-
> > >kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> > [ 1738.129355][  T104] irq event stamp: 31000
> > [ 1738.132088][ T104] hardirqs last enabled at (31000):
> > seqcount_lockdep_reader_access
> > (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> > [ 1738.139417][ T104] hardirqs last disabled at (30999):
> > seqcount_lockdep_reader_access
> > (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
> > [ 1738.146781][ T104] softirqs last enabled at (30973):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > [ 1738.153891][ T104] softirqs last disabled at (30971):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > [ 1738.161012][  T104]
> > [ 1738.162663][  T104] write to 0xffff00000bab6918 of 4 bytes by
> > task 104 on cpu 0:
> > [ 1738.167730][ T104] set_nlink
> > (/home/anders/src/kernel/next/fs/inode.c:372)
> > [ 1738.170559][ T104] kernfs_refresh_inode
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
> > [ 1738.174355][ T104] kernfs_iop_permission
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> > [ 1738.177829][ T104] inode_permission
> > (/home/anders/src/kernel/next/fs/namei.c:461
> > /home/anders/src/kernel/next/fs/namei.c:528)
> > [ 1738.181403][ T104] link_path_walk
> > (/home/anders/src/kernel/next/fs/namei.c:1720
> > /home/anders/src/kernel/next/fs/namei.c:2267)
> > [ 1738.184738][ T104] path_lookupat
> > (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> > [ 1738.188268][ T104] filename_lookup
> > (/home/anders/src/kernel/next/fs/namei.c:2508)
> > [ 1738.191865][ T104] vfs_statx
> > (/home/anders/src/kernel/next/fs/stat.c:238)
> > [ 1738.196236][ T104] vfs_fstatat
> > (/home/anders/src/kernel/next/fs/stat.c:276)
> > [ 1738.200120][ T104] __do_sys_newfstatat
> > (/home/anders/src/kernel/next/fs/stat.c:446)
> > [ 1738.204095][ T104] __arm64_sys_newfstatat
> > (/home/anders/src/kernel/next/fs/stat.c:440
> > /home/anders/src/kernel/next/fs/stat.c:440)
> > [ 1738.207676][ T104] el0_svc_common.constprop.0
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > [ 1738.211820][ T104] do_el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > [ 1738.214815][ T104] el0_svc
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > [ 1738.217709][ T104] el0t_64_sync_handler
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > [ 1738.221239][ T104] el0t_64_sync
> > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > [ 1738.224502][  T104]
> > [ 1738.226090][  T104] 1 lock held by systemd-udevd/104:
> > [ 1738.229747][ T104] #0: ffff000006681e08 (&root-
> > >kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
> > (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> > [ 1738.237504][  T104] irq event stamp: 108353
> > [ 1738.240262][ T104] hardirqs last enabled at (108353):
> > seqcount_lockdep_reader_access
> > (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> > [ 1738.247443][ T104] hardirqs last disabled at (108352):
> > seqcount_lockdep_reader_access
> > (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
> > [ 1738.254510][ T104] softirqs last enabled at (108326):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > [ 1738.262187][ T104] softirqs last disabled at (108324):
> > fpsimd_restore_current_state
> > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > [ 1738.270239][  T104]
> > [ 1738.272140][  T104] Reported by Kernel Concurrency Sanitizer on:
> > [ 1738.285185][  T104] Hardware name: linux,dummy-virt (DT)
> > [ 1738.288703][  T104]
> > ==================================================================
> 
> This looks just like the original except a different lock is being
> 
> used now.
> 
> 
> The link count can't be less than two if set_nlink() is called.
> 
> 
> Maybe I am missing something but the directory count is changed only
> 
> while holding the root->kernfs_iattr_rwsem so the value used by
> 
> set_nlink() will not change during concurrent calls to
> refresh_inode().
> 
> Still looks like a false positive, I'll check the write operations
> again just to be sure.

I do see a problem with recent changes.

I'll send this off to Greg after I've done some testing (primarily just
compile and function).

Here's a patch which describes what I found.

Comments are, of course, welcome, ;)

kernfs: fix missing kernfs_iattr_rwsem locking

From: Ian Kent <raven@themaw.net>

When the kernfs_iattr_rwsem was introduced a case was missed.

The update of the kernfs directory node child count was also protected
by the kernfs_rwsem and needs to be included in the change so that the
child count (and so the inode n_link attribute) does not change while
holding the rwsem for read.

Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
attributes)

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Imran Khan <imran.f.khan@oracle.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Eric Sandeen <sandeen@sandeen.net>
---
 fs/kernfs/dir.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 45b6919903e6..6e84bb69602e 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
*kn)
 	rb_insert_color(&kn->rb, &kn->parent->dir.children);
 
 	/* successfully added, account subdir number */
+	down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
 	if (kernfs_type(kn) == KERNFS_DIR)
 		kn->parent->dir.subdirs++;
 	kernfs_inc_rev(kn->parent);
+	up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
 
 	return 0;
 }
@@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
kernfs_node *kn)
 	if (RB_EMPTY_NODE(&kn->rb))
 		return false;
 
+	down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
 	if (kernfs_type(kn) == KERNFS_DIR)
 		kn->parent->dir.subdirs--;
 	kernfs_inc_rev(kn->parent);
+	up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
 
 	rb_erase(&kn->rb, &kn->parent->dir.children);
 	RB_CLEAR_NODE(&kn->rb);
Miklos Szeredi July 26, 2023, 1:49 p.m. UTC | #11
On Thu, 20 Jul 2023 at 04:03, Ian Kent <raven@themaw.net> wrote:
>
> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
> > On 19/7/23 03:00, Anders Roxell wrote:
> > > On 2023-01-23 11:11, Ian Kent wrote:
> > > > On 29/12/22 21:07, Ian Kent wrote:
> > > > > On 29/12/22 17:20, Arnd Bergmann wrote:
> > > > > > On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
> > > > > > > On 21/12/22 21:34, Anders Roxell wrote:
> > > > > > > > On 2022-10-31 12:30, Tejun Heo wrote:
> > > > > > > > > On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent
> > > > > > > > > wrote:
> > > > > > > > > > The kernfs write lock is held when the kernfs node
> > > > > > > > > > inode attributes
> > > > > > > > > > are updated. Therefore, when either
> > > > > > > > > > kernfs_iop_getattr() or
> > > > > > > > > > kernfs_iop_permission() are called the kernfs node
> > > > > > > > > > inode attributes
> > > > > > > > > > won't change.
> > > > > > > > > >
> > > > > > > > > > Consequently concurrent kernfs_refresh_inode() calls
> > > > > > > > > > always copy the
> > > > > > > > > > same values from the kernfs node.
> > > > > > > > > >
> > > > > > > > > > So there's no need to take the inode i_lock to get
> > > > > > > > > > consistent values
> > > > > > > > > > for generic_fillattr() and generic_permission(), the
> > > > > > > > > > kernfs read lock
> > > > > > > > > > is sufficient.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Ian Kent <raven@themaw.net>
> > > > > > > > > Acked-by: Tejun Heo <tj@kernel.org>
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Building an allmodconfig arm64 kernel on yesterdays next-
> > > > > > > > 20221220 and
> > > > > > > > booting that in qemu I see the following "BUG: KCSAN:
> > > > > > > > data-race in
> > > > > > > > set_nlink / set_nlink".
> > > > > > > I'll check if I missed any places where set_link() could be
> > > > > > > called where the link count could be different.
> > > > > > >
> > > > > > >
> > > > > > > If there aren't any the question will then be can writing
> > > > > > > the
> > > > > > > same value to this location in multiple concurrent threads
> > > > > > > corrupt it?
> > > > > > I think the race that is getting reported for set_nlink()
> > > > > > is about this bit getting called simulatenously on multiple
> > > > > > CPUs with only the read lock held for the inode:
> > > > > >
> > > > > >        /* Yes, some filesystems do change nlink from zero to
> > > > > > one */
> > > > > >        if (inode->i_nlink == 0)
> > > > > > atomic_long_dec(&inode->i_sb->s_remove_count);
> > > > > >        inode->__i_nlink = nlink;
> > > > > >
> > > > > > Since i_nlink and __i_nlink refer to the same memory
> > > > > > location,
> > > > > > the 'inode->i_nlink == 0' check can be true for all of them
> > > > > > before the nonzero nlink value gets set, and this results in
> > > > > > s_remove_count being decremented more than once.
> > > > >
> > > > > Thanks for the comment Arnd.
> > > >
> > > > Hello all,
> > > >
> > > >
> > > > I've been looking at this and after consulting Miklos and his
> > > > pointing
> > > >
> > > > out that it looks like a false positive the urgency dropped off a
> > > > bit. So
> > > >
> > > > apologies for taking so long to report back.
> > > >
> > > >
> > > > Anyway it needs some description of conclusions reached so far.
> > > >
> > > >
> > > > I'm still looking around but in short, kernfs will set
> > > > directories to <# of
> > > >
> > > > directory entries> + 2 unconditionally for directories. I can't
> > > > yet find
> > > >
> > > > any other places where i_nlink is set or changed and if there are
> > > > none
> > > >
> > > > then i_nlink will never be set to zero so the race should not
> > > > occur.
> > > >
> > > >
> > > > Consequently my claim is this is a real false positive.
> > > >
> > > >
> > > > There are the file system operations that may be passed at mount
> > > > time
> > > >
> > > > but given the way kernfs sets i_nlink it pretty much dictates
> > > > those
> > > > operations
> > > >
> > > > (if there were any that modify it and there don't appear to be
> > > > any) leave it
> > > >
> > > > alone.
> > > >
> > > >
> > > > So it just doesn't make sense for users of kernfs to fiddle with
> > > > i_nlink ...
> > > On todays next tag, next-20230718 this KCSAN BUG poped up again.
> > > When I
> > > built an allmodconfig arm64 kernel and booted it in QEMU. Full log
> > > can
> > > be found http://ix.io/4AUd
> > >
> > > [ 1694.987789][  T137] BUG: KCSAN: data-race in inode_permission /
> > > kernfs_refresh_inode
> > > [ 1694.992912][  T137]
> > > [ 1694.994532][  T137] write to 0xffff00000bab6070 of 2 bytes by
> > > task 104 on cpu 0:
> > > [ 1694.999269][ T137] kernfs_refresh_inode
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:171)
> > > [ 1695.002707][ T137] kernfs_iop_permission
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> > > [ 1695.006148][ T137] inode_permission
> > > (/home/anders/src/kernel/next/fs/namei.c:461
> > > /home/anders/src/kernel/next/fs/namei.c:528)
> > > [ 1695.009420][ T137] link_path_walk
> > > (/home/anders/src/kernel/next/fs/namei.c:1720
> > > /home/anders/src/kernel/next/fs/namei.c:2267)
> > > [ 1695.012643][ T137] path_lookupat
> > > (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> > > [ 1695.015781][ T137] filename_lookup
> > > (/home/anders/src/kernel/next/fs/namei.c:2508)
> > > [ 1695.019059][ T137] vfs_statx
> > > (/home/anders/src/kernel/next/fs/stat.c:238)
> > > [ 1695.022024][ T137] vfs_fstatat
> > > (/home/anders/src/kernel/next/fs/stat.c:276)
> > > [ 1695.025067][ T137] __do_sys_newfstatat
> > > (/home/anders/src/kernel/next/fs/stat.c:446)
> > > [ 1695.028497][ T137] __arm64_sys_newfstatat
> > > (/home/anders/src/kernel/next/fs/stat.c:440
> > > /home/anders/src/kernel/next/fs/stat.c:440)
> > > [ 1695.032080][ T137] el0_svc_common.constprop.0
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > > [ 1695.035916][ T137] do_el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > > [ 1695.038796][ T137] el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > > [ 1695.041468][ T137] el0t_64_sync_handler
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > > [ 1695.044889][ T137] el0t_64_sync
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > > [ 1695.047904][  T137]
> > > [ 1695.049511][  T137] 1 lock held by systemd-udevd/104:
> > > [ 1695.052837][ T137] #0: ffff000006681e08 (&root-
> > > >kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> > > [ 1695.060241][  T137] irq event stamp: 82902
> > > [ 1695.063006][ T137] hardirqs last enabled at (82901):
> > > _raw_spin_unlock_irqrestore
> > > (/home/anders/src/kernel/next/arch/arm64/include/asm/alternative-
> > > macros.h:250
> > > /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:27
> > > /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:140
> > > /home/anders/src/kernel/next/include/linux/spinlock_api_smp.h:151
> > > /home/anders/src/kernel/next/kernel/locking/spinlock.c:194)
> > > [ 1695.069673][ T137] hardirqs last disabled at (82902):
> > > el1_interrupt
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
> > > [ 1695.075474][ T137] softirqs last enabled at (82792):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > > [ 1695.082319][ T137] softirqs last disabled at (82790):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > > [ 1695.089049][  T137]
> > > [ 1695.090659][  T137] read to 0xffff00000bab6070 of 2 bytes by
> > > task 137 on cpu 0:
> > > [ 1695.095374][ T137] inode_permission
> > > (/home/anders/src/kernel/next/fs/namei.c:532)
> > > [ 1695.098655][ T137] link_path_walk
> > > (/home/anders/src/kernel/next/fs/namei.c:1720
> > > /home/anders/src/kernel/next/fs/namei.c:2267)
> > > [ 1695.101857][ T137] path_openat
> > > (/home/anders/src/kernel/next/fs/namei.c:3789 (discriminator 2))
> > > [ 1695.104885][ T137] do_filp_open
> > > (/home/anders/src/kernel/next/fs/namei.c:3820)
> > > [ 1695.108006][ T137] do_sys_openat2
> > > (/home/anders/src/kernel/next/fs/open.c:1418)
> > > [ 1695.111290][ T137] __arm64_sys_openat
> > > (/home/anders/src/kernel/next/fs/open.c:1433)
> > > [ 1695.114825][ T137] el0_svc_common.constprop.0
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > > [ 1695.118662][ T137] do_el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > > [ 1695.121555][ T137] el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > > [ 1695.124207][ T137] el0t_64_sync_handler
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > > [ 1695.127590][ T137] el0t_64_sync
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > > [ 1695.130641][  T137]
> > > [ 1695.132241][  T137] no locks held by systemd-udevd/137.
> > > [ 1695.135618][  T137] irq event stamp: 3246
> > > [ 1695.138519][ T137] hardirqs last enabled at (3245):
> > > seqcount_lockdep_reader_access
> > > (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> > > [ 1695.145825][ T137] hardirqs last disabled at (3246):
> > > el1_interrupt
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
> > > [ 1695.151942][ T137] softirqs last enabled at (3208):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > > [ 1695.158950][ T137] softirqs last disabled at (3206):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > > [ 1695.166036][  T137]
> > > [ 1695.167621][  T137] Reported by Kernel Concurrency Sanitizer on:
> > > [ 1695.179990][  T137] Hardware name: linux,dummy-virt (DT)
> > > [ 1695.183687][  T137]
> > > ==================================================================
> >
> >
> > This one is different to the original.
> >
> >
> > I can't see where the problem is here, can someone help me out
> >
> > please.
> >
> >
> > I don't see any shared data values used by the call
> >
> > devcgroup_inode_permission(inode, mask) in
> > fs/namei.c:inode_permission()
> >
> > that might be a problem during the read except possibly inode-
> > >i_mode.
> >
> >
> > I'll check on that ... maybe something's been missed when
> > kernfs_rwsem
> >
> > was changed to a separate lock (kernfs_iattr_rwsem).
> >
> >
> > >
> > > [...]
> > >
> > > [ 1738.053819][  T104] BUG: KCSAN: data-race in set_nlink /
> > > set_nlink
> > > [ 1738.058223][  T104]
> > > [ 1738.059865][  T104] read to 0xffff00000bab6918 of 4 bytes by
> > > task 108 on cpu 0:
> > > [ 1738.064916][ T104] set_nlink
> > > (/home/anders/src/kernel/next/fs/inode.c:369)
> > > [ 1738.067845][ T104] kernfs_refresh_inode
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
> > > [ 1738.071607][ T104] kernfs_iop_permission
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> > > [ 1738.075467][ T104] inode_permission
> > > (/home/anders/src/kernel/next/fs/namei.c:461
> > > /home/anders/src/kernel/next/fs/namei.c:528)
> > > [ 1738.078868][ T104] link_path_walk
> > > (/home/anders/src/kernel/next/fs/namei.c:1720
> > > /home/anders/src/kernel/next/fs/namei.c:2267)
> > > [ 1738.082270][ T104] path_lookupat
> > > (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> > > [ 1738.085488][ T104] filename_lookup
> > > (/home/anders/src/kernel/next/fs/namei.c:2508)
> > > [ 1738.089101][ T104] user_path_at_empty
> > > (/home/anders/src/kernel/next/fs/namei.c:2907)
> > > [ 1738.092469][ T104] do_readlinkat
> > > (/home/anders/src/kernel/next/fs/stat.c:477)
> > > [ 1738.095970][ T104] __arm64_sys_readlinkat
> > > (/home/anders/src/kernel/next/fs/stat.c:504
> > > /home/anders/src/kernel/next/fs/stat.c:501
> > > /home/anders/src/kernel/next/fs/stat.c:501)
> > > [ 1738.099529][ T104] el0_svc_common.constprop.0
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > > [ 1738.103696][ T104] do_el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > > [ 1738.106560][ T104] el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > > [ 1738.109613][ T104] el0t_64_sync_handler
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > > [ 1738.113035][ T104] el0t_64_sync
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > > [ 1738.116346][  T104]
> > > [ 1738.117924][  T104] 1 lock held by systemd-udevd/108:
> > > [ 1738.121580][ T104] #0: ffff000006681e08 (&root-
> > > >kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> > > [ 1738.129355][  T104] irq event stamp: 31000
> > > [ 1738.132088][ T104] hardirqs last enabled at (31000):
> > > seqcount_lockdep_reader_access
> > > (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> > > [ 1738.139417][ T104] hardirqs last disabled at (30999):
> > > seqcount_lockdep_reader_access
> > > (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
> > > [ 1738.146781][ T104] softirqs last enabled at (30973):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > > [ 1738.153891][ T104] softirqs last disabled at (30971):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > > [ 1738.161012][  T104]
> > > [ 1738.162663][  T104] write to 0xffff00000bab6918 of 4 bytes by
> > > task 104 on cpu 0:
> > > [ 1738.167730][ T104] set_nlink
> > > (/home/anders/src/kernel/next/fs/inode.c:372)
> > > [ 1738.170559][ T104] kernfs_refresh_inode
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
> > > [ 1738.174355][ T104] kernfs_iop_permission
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
> > > [ 1738.177829][ T104] inode_permission
> > > (/home/anders/src/kernel/next/fs/namei.c:461
> > > /home/anders/src/kernel/next/fs/namei.c:528)
> > > [ 1738.181403][ T104] link_path_walk
> > > (/home/anders/src/kernel/next/fs/namei.c:1720
> > > /home/anders/src/kernel/next/fs/namei.c:2267)
> > > [ 1738.184738][ T104] path_lookupat
> > > (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
> > > [ 1738.188268][ T104] filename_lookup
> > > (/home/anders/src/kernel/next/fs/namei.c:2508)
> > > [ 1738.191865][ T104] vfs_statx
> > > (/home/anders/src/kernel/next/fs/stat.c:238)
> > > [ 1738.196236][ T104] vfs_fstatat
> > > (/home/anders/src/kernel/next/fs/stat.c:276)
> > > [ 1738.200120][ T104] __do_sys_newfstatat
> > > (/home/anders/src/kernel/next/fs/stat.c:446)
> > > [ 1738.204095][ T104] __arm64_sys_newfstatat
> > > (/home/anders/src/kernel/next/fs/stat.c:440
> > > /home/anders/src/kernel/next/fs/stat.c:440)
> > > [ 1738.207676][ T104] el0_svc_common.constprop.0
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
> > > /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
> > > [ 1738.211820][ T104] do_el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
> > > [ 1738.214815][ T104] el0_svc
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
> > > /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
> > > [ 1738.217709][ T104] el0t_64_sync_handler
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
> > > [ 1738.221239][ T104] el0t_64_sync
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
> > > [ 1738.224502][  T104]
> > > [ 1738.226090][  T104] 1 lock held by systemd-udevd/104:
> > > [ 1738.229747][ T104] #0: ffff000006681e08 (&root-
> > > >kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
> > > (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
> > > [ 1738.237504][  T104] irq event stamp: 108353
> > > [ 1738.240262][ T104] hardirqs last enabled at (108353):
> > > seqcount_lockdep_reader_access
> > > (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
> > > [ 1738.247443][ T104] hardirqs last disabled at (108352):
> > > seqcount_lockdep_reader_access
> > > (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
> > > [ 1738.254510][ T104] softirqs last enabled at (108326):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
> > > [ 1738.262187][ T104] softirqs last disabled at (108324):
> > > fpsimd_restore_current_state
> > > (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
> > > /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
> > > [ 1738.270239][  T104]
> > > [ 1738.272140][  T104] Reported by Kernel Concurrency Sanitizer on:
> > > [ 1738.285185][  T104] Hardware name: linux,dummy-virt (DT)
> > > [ 1738.288703][  T104]
> > > ==================================================================
> >
> > This looks just like the original except a different lock is being
> >
> > used now.
> >
> >
> > The link count can't be less than two if set_nlink() is called.
> >
> >
> > Maybe I am missing something but the directory count is changed only
> >
> > while holding the root->kernfs_iattr_rwsem so the value used by
> >
> > set_nlink() will not change during concurrent calls to
> > refresh_inode().
> >
> > Still looks like a false positive, I'll check the write operations
> > again just to be sure.
>
> I do see a problem with recent changes.
>
> I'll send this off to Greg after I've done some testing (primarily just
> compile and function).
>
> Here's a patch which describes what I found.
>
> Comments are, of course, welcome, ;)
>
> kernfs: fix missing kernfs_iattr_rwsem locking
>
> From: Ian Kent <raven@themaw.net>
>
> When the kernfs_iattr_rwsem was introduced a case was missed.
>
> The update of the kernfs directory node child count was also protected
> by the kernfs_rwsem and needs to be included in the change so that the
> child count (and so the inode n_link attribute) does not change while
> holding the rwsem for read.
>
> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
> attributes)
>
> Signed-off-by: Ian Kent <raven@themaw.net>
> Cc: Anders Roxell <anders.roxell@linaro.org>
> Cc: Imran Khan <imran.f.khan@oracle.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Eric Sandeen <sandeen@sandeen.net>

Looks good.

Acked-by: Miklos Szeredi <mszeredi@redhat.com>
Ian Kent July 27, 2023, 12:38 a.m. UTC | #12
On 20/7/23 10:03, Ian Kent wrote:
> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
>> On 19/7/23 03:00, Anders Roxell wrote:
>>> On 2023-01-23 11:11, Ian Kent wrote:
>>>> On 29/12/22 21:07, Ian Kent wrote:
>>>>> On 29/12/22 17:20, Arnd Bergmann wrote:
>>>>>> On Fri, Dec 23, 2022, at 00:11, Ian Kent wrote:
>>>>>>> On 21/12/22 21:34, Anders Roxell wrote:
>>>>>>>> On 2022-10-31 12:30, Tejun Heo wrote:
>>>>>>>>> On Tue, Oct 18, 2022 at 10:32:42AM +0800, Ian Kent
>>>>>>>>> wrote:
>>>>>>>>>> The kernfs write lock is held when the kernfs node
>>>>>>>>>> inode attributes
>>>>>>>>>> are updated. Therefore, when either
>>>>>>>>>> kernfs_iop_getattr() or
>>>>>>>>>> kernfs_iop_permission() are called the kernfs node
>>>>>>>>>> inode attributes
>>>>>>>>>> won't change.
>>>>>>>>>>
>>>>>>>>>> Consequently concurrent kernfs_refresh_inode() calls
>>>>>>>>>> always copy the
>>>>>>>>>> same values from the kernfs node.
>>>>>>>>>>
>>>>>>>>>> So there's no need to take the inode i_lock to get
>>>>>>>>>> consistent values
>>>>>>>>>> for generic_fillattr() and generic_permission(), the
>>>>>>>>>> kernfs read lock
>>>>>>>>>> is sufficient.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Ian Kent <raven@themaw.net>
>>>>>>>>> Acked-by: Tejun Heo <tj@kernel.org>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Building an allmodconfig arm64 kernel on yesterdays next-
>>>>>>>> 20221220 and
>>>>>>>> booting that in qemu I see the following "BUG: KCSAN:
>>>>>>>> data-race in
>>>>>>>> set_nlink / set_nlink".
>>>>>>> I'll check if I missed any places where set_link() could be
>>>>>>> called where the link count could be different.
>>>>>>>
>>>>>>>
>>>>>>> If there aren't any the question will then be can writing
>>>>>>> the
>>>>>>> same value to this location in multiple concurrent threads
>>>>>>> corrupt it?
>>>>>> I think the race that is getting reported for set_nlink()
>>>>>> is about this bit getting called simulatenously on multiple
>>>>>> CPUs with only the read lock held for the inode:
>>>>>>
>>>>>>         /* Yes, some filesystems do change nlink from zero to
>>>>>> one */
>>>>>>         if (inode->i_nlink == 0)
>>>>>> atomic_long_dec(&inode->i_sb->s_remove_count);
>>>>>>         inode->__i_nlink = nlink;
>>>>>>
>>>>>> Since i_nlink and __i_nlink refer to the same memory
>>>>>> location,
>>>>>> the 'inode->i_nlink == 0' check can be true for all of them
>>>>>> before the nonzero nlink value gets set, and this results in
>>>>>> s_remove_count being decremented more than once.
>>>>> Thanks for the comment Arnd.
>>>> Hello all,
>>>>
>>>>
>>>> I've been looking at this and after consulting Miklos and his
>>>> pointing
>>>>
>>>> out that it looks like a false positive the urgency dropped off a
>>>> bit. So
>>>>
>>>> apologies for taking so long to report back.
>>>>
>>>>
>>>> Anyway it needs some description of conclusions reached so far.
>>>>
>>>>
>>>> I'm still looking around but in short, kernfs will set
>>>> directories to <# of
>>>>
>>>> directory entries> + 2 unconditionally for directories. I can't
>>>> yet find
>>>>
>>>> any other places where i_nlink is set or changed and if there are
>>>> none
>>>>
>>>> then i_nlink will never be set to zero so the race should not
>>>> occur.
>>>>
>>>>
>>>> Consequently my claim is this is a real false positive.
>>>>
>>>>
>>>> There are the file system operations that may be passed at mount
>>>> time
>>>>
>>>> but given the way kernfs sets i_nlink it pretty much dictates
>>>> those
>>>> operations
>>>>
>>>> (if there were any that modify it and there don't appear to be
>>>> any) leave it
>>>>
>>>> alone.
>>>>
>>>>
>>>> So it just doesn't make sense for users of kernfs to fiddle with
>>>> i_nlink ...
>>> On todays next tag, next-20230718 this KCSAN BUG poped up again.
>>> When I
>>> built an allmodconfig arm64 kernel and booted it in QEMU. Full log
>>> can
>>> be found http://ix.io/4AUd
>>>
>>> [ 1694.987789][  T137] BUG: KCSAN: data-race in inode_permission /
>>> kernfs_refresh_inode
>>> [ 1694.992912][  T137]
>>> [ 1694.994532][  T137] write to 0xffff00000bab6070 of 2 bytes by
>>> task 104 on cpu 0:
>>> [ 1694.999269][ T137] kernfs_refresh_inode
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:171)
>>> [ 1695.002707][ T137] kernfs_iop_permission
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
>>> [ 1695.006148][ T137] inode_permission
>>> (/home/anders/src/kernel/next/fs/namei.c:461
>>> /home/anders/src/kernel/next/fs/namei.c:528)
>>> [ 1695.009420][ T137] link_path_walk
>>> (/home/anders/src/kernel/next/fs/namei.c:1720
>>> /home/anders/src/kernel/next/fs/namei.c:2267)
>>> [ 1695.012643][ T137] path_lookupat
>>> (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
>>> [ 1695.015781][ T137] filename_lookup
>>> (/home/anders/src/kernel/next/fs/namei.c:2508)
>>> [ 1695.019059][ T137] vfs_statx
>>> (/home/anders/src/kernel/next/fs/stat.c:238)
>>> [ 1695.022024][ T137] vfs_fstatat
>>> (/home/anders/src/kernel/next/fs/stat.c:276)
>>> [ 1695.025067][ T137] __do_sys_newfstatat
>>> (/home/anders/src/kernel/next/fs/stat.c:446)
>>> [ 1695.028497][ T137] __arm64_sys_newfstatat
>>> (/home/anders/src/kernel/next/fs/stat.c:440
>>> /home/anders/src/kernel/next/fs/stat.c:440)
>>> [ 1695.032080][ T137] el0_svc_common.constprop.0
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
>>> [ 1695.035916][ T137] do_el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
>>> [ 1695.038796][ T137] el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
>>> [ 1695.041468][ T137] el0t_64_sync_handler
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
>>> [ 1695.044889][ T137] el0t_64_sync
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
>>> [ 1695.047904][  T137]
>>> [ 1695.049511][  T137] 1 lock held by systemd-udevd/104:
>>> [ 1695.052837][ T137] #0: ffff000006681e08 (&root-
>>>> kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
>>> [ 1695.060241][  T137] irq event stamp: 82902
>>> [ 1695.063006][ T137] hardirqs last enabled at (82901):
>>> _raw_spin_unlock_irqrestore
>>> (/home/anders/src/kernel/next/arch/arm64/include/asm/alternative-
>>> macros.h:250
>>> /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:27
>>> /home/anders/src/kernel/next/arch/arm64/include/asm/irqflags.h:140
>>> /home/anders/src/kernel/next/include/linux/spinlock_api_smp.h:151
>>> /home/anders/src/kernel/next/kernel/locking/spinlock.c:194)
>>> [ 1695.069673][ T137] hardirqs last disabled at (82902):
>>> el1_interrupt
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
>>> [ 1695.075474][ T137] softirqs last enabled at (82792):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
>>> [ 1695.082319][ T137] softirqs last disabled at (82790):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
>>> [ 1695.089049][  T137]
>>> [ 1695.090659][  T137] read to 0xffff00000bab6070 of 2 bytes by
>>> task 137 on cpu 0:
>>> [ 1695.095374][ T137] inode_permission
>>> (/home/anders/src/kernel/next/fs/namei.c:532)
>>> [ 1695.098655][ T137] link_path_walk
>>> (/home/anders/src/kernel/next/fs/namei.c:1720
>>> /home/anders/src/kernel/next/fs/namei.c:2267)
>>> [ 1695.101857][ T137] path_openat
>>> (/home/anders/src/kernel/next/fs/namei.c:3789 (discriminator 2))
>>> [ 1695.104885][ T137] do_filp_open
>>> (/home/anders/src/kernel/next/fs/namei.c:3820)
>>> [ 1695.108006][ T137] do_sys_openat2
>>> (/home/anders/src/kernel/next/fs/open.c:1418)
>>> [ 1695.111290][ T137] __arm64_sys_openat
>>> (/home/anders/src/kernel/next/fs/open.c:1433)
>>> [ 1695.114825][ T137] el0_svc_common.constprop.0
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
>>> [ 1695.118662][ T137] do_el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
>>> [ 1695.121555][ T137] el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
>>> [ 1695.124207][ T137] el0t_64_sync_handler
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
>>> [ 1695.127590][ T137] el0t_64_sync
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
>>> [ 1695.130641][  T137]
>>> [ 1695.132241][  T137] no locks held by systemd-udevd/137.
>>> [ 1695.135618][  T137] irq event stamp: 3246
>>> [ 1695.138519][ T137] hardirqs last enabled at (3245):
>>> seqcount_lockdep_reader_access
>>> (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
>>> [ 1695.145825][ T137] hardirqs last disabled at (3246):
>>> el1_interrupt
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:472
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:488)
>>> [ 1695.151942][ T137] softirqs last enabled at (3208):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
>>> [ 1695.158950][ T137] softirqs last disabled at (3206):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
>>> [ 1695.166036][  T137]
>>> [ 1695.167621][  T137] Reported by Kernel Concurrency Sanitizer on:
>>> [ 1695.179990][  T137] Hardware name: linux,dummy-virt (DT)
>>> [ 1695.183687][  T137]
>>> ==================================================================
>>
>> This one is different to the original.
>>
>>
>> I can't see where the problem is here, can someone help me out
>>
>> please.
>>
>>
>> I don't see any shared data values used by the call
>>
>> devcgroup_inode_permission(inode, mask) in
>> fs/namei.c:inode_permission()
>>
>> that might be a problem during the read except possibly inode-
>>> i_mode.
>>
>> I'll check on that ... maybe something's been missed when
>> kernfs_rwsem
>>
>> was changed to a separate lock (kernfs_iattr_rwsem).
>>
>>
>>> [...]
>>>
>>> [ 1738.053819][  T104] BUG: KCSAN: data-race in set_nlink /
>>> set_nlink
>>> [ 1738.058223][  T104]
>>> [ 1738.059865][  T104] read to 0xffff00000bab6918 of 4 bytes by
>>> task 108 on cpu 0:
>>> [ 1738.064916][ T104] set_nlink
>>> (/home/anders/src/kernel/next/fs/inode.c:369)
>>> [ 1738.067845][ T104] kernfs_refresh_inode
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
>>> [ 1738.071607][ T104] kernfs_iop_permission
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
>>> [ 1738.075467][ T104] inode_permission
>>> (/home/anders/src/kernel/next/fs/namei.c:461
>>> /home/anders/src/kernel/next/fs/namei.c:528)
>>> [ 1738.078868][ T104] link_path_walk
>>> (/home/anders/src/kernel/next/fs/namei.c:1720
>>> /home/anders/src/kernel/next/fs/namei.c:2267)
>>> [ 1738.082270][ T104] path_lookupat
>>> (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
>>> [ 1738.085488][ T104] filename_lookup
>>> (/home/anders/src/kernel/next/fs/namei.c:2508)
>>> [ 1738.089101][ T104] user_path_at_empty
>>> (/home/anders/src/kernel/next/fs/namei.c:2907)
>>> [ 1738.092469][ T104] do_readlinkat
>>> (/home/anders/src/kernel/next/fs/stat.c:477)
>>> [ 1738.095970][ T104] __arm64_sys_readlinkat
>>> (/home/anders/src/kernel/next/fs/stat.c:504
>>> /home/anders/src/kernel/next/fs/stat.c:501
>>> /home/anders/src/kernel/next/fs/stat.c:501)
>>> [ 1738.099529][ T104] el0_svc_common.constprop.0
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
>>> [ 1738.103696][ T104] do_el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
>>> [ 1738.106560][ T104] el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
>>> [ 1738.109613][ T104] el0t_64_sync_handler
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
>>> [ 1738.113035][ T104] el0t_64_sync
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
>>> [ 1738.116346][  T104]
>>> [ 1738.117924][  T104] 1 lock held by systemd-udevd/108:
>>> [ 1738.121580][ T104] #0: ffff000006681e08 (&root-
>>>> kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
>>> [ 1738.129355][  T104] irq event stamp: 31000
>>> [ 1738.132088][ T104] hardirqs last enabled at (31000):
>>> seqcount_lockdep_reader_access
>>> (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
>>> [ 1738.139417][ T104] hardirqs last disabled at (30999):
>>> seqcount_lockdep_reader_access
>>> (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
>>> [ 1738.146781][ T104] softirqs last enabled at (30973):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
>>> [ 1738.153891][ T104] softirqs last disabled at (30971):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
>>> [ 1738.161012][  T104]
>>> [ 1738.162663][  T104] write to 0xffff00000bab6918 of 4 bytes by
>>> task 104 on cpu 0:
>>> [ 1738.167730][ T104] set_nlink
>>> (/home/anders/src/kernel/next/fs/inode.c:372)
>>> [ 1738.170559][ T104] kernfs_refresh_inode
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:180)
>>> [ 1738.174355][ T104] kernfs_iop_permission
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:289)
>>> [ 1738.177829][ T104] inode_permission
>>> (/home/anders/src/kernel/next/fs/namei.c:461
>>> /home/anders/src/kernel/next/fs/namei.c:528)
>>> [ 1738.181403][ T104] link_path_walk
>>> (/home/anders/src/kernel/next/fs/namei.c:1720
>>> /home/anders/src/kernel/next/fs/namei.c:2267)
>>> [ 1738.184738][ T104] path_lookupat
>>> (/home/anders/src/kernel/next/fs/namei.c:2478 (discriminator 2))
>>> [ 1738.188268][ T104] filename_lookup
>>> (/home/anders/src/kernel/next/fs/namei.c:2508)
>>> [ 1738.191865][ T104] vfs_statx
>>> (/home/anders/src/kernel/next/fs/stat.c:238)
>>> [ 1738.196236][ T104] vfs_fstatat
>>> (/home/anders/src/kernel/next/fs/stat.c:276)
>>> [ 1738.200120][ T104] __do_sys_newfstatat
>>> (/home/anders/src/kernel/next/fs/stat.c:446)
>>> [ 1738.204095][ T104] __arm64_sys_newfstatat
>>> (/home/anders/src/kernel/next/fs/stat.c:440
>>> /home/anders/src/kernel/next/fs/stat.c:440)
>>> [ 1738.207676][ T104] el0_svc_common.constprop.0
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:38
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:52
>>> /home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:139)
>>> [ 1738.211820][ T104] do_el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/syscall.c:188)
>>> [ 1738.214815][ T104] el0_svc
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:133
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:144
>>> /home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:648)
>>> [ 1738.217709][ T104] el0t_64_sync_handler
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry-common.c:666)
>>> [ 1738.221239][ T104] el0t_64_sync
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/entry.S:591)
>>> [ 1738.224502][  T104]
>>> [ 1738.226090][  T104] 1 lock held by systemd-udevd/104:
>>> [ 1738.229747][ T104] #0: ffff000006681e08 (&root-
>>>> kernfs_iattr_rwsem){++++}-{3:3}, at: kernfs_iop_permission
>>> (/home/anders/src/kernel/next/fs/kernfs/inode.c:288)
>>> [ 1738.237504][  T104] irq event stamp: 108353
>>> [ 1738.240262][ T104] hardirqs last enabled at (108353):
>>> seqcount_lockdep_reader_access
>>> (/home/anders/src/kernel/next/include/linux/seqlock.h:105)
>>> [ 1738.247443][ T104] hardirqs last disabled at (108352):
>>> seqcount_lockdep_reader_access
>>> (/home/anders/src/kernel/next/include/linux/seqlock.h:104)
>>> [ 1738.254510][ T104] softirqs last enabled at (108326):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:264
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1791)
>>> [ 1738.262187][ T104] softirqs last disabled at (108324):
>>> fpsimd_restore_current_state
>>> (/home/anders/src/kernel/next/include/linux/bottom_half.h:20
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:242
>>> /home/anders/src/kernel/next/arch/arm64/kernel/fpsimd.c:1784)
>>> [ 1738.270239][  T104]
>>> [ 1738.272140][  T104] Reported by Kernel Concurrency Sanitizer on:
>>> [ 1738.285185][  T104] Hardware name: linux,dummy-virt (DT)
>>> [ 1738.288703][  T104]
>>> ==================================================================
>> This looks just like the original except a different lock is being
>>
>> used now.
>>
>>
>> The link count can't be less than two if set_nlink() is called.
>>
>>
>> Maybe I am missing something but the directory count is changed only
>>
>> while holding the root->kernfs_iattr_rwsem so the value used by
>>
>> set_nlink() will not change during concurrent calls to
>> refresh_inode().
>>
>> Still looks like a false positive, I'll check the write operations
>> again just to be sure.
> I do see a problem with recent changes.
>
> I'll send this off to Greg after I've done some testing (primarily just
> compile and function).
>
> Here's a patch which describes what I found.
>
> Comments are, of course, welcome, ;)

Anders I was hoping you would check if/what lockdep trace

you get with this patch.


Imran, I was hoping you would comment on my change as it

relates to the kernfs_iattr_rwsem changes.


Ian

>
> kernfs: fix missing kernfs_iattr_rwsem locking
>
> From: Ian Kent <raven@themaw.net>
>
> When the kernfs_iattr_rwsem was introduced a case was missed.
>
> The update of the kernfs directory node child count was also protected
> by the kernfs_rwsem and needs to be included in the change so that the
> child count (and so the inode n_link attribute) does not change while
> holding the rwsem for read.
>
> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
> attributes)
>
> Signed-off-by: Ian Kent <raven@themaw.net>
> Cc: Anders Roxell <anders.roxell@linaro.org>
> Cc: Imran Khan <imran.f.khan@oracle.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Eric Sandeen <sandeen@sandeen.net>
> ---
>   fs/kernfs/dir.c |    4 ++++
>   1 file changed, 4 insertions(+)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 45b6919903e6..6e84bb69602e 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
> *kn)
>   	rb_insert_color(&kn->rb, &kn->parent->dir.children);
>   
>   	/* successfully added, account subdir number */
> +	down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>   	if (kernfs_type(kn) == KERNFS_DIR)
>   		kn->parent->dir.subdirs++;
>   	kernfs_inc_rev(kn->parent);
> +	up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>   
>   	return 0;
>   }
> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
> kernfs_node *kn)
>   	if (RB_EMPTY_NODE(&kn->rb))
>   		return false;
>   
> +	down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>   	if (kernfs_type(kn) == KERNFS_DIR)
>   		kn->parent->dir.subdirs--;
>   	kernfs_inc_rev(kn->parent);
> +	up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>   
>   	rb_erase(&kn->rb, &kn->parent->dir.children);
>   	RB_CLEAR_NODE(&kn->rb);
>
Imran Khan July 27, 2023, 4:30 a.m. UTC | #13
Hello Ian,
Sorry for late reply. I was about to reply this week.

On 27/7/2023 10:38 am, Ian Kent wrote:
> On 20/7/23 10:03, Ian Kent wrote:
>> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:

[...]
>> I do see a problem with recent changes.
>>
>> I'll send this off to Greg after I've done some testing (primarily just
>> compile and function).
>>
>> Here's a patch which describes what I found.
>>
>> Comments are, of course, welcome, ;)
> 
> Anders I was hoping you would check if/what lockdep trace
> 
> you get with this patch.
> 
> 
> Imran, I was hoping you would comment on my change as it
> 
> relates to the kernfs_iattr_rwsem changes.
> 
> 
> Ian
> 
>>
>> kernfs: fix missing kernfs_iattr_rwsem locking
>>
>> From: Ian Kent <raven@themaw.net>
>>
>> When the kernfs_iattr_rwsem was introduced a case was missed.
>>
>> The update of the kernfs directory node child count was also protected
>> by the kernfs_rwsem and needs to be included in the change so that the
>> child count (and so the inode n_link attribute) does not change while
>> holding the rwsem for read.
>>

kernfs direcytory node's child count changes in kernfs_(un)link_sibling and
these are getting invoked while adding (kernfs_add_one),
removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of these
operations proceed under kernfs_rwsem and I see each invocation of
kernfs_link/unlink_sibling during the above mentioned operations is happening
under kernfs_rwsem.
So the child count should still be protected by kernfs_rwsem and we should not
need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.

Kindly let me know your thoughts. I would still like to see new lockdep traces
with this change.

Thanks,
Imran

>> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
>> attributes)
>>
>> Signed-off-by: Ian Kent <raven@themaw.net>
>> Cc: Anders Roxell <anders.roxell@linaro.org>
>> Cc: Imran Khan <imran.f.khan@oracle.com>
>> Cc: Arnd Bergmann <arnd@arndb.de>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Eric Sandeen <sandeen@sandeen.net>
>> ---
>>   fs/kernfs/dir.c |    4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>> index 45b6919903e6..6e84bb69602e 100644
>> --- a/fs/kernfs/dir.c
>> +++ b/fs/kernfs/dir.c
>> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
>> *kn)
>>       rb_insert_color(&kn->rb, &kn->parent->dir.children);
>>         /* successfully added, account subdir number */
>> +    down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>       if (kernfs_type(kn) == KERNFS_DIR)
>>           kn->parent->dir.subdirs++;
>>       kernfs_inc_rev(kn->parent);
>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>         return 0;
>>   }
>> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
>> kernfs_node *kn)
>>       if (RB_EMPTY_NODE(&kn->rb))
>>           return false;
>>   +    down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>       if (kernfs_type(kn) == KERNFS_DIR)
>>           kn->parent->dir.subdirs--;
>>       kernfs_inc_rev(kn->parent);
>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>         rb_erase(&kn->rb, &kn->parent->dir.children);
>>       RB_CLEAR_NODE(&kn->rb);
>>
Imran Khan July 27, 2023, 5:35 a.m. UTC | #14
Hello again Ian,
I take back my previous comment :).

On 27/7/2023 2:30 pm, Imran Khan wrote:
> Hello Ian,
> Sorry for late reply. I was about to reply this week.
> 
> On 27/7/2023 10:38 am, Ian Kent wrote:
>> On 20/7/23 10:03, Ian Kent wrote:
>>> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
> 
> [...]
>>> I do see a problem with recent changes.
>>>
>>> I'll send this off to Greg after I've done some testing (primarily just
>>> compile and function).
>>>
>>> Here's a patch which describes what I found.
>>>
>>> Comments are, of course, welcome, ;)
>>
>> Anders I was hoping you would check if/what lockdep trace
>>
>> you get with this patch.
>>
>>
>> Imran, I was hoping you would comment on my change as it
>>
>> relates to the kernfs_iattr_rwsem changes.
>>
>>
>> Ian
>>
>>>
>>> kernfs: fix missing kernfs_iattr_rwsem locking
>>>
>>> From: Ian Kent <raven@themaw.net>
>>>
>>> When the kernfs_iattr_rwsem was introduced a case was missed.
>>>
>>> The update of the kernfs directory node child count was also protected
>>> by the kernfs_rwsem and needs to be included in the change so that the
>>> child count (and so the inode n_link attribute) does not change while
>>> holding the rwsem for read.
>>>
> 
> kernfs direcytory node's child count changes in kernfs_(un)link_sibling and
> these are getting invoked while adding (kernfs_add_one),
> removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of these
> operations proceed under kernfs_rwsem and I see each invocation of
> kernfs_link/unlink_sibling during the above mentioned operations is happening
> under kernfs_rwsem.
> So the child count should still be protected by kernfs_rwsem and we should not
> need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.
> 

kernfs_refresh_inode can still race against kernfs_link/unlink_siblings. So your
change looks good to me.
My tests are not showing any issues either. ( I tested on 4.14 and 5.4 kernels
as well).

Fee free to add my RB.

Reviewed-by: Imran Khan <imran.f.khan@oracle.com>

> Kindly let me know your thoughts. I would still like to see new lockdep traces
> with this change.
> 
> Thanks,
> Imran
> 
>>> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
>>> attributes)
>>>
>>> Signed-off-by: Ian Kent <raven@themaw.net>
>>> Cc: Anders Roxell <anders.roxell@linaro.org>
>>> Cc: Imran Khan <imran.f.khan@oracle.com>
>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>> Cc: Minchan Kim <minchan@kernel.org>
>>> Cc: Eric Sandeen <sandeen@sandeen.net>
>>> ---
>>>   fs/kernfs/dir.c |    4 ++++
>>>   1 file changed, 4 insertions(+)
>>>
>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>>> index 45b6919903e6..6e84bb69602e 100644
>>> --- a/fs/kernfs/dir.c
>>> +++ b/fs/kernfs/dir.c
>>> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
>>> *kn)
>>>       rb_insert_color(&kn->rb, &kn->parent->dir.children);
>>>         /* successfully added, account subdir number */
>>> +    down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>       if (kernfs_type(kn) == KERNFS_DIR)
>>>           kn->parent->dir.subdirs++;
>>>       kernfs_inc_rev(kn->parent);
>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>         return 0;
>>>   }
>>> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
>>> kernfs_node *kn)
>>>       if (RB_EMPTY_NODE(&kn->rb))
>>>           return false;
>>>   +    down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>       if (kernfs_type(kn) == KERNFS_DIR)
>>>           kn->parent->dir.subdirs--;
>>>       kernfs_inc_rev(kn->parent);
>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>         rb_erase(&kn->rb, &kn->parent->dir.children);
>>>       RB_CLEAR_NODE(&kn->rb);
>>>
Ian Kent July 28, 2023, midnight UTC | #15
On 27/7/23 12:30, Imran Khan wrote:
> Hello Ian,
> Sorry for late reply. I was about to reply this week.
>
> On 27/7/2023 10:38 am, Ian Kent wrote:
>> On 20/7/23 10:03, Ian Kent wrote:
>>> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
> [...]
>>> I do see a problem with recent changes.
>>>
>>> I'll send this off to Greg after I've done some testing (primarily just
>>> compile and function).
>>>
>>> Here's a patch which describes what I found.
>>>
>>> Comments are, of course, welcome, ;)
>> Anders I was hoping you would check if/what lockdep trace
>>
>> you get with this patch.
>>
>>
>> Imran, I was hoping you would comment on my change as it
>>
>> relates to the kernfs_iattr_rwsem changes.
>>
>>
>> Ian
>>
>>> kernfs: fix missing kernfs_iattr_rwsem locking
>>>
>>> From: Ian Kent <raven@themaw.net>
>>>
>>> When the kernfs_iattr_rwsem was introduced a case was missed.
>>>
>>> The update of the kernfs directory node child count was also protected
>>> by the kernfs_rwsem and needs to be included in the change so that the
>>> child count (and so the inode n_link attribute) does not change while
>>> holding the rwsem for read.
>>>
> kernfs direcytory node's child count changes in kernfs_(un)link_sibling and
> these are getting invoked while adding (kernfs_add_one),
> removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of these
> operations proceed under kernfs_rwsem and I see each invocation of
> kernfs_link/unlink_sibling during the above mentioned operations is happening
> under kernfs_rwsem.
> So the child count should still be protected by kernfs_rwsem and we should not
> need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.

Yes, that's exactly what I intended (assuming you mean write lock in 
those cases)

when I did it so now I wonder what I saw that lead to my patch, I'll 
need to look

again ...


>
> Kindly let me know your thoughts. I would still like to see new lockdep traces
> with this change.

Indeed, I hope Anders can find time to get the trace.


Ian

>
> Thanks,
> Imran
>
>>> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
>>> attributes)
>>>
>>> Signed-off-by: Ian Kent <raven@themaw.net>
>>> Cc: Anders Roxell <anders.roxell@linaro.org>
>>> Cc: Imran Khan <imran.f.khan@oracle.com>
>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>> Cc: Minchan Kim <minchan@kernel.org>
>>> Cc: Eric Sandeen <sandeen@sandeen.net>
>>> ---
>>>    fs/kernfs/dir.c |    4 ++++
>>>    1 file changed, 4 insertions(+)
>>>
>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>>> index 45b6919903e6..6e84bb69602e 100644
>>> --- a/fs/kernfs/dir.c
>>> +++ b/fs/kernfs/dir.c
>>> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
>>> *kn)
>>>        rb_insert_color(&kn->rb, &kn->parent->dir.children);
>>>          /* successfully added, account subdir number */
>>> +    down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>        if (kernfs_type(kn) == KERNFS_DIR)
>>>            kn->parent->dir.subdirs++;
>>>        kernfs_inc_rev(kn->parent);
>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>          return 0;
>>>    }
>>> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
>>> kernfs_node *kn)
>>>        if (RB_EMPTY_NODE(&kn->rb))
>>>            return false;
>>>    +    down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>        if (kernfs_type(kn) == KERNFS_DIR)
>>>            kn->parent->dir.subdirs--;
>>>        kernfs_inc_rev(kn->parent);
>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>          rb_erase(&kn->rb, &kn->parent->dir.children);
>>>        RB_CLEAR_NODE(&kn->rb);
>>>
Ian Kent July 28, 2023, 12:16 a.m. UTC | #16
On 28/7/23 08:00, Ian Kent wrote:
> On 27/7/23 12:30, Imran Khan wrote:
>> Hello Ian,
>> Sorry for late reply. I was about to reply this week.
>>
>> On 27/7/2023 10:38 am, Ian Kent wrote:
>>> On 20/7/23 10:03, Ian Kent wrote:
>>>> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
>> [...]
>>>> I do see a problem with recent changes.
>>>>
>>>> I'll send this off to Greg after I've done some testing (primarily 
>>>> just
>>>> compile and function).
>>>>
>>>> Here's a patch which describes what I found.
>>>>
>>>> Comments are, of course, welcome, ;)
>>> Anders I was hoping you would check if/what lockdep trace
>>>
>>> you get with this patch.
>>>
>>>
>>> Imran, I was hoping you would comment on my change as it
>>>
>>> relates to the kernfs_iattr_rwsem changes.
>>>
>>>
>>> Ian
>>>
>>>> kernfs: fix missing kernfs_iattr_rwsem locking
>>>>
>>>> From: Ian Kent <raven@themaw.net>
>>>>
>>>> When the kernfs_iattr_rwsem was introduced a case was missed.
>>>>
>>>> The update of the kernfs directory node child count was also protected
>>>> by the kernfs_rwsem and needs to be included in the change so that the
>>>> child count (and so the inode n_link attribute) does not change while
>>>> holding the rwsem for read.
>>>>
>> kernfs direcytory node's child count changes in 
>> kernfs_(un)link_sibling and
>> these are getting invoked while adding (kernfs_add_one),
>> removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of 
>> these
>> operations proceed under kernfs_rwsem and I see each invocation of
>> kernfs_link/unlink_sibling during the above mentioned operations is 
>> happening
>> under kernfs_rwsem.
>> So the child count should still be protected by kernfs_rwsem and we 
>> should not
>> need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.
>
> Yes, that's exactly what I intended (assuming you mean write lock in 
> those cases)
>
> when I did it so now I wonder what I saw that lead to my patch, I'll 
> need to look
>
> again ...

Ahh, I see why I thought this ...

It's the hunk:

@@ -285,10 +285,10 @@ int kernfs_iop_permission(struct mnt_idmap *idmap,
         kn = inode->i_private;
         root = kernfs_root(kn);

-       down_read(&root->kernfs_rwsem);
+       down_read(&root->kernfs_iattr_rwsem);
         kernfs_refresh_inode(kn, inode);
         ret = generic_permission(&nop_mnt_idmap, inode, mask);
-       up_read(&root->kernfs_rwsem);
+       up_read(&root->kernfs_iattr_rwsem);

         return ret;
  }

which takes away the kernfs_rwsem and introduces the possibility of

the change. It may be more instructive to add back taking the read

lock of kernfs_rwsem in .permission() than altering the sibling link

and unlink functions, I mean I even caught myself on it.


Ian

>
>
>>
>> Kindly let me know your thoughts. I would still like to see new 
>> lockdep traces
>> with this change.
>
> Indeed, I hope Anders can find time to get the trace.
>
>
> Ian
>
>>
>> Thanks,
>> Imran
>>
>>>> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
>>>> attributes)
>>>>
>>>> Signed-off-by: Ian Kent <raven@themaw.net>
>>>> Cc: Anders Roxell <anders.roxell@linaro.org>
>>>> Cc: Imran Khan <imran.f.khan@oracle.com>
>>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>> Cc: Eric Sandeen <sandeen@sandeen.net>
>>>> ---
>>>>    fs/kernfs/dir.c |    4 ++++
>>>>    1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>>>> index 45b6919903e6..6e84bb69602e 100644
>>>> --- a/fs/kernfs/dir.c
>>>> +++ b/fs/kernfs/dir.c
>>>> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
>>>> *kn)
>>>>        rb_insert_color(&kn->rb, &kn->parent->dir.children);
>>>>          /* successfully added, account subdir number */
>>>> + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>        if (kernfs_type(kn) == KERNFS_DIR)
>>>>            kn->parent->dir.subdirs++;
>>>>        kernfs_inc_rev(kn->parent);
>>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>          return 0;
>>>>    }
>>>> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
>>>> kernfs_node *kn)
>>>>        if (RB_EMPTY_NODE(&kn->rb))
>>>>            return false;
>>>>    + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>        if (kernfs_type(kn) == KERNFS_DIR)
>>>>            kn->parent->dir.subdirs--;
>>>>        kernfs_inc_rev(kn->parent);
>>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>          rb_erase(&kn->rb, &kn->parent->dir.children);
>>>>        RB_CLEAR_NODE(&kn->rb);
>>>>
Imran Khan July 28, 2023, 1:06 a.m. UTC | #17
Hello Ian,

On 28/7/2023 10:16 am, Ian Kent wrote:
> On 28/7/23 08:00, Ian Kent wrote:
>> On 27/7/23 12:30, Imran Khan wrote:
>>> Hello Ian,
>>> Sorry for late reply. I was about to reply this week.
>>>
>>> On 27/7/2023 10:38 am, Ian Kent wrote:
>>>> On 20/7/23 10:03, Ian Kent wrote:
>>>>> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
>>> [...]
>>>>> I do see a problem with recent changes.
>>>>>
>>>>> I'll send this off to Greg after I've done some testing (primarily just
>>>>> compile and function).
>>>>>
>>>>> Here's a patch which describes what I found.
>>>>>
>>>>> Comments are, of course, welcome, ;)
>>>> Anders I was hoping you would check if/what lockdep trace
>>>>
>>>> you get with this patch.
>>>>
>>>>
>>>> Imran, I was hoping you would comment on my change as it
>>>>
>>>> relates to the kernfs_iattr_rwsem changes.
>>>>
>>>>
>>>> Ian
>>>>
>>>>> kernfs: fix missing kernfs_iattr_rwsem locking
>>>>>
>>>>> From: Ian Kent <raven@themaw.net>
>>>>>
>>>>> When the kernfs_iattr_rwsem was introduced a case was missed.
>>>>>
>>>>> The update of the kernfs directory node child count was also protected
>>>>> by the kernfs_rwsem and needs to be included in the change so that the
>>>>> child count (and so the inode n_link attribute) does not change while
>>>>> holding the rwsem for read.
>>>>>
>>> kernfs direcytory node's child count changes in kernfs_(un)link_sibling and
>>> these are getting invoked while adding (kernfs_add_one),
>>> removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of these
>>> operations proceed under kernfs_rwsem and I see each invocation of
>>> kernfs_link/unlink_sibling during the above mentioned operations is happening
>>> under kernfs_rwsem.
>>> So the child count should still be protected by kernfs_rwsem and we should not
>>> need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.
>>
>> Yes, that's exactly what I intended (assuming you mean write lock in those cases)
>>
>> when I did it so now I wonder what I saw that lead to my patch, I'll need to look
>>
>> again ...
> 
> Ahh, I see why I thought this ...
> 
> It's the hunk:
> 
> @@ -285,10 +285,10 @@ int kernfs_iop_permission(struct mnt_idmap *idmap,
>         kn = inode->i_private;
>         root = kernfs_root(kn);
> 
> -       down_read(&root->kernfs_rwsem);
> +       down_read(&root->kernfs_iattr_rwsem);
>         kernfs_refresh_inode(kn, inode);
>         ret = generic_permission(&nop_mnt_idmap, inode, mask);
> -       up_read(&root->kernfs_rwsem);
> +       up_read(&root->kernfs_iattr_rwsem);
> 
>         return ret;
>  }
> 
> which takes away the kernfs_rwsem and introduces the possibility of
> 
> the change. It may be more instructive to add back taking the read
> 
> lock of kernfs_rwsem in .permission() than altering the sibling link
> 
> and unlink functions, I mean I even caught myself on it.
> 

Yes this was the block I referred to in my second comment [1]. However adding
back read lock of kernfs_rwsem in .permission() will again introduce the
bottleneck that I mentioned at [2]. So I think taking taking the locks in
link/unlink is the better option.
I understand having one lock to synchronize everything makes it easier
debug/development wise but sometimes (such as the case mentioned in [2]), it is
not optimum performance wise.
Thoughts ?

Thanks,
Imran

[1]: https://lore.kernel.org/all/8b0a1619-1e39-fc3a-1226-f3b167e64646@oracle.com/
[2]: https://lore.kernel.org/all/20230302043203.1695051-2-imran.f.khan@oracle.com/
> 
> Ian
> 
>>
>>
>>>
>>> Kindly let me know your thoughts. I would still like to see new lockdep traces
>>> with this change.
>>
>> Indeed, I hope Anders can find time to get the trace.
>>
>>
>> Ian
>>
>>>
>>> Thanks,
>>> Imran
>>>
>>>>> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
>>>>> attributes)
>>>>>
>>>>> Signed-off-by: Ian Kent <raven@themaw.net>
>>>>> Cc: Anders Roxell <anders.roxell@linaro.org>
>>>>> Cc: Imran Khan <imran.f.khan@oracle.com>
>>>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>>> Cc: Eric Sandeen <sandeen@sandeen.net>
>>>>> ---
>>>>>    fs/kernfs/dir.c |    4 ++++
>>>>>    1 file changed, 4 insertions(+)
>>>>>
>>>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>>>>> index 45b6919903e6..6e84bb69602e 100644
>>>>> --- a/fs/kernfs/dir.c
>>>>> +++ b/fs/kernfs/dir.c
>>>>> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
>>>>> *kn)
>>>>>        rb_insert_color(&kn->rb, &kn->parent->dir.children);
>>>>>          /* successfully added, account subdir number */
>>>>> + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>        if (kernfs_type(kn) == KERNFS_DIR)
>>>>>            kn->parent->dir.subdirs++;
>>>>>        kernfs_inc_rev(kn->parent);
>>>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>          return 0;
>>>>>    }
>>>>> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
>>>>> kernfs_node *kn)
>>>>>        if (RB_EMPTY_NODE(&kn->rb))
>>>>>            return false;
>>>>>    + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>        if (kernfs_type(kn) == KERNFS_DIR)
>>>>>            kn->parent->dir.subdirs--;
>>>>>        kernfs_inc_rev(kn->parent);
>>>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>          rb_erase(&kn->rb, &kn->parent->dir.children);
>>>>>        RB_CLEAR_NODE(&kn->rb);
>>>>>
Ian Kent July 28, 2023, 1:29 a.m. UTC | #18
On 28/7/23 09:06, Imran Khan wrote:
> Hello Ian,
>
> On 28/7/2023 10:16 am, Ian Kent wrote:
>> On 28/7/23 08:00, Ian Kent wrote:
>>> On 27/7/23 12:30, Imran Khan wrote:
>>>> Hello Ian,
>>>> Sorry for late reply. I was about to reply this week.
>>>>
>>>> On 27/7/2023 10:38 am, Ian Kent wrote:
>>>>> On 20/7/23 10:03, Ian Kent wrote:
>>>>>> On Wed, 2023-07-19 at 12:23 +0800, Ian Kent wrote:
>>>> [...]
>>>>>> I do see a problem with recent changes.
>>>>>>
>>>>>> I'll send this off to Greg after I've done some testing (primarily just
>>>>>> compile and function).
>>>>>>
>>>>>> Here's a patch which describes what I found.
>>>>>>
>>>>>> Comments are, of course, welcome, ;)
>>>>> Anders I was hoping you would check if/what lockdep trace
>>>>>
>>>>> you get with this patch.
>>>>>
>>>>>
>>>>> Imran, I was hoping you would comment on my change as it
>>>>>
>>>>> relates to the kernfs_iattr_rwsem changes.
>>>>>
>>>>>
>>>>> Ian
>>>>>
>>>>>> kernfs: fix missing kernfs_iattr_rwsem locking
>>>>>>
>>>>>> From: Ian Kent <raven@themaw.net>
>>>>>>
>>>>>> When the kernfs_iattr_rwsem was introduced a case was missed.
>>>>>>
>>>>>> The update of the kernfs directory node child count was also protected
>>>>>> by the kernfs_rwsem and needs to be included in the change so that the
>>>>>> child count (and so the inode n_link attribute) does not change while
>>>>>> holding the rwsem for read.
>>>>>>
>>>> kernfs direcytory node's child count changes in kernfs_(un)link_sibling and
>>>> these are getting invoked while adding (kernfs_add_one),
>>>> removing(__kernfs_remove) or moving (kernfs_rename_ns)a node. Each of these
>>>> operations proceed under kernfs_rwsem and I see each invocation of
>>>> kernfs_link/unlink_sibling during the above mentioned operations is happening
>>>> under kernfs_rwsem.
>>>> So the child count should still be protected by kernfs_rwsem and we should not
>>>> need to acquire kernfs_iattr_rwsem in kernfs_link/unlink_sibling.
>>> Yes, that's exactly what I intended (assuming you mean write lock in those cases)
>>>
>>> when I did it so now I wonder what I saw that lead to my patch, I'll need to look
>>>
>>> again ...
>> Ahh, I see why I thought this ...
>>
>> It's the hunk:
>>
>> @@ -285,10 +285,10 @@ int kernfs_iop_permission(struct mnt_idmap *idmap,
>>          kn = inode->i_private;
>>          root = kernfs_root(kn);
>>
>> -       down_read(&root->kernfs_rwsem);
>> +       down_read(&root->kernfs_iattr_rwsem);
>>          kernfs_refresh_inode(kn, inode);
>>          ret = generic_permission(&nop_mnt_idmap, inode, mask);
>> -       up_read(&root->kernfs_rwsem);
>> +       up_read(&root->kernfs_iattr_rwsem);
>>
>>          return ret;
>>   }
>>
>> which takes away the kernfs_rwsem and introduces the possibility of
>>
>> the change. It may be more instructive to add back taking the read
>>
>> lock of kernfs_rwsem in .permission() than altering the sibling link
>>
>> and unlink functions, I mean I even caught myself on it.
>>
> Yes this was the block I referred to in my second comment [1]. However adding
> back read lock of kernfs_rwsem in .permission() will again introduce the
> bottleneck that I mentioned at [2]. So I think taking taking the locks in
> link/unlink is the better option.

Yes, sorry, I always fall into the trap of not reading through all my

mail before commenting, oops!


The fact that .permission() is called so much more than the create/remove

functions also occurred to be me too just after I posted my comment (and

is probably why I originally did it that way).


I'll forward the patch for merge but would really like to see the lockdep

trace so I'll wait a little while ...

> I understand having one lock to synchronize everything makes it easier
> debug/development wise but sometimes (such as the case mentioned in [2]), it is
> not optimum performance wise.

Indeed, the performance improvement work that has been happening over

quite some time now is very good.


I had seen some opportunities around the file open path long ago but

hadn't got to doing anything there as you have, the work looks good

to me, thanks for doing it.


Ian

> Thoughts ?
>
> Thanks,
> Imran
>
> [1]: https://lore.kernel.org/all/8b0a1619-1e39-fc3a-1226-f3b167e64646@oracle.com/
> [2]: https://lore.kernel.org/all/20230302043203.1695051-2-imran.f.khan@oracle.com/
>> Ian
>>
>>>
>>>> Kindly let me know your thoughts. I would still like to see new lockdep traces
>>>> with this change.
>>> Indeed, I hope Anders can find time to get the trace.
>>>
>>>
>>> Ian
>>>
>>>> Thanks,
>>>> Imran
>>>>
>>>>>> Fixes: 9caf696142 (kernfs: Introduce separate rwsem to protect inode
>>>>>> attributes)
>>>>>>
>>>>>> Signed-off-by: Ian Kent <raven@themaw.net>
>>>>>> Cc: Anders Roxell <anders.roxell@linaro.org>
>>>>>> Cc: Imran Khan <imran.f.khan@oracle.com>
>>>>>> Cc: Arnd Bergmann <arnd@arndb.de>
>>>>>> Cc: Minchan Kim <minchan@kernel.org>
>>>>>> Cc: Eric Sandeen <sandeen@sandeen.net>
>>>>>> ---
>>>>>>     fs/kernfs/dir.c |    4 ++++
>>>>>>     1 file changed, 4 insertions(+)
>>>>>>
>>>>>> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
>>>>>> index 45b6919903e6..6e84bb69602e 100644
>>>>>> --- a/fs/kernfs/dir.c
>>>>>> +++ b/fs/kernfs/dir.c
>>>>>> @@ -383,9 +383,11 @@ static int kernfs_link_sibling(struct kernfs_node
>>>>>> *kn)
>>>>>>         rb_insert_color(&kn->rb, &kn->parent->dir.children);
>>>>>>           /* successfully added, account subdir number */
>>>>>> + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>>         if (kernfs_type(kn) == KERNFS_DIR)
>>>>>>             kn->parent->dir.subdirs++;
>>>>>>         kernfs_inc_rev(kn->parent);
>>>>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>>           return 0;
>>>>>>     }
>>>>>> @@ -408,9 +410,11 @@ static bool kernfs_unlink_sibling(struct
>>>>>> kernfs_node *kn)
>>>>>>         if (RB_EMPTY_NODE(&kn->rb))
>>>>>>             return false;
>>>>>>     + down_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>>         if (kernfs_type(kn) == KERNFS_DIR)
>>>>>>             kn->parent->dir.subdirs--;
>>>>>>         kernfs_inc_rev(kn->parent);
>>>>>> +    up_write(&kernfs_root(kn)->kernfs_iattr_rwsem);
>>>>>>           rb_erase(&kn->rb, &kn->parent->dir.children);
>>>>>>         RB_CLEAR_NODE(&kn->rb);
>>>>>>
diff mbox series

Patch

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3d783d80f5da..74f3453f4639 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -190,10 +190,8 @@  int kernfs_iop_getattr(struct user_namespace *mnt_userns,
 	struct kernfs_root *root = kernfs_root(kn);
 
 	down_read(&root->kernfs_rwsem);
-	spin_lock(&inode->i_lock);
 	kernfs_refresh_inode(kn, inode);
 	generic_fillattr(&init_user_ns, inode, stat);
-	spin_unlock(&inode->i_lock);
 	up_read(&root->kernfs_rwsem);
 
 	return 0;
@@ -288,10 +286,8 @@  int kernfs_iop_permission(struct user_namespace *mnt_userns,
 	root = kernfs_root(kn);
 
 	down_read(&root->kernfs_rwsem);
-	spin_lock(&inode->i_lock);
 	kernfs_refresh_inode(kn, inode);
 	ret = generic_permission(&init_user_ns, inode, mask);
-	spin_unlock(&inode->i_lock);
 	up_read(&root->kernfs_rwsem);
 
 	return ret;