Message ID | 20190109020522.105713-1-houtao1@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | 9p: use inode->i_lock to protect i_size_write() | expand |
Hou Tao wrote on Wed, Jan 09, 2019: > Use inode->i_lock to protect i_size_write(), else i_size_read() in > generic_fillattr() may loop infinitely when multiple processes invoke > v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl() simultaneously under > 32-bit SMP environment, and a soft lockup will be triggered as show below: Hmm, I'm not familiar with the read/write seqcount code for 32 bit but I don't understand how locking here helps besides slowing things down (so if the value is constantly updated, the read thread might have a chance to be scheduled between two updates which was harder to do before ; and thus "solving" your soft lockup) Instead, a better fix would be to update v9fs_stat2inode to first read the inode size, and only call i_size_write if it changed - I'd bet this also fixes the problem and looks better than locking to me. (Can also probably reuse stat->length instead of the following i_size_read for i_blocks...) On the other hand it might make sense to also lock the inode for stat2inode because we're dealing with partially updated inodes at time, but if we do this I'd rather put the locking in v9fs_stat2inode and not outside of it to catch all the places where it's used; but the readers don't lock so I'm not sure it makes much sense. There's also a window during which the inode's nlink is dropped down to 1 then set again appropriately if the extension is present; that's rather ugly and we probably should only reset it to 1 if the attribute wasn't set before... That can be another patch and/or I'll do it eventually if you don't. I hope what I said makes sense. Thanks,
Hi, On 2019/1/9 10:38, Dominique Martinet wrote: > Hou Tao wrote on Wed, Jan 09, 2019: >> Use inode->i_lock to protect i_size_write(), else i_size_read() in >> generic_fillattr() may loop infinitely when multiple processes invoke >> v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl() simultaneously under >> 32-bit SMP environment, and a soft lockup will be triggered as show below: > Hmm, I'm not familiar with the read/write seqcount code for 32 bit but I > don't understand how locking here helps besides slowing things down (so > if the value is constantly updated, the read thread might have a chance > to be scheduled between two updates which was harder to do before ; and > thus "solving" your soft lockup) i_size_read() will call read_seqcount_begin() under 32-bit SMP environment, and it may loop in __read_seqcount_begin() infinitely because two or more invocations of write_seqcount_begin interleave and s->sequence becomes an odd number. It's noted in comments of i_size_write(): /* * NOTE: unlike i_size_read(), i_size_write() does need locking around it * (normally i_mutex), otherwise on 32bit/SMP an update of i_size_seqcount * can be lost, resulting in subsequent i_size_read() calls spinning forever. */ static inline void i_size_write(struct inode *inode, loff_t i_size) { #if BITS_PER_LONG==32 && defined(CONFIG_SMP) preempt_disable(); write_seqcount_begin(&inode->i_size_seqcount); inode->i_size = i_size; write_seqcount_end(&inode->i_size_seqcount); preempt_enable(); #elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT) preempt_disable(); inode->i_size = i_size; preempt_enable(); #else inode->i_size = i_size; #endif } > Instead, a better fix would be to update v9fs_stat2inode to first read > the inode size, and only call i_size_write if it changed - I'd bet this > also fixes the problem and looks better than locking to me. > (Can also probably reuse stat->length instead of the following > i_size_read for i_blocks...) For read-only case, this fix will work. However if the inode size is changed constantly, there will be two or more callers of i_size_write() and the soft-lockup is still possible. > > On the other hand it might make sense to also lock the inode for > stat2inode because we're dealing with partially updated inodes at time, > but if we do this I'd rather put the locking in v9fs_stat2inode and not > outside of it to catch all the places where it's used; but the readers > don't lock so I'm not sure it makes much sense. Moving lock into v9fs_stat2inode() sounds reasonable. There are callers which don't need it (e.g. v9fs_qid_iget() uses it to fill attribute for a newly-created inode and v9fs_mount() uses it to fill attribute for root inode), so i will rename v9fs_stat2inode() to v9fs_stat2inode_nolock(), and wrap v9fs_stat2inode() upon v9fs_stat2inode_nolock(). > > There's also a window during which the inode's nlink is dropped down to > 1 then set again appropriately if the extension is present; that's > rather ugly and we probably should only reset it to 1 if the attribute > wasn't set before... That can be another patch and/or I'll do it > eventually if you don't. I can not follow that. Do you mean inode->i_nlink may be updated concurrently by v9fs_stat2inode() and v9fs_remove() and that will lead to corruption of i_nlink ? I also note a race about updating of v9inode->cache_validity. It seems that it is possible the clear of V9FS_INO_INVALID_ATTR in v9fs_remove() may lost if there are invocations of v9fs_vfs_getattr() in the same time. We may need to ensure V9FS_INO_INVALID_ATTR is enabled before clearing it atomically in v9fs_vfs_getattr() and i will send another patch for it. Regards, Tao > I hope what I said makes sense. > > Thanks,
Hou Tao wrote on Thu, Jan 10, 2019: > > Hmm, I'm not familiar with the read/write seqcount code for 32 bit but I > > don't understand how locking here helps besides slowing things down (so > > if the value is constantly updated, the read thread might have a chance > > to be scheduled between two updates which was harder to do before ; and > > thus "solving" your soft lockup) > > i_size_read() will call read_seqcount_begin() under 32-bit SMP environment, > and it may loop in __read_seqcount_begin() infinitely because two or more > invocations of write_seqcount_begin interleave and s->sequence becomes > an odd number. It's noted in comments of i_size_write(): > > /* > * NOTE: unlike i_size_read(), i_size_write() does need locking around it > * (normally i_mutex), otherwise on 32bit/SMP an update of i_size_seqcount > * can be lost, resulting in subsequent i_size_read() calls spinning forever. > */ I see, I wasn't aware of how seqcount worked in details but it is a simple seq++ with barrier so that does indeed need locking. I had misunderstood the lockup reason as simply the value being updated all the time. > > Instead, a better fix would be to update v9fs_stat2inode to first read > > the inode size, and only call i_size_write if it changed - I'd bet this > > also fixes the problem and looks better than locking to me. > > (Can also probably reuse stat->length instead of the following > > i_size_read for i_blocks...) > > For read-only case, this fix will work. However if the inode size is changed > constantly, there will be two or more callers of i_size_write() and the soft-lockup > is still possible. You are right. We can still do this as an optimization to not take the lock in the read-only case, but otherwise it's safe to forget about this comment. > > On the other hand it might make sense to also lock the inode for > > stat2inode because we're dealing with partially updated inodes at time, > > but if we do this I'd rather put the locking in v9fs_stat2inode and not > > outside of it to catch all the places where it's used; but the readers > > don't lock so I'm not sure it makes much sense. > > Moving lock into v9fs_stat2inode() sounds reasonable. There are callers > which don't need it (e.g. v9fs_qid_iget() uses it to fill attribute for a > newly-created inode and v9fs_mount() uses it to fill attribute for root inode), > so i will rename v9fs_stat2inode() to v9fs_stat2inode_nolock(), and wrap > v9fs_stat2inode() upon v9fs_stat2inode_nolock(). If it's a newly created inode there is no contention and the spinlock has virtually no cost ; and v9fs_mount's root inode is the same. Let's keep this simple and always lock around i_size_write. > > There's also a window during which the inode's nlink is dropped down to > > 1 then set again appropriately if the extension is present; that's > > rather ugly and we probably should only reset it to 1 if the attribute > > wasn't set before... That can be another patch and/or I'll do it > > eventually if you don't. > > I can not follow that. Do you mean inode->i_nlink may be updated concurrently > by v9fs_stat2inode() and v9fs_remove() and that will lead to corruption of i_nlink ? Not corruption, but the value can be incoherent for a short while. If another thread looks at nlink continuously it can see the value being reset to 1 for a short time when there really is no reason to. I think this isn't as important as the other issues you've raised, so don't worry about this point unless you want to. > I also note a race about updating of v9inode->cache_validity. It seems > that it is possible the clear of V9FS_INO_INVALID_ATTR in > v9fs_remove() may lost if there are invocations of v9fs_vfs_getattr() > in the same time. We may need to ensure V9FS_INO_INVALID_ATTR is > enabled before clearing it atomically in v9fs_vfs_getattr() and i will > send another patch for it. There are many places setting this flag but that appears correct this isn't safe. It's more complicated than what you're saying though, e.g. even if the flag was already set at the start of v9fs_vfs_getattr, if another thread "sets it again" while it is running we need not to unset it -- so basically this needs to be a 3-value: - valid - invalid - invalid-being-refreshed v9fs_invalidate_inode_attr would set it to invalid regardless of the previous state. start of v9fs_vfs_getattr would set it to invalid-being-refreshed (iff it was invalid? always?) end of v9fs_vfs_getattr/v9fs_stat2inode would set it back to valid iff it is invalid-being-refreshed. You cannot clear the flag at the start of v9fs_vfs_getattr() because another getattr in parallel needs to know it cannot trust the cached value; currently that parallel getattr call would issue another stat request to the server but if we implement this we could wait until the flag becomes valid (I'm not sure if it's appropriate to poll the value even if we yield, so this might need a mutex or other lock that can sleep though...); well, for now it's ok to behave like we currently do and also proceed on to refresh the attribute in that other thread, we can think about handling it better incrementally Thanks,
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 85ff859d3af5..36405361c2e1 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -1074,6 +1074,7 @@ v9fs_vfs_getattr(const struct path *path, struct kstat *stat, u32 request_mask, unsigned int flags) { struct dentry *dentry = path->dentry; + struct inode *inode = d_inode(dentry); struct v9fs_session_info *v9ses; struct p9_fid *fid; struct p9_wstat *st; @@ -1081,7 +1082,7 @@ v9fs_vfs_getattr(const struct path *path, struct kstat *stat, p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry); v9ses = v9fs_dentry2v9ses(dentry); if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) { - generic_fillattr(d_inode(dentry), stat); + generic_fillattr(inode, stat); return 0; } fid = v9fs_fid_lookup(dentry); @@ -1092,8 +1093,10 @@ v9fs_vfs_getattr(const struct path *path, struct kstat *stat, if (IS_ERR(st)) return PTR_ERR(st); - v9fs_stat2inode(st, d_inode(dentry), dentry->d_sb); - generic_fillattr(d_inode(dentry), stat); + spin_lock(&inode->i_lock); + v9fs_stat2inode(st, inode, dentry->d_sb); + spin_unlock(&inode->i_lock); + generic_fillattr(inode, stat); p9stat_free(st); kfree(st); diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c index 4823e1c46999..ac7d0c9f81c9 100644 --- a/fs/9p/vfs_inode_dotl.c +++ b/fs/9p/vfs_inode_dotl.c @@ -474,6 +474,7 @@ v9fs_vfs_getattr_dotl(const struct path *path, struct kstat *stat, u32 request_mask, unsigned int flags) { struct dentry *dentry = path->dentry; + struct inode *inode = d_inode(dentry); struct v9fs_session_info *v9ses; struct p9_fid *fid; struct p9_stat_dotl *st; @@ -481,7 +482,7 @@ v9fs_vfs_getattr_dotl(const struct path *path, struct kstat *stat, p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry); v9ses = v9fs_dentry2v9ses(dentry); if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) { - generic_fillattr(d_inode(dentry), stat); + generic_fillattr(inode, stat); return 0; } fid = v9fs_fid_lookup(dentry); @@ -496,8 +497,10 @@ v9fs_vfs_getattr_dotl(const struct path *path, struct kstat *stat, if (IS_ERR(st)) return PTR_ERR(st); - v9fs_stat2inode_dotl(st, d_inode(dentry)); - generic_fillattr(d_inode(dentry), stat); + spin_lock(&inode->i_lock); + v9fs_stat2inode_dotl(st, inode); + spin_unlock(&inode->i_lock); + generic_fillattr(inode, stat); /* Change block size to what the server returned */ stat->blksize = st->st_blksize;
Use inode->i_lock to protect i_size_write(), else i_size_read() in generic_fillattr() may loop infinitely when multiple processes invoke v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl() simultaneously under 32-bit SMP environment, and a soft lockup will be triggered as show below: watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:2217] Modules linked in: CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4 Hardware name: Generic DT based system PC is at generic_fillattr+0x104/0x108 LR is at 0xec497f00 pc : [<802b8898>] lr : [<ec497f00>] psr: 200c0013 sp : ec497e20 ip : ed608030 fp : ec497e3c r10: 00000000 r9 : ec497f00 r8 : ed608030 r7 : ec497ebc r6 : ec497f00 r5 : ee5c1550 r4 : ee005780 r3 : 0000052d r2 : 00000000 r1 : ec497f00 r0 : ed608030 Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none Control: 10c5387d Table: ac48006a DAC: 00000051 CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4 Hardware name: Generic DT based system Backtrace: [<8010d974>] (dump_backtrace) from [<8010dc88>] (show_stack+0x20/0x24) [<8010dc68>] (show_stack) from [<80a1d194>] (dump_stack+0xb0/0xdc) [<80a1d0e4>] (dump_stack) from [<80109f34>] (show_regs+0x1c/0x20) [<80109f18>] (show_regs) from [<801d0a80>] (watchdog_timer_fn+0x280/0x2f8) [<801d0800>] (watchdog_timer_fn) from [<80198658>] (__hrtimer_run_queues+0x18c/0x380) [<801984cc>] (__hrtimer_run_queues) from [<80198e60>] (hrtimer_run_queues+0xb8/0xf0) [<80198da8>] (hrtimer_run_queues) from [<801973e8>] (run_local_timers+0x28/0x64) [<801973c0>] (run_local_timers) from [<80197460>] (update_process_times+0x3c/0x6c) [<80197424>] (update_process_times) from [<801ab2b8>] (tick_nohz_handler+0xe0/0x1bc) [<801ab1d8>] (tick_nohz_handler) from [<80843050>] (arch_timer_handler_virt+0x38/0x48) [<80843018>] (arch_timer_handler_virt) from [<80180a64>] (handle_percpu_devid_irq+0x8c/0x240) [<801809d8>] (handle_percpu_devid_irq) from [<8017ac20>] (generic_handle_irq+0x34/0x44) [<8017abec>] (generic_handle_irq) from [<8017b344>] (__handle_domain_irq+0x6c/0xc4) [<8017b2d8>] (__handle_domain_irq) from [<801022e0>] (gic_handle_irq+0x4c/0x88) [<80102294>] (gic_handle_irq) from [<80101a30>] (__irq_svc+0x70/0x98) [<802b8794>] (generic_fillattr) from [<8056b284>] (v9fs_vfs_getattr_dotl+0x74/0xa4) [<8056b210>] (v9fs_vfs_getattr_dotl) from [<802b8904>] (vfs_getattr_nosec+0x68/0x7c) [<802b889c>] (vfs_getattr_nosec) from [<802b895c>] (vfs_getattr+0x44/0x48) [<802b8918>] (vfs_getattr) from [<802b8a74>] (vfs_statx+0x9c/0xec) [<802b89d8>] (vfs_statx) from [<802b9428>] (sys_lstat64+0x48/0x78) [<802b93e0>] (sys_lstat64) from [<80101000>] (ret_fast_syscall+0x0/0x28) Reported-by: Xing Gaopeng <xingaopeng@huawei.com> Signed-off-by: Hou Tao <houtao1@huawei.com> --- fs/9p/vfs_inode.c | 9 ++++++--- fs/9p/vfs_inode_dotl.c | 9 ++++++--- 2 files changed, 12 insertions(+), 6 deletions(-)