diff mbox series

9p: use inode->i_lock to protect i_size_write()

Message ID 20190109020522.105713-1-houtao1@huawei.com (mailing list archive)
State New, archived
Headers show
Series 9p: use inode->i_lock to protect i_size_write() | expand

Commit Message

Hou Tao Jan. 9, 2019, 2:05 a.m. UTC
Use inode->i_lock to protect i_size_write(), else i_size_read() in
generic_fillattr() may loop infinitely when multiple processes invoke
v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl() simultaneously under
32-bit SMP environment, and a soft lockup will be triggered as show below:

  watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [stat:2217]
  Modules linked in:
  CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4
  Hardware name: Generic DT based system
  PC is at generic_fillattr+0x104/0x108
  LR is at 0xec497f00
  pc : [<802b8898>]    lr : [<ec497f00>]    psr: 200c0013
  sp : ec497e20  ip : ed608030  fp : ec497e3c
  r10: 00000000  r9 : ec497f00  r8 : ed608030
  r7 : ec497ebc  r6 : ec497f00  r5 : ee5c1550  r4 : ee005780
  r3 : 0000052d  r2 : 00000000  r1 : ec497f00  r0 : ed608030
  Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
  Control: 10c5387d  Table: ac48006a  DAC: 00000051
  CPU: 5 PID: 2217 Comm: stat Not tainted 5.0.0-rc1-00005-g7f702faf5a9e #4
  Hardware name: Generic DT based system
  Backtrace:
  [<8010d974>] (dump_backtrace) from [<8010dc88>] (show_stack+0x20/0x24)
  [<8010dc68>] (show_stack) from [<80a1d194>] (dump_stack+0xb0/0xdc)
  [<80a1d0e4>] (dump_stack) from [<80109f34>] (show_regs+0x1c/0x20)
  [<80109f18>] (show_regs) from [<801d0a80>] (watchdog_timer_fn+0x280/0x2f8)
  [<801d0800>] (watchdog_timer_fn) from [<80198658>] (__hrtimer_run_queues+0x18c/0x380)
  [<801984cc>] (__hrtimer_run_queues) from [<80198e60>] (hrtimer_run_queues+0xb8/0xf0)
  [<80198da8>] (hrtimer_run_queues) from [<801973e8>] (run_local_timers+0x28/0x64)
  [<801973c0>] (run_local_timers) from [<80197460>] (update_process_times+0x3c/0x6c)
  [<80197424>] (update_process_times) from [<801ab2b8>] (tick_nohz_handler+0xe0/0x1bc)
  [<801ab1d8>] (tick_nohz_handler) from [<80843050>] (arch_timer_handler_virt+0x38/0x48)
  [<80843018>] (arch_timer_handler_virt) from [<80180a64>] (handle_percpu_devid_irq+0x8c/0x240)
  [<801809d8>] (handle_percpu_devid_irq) from [<8017ac20>] (generic_handle_irq+0x34/0x44)
  [<8017abec>] (generic_handle_irq) from [<8017b344>] (__handle_domain_irq+0x6c/0xc4)
  [<8017b2d8>] (__handle_domain_irq) from [<801022e0>] (gic_handle_irq+0x4c/0x88)
  [<80102294>] (gic_handle_irq) from [<80101a30>] (__irq_svc+0x70/0x98)
  [<802b8794>] (generic_fillattr) from [<8056b284>] (v9fs_vfs_getattr_dotl+0x74/0xa4)
  [<8056b210>] (v9fs_vfs_getattr_dotl) from [<802b8904>] (vfs_getattr_nosec+0x68/0x7c)
  [<802b889c>] (vfs_getattr_nosec) from [<802b895c>] (vfs_getattr+0x44/0x48)
  [<802b8918>] (vfs_getattr) from [<802b8a74>] (vfs_statx+0x9c/0xec)
  [<802b89d8>] (vfs_statx) from [<802b9428>] (sys_lstat64+0x48/0x78)
  [<802b93e0>] (sys_lstat64) from [<80101000>] (ret_fast_syscall+0x0/0x28)

Reported-by: Xing Gaopeng <xingaopeng@huawei.com>
Signed-off-by: Hou Tao <houtao1@huawei.com>
---
 fs/9p/vfs_inode.c      | 9 ++++++---
 fs/9p/vfs_inode_dotl.c | 9 ++++++---
 2 files changed, 12 insertions(+), 6 deletions(-)

Comments

Dominique Martinet Jan. 9, 2019, 2:38 a.m. UTC | #1
Hou Tao wrote on Wed, Jan 09, 2019:
> Use inode->i_lock to protect i_size_write(), else i_size_read() in
> generic_fillattr() may loop infinitely when multiple processes invoke
> v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl() simultaneously under
> 32-bit SMP environment, and a soft lockup will be triggered as show below:

Hmm, I'm not familiar with the read/write seqcount code for 32 bit but I
don't understand how locking here helps besides slowing things down (so
if the value is constantly updated, the read thread might have a chance
to be scheduled between two updates which was harder to do before ; and
thus "solving" your soft lockup)

Instead, a better fix would be to update v9fs_stat2inode to first read
the inode size, and only call i_size_write if it changed - I'd bet this
also fixes the problem and looks better than locking to me.
(Can also probably reuse stat->length instead of the following
i_size_read for i_blocks...)



On the other hand it might make sense to also lock the inode for
stat2inode because we're dealing with partially updated inodes at time,
but if we do this I'd rather put the locking in v9fs_stat2inode and not
outside of it to catch all the places where it's used; but the readers
don't lock so I'm not sure it makes much sense.

There's also a window during which the inode's nlink is dropped down to
1 then set again appropriately if the extension is present; that's
rather ugly and we probably should only reset it to 1 if the attribute
wasn't set before... That can be another patch and/or I'll do it
eventually if you don't.

I hope what I said makes sense.

Thanks,
Hou Tao Jan. 10, 2019, 2:50 a.m. UTC | #2
Hi,


On 2019/1/9 10:38, Dominique Martinet wrote:
> Hou Tao wrote on Wed, Jan 09, 2019:
>> Use inode->i_lock to protect i_size_write(), else i_size_read() in
>> generic_fillattr() may loop infinitely when multiple processes invoke
>> v9fs_vfs_getattr() or v9fs_vfs_getattr_dotl() simultaneously under
>> 32-bit SMP environment, and a soft lockup will be triggered as show below:
> Hmm, I'm not familiar with the read/write seqcount code for 32 bit but I
> don't understand how locking here helps besides slowing things down (so
> if the value is constantly updated, the read thread might have a chance
> to be scheduled between two updates which was harder to do before ; and
> thus "solving" your soft lockup)
i_size_read() will call read_seqcount_begin() under 32-bit SMP environment,
and it may loop in __read_seqcount_begin() infinitely because two or more
invocations of write_seqcount_begin interleave and s->sequence becomes
an odd number. It's noted in comments of i_size_write():

/*
 * NOTE: unlike i_size_read(), i_size_write() does need locking around it
 * (normally i_mutex), otherwise on 32bit/SMP an update of i_size_seqcount
 * can be lost, resulting in subsequent i_size_read() calls spinning forever.
 */
static inline void i_size_write(struct inode *inode, loff_t i_size)
{
#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
    preempt_disable();
    write_seqcount_begin(&inode->i_size_seqcount);
    inode->i_size = i_size;
    write_seqcount_end(&inode->i_size_seqcount);
    preempt_enable();
#elif BITS_PER_LONG==32 && defined(CONFIG_PREEMPT)
    preempt_disable();
    inode->i_size = i_size;
    preempt_enable();
#else
    inode->i_size = i_size;
#endif
}

> Instead, a better fix would be to update v9fs_stat2inode to first read
> the inode size, and only call i_size_write if it changed - I'd bet this
> also fixes the problem and looks better than locking to me.
> (Can also probably reuse stat->length instead of the following
> i_size_read for i_blocks...)
For read-only case, this fix will work. However if the inode size is changed
constantly, there will be two or more callers of i_size_write() and the soft-lockup
is still possible.

>
> On the other hand it might make sense to also lock the inode for
> stat2inode because we're dealing with partially updated inodes at time,
> but if we do this I'd rather put the locking in v9fs_stat2inode and not
> outside of it to catch all the places where it's used; but the readers
> don't lock so I'm not sure it makes much sense.
Moving lock into v9fs_stat2inode() sounds reasonable. There are callers
which don't need it (e.g. v9fs_qid_iget() uses it to fill attribute for a
newly-created inode and v9fs_mount() uses it to fill attribute for root inode),
so i will rename v9fs_stat2inode() to v9fs_stat2inode_nolock(), and wrap
v9fs_stat2inode() upon v9fs_stat2inode_nolock().

>
> There's also a window during which the inode's nlink is dropped down to
> 1 then set again appropriately if the extension is present; that's
> rather ugly and we probably should only reset it to 1 if the attribute
> wasn't set before... That can be another patch and/or I'll do it
> eventually if you don't.
I can not follow that. Do you mean inode->i_nlink may be updated concurrently
by v9fs_stat2inode() and v9fs_remove() and that will lead to corruption of i_nlink ?

I also note a race about updating of v9inode->cache_validity. It seems that it is possible
the clear of V9FS_INO_INVALID_ATTR in v9fs_remove() may lost if there are invocations of
v9fs_vfs_getattr() in the same time. We may need to ensure V9FS_INO_INVALID_ATTR is enabled
before clearing it atomically in v9fs_vfs_getattr() and i will send another patch for it.

Regards,
Tao
> I hope what I said makes sense.
>
> Thanks,
Dominique Martinet Jan. 10, 2019, 4:18 a.m. UTC | #3
Hou Tao wrote on Thu, Jan 10, 2019:
> > Hmm, I'm not familiar with the read/write seqcount code for 32 bit but I
> > don't understand how locking here helps besides slowing things down (so
> > if the value is constantly updated, the read thread might have a chance
> > to be scheduled between two updates which was harder to do before ; and
> > thus "solving" your soft lockup)
>
> i_size_read() will call read_seqcount_begin() under 32-bit SMP environment,
> and it may loop in __read_seqcount_begin() infinitely because two or more
> invocations of write_seqcount_begin interleave and s->sequence becomes
> an odd number. It's noted in comments of i_size_write():
> 
> /*
>  * NOTE: unlike i_size_read(), i_size_write() does need locking around it
>  * (normally i_mutex), otherwise on 32bit/SMP an update of i_size_seqcount
>  * can be lost, resulting in subsequent i_size_read() calls spinning forever.
>  */

I see, I wasn't aware of how seqcount worked in details but it is a
simple seq++ with barrier so that does indeed need locking.
I had misunderstood the lockup reason as simply the value being updated
all the time.

> > Instead, a better fix would be to update v9fs_stat2inode to first read
> > the inode size, and only call i_size_write if it changed - I'd bet this
> > also fixes the problem and looks better than locking to me.
> > (Can also probably reuse stat->length instead of the following
> > i_size_read for i_blocks...)
>
> For read-only case, this fix will work. However if the inode size is changed
> constantly, there will be two or more callers of i_size_write() and the soft-lockup
> is still possible.

You are right. We can still do this as an optimization to not take the
lock in the read-only case, but otherwise it's safe to forget about this
comment.
 
> > On the other hand it might make sense to also lock the inode for
> > stat2inode because we're dealing with partially updated inodes at time,
> > but if we do this I'd rather put the locking in v9fs_stat2inode and not
> > outside of it to catch all the places where it's used; but the readers
> > don't lock so I'm not sure it makes much sense.
>
> Moving lock into v9fs_stat2inode() sounds reasonable. There are callers
> which don't need it (e.g. v9fs_qid_iget() uses it to fill attribute for a
> newly-created inode and v9fs_mount() uses it to fill attribute for root inode),
> so i will rename v9fs_stat2inode() to v9fs_stat2inode_nolock(), and wrap
> v9fs_stat2inode() upon v9fs_stat2inode_nolock().

If it's a newly created inode there is no contention and the spinlock
has virtually no cost ; and v9fs_mount's root inode is the same.
Let's keep this simple and always lock around i_size_write.

> > There's also a window during which the inode's nlink is dropped down to
> > 1 then set again appropriately if the extension is present; that's
> > rather ugly and we probably should only reset it to 1 if the attribute
> > wasn't set before... That can be another patch and/or I'll do it
> > eventually if you don't.
>
> I can not follow that. Do you mean inode->i_nlink may be updated concurrently
> by v9fs_stat2inode() and v9fs_remove() and that will lead to corruption of i_nlink ?

Not corruption, but the value can be incoherent for a short while.
If another thread looks at nlink continuously it can see the value being
reset to 1 for a short time when there really is no reason to.

I think this isn't as important as the other issues you've raised, so
don't worry about this point unless you want to.

> I also note a race about updating of v9inode->cache_validity. It seems
> that it is possible the clear of V9FS_INO_INVALID_ATTR in
> v9fs_remove() may lost if there are invocations of v9fs_vfs_getattr()
> in the same time. We may need to ensure V9FS_INO_INVALID_ATTR is
> enabled before clearing it atomically in v9fs_vfs_getattr() and i will
> send another patch for it.

There are many places setting this flag but that appears correct this
isn't safe.

It's more complicated than what you're saying though, e.g. even if the
flag was already set at the start of v9fs_vfs_getattr, if another thread
"sets it again" while it is running we need not to unset it -- so
basically this needs to be a 3-value:
 - valid
 - invalid
 - invalid-being-refreshed

v9fs_invalidate_inode_attr would set it to invalid regardless of the
previous state.
start of v9fs_vfs_getattr would set it to invalid-being-refreshed (iff
it was invalid? always?)
end of v9fs_vfs_getattr/v9fs_stat2inode would set it back to valid iff
it is invalid-being-refreshed.


You cannot clear the flag at the start of v9fs_vfs_getattr() because
another getattr in parallel needs to know it cannot trust the cached
value; currently that parallel getattr call would issue another stat
request to the server but if we implement this we could wait until the
flag becomes valid (I'm not sure if it's appropriate to poll the value
even if we yield, so this might need a mutex or other lock that can
sleep though...); well, for now it's ok to behave like we currently do
and also proceed on to refresh the attribute in that other thread, we
can think about handling it better incrementally

Thanks,
diff mbox series

Patch

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 85ff859d3af5..36405361c2e1 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1074,6 +1074,7 @@  v9fs_vfs_getattr(const struct path *path, struct kstat *stat,
 		 u32 request_mask, unsigned int flags)
 {
 	struct dentry *dentry = path->dentry;
+	struct inode *inode = d_inode(dentry);
 	struct v9fs_session_info *v9ses;
 	struct p9_fid *fid;
 	struct p9_wstat *st;
@@ -1081,7 +1082,7 @@  v9fs_vfs_getattr(const struct path *path, struct kstat *stat,
 	p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
 	v9ses = v9fs_dentry2v9ses(dentry);
 	if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
-		generic_fillattr(d_inode(dentry), stat);
+		generic_fillattr(inode, stat);
 		return 0;
 	}
 	fid = v9fs_fid_lookup(dentry);
@@ -1092,8 +1093,10 @@  v9fs_vfs_getattr(const struct path *path, struct kstat *stat,
 	if (IS_ERR(st))
 		return PTR_ERR(st);
 
-	v9fs_stat2inode(st, d_inode(dentry), dentry->d_sb);
-	generic_fillattr(d_inode(dentry), stat);
+	spin_lock(&inode->i_lock);
+	v9fs_stat2inode(st, inode, dentry->d_sb);
+	spin_unlock(&inode->i_lock);
+	generic_fillattr(inode, stat);
 
 	p9stat_free(st);
 	kfree(st);
diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index 4823e1c46999..ac7d0c9f81c9 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -474,6 +474,7 @@  v9fs_vfs_getattr_dotl(const struct path *path, struct kstat *stat,
 		 u32 request_mask, unsigned int flags)
 {
 	struct dentry *dentry = path->dentry;
+	struct inode *inode = d_inode(dentry);
 	struct v9fs_session_info *v9ses;
 	struct p9_fid *fid;
 	struct p9_stat_dotl *st;
@@ -481,7 +482,7 @@  v9fs_vfs_getattr_dotl(const struct path *path, struct kstat *stat,
 	p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
 	v9ses = v9fs_dentry2v9ses(dentry);
 	if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
-		generic_fillattr(d_inode(dentry), stat);
+		generic_fillattr(inode, stat);
 		return 0;
 	}
 	fid = v9fs_fid_lookup(dentry);
@@ -496,8 +497,10 @@  v9fs_vfs_getattr_dotl(const struct path *path, struct kstat *stat,
 	if (IS_ERR(st))
 		return PTR_ERR(st);
 
-	v9fs_stat2inode_dotl(st, d_inode(dentry));
-	generic_fillattr(d_inode(dentry), stat);
+	spin_lock(&inode->i_lock);
+	v9fs_stat2inode_dotl(st, inode);
+	spin_unlock(&inode->i_lock);
+	generic_fillattr(inode, stat);
 	/* Change block size to what the server returned */
 	stat->blksize = st->st_blksize;