diff mbox series

[RESEND] fs: fix race condition oops between destroy_inode and writeback_sb_inodes

Message ID 20200919093923.19016-1-luoshijie1@huawei.com
State New
Headers show
Series [RESEND] fs: fix race condition oops between destroy_inode and writeback_sb_inodes | expand

Commit Message

Shijie Luo Sept. 19, 2020, 9:39 a.m. UTC
We tested an oops problem in Linux 4.18. The Call Trace message is
 followed below.

[255946.665989] Oops: 0000 [#1] SMP PTI
[255946.674811] Workqueue: writeback wb_workfn (flush-253:6)
[255946.676443] RIP: 0010:locked_inode_to_wb_and_lock_list+0x20/0x120
[255946.683916] RSP: 0018:ffffbb0e44727c00 EFLAGS: 00010286
[255946.685518] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[255946.687699] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9ef282be5398
[255946.689866] RBP: ffff9ef282be5398 R08: ffffbb0e44727cd8 R09: ffff9ef3064f306e
[255946.692037] R10: 0000000000000000 R11: 0000000000000010 R12: ffff9ef282be5420
[255946.694208] R13: ffff9ef3351cc800 R14: 0000000000000000 R15: ffff9ef3352e2058
[255946.696378] FS:  0000000000000000(0000) GS:ffff9ef33ad80000(0000) knlGS:0000000000000000
[255946.698835] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[255946.700604] CR2: 0000000000000000 CR3: 000000000760a005 CR4: 00000000003606e0
[255946.702787] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[255946.704955] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[255946.707123] Call Trace:
[255946.707918]  writeback_sb_inodes+0x1fe/0x460
[255946.709244]  __writeback_inodes_wb+0x5d/0xb0
[255946.710575]  wb_writeback+0x265/0x2f0
[255946.711728]  ? wb_workfn+0x3cf/0x4d0
[255946.712850]  wb_workfn+0x3cf/0x4d0
[255946.713923]  process_one_work+0x195/0x390
[255946.715173]  worker_thread+0x30/0x390
[255946.716319]  ? process_one_work+0x390/0x390
[255946.717625]  kthread+0x10d/0x130
[255946.718789]  ? kthread_flush_work_fn+0x10/0x10
[255946.720170]  ret_from_fork+0x35/0x40

There is a race condition between destroy_inode and writeback_sb_inodes,
thread-1                                    thread-2
wb_workfn
  writeback_inodes_wb
    __writeback_inodes_wb
      writeback_sb_inodes
        wbc_attach_and_unlock_inode
					iget_locked
                                          destroy_inode
                                            inode_detach_wb
                                              inode->i_wb = NULL;

        inode_to_wb_and_lock_list
          locked_inode_to_wb_and_lock_list
            wb_get
              oops

so destroy inode after adding I_FREEING to inode state and the I_SYNC state
 being cleared.

Reported-by: Tianxiong Lu <lutianxiong@huawei.com>
Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Signed-off-by: Haotian Li <lihaotian9@huawei.com>
---
 fs/inode.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox Sept. 19, 2020, 2:56 p.m. UTC | #1
On Sat, Sep 19, 2020 at 05:39:23AM -0400, Shijie Luo wrote:
> There is a race condition between destroy_inode and writeback_sb_inodes,
> thread-1                                    thread-2
> wb_workfn
>   writeback_inodes_wb
>     __writeback_inodes_wb
>       writeback_sb_inodes
>         wbc_attach_and_unlock_inode
> 					iget_locked
>                                           destroy_inode
>                                             inode_detach_wb
>                                               inode->i_wb = NULL;
> 
>         inode_to_wb_and_lock_list
>           locked_inode_to_wb_and_lock_list
>             wb_get
>               oops
> 
> so destroy inode after adding I_FREEING to inode state and the I_SYNC state
>  being cleared.
> 
> Reported-by: Tianxiong Lu <lutianxiong@huawei.com>
> Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
> Signed-off-by: Haotian Li <lihaotian9@huawei.com>
> ---
>  fs/inode.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 72c4c347afb7..b28a2a9e15d5 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1148,10 +1148,17 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
>  		struct inode *new = alloc_inode(sb);
>  
>  		if (new) {
> +			spin_lock(&new->i_lock);
>  			new->i_state = 0;
> +			spin_unlock(&new->i_lock);

This part is unnecessary.  We just allocated 'new' two lines above;
nobody else can see 'new' yet.  We make it visible with hlist_add_head_rcu()
which uses rcu_assign_pointer() whch contains a memory barrier, so it's
impossible for another CPU to see a stale i_state.

>  			inode = inode_insert5(new, hashval, test, set, data);
> -			if (unlikely(inode != new))
> +			if (unlikely(inode != new)) {
> +				spin_lock(&new->i_lock);
> +				new->i_state |= I_FREEING;
> +				spin_unlock(&new->i_lock);
> +				inode_wait_for_writeback(new);
>  				destroy_inode(new);

This doesn't make sense either.  If an inode is returned here which is not
'new', then adding 'new' to the hash failed, and new was never visible
to another CPU.

> @@ -1218,6 +1225,11 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
>  		 * allocated.
>  		 */
>  		spin_unlock(&inode_hash_lock);
> +
> +		spin_lock(&inode->i_lock);
> +		inode->i_state |= I_FREEING;
> +		spin_unlock(&inode->i_lock);
> +		inode_wait_for_writeback(inode);
>  		destroy_inode(inode);

Again, this doesn't make sense.  This is also a codepath which failed to
make 'inode' visible to any other thread.

I don't understand how this patch could fix anything.
Shijie Luo Sept. 21, 2020, 8:29 a.m. UTC | #2
On 2020/9/19 22:56, Matthew Wilcox wrote:
> This part is unnecessary.  We just allocated 'new' two lines above;
> nobody else can see 'new' yet.  We make it visible with hlist_add_head_rcu()
> which uses rcu_assign_pointer() whch contains a memory barrier, so it's
> impossible for another CPU to see a stale i_state.
>
>>   			inode = inode_insert5(new, hashval, test, set, data);
>> -			if (unlikely(inode != new))
>> +			if (unlikely(inode != new)) {
>> +				spin_lock(&new->i_lock);
>> +				new->i_state |= I_FREEING;
>> +				spin_unlock(&new->i_lock);
>> +				inode_wait_for_writeback(new);
>>   				destroy_inode(new);
> This doesn't make sense either.  If an inode is returned here which is not
> 'new', then adding 'new' to the hash failed, and new was never visible
> to another CPU.
>
>> @@ -1218,6 +1225,11 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
>>   		 * allocated.
>>   		 */
>>   		spin_unlock(&inode_hash_lock);
>> +
>> +		spin_lock(&inode->i_lock);
>> +		inode->i_state |= I_FREEING;
>> +		spin_unlock(&inode->i_lock);
>> +		inode_wait_for_writeback(inode);
>>   		destroy_inode(inode);
> Again, this doesn't make sense.  This is also a codepath which failed to
> make 'inode' visible to any other thread.
>
> I don't understand how this patch could fix anything.
> .

Thanks for your review,the underlying filesystem is ext4, 
ext4_alloc_inode doesn't

allocate a new vfs inode from slab, and I found the "new inode" was used 
by another

thread in vmcore, in other words, the new inode should be a new one , 
but not.

Maybe it's not a filesystem problem, and fixing this problem in 
iget_locked is not

a good way, I 'll try to find the root cause and fix it.
Jan Kara Sept. 21, 2020, 10:25 a.m. UTC | #3
On Sat 19-09-20 05:39:23, Shijie Luo wrote:
> We tested an oops problem in Linux 4.18. The Call Trace message is
>  followed below.
> 
> [255946.665989] Oops: 0000 [#1] SMP PTI
> [255946.674811] Workqueue: writeback wb_workfn (flush-253:6)
> [255946.676443] RIP: 0010:locked_inode_to_wb_and_lock_list+0x20/0x120
> [255946.683916] RSP: 0018:ffffbb0e44727c00 EFLAGS: 00010286
> [255946.685518] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [255946.687699] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9ef282be5398
> [255946.689866] RBP: ffff9ef282be5398 R08: ffffbb0e44727cd8 R09: ffff9ef3064f306e
> [255946.692037] R10: 0000000000000000 R11: 0000000000000010 R12: ffff9ef282be5420
> [255946.694208] R13: ffff9ef3351cc800 R14: 0000000000000000 R15: ffff9ef3352e2058
> [255946.696378] FS:  0000000000000000(0000) GS:ffff9ef33ad80000(0000) knlGS:0000000000000000
> [255946.698835] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [255946.700604] CR2: 0000000000000000 CR3: 000000000760a005 CR4: 00000000003606e0
> [255946.702787] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [255946.704955] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [255946.707123] Call Trace:
> [255946.707918]  writeback_sb_inodes+0x1fe/0x460
> [255946.709244]  __writeback_inodes_wb+0x5d/0xb0
> [255946.710575]  wb_writeback+0x265/0x2f0
> [255946.711728]  ? wb_workfn+0x3cf/0x4d0
> [255946.712850]  wb_workfn+0x3cf/0x4d0
> [255946.713923]  process_one_work+0x195/0x390
> [255946.715173]  worker_thread+0x30/0x390
> [255946.716319]  ? process_one_work+0x390/0x390
> [255946.717625]  kthread+0x10d/0x130
> [255946.718789]  ? kthread_flush_work_fn+0x10/0x10
> [255946.720170]  ret_from_fork+0x35/0x40

So 4.18 is rather old and we had several fixes in this area for crashes
similar to the one you show above. The list was likely:

68f23b89067 ("memcg: fix a crash in wb_workfn when a device disappears")

but there were multiple changes before that to bdi logic to fix lifetime
issues when devices are hot-removed.

> There is a race condition between destroy_inode and writeback_sb_inodes,
> thread-1                                    thread-2
> wb_workfn
>   writeback_inodes_wb
>     __writeback_inodes_wb
>       writeback_sb_inodes
>         wbc_attach_and_unlock_inode
> 					iget_locked
>                                           destroy_inode
>                                             inode_detach_wb
>                                               inode->i_wb = NULL;

So thread-1 looks sensible but I don't see how what is in thread-2 can ever
happen. We can call destroy_inode() from iget_locked() only for inodes that
were never added to inode hash (and so they couldn't ever be dirty of even
be handled by the flusher thread). Active inodes must (and AFAIK always do)
pass through fs/inode.c:evict() which takes care of waiting for the running
flusher thread (through inode_wait_for_writeback()).

								Honza
Shijie Luo Sept. 24, 2020, 2 p.m. UTC | #4
On 2020/9/21 18:25, Jan Kara wrote:
> On Sat 19-09-20 05:39:23, Shijie Luo wrote:
>> So 4.18 is rather old and we had several fixes in this area for crashes
>> similar to the one you show above. The list was likely:
>>
>> 68f23b89067 ("memcg: fix a crash in wb_workfn when a device disappears")
>>
>> but there were multiple changes before that to bdi logic to fix lifetime
>> issues when devices are hot-removed.
>>
Thanks for your reply, we checked several fixes in wb_workfn , and 
finally found

this patch (ceff86fddae8 ext4: Avoid freeing inodes on dirty list) works.

Our fsstress  process randomly uses ioctl interface to set inode with 
journal data flag, ext4 inode with journal data

flags is possible to be marked dirty and added to writeback lists again.

When locked_inode_to_wb_and_lock_list in __mark_inode_dirty releases 
inode->i_lock and do not lock

wb->list_lock, simultaneously the inode is evicted and removed from 
writeback lists, it's possible this

inode will be added to writeback list again. This problem causes inode 
allocated from slab is still on

writeback list, and may causes crash because destory_inode set inode->wb 
to be NULL.
diff mbox series

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 72c4c347afb7..b28a2a9e15d5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1148,10 +1148,17 @@  struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
 		struct inode *new = alloc_inode(sb);
 
 		if (new) {
+			spin_lock(&new->i_lock);
 			new->i_state = 0;
+			spin_unlock(&new->i_lock);
 			inode = inode_insert5(new, hashval, test, set, data);
-			if (unlikely(inode != new))
+			if (unlikely(inode != new)) {
+				spin_lock(&new->i_lock);
+				new->i_state |= I_FREEING;
+				spin_unlock(&new->i_lock);
+				inode_wait_for_writeback(new);
 				destroy_inode(new);
+			}
 		}
 	}
 	return inode;
@@ -1218,6 +1225,11 @@  struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 		 * allocated.
 		 */
 		spin_unlock(&inode_hash_lock);
+
+		spin_lock(&inode->i_lock);
+		inode->i_state |= I_FREEING;
+		spin_unlock(&inode->i_lock);
+		inode_wait_for_writeback(inode);
 		destroy_inode(inode);
 		if (IS_ERR(old))
 			return NULL;