sync: wait_sb_inodes() calls iput() with spinlock held (was Re: [PATCH 0/7] super block scalabilit patches V3)
diff mbox

Message ID 20150622022648.GO10224@dastard
State New
Headers show

Commit Message

Dave Chinner June 22, 2015, 2:26 a.m. UTC
On Tue, Jun 16, 2015 at 07:34:29AM +1000, Dave Chinner wrote:
> On Thu, Jun 11, 2015 at 03:41:05PM -0400, Josef Bacik wrote:
> > Here are the cleaned up versions of Dave Chinners super block scalability
> > patches.  I've been testing them locally for a while and they are pretty solid.
> > They fix a few big issues, such as the global inode list and soft lockups on
> > boxes on unmount that have lots of inodes in cache.  Al if you would consider
> > pulling these in that would be great, you can pull from here
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git superblock-scaling
> 
> Passes all my smoke tests.
> 
> Tested-by: Dave Chinner <dchinner@redhat.com>

FWIW, I just updated my trees to whatever is in the above branch and
v4.1-rc8, and now I'm seeing problems with wb.list_lock recursion
and "sleeping in atomic" scehduling issues. generic/269 produced
this:

 BUG: spinlock cpu recursion on CPU#1, fsstress/3852
  lock: 0xffff88042a650c28, .magic: dead4ead, .owner: fsstress/3804, .owner_cpu: 1
 CPU: 1 PID: 3852 Comm: fsstress Tainted: G        W       4.1.0-rc8-dgc+ #263
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  ffff88042a650c28 ffff88039898b8e8 ffffffff81e18ffd ffff88042f250fb0
  ffff880428f6b8e0 ffff88039898b908 ffffffff81e12f09 ffff88042a650c28
  ffffffff8221337b ffff88039898b928 ffffffff81e12f34 ffff88042a650c28
 Call Trace:
  [<ffffffff81e18ffd>] dump_stack+0x4c/0x6e
  [<ffffffff81e12f09>] spin_dump+0x90/0x95
  [<ffffffff81e12f34>] spin_bug+0x26/0x2b
  [<ffffffff810e762d>] do_raw_spin_lock+0x10d/0x150
  [<ffffffff81e24975>] _raw_spin_lock+0x15/0x20
  [<ffffffff811f8ba0>] __mark_inode_dirty+0x2b0/0x450
  [<ffffffff812003b8>] __set_page_dirty+0x78/0xd0
  [<ffffffff81200531>] mark_buffer_dirty+0x61/0xf0
  [<ffffffff81200d91>] __block_commit_write.isra.24+0x81/0xb0
  [<ffffffff81202406>] block_write_end+0x36/0x70
  [<ffffffff814fa110>] ? __xfs_get_blocks+0x8a0/0x8a0
  [<ffffffff81202474>] generic_write_end+0x34/0xb0
  [<ffffffff8118af3d>] ? wait_for_stable_page+0x1d/0x50
  [<ffffffff814fa317>] xfs_vm_write_end+0x67/0xc0
  [<ffffffff811813af>] pagecache_write_end+0x1f/0x30
  [<ffffffff815060dd>] xfs_iozero+0x10d/0x190
  [<ffffffff8150666b>] xfs_zero_last_block+0xdb/0x110
  [<ffffffff815067ba>] xfs_zero_eof+0x11a/0x290
  [<ffffffff811d69e0>] ? complete_walk+0x60/0x100
  [<ffffffff811da25f>] ? path_lookupat+0x5f/0x660
  [<ffffffff81506a6e>] xfs_file_aio_write_checks+0x13e/0x160
  [<ffffffff81506f15>] xfs_file_buffered_aio_write+0x75/0x250
  [<ffffffff811ddb0f>] ? user_path_at_empty+0x5f/0xa0
  [<ffffffff810c601d>] ? __might_sleep+0x4d/0x90
  [<ffffffff815071f5>] xfs_file_write_iter+0x105/0x120
  [<ffffffff811cc5ce>] __vfs_write+0xae/0xf0
  [<ffffffff811ccc01>] vfs_write+0xa1/0x190
  [<ffffffff811cd999>] SyS_write+0x49/0xb0
  [<ffffffff811cc781>] ? SyS_lseek+0x91/0xb0
  [<ffffffff81e24fee>] system_call_fastpath+0x12/0x71

And there are a few tests (including generic/269) producing
in_atomic/"scheduling while atomic" bugs in the evict() path such as:

 in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: fsstress
 CPU: 12 PID: 3852 Comm: fsstress Not tainted 4.1.0-rc8-dgc+ #263
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  000000000000015d ffff88039898b6d8 ffffffff81e18ffd 0000000000000000
  ffff880398865550 ffff88039898b6f8 ffffffff810c5f89 ffff8803f15c45c0
  ffffffff8227a3bf ffff88039898b728 ffffffff810c601d ffff88039898b758
 Call Trace:
  [<ffffffff81e18ffd>] dump_stack+0x4c/0x6e
  [<ffffffff810c5f89>] ___might_sleep+0xf9/0x140
  [<ffffffff810c601d>] __might_sleep+0x4d/0x90
  [<ffffffff81201e8b>] block_invalidatepage+0xab/0x140
  [<ffffffff814f7579>] xfs_vm_invalidatepage+0x39/0xb0
  [<ffffffff8118fa77>] truncate_inode_page+0x67/0xa0
  [<ffffffff8118fc92>] truncate_inode_pages_range+0x1a2/0x6f0
  [<ffffffff811828d1>] ? find_get_pages_tag+0xf1/0x1b0
  [<ffffffff8104a663>] ? __switch_to+0x1e3/0x5a0
  [<ffffffff8118dd05>] ? pagevec_lookup_tag+0x25/0x40
  [<ffffffff811f620d>] ? __inode_wait_for_writeback+0x6d/0xc0
  [<ffffffff8119024c>] truncate_inode_pages_final+0x4c/0x60
  [<ffffffff8151c47f>] xfs_fs_evict_inode+0x4f/0x100
  [<ffffffff811e8330>] evict+0xc0/0x1a0
  [<ffffffff811e8d7b>] iput+0x1bb/0x220
  [<ffffffff811f68b3>] sync_inodes_sb+0x353/0x3d0
  [<ffffffff8151def8>] xfs_flush_inodes+0x28/0x40
  [<ffffffff81514648>] xfs_create+0x638/0x770
  [<ffffffff814e9049>] ? xfs_dir2_sf_lookup+0x199/0x330
  [<ffffffff81511091>] xfs_generic_create+0xd1/0x300
  [<ffffffff817a059c>] ? security_inode_permission+0x1c/0x30
  [<ffffffff815112f6>] xfs_vn_create+0x16/0x20
  [<ffffffff811d8665>] vfs_create+0xd5/0x140
  [<ffffffff811dbea3>] do_last+0xff3/0x1200
  [<ffffffff811d9f36>] ? path_init+0x186/0x450
  [<ffffffff811dc130>] path_openat+0x80/0x610
  [<ffffffff81512a24>] ? xfs_iunlock+0xc4/0x210
  [<ffffffff811ddbfa>] do_filp_open+0x3a/0x90
  [<ffffffff811dc8bf>] ? getname_flags+0x4f/0x200
  [<ffffffff81e249ce>] ? _raw_spin_unlock+0xe/0x30
  [<ffffffff811eab17>] ? __alloc_fd+0xa7/0x130
  [<ffffffff811cbcf8>] do_sys_open+0x128/0x220
  [<ffffffff811cbe4e>] SyS_creat+0x1e/0x20
  [<ffffffff81e24fee>] system_call_fastpath+0x12/0x71

It looks to me like iput() is being called with the wb.list_lock
held in wait_sb_inodes(), and everything is going downhill from
there.  Patch below fixes the problem for me.

Cheers,

Dave.

Comments

Josef Bacik June 22, 2015, 4:21 p.m. UTC | #1
On 06/21/2015 07:26 PM, Dave Chinner wrote:
> On Tue, Jun 16, 2015 at 07:34:29AM +1000, Dave Chinner wrote:
>> On Thu, Jun 11, 2015 at 03:41:05PM -0400, Josef Bacik wrote:
>>> Here are the cleaned up versions of Dave Chinners super block scalability
>>> patches.  I've been testing them locally for a while and they are pretty solid.
>>> They fix a few big issues, such as the global inode list and soft lockups on
>>> boxes on unmount that have lots of inodes in cache.  Al if you would consider
>>> pulling these in that would be great, you can pull from here
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git superblock-scaling
>>
>> Passes all my smoke tests.
>>
>> Tested-by: Dave Chinner <dchinner@redhat.com>
>
> FWIW, I just updated my trees to whatever is in the above branch and
> v4.1-rc8, and now I'm seeing problems with wb.list_lock recursion
> and "sleeping in atomic" scehduling issues. generic/269 produced
> this:
>
>   BUG: spinlock cpu recursion on CPU#1, fsstress/3852
>    lock: 0xffff88042a650c28, .magic: dead4ead, .owner: fsstress/3804, .owner_cpu: 1
>   CPU: 1 PID: 3852 Comm: fsstress Tainted: G        W       4.1.0-rc8-dgc+ #263
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>    ffff88042a650c28 ffff88039898b8e8 ffffffff81e18ffd ffff88042f250fb0
>    ffff880428f6b8e0 ffff88039898b908 ffffffff81e12f09 ffff88042a650c28
>    ffffffff8221337b ffff88039898b928 ffffffff81e12f34 ffff88042a650c28
>   Call Trace:
>    [<ffffffff81e18ffd>] dump_stack+0x4c/0x6e
>    [<ffffffff81e12f09>] spin_dump+0x90/0x95
>    [<ffffffff81e12f34>] spin_bug+0x26/0x2b
>    [<ffffffff810e762d>] do_raw_spin_lock+0x10d/0x150
>    [<ffffffff81e24975>] _raw_spin_lock+0x15/0x20
>    [<ffffffff811f8ba0>] __mark_inode_dirty+0x2b0/0x450
>    [<ffffffff812003b8>] __set_page_dirty+0x78/0xd0
>    [<ffffffff81200531>] mark_buffer_dirty+0x61/0xf0
>    [<ffffffff81200d91>] __block_commit_write.isra.24+0x81/0xb0
>    [<ffffffff81202406>] block_write_end+0x36/0x70
>    [<ffffffff814fa110>] ? __xfs_get_blocks+0x8a0/0x8a0
>    [<ffffffff81202474>] generic_write_end+0x34/0xb0
>    [<ffffffff8118af3d>] ? wait_for_stable_page+0x1d/0x50
>    [<ffffffff814fa317>] xfs_vm_write_end+0x67/0xc0
>    [<ffffffff811813af>] pagecache_write_end+0x1f/0x30
>    [<ffffffff815060dd>] xfs_iozero+0x10d/0x190
>    [<ffffffff8150666b>] xfs_zero_last_block+0xdb/0x110
>    [<ffffffff815067ba>] xfs_zero_eof+0x11a/0x290
>    [<ffffffff811d69e0>] ? complete_walk+0x60/0x100
>    [<ffffffff811da25f>] ? path_lookupat+0x5f/0x660
>    [<ffffffff81506a6e>] xfs_file_aio_write_checks+0x13e/0x160
>    [<ffffffff81506f15>] xfs_file_buffered_aio_write+0x75/0x250
>    [<ffffffff811ddb0f>] ? user_path_at_empty+0x5f/0xa0
>    [<ffffffff810c601d>] ? __might_sleep+0x4d/0x90
>    [<ffffffff815071f5>] xfs_file_write_iter+0x105/0x120
>    [<ffffffff811cc5ce>] __vfs_write+0xae/0xf0
>    [<ffffffff811ccc01>] vfs_write+0xa1/0x190
>    [<ffffffff811cd999>] SyS_write+0x49/0xb0
>    [<ffffffff811cc781>] ? SyS_lseek+0x91/0xb0
>    [<ffffffff81e24fee>] system_call_fastpath+0x12/0x71
>
> And there are a few tests (including generic/269) producing
> in_atomic/"scheduling while atomic" bugs in the evict() path such as:
>
>   in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: fsstress
>   CPU: 12 PID: 3852 Comm: fsstress Not tainted 4.1.0-rc8-dgc+ #263
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>    000000000000015d ffff88039898b6d8 ffffffff81e18ffd 0000000000000000
>    ffff880398865550 ffff88039898b6f8 ffffffff810c5f89 ffff8803f15c45c0
>    ffffffff8227a3bf ffff88039898b728 ffffffff810c601d ffff88039898b758
>   Call Trace:
>    [<ffffffff81e18ffd>] dump_stack+0x4c/0x6e
>    [<ffffffff810c5f89>] ___might_sleep+0xf9/0x140
>    [<ffffffff810c601d>] __might_sleep+0x4d/0x90
>    [<ffffffff81201e8b>] block_invalidatepage+0xab/0x140
>    [<ffffffff814f7579>] xfs_vm_invalidatepage+0x39/0xb0
>    [<ffffffff8118fa77>] truncate_inode_page+0x67/0xa0
>    [<ffffffff8118fc92>] truncate_inode_pages_range+0x1a2/0x6f0
>    [<ffffffff811828d1>] ? find_get_pages_tag+0xf1/0x1b0
>    [<ffffffff8104a663>] ? __switch_to+0x1e3/0x5a0
>    [<ffffffff8118dd05>] ? pagevec_lookup_tag+0x25/0x40
>    [<ffffffff811f620d>] ? __inode_wait_for_writeback+0x6d/0xc0
>    [<ffffffff8119024c>] truncate_inode_pages_final+0x4c/0x60
>    [<ffffffff8151c47f>] xfs_fs_evict_inode+0x4f/0x100
>    [<ffffffff811e8330>] evict+0xc0/0x1a0
>    [<ffffffff811e8d7b>] iput+0x1bb/0x220
>    [<ffffffff811f68b3>] sync_inodes_sb+0x353/0x3d0
>    [<ffffffff8151def8>] xfs_flush_inodes+0x28/0x40
>    [<ffffffff81514648>] xfs_create+0x638/0x770
>    [<ffffffff814e9049>] ? xfs_dir2_sf_lookup+0x199/0x330
>    [<ffffffff81511091>] xfs_generic_create+0xd1/0x300
>    [<ffffffff817a059c>] ? security_inode_permission+0x1c/0x30
>    [<ffffffff815112f6>] xfs_vn_create+0x16/0x20
>    [<ffffffff811d8665>] vfs_create+0xd5/0x140
>    [<ffffffff811dbea3>] do_last+0xff3/0x1200
>    [<ffffffff811d9f36>] ? path_init+0x186/0x450
>    [<ffffffff811dc130>] path_openat+0x80/0x610
>    [<ffffffff81512a24>] ? xfs_iunlock+0xc4/0x210
>    [<ffffffff811ddbfa>] do_filp_open+0x3a/0x90
>    [<ffffffff811dc8bf>] ? getname_flags+0x4f/0x200
>    [<ffffffff81e249ce>] ? _raw_spin_unlock+0xe/0x30
>    [<ffffffff811eab17>] ? __alloc_fd+0xa7/0x130
>    [<ffffffff811cbcf8>] do_sys_open+0x128/0x220
>    [<ffffffff811cbe4e>] SyS_creat+0x1e/0x20
>    [<ffffffff81e24fee>] system_call_fastpath+0x12/0x71
>
> It looks to me like iput() is being called with the wb.list_lock
> held in wait_sb_inodes(), and everything is going downhill from
> there.  Patch below fixes the problem for me.
>
> Cheers,
>
> Dave.
>

Thanks Dave I'll add it.  I think this is what we were doing at first 
but then I changed it, didn't notice the wb.list_lock.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
Josef Bacik June 23, 2015, 11:14 p.m. UTC | #2
On 06/21/2015 07:26 PM, Dave Chinner wrote:
> On Tue, Jun 16, 2015 at 07:34:29AM +1000, Dave Chinner wrote:
>> On Thu, Jun 11, 2015 at 03:41:05PM -0400, Josef Bacik wrote:
>>> Here are the cleaned up versions of Dave Chinners super block scalability
>>> patches.  I've been testing them locally for a while and they are pretty solid.
>>> They fix a few big issues, such as the global inode list and soft lockups on
>>> boxes on unmount that have lots of inodes in cache.  Al if you would consider
>>> pulling these in that would be great, you can pull from here
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git superblock-scaling
>>
>> Passes all my smoke tests.
>>
>> Tested-by: Dave Chinner <dchinner@redhat.com>
>
> FWIW, I just updated my trees to whatever is in the above branch and
> v4.1-rc8, and now I'm seeing problems with wb.list_lock recursion
> and "sleeping in atomic" scehduling issues. generic/269 produced
> this:
>
>   BUG: spinlock cpu recursion on CPU#1, fsstress/3852
>    lock: 0xffff88042a650c28, .magic: dead4ead, .owner: fsstress/3804, .owner_cpu: 1
>   CPU: 1 PID: 3852 Comm: fsstress Tainted: G        W       4.1.0-rc8-dgc+ #263
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>    ffff88042a650c28 ffff88039898b8e8 ffffffff81e18ffd ffff88042f250fb0
>    ffff880428f6b8e0 ffff88039898b908 ffffffff81e12f09 ffff88042a650c28
>    ffffffff8221337b ffff88039898b928 ffffffff81e12f34 ffff88042a650c28
>   Call Trace:
>    [<ffffffff81e18ffd>] dump_stack+0x4c/0x6e
>    [<ffffffff81e12f09>] spin_dump+0x90/0x95
>    [<ffffffff81e12f34>] spin_bug+0x26/0x2b
>    [<ffffffff810e762d>] do_raw_spin_lock+0x10d/0x150
>    [<ffffffff81e24975>] _raw_spin_lock+0x15/0x20
>    [<ffffffff811f8ba0>] __mark_inode_dirty+0x2b0/0x450
>    [<ffffffff812003b8>] __set_page_dirty+0x78/0xd0
>    [<ffffffff81200531>] mark_buffer_dirty+0x61/0xf0
>    [<ffffffff81200d91>] __block_commit_write.isra.24+0x81/0xb0
>    [<ffffffff81202406>] block_write_end+0x36/0x70
>    [<ffffffff814fa110>] ? __xfs_get_blocks+0x8a0/0x8a0
>    [<ffffffff81202474>] generic_write_end+0x34/0xb0
>    [<ffffffff8118af3d>] ? wait_for_stable_page+0x1d/0x50
>    [<ffffffff814fa317>] xfs_vm_write_end+0x67/0xc0
>    [<ffffffff811813af>] pagecache_write_end+0x1f/0x30
>    [<ffffffff815060dd>] xfs_iozero+0x10d/0x190
>    [<ffffffff8150666b>] xfs_zero_last_block+0xdb/0x110
>    [<ffffffff815067ba>] xfs_zero_eof+0x11a/0x290
>    [<ffffffff811d69e0>] ? complete_walk+0x60/0x100
>    [<ffffffff811da25f>] ? path_lookupat+0x5f/0x660
>    [<ffffffff81506a6e>] xfs_file_aio_write_checks+0x13e/0x160
>    [<ffffffff81506f15>] xfs_file_buffered_aio_write+0x75/0x250
>    [<ffffffff811ddb0f>] ? user_path_at_empty+0x5f/0xa0
>    [<ffffffff810c601d>] ? __might_sleep+0x4d/0x90
>    [<ffffffff815071f5>] xfs_file_write_iter+0x105/0x120
>    [<ffffffff811cc5ce>] __vfs_write+0xae/0xf0
>    [<ffffffff811ccc01>] vfs_write+0xa1/0x190
>    [<ffffffff811cd999>] SyS_write+0x49/0xb0
>    [<ffffffff811cc781>] ? SyS_lseek+0x91/0xb0
>    [<ffffffff81e24fee>] system_call_fastpath+0x12/0x71
>
> And there are a few tests (including generic/269) producing
> in_atomic/"scheduling while atomic" bugs in the evict() path such as:
>
>   in_atomic(): 1, irqs_disabled(): 0, pid: 3852, name: fsstress
>   CPU: 12 PID: 3852 Comm: fsstress Not tainted 4.1.0-rc8-dgc+ #263
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>    000000000000015d ffff88039898b6d8 ffffffff81e18ffd 0000000000000000
>    ffff880398865550 ffff88039898b6f8 ffffffff810c5f89 ffff8803f15c45c0
>    ffffffff8227a3bf ffff88039898b728 ffffffff810c601d ffff88039898b758
>   Call Trace:
>    [<ffffffff81e18ffd>] dump_stack+0x4c/0x6e
>    [<ffffffff810c5f89>] ___might_sleep+0xf9/0x140
>    [<ffffffff810c601d>] __might_sleep+0x4d/0x90
>    [<ffffffff81201e8b>] block_invalidatepage+0xab/0x140
>    [<ffffffff814f7579>] xfs_vm_invalidatepage+0x39/0xb0
>    [<ffffffff8118fa77>] truncate_inode_page+0x67/0xa0
>    [<ffffffff8118fc92>] truncate_inode_pages_range+0x1a2/0x6f0
>    [<ffffffff811828d1>] ? find_get_pages_tag+0xf1/0x1b0
>    [<ffffffff8104a663>] ? __switch_to+0x1e3/0x5a0
>    [<ffffffff8118dd05>] ? pagevec_lookup_tag+0x25/0x40
>    [<ffffffff811f620d>] ? __inode_wait_for_writeback+0x6d/0xc0
>    [<ffffffff8119024c>] truncate_inode_pages_final+0x4c/0x60
>    [<ffffffff8151c47f>] xfs_fs_evict_inode+0x4f/0x100
>    [<ffffffff811e8330>] evict+0xc0/0x1a0
>    [<ffffffff811e8d7b>] iput+0x1bb/0x220
>    [<ffffffff811f68b3>] sync_inodes_sb+0x353/0x3d0
>    [<ffffffff8151def8>] xfs_flush_inodes+0x28/0x40
>    [<ffffffff81514648>] xfs_create+0x638/0x770
>    [<ffffffff814e9049>] ? xfs_dir2_sf_lookup+0x199/0x330
>    [<ffffffff81511091>] xfs_generic_create+0xd1/0x300
>    [<ffffffff817a059c>] ? security_inode_permission+0x1c/0x30
>    [<ffffffff815112f6>] xfs_vn_create+0x16/0x20
>    [<ffffffff811d8665>] vfs_create+0xd5/0x140
>    [<ffffffff811dbea3>] do_last+0xff3/0x1200
>    [<ffffffff811d9f36>] ? path_init+0x186/0x450
>    [<ffffffff811dc130>] path_openat+0x80/0x610
>    [<ffffffff81512a24>] ? xfs_iunlock+0xc4/0x210
>    [<ffffffff811ddbfa>] do_filp_open+0x3a/0x90
>    [<ffffffff811dc8bf>] ? getname_flags+0x4f/0x200
>    [<ffffffff81e249ce>] ? _raw_spin_unlock+0xe/0x30
>    [<ffffffff811eab17>] ? __alloc_fd+0xa7/0x130
>    [<ffffffff811cbcf8>] do_sys_open+0x128/0x220
>    [<ffffffff811cbe4e>] SyS_creat+0x1e/0x20
>    [<ffffffff81e24fee>] system_call_fastpath+0x12/0x71
>
> It looks to me like iput() is being called with the wb.list_lock
> held in wait_sb_inodes(), and everything is going downhill from
> there.  Patch below fixes the problem for me.
>

I folded this into "bdi: add a new writeback list for sync" since it was 
there before and to be more bisect friendly.  Let me know if this isn't 
ok with you and I'll undo it.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner June 24, 2015, 12:35 a.m. UTC | #3
On Tue, Jun 23, 2015 at 04:14:42PM -0700, Josef Bacik wrote:
> On 06/21/2015 07:26 PM, Dave Chinner wrote:
> >On Tue, Jun 16, 2015 at 07:34:29AM +1000, Dave Chinner wrote:
> >>On Thu, Jun 11, 2015 at 03:41:05PM -0400, Josef Bacik wrote:
> >>>Here are the cleaned up versions of Dave Chinners super block scalability
> >>>patches.  I've been testing them locally for a while and they are pretty solid.
> >>>They fix a few big issues, such as the global inode list and soft lockups on
> >>>boxes on unmount that have lots of inodes in cache.  Al if you would consider
> >>>pulling these in that would be great, you can pull from here
> >>>
> >>>git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git superblock-scaling
> >>
> >>Passes all my smoke tests.
> >>
> >>Tested-by: Dave Chinner <dchinner@redhat.com>
> >
> >FWIW, I just updated my trees to whatever is in the above branch and
> >v4.1-rc8, and now I'm seeing problems with wb.list_lock recursion
> >and "sleeping in atomic" scehduling issues. generic/269 produced
> >this:
....
> >It looks to me like iput() is being called with the wb.list_lock
> >held in wait_sb_inodes(), and everything is going downhill from
> >there.  Patch below fixes the problem for me.
> >
> 
> I folded this into "bdi: add a new writeback list for sync" since it
> was there before and to be more bisect friendly.  Let me know if
> this isn't ok with you and I'll undo it.  Thanks,

That's fine - I was expecting you would fold it back in... ;)

Cheers,

Dave.

Patch
diff mbox

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1718702..a2cd363 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1436,6 +1436,7 @@  static void wait_sb_inodes(struct super_block *sb)
 {
 	struct backing_dev_info *bdi = sb->s_bdi;
 	LIST_HEAD(sync_list);
+	struct inode *iput_inode = NULL;
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -1497,6 +1498,9 @@  static void wait_sb_inodes(struct super_block *sb)
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&bdi->wb.list_lock);
 
+		if (iput_inode)
+			iput(iput_inode);
+
 		filemap_fdatawait(mapping);
 		cond_resched();
 
@@ -1516,9 +1520,19 @@  static void wait_sb_inodes(struct super_block *sb)
                 } else
 			list_del_init(&inode->i_wb_list);
 		spin_unlock_irq(&mapping->tree_lock);
-		iput(inode);
+
+		/*
+		 * can't iput inode while holding the wb.list_lock. Save it for
+		 * the next time through the loop when we drop all our spin
+		 * locks.
+		 */
+		iput_inode = inode;
 	}
 	spin_unlock(&bdi->wb.list_lock);
+
+	if (iput_inode)
+		iput(iput_inode);
+
 	mutex_unlock(&sb->s_sync_lock);
 }