[6/8] bdi: add a new writeback list for sync
diff mbox

Message ID 1435116242-27495-7-git-send-email-jbacik@fb.com
State New
Headers show

Commit Message

Josef Bacik June 24, 2015, 3:24 a.m. UTC
From: Dave Chinner <dchinner@redhat.com>

wait_sb_inodes() current does a walk of all inodes in the filesystem
to find dirty one to wait on during sync. This is highly
inefficient and wastes a lot of CPU when there are lots of clean
cached inodes that we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for. We do this by
adding inodes to a writeback list on the bdi when the mapping is
first tagged as having pages under writeback.  wait_sb_inodes() can
then walk this list of "inodes under IO" and wait specifically just
for the inodes that the current sync(2) needs to wait for.

To avoid needing all the realted locking to be safe against
interrupts, Jan Kara suggested that we be lazy about removal from
the writeback list. That is, we don't remove inodes from the
writeback list on IO completion, but do it directly during a
wait_sb_inodes() walk.

This means that the a rare sync(2) call will have some work to do
skipping clean inodes However, in the current problem case of
concurrent sync workloads, concurrent wait_sb_inodes() calls only
walk the very recently dispatched inodes and hence should have very
little work to do.

This also means that we have to remove the inodes from the writeback
list during eviction. Do this in inode_wait_for_writeback() once
all writeback on the inode is complete.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Tested-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/block_dev.c              |   2 +
 fs/fs-writeback.c           | 148 +++++++++++++++++++++++++++++++++++---------
 fs/inode.c                  |   1 +
 include/linux/backing-dev.h |   5 ++
 include/linux/fs.h          |   1 +
 mm/backing-dev.c            |   1 +
 mm/page-writeback.c         |  14 +++++
 7 files changed, 144 insertions(+), 28 deletions(-)

Comments

Brian Foster Dec. 9, 2015, 6:40 p.m. UTC | #1
On Tue, Jun 23, 2015 at 08:24:00PM -0700, Josef Bacik wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> wait_sb_inodes() current does a walk of all inodes in the filesystem
> to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
> 
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the bdi when the mapping is
> first tagged as having pages under writeback.  wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
> 
> To avoid needing all the realted locking to be safe against
> interrupts, Jan Kara suggested that we be lazy about removal from
> the writeback list. That is, we don't remove inodes from the
> writeback list on IO completion, but do it directly during a
> wait_sb_inodes() walk.
> 
> This means that the a rare sync(2) call will have some work to do
> skipping clean inodes However, in the current problem case of
> concurrent sync workloads, concurrent wait_sb_inodes() calls only
> walk the very recently dispatched inodes and hence should have very
> little work to do.
> 
> This also means that we have to remove the inodes from the writeback
> list during eviction. Do this in inode_wait_for_writeback() once
> all writeback on the inode is complete.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> Tested-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Jan Kara <jack@suse.cz>
> ---

Hi all,

My understanding is this (and the subsequent) patch were never merged
due to conflicts with the cgroup aware writeback stuff that had been
merged. I think the fundamental approach of this patch still solves the
original problem, the writeback inodes are just always tracked on the
root bdi_writeback rather than the per-cgroup bdi_writeback the inode
was associated with before writeback.

Any thoughts on that are appreciated. Otherwise, I'm trying to rebase
this and I've hit a lockdep splat [1] that suggests the locking for the
writeback list needs to be tweaked one way or another. Note that lockdep
fires with this original version as well, so I don't think this is a new
issue.

I _think_ this means that list_lock either needs to be taken outside of
mapping->tree_lock or we need to create a new irq safe spinlock specific
to the b_writeback list. The former looks like it requires some ugliness
in __test_set_page_writeback() to acquire list_lock first, but only when
list_empty(&inode->i_wb_list) so we don't grab it for every page. The
latter obviously means a new irq-disabling lock (which is already done
in parts of wait_sb_inode(), fwiw). The latter might also facilitate
further cleanups like pulling b_writeback out of bdi_writeback and up
into backing_dev_info since we really only need one instance of the
list, but I'm not sure if that's preferable..

Thoughts?

Brian

[1] lockdep splat:

[ INFO: possible irq lock inversion dependency detected ]
4.4.0-rc4+ #54 Not tainted
---------------------------------------------------------
swapper/0/0 just changed the state of lock:
(&(&mapping->tree_lock)->rlock){-.....}, at: [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
but this lock took another, HARDIRQ-unsafe lock in the past:
(&(&wb->list_lock)->rlock){+.+...}#012#012and interrupts could create inverse lock ordering between them.
#012other info that might help us debug this:
Possible interrupt unsafe locking scenario:
      CPU0                    CPU1
      ----                    ----
 lock(&(&wb->list_lock)->rlock);
                              local_irq_disable();
                              lock(&(&mapping->tree_lock)->rlock);
                              lock(&(&wb->list_lock)->rlock);
 <Interrupt>
   lock(&(&mapping->tree_lock)->rlock);
#012 *** DEADLOCK ***
no locks held by swapper/0/0.
#012the shortest dependencies between 2nd lock and 1st lock:
-> (&(&wb->list_lock)->rlock){+.+...} ops: 194 {
   HARDIRQ-ON-W at:
                     [<ffffffff810f9959>] __lock_acquire+0x819/0x1a50
                     [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
                     [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
                     [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
                     [<ffffffff8125a859>] generic_update_time+0x79/0xd0
                     [<ffffffff8125c638>] touch_atime+0xa8/0xd0
                     [<ffffffff812519c3>] iterate_dir+0xe3/0x130
                     [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
                     [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
   SOFTIRQ-ON-W at:
                     [<ffffffff810f998a>] __lock_acquire+0x84a/0x1a50
                     [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
                     [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
                     [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
                     [<ffffffff8125a859>] generic_update_time+0x79/0xd0
                     [<ffffffff8125c638>] touch_atime+0xa8/0xd0
                     [<ffffffff812519c3>] iterate_dir+0xe3/0x130
                     [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
                     [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
   INITIAL USE at:
                    [<ffffffff810f9538>] __lock_acquire+0x3f8/0x1a50
                    [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
                    [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
                    [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
                    [<ffffffff8125a859>] generic_update_time+0x79/0xd0
                    [<ffffffff8125c638>] touch_atime+0xa8/0xd0
                    [<ffffffff812519c3>] iterate_dir+0xe3/0x130
                    [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
                    [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
 }
 ... key      at: [<ffffffff82c7faa0>] __key.37388+0x0/0x8
 ... acquired at:
  [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
  [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
  [<ffffffff812703a7>] bdi_mark_inode_writeback+0x67/0xd0
  [<ffffffff811d19b2>] __test_set_page_writeback+0x112/0x200
  [<ffffffff8127bcdc>] __block_write_full_page.constprop.44+0xec/0x400
  [<ffffffff8127c0ee>] block_write_full_page+0xfe/0x190
  [<ffffffff8127cec8>] blkdev_writepage+0x18/0x20
  [<ffffffff811cede6>] __writepage+0x16/0x40
  [<ffffffff811cfc65>] write_cache_pages+0x225/0x5f0
  [<ffffffff811d0084>] generic_writepages+0x54/0x80
  [<ffffffff811d24e1>] do_writepages+0x21/0x30
  [<ffffffff811c4b40>] __filemap_fdatawrite_range+0x80/0xb0
  [<ffffffff811c4c7d>] filemap_write_and_wait_range+0x2d/0x70
  [<ffffffff8127ca8b>] blkdev_fsync+0x1b/0x50
  [<ffffffff81274a9b>] vfs_fsync_range+0x4b/0xb0
  [<ffffffff81274b5d>] do_fsync+0x3d/0x70
  [<ffffffff81274e10>] SyS_fsync+0x10/0x20
  [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76

-> (&(&mapping->tree_lock)->rlock){-.....} ops: 16821 {
  IN-HARDIRQ-W at:
                   [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
                   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
                   [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
                   [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
                   [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
                   [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
                   [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
                   [<ffffffff813a329f>] bio_endio+0x3f/0x60
                   [<ffffffff8162d4d9>] dec_pending+0x149/0x310
                   [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
                   [<ffffffff813a329f>] bio_endio+0x3f/0x60
                   [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
                   [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
                   [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
                   [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
                   [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
                   [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
                   [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
                   [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
                   [<ffffffff81026703>] default_idle+0x23/0x150
                   [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
                   [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
                   [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
                   [<ffffffff817d6aaa>] rest_init+0x13a/0x140
                   [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
                   [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
                   [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
  INITIAL USE at:
                  [<ffffffff810f9538>] __lock_acquire+0x3f8/0x1a50
                  [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
                  [<ffffffff817e3fa4>] _raw_spin_lock_irq+0x44/0x60
                  [<ffffffff811c303c>] __add_to_page_cache_locked+0xbc/0x340
                  [<ffffffff811c3317>] add_to_page_cache_lru+0x37/0x90
                  [<ffffffff811c3dc7>] pagecache_get_page+0xb7/0x200
                  [<ffffffff811c4139>] grab_cache_page_write_begin+0x29/0x40
                  [<ffffffff81269848>] simple_write_begin+0x28/0x1c0
                  [<ffffffff811c2a01>] generic_perform_write+0xd1/0x1e0
                  [<ffffffff811c55f2>] __generic_file_write_iter+0x1a2/0x1e0
                  [<ffffffff811c5718>] generic_file_write_iter+0xe8/0x1e0
                  [<ffffffff8123c409>] __vfs_write+0xc9/0x110
                  [<ffffffff8123ca99>] vfs_write+0xa9/0x1a0
                  [<ffffffff8123d778>] SyS_write+0x58/0xd0
                  [<ffffffff81d6c3b1>] xwrite+0x29/0x5c
                  [<ffffffff81d6c40e>] do_copy+0x2a/0xb8
                  [<ffffffff81d6c151>] write_buffer+0x23/0x34
                  [<ffffffff81d6c9e6>] unpack_to_rootfs+0xfd/0x294
                  [<ffffffff81d6cbd9>] populate_rootfs+0x5c/0x108
                  [<ffffffff81002123>] do_one_initcall+0xb3/0x200
                  [<ffffffff81d6b22e>] kernel_init_freeable+0x1ee/0x28d
                  [<ffffffff817d6abe>] kernel_init+0xe/0xe0
                  [<ffffffff817e4ddf>] ret_from_fork+0x3f/0x70
}
... key      at: [<ffffffff82ca2760>] __key.39560+0x0/0x8
... acquired at:
  [<ffffffff810f7c1b>] check_usage_forwards+0x15b/0x160
  [<ffffffff810f88c3>] mark_lock+0x333/0x610
  [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
  [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
  [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
  [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
  [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
  [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
  [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
  [<ffffffff813a329f>] bio_endio+0x3f/0x60
  [<ffffffff8162d4d9>] dec_pending+0x149/0x310
  [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
  [<ffffffff813a329f>] bio_endio+0x3f/0x60
  [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
  [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
  [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
  [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
  [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
  [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
  [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
  [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
  [<ffffffff81026703>] default_idle+0x23/0x150
  [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
  [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
  [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
  [<ffffffff817d6aaa>] rest_init+0x13a/0x140
  [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
  [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
  [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168

#012stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-rc4+ #54
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
0000000000000000 06776aa479e380f6 ffff88011fc03ad0 ffffffff813dd019
ffffffff82a0f120 ffff88011fc03b10 ffffffff811c184d ffffffff81c10be8
ffffffff81c10be8 ffffffff81c10500 ffffffff81a56510 0000000000000000
Call Trace:
<IRQ>  [<ffffffff813dd019>] dump_stack+0x4b/0x72
[<ffffffff811c184d>] print_irq_inversion_bug.part.38+0x1a4/0x1b0
[<ffffffff810f7c1b>] check_usage_forwards+0x15b/0x160
[<ffffffff810f88c3>] mark_lock+0x333/0x610
[<ffffffff810f7ac0>] ? check_usage_backwards+0x160/0x160
[<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
[<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
[<ffffffff811d25ed>] ? test_clear_page_writeback+0x5d/0x210
[<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
[<ffffffff811d25ed>] ? test_clear_page_writeback+0x5d/0x210
[<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
[<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
[<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
[<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
[<ffffffff813a329f>] bio_endio+0x3f/0x60
[<ffffffff8162d4d9>] dec_pending+0x149/0x310
[<ffffffff811c6809>] ? mempool_free+0x29/0x80
[<ffffffff8162e1d6>] clone_endio+0x76/0xe0
[<ffffffff813a329f>] bio_endio+0x3f/0x60
[<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
[<ffffffff813b4e90>] ? blkdev_issue_zeroout+0xf0/0xf0
[<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
[<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
[<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
[<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
[<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
[<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
[<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
<EOI>  [<ffffffff810653a6>] ? native_safe_halt+0x6/0x10
[<ffffffff81026703>] default_idle+0x23/0x150
[<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
[<ffffffff810efd8a>] default_idle_call+0x2a/0x40
[<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
[<ffffffff817d6aaa>] rest_init+0x13a/0x140
[<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
[<ffffffff81d6a120>] ? early_idt_handler_array+0x120/0x120
[<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
[<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168

>  fs/block_dev.c              |   2 +
>  fs/fs-writeback.c           | 148 +++++++++++++++++++++++++++++++++++---------
>  fs/inode.c                  |   1 +
>  include/linux/backing-dev.h |   5 ++
>  include/linux/fs.h          |   1 +
>  mm/backing-dev.c            |   1 +
>  mm/page-writeback.c         |  14 +++++
>  7 files changed, 144 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index f2a89be..2726ff1 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -64,6 +64,8 @@ void kill_bdev(struct block_device *bdev)
>  {
>  	struct address_space *mapping = bdev->bd_inode->i_mapping;
>  
> +	bdi_clear_inode_writeback(inode_to_bdi(bdev->bd_inode),
> +				  bdev->bd_inode);
>  	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
>  		return;
>  
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index aa72536..5c005ad 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -210,6 +210,34 @@ void inode_wb_list_del(struct inode *inode)
>  }
>  
>  /*
> + * mark an inode as under writeback on the given bdi
> + */
> +void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
> +{
> +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> +	if (list_empty(&inode->i_wb_list)) {
> +		spin_lock(&bdi->wb.list_lock);
> +		if (list_empty(&inode->i_wb_list))
> +			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
> +		spin_unlock(&bdi->wb.list_lock);
> +	}
> +}
> +
> +/*
> + * clear an inode as under writeback on the given bdi
> + */
> +void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> +			       struct inode *inode)
> +{
> +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> +	if (!list_empty(&inode->i_wb_list)) {
> +		spin_lock(&bdi->wb.list_lock);
> +		list_del_init(&inode->i_wb_list);
> +		spin_unlock(&bdi->wb.list_lock);
> +	}
> +}
> +
> +/*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
>   *
> @@ -383,13 +411,28 @@ static void __inode_wait_for_writeback(struct inode *inode)
>  }
>  
>  /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must have
> + * inode pinned.
>   */
>  void inode_wait_for_writeback(struct inode *inode)
>  {
> +	BUG_ON(!(inode->i_state & I_FREEING));
> +
>  	spin_lock(&inode->i_lock);
>  	__inode_wait_for_writeback(inode);
>  	spin_unlock(&inode->i_lock);
> +
> +	/*
> +	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> +	 * the inode and then deref bdev->bd_disk, which at this point has been
> +	 * set to NULL, so we would panic.  At the point we are dropping our
> +	 * bd_inode we won't have any pages under writeback on the device so
> +	 * this is safe.  But just in case we'll assert to make sure we don't
> +	 * screw this up.
> +	 */
> +	if (!sb_is_blkdev_sb(inode->i_sb))
> +		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
> +	BUG_ON(!list_empty(&inode->i_wb_list));
>  }
>  
>  /*
> @@ -1372,7 +1415,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
>   */
>  static void wait_sb_inodes(struct super_block *sb)
>  {
> -	struct inode *inode, *old_inode = NULL;
> +	struct backing_dev_info *bdi = sb->s_bdi;
> +	LIST_HEAD(sync_list);
> +	struct inode *iput_inode = NULL;
>  
>  	/*
>  	 * We need to be protected against the filesystem going from
> @@ -1380,48 +1425,95 @@ static void wait_sb_inodes(struct super_block *sb)
>  	 */
>  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
>  
> -	mutex_lock(&sb->s_sync_lock);
> -	spin_lock(&sb->s_inode_list_lock);
> -
>  	/*
> -	 * Data integrity sync. Must wait for all pages under writeback,
> -	 * because there may have been pages dirtied before our sync
> -	 * call, but which had writeout started before we write it out.
> -	 * In which case, the inode may not be on the dirty list, but
> -	 * we still have to wait for that writeout.
> +	 * Data integrity sync. Must wait for all pages under writeback, because
> +	 * there may have been pages dirtied before our sync call, but which had
> +	 * writeout started before we write it out.  In which case, the inode
> +	 * may not be on the dirty list, but we still have to wait for that
> +	 * writeout.
> +	 *
> +	 * To avoid syncing inodes put under IO after we have started here,
> +	 * splice the io list to a temporary list head and walk that. Newly
> +	 * dirtied inodes will go onto the primary list so we won't wait for
> +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> +	 * so we'll complete processing the complete list before the next
> +	 * sync operation repeats the splice-and-walk process.
> +	 *
> +	 * Inodes that have pages under writeback after we've finished the wait
> +	 * may or may not be on the primary list. They had pages under put IO
> +	 * after we started our wait, so we need to make sure the next sync or
> +	 * IO completion treats them correctly, Move them back to the primary
> +	 * list and restart the walk.
> +	 *
> +	 * Inodes that are clean after we have waited for them don't belong on
> +	 * any list, so simply remove them from the sync list and move onwards.
>  	 */
> -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +	mutex_lock(&sb->s_sync_lock);
> +	spin_lock(&bdi->wb.list_lock);
> +	list_splice_init(&bdi->wb.b_writeback, &sync_list);
> +
> +	while (!list_empty(&sync_list)) {
> +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> +						       i_wb_list);
>  		struct address_space *mapping = inode->i_mapping;
>  
> +		/*
> +		 * We are lazy on IO completion and don't remove inodes from the
> +		 * list when they are clean. Detect that immediately and skip
> +		 * inodes we don't ahve to wait on.
> +		 */
> +		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			list_del_init(&inode->i_wb_list);
> +			cond_resched_lock(&bdi->wb.list_lock);
> +			continue;
> +		}
> +
>  		spin_lock(&inode->i_lock);
> -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> -		    (mapping->nrpages == 0)) {
> +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
>  			spin_unlock(&inode->i_lock);
> +			cond_resched_lock(&bdi->wb.list_lock);
>  			continue;
>  		}
>  		__iget(inode);
>  		spin_unlock(&inode->i_lock);
> -		spin_unlock(&sb->s_inode_list_lock);
> +		spin_unlock(&bdi->wb.list_lock);
>  
> -		/*
> -		 * We hold a reference to 'inode' so it couldn't have been
> -		 * removed from s_inodes list while we dropped the
> -		 * s_inode_list_lock.  We cannot iput the inode now as we can
> -		 * be holding the last reference and we cannot iput it under
> -		 * s_inode_list_lock. So we keep the reference and iput it
> -		 * later.
> -		 */
> -		iput(old_inode);
> -		old_inode = inode;
> +		if (iput_inode)
> +			iput(iput_inode);
>  
>  		filemap_fdatawait(mapping);
> -
>  		cond_resched();
>  
> -		spin_lock(&sb->s_inode_list_lock);
> +		/*
> +		 * the inode has been written back now, so check whether we
> +		 * still have pages under IO and move it back to the primary
> +		 * list if necessary.  We really need the mapping->tree_lock
> +		 * here because bdi_mark_inode_writeback may have not done
> +		 * anything because we were on the spliced list and we need to
> +		 * check TAG_WRITEBACK.
> +		 */
> +		spin_lock_irq(&mapping->tree_lock);
> +		spin_lock(&bdi->wb.list_lock);
> +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> +			WARN_ON(list_empty(&inode->i_wb_list));
> +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> +                } else
> +			list_del_init(&inode->i_wb_list);
> +		spin_unlock_irq(&mapping->tree_lock);
> +
> +		/*
> +		 * can't iput inode while holding the wb.list_lock. Save it for
> +		 * the next time through the loop when we drop all our spin
> +		 * locks.
> +		 */
> +		iput_inode = inode;
>  	}
> -	spin_unlock(&sb->s_inode_list_lock);
> -	iput(old_inode);
> +	spin_unlock(&bdi->wb.list_lock);
> +
> +	if (iput_inode)
> +		iput(iput_inode);
> +
>  	mutex_unlock(&sb->s_sync_lock);
>  }
>  
> diff --git a/fs/inode.c b/fs/inode.c
> index 770d684..8f00557 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -357,6 +357,7 @@ void inode_init_once(struct inode *inode)
>  	INIT_HLIST_NODE(&inode->i_hash);
>  	INIT_LIST_HEAD(&inode->i_devices);
>  	INIT_LIST_HEAD(&inode->i_io_list);
> +	INIT_LIST_HEAD(&inode->i_wb_list);
>  	INIT_LIST_HEAD(&inode->i_lru);
>  	address_space_init_once(&inode->i_data);
>  	i_size_ordered_init(inode);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index d87d8ec..0d5bca2 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -56,6 +56,7 @@ struct bdi_writeback {
>  	struct list_head b_io;		/* parked for writeback */
>  	struct list_head b_more_io;	/* parked for more writeback */
>  	struct list_head b_dirty_time;	/* time stamps are dirty */
> +	struct list_head b_writeback;	/* inodes under writeback */
>  	spinlock_t list_lock;		/* protects the b_* lists */
>  };
>  
> @@ -123,6 +124,10 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi);
>  void bdi_writeback_workfn(struct work_struct *work);
>  int bdi_has_dirty_io(struct backing_dev_info *bdi);
>  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> +void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
> +			      struct inode *inode);
> +void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> +			       struct inode *inode);
>  
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6dd2ab2..d3baa08 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -636,6 +636,7 @@ struct inode {
>  	struct list_head	i_io_list;	/* backing dev IO list */
>  	struct list_head	i_lru;		/* inode LRU list */
>  	struct list_head	i_sb_list;
> +	struct list_head	i_wb_list;	/* backing dev writeback list */
>  	union {
>  		struct hlist_head	i_dentry;
>  		struct rcu_head		i_rcu;
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index e9ed047..ea39301 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -369,6 +369,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
>  	INIT_LIST_HEAD(&wb->b_io);
>  	INIT_LIST_HEAD(&wb->b_more_io);
>  	INIT_LIST_HEAD(&wb->b_dirty_time);
> +	INIT_LIST_HEAD(&wb->b_writeback);
>  	spin_lock_init(&wb->list_lock);
>  	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
>  }
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index eb59f7e..f3751d1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2382,11 +2382,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>  		spin_lock_irqsave(&mapping->tree_lock, flags);
>  		ret = TestSetPageWriteback(page);
>  		if (!ret) {
> +			bool on_wblist;
> +
> +			on_wblist = mapping_tagged(mapping,
> +						   PAGECACHE_TAG_WRITEBACK);
> +
>  			radix_tree_tag_set(&mapping->page_tree,
>  						page_index(page),
>  						PAGECACHE_TAG_WRITEBACK);
>  			if (bdi_cap_account_writeback(bdi))
>  				__inc_bdi_stat(bdi, BDI_WRITEBACK);
> +
> +			/*
> +			 * we can come through here when swapping anonymous
> +			 * pages, so we don't necessarily have an inode to
> +			 * track for sync. Inodes are remove lazily, so there is
> +			 * no equivalent in test_clear_page_writeback().
> +			 */
> +			if (!on_wblist && mapping->host)
> +				bdi_mark_inode_writeback(bdi, mapping->host);
>  		}
>  		if (!PageDirty(page))
>  			radix_tree_tag_clear(&mapping->page_tree,
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Dec. 10, 2015, 10:08 a.m. UTC | #2
On Wed 09-12-15 13:40:30, Brian Foster wrote:
> On Tue, Jun 23, 2015 at 08:24:00PM -0700, Josef Bacik wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > wait_sb_inodes() current does a walk of all inodes in the filesystem
> > to find dirty one to wait on during sync. This is highly
> > inefficient and wastes a lot of CPU when there are lots of clean
> > cached inodes that we don't need to wait on.
> > 
> > To avoid this "all inode" walk, we need to track inodes that are
> > currently under writeback that we need to wait for. We do this by
> > adding inodes to a writeback list on the bdi when the mapping is
> > first tagged as having pages under writeback.  wait_sb_inodes() can
> > then walk this list of "inodes under IO" and wait specifically just
> > for the inodes that the current sync(2) needs to wait for.
> > 
> > To avoid needing all the realted locking to be safe against
> > interrupts, Jan Kara suggested that we be lazy about removal from
> > the writeback list. That is, we don't remove inodes from the
> > writeback list on IO completion, but do it directly during a
> > wait_sb_inodes() walk.
> > 
> > This means that the a rare sync(2) call will have some work to do
> > skipping clean inodes However, in the current problem case of
> > concurrent sync workloads, concurrent wait_sb_inodes() calls only
> > walk the very recently dispatched inodes and hence should have very
> > little work to do.
> > 
> > This also means that we have to remove the inodes from the writeback
> > list during eviction. Do this in inode_wait_for_writeback() once
> > all writeback on the inode is complete.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > Tested-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> > ---

  Hi,

> My understanding is this (and the subsequent) patch were never merged
> due to conflicts with the cgroup aware writeback stuff that had been
> merged.

Yes. Thanks for bringing this up. At that point I thought Josef will take
care of rebasing the patches and it fell out of my radar.

> I think the fundamental approach of this patch still solves the
> original problem, the writeback inodes are just always tracked on the
> root bdi_writeback rather than the per-cgroup bdi_writeback the inode
> was associated with before writeback.

I was thinking about this and realized that the original patch actually has
a flaw. In case there are several partitions on a bdi and sync_fs() would
be called for two filesystems on the same bdi, then one of those sync_fs()
calls would splice the whole b_writeback list into its private list in
wait_sb_inodes() and the other sync_fs() call would return too early because
it would find b_writeback list empty.

Given that we need the list of inodes under writeback only for sync(2) /
sync_fs(2) purposes it would work the best to track those inodes in the
per-superblock list. We would miss block device inode this way but that one
is waited-on in a special way anyway (in __sync_blockdev() /
fdatawait_one_bdev()). Also that way any strange interaction and confusion
with cgroup writeback is avoided.

> Any thoughts on that are appreciated. Otherwise, I'm trying to rebase
> this and I've hit a lockdep splat [1] that suggests the locking for the
> writeback list needs to be tweaked one way or another. Note that lockdep
> fires with this original version as well, so I don't think this is a new
> issue.

Yeah, it looks that the issue has been there forever. When we have per-sb
list then we need a different lock to protect the list anyway. So having a
dedicated lock for the list which would be irqsave would be the easiest
solution. But note that in that case the locking of i_lock in
wait_inodes_sb() has to be dealt with as well since making i_lock irqsafe
is not an option.

I think we could avoid having the dedicated lock irqsafe if we moved
bdi_mark_inode_writeback() call from under mapping->tree_lock (to happen
later in __test_set_page_writeback()). However that needs some careful
thinking about how the interactions of __test_set_page_writeback() with the
list handling in wait_inodes_sb() work out.

								Honza

> [1] lockdep splat:
> 
> [ INFO: possible irq lock inversion dependency detected ]
> 4.4.0-rc4+ #54 Not tainted
> ---------------------------------------------------------
> swapper/0/0 just changed the state of lock:
> (&(&mapping->tree_lock)->rlock){-.....}, at: [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
> but this lock took another, HARDIRQ-unsafe lock in the past:
> (&(&wb->list_lock)->rlock){+.+...}#012#012and interrupts could create inverse lock ordering between them.
> #012other info that might help us debug this:
> Possible interrupt unsafe locking scenario:
>       CPU0                    CPU1
>       ----                    ----
>  lock(&(&wb->list_lock)->rlock);
>                               local_irq_disable();
>                               lock(&(&mapping->tree_lock)->rlock);
>                               lock(&(&wb->list_lock)->rlock);
>  <Interrupt>
>    lock(&(&mapping->tree_lock)->rlock);
> #012 *** DEADLOCK ***
> no locks held by swapper/0/0.
> #012the shortest dependencies between 2nd lock and 1st lock:
> -> (&(&wb->list_lock)->rlock){+.+...} ops: 194 {
>    HARDIRQ-ON-W at:
>                      [<ffffffff810f9959>] __lock_acquire+0x819/0x1a50
>                      [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>                      [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
>                      [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
>                      [<ffffffff8125a859>] generic_update_time+0x79/0xd0
>                      [<ffffffff8125c638>] touch_atime+0xa8/0xd0
>                      [<ffffffff812519c3>] iterate_dir+0xe3/0x130
>                      [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
>                      [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
>    SOFTIRQ-ON-W at:
>                      [<ffffffff810f998a>] __lock_acquire+0x84a/0x1a50
>                      [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>                      [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
>                      [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
>                      [<ffffffff8125a859>] generic_update_time+0x79/0xd0
>                      [<ffffffff8125c638>] touch_atime+0xa8/0xd0
>                      [<ffffffff812519c3>] iterate_dir+0xe3/0x130
>                      [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
>                      [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
>    INITIAL USE at:
>                     [<ffffffff810f9538>] __lock_acquire+0x3f8/0x1a50
>                     [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>                     [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
>                     [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
>                     [<ffffffff8125a859>] generic_update_time+0x79/0xd0
>                     [<ffffffff8125c638>] touch_atime+0xa8/0xd0
>                     [<ffffffff812519c3>] iterate_dir+0xe3/0x130
>                     [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
>                     [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
>  }
>  ... key      at: [<ffffffff82c7faa0>] __key.37388+0x0/0x8
>  ... acquired at:
>   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>   [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
>   [<ffffffff812703a7>] bdi_mark_inode_writeback+0x67/0xd0
>   [<ffffffff811d19b2>] __test_set_page_writeback+0x112/0x200
>   [<ffffffff8127bcdc>] __block_write_full_page.constprop.44+0xec/0x400
>   [<ffffffff8127c0ee>] block_write_full_page+0xfe/0x190
>   [<ffffffff8127cec8>] blkdev_writepage+0x18/0x20
>   [<ffffffff811cede6>] __writepage+0x16/0x40
>   [<ffffffff811cfc65>] write_cache_pages+0x225/0x5f0
>   [<ffffffff811d0084>] generic_writepages+0x54/0x80
>   [<ffffffff811d24e1>] do_writepages+0x21/0x30
>   [<ffffffff811c4b40>] __filemap_fdatawrite_range+0x80/0xb0
>   [<ffffffff811c4c7d>] filemap_write_and_wait_range+0x2d/0x70
>   [<ffffffff8127ca8b>] blkdev_fsync+0x1b/0x50
>   [<ffffffff81274a9b>] vfs_fsync_range+0x4b/0xb0
>   [<ffffffff81274b5d>] do_fsync+0x3d/0x70
>   [<ffffffff81274e10>] SyS_fsync+0x10/0x20
>   [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
> 
> -> (&(&mapping->tree_lock)->rlock){-.....} ops: 16821 {
>   IN-HARDIRQ-W at:
>                    [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
>                    [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>                    [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
>                    [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
>                    [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
>                    [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
>                    [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
>                    [<ffffffff813a329f>] bio_endio+0x3f/0x60
>                    [<ffffffff8162d4d9>] dec_pending+0x149/0x310
>                    [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
>                    [<ffffffff813a329f>] bio_endio+0x3f/0x60
>                    [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
>                    [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
>                    [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
>                    [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
>                    [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
>                    [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
>                    [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
>                    [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
>                    [<ffffffff81026703>] default_idle+0x23/0x150
>                    [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
>                    [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
>                    [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
>                    [<ffffffff817d6aaa>] rest_init+0x13a/0x140
>                    [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
>                    [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
>                    [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
>   INITIAL USE at:
>                   [<ffffffff810f9538>] __lock_acquire+0x3f8/0x1a50
>                   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>                   [<ffffffff817e3fa4>] _raw_spin_lock_irq+0x44/0x60
>                   [<ffffffff811c303c>] __add_to_page_cache_locked+0xbc/0x340
>                   [<ffffffff811c3317>] add_to_page_cache_lru+0x37/0x90
>                   [<ffffffff811c3dc7>] pagecache_get_page+0xb7/0x200
>                   [<ffffffff811c4139>] grab_cache_page_write_begin+0x29/0x40
>                   [<ffffffff81269848>] simple_write_begin+0x28/0x1c0
>                   [<ffffffff811c2a01>] generic_perform_write+0xd1/0x1e0
>                   [<ffffffff811c55f2>] __generic_file_write_iter+0x1a2/0x1e0
>                   [<ffffffff811c5718>] generic_file_write_iter+0xe8/0x1e0
>                   [<ffffffff8123c409>] __vfs_write+0xc9/0x110
>                   [<ffffffff8123ca99>] vfs_write+0xa9/0x1a0
>                   [<ffffffff8123d778>] SyS_write+0x58/0xd0
>                   [<ffffffff81d6c3b1>] xwrite+0x29/0x5c
>                   [<ffffffff81d6c40e>] do_copy+0x2a/0xb8
>                   [<ffffffff81d6c151>] write_buffer+0x23/0x34
>                   [<ffffffff81d6c9e6>] unpack_to_rootfs+0xfd/0x294
>                   [<ffffffff81d6cbd9>] populate_rootfs+0x5c/0x108
>                   [<ffffffff81002123>] do_one_initcall+0xb3/0x200
>                   [<ffffffff81d6b22e>] kernel_init_freeable+0x1ee/0x28d
>                   [<ffffffff817d6abe>] kernel_init+0xe/0xe0
>                   [<ffffffff817e4ddf>] ret_from_fork+0x3f/0x70
> }
> ... key      at: [<ffffffff82ca2760>] __key.39560+0x0/0x8
> ... acquired at:
>   [<ffffffff810f7c1b>] check_usage_forwards+0x15b/0x160
>   [<ffffffff810f88c3>] mark_lock+0x333/0x610
>   [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
>   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
>   [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
>   [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
>   [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
>   [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
>   [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
>   [<ffffffff813a329f>] bio_endio+0x3f/0x60
>   [<ffffffff8162d4d9>] dec_pending+0x149/0x310
>   [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
>   [<ffffffff813a329f>] bio_endio+0x3f/0x60
>   [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
>   [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
>   [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
>   [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
>   [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
>   [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
>   [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
>   [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
>   [<ffffffff81026703>] default_idle+0x23/0x150
>   [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
>   [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
>   [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
>   [<ffffffff817d6aaa>] rest_init+0x13a/0x140
>   [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
>   [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
>   [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
> 
> #012stack backtrace:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-rc4+ #54
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> 0000000000000000 06776aa479e380f6 ffff88011fc03ad0 ffffffff813dd019
> ffffffff82a0f120 ffff88011fc03b10 ffffffff811c184d ffffffff81c10be8
> ffffffff81c10be8 ffffffff81c10500 ffffffff81a56510 0000000000000000
> Call Trace:
> <IRQ>  [<ffffffff813dd019>] dump_stack+0x4b/0x72
> [<ffffffff811c184d>] print_irq_inversion_bug.part.38+0x1a4/0x1b0
> [<ffffffff810f7c1b>] check_usage_forwards+0x15b/0x160
> [<ffffffff810f88c3>] mark_lock+0x333/0x610
> [<ffffffff810f7ac0>] ? check_usage_backwards+0x160/0x160
> [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
> [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> [<ffffffff811d25ed>] ? test_clear_page_writeback+0x5d/0x210
> [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
> [<ffffffff811d25ed>] ? test_clear_page_writeback+0x5d/0x210
> [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
> [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
> [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
> [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
> [<ffffffff813a329f>] bio_endio+0x3f/0x60
> [<ffffffff8162d4d9>] dec_pending+0x149/0x310
> [<ffffffff811c6809>] ? mempool_free+0x29/0x80
> [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
> [<ffffffff813a329f>] bio_endio+0x3f/0x60
> [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
> [<ffffffff813b4e90>] ? blkdev_issue_zeroout+0xf0/0xf0
> [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
> [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
> [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
> [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
> [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
> [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
> [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
> <EOI>  [<ffffffff810653a6>] ? native_safe_halt+0x6/0x10
> [<ffffffff81026703>] default_idle+0x23/0x150
> [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
> [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
> [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
> [<ffffffff817d6aaa>] rest_init+0x13a/0x140
> [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
> [<ffffffff81d6a120>] ? early_idt_handler_array+0x120/0x120
> [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
> 
> >  fs/block_dev.c              |   2 +
> >  fs/fs-writeback.c           | 148 +++++++++++++++++++++++++++++++++++---------
> >  fs/inode.c                  |   1 +
> >  include/linux/backing-dev.h |   5 ++
> >  include/linux/fs.h          |   1 +
> >  mm/backing-dev.c            |   1 +
> >  mm/page-writeback.c         |  14 +++++
> >  7 files changed, 144 insertions(+), 28 deletions(-)
> > 
> > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > index f2a89be..2726ff1 100644
> > --- a/fs/block_dev.c
> > +++ b/fs/block_dev.c
> > @@ -64,6 +64,8 @@ void kill_bdev(struct block_device *bdev)
> >  {
> >  	struct address_space *mapping = bdev->bd_inode->i_mapping;
> >  
> > +	bdi_clear_inode_writeback(inode_to_bdi(bdev->bd_inode),
> > +				  bdev->bd_inode);
> >  	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
> >  		return;
> >  
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index aa72536..5c005ad 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -210,6 +210,34 @@ void inode_wb_list_del(struct inode *inode)
> >  }
> >  
> >  /*
> > + * mark an inode as under writeback on the given bdi
> > + */
> > +void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
> > +{
> > +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> > +	if (list_empty(&inode->i_wb_list)) {
> > +		spin_lock(&bdi->wb.list_lock);
> > +		if (list_empty(&inode->i_wb_list))
> > +			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
> > +		spin_unlock(&bdi->wb.list_lock);
> > +	}
> > +}
> > +
> > +/*
> > + * clear an inode as under writeback on the given bdi
> > + */
> > +void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> > +			       struct inode *inode)
> > +{
> > +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> > +	if (!list_empty(&inode->i_wb_list)) {
> > +		spin_lock(&bdi->wb.list_lock);
> > +		list_del_init(&inode->i_wb_list);
> > +		spin_unlock(&bdi->wb.list_lock);
> > +	}
> > +}
> > +
> > +/*
> >   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
> >   * furthest end of its superblock's dirty-inode list.
> >   *
> > @@ -383,13 +411,28 @@ static void __inode_wait_for_writeback(struct inode *inode)
> >  }
> >  
> >  /*
> > - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> > + * Wait for writeback on an inode to complete during eviction. Caller must have
> > + * inode pinned.
> >   */
> >  void inode_wait_for_writeback(struct inode *inode)
> >  {
> > +	BUG_ON(!(inode->i_state & I_FREEING));
> > +
> >  	spin_lock(&inode->i_lock);
> >  	__inode_wait_for_writeback(inode);
> >  	spin_unlock(&inode->i_lock);
> > +
> > +	/*
> > +	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> > +	 * the inode and then deref bdev->bd_disk, which at this point has been
> > +	 * set to NULL, so we would panic.  At the point we are dropping our
> > +	 * bd_inode we won't have any pages under writeback on the device so
> > +	 * this is safe.  But just in case we'll assert to make sure we don't
> > +	 * screw this up.
> > +	 */
> > +	if (!sb_is_blkdev_sb(inode->i_sb))
> > +		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
> > +	BUG_ON(!list_empty(&inode->i_wb_list));
> >  }
> >  
> >  /*
> > @@ -1372,7 +1415,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
> >   */
> >  static void wait_sb_inodes(struct super_block *sb)
> >  {
> > -	struct inode *inode, *old_inode = NULL;
> > +	struct backing_dev_info *bdi = sb->s_bdi;
> > +	LIST_HEAD(sync_list);
> > +	struct inode *iput_inode = NULL;
> >  
> >  	/*
> >  	 * We need to be protected against the filesystem going from
> > @@ -1380,48 +1425,95 @@ static void wait_sb_inodes(struct super_block *sb)
> >  	 */
> >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> >  
> > -	mutex_lock(&sb->s_sync_lock);
> > -	spin_lock(&sb->s_inode_list_lock);
> > -
> >  	/*
> > -	 * Data integrity sync. Must wait for all pages under writeback,
> > -	 * because there may have been pages dirtied before our sync
> > -	 * call, but which had writeout started before we write it out.
> > -	 * In which case, the inode may not be on the dirty list, but
> > -	 * we still have to wait for that writeout.
> > +	 * Data integrity sync. Must wait for all pages under writeback, because
> > +	 * there may have been pages dirtied before our sync call, but which had
> > +	 * writeout started before we write it out.  In which case, the inode
> > +	 * may not be on the dirty list, but we still have to wait for that
> > +	 * writeout.
> > +	 *
> > +	 * To avoid syncing inodes put under IO after we have started here,
> > +	 * splice the io list to a temporary list head and walk that. Newly
> > +	 * dirtied inodes will go onto the primary list so we won't wait for
> > +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> > +	 * so we'll complete processing the complete list before the next
> > +	 * sync operation repeats the splice-and-walk process.
> > +	 *
> > +	 * Inodes that have pages under writeback after we've finished the wait
> > +	 * may or may not be on the primary list. They had pages under put IO
> > +	 * after we started our wait, so we need to make sure the next sync or
> > +	 * IO completion treats them correctly, Move them back to the primary
> > +	 * list and restart the walk.
> > +	 *
> > +	 * Inodes that are clean after we have waited for them don't belong on
> > +	 * any list, so simply remove them from the sync list and move onwards.
> >  	 */
> > -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > +	mutex_lock(&sb->s_sync_lock);
> > +	spin_lock(&bdi->wb.list_lock);
> > +	list_splice_init(&bdi->wb.b_writeback, &sync_list);
> > +
> > +	while (!list_empty(&sync_list)) {
> > +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> > +						       i_wb_list);
> >  		struct address_space *mapping = inode->i_mapping;
> >  
> > +		/*
> > +		 * We are lazy on IO completion and don't remove inodes from the
> > +		 * list when they are clean. Detect that immediately and skip
> > +		 * inodes we don't ahve to wait on.
> > +		 */
> > +		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> > +			list_del_init(&inode->i_wb_list);
> > +			cond_resched_lock(&bdi->wb.list_lock);
> > +			continue;
> > +		}
> > +
> >  		spin_lock(&inode->i_lock);
> > -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> > -		    (mapping->nrpages == 0)) {
> > +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> > +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> >  			spin_unlock(&inode->i_lock);
> > +			cond_resched_lock(&bdi->wb.list_lock);
> >  			continue;
> >  		}
> >  		__iget(inode);
> >  		spin_unlock(&inode->i_lock);
> > -		spin_unlock(&sb->s_inode_list_lock);
> > +		spin_unlock(&bdi->wb.list_lock);
> >  
> > -		/*
> > -		 * We hold a reference to 'inode' so it couldn't have been
> > -		 * removed from s_inodes list while we dropped the
> > -		 * s_inode_list_lock.  We cannot iput the inode now as we can
> > -		 * be holding the last reference and we cannot iput it under
> > -		 * s_inode_list_lock. So we keep the reference and iput it
> > -		 * later.
> > -		 */
> > -		iput(old_inode);
> > -		old_inode = inode;
> > +		if (iput_inode)
> > +			iput(iput_inode);
> >  
> >  		filemap_fdatawait(mapping);
> > -
> >  		cond_resched();
> >  
> > -		spin_lock(&sb->s_inode_list_lock);
> > +		/*
> > +		 * the inode has been written back now, so check whether we
> > +		 * still have pages under IO and move it back to the primary
> > +		 * list if necessary.  We really need the mapping->tree_lock
> > +		 * here because bdi_mark_inode_writeback may have not done
> > +		 * anything because we were on the spliced list and we need to
> > +		 * check TAG_WRITEBACK.
> > +		 */
> > +		spin_lock_irq(&mapping->tree_lock);
> > +		spin_lock(&bdi->wb.list_lock);
> > +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> > +			WARN_ON(list_empty(&inode->i_wb_list));
> > +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> > +                } else
> > +			list_del_init(&inode->i_wb_list);
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +
> > +		/*
> > +		 * can't iput inode while holding the wb.list_lock. Save it for
> > +		 * the next time through the loop when we drop all our spin
> > +		 * locks.
> > +		 */
> > +		iput_inode = inode;
> >  	}
> > -	spin_unlock(&sb->s_inode_list_lock);
> > -	iput(old_inode);
> > +	spin_unlock(&bdi->wb.list_lock);
> > +
> > +	if (iput_inode)
> > +		iput(iput_inode);
> > +
> >  	mutex_unlock(&sb->s_sync_lock);
> >  }
> >  
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 770d684..8f00557 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -357,6 +357,7 @@ void inode_init_once(struct inode *inode)
> >  	INIT_HLIST_NODE(&inode->i_hash);
> >  	INIT_LIST_HEAD(&inode->i_devices);
> >  	INIT_LIST_HEAD(&inode->i_io_list);
> > +	INIT_LIST_HEAD(&inode->i_wb_list);
> >  	INIT_LIST_HEAD(&inode->i_lru);
> >  	address_space_init_once(&inode->i_data);
> >  	i_size_ordered_init(inode);
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index d87d8ec..0d5bca2 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -56,6 +56,7 @@ struct bdi_writeback {
> >  	struct list_head b_io;		/* parked for writeback */
> >  	struct list_head b_more_io;	/* parked for more writeback */
> >  	struct list_head b_dirty_time;	/* time stamps are dirty */
> > +	struct list_head b_writeback;	/* inodes under writeback */
> >  	spinlock_t list_lock;		/* protects the b_* lists */
> >  };
> >  
> > @@ -123,6 +124,10 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi);
> >  void bdi_writeback_workfn(struct work_struct *work);
> >  int bdi_has_dirty_io(struct backing_dev_info *bdi);
> >  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> > +void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
> > +			      struct inode *inode);
> > +void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> > +			       struct inode *inode);
> >  
> >  extern spinlock_t bdi_lock;
> >  extern struct list_head bdi_list;
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 6dd2ab2..d3baa08 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -636,6 +636,7 @@ struct inode {
> >  	struct list_head	i_io_list;	/* backing dev IO list */
> >  	struct list_head	i_lru;		/* inode LRU list */
> >  	struct list_head	i_sb_list;
> > +	struct list_head	i_wb_list;	/* backing dev writeback list */
> >  	union {
> >  		struct hlist_head	i_dentry;
> >  		struct rcu_head		i_rcu;
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index e9ed047..ea39301 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -369,6 +369,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> >  	INIT_LIST_HEAD(&wb->b_io);
> >  	INIT_LIST_HEAD(&wb->b_more_io);
> >  	INIT_LIST_HEAD(&wb->b_dirty_time);
> > +	INIT_LIST_HEAD(&wb->b_writeback);
> >  	spin_lock_init(&wb->list_lock);
> >  	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
> >  }
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index eb59f7e..f3751d1 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2382,11 +2382,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> >  		spin_lock_irqsave(&mapping->tree_lock, flags);
> >  		ret = TestSetPageWriteback(page);
> >  		if (!ret) {
> > +			bool on_wblist;
> > +
> > +			on_wblist = mapping_tagged(mapping,
> > +						   PAGECACHE_TAG_WRITEBACK);
> > +
> >  			radix_tree_tag_set(&mapping->page_tree,
> >  						page_index(page),
> >  						PAGECACHE_TAG_WRITEBACK);
> >  			if (bdi_cap_account_writeback(bdi))
> >  				__inc_bdi_stat(bdi, BDI_WRITEBACK);
> > +
> > +			/*
> > +			 * we can come through here when swapping anonymous
> > +			 * pages, so we don't necessarily have an inode to
> > +			 * track for sync. Inodes are remove lazily, so there is
> > +			 * no equivalent in test_clear_page_writeback().
> > +			 */
> > +			if (!on_wblist && mapping->host)
> > +				bdi_mark_inode_writeback(bdi, mapping->host);
> >  		}
> >  		if (!PageDirty(page))
> >  			radix_tree_tag_clear(&mapping->page_tree,
> > -- 
> > 2.1.0
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Dec. 11, 2015, 2:37 p.m. UTC | #3
On Thu, Dec 10, 2015 at 11:08:20AM +0100, Jan Kara wrote:
> On Wed 09-12-15 13:40:30, Brian Foster wrote:
> > On Tue, Jun 23, 2015 at 08:24:00PM -0700, Josef Bacik wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > wait_sb_inodes() current does a walk of all inodes in the filesystem
> > > to find dirty one to wait on during sync. This is highly
> > > inefficient and wastes a lot of CPU when there are lots of clean
> > > cached inodes that we don't need to wait on.
> > > 
> > > To avoid this "all inode" walk, we need to track inodes that are
> > > currently under writeback that we need to wait for. We do this by
> > > adding inodes to a writeback list on the bdi when the mapping is
> > > first tagged as having pages under writeback.  wait_sb_inodes() can
> > > then walk this list of "inodes under IO" and wait specifically just
> > > for the inodes that the current sync(2) needs to wait for.
> > > 
> > > To avoid needing all the realted locking to be safe against
> > > interrupts, Jan Kara suggested that we be lazy about removal from
> > > the writeback list. That is, we don't remove inodes from the
> > > writeback list on IO completion, but do it directly during a
> > > wait_sb_inodes() walk.
> > > 
> > > This means that the a rare sync(2) call will have some work to do
> > > skipping clean inodes However, in the current problem case of
> > > concurrent sync workloads, concurrent wait_sb_inodes() calls only
> > > walk the very recently dispatched inodes and hence should have very
> > > little work to do.
> > > 
> > > This also means that we have to remove the inodes from the writeback
> > > list during eviction. Do this in inode_wait_for_writeback() once
> > > all writeback on the inode is complete.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > > Tested-by: Dave Chinner <dchinner@redhat.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > > ---
> 
>   Hi,
> 
> > My understanding is this (and the subsequent) patch were never merged
> > due to conflicts with the cgroup aware writeback stuff that had been
> > merged.
> 
> Yes. Thanks for bringing this up. At that point I thought Josef will take
> care of rebasing the patches and it fell out of my radar.
> 
> > I think the fundamental approach of this patch still solves the
> > original problem, the writeback inodes are just always tracked on the
> > root bdi_writeback rather than the per-cgroup bdi_writeback the inode
> > was associated with before writeback.
> 
> I was thinking about this and realized that the original patch actually has
> a flaw. In case there are several partitions on a bdi and sync_fs() would
> be called for two filesystems on the same bdi, then one of those sync_fs()
> calls would splice the whole b_writeback list into its private list in
> wait_sb_inodes() and the other sync_fs() call would return too early because
> it would find b_writeback list empty.
> 

Hmm, yeah that does sound like a problem...

> Given that we need the list of inodes under writeback only for sync(2) /
> sync_fs(2) purposes it would work the best to track those inodes in the
> per-superblock list. We would miss block device inode this way but that one
> is waited-on in a special way anyway (in __sync_blockdev() /
> fdatawait_one_bdev()). Also that way any strange interaction and confusion
> with cgroup writeback is avoided.
> 

Agree, that sounds like a more clean approach on first thought. We kill
the more wasteful per-bdi_writeback list since it would only be used by
the root structure embedded in backing_dev_info and isolates the code to
dealing with sync. I'll have a closer look at this.

> > Any thoughts on that are appreciated. Otherwise, I'm trying to rebase
> > this and I've hit a lockdep splat [1] that suggests the locking for the
> > writeback list needs to be tweaked one way or another. Note that lockdep
> > fires with this original version as well, so I don't think this is a new
> > issue.
> 
> Yeah, it looks that the issue has been there forever. When we have per-sb
> list then we need a different lock to protect the list anyway. So having a
> dedicated lock for the list which would be irqsave would be the easiest
> solution. But note that in that case the locking of i_lock in
> wait_inodes_sb() has to be dealt with as well since making i_lock irqsafe
> is not an option.
> 

Ok...

> I think we could avoid having the dedicated lock irqsafe if we moved
> bdi_mark_inode_writeback() call from under mapping->tree_lock (to happen
> later in __test_set_page_writeback()). However that needs some careful
> thinking about how the interactions of __test_set_page_writeback() with the
> list handling in wait_inodes_sb() work out.
> 

That thought crossed my mind briefly as well. It looked like it could be
racy once we drop mapping->tree_lock, but actually I wonder if the lazy
list removal technique helps us get away with that (e.g., it's expected
that the writeback list will be populated with non-writeback inodes for
some time as it is).

Anyways, I can take a closer look at this as well. It could be a couple
weeks or so before I get back to this, but I'll try to incorporate the
above once I do. Thanks for the feedback! :)

Brian

> 								Honza
> 
> > [1] lockdep splat:
> > 
> > [ INFO: possible irq lock inversion dependency detected ]
> > 4.4.0-rc4+ #54 Not tainted
> > ---------------------------------------------------------
> > swapper/0/0 just changed the state of lock:
> > (&(&mapping->tree_lock)->rlock){-.....}, at: [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
> > but this lock took another, HARDIRQ-unsafe lock in the past:
> > (&(&wb->list_lock)->rlock){+.+...}#012#012and interrupts could create inverse lock ordering between them.
> > #012other info that might help us debug this:
> > Possible interrupt unsafe locking scenario:
> >       CPU0                    CPU1
> >       ----                    ----
> >  lock(&(&wb->list_lock)->rlock);
> >                               local_irq_disable();
> >                               lock(&(&mapping->tree_lock)->rlock);
> >                               lock(&(&wb->list_lock)->rlock);
> >  <Interrupt>
> >    lock(&(&mapping->tree_lock)->rlock);
> > #012 *** DEADLOCK ***
> > no locks held by swapper/0/0.
> > #012the shortest dependencies between 2nd lock and 1st lock:
> > -> (&(&wb->list_lock)->rlock){+.+...} ops: 194 {
> >    HARDIRQ-ON-W at:
> >                      [<ffffffff810f9959>] __lock_acquire+0x819/0x1a50
> >                      [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >                      [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
> >                      [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
> >                      [<ffffffff8125a859>] generic_update_time+0x79/0xd0
> >                      [<ffffffff8125c638>] touch_atime+0xa8/0xd0
> >                      [<ffffffff812519c3>] iterate_dir+0xe3/0x130
> >                      [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
> >                      [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
> >    SOFTIRQ-ON-W at:
> >                      [<ffffffff810f998a>] __lock_acquire+0x84a/0x1a50
> >                      [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >                      [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
> >                      [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
> >                      [<ffffffff8125a859>] generic_update_time+0x79/0xd0
> >                      [<ffffffff8125c638>] touch_atime+0xa8/0xd0
> >                      [<ffffffff812519c3>] iterate_dir+0xe3/0x130
> >                      [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
> >                      [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
> >    INITIAL USE at:
> >                     [<ffffffff810f9538>] __lock_acquire+0x3f8/0x1a50
> >                     [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >                     [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
> >                     [<ffffffff8126b7a9>] __mark_inode_dirty+0x3d9/0x5e0
> >                     [<ffffffff8125a859>] generic_update_time+0x79/0xd0
> >                     [<ffffffff8125c638>] touch_atime+0xa8/0xd0
> >                     [<ffffffff812519c3>] iterate_dir+0xe3/0x130
> >                     [<ffffffff81251ecd>] SyS_getdents+0x9d/0x130
> >                     [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
> >  }
> >  ... key      at: [<ffffffff82c7faa0>] __key.37388+0x0/0x8
> >  ... acquired at:
> >   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >   [<ffffffff817e3df8>] _raw_spin_lock+0x38/0x50
> >   [<ffffffff812703a7>] bdi_mark_inode_writeback+0x67/0xd0
> >   [<ffffffff811d19b2>] __test_set_page_writeback+0x112/0x200
> >   [<ffffffff8127bcdc>] __block_write_full_page.constprop.44+0xec/0x400
> >   [<ffffffff8127c0ee>] block_write_full_page+0xfe/0x190
> >   [<ffffffff8127cec8>] blkdev_writepage+0x18/0x20
> >   [<ffffffff811cede6>] __writepage+0x16/0x40
> >   [<ffffffff811cfc65>] write_cache_pages+0x225/0x5f0
> >   [<ffffffff811d0084>] generic_writepages+0x54/0x80
> >   [<ffffffff811d24e1>] do_writepages+0x21/0x30
> >   [<ffffffff811c4b40>] __filemap_fdatawrite_range+0x80/0xb0
> >   [<ffffffff811c4c7d>] filemap_write_and_wait_range+0x2d/0x70
> >   [<ffffffff8127ca8b>] blkdev_fsync+0x1b/0x50
> >   [<ffffffff81274a9b>] vfs_fsync_range+0x4b/0xb0
> >   [<ffffffff81274b5d>] do_fsync+0x3d/0x70
> >   [<ffffffff81274e10>] SyS_fsync+0x10/0x20
> >   [<ffffffff817e4a72>] entry_SYSCALL_64_fastpath+0x12/0x76
> > 
> > -> (&(&mapping->tree_lock)->rlock){-.....} ops: 16821 {
> >   IN-HARDIRQ-W at:
> >                    [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
> >                    [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >                    [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
> >                    [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
> >                    [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
> >                    [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
> >                    [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
> >                    [<ffffffff813a329f>] bio_endio+0x3f/0x60
> >                    [<ffffffff8162d4d9>] dec_pending+0x149/0x310
> >                    [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
> >                    [<ffffffff813a329f>] bio_endio+0x3f/0x60
> >                    [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
> >                    [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
> >                    [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
> >                    [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
> >                    [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
> >                    [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
> >                    [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
> >                    [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
> >                    [<ffffffff81026703>] default_idle+0x23/0x150
> >                    [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
> >                    [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
> >                    [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
> >                    [<ffffffff817d6aaa>] rest_init+0x13a/0x140
> >                    [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
> >                    [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
> >                    [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
> >   INITIAL USE at:
> >                   [<ffffffff810f9538>] __lock_acquire+0x3f8/0x1a50
> >                   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >                   [<ffffffff817e3fa4>] _raw_spin_lock_irq+0x44/0x60
> >                   [<ffffffff811c303c>] __add_to_page_cache_locked+0xbc/0x340
> >                   [<ffffffff811c3317>] add_to_page_cache_lru+0x37/0x90
> >                   [<ffffffff811c3dc7>] pagecache_get_page+0xb7/0x200
> >                   [<ffffffff811c4139>] grab_cache_page_write_begin+0x29/0x40
> >                   [<ffffffff81269848>] simple_write_begin+0x28/0x1c0
> >                   [<ffffffff811c2a01>] generic_perform_write+0xd1/0x1e0
> >                   [<ffffffff811c55f2>] __generic_file_write_iter+0x1a2/0x1e0
> >                   [<ffffffff811c5718>] generic_file_write_iter+0xe8/0x1e0
> >                   [<ffffffff8123c409>] __vfs_write+0xc9/0x110
> >                   [<ffffffff8123ca99>] vfs_write+0xa9/0x1a0
> >                   [<ffffffff8123d778>] SyS_write+0x58/0xd0
> >                   [<ffffffff81d6c3b1>] xwrite+0x29/0x5c
> >                   [<ffffffff81d6c40e>] do_copy+0x2a/0xb8
> >                   [<ffffffff81d6c151>] write_buffer+0x23/0x34
> >                   [<ffffffff81d6c9e6>] unpack_to_rootfs+0xfd/0x294
> >                   [<ffffffff81d6cbd9>] populate_rootfs+0x5c/0x108
> >                   [<ffffffff81002123>] do_one_initcall+0xb3/0x200
> >                   [<ffffffff81d6b22e>] kernel_init_freeable+0x1ee/0x28d
> >                   [<ffffffff817d6abe>] kernel_init+0xe/0xe0
> >                   [<ffffffff817e4ddf>] ret_from_fork+0x3f/0x70
> > }
> > ... key      at: [<ffffffff82ca2760>] __key.39560+0x0/0x8
> > ... acquired at:
> >   [<ffffffff810f7c1b>] check_usage_forwards+0x15b/0x160
> >   [<ffffffff810f88c3>] mark_lock+0x333/0x610
> >   [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
> >   [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> >   [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
> >   [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
> >   [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
> >   [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
> >   [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
> >   [<ffffffff813a329f>] bio_endio+0x3f/0x60
> >   [<ffffffff8162d4d9>] dec_pending+0x149/0x310
> >   [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
> >   [<ffffffff813a329f>] bio_endio+0x3f/0x60
> >   [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
> >   [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
> >   [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
> >   [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
> >   [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
> >   [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
> >   [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
> >   [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
> >   [<ffffffff81026703>] default_idle+0x23/0x150
> >   [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
> >   [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
> >   [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
> >   [<ffffffff817d6aaa>] rest_init+0x13a/0x140
> >   [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
> >   [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
> >   [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
> > 
> > #012stack backtrace:
> > CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-rc4+ #54
> > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > 0000000000000000 06776aa479e380f6 ffff88011fc03ad0 ffffffff813dd019
> > ffffffff82a0f120 ffff88011fc03b10 ffffffff811c184d ffffffff81c10be8
> > ffffffff81c10be8 ffffffff81c10500 ffffffff81a56510 0000000000000000
> > Call Trace:
> > <IRQ>  [<ffffffff813dd019>] dump_stack+0x4b/0x72
> > [<ffffffff811c184d>] print_irq_inversion_bug.part.38+0x1a4/0x1b0
> > [<ffffffff810f7c1b>] check_usage_forwards+0x15b/0x160
> > [<ffffffff810f88c3>] mark_lock+0x333/0x610
> > [<ffffffff810f7ac0>] ? check_usage_backwards+0x160/0x160
> > [<ffffffff810f9ad9>] __lock_acquire+0x999/0x1a50
> > [<ffffffff810fb54e>] lock_acquire+0xce/0x1c0
> > [<ffffffff811d25ed>] ? test_clear_page_writeback+0x5d/0x210
> > [<ffffffff817e4810>] _raw_spin_lock_irqsave+0x50/0x70
> > [<ffffffff811d25ed>] ? test_clear_page_writeback+0x5d/0x210
> > [<ffffffff811d25ed>] test_clear_page_writeback+0x5d/0x210
> > [<ffffffff811c2b2f>] end_page_writeback+0x1f/0xa0
> > [<ffffffff81278df3>] end_buffer_async_write+0xd3/0x1a0
> > [<ffffffff81277828>] end_bio_bh_io_sync+0x28/0x40
> > [<ffffffff813a329f>] bio_endio+0x3f/0x60
> > [<ffffffff8162d4d9>] dec_pending+0x149/0x310
> > [<ffffffff811c6809>] ? mempool_free+0x29/0x80
> > [<ffffffff8162e1d6>] clone_endio+0x76/0xe0
> > [<ffffffff813a329f>] bio_endio+0x3f/0x60
> > [<ffffffff813ab512>] blk_update_request+0xb2/0x3a0
> > [<ffffffff813b4e90>] ? blkdev_issue_zeroout+0xf0/0xf0
> > [<ffffffff813b63ca>] blk_mq_end_request+0x1a/0x70
> > [<ffffffffa009e82f>] virtblk_request_done+0x3f/0x70 [virtio_blk]
> > [<ffffffff813b4ea3>] __blk_mq_complete_request_remote+0x13/0x20
> > [<ffffffff81137d9f>] flush_smp_call_function_queue+0x5f/0x160
> > [<ffffffff811388d3>] generic_smp_call_function_single_interrupt+0x13/0x60
> > [<ffffffff810530a7>] smp_call_function_single_interrupt+0x27/0x40
> > [<ffffffff817e604c>] call_function_single_interrupt+0x8c/0xa0
> > <EOI>  [<ffffffff810653a6>] ? native_safe_halt+0x6/0x10
> > [<ffffffff81026703>] default_idle+0x23/0x150
> > [<ffffffff8102702f>] arch_cpu_idle+0xf/0x20
> > [<ffffffff810efd8a>] default_idle_call+0x2a/0x40
> > [<ffffffff810f0184>] cpu_startup_entry+0x384/0x3f0
> > [<ffffffff817d6aaa>] rest_init+0x13a/0x140
> > [<ffffffff81d6b01f>] start_kernel+0x49e/0x4bf
> > [<ffffffff81d6a120>] ? early_idt_handler_array+0x120/0x120
> > [<ffffffff81d6a339>] x86_64_start_reservations+0x2a/0x2c
> > [<ffffffff81d6a480>] x86_64_start_kernel+0x145/0x168
> > 
> > >  fs/block_dev.c              |   2 +
> > >  fs/fs-writeback.c           | 148 +++++++++++++++++++++++++++++++++++---------
> > >  fs/inode.c                  |   1 +
> > >  include/linux/backing-dev.h |   5 ++
> > >  include/linux/fs.h          |   1 +
> > >  mm/backing-dev.c            |   1 +
> > >  mm/page-writeback.c         |  14 +++++
> > >  7 files changed, 144 insertions(+), 28 deletions(-)
> > > 
> > > diff --git a/fs/block_dev.c b/fs/block_dev.c
> > > index f2a89be..2726ff1 100644
> > > --- a/fs/block_dev.c
> > > +++ b/fs/block_dev.c
> > > @@ -64,6 +64,8 @@ void kill_bdev(struct block_device *bdev)
> > >  {
> > >  	struct address_space *mapping = bdev->bd_inode->i_mapping;
> > >  
> > > +	bdi_clear_inode_writeback(inode_to_bdi(bdev->bd_inode),
> > > +				  bdev->bd_inode);
> > >  	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
> > >  		return;
> > >  
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index aa72536..5c005ad 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -210,6 +210,34 @@ void inode_wb_list_del(struct inode *inode)
> > >  }
> > >  
> > >  /*
> > > + * mark an inode as under writeback on the given bdi
> > > + */
> > > +void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
> > > +{
> > > +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> > > +	if (list_empty(&inode->i_wb_list)) {
> > > +		spin_lock(&bdi->wb.list_lock);
> > > +		if (list_empty(&inode->i_wb_list))
> > > +			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
> > > +		spin_unlock(&bdi->wb.list_lock);
> > > +	}
> > > +}
> > > +
> > > +/*
> > > + * clear an inode as under writeback on the given bdi
> > > + */
> > > +void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> > > +			       struct inode *inode)
> > > +{
> > > +	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
> > > +	if (!list_empty(&inode->i_wb_list)) {
> > > +		spin_lock(&bdi->wb.list_lock);
> > > +		list_del_init(&inode->i_wb_list);
> > > +		spin_unlock(&bdi->wb.list_lock);
> > > +	}
> > > +}
> > > +
> > > +/*
> > >   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
> > >   * furthest end of its superblock's dirty-inode list.
> > >   *
> > > @@ -383,13 +411,28 @@ static void __inode_wait_for_writeback(struct inode *inode)
> > >  }
> > >  
> > >  /*
> > > - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> > > + * Wait for writeback on an inode to complete during eviction. Caller must have
> > > + * inode pinned.
> > >   */
> > >  void inode_wait_for_writeback(struct inode *inode)
> > >  {
> > > +	BUG_ON(!(inode->i_state & I_FREEING));
> > > +
> > >  	spin_lock(&inode->i_lock);
> > >  	__inode_wait_for_writeback(inode);
> > >  	spin_unlock(&inode->i_lock);
> > > +
> > > +	/*
> > > +	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> > > +	 * the inode and then deref bdev->bd_disk, which at this point has been
> > > +	 * set to NULL, so we would panic.  At the point we are dropping our
> > > +	 * bd_inode we won't have any pages under writeback on the device so
> > > +	 * this is safe.  But just in case we'll assert to make sure we don't
> > > +	 * screw this up.
> > > +	 */
> > > +	if (!sb_is_blkdev_sb(inode->i_sb))
> > > +		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
> > > +	BUG_ON(!list_empty(&inode->i_wb_list));
> > >  }
> > >  
> > >  /*
> > > @@ -1372,7 +1415,9 @@ EXPORT_SYMBOL(__mark_inode_dirty);
> > >   */
> > >  static void wait_sb_inodes(struct super_block *sb)
> > >  {
> > > -	struct inode *inode, *old_inode = NULL;
> > > +	struct backing_dev_info *bdi = sb->s_bdi;
> > > +	LIST_HEAD(sync_list);
> > > +	struct inode *iput_inode = NULL;
> > >  
> > >  	/*
> > >  	 * We need to be protected against the filesystem going from
> > > @@ -1380,48 +1425,95 @@ static void wait_sb_inodes(struct super_block *sb)
> > >  	 */
> > >  	WARN_ON(!rwsem_is_locked(&sb->s_umount));
> > >  
> > > -	mutex_lock(&sb->s_sync_lock);
> > > -	spin_lock(&sb->s_inode_list_lock);
> > > -
> > >  	/*
> > > -	 * Data integrity sync. Must wait for all pages under writeback,
> > > -	 * because there may have been pages dirtied before our sync
> > > -	 * call, but which had writeout started before we write it out.
> > > -	 * In which case, the inode may not be on the dirty list, but
> > > -	 * we still have to wait for that writeout.
> > > +	 * Data integrity sync. Must wait for all pages under writeback, because
> > > +	 * there may have been pages dirtied before our sync call, but which had
> > > +	 * writeout started before we write it out.  In which case, the inode
> > > +	 * may not be on the dirty list, but we still have to wait for that
> > > +	 * writeout.
> > > +	 *
> > > +	 * To avoid syncing inodes put under IO after we have started here,
> > > +	 * splice the io list to a temporary list head and walk that. Newly
> > > +	 * dirtied inodes will go onto the primary list so we won't wait for
> > > +	 * them. This is safe to do as we are serialised by the s_sync_lock,
> > > +	 * so we'll complete processing the complete list before the next
> > > +	 * sync operation repeats the splice-and-walk process.
> > > +	 *
> > > +	 * Inodes that have pages under writeback after we've finished the wait
> > > +	 * may or may not be on the primary list. They had pages under put IO
> > > +	 * after we started our wait, so we need to make sure the next sync or
> > > +	 * IO completion treats them correctly, Move them back to the primary
> > > +	 * list and restart the walk.
> > > +	 *
> > > +	 * Inodes that are clean after we have waited for them don't belong on
> > > +	 * any list, so simply remove them from the sync list and move onwards.
> > >  	 */
> > > -	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> > > +	mutex_lock(&sb->s_sync_lock);
> > > +	spin_lock(&bdi->wb.list_lock);
> > > +	list_splice_init(&bdi->wb.b_writeback, &sync_list);
> > > +
> > > +	while (!list_empty(&sync_list)) {
> > > +		struct inode *inode = list_first_entry(&sync_list, struct inode,
> > > +						       i_wb_list);
> > >  		struct address_space *mapping = inode->i_mapping;
> > >  
> > > +		/*
> > > +		 * We are lazy on IO completion and don't remove inodes from the
> > > +		 * list when they are clean. Detect that immediately and skip
> > > +		 * inodes we don't ahve to wait on.
> > > +		 */
> > > +		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> > > +			list_del_init(&inode->i_wb_list);
> > > +			cond_resched_lock(&bdi->wb.list_lock);
> > > +			continue;
> > > +		}
> > > +
> > >  		spin_lock(&inode->i_lock);
> > > -		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> > > -		    (mapping->nrpages == 0)) {
> > > +		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> > > +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> > >  			spin_unlock(&inode->i_lock);
> > > +			cond_resched_lock(&bdi->wb.list_lock);
> > >  			continue;
> > >  		}
> > >  		__iget(inode);
> > >  		spin_unlock(&inode->i_lock);
> > > -		spin_unlock(&sb->s_inode_list_lock);
> > > +		spin_unlock(&bdi->wb.list_lock);
> > >  
> > > -		/*
> > > -		 * We hold a reference to 'inode' so it couldn't have been
> > > -		 * removed from s_inodes list while we dropped the
> > > -		 * s_inode_list_lock.  We cannot iput the inode now as we can
> > > -		 * be holding the last reference and we cannot iput it under
> > > -		 * s_inode_list_lock. So we keep the reference and iput it
> > > -		 * later.
> > > -		 */
> > > -		iput(old_inode);
> > > -		old_inode = inode;
> > > +		if (iput_inode)
> > > +			iput(iput_inode);
> > >  
> > >  		filemap_fdatawait(mapping);
> > > -
> > >  		cond_resched();
> > >  
> > > -		spin_lock(&sb->s_inode_list_lock);
> > > +		/*
> > > +		 * the inode has been written back now, so check whether we
> > > +		 * still have pages under IO and move it back to the primary
> > > +		 * list if necessary.  We really need the mapping->tree_lock
> > > +		 * here because bdi_mark_inode_writeback may have not done
> > > +		 * anything because we were on the spliced list and we need to
> > > +		 * check TAG_WRITEBACK.
> > > +		 */
> > > +		spin_lock_irq(&mapping->tree_lock);
> > > +		spin_lock(&bdi->wb.list_lock);
> > > +		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> > > +			WARN_ON(list_empty(&inode->i_wb_list));
> > > +			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
> > > +                } else
> > > +			list_del_init(&inode->i_wb_list);
> > > +		spin_unlock_irq(&mapping->tree_lock);
> > > +
> > > +		/*
> > > +		 * can't iput inode while holding the wb.list_lock. Save it for
> > > +		 * the next time through the loop when we drop all our spin
> > > +		 * locks.
> > > +		 */
> > > +		iput_inode = inode;
> > >  	}
> > > -	spin_unlock(&sb->s_inode_list_lock);
> > > -	iput(old_inode);
> > > +	spin_unlock(&bdi->wb.list_lock);
> > > +
> > > +	if (iput_inode)
> > > +		iput(iput_inode);
> > > +
> > >  	mutex_unlock(&sb->s_sync_lock);
> > >  }
> > >  
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index 770d684..8f00557 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -357,6 +357,7 @@ void inode_init_once(struct inode *inode)
> > >  	INIT_HLIST_NODE(&inode->i_hash);
> > >  	INIT_LIST_HEAD(&inode->i_devices);
> > >  	INIT_LIST_HEAD(&inode->i_io_list);
> > > +	INIT_LIST_HEAD(&inode->i_wb_list);
> > >  	INIT_LIST_HEAD(&inode->i_lru);
> > >  	address_space_init_once(&inode->i_data);
> > >  	i_size_ordered_init(inode);
> > > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > > index d87d8ec..0d5bca2 100644
> > > --- a/include/linux/backing-dev.h
> > > +++ b/include/linux/backing-dev.h
> > > @@ -56,6 +56,7 @@ struct bdi_writeback {
> > >  	struct list_head b_io;		/* parked for writeback */
> > >  	struct list_head b_more_io;	/* parked for more writeback */
> > >  	struct list_head b_dirty_time;	/* time stamps are dirty */
> > > +	struct list_head b_writeback;	/* inodes under writeback */
> > >  	spinlock_t list_lock;		/* protects the b_* lists */
> > >  };
> > >  
> > > @@ -123,6 +124,10 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi);
> > >  void bdi_writeback_workfn(struct work_struct *work);
> > >  int bdi_has_dirty_io(struct backing_dev_info *bdi);
> > >  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> > > +void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
> > > +			      struct inode *inode);
> > > +void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
> > > +			       struct inode *inode);
> > >  
> > >  extern spinlock_t bdi_lock;
> > >  extern struct list_head bdi_list;
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 6dd2ab2..d3baa08 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -636,6 +636,7 @@ struct inode {
> > >  	struct list_head	i_io_list;	/* backing dev IO list */
> > >  	struct list_head	i_lru;		/* inode LRU list */
> > >  	struct list_head	i_sb_list;
> > > +	struct list_head	i_wb_list;	/* backing dev writeback list */
> > >  	union {
> > >  		struct hlist_head	i_dentry;
> > >  		struct rcu_head		i_rcu;
> > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > index e9ed047..ea39301 100644
> > > --- a/mm/backing-dev.c
> > > +++ b/mm/backing-dev.c
> > > @@ -369,6 +369,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> > >  	INIT_LIST_HEAD(&wb->b_io);
> > >  	INIT_LIST_HEAD(&wb->b_more_io);
> > >  	INIT_LIST_HEAD(&wb->b_dirty_time);
> > > +	INIT_LIST_HEAD(&wb->b_writeback);
> > >  	spin_lock_init(&wb->list_lock);
> > >  	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
> > >  }
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index eb59f7e..f3751d1 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -2382,11 +2382,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> > >  		spin_lock_irqsave(&mapping->tree_lock, flags);
> > >  		ret = TestSetPageWriteback(page);
> > >  		if (!ret) {
> > > +			bool on_wblist;
> > > +
> > > +			on_wblist = mapping_tagged(mapping,
> > > +						   PAGECACHE_TAG_WRITEBACK);
> > > +
> > >  			radix_tree_tag_set(&mapping->page_tree,
> > >  						page_index(page),
> > >  						PAGECACHE_TAG_WRITEBACK);
> > >  			if (bdi_cap_account_writeback(bdi))
> > >  				__inc_bdi_stat(bdi, BDI_WRITEBACK);
> > > +
> > > +			/*
> > > +			 * we can come through here when swapping anonymous
> > > +			 * pages, so we don't necessarily have an inode to
> > > +			 * track for sync. Inodes are remove lazily, so there is
> > > +			 * no equivalent in test_clear_page_writeback().
> > > +			 */
> > > +			if (!on_wblist && mapping->host)
> > > +				bdi_mark_inode_writeback(bdi, mapping->host);
> > >  		}
> > >  		if (!PageDirty(page))
> > >  			radix_tree_tag_clear(&mapping->page_tree,
> > > -- 
> > > 2.1.0
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/block_dev.c b/fs/block_dev.c
index f2a89be..2726ff1 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -64,6 +64,8 @@  void kill_bdev(struct block_device *bdev)
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
+	bdi_clear_inode_writeback(inode_to_bdi(bdev->bd_inode),
+				  bdev->bd_inode);
 	if (mapping->nrpages == 0 && mapping->nrshadows == 0)
 		return;
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index aa72536..5c005ad 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -210,6 +210,34 @@  void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * mark an inode as under writeback on the given bdi
+ */
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi, struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		if (list_empty(&inode->i_wb_list))
+			list_add_tail(&inode->i_wb_list, &bdi->wb.b_writeback);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
+ * clear an inode as under writeback on the given bdi
+ */
+void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
+			       struct inode *inode)
+{
+	WARN_ON_ONCE(bdi != inode_to_bdi(inode));
+	if (!list_empty(&inode->i_wb_list)) {
+		spin_lock(&bdi->wb.list_lock);
+		list_del_init(&inode->i_wb_list);
+		spin_unlock(&bdi->wb.list_lock);
+	}
+}
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -383,13 +411,28 @@  static void __inode_wait_for_writeback(struct inode *inode)
 }
 
 /*
- * Wait for writeback on an inode to complete. Caller must have inode pinned.
+ * Wait for writeback on an inode to complete during eviction. Caller must have
+ * inode pinned.
  */
 void inode_wait_for_writeback(struct inode *inode)
 {
+	BUG_ON(!(inode->i_state & I_FREEING));
+
 	spin_lock(&inode->i_lock);
 	__inode_wait_for_writeback(inode);
 	spin_unlock(&inode->i_lock);
+
+	/*
+	 * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
+	 * the inode and then deref bdev->bd_disk, which at this point has been
+	 * set to NULL, so we would panic.  At the point we are dropping our
+	 * bd_inode we won't have any pages under writeback on the device so
+	 * this is safe.  But just in case we'll assert to make sure we don't
+	 * screw this up.
+	 */
+	if (!sb_is_blkdev_sb(inode->i_sb))
+		bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
+	BUG_ON(!list_empty(&inode->i_wb_list));
 }
 
 /*
@@ -1372,7 +1415,9 @@  EXPORT_SYMBOL(__mark_inode_dirty);
  */
 static void wait_sb_inodes(struct super_block *sb)
 {
-	struct inode *inode, *old_inode = NULL;
+	struct backing_dev_info *bdi = sb->s_bdi;
+	LIST_HEAD(sync_list);
+	struct inode *iput_inode = NULL;
 
 	/*
 	 * We need to be protected against the filesystem going from
@@ -1380,48 +1425,95 @@  static void wait_sb_inodes(struct super_block *sb)
 	 */
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
-	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
-
 	/*
-	 * Data integrity sync. Must wait for all pages under writeback,
-	 * because there may have been pages dirtied before our sync
-	 * call, but which had writeout started before we write it out.
-	 * In which case, the inode may not be on the dirty list, but
-	 * we still have to wait for that writeout.
+	 * Data integrity sync. Must wait for all pages under writeback, because
+	 * there may have been pages dirtied before our sync call, but which had
+	 * writeout started before we write it out.  In which case, the inode
+	 * may not be on the dirty list, but we still have to wait for that
+	 * writeout.
+	 *
+	 * To avoid syncing inodes put under IO after we have started here,
+	 * splice the io list to a temporary list head and walk that. Newly
+	 * dirtied inodes will go onto the primary list so we won't wait for
+	 * them. This is safe to do as we are serialised by the s_sync_lock,
+	 * so we'll complete processing the complete list before the next
+	 * sync operation repeats the splice-and-walk process.
+	 *
+	 * Inodes that have pages under writeback after we've finished the wait
+	 * may or may not be on the primary list. They had pages under put IO
+	 * after we started our wait, so we need to make sure the next sync or
+	 * IO completion treats them correctly, Move them back to the primary
+	 * list and restart the walk.
+	 *
+	 * Inodes that are clean after we have waited for them don't belong on
+	 * any list, so simply remove them from the sync list and move onwards.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	mutex_lock(&sb->s_sync_lock);
+	spin_lock(&bdi->wb.list_lock);
+	list_splice_init(&bdi->wb.b_writeback, &sync_list);
+
+	while (!list_empty(&sync_list)) {
+		struct inode *inode = list_first_entry(&sync_list, struct inode,
+						       i_wb_list);
 		struct address_space *mapping = inode->i_mapping;
 
+		/*
+		 * We are lazy on IO completion and don't remove inodes from the
+		 * list when they are clean. Detect that immediately and skip
+		 * inodes we don't ahve to wait on.
+		 */
+		if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			list_del_init(&inode->i_wb_list);
+			cond_resched_lock(&bdi->wb.list_lock);
+			continue;
+		}
+
 		spin_lock(&inode->i_lock);
-		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
-		    (mapping->nrpages == 0)) {
+		if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
+			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
 			spin_unlock(&inode->i_lock);
+			cond_resched_lock(&bdi->wb.list_lock);
 			continue;
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(&bdi->wb.list_lock);
 
-		/*
-		 * We hold a reference to 'inode' so it couldn't have been
-		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock.  We cannot iput the inode now as we can
-		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
-		 */
-		iput(old_inode);
-		old_inode = inode;
+		if (iput_inode)
+			iput(iput_inode);
 
 		filemap_fdatawait(mapping);
-
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
+		/*
+		 * the inode has been written back now, so check whether we
+		 * still have pages under IO and move it back to the primary
+		 * list if necessary.  We really need the mapping->tree_lock
+		 * here because bdi_mark_inode_writeback may have not done
+		 * anything because we were on the spliced list and we need to
+		 * check TAG_WRITEBACK.
+		 */
+		spin_lock_irq(&mapping->tree_lock);
+		spin_lock(&bdi->wb.list_lock);
+		if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
+			WARN_ON(list_empty(&inode->i_wb_list));
+			list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
+                } else
+			list_del_init(&inode->i_wb_list);
+		spin_unlock_irq(&mapping->tree_lock);
+
+		/*
+		 * can't iput inode while holding the wb.list_lock. Save it for
+		 * the next time through the loop when we drop all our spin
+		 * locks.
+		 */
+		iput_inode = inode;
 	}
-	spin_unlock(&sb->s_inode_list_lock);
-	iput(old_inode);
+	spin_unlock(&bdi->wb.list_lock);
+
+	if (iput_inode)
+		iput(iput_inode);
+
 	mutex_unlock(&sb->s_sync_lock);
 }
 
diff --git a/fs/inode.c b/fs/inode.c
index 770d684..8f00557 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -357,6 +357,7 @@  void inode_init_once(struct inode *inode)
 	INIT_HLIST_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io_list);
+	INIT_LIST_HEAD(&inode->i_wb_list);
 	INIT_LIST_HEAD(&inode->i_lru);
 	address_space_init_once(&inode->i_data);
 	i_size_ordered_init(inode);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index d87d8ec..0d5bca2 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -56,6 +56,7 @@  struct bdi_writeback {
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
 	struct list_head b_dirty_time;	/* time stamps are dirty */
+	struct list_head b_writeback;	/* inodes under writeback */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -123,6 +124,10 @@  void bdi_start_background_writeback(struct backing_dev_info *bdi);
 void bdi_writeback_workfn(struct work_struct *work);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void bdi_mark_inode_writeback(struct backing_dev_info *bdi,
+			      struct inode *inode);
+void bdi_clear_inode_writeback(struct backing_dev_info *bdi,
+			       struct inode *inode);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6dd2ab2..d3baa08 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -636,6 +636,7 @@  struct inode {
 	struct list_head	i_io_list;	/* backing dev IO list */
 	struct list_head	i_lru;		/* inode LRU list */
 	struct list_head	i_sb_list;
+	struct list_head	i_wb_list;	/* backing dev writeback list */
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e9ed047..ea39301 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -369,6 +369,7 @@  static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
 	INIT_LIST_HEAD(&wb->b_dirty_time);
+	INIT_LIST_HEAD(&wb->b_writeback);
 	spin_lock_init(&wb->list_lock);
 	INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index eb59f7e..f3751d1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2382,11 +2382,25 @@  int __test_set_page_writeback(struct page *page, bool keep_write)
 		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret) {
+			bool on_wblist;
+
+			on_wblist = mapping_tagged(mapping,
+						   PAGECACHE_TAG_WRITEBACK);
+
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
 			if (bdi_cap_account_writeback(bdi))
 				__inc_bdi_stat(bdi, BDI_WRITEBACK);
+
+			/*
+			 * we can come through here when swapping anonymous
+			 * pages, so we don't necessarily have an inode to
+			 * track for sync. Inodes are remove lazily, so there is
+			 * no equivalent in test_clear_page_writeback().
+			 */
+			if (!on_wblist && mapping->host)
+				bdi_mark_inode_writeback(bdi, mapping->host);
 		}
 		if (!PageDirty(page))
 			radix_tree_tag_clear(&mapping->page_tree,