diff mbox

fs-writeback: drop wb->list_lock during blk_finish_plug()

Message ID 20150917224230.GF8624@ret.masoncoding.com
State New, archived
Headers show

Commit Message

Chris Mason Sept. 17, 2015, 10:42 p.m. UTC
On Thu, Sep 17, 2015 at 12:39:51PM -0700, Linus Torvalds wrote:
> On Wed, Sep 16, 2015 at 7:14 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> Dave, if you're testing my current -git, the other performance issue
> >> might still be the spinlock thing.
> >
> > I have the fix as the first commit in my local tree - it'll remain
> > there until I get a conflict after an update. :)
> 
> Ok. I'm happy to report that you should get a conflict now, and that
> the spinlock code should work well for your virtualized case again.
> 
> No updates on the plugging thing yet, I'll wait a bit and follow this
> thread and see if somebody comes up with any explanations or theories
> in the hope that we might not need to revert (or at least have a more
> targeted change).

Playing around with the plug a little, most of the unplugs are coming
from the cond_resched_lock().  Not really sure why we are doing the
cond_resched() there, we should be doing it before we retake the lock
instead.

This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
files/sec up to 213K.  Average IO size is the same as 4.3-rc1.

It probably won't help Dave, since most of his unplugs should have been
from the cond_resched_locked() too.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Linus Torvalds Sept. 17, 2015, 11:08 p.m. UTC | #1
On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote:
>
> Playing around with the plug a little, most of the unplugs are coming
> from the cond_resched_lock().  Not really sure why we are doing the
> cond_resched() there, we should be doing it before we retake the lock
> instead.
>
> This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
> files/sec up to 213K.  Average IO size is the same as 4.3-rc1.

Ok, so at least for you, part of the problem really ends up being that
there's a mix of the "synchronous" unplugging (by the actual explicit
"blk_finish_plug(&plug);") and the writeback that is handed off to
kblockd_workqueue.

I'm not seeing why that should be an issue. Sure, there's some CPU
overhead to context switching, but I don't see that it should be that
big of a deal.

I wonder if there is something more serious wrong with the kblockd_workqueue.

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Sept. 17, 2015, 11:56 p.m. UTC | #2
On Thu, Sep 17, 2015 at 04:08:19PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote:
> >
> > Playing around with the plug a little, most of the unplugs are coming
> > from the cond_resched_lock().  Not really sure why we are doing the
> > cond_resched() there, we should be doing it before we retake the lock
> > instead.
> >
> > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
> > files/sec up to 213K.  Average IO size is the same as 4.3-rc1.
> 
> Ok, so at least for you, part of the problem really ends up being that
> there's a mix of the "synchronous" unplugging (by the actual explicit
> "blk_finish_plug(&plug);") and the writeback that is handed off to
> kblockd_workqueue.
> 
> I'm not seeing why that should be an issue. Sure, there's some CPU
> overhead to context switching, but I don't see that it should be that
> big of a deal.
> 
> I wonder if there is something more serious wrong with the kblockd_workqueue.

I'm driving the box pretty hard, it's right on the line between CPU
bound and IO bound.  So I've got 32 fs_mark processes banging away and
32 CPUs (16 really, with hyperthreading).

They are popping in and out of balance_dirty_pages() so I have high CPU
utilization alternating with high IO wait times.  There no reads at all,
so all of these waits are for buffered writes.

People in balance_dirty_pages are indirectly waiting on the unplug, so
maybe the context switch overhead on a loaded box is enough to explain
it.  We've definitely gotten more than 9% by inlining small synchronous
items in btrfs in the past, but those were more explicitly synchronous.

I know it's painfully hand wavy.  I don't see any other users of the
kblockd workqueues, and the perf profiles don't jump out at me.  I'll
feel better about the patch if Dave confirms any gains.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 18, 2015, 12:37 a.m. UTC | #3
On Thu, Sep 17, 2015 at 07:56:47PM -0400, Chris Mason wrote:
> On Thu, Sep 17, 2015 at 04:08:19PM -0700, Linus Torvalds wrote:
> > On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote:
> > >
> > > Playing around with the plug a little, most of the unplugs are coming
> > > from the cond_resched_lock().  Not really sure why we are doing the
> > > cond_resched() there, we should be doing it before we retake the lock
> > > instead.
> > >
> > > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
> > > files/sec up to 213K.  Average IO size is the same as 4.3-rc1.
> > 
> > Ok, so at least for you, part of the problem really ends up being that
> > there's a mix of the "synchronous" unplugging (by the actual explicit
> > "blk_finish_plug(&plug);") and the writeback that is handed off to
> > kblockd_workqueue.
> >
> > I'm not seeing why that should be an issue. Sure, there's some CPU
> > overhead to context switching, but I don't see that it should be that
> > big of a deal.

It may well change the dispatch order of enough IOs for it to be
significant on an IO bound device.

> > I wonder if there is something more serious wrong with the kblockd_workqueue.
> 
> I'm driving the box pretty hard, it's right on the line between CPU
> bound and IO bound.  So I've got 32 fs_mark processes banging away and
> 32 CPUs (16 really, with hyperthreading).

I'm only using 8 threads right now, so I have ~6-7 idle CPUs on this
workload. Hence if it's CPU load related, I probably won't see any
change in behaviour.

> They are popping in and out of balance_dirty_pages() so I have high CPU
> utilization alternating with high IO wait times.  There no reads at all,
> so all of these waits are for buffered writes.
> 
> People in balance_dirty_pages are indirectly waiting on the unplug, so
> maybe the context switch overhead on a loaded box is enough to explain
> it.  We've definitely gotten more than 9% by inlining small synchronous
> items in btrfs in the past, but those were more explicitly synchronous.
> 
> I know it's painfully hand wavy.  I don't see any other users of the
> kblockd workqueues, and the perf profiles don't jump out at me.  I'll
> feel better about the patch if Dave confirms any gains.

In outright performance on my test machine, the difference in
files/s is noise. However, the consistency looks to be substantially
improved and the context switch rate is now running at under
3,000/sec. Numbers, including the std deviation of the files/s
number output during the fsmark run (averaged across 3 separate
benahmark runs):

			files/s		std-dev		wall time
4.3-rc1-noplug		34400		2.0e04		5m25s
4.3-rc1			56600		2.3e04		3m23s
4.3-rc1-flush		56079		1.4e04		3m14s

std-dev is well down, and the improvement in wall time is large
enough to be significant.

Looks good to me.

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 587ac08..05ed541 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1481,6 +1481,19 @@  static long writeback_sb_inodes(struct super_block *sb,
 		wbc_detach_inode(&wbc);
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
+
+		if (need_resched()) {
+			/*
+			 * we're plugged and don't want to hand off to kblockd
+			 * for the actual unplug work.  But we do want to
+			 * reschedule.  So flush our plug and then
+			 * schedule away
+			 */
+			blk_flush_plug(current);
+			cond_resched();
+		}
+
+
 		spin_lock(&wb->list_lock);
 		spin_lock(&inode->i_lock);
 		if (!(inode->i_state & I_DIRTY_ALL))
@@ -1488,7 +1501,7 @@  static long writeback_sb_inodes(struct super_block *sb,
 		requeue_inode(inode, wb, &wbc);
 		inode_sync_complete(inode);
 		spin_unlock(&inode->i_lock);
-		cond_resched_lock(&wb->list_lock);
+
 		/*
 		 * bail out to wb_writeback() often enough to check
 		 * background threshold and other termination conditions.