Message ID | 20150917224230.GF8624@ret.masoncoding.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote: > > Playing around with the plug a little, most of the unplugs are coming > from the cond_resched_lock(). Not really sure why we are doing the > cond_resched() there, we should be doing it before we retake the lock > instead. > > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K > files/sec up to 213K. Average IO size is the same as 4.3-rc1. Ok, so at least for you, part of the problem really ends up being that there's a mix of the "synchronous" unplugging (by the actual explicit "blk_finish_plug(&plug);") and the writeback that is handed off to kblockd_workqueue. I'm not seeing why that should be an issue. Sure, there's some CPU overhead to context switching, but I don't see that it should be that big of a deal. I wonder if there is something more serious wrong with the kblockd_workqueue. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 17, 2015 at 04:08:19PM -0700, Linus Torvalds wrote: > On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote: > > > > Playing around with the plug a little, most of the unplugs are coming > > from the cond_resched_lock(). Not really sure why we are doing the > > cond_resched() there, we should be doing it before we retake the lock > > instead. > > > > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K > > files/sec up to 213K. Average IO size is the same as 4.3-rc1. > > Ok, so at least for you, part of the problem really ends up being that > there's a mix of the "synchronous" unplugging (by the actual explicit > "blk_finish_plug(&plug);") and the writeback that is handed off to > kblockd_workqueue. > > I'm not seeing why that should be an issue. Sure, there's some CPU > overhead to context switching, but I don't see that it should be that > big of a deal. > > I wonder if there is something more serious wrong with the kblockd_workqueue. I'm driving the box pretty hard, it's right on the line between CPU bound and IO bound. So I've got 32 fs_mark processes banging away and 32 CPUs (16 really, with hyperthreading). They are popping in and out of balance_dirty_pages() so I have high CPU utilization alternating with high IO wait times. There no reads at all, so all of these waits are for buffered writes. People in balance_dirty_pages are indirectly waiting on the unplug, so maybe the context switch overhead on a loaded box is enough to explain it. We've definitely gotten more than 9% by inlining small synchronous items in btrfs in the past, but those were more explicitly synchronous. I know it's painfully hand wavy. I don't see any other users of the kblockd workqueues, and the perf profiles don't jump out at me. I'll feel better about the patch if Dave confirms any gains. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 17, 2015 at 07:56:47PM -0400, Chris Mason wrote: > On Thu, Sep 17, 2015 at 04:08:19PM -0700, Linus Torvalds wrote: > > On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason <clm@fb.com> wrote: > > > > > > Playing around with the plug a little, most of the unplugs are coming > > > from the cond_resched_lock(). Not really sure why we are doing the > > > cond_resched() there, we should be doing it before we retake the lock > > > instead. > > > > > > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K > > > files/sec up to 213K. Average IO size is the same as 4.3-rc1. > > > > Ok, so at least for you, part of the problem really ends up being that > > there's a mix of the "synchronous" unplugging (by the actual explicit > > "blk_finish_plug(&plug);") and the writeback that is handed off to > > kblockd_workqueue. > > > > I'm not seeing why that should be an issue. Sure, there's some CPU > > overhead to context switching, but I don't see that it should be that > > big of a deal. It may well change the dispatch order of enough IOs for it to be significant on an IO bound device. > > I wonder if there is something more serious wrong with the kblockd_workqueue. > > I'm driving the box pretty hard, it's right on the line between CPU > bound and IO bound. So I've got 32 fs_mark processes banging away and > 32 CPUs (16 really, with hyperthreading). I'm only using 8 threads right now, so I have ~6-7 idle CPUs on this workload. Hence if it's CPU load related, I probably won't see any change in behaviour. > They are popping in and out of balance_dirty_pages() so I have high CPU > utilization alternating with high IO wait times. There no reads at all, > so all of these waits are for buffered writes. > > People in balance_dirty_pages are indirectly waiting on the unplug, so > maybe the context switch overhead on a loaded box is enough to explain > it. We've definitely gotten more than 9% by inlining small synchronous > items in btrfs in the past, but those were more explicitly synchronous. > > I know it's painfully hand wavy. I don't see any other users of the > kblockd workqueues, and the perf profiles don't jump out at me. I'll > feel better about the patch if Dave confirms any gains. In outright performance on my test machine, the difference in files/s is noise. However, the consistency looks to be substantially improved and the context switch rate is now running at under 3,000/sec. Numbers, including the std deviation of the files/s number output during the fsmark run (averaged across 3 separate benahmark runs): files/s std-dev wall time 4.3-rc1-noplug 34400 2.0e04 5m25s 4.3-rc1 56600 2.3e04 3m23s 4.3-rc1-flush 56079 1.4e04 3m14s std-dev is well down, and the improvement in wall time is large enough to be significant. Looks good to me. Cheers, Dave.
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 587ac08..05ed541 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1481,6 +1481,19 @@ static long writeback_sb_inodes(struct super_block *sb, wbc_detach_inode(&wbc); work->nr_pages -= write_chunk - wbc.nr_to_write; wrote += write_chunk - wbc.nr_to_write; + + if (need_resched()) { + /* + * we're plugged and don't want to hand off to kblockd + * for the actual unplug work. But we do want to + * reschedule. So flush our plug and then + * schedule away + */ + blk_flush_plug(current); + cond_resched(); + } + + spin_lock(&wb->list_lock); spin_lock(&inode->i_lock); if (!(inode->i_state & I_DIRTY_ALL)) @@ -1488,7 +1501,7 @@ static long writeback_sb_inodes(struct super_block *sb, requeue_inode(inode, wb, &wbc); inode_sync_complete(inode); spin_unlock(&inode->i_lock); - cond_resched_lock(&wb->list_lock); + /* * bail out to wb_writeback() often enough to check * background threshold and other termination conditions.