diff mbox

[10/12] writeback: only allow one inflight and pending full flush

Message ID 1506543239-31470-11-git-send-email-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show

Commit Message

Jens Axboe Sept. 27, 2017, 8:13 p.m. UTC
When someone calls wakeup_flusher_threads() or
wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
pages in the system (or on that bdi). If we are tight on memory, we
can get tons of these queued from kswapd/vmscan. This causes (at
least) two problems:

1) We consume a ton of memory just allocating writeback work items.
   We've seen as much as 600 million of these writeback work items
   pending. That's a lot of memory to pointlessly hold hostage,
   while the box is under memory pressure.

2) We spend so much time processing these work items, that we
   introduce a softlockup in writeback processing. This is because
   each of the writeback work items don't end up doing any work (it's
   hard when you have millions of identical ones coming in to the
   flush machinery), so we just sit in a tight loop pulling work
   items and deleting/freeing them.

Fix this by adding a 'start_all' bit to the writeback structure, and
set that when someone attempts to flush all dirty pages. The bit is
cleared when we start writeback on that work item. If the bit is
already set when we attempt to queue !nr_pages writeback, then we
simply ignore it.

This provides us one full flush in flight, with one pending as well,
and makes for more efficient handling of this type of writeback.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: Chris Mason <clm@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/fs-writeback.c                | 25 +++++++++++++++++++++++++
 include/linux/backing-dev-defs.h |  1 +
 2 files changed, 26 insertions(+)

Comments

Andrew Morton Sept. 28, 2017, 9:41 p.m. UTC | #1
On Wed, 27 Sep 2017 14:13:57 -0600 Jens Axboe <axboe@kernel.dk> wrote:

> When someone calls wakeup_flusher_threads() or
> wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
> pages in the system (or on that bdi). If we are tight on memory, we
> can get tons of these queued from kswapd/vmscan. This causes (at
> least) two problems:
> 
> 1) We consume a ton of memory just allocating writeback work items.
>    We've seen as much as 600 million of these writeback work items
>    pending. That's a lot of memory to pointlessly hold hostage,
>    while the box is under memory pressure.
> 
> 2) We spend so much time processing these work items, that we
>    introduce a softlockup in writeback processing. This is because
>    each of the writeback work items don't end up doing any work (it's
>    hard when you have millions of identical ones coming in to the
>    flush machinery), so we just sit in a tight loop pulling work
>    items and deleting/freeing them.
> 
> Fix this by adding a 'start_all' bit to the writeback structure, and
> set that when someone attempts to flush all dirty pages. The bit is
> cleared when we start writeback on that work item. If the bit is
> already set when we attempt to queue !nr_pages writeback, then we
> simply ignore it.
> 
> This provides us one full flush in flight, with one pending as well,
> and makes for more efficient handling of this type of writeback.
> 
> ...
>
> @@ -953,12 +954,27 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
>  		return;
>  
>  	/*
> +	 * All callers of this function want to start writeback of all
> +	 * dirty pages. Places like vmscan can call this at a very
> +	 * high frequency, causing pointless allocations of tons of
> +	 * work items and keeping the flusher threads busy retrieving
> +	 * that work. Ensure that we only allow one of them pending and
> +	 * inflight at the time. It doesn't matter if we race a little
> +	 * bit on this, so use the faster separate test/set bit variants.
> +	 */
> +	if (test_bit(WB_start_all, &wb->state))
> +		return;
> +
> +	set_bit(WB_start_all, &wb->state);

test_and_set_bit()?
Linus Torvalds Sept. 28, 2017, 9:44 p.m. UTC | #2
On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> test_and_set_bit()?

If there aren't any atomicity concerns (either because of higher-level
locking, or because racing and having two people set the bit is fine),
it can be better to do them separately if the test_bit() is the common
case and you can avoid dirtying a cacheline that way.

But yeah, if that is the case, it might be worth documenting, because
test_and_set_bit() is the more obviously appropriate "there can be
only one" model.

               Linus
Jens Axboe Sept. 29, 2017, 12:15 a.m. UTC | #3
On 09/28/2017 11:41 PM, Andrew Morton wrote:
> On Wed, 27 Sep 2017 14:13:57 -0600 Jens Axboe <axboe@kernel.dk> wrote:
> 
>> When someone calls wakeup_flusher_threads() or
>> wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
>> pages in the system (or on that bdi). If we are tight on memory, we
>> can get tons of these queued from kswapd/vmscan. This causes (at
>> least) two problems:
>>
>> 1) We consume a ton of memory just allocating writeback work items.
>>    We've seen as much as 600 million of these writeback work items
>>    pending. That's a lot of memory to pointlessly hold hostage,
>>    while the box is under memory pressure.
>>
>> 2) We spend so much time processing these work items, that we
>>    introduce a softlockup in writeback processing. This is because
>>    each of the writeback work items don't end up doing any work (it's
>>    hard when you have millions of identical ones coming in to the
>>    flush machinery), so we just sit in a tight loop pulling work
>>    items and deleting/freeing them.
>>
>> Fix this by adding a 'start_all' bit to the writeback structure, and
>> set that when someone attempts to flush all dirty pages. The bit is
>> cleared when we start writeback on that work item. If the bit is
>> already set when we attempt to queue !nr_pages writeback, then we
>> simply ignore it.
>>
>> This provides us one full flush in flight, with one pending as well,
>> and makes for more efficient handling of this type of writeback.
>>
>> ...
>>
>> @@ -953,12 +954,27 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
>>  		return;
>>  
>>  	/*
>> +	 * All callers of this function want to start writeback of all
>> +	 * dirty pages. Places like vmscan can call this at a very
>> +	 * high frequency, causing pointless allocations of tons of
>> +	 * work items and keeping the flusher threads busy retrieving
>> +	 * that work. Ensure that we only allow one of them pending and
>> +	 * inflight at the time. It doesn't matter if we race a little
>> +	 * bit on this, so use the faster separate test/set bit variants.
>> +	 */
>> +	if (test_bit(WB_start_all, &wb->state))
>> +		return;
>> +
>> +	set_bit(WB_start_all, &wb->state);
> 
> test_and_set_bit()?

Like Linus says, this is done purposely. I've even included a bit about
it in the comment above, though maybe it's not clear enough. I've used
this trick in blk-mq quite a bit as well, and for high frequency calls,
it can make a substantial difference not to redirty that cache line if
you can avoid it.

If you do care about atomicity, this works really well too:

if (test_bit(bit, addr) || test_and_set_bit(bit, addr))
	...

just to avoid the locked operation. Also see this commit:
commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d
Author: Jens Axboe <axboe@fb.com>
Date:   Thu May 22 11:54:16 2014 -0700

    mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT

where there are some actual numbers on a specific case.

For the case at hand, we don't even need to do the test_and_set
case, since we don't care about a small race there.
Jens Axboe Sept. 29, 2017, 12:17 a.m. UTC | #4
On 09/28/2017 11:44 PM, Linus Torvalds wrote:
> On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>>
>> test_and_set_bit()?
> 
> If there aren't any atomicity concerns (either because of higher-level
> locking, or because racing and having two people set the bit is fine),
> it can be better to do them separately if the test_bit() is the common
> case and you can avoid dirtying a cacheline that way.
> 
> But yeah, if that is the case, it might be worth documenting, because
> test_and_set_bit() is the more obviously appropriate "there can be
> only one" model.

It is documented though, but maybe not well enough...

I've actually had to document/explain it enough times now, that it
might be worth making a general construct. Though it has to be
used carefully, so perhaps it's better contained as separate use
cases.
Amir Goldstein Sept. 29, 2017, 5:21 a.m. UTC | #5
On Fri, Sep 29, 2017 at 3:17 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 09/28/2017 11:44 PM, Linus Torvalds wrote:
>> On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
>> <akpm@linux-foundation.org> wrote:
>>>
>>> test_and_set_bit()?
>>
>> If there aren't any atomicity concerns (either because of higher-level
>> locking, or because racing and having two people set the bit is fine),
>> it can be better to do them separately if the test_bit() is the common
>> case and you can avoid dirtying a cacheline that way.
>>
>> But yeah, if that is the case, it might be worth documenting, because
>> test_and_set_bit() is the more obviously appropriate "there can be
>> only one" model.
>
> It is documented though, but maybe not well enough...
>
> I've actually had to document/explain it enough times now, that it
> might be worth making a general construct. Though it has to be
> used carefully, so perhaps it's better contained as separate use
> cases.
>

Maybe change "Ensure that we only allow one of them pending"
in the comment above. Only the "allow one inflight" part is correct.

Or apply your follow up patch and be done with in...

Amir.
Matthew Wilcox Oct. 3, 2017, 4:06 p.m. UTC | #6
On Fri, Sep 29, 2017 at 02:17:32AM +0200, Jens Axboe wrote:
> On 09/28/2017 11:44 PM, Linus Torvalds wrote:
> > On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> >>
> >> test_and_set_bit()?
> > 
> > If there aren't any atomicity concerns (either because of higher-level
> > locking, or because racing and having two people set the bit is fine),
> > it can be better to do them separately if the test_bit() is the common
> > case and you can avoid dirtying a cacheline that way.
> > 
> > But yeah, if that is the case, it might be worth documenting, because
> > test_and_set_bit() is the more obviously appropriate "there can be
> > only one" model.
> 
> It is documented though, but maybe not well enough...
> 
> I've actually had to document/explain it enough times now, that it
> might be worth making a general construct. Though it has to be
> used carefully, so perhaps it's better contained as separate use
> cases.

test_and_test_and_set_bit()?  It's an unusual name, so when either
reading it or writing it, people are going to say "something unusual
here", rather than "That Jens Axboe is such a n00b, he doesn't know how
to use test_and_set_bit()".  There are a few references out on the web
to test-and-test-and-set already, so it's not entirely unique to Linux.

Plus, some architectures might be able to optimise that, particularly
those which are ll/sc based.  It might be exactly the same as their
test_and_set().
Jens Axboe Oct. 3, 2017, 4:11 p.m. UTC | #7
On 10/03/2017 10:06 AM, Matthew Wilcox wrote:
> On Fri, Sep 29, 2017 at 02:17:32AM +0200, Jens Axboe wrote:
>> On 09/28/2017 11:44 PM, Linus Torvalds wrote:
>>> On Thu, Sep 28, 2017 at 2:41 PM, Andrew Morton
>>> <akpm@linux-foundation.org> wrote:
>>>>
>>>> test_and_set_bit()?
>>>
>>> If there aren't any atomicity concerns (either because of higher-level
>>> locking, or because racing and having two people set the bit is fine),
>>> it can be better to do them separately if the test_bit() is the common
>>> case and you can avoid dirtying a cacheline that way.
>>>
>>> But yeah, if that is the case, it might be worth documenting, because
>>> test_and_set_bit() is the more obviously appropriate "there can be
>>> only one" model.
>>
>> It is documented though, but maybe not well enough...
>>
>> I've actually had to document/explain it enough times now, that it
>> might be worth making a general construct. Though it has to be
>> used carefully, so perhaps it's better contained as separate use
>> cases.
> 
> test_and_test_and_set_bit()?  It's an unusual name, so when either
> reading it or writing it, people are going to say "something unusual
> here", rather than "That Jens Axboe is such a n00b, he doesn't know how
> to use test_and_set_bit()".  There are a few references out on the web
> to test-and-test-and-set already, so it's not entirely unique to Linux.
> 
> Plus, some architectures might be able to optimise that, particularly
> those which are ll/sc based.  It might be exactly the same as their
> test_and_set().

I like that suggestion, but would suggest we make it
test_then_test_and_set_bit() since the 'then' naming would work for
having similar test_then_clear_bit() and not clash with
test_and_set_bit().

And yes, some archs would be able to optimize this nicely.

All worth it if I never have to explain it or add special comments
about it again :-)
Matthew Wilcox Oct. 3, 2017, 5:03 p.m. UTC | #8
On Tue, Oct 03, 2017 at 10:11:20AM -0600, Jens Axboe wrote:
> On 10/03/2017 10:06 AM, Matthew Wilcox wrote:
> > test_and_test_and_set_bit()?  It's an unusual name, so when either
> > reading it or writing it, people are going to say "something unusual
> > here", rather than "That Jens Axboe is such a n00b, he doesn't know how
> > to use test_and_set_bit()".  There are a few references out on the web
> > to test-and-test-and-set already, so it's not entirely unique to Linux.
> 
> I like that suggestion, but would suggest we make it
> test_then_test_and_set_bit() since the 'then' naming would work for
> having similar test_then_clear_bit() and not clash with
> test_and_set_bit().

'test-then-test-and-set' has the disadvantage of not being readily
searchable ... if you search for 'test-and-test-and-set', you find
discussions about why you might want to use this technique.  Also, I
don't like having set use a different name from clear; either we want
'test_and_test_and_(set|clear)_bit()' or 'test_then_(set|clear)_bit()'.

Usually I'd be in favour of the shorter name, but this should be a rare
thing for people to use, and if you search for test-then-clear you get
a lot of results about pregnancy tests ...
diff mbox

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d9be3dc34302..2e46a1537e10 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -53,6 +53,7 @@  struct wb_writeback_work {
 	unsigned int for_background:1;
 	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
 	unsigned int auto_free:1;	/* free on completion */
+	unsigned int start_all:1;	/* nr_pages == 0 (all) writeback */
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
@@ -953,12 +954,27 @@  static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
 		return;
 
 	/*
+	 * All callers of this function want to start writeback of all
+	 * dirty pages. Places like vmscan can call this at a very
+	 * high frequency, causing pointless allocations of tons of
+	 * work items and keeping the flusher threads busy retrieving
+	 * that work. Ensure that we only allow one of them pending and
+	 * inflight at the time. It doesn't matter if we race a little
+	 * bit on this, so use the faster separate test/set bit variants.
+	 */
+	if (test_bit(WB_start_all, &wb->state))
+		return;
+
+	set_bit(WB_start_all, &wb->state);
+
+	/*
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
 	work = kzalloc(sizeof(*work),
 		       GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
 	if (!work) {
+		clear_bit(WB_start_all, &wb->state);
 		trace_writeback_nowork(wb);
 		wb_wakeup(wb);
 		return;
@@ -969,6 +985,7 @@  static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
 	work->range_cyclic = range_cyclic;
 	work->reason	= reason;
 	work->auto_free	= 1;
+	work->start_all = 1;
 
 	wb_queue_work(wb, work);
 }
@@ -1822,6 +1839,14 @@  static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
 		list_del_init(&work->list);
 	}
 	spin_unlock_bh(&wb->work_lock);
+
+	/*
+	 * Once we start processing a work item that had !nr_pages,
+	 * clear the wb state bit for that so we can allow more.
+	 */
+	if (work && work->start_all)
+		clear_bit(WB_start_all, &wb->state);
+
 	return work;
 }
 
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 866c433e7d32..420de5c7c7f9 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -24,6 +24,7 @@  enum wb_state {
 	WB_shutting_down,	/* wb_shutdown() in progress */
 	WB_writeback_running,	/* Writeback is in progress */
 	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
+	WB_start_all,		/* nr_pages == 0 (all) work pending */
 };
 
 enum wb_congested_state {