diff mbox

[6/6] fs-writeback: only allow one inflight and pending !nr_pages flush

Message ID 1505850787-18311-7-git-send-email-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show

Commit Message

Jens Axboe Sept. 19, 2017, 7:53 p.m. UTC
A few callers pass in nr_pages == 0 when they wakeup the flusher
threads, which means that the flusher should just flush everything
that was currently dirty. If we are tight on memory, we can get
tons of these queued from kswapd/vmscan. This causes (at least)
two problems:

1) We consume a ton of memory just allocating writeback work items.
2) We spend so much time processing these work items, that we
   introduce a softlockup in writeback processing.

Fix this by adding a 'zero_pages' bit to the writeback structure,
and set that when someone queues a nr_pages==0 flusher thread
wakeup. The bit is cleared when we start writeback on that work
item. If the bit is already set when we attempt to queue !nr_pages
writeback, then we simply ignore it.

This provides us one of full flush in flight, with one pending as
well, and makes for more efficient handling of this type of
writeback.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 fs/fs-writeback.c                | 30 ++++++++++++++++++++++++++++--
 include/linux/backing-dev-defs.h |  1 +
 2 files changed, 29 insertions(+), 2 deletions(-)

Comments

Johannes Weiner Sept. 19, 2017, 8:18 p.m. UTC | #1
On Tue, Sep 19, 2017 at 01:53:07PM -0600, Jens Axboe wrote:
> A few callers pass in nr_pages == 0 when they wakeup the flusher
> threads, which means that the flusher should just flush everything
> that was currently dirty. If we are tight on memory, we can get
> tons of these queued from kswapd/vmscan. This causes (at least)
> two problems:
> 
> 1) We consume a ton of memory just allocating writeback work items.
> 2) We spend so much time processing these work items, that we
>    introduce a softlockup in writeback processing.
> 
> Fix this by adding a 'zero_pages' bit to the writeback structure,
> and set that when someone queues a nr_pages==0 flusher thread
> wakeup. The bit is cleared when we start writeback on that work
> item. If the bit is already set when we attempt to queue !nr_pages
> writeback, then we simply ignore it.
> 
> This provides us one of full flush in flight, with one pending as
> well, and makes for more efficient handling of this type of
> writeback.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Just a nitpick:

> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>  			       bool range_cyclic, enum wb_reason reason)
>  {
>  	struct wb_writeback_work *work;
> +	bool zero_pages = false;
>  
>  	if (!wb_has_dirty_io(wb))
>  		return;
>  
>  	/*
> -	 * If someone asked for zero pages, we write out the WORLD
> +	 * If someone asked for zero pages, we write out the WORLD.
> +	 * Places like vmscan and laptop mode want to queue a wakeup to
> +	 * the flusher threads to clean out everything. To avoid potentially
> +	 * having tons of these pending, ensure that we only allow one of
> +	 * them pending and inflight at the time
>  	 */
> -	if (!nr_pages)
> +	if (!nr_pages) {
> +		if (test_bit(WB_zero_pages, &wb->state))
> +			return;
> +		set_bit(WB_zero_pages, &wb->state);
>  		nr_pages = get_nr_dirty_pages();

We could rely on the work->older_than_this and pass LONG_MAX here
instead to write out the world as it was at the time wb commences.

get_nr_dirty_pages() is somewhat clearer on intent, but on the other
hand it returns global state and is used here in a split-bdi context,
and we can end up in sum requesting the system-wide dirty pages
several times over. It'll work fine, relying on work->older_than_this
to contain it also, it just seems a little ugly and subtle.
Jens Axboe Sept. 19, 2017, 8:39 p.m. UTC | #2
On 09/19/2017 02:18 PM, Johannes Weiner wrote:
> On Tue, Sep 19, 2017 at 01:53:07PM -0600, Jens Axboe wrote:
>> A few callers pass in nr_pages == 0 when they wakeup the flusher
>> threads, which means that the flusher should just flush everything
>> that was currently dirty. If we are tight on memory, we can get
>> tons of these queued from kswapd/vmscan. This causes (at least)
>> two problems:
>>
>> 1) We consume a ton of memory just allocating writeback work items.
>> 2) We spend so much time processing these work items, that we
>>    introduce a softlockup in writeback processing.
>>
>> Fix this by adding a 'zero_pages' bit to the writeback structure,
>> and set that when someone queues a nr_pages==0 flusher thread
>> wakeup. The bit is cleared when we start writeback on that work
>> item. If the bit is already set when we attempt to queue !nr_pages
>> writeback, then we simply ignore it.
>>
>> This provides us one of full flush in flight, with one pending as
>> well, and makes for more efficient handling of this type of
>> writeback.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> 
> Just a nitpick:
> 
>> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>>  			       bool range_cyclic, enum wb_reason reason)
>>  {
>>  	struct wb_writeback_work *work;
>> +	bool zero_pages = false;
>>  
>>  	if (!wb_has_dirty_io(wb))
>>  		return;
>>  
>>  	/*
>> -	 * If someone asked for zero pages, we write out the WORLD
>> +	 * If someone asked for zero pages, we write out the WORLD.
>> +	 * Places like vmscan and laptop mode want to queue a wakeup to
>> +	 * the flusher threads to clean out everything. To avoid potentially
>> +	 * having tons of these pending, ensure that we only allow one of
>> +	 * them pending and inflight at the time
>>  	 */
>> -	if (!nr_pages)
>> +	if (!nr_pages) {
>> +		if (test_bit(WB_zero_pages, &wb->state))
>> +			return;
>> +		set_bit(WB_zero_pages, &wb->state);
>>  		nr_pages = get_nr_dirty_pages();
> 
> We could rely on the work->older_than_this and pass LONG_MAX here
> instead to write out the world as it was at the time wb commences.
> 
> get_nr_dirty_pages() is somewhat clearer on intent, but on the other
> hand it returns global state and is used here in a split-bdi context,
> and we can end up in sum requesting the system-wide dirty pages
> several times over. It'll work fine, relying on work->older_than_this
> to contain it also, it just seems a little ugly and subtle.

Not disagreeing with that at all. I just carried the !nr_pages forward
as the way to do this. I think any further cleanup or work should just
be based on this patchset, I'd definitely welcome a change in that
direction.

Thanks for your reviews!
Jens Axboe Sept. 20, 2017, 1:57 a.m. UTC | #3
On 09/19/2017 01:53 PM, Jens Axboe wrote:
> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>  			       bool range_cyclic, enum wb_reason reason)
>  {
>  	struct wb_writeback_work *work;
> +	bool zero_pages = false;
>  
>  	if (!wb_has_dirty_io(wb))
>  		return;
>  
>  	/*
> -	 * If someone asked for zero pages, we write out the WORLD
> +	 * If someone asked for zero pages, we write out the WORLD.
> +	 * Places like vmscan and laptop mode want to queue a wakeup to
> +	 * the flusher threads to clean out everything. To avoid potentially
> +	 * having tons of these pending, ensure that we only allow one of
> +	 * them pending and inflight at the time
>  	 */
> -	if (!nr_pages)
> +	if (!nr_pages) {
> +		if (test_bit(WB_zero_pages, &wb->state))
> +			return;
> +		set_bit(WB_zero_pages, &wb->state);
>  		nr_pages = get_nr_dirty_pages();
> +		zero_pages = true;
> +	}

Later fix added here to ensure we clear WB_zero_pages, if work
allocation fails:

work = kzalloc(sizeof(*work),                                           
                GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);           
if (!work) {                                                            
        if (zero_pages)                                                 
                clear_bit(WB_zero_pages, &wb->state);
	[...]

Updated patch here:

http://git.kernel.dk/cgit/linux-block/commit/?h=writeback-fixup&id=21ea70657894fda9fccf257543cbec112b2813ef
Amir Goldstein Sept. 20, 2017, 3:10 a.m. UTC | #4
On Tue, Sep 19, 2017 at 10:53 PM, Jens Axboe <axboe@kernel.dk> wrote:
> A few callers pass in nr_pages == 0 when they wakeup the flusher
> threads, which means that the flusher should just flush everything
> that was currently dirty. If we are tight on memory, we can get
> tons of these queued from kswapd/vmscan. This causes (at least)
> two problems:
>
> 1) We consume a ton of memory just allocating writeback work items.
> 2) We spend so much time processing these work items, that we
>    introduce a softlockup in writeback processing.
>
> Fix this by adding a 'zero_pages' bit to the writeback structure,
> and set that when someone queues a nr_pages==0 flusher thread
> wakeup. The bit is cleared when we start writeback on that work
> item. If the bit is already set when we attempt to queue !nr_pages
> writeback, then we simply ignore it.
>
> This provides us one of full flush in flight, with one pending as
> well, and makes for more efficient handling of this type of
> writeback.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  fs/fs-writeback.c                | 30 ++++++++++++++++++++++++++++--
>  include/linux/backing-dev-defs.h |  1 +
>  2 files changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index a9a86644cb9f..e0240110b36f 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -53,6 +53,7 @@ struct wb_writeback_work {
>         unsigned int for_background:1;
>         unsigned int for_sync:1;        /* sync(2) WB_SYNC_ALL writeback */
>         unsigned int auto_free:1;       /* free on completion */
> +       unsigned int zero_pages:1;      /* nr_pages == 0 writeback */

Suggest: use a name that describes the intention (e.g. WB_everything)

>         enum wb_reason reason;          /* why was writeback initiated? */
>
>         struct list_head list;          /* pending work list */
> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>                                bool range_cyclic, enum wb_reason reason)
>  {
>         struct wb_writeback_work *work;
> +       bool zero_pages = false;
>
>         if (!wb_has_dirty_io(wb))
>                 return;
>
>         /*
> -        * If someone asked for zero pages, we write out the WORLD
> +        * If someone asked for zero pages, we write out the WORLD.
> +        * Places like vmscan and laptop mode want to queue a wakeup to
> +        * the flusher threads to clean out everything. To avoid potentially
> +        * having tons of these pending, ensure that we only allow one of
> +        * them pending and inflight at the time
>          */
> -       if (!nr_pages)
> +       if (!nr_pages) {
> +               if (test_bit(WB_zero_pages, &wb->state))
> +                       return;
> +               set_bit(WB_zero_pages, &wb->state);

Shouldn't this be test_and_set? not the worst outcome if you have more
than one pending work item, but still.

>                 nr_pages = get_nr_dirty_pages();
> +               zero_pages = true;
> +       }
>
>         /*
>          * This is WB_SYNC_NONE writeback, so if allocation fails just
> @@ -975,6 +986,7 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>         work->range_cyclic = range_cyclic;
>         work->reason    = reason;
>         work->auto_free = 1;
> +       work->zero_pages = zero_pages;
>
>         wb_queue_work(wb, work);
>  }
> @@ -1828,6 +1840,14 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
>                 list_del_init(&work->list);
>         }
>         spin_unlock_bh(&wb->work_lock);
> +
> +       /*
> +        * Once we start processing a work item that had !nr_pages,
> +        * clear the wb state bit for that so we can allow more.
> +        */
> +       if (work && work->zero_pages && test_bit(WB_zero_pages, &wb->state))
> +               clear_bit(WB_zero_pages, &wb->state);

nit: should not need to test_bit

> +
>         return work;
>  }
>
> @@ -1896,6 +1916,12 @@ static long wb_do_writeback(struct bdi_writeback *wb)
>                 trace_writeback_exec(wb, work);
>                 wrote += wb_writeback(wb, work);
>                 finish_writeback_work(wb, work);
> +
> +               /*
> +                * If we have a lot of pending work, make sure we take
> +                * an occasional breather, if needed.
> +                */
> +               cond_resched();

Probably ought to be in a separate patch.

>         }
>
>         /*
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index 866c433e7d32..7494f6a75458 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -24,6 +24,7 @@ enum wb_state {
>         WB_shutting_down,       /* wb_shutdown() in progress */
>         WB_writeback_running,   /* Writeback is in progress */
>         WB_has_dirty_io,        /* Dirty inodes on ->b_{dirty|io|more_io} */
> +       WB_zero_pages,          /* nr_pages == 0 flush pending */

same suggestion: WB_everything

Cheers,
Amir.
Jens Axboe Sept. 20, 2017, 4:13 a.m. UTC | #5
On 09/19/2017 09:10 PM, Amir Goldstein wrote:
> On Tue, Sep 19, 2017 at 10:53 PM, Jens Axboe <axboe@kernel.dk> wrote:
>> A few callers pass in nr_pages == 0 when they wakeup the flusher
>> threads, which means that the flusher should just flush everything
>> that was currently dirty. If we are tight on memory, we can get
>> tons of these queued from kswapd/vmscan. This causes (at least)
>> two problems:
>>
>> 1) We consume a ton of memory just allocating writeback work items.
>> 2) We spend so much time processing these work items, that we
>>    introduce a softlockup in writeback processing.
>>
>> Fix this by adding a 'zero_pages' bit to the writeback structure,
>> and set that when someone queues a nr_pages==0 flusher thread
>> wakeup. The bit is cleared when we start writeback on that work
>> item. If the bit is already set when we attempt to queue !nr_pages
>> writeback, then we simply ignore it.
>>
>> This provides us one of full flush in flight, with one pending as
>> well, and makes for more efficient handling of this type of
>> writeback.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  fs/fs-writeback.c                | 30 ++++++++++++++++++++++++++++--
>>  include/linux/backing-dev-defs.h |  1 +
>>  2 files changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index a9a86644cb9f..e0240110b36f 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -53,6 +53,7 @@ struct wb_writeback_work {
>>         unsigned int for_background:1;
>>         unsigned int for_sync:1;        /* sync(2) WB_SYNC_ALL writeback */
>>         unsigned int auto_free:1;       /* free on completion */
>> +       unsigned int zero_pages:1;      /* nr_pages == 0 writeback */
> 
> Suggest: use a name that describes the intention (e.g. WB_everything)

Agree, the name isn't the best. WB_everything isn't great either, though,
since this isn't an integrity write. WB_start_all would be better,
I'll make that change.

>>         enum wb_reason reason;          /* why was writeback initiated? */
>>
>>         struct list_head list;          /* pending work list */
>> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>>                                bool range_cyclic, enum wb_reason reason)
>>  {
>>         struct wb_writeback_work *work;
>> +       bool zero_pages = false;
>>
>>         if (!wb_has_dirty_io(wb))
>>                 return;
>>
>>         /*
>> -        * If someone asked for zero pages, we write out the WORLD
>> +        * If someone asked for zero pages, we write out the WORLD.
>> +        * Places like vmscan and laptop mode want to queue a wakeup to
>> +        * the flusher threads to clean out everything. To avoid potentially
>> +        * having tons of these pending, ensure that we only allow one of
>> +        * them pending and inflight at the time
>>          */
>> -       if (!nr_pages)
>> +       if (!nr_pages) {
>> +               if (test_bit(WB_zero_pages, &wb->state))
>> +                       return;
>> +               set_bit(WB_zero_pages, &wb->state);
> 
> Shouldn't this be test_and_set? not the worst outcome if you have more
> than one pending work item, but still.

If the frequency of these is high, and they were to trigger the bad
conditions we saw, then a split test + set is faster as it won't
keep re-dirtying the same cacheline from multiple locations. It's
better to leave it a little racy, but faster.

>> @@ -1828,6 +1840,14 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
>>                 list_del_init(&work->list);
>>         }
>>         spin_unlock_bh(&wb->work_lock);
>> +
>> +       /*
>> +        * Once we start processing a work item that had !nr_pages,
>> +        * clear the wb state bit for that so we can allow more.
>> +        */
>> +       if (work && work->zero_pages && test_bit(WB_zero_pages, &wb->state))
>> +               clear_bit(WB_zero_pages, &wb->state);
> 
> nit: should not need to test_bit

True, we can drop it for this case, as it'll be the common condition
anyway. I'll make that change.

>> @@ -1896,6 +1916,12 @@ static long wb_do_writeback(struct bdi_writeback *wb)
>>                 trace_writeback_exec(wb, work);
>>                 wrote += wb_writeback(wb, work);
>>                 finish_writeback_work(wb, work);
>> +
>> +               /*
>> +                * If we have a lot of pending work, make sure we take
>> +                * an occasional breather, if needed.
>> +                */
>> +               cond_resched();
> 
> Probably ought to be in a separate patch.

Yeah, it probably should be. It's not strictly needed with the other
change anyway, I will just drop it.

New version:

http://git.kernel.dk/cgit/linux-block/commit/?h=writeback-fixup&id=338a69c217cdaaffda93f3cc9a364a347f782adb
Amir Goldstein Sept. 20, 2017, 6:05 a.m. UTC | #6
On Wed, Sep 20, 2017 at 7:13 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 09/19/2017 09:10 PM, Amir Goldstein wrote:
>> On Tue, Sep 19, 2017 at 10:53 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>> A few callers pass in nr_pages == 0 when they wakeup the flusher
>>> threads, which means that the flusher should just flush everything
>>> that was currently dirty. If we are tight on memory, we can get
>>> tons of these queued from kswapd/vmscan. This causes (at least)
>>> two problems:
>>>
>>> 1) We consume a ton of memory just allocating writeback work items.
>>> 2) We spend so much time processing these work items, that we
>>>    introduce a softlockup in writeback processing.
>>>
>>> Fix this by adding a 'zero_pages' bit to the writeback structure,
>>> and set that when someone queues a nr_pages==0 flusher thread
>>> wakeup. The bit is cleared when we start writeback on that work
>>> item. If the bit is already set when we attempt to queue !nr_pages
>>> writeback, then we simply ignore it.
>>>
>>> This provides us one of full flush in flight, with one pending as
>>> well, and makes for more efficient handling of this type of
>>> writeback.
>>>
>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>> ---
>>>  fs/fs-writeback.c                | 30 ++++++++++++++++++++++++++++--
>>>  include/linux/backing-dev-defs.h |  1 +
>>>  2 files changed, 29 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>> index a9a86644cb9f..e0240110b36f 100644
>>> --- a/fs/fs-writeback.c
>>> +++ b/fs/fs-writeback.c
>>> @@ -53,6 +53,7 @@ struct wb_writeback_work {
>>>         unsigned int for_background:1;
>>>         unsigned int for_sync:1;        /* sync(2) WB_SYNC_ALL writeback */
>>>         unsigned int auto_free:1;       /* free on completion */
>>> +       unsigned int zero_pages:1;      /* nr_pages == 0 writeback */
>>
>> Suggest: use a name that describes the intention (e.g. WB_everything)
>
> Agree, the name isn't the best. WB_everything isn't great either, though,
> since this isn't an integrity write. WB_start_all would be better,
> I'll make that change.
>
>>>         enum wb_reason reason;          /* why was writeback initiated? */
>>>
>>>         struct list_head list;          /* pending work list */
>>> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>>>                                bool range_cyclic, enum wb_reason reason)
>>>  {
>>>         struct wb_writeback_work *work;
>>> +       bool zero_pages = false;
>>>
>>>         if (!wb_has_dirty_io(wb))
>>>                 return;
>>>
>>>         /*
>>> -        * If someone asked for zero pages, we write out the WORLD
>>> +        * If someone asked for zero pages, we write out the WORLD.
>>> +        * Places like vmscan and laptop mode want to queue a wakeup to
>>> +        * the flusher threads to clean out everything. To avoid potentially
>>> +        * having tons of these pending, ensure that we only allow one of
>>> +        * them pending and inflight at the time
>>>          */
>>> -       if (!nr_pages)
>>> +       if (!nr_pages) {
>>> +               if (test_bit(WB_zero_pages, &wb->state))
>>> +                       return;
>>> +               set_bit(WB_zero_pages, &wb->state);
>>
>> Shouldn't this be test_and_set? not the worst outcome if you have more
>> than one pending work item, but still.
>
> If the frequency of these is high, and they were to trigger the bad
> conditions we saw, then a split test + set is faster as it won't
> keep re-dirtying the same cacheline from multiple locations. It's
> better to leave it a little racy, but faster.
>

Fare enough, but then better change the language of the commit message and
comment above not to claim that there can be only one pending work item.

Amir.
Jens Axboe Sept. 20, 2017, 12:35 p.m. UTC | #7
On 09/20/2017 12:05 AM, Amir Goldstein wrote:
> On Wed, Sep 20, 2017 at 7:13 AM, Jens Axboe <axboe@kernel.dk> wrote:
>> On 09/19/2017 09:10 PM, Amir Goldstein wrote:
>>> On Tue, Sep 19, 2017 at 10:53 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>>> A few callers pass in nr_pages == 0 when they wakeup the flusher
>>>> threads, which means that the flusher should just flush everything
>>>> that was currently dirty. If we are tight on memory, we can get
>>>> tons of these queued from kswapd/vmscan. This causes (at least)
>>>> two problems:
>>>>
>>>> 1) We consume a ton of memory just allocating writeback work items.
>>>> 2) We spend so much time processing these work items, that we
>>>>    introduce a softlockup in writeback processing.
>>>>
>>>> Fix this by adding a 'zero_pages' bit to the writeback structure,
>>>> and set that when someone queues a nr_pages==0 flusher thread
>>>> wakeup. The bit is cleared when we start writeback on that work
>>>> item. If the bit is already set when we attempt to queue !nr_pages
>>>> writeback, then we simply ignore it.
>>>>
>>>> This provides us one of full flush in flight, with one pending as
>>>> well, and makes for more efficient handling of this type of
>>>> writeback.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>>> ---
>>>>  fs/fs-writeback.c                | 30 ++++++++++++++++++++++++++++--
>>>>  include/linux/backing-dev-defs.h |  1 +
>>>>  2 files changed, 29 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>>>> index a9a86644cb9f..e0240110b36f 100644
>>>> --- a/fs/fs-writeback.c
>>>> +++ b/fs/fs-writeback.c
>>>> @@ -53,6 +53,7 @@ struct wb_writeback_work {
>>>>         unsigned int for_background:1;
>>>>         unsigned int for_sync:1;        /* sync(2) WB_SYNC_ALL writeback */
>>>>         unsigned int auto_free:1;       /* free on completion */
>>>> +       unsigned int zero_pages:1;      /* nr_pages == 0 writeback */
>>>
>>> Suggest: use a name that describes the intention (e.g. WB_everything)
>>
>> Agree, the name isn't the best. WB_everything isn't great either, though,
>> since this isn't an integrity write. WB_start_all would be better,
>> I'll make that change.
>>
>>>>         enum wb_reason reason;          /* why was writeback initiated? */
>>>>
>>>>         struct list_head list;          /* pending work list */
>>>> @@ -948,15 +949,25 @@ static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
>>>>                                bool range_cyclic, enum wb_reason reason)
>>>>  {
>>>>         struct wb_writeback_work *work;
>>>> +       bool zero_pages = false;
>>>>
>>>>         if (!wb_has_dirty_io(wb))
>>>>                 return;
>>>>
>>>>         /*
>>>> -        * If someone asked for zero pages, we write out the WORLD
>>>> +        * If someone asked for zero pages, we write out the WORLD.
>>>> +        * Places like vmscan and laptop mode want to queue a wakeup to
>>>> +        * the flusher threads to clean out everything. To avoid potentially
>>>> +        * having tons of these pending, ensure that we only allow one of
>>>> +        * them pending and inflight at the time
>>>>          */
>>>> -       if (!nr_pages)
>>>> +       if (!nr_pages) {
>>>> +               if (test_bit(WB_zero_pages, &wb->state))
>>>> +                       return;
>>>> +               set_bit(WB_zero_pages, &wb->state);
>>>
>>> Shouldn't this be test_and_set? not the worst outcome if you have more
>>> than one pending work item, but still.
>>
>> If the frequency of these is high, and they were to trigger the bad
>> conditions we saw, then a split test + set is faster as it won't
>> keep re-dirtying the same cacheline from multiple locations. It's
>> better to leave it a little racy, but faster.
>>
> 
> Fare enough, but then better change the language of the commit message and
> comment above not to claim that there can be only one pending work item.

That's unchanged, the commit message should be fine. We clear the
bit when we start the work item, so we can have one in flight and
one pending.

But it does reference 'zero_pages', I'll update that.
Jan Kara Sept. 20, 2017, 2:43 p.m. UTC | #8
On Tue 19-09-17 22:13:25, Jens Axboe wrote:
> On 09/19/2017 09:10 PM, Amir Goldstein wrote:
> New version:
> 
> http://git.kernel.dk/cgit/linux-block/commit/?h=writeback-fixup&id=338a69c217cdaaffda93f3cc9a364a347f782adb

This looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

									Honza
diff mbox

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a9a86644cb9f..e0240110b36f 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -53,6 +53,7 @@  struct wb_writeback_work {
 	unsigned int for_background:1;
 	unsigned int for_sync:1;	/* sync(2) WB_SYNC_ALL writeback */
 	unsigned int auto_free:1;	/* free on completion */
+	unsigned int zero_pages:1;	/* nr_pages == 0 writeback */
 	enum wb_reason reason;		/* why was writeback initiated? */
 
 	struct list_head list;		/* pending work list */
@@ -948,15 +949,25 @@  static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 			       bool range_cyclic, enum wb_reason reason)
 {
 	struct wb_writeback_work *work;
+	bool zero_pages = false;
 
 	if (!wb_has_dirty_io(wb))
 		return;
 
 	/*
-	 * If someone asked for zero pages, we write out the WORLD
+	 * If someone asked for zero pages, we write out the WORLD.
+	 * Places like vmscan and laptop mode want to queue a wakeup to
+	 * the flusher threads to clean out everything. To avoid potentially
+	 * having tons of these pending, ensure that we only allow one of
+	 * them pending and inflight at the time
 	 */
-	if (!nr_pages)
+	if (!nr_pages) {
+		if (test_bit(WB_zero_pages, &wb->state))
+			return;
+		set_bit(WB_zero_pages, &wb->state);
 		nr_pages = get_nr_dirty_pages();
+		zero_pages = true;
+	}
 
 	/*
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
@@ -975,6 +986,7 @@  static void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 	work->range_cyclic = range_cyclic;
 	work->reason	= reason;
 	work->auto_free	= 1;
+	work->zero_pages = zero_pages;
 
 	wb_queue_work(wb, work);
 }
@@ -1828,6 +1840,14 @@  static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
 		list_del_init(&work->list);
 	}
 	spin_unlock_bh(&wb->work_lock);
+
+	/*
+	 * Once we start processing a work item that had !nr_pages,
+	 * clear the wb state bit for that so we can allow more.
+	 */
+	if (work && work->zero_pages && test_bit(WB_zero_pages, &wb->state))
+		clear_bit(WB_zero_pages, &wb->state);
+
 	return work;
 }
 
@@ -1896,6 +1916,12 @@  static long wb_do_writeback(struct bdi_writeback *wb)
 		trace_writeback_exec(wb, work);
 		wrote += wb_writeback(wb, work);
 		finish_writeback_work(wb, work);
+
+		/*
+		 * If we have a lot of pending work, make sure we take
+		 * an occasional breather, if needed.
+		 */
+		cond_resched();
 	}
 
 	/*
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 866c433e7d32..7494f6a75458 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -24,6 +24,7 @@  enum wb_state {
 	WB_shutting_down,	/* wb_shutdown() in progress */
 	WB_writeback_running,	/* Writeback is in progress */
 	WB_has_dirty_io,	/* Dirty inodes on ->b_{dirty|io|more_io} */
+	WB_zero_pages,		/* nr_pages == 0 flush pending */
 };
 
 enum wb_congested_state {