diff mbox

writeback: Avoid exhausting allocation reserves under memory pressure

Message ID 1462436092-32665-1-git-send-email-jack@suse.cz (mailing list archive)
State New, archived
Headers show

Commit Message

Jan Kara May 5, 2016, 8:14 a.m. UTC
When system is under memory pressure memory management frequently calls
wakeup_flusher_threads() to writeback pages to that they can be freed.
This was observed to exhaust reserves for atomic allocations since
wakeup_flusher_threads() allocates one writeback work for each device
with dirty data with GFP_ATOMIC.

However it is pointless to allocate new work items when requested work
is identical. Instead, we can merge the new work with the pending work
items and thus save memory allocation.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                | 37 +++++++++++++++++++++++++++++++++++++
 include/trace/events/writeback.h |  1 +
 2 files changed, 38 insertions(+)

This is a patch which should (and in my basic testing does) address the issues
with many atomic allocations Tetsuo reported. What do people think?

Comments

Michal Hocko May 5, 2016, 8:24 a.m. UTC | #1
On Thu 05-05-16 10:14:52, Jan Kara wrote:
> When system is under memory pressure memory management frequently calls
> wakeup_flusher_threads() to writeback pages to that they can be freed.
> This was observed to exhaust reserves for atomic allocations since
> wakeup_flusher_threads() allocates one writeback work for each device
> with dirty data with GFP_ATOMIC.
> 
> However it is pointless to allocate new work items when requested work
> is identical. Instead, we can merge the new work with the pending work
> items and thus save memory allocation.

Makes sense. See one question below:

> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c                | 37 +++++++++++++++++++++++++++++++++++++
>  include/trace/events/writeback.h |  1 +
>  2 files changed, 38 insertions(+)
> 
> This is a patch which should (and in my basic testing does) address the issues
> with many atomic allocations Tetsuo reported. What do people think?
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index fee81e8768c9..bb6725f5b1ba 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -189,6 +189,35 @@ out_unlock:
>  	spin_unlock_bh(&wb->work_lock);
>  }
>  
> +/*
> + * Check whether the request to writeback some pages can be merged with some
> + * other request which is already pending. If yes, merge it and return true.
> + * If no, return false.
> + */
> +static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages,
> +			     struct super_block *sb, bool range_cyclic,
> +			     enum wb_reason reason)
> +{
> +	struct wb_writeback_work *work;
> +	bool merged = false;
> +
> +	spin_lock_bh(&wb->work_lock);
> +	list_for_each_entry(work, &wb->work_list, list) {

Is the lenght of the list bounded somehow? In other words is it possible
that the spinlock would be held for too long to traverse the whole list?

> +		if (work->reason == reason &&
> +		    work->range_cyclic == range_cyclic &&
> +		    work->auto_free == 1 && work->sb == sb &&
> +		    work->for_sync == 0) {
> +			work->nr_pages += nr_pages;
> +			merged = true;
> +			trace_writeback_merged(wb, work);
> +			break;
> +		}
> +	}
> +	spin_unlock_bh(&wb->work_lock);
> +
> +	return merged;
> +}
Jan Kara May 5, 2016, 9:07 a.m. UTC | #2
On Thu 05-05-16 10:24:33, Michal Hocko wrote:
> > +/*
> > + * Check whether the request to writeback some pages can be merged with some
> > + * other request which is already pending. If yes, merge it and return true.
> > + * If no, return false.
> > + */
> > +static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages,
> > +			     struct super_block *sb, bool range_cyclic,
> > +			     enum wb_reason reason)
> > +{
> > +	struct wb_writeback_work *work;
> > +	bool merged = false;
> > +
> > +	spin_lock_bh(&wb->work_lock);
> > +	list_for_each_entry(work, &wb->work_list, list) {
> 
> Is the lenght of the list bounded somehow? In other words is it possible
> that the spinlock would be held for too long to traverse the whole list?

I was thinking about this as well. With the merging enabled, the number of
entries queued from wb_start_writeback() is essentially limited by the
number of writeback reasons and there's only a couple of those. What is
more questionable is the number of entries queued from
__writeback_inodes_sb_nr(). Generally there should be a couple at maximum
either but it is hard to give any guarantee since e.g. filesystems use this
function to reduce amount of delay-allocated data when they are running out
of space. Hum, maybe we could limit the merging to scan only the last say
16 entries. That should give good results in most cases... Thoughts?

								Honza

> > +		if (work->reason == reason &&
> > +		    work->range_cyclic == range_cyclic &&
> > +		    work->auto_free == 1 && work->sb == sb &&
> > +		    work->for_sync == 0) {
> > +			work->nr_pages += nr_pages;
> > +			merged = true;
> > +			trace_writeback_merged(wb, work);
> > +			break;
> > +		}
> > +	}
> > +	spin_unlock_bh(&wb->work_lock);
> > +
> > +	return merged;
> > +}
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko May 5, 2016, 9:18 a.m. UTC | #3
On Thu 05-05-16 11:07:50, Jan Kara wrote:
> On Thu 05-05-16 10:24:33, Michal Hocko wrote:
> > > +/*
> > > + * Check whether the request to writeback some pages can be merged with some
> > > + * other request which is already pending. If yes, merge it and return true.
> > > + * If no, return false.
> > > + */
> > > +static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages,
> > > +			     struct super_block *sb, bool range_cyclic,
> > > +			     enum wb_reason reason)
> > > +{
> > > +	struct wb_writeback_work *work;
> > > +	bool merged = false;
> > > +
> > > +	spin_lock_bh(&wb->work_lock);
> > > +	list_for_each_entry(work, &wb->work_list, list) {
> > 
> > Is the lenght of the list bounded somehow? In other words is it possible
> > that the spinlock would be held for too long to traverse the whole list?
> 
> I was thinking about this as well. With the merging enabled, the number of
> entries queued from wb_start_writeback() is essentially limited by the
> number of writeback reasons and there's only a couple of those. What is
> more questionable is the number of entries queued from
> __writeback_inodes_sb_nr(). Generally there should be a couple at maximum
> either but it is hard to give any guarantee since e.g. filesystems use this
> function to reduce amount of delay-allocated data when they are running out
> of space. Hum, maybe we could limit the merging to scan only the last say
> 16 entries. That should give good results in most cases... Thoughts?

Yes, I guess this sounds reasonable. The merging doesn't have to be
perfect. We primarily want to get rid of (potentially) thousands of
duplicates. And there is always a GFP_NOWAIT fallback IIUC.
Andrew Morton May 5, 2016, 9:37 p.m. UTC | #4
On Thu, 5 May 2016 11:07:50 +0200 Jan Kara <jack@suse.cz> wrote:

> On Thu 05-05-16 10:24:33, Michal Hocko wrote:
> > > +/*
> > > + * Check whether the request to writeback some pages can be merged with some
> > > + * other request which is already pending. If yes, merge it and return true.
> > > + * If no, return false.
> > > + */
> > > +static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages,
> > > +			     struct super_block *sb, bool range_cyclic,
> > > +			     enum wb_reason reason)
> > > +{
> > > +	struct wb_writeback_work *work;
> > > +	bool merged = false;
> > > +
> > > +	spin_lock_bh(&wb->work_lock);
> > > +	list_for_each_entry(work, &wb->work_list, list) {
> > 
> > Is the lenght of the list bounded somehow? In other words is it possible
> > that the spinlock would be held for too long to traverse the whole list?
> 
> I was thinking about this as well. With the merging enabled, the number of
> entries queued from wb_start_writeback() is essentially limited by the
> number of writeback reasons and there's only a couple of those. What is
> more questionable is the number of entries queued from
> __writeback_inodes_sb_nr(). Generally there should be a couple at maximum
> either but it is hard to give any guarantee since e.g. filesystems use this
> function to reduce amount of delay-allocated data when they are running out
> of space. Hum, maybe we could limit the merging to scan only the last say
> 16 entries. That should give good results in most cases... Thoughts?

If it's possible to cause a search complexity meltdown, someone will
find a way :(

Is there any reason why the requests coming out of
writeback_inodes_sb_nr() cannot also be merged?

Your wb_merge_request() doesn't check ->tagged_writepages?

Why is ->for_sync handled differently?  Needs a comment.

Suggest turning this into a separate function.  Build a local
wb_writeback_work in wb_start_writeback(), do:


	/* comment goes here */
	if (new->reason != old->reason)
		return false;
	/* comment goes here */
	if (new->range_cyclic != old->range_cyclic)
		retun false;
	return true;

then copy wb_start_writeback()'s local wb_writeback_work into the
newly-allocated one if needed (kmemdup()).  Or just pass a billion args
into that comparison function.

bdi_split_work_to_wbs() does GFP_ATOMIC as well.  Problem?  (Why the
heck don't we document the *reasons* for these things, sigh).

I suspect it would be best to be proactive here and use some smarter
data structure.  It appears that all the wb_writeback_work fields
except sb can be squeezed into a single word so perhaps a radix-tree. 
Or hash them all together and use a chained array or something.  Maybe
fiddle at it for an hour or so, see how it's looking?  It's a lot of
fuss to avoid one problematic kmalloc(), sigh.

We really don't want there to be *any* pathological workload which
results in merging failures - if that's the case then someone will hit
it.  They'll experience the ooms (perhaps) and the search complexity
issues (for sure).


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 12, 2016, 4:08 p.m. UTC | #5
On Thu 05-05-16 14:37:51, Andrew Morton wrote:
> On Thu, 5 May 2016 11:07:50 +0200 Jan Kara <jack@suse.cz> wrote:
> 
> > On Thu 05-05-16 10:24:33, Michal Hocko wrote:
> > > > +/*
> > > > + * Check whether the request to writeback some pages can be merged with some
> > > > + * other request which is already pending. If yes, merge it and return true.
> > > > + * If no, return false.
> > > > + */
> > > > +static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages,
> > > > +			     struct super_block *sb, bool range_cyclic,
> > > > +			     enum wb_reason reason)
> > > > +{
> > > > +	struct wb_writeback_work *work;
> > > > +	bool merged = false;
> > > > +
> > > > +	spin_lock_bh(&wb->work_lock);
> > > > +	list_for_each_entry(work, &wb->work_list, list) {
> > > 
> > > Is the lenght of the list bounded somehow? In other words is it possible
> > > that the spinlock would be held for too long to traverse the whole list?
> > 
> > I was thinking about this as well. With the merging enabled, the number of
> > entries queued from wb_start_writeback() is essentially limited by the
> > number of writeback reasons and there's only a couple of those. What is
> > more questionable is the number of entries queued from
> > __writeback_inodes_sb_nr(). Generally there should be a couple at maximum
> > either but it is hard to give any guarantee since e.g. filesystems use this
> > function to reduce amount of delay-allocated data when they are running out
> > of space. Hum, maybe we could limit the merging to scan only the last say
> > 16 entries. That should give good results in most cases... Thoughts?
> 
> If it's possible to cause a search complexity meltdown, someone will
> find a way :(

Agreed.

> Is there any reason why the requests coming out of
> writeback_inodes_sb_nr() cannot also be merged?

So my thought was that since these items are essentially synchronous -
writeback_inodes_sb_nr() returns only after the worker submits the IO - we
don't want to block waiters longer than necessary by bumping the number of
pages to write. Also it would be tricky to properly handle completions of
work->done when work items are merged. Finally, since the submitter waits
for work to be completed, this pretty much limits the capability of this
path to DoS the system.

> Your wb_merge_request() doesn't check ->tagged_writepages?
> 
> Why is ->for_sync handled differently?  Needs a comment.
> 
> Suggest turning this into a separate function.  Build a local
> wb_writeback_work in wb_start_writeback(), do:
> 
> 
> 	/* comment goes here */
> 	if (new->reason != old->reason)
> 		return false;
> 	/* comment goes here */
> 	if (new->range_cyclic != old->range_cyclic)
> 		retun false;
> 	return true;
> 
> then copy wb_start_writeback()'s local wb_writeback_work into the
> newly-allocated one if needed (kmemdup()).  Or just pass a billion args
> into that comparison function.
> 
> bdi_split_work_to_wbs() does GFP_ATOMIC as well.  Problem?  (Why the
> heck don't we document the *reasons* for these things, sigh).

Heh, there are much more GFP_ATOMIC allocations in fs/fs-writeback.c after
Tejun's memcg aware writeback... I believe they are GFP_ATOMIC mostly
because they can already be called from direct reclaim (e.g. when
requesting pages to be written through wakeup_flusher_threads()) and so we
don't want to recurse into direct reclaim code again.

> I suspect it would be best to be proactive here and use some smarter
> data structure.  It appears that all the wb_writeback_work fields
> except sb can be squeezed into a single word so perhaps a radix-tree. 
> Or hash them all together and use a chained array or something.  Maybe
> fiddle at it for an hour or so, see how it's looking?  It's a lot of
> fuss to avoid one problematic kmalloc(), sigh.
> 
> We really don't want there to be *any* pathological workload which
> results in merging failures - if that's the case then someone will hit
> it.  They'll experience the ooms (perhaps) and the search complexity
> issues (for sure).

So the question is what is the desired outcome. After Tetsuo's patch
"mm,writeback: Don't use memory reserves for wb_start_writeback" we will
use GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN instead of GFP_ATOMIC in
wb_start_writeback(). We can treat other places using GFP_ATOMIC in a
similar way. So my thought was that this is enough to avoid exhaustion of
reserves for writeback work items under memory pressure. And the merging of
writeback works I proposed was more like an optimization to avoid
unnecessary allocations. And in that case we can allow imperfection and
possibly large lists of queued works in pathological cases - I agree we
should not DoS the system by going through large linked lists in any case but
that is easily avoided if we are fine with the fact that merging won't happen
always when it could.

The question which is not clear to me is: Do we want to guard against
malicious attacker that may be consuming memory through writeback works
that are allocated via GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN? 
If yes, then my patch needs further thought. Any opinions?

								Honza
Michal Hocko May 16, 2016, 11:45 a.m. UTC | #6
On Thu 12-05-16 18:08:29, Jan Kara wrote:
> On Thu 05-05-16 14:37:51, Andrew Morton wrote:
[...]
> > bdi_split_work_to_wbs() does GFP_ATOMIC as well.  Problem?  (Why the
> > heck don't we document the *reasons* for these things, sigh).
> 
> Heh, there are much more GFP_ATOMIC allocations in fs/fs-writeback.c after
> Tejun's memcg aware writeback... I believe they are GFP_ATOMIC mostly
> because they can already be called from direct reclaim (e.g. when
> requesting pages to be written through wakeup_flusher_threads()) and so we
> don't want to recurse into direct reclaim code again.

If that is the case then __GFP_DIRECT_RECLAIM should be cleared rather
than GFP_ATOMIC abused.

> > I suspect it would be best to be proactive here and use some smarter
> > data structure.  It appears that all the wb_writeback_work fields
> > except sb can be squeezed into a single word so perhaps a radix-tree. 
> > Or hash them all together and use a chained array or something.  Maybe
> > fiddle at it for an hour or so, see how it's looking?  It's a lot of
> > fuss to avoid one problematic kmalloc(), sigh.
> > 
> > We really don't want there to be *any* pathological workload which
> > results in merging failures - if that's the case then someone will hit
> > it.  They'll experience the ooms (perhaps) and the search complexity
> > issues (for sure).
> 
> So the question is what is the desired outcome. After Tetsuo's patch
> "mm,writeback: Don't use memory reserves for wb_start_writeback" we will
> use GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN instead of GFP_ATOMIC in
> wb_start_writeback(). We can treat other places using GFP_ATOMIC in a
> similar way. So my thought was that this is enough to avoid exhaustion of
> reserves for writeback work items under memory pressure. And the merging of
> writeback works I proposed was more like an optimization to avoid
> unnecessary allocations. And in that case we can allow imperfection and
> possibly large lists of queued works in pathological cases - I agree we
> should not DoS the system by going through large linked lists in any case but
> that is easily avoided if we are fine with the fact that merging won't happen
> always when it could.

Yes I think this is acceptable.

> The question which is not clear to me is: Do we want to guard against
> malicious attacker that may be consuming memory through writeback works
> that are allocated via GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN? 
> If yes, then my patch needs further thought. Any opinions?

GFP_NOWAIT still kicks the kswapd so there is some reclaim activity
on the background. Sure if we can reduce the number of those requests
it would be better because we are losing natural throttling without
the direct reclaim. But I am not sure I can see how this would cause a
a major problem (slow down everybody - quite possible - but not DoS
AFAICS).
diff mbox

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index fee81e8768c9..bb6725f5b1ba 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -189,6 +189,35 @@  out_unlock:
 	spin_unlock_bh(&wb->work_lock);
 }
 
+/*
+ * Check whether the request to writeback some pages can be merged with some
+ * other request which is already pending. If yes, merge it and return true.
+ * If no, return false.
+ */
+static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages,
+			     struct super_block *sb, bool range_cyclic,
+			     enum wb_reason reason)
+{
+	struct wb_writeback_work *work;
+	bool merged = false;
+
+	spin_lock_bh(&wb->work_lock);
+	list_for_each_entry(work, &wb->work_list, list) {
+		if (work->reason == reason &&
+		    work->range_cyclic == range_cyclic &&
+		    work->auto_free == 1 && work->sb == sb &&
+		    work->for_sync == 0) {
+			work->nr_pages += nr_pages;
+			merged = true;
+			trace_writeback_merged(wb, work);
+			break;
+		}
+	}
+	spin_unlock_bh(&wb->work_lock);
+
+	return merged;
+}
+
 /**
  * wb_wait_for_completion - wait for completion of bdi_writeback_works
  * @bdi: bdi work items were issued to
@@ -928,6 +957,14 @@  void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
 		return;
 
 	/*
+	 * Can we merge current request with another pending one - saves us
+	 * atomic allocation which can be significant e.g. when MM is under
+	 * pressure and calls wake_up_flusher_threads() a lot.
+	 */
+	if (wb_merge_request(wb, nr_pages, NULL, range_cyclic, reason))
+		return;
+
+	/*
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 73614ce1d204..84ad9fac475b 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -252,6 +252,7 @@  DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_written);
 DEFINE_WRITEBACK_WORK_EVENT(writeback_wait);
+DEFINE_WRITEBACK_WORK_EVENT(writeback_merged);
 
 TRACE_EVENT(writeback_pages_written,
 	TP_PROTO(long pages_written),