Message ID | 1462436092-32665-1-git-send-email-jack@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu 05-05-16 10:14:52, Jan Kara wrote: > When system is under memory pressure memory management frequently calls > wakeup_flusher_threads() to writeback pages to that they can be freed. > This was observed to exhaust reserves for atomic allocations since > wakeup_flusher_threads() allocates one writeback work for each device > with dirty data with GFP_ATOMIC. > > However it is pointless to allocate new work items when requested work > is identical. Instead, we can merge the new work with the pending work > items and thus save memory allocation. Makes sense. See one question below: > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Signed-off-by: Jan Kara <jack@suse.cz> > --- > fs/fs-writeback.c | 37 +++++++++++++++++++++++++++++++++++++ > include/trace/events/writeback.h | 1 + > 2 files changed, 38 insertions(+) > > This is a patch which should (and in my basic testing does) address the issues > with many atomic allocations Tetsuo reported. What do people think? > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index fee81e8768c9..bb6725f5b1ba 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -189,6 +189,35 @@ out_unlock: > spin_unlock_bh(&wb->work_lock); > } > > +/* > + * Check whether the request to writeback some pages can be merged with some > + * other request which is already pending. If yes, merge it and return true. > + * If no, return false. > + */ > +static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages, > + struct super_block *sb, bool range_cyclic, > + enum wb_reason reason) > +{ > + struct wb_writeback_work *work; > + bool merged = false; > + > + spin_lock_bh(&wb->work_lock); > + list_for_each_entry(work, &wb->work_list, list) { Is the lenght of the list bounded somehow? In other words is it possible that the spinlock would be held for too long to traverse the whole list? > + if (work->reason == reason && > + work->range_cyclic == range_cyclic && > + work->auto_free == 1 && work->sb == sb && > + work->for_sync == 0) { > + work->nr_pages += nr_pages; > + merged = true; > + trace_writeback_merged(wb, work); > + break; > + } > + } > + spin_unlock_bh(&wb->work_lock); > + > + return merged; > +}
On Thu 05-05-16 10:24:33, Michal Hocko wrote: > > +/* > > + * Check whether the request to writeback some pages can be merged with some > > + * other request which is already pending. If yes, merge it and return true. > > + * If no, return false. > > + */ > > +static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages, > > + struct super_block *sb, bool range_cyclic, > > + enum wb_reason reason) > > +{ > > + struct wb_writeback_work *work; > > + bool merged = false; > > + > > + spin_lock_bh(&wb->work_lock); > > + list_for_each_entry(work, &wb->work_list, list) { > > Is the lenght of the list bounded somehow? In other words is it possible > that the spinlock would be held for too long to traverse the whole list? I was thinking about this as well. With the merging enabled, the number of entries queued from wb_start_writeback() is essentially limited by the number of writeback reasons and there's only a couple of those. What is more questionable is the number of entries queued from __writeback_inodes_sb_nr(). Generally there should be a couple at maximum either but it is hard to give any guarantee since e.g. filesystems use this function to reduce amount of delay-allocated data when they are running out of space. Hum, maybe we could limit the merging to scan only the last say 16 entries. That should give good results in most cases... Thoughts? Honza > > + if (work->reason == reason && > > + work->range_cyclic == range_cyclic && > > + work->auto_free == 1 && work->sb == sb && > > + work->for_sync == 0) { > > + work->nr_pages += nr_pages; > > + merged = true; > > + trace_writeback_merged(wb, work); > > + break; > > + } > > + } > > + spin_unlock_bh(&wb->work_lock); > > + > > + return merged; > > +} > -- > Michal Hocko > SUSE Labs
On Thu 05-05-16 11:07:50, Jan Kara wrote: > On Thu 05-05-16 10:24:33, Michal Hocko wrote: > > > +/* > > > + * Check whether the request to writeback some pages can be merged with some > > > + * other request which is already pending. If yes, merge it and return true. > > > + * If no, return false. > > > + */ > > > +static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages, > > > + struct super_block *sb, bool range_cyclic, > > > + enum wb_reason reason) > > > +{ > > > + struct wb_writeback_work *work; > > > + bool merged = false; > > > + > > > + spin_lock_bh(&wb->work_lock); > > > + list_for_each_entry(work, &wb->work_list, list) { > > > > Is the lenght of the list bounded somehow? In other words is it possible > > that the spinlock would be held for too long to traverse the whole list? > > I was thinking about this as well. With the merging enabled, the number of > entries queued from wb_start_writeback() is essentially limited by the > number of writeback reasons and there's only a couple of those. What is > more questionable is the number of entries queued from > __writeback_inodes_sb_nr(). Generally there should be a couple at maximum > either but it is hard to give any guarantee since e.g. filesystems use this > function to reduce amount of delay-allocated data when they are running out > of space. Hum, maybe we could limit the merging to scan only the last say > 16 entries. That should give good results in most cases... Thoughts? Yes, I guess this sounds reasonable. The merging doesn't have to be perfect. We primarily want to get rid of (potentially) thousands of duplicates. And there is always a GFP_NOWAIT fallback IIUC.
On Thu, 5 May 2016 11:07:50 +0200 Jan Kara <jack@suse.cz> wrote: > On Thu 05-05-16 10:24:33, Michal Hocko wrote: > > > +/* > > > + * Check whether the request to writeback some pages can be merged with some > > > + * other request which is already pending. If yes, merge it and return true. > > > + * If no, return false. > > > + */ > > > +static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages, > > > + struct super_block *sb, bool range_cyclic, > > > + enum wb_reason reason) > > > +{ > > > + struct wb_writeback_work *work; > > > + bool merged = false; > > > + > > > + spin_lock_bh(&wb->work_lock); > > > + list_for_each_entry(work, &wb->work_list, list) { > > > > Is the lenght of the list bounded somehow? In other words is it possible > > that the spinlock would be held for too long to traverse the whole list? > > I was thinking about this as well. With the merging enabled, the number of > entries queued from wb_start_writeback() is essentially limited by the > number of writeback reasons and there's only a couple of those. What is > more questionable is the number of entries queued from > __writeback_inodes_sb_nr(). Generally there should be a couple at maximum > either but it is hard to give any guarantee since e.g. filesystems use this > function to reduce amount of delay-allocated data when they are running out > of space. Hum, maybe we could limit the merging to scan only the last say > 16 entries. That should give good results in most cases... Thoughts? If it's possible to cause a search complexity meltdown, someone will find a way :( Is there any reason why the requests coming out of writeback_inodes_sb_nr() cannot also be merged? Your wb_merge_request() doesn't check ->tagged_writepages? Why is ->for_sync handled differently? Needs a comment. Suggest turning this into a separate function. Build a local wb_writeback_work in wb_start_writeback(), do: /* comment goes here */ if (new->reason != old->reason) return false; /* comment goes here */ if (new->range_cyclic != old->range_cyclic) retun false; return true; then copy wb_start_writeback()'s local wb_writeback_work into the newly-allocated one if needed (kmemdup()). Or just pass a billion args into that comparison function. bdi_split_work_to_wbs() does GFP_ATOMIC as well. Problem? (Why the heck don't we document the *reasons* for these things, sigh). I suspect it would be best to be proactive here and use some smarter data structure. It appears that all the wb_writeback_work fields except sb can be squeezed into a single word so perhaps a radix-tree. Or hash them all together and use a chained array or something. Maybe fiddle at it for an hour or so, see how it's looking? It's a lot of fuss to avoid one problematic kmalloc(), sigh. We really don't want there to be *any* pathological workload which results in merging failures - if that's the case then someone will hit it. They'll experience the ooms (perhaps) and the search complexity issues (for sure). -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 05-05-16 14:37:51, Andrew Morton wrote: > On Thu, 5 May 2016 11:07:50 +0200 Jan Kara <jack@suse.cz> wrote: > > > On Thu 05-05-16 10:24:33, Michal Hocko wrote: > > > > +/* > > > > + * Check whether the request to writeback some pages can be merged with some > > > > + * other request which is already pending. If yes, merge it and return true. > > > > + * If no, return false. > > > > + */ > > > > +static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages, > > > > + struct super_block *sb, bool range_cyclic, > > > > + enum wb_reason reason) > > > > +{ > > > > + struct wb_writeback_work *work; > > > > + bool merged = false; > > > > + > > > > + spin_lock_bh(&wb->work_lock); > > > > + list_for_each_entry(work, &wb->work_list, list) { > > > > > > Is the lenght of the list bounded somehow? In other words is it possible > > > that the spinlock would be held for too long to traverse the whole list? > > > > I was thinking about this as well. With the merging enabled, the number of > > entries queued from wb_start_writeback() is essentially limited by the > > number of writeback reasons and there's only a couple of those. What is > > more questionable is the number of entries queued from > > __writeback_inodes_sb_nr(). Generally there should be a couple at maximum > > either but it is hard to give any guarantee since e.g. filesystems use this > > function to reduce amount of delay-allocated data when they are running out > > of space. Hum, maybe we could limit the merging to scan only the last say > > 16 entries. That should give good results in most cases... Thoughts? > > If it's possible to cause a search complexity meltdown, someone will > find a way :( Agreed. > Is there any reason why the requests coming out of > writeback_inodes_sb_nr() cannot also be merged? So my thought was that since these items are essentially synchronous - writeback_inodes_sb_nr() returns only after the worker submits the IO - we don't want to block waiters longer than necessary by bumping the number of pages to write. Also it would be tricky to properly handle completions of work->done when work items are merged. Finally, since the submitter waits for work to be completed, this pretty much limits the capability of this path to DoS the system. > Your wb_merge_request() doesn't check ->tagged_writepages? > > Why is ->for_sync handled differently? Needs a comment. > > Suggest turning this into a separate function. Build a local > wb_writeback_work in wb_start_writeback(), do: > > > /* comment goes here */ > if (new->reason != old->reason) > return false; > /* comment goes here */ > if (new->range_cyclic != old->range_cyclic) > retun false; > return true; > > then copy wb_start_writeback()'s local wb_writeback_work into the > newly-allocated one if needed (kmemdup()). Or just pass a billion args > into that comparison function. > > bdi_split_work_to_wbs() does GFP_ATOMIC as well. Problem? (Why the > heck don't we document the *reasons* for these things, sigh). Heh, there are much more GFP_ATOMIC allocations in fs/fs-writeback.c after Tejun's memcg aware writeback... I believe they are GFP_ATOMIC mostly because they can already be called from direct reclaim (e.g. when requesting pages to be written through wakeup_flusher_threads()) and so we don't want to recurse into direct reclaim code again. > I suspect it would be best to be proactive here and use some smarter > data structure. It appears that all the wb_writeback_work fields > except sb can be squeezed into a single word so perhaps a radix-tree. > Or hash them all together and use a chained array or something. Maybe > fiddle at it for an hour or so, see how it's looking? It's a lot of > fuss to avoid one problematic kmalloc(), sigh. > > We really don't want there to be *any* pathological workload which > results in merging failures - if that's the case then someone will hit > it. They'll experience the ooms (perhaps) and the search complexity > issues (for sure). So the question is what is the desired outcome. After Tetsuo's patch "mm,writeback: Don't use memory reserves for wb_start_writeback" we will use GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN instead of GFP_ATOMIC in wb_start_writeback(). We can treat other places using GFP_ATOMIC in a similar way. So my thought was that this is enough to avoid exhaustion of reserves for writeback work items under memory pressure. And the merging of writeback works I proposed was more like an optimization to avoid unnecessary allocations. And in that case we can allow imperfection and possibly large lists of queued works in pathological cases - I agree we should not DoS the system by going through large linked lists in any case but that is easily avoided if we are fine with the fact that merging won't happen always when it could. The question which is not clear to me is: Do we want to guard against malicious attacker that may be consuming memory through writeback works that are allocated via GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN? If yes, then my patch needs further thought. Any opinions? Honza
On Thu 12-05-16 18:08:29, Jan Kara wrote: > On Thu 05-05-16 14:37:51, Andrew Morton wrote: [...] > > bdi_split_work_to_wbs() does GFP_ATOMIC as well. Problem? (Why the > > heck don't we document the *reasons* for these things, sigh). > > Heh, there are much more GFP_ATOMIC allocations in fs/fs-writeback.c after > Tejun's memcg aware writeback... I believe they are GFP_ATOMIC mostly > because they can already be called from direct reclaim (e.g. when > requesting pages to be written through wakeup_flusher_threads()) and so we > don't want to recurse into direct reclaim code again. If that is the case then __GFP_DIRECT_RECLAIM should be cleared rather than GFP_ATOMIC abused. > > I suspect it would be best to be proactive here and use some smarter > > data structure. It appears that all the wb_writeback_work fields > > except sb can be squeezed into a single word so perhaps a radix-tree. > > Or hash them all together and use a chained array or something. Maybe > > fiddle at it for an hour or so, see how it's looking? It's a lot of > > fuss to avoid one problematic kmalloc(), sigh. > > > > We really don't want there to be *any* pathological workload which > > results in merging failures - if that's the case then someone will hit > > it. They'll experience the ooms (perhaps) and the search complexity > > issues (for sure). > > So the question is what is the desired outcome. After Tetsuo's patch > "mm,writeback: Don't use memory reserves for wb_start_writeback" we will > use GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN instead of GFP_ATOMIC in > wb_start_writeback(). We can treat other places using GFP_ATOMIC in a > similar way. So my thought was that this is enough to avoid exhaustion of > reserves for writeback work items under memory pressure. And the merging of > writeback works I proposed was more like an optimization to avoid > unnecessary allocations. And in that case we can allow imperfection and > possibly large lists of queued works in pathological cases - I agree we > should not DoS the system by going through large linked lists in any case but > that is easily avoided if we are fine with the fact that merging won't happen > always when it could. Yes I think this is acceptable. > The question which is not clear to me is: Do we want to guard against > malicious attacker that may be consuming memory through writeback works > that are allocated via GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN? > If yes, then my patch needs further thought. Any opinions? GFP_NOWAIT still kicks the kswapd so there is some reclaim activity on the background. Sure if we can reduce the number of those requests it would be better because we are losing natural throttling without the direct reclaim. But I am not sure I can see how this would cause a a major problem (slow down everybody - quite possible - but not DoS AFAICS).
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index fee81e8768c9..bb6725f5b1ba 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -189,6 +189,35 @@ out_unlock: spin_unlock_bh(&wb->work_lock); } +/* + * Check whether the request to writeback some pages can be merged with some + * other request which is already pending. If yes, merge it and return true. + * If no, return false. + */ +static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages, + struct super_block *sb, bool range_cyclic, + enum wb_reason reason) +{ + struct wb_writeback_work *work; + bool merged = false; + + spin_lock_bh(&wb->work_lock); + list_for_each_entry(work, &wb->work_list, list) { + if (work->reason == reason && + work->range_cyclic == range_cyclic && + work->auto_free == 1 && work->sb == sb && + work->for_sync == 0) { + work->nr_pages += nr_pages; + merged = true; + trace_writeback_merged(wb, work); + break; + } + } + spin_unlock_bh(&wb->work_lock); + + return merged; +} + /** * wb_wait_for_completion - wait for completion of bdi_writeback_works * @bdi: bdi work items were issued to @@ -928,6 +957,14 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, return; /* + * Can we merge current request with another pending one - saves us + * atomic allocation which can be significant e.g. when MM is under + * pressure and calls wake_up_flusher_threads() a lot. + */ + if (wb_merge_request(wb, nr_pages, NULL, range_cyclic, reason)) + return; + + /* * This is WB_SYNC_NONE writeback, so if allocation fails just * wakeup the thread for old dirty data writeback */ diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 73614ce1d204..84ad9fac475b 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -252,6 +252,7 @@ DEFINE_WRITEBACK_WORK_EVENT(writeback_exec); DEFINE_WRITEBACK_WORK_EVENT(writeback_start); DEFINE_WRITEBACK_WORK_EVENT(writeback_written); DEFINE_WRITEBACK_WORK_EVENT(writeback_wait); +DEFINE_WRITEBACK_WORK_EVENT(writeback_merged); TRACE_EVENT(writeback_pages_written, TP_PROTO(long pages_written),
When system is under memory pressure memory management frequently calls wakeup_flusher_threads() to writeback pages to that they can be freed. This was observed to exhaust reserves for atomic allocations since wakeup_flusher_threads() allocates one writeback work for each device with dirty data with GFP_ATOMIC. However it is pointless to allocate new work items when requested work is identical. Instead, we can merge the new work with the pending work items and thus save memory allocation. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Jan Kara <jack@suse.cz> --- fs/fs-writeback.c | 37 +++++++++++++++++++++++++++++++++++++ include/trace/events/writeback.h | 1 + 2 files changed, 38 insertions(+) This is a patch which should (and in my basic testing does) address the issues with many atomic allocations Tetsuo reported. What do people think?