diff mbox

[1/3] fs: Perform writebacks under memalloc_nofs

Message ID 20180321224429.15860-2-rgoldwyn@suse.de (mailing list archive)
State New, archived
Headers show

Commit Message

Goldwyn Rodrigues March 21, 2018, 10:44 p.m. UTC
From: Goldwyn Rodrigues <rgoldwyn@suse.com>

writebacks can recurse into itself under low memory situations.
Set memalloc_nofs_save() in order to make sure it does not
recurse.

Since writeouts are performed for fdatawrites as well, set
memalloc_nofs_save while performing fdatawrites.

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
---
 fs/fs-writeback.c | 8 ++++++++
 fs/xfs/xfs_aops.c | 7 -------
 mm/filemap.c      | 3 +++
 3 files changed, 11 insertions(+), 7 deletions(-)

Comments

Michal Hocko March 22, 2018, 7:08 a.m. UTC | #1
On Wed 21-03-18 17:44:27, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
> 
> writebacks can recurse into itself under low memory situations.
> Set memalloc_nofs_save() in order to make sure it does not
> recurse.

How? We are not doing writeback from the direct reclaim context.
Goldwyn Rodrigues March 27, 2018, 12:52 p.m. UTC | #2
On 03/22/2018 02:08 AM, Michal Hocko wrote:
> On Wed 21-03-18 17:44:27, Goldwyn Rodrigues wrote:
>> From: Goldwyn Rodrigues <rgoldwyn@suse.com>
>>
>> writebacks can recurse into itself under low memory situations.
>> Set memalloc_nofs_save() in order to make sure it does not
>> recurse.
> 
> How? We are not doing writeback from the direct reclaim context.
> 

I am not sure if I missed a condition in the code, but here is one of
the call lineup:

writepages() -> writepage() -> kmalloc() -> __alloc_pages() ->
__alloc_pages_nodemask -> __alloc_pages_slowpath ->
__alloc_pages_direct_reclaim() -> try_to_free_pages() ->
do_try_to_free_pages() -> shrink_zones() -> shrink_node() ->
shrink_slab() -> do_shrink_slab() -> shrinker.scan_objects() ->
super_cache_scan() -> prune_icache_sb() -> fs/inode.c:dispose_list() ->
evict(inode) -> evict_inode() for ext4 ->  filemap_write_and_wait() ->
filemap_fdatawrite(mapping) -> __filemap_fdatawrite_range() ->
do_writepages -> writepages()

Please note, most filesystems currently have a safeguard in writepage()
which will return if the PF_MEMALLOC is set. The other safeguard is
__GFP_FS which we are trying to eliminate.
Matthew Wilcox March 27, 2018, 2:21 p.m. UTC | #3
On Tue, Mar 27, 2018 at 07:52:48AM -0500, Goldwyn Rodrigues wrote:
> I am not sure if I missed a condition in the code, but here is one of
> the call lineup:
> 
> writepages() -> writepage() -> kmalloc() -> __alloc_pages() ->
> __alloc_pages_nodemask -> __alloc_pages_slowpath ->
> __alloc_pages_direct_reclaim() -> try_to_free_pages() ->
> do_try_to_free_pages() -> shrink_zones() -> shrink_node() ->
> shrink_slab() -> do_shrink_slab() -> shrinker.scan_objects() ->
> super_cache_scan() -> prune_icache_sb() -> fs/inode.c:dispose_list() ->
> evict(inode) -> evict_inode() for ext4 ->  filemap_write_and_wait() ->
> filemap_fdatawrite(mapping) -> __filemap_fdatawrite_range() ->
> do_writepages -> writepages()
> 
> Please note, most filesystems currently have a safeguard in writepage()
> which will return if the PF_MEMALLOC is set. The other safeguard is
> __GFP_FS which we are trying to eliminate.

But is that harmful?  ext4_writepage() (for example) says that it will
not deadlock in that circumstance:

 * We can get recursively called as show below.
 *
 *      ext4_writepage() -> kmalloc() -> __alloc_pages() -> page_launder() ->
 *              ext4_writepage()
 *
 * But since we don't do any block allocation we should not deadlock.
 * Page also have the dirty flag cleared so we don't get recurive page_lock.

One might well argue that it's not *useful*; if we've gone into
writepage already, there's no point in re-entering writepage.  And the
last thing we want to do is 
But I could see filesystems behaving differently when entered
for writepage-for-regularly-scheduled-writeback versus
writepage-for-shrinking, so maybe they can make progress.

Maybe no real filesystem behaves that way.  We need feedback from
filesystem people.
Goldwyn Rodrigues March 27, 2018, 3:13 p.m. UTC | #4
On 03/27/2018 09:21 AM, Matthew Wilcox wrote:
> On Tue, Mar 27, 2018 at 07:52:48AM -0500, Goldwyn Rodrigues wrote:
>> I am not sure if I missed a condition in the code, but here is one of
>> the call lineup:
>>
>> writepages() -> writepage() -> kmalloc() -> __alloc_pages() ->
>> __alloc_pages_nodemask -> __alloc_pages_slowpath ->
>> __alloc_pages_direct_reclaim() -> try_to_free_pages() ->
>> do_try_to_free_pages() -> shrink_zones() -> shrink_node() ->
>> shrink_slab() -> do_shrink_slab() -> shrinker.scan_objects() ->
>> super_cache_scan() -> prune_icache_sb() -> fs/inode.c:dispose_list() ->
>> evict(inode) -> evict_inode() for ext4 ->  filemap_write_and_wait() ->
>> filemap_fdatawrite(mapping) -> __filemap_fdatawrite_range() ->
>> do_writepages -> writepages()
>>
>> Please note, most filesystems currently have a safeguard in writepage()
>> which will return if the PF_MEMALLOC is set. The other safeguard is
>> __GFP_FS which we are trying to eliminate.
> 
> But is that harmful?  ext4_writepage() (for example) says that it will
> not deadlock in that circumstance:

No, it is not harmful.

> 
>  * We can get recursively called as show below.
>  *
>  *      ext4_writepage() -> kmalloc() -> __alloc_pages() -> page_launder() ->
>  *              ext4_writepage()
>  *
>  * But since we don't do any block allocation we should not deadlock.
>  * Page also have the dirty flag cleared so we don't get recurive page_lock.

Yes, and it avoids this by checking for PF_MEMALLOC flag.

> 
> One might well argue that it's not *useful*; if we've gone into
> writepage already, there's no point in re-entering writepage.  And the
> last thing we want to do is 

?

> But I could see filesystems behaving differently when entered
> for writepage-for-regularly-scheduled-writeback versus
> writepage-for-shrinking, so maybe they can make progress.
> 

do_writepages() is the same for both, and hence the memalloc_* API patch.

> Maybe no real filesystem behaves that way.  We need feedback from
> filesystem people.

The idea is to:
* Keep a central location for check, rather than individual filesystem
writepage(). It should reduce code as well.
* Filesystem developers call memory allocations without thinking twice
about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence
eliminate GFP_NOFS.
Matthew Wilcox March 27, 2018, 4:45 p.m. UTC | #5
On Tue, Mar 27, 2018 at 10:13:53AM -0500, Goldwyn Rodrigues wrote:
> On 03/27/2018 09:21 AM, Matthew Wilcox wrote:
> > On Tue, Mar 27, 2018 at 07:52:48AM -0500, Goldwyn Rodrigues wrote:
> >> I am not sure if I missed a condition in the code, but here is one of
> >> the call lineup:
> >>
> >> writepages() -> writepage() -> kmalloc() -> __alloc_pages() ->
> >> __alloc_pages_nodemask -> __alloc_pages_slowpath ->
> >> __alloc_pages_direct_reclaim() -> try_to_free_pages() ->
> >> do_try_to_free_pages() -> shrink_zones() -> shrink_node() ->
> >> shrink_slab() -> do_shrink_slab() -> shrinker.scan_objects() ->
> >> super_cache_scan() -> prune_icache_sb() -> fs/inode.c:dispose_list() ->
> >> evict(inode) -> evict_inode() for ext4 ->  filemap_write_and_wait() ->
> >> filemap_fdatawrite(mapping) -> __filemap_fdatawrite_range() ->
> >> do_writepages -> writepages()
> >>
> >> Please note, most filesystems currently have a safeguard in writepage()
> >> which will return if the PF_MEMALLOC is set. The other safeguard is
> >> __GFP_FS which we are trying to eliminate.
> > 
> > But is that harmful?  ext4_writepage() (for example) says that it will
> > not deadlock in that circumstance:
> 
> No, it is not harmful.
> 
> > 
> >  * We can get recursively called as show below.
> >  *
> >  *      ext4_writepage() -> kmalloc() -> __alloc_pages() -> page_launder() ->
> >  *              ext4_writepage()
> >  *
> >  * But since we don't do any block allocation we should not deadlock.
> >  * Page also have the dirty flag cleared so we don't get recurive page_lock.
> 
> Yes, and it avoids this by checking for PF_MEMALLOC flag.
> 
> > 
> > One might well argue that it's not *useful*; if we've gone into
> > writepage already, there's no point in re-entering writepage.  And the
> > last thing we want to do is 
> 
> ?

Sorry, got cut off.  The last thing we want to do is blow the stack by
recursing too deeply, but I don't think we're going to go through this
loop more than once.

> > But I could see filesystems behaving differently when entered
> > for writepage-for-regularly-scheduled-writeback versus
> > writepage-for-shrinking, so maybe they can make progress.
> > 
> 
> do_writepages() is the same for both, and hence the memalloc_* API patch.

But we don't want to avoid this particular recursion.  We only need to
avoid the recursion if it would result in a deadlock.

> > Maybe no real filesystem behaves that way.  We need feedback from
> > filesystem people.
> 
> The idea is to:
> * Keep a central location for check, rather than individual filesystem
> writepage(). It should reduce code as well.
> * Filesystem developers call memory allocations without thinking twice
> about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence
> eliminate GFP_NOFS.

I know the goal is to eliminate GFP_NOFS.  I'm very much in favour
of that idea.  I'm just not sure you're going about it the right way.
Probably we will have a good discussion about it next month.
Michal Hocko March 28, 2018, 7:01 a.m. UTC | #6
On Tue 27-03-18 10:13:53, Goldwyn Rodrigues wrote:
> 
> 
> On 03/27/2018 09:21 AM, Matthew Wilcox wrote:
[...]
> > Maybe no real filesystem behaves that way.  We need feedback from
> > filesystem people.
> 
> The idea is to:
> * Keep a central location for check, rather than individual filesystem
> writepage(). It should reduce code as well.
> * Filesystem developers call memory allocations without thinking twice
> about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence
> eliminate GFP_NOFS.

I do not think this is the right approach. We do want to eliminate
explicit GFP_NOFS usage, but we also want to reduce the overal GFP_NOFS
usage as well. The later requires that we drop the __GFP_FS only for
those contexts that really might cause reclaim recursion problems. So in
your example, it would be much better to add the scope into those
writepage(s) implementations which actually can trigger the writeback
from the reclaim path rather from the generic implementation which has
no means to know that.
Dave Chinner March 28, 2018, 11:57 p.m. UTC | #7
On Wed, Mar 28, 2018 at 09:01:13AM +0200, Michal Hocko wrote:
> On Tue 27-03-18 10:13:53, Goldwyn Rodrigues wrote:
> > 
> > 
> > On 03/27/2018 09:21 AM, Matthew Wilcox wrote:
> [...]
> > > Maybe no real filesystem behaves that way.  We need feedback from
> > > filesystem people.
> > 
> > The idea is to:
> > * Keep a central location for check, rather than individual filesystem
> > writepage(). It should reduce code as well.
> > * Filesystem developers call memory allocations without thinking twice
> > about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence
> > eliminate GFP_NOFS.
> 
> I do not think this is the right approach. We do want to eliminate
> explicit GFP_NOFS usage, but we also want to reduce the overal GFP_NOFS
> usage as well. The later requires that we drop the __GFP_FS only for
> those contexts that really might cause reclaim recursion problems.

As I've said before, moving to a scoped API will not reduce the
number of GFP_NOFS scope allocation points - removing individual
GFP_NOFS annotations doesn't do anything to avoid the deadlock paths
it protects against.

The issue is that GFP_NOFS is a big hammer - it stops reclaim from
all filesystem scopes, not just the one we hold locks on and are
doing the allocation for. i.e. we can be in one filesystem and quite
safely do reclaim from other filesystems. The global scope of
GFP_NOFS just doesn't allow this sort of fine-grained control to be
expressed in reclaim.

IOWs, if we want to reduce the scope of GFP_NOFS, we need a context
to be passed from allocation to reclaim so that the reclaim context
can check that it's a safe allocation context to reclaim from. e.g.
for GFP_NOFS, we can use the superblock of the allocating filesystem
as the context, and check it against the superblock that the current
reclaim context (e.g. shrinker invocation) belongs to. If they
match, we skip it. If they don't match, then we can perform reclaim
on that context.

Cheers,

Dave.
Michal Hocko March 29, 2018, 7:01 a.m. UTC | #8
On Thu 29-03-18 10:57:02, Dave Chinner wrote:
> On Wed, Mar 28, 2018 at 09:01:13AM +0200, Michal Hocko wrote:
> > On Tue 27-03-18 10:13:53, Goldwyn Rodrigues wrote:
> > > 
> > > 
> > > On 03/27/2018 09:21 AM, Matthew Wilcox wrote:
> > [...]
> > > > Maybe no real filesystem behaves that way.  We need feedback from
> > > > filesystem people.
> > > 
> > > The idea is to:
> > > * Keep a central location for check, rather than individual filesystem
> > > writepage(). It should reduce code as well.
> > > * Filesystem developers call memory allocations without thinking twice
> > > about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence
> > > eliminate GFP_NOFS.
> > 
> > I do not think this is the right approach. We do want to eliminate
> > explicit GFP_NOFS usage, but we also want to reduce the overal GFP_NOFS
> > usage as well. The later requires that we drop the __GFP_FS only for
> > those contexts that really might cause reclaim recursion problems.
> 
> As I've said before, moving to a scoped API will not reduce the
> number of GFP_NOFS scope allocation points - removing individual
> GFP_NOFS annotations doesn't do anything to avoid the deadlock paths
> it protects against.

Maybe it doesn't for some filesystems like xfs but I am quite sure it
will for some others which overuse GFP_NOFS just to be sure. E.g. btrfs.

> The issue is that GFP_NOFS is a big hammer - it stops reclaim from
> all filesystem scopes, not just the one we hold locks on and are
> doing the allocation for. i.e. we can be in one filesystem and quite
> safely do reclaim from other filesystems. The global scope of
> GFP_NOFS just doesn't allow this sort of fine-grained control to be
> expressed in reclaim.

Agreed!

> IOWs, if we want to reduce the scope of GFP_NOFS, we need a context
> to be passed from allocation to reclaim so that the reclaim context
> can check that it's a safe allocation context to reclaim from. e.g.
> for GFP_NOFS, we can use the superblock of the allocating filesystem
> as the context, and check it against the superblock that the current
> reclaim context (e.g. shrinker invocation) belongs to. If they
> match, we skip it. If they don't match, then we can perform reclaim
> on that context.

Agreed again. But this is hardly doable without actually defining what
those scopes are. Once we have them we can expand to add more context.
Dave Chinner March 31, 2018, 9:21 p.m. UTC | #9
On Thu, Mar 29, 2018 at 09:01:08AM +0200, Michal Hocko wrote:
> On Thu 29-03-18 10:57:02, Dave Chinner wrote:
> > On Wed, Mar 28, 2018 at 09:01:13AM +0200, Michal Hocko wrote:
> > > On Tue 27-03-18 10:13:53, Goldwyn Rodrigues wrote:
> > > > 
> > > > 
> > > > On 03/27/2018 09:21 AM, Matthew Wilcox wrote:
> > > [...]
> > > > > Maybe no real filesystem behaves that way.  We need feedback from
> > > > > filesystem people.
> > > > 
> > > > The idea is to:
> > > > * Keep a central location for check, rather than individual filesystem
> > > > writepage(). It should reduce code as well.
> > > > * Filesystem developers call memory allocations without thinking twice
> > > > about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence
> > > > eliminate GFP_NOFS.
> > > 
> > > I do not think this is the right approach. We do want to eliminate
> > > explicit GFP_NOFS usage, but we also want to reduce the overal GFP_NOFS
> > > usage as well. The later requires that we drop the __GFP_FS only for
> > > those contexts that really might cause reclaim recursion problems.
> > 
> > As I've said before, moving to a scoped API will not reduce the
> > number of GFP_NOFS scope allocation points - removing individual
> > GFP_NOFS annotations doesn't do anything to avoid the deadlock paths
> > it protects against.
> 
> Maybe it doesn't for some filesystems like xfs but I am quite sure it
> will for some others which overuse GFP_NOFS just to be sure. E.g. btrfs.
> 
> > The issue is that GFP_NOFS is a big hammer - it stops reclaim from
> > all filesystem scopes, not just the one we hold locks on and are
> > doing the allocation for. i.e. we can be in one filesystem and quite
> > safely do reclaim from other filesystems. The global scope of
> > GFP_NOFS just doesn't allow this sort of fine-grained control to be
> > expressed in reclaim.
> 
> Agreed!
> 
> > IOWs, if we want to reduce the scope of GFP_NOFS, we need a context
> > to be passed from allocation to reclaim so that the reclaim context
> > can check that it's a safe allocation context to reclaim from. e.g.
> > for GFP_NOFS, we can use the superblock of the allocating filesystem
> > as the context, and check it against the superblock that the current
> > reclaim context (e.g. shrinker invocation) belongs to. If they
> > match, we skip it. If they don't match, then we can perform reclaim
> > on that context.
> 
> Agreed again. But this is hardly doable without actually defining what
> those scopes are. Once we have them we can expand to add more context.

Some filesystems already have well defined scopes (e.g. XFS's
transaction scope) - all we need is the infrastructure that passes
the scope pointer to reclaim rather than having the allocation code
intercept PF_MEMALLOC_NOFS and turn it into GFP_NOFS allocation
context...

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d4d04fee568a..42f7f4c2063b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -28,6 +28,7 @@ 
 #include <linux/tracepoint.h>
 #include <linux/device.h>
 #include <linux/memcontrol.h>
+#include <linux/sched/mm.h>
 #include "internal.h"
 
 /*
@@ -1713,7 +1714,12 @@  static long wb_writeback(struct bdi_writeback *wb,
 	struct inode *inode;
 	long progress;
 	struct blk_plug plug;
+	unsigned flags;
 
+	if (current->flags & PF_MEMALLOC_NOFS)
+		return 0;
+
+	flags = memalloc_nofs_save();
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
 
@@ -1797,6 +1803,8 @@  static long wb_writeback(struct bdi_writeback *wb,
 	spin_unlock(&wb->list_lock);
 	blk_finish_plug(&plug);
 
+	memalloc_nofs_restore(flags);
+
 	return nr_pages - work->nr_pages;
 }
 
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9c6a830da0ee..943ade03489a 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1086,13 +1086,6 @@  xfs_do_writepage(
 			PF_MEMALLOC))
 		goto redirty;
 
-	/*
-	 * Given that we do not allow direct reclaim to call us, we should
-	 * never be called while in a filesystem transaction.
-	 */
-	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
-		goto redirty;
-
 	/*
 	 * Is this page beyond the end of the file?
 	 *
diff --git a/mm/filemap.c b/mm/filemap.c
index 693f62212a59..3c9ead9a1e32 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -430,6 +430,7 @@  int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 				loff_t end, int sync_mode)
 {
 	int ret;
+	unsigned flags;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
 		.nr_to_write = LONG_MAX,
@@ -440,9 +441,11 @@  int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
 	if (!mapping_cap_writeback_dirty(mapping))
 		return 0;
 
+	flags = memalloc_nofs_save();
 	wbc_attach_fdatawrite_inode(&wbc, mapping->host);
 	ret = do_writepages(mapping, &wbc);
 	wbc_detach_inode(&wbc);
+	memalloc_nofs_restore(flags);
 	return ret;
 }