Message ID | 20180321224429.15860-2-rgoldwyn@suse.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed 21-03-18 17:44:27, Goldwyn Rodrigues wrote: > From: Goldwyn Rodrigues <rgoldwyn@suse.com> > > writebacks can recurse into itself under low memory situations. > Set memalloc_nofs_save() in order to make sure it does not > recurse. How? We are not doing writeback from the direct reclaim context.
On 03/22/2018 02:08 AM, Michal Hocko wrote: > On Wed 21-03-18 17:44:27, Goldwyn Rodrigues wrote: >> From: Goldwyn Rodrigues <rgoldwyn@suse.com> >> >> writebacks can recurse into itself under low memory situations. >> Set memalloc_nofs_save() in order to make sure it does not >> recurse. > > How? We are not doing writeback from the direct reclaim context. > I am not sure if I missed a condition in the code, but here is one of the call lineup: writepages() -> writepage() -> kmalloc() -> __alloc_pages() -> __alloc_pages_nodemask -> __alloc_pages_slowpath -> __alloc_pages_direct_reclaim() -> try_to_free_pages() -> do_try_to_free_pages() -> shrink_zones() -> shrink_node() -> shrink_slab() -> do_shrink_slab() -> shrinker.scan_objects() -> super_cache_scan() -> prune_icache_sb() -> fs/inode.c:dispose_list() -> evict(inode) -> evict_inode() for ext4 -> filemap_write_and_wait() -> filemap_fdatawrite(mapping) -> __filemap_fdatawrite_range() -> do_writepages -> writepages() Please note, most filesystems currently have a safeguard in writepage() which will return if the PF_MEMALLOC is set. The other safeguard is __GFP_FS which we are trying to eliminate.
On Tue, Mar 27, 2018 at 07:52:48AM -0500, Goldwyn Rodrigues wrote: > I am not sure if I missed a condition in the code, but here is one of > the call lineup: > > writepages() -> writepage() -> kmalloc() -> __alloc_pages() -> > __alloc_pages_nodemask -> __alloc_pages_slowpath -> > __alloc_pages_direct_reclaim() -> try_to_free_pages() -> > do_try_to_free_pages() -> shrink_zones() -> shrink_node() -> > shrink_slab() -> do_shrink_slab() -> shrinker.scan_objects() -> > super_cache_scan() -> prune_icache_sb() -> fs/inode.c:dispose_list() -> > evict(inode) -> evict_inode() for ext4 -> filemap_write_and_wait() -> > filemap_fdatawrite(mapping) -> __filemap_fdatawrite_range() -> > do_writepages -> writepages() > > Please note, most filesystems currently have a safeguard in writepage() > which will return if the PF_MEMALLOC is set. The other safeguard is > __GFP_FS which we are trying to eliminate. But is that harmful? ext4_writepage() (for example) says that it will not deadlock in that circumstance: * We can get recursively called as show below. * * ext4_writepage() -> kmalloc() -> __alloc_pages() -> page_launder() -> * ext4_writepage() * * But since we don't do any block allocation we should not deadlock. * Page also have the dirty flag cleared so we don't get recurive page_lock. One might well argue that it's not *useful*; if we've gone into writepage already, there's no point in re-entering writepage. And the last thing we want to do is But I could see filesystems behaving differently when entered for writepage-for-regularly-scheduled-writeback versus writepage-for-shrinking, so maybe they can make progress. Maybe no real filesystem behaves that way. We need feedback from filesystem people.
On 03/27/2018 09:21 AM, Matthew Wilcox wrote: > On Tue, Mar 27, 2018 at 07:52:48AM -0500, Goldwyn Rodrigues wrote: >> I am not sure if I missed a condition in the code, but here is one of >> the call lineup: >> >> writepages() -> writepage() -> kmalloc() -> __alloc_pages() -> >> __alloc_pages_nodemask -> __alloc_pages_slowpath -> >> __alloc_pages_direct_reclaim() -> try_to_free_pages() -> >> do_try_to_free_pages() -> shrink_zones() -> shrink_node() -> >> shrink_slab() -> do_shrink_slab() -> shrinker.scan_objects() -> >> super_cache_scan() -> prune_icache_sb() -> fs/inode.c:dispose_list() -> >> evict(inode) -> evict_inode() for ext4 -> filemap_write_and_wait() -> >> filemap_fdatawrite(mapping) -> __filemap_fdatawrite_range() -> >> do_writepages -> writepages() >> >> Please note, most filesystems currently have a safeguard in writepage() >> which will return if the PF_MEMALLOC is set. The other safeguard is >> __GFP_FS which we are trying to eliminate. > > But is that harmful? ext4_writepage() (for example) says that it will > not deadlock in that circumstance: No, it is not harmful. > > * We can get recursively called as show below. > * > * ext4_writepage() -> kmalloc() -> __alloc_pages() -> page_launder() -> > * ext4_writepage() > * > * But since we don't do any block allocation we should not deadlock. > * Page also have the dirty flag cleared so we don't get recurive page_lock. Yes, and it avoids this by checking for PF_MEMALLOC flag. > > One might well argue that it's not *useful*; if we've gone into > writepage already, there's no point in re-entering writepage. And the > last thing we want to do is ? > But I could see filesystems behaving differently when entered > for writepage-for-regularly-scheduled-writeback versus > writepage-for-shrinking, so maybe they can make progress. > do_writepages() is the same for both, and hence the memalloc_* API patch. > Maybe no real filesystem behaves that way. We need feedback from > filesystem people. The idea is to: * Keep a central location for check, rather than individual filesystem writepage(). It should reduce code as well. * Filesystem developers call memory allocations without thinking twice about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence eliminate GFP_NOFS.
On Tue, Mar 27, 2018 at 10:13:53AM -0500, Goldwyn Rodrigues wrote: > On 03/27/2018 09:21 AM, Matthew Wilcox wrote: > > On Tue, Mar 27, 2018 at 07:52:48AM -0500, Goldwyn Rodrigues wrote: > >> I am not sure if I missed a condition in the code, but here is one of > >> the call lineup: > >> > >> writepages() -> writepage() -> kmalloc() -> __alloc_pages() -> > >> __alloc_pages_nodemask -> __alloc_pages_slowpath -> > >> __alloc_pages_direct_reclaim() -> try_to_free_pages() -> > >> do_try_to_free_pages() -> shrink_zones() -> shrink_node() -> > >> shrink_slab() -> do_shrink_slab() -> shrinker.scan_objects() -> > >> super_cache_scan() -> prune_icache_sb() -> fs/inode.c:dispose_list() -> > >> evict(inode) -> evict_inode() for ext4 -> filemap_write_and_wait() -> > >> filemap_fdatawrite(mapping) -> __filemap_fdatawrite_range() -> > >> do_writepages -> writepages() > >> > >> Please note, most filesystems currently have a safeguard in writepage() > >> which will return if the PF_MEMALLOC is set. The other safeguard is > >> __GFP_FS which we are trying to eliminate. > > > > But is that harmful? ext4_writepage() (for example) says that it will > > not deadlock in that circumstance: > > No, it is not harmful. > > > > > * We can get recursively called as show below. > > * > > * ext4_writepage() -> kmalloc() -> __alloc_pages() -> page_launder() -> > > * ext4_writepage() > > * > > * But since we don't do any block allocation we should not deadlock. > > * Page also have the dirty flag cleared so we don't get recurive page_lock. > > Yes, and it avoids this by checking for PF_MEMALLOC flag. > > > > > One might well argue that it's not *useful*; if we've gone into > > writepage already, there's no point in re-entering writepage. And the > > last thing we want to do is > > ? Sorry, got cut off. The last thing we want to do is blow the stack by recursing too deeply, but I don't think we're going to go through this loop more than once. > > But I could see filesystems behaving differently when entered > > for writepage-for-regularly-scheduled-writeback versus > > writepage-for-shrinking, so maybe they can make progress. > > > > do_writepages() is the same for both, and hence the memalloc_* API patch. But we don't want to avoid this particular recursion. We only need to avoid the recursion if it would result in a deadlock. > > Maybe no real filesystem behaves that way. We need feedback from > > filesystem people. > > The idea is to: > * Keep a central location for check, rather than individual filesystem > writepage(). It should reduce code as well. > * Filesystem developers call memory allocations without thinking twice > about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence > eliminate GFP_NOFS. I know the goal is to eliminate GFP_NOFS. I'm very much in favour of that idea. I'm just not sure you're going about it the right way. Probably we will have a good discussion about it next month.
On Tue 27-03-18 10:13:53, Goldwyn Rodrigues wrote: > > > On 03/27/2018 09:21 AM, Matthew Wilcox wrote: [...] > > Maybe no real filesystem behaves that way. We need feedback from > > filesystem people. > > The idea is to: > * Keep a central location for check, rather than individual filesystem > writepage(). It should reduce code as well. > * Filesystem developers call memory allocations without thinking twice > about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence > eliminate GFP_NOFS. I do not think this is the right approach. We do want to eliminate explicit GFP_NOFS usage, but we also want to reduce the overal GFP_NOFS usage as well. The later requires that we drop the __GFP_FS only for those contexts that really might cause reclaim recursion problems. So in your example, it would be much better to add the scope into those writepage(s) implementations which actually can trigger the writeback from the reclaim path rather from the generic implementation which has no means to know that.
On Wed, Mar 28, 2018 at 09:01:13AM +0200, Michal Hocko wrote: > On Tue 27-03-18 10:13:53, Goldwyn Rodrigues wrote: > > > > > > On 03/27/2018 09:21 AM, Matthew Wilcox wrote: > [...] > > > Maybe no real filesystem behaves that way. We need feedback from > > > filesystem people. > > > > The idea is to: > > * Keep a central location for check, rather than individual filesystem > > writepage(). It should reduce code as well. > > * Filesystem developers call memory allocations without thinking twice > > about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence > > eliminate GFP_NOFS. > > I do not think this is the right approach. We do want to eliminate > explicit GFP_NOFS usage, but we also want to reduce the overal GFP_NOFS > usage as well. The later requires that we drop the __GFP_FS only for > those contexts that really might cause reclaim recursion problems. As I've said before, moving to a scoped API will not reduce the number of GFP_NOFS scope allocation points - removing individual GFP_NOFS annotations doesn't do anything to avoid the deadlock paths it protects against. The issue is that GFP_NOFS is a big hammer - it stops reclaim from all filesystem scopes, not just the one we hold locks on and are doing the allocation for. i.e. we can be in one filesystem and quite safely do reclaim from other filesystems. The global scope of GFP_NOFS just doesn't allow this sort of fine-grained control to be expressed in reclaim. IOWs, if we want to reduce the scope of GFP_NOFS, we need a context to be passed from allocation to reclaim so that the reclaim context can check that it's a safe allocation context to reclaim from. e.g. for GFP_NOFS, we can use the superblock of the allocating filesystem as the context, and check it against the superblock that the current reclaim context (e.g. shrinker invocation) belongs to. If they match, we skip it. If they don't match, then we can perform reclaim on that context. Cheers, Dave.
On Thu 29-03-18 10:57:02, Dave Chinner wrote: > On Wed, Mar 28, 2018 at 09:01:13AM +0200, Michal Hocko wrote: > > On Tue 27-03-18 10:13:53, Goldwyn Rodrigues wrote: > > > > > > > > > On 03/27/2018 09:21 AM, Matthew Wilcox wrote: > > [...] > > > > Maybe no real filesystem behaves that way. We need feedback from > > > > filesystem people. > > > > > > The idea is to: > > > * Keep a central location for check, rather than individual filesystem > > > writepage(). It should reduce code as well. > > > * Filesystem developers call memory allocations without thinking twice > > > about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence > > > eliminate GFP_NOFS. > > > > I do not think this is the right approach. We do want to eliminate > > explicit GFP_NOFS usage, but we also want to reduce the overal GFP_NOFS > > usage as well. The later requires that we drop the __GFP_FS only for > > those contexts that really might cause reclaim recursion problems. > > As I've said before, moving to a scoped API will not reduce the > number of GFP_NOFS scope allocation points - removing individual > GFP_NOFS annotations doesn't do anything to avoid the deadlock paths > it protects against. Maybe it doesn't for some filesystems like xfs but I am quite sure it will for some others which overuse GFP_NOFS just to be sure. E.g. btrfs. > The issue is that GFP_NOFS is a big hammer - it stops reclaim from > all filesystem scopes, not just the one we hold locks on and are > doing the allocation for. i.e. we can be in one filesystem and quite > safely do reclaim from other filesystems. The global scope of > GFP_NOFS just doesn't allow this sort of fine-grained control to be > expressed in reclaim. Agreed! > IOWs, if we want to reduce the scope of GFP_NOFS, we need a context > to be passed from allocation to reclaim so that the reclaim context > can check that it's a safe allocation context to reclaim from. e.g. > for GFP_NOFS, we can use the superblock of the allocating filesystem > as the context, and check it against the superblock that the current > reclaim context (e.g. shrinker invocation) belongs to. If they > match, we skip it. If they don't match, then we can perform reclaim > on that context. Agreed again. But this is hardly doable without actually defining what those scopes are. Once we have them we can expand to add more context.
On Thu, Mar 29, 2018 at 09:01:08AM +0200, Michal Hocko wrote: > On Thu 29-03-18 10:57:02, Dave Chinner wrote: > > On Wed, Mar 28, 2018 at 09:01:13AM +0200, Michal Hocko wrote: > > > On Tue 27-03-18 10:13:53, Goldwyn Rodrigues wrote: > > > > > > > > > > > > On 03/27/2018 09:21 AM, Matthew Wilcox wrote: > > > [...] > > > > > Maybe no real filesystem behaves that way. We need feedback from > > > > > filesystem people. > > > > > > > > The idea is to: > > > > * Keep a central location for check, rather than individual filesystem > > > > writepage(). It should reduce code as well. > > > > * Filesystem developers call memory allocations without thinking twice > > > > about which GFP flag to use: GFP_KERNEL or GFP_NOFS. In essence > > > > eliminate GFP_NOFS. > > > > > > I do not think this is the right approach. We do want to eliminate > > > explicit GFP_NOFS usage, but we also want to reduce the overal GFP_NOFS > > > usage as well. The later requires that we drop the __GFP_FS only for > > > those contexts that really might cause reclaim recursion problems. > > > > As I've said before, moving to a scoped API will not reduce the > > number of GFP_NOFS scope allocation points - removing individual > > GFP_NOFS annotations doesn't do anything to avoid the deadlock paths > > it protects against. > > Maybe it doesn't for some filesystems like xfs but I am quite sure it > will for some others which overuse GFP_NOFS just to be sure. E.g. btrfs. > > > The issue is that GFP_NOFS is a big hammer - it stops reclaim from > > all filesystem scopes, not just the one we hold locks on and are > > doing the allocation for. i.e. we can be in one filesystem and quite > > safely do reclaim from other filesystems. The global scope of > > GFP_NOFS just doesn't allow this sort of fine-grained control to be > > expressed in reclaim. > > Agreed! > > > IOWs, if we want to reduce the scope of GFP_NOFS, we need a context > > to be passed from allocation to reclaim so that the reclaim context > > can check that it's a safe allocation context to reclaim from. e.g. > > for GFP_NOFS, we can use the superblock of the allocating filesystem > > as the context, and check it against the superblock that the current > > reclaim context (e.g. shrinker invocation) belongs to. If they > > match, we skip it. If they don't match, then we can perform reclaim > > on that context. > > Agreed again. But this is hardly doable without actually defining what > those scopes are. Once we have them we can expand to add more context. Some filesystems already have well defined scopes (e.g. XFS's transaction scope) - all we need is the infrastructure that passes the scope pointer to reclaim rather than having the allocation code intercept PF_MEMALLOC_NOFS and turn it into GFP_NOFS allocation context... Cheers, Dave.
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index d4d04fee568a..42f7f4c2063b 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -28,6 +28,7 @@ #include <linux/tracepoint.h> #include <linux/device.h> #include <linux/memcontrol.h> +#include <linux/sched/mm.h> #include "internal.h" /* @@ -1713,7 +1714,12 @@ static long wb_writeback(struct bdi_writeback *wb, struct inode *inode; long progress; struct blk_plug plug; + unsigned flags; + if (current->flags & PF_MEMALLOC_NOFS) + return 0; + + flags = memalloc_nofs_save(); oldest_jif = jiffies; work->older_than_this = &oldest_jif; @@ -1797,6 +1803,8 @@ static long wb_writeback(struct bdi_writeback *wb, spin_unlock(&wb->list_lock); blk_finish_plug(&plug); + memalloc_nofs_restore(flags); + return nr_pages - work->nr_pages; } diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 9c6a830da0ee..943ade03489a 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1086,13 +1086,6 @@ xfs_do_writepage( PF_MEMALLOC)) goto redirty; - /* - * Given that we do not allow direct reclaim to call us, we should - * never be called while in a filesystem transaction. - */ - if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS)) - goto redirty; - /* * Is this page beyond the end of the file? * diff --git a/mm/filemap.c b/mm/filemap.c index 693f62212a59..3c9ead9a1e32 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -430,6 +430,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, loff_t end, int sync_mode) { int ret; + unsigned flags; struct writeback_control wbc = { .sync_mode = sync_mode, .nr_to_write = LONG_MAX, @@ -440,9 +441,11 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, if (!mapping_cap_writeback_dirty(mapping)) return 0; + flags = memalloc_nofs_save(); wbc_attach_fdatawrite_inode(&wbc, mapping->host); ret = do_writepages(mapping, &wbc); wbc_detach_inode(&wbc); + memalloc_nofs_restore(flags); return ret; }