[6/6] mm: Add memalloc_nowait
diff mbox series

Message ID 20200625113122.7540-7-willy@infradead.org
State New
Headers show
Series
  • Overhaul memalloc_no*
Related show

Commit Message

Matthew Wilcox June 25, 2020, 11:31 a.m. UTC
Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
guarantees we will not sleep to reclaim memory.  Use it to simplify
dm-bufio's allocations.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 drivers/md/dm-bufio.c    | 30 ++++++++----------------------
 include/linux/sched.h    |  1 +
 include/linux/sched/mm.h | 12 ++++++++----
 3 files changed, 17 insertions(+), 26 deletions(-)

Comments

Michal Hocko June 25, 2020, 12:40 p.m. UTC | #1
On Thu 25-06-20 12:31:22, Matthew Wilcox wrote:
> Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> guarantees we will not sleep to reclaim memory.  Use it to simplify
> dm-bufio's allocations.

memalloc_nowait is a good idea! I suspect the primary usecase would be
vmalloc.

> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>

[...]
> @@ -877,7 +857,9 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 */
>  	while (1) {
>  		if (dm_bufio_cache_size_latch != 1) {
> -			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			unsigned nowait_flag = memalloc_nowait_save();
> +			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			memalloc_nowait_restore(nowait_flag);

This looks confusing though. I am not familiar with alloc_buffer and
there is quite some tweaking around __GFP_NORETRY in alloc_buffer_data
which I do not follow but GFP_KERNEL just struck my eyes. So why cannot
we have 
		alloc_buffer(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
Matthew Wilcox June 25, 2020, 1:10 p.m. UTC | #2
On Thu, Jun 25, 2020 at 02:40:17PM +0200, Michal Hocko wrote:
> On Thu 25-06-20 12:31:22, Matthew Wilcox wrote:
> > Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> > guarantees we will not sleep to reclaim memory.  Use it to simplify
> > dm-bufio's allocations.
> 
> memalloc_nowait is a good idea! I suspect the primary usecase would be
> vmalloc.

That's funny.  My use case is allocating page tables in an RCU protected
page fault handler.  Jens' use case is allocating page cache.  This one
is a vmalloc consumer (which is also indirectly page table allocation).

> > @@ -877,7 +857,9 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> >  	 */
> >  	while (1) {
> >  		if (dm_bufio_cache_size_latch != 1) {
> > -			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > +			unsigned nowait_flag = memalloc_nowait_save();
> > +			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > +			memalloc_nowait_restore(nowait_flag);
> 
> This looks confusing though. I am not familiar with alloc_buffer and
> there is quite some tweaking around __GFP_NORETRY in alloc_buffer_data
> which I do not follow but GFP_KERNEL just struck my eyes. So why cannot
> we have 
> 		alloc_buffer(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);

Actually, I wanted to ask about the proliferation of __GFP_NOMEMALLOC
in the block layer.  Am I right in thinking it really has no effect
unless GFP_ATOMIC is set?  It seems like a magic flag that some driver
developers are sprinkling around randomly, so we probably need to clarify
the documentation on it.

What I was trying to do was just use the memalloc_nofoo API to control
what was going on and then the driver can just use GFP_KERNEL.  I should
probably have completed that thought before sending the patches out.
Michal Hocko June 25, 2020, 1:34 p.m. UTC | #3
On Thu 25-06-20 14:10:55, Matthew Wilcox wrote:
> On Thu, Jun 25, 2020 at 02:40:17PM +0200, Michal Hocko wrote:
> > On Thu 25-06-20 12:31:22, Matthew Wilcox wrote:
> > > Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> > > guarantees we will not sleep to reclaim memory.  Use it to simplify
> > > dm-bufio's allocations.
> > 
> > memalloc_nowait is a good idea! I suspect the primary usecase would be
> > vmalloc.
> 
> That's funny.  My use case is allocating page tables in an RCU protected
> page fault handler.  Jens' use case is allocating page cache.  This one
> is a vmalloc consumer (which is also indirectly page table allocation).
> 
> > > @@ -877,7 +857,9 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> > >  	 */
> > >  	while (1) {
> > >  		if (dm_bufio_cache_size_latch != 1) {
> > > -			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > > +			unsigned nowait_flag = memalloc_nowait_save();
> > > +			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > > +			memalloc_nowait_restore(nowait_flag);
> > 
> > This looks confusing though. I am not familiar with alloc_buffer and
> > there is quite some tweaking around __GFP_NORETRY in alloc_buffer_data
> > which I do not follow but GFP_KERNEL just struck my eyes. So why cannot
> > we have 
> > 		alloc_buffer(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> 
> Actually, I wanted to ask about the proliferation of __GFP_NOMEMALLOC
> in the block layer.  Am I right in thinking it really has no effect
> unless GFP_ATOMIC is set?

It does have an effect as an __GFP_MEMALLOC resp PF_MEMALLOC inhibitor.

> It seems like a magic flag that some driver
> developers are sprinkling around randomly, so we probably need to clarify
> the documentation on it.

Would the following make more sense?
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..014aa7a6d36a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -116,8 +116,9 @@ struct vm_area_struct;
  * Usage of a pre-allocated pool (e.g. mempool) should be always considered
  * before using this flag.
  *
- * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
- * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
+ * %__GFP_NOMEMALLOC is used to inhibit __GFP_MEMALLOC resp. process scope
+ * variant of it PF_MEMALLOC. So use this flag if the caller of the allocation
+ * context might contain one or the other.
  */
 #define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)
 #define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)

> What I was trying to do was just use the memalloc_nofoo API to control
> what was going on and then the driver can just use GFP_KERNEL.  I should
> probably have completed that thought before sending the patches out.

Yes the effect will be the same but it just really hit my eyes as this
was in the same diff. IMHO GFP_NOWAIT would be easier to grasp but
nothing I would dare to insist on.
kernel test robot June 25, 2020, 7:05 p.m. UTC | #4
Hi "Matthew,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on dm/for-next linus/master v5.8-rc2]
[cannot apply to xfs-linux/for-next next-20200625]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Wilcox-Oracle/Overhaul-memalloc_no/20200625-193357
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 87e867b4269f29dac8190bca13912d08163a277f
config: nds32-randconfig-r011-20200624 (attached as .config)
compiler: nds32le-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=nds32 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   drivers/md/dm-bufio.c: In function '__alloc_buffer_wait_no_callback':
>> drivers/md/dm-bufio.c:860:27: error: implicit declaration of function 'memalloc_nowait_save'; did you mean 'memalloc_noio_save'? [-Werror=implicit-function-declaration]
     860 |    unsigned nowait_flag = memalloc_nowait_save();
         |                           ^~~~~~~~~~~~~~~~~~~~
         |                           memalloc_noio_save
>> drivers/md/dm-bufio.c:862:4: error: implicit declaration of function 'memalloc_nowait_restore'; did you mean 'memalloc_noio_restore'? [-Werror=implicit-function-declaration]
     862 |    memalloc_nowait_restore(nowait_flag);
         |    ^~~~~~~~~~~~~~~~~~~~~~~
         |    memalloc_noio_restore
   cc1: some warnings being treated as errors

vim +860 drivers/md/dm-bufio.c

   836	
   837	/*
   838	 * Allocate a new buffer. If the allocation is not possible, wait until
   839	 * some other thread frees a buffer.
   840	 *
   841	 * May drop the lock and regain it.
   842	 */
   843	static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
   844	{
   845		struct dm_buffer *b;
   846		bool tried_noio_alloc = false;
   847	
   848		/*
   849		 * dm-bufio is resistant to allocation failures (it just keeps
   850		 * one buffer reserved in cases all the allocations fail).
   851		 * So set flags to not try too hard:
   852		 *	__GFP_NOMEMALLOC: don't use emergency reserves
   853		 *	__GFP_NOWARN: don't print a warning in case of failure
   854		 *
   855		 * For debugging, if we set the cache size to 1, no new buffers will
   856		 * be allocated.
   857		 */
   858		while (1) {
   859			if (dm_bufio_cache_size_latch != 1) {
 > 860				unsigned nowait_flag = memalloc_nowait_save();
   861				b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
 > 862				memalloc_nowait_restore(nowait_flag);
   863				if (b)
   864					return b;
   865			}
   866	
   867			if (nf == NF_PREFETCH)
   868				return NULL;
   869	
   870			if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
   871				unsigned noio_flag;
   872	
   873				dm_bufio_unlock(c);
   874				noio_flag = memalloc_noio_save();
   875				b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
   876				memalloc_noio_restore(noio_flag);
   877				dm_bufio_lock(c);
   878				if (b)
   879					return b;
   880				tried_noio_alloc = true;
   881			}
   882	
   883			if (!list_empty(&c->reserved_buffers)) {
   884				b = list_entry(c->reserved_buffers.next,
   885					       struct dm_buffer, lru_list);
   886				list_del(&b->lru_list);
   887				c->need_reserved_buffers++;
   888	
   889				return b;
   890			}
   891	
   892			b = __get_unclaimed_buffer(c);
   893			if (b)
   894				return b;
   895	
   896			__wait_for_free_buffer(c);
   897		}
   898	}
   899	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
kernel test robot June 25, 2020, 11:51 p.m. UTC | #5
Hi "Matthew,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on dm/for-next linus/master v5.8-rc2]
[cannot apply to xfs-linux/for-next next-20200625]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Matthew-Wilcox-Oracle/Overhaul-memalloc_no/20200625-193357
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 87e867b4269f29dac8190bca13912d08163a277f
config: x86_64-allyesconfig (attached as .config)
compiler: clang version 11.0.0 (https://github.com/llvm/llvm-project 8911a35180c6777188fefe0954a2451a2b91deaf)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> drivers/md/dm-bufio.c:860:27: error: implicit declaration of function 'memalloc_nowait_save' [-Werror,-Wimplicit-function-declaration]
                           unsigned nowait_flag = memalloc_nowait_save();
                                                  ^
   drivers/md/dm-bufio.c:860:27: note: did you mean 'memalloc_noio_save'?
   include/linux/sched/mm.h:227:28: note: 'memalloc_noio_save' declared here
   static inline unsigned int memalloc_noio_save(void)
                              ^
>> drivers/md/dm-bufio.c:862:4: error: implicit declaration of function 'memalloc_nowait_restore' [-Werror,-Wimplicit-function-declaration]
                           memalloc_nowait_restore(nowait_flag);
                           ^
   drivers/md/dm-bufio.c:862:4: note: did you mean 'memalloc_noio_restore'?
   include/linux/sched/mm.h:242:20: note: 'memalloc_noio_restore' declared here
   static inline void memalloc_noio_restore(unsigned int flags)
                      ^
   2 errors generated.

vim +/memalloc_nowait_save +860 drivers/md/dm-bufio.c

   836	
   837	/*
   838	 * Allocate a new buffer. If the allocation is not possible, wait until
   839	 * some other thread frees a buffer.
   840	 *
   841	 * May drop the lock and regain it.
   842	 */
   843	static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c, enum new_flag nf)
   844	{
   845		struct dm_buffer *b;
   846		bool tried_noio_alloc = false;
   847	
   848		/*
   849		 * dm-bufio is resistant to allocation failures (it just keeps
   850		 * one buffer reserved in cases all the allocations fail).
   851		 * So set flags to not try too hard:
   852		 *	__GFP_NOMEMALLOC: don't use emergency reserves
   853		 *	__GFP_NOWARN: don't print a warning in case of failure
   854		 *
   855		 * For debugging, if we set the cache size to 1, no new buffers will
   856		 * be allocated.
   857		 */
   858		while (1) {
   859			if (dm_bufio_cache_size_latch != 1) {
 > 860				unsigned nowait_flag = memalloc_nowait_save();
   861				b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
 > 862				memalloc_nowait_restore(nowait_flag);
   863				if (b)
   864					return b;
   865			}
   866	
   867			if (nf == NF_PREFETCH)
   868				return NULL;
   869	
   870			if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
   871				unsigned noio_flag;
   872	
   873				dm_bufio_unlock(c);
   874				noio_flag = memalloc_noio_save();
   875				b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
   876				memalloc_noio_restore(noio_flag);
   877				dm_bufio_lock(c);
   878				if (b)
   879					return b;
   880				tried_noio_alloc = true;
   881			}
   882	
   883			if (!list_empty(&c->reserved_buffers)) {
   884				b = list_entry(c->reserved_buffers.next,
   885					       struct dm_buffer, lru_list);
   886				list_del(&b->lru_list);
   887				c->need_reserved_buffers++;
   888	
   889				return b;
   890			}
   891	
   892			b = __get_unclaimed_buffer(c);
   893			if (b)
   894				return b;
   895	
   896			__wait_for_free_buffer(c);
   897		}
   898	}
   899	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
Mike Rapoport June 29, 2020, 5:08 a.m. UTC | #6
On Thu, Jun 25, 2020 at 12:31:22PM +0100, Matthew Wilcox (Oracle) wrote:
> Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> guarantees we will not sleep to reclaim memory.  Use it to simplify
> dm-bufio's allocations.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>  drivers/md/dm-bufio.c    | 30 ++++++++----------------------
>  include/linux/sched.h    |  1 +
>  include/linux/sched/mm.h | 12 ++++++++----
>  3 files changed, 17 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 6d1565021d74..140ada9a2c8f 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -412,23 +412,6 @@ static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
>  
>  	*data_mode = DATA_MODE_VMALLOC;
>  
> -	/*
> -	 * __vmalloc allocates the data pages and auxiliary structures with
> -	 * gfp_flags that were specified, but pagetables are always allocated
> -	 * with GFP_KERNEL, no matter what was specified as gfp_mask.
> -	 *
> -	 * Consequently, we must set per-process flag PF_MEMALLOC_NOIO so that
> -	 * all allocations done by this process (including pagetables) are done
> -	 * as if GFP_NOIO was specified.
> -	 */
> -	if (gfp_mask & __GFP_NORETRY) {
> -		unsigned noio_flag = memalloc_noio_save();
> -		void *ptr = __vmalloc(c->block_size, gfp_mask);
> -
> -		memalloc_noio_restore(noio_flag);
> -		return ptr;
> -	}
> -
>  	return __vmalloc(c->block_size, gfp_mask);
>  }
>  
> @@ -866,9 +849,6 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 * dm-bufio is resistant to allocation failures (it just keeps
>  	 * one buffer reserved in cases all the allocations fail).
>  	 * So set flags to not try too hard:
> -	 *	GFP_NOWAIT: don't wait; if we need to sleep we'll release our
> -	 *		    mutex and wait ourselves.
> -	 *	__GFP_NORETRY: don't retry and rather return failure
>  	 *	__GFP_NOMEMALLOC: don't use emergency reserves
>  	 *	__GFP_NOWARN: don't print a warning in case of failure
>  	 *
> @@ -877,7 +857,9 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  	 */
>  	while (1) {
>  		if (dm_bufio_cache_size_latch != 1) {
> -			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			unsigned nowait_flag = memalloc_nowait_save();
> +			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			memalloc_nowait_restore(nowait_flag);
>  			if (b)
>  				return b;
>  		}
> @@ -886,8 +868,12 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>  			return NULL;
>  
>  		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> +			unsigned noio_flag;
> +
>  			dm_bufio_unlock(c);
> -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +			noio_flag = memalloc_noio_save();

I've read the series twice and I'm still missing the definition of
memalloc_noio_save().

And also it would be nice to have a paragraph about it in
Documentation/core-api/memory-allocation.rst

> +			b = alloc_buffer(c, GFP_KERNEL |
> __GFP_NOMEMALLOC | __GFP_NOWARN); +
> memalloc_noio_restore(noio_flag); dm_bufio_lock(c); if (b)
>  				return b;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 90336850e940..b1c2cddd366c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -803,6 +803,7 @@ struct task_struct {
>  #endif
>  	unsigned			memalloc_noio:1;
>  	unsigned			memalloc_nofs:1;
> +	unsigned			memalloc_nowait:1;
>  	unsigned			memalloc_nocma:1;
>  
>  	unsigned long			atomic_flags; /* Flags requiring atomic access. */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 6f7b59a848a6..6484569f50df 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -179,12 +179,16 @@ static inline bool in_vfork(struct task_struct *tsk)
>  static inline gfp_t current_gfp_context(gfp_t flags)
>  {
>  	if (unlikely(current->memalloc_noio || current->memalloc_nofs ||
> -		     current->memalloc_nocma)) {
> +		     current->memalloc_nocma) || current->memalloc_nowait) {
>  		/*
> -		 * NOIO implies both NOIO and NOFS and it is a weaker context
> -		 * so always make sure it makes precedence
> +		 * Clearing DIRECT_RECLAIM means we won't get to the point
> +		 * of testing IO or FS, so we don't need to bother clearing
> +		 * them.  noio implies neither IO nor FS and it is a weaker
> +		 * context so always make sure it takes precedence.
>  		 */
> -		if (current->memalloc_noio)
> +		if (current->memalloc_nowait)
> +			flags &= ~__GFP_DIRECT_RECLAIM;
> +		else if (current->memalloc_noio)
>  			flags &= ~(__GFP_IO | __GFP_FS);
>  		else if (current->memalloc_nofs)
>  			flags &= ~__GFP_FS;
> -- 
> 2.27.0
> 
>
Matthew Wilcox June 29, 2020, 12:18 p.m. UTC | #7
On Mon, Jun 29, 2020 at 08:08:51AM +0300, Mike Rapoport wrote:
> > @@ -886,8 +868,12 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> >  			return NULL;
> >  
> >  		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> > +			unsigned noio_flag;
> > +
> >  			dm_bufio_unlock(c);
> > -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > +			noio_flag = memalloc_noio_save();
> 
> I've read the series twice and I'm still missing the definition of
> memalloc_noio_save().
> 
> And also it would be nice to have a paragraph about it in
> Documentation/core-api/memory-allocation.rst

Documentation/core-api/gfp_mask-from-fs-io.rst:``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
Documentation/core-api/gfp_mask-from-fs-io.rst:   :functions: memalloc_noio_save memalloc_noio_restore
Documentation/core-api/gfp_mask-from-fs-io.rst:allows nesting so it is safe to call ``memalloc_noio_save`` or
Michal Hocko June 29, 2020, 12:52 p.m. UTC | #8
On Mon 29-06-20 13:18:16, Matthew Wilcox wrote:
> On Mon, Jun 29, 2020 at 08:08:51AM +0300, Mike Rapoport wrote:
> > > @@ -886,8 +868,12 @@ static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> > >  			return NULL;
> > >  
> > >  		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> > > +			unsigned noio_flag;
> > > +
> > >  			dm_bufio_unlock(c);
> > > -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > > +			noio_flag = memalloc_noio_save();
> > 
> > I've read the series twice and I'm still missing the definition of
> > memalloc_noio_save().
> > 
> > And also it would be nice to have a paragraph about it in
> > Documentation/core-api/memory-allocation.rst
> 
> Documentation/core-api/gfp_mask-from-fs-io.rst:``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
> Documentation/core-api/gfp_mask-from-fs-io.rst:   :functions: memalloc_noio_save memalloc_noio_restore
> Documentation/core-api/gfp_mask-from-fs-io.rst:allows nesting so it is safe to call ``memalloc_noio_save`` or
 
The patch is adding memalloc_nowait* and I suspect Mike had that in
mind, which would be a fair request. Btw. we are missing memalloc_nocma*
documentation either - I was just reminded of its existence today...
Mike Rapoport June 29, 2020, 1:45 p.m. UTC | #9
On June 29, 2020 3:52:31 PM GMT+03:00, Michal Hocko <mhocko@kernel.org> wrote:
>On Mon 29-06-20 13:18:16, Matthew Wilcox wrote:
>> On Mon, Jun 29, 2020 at 08:08:51AM +0300, Mike Rapoport wrote:
>> > > @@ -886,8 +868,12 @@ static struct dm_buffer
>*__alloc_buffer_wait_no_callback(struct dm_bufio_client
>> > >  			return NULL;
>> > >  
>> > >  		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
>> > > +			unsigned noio_flag;
>> > > +
>> > >  			dm_bufio_unlock(c);
>> > > -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY |
>__GFP_NOMEMALLOC | __GFP_NOWARN);
>> > > +			noio_flag = memalloc_noio_save();
>> > 
>> > I've read the series twice and I'm still missing the definition of
>> > memalloc_noio_save().
>> > 
>> > And also it would be nice to have a paragraph about it in
>> > Documentation/core-api/memory-allocation.rst
>> 
>>
>Documentation/core-api/gfp_mask-from-fs-io.rst:``memalloc_nofs_save``,
>``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
>> Documentation/core-api/gfp_mask-from-fs-io.rst:   :functions:
>memalloc_noio_save memalloc_noio_restore
>> Documentation/core-api/gfp_mask-from-fs-io.rst:allows nesting so it
>is safe to call ``memalloc_noio_save`` or
> 
>The patch is adding memalloc_nowait* and I suspect Mike had that in
>mind, which would be a fair request. 

Right, sorry misprinted that.

> Btw. we are missing
>memalloc_nocma*
>documentation either - I was just reminded of its existence today..
Matthew Wilcox June 29, 2020, 9:28 p.m. UTC | #10
On Mon, Jun 29, 2020 at 04:45:14PM +0300, Mike Rapoport wrote:
> 
> 
> On June 29, 2020 3:52:31 PM GMT+03:00, Michal Hocko <mhocko@kernel.org> wrote:
> >On Mon 29-06-20 13:18:16, Matthew Wilcox wrote:
> >> On Mon, Jun 29, 2020 at 08:08:51AM +0300, Mike Rapoport wrote:
> >> > > @@ -886,8 +868,12 @@ static struct dm_buffer
> >*__alloc_buffer_wait_no_callback(struct dm_bufio_client
> >> > >  			return NULL;
> >> > >  
> >> > >  		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> >> > > +			unsigned noio_flag;
> >> > > +
> >> > >  			dm_bufio_unlock(c);
> >> > > -			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY |
> >__GFP_NOMEMALLOC | __GFP_NOWARN);
> >> > > +			noio_flag = memalloc_noio_save();
> >> > 
> >> > I've read the series twice and I'm still missing the definition of
> >> > memalloc_noio_save().
> >> > 
> >> > And also it would be nice to have a paragraph about it in
> >> > Documentation/core-api/memory-allocation.rst
> >> 
> >>
> >Documentation/core-api/gfp_mask-from-fs-io.rst:``memalloc_nofs_save``,
> >``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
> >> Documentation/core-api/gfp_mask-from-fs-io.rst:   :functions:
> >memalloc_noio_save memalloc_noio_restore
> >> Documentation/core-api/gfp_mask-from-fs-io.rst:allows nesting so it
> >is safe to call ``memalloc_noio_save`` or
> > 
> >The patch is adding memalloc_nowait* and I suspect Mike had that in
> >mind, which would be a fair request. 
> 
> Right, sorry misprinted that.
> 
> > Btw. we are missing
> >memalloc_nocma*
> >documentation either - I was just reminded of its existence today..

Heh.  Oops, not sure how those got left out.  I hadn't touched the
documentation either, so -1 points to me.

The documentation is hard to add a new case to, so I rewrote it.  What
do you think?  (Obviously I'll split this out differently for submission;
this is just what I have in my tree right now).

diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
deleted file mode 100644
index e7c32a8de126..000000000000
--- a/Documentation/core-api/gfp_mask-from-fs-io.rst
+++ /dev/null
@@ -1,68 +0,0 @@
-.. _gfp_mask_from_fs_io:
-
-=================================
-GFP masks used from FS/IO context
-=================================
-
-:Date: May, 2018
-:Author: Michal Hocko <mhocko@kernel.org>
-
-Introduction
-============
-
-Code paths in the filesystem and IO stacks must be careful when
-allocating memory to prevent recursion deadlocks caused by direct
-memory reclaim calling back into the FS or IO paths and blocking on
-already held resources (e.g. locks - most commonly those used for the
-transaction context).
-
-The traditional way to avoid this deadlock problem is to clear __GFP_FS
-respectively __GFP_IO (note the latter implies clearing the first as well) in
-the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be
-used as shortcut. It turned out though that above approach has led to
-abuses when the restricted gfp mask is used "just in case" without a
-deeper consideration which leads to problems because an excessive use
-of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
-reclaim issues.
-
-New API
-========
-
-Since 4.12 we do have a generic scope API for both NOFS and NOIO context
-``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
-``memalloc_noio_restore`` which allow to mark a scope to be a critical
-section from a filesystem or I/O point of view. Any allocation from that
-scope will inherently drop __GFP_FS respectively __GFP_IO from the given
-mask so no memory allocation can recurse back in the FS/IO.
-
-.. kernel-doc:: include/linux/sched/mm.h
-   :functions: memalloc_nofs_save memalloc_nofs_restore
-.. kernel-doc:: include/linux/sched/mm.h
-   :functions: memalloc_noio_save memalloc_noio_restore
-
-FS/IO code then simply calls the appropriate save function before
-any critical section with respect to the reclaim is started - e.g.
-lock shared with the reclaim context or when a transaction context
-nesting would be possible via reclaim. The restore function should be
-called when the critical section ends. All that ideally along with an
-explanation what is the reclaim context for easier maintenance.
-
-Please note that the proper pairing of save/restore functions
-allows nesting so it is safe to call ``memalloc_noio_save`` or
-``memalloc_noio_restore`` respectively from an existing NOIO or NOFS
-scope.
-
-What about __vmalloc(GFP_NOFS)
-==============================
-
-vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
-GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
-to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
-almost always a bug. The good news is that the NOFS/NOIO semantic can be
-achieved by the scope API.
-
-In the ideal world, upper layers should already mark dangerous contexts
-and so no special care is required and vmalloc should be called without
-any problems. Sometimes if the context is not really clear or there are
-layering violations then the recommended way around that is to wrap ``vmalloc``
-by the scope API with a comment explaining the problem.
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 15ab86112627..55f611e34a1d 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -90,7 +90,6 @@ more memory-management documentation in :doc:`/vm/index`.
    genalloc
    pin_user_pages
    boot-time-mm
-   gfp_mask-from-fs-io
 
 Interfaces for kernel debugging
 ===============================
diff --git a/Documentation/core-api/memory-allocation.rst b/Documentation/core-api/memory-allocation.rst
index 4aa82ddd01b8..c6287c25ff99 100644
--- a/Documentation/core-api/memory-allocation.rst
+++ b/Documentation/core-api/memory-allocation.rst
@@ -69,13 +69,12 @@ here we briefly outline their recommended usage:
     ``GFP_USER`` means that the allocated memory is not movable and it
     must be directly accessible by the kernel.
 
-You may notice that quite a few allocations in the existing code
-specify ``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to
-prevent recursion deadlocks caused by direct memory reclaim calling
-back into the FS or IO paths and blocking on already held
-resources. Since 4.12 the preferred way to address this issue is to
-use new scope APIs described in
-:ref:`Documentation/core-api/gfp_mask-from-fs-io.rst <gfp_mask_from_fs_io>`.
+You may notice that quite a few allocations in the existing code specify
+``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to prevent
+recursion deadlocks caused by direct memory reclaim calling back into
+the FS or IO paths and blocking on already held resources. Since 4.12
+the preferred way to address this issue is to use the new scope APIs
+described below.
 
 Other legacy GFP flags are ``GFP_DMA`` and ``GFP_DMA32``. They are
 used to ensure that the allocated memory is accessible by hardware
@@ -84,6 +83,37 @@ driver for a device with such restrictions, avoid using these flags.
 And even with hardware with restrictions it is preferable to use
 `dma_alloc*` APIs.
 
+Memory scoping API
+==================
+
+Traditionally, we have passed GFP flags to functions that we call,
+indicating what kind of actions may be taken to free up memory if none
+is currently available.  This has proved impractical in some places and
+so we are currently transitioning to the calls below which override the
+flags specified by any particular call to allocate memory.
+
+.. kernel-doc:: include/linux/sched/mm.h
+   :functions: memalloc_nofs_save memalloc_nofs_restore
+.. kernel-doc:: include/linux/sched/mm.h
+   :functions: memalloc_noio_save memalloc_noio_restore
+.. kernel-doc:: include/linux/sched/mm.h
+   :functions: memalloc_nocma_save memalloc_nocma_restore
+.. kernel-doc:: include/linux/sched/mm.h
+   :functions: memalloc_nowait_save memalloc_nowait_restore
+
+These functions should be called at the point where any memory allocation
+would start to cause problems.  That is, do not simply wrap individual
+memory allocation calls which currently use ``GFP_NOFS`` with a pair
+of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
+find the lock which is taken that would cause problems if memory reclaim
+reentered the filesystem, place a call to memalloc_nofs_save() before it
+is acquired and a call to memalloc_nofs_restore() after it is released.
+Ideally also add a comment explaining why this lock will be problematic.
+
+Please note that the proper pairing of save/restore functions
+allows nesting so it is safe to call memalloc_noio_save() and
+memalloc_noio_restore() within an existing NOIO or NOFS scope.
+
 Selecting memory allocator
 ==========================
 
@@ -104,16 +134,19 @@ ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
 alignment is also guaranteed to be at least the respective size.
 
 For large allocations you can use vmalloc() and vzalloc(), or directly
-request pages from the page allocator. The memory allocated by `vmalloc`
-and related functions is not physically contiguous.
+request pages from the page allocator.  The memory allocated by `vmalloc`
+and related functions is not physically contiguous.  The `vmalloc`
+family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
+flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
+the allocator which are hard to remove.  However, the scope APIs described
+above can be used to limit the `vmalloc` functions.
 
 If you are not sure whether the allocation size is too large for
 `kmalloc`, it is possible to use kvmalloc() and its derivatives. It will
 try to allocate memory with `kmalloc` and if the allocation fails it
-will be retried with `vmalloc`. There are restrictions on which GFP
-flags can be used with `kvmalloc`; please see kvmalloc_node() reference
-documentation. Note that `kvmalloc` may return memory that is not
-physically contiguous.
+will be retried with `vmalloc`. That means the GFP flags supported by
+`kvmalloc` are the same as those supported by `vmalloc` and `kvmalloc`
+may return memory that is not physically contiguous.
 
 If you need to allocate many identical objects you can use the slab
 cache allocator. The cache should be set up with kmem_cache_create() or
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 6484569f50df..9fc091274d1d 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -186,9 +186,10 @@ static inline gfp_t current_gfp_context(gfp_t flags)
 		 * them.  noio implies neither IO nor FS and it is a weaker
 		 * context so always make sure it takes precedence.
 		 */
-		if (current->memalloc_nowait)
+		if (current->memalloc_nowait) {
 			flags &= ~__GFP_DIRECT_RECLAIM;
-		else if (current->memalloc_noio)
+			flags |= __GFP_NOWARN;
+		} else if (current->memalloc_noio)
 			flags &= ~(__GFP_IO | __GFP_FS);
 		else if (current->memalloc_nofs)
 			flags &= ~__GFP_FS;
@@ -275,6 +276,36 @@ static inline void memalloc_nofs_restore(unsigned int flags)
 	current->memalloc_nofs = flags ? 1 : 0;
 }
 
+/**
+ * memalloc_nowait_save - Marks implicit GFP_NOWAIT allocation scope.
+ *
+ * This functions marks the beginning of the GFP_NOWAIT allocation scope.
+ * All further allocations will implicitly disallow all waiting in the
+ * page allocator.  Use memalloc_nowait_restore() to end the scope with
+ * flags returned by this function.
+ *
+ * This function is safe to be used from any context.
+ */
+static inline unsigned int memalloc_nowait_save(void)
+{
+	unsigned int flags = current->memalloc_nowait;
+	current->memalloc_nowait = 1;
+	return flags;
+}
+
+/**
+ * memalloc_nowait_restore - Ends the implicit GFP_NOWAIT scope.
+ * @flags: Flags to restore.
+ *
+ * Ends the implicit GFP_NOWAIT scope started by memalloc_nowait_save().
+ * Always make sure that that the given flags is the return value from the
+ * pairing memalloc_nowait_save call.
+ */
+static inline void memalloc_nowait_restore(unsigned int flags)
+{
+	current->memalloc_nowait = flags ? 1 : 0;
+}
+
 static inline unsigned int memalloc_noreclaim_save(void)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC;
Michal Hocko June 30, 2020, 6:34 a.m. UTC | #11
On Mon 29-06-20 22:28:30, Matthew Wilcox wrote:
[...]
> The documentation is hard to add a new case to, so I rewrote it.  What
> do you think?  (Obviously I'll split this out differently for submission;
> this is just what I have in my tree right now).

I am fine with your changes. Few notes below.

> -It turned out though that above approach has led to
> -abuses when the restricted gfp mask is used "just in case" without a
> -deeper consideration which leads to problems because an excessive use
> -of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> -reclaim issues.

I believe this is an important part because it shows that new people
coming to the existing code shouldn't take it as correct and rather
question it. Also having a clear indication that overuse is causing real
problems that might be not immediately visible to subsystems outside of
MM.

> -FS/IO code then simply calls the appropriate save function before
> -any critical section with respect to the reclaim is started - e.g.
> -lock shared with the reclaim context or when a transaction context
> -nesting would be possible via reclaim.  

[...]

> +These functions should be called at the point where any memory allocation
> +would start to cause problems.  That is, do not simply wrap individual
> +memory allocation calls which currently use ``GFP_NOFS`` with a pair
> +of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> +find the lock which is taken that would cause problems if memory reclaim
> +reentered the filesystem, place a call to memalloc_nofs_save() before it
> +is acquired and a call to memalloc_nofs_restore() after it is released.
> +Ideally also add a comment explaining why this lock will be problematic.

The above text has mentioned the transaction context nesting as well and
that was a hint by Dave IIRC. It is imho good to have an example of
other reentrant points than just locks. I believe another useful example
would be something like loop device which is mixing IO and FS layers but
I am not familiar with all the details to give you an useful text.

[...]
> @@ -104,16 +134,19 @@ ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
>  alignment is also guaranteed to be at least the respective size.
>  
>  For large allocations you can use vmalloc() and vzalloc(), or directly
> -request pages from the page allocator. The memory allocated by `vmalloc`
> -and related functions is not physically contiguous.
> +request pages from the page allocator.  The memory allocated by `vmalloc`
> +and related functions is not physically contiguous.  The `vmalloc`
> +family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
> +flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
> +the allocator which are hard to remove.  However, the scope APIs described
> +above can be used to limit the `vmalloc` functions.

I would reiterate "Do not just wrap vmalloc by the scope api but rather
rely on the real scope for the NOFS/NOIO context". Maybe we want to
stress out that once a scope is defined it is sticky to _all_
allocations and all allocators within that scope. The text is already
saying that but maybe we want to make it explicit and make it stand out.

[...]
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 6484569f50df..9fc091274d1d 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -186,9 +186,10 @@ static inline gfp_t current_gfp_context(gfp_t flags)
>  		 * them.  noio implies neither IO nor FS and it is a weaker
>  		 * context so always make sure it takes precedence.
>  		 */
> -		if (current->memalloc_nowait)
> +		if (current->memalloc_nowait) {
>  			flags &= ~__GFP_DIRECT_RECLAIM;
> -		else if (current->memalloc_noio)
> +			flags |= __GFP_NOWARN;

I dunno. I wouldn't make nowait implicitly NOWARN as well. At least not
with the initial implementation. Maybe we will learn later that there is
just too much unhelpful noise in the kernel log and will reconsider but
I wouldn't just start with that. Also we might learn that there will be
other modifiers for atomic (or should I say non-sleeping) scopes to be
defined. E.g. access to memory reserves but let's just wait for real
usecases.


Thanks a lot Matthew!
Matthew Wilcox July 1, 2020, 4:12 a.m. UTC | #12
On Tue, Jun 30, 2020 at 08:34:36AM +0200, Michal Hocko wrote:
> On Mon 29-06-20 22:28:30, Matthew Wilcox wrote:
> [...]
> > The documentation is hard to add a new case to, so I rewrote it.  What
> > do you think?  (Obviously I'll split this out differently for submission;
> > this is just what I have in my tree right now).
> 
> I am fine with your changes. Few notes below.

Thanks!

> > -It turned out though that above approach has led to
> > -abuses when the restricted gfp mask is used "just in case" without a
> > -deeper consideration which leads to problems because an excessive use
> > -of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> > -reclaim issues.
> 
> I believe this is an important part because it shows that new people
> coming to the existing code shouldn't take it as correct and rather
> question it. Also having a clear indication that overuse is causing real
> problems that might be not immediately visible to subsystems outside of
> MM.

It seemed to say a lot of the same things as this paragraph:

+You may notice that quite a few allocations in the existing code specify
+``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to prevent
+recursion deadlocks caused by direct memory reclaim calling back into
+the FS or IO paths and blocking on already held resources. Since 4.12
+the preferred way to address this issue is to use the new scope APIs
+described below.

Since this is in core-api/ rather than vm/, I felt that discussion of
the problems that it causes to the mm was a bit too much detail for the
people who would be reading this document.  Maybe I could move that
information into a new Documentation/vm/reclaim.rst file?

Let's see if Our Grumpy Editor has time to give us his advice on this.

> > -FS/IO code then simply calls the appropriate save function before
> > -any critical section with respect to the reclaim is started - e.g.
> > -lock shared with the reclaim context or when a transaction context
> > -nesting would be possible via reclaim.  
> 
> [...]
> 
> > +These functions should be called at the point where any memory allocation
> > +would start to cause problems.  That is, do not simply wrap individual
> > +memory allocation calls which currently use ``GFP_NOFS`` with a pair
> > +of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> > +find the lock which is taken that would cause problems if memory reclaim
> > +reentered the filesystem, place a call to memalloc_nofs_save() before it
> > +is acquired and a call to memalloc_nofs_restore() after it is released.
> > +Ideally also add a comment explaining why this lock will be problematic.
> 
> The above text has mentioned the transaction context nesting as well and
> that was a hint by Dave IIRC. It is imho good to have an example of
> other reentrant points than just locks. I believe another useful example
> would be something like loop device which is mixing IO and FS layers but
> I am not familiar with all the details to give you an useful text.

I'll let Mikulas & Dave finish fighting about that before I write any
text mentioning the loop driver.  How about this for mentioning the
filesystem transaction possibility?

@@ -103,12 +103,16 @@ flags specified by any particular call to allocate memory.
 
 These functions should be called at the point where any memory allocation
 would start to cause problems.  That is, do not simply wrap individual
-memory allocation calls which currently use ``GFP_NOFS`` with a pair
-of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
-find the lock which is taken that would cause problems if memory reclaim
+memory allocation calls which currently use ``GFP_NOFS`` with a pair of
+calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead, find
+the resource which is acquired that would cause problems if memory reclaim
 reentered the filesystem, place a call to memalloc_nofs_save() before it
 is acquired and a call to memalloc_nofs_restore() after it is released.
 Ideally also add a comment explaining why this lock will be problematic.
+A resource might be a lock which would need to be acquired by an attempt
+to reclaim memory, or it might be starting a transaction that should not
+nest over a memory reclaim transaction.  Deep knowledge of the filesystem
+or driver is often needed to place memory scoping calls correctly.
 
 Please note that the proper pairing of save/restore functions
 allows nesting so it is safe to call memalloc_noio_save() and

> > @@ -104,16 +134,19 @@ ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
> >  alignment is also guaranteed to be at least the respective size.
> >  
> >  For large allocations you can use vmalloc() and vzalloc(), or directly
> > -request pages from the page allocator. The memory allocated by `vmalloc`
> > -and related functions is not physically contiguous.
> > +request pages from the page allocator.  The memory allocated by `vmalloc`
> > +and related functions is not physically contiguous.  The `vmalloc`
> > +family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
> > +flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
> > +the allocator which are hard to remove.  However, the scope APIs described
> > +above can be used to limit the `vmalloc` functions.
> 
> I would reiterate "Do not just wrap vmalloc by the scope api but rather
> rely on the real scope for the NOFS/NOIO context". Maybe we want to
> stress out that once a scope is defined it is sticky to _all_
> allocations and all allocators within that scope. The text is already
> saying that but maybe we want to make it explicit and make it stand out.

yes.  I went with:

@@ -139,7 +143,10 @@ and related functions is not physically contiguous.  The `vmalloc`
 family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
 flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
 the allocator which are hard to remove.  However, the scope APIs described
-above can be used to limit the `vmalloc` functions.
+above can be used to limit the `vmalloc` functions.  As described above,
+do not simply wrap individual calls in the scope APIs, but look for the
+underlying reason why the memory allocation may not call into filesystems
+or block devices.
 
 If you are not sure whether the allocation size is too large for
 `kmalloc`, it is possible to use kvmalloc() and its derivatives. It will


> [...]
> > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > index 6484569f50df..9fc091274d1d 100644
> > --- a/include/linux/sched/mm.h
> > +++ b/include/linux/sched/mm.h
> > @@ -186,9 +186,10 @@ static inline gfp_t current_gfp_context(gfp_t flags)
> >  		 * them.  noio implies neither IO nor FS and it is a weaker
> >  		 * context so always make sure it takes precedence.
> >  		 */
> > -		if (current->memalloc_nowait)
> > +		if (current->memalloc_nowait) {
> >  			flags &= ~__GFP_DIRECT_RECLAIM;
> > -		else if (current->memalloc_noio)
> > +			flags |= __GFP_NOWARN;
> 
> I dunno. I wouldn't make nowait implicitly NOWARN as well. At least not
> with the initial implementation. Maybe we will learn later that there is
> just too much unhelpful noise in the kernel log and will reconsider but
> I wouldn't just start with that. Also we might learn that there will be
> other modifiers for atomic (or should I say non-sleeping) scopes to be
> defined. E.g. access to memory reserves but let's just wait for real
> usecases.

Fair enough.  I'll drop that part.  Thanks!
Michal Hocko July 1, 2020, 5:53 a.m. UTC | #13
On Wed 01-07-20 05:12:03, Matthew Wilcox wrote:
> On Tue, Jun 30, 2020 at 08:34:36AM +0200, Michal Hocko wrote:
> > On Mon 29-06-20 22:28:30, Matthew Wilcox wrote:
> > [...]
> > > The documentation is hard to add a new case to, so I rewrote it.  What
> > > do you think?  (Obviously I'll split this out differently for submission;
> > > this is just what I have in my tree right now).
> > 
> > I am fine with your changes. Few notes below.
> 
> Thanks!
> 
> > > -It turned out though that above approach has led to
> > > -abuses when the restricted gfp mask is used "just in case" without a
> > > -deeper consideration which leads to problems because an excessive use
> > > -of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> > > -reclaim issues.
> > 
> > I believe this is an important part because it shows that new people
> > coming to the existing code shouldn't take it as correct and rather
> > question it. Also having a clear indication that overuse is causing real
> > problems that might be not immediately visible to subsystems outside of
> > MM.
> 
> It seemed to say a lot of the same things as this paragraph:
> 
> +You may notice that quite a few allocations in the existing code specify
> +``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to prevent
> +recursion deadlocks caused by direct memory reclaim calling back into
> +the FS or IO paths and blocking on already held resources. Since 4.12
> +the preferred way to address this issue is to use the new scope APIs
> +described below.
> 
> Since this is in core-api/ rather than vm/, I felt that discussion of
> the problems that it causes to the mm was a bit too much detail for the
> people who would be reading this document.  Maybe I could move that
> information into a new Documentation/vm/reclaim.rst file?

Hmm, my experience is that at least some users of NOFS/NOIO use this
flag just to be sure they do not do something wrong without realizing
that this might have a very negative effect on the whole system
operation. That was the main motivation to have an explicit note there.
I am not sure having that in MM internal documentation will make it
stand out for a general reader.

But I will not insist of course.

> Let's see if Our Grumpy Editor has time to give us his advice on this.
> 
> > > -FS/IO code then simply calls the appropriate save function before
> > > -any critical section with respect to the reclaim is started - e.g.
> > > -lock shared with the reclaim context or when a transaction context
> > > -nesting would be possible via reclaim.  
> > 
> > [...]
> > 
> > > +These functions should be called at the point where any memory allocation
> > > +would start to cause problems.  That is, do not simply wrap individual
> > > +memory allocation calls which currently use ``GFP_NOFS`` with a pair
> > > +of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> > > +find the lock which is taken that would cause problems if memory reclaim
> > > +reentered the filesystem, place a call to memalloc_nofs_save() before it
> > > +is acquired and a call to memalloc_nofs_restore() after it is released.
> > > +Ideally also add a comment explaining why this lock will be problematic.
> > 
> > The above text has mentioned the transaction context nesting as well and
> > that was a hint by Dave IIRC. It is imho good to have an example of
> > other reentrant points than just locks. I believe another useful example
> > would be something like loop device which is mixing IO and FS layers but
> > I am not familiar with all the details to give you an useful text.
> 
> I'll let Mikulas & Dave finish fighting about that before I write any
> text mentioning the loop driver.  How about this for mentioning the
> filesystem transaction possibility?
> 
> @@ -103,12 +103,16 @@ flags specified by any particular call to allocate memory.
>  
>  These functions should be called at the point where any memory allocation
>  would start to cause problems.  That is, do not simply wrap individual
> -memory allocation calls which currently use ``GFP_NOFS`` with a pair
> -of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> -find the lock which is taken that would cause problems if memory reclaim
> +memory allocation calls which currently use ``GFP_NOFS`` with a pair of
> +calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead, find
> +the resource which is acquired that would cause problems if memory reclaim
>  reentered the filesystem, place a call to memalloc_nofs_save() before it
>  is acquired and a call to memalloc_nofs_restore() after it is released.
>  Ideally also add a comment explaining why this lock will be problematic.
> +A resource might be a lock which would need to be acquired by an attempt
> +to reclaim memory, or it might be starting a transaction that should not
> +nest over a memory reclaim transaction.  Deep knowledge of the filesystem
> +or driver is often needed to place memory scoping calls correctly.

Ack

>  Please note that the proper pairing of save/restore functions
>  allows nesting so it is safe to call memalloc_noio_save() and
> 
> > > @@ -104,16 +134,19 @@ ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
> > >  alignment is also guaranteed to be at least the respective size.
> > >  
> > >  For large allocations you can use vmalloc() and vzalloc(), or directly
> > > -request pages from the page allocator. The memory allocated by `vmalloc`
> > > -and related functions is not physically contiguous.
> > > +request pages from the page allocator.  The memory allocated by `vmalloc`
> > > +and related functions is not physically contiguous.  The `vmalloc`
> > > +family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
> > > +flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
> > > +the allocator which are hard to remove.  However, the scope APIs described
> > > +above can be used to limit the `vmalloc` functions.
> > 
> > I would reiterate "Do not just wrap vmalloc by the scope api but rather
> > rely on the real scope for the NOFS/NOIO context". Maybe we want to
> > stress out that once a scope is defined it is sticky to _all_
> > allocations and all allocators within that scope. The text is already
> > saying that but maybe we want to make it explicit and make it stand out.
> 
> yes.  I went with:
> 
> @@ -139,7 +143,10 @@ and related functions is not physically contiguous.  The `vmalloc`
>  family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
>  flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
>  the allocator which are hard to remove.  However, the scope APIs described
> -above can be used to limit the `vmalloc` functions.
> +above can be used to limit the `vmalloc` functions.  As described above,
> +do not simply wrap individual calls in the scope APIs, but look for the
> +underlying reason why the memory allocation may not call into filesystems
> +or block devices.

ack

>  
>  If you are not sure whether the allocation size is too large for
>  `kmalloc`, it is possible to use kvmalloc() and its derivatives. It will
> 
> 
> > [...]
> > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > > index 6484569f50df..9fc091274d1d 100644
> > > --- a/include/linux/sched/mm.h
> > > +++ b/include/linux/sched/mm.h
> > > @@ -186,9 +186,10 @@ static inline gfp_t current_gfp_context(gfp_t flags)
> > >  		 * them.  noio implies neither IO nor FS and it is a weaker
> > >  		 * context so always make sure it takes precedence.
> > >  		 */
> > > -		if (current->memalloc_nowait)
> > > +		if (current->memalloc_nowait) {
> > >  			flags &= ~__GFP_DIRECT_RECLAIM;
> > > -		else if (current->memalloc_noio)
> > > +			flags |= __GFP_NOWARN;
> > 
> > I dunno. I wouldn't make nowait implicitly NOWARN as well. At least not
> > with the initial implementation. Maybe we will learn later that there is
> > just too much unhelpful noise in the kernel log and will reconsider but
> > I wouldn't just start with that. Also we might learn that there will be
> > other modifiers for atomic (or should I say non-sleeping) scopes to be
> > defined. E.g. access to memory reserves but let's just wait for real
> > usecases.
> 
> Fair enough.  I'll drop that part.  Thanks!

thanks!
Mike Rapoport July 1, 2020, 7:04 a.m. UTC | #14
On Wed, Jul 01, 2020 at 07:53:46AM +0200, Michal Hocko wrote:
> On Wed 01-07-20 05:12:03, Matthew Wilcox wrote:
> > On Tue, Jun 30, 2020 at 08:34:36AM +0200, Michal Hocko wrote:
> > > On Mon 29-06-20 22:28:30, Matthew Wilcox wrote:
> > > [...]
> > > > The documentation is hard to add a new case to, so I rewrote it.  What
> > > > do you think?  (Obviously I'll split this out differently for submission;
> > > > this is just what I have in my tree right now).
> > > 
> > > I am fine with your changes. Few notes below.
> > 
> > Thanks!
> > 
> > > > -It turned out though that above approach has led to
> > > > -abuses when the restricted gfp mask is used "just in case" without a
> > > > -deeper consideration which leads to problems because an excessive use
> > > > -of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> > > > -reclaim issues.
> > > 
> > > I believe this is an important part because it shows that new people
> > > coming to the existing code shouldn't take it as correct and rather
> > > question it. Also having a clear indication that overuse is causing real
> > > problems that might be not immediately visible to subsystems outside of
> > > MM.
> > 
> > It seemed to say a lot of the same things as this paragraph:
> > 
> > +You may notice that quite a few allocations in the existing code specify
> > +``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to prevent
> > +recursion deadlocks caused by direct memory reclaim calling back into
> > +the FS or IO paths and blocking on already held resources. Since 4.12
> > +the preferred way to address this issue is to use the new scope APIs
> > +described below.
> > 
> > Since this is in core-api/ rather than vm/, I felt that discussion of
> > the problems that it causes to the mm was a bit too much detail for the
> > people who would be reading this document.  Maybe I could move that
> > information into a new Documentation/vm/reclaim.rst file?

It would be nice to have Documentation/vm/reclaim.rst regardless ;-)

> Hmm, my experience is that at least some users of NOFS/NOIO use this
> flag just to be sure they do not do something wrong without realizing
> that this might have a very negative effect on the whole system
> operation. That was the main motivation to have an explicit note there.
> I am not sure having that in MM internal documentation will make it
> stand out for a general reader.

I'd add an explict note in the "Memory Scoping API" section. Please see
below.

> But I will not insist of course.
> 
> > Let's see if Our Grumpy Editor has time to give us his advice on this.
> > 
> > > > -FS/IO code then simply calls the appropriate save function before
> > > > -any critical section with respect to the reclaim is started - e.g.
> > > > -lock shared with the reclaim context or when a transaction context
> > > > -nesting would be possible via reclaim.  
> > > 
> > > [...]
> > > 
> > > > +These functions should be called at the point where any memory allocation
> > > > +would start to cause problems.  That is, do not simply wrap individual
> > > > +memory allocation calls which currently use ``GFP_NOFS`` with a pair
> > > > +of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> > > > +find the lock which is taken that would cause problems if memory reclaim
> > > > +reentered the filesystem, place a call to memalloc_nofs_save() before it
> > > > +is acquired and a call to memalloc_nofs_restore() after it is released.
> > > > +Ideally also add a comment explaining why this lock will be problematic.
> > > 
> > > The above text has mentioned the transaction context nesting as well and
> > > that was a hint by Dave IIRC. It is imho good to have an example of
> > > other reentrant points than just locks. I believe another useful example
> > > would be something like loop device which is mixing IO and FS
> > > layers but I am not familiar with all the details to give you an
> > > useful text.
> > 
> > I'll let Mikulas & Dave finish fighting about that before I write
> > any text mentioning the loop driver.  How about this for mentioning
> > the filesystem transaction possibility?
> > 
> > @@ -103,12 +103,16 @@ flags specified by any particular call to allocate memory.
> >  
> >  These functions should be called at the point where any memory allocation
> >  would start to cause problems.  That is, do not simply wrap individual
> > -memory allocation calls which currently use ``GFP_NOFS`` with a pair
> > -of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> > -find the lock which is taken that would cause problems if memory reclaim
> > +memory allocation calls which currently use ``GFP_NOFS`` with a pair of
> > +calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead, find
> > +the resource which is acquired that would cause problems if memory reclaim
> >  reentered the filesystem, place a call to memalloc_nofs_save() before it
> >  is acquired and a call to memalloc_nofs_restore() after it is released.
> >  Ideally also add a comment explaining why this lock will be problematic.
> > +A resource might be a lock which would need to be acquired by an attempt
> > +to reclaim memory, or it might be starting a transaction that should not
> > +nest over a memory reclaim transaction.  Deep knowledge of the filesystem
> > +or driver is often needed to place memory scoping calls correctly.

I'd s/often/always/ :)

> Ack

And

+ Using memory scoping APIs "just in case" may lead to problematic
reclaim behaviour and have a very negative effect on the whole system
operation.

> >  Please note that the proper pairing of save/restore functions
> >  allows nesting so it is safe to call memalloc_noio_save() and
> > 
> > > > @@ -104,16 +134,19 @@ ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a power of two, the
> > > >  alignment is also guaranteed to be at least the respective size.
> > > >  
> > > >  For large allocations you can use vmalloc() and vzalloc(), or directly
> > > > -request pages from the page allocator. The memory allocated by `vmalloc`
> > > > -and related functions is not physically contiguous.
> > > > +request pages from the page allocator.  The memory allocated by `vmalloc`
> > > > +and related functions is not physically contiguous.  The `vmalloc`
> > > > +family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
> > > > +flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
> > > > +the allocator which are hard to remove.  However, the scope APIs described
> > > > +above can be used to limit the `vmalloc` functions.
> > > 
> > > I would reiterate "Do not just wrap vmalloc by the scope api but rather
> > > rely on the real scope for the NOFS/NOIO context". Maybe we want to
> > > stress out that once a scope is defined it is sticky to _all_
> > > allocations and all allocators within that scope. The text is already
> > > saying that but maybe we want to make it explicit and make it stand out.
> > 
> > yes.  I went with:
> > 
> > @@ -139,7 +143,10 @@ and related functions is not physically contiguous.  The `vmalloc`
> >  family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
> >  flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
> >  the allocator which are hard to remove.  However, the scope APIs described
> > -above can be used to limit the `vmalloc` functions.
> > +above can be used to limit the `vmalloc` functions.  As described above,
> > +do not simply wrap individual calls in the scope APIs, but look for the
> > +underlying reason why the memory allocation may not call into filesystems
> > +or block devices.
> 
> ack
> 
> >  
> >  If you are not sure whether the allocation size is too large for
> >  `kmalloc`, it is possible to use kvmalloc() and its derivatives. It will
> > 
> > 
> > > [...]
> > > > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > > > index 6484569f50df..9fc091274d1d 100644
> > > > --- a/include/linux/sched/mm.h
> > > > +++ b/include/linux/sched/mm.h
> > > > @@ -186,9 +186,10 @@ static inline gfp_t current_gfp_context(gfp_t flags)
> > > >  		 * them.  noio implies neither IO nor FS and it is a weaker
> > > >  		 * context so always make sure it takes precedence.
> > > >  		 */
> > > > -		if (current->memalloc_nowait)
> > > > +		if (current->memalloc_nowait) {
> > > >  			flags &= ~__GFP_DIRECT_RECLAIM;
> > > > -		else if (current->memalloc_noio)
> > > > +			flags |= __GFP_NOWARN;
> > > 
> > > I dunno. I wouldn't make nowait implicitly NOWARN as well. At least not
> > > with the initial implementation. Maybe we will learn later that there is
> > > just too much unhelpful noise in the kernel log and will reconsider but
> > > I wouldn't just start with that. Also we might learn that there will be
> > > other modifiers for atomic (or should I say non-sleeping) scopes to be
> > > defined. E.g. access to memory reserves but let's just wait for real
> > > usecases.
> > 
> > Fair enough.  I'll drop that part.  Thanks!
> 
> thanks!
> -- 
> Michal Hocko
> SUSE Labs

Patch
diff mbox series

diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 6d1565021d74..140ada9a2c8f 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -412,23 +412,6 @@  static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
 
 	*data_mode = DATA_MODE_VMALLOC;
 
-	/*
-	 * __vmalloc allocates the data pages and auxiliary structures with
-	 * gfp_flags that were specified, but pagetables are always allocated
-	 * with GFP_KERNEL, no matter what was specified as gfp_mask.
-	 *
-	 * Consequently, we must set per-process flag PF_MEMALLOC_NOIO so that
-	 * all allocations done by this process (including pagetables) are done
-	 * as if GFP_NOIO was specified.
-	 */
-	if (gfp_mask & __GFP_NORETRY) {
-		unsigned noio_flag = memalloc_noio_save();
-		void *ptr = __vmalloc(c->block_size, gfp_mask);
-
-		memalloc_noio_restore(noio_flag);
-		return ptr;
-	}
-
 	return __vmalloc(c->block_size, gfp_mask);
 }
 
@@ -866,9 +849,6 @@  static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
 	 * dm-bufio is resistant to allocation failures (it just keeps
 	 * one buffer reserved in cases all the allocations fail).
 	 * So set flags to not try too hard:
-	 *	GFP_NOWAIT: don't wait; if we need to sleep we'll release our
-	 *		    mutex and wait ourselves.
-	 *	__GFP_NORETRY: don't retry and rather return failure
 	 *	__GFP_NOMEMALLOC: don't use emergency reserves
 	 *	__GFP_NOWARN: don't print a warning in case of failure
 	 *
@@ -877,7 +857,9 @@  static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
 	 */
 	while (1) {
 		if (dm_bufio_cache_size_latch != 1) {
-			b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+			unsigned nowait_flag = memalloc_nowait_save();
+			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
+			memalloc_nowait_restore(nowait_flag);
 			if (b)
 				return b;
 		}
@@ -886,8 +868,12 @@  static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client
 			return NULL;
 
 		if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
+			unsigned noio_flag;
+
 			dm_bufio_unlock(c);
-			b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
+			noio_flag = memalloc_noio_save();
+			b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | __GFP_NOWARN);
+			memalloc_noio_restore(noio_flag);
 			dm_bufio_lock(c);
 			if (b)
 				return b;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 90336850e940..b1c2cddd366c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -803,6 +803,7 @@  struct task_struct {
 #endif
 	unsigned			memalloc_noio:1;
 	unsigned			memalloc_nofs:1;
+	unsigned			memalloc_nowait:1;
 	unsigned			memalloc_nocma:1;
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 6f7b59a848a6..6484569f50df 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -179,12 +179,16 @@  static inline bool in_vfork(struct task_struct *tsk)
 static inline gfp_t current_gfp_context(gfp_t flags)
 {
 	if (unlikely(current->memalloc_noio || current->memalloc_nofs ||
-		     current->memalloc_nocma)) {
+		     current->memalloc_nocma) || current->memalloc_nowait) {
 		/*
-		 * NOIO implies both NOIO and NOFS and it is a weaker context
-		 * so always make sure it makes precedence
+		 * Clearing DIRECT_RECLAIM means we won't get to the point
+		 * of testing IO or FS, so we don't need to bother clearing
+		 * them.  noio implies neither IO nor FS and it is a weaker
+		 * context so always make sure it takes precedence.
 		 */
-		if (current->memalloc_noio)
+		if (current->memalloc_nowait)
+			flags &= ~__GFP_DIRECT_RECLAIM;
+		else if (current->memalloc_noio)
 			flags &= ~(__GFP_IO | __GFP_FS);
 		else if (current->memalloc_nofs)
 			flags &= ~__GFP_FS;