diff mbox

dm bufio: Reduce dm_bufio_lock contention

Message ID alpine.LRH.2.02.1806191228110.25656@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mikulas Patocka June 22, 2018, 1:17 a.m. UTC
On Tue, 19 Jun 2018, Michal Hocko wrote:

> On Mon 18-06-18 18:11:26, Mikulas Patocka wrote:
> [...]
> > I grepped the kernel for __GFP_NORETRY and triaged them. I found 16 cases 
> > without a fallback - those are bugs that make various functions randomly 
> > return -ENOMEM.
> 
> Well, maybe those are just optimistic attempts to allocate memory and
> have a fallback somewhere else. So I would be careful calling them
> outright bugs. But maybe you are right.

I was trying to find the fallback code when I triaged them and maked as 
"BUG" those cases where I didn't find it. You can search harder and 
perhaps you'll find something that I didn't.

> > Most of the callers provide callback.
> > 
> > There is another strange flag - __GFP_RETRY_MAYFAIL - it provides two 
> > different functions - if the allocation is larger than 
> > PAGE_ALLOC_COSTLY_ORDER, it retries the allocation as if it were smaller. 
> > If the allocations is smaller than PAGE_ALLOC_COSTLY_ORDER, 
> > __GFP_RETRY_MAYFAIL will avoid the oom killer (larger order allocations 
> > don't trigger the oom killer at all).
> 
> Well, the primary purpose of this flag is to provide a consistent
> failure behavior for all requests regardless of the size.
> 
> > So, perhaps __GFP_RETRY_MAYFAIL could be used instead of __GFP_NORETRY in 
> > the cases where the caller wants to avoid trigerring the oom killer (the 
> > problem is that __GFP_NORETRY causes random failure even in no-oom 
> > situations but __GFP_RETRY_MAYFAIL doesn't).
> 
> myabe yes.
> 
> > So my suggestion is - fix these obvious bugs when someone allocates memory 
> > with __GFP_NORETRY without any fallback - and then, __GFP_NORETRY could be 
> > just changed to return NULL instead of sleeping.
> 
> No real objection to fixing wrong __GFP_NORETRY usage. But __GFP_NORETRY
> can sleep. Nothing will really change in that regards.  It does a
> reclaim and that _might_ sleep.
> 
> But seriously, isn't the best way around the throttling issue to use
> PF_LESS_THROTTLE?

Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would 
be better to change it just in one place than to add PF_LESS_THROTTLE to 
every block device driver (because adding it to every block driver results 
in more code).

What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the 
request comes from a block device driver or a filesystem), we should not 
sleep.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

Comments

Michal Hocko June 22, 2018, 9:01 a.m. UTC | #1
On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
[...]
> > But seriously, isn't the best way around the throttling issue to use
> > PF_LESS_THROTTLE?
> 
> Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would 
> be better to change it just in one place than to add PF_LESS_THROTTLE to 
> every block device driver (because adding it to every block driver results 
> in more code).

Why would every block device need this? I thought we were talking about
mempool allocator and the md variant of it. They are explicitly doing
their own back off so PF_LESS_THROTTLE sounds appropriate to me.

> What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the 
> request comes from a block device driver or a filesystem), we should not 
> sleep.

Why? How are you going to audit all the callers that the behavior makes
sense and moreover how are you going to ensure that future usage will
still make sense. The more subtle side effects gfp flags have the harder
they are to maintain.

> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org
> 
> Index: linux-2.6/mm/vmscan.c
> ===================================================================
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat
>  		 * the LRU too quickly.
>  		 */
>  		if (!sc->hibernation_mode && !current_is_kswapd() &&
> +		   (sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY &&
>  		   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
>  			wait_iff_congested(BLK_RW_ASYNC, HZ/10);
>
Michal Hocko June 22, 2018, 9:09 a.m. UTC | #2
On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
[...]
> > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the 
> > request comes from a block device driver or a filesystem), we should not 
> > sleep.
> 
> Why? How are you going to audit all the callers that the behavior makes
> sense and moreover how are you going to ensure that future usage will
> still make sense. The more subtle side effects gfp flags have the harder
> they are to maintain.

So just as an excercise. Try to explain the above semantic to users. We
currently have the following.

 * __GFP_NORETRY: The VM implementation will try only very lightweight
 *   memory direct reclaim to get some memory under memory pressure (thus
 *   it can sleep). It will avoid disruptive actions like OOM killer. The
 *   caller must handle the failure which is quite likely to happen under
 *   heavy memory pressure. The flag is suitable when failure can easily be
 *   handled at small cost, such as reduced throughput

 * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
 *   allocator recursing into the filesystem which might already be holding
 *   locks.

So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
is the actual semantic without explaining the whole reclaim or force
users to look into the code to understand that? What about GFP_NOIO |
__GFP_NORETRY? What does it mean to that "should not sleep". Do all
shrinkers have to follow that as well?
Mikulas Patocka June 22, 2018, 12:44 p.m. UTC | #3
On Fri, 22 Jun 2018, Michal Hocko wrote:

> On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> [...]
> > > But seriously, isn't the best way around the throttling issue to use
> > > PF_LESS_THROTTLE?
> > 
> > Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would 
> > be better to change it just in one place than to add PF_LESS_THROTTLE to 
> > every block device driver (because adding it to every block driver results 
> > in more code).
> 
> Why would every block device need this? I thought we were talking about
> mempool allocator and the md variant of it. They are explicitly doing
> their own back off so PF_LESS_THROTTLE sounds appropriate to me.

Because every block driver is suspicible to this problem. Two years ago, 
there was a bug that when the user was swapping to dm-crypt device, memory 
management would stall the allocations done by dm-crypt by 100ms - that 
slowed down swapping rate and made the machine unuseable.

Then, people are complaining about the same problem in dm-bufio.

Next time, it will be something else.

(you will answer : "we will not fix bugs unless people are reporting them" :-)

> > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the 
> > request comes from a block device driver or a filesystem), we should not 
> > sleep.
> 
> Why? How are you going to audit all the callers that the behavior makes
> sense and moreover how are you going to ensure that future usage will
> still make sense. The more subtle side effects gfp flags have the harder
> they are to maintain.

I did audit them - see the previous email in this thread: 
https://www.redhat.com/archives/dm-devel/2018-June/thread.html

Mikulas

> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > Cc: stable@vger.kernel.org
> > 
> > Index: linux-2.6/mm/vmscan.c
> > ===================================================================
> > --- linux-2.6.orig/mm/vmscan.c
> > +++ linux-2.6/mm/vmscan.c
> > @@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat
> >  		 * the LRU too quickly.
> >  		 */
> >  		if (!sc->hibernation_mode && !current_is_kswapd() &&
> > +		   (sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY &&
> >  		   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
> >  			wait_iff_congested(BLK_RW_ASYNC, HZ/10);
> >  
> 
> -- 
> Michal Hocko
> SUSE Labs
>
Michal Hocko June 22, 2018, 1:10 p.m. UTC | #4
On Fri 22-06-18 08:44:52, Mikulas Patocka wrote:
> On Fri, 22 Jun 2018, Michal Hocko wrote:
[...]
> > Why? How are you going to audit all the callers that the behavior makes
> > sense and moreover how are you going to ensure that future usage will
> > still make sense. The more subtle side effects gfp flags have the harder
> > they are to maintain.
> 
> I did audit them - see the previous email in this thread: 
> https://www.redhat.com/archives/dm-devel/2018-June/thread.html

I do not see any mention about throttling expectations for those users.
You have focused only on the allocation failure fallback AFAIR
Mikulas Patocka June 22, 2018, 6:46 p.m. UTC | #5
On Fri, 22 Jun 2018, Michal Hocko wrote:

> On Fri 22-06-18 08:44:52, Mikulas Patocka wrote:
> > On Fri, 22 Jun 2018, Michal Hocko wrote:
> [...]
> > > Why? How are you going to audit all the callers that the behavior makes
> > > sense and moreover how are you going to ensure that future usage will
> > > still make sense. The more subtle side effects gfp flags have the harder
> > > they are to maintain.
> > 
> > I did audit them - see the previous email in this thread: 
> > https://www.redhat.com/archives/dm-devel/2018-June/thread.html
> 
> I do not see any mention about throttling expectations for those users.
> You have focused only on the allocation failure fallback AFAIR

How should the callers be analyzed with respect to throttling?

Mikulas
diff mbox

Patch

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -2674,6 +2674,7 @@  static bool shrink_node(pg_data_t *pgdat
 		 * the LRU too quickly.
 		 */
 		if (!sc->hibernation_mode && !current_is_kswapd() &&
+		   (sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY &&
 		   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
 			wait_iff_congested(BLK_RW_ASYNC, HZ/10);