diff mbox

[Bug,87891] New: kernel BUG at mm/slab.c:2625!

Message ID 20141111153120.9131c8e1459415afff8645bc@linux-foundation.org (mailing list archive)
State New, archived
Headers show

Commit Message

Andrew Morton Nov. 11, 2014, 11:31 p.m. UTC
(switched to email.  Please respond via emailed reply-to-all, not via the
bugzilla web interface).

On Thu, 06 Nov 2014 17:28:41 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=87891
> 
>             Bug ID: 87891
>            Summary: kernel BUG at mm/slab.c:2625!
>            Product: Memory Management
>            Version: 2.5
>     Kernel Version: 3.17.2
>           Hardware: i386
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: blocking
>           Priority: P1
>          Component: Slab Allocator
>           Assignee: akpm@linux-foundation.org
>           Reporter: luke-jr+linuxbugs@utopios.org
>         Regression: No

Well this is interesting.


> [359782.842112] kernel BUG at mm/slab.c:2625!
> ...
> [359782.843008] Call Trace:
> [359782.843017]  [<ffffffff8115181f>] __kmalloc+0xdf/0x200
> [359782.843037]  [<ffffffffa0466285>] ? ttm_page_pool_free+0x35/0x180 [ttm]
> [359782.843060]  [<ffffffffa0466285>] ttm_page_pool_free+0x35/0x180 [ttm]
> [359782.843084]  [<ffffffffa046674e>] ttm_pool_shrink_scan+0xae/0xd0 [ttm]
> [359782.843108]  [<ffffffff8111c2fb>] shrink_slab_node+0x12b/0x2e0
> [359782.843129]  [<ffffffff81127ed4>] ? fragmentation_index+0x14/0x70
> [359782.843150]  [<ffffffff8110fc3a>] ? zone_watermark_ok+0x1a/0x20
> [359782.843171]  [<ffffffff8111ceb8>] shrink_slab+0xc8/0x110
> [359782.843189]  [<ffffffff81120480>] do_try_to_free_pages+0x300/0x410
> [359782.843210]  [<ffffffff8112084b>] try_to_free_pages+0xbb/0x190
> [359782.843230]  [<ffffffff81113136>] __alloc_pages_nodemask+0x696/0xa90
> [359782.843253]  [<ffffffff8115810a>] do_huge_pmd_anonymous_page+0xfa/0x3f0
> [359782.843278]  [<ffffffff812dffe7>] ? debug_smp_processor_id+0x17/0x20
> [359782.843300]  [<ffffffff81118dc7>] ? __lru_cache_add+0x57/0xa0
> [359782.843321]  [<ffffffff811385ce>] handle_mm_fault+0x37e/0xdd0

It went pagefault
        ->__alloc_pages_nodemask
          ->shrink_slab
            ->ttm_pool_shrink_scan
              ->ttm_page_pool_free
                ->kmalloc
                  ->cache_grow
                    ->BUG_ON(flags & GFP_SLAB_BUG_MASK);

And I don't really know why - I'm not seeing anything in there which
can set a GFP flag which is outside GFP_SLAB_BUG_MASK.  However I see
lots of nits.

Core MM:

__alloc_pages_nodemask() does

	if (unlikely(!page)) {
		/*
		 * Runtime PM, block IO and its error handling path
		 * can deadlock because I/O on the device might not
		 * complete.
		 */
		gfp_mask = memalloc_noio_flags(gfp_mask);
		page = __alloc_pages_slowpath(gfp_mask, order,
				zonelist, high_zoneidx, nodemask,
				preferred_zone, classzone_idx, migratetype);
	}

so it permanently alters the value of incoming arg gfp_mask.  This
means that the following trace_mm_page_alloc() will print the wrong
value of gfp_mask, and if we later do the `goto retry_cpuset', we retry
with a possibly different gfp_mask.  Isn't this a bug?


Also, why are we even passing a gfp_t down to the shrinkers?  So they
can work out the allocation context - things like __GFP_IO, __GFP_FS,
etc?  Is it even appropriate to use that mask for a new allocation
attempt within a particular shrinker?


ttm:

I think it's a bad idea to be calling kmalloc() in the slab shrinker
function.  We *know* that the system is low on memory and is trying to
free things up.  Trying to allocate *more* memory at this time is
asking for trouble.  ttm_page_pool_free() could easily be tweaked to
use a fixed-size local array of page*'s t avoid that allocation.  Could
someone implement this please?


slab:

There's no point in doing

	#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)

because __GFP_DMA32|__GFP_HIGHMEM are already part of ~__GFP_BITS_MASK.
What's it trying to do here?

And it's quite infuriating to go BUG when the code could easily warn
and fix it up.

And it's quite infuriating to go BUG because one of the bits was set,
but not tell us which bit it was!


Could the slab guys please review this?

From: Andrew Morton <akpm@linux-foundation.org>
Subject: slab: improve checking for invalid gfp_flags

- The code goes BUG, but doesn't tell us which bits were unexpectedly
  set.  Print that out.

- The code goes BUG when it could jsut fix things up and proceed.  Do that.

- ~__GFP_BITS_MASK already includes __GFP_DMA32 and __GFP_HIGHMEM, so
  remove those from the GFP_SLAB_BUG_MASK definition.

Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/gfp.h |    2 +-
 mm/slab.c           |    5 ++++-
 mm/slub.c           |    5 ++++-
 3 files changed, 9 insertions(+), 3 deletions(-)

Comments

Christoph Lameter (Ampere) Nov. 12, 2014, 12:36 a.m. UTC | #1
On Tue, 11 Nov 2014, Andrew Morton wrote:

> There's no point in doing
>
> 	#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
>
> because __GFP_DMA32|__GFP_HIGHMEM are already part of ~__GFP_BITS_MASK.

?? ~__GFP_BITS_MASK means bits 25 to 31 are set.

__GFP_DMA32 is bit 2 and __GFP_HIGHMEM is bit 1.
Joonsoo Kim Nov. 12, 2014, 12:44 a.m. UTC | #2
On Tue, Nov 11, 2014 at 03:31:20PM -0800, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
> 
> On Thu, 06 Nov 2014 17:28:41 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
> 
> > https://bugzilla.kernel.org/show_bug.cgi?id=87891
> > 
> >             Bug ID: 87891
> >            Summary: kernel BUG at mm/slab.c:2625!
> >            Product: Memory Management
> >            Version: 2.5
> >     Kernel Version: 3.17.2
> >           Hardware: i386
> >                 OS: Linux
> >               Tree: Mainline
> >             Status: NEW
> >           Severity: blocking
> >           Priority: P1
> >          Component: Slab Allocator
> >           Assignee: akpm@linux-foundation.org
> >           Reporter: luke-jr+linuxbugs@utopios.org
> >         Regression: No
> 
> Well this is interesting.
> 
> 
> > [359782.842112] kernel BUG at mm/slab.c:2625!
> > ...
> > [359782.843008] Call Trace:
> > [359782.843017]  [<ffffffff8115181f>] __kmalloc+0xdf/0x200
> > [359782.843037]  [<ffffffffa0466285>] ? ttm_page_pool_free+0x35/0x180 [ttm]
> > [359782.843060]  [<ffffffffa0466285>] ttm_page_pool_free+0x35/0x180 [ttm]
> > [359782.843084]  [<ffffffffa046674e>] ttm_pool_shrink_scan+0xae/0xd0 [ttm]
> > [359782.843108]  [<ffffffff8111c2fb>] shrink_slab_node+0x12b/0x2e0
> > [359782.843129]  [<ffffffff81127ed4>] ? fragmentation_index+0x14/0x70
> > [359782.843150]  [<ffffffff8110fc3a>] ? zone_watermark_ok+0x1a/0x20
> > [359782.843171]  [<ffffffff8111ceb8>] shrink_slab+0xc8/0x110
> > [359782.843189]  [<ffffffff81120480>] do_try_to_free_pages+0x300/0x410
> > [359782.843210]  [<ffffffff8112084b>] try_to_free_pages+0xbb/0x190
> > [359782.843230]  [<ffffffff81113136>] __alloc_pages_nodemask+0x696/0xa90
> > [359782.843253]  [<ffffffff8115810a>] do_huge_pmd_anonymous_page+0xfa/0x3f0
> > [359782.843278]  [<ffffffff812dffe7>] ? debug_smp_processor_id+0x17/0x20
> > [359782.843300]  [<ffffffff81118dc7>] ? __lru_cache_add+0x57/0xa0
> > [359782.843321]  [<ffffffff811385ce>] handle_mm_fault+0x37e/0xdd0
> 
> It went pagefault
>         ->__alloc_pages_nodemask
>           ->shrink_slab
>             ->ttm_pool_shrink_scan
>               ->ttm_page_pool_free
>                 ->kmalloc
>                   ->cache_grow
>                     ->BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
> And I don't really know why - I'm not seeing anything in there which
> can set a GFP flag which is outside GFP_SLAB_BUG_MASK.  However I see
> lots of nits.
> 
> Core MM:
> 
> __alloc_pages_nodemask() does
> 
> 	if (unlikely(!page)) {
> 		/*
> 		 * Runtime PM, block IO and its error handling path
> 		 * can deadlock because I/O on the device might not
> 		 * complete.
> 		 */
> 		gfp_mask = memalloc_noio_flags(gfp_mask);
> 		page = __alloc_pages_slowpath(gfp_mask, order,
> 				zonelist, high_zoneidx, nodemask,
> 				preferred_zone, classzone_idx, migratetype);
> 	}
> 
> so it permanently alters the value of incoming arg gfp_mask.  This
> means that the following trace_mm_page_alloc() will print the wrong
> value of gfp_mask, and if we later do the `goto retry_cpuset', we retry
> with a possibly different gfp_mask.  Isn't this a bug?
> 
> 
> Also, why are we even passing a gfp_t down to the shrinkers?  So they
> can work out the allocation context - things like __GFP_IO, __GFP_FS,
> etc?  Is it even appropriate to use that mask for a new allocation
> attempt within a particular shrinker?
> 
> 
> ttm:
> 
> I think it's a bad idea to be calling kmalloc() in the slab shrinker
> function.  We *know* that the system is low on memory and is trying to
> free things up.  Trying to allocate *more* memory at this time is
> asking for trouble.  ttm_page_pool_free() could easily be tweaked to
> use a fixed-size local array of page*'s t avoid that allocation.  Could
> someone implement this please?
> 
> 
> slab:
> 
> There's no point in doing
> 
> 	#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> 
> because __GFP_DMA32|__GFP_HIGHMEM are already part of ~__GFP_BITS_MASK.
> What's it trying to do here?

Hello, Andrew.

__GFP_DMA32 and __GFP_HIGHMEM isn't included in ~__GFP_BITS_MASK.
~__GFP_BITS_MASK means all the high bits excluding all gfp bits.

As you already know, HIGHMEM isn't appropriate for slab because there
is no direct mapping on this memory. And, if we want memory only from
the memory on DMA32 area, specific kmem_cache is needed. But, there is
no interface for it, so allocation for DMA32 is also restricted here.

> 
> And it's quite infuriating to go BUG when the code could easily warn
> and fix it up.

If user wants memory on HIGHMEM, it can be easily fixed by following
change because all memory is compatible for HIGHMEM. But, if user wants
memory on DMA32, it's not easy to fix because memory on NORMAL isn't
compatible with DMA32. slab could return object from another slab page
even if cache_grow() is successfully called. So BUG_ON() here
looks right thing to me. We cannot know in advance whether ignoring this
flag cause more serious result or not.

> 
> And it's quite infuriating to go BUG because one of the bits was set,
> but not tell us which bit it was!

Agreed. Let's fix it.

Thanks.

> 
> Could the slab guys please review this?
> 
> From: Andrew Morton <akpm@linux-foundation.org>
> Subject: slab: improve checking for invalid gfp_flags
> 
> - The code goes BUG, but doesn't tell us which bits were unexpectedly
>   set.  Print that out.
> 
> - The code goes BUG when it could jsut fix things up and proceed.  Do that.
> 
> - ~__GFP_BITS_MASK already includes __GFP_DMA32 and __GFP_HIGHMEM, so
>   remove those from the GFP_SLAB_BUG_MASK definition.
> 
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  include/linux/gfp.h |    2 +-
>  mm/slab.c           |    5 ++++-
>  mm/slub.c           |    5 ++++-
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff -puN include/linux/gfp.h~slab-improve-checking-for-invalid-gfp_flags include/linux/gfp.h
> --- a/include/linux/gfp.h~slab-improve-checking-for-invalid-gfp_flags
> +++ a/include/linux/gfp.h
> @@ -145,7 +145,7 @@ struct vm_area_struct;
>  #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
>  
>  /* Do not use these with a slab allocator */
> -#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> +#define GFP_SLAB_BUG_MASK (~__GFP_BITS_MASK)
>  
>  /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
>     platforms, used as appropriate on others */
> diff -puN mm/slab.c~slab-improve-checking-for-invalid-gfp_flags mm/slab.c
> --- a/mm/slab.c~slab-improve-checking-for-invalid-gfp_flags
> +++ a/mm/slab.c
> @@ -2590,7 +2590,10 @@ static int cache_grow(struct kmem_cache
>  	 * Be lazy and only check for valid flags here,  keeping it out of the
>  	 * critical path in kmem_cache_alloc().
>  	 */
> -	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> +	if (WARN_ON(flags & GFP_SLAB_BUG_MASK)) {
> +		pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
> +		flags &= ~GFP_SLAB_BUG_MASK;
> +	}
>  	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
>  
>  	/* Take the node list lock to change the colour_next on this node */
> diff -puN mm/slub.c~slab-improve-checking-for-invalid-gfp_flags mm/slub.c
> --- a/mm/slub.c~slab-improve-checking-for-invalid-gfp_flags
> +++ a/mm/slub.c
> @@ -1377,7 +1377,10 @@ static struct page *new_slab(struct kmem
>  	int order;
>  	int idx;
>  
> -	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> +	if (WARN_ON(flags & GFP_SLAB_BUG_MASK)) {
> +		pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
> +		flags &= ~GFP_SLAB_BUG_MASK;
> +	}
>  
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> _
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Andrew Morton Nov. 12, 2014, 12:49 a.m. UTC | #3
On Tue, 11 Nov 2014 18:36:28 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:

> On Tue, 11 Nov 2014, Andrew Morton wrote:
> 
> > There's no point in doing
> >
> > 	#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> >
> > because __GFP_DMA32|__GFP_HIGHMEM are already part of ~__GFP_BITS_MASK.
> 
> ?? ~__GFP_BITS_MASK means bits 25 to 31 are set.
> 
> __GFP_DMA32 is bit 2 and __GFP_HIGHMEM is bit 1.

Ah, yes, OK.

I suppose it's possible that __GFP_HIGHMEM was set.

do_huge_pmd_anonymous_page
->pte_alloc_one
  ->alloc_pages(__userpte_alloc_gfp==__GFP_HIGHMEM)

but I haven't traced that through and that's 32-bit.

But anyway - Luke, please attach your .config to
https://bugzilla.kernel.org/show_bug.cgi?id=87891?
Luke-Jr Nov. 12, 2014, 12:54 a.m. UTC | #4
On Wednesday, November 12, 2014 12:49:13 AM Andrew Morton wrote:
> But anyway - Luke, please attach your .config to
> https://bugzilla.kernel.org/show_bug.cgi?id=87891?

Done: https://bugzilla.kernel.org/attachment.cgi?id=157381

Luke
Kirill A . Shutemov Nov. 12, 2014, 1:22 a.m. UTC | #5
On Tue, Nov 11, 2014 at 04:49:13PM -0800, Andrew Morton wrote:
> On Tue, 11 Nov 2014 18:36:28 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
> 
> > On Tue, 11 Nov 2014, Andrew Morton wrote:
> > 
> > > There's no point in doing
> > >
> > > 	#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> > >
> > > because __GFP_DMA32|__GFP_HIGHMEM are already part of ~__GFP_BITS_MASK.
> > 
> > ?? ~__GFP_BITS_MASK means bits 25 to 31 are set.
> > 
> > __GFP_DMA32 is bit 2 and __GFP_HIGHMEM is bit 1.
> 
> Ah, yes, OK.
> 
> I suppose it's possible that __GFP_HIGHMEM was set.
> 
> do_huge_pmd_anonymous_page
> ->pte_alloc_one
>   ->alloc_pages(__userpte_alloc_gfp==__GFP_HIGHMEM)

do_huge_pmd_anonymous_page
 alloc_hugepage_vma
  alloc_pages_vma(GFP_TRANSHUGE)

GFP_TRANSHUGE contains GFP_HIGHUSER_MOVABLE, which has __GFP_HIGHMEM.
Andrew Morton Nov. 12, 2014, 1:56 a.m. UTC | #6
On Wed, 12 Nov 2014 03:47:03 +0200 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> On Wed, Nov 12, 2014 at 03:22:41AM +0200, Kirill A. Shutemov wrote:
> > On Tue, Nov 11, 2014 at 04:49:13PM -0800, Andrew Morton wrote:
> > > On Tue, 11 Nov 2014 18:36:28 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
> > > 
> > > > On Tue, 11 Nov 2014, Andrew Morton wrote:
> > > > 
> > > > > There's no point in doing
> > > > >
> > > > > 	#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> > > > >
> > > > > because __GFP_DMA32|__GFP_HIGHMEM are already part of ~__GFP_BITS_MASK.
> > > > 
> > > > ?? ~__GFP_BITS_MASK means bits 25 to 31 are set.
> > > > 
> > > > __GFP_DMA32 is bit 2 and __GFP_HIGHMEM is bit 1.
> > > 
> > > Ah, yes, OK.
> > > 
> > > I suppose it's possible that __GFP_HIGHMEM was set.
> > > 
> > > do_huge_pmd_anonymous_page
> > > ->pte_alloc_one
> > >   ->alloc_pages(__userpte_alloc_gfp==__GFP_HIGHMEM)
> > 
> > do_huge_pmd_anonymous_page
> >  alloc_hugepage_vma
> >   alloc_pages_vma(GFP_TRANSHUGE)
> > 
> > GFP_TRANSHUGE contains GFP_HIGHUSER_MOVABLE, which has __GFP_HIGHMEM.
> 
> Looks like it's reasonable to sanitize flags in shrink_slab() by dropping
> flags incompatible with slab expectation. Like this:
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index dcb47074ae03..eb165d29c5e5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -369,6 +369,8 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
>         if (nr_pages_scanned == 0)
>                 nr_pages_scanned = SWAP_CLUSTER_MAX;
>  
> +       shrinkctl->gfp_mask &= ~(__GFP_DMA32 | __GFP_HIGHMEM);
> +
>         if (!down_read_trylock(&shrinker_rwsem)) {
>                 /*
>                  * If we would return 0, our callers would understand that we

Well no, because nobody is supposed to be passing this gfp_mask back
into a new allocation attempt anyway.  It would be better to do

	shrinkctl->gfp_mask |= __GFP_IMMEDIATELY_GO_BUG;

?
Kirill A . Shutemov Nov. 12, 2014, 2:07 a.m. UTC | #7
On Tue, Nov 11, 2014 at 05:56:03PM -0800, Andrew Morton wrote:
> On Wed, 12 Nov 2014 03:47:03 +0200 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > On Wed, Nov 12, 2014 at 03:22:41AM +0200, Kirill A. Shutemov wrote:
> > > On Tue, Nov 11, 2014 at 04:49:13PM -0800, Andrew Morton wrote:
> > > > On Tue, 11 Nov 2014 18:36:28 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
> > > > 
> > > > > On Tue, 11 Nov 2014, Andrew Morton wrote:
> > > > > 
> > > > > > There's no point in doing
> > > > > >
> > > > > > 	#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> > > > > >
> > > > > > because __GFP_DMA32|__GFP_HIGHMEM are already part of ~__GFP_BITS_MASK.
> > > > > 
> > > > > ?? ~__GFP_BITS_MASK means bits 25 to 31 are set.
> > > > > 
> > > > > __GFP_DMA32 is bit 2 and __GFP_HIGHMEM is bit 1.
> > > > 
> > > > Ah, yes, OK.
> > > > 
> > > > I suppose it's possible that __GFP_HIGHMEM was set.
> > > > 
> > > > do_huge_pmd_anonymous_page
> > > > ->pte_alloc_one
> > > >   ->alloc_pages(__userpte_alloc_gfp==__GFP_HIGHMEM)
> > > 
> > > do_huge_pmd_anonymous_page
> > >  alloc_hugepage_vma
> > >   alloc_pages_vma(GFP_TRANSHUGE)
> > > 
> > > GFP_TRANSHUGE contains GFP_HIGHUSER_MOVABLE, which has __GFP_HIGHMEM.
> > 
> > Looks like it's reasonable to sanitize flags in shrink_slab() by dropping
> > flags incompatible with slab expectation. Like this:
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index dcb47074ae03..eb165d29c5e5 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -369,6 +369,8 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
> >         if (nr_pages_scanned == 0)
> >                 nr_pages_scanned = SWAP_CLUSTER_MAX;
> >  
> > +       shrinkctl->gfp_mask &= ~(__GFP_DMA32 | __GFP_HIGHMEM);
> > +
> >         if (!down_read_trylock(&shrinker_rwsem)) {
> >                 /*
> >                  * If we would return 0, our callers would understand that we
> 
> Well no, because nobody is supposed to be passing this gfp_mask back
> into a new allocation attempt anyway.  It would be better to do
> 
> 	shrinkctl->gfp_mask |= __GFP_IMMEDIATELY_GO_BUG;
> 
> ?

From my POV, the problem is that we combine what-need-to-be-freed gfp_mask
with if-have-to-allocate gfp_mask: we want to respect __GFP_IO/FS on
alloc, but not nessesary both if there's no restriction from the context.

For shrink_slab(), __GFP_DMA32 and __GFP_HIGHMEM don't make sense in both
cases.

__GFP_IMMEDIATELY_GO_BUG would work too, but we also need to provide
macros to construct alloc-suitable mask from the given one for
yes-i-really-have-to-allocate case.
Joonsoo Kim Nov. 12, 2014, 2:17 a.m. UTC | #8
On Wed, Nov 12, 2014 at 03:22:41AM +0200, Kirill A. Shutemov wrote:
> On Tue, Nov 11, 2014 at 04:49:13PM -0800, Andrew Morton wrote:
> > On Tue, 11 Nov 2014 18:36:28 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
> > 
> > > On Tue, 11 Nov 2014, Andrew Morton wrote:
> > > 
> > > > There's no point in doing
> > > >
> > > > 	#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> > > >
> > > > because __GFP_DMA32|__GFP_HIGHMEM are already part of ~__GFP_BITS_MASK.
> > > 
> > > ?? ~__GFP_BITS_MASK means bits 25 to 31 are set.
> > > 
> > > __GFP_DMA32 is bit 2 and __GFP_HIGHMEM is bit 1.
> > 
> > Ah, yes, OK.
> > 
> > I suppose it's possible that __GFP_HIGHMEM was set.
> > 
> > do_huge_pmd_anonymous_page
> > ->pte_alloc_one
> >   ->alloc_pages(__userpte_alloc_gfp==__GFP_HIGHMEM)
> 
> do_huge_pmd_anonymous_page
>  alloc_hugepage_vma
>   alloc_pages_vma(GFP_TRANSHUGE)
> 
> GFP_TRANSHUGE contains GFP_HIGHUSER_MOVABLE, which has __GFP_HIGHMEM.

Hello, Kirill.

BTW, why does GFP_TRANSHUGE have MOVABLE flag despite it isn't
movable? After breaking hugepage, it could be movable, but, it may
prevent CMA from working correctly until break.

Thanks.
Kirill A . Shutemov Nov. 12, 2014, 2:37 a.m. UTC | #9
On Wed, Nov 12, 2014 at 11:17:16AM +0900, Joonsoo Kim wrote:
> On Wed, Nov 12, 2014 at 03:22:41AM +0200, Kirill A. Shutemov wrote:
> > On Tue, Nov 11, 2014 at 04:49:13PM -0800, Andrew Morton wrote:
> > > On Tue, 11 Nov 2014 18:36:28 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
> > > 
> > > > On Tue, 11 Nov 2014, Andrew Morton wrote:
> > > > 
> > > > > There's no point in doing
> > > > >
> > > > > 	#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> > > > >
> > > > > because __GFP_DMA32|__GFP_HIGHMEM are already part of ~__GFP_BITS_MASK.
> > > > 
> > > > ?? ~__GFP_BITS_MASK means bits 25 to 31 are set.
> > > > 
> > > > __GFP_DMA32 is bit 2 and __GFP_HIGHMEM is bit 1.
> > > 
> > > Ah, yes, OK.
> > > 
> > > I suppose it's possible that __GFP_HIGHMEM was set.
> > > 
> > > do_huge_pmd_anonymous_page
> > > ->pte_alloc_one
> > >   ->alloc_pages(__userpte_alloc_gfp==__GFP_HIGHMEM)
> > 
> > do_huge_pmd_anonymous_page
> >  alloc_hugepage_vma
> >   alloc_pages_vma(GFP_TRANSHUGE)
> > 
> > GFP_TRANSHUGE contains GFP_HIGHUSER_MOVABLE, which has __GFP_HIGHMEM.
> 
> Hello, Kirill.
> 
> BTW, why does GFP_TRANSHUGE have MOVABLE flag despite it isn't
> movable? After breaking hugepage, it could be movable, but, it may
> prevent CMA from working correctly until break.

Again, the same alloc vs. free gfp_mask: we want page allocator to move
pages around to find space from THP, but resulting page is no really
movable.

I've tried to look into making THP movable: it requires quite a bit of
infrastructure changes around rmap: try_to_unmap*(), remove_migration_pmd(),
migration entries for PMDs, etc. I gets ugly pretty fast :-/
I probably need to give it second try. No promises.
Joonsoo Kim Nov. 12, 2014, 8:21 a.m. UTC | #10
On Wed, Nov 12, 2014 at 04:37:46AM +0200, Kirill A. Shutemov wrote:
> On Wed, Nov 12, 2014 at 11:17:16AM +0900, Joonsoo Kim wrote:
> > On Wed, Nov 12, 2014 at 03:22:41AM +0200, Kirill A. Shutemov wrote:
> > > On Tue, Nov 11, 2014 at 04:49:13PM -0800, Andrew Morton wrote:
> > > > On Tue, 11 Nov 2014 18:36:28 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
> > > > 
> > > > > On Tue, 11 Nov 2014, Andrew Morton wrote:
> > > > > 
> > > > > > There's no point in doing
> > > > > >
> > > > > > 	#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> > > > > >
> > > > > > because __GFP_DMA32|__GFP_HIGHMEM are already part of ~__GFP_BITS_MASK.
> > > > > 
> > > > > ?? ~__GFP_BITS_MASK means bits 25 to 31 are set.
> > > > > 
> > > > > __GFP_DMA32 is bit 2 and __GFP_HIGHMEM is bit 1.
> > > > 
> > > > Ah, yes, OK.
> > > > 
> > > > I suppose it's possible that __GFP_HIGHMEM was set.
> > > > 
> > > > do_huge_pmd_anonymous_page
> > > > ->pte_alloc_one
> > > >   ->alloc_pages(__userpte_alloc_gfp==__GFP_HIGHMEM)
> > > 
> > > do_huge_pmd_anonymous_page
> > >  alloc_hugepage_vma
> > >   alloc_pages_vma(GFP_TRANSHUGE)
> > > 
> > > GFP_TRANSHUGE contains GFP_HIGHUSER_MOVABLE, which has __GFP_HIGHMEM.
> > 
> > Hello, Kirill.
> > 
> > BTW, why does GFP_TRANSHUGE have MOVABLE flag despite it isn't
> > movable? After breaking hugepage, it could be movable, but, it may
> > prevent CMA from working correctly until break.
> 
> Again, the same alloc vs. free gfp_mask: we want page allocator to move
> pages around to find space from THP, but resulting page is no really
> movable.

Hmm... AFAIK, without MOVABLE flag page allocator will try to move
pages to find space for THP page. Am I missing something?
        
> 
> I've tried to look into making THP movable: it requires quite a bit of
> infrastructure changes around rmap: try_to_unmap*(), remove_migration_pmd(),
> migration entries for PMDs, etc. I gets ugly pretty fast :-/
> I probably need to give it second try. No promises.

Good to hear. :)

I think that we can go another way that breaks the hugepage. This
operation makes it movable and CMA would be succeed.

Thanks.
Mel Gorman Nov. 12, 2014, 10:39 a.m. UTC | #11
On Wed, Nov 12, 2014 at 11:17:16AM +0900, Joonsoo Kim wrote:
> On Wed, Nov 12, 2014 at 03:22:41AM +0200, Kirill A. Shutemov wrote:
> > On Tue, Nov 11, 2014 at 04:49:13PM -0800, Andrew Morton wrote:
> > > On Tue, 11 Nov 2014 18:36:28 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
> > > 
> > > > On Tue, 11 Nov 2014, Andrew Morton wrote:
> > > > 
> > > > > There's no point in doing
> > > > >
> > > > > 	#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> > > > >
> > > > > because __GFP_DMA32|__GFP_HIGHMEM are already part of ~__GFP_BITS_MASK.
> > > > 
> > > > ?? ~__GFP_BITS_MASK means bits 25 to 31 are set.
> > > > 
> > > > __GFP_DMA32 is bit 2 and __GFP_HIGHMEM is bit 1.
> > > 
> > > Ah, yes, OK.
> > > 
> > > I suppose it's possible that __GFP_HIGHMEM was set.
> > > 
> > > do_huge_pmd_anonymous_page
> > > ->pte_alloc_one
> > >   ->alloc_pages(__userpte_alloc_gfp==__GFP_HIGHMEM)
> > 
> > do_huge_pmd_anonymous_page
> >  alloc_hugepage_vma
> >   alloc_pages_vma(GFP_TRANSHUGE)
> > 
> > GFP_TRANSHUGE contains GFP_HIGHUSER_MOVABLE, which has __GFP_HIGHMEM.
> 
> Hello, Kirill.
> 
> BTW, why does GFP_TRANSHUGE have MOVABLE flag despite it isn't
> movable? After breaking hugepage, it could be movable, but, it may
> prevent CMA from working correctly until break.
> 

Because THP can use the Movable zone if it's allocated. When movable was
introduced it did not just mean migratable. It meant it could also be
moved to swap. THP can be broken up and swapped so it tagged as movable.
Joonsoo Kim Nov. 13, 2014, 6:37 a.m. UTC | #12
On Wed, Nov 12, 2014 at 10:39:24AM +0000, Mel Gorman wrote:
> On Wed, Nov 12, 2014 at 11:17:16AM +0900, Joonsoo Kim wrote:
> > On Wed, Nov 12, 2014 at 03:22:41AM +0200, Kirill A. Shutemov wrote:
> > > On Tue, Nov 11, 2014 at 04:49:13PM -0800, Andrew Morton wrote:
> > > > On Tue, 11 Nov 2014 18:36:28 -0600 (CST) Christoph Lameter <cl@linux.com> wrote:
> > > > 
> > > > > On Tue, 11 Nov 2014, Andrew Morton wrote:
> > > > > 
> > > > > > There's no point in doing
> > > > > >
> > > > > > 	#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> > > > > >
> > > > > > because __GFP_DMA32|__GFP_HIGHMEM are already part of ~__GFP_BITS_MASK.
> > > > > 
> > > > > ?? ~__GFP_BITS_MASK means bits 25 to 31 are set.
> > > > > 
> > > > > __GFP_DMA32 is bit 2 and __GFP_HIGHMEM is bit 1.
> > > > 
> > > > Ah, yes, OK.
> > > > 
> > > > I suppose it's possible that __GFP_HIGHMEM was set.
> > > > 
> > > > do_huge_pmd_anonymous_page
> > > > ->pte_alloc_one
> > > >   ->alloc_pages(__userpte_alloc_gfp==__GFP_HIGHMEM)
> > > 
> > > do_huge_pmd_anonymous_page
> > >  alloc_hugepage_vma
> > >   alloc_pages_vma(GFP_TRANSHUGE)
> > > 
> > > GFP_TRANSHUGE contains GFP_HIGHUSER_MOVABLE, which has __GFP_HIGHMEM.
> > 
> > Hello, Kirill.
> > 
> > BTW, why does GFP_TRANSHUGE have MOVABLE flag despite it isn't
> > movable? After breaking hugepage, it could be movable, but, it may
> > prevent CMA from working correctly until break.
> > 
> 
> Because THP can use the Movable zone if it's allocated. When movable was
> introduced it did not just mean migratable. It meant it could also be
> moved to swap. THP can be broken up and swapped so it tagged as movable.

Great explanation!

Thanks Mel.
Vlastimil Babka Nov. 13, 2014, 7:04 a.m. UTC | #13
On 11/12/2014 12:31 AM, Andrew Morton wrote:
>
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).
>
> On Thu, 06 Nov 2014 17:28:41 +0000 bugzilla-daemon@bugzilla.kernel.org wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=87891
>>
>>              Bug ID: 87891
>>             Summary: kernel BUG at mm/slab.c:2625!
>>             Product: Memory Management
>>             Version: 2.5
>>      Kernel Version: 3.17.2
>>            Hardware: i386
>>                  OS: Linux
>>                Tree: Mainline
>>              Status: NEW
>>            Severity: blocking
>>            Priority: P1
>>           Component: Slab Allocator
>>            Assignee: akpm@linux-foundation.org
>>            Reporter: luke-jr+linuxbugs@utopios.org
>>          Regression: No
>
> Well this is interesting.
>
>
>> [359782.842112] kernel BUG at mm/slab.c:2625!
>> ...
>> [359782.843008] Call Trace:
>> [359782.843017]  [<ffffffff8115181f>] __kmalloc+0xdf/0x200
>> [359782.843037]  [<ffffffffa0466285>] ? ttm_page_pool_free+0x35/0x180 [ttm]
>> [359782.843060]  [<ffffffffa0466285>] ttm_page_pool_free+0x35/0x180 [ttm]
>> [359782.843084]  [<ffffffffa046674e>] ttm_pool_shrink_scan+0xae/0xd0 [ttm]
>> [359782.843108]  [<ffffffff8111c2fb>] shrink_slab_node+0x12b/0x2e0
>> [359782.843129]  [<ffffffff81127ed4>] ? fragmentation_index+0x14/0x70
>> [359782.843150]  [<ffffffff8110fc3a>] ? zone_watermark_ok+0x1a/0x20
>> [359782.843171]  [<ffffffff8111ceb8>] shrink_slab+0xc8/0x110
>> [359782.843189]  [<ffffffff81120480>] do_try_to_free_pages+0x300/0x410
>> [359782.843210]  [<ffffffff8112084b>] try_to_free_pages+0xbb/0x190
>> [359782.843230]  [<ffffffff81113136>] __alloc_pages_nodemask+0x696/0xa90
>> [359782.843253]  [<ffffffff8115810a>] do_huge_pmd_anonymous_page+0xfa/0x3f0
>> [359782.843278]  [<ffffffff812dffe7>] ? debug_smp_processor_id+0x17/0x20
>> [359782.843300]  [<ffffffff81118dc7>] ? __lru_cache_add+0x57/0xa0
>> [359782.843321]  [<ffffffff811385ce>] handle_mm_fault+0x37e/0xdd0
>
> It went pagefault
>          ->__alloc_pages_nodemask
>            ->shrink_slab
>              ->ttm_pool_shrink_scan
>                ->ttm_page_pool_free
>                  ->kmalloc
>                    ->cache_grow
>                      ->BUG_ON(flags & GFP_SLAB_BUG_MASK);
>
> And I don't really know why - I'm not seeing anything in there which
> can set a GFP flag which is outside GFP_SLAB_BUG_MASK.  However I see
> lots of nits.
>
> Core MM:
>
> __alloc_pages_nodemask() does
>
> 	if (unlikely(!page)) {
> 		/*
> 		 * Runtime PM, block IO and its error handling path
> 		 * can deadlock because I/O on the device might not
> 		 * complete.
> 		 */
> 		gfp_mask = memalloc_noio_flags(gfp_mask);
> 		page = __alloc_pages_slowpath(gfp_mask, order,
> 				zonelist, high_zoneidx, nodemask,
> 				preferred_zone, classzone_idx, migratetype);
> 	}
>
> so it permanently alters the value of incoming arg gfp_mask.  This
> means that the following trace_mm_page_alloc() will print the wrong
> value of gfp_mask, and if we later do the `goto retry_cpuset', we retry
> with a possibly different gfp_mask.  Isn't this a bug?

I think so. I noticed and fixed it in the RFC about reducing 
alloc_pages* parameters [1], but it's buried in patch 2/4 Guess I should 
have made it a separate non-RFC patch. Will do soon hopefully.

Vlastimil


[1] https://lkml.org/lkml/2014/8/6/249
diff mbox

Patch

diff -puN include/linux/gfp.h~slab-improve-checking-for-invalid-gfp_flags include/linux/gfp.h
--- a/include/linux/gfp.h~slab-improve-checking-for-invalid-gfp_flags
+++ a/include/linux/gfp.h
@@ -145,7 +145,7 @@  struct vm_area_struct;
 #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
 
 /* Do not use these with a slab allocator */
-#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
+#define GFP_SLAB_BUG_MASK (~__GFP_BITS_MASK)
 
 /* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
    platforms, used as appropriate on others */
diff -puN mm/slab.c~slab-improve-checking-for-invalid-gfp_flags mm/slab.c
--- a/mm/slab.c~slab-improve-checking-for-invalid-gfp_flags
+++ a/mm/slab.c
@@ -2590,7 +2590,10 @@  static int cache_grow(struct kmem_cache
 	 * Be lazy and only check for valid flags here,  keeping it out of the
 	 * critical path in kmem_cache_alloc().
 	 */
-	BUG_ON(flags & GFP_SLAB_BUG_MASK);
+	if (WARN_ON(flags & GFP_SLAB_BUG_MASK)) {
+		pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
+		flags &= ~GFP_SLAB_BUG_MASK;
+	}
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
 	/* Take the node list lock to change the colour_next on this node */
diff -puN mm/slub.c~slab-improve-checking-for-invalid-gfp_flags mm/slub.c
--- a/mm/slub.c~slab-improve-checking-for-invalid-gfp_flags
+++ a/mm/slub.c
@@ -1377,7 +1377,10 @@  static struct page *new_slab(struct kmem
 	int order;
 	int idx;
 
-	BUG_ON(flags & GFP_SLAB_BUG_MASK);
+	if (WARN_ON(flags & GFP_SLAB_BUG_MASK)) {
+		pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
+		flags &= ~GFP_SLAB_BUG_MASK;
+	}
 
 	page = allocate_slab(s,
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);