diff mbox

ARM: sparsemem: Enable CONFIG_HOLES_IN_ZONE config option for SparseMem and HAS_HOLES_MEMORYMODEL for linux-3.0.

Message ID CAFPAmTQByL0YJT8Lvar1Oe+3Q1EREvqPA_GP=hHApJDz5dSOzQ@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kautuk Consul Aug. 2, 2011, 12:08 p.m. UTC
Hi,

In the case where the total kernel memory is not aligned to the
SECTION_SIZE_BITS I see a kernel crash.

When I copy a huge file, then the kernel crashes at the following callstack:

Backtrace:
[<c00269ac>] (__bug+0x0/0x30) from [<c008e8b0>]
(move_freepages_block+0xd4/0x158)
[<c008e7dc>] (move_freepages_block+0x0/0x158) from [<c008eb10>]
(__rmqueue+0x1dc/0x32c)
 r8:c0481120 r7:c048107c r6:00000003 r5:00000001 r4:c04f8200
r3:00000000
[<c008e934>] (__rmqueue+0x0/0x32c) from [<c00906f8>]
(get_page_from_freelist+0x12c/0x530)
[<c00905cc>] (get_page_from_freelist+0x0/0x530) from [<c0090bec>]
(__alloc_pages_nodemask+0xf0/0x544)
[<c0090afc>] (__alloc_pages_nodemask+0x0/0x544) from [<c00b4da4>]
(cache_alloc_refill+0x2d0/0x654)
[<c00b4ad4>] (cache_alloc_refill+0x0/0x654) from [<c00b5258>]
(kmem_cache_alloc+0x58/0x9c)
[<c00b5200>] (kmem_cache_alloc+0x0/0x9c) from [<c01f0154>]
(radix_tree_preload+0x58/0xbc)
 r7:00006741 r6:000000d0 r5:c04a98a0 r4:ce986000
[<c01f00fc>] (radix_tree_preload+0x0/0xbc) from [<c008ac94>]
(add_to_page_cache_locked+0x20/0x1c4)
 r6:ce987d20 r5:ce346c1c r4:c04f8600 r3:000000d0
[<c008ac74>] (add_to_page_cache_locked+0x0/0x1c4) from [<c008ae84>]
(add_to_page_cache_lru+0x4c/0x7c)
 r8:00000020 r7:ce7402a0 r6:ce987d20 r5:00000005 r4:c04f8600
r3:000000d0
[<c008ae38>] (add_to_page_cache_lru+0x0/0x7c) from [<c00e8428>]
(mpage_readpages+0x7c/0x108)
 r5:00000005 r4:c04f8600
[<c00e83ac>] (mpage_readpages+0x0/0x108) from [<c010f450>]
(fat_readpages+0x20/0x28)
[<c010f430>] (fat_readpages+0x0/0x28) from [<c0092ebc>]
(__do_page_cache_readahead+0x1c4/0x27c)
[<c0092cf8>] (__do_page_cache_readahead+0x0/0x27c) from [<c0092fa0>]
(ra_submit+0x2c/0x34)
[<c0092f74>] (ra_submit+0x0/0x34) from [<c0093294>]
(ondemand_readahead+0x20c/0x21c)
[<c0093088>] (ondemand_readahead+0x0/0x21c) from [<c0093348>]
(page_cache_async_readahead+0xa4/0xd8)
[<c00932a4>] (page_cache_async_readahead+0x0/0xd8) from [<c008c6f4>]
(generic_file_aio_read+0x360/0x7f4)
 r8:00000000 r7:ce346c1c r6:ce88dba0 r5:0000671c r4:c04f9640
[<c008c394>] (generic_file_aio_read+0x0/0x7f4) from [<c00b7954>]
(do_sync_read+0xa0/0xd8)
[<c00b78b4>] (do_sync_read+0x0/0xd8) from [<c00b84d8>] (vfs_read+0xb8/0x154)
 r6:bed53650 r5:ce88dba0 r4:00001000
[<c00b8420>] (vfs_read+0x0/0x154) from [<c00b863c>] (sys_read+0x44/0x70)
 r8:0671c000 r7:00000003 r6:00001000 r5:bed53650 r4:ce88dba0
[<c00b85f8>] (sys_read+0x0/0x70) from [<c0022f80>] (ret_fast_syscall+0x0/0x30)
 r9:ce986000 r8:c0023128 r6:bed53650 r5:00001000 r4:0013a4e0
Code: e59f0010 e1a01003 eb0d0852 e3a03000 (e5833000)

The reason for this is that the CONFIG_HOLES_IN_ZONE configuration
option is not automatically enabled when SPARSEMEM or
ARCH_HAS_HOLES_MEMORYMODEL are enabled. Due to this, the
pfn_valid_within() macro always returns 1 due to which the BUG_ON is
encountered.
This patch enables the CONFIG_HOLES_IN_ZONE config option if either
ARCH_HAS_HOLES_MEMORYMODEL or SPARSEMEM is enabled.

Although I tested this on an older kernel, i.e., 2.6.35.13, I see that
this option has not been enabled as yet in linux-3.0 and this appears
to be a
logically correct change anyways with respect to pfn_valid_within()
functionality.

Cc: Russell King <linux@arm.linux.org.uk>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
---


Thanks,
Kautuk.

Comments

Mel Gorman Aug. 3, 2011, 11:05 a.m. UTC | #1
On Tue, Aug 02, 2011 at 05:38:31PM +0530, Kautuk Consul wrote:
> Hi,
> 
> In the case where the total kernel memory is not aligned to the
> SECTION_SIZE_BITS I see a kernel crash.
> 
> When I copy a huge file, then the kernel crashes at the following callstack:
> 

The callstack should not be 80-column formatted as this is completely
manged and unreadable without manual editting. Also, why did you not
include the full error message? With it, I'd have a better idea of
which bug check you hit.

> Backtrace:
> <SNIP>
> 
> The reason for this is that the CONFIG_HOLES_IN_ZONE configuration
> option is not automatically enabled when SPARSEMEM or
> ARCH_HAS_HOLES_MEMORYMODEL are enabled. Due to this, the
> pfn_valid_within() macro always returns 1 due to which the BUG_ON is
> encountered.
> This patch enables the CONFIG_HOLES_IN_ZONE config option if either
> ARCH_HAS_HOLES_MEMORYMODEL or SPARSEMEM is enabled.
> 
> Although I tested this on an older kernel, i.e., 2.6.35.13, I see that
> this option has not been enabled as yet in linux-3.0 and this appears
> to be a
> logically correct change anyways with respect to pfn_valid_within()
> functionality.
> 

There is a performance cost associated with HOLES_IN_ZONE which may be
offset by memory savings but not necessarily.

If the BUG_ON you are hitting is this one
BUG_ON(page_zone(start_page) != page_zone(end_page)) then I'd be
wondering why the check in move_freepages_block() was insufficient.

If it's because holes are punched in the memmap then the option does
need to be set.
Kautuk Consul Aug. 3, 2011, 12:29 p.m. UTC | #2
Hi Mel,

Sorry for the formatting.

I forgot to include the following entire backtrace:
#> cp test_huge_file nfsmnt
kernel BUG at mm/page_alloc.c:849!
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = ce9f0000
<SNIP>
Backtrace:
[<c00269ac>] (__bug+0x0/0x30) from [<c008e8b0>]
(move_freepages_block+0xd4/0x158)
[<c008e7dc>] (move_freepages_block+0x0/0x158) from [<c008eb10>]
(__rmqueue+0x1dc/0x32c)
 r8:c0481120 r7:c048107c r6:00000003 r5:00000001 r4:c04f8200
r3:00000000
[<c008e934>] (__rmqueue+0x0/0x32c) from [<c00906f8>]
(get_page_from_freelist+0x12c/0x530)
[<c00905cc>] (get_page_from_freelist+0x0/0x530) from [<c0090bec>]
(__alloc_pages_nodemask+0xf0/0x544)
[<c0090afc>] (__alloc_pages_nodemask+0x0/0x544) from [<c00b4da4>]
(cache_alloc_refill+0x2d0/0x654)
[<c00b4ad4>] (cache_alloc_refill+0x0/0x654) from [<c00b5258>]
(kmem_cache_alloc+0x58/0x9c)
[<c00b5200>] (kmem_cache_alloc+0x0/0x9c) from [<c01f0154>]
(radix_tree_preload+0x58/0xbc)
 r7:00006741 r6:000000d0 r5:c04a98a0 r4:ce986000
[<c01f00fc>] (radix_tree_preload+0x0/0xbc) from [<c008ac94>]
(add_to_page_cache_locked+0x20/0x1c4)
 r6:ce987d20 r5:ce346c1c r4:c04f8600 r3:000000d0
[<c008ac74>] (add_to_page_cache_locked+0x0/0x1c4) from [<c008ae84>]
(add_to_page_cache_lru+0x4c/0x7c)
 r8:00000020 r7:ce7402a0 r6:ce987d20 r5:00000005 r4:c04f8600
r3:000000d0
[<c008ae38>] (add_to_page_cache_lru+0x0/0x7c) from [<c00e8428>]
(mpage_readpages+0x7c/0x108)
 r5:00000005 r4:c04f8600
[<c00e83ac>] (mpage_readpages+0x0/0x108) from [<c010f450>]
(fat_readpages+0x20/0x28)
[<c010f430>] (fat_readpages+0x0/0x28) from [<c0092ebc>]
(__do_page_cache_readahead+0x1c4/0x27c)
[<c0092cf8>] (__do_page_cache_readahead+0x0/0x27c) from [<c0092fa0>]
(ra_submit+0x2c/0x34)
[<c0092f74>] (ra_submit+0x0/0x34) from [<c0093294>]
(ondemand_readahead+0x20c/0x21c)
[<c0093088>] (ondemand_readahead+0x0/0x21c) from [<c0093348>]
(page_cache_async_readahead+0xa4/0xd8)
[<c00932a4>] (page_cache_async_readahead+0x0/0xd8) from [<c008c6f4>]
(generic_file_aio_read+0x360/0x7f4)
 r8:00000000 r7:ce346c1c r6:ce88dba0 r5:0000671c r4:c04f9640
[<c008c394>] (generic_file_aio_read+0x0/0x7f4) from [<c00b7954>]
(do_sync_read+0xa0/0xd8)
[<c00b78b4>] (do_sync_read+0x0/0xd8) from [<c00b84d8>] (vfs_read+0xb8/0x154)
 r6:bed53650 r5:ce88dba0 r4:00001000
[<c00b8420>] (vfs_read+0x0/0x154) from [<c00b863c>] (sys_read+0x44/0x70)
 r8:0671c000 r7:00000003 r6:00001000 r5:bed53650 r4:ce88dba0
[<c00b85f8>] (sys_read+0x0/0x70) from [<c0022f80>] (ret_fast_syscall+0x0/0x30)
 r9:ce986000 r8:c0023128 r6:bed53650 r5:00001000 r4:0013a4e0
Code: e59f0010 e1a01003 eb0d0852 e3a03000 (e5833000)
<SNIP>

Since I was testing on linux-2.6.35.9, line 849 in page_alloc.c is the
same line as you have mentioned:
BUG_ON(page_zone(start_page) != page_zone(end_page))

I reproduce this crash by altering the memory banks' memory ranges
such that they are not aligned to the SECTION_SIZE_BITS size.
For example, on my ARM system SECTION_SIZE_BITS is 23(8MB), so I
change the code in arch/arm/mach-* code so that the total kernel
memory
size in the memory banks is lesser by 1 MB, which makes the total
kernel memory size become not exactly divisible by 8 MB.

Thanks,
Kautuk.


On Wed, Aug 3, 2011 at 4:35 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Tue, Aug 02, 2011 at 05:38:31PM +0530, Kautuk Consul wrote:
>> Hi,
>>
>> In the case where the total kernel memory is not aligned to the
>> SECTION_SIZE_BITS I see a kernel crash.
>>
>> When I copy a huge file, then the kernel crashes at the following callstack:
>>
>
> The callstack should not be 80-column formatted as this is completely
> manged and unreadable without manual editting. Also, why did you not
> include the full error message? With it, I'd have a better idea of
> which bug check you hit.
>
>> Backtrace:
>> <SNIP>
>>
>> The reason for this is that the CONFIG_HOLES_IN_ZONE configuration
>> option is not automatically enabled when SPARSEMEM or
>> ARCH_HAS_HOLES_MEMORYMODEL are enabled. Due to this, the
>> pfn_valid_within() macro always returns 1 due to which the BUG_ON is
>> encountered.
>> This patch enables the CONFIG_HOLES_IN_ZONE config option if either
>> ARCH_HAS_HOLES_MEMORYMODEL or SPARSEMEM is enabled.
>>
>> Although I tested this on an older kernel, i.e., 2.6.35.13, I see that
>> this option has not been enabled as yet in linux-3.0 and this appears
>> to be a
>> logically correct change anyways with respect to pfn_valid_within()
>> functionality.
>>
>
> There is a performance cost associated with HOLES_IN_ZONE which may be
> offset by memory savings but not necessarily.
>
> If the BUG_ON you are hitting is this one
> BUG_ON(page_zone(start_page) != page_zone(end_page)) then I'd be
> wondering why the check in move_freepages_block() was insufficient.
>
> If it's because holes are punched in the memmap then the option does
> need to be set.
>
> --
> Mel Gorman
> SUSE Labs
>
Mel Gorman Aug. 3, 2011, 1:28 p.m. UTC | #3
On Wed, Aug 03, 2011 at 05:59:03PM +0530, Kautuk Consul wrote:
> Hi Mel,
> 
> Sorry for the formatting.
> 
> I forgot to include the following entire backtrace:
> #> cp test_huge_file nfsmnt
> kernel BUG at mm/page_alloc.c:849!
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> pgd = ce9f0000
> <SNIP>
> Backtrace:
> [<c00269ac>] (__bug+0x0/0x30) from [<c008e8b0>]
> (move_freepages_block+0xd4/0x158)

It's still horribly mangled and pretty much unreadable but at least we
know where the bug is hitting.

> <SNIP>
> 
> Since I was testing on linux-2.6.35.9, line 849 in page_alloc.c is the
> same line as you have mentioned:
> BUG_ON(page_zone(start_page) != page_zone(end_page))
> 
> I reproduce this crash by altering the memory banks' memory ranges
> such that they are not aligned to the SECTION_SIZE_BITS size.

How are you altering the ranges? Are you somehow breaking
the checks based on the information in stuct zone that is in
move_freepages_block()?

It no longer seems like a punching-hole-in-memmap problem. Can
you investigate how and why the range of pages passed in to
move_freepages() belong to different zones?

Thanks.
Kautuk Consul Aug. 4, 2011, 9:36 a.m. UTC | #4
Hi Mel,

My ARM system has 2 memory banks which have the following 2 PFN ranges:
60000-62000 and 70000-7ce00.

My SECTION_SIZE_BITS is #defined to 23.

I am altering the ranges via the following kind of pseudo-code in the
arch/arm/mach-*/mach-*.c file:
meminfo->bank[0].size -= (1 << 20)
meminfo->bank[1].size -= (1 << 20)

After altering the size of both memory banks, the PFN ranges now
visible to the kernel are:
60000-61f00 and 70000-7cd00.

There is only one node and one ZONE_NORMAL zone on the system and this
zone accounts for both memory banks as CONFIG_SPARSEMEM
is enabled in the kernel.

I put some printks in the move_freeblockpages() function, compiled the
kernel and then ran the following commands:
ifconfig eth0 107.109.39.102 up
mount /dev/sda1 /mnt   # Mounted the USB pen drive
mount -t nfs -o nolock 107.109.39.103:/home/kautuk nfsmnt   # NFS
mounted an NFS share
cp test_huge_file nfsmnt/

I got the following output:
#> cp test_huge nfsmnt/
------------------------------------------------------------move_freepages_block_start----------------------------------------
kc: The page=c068a000 start_pfn=7c400 start_page=c0686000
end_page=c068dfe0 end_pfn=7c7ff
page_zone(start_page)=c048107c page_zone(end_page)=c048107c
page_zonenum(end_page) = 0
------------------------------------------------------------move_freepages_block_end----------------------------------------
------------------------------------------------------------move_freepages_block_start----------------------------------------
kc: The page=c0652000 start_pfn=7a800 start_page=c064e000
end_page=c0655fe0 end_pfn=7abff
page_zone(start_page)=c048107c page_zone(end_page)=c048107c
page_zonenum(end_page) = 0
------------------------------------------------------------move_freepages_block_end----------------------------------------
------------------------------------------------------------move_freepages_block_start----------------------------------------
kc: The page=c065a000 start_pfn=7ac00 start_page=c0656000
end_page=c065dfe0 end_pfn=7afff
page_zone(start_page)=c048107c page_zone(end_page)=c048107c
page_zonenum(end_page) = 0
------------------------------------------------------------move_freepages_block_end----------------------------------------
------------------------------------------------------------move_freepages_block_start----------------------------------------
kc: The page=c0695000 start_pfn=7c800 start_page=c068e000
end_page=c0695fe0 end_pfn=7cbff
page_zone(start_page)=c048107c page_zone(end_page)=c048107c
page_zonenum(end_page) = 0
------------------------------------------------------------move_freepages_block_end----------------------------------------
------------------------------------------------------------move_freepages_block_start----------------------------------------
kc: The page=c04f7c00 start_pfn=61c00 start_page=c04f6000
end_page=c04fdfe0 end_pfn=61fff
page_zone(start_page)=c048107c page_zone(end_page)=c0481358
page_zonenum(end_page) = 1
------------------------------------------------------------move_freepages_block_end----------------------------------------
kernel BUG at mm/page_alloc.c:849!
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = ce9fc000
[00000000] *pgd=7ca5a031, *pte=00000000, *ppte=00000000

As per the last line, we can clearly see that the
page_zone(start_page)=c048107c and page_zone(end_page)=c0481358,
which are not equal to each other.
Since they do not match, the code in move_freepages() bugchecks
because of the following BUG_ON() check:
page_zone(start_page) != page_zone(end_page)

The reason for this that the page_zonenum(end_page) is equal to 1 and
this is different from the page_zonenum(start_page) which is 0.

On checking the code within page_zonenum(), I see that this code tries
to retrieve the zone number from the end_page->flags.

The reason why we cannot expect the 0x61fff end_page->flags to contain
a valid zone number is:
memmap_init_zone() initializes the zone number of all pages for a zone
via the set_page_links() inline function.
For the end_page (whose PFN is 0x61fff), set_page_links() cannot be
possibly called, as the zones are simply not aware of of PFNs above
0x61f00 and below 0x70000.

The (end >= zone->zone_start_pfn + zone->spanned_pages) in
move_freepages_block() does not stop this crash from happening as both
our memory banks are in the same zone and the empty space within them
is accomodated into this zone via the CONFIG_SPARSEMEM
config option.

When we enable CONFIG_HOLES_IN_ZONE we survive this BUG_ON as well as
any other BUG_ONs in the loop in move_freepages() as then the
pfn_valid_within()/pfn_valid() function takes care of this
functionality, especially in the case where the newly introduced
CONFIG_HAVE_ARCH_PFN_VALID is
enabled.

Thanks,
Kautuk.


On Wed, Aug 3, 2011 at 6:58 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Wed, Aug 03, 2011 at 05:59:03PM +0530, Kautuk Consul wrote:
>> Hi Mel,
>>
>> Sorry for the formatting.
>>
>> I forgot to include the following entire backtrace:
>> #> cp test_huge_file nfsmnt
>> kernel BUG at mm/page_alloc.c:849!
>> Unable to handle kernel NULL pointer dereference at virtual address 00000000
>> pgd = ce9f0000
>> <SNIP>
>> Backtrace:
>> [<c00269ac>] (__bug+0x0/0x30) from [<c008e8b0>]
>> (move_freepages_block+0xd4/0x158)
>
> It's still horribly mangled and pretty much unreadable but at least we
> know where the bug is hitting.
>
>> <SNIP>
>>
>> Since I was testing on linux-2.6.35.9, line 849 in page_alloc.c is the
>> same line as you have mentioned:
>> BUG_ON(page_zone(start_page) != page_zone(end_page))
>>
>> I reproduce this crash by altering the memory banks' memory ranges
>> such that they are not aligned to the SECTION_SIZE_BITS size.
>
> How are you altering the ranges? Are you somehow breaking
> the checks based on the information in stuct zone that is in
> move_freepages_block()?
>
> It no longer seems like a punching-hole-in-memmap problem. Can
> you investigate how and why the range of pages passed in to
> move_freepages() belong to different zones?
>
> Thanks.
>
> --
> Mel Gorman
> SUSE Labs
>
Mel Gorman Aug. 4, 2011, 10:09 a.m. UTC | #5
On Thu, Aug 04, 2011 at 03:06:39PM +0530, Kautuk Consul wrote:
> Hi Mel,
> 
> My ARM system has 2 memory banks which have the following 2 PFN ranges:
> 60000-62000 and 70000-7ce00.
> 
> My SECTION_SIZE_BITS is #defined to 23.
> 

So bank 0 is 4 sections and bank 1 is 26 sections with the last section
incomplete.

> I am altering the ranges via the following kind of pseudo-code in the
> arch/arm/mach-*/mach-*.c file:
> meminfo->bank[0].size -= (1 << 20)
> meminfo->bank[1].size -= (1 << 20)
> 

Why are you taking 1M off each bank? I could understand aligning the
banks to a section size at least.

That said, there is an assumption that pages within a MAX_ORDER-aligned
block. If you are taking 1M off the end of bank 0, it is no longer
MAX_ORDER aligned. This is the assumption move_freepages_block()
is falling foul of. The problem can be avoided by ensuring that memmap
is valid within MAX_ORDER-aligned ranges.

> After altering the size of both memory banks, the PFN ranges now
> visible to the kernel are:
> 60000-61f00 and 70000-7cd00.
> 
> There is only one node and one ZONE_NORMAL zone on the system and this
> zone accounts for both memory banks as CONFIG_SPARSEMEM
> is enabled in the kernel.
> 

Ok.

> I put some printks in the move_freeblockpages() function, compiled the
> kernel and then ran the following commands:
> ifconfig eth0 107.109.39.102 up
> mount /dev/sda1 /mnt   # Mounted the USB pen drive
> mount -t nfs -o nolock 107.109.39.103:/home/kautuk nfsmnt   # NFS
> mounted an NFS share
> cp test_huge_file nfsmnt/
> 
> I got the following output:
> #> cp test_huge nfsmnt/
> ------------------------------------------------------------move_freepages_block_start----------------------------------------
> kc: The page=c068a000 start_pfn=7c400 start_page=c0686000
> end_page=c068dfe0 end_pfn=7c7ff
> page_zone(start_page)=c048107c page_zone(end_page)=c048107c
> page_zonenum(end_page) = 0
> ------------------------------------------------------------move_freepages_block_end----------------------------------------
> ------------------------------------------------------------move_freepages_block_start----------------------------------------
> kc: The page=c0652000 start_pfn=7a800 start_page=c064e000
> end_page=c0655fe0 end_pfn=7abff
> page_zone(start_page)=c048107c page_zone(end_page)=c048107c
> page_zonenum(end_page) = 0
> ------------------------------------------------------------move_freepages_block_end----------------------------------------
> ------------------------------------------------------------move_freepages_block_start----------------------------------------
> kc: The page=c065a000 start_pfn=7ac00 start_page=c0656000
> end_page=c065dfe0 end_pfn=7afff
> page_zone(start_page)=c048107c page_zone(end_page)=c048107c
> page_zonenum(end_page) = 0
> ------------------------------------------------------------move_freepages_block_end----------------------------------------
> ------------------------------------------------------------move_freepages_block_start----------------------------------------
> kc: The page=c0695000 start_pfn=7c800 start_page=c068e000
> end_page=c0695fe0 end_pfn=7cbff
> page_zone(start_page)=c048107c page_zone(end_page)=c048107c
> page_zonenum(end_page) = 0
> ------------------------------------------------------------move_freepages_block_end----------------------------------------
> ------------------------------------------------------------move_freepages_block_start----------------------------------------
> kc: The page=c04f7c00 start_pfn=61c00 start_page=c04f6000
> end_page=c04fdfe0 end_pfn=61fff
> page_zone(start_page)=c048107c page_zone(end_page)=c0481358
> page_zonenum(end_page) = 1
> ------------------------------------------------------------move_freepages_block_end----------------------------------------
> kernel BUG at mm/page_alloc.c:849!
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> pgd = ce9fc000
> [00000000] *pgd=7ca5a031, *pte=00000000, *ppte=00000000
> 
> As per the last line, we can clearly see that the
> page_zone(start_page)=c048107c and page_zone(end_page)=c0481358,
> which are not equal to each other.
> Since they do not match, the code in move_freepages() bugchecks
> because of the following BUG_ON() check:
> page_zone(start_page) != page_zone(end_page)

> The reason for this that the page_zonenum(end_page) is equal to 1 and
> this is different from the page_zonenum(start_page) which is 0.
> 

Because the MAX_ORDER alignment is gone.

> On checking the code within page_zonenum(), I see that this code tries
> to retrieve the zone number from the end_page->flags.
> 

Yes. In the majority of cases a pages node and zone is stored in the
page->flags.

> The reason why we cannot expect the 0x61fff end_page->flags to contain
> a valid zone number is:
> memmap_init_zone() initializes the zone number of all pages for a zone
> via the set_page_links() inline function.
> For the end_page (whose PFN is 0x61fff), set_page_links() cannot be
> possibly called, as the zones are simply not aware of of PFNs above
> 0x61f00 and below 0x70000.
> 

Can you ensure that the ranges passed into free_area_init_node()
are MAX_ORDER aligned as this would initialise the struct pages. You
may have already seen that care is taken when freeing memmap that it
is aligned to MAX_ORDER in free_unused_memmap() in ARM.

> The (end >= zone->zone_start_pfn + zone->spanned_pages) in
> move_freepages_block() does not stop this crash from happening as both
> our memory banks are in the same zone and the empty space within them
> is accomodated into this zone via the CONFIG_SPARSEMEM
> config option.
> 
> When we enable CONFIG_HOLES_IN_ZONE we survive this BUG_ON as well as
> any other BUG_ONs in the loop in move_freepages() as then the
> pfn_valid_within()/pfn_valid() function takes care of this
> functionality, especially in the case where the newly introduced
> CONFIG_HAVE_ARCH_PFN_VALID is
> enabled.
> 

This is an expensive option in terms of performance. If Russell
wants to pick it up, I won't object but I would strongly suggest that
you solve this problem by ensuring that memmap is initialised on a
MAX_ORDER-aligned boundaries as it'll perform better.

Thanks.
Kautuk Consul Aug. 5, 2011, 5:57 a.m. UTC | #6
Hi Mel,

Please find my comments inline to the email below.

2 general questions:
i)    If an email chain such as this leads to another kernel patch for
the same problem, do I need to
      create a new email chain for that ?
ii)  Sorry about my formatting problems. However, text such as
backtraces and logs tend to wrap
      irrespective of whatever gmail settings/browser I try. Any
pointers here ?

Thanks,
Kautuk.

On Thu, Aug 4, 2011 at 3:39 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Thu, Aug 04, 2011 at 03:06:39PM +0530, Kautuk Consul wrote:
>> Hi Mel,
>>
>> My ARM system has 2 memory banks which have the following 2 PFN ranges:
>> 60000-62000 and 70000-7ce00.
>>
>> My SECTION_SIZE_BITS is #defined to 23.
>>
>
> So bank 0 is 4 sections and bank 1 is 26 sections with the last section
> incomplete.
>
>> I am altering the ranges via the following kind of pseudo-code in the
>> arch/arm/mach-*/mach-*.c file:
>> meminfo->bank[0].size -= (1 << 20)
>> meminfo->bank[1].size -= (1 << 20)
>>
>
> Why are you taking 1M off each bank? I could understand aligning the
> banks to a section size at least.

The reason I am doing this is that one of our embedded boards actually
has this problem, due
to which we see this kernel crash. I am merely reproducing this
problem by performing this step.

>
> That said, there is an assumption that pages within a MAX_ORDER-aligned
> block. If you are taking 1M off the end of bank 0, it is no longer
> MAX_ORDER aligned. This is the assumption move_freepages_block()
> is falling foul of. The problem can be avoided by ensuring that memmap
> is valid within MAX_ORDER-aligned ranges.
>
>> After altering the size of both memory banks, the PFN ranges now
>> visible to the kernel are:
>> 60000-61f00 and 70000-7cd00.
>>
>> There is only one node and one ZONE_NORMAL zone on the system and this
>> zone accounts for both memory banks as CONFIG_SPARSEMEM
>> is enabled in the kernel.
>>
>
> Ok.
>
>> I put some printks in the move_freeblockpages() function, compiled the
>> kernel and then ran the following commands:
>> ifconfig eth0 107.109.39.102 up
>> mount /dev/sda1 /mnt   # Mounted the USB pen drive
>> mount -t nfs -o nolock 107.109.39.103:/home/kautuk nfsmnt   # NFS
>> mounted an NFS share
>> cp test_huge_file nfsmnt/
>>
>> I got the following output:
>> #> cp test_huge nfsmnt/
>> ------------------------------------------------------------move_freepages_block_start----------------------------------------
>> kc: The page=c068a000 start_pfn=7c400 start_page=c0686000
>> end_page=c068dfe0 end_pfn=7c7ff
>> page_zone(start_page)=c048107c page_zone(end_page)=c048107c
>> page_zonenum(end_page) = 0
>> ------------------------------------------------------------move_freepages_block_end----------------------------------------
>> ------------------------------------------------------------move_freepages_block_start----------------------------------------
>> kc: The page=c0652000 start_pfn=7a800 start_page=c064e000
>> end_page=c0655fe0 end_pfn=7abff
>> page_zone(start_page)=c048107c page_zone(end_page)=c048107c
>> page_zonenum(end_page) = 0
>> ------------------------------------------------------------move_freepages_block_end----------------------------------------
>> ------------------------------------------------------------move_freepages_block_start----------------------------------------
>> kc: The page=c065a000 start_pfn=7ac00 start_page=c0656000
>> end_page=c065dfe0 end_pfn=7afff
>> page_zone(start_page)=c048107c page_zone(end_page)=c048107c
>> page_zonenum(end_page) = 0
>> ------------------------------------------------------------move_freepages_block_end----------------------------------------
>> ------------------------------------------------------------move_freepages_block_start----------------------------------------
>> kc: The page=c0695000 start_pfn=7c800 start_page=c068e000
>> end_page=c0695fe0 end_pfn=7cbff
>> page_zone(start_page)=c048107c page_zone(end_page)=c048107c
>> page_zonenum(end_page) = 0
>> ------------------------------------------------------------move_freepages_block_end----------------------------------------
>> ------------------------------------------------------------move_freepages_block_start----------------------------------------
>> kc: The page=c04f7c00 start_pfn=61c00 start_page=c04f6000
>> end_page=c04fdfe0 end_pfn=61fff
>> page_zone(start_page)=c048107c page_zone(end_page)=c0481358
>> page_zonenum(end_page) = 1
>> ------------------------------------------------------------move_freepages_block_end----------------------------------------
>> kernel BUG at mm/page_alloc.c:849!
>> Unable to handle kernel NULL pointer dereference at virtual address 00000000
>> pgd = ce9fc000
>> [00000000] *pgd=7ca5a031, *pte=00000000, *ppte=00000000
>>
>> As per the last line, we can clearly see that the
>> page_zone(start_page)=c048107c and page_zone(end_page)=c0481358,
>> which are not equal to each other.
>> Since they do not match, the code in move_freepages() bugchecks
>> because of the following BUG_ON() check:
>> page_zone(start_page) != page_zone(end_page)
>
>> The reason for this that the page_zonenum(end_page) is equal to 1 and
>> this is different from the page_zonenum(start_page) which is 0.
>>
>
> Because the MAX_ORDER alignment is gone.
>
>> On checking the code within page_zonenum(), I see that this code tries
>> to retrieve the zone number from the end_page->flags.
>>
>
> Yes. In the majority of cases a pages node and zone is stored in the
> page->flags.
>
>> The reason why we cannot expect the 0x61fff end_page->flags to contain
>> a valid zone number is:
>> memmap_init_zone() initializes the zone number of all pages for a zone
>> via the set_page_links() inline function.
>> For the end_page (whose PFN is 0x61fff), set_page_links() cannot be
>> possibly called, as the zones are simply not aware of of PFNs above
>> 0x61f00 and below 0x70000.
>>
>
> Can you ensure that the ranges passed into free_area_init_node()
> are MAX_ORDER aligned as this would initialise the struct pages. You
> may have already seen that care is taken when freeing memmap that it
> is aligned to MAX_ORDER in free_unused_memmap() in ARM.
>

Will this work ? My doubt arises from the fact that there is only one
zone on the entire
system which contains both memory banks.
The crash arises at the PFN 0x61fff, which will not be covered by such
a check, as this function
will try to act on the entire zone, which is the PFN range:
60000-7cd00, including the holes within as
all of this RAM falls into the same node and zone.
( Please correct me if I am wrong about this. )

I tried aligning the end parameter in the memory_present() function
which is called separately
for each memory bank.
I tried the following change in memory_present() as well as
mminit_validate_memodel_limits():
end &= ~(pageblock_nr_pages-1);
But, in this case, the board simply does not boot up. I think that
will then require some change in the
arch/arm code which I think would be an arch-specific solution to a
possibly generic problem.

>> The (end >= zone->zone_start_pfn + zone->spanned_pages) in
>> move_freepages_block() does not stop this crash from happening as both
>> our memory banks are in the same zone and the empty space within them
>> is accomodated into this zone via the CONFIG_SPARSEMEM
>> config option.
>>
>> When we enable CONFIG_HOLES_IN_ZONE we survive this BUG_ON as well as
>> any other BUG_ONs in the loop in move_freepages() as then the
>> pfn_valid_within()/pfn_valid() function takes care of this
>> functionality, especially in the case where the newly introduced
>> CONFIG_HAVE_ARCH_PFN_VALID is
>> enabled.
>>
>
> This is an expensive option in terms of performance. If Russell
> wants to pick it up, I won't object but I would strongly suggest that
> you solve this problem by ensuring that memmap is initialised on a
> MAX_ORDER-aligned boundaries as it'll perform better.
>

I couldn't really locate a method in the kernel wherein we can
validate a pageblock(1024 pages for my
platform) with respect to the memory banks on that system.

How about this :
We implement an arch_is_valid_pageblock() function, controlled by a
new config option
CONFIG_ARCH_HAVE IS_VALID_PAGEBLOCK.
This arch function will simply check whether this pageblock is valid
or not, in terms of arch-specific
memory banks or by using the memblock APIs depending on CONFIG_HAVE_MEMBLOCK.
We can modify the memmap_init_zone() function so that an outer loop
works in measures of
pageblocks thus enabling us to avoid invalid pageblocks.
We then wouldn't need to go for the HOLES_IN_ZONE option as all PFN
ranges will be aligned to the
pageblock_nr_pages thus removing the possibility of this crash in
move_freepages().


> Thanks.
>
> --
> Mel Gorman
> SUSE Labs
>
Mel Gorman Aug. 5, 2011, 8:47 a.m. UTC | #7
On Fri, Aug 05, 2011 at 11:27:21AM +0530, Kautuk Consul wrote:
> Hi Mel,
> 
> Please find my comments inline to the email below.
> 
> 2 general questions:
> i)    If an email chain such as this leads to another kernel patch for
> the same problem, do I need to
>       create a new email chain for that ?

Yes. I believe it's easier on the maintainer if a new thread is started
with the new patch leader summarising relevant information from the
discussion. A happy maintainer makes the day easier.

> ii)  Sorry about my formatting problems. However, text such as
> backtraces and logs tend to wrap
>       irrespective of whatever gmail settings/browser I try. Any
> pointers here ?
> 

I don't use gmail so I do not have any suggestions other than using a
different mailer. There are no suggestions in
Documentation/email-clients.txt on how to deal with gmail.

> On Thu, Aug 4, 2011 at 3:39 PM, Mel Gorman <mgorman@suse.de> wrote:
> > On Thu, Aug 04, 2011 at 03:06:39PM +0530, Kautuk Consul wrote:
> >> Hi Mel,
> >>
> >> My ARM system has 2 memory banks which have the following 2 PFN ranges:
> >> 60000-62000 and 70000-7ce00.
> >>
> >> My SECTION_SIZE_BITS is #defined to 23.
> >>
> >
> > So bank 0 is 4 sections and bank 1 is 26 sections with the last section
> > incomplete.
> >
> >> I am altering the ranges via the following kind of pseudo-code in the
> >> arch/arm/mach-*/mach-*.c file:
> >> meminfo->bank[0].size -= (1 << 20)
> >> meminfo->bank[1].size -= (1 << 20)
> >>
> >
> > Why are you taking 1M off each bank? I could understand aligning the
> > banks to a section size at least.
> 
> The reason I am doing this is that one of our embedded boards actually
> has this problem, due
> to which we see this kernel crash. I am merely reproducing this
> problem by performing this step.
> 

Ah, that makes sense.

> >> <SNIP>
> >> The reason why we cannot expect the 0x61fff end_page->flags to contain
> >> a valid zone number is:
> >> memmap_init_zone() initializes the zone number of all pages for a zone
> >> via the set_page_links() inline function.
> >> For the end_page (whose PFN is 0x61fff), set_page_links() cannot be
> >> possibly called, as the zones are simply not aware of of PFNs above
> >> 0x61f00 and below 0x70000.
> >>
> >
> > Can you ensure that the ranges passed into free_area_init_node()
> > are MAX_ORDER aligned as this would initialise the struct pages. You
> > may have already seen that care is taken when freeing memmap that it
> > is aligned to MAX_ORDER in free_unused_memmap() in ARM.
> >
> 
> Will this work ? My doubt arises from the fact that there is only one
> zone on the entire
> system which contains both memory banks.

That is a common situation. It's why present_pages and spanned_pages
in a zone can differ. As long as valid memmap is MAX_ORDER-aligned, it's
fine.

> The crash arises at the PFN 0x61fff, which will not be covered by such
> a check, as this function

No, it won't be covered by the range check. However, the memmap will be
initialised so even though the page is outside a valid bank of memory,
it'll still resolve to the correct zone. The struct page will be marked
Reserved so it'll never be used.

> will try to act on the entire zone, which is the PFN range:
> 60000-7cd00, including the holes within as
> all of this RAM falls into the same node and zone.
> ( Please correct me if I am wrong about this. )
> 
> I tried aligning the end parameter in the memory_present() function
> which is called separately
> for each memory bank.
> I tried the following change in memory_present() as well as
> mminit_validate_memodel_limits():
> end &= ~(pageblock_nr_pages-1);
> But, in this case, the board simply does not boot up. I think that
> will then require some change in the
> arch/arm code which I think would be an arch-specific solution to a
> possibly generic problem.
> 

I do not believe this is a generic problem. Machines have
holes in the zone all the time and this bug does not trigger.
mminit_validate_memodel_limits() is the wrong place to make a change.
Look more towards where free_area_init_node() gets called to initialise
memmap.

> >> The (end >= zone->zone_start_pfn + zone->spanned_pages) in
> >> move_freepages_block() does not stop this crash from happening as both
> >> our memory banks are in the same zone and the empty space within them
> >> is accomodated into this zone via the CONFIG_SPARSEMEM
> >> config option.
> >>
> >> When we enable CONFIG_HOLES_IN_ZONE we survive this BUG_ON as well as
> >> any other BUG_ONs in the loop in move_freepages() as then the
> >> pfn_valid_within()/pfn_valid() function takes care of this
> >> functionality, especially in the case where the newly introduced
> >> CONFIG_HAVE_ARCH_PFN_VALID is
> >> enabled.
> >>
> >
> > This is an expensive option in terms of performance. If Russell
> > wants to pick it up, I won't object but I would strongly suggest that
> > you solve this problem by ensuring that memmap is initialised on a
> > MAX_ORDER-aligned boundaries as it'll perform better.
> >
> 
> I couldn't really locate a method in the kernel wherein we can
> validate a pageblock(1024 pages for my
> platform) with respect to the memory banks on that system.
> 

I suspect what you're looking for is somewhere in arch/arm/mm/init.c .
I'm afraid I didn't dig through the ARM memory initialisation
code to see where it could be done but if it was me, I'd be looking at
how free_area_init_node() is called.

> How about this :
> We implement an arch_is_valid_pageblock() function, controlled by a
> new config option
> CONFIG_ARCH_HAVE IS_VALID_PAGEBLOCK.
> This arch function will simply check whether this pageblock is valid
> or not, in terms of arch-specific
> memory banks or by using the memblock APIs depending on CONFIG_HAVE_MEMBLOCK.
> We can modify the memmap_init_zone() function so that an outer loop
> works in measures of
> pageblocks thus enabling us to avoid invalid pageblocks.

That seems like massive overkill for what should be an alignment problem
when initialisating memmap. It's adding complexity that is similar to
HOLES_IN_ZONE with very little gain.

When it gets down to it, I'd even prefer deleting the BUG_ON as
the PageBuddy check over such a solution. However, the BUG_ON is
there because alignment to MAX_ORDER is expected so it is a valid
sanity check. There would need to be good evidence that initialising
memmap to MAX_ORDER alignment was somehow impossible.
Kautuk Consul Aug. 5, 2011, 11:40 a.m. UTC | #8
Ok, I analyzed the code and it seems that this alignment problem has
been solved by the changes made
to the free_unused_memmap() code in arch/arm/mm/init.c.

I backported those changes to free_unused_memmap_node() in
linux-2.6.35.9 and I don't see any more
crashes. This solves my problem.

Thanks for all the help.


On Fri, Aug 5, 2011 at 2:17 PM, Mel Gorman <mgorman@suse.de> wrote:
> On Fri, Aug 05, 2011 at 11:27:21AM +0530, Kautuk Consul wrote:
>> Hi Mel,
>>
>> Please find my comments inline to the email below.
>>
>> 2 general questions:
>> i)    If an email chain such as this leads to another kernel patch for
>> the same problem, do I need to
>>       create a new email chain for that ?
>
> Yes. I believe it's easier on the maintainer if a new thread is started
> with the new patch leader summarising relevant information from the
> discussion. A happy maintainer makes the day easier.
>
>> ii)  Sorry about my formatting problems. However, text such as
>> backtraces and logs tend to wrap
>>       irrespective of whatever gmail settings/browser I try. Any
>> pointers here ?
>>
>
> I don't use gmail so I do not have any suggestions other than using a
> different mailer. There are no suggestions in
> Documentation/email-clients.txt on how to deal with gmail.
>
>> On Thu, Aug 4, 2011 at 3:39 PM, Mel Gorman <mgorman@suse.de> wrote:
>> > On Thu, Aug 04, 2011 at 03:06:39PM +0530, Kautuk Consul wrote:
>> >> Hi Mel,
>> >>
>> >> My ARM system has 2 memory banks which have the following 2 PFN ranges:
>> >> 60000-62000 and 70000-7ce00.
>> >>
>> >> My SECTION_SIZE_BITS is #defined to 23.
>> >>
>> >
>> > So bank 0 is 4 sections and bank 1 is 26 sections with the last section
>> > incomplete.
>> >
>> >> I am altering the ranges via the following kind of pseudo-code in the
>> >> arch/arm/mach-*/mach-*.c file:
>> >> meminfo->bank[0].size -= (1 << 20)
>> >> meminfo->bank[1].size -= (1 << 20)
>> >>
>> >
>> > Why are you taking 1M off each bank? I could understand aligning the
>> > banks to a section size at least.
>>
>> The reason I am doing this is that one of our embedded boards actually
>> has this problem, due
>> to which we see this kernel crash. I am merely reproducing this
>> problem by performing this step.
>>
>
> Ah, that makes sense.
>
>> >> <SNIP>
>> >> The reason why we cannot expect the 0x61fff end_page->flags to contain
>> >> a valid zone number is:
>> >> memmap_init_zone() initializes the zone number of all pages for a zone
>> >> via the set_page_links() inline function.
>> >> For the end_page (whose PFN is 0x61fff), set_page_links() cannot be
>> >> possibly called, as the zones are simply not aware of of PFNs above
>> >> 0x61f00 and below 0x70000.
>> >>
>> >
>> > Can you ensure that the ranges passed into free_area_init_node()
>> > are MAX_ORDER aligned as this would initialise the struct pages. You
>> > may have already seen that care is taken when freeing memmap that it
>> > is aligned to MAX_ORDER in free_unused_memmap() in ARM.
>> >
>>
>> Will this work ? My doubt arises from the fact that there is only one
>> zone on the entire
>> system which contains both memory banks.
>
> That is a common situation. It's why present_pages and spanned_pages
> in a zone can differ. As long as valid memmap is MAX_ORDER-aligned, it's
> fine.
>
>> The crash arises at the PFN 0x61fff, which will not be covered by such
>> a check, as this function
>
> No, it won't be covered by the range check. However, the memmap will be
> initialised so even though the page is outside a valid bank of memory,
> it'll still resolve to the correct zone. The struct page will be marked
> Reserved so it'll never be used.
>
>> will try to act on the entire zone, which is the PFN range:
>> 60000-7cd00, including the holes within as
>> all of this RAM falls into the same node and zone.
>> ( Please correct me if I am wrong about this. )
>>
>> I tried aligning the end parameter in the memory_present() function
>> which is called separately
>> for each memory bank.
>> I tried the following change in memory_present() as well as
>> mminit_validate_memodel_limits():
>> end &= ~(pageblock_nr_pages-1);
>> But, in this case, the board simply does not boot up. I think that
>> will then require some change in the
>> arch/arm code which I think would be an arch-specific solution to a
>> possibly generic problem.
>>
>
> I do not believe this is a generic problem. Machines have
> holes in the zone all the time and this bug does not trigger.
> mminit_validate_memodel_limits() is the wrong place to make a change.
> Look more towards where free_area_init_node() gets called to initialise
> memmap.
>
>> >> The (end >= zone->zone_start_pfn + zone->spanned_pages) in
>> >> move_freepages_block() does not stop this crash from happening as both
>> >> our memory banks are in the same zone and the empty space within them
>> >> is accomodated into this zone via the CONFIG_SPARSEMEM
>> >> config option.
>> >>
>> >> When we enable CONFIG_HOLES_IN_ZONE we survive this BUG_ON as well as
>> >> any other BUG_ONs in the loop in move_freepages() as then the
>> >> pfn_valid_within()/pfn_valid() function takes care of this
>> >> functionality, especially in the case where the newly introduced
>> >> CONFIG_HAVE_ARCH_PFN_VALID is
>> >> enabled.
>> >>
>> >
>> > This is an expensive option in terms of performance. If Russell
>> > wants to pick it up, I won't object but I would strongly suggest that
>> > you solve this problem by ensuring that memmap is initialised on a
>> > MAX_ORDER-aligned boundaries as it'll perform better.
>> >
>>
>> I couldn't really locate a method in the kernel wherein we can
>> validate a pageblock(1024 pages for my
>> platform) with respect to the memory banks on that system.
>>
>
> I suspect what you're looking for is somewhere in arch/arm/mm/init.c .
> I'm afraid I didn't dig through the ARM memory initialisation
> code to see where it could be done but if it was me, I'd be looking at
> how free_area_init_node() is called.
>
>> How about this :
>> We implement an arch_is_valid_pageblock() function, controlled by a
>> new config option
>> CONFIG_ARCH_HAVE IS_VALID_PAGEBLOCK.
>> This arch function will simply check whether this pageblock is valid
>> or not, in terms of arch-specific
>> memory banks or by using the memblock APIs depending on CONFIG_HAVE_MEMBLOCK.
>> We can modify the memmap_init_zone() function so that an outer loop
>> works in measures of
>> pageblocks thus enabling us to avoid invalid pageblocks.
>
> That seems like massive overkill for what should be an alignment problem
> when initialisating memmap. It's adding complexity that is similar to
> HOLES_IN_ZONE with very little gain.
>
> When it gets down to it, I'd even prefer deleting the BUG_ON as
> the PageBuddy check over such a solution. However, the BUG_ON is
> there because alignment to MAX_ORDER is expected so it is a valid
> sanity check. There would need to be good evidence that initialising
> memmap to MAX_ORDER alignment was somehow impossible.
>
> --
> Mel Gorman
> SUSE Labs
>
diff mbox

Patch

--- a/arch/arm/Kconfig	2011-08-02 16:51:32.109207996 +0530
+++ b/arch/arm/Kconfig	2011-08-02 16:56:19.036207994 +0530
@@ -1508,6 +1508,9 @@  config ARCH_SELECT_MEMORY_MODEL
 config HAVE_ARCH_PFN_VALID
 	def_bool ARCH_HAS_HOLES_MEMORYMODEL || !SPARSEMEM

+config HOLES_IN_ZONE
+	def_bool ARCH_HAS_HOLES_MEMORYMODEL || SPARSEMEM
+
 config HIGHMEM
 	bool "High Memory Support"
 	depends on MMU