diff mbox

[v3] mm: make expand_downwards symmetrical to expand_upwards

Message ID alpine.DEB.2.00.1104191657030.26867@router.home (mailing list archive)
State Not Applicable
Headers show

Commit Message

Christoph Lameter (Ampere) April 19, 2011, 9:58 p.m. UTC
On Tue, 19 Apr 2011, James Bottomley wrote:

> > Which part of me telling you that you will break lots of other things in
> > the core kernel dont you get?
>
> I get that you tell me this ... however, the systems that, according to
> you, should be failing to get to boot prompt do, in fact, manage it.

If you dont use certain subsystems then it may work. Also do you run with
debuggin on.

The following patch is I think what would be needed to fix it.



Subject: [PATCH] Fix discontig support for !NUMA

Under NUMA discontig nodes map directly to the kernel NUMA nodes.

However, when DISCONTIG is used without NUMA then the kernel has only
one NUMA mode (==0) but within the node there may be multiple discontig pages
on various "nodes" for page struct vector management purposes.

Define a function __page_to_nid() that always extracts the node from
the page struct. This can be used in places where we need the discontig
node. Define page_to_nid() under !NUMA to always return 0. This ensures
that the various subsystems relying on page_to_nid(page) == 0 on !NUMA
function properly.

<Untested since I do not have a PARISC system. There could be
additional occurrences that need __page_to_nid>

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/asm-generic/memory_model.h |    2 +-
 include/linux/mm.h                 |   10 ++++++++--
 mm/sparse.c                        |    2 +-
 3 files changed, 10 insertions(+), 4 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Motohiro KOSAKI April 20, 2011, 1:23 a.m. UTC | #1
> On Tue, 19 Apr 2011, James Bottomley wrote:
> 
> > > Which part of me telling you that you will break lots of other things in
> > > the core kernel dont you get?
> >
> > I get that you tell me this ... however, the systems that, according to
> > you, should be failing to get to boot prompt do, in fact, manage it.
> 
> If you dont use certain subsystems then it may work. Also do you run with
> debuggin on.
> 
> The following patch is I think what would be needed to fix it.

I'm worry about this patch. A lot of mm code assume !NUMA systems 
only have node 0. Not only SLUB.

I'm not sure why this unfortunate mismatch occur. but I think DISCONTIG
hacks makes less sense. Can we consider parisc turn NUMA on instead?



--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley April 20, 2011, 2:33 a.m. UTC | #2
On Tue, 2011-04-19 at 16:58 -0500, Christoph Lameter wrote:
> On Tue, 19 Apr 2011, James Bottomley wrote:
> 
> > > Which part of me telling you that you will break lots of other things in
> > > the core kernel dont you get?
> >
> > I get that you tell me this ... however, the systems that, according to
> > you, should be failing to get to boot prompt do, in fact, manage it.
> 
> If you dont use certain subsystems then it may work. Also do you run with
> debuggin on.
> 
> The following patch is I think what would be needed to fix it.

Not really: crashes immediately on boot

[    0.000000] FP[0] enabled: Rev 1 Model 20
[    0.000000] The 64-bit Kernel has started...
[    0.000000] bootconsole [ttyB0] enabled
[    0.000000] Initialized PDC Console for debugging.
[    0.000000] Determining PDC firmware type: 64 bit PAT.
[    0.000000] model 00008870 00000491 00000000 00000002 3e0505e7352af710 100000f0 00000008 000000b2 000000b2
[    0.000000] vers  00000301
[    0.000000] CPUID vers 20 rev 4 (0x00000284)
[    0.000000] capabilities 0x35
[    0.000000] model 9000/800/rp3440  
[    0.000000] parisc_cache_init: Only equivalent aliasing supported!
[    0.000000] Memory Ranges:
[    0.000000]  0) Start 0x0000000000000000 End 0x000000003fffffff Size   1024 MB
[    0.000000]  1) Start 0x0000004040000000 End 0x000000407fdfffff Size   1022 MB
[    0.000000] Total Memory: 2046 MB
[    0.000000] initrd: 7f390000-7ffedf6d
[    0.000000] initrd: reserving 3f390000-3ffedf6d (mem_max 7fe00000)
[    0.000000] ------------[ cut here ]------------
[    0.000000] kernel BUG at mm/mm_init.c:127!
[    0.000000] 
[    0.000000]      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[    0.000000] PSW: 00001000000001001111111100001110 Not tainted
[    0.000000] r00-03  000000ff0804ff0e 000000004076a640 0000000040798c50 0000004080000000
[    0.000000] r04-07  0000000040746e40 0000000004040000 0000000000000001 0000000040654150
[    0.000000] r08-11  00000000405bd540 000000000407fe00 0000000000000001 0000000000000000
[    0.000000] r12-15  00000000405bc740 000f000000000000 00000000000001ff 000000004076a640
[    0.000000] r16-19  00000000000000ff 0000000000000000 2000000000000000 0000000000000000
[    0.000000] r20-23  0000004080000000 00000000405bd908 0000000000000000 0000000004040000
[    0.000000] r24-27  0000000000000001 0000000000000000 0000004080000000 0000000040746e40
[    0.000000] r28-31  2000000000000000 00000000405b0610 00000000405b0640 0000000000000000
[    0.000000] sr00-03  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.000000] sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
[    0.000000] 
[    0.000000] IASQ: 0000000000000000 0000000000000000 IAOQ: 0000000040798fac 0000000040798fb0
[    0.000000]  IIR: 03ffe01f    ISR: 0000000010350000  IOR: 0000010000000000
[    0.000000]  CPU:        0   CR30: 00000000405b0000 CR31: fffffff0f0e098e0
[    0.000000]  ORIG_R28: 000000000003fe00
[    0.000000]  IAOQ[0]: mminit_verify_page_links+0x84/0xa0
[    0.000000]  IAOQ[1]: mminit_verify_page_links+0x88/0xa0
[    0.000000]  RP(r2): memmap_init_zone+0x148/0x2a0
[    0.000000] Backtrace:
[    0.000000]  [<0000000040798c50>] memmap_init_zone+0x148/0x2a0
[    0.000000]  [<0000000040777ca8>] free_area_init_node+0x3c8/0x518
[    0.000000]  [<000000004076fde0>] paging_init+0x928/0xb20
[    0.000000]  [<0000000040770a48>] setup_arch+0xe8/0x120
[    0.000000]  [<000000004076c9a0>] start_kernel+0xf0/0x830
[    0.000000]  [<000000004011f4fc>] start_parisc+0xa4/0xb8
[    0.000000]  [<00000000404b0f0c>] packet_ioctl+0x1e4/0x208
[    0.000000]  [<00000000404a79d0>] unix_ioctl+0x70/0x168
[    0.000000]  [<0000000040482d24>] ip_mc_gsfget+0x14c/0x200
[    0.000000]  [<000000004046cc20>] raw_ioctl+0xe8/0x118
[    0.000000]  [<000000004044f524>] do_tcp_getsockopt+0x5c4/0x5d0
[    0.000000]  [<0000000040432d64>] netlink_getsockopt+0x15c/0x178
[    0.000000] 
[    0.000000] Backtrace:
[    0.000000]  [<000000004011f984>] show_stack+0x14/0x20
[    0.000000]  [<000000004011f9a8>] dump_stack+0x18/0x28
[    0.000000]  [<000000004012022c>] die_if_kernel+0x194/0x258
[    0.000000]  [<0000000040120b30>] handle_interruption+0x840/0x8f8
[    0.000000]  [<0000000040798fac>] mminit_verify_page_links+0x84/0xa0
[    0.000000] 
[    0.000000] ---[ end trace 139ce121c98e96c9 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.000000] Backtrace:
[    0.000000]  [<000000004011f984>] show_stack+0x14/0x20
[    0.000000]  [<000000004011f9a8>] dump_stack+0x18/0x28
[    0.000000]  [<000000004015945c>] panic+0xd4/0x368
[    0.000000]  [<000000004015f054>] do_exit+0x89c/0x9d8
[    0.000000]  [<00000000401202d4>] die_if_kernel+0x23c/0x258
[    0.000000]  [<0000000040120b30>] handle_interruption+0x840/0x8f8
[    0.000000]  [<0000000040798fac>] mminit_verify_page_links+0x84/0xa0
[    0.000000] 

There's a lot more to discontigmem than just page_to_nid ... there's the
whole pfn_to_nid() thing as well

James


--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg April 20, 2011, 5:53 a.m. UTC | #3
On Wed, Apr 20, 2011 at 4:23 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> I'm worry about this patch. A lot of mm code assume !NUMA systems
> only have node 0. Not only SLUB.

So is that a valid assumption or not? Christoph seems to think it is
and James seems to think it's not. Which way should we aim to fix it?
Would be nice if other people chimed in as we already know what James
and Christoph think.

                        Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Motohiro KOSAKI April 20, 2011, 7:15 a.m. UTC | #4
> On Wed, Apr 20, 2011 at 4:23 AM, KOSAKI Motohiro
> <kosaki.motohiro@jp.fujitsu.com> wrote:
> > I'm worry about this patch. A lot of mm code assume !NUMA systems
> > only have node 0. Not only SLUB.
> 
> So is that a valid assumption or not? Christoph seems to think it is
> and James seems to think it's not. Which way should we aim to fix it?
> Would be nice if other people chimed in as we already know what James
> and Christoph think.

I'm sorry. I don't know it really. The fact was gone into historical myst. ;-)

Now, CONFIG_NUMA has mainly five meanings.

1) system may has !0 node id.
2) compile mm/mempolicy.c (ie enable mempolicy APIs)
3) Allocator (kmalloc, vmalloc, alloc_page, et al) awake NUMA topology.
4) enable zone-reclaim feature
5) scheduler makes per-node load balancing scheduler domain

Anyway, we have to fix this issue.  I'm digging which fixing way has least risk.


btw, x86 don't have an issue. Probably it's a reason why this issue was neglected
long time.

arch/x86/Kconfig
-------------------------------------
config ARCH_DISCONTIGMEM_ENABLE
        def_bool y
        depends on NUMA && X86_32



--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg April 20, 2011, 7:34 a.m. UTC | #5
Hi!

On Wed, Apr 20, 2011 at 4:23 AM, KOSAKI Motohiro
>> <kosaki.motohiro@jp.fujitsu.com> wrote:
>> > I'm worry about this patch. A lot of mm code assume !NUMA systems
>> > only have node 0. Not only SLUB.
>>
>> So is that a valid assumption or not? Christoph seems to think it is
>> and James seems to think it's not. Which way should we aim to fix it?
>> Would be nice if other people chimed in as we already know what James
>> and Christoph think.

On Wed, Apr 20, 2011 at 10:15 AM, KOSAKI Motohiro
<kosaki.motohiro@jp.fujitsu.com> wrote:
> I'm sorry. I don't know it really. The fact was gone into historical myst. ;-)
>
> Now, CONFIG_NUMA has mainly five meanings.
>
> 1) system may has !0 node id.
> 2) compile mm/mempolicy.c (ie enable mempolicy APIs)
> 3) Allocator (kmalloc, vmalloc, alloc_page, et al) awake NUMA topology.
> 4) enable zone-reclaim feature
> 5) scheduler makes per-node load balancing scheduler domain
>
> Anyway, we have to fix this issue.  I'm digging which fixing way has least risk.
>
>
> btw, x86 don't have an issue. Probably it's a reason why this issue was neglected
> long time.
>
> arch/x86/Kconfig
> -------------------------------------
> config ARCH_DISCONTIGMEM_ENABLE
>        def_bool y
>        depends on NUMA && X86_32

That part makes me think the best option is to make parisc do
CONFIG_NUMA as well regardless of the historical intent was.

                        Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Wilcox April 20, 2011, 11:20 a.m. UTC | #6
On Wed, Apr 20, 2011 at 10:34:23AM +0300, Pekka Enberg wrote:
> That part makes me think the best option is to make parisc do
> CONFIG_NUMA as well regardless of the historical intent was.

But it's not just parisc.  It's six other architectures as well, some
of which aren't even SMP.  Does !SMP && NUMA make any kind of sense?

I think really, this is just a giant horrible misunderstanding on the part
of the MM people.  There's no reason why an ARM chip with 16MB of memory
at 0 and 16MB of memory at 1GB should be saddled with all the NUMA gunk.
Pekka Enberg April 20, 2011, 11:28 a.m. UTC | #7
Hi Matthew,

On Wed, Apr 20, 2011 at 10:34:23AM +0300, Pekka Enberg wrote:
>> That part makes me think the best option is to make parisc do
>> CONFIG_NUMA as well regardless of the historical intent was.

On Wed, Apr 20, 2011 at 2:20 PM, Matthew Wilcox <matthew@wil.cx> wrote:
> But it's not just parisc.  It's six other architectures as well, some
> of which aren't even SMP.  Does !SMP && NUMA make any kind of sense?

IIRC, we actually fixed SLAB or SLUB to work on such configs in the past.

On Wed, Apr 20, 2011 at 2:20 PM, Matthew Wilcox <matthew@wil.cx> wrote:
> I think really, this is just a giant horrible misunderstanding on the part
> of the MM people.  There's no reason why an ARM chip with 16MB of memory
> at 0 and 16MB of memory at 1GB should be saddled with all the NUMA gunk.

Right. My point was simply that since x86 doesn't support DISCONTIGMEM
without NUMA, the misunderstanding is likely very wide-spread.

                       Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) April 20, 2011, 1:58 p.m. UTC | #8
On Wed, 20 Apr 2011, Matthew Wilcox wrote:

> On Wed, Apr 20, 2011 at 10:34:23AM +0300, Pekka Enberg wrote:
> > That part makes me think the best option is to make parisc do
> > CONFIG_NUMA as well regardless of the historical intent was.
>
> But it's not just parisc.  It's six other architectures as well, some
> of which aren't even SMP.  Does !SMP && NUMA make any kind of sense?

Of course not.

> I think really, this is just a giant horrible misunderstanding on the part
> of the MM people.  There's no reason why an ARM chip with 16MB of memory
> at 0 and 16MB of memory at 1GB should be saddled with all the NUMA gunk.

DISCONTIG has fallen out of favor in the last years. SPARSEMEM has largely
replaced it. ARM uses that and does not suffer from these issue.

No one considered the issues of having a !NUMA configuration with
nodes (which DISCONTIG seems to create) when developing core code in the
last years. The implicit assumption has always been that page_to_nid(x)
etc is always zero on a !NUMA configuration.
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) April 20, 2011, 2:07 p.m. UTC | #9
On Wed, 20 Apr 2011, Pekka Enberg wrote:

> That part makes me think the best option is to make parisc do
> CONFIG_NUMA as well regardless of the historical intent was.

Another possilibity is to use SPARSEMEM instead? We can do the same for
the other arches that we have done to x86.


--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley April 20, 2011, 2:15 p.m. UTC | #10
[added linux-arch to cc since we're going to be affecting them]
On Wed, 2011-04-20 at 14:28 +0300, Pekka Enberg wrote:
> Right. My point was simply that since x86 doesn't support DISCONTIGMEM
> without NUMA, the misunderstanding is likely very wide-spread.

Why don't we approach the problem in a few separate ways then. 

     1. We can look at what imposing NUMA on the DISCONTIGMEM archs
        would do ... the embedded ones are going to be hardest hit, but
        if it's not too much extra code, it might be palatable.
     2. The other is that we can audit mm to look at all the node
        assumptions in the non-numa case.  My suspicion is that
        accidentally or otherwise, it mostly works for the normal case,
        so there might not be much needed to pull it back to working
        properly for DISCONTIGMEM.
     3. Finally we could look at deprecating DISCONTIGMEM in favour of
        SPARSEMEM, but we'd still need to fix -stable for that case.
        Especially as it will take time to convert all the architectures

I'm certainly with Matthew: DISCONTIGMEM is supposed to be a lightweight
framework which allows machines with split physical memory ranges to
work.  That's a very common case nowadays.  Numa is supposed to be a
heavyweight framework to preserve node locality for non-uniform memory
access boxes (which none of the DISCONTIGMEM && !NUMA systems are).

James


--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) April 20, 2011, 2:50 p.m. UTC | #11
On Wed, 20 Apr 2011, James Bottomley wrote:

>      1. We can look at what imposing NUMA on the DISCONTIGMEM archs
>         would do ... the embedded ones are going to be hardest hit, but
>         if it's not too much extra code, it might be palatable.
>      2. The other is that we can audit mm to look at all the node
>         assumptions in the non-numa case.  My suspicion is that
>         accidentally or otherwise, it mostly works for the normal case,
>         so there might not be much needed to pull it back to working
>         properly for DISCONTIGMEM.

The older code may work. SLAB f.e. does not call page_to_nid() in the
!NUMA case but keeps special metadata structures around in each slab page
that records the node used for allocation. The problem is with new code
added/revised in the last 5 years or so that uses page_to_nid() and
allocates only a single structure for !NUMA. There are also VM_BUG_ONs in
the page allocator that should trigger if page_to_nid() returns strange
values. I wonder why that never occurred.

>      3. Finally we could look at deprecating DISCONTIGMEM in favour
of >         SPARSEMEM, but we'd still need to fix -stable for that case.
>         Especially as it will take time to convert all the architectures

The fix needed is to mark DISCONTIGMEM without NUMA as broken for now. We
need an audit of the core VM before removing that or making it contingent
on the configurations of various VM subsystems.

> I'm certainly with Matthew: DISCONTIGMEM is supposed to be a lightweight
> framework which allows machines with split physical memory ranges to
> work.  That's a very common case nowadays.  Numa is supposed to be a
> heavyweight framework to preserve node locality for non-uniform memory
> access boxes (which none of the DISCONTIGMEM && !NUMA systems are).

Well yes but we have SPARSE for that today. DISCONTIG with multiple per
pgdat structures in a !NUMA case is just weird and unexpected for many who
have done VM coding in the last years.

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley April 20, 2011, 3:02 p.m. UTC | #12
On Wed, 2011-04-20 at 09:50 -0500, Christoph Lameter wrote:
> On Wed, 20 Apr 2011, James Bottomley wrote:
> 
> >      1. We can look at what imposing NUMA on the DISCONTIGMEM archs
> >         would do ... the embedded ones are going to be hardest hit, but
> >         if it's not too much extra code, it might be palatable.
> >      2. The other is that we can audit mm to look at all the node
> >         assumptions in the non-numa case.  My suspicion is that
> >         accidentally or otherwise, it mostly works for the normal case,
> >         so there might not be much needed to pull it back to working
> >         properly for DISCONTIGMEM.
> 
> The older code may work. SLAB f.e. does not call page_to_nid() in the
> !NUMA case but keeps special metadata structures around in each slab page
> that records the node used for allocation. The problem is with new code
> added/revised in the last 5 years or so that uses page_to_nid() and
> allocates only a single structure for !NUMA. There are also VM_BUG_ONs in
> the page allocator that should trigger if page_to_nid() returns strange
> values. I wonder why that never occurred.

Actually, I think slab got changed when discontigmem was added ...
that's why it all works OK.

> >      3. Finally we could look at deprecating DISCONTIGMEM in favour
> of >         SPARSEMEM, but we'd still need to fix -stable for that case.
> >         Especially as it will take time to convert all the architectures
> 
> The fix needed is to mark DISCONTIGMEM without NUMA as broken for now. We
> need an audit of the core VM before removing that or making it contingent
> on the configurations of various VM subsystems.

Don't be stupid ... that would cause six architectures to get marked
broken.

> > I'm certainly with Matthew: DISCONTIGMEM is supposed to be a lightweight
> > framework which allows machines with split physical memory ranges to
> > work.  That's a very common case nowadays.  Numa is supposed to be a
> > heavyweight framework to preserve node locality for non-uniform memory
> > access boxes (which none of the DISCONTIGMEM && !NUMA systems are).
> 
> Well yes but we have SPARSE for that today. DISCONTIG with multiple per
> pgdat structures in a !NUMA case is just weird and unexpected for many who
> have done VM coding in the last years.

Look, I'm not really interested in who understands what.  The fact is we
have six architectures with the possibility for DISCONTIGMEM && !NUMA,
so that's the case we need to fix in -stable.

They oops with SLUB, as far as I can tell, there are still no oops
reports with SLAB.  The simplest -stable fix seems to be to mark SLUB
broken on DISCONTIGMEM && !NUMA.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) April 20, 2011, 3:22 p.m. UTC | #13
On Wed, 20 Apr 2011, James Bottomley wrote:

> > The older code may work. SLAB f.e. does not call page_to_nid() in the
> > !NUMA case but keeps special metadata structures around in each slab page
> > that records the node used for allocation. The problem is with new code
> > added/revised in the last 5 years or so that uses page_to_nid() and
> > allocates only a single structure for !NUMA. There are also VM_BUG_ONs in
> > the page allocator that should trigger if page_to_nid() returns strange
> > values. I wonder why that never occurred.
>
> Actually, I think slab got changed when discontigmem was added ...
> that's why it all works OK.

Could be. I was not around at the time.

> > >      3. Finally we could look at deprecating DISCONTIGMEM in favour
> > of >         SPARSEMEM, but we'd still need to fix -stable for that case.
> > >         Especially as it will take time to convert all the architectures
> >
> > The fix needed is to mark DISCONTIGMEM without NUMA as broken for now. We
> > need an audit of the core VM before removing that or making it contingent
> > on the configurations of various VM subsystems.
>
> Don't be stupid ... that would cause six architectures to get marked
> broken.

Yes they are broken right now. Marking just means showing the user that we
are aware of the situation.

> Look, I'm not really interested in who understands what.  The fact is we
> have six architectures with the possibility for DISCONTIGMEM && !NUMA,
> so that's the case we need to fix in -stable.
>
> They oops with SLUB, as far as I can tell, there are still no oops
> reports with SLAB.  The simplest -stable fix seems to be to mark SLUB
> broken on DISCONTIGMEM && !NUMA.

There is barely any testing going on at all of this since we have had this
issue for more than 5 years and have not noticed it. The absence of bug
reports therefore proves nothing. Code inspection of the VM shows
that this is an issue that arises in multiple subsystems and that we have
VM_BUG_ONs in the page allocator that should trigger for these situations.

Usage of DISCONTIGMEM and !NUMA is not safe and should be flagged as such.

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Wilcox April 20, 2011, 7:25 p.m. UTC | #14
On Wed, Apr 20, 2011 at 10:22:04AM -0500, Christoph Lameter wrote:
> There is barely any testing going on at all of this since we have had this
> issue for more than 5 years and have not noticed it. The absence of bug
> reports therefore proves nothing. Code inspection of the VM shows
> that this is an issue that arises in multiple subsystems and that we have
> VM_BUG_ONs in the page allocator that should trigger for these situations.

So ... we've proven that people using these architectures use SLAB
instead of SLUB, don't enable CONFIG_DEBUG_VM and don't use hugepages
(not really a surprise ... nobody's running Oracle on these arches :-)

I don't think that qualifies as "barely any testing".  I think that's
"nobody developing the Linux MM uses one of these architectures".
David Rientjes April 20, 2011, 9:34 p.m. UTC | #15
On Wed, 20 Apr 2011, Matthew Wilcox wrote:

> > That part makes me think the best option is to make parisc do
> > CONFIG_NUMA as well regardless of the historical intent was.
> 
> But it's not just parisc.  It's six other architectures as well, some
> of which aren't even SMP.  Does !SMP && NUMA make any kind of sense?
> 

It does as long as DISCONTIGMEM is hijacking NUMA abstractions throughout 
the code; for example, look at the .config that James is probably using 
for testing here:

	CONFIG_PA8X00=y
	CONFIG_64BIT=y
	CONFIG_DISCONTIGMEM=y
	CONFIG_NEED_MULTIPLE_NODES=y
	CONFIG_NODES_SHIFT=3

and CONFIG_NUMA is not enabled.  So we want CONFIG_NODES_SHIFT of 3 
(because MAX_PHYSMEM_RANGES is 8) and CONFIG_NEED_MULTIPLE_NODES is 
enabled because of DISCONTIGMEM:

	#
	# Both the NUMA code and DISCONTIGMEM use arrays of pg_data_t's
	# to represent different areas of memory.  This variable allows
	# those dependencies to exist individually.
	#
	config NEED_MULTIPLE_NODES
		def_bool y
		depends on DISCONTIGMEM || NUMA

when in reality we should do away with CONFIG_NEED_MULTIPLE_NODES and just 
force DISCONTIGMEM to enable CONFIG_NUMA at least for -stable and as a 
quick fix for James.

In the long run, we'll probably want to define a lighterweight CONFIG_NUMA 
as a layer that CONFIG_DISCONTIGMEM can use for memory range abstractions 
and then CONFIG_NUMA is built on top of it to define proximity between 
those ranges.
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Rientjes April 20, 2011, 9:42 p.m. UTC | #16
On Wed, 20 Apr 2011, Christoph Lameter wrote:

> There is barely any testing going on at all of this since we have had this
> issue for more than 5 years and have not noticed it. The absence of bug
> reports therefore proves nothing. Code inspection of the VM shows
> that this is an issue that arises in multiple subsystems and that we have
> VM_BUG_ONs in the page allocator that should trigger for these situations.
> 
> Usage of DISCONTIGMEM and !NUMA is not safe and should be flagged as such.
> 

We don't actually have any bug reports in front of us that show anything 
else in the VM other than slub has issues with this configuration, so 
marking them as broken is probably premature.  The parisc config that 
triggered this debugging enables CONFIG_SLAB by default, so it probably 
has gone unnoticed just because nobody other than James has actually tried 
it on hppa64.

Let's see if KOSAKI-san's fixes to Kconfig (even though I'd prefer the 
simpler and implicit "config NUMA def_bool ARCH_DISCONTIGMEM_ENABLE" over 
his config NUMA) and my fix to parisc to set the bit in N_NORMAL_MEMORY 
so that CONFIG_SLUB initializes kmem_cache_node correctly works and then 
address issues in the core VM as they arise.  Presumably someone has been 
running DISCONTIGMEM on hppa64 in the past five years without issues with 
defconfig, so the issue here may just be slub.
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2011-04-19 16:43:53.822507013 -0500
+++ linux-2.6/include/linux/mm.h	2011-04-19 16:44:52.082506944 -0500
@@ -666,14 +666,20 @@  static inline int zone_to_nid(struct zon
 }

 #ifdef NODE_NOT_IN_PAGE_FLAGS
-extern int page_to_nid(struct page *page);
+extern int __page_to_nid(struct page *page);
 #else
-static inline int page_to_nid(struct page *page)
+static inline int __page_to_nid(struct page *page)
 {
 	return (page->flags >> NODES_PGSHIFT) & NODES_MASK;
 }
 #endif

+#ifdef CONFIG_NUMA
+#define page_to_nid __page_to_nid
+#else
+#define page_to_nid(x) 0
+#endif
+
 static inline struct zone *page_zone(struct page *page)
 {
 	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
Index: linux-2.6/include/asm-generic/memory_model.h
===================================================================
--- linux-2.6.orig/include/asm-generic/memory_model.h	2011-04-19 16:45:26.772506904 -0500
+++ linux-2.6/include/asm-generic/memory_model.h	2011-04-19 16:46:02.602506861 -0500
@@ -40,7 +40,7 @@ 

 #define __page_to_pfn(pg)						\
 ({	struct page *__pg = (pg);					\
-	struct pglist_data *__pgdat = NODE_DATA(page_to_nid(__pg));	\
+	struct pglist_data *__pgdat = NODE_DATA(__page_to_nid(__pg));	\
 	(unsigned long)(__pg - __pgdat->node_mem_map) +			\
 	 __pgdat->node_start_pfn;					\
 })
Index: linux-2.6/mm/sparse.c
===================================================================
--- linux-2.6.orig/mm/sparse.c	2011-04-19 16:44:58.432506937 -0500
+++ linux-2.6/mm/sparse.c	2011-04-19 16:45:07.332506926 -0500
@@ -40,7 +40,7 @@  static u8 section_to_node_table[NR_MEM_S
 static u16 section_to_node_table[NR_MEM_SECTIONS] __cacheline_aligned;
 #endif

-int page_to_nid(struct page *page)
+int __page_to_nid(struct page *page)
 {
 	return section_to_node_table[page_to_section(page)];
 }