diff mbox series

[04/12] mips: Reserve memory for the kernel image resources

Message ID 20190423224748.3765-5-fancer.lancer@gmail.com (mailing list archive)
State Superseded
Headers show
Series mips: Post-bootmem-memblock transition fixes | expand

Commit Message

Serge Semin April 23, 2019, 10:47 p.m. UTC
The reserved_end variable had been used by the bootmem_init() code
to find a lowest limit of memory available for memmap blob. The original
code just tried to find a free memory space higher than kernel was placed.
This limitation seems justified for the memmap ragion search process, but
I can't see any obvious reason to reserve the unused space below kernel
seeing some platforms place it much higher than standard 1MB. Moreover
the RELOCATION config enables it to be loaded at any memory address.
So lets reserve the memory occupied by the kernel only, leaving the region
below being free for allocations. After doing this we can now discard the
code freeing a space between kernel _text and VMLINUX_LOAD_ADDRESS symbols
since it's going to be free anyway (unless marked as reserved by
platforms).

Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
---
 arch/mips/kernel/setup.c | 30 +++---------------------------
 1 file changed, 3 insertions(+), 27 deletions(-)

Comments

Paul Burton April 24, 2019, 10:43 p.m. UTC | #1
Hi Serge,

On Wed, Apr 24, 2019 at 01:47:40AM +0300, Serge Semin wrote:
> The reserved_end variable had been used by the bootmem_init() code
> to find a lowest limit of memory available for memmap blob. The original
> code just tried to find a free memory space higher than kernel was placed.
> This limitation seems justified for the memmap ragion search process, but
> I can't see any obvious reason to reserve the unused space below kernel
> seeing some platforms place it much higher than standard 1MB.

There are 2 reasons I'm aware of:

 1) Older systems generally had something like an ISA bus which used
    addresses below the kernel, and bootloaders like YAMON left behind
    functions that could be called right at the start of RAM. This sort
    of thing should be accounted for by /memreserve/ in DT or similar
    platform-specific reservations though rather than generically, and
    at least Malta & SEAD-3 DTs already have /memreserve/ entries for
    it. So this part I think is OK. Some other older platforms might
    need updating, but that's fine.

 2) trap_init() only allocates memory for the exception vector if using
    a vectored interrupt mode. In other cases it just uses CAC_BASE
    which currently gets reserved as part of this region between
    PHYS_OFFSET & _text.

    I think this behavior is bogus, and we should instead:

    - Allocate the exception vector memory using memblock_alloc() for
      CPUs implementing MIPSr2 or higher (ie. CPUs with a programmable
      EBase register). If we're not using vectored interrupts then
      allocating one page will do, and we already have the size
      calculation for if we are.

    - Otherwise use CAC_BASE but call memblock_reserve() on the first
      page.

    I think we should make that change before this one goes in. I can
    try to get to it tomorrow, but feel free to beat me to it.

Thanks,
    Paul
Serge Semin April 26, 2019, midnight UTC | #2
On Wed, Apr 24, 2019 at 10:43:48PM +0000, Paul Burton wrote:
> Hi Serge,
> 
> On Wed, Apr 24, 2019 at 01:47:40AM +0300, Serge Semin wrote:
> > The reserved_end variable had been used by the bootmem_init() code
> > to find a lowest limit of memory available for memmap blob. The original
> > code just tried to find a free memory space higher than kernel was placed.
> > This limitation seems justified for the memmap ragion search process, but
> > I can't see any obvious reason to reserve the unused space below kernel
> > seeing some platforms place it much higher than standard 1MB.
> 
> There are 2 reasons I'm aware of:
> 
>  1) Older systems generally had something like an ISA bus which used
>     addresses below the kernel, and bootloaders like YAMON left behind
>     functions that could be called right at the start of RAM. This sort
>     of thing should be accounted for by /memreserve/ in DT or similar
>     platform-specific reservations though rather than generically, and
>     at least Malta & SEAD-3 DTs already have /memreserve/ entries for
>     it. So this part I think is OK. Some other older platforms might
>     need updating, but that's fine.
> 

Regarding ISA. As far as I remember devices on that bus can DMA only to the
lowest 16MB. So in case if kernel is too big or placed pretty much high,
they may be left even without reachable memory at all in current
implementation.

>  2) trap_init() only allocates memory for the exception vector if using
>     a vectored interrupt mode. In other cases it just uses CAC_BASE
>     which currently gets reserved as part of this region between
>     PHYS_OFFSET & _text.
> 
>     I think this behavior is bogus, and we should instead:
> 
>     - Allocate the exception vector memory using memblock_alloc() for
>       CPUs implementing MIPSr2 or higher (ie. CPUs with a programmable
>       EBase register). If we're not using vectored interrupts then
>       allocating one page will do, and we already have the size
>       calculation for if we are.
> 
>     - Otherwise use CAC_BASE but call memblock_reserve() on the first
>       page.
> 
>     I think we should make that change before this one goes in. I can
>     try to get to it tomorrow, but feel free to beat me to it.
> 

As far as I understood you and the code this should be enough to fix
the problem:
diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
index 98ca55d62201..f680253e2617 100644
--- a/arch/mips/kernel/traps.c
+++ b/arch/mips/kernel/traps.c
@@ -2326,6 +2326,8 @@ void __init trap_init(void)
 				ebase += (read_c0_ebase() & 0x3ffff000);
 			}
 		}
+
+		memblock_reserve(ebase, PAGE_SIZE);
 	}
 
 	if (cpu_has_mmips) {
---

Allocation has already been implemented in the if-branch under the
(cpu_has_veic || cpu_has_vint) condition. So we don't need to change
there anything.
In case if vectored interrupts aren't supported the else-clause is
taken and we need to reserve whatever is set in the exception base
address variable.

I'll add this patch between 3d and 4th ones if you are ok with it.

-Sergey

> Thanks,
>     Paul
Paul Burton April 30, 2019, 10:58 p.m. UTC | #3
Hi Serge,

On Fri, Apr 26, 2019 at 03:00:36AM +0300, Serge Semin wrote:
> >  1) Older systems generally had something like an ISA bus which used
> >     addresses below the kernel, and bootloaders like YAMON left behind
> >     functions that could be called right at the start of RAM. This sort
> >     of thing should be accounted for by /memreserve/ in DT or similar
> >     platform-specific reservations though rather than generically, and
> >     at least Malta & SEAD-3 DTs already have /memreserve/ entries for
> >     it. So this part I think is OK. Some other older platforms might
> >     need updating, but that's fine.
> > 
> 
> Regarding ISA. As far as I remember devices on that bus can DMA only to the
> lowest 16MB. So in case if kernel is too big or placed pretty much high,
> they may be left even without reachable memory at all in current
> implementation.

Sure - I'm not too worried about these old buses, platforms can continue
to reserve the memory through DT or otherwise if they need to.

> >  2) trap_init() only allocates memory for the exception vector if using
> >     a vectored interrupt mode. In other cases it just uses CAC_BASE
> >     which currently gets reserved as part of this region between
> >     PHYS_OFFSET & _text.
> > 
> >     I think this behavior is bogus, and we should instead:
> > 
> >     - Allocate the exception vector memory using memblock_alloc() for
> >       CPUs implementing MIPSr2 or higher (ie. CPUs with a programmable
> >       EBase register). If we're not using vectored interrupts then
> >       allocating one page will do, and we already have the size
> >       calculation for if we are.
> > 
> >     - Otherwise use CAC_BASE but call memblock_reserve() on the first
> >       page.
> > 
> >     I think we should make that change before this one goes in. I can
> >     try to get to it tomorrow, but feel free to beat me to it.
> > 
> 
> As far as I understood you and the code this should be enough to fix
> the problem:
> diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
> index 98ca55d62201..f680253e2617 100644
> --- a/arch/mips/kernel/traps.c
> +++ b/arch/mips/kernel/traps.c
> @@ -2326,6 +2326,8 @@ void __init trap_init(void)
>  				ebase += (read_c0_ebase() & 0x3ffff000);
>  			}
>  		}
> +
> +		memblock_reserve(ebase, PAGE_SIZE);
>  	}
>  
>  	if (cpu_has_mmips) {
> ---
> 
> Allocation has already been implemented in the if-branch under the
> (cpu_has_veic || cpu_has_vint) condition. So we don't need to change
> there anything.
> In case if vectored interrupts aren't supported the else-clause is
> taken and we need to reserve whatever is set in the exception base
> address variable.
> 
> I'll add this patch between 3d and 4th ones if you are ok with it.

I think that would work, but I have other motivations to allocate the
memory in non-vectored cases anyway. I just sent a series that does that
& cleans up a little [1]. If you could take a look that would be great.
With that change made I think this patch will be good to apply.

Thanks,
    Paul

[1] https://lore.kernel.org/linux-mips/20190430225216.7164-1-paul.burton@mips.com/T/#t
Serge Semin May 2, 2019, 2:24 p.m. UTC | #4
On Tue, Apr 30, 2019 at 10:58:33PM +0000, Paul Burton wrote:

Hello Paul

> Hi Serge,
> 
> On Fri, Apr 26, 2019 at 03:00:36AM +0300, Serge Semin wrote:
> > >  1) Older systems generally had something like an ISA bus which used
> > >     addresses below the kernel, and bootloaders like YAMON left behind
> > >     functions that could be called right at the start of RAM. This sort
> > >     of thing should be accounted for by /memreserve/ in DT or similar
> > >     platform-specific reservations though rather than generically, and
> > >     at least Malta & SEAD-3 DTs already have /memreserve/ entries for
> > >     it. So this part I think is OK. Some other older platforms might
> > >     need updating, but that's fine.
> > > 
> > 
> > Regarding ISA. As far as I remember devices on that bus can DMA only to the
> > lowest 16MB. So in case if kernel is too big or placed pretty much high,
> > they may be left even without reachable memory at all in current
> > implementation.
> 
> Sure - I'm not too worried about these old buses, platforms can continue
> to reserve the memory through DT or otherwise if they need to.
> 
> > >  2) trap_init() only allocates memory for the exception vector if using
> > >     a vectored interrupt mode. In other cases it just uses CAC_BASE
> > >     which currently gets reserved as part of this region between
> > >     PHYS_OFFSET & _text.
> > > 
> > >     I think this behavior is bogus, and we should instead:
> > > 
> > >     - Allocate the exception vector memory using memblock_alloc() for
> > >       CPUs implementing MIPSr2 or higher (ie. CPUs with a programmable
> > >       EBase register). If we're not using vectored interrupts then
> > >       allocating one page will do, and we already have the size
> > >       calculation for if we are.
> > > 
> > >     - Otherwise use CAC_BASE but call memblock_reserve() on the first
> > >       page.
> > > 
> > >     I think we should make that change before this one goes in. I can
> > >     try to get to it tomorrow, but feel free to beat me to it.
> > > 
> > 
> > As far as I understood you and the code this should be enough to fix
> > the problem:
> > diff --git a/arch/mips/kernel/traps.c b/arch/mips/kernel/traps.c
> > index 98ca55d62201..f680253e2617 100644
> > --- a/arch/mips/kernel/traps.c
> > +++ b/arch/mips/kernel/traps.c
> > @@ -2326,6 +2326,8 @@ void __init trap_init(void)
> >  				ebase += (read_c0_ebase() & 0x3ffff000);
> >  			}
> >  		}
> > +
> > +		memblock_reserve(ebase, PAGE_SIZE);
> >  	}
> >  
> >  	if (cpu_has_mmips) {
> > ---
> > 
> > Allocation has already been implemented in the if-branch under the
> > (cpu_has_veic || cpu_has_vint) condition. So we don't need to change
> > there anything.
> > In case if vectored interrupts aren't supported the else-clause is
> > taken and we need to reserve whatever is set in the exception base
> > address variable.
> > 
> > I'll add this patch between 3d and 4th ones if you are ok with it.
> 
> I think that would work, but I have other motivations to allocate the
> memory in non-vectored cases anyway. I just sent a series that does that
> & cleans up a little [1]. If you could take a look that would be great.
> With that change made I think this patch will be good to apply.
> 

Just reviewed and tested your series on my machine. I tagged the whole series
in a response to the cover-letter of [1].

Could you please proceed with this patchset review procedure? There are
also eight more patches left without your tag or comment.  This patch
is also left with no explicit tag.

BTW I see you already applied patches 1-3 to the mips-next, so what shall I
do when sending a v2 patchset with fixes asked to be provided for patch 12
and possibly for others in future? Shall I just resend the series without that
applied patches or send them over with your acked-by tagges?

-Sergey

> Thanks,
>     Paul
> 
> [1] https://lore.kernel.org/linux-mips/20190430225216.7164-1-paul.burton@mips.com/T/#t
Paul Burton May 2, 2019, 6:35 p.m. UTC | #5
Hello,

Serge Semin wrote:
> The reserved_end variable had been used by the bootmem_init() code
> to find a lowest limit of memory available for memmap blob. The original
> code just tried to find a free memory space higher than kernel was placed.
> This limitation seems justified for the memmap ragion search process, but
> I can't see any obvious reason to reserve the unused space below kernel
> seeing some platforms place it much higher than standard 1MB. Moreover
> the RELOCATION config enables it to be loaded at any memory address.
> So lets reserve the memory occupied by the kernel only, leaving the region
> below being free for allocations. After doing this we can now discard the
> code freeing a space between kernel _text and VMLINUX_LOAD_ADDRESS symbols
> since it's going to be free anyway (unless marked as reserved by
> platforms).
> 
> Signed-off-by: Serge Semin <fancer.lancer@gmail.com>

Applied to mips-next.

Thanks,
    Paul

[ This message was auto-generated; if you believe anything is incorrect
  then please email paul.burton@mips.com to report it. ]
Paul Burton May 2, 2019, 6:45 p.m. UTC | #6
Hi Serge,

On Thu, May 02, 2019 at 05:24:37PM +0300, Serge Semin wrote:
> Just reviewed and tested your series on my machine. I tagged the whole series
> in a response to the cover-letter of [1].

Thanks for the review & testing; that series is now in mips-next.

> Could you please proceed with this patchset review procedure? There are
> also eight more patches left without your tag or comment.  This patch
> is also left with no explicit tag.
> 
> BTW I see you already applied patches 1-3 to the mips-next, so what shall I
> do when sending a v2 patchset with fixes asked to be provided for patch 12
> and possibly for others in future? Shall I just resend the series without that
> applied patches or send them over with your acked-by tagges?

I've so far applied patches 1-7 of your series to mips-next, and stopped
at patch 8 which has a comment to address.

My preference would be if you could send a v2 which just contains the
remaining patches (ie. patches 8-12 become patches 1-5), ideally atop
the mips-next branch.

The series looks good to me once the review comments are addressed, but
no need to add an Acked-by - it'll be implicit when I apply them to
mips-next.

Thanks,
    Paul
Serge Semin May 3, 2019, 5:21 p.m. UTC | #7
Hello Paul

On Thu, May 02, 2019 at 06:45:39PM +0000, Paul Burton wrote:
> Hi Serge,
> 
> On Thu, May 02, 2019 at 05:24:37PM +0300, Serge Semin wrote:
> > Just reviewed and tested your series on my machine. I tagged the whole series
> > in a response to the cover-letter of [1].
> 
> Thanks for the review & testing; that series is now in mips-next.
> 
> > Could you please proceed with this patchset review procedure? There are
> > also eight more patches left without your tag or comment.  This patch
> > is also left with no explicit tag.
> > 
> > BTW I see you already applied patches 1-3 to the mips-next, so what shall I
> > do when sending a v2 patchset with fixes asked to be provided for patch 12
> > and possibly for others in future? Shall I just resend the series without that
> > applied patches or send them over with your acked-by tagges?
> 
> I've so far applied patches 1-7 of your series to mips-next, and stopped
> at patch 8 which has a comment to address.
> 
> My preference would be if you could send a v2 which just contains the
> remaining patches (ie. patches 8-12 become patches 1-5), ideally atop
> the mips-next branch.
> 
> The series looks good to me once the review comments are addressed, but
> no need to add an Acked-by - it'll be implicit when I apply them to
> mips-next.
> 

Agreed. I'll do it shortly.

-Sergey

> Thanks,
>     Paul
Geert Uytterhoeven May 21, 2019, 2:56 p.m. UTC | #8
Hi Serge,

On Wed, Apr 24, 2019 at 12:50 AM Serge Semin <fancer.lancer@gmail.com> wrote:
> The reserved_end variable had been used by the bootmem_init() code
> to find a lowest limit of memory available for memmap blob. The original
> code just tried to find a free memory space higher than kernel was placed.
> This limitation seems justified for the memmap ragion search process, but
> I can't see any obvious reason to reserve the unused space below kernel
> seeing some platforms place it much higher than standard 1MB. Moreover
> the RELOCATION config enables it to be loaded at any memory address.
> So lets reserve the memory occupied by the kernel only, leaving the region
> below being free for allocations. After doing this we can now discard the
> code freeing a space between kernel _text and VMLINUX_LOAD_ADDRESS symbols
> since it's going to be free anyway (unless marked as reserved by
> platforms).
>
> Signed-off-by: Serge Semin <fancer.lancer@gmail.com>

This is now commit b93ddc4f9156205e ("mips: Reserve memory for the kernel
image resources") in v5.2-rc1, which causes rbtx4927 to crash during boot:

    VFS: Mounted root (nfs filesystem) on device 0:13.
    devtmpfs: mounted
    BUG: Bad page state in process swapper  pfn:00001
    page:804b7820 refcount:0 mapcount:-128 mapping:00000000 index:0x1
    flags: 0x0()
    raw: 00000000 00000100 00000200 00000000 00000001 00000000 ffffff7f 00000000
    page dumped because: nonzero mapcount
    Modules linked in:
    CPU: 0 PID: 1 Comm: swapper Not tainted
5.2.0-rc1-rbtx4927-00468-g3c05ea3d4077b756-dirty #137
    Stack : 00000000 10008400 8040dd2c 87c1b974 8044af63 8040dd2c
00000001 804a3490
            00000001 81000000 0030f231 80148558 00000003 10008400
87c1dd80 7599ee13
            00000000 00000000 804b0000 00000000 00000007 00000000
00000085 00000000
            62722d31 00000084 804b0000 39347874 00000000 804b7820
8040cef8 81000010
            00000001 00000007 00000001 81000000 00000008 8021de24
00000000 804a0000
            ...
    Call Trace:
    [<8010adec>] show_stack+0x74/0x104
    [<801a5e44>] bad_page+0x130/0x138
    [<801a654c>] free_pcppages_bulk+0x17c/0x3b0
    [<801a789c>] free_unref_page+0x40/0x68
    [<801120f4>] free_init_pages+0xec/0x104
    [<803bdde8>] free_initmem+0x10/0x58
    [<803bdb8c>] kernel_init+0x20/0x100
    [<801057c8>] ret_from_kernel_thread+0x14/0x1c
    Disabling lock debugging due to kernel taint
    BUG: Bad page state in process swapper  pfn:00002
    [...]

CONFIG_RELOCATABLE is not set, so the only relevant part is the
change quoted below.

> --- a/arch/mips/kernel/setup.c
> +++ b/arch/mips/kernel/setup.c
> @@ -371,7 +371,6 @@ static void __init bootmem_init(void)
>
>  static void __init bootmem_init(void)
>  {
> -       unsigned long reserved_end;
>         phys_addr_t ramstart = PHYS_ADDR_MAX;
>         int i;
>
> @@ -382,10 +381,10 @@ static void __init bootmem_init(void)
>          * will reserve the area used for the initrd.
>          */
>         init_initrd();
> -       reserved_end = (unsigned long) PFN_UP(__pa_symbol(&_end));
>
> -       memblock_reserve(PHYS_OFFSET,
> -                        (reserved_end << PAGE_SHIFT) - PHYS_OFFSET);
> +       /* Reserve memory occupied by kernel. */
> +       memblock_reserve(__pa_symbol(&_text),
> +                       __pa_symbol(&_end) - __pa_symbol(&_text));
>
>         /*
>          * max_low_pfn is not a number of pages. The number of pages

With some debug code added:

    Determined physical RAM map:
     memory: 08000000 @ 00000000 (usable)
    bootmem_init:390: PHYS_OFFSET = 0x0
    bootmem_init:391: __pa_symbol(&_text) = 0x100000
    bootmem_init:392: __pa_symbol(&_end) = 0x4b77c8
    bootmem_init:393: PFN_UP(__pa_symbol(&_end)) = 0x4b8

Hence the old code reserved 1 MiB extra at the beginning.

Note that the new code also dropped the rounding up of the memory block
size to a multiple of PAGE_SIZE. I'm not sure the latter actually
matters or not.

Do you have a clue? Thanks!

Gr{oetje,eeting}s,

                        Geert
Mike Rapoport May 21, 2019, 3:53 p.m. UTC | #9
Hi Geert,

On Tue, May 21, 2019 at 04:56:39PM +0200, Geert Uytterhoeven wrote:
> Hi Serge,
> 
> On Wed, Apr 24, 2019 at 12:50 AM Serge Semin <fancer.lancer@gmail.com> wrote:
> > The reserved_end variable had been used by the bootmem_init() code
> > to find a lowest limit of memory available for memmap blob. The original
> > code just tried to find a free memory space higher than kernel was placed.
> > This limitation seems justified for the memmap ragion search process, but
> > I can't see any obvious reason to reserve the unused space below kernel
> > seeing some platforms place it much higher than standard 1MB. Moreover
> > the RELOCATION config enables it to be loaded at any memory address.
> > So lets reserve the memory occupied by the kernel only, leaving the region
> > below being free for allocations. After doing this we can now discard the
> > code freeing a space between kernel _text and VMLINUX_LOAD_ADDRESS symbols
> > since it's going to be free anyway (unless marked as reserved by
> > platforms).
> >
> > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> 
> This is now commit b93ddc4f9156205e ("mips: Reserve memory for the kernel
> image resources") in v5.2-rc1, which causes rbtx4927 to crash during boot:
> 
>     VFS: Mounted root (nfs filesystem) on device 0:13.
>     devtmpfs: mounted
>     BUG: Bad page state in process swapper  pfn:00001
>     page:804b7820 refcount:0 mapcount:-128 mapping:00000000 index:0x1
>     flags: 0x0()
>     raw: 00000000 00000100 00000200 00000000 00000001 00000000 ffffff7f 00000000
>     page dumped because: nonzero mapcount
>     Modules linked in:
>     CPU: 0 PID: 1 Comm: swapper Not tainted
> 5.2.0-rc1-rbtx4927-00468-g3c05ea3d4077b756-dirty #137
>     Stack : 00000000 10008400 8040dd2c 87c1b974 8044af63 8040dd2c
> 00000001 804a3490
>             00000001 81000000 0030f231 80148558 00000003 10008400
> 87c1dd80 7599ee13
>             00000000 00000000 804b0000 00000000 00000007 00000000
> 00000085 00000000
>             62722d31 00000084 804b0000 39347874 00000000 804b7820
> 8040cef8 81000010
>             00000001 00000007 00000001 81000000 00000008 8021de24
> 00000000 804a0000
>             ...
>     Call Trace:
>     [<8010adec>] show_stack+0x74/0x104
>     [<801a5e44>] bad_page+0x130/0x138
>     [<801a654c>] free_pcppages_bulk+0x17c/0x3b0
>     [<801a789c>] free_unref_page+0x40/0x68
>     [<801120f4>] free_init_pages+0xec/0x104
>     [<803bdde8>] free_initmem+0x10/0x58
>     [<803bdb8c>] kernel_init+0x20/0x100
>     [<801057c8>] ret_from_kernel_thread+0x14/0x1c
>     Disabling lock debugging due to kernel taint
>     BUG: Bad page state in process swapper  pfn:00002
>     [...]
> 
> CONFIG_RELOCATABLE is not set, so the only relevant part is the
> change quoted below.
> 
> > --- a/arch/mips/kernel/setup.c
> > +++ b/arch/mips/kernel/setup.c
> > @@ -371,7 +371,6 @@ static void __init bootmem_init(void)
> >
> >  static void __init bootmem_init(void)
> >  {
> > -       unsigned long reserved_end;
> >         phys_addr_t ramstart = PHYS_ADDR_MAX;
> >         int i;
> >
> > @@ -382,10 +381,10 @@ static void __init bootmem_init(void)
> >          * will reserve the area used for the initrd.
> >          */
> >         init_initrd();
> > -       reserved_end = (unsigned long) PFN_UP(__pa_symbol(&_end));
> >
> > -       memblock_reserve(PHYS_OFFSET,
> > -                        (reserved_end << PAGE_SHIFT) - PHYS_OFFSET);
> > +       /* Reserve memory occupied by kernel. */
> > +       memblock_reserve(__pa_symbol(&_text),
> > +                       __pa_symbol(&_end) - __pa_symbol(&_text));
> >
> >         /*
> >          * max_low_pfn is not a number of pages. The number of pages
> 
> With some debug code added:
> 
>     Determined physical RAM map:
>      memory: 08000000 @ 00000000 (usable)
>     bootmem_init:390: PHYS_OFFSET = 0x0
>     bootmem_init:391: __pa_symbol(&_text) = 0x100000
>     bootmem_init:392: __pa_symbol(&_end) = 0x4b77c8
>     bootmem_init:393: PFN_UP(__pa_symbol(&_end)) = 0x4b8

Have you tried adding memblock=debug to the command line?
Not sure it'll help, but still :)
 
> Hence the old code reserved 1 MiB extra at the beginning.
> 
> Note that the new code also dropped the rounding up of the memory block
> size to a multiple of PAGE_SIZE. I'm not sure the latter actually
> matters or not.

I'd say that bad page state for pfn 1 is caused by "freeing" the first 1M.

> Do you have a clue? Thanks!
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> -- 
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds
>
Serge Semin May 21, 2019, 4:39 p.m. UTC | #10
Hello Geert, Mike

On Tue, May 21, 2019 at 06:53:10PM +0300, Mike Rapoport wrote:
> Hi Geert,
> 
> On Tue, May 21, 2019 at 04:56:39PM +0200, Geert Uytterhoeven wrote:
> > Hi Serge,
> > 
> > On Wed, Apr 24, 2019 at 12:50 AM Serge Semin <fancer.lancer@gmail.com> wrote:
> > > The reserved_end variable had been used by the bootmem_init() code
> > > to find a lowest limit of memory available for memmap blob. The original
> > > code just tried to find a free memory space higher than kernel was placed.
> > > This limitation seems justified for the memmap ragion search process, but
> > > I can't see any obvious reason to reserve the unused space below kernel
> > > seeing some platforms place it much higher than standard 1MB. Moreover
> > > the RELOCATION config enables it to be loaded at any memory address.
> > > So lets reserve the memory occupied by the kernel only, leaving the region
> > > below being free for allocations. After doing this we can now discard the
> > > code freeing a space between kernel _text and VMLINUX_LOAD_ADDRESS symbols
> > > since it's going to be free anyway (unless marked as reserved by
> > > platforms).
> > >
> > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > 
> > This is now commit b93ddc4f9156205e ("mips: Reserve memory for the kernel
> > image resources") in v5.2-rc1, which causes rbtx4927 to crash during boot:
> > 
> >     VFS: Mounted root (nfs filesystem) on device 0:13.
> >     devtmpfs: mounted
> >     BUG: Bad page state in process swapper  pfn:00001
> >     page:804b7820 refcount:0 mapcount:-128 mapping:00000000 index:0x1
> >     flags: 0x0()
> >     raw: 00000000 00000100 00000200 00000000 00000001 00000000 ffffff7f 00000000
> >     page dumped because: nonzero mapcount
> >     Modules linked in:
> >     CPU: 0 PID: 1 Comm: swapper Not tainted
> > 5.2.0-rc1-rbtx4927-00468-g3c05ea3d4077b756-dirty #137
> >     Stack : 00000000 10008400 8040dd2c 87c1b974 8044af63 8040dd2c
> > 00000001 804a3490
> >             00000001 81000000 0030f231 80148558 00000003 10008400
> > 87c1dd80 7599ee13
> >             00000000 00000000 804b0000 00000000 00000007 00000000
> > 00000085 00000000
> >             62722d31 00000084 804b0000 39347874 00000000 804b7820
> > 8040cef8 81000010
> >             00000001 00000007 00000001 81000000 00000008 8021de24
> > 00000000 804a0000
> >             ...
> >     Call Trace:
> >     [<8010adec>] show_stack+0x74/0x104
> >     [<801a5e44>] bad_page+0x130/0x138
> >     [<801a654c>] free_pcppages_bulk+0x17c/0x3b0
> >     [<801a789c>] free_unref_page+0x40/0x68
> >     [<801120f4>] free_init_pages+0xec/0x104
> >     [<803bdde8>] free_initmem+0x10/0x58
> >     [<803bdb8c>] kernel_init+0x20/0x100
> >     [<801057c8>] ret_from_kernel_thread+0x14/0x1c
> >     Disabling lock debugging due to kernel taint
> >     BUG: Bad page state in process swapper  pfn:00002
> >     [...]
> > 

The root cause of the problem most likely is in prom_free_prom_memory() method of
arch/mips/txx9/generic/setup.c:
void __init prom_free_prom_memory(void)
{
        unsigned long saddr = PAGE_SIZE;
        unsigned long eaddr = __pa_symbol(&_text);
        
        if (saddr < eaddr)
                free_init_pages("prom memory", saddr, eaddr);
}

As you can see the txx9 platform tries to free a memory which isn't reserved
and set free from the very beginning due to the patch above. So as soon as you
remove the free_init_pages("prom memory", ...) the problem shall be fixed.
Could you try it and send a result to us whether it helped?

Regards,
-Sergey

> > CONFIG_RELOCATABLE is not set, so the only relevant part is the
> > change quoted below.
> > 
> > > --- a/arch/mips/kernel/setup.c
> > > +++ b/arch/mips/kernel/setup.c
> > > @@ -371,7 +371,6 @@ static void __init bootmem_init(void)
> > >
> > >  static void __init bootmem_init(void)
> > >  {
> > > -       unsigned long reserved_end;
> > >         phys_addr_t ramstart = PHYS_ADDR_MAX;
> > >         int i;
> > >
> > > @@ -382,10 +381,10 @@ static void __init bootmem_init(void)
> > >          * will reserve the area used for the initrd.
> > >          */
> > >         init_initrd();
> > > -       reserved_end = (unsigned long) PFN_UP(__pa_symbol(&_end));
> > >
> > > -       memblock_reserve(PHYS_OFFSET,
> > > -                        (reserved_end << PAGE_SHIFT) - PHYS_OFFSET);
> > > +       /* Reserve memory occupied by kernel. */
> > > +       memblock_reserve(__pa_symbol(&_text),
> > > +                       __pa_symbol(&_end) - __pa_symbol(&_text));
> > >
> > >         /*
> > >          * max_low_pfn is not a number of pages. The number of pages
> > 
> > With some debug code added:
> > 
> >     Determined physical RAM map:
> >      memory: 08000000 @ 00000000 (usable)
> >     bootmem_init:390: PHYS_OFFSET = 0x0
> >     bootmem_init:391: __pa_symbol(&_text) = 0x100000
> >     bootmem_init:392: __pa_symbol(&_end) = 0x4b77c8
> >     bootmem_init:393: PFN_UP(__pa_symbol(&_end)) = 0x4b8
> 
> Have you tried adding memblock=debug to the command line?
> Not sure it'll help, but still :)
>  
> > Hence the old code reserved 1 MiB extra at the beginning.
> > 
> > Note that the new code also dropped the rounding up of the memory block
> > size to a multiple of PAGE_SIZE. I'm not sure the latter actually
> > matters or not.
> 
> I'd say that bad page state for pfn 1 is caused by "freeing" the first 1M.
> 
> > Do you have a clue? Thanks!
> > 
> > Gr{oetje,eeting}s,
> > 
> >                         Geert
> > 
> > -- 
> > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> > 
> > In personal conversations with technical people, I call myself a hacker. But
> > when I'm talking to journalists I just say "programmer" or something like that.
> >                                 -- Linus Torvalds
> > 
> 
> -- 
> Sincerely yours,
> Mike.
>
Geert Uytterhoeven May 22, 2019, 7:47 a.m. UTC | #11
Hi Mike,

On Tue, May 21, 2019 at 5:53 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> On Tue, May 21, 2019 at 04:56:39PM +0200, Geert Uytterhoeven wrote:
> > On Wed, Apr 24, 2019 at 12:50 AM Serge Semin <fancer.lancer@gmail.com> wrote:
> > > The reserved_end variable had been used by the bootmem_init() code
> > > to find a lowest limit of memory available for memmap blob. The original
> > > code just tried to find a free memory space higher than kernel was placed.
> > > This limitation seems justified for the memmap ragion search process, but
> > > I can't see any obvious reason to reserve the unused space below kernel
> > > seeing some platforms place it much higher than standard 1MB. Moreover
> > > the RELOCATION config enables it to be loaded at any memory address.
> > > So lets reserve the memory occupied by the kernel only, leaving the region
> > > below being free for allocations. After doing this we can now discard the
> > > code freeing a space between kernel _text and VMLINUX_LOAD_ADDRESS symbols
> > > since it's going to be free anyway (unless marked as reserved by
> > > platforms).
> > >
> > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> >
> > This is now commit b93ddc4f9156205e ("mips: Reserve memory for the kernel
> > image resources") in v5.2-rc1, which causes rbtx4927 to crash during boot:
> >
> >     VFS: Mounted root (nfs filesystem) on device 0:13.
> >     devtmpfs: mounted
> >     BUG: Bad page state in process swapper  pfn:00001
> >     page:804b7820 refcount:0 mapcount:-128 mapping:00000000 index:0x1
> >     flags: 0x0()
> >     raw: 00000000 00000100 00000200 00000000 00000001 00000000 ffffff7f 00000000
> >     page dumped because: nonzero mapcount
> >     Modules linked in:
> >     CPU: 0 PID: 1 Comm: swapper Not tainted
> > 5.2.0-rc1-rbtx4927-00468-g3c05ea3d4077b756-dirty #137
> >     Stack : 00000000 10008400 8040dd2c 87c1b974 8044af63 8040dd2c
> > 00000001 804a3490
> >             00000001 81000000 0030f231 80148558 00000003 10008400
> > 87c1dd80 7599ee13
> >             00000000 00000000 804b0000 00000000 00000007 00000000
> > 00000085 00000000
> >             62722d31 00000084 804b0000 39347874 00000000 804b7820
> > 8040cef8 81000010
> >             00000001 00000007 00000001 81000000 00000008 8021de24
> > 00000000 804a0000
> >             ...
> >     Call Trace:
> >     [<8010adec>] show_stack+0x74/0x104
> >     [<801a5e44>] bad_page+0x130/0x138
> >     [<801a654c>] free_pcppages_bulk+0x17c/0x3b0
> >     [<801a789c>] free_unref_page+0x40/0x68
> >     [<801120f4>] free_init_pages+0xec/0x104
> >     [<803bdde8>] free_initmem+0x10/0x58
> >     [<803bdb8c>] kernel_init+0x20/0x100
> >     [<801057c8>] ret_from_kernel_thread+0x14/0x1c
> >     Disabling lock debugging due to kernel taint
> >     BUG: Bad page state in process swapper  pfn:00002
> >     [...]
> >
> > CONFIG_RELOCATABLE is not set, so the only relevant part is the
> > change quoted below.
> >
> > > --- a/arch/mips/kernel/setup.c
> > > +++ b/arch/mips/kernel/setup.c
> > > @@ -371,7 +371,6 @@ static void __init bootmem_init(void)
> > >
> > >  static void __init bootmem_init(void)
> > >  {
> > > -       unsigned long reserved_end;
> > >         phys_addr_t ramstart = PHYS_ADDR_MAX;
> > >         int i;
> > >
> > > @@ -382,10 +381,10 @@ static void __init bootmem_init(void)
> > >          * will reserve the area used for the initrd.
> > >          */
> > >         init_initrd();
> > > -       reserved_end = (unsigned long) PFN_UP(__pa_symbol(&_end));
> > >
> > > -       memblock_reserve(PHYS_OFFSET,
> > > -                        (reserved_end << PAGE_SHIFT) - PHYS_OFFSET);
> > > +       /* Reserve memory occupied by kernel. */
> > > +       memblock_reserve(__pa_symbol(&_text),
> > > +                       __pa_symbol(&_end) - __pa_symbol(&_text));
> > >
> > >         /*
> > >          * max_low_pfn is not a number of pages. The number of pages
> >
> > With some debug code added:
> >
> >     Determined physical RAM map:
> >      memory: 08000000 @ 00000000 (usable)
> >     bootmem_init:390: PHYS_OFFSET = 0x0
> >     bootmem_init:391: __pa_symbol(&_text) = 0x100000
> >     bootmem_init:392: __pa_symbol(&_end) = 0x4b77c8
> >     bootmem_init:393: PFN_UP(__pa_symbol(&_end)) = 0x4b8
>
> Have you tried adding memblock=debug to the command line?
> Not sure it'll help, but still :)

Thanks! Output below...

 Determined physical RAM map:
  memory: 08000000 @ 00000000 (usable)
+memblock_reserve: [0x00100000-0x004b77c7] setup_arch+0x258/0x8e4
 Initrd not found or empty - disabling initrd
+memblock_reserve: [0x00448000-0x00447fff] setup_arch+0x5ac/0x8e4
+MEMBLOCK configuration:
+ memory size = 0x08000000 reserved size = 0x003b77c8
+ memory.cnt  = 0x1
+ memory[0x0]    [0x00000000-0x07ffffff], 0x08000000 bytes on node 0 flags: 0x0
+ reserved.cnt  = 0x1
+ reserved[0x0]  [0x00100000-0x004b77c7], 0x003b77c8 bytes flags: 0x0
+memblock_alloc_try_nid: 32 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 setup_arch+0x7ec/0x8e4
+memblock_reserve: [0x004b77e0-0x004b77ff] memblock_alloc_range_nid+0x130/0x178
 Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
 Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes
 Zone ranges:
@@ -16,10 +26,48 @@ Movable zone start for each node
 Early memory node ranges
   node   0: [mem 0x0000000000000000-0x0000000007ffffff]
 Initmem setup node 0 [mem 0x0000000000000000-0x0000000007ffffff]
+memblock_alloc_try_nid: 1048576 bytes align=0x20 nid=0
from=0x00000000 max_addr=0x00000000
alloc_node_mem_map.constprop.31+0x6c/0xc8
+memblock_reserve: [0x004b7800-0x005b77ff] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 16 bytes align=0x20 nid=0 from=0x00000000
max_addr=0x00000000 setup_usemap.isra.13+0x68/0xa0
+memblock_reserve: [0x005b7800-0x005b780f] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 116 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 start_kernel+0xb0/0x508
+memblock_reserve: [0x005b7820-0x005b7893] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 116 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 start_kernel+0xf0/0x508
+memblock_reserve: [0x005b78a0-0x005b7913] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 116 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 start_kernel+0x114/0x508
+memblock_reserve: [0x005b7920-0x005b7993] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1
from=0x00000000 max_addr=0x00000000 pcpu_alloc_alloc_info+0x60/0xb8
+memblock_reserve: [0x005b8000-0x005b8fff] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 32768 bytes align=0x1000 nid=-1
from=0x01000000 max_addr=0x00000000 setup_per_cpu_areas+0x38/0xa8
+memblock_reserve: [0x01000000-0x01007fff] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 4 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 pcpu_setup_first_chunk+0x200/0x588
+memblock_reserve: [0x005b79a0-0x005b79a3] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 4 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 pcpu_setup_first_chunk+0x22c/0x588
+memblock_reserve: [0x005b79c0-0x005b79c3] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 4 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 pcpu_setup_first_chunk+0x250/0x588
+memblock_reserve: [0x005b79e0-0x005b79e3] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 4 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 pcpu_setup_first_chunk+0x288/0x588
+memblock_reserve: [0x005b7a00-0x005b7a03] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 120 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 pcpu_setup_first_chunk+0x47c/0x588
+memblock_reserve: [0x005b7a20-0x005b7a97] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 89 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 pcpu_alloc_first_chunk+0x88/0x2e0
+memblock_reserve: [0x005b7aa0-0x005b7af8] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 1024 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 pcpu_alloc_first_chunk+0xd8/0x2e0
+memblock_reserve: [0x005b7b00-0x005b7eff] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 1028 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 pcpu_alloc_first_chunk+0x118/0x2e0
+memblock_reserve: [0x005b9000-0x005b9403] memblock_alloc_range_nid+0x130/0x178
+memblock_alloc_try_nid: 256 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 pcpu_alloc_first_chunk+0x154/0x2e0
+memblock_reserve: [0x005b7f00-0x005b7fff] memblock_alloc_range_nid+0x130/0x178
+   memblock_free: [0x005b8000-0x005b8fff] start_kernel+0x164/0x508
 Built 1 zonelists, mobility grouping on.  Total pages: 32512
-Kernel command line:   console=ttyS0,9600 ip=on root=/dev/nfs rw
nfsroot=192.168.97.29:/nas/rbtx4927/debian-mipsel,tcp,v3
+Kernel command line:   console=ttyS0,9600 ip=on root=/dev/nfs rw
nfsroot=192.168.97.29:/nas/rbtx4927/debian-mipsel,tcp,v3
memblock=debug
+memblock_alloc_try_nid: 65536 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 alloc_large_system_hash+0x270/0x478
+memblock_reserve: [0x005b9420-0x005c941f] memblock_alloc_range_nid+0x130/0x178
 Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
+memblock_alloc_try_nid: 32768 bytes align=0x20 nid=-1 from=0x00000000
max_addr=0x00000000 alloc_large_system_hash+0x270/0x478
+memblock_reserve: [0x005c9420-0x005d141f] memblock_alloc_range_nid+0x130/0x178
 Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
+memblock_reserve: [0x00000000-0x000003ff] trap_init+0x58/0x474
 Memory: 126100K/131072K available (2830K kernel code, 147K rwdata,
508K rodata, 220K init, 93K bss, 4972K reserved, 0K cma-reserved)

> > Hence the old code reserved 1 MiB extra at the beginning.
> >
> > Note that the new code also dropped the rounding up of the memory block
> > size to a multiple of PAGE_SIZE. I'm not sure the latter actually
> > matters or not.
>
> I'd say that bad page state for pfn 1 is caused by "freeing" the first 1M.

Yes, by prom_free_prom_memory(), as pointed out by Serge.

Gr{oetje,eeting}s,

                        Geert
Geert Uytterhoeven May 22, 2019, 7:50 a.m. UTC | #12
Hi Serge,

On Tue, May 21, 2019 at 6:39 PM Serge Semin <fancer.lancer@gmail.com> wrote:
> On Tue, May 21, 2019 at 06:53:10PM +0300, Mike Rapoport wrote:
> > On Tue, May 21, 2019 at 04:56:39PM +0200, Geert Uytterhoeven wrote:
> > > On Wed, Apr 24, 2019 at 12:50 AM Serge Semin <fancer.lancer@gmail.com> wrote:
> > > > The reserved_end variable had been used by the bootmem_init() code
> > > > to find a lowest limit of memory available for memmap blob. The original
> > > > code just tried to find a free memory space higher than kernel was placed.
> > > > This limitation seems justified for the memmap ragion search process, but
> > > > I can't see any obvious reason to reserve the unused space below kernel
> > > > seeing some platforms place it much higher than standard 1MB. Moreover
> > > > the RELOCATION config enables it to be loaded at any memory address.
> > > > So lets reserve the memory occupied by the kernel only, leaving the region
> > > > below being free for allocations. After doing this we can now discard the
> > > > code freeing a space between kernel _text and VMLINUX_LOAD_ADDRESS symbols
> > > > since it's going to be free anyway (unless marked as reserved by
> > > > platforms).
> > > >
> > > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > >
> > > This is now commit b93ddc4f9156205e ("mips: Reserve memory for the kernel
> > > image resources") in v5.2-rc1, which causes rbtx4927 to crash during boot:
> > >
> > >     VFS: Mounted root (nfs filesystem) on device 0:13.
> > >     devtmpfs: mounted
> > >     BUG: Bad page state in process swapper  pfn:00001
> > >     page:804b7820 refcount:0 mapcount:-128 mapping:00000000 index:0x1
> > >     flags: 0x0()
> > >     raw: 00000000 00000100 00000200 00000000 00000001 00000000 ffffff7f 00000000
> > >     page dumped because: nonzero mapcount
> > >     Modules linked in:
> > >     CPU: 0 PID: 1 Comm: swapper Not tainted
> > > 5.2.0-rc1-rbtx4927-00468-g3c05ea3d4077b756-dirty #137
> > >     Stack : 00000000 10008400 8040dd2c 87c1b974 8044af63 8040dd2c
> > > 00000001 804a3490
> > >             00000001 81000000 0030f231 80148558 00000003 10008400
> > > 87c1dd80 7599ee13
> > >             00000000 00000000 804b0000 00000000 00000007 00000000
> > > 00000085 00000000
> > >             62722d31 00000084 804b0000 39347874 00000000 804b7820
> > > 8040cef8 81000010
> > >             00000001 00000007 00000001 81000000 00000008 8021de24
> > > 00000000 804a0000
> > >             ...
> > >     Call Trace:
> > >     [<8010adec>] show_stack+0x74/0x104
> > >     [<801a5e44>] bad_page+0x130/0x138
> > >     [<801a654c>] free_pcppages_bulk+0x17c/0x3b0
> > >     [<801a789c>] free_unref_page+0x40/0x68
> > >     [<801120f4>] free_init_pages+0xec/0x104
> > >     [<803bdde8>] free_initmem+0x10/0x58
> > >     [<803bdb8c>] kernel_init+0x20/0x100
> > >     [<801057c8>] ret_from_kernel_thread+0x14/0x1c
> > >     Disabling lock debugging due to kernel taint
> > >     BUG: Bad page state in process swapper  pfn:00002
> > >     [...]
> > >
>
> The root cause of the problem most likely is in prom_free_prom_memory() method of
> arch/mips/txx9/generic/setup.c:
> void __init prom_free_prom_memory(void)
> {
>         unsigned long saddr = PAGE_SIZE;
>         unsigned long eaddr = __pa_symbol(&_text);
>
>         if (saddr < eaddr)
>                 free_init_pages("prom memory", saddr, eaddr);
> }
>
> As you can see the txx9 platform tries to free a memory which isn't reserved
> and set free from the very beginning due to the patch above. So as soon as you
> remove the free_init_pages("prom memory", ...) the problem shall be fixed.
> Could you try it and send a result to us whether it helped?

Thanks, that does the trick!
Will send a patch shortly...

Gr{oetje,eeting}s,

                        Geert
Mike Rapoport May 22, 2019, 8:08 a.m. UTC | #13
On Wed, May 22, 2019 at 09:47:04AM +0200, Geert Uytterhoeven wrote:
> Hi Mike,
> 
> On Tue, May 21, 2019 at 5:53 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > On Tue, May 21, 2019 at 04:56:39PM +0200, Geert Uytterhoeven wrote:
> > > On Wed, Apr 24, 2019 at 12:50 AM Serge Semin <fancer.lancer@gmail.com> wrote:
> > > > The reserved_end variable had been used by the bootmem_init() code
> > > > to find a lowest limit of memory available for memmap blob. The original
> > > > code just tried to find a free memory space higher than kernel was placed.
> > > > This limitation seems justified for the memmap ragion search process, but
> > > > I can't see any obvious reason to reserve the unused space below kernel
> > > > seeing some platforms place it much higher than standard 1MB. Moreover
> > > > the RELOCATION config enables it to be loaded at any memory address.
> > > > So lets reserve the memory occupied by the kernel only, leaving the region
> > > > below being free for allocations. After doing this we can now discard the
> > > > code freeing a space between kernel _text and VMLINUX_LOAD_ADDRESS symbols
> > > > since it's going to be free anyway (unless marked as reserved by
> > > > platforms).
> > > >
> > > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > >
> > > This is now commit b93ddc4f9156205e ("mips: Reserve memory for the kernel
> > > image resources") in v5.2-rc1, which causes rbtx4927 to crash during boot:
> > >
> > >     VFS: Mounted root (nfs filesystem) on device 0:13.
> > >     devtmpfs: mounted
> > >     BUG: Bad page state in process swapper  pfn:00001
> > >     page:804b7820 refcount:0 mapcount:-128 mapping:00000000 index:0x1
> > >     flags: 0x0()
> > >     raw: 00000000 00000100 00000200 00000000 00000001 00000000 ffffff7f 00000000
> > >     page dumped because: nonzero mapcount
> > >     Modules linked in:
> > >     CPU: 0 PID: 1 Comm: swapper Not tainted
> > > 5.2.0-rc1-rbtx4927-00468-g3c05ea3d4077b756-dirty #137
> > >     Stack : 00000000 10008400 8040dd2c 87c1b974 8044af63 8040dd2c
> > > 00000001 804a3490
> > >             00000001 81000000 0030f231 80148558 00000003 10008400
> > > 87c1dd80 7599ee13
> > >             00000000 00000000 804b0000 00000000 00000007 00000000
> > > 00000085 00000000
> > >             62722d31 00000084 804b0000 39347874 00000000 804b7820
> > > 8040cef8 81000010
> > >             00000001 00000007 00000001 81000000 00000008 8021de24
> > > 00000000 804a0000
> > >             ...
> > >     Call Trace:
> > >     [<8010adec>] show_stack+0x74/0x104
> > >     [<801a5e44>] bad_page+0x130/0x138
> > >     [<801a654c>] free_pcppages_bulk+0x17c/0x3b0
> > >     [<801a789c>] free_unref_page+0x40/0x68
> > >     [<801120f4>] free_init_pages+0xec/0x104
> > >     [<803bdde8>] free_initmem+0x10/0x58
> > >     [<803bdb8c>] kernel_init+0x20/0x100
> > >     [<801057c8>] ret_from_kernel_thread+0x14/0x1c
> > >     Disabling lock debugging due to kernel taint
> > >     BUG: Bad page state in process swapper  pfn:00002
> > >     [...]
> > >
> > > CONFIG_RELOCATABLE is not set, so the only relevant part is the
> > > change quoted below.
> > >
> > > > --- a/arch/mips/kernel/setup.c
> > > > +++ b/arch/mips/kernel/setup.c
> > > > @@ -371,7 +371,6 @@ static void __init bootmem_init(void)
> > > >
> > > >  static void __init bootmem_init(void)
> > > >  {
> > > > -       unsigned long reserved_end;
> > > >         phys_addr_t ramstart = PHYS_ADDR_MAX;
> > > >         int i;
> > > >
> > > > @@ -382,10 +381,10 @@ static void __init bootmem_init(void)
> > > >          * will reserve the area used for the initrd.
> > > >          */
> > > >         init_initrd();
> > > > -       reserved_end = (unsigned long) PFN_UP(__pa_symbol(&_end));
> > > >
> > > > -       memblock_reserve(PHYS_OFFSET,
> > > > -                        (reserved_end << PAGE_SHIFT) - PHYS_OFFSET);
> > > > +       /* Reserve memory occupied by kernel. */
> > > > +       memblock_reserve(__pa_symbol(&_text),
> > > > +                       __pa_symbol(&_end) - __pa_symbol(&_text));
> > > >
> > > >         /*
> > > >          * max_low_pfn is not a number of pages. The number of pages
> > >
> > > With some debug code added:
> > >
> > >     Determined physical RAM map:
> > >      memory: 08000000 @ 00000000 (usable)
> > >     bootmem_init:390: PHYS_OFFSET = 0x0
> > >     bootmem_init:391: __pa_symbol(&_text) = 0x100000
> > >     bootmem_init:392: __pa_symbol(&_end) = 0x4b77c8
> > >     bootmem_init:393: PFN_UP(__pa_symbol(&_end)) = 0x4b8
> >
> > Have you tried adding memblock=debug to the command line?
> > Not sure it'll help, but still :)
> 
> Thanks! Output below...
> 
>  Determined physical RAM map:
>   memory: 08000000 @ 00000000 (usable)
> +memblock_reserve: [0x00100000-0x004b77c7] setup_arch+0x258/0x8e4
>  Initrd not found or empty - disabling initrd
> +memblock_reserve: [0x00448000-0x00447fff] setup_arch+0x5ac/0x8e4
> +MEMBLOCK configuration:
> + memory size = 0x08000000 reserved size = 0x003b77c8
> + memory.cnt  = 0x1
> + memory[0x0]    [0x00000000-0x07ffffff], 0x08000000 bytes on node 0 flags: 0x0
> + reserved.cnt  = 0x1
> + reserved[0x0]  [0x00100000-0x004b77c7], 0x003b77c8 bytes flags: 0x0
> +memblock_alloc_try_nid: 32 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 setup_arch+0x7ec/0x8e4
> +memblock_reserve: [0x004b77e0-0x004b77ff] memblock_alloc_range_nid+0x130/0x178
>  Primary instruction cache 32kB, VIPT, 4-way, linesize 32 bytes.
>  Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes
>  Zone ranges:
> @@ -16,10 +26,48 @@ Movable zone start for each node
>  Early memory node ranges
>    node   0: [mem 0x0000000000000000-0x0000000007ffffff]
>  Initmem setup node 0 [mem 0x0000000000000000-0x0000000007ffffff]
> +memblock_alloc_try_nid: 1048576 bytes align=0x20 nid=0
> from=0x00000000 max_addr=0x00000000
> alloc_node_mem_map.constprop.31+0x6c/0xc8
> +memblock_reserve: [0x004b7800-0x005b77ff] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 16 bytes align=0x20 nid=0 from=0x00000000
> max_addr=0x00000000 setup_usemap.isra.13+0x68/0xa0
> +memblock_reserve: [0x005b7800-0x005b780f] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 116 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 start_kernel+0xb0/0x508
> +memblock_reserve: [0x005b7820-0x005b7893] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 116 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 start_kernel+0xf0/0x508
> +memblock_reserve: [0x005b78a0-0x005b7913] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 116 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 start_kernel+0x114/0x508
> +memblock_reserve: [0x005b7920-0x005b7993] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 4096 bytes align=0x1000 nid=-1
> from=0x00000000 max_addr=0x00000000 pcpu_alloc_alloc_info+0x60/0xb8
> +memblock_reserve: [0x005b8000-0x005b8fff] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 32768 bytes align=0x1000 nid=-1
> from=0x01000000 max_addr=0x00000000 setup_per_cpu_areas+0x38/0xa8
> +memblock_reserve: [0x01000000-0x01007fff] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 4 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 pcpu_setup_first_chunk+0x200/0x588
> +memblock_reserve: [0x005b79a0-0x005b79a3] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 4 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 pcpu_setup_first_chunk+0x22c/0x588
> +memblock_reserve: [0x005b79c0-0x005b79c3] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 4 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 pcpu_setup_first_chunk+0x250/0x588
> +memblock_reserve: [0x005b79e0-0x005b79e3] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 4 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 pcpu_setup_first_chunk+0x288/0x588
> +memblock_reserve: [0x005b7a00-0x005b7a03] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 120 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 pcpu_setup_first_chunk+0x47c/0x588
> +memblock_reserve: [0x005b7a20-0x005b7a97] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 89 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 pcpu_alloc_first_chunk+0x88/0x2e0
> +memblock_reserve: [0x005b7aa0-0x005b7af8] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 1024 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 pcpu_alloc_first_chunk+0xd8/0x2e0
> +memblock_reserve: [0x005b7b00-0x005b7eff] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 1028 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 pcpu_alloc_first_chunk+0x118/0x2e0
> +memblock_reserve: [0x005b9000-0x005b9403] memblock_alloc_range_nid+0x130/0x178
> +memblock_alloc_try_nid: 256 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 pcpu_alloc_first_chunk+0x154/0x2e0
> +memblock_reserve: [0x005b7f00-0x005b7fff] memblock_alloc_range_nid+0x130/0x178
> +   memblock_free: [0x005b8000-0x005b8fff] start_kernel+0x164/0x508
>  Built 1 zonelists, mobility grouping on.  Total pages: 32512
> -Kernel command line:   console=ttyS0,9600 ip=on root=/dev/nfs rw
> nfsroot=192.168.97.29:/nas/rbtx4927/debian-mipsel,tcp,v3
> +Kernel command line:   console=ttyS0,9600 ip=on root=/dev/nfs rw
> nfsroot=192.168.97.29:/nas/rbtx4927/debian-mipsel,tcp,v3
> memblock=debug
> +memblock_alloc_try_nid: 65536 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 alloc_large_system_hash+0x270/0x478
> +memblock_reserve: [0x005b9420-0x005c941f] memblock_alloc_range_nid+0x130/0x178
>  Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
> +memblock_alloc_try_nid: 32768 bytes align=0x20 nid=-1 from=0x00000000
> max_addr=0x00000000 alloc_large_system_hash+0x270/0x478
> +memblock_reserve: [0x005c9420-0x005d141f] memblock_alloc_range_nid+0x130/0x178
>  Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
> +memblock_reserve: [0x00000000-0x000003ff] trap_init+0x58/0x474
>  Memory: 126100K/131072K available (2830K kernel code, 147K rwdata,
> 508K rodata, 220K init, 93K bss, 4972K reserved, 0K cma-reserved)

Presuming your system is !cpu_has_mips_r2_r6 and CAC_BASE is 0 the log
looks completely sane
 
> > > Hence the old code reserved 1 MiB extra at the beginning.
> > >
> > > Note that the new code also dropped the rounding up of the memory block
> > > size to a multiple of PAGE_SIZE. I'm not sure the latter actually
> > > matters or not.
> >
> > I'd say that bad page state for pfn 1 is caused by "freeing" the first 1M.
> 
> Yes, by prom_free_prom_memory(), as pointed out by Serge.

I wonder how other MIPS variants would react to the fact that the memory
below the kernel is not reserved ;-)

> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> -- 
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds
>
Geert Uytterhoeven May 22, 2019, 8:14 a.m. UTC | #14
Hi Mike,

On Wed, May 22, 2019 at 10:08 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
> On Wed, May 22, 2019 at 09:47:04AM +0200, Geert Uytterhoeven wrote:
> > On Tue, May 21, 2019 at 5:53 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > On Tue, May 21, 2019 at 04:56:39PM +0200, Geert Uytterhoeven wrote:
> > > > On Wed, Apr 24, 2019 at 12:50 AM Serge Semin <fancer.lancer@gmail.com> wrote:
> > > > > The reserved_end variable had been used by the bootmem_init() code
> > > > > to find a lowest limit of memory available for memmap blob. The original
> > > > > code just tried to find a free memory space higher than kernel was placed.
> > > > > This limitation seems justified for the memmap ragion search process, but
> > > > > I can't see any obvious reason to reserve the unused space below kernel
> > > > > seeing some platforms place it much higher than standard 1MB. Moreover
> > > > > the RELOCATION config enables it to be loaded at any memory address.
> > > > > So lets reserve the memory occupied by the kernel only, leaving the region
> > > > > below being free for allocations. After doing this we can now discard the
> > > > > code freeing a space between kernel _text and VMLINUX_LOAD_ADDRESS symbols
> > > > > since it's going to be free anyway (unless marked as reserved by
> > > > > platforms).
> > > > >
> > > > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > > >
> > > > This is now commit b93ddc4f9156205e ("mips: Reserve memory for the kernel
> > > > image resources") in v5.2-rc1, which causes rbtx4927 to crash during boot:
> > > >
> > > >     VFS: Mounted root (nfs filesystem) on device 0:13.
> > > >     devtmpfs: mounted
> > > >     BUG: Bad page state in process swapper  pfn:00001
> > > >     page:804b7820 refcount:0 mapcount:-128 mapping:00000000 index:0x1
> > > >     flags: 0x0()
> > > >     raw: 00000000 00000100 00000200 00000000 00000001 00000000 ffffff7f 00000000
> > > >     page dumped because: nonzero mapcount
> > > >     Modules linked in:
> > > >     CPU: 0 PID: 1 Comm: swapper Not tainted
> > > > 5.2.0-rc1-rbtx4927-00468-g3c05ea3d4077b756-dirty #137
> > > >     Stack : 00000000 10008400 8040dd2c 87c1b974 8044af63 8040dd2c
> > > > 00000001 804a3490
> > > >             00000001 81000000 0030f231 80148558 00000003 10008400
> > > > 87c1dd80 7599ee13
> > > >             00000000 00000000 804b0000 00000000 00000007 00000000
> > > > 00000085 00000000
> > > >             62722d31 00000084 804b0000 39347874 00000000 804b7820
> > > > 8040cef8 81000010
> > > >             00000001 00000007 00000001 81000000 00000008 8021de24
> > > > 00000000 804a0000
> > > >             ...
> > > >     Call Trace:
> > > >     [<8010adec>] show_stack+0x74/0x104
> > > >     [<801a5e44>] bad_page+0x130/0x138
> > > >     [<801a654c>] free_pcppages_bulk+0x17c/0x3b0
> > > >     [<801a789c>] free_unref_page+0x40/0x68
> > > >     [<801120f4>] free_init_pages+0xec/0x104
> > > >     [<803bdde8>] free_initmem+0x10/0x58
> > > >     [<803bdb8c>] kernel_init+0x20/0x100
> > > >     [<801057c8>] ret_from_kernel_thread+0x14/0x1c
> > > >     Disabling lock debugging due to kernel taint
> > > >     BUG: Bad page state in process swapper  pfn:00002
> > > >     [...]
> > > >
> > > > CONFIG_RELOCATABLE is not set, so the only relevant part is the
> > > > change quoted below.
> > > >
> > > > > --- a/arch/mips/kernel/setup.c
> > > > > +++ b/arch/mips/kernel/setup.c
> > > > > @@ -371,7 +371,6 @@ static void __init bootmem_init(void)
> > > > >
> > > > >  static void __init bootmem_init(void)
> > > > >  {
> > > > > -       unsigned long reserved_end;
> > > > >         phys_addr_t ramstart = PHYS_ADDR_MAX;
> > > > >         int i;
> > > > >
> > > > > @@ -382,10 +381,10 @@ static void __init bootmem_init(void)
> > > > >          * will reserve the area used for the initrd.
> > > > >          */
> > > > >         init_initrd();
> > > > > -       reserved_end = (unsigned long) PFN_UP(__pa_symbol(&_end));
> > > > >
> > > > > -       memblock_reserve(PHYS_OFFSET,
> > > > > -                        (reserved_end << PAGE_SHIFT) - PHYS_OFFSET);
> > > > > +       /* Reserve memory occupied by kernel. */
> > > > > +       memblock_reserve(__pa_symbol(&_text),
> > > > > +                       __pa_symbol(&_end) - __pa_symbol(&_text));
> > > > >
> > > > >         /*
> > > > >          * max_low_pfn is not a number of pages. The number of pages

> > > > Hence the old code reserved 1 MiB extra at the beginning.
> > > >
> > > > Note that the new code also dropped the rounding up of the memory block
> > > > size to a multiple of PAGE_SIZE. I'm not sure the latter actually
> > > > matters or not.
> > >
> > > I'd say that bad page state for pfn 1 is caused by "freeing" the first 1M.
> >
> > Yes, by prom_free_prom_memory(), as pointed out by Serge.
>
> I wonder how other MIPS variants would react to the fact that the memory
> below the kernel is not reserved ;-)

Exactly...

Looks like at least arch/mips/dec/prom/memory.c needs a similar but\
more complicated fix, due to declance handling...


Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds
Serge Semin May 22, 2019, 1:34 p.m. UTC | #15
Hello folks,

On Wed, May 22, 2019 at 10:14:47AM +0200, Geert Uytterhoeven wrote:
> Hi Mike,
> 
> On Wed, May 22, 2019 at 10:08 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > On Wed, May 22, 2019 at 09:47:04AM +0200, Geert Uytterhoeven wrote:
> > > On Tue, May 21, 2019 at 5:53 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > > On Tue, May 21, 2019 at 04:56:39PM +0200, Geert Uytterhoeven wrote:
> > > > > On Wed, Apr 24, 2019 at 12:50 AM Serge Semin <fancer.lancer@gmail.com> wrote:
> > > > > > The reserved_end variable had been used by the bootmem_init() code
> > > > > > to find a lowest limit of memory available for memmap blob. The original
> > > > > > code just tried to find a free memory space higher than kernel was placed.
> > > > > > This limitation seems justified for the memmap ragion search process, but
> > > > > > I can't see any obvious reason to reserve the unused space below kernel
> > > > > > seeing some platforms place it much higher than standard 1MB. Moreover
> > > > > > the RELOCATION config enables it to be loaded at any memory address.
> > > > > > So lets reserve the memory occupied by the kernel only, leaving the region
> > > > > > below being free for allocations. After doing this we can now discard the
> > > > > > code freeing a space between kernel _text and VMLINUX_LOAD_ADDRESS symbols
> > > > > > since it's going to be free anyway (unless marked as reserved by
> > > > > > platforms).
> > > > > >
> > > > > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > > > >
> > > > > This is now commit b93ddc4f9156205e ("mips: Reserve memory for the kernel
> > > > > image resources") in v5.2-rc1, which causes rbtx4927 to crash during boot:
> > > > >
> > > > >     VFS: Mounted root (nfs filesystem) on device 0:13.
> > > > >     devtmpfs: mounted
> > > > >     BUG: Bad page state in process swapper  pfn:00001
> > > > >     page:804b7820 refcount:0 mapcount:-128 mapping:00000000 index:0x1
> > > > >     flags: 0x0()
> > > > >     raw: 00000000 00000100 00000200 00000000 00000001 00000000 ffffff7f 00000000
> > > > >     page dumped because: nonzero mapcount
> > > > >     Modules linked in:
> > > > >     CPU: 0 PID: 1 Comm: swapper Not tainted
> > > > > 5.2.0-rc1-rbtx4927-00468-g3c05ea3d4077b756-dirty #137
> > > > >     Stack : 00000000 10008400 8040dd2c 87c1b974 8044af63 8040dd2c
> > > > > 00000001 804a3490
> > > > >             00000001 81000000 0030f231 80148558 00000003 10008400
> > > > > 87c1dd80 7599ee13
> > > > >             00000000 00000000 804b0000 00000000 00000007 00000000
> > > > > 00000085 00000000
> > > > >             62722d31 00000084 804b0000 39347874 00000000 804b7820
> > > > > 8040cef8 81000010
> > > > >             00000001 00000007 00000001 81000000 00000008 8021de24
> > > > > 00000000 804a0000
> > > > >             ...
> > > > >     Call Trace:
> > > > >     [<8010adec>] show_stack+0x74/0x104
> > > > >     [<801a5e44>] bad_page+0x130/0x138
> > > > >     [<801a654c>] free_pcppages_bulk+0x17c/0x3b0
> > > > >     [<801a789c>] free_unref_page+0x40/0x68
> > > > >     [<801120f4>] free_init_pages+0xec/0x104
> > > > >     [<803bdde8>] free_initmem+0x10/0x58
> > > > >     [<803bdb8c>] kernel_init+0x20/0x100
> > > > >     [<801057c8>] ret_from_kernel_thread+0x14/0x1c
> > > > >     Disabling lock debugging due to kernel taint
> > > > >     BUG: Bad page state in process swapper  pfn:00002
> > > > >     [...]
> > > > >
> > > > > CONFIG_RELOCATABLE is not set, so the only relevant part is the
> > > > > change quoted below.
> > > > >
> > > > > > --- a/arch/mips/kernel/setup.c
> > > > > > +++ b/arch/mips/kernel/setup.c
> > > > > > @@ -371,7 +371,6 @@ static void __init bootmem_init(void)
> > > > > >
> > > > > >  static void __init bootmem_init(void)
> > > > > >  {
> > > > > > -       unsigned long reserved_end;
> > > > > >         phys_addr_t ramstart = PHYS_ADDR_MAX;
> > > > > >         int i;
> > > > > >
> > > > > > @@ -382,10 +381,10 @@ static void __init bootmem_init(void)
> > > > > >          * will reserve the area used for the initrd.
> > > > > >          */
> > > > > >         init_initrd();
> > > > > > -       reserved_end = (unsigned long) PFN_UP(__pa_symbol(&_end));
> > > > > >
> > > > > > -       memblock_reserve(PHYS_OFFSET,
> > > > > > -                        (reserved_end << PAGE_SHIFT) - PHYS_OFFSET);
> > > > > > +       /* Reserve memory occupied by kernel. */
> > > > > > +       memblock_reserve(__pa_symbol(&_text),
> > > > > > +                       __pa_symbol(&_end) - __pa_symbol(&_text));
> > > > > >
> > > > > >         /*
> > > > > >          * max_low_pfn is not a number of pages. The number of pages
> 
> > > > > Hence the old code reserved 1 MiB extra at the beginning.
> > > > >
> > > > > Note that the new code also dropped the rounding up of the memory block
> > > > > size to a multiple of PAGE_SIZE. I'm not sure the latter actually
> > > > > matters or not.
> > > >
> > > > I'd say that bad page state for pfn 1 is caused by "freeing" the first 1M.
> > >
> > > Yes, by prom_free_prom_memory(), as pointed out by Serge.
> >
> > I wonder how other MIPS variants would react to the fact that the memory
> > below the kernel is not reserved ;-)
> 
> Exactly...
> 
> Looks like at least arch/mips/dec/prom/memory.c needs a similar but\
> more complicated fix, due to declance handling...
> 

The problem might be fixed there by the next patch:
---
diff --git a/arch/mips/dec/prom/memory.c b/arch/mips/dec/prom/memory.c
index 5073d2ed78bb..5a0c734b5d04 100644
--- a/arch/mips/dec/prom/memory.c
+++ b/arch/mips/dec/prom/memory.c
@@ -91,29 +91,14 @@ void __init prom_meminit(u32 magic)
 		pmax_setup_memory_region();
 	else
 		rex_setup_memory_region();
-}
-
-void __init prom_free_prom_memory(void)
-{
-	unsigned long end;
-
-	/*
-	 * Free everything below the kernel itself but leave
-	 * the first page reserved for the exception handlers.
-	 */
 
 #if IS_ENABLED(CONFIG_DECLANCE)
 	/*
-	 * Leave 128 KB reserved for Lance memory for
-	 * IOASIC DECstations.
+	 * Reserve 128 KB for Lance memory for IOASIC DECstations.
 	 *
 	 * XXX: save this address for use in dec_lance.c?
 	 */
 	if (IOASIC)
-		end = __pa(&_text) - 0x00020000;
-	else
+		memblock_reserve(__pa_symbol(&_text), 0x00020000);
 #endif
-		end = __pa(&_text);
-
-	free_init_pages("unused PROM memory", PAGE_SIZE, end);
 }
---

Didn't wanna use prom_FREE_prom_memory to actually reserve a memory
chunk, so I moved the reservation into the prom_meminit() method.

Regarding the first page for the exception handlers. We don't need
to reserve it here, since it is already done in arch/mips/kernel/traps.c .

Regards,
-Sergey


> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds
Geert Uytterhoeven May 22, 2019, 1:44 p.m. UTC | #16
Hi Serge,

On Wed, May 22, 2019 at 3:34 PM Serge Semin <fancer.lancer@gmail.com> wrote:
> On Wed, May 22, 2019 at 10:14:47AM +0200, Geert Uytterhoeven wrote:
> > On Wed, May 22, 2019 at 10:08 AM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > On Wed, May 22, 2019 at 09:47:04AM +0200, Geert Uytterhoeven wrote:
> > > > On Tue, May 21, 2019 at 5:53 PM Mike Rapoport <rppt@linux.ibm.com> wrote:
> > > > > On Tue, May 21, 2019 at 04:56:39PM +0200, Geert Uytterhoeven wrote:
> > > > > > On Wed, Apr 24, 2019 at 12:50 AM Serge Semin <fancer.lancer@gmail.com> wrote:
> > > > > > > The reserved_end variable had been used by the bootmem_init() code
> > > > > > > to find a lowest limit of memory available for memmap blob. The original
> > > > > > > code just tried to find a free memory space higher than kernel was placed.
> > > > > > > This limitation seems justified for the memmap ragion search process, but
> > > > > > > I can't see any obvious reason to reserve the unused space below kernel
> > > > > > > seeing some platforms place it much higher than standard 1MB. Moreover
> > > > > > > the RELOCATION config enables it to be loaded at any memory address.
> > > > > > > So lets reserve the memory occupied by the kernel only, leaving the region
> > > > > > > below being free for allocations. After doing this we can now discard the
> > > > > > > code freeing a space between kernel _text and VMLINUX_LOAD_ADDRESS symbols
> > > > > > > since it's going to be free anyway (unless marked as reserved by
> > > > > > > platforms).
> > > > > > >
> > > > > > > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > > > > >
> > > > > > This is now commit b93ddc4f9156205e ("mips: Reserve memory for the kernel
> > > > > > image resources") in v5.2-rc1, which causes rbtx4927 to crash during boot:
> > > > > >
> > > > > >     VFS: Mounted root (nfs filesystem) on device 0:13.
> > > > > >     devtmpfs: mounted
> > > > > >     BUG: Bad page state in process swapper  pfn:00001
> > > > > >     page:804b7820 refcount:0 mapcount:-128 mapping:00000000 index:0x1
> > > > > >     flags: 0x0()
> > > > > >     raw: 00000000 00000100 00000200 00000000 00000001 00000000 ffffff7f 00000000
> > > > > >     page dumped because: nonzero mapcount
> > > > > >     Modules linked in:
> > > > > >     CPU: 0 PID: 1 Comm: swapper Not tainted
> > > > > > 5.2.0-rc1-rbtx4927-00468-g3c05ea3d4077b756-dirty #137
> > > > > >     Stack : 00000000 10008400 8040dd2c 87c1b974 8044af63 8040dd2c
> > > > > > 00000001 804a3490
> > > > > >             00000001 81000000 0030f231 80148558 00000003 10008400
> > > > > > 87c1dd80 7599ee13
> > > > > >             00000000 00000000 804b0000 00000000 00000007 00000000
> > > > > > 00000085 00000000
> > > > > >             62722d31 00000084 804b0000 39347874 00000000 804b7820
> > > > > > 8040cef8 81000010
> > > > > >             00000001 00000007 00000001 81000000 00000008 8021de24
> > > > > > 00000000 804a0000
> > > > > >             ...
> > > > > >     Call Trace:
> > > > > >     [<8010adec>] show_stack+0x74/0x104
> > > > > >     [<801a5e44>] bad_page+0x130/0x138
> > > > > >     [<801a654c>] free_pcppages_bulk+0x17c/0x3b0
> > > > > >     [<801a789c>] free_unref_page+0x40/0x68
> > > > > >     [<801120f4>] free_init_pages+0xec/0x104
> > > > > >     [<803bdde8>] free_initmem+0x10/0x58
> > > > > >     [<803bdb8c>] kernel_init+0x20/0x100
> > > > > >     [<801057c8>] ret_from_kernel_thread+0x14/0x1c
> > > > > >     Disabling lock debugging due to kernel taint
> > > > > >     BUG: Bad page state in process swapper  pfn:00002
> > > > > >     [...]
> > > > > >
> > > > > > CONFIG_RELOCATABLE is not set, so the only relevant part is the
> > > > > > change quoted below.
> > > > > >
> > > > > > > --- a/arch/mips/kernel/setup.c
> > > > > > > +++ b/arch/mips/kernel/setup.c
> > > > > > > @@ -371,7 +371,6 @@ static void __init bootmem_init(void)
> > > > > > >
> > > > > > >  static void __init bootmem_init(void)
> > > > > > >  {
> > > > > > > -       unsigned long reserved_end;
> > > > > > >         phys_addr_t ramstart = PHYS_ADDR_MAX;
> > > > > > >         int i;
> > > > > > >
> > > > > > > @@ -382,10 +381,10 @@ static void __init bootmem_init(void)
> > > > > > >          * will reserve the area used for the initrd.
> > > > > > >          */
> > > > > > >         init_initrd();
> > > > > > > -       reserved_end = (unsigned long) PFN_UP(__pa_symbol(&_end));
> > > > > > >
> > > > > > > -       memblock_reserve(PHYS_OFFSET,
> > > > > > > -                        (reserved_end << PAGE_SHIFT) - PHYS_OFFSET);
> > > > > > > +       /* Reserve memory occupied by kernel. */
> > > > > > > +       memblock_reserve(__pa_symbol(&_text),
> > > > > > > +                       __pa_symbol(&_end) - __pa_symbol(&_text));
> > > > > > >
> > > > > > >         /*
> > > > > > >          * max_low_pfn is not a number of pages. The number of pages
> >
> > > > > > Hence the old code reserved 1 MiB extra at the beginning.
> > > > > >
> > > > > > Note that the new code also dropped the rounding up of the memory block
> > > > > > size to a multiple of PAGE_SIZE. I'm not sure the latter actually
> > > > > > matters or not.
> > > > >
> > > > > I'd say that bad page state for pfn 1 is caused by "freeing" the first 1M.
> > > >
> > > > Yes, by prom_free_prom_memory(), as pointed out by Serge.
> > >
> > > I wonder how other MIPS variants would react to the fact that the memory
> > > below the kernel is not reserved ;-)
> >
> > Exactly...
> >
> > Looks like at least arch/mips/dec/prom/memory.c needs a similar but\
> > more complicated fix, due to declance handling...
> >
>
> The problem might be fixed there by the next patch:
> ---
> diff --git a/arch/mips/dec/prom/memory.c b/arch/mips/dec/prom/memory.c
> index 5073d2ed78bb..5a0c734b5d04 100644
> --- a/arch/mips/dec/prom/memory.c
> +++ b/arch/mips/dec/prom/memory.c
> @@ -91,29 +91,14 @@ void __init prom_meminit(u32 magic)
>                 pmax_setup_memory_region();
>         else
>                 rex_setup_memory_region();
> -}
> -
> -void __init prom_free_prom_memory(void)
> -{
> -       unsigned long end;
> -
> -       /*
> -        * Free everything below the kernel itself but leave
> -        * the first page reserved for the exception handlers.
> -        */
>
>  #if IS_ENABLED(CONFIG_DECLANCE)
>         /*
> -        * Leave 128 KB reserved for Lance memory for
> -        * IOASIC DECstations.
> +        * Reserve 128 KB for Lance memory for IOASIC DECstations.
>          *
>          * XXX: save this address for use in dec_lance.c?
>          */
>         if (IOASIC)
> -               end = __pa(&_text) - 0x00020000;
> -       else
> +               memblock_reserve(__pa_symbol(&_text), 0x00020000);

Shouldn't that be

    memblock_reserve(__pa_symbol(&_text) - 0x00020000, 0x00020000);

?

>  #endif
> -               end = __pa(&_text);
> -
> -       free_init_pages("unused PROM memory", PAGE_SIZE, end);
>  }
> ---
>
> Didn't wanna use prom_FREE_prom_memory to actually reserve a memory
> chunk, so I moved the reservation into the prom_meminit() method.

I guess Maciej will test this on real hardware, eventually...

> Regarding the first page for the exception handlers. We don't need
> to reserve it here, since it is already done in arch/mips/kernel/traps.c .

Thanks for checking! That was actually something I was still wondering
about...

Gr{oetje,eeting}s,

                        Geert
Serge Semin May 22, 2019, 1:54 p.m. UTC | #17
On Wed, May 22, 2019 at 03:44:47PM +0200, Geert Uytterhoeven wrote:
> Hi Serge,
>
> ...
> 
> >
> > The problem might be fixed there by the next patch:
> > ---
> > diff --git a/arch/mips/dec/prom/memory.c b/arch/mips/dec/prom/memory.c
> > index 5073d2ed78bb..5a0c734b5d04 100644
> > --- a/arch/mips/dec/prom/memory.c
> > +++ b/arch/mips/dec/prom/memory.c
> > @@ -91,29 +91,14 @@ void __init prom_meminit(u32 magic)
> >                 pmax_setup_memory_region();
> >         else
> >                 rex_setup_memory_region();
> > -}
> > -
> > -void __init prom_free_prom_memory(void)
> > -{
> > -       unsigned long end;
> > -
> > -       /*
> > -        * Free everything below the kernel itself but leave
> > -        * the first page reserved for the exception handlers.
> > -        */
> >
> >  #if IS_ENABLED(CONFIG_DECLANCE)
> >         /*
> > -        * Leave 128 KB reserved for Lance memory for
> > -        * IOASIC DECstations.
> > +        * Reserve 128 KB for Lance memory for IOASIC DECstations.
> >          *
> >          * XXX: save this address for use in dec_lance.c?
> >          */
> >         if (IOASIC)
> > -               end = __pa(&_text) - 0x00020000;
> > -       else
> > +               memblock_reserve(__pa_symbol(&_text), 0x00020000);
> 
> Shouldn't that be
> 
>     memblock_reserve(__pa_symbol(&_text) - 0x00020000, 0x00020000);
> 
> ?
> 

Right. Thanks. The updated version of the patch is attached to this email.

-Sergey

> >  #endif
> > -               end = __pa(&_text);
> > -
> > -       free_init_pages("unused PROM memory", PAGE_SIZE, end);
> >  }
> > ---
> >
> > Didn't wanna use prom_FREE_prom_memory to actually reserve a memory
> > chunk, so I moved the reservation into the prom_meminit() method.
> 
> I guess Maciej will test this on real hardware, eventually...
> 
> > Regarding the first page for the exception handlers. We don't need
> > to reserve it here, since it is already done in arch/mips/kernel/traps.c .
> 
> Thanks for checking! That was actually something I was still wondering
> about...
> 
> Gr{oetje,eeting}s,
> 
>                         Geert
> 
> -- 
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
>                                 -- Linus Torvalds
diff --git a/arch/mips/dec/prom/memory.c b/arch/mips/dec/prom/memory.c
index 5073d2ed78bb..031c0cd45d85 100644
--- a/arch/mips/dec/prom/memory.c
+++ b/arch/mips/dec/prom/memory.c
@@ -91,29 +91,14 @@ void __init prom_meminit(u32 magic)
 		pmax_setup_memory_region();
 	else
 		rex_setup_memory_region();
-}
-
-void __init prom_free_prom_memory(void)
-{
-	unsigned long end;
-
-	/*
-	 * Free everything below the kernel itself but leave
-	 * the first page reserved for the exception handlers.
-	 */
 
 #if IS_ENABLED(CONFIG_DECLANCE)
 	/*
-	 * Leave 128 KB reserved for Lance memory for
-	 * IOASIC DECstations.
+	 * Reserve 128 KB for Lance memory for IOASIC DECstations.
 	 *
 	 * XXX: save this address for use in dec_lance.c?
 	 */
 	if (IOASIC)
-		end = __pa(&_text) - 0x00020000;
-	else
+		memblock_reserve(__pa_symbol(&_text) - 0x00020000, 0x00020000);
 #endif
-		end = __pa(&_text);
-
-	free_init_pages("unused PROM memory", PAGE_SIZE, end);
 }
Maciej W. Rozycki Oct. 14, 2020, 9:49 a.m. UTC | #18
On Wed, 22 May 2019, Serge Semin wrote:

> > > The problem might be fixed there by the next patch:
> > > ---
> > > diff --git a/arch/mips/dec/prom/memory.c b/arch/mips/dec/prom/memory.c
> > > index 5073d2ed78bb..5a0c734b5d04 100644
> > > --- a/arch/mips/dec/prom/memory.c
> > > +++ b/arch/mips/dec/prom/memory.c
> > > @@ -91,29 +91,14 @@ void __init prom_meminit(u32 magic)
> > >                 pmax_setup_memory_region();
> > >         else
> > >                 rex_setup_memory_region();
> > > -}
> > > -
> > > -void __init prom_free_prom_memory(void)
> > > -{
> > > -       unsigned long end;
> > > -
> > > -       /*
> > > -        * Free everything below the kernel itself but leave
> > > -        * the first page reserved for the exception handlers.
> > > -        */
> > >
> > >  #if IS_ENABLED(CONFIG_DECLANCE)
> > >         /*
> > > -        * Leave 128 KB reserved for Lance memory for
> > > -        * IOASIC DECstations.
> > > +        * Reserve 128 KB for Lance memory for IOASIC DECstations.
> > >          *
> > >          * XXX: save this address for use in dec_lance.c?
> > >          */
> > >         if (IOASIC)
> > > -               end = __pa(&_text) - 0x00020000;
> > > -       else
> > > +               memblock_reserve(__pa_symbol(&_text), 0x00020000);
> > 
> > Shouldn't that be
> > 
> >     memblock_reserve(__pa_symbol(&_text) - 0x00020000, 0x00020000);
> > 
> > ?
> > 
> 
> Right. Thanks. The updated version of the patch is attached to this email.
> 
> -Sergey
> 
> > >  #endif
> > > -               end = __pa(&_text);
> > > -
> > > -       free_init_pages("unused PROM memory", PAGE_SIZE, end);
> > >  }
> > > ---
> > >
> > > Didn't wanna use prom_FREE_prom_memory to actually reserve a memory
> > > chunk, so I moved the reservation into the prom_meminit() method.
> > 
> > I guess Maciej will test this on real hardware, eventually...

 I finally got to it as I was hit by it the hard way (and I do have kept 
the thread in my inbox!), however this is the wrong fix.

 With DEC hardware the whole 128KiB region is reserved as firmware working 
memory area and we call into the firmware throughout bootstrap on several 
occasions.  Therefore we have to stay away from it until we know we won't 
need any firmware services any longer, which is at this point.  So this 
piece has to stay, and the removed reservation has to be reinstated in 
platform code.  I'll be posting a fix separately.

 NB I suspect CFE platforms may need a similar fix, but I don't have 
access to my SWARM now, so I'll verify it on another occasion.

 Other platforms may need it too; at least up to a point an assumption was 
the kernel load address is just at the end of any firmware working area 
typically allocated right beyond the exception vector area, hence the 
reservation.  I realise the assumption may have changed at one point and 
the oldtimers who have known it may have been away or not paying enough 
attention while the newcomers did not realise that.

  Maciej
diff mbox series

Patch

diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c
index 185e0e42e009..f71a7d32a687 100644
--- a/arch/mips/kernel/setup.c
+++ b/arch/mips/kernel/setup.c
@@ -371,7 +371,6 @@  static void __init bootmem_init(void)
 
 static void __init bootmem_init(void)
 {
-	unsigned long reserved_end;
 	phys_addr_t ramstart = PHYS_ADDR_MAX;
 	int i;
 
@@ -382,10 +381,10 @@  static void __init bootmem_init(void)
 	 * will reserve the area used for the initrd.
 	 */
 	init_initrd();
-	reserved_end = (unsigned long) PFN_UP(__pa_symbol(&_end));
 
-	memblock_reserve(PHYS_OFFSET,
-			 (reserved_end << PAGE_SHIFT) - PHYS_OFFSET);
+	/* Reserve memory occupied by kernel. */
+	memblock_reserve(__pa_symbol(&_text),
+			__pa_symbol(&_end) - __pa_symbol(&_text));
 
 	/*
 	 * max_low_pfn is not a number of pages. The number of pages
@@ -501,29 +500,6 @@  static void __init bootmem_init(void)
 		memory_present(0, start, end);
 	}
 
-#ifdef CONFIG_RELOCATABLE
-	/*
-	 * The kernel reserves all memory below its _end symbol as bootmem,
-	 * but the kernel may now be at a much higher address. The memory
-	 * between the original and new locations may be returned to the system.
-	 */
-	if (__pa_symbol(_text) > __pa_symbol(VMLINUX_LOAD_ADDRESS)) {
-		unsigned long offset;
-		extern void show_kernel_relocation(const char *level);
-
-		offset = __pa_symbol(_text) - __pa_symbol(VMLINUX_LOAD_ADDRESS);
-		memblock_free(__pa_symbol(VMLINUX_LOAD_ADDRESS), offset);
-
-#if defined(CONFIG_DEBUG_KERNEL) && defined(CONFIG_DEBUG_INFO)
-		/*
-		 * This information is necessary when debugging the kernel
-		 * But is a security vulnerability otherwise!
-		 */
-		show_kernel_relocation(KERN_INFO);
-#endif
-	}
-#endif
-
 	/*
 	 * Reserve initrd memory if needed.
 	 */