mm/page_alloc: fix a crash in free_pages_prepare()
diff mbox series

Message ID 1569613623-16820-1-git-send-email-cai@lca.pw
State New
Headers show
Series
  • mm/page_alloc: fix a crash in free_pages_prepare()
Related show

Commit Message

Qian Cai Sept. 27, 2019, 7:47 p.m. UTC
On architectures like s390, arch_free_page() could mark the page unused
(set_page_unused()) and any access later would trigger a kernel panic.
Fix it by moving arch_free_page() after all possible accessing calls.

 Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
 Krnl PSW : 0404e00180000000 0000000026c2b96e
(__free_pages_ok+0x34e/0x5d8)
            R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
 Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c
 000000000000000f
            000003d080012100 000003d080013fc0 0000000000000000
 0000000000100000
            00000000275cca48 0000000000000100 0000000000000008
 000003d080010000
            00000000000001d0 000003d000000000 0000000026c2b78a
 000000002717fdb0
 Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
            0000000026c2b962: e32014000036 pfd 2,1024(%r1)
           #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
           >0000000026c2b96e: 41101100  la %r1,256(%r1)
            0000000026c2b972: a737fff8  brctg %r3,26c2b962
            0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
            0000000026c2b97c: e31003400004 lg %r1,832
            0000000026c2b982: ebff1430016a asi 5168(%r1),-1
 Call Trace:
 __free_pages_ok+0x16a/0x5d8)
 memblock_free_all+0x206/0x290
 mem_init+0x58/0x120
 start_kernel+0x2b0/0x570
 startup_continue+0x6a/0xc0
 INFO: lockdep is turned off.
 Last Breaking-Event-Address:
 __free_pages_ok+0x372/0x5d8
 Kernel panic - not syncing: Fatal exception: panic_on_oops
00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000
26A2379C

Signed-off-by: Qian Cai <cai@lca.pw>
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Andrew Morton Sept. 27, 2019, 8:48 p.m. UTC | #1
On Fri, 27 Sep 2019 15:47:03 -0400 Qian Cai <cai@lca.pw> wrote:

> On architectures like s390, arch_free_page() could mark the page unused
> (set_page_unused()) and any access later would trigger a kernel panic.
> Fix it by moving arch_free_page() after all possible accessing calls.
> 
>  Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
>  Krnl PSW : 0404e00180000000 0000000026c2b96e
> (__free_pages_ok+0x34e/0x5d8)
>             R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>  Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c
>  000000000000000f
>             000003d080012100 000003d080013fc0 0000000000000000
>  0000000000100000
>             00000000275cca48 0000000000000100 0000000000000008
>  000003d080010000
>             00000000000001d0 000003d000000000 0000000026c2b78a
>  000000002717fdb0
>  Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
>             0000000026c2b962: e32014000036 pfd 2,1024(%r1)
>            #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
>            >0000000026c2b96e: 41101100  la %r1,256(%r1)
>             0000000026c2b972: a737fff8  brctg %r3,26c2b962
>             0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
>             0000000026c2b97c: e31003400004 lg %r1,832
>             0000000026c2b982: ebff1430016a asi 5168(%r1),-1
>  Call Trace:
>  __free_pages_ok+0x16a/0x5d8)
>  memblock_free_all+0x206/0x290
>  mem_init+0x58/0x120
>  start_kernel+0x2b0/0x570
>  startup_continue+0x6a/0xc0
>  INFO: lockdep is turned off.
>  Last Breaking-Event-Address:
>  __free_pages_ok+0x372/0x5d8
>  Kernel panic - not syncing: Fatal exception: panic_on_oops
> 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000
> 26A2379C
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1175,11 +1175,11 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  		debug_check_no_obj_freed(page_address(page),
>  					   PAGE_SIZE << order);
>  	}
> -	arch_free_page(page, order);
>  	if (want_init_on_free())
>  		kernel_init_free_pages(page, 1 << order);
>  
>  	kernel_poison_pages(page, 1 << order, 0);
> +	arch_free_page(page, order);
>  	if (debug_pagealloc_enabled())
>  		kernel_map_pages(page, 1 << order, 0);

This is all fairly mature code, isn't it?  What happened to make this
problem pop up now?

IOW, is a -stable backport needed?
Andrew Morton Sept. 27, 2019, 9:02 p.m. UTC | #2
On Fri, 27 Sep 2019 15:47:03 -0400 Qian Cai <cai@lca.pw> wrote:

> On architectures like s390, arch_free_page() could mark the page unused
> (set_page_unused()) and any access later would trigger a kernel panic.
> Fix it by moving arch_free_page() after all possible accessing calls.
> 
>  Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
>  Krnl PSW : 0404e00180000000 0000000026c2b96e
> (__free_pages_ok+0x34e/0x5d8)
>             R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>  Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c
>  000000000000000f
>             000003d080012100 000003d080013fc0 0000000000000000
>  0000000000100000
>             00000000275cca48 0000000000000100 0000000000000008
>  000003d080010000
>             00000000000001d0 000003d000000000 0000000026c2b78a
>  000000002717fdb0
>  Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
>             0000000026c2b962: e32014000036 pfd 2,1024(%r1)
>            #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
>            >0000000026c2b96e: 41101100  la %r1,256(%r1)
>             0000000026c2b972: a737fff8  brctg %r3,26c2b962
>             0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
>             0000000026c2b97c: e31003400004 lg %r1,832
>             0000000026c2b982: ebff1430016a asi 5168(%r1),-1
>  Call Trace:
>  __free_pages_ok+0x16a/0x5d8)
>  memblock_free_all+0x206/0x290
>  mem_init+0x58/0x120
>  start_kernel+0x2b0/0x570
>  startup_continue+0x6a/0xc0
>  INFO: lockdep is turned off.
>  Last Breaking-Event-Address:
>  __free_pages_ok+0x372/0x5d8
>  Kernel panic - not syncing: Fatal exception: panic_on_oops
> 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000
> 26A2379C
> 
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1175,11 +1175,11 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  		debug_check_no_obj_freed(page_address(page),
>  					   PAGE_SIZE << order);
>  	}
> -	arch_free_page(page, order);
>  	if (want_init_on_free())
>  		kernel_init_free_pages(page, 1 << order);
>  
>  	kernel_poison_pages(page, 1 << order, 0);
> +	arch_free_page(page, order);
>  	if (debug_pagealloc_enabled())
>  		kernel_map_pages(page, 1 << order, 0);

AFAICT the meticulously undocumented s390 set_page_unused() will cause
there to be a fault if anyone tries to access the page contents, yes?

So I think you've moved the arch_free_page() to be after the final
thing which can access page contents, yes?  If so, we should have a
comment in free_pages_prepare() to attmept to prevent this problem from
reoccurring as the code evolves?
Qian Cai Sept. 27, 2019, 9:15 p.m. UTC | #3
On Fri, 2019-09-27 at 13:48 -0700, Andrew Morton wrote:
> On Fri, 27 Sep 2019 15:47:03 -0400 Qian Cai <cai@lca.pw> wrote:
> 
> > On architectures like s390, arch_free_page() could mark the page unused
> > (set_page_unused()) and any access later would trigger a kernel panic.
> > Fix it by moving arch_free_page() after all possible accessing calls.
> > 
> >  Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
> >  Krnl PSW : 0404e00180000000 0000000026c2b96e
> > (__free_pages_ok+0x34e/0x5d8)
> >             R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> >  Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c
> >  000000000000000f
> >             000003d080012100 000003d080013fc0 0000000000000000
> >  0000000000100000
> >             00000000275cca48 0000000000000100 0000000000000008
> >  000003d080010000
> >             00000000000001d0 000003d000000000 0000000026c2b78a
> >  000000002717fdb0
> >  Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
> >             0000000026c2b962: e32014000036 pfd 2,1024(%r1)
> >            #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
> >            >0000000026c2b96e: 41101100  la %r1,256(%r1)
> >             0000000026c2b972: a737fff8  brctg %r3,26c2b962
> >             0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
> >             0000000026c2b97c: e31003400004 lg %r1,832
> >             0000000026c2b982: ebff1430016a asi 5168(%r1),-1
> >  Call Trace:
> >  __free_pages_ok+0x16a/0x5d8)
> >  memblock_free_all+0x206/0x290
> >  mem_init+0x58/0x120
> >  start_kernel+0x2b0/0x570
> >  startup_continue+0x6a/0xc0
> >  INFO: lockdep is turned off.
> >  Last Breaking-Event-Address:
> >  __free_pages_ok+0x372/0x5d8
> >  Kernel panic - not syncing: Fatal exception: panic_on_oops
> > 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000
> > 26A2379C
> > 
> > ...
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1175,11 +1175,11 @@ static __always_inline bool free_pages_prepare(struct page *page,
> >  		debug_check_no_obj_freed(page_address(page),
> >  					   PAGE_SIZE << order);
> >  	}
> > -	arch_free_page(page, order);
> >  	if (want_init_on_free())
> >  		kernel_init_free_pages(page, 1 << order);
> >  
> >  	kernel_poison_pages(page, 1 << order, 0);
> > +	arch_free_page(page, order);
> >  	if (debug_pagealloc_enabled())
> >  		kernel_map_pages(page, 1 << order, 0);
> 
> This is all fairly mature code, isn't it?  What happened to make this
> problem pop up now?

In the past, there is only kernel_poison_pages() would trigger it but it needs
"page_poison=on" kernel cmdline, and I suspect nobody tested that on s390 in the
past.

Recently, kernel_init_free_pages() (the commit 6471384af2a6) was added to the
kernel and could trigger it as well.

> 
> IOW, is a -stable backport needed?

Since kernel_init_free_pages() was added in v5.3, I think it does not hurt to
backport to -stable there.
Qian Cai Sept. 27, 2019, 9:28 p.m. UTC | #4
On Fri, 2019-09-27 at 14:02 -0700, Andrew Morton wrote:
> On Fri, 27 Sep 2019 15:47:03 -0400 Qian Cai <cai@lca.pw> wrote:
> 
> > On architectures like s390, arch_free_page() could mark the page unused
> > (set_page_unused()) and any access later would trigger a kernel panic.
> > Fix it by moving arch_free_page() after all possible accessing calls.
> > 
> >  Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
> >  Krnl PSW : 0404e00180000000 0000000026c2b96e
> > (__free_pages_ok+0x34e/0x5d8)
> >             R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> >  Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c
> >  000000000000000f
> >             000003d080012100 000003d080013fc0 0000000000000000
> >  0000000000100000
> >             00000000275cca48 0000000000000100 0000000000000008
> >  000003d080010000
> >             00000000000001d0 000003d000000000 0000000026c2b78a
> >  000000002717fdb0
> >  Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
> >             0000000026c2b962: e32014000036 pfd 2,1024(%r1)
> >            #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
> >            >0000000026c2b96e: 41101100  la %r1,256(%r1)
> >             0000000026c2b972: a737fff8  brctg %r3,26c2b962
> >             0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
> >             0000000026c2b97c: e31003400004 lg %r1,832
> >             0000000026c2b982: ebff1430016a asi 5168(%r1),-1
> >  Call Trace:
> >  __free_pages_ok+0x16a/0x5d8)
> >  memblock_free_all+0x206/0x290
> >  mem_init+0x58/0x120
> >  start_kernel+0x2b0/0x570
> >  startup_continue+0x6a/0xc0
> >  INFO: lockdep is turned off.
> >  Last Breaking-Event-Address:
> >  __free_pages_ok+0x372/0x5d8
> >  Kernel panic - not syncing: Fatal exception: panic_on_oops
> > 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000
> > 26A2379C
> > 
> > ...
> > 
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1175,11 +1175,11 @@ static __always_inline bool free_pages_prepare(struct page *page,
> >  		debug_check_no_obj_freed(page_address(page),
> >  					   PAGE_SIZE << order);
> >  	}
> > -	arch_free_page(page, order);
> >  	if (want_init_on_free())
> >  		kernel_init_free_pages(page, 1 << order);
> >  
> >  	kernel_poison_pages(page, 1 << order, 0);
> > +	arch_free_page(page, order);
> >  	if (debug_pagealloc_enabled())
> >  		kernel_map_pages(page, 1 << order, 0);
> 
> AFAICT the meticulously undocumented s390 set_page_unused() will cause
> there to be a fault if anyone tries to access the page contents, yes?

Yes.

> 
> So I think you've moved the arch_free_page() to be after the final
> thing which can access page contents, yes?  If so, we should have a
> comment in free_pages_prepare() to attmept to prevent this problem from
> reoccurring as the code evolves?

Right, something like this above arch_free_page() there?

/*
 * It needs to be just above kernel_map_pages(), as s390 could mark those
 * pages unused and then trigger a fault when accessing.
 */
Andrew Morton Sept. 27, 2019, 9:59 p.m. UTC | #5
On Fri, 27 Sep 2019 17:28:06 -0400 Qian Cai <cai@lca.pw> wrote:

> > 
> > So I think you've moved the arch_free_page() to be after the final
> > thing which can access page contents, yes?  If so, we should have a
> > comment in free_pages_prepare() to attmept to prevent this problem from
> > reoccurring as the code evolves?
> 
> Right, something like this above arch_free_page() there?
> 
> /*
>  * It needs to be just above kernel_map_pages(), as s390 could mark those
>  * pages unused and then trigger a fault when accessing.
>  */

I did this.

--- a/mm/page_alloc.c~mm-page_alloc-fix-a-crash-in-free_pages_prepare-fix
+++ a/mm/page_alloc.c
@@ -1179,7 +1179,13 @@ static __always_inline bool free_pages_p
 		kernel_init_free_pages(page, 1 << order);
 
 	kernel_poison_pages(page, 1 << order, 0);
+	/*
+	 * arch_free_page() can make the page's contents inaccessible.  s390
+	 * does this.  So nothing which can access the page's contents should
+	 * happen after this.
+	 */
 	arch_free_page(page, order);
+
 	if (debug_pagealloc_enabled())
 		kernel_map_pages(page, 1 << order, 0);
Alexander Duyck Sept. 27, 2019, 10:17 p.m. UTC | #6
On Fri, Sep 27, 2019 at 2:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 27 Sep 2019 17:28:06 -0400 Qian Cai <cai@lca.pw> wrote:
>
> > >
> > > So I think you've moved the arch_free_page() to be after the final
> > > thing which can access page contents, yes?  If so, we should have a
> > > comment in free_pages_prepare() to attmept to prevent this problem from
> > > reoccurring as the code evolves?
> >
> > Right, something like this above arch_free_page() there?
> >
> > /*
> >  * It needs to be just above kernel_map_pages(), as s390 could mark those
> >  * pages unused and then trigger a fault when accessing.
> >  */
>
> I did this.
>
> --- a/mm/page_alloc.c~mm-page_alloc-fix-a-crash-in-free_pages_prepare-fix
> +++ a/mm/page_alloc.c
> @@ -1179,7 +1179,13 @@ static __always_inline bool free_pages_p
>                 kernel_init_free_pages(page, 1 << order);
>
>         kernel_poison_pages(page, 1 << order, 0);
> +       /*
> +        * arch_free_page() can make the page's contents inaccessible.  s390
> +        * does this.  So nothing which can access the page's contents should
> +        * happen after this.
> +        */
>         arch_free_page(page, order);
> +
>         if (debug_pagealloc_enabled())
>                 kernel_map_pages(page, 1 << order, 0);
>

So the question I would have is what is the state of the page after it
has been marked unused and then pulled back in? I'm assuming it will
be all 0s.

I know with the work I am still doing on marking pages as unused this
ends up being an additional item that we will need to pay attention
to, however in our case we will just be initializing the page as zero
if we end up evicting it from the guest.
David Hildenbrand Sept. 28, 2019, 9:06 a.m. UTC | #7
On 28.09.19 00:17, Alexander Duyck wrote:
> On Fri, Sep 27, 2019 at 2:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> On Fri, 27 Sep 2019 17:28:06 -0400 Qian Cai <cai@lca.pw> wrote:
>>
>>>>
>>>> So I think you've moved the arch_free_page() to be after the final
>>>> thing which can access page contents, yes?  If so, we should have a
>>>> comment in free_pages_prepare() to attmept to prevent this problem from
>>>> reoccurring as the code evolves?
>>>
>>> Right, something like this above arch_free_page() there?
>>>
>>> /*
>>>  * It needs to be just above kernel_map_pages(), as s390 could mark those
>>>  * pages unused and then trigger a fault when accessing.
>>>  */
>>
>> I did this.
>>
>> --- a/mm/page_alloc.c~mm-page_alloc-fix-a-crash-in-free_pages_prepare-fix
>> +++ a/mm/page_alloc.c
>> @@ -1179,7 +1179,13 @@ static __always_inline bool free_pages_p
>>                 kernel_init_free_pages(page, 1 << order);
>>
>>         kernel_poison_pages(page, 1 << order, 0);
>> +       /*
>> +        * arch_free_page() can make the page's contents inaccessible.  s390
>> +        * does this.  So nothing which can access the page's contents should
>> +        * happen after this.
>> +        */
>>         arch_free_page(page, order);
>> +
>>         if (debug_pagealloc_enabled())
>>                 kernel_map_pages(page, 1 << order, 0);
>>
> 
> So the question I would have is what is the state of the page after it
> has been marked unused and then pulled back in? I'm assuming it will
> be all 0s.

I think this comment relates to the s390x implementation, so I'll try to
explain that. After arch_free_page() the page might have been zapped in
the hypervisor, but that might happen deferred. The guest ends up
triggering the ESSA instruction in arch_free_page(). That instruction
sets some extended-page-table-related ("pgste") bits in the hypervisor
tables for the guest ("gmap") and fills a buffer with these entries. The
page is marked _PGSTE_GPS_USAGE_UNUSED.

Once the buffer is full, the ESSA instruction intercepts to the
hypervisor, where the hypervisor can go over all recorded entries and
zap them *in case* the extended-page-table-related bits still indicate
that the page is unused by the guest (_PGSTE_GPS_USAGE_UNUSED) or is
marked to be logically zero (_PGSTE_GPS_ZERO). Zapping a page only
happens if it's a pte_swap(pte) entry and effectively triggers a
ptep_zap_unused() -> ptep_zap_swap_entry() -> free_swap_and_cache(). So
I think it will be backed with the zero page when pulled back in.

arch_alloc_page() will similarly trigger the ESSA instruction but only
set the extended-page-table-related bits, so the entry is no longer
_PGSTE_GPS_USAGE_UNUSED. This is basically to make sure a page won't get
zapped in the hypervisor while it is already getting used by the guest
again.

The implementation on the KVM side resides in
arch/s390/kvm/priv.c:handle_essa() but more importantly in
arch/s390/kvm/priv.c:__do_essa() ("pure interpretation path skipping
hardware interpretation completely").

Now, one interesting thing resides in
arch/s390/kvm/priv.c:pgste_perform_essa():
	/* If we are discarding a page, set it to logical zero */
	if (res)
		pgstev |= _PGSTE_GPS_ZERO;

So whenever we do an arch_free_page() in the guest, the page will
immediately also be set in the hypervisor to _PGSTE_GPS_ZERO. However, I
think setting the page logically to zero is just an "extended HW state"
and will not actually result in the page reading zeroes before we
actually zap it. I might be wrong and I only see one place where
_PGSTE_GPS_ZERO actually gets cleared again, especially when setting a
page stable (which looks bogus but as the documentation is confidential
I have no idea what's happening there).

Long story short: I think there is *no guarantee* that
a) After arch_free_page(), the page is actually zeroed-out
b) After arch_free_page() the page has been zapped in the hypervisor or
   will get zapped.
c) After arch_alloc_page(), the page was actually zeroed-out.

I might be wrong, depending on how _PGSTE_GPS_ZERO is actually used.

However, *if* the page was zapped in the hypervisor
(free_swap_and_cache()), I think it will get populated using the zero-page.

Also, please not that s390x requires an arch_alloc_page() after an
arch_free_page(). You cannot simply go ahead and reuse the page after
arch_alloc_page().

> 
> I know with the work I am still doing on marking pages as unused this
> ends up being an additional item that we will need to pay attention
> to, however in our case we will just be initializing the page as zero
> if we end up evicting it from the guest.

Please note that if you are using MADV_FREE instead of MADV_DONTNEED in
the hypervisor, you might end up with the same guarantees that s390x'
implementation gives you. Could be, that the page was not zapped/zeroed
out on the next access. Depends on if the hypervisor was feeling like
zapping entries marked using MADV_FREE.
Christian Borntraeger Sept. 30, 2019, 6:27 a.m. UTC | #8
On 28.09.19 11:06, David Hildenbrand wrote:
> On 28.09.19 00:17, Alexander Duyck wrote:
>> On Fri, Sep 27, 2019 at 2:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>>
>>> On Fri, 27 Sep 2019 17:28:06 -0400 Qian Cai <cai@lca.pw> wrote:
>>>
>>>>>
>>>>> So I think you've moved the arch_free_page() to be after the final
>>>>> thing which can access page contents, yes?  If so, we should have a
>>>>> comment in free_pages_prepare() to attmept to prevent this problem from
>>>>> reoccurring as the code evolves?
>>>>
>>>> Right, something like this above arch_free_page() there?
>>>>
>>>> /*
>>>>  * It needs to be just above kernel_map_pages(), as s390 could mark those
>>>>  * pages unused and then trigger a fault when accessing.
>>>>  */
>>>
>>> I did this.
>>>
>>> --- a/mm/page_alloc.c~mm-page_alloc-fix-a-crash-in-free_pages_prepare-fix
>>> +++ a/mm/page_alloc.c
>>> @@ -1179,7 +1179,13 @@ static __always_inline bool free_pages_p
>>>                 kernel_init_free_pages(page, 1 << order);
>>>
>>>         kernel_poison_pages(page, 1 << order, 0);
>>> +       /*
>>> +        * arch_free_page() can make the page's contents inaccessible.  s390
>>> +        * does this.  So nothing which can access the page's contents should
>>> +        * happen after this.
>>> +        */
>>>         arch_free_page(page, order);
>>> +
>>>         if (debug_pagealloc_enabled())
>>>                 kernel_map_pages(page, 1 << order, 0);
>>>
>>
>> So the question I would have is what is the state of the page after it
>> has been marked unused and then pulled back in? I'm assuming it will
>> be all 0s.
> 
> I think this comment relates to the s390x implementation, so I'll try to
> explain that. After arch_free_page() the page might have been zapped in
> the hypervisor, but that might happen deferred. The guest ends up
> triggering the ESSA instruction in arch_free_page(). That instruction
> sets some extended-page-table-related ("pgste") bits in the hypervisor
> tables for the guest ("gmap") and fills a buffer with these entries. The
> page is marked _PGSTE_GPS_USAGE_UNUSED.

Yes. And that also means that architecturally it can be 0 or it can contain
the old content depending on whether the host has paged that page out or not
and how many pages have been marked unused.
I am also sure that the implementation of z/VM and KVM do differ in that regard.
For example KVM does not make use of the logical zero state but z/VM does.

In essence you can consider this like a ballooner that takes away the page lazily.


For a writeup of the details see 
https://www.kernel.org/doc/ols/2006/ols2006v2-pages-321-336.pdf
(This also contains additional states that were never merged upstream)
Christian Borntraeger Sept. 30, 2019, 6:30 a.m. UTC | #9
On 27.09.19 23:59, Andrew Morton wrote:
> On Fri, 27 Sep 2019 17:28:06 -0400 Qian Cai <cai@lca.pw> wrote:
> 
>>>
>>> So I think you've moved the arch_free_page() to be after the final
>>> thing which can access page contents, yes?  If so, we should have a
>>> comment in free_pages_prepare() to attmept to prevent this problem from
>>> reoccurring as the code evolves?
>>
>> Right, something like this above arch_free_page() there?
>>
>> /*
>>  * It needs to be just above kernel_map_pages(), as s390 could mark those
>>  * pages unused and then trigger a fault when accessing.
>>  */
> 
> I did this.
> 
> --- a/mm/page_alloc.c~mm-page_alloc-fix-a-crash-in-free_pages_prepare-fix
> +++ a/mm/page_alloc.c
> @@ -1179,7 +1179,13 @@ static __always_inline bool free_pages_p
>  		kernel_init_free_pages(page, 1 << order);
>  
>  	kernel_poison_pages(page, 1 << order, 0);
> +	/*
> +	 * arch_free_page() can make the page's contents inaccessible.  s390
> +	 * does this.  So nothing which can access the page's contents should
> +	 * happen after this.
> +	 */
>  	arch_free_page(page, order);
> +
>  	if (debug_pagealloc_enabled())
>  		kernel_map_pages(page, 1 << order, 0);


With that Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Heiko Carstens Sept. 30, 2019, 7:44 a.m. UTC | #10
On Fri, Sep 27, 2019 at 05:15:08PM -0400, Qian Cai wrote:
> On Fri, 2019-09-27 at 13:48 -0700, Andrew Morton wrote:
> > On Fri, 27 Sep 2019 15:47:03 -0400 Qian Cai <cai@lca.pw> wrote:
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1175,11 +1175,11 @@ static __always_inline bool free_pages_prepare(struct page *page,
> > >  		debug_check_no_obj_freed(page_address(page),
> > >  					   PAGE_SIZE << order);
> > >  	}
> > > -	arch_free_page(page, order);
> > >  	if (want_init_on_free())
> > >  		kernel_init_free_pages(page, 1 << order);
> > >  
> > >  	kernel_poison_pages(page, 1 << order, 0);
> > > +	arch_free_page(page, order);
> > >  	if (debug_pagealloc_enabled())
> > >  		kernel_map_pages(page, 1 << order, 0);
> > 
> > This is all fairly mature code, isn't it?  What happened to make this
> > problem pop up now?
> 
> In the past, there is only kernel_poison_pages() would trigger it but it needs
> "page_poison=on" kernel cmdline, and I suspect nobody tested that on s390 in the
> past.

Yes. Peter Oberparleiter reported this also before my short vacation,
but I didn't have time to look into this. Thanks for fixing!

Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Michal Hocko Sept. 30, 2019, 11 a.m. UTC | #11
On Fri 27-09-19 15:47:03, Qian Cai wrote:
> On architectures like s390, arch_free_page() could mark the page unused
> (set_page_unused()) and any access later would trigger a kernel panic.
> Fix it by moving arch_free_page() after all possible accessing calls.
> 
>  Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
>  Krnl PSW : 0404e00180000000 0000000026c2b96e
> (__free_pages_ok+0x34e/0x5d8)
>             R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>  Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c
>  000000000000000f
>             000003d080012100 000003d080013fc0 0000000000000000
>  0000000000100000
>             00000000275cca48 0000000000000100 0000000000000008
>  000003d080010000
>             00000000000001d0 000003d000000000 0000000026c2b78a
>  000000002717fdb0
>  Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
>             0000000026c2b962: e32014000036 pfd 2,1024(%r1)
>            #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
>            >0000000026c2b96e: 41101100  la %r1,256(%r1)
>             0000000026c2b972: a737fff8  brctg %r3,26c2b962
>             0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
>             0000000026c2b97c: e31003400004 lg %r1,832
>             0000000026c2b982: ebff1430016a asi 5168(%r1),-1
>  Call Trace:
>  __free_pages_ok+0x16a/0x5d8)
>  memblock_free_all+0x206/0x290
>  mem_init+0x58/0x120
>  start_kernel+0x2b0/0x570
>  startup_continue+0x6a/0xc0
>  INFO: lockdep is turned off.
>  Last Breaking-Event-Address:
>  __free_pages_ok+0x372/0x5d8
>  Kernel panic - not syncing: Fatal exception: panic_on_oops
> 00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000
> 26A2379C
> 
> Signed-off-by: Qian Cai <cai@lca.pw>

Unless I am missing something 
Fixes: 8823b1dbc05f ("mm/page_poison.c: enable PAGE_POISONING as a separate option")
Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
Cc: stable

With the comment discussed later in the thread
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3334a769eb91..a54ff6a60649 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1175,11 +1175,11 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  		debug_check_no_obj_freed(page_address(page),
>  					   PAGE_SIZE << order);
>  	}
> -	arch_free_page(page, order);
>  	if (want_init_on_free())
>  		kernel_init_free_pages(page, 1 << order);
>  
>  	kernel_poison_pages(page, 1 << order, 0);
> +	arch_free_page(page, order);
>  	if (debug_pagealloc_enabled())
>  		kernel_map_pages(page, 1 << order, 0);
>  
> -- 
> 1.8.3.1
>
Kirill A. Shutemov Sept. 30, 2019, 11:04 a.m. UTC | #12
On Fri, Sep 27, 2019 at 03:47:03PM -0400, Qian Cai wrote:
> On architectures like s390, arch_free_page() could mark the page unused
> (set_page_unused()) and any access later would trigger a kernel panic.
> Fix it by moving arch_free_page() after all possible accessing calls.

Looks like Power also uses the hook. Have you check that this patch will
not affect Power?
Qian Cai Sept. 30, 2019, 11:22 a.m. UTC | #13
> On Sep 30, 2019, at 7:04 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> Looks like Power also uses the hook. Have you check that this patch will
> not affect Power?

Yes, I did. Although I did only test radix memory which seems just return immediately in arch_free_page(), I code-review the non-radix memory path and did not find anything suspicious. However, if really needed, I could test hash memory config or adding powerpc developers to double-check.

Patch
diff mbox series

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3334a769eb91..a54ff6a60649 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1175,11 +1175,11 @@  static __always_inline bool free_pages_prepare(struct page *page,
 		debug_check_no_obj_freed(page_address(page),
 					   PAGE_SIZE << order);
 	}
-	arch_free_page(page, order);
 	if (want_init_on_free())
 		kernel_init_free_pages(page, 1 << order);
 
 	kernel_poison_pages(page, 1 << order, 0);
+	arch_free_page(page, order);
 	if (debug_pagealloc_enabled())
 		kernel_map_pages(page, 1 << order, 0);