diff mbox series

mm/hmm: Fix bad subpage pointer in try_to_unmap_one

Message ID 20190709223556.28908-1-rcampbell@nvidia.com (mailing list archive)
State New, archived
Headers show
Series mm/hmm: Fix bad subpage pointer in try_to_unmap_one | expand

Commit Message

Ralph Campbell July 9, 2019, 10:35 p.m. UTC
When migrating a ZONE device private page from device memory to system
memory, the subpage pointer is initialized from a swap pte which computes
an invalid page pointer. A kernel panic results such as:

BUG: unable to handle page fault for address: ffffea1fffffffc8

Initialize subpage correctly before calling page_remove_rmap().

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/rmap.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Andrew Morton July 10, 2019, 12:28 a.m. UTC | #1
On Tue, 9 Jul 2019 15:35:56 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:

> When migrating a ZONE device private page from device memory to system
> memory, the subpage pointer is initialized from a swap pte which computes
> an invalid page pointer. A kernel panic results such as:
> 
> BUG: unable to handle page fault for address: ffffea1fffffffc8
> 
> Initialize subpage correctly before calling page_remove_rmap().

I think this is

Fixes:  a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
Cc: stable

yes?
Ralph Campbell July 10, 2019, 1:24 a.m. UTC | #2
On 7/9/19 5:28 PM, Andrew Morton wrote:
> On Tue, 9 Jul 2019 15:35:56 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:
> 
>> When migrating a ZONE device private page from device memory to system
>> memory, the subpage pointer is initialized from a swap pte which computes
>> an invalid page pointer. A kernel panic results such as:
>>
>> BUG: unable to handle page fault for address: ffffea1fffffffc8
>>
>> Initialize subpage correctly before calling page_remove_rmap().
> 
> I think this is
> 
> Fixes:  a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
> Cc: stable
> 
> yes?
> 

Yes. Can you add this or should I send a v2?
Andrew Morton July 15, 2019, 10 p.m. UTC | #3
On Tue, 9 Jul 2019 18:24:57 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:

> 
> On 7/9/19 5:28 PM, Andrew Morton wrote:
> > On Tue, 9 Jul 2019 15:35:56 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:
> > 
> >> When migrating a ZONE device private page from device memory to system
> >> memory, the subpage pointer is initialized from a swap pte which computes
> >> an invalid page pointer. A kernel panic results such as:
> >>
> >> BUG: unable to handle page fault for address: ffffea1fffffffc8
> >>
> >> Initialize subpage correctly before calling page_remove_rmap().
> > 
> > I think this is
> > 
> > Fixes:  a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
> > Cc: stable
> > 
> > yes?
> > 
> 
> Yes. Can you add this or should I send a v2?

I updated the patch.  Could we please have some review input?


From: Ralph Campbell <rcampbell@nvidia.com>
Subject: mm/hmm: fix bad subpage pointer in try_to_unmap_one

When migrating a ZONE device private page from device memory to system
memory, the subpage pointer is initialized from a swap pte which computes
an invalid page pointer. A kernel panic results such as:

BUG: unable to handle page fault for address: ffffea1fffffffc8

Initialize subpage correctly before calling page_remove_rmap().

Link: http://lkml.kernel.org/r/20190709223556.28908-1-rcampbell@nvidia.com
Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/rmap.c |    1 +
 1 file changed, 1 insertion(+)

--- a/mm/rmap.c~mm-hmm-fix-bad-subpage-pointer-in-try_to_unmap_one
+++ a/mm/rmap.c
@@ -1476,6 +1476,7 @@ static bool try_to_unmap_one(struct page
 			 * No need to invalidate here it will synchronize on
 			 * against the special swap migration pte.
 			 */
+			subpage = page;
 			goto discard;
 		}
John Hubbard July 15, 2019, 11:34 p.m. UTC | #4
On 7/15/19 3:00 PM, Andrew Morton wrote:
> On Tue, 9 Jul 2019 18:24:57 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:
> 
>>
>> On 7/9/19 5:28 PM, Andrew Morton wrote:
>>> On Tue, 9 Jul 2019 15:35:56 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:
>>>
>>>> When migrating a ZONE device private page from device memory to system
>>>> memory, the subpage pointer is initialized from a swap pte which computes
>>>> an invalid page pointer. A kernel panic results such as:
>>>>
>>>> BUG: unable to handle page fault for address: ffffea1fffffffc8
>>>>
>>>> Initialize subpage correctly before calling page_remove_rmap().
>>>
>>> I think this is
>>>
>>> Fixes:  a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
>>> Cc: stable
>>>
>>> yes?
>>>
>>
>> Yes. Can you add this or should I send a v2?
> 
> I updated the patch.  Could we please have some review input?
> 
> 
> From: Ralph Campbell <rcampbell@nvidia.com>
> Subject: mm/hmm: fix bad subpage pointer in try_to_unmap_one
> 
> When migrating a ZONE device private page from device memory to system
> memory, the subpage pointer is initialized from a swap pte which computes
> an invalid page pointer. A kernel panic results such as:
> 
> BUG: unable to handle page fault for address: ffffea1fffffffc8
> 
> Initialize subpage correctly before calling page_remove_rmap().
> 
> Link: http://lkml.kernel.org/r/20190709223556.28908-1-rcampbell@nvidia.com
> Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  mm/rmap.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> --- a/mm/rmap.c~mm-hmm-fix-bad-subpage-pointer-in-try_to_unmap_one
> +++ a/mm/rmap.c
> @@ -1476,6 +1476,7 @@ static bool try_to_unmap_one(struct page
>  			 * No need to invalidate here it will synchronize on
>  			 * against the special swap migration pte.
>  			 */
> +			subpage = page;
>  			goto discard;
>  		}
>  

Hi Ralph and everyone,

While the above prevents a crash, I'm concerned that it is still not
an accurate fix. This fix leads to repeatedly removing the rmap, against the
same struct page, which is odd, and also doesn't directly address the
root cause, which I understand to be: this routine can't handle migrating
the zero page properly--over and back, anyway. (We should also mention more 
about how this is triggered, in the commit description.)

I'll take a closer look at possible fixes (I have to step out for a bit) soon, 
but any more experienced help is also appreciated here.

thanks,
Ralph Campbell July 16, 2019, 12:38 a.m. UTC | #5
On 7/15/19 4:34 PM, John Hubbard wrote:
> On 7/15/19 3:00 PM, Andrew Morton wrote:
>> On Tue, 9 Jul 2019 18:24:57 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:
>>
>>>
>>> On 7/9/19 5:28 PM, Andrew Morton wrote:
>>>> On Tue, 9 Jul 2019 15:35:56 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:
>>>>
>>>>> When migrating a ZONE device private page from device memory to system
>>>>> memory, the subpage pointer is initialized from a swap pte which computes
>>>>> an invalid page pointer. A kernel panic results such as:
>>>>>
>>>>> BUG: unable to handle page fault for address: ffffea1fffffffc8
>>>>>
>>>>> Initialize subpage correctly before calling page_remove_rmap().
>>>>
>>>> I think this is
>>>>
>>>> Fixes:  a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
>>>> Cc: stable
>>>>
>>>> yes?
>>>>
>>>
>>> Yes. Can you add this or should I send a v2?
>>
>> I updated the patch.  Could we please have some review input?
>>
>>
>> From: Ralph Campbell <rcampbell@nvidia.com>
>> Subject: mm/hmm: fix bad subpage pointer in try_to_unmap_one
>>
>> When migrating a ZONE device private page from device memory to system
>> memory, the subpage pointer is initialized from a swap pte which computes
>> an invalid page pointer. A kernel panic results such as:
>>
>> BUG: unable to handle page fault for address: ffffea1fffffffc8
>>
>> Initialize subpage correctly before calling page_remove_rmap().
>>
>> Link: http://lkml.kernel.org/r/20190709223556.28908-1-rcampbell@nvidia.com
>> Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Jason Gunthorpe <jgg@mellanox.com>
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>
>>   mm/rmap.c |    1 +
>>   1 file changed, 1 insertion(+)
>>
>> --- a/mm/rmap.c~mm-hmm-fix-bad-subpage-pointer-in-try_to_unmap_one
>> +++ a/mm/rmap.c
>> @@ -1476,6 +1476,7 @@ static bool try_to_unmap_one(struct page
>>   			 * No need to invalidate here it will synchronize on
>>   			 * against the special swap migration pte.
>>   			 */
>> +			subpage = page;
>>   			goto discard;
>>   		}
>>   
> 
> Hi Ralph and everyone,
> 
> While the above prevents a crash, I'm concerned that it is still not
> an accurate fix. This fix leads to repeatedly removing the rmap, against the
> same struct page, which is odd, and also doesn't directly address the
> root cause, which I understand to be: this routine can't handle migrating
> the zero page properly--over and back, anyway. (We should also mention more
> about how this is triggered, in the commit description.)
> 
> I'll take a closer look at possible fixes (I have to step out for a bit) soon,
> but any more experienced help is also appreciated here.
> 
> thanks,

I'm not surprised at the confusion. It took me quite awhile to 
understand how migrate_vma() works with ZONE_DEVICE private memory.
The big point to be aware of is that when migrating a page to
device private memory, the source page's page->mapping pointer
is copied to the ZONE_DEVICE struct page and the page_mapcount()
is increased. So, the kernel sees the page as being "mapped"
but the page table entry as being is_swap_pte() so the CPU will fault
if it tries to access the mapped address.
So yes, the source anon page is unmapped, DMA'ed to the device,
and then mapped again. Then on a CPU fault, the zone device page
is unmapped, DMA'ed to system memory, and mapped again.
The rmap_walk() is used to clear the temporary migration pte so
that is another important detail of how migrate_vma() works.
At the moment, only single anon private pages can migrate to
device private memory so there are no subpages and setting it to "page"
should be correct for now. I'm looking at supporting migration of
transparent huge pages but that is a work in progress.
Let me know how much of all that you think should be in the change log.
Getting an Acked-by from Jerome would be nice too.

I see Christoph Hellwig got confused by this too [1].
I have a patch to clear page->mapping when freeing ZONE_DEVICE private
struct pages which I'll send out soon.
I'll probably also add some comments to struct page to include the
above info and maybe remove the _zd_pad_1 field.

[1] 740d6310ed4cd5c78e63 ("mm: don't clear ->mapping in hmm_devmem_free")
John Hubbard July 16, 2019, 6:14 a.m. UTC | #6
On 7/15/19 5:38 PM, Ralph Campbell wrote:
> On 7/15/19 4:34 PM, John Hubbard wrote:
>> On 7/15/19 3:00 PM, Andrew Morton wrote:
>>> On Tue, 9 Jul 2019 18:24:57 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:
>>>
>>>   mm/rmap.c |    1 +
>>>   1 file changed, 1 insertion(+)
>>>
>>> --- a/mm/rmap.c~mm-hmm-fix-bad-subpage-pointer-in-try_to_unmap_one
>>> +++ a/mm/rmap.c
>>> @@ -1476,6 +1476,7 @@ static bool try_to_unmap_one(struct page
>>>                * No need to invalidate here it will synchronize on
>>>                * against the special swap migration pte.
>>>                */
>>> +            subpage = page;
>>>               goto discard;
>>>           }
>>
>> Hi Ralph and everyone,
>>
>> While the above prevents a crash, I'm concerned that it is still not
>> an accurate fix. This fix leads to repeatedly removing the rmap, against the
>> same struct page, which is odd, and also doesn't directly address the
>> root cause, which I understand to be: this routine can't handle migrating
>> the zero page properly--over and back, anyway. (We should also mention more
>> about how this is triggered, in the commit description.)
>>
>> I'll take a closer look at possible fixes (I have to step out for a bit) soon,
>> but any more experienced help is also appreciated here.
>>
>> thanks,
> 
> I'm not surprised at the confusion. It took me quite awhile to understand how 
> migrate_vma() works with ZONE_DEVICE private memory.
> The big point to be aware of is that when migrating a page to
> device private memory, the source page's page->mapping pointer
> is copied to the ZONE_DEVICE struct page and the page_mapcount()
> is increased. So, the kernel sees the page as being "mapped"
> but the page table entry as being is_swap_pte() so the CPU will fault
> if it tries to access the mapped address.

Thanks for humoring me here...

The part about the source page's page->mapping pointer being *copied*
to the ZONE_DEVICE struct page is particularly interesting, and belongs
maybe even in a comment (if not already there). Definitely at least in
the commit description, for now.

> So yes, the source anon page is unmapped, DMA'ed to the device,
> and then mapped again. Then on a CPU fault, the zone device page
> is unmapped, DMA'ed to system memory, and mapped again.
> The rmap_walk() is used to clear the temporary migration pte so
> that is another important detail of how migrate_vma() works.
> At the moment, only single anon private pages can migrate to
> device private memory so there are no subpages and setting it to "page"
> should be correct for now. I'm looking at supporting migration of
> transparent huge pages but that is a work in progress.

Well here, I worry, because subpage != tail page, right? subpage is a
strange variable name, and here it is used to record the page that
corresponds to *each* mapping that is found during the reverse page
mapping walk.

And that makes me suspect that if there were more than one of these
found (which is unlikely, given the light testing that we have available
so far, I realize), then there could possibly be a problem with the fix,
yes?

> Let me know how much of all that you think should be in the change log.
> Getting an Acked-by from Jerome would be nice too.
> 
> I see Christoph Hellwig got confused by this too [1].

Yeah, him and me both. :)

> I have a patch to clear page->mapping when freeing ZONE_DEVICE private
> struct pages which I'll send out soon.
> I'll probably also add some comments to struct page to include the
> above info and maybe remove the _zd_pad_1 field.
> 
> [1] 740d6310ed4cd5c78e63 ("mm: don't clear ->mapping in hmm_devmem_free")
> 

That's  b7a523109fb5c9d2d6dd3ffc1fa38a4f48c6f842 in linux.git, now.

thanks,
Jerome Glisse July 16, 2019, 3:33 p.m. UTC | #7
On Mon, Jul 15, 2019 at 11:14:31PM -0700, John Hubbard wrote:
> On 7/15/19 5:38 PM, Ralph Campbell wrote:
> > On 7/15/19 4:34 PM, John Hubbard wrote:
> > > On 7/15/19 3:00 PM, Andrew Morton wrote:
> > > > On Tue, 9 Jul 2019 18:24:57 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:
> > > > 
> > > >   mm/rmap.c |    1 +
> > > >   1 file changed, 1 insertion(+)
> > > > 
> > > > --- a/mm/rmap.c~mm-hmm-fix-bad-subpage-pointer-in-try_to_unmap_one
> > > > +++ a/mm/rmap.c
> > > > @@ -1476,6 +1476,7 @@ static bool try_to_unmap_one(struct page
> > > >                * No need to invalidate here it will synchronize on
> > > >                * against the special swap migration pte.
> > > >                */
> > > > +            subpage = page;
> > > >               goto discard;
> > > >           }
> > > 
> > > Hi Ralph and everyone,
> > > 
> > > While the above prevents a crash, I'm concerned that it is still not
> > > an accurate fix. This fix leads to repeatedly removing the rmap, against the
> > > same struct page, which is odd, and also doesn't directly address the
> > > root cause, which I understand to be: this routine can't handle migrating
> > > the zero page properly--over and back, anyway. (We should also mention more
> > > about how this is triggered, in the commit description.)
> > > 
> > > I'll take a closer look at possible fixes (I have to step out for a bit) soon,
> > > but any more experienced help is also appreciated here.
> > > 
> > > thanks,
> > 
> > I'm not surprised at the confusion. It took me quite awhile to
> > understand how migrate_vma() works with ZONE_DEVICE private memory.
> > The big point to be aware of is that when migrating a page to
> > device private memory, the source page's page->mapping pointer
> > is copied to the ZONE_DEVICE struct page and the page_mapcount()
> > is increased. So, the kernel sees the page as being "mapped"
> > but the page table entry as being is_swap_pte() so the CPU will fault
> > if it tries to access the mapped address.
> 
> Thanks for humoring me here...
> 
> The part about the source page's page->mapping pointer being *copied*
> to the ZONE_DEVICE struct page is particularly interesting, and belongs
> maybe even in a comment (if not already there). Definitely at least in
> the commit description, for now.
> 
> > So yes, the source anon page is unmapped, DMA'ed to the device,
> > and then mapped again. Then on a CPU fault, the zone device page
> > is unmapped, DMA'ed to system memory, and mapped again.
> > The rmap_walk() is used to clear the temporary migration pte so
> > that is another important detail of how migrate_vma() works.
> > At the moment, only single anon private pages can migrate to
> > device private memory so there are no subpages and setting it to "page"
> > should be correct for now. I'm looking at supporting migration of
> > transparent huge pages but that is a work in progress.
> 
> Well here, I worry, because subpage != tail page, right? subpage is a
> strange variable name, and here it is used to record the page that
> corresponds to *each* mapping that is found during the reverse page
> mapping walk.
> 
> And that makes me suspect that if there were more than one of these
> found (which is unlikely, given the light testing that we have available
> so far, I realize), then there could possibly be a problem with the fix,
> yes?

No THP when migrating to device memory so no tail vs head page here.

Cheers,
Jérôme
Andrew Morton July 16, 2019, 9:10 p.m. UTC | #8
On Mon, 15 Jul 2019 17:38:04 -0700 Ralph Campbell <rcampbell@nvidia.com> wrote:

> I'm not surprised at the confusion. It took me quite awhile to 
> understand how migrate_vma() works with ZONE_DEVICE private memory.
>
> ...
> 
> I see Christoph Hellwig got confused by this too [1].

While making such discoveries, please prepare a patch which adds
comments which will help the next poor soul!
diff mbox series

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index e5dfe2ae6b0d..ec1af8b60423 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1476,6 +1476,7 @@  static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			 * No need to invalidate here it will synchronize on
 			 * against the special swap migration pte.
 			 */
+			subpage = page;
 			goto discard;
 		}